Problems associated with document classification, an important application of text mining of text data, are focused on in this paper. There have been many models and algorithms proposed for text classification; one of these is a technique using a vector space model. In these methods, a digital document is represented as a point in the vector space which is constructed by morphological analysis and counting the frequency of each word in the document. In the vector space model, the documents can be classified using the distance measure between documents. However, there are specific characteristics in the vector space model for document classification. Firstly, it is not easy to automatically remove unnecessary words completely. The existence of unnecessary words is one of the characteristics of the text mining problems. Secondly, the dimensions of the word vector space are usually huge in comparison to the number of words appearing in a document. Although the frequencies of words appearing in a document could be small in many cases, many kinds of such words with small frequency can usually be used to classify the documents. In this paper, we evaluate the performance of document classification in the case where unnecessary words are included in the word set. Moreover, the performance of the distance measure between documents in a large dimensional word vector space is analyzed. From the asymptotic results about the distance measure, we can provide an explanation of the fact given in many experiments that classification using the empirical distance between documents calculated via the cosine measure is not particularly bad. It is also suggested that the KL-divergence is not useful for text mining problems.
|ジャーナル||Journal of Japan Industrial Management Association|
|出版ステータス||Published - 2010 12月 1|
ASJC Scopus subject areas
- 経営科学およびオペレーションズ リサーチ