A theoretical analysis of document classification based on a high-dimensional vector space model - Asymptotic analysis of classification performance and distance measures

Masayuki Goto, Takashi Ishida, Makoto Suzuki, Shigeichi Hirasawa

Research output: Contribution to journalArticlepeer-review

Abstract

Problems associated with document classification, an important application of text mining of text data, are focused on in this paper. There have been many models and algorithms proposed for text classification; one of these is a technique using a vector space model. In these methods, a digital document is represented as a point in the vector space which is constructed by morphological analysis and counting the frequency of each word in the document. In the vector space model, the documents can be classified using the distance measure between documents. However, there are specific characteristics in the vector space model for document classification. Firstly, it is not easy to automatically remove unnecessary words completely. The existence of unnecessary words is one of the characteristics of the text mining problems. Secondly, the dimensions of the word vector space are usually huge in comparison to the number of words appearing in a document. Although the frequencies of words appearing in a document could be small in many cases, many kinds of such words with small frequency can usually be used to classify the documents. In this paper, we evaluate the performance of document classification in the case where unnecessary words are included in the word set. Moreover, the performance of the distance measure between documents in a large dimensional word vector space is analyzed. From the asymptotic results about the distance measure, we can provide an explanation of the fact given in many experiments that classification using the empirical distance between documents calculated via the cosine measure is not particularly bad. It is also suggested that the KL-divergence is not useful for text mining problems.

Original languageEnglish
Pages (from-to)97-106
Number of pages10
JournalJournal of Japan Industrial Management Association
Volume61
Issue number3
Publication statusPublished - 2010 Dec 1

Keywords

  • Distance measure
  • Document classfication
  • Term frequency
  • Text mining

ASJC Scopus subject areas

  • Strategy and Management
  • Management Science and Operations Research
  • Industrial and Manufacturing Engineering
  • Applied Mathematics

Fingerprint Dive into the research topics of 'A theoretical analysis of document classification based on a high-dimensional vector space model - Asymptotic analysis of classification performance and distance measures'. Together they form a unique fingerprint.

Cite this