A theoretical analysis of document classification based on a high-dimensional vector space model - Asymptotic analysis of classification performance and distance measures

Masayuki Goto, Takashi Ishida, Makoto Suzuki, Shigeichi Hirasawa

    Research output: Contribution to journalArticle

    Abstract

    Problems associated with document classification, an important application of text mining of text data, are focused on in this paper. There have been many models and algorithms proposed for text classification; one of these is a technique using a vector space model. In these methods, a digital document is represented as a point in the vector space which is constructed by morphological analysis and counting the frequency of each word in the document. In the vector space model, the documents can be classified using the distance measure between documents. However, there are specific characteristics in the vector space model for document classification. Firstly, it is not easy to automatically remove unnecessary words completely. The existence of unnecessary words is one of the characteristics of the text mining problems. Secondly, the dimensions of the word vector space are usually huge in comparison to the number of words appearing in a document. Although the frequencies of words appearing in a document could be small in many cases, many kinds of such words with small frequency can usually be used to classify the documents. In this paper, we evaluate the performance of document classification in the case where unnecessary words are included in the word set. Moreover, the performance of the distance measure between documents in a large dimensional word vector space is analyzed. From the asymptotic results about the distance measure, we can provide an explanation of the fact given in many experiments that classification using the empirical distance between documents calculated via the cosine measure is not particularly bad. It is also suggested that the KL-divergence is not useful for text mining problems.

    Original languageEnglish
    Pages (from-to)97-106
    Number of pages10
    JournalJournal of Japan Industrial Management Association
    Volume61
    Issue number3
    Publication statusPublished - 2010

      Fingerprint

    Keywords

    • Distance measure
    • Document classfication
    • Term frequency
    • Text mining

    ASJC Scopus subject areas

    • Industrial and Manufacturing Engineering
    • Applied Mathematics
    • Management Science and Operations Research
    • Strategy and Management

    Cite this