A theoretical analysis of document classification based on a high-dimensional vector space model - Asymptotic analysis of classification performance and distance measures

Masayuki Goto, Takashi Ishida, Makoto Suzuki, Shigeichi Hirasawa

    Research output: Contribution to journalArticle

    Abstract

    Problems associated with document classification, an important application of text mining of text data, are focused on in this paper. There have been many models and algorithms proposed for text classification; one of these is a technique using a vector space model. In these methods, a digital document is represented as a point in the vector space which is constructed by morphological analysis and counting the frequency of each word in the document. In the vector space model, the documents can be classified using the distance measure between documents. However, there are specific characteristics in the vector space model for document classification. Firstly, it is not easy to automatically remove unnecessary words completely. The existence of unnecessary words is one of the characteristics of the text mining problems. Secondly, the dimensions of the word vector space are usually huge in comparison to the number of words appearing in a document. Although the frequencies of words appearing in a document could be small in many cases, many kinds of such words with small frequency can usually be used to classify the documents. In this paper, we evaluate the performance of document classification in the case where unnecessary words are included in the word set. Moreover, the performance of the distance measure between documents in a large dimensional word vector space is analyzed. From the asymptotic results about the distance measure, we can provide an explanation of the fact given in many experiments that classification using the empirical distance between documents calculated via the cosine measure is not particularly bad. It is also suggested that the KL-divergence is not useful for text mining problems.

    Original languageEnglish
    Pages (from-to)97-106
    Number of pages10
    JournalJournal of Japan Industrial Management Association
    Volume61
    Issue number3
    Publication statusPublished - 2010

    Fingerprint

    Document Classification
    Vector Space Model
    Asymptotic analysis
    Text Mining
    Distance Measure
    Vector spaces
    Model Analysis
    Asymptotic Analysis
    Performance Measures
    Vector space
    Theoretical Analysis
    High-dimensional
    Morphological Analysis
    Text Classification
    Counting
    Divergence
    Classify
    Evaluate
    Experiment
    Performance measures

    Keywords

    • Distance measure
    • Document classfication
    • Term frequency
    • Text mining

    ASJC Scopus subject areas

    • Industrial and Manufacturing Engineering
    • Applied Mathematics
    • Management Science and Operations Research
    • Strategy and Management

    Cite this

    @article{bd6b989002b24e35a0a773747779a78a,
    title = "A theoretical analysis of document classification based on a high-dimensional vector space model - Asymptotic analysis of classification performance and distance measures",
    abstract = "Problems associated with document classification, an important application of text mining of text data, are focused on in this paper. There have been many models and algorithms proposed for text classification; one of these is a technique using a vector space model. In these methods, a digital document is represented as a point in the vector space which is constructed by morphological analysis and counting the frequency of each word in the document. In the vector space model, the documents can be classified using the distance measure between documents. However, there are specific characteristics in the vector space model for document classification. Firstly, it is not easy to automatically remove unnecessary words completely. The existence of unnecessary words is one of the characteristics of the text mining problems. Secondly, the dimensions of the word vector space are usually huge in comparison to the number of words appearing in a document. Although the frequencies of words appearing in a document could be small in many cases, many kinds of such words with small frequency can usually be used to classify the documents. In this paper, we evaluate the performance of document classification in the case where unnecessary words are included in the word set. Moreover, the performance of the distance measure between documents in a large dimensional word vector space is analyzed. From the asymptotic results about the distance measure, we can provide an explanation of the fact given in many experiments that classification using the empirical distance between documents calculated via the cosine measure is not particularly bad. It is also suggested that the KL-divergence is not useful for text mining problems.",
    keywords = "Distance measure, Document classfication, Term frequency, Text mining",
    author = "Masayuki Goto and Takashi Ishida and Makoto Suzuki and Shigeichi Hirasawa",
    year = "2010",
    language = "English",
    volume = "61",
    pages = "97--106",
    journal = "Journal of Japan Industrial Management Association",
    issn = "0386-4812",
    publisher = "Nihon Keikei Kogakkai",
    number = "3",

    }

    TY - JOUR

    T1 - A theoretical analysis of document classification based on a high-dimensional vector space model - Asymptotic analysis of classification performance and distance measures

    AU - Goto, Masayuki

    AU - Ishida, Takashi

    AU - Suzuki, Makoto

    AU - Hirasawa, Shigeichi

    PY - 2010

    Y1 - 2010

    N2 - Problems associated with document classification, an important application of text mining of text data, are focused on in this paper. There have been many models and algorithms proposed for text classification; one of these is a technique using a vector space model. In these methods, a digital document is represented as a point in the vector space which is constructed by morphological analysis and counting the frequency of each word in the document. In the vector space model, the documents can be classified using the distance measure between documents. However, there are specific characteristics in the vector space model for document classification. Firstly, it is not easy to automatically remove unnecessary words completely. The existence of unnecessary words is one of the characteristics of the text mining problems. Secondly, the dimensions of the word vector space are usually huge in comparison to the number of words appearing in a document. Although the frequencies of words appearing in a document could be small in many cases, many kinds of such words with small frequency can usually be used to classify the documents. In this paper, we evaluate the performance of document classification in the case where unnecessary words are included in the word set. Moreover, the performance of the distance measure between documents in a large dimensional word vector space is analyzed. From the asymptotic results about the distance measure, we can provide an explanation of the fact given in many experiments that classification using the empirical distance between documents calculated via the cosine measure is not particularly bad. It is also suggested that the KL-divergence is not useful for text mining problems.

    AB - Problems associated with document classification, an important application of text mining of text data, are focused on in this paper. There have been many models and algorithms proposed for text classification; one of these is a technique using a vector space model. In these methods, a digital document is represented as a point in the vector space which is constructed by morphological analysis and counting the frequency of each word in the document. In the vector space model, the documents can be classified using the distance measure between documents. However, there are specific characteristics in the vector space model for document classification. Firstly, it is not easy to automatically remove unnecessary words completely. The existence of unnecessary words is one of the characteristics of the text mining problems. Secondly, the dimensions of the word vector space are usually huge in comparison to the number of words appearing in a document. Although the frequencies of words appearing in a document could be small in many cases, many kinds of such words with small frequency can usually be used to classify the documents. In this paper, we evaluate the performance of document classification in the case where unnecessary words are included in the word set. Moreover, the performance of the distance measure between documents in a large dimensional word vector space is analyzed. From the asymptotic results about the distance measure, we can provide an explanation of the fact given in many experiments that classification using the empirical distance between documents calculated via the cosine measure is not particularly bad. It is also suggested that the KL-divergence is not useful for text mining problems.

    KW - Distance measure

    KW - Document classfication

    KW - Term frequency

    KW - Text mining

    UR - http://www.scopus.com/inward/record.url?scp=78651239894&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=78651239894&partnerID=8YFLogxK

    M3 - Article

    AN - SCOPUS:78651239894

    VL - 61

    SP - 97

    EP - 106

    JO - Journal of Japan Industrial Management Association

    JF - Journal of Japan Industrial Management Association

    SN - 0386-4812

    IS - 3

    ER -