Text classification using similarity of tree sources estimated from Bayes coding algorithm

Hiroki Iwama, Takashi Ishida, Masayuki Goto

    Research output: Contribution to journalArticle

    Abstract

    In this paper, we propose a method of text classification using a Bayes coding algorithm, one of the efficient data compression methods. The Bayes coding algorithm gives the Bayes optimal data compression over the tree source model class. When data is compressed by the Bayes coding algorithm, the probability structure of information sources is implicitly estimated from the compressed data. Therefore, we can expect that the implicit estimation of data compression can be utilized for other purposes, especially for the document classification problem. As for the document classification using data compression methods, ZIP format and context tree weighting methods have been proposed. However, these methods do not have Bayes optimal compression and use the compression ratio as a similarity measure between documents for classification. In the Bayes coding algorithm, a weighted mixture tree given by the compression phase can be used for estimated probability structure. Tree source is a class of Markov sources and it is possible to measure the divergence between the tree sources with the same structure. However, the Bayes coding algorithm outputs different tree structures based on the data sequence to be compressed. Since the tree structures derived from documents are different from each other, it is difficult to measure the divergence between them just as it is. This paper proposes a new method to change the structures of weighted mixture trees into the same tree structure to be able to measure the divergence. Using the divergence between trees estimated by documents, the documents can be classified. Moreover, the effectiveness of the proposed method is clarified via a simulation experiment for the document classification with natural data.

    Original languageEnglish
    Pages (from-to)438-446
    Number of pages9
    JournalJournal of Japan Industrial Management Association
    Volume64
    Issue number3
    Publication statusPublished - 2013

    Fingerprint

    Text Classification
    Bayes
    Data compression
    Coding
    Data Compression
    Document Classification
    Divergence
    Tree Structure
    Compression
    Compaction
    Similarity
    Text classification
    Similarity Measure
    Classification Problems
    Simulation Experiment
    Weighting
    Output
    Experiments

    Keywords

    • Author identification
    • Bayes coding algorithm
    • Data compression algorithm
    • Document classification

    ASJC Scopus subject areas

    • Industrial and Manufacturing Engineering
    • Applied Mathematics
    • Management Science and Operations Research
    • Strategy and Management

    Cite this

    Text classification using similarity of tree sources estimated from Bayes coding algorithm. / Iwama, Hiroki; Ishida, Takashi; Goto, Masayuki.

    In: Journal of Japan Industrial Management Association, Vol. 64, No. 3, 2013, p. 438-446.

    Research output: Contribution to journalArticle

    @article{c1d0795733094931afe5224b058b9e95,
    title = "Text classification using similarity of tree sources estimated from Bayes coding algorithm",
    abstract = "In this paper, we propose a method of text classification using a Bayes coding algorithm, one of the efficient data compression methods. The Bayes coding algorithm gives the Bayes optimal data compression over the tree source model class. When data is compressed by the Bayes coding algorithm, the probability structure of information sources is implicitly estimated from the compressed data. Therefore, we can expect that the implicit estimation of data compression can be utilized for other purposes, especially for the document classification problem. As for the document classification using data compression methods, ZIP format and context tree weighting methods have been proposed. However, these methods do not have Bayes optimal compression and use the compression ratio as a similarity measure between documents for classification. In the Bayes coding algorithm, a weighted mixture tree given by the compression phase can be used for estimated probability structure. Tree source is a class of Markov sources and it is possible to measure the divergence between the tree sources with the same structure. However, the Bayes coding algorithm outputs different tree structures based on the data sequence to be compressed. Since the tree structures derived from documents are different from each other, it is difficult to measure the divergence between them just as it is. This paper proposes a new method to change the structures of weighted mixture trees into the same tree structure to be able to measure the divergence. Using the divergence between trees estimated by documents, the documents can be classified. Moreover, the effectiveness of the proposed method is clarified via a simulation experiment for the document classification with natural data.",
    keywords = "Author identification, Bayes coding algorithm, Data compression algorithm, Document classification",
    author = "Hiroki Iwama and Takashi Ishida and Masayuki Goto",
    year = "2013",
    language = "English",
    volume = "64",
    pages = "438--446",
    journal = "Journal of Japan Industrial Management Association",
    issn = "0386-4812",
    publisher = "Nihon Keikei Kogakkai",
    number = "3",

    }

    TY - JOUR

    T1 - Text classification using similarity of tree sources estimated from Bayes coding algorithm

    AU - Iwama, Hiroki

    AU - Ishida, Takashi

    AU - Goto, Masayuki

    PY - 2013

    Y1 - 2013

    N2 - In this paper, we propose a method of text classification using a Bayes coding algorithm, one of the efficient data compression methods. The Bayes coding algorithm gives the Bayes optimal data compression over the tree source model class. When data is compressed by the Bayes coding algorithm, the probability structure of information sources is implicitly estimated from the compressed data. Therefore, we can expect that the implicit estimation of data compression can be utilized for other purposes, especially for the document classification problem. As for the document classification using data compression methods, ZIP format and context tree weighting methods have been proposed. However, these methods do not have Bayes optimal compression and use the compression ratio as a similarity measure between documents for classification. In the Bayes coding algorithm, a weighted mixture tree given by the compression phase can be used for estimated probability structure. Tree source is a class of Markov sources and it is possible to measure the divergence between the tree sources with the same structure. However, the Bayes coding algorithm outputs different tree structures based on the data sequence to be compressed. Since the tree structures derived from documents are different from each other, it is difficult to measure the divergence between them just as it is. This paper proposes a new method to change the structures of weighted mixture trees into the same tree structure to be able to measure the divergence. Using the divergence between trees estimated by documents, the documents can be classified. Moreover, the effectiveness of the proposed method is clarified via a simulation experiment for the document classification with natural data.

    AB - In this paper, we propose a method of text classification using a Bayes coding algorithm, one of the efficient data compression methods. The Bayes coding algorithm gives the Bayes optimal data compression over the tree source model class. When data is compressed by the Bayes coding algorithm, the probability structure of information sources is implicitly estimated from the compressed data. Therefore, we can expect that the implicit estimation of data compression can be utilized for other purposes, especially for the document classification problem. As for the document classification using data compression methods, ZIP format and context tree weighting methods have been proposed. However, these methods do not have Bayes optimal compression and use the compression ratio as a similarity measure between documents for classification. In the Bayes coding algorithm, a weighted mixture tree given by the compression phase can be used for estimated probability structure. Tree source is a class of Markov sources and it is possible to measure the divergence between the tree sources with the same structure. However, the Bayes coding algorithm outputs different tree structures based on the data sequence to be compressed. Since the tree structures derived from documents are different from each other, it is difficult to measure the divergence between them just as it is. This paper proposes a new method to change the structures of weighted mixture trees into the same tree structure to be able to measure the divergence. Using the divergence between trees estimated by documents, the documents can be classified. Moreover, the effectiveness of the proposed method is clarified via a simulation experiment for the document classification with natural data.

    KW - Author identification

    KW - Bayes coding algorithm

    KW - Data compression algorithm

    KW - Document classification

    UR - http://www.scopus.com/inward/record.url?scp=84923307992&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84923307992&partnerID=8YFLogxK

    M3 - Article

    AN - SCOPUS:84923307992

    VL - 64

    SP - 438

    EP - 446

    JO - Journal of Japan Industrial Management Association

    JF - Journal of Japan Industrial Management Association

    SN - 0386-4812

    IS - 3

    ER -