A proposal for classification of document data with unobserved categories considering latent topics

Yusei Yamamoto, Kenta Mikawa, Masayuki Goto

    Research output: Contribution to journalArticle

    1 Citation (Scopus)

    Abstract

    With rapid development on information society, automatic document classification by machine learning has become even more important. In document classification, it is assumed that a new input data can be classified into any of the categories observed in the training data. Therefore, if a new input data belongs to an unobserved category which does not exist in the training data, then such data cannot be classified exactly. To solve the above problem, Arakawa et al. proposed the method which models the generative probabilities of documents with a mixture of Polya distributions and estimates the optimum category within all observed and unobserved categories where it is assumed that documents in each category are generated from each single Polya distribution. However, the statistical characteristics of document categories are generally more complicated and there are various underlying latent topics in a category. Because a single Polya distribution models each category in the conventional approach, this method cannot represent the variation of word frequency depending on plural unobserved latent topics. This paper proposes a new model which assumes a mixture of Polya distributions for the generative probabilities of documents in a category to represent plural latent topics. To verify the effectiveness of the proposed method, we conduct the simulation experiments of document classification by using a set of English newspaper articles.

    Original languageEnglish
    Pages (from-to)165-174
    Number of pages10
    JournalIndustrial Engineering and Management Systems
    Volume16
    Issue number2
    DOIs
    Publication statusPublished - 2017 Jun 1

    Fingerprint

    information society
    newspaper
    simulation
    experiment
    learning
    Document classification
    Mixture of distributions
    Information society
    Simulation experiment
    Machine learning

    Keywords

    • Document classification
    • Latent topic model
    • Polya distribution
    • Unobserved category

    ASJC Scopus subject areas

    • Social Sciences(all)
    • Economics, Econometrics and Finance(all)

    Cite this

    A proposal for classification of document data with unobserved categories considering latent topics. / Yamamoto, Yusei; Mikawa, Kenta; Goto, Masayuki.

    In: Industrial Engineering and Management Systems, Vol. 16, No. 2, 01.06.2017, p. 165-174.

    Research output: Contribution to journalArticle

    @article{ac12988b897d4b71aa241d18cbf60035,
    title = "A proposal for classification of document data with unobserved categories considering latent topics",
    abstract = "With rapid development on information society, automatic document classification by machine learning has become even more important. In document classification, it is assumed that a new input data can be classified into any of the categories observed in the training data. Therefore, if a new input data belongs to an unobserved category which does not exist in the training data, then such data cannot be classified exactly. To solve the above problem, Arakawa et al. proposed the method which models the generative probabilities of documents with a mixture of Polya distributions and estimates the optimum category within all observed and unobserved categories where it is assumed that documents in each category are generated from each single Polya distribution. However, the statistical characteristics of document categories are generally more complicated and there are various underlying latent topics in a category. Because a single Polya distribution models each category in the conventional approach, this method cannot represent the variation of word frequency depending on plural unobserved latent topics. This paper proposes a new model which assumes a mixture of Polya distributions for the generative probabilities of documents in a category to represent plural latent topics. To verify the effectiveness of the proposed method, we conduct the simulation experiments of document classification by using a set of English newspaper articles.",
    keywords = "Document classification, Latent topic model, Polya distribution, Unobserved category",
    author = "Yusei Yamamoto and Kenta Mikawa and Masayuki Goto",
    year = "2017",
    month = "6",
    day = "1",
    doi = "10.7232/iems.2017.16.2.165",
    language = "English",
    volume = "16",
    pages = "165--174",
    journal = "Industrial Engineering and Management Systems",
    issn = "1598-7248",
    publisher = "Korean Institute of Industrial Engineers",
    number = "2",

    }

    TY - JOUR

    T1 - A proposal for classification of document data with unobserved categories considering latent topics

    AU - Yamamoto, Yusei

    AU - Mikawa, Kenta

    AU - Goto, Masayuki

    PY - 2017/6/1

    Y1 - 2017/6/1

    N2 - With rapid development on information society, automatic document classification by machine learning has become even more important. In document classification, it is assumed that a new input data can be classified into any of the categories observed in the training data. Therefore, if a new input data belongs to an unobserved category which does not exist in the training data, then such data cannot be classified exactly. To solve the above problem, Arakawa et al. proposed the method which models the generative probabilities of documents with a mixture of Polya distributions and estimates the optimum category within all observed and unobserved categories where it is assumed that documents in each category are generated from each single Polya distribution. However, the statistical characteristics of document categories are generally more complicated and there are various underlying latent topics in a category. Because a single Polya distribution models each category in the conventional approach, this method cannot represent the variation of word frequency depending on plural unobserved latent topics. This paper proposes a new model which assumes a mixture of Polya distributions for the generative probabilities of documents in a category to represent plural latent topics. To verify the effectiveness of the proposed method, we conduct the simulation experiments of document classification by using a set of English newspaper articles.

    AB - With rapid development on information society, automatic document classification by machine learning has become even more important. In document classification, it is assumed that a new input data can be classified into any of the categories observed in the training data. Therefore, if a new input data belongs to an unobserved category which does not exist in the training data, then such data cannot be classified exactly. To solve the above problem, Arakawa et al. proposed the method which models the generative probabilities of documents with a mixture of Polya distributions and estimates the optimum category within all observed and unobserved categories where it is assumed that documents in each category are generated from each single Polya distribution. However, the statistical characteristics of document categories are generally more complicated and there are various underlying latent topics in a category. Because a single Polya distribution models each category in the conventional approach, this method cannot represent the variation of word frequency depending on plural unobserved latent topics. This paper proposes a new model which assumes a mixture of Polya distributions for the generative probabilities of documents in a category to represent plural latent topics. To verify the effectiveness of the proposed method, we conduct the simulation experiments of document classification by using a set of English newspaper articles.

    KW - Document classification

    KW - Latent topic model

    KW - Polya distribution

    KW - Unobserved category

    UR - http://www.scopus.com/inward/record.url?scp=85030775353&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85030775353&partnerID=8YFLogxK

    U2 - 10.7232/iems.2017.16.2.165

    DO - 10.7232/iems.2017.16.2.165

    M3 - Article

    AN - SCOPUS:85030775353

    VL - 16

    SP - 165

    EP - 174

    JO - Industrial Engineering and Management Systems

    JF - Industrial Engineering and Management Systems

    SN - 1598-7248

    IS - 2

    ER -