A proposal for classification of document data with unobserved categories considering latent topics

Yusei Yamamoto, Kenta Mikawa, Masayuki Goto

研究成果: Article査読

2 被引用数 (Scopus)

抄録

With rapid development on information society, automatic document classification by machine learning has become even more important. In document classification, it is assumed that a new input data can be classified into any of the categories observed in the training data. Therefore, if a new input data belongs to an unobserved category which does not exist in the training data, then such data cannot be classified exactly. To solve the above problem, Arakawa et al. proposed the method which models the generative probabilities of documents with a mixture of Polya distributions and estimates the optimum category within all observed and unobserved categories where it is assumed that documents in each category are generated from each single Polya distribution. However, the statistical characteristics of document categories are generally more complicated and there are various underlying latent topics in a category. Because a single Polya distribution models each category in the conventional approach, this method cannot represent the variation of word frequency depending on plural unobserved latent topics. This paper proposes a new model which assumes a mixture of Polya distributions for the generative probabilities of documents in a category to represent plural latent topics. To verify the effectiveness of the proposed method, we conduct the simulation experiments of document classification by using a set of English newspaper articles.

本文言語English
ページ(範囲)165-174
ページ数10
ジャーナルIndustrial Engineering and Management Systems
16
2
DOI
出版ステータスPublished - 2017 6

ASJC Scopus subject areas

  • Social Sciences(all)
  • Economics, Econometrics and Finance(all)

フィンガープリント 「A proposal for classification of document data with unobserved categories considering latent topics」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル