Korean text categorization using the character TV-gram

Makoto Suzuki*, Naohide Yamagishi, Masayuki Goto

*この研究の対応する著者

研究成果: Conference contribution

抄録

We previously proposed the accumulation method, a language-independent text classification method that is based on the character N-gram, and classified English and Japanese text documents. The accumulation method does not depend on the language structure, because it uses the character N-gram to form Index Terms. If text documents are expressed in Unicode, the accumulation method can classify the documents using the same algorithm. In the present paper, we improve the proposed method and classify Korean text documents, which are newspaper articles from the Korean Hankyoreh 2008 data set. As a result, the highest macro-averaged F-measure of the proposed method is 90.2% for the Korean Hankyoreh 2008 data set. In this way, we obtain good results for Korean. In addition, we demonstrate the improvement in classification accuracy for English. Finally, we consider points of qualitative meaning of the accumulation method.

本文言語English
ホスト出版物のタイトル7th International Conference on Information Technology and Application, ICITA 2011
ページ197-202
ページ数6
出版ステータスPublished - 2011 12月 1
イベント7th International Conference on Information Technology and Application, ICITA 2011 - Sydney, NSW, Australia
継続期間: 2011 11月 212011 11月 24

出版物シリーズ

名前7th International Conference on Information Technology and Application, ICITA 2011

Conference

Conference7th International Conference on Information Technology and Application, ICITA 2011
国/地域Australia
CitySydney, NSW
Period11/11/2111/11/24

ASJC Scopus subject areas

  • コンピュータ サイエンス(全般)

フィンガープリント

「Korean text categorization using the character TV-gram」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル