Language-independent word acquisition method using a state-transition model

Bin Xu, Naohide Yamagishi, Makoto Suzuki, Masayuki Goto

Research output: Contribution to journalArticlepeer-review

Abstract

The use of new words, numerous spoken languages, and abbreviations on the Internet is extensive. As such, automatically acquiring words for the purpose of analyzing Internet content is very difficult. In a previous study, we proposed a method for Japanese word segmentation using character N-grams. The previously proposed method is based on a simple state-transition model that is established under the assumption that the input document is described based on four states (denoted as A, B, C, and D) specified beforehand: state A represents words (nouns, verbs, etc.); state B represents statement separators (punctuation marks, conjunctions, etc.); state C represents postpositions (namely, words that follow nouns); and state D represents prepositions (namely, words that precede nouns). According to this state-transition model, based on the states applied to each pseudo-word, we search the document from beginning to end for an accessible pattern. In other words, the process of this transition detects some words during the search. In the present paper, we perform experiments based on the proposed word acquisition algorithm using Japanese and Chinese newspaper articles. These articles were obtained from Japan's Kyoto University and the Chinese People's Daily. The proposed method does not depend on the language structure. If text documents are expressed in Unicode the proposed method can, using the same algorithm, obtain words in Japanese and Chinese, which do not contain spaces between words. Hence, we demonstrate that the proposed method is language independent.

Original languageEnglish
Pages (from-to)224-230
Number of pages7
JournalIndustrial Engineering and Management Systems
Volume15
Issue number3
DOIs
Publication statusPublished - 2016 Sep

Keywords

  • Character N-gram
  • Language independent
  • State transition
  • Word segmentation

ASJC Scopus subject areas

  • Social Sciences(all)
  • Economics, Econometrics and Finance(all)

Fingerprint Dive into the research topics of 'Language-independent word acquisition method using a state-transition model'. Together they form a unique fingerprint.

Cite this