Language-independent word acquisition method using a state-transition model

Bin Xu, Naohide Yamagishi, Makoto Suzuki, Masayuki Goto

    Research output: Contribution to journalArticle

    Abstract

    The use of new words, numerous spoken languages, and abbreviations on the Internet is extensive. As such, automatically acquiring words for the purpose of analyzing Internet content is very difficult. In a previous study, we proposed a method for Japanese word segmentation using character N-grams. The previously proposed method is based on a simple state-transition model that is established under the assumption that the input document is described based on four states (denoted as A, B, C, and D) specified beforehand: state A represents words (nouns, verbs, etc.); state B represents statement separators (punctuation marks, conjunctions, etc.); state C represents postpositions (namely, words that follow nouns); and state D represents prepositions (namely, words that precede nouns). According to this state-transition model, based on the states applied to each pseudo-word, we search the document from beginning to end for an accessible pattern. In other words, the process of this transition detects some words during the search. In the present paper, we perform experiments based on the proposed word acquisition algorithm using Japanese and Chinese newspaper articles. These articles were obtained from Japan's Kyoto University and the Chinese People's Daily. The proposed method does not depend on the language structure. If text documents are expressed in Unicode the proposed method can, using the same algorithm, obtain words in Japanese and Chinese, which do not contain spaces between words. Hence, we demonstrate that the proposed method is language independent.

    Original languageEnglish
    Pages (from-to)224-230
    Number of pages7
    JournalIndustrial Engineering and Management Systems
    Volume15
    Issue number3
    DOIs
    Publication statusPublished - 2016 Sep 1

    Keywords

    • Character N-gram
    • Language independent
    • State transition
    • Word segmentation

    ASJC Scopus subject areas

    • Social Sciences(all)
    • Economics, Econometrics and Finance(all)

    Fingerprint Dive into the research topics of 'Language-independent word acquisition method using a state-transition model'. Together they form a unique fingerprint.

  • Cite this