Word segmentation for the sequences emitted from a word-valued source

Takashi Ishida, Toshiyasu Matsushima, Shigeichi Hirasawa

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Abstract

    Word segmentation is the most fundamental and important process for Japanese or Chinese language processing. Because there is no separation between words in these languages, we firstly have to separate the sequence into words. On this problem, it is known that the approach by probabilistic language model is highly efficient, and this is shown practically. On the other hand, recently, a word-valued source has been proposed as a new class of source model for the source coding problem. This model can be supposed to reflect more of the probability structure of natural languages. We may regard Japanese sentence or Chinese sentence as the sequence emitting from a non-prefix-free WVS. In this paper, as the first phase of applying WVS to natural language processing, we formulate a word segmentation problem for the sequence from non-prefix-free WVS. Then, we examine the performance of word segmentation for the models by numerical computations.

    Original languageEnglish
    Title of host publicationCIT 2007: 7th IEEE International Conference on Computer and Information Technology
    Pages662-667
    Number of pages6
    DOIs
    Publication statusPublished - 2007
    EventCIT 2007: 7th IEEE International Conference on Computer and Information Technology - Aizu-Wakamatsu, Fukushima
    Duration: 2007 Oct 162007 Oct 19

    Other

    OtherCIT 2007: 7th IEEE International Conference on Computer and Information Technology
    CityAizu-Wakamatsu, Fukushima
    Period07/10/1607/10/19

    Fingerprint

    Segmentation
    Natural Language
    Source Coding
    Language Model
    Probabilistic Model
    Numerical Computation
    Processing
    Model
    Language
    Class

    ASJC Scopus subject areas

    • Computer Science Applications
    • Information Systems
    • Software
    • Mathematics(all)

    Cite this

    Ishida, T., Matsushima, T., & Hirasawa, S. (2007). Word segmentation for the sequences emitted from a word-valued source. In CIT 2007: 7th IEEE International Conference on Computer and Information Technology (pp. 662-667). [4385160] https://doi.org/10.1109/CIT.2007.4385160

    Word segmentation for the sequences emitted from a word-valued source. / Ishida, Takashi; Matsushima, Toshiyasu; Hirasawa, Shigeichi.

    CIT 2007: 7th IEEE International Conference on Computer and Information Technology. 2007. p. 662-667 4385160.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Ishida, T, Matsushima, T & Hirasawa, S 2007, Word segmentation for the sequences emitted from a word-valued source. in CIT 2007: 7th IEEE International Conference on Computer and Information Technology., 4385160, pp. 662-667, CIT 2007: 7th IEEE International Conference on Computer and Information Technology, Aizu-Wakamatsu, Fukushima, 07/10/16. https://doi.org/10.1109/CIT.2007.4385160
    Ishida T, Matsushima T, Hirasawa S. Word segmentation for the sequences emitted from a word-valued source. In CIT 2007: 7th IEEE International Conference on Computer and Information Technology. 2007. p. 662-667. 4385160 https://doi.org/10.1109/CIT.2007.4385160
    Ishida, Takashi ; Matsushima, Toshiyasu ; Hirasawa, Shigeichi. / Word segmentation for the sequences emitted from a word-valued source. CIT 2007: 7th IEEE International Conference on Computer and Information Technology. 2007. pp. 662-667
    @inproceedings{1376624a4c2d46dea7e41abc625429d8,
    title = "Word segmentation for the sequences emitted from a word-valued source",
    abstract = "Word segmentation is the most fundamental and important process for Japanese or Chinese language processing. Because there is no separation between words in these languages, we firstly have to separate the sequence into words. On this problem, it is known that the approach by probabilistic language model is highly efficient, and this is shown practically. On the other hand, recently, a word-valued source has been proposed as a new class of source model for the source coding problem. This model can be supposed to reflect more of the probability structure of natural languages. We may regard Japanese sentence or Chinese sentence as the sequence emitting from a non-prefix-free WVS. In this paper, as the first phase of applying WVS to natural language processing, we formulate a word segmentation problem for the sequence from non-prefix-free WVS. Then, we examine the performance of word segmentation for the models by numerical computations.",
    author = "Takashi Ishida and Toshiyasu Matsushima and Shigeichi Hirasawa",
    year = "2007",
    doi = "10.1109/CIT.2007.4385160",
    language = "English",
    isbn = "0769529836",
    pages = "662--667",
    booktitle = "CIT 2007: 7th IEEE International Conference on Computer and Information Technology",

    }

    TY - GEN

    T1 - Word segmentation for the sequences emitted from a word-valued source

    AU - Ishida, Takashi

    AU - Matsushima, Toshiyasu

    AU - Hirasawa, Shigeichi

    PY - 2007

    Y1 - 2007

    N2 - Word segmentation is the most fundamental and important process for Japanese or Chinese language processing. Because there is no separation between words in these languages, we firstly have to separate the sequence into words. On this problem, it is known that the approach by probabilistic language model is highly efficient, and this is shown practically. On the other hand, recently, a word-valued source has been proposed as a new class of source model for the source coding problem. This model can be supposed to reflect more of the probability structure of natural languages. We may regard Japanese sentence or Chinese sentence as the sequence emitting from a non-prefix-free WVS. In this paper, as the first phase of applying WVS to natural language processing, we formulate a word segmentation problem for the sequence from non-prefix-free WVS. Then, we examine the performance of word segmentation for the models by numerical computations.

    AB - Word segmentation is the most fundamental and important process for Japanese or Chinese language processing. Because there is no separation between words in these languages, we firstly have to separate the sequence into words. On this problem, it is known that the approach by probabilistic language model is highly efficient, and this is shown practically. On the other hand, recently, a word-valued source has been proposed as a new class of source model for the source coding problem. This model can be supposed to reflect more of the probability structure of natural languages. We may regard Japanese sentence or Chinese sentence as the sequence emitting from a non-prefix-free WVS. In this paper, as the first phase of applying WVS to natural language processing, we formulate a word segmentation problem for the sequence from non-prefix-free WVS. Then, we examine the performance of word segmentation for the models by numerical computations.

    UR - http://www.scopus.com/inward/record.url?scp=38049025202&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=38049025202&partnerID=8YFLogxK

    U2 - 10.1109/CIT.2007.4385160

    DO - 10.1109/CIT.2007.4385160

    M3 - Conference contribution

    AN - SCOPUS:38049025202

    SN - 0769529836

    SN - 9780769529837

    SP - 662

    EP - 667

    BT - CIT 2007: 7th IEEE International Conference on Computer and Information Technology

    ER -