English and taiwanese text categorization using N-gram based on Vector Space Model

Makoto Suzuki, Naohide Yamagishi, Yi Ching Tsai, Takashi Ishida, Masayuki Goto

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    1 Citation (Scopus)

    Abstract

    In this paper, we present a new mathematical model based on a "Vector Space Model" and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, and Taiwanese China Times 2005 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 94.5% for English. However, that is 78.0% for Taiwanese. Though the proposed method is language-independent and provides a new perspective, our future work is to improve classification accuracy for Taiwanese.

    Original languageEnglish
    Title of host publicationISITA/ISSSTA 2010 - 2010 International Symposium on Information Theory and Its Applications
    Pages106-111
    Number of pages6
    DOIs
    Publication statusPublished - 2010
    Event2010 20th International Symposium on Information Theory and Its Applications, ISITA 2010 and the 2010 20th International Symposium on Spread Spectrum Techniques and Applications, ISSSTA 2010 - Taichung
    Duration: 2010 Oct 172010 Oct 20

    Other

    Other2010 20th International Symposium on Information Theory and Its Applications, ISITA 2010 and the 2010 20th International Symposium on Spread Spectrum Techniques and Applications, ISSSTA 2010
    CityTaichung
    Period10/10/1710/10/20

    Fingerprint

    Vector spaces
    Experiments
    Mathematical models

    Keywords

    • Classification
    • N-gram
    • Newspaper
    • Text mining

    ASJC Scopus subject areas

    • Computational Theory and Mathematics
    • Information Systems

    Cite this

    Suzuki, M., Yamagishi, N., Tsai, Y. C., Ishida, T., & Goto, M. (2010). English and taiwanese text categorization using N-gram based on Vector Space Model. In ISITA/ISSSTA 2010 - 2010 International Symposium on Information Theory and Its Applications (pp. 106-111). [5649453] https://doi.org/10.1109/ISITA.2010.5649453

    English and taiwanese text categorization using N-gram based on Vector Space Model. / Suzuki, Makoto; Yamagishi, Naohide; Tsai, Yi Ching; Ishida, Takashi; Goto, Masayuki.

    ISITA/ISSSTA 2010 - 2010 International Symposium on Information Theory and Its Applications. 2010. p. 106-111 5649453.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Suzuki, M, Yamagishi, N, Tsai, YC, Ishida, T & Goto, M 2010, English and taiwanese text categorization using N-gram based on Vector Space Model. in ISITA/ISSSTA 2010 - 2010 International Symposium on Information Theory and Its Applications., 5649453, pp. 106-111, 2010 20th International Symposium on Information Theory and Its Applications, ISITA 2010 and the 2010 20th International Symposium on Spread Spectrum Techniques and Applications, ISSSTA 2010, Taichung, 10/10/17. https://doi.org/10.1109/ISITA.2010.5649453
    Suzuki M, Yamagishi N, Tsai YC, Ishida T, Goto M. English and taiwanese text categorization using N-gram based on Vector Space Model. In ISITA/ISSSTA 2010 - 2010 International Symposium on Information Theory and Its Applications. 2010. p. 106-111. 5649453 https://doi.org/10.1109/ISITA.2010.5649453
    Suzuki, Makoto ; Yamagishi, Naohide ; Tsai, Yi Ching ; Ishida, Takashi ; Goto, Masayuki. / English and taiwanese text categorization using N-gram based on Vector Space Model. ISITA/ISSSTA 2010 - 2010 International Symposium on Information Theory and Its Applications. 2010. pp. 106-111
    @inproceedings{f90d3022e052455b946ffe5e4a250506,
    title = "English and taiwanese text categorization using N-gram based on Vector Space Model",
    abstract = "In this paper, we present a new mathematical model based on a {"}Vector Space Model{"} and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, and Taiwanese China Times 2005 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 94.5{\%} for English. However, that is 78.0{\%} for Taiwanese. Though the proposed method is language-independent and provides a new perspective, our future work is to improve classification accuracy for Taiwanese.",
    keywords = "Classification, N-gram, Newspaper, Text mining",
    author = "Makoto Suzuki and Naohide Yamagishi and Tsai, {Yi Ching} and Takashi Ishida and Masayuki Goto",
    year = "2010",
    doi = "10.1109/ISITA.2010.5649453",
    language = "English",
    isbn = "9781424460175",
    pages = "106--111",
    booktitle = "ISITA/ISSSTA 2010 - 2010 International Symposium on Information Theory and Its Applications",

    }

    TY - GEN

    T1 - English and taiwanese text categorization using N-gram based on Vector Space Model

    AU - Suzuki, Makoto

    AU - Yamagishi, Naohide

    AU - Tsai, Yi Ching

    AU - Ishida, Takashi

    AU - Goto, Masayuki

    PY - 2010

    Y1 - 2010

    N2 - In this paper, we present a new mathematical model based on a "Vector Space Model" and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, and Taiwanese China Times 2005 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 94.5% for English. However, that is 78.0% for Taiwanese. Though the proposed method is language-independent and provides a new perspective, our future work is to improve classification accuracy for Taiwanese.

    AB - In this paper, we present a new mathematical model based on a "Vector Space Model" and consider its implications. The proposed method is evaluated by performing several experiments. In these experiments, we classify newspaper articles from the English Reuters-21578 data set, and Taiwanese China Times 2005 data set using the proposed method. The Reuters-21578 data set is a benchmark data set for automatic text categorization. It is shown that FRAM has good classification accuracy. Specifically, the micro-averaged F-measure of the proposed method is 94.5% for English. However, that is 78.0% for Taiwanese. Though the proposed method is language-independent and provides a new perspective, our future work is to improve classification accuracy for Taiwanese.

    KW - Classification

    KW - N-gram

    KW - Newspaper

    KW - Text mining

    UR - http://www.scopus.com/inward/record.url?scp=78651327327&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=78651327327&partnerID=8YFLogxK

    U2 - 10.1109/ISITA.2010.5649453

    DO - 10.1109/ISITA.2010.5649453

    M3 - Conference contribution

    AN - SCOPUS:78651327327

    SN - 9781424460175

    SP - 106

    EP - 111

    BT - ISITA/ISSSTA 2010 - 2010 International Symposium on Information Theory and Its Applications

    ER -