Khmer POS tagger: A transformation-based approach with hybrid unknown word handling

Chenda Nou, Wataru Kameyama

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    4 Citations (Scopus)

    Abstract

    This paper presents an initiative research on Khmer part-of-speech tagger. We propose some modifications on applying rule algorithm of the transformation-based approach to adapt to Khmer language which is morphologically and syntactically different from the English language. Furthermore, to overcome the limited coverage of the rule-based approach in handling unknown words, we propose a hybrid approach to combine the rule-based and trigram models. Although training on a very small corpus, both proposed approaches achieve higher accuracy than the conventional methods. The tagger achieves 95.27% on training data and 91.96% on test data which includes 9% of unknown words.

    Original languageEnglish
    Title of host publicationICSC 2007 International Conference on Semantic Computing
    Pages482-489
    Number of pages8
    DOIs
    Publication statusPublished - 2007
    EventICSC 2007 International Conference on Semantic Computing - Irvine CA
    Duration: 2007 Sep 172007 Sep 19

    Other

    OtherICSC 2007 International Conference on Semantic Computing
    CityIrvine CA
    Period07/9/1707/9/19

    ASJC Scopus subject areas

    • Computer Science(all)
    • Computer Science Applications

    Cite this

    Nou, C., & Kameyama, W. (2007). Khmer POS tagger: A transformation-based approach with hybrid unknown word handling. In ICSC 2007 International Conference on Semantic Computing (pp. 482-489). [4338385] https://doi.org/10.1109/ICSC.2007.104

    Khmer POS tagger : A transformation-based approach with hybrid unknown word handling. / Nou, Chenda; Kameyama, Wataru.

    ICSC 2007 International Conference on Semantic Computing. 2007. p. 482-489 4338385.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Nou, C & Kameyama, W 2007, Khmer POS tagger: A transformation-based approach with hybrid unknown word handling. in ICSC 2007 International Conference on Semantic Computing., 4338385, pp. 482-489, ICSC 2007 International Conference on Semantic Computing, Irvine CA, 07/9/17. https://doi.org/10.1109/ICSC.2007.104
    Nou C, Kameyama W. Khmer POS tagger: A transformation-based approach with hybrid unknown word handling. In ICSC 2007 International Conference on Semantic Computing. 2007. p. 482-489. 4338385 https://doi.org/10.1109/ICSC.2007.104
    Nou, Chenda ; Kameyama, Wataru. / Khmer POS tagger : A transformation-based approach with hybrid unknown word handling. ICSC 2007 International Conference on Semantic Computing. 2007. pp. 482-489
    @inproceedings{b3b03858c9d3484bb3dcdaa8c40a3900,
    title = "Khmer POS tagger: A transformation-based approach with hybrid unknown word handling",
    abstract = "This paper presents an initiative research on Khmer part-of-speech tagger. We propose some modifications on applying rule algorithm of the transformation-based approach to adapt to Khmer language which is morphologically and syntactically different from the English language. Furthermore, to overcome the limited coverage of the rule-based approach in handling unknown words, we propose a hybrid approach to combine the rule-based and trigram models. Although training on a very small corpus, both proposed approaches achieve higher accuracy than the conventional methods. The tagger achieves 95.27{\%} on training data and 91.96{\%} on test data which includes 9{\%} of unknown words.",
    author = "Chenda Nou and Wataru Kameyama",
    year = "2007",
    doi = "10.1109/ICSC.2007.104",
    language = "English",
    isbn = "0769529976",
    pages = "482--489",
    booktitle = "ICSC 2007 International Conference on Semantic Computing",

    }

    TY - GEN

    T1 - Khmer POS tagger

    T2 - A transformation-based approach with hybrid unknown word handling

    AU - Nou, Chenda

    AU - Kameyama, Wataru

    PY - 2007

    Y1 - 2007

    N2 - This paper presents an initiative research on Khmer part-of-speech tagger. We propose some modifications on applying rule algorithm of the transformation-based approach to adapt to Khmer language which is morphologically and syntactically different from the English language. Furthermore, to overcome the limited coverage of the rule-based approach in handling unknown words, we propose a hybrid approach to combine the rule-based and trigram models. Although training on a very small corpus, both proposed approaches achieve higher accuracy than the conventional methods. The tagger achieves 95.27% on training data and 91.96% on test data which includes 9% of unknown words.

    AB - This paper presents an initiative research on Khmer part-of-speech tagger. We propose some modifications on applying rule algorithm of the transformation-based approach to adapt to Khmer language which is morphologically and syntactically different from the English language. Furthermore, to overcome the limited coverage of the rule-based approach in handling unknown words, we propose a hybrid approach to combine the rule-based and trigram models. Although training on a very small corpus, both proposed approaches achieve higher accuracy than the conventional methods. The tagger achieves 95.27% on training data and 91.96% on test data which includes 9% of unknown words.

    UR - http://www.scopus.com/inward/record.url?scp=47749121029&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=47749121029&partnerID=8YFLogxK

    U2 - 10.1109/ICSC.2007.104

    DO - 10.1109/ICSC.2007.104

    M3 - Conference contribution

    AN - SCOPUS:47749121029

    SN - 0769529976

    SN - 9780769529974

    SP - 482

    EP - 489

    BT - ICSC 2007 International Conference on Semantic Computing

    ER -