What is your Mother Tongue? Improving Chinese native language identification by cleaning noisy data and adopting BM25

Lan Wang, Masahiro Tanaka, Hayato Yamana

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    3 Citations (Scopus)

    Abstract

    Native language identification (NLI) is a process by which an author's native language can be identified from essays written in the second language of the author. In this work, a supervised model is built to accomplish this based on a Chinese learner corpus. In the NLI field, this is the first work to (1) eliminate noisy data automatically before the training phase and (2) employ a BM25 term weighting technique to score each feature. We also adopt a hierarchical structure of linear support vector machine classifiers to achieve high accuracy and a state-of-the-art accuracy of 77.1%, which is greater than those of other Chinese NLI methods by over 10%.

    Original languageEnglish
    Title of host publicationProceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    ISBN (Electronic)9781467395908
    DOIs
    Publication statusPublished - 2016 Jul 12
    Event2016 IEEE International Conference on Big Data Analysis, ICBDA 2016 - Hangzhou, China
    Duration: 2016 Mar 122016 Mar 14

    Other

    Other2016 IEEE International Conference on Big Data Analysis, ICBDA 2016
    CountryChina
    CityHangzhou
    Period16/3/1216/3/14

    Keywords

    • author profiling
    • machine learning
    • text mining

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Computer Science Applications
    • Information Systems
    • Information Systems and Management

    Fingerprint Dive into the research topics of 'What is your Mother Tongue? Improving Chinese native language identification by cleaning noisy data and adopting BM25'. Together they form a unique fingerprint.

  • Cite this

    Wang, L., Tanaka, M., & Yamana, H. (2016). What is your Mother Tongue? Improving Chinese native language identification by cleaning noisy data and adopting BM25. In Proceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016 [7509793] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICBDA.2016.7509793