What is your Mother Tongue?

Improving Chinese native language identification by cleaning noisy data and adopting BM25

Lan Wang, Masahiro Tanaka, Hayato Yamana

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    1 Citation (Scopus)

    Abstract

    Native language identification (NLI) is a process by which an author's native language can be identified from essays written in the second language of the author. In this work, a supervised model is built to accomplish this based on a Chinese learner corpus. In the NLI field, this is the first work to (1) eliminate noisy data automatically before the training phase and (2) employ a BM25 term weighting technique to score each feature. We also adopt a hierarchical structure of linear support vector machine classifiers to achieve high accuracy and a state-of-the-art accuracy of 77.1%, which is greater than those of other Chinese NLI methods by over 10%.

    Original languageEnglish
    Title of host publicationProceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016
    PublisherInstitute of Electrical and Electronics Engineers Inc.
    ISBN (Electronic)9781467395908
    DOIs
    Publication statusPublished - 2016 Jul 12
    Event2016 IEEE International Conference on Big Data Analysis, ICBDA 2016 - Hangzhou, China
    Duration: 2016 Mar 122016 Mar 14

    Other

    Other2016 IEEE International Conference on Big Data Analysis, ICBDA 2016
    CountryChina
    CityHangzhou
    Period16/3/1216/3/14

    Fingerprint

    Support vector machines
    Cleaning
    Classifiers
    Language

    Keywords

    • author profiling
    • machine learning
    • text mining

    ASJC Scopus subject areas

    • Computer Networks and Communications
    • Computer Science Applications
    • Information Systems
    • Information Systems and Management

    Cite this

    Wang, L., Tanaka, M., & Yamana, H. (2016). What is your Mother Tongue? Improving Chinese native language identification by cleaning noisy data and adopting BM25. In Proceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016 [7509793] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICBDA.2016.7509793

    What is your Mother Tongue? Improving Chinese native language identification by cleaning noisy data and adopting BM25. / Wang, Lan; Tanaka, Masahiro; Yamana, Hayato.

    Proceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016. Institute of Electrical and Electronics Engineers Inc., 2016. 7509793.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Wang, L, Tanaka, M & Yamana, H 2016, What is your Mother Tongue? Improving Chinese native language identification by cleaning noisy data and adopting BM25. in Proceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016., 7509793, Institute of Electrical and Electronics Engineers Inc., 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016, Hangzhou, China, 16/3/12. https://doi.org/10.1109/ICBDA.2016.7509793
    Wang L, Tanaka M, Yamana H. What is your Mother Tongue? Improving Chinese native language identification by cleaning noisy data and adopting BM25. In Proceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016. Institute of Electrical and Electronics Engineers Inc. 2016. 7509793 https://doi.org/10.1109/ICBDA.2016.7509793
    Wang, Lan ; Tanaka, Masahiro ; Yamana, Hayato. / What is your Mother Tongue? Improving Chinese native language identification by cleaning noisy data and adopting BM25. Proceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016. Institute of Electrical and Electronics Engineers Inc., 2016.
    @inproceedings{8d5e5338a84e4bd298887af8141750d2,
    title = "What is your Mother Tongue?: Improving Chinese native language identification by cleaning noisy data and adopting BM25",
    abstract = "Native language identification (NLI) is a process by which an author's native language can be identified from essays written in the second language of the author. In this work, a supervised model is built to accomplish this based on a Chinese learner corpus. In the NLI field, this is the first work to (1) eliminate noisy data automatically before the training phase and (2) employ a BM25 term weighting technique to score each feature. We also adopt a hierarchical structure of linear support vector machine classifiers to achieve high accuracy and a state-of-the-art accuracy of 77.1{\%}, which is greater than those of other Chinese NLI methods by over 10{\%}.",
    keywords = "author profiling, machine learning, text mining",
    author = "Lan Wang and Masahiro Tanaka and Hayato Yamana",
    year = "2016",
    month = "7",
    day = "12",
    doi = "10.1109/ICBDA.2016.7509793",
    language = "English",
    booktitle = "Proceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016",
    publisher = "Institute of Electrical and Electronics Engineers Inc.",
    address = "United States",

    }

    TY - GEN

    T1 - What is your Mother Tongue?

    T2 - Improving Chinese native language identification by cleaning noisy data and adopting BM25

    AU - Wang, Lan

    AU - Tanaka, Masahiro

    AU - Yamana, Hayato

    PY - 2016/7/12

    Y1 - 2016/7/12

    N2 - Native language identification (NLI) is a process by which an author's native language can be identified from essays written in the second language of the author. In this work, a supervised model is built to accomplish this based on a Chinese learner corpus. In the NLI field, this is the first work to (1) eliminate noisy data automatically before the training phase and (2) employ a BM25 term weighting technique to score each feature. We also adopt a hierarchical structure of linear support vector machine classifiers to achieve high accuracy and a state-of-the-art accuracy of 77.1%, which is greater than those of other Chinese NLI methods by over 10%.

    AB - Native language identification (NLI) is a process by which an author's native language can be identified from essays written in the second language of the author. In this work, a supervised model is built to accomplish this based on a Chinese learner corpus. In the NLI field, this is the first work to (1) eliminate noisy data automatically before the training phase and (2) employ a BM25 term weighting technique to score each feature. We also adopt a hierarchical structure of linear support vector machine classifiers to achieve high accuracy and a state-of-the-art accuracy of 77.1%, which is greater than those of other Chinese NLI methods by over 10%.

    KW - author profiling

    KW - machine learning

    KW - text mining

    UR - http://www.scopus.com/inward/record.url?scp=84981333042&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84981333042&partnerID=8YFLogxK

    U2 - 10.1109/ICBDA.2016.7509793

    DO - 10.1109/ICBDA.2016.7509793

    M3 - Conference contribution

    BT - Proceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016

    PB - Institute of Electrical and Electronics Engineers Inc.

    ER -