What is your Mother Tongue? Improving Chinese native language identification by cleaning noisy data and adopting BM25

Lan Wang, Masahiro Tanaka, Hayato Yamana

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

Native language identification (NLI) is a process by which an author's native language can be identified from essays written in the second language of the author. In this work, a supervised model is built to accomplish this based on a Chinese learner corpus. In the NLI field, this is the first work to (1) eliminate noisy data automatically before the training phase and (2) employ a BM25 term weighting technique to score each feature. We also adopt a hierarchical structure of linear support vector machine classifiers to achieve high accuracy and a state-of-the-art accuracy of 77.1%, which is greater than those of other Chinese NLI methods by over 10%.

Original languageEnglish
Title of host publicationProceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781467395908
DOIs
Publication statusPublished - 2016 Jul 12
Event2016 IEEE International Conference on Big Data Analysis, ICBDA 2016 - Hangzhou, China
Duration: 2016 Mar 122016 Mar 14

Publication series

NameProceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016

Other

Other2016 IEEE International Conference on Big Data Analysis, ICBDA 2016
Country/TerritoryChina
CityHangzhou
Period16/3/1216/3/14

Keywords

  • author profiling
  • machine learning
  • text mining

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Information Systems
  • Information Systems and Management

Fingerprint

Dive into the research topics of 'What is your Mother Tongue? Improving Chinese native language identification by cleaning noisy data and adopting BM25'. Together they form a unique fingerprint.

Cite this