What is your Mother Tongue? Improving Chinese native language identification by cleaning noisy data and adopting BM25

Lan Wang, Masahiro Tanaka, Hayato Yamana

研究成果: Conference contribution

5 被引用数 (Scopus)

抄録

Native language identification (NLI) is a process by which an author's native language can be identified from essays written in the second language of the author. In this work, a supervised model is built to accomplish this based on a Chinese learner corpus. In the NLI field, this is the first work to (1) eliminate noisy data automatically before the training phase and (2) employ a BM25 term weighting technique to score each feature. We also adopt a hierarchical structure of linear support vector machine classifiers to achieve high accuracy and a state-of-the-art accuracy of 77.1%, which is greater than those of other Chinese NLI methods by over 10%.

本文言語English
ホスト出版物のタイトルProceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016
出版社Institute of Electrical and Electronics Engineers Inc.
ISBN(電子版)9781467395908
DOI
出版ステータスPublished - 2016 7月 12
イベント2016 IEEE International Conference on Big Data Analysis, ICBDA 2016 - Hangzhou, China
継続期間: 2016 3月 122016 3月 14

出版物シリーズ

名前Proceedings of 2016 IEEE International Conference on Big Data Analysis, ICBDA 2016

Other

Other2016 IEEE International Conference on Big Data Analysis, ICBDA 2016
国/地域China
CityHangzhou
Period16/3/1216/3/14

ASJC Scopus subject areas

  • コンピュータ ネットワークおよび通信
  • コンピュータ サイエンスの応用
  • 情報システム
  • 情報システムおよび情報管理

フィンガープリント

「What is your Mother Tongue? Improving Chinese native language identification by cleaning noisy data and adopting BM25」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル