Korean L2 vocabulary prediction: Can a large annotated corpus be used to train better models for predicting unknown words?

Kevin P. Yancey, Yves Lepage

研究成果: Conference contribution

抜粋

Vocabulary knowledge prediction is an important task in lexical text simplification for foreign language learners (L2 learners). However, previously studied methods that use hand-crafted rules based on one or two word features have had limited success. A recent study hypothesized that a supervised learning classifier trained on a large annotated corpus of words unknown by L2 learners may yield better results. Our study crowdsourced the production of such a corpus for Korean, now consisting of 2,385 annotated passages contributed by 357 distinct L2 learners. Our preliminary evaluation of models trained on this corpus show favorable results, thus confirming the hypothesis. In this paper, we describe our methodology for building this resource in detail and analyze its results so that it can be duplicated for other languages. We also present our preliminary evaluation of models trained on this annotated corpus, the best of which recalls 80 % of unknown words with 71 % precision. We make our annotation data available.

元の言語English
ホスト出版物のタイトルLREC 2018 - 11th International Conference on Language Resources and Evaluation
編集者Hitoshi Isahara, Bente Maegaard, Stelios Piperidis, Christopher Cieri, Thierry Declerck, Koiti Hasida, Helene Mazo, Khalid Choukri, Sara Goggi, Joseph Mariani, Asuncion Moreno, Nicoletta Calzolari, Jan Odijk, Takenobu Tokunaga
出版者European Language Resources Association (ELRA)
ページ438-445
ページ数8
ISBN(電子版)9791095546009
出版物ステータスPublished - 2019 1 1
イベント11th International Conference on Language Resources and Evaluation, LREC 2018 - Miyazaki, Japan
継続期間: 2018 5 72018 5 12

Other

Other11th International Conference on Language Resources and Evaluation, LREC 2018
Japan
Miyazaki
期間18/5/718/5/12

    フィンガープリント

ASJC Scopus subject areas

  • Linguistics and Language
  • Education
  • Library and Information Sciences
  • Language and Linguistics

これを引用

Yancey, K. P., & Lepage, Y. (2019). Korean L2 vocabulary prediction: Can a large annotated corpus be used to train better models for predicting unknown words? : H. Isahara, B. Maegaard, S. Piperidis, C. Cieri, T. Declerck, K. Hasida, H. Mazo, K. Choukri, S. Goggi, J. Mariani, A. Moreno, N. Calzolari, J. Odijk, & T. Tokunaga (版), LREC 2018 - 11th International Conference on Language Resources and Evaluation (pp. 438-445). European Language Resources Association (ELRA).