Shrinking Japanese morphological analyzers with neural networks and semi-supervised learning

Arseny Tolmachev, Daisuke Kawahara, Sadao Kurohashi

研究成果: Conference contribution

抄録

For languages without natural word boundaries, like Japanese and Chinese, word segmentation is a prerequisite for downstream analysis. For Japanese, segmentation is often done jointly with part of speech tagging, and this process is usually referred to as morphological analysis. Morphological analyzers are trained on data hand-annotated with segmentation boundaries and part of speech tags. A segmentation dictionary or character n-gram information is also provided as additional inputs to the model. Incorporating this extra information makes models large. Modern neural morphological analyzers can consume gigabytes of memory. We propose a compact alternative to these cumbersome approaches which do not rely on any externally provided n-gram or word representations. The model uses only unigram character embeddings, encodes them using either stacked bi-LSTM or a self-attention network, and independently infers both segmentation and part of speech information. The model is trained in an end-to-end and semi-supervised fashion, on labels produced by a state-of-the-art analyzer. We demonstrate that the proposed technique rivals performance of a previous dictionary-based state-of-the-art approach and can even surpass it when training with the combination of human-annotated and automatically-annotated data. Our model itself is significantly smaller than the dictionary-based one: it uses less than 15 megabytes of space.

本文言語English
ホスト出版物のタイトルLong and Short Papers
出版社Association for Computational Linguistics (ACL)
ページ2744-2755
ページ数12
ISBN(電子版)9781950737130
出版ステータスPublished - 2019
外部発表はい
イベント2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2019 - Minneapolis, United States
継続期間: 2019 6 22019 6 7

出版物シリーズ

名前NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference
1

Conference

Conference2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2019
CountryUnited States
CityMinneapolis
Period19/6/219/6/7

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science Applications
  • Linguistics and Language

フィンガープリント 「Shrinking Japanese morphological analyzers with neural networks and semi-supervised learning」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル