Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?

Jing Sun, Yves Lepage

研究成果: Conference contribution

抜粋

Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand.

元の言語English
ホスト出版物のタイトルProceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012
ページ351-360
ページ数10
出版物ステータスPublished - 2012 12 1
イベント26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012 - Bali, Indonesia
継続期間: 2012 11 72012 11 7

出版物シリーズ

名前Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012

Conference

Conference26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012
Indonesia
Bali
期間12/11/712/11/7

ASJC Scopus subject areas

  • Information Systems
  • Software

フィンガープリント Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?' の研究トピックを掘り下げます。これらはともに一意のフィンガープリントを構成します。

  • これを引用

    Sun, J., & Lepage, Y. (2012). Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012 (pp. 351-360). (Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012).