Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?

Jing Sun, Yves Lepage

研究成果: Conference contribution

抄録

Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand.

本文言語English
ホスト出版物のタイトルProceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012
ページ351-360
ページ数10
出版ステータスPublished - 2012 12 1
イベント26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012 - Bali, Indonesia
継続期間: 2012 11 72012 11 7

出版物シリーズ

名前Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012

Conference

Conference26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012
国/地域Indonesia
CityBali
Period12/11/712/11/7

ASJC Scopus subject areas

  • 情報システム
  • ソフトウェア

フィンガープリント

「Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル