Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?

Jing Sun, Yves Lepage

研究成果: Conference contribution

抄録

Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand.

元の言語English
ホスト出版物のタイトルProceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012
ページ351-360
ページ数10
出版物ステータスPublished - 2012
イベント26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012 - Bali
継続期間: 2012 11 72012 11 7

Other

Other26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012
Bali
期間12/11/712/11/7

Fingerprint

Sampling
Processing
Experiments

ASJC Scopus subject areas

  • Information Systems
  • Software

これを引用

Sun, J., & Lepage, Y. (2012). Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012 (pp. 351-360)

Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese? / Sun, Jing; Lepage, Yves.

Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012. 2012. p. 351-360.

研究成果: Conference contribution

Sun, J & Lepage, Y 2012, Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012. pp. 351-360, 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012, Bali, 12/11/7.
Sun J, Lepage Y. Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese? : Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012. 2012. p. 351-360
Sun, Jing ; Lepage, Yves. / Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?. Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012. 2012. pp. 351-360
@inproceedings{8ea17ff2298d4c81b6503be9a3162a62,
title = "Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?",
abstract = "Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand.",
author = "Jing Sun and Yves Lepage",
year = "2012",
language = "English",
isbn = "9789791421171",
pages = "351--360",
booktitle = "Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012",

}

TY - GEN

T1 - Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?

AU - Sun, Jing

AU - Lepage, Yves

PY - 2012

Y1 - 2012

N2 - Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand.

AB - Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand.

UR - http://www.scopus.com/inward/record.url?scp=84883365383&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883365383&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84883365383

SN - 9789791421171

SP - 351

EP - 360

BT - Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012

ER -