Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?

Jing Sun, Yves Lepage

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand.

Original languageEnglish
Title of host publicationProceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012
Pages351-360
Number of pages10
Publication statusPublished - 2012
Event26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012 - Bali
Duration: 2012 Nov 72012 Nov 7

Other

Other26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012
CityBali
Period12/11/712/11/7

Fingerprint

Sampling
Processing
Experiments

ASJC Scopus subject areas

  • Information Systems
  • Software

Cite this

Sun, J., & Lepage, Y. (2012). Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese? In Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012 (pp. 351-360)

Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese? / Sun, Jing; Lepage, Yves.

Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012. 2012. p. 351-360.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sun, J & Lepage, Y 2012, Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese? in Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012. pp. 351-360, 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012, Bali, 12/11/7.
Sun J, Lepage Y. Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese? In Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012. 2012. p. 351-360
Sun, Jing ; Lepage, Yves. / Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?. Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012. 2012. pp. 351-360
@inproceedings{8ea17ff2298d4c81b6503be9a3162a62,
title = "Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?",
abstract = "Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand.",
author = "Jing Sun and Yves Lepage",
year = "2012",
language = "English",
isbn = "9789791421171",
pages = "351--360",
booktitle = "Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012",

}

TY - GEN

T1 - Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?

AU - Sun, Jing

AU - Lepage, Yves

PY - 2012

Y1 - 2012

N2 - Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand.

AB - Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand.

UR - http://www.scopus.com/inward/record.url?scp=84883365383&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883365383&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9789791421171

SP - 351

EP - 360

BT - Proceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012

ER -