Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?

Jing Sun, Yves Lepage

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Unlike most Western languages, there are no typographic boundaries between words in written Japanese and Chinese. Word segmentation is thus normally adopted as an initial step in most natural language processing tasks for these Asian languages. Although word segmentation techniques have improved greatly both theoretically and practically, there still remains some problems to be tackled. In this paper, we present an effective approach in extracting Chinese and Japanese phrases without conducting word segmentation beforehand, using a sampling-based multilingual alignment method. According to our experiments, it is also feasible to train a statistical machine translation system on a small Japanese-Chinese training corpus without performing word segmentation beforehand.

Original languageEnglish
Title of host publicationProceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012
Pages351-360
Number of pages10
Publication statusPublished - 2012 Dec 1
Event26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012 - Bali, Indonesia
Duration: 2012 Nov 72012 Nov 7

Publication series

NameProceedings of the 26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012

Conference

Conference26th Pacific Asia Conference on Language, Information and Computation, PACLIC 2012
CountryIndonesia
CityBali
Period12/11/712/11/7

ASJC Scopus subject areas

  • Information Systems
  • Software

Fingerprint Dive into the research topics of 'Can word segmentation be considered harmful for statistical machine translation tasks between Japanese and Chinese?'. Together they form a unique fingerprint.

Cite this