Unsupervised bilingual segmentation using MDL for machine translation

Bin Shan, Hao Wang, Yves Lepage

Research output: Contribution to conferencePaperpeer-review

Abstract

In statistical machine translation systems, a problem arises from the weak performance in alignment due to differences in word form or granularity across different languages. To address this problem, in this paper, we propose a unsupervised bilingual segmentation method using the minimum description length (MDL) principle. Our work aims at improving translation quality using a proper segmentation model (lexicon). For generating bilingual lexica, we implement a heuristic and iterative algorithm. Each entry in this bilingual lexicon is required to hold a proper length and the ability to fit the data well. The results show that this bilingual segmentation significantly improved the translation quality on the Chinese-Japanese and Japanese-Chinese sub-tasks.

Original languageEnglish
Pages89-96
Number of pages8
Publication statusPublished - 2019
Event31st Pacific Asia Conference on Language, Information and Computation, PACLIC 2017 - Cebu City, Philippines
Duration: 2017 Nov 162017 Nov 18

Conference

Conference31st Pacific Asia Conference on Language, Information and Computation, PACLIC 2017
Country/TerritoryPhilippines
CityCebu City
Period17/11/1617/11/18

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science (miscellaneous)

Fingerprint

Dive into the research topics of 'Unsupervised bilingual segmentation using MDL for machine translation'. Together they form a unique fingerprint.

Cite this