Unsupervised bilingual segmentation using MDL for machine translation

Bin Shan, Hao Wang, Yves Lepage

Research output: Contribution to conferencePaper

Abstract

In statistical machine translation systems, a problem arises from the weak performance in alignment due to differences in word form or granularity across different languages. To address this problem, in this paper, we propose a unsupervised bilingual segmentation method using the minimum description length (MDL) principle. Our work aims at improving translation quality using a proper segmentation model (lexicon). For generating bilingual lexica, we implement a heuristic and iterative algorithm. Each entry in this bilingual lexicon is required to hold a proper length and the ability to fit the data well. The results show that this bilingual segmentation significantly improved the translation quality on the Chinese-Japanese and Japanese-Chinese sub-tasks.

Original languageEnglish
Pages89-96
Number of pages8
Publication statusPublished - 2019 Jan 1
Event31st Pacific Asia Conference on Language, Information and Computation, PACLIC 2017 - Cebu City, Philippines
Duration: 2017 Nov 162017 Nov 18

Conference

Conference31st Pacific Asia Conference on Language, Information and Computation, PACLIC 2017
CountryPhilippines
CityCebu City
Period17/11/1617/11/18

    Fingerprint

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science (miscellaneous)

Cite this

Shan, B., Wang, H., & Lepage, Y. (2019). Unsupervised bilingual segmentation using MDL for machine translation. 89-96. Paper presented at 31st Pacific Asia Conference on Language, Information and Computation, PACLIC 2017, Cebu City, Philippines.