Unsupervised bilingual segmentation using MDL for machine translation

Bin Shan, Hao Wang, Yves Lepage

Research output: Contribution to conferencePaper

Abstract

In statistical machine translation systems, a problem arises from the weak performance in alignment due to differences in word form or granularity across different languages. To address this problem, in this paper, we propose a unsupervised bilingual segmentation method using the minimum description length (MDL) principle. Our work aims at improving translation quality using a proper segmentation model (lexicon). For generating bilingual lexica, we implement a heuristic and iterative algorithm. Each entry in this bilingual lexicon is required to hold a proper length and the ability to fit the data well. The results show that this bilingual segmentation significantly improved the translation quality on the Chinese-Japanese and Japanese-Chinese sub-tasks.

Original languageEnglish
Pages89-96
Number of pages8
Publication statusPublished - 2019 Jan 1
Event31st Pacific Asia Conference on Language, Information and Computation, PACLIC 2017 - Cebu City, Philippines
Duration: 2017 Nov 162017 Nov 18

Conference

Conference31st Pacific Asia Conference on Language, Information and Computation, PACLIC 2017
CountryPhilippines
CityCebu City
Period17/11/1617/11/18

Fingerprint

Segmentation
Machine Translation
Length
Bilingual Lexicon
Machine Translation System
Lexicon
Statistical Machine Translation
Granularity
Language
Word Forms
Alignment
Heuristics

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science (miscellaneous)

Cite this

Shan, B., Wang, H., & Lepage, Y. (2019). Unsupervised bilingual segmentation using MDL for machine translation. 89-96. Paper presented at 31st Pacific Asia Conference on Language, Information and Computation, PACLIC 2017, Cebu City, Philippines.

Unsupervised bilingual segmentation using MDL for machine translation. / Shan, Bin; Wang, Hao; Lepage, Yves.

2019. 89-96 Paper presented at 31st Pacific Asia Conference on Language, Information and Computation, PACLIC 2017, Cebu City, Philippines.

Research output: Contribution to conferencePaper

Shan, B, Wang, H & Lepage, Y 2019, 'Unsupervised bilingual segmentation using MDL for machine translation' Paper presented at 31st Pacific Asia Conference on Language, Information and Computation, PACLIC 2017, Cebu City, Philippines, 17/11/16 - 17/11/18, pp. 89-96.
Shan B, Wang H, Lepage Y. Unsupervised bilingual segmentation using MDL for machine translation. 2019. Paper presented at 31st Pacific Asia Conference on Language, Information and Computation, PACLIC 2017, Cebu City, Philippines.
Shan, Bin ; Wang, Hao ; Lepage, Yves. / Unsupervised bilingual segmentation using MDL for machine translation. Paper presented at 31st Pacific Asia Conference on Language, Information and Computation, PACLIC 2017, Cebu City, Philippines.8 p.
@conference{1569c55963844932808046c8c232b0f3,
title = "Unsupervised bilingual segmentation using MDL for machine translation",
abstract = "In statistical machine translation systems, a problem arises from the weak performance in alignment due to differences in word form or granularity across different languages. To address this problem, in this paper, we propose a unsupervised bilingual segmentation method using the minimum description length (MDL) principle. Our work aims at improving translation quality using a proper segmentation model (lexicon). For generating bilingual lexica, we implement a heuristic and iterative algorithm. Each entry in this bilingual lexicon is required to hold a proper length and the ability to fit the data well. The results show that this bilingual segmentation significantly improved the translation quality on the Chinese-Japanese and Japanese-Chinese sub-tasks.",
author = "Bin Shan and Hao Wang and Yves Lepage",
year = "2019",
month = "1",
day = "1",
language = "English",
pages = "89--96",
note = "31st Pacific Asia Conference on Language, Information and Computation, PACLIC 2017 ; Conference date: 16-11-2017 Through 18-11-2017",

}

TY - CONF

T1 - Unsupervised bilingual segmentation using MDL for machine translation

AU - Shan, Bin

AU - Wang, Hao

AU - Lepage, Yves

PY - 2019/1/1

Y1 - 2019/1/1

N2 - In statistical machine translation systems, a problem arises from the weak performance in alignment due to differences in word form or granularity across different languages. To address this problem, in this paper, we propose a unsupervised bilingual segmentation method using the minimum description length (MDL) principle. Our work aims at improving translation quality using a proper segmentation model (lexicon). For generating bilingual lexica, we implement a heuristic and iterative algorithm. Each entry in this bilingual lexicon is required to hold a proper length and the ability to fit the data well. The results show that this bilingual segmentation significantly improved the translation quality on the Chinese-Japanese and Japanese-Chinese sub-tasks.

AB - In statistical machine translation systems, a problem arises from the weak performance in alignment due to differences in word form or granularity across different languages. To address this problem, in this paper, we propose a unsupervised bilingual segmentation method using the minimum description length (MDL) principle. Our work aims at improving translation quality using a proper segmentation model (lexicon). For generating bilingual lexica, we implement a heuristic and iterative algorithm. Each entry in this bilingual lexicon is required to hold a proper length and the ability to fit the data well. The results show that this bilingual segmentation significantly improved the translation quality on the Chinese-Japanese and Japanese-Chinese sub-tasks.

UR - http://www.scopus.com/inward/record.url?scp=85072810641&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85072810641&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85072810641

SP - 89

EP - 96

ER -