Inflating a training corpus for SMT by using unrelated unaligned monolingual data

Wei Yang, Yves Lepage

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

To improve the translation quality of less resourced language pairs, the most natural answer is to build larger and larger aligned training data, that is to make those language pairs well resourced. But aligned data is not always easy to collect. In contrast, monolingual data are usually easier to access. In this paper we show how to leverage unrelated unaligned monolingual data to construct additional training data that varies only a little from the original training data. We measure the contribution of such additional data to translation quality. We report an experiment between Chinese and Japanese where we use 70,000 sentences of unrelated unaligned monolingual additional data in each language to construct new sentence pairs that are not perfectly aligned. We add these sentence pairs to a training corpus of 110,000 sentence pairs, and report an increase of 6 BLEU points.

Original languageEnglish
Pages (from-to)236-248
Number of pages13
JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8686
Publication statusPublished - 2014 Jan 1

Keywords

  • Analogies
  • Machine translation
  • Monolingual corpus
  • Quasi-parallel corpus

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Inflating a training corpus for SMT by using unrelated unaligned monolingual data'. Together they form a unique fingerprint.

  • Cite this