Inflating a training corpus for SMT by using unrelated unaligned monolingual data

Wei Yang, Yves Lepage

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

To improve the translation quality of less resourced language pairs, the most natural answer is to build larger and larger aligned training data, that is to make those language pairs well resourced. But aligned data is not always easy to collect. In contrast, monolingual data are usually easier to access. In this paper we show how to leverage unrelated unaligned monolingual data to construct additional training data that varies only a little from the original training data. We measure the contribution of such additional data to translation quality. We report an experiment between Chinese and Japanese where we use 70,000 sentences of unrelated unaligned monolingual additional data in each language to construct new sentence pairs that are not perfectly aligned. We add these sentence pairs to a training corpus of 110,000 sentence pairs, and report an increase of 6 BLEU points.

Original languageEnglish
Pages (from-to)236-248
Number of pages13
JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume8686
Publication statusPublished - 2014

Fingerprint

Surface mount technology
Experiments
Corpus
Training
Leverage
Vary

Keywords

  • Analogies
  • Machine translation
  • Monolingual corpus
  • Quasi-parallel corpus

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

@article{3c34a8b1d56043a595f24d87a4dd7952,
title = "Inflating a training corpus for SMT by using unrelated unaligned monolingual data",
abstract = "To improve the translation quality of less resourced language pairs, the most natural answer is to build larger and larger aligned training data, that is to make those language pairs well resourced. But aligned data is not always easy to collect. In contrast, monolingual data are usually easier to access. In this paper we show how to leverage unrelated unaligned monolingual data to construct additional training data that varies only a little from the original training data. We measure the contribution of such additional data to translation quality. We report an experiment between Chinese and Japanese where we use 70,000 sentences of unrelated unaligned monolingual additional data in each language to construct new sentence pairs that are not perfectly aligned. We add these sentence pairs to a training corpus of 110,000 sentence pairs, and report an increase of 6 BLEU points.",
keywords = "Analogies, Machine translation, Monolingual corpus, Quasi-parallel corpus",
author = "Wei Yang and Yves Lepage",
year = "2014",
language = "English",
volume = "8686",
pages = "236--248",
journal = "Lecture Notes in Computer Science",
issn = "0302-9743",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - Inflating a training corpus for SMT by using unrelated unaligned monolingual data

AU - Yang, Wei

AU - Lepage, Yves

PY - 2014

Y1 - 2014

N2 - To improve the translation quality of less resourced language pairs, the most natural answer is to build larger and larger aligned training data, that is to make those language pairs well resourced. But aligned data is not always easy to collect. In contrast, monolingual data are usually easier to access. In this paper we show how to leverage unrelated unaligned monolingual data to construct additional training data that varies only a little from the original training data. We measure the contribution of such additional data to translation quality. We report an experiment between Chinese and Japanese where we use 70,000 sentences of unrelated unaligned monolingual additional data in each language to construct new sentence pairs that are not perfectly aligned. We add these sentence pairs to a training corpus of 110,000 sentence pairs, and report an increase of 6 BLEU points.

AB - To improve the translation quality of less resourced language pairs, the most natural answer is to build larger and larger aligned training data, that is to make those language pairs well resourced. But aligned data is not always easy to collect. In contrast, monolingual data are usually easier to access. In this paper we show how to leverage unrelated unaligned monolingual data to construct additional training data that varies only a little from the original training data. We measure the contribution of such additional data to translation quality. We report an experiment between Chinese and Japanese where we use 70,000 sentences of unrelated unaligned monolingual additional data in each language to construct new sentence pairs that are not perfectly aligned. We add these sentence pairs to a training corpus of 110,000 sentence pairs, and report an increase of 6 BLEU points.

KW - Analogies

KW - Machine translation

KW - Monolingual corpus

KW - Quasi-parallel corpus

UR - http://www.scopus.com/inward/record.url?scp=84922058133&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84922058133&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84922058133

VL - 8686

SP - 236

EP - 248

JO - Lecture Notes in Computer Science

JF - Lecture Notes in Computer Science

SN - 0302-9743

ER -