Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese–Iapanese machine translation

Wei Yang, Hanfei Shen, Yves Lepage

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Increasing the size of parallel corpora for less-resourced language pairs is essential for machine translation (MT). To address the shortage of parallel corpora between Chinese and Japanese, we propose a method to construct a quasi-parallel corpus by inflating a small amount of Chinese–Japanese corpus, so as to improve statistical machine translation (SMT) quality. We generate new sentences using analogical associations based on large amounts of monolingual data and a small amount of parallel data. We filter over-generated sentences using two filtering methods: one based on BLEU and the second one based on N-sequences. We add the obtained aligned quasi-parallel corpus to a small parallel Chinese–Japanese corpus and perform SMT experiments. We obtain significant improvements over a baseline system.

Original languageEnglish
Pages (from-to)88-99
Number of pages12
JournalJournal of Information Processing
Volume25
DOIs
Publication statusPublished - 2017

Fingerprint

Experiments

Keywords

  • Analogies
  • BLEU
  • Clustering
  • Filtering
  • Machine translation
  • Quasi-parallel corpus

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

@article{22fc236e47cc4539953096bc82036dd9,
title = "Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese–Iapanese machine translation",
abstract = "Increasing the size of parallel corpora for less-resourced language pairs is essential for machine translation (MT). To address the shortage of parallel corpora between Chinese and Japanese, we propose a method to construct a quasi-parallel corpus by inflating a small amount of Chinese–Japanese corpus, so as to improve statistical machine translation (SMT) quality. We generate new sentences using analogical associations based on large amounts of monolingual data and a small amount of parallel data. We filter over-generated sentences using two filtering methods: one based on BLEU and the second one based on N-sequences. We add the obtained aligned quasi-parallel corpus to a small parallel Chinese–Japanese corpus and perform SMT experiments. We obtain significant improvements over a baseline system.",
keywords = "Analogies, BLEU, Clustering, Filtering, Machine translation, Quasi-parallel corpus",
author = "Wei Yang and Hanfei Shen and Yves Lepage",
year = "2017",
doi = "10.2197/ipsjjip.25.88",
language = "English",
volume = "25",
pages = "88--99",
journal = "Journal of Information Processing",
issn = "0387-5806",
publisher = "Information Processing Society of Japan",

}

TY - JOUR

T1 - Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese–Iapanese machine translation

AU - Yang, Wei

AU - Shen, Hanfei

AU - Lepage, Yves

PY - 2017

Y1 - 2017

N2 - Increasing the size of parallel corpora for less-resourced language pairs is essential for machine translation (MT). To address the shortage of parallel corpora between Chinese and Japanese, we propose a method to construct a quasi-parallel corpus by inflating a small amount of Chinese–Japanese corpus, so as to improve statistical machine translation (SMT) quality. We generate new sentences using analogical associations based on large amounts of monolingual data and a small amount of parallel data. We filter over-generated sentences using two filtering methods: one based on BLEU and the second one based on N-sequences. We add the obtained aligned quasi-parallel corpus to a small parallel Chinese–Japanese corpus and perform SMT experiments. We obtain significant improvements over a baseline system.

AB - Increasing the size of parallel corpora for less-resourced language pairs is essential for machine translation (MT). To address the shortage of parallel corpora between Chinese and Japanese, we propose a method to construct a quasi-parallel corpus by inflating a small amount of Chinese–Japanese corpus, so as to improve statistical machine translation (SMT) quality. We generate new sentences using analogical associations based on large amounts of monolingual data and a small amount of parallel data. We filter over-generated sentences using two filtering methods: one based on BLEU and the second one based on N-sequences. We add the obtained aligned quasi-parallel corpus to a small parallel Chinese–Japanese corpus and perform SMT experiments. We obtain significant improvements over a baseline system.

KW - Analogies

KW - BLEU

KW - Clustering

KW - Filtering

KW - Machine translation

KW - Quasi-parallel corpus

UR - http://www.scopus.com/inward/record.url?scp=85009932302&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85009932302&partnerID=8YFLogxK

U2 - 10.2197/ipsjjip.25.88

DO - 10.2197/ipsjjip.25.88

M3 - Article

AN - SCOPUS:85009932302

VL - 25

SP - 88

EP - 99

JO - Journal of Information Processing

JF - Journal of Information Processing

SN - 0387-5806

ER -