Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese–Iapanese machine translation

Wei Yang, Hanfei Shen, Yves Lepage

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Increasing the size of parallel corpora for less-resourced language pairs is essential for machine translation (MT). To address the shortage of parallel corpora between Chinese and Japanese, we propose a method to construct a quasi-parallel corpus by inflating a small amount of Chinese–Japanese corpus, so as to improve statistical machine translation (SMT) quality. We generate new sentences using analogical associations based on large amounts of monolingual data and a small amount of parallel data. We filter over-generated sentences using two filtering methods: one based on BLEU and the second one based on N-sequences. We add the obtained aligned quasi-parallel corpus to a small parallel Chinese–Japanese corpus and perform SMT experiments. We obtain significant improvements over a baseline system.

Original languageEnglish
Pages (from-to)88-99
Number of pages12
JournalJournal of Information Processing
Volume25
DOIs
Publication statusPublished - 2017

Keywords

  • Analogies
  • BLEU
  • Clustering
  • Filtering
  • Machine translation
  • Quasi-parallel corpus

ASJC Scopus subject areas

  • Computer Science(all)

Fingerprint Dive into the research topics of 'Inflating a small parallel corpus into a large quasi-parallel corpus using monolingual data for Chinese–Iapanese machine translation'. Together they form a unique fingerprint.

  • Cite this