Improved Chinese-Japanese phrase-based MT quality using an extended quasi-parallel corpus

Hao Wang, Wei Yang, Yves Lepage

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

State-of-the-art phrase-based machine translation (MT) systems usually demand large parallel corpora in the step of training. The quality and the quantity of the training data exert a direct influence on the performance of such translation systems. The lack of open-source bilingual corpora for a particular language pair results in lower translation scores reported for such a language pair. This is the case of Chinese-Japanese. In this paper, we propose to build an extension of an initial parallel corpus in the form of quasi-parallel sentences, instead of adding new parallel sentences. The extension of the initial corpus is obtained by using monolingual analogical associations. Our experiments show that the use of such quasi-parallel corpora improves the performance of Chinese-Japanese translation systems.

Original languageEnglish
Title of host publicationPIC 2014 - Proceedings of 2014 IEEE International Conference on Progress in Informatics and Computing
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages6-10
Number of pages5
ISBN (Print)9781479920334
DOIs
Publication statusPublished - 2014 Dec 2
Event2014 2nd IEEE International Conference on Progress in Informatics and Computing, PIC 2014 - Shanghai
Duration: 2014 May 162014 May 18

Other

Other2014 2nd IEEE International Conference on Progress in Informatics and Computing, PIC 2014
CityShanghai
Period14/5/1614/5/18

Fingerprint

Experiments

Keywords

  • analogy
  • machine translation
  • paraphrasing
  • quasi-parallel data

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Cite this

Wang, H., Yang, W., & Lepage, Y. (2014). Improved Chinese-Japanese phrase-based MT quality using an extended quasi-parallel corpus. In PIC 2014 - Proceedings of 2014 IEEE International Conference on Progress in Informatics and Computing (pp. 6-10). [6972285] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/PIC.2014.6972285

Improved Chinese-Japanese phrase-based MT quality using an extended quasi-parallel corpus. / Wang, Hao; Yang, Wei; Lepage, Yves.

PIC 2014 - Proceedings of 2014 IEEE International Conference on Progress in Informatics and Computing. Institute of Electrical and Electronics Engineers Inc., 2014. p. 6-10 6972285.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Wang, H, Yang, W & Lepage, Y 2014, Improved Chinese-Japanese phrase-based MT quality using an extended quasi-parallel corpus. in PIC 2014 - Proceedings of 2014 IEEE International Conference on Progress in Informatics and Computing., 6972285, Institute of Electrical and Electronics Engineers Inc., pp. 6-10, 2014 2nd IEEE International Conference on Progress in Informatics and Computing, PIC 2014, Shanghai, 14/5/16. https://doi.org/10.1109/PIC.2014.6972285
Wang H, Yang W, Lepage Y. Improved Chinese-Japanese phrase-based MT quality using an extended quasi-parallel corpus. In PIC 2014 - Proceedings of 2014 IEEE International Conference on Progress in Informatics and Computing. Institute of Electrical and Electronics Engineers Inc. 2014. p. 6-10. 6972285 https://doi.org/10.1109/PIC.2014.6972285
Wang, Hao ; Yang, Wei ; Lepage, Yves. / Improved Chinese-Japanese phrase-based MT quality using an extended quasi-parallel corpus. PIC 2014 - Proceedings of 2014 IEEE International Conference on Progress in Informatics and Computing. Institute of Electrical and Electronics Engineers Inc., 2014. pp. 6-10
@inproceedings{805ea078d9c34485a554ed5229b231dc,
title = "Improved Chinese-Japanese phrase-based MT quality using an extended quasi-parallel corpus",
abstract = "State-of-the-art phrase-based machine translation (MT) systems usually demand large parallel corpora in the step of training. The quality and the quantity of the training data exert a direct influence on the performance of such translation systems. The lack of open-source bilingual corpora for a particular language pair results in lower translation scores reported for such a language pair. This is the case of Chinese-Japanese. In this paper, we propose to build an extension of an initial parallel corpus in the form of quasi-parallel sentences, instead of adding new parallel sentences. The extension of the initial corpus is obtained by using monolingual analogical associations. Our experiments show that the use of such quasi-parallel corpora improves the performance of Chinese-Japanese translation systems.",
keywords = "analogy, machine translation, paraphrasing, quasi-parallel data",
author = "Hao Wang and Wei Yang and Yves Lepage",
year = "2014",
month = "12",
day = "2",
doi = "10.1109/PIC.2014.6972285",
language = "English",
isbn = "9781479920334",
pages = "6--10",
booktitle = "PIC 2014 - Proceedings of 2014 IEEE International Conference on Progress in Informatics and Computing",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Improved Chinese-Japanese phrase-based MT quality using an extended quasi-parallel corpus

AU - Wang, Hao

AU - Yang, Wei

AU - Lepage, Yves

PY - 2014/12/2

Y1 - 2014/12/2

N2 - State-of-the-art phrase-based machine translation (MT) systems usually demand large parallel corpora in the step of training. The quality and the quantity of the training data exert a direct influence on the performance of such translation systems. The lack of open-source bilingual corpora for a particular language pair results in lower translation scores reported for such a language pair. This is the case of Chinese-Japanese. In this paper, we propose to build an extension of an initial parallel corpus in the form of quasi-parallel sentences, instead of adding new parallel sentences. The extension of the initial corpus is obtained by using monolingual analogical associations. Our experiments show that the use of such quasi-parallel corpora improves the performance of Chinese-Japanese translation systems.

AB - State-of-the-art phrase-based machine translation (MT) systems usually demand large parallel corpora in the step of training. The quality and the quantity of the training data exert a direct influence on the performance of such translation systems. The lack of open-source bilingual corpora for a particular language pair results in lower translation scores reported for such a language pair. This is the case of Chinese-Japanese. In this paper, we propose to build an extension of an initial parallel corpus in the form of quasi-parallel sentences, instead of adding new parallel sentences. The extension of the initial corpus is obtained by using monolingual analogical associations. Our experiments show that the use of such quasi-parallel corpora improves the performance of Chinese-Japanese translation systems.

KW - analogy

KW - machine translation

KW - paraphrasing

KW - quasi-parallel data

UR - http://www.scopus.com/inward/record.url?scp=84919384902&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84919384902&partnerID=8YFLogxK

U2 - 10.1109/PIC.2014.6972285

DO - 10.1109/PIC.2014.6972285

M3 - Conference contribution

AN - SCOPUS:84919384902

SN - 9781479920334

SP - 6

EP - 10

BT - PIC 2014 - Proceedings of 2014 IEEE International Conference on Progress in Informatics and Computing

PB - Institute of Electrical and Electronics Engineers Inc.

ER -