Improving Patent Translation using Bilingual Term Extraction and Re-tokenization for Chinese-Japanese

Wei Yang, Yves Lepage

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Unlike European languages, many Asian languages like Chinese and Japanese do not have typographic boundaries in written system. Word segmentation (tokenization) that break sentences down into individual words (tokens) is normally treated as the first step for machine translation (MT). For Chinese and Japanese, different rules and segmentation tools lead different segmentation results in different level of granularity between Chinese and Japanese. To improve the translation accuracy, we adjust and balance the granularity of segmentation results around terms for Chinese-Japanese patent corpus for training translation model. In this paper, we describe a statistical machine translation (SMT) system which is built on re-tokenized Chinese-Japanese patent training corpus using extracted bilingual multi-word terms.

Original languageEnglish
Title of host publicationWAT 2016 - 3rd Workshop on Asian Translation, Proceedings of the Workshop
PublisherAssociation for Computational Linguistics (ACL)
Pages194-202
Number of pages9
ISBN (Electronic)9784879747143
Publication statusPublished - 2016
Event3rd Workshop on Asian Translation, WAT 2016 - Osaka, Japan
Duration: 2016 Dec 112016 Dec 16

Publication series

NameWAT 2016 - 3rd Workshop on Asian Translation, Proceedings of the Workshop

Conference

Conference3rd Workshop on Asian Translation, WAT 2016
Country/TerritoryJapan
CityOsaka
Period16/12/1116/12/16

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science (miscellaneous)

Fingerprint

Dive into the research topics of 'Improving Patent Translation using Bilingual Term Extraction and Re-tokenization for Chinese-Japanese'. Together they form a unique fingerprint.

Cite this