Exploiting parallel corpus for handling out-of-vocabulary words

Juan Luo, John Tinsley, Yves Lepage

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

This paper presents a hybrid model for handling out-of-vocabulary words in Japanese to- English statistical machine translation output by exploiting parallel corpus. As the Japanese writing system makes use of four different script sets (kanji, hiragana, katakana, and romaji), we treat these scripts differently. A machine transliteration model is built to transliterate out-of vocabulary Japanese katakana words into English words. A Japanese dependency structure analyzer is employed to tackle out of-vocabulary kanji and hiragana words. The evaluation results demonstrate that it is an effective approach for addressing out-of vocabulary word problems and decreasing the OOVs rate in the Japanese-to-English machine translation tasks.

Original languageEnglish
Title of host publication27th Pacific Asia Conference on Language, Information, and Computation, PACLIC 27
PublisherNational Chengchi University
Pages399-408
Number of pages10
ISBN (Print)9789860385670
Publication statusPublished - 2013
Event27th Pacific Asia Conference on Language, Information, and Computation, PACLIC 2013 - Taipei
Duration: 2013 Nov 212013 Nov 24

Other

Other27th Pacific Asia Conference on Language, Information, and Computation, PACLIC 2013
CityTaipei
Period13/11/2113/11/24

Fingerprint

Vocabulary
Parallel Corpora
Katakana
Kanji
Hiragana
Writing Systems
Romaji
Machine Translation
Evaluation
Transliteration
English Words
Japanese Writing
Hybrid Model
Statistical Machine Translation

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science(all)

Cite this

Luo, J., Tinsley, J., & Lepage, Y. (2013). Exploiting parallel corpus for handling out-of-vocabulary words. In 27th Pacific Asia Conference on Language, Information, and Computation, PACLIC 27 (pp. 399-408). National Chengchi University.

Exploiting parallel corpus for handling out-of-vocabulary words. / Luo, Juan; Tinsley, John; Lepage, Yves.

27th Pacific Asia Conference on Language, Information, and Computation, PACLIC 27. National Chengchi University, 2013. p. 399-408.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Luo, J, Tinsley, J & Lepage, Y 2013, Exploiting parallel corpus for handling out-of-vocabulary words. in 27th Pacific Asia Conference on Language, Information, and Computation, PACLIC 27. National Chengchi University, pp. 399-408, 27th Pacific Asia Conference on Language, Information, and Computation, PACLIC 2013, Taipei, 13/11/21.
Luo J, Tinsley J, Lepage Y. Exploiting parallel corpus for handling out-of-vocabulary words. In 27th Pacific Asia Conference on Language, Information, and Computation, PACLIC 27. National Chengchi University. 2013. p. 399-408
Luo, Juan ; Tinsley, John ; Lepage, Yves. / Exploiting parallel corpus for handling out-of-vocabulary words. 27th Pacific Asia Conference on Language, Information, and Computation, PACLIC 27. National Chengchi University, 2013. pp. 399-408
@inproceedings{cb3329e6c2fd4d76b2c269a7021697f1,
title = "Exploiting parallel corpus for handling out-of-vocabulary words",
abstract = "This paper presents a hybrid model for handling out-of-vocabulary words in Japanese to- English statistical machine translation output by exploiting parallel corpus. As the Japanese writing system makes use of four different script sets (kanji, hiragana, katakana, and romaji), we treat these scripts differently. A machine transliteration model is built to transliterate out-of vocabulary Japanese katakana words into English words. A Japanese dependency structure analyzer is employed to tackle out of-vocabulary kanji and hiragana words. The evaluation results demonstrate that it is an effective approach for addressing out-of vocabulary word problems and decreasing the OOVs rate in the Japanese-to-English machine translation tasks.",
author = "Juan Luo and John Tinsley and Yves Lepage",
year = "2013",
language = "English",
isbn = "9789860385670",
pages = "399--408",
booktitle = "27th Pacific Asia Conference on Language, Information, and Computation, PACLIC 27",
publisher = "National Chengchi University",

}

TY - GEN

T1 - Exploiting parallel corpus for handling out-of-vocabulary words

AU - Luo, Juan

AU - Tinsley, John

AU - Lepage, Yves

PY - 2013

Y1 - 2013

N2 - This paper presents a hybrid model for handling out-of-vocabulary words in Japanese to- English statistical machine translation output by exploiting parallel corpus. As the Japanese writing system makes use of four different script sets (kanji, hiragana, katakana, and romaji), we treat these scripts differently. A machine transliteration model is built to transliterate out-of vocabulary Japanese katakana words into English words. A Japanese dependency structure analyzer is employed to tackle out of-vocabulary kanji and hiragana words. The evaluation results demonstrate that it is an effective approach for addressing out-of vocabulary word problems and decreasing the OOVs rate in the Japanese-to-English machine translation tasks.

AB - This paper presents a hybrid model for handling out-of-vocabulary words in Japanese to- English statistical machine translation output by exploiting parallel corpus. As the Japanese writing system makes use of four different script sets (kanji, hiragana, katakana, and romaji), we treat these scripts differently. A machine transliteration model is built to transliterate out-of vocabulary Japanese katakana words into English words. A Japanese dependency structure analyzer is employed to tackle out of-vocabulary kanji and hiragana words. The evaluation results demonstrate that it is an effective approach for addressing out-of vocabulary word problems and decreasing the OOVs rate in the Japanese-to-English machine translation tasks.

UR - http://www.scopus.com/inward/record.url?scp=84922792037&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84922792037&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84922792037

SN - 9789860385670

SP - 399

EP - 408

BT - 27th Pacific Asia Conference on Language, Information, and Computation, PACLIC 27

PB - National Chengchi University

ER -