A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

Arbi Haza Nasution, Yohei Murakami, Toru Ishida

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycle to reach many more cognates in the transgraph. We further identify cognate synonyms to obtain many-to-many translation pairs. This article utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from Cartesian product of input dictionaries as baselines. We evaluate our result using the metrics of precision, recall, and F-score. Our customizable approach allows the user to conduct cross validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combination of heuristics and number of symmetry assumption cycles to gain the highest F-score. Our proposed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages.

Original languageEnglish
Article number9
JournalACM Transactions on Asian and Low-Resource Language Information Processing
Volume17
Issue number2
DOIs
Publication statusPublished - 2017 Nov 1
Externally publishedYes

Fingerprint

Glossaries

Keywords

  • Closely-related languages
  • Cognate recognition
  • Constraint satisfaction problem
  • Low-resource languages
  • Pivot-based bilingual lexicon induction

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families. / Nasution, Arbi Haza; Murakami, Yohei; Ishida, Toru.

In: ACM Transactions on Asian and Low-Resource Language Information Processing, Vol. 17, No. 2, 9, 01.11.2017.

Research output: Contribution to journalArticle

@article{93963414c9a14f5f8833323fa6af5a27,
title = "A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families",
abstract = "The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycle to reach many more cognates in the transgraph. We further identify cognate synonyms to obtain many-to-many translation pairs. This article utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from Cartesian product of input dictionaries as baselines. We evaluate our result using the metrics of precision, recall, and F-score. Our customizable approach allows the user to conduct cross validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combination of heuristics and number of symmetry assumption cycles to gain the highest F-score. Our proposed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages.",
keywords = "Closely-related languages, Cognate recognition, Constraint satisfaction problem, Low-resource languages, Pivot-based bilingual lexicon induction",
author = "Nasution, {Arbi Haza} and Yohei Murakami and Toru Ishida",
year = "2017",
month = "11",
day = "1",
doi = "10.1145/3138815",
language = "English",
volume = "17",
journal = "ACM Transactions on Asian and Low-Resource Language Information Processing",
issn = "2375-4699",
publisher = "Association for Computing Machinery (ACM)",
number = "2",

}

TY - JOUR

T1 - A Generalized Constraint Approach to Bilingual Dictionary Induction for Low-Resource Language Families

AU - Nasution, Arbi Haza

AU - Murakami, Yohei

AU - Ishida, Toru

PY - 2017/11/1

Y1 - 2017/11/1

N2 - The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycle to reach many more cognates in the transgraph. We further identify cognate synonyms to obtain many-to-many translation pairs. This article utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from Cartesian product of input dictionaries as baselines. We evaluate our result using the metrics of precision, recall, and F-score. Our customizable approach allows the user to conduct cross validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combination of heuristics and number of symmetry assumption cycles to gain the highest F-score. Our proposed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages.

AB - The lack or absence of parallel and comparable corpora makes bilingual lexicon extraction a difficult task for low-resource languages. The pivot language and cognate recognition approaches have been proven useful for inducing bilingual lexicons for such languages. We propose constraint-based bilingual lexicon induction for closely related languages by extending constraints from the recent pivot-based induction technique and further enabling multiple symmetry assumption cycle to reach many more cognates in the transgraph. We further identify cognate synonyms to obtain many-to-many translation pairs. This article utilizes four datasets: one Austronesian low-resource language and three Indo-European high-resource languages. We use three constraint-based methods from our previous work, the Inverse Consultation method and translation pairs generated from Cartesian product of input dictionaries as baselines. We evaluate our result using the metrics of precision, recall, and F-score. Our customizable approach allows the user to conduct cross validation to predict the optimal hyperparameters (cognate threshold and cognate synonym threshold) with various combination of heuristics and number of symmetry assumption cycles to gain the highest F-score. Our proposed methods have statistically significant improvement of precision and F-score compared to our previous constraint-based methods. The results show that our method demonstrates the potential to complement other bilingual dictionary creation methods like word alignment models using parallel corpora for high-resource languages while well handling low-resource languages.

KW - Closely-related languages

KW - Cognate recognition

KW - Constraint satisfaction problem

KW - Low-resource languages

KW - Pivot-based bilingual lexicon induction

UR - http://www.scopus.com/inward/record.url?scp=85034666604&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85034666604&partnerID=8YFLogxK

U2 - 10.1145/3138815

DO - 10.1145/3138815

M3 - Article

AN - SCOPUS:85034666604

VL - 17

JO - ACM Transactions on Asian and Low-Resource Language Information Processing

JF - ACM Transactions on Asian and Low-Resource Language Information Processing

SN - 2375-4699

IS - 2

M1 - 9

ER -