Using multiple edit distances to automatically grade outputs from machine translation systems

Yasuhiro Akiba, Kenji Imamura, Eiichiro Sumita, Hiromi Nakaiwa, Seiichi Yamamoto, Hiroshi G. Okuno

Research output: Contribution to journalArticle

Abstract

This paper addresses the challenging problem of automatically evaluating output from machine translation (MT) systems that are subsystems of speech-to-speech MT (SSMT) systems. Conventional automatic MT evaluation methods include BLEU, which MT researchers have frequently used. However, BLEU has two drawbacks in SSMT evaluation. First, BLEU assesses errors lightly at the beginning of translations and heavily in the middle, even though its assessments should be independent of position. Second, BLEU lacks tolerance in accepting colloquial sentences with small errors, although such errors do not prevent us from continuing an SSMT-mediated conversation. In this paper, the authors report a new evaluation method called "gRader based on Edit Distances (RED)" that automatically grades each MT output by using a decision tree (DT). The DT is learned from training data that are encoded by using multiple edit distances, that is, normal edit distance (ED) defined by insertion, deletion, and replacement, as well as its extensions. The use of multiple edit distances allows more tolerance than either ED or BLEU. Each evaluated MT output is assigned a grade by using the DT. RED and BLEU were compared for the task of evaluating MT systems of varying quality on ATR's Basic Travel Expression Corpus (BTEC). Experimental results show that RED significantly outperforms BLEU.

Original languageEnglish
Pages (from-to)393-401
Number of pages9
JournalIEEE Transactions on Audio, Speech and Language Processing
Volume14
Issue number2
DOIs
Publication statusPublished - 2006
Externally publishedYes

Fingerprint

machine translation
grade
output
Decision trees
evaluation
conversation
deletion
sentences
travel
insertion
education

Keywords

  • BLEU
  • Decision tree (DT)
  • Edit distances (EDs)
  • Machine translation evaluation
  • mWER
  • Reference translations

ASJC Scopus subject areas

  • Electrical and Electronic Engineering
  • Acoustics and Ultrasonics

Cite this

Using multiple edit distances to automatically grade outputs from machine translation systems. / Akiba, Yasuhiro; Imamura, Kenji; Sumita, Eiichiro; Nakaiwa, Hiromi; Yamamoto, Seiichi; Okuno, Hiroshi G.

In: IEEE Transactions on Audio, Speech and Language Processing, Vol. 14, No. 2, 2006, p. 393-401.

Research output: Contribution to journalArticle

Akiba, Yasuhiro ; Imamura, Kenji ; Sumita, Eiichiro ; Nakaiwa, Hiromi ; Yamamoto, Seiichi ; Okuno, Hiroshi G. / Using multiple edit distances to automatically grade outputs from machine translation systems. In: IEEE Transactions on Audio, Speech and Language Processing. 2006 ; Vol. 14, No. 2. pp. 393-401.
@article{28a66309d0144a3cb8407fae539794b9,
title = "Using multiple edit distances to automatically grade outputs from machine translation systems",
abstract = "This paper addresses the challenging problem of automatically evaluating output from machine translation (MT) systems that are subsystems of speech-to-speech MT (SSMT) systems. Conventional automatic MT evaluation methods include BLEU, which MT researchers have frequently used. However, BLEU has two drawbacks in SSMT evaluation. First, BLEU assesses errors lightly at the beginning of translations and heavily in the middle, even though its assessments should be independent of position. Second, BLEU lacks tolerance in accepting colloquial sentences with small errors, although such errors do not prevent us from continuing an SSMT-mediated conversation. In this paper, the authors report a new evaluation method called {"}gRader based on Edit Distances (RED){"} that automatically grades each MT output by using a decision tree (DT). The DT is learned from training data that are encoded by using multiple edit distances, that is, normal edit distance (ED) defined by insertion, deletion, and replacement, as well as its extensions. The use of multiple edit distances allows more tolerance than either ED or BLEU. Each evaluated MT output is assigned a grade by using the DT. RED and BLEU were compared for the task of evaluating MT systems of varying quality on ATR's Basic Travel Expression Corpus (BTEC). Experimental results show that RED significantly outperforms BLEU.",
keywords = "BLEU, Decision tree (DT), Edit distances (EDs), Machine translation evaluation, mWER, Reference translations",
author = "Yasuhiro Akiba and Kenji Imamura and Eiichiro Sumita and Hiromi Nakaiwa and Seiichi Yamamoto and Okuno, {Hiroshi G.}",
year = "2006",
doi = "10.1109/TSA.2005.860770",
language = "English",
volume = "14",
pages = "393--401",
journal = "IEEE Transactions on Speech and Audio Processing",
issn = "1558-7916",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "2",

}

TY - JOUR

T1 - Using multiple edit distances to automatically grade outputs from machine translation systems

AU - Akiba, Yasuhiro

AU - Imamura, Kenji

AU - Sumita, Eiichiro

AU - Nakaiwa, Hiromi

AU - Yamamoto, Seiichi

AU - Okuno, Hiroshi G.

PY - 2006

Y1 - 2006

N2 - This paper addresses the challenging problem of automatically evaluating output from machine translation (MT) systems that are subsystems of speech-to-speech MT (SSMT) systems. Conventional automatic MT evaluation methods include BLEU, which MT researchers have frequently used. However, BLEU has two drawbacks in SSMT evaluation. First, BLEU assesses errors lightly at the beginning of translations and heavily in the middle, even though its assessments should be independent of position. Second, BLEU lacks tolerance in accepting colloquial sentences with small errors, although such errors do not prevent us from continuing an SSMT-mediated conversation. In this paper, the authors report a new evaluation method called "gRader based on Edit Distances (RED)" that automatically grades each MT output by using a decision tree (DT). The DT is learned from training data that are encoded by using multiple edit distances, that is, normal edit distance (ED) defined by insertion, deletion, and replacement, as well as its extensions. The use of multiple edit distances allows more tolerance than either ED or BLEU. Each evaluated MT output is assigned a grade by using the DT. RED and BLEU were compared for the task of evaluating MT systems of varying quality on ATR's Basic Travel Expression Corpus (BTEC). Experimental results show that RED significantly outperforms BLEU.

AB - This paper addresses the challenging problem of automatically evaluating output from machine translation (MT) systems that are subsystems of speech-to-speech MT (SSMT) systems. Conventional automatic MT evaluation methods include BLEU, which MT researchers have frequently used. However, BLEU has two drawbacks in SSMT evaluation. First, BLEU assesses errors lightly at the beginning of translations and heavily in the middle, even though its assessments should be independent of position. Second, BLEU lacks tolerance in accepting colloquial sentences with small errors, although such errors do not prevent us from continuing an SSMT-mediated conversation. In this paper, the authors report a new evaluation method called "gRader based on Edit Distances (RED)" that automatically grades each MT output by using a decision tree (DT). The DT is learned from training data that are encoded by using multiple edit distances, that is, normal edit distance (ED) defined by insertion, deletion, and replacement, as well as its extensions. The use of multiple edit distances allows more tolerance than either ED or BLEU. Each evaluated MT output is assigned a grade by using the DT. RED and BLEU were compared for the task of evaluating MT systems of varying quality on ATR's Basic Travel Expression Corpus (BTEC). Experimental results show that RED significantly outperforms BLEU.

KW - BLEU

KW - Decision tree (DT)

KW - Edit distances (EDs)

KW - Machine translation evaluation

KW - mWER

KW - Reference translations

UR - http://www.scopus.com/inward/record.url?scp=33947164689&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33947164689&partnerID=8YFLogxK

U2 - 10.1109/TSA.2005.860770

DO - 10.1109/TSA.2005.860770

M3 - Article

VL - 14

SP - 393

EP - 401

JO - IEEE Transactions on Speech and Audio Processing

JF - IEEE Transactions on Speech and Audio Processing

SN - 1558-7916

IS - 2

ER -