TY - JOUR
T1 - Using multiple edit distances to automatically grade outputs from machine translation systems
AU - Akiba, Yasuhiro
AU - Imamura, Kenji
AU - Sumita, Eiichiro
AU - Nakaiwa, Hiromi
AU - Yamamoto, Seiichi
AU - Okuno, Hiroshi G.
PY - 2006
Y1 - 2006
N2 - This paper addresses the challenging problem of automatically evaluating output from machine translation (MT) systems that are subsystems of speech-to-speech MT (SSMT) systems. Conventional automatic MT evaluation methods include BLEU, which MT researchers have frequently used. However, BLEU has two drawbacks in SSMT evaluation. First, BLEU assesses errors lightly at the beginning of translations and heavily in the middle, even though its assessments should be independent of position. Second, BLEU lacks tolerance in accepting colloquial sentences with small errors, although such errors do not prevent us from continuing an SSMT-mediated conversation. In this paper, the authors report a new evaluation method called "gRader based on Edit Distances (RED)" that automatically grades each MT output by using a decision tree (DT). The DT is learned from training data that are encoded by using multiple edit distances, that is, normal edit distance (ED) defined by insertion, deletion, and replacement, as well as its extensions. The use of multiple edit distances allows more tolerance than either ED or BLEU. Each evaluated MT output is assigned a grade by using the DT. RED and BLEU were compared for the task of evaluating MT systems of varying quality on ATR's Basic Travel Expression Corpus (BTEC). Experimental results show that RED significantly outperforms BLEU.
AB - This paper addresses the challenging problem of automatically evaluating output from machine translation (MT) systems that are subsystems of speech-to-speech MT (SSMT) systems. Conventional automatic MT evaluation methods include BLEU, which MT researchers have frequently used. However, BLEU has two drawbacks in SSMT evaluation. First, BLEU assesses errors lightly at the beginning of translations and heavily in the middle, even though its assessments should be independent of position. Second, BLEU lacks tolerance in accepting colloquial sentences with small errors, although such errors do not prevent us from continuing an SSMT-mediated conversation. In this paper, the authors report a new evaluation method called "gRader based on Edit Distances (RED)" that automatically grades each MT output by using a decision tree (DT). The DT is learned from training data that are encoded by using multiple edit distances, that is, normal edit distance (ED) defined by insertion, deletion, and replacement, as well as its extensions. The use of multiple edit distances allows more tolerance than either ED or BLEU. Each evaluated MT output is assigned a grade by using the DT. RED and BLEU were compared for the task of evaluating MT systems of varying quality on ATR's Basic Travel Expression Corpus (BTEC). Experimental results show that RED significantly outperforms BLEU.
KW - BLEU
KW - Decision tree (DT)
KW - Edit distances (EDs)
KW - Machine translation evaluation
KW - mWER
KW - Reference translations
UR - http://www.scopus.com/inward/record.url?scp=33947164689&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33947164689&partnerID=8YFLogxK
U2 - 10.1109/TSA.2005.860770
DO - 10.1109/TSA.2005.860770
M3 - Article
AN - SCOPUS:33947164689
VL - 14
SP - 393
EP - 401
JO - IEEE Transactions on Speech and Audio Processing
JF - IEEE Transactions on Speech and Audio Processing
SN - 1558-7916
IS - 2
ER -