TY - JOUR
T1 - On the reliability of factoid question answering evaluation
AU - Sakai, Tetsuya
PY - 2007/4/1
Y1 - 2007/4/1
N2 - This paper compares some existing evaluation metrics for factoid question answering (QA) from the viewpoint of stability and sensitivity, using the NTCIR-4 QAC2 Japanese factoid QA tasks and the Buckley/Voorhees stability method and Voorhees/Buckley swap method. Our main findings are: (1) For QA evaluation with ranked lists containing up to five answers, the fraction of questions with a correct answer within top 5 (NQcorrect5) and that with a correct answer at rank 1 (NQcorrect1) are not as stable and sensitive as reciprocal rank. (2) Q-measure, which can handle multiple correct answers and answer correctness levels, is at least as stable and sensitive as reciprocal rank, provided that a mild gain value assignment is used. Emphasizing answer correctness levels tends to hurt stability and sensitivity, while handling multiple correct answers improves them. As our experimental methods are language-independent, we believe that these findings apply to QA in languages other than Japanese as well.
AB - This paper compares some existing evaluation metrics for factoid question answering (QA) from the viewpoint of stability and sensitivity, using the NTCIR-4 QAC2 Japanese factoid QA tasks and the Buckley/Voorhees stability method and Voorhees/Buckley swap method. Our main findings are: (1) For QA evaluation with ranked lists containing up to five answers, the fraction of questions with a correct answer within top 5 (NQcorrect5) and that with a correct answer at rank 1 (NQcorrect1) are not as stable and sensitive as reciprocal rank. (2) Q-measure, which can handle multiple correct answers and answer correctness levels, is at least as stable and sensitive as reciprocal rank, provided that a mild gain value assignment is used. Emphasizing answer correctness levels tends to hurt stability and sensitivity, while handling multiple correct answers improves them. As our experimental methods are language-independent, we believe that these findings apply to QA in languages other than Japanese as well.
KW - Evaluation metrics
KW - Question answering
UR - http://www.scopus.com/inward/record.url?scp=34247192084&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=34247192084&partnerID=8YFLogxK
U2 - 10.1145/1227850.1227853
DO - 10.1145/1227850.1227853
M3 - Article
AN - SCOPUS:34247192084
VL - 6
JO - ACM Transactions on Asian Language Information Processing
JF - ACM Transactions on Asian Language Information Processing
SN - 1530-0226
IS - 1
M1 - 1227853
ER -