This paper compares some existing evaluation metrics for factoid question answering (QA) from the viewpoint of stability and sensitivity, using the NTCIR-4 QAC2 Japanese factoid QA tasks and the Buckley/Voorhees stability method and Voorhees/Buckley swap method. Our main findings are: (1) For QA evaluation with ranked lists containing up to five answers, the fraction of questions with a correct answer within top 5 (NQcorrect5) and that with a correct answer at rank 1 (NQcorrect1) are not as stable and sensitive as reciprocal rank. (2) Q-measure, which can handle multiple correct answers and answer correctness levels, is at least as stable and sensitive as reciprocal rank, provided that a mild gain value assignment is used. Emphasizing answer correctness levels tends to hurt stability and sensitivity, while handling multiple correct answers improves them. As our experimental methods are language-independent, we believe that these findings apply to QA in languages other than Japanese as well.
|Journal||ACM Transactions on Asian Language Information Processing|
|Publication status||Published - 2007 Apr 1|
- Evaluation metrics
- Question answering
ASJC Scopus subject areas
- Computer Science(all)