On the reliability of factoid question answering evaluation

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

This paper compares some existing evaluation metrics for factoid question answering (QA) from the viewpoint of stability and sensitivity, using the NTCIR-4 QAC2 Japanese factoid QA tasks and the Buckley/Voorhees stability method and Voorhees/Buckley swap method. Our main findings are: (1) For QA evaluation with ranked lists containing up to five answers, the fraction of questions with a correct answer within top 5 (NQcorrect5) and that with a correct answer at rank 1 (NQcorrect1) are not as stable and sensitive as reciprocal rank. (2) Q-measure, which can handle multiple correct answers and answer correctness levels, is at least as stable and sensitive as reciprocal rank, provided that a mild gain value assignment is used. Emphasizing answer correctness levels tends to hurt stability and sensitivity, while handling multiple correct answers improves them. As our experimental methods are language-independent, we believe that these findings apply to QA in languages other than Japanese as well.

Original languageEnglish
Article number1227853
JournalACM Transactions on Asian Language Information Processing
Volume6
Issue number1
DOIs
Publication statusPublished - 2007 Apr 1
Externally publishedYes

Keywords

  • Evaluation metrics
  • Question answering

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

On the reliability of factoid question answering evaluation. / Sakai, Tetsuya.

In: ACM Transactions on Asian Language Information Processing, Vol. 6, No. 1, 1227853, 01.04.2007.

Research output: Contribution to journalArticle

@article{0dd99cd5719947f782a7a428767e4ed4,
title = "On the reliability of factoid question answering evaluation",
abstract = "This paper compares some existing evaluation metrics for factoid question answering (QA) from the viewpoint of stability and sensitivity, using the NTCIR-4 QAC2 Japanese factoid QA tasks and the Buckley/Voorhees stability method and Voorhees/Buckley swap method. Our main findings are: (1) For QA evaluation with ranked lists containing up to five answers, the fraction of questions with a correct answer within top 5 (NQcorrect5) and that with a correct answer at rank 1 (NQcorrect1) are not as stable and sensitive as reciprocal rank. (2) Q-measure, which can handle multiple correct answers and answer correctness levels, is at least as stable and sensitive as reciprocal rank, provided that a mild gain value assignment is used. Emphasizing answer correctness levels tends to hurt stability and sensitivity, while handling multiple correct answers improves them. As our experimental methods are language-independent, we believe that these findings apply to QA in languages other than Japanese as well.",
keywords = "Evaluation metrics, Question answering",
author = "Tetsuya Sakai",
year = "2007",
month = "4",
day = "1",
doi = "10.1145/1227850.1227853",
language = "English",
volume = "6",
journal = "ACM Transactions on Asian Language Information Processing",
issn = "1530-0226",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

TY - JOUR

T1 - On the reliability of factoid question answering evaluation

AU - Sakai, Tetsuya

PY - 2007/4/1

Y1 - 2007/4/1

N2 - This paper compares some existing evaluation metrics for factoid question answering (QA) from the viewpoint of stability and sensitivity, using the NTCIR-4 QAC2 Japanese factoid QA tasks and the Buckley/Voorhees stability method and Voorhees/Buckley swap method. Our main findings are: (1) For QA evaluation with ranked lists containing up to five answers, the fraction of questions with a correct answer within top 5 (NQcorrect5) and that with a correct answer at rank 1 (NQcorrect1) are not as stable and sensitive as reciprocal rank. (2) Q-measure, which can handle multiple correct answers and answer correctness levels, is at least as stable and sensitive as reciprocal rank, provided that a mild gain value assignment is used. Emphasizing answer correctness levels tends to hurt stability and sensitivity, while handling multiple correct answers improves them. As our experimental methods are language-independent, we believe that these findings apply to QA in languages other than Japanese as well.

AB - This paper compares some existing evaluation metrics for factoid question answering (QA) from the viewpoint of stability and sensitivity, using the NTCIR-4 QAC2 Japanese factoid QA tasks and the Buckley/Voorhees stability method and Voorhees/Buckley swap method. Our main findings are: (1) For QA evaluation with ranked lists containing up to five answers, the fraction of questions with a correct answer within top 5 (NQcorrect5) and that with a correct answer at rank 1 (NQcorrect1) are not as stable and sensitive as reciprocal rank. (2) Q-measure, which can handle multiple correct answers and answer correctness levels, is at least as stable and sensitive as reciprocal rank, provided that a mild gain value assignment is used. Emphasizing answer correctness levels tends to hurt stability and sensitivity, while handling multiple correct answers improves them. As our experimental methods are language-independent, we believe that these findings apply to QA in languages other than Japanese as well.

KW - Evaluation metrics

KW - Question answering

UR - http://www.scopus.com/inward/record.url?scp=34247192084&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=34247192084&partnerID=8YFLogxK

U2 - 10.1145/1227850.1227853

DO - 10.1145/1227850.1227853

M3 - Article

VL - 6

JO - ACM Transactions on Asian Language Information Processing

JF - ACM Transactions on Asian Language Information Processing

SN - 1530-0226

IS - 1

M1 - 1227853

ER -