On information retrieval metrics designed for evaluation with incomplete relevance assessments

Tetsuya Sakai*, Noriko Kando

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

97 Citations (Scopus)

Abstract

Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, where many documents exist that were never examined by the relevance assessors, is receiving a lot of attention. This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs-the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. Following previous work, we artificially reduce the original relevance data to simulate IR evaluation environments with extremely incomplete relevance data. We then investigate the effect of this reduction on discriminative power, which we define as the proportion of system pairs with a statistically significant difference for a given probability of Type I Error, and on Kendall's rank correlation, which reflects the overall resemblance of two system rankings according to two different metrics or two different relevance data sets. According to these experiments, nDCG' and AP proposed by Sakai are superior to bpref proposed by Buckley and Voorhees and to Rank-Biased Precision proposed by Moffat and Zobel. We also point out some weaknesses of bpref and Rank-Biased Precision by examining their formal definitions.

Original languageEnglish
Pages (from-to)447-470
Number of pages24
JournalInformation Retrieval
Volume11
Issue number5
DOIs
Publication statusPublished - 2008 Oct
Externally publishedYes

Keywords

  • Evaluation metrics
  • Incompleteness
  • Relevance assessments
  • Test collections

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'On information retrieval metrics designed for evaluation with incomplete relevance assessments'. Together they form a unique fingerprint.

Cite this