Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, where many documents exist that were never examined by the relevance assessors, is receiving a lot of attention. This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs-the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. Following previous work, we artificially reduce the original relevance data to simulate IR evaluation environments with extremely incomplete relevance data. We then investigate the effect of this reduction on discriminative power, which we define as the proportion of system pairs with a statistically significant difference for a given probability of Type I Error, and on Kendall's rank correlation, which reflects the overall resemblance of two system rankings according to two different metrics or two different relevance data sets. According to these experiments, nDCG' and AP proposed by Sakai are superior to bpref proposed by Buckley and Voorhees and to Rank-Biased Precision proposed by Moffat and Zobel. We also point out some weaknesses of bpref and Rank-Biased Precision by examining their formal definitions.
ASJC Scopus subject areas