On information retrieval metrics designed for evaluation with incomplete relevance assessments

Tetsuya Sakai, Noriko Kando

Research output: Contribution to journalArticle

59 Citations (Scopus)

Abstract

Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, where many documents exist that were never examined by the relevance assessors, is receiving a lot of attention. This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs-the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. Following previous work, we artificially reduce the original relevance data to simulate IR evaluation environments with extremely incomplete relevance data. We then investigate the effect of this reduction on discriminative power, which we define as the proportion of system pairs with a statistically significant difference for a given probability of Type I Error, and on Kendall's rank correlation, which reflects the overall resemblance of two system rankings according to two different metrics or two different relevance data sets. According to these experiments, nDCG' and AP proposed by Sakai are superior to bpref proposed by Buckley and Voorhees and to Rank-Biased Precision proposed by Moffat and Zobel. We also point out some weaknesses of bpref and Rank-Biased Precision by examining their formal definitions.

Original languageEnglish
Pages (from-to)447-470
Number of pages24
JournalInformation Retrieval
Volume11
Issue number5
DOIs
Publication statusPublished - 2008 Oct
Externally publishedYes

Fingerprint

Information retrieval
information retrieval
evaluation
Information retrieval systems
manpower
ranking
Experiments
experiment

Keywords

  • Evaluation metrics
  • Incompleteness
  • Relevance assessments
  • Test collections

ASJC Scopus subject areas

  • Information Systems

Cite this

On information retrieval metrics designed for evaluation with incomplete relevance assessments. / Sakai, Tetsuya; Kando, Noriko.

In: Information Retrieval, Vol. 11, No. 5, 10.2008, p. 447-470.

Research output: Contribution to journalArticle

@article{e62693e2dfb74ac8901e2a86a5edd9b8,
title = "On information retrieval metrics designed for evaluation with incomplete relevance assessments",
abstract = "Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, where many documents exist that were never examined by the relevance assessors, is receiving a lot of attention. This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs-the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. Following previous work, we artificially reduce the original relevance data to simulate IR evaluation environments with extremely incomplete relevance data. We then investigate the effect of this reduction on discriminative power, which we define as the proportion of system pairs with a statistically significant difference for a given probability of Type I Error, and on Kendall's rank correlation, which reflects the overall resemblance of two system rankings according to two different metrics or two different relevance data sets. According to these experiments, nDCG' and AP proposed by Sakai are superior to bpref proposed by Buckley and Voorhees and to Rank-Biased Precision proposed by Moffat and Zobel. We also point out some weaknesses of bpref and Rank-Biased Precision by examining their formal definitions.",
keywords = "Evaluation metrics, Incompleteness, Relevance assessments, Test collections",
author = "Tetsuya Sakai and Noriko Kando",
year = "2008",
month = "10",
doi = "10.1007/s10791-008-9059-7",
language = "English",
volume = "11",
pages = "447--470",
journal = "Information Retrieval",
issn = "1386-4564",
publisher = "Springer Netherlands",
number = "5",

}

TY - JOUR

T1 - On information retrieval metrics designed for evaluation with incomplete relevance assessments

AU - Sakai, Tetsuya

AU - Kando, Noriko

PY - 2008/10

Y1 - 2008/10

N2 - Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, where many documents exist that were never examined by the relevance assessors, is receiving a lot of attention. This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs-the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. Following previous work, we artificially reduce the original relevance data to simulate IR evaluation environments with extremely incomplete relevance data. We then investigate the effect of this reduction on discriminative power, which we define as the proportion of system pairs with a statistically significant difference for a given probability of Type I Error, and on Kendall's rank correlation, which reflects the overall resemblance of two system rankings according to two different metrics or two different relevance data sets. According to these experiments, nDCG' and AP proposed by Sakai are superior to bpref proposed by Buckley and Voorhees and to Rank-Biased Precision proposed by Moffat and Zobel. We also point out some weaknesses of bpref and Rank-Biased Precision by examining their formal definitions.

AB - Modern information retrieval (IR) test collections have grown in size, but the available manpower for relevance assessments has more or less remained constant. Hence, how to reliably evaluate and compare IR systems using incomplete relevance data, where many documents exist that were never examined by the relevance assessors, is receiving a lot of attention. This article compares the robustness of IR metrics to incomplete relevance assessments, using four different sets of graded-relevance test collections with submitted runs-the TREC 2003 and 2004 robust track data and the NTCIR-6 Japanese and Chinese IR data from the crosslingual task. Following previous work, we artificially reduce the original relevance data to simulate IR evaluation environments with extremely incomplete relevance data. We then investigate the effect of this reduction on discriminative power, which we define as the proportion of system pairs with a statistically significant difference for a given probability of Type I Error, and on Kendall's rank correlation, which reflects the overall resemblance of two system rankings according to two different metrics or two different relevance data sets. According to these experiments, nDCG' and AP proposed by Sakai are superior to bpref proposed by Buckley and Voorhees and to Rank-Biased Precision proposed by Moffat and Zobel. We also point out some weaknesses of bpref and Rank-Biased Precision by examining their formal definitions.

KW - Evaluation metrics

KW - Incompleteness

KW - Relevance assessments

KW - Test collections

UR - http://www.scopus.com/inward/record.url?scp=50849122035&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=50849122035&partnerID=8YFLogxK

U2 - 10.1007/s10791-008-9059-7

DO - 10.1007/s10791-008-9059-7

M3 - Article

VL - 11

SP - 447

EP - 470

JO - Information Retrieval

JF - Information Retrieval

SN - 1386-4564

IS - 5

ER -