On the robustness of information retrieval metrics to biased relevance assessments

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Information Retrieval (IR) test collections are growing larger, and relevance data constructed through pooling are suspected of becoming more and more incomplete and biased. Several studies have used IR evaluation metrics specifically designed to handle this problem, but most of them have only examined the metrics under incomplete but unbiased conditions, using random samples of the original relevance data. This paper examines nine metrics in more realistic settings, by reducing the number of pooled systems and the number of pooled documents. Even though previous studies have shown that metrics based on a condensed list, obtained by removing all unjudged documents from the original ranked list, are effective for handling very incomplete but unbiased relevance data, we show that these results do not hold when the relevance data are biased towards particular systems or towards the top of the pools. More specifically, we show that the condensed-list versions of Average Precision, Qmeasure and normalised Discounted Cumulative Gain, which we denote as AP′, Q′ and nDCG′, are not necessarily superior to the original metrics for handling biases. Nevertheless, AP′ and Q′ are generally superior to bpref, Rank-Biased Precision and its condensed-list version even in the presence of biases.

Original languageEnglish
Pages (from-to)156-166
Number of pages11
JournalJournal of Information Processing
Volume17
DOIs
Publication statusPublished - 2009
Externally publishedYes

Fingerprint

Information retrieval

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

On the robustness of information retrieval metrics to biased relevance assessments. / Sakai, Tetsuya.

In: Journal of Information Processing, Vol. 17, 2009, p. 156-166.

Research output: Contribution to journalArticle

@article{2c1f6ba0e6f9431284b2e932701dc852,
title = "On the robustness of information retrieval metrics to biased relevance assessments",
abstract = "Information Retrieval (IR) test collections are growing larger, and relevance data constructed through pooling are suspected of becoming more and more incomplete and biased. Several studies have used IR evaluation metrics specifically designed to handle this problem, but most of them have only examined the metrics under incomplete but unbiased conditions, using random samples of the original relevance data. This paper examines nine metrics in more realistic settings, by reducing the number of pooled systems and the number of pooled documents. Even though previous studies have shown that metrics based on a condensed list, obtained by removing all unjudged documents from the original ranked list, are effective for handling very incomplete but unbiased relevance data, we show that these results do not hold when the relevance data are biased towards particular systems or towards the top of the pools. More specifically, we show that the condensed-list versions of Average Precision, Qmeasure and normalised Discounted Cumulative Gain, which we denote as AP′, Q′ and nDCG′, are not necessarily superior to the original metrics for handling biases. Nevertheless, AP′ and Q′ are generally superior to bpref, Rank-Biased Precision and its condensed-list version even in the presence of biases.",
author = "Tetsuya Sakai",
year = "2009",
doi = "10.2197/ipsjjip.17.156",
language = "English",
volume = "17",
pages = "156--166",
journal = "Journal of Information Processing",
issn = "0387-5806",
publisher = "Information Processing Society of Japan",

}

TY - JOUR

T1 - On the robustness of information retrieval metrics to biased relevance assessments

AU - Sakai, Tetsuya

PY - 2009

Y1 - 2009

N2 - Information Retrieval (IR) test collections are growing larger, and relevance data constructed through pooling are suspected of becoming more and more incomplete and biased. Several studies have used IR evaluation metrics specifically designed to handle this problem, but most of them have only examined the metrics under incomplete but unbiased conditions, using random samples of the original relevance data. This paper examines nine metrics in more realistic settings, by reducing the number of pooled systems and the number of pooled documents. Even though previous studies have shown that metrics based on a condensed list, obtained by removing all unjudged documents from the original ranked list, are effective for handling very incomplete but unbiased relevance data, we show that these results do not hold when the relevance data are biased towards particular systems or towards the top of the pools. More specifically, we show that the condensed-list versions of Average Precision, Qmeasure and normalised Discounted Cumulative Gain, which we denote as AP′, Q′ and nDCG′, are not necessarily superior to the original metrics for handling biases. Nevertheless, AP′ and Q′ are generally superior to bpref, Rank-Biased Precision and its condensed-list version even in the presence of biases.

AB - Information Retrieval (IR) test collections are growing larger, and relevance data constructed through pooling are suspected of becoming more and more incomplete and biased. Several studies have used IR evaluation metrics specifically designed to handle this problem, but most of them have only examined the metrics under incomplete but unbiased conditions, using random samples of the original relevance data. This paper examines nine metrics in more realistic settings, by reducing the number of pooled systems and the number of pooled documents. Even though previous studies have shown that metrics based on a condensed list, obtained by removing all unjudged documents from the original ranked list, are effective for handling very incomplete but unbiased relevance data, we show that these results do not hold when the relevance data are biased towards particular systems or towards the top of the pools. More specifically, we show that the condensed-list versions of Average Precision, Qmeasure and normalised Discounted Cumulative Gain, which we denote as AP′, Q′ and nDCG′, are not necessarily superior to the original metrics for handling biases. Nevertheless, AP′ and Q′ are generally superior to bpref, Rank-Biased Precision and its condensed-list version even in the presence of biases.

UR - http://www.scopus.com/inward/record.url?scp=76749165630&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=76749165630&partnerID=8YFLogxK

U2 - 10.2197/ipsjjip.17.156

DO - 10.2197/ipsjjip.17.156

M3 - Article

VL - 17

SP - 156

EP - 166

JO - Journal of Information Processing

JF - Journal of Information Processing

SN - 0387-5806

ER -