Comparing metrics across TREC and NTCIR

The robustness to system bias

Research output: Chapter in Book/Report/Conference proceedingConference contribution

17 Citations (Scopus)

Abstract

Test collections are growing larger, and relevance data constructed through pooling are suspected of becoming more and more incomplete and biased. Several studies have used evaluation metrics specifically designed to handle this problem, but most of them have only examined the metrics under incomplete but unbiased conditions, using random samples of the original relevance data. This paper examines nine metrics in a more realistic setting, by reducing the number of pooled systems. Even though previous work has shown that metrics based on a condensed list, obtained by removing all unjudged documents from the original ranked list, are effective for handling very incomplete but unbiased relevance data, we show that these results do not hold in the presence of system bias. In our experiments using TREC and NTCIR data, we first show that condensed-list metrics overestimate new systems while traditional metrics underestimate them, and that the overestimation tends to be larger than the underestimation. We then show that, when relevance data is heavily biased towards a single team or a few teams, the condensed-list versions of Average Precision (AP), Q-measure (Q) and normalised Discounted Cumulative Gain (nDCG), which we call AP', Q' and nDCG', are not necessarily superior to the original metrics in terms of discriminative power, i.e., the overall ability to detect pairwise statistical significance. Nevertheless, even under system bias, AP' and Q' are generally more discriminative than bpref and the condensed-list version of Rank-Biased Precision (RBP), which we call RBP'.

Original languageEnglish
Title of host publicationInternational Conference on Information and Knowledge Management, Proceedings
Pages581-590
Number of pages10
DOIs
Publication statusPublished - 2008
Externally publishedYes
Event17th ACM Conference on Information and Knowledge Management, CIKM'08 - Napa Valley, CA
Duration: 2008 Oct 262008 Oct 30

Other

Other17th ACM Conference on Information and Knowledge Management, CIKM'08
CityNapa Valley, CA
Period08/10/2608/10/30

Fingerprint

Robustness
Evaluation
Experiment
Test collections
Statistical significance
Pooling

Keywords

  • Evaluation metrics
  • Graded relevance
  • Test collection

ASJC Scopus subject areas

  • Business, Management and Accounting(all)
  • Decision Sciences(all)

Cite this

Sakai, T. (2008). Comparing metrics across TREC and NTCIR: The robustness to system bias. In International Conference on Information and Knowledge Management, Proceedings (pp. 581-590) https://doi.org/10.1145/1458082.1458159

Comparing metrics across TREC and NTCIR : The robustness to system bias. / Sakai, Tetsuya.

International Conference on Information and Knowledge Management, Proceedings. 2008. p. 581-590.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sakai, T 2008, Comparing metrics across TREC and NTCIR: The robustness to system bias. in International Conference on Information and Knowledge Management, Proceedings. pp. 581-590, 17th ACM Conference on Information and Knowledge Management, CIKM'08, Napa Valley, CA, 08/10/26. https://doi.org/10.1145/1458082.1458159
Sakai T. Comparing metrics across TREC and NTCIR: The robustness to system bias. In International Conference on Information and Knowledge Management, Proceedings. 2008. p. 581-590 https://doi.org/10.1145/1458082.1458159
Sakai, Tetsuya. / Comparing metrics across TREC and NTCIR : The robustness to system bias. International Conference on Information and Knowledge Management, Proceedings. 2008. pp. 581-590
@inproceedings{d3250f3debdd45f7a56286ca61f3f611,
title = "Comparing metrics across TREC and NTCIR: The robustness to system bias",
abstract = "Test collections are growing larger, and relevance data constructed through pooling are suspected of becoming more and more incomplete and biased. Several studies have used evaluation metrics specifically designed to handle this problem, but most of them have only examined the metrics under incomplete but unbiased conditions, using random samples of the original relevance data. This paper examines nine metrics in a more realistic setting, by reducing the number of pooled systems. Even though previous work has shown that metrics based on a condensed list, obtained by removing all unjudged documents from the original ranked list, are effective for handling very incomplete but unbiased relevance data, we show that these results do not hold in the presence of system bias. In our experiments using TREC and NTCIR data, we first show that condensed-list metrics overestimate new systems while traditional metrics underestimate them, and that the overestimation tends to be larger than the underestimation. We then show that, when relevance data is heavily biased towards a single team or a few teams, the condensed-list versions of Average Precision (AP), Q-measure (Q) and normalised Discounted Cumulative Gain (nDCG), which we call AP', Q' and nDCG', are not necessarily superior to the original metrics in terms of discriminative power, i.e., the overall ability to detect pairwise statistical significance. Nevertheless, even under system bias, AP' and Q' are generally more discriminative than bpref and the condensed-list version of Rank-Biased Precision (RBP), which we call RBP'.",
keywords = "Evaluation metrics, Graded relevance, Test collection",
author = "Tetsuya Sakai",
year = "2008",
doi = "10.1145/1458082.1458159",
language = "English",
isbn = "9781595939913",
pages = "581--590",
booktitle = "International Conference on Information and Knowledge Management, Proceedings",

}

TY - GEN

T1 - Comparing metrics across TREC and NTCIR

T2 - The robustness to system bias

AU - Sakai, Tetsuya

PY - 2008

Y1 - 2008

N2 - Test collections are growing larger, and relevance data constructed through pooling are suspected of becoming more and more incomplete and biased. Several studies have used evaluation metrics specifically designed to handle this problem, but most of them have only examined the metrics under incomplete but unbiased conditions, using random samples of the original relevance data. This paper examines nine metrics in a more realistic setting, by reducing the number of pooled systems. Even though previous work has shown that metrics based on a condensed list, obtained by removing all unjudged documents from the original ranked list, are effective for handling very incomplete but unbiased relevance data, we show that these results do not hold in the presence of system bias. In our experiments using TREC and NTCIR data, we first show that condensed-list metrics overestimate new systems while traditional metrics underestimate them, and that the overestimation tends to be larger than the underestimation. We then show that, when relevance data is heavily biased towards a single team or a few teams, the condensed-list versions of Average Precision (AP), Q-measure (Q) and normalised Discounted Cumulative Gain (nDCG), which we call AP', Q' and nDCG', are not necessarily superior to the original metrics in terms of discriminative power, i.e., the overall ability to detect pairwise statistical significance. Nevertheless, even under system bias, AP' and Q' are generally more discriminative than bpref and the condensed-list version of Rank-Biased Precision (RBP), which we call RBP'.

AB - Test collections are growing larger, and relevance data constructed through pooling are suspected of becoming more and more incomplete and biased. Several studies have used evaluation metrics specifically designed to handle this problem, but most of them have only examined the metrics under incomplete but unbiased conditions, using random samples of the original relevance data. This paper examines nine metrics in a more realistic setting, by reducing the number of pooled systems. Even though previous work has shown that metrics based on a condensed list, obtained by removing all unjudged documents from the original ranked list, are effective for handling very incomplete but unbiased relevance data, we show that these results do not hold in the presence of system bias. In our experiments using TREC and NTCIR data, we first show that condensed-list metrics overestimate new systems while traditional metrics underestimate them, and that the overestimation tends to be larger than the underestimation. We then show that, when relevance data is heavily biased towards a single team or a few teams, the condensed-list versions of Average Precision (AP), Q-measure (Q) and normalised Discounted Cumulative Gain (nDCG), which we call AP', Q' and nDCG', are not necessarily superior to the original metrics in terms of discriminative power, i.e., the overall ability to detect pairwise statistical significance. Nevertheless, even under system bias, AP' and Q' are generally more discriminative than bpref and the condensed-list version of Rank-Biased Precision (RBP), which we call RBP'.

KW - Evaluation metrics

KW - Graded relevance

KW - Test collection

UR - http://www.scopus.com/inward/record.url?scp=70349242289&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70349242289&partnerID=8YFLogxK

U2 - 10.1145/1458082.1458159

DO - 10.1145/1458082.1458159

M3 - Conference contribution

SN - 9781595939913

SP - 581

EP - 590

BT - International Conference on Information and Knowledge Management, Proceedings

ER -