Alternatives to Bpref

Research output: Chapter in Book/Report/Conference proceedingChapter

84 Citations (Scopus)

Abstract

Recently, a number of TREC tracks have adopted a retrieval effectiveness metric called bpref which has been designed for evaluation environments with incomplete relevance data. A graded-relevance version of this metric called rpref has also been proposed. However, we show that the application of Q-measure, normalised Discounted Cumulative Gain (nDCG) or Average Precision (AveP)to condensed lists, obtained by ?ltering out all unjudged documents from the original ranked lists, is actually a better solution to the incompleteness problem than bpref. Furthermore, we show that the use of graded relevance boosts the robustness of IR evaluation to incompleteness and therefore that Q-measure and nDCG based on condensed lists are the best choices. To this end, we use four graded-relevance test collections from NTCIR to compare ten different IR metrics in terms of system ranking stability and pairwise discriminative power.

Original languageEnglish
Title of host publicationProceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
Pages71-78
Number of pages8
DOIs
Publication statusPublished - 2007
Externally publishedYes
Event30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07 - Amsterdam
Duration: 2007 Jul 232007 Jul 27

Other

Other30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07
CityAmsterdam
Period07/7/2307/7/27

Fingerprint

System stability
Incompleteness
Alternatives
Metric
Evaluation
Pairwise
Ranking
Retrieval
Robustness
Relevance

Keywords

  • Evaluation metrics
  • Graded relevance
  • Test collection

ASJC Scopus subject areas

  • Information Systems
  • Software
  • Applied Mathematics

Cite this

Sakai, T. (2007). Alternatives to Bpref. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07 (pp. 71-78) https://doi.org/10.1145/1277741.1277756

Alternatives to Bpref. / Sakai, Tetsuya.

Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07. 2007. p. 71-78.

Research output: Chapter in Book/Report/Conference proceedingChapter

Sakai, T 2007, Alternatives to Bpref. in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07. pp. 71-78, 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07, Amsterdam, 07/7/23. https://doi.org/10.1145/1277741.1277756
Sakai T. Alternatives to Bpref. In Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07. 2007. p. 71-78 https://doi.org/10.1145/1277741.1277756
Sakai, Tetsuya. / Alternatives to Bpref. Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07. 2007. pp. 71-78
@inbook{028ffa1c3b714dd8adba6eaf98fa2cc2,
title = "Alternatives to Bpref",
abstract = "Recently, a number of TREC tracks have adopted a retrieval effectiveness metric called bpref which has been designed for evaluation environments with incomplete relevance data. A graded-relevance version of this metric called rpref has also been proposed. However, we show that the application of Q-measure, normalised Discounted Cumulative Gain (nDCG) or Average Precision (AveP)to condensed lists, obtained by ?ltering out all unjudged documents from the original ranked lists, is actually a better solution to the incompleteness problem than bpref. Furthermore, we show that the use of graded relevance boosts the robustness of IR evaluation to incompleteness and therefore that Q-measure and nDCG based on condensed lists are the best choices. To this end, we use four graded-relevance test collections from NTCIR to compare ten different IR metrics in terms of system ranking stability and pairwise discriminative power.",
keywords = "Evaluation metrics, Graded relevance, Test collection",
author = "Tetsuya Sakai",
year = "2007",
doi = "10.1145/1277741.1277756",
language = "English",
isbn = "1595935975",
pages = "71--78",
booktitle = "Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07",

}

TY - CHAP

T1 - Alternatives to Bpref

AU - Sakai, Tetsuya

PY - 2007

Y1 - 2007

N2 - Recently, a number of TREC tracks have adopted a retrieval effectiveness metric called bpref which has been designed for evaluation environments with incomplete relevance data. A graded-relevance version of this metric called rpref has also been proposed. However, we show that the application of Q-measure, normalised Discounted Cumulative Gain (nDCG) or Average Precision (AveP)to condensed lists, obtained by ?ltering out all unjudged documents from the original ranked lists, is actually a better solution to the incompleteness problem than bpref. Furthermore, we show that the use of graded relevance boosts the robustness of IR evaluation to incompleteness and therefore that Q-measure and nDCG based on condensed lists are the best choices. To this end, we use four graded-relevance test collections from NTCIR to compare ten different IR metrics in terms of system ranking stability and pairwise discriminative power.

AB - Recently, a number of TREC tracks have adopted a retrieval effectiveness metric called bpref which has been designed for evaluation environments with incomplete relevance data. A graded-relevance version of this metric called rpref has also been proposed. However, we show that the application of Q-measure, normalised Discounted Cumulative Gain (nDCG) or Average Precision (AveP)to condensed lists, obtained by ?ltering out all unjudged documents from the original ranked lists, is actually a better solution to the incompleteness problem than bpref. Furthermore, we show that the use of graded relevance boosts the robustness of IR evaluation to incompleteness and therefore that Q-measure and nDCG based on condensed lists are the best choices. To this end, we use four graded-relevance test collections from NTCIR to compare ten different IR metrics in terms of system ranking stability and pairwise discriminative power.

KW - Evaluation metrics

KW - Graded relevance

KW - Test collection

UR - http://www.scopus.com/inward/record.url?scp=36448993626&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=36448993626&partnerID=8YFLogxK

U2 - 10.1145/1277741.1277756

DO - 10.1145/1277741.1277756

M3 - Chapter

AN - SCOPUS:36448993626

SN - 1595935975

SN - 9781595935977

SP - 71

EP - 78

BT - Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'07

ER -