Evaluating evaluation metrics on the bootstrap

Research output: Chapter in Book/Report/Conference proceedingConference contribution

112 Citations (Scopus)

Abstract

This paper describes how the Bootstrap approach to statistics can be applied to the evaluation of IR effectiveness metrics. First, we argue that Bootstrap Hypothesis Tests deserve more attention from the IR community, as they are based on fewer assumptions than traditional statistical significance tests. We then describe straightforward methods for comparing the sensitivity of IR metrics based on Bootstrap Hypothesis Tests. Unlike the heuristics-based "swap" method proposed by Voorhees and Buckley, our method estimates the performance difference required to achieve a given significance level directly from Bootstrap Hypothesis Test results. In addition, we describe a simple way of examining the accuracy of rank correlation between two metrics based on the Bootstrap Estimate of Standard Error. We demonstrate the usefulness of our methods using test collections and runs from the NTCIR CLIR track for comparing seven IR metrics, including those that can handle graded relevance and those based on the Geometric Mean.

Original languageEnglish
Title of host publicationProceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
Pages525-532
Number of pages8
Volume2006
Publication statusPublished - 2006
Externally publishedYes
Event29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - Seatttle, WA
Duration: 2006 Aug 62006 Aug 11

Other

Other29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
CitySeatttle, WA
Period06/8/606/8/11

Fingerprint

Statistical tests
Bootstrap Test
Bootstrap
Hypothesis Test
Statistics
Metric
Evaluation
Spearman's coefficient
Significance Test
Significance level
Geometric mean
Swap
Statistical Significance
Standard error
Statistical test
Estimate
Heuristics
Demonstrate

Keywords

  • Bootstrap
  • Evaluation
  • Graded relevance
  • Test collection

ASJC Scopus subject areas

  • Engineering(all)
  • Information Systems
  • Software
  • Applied Mathematics

Cite this

Sakai, T. (2006). Evaluating evaluation metrics on the bootstrap. In Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Vol. 2006, pp. 525-532)

Evaluating evaluation metrics on the bootstrap. / Sakai, Tetsuya.

Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Vol. 2006 2006. p. 525-532.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sakai, T 2006, Evaluating evaluation metrics on the bootstrap. in Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. vol. 2006, pp. 525-532, 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Seatttle, WA, 06/8/6.
Sakai T. Evaluating evaluation metrics on the bootstrap. In Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Vol. 2006. 2006. p. 525-532
Sakai, Tetsuya. / Evaluating evaluation metrics on the bootstrap. Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Vol. 2006 2006. pp. 525-532
@inproceedings{7275b174d5754e12b14bceb96d6601ee,
title = "Evaluating evaluation metrics on the bootstrap",
abstract = "This paper describes how the Bootstrap approach to statistics can be applied to the evaluation of IR effectiveness metrics. First, we argue that Bootstrap Hypothesis Tests deserve more attention from the IR community, as they are based on fewer assumptions than traditional statistical significance tests. We then describe straightforward methods for comparing the sensitivity of IR metrics based on Bootstrap Hypothesis Tests. Unlike the heuristics-based {"}swap{"} method proposed by Voorhees and Buckley, our method estimates the performance difference required to achieve a given significance level directly from Bootstrap Hypothesis Test results. In addition, we describe a simple way of examining the accuracy of rank correlation between two metrics based on the Bootstrap Estimate of Standard Error. We demonstrate the usefulness of our methods using test collections and runs from the NTCIR CLIR track for comparing seven IR metrics, including those that can handle graded relevance and those based on the Geometric Mean.",
keywords = "Bootstrap, Evaluation, Graded relevance, Test collection",
author = "Tetsuya Sakai",
year = "2006",
language = "English",
isbn = "1595933697",
volume = "2006",
pages = "525--532",
booktitle = "Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval",

}

TY - GEN

T1 - Evaluating evaluation metrics on the bootstrap

AU - Sakai, Tetsuya

PY - 2006

Y1 - 2006

N2 - This paper describes how the Bootstrap approach to statistics can be applied to the evaluation of IR effectiveness metrics. First, we argue that Bootstrap Hypothesis Tests deserve more attention from the IR community, as they are based on fewer assumptions than traditional statistical significance tests. We then describe straightforward methods for comparing the sensitivity of IR metrics based on Bootstrap Hypothesis Tests. Unlike the heuristics-based "swap" method proposed by Voorhees and Buckley, our method estimates the performance difference required to achieve a given significance level directly from Bootstrap Hypothesis Test results. In addition, we describe a simple way of examining the accuracy of rank correlation between two metrics based on the Bootstrap Estimate of Standard Error. We demonstrate the usefulness of our methods using test collections and runs from the NTCIR CLIR track for comparing seven IR metrics, including those that can handle graded relevance and those based on the Geometric Mean.

AB - This paper describes how the Bootstrap approach to statistics can be applied to the evaluation of IR effectiveness metrics. First, we argue that Bootstrap Hypothesis Tests deserve more attention from the IR community, as they are based on fewer assumptions than traditional statistical significance tests. We then describe straightforward methods for comparing the sensitivity of IR metrics based on Bootstrap Hypothesis Tests. Unlike the heuristics-based "swap" method proposed by Voorhees and Buckley, our method estimates the performance difference required to achieve a given significance level directly from Bootstrap Hypothesis Test results. In addition, we describe a simple way of examining the accuracy of rank correlation between two metrics based on the Bootstrap Estimate of Standard Error. We demonstrate the usefulness of our methods using test collections and runs from the NTCIR CLIR track for comparing seven IR metrics, including those that can handle graded relevance and those based on the Geometric Mean.

KW - Bootstrap

KW - Evaluation

KW - Graded relevance

KW - Test collection

UR - http://www.scopus.com/inward/record.url?scp=33750340100&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33750340100&partnerID=8YFLogxK

M3 - Conference contribution

SN - 1595933697

SN - 9781595933690

VL - 2006

SP - 525

EP - 532

BT - Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval

ER -