On the reliability of information retrieval metrics based on graded relevance

Research output: Contribution to journalArticle

61 Citations (Scopus)

Abstract

This paper compares 14 information retrieval metrics based on graded relevance, together with 10 traditional metrics based on binary relevance, in terms of stability, sensitivity and resemblance of system rankings. More specifically, we compare these metrics using the Buckley/Voorhees stability method, the Voorhees/Buckley swap method and Kendall's rank correlation, with three data sets comprising test collections and submitted runs from NTCIR. Our experiments show that (Average) Normalised Discounted Cumulative Gain at document cut-off l are the best among the rank-based graded-relevance metrics, provided that l is large. On the other hand, if one requires a recall-based graded-relevance metric that is highly correlated with Average Precision, then Q-measure is the best choice. Moreover, these best graded-relevance metrics are at least as stable and sensitive as Average Precision, and are fairly robust to the choice of gain values.

Original languageEnglish
Pages (from-to)531-548
Number of pages18
JournalInformation Processing and Management
Volume43
Issue number2
DOIs
Publication statusPublished - 2007 Mar
Externally publishedYes

Fingerprint

Information retrieval
information retrieval
Experiments
ranking
experiment
Values

Keywords

  • Cumulative gain
  • Evaluation
  • Graded relevance
  • Q-measure
  • Reliability

ASJC Scopus subject areas

  • Computer Science Applications
  • Information Systems
  • Library and Information Sciences

Cite this

On the reliability of information retrieval metrics based on graded relevance. / Sakai, Tetsuya.

In: Information Processing and Management, Vol. 43, No. 2, 03.2007, p. 531-548.

Research output: Contribution to journalArticle

@article{11c85803706140f4bfc7aca9b9d77db1,
title = "On the reliability of information retrieval metrics based on graded relevance",
abstract = "This paper compares 14 information retrieval metrics based on graded relevance, together with 10 traditional metrics based on binary relevance, in terms of stability, sensitivity and resemblance of system rankings. More specifically, we compare these metrics using the Buckley/Voorhees stability method, the Voorhees/Buckley swap method and Kendall's rank correlation, with three data sets comprising test collections and submitted runs from NTCIR. Our experiments show that (Average) Normalised Discounted Cumulative Gain at document cut-off l are the best among the rank-based graded-relevance metrics, provided that l is large. On the other hand, if one requires a recall-based graded-relevance metric that is highly correlated with Average Precision, then Q-measure is the best choice. Moreover, these best graded-relevance metrics are at least as stable and sensitive as Average Precision, and are fairly robust to the choice of gain values.",
keywords = "Cumulative gain, Evaluation, Graded relevance, Q-measure, Reliability",
author = "Tetsuya Sakai",
year = "2007",
month = "3",
doi = "10.1016/j.ipm.2006.07.020",
language = "English",
volume = "43",
pages = "531--548",
journal = "Information Processing and Management",
issn = "0306-4573",
publisher = "Elsevier Limited",
number = "2",

}

TY - JOUR

T1 - On the reliability of information retrieval metrics based on graded relevance

AU - Sakai, Tetsuya

PY - 2007/3

Y1 - 2007/3

N2 - This paper compares 14 information retrieval metrics based on graded relevance, together with 10 traditional metrics based on binary relevance, in terms of stability, sensitivity and resemblance of system rankings. More specifically, we compare these metrics using the Buckley/Voorhees stability method, the Voorhees/Buckley swap method and Kendall's rank correlation, with three data sets comprising test collections and submitted runs from NTCIR. Our experiments show that (Average) Normalised Discounted Cumulative Gain at document cut-off l are the best among the rank-based graded-relevance metrics, provided that l is large. On the other hand, if one requires a recall-based graded-relevance metric that is highly correlated with Average Precision, then Q-measure is the best choice. Moreover, these best graded-relevance metrics are at least as stable and sensitive as Average Precision, and are fairly robust to the choice of gain values.

AB - This paper compares 14 information retrieval metrics based on graded relevance, together with 10 traditional metrics based on binary relevance, in terms of stability, sensitivity and resemblance of system rankings. More specifically, we compare these metrics using the Buckley/Voorhees stability method, the Voorhees/Buckley swap method and Kendall's rank correlation, with three data sets comprising test collections and submitted runs from NTCIR. Our experiments show that (Average) Normalised Discounted Cumulative Gain at document cut-off l are the best among the rank-based graded-relevance metrics, provided that l is large. On the other hand, if one requires a recall-based graded-relevance metric that is highly correlated with Average Precision, then Q-measure is the best choice. Moreover, these best graded-relevance metrics are at least as stable and sensitive as Average Precision, and are fairly robust to the choice of gain values.

KW - Cumulative gain

KW - Evaluation

KW - Graded relevance

KW - Q-measure

KW - Reliability

UR - http://www.scopus.com/inward/record.url?scp=33750437740&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33750437740&partnerID=8YFLogxK

U2 - 10.1016/j.ipm.2006.07.020

DO - 10.1016/j.ipm.2006.07.020

M3 - Article

VL - 43

SP - 531

EP - 548

JO - Information Processing and Management

JF - Information Processing and Management

SN - 0306-4573

IS - 2

ER -