The reliability of metrics based on graded relevance

Research output: Chapter in Book/Report/Conference proceedingConference contribution

3 Citations (Scopus)

Abstract

This paper compares 14 metrics designed for information retrieval evaluation with graded relevance, together with 10 traditional metrics based on binary relevance, in terms of reliability and resemblance of system rankings. More specifically, we use two test collections with submitted runs from the Chinese IR and English IR tasks in the NTCIR-3 CLIR track to examine the metrics using methods proposed by Buckley/Voorhees and Voorhees/Buckley as well as Kendall's rank correlation. Our results show that AnDCGl and nDCGl ((Average) Normalised Discounted Cumulative Gain at Document cut-off l) are good metrics, provided that l is large. However, if one wants to avoid the parameter l altogether, or if one requires a metric that closely resembles TREC Average Precision, then Q-measure appears to be the best choice.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages1-16
Number of pages16
Volume3689 LNCS
Publication statusPublished - 2005
Externally publishedYes
Event2nd Asia Information Retrieval Symposium, AIRS 2005 - Jeju Island
Duration: 2005 Oct 132005 Oct 15

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume3689 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other2nd Asia Information Retrieval Symposium, AIRS 2005
CityJeju Island
Period05/10/1305/10/15

Fingerprint

Information Storage and Retrieval
Information retrieval
Metric
Spearman's coefficient
Information Retrieval
Ranking
Relevance
Binary
Evaluation

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

Sakai, T. (2005). The reliability of metrics based on graded relevance. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3689 LNCS, pp. 1-16). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3689 LNCS).

The reliability of metrics based on graded relevance. / Sakai, Tetsuya.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 3689 LNCS 2005. p. 1-16 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 3689 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sakai, T 2005, The reliability of metrics based on graded relevance. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 3689 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 3689 LNCS, pp. 1-16, 2nd Asia Information Retrieval Symposium, AIRS 2005, Jeju Island, 05/10/13.
Sakai T. The reliability of metrics based on graded relevance. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 3689 LNCS. 2005. p. 1-16. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Sakai, Tetsuya. / The reliability of metrics based on graded relevance. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 3689 LNCS 2005. pp. 1-16 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{dd3cfdff05d54189b5eb6d8c12534296,
title = "The reliability of metrics based on graded relevance",
abstract = "This paper compares 14 metrics designed for information retrieval evaluation with graded relevance, together with 10 traditional metrics based on binary relevance, in terms of reliability and resemblance of system rankings. More specifically, we use two test collections with submitted runs from the Chinese IR and English IR tasks in the NTCIR-3 CLIR track to examine the metrics using methods proposed by Buckley/Voorhees and Voorhees/Buckley as well as Kendall's rank correlation. Our results show that AnDCGl and nDCGl ((Average) Normalised Discounted Cumulative Gain at Document cut-off l) are good metrics, provided that l is large. However, if one wants to avoid the parameter l altogether, or if one requires a metric that closely resembles TREC Average Precision, then Q-measure appears to be the best choice.",
author = "Tetsuya Sakai",
year = "2005",
language = "English",
isbn = "3540291865",
volume = "3689 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "1--16",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - The reliability of metrics based on graded relevance

AU - Sakai, Tetsuya

PY - 2005

Y1 - 2005

N2 - This paper compares 14 metrics designed for information retrieval evaluation with graded relevance, together with 10 traditional metrics based on binary relevance, in terms of reliability and resemblance of system rankings. More specifically, we use two test collections with submitted runs from the Chinese IR and English IR tasks in the NTCIR-3 CLIR track to examine the metrics using methods proposed by Buckley/Voorhees and Voorhees/Buckley as well as Kendall's rank correlation. Our results show that AnDCGl and nDCGl ((Average) Normalised Discounted Cumulative Gain at Document cut-off l) are good metrics, provided that l is large. However, if one wants to avoid the parameter l altogether, or if one requires a metric that closely resembles TREC Average Precision, then Q-measure appears to be the best choice.

AB - This paper compares 14 metrics designed for information retrieval evaluation with graded relevance, together with 10 traditional metrics based on binary relevance, in terms of reliability and resemblance of system rankings. More specifically, we use two test collections with submitted runs from the Chinese IR and English IR tasks in the NTCIR-3 CLIR track to examine the metrics using methods proposed by Buckley/Voorhees and Voorhees/Buckley as well as Kendall's rank correlation. Our results show that AnDCGl and nDCGl ((Average) Normalised Discounted Cumulative Gain at Document cut-off l) are good metrics, provided that l is large. However, if one wants to avoid the parameter l altogether, or if one requires a metric that closely resembles TREC Average Precision, then Q-measure appears to be the best choice.

UR - http://www.scopus.com/inward/record.url?scp=33646126694&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33646126694&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:33646126694

SN - 3540291865

SN - 9783540291862

VL - 3689 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 1

EP - 16

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -