Evaluating diversified search results using per-intent graded relevance

Tetsuya Sakai, Ruihua Song

Research output: Chapter in Book/Report/Conference proceedingConference contribution

71 Citations (Scopus)

Abstract

Search queries are often ambiguous and/or underspecified. To accomodate different user needs, search result diversification has received attention in the past few years. Accordingly, several new metrics for evaluating diversification have been proposed, but their properties are little understood. We compare the properties of existing metrics given the premises that (1) queries may have multiple intents; (2) the likelihood of each intent given a query is available; and (3) graded relevance assessments are available for each intent. We compare a wide range of traditional and diversified IR metrics after adding graded relevance assessments to the TREC 2009 Web track diversity task test collection which originally had binary relevance assessments. Our primary criterion is discriminative power, which represents the reliability of a metric in an experiment. Our results show that diversified IR experiments with a given number of topics can be as reliable as traditional IR experiments with the same number of topics, provided that the right metrics are used. Moreover, we compare the intuitiveness of diversified IR metrics by closely examining the actual ranked lists from TREC. We show that a family of metrics called D#-measures have several advantages over other metrics such as α-nDCG and Intent-Aware metrics.

Original languageEnglish
Title of host publicationSIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
Pages1043-1052
Number of pages10
DOIs
Publication statusPublished - 2011
Externally publishedYes
Event34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11 - Beijing
Duration: 2011 Jul 242011 Jul 28

Other

Other34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11
CityBeijing
Period11/7/2411/7/28

Fingerprint

Experiments

Keywords

  • Ambiguity
  • Diversity
  • Evaluation
  • Graded relevance
  • Test collection

ASJC Scopus subject areas

  • Information Systems

Cite this

Sakai, T., & Song, R. (2011). Evaluating diversified search results using per-intent graded relevance. In SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1043-1052) https://doi.org/10.1145/2009916.2010055

Evaluating diversified search results using per-intent graded relevance. / Sakai, Tetsuya; Song, Ruihua.

SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2011. p. 1043-1052.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sakai, T & Song, R 2011, Evaluating diversified search results using per-intent graded relevance. in SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 1043-1052, 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, Beijing, 11/7/24. https://doi.org/10.1145/2009916.2010055
Sakai T, Song R. Evaluating diversified search results using per-intent graded relevance. In SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2011. p. 1043-1052 https://doi.org/10.1145/2009916.2010055
Sakai, Tetsuya ; Song, Ruihua. / Evaluating diversified search results using per-intent graded relevance. SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2011. pp. 1043-1052
@inproceedings{c3d9dada1f4b4ff1b9518766e03bc0e7,
title = "Evaluating diversified search results using per-intent graded relevance",
abstract = "Search queries are often ambiguous and/or underspecified. To accomodate different user needs, search result diversification has received attention in the past few years. Accordingly, several new metrics for evaluating diversification have been proposed, but their properties are little understood. We compare the properties of existing metrics given the premises that (1) queries may have multiple intents; (2) the likelihood of each intent given a query is available; and (3) graded relevance assessments are available for each intent. We compare a wide range of traditional and diversified IR metrics after adding graded relevance assessments to the TREC 2009 Web track diversity task test collection which originally had binary relevance assessments. Our primary criterion is discriminative power, which represents the reliability of a metric in an experiment. Our results show that diversified IR experiments with a given number of topics can be as reliable as traditional IR experiments with the same number of topics, provided that the right metrics are used. Moreover, we compare the intuitiveness of diversified IR metrics by closely examining the actual ranked lists from TREC. We show that a family of metrics called D#-measures have several advantages over other metrics such as α-nDCG and Intent-Aware metrics.",
keywords = "Ambiguity, Diversity, Evaluation, Graded relevance, Test collection",
author = "Tetsuya Sakai and Ruihua Song",
year = "2011",
doi = "10.1145/2009916.2010055",
language = "English",
isbn = "9781450309349",
pages = "1043--1052",
booktitle = "SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval",

}

TY - GEN

T1 - Evaluating diversified search results using per-intent graded relevance

AU - Sakai, Tetsuya

AU - Song, Ruihua

PY - 2011

Y1 - 2011

N2 - Search queries are often ambiguous and/or underspecified. To accomodate different user needs, search result diversification has received attention in the past few years. Accordingly, several new metrics for evaluating diversification have been proposed, but their properties are little understood. We compare the properties of existing metrics given the premises that (1) queries may have multiple intents; (2) the likelihood of each intent given a query is available; and (3) graded relevance assessments are available for each intent. We compare a wide range of traditional and diversified IR metrics after adding graded relevance assessments to the TREC 2009 Web track diversity task test collection which originally had binary relevance assessments. Our primary criterion is discriminative power, which represents the reliability of a metric in an experiment. Our results show that diversified IR experiments with a given number of topics can be as reliable as traditional IR experiments with the same number of topics, provided that the right metrics are used. Moreover, we compare the intuitiveness of diversified IR metrics by closely examining the actual ranked lists from TREC. We show that a family of metrics called D#-measures have several advantages over other metrics such as α-nDCG and Intent-Aware metrics.

AB - Search queries are often ambiguous and/or underspecified. To accomodate different user needs, search result diversification has received attention in the past few years. Accordingly, several new metrics for evaluating diversification have been proposed, but their properties are little understood. We compare the properties of existing metrics given the premises that (1) queries may have multiple intents; (2) the likelihood of each intent given a query is available; and (3) graded relevance assessments are available for each intent. We compare a wide range of traditional and diversified IR metrics after adding graded relevance assessments to the TREC 2009 Web track diversity task test collection which originally had binary relevance assessments. Our primary criterion is discriminative power, which represents the reliability of a metric in an experiment. Our results show that diversified IR experiments with a given number of topics can be as reliable as traditional IR experiments with the same number of topics, provided that the right metrics are used. Moreover, we compare the intuitiveness of diversified IR metrics by closely examining the actual ranked lists from TREC. We show that a family of metrics called D#-measures have several advantages over other metrics such as α-nDCG and Intent-Aware metrics.

KW - Ambiguity

KW - Diversity

KW - Evaluation

KW - Graded relevance

KW - Test collection

UR - http://www.scopus.com/inward/record.url?scp=80052111133&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80052111133&partnerID=8YFLogxK

U2 - 10.1145/2009916.2010055

DO - 10.1145/2009916.2010055

M3 - Conference contribution

SN - 9781450309349

SP - 1043

EP - 1052

BT - SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval

ER -