TY - GEN
T1 - Evaluating diversified search results using per-intent graded relevance
AU - Sakai, Tetsuya
AU - Song, Ruihua
PY - 2011
Y1 - 2011
N2 - Search queries are often ambiguous and/or underspecified. To accomodate different user needs, search result diversification has received attention in the past few years. Accordingly, several new metrics for evaluating diversification have been proposed, but their properties are little understood. We compare the properties of existing metrics given the premises that (1) queries may have multiple intents; (2) the likelihood of each intent given a query is available; and (3) graded relevance assessments are available for each intent. We compare a wide range of traditional and diversified IR metrics after adding graded relevance assessments to the TREC 2009 Web track diversity task test collection which originally had binary relevance assessments. Our primary criterion is discriminative power, which represents the reliability of a metric in an experiment. Our results show that diversified IR experiments with a given number of topics can be as reliable as traditional IR experiments with the same number of topics, provided that the right metrics are used. Moreover, we compare the intuitiveness of diversified IR metrics by closely examining the actual ranked lists from TREC. We show that a family of metrics called D#-measures have several advantages over other metrics such as α-nDCG and Intent-Aware metrics.
AB - Search queries are often ambiguous and/or underspecified. To accomodate different user needs, search result diversification has received attention in the past few years. Accordingly, several new metrics for evaluating diversification have been proposed, but their properties are little understood. We compare the properties of existing metrics given the premises that (1) queries may have multiple intents; (2) the likelihood of each intent given a query is available; and (3) graded relevance assessments are available for each intent. We compare a wide range of traditional and diversified IR metrics after adding graded relevance assessments to the TREC 2009 Web track diversity task test collection which originally had binary relevance assessments. Our primary criterion is discriminative power, which represents the reliability of a metric in an experiment. Our results show that diversified IR experiments with a given number of topics can be as reliable as traditional IR experiments with the same number of topics, provided that the right metrics are used. Moreover, we compare the intuitiveness of diversified IR metrics by closely examining the actual ranked lists from TREC. We show that a family of metrics called D#-measures have several advantages over other metrics such as α-nDCG and Intent-Aware metrics.
KW - Ambiguity
KW - Diversity
KW - Evaluation
KW - Graded relevance
KW - Test collection
UR - http://www.scopus.com/inward/record.url?scp=80052111133&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80052111133&partnerID=8YFLogxK
U2 - 10.1145/2009916.2010055
DO - 10.1145/2009916.2010055
M3 - Conference contribution
AN - SCOPUS:80052111133
SN - 9781450309349
T3 - SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 1043
EP - 1052
BT - SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery
T2 - 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011
Y2 - 24 July 2011 through 28 July 2011
ER -