Which diversity evaluation measures are “good”?

Tetsuya Sakai, Zhaohao Zeng

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This study evaluates 30 IR evaluation measures or their instances, of which nine are for adhoc IR and 21 are for diversified IR, primarily from the viewpoint of whether their preferences of one SERP (search engine result page) over another actually align with users' preferences. The gold preferences were contructed by hiring 15 assessors, who independently examined 1,127 SERP pairs and made preference assessments. Two sets of preference assessments were obtained: one based on a relevance question “Which SERP is more relevant?” and the other based on a diversity question “Which SERP is likely to satisfy a higher number of users?” To our knowledge, our study is the first to have collected diversity preference assessments in this way and evaluated diversity measures successfully. Our main results are that (a) Popular adhoc IR measures such as nDCG actually align quite well with the gold relevance preferences; and that (b) While the D#-measures align well with the gold diversity preferences, intent-aware measures perform relatively poorly. Moreover, as by-products of our analysis of existing evaluation measures, we define new adhoc measures called iRBU (intentwise Rank-Biased Utility) and EBR (Expected Blended Ratio); we demonstrate that an instance of iRBU performs as well as nDCG when compared to the gold relevance preferences. On the other hand, the original RBU, a recently-proposed diversity measure, underperforms the best D#-measures when compared to the gold diversity preferences.

Original languageEnglish
Title of host publicationSIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages595-604
Number of pages10
ISBN (Electronic)9781450361729
DOIs
Publication statusPublished - 2019 Jul 18
Event42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019 - Paris, France
Duration: 2019 Jul 212019 Jul 25

Publication series

NameSIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Conference

Conference42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019
CountryFrance
CityParis
Period19/7/2119/7/25

Keywords

  • Evaluation measures
  • Search result diversification
  • User preferences

ASJC Scopus subject areas

  • Information Systems
  • Applied Mathematics
  • Software

Fingerprint Dive into the research topics of 'Which diversity evaluation measures are “good”?'. Together they form a unique fingerprint.

  • Cite this

    Sakai, T., & Zeng, Z. (2019). Which diversity evaluation measures are “good”? In SIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 595-604). (SIGIR 2019 - Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval). Association for Computing Machinery, Inc. https://doi.org/10.1145/3331184.3331215