Diversified search evaluation

Lessons from the NTCIR-9 INTENT task

Tetsuya Sakai, Ruihua Song

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

The evaluation of diversified web search results is a relatively new research topic and is not as well-understood as the time-honoured evaluation methodology of traditional IR based on precision and recall. In diversity evaluation, one topic may have more than one intent, and systems are expected to balance relevance and diversity. The recent NTCIR-9 evaluation workshop launched a new task called INTENT which included a diversified web search subtask that differs from the TREC web diversity task in several aspects: the choice of evaluation metrics, the use of intent popularity and per-intent graded relevance, and the use of topic sets that are twice as large as those of TREC. The objective of this study is to examine whether these differences are useful, using the actual data recently obtained from the NTCIR-9 INTENT task. Our main experimental findings are: (1) The D # evaluation framework used at NTCIR provides more "intuitive" and statistically reliable results than Intent-Aware Expected Reciprocal Rank; (2) Utilising both intent popularity and per-intent graded relevance as is done at NTCIR tends to improve discriminative power, particularly for F #-nDCG; and (3) Reducing the topic set size, even by just 10 topics, can affect not only significance testing but also the entire system ranking; when 50 topics are used (as in TREC) instead of 100 (as in NTCIR), the system ranking can be substantially different from the original ranking and the discriminative power can be halved. These results suggest that the directions being explored at NTCIR are valuable.

Original languageEnglish
Pages (from-to)504-529
Number of pages26
JournalInformation Retrieval
Volume16
Issue number4
DOIs
Publication statusPublished - 2013
Externally publishedYes

Fingerprint

Testing
evaluation
ranking
popularity
methodology

Keywords

  • Diversity
  • Evaluation
  • Intents
  • NTCIR
  • Search result diversification
  • Test collections
  • TREC
  • Web search

ASJC Scopus subject areas

  • Information Systems
  • Library and Information Sciences

Cite this

Diversified search evaluation : Lessons from the NTCIR-9 INTENT task. / Sakai, Tetsuya; Song, Ruihua.

In: Information Retrieval, Vol. 16, No. 4, 2013, p. 504-529.

Research output: Contribution to journalArticle

@article{6a2dd7fe77104f868eda405dbc5810ba,
title = "Diversified search evaluation: Lessons from the NTCIR-9 INTENT task",
abstract = "The evaluation of diversified web search results is a relatively new research topic and is not as well-understood as the time-honoured evaluation methodology of traditional IR based on precision and recall. In diversity evaluation, one topic may have more than one intent, and systems are expected to balance relevance and diversity. The recent NTCIR-9 evaluation workshop launched a new task called INTENT which included a diversified web search subtask that differs from the TREC web diversity task in several aspects: the choice of evaluation metrics, the use of intent popularity and per-intent graded relevance, and the use of topic sets that are twice as large as those of TREC. The objective of this study is to examine whether these differences are useful, using the actual data recently obtained from the NTCIR-9 INTENT task. Our main experimental findings are: (1) The D # evaluation framework used at NTCIR provides more {"}intuitive{"} and statistically reliable results than Intent-Aware Expected Reciprocal Rank; (2) Utilising both intent popularity and per-intent graded relevance as is done at NTCIR tends to improve discriminative power, particularly for F #-nDCG; and (3) Reducing the topic set size, even by just 10 topics, can affect not only significance testing but also the entire system ranking; when 50 topics are used (as in TREC) instead of 100 (as in NTCIR), the system ranking can be substantially different from the original ranking and the discriminative power can be halved. These results suggest that the directions being explored at NTCIR are valuable.",
keywords = "Diversity, Evaluation, Intents, NTCIR, Search result diversification, Test collections, TREC, Web search",
author = "Tetsuya Sakai and Ruihua Song",
year = "2013",
doi = "10.1007/s10791-012-9208-x",
language = "English",
volume = "16",
pages = "504--529",
journal = "Information Retrieval",
issn = "1386-4564",
publisher = "Springer Netherlands",
number = "4",

}

TY - JOUR

T1 - Diversified search evaluation

T2 - Lessons from the NTCIR-9 INTENT task

AU - Sakai, Tetsuya

AU - Song, Ruihua

PY - 2013

Y1 - 2013

N2 - The evaluation of diversified web search results is a relatively new research topic and is not as well-understood as the time-honoured evaluation methodology of traditional IR based on precision and recall. In diversity evaluation, one topic may have more than one intent, and systems are expected to balance relevance and diversity. The recent NTCIR-9 evaluation workshop launched a new task called INTENT which included a diversified web search subtask that differs from the TREC web diversity task in several aspects: the choice of evaluation metrics, the use of intent popularity and per-intent graded relevance, and the use of topic sets that are twice as large as those of TREC. The objective of this study is to examine whether these differences are useful, using the actual data recently obtained from the NTCIR-9 INTENT task. Our main experimental findings are: (1) The D # evaluation framework used at NTCIR provides more "intuitive" and statistically reliable results than Intent-Aware Expected Reciprocal Rank; (2) Utilising both intent popularity and per-intent graded relevance as is done at NTCIR tends to improve discriminative power, particularly for F #-nDCG; and (3) Reducing the topic set size, even by just 10 topics, can affect not only significance testing but also the entire system ranking; when 50 topics are used (as in TREC) instead of 100 (as in NTCIR), the system ranking can be substantially different from the original ranking and the discriminative power can be halved. These results suggest that the directions being explored at NTCIR are valuable.

AB - The evaluation of diversified web search results is a relatively new research topic and is not as well-understood as the time-honoured evaluation methodology of traditional IR based on precision and recall. In diversity evaluation, one topic may have more than one intent, and systems are expected to balance relevance and diversity. The recent NTCIR-9 evaluation workshop launched a new task called INTENT which included a diversified web search subtask that differs from the TREC web diversity task in several aspects: the choice of evaluation metrics, the use of intent popularity and per-intent graded relevance, and the use of topic sets that are twice as large as those of TREC. The objective of this study is to examine whether these differences are useful, using the actual data recently obtained from the NTCIR-9 INTENT task. Our main experimental findings are: (1) The D # evaluation framework used at NTCIR provides more "intuitive" and statistically reliable results than Intent-Aware Expected Reciprocal Rank; (2) Utilising both intent popularity and per-intent graded relevance as is done at NTCIR tends to improve discriminative power, particularly for F #-nDCG; and (3) Reducing the topic set size, even by just 10 topics, can affect not only significance testing but also the entire system ranking; when 50 topics are used (as in TREC) instead of 100 (as in NTCIR), the system ranking can be substantially different from the original ranking and the discriminative power can be halved. These results suggest that the directions being explored at NTCIR are valuable.

KW - Diversity

KW - Evaluation

KW - Intents

KW - NTCIR

KW - Search result diversification

KW - Test collections

KW - TREC

KW - Web search

UR - http://www.scopus.com/inward/record.url?scp=84880838418&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84880838418&partnerID=8YFLogxK

U2 - 10.1007/s10791-012-9208-x

DO - 10.1007/s10791-012-9208-x

M3 - Article

VL - 16

SP - 504

EP - 529

JO - Information Retrieval

JF - Information Retrieval

SN - 1386-4564

IS - 4

ER -