The reusability of a diversified search test collection

Tetsuya Sakai, Zhicheng Dou, Ruihua Song, Noriko Kando

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Traditional ad hoc IR test collections were built using a relatively large pool depth (e.g. 100), and are usually assumed to be reusable. Moreover, when they are reused to compare a new system with another or with systems that contributed to the pools ("contributors"), an even larger measurement depth (e.g. 1,000) is often used for computing evaluation metrics. In contrast, the web diversity test collections that have been created in the past few years at TREC and NTCIR use a much smaller pool depth (e.g. 20). The measurement depth is also small (e.g. 10-30), as search result diversification is primarily intended for the first result page. In this study, we examine the reusability of a typical web diversity test collection, namely, one from the NTCIR-9 INTENT-1 Chinese Document Ranking task, which used a pool depth of 20 and official measurement depths of 10, 20 and 30. First, we conducted additional relevance assessments to expand the official INTENT-1 collection to achieve a pool depth of 40. Using the expanded relevance assessments, we show that run rankings at the measurement depth of 30 are too unreliable, given that the pool depth is 20. Second, we conduct a leave-one-out experiment for every participating team of the INTENT-1 Chinese task, to examine how (un)fairly new runs are evaluated with the INTENT-1 collection. We show that, for the purpose of comparing new systems with the contributors of the test collection being used, condensed-list versions of existing diversity evaluation metrics are more reliable than the raw metrics. However, even the condensed-list metrics may be unreliable if the new systems are not competitive compared to the contributors.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages26-38
Number of pages13
Volume7675 LNCS
DOIs
Publication statusPublished - 2012
Externally publishedYes
Event8th Asia Information Retrieval Societies Conference, AIRS 2012 - Tianjin
Duration: 2012 Dec 172012 Dec 19

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume7675 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other8th Asia Information Retrieval Societies Conference, AIRS 2012
CityTianjin
Period12/12/1712/12/19

Fingerprint

Reusability
Experiments
Metric
Ranking
Diversification
Evaluation
Expand
Computing

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Sakai, T., Dou, Z., Song, R., & Kando, N. (2012). The reusability of a diversified search test collection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 7675 LNCS, pp. 26-38). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7675 LNCS). https://doi.org/10.1007/978-3-642-35341-3_3

The reusability of a diversified search test collection. / Sakai, Tetsuya; Dou, Zhicheng; Song, Ruihua; Kando, Noriko.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7675 LNCS 2012. p. 26-38 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 7675 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sakai, T, Dou, Z, Song, R & Kando, N 2012, The reusability of a diversified search test collection. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 7675 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7675 LNCS, pp. 26-38, 8th Asia Information Retrieval Societies Conference, AIRS 2012, Tianjin, 12/12/17. https://doi.org/10.1007/978-3-642-35341-3_3
Sakai T, Dou Z, Song R, Kando N. The reusability of a diversified search test collection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7675 LNCS. 2012. p. 26-38. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-35341-3_3
Sakai, Tetsuya ; Dou, Zhicheng ; Song, Ruihua ; Kando, Noriko. / The reusability of a diversified search test collection. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 7675 LNCS 2012. pp. 26-38 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{e0d4de52591f4ec2bd0a633199c9b1b2,
title = "The reusability of a diversified search test collection",
abstract = "Traditional ad hoc IR test collections were built using a relatively large pool depth (e.g. 100), and are usually assumed to be reusable. Moreover, when they are reused to compare a new system with another or with systems that contributed to the pools ({"}contributors{"}), an even larger measurement depth (e.g. 1,000) is often used for computing evaluation metrics. In contrast, the web diversity test collections that have been created in the past few years at TREC and NTCIR use a much smaller pool depth (e.g. 20). The measurement depth is also small (e.g. 10-30), as search result diversification is primarily intended for the first result page. In this study, we examine the reusability of a typical web diversity test collection, namely, one from the NTCIR-9 INTENT-1 Chinese Document Ranking task, which used a pool depth of 20 and official measurement depths of 10, 20 and 30. First, we conducted additional relevance assessments to expand the official INTENT-1 collection to achieve a pool depth of 40. Using the expanded relevance assessments, we show that run rankings at the measurement depth of 30 are too unreliable, given that the pool depth is 20. Second, we conduct a leave-one-out experiment for every participating team of the INTENT-1 Chinese task, to examine how (un)fairly new runs are evaluated with the INTENT-1 collection. We show that, for the purpose of comparing new systems with the contributors of the test collection being used, condensed-list versions of existing diversity evaluation metrics are more reliable than the raw metrics. However, even the condensed-list metrics may be unreliable if the new systems are not competitive compared to the contributors.",
author = "Tetsuya Sakai and Zhicheng Dou and Ruihua Song and Noriko Kando",
year = "2012",
doi = "10.1007/978-3-642-35341-3_3",
language = "English",
isbn = "9783642353406",
volume = "7675 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "26--38",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - The reusability of a diversified search test collection

AU - Sakai, Tetsuya

AU - Dou, Zhicheng

AU - Song, Ruihua

AU - Kando, Noriko

PY - 2012

Y1 - 2012

N2 - Traditional ad hoc IR test collections were built using a relatively large pool depth (e.g. 100), and are usually assumed to be reusable. Moreover, when they are reused to compare a new system with another or with systems that contributed to the pools ("contributors"), an even larger measurement depth (e.g. 1,000) is often used for computing evaluation metrics. In contrast, the web diversity test collections that have been created in the past few years at TREC and NTCIR use a much smaller pool depth (e.g. 20). The measurement depth is also small (e.g. 10-30), as search result diversification is primarily intended for the first result page. In this study, we examine the reusability of a typical web diversity test collection, namely, one from the NTCIR-9 INTENT-1 Chinese Document Ranking task, which used a pool depth of 20 and official measurement depths of 10, 20 and 30. First, we conducted additional relevance assessments to expand the official INTENT-1 collection to achieve a pool depth of 40. Using the expanded relevance assessments, we show that run rankings at the measurement depth of 30 are too unreliable, given that the pool depth is 20. Second, we conduct a leave-one-out experiment for every participating team of the INTENT-1 Chinese task, to examine how (un)fairly new runs are evaluated with the INTENT-1 collection. We show that, for the purpose of comparing new systems with the contributors of the test collection being used, condensed-list versions of existing diversity evaluation metrics are more reliable than the raw metrics. However, even the condensed-list metrics may be unreliable if the new systems are not competitive compared to the contributors.

AB - Traditional ad hoc IR test collections were built using a relatively large pool depth (e.g. 100), and are usually assumed to be reusable. Moreover, when they are reused to compare a new system with another or with systems that contributed to the pools ("contributors"), an even larger measurement depth (e.g. 1,000) is often used for computing evaluation metrics. In contrast, the web diversity test collections that have been created in the past few years at TREC and NTCIR use a much smaller pool depth (e.g. 20). The measurement depth is also small (e.g. 10-30), as search result diversification is primarily intended for the first result page. In this study, we examine the reusability of a typical web diversity test collection, namely, one from the NTCIR-9 INTENT-1 Chinese Document Ranking task, which used a pool depth of 20 and official measurement depths of 10, 20 and 30. First, we conducted additional relevance assessments to expand the official INTENT-1 collection to achieve a pool depth of 40. Using the expanded relevance assessments, we show that run rankings at the measurement depth of 30 are too unreliable, given that the pool depth is 20. Second, we conduct a leave-one-out experiment for every participating team of the INTENT-1 Chinese task, to examine how (un)fairly new runs are evaluated with the INTENT-1 collection. We show that, for the purpose of comparing new systems with the contributors of the test collection being used, condensed-list versions of existing diversity evaluation metrics are more reliable than the raw metrics. However, even the condensed-list metrics may be unreliable if the new systems are not competitive compared to the contributors.

UR - http://www.scopus.com/inward/record.url?scp=84871603799&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84871603799&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-35341-3_3

DO - 10.1007/978-3-642-35341-3_3

M3 - Conference contribution

SN - 9783642353406

VL - 7675 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 26

EP - 38

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -