TY - GEN
T1 - Simple and effective approach to score standardisation
AU - Sakai, Tetsuya
N1 - Copyright:
Copyright 2017 Elsevier B.V., All rights reserved.
PY - 2016/9/12
Y1 - 2016/9/12
N2 - Webber, Moffat and Zobel proposed score standardization for information retrieval evaluation with multiple test collections. Given a topic-by-run raw score matrix in terms of some evaluation measure, each score can be standardised using the topic's sample mean and sample standard deviation across a set of past runs so as to quantify how different a system is from the "average" system in standard deviation units. Using standardised scores, researchers can compare systems across different test collections without worrying about topic hardness or normalisation. While Webber et al. mapped the standardised scores to the [0,1] range using a standard normal cumulative density function, the present study demonstrates that linear transformation of the standardised scores, a method widely used in educational research, can be a simple and effective alternative. We use three TREC robust track data sets with graded relevance assessments and official runs to compare these methods by means of leave-one-out tests, discriminative power, swap rate tests, and topic set size design. In particular, we demonstrate that our method is superior to the method of Webber et al. in terms of swap rates and topic set size design: put simply, our method ensures pairwise system comparisons that are more consistent across different data sets, and is arguably more convenient for designing a new test collection from a statistical viewpoint.
AB - Webber, Moffat and Zobel proposed score standardization for information retrieval evaluation with multiple test collections. Given a topic-by-run raw score matrix in terms of some evaluation measure, each score can be standardised using the topic's sample mean and sample standard deviation across a set of past runs so as to quantify how different a system is from the "average" system in standard deviation units. Using standardised scores, researchers can compare systems across different test collections without worrying about topic hardness or normalisation. While Webber et al. mapped the standardised scores to the [0,1] range using a standard normal cumulative density function, the present study demonstrates that linear transformation of the standardised scores, a method widely used in educational research, can be a simple and effective alternative. We use three TREC robust track data sets with graded relevance assessments and official runs to compare these methods by means of leave-one-out tests, discriminative power, swap rate tests, and topic set size design. In particular, we demonstrate that our method is superior to the method of Webber et al. in terms of swap rates and topic set size design: put simply, our method ensures pairwise system comparisons that are more consistent across different data sets, and is arguably more convenient for designing a new test collection from a statistical viewpoint.
KW - Evaluation
KW - Measures
KW - Standardization
KW - Statistical power
KW - Statistical significance
KW - Test collections
KW - Topics
KW - Variances
UR - http://www.scopus.com/inward/record.url?scp=84991047660&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84991047660&partnerID=8YFLogxK
U2 - 10.1145/2970398.2970399
DO - 10.1145/2970398.2970399
M3 - Conference contribution
AN - SCOPUS:84991047660
T3 - ICTIR 2016 - Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval
SP - 95
EP - 104
BT - ICTIR 2016 - Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval
PB - Association for Computing Machinery, Inc
T2 - 2016 ACM International Conference on the Theory of Information Retrieval, ICTIR 2016
Y2 - 12 September 2016 through 16 September 2016
ER -