TY - JOUR
T1 - Topic set size design
AU - Sakai, Tetsuya
N1 - Publisher Copyright:
© 2015, The Author(s).
PY - 2016/6/1
Y1 - 2016/6/1
N2 - Traditional pooling-based information retrieval (IR) test collections typically have n= 50 –100 topics, but it is difficult for an IR researcher to say why the topic set size should really be n. The present study provides details on principled ways to determine the number of topics for a test collection to be built, based on a specific set of statistical requirements. We employ Nagata’s three sample size design techniques, which are based on the paired t test, one-way ANOVA, and confidence intervals, respectively. These topic set size design methods require topic-by-run score matrices from past test collections for the purpose of estimating the within-system population variance for a particular evaluation measure. While the previous work of Sakai incorrectly used estimates of the total variances, here we use the correct estimates of the within-system variances, which yield slightly smaller topic set sizes than those reported previously by Sakai. Moreover, this study provides a comparison across the three methods. Our conclusions nevertheless echo those of Sakai: as different evaluation measures can have vastly different within-system variances, they require substantially different topic set sizes under the same set of statistical requirements; by analysing the tradeoff between the topic set size and the pool depth for a particular evaluation measure in advance, researchers can build statistically reliable yet highly economical test collections.
AB - Traditional pooling-based information retrieval (IR) test collections typically have n= 50 –100 topics, but it is difficult for an IR researcher to say why the topic set size should really be n. The present study provides details on principled ways to determine the number of topics for a test collection to be built, based on a specific set of statistical requirements. We employ Nagata’s three sample size design techniques, which are based on the paired t test, one-way ANOVA, and confidence intervals, respectively. These topic set size design methods require topic-by-run score matrices from past test collections for the purpose of estimating the within-system population variance for a particular evaluation measure. While the previous work of Sakai incorrectly used estimates of the total variances, here we use the correct estimates of the within-system variances, which yield slightly smaller topic set sizes than those reported previously by Sakai. Moreover, this study provides a comparison across the three methods. Our conclusions nevertheless echo those of Sakai: as different evaluation measures can have vastly different within-system variances, they require substantially different topic set sizes under the same set of statistical requirements; by analysing the tradeoff between the topic set size and the pool depth for a particular evaluation measure in advance, researchers can build statistically reliable yet highly economical test collections.
UR - http://www.scopus.com/inward/record.url?scp=84945280242&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84945280242&partnerID=8YFLogxK
U2 - 10.1007/s10791-015-9273-z
DO - 10.1007/s10791-015-9273-z
M3 - Article
AN - SCOPUS:84945280242
VL - 19
SP - 256
EP - 283
JO - Information Retrieval
JF - Information Retrieval
SN - 1386-4564
IS - 3
ER -