Traditional pooling-based information retrieval (IR) test collections typically have n= 50 –100 topics, but it is difficult for an IR researcher to say why the topic set size should really be n. The present study provides details on principled ways to determine the number of topics for a test collection to be built, based on a specific set of statistical requirements. We employ Nagata’s three sample size design techniques, which are based on the paired t test, one-way ANOVA, and confidence intervals, respectively. These topic set size design methods require topic-by-run score matrices from past test collections for the purpose of estimating the within-system population variance for a particular evaluation measure. While the previous work of Sakai incorrectly used estimates of the total variances, here we use the correct estimates of the within-system variances, which yield slightly smaller topic set sizes than those reported previously by Sakai. Moreover, this study provides a comparison across the three methods. Our conclusions nevertheless echo those of Sakai: as different evaluation measures can have vastly different within-system variances, they require substantially different topic set sizes under the same set of statistical requirements; by analysing the tradeoff between the topic set size and the pool depth for a particular evaluation measure in advance, researchers can build statistically reliable yet highly economical test collections.
ASJC Scopus subject areas