Designing test collections for comparing many systems

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    12 Citations (Scopus)


    A researcher decides to build a test collection for comparing her new information retrieval (IR) systems with several state-of-the-art baselines. She wants to know the number of topics (n) she needs to create in advance, so that she can start looking for (say) a query log large enough for sampling n good topics, and estimating the relevance assessment cost. We provide practical solutions to researchers like her using power analysis and sample size design techniques, and demonstrate its usefulness for several IR tasks and evaluation measures. We consider not only the paired t-test but also one-way analysis of variance (ANOVA) for significance testing to accommodate comparison of m(≥ 2) systems under a given set of statistical requirements (α: the Type I error rate, β: the Type II error rate, and minD: the minimum detectable difference between the best and the worst systems). Using our simple Excel tools and some pooled variance estimates from past data, researchers can design statistically well-designed test collections. We demonstrate that, as different evaluation measures have different variances across topics, they inevitably require different topic set sizes. This suggests that the evaluation measures should be chosen at the test collection design phase. Moreover, through a pool depth reduction experiment with past data, we show how the relevance assessment cost can be reduced dramatically while freezing the set of statistical requirements. Based on the cost analysis and the available budget, researchers can determine the right balance betweeen n and the pool depth pd. Our techniques and tools are applicable to test collections for non-IR tasks as well.

    Original languageEnglish
    Title of host publicationCIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management
    PublisherAssociation for Computing Machinery, Inc
    Number of pages10
    ISBN (Print)9781450325981
    Publication statusPublished - 2014 Nov 3
    Event23rd ACM International Conference on Information and Knowledge Management, CIKM 2014 - Shanghai, China
    Duration: 2014 Nov 32014 Nov 7


    Other23rd ACM International Conference on Information and Knowledge Management, CIKM 2014


    • Effect sizes
    • Evaluation
    • Evaluation measures
    • Power
    • Sample sizes
    • Statistical significance
    • Test collections
    • Variances

    ASJC Scopus subject areas

    • Information Systems and Management
    • Computer Science Applications
    • Information Systems

    Fingerprint Dive into the research topics of 'Designing test collections for comparing many systems'. Together they form a unique fingerprint.

    Cite this