Designing test collections for comparing many systems

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    12 Citations (Scopus)

    Abstract

    A researcher decides to build a test collection for comparing her new information retrieval (IR) systems with several state-of-the-art baselines. She wants to know the number of topics (n) she needs to create in advance, so that she can start looking for (say) a query log large enough for sampling n good topics, and estimating the relevance assessment cost. We provide practical solutions to researchers like her using power analysis and sample size design techniques, and demonstrate its usefulness for several IR tasks and evaluation measures. We consider not only the paired t-test but also one-way analysis of variance (ANOVA) for significance testing to accommodate comparison of m(≥ 2) systems under a given set of statistical requirements (α: the Type I error rate, β: the Type II error rate, and minD: the minimum detectable difference between the best and the worst systems). Using our simple Excel tools and some pooled variance estimates from past data, researchers can design statistically well-designed test collections. We demonstrate that, as different evaluation measures have different variances across topics, they inevitably require different topic set sizes. This suggests that the evaluation measures should be chosen at the test collection design phase. Moreover, through a pool depth reduction experiment with past data, we show how the relevance assessment cost can be reduced dramatically while freezing the set of statistical requirements. Based on the cost analysis and the available budget, researchers can determine the right balance betweeen n and the pool depth pd. Our techniques and tools are applicable to test collections for non-IR tasks as well.

    Original languageEnglish
    Title of host publicationCIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management
    PublisherAssociation for Computing Machinery, Inc
    Pages61-70
    Number of pages10
    ISBN (Print)9781450325981
    DOIs
    Publication statusPublished - 2014 Nov 3
    Event23rd ACM International Conference on Information and Knowledge Management, CIKM 2014 - Shanghai, China
    Duration: 2014 Nov 32014 Nov 7

    Other

    Other23rd ACM International Conference on Information and Knowledge Management, CIKM 2014
    CountryChina
    CityShanghai
    Period14/11/314/11/7

    Fingerprint

    Costs
    Information retrieval systems
    Analysis of variance (ANOVA)
    Information retrieval
    Freezing
    Sampling
    Testing
    Test collections
    Experiments
    Evaluation
    Cost analysis
    Type II error
    Usefulness
    Excel
    Experiment
    Sample size
    T-test
    Query logs
    Type I error
    Analysis of variance

    Keywords

    • Effect sizes
    • Evaluation
    • Evaluation measures
    • Power
    • Sample sizes
    • Statistical significance
    • Test collections
    • Variances

    ASJC Scopus subject areas

    • Information Systems and Management
    • Computer Science Applications
    • Information Systems

    Cite this

    Sakai, T. (2014). Designing test collections for comparing many systems. In CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management (pp. 61-70). Association for Computing Machinery, Inc. https://doi.org/10.1145/2661829.2661893

    Designing test collections for comparing many systems. / Sakai, Tetsuya.

    CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Inc, 2014. p. 61-70.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Sakai, T 2014, Designing test collections for comparing many systems. in CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Inc, pp. 61-70, 23rd ACM International Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, 14/11/3. https://doi.org/10.1145/2661829.2661893
    Sakai T. Designing test collections for comparing many systems. In CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Inc. 2014. p. 61-70 https://doi.org/10.1145/2661829.2661893
    Sakai, Tetsuya. / Designing test collections for comparing many systems. CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Inc, 2014. pp. 61-70
    @inproceedings{a816e0f68b27485188af61f554b77a44,
    title = "Designing test collections for comparing many systems",
    abstract = "A researcher decides to build a test collection for comparing her new information retrieval (IR) systems with several state-of-the-art baselines. She wants to know the number of topics (n) she needs to create in advance, so that she can start looking for (say) a query log large enough for sampling n good topics, and estimating the relevance assessment cost. We provide practical solutions to researchers like her using power analysis and sample size design techniques, and demonstrate its usefulness for several IR tasks and evaluation measures. We consider not only the paired t-test but also one-way analysis of variance (ANOVA) for significance testing to accommodate comparison of m(≥ 2) systems under a given set of statistical requirements (α: the Type I error rate, β: the Type II error rate, and minD: the minimum detectable difference between the best and the worst systems). Using our simple Excel tools and some pooled variance estimates from past data, researchers can design statistically well-designed test collections. We demonstrate that, as different evaluation measures have different variances across topics, they inevitably require different topic set sizes. This suggests that the evaluation measures should be chosen at the test collection design phase. Moreover, through a pool depth reduction experiment with past data, we show how the relevance assessment cost can be reduced dramatically while freezing the set of statistical requirements. Based on the cost analysis and the available budget, researchers can determine the right balance betweeen n and the pool depth pd. Our techniques and tools are applicable to test collections for non-IR tasks as well.",
    keywords = "Effect sizes, Evaluation, Evaluation measures, Power, Sample sizes, Statistical significance, Test collections, Variances",
    author = "Tetsuya Sakai",
    year = "2014",
    month = "11",
    day = "3",
    doi = "10.1145/2661829.2661893",
    language = "English",
    isbn = "9781450325981",
    pages = "61--70",
    booktitle = "CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management",
    publisher = "Association for Computing Machinery, Inc",

    }

    TY - GEN

    T1 - Designing test collections for comparing many systems

    AU - Sakai, Tetsuya

    PY - 2014/11/3

    Y1 - 2014/11/3

    N2 - A researcher decides to build a test collection for comparing her new information retrieval (IR) systems with several state-of-the-art baselines. She wants to know the number of topics (n) she needs to create in advance, so that she can start looking for (say) a query log large enough for sampling n good topics, and estimating the relevance assessment cost. We provide practical solutions to researchers like her using power analysis and sample size design techniques, and demonstrate its usefulness for several IR tasks and evaluation measures. We consider not only the paired t-test but also one-way analysis of variance (ANOVA) for significance testing to accommodate comparison of m(≥ 2) systems under a given set of statistical requirements (α: the Type I error rate, β: the Type II error rate, and minD: the minimum detectable difference between the best and the worst systems). Using our simple Excel tools and some pooled variance estimates from past data, researchers can design statistically well-designed test collections. We demonstrate that, as different evaluation measures have different variances across topics, they inevitably require different topic set sizes. This suggests that the evaluation measures should be chosen at the test collection design phase. Moreover, through a pool depth reduction experiment with past data, we show how the relevance assessment cost can be reduced dramatically while freezing the set of statistical requirements. Based on the cost analysis and the available budget, researchers can determine the right balance betweeen n and the pool depth pd. Our techniques and tools are applicable to test collections for non-IR tasks as well.

    AB - A researcher decides to build a test collection for comparing her new information retrieval (IR) systems with several state-of-the-art baselines. She wants to know the number of topics (n) she needs to create in advance, so that she can start looking for (say) a query log large enough for sampling n good topics, and estimating the relevance assessment cost. We provide practical solutions to researchers like her using power analysis and sample size design techniques, and demonstrate its usefulness for several IR tasks and evaluation measures. We consider not only the paired t-test but also one-way analysis of variance (ANOVA) for significance testing to accommodate comparison of m(≥ 2) systems under a given set of statistical requirements (α: the Type I error rate, β: the Type II error rate, and minD: the minimum detectable difference between the best and the worst systems). Using our simple Excel tools and some pooled variance estimates from past data, researchers can design statistically well-designed test collections. We demonstrate that, as different evaluation measures have different variances across topics, they inevitably require different topic set sizes. This suggests that the evaluation measures should be chosen at the test collection design phase. Moreover, through a pool depth reduction experiment with past data, we show how the relevance assessment cost can be reduced dramatically while freezing the set of statistical requirements. Based on the cost analysis and the available budget, researchers can determine the right balance betweeen n and the pool depth pd. Our techniques and tools are applicable to test collections for non-IR tasks as well.

    KW - Effect sizes

    KW - Evaluation

    KW - Evaluation measures

    KW - Power

    KW - Sample sizes

    KW - Statistical significance

    KW - Test collections

    KW - Variances

    UR - http://www.scopus.com/inward/record.url?scp=84925449110&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84925449110&partnerID=8YFLogxK

    U2 - 10.1145/2661829.2661893

    DO - 10.1145/2661829.2661893

    M3 - Conference contribution

    AN - SCOPUS:84925449110

    SN - 9781450325981

    SP - 61

    EP - 70

    BT - CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management

    PB - Association for Computing Machinery, Inc

    ER -