Designing test collections for comparing many systems

    研究成果: Conference contribution

    12 引用 (Scopus)

    抄録

    A researcher decides to build a test collection for comparing her new information retrieval (IR) systems with several state-of-the-art baselines. She wants to know the number of topics (n) she needs to create in advance, so that she can start looking for (say) a query log large enough for sampling n good topics, and estimating the relevance assessment cost. We provide practical solutions to researchers like her using power analysis and sample size design techniques, and demonstrate its usefulness for several IR tasks and evaluation measures. We consider not only the paired t-test but also one-way analysis of variance (ANOVA) for significance testing to accommodate comparison of m(≥ 2) systems under a given set of statistical requirements (α: the Type I error rate, β: the Type II error rate, and minD: the minimum detectable difference between the best and the worst systems). Using our simple Excel tools and some pooled variance estimates from past data, researchers can design statistically well-designed test collections. We demonstrate that, as different evaluation measures have different variances across topics, they inevitably require different topic set sizes. This suggests that the evaluation measures should be chosen at the test collection design phase. Moreover, through a pool depth reduction experiment with past data, we show how the relevance assessment cost can be reduced dramatically while freezing the set of statistical requirements. Based on the cost analysis and the available budget, researchers can determine the right balance betweeen n and the pool depth pd. Our techniques and tools are applicable to test collections for non-IR tasks as well.

    元の言語English
    ホスト出版物のタイトルCIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management
    出版者Association for Computing Machinery, Inc
    ページ61-70
    ページ数10
    ISBN(印刷物)9781450325981
    DOI
    出版物ステータスPublished - 2014 11 3
    イベント23rd ACM International Conference on Information and Knowledge Management, CIKM 2014 - Shanghai, China
    継続期間: 2014 11 32014 11 7

    Other

    Other23rd ACM International Conference on Information and Knowledge Management, CIKM 2014
    China
    Shanghai
    期間14/11/314/11/7

    Fingerprint

    Costs
    Information retrieval systems
    Analysis of variance (ANOVA)
    Information retrieval
    Freezing
    Sampling
    Testing
    Test collections
    Experiments
    Evaluation
    Cost analysis
    Type II error
    Usefulness
    Excel
    Experiment
    Sample size
    T-test
    Query logs
    Type I error
    Analysis of variance

    ASJC Scopus subject areas

    • Information Systems and Management
    • Computer Science Applications
    • Information Systems

    これを引用

    Sakai, T. (2014). Designing test collections for comparing many systems. : CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management (pp. 61-70). Association for Computing Machinery, Inc. https://doi.org/10.1145/2661829.2661893

    Designing test collections for comparing many systems. / Sakai, Tetsuya.

    CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Inc, 2014. p. 61-70.

    研究成果: Conference contribution

    Sakai, T 2014, Designing test collections for comparing many systems. : CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Inc, pp. 61-70, 23rd ACM International Conference on Information and Knowledge Management, CIKM 2014, Shanghai, China, 14/11/3. https://doi.org/10.1145/2661829.2661893
    Sakai T. Designing test collections for comparing many systems. : CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Inc. 2014. p. 61-70 https://doi.org/10.1145/2661829.2661893
    Sakai, Tetsuya. / Designing test collections for comparing many systems. CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management. Association for Computing Machinery, Inc, 2014. pp. 61-70
    @inproceedings{a816e0f68b27485188af61f554b77a44,
    title = "Designing test collections for comparing many systems",
    abstract = "A researcher decides to build a test collection for comparing her new information retrieval (IR) systems with several state-of-the-art baselines. She wants to know the number of topics (n) she needs to create in advance, so that she can start looking for (say) a query log large enough for sampling n good topics, and estimating the relevance assessment cost. We provide practical solutions to researchers like her using power analysis and sample size design techniques, and demonstrate its usefulness for several IR tasks and evaluation measures. We consider not only the paired t-test but also one-way analysis of variance (ANOVA) for significance testing to accommodate comparison of m(≥ 2) systems under a given set of statistical requirements (α: the Type I error rate, β: the Type II error rate, and minD: the minimum detectable difference between the best and the worst systems). Using our simple Excel tools and some pooled variance estimates from past data, researchers can design statistically well-designed test collections. We demonstrate that, as different evaluation measures have different variances across topics, they inevitably require different topic set sizes. This suggests that the evaluation measures should be chosen at the test collection design phase. Moreover, through a pool depth reduction experiment with past data, we show how the relevance assessment cost can be reduced dramatically while freezing the set of statistical requirements. Based on the cost analysis and the available budget, researchers can determine the right balance betweeen n and the pool depth pd. Our techniques and tools are applicable to test collections for non-IR tasks as well.",
    keywords = "Effect sizes, Evaluation, Evaluation measures, Power, Sample sizes, Statistical significance, Test collections, Variances",
    author = "Tetsuya Sakai",
    year = "2014",
    month = "11",
    day = "3",
    doi = "10.1145/2661829.2661893",
    language = "English",
    isbn = "9781450325981",
    pages = "61--70",
    booktitle = "CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management",
    publisher = "Association for Computing Machinery, Inc",

    }

    TY - GEN

    T1 - Designing test collections for comparing many systems

    AU - Sakai, Tetsuya

    PY - 2014/11/3

    Y1 - 2014/11/3

    N2 - A researcher decides to build a test collection for comparing her new information retrieval (IR) systems with several state-of-the-art baselines. She wants to know the number of topics (n) she needs to create in advance, so that she can start looking for (say) a query log large enough for sampling n good topics, and estimating the relevance assessment cost. We provide practical solutions to researchers like her using power analysis and sample size design techniques, and demonstrate its usefulness for several IR tasks and evaluation measures. We consider not only the paired t-test but also one-way analysis of variance (ANOVA) for significance testing to accommodate comparison of m(≥ 2) systems under a given set of statistical requirements (α: the Type I error rate, β: the Type II error rate, and minD: the minimum detectable difference between the best and the worst systems). Using our simple Excel tools and some pooled variance estimates from past data, researchers can design statistically well-designed test collections. We demonstrate that, as different evaluation measures have different variances across topics, they inevitably require different topic set sizes. This suggests that the evaluation measures should be chosen at the test collection design phase. Moreover, through a pool depth reduction experiment with past data, we show how the relevance assessment cost can be reduced dramatically while freezing the set of statistical requirements. Based on the cost analysis and the available budget, researchers can determine the right balance betweeen n and the pool depth pd. Our techniques and tools are applicable to test collections for non-IR tasks as well.

    AB - A researcher decides to build a test collection for comparing her new information retrieval (IR) systems with several state-of-the-art baselines. She wants to know the number of topics (n) she needs to create in advance, so that she can start looking for (say) a query log large enough for sampling n good topics, and estimating the relevance assessment cost. We provide practical solutions to researchers like her using power analysis and sample size design techniques, and demonstrate its usefulness for several IR tasks and evaluation measures. We consider not only the paired t-test but also one-way analysis of variance (ANOVA) for significance testing to accommodate comparison of m(≥ 2) systems under a given set of statistical requirements (α: the Type I error rate, β: the Type II error rate, and minD: the minimum detectable difference between the best and the worst systems). Using our simple Excel tools and some pooled variance estimates from past data, researchers can design statistically well-designed test collections. We demonstrate that, as different evaluation measures have different variances across topics, they inevitably require different topic set sizes. This suggests that the evaluation measures should be chosen at the test collection design phase. Moreover, through a pool depth reduction experiment with past data, we show how the relevance assessment cost can be reduced dramatically while freezing the set of statistical requirements. Based on the cost analysis and the available budget, researchers can determine the right balance betweeen n and the pool depth pd. Our techniques and tools are applicable to test collections for non-IR tasks as well.

    KW - Effect sizes

    KW - Evaluation

    KW - Evaluation measures

    KW - Power

    KW - Sample sizes

    KW - Statistical significance

    KW - Test collections

    KW - Variances

    UR - http://www.scopus.com/inward/record.url?scp=84925449110&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84925449110&partnerID=8YFLogxK

    U2 - 10.1145/2661829.2661893

    DO - 10.1145/2661829.2661893

    M3 - Conference contribution

    AN - SCOPUS:84925449110

    SN - 9781450325981

    SP - 61

    EP - 70

    BT - CIKM 2014 - Proceedings of the 2014 ACM International Conference on Information and Knowledge Management

    PB - Association for Computing Machinery, Inc

    ER -