Statistical significance, power, and sample sizes

A systematic review of SIGIR and TOIS, 2006-2015

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    12 Citations (Scopus)

    Abstract

    We conducted a systematic review of 840 SIGIR full papers and 215 TOIS papers published between 2006 and 2015. The original objective of the study was to identify IR effectiveness experiments that are seriously underpowered (i.e., the sample size is far too small so that the probability of missing a real difference is extremely high) or overpowered (i.e., the sample size is so large that a difference will be considered statistically significant even if the actual effect size is extremely small). However, it quickly became clear to us that many IR effectiveness papers either lack significance testing or fail to report p-values and/or test statistics, which prevents us from conducting power analysis. Hence we first report on how IR researchers (fail to) report on significance test results, what types of tests they use, and how the reporting practices may have changed over the last decade. From those papers that reported enough information for us to conduct power analysis, we identify extremely overpowered and underpowered experiments, as well as appropriate sample sizes for future experiments. The raw results of our systematic survey of 1,055 papers and our R scripts for power analysis are available online. Our hope is that this study will help improve the reporting practices and experimental designs of future IR effectiveness studies.

    Original languageEnglish
    Title of host publicationSIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
    PublisherAssociation for Computing Machinery, Inc
    Pages5-14
    Number of pages10
    ISBN (Electronic)9781450342902
    DOIs
    Publication statusPublished - 2016 Jul 7
    Event39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016 - Pisa, Italy
    Duration: 2016 Jul 172016 Jul 21

    Other

    Other39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016
    CountryItaly
    CityPisa
    Period16/7/1716/7/21

    Fingerprint

    Experiments
    Design of experiments
    Statistics
    Testing

    Keywords

    • Effect sizes
    • Evaluation
    • Power analysis
    • Sample sizes
    • Statistical power
    • Statistical significance
    • Systematic review

    ASJC Scopus subject areas

    • Information Systems
    • Software

    Cite this

    Sakai, T. (2016). Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006-2015. In SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 5-14). Association for Computing Machinery, Inc. https://doi.org/10.1145/2911451.2911492

    Statistical significance, power, and sample sizes : A systematic review of SIGIR and TOIS, 2006-2015. / Sakai, Tetsuya.

    SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2016. p. 5-14.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Sakai, T 2016, Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006-2015. in SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, pp. 5-14, 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, 16/7/17. https://doi.org/10.1145/2911451.2911492
    Sakai T. Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006-2015. In SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc. 2016. p. 5-14 https://doi.org/10.1145/2911451.2911492
    Sakai, Tetsuya. / Statistical significance, power, and sample sizes : A systematic review of SIGIR and TOIS, 2006-2015. SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2016. pp. 5-14
    @inproceedings{96fee31f6cf74268b28b858e7fb31dc1,
    title = "Statistical significance, power, and sample sizes: A systematic review of SIGIR and TOIS, 2006-2015",
    abstract = "We conducted a systematic review of 840 SIGIR full papers and 215 TOIS papers published between 2006 and 2015. The original objective of the study was to identify IR effectiveness experiments that are seriously underpowered (i.e., the sample size is far too small so that the probability of missing a real difference is extremely high) or overpowered (i.e., the sample size is so large that a difference will be considered statistically significant even if the actual effect size is extremely small). However, it quickly became clear to us that many IR effectiveness papers either lack significance testing or fail to report p-values and/or test statistics, which prevents us from conducting power analysis. Hence we first report on how IR researchers (fail to) report on significance test results, what types of tests they use, and how the reporting practices may have changed over the last decade. From those papers that reported enough information for us to conduct power analysis, we identify extremely overpowered and underpowered experiments, as well as appropriate sample sizes for future experiments. The raw results of our systematic survey of 1,055 papers and our R scripts for power analysis are available online. Our hope is that this study will help improve the reporting practices and experimental designs of future IR effectiveness studies.",
    keywords = "Effect sizes, Evaluation, Power analysis, Sample sizes, Statistical power, Statistical significance, Systematic review",
    author = "Tetsuya Sakai",
    year = "2016",
    month = "7",
    day = "7",
    doi = "10.1145/2911451.2911492",
    language = "English",
    pages = "5--14",
    booktitle = "SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval",
    publisher = "Association for Computing Machinery, Inc",

    }

    TY - GEN

    T1 - Statistical significance, power, and sample sizes

    T2 - A systematic review of SIGIR and TOIS, 2006-2015

    AU - Sakai, Tetsuya

    PY - 2016/7/7

    Y1 - 2016/7/7

    N2 - We conducted a systematic review of 840 SIGIR full papers and 215 TOIS papers published between 2006 and 2015. The original objective of the study was to identify IR effectiveness experiments that are seriously underpowered (i.e., the sample size is far too small so that the probability of missing a real difference is extremely high) or overpowered (i.e., the sample size is so large that a difference will be considered statistically significant even if the actual effect size is extremely small). However, it quickly became clear to us that many IR effectiveness papers either lack significance testing or fail to report p-values and/or test statistics, which prevents us from conducting power analysis. Hence we first report on how IR researchers (fail to) report on significance test results, what types of tests they use, and how the reporting practices may have changed over the last decade. From those papers that reported enough information for us to conduct power analysis, we identify extremely overpowered and underpowered experiments, as well as appropriate sample sizes for future experiments. The raw results of our systematic survey of 1,055 papers and our R scripts for power analysis are available online. Our hope is that this study will help improve the reporting practices and experimental designs of future IR effectiveness studies.

    AB - We conducted a systematic review of 840 SIGIR full papers and 215 TOIS papers published between 2006 and 2015. The original objective of the study was to identify IR effectiveness experiments that are seriously underpowered (i.e., the sample size is far too small so that the probability of missing a real difference is extremely high) or overpowered (i.e., the sample size is so large that a difference will be considered statistically significant even if the actual effect size is extremely small). However, it quickly became clear to us that many IR effectiveness papers either lack significance testing or fail to report p-values and/or test statistics, which prevents us from conducting power analysis. Hence we first report on how IR researchers (fail to) report on significance test results, what types of tests they use, and how the reporting practices may have changed over the last decade. From those papers that reported enough information for us to conduct power analysis, we identify extremely overpowered and underpowered experiments, as well as appropriate sample sizes for future experiments. The raw results of our systematic survey of 1,055 papers and our R scripts for power analysis are available online. Our hope is that this study will help improve the reporting practices and experimental designs of future IR effectiveness studies.

    KW - Effect sizes

    KW - Evaluation

    KW - Power analysis

    KW - Sample sizes

    KW - Statistical power

    KW - Statistical significance

    KW - Systematic review

    UR - http://www.scopus.com/inward/record.url?scp=84980325588&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84980325588&partnerID=8YFLogxK

    U2 - 10.1145/2911451.2911492

    DO - 10.1145/2911451.2911492

    M3 - Conference contribution

    SP - 5

    EP - 14

    BT - SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval

    PB - Association for Computing Machinery, Inc

    ER -