Two sample T-tests for IR evaluation: Student or welch?

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    8 Citations (Scopus)

    Abstract

    There are two well-known versions of the t-test for comparing means from unpaired data: Student's t-test and Welch's t-test. While Welch's t-test does not assume homoscedasticity (i.e., equal variances), it involves approximations. A classical textbook recommendation would be to use Student's t-test if either the two sample sizes are similar or the two sample variances are similar, and to use Welch's t-test only when both of the above conditions are violated. However, a more recent recommendation seems to be to use Welch's t-test unconditionally. Using past data from both TREC and NTCIR, the present study demonstrates that the latter advice should not be followed blindly in the context of IR system evaluation. More specifically, our results suggest that if the sample sizes differ substantially and if the larger sample has a substantially larger variance, Welch's t-test may not be reliable.

    Original languageEnglish
    Title of host publicationSIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
    PublisherAssociation for Computing Machinery, Inc
    Pages1045-1048
    Number of pages4
    ISBN (Electronic)9781450342902
    DOIs
    Publication statusPublished - 2016 Jul 7
    Event39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016 - Pisa, Italy
    Duration: 2016 Jul 172016 Jul 21

    Other

    Other39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016
    CountryItaly
    CityPisa
    Period16/7/1716/7/21

    Fingerprint

    Students
    Textbooks

    Keywords

    • Statistical significance
    • Test collections
    • Topics
    • Variances

    ASJC Scopus subject areas

    • Information Systems
    • Software

    Cite this

    Sakai, T. (2016). Two sample T-tests for IR evaluation: Student or welch? In SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 1045-1048). Association for Computing Machinery, Inc. https://doi.org/10.1145/2911451.2914684

    Two sample T-tests for IR evaluation : Student or welch? / Sakai, Tetsuya.

    SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2016. p. 1045-1048.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Sakai, T 2016, Two sample T-tests for IR evaluation: Student or welch? in SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, pp. 1045-1048, 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, 16/7/17. https://doi.org/10.1145/2911451.2914684
    Sakai T. Two sample T-tests for IR evaluation: Student or welch? In SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc. 2016. p. 1045-1048 https://doi.org/10.1145/2911451.2914684
    Sakai, Tetsuya. / Two sample T-tests for IR evaluation : Student or welch?. SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2016. pp. 1045-1048
    @inproceedings{ef05cfda637f49cdb7285e5a6d5d00f8,
    title = "Two sample T-tests for IR evaluation: Student or welch?",
    abstract = "There are two well-known versions of the t-test for comparing means from unpaired data: Student's t-test and Welch's t-test. While Welch's t-test does not assume homoscedasticity (i.e., equal variances), it involves approximations. A classical textbook recommendation would be to use Student's t-test if either the two sample sizes are similar or the two sample variances are similar, and to use Welch's t-test only when both of the above conditions are violated. However, a more recent recommendation seems to be to use Welch's t-test unconditionally. Using past data from both TREC and NTCIR, the present study demonstrates that the latter advice should not be followed blindly in the context of IR system evaluation. More specifically, our results suggest that if the sample sizes differ substantially and if the larger sample has a substantially larger variance, Welch's t-test may not be reliable.",
    keywords = "Statistical significance, Test collections, Topics, Variances",
    author = "Tetsuya Sakai",
    year = "2016",
    month = "7",
    day = "7",
    doi = "10.1145/2911451.2914684",
    language = "English",
    pages = "1045--1048",
    booktitle = "SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval",
    publisher = "Association for Computing Machinery, Inc",

    }

    TY - GEN

    T1 - Two sample T-tests for IR evaluation

    T2 - Student or welch?

    AU - Sakai, Tetsuya

    PY - 2016/7/7

    Y1 - 2016/7/7

    N2 - There are two well-known versions of the t-test for comparing means from unpaired data: Student's t-test and Welch's t-test. While Welch's t-test does not assume homoscedasticity (i.e., equal variances), it involves approximations. A classical textbook recommendation would be to use Student's t-test if either the two sample sizes are similar or the two sample variances are similar, and to use Welch's t-test only when both of the above conditions are violated. However, a more recent recommendation seems to be to use Welch's t-test unconditionally. Using past data from both TREC and NTCIR, the present study demonstrates that the latter advice should not be followed blindly in the context of IR system evaluation. More specifically, our results suggest that if the sample sizes differ substantially and if the larger sample has a substantially larger variance, Welch's t-test may not be reliable.

    AB - There are two well-known versions of the t-test for comparing means from unpaired data: Student's t-test and Welch's t-test. While Welch's t-test does not assume homoscedasticity (i.e., equal variances), it involves approximations. A classical textbook recommendation would be to use Student's t-test if either the two sample sizes are similar or the two sample variances are similar, and to use Welch's t-test only when both of the above conditions are violated. However, a more recent recommendation seems to be to use Welch's t-test unconditionally. Using past data from both TREC and NTCIR, the present study demonstrates that the latter advice should not be followed blindly in the context of IR system evaluation. More specifically, our results suggest that if the sample sizes differ substantially and if the larger sample has a substantially larger variance, Welch's t-test may not be reliable.

    KW - Statistical significance

    KW - Test collections

    KW - Topics

    KW - Variances

    UR - http://www.scopus.com/inward/record.url?scp=84980398049&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84980398049&partnerID=8YFLogxK

    U2 - 10.1145/2911451.2914684

    DO - 10.1145/2911451.2914684

    M3 - Conference contribution

    AN - SCOPUS:84980398049

    SP - 1045

    EP - 1048

    BT - SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval

    PB - Association for Computing Machinery, Inc

    ER -