Two sample T-tests for IR evaluation: Student or welch?

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)

Abstract

There are two well-known versions of the t-test for comparing means from unpaired data: Student's t-test and Welch's t-test. While Welch's t-test does not assume homoscedasticity (i.e., equal variances), it involves approximations. A classical textbook recommendation would be to use Student's t-test if either the two sample sizes are similar or the two sample variances are similar, and to use Welch's t-test only when both of the above conditions are violated. However, a more recent recommendation seems to be to use Welch's t-test unconditionally. Using past data from both TREC and NTCIR, the present study demonstrates that the latter advice should not be followed blindly in the context of IR system evaluation. More specifically, our results suggest that if the sample sizes differ substantially and if the larger sample has a substantially larger variance, Welch's t-test may not be reliable.

Original languageEnglish
Title of host publicationSIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages1045-1048
Number of pages4
ISBN (Electronic)9781450342902
DOIs
Publication statusPublished - 2016 Jul 7
Event39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016 - Pisa, Italy
Duration: 2016 Jul 172016 Jul 21

Publication series

NameSIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval

Other

Other39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016
CountryItaly
CityPisa
Period16/7/1716/7/21

Keywords

  • Statistical significance
  • Test collections
  • Topics
  • Variances

ASJC Scopus subject areas

  • Information Systems
  • Software

Fingerprint Dive into the research topics of 'Two sample T-tests for IR evaluation: Student or welch?'. Together they form a unique fingerprint.

Cite this