### Abstract

There are two well-known versions of the t-test for comparing means from unpaired data: Student's t-test and Welch's t-test. While Welch's t-test does not assume homoscedasticity (i.e., equal variances), it involves approximations. A classical textbook recommendation would be to use Student's t-test if either the two sample sizes are similar or the two sample variances are similar, and to use Welch's t-test only when both of the above conditions are violated. However, a more recent recommendation seems to be to use Welch's t-test unconditionally. Using past data from both TREC and NTCIR, the present study demonstrates that the latter advice should not be followed blindly in the context of IR system evaluation. More specifically, our results suggest that if the sample sizes differ substantially and if the larger sample has a substantially larger variance, Welch's t-test may not be reliable.

Original language | English |
---|---|

Title of host publication | SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval |

Publisher | Association for Computing Machinery, Inc |

Pages | 1045-1048 |

Number of pages | 4 |

ISBN (Electronic) | 9781450342902 |

DOIs | |

Publication status | Published - 2016 Jul 7 |

Event | 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016 - Pisa, Italy Duration: 2016 Jul 17 → 2016 Jul 21 |

### Other

Other | 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016 |
---|---|

Country | Italy |

City | Pisa |

Period | 16/7/17 → 16/7/21 |

### Fingerprint

### Keywords

- Statistical significance
- Test collections
- Topics
- Variances

### ASJC Scopus subject areas

- Information Systems
- Software

### Cite this

*SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval*(pp. 1045-1048). Association for Computing Machinery, Inc. https://doi.org/10.1145/2911451.2914684

**Two sample T-tests for IR evaluation : Student or welch?** / Sakai, Tetsuya.

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

*SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval.*Association for Computing Machinery, Inc, pp. 1045-1048, 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2016, Pisa, Italy, 16/7/17. https://doi.org/10.1145/2911451.2914684

}

TY - GEN

T1 - Two sample T-tests for IR evaluation

T2 - Student or welch?

AU - Sakai, Tetsuya

PY - 2016/7/7

Y1 - 2016/7/7

N2 - There are two well-known versions of the t-test for comparing means from unpaired data: Student's t-test and Welch's t-test. While Welch's t-test does not assume homoscedasticity (i.e., equal variances), it involves approximations. A classical textbook recommendation would be to use Student's t-test if either the two sample sizes are similar or the two sample variances are similar, and to use Welch's t-test only when both of the above conditions are violated. However, a more recent recommendation seems to be to use Welch's t-test unconditionally. Using past data from both TREC and NTCIR, the present study demonstrates that the latter advice should not be followed blindly in the context of IR system evaluation. More specifically, our results suggest that if the sample sizes differ substantially and if the larger sample has a substantially larger variance, Welch's t-test may not be reliable.

AB - There are two well-known versions of the t-test for comparing means from unpaired data: Student's t-test and Welch's t-test. While Welch's t-test does not assume homoscedasticity (i.e., equal variances), it involves approximations. A classical textbook recommendation would be to use Student's t-test if either the two sample sizes are similar or the two sample variances are similar, and to use Welch's t-test only when both of the above conditions are violated. However, a more recent recommendation seems to be to use Welch's t-test unconditionally. Using past data from both TREC and NTCIR, the present study demonstrates that the latter advice should not be followed blindly in the context of IR system evaluation. More specifically, our results suggest that if the sample sizes differ substantially and if the larger sample has a substantially larger variance, Welch's t-test may not be reliable.

KW - Statistical significance

KW - Test collections

KW - Topics

KW - Variances

UR - http://www.scopus.com/inward/record.url?scp=84980398049&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84980398049&partnerID=8YFLogxK

U2 - 10.1145/2911451.2914684

DO - 10.1145/2911451.2914684

M3 - Conference contribution

AN - SCOPUS:84980398049

SP - 1045

EP - 1048

BT - SIGIR 2016 - Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval

PB - Association for Computing Machinery, Inc

ER -