Topic set size design for paired and unpaired data

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Topic set size design is an approach to determining the sample sizes of an experiment (e.g., number of topics) based on a statistical requirement, namely a desired statistical power or a cap on the confidence interval (CI) width for the difference in means. Previous work considered paired data cases for a desired power of the t - test and for a cap on CI width, as well as unpaired data cases for a desired power of one-way ANOVA. In the present study, we consider unpaired (i.e., two-sample) cases for the t -test and for the CI width. Since one-way ANOVA with two groups is strictly equivalent to the two-sample t -test, we compare the outcomes of the topic set size design results based on these two approaches, and show that the one-way ANOVA-based approach actually returns tighter sample sizes than the two-sample t -test approach. Moreover, we compare the paired and unpaired cases for both t-test-based and CI-based topic set size design approaches. Because estimating the variance of the score differences for the paired data setting is problematic, we recommend the use of our unpaired-data versions of t-test-based and CI-based topic set size design tools, as they only require a variance estimate for individual scores and the appropriate sample sizes for unpaired data are also large enough for paired data.

Original languageEnglish
Title of host publicationICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval
PublisherAssociation for Computing Machinery, Inc
Pages199-202
Number of pages4
ISBN (Electronic)9781450356565
DOIs
Publication statusPublished - 2018 Sep 10
Event8th ACM SIGIR International Conference on the Theory of Information Retrieval, ICTIR 2018 - Tianjin, China
Duration: 2018 Sep 142018 Sep 17

Publication series

NameICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval

Conference

Conference8th ACM SIGIR International Conference on the Theory of Information Retrieval, ICTIR 2018
CountryChina
CityTianjin
Period18/9/1418/9/17

Fingerprint

Analysis of variance (ANOVA)
Experiments

Keywords

  • confidence intervals
  • effect sizes
  • evaluation
  • sample sizes
  • statistical power
  • statistical significance
  • test collections

ASJC Scopus subject areas

  • Information Systems
  • Computer Science (miscellaneous)

Cite this

Sakai, T. (2018). Topic set size design for paired and unpaired data. In ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval (pp. 199-202). (ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval). Association for Computing Machinery, Inc. https://doi.org/10.1145/3234944.3234971

Topic set size design for paired and unpaired data. / Sakai, Tetsuya.

ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval. Association for Computing Machinery, Inc, 2018. p. 199-202 (ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sakai, T 2018, Topic set size design for paired and unpaired data. in ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval. ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval, Association for Computing Machinery, Inc, pp. 199-202, 8th ACM SIGIR International Conference on the Theory of Information Retrieval, ICTIR 2018, Tianjin, China, 18/9/14. https://doi.org/10.1145/3234944.3234971
Sakai T. Topic set size design for paired and unpaired data. In ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval. Association for Computing Machinery, Inc. 2018. p. 199-202. (ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval). https://doi.org/10.1145/3234944.3234971
Sakai, Tetsuya. / Topic set size design for paired and unpaired data. ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval. Association for Computing Machinery, Inc, 2018. pp. 199-202 (ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval).
@inproceedings{d2e2ecb98a4d4883a73efc5753d2145c,
title = "Topic set size design for paired and unpaired data",
abstract = "Topic set size design is an approach to determining the sample sizes of an experiment (e.g., number of topics) based on a statistical requirement, namely a desired statistical power or a cap on the confidence interval (CI) width for the difference in means. Previous work considered paired data cases for a desired power of the t - test and for a cap on CI width, as well as unpaired data cases for a desired power of one-way ANOVA. In the present study, we consider unpaired (i.e., two-sample) cases for the t -test and for the CI width. Since one-way ANOVA with two groups is strictly equivalent to the two-sample t -test, we compare the outcomes of the topic set size design results based on these two approaches, and show that the one-way ANOVA-based approach actually returns tighter sample sizes than the two-sample t -test approach. Moreover, we compare the paired and unpaired cases for both t-test-based and CI-based topic set size design approaches. Because estimating the variance of the score differences for the paired data setting is problematic, we recommend the use of our unpaired-data versions of t-test-based and CI-based topic set size design tools, as they only require a variance estimate for individual scores and the appropriate sample sizes for unpaired data are also large enough for paired data.",
keywords = "confidence intervals, effect sizes, evaluation, sample sizes, statistical power, statistical significance, test collections",
author = "Tetsuya Sakai",
year = "2018",
month = "9",
day = "10",
doi = "10.1145/3234944.3234971",
language = "English",
series = "ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval",
publisher = "Association for Computing Machinery, Inc",
pages = "199--202",
booktitle = "ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval",

}

TY - GEN

T1 - Topic set size design for paired and unpaired data

AU - Sakai, Tetsuya

PY - 2018/9/10

Y1 - 2018/9/10

N2 - Topic set size design is an approach to determining the sample sizes of an experiment (e.g., number of topics) based on a statistical requirement, namely a desired statistical power or a cap on the confidence interval (CI) width for the difference in means. Previous work considered paired data cases for a desired power of the t - test and for a cap on CI width, as well as unpaired data cases for a desired power of one-way ANOVA. In the present study, we consider unpaired (i.e., two-sample) cases for the t -test and for the CI width. Since one-way ANOVA with two groups is strictly equivalent to the two-sample t -test, we compare the outcomes of the topic set size design results based on these two approaches, and show that the one-way ANOVA-based approach actually returns tighter sample sizes than the two-sample t -test approach. Moreover, we compare the paired and unpaired cases for both t-test-based and CI-based topic set size design approaches. Because estimating the variance of the score differences for the paired data setting is problematic, we recommend the use of our unpaired-data versions of t-test-based and CI-based topic set size design tools, as they only require a variance estimate for individual scores and the appropriate sample sizes for unpaired data are also large enough for paired data.

AB - Topic set size design is an approach to determining the sample sizes of an experiment (e.g., number of topics) based on a statistical requirement, namely a desired statistical power or a cap on the confidence interval (CI) width for the difference in means. Previous work considered paired data cases for a desired power of the t - test and for a cap on CI width, as well as unpaired data cases for a desired power of one-way ANOVA. In the present study, we consider unpaired (i.e., two-sample) cases for the t -test and for the CI width. Since one-way ANOVA with two groups is strictly equivalent to the two-sample t -test, we compare the outcomes of the topic set size design results based on these two approaches, and show that the one-way ANOVA-based approach actually returns tighter sample sizes than the two-sample t -test approach. Moreover, we compare the paired and unpaired cases for both t-test-based and CI-based topic set size design approaches. Because estimating the variance of the score differences for the paired data setting is problematic, we recommend the use of our unpaired-data versions of t-test-based and CI-based topic set size design tools, as they only require a variance estimate for individual scores and the appropriate sample sizes for unpaired data are also large enough for paired data.

KW - confidence intervals

KW - effect sizes

KW - evaluation

KW - sample sizes

KW - statistical power

KW - statistical significance

KW - test collections

UR - http://www.scopus.com/inward/record.url?scp=85063468500&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063468500&partnerID=8YFLogxK

U2 - 10.1145/3234944.3234971

DO - 10.1145/3234944.3234971

M3 - Conference contribution

T3 - ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval

SP - 199

EP - 202

BT - ICTIR 2018 - Proceedings of the 2018 ACM SIGIR International Conference on the Theory of Information Retrieval

PB - Association for Computing Machinery, Inc

ER -