The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    6 Citations (Scopus)

    Abstract

    Using classical statistical significance tests, researchers can only discuss PD+jH, the probability of observing the data D at hand or something more extreme, under the assumption that the hypothesis H is true (i.e., the p-value). But what we usually want is PHjD, the probability that a hypothesis is true, given the data. If we use Bayesian statistics with state-of-The-Art Markov Chain Monte Carlo (MCMC) methods for obtaining posterior distributions, this is no longer a problem. .at is, instead of the classical p-values and 95% confidence intervals, which are offen misinterpreted respectively as "probability that the hypothesis is (in)correct" and "probability that the true parameter value drops within the interval is 95%," we can easily obtain PHjD and credible intervals which represent exactly the above. Moreover, with Bayesian tests, we can easily handle virtually any hypothesis, not just "equality of means," and obtain an Expected A Posteriori (EAP) value of any statistic that we are interested in. We provide simple tools to encourage the IR community to take up paired and unpaired Bayesian tests for comparing two systems. Using a variety of TREC and NTCIR data, we compare PHjD with p-values, credible intervals with con.-dence intervals, and Bayesian EAP effect sizes with classical ones. Our results show that (a) p-values and confidence intervals can respectively be regarded as approximations of what we really want, namely, PHjD and credible intervals; and (b) sample effect sizes from classical significance tests can di.er considerably from the Bayesian EAP effect sizes, which suggests that the former can be poor estimates of population effect sizes. For both paired and unpaired tests, we propose that the IR community report the EAP, the credible interval, and the probability of hypothesis being true, not only for the raw di.erence in means but also for the effect size in terms of Glass's.δ.

    Original languageEnglish
    Title of host publicationSIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
    PublisherAssociation for Computing Machinery, Inc
    Pages25-34
    Number of pages10
    ISBN (Electronic)9781450350228
    DOIs
    Publication statusPublished - 2017 Aug 7
    Event40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2017 - Tokyo, Shinjuku, Japan
    Duration: 2017 Aug 72017 Aug 11

    Other

    Other40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2017
    CountryJapan
    CityTokyo, Shinjuku
    Period17/8/717/8/11

    Fingerprint

    Statistics
    Statistical tests
    Markov processes
    Monte Carlo methods
    Glass

    Keywords

    • Bayesian Hypothesis Tests
    • Confidence Intervals
    • Credible Intervals
    • Effect Sizes
    • Hamiltonian Monte Carlo
    • Markov Chain Monte Carlo
    • P-Values
    • Statistical Significance

    ASJC Scopus subject areas

    • Information Systems
    • Software
    • Computer Graphics and Computer-Aided Design

    Cite this

    Sakai, T. (2017). The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation. In SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 25-34). Association for Computing Machinery, Inc. https://doi.org/10.1145/3077136.3080766

    The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation. / Sakai, Tetsuya.

    SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2017. p. 25-34.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Sakai, T 2017, The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation. in SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, pp. 25-34, 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2017, Tokyo, Shinjuku, Japan, 17/8/7. https://doi.org/10.1145/3077136.3080766
    Sakai T. The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation. In SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc. 2017. p. 25-34 https://doi.org/10.1145/3077136.3080766
    Sakai, Tetsuya. / The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation. SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. Association for Computing Machinery, Inc, 2017. pp. 25-34
    @inproceedings{96da9ba640df47bfa6f475d60ff3445e,
    title = "The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation",
    abstract = "Using classical statistical significance tests, researchers can only discuss PD+jH, the probability of observing the data D at hand or something more extreme, under the assumption that the hypothesis H is true (i.e., the p-value). But what we usually want is PHjD, the probability that a hypothesis is true, given the data. If we use Bayesian statistics with state-of-The-Art Markov Chain Monte Carlo (MCMC) methods for obtaining posterior distributions, this is no longer a problem. .at is, instead of the classical p-values and 95{\%} confidence intervals, which are offen misinterpreted respectively as {"}probability that the hypothesis is (in)correct{"} and {"}probability that the true parameter value drops within the interval is 95{\%},{"} we can easily obtain PHjD and credible intervals which represent exactly the above. Moreover, with Bayesian tests, we can easily handle virtually any hypothesis, not just {"}equality of means,{"} and obtain an Expected A Posteriori (EAP) value of any statistic that we are interested in. We provide simple tools to encourage the IR community to take up paired and unpaired Bayesian tests for comparing two systems. Using a variety of TREC and NTCIR data, we compare PHjD with p-values, credible intervals with con.-dence intervals, and Bayesian EAP effect sizes with classical ones. Our results show that (a) p-values and confidence intervals can respectively be regarded as approximations of what we really want, namely, PHjD and credible intervals; and (b) sample effect sizes from classical significance tests can di.er considerably from the Bayesian EAP effect sizes, which suggests that the former can be poor estimates of population effect sizes. For both paired and unpaired tests, we propose that the IR community report the EAP, the credible interval, and the probability of hypothesis being true, not only for the raw di.erence in means but also for the effect size in terms of Glass's.δ.",
    keywords = "Bayesian Hypothesis Tests, Confidence Intervals, Credible Intervals, Effect Sizes, Hamiltonian Monte Carlo, Markov Chain Monte Carlo, P-Values, Statistical Significance",
    author = "Tetsuya Sakai",
    year = "2017",
    month = "8",
    day = "7",
    doi = "10.1145/3077136.3080766",
    language = "English",
    pages = "25--34",
    booktitle = "SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval",
    publisher = "Association for Computing Machinery, Inc",

    }

    TY - GEN

    T1 - The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation

    AU - Sakai, Tetsuya

    PY - 2017/8/7

    Y1 - 2017/8/7

    N2 - Using classical statistical significance tests, researchers can only discuss PD+jH, the probability of observing the data D at hand or something more extreme, under the assumption that the hypothesis H is true (i.e., the p-value). But what we usually want is PHjD, the probability that a hypothesis is true, given the data. If we use Bayesian statistics with state-of-The-Art Markov Chain Monte Carlo (MCMC) methods for obtaining posterior distributions, this is no longer a problem. .at is, instead of the classical p-values and 95% confidence intervals, which are offen misinterpreted respectively as "probability that the hypothesis is (in)correct" and "probability that the true parameter value drops within the interval is 95%," we can easily obtain PHjD and credible intervals which represent exactly the above. Moreover, with Bayesian tests, we can easily handle virtually any hypothesis, not just "equality of means," and obtain an Expected A Posteriori (EAP) value of any statistic that we are interested in. We provide simple tools to encourage the IR community to take up paired and unpaired Bayesian tests for comparing two systems. Using a variety of TREC and NTCIR data, we compare PHjD with p-values, credible intervals with con.-dence intervals, and Bayesian EAP effect sizes with classical ones. Our results show that (a) p-values and confidence intervals can respectively be regarded as approximations of what we really want, namely, PHjD and credible intervals; and (b) sample effect sizes from classical significance tests can di.er considerably from the Bayesian EAP effect sizes, which suggests that the former can be poor estimates of population effect sizes. For both paired and unpaired tests, we propose that the IR community report the EAP, the credible interval, and the probability of hypothesis being true, not only for the raw di.erence in means but also for the effect size in terms of Glass's.δ.

    AB - Using classical statistical significance tests, researchers can only discuss PD+jH, the probability of observing the data D at hand or something more extreme, under the assumption that the hypothesis H is true (i.e., the p-value). But what we usually want is PHjD, the probability that a hypothesis is true, given the data. If we use Bayesian statistics with state-of-The-Art Markov Chain Monte Carlo (MCMC) methods for obtaining posterior distributions, this is no longer a problem. .at is, instead of the classical p-values and 95% confidence intervals, which are offen misinterpreted respectively as "probability that the hypothesis is (in)correct" and "probability that the true parameter value drops within the interval is 95%," we can easily obtain PHjD and credible intervals which represent exactly the above. Moreover, with Bayesian tests, we can easily handle virtually any hypothesis, not just "equality of means," and obtain an Expected A Posteriori (EAP) value of any statistic that we are interested in. We provide simple tools to encourage the IR community to take up paired and unpaired Bayesian tests for comparing two systems. Using a variety of TREC and NTCIR data, we compare PHjD with p-values, credible intervals with con.-dence intervals, and Bayesian EAP effect sizes with classical ones. Our results show that (a) p-values and confidence intervals can respectively be regarded as approximations of what we really want, namely, PHjD and credible intervals; and (b) sample effect sizes from classical significance tests can di.er considerably from the Bayesian EAP effect sizes, which suggests that the former can be poor estimates of population effect sizes. For both paired and unpaired tests, we propose that the IR community report the EAP, the credible interval, and the probability of hypothesis being true, not only for the raw di.erence in means but also for the effect size in terms of Glass's.δ.

    KW - Bayesian Hypothesis Tests

    KW - Confidence Intervals

    KW - Credible Intervals

    KW - Effect Sizes

    KW - Hamiltonian Monte Carlo

    KW - Markov Chain Monte Carlo

    KW - P-Values

    KW - Statistical Significance

    UR - http://www.scopus.com/inward/record.url?scp=85029392129&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85029392129&partnerID=8YFLogxK

    U2 - 10.1145/3077136.3080766

    DO - 10.1145/3077136.3080766

    M3 - Conference contribution

    AN - SCOPUS:85029392129

    SP - 25

    EP - 34

    BT - SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

    PB - Association for Computing Machinery, Inc

    ER -