The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation

Tetsuya Sakai*

*この研究の対応する著者

研究成果: Conference contribution

8 被引用数 (Scopus)

抄録

Using classical statistical significance tests, researchers can only discuss PD+jH, the probability of observing the data D at hand or something more extreme, under the assumption that the hypothesis H is true (i.e., the p-value). But what we usually want is PHjD, the probability that a hypothesis is true, given the data. If we use Bayesian statistics with state-of-The-Art Markov Chain Monte Carlo (MCMC) methods for obtaining posterior distributions, this is no longer a problem. .at is, instead of the classical p-values and 95% confidence intervals, which are offen misinterpreted respectively as "probability that the hypothesis is (in)correct" and "probability that the true parameter value drops within the interval is 95%," we can easily obtain PHjD and credible intervals which represent exactly the above. Moreover, with Bayesian tests, we can easily handle virtually any hypothesis, not just "equality of means," and obtain an Expected A Posteriori (EAP) value of any statistic that we are interested in. We provide simple tools to encourage the IR community to take up paired and unpaired Bayesian tests for comparing two systems. Using a variety of TREC and NTCIR data, we compare PHjD with p-values, credible intervals with con.-dence intervals, and Bayesian EAP effect sizes with classical ones. Our results show that (a) p-values and confidence intervals can respectively be regarded as approximations of what we really want, namely, PHjD and credible intervals; and (b) sample effect sizes from classical significance tests can di.er considerably from the Bayesian EAP effect sizes, which suggests that the former can be poor estimates of population effect sizes. For both paired and unpaired tests, we propose that the IR community report the EAP, the credible interval, and the probability of hypothesis being true, not only for the raw di.erence in means but also for the effect size in terms of Glass's.δ.

本文言語English
ホスト出版物のタイトルSIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
出版社Association for Computing Machinery, Inc
ページ25-34
ページ数10
ISBN(電子版)9781450350228
DOI
出版ステータスPublished - 2017 8 7
イベント40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2017 - Tokyo, Shinjuku, Japan
継続期間: 2017 8 72017 8 11

出版物シリーズ

名前SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Other

Other40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2017
国/地域Japan
CityTokyo, Shinjuku
Period17/8/717/8/11

ASJC Scopus subject areas

  • 情報システム
  • ソフトウェア
  • コンピュータ グラフィックスおよびコンピュータ支援設計

フィンガープリント

「The probability that your hypothesis is correct, credible intervals, and effect sizes for IR evaluation」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。

引用スタイル