Evaluating evaluation measures with worst-case confidence interval widths

    Research output: Contribution to journalConference article

    Abstract

    IR evaluation measures are often compared in terms of rank correlation between two system rankings, agreement with the users' preferences, the swap method, and discriminative power. While we view the agreement with real users as the most important, this paper proposes to use the Worst-case Confidence interval Width (WCW) curves to supplement it in test-collection environments. WCW is the worst-case width of a confidence interval (CI) for the difference between any two systems, given a topic set size. We argue that WCW curves are more useful than the swap method and discriminative power, since they provide a statistically well-founded overview of the comparison of measures over various topic set sizes, and visualise what levels of differences across measures might be of practical importance. First, we prove that Sakai's ANOVA-based topic set size design tool can be used for discussing WCW instead of his CI-based tool that cannot handle large topic set sizes. We then provide some case studies of evaluating evaluation measures using WCW curves based on the ANOVA-based tool, using data from TREC and NTCIR.

    Original languageEnglish
    Pages (from-to)16-19
    Number of pages4
    JournalCEUR Workshop Proceedings
    Volume2008
    Publication statusPublished - 2017 Jan 1
    Event8th International Workshop on Evaluating Information Access, EVIA 2017 - Tokyo, Japan
    Duration: 2017 Dec 5 → …

    Fingerprint

    Analysis of variance (ANOVA)

    Keywords

    • ANOVA
    • Confidence intervals
    • Effect sizes
    • Evaluation measures
    • P-values
    • Sample sizes
    • Statistical significance

    ASJC Scopus subject areas

    • Computer Science(all)

    Cite this

    Evaluating evaluation measures with worst-case confidence interval widths. / Sakai, Tetsuya.

    In: CEUR Workshop Proceedings, Vol. 2008, 01.01.2017, p. 16-19.

    Research output: Contribution to journalConference article

    @article{161b986b7d62484cacabcf70b0d20b30,
    title = "Evaluating evaluation measures with worst-case confidence interval widths",
    abstract = "IR evaluation measures are often compared in terms of rank correlation between two system rankings, agreement with the users' preferences, the swap method, and discriminative power. While we view the agreement with real users as the most important, this paper proposes to use the Worst-case Confidence interval Width (WCW) curves to supplement it in test-collection environments. WCW is the worst-case width of a confidence interval (CI) for the difference between any two systems, given a topic set size. We argue that WCW curves are more useful than the swap method and discriminative power, since they provide a statistically well-founded overview of the comparison of measures over various topic set sizes, and visualise what levels of differences across measures might be of practical importance. First, we prove that Sakai's ANOVA-based topic set size design tool can be used for discussing WCW instead of his CI-based tool that cannot handle large topic set sizes. We then provide some case studies of evaluating evaluation measures using WCW curves based on the ANOVA-based tool, using data from TREC and NTCIR.",
    keywords = "ANOVA, Confidence intervals, Effect sizes, Evaluation measures, P-values, Sample sizes, Statistical significance",
    author = "Tetsuya Sakai",
    year = "2017",
    month = "1",
    day = "1",
    language = "English",
    volume = "2008",
    pages = "16--19",
    journal = "CEUR Workshop Proceedings",
    issn = "1613-0073",
    publisher = "CEUR-WS",

    }

    TY - JOUR

    T1 - Evaluating evaluation measures with worst-case confidence interval widths

    AU - Sakai, Tetsuya

    PY - 2017/1/1

    Y1 - 2017/1/1

    N2 - IR evaluation measures are often compared in terms of rank correlation between two system rankings, agreement with the users' preferences, the swap method, and discriminative power. While we view the agreement with real users as the most important, this paper proposes to use the Worst-case Confidence interval Width (WCW) curves to supplement it in test-collection environments. WCW is the worst-case width of a confidence interval (CI) for the difference between any two systems, given a topic set size. We argue that WCW curves are more useful than the swap method and discriminative power, since they provide a statistically well-founded overview of the comparison of measures over various topic set sizes, and visualise what levels of differences across measures might be of practical importance. First, we prove that Sakai's ANOVA-based topic set size design tool can be used for discussing WCW instead of his CI-based tool that cannot handle large topic set sizes. We then provide some case studies of evaluating evaluation measures using WCW curves based on the ANOVA-based tool, using data from TREC and NTCIR.

    AB - IR evaluation measures are often compared in terms of rank correlation between two system rankings, agreement with the users' preferences, the swap method, and discriminative power. While we view the agreement with real users as the most important, this paper proposes to use the Worst-case Confidence interval Width (WCW) curves to supplement it in test-collection environments. WCW is the worst-case width of a confidence interval (CI) for the difference between any two systems, given a topic set size. We argue that WCW curves are more useful than the swap method and discriminative power, since they provide a statistically well-founded overview of the comparison of measures over various topic set sizes, and visualise what levels of differences across measures might be of practical importance. First, we prove that Sakai's ANOVA-based topic set size design tool can be used for discussing WCW instead of his CI-based tool that cannot handle large topic set sizes. We then provide some case studies of evaluating evaluation measures using WCW curves based on the ANOVA-based tool, using data from TREC and NTCIR.

    KW - ANOVA

    KW - Confidence intervals

    KW - Effect sizes

    KW - Evaluation measures

    KW - P-values

    KW - Sample sizes

    KW - Statistical significance

    UR - http://www.scopus.com/inward/record.url?scp=85038855715&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85038855715&partnerID=8YFLogxK

    M3 - Conference article

    AN - SCOPUS:85038855715

    VL - 2008

    SP - 16

    EP - 19

    JO - CEUR Workshop Proceedings

    JF - CEUR Workshop Proceedings

    SN - 1613-0073

    ER -