Comparing two binned probability distributions for information access evaluation

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    1 Citation (Scopus)

    Abstract

    Some modern information access tasks such as natural language dialogue tasks are difficult to evaluate, for often there is no such thing as the ground truth: different users may have different opinions about the system's output. A few task designs for dialogue evaluation have been implemented and/or proposed recently, where both the ground truth data and the system's output are represented as a distribution of users' votes over bins on a non-nominal scale. The present study first points out that popular bin-by-bin measures such as Jensen-Shannon divergence and Sum of Squared Errors are clearly not adequate for such tasks, and that cross-bin measures should be used. Through experiments using artificial distributions as well as real ones from a dialogue evaluation task, we demonstrate that two cross-bin measures, namely, the Normalised Match Distance (NMD; a special case of the Earth Mover's Distance) and the Root Symmetric Normalised Order-aware Divergence (RSNOD), are indeed substantially different from the bin-by-bin measures.Furthermore, RSNOD lies between the popular bin-by-bin measures and NMD in terms of how it behaves. We recommend using both of these measures in the aforementioned type of evaluation tasks.

    Original languageEnglish
    Title of host publication41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018
    PublisherAssociation for Computing Machinery, Inc
    Pages1073-1076
    Number of pages4
    ISBN (Electronic)9781450356572
    DOIs
    Publication statusPublished - 2018 Jun 27
    Event41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018 - Ann Arbor, United States
    Duration: 2018 Jul 82018 Jul 12

    Other

    Other41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018
    CountryUnited States
    CityAnn Arbor
    Period18/7/818/7/12

    Fingerprint

    Bins
    Probability distributions
    Earth (planet)

    Keywords

    • Dialogue evaluation
    • Earth mover's distance
    • Evaluation measures
    • Jensen-shannon divergence
    • Kullback-leibler divergence
    • Order-aware divergence
    • Wasserstein distance

    ASJC Scopus subject areas

    • Software
    • Computer Graphics and Computer-Aided Design
    • Information Systems

    Cite this

    Sakai, T. (2018). Comparing two binned probability distributions for information access evaluation. In 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018 (pp. 1073-1076). Association for Computing Machinery, Inc. https://doi.org/10.1145/3209978.3210073

    Comparing two binned probability distributions for information access evaluation. / Sakai, Tetsuya.

    41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018. Association for Computing Machinery, Inc, 2018. p. 1073-1076.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Sakai, T 2018, Comparing two binned probability distributions for information access evaluation. in 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018. Association for Computing Machinery, Inc, pp. 1073-1076, 41st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018, Ann Arbor, United States, 18/7/8. https://doi.org/10.1145/3209978.3210073
    Sakai T. Comparing two binned probability distributions for information access evaluation. In 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018. Association for Computing Machinery, Inc. 2018. p. 1073-1076 https://doi.org/10.1145/3209978.3210073
    Sakai, Tetsuya. / Comparing two binned probability distributions for information access evaluation. 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018. Association for Computing Machinery, Inc, 2018. pp. 1073-1076
    @inproceedings{1dfa6af333534675b6a63ef085463c45,
    title = "Comparing two binned probability distributions for information access evaluation",
    abstract = "Some modern information access tasks such as natural language dialogue tasks are difficult to evaluate, for often there is no such thing as the ground truth: different users may have different opinions about the system's output. A few task designs for dialogue evaluation have been implemented and/or proposed recently, where both the ground truth data and the system's output are represented as a distribution of users' votes over bins on a non-nominal scale. The present study first points out that popular bin-by-bin measures such as Jensen-Shannon divergence and Sum of Squared Errors are clearly not adequate for such tasks, and that cross-bin measures should be used. Through experiments using artificial distributions as well as real ones from a dialogue evaluation task, we demonstrate that two cross-bin measures, namely, the Normalised Match Distance (NMD; a special case of the Earth Mover's Distance) and the Root Symmetric Normalised Order-aware Divergence (RSNOD), are indeed substantially different from the bin-by-bin measures.Furthermore, RSNOD lies between the popular bin-by-bin measures and NMD in terms of how it behaves. We recommend using both of these measures in the aforementioned type of evaluation tasks.",
    keywords = "Dialogue evaluation, Earth mover's distance, Evaluation measures, Jensen-shannon divergence, Kullback-leibler divergence, Order-aware divergence, Wasserstein distance",
    author = "Tetsuya Sakai",
    year = "2018",
    month = "6",
    day = "27",
    doi = "10.1145/3209978.3210073",
    language = "English",
    pages = "1073--1076",
    booktitle = "41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018",
    publisher = "Association for Computing Machinery, Inc",

    }

    TY - GEN

    T1 - Comparing two binned probability distributions for information access evaluation

    AU - Sakai, Tetsuya

    PY - 2018/6/27

    Y1 - 2018/6/27

    N2 - Some modern information access tasks such as natural language dialogue tasks are difficult to evaluate, for often there is no such thing as the ground truth: different users may have different opinions about the system's output. A few task designs for dialogue evaluation have been implemented and/or proposed recently, where both the ground truth data and the system's output are represented as a distribution of users' votes over bins on a non-nominal scale. The present study first points out that popular bin-by-bin measures such as Jensen-Shannon divergence and Sum of Squared Errors are clearly not adequate for such tasks, and that cross-bin measures should be used. Through experiments using artificial distributions as well as real ones from a dialogue evaluation task, we demonstrate that two cross-bin measures, namely, the Normalised Match Distance (NMD; a special case of the Earth Mover's Distance) and the Root Symmetric Normalised Order-aware Divergence (RSNOD), are indeed substantially different from the bin-by-bin measures.Furthermore, RSNOD lies between the popular bin-by-bin measures and NMD in terms of how it behaves. We recommend using both of these measures in the aforementioned type of evaluation tasks.

    AB - Some modern information access tasks such as natural language dialogue tasks are difficult to evaluate, for often there is no such thing as the ground truth: different users may have different opinions about the system's output. A few task designs for dialogue evaluation have been implemented and/or proposed recently, where both the ground truth data and the system's output are represented as a distribution of users' votes over bins on a non-nominal scale. The present study first points out that popular bin-by-bin measures such as Jensen-Shannon divergence and Sum of Squared Errors are clearly not adequate for such tasks, and that cross-bin measures should be used. Through experiments using artificial distributions as well as real ones from a dialogue evaluation task, we demonstrate that two cross-bin measures, namely, the Normalised Match Distance (NMD; a special case of the Earth Mover's Distance) and the Root Symmetric Normalised Order-aware Divergence (RSNOD), are indeed substantially different from the bin-by-bin measures.Furthermore, RSNOD lies between the popular bin-by-bin measures and NMD in terms of how it behaves. We recommend using both of these measures in the aforementioned type of evaluation tasks.

    KW - Dialogue evaluation

    KW - Earth mover's distance

    KW - Evaluation measures

    KW - Jensen-shannon divergence

    KW - Kullback-leibler divergence

    KW - Order-aware divergence

    KW - Wasserstein distance

    UR - http://www.scopus.com/inward/record.url?scp=85051518071&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85051518071&partnerID=8YFLogxK

    U2 - 10.1145/3209978.3210073

    DO - 10.1145/3209978.3210073

    M3 - Conference contribution

    AN - SCOPUS:85051518071

    SP - 1073

    EP - 1076

    BT - 41st International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2018

    PB - Association for Computing Machinery, Inc

    ER -