Towards automatic evaluation of multi-turn dialogues: A task design that leverages inherently subjective annotations

    Research output: Contribution to journalConference article

    1 Citation (Scopus)

    Abstract

    This paper proposes a design of a shared task whose ultimate goal is automatic evaluation of multi-turn, dyadic, textual helpdesk dialogues. The proposed task takes the form of an offline evaluation, where participating systems are given a dialogue as input, and output at least one of the following: (1) an estimated distribution of the annotators' quality ratings for that dialogue; and (2) an estimated distribution of the annotators' nugget type labels for each utterance block (i.e., a maximal sequence of consecutive posts by the same utterer) in that dialogue. This shared task should help researchers build automatic helpdesk dialogue systems that respond appropriately to inquiries by considering the diverse views of customers. The proposed task has been accepted as part of the NTCIR-14 Short Text Conversation (STC-3) task. While estimated and gold distributions are traditionally compared by means of root mean squared error, Jensen-Shannon divergence and the like, we propose a pilot measure that considers the order of the probability bins for the dialogue quality subtask, which we call Symmetric Normalised Order-aware Divergence (SNOD).

    Original languageEnglish
    Pages (from-to)24-30
    Number of pages7
    JournalCEUR Workshop Proceedings
    Volume2008
    Publication statusPublished - 2017 Jan 1
    Event8th International Workshop on Evaluating Information Access, EVIA 2017 - Tokyo, Japan
    Duration: 2017 Dec 5 → …

    Fingerprint

    Bins
    Labels
    Gold

    Keywords

    • Dialogues
    • Divergence
    • Evaluation
    • Nuggets
    • Probability distributions
    • Test collections

    ASJC Scopus subject areas

    • Computer Science(all)

    Cite this

    Towards automatic evaluation of multi-turn dialogues : A task design that leverages inherently subjective annotations. / Sakai, Tetsuya.

    In: CEUR Workshop Proceedings, Vol. 2008, 01.01.2017, p. 24-30.

    Research output: Contribution to journalConference article

    @article{234fd47948484e908364217da9095fca,
    title = "Towards automatic evaluation of multi-turn dialogues: A task design that leverages inherently subjective annotations",
    abstract = "This paper proposes a design of a shared task whose ultimate goal is automatic evaluation of multi-turn, dyadic, textual helpdesk dialogues. The proposed task takes the form of an offline evaluation, where participating systems are given a dialogue as input, and output at least one of the following: (1) an estimated distribution of the annotators' quality ratings for that dialogue; and (2) an estimated distribution of the annotators' nugget type labels for each utterance block (i.e., a maximal sequence of consecutive posts by the same utterer) in that dialogue. This shared task should help researchers build automatic helpdesk dialogue systems that respond appropriately to inquiries by considering the diverse views of customers. The proposed task has been accepted as part of the NTCIR-14 Short Text Conversation (STC-3) task. While estimated and gold distributions are traditionally compared by means of root mean squared error, Jensen-Shannon divergence and the like, we propose a pilot measure that considers the order of the probability bins for the dialogue quality subtask, which we call Symmetric Normalised Order-aware Divergence (SNOD).",
    keywords = "Dialogues, Divergence, Evaluation, Nuggets, Probability distributions, Test collections",
    author = "Tetsuya Sakai",
    year = "2017",
    month = "1",
    day = "1",
    language = "English",
    volume = "2008",
    pages = "24--30",
    journal = "CEUR Workshop Proceedings",
    issn = "1613-0073",
    publisher = "CEUR-WS",

    }

    TY - JOUR

    T1 - Towards automatic evaluation of multi-turn dialogues

    T2 - A task design that leverages inherently subjective annotations

    AU - Sakai, Tetsuya

    PY - 2017/1/1

    Y1 - 2017/1/1

    N2 - This paper proposes a design of a shared task whose ultimate goal is automatic evaluation of multi-turn, dyadic, textual helpdesk dialogues. The proposed task takes the form of an offline evaluation, where participating systems are given a dialogue as input, and output at least one of the following: (1) an estimated distribution of the annotators' quality ratings for that dialogue; and (2) an estimated distribution of the annotators' nugget type labels for each utterance block (i.e., a maximal sequence of consecutive posts by the same utterer) in that dialogue. This shared task should help researchers build automatic helpdesk dialogue systems that respond appropriately to inquiries by considering the diverse views of customers. The proposed task has been accepted as part of the NTCIR-14 Short Text Conversation (STC-3) task. While estimated and gold distributions are traditionally compared by means of root mean squared error, Jensen-Shannon divergence and the like, we propose a pilot measure that considers the order of the probability bins for the dialogue quality subtask, which we call Symmetric Normalised Order-aware Divergence (SNOD).

    AB - This paper proposes a design of a shared task whose ultimate goal is automatic evaluation of multi-turn, dyadic, textual helpdesk dialogues. The proposed task takes the form of an offline evaluation, where participating systems are given a dialogue as input, and output at least one of the following: (1) an estimated distribution of the annotators' quality ratings for that dialogue; and (2) an estimated distribution of the annotators' nugget type labels for each utterance block (i.e., a maximal sequence of consecutive posts by the same utterer) in that dialogue. This shared task should help researchers build automatic helpdesk dialogue systems that respond appropriately to inquiries by considering the diverse views of customers. The proposed task has been accepted as part of the NTCIR-14 Short Text Conversation (STC-3) task. While estimated and gold distributions are traditionally compared by means of root mean squared error, Jensen-Shannon divergence and the like, we propose a pilot measure that considers the order of the probability bins for the dialogue quality subtask, which we call Symmetric Normalised Order-aware Divergence (SNOD).

    KW - Dialogues

    KW - Divergence

    KW - Evaluation

    KW - Nuggets

    KW - Probability distributions

    KW - Test collections

    UR - http://www.scopus.com/inward/record.url?scp=85038882038&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85038882038&partnerID=8YFLogxK

    M3 - Conference article

    AN - SCOPUS:85038882038

    VL - 2008

    SP - 24

    EP - 30

    JO - CEUR Workshop Proceedings

    JF - CEUR Workshop Proceedings

    SN - 1613-0073

    ER -