The effect of inter-assessor disagreement on IR system evaluation

A case study with lancers and students

    Research output: Contribution to journalConference article

    Abstract

    This paper reports on a case study on the inter-assessor disagreements in the English NTCIR-13 We Want Web (WWW) collection. For each of our 50 topics, pooled documents were independently judged by three assessors: two "lancers" and one Waseda University student. A lancer is a worker hired through a Japanese part time job matching website, where the hirer is required to rate the quality of the lancer's work upon task completion and therefore the lancer has a reputation to maintain. Nine lancers and five students were hired in total; the hourly pay was the same for all assessors. On the whole, the inter-assessor agreement between two lancers is statistically significantly higher than that between a lancer and a student. We then compared the system rankings and statistical significance test results according to different qrels versions created by changing which asessors to rely on: overall, the outcomes do differ according to the qrels versions, and those that rely on multiple assessors have a higher discriminative power than those that rely on a single assessor. Furthermore, we consider removing topics with relatively low inter-assessor agreements from the original topic set: we thus rank systems using 27 high-agreement topics, after removing 23 low-agreement topics. While the system ranking with the full topic set and that with the high-agreement set are statistically equivalent, the ranking with the high-agreement set and that with the low-agreement set are not. Moreover, the low-agreement set substantially underperforms the full and the high-agreement sets in terms of discriminative power. Hence, from a statistical point of view, our results suggest that a high-agreement topic set is more useful for finding concrete research conclusions than a low-agreement one.

    Original languageEnglish
    Pages (from-to)31-38
    Number of pages8
    JournalCEUR Workshop Proceedings
    Volume2008
    Publication statusPublished - 2017 Jan 1
    Event8th International Workshop on Evaluating Information Access, EVIA 2017 - Tokyo, Japan
    Duration: 2017 Dec 5 → …

    Fingerprint

    Students
    Statistical tests
    Websites
    Concretes

    Keywords

    • Inter-assessor agreement
    • P-values
    • Relevance assessments
    • Statistical significance

    ASJC Scopus subject areas

    • Computer Science(all)

    Cite this

    The effect of inter-assessor disagreement on IR system evaluation : A case study with lancers and students. / Sakai, Tetsuya.

    In: CEUR Workshop Proceedings, Vol. 2008, 01.01.2017, p. 31-38.

    Research output: Contribution to journalConference article

    @article{7d9c890802ba4bf2ab379187b620ff03,
    title = "The effect of inter-assessor disagreement on IR system evaluation: A case study with lancers and students",
    abstract = "This paper reports on a case study on the inter-assessor disagreements in the English NTCIR-13 We Want Web (WWW) collection. For each of our 50 topics, pooled documents were independently judged by three assessors: two {"}lancers{"} and one Waseda University student. A lancer is a worker hired through a Japanese part time job matching website, where the hirer is required to rate the quality of the lancer's work upon task completion and therefore the lancer has a reputation to maintain. Nine lancers and five students were hired in total; the hourly pay was the same for all assessors. On the whole, the inter-assessor agreement between two lancers is statistically significantly higher than that between a lancer and a student. We then compared the system rankings and statistical significance test results according to different qrels versions created by changing which asessors to rely on: overall, the outcomes do differ according to the qrels versions, and those that rely on multiple assessors have a higher discriminative power than those that rely on a single assessor. Furthermore, we consider removing topics with relatively low inter-assessor agreements from the original topic set: we thus rank systems using 27 high-agreement topics, after removing 23 low-agreement topics. While the system ranking with the full topic set and that with the high-agreement set are statistically equivalent, the ranking with the high-agreement set and that with the low-agreement set are not. Moreover, the low-agreement set substantially underperforms the full and the high-agreement sets in terms of discriminative power. Hence, from a statistical point of view, our results suggest that a high-agreement topic set is more useful for finding concrete research conclusions than a low-agreement one.",
    keywords = "Inter-assessor agreement, P-values, Relevance assessments, Statistical significance",
    author = "Tetsuya Sakai",
    year = "2017",
    month = "1",
    day = "1",
    language = "English",
    volume = "2008",
    pages = "31--38",
    journal = "CEUR Workshop Proceedings",
    issn = "1613-0073",
    publisher = "CEUR-WS",

    }

    TY - JOUR

    T1 - The effect of inter-assessor disagreement on IR system evaluation

    T2 - A case study with lancers and students

    AU - Sakai, Tetsuya

    PY - 2017/1/1

    Y1 - 2017/1/1

    N2 - This paper reports on a case study on the inter-assessor disagreements in the English NTCIR-13 We Want Web (WWW) collection. For each of our 50 topics, pooled documents were independently judged by three assessors: two "lancers" and one Waseda University student. A lancer is a worker hired through a Japanese part time job matching website, where the hirer is required to rate the quality of the lancer's work upon task completion and therefore the lancer has a reputation to maintain. Nine lancers and five students were hired in total; the hourly pay was the same for all assessors. On the whole, the inter-assessor agreement between two lancers is statistically significantly higher than that between a lancer and a student. We then compared the system rankings and statistical significance test results according to different qrels versions created by changing which asessors to rely on: overall, the outcomes do differ according to the qrels versions, and those that rely on multiple assessors have a higher discriminative power than those that rely on a single assessor. Furthermore, we consider removing topics with relatively low inter-assessor agreements from the original topic set: we thus rank systems using 27 high-agreement topics, after removing 23 low-agreement topics. While the system ranking with the full topic set and that with the high-agreement set are statistically equivalent, the ranking with the high-agreement set and that with the low-agreement set are not. Moreover, the low-agreement set substantially underperforms the full and the high-agreement sets in terms of discriminative power. Hence, from a statistical point of view, our results suggest that a high-agreement topic set is more useful for finding concrete research conclusions than a low-agreement one.

    AB - This paper reports on a case study on the inter-assessor disagreements in the English NTCIR-13 We Want Web (WWW) collection. For each of our 50 topics, pooled documents were independently judged by three assessors: two "lancers" and one Waseda University student. A lancer is a worker hired through a Japanese part time job matching website, where the hirer is required to rate the quality of the lancer's work upon task completion and therefore the lancer has a reputation to maintain. Nine lancers and five students were hired in total; the hourly pay was the same for all assessors. On the whole, the inter-assessor agreement between two lancers is statistically significantly higher than that between a lancer and a student. We then compared the system rankings and statistical significance test results according to different qrels versions created by changing which asessors to rely on: overall, the outcomes do differ according to the qrels versions, and those that rely on multiple assessors have a higher discriminative power than those that rely on a single assessor. Furthermore, we consider removing topics with relatively low inter-assessor agreements from the original topic set: we thus rank systems using 27 high-agreement topics, after removing 23 low-agreement topics. While the system ranking with the full topic set and that with the high-agreement set are statistically equivalent, the ranking with the high-agreement set and that with the low-agreement set are not. Moreover, the low-agreement set substantially underperforms the full and the high-agreement sets in terms of discriminative power. Hence, from a statistical point of view, our results suggest that a high-agreement topic set is more useful for finding concrete research conclusions than a low-agreement one.

    KW - Inter-assessor agreement

    KW - P-values

    KW - Relevance assessments

    KW - Statistical significance

    UR - http://www.scopus.com/inward/record.url?scp=85038864544&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=85038864544&partnerID=8YFLogxK

    M3 - Conference article

    VL - 2008

    SP - 31

    EP - 38

    JO - CEUR Workshop Proceedings

    JF - CEUR Workshop Proceedings

    SN - 1613-0073

    ER -