The effect of inter-assessor disagreement on IR system evaluation: A case study with lancers and students

Tetsuya Sakai*


    研究成果: Conference article査読

    1 被引用数 (Scopus)


    This paper reports on a case study on the inter-assessor disagreements in the English NTCIR-13 We Want Web (WWW) collection. For each of our 50 topics, pooled documents were independently judged by three assessors: two "lancers" and one Waseda University student. A lancer is a worker hired through a Japanese part time job matching website, where the hirer is required to rate the quality of the lancer's work upon task completion and therefore the lancer has a reputation to maintain. Nine lancers and five students were hired in total; the hourly pay was the same for all assessors. On the whole, the inter-assessor agreement between two lancers is statistically significantly higher than that between a lancer and a student. We then compared the system rankings and statistical significance test results according to different qrels versions created by changing which asessors to rely on: overall, the outcomes do differ according to the qrels versions, and those that rely on multiple assessors have a higher discriminative power than those that rely on a single assessor. Furthermore, we consider removing topics with relatively low inter-assessor agreements from the original topic set: we thus rank systems using 27 high-agreement topics, after removing 23 low-agreement topics. While the system ranking with the full topic set and that with the high-agreement set are statistically equivalent, the ranking with the high-agreement set and that with the low-agreement set are not. Moreover, the low-agreement set substantially underperforms the full and the high-agreement sets in terms of discriminative power. Hence, from a statistical point of view, our results suggest that a high-agreement topic set is more useful for finding concrete research conclusions than a low-agreement one.

    ジャーナルCEUR Workshop Proceedings
    出版ステータスPublished - 2017 1月 1
    イベント8th International Workshop on Evaluating Information Access, EVIA 2017 - Tokyo, Japan
    継続期間: 2017 12月 5 → …

    ASJC Scopus subject areas

    • コンピュータ サイエンス(全般)


    「The effect of inter-assessor disagreement on IR system evaluation: A case study with lancers and students」の研究トピックを掘り下げます。これらがまとまってユニークなフィンガープリントを構成します。