Simple and effective approach to score standardisation

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    7 Citations (Scopus)

    Abstract

    Webber, Moffat and Zobel proposed score standardization for information retrieval evaluation with multiple test collections. Given a topic-by-run raw score matrix in terms of some evaluation measure, each score can be standardised using the topic's sample mean and sample standard deviation across a set of past runs so as to quantify how different a system is from the "average" system in standard deviation units. Using standardised scores, researchers can compare systems across different test collections without worrying about topic hardness or normalisation. While Webber et al. mapped the standardised scores to the [0,1] range using a standard normal cumulative density function, the present study demonstrates that linear transformation of the standardised scores, a method widely used in educational research, can be a simple and effective alternative. We use three TREC robust track data sets with graded relevance assessments and official runs to compare these methods by means of leave-one-out tests, discriminative power, swap rate tests, and topic set size design. In particular, we demonstrate that our method is superior to the method of Webber et al. in terms of swap rates and topic set size design: put simply, our method ensures pairwise system comparisons that are more consistent across different data sets, and is arguably more convenient for designing a new test collection from a statistical viewpoint.

    Original languageEnglish
    Title of host publicationICTIR 2016 - Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval
    PublisherAssociation for Computing Machinery, Inc
    Pages95-104
    Number of pages10
    ISBN (Electronic)9781450344975
    DOIs
    Publication statusPublished - 2016 Sep 12
    Event2016 ACM International Conference on the Theory of Information Retrieval, ICTIR 2016 - Newark, United States
    Duration: 2016 Sep 122016 Sep 16

    Other

    Other2016 ACM International Conference on the Theory of Information Retrieval, ICTIR 2016
    CountryUnited States
    CityNewark
    Period16/9/1216/9/16

    Fingerprint

    Standardization
    Linear transformations
    Information retrieval
    Probability density function
    Hardness

    Keywords

    • Evaluation
    • Measures
    • Standardization
    • Statistical power
    • Statistical significance
    • Test collections
    • Topics
    • Variances

    ASJC Scopus subject areas

    • Information Systems
    • Computer Science (miscellaneous)

    Cite this

    Sakai, T. (2016). Simple and effective approach to score standardisation. In ICTIR 2016 - Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval (pp. 95-104). Association for Computing Machinery, Inc. https://doi.org/10.1145/2970398.2970399

    Simple and effective approach to score standardisation. / Sakai, Tetsuya.

    ICTIR 2016 - Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval. Association for Computing Machinery, Inc, 2016. p. 95-104.

    Research output: Chapter in Book/Report/Conference proceedingConference contribution

    Sakai, T 2016, Simple and effective approach to score standardisation. in ICTIR 2016 - Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval. Association for Computing Machinery, Inc, pp. 95-104, 2016 ACM International Conference on the Theory of Information Retrieval, ICTIR 2016, Newark, United States, 16/9/12. https://doi.org/10.1145/2970398.2970399
    Sakai T. Simple and effective approach to score standardisation. In ICTIR 2016 - Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval. Association for Computing Machinery, Inc. 2016. p. 95-104 https://doi.org/10.1145/2970398.2970399
    Sakai, Tetsuya. / Simple and effective approach to score standardisation. ICTIR 2016 - Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval. Association for Computing Machinery, Inc, 2016. pp. 95-104
    @inproceedings{a08e4df1086944b3b31252c7440b6387,
    title = "Simple and effective approach to score standardisation",
    abstract = "Webber, Moffat and Zobel proposed score standardization for information retrieval evaluation with multiple test collections. Given a topic-by-run raw score matrix in terms of some evaluation measure, each score can be standardised using the topic's sample mean and sample standard deviation across a set of past runs so as to quantify how different a system is from the {"}average{"} system in standard deviation units. Using standardised scores, researchers can compare systems across different test collections without worrying about topic hardness or normalisation. While Webber et al. mapped the standardised scores to the [0,1] range using a standard normal cumulative density function, the present study demonstrates that linear transformation of the standardised scores, a method widely used in educational research, can be a simple and effective alternative. We use three TREC robust track data sets with graded relevance assessments and official runs to compare these methods by means of leave-one-out tests, discriminative power, swap rate tests, and topic set size design. In particular, we demonstrate that our method is superior to the method of Webber et al. in terms of swap rates and topic set size design: put simply, our method ensures pairwise system comparisons that are more consistent across different data sets, and is arguably more convenient for designing a new test collection from a statistical viewpoint.",
    keywords = "Evaluation, Measures, Standardization, Statistical power, Statistical significance, Test collections, Topics, Variances",
    author = "Tetsuya Sakai",
    year = "2016",
    month = "9",
    day = "12",
    doi = "10.1145/2970398.2970399",
    language = "English",
    pages = "95--104",
    booktitle = "ICTIR 2016 - Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval",
    publisher = "Association for Computing Machinery, Inc",

    }

    TY - GEN

    T1 - Simple and effective approach to score standardisation

    AU - Sakai, Tetsuya

    PY - 2016/9/12

    Y1 - 2016/9/12

    N2 - Webber, Moffat and Zobel proposed score standardization for information retrieval evaluation with multiple test collections. Given a topic-by-run raw score matrix in terms of some evaluation measure, each score can be standardised using the topic's sample mean and sample standard deviation across a set of past runs so as to quantify how different a system is from the "average" system in standard deviation units. Using standardised scores, researchers can compare systems across different test collections without worrying about topic hardness or normalisation. While Webber et al. mapped the standardised scores to the [0,1] range using a standard normal cumulative density function, the present study demonstrates that linear transformation of the standardised scores, a method widely used in educational research, can be a simple and effective alternative. We use three TREC robust track data sets with graded relevance assessments and official runs to compare these methods by means of leave-one-out tests, discriminative power, swap rate tests, and topic set size design. In particular, we demonstrate that our method is superior to the method of Webber et al. in terms of swap rates and topic set size design: put simply, our method ensures pairwise system comparisons that are more consistent across different data sets, and is arguably more convenient for designing a new test collection from a statistical viewpoint.

    AB - Webber, Moffat and Zobel proposed score standardization for information retrieval evaluation with multiple test collections. Given a topic-by-run raw score matrix in terms of some evaluation measure, each score can be standardised using the topic's sample mean and sample standard deviation across a set of past runs so as to quantify how different a system is from the "average" system in standard deviation units. Using standardised scores, researchers can compare systems across different test collections without worrying about topic hardness or normalisation. While Webber et al. mapped the standardised scores to the [0,1] range using a standard normal cumulative density function, the present study demonstrates that linear transformation of the standardised scores, a method widely used in educational research, can be a simple and effective alternative. We use three TREC robust track data sets with graded relevance assessments and official runs to compare these methods by means of leave-one-out tests, discriminative power, swap rate tests, and topic set size design. In particular, we demonstrate that our method is superior to the method of Webber et al. in terms of swap rates and topic set size design: put simply, our method ensures pairwise system comparisons that are more consistent across different data sets, and is arguably more convenient for designing a new test collection from a statistical viewpoint.

    KW - Evaluation

    KW - Measures

    KW - Standardization

    KW - Statistical power

    KW - Statistical significance

    KW - Test collections

    KW - Topics

    KW - Variances

    UR - http://www.scopus.com/inward/record.url?scp=84991047660&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84991047660&partnerID=8YFLogxK

    U2 - 10.1145/2970398.2970399

    DO - 10.1145/2970398.2970399

    M3 - Conference contribution

    AN - SCOPUS:84991047660

    SP - 95

    EP - 104

    BT - ICTIR 2016 - Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval

    PB - Association for Computing Machinery, Inc

    ER -