A Closer Look at Evaluation Measures for Ordinal Quantification

Tetsuya Sakai*

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review


In his ACL 2021 paper [1], Sakai compared several evaluation measures in the context of Ordinal Quantification (OQ) tasks in terms of system ranking similarity, system ranking consistency (i.e., robustness to the choice of test data), and discriminative power (i.e., ability to find many statistically significant differences). Based on his experimental results, he recommended the use of his RNOD (Root Normalised Order-aware Divergence) measure along with NMD (Normalised Match Distance, i.e., normalised Earth Mover's Distance). The present study follows up on his discriminative power experiments, by taking a much closer look at the statistical significance test results obtained from each evaluation measure. Our new analyses show that (1) RNOD is the overall winner among the OQ measures in terms of pooled discriminative power (i.e., discriminative power across multiple data sets); (2) NMD behaves noticeably differently from RNOD and from measures that cannot handle ordinal classes; (3) NMD tends to favour a popularity-based baseline (which accesses the gold distributions) over a uniform-distribution baseline, thus contradicting the other measures in terms of statistical significance. As both RNOD and NMD have their merits, we recommend the organisers of OQ tasks to use both of them to evaluate the systems from multiple angles.

Original languageEnglish
JournalCEUR Workshop Proceedings
Publication statusPublished - 2021
Event2021 International Conference on Information and Knowledge Management Workshops, CIKMW 2021 - Gold Coast, Australia
Duration: 2021 Nov 12021 Nov 5


  • Distributions
  • Evaluation
  • Evaluation measures
  • Ordinal classes
  • Ordinal quantification
  • Prevalence estimation

ASJC Scopus subject areas

  • Computer Science(all)


Dive into the research topics of 'A Closer Look at Evaluation Measures for Ordinal Quantification'. Together they form a unique fingerprint.

Cite this