An Effectiveness Metric for Ordinal Classification: Formal Properties and Experimental Results

In Ordinal Classification tasks, items have to be assigned to classes that have a relative ordering, such as “positive”, “neutral”, “negative” in sentiment analysis. Remarkably, the most popular evaluation metrics for ordinal classification tasks either ignore relevant information (for instance, precision/recall on each of the classes ignores their relative ordering) or assume additional information (for instance, Mean Average Error assumes absolute distances between classes). In this paper we propose a new metric for Ordinal Classification, Closeness Evaluation Measure, that is rooted on Measurement Theory and Information Theory. Our theoretical analysis and experimental results over both synthetic data and data from NLP shared tasks indicate that the proposed metric captures quality aspects from different traditional tasks simultaneously. In addition, it generalizes some popular classification (nominal scale) and error minimization (interval scale) metrics, depending on the measurement scale in which it is instantiated.


Introduction
In Ordinal Classification (OC) tasks, items have to be assigned to classes that have a relative ordering, such as positive, neutral, negative in sentiment analysis. It is different from n-ary classification, because it considers ordinal relationships between classes. It is also different from ranking tasks, which only care about relative ordering between items, because it requires category matching; and it is also different from value prediction, because it does not assume fixed numeric intervals between categories.
Most research on Ordinal Classification, however, evaluates systems with metrics designed for those other problems. But classification measures ignore the ordering between classes, ranking metrics ignore category matching, and value prediction metrics are used by assuming (usually equal) numeric intervals between categories.
In this paper we propose a metric designed to evaluate Ordinal Classification systems which relies on concepts from Measurement Theory and from Information Theory. The key idea is defining a general notion of closeness between item value assignments (system output prediction vs gold standard class) which is instantiated into ordinal scales but can be also be used with nominal or interval scales. Our approach establishes closeness between classes in terms of the distribution of items per class in the gold standard, instead of assuming predefined intervals between classes. We provide a formal (Section 4) and empirical (Section 5) comparison of our metric with previous approaches, and both analytical and empirical evidence indicate that our metric suits the problem best than the current most popular choices.

State of the Art
In this section we first summarize the most popular metrics used in OC evaluation campaigns, and then discuss previous work on OC evaluation.

OC Metrics in NLP shared tasks
OC does not match traditional classification, because the ordering between classes makes some errors more severe than others. For instance, misclassifying a positive opinion as negative is a more severe error than as a neutral opinion. Classification metrics, however, have been used for OC tasks in several shared tasks (see Table 1). For instance, Evalita-16 (Barbieri et al., 2016) uses F 1 , NTCIR-  (Rosenthal et al., 2017) uses Macro Average Recall. OC does not match ranking metrics either: three items categorized by a system as very high/high/low, respectively, are perfectly ranked with respect to a ground-truth high/low/very_low, but yet no single item is correctly classified. However, ranking metrics have been applied in some campaigns, such as R/S for reputation polarity and priority in Replab-2013 (Amigó et al., 2013a).
OC has also been evaluated as a value prediction problem -for instance, SemEval 2015 Task 11 (Ghosh et al., 2015) -with metrics such as Mean Average Error (MAE) or Mean Squared Error (MSE), usually assuming that all classes are equidistant. But, in general, we cannot assume fixed intervals between classes if we are dealing with an OC task. For instance, in a paper reviewing scale strong_accept/ accept /weak_accept / undecided/ weak_reject/ reject/ strong_reject, the differences in appreciation between each ordinal step do not necessarily map into predefined numerical intervals.
Finally, OC has been also considered as a linear correlation problem. as in the Semantic Textual Similarity track (Cer et al., 2017). An OC output, however, can have perfect linear correlation with the ground truth without matching any single value.
This diversity of approaches -which do not happen in other types of tasks -indicates a lack of consensus about what tasks are true Ordinal Classification problems, and what are the general requirements of OC evaluation.

Studies on Ordinal Classification
There is a number of previous formal studies on OC in the literature. First, the problem has been studied from the perspective of loss functions for ordinal regression Machine Learning algorithms.
In particular, in a comprehensive work, Rennie and Srebro (2005) reviewed the existing loss functions for traditional classification and they extended them to OC. Although they did not try to formalize OC tasks, in further sections we will study the implication of using their loss function for OC evaluation purposes.
Other authors analyzed OC from a classification perspective. For instance, Waegeman et al. (2006) presented an extended version of the ROC curve for ordinal classification, and Vanbelle and Albert (2009) studied the properties of the Weighted Kappa coefficient in OC.
Other authors applied a value prediction perspective. Gaudette and Japkowicz (2009) analysed the effect of using different error minimization metrics for OC. Baccianella et al. (2009) focused on imbalanced datasets. They imported macro averaging (from classification) to error minimization metrics such as MAE, MSE, and Mean Zero-One Error.
Remarkably, a common aspect of all these contributions is that they all assume predefined intervals between categories. Rennie and Srebro assumed, for their loss function, uniform interval distributions across categories. In their probabilistic extension, they assume predefined intervals via parameters in the join distribution model. Waegeman et al. explicitly assumed that "the misclassification costs are always proportional to the absolute difference between the real and the predicted label". The predefined intervals are defined by Vanbelle and Albert via weighting parameters in Kappa. The MAE and MSE metrics compared by Gaudette and Japkowicz also assume predefined (uniform) intervals. Finally, the solution proposed by Baccianella et al. is based on "a sum of the classification errors across classes".
In our opinion, assuming and adding intervals between categories to estimate misclassification errors violates the notion of ordinal scale in Measurement Theory (Stevens, 1946), which establishes that intervals are not meaningful relationships for ordinal scales. Our measure and our theoretical analysis are meant to address this problem.

Measure Definition
Evaluation metrics establish proximity between a system output and the gold standard (Amigó and Mizzaro, 2020). In ordinal classification we have to compare the classes assigned by the system with weak reject would be a strong disagreement between reviewers (i.e., the classes are distant), because in practice these are almost the extreme cases of the scale (reviewers rarely go for accept or reject). In the right distribution the situation is the opposite: reviewers tend to take a clear stance, which makes weak accept and weak reject closer assessments than in the left case.
the true classes in the gold standard.
A key idea in our metric is to establish a notion of informational closeness that depends on how items are distributed in the rank of classes. The idea is that two items a and b are informationally close if the probability of finding an item between the two is low. As an example, Figure 1 illustrates the intuition of how item distribution affects informational closeness in the context of paper reviewing. This is similar in spirit to, for instance, comparing the quality of two journals according to their quartiles in the rank of journals of comparable topics. With this notion of informational closeness, proximity between classes adapts to the way in which classes are used in a given dataset.
This idea of informational closeness can be implemented using Information Theory: the more unexpected it is to find an item between a and b, the more information such event provides, and the more a and b are informationally closer. Let P (x b ORD a) be the probability that, sampling an item x from the space of items, x is closer to b than a in the ordinal scale of classes. Then we can define Closeness Information Quantity (CIQ) between a and b as the Information Quantity of the event x b ORD a, as follows: Let us now apply this concept for the evaluation of system outputs. Let D be the item collection, C = {c 1 , . . . , c n } a set of sorted classes such that c 1 < c 2 < . . . < c n , and g, s : D −→ C the gold standard and a system output. Given the classes g(d), s(d) assigned to an item d ∈ D by the gold standard and the system output, CIQ ORD (s(d), g(d)) measures the closeness between the assigned class and the gold standard class: ORD s(d))).
Our proposed evaluation measure consists in adding CIQ values for all items d ∈ D, and normalizing the sum by its maximal value, which is the one obtained by a system output that matches the gold standard perfectly. This is what we call Closeness Evaluation Measure, CEM ORD : .
In an ordinal scale, the condition x b ORD a (x is closer to b than a) implies that x is between a and b (a ≥ x ≥ b or a ≤ x ≤ b). Therefore, if n i is the amount of items assigned to class c i in the gold standard, and N is the total amount of items, the formula above turns into: Note that the term prox(c i , c j ), which is the core of the metric, reflects the informational closeness that the metric assigns to a pair of classes c i , c j . Note also that half of the ties (elements in the class i) are included in the computation. Every time the system assigns the class c i and the ground truth is c j , the contribution of that assignment to the final value of CEM ORD is proportional to the informational closeness between both classes.
As an example, let us consider the two ground truth distributions in Figure 1. The proximity between the classes weak_accept and weak_reject for the left distribution is: and for the right distribution is: − log 10/2 + 3 + 10 376 = 4.38.
A mistake between these two classes is more heavily penalized by the metric in the left distribution. Note also that correct predictions have different weights -prox(c i , c i ) -which are higher for infrequent classes. For instance, a correct guess for a reject ground truth in the left distribution has a weight of prox(reject,reject)= 6.84, because it is a rare class (7/402 items); but a correct guess for an undecided item has only a weight of 2.06 because the class is very frequent in the ground truth (193/402 items). This is an effect of using Information Theory to characterize closeness: an infrequent class has more information than a frequent class.
Overall, CEM ORD rewards exact matches, considers ordinal relationships, and does not assume predefined intervals between classes (instead, intervals depend on the distribution of items into classes in the gold standard). Appendix A shows detailed examples of how to compute CEM ORD from the confusion matrix for a system output.

Formalization of CEM on Different Scales
We have specified our measure CEM ORD at ordinal scale to address OC tasks, but it could be used at any scale. In this section we briefly investigate this generalization. In Measurement Theory, at least in Stevens's model (1946), all measures map items to real numbers, and measurement equivalence at different scales is determined by permissible transformation functions. Starting from the notion of |a − b| as the standard algebraic distance between numbers, we define closeness at a certain measurement scale T if it fits for at least one permissible transformation in F T .
Definition 1 (Closeness for a Scale Type) Being three numbers x, a, and b, we say that x is closer to b than a, (x b T a) for a certain scale type T, if and only if: That is, at ordinal scale, x must be located between a and b to be closer to a than b. The condition for nominal scale , the condition matches the standard algebraic closeness between numbers: (|b − x| ≤ |b − a|).
We can generalize CIQ ORD and CEM ORD to consider closeness at any scale T, simply replacing We denote these generalizations as CIQ T , CEM T . The CEM T metric generalizes some of the most popular metrics in classification.
Proposition 1 Assuming that categories in g follow a uniform distribution, then Accuracy is proportional to CEM at nominal scale. Formally, whenever P (g(d) = c) is equal for all categories c ∈ C, then: Acc(s, g) ∝ CEM NOM (s, g).
Macro Average Accuracy can be also defined by aggregating CIQ NOM (s(d), g(d)) in the corresponding manner. Also, under the same statistical assumptions, Precision and Recall for a category c can be defined in terms of aggregated CIQs of items in the system or gold category respectively. Proposition 2 Whenever P (g(d) = c) is equal for all categories c ∈ C, then: Exact match between Precision, Recall and the CIQ aggregation is achieved when values are normalized with respect to the maximum.
On the other hand, if we do not assume a uniform distribution of items into classes in the gold standard, then we obtain a classification metric CEM NOM (s, g) which gives more (logarithmic) weight to errors in infrequent classes.
Finally, at interval scale, CEM INT would be equivalent to a logarithmic version of MAE whenever items are uniformly distributed across classes.
We leave a more detailed formal and empirical analysis of CEM at other scales for future work, as it is not the primary scope of this paper.

Metric Properties
The first property states that an effectiveness metric Eff(s, g) should not assume predefined intervals between classes, i.e., it should be invariant under permissible transformation functions at ordinal scale.
Although we can not compare intervals at ordinal scale, we know, e.g., that "neutral" is closer to "positive" than "negative". Therefore we need another property to verify monotonicity with respect to category closeness.
Property 2 (Ordinal Monotonicity) Changing system predictions closer to the true category should result in a metric increase: then Eff(s , g) > Eff(s, g).
The formalization of ordinal monotonicity states that if all predictions by system s are better or equal than predictions by s, and at least one is Figure 2: Illustration of desirable formal properties for Ordinal Classification. Each bin is a system output, where columns represent ordered classes assigned by the system, and colors represent the items' true classes, ordered from black to white. "=" means that both outputs should have the same quality, and ">" that the left output should receive a higher metric value than the right output. strictly better, then the metric score of s must be higher.
Finally, in order to manage the effect of imbalanced data sets, another desirable property is that an item classification error in a frequent class should have less effect than a classification error in a small class (Fatourechi et al., 2008). In order to formalize this property, we use g d→c to denote the result of moving the item d to the class c in the gold standard.
Property 3 (Imbalance) Distancing items from a small class has more effect than distancing items from a large class. Let (c 1 , c 2 , c 3 ) be three contiguous classes such that c 1 is larger than c 3 , and d 1 , d 3 two items such that g(d 1 ) = c 1 and g(d 3 ) = c 3 . Then Eff(g d 1 →c 2 , g) > Eff(g d 3 →c 2 , g).

Metric Analysis
Table 2 displays the properties satisfied by metrics grouped by families. 1 Classification metrics are ordinal invariant, but they do not satisfy ordinal monotonicity. Attempts to mitigate this limitation include (i) Accuracy at n (Gaudette and Japkowicz, 2009) which relaxes Accuracy with an ordinal margin error, and (ii) ignoring the neutral class (Rosenthal et al., 2014). However, both approaches are insensitive to some types of error. Some classification metrics such as MAAC, Cohen's Kappa or F-measure averaged across classes satisfy the imbalance constraint.  (Baccianella et al., 2009). The weighted Kappa can be monotonic whenever the accumulated weights are consistent with the ordinal structure (Vanbelle and Albert, 2009). In addition, it can satisfy imbalance depending on the weighting scheme. However, ordinal invariance is not satisfied. The loss function for ordinal classification proposed by Rennie and Srebro (2005) is, in the same way as MAE, grounded on category differences, and therefore does not satisfy ordinal invariance. Finally, the cosine similarity has also been employed to evaluate OC (Ghosh et al., 2015), where documents are dimensions and categories are vector values. Just like any other geometric measure, it is not ordinal invariant and it does not satisfy imbalance.
In general, correlation coefficients do not satisfy monotonicity, given that exact matching of gold standard values is not required to achieve the maximum score. Unlike linear correlation, ordinal correlation coefficients (i.e., Kendall or Spearman) are ordinal invariant. Kendall can be computed in different ways depending on how ties are man-aged. In Tau-a, only discordant pairs are considered (g(d 1 ) > g(d 2 ) and s(d 1 ) < s(d 2 )) and imbalance is not satisfied. The most popular Kendall coefficient approach (Tau-b) and Spearman both satisfy imbalance. Pearson coefficient does not, due to the interval effect. Reliability and Sensitivity metrics, which extend the clustering metric BCubed, are essentially an ordinal correlation metric, being invariant but failing in monotonicity, with the advantage of satisfying imbalance due to the precision/recall notions.
By definition, clustering metrics are ordinal invariant, because they are not affected by the cluster of category descriptors. In addition, most of them, such as Mutual Information (MI) or Purity and Inverse Purity, satisfy imbalance. However, they are not ordinal monotonic, given that they do not consider any ordinal relationship between categories.
Finally, we must include the approach by Cardoso and Sousa (2011), a path based metric called Ordinal Classification Index which is designed specifically for OC problems. This is a metric that integrates aspects from the previous three metric families, including two parameters β 1 and β 2 to combine different components. Therefore, this metric can capture the different quality aspects involved in the OC process. However, the metric inherits the lack of invariance of MAE and MSE when computing the ordinal distance between categories, and monotonicity can be violated depending on the effect of discordant item pairs.
The table ends with our proposed metric CEM, which is either a classification, error minimization, or OC metric depending if it is instantiated into nominal (CEM NOM ), interval (CEM INT ), or ordinal measurement scale (CEM ORD ). CEM ORD is the only metric that satisfies the three properties, provided that there are no empty classes in the gold standard (see Appendix A.2).

Empirical Study
Meta-evaluating metrics is not straightforward. A common criterion is robustness, defined as consistence (correlation) of system rankings across data sets. However, although robustness is relevantand we do report it at the end of this section -it does not reflect to what extent a metric captures the quality aspects of systems.
As many authors have pointed out, an OC metric should capture diverse aspects of systems: class matching, ordering, and imbalance. In our exper-iments, in addition to robustness, we select three complementary metrics, each focused on one of these partial aspects, and we evaluate to what extent existing OC metrics are able to capture all these aspects simultaneously.
The selected metrics are: (i) Accuracy, as a partial metric which captures class matching; (ii) Kendall's correlation coefficient Tau-a (without counting ties), in order to capture class ordering 2 ; and (iii) Mutual Information (MI), a clustering metric which reflects how much knowing the system output reduces uncertainty about the gold standard values. This metric accentuates the effect of small classes (imbalance property).

Meta-evaluation Metric
In order to quantify the ability of metrics to capture the aspects reflected by these three metrics, we use the Unanimous Improvement Ratio (UIR) (Amigó et al., 2011). While robustness focuses on consistence across data sets, UIR focuses on consistence across metrics. It essentially counts in how many test cases an improvement is observed for all metrics simultaneously. Being M a set of metrics, and T a set of test cases, and s t a system output for the test case t, the Unanimous Improvement Ratio UIR M (s, s ) between two systems is defined as: where s t ≥ M s t represents that system s improves system s , on the test case t, unanimously for every metric: Therefore, UIR reflects to what extent a system outperforms another system for several metrics simultaneously. Then, we define our meta-evaluation measure Coverage for a single metric m as the Spearman correlation (over system output pairs s, s in the set of system outputs) between differences in m and unanimous improvements over the reference metric set. Being M the reference metric set: 3 Cov M (m) = Spea m(s)−m(s ), UIR M (s, s ) . 2 en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient 3 We use the non parametric coefficient Spearman instead of Pearson. This focuses the meta-evaluation on system score ordering rather than particular scale properties of metrics.
The more the coverage of a metric m is high with respect to a reference metric set M, the more an improvement according to m reflects all quality aspects represented by M.

Compared Metrics
We evaluate the coverage of CEM ORD and other metrics with respect to the reference metric set Accuracy, Kendall, and MI. In the empirical study we have considered most metrics used in practice to evaluate OC problems; we have excluded a few metrics which are included in the theoretical study, either because they have not been used previously to evaluate OC problems (such as clustering metrics) or because they have internal parameters and therefore a range of variability that requires a dedicated study (such as weighted Kappa and Ordinal Index). In order to check the need for the logarithmic scaling in CEM ORD (which comes from the application of Information Quantity), we also include an alternative metric CEM ORD f lat , which is similar to CEM but without the logarithmic scaling.

Experiments on Synthetic Data
In order to play with a representative and controlled amount of classes and distributions, we first experiment with synthetic data. Let us consider a synthetic dataset with 100 test cases and 200 documents per test case, classified into 11 categories. In order to study different degrees of imbalance, we assign ground truth labels to documents according to a normal distribution with average 4 and a typical deviation between 1 and 3. The imbalance grade (deviation) varies uniformly across topics. The majority class is therefore the fourth class. 4 Finally, we discretize the resulting values into their closest category in {1, 2, . . . , 11}. We generate synthetic system outputs according to the following behaviour: each system makes mistakes in a certain ratio r of value assignments, where r ∈ {0.1, 0.2, . . . , 0.9, 1}. Then we distinguish between five kinds of mistakes, thus obtaining 10 × 5 possible system configurations. The five alternative mistakes are: 1. Majority class assignment: Assign the most frequent category: s maj (d) = 4.

Proximity assignment:
The assignment is closer to the gold standard than a random one: it assigns a category between a randomly selected one and the gold standard: with rP os ∼ U (1, n) (a random position between 1 and n).
We discretize the resulting values in the same way than the gold standard. The synthetic outputs are designed to produce trade-offs between evaluation metrics. For instance, a total displacement (s r=1 tDisp ) achieves the maximal Kendall correlation but the lowest Accuracy. On the contrary, a 30% of random assignments s {r=0.3,rand} can decrease substantially the ordinal relationships, but keeping a 70% of Accuracy. Also, s r=0.3 rand outperforms s r=0.5 prox in terms of accuracy, but not necessarily in terms of error minimization metrics. Finally, s r=0.3 rand can be outperformed by s r=0.4 maj given that the second system assigns documents to the majority class, but not in terms of MI, which accounts for the imbalance effect. Table 3 (left part) shows the results. The metric coverage can vary substantially when changing the distribution of systems. For this reason, we first consider every synthetic output and then we repeat the experiment removing each of the system types. As the table shows, CEM ORD improves all other metrics, including the individual metrics used as a reference via UIR (MI, Kendall, and Accuracy). Note that the flat (not logarithmic) version CEM ORD f lat performs systematically worse than the original metric, which supports the use of the logarithmic, information-theoretic formula to compute similarity.

Experiments on NLP shared tasks
Let us now study how metrics behave with actual data from evaluation campaigns, where we cannot control the amount and types of error. We use data from six OC evaluation campaigns for which system outputs are publicly available.
The first data set comes from the Replab 2013 reputational polarity task (Amigó et al., 2013a). It consists of 61 companies with 1,500 tweets each; tweets are annotated as positive, negative, or neutral for the company's reputation.
All the other five datasets are sentiment analysis subtasks from SemEval for which system outputs are available online: SemEval-2015 task 10A (1680 samples, 13 systems), task 10B (8985 samples, 51 systems) and task 10C (3097 samples, 11 systems) (Rosenthal et al., 2015); and SemEval-2014, tasks 9A (2392 samples, 48 systems) and 9B (2396 samples, 7 systems). All these tasks contain three categories. Given that SemEval tasks do not distribute samples in test cases, we emulate 10 test cases by dividing randomly the data sets into 10 partitions in order to compute UIR. Table 3 (right part) shows the results. CEM ORD is the top performer in four datasets, and the second best (with a minimal difference of 0.01 with respect to the best metric) in the other two. The nonlogarithmic version of CEM ORD is, again, worse than the logarithmic version in all cases (except one, SemEval 2014 task 9A, where they both give the same result).
Some metrics are able to achieve a high coverage in some data sets, but not in a consistent manner. For instance, Kappa maximizes the coverage in the last dataset in the table, but achieves an extremely low result for RepLab. In general, the table also shows that the relative coverage performance of metrics varies depending on the dataset characteristics.
Finally, we also computed metrics robustness in terms of Spearman correlation between system rankings produced by the metric for topics (or data set partition) pairs in the campaigns. The highest robustness (0.57) is achieved by CEM ORD , Accuracy and F-measure; and the lowest robustness (0.49) is achieved by Accuracy with 1 and Macro Average MAE. CEM ORD is more robust than its nonlogarithmic version CEM ORD f lat (0.57 vs 0.55), again supporting the use of the information-theoretic logarithmic formula.

Conclusions
Our findings can be summarized as follows: (i) metrics commonly used for Ordinal Classification problems are highly heterogeneous and, in general, inconsistent with the notion of ordinal scale in Measurement Theory; (ii) the notion of closeness between classes can be modelled in terms of Measurement Theory and Information Theory and particularized for different scales; and (iii) our proposed Ordinal Closeness Evaluation Measure (CEM ORD ) is the only one that satisfies all desirable formal properties, it is as robust as the best state-of-the-art metrics, and it is the one that better captures the different quality aspects of OC problems in our ex-perimentation, with both synthetic and naturalistic datasets.
From a methodological perspective, the evidence that we have presented covers the four approaches pointed out in Amigó et al. (2018): we have compared metrics in terms of desirable formal properties to be satisfied (theoretic top-down), we have generalized existing approaches (theoretic bottomup), and we have compared effectiveness on human assessed and on synthetic data (empirical bottomup and top-down). Future work includes the application of CEM at scales other than the ordinal.
Code to compute CEM will be available at github.com/EvALLTEAM/EvALLToolkit. Appendix A. Example computation of CEM Figure 3 illustrates the computation of CEM for two systems (A and B) on the same ground truth with the three usual classes in sentiment analysis: negative, neutral, positive. The ground truth distribution is 10, 60 and 30 items, respectively, which is all the information needed to compute proximity between classes. Note that proximity of one class with respect to other is − log of the amount of items that lie between them (including all items in the ground truth class and half of the items in the system-predicted class) divided by the total number of items. The lowest score corresponds to the proximity between the two extreme cases (in the example, the negative and positive classes), because all items except half of the items in the systempredicted class lie between them, and therefore the − log value is minimal. System A and System B in the figure both have the same accuracy (0.70), but system B receives a higher CEM ORD score (0.76 vs 0.71). The main reason is that system A makes more mistakes between distant classes (positive and negative). Another reason is that system A makes more positive/neutral than negative/neutral mistakes; and positive/neutral errors are more penalized by the metric than negative/neutral. The reason is that, together, the positive and neutral classes represent 90% of the items in the dataset, and therefore are considered less close from an information-theoretic point of view.