A method for in-depth comparative evaluation: How (dis)similar are outputs of pos taggers, dependency parsers and coreference resolvers really?

This paper proposes a generic method for the comparative evaluation of system outputs. The approach is able to quantify the pairwise differences between two outputs and to unravel in detail what the differences consist of. We apply our approach to three tasks in Computational Linguistics, i.e. POS tagging, dependency parsing, and coreference resolution. We find that system outputs are more distinct than the (often) small differences in evaluation scores seem to suggest.


Introduction
While there exist well-defined procedures to evaluate system outputs against manually annotated gold data for many tasks in Computational Linguistics, generally little effort and exploration goes into identifying and analysing the differences between the outputs themselves. System outputs are usually compared in the following manner: The standard evaluation protocol for many tasks consists of comparing a system output (the response) to a manual annotation of the same data (the key). The difference between the response and key is quantified by a similarity metric such as accuracy, and different system outputs are compared to each other by ranking their scores with respect to the similarity metric.
However, comparing the scores of the similarity metric does not paint the full picture of the differences between the outputs, as we will demonstrate. There are hardly any principled or generic evaluation approaches that aim at comparing two or more system responses directly to investigate, highlight, and quantify their differences in detail. Closing this gap is desirable, because progress in many NLP tasks is often made in small steps, and it is often left unclear what the specific contribution of a novel approach is if the comparison to related work is solely based on a (sometimes marginally) small improvement in F1 score or accuracy. Furthermore, an overall improvement regarding accuracy achieved by a new approach might come at the cost of failing in some areas where a baseline system was correct. Vice versa, a new approach might not improve overall accuracy, but solve particular problems that no other system has been able to address.
We propose an evaluation approach which aims at shedding light on the particular differences between system responses and which is intended as a complement to evaluation metrics such as F1score and accuracy. By doing so, we strive to provide researchers with a tool that is able to give insight into the particular strengths and weaknesses of their system in comparison to others. 1 Our method is also useful in iterative system development, as it tracks changes in the outputs of different system versions or feature sets. Furthermore, our approach is able to compare multiple system outputs at once, which enables it to identify hard (or easy) problem areas by assessing how many of the systems solve a problem correctly and give according upper bounds for system ensembles. The performance difference between the simulated ensemble and the individual systems serves as an additional indicator of the difference between the system outputs.
We exemplify the application of our approach by aiming to answer the question of how (dis)similar are the outputs of several state-of-theart systems for different tasks in NLP. We first motivate why using evaluation metrics such as accuracy is not suited for comparing outputs (next section). We then propose a method to do so which introduces an inventory to systematically classify and quantify output differences (section 2). Next, we demonstrate how combining a set of outputs can be used to identify their divergence and to identify hard (and easy) problem areas by looking at upper bounds in performance achieved by an oracle output combination (section 3).

Motivation
First let us motivate why comparing accuracy or F1 scores is not a suited method for establishing the (dis)similarity of system outputs. Consider a simple synthetic problem set with four test cases {A, B, C, D} (e.g. a sequence of POS tags). A system response S 1 solves correctly the cases A and B, while a system response S 2 returns the correct answers for the cases C and D. In terms of accuracy, both responses achieve identical scores, i.e. 50%. However, their output is maximally dissimilar. Extending the set of cases, assume five problems, {A, B, C, D, E}, and three responses S 1 , S 2 , and S 3 as shown in table 1. Although the three responses achieve the same accuracy (left table), their pairwise overlap in terms of identical correct responses (right table) varies considerably, i.e. S 1 is much more similar to S 2 (two shared answers) than to S 3 (one shared answer). In fact, the establishment of the similarity of the responses S 1 , S 2 , and S 3 is more complicated, because we have left out the overlap of the incorrect answers in the responses. Consider the full responses in table 2. The overlap metric (right table) now compares how many of the cells in two rows have identical answers, regardless of whether the answer is correct. The overlap-based similarities between the systems have become more diverse, i.e. S 1 and S 3 are more dissimilar than in the previous table, and the similarities of the pairs (S 1 , S 2 ) and (S 2 , S 3 ) are now distinct, because S 2 and S 3 share the error Z (beside the correct answers C and D), while S 1 and S 2 do not share an error.
Hence, evaluating systems based on performance metrics such as accuracy and F1 scores provides no insight into the differences between the systems and is not able to accurately quantify the similarities between them. That is, a small difference in accuracy does not necessarily imply a high similarity of the outputs, and, vice versa, a larger difference in accuracy does not necessarily signify vastly dissimilar outputs.
Moreover, evaluation based on scores in performance metrics such as F1 does not detail in what regard a system performs better than another. Two systems might implement very distinct approaches, but achieve very similar scores in evaluation. Based on e.g. F1, we cannot assert whether a response S 2 performs better than a response S 1 because a) it solves the same problems as S 1 and then some additional ones, or b) if S 2 and S 1 solve a quite diverse set of problems and S 2 happens to solve a few more in its area of expertise. Additionally, a system that performs better than a baseline is bound to make errors where the baseline was correct. The overall accuracies cannot tell us how often this is the case.
In summary, the comparison of systems based on overall performance scores only lets us glimpse the proverbial tip of the iceberg. Therefore, our approach to comparative evaluation features three main points of interest: 1. How are the differences between system responses quantifiable? 2. What is the nature of the difference between two responses? 3. How can we assess the divergence of a set of responses, how complementary are they?
We try to answer these questions regarding three main tasks, namely POS tagging, dependency parsing, and coreference resolution. We select these tasks because they are fairly widespread procedures in Computational Linguistics and their evaluation increases in complexity. While we limit ourselves to these, we believe our approach to be generic enough to be applied to other labeling problems, such as named entity recognition and semantic role labeling.

Quantifying differences in system responses
As argued above, system responses differ both regarding the correct answers they give and the errors they make. The underlying idea of our approach is to assess how many of the labeled linguistic units (i.e. tokens) in the key have different labels in the responses, regardless of whether the labels are correct. 2 In a second step, we use a class inventory to analyse and quantify these differences in more detail. Formally, given a set of tokens T and two accompanying system responses S 1 and S 2 , we quantify how many of the tokens t i ∈ T have a different label in S 1 and S 2 : Note that switching the inequality condition ( =) to equality (=) actually yields the accuracy metric. That is, taking S 1 as the key and the S 2 as the response and calculating accuracy produces the inverted results of our metric, i.e. 1 − diff (S 1 , S 2 | T ), since accuracy is the ratio of tokens that have identical labels. The question is then, why not simply use S 1 as the key and S 2 as the response and calculate accuracy? While this answers whether two systems solve a similar or diverse set of problems, it does not enable us to identify the sources of the differences that drive the better performance of one response over the other. That is, if a token has a different label in S 1 and S 2 , we cannot tell which and if any of the responses is correct. Hence, we need to look at the gold labels of the tokens T in a key K. This enables us to categorise differences in the outputs into three distinct and informative classes 3 : • Correction: S 1 labels a token incorrectly, S 2 corrects this error • New error: S 1 is correct, S 2 introduces an error • Changed error: Both S 1 and S 2 are incorrect but have different labels The general algorithm to quantify differences in two responses S 1 and S 2 given a set of tokens t i...n in a key K is outlined in algorithm 1. This procedure lets us track and count how often S 2 has a different label than S 1 , classify the difference, and calculate the percentage of each class of difference. The approach can be applied straightforwardly to comparing outputs of POS taggers and dependency parsers.

Algorithm 1 Track differences in two responses
else if L 1 = G then

POS tagging
We compare three POS taggers than can be used off-the-shelf to tag German: the Stanford POS Tagger (Toutanova et al., 2003), the TreeTagger (Schmid, 1995), and the Clevertagger (Sennrich et al., 2013, state-of-the-art). Following Sennrich et al. (2013), we use 3000 sentences from the TübaD/Z (Telljohann et al., 2004), a corpus of articles from a German newspaper, as a test set. 4 Table 3 shows the labeling accuracy of the POS taggers and the percentage of correctly tagged sentences. The accuracy improvement of Clevertagger over TreeTagger is +1.27 points, and the percentage of correctly tagged sentences increases substantially (+9.9 points). In comparison to the Stanford tagger 5 , Clevertagger raises performance  Table 3: Accuracy and differences between POS taggers by roughly 6 points in accuracy, but almost doubles the numbers of correctly tagged sentences.
In the lower table, we see that although the accuracy difference puts the Stanford tagger closer to the TreeTagger (4.48) than to the Clevertagger (5.75), the Stanford tagger's response is more different from the one of TreeTagger (11.06) than from the response of Clevertagger (9.41). Comparing the two best performing taggers, we see that despite their accuracy difference of only 1.27 points, they label 4.96% of the tokens differently.
To get a more detailed understanding of the differences, we apply algorithm 1 to the two outputs, whose results are shown in table 4 6 , listing the five most frequent changes per difference class. 7 Of the 4.96% different labels in Clevertagger compared to TreeTagger, 58.71% are corrections, 33.13% are new errors, and 8.15% changed errors. 8 That is, one third of the changes that Clevertagger introduces are errors. This is a noteworthy observation which applies to all our system comparisons: Every improved response introduces a considerable amount of errors with respect to the baseline, i.e. it invalidates correct decisions of the baseline. While this observation is to some degree expected, our method is able to quantify and analyse such changes in detail.
Regarding the differences, we see that both the most frequent correction (NN→NE) and new error (NE→NN) evolve around the confusion of named entities and common nouns, which is especially   In table 5, we report the unlabeled attachment score (UAS), the labeling score (LS), and the labeled attachment score (LAS) for the parsers. Furthermore, we evaluate how many of the sentences are fully parsed correctly given each criterion.
We see that Parsey outperforms the Stanford parsers mainly due to the performance in attachment (UAS). The performance differences in assigning grammatical labels (LS) are comparably marginal. Parsey also features almost identical performance in both attaching and labeling tokens. However, there is a gap compared to labeled attachment score, which indicates that although Parsey attaches more tokens correctly than the other parsers, it does not necessarily assign the correct grammatical label to these tokens. Looking at the difference chart, we see that despite the rather small differences in LAS (1-4 points), the parsers attach and label around 15% of the tokens differently. The Stanford parsers only differ in 1.07 points in LAS, but this difference is based on 14.01% (diff ) of the tokens in the test set. Parsey outperforms the Stanford NN parser by 2.51 LAS based on 13.62% of the tokens. To gain a better understanding of the differences contained in these 13.62% of the tokens, we apply algorithm 1, whose output is shown in table 6.
The table shows that half (50.22%) of the 13.62% changed token annotations from Stanford NN to Parsey are corrections. All of these changes are attachment corrections, i.e. the label of the tokens are not changed, which correlates with the small difference we saw in labeling score.

Coreference resolution
The final task we investigate is coreference resolution. We choose three freely available systems for English, again due to the lack of available systems for other languages: the Stanford statistical coreference resolver (Clark and Manning, 2015, stateof-the-art), HOTCoref , and the Berkeley coreference system (Durrett and . We use the CoNLL 2012 shared task test set (Pradhan et al., 2012). The coreference task differs from the previous two, since not all tokens in a document partake in coreference relations (but all tokens are in syntactic relations and feature a POS tag). Furthermore, the linguistic units of coreference relations are not only single word tokens, but syntactic units called mentions (i.e. mostly noun phrases). Therefore, we have to adapt our similarity metric in equation 1. To quantify the difference of two corefer-ence system outputs S 1 and S 2 , given a key K, we count how many of the mentions m are classified differently using a mention classification function c: The mention classification function c requires a class inventory which is not featured by the common evaluation metrics for coreference resolution. 10 Therefore, we adapt the mention classification paradigm introduced in the ARCS framework for coreference resolution evaluation (Tuggener, 2014) which assigns one of the following four classes to a mention m given a key K and a system response S: However, one issue with ARCS is to determine a criterion for the TP class, i.e. under what circumstances is m regarded as resolved correctly. Tuggener (2014) proposed to determine correct antecedents based on the requirements of prospective downstream applications. 11 We implement one loose criterion and regard m as correctly resolved if any of its antecedents in S is also an antecedent of m in K. Conversely, if none of the antecedents of m overlap in S and K, we label m as WL. This yields the ARCS any metric. Alternatively, we require that the closest preceding nominal antecedent of m in S is also an antecedent of m in K, which yields the ARCS nom metric. This metric is more conservative in assigning the TP class, but implements a more realistic criterion for 10 The common metrics analyse either the links between mentions or calculate a percentage of overlapping mentions in coreference chains in the key and a response. They are not able to determine whether a given mention m is resolved correctly or assign a class to it. 11 Machine translation requires pronouns to be linked to nominal antecedents, Sentiment analysis needs Named Entity antecedents (if available) etc. correct antecedents from the perspective of downstream applications.
The official CoNLL score MELA (average of MUC, CEAFE, and BCUB) and the recently proposed LEA metric (Moosavi and Strube, 2016), which addresses several issues of of the other metrics, as well as the ARCS scores, are given in table 7. Using the ARCS class inventory and equation 2, we quantify how many of the mentions are classified differently in the system responses. Stanford   The F1 scores are lowest for the LEA metric, because it gives more weight to errors regarding longer coreference chains. The ARCS any metric assigns the highest scores due to the loose criterion that any antecedent is correct as long as it is in the key chain of a given mention. Furthermore, all the metrics agree on the ranking of the systems.
The mention-based differences between the systems are considerably larger than the relatively small differences in F1 scores suggest. The Stanford systems outperforms HOTCoref by 2.3 MELA, 3.79 LEA F1, and 3.03 ARCS any F1, but the systems process one fourth (26.30%) of the mentions differently in the ARCS any setting. For the ARCS nom criterion, the differences are even larger. The Stanford system outperforms the Berkeley system by 2.91 ARCS nom F1, but the systems process 35.39% of the mentions differently. Furthermore, we observe that the differences in F1 (∆ F1) do not correlate with the differences of the outputs (diff ) for both ARCS metrics. Given ARCS nom , we see that the smallest difference in F1 (Stanford ↔ HOTCoref: 0.09) actually occurs between the two responses that the diff metric deems most dissimilar (37.45).
Finally, we apply algorithm 1, using the ARCS nom criterion and our mention classification scheme to the two best performing systems, i.e. HOTCoref and Stanford. Results are given in table 8.  We see that less than 50% of the changes that the Stanford system introduces are corrections (44.62%). But this percentage is still higher than the newly introduced errors (41.65%); hence the improvement in overall F1. Furthermore, the most frequent change is wrong linkages to true positives (wl → tp). The most frequent new error also involves true mentions, i.e. attaching correctly resolved mentions to incorrect antecedents (tp → wl). Recovering false negatives and rendering true positives to false negatives occurs equally frequent, roughly. Hence, the performance difference stems mainly from attaching anaphoric mention to (nominal) antecedents, rather than from deciding which mentions to resolve, which are two subproblems in coreference resolution.

System combination
Lastly, we combine the system outputs per task and calculate the upper bounds for perfect system combinations by deeming a token labeled correctly if at least one of the systems provides the correct label. The upper bounds are intended to be another measure of the (dis)similarity of the outputs: the higher the upper bound, the higher the divergence of the outputs. Furthermore, looking at per-label performance of all systems, we can identify labels with low scores but high upper bounds, which is an interesting starting point for future work.

POS tagging
We start with the POS tagging task and present the upper bound of the system combination if table 9. We also indicate the accuracy gains for the top ten most frequent POS tags relative to the best performing tagger (Clevertagger).  The Stanford tagger, despite performing the lowest with respect to overall accuracy, achieves the highest accuracy on named entities (NE), while the TreeTagger struggles in this category particularly. The TreeTagger surpasses the other taggers on finite verbs (VVFIN) by a wide margin and auxiliary finite verbs (VAFIN). Clevertagger performs best overall, but interestingly, it only achieves the highest accuracy on three of the ten most frequent POS tags. Looking at the overall upper bound, we see that it more than halves the error rate of the best performing system and is near 99% accuracy. The POS tags that profit most from the combination are named entities. Interestingly, all the taggers have low accuracy with respect to this tag, but the upper bound of the combination drastically raises it. Hence, it seems that the taggers diverge mostly here, which correlates with our analysis of the difference between the two best performing systems in table 4.

Dependency parsing
Next, we analyse the upper bounds of the combination of the dependency parsers, given in table 10. in contrast to the POS tagging task, we find that the best performing system, Parsey (P-MP) achieves highest LAS for almost all considered labels. Still, its overall LAS is drastically increased by the upper bound (+5.99) of the perfect system combination. Two of the labels that benefit the most of the combination are amod (adjec-tival modifier), which is often confused with nn (noun compound modifier) as we saw in table 6, and advmod (adverb modifier). All parsers have below 90 LAS for these labels, but the combination raises performance to 95.26 and 91.40, respectively. Furthermore, prepositions (prep) gains considerably in LAS in the combination. We observed in table 6 that almost ten percent of the difference between Parsey and the Stanford NN parser stem from correcting attachments of prepositions. However, also more than seven percent of the difference stems from invalidating correctly attached prepositions in the Stanford NN output. The large performance jump in the combination of the systems is further evidence that the parsers are highly complementary with respect to prepositions.

Coreference resolution
For the coreference task, it is not trivial to calculate the F1 upper bound of the response combination, as the systems do no feature the same mentions in their outputs 12 , and disentangling the false positives is a cumbersome undertaking. Therefore, we limit our investigation to the gold mentions in the key and count for how many of them at least one of the responses produces a correct nominal antecedent, which yields the upper bound for ARCS nom recall. To gain a deeper insight into the benefits of the combination and the performance of the systems, we divide the mentions into nouns (named entities and common nouns), personal pronouns (PRP), and possessive pronouns (PRP$). Results are given in table 11. 13 The system with overall best recall features the highest recall with respect to all mention types. 12 The systems have to decide which NPs they consider for coreference resolution (the anaphoricity detection problem). I.e. the mentions are not known beforehand, and the systems

Related work
One way to establish the difference of two system outputs is to apply statistical significance tests. However, there is generally little agreement on which test to use, and it is often not trivial to verify if all criteria are met for the application of a specific test to a given data set (Yeh, 2000). Furthermore, the significance tests provide no insight into the nature of the differences between two outputs. Several survey papers analysed performance of state-of-the-art tools for POS tagging (Volk and Schneider, 1998;Giesbrecht and Evert, 2009;Horsmann et al., 2015) or dependency parsing (McDonald and Nivre, 2007). While these surveys provide performance results along different axes (accuracy, time, domain, frequent errors), they do not analyse the particular differences between the system responses on the token level and hence do not provide a (dis)similarity rating of the responses. Regarding dependency parsing, our work is most closely related to McDonald and Nivre (2007) and Seddah et al. (2013). Both papers analyse the performance of parsers with respect to several subproblems. McDonald and Nivre (2007) also performed output combination experiments to stress that the two parsers that they investigated are complementary to a significant degree.
Comparative system evaluation in shared tasks is usually performed by pitting scores in evaluawill hallucinate different incorrect ones. 13 Note that the HOTCoref system has better recall than the Stanford system, but the Stanford system features better precision, which leads to a higher F1 score in table 7. tion metrics against each other, e.g. the CoNLL shared tasks on coreference (Pradhan et al., 2011;Pradhan et al., 2012) or on dependency parsing (Buchholz and Marsi, 2006;Nilsson et al., 2007). While the post task evaluation of the CoNLL shared task 2007 included an experiment of system combination which showed performance improvements, it is generally left unclear how similar are the system outputs with (sometimes marginally) small differences with respect to the evaluation metrics.
Another branch of evaluation related to our work is error analysis. Gärtner et al. (2014) presented a tool to explore coreference errors visually, but does not aggregate and classify them. Kummerfeld and Klein (2013) devised a set of error classes for coreference and analysed quantitatively which systems make which errors. Martschat and Strube (2014) presented an analysis and grouping of recall errors for coreference and evaluated a set of system responses. However, these analyses focus on the errors of one system at a time and then compare the overall error statistics, i.e. there is no direct linking or combination of the responses. Hence, we believe our approach to be complementary to the work outlined above.

Conclusion
We have presented a generic dissimilarity metric for system outputs and applied it to several systems for POS tagging, dependency parsing, and coreference resolution. We found that systems with marginal differences in accuracy scores or F1 actually have considerably distinct outputs. We combined system outputs and calculated upper bounds in performance as an additional measure of the degree of difference between the outputs.
We discussed and applied a method for analysing the specific differences between two system outputs using a class inventory to label and quantify the differences. Our analysis revealed the (often considerable) quantity of new errors that improvements introduce compared to baselines. We believe that this kind of analysis is also useful during system and method design, as it allows one to track all changes in the output when adjusting a system or a feature set.
While we have explored our approach on three core tasks in Computational Linguistics, we believe it to be applicable to other areas in the field. Our hope is that our method of comparative evalu-ation will motivate other researchers to gain an indepth understanding of the output of their systems and what distinguishes them from others, beyond differences in accuracy or F1 scores.

A.2 Coreference resolution
Since the ARCS framework is relatively unknown and not widely used, we revisit the connection of our dif f metric to accuracy and F1 outlined in section 2 in order to use one of the coreference metrics to establish the differences between the outputs. We saw that our metric is inversely equivalent to accuracy when taking one system response as the key and the other as the response. That is, we can calculate the dif f ratio by 1 − |t i ∈T :label(t i ,S 1 )==label(t i ,S 2 )| |T | , which is equivalent to taking S 1 as the key and S 2 as the response (or vice versa). For the coreference task, we can thus use one response as the key and the other as the response. The resulting F1 score can then be used as an agreement value, which, however, does not provide any detailed analysis of the nature of the differences compared to the ARCS approach. Table 13 shows the F1 scores when using one response as the key and the second as response. Note that switching the key and the response role provides the same F1 scores for two responses; the only effect is that the recall and precision values are switched.
The table shows that using this approach, we obtain F1 scores that give quite high dissimilarities when turned into the dif f metric, i.e. dif f = 100 − F 1.