Do dependency parsing metrics correlate with human judgments?

Using automatic measures such as labeled and unlabeled attachment scores is common practice in dependency parser evaluation. In this paper, we examine whether these measures correlate with human judgments of overall parse quality. We ask lin-guists with experience in dependency annotation to judge system outputs. We measure the correlation between their judgments and a range of parse evaluation metrics across ﬁve languages. The human-metric correlation is lower for dependency parsing than for other NLP tasks. Also, inter-annotator agreement is sometimes higher than the agreement between judgments and metrics, indicating that the standard metrics fail to capture certain aspects of parse quality, such as the relevance of root attachment or the relative importance of the different parts of speech.


Introduction
In dependency parser evaluation, the standard accuracy metrics-labeled and unlabeled attachment scores-are defined simply as averages over correct attachment decisions. Several authors have pointed out problems with these metrics; they are both sensitive to annotation guidelines (Schwartz et al., 2012;Tsarfaty et al., 2011), and they fail to say anything about how parsers fare on rare, but important linguistic constructions (Nivre et al., 2010). Both criticisms rely on the intuition that some parsing errors are more important than others, and that our metrics should somehow reflect that. There are sentences that are hard to annotate because they are ambiguous, or because they contain phenomena peripheral to linguistic theory, such as punctuation, clitics, or fragments. Manning (2011) discusses similar issues for part-ofspeech tagging.
To measure the variable relevance of parsing errors, we present experiments with human judgment of parse output quality across five languages: Croatian, Danish, English, German, and Spanish. For the human judgments, we asked professional linguists with dependency annotation experience to judge which of two parsers produced the better parse. Our stance here is that, insofar experts are able to annotate dependency trees, they are also able to determine the quality of a predicted syntactic structure, which we can in turn use to evaluate parser evaluation metrics. Even though downstream evaluation is critical in assessing the usefulness of parses, it also presents non-trivial challenges in choosing the appropriate downstream tasks (Elming et al., 2013), we see human judgments as an important supplement to extrinsic evaluation.
To the best of our knowledge, no prior study has analyzed the correlation between dependency parsing metrics and human judgments. For a range of other NLP tasks, metrics have been evaluated by how well they correlate with human judgments. For instance, the standard automatic metrics for certain tasks-such as BLEU in machine translation, or ROUGE-N and NIST in summarization or natural language generation-were evaluated, reaching correlation coefficients well above .80 (Papineni et al., 2002;Lin, 2004;Belz and Reiter, 2006;Callison-Burch et al., 2007).
We find that correlations between evaluation metrics and human judgments are weaker for dependency parsing than other NLP tasks-our correlation coefficients are typically between .35 and .55-and that inter-annotator agreement is sometimes higher than human-metric agreement. Moreover, our analysis ( §5) reveals that humans have a preference for attachment over labeling decisions, and that attachments closer to the root are more important. Our findings suggest that the currently employed metrics are not fully adequate.
Contributions We present i) a systematic comparison between a range of available dependency parsing metrics and their correlation with human judgments; and ii) a novel dataset 1 of 984 sentences (up to 200 sentences for each of the 5 languages) annotated with human judgments for the preferred automatically parsed dependency tree, enabling further research in this direction.

Metrics
We evaluate seven dependency parsing metrics, described in this section.
Given a labeled gold tree G = V, E G , l G (·) and a labeled predicted tree P = V, E P , l P (·) , let E ⊂ V × V be the set of directed edges from dependents to heads, and let l : V × V → L be the edge labeling function, with L the set of dependency labels.
We include two further metrics-namely, labeled (LCP) and unlabeled (UCP) complete predications-to give account for the relevance of correct predicate prediction for parsing quality.
LCP is inspired by the complete predicates metric from the SemEval 2015 shared task on semantic parsing (Oepen et al., 2015). 2 LCP is triggered by a verb (i.e., set of nodes V verb ) and checks whether all its core arguments match, i.e., all outgoing dependency edges except for punctuation. Since LCP is a very strict metric, we also evaluate UCP, its unlabeled variant. Given a function c X (v) that retrieves the set of child nodes of a node v from a tree X, we first define UCP as follows, and then incorporate the label matching for LCP: For the final figure of seven different parsing metrics, on top of the previous five, in our experiments we also include the neutral edge direction metric (NED) (Schwartz et al., 2011), and tree edit distance (TED) (Tsarfaty et al., 2011;Tsarfaty et al., 2012). 3

Experiment
In our analysis, we compare the metrics with human judgments. We examine how well the automatic metrics correlate with each other, as well as with human judgments, and whether interannotator agreement exceeds annotator-metric agreement. LANG  Data In our experiments we use data from five languages: The English (en), German (de) and Spanish (es) treebanks from the Universal Dependencies (UD v1.0) project (Nivre et al., 2015), the Copenhagen Dependency Treebank (da) (Buch-Kromann, 2003), and the Croatian Dependency Treebank (hr) (Agić and Merkler, 2013  pendency parsers do not agree on the correct analysis, after removing punctuation. 4 We do not control for predicted trees matching the gold standard.
Annotation task A total of 7 annotators were involved in the annotation task. All the annotators are either native or fluent speakers, and wellversed in dependency syntax analysis.
For each language, we present the selected 200 sentences with their two predicted dependency structures to 2-4 annotators and ask them to rank which of the two parses is better. They see graphical representations of the two dependency structures, visualized with the What's Wrong With My NLP? tool. 5 The annotators were not informed of what parser produced which tree, nor had they access to the gold standard. The dataset of 984 sentences is available at: https://bitbucket.org/lowlands/ release (folder CoNLL2015).

Results
First, we perform a standard evaluation in order to see how the parsers fare, using our range of dependency evaluation measures. In addition, we compute correlations between metrics to assess their similarity. Finally, we correlate the measures with human judgements, and compare average annotator and human-system agreements. Table 2 presents the parsing performances with respect to the set of metrics. We see that using LAS, Malt performs better on English, while MST performs better on the remaining four languages.   correlated, e.g., LAS and LA, and UAS and NED, but some exhibit very low correlation coefficients.
Next we study correlations with human judgments (Table 4). In order to aggregate over the annotations, we use an item-response model (Hovy et al., 2013). The correlations are relatively weak compared to similar findings for other NLP tasks. For instance, ROUGE-1 (Lin, 2004) correlates strongly with perceived summary quality, with a coefficient of 0.99. The same holds for BLEU and human judgments of machine translation quality (Papineni et al., 2002).
We find that, overall, LAS is the metric that correlates best with human judgments. It is closely followed by UAS, which does not differ significantly from LAS, albeit the correlations for UAS are slightly lower on average. NED is in turn highly correlated with UAS. The correlations for the predicate-based measures (LCP, UCP) are the lowest, as they are presumably too strict, and very different to LAS.
Motivated by the fact that people prefer the parse that gets the overall structure right ( §5), we experimented with weighting edges proportionally to their log-distance to root. However, the signal was fairly weak; the correlations were only slightly higher for English and Danish: .552 and .338, respectively.
Finally, we compare the mean agreement be-  Table 5: Average mean agreement between annotators, and between annotators and metrics.
tween humans with the mean agreement between humans and standard metrics, cf. Table 5. For two languages (English and Croatian), humans agree more with each other than with the standard metrics, suggesting that metrics are not fully adequate. The mean agreement between humans is .728 for English, with slightly lower scores for the metrics (LAS: .715, UAS: .705, NED: .660). The difference between mean agreement of annotators and human-metric was higher for Croatian: .80 vs .755. For Danish, German and Spanish, however, average agreement between metrics and human judgments is higher than our inter-annotator agreement.

Analysis
In sum, our experiments show that metrics correlate relatively weakly with human judgments, suggesting that some errors are more important to humans than others, and that the relevance of these errors are not captured by the metrics.
To better understand this, we first consider the POS-wise correlations between human judgments and LAS, cf. Table 6. In English, for example, the correlation between judgments and LAS is significantly stronger for content words 6 (ρ c = 0.522) than for function words (ρ f = 0.175). This also holds for the other UD languages, namely German (ρ c = 0.423 vs ρ f = 0.263) and Spanish (ρ c = 0.403 vs ρ f = 0.228). This is not the case for the non-UD languages, Croatian and Danish, where the difference between content-POS and function-POS correlations is not significantly different. In Danish, function words head nouns, and are thus more important than in UD, where content-content word relations are annotated, and function words are leaves in the dependency tree. This difference in dependency formalism is shown by the higher correlation for ρ f for Danish.
The greater correlation for content words for English, German and Spanish suggests that errors 6 Tagged as ADJ, NOUN, PROPN, VERB. .306 Table 6: Correlations between human judgements and POS-wise LAS (content ρ c vs function ρ f poswise LAS correlations).
in attaching or labeling content words mean more to human judges than errors in attaching or labeling function words. We also observe that longer sentences do not compromise annotation quality, with a ρ between −0.07 and 0.08 across languages regarding sentence length and agreement. For the languages for which we had 4 annotators, we analyzed the subset of trees where humans and system (by LAS) disagreed, but where there was majority vote for one tree. We obtained 35 dependency instances for English and 27 for Spanish (cf.   Table 7 shows that there is a prevalent preference for attachment over labeling for both languages. For Spanish, there is proportionally higher label preference. Out of the attachment preferences, 36% and 28% were related to root/main predicate attachments, for English and Spanish respectively. The relevance of the rootattachment preference indicates that attachment is more important than labeling for our annotators. Figure 5 provides three examples from the data where human and system disagree. Parse i) involves a coordination as well as a (local) adverbial, where humans voted for correct coordination (red) and thus unanimously preferred attachment over labeling. Yet, LAS was higher for the analysis in blue because "certainly" is attached to "Europeans" in the gold standard. Parse ii) is another example where humans preferred attachment (in this case root attachment), while iii) shows a Spanish example ("waiter is needed") where the subject label (nsubj) of "camarero" ("waiter") was the decisive trait.

Related Work
Parsing metrics are sensitive to the choice of annotation scheme (Schwartz et al., 2012;Tsarfaty et al., 2011) and fail to capture how parsers fare on important linguistic constructions (Nivre et al., 2010). In other NLP tasks, several studies have examined how metrics correlate with human judgments, including machine translation, summarization and natural language generation (Papineni et al., 2002;Lin, 2004;Belz and Reiter, 2006;Callison-Burch et al., 2007). Our study is the first to assess the correlation of human judgments and dependency parsing metrics. While previous studies reached correlation coefficients over 0.80, this is not the case for dependency parsing, where we observe much lower coefficients.

Conclusions
We have shown that out of seven metrics, LAS correlates best with human jugdments. Nevertheless, our study shows that there is an amount of human preference that is not captured with LAS. Our analysis on human versus system disagreement indicates that attachment is more important than labeling, and that humans prefer a parse that gets the overall structure right. For some languages, inter-annotator agreement is higher than annotator-metric (LAS) agreement, and content-POS is more important than function-POS, indicating there is an amount of human preference that is not captured with our current metrics. These observations raise the important question on how to incorporate our observations into parsing metrics that provide a better fit to human judgments. We do not propose a better metric here, but simply show that while LAS seems to be the most adequate metric, there is still a need for better metrics to complement downstream evaluation.
We outline a number of extensions for future research. Among those, we would aim at augmenting the annotations by obtaining more detailed judgments from human annotators. The current evaluation would ideally encompass more (diverse) domains and languages, as well as the many diverse annotation schemes implemented in various publicly available dependency treebanks that were not included in our experiment.