Learning to parse with IAA-weighted loss

Natural language processing (NLP) annotation projects employ guidelines to maximize inter-annotator agreement (IAA), and models are estimated assuming that there is one single ground truth. However, not all disagreement is noise, and in fact some of it may contain valuable linguistic information. We integrate such information in the training of a cost-sensitive dependency parser. We introduce ﬁve different factorizations of IAA and the corresponding loss functions, and evaluate these across six different languages. We obtain robust improvements across the board using a factorization that considers dependency labels and di-rectionality. The best method-dataset combination reaches an average overall error reduction of 6.4% in labeled attachment score.


Introduction
Typically, NLP annotation projects employ guidelines to maximize inter-annotator agreement. Possible inconsistencies are resolved by adjudication, and models are induced assuming there is one single ground truth. However, there exist linguistically hard cases where there is no clear answer (Zeman, 2010;Manning, 2011), and incorporating such disagreements into the training of a model has proven helpful for POS tagging (Plank et al., 2014a;Plank et al., 2014b).
Inter-annotator agreement (IAA) is straightforward to calculate for POS, but not for dependency trees. There is no well-established standard for computing agreement on trees (Skjaerholt, 2014).
For a dependency tree, annotators can disagree in attachment, labeling, or both. We implement different strategies, i.e., factorizations ( §2), to capture disagreement on specific syntactic phenomena.
Our hypothesis is that a dependency parser can be informed of disagreements to regularize over annotators' biases. Testing our hypothesis requires the availability of doubly-annotated data, and involves two steps: i) how to factorize attachment or labeling disagreements; and ii) how to inform the parser of them during learning ( §3).

Factorizations
Assume a sample of sentences annotated by annotators A 1 and A 2 . With such a sample we can estimate probabilities of the two annotators' disagreeing on the annotation of a word or span, relative to some dependency tree factorization. We factorize disagreement on dependency tree annotations relative to four properties of the annotated dependency edges: the POS of the dependent, the POS of the head, the label of the edge and the direction (left or right) of the head with regards to the dependent. This section describes the different factorizations.
We present five factorizations, depicted in Figure 1. With artificial root notes, all words in a dependency tree have one incoming edge. This means that in our sample, any word w i has two headId, label annotations, i.e., h 1 , l 1 and h 2 , l 2 given by A 1 and A 2 , respectively, with POS(·) being a function from word indices to POS. The five factorizations are as follows: a) LABEL: disagreement over label pairs, regardless of attachment (h 1 ,h 2 ). That is, h 1 , l 1 and h 2 , l 2 count as disagreement, iff l 1 = l 2 . b) LABELD, same as LABEL, but incorporating edge direction. That is, h 1 , l 1 and h 2 , l 2 count as disagreement, for any j, k ∈ h 1 , h 2 , iff h j < i < h k or l 1 = l 2 . c) CHILDPOSD, i.e., disagreement on attachment direction given POS(i). That is, for POS(i), h 1 , l 1 and h 2 , l 2 count as disagreement, iff h j < i < h k . d) HEADPOS: disagreement on head POS. That is, h 1 , l 1 and h 2 , l 2 count as disagreement, iff POS(h 1 ) =POS(h 2 ). e) HEADPOSD, i.e., HEADPOS, plus direction.
That is, h 1 , l 1 and h 2 , l 2 count as disagreement, iff POS(h 1 ) =POS(h 2 ) or h j < i < h k .
Each factorization yields a symmetric confusion matrix. In our Norwegian data ( §4), for instance, for LABEL there are 834 words that have been labeled as ATR (attribute) by both annotators, while there are 44 cases where one annotator has given the ATR label and the other has given the ADV (adverbial) label. For LABELD, there are 968 words that have been labeled as ADV where both annotators agree on the head being on the left side of the word, whereas there are 9 cases where the annotators agree on ADV label but not on the direction of the head. These 9 cases count as disagreements for LABELD but not for LABEL.

Cost-sensitive updates
We use the cost-sensitive perceptron classifier, following Plank et al. (2014a), but extend it to transition-based dependency parsing, where the predicted values are transitions (Goldberg and Nivre, 2012). Given a gold y i and predicted labelŷ i (POS tags or transitions), the loss is weighted by γ(ŷ i , y i ): Whenever a transition has been wrongly predicted, we retrieve the predicted edge and compare it to the gold dependency to calculate γ. γ(y i , y j ) is then the inverse of the confusion probability estimated from our sample of doubly-annotated data. For example, using the factorization LABEL, if the parser predicts w i to be SUBJECT and the gold annotation is OB-JECT, the confusion probability is the number of times one annotator said SUBJECT while the other said OBJECT out of the times one annotator said one of them. In LABELD, A 1 and A 2 can disagree even if both say the grammatical function of some word w i is SUBJECT, namely if one says the subject is left of w i , and the other says it is right of w i . The confusion probability is then the count of disagreements over the total number of cases where both annotators said a word was SUBJECT. In our baseline model, γ(ŷ i , y i ) = 1. The values for our cost-sensitive systems (LABEL, LABELD, CHILDPOSD, HEADPOS, HEADPOSD) are never above 1, which means that we are selectively underfitting the parser for specific syntactic phenomena.
In other words, we use the doubly-annotated data to regularize our model, hopefully preventing overfitting to annotators' biases.

Data
We use six treebanks (Buch-Kromann et al., 2003;Buch-Kromann et al., 2007;Arias et al., 2014;Solberg et al., 2014;Agić and Merkler, 2013;Haverinen et al., 2010) for which we could get a sample of doubly-annotated data. All these treebanks are directly developed as dependency treebanks, instead of being converted from constituent treebanks. Table 1 gives overview statistics of the treebanks, Table 2 lists the sizes of the doubly-annotated samples, as well as F1 scores between annotators and α values (Skjaerholt, 2014). The doubly-annotated samples are solely used to estimate confusion probabilities, and not for training or testing. When a treebank had no canonical train/test split, we took the final 30% for testing.

Experiments
In our experiments, we use redshift, 1 a transition-based arc-eager dependency parser that implements the dynamic oracle (Goldberg and Nivre, 2012) with averaged perceptron training. We modified the parser 2 to read confusion matrices and weigh the updates with the respective γ. We compare the five ( §2) factorized systems to a baseline system that does not take confusion probabilities into account, i.e., standard redshift. Throughout the experiments, we fix the number of iterations to 5, and we use pseudo-projectivization (Nivre and Nilsson, 2005). 3 The parser does not include morphological features, which lowers performance for morphological rich languages like FI. We report labeled attachment scores (LAS) incl. punctuation.
We use bootstrap sampling in all our experiments in order to get more reliable results. This method allows abstracting away from biases-in sampling and annotation-of training and test splits. We use two complementary evaluation methods: crossvalidation within the training data, and learning curves against the test set. We calculate significance using the approximate randomization test (Noreen, 1989) with 10k iterations.
Cross-validation In this setup, we perform 50 runs of 5-fold cross validation on bootstrap-based samples of the training data. This allows us to gauge the effect of our factorization without committing to a certain test set. We report on the average of the total of 250 runs.
Learning curve To calculate the learning curves, we train the parser on increasing amounts of training data, bootstrap-sampled in steps of 10%, and evaluate against the test set. Each 10% increment is repeated k = 50 times. We finally report average overall error reduction over the baseline.

Results
Cross-validation The results for cross-validation are shown in Table 3. For 5 out of the 6 languages we get significant improvements over the baseline with some factorization. We obtain improvements on all treebanks using LABELD, and on five out of six using CHILDPOSD. For CA, with the smallest doubly-annotated sample, results are not as consistent across the two evaluation methods. Table 4 summarizes the overall average error reduction over the 10-step bootstrapbased learning curve (with 50 runs at each step). We get consistent improvements for languages for which we have a sample of 100+ sentences (Table 2). Again, the most robust factorization is LA-BELD. Figure 2 shows the learning curves for the system with the highest error reduction (NO with CHILDPOSD).

Learning curve
Additional studies In order to evaluate whether our results are meaningful and not just artifacts of random regularization, we performed a sanity check for the best performing system and factorization (i.e., NO with CHILDPOSD factorization). We   shuffled the confusion matrix and ran the bootstrap learning curve with k = 50 repetitions, for five different shufflings. The mean over the five runs for the overall average error reductions is negative (-0.38%, compared to the 2.4% mean for the original, nonshuffled version). We thus conclude that our factorizations capture linguistically plausible information rather than random noise.
7 Related Work Plank et al. (2014a) propose IAA-weighted costsensitive learning for POS tagging. We extend their line of work to dependency parsing. A single sentence can have more than one plausible dependency annotation. Some researchers have proposed evaluation metrics that do not penalize disagreements (Schwartz et al., 2011;Tsarfaty et al., 2011), while others have argued that we should instead ensure the consistency of treebanks (Dickinson, 2010;Manning, 2011;McDonald et al., 2013). Others have claimed that because of these ambiguities, only downstream evaluations are meaningful (Elming et al., 2013).
Syntactic annotation disagreement has typically been studied in the context of treebank development. Haverinen et al. (2012), for example, analyze annotator disagreement for Finnish dependency syntax, and compare it against parser performance. Skjaerholt (2014) use doubly-annotated data to evaluate various agreement metrics. Our paper differs from both lines of research in that we leverage disagreements from doubly-annotated data to obtain more robust models. While we agree that evaluation metrics should probably reflect disagreements, we show that our learning algorithms can indeed benefit from information about disagreement, also using standard performance metrics.

Conclusions
We have evaluated five different factorizations on six treebanks to evaluate the impact of IAA-weighted learning for dependency parsing, obtaining promising results. The findings support our hypothesis that annotator disagreement is informative for parsing. The LABELD factorization-which takes both labeling and word order into account-is the overall most robust factorization across all languages. However, the best factorization for each language varies. This variation can be a result of the morphosyntax of the language, but also of the dependency annotation formalisms, annotation method, training corpus and size of the doubly-annotated sample. 1360