Supersense tagging with inter-annotator disagreement

Linguistic annotation underlies many successful approaches in Natural Language Processing (NLP), where the annotated corpora are used for training and evaluating supervised learners. The consistency of annotation limits the performance of supervised models, and thus a lot of effort is put into obtaining high-agreement annotated datasets. Recent research has shown that annotation disagreement is not random noise, but carries a systematic signal that can be used for improving the supervised learner. However, prior work was limited in scope, focusing only on part-of-speech tagging in a single language. In this paper we broaden the experiments to a semantic task (supersense tagging) us-ing multiple languages. In particular, we analyse how systematic disagreement is for sense annotation, and we present a pre-liminary study of whether patterns of disagreements transfer across languages.


Introduction
Consistent annotations are important if we wish to train reliable models and perform conclusive evaluation of NLP. The standard practice in annotation efforts is to define annotation guidelines that aim to minimize annotator disagreement. However, in practical annotation projects, perfect agreement is virtually unattainable. Moreover, not all of disagreement should be considered noise because some of it is systematic (Krippendorff, 2011).
The work of Plank et al. (2014a) shows that the regularity of some disagreement in part-of-speech (POS) annotation can be used to obtain more robust POS taggers. They adjust the training loss of each example according to its possible varia-tion in agreement, providing smaller losses when a classifier training decision makes a misclassification that matches with human disagreement. For example, the loss for predicting a particle instead of an adverb is smaller than the loss for predicting a noun instead of an adverb, because the particle/adverb confusion is fairly common among annotators (Sec. 3).
In this article, we apply the method of Plank et al. (2014a) to a semantic sequence-prediction task, namely supersense tagging (SST). SST is considered a more difficult task than POS tagging, because the semantic classes are more dependent on world knowledge, and the number of supersenses is higher than the number of POS labels. We experiment with different methods to calculate the label-wise agreement (Sec. 3.1), and apply these methods to datasets in two languages, namely English and Danish (Sec. 3.2). Moreover, we also perform cross-linguistic experiments to assess how much of the annotation variation in one language can be applied to another.

Variation in supersense annotation
This section provides examples of reasonable disagreement in supersense annotation. We have extracted examples of disagreement from English supersense data (Johannsen et al., 2014), which we later use in our experiments. Tables 1 provides example nominal and verbal expressions, and how they have been annotated by three annotators, namely A 1 -A 3 .
In the first noun example, human being is seen by most as a two-token multiword of N.PERSON, but A 2 emphasizes the biological reading of human being when assigning senses, thus interpreting it as N.ANIMAL.
For lightning, we observe a disagreement across two types (N.EVENT and N.PHENOMENON) that  arguably have a hyponymy relation between them (phenomena being a type of event), and we consider this disagreement a consequence of the overlap in the supersense inventory. The word thunder shows the same disagreement.
In the case of October Iron Range eNews, there is disagreement on the extension of the spans of the multiword. This difference also makes A 3 provide a different semantic type to each of the three multiwords.
Even without span-size disagreements and with a slightly smaller inventory, supersense annotation for verbs is harder than for nouns. For instance, run is the main verb of "He's gonna run out of money", and even though run is prototypically V.MOTION, the three senses provided in Table 1 reflect the meaning of "run out of ". In the second example, the word stop has full disagreement, and it even has two supersenses that seem contradictory, namely V.MOTION and V.STATIVE. This disagreement is a result of the overlap between possible annotations for stop.
The case of rewind seems more surprising, but it comes from the sentence "Rewind the 1st time I gave you a bar of chocolate", where rewind is used to mean remember. Both A 2 and A 3 have chosen V.COGNITION to give account for the metaphorical meaning of the verb, while A 1 has given the prototypical, literal sense of rewind.

Method
Our approach is based on the confusion-matrix cost-sensitive learning described in Plank et al. (2014a). We use a soft notion of correctness, so that the cost of making a prediction y depends not only on whether the correct gold label y is recovered, but also on how often annotators clashed when deciding between between y and y . The idea is to give the learner more leeway to make mistakes as long as these mistakes are the same as those made by human annotators. The learning algorithm is parameterized with a cost matrix C, where the C i,j is the cost of predicting j when i is the true label.
To obtain the costs, we first calculate the disagreement matrix D for each doubly-annotated dataset. An entry D i,j contains the probability of two annotators providing a conflicting annotation with labels i and j. High-probability entries indicate low agreement. The cost matrix is then In our experiments we use a structured perceptron with cost-sensitive updates as the learner.

Factorizations
While disagreement for POS is straightforward, disagreement on supersense labels can be estimated in various ways, because supersense tags contain span, POS and sense information. Supersense tags are similar to named entity tags, but using semantic types from WordNet's lexicogra- To account for the various kinds of information captured by the supersense tags, we use four different factorizations, i.e., four different ways of factoring costs into the model training. Each factorization determines when two tags are considered different in terms of applying a different loss during cost-sensitive training.

Data
We use supersense data from two languages, Danish and English. For Danish, we use the Sem-Dax corpus (Pedersen et al., 2016), a collection of supersense-annotated documents of different domains. 1 For English, we use SemCor (Miller et al., 1994) Table 3 provides statistics on the doubly-annotated data used to calculate disagreement factorizations, including annotator agreement scores. Note that the English doublyannotated data is considerably smaller.

Model
Supersense tagging is typically cast as a sequential problem like POS tagging, but the class distribution is more skewed with a majority class O. We use the structured perceptron RUNGSTED, which allows cost-sensitive training. 2 We use the same feature representation as Martínez Alonso et al. (2015b), which includes information on word forms, morphology, part of speech and word embeddings. We use 5 epochs for training. All results are expressed in terms of micro-averaged F 1 -score, calculated using the official CONLLE-VAL.PL script from the NER shared tasks.

Experiments
We perform two kinds of experiments: monolingual and cross-language. For the monolingual experiments we use each of the four possible factorizations (Sec. 3.1) to train SST models with different costs on a single language. We evaluate each system against the most-frequent sense baseline (MFS), and against a regular structured perceptron without cost-sensitive training (BASELINE). The cross-language experiments assess whether some of the disagreement information captured by the factorizations can be used cross-lingually. To study this hypothesis, we run factorized systems using S DA (Sec. 3.1) on English, and viceversa.
Adapting S DA to English requires projecting back to the canonical supersense inventory, namely removing the adjective supersenses and treating, e.g., all cases of NOUN.VEHICLE as N.ARTIFACT, before calculating factorizations for the different confusion matrices.
Applying the complementary process-using English disagreement information to train costsensitive models for Danish SST-is more involved. We have converted all the Danish data to the English SST inventory to be able to use the coarser inventory of S EN by projecting the extended senses to their original sense. Modifying the Danish data to harmonize with S EN has thus  an effect on the most frequent sense baseline, because the test data is effectively relabeled. Table 4 shows the performance of our system compared to the MFS baseline and the non-regularized baseline that does not use factorizations. Note that our baseline structured perceptron already beats the though MFS baseline. We mark results in bold when another system beats the BASELINE. Some factorizations are more favorable for certain datasets. For instance, all factorizations improve the performance on Ritter-eval, but only WHO-LETAGS aids on Ritter-dev. Over all in-language data sets, WHOLETAGS beats the macro-averaged baseline for English. However, the most reliable factorization overall is JUSTSENSE, which beats BASELINE for English and Danish. For Danish-JUSTSENSE we observe that the adjective supersenses improve (A.MENTAL goes from 0.00 to 16.53 for a support of 15 instances, and A.SOCIAL goes from 48.87 to 56.75 for a support of 169 instances in the training data), but also other senses with much higher support improve, regardless of POS, like N.PERSON (from 49.72 to 52.66 for 951 instances) or V.COMMUNICATION (from 49.66 to 50.31 for 364 instances).

Results
With regard to our cross-lingual investigation, only the direction of using Danish disagreement on English proves promising. Table 5 shows the results of using S DA when training and testing on English. While JUSTSENSE still helps defeat BASELINE, using NOPOS yields better re-sults in this setup, indicating that coarser information might be the easiest to transfer across languages. Indeed, we find that N.COMMUNICATION goes from 60.63 to 66.60 and V.COMMUNICATION goes from 71.34 to 72.05.
Unfortunately, we have not found the improvements across factorizations to be statistically significant using bootstrap test and p < 0.05. Some of the differences in performance for the two languages can spawn from the differences in size of the doubly-annotated sample. In fact, the amount of data in S DA is much larger than S EN (200 newswire sentences vs. 40 tweets).
The results indicate that there is supporting evidence that the systematicity of annotator disagreement in supersense annotation can be used for cost-sensitive training, in particular using the JUSTSENSE factorization. Notice that the improvements in Plank et al. (2014a) for tagging reach a maximum of 4 accuracy points over the regular baseline. It would be unrealistic to expect improvements of such a magnitude for SST instead of POS tagging, in particular when evaluating with label-wise micro-averaged F1 instead of accuracy.

Related Work
Statistical NLP has been aware of the importance of annotator bias for NLP models (Yarowsky and Florian, 2002). Ratnaparkhi and others (1996) already mentioned that annotator identity was a predictive feature for maximum-entropy POS tagging, thereby including annotator bias as a feature.  Table 5: F 1 s for English using cross-lingual costs calculated from S DA Instead of training on annotator-specific data, we use disagreement to regularize over individual annotators. Tomuro (2001) has used mismatching annotations between two sense-annotated corpora to find causes of disagreement such as systematic polysemy.

Conclusions
We presented an application of cost-sensitive learning (Plank et al., 2014a) to supersense tagging. Prior work only focused on syntactic tasks and single languages. We evaluate different factorizations of label disagreement, run monolingual experiment on languages, and attempted a crosslingual regularization experiment.
We identify a consistent factorization (JUST-SENSE) that beats the baseline in both monolingual scenarios and in the cross-lingual scenario of using Danish annotation disagreement to train an English SST model.
We believe that capturing semantic disagreement is even more adequate for cross-lingual studies as semantics is more abstract and should better carry over to other languages. However, our investigation is only preliminary, and we would like to test the approach on further semantic tasks for which doubly-annotated data is available.