Predicting word sense annotation agreement

High agreement is a common objective when annotating data for word senses. However, a number of factors make perfect agreement impossible, e.g. the limitations of sense inventories, the difﬁculty of the examples or the interpretation preferences of the annotators. Estimating potential agreement is thus a relevant task to supplement the evaluation of sense annotations. In this article we propose two methods to predict agreement on word-annotation instances. We experiment with a continuous representation and a three-way discretization of observed agreement. In spite of the difﬁculty of the task, we ﬁnd that different levels of agreement can be identiﬁed—in particular, low-agreement examples are easier to identify.


Introduction
Sense-annotation tasks show less-than-perfect agreement scores. However, variation in agreement is not the result of featureless, white noise in the annotations; Krippendorff (2011) defines disagreement as by chance-caused by unavoidable inconsistencies in annotator behavior-and systematic-caused by properties of the data.
Our goal is to predict the agreement of senseannotated examples by examining their linguistic properties. If we can identify properties predictive of low or high agreement, then we can claim that some of the agreement variation in the data is indeed systematic. Artstein and Poesio (2008) provide an interpretation of Kripperdorff's α coefficient to describe the reliability of a whole annotation task and the way that observed agreement (A o ) is calculated for each example. Strictly speaking, the value of α only provides an indication of the replicability of an annotation task, but we propose that the difficulty of annotating a particular example will influence its local observed agreement. Thus, easy examples will have a high A o , that will be lower for more difficult examples.
Identifying low-agreement examples by their linguistic features would help characterize contexts that make words difficult to annotate. Estimating the agreement of examples has an immediate application for data collection, as a way of estimating the proportion of examples of each difficulty level that one wants to sample. Moreover, a model of (dis)agreement can help interpret the mispredictions of a word-sense disambiguation system without requiring the data to be multiply annotated.
Observed agreement A o is a continuous-valued variable in the unit interval and we tackle its prediction as a regression task (Section 4.1). We also experiment with a discretized version of observed agreement into low, mid and high agreement, which is predicted using classification (Section 4.2).

Related work
In their study, Yarowsky and Florian (2002) examine the relation between agreement variation and predictive power of word-sense disambiguation systems, which is later expanded by Lopez de Lacalle and Agirre (2015a). Our work is different in that we do not study the relation between agreement and performance, but between example properties and agreement. Martínez Alonso (2013) experiments with prediction of agreement for coarse-sense annotation. Tomuro (2001) uses the disagreement between annotators of two English sense-annotated corpora to provide insights on the relations between synsets, and more recent studies (Jurgens, 2013;Jurgens, 2014;Lopez de Lacalle and Agirre, 2015b) have empirically tackled the issue of inter-annotator disagreement as a phenomenon that is potentially informative for natural language processing. Other research efforts advocate for models of annotator behavior (Passonneau et al., 2009;Passonneau et al., 2010;Passonneau and Carpenter, 2014;Cohn and Specia, 2013).

Data
We conduct our study on sense-annotated datasets, keeping only the examples with at least two annotations per item. In the datasets with two annotators and one adjudicator, we disregard adjudications given their potentially different bias.
1 MASCC The English crowdsourced lexicalsample word-sense corpus from Passonneau and Carpenter (2014). 2-5 MASCE* The expert annotations for a series of English lexical-sample words from Passonneau et al. (2012), with several annotation rounds. We include the second, third and fourth round of annotation in our experiments. We use on the whole dataset (MASCEW) pooling all the rounds together, as well as on each round independently, namely MASCE2, MASCE3 and MASCE4. 6 FNTW The English Twitter FrameNet data of Søgaard et al. (2015). We treat the framename layer as a word-sense layer, and disregard the arguments. 7 ENSST The English supersense-annotated data of Johannsen et al. (2014). 8 EUSC The Basque lexical-sample SemCor of Agirre et al. (2006). 9 DASST The Danish supersense-annotated data of Martínez . Table 1 provides the characteristics of the datasets. The annotation task can be lexical-sample (ls) or all-words (aw). The number of instances is different from the number of sentences for all-words annotation. The type of annotators can be expert (ex) or crowdsourced (cs). The α scores can differ from those reported in the datasets' documentation given our example-selection criteria. The last two columns describe the target variables of observed agreement (A o ) and the proportion of low-, midand high-agreement instances, cf. 3.2 for details.

Features
We define an instance as a sentence with a target word for annotation. If a sentence has n annotated target words, it yields n instances. For each in-stance, we obtain features for a word w and its syntactic parent p in a sentence s, organized in feature groups. The word identities of w and p are not included in the features to keep the models more general. Number of features are in parentheses. Frequency(2) We calculate the frequency of w and p, scaling by log(rank(x) + 1) −1 . Morphology (5) We consider the part-of-speech tag (POS) of w, of p, and the POS-bigram at the left and at the right of w. In order to incorporate information on inflectional complexity, we calculate which proportion of the frequency of the stem of w is covered by w, e.g. the occurrences of 'jumping' constitute 22% of the occurrences of the stem 'jump'. Syntax (5) We calculate the number of dependents of w and p, and a bag of words for the labels of the dependents of w and p. We also include the distance from w to the root node, and the linear distance between w and p.
Context (5) We calculate the length of s in tokens, the proportion of w made up of content words, and a bag of words of the context of w, i.e. all the words of s except w. To capture context specificity, we calculate the maximum and the sum of the sentence-wise idf of each stem in s. Sense inventory (2) We calculate the number of possible senses for w, plus an additional sense when w could be discarded from the annotationlike the tag 'O' for supersenses-or the right synset was not present in WordNet. We also calculate the sense entropy for each word following Yarowsky and Florian (2002).

Target variable
Regression Instance-wise observed agreement (A o ) is the target variable for the regression experiments. We obtain A o for each example by counting the pairwise matches in the annotation and dividing over the amount of pairwise combi- nations. Note that α is an aggregate measure that is obtained dataset-wise, and A o is the only agreement measure available for individual instances.
Classification The target variable for the classification experiments is a discretization of A o into three agreement-level classes, namely LOW, MID and HIGH. The threshold for LOW is set at A o ≤ 1 3 , and for HIGH at A o ≥ 2 3 . The MID value is only possible for datasets with more than three annotators (cf. Table 1).

Experiments
We use the scikit-learn 2 implementation for all learning algorithms, and train and test on 10-fold cross validation.
Regression We use L2-regularized linear regression. The baselines for regression are MEAN, where all instances receive the mean A o of the dataset, and MEDIAN, that assigns the median A o .
Classification We use a maximum-entropy classifier. The baselines for classification are MFC, where all instances receive most frequent class, and the two random baselines: STRA, where the assigned values are randomly selected via stratified sampling from the distribution of classes in the dataset, and UNI where values are assigned from the uniform distribution of the three labels.  The results for regression show that predicting instance-wise A o is a hard task. The learnability of the task is limited by the resolution of the target variable; the only two datasets that can beat all baselines (and thus have lower MAE) have many instances, and many annotators (about 50% of the instances in MASCEW have five or more annotators). Also, size of the dataset is a relevant factor for a good estimation of A o .

Regression
We also examine goodness of fit in terms of R 2 (determination coefficient or explained variance). R 2 does not strictly say how much agreement is systematic, but how much of the agreement variation within a dataset can be explained by the features. The only two datasets with positive R 2 are MASCC and EUSC, at .082 and 0.014 respectively. EUSC has only two annotators per instance, but it  Table 3: Agreement prediction as classification compared against the most-frequent, stratified and uniform baseline. Datasets where the system outperforms the hardest baseline are marked in bold, error reduction in parentheses.
is a large dataset that allows mapping some properties of the features onto the variance of A o . The two datasets with a goodness of fit over baseline are the largest ones. This behavior indicates that the regression method suffers from the data bottleneck. Smooth estimation of continuous values might be more sensitive to data volume than estimation of discrete values, therefore we experiment with classification in the next section. Table 3 shows the results for classification in terms of micro-averaged F 1 score. Error reduction over the hardest baseline is given in parentheses.

Classification
The ENSST dataset is the only dataset where the system cannot beat both baselines, albeit by a small margin. It is a small, all-words dataset, and the data might be too heterogenous for the model to make sense of it with only 326 instances. The F 1 scores are not very high in absolute terms, but agreement prediction is as least as hard as sense prediction.
MAE and F 1 are not comparable measures; without evaluating both on error reduction over equivalent baselines, we cannot strictly say that classification outperforms regression. Nevertheless, classification seems a promising approach. Figure 1 shows the Spearman correlation with A o for the numeric features on two English datasets, namely MASCC and FNTW.

Feature analysis
Even though there is variation in the magnitude across datasets, we observe strong negative correlation of the sense inventory features (z senseentropy, z nlabels), but also for the frequency of the target word (a targetfreq). Notice that these features are also colinear, and in word-sense annotation highfrequency words can be partly more difficult to annotate because they can be more polysemous.
Given these correlations, the feature repertoire we use captures better the low-agreement area of the data, but no feature has a consistently high positive correlation with agreement. That is, the predictors for low-agreement are more reliable than those for high agreement.
A possible candidate for high-agreement prediction could be the proportion of content words over the length of the context, arguably because more lexically rich context are easier to desambiguate by the annotators. This feature has a positive value for most datasets except MASCC. This property has already been noted by Passonneau et al. (2009), who mention that 'greater specificity in the contexts of use leads to higher agreement'.
Syntactic complexity is also an indicator of difficulty. Words with many dependents are often more difficult to annotate (d targetdeps has a consistently negative correlation with A o ), while words with many syntactic siblings are placed in more specific contexts and are easier to annotate, giving d headdeps a slight positive correlation with A o . This behavior holds for all the English datasets except MASCC, as well as for EUSC. We have also performed group-wise feature ablation tests on regression and classification, with similar results. Based on the contribution of single feature groups, we find that the sense-inventory group constantly outperforms the other groups, followed by the morphology group. When the sense inventory is ignored from the features, performance almost always decreases, indicating that sense inventory information is very valuable to predict agreement. However, context information is necessary to distinguish between examples of the same word (say, in a one-lemma lexicalsample dataset), where the sense-inventory features would be constant across the whole dataset.
Similarly, in the class-based experiments of Martínez Alonso (2013), certain features like plural or number of dependents are strong predictors for low agreement when annotating between the container and the content senses of words like bowl and glass. However, our datasets are ei- Nevertheless, the systems do not always improve when adding context features, which suggests that there is room for improvement in capturing contextual information for sense-annotated instances.

Conclusions and further work
This article addresses the prediction of instancewise agreement for sense-annotated data. We have described a method to model agreement as a continuous value, and as a set of three discrete values. We use a feature scheme that tries to give account for the lexical, morphologic and syntactic properties of the examples. We have conducted experiments on nine datasets, which comprise three languages, all-words vs. lexical-sample word annotations, and crowdsourced vs. expert annotations.
The overall conclusiveness of the study requires expanding this research to more datasets and languages, as well as further exploring the difference in annotator bias between expert and crowdsourced annotations. Our feature repertoire can be expanded with characteristics of the sense inventory in terms of sense relatedness like autohyponymy, depth in the sense ontology, or qualitative properties of the senses such abstractness. Context features can also be expanded by adding information from word sense induction and distributional models.
Moreover, if we are to examine agreement variation in full-document (as opposed to sentence-bysentence) annotation, we suggest that documentlevel frequency would help concretize the meaning of a certain word, following the principle of one sense per discourse (Gale et al., 1992).
If the numeric prediction of agreement is desirable over classification, a metric like annotation entropy (Lopez de Lacalle and Agirre, 2015a) is worth considering as an alternative measure to A o , since it an information-theoretical measure that also gives account for distribution skewness.