Beyond Black & White: Leveraging Annotator Disagreement via Soft-Label Multi-Task Learning

Supervised learning assumes that a ground truth label exists. However, the reliability of this ground truth depends on human annotators, who often disagree. Prior work has shown that this disagreement can be helpful in training models. We propose a novel method to incorporate this disagreement as information: in addition to the standard error computation, we use soft-labels (i.e., probability distributions over the annotator labels) as an auxiliary task in a multi-task neural network. We measure the divergence between the predictions and the target soft-labels with several loss-functions and evaluate the models on various NLP tasks. We find that the soft-label prediction auxiliary task reduces the penalty for errors on ambiguous entities, and thereby mitigates overfitting. It significantly improves performance across tasks, beyond the standard approach and prior work.


Introduction
Usually, the labels used in NLP classification tasks are produced by sets of human annotators. As disagreement between annotators is common, many methods aggregate the different answers into a supposedly correct one (Dawid and Skene, 1979;Carpenter, 2008;Hovy et al., 2013;Raykar et al., 2010;Paun et al., 2018;Ruiz et al., 2019). However, the aggregated labels obtained in this way mask the world's real complexity: instances can be intrinsically ambiguous (Poesio and Artstein, 2005;Zeman, 2010;Pavlick and Kwiatkowski, 2019), or so challenging to evaluate that considerable disagreement between different annotators is unavoidable. In those cases, it is reasonable to wonder whether the ambiguity is indeed harmful to the models or whether it carries valuable information about the relative difficulty of each instance (Aroyo and Welty, 2015). Several authors followed that intuition, trying ways to incorporate the information about the level of annotator agree-ment in their models (Sheng et al., 2008;Plank et al., , 2016Jamison and Gurevych, 2015;Rodrigues and Pereira, 2018;Lalor et al., 2017).
Usually, Deep Learning models compute the error as the divergence between the predicted label distribution and a one-hot encoded gold distribution (i.e., nothing but the gold label has any probability mass). However, for complex tasks, this binary black-and-white notion of truth is not plausible and can lead to overfitting. Instead, we can use a more nuanced notion of truth by comparing against soft labels: we collect the probability distributions over the labels given by the annotators, rather than using one-hot encodings with a single correct label. To measure the divergence between probability distributions, we can use well-known measures like the Kullback-Leibler divergence (Kullback and Leibler, 1951), the Jensen-Shannon divergence (Lin, 1991), and the Cross-Entropy, which is also used to quantify the error with one-hot encoded labels. The main impediment to the direct use of soft labels as targets, though, is the lack of universally accepted performance metrics to evaluate the divergence between probability distributions. (Most metrics lack an upper bound, making it difficult to assess prediction quality). Usually, annotations are incorporated into the models without soft labels Rodrigues and Pereira, 2018). Where soft labels are used, they are variously filtered according to their distance from the correct labels and then used to weight the training instances rather than as prediction targets. These models still predict only true labels (Jamison and Gurevych, 2015).
In contrast to previous approaches, we use Multi-Task Learning (MTL) to predict a probability distribution over the soft labels as additional output. We jointly model the main task of predicting standard gold labels and the novel auxiliary task of predicting the soft label distributions. Due to the difficulty of interpreting its performance, we do not directly evaluate the distance between the target and the predicted probability distributions. However, the MTL framework allows us to indirectly evaluate its effect on the main task. Exploiting the standard metrics for gold labels, we can also compare the effect of different loss functions for the soft label task. In particular, we propose a standard and an inverse version of the KL-divergence and Cross-Entropy. In previous work (Jamison and Gurevych, 2015), filtering and weighting the training instances according to soft labels did not lead to consistent performance improvements. In contrast, we find that the information carried by MTL soft labels does significantly improve model performance on several NLP tasks.
Contributions 1) We show that MTL models, trained with soft labels, consistently outperform the corresponding Single-Task Learning (STL) networks, and 2) we evaluate the use of different loss functions for soft labels.

MTL with three loss functions
For the experiments, we use different types of neural networks, depending on the type of task. However, we create two versions of each model architecture: an STL model and an MTL model. In STL, we predict the one-hot encoded labels. In MTL, we add the auxiliary task of predicting the soft label distributions to the previous main task.
In both cases, we use Adam optimization (Kingma and Ba, 2014). The loss function for the main task is standard cross-entropy. For the auxiliary task, we have different options. The KLdivergence is a natural choice to measure the difference between the prediction distribution Q and the distribution of soft labels P . However, there are two ways we can do that, depending on what we want to capture.The standard KL-divergence is: This measures the divergence from Q to P and encourages a wide Q, because if the model overestimates the regions of small mass from P it will be heavily penalised. The inverse KL-divergence is: This measures the divergence from P to Q and encourages a narrow Q distribution because the model will try to allocate mass to Q in all the places where P has mass; otherwise, it will get a strong penalty.
Considering that we use the auxiliary task to reduce overfitting on the main task, we expect equation 2 to be more effective because it encourages the model to learn a distribution that pays attention to the classes where the annotations possibly agree.
A third option is to directly apply Cross-Entropy. This is actually derived from KL-divergence, the entropy of P added to the KL-divergence: Therefore, regular KL-divergence and Cross-Entropy tend to lead to the same performance. For completeness, we report the results of Cross-Entropy as well.
As overall loss of the main and of the auxiliary task, we compute the two's sum. We do not apply any normalization method to the two losses, as unnecessary. We use LogSoftmax activation function for the main task, which is a standard choice for one-hot encoded labels, and standard Softmax for the auxiliary task. Against the distributions of gold (one-hot encoded) and soft labels, both summing up to one, the errors are on the same scale.
We also derive the soft labels using the Softmax function, which prevents the probability of the single labels from falling to zero.

Methods
We evaluate our approach on two NLP tasks: POS tagging and morphological stemming. We use the respective data sets from  and Jamison and Gurevych (2015) (where data sets are sufficiently large to train a neural model). In both cases, we use data sets where both one-hot (gold) and probabilistic (soft) labels (i.e., distributions over labels annotations) are available. The code for all models in this paper will be available on github.com/fornaciari.

POS tagging
Data set For this task, we use the data set released by Gimpel et al. (2010) with the crowdsourced labels provided by . The same data set was used by Jamison and Gurevych (2015). Similarly, we use the CONLL Universal POS tags (Petrov et al., 2012) and 5-fold crossvalidation. The soft labels come from the annotation of 177 annotators, with at least five annotations for each instance. Differently from Jamison and Gurevych (2015), however, we also test the model on a completely independent test set, released by . This data set does not contain soft labels. However, they are not necessary to test our models.

Model
We use a tagging model that takes two kinds of input representations, at the character and the word level (Plank et al., 2016). At the character level, we use character embeddings trained on the same data set; at the word level, we use Glove embeddings (Pennington et al., 2014). We feed the word representation into a 'context bi-RNN', selecting the hidden state of the RNN at the target word's position in the sentence. The character representation is then fed into a 'sequence bi-RNN', whose output is its final state. The two outputs are concatenated and passed to an attention mechanism, as proposed by Vaswani et al. (2017). In the STL models, the attention mechanisms' output is passed to a last attention mechanism and to a fully connected layer that gives the output. In the MTL models, the last two components of the STL network (attention + fully connected layer) are duplicated and used for the auxiliary task, providing softmax predictions.

Morphological stemming
Data set We use the data set used in Jamison and Gurevych (2015), which was originally created by Carpenter et al. (2009). It consists of (word, stem)pairs, and the task is a binary classification task of whether the stem belongs to the word. The soft labels come from 26 unique annotators, and each instance received at least four labels.
Model We represent each (word, stem)-pair with the same character embeddings trained for the previous task. Each representation passes to two convolutional/max-pooling layers. We use two convolutional layers with 64 and 128 channels and three windows of 3, 4, and 5 characters size. Their outputs are connected with two independent attention mechanisms (Vaswani et al., 2017). Their output is concatenated and passed directly to the fully connected layers -one for each task -, which provide the prediction. In the MTL models, the concatenation of the attention mechanisms is passed to another fully connected layer, which predicts the soft labels.

Gold standard and soft labels
To account for the effects of random initializations, we run ten experiments for each experimental condition. During the training, we select the models relying on the F-measure observed on the development set. We report the averaged results for accuracy and F-measure, the metrics used by the studies we compare to. For each task, we compare the STL and MTL models. Where possible, we compare model performance with previous work.

Silver standard and soft labels
Since we did not create the corpora that we use in our experiments, we do not know the details of the gold labels' creation process. However, we verified that the gold labels do not correspond to the classes resulting from the majority voting of the annotations used for the soft labels. Consequently, the MTL models exploit an additional source of information that is not provided to the STL ones. To validate our hypothesis, we need to exclude that the reason for the MTL's success is not simply that the soft labels inject more information into the models, We ran a set of experiments where the main task was trained on the majority voting (silver) labels from the annotations, rather than on the gold labels. We still performed the tests on the gold labels. In these conditions, both tasks rely on the same source of (imperfect) information, so MTL has no potential advantage over STL. While overall performance drops compared to the results of Table  1, Table 2 shows that the MTL models still maintain a significant advantage over the STL ones. As before, results are averaged over ten independent runs for each condition.

Model
Acc

Error analysis
To gain further insights about their contributions, we inspect the soft labels' probability distributions, comparing the predictions of STL and MTL models.
We perform the following analysis for the POS and the stemming tasks, and for each kind of loss function in the MTL models. In particular, we consider four-conditions of the predictions: 1) where both STL and MTL gave the correct answer, 2) where both gave the wrong answer, 3) where STL was correct and MTL incorrect, and 4) where MTL was correct and STL incorrect (see confusion matrix in Table 3) For each of these categories, we compute the relative kurtosis of the soft labels. We choose this measure as it describes how uniform the probability distribution is: whether the annotators agree on a single class, or whether they disperse their votes among different classes.
Not surprisingly, we find the highest average kurtosis where both STL and MTL models give the correct prediction. Both kinds of models find it easier to predict the instances that are also unambiguous for the annotators. The opposite holds as well: the instances where both MTL and STL models are wrong show the lowest mean kurtosis.
More interesting is the outcome where MTL models are correct and STL wrong, and vice-versa. In these cases, the average kurtosis lies between the two previous extremes. Also, we find a consistent trend across data sets and MTL loss-functions: the instances where only the MTL models are correct show a slightly higher kurtosis than those instances where only the STL models give the right answer. To measure the significance of this trend, we apply the Mann-Whitney rank test (Mann and Whitney, 1947). We use a non-parametric test because the kurtoses' distribution is not normal. We find two significant results: when we use Cross-Entropy as MTL loss-function in the POS data set, and with the KL inverse on the Stemming data set. We report the POS results in table 3. Similarly to the previous sections 1 and 2, the results refer to 10 runs of each experimental condition. This finding suggests that, when dealing with ambiguous cases, the soft labels tend to provide a qualified hint. It is training the models to predict the classes that seem to be the most probable for the annotators. MTL correct incorrect STL correct 6.614 5.961 incorrect 6.015* 5.727 Table 3: Average soft labels' kurtosis of correctly/incorrectly predicted instances by STL and MTL models (with Cross-Entropy as loss-function) in the POS data set. The kurtosis where only the MTL models are correct is significantly higher than that where only STL models is correct, with * : p ≤ 0.05

Related Work
Several different lines of research use annotation disagreement. One line focuses on the aggregation of multiple annotations before model training. Seminal work includes the proposal by Dawid and Skene (1979), who proposed an Expectation-Maximization (EM) based aggregation model. This model has since influenced a large body of work on annotation aggregation, and modeling annotator competence (Carpenter et al., 2009;Hovy et al., 2013;Raykar et al., 2010;Paun et al., 2018;Ruiz et al., 2019). In our experiments on POS-tagging, we evaluated the possibility of testing Dawid-Skene labels rather than Majority Voting, finding that the performance of the two against the gold standard was mostly the same. Some of these methods also evaluate the annotators' expertise (Dawid and Skene, 1979;Raykar et al., 2010;Hovy et al., 2013;Ruiz et al., 2019). Others just penalize disagreement (Pan et al., 2019). The second line of work focuses on filtering out presumably low quality data to train on the remaining data (Beigman Klebanov and Beigman, 2014;Jamison and Gurevych, 2015). However, such filtering strategies require an effective filtering threshold, which is non-trivial; relying only on high-agreement cases also results in worse performance (Jamison and Gurevych, 2015). Some studies (Goldberger and Ben-Reuven, 2016;Han et al., 2018b,a) treat disagreement as a corruption of a theoretical gold standard. Since the robustness of machine learning models is affected by the data annotation quality, reducing noisy labels generally improves the models' performance. The closest to our work are the studies of Cohn and Specia (2013) and Rodrigues and Pereira (2018), who both use MTL. In contrast to our approach, though, each of their tasks represents an annotator. We instead propose to learn from both the gold labels and the distribution over multiple annotators, which we treat as soft label distributions in a single auxiliary task. Compared to treating each annotator as a task, our approach has the advantage that it requires fewer output nodes, which reduces the number of parameters. To our knowledge, the only study that directly uses soft labels is the one by Lalor et al. (2017). Different from our study, they assume that soft labels are available only for a subset of the data. Therefore they use them to fine-tune STL networks. Despite this methodological difference, their findings support this paper's intuition that soft labels carry signal rather than noise.
In a broad sense, our study belongs to the research area of regularization methods for neural networks. Among them, label smoothing (Pereyra et al., 2017) penalizes the cases of over-confident network predictions. Both label smoothing and soft labels reduce overfitting regulating the loss size. However, label smoothing relies on the gold labels' distribution, not accounting for the instances' inherent ambiguity, while soft labels selectively train the models to reduce the confidence when dealing with unclear cases, not affecting the prediction of clear cases. Disagreement also relates to the issue of annotator biases (Shah et al., 2020;Sap et al., 2019;Hovy and Yang, 2021), and our method can provide a possible way to address it.

Conclusion
We propose a new method for leveraging instance ambiguity, as expressed by the probability distribution over label annotations. We set up MTL models to predict this label distribution as an auxiliary task in addition to the standard classification task. This setup allows us to incorporate uncertainty about the instances' class membership into the model. Across two NLP tasks, three data sets, and three loss functions, we always find that our method significantly improves over the STL performance. While the performance difference between the loss functions is not significant, we find that the inverse version of KL gives the best results in all the experimental conditions but one. This finding supports our idea of emphasizing the coders' disagreement during training. We conjecture that predicting the soft labels acts as a regularizer, reducing overfitting. That effect is especially likely for ambiguous instances, where annotators' label distributions differ especially strongly from one-hot encoded gold labels.