Looking Beyond Label Noise: Shifted Label Distribution Matters in Distantly Supervised Relation Extraction

In recent years there is a surge of interest in applying distant supervision (DS) to automatically generate training data for relation extraction (RE). In this paper, we study the problem what limits the performance of DS-trained neural models, conduct thorough analyses, and identify a factor that can influence the performance greatly, shifted label distribution. Specifically, we found this problem commonly exists in real-world DS datasets, and without special handing, typical DS-RE models cannot automatically adapt to this shift, thus achieving deteriorated performance. To further validate our intuition, we develop a simple yet effective adaptation method for DS-trained models, bias adjustment, which updates models learned over the source domain (i.e., DS training set) with a label distribution estimated on the target domain (i.e., test set). Experiments demonstrate that bias adjustment achieves consistent performance gains on DS-trained models, especially on neural models, with an up to 23% relative F1 improvement, which verifies our assumptions. Our code and data can be found at https://github.com/INK-USC/shifted-label-distribution.


Introduction
Aiming to identify the relation among an entity pair, relation extraction (RE) serves as an important step towards text understanding and is a long-standing pursuit by many researchers. To reduce the reliance on human-annotated data, especially for data-hungry neural models (Zeng et al., 2014;Zhang et al., 2017), there have been extensive studies on leveraging distant supervision (DS) in conjunction with external knowledge bases to automatically generate large-scale training data (Mintz et al., 2009;Zeng et al., 2015). * Equal contribution. While recent DS-based relation extraction methods focus on handling label noise (Riedel et al., 2010;Hoffmann et al., 2011;Lin et al., 2016), i.e., false labels introduced by error-prone DS process, other factors may have been overlooked. Here, we observe model behaviors to be different on DS datasets and clean dataset, which implies existence of other challenges that restrict performance of DS-RE models. In this paper, we conduct thorough analyses over both real-world and synthetic datasets to explore the question -what limits the performance of DS-trained neural models.
Our analysis starts with a performance comparison among recent relation extraction methods on both DS datasets (i.e., KBP (Ellis et al., 2012), NYT (Riedel et al., 2010)) and human-annotated dataset (i.e., TACRED (Zhang et al., 2017)), with the goal of seeking models that can consistently yield strong results. We observe that, on human-annotated dataset, neural relation extraction models outperform feature-based models by notable gaps, but these gaps diminish when the same models are applied to DS datasets-neural models merely achieve performance comparable with feature-based models. We endeavor to analyze the underlying problem that leads to this unexpected "diminishing" phenomenon. Inspired by two heuristic threshold techniques that prove to be effective on DS datasets, and further convinced by comprehensive analysis on synthetic datasets, we reveal an important characteristic of DS datasets-shifted label distribution, the issue that the label distribution of training set does not align with that of test set. There often exists a large margin between label distributions of distantly-supervised training set and that of human-annotated test set, as shown in Fig 1. Intuitively, this is mainly caused by "false positive" and "false negative" labels generated by errorprone DS processes, and the imbalanced data distribution of external knowledge bases.
To some extent, such distortion is a special case of domain shift -i.e., training the model on a source domain and applying the learned model to a different target domain. To further verify our assumption, we develop a simple domain adaption method, bias adaptation, to address the shifted label distribution issue. It modifies the bias term in softmax classifiers and explicitly fits models along the shift. Specifically, the proposed method estimates the label distribution of target domain with a small development set sampled from test set, and derives the adapted predictions under reasonable assumptions. In our experiments, we observe consistent performance improvement, which validates that model performance may be severely hindered by label distribution shift.
In the rest of the paper, we first introduce the problem setting in Section 2 and report the inconsistency of model performance with human annotations and DS in Section 3. Then, we present two threshold techniques which are found to be effective on DS datasets and further lead us to the discovery of shifted label distribution. We explore its impact on synthetic datasets in Section 4, and introduce the bias adjustment method in Section 5. In addition, comparison of denoising method, heuristic threshold and bias adjustment is conducted in Section 6. We discuss related work in Section 7 and conclude our findings in Section 8.

Experiment Setup
In this paper, we conduct extensive empirical analyses on distantly supervised relation extraction (DS-RE). For a meaningful comparison, we ensure the same setup in all experiments. In this section, we provide a brief introduction on the setting, while more details could be found in Appendix A. All implementations are available at https://github.com/INK-USC/ shifted-label-distribution.

Problem Setting
Following previous works Liu et al., 2017), we conduct relation extraction at sentence level. Formally speaking, the basic unit is the relation mention, which is composed of one sentence and one ordered entity pair within the sentence. The relation extraction task is to categorize each relation mention into a given set of relation types, or a Not-Target-Type (NONE).

Datasets
We select three popular relation extraction datasets as benchmarks. Specifically, two of them are distantly supervised and one is human-annotated. KBP (Ling and Weld, 2012) uses Wikipedia articles annotated with Freebase entries as train set, and manually-annotated sentences from 2013 KBP slot filling assessment results (Ellis et al., 2012) as test set.
NYT (Riedel et al., 2010) contains New York Times news articles and has been already heuristically annotated. Test set is constructed manually by (Hoffmann et al., 2011). TACRED (Zhang et al., 2017) is a large-scale crowd-sourced dataset, and is sufficiently larger than previous manually annotated datasets.

Models
We consider two classes of relation extraction methods, i.e., feature-based and neural models.
For each relation mention, these models will first construct a representation vector h, and then make predictions with softmax based on h 2 : where r i and b i are the parameters corresponding to i-th relation type. More details on these models can be found in Appendix A.

The Diminishing Phenomenon
Neural models alleviate the reliance on handcrafted features and have greatly advanced the state-of-the-art, especially on datasets with human annotations. Meanwhile, we observe such performance boost starts to diminish on distantly supervised datasets. Specifically, we list the perfor-2 CoType-RM and logistic regression are exceptions as they don't adopt softmax to generate output. Bi-GRU outperforms ReHession with a significant gap on TACRED, but only has comparable performance (with Re-Hession) on KBP and NYT. Similar gap diminishing phenomenon happens to Bi-LSTM. mance of all tested models in Table 2 and summarize three popular models in Figure 2.
On TACRED, a human annotated dataset, complex neural models like Bi-LSTM and Bi-GRU significantly outperform feature-based ReHession, with an up to 13% relative F1 improvement. On the other hand, on distantly supervised datasets (KBP and NYT), the performance gaps between the aforementioned methods diminishes to within 5% (relative improvement). We refer to this observation as "diminishing" phenomenon. Such observation implies a lack of handling of the underlying difference between human annotations and distant supervision.
After a broad exploration, we found two heuristic techniques that we believe capture problems exclusive to distantly supervised RE, and are potentially related to "diminishing" phenomenon. We found they can greatly boost the performance on distantly supervised datasets, but fail to do so on human-annotated dataset. To get deeper insights, we analyze the diminishing phenomenon and the two heuristic methods.

Heuristic Threshold Techniques
Max threshold and entropy threshold are designed to identify the "ambiguous" relation mentions (i.e., predicted with a low confidence) and label them as the NONE type Liu et al., 2017). In particular, referring to the original predictions as r * = arg max r i p(y = r i |h), we formally introduce these two threshold techniques: • Max Threshold introduces an additional hyperparameter T m , and adjusts the prediction as : • Entropy Threshold introduces an additional hyper-parameter T e . It first calculates the entropy of prediction: then it adjusts prediction as (Liu et al., 2017): To estimate T e or T m , 20% instances are sampled from the test set as an additional development set, and used to tune the value of T e and T m with grid search. After that, we evaluate the model performance on the remaining 80% of the test set. We refer to this new dev set as clean dev and refer to the original dev set used for tuning other parameters as noisy dev. We would like to highlight that tuning threshold on this clean dev is necessary as it acts as a bridge between distantly supervised train set and human-annotated test set.

Results and Discussion
Results of three representative models (ReHession, Bi-GRU and Bi-LSTM) using threshold techniques are summarized in Figure 3. Full results are listed in Table 3. We observe significant improvements on distantly supervised datasets (i.e., KBP and NYT), with a up to 19% relative F1 improvement (Bi-GRU from 37.77% to 45.01% on KBP). However, on the human-annotated corpus, the performance gain can be hardly noticed. Such inconsistency implies that these heuristics may capture some important but overlooked factors for distantly supervised relation extraction, while we are still unclear about their underlying mechanisms.
Intuitively, annotations provided by distant supervision differ from human annotations in two ways: (1) False Positive: falsely annotating unrelated entities in one sentence as a certain relation type; (2) False Negative: neglecting related entities by marking their relationship as NONE. The terms, false positive and false negative, are commonly used to describe wrongful predictions. Here we only borrow the terms for describing label noises. These label noises distort the true label distribution of train corpora, creating a gap between the label distribution of train and test set (i.e., shifted label distribution). With existing denoising methods, the effect of noisy training instances may be reduced; still, it would be infeasible to recover the original label, and thus label distribution shift remains an unsolved problem.
In our experiments, we notice that in distantly supervised datasets, instances labeled as NONE have a larger portion in test set than in train set. It is apparent that the strategy of rejecting "ambiguous" predictions would guide the model to predict more NONE types, leading the predicted label distribution towards a favorable direction. Specifically, in the train set of KBP, 74.25% instances are annotated as NONE and 85.67% in the test set. The original Bi-GRU model would annotate 75.72% instances to be NONE, which is close to 74.25%; after applying the max-threshold and entropy-threshold, this proportion becomes 86.18% and 88.30%, which are close to 85.67%.
Accordingly, we believe part of the underlying mechanism of heuristic threshold is to better handle the label distribution difference between train and test set. We try to further verify this hypothesis with experiments in next section.

Shifted Label Distribution
In this section, we first summarize our observation on shifted label distribution, and then conduct empirical analysis to study its impact on model performance using synthetic datasets. These datasets are carefully created so that label distribution is the only variable and other factors are controlled.

Shifted Label Distribution
Shifted label distribution refers to the problem that the label distribution of train set does not align with the test set. This problem is related to but different from "learning from imbalanced data", where the data distribution is imbalanced, but consistent across train and test set. Admittedly, one relation may appear more or less than another in natural language, creating distribution skews; however, this problem widely occurs in both supervised and distantly supervised settings, and is not our focus in this paper.
Our focus is the label distribution difference between train and test set. This problem is critical to distantly supervised relation extraction, where the train set is annotated with distant supervision and the test set is manually annotated. As previously mentioned in Section 3.2, distant supervision differs from human annotations by introducing "false positive" and "false negative" labels. The label distribution of train set is subject to existing entries in KBs, and thus there exists a gap between label distributions of train and test set.
We visualize the distribution of KBP (DS dataset) and a truncated 6-class version of TA-CRED (human-annotated dataset) in Fig 1. Also, proportions of instances categorized into δ = |p(r i |D train ) − p(r i |D test )| bins are shown. It is observed that KBP and NYT both have shifted label distributions (most instances fall into δ > 0.05 bins); while TACRED has consistent label distributions (all instances fall into δ < 0.05 bins).

Impact of Shifted Label Distribution
In order to quantitatively study the impact of label distribution shift, we construct synthetic datasets by sub-sampling instances from the human-annotated TACRED dataset. In this way, the only variable is the label distribution of synthetic datasets, and the impact of other factors such as label noise is excluded. Synthetic Dataset Generation. We create five synthetic train sets, by sampling sentence-label pairs from original TACRED train set with label distributions S1-S5. S5 is a randomly generated label distribution (see Fig 7 in Appendix B). S0 is TACRED's original train set label distribution. S1 − S4 are created with linear interpolation between S0 and S5, i.e., Si = 5−i 5 S0 + i 5 S5. We apply disproportionate stratification to control the label distribution of synthetic datasets. In this way, we ensure that the number of sampled instances in each synthetic train set is kept constant within 10000 ± 3, and the label distribution of each set satisfies S1-S5 respectively. Results and Discussion. We conduct experiments with three typical models (i.e., Bi-GRU, Bi-LSTM and ReHession) and summarize their results in Fig 4. We observe that, from S1 to S5, the performance of all models consistently drops. This phenomenon verifies that shifted label distribution is making a negative influence on model performance. The negative effect expands as the train set label distribution becomes more twisted.
At the same time, we observe that feature-based ReHession is more robust to such shift. The gap  4: F1 scores on synthesized datasets S1-S5. We observe that (a) performance consistently drops from S1 to S5, demonstrating the impact of shifted label distributions; (b) ReHession is more robust to such distribution shift, outperforming Bi-LSTM and Bi-GRU on S4 and S5; (c) threshold is an effective way to handle such shift.
between ReHession and Bi-GRU stably decreases, and eventually ReHession starts to outperform the other two at S4. This could be the reason accounting for "diminishing" phenomenon -neural models such as Bi-GRU is supposed to outperform ReHession by a huge gap (as with S1); however on distantly supervised datasets, shifted label distribution seriously interfere the performance (as with S4 and S5), and therefore it appears as the performance gap diminishes.
Applying Threshold Techniques. We also applied the two threshold techniques on the synthetic datasets and summarize their performance in Fig 4. The three models become more robust to the label distribution shift when threshold is applied. Threshold techniques are consistently making improvements, and the improvements are more significant with S5 (most shifted) than with S0 (least shifted). This observation verifies that the underlying mechanism of threshold techniques help the model better handle label distribution shift.

Bias Adjustment: An Adaptation Method for Label Distribution Shift
Investigating the probabilistic nature of softmax classifier, we present a principled domain adaptation approach, bias adjustment (BA), to deal with label distribution shift. This approach explicitly fits the model along such shift by adjusting the bias term in softmax classifier.

Bias Adjustment
We view corpora with different label distributions as different domains. Denoting distantly supervised corpus (train set) as D d and humanannotated corpus (test set) as D m , our task becomes to calculate p(y = r i |h, D m ) based on p(y = r i |h, D d ).
We assume the only difference between p(y = r i |h, D m ) and p(y = r i |h, D d ) to be the label distribution, and the semantic meaning of each label is unchanged. Accordingly, we assume p(h|r i ) is universal, i.e., (2) As distantly supervised relation extraction models are trained with D d , its prediction in Equation 1 can be viewed as p(y = r i |h, D d ), i.e., . (3) Based on the Bayes Theorem, we have: . (4) Based on the definition of conditional probability and Equation 2, we have: With Equation 3, 4 and 5, we can derive that where  Table 3: F1 score of RE Models with Threshold and Bias Adaptation. 5-time average and standard deviation of F1 scores are reported. ∆ denotes the F1 improvement over original. On DS datasets, the four methods targeting label distribution shift achieve consistent performance improvement, with averagely 3.83 F1 improvement on KBP and 1.72 on NYT. However, the same four methods fail to improve performance on human-annotated TACRED.
With this derivation we now know that, under certain assumptions (Equation 2), we can adjust the prediction to fit a target label distribution given p(r i |D d ) and p(r i |D m ). Accordingly, we use Equation 6 and 7 to calculate the adjusted prediction as: Label distribution estimation. The source domain (train) label distribution, p(r i |D d ) can be easily estimated on train set. As for target domain (test) distribution p(r i |D m ), we use maximum likelihood estimation on a held-out clean dev set, which is a 20% sample from test set. This setting is similar to heuristic threshold techniques. Implementation details. The bias adjustment model are implemented in two ways: • BA-Set directly replaces the bias term in Equation 1 with b i in Equation 7 during evaluation. No modification to model training is required.
• BA-Fix fixes the bias term in Equation 1 as b i = ln p(r i |D d ) during training and replaces it with b i = ln p(r i |D m ) during evaluation. Intuitively, BA-Fix would encourage the model to fit our assumption better (Equation 2); still, it needs special handling during model training, which is a minor disadvantage of BA-Fix compared with BA-Set.

Results and Discussion
We conduct experiments to explore the effectiveness of BA-Set and BA-Fix and summarize their performance in Table 3  when BA-Fix is used is shown in Fig 5. We find that these two technologies bring consistent improvements to all RE models on both distantly supervised datasets. Especially, in the case of PCNN on KBP dataset, a 23% relative F1 improvement is observed. At the same time, the same technology fails to achieve performance improvements on TACRED, the human annotated dataset. In other words, we found that, by explicitly adapting the model along label distribution shift, consistent improvements can be achieved on distant supervision but not on human annotations. This observation again supports our assumption that shifted label distribution is an unique factor for distantly super-vised RE that needs special handling.
Summary of all four methods. Compared with heuristic threshold techniques, bias adjustment methods directly address shifted label distribution issue by adjusting bias term in classifier, and is more principled and explainable. Though both threshold and bias adjustment methods require an extra clean dev sampled from test set, threshold techniques require predicting on the instances in the clean dev and tune T m or T e based on the output probability. Bias adjustment only needs the label distribution estimated on clean dev, and is more efficient in computation. As for performance, there is no clear indication that one method is consistently stronger than another. However, all four methods (2 threshold and 2 bias adjustment) achieve similar results when applied to RE models -we observe relatively significant performance improvements on DS datasets, but only marginal improvements on human-annotated dataset.
Comments on neural RE models. On average, bias adjustment methods results in 3.66 F1 improvement on KBP and 2.05 on NYT for neural models. BA-Fix gains a suprising 7.20 with PCNN on KBP. Noting that only bias terms in softmax classifier are modified and only a small piece of extra information is used, it is implied that shifted label distribution is severely hindering model performance and capabilities of state-of-the-art neural models are not fully described with traditional evaluation. Hidden representations h learned by neural models indeed capture semantic meanings more accurately than feature-based model, while the bias in classifier becomes the major obstacle towards better performance.

Comparison with Denoising Methods
In this section, we conduct analyses about shifted label distribution and its relation with label noise. We apply a popular label noise reduction methodselective attention (Lin et al., 2016), which groups all sentences with the same entity pair into one bag, conducts multi-instance training and tries to place more weight on high-quality sentences within the bag. This method, along with threshold techniques and bias adjustment introduced in previous sections, is applied to two different models (i.e., PCNN and Bi-GRU). We summarize their improvements over the original model in Figure 6. We find that selective attention is indeed effective and improves the per- formance; meanwhile, heuristic threshold and bias adaption approaches boost the performance, and in some cases the boost is even more significant than that of selective attention. This observation is reasonable since both heuristics and bias adaption approaches are able to access additional information from clean dev (20% of test set). Still, it is surprising that such small piece of information brings about huge difference, demonstrating the importance of handling shifted label distribution. It also shows that there exists much space for improving distantly supervised RE models from a shifted label distribution perspective.
7 Related Work

Relation Extraction
Relation extraction is to identify the relationship between a pair of entity mentions. The task setting slightly varies as the relation can be either extracted from a bag of sentences (corpus-level) or one single sentence (sentence-level). In this paper, we focus on sentence-level RE. That is, prediction should be purely based on the information provided within the sentence, instead of external knowledge or commonsense. Recent approaches follow the supervised learning paradigm and rely on considerable amounts of labeled instances to train effective models. Zeng et al. (2014) proposed using CNN for relation ex-traction, which could automatically capture features from texts. Zeng et al. (2015) further extended it with piecewise max-pooling, i.e., splitting the sentence into three pieces with the object and subject entity, doing max-pooling over the three pieces separately, and finally concatenating the hidden representations. Lin et al. (2016) applied selective attention over sentences for learning from multiple instances. This method was originally designed for corpus-level setting. It organizes sentences into bags and assign lower weights to those less relevant sentences in the bag. Zhang et al. (2017) proposed position-aware LSTM network which incorporates entity position information into encoding and enables attention mechanism to simultaneously exploit semantic and positional information.

Distant Supervision
In supervised relation extraction paradigm, one longstanding bottleneck is the lack of large-scale labeled training data. In order to alleviate the dependency on human supervision, Mintz et al. (2009) proposed distant supervision, namely constructing large datasets automatically by aligning text to an existing knowledge bases (e.g., Freebase). Also, distantly supervised relation extraction is formulated into a reinforcement learning problem by Feng et al. (2018) for selecting highquality instances. Similar annotation generation strategy using distant supervision has also been used for other NLP tasks, such as named entity recognition (Shang et al., 2018) and sentiment classification (Go et al., 2009).
Though this strategy lightens annotation burdens, distant supervision inevitably introduces label noises. As the relation types are annotated merely according to entity mentions in the sentence, the local context may be annotated with labels that are not expressed in the sentence, In recent years, researchers mainly focus on dealing with label noises, and proposed the following methods: Riedel et al. (2010) use multi-instance single-label learning paradigm; Hoffmann et al. (2011);Surdeanu et al. (2012) propose multiinstance multi-label learning paradigm. Recently, with the advance of neural network techniques, deep learning methods (Zeng et al., 2015;Lin et al., 2016) are applied to distantly supervised datasets, with powerful automatic feature extraction and advanced label noised reducing tech-niques such as selective attention. Liu et al. (2017) proposed a general framework to consolidate heterogeneous information and refine the true labels from noisy labels.
Label noise is certainly an important factor limiting the performance of DS-RE models. Meanwhile, we argue that shifted label distribution is also a performance-limiting aspect. It is longoverlooked and should be handled properly.

Conclusion and Future Work
In this paper, we first present the observation of inconsistent performance when models are trained with human annotations and distant supervision in the task of relation extraction. It leads us to explore the underlying challenges for distantly supervised relation extraction. Relating two effective threshold techniques to label distribution, we reveal an important yet long-overlooked issueshifted label distribution. The impact of this issue is further demonstrated with experiments on five synthetic train sets. We also consider this issue from a domain adaptation perspective, introducing a theoretically-sound bias adjustment method to recognize and highlight label distribution shift. The bias adjustment methods achieve significant performance improvement on distantly-supervised datasets. All of these findings support our argument that shifted label distribution can severely hinder model performance and should be handled properly in future research.
Based on these observations, we suggest that in addition to label noise, more attention be paid to the shifted label distribution in distantly supervised relation extraction research. We hope that the analysis presented will provide new insights into this long-overlooked factor and encourage future research of creating models robust to label distribution shift. We also hope that methods such as threshold techniques and bias adjustment become useful tools in future research.