Relabel the Noise: Joint Extraction of Entities and Relations via Cooperative Multiagents

Distant supervision based methods for entity and relation extraction have received increasing popularity due to the fact that these methods require light human annotation efforts. In this paper, we consider the problem of shifted label distribution, which is caused by the inconsistency between the noisy-labeled training set subject to external knowledge graph and the human-annotated test set, and exacerbated by the pipelined entity-then-relation extraction manner with noise propagation. We propose a joint extraction approach to address this problem by re-labeling noisy instances with a group of cooperative multiagents. To handle noisy instances in a fine-grained manner, each agent in the cooperative group evaluates the instance by calculating a continuous confidence score from its own perspective; To leverage the correlations between these two extraction tasks, a confidence consensus module is designed to gather the wisdom of all agents and re-distribute the noisy training set with confidence-scored labels. Further, the confidences are used to adjust the training losses of extractors. Experimental results on two real-world datasets verify the benefits of re-labeling noisy instance, and show that the proposed model significantly outperforms the state-of-the-art entity and relation extraction methods.


Introduction
The extraction of entities and relations has long been recognized as an important task within natural language processing, as it facilitates text understanding.The goal of the extraction task is to identify entity mentions, assign predefined entity types, and extract their semantic relations from text corpora.For example, given a sentence "Washington is the president of the United States of America", an extraction system will find a PRESIDENT OF relation between PERSON entity "Washington" and COUNTRY entity "United States of America".
A major challenge of the entity and relation extraction task is the absence of large-scale and domain-specific labeled training data due to the expensive labeling efforts.One promising solution to address this challenge is distant supervision (DS) (Mintz et al., 2009;Hoffmann et al., 2011), which generates labeled training data automatically by aligning external knowledge graph (KG) to text corpus.Despite its effectiveness, the aligning process introduces many noisy labels that degrade the performance of extractors.To alleviate the introduced noise issue of DS, extensive studies have been performed, such as using probabilistic graphical models (Surdeanu et al., 2012), neural networks with attention (Zeng et al., 2015;Lin et al., 2016) and instance selector with reinforcement learning (RL) (Qin et al., 2018;Feng et al., 2018).
However, most existing works overlooked the shifted label distribution problem (Ye et al., 2019), which severely hinders the performance of DS-based extraction models.Specifically, there is a label distribution gap between DS-labeled training set and human-annotated test data, since two kinds of noisy labels are introduced and they are subject to the aligned KG: (1) False Positive: unrelated entity pair in the sentence while labeled as relations in KG; and (2) False Negative: related entity pair while neglected and labeled as NONE.Existing denoising works assign low weights to noisy instances or discard false positives while not recovering the original labels, leaving the shifted label distribution problem unsolved.
Moreover, most denoising works assume that the target entities have been extracted, i.e., the entity and relation extraction is processed in a pipe-lined manner.By extracting entities first and then classifying predefined relations, the entity extraction errors will be propagated to the relation extractor, introducing more noisy labels and exacerbating the shifted label problem.Besides, there are some correlations and complementary information between the two extraction tasks, which are under-utilized but can provide hints to reduce noises more precisely, e.g., it is unreasonable to predict two COUN-TRY entities as the relation PRESIDENT OF.
In this paper, to reduce the shifted label distribution gap and further enhance the DS-based extraction models, we propose a novel method to re-label the noisy training data and jointly extract entities and relations.Specifically, we incorporate RL to re-label noisy instances and iteratively retrain entity and relation extractors with adjusted labels, such that the labels can be corrected by trial and error.To leverage the correlations between the two extraction tasks, we train a group of cooperative multiagents to evaluate the instance confidence from different extraction views.Through a proposed confidence consensus module, the instances are re-labeled with confidence-scored labels, and such confidence information will be used to adjust the training loss of extractors.Finally, the performances of extractors are refined by exploring suitable label distributions with iterative re-training.
Empirical evaluations on two real-world datasets show that the proposed approach can effectively help existing extractors to achieve remarkable extraction performance with noisy labels, and the agent training is efficient with the help of correlations between these two extraction tasks.

Overview
In this research, we aim to refine entity extractor and relation extractor trained with DS, by incorporating a group of cooperative multiagents.Formally, given a DS training corpus D = {s 1 , . . ., s n }, an entity extractor θ e and a relation extractor θ r trained on D are input into the multiagents.The agents re-distribute D with confidence-scored labels and output two refined extractors θ * e and θ * r using the adjusted labels.Towards this purpose, we model our problem as a decentralized multiagents RL problem, where each agent receives local environmental observation and takes action individually without inferring the policies of other agents.It is hard to directly evaluate the correctness of adjusted noisy labels since we do not know the "gold" training label distributions suitable to the test set.Nonetheless, we can apply RL to indirectly judge the re-labeling effect by using performance scores on an independent validation set as rewards, which is delayed over the extractor re-training.Further, the decentralization setting allows the interaction between the distinct information of entity and relation extractors via intermediate agents.
As shown in Figure 1, a group of agents acts as confidence evaluators, and the external environment consists of training instances and classification results of extractors.Each agent receives a private observation from the perspective of entity extractor or relation extractor, and makes an independent action to compute a confidence score of the instance.These actions (confidence scores) will then be considered together by the confidence consensus module, which determines whether the current sentence is positive or negative and assigns a confidence score.Finally, the updated confidences are used to retrain extractors, the performance score on validation set and the consistent score of the two extractors are combined into rewards for agents.
The proposed method can be regarded as a postprocessing plugin for existing entity and relation extraction model.That is, we design a general framework of the states, actions and rewards by reusing the inputs and outputs of the extractors.

Confidence Evaluators as Agents
A group of cooperative multiagents are used to evaluate the confidence of each instance.These multiagents are divided into two subgroups, which act from the perspective of entity and relation respectively.There can be multiple agents in each subgroup for the purpose of scaling to larger observation space and action space for better performance.Next, we will detail the states, actions and rewards of these agents.
States The states S e for entity-view agents and S r for relation-view agents represent their own viewpoint to evaluate the instance confidence.Specifically, entity-view agents evaluate sentence confidence according to three kinds of information: current sentence, the entity extraction results (typed entity) and the noisy label types.Similarly, relationview agents make their decisions depending on the current sentence, the relation types from relation extractor and the noisy label types from DS.
Most entity and relation extractors encode the semantic and syntactic information of extracted sentences into low-dimension embeddings as their inputs.For entity types and relation types, we also encode them into embeddings and some extractors have learned these vectors such as CoType (Ren et al., 2017).Given reused extractors, we denote the encoded sentence vector as s, the extracted type vector as t e and t r for entity and relation respectively, and DS type vectors as t e d and t r d for entity and relation respectively.We reuse the sentence and type vectors of base extractors to make our approach lightweight and pluggable.Finally, we average the extracted and DS type embeddings to decrease the size of observation space, and con-catenate them with the sentence embedding s to form the states S e and S r for entity/relation agents respectively as follows: S e = s (t e + t e d )/2, S r = s (t r + t r d )/2, (1) Note that we have encoded some semantics into the type vectors, e.g., the margin-based loss used in CoType enforces the type vectors are closer to their candidate type vectors than any other noncandidate types.Intuitively, in the representation spaces, the average operation leads in the midpoint of extracted type vector and DS type vector, which partially preserves the distance property among the two vectors and other type vectors, so that helps form distinguishable states.
Actions To assign confidence in a fine-grained manner and accelerate the learning procedure, we adopt a continuous action space.Each agent uses a neural policy network Θ to determine whether the current sentence is positive (conform with the extracted type t i ) or negative ("None" type) and computes a confidence score c.We model this action as a conditional probability prediction, i.e., estimate the probability as confidence given by the extracted type t i and the current state S: c = p(positive|t i , Θ, S).We adopt gated recurrent unit (GRU) as policy network, which outputs the probability value using sigmoid function.A probability value (confidence score) which is close to 1/0 means that the agent votes a sentence as positive/negative with a high weight.
To handle huge state spaces (e.g., there are thousands of target types in our experimental dataset) and make our approach scalable, here we divide and conquer the state space by using more than one agent in entity-view and relation-view groups.The target type set is divided equally by agent number and each agent only is in charge of a part of types.Based on the allocation and DS labels, one sentence is evaluated by only one relation agent and two entity agents at a time, meanwhile, the other agents are masked.

Re-labeling with Confidence Consensus
To leverage the wisdom of crowds, we design a consensus strategy for the evaluated confidences from multiagents.This is conducted by two steps: gather confidences and re-label with confidence score.Specifically, we calculate an averaged score as c = c sum /3, where c sum is the sum of all agent confidences and the dividing means three agents evaluated the present sentence due to the above masking action strategy.Then we label the current sentence as negative ("None" type) with confidence C = 1 − c if c ≤ 0.5, otherwise we label the current sentence as positive (replace noisy label with extracted type) with confidence C = c.This procedure can be regarded as weighted voting and re-distribute the training set with confidence-scored labels as shown in the right part of Figure 1, where some falsely labeled instances are put into intended positions or assigned with low confidences.
Rewards The reward of each agent is composed of two parts: shared global reward g expressing correlations among sub-tasks, and separate local rewards restricting the reward signals to different three agents for different sentences (recall that we evaluate each sentence by different agents w.r.t their responsible types).Specifically, the global reward g can give hints for denoising and here we adopt a general, translation-based triple score as used in TransE (Bordes et al., 2013) where t 1 , t r and t 2 are embeddings for triple (E 1 , R, E 2 ) and pre-trained by TransE.The score is used to measure the semantic consistency of each triple and can be easily extended with many other KG embedding methods (Wang et al., 2017).As for the separate local reward, we use F1 scores F e 1 and F r 1 to reflect the extractor performance, which are gained by entity extractor and relation extractor on an independent validation dataset1 respectively.Finally, to control the proportions of two-part rewards, we introduce a hyper-parameter α, which is shareable for ease of scaling to multiple agents as: end for r ← θ r 14: end for side-effects caused by noises and prevent the gradient being dominated by noisy labels, especially for those with divergent votes since the averaging in confidence consensus module leads to a small C.

Training Algorithm
Pre-training Many RL-based models introduce pre-training strategies to refine the agent training efficiency (Qin et al., 2018;Feng et al., 2018).In this study, we pre-train our models in two aspects: (1) we first pre-train entity and relation extractors to be refined as environment initialization, which is vital to provide reasonable agent states (embeddings of sentences and extracted types).( 2) we then pre-train the policy networks of agents to gain a preliminary ability to evaluate confidence.In order to guide the instance confidence evaluation, we extract a small part of the valid data.The relatively clean DS type labels of the valid data are used to form states.The binary label is assigned according to the valid data and the policy networks are pre-trained for several epochs.Although the binary labels from valid data are not exactly the continuous confidence, the policy networks gain a better parameter initialization than random initialization by this approximate training strategy.
Iterative Re-training With the pre-trained extractors and policy networks, we retrain extractors and agents as Algorithm 1 detailed.The agents refine extractors in each epoch and we record parameters of extractors that achieve best F1 performance.For each data batch, entity and relation extractor perform extraction, form the states S e and S r as Equation ( 1), and send them to entity and relation agents respectively.Then agents take actions (evaluate confidences) and redistribute instance based on confidences consensus module (Section 2.2).Finally extractors are trained with confidences and give rewards as Equation (2).
Curriculum Learning for Multiagents It is difficult to learn from scratch for many RL agents.In this study, we extend the curriculum learning strategy (Bengio et al., 2009) to our cooperative multiagents.The motivation is that we can leverage the complementarity of the two tasks and enhance the agent exploration by smoothly increasing the policy difficulty.To be more specific, we maintain a priority queue and sample instances ordered by their reward values.Once the reward of current sentence excesses the training reward threshold r threshold or the queue is full, we then learn agents policies using Proximal Policy Optimization (PPO) (Schulman et al., 2017) algorithm, which achieves good performances in many continuous control tasks.Algorithm 2 details the training procedure.
Multiagents Setup To evaluate the ability of our approach to refine existing extractors, we choose two basic extractors for our Multiagent RL approach, CoType and PCNN, and denote them as MRL-CoType and MRL-PCNN respectively.
Since PCNN is a pipe-lined method, we reuse a pre-trained and fixed CoType entity extractor, and adopt PCNN as base relation extractor to adapt to the joint manner.For the CoType, we use the implementation of the original paper 2 , and adopt the same sentence dimension, type dimension and hyper-parameters settings as reported in (Ren et al., 2017).For the PCNN, we set the number of kernel to be 230 and the window size to be 3.For the KG embeddings, we set the dimension to be 50 and pre-train them by TransE.We use Stochasitc Gradient Descent and learning rate scheduler with cosine annealing to optimize both the agents and extractors, the learning rate range and batch size is set to be [1e-4, 1e-2] and 64 respectively.We implement our RL agents using a scalable RL library, RLlib (Liang et al., 2018), and adopt 2/8 relation agents and 2/16 entity agents for Wiki-2 https://github.com/INK-USC/DS-RelationExtractionKBP/BioInfer datasets respectively, according to their scales of type sets.For the multi-agents, due to the limitation of RL training time, we set the PPO parameters as default RLlib setting and perform preliminary grid searches for other parameters.For the PPO algorithm, we set the GAE lambda parameter to be 1.0, the initial coefficient for KL divergence to be 0.2.The loss adjusting factor λ is searched among {1, 2, 4} and set to be 2, the reward control factors α is searched among {2e-1, 1, 2, 4} and set to be 2.For all agents, the dimensions of GRU is searched among {32, 64}, and the setting as 64 achieved sightly better performance than setting as 32, while the larger dimension setting leads to higher memory overhead for each agent.Hence we set it to be 32 to enable a larger scale of the agents.

Performance on Entity Extraction
We adopt the Macro-F1, Micro-F1 and Strict-F1 metrics (Ling and Weld, 2012) in the entity extraction evaluation.For Strict-F1, the entity prediction is considered to be "strictly" correct if and only if when the true set of entity tags is equal to the prediction set.The results are shown in Table 2 and we can see that our approach can effectively refine the base extractors and outperform all baseline methods on all metrics.Note that the refinements on BioInfer is significant (t-test with p < 0.05) even though the BioInfer has a large entity type set (2,200 types) and the base extractor CoType has achieved a high performance (0.74 S-F1), which shows that our agents are capable of leading entity extractors towards a better optimization with noisy.

Performance on Relation Extraction
Another comparison is the end-to-end relation extraction task, we report the precision, recall and F1 results in Table 3 and it illustrates that: (1) Our method achieves best F1 for Wiki-KBP, outperforms all baselines on all metrics for BioInfer data, and significantly refines both the two base extractors, PCNN and CoType (t-test with p < 0.05), demonstrating the effectiveness of our approach.
(2) The improvements for CoType are larger than PCNN.Since CoType is a joint extraction model and leverages multi-agents better than the singletask extractor with fixed entity extractor.This shows the benefit of correlations between the two extraction tasks.
(3) Using the same base relation extractor, the MRL-PCNN achieves significantly better improvements than RRL-PCNN (t-test with p < 0.05).Besides, the precision of RRL-PCNN method is relatively worse than recall, which is mainly caused by the noise propagation of entity extraction and its binary discard-or-retain action.By contrast, our model achieves better and more balanced results by leveraging the cooperative multiagents with finegrained confidences.
(4) The MRL-PCNN gains comparable performance with BA-Fix-PCNN, which leverages the additional information from the test set to adjust softmax classifier.This verifies the effectiveness and the robustness of the proposed RL-based relabeling method to reduce the shifted label distribution gap without knowing the test set.

Ablation Analysis
To evaluate the impact of curriculum learning strategy and joint learning strategy of our method, we compare three training settings: curriculum learn-  4. The curriculum setting and the joint setting achieve much better results than the separate training setting.This shows the superiority of cooperative multi-agents over single view extraction, which evaluates confidences with limited information.Besides, the curriculum setting achieves better results than the joint setting, especially on the BioInfer data, which has a larger type set and is more challenging than Wiki-KBP.This indicates the effectiveness of the curriculum learning strategy, which enhances the model ability to handle large state space with gradual exploration.
Training efficiency is an important issue for RL methods since the agents face the explorationexploitation dilemma.We also compare the three settings from the view of model training.(2) The curriculum learning achieves higher rewards than the other two settings with fewer epochs, since that the convergence to local optimum can be accelerated by smoothly increasing the instance difficulty, and the multiagents provide a regularization effect.

Re-labeling Study
To gain insight into the proposed method, we conduct a statistic on the final re-labeled instances.Figure 3 reports the results and shows that our approach identifies some noisy instances including both positives and negatives, and leverage them in a fine-grained manner comparing with discard-orretain strategy.Besides, the instances which are re-labeled from negatives to positives take a larger proportion than those with inverse re-labeling assignments, especially on Wiki-KBP data.This is in accordance with the fact that many noisy labels are "None" in DS setting.Note that some instances are re-labeled with divergent evaluations between entity-view and relation-view agents, which are usually get low confidences through the consensus module and have a small impact on the optimization with damping losses.We further sample two sentences to illustrate the re-labeling processes.On Table 5, the first sentence has a noisy relation label None, while the relation extractor recognizes it as country of birth relation.Based on the extracted type, the relation-view agent evaluates it as a confidential positive instance due to the typical pattern "born in" in the sentence.The entity-view agents also evaluate it as positive with relatively lower confidences, and finally the sentence is re-labeled as positive by the consensus module.For the second sentence, agents disagree that it is positive.With the help of diverse extraction information, the consensus module re-labels the instance with low confidence score, and further alleviates the performance harm by loss damping.

Related Works
Many entity and relation extraction methods have been proposed with the pipelined fashion, i.e., perform named entity recognition (NER) first and then relation classification.Traditional NER systems usually focus on a few predefined types with supervised learning (Yosef et al., 2012).However, the expensive human annotation blocks the large-scale training data construction.Recently, several efforts on DS and weak supervision (WS) NER extraction have been made to address the training data bottleneck (Yogatama et al., 2015;Yang et al., 2018).For relation extraction, there are also many DS methods (Mintz et al., 2009;Min et al., 2013;Zeng et al., 2015;Han and Sun, 2016;Ji et al., 2017;Lei et al., 2018) and WS methods (Jiang, 2009;Ren et al., 2016;Deng et al., 2019) to address the limitation of supervised methods.Our method can be applied for a large number of those extractors as a post-processing plugin since the DS and WS usually incorporate many noises.
A recent work CrossWeigh (Wang et al., 2019) estimates the label mistakes and adjusts the weights of sentences in the NER benchmark CoNLL03.They focus on the noises of supervised "gold standard" labels while we focus on the noises of automatically constructed "silver standard" labels.Moreover, we deal with the noises by considering the shifted label distribution problem, which is overlooked by most existing DS works.In Ye et al. (2019), this issue is analyzed and authors improve performance significantly by using the distribution information from test set.In this paper, we propose to use RL to explore suitable label distributions by re-distributing the training set with confidencescored labels, which is practical and robust to label distribution shift since we may not know the distribution of test set in real-world applications.
Another extraction manner is joint extraction, such as methods based on neural network with parameter sharing (Miwa and Bansal, 2016)  tation learning (Ren et al., 2017) and new tagging scheme (Zheng et al., 2017).However, these works perform extraction without explicitly handling the noises.Our approach introduces multiagents to the joint extraction task and explicitly model sentence confidences.As for the RL-based methods, in Zeng et al. (2018), RL agent is introduced as bag-level relation predictor.Qin et al. (2018) and Feng et al. (2018) use agent as instance selectors to discard noisy instances in sentence-level.Different from adopting a binary action strategy and only focus on false positives in these works, we adopt a continuous action space (confidence evaluation) and handle the noises in a fine-grained manner.
The binary selection strategy is also adopted in a related study, Reinforced Co-Training (Wu et al., 2018), which uses an agent to select instances and help classifiers to form auto-labeled datasets.An important difference is that they select unlabeled instances while we evaluate noisy instances and relabel them.More recently, HRL (Takanobu et al., 2019) uses a hierarchical agent to first identifies relation indicators and then entities.Different from using one task-switching agent of this work, we leverage a group of multiagents, which can be a pluggable helper to existing extraction models.

Conclusions
To deal with the noise labels and accompanying shifted label distribution problem in distant supervision, in this paper, we propose a novel method to jointly extract entity and relation through a group of cooperative multiagents.To make full use of each instance, each agent evaluates the instance confidence from different views, and then a confidence consensus module is designed to re-label noisy instances with confidences.Thanks to the exploration of suitable label distribution by RL agents, the confidences are further used to adjust the training losses of extractors and the potential harm caused by noisy instances can be alleviated.
To demonstrate the effectiveness of the proposed method, we evaluate it on two real-world datasets and the results confirm that the proposed method can significantly improve extractor performance and achieve effective learning.

Figure 1 :
Figure 1: Overview of the proposed method.A group of multiagents are leveraged to evaluate the confidences of noisy instances from different extraction views.Base extractors are refined by iteratively training on the redistributed instances with confidence-scored labels.
Correction for Extractors With the evaluated confidences and re-labeled instances, we adjust the training losses of entity extractor and relation extractor to alleviate the performance harm from noise and shifted label distribution.Denote the original loss of extractor as , the new loss is adjusted by an exponential scaling factor λ and confidence C as : = C λ .Intuitively, a small confidence score C and a large λ indicate that the current instance has almost no impact on the model optimization.This can alleviate Algorithm 1 Training Framework for Extractors Input: Noisy training data D, pre-trained entity extractor θ e , pre-trained relation extractor θ r Output: refined entity/relation extractor θ * e , θ * r 1: pre-train policy networks of agents based on θ e and θ r 2: init: best F 1 * e ← F 1(θ e ), best F 1 * r ← F 1(θ r ) 3: for epoch i = 1 → N do 4: init: current extractors parameters θ e ← θ e , θ r ← θ r 5: for batch d i ∈ D do 6: extractors generate S e /S r as Equ.(1θ e /θ r with scaled losses e / r 10: calculate rewards r e and r r as Equ.(2) 11:

Figure 2 :
Figure 2: Smoothed average rewards on Wiki-KBP data for two agents of MRL-CoType.The light-colored lines are un-smoothed rewards.

Figure 3 :
Figure 3: Proportions of re-labeled instances for MRL-CoType."N-to-P" denotes the instances are re-labeled from negative to positive."divergent" means that entity agents and relation agent have different evaluations about whether the instance is positive or negative.

:
Wiki-KBP: the training sentences are sampled from Wikipedia articles and the test set are manually annotated from 2013 KBP slot filling task; BioInfer: the dataset is sampled and manually annotated from biomedical paper abstracts.The two datasets vary in domains and scales of type set, detailed statistics are shown in Table1.

Table 1 :
Datasets statistics.M r and M e indicates relation and entity mentions respectively.Data batch d i , queue size l, pre-trained policy network with parameter Θ Output: Policy network parameter Θ 1: initialize an empty priority queue q with size l 2: for sentence s j ∈ d i do

Table 2 :
NER performance on two datasets, 3-time average results with standard deviations are reported.

Table 3 :
End-to-end relation extraction performance, 3-time average results with standard deviations are reported.

Table 4 :
Ablation results of the MRL-CoType for end-to-end relation extraction.

Table 5 :
Confidence evaluations on two noisy instances using MRL-CoType.