DIAG-NRE: A Neural Pattern Diagnosis Framework for Distantly Supervised Neural Relation Extraction

Pattern-based labeling methods have achieved promising results in alleviating the inevitable labeling noises of distantly supervised neural relation extraction. However, these methods require significant expert labor to write relation-specific patterns, which makes them too sophisticated to generalize quickly. To ease the labor-intensive workload of pattern writing and enable the quick generalization to new relation types, we propose a neural pattern diagnosis framework, DIAG-NRE, that can automatically summarize and refine high-quality relational patterns from noise data with human experts in the loop. To demonstrate the effectiveness of DIAG-NRE, we apply it to two real-world datasets and present both significant and interpretable improvements over state-of-the-art methods.


Introduction
Relation extraction aims to extract relational facts from the plain text and can benefit downstream knowledge-driven applications. A relational fact is defined as a relation between a head entity and a tail entity, e.g., (Letizia Moratti, Birthplace, Milan). The conventional methods often regard relation extraction as a supervised classification task that predicts the relation type between two detected entities mentioned in a sentence, including both statistical models (Zelenko et al., 2003;Zhou et al., 2005) and neural models (Zeng et al., 2014;dos Santos et al., 2015).
These supervised models require a large number of human-annotated data to train, which are both expensive and time-consuming to collect. Therefore, Craven et al. (1999); Mintz et al. (2009)  traction, by aligning relational facts from a knowledge base (KB) to plain text and assuming that every sentence mentioning two entities can describe their relationships in the KB. As DS can acquire large-scale data without human annotation, it has been widely adopted by recent neural relation extraction (NRE) models (Zeng et al., 2015;. Although DS is both simple and effective in many cases, it inevitably introduces intolerable labeling noises. As Figure 1 shows, there are two types of error labels, false negatives and false positives. The reason for false negatives is that a sentence does describe two entities about a target relation, but the fact has not been covered by the KB yet. While for false positives, it is because not all sentences mentioning entity pairs actually express their relations in the KB. The noisy-labeling problem can become severe when the KB and text do not match well and as a result heavily weaken the model performance (Riedel et al., 2010).
Recent research has realized that introducing appropriate human efforts is essential for reducing such labeling noises. For example, Zhang et al. (2012); Pershina et al. (2014); Angeli et al. (2014);  mixed a small set of crowd-annotated labels with purely DS-generated noise labels. However, they found that only sufficiently large and high-quality human labels can bring notable improvements, because there are significantly larger number of noise labels.
To enlarge the impact of human efforts, Ratner et al. (2016); Liu et al. (2017a) proposed to incorporate pattern-based labeling, where the key idea was to regard both DS and pattern-based heuristics as the weak supervision sources and develop a weak-label-fusion (WLF) model to produce denoised labels. However, the major limitation of the WLF paradigm lies in the requirement of human experts to write relation-specific patterns. Unfortunately, writing good patterns is both a highskill and labor-intensive task that requires experts to learn detailed pattern-composing instructions, examine adequate examples, tune patterns for different corner cases, etc. For example, the spouse relation example of Ratner et al. (2016) uses 11 functions with over 20 relation-specific keywords 1 . Even worse, when generalizing to a new relation type, we need to repeat the hard manual operations mentioned above again.
To ease the pattern-writing work of human experts and enable the quick generalization to new relation types, we propose a neural pattern diagnosis framework, DIAG-NRE, which establishes a bridge between DS and WLF, for common NRE models. The general workflow of DIAG-NRE, as Figure 2 shows, contains two key stages: 1) pattern extraction, extracting potential patterns from NRE models by employing reinforcement learning (RL), and 2) pattern refinement, asking human experts to annotate a small set of actively selected examples. Following these steps, we not only minimize the workload and difficulty of human experts by generating patterns automatically, but also enable the quick generalization by only requiring a small number of human annotations. After the processing of DIAG-NRE, we obtain highquality patterns that are either supportive or unsupportive of the target relation with high probabilities and can feed them into the WLF stage to get denoised labels and retrain a better model. To demonstrate the effectiveness of DIAG-NRE, we conduct extensive experiments on two real-world datasets, where DIAG-NRE not only achieves significant improvements over state-of-the-art methods but also provides insightful diagnostic results for different noise behaviors via refined patterns.
In summary, DIAG-NRE has the following contributions:  • easing the pattern-writing work of human experts by generating patterns automatically; • enabling the quick generalization to new relation types by only requiring a small number of human annotations; • presenting both significant and interpretable performance improvements as well as intuitive diagnostic analyses.
Particularly, for one relation with severe false negative noises, we improve the F1 score by about 0.4.
To the best of our knowledge, we are the first to explicitly reveal and address this severe noise problem for that dataset.

Related Work
To reduce labeling noises of DS, earlier work attempted to design specific model architectures that can better tolerate labeling noises, such as the multi-instance learning paradigm (Riedel et al., 2010;Hoffmann et al., 2011;Surdeanu et al., 2012;Zeng et al., 2015;Wu et al., 2017). These models relax the raw assumption of DS by grouping multiple sentences that mention the same entity pair together as a bag and then assuming that at least one sentence in this bag expresses the relation. This weaker assumption can alleviate the noisy-labeling problem to some extent, but this problem still exists at the bag level, and Feng et al. (2018) discovered that bag-level models struggled to do sentence-level predictions. Later work tried to design a dynamic labeladjustment strategy for training (Liu et al., 2017b;Luo et al., 2017). Especially, the most recent work (Feng et al., 2018;Qin et al., 2018) adopted RL to train an agent that interacts with the NRE model to learn how to remove or alter noise labels. These methods work without human intervention by utilizing the consistency and difference between DS-generated labels and model-predicted ones. However, such methods can neither discover noise labels that coincide with the model predictions nor explain the reasons for removed or altered labels. As discussed in the introduction, introducing human efforts is a promising direction to contribute both significant and interpretable improvements, which is also the focus of this paper.
As for the pattern-extraction part, we note that there are some methods with similar insights but different purposes. For example, Zhang et al. (2018) improved the performance of the vanilla LSTM (Hochreiter and Schmidhuber, 1997) by utilizing RL to discover structured representations and  interpreted the sentiment prediction of neural models by employing RL to find the decision-changing phrases. However, NRE models are unique because we only care about the semantic inter-entity relation mentioned in the sentence. To the best of our knowledge, we are the first to extract patterns from NRE models by RL.
We also note that the relational-pattern mining has been extensively studied (Califf and Mooney, 1999;Carlson et al., 2010;Nakashole et al., 2012;Jiang et al., 2017). Different from those studies, our pattern-extraction method 1) is simply based on RL, 2) does not rely on any lexical or syntactic annotation, and 3) can be aware of the pattern importance via the prediction of NRE models. Besides, Takamatsu et al. (2012) inferred negative syntactic patterns via the example-pattern-relation co-occurrence and removed the false-positive labels accordingly. In contrast, built upon modern neural models, our method not only reduces negative patterns to alleviate false positives but also reinforces positive patterns to address false negatives at the same time.

Methodology
Provided with DS-generated data and NRE models trained on them, DIAG-NRE can generate high-quality patterns for the WLF stage to produce denoised labels. As Figure 2 shows, DIAG-NRE contains two key stages in general: pattern extraction (Section 3.2) and pattern refinement (Section 3.3). Moreover, we briefly introduce the WLF paradigm in Section 3.4 for completeness. Next, we start with reviewing the common input-output schema of modern NRE models.

NRE Models
Given an instance s with T tokens 2 , a common input representation of NRE models is to be aware of both semantics and entity positions, where d x = d w + d p . Given the relation type r, NRE models perform different types of tensor manipulations on x and obtain the predicting probability of r given the instance s as P φ (r|x), where φ denotes model parameters except for the input embedding tables.

Pattern Extraction
In this stage, we build a pattern-extraction agent to distill relation-specific patterns from NRE models with the aforementioned input-output schema. The basic idea is to erase irrelevant tokens and preserve the raw target prediction simultaneously, which can be modeled as a token-erasing decision process and optimized by RL. Figure 3 shows this RL-based workflow in a general way together with an intuitive pattern-induction example. Next, we elaborate details of this workflow.
Action. The agent takes an action a i , retaining (0) or erasing (1), for each token of the instance s and transforms the input representation from x intox. During this process, the column i of x, where the position vectors are left untouched and the new word vectorŵ i is adjusted based on the action taken by the agent. For the retaining action, we retain the raw word vector asŵ i = w i . While for erasing, we setŵ i to be all zeros to remove the semantic meaning. After taking a sequence of actions, a = [a 1 ; a 2 ; · · · ; a T ], we get the transformed representationx withT tokens retained.
Reward. Our purpose is to find the most simplified sequence that preserves the raw prediction confidence. Therefore, given the raw input representation x and the corresponding action vector a, we define the reward as follows: where the total reward is composed of two parts: one is the log-likelihood term to pursue the high prediction confidence and the other is the sparse ratio term to induce sparsity in terms of retained tokens. We balance these two parts through a hyper-parameter η.
State. To be general, the state provided to the agent should be independent of NRE architectures. Moreover, the state needs to incorporate complete information of the current instance. Therefore, in our design, the agent directly employs the input representation x as the state.
Agent. We employ policy-based RL to train a neural-network-based agent that can predict a sequence of actions for an instance to maximize the reward. Our agent network directly estimates π Θ (a|x) = T i=1 π Θ (a i |x) in a nonautoregressive manner by calculating π Θ (a i |x) in parallel, where Θ denotes the parameters of the agent network. To enrich the contextual information when deciding the action for each token, we employ the forward and backward LSTM net- R 2×d h , and d h denotes the size of LSTM's hidden state. Then, we employ an attention-based strategy (Bahdanau et al., 2015) to aggregate the contextual information as c = [c 1 , c 2 , · · · , c T ].
For each token i, we compute the context vector c i ∈ R 2d h as follows: R 2d h are network parameters. Next, we compute the final representation to infer actions as z = [z 1 , z 2 , · · · , z T ], where for each token i, positional and contextual information. Finally, we estimate the probability of taking action a i for token i as Optimization. We employ the REINFORCE algorithm (Williams, 1992) and policy gradient methods (Sutton et al., 2000) to optimize parameters of the agent network, where the key step is to rewrite the gradient formulation and then apply the back-propagation algorithm (Rumelhart et al., 1986) to update network parameters. Specifically, we define our objective as: where x denotes the input representation of the instance s. By taking the derivative of J(Θ) with respect to Θ, we can obtain the gradient Besides, we utilize the -greedy trick to balance exploration and exploitation.
Pattern Induction. Given instances and corresponding agent actions, we take the following steps to induce compact patterns. First, to be general, we substitute raw entity pairs with corresponding entity types. Then, we evaluate the agent to obtain retained tokens with the relative distance preserved. To enable the generalized position indication, we divide the relative distance between two adjacent retained tokens into four categories: zero (no tokens between them), short (1-3 tokens),  Figure 4: The human-in-the-loop pattern refinement. medium (4-9 tokens) and long (10 or more tokens) distance. For instance, Figure 3 shows a typical pattern-induction example. Patterns with such formats can incorporate multiple kinds of crucial information, such as entity types, key tokens and the relative distance among them.

Pattern Refinement
The above pattern-extraction stage operates at the instance level by producing a pattern for each evaluated instance. However, after aggregating available patterns at the dataset level, there inevitably exist redundant ones. Therefore, we design a pattern hierarchy to merge redundant patterns. Afterward, we can introduce human experts into the workflow by asking them to annotate a small number of actively selected examples. Figure 4 shows the general workflow of this stage.
Pattern Hierarchy. To identify redundant patterns, we group multiple instances with the same pattern and build a pattern hierarchy by the matching statistics. In this hierarchy, the parent pattern should cover all instances matched by the child pattern. As the parent pattern already has sufficient relation-supporting signals, we can omit child patterns for human annotation. Moreover, the number of instances from which the pattern can be induced is closely related to the pattern representativeness. Therefore, we follow the decreasing order of this number to select top n r most representative patterns for human annotation.
Human Annotation. To quantitatively evaluate the pattern quality, we adopt an approximate method by randomly selecting n a pattern-matched instances and annotating them manually. Thus, for each relation type, we end up with n r * n a humanannotated instances. We assign patterns with the accuracy higher than p h and lower than p l into the positive pattern set and the negative pattern set, re-spectively, to serve the WLF stage. In practice, users can tune these hyper-parameters (n r , n a , p h and p l ) accordingly for different applications, such as increasing p h to prefer precision. While in this paper, to show the wide applicability and robustness of DIAG-NRE, we demonstrate that a single configuration can handle all 14 relation types.

Weak Label Fusion
The WLF model aims to fuse weak labels from multiple labeling sources, including both DS and patterns, to produce denoised labels. In this paper, we adopt data programming (DP) (Ratner et al., 2016) at our WLF model. The input unit of DP is called labeling function (LF), which takes one instance and emits a label (+1: positive, -1: negative or 0: unknown). In our case, the LF of DS generates +1 or -1, LFs of positive patterns generate +1 or 0, and LFs of negative patterns generate -1 or 0. We estimate parameters of DP on the small set of human-annotated labels with a closed-form solution (see the appendix for detailed formulations). With the help of DP, we get denoised labels to retrain a better model. Note that designing better generic WLF models is still a hot research topic (Varma et al., 2016;Bach et al., 2017;Liu et al., 2017a) but outside the scope of this work, which is automatically generating patters to ease human's work.

Experiments
In this section, we present experimental results and comprehensive analyses to demonstrate the effectiveness of DIAG-NRE.

Experimental Setup
Evaluation. To clearly show the different noise behaviours for various relation types, we treat each relation prediction task as a single binary classification problem, that is predicting the existing or not of that relation for a given instance. Different from previous studies, we report relation-specific metrics (Precision, Recall and F1 scores, all in the percentage format) and macro-averaged ones at the dataset level, because the distribution of relation types is extremely imbalanced and the microaveraged evaluation inevitably overlooks noisylabeling issues of many relation types. Moreover, we only utilize human-annotated test data to evaluate models trained on noise labels, as Ratner et al.  severe labeling noises of many relation types heavily weaken the reliability of the DS-based heldout evaluation (Mintz et al., 2009), which cannot judge the performance accurately.
Data & Tasks. We select top ten relation types with enough coverage (over 1, 000 instances) from the NYT dataset (Riedel et al., 2010) 3 and all four relation types from the UW dataset (Liu et al., 2016) 4 . Originally, the NYT dataset contains a train set and a test set both generated by DS with 522, 611 and 172, 448 instances, respectively; the UW dataset contains a train set generated by DS, a crowd-annotated set and a minimal human-annotated test set with 676, 882, 18, 128 and 164 instances, respectively. To enable the reliable evaluation based on human annotations, for the NYT dataset, we randomly select up to 100 instances per relation (including the special unknown relation NA) from the test set and manually annotate them; while for the UW dataset, we directly utilize the crowd-annotated set (disjoint from the train set) with the broad coverage and very high quality as the ground truth. Table 1 summaries detailed statistics of these 14 tasks.
Hyper-parameters. We implement DIAG-NRE based on Pytorch 5 and directly utilize its default initialization for neural networks. For the NRE model, we adopt a simple yet effective LSTMbased architecture described in Zhou et al. (2016) and adopt widely-used hyper-parameters (see the appendix for details). As for DIAG-NRE, we use the following configuration for all 14 tasks. For the agent network, the LSTM hidden size is 200, the optimizer is Adam with a learning rate of 0.001, the batch size is 5, and the training epoch is 10. At the pattern-extraction stage, we use = 0.1 and alter η in {0.05, 0.1, 0.5, 1.0, 1.5} to train multiple agents that tend to squeeze patterns with different granularities and combine outputs of all agents to serve the pattern-refinement stage. To speed up the agent training, we use filtered instances by taking the top 10, 000 ones with the highest prediction probabilities. At the patternrefinement stage, hyper-parameters include n r = 20, n a = 10, p h = 0.8 and p l = 0.1. Thus, for each task, we get 200 human-annotated instances (about 0.05% of the entire train set) and at most 20 patterns for the WLF stage.

Performance Comparisons
Based on the above hyper-parameters, DIAG-NRE together with the WLF model can produce denoised labels to retrain a better NRE model. Next, we present the overall performance comparisons of NRE models trained with different labels.
Baselines. We adopt the following baselines: 1) Distant Supervision, the vanilla DS described in Mintz et al. (2009), 2) Gold Label Mix , mixing human-annotated highquality labels with DS-generated noise labels, and 3) RLRE (Feng et al., 2018), building an instanceselection agent to select correct-labeled ones by only interacting with NRE models trained on noise labels. Specifically, for Gold Label Mix, we use the same 200 labels obtained at the patternrefinement stage as the high-quality labels. To focus on the impact of training labels produced with different methods, besides for fixing all hyperparameters exactly same, we run the NRE model with five random seeds, ranging from 0 to 4, for each case and present the averaged scores.
Overall Results. Table 2 shows the overall results with precision (P.), recall (R.) and F1 scores. For a majority of tasks suffering large labeling noises, including R 1 , R 4 , R 5 , R 8 , R 9 and R u 8 , we improve the F1 score by 5.0 over the best baseline. Notably, the F1 improvement for task R 1 has   Table 3: Total diagnostic results, where columns contain the precision, recall and accuracy of DS-generated labels evaluated on 200 human-annotated labels as well as the number of positive and negative patterns preserved after the pattern-refinement stage, and we underline some cases in which DS performs poorly.
reached 40. For some tasks with fewer noises, including R 0 , R 7 , R u 7 and R u 9 , our method can obtain small improvements. For a few tasks, such as R 3 , R 6 and R u 6 , only using DS is sufficient to train competitive models. In such cases, fusing other weak labels may have negative effects, but these side effects are small. The detailed reasons for these improvements will be elaborated together with the diagnostic results in Section 4.3. Another interesting observation is that RLRE yields the best result on tasks R 2 and R u 6 but gets worse results than the vanilla DS on tasks R 0 , R 1 , R 4 and R 7 . Since the instance selector used in RLRE is difficult to be interpreted, we can hardly figure out the specific reason. We conjecture that this behavior is due to the gap between maximizing the likelihood of the NRE model and the ground-truth instance selection. In contrast, DIAG-NRE can contribute both stable and interpretable improvements with the help of human-readable patterns.

Pattern-based Diagnostic Results
Besides for improving the extraction performance, DIAG-NRE can interpret different noise effects caused by DS via refined patterns, as Table 3 shows. Next, we elaborate these diagnostic results and the corresponding performance degradation of NRE models from two perspectives: false negatives (FN) and false positives (FP).

FN.
A typical example of FN is task R 1 (Administrative Division), where the precision of DS-generated labels is fairly good but the recall is too low. The underlying reason is that the relational facts stored in the KB cover too few real facts actually contained by the corpus. This low-recall issue introduces too many negative instances with common relation-supporting patterns and thus confuses the NRE model in capturing correct features. This issue also explains results of R 1 in Table 2 that the NRE model trained on DS-generated data achieves high precision but low recall, while DIAG-NRE with reinforced positive patterns can obtain significant im-  provements due to much higher recall. For tasks R 8 (Birthplace) and R 9 (Deathplace), we observe the similar low-recall issues.
FP. The FP errors are mainly caused by the assumption of DS described in the introduction. For example, the precision of DS-generated labels for tasks R 8 and R u 8 is too low. This low precision means that a large portion of DS-generated positive labels do not indicate the target relation. Thus, this issue inevitably causes the NRE model to absorb some irrelevant patterns. This explanation also corresponds to the fact that we have obtained some negative patterns. By reducing labels with FP errors through negative patterns, DIAG-NRE can achieve large precision improvements.
For other tasks, DS-generated labels are relatively good, but the noise issue still exists, major or minor, except for task R 3 (Contains), where labels automatically generated by DS are incredibly accurate. We conjecture the reason for such high-quality labeling is that for task R 3 , the DS assumption is consistent with the written language convention: when mentioning two locations with the containing relation in one sentence, people get used to declaring this relation explicitly.

Incremental Diagnosis
In addition to the performance comparisons based on 200 human-annotated instances, we show the incremental diagnosis ability of DIAG-NRE by gradually increasing the number of human annotations from 10 to 200. As Figure 5 shows, where we pick those tasks (three from NYT and two from UW) suffering large labeling noises, most tasks experience a rapid improvement phase with the help of high-quality patterns automatically generated by DIAG-NRE and then enter a saturate phase where adding annotations does not contribute much. This saturation accords with the intuition that high-quality relational patterns are often limited. The only exception is task R 9 that drops first and then increases again, the reason is that the fully automatic pattern refinement of DIAG-NRE produces one incorrect pattern accidentally, while later patterns alleviate this mistake. Actually, in practice, users can further curate patterns generated by DIAG-NRE to get even better results, which can also be much easier and quicker than writing patterns from scratch. Table 4 shows five pattern examples from three tasks. For task R 1 , the positive pattern can remedy the extremely low coverage caused by DS. For tasks R 8 and R u 9 , besides for the help of the positive pattern, the negative pattern can correct many FP labels caused by DS. These cases intuitively illustrate the ability of DIAG-NRE to diagnose and denoise DS-generated labels.

Conclusion and Future Work
In this paper, we propose a neural pattern diagnosis framework, DIAG-NRE, to diagnose and improve NRE models trained on DS-generated data. DIAG-NRE not only eases the hard patternwriting work of human experts by generating patterns automatically, but also enables the quick generalization to new relation types by only requiring a small number of human annotations. Coupled with the WLF model, DIAG-NRE can produce denoised labels to retrain a better NRE model. Extensive experiments with comprehensive analyses demonstrate that DIAG-NRE can contribute both significant and interpretable improvements.
For the future work, we plan to extend DIAG-NRE to other DS-based applications, such as question answering (Lin et al., 2018), event extraction (Chen et al., 2017), etc.

A Appendices
In the appendices, we introduce formulation details of the weak-label-fusion (WLF) model and the hyper-parameters for our neural relation extraction (NRE) model.

A.1 Weak Label Fusion
As mentioned in the main body, we employ the data programming (DP) (Ratner et al., 2016) as our WLF model. DP proposed an abstraction of the weak label generator, named as the labeling function (LF), which can incorporate both DS and pattern-based heuristics. Typically, for a binary classification task, an LF is supposed to produce one label (+1: positive, -1: negative or 0: unknown) for each input instance. In our case, the LF of DS generates +1 or -1, LFs of positive patterns generate +1 or 0, and LFs of negative patterns generate -1 or 0.
Given m labeling functions, we can write the joint probability of weak labels L s and the true label Y s ∈ {−1, +1} for instance s, P α,β (L s , Y s ), as where each L s i ∈ {−1, 0, +1} denotes the weak label generated for instance s by the i th labeling function, and α and β are model parameters to be estimated.
Originally, Ratner et al. (2016) conducted the unsupervised parameter estimation based on unlabeled data by solving max α,β s∈S log Y s P α,β (L s , Y s )) .
Different from the general DP that treats each LF with the equal prior, we have strong priors that patterns produced by DIAG-NRE are either supportive or unsupportive of the target relation with high probabilities. Therefore, in our case, we directly employ the small labeled set S L obtained at the pattern-refinement stage to estimate (α, β) by solving max α,β s∈S L log P α,β (L s , Y s ), where the closed-form solutions are for each i ∈ {1, · · · , m}. After estimating these parameters, we can infer the true label distribution by the posterior P α,β (Y s |L s ) and use the denoised soft label to train a better NRE model, just as Ratner et al. (2016) did.

A.2 Hyper-parameters of the NRE model
For the NRE model, we implement a simple yet effective LSTM-based architecture described in (Zhou et al., 2016). We conduct the hyperparameter search via cross-validation and adopt the following configurations that can produce pretty good results for all 14 tasks. First, the word embedding table (d w = 100) is initialized with Glove vectors (Pennington et al., 2014), the size of the position vector (d p ) is 5, the maximum length of the encoded relative distance is 60, and we follow (Zeng et al., 2015; to randomly initialize these position vectors. Besides, the LSTM hidden size is 200, and the dropout probabilities at the embedding layer, the LSTM layer and the last layer are 0.3, 0.3 and 0.5, respectively. During training, we employ the Adam (Kingma and Ba, 2014) optimizer with the learning rate of 0.001 and the batch size of 50. Moreover, we select the best epoch according to the score on the validation set. Notably, we observe that when training on data with large labeling noises, different parameter initializations can heavily influence the extraction performance of trained models. Therefore, as mentioned in the main body, to clearly and fairly show the actual impact of different types of training labels, we restart the training of NRE models with 5 random seeds, ranging from 0 to 4, for each case and report the averaged scores.