Uncover the Ground-Truth Relations in Distant Supervision: A Neural Expectation-Maximization Framework

Distant supervision for relation extraction enables one to effectively acquire structured relations out of very large text corpora with less human efforts. Nevertheless, most of the prior-art models for such tasks assume that the given text can be noisy, but their corresponding labels are clean. Such unrealistic assumption is contradictory with the fact that the given labels are often noisy as well, thus leading to significant performance degradation of those models on real-world data. To cope with this challenge, we propose a novel label-denoising framework that combines neural network with probabilistic modelling, which naturally takes into account the noisy labels during learning. We empirically demonstrate that our approach significantly improves the current art in uncovering the ground-truth relation labels.


Introduction
Relation extraction aims at automatically extracting semantic relationships from a piece of text. Consider the sentence "Larry Page, the chief executive officer of Alphabet, Google's parent company, was born in East Lansing, Michigan.". The knowledge triple (Larry Page, employed by, Google) can be extracted. Despite various efforts in building relation extraction models (Zelenko et al., 2002;Zhou et al., 2005;Bunescu and Mooney, 2005;Zeng et al., 2014;dos Santos et al., 2015;Yang et al., 2016;Xu et al., 2015;Miwa and Bansal, 2016;Gábor et al., 2018), the difficulty of obtaining abundant training data with labelled relations remains a challenge, and thus motivates the development of Distant Supervision relation extraction (Mintz et al., 2009;Riedel et al., 2010;Zeng et al., 2015;Lin et al., 2016;Ji et al., 2017;Zeng et al., 2018;Feng et al., * Corresponding author 2018;Wang et al., 2018b,a;Hoffmann et al., 2011;Surdeanu et al., 2012;Jiang et al., 2016;Ye et al., 2017;Su et al., 2018;Qu et al., 2018). Distant supervision (DS) relation extract methods collect a large dataset with "distant" supervision signal and learn a relation predictor from such data. In detail, each example in the dataset contains a collection, or bag, of sentences all involving the same pair of entities extracted from some corpus (e.g., news reports). Although such a dataset is expected to be very noisy, one hopes that when the dataset is large enough, useful correspondence between the semantics of a sentence and the relation label it implies still reinforces and manifests itself. Despite their capability of learning from large scale data, we show in this paper that these DS relation extraction strategies fail to adequately model the characteristic of the noise in the data. Specifically, most of the works fail to recognize that the labels can be noisy in a bag and directly use bag labels as training targets.
This aforementioned observation has inspired us to study a more realistic setting for DS relation extraction. That is, we treat the bag labels as the noisy observations of the ground-truth labels. To that end, we develop a novel framework that jointly models the semantics representation of the bag, the latent ground truth labels, and the noisy observed labels. The framework, probabilistic in nature, allows any neural network, that encodes the bag semantics, to nest within. We show that the well-known Expectation-Maximization (EM) algorithm can be applied to the end-to-end learning of the models built with this framework. As such, we term the framework Neural Expectation-Maximization, or the nEM.
Since our approach deviates from the conventional models and regards bag labels not as the ground truth, bags with the real ground-truth la- In each matrix, rows correspond to the sentences in a bag, and columns correspond to the labels assigned to the bag. A check mark on (S i , Lj) indicates that the label L j is supported by sentence S i . bels are required for evaluating our model. To that end, we manually re-annotate a fraction of the testing bags in a standard DS dataset with their real ground-truth labels. We then perform extensive experiments and demonstrate that the proposed nEM framework improves the current state-of-the-art models in uncovering the ground truth relations. To the best of our knowledge, this work is the first that combines a neural network model with EM training under the "noisy-sentence noisy-label" assumption. The re-annotated testing dataset 1 , containing the ground-truth relation labels, would also benefit the research community.

Problem Statement
Let set R contain all relation labels of interest. Specifically, each label r in R corresponds to a candidate relation in which any considered pair (e, e ) of entities may participate. Since R cannot contain all relations that are implied in a corpus, we include in R an additional relation label "NA", which refers to any relation that cannot be regarded as the other relations in R. Any subset of R will be written as a {0, 1}valued vector of length |R|, for example, z, where each element of the vector z corresponds to a label r ∈ R. Specifically, if and only if label r is contained in the subset, its corresponding element z[r] of z equals 1. If two entities e and e participate in a relation r, we say that the triple (e, r, e ) is fac-1 Will be released upon the acceptance of the paper tual. Let B be a finite set, in which each b ∈ B is a pair (e, e ) of entities. Each b = (e, e ) ∈ B serves as the index for a bag x b of sentences.
The objective of DS is to use a large but noisy training set to learn a predictor that predicts the relations involving two arbitrary (possibly unseen) entities; the predictor takes as input a bag of sentences each containing the two entities of interest, and hopefully outputs the set of all relations in which the two entities participate.

Prior Art and Related Works
Relation extraction is an important task in natural language processing. Many approaches with supervised methods have been proposed to complete this task. These works, such as (Zelenko et al., 2002;Zhou et al., 2005;Bunescu and Mooney, 2005), although achieving good performance, rely on carefully selected features and well labelled dataset. Recently, neural network models, have been used in (Zeng et al., 2014;dos Santos et al., 2015;Yang et al., 2016;Xu et al., 2015;Miwa and Bansal, 2016) for supervised relation extraction. These models avoid feature engineering and are shown to improve upon previous models. But having a large number of parameters to estimate, these models rely heavily on costly human-labeled data.
Distant supervision was proposed in (Mintz et al., 2009) to automatically generate large dataset through aligning the given knowledge base to text corpus. However, such dataset can be quite noisy. To articulate the nature of noise in DS dataset, a sentence is said to be noisy if it supports no relation labels of the bag, and a label of the bag is said to be noisy if it is not supported by any sentence in the bag. A sentence or label that is not noisy will be called clean. The cleanness of a training example may obey the following four assumptions, for each of which an example is given In Figure 1.
• Clean-Sentence Clean-Label (CSCL): All sentences and all labels are clean (Figure 1(a)). • Noisy-Sentence Clean-Label (NSCL): Some sentences may be noisy but all labels are clean (Figure 1(b)). Note that CSCL is a special case of NSCL. • Clean-Sentence Noisy-Label (CSNL): All sentences are clean but some labels may be noisy (Figure 1(c)). Note that CSNL includes CSCL as a special case.
• Noisy-Sentence Noisy-Label (NSNL): Some sentences may be noisy and some labels may also be noisy (Figure 1(d)).
Obviously, CSCL, NSCL, CSNL are all special cases of NSNL. Thus NSNL is the most general among all these assumptions.
The author of (Mintz et al., 2009) creates a model under the CSCL assumption, which is however pointed out a too strong assumption (Riedel et al., 2010). To alleviate this issue, many studies adopt NSCL assumption. Some of them, including (Riedel et al., 2010;Zeng et al., 2015;Lin et al., 2016;Ji et al., 2017;Zeng et al., 2018;Feng et al., 2018;Wang et al., 2018b,a), formulate the task as a multi-instance learning problem where only one label is allowed for each bag. These works take sentence denoising through selecting the max-scored sentence (Riedel et al., 2010;Zeng et al., 2015Zeng et al., , 2018, applying sentence selection with soft attention (Lin et al., 2016;Ji et al., 2017), performing sentence level prediction as well as filtering noisy bags (Feng et al., 2018) and redistributing the noisy sentences into negative bags (Wang et al., 2018b,a). Other studies complete this task with multi-instance multi-label learning (Hoffmann et al., 2011;Surdeanu et al., 2012;Jiang et al., 2016;Ye et al., 2017;Su et al., 2018), which allow overlapping relations in a bag. Despite the demonstrated successes, these models ignore the fact that relation labels can be noisy and "noisy" sentences that indeed point to factual relations may also be ignored. Two recent approaches  using the NSNL assumption have also been introduced, but these methods are evaluated based on the assumption that the evaluation labels are clean.
We note that this paper is not the first work that combines neural networks with EM training. Very recently, a model also known as Neural Expectation-Maximization or NEM (Greff et al., 2017) has been presented to learn latent representations for clusters of objects (e.g., images) under a complete unsupervised setting. The NEM model is not directly applicable to our problem setting which deals with noisy supervision signals from categorical relation labels. Nonetheless, given the existence of the acronym NEM, we choose to abbreviate our Neural Expecation-Maximization model as the nEM model.

The nEM Framework
We first introduce the nEM architecture and its learning strategy, and then present approaches to encode a bag of sentences (i.e., the Bag Encoding Models) needed by the framework.

The nEM Architecture
Let random variable X denote a random bag and random variable Z denote the label set assigned to X. Under the NSNL assumption, Z, or some labels within, may not be clean for X. We introduce another latent random variable Y , taking values as a subset of R, indicating the set of ground-truth (namely, clean) labels for X. We will write Y again as an |R|-dimensional {0, 1}-valued vector. From here on, we will adopt the convention that a random variable will be written using a capitalized letter, and the value it takes will be written using the corresponding lower-cased letter.
A key modelling assumption in nEM is that random variables X, Y and Z form a Markov chain X → Y → Z. Specifically, the dependency of noisy labels Z on the bag X is modelled as The conditional distribution p Z|Y is modelled as That is, for each ground-truth label r ∈ R, Z[r] depends only on Y [r]. Furthermore, we assume that for each r, there are two parameters φ 0 r and φ 1 r governing the dependency of Z We will denote by {φ 0 r , φ 1 r : r ∈ R} collectively by φ. .
On the other hand, we model p Y |X by where x is the encoding of bag x, implemented using any suitable neural network. Postponing explaining the details of bag encoding to a later section (namely Section 4.3), we here specify the form of p Y [r]|X (y[r]|x) for each r: where σ(·) is the sigmoid function, r is a |R|dimensional vector and b r is bias. That is, p Y [r]|X is essentially a logistic regression (binary) classifier based on the encoding x of bag x. We will denote by θ the set of all parameters {r : r ∈ R} and the parameters for generating encoding x. At this end, we have defined the overall structure of the nEM framework (summarized in Figure  2). Next, we will discuss the learning of it.

Learning with the EM Algorithm
Let (z|x; φ, θ) be the log-likelihood of observing the label set z given the bag x, that is, The structure of the framework readily enables a principled learning algorithm based on the EM algorithm (Dempster et al., 1977). Let total (φ, θ) be the log-likelihood defined as The learning problem can then be formulated as maximizing this objective function over its parameters (φ, θ), or solving Let Q be an arbitrary distribution over {0, 1} |R| . Then it is possible to show where the lower bound L(z|x; φ, θ, Q), often referred to as the variational lower bound. Now we define L total as where we have introduced a Q b for each b ∈ B. Instead of solving the original optimization problem (8), we can turn to solving a different optimization problem by maximizing L total The EM algorithm for solving the optimization problem (11) is essentially the coordinate ascent algorithm on objective function L total , where we iterate over two steps, the E-Step and the M-Step.
In the E-Step, we maximizes L total over {Q b } for the current setting of (φ, θ) and in the M-Step, we maximize L total (φ, θ) for the current setting of {Q b }. We now describe the two steps in detail, where we will use superscript t to denote the iteration number. E-step: In this step, we hold (φ, θ) fixed and update . This boils down to update each factor Q b,r of Q b according to: M-step: In this step, we hold {Q b } fixed and update (φ, θ), to maximize the lower bound then (φ, θ) is updated according to: Overall the EM algorithm starts with initializing (φ, θ, {Q b }), and then iterates over the two steps until convergence or after some prescribed number of iterations is reached. There are however several caveats on which caution is needed. First, the optimization problem in the M-Step cannot be solved in closed form. As such we take a stochastic gradient descent (SGD) approach 2 .
In each M-Step, we perform ∆ updates, where ∆ is a hyper-parameter. Such an approach is sometimes referred to as "Generalized EM" (Wu, 1983;Jojic and Frey, 2001;Greff et al., 2017). Note that since the parameters θ are parameters for the neural network performing bag encoding, the objective function L total is highly non-convex with respect to θ. This makes it desirable to choose an appropriate ∆. Too small ∆ results in little change in θ and hence provides insufficient signal to update {Q b }; too large ∆, particularly at the early iterations when {Q b } has not been sufficiently optimized, tends to make the optimization stuck in undesired local optimum. In practice, one can try several values of ∆ by inspecting their achieved value of L total and select the ∆ giving rise to the highest L total . Note that such a tuning of ∆ requires no access of the testing data. The second issue is that the EM algorithm is known to be sensitive to initialization. In our implementation, in order to provide a good initial parameter setting, we set each Q 0 b to z b . Despite the fact that z contains noise, this is a much better approximation of the posterior of true labels than any random initialization. The nEM framework needs the encoding of a bag of x as discussed in Equation 4. Any suitable neural network can be deployed to achieve this goal. Next, we present the widely used methods for DS relation extraction strategies: the Bag Encoding Models.

Bag Encoding Models
As illustrated in Figure 3, the Bag Encoding Models include three components: Word-Position Embedding, Sentence Encoding, and Sentence Selectors.

Word-Position Embedding
For the j th word in a sentence, Word-Position Embedding generates a vector representation w j as concatenated three components w w j , w p1 j , w p2 j . Specifically, w w j is the word embedding of the word and w p1 j and w p2 j are two position embeddings. Here w p1 j (resp. w p2 j ) are the embedding of the relative location of the word with respect to the first (resp. second) entity in the sentence. The dimensions of word and position embeddings are denoted by d w and d p respectively.

Sentence Encoding
Sentence encoding uses Piecewise Convolutional Neural Networks (PCNN) (Zeng et al., 2015;Lin et al., 2016;Ye et al., 2017;Ji et al., 2017), which consists of convolution followed by Piecewise Max-pooling. The convolution operation uses a list of matrix kernels such as {K 1 , K 2 , · · · , K l ker } to extract n-gram features through sliding windows of length l win . Here K i ∈ R l win ×de and l ker is the number of kernels. Let w j−l win +1:j ∈ R l win ×de be the concatenated vector of token embeddings in the j th window. The output of convolution operation is a matrix U ∈ R l ker ×(m+l win −1) where each element is computed by where denotes inner product. In Piecewise Max-Pooling, U i is segmented to three parts U 1 i , U 2 i , U 3 i depending on whether an element in U i is on the left or right of the two entities, or between the two entities. Then max-pooling is applied to each segment, giving rise to g ip = max(U p i ), 1 i l ker , 1 p 3 (16) Let g = g 1 , g 2 , g 3 . Then sentence encoding outputs where • is element-wise multiplication and h is a vector of Bernoulli random variables, representing dropouts.

Sentence Selectors
Let n be the number of sentences in a bag. We denote a matrix V ∈ R n×(l ker ×3) consisting of each sentence vector v T . V k: and V :j are used to index the k th row vector and j th column vector of V respectively. Three kinds of sentenceselectors are used to construct the bag encoding. Mean-selector (Lin et al., 2016;Ye et al., 2017): The bag encoding is computed as Max-selector (Jiang et al., 2016): The j th element of bag encoding x is computed as Attention-selector Attention mechanism is extensively used for sentence selection in relation extraction by weighted summing of the sentence vectors, such as in (Lin et al., 2016;Ye et al., 2017;Su et al., 2018). However, all these works assume that the labels are correct and only use the golden label embeddings to select sentences at training stage. We instead selecting sentences using all label embeddings r and construct a bag encoding for each label r ∈ R. The output is then a list of vectors {x r }, in which the r th vector is calculated through attention mechanism as where A ∈ R dr×dr is a diagonal matrix and d r is the dimension of relation embedding.

Experimental Study
We first conduct experiments on the widely used benchmark data set Riedel (Riedel et al., 2010), and then on the TARCED (Zhang et al., 2017) data set. The latter allows us to control the noise level in the labels to observe the behavior and working mechanism of our proposed method. The code for our model is found on the Github page 3 .

Evaluation on the Riedel Dataset
The Riedel dataset 2 is a widely used DS dataset for relation extraction. It was developed in (Riedel et al., 2010)

Ground-Truth Annotation
The Riedel dataset contains no ground-truth labels. The held-out evaluation (Mintz et al., 2009) is highly unreliable in measuring a model's performance against the ground truth. Since this study is concerned with discovering the ground-truth labels, the conventional held-out evaluation is no longer appropriate. For this reason, we annotate a subset of the original test data for evaluation purpose. Specifically, we annotate a bag x b by its all correct labels, and if no such labels exist, we label the bag NA.
In total 3762 bags are annotated, which include all bags originally labelled as non-NA ("Part 1") and 2000 bags ("Part 2") selected from the bags originally labelled as NA but have relatively low score for NA label under a pre-trained PCNN+ATT (Lin et al., 2016) model. The rationale here is that we are interested in inspecting the model's performance in detecting the labels of known relations rather than NA relation. Table 1: The statistics of ground-truth annotation. origin denotes the total number of originally assigned labels of these bags. correct and wrong denote the total number of correctly assigned and wrongly assigned labels. added denote the number of missing labels we added into these bags. The statistics of annotation is shown in Table  1. Through the annotation, we notice that, about 36% of the original labels in original non-NA bags and 53% of labels in original NA bags are wrongly assigned. Similar statistics has been reported in previous works (Riedel et al., 2010;Feng et al., 2018).

Evaluation Metrics and Baselines
For the Riedel dataset, we train the models on the noisy training set and test the models on the manually labeled test set. The precision-recall (PR) curve is reported to compare performance of models. Three baselines are considered: • PCNN+MEAN (Lin et al., 2016;Ye et al., 2017): A model using PCNN to encode sentences and a mean-selector to generate bag encoding. • PCNN+MAX (Jiang et al., 2016): A model using PCNN to encode sentences and a maxselector to generate bag encoding. • PCNN+ATT (Lin et al., 2016;Ye et al., 2017;Su et al., 2018): A model using PCNN to encode sentences and an attention-selector to generate bag encoding.
We compare the three baselines with their nEM versions (namely using them as the Bag Encoding component in nEM), which are denoted with a "+nEM" suffix.

Implementation Detail
Following previous work, we tune our models using three-fold validation on the training set. As the best configurations, we set d w = 50, d p = 5 and d r = 230. For PCNN, we set l win = 3, l ker = 230 and set the probability of dropout to 0.5. We use Adadelta (Zeiler, 2012) with default setting (ρ = 0.95, ε = 1e −6 ) to optimize the models and the initial learning rate is set as 1. The batch size is fixed to 160 and the max length of a sentence is set as 120. For the noise model p Z|Y , we set φ 0 N A = 0.3, φ 1 N A = 0 for the NA label and φ 0 r = 0, φ 1 r = 0.9 for other labels r = N A. In addition, the number ∆ of SGD updates in M-step is set to 2000.

Predictive performance
The evaluation results on manually labeled test set are shown in Figure 4. From which, we observe that the PR-curves of the nEM models are above their corresponding baselines by significant margins, especially in the low-recall regime. This observation demonstrates the effectiveness of the nEM model on improving the extracting performance upon the baseline models. We also observe that models with attention-selector achieve better performance than models with mean-selector and max-selector, demonstrating the superiority of the attention mechanism on sentence selection.
We then analyze the predicting probability of PCNN+ATT and PCNN+ATT+nEM on the ground-truth labels in the test set. We divide the predicting probability values into 5 bins and count the number of label within each bin. The result is shown in Figure 5(a). We observe that the count for PCNN+ATT in bins 0.0 − 0.2, 0.2 − 0.4, 0.4 − 0.6 and 0.6 − 0.8 are all greater than PCNN+ATT+nEM. But in bin 0.8 − 1.0, the count for PCNN+ATT+nEM is about 55% larger than PCNN+ATT. This observation indicates that nEM can promote the overall predicting scores of ground-truth labels. Figure  5(b) compares PCNN+ATT and PCNN+ATT+nEM in their predictive probabilities on the frequent relations. The result shows that PCNN+ATT+nEM achieves higher average predicting probability on all these relations, except for the NA relation, on which PCNN+ATT+nEM nearly levels with PCNN+ATT. This phenomenon demonstrates that nEM tends to be more confident in predicting the correct labels. In this case, raising the predictive probability on the correct label does increase the models ability to make the correct decision, thereby improving performance.

Evaluation on the TACRED Dataset
TACRED is a large supervised relation extraction dataset collected in (Zhang et al., 2017). Another advantage of constructing single-sentence bags is that it allows us to pinpoint the correspondence between the correct label and its supporting sentence. The number of relations in TACRED dataset is 42, including a special relation "no relation", which is treated as NA in this study.

Constructing Semi-synthetic Dataset
To obtain insight into the working of nEM, we create a simulated DS dataset by inserting noisy labels into the training set of a supervised dataset, TACRED. Since the training set of TACRED was manually annotated, the labels therein may be regarded as the ground-truth labels. Training using this semi-synthetic dataset allows us to easily observe models' behaviour with respect to the noisy labels and the true labels. We inject artificial noise into the TACRED training data through the following precedure. For each bag x b in the training set, we generate a noisy label vectorz b from the observed ground-truth label vector y b . Specifically,z b is generated by flipping each element y b [r] from 0 to 1 or 1 to 0 with a probability p f . This precedure simulates a DS dataset through introducing wrong labels into training bags, thus corrupts the training dataset.

Experimental Settings
Following (Zhang et al., 2017), the common relation classification metrics Precision, Recall and F1 are used for evaluation. The PCNN model is used to generate the bag encoding since sentence selection is not needed in this setting. The same hyperparameter settings as in Section 5.1.3 are used in this experiment. For the noise model p Z[r]|Y [r] of PCNN +nEM, we set φ 0 r = 0.1, φ 1 r = 0.1 for each label r ∈ R. The number ∆ of SGD updates in each M-step is set to 1600.

Test Results
From Table 2, we see that PCNN+nEM achieves better recall and F1 score than the PCNN model under various noise levels. Additionally, the recall and F1 margins between PCNN and PCNN+nEM increase with noise levels. This suggests that nEM keeps better performance than the corresponding baseline model under various level of training noise. We also observe that the precision of nEM is consistently lower than that of PCNN when noise is injected to TACRED. This is a necessary trade-off present in nEM. The training of nEM regards the training labels with less confidence based on noisy-label assumption. This effectively lowers the probability of seen training labels and considers the unseen labels also having some probability to occur. When trained this way, nEM, at the prediction time, tends to predict more labels than its baseline (PCNN) does. Note that in TACRED, each instance contains only a single ground truth label. Thus the tendency of nEM to predict more labels leads to the reduced precision. However, despite this tendency, nEM, comparing with PCNN, has a stronger capability in detecting the correct label and gives the better recall of nEM. The gain in recall dominates the loss in precision.

Training Label Probabilities
The predicting probabilities for the noise labels and the original true labels are also evaluated under the trained models. The results are shown in Figure 6 (left) which reveals that with increasing noise, the average probability for noisy label sets of PCNN and PCNN+nEM both increase and average scores for original label sets of PCNN and PCNN+nEM both decrease. The performance degradation of PCNN and PCNN+nEM under noise appears different. The average probability for noisy label sets rises with a higher slope in PCNN than in PCNN+nEM. Additionally, the average probability for the original label sets of PCNN+nEM is higher than or equal to PCNN at all noise levels. These observations confirms that the denoising capability of nEM is learned from effectively denoising the training set.

Effectiveness of EM Iterations
For each training bag x b and each artificial noisy label r, we track the probability Q b,r (y[r]) = 1 over EM iterations. This probability, measuring the likelihood of the noise label r being correct, is then averaged over r and over all bags x b . It can be seen in Figure 6 (right) that the average value of Q b,r (y[r]) decreases as the training progresses, leading the model to gradually ignore noisy labels. This demonstrates the effectiveness of EM iterations and validates the proposed EM-based framework.

Concluding Remarks
We proposed a nEM framework to deal with the noisy-label problem in distance supervision relation extraction. We empirically demonstrated the effectiveness of the nEM framework, and provided insights on its working and behaviours through data with controllable noise levels. Our framework is a combination of latent variable models in probabilistic modelling with contemporary deep neural networks. Consequently, it naturally supports a training algorithm which elegantly nests the SGD training of any appropriate neural network inside an EM algorithm. We hope that our approach and the annotated clean testing data would inspire further research along this direction.