DSGAN: Generative Adversarial Training for Distant Supervision Relation Extraction

Distant supervision can effectively label data for relation extraction, but suffers from the noise labeling problem. Recent works mainly perform soft bag-level noise reduction strategies to find the relatively better samples in a sentence bag, which is suboptimal compared with making a hard decision of false positive samples in sentence level. In this paper, we introduce an adversarial learning framework, which we named DSGAN, to learn a sentence-level true-positive generator. Inspired by Generative Adversarial Networks, we regard the positive samples generated by the generator as the negative samples to train the discriminator. The optimal generator is obtained until the discrimination ability of the discriminator has the greatest decline. We adopt the generator to filter distant supervision training dataset and redistribute the false positive instances into the negative set, in which way to provide a cleaned dataset for relation classification. The experimental results show that the proposed strategy significantly improves the performance of distant supervision relation extraction comparing to state-of-the-art systems.


Introduction
Relation extraction is a crucial task in the field of natural language processing (NLP). It has a wide range of applications including information retrieval, question answering, and knowledge base completion. The goal of relation extraction system is to predict relation between entity pair in a sentence (Zelenko et al., 2003;Bunescu and Mooney, 2005;GuoDong et al., 2005). For exam- With the infinite amount of facts in real world, it is extremely expensive, and almost impossible for human annotators to annotate training dataset to meet the needs of all walks of life. This problem has received increasingly attention. Fewshot learning and Zero-shot Learning (Xian et al., 2017) try to predict the unseen classes with few labeled data or even without labeled data. Differently, distant supervision (Mintz et al., 2009;Hoffmann et al., 2011;Surdeanu et al., 2012) is to efficiently generate relational data from plain text for unseen relations with distant supervision (DS). However, it naturally brings with some defects: the resulted distantly-supervised training samples are often very noisy (shown in Figure 1), which is the main problem of impeding the performance (Roth et al., 2013). Most of the current state-of-the-art methods (Zeng et al., 2015;Lin et al., 2016) make the denoising operation in the sentence bag of entity pair, and integrate this process into the distant supervision relation ex-traction. Indeed, these methods can filter a substantial number of noise samples; However, they overlook the case that all sentences of an entity pair are false positive, which is also the common phenomenon in distant supervision datasets. Under this consideration, an independent and accurate sentence-level noise reduction strategy is the better choice.
In this paper, we design an adversarial learning process (Goodfellow et al., 2014;Radford et al., 2015) to obtain a sentence-level generator that can recognize the true positive samples from the noisy distant supervision dataset without any supervised information. In Figure 1, the existence of false positive samples makes the DS decision boundary suboptimal, therefore hinders the performance of relation extraction. However, in terms of quantity, the true positive samples still occupy most of the proportion; this is the prerequisite of our method. Given the discriminator that possesses the decision boundary of DS dataset (the brown decision boundary in Figure 1), the generator tries to generate true positive samples from DS positive dataset; Then, we assign the generated samples with negative label and the rest samples with positive label to challenge the discriminator. Under this adversarial setting, if the generated sample set includes more true positive samples and more false positive samples are left in the rest set, the classification ability of the discriminator will drop faster. Empirically, we show that our method has brought consistent performance gains in various deep-neural-network-based models, achieving strong performances on the widely used New York Times dataset (Riedel et al., 2010). Our contributions are three-fold: • We are the first to consider adversarial learning to denoise the distant supervision relation extraction dataset.
• Our method is sentence-level and modelagnostic, so it can be used as a plug-and-play technique for any relation extractors.
• We show that our method can generate a cleaned dataset without any supervised information, in which way to boost the performance of recently proposed neural relation extractors.
In Section 2, we outline some related works on distant supervision relation extraction. Next, we describe our adversarial learning strategy in Section 3. In Section 4, we show the stability analyses of DSGAN and the empirical evaluation results. And finally, we conclude in Section 5.

Related Work
To address the above-mentioned data sparsity issue, Mintz et al. (2009) first align unlabeled text corpus with Freebase by distant supervision. However, distant supervision inevitably suffers from the wrong labeling problem. Instead of explicitly removing noisy instances, the early works intend to suppress the noise. Riedel et al. (2010) adopt multi-instance single-label learning in relation extraction; Hoffmann et al. (2011) andSurdeanu et al. (2012) model distant supervision relation extraction as a multi-instance multi-label problem.
Recently, some deep-learning-based models (Zeng et al., 2014;Shen and Huang, 2016) have been proposed to solve relation extraction. Naturally, some works try to alleviate the wrong labeling problem with deep learning technique, and their denoising process is integrated into relation extraction. Zeng et al. (2015) select one most plausible sentence to represent the relation between entity pairs, which inevitably misses some valuable information. Lin et al. (2016) calculate a series of soft attention weights for all sentences of one entity pair and the incorrect sentences can be down-weighted; Base on the same idea, Ji et al. (2017) bring the useful entity information into the calculation of the attention weights. However, compared to these soft attention weight assignment strategies, recognizing the true positive samples from distant supervision dataset before relation extraction is a better choice. Takamatsu et al. (2012) build a noise-filtering strategy based on the linguistic features extracted from many NLP tools, including NER and dependency tree, which inevitably suffers the error propagation problem; while we just utilize word embedding as the input information. In this work, we learn a true-positive identifier (the generator) which is independent of the relation prediction of entity pairs, so it can be directly applied on top of any existing relation extraction classifiers. Then, we redistribute the false positive samples into the negative set, in which way to make full use of the distantly labeled resources. Supervision In this section, we introduce an adversarial learning pipeline to obtain a robust generator which can automatically discover the true positive samples from the noisy distantly-supervised dataset without any supervised information. The overview of our adversarial learning process is shown in Figure 2. Given a set of distantly-labeled sentences, the generator tries to generate true positive samples from it; But, these generated samples are regarded as negative samples to train the discriminator. Thus, when finishing scanning the DS positive dataset one time, the more true positive samples that the generator discovers, the sharper drop of performance the discriminator obtains. After adversarial training, we hope to obtain a robust generator that is capable of forcing discriminator into maximumly losing its classification ability.
In the following section, we describe the adversarial training pipeline between the generator and the discriminator, including the pre-training strategy, objective functions and gradient calculation. Because the generator involves a discrete sampling step, we introduce a policy gradient method to calculate gradients for the generator.

Pre-Training Strategy
Both the generator and the discriminator require the pre-training process, which is the common setting for GANs (Cai and Wang, 2017;. With the better initial parameters, the adversarial learning is prone to convergence. As presented in Figure 2, the discriminator is pre-trained with DS positive dataset P (label 1) and DS negative set N D (label 0). After our adversarial learning process, we desire a strong generator that can, to the maximum extent, collapse the discriminator. Therefore, the more robust generator can be obtained via competing with the more robust discriminator. So we pre-train the discriminator until the accuracy reaches 90% or more. The pretraining of generator is similar to the discriminator; however, for the negative dataset, we use another completely different dataset N G , which makes sure the robustness of the experiment. Specially, we let the generator overfits the DS positive dataset P . The reason of this setting is that we hope the generator wrongly give high probabilities to all of the noisy DS positive samples at the beginning of the training process. Then, along with our adversarial learning, the generator learns to gradually decrease the probabilities of the false positive samples.

Generative Adversarial Training for Distant Supervision Relation Extraction
The generator and the discriminator of DSGAN are both modeled by simple CNN, because CNN performs well in understanding sentence (Zeng et al., 2014), and it has less parameters than RNNbased networks. For relation extraction, the input information consists of the sentences and entity pairs; thus, as the common setting (Zeng et al., 2014;Nguyen and Grishman, 2015), we use both word embedding and position embedding to convert input instances into continuous real-valued vectors.
What we desire the generator to do is to accurately recognize true positive samples. Unlike the generator applied in computer vision field (Im et al., 2016) that generates new image from the input noise, our generator just needs to discover true positive samples from the noisy DS positive dataset. Thus, it is to realize the "sampling from a probability distribution" process of the discrete GANs (Figure 2). For a input sentence s j , we define the probability of being true positive sample by generator as p G (s j ). Similarly, for discriminator, the probability of being true positive sample is represented as p D (s j ). We define that one epoch means that one time scanning of the entire DS positive dataset. In order to obtain more feedbacks and make the training process more efficient, we split the DS positive dataset P = {s 1 , s 2 , ..., s j , ...} into N bags B = {B 1 , B 2 , ...B N }, and the network parameters θ G , θ D are updated when finishing processing one bag B i 1 . Based on the notion of adversarial learning, we define the objectives of the generator and the discriminator as follow, and they are alternatively trained towards their respective objectives.
Generator Suppose that the generator produces a set of probability distribution {p G (s j )} j=1...|B i | for a sentence bag B i . Based on these probabilities, a set of sentence are sampled and we denote this set as T . This generated dataset T consists of the highconfidence sentences, and is regard as true positive samples by the current generator; however, it will be treated as the negative samples to train the discriminator. In order to challenge the discriminator, the objective of the generator can be formulated as maximizing the following probabilities of the generated dataset T : Because L G involves a discrete sampling step, so it cannot be directly optimized by gradientbased algorithm. We adopt a common approach: the policy-gradient-based reinforcement learning.
The following section will give the detailed introduction of the setting of reinforcement learning. The parameters of the generator are continually updated until reaching the convergence condition.
Discriminator After the generator has generated the sample subset T , the discriminator treats them as the negative samples; conversely, the rest part F = B i −T is treated as positive samples. So, the objective of the discriminator can be formulated as minimizing the following cross-entropy loss function: (3) The update of discriminator is identical to the common binary classification problem. Naturally, it can be simply optimized by any gradient-based algorithm. What needs to be explained is that, unlike the common setting of discriminator in previous works, our discriminator loads the same pretrained parameter set at the beginning of each epoch as shown in Figure 2. There are two reasons. First, at the end of our adversarial training, what we need is a robust generator rather than a discriminator. Second, our generator is to sample data rather than generate new data from scratch; Therefore, the discriminator is relatively easy to be collapsed. So we design this new adversarial strategy: the robustest generator is yielded when the discriminator has the largest drop of performance in one epoch. In order to create the equal condition, the bag set B for each epoch is identical, including the sequence and the sentences in each Algorithm 1 The DSGAN algorithm. Data: DS positive set P , DS negative set N G for generator G, DS negative set N D for discriminator D Input: Pre-trained G with parameters θ G on dataset (P , N G ); Pre-trained D with parameters θ D on dataset (P , N D ) Output: Adversarially trained generator G 1: Load parameters θ G for G 2: Split P into the bag sequence P = {B 1 , B 2 , ..., B i , ..., B N } 3: repeat 4: Load parameters θ D for D 5: Compute the probability p G (s j ) for each sentence s j in B i 8: Obtain the generated part T by sampling according to {p G (s j )} j=1...|B| and the rest set F = B i − T 9:

11:
Calculate the reward r 12: end for 15: Compute the accuracy ACC D on N D with the current θ D 16: until ACC D no longer drops 17: Save θ G bag B i .
Optimizing Generator The objective of the generator is similar to the objective of the one-step reinforcement learning problem: Maximizing the expectation of a given function of samples from a parametrized probability distribution. Therefore, we use a policy gradient strategy to update the generator. Corresponding to the terminology of reinforcement learning, s j is the state and P G (s j ) is the policy. In order to better reflect the quality of the generator, we define the reward r from two angles: • As the common setting in adversarial learning, for the generated sample set, we hope the confidence of being positive samples by the discriminator becomes higher. Therefore, the first component of our reward is formulated as below: the function of b 1 is to reduce variance during reinforcement learning.
• The second component is from the average prediction probability of N D , N D participates the pre-training process of the discriminator, but not the adversarial training process. When the classification capacity of discriminator declines, the accuracy of being predicted as negative sample on N D gradually drops; thus,p increases. In other words, the generator becomes better. Therefore, for epoch k, after processing the bag B i , reward r 2 is calculated as below, b 2 has the same function as b 1 .
The gradient of L G can be formulated as below:

Cleaning Noisy Dataset with Generator
After our adversarial learning process, we obtain one generator for one relation type; These generators possess the capability of generating true positive samples for the corresponding relation type. Thus, we can adopt the generator to filter the noise samples from distant supervision dataset. Simply and clearly, we utilize the generator as a binary classifier. In order to reach the maximum utilization of data, we develop a strategy: for an entity pair with a set of annotated sentences, if all of these sentences are determined as false negative by our generator, this entity pair will be redistributed into the negative set. Under this strategy, the scale of distant supervision training set keeps unchanged.

Experiments
This paper proposes an adversarial learning strategy to detect true positive samples from the noisy distant supervision dataset. Due to the absence of supervised information, we define a generator to heuristically learn to recognize true positive samples through competing with a discriminator. Therefore, our experiments are intended to demonstrate that our DSGAN method possess this capability. To this end, we first briefly introduce the dataset and the evaluation metrics. Empirically, the adversarial learning process, to some extent, has instability; Therefore, we next illustrate the convergence of our adversarial training process. Finally, we demonstrate the efficiency of our generator from two angles: the quality of the generated samples and the performance on the widely-used distant supervision relation extraction task.

Evaluation and Implementation Details
The Reidel dataset 2 (Riedel et al., 2010) is a commonly-used distant supervision relation extraction dataset. Freebase is a huge knowledge base including billions of triples: the entity pair and the specific relationship between them. Given these triples, the sentences of each entity pair are selected from the New York Times corpus(NYT). Entity mentions of NYT corpus are recognized by the Stanford named entity recognizer (Finkel et al., 2005). There are 52 actual relationships and a special relation N A which indicates there is no relation between head and tail entities. N A are defined as the entity pairs that appear in the same sentence but are not related according to Freebase.
Due to the absence of the corresponding labeled dataset, there is not a ground-truth test dataset to evaluate the performance of distant supervision relation extraction system. Under this circumstance, the previous work adopt the held-out evaluation to evaluate their systems, which can provide an approximate measure of precision without requiring costly human evaluation. It builds a test set where entity pairs are also extracted from Freebase. Similarly, relation facts that discovered from test articles are automatically compared with those in Freebase. CNN is widely used in relation classification (Santos et al., 2015;Qin et al., 2017), thus the generator and the discriminator are both modeled as a simple CNN with the window size c w and the kernel size c k . Word embedding is directly from the released word embedding matrix by Lin et al. (2016) 3 . Position embedding has the same setting with the previous works: the maximum distance of -30 and 30. Some detailed hyperparameter settings are displayed in Table 1.

Training Process of DSGAN
Because adversarial learning is widely regarded as an effective but unstable technique, here we illustrate some property changes during the training process, in which way to indicate the learning trend of our proposed approach. We use 3 relation types as the examples: /business/person/company, /people/person/place lived and /location/neighborhood/neighborhood of. Because they are from three major classes (bussiness, people, location) of Reidel dataset and they all have enough distant-supervised instances. The first row in Figure 3 shows the classification ability change of the discriminator during training. The color of curves become darker as long as the epoch goes on. Because the discriminator reloads the pre-trained parameters at the beginning of each epoch, all curves start from the same point for each relation type; Along with the adversarial training, the generator gradually collapses the discriminator. The figures in the second row reflect the performance of generators from the view of the difficulty level of training with the positive datasets that are generated by different strategies. Based on the noisy DS positive dataset P , DSGAN represents that the cleaned positive dataset is generated by our DSGAN generator; Random means that the positive set is randomly selected from P ; Pre-training denotes that the dataset is selected according to the prediction probability of the pre-trained generator. These three new positive datasets are in the same size.
The accuracy is calculated from the negative set 4 N D . At the beginning of adversarial learning, the discriminator performs well on N D ; moreover, N D is not used during adversarial training. Therefore, the accuracy on N D is the criterion to reflect the performance of the discriminator. In the early epochs, the generated samples from the generator increases the accuracy, because it has not possessed the ability of challenging the discriminator; however, as the training epoch increases, this accuracy gradually decreases, which means the discriminator becomes weaker. It is because the generator gradually learn to generate more accurate true positive samples in each bag. After the proposed adversarial learning process, the generator is strong enough to collapse the discriminator. Figure 4 gives more intuitive 4 The trends in the first row of Figure 3 is not limited in ND. Different randomly-selected negative sets have the same trends. display of the trend of accuracy. Note that there is a critical point of the decline of accuracy for each presented relation types. It is because that the chance we give the generator to challenge the discriminator is just one time scanning of the noisy dataset; this critical point is yielded when the generator has already been robust enough. Thus, we stop the training process when the model reaches this critical point. To sum up, the capability of our generator can steadily increases, which indicates that DSGAN is a robust adversarial learning strategy.

Quality of Generator
Due to the absence of supervised information, we validate the quality of the generator from another angle. Combining with Figure 1, for one relation type, the true positive samples must have evidently higher relevance (the cluster of purple circles). Therefore, a positive set with more true positive samples is easier to be trained; In other words, the convergence speed is faster and the fitting degree on training set is higher. Based on this , we present the comparison tests in the second row of Figure 3. We build three positive datasets from the noisy distant supervision dataset P : the randomly-selected positive set, the positive set base on the pre-trained generator and the positive set base on the DSGAN generator. For the pre-trained generator, the positive set is selected according to the probability of being positive from high to low. These three sets have the same size and are accompanied by the same negative set. Obviously, the positive set from the DSGAN generator yields the best performance, which indicates that our adversarial learning process is able to produce a robust true-positive generator. In addition, the pre-trained generator also has a good performance; however, compared with the DSGAN generator, it cannot provide the boundary between the false positives and the true positives.

Performance on Distant Supervision Relation Extraction
Based on the proposed adversarial learning process, we obtain a generator that can recognize the true positive samples from the noisy distant supervision dataset. Naturally, the improvement of distant supervision relation extraction can provide a intuitive evaluation of our generator. We adopt the strategy mentioned in Section 3.3 to relocate the dataset. After obtaining this redistributed dataset, we apply it to train the recent state-of-the-art models and observe whether it brings further improve- ment for these systems. Zeng et al. (2015) and Lin et al. (2016) are both the robust models to solve wrong labeling problem of distant supervision relation extraction. According to the comparison displayed in Figure 5 and Figure 6, all four models (CNN+ONE, CNN+ATT, PCNN+ONE and PCNN+ATT) achieve further improvement. Even though Zeng et al. (2015) and Lin et al. (2016) are designed to alleviate the influence of false positive samples, both of them merely focus on the noise filtering in the sentence bag of entity pairs. Zeng et al. (2015) combine at-least-one multi-instance learning with deep neural network to extract only one active sentence to represent the target entity pair; Lin et al. (2016) assign soft attention weights to the representations of all sentences of one entity pair, then employ the weighted sum of these representations to predict the rela-  tion between the target entity pair. However, from our manual inspection of Riedel dataset (Riedel et al., 2010), we found another false positive case that all the sentences of a specific entity pair are wrong; but the aforementioned methods overlook this case, while the proposed method can solve this problem. Our DSGAN pipeline is independent of the relation prediction of entity pairs, so we can adopt our generator as the true-positive indicator to filter the noisy distant supervision dataset before relation extraction, which explains the origin of these further improvements in Figure 5 and Figure 6. In order to give more intuitive comparison, in Table 2, we present the AUC value of each PR curve, which reflects the area size under these curves. The larger value of AUC reflects the better performance. Also, as can be seen from the result of t-test evaluation, all the p-values are less than 5e-02, so the improvements are obvious.

Conclusion
Distant supervision has become a standard method in relation extraction. However, while it brings the convenience, it also introduces noise in distantly labeled sentences. In this work, we propose the first generative adversarial training method for robust distant supervision relation extraction. More specifically, our framework has two components: a generator that generates true positives, and a discriminator that tries to classify positive and negative data samples. With adversarial training, our goal is to gradually decrease the performance of the discriminator, while the generator improves the performance for predicting true positives when reaching equilibrium. Our approach is model-agnostic, and thus can be applied to any distant supervision model. Empirically, we show that our method can significantly improve the performances of many competitive baselines on the widely used New York Time dataset.