Adversarial Training for Relation Extraction

Adversarial training is a mean of regularizing classification algorithms by generating adversarial noise to the training data. We apply adversarial training in relation extraction within the multi-instance multi-label learning framework. We evaluate various neural network architectures on two different datasets. Experimental results demonstrate that adversarial training is generally effective for both CNN and RNN models and significantly improves the precision of predicted relations.


Introduction
Despite the recent successes of deep neural networks on various applications, neural network models tend to be overconfident about the noise in input signals. Adversarial examples (Szegedy et al., 2013) are examples generated by adding noise in the form of small perturbations to the original data, which are often indistinguishable for humans but drastically increase the loss incurred in a deep model. Adversarial training (Goodfellow et al., 2014) is a technique for regularizing deep models by encouraging the neural network to correctly classify both unmodified examples and perturbed ones, which in practice not only enhances the robustness of the neural network but also improves its generalizability. Previous work has largely applied adversarial training on straightforward classification tasks, including image classification (Goodfellow et al., 2014) and text classification (Miyato et al., 2016), where the goal is simply predicting a single label for every example and the training examples are able to provide strong supervision. It remains unclear whether adversarial training could be still effective for tasks with much weaker supervision, e.g., distant super-vision (Mintz et al., 2009), or a different evaluation metric other than prediction accuracy (e.g., F1 score).
This paper focuses on the task of relation extraction, where the goal is to predict the relation that exists between a particular entity pair given several text mentions. One popular way to handle this problem is the multi-instance multi-label learning framework (MIML) (Hoffmann et al., 2011;Surdeanu et al., 2012) with distant supervision (Mintz et al., 2009), where the mentions for an entity pair are aligned with the relations in Freebase (Bollacker et al., 2008). In this setting, relation extraction is much harder than the canonical classification problem in two respects: (1) although distant supervision can provide a large amount of data, the training labels are very noisy, and due to the multi-instance framework, the supervision is much weaker; (2) the evaluation metric of relation extraction is often the precisionrecall curve or F1 score, which cannot be represented (and thereby optimized) directly in the loss function.
In order to evaluate the effectiveness of adversarial training for relation extraction, we apply it to two different architectures (a convoluational neural network and a recurrent neural network) on two different datasets. Experimental results show that even on this harder task with much weaker supervision, adversarial training can still improve the performance on all of the cases we studied.

Related Work
Neural Relation Extraction: In recent years, neural network models have shown superior performance over approaches using hand-crafted features in various tasks. Convolutional neural networks (CNN) are among the first deep models that have been applied to relation extrac-tion (Santos et al., 2015;Nguyen and Grishman, 2015). Variants of convolutional networks include piecewise-CNN (PCNN) (Zeng et al., 2014), split CNN (Adel et al., 2016), CNN with sentencewise pooling (Jiang et al., 2016) and attention CNN . Recurrent neural networks (RNN) are another popular choice, and have been used in recent work in the form of recurrent CNNs (Cai et al., 2016) and attention RNNs (Zhou et al., 2016). An instance-level selective attention mechanism was introduced for MIML by , and has significantly improved the prediction accuracy for several of these base deep models.
Adversarial Training: Adversarial training (AT) (Goodfellow et al., 2014) was originally introduced in the context of image classification tasks where the input data is continuous. Miyato et al. (2015Miyato et al. ( , 2016 adapts AT to text classification by adding perturbations on word embeddings and also extends AT to a semi-supervised setting by minimizing the entropy of the predicted label distributions on unlabeled data. AT introduces an end-to-end and deterministic way of data perturbation by utilizing the gradient information. There are also other works for regularizing classifiers by adding random noise to the data, such as dropout (Srivastava et al., 2014) and its variant for NLP tasks, word dropout (Iyyer et al., 2015). Xie et al. (2017) discusses various data noising techniques for language models. Søgaard (2013) and  focus on linguistic adversaries.

Methodology
We first introduce MIML and then describe the base neural network models we consider: 1 piecewise CNN (Zeng et al., 2015) (PCNN) and bidirectional GRU (Cho et al., 2014) (RNN). We also utilize the selective attention mechanism in  for both PCNN and RNN models. Adversarial training is presented at the end of this section.

Preliminaries
In MIML, we consider the set of text sentences X = {x 1 , x 2 , . . . , x n } for each entity pair. Supposing we have R predefined relations (including NA) to extract, we want to predict the probabil- (2) 0 dropout on the output variable Figure 1: The computation graph of encoding a sentence x i with adversarial training. e i denotes the adversarial perturbation w.r.t. x i . Dropout is placed on the output of the variables in the doublelined rectangles.
ity of each of the R relations given the mentions.
Formally, for each relation r, we want to predict P (r | x 1 , . . . , x n ).
Note that since an entity pair may have no relations, we introduce a special relation NA to the label set. Hence, we simply assume there will be at least one relation existing for every entity pair. During evaluation, we ignore the probability predicted for the NA relation.

Neural Architectures
Input Representation: For each sentence x i , we use pretrained word embeddings to project each word token into d w -dimensional space. Note that we also need to include the entity position information in x i . Here we introduce an extra feature vector p (w) i for each word w to encode the entities' positions. One choice is the position embedding (Zeng et al., 2014): for each word w, we compute the relative distances to the two entities and embed the distances in two d p -dimensional vectors, which are then concatenated as p (w) i . Position embedding introduces extra variables in the model and slows down the training time. We also investigate a simpler choice, indicator encoding: when a word w is exactly an entity, we generate a d pdimensional 1 vector and a 0 vector otherwise. In our experiments, position embedding is crucial for PCNN due to the spatial invariance of CNN. For RNN, position embedding helps little (likely because an RNN has the capacity of exploiting temporal dependencies) so we adopt indicator encoding instead.
Sentence Encoder: For a sentence x i , we want to apply a non-linear transformation to the vector representation of x i to derive a feature vector s i = f (x i ; θ) given a set of parameters θ. We consider both PCNN and RNN as f (x i ; θ).
For PCNN, inheriting the settings from (Zeng et al., 2014), we adopt a convolution kernel with window size 3 and d s output channels and then apply piecewise pooling and ReLU (Nair and Hinton, 2010) as an activation function to eventually obtain a 3 · d s -dimensional feature vector s i .
For RNN, we adopt bidirectional GRU with d s hidden units and concatenate the hidden states of the last timesteps from both the forward and the backward RNN as a 2·d s -dimensional feature vector s i .
Selective Attention: Following , for each relation r, we aim to softly select an attended sentence s r by taking a weighted average of s 1 , s 2 , . . . , s n , namely s r = i α r i s i . Here α r denotes the attention weights w.r.t. relation r. For computing the weights, we define a query vector q r for each relation r and compute α r = softmax(u r ) where u r i = tanh(s i ) q r . The query vector q r can be considered as the embedding vector for the relation r, which is jointly learned with other model parameters.
Loss Function: For an entity pair, we compute the probability of relation r by P (r | X; θ) = softmax(As r +b), where A is the projection matrix and b is the bias. For the multi-label setting, suppose K relations r 1 , . . . , r K exist for X. Simply taking the summation over the log probabilities of all those labels yields the final loss function log P (r i | X; θ). (1) Dropout: For regularizing the parameters, we apply dropout (Srivastava et al., 2014) to both the word embedding and the sentence feature vector s i . Note that we do not perform dropout on the position embedding p i .

Adversarial Training
Adversarial training (AT) is a way of regularizing the classifier to improve robustness to small worst-case perturbations by computing the gradient direction of a loss function w.r.t. the data. AT generates continuous perturbations, so we add the adversarial noise at the level of the word embeddings, similar to Miyato et al. (2016). Formally, consider the input data X and suppose the word embedding of all the words in X is V . AT adds a  small adversarial perturbation e adv to V and optimizes the following objective instead of Eq.(1). Here V denotes the word embedding of all the words in X. Accordingly, in Eq. 4, g denotes the norm of gradients over all the words from all the sentences in X. In addition, we do not perturb the feature vector p for entity positions. A visualization of the process is demonstrated in Fig. 1.

Experiments
To measure the effectiveness of adversarial training on relation extraction, we evaluate both the CNN (PCNN) and RNN (bi-GRU) models on two different datasets, the NYT dataset (NYT) developed by Riedel et al. (2010) and the UW dataset (UW) by . All code is implemented in Tensorflow (Abadi et al., 2016) and available at https://github. com/jxwuyi/AtNRE. We adopt Adam optimizer (Kingma and Ba, 2014) with learning rate 0.001, batch size 50 and dropout rate 0.5. For adversarial training, the only parameter is . In each of the following experiments, we fixed all the hyper-parameters of the base model, performed a binary search solely on and showed the most effective value of .

Datasets
The statistics of the two datasets are summarized in Table 1. We exclude sentences longer than Sent-Len during training and randomly split data for entity pairs with more than 500 mentions. Note that the number of target relations in these two datasets are significantly different, which helps   demonstrate the applicability of adversarial training on various evaluation settings.
Since the test set of the UW dataset only contains 200 sentences, we adopt a subset of the test set from the NYT dataset: all the entity pairs with the corresponding 4 relations in UW and another 1500 randomly selected NA pairs.

Practical Performances
The NYT dataset: We utilize the word embeddings released by , which has d w = 50 dimensions. For model parameters, we set d e = 5 (dimension of the entity position feature vector) and d s = 230 (dimension of sentence feature vector) for PCNN and d e = 3 and d s = 150 for RNN. For adversarial training, we choose = 0.01 for PCNN and = 0.02 for RNN. We empirically observed that when adding dropout to the word embeddings, PCNN performs significantly worse. Hence we only apply dropout to s i for PCNN. However, even with a dropout rate of 0.5, RNN still performs well. We conjecture that it is due to PCNN being more sensitive to input signals and the dimensionality of the word embedding (d w = 50) being very small.
The precision-recall curves for different models on the test set are shown in Fig. 2. Since the precision drops significantly with large recalls on the NYT dataset, we emphasize a part of the curve with recall number smaller than 0.5 in the   figure. Adversarial training significantly improves the precision for both PCNN and RNN models. We also show the precision numbers for some particular recalls as well as the AUC (for the whole PR curve) in Table 2, where RNN generally leads to better precision.

The UW dataset:
We train a word embedding of d w = 200 dimensions using Glove (Pennington et al., 2014) on the New York Times Corpus in this experiment. For model parameters, we set the entity feature dimension d e = 5 and sentence feature dimension d s = 250 for PCNN and d e = 3 and d s = 200 for RNN. For adversarial training, we choose = 0.05 for PCNN and = 0.5 for RNN. Since here word embedding dimension d w is larger than that used for the NYT dataset, which implies that we now have word embeddings with larger norms, accordingly the optimal value of increases. The precision-recall curves on the test data are shown in Fig. 3, where adversarial training again significantly improves the precision for both models. The precision numbers for some particular recall values as well as the AUC numbers are demonstrated in Table 3. Similarly RNN yields superior performances on the UW dataset.

CNN vs RNN:
In the experiments, RNN generally produces more precise predictions than CNN due to its rich model capacity and also has high robustness to input embeddings. The CNN, in contrast, has far fewer parameters which leads to much faster training and testing, which suggests a practical trade-off.
Notably, although the improvement under AUC by adversarial training are roughly the same for both RNN and CNN, the optimal value for RNN is always much larger than CNN. This implies that empirically RNN is more robust under adversarial attacks than CNN, which also helps RNN maintain higher precision as recall increases. Choice of : When = 0, the AT loss (Eq.(2)) degenerates to the original loss (Eq.(1)); when becomes too large, the noise can change the semantics of a sentence 2 and make the model extremely hard to correctly classify the adversarial examples.
Notably, the optimal value of is much smaller than the norm of the word embedding, which implies adversarial training works most effectively when only producing tiny perturbations on word features while keeping the semantics of sentences unchanged 3 . Connection to other approaches: ; Xie et al. (2017) proposes linguistic adversaries techniques to enhance the robustness of the model by randomly changing the word tokens in a sentence. This explicitly modifies the semantics of a sentence. By contrast, adversarial training focuses on smaller and continuous perturbations in the embedding space while preserving the semantics of sentences. Hence, adversarial training is complementary to linguistic adversaries.