Adversarial training for multi-context joint entity and relation extraction

Adversarial training (AT) is a regularization method that can be used to improve the robustness of neural network methods by adding small perturbations in the training data. We show how to use AT for the tasks of entity recognition and relation extraction. In particular, we demonstrate that applying AT to a general purpose baseline model for jointly extracting entities and relations, allows improving the state-of-the-art effectiveness on several datasets in different contexts (i.e., news, biomedical, and real estate data) and for different languages (English and Dutch).


Introduction
Many neural network methods have recently been exploited in various natural language processing (NLP) tasks, such as parsing , POS tagging (Lample et al., 2016), relation extraction (dos Santos et al., 2015), translation (Bahdanau et al., 2015), and joint tasks (Miwa and Bansal, 2016). However, Szegedy et al. (2014) observed that intentional small scale perturbations (i.e., adversarial examples) to the input of such models may lead to incorrect decisions (with high confidence). Goodfellow et al. (2015) proposed adversarial training (AT) (for image recognition) as a regularization method which uses a mixture of clean and adversarial examples to enhance the robustness of the model. Although AT has recently been applied in NLP tasks (e.g., text classification (Miyato et al., 2017)), this paper -to the best of our knowledge -is the first attempt investigating regularization effects of AT in a joint setting for two related tasks.
We start from a baseline joint model that performs the tasks of named entity recognition (NER) and relation extraction at once. Previously proposed models (summarized in Section 2) exhibit several issues that the neural network-based baseline approach (detailed in Section 3.1) overcomes: (i) our model uses automatically extracted features without the need of external parsers nor manually extracted features (see Gupta et al. (2016); Miwa and Bansal (2016); Li et al. (2017)), (ii) all entities and the corresponding relations within the sentence are extracted at once, instead of examining one pair of entities at a time (see Adel and Schütze (2017)), and (iii) we model relation extraction in a multi-label setting, allowing multiple relations per entity (see Katiyar and Cardie (2017); Bekoulis et al. (2018a)). The core contribution of the paper is the use of AT as an extension in the training procedure for the joint extraction task (Section 3.2).
To evaluate the proposed AT method, we perform a large scale experimental study in this joint task (see Section 4), using datasets from different contexts (i.e., news, biomedical, real estate) and languages (i.e., English, Dutch). We use a strong baseline that outperforms all previous models that rely on automatically extracted features, achieving state-of-the-art performance (Section 5). Compared to the baseline model, applying AT during training leads to a consistent additional increase in joint extraction effectiveness.

Related work
Joint entity and relation extraction: Joint models (Li and Ji, 2014;Miwa and Sasaki, 2014) that are based on manually extracted features have been proposed for performing both the named entity recognition (NER) and relation extraction subtasks at once. These methods rely on the availability of NLP tools (e.g., POS taggers) or manually designed features leading to additional complexity. Neural network methods have been exploited to overcome this feature design issue and usually involve RNNs and CNNs (Miwa and Bansal, 2016;Zheng et al., 2017). Specifically, Miwa and Bansal (2016) as well as Li et al. (2017) apply bidirectional tree-structured RNNs for different contexts (i.e., news, biomedical) to capture syntactic information (using external dependency parsers). Gupta et al. (2016) propose the use of various manually extracted features along with RNNs. Adel and Schütze (2017) solve the simpler problem of entity classification (EC, assuming entity boundaries are given), instead of NER, and they replicate the context around the entities, feeding entity pairs to the relation extraction layer. Katiyar and Cardie (2017) investigate RNNs with attention without taking into account that relation labels are not mutually exclusive. Finally, Bekoulis et al. (2018a) use LSTMs in a joint model for extracting just one relation at a time, but increase the complexity of the NER part. Our baseline model enables simultaneous extraction of multiple relations from the same input. Then, we further extend this strong baseline using adversarial training.
Adversarial training (AT) (Goodfellow et al., 2015) has been proposed to make classifiers more robust to input perturbations in the context of image recognition. In the context of NLP, several variants have been proposed for different tasks such as text classification (Miyato et al., 2017), relation extraction (Wu et al., 2017) and POS tagging (Yasunaga et al., 2018). AT is considered as a regularization method. Unlike other regularization methods (i.e., dropout (Srivastava et al., 2014), word dropout (Iyyer et al., 2015)) that introduce random noise, AT generates perturbations that are variations of examples easily misclassified by the model.

Joint learning as head selection
The baseline model, described in detail in Bekoulis et al. (2018b), is illustrated in Fig. 1. It aims to detect (i) the type and the boundaries of the entities and (ii) the relations between them. The input is a sequence of tokens (i.e., sentence) w = w 1 , ..., w n . We use character level embeddings to implicitly capture morphological features (e.g., prefixes and suffixes), representing each character by a vector (embedding). The character embeddings are fed to a bidirectional LSTM (BiLSTM) to obtain the character-based representation of the word. We also use pre-trained word embeddings. Word and character embeddings are concatenated to form the final token representation, which is then fed to a BiLSTM layer to extract sequential information.
For the NER task, we adopt the BIO (Beginning, Inside, Outside) encoding scheme. In Fig. 1, the B-PER tag is assigned to the beginning token of a 'person' (PER) entity. For the prediction of the entity tags, we use: (i) a softmax approach for the entity classification (EC) task (assuming entity boundaries given) or (ii) a CRF approach where we identify both the type and the boundaries for each entity. During decoding, in the softmax setting, we greedily detect the entity types of the tokens (i.e., independent prediction). Although independent distribution of types is reasonable for EC tasks, this is not the case when there are strong correlations between neighboring tags. For instance, the BIO encoding scheme imposes several constraints in the NER task (e.g., the B-PER and I-LOC tags cannot be sequential). Motivated by this intuition, we use a linear-chain CRF for the NER task (Lample et al., 2016). For decoding, in the CRF setting, we use the Viterbi algorithm. During training, for both EC (softmax) and NER tasks (CRF), we minimize the cross-entropy loss L NER . The entity tags are later fed into the relation extraction layer as label embeddings (see Fig. 1), assuming that knowledge of the entity types is beneficial in predicting the relations between the involved entities.
We model the relation extraction task as a multi-label head selection problem (Bekoulis et al., 2018b;. In our model, each word w i can be involved in multiple relations with other words. For instance, in the example illustrated in Fig. 1, "Smith" could be involved not only in a Lives in relation with the token "California" (head) but also in other relations simultaneously (e.g., Works for, Born In with some corresponding tokens). The goal of the task is to predict for each word w i , a vector of headsŷ i and the vector of corresponding relationsr i . We compute the score s(w j , w i , r k ) of word w j to be the head of w i given a relation label r k using a single layer neural network. The corresponding probability is defined as: P(w j , r k | w i ; θ) = σ(s(w j , w i , r k )), where σ(.) is the sigmoid function. During training, we minimize the cross-entropy loss L rel as: where m is the number of associated heads (and thus relations) per word w i . During decoding, the most probable heads and relations are selected using threshold-based prediction. The final objective for the joint task is computed as L JOINT (w; θ) = L NER + L rel where θ is a set of parameters. In the case of multi-token entities, only the last token of the entity can serve as head of another token, to eliminate redundant relations. If an entity is not involved in any relation, we predict the auxiliary "N" relation label and the token itself as head.

Adversarial training (AT)
We exploit the idea of AT (Goodfellow et al., 2015) as a regularization method to make our model robust to input perturbations. Specifically, we generate examples which are variations of the original ones by adding some noise at the level of the concatenated word representation (Miyato et al., 2017). This is similar to the concept introduced by Goodfellow et al. (2015) to improve the robustness of image recognition classifiers. We generate an adversarial example by adding the worst-case perturbation η adv to the original embedding w that maximizes the loss function: whereθ is a copy of the current model parameters. Since Eq.
(2) is intractable in neural networks, we use the approximation proposed in Goodfellow et al. (2015) defined as: η adv = g/ g , with g = ∇ w L JOINT (w;θ), where is a small bounded norm treated as a hyperparameter. Similar to Yasunaga et al. (2018), we set to be α √ D (where D is the dimension of the embeddings). We train on the mixture of original and adversarial examples, so the final loss is computed as: L JOINT (w;θ) + L JOINT (w + η adv ;θ).

Experimental setup
We evaluate our models on four datasets, using the code as available from our github codebase. 1 Specifically, we follow the 5-fold crossvalidation defined by Miwa and Bansal (2016) for the ACE04 (Doddington et al., 2004) dataset. For the CoNLL04 (Roth and Yih, 2004) EC task (assuming boundaries are given), we use the same splits as in Gupta et al. (2016); Adel and Schütze (2017). We also evaluate our models on the NER task similar to Miwa and Sasaki (2014) in the same dataset using 10-fold cross validation. For the Dutch Real Estate Classifieds, DREC (Bekoulis et al., 2017) dataset, we use train-test splits as in Bekoulis et al. (2018a). For the Adverse Drug Events, ADE (Gurulingappa et al., 2012), we perform 10-fold cross-validation similar to Li et al. (2017). To obtain comparable results that are not affected by the input embeddings, we use the embeddings of the previous works. We employ early stopping in all of the experiments. We use the Adam optimizer (Kingma and Ba, 2015) and we fix the hyperparameters (i.e., α, dropout values, best epoch, learning rate) on the validation sets. The scaling parameter α is selected from {5e−2, 1e−2, 1e−3, 1e−4}. Larger values of α (i.e., larger perturbations) lead to consistent performance decrease in our early experiments. This can be explained from the fact that adding more noise can change the content of the sentence as also reported by Wu et al. (2017).
We use three types of evaluation, namely: (i) S(trict): we score an entity as correct if both the entity boundaries and the entity type are correct (ACE04, ADE, CoNLL04, DREC), (ii) B(oundaries): we score an entity as correct if only the entity boundaries are correct while the entity type is not taken into account (DREC) and (iii) R(elaxed): a multi-token entity is considered correct if at least one correct type is assigned to the tokens comprising the entity, assuming that the  Table 1: Comparison of our method with the stateof-the-art in terms of F 1 score. The proposed models are: (i) baseline, (ii) baseline EC (predicts only entity classes) and (iii) baseline (EC) + AT (regularized by AT). The and symbols indicate whether the models rely on external NLP tools. We include different evaluation types (S, R and B).
boundaries are known (CoNLL04), to compare to previous works. In all cases, a relation is considered as correct when both the relation type and the argument entities are correct. Table 1 shows our experimental results. The name of the dataset is presented in the first column while the models are listed in the second column. The proposed models are the following: (i) baseline: the baseline model shown in Fig. 1 with the CRF layer and the sigmoid loss, (ii) baseline EC: the proposed model with the softmax layer for EC, (iii) baseline (EC) + AT: the baseline regularized using AT. The final three columns present the F 1 results for the two subtasks and their average performance. Bold values indicate the best results among models that use only automatically extracted features. For ACE04, the baseline outperforms Katiyar and Cardie (2017) by ∼2% in both tasks. This improvement can be explained by the use of: (i) multi-label head selection, (ii) CRF-layer and (iii) character level embeddings. Compared to Miwa and Bansal (2016), who rely on NLP tools, the baseline performs within a reasonable margin (less than 1%) on the joint task. On the other hand, Li et al. (2017) use the same model for the ADE biomedical dataset, where we report a 2.5% overall improvement. This indicates that NLP tools are not always accurate for various contexts. For the CoNLL04 dataset, we use two evaluation settings. We use the relaxed evaluation similar to Gupta et al. (2016); Adel and Schütze (2017) on the EC task. The baseline model outperforms the state-of-the-art models that do not rely on manually extracted features (>4% improvement for both tasks), since we directly model the whole sentence, instead of just considering pairs of entities. Moreover, compared to the model of Gupta et al. (2016) that relies on complex features, the baseline model performs within a margin of 1% in terms of overall F 1 score. We also report NER results on the same dataset and improve overall F 1 score with ∼1% compared to Miwa and Sasaki (2014), indicating that our automatically extracted features are more informative than the hand-crafted ones. These automatically extracted features exhibit their performance improvement mainly due to the shared LSTM layer that learns to automatically generate feature representations of entities and their corresponding relations within a single model. For the DREC dataset, we use two evaluation methods. In the boundaries evaluation, the baseline has an improvement of ∼3% on both tasks compared to Bekoulis et al. (2018a), whose quadratic scoring layer complicates NER. Table 1 and Fig. 2 show the effectiveness of the adversarial training on top of the baseline model. In all of the experiments, AT improves the predictive performance of the baseline model in the joint setting. Moreover, as seen in Fig. 2, the performance of the models using AT is closer to maximum even from the early training epochs. Specifically, for ACE04, there is an improvement in both tasks as well as in the overall F 1 performance (0.4%). For CoNLL04, we note an improvement in the overall F 1 of 0.4% for the EC and 0.8% for the NER tasks, respectively. For the DREC dataset, in both settings, there is an overall improvement of ∼1%. Figure 2 shows that from the first epochs, the model obtains its maximum performance on the DREC validation set. Finally, for ADE, our AT model beats the baseline F 1 by 0.7%.

Results
Our results demonstrate that AT outperforms the neural baseline model consistently, considering our experiments across multiple and more diverse datasets than typical related works. The im- provement of AT over our baseline (depending on the dataset) ranges from ∼0.4% to ∼0.9% in terms of overall F 1 score. This seemingly small performance increase is mainly due to the limited performance benefit for the NER component, which is in accordance with the recent advances in NER using neural networks that report similarly small gains (e.g., the performance improvement in Ma and Hovy (2016) and Lample et al. (2016) on the CoNLL-2003 test set is 0.01% and 0.17% F 1 percentage points, while in the work of Yasunaga et al. (2018), a 0.07% F 1 improvement on CoNLL-2000 using AT for NER is reported). However, the relation extraction performance increases by ∼1% F 1 scoring points, except for the ACE04 dataset. Further, as seen in Fig. 2, the improvement for CoNLL04 is particularly small on the evaluation set. This may indicate a correlation between the dataset size and the benefit of adversarial training in the context of joint models, but this needs further investigation in future work.