Exploiting Entity BIO Tag Embeddings and Multi-task Learning for Relation Extraction with Imbalanced Data

In practical scenario, relation extraction needs to first identify entity pairs that have relation and then assign a correct relation class. However, the number of non-relation entity pairs in context (negative instances) usually far exceeds the others (positive instances), which negatively affects a model’s performance. To mitigate this problem, we propose a multi-task architecture which jointly trains a model to perform relation identification with cross-entropy loss and relation classification with ranking loss. Meanwhile, we observe that a sentence may have multiple entities and relation mentions, and the patterns in which the entities appear in a sentence may contain useful semantic information that can be utilized to distinguish between positive and negative instances. Thus we further incorporate the embeddings of character-wise/word-wise BIO tag from the named entity recognition task into character/word embeddings to enrich the input representation. Experiment results show that our proposed approach can significantly improve the performance of a baseline model with more than 10% absolute increase in F1-score, and outperform the state-of-the-art models on ACE 2005 Chinese and English corpus. Moreover, BIO tag embeddings are particularly effective and can be used to improve other models as well.


Introduction
Relation extraction, which aims to extract semantic relations from a given instance-entity pair and the corresponding text in context, is an important and challenging task in information extraction. It serves as a step stone for many downstream tasks such as question answering and knowledge graph construction.
The relation extraction task can be divided into two steps: determining which pair of entities in a given sentence has relation, and assigning a correct relation class to the identified entity pair. We define these two steps as two related tasks: Relation Identification and Relation Classification. If one only needs to categorize the given entities that are guaranteed to have some expected relation, then relation extraction is reduced to relation classification (Nguyen and Grishman, 2015).
One variation of relation classification is the introduction of a new artificial relation class "Other." If the number of non-relation entity pairs in context (negative instances) in the dataset is comparable to the number of entity pairs that have relation in context (positive instances), then the nonrelation pairs can be treated as having the relation class Other.
Strictly speaking, most existing studies of relation extraction treat the task as relation classification. However, relation extraction often comes with an extremely imbalanced dataset where the number of non-relation entity pairs far exceeds the others, making it a more challenging yet more practical task than relation classification. For example, after filtering out those entity pairs whose entity type combination has never appeared in the Chinese corpus of ACE 2005, there are still more than 200,000 entity pairs left, in which the positive/negative instance ratio is about 1:20. In this paper, we focus on the relation extraction task with an imbalanced corpus, and adopt multi-task learning paradigm to mitigate the data imbalance problem.
Only a few studies have considered the negative effect of having too many negative instances. Nguyen and Grishman (2015) proposed using CNN with filters of multiple window sizes. dos Santos et al. (2015) focused on learning the common features of the positive instances by computing only the scores of the relation classes excluding the class Other, and proposed using a pairwise ranking loss. We have also adopted these methods in our approach.
For relation classification, the prediction error can be categorized into three types: 1) false negative-predicting a positive instance to be negative; 2) false positive-predicting a negative instance to be positive; 3) wrong relation classpredicting a positive instance to be positive yet assigning a wrong relation class. After training a baseline model to perform relation classification on the extremely imbalanced ACE 2005 Chinese corpus and dissecting its prediction errors, we find that the proportion of these three types of error are 30.20%, 62.80% and 7.00% respectively. It is conceivable that to improve a model's performance on such corpus, it is best to focus on telling positive and negative instances apart.
Since the negative instances may not have much in common, distinguishing between positive and negative instances is much more challenging than only classifying positive instances into a correct class. Moreover, the total number of positive instances combined is more comparable to the number of negative instances than positive instances of any individual relation class alone. Based on these rationales, we propose to jointly train a model to do another binary classification task-relation identification-alongside relation classification to mitigate the data imbalance problem.
Another facet that most existing studies fail to consider is that there may be multiple relation mentions in a given sentence if it contains multiple entities. In the Chinese corpus of ACE 2005, there are 4.9 entities and 1.34 relation mentions in a sentence on average. The patterns in which these entities appear in the sentence can provide useful semantic information to distinguish between positive and negative instances. Therefore, we exploit the character-wise/word-wise BIO (Beginning, Inside, Outside) tag used in the named entity recognition (NER) task to enrich the input representation. The details of our approach will be presented in Section 2.
We conducted extensive experiments on ACE 2005 Chinese and English corpus. Results show that both the novel multi-task architecture and the incorporation of BIO tag embeddings can improve the performance, and the model equipped with both achieves the highest F1-score, significantly outperforming the state-of-the-art models. Analysis of the results indicates that our proposed approach can successfully address the problem of having a large number of negative instances.
To summarize, we make the following contributions in this paper: 1. We propose a multi-task architecture which jointly trains a model to perform relation identification with cross-entropy loss and relation classification task with ranking loss, which can successfully mitigate the negative effect of having too many negative instances.
2. We incorporate the embeddings of characterwise/word-wise BIO tag from NER task to enrich the input representation, which proves to be very effective not only for our model but for other models as well. We argue that BIO tag embeddings could be a general part of character/word representation, just like the entity position embeddings (Zeng et al., 2014) that many researchers would use in recent years.

Proposed Approach
We have designed a novel multi-task architecture which combines two related tasks: 1) relation identification, which is a binary classification problem to determine whether a given entity pair has relation; 2) relation classification, which is a multiple classification problem to determine the relation class. Figure 1 shows the overall architecture. Figure 1: The overall multi-task architecture. To demonstrate, there are three window sizes for filters in the convolutional layer, as denoted by the three-layer stack; for each window size there are four filters, as denoted by the number of rows in each layer. Maxpooling is applied to each row in each layer of the stack, and the dimension of the output is equal to the total number of filters.
Three are three main parts in the architecture: • Input Layer Given an input sentence x of n words 1 {x 1 , x 2 , ..., x n } with m entities {e 1 , e 2 , ..., e m } where e i ∈ x, and two target entities e t1 , e t2 ∈ {e 1 , e 2 , ..., e m }, the input layer transforms the sentence into a matrix X, which includes word embeddings, position embeddings and BIO tag embeddings of each word.
• Convolutional Layer with Max-pooling Following the input layer is a convolutional layer that extracts high-level features, with filters (convolution kernels) of multiple window sizes (Nguyen and Grishman, 2015). Then max-pooling is applied to each feature map to reduce dimensionality.
• Multi-task Layer In the multi-task layer, the model jointly learns the relation identification 1 We use character-wise model for Chinese corpus and word-wise model for English corpus. For simplicity sake, we use "word" to denote either an English word or a Chinese character to present our model. task using cross-entropy loss and the relation classification task using ranking loss.

Input Layer
• Word Embeddings We use word embeddings with random initialization for each word in the input sentence. The dimension of word embeddings is d w .
• Position Embeddings We also employ position embeddings to encode the relative distance between each word and the two target entities in the sentence. We believe that more useful information regarding the relation is hidden in the words closer to the target entities. The dimension of position embeddings is d p .
• BIO Tag Embeddings Since an input sentence often contains more than two entities, we utilize the BIO tag information of entities to enrich the representation of the input. More specifically, for each word in the input sentence, if the word is part of an entity, we use the entity type T to label the start of the entity as B T , and label the rest of the entity as B I . If the word is not part of an entity, then we label the word as O. The dimension of BIO tag embeddings is d t .
After concatenating all three embeddings together for each word, we transform a sentence into a matrix X = [w 1 , w 2 , ..., w n ] as the input representation, where the column vector w i ∈ R d w +2 * d p +d t . Figure 2 illustrates how to derive position embeddings and BIO tag embeddings.

Convolutional Layer with Multi-Sized Window Kernels
Next, the matrix X is fed into the convolutional layer to extract high-level features. A filter with window size k can be denoted as Apply the convolution operation on the two matrices X and F , and we get a score sequence T = {t 1 , t 2 , ..., t n−k+1 }: where g is a non-linear function and b is bias.
In our experiments, we apply zero-paddings during the convolution operation, so that the score Figure 2: Illustration of BIO tag information and positional information for a given instance. In this example, there are five entities in the input sentence, and the target entities are the second and the third. sequence has the same length as the input sequence, which is n, instead of n − k + 1 if we apply Equation 1 which assumes no padding.
There are multiple filters with different window sizes in the convolutional layer. Then max-pooling is applied to the outputted feature map of each filter. Eventually the input sentence x is represented as a column vector r with a dimension that is equal to the total number of filters.

Multi-Task Layer
• Relation Identification with Cross-entropy Loss For the binary classification task of relation identification we use cross-entropy loss. Positive instances are labelled "1" and negative instances "0." If p is the one-hot true distribution over all classes C = {c} and q is the distribution a model predicts, then the cross-entropy loss of a given instance can be defined as follows: So the loss of this task can be defined as: • Relation Classification with Ranking Loss For the multiple classification task of relation classification, we use the pairwise ranking loss proposed by (dos Santos et al., 2015).
Given the sentence representation r, the score for class c is computed as: where W classes is a matrix to be learned, whose number of columns is equal to the number of classes. W classes c is a column vector corresponding to class c, whose dimension is equal to that of r.
For each instance, the input sentence x has a correct class label y + and incorrect ones y − . Let s y + and s y − be the scores for y + and y − respectively, then the ranking loss can be computed by the following two equations: (6) where m + and m − are margins and γ is a scaling factor. L + decreases as the score s y + increases, and is close to zero when s y + > m + , which encourages the network to give a score greater than m + for the correct class. Similarly, L − decreases as the score s y − decreases, and is close to zero when s y − < −m − , which encourages the network to give scores smaller than −m − for incorrect classes.
For the class Other, only L − is calculated to penalize the incorrect prediction. And following (dos Santos et al., 2015), we only choose the class with the highest score among all incorrect classes as the one to perform a training step. Then we optimize the pairwise ranking loss function: The total loss function for multi-task training is: where α and β are weights of the two losses. In our experiments, we find that α = β yields the best result.

Prediction
We only use the class score s c in the multiple classification task to make predictions, while the binary classification task is only used for optimizing the network parameters.
Given an instance, the prediction P is made by: where θ is a threshold. The relation in an instance is predicted as the class Other if the score s c is less than θ for every class c. Otherwise, we choose the class with the highest score as the prediction.

Data Preparation
We use both the Chinese and English corpus of ACE 2005 to evaluate our proposed approach. Only positive instances have been annotated in the dataset. To extract negative instances, we need to enumerate every entity pair in a sentence. We consider two approaches: one considers the direction of relation while the other does not. For the first approach, we extract only one instance for any pair of entities e 1 , e 2 in a sentence x regardless of direction. Those instances that have been annotated, regardless of direction, are positive instances, and the rest are negative instances. A trained model only needs to determine whether an entity pair has relation. For the second approach, we extract two instances for any pair of entities in a sentence, with the two entities in different orders. Since at most one of the two instances has been annotated to be positive instances, we treat the other one and those neither of which are annotated to be negative instances. A trained model will additionally need to identify head entity and tail entity in a relation, which is considerably harder.
After extracting negative instances, we further filtered out those instances whose entity type combination has never appeared in a relation mention. Then we added the remaining negative instances to the positive instances to complete data preparation.
We adopted the first approach to extract negative instances from the English corpus of ACE 2005, and ended up with 71,895 total instances after filtering, among which 64,989 are negative instances. The positive/negative instance ratio is about 1:9.4. We adopted the second approach to extract negative instances from the Chinese corpus of ACE 2005, and ended up with 215,117 total instances after filtering, among which 205,800 of them are negative instances. The positive/negative instance ratio is about 1:20.

Embeddings
In our approach, we use three kinds of embeddings, namely word embeddings, position embeddings and BIO tag embeddings. They are all randomly initialized, and are adjusted during training. The dimensions of these three embeddings are 200, 50 and 50 respectively.

Hyper-parameters
The number of filters in the convolutional layer is 64, and the window size of filters ranges from 4 to 10. The fully connected layer to calculate class scores has 128 hidden units with a dropout rate of 0.2. The batch size is 256. The neural networks are trained using the RMSprop optimizer with the learning rate α set to 0.001.
As for the parameters in the pairwise ranking loss, for the English corpus, we set m + to 2.5, m − to 0.5, γ to 2 and θ to 0; for the Chinese corpus, we set m + to 4.5, m − to -0.5, γ to 2 and θ to 1. The cross-entropy loss and the pairwise ranking loss in multi-task learning are equally weighted.

Experiment Results
We use five-fold cross-validation to reduce the randomness in the experiment results. The precision (P), recall (R) and F1-score (F1) of the positive instances are used as evaluation metrics.
We compare several variants of our proposed models with the state-of-the-art models on the English and Chinese corpus of ACE 2005 respectively. Variants of our models are: • Baseline: a model that uses CNN with filters of multiple window sizes and only performs the relation classification task using the pairwise ranking loss. The baseline model is motivated by dos Santos et al. (2015) and Nguyen and Grishman (2015).
• Baseline+Tag: baseline model with BIO tag embeddings.
• Baseline+MTL: baseline model that performs relation identification using crossentropy loss in addition to relation classification.
• Baseline+MTL+Tag, baseline model that adopts both multi-tasking learning and BIO tag embeddings.
For the English corpus, we choose SPTree (Miwa and Bansal, 2016) and Walk-based Model (Christopoulou et al., 2018) for comparison. Since the data preparation is similar, we directly report the results from the original papers. The experiment results are summarized in Table 1.
For the Chinese corpus, we choose PCNN (Zeng et al., 2015) and Eatt-BiGRU (Qin et al., 2017) for comparison. We re-implemented these two models, and the experiment results are summarized in Table 2 Table 2: Comparison between our model and the stateof-the-art models using ACE 2005 Chinese corpus. F1scores higher than the state-of-the-art are in bold.
From Table 1 and Table 2, we can see: 1. Both BIO tag embeddings and multi-task learning can improve the performance of the baseline model.
2. Baseline+Tag can outperform the state-ofthe-art models on both the Chinese and English corpus. Compared to the baseline model, BIO tag embeddings lead to an absolute increase of about 10% in F1-score, which indicates that BIO tag embeddings are very effective.
3. Multi-task learning can yield further improvement in addition to BIO tag embeddings: Baseline+MTL+Tag achieves the highest F1-score on both corpora.

Effectiveness of BIO Tag Embeddings
To further investigate the effectiveness of BIO tag embeddings, we incorporated these embeddings into PCNN (Zeng et al., 2015) and Eatt-BiGRU (Qin et al., 2017) to form two new models: PCNN+Tag and East-BiGRU+Tag, and evaluated their performance using the Chinese corpus of ACE 2005. The results are summarized in Table  3.  Compare Table 3 with Table 2, and we can see that thanks to BIO tag embeddings, the F1-score of PCNN increases from 46.1% to 58.2%, while the F1-score of Eatt-BiGRU increases from 52.0% to 61.1%. Such significant improvement is consistent with that on the baseline model and further attests to the effectiveness of BIO tag embeddings. We believe that BIO tag embeddings could be used as a general part of character/word representation for other models and potentially other tasks as well.

Effect of Positive/Negative Instance Ratio
To see how our approach would perform as the degree of data imbalance varies, we used the same random seed to sample negative instances extracted from the Chinese corpus of ACE 2005 to add to the positive instances with different negative/positive instance ratios of 1:0.5, 1:1, 1:5, 1:10 and 1:15. Then we trained and evaluated two models: Baseline and Baseline+MTL+Tag. The results are shown in Figure 3. As shown in Figure 3, the performance drops for both models in terms of F1-score as the positive/negative instance ratio decreases. Yet, as the data become more imbalanced, the gap between the performances of Baseline+MTL+Tag and Baseline widens. This indicates that our proposed approach is more useful when the data is   more imbalanced, though it performs better than the baseline regardless of the positive/negative instance ratio.

Effect of Loss Function w/o Multi-tasking
Recall that in the multi-task architecture that we have proposed, we use the pairwise ranking loss for the multiple classification task of relation classification and use cross-entropy loss for the binary classification task of relation identification. We can, however, use cross-entropy in relation classification as well. To see how the choice of loss function affects performance in different scenarios, we switched ranking loss to cross-entropy loss or simply added cross-entropy loss in the relation classification task, and evaluated the Base-line+Tag model w/o multi-task learning, using the Chinese corpus of ACE 2005. The results are summarized in Table 4, from which we can see: 1. When doing a single task of relation classification, the model has higher precision and lower recall with cross-entropy loss, but has lower precision and higher recall with ranking loss; the F1-scores do not differ much. This suggests that for doing relation classification only, the choice of loss function seems not to matter too much.
2. Multi-task learning helps, regardless of the loss function used in relation classification.
3. When we use cross-entropy loss and ranking loss at the same time for relation classification, without multi-tasking, the F1score only increases slightly from 61.4% to 61.7%. But when cross-entropy is applied to another related task-relation identification, with multi-tasking, the F1-score increases from 61.4% to 62.9% with an absolute increase of 1.5%. This suggests that the effectiveness of our multi-task architecture mostly comes from the introduction of relation identification, and this binary classification task does help with the data imbalance problem, corroborating our motivation stated in Section 1.
4. In the same multi-tasking scenario, using ranking loss in relation classification is better than using cross-entropy loss (62.9% v.s. 62.0%), with an absolute increase of 0.9% in F1-score. Note that cross-entropy loss is already used in relation identification. This suggests that the diversity that comes with ranking loss can improve performance.
4 Related work Liu et al. (2013) were the first to adopt deep learning for relation extraction. They proposed to use a CNN to learn features automatically without using handcraft features. Zeng et al. (2014) also employed CNN to encode the sentence, using additional lexical features to word embeddings. Their biggest contribution is the introduction of position embeddings. Zeng et al. (2015) proposed a model named Piecewise Convolutional Neural Networks (PCNN) in which each convolutional filter p i is divided into three segments (p i1 , p i2 , p i3 ) by head and tail entities, and the max-pooling operation is applied to these three segments separately. dos Santos et al. (2015) also used CNN but proposed a new pairwise ranking loss function to reduce the impact of negative instances. Lin et al. (2016) used CNN with a sentence-level attention mechanism over multiple instances to reduce noise in labels.
RNN is also widely used in relation extraction. Miwa and Bansal (2016) used LSTM and tree structures for relation extraction task. Their model is composed of three parts: an embedding layer to encode the input sentence, a sequence layer to identify whether a word is an entity or not, and a dependency layer for relation extraction. Zhou et al. (2016) used BiLSTM and attention mechanism to improve the model's performance. She et al. (2018) proposed a novel Hierarchical attention-based Bidirectional Gated recurrent neural network (HBGD) integrated with entity descriptions to mitigate the problem of having wrong labels and enable the model to capture the most important semantic information.
Entity background knowledge also contains important information for relation extraction. To capture such information, Ji et al. (2017) and She et al.
(2018) extracted entity descriptions from Freebase and Wikipedia and used an encoder to extract features from these descriptions. He et al. (2018) used a dependency tree to represent the context of entities and transformed the tree into entity context embedding using tree-based GRU.
Unlike most existing works which only consider a single entity pair in a sentence, Christopoulou et al. (2018) considered multiple entity pairs in a sentence simultaneously and proposed a novel walk-based model to capture the interaction pattern among the entity pairs. Su et al. (2018) pointed out that the global statistics of relations between entity pairs are also useful, and proposed to construct a relation graph and learn relation embeddings to improve the performance of relation extraction.
Several studies are motivated to mitigate the effect of wrong labels (Lin et al., 2016;She et al., 2018;Qin et al., 2018), and Li and Ji (2014) proposed to jointly extract entity mentions and relations. This is not the focus of our paper.

Conclusion
In this paper, we focus on the relation extraction task with an imbalanced corpus. To mitigate the problem of having too many negative instances, we propose a multi-task architecture which jointly trains a model to perform the relation identification task with cross-entropy loss and the relation classification task with ranking loss. Moreover, we introduce the embeddings of characterwise/word-wise BIO tag from the named entity recognition task to enrich the input representation. Experiment results on ACE 2005 Chinese and English corpus show that our proposed approach can successfully address the data imbalance problem and significantly improve the performance, outperforming the state-of-the-art models in terms of F1-score. Particularly, we find BIO tag embeddings very effective, which we believe could be used as a general part of character/word representation.