Recurrent Interaction Network for Jointly Extracting Entities and Classifying Relations

Named entity recognition (NER) and Relation extraction (RE) are two fundamental tasks in natural language processing applications. In practice, these two tasks are often to be solved simultaneously. Traditional multi-task learning models implicitly capture the correlations between NER and RE. However, there exist intrinsic connections between the output of NER and RE. In this study, we argue that an explicit interaction between the NER model and the RE model will better guide the training of both models. Based on the traditional multi-task learning framework, we design an interactive feature encoding method to capture the intrinsic connections between NER and RE tasks. In addition, we propose a recurrent interaction network to progressively capture the correlation between the two models. Empirical studies on two real-world datasets confirm the superiority of the proposed model.


Introduction
Named entity recognition (NER) and relation extraction (RE) are two crucial tasks for information extraction from textual data. NER aims to extract all entities in the sentence. RE aims to classify the relation between given entities. In practice, both of two tasks are required to be solved simultaneously. Consider the sentence, John was born in Sheffield which is a city of England as an example. The goal of the joint entity and relation extraction is to identify all the relational triples (Sheffield, birth place of, John) and (England, contains, Sheffield). This joint task plays a vital role in extracting structured knowledge from unstructured text, which is deemed important for several applications, including knowledge base construction (Komninos and Manandhar, 2017;Deng et al., 2019;Nathani et al., 2019).
The simplest approach to solve this joint task is to utilize a pipeline-based approach by firstly extracting all entities in the sentence and then classifying the relation between all entity pairs (Zelenko et al., 2003;Zhou et al., 2005;Chan and Roth, 2011). However, the pipeline-based approaches omit the correlation between NER and RE tasks and may result in error propagation.
Considering the close correlation between NER and RE tasks, many recent studies have been focused on joint relation extraction and named entity recognition. Multi-task learning (MTL) techniques (Collobert and Weston, 2008) have been exploited (Miwa and Bansal, 2016;Zeng et al., 2018;Fu et al., 2019) to capture the correlation between NER and RE and improve joint extraction performance. These MTL-based models implicitly model the correlation of NER and RE via a shared common representation. It is worth noting that the output of RE helps the prediction of NER and vice versa.
Take the sentence Reynolds had been in CBS for ten years as an example. If we know that there exist the relation company of between CBS and Reynolds, then CBS and Reynolds have high probabilities to be Organization and Person while having low probabilities to be Person and Person. On the other hand, if we know CBS and Reynolds are Organization and Person, then the relation is impossible to be nationality of. In this study, we regard NER and RE as dual tasks and introduce an interaction model between the output of NER and RE to explicitly leverage their correlation to guide the training process of both models.
As shown in Figure 1 (A), most previous works jointly extracting entities and relations in a multitask learning framework which focuses on learning share layers f to extract the common features S for NER and RE. Then, the learned common fea- An interaction augmented multi-task model. X is the input, g 1 and g 2 are two models for different tasks. They correspond to NER and RE in this study.
tures S are fed into two independent modules g 1 and g 2 for NER and RE. If f computes the sufficient statistics of X for predicting y 1 and y 2 , g 1 and g 2 are sufficiently expressive and data supports the learning of g 1 and g 2 , there is no need to have further interaction. However, there is not always the case, especially when g 1 and g 2 correlates to each other. An interaction between g 1 and g 2 can be introduced as shown in (B). Although the interactive MTL framework may be less expressive for the simple interaction, multiple interactions allow each component to be less expressive but overall the model is sufficiently expressive. Follow this motivation, in this paper, we propose a recurrent interaction network (RIN) to capture the correlations between NER and RE dual tasks. Specifically, we present a method to learn the dual task interaction features which represent the "degree of alignment" of NER and RE on each word. We further introduce a recurrent structure to progressively refine the prediction of NER and RE based on the learned dual task interaction features. The empirical studies on NYT and WebNLG datasets achieve new state-of-the-art performances and confirm the effectiveness of the presented RIN. A further experiment by introducing a pre-trained BERT (Devlin et al., 2019) model as the sentence encoder shows a significant performance gain over the BiLSTM encoder. This fact suggests the utilizing of BERT in the joint entity and relation extraction tasks.

Related Work
Extracting relation facts from the raw text is one of the most important tasks in natural language processing. In the earlier relation extraction (RE) task, the goal is to classify the relation between two given entities into one of the pre-defined relations. Most researchers adopted sequence based models and attention mechanisms to encode each word sen-tence and distill a vector representation, which is then passed to a classifier (Zhang et al., 2015;Shen and Huang, 2016;Wang et al., 2016). Some other studies also incorporate the dependency grammar information of the sentence into the encoder model to achieve a better representation and classification accuracy (Xu et al., 2015;Miwa and Bansal, 2016;Zhang et al., 2018;Guo et al., 2019). These methods, while effective, are limited to the relation extraction task with given entities.
A more challenging task is to extract all relational facts from an arbitrary sentence which are not accompanied by marked entities. Recently, deep neural network based joint models are exploited to unite NER and RE problems. In (Zheng et al., 2017), authors propose a tagging strategy to transfer this task into a sequence prediction problem, where the label of relation and entity type are shared in a common space. This joint model is capable of extracting entities and relations simultaneously. However, it fails to handle the case where more than one relations exist between two entities. In (Zhang et al., 2018), authors propose an end-to-end sequence-to-sequence model which detects a relational triple by firstly decoding the relation, then decoding the two entities of the relation. However, the number of relational triples that can be extracted in a sentence is limited to a predefined constant. This model cannot extract entities containing multiple words as well. A more recent study  solves these limitations by transferring the task to the subject and relation-specific object tagging task. A two-level framework is presented where the low-level tagging module recognizes all possible subjects and the high-level tagging module identifies all possible objects in each relation.
It is worth noting that above mentioned works seldom consider the implicit constraint and connections between NER and RE. Multi-task learning techniques are introduced to implicitly model the interaction between NER and RE. In (Fu et al., 2019), authors follow the generalized MTL framework and exploit Bi-RNN and GCN to extract both sequential and regional dependency word features of the sentence. The shared features are then fed into two independent classifiers for RE and NER respectively. As discussed in the Introduction section, the explicit correlation is also a potential constraint to improve the learning of both the NER and RE model.  proposed an interactive multitask learning network for jointly extracting aspects and classifying their sentiment. Both sub-tasks are regarded as a sequence prediction problem which is not the same as our joint NER and RE model. In addition, a linear transformation is exploited for the interaction of the outputs of sub-tasks is not sufficient to model the interaction between models.

Problem Statement
In this section, we formally describe the problem. For a set T = {t 1 , · · · , t l } of pre-defined l relation types, and a given sentence s = {w 1 , w 2 , · · · , w n } of n words, the problem is to extract all relational triples for the given sentence. A single relational triple is defined as w, t, w , where relation t ∈ T , entity words w, w ∈ S, w = w . In the case where a phrase of multiple words forms an entity, we denote the entity by the beginning word of the entity phrase. Note that one word w and even the same entity pair (w, w ) may involve multiple relation triples. And the sequential order of two words in the triple matters. From a probabilistic point of view, we predict the probability p( w, t, w ) that the relational triple holds. When the relation is more likely to hold than not, i.e. p( w, t, w ) ≥ 0.5, we can extract the relation.

Model
In this section, we describe our model. First, we introduce the recurrent interaction network (RIN). Next, we present the NER and RE modules. Finally, we show the input and training objective of our model. The framework of RIN is shown in Figure 2.

Recurrent Interaction Network
As we have discussed above, the output of RE helps the prediction of NER and vice versa. Based on this assumption, we aim to model the interaction of NER and RE and leverage the interaction result back to refine the prediction of NER and RE. Assume that for each word w of the given sentence s, we have extracted relation-specific feature vectorh and entity-specific feature vectorh based on word embedding h. All the feature vectors of the words in s make up the corresponding sentence embedding matrixH,H, H. The NER model predicts the entity label information (represented inȳ for the moment) based on task featuresH and the RE module predicts the probability p( w, t, w ) based on task featuresH. The key idea behind our model is to encode the interaction X among word embedding H and subtask results p( w, t, w ),ȳ, and then update task featuresH,H based on the interaction features X. X is supposed to contain information about the "alignment" of NER and RE results on each word of a sentence.
We introduced the interaction (INT) module to extract the interaction information. For each word w with word embedding h, the INT module learns an interaction feature vector x from subtask results p( w, t, w ) andȳ according to the following calculation.
where ⊕ denotes the concatenation operation, and φ(·) is the ReLU activation function. θ INT = {W a , b a } are learnable model parameters. In the calculation of x, we consider the probability of word w in all l relations with some word, the possibility of word w being an entity, and the word embedding. By combing these three kinds of information in the INT module, we aim to learn a feature that conveys information about the alignment of NER and RE on word w. The interaction features on all words of the sentence s make up the interaction feature matrix X = {x 1 , ..., x n }. We employed two separate gated recurrent units (GRUs) to update task featuresH andH based on interaction feature X. Taking the updating of relation-specific task featuresH as an example, the updated new task featureh new of a word w is got based on the interaction feature x of this word according the following calculation where ⊕ is the concatenation operation, and * is the dot product operation. θ GRUr = {W z , W u , W o } are learnable model parameters. The updating of entity-specific task featuresH is similar withH using a separate model with parameters θ GRUg . The updating process can be run for K rounds. In the kth updating round, relations are predicted based onH (k) and entities are labeled Figure 2: Overview of RIN. The f r extracts relation-specific featureH and f g extracts entity-specific featureH from the sentence embedding H. The C r is relation extraction model and the C g is the entity recognition model. INT encodes the interaction information between two sub-tasks.
based onH (k) , then the interaction information X (k) is extracted. Based on X (k) , the relationspecific and entity-specific features are updated tõ H (k+1) ,H (k+1) . We believe that with the updating operation on task features in a recurrent way, the predictions of NER and RE are progressively refined in multi updating rounds. We also conduct experiments on different updating K to verify our assumption. Finally, after the K-th rounds updating, we use the finetuned representationsH (K) and H (K) for final NER and RE.

Named Entity Recognition
The NER module recognizes all the entities in the sentence based on entity-specific featuresH. As one entity can consist of multiple words, we formalize this problem as tagging each word with an entity label which takes values from (Begin, Inside, End, Single, Out). When a word is tagged a Begin label, it is the beginning word of a detected entity. More specifically, the NER module classifies each word to one of the five label clusters. The probability distributionȳ of word w over these five clusters is calculated based on the entity featureh as follows:ȳ where θ NER = {W g , b g } are learnable model parameters.

Relation Extraction
The RE module extracts all the relation triples in the given sentence based on relation-specific fea-tureH. Following (Fu et al., 2019), we consider all the relations between all the word pairs in the sentence. For the word pair (w, w ) and the considered relation t, the relation extraction is probabilistically formed as a binary classification problem. Specifically, the RE module calculate the probability p( w, t, w ) that the relation holds. If the relation is more likely to hold than not, i.e. p( w, t, w ) ≥ 0.5, we extract the relation. The classifier is defined as where ⊕ is the concatenation operation, φ(·) is the ReLU activation function, σ(·) is the sigmoid activation function. θ RE = {W m , W r , W p , b 1 , b 2 , ..., b l } are learned model parameters. Note that different from (Fu et al., 2019), we exploit a simgoid activation function rather than a softmax function in Eq. (10).
Considering that there may exist more than one relations between the same word pair (w, w ), (Fu et al., 2019) using a softmax function can not address the overlapping problem.

Input of Model
The whole model takes the embedding H of the given sentence s as input and further extracts relation-specific featuresH andH. The embedding matrix H of a sentence can be formed with each word embedding h by looking up a pre-trained word embedding matrix. To further encode the contextual information into word embedding, a BiL-STM can be trained over the pre-trained word embeddings of each sentence. Alternatively, we can utilize the commonly used pre-trained model BERT to get sentence embedding H from sentence words.
We denote the learnable parameters in BiLSTM or BERT as θ H . After getting the representations H of the sentence s, we feed H into two separate linear transformation modules to get the task-specific featuresH andH for each word. The relation-specific featurẽ h of word w is extracted from word embedding h according the linear transformations.
Where φ(·) is the ReLU activation function, θ fr = {W f , b f } are learnable model parameters. The entity-specific featureh of each word is extracted with similar linear transformations from h with separate model parameters θ fg .

Training Objective
Training loss of the whole RIN model is comprised of two parts: the loss of relation extraction L r and the loss of named entity recognition L g . Assume that for each word w,t is the one-hot ground truth entity label,ȳ is the predictive distribution over five labels acquired from the NER module after K round; the entity recognition loss on one word is the cross entropy between the true one-hot label and predictive distribution.
L g (w) = CrossEntropy (t,ȳ) Assume that for each relation triple w, t, w ,t is the one-hot ground truth label taking values [1, 0] if the relation holds and taking values [0, 1] otherwise, and p( w, t, w ) is the probability that the relation holds acquired from the RE module after K round, the predictive distribution is denoted as y = [p( w, t, w ), 1 − p( w, t, w )]. Then the relation extraction loss on one relation triple is the cross entropy between the true one-hot label and predictive distribution.
The total loss L over all words and relation triples for all sentences is then calculated as follows.
With gradient based algorithm, we seek to minimize the total loss L over all model parameters Θ = {θ INT , θ GRUr , θ GRUg , θ RE , θ NER , θ H , θ fr , θ fg } to achieve good performance for both the NER and RE tasks.

Experiment
In this section, we conduct experiments to evaluate our model on two public datasets NYT (Riedel et al., 2010) and WebNLG (Gardent et al., 2017). NYT dataset was originally produced by a distant supervision method. It consists of 1.18M sentences with 24 predefined relation types. WebNLG dataset was created by Natural Language Generation (NLG) tasks and adapted by (Zeng et al., 2018) for relational triple extraction task. It contains 246 predefined relation classes. For a fair comparison, we directly use the preprocessed datasets provided by (Zeng et al., 2018). For both datasets, we follow the evaluation setting used in previous works. An extracted relational triple (subject, relation, object) is regarded as correct only if the relation and the heads of both subject and object are correct. We report Precision, Recall and F1-score for all the compared models. The statistics of the datasets are summarized in Table 2 5.1 Implementation Details For a fair comparison with previous work, we use the pre-trained 100-dimensional embeddings provided by (Zeng et al., 2018), as well as a 10-dimensional part-of-speech (POS) embeddings. We concatenate both word and POS embeddings and learn a 100-dimensional BiLSTM embedding for each word. We randomly dropout 10% of neurons in the input layer. The model is trained with batch size of 50 in both datasets. We use Adam optimizer with an initial learning rate of 0.001 for all datasets. To compare with the SOTA model HBT  which exploits the pretrained BERT (Devlin et al., 2019) model to initialize word embeddings, we follow their work using the same pre-trained BERT model which is [BERT-Base, Cased] 1 . In the BERT initialized setting, the model is trained with batch size of 70 and 50 in NYT and WenNLG. We use Adam optimizer with an initial learning rate of 2e −4 for both datasets.
The code for our model is found on XXX 2 .

Performance Comparison
We now show the results on NYT and WebNLG datasets. As a baseline, we include BiLSTM and BiLSTM s. In BiLSTM, H is fed into C g and C r  for final NER and RE predictions. In BiLSTM s, H is firstly feed into f g and f r . The outputs of f g and f r are fed into C g and C r for final NER and RE predictions. We aslo compare with several recent models, including the sequential model NovelTagging (Zheng et al., 2017), the encoder-decoder based models MultiDecoder (Zeng et al., 2018) and Seq2Seq+RL (Zeng et al., 2019), the dependency based model GraphRel (Fu et al., 2019), the hierarchical binary tagging framework HBT . The result of Seq2Seq+RL is taken from (Zeng et al., 2019) and the others from (Devlin et al., 2019).
Result Discussion (100d) Table 1 shows the performances of different models. In the setting of 100-dimensional word embeddings, it can be seen that RIN consistently outperforms all previous models. Especially even the baseline models BiLSTM and BiLSTM s significantly surpass SeqtoSeq+RL and MultiDecoder, revealing the superiority of word pair based methods over encoder-decoder based methods. To notice that GraphRel shows a low F1 performance of 61.9 and 42.9 on the two datasets. As discussed above, using the softmax function in the prediction of RE, GraphRel cannot address the cases where more than one relations exist between two entities. It may be the main reason for the low performance. We also find that RIN has significantly outperformed BiLSTM and BiLSTM s on the two datasets with K = 1. This improvement proves the effectiveness of the inter-active updating mechanism used in RIN. By setting K = 3 and K = 5, RIN achieves the best F1 performance in WebNLG and NYT datasets. The F1 performances have been significantly improved compared to K = 1. The performance improvement achieved by increasing K proves the effectiveness of the recurrent structure in the model. Result Discussion (BERT) In the setting of BERT, we notice that F1 performances of RIN are further improved and surpass the non-BERT models by a large margin. With K = 1, RIN exceeds the non-BERT model by 6.0 on F1 performance in the NYT dataset and has been competitive with the SOTA model HBT on both datasets. By setting K to 2, RIN surpasses HBT by 0.3 and 1.3 on the F1 performance. These results show that the incorporating of BERT significantly improves the performance of RIN.

Impact of Updating Rounds
In this section, we conduct experiments on NYT and WebNLG datasets to show the performance of RIN on a different number of updating rounds K. To evaluate the effectiveness of GRU, we also present the performance of vanilla RNN (Hochreiter and Schmidhuber, 1997). The results are shown in Figure 3. It can be seen that both RIN (RNN) and RIN (GRU) have significantly outperformed BiLSTM s on F1 performance with K = 1. From the F1 curve of RIN (GRU) on the NYT dataset, we also find that as we increase the number of updating rounds K, the F1 performance increases to an extent. In particular, RIN (GRU) increases in model performance over 5 rounds. This progressive increased F1 performance verifies our original assumption that the performance is considered to be improved in the recurrent structure.
It can also be seen that the performances of RIN (GRU) consistently outperform RIN (RNN) in both datasets and the optimal round for RIN (GRU) is later than RIN (RNN). As shown in the first subfigure, the F1 performance of RIN (RNN) only increases in the first two rounds while the F1 performance of RIN (GRU) persists in increasing for 5 rounds. A similar phenomenon can also be found in the WebNLG. Consider that the gate mechanism used in GRU is designed to solve the problem of long term memories covered by the short term memories, RIN (GRU) is more adept than RIN (RNN) at leveraging historical updating information to adjust the updating process. From this perspective, RIN (GRU) is more expressive than RIN (RNN).
We also show the F1 performance of NER and RE on the NYT dataset. The results are presented in table 4. From the table, we find that both the performances of NER and RE are improved compared to BiLSTM and BiLSTM s with the setting of K = 1. The performances are further improved by setting K to 5. These results verify our argument that explicit interaction can enhance the performance on both sides.

Ablation Study
In this section, we perform the ablation study on RIN. The ablated models are noted as (1)  We find that the performance of RIN deteriorates as we remove critical components. Specifically, RIN −INT underperforms relative to RIN on both datasets, suggesting the importance of modeling the interaction of NER and RE for performance improvement. From the performance on NYT dataset, We also find that RIN −ỹ , RIN −ȳ and RIN −H underperform relative to RIN while outperforming relative to RIN −INT , indicating the fact that all these three kinds of information play an important role in learning the interaction feature. Notice that F1 performance drops 0.3 by removing H. The performance deterioration is marginal comparing to removingỹ orȳ. It suggests thatỹ together withȳ may play a pivotal role in providing "alignment" information for learning the interaction feature. From the performance of RIN −γ , we observe that directly using two linear transformations to updatingH andH hurts the performance. The F1 performance drops by 0.3 and 0.5 in NYT and WebNLG compared to RIN −INT . This observation    sufficiently proves that it is the learned interaction feature that plays the important role in refining the performances.

Case Study
In this section, we conduct a case study from NYT on RIN and BiLSTM s. From the first case in Figure 3, we observe that BiL-STM s misses the relational triple (Europe, /location/location/contains, Norway) while RIN extracts all the relational triples in the sentence. Although BiLSTM s correctly extracts all the entities including Norway in the sentence, BiLSTM s cannot leverage the prediction state of NER to refine its RE without interaction. In contrast, RIN captures this "alignment" information and correctly extracts the relational triple which contains the entity Norway.
From the second case in Figure 3, similarly, we observe that both RIN and BiLSTM s correctly extract the relational triple (York, /location/location/contains, Scott).
However, BiLSTM s identifies Texas as an entity by error while RIN correctly extract the entity Scott that involves in the relation /location/location/contains. This fact suggests that RIN is capable of leveraging the prediction state of RE to refine its NER and is prone to extract the word which involves the relational triple as an entity.

Conclusion
This paper studies the joint entity and relation extraction problem. Existing multi-task learning based models implicitly characterizing the commonalities and differences via shared representations. We argue that an explicit interaction between these two tasks can improve the performance on both sides. In this study, we present a recurrent interaction network to capture the intrinsic connection between two sub-tasks. Specifically, the features that represent the interaction between NER and RE are encoded into a distributed representation. Besides, a recurrent module is proposed to progressively accumulate the dependencies. Empirical studies on two publicly available datasets confirm the effectiveness of the presented model.