Extracting Multiple-Relations in One-Pass with Pre-Trained Transformers

Many approaches to extract multiple relations from a paragraph require multiple passes over the paragraph. In practice, multiple passes are computationally expensive and this makes difficult to scale to longer paragraphs and larger text corpora. In this work, we focus on the task of multiple relation extractions by encoding the paragraph only once. We build our solution upon the pre-trained self-attentive models (Transformer), where we first add a structured prediction layer to handle extraction between multiple entity pairs, then enhance the paragraph embedding to capture multiple relational information associated with each entity with entity-aware attention. We show that our approach is not only scalable but can also perform state-of-the-art on the standard benchmark ACE 2005.


Introduction
Relation extraction (RE) aims to find the semantic relation between a pair of entity mentions from an input paragraph. A solution to this task is essential for many downstream NLP applications such as automatic knowledge-base completion (Surdeanu et al., 2012;Riedel et al., 2013;Verga et al., 2016), knowledge base question answering (Yih et al., 2015;Xu et al., 2016;Yu et al., 2017), and symbolic approaches for visual question answering (Mao et al., 2019;Hu et al., 2019), etc.
One particular type of the RE task is multiplerelations extraction (MRE) that aims to recognize relations of multiple pairs of entity mentions from an input paragraph. Because in real-world applications, whose input paragraphs dominantly contain multiple pairs of entities, an efficient and effective solution for MRE has more important and more practical implications. However, nearly all existing approaches for MRE tasks (Qu et al.,  2014; Gormley et al., 2015;Nguyen and Grishman, 2015) adopt some variations of the singlerelation extraction (SRE) approach, which treats each pair of entity mentions as an independent instance, and requires multiple passes of encoding for the multiple pairs of entities. The drawback of this approach is obvious -it is computationally expensive and this issue becomes more severe when the input paragraph is large, making this solution impossible to implement when the encoding step involves deep models.
This work presents a solution that can resolve the inefficient multiple-passes issue of existing solutions for MRE by encoding the input only once, which significantly increases the efficiency and scalability. Specifically, the proposed solution is built on top of the existing transformer-based, pretrained general-purposed language encoders. In this paper we use Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018) as the transformer-based encoder, but this solution is not limited to using BERT alone. The two novel modifications to the original BERT architecture are: (1) we introduce a structured prediction layer for predicting multiple relations for different entity pairs; and (2) we make the selfattention layers aware of the positions of all en-tities in the input paragraph. To the best of our knowledge, this work is the first promising solution that can solve MRE tasks with such high efficiency (encoding the input in one-pass) and effectiveness (achieve a new state-of-the-art performance), as proved on the ACE 2005 benchmark.

Background
MRE is an important task as it is an essential prior step for many downstream tasks such as automatic knowledge-base completion and questionanswering. Popular MRE benchmarks include ACE (Walker et al., 2006) and ERE (Linguistic Data Consortium, 2013). In MRE, given as a text paragraph x = {x 1 , . . . , x N } and M mentions e = {e 1 , . . . , e M } as input, the goal is to predict the relation r ij for each mention pair (e i , e j ) either belongs to one class of a list of pre-defined relations R or falls into a special class NA indicating no relation. This paper uses "entity mention", "mention" and "entity" interchangeably.
Existing MRE approaches are based on either feature and model architecture selection techniques (Xu et al., 2015;Gormley et al., 2015;Nguyen and Grishman, 2015;F. Petroni and Gemulla, 2015;Sorokin and Gurevych, 2017;Song et al., 2018b), or domain adaptations approaches (Fu et al., 2017;Shi et al., 2018). But these approaches require multiple passes of encoding over the paragraph, as they treat a MRE task as multiple passes of a SRE task.

Proposed Approach
This section describes the proposed one-pass encoding MRE solution. The solution is built upon BERT with a structured prediction layer to enable BERT to predict multiple relations with onepass encoding, and an entity-aware self-attention mechanism to infuse the relational information with regard to multiple entities at each layer of hidden states. The framework is illustrated in Figure 1. It is worth mentioning that our solution can easily use other transformer-based encoders besides BERT, e.g. (Radford et al., 2018).

Structured Prediction with BERT for MRE
The BERT model has been successfully applied to various NLP tasks. However, the final prediction layers used in the original model is not applicable to MRE tasks. The MRE task essentially requires to perform edge predictions over a graph with entities as nodes. Inspired by (Dozat and Manning, 2018;Ahmad et al., 2018), we propose that we can first encode the input paragraph using BERT. Thus, the representation for a pair of entity mentions (e i , e j ) can be denoted as o i and o j respectively. In the case of a mention e i consist of multiple hidden states (due to the byte pair encoding), o i is aggregated via average-pooling over the hidden states of the corresponding tokens in the last BERT layer. We then concatenate o i and o j denoted as [o i : o j ], and pass it to a linear classifier 2 to predict the relation where W L ∈ R 2dz×l . d z is the dimension of BERT embedding at each token position, and l is the number of relation labels.

Entity-Aware Self-Attention based on Relative Distance
This section describes how we encode multiplerelations information into the model. The key concept is to use the relative distances between words and entities to encode the positional information for each entity. This information is propagated through different layers via attention computations. Following (Shaw et al., 2018), for each pair of word tokens (x i , x j ) with the input representations from the previous layer as h i and h j , we extend the computation of self-attention z i as: are the parameters of the model, and d z is the dimension of the output from the self-attention layer.
Compared to standard BERT's self-attention, a V ij , a K ij ∈ R dz are extra, which could be viewed as the edge representation between the input element x i and x j . Specifically, we devise a V ij and a K ij to encourage each token to be aware of the relative distance to different entity mentions, and vice versa. ij } introduced in selfattention computation. Each red cell embedding is defined by w d(i−j) , as the distance from entity xi to token xj. Each blue cell embedding is defined by w d(j−i) , as the distance from the entity xj to token xi . White cells are zero embeddings since neither xi nor xj is entity. The {a V ij } follows the same pattern with independent parameters.
Adapted from (Shaw et al., 2018), we argue that the relative distance information will not help if the distance is beyond a certain threshold. Hence we first define the distance function as: This distance definition clips all distances to a region [−k, k]. k is a hyper-parameter to be tuned on the development set. We can now define a V ij and a K ij formally as: As defined above, if either token x i or x j belongs to an entity, we will introduce a relative positional representation according to their distance. The distance is defined in an entity-centric way as we always compute the distance from the entity mention to the other token. If neither x i nor x j are entity mentions, we explicitly assign a zero vector to a K ij and a V ij . When both x i and x j are inside entity mentions, we take the distance as d(i, j) to make row-wise attention computation coherent as depicted in Figure 2.
During the model fine-tuning, the newly introduced parameters {w K −k , ..., w K k } and {w V −k , ..., w V k } are trained from scratch.

Experiments
We demonstrate the advantage of our method on a popular MRE benchmark, ACE 2005 (Walker et al., 2006), and a more recent MRE benchmark, SemEval 2018 Task 7 (Gábor et al., 2018). We also evaluate on a commonly used SRE benchmark SemEval 2010 task 8 (Hendrickx et al., 2009), and achieve state-of-the-art performance.

Settings
Data For ACE 2005, we adopt the multi-domain setting and split the data following (Gormley et al., 2015): we train on the union of news domain (nw and bn), tune hyperparameters on half of the broadcast conversation (bc) domain, and evaluate on the remainder of broadcast conversation (bc), the telephone speech (cts), usenet newsgroups (un), and weblogs (wl) domains. For Se-mEval 2018 Task 7, we evaluate on its sub-task 1.1. We use the same data split in the shared task. The passages in this task is usually much longer compared to ACE. Therefore we adopt the following pre-processing step -for the entity pair in each relation, we assume the tokens related to their relation labeling are always within a range from the fifth token ahead of the pair to the fifth token after it. Therefore, the tokens in the original passage that are not covered by the range of ANY input relations, will be removed from the input.

Methods
We compare our solution with previous works that predict a single relation per pass (Gormley et al., 2015;Nguyen and Grishman, 2015;Fu et al., 2017;Shi et al., 2018), our model that predicts single relation per pass for MRE, and with the following naive modifications of BERT that could achieve MRE in one-pass.
• BERT SP : BERT with structured prediction only, which includes proposed improvement in 3.1.
• Entity-Aware BERT SP : our full model, which includes both improvements in §3.1 and §3.2.
• BERT SP with position embedding on the final attention layer. This is a more straightforward way to achieve MRE in one-pass derived from previous works using position embeddings (Nguyen and Grishman, 2015;Fu et al., 2017;Shi et al., 2018). In this method, the BERT model encode the paragraph to the last attention-layer. Then, for each entity pair, it takes the hidden states, adds the relative position embeddings corresponding to the target entities, and finally makes the relation prediction for this pair.
• BERT SP with entity indicators on input layer: it replaces our structured attention layer, and adds indicators of entities (transformed to embeddings)

Results on ACE 2005
Main Results Table 1 gives the overall results on ACE 2005. The first observation is that our model architecture achieves much better results compared to the previous state-of-the-art methods. Note that our method was not designed for domain adaptation, it still outperforms those methods with domain adaptation. This result further demonstrates its effectiveness. Among all the BERT-based approaches, finetuning the off-the-shelf BERT does not give a satisfying result, because the sentence embeddings cannot distinguish different entity pairs. The simpler version of our approach, BERT SP , can successfully adapt the pre-trained BERT to the MRE task, and achieves comparable performance at the 3 Note the usage of relative position embeddings does not work for one-pass MRE, since each word corresponds to a varying number of position embedding vectors. Summing up the vectors confuses this information. It works for the singlerelation per pass setting, but the performance lags behind using only indicators of the two target entities.
prior state-of-the-art level of the methods without domain adaptation.
Our full model, with the structured fine-tuning of attention layers, brings further improvement of about 5.5%, in the MRE one-pass setting, and achieves a new state-of-the-art performance when compared to the methods with domain adaptation. It also beats the other two methods on BERT in Multi-Relation per Pass.
Performance Gap between MRE in One-Pass and Multi-Pass The MRE-in-one-pass models can also be used to train and test with one entity pair per pass (Single-Relation per Pass results in Table 1). Therefore, we compare the same methods when applied to the multi-relation and singlerelation settings. For BERT SP with entity indicators on inputs, it is expected to perform slightly better in the single-relation setting, because of the mixture of information from multiple pairs. A 2% gap is observed as expected. By comparison, our full model has a much smaller performance gap between two different settings (and no consistent performance drop over different domains).
The BERT SP is not expected to have a gap as shown in the table.For BERT SP with position embeddings on the final attention layer, we train the model in the single-relation setting and test with two different settings, so the results are the same.
Training and Inference Time Through our experiment, 4 we verify that the full model with MRE is significantly faster compared to all other methods for both training and inference. The training  time for full model with MRE is 3.5x faster than it with SRE. As for inference speed, the former could reach 126 relation per second compared the later at 23 relation per second. It is also much faster when compared to the second best performing approach, BERT SP w/ pos-emb on final attlayer, which is at 76 relation per second, as it runs the last layer for every entity pair. Table 2 evaluates the usage of different prediction layers, including replacing our linear layer in Eq.

Prediction Module Selection
(1) with MLP or Biaff. Results show that the usage of the linear predictor gives better results. This is consistent with the motivation of the pre-trained encoders: by unsupervised pre-training the encoders are expected to be sufficiently powerful thus adding more complex layers on top does not improve the capacity but leads to more free parameters and higher risk of over-fitting.

Results on SemEval 2018 Task 7
The results on SemEval 2018 Task 7 are shown in Table 3. Our Entity-Aware BERT SP gives comparable results to the top-ranked system (Rotsztejn et al., 2018) in the shared task, with slightly lower Macro-F1, which is the official metric of the task, and slightly higher Micro-F1. When predicting multiple relations in one-pass, we have 0.9% drop on Macro-F1, but a further 0.8% improvement on Micro-F1. Note that the system (Rotsztejn et al., 2018) integrates many techniques like feature-engineering, model combination, pretraining embeddings on in-domain data, and artificial data generation, while our model is almost a direct adaption from the ACE architecture.
On the other hand, compared to the top singlemodel result (Luan et al., 2018), which makes use of additional word and entity embeddings pretrained on in-domain data, our methods demonstrate clear advantage as a single model.

Additional SRE Results
We conduct additional experiments on the relation classification task, SemEval 2010 Task 8, to com-

Method Averaged F1 Macro Micro
Top 3 in the Shared Task (Rotsztejn et al., 2018) 81.7 82.8 (Luan et al., 2018) 78.9 - (Nooralahzadeh et al., 2018) 76.   pare with models developed on this benchmark. From the results in Table 4, our proposed techniques also outperforms the state-of-the-art on this single-relation benchmark. On this single relation task, the out-of-box BERT achieves a reasonable result after finetuning. Adding the entity-aware attention gives about 8% improvement, due to the availability of the entity information during encoding. Adding structured prediction layer to BERT (i.e., BERT SP ) also leads to a similar amount of improvement. However, the gap between BERT SP method with and without entity-aware attention is small. This is likely because of the bias of data distribution: the assumption that only two target entities exist, makes the two techniques have similar effects.

Conclusion
In summary, we propose a first-of-its-kind solution that can simultaneously extract multiple relations with one-pass encoding of an input paragraph for MRE tasks. With the proposed structured prediction and entity-aware self-attention layers on top of BERT, we achieve a new state-of-the-art results with high efficiency on the ACE 2005 benchmark. Our idea of encoding a passage regarding multiple entities has potentially broader applications beyond relation extraction, e.g., entity-centric passage encoding in question answering (Song et al., 2018a). In the future work, we will explore the usage of this method with other applications.