A Frustratingly Easy Approach for Joint Entity and Relation Extraction

End-to-end relation extraction aims to identify named entities and extract relations between them simultaneously. Most recent work models these two subtasks jointly, either by unifying them in one structured prediction framework, or multi-task learning through shared representations. In this work, we describe a very simple approach for joint entity and relation extraction, and establish the new state-of-the-art on standard benchmarks (ACE04, ACE05, and SciERC). Our approach essentially builds on two independent pre-trained encoders and merely uses the entity model to provide input features for the relation model. Through a series of careful examinations, we validate the importance of learning distinct contextual representations for entities and relations, fusing entity information at the input layer of the relation model, and incorporating global context. Finally, we also present an efficient approximation to our approach which requires only one pass of both encoders at inference time, obtaining a 8-16$\times$ speedup with a small accuracy drop.


Introduction
Extracting entities and their relations from unstructured text is a fundamental problem in information extraction.This problem can be decomposed into two subtasks: named entity recognition (Sang and De Meulder, 2003;Ratinov and Roth, 2009) and relation extraction (Zelenko et al., 2002;Bunescu and Mooney, 2005).Early work employed a pipelined approach, training one model to extract entities and another to classify relations between them.More recently, however, end-toend evaluations have been dominated by systems that model these two tasks jointly (Li and Ji, 2014;Miwa and Bansal, 2016;Katiyar and Cardie, 2017;Zhang et al., 2017a;Li et al., 2019;Luan et al., 2018Luan et al., , 2019;;Wadden et al., 2019;Lin et al., 2020;Wang and Lu, 2020).It is commonly thought that joint models can better capture the interactions between entities and relations and help mitigate error propagation issues.
In this work, we review this problem and present a very simple approach which learns two encoders built on top of deep pre-trained language models (Devlin et al., 2019;Beltagy et al., 2019;Lan et al., 2020).The two models -which we refer them as to an entity model and a relation model throughout the paper -are trained independently and the relation model only relies on the entity model as input features.Our entity model builds on span-level representations and our relation model builds on contextual representations specific to a given pair of spans.Despite its simplicity, we find this pipelined approach to be extremely effective: using the same pre-trained encoders, our model outperforms all previous joint models on three standard benchmarks (ACE04, ACE05, SciERC).
To understand why this approach performs so well, we carry out a series of careful analyses.We observe that, (1) the contextual representations for the entity and relation models essentially capture distinct information, so sharing their representations hurts performance; (2) it is crucial to fuse the entity information (both boundary and type) at the input layer of the relation model; (3) leveraging cross-sentence information is useful in both tasks; (4) stronger pre-trained language models can bring further gains.Hence, we expect that this simple model will serve as a very strong baseline and make us rethink the value of joint training in end-to-end relation extraction.
Finally, one possible shortcoming of our approach is that we need to run our relation model once for every pair of entities.To alleviate this issue, we present a novel and efficient alternative by approximating and batching the computations of different groups of entity pairs at inference time.This approximation achieves a 8-16× speedup with only a very small accuracy drop (e.g., 0.5-0.9%F1 drop on ACE05), which makes our model fast and accurate to use in practice.We summarize our contributions as follows: • We present a very simple and effective approach for end-to-end relation extraction, which learns two independent encoders for entity recognition and relation extraction.Our model establishes the new state-of-the-art on three standard benchmarks and surpasses all previous joint models using the same pretrained models.
• We conduct careful analyses to understand why our approach performs so well and how different factors affect final performance.We conclude that it is more effective to learn distinct contextual representations for entities and relations than to learn them jointly.
• To speed up the inference time of our model, we also propose a novel and efficient approximation, which achieves a 8-16× runtime improvement with only a small accuracy drop.

Related Work
Traditionally, extracting relations between entities in text has been studied as two separate tasks: named entity recognition and relation extraction.
In the last several years, there has been a surge of interest in developing models for joint extraction of entities and relations (Li and Ji, 2014;Miwa and Bansal, 2016).We group existing work of end-toend relation extraction into two main categories: structured prediction and multi-task learning.

Structured prediction
The first category casts the two tasks into one structured prediction framework, although it can be formulated in various ways.Li and Ji (2014) propose an action-based system which identifies new entities as well as links to previous entities, Zhang et al. (2017a); Wang and Lu (2020) adopt a table-filling approach proposed in (Miwa and Sasaki, 2014); Katiyar and Cardie (2017) and Zheng et al. (2017) employ sequence tagging-based approaches; Sun et al. (2019) and Fu et al. (2019) propose graph-based approaches to jointly predict entity and relation types; and, Li et al. (2019) convert the task into a multi-turn question answering problem.All of these approaches need to tackle a global optimization problem and perform joint decoding at inference time, using beam search or reinforcement learning.
Multi-task learning This family of models essentially builds two separate models for entity recognition and relation extraction and optimizes them together through parameter sharing.Miwa and Bansal (2016) propose to use a sequence tagging model for entity prediction and a tree-based LSTM model for relation extraction.The two models share one LSTM layer for contextualized word representations and they find sharing parameters improves performance for both models.The approach of Bekoulis et al. (2018) is similar except that they model relation classification as a multilabel head selection problem.Note that these approaches still perform pipelined decoding: entities are first extracted and the relation model is applied on the predicted entities.
The closest work to ours is probably DYGIE and DYGIE++ (Luan et al., 2018(Luan et al., , 2019;;Wadden et al., 2019) which builds on recent span-based models for coreference resolution (Lee et al., 2017) and semantic role labeling (He et al., 2018).The key idea is to share span representations between the two tasks with joint optimization.It is later improved by incorporating relation and coreference propagation layers to update span representations (Luan et al., 2019) and replacing LSTM encodings with BERT-based representations (Wadden et al., 2019).A more recent work (Lin et al., 2020) further extends Wadden et al. (2019) by incorporating global features based on cross-substask and cross-instance constraints.1 Compared to DY-GIE++, our approach is much simpler: we use two independent encoders and do not use beam search or graph propagation layers.We will detail the differences in Section 3.2 and argue why our simpler model performs better.
BERT for relation extraction Earlier work explored a variety of neural network architectures for relation extraction, including convolutional neural networks (Zeng et al., 2014), recurrent neural networks (Xu et al., 2015;Zhang et al., 2017b), and graph neural networks (Zhang et al., 2018).More recent work uses pre-trained language models (LMs) such as BERT, built on deep Transformer encoders (Shi and Lin, 2019;Soares et al., 2019).We follow this trend and also study the impact of different pre-trained LMs on final performance.

Method
In this section, we first formally define the problem of end-to-end relation extraction in Section 3.1 and then detail our approach in Section 3.2.Finally, we present our approximation solution in Section 3.3 which improves the efficiency of our approach during inference considerably.

Problem Definition
The input of the problem is a sentence X consisting of n tokens x 1 , . . ., x n .Let S = {s 1 , . . ., s m } be all the possible spans in X of up to length L and START(i) and END(i) denote start and end indices of s i .Optionally, we can incorporate cross-sentence context to help improve predictions, which we will elaborate in the next section.The problem can be decomposed into two sub-tasks: Named entity recognition Let E denote a set of pre-defined entity types.The named entity recognition task is, for each span s i ∈ S, to predict an entity type y e (s i ) ∈ E or y e (s i ) = representing span s i is not an entity.The output of the task is Relation extraction Let R denote a set of predefined relation types.The task is, for every pair of spans s i ∈ S, s j ∈ S, to predict a relation type y r (s i , s j ) ∈ R, or there is no relation between them: y r (s i , s j ) = .The output of the task is Y r = {(s i , s j , r), s i , s j ∈ S, r ∈ R}.
We aim to build a model which takes X as input and outputs Y e and Y r at the same time.During evaluation, Y e and Y r are compared against the ground truth Y * e and Y * r and entity and relation F1 will be reported respectively.2

Our Model
In the following, we will describe our full model which consists of an entity and a relation model.As shown in Figure 1, our entity model first takes the input sentence and predict an entity type (or ) for each single span.We then process every pair of candidate entities independently in the relation model by inserting extra marker tokens to highlight the subject and object and their types.We will detail each component below.At the end of this section, we will also summarize the main differences of our approach and DYGIE++ (Wadden et al., 2019), which is the closest work to ours and serves as a strong baseline in the literature.
Entity model Our entity model is a standard span-based model following prior work (Lee et al., 2017;Luan et al., 2018Luan et al., , 2019;;Wadden et al., 2019).We first use a pre-trained language model (e.g., BERT) to obtain contextualized representations x t for each input token x t .Given a span s i ∈ S, the span representation h e (s i ) is defined as: where φ(s i ) ∈ R d W represents the learned embeddings of span width features.The span representation h e (s i ) is then used to predict entity types e ∈ E ∪ { }: We follow Wadden et al. (2019) and use a 2-layer feedforward neural network with ReLU activations.

Relation model
The relation model aims to take a pair of spans s i , s j (a subject and an object) as input and predicts a relation type or , between the two spans.Previous approaches (Luan et al., 2018(Luan et al., , 2019;;Wadden et al., 2019) re-use span representations h e (s i ), h e (s j ) to predict their relation.We observe that these representations only capture contextual information around each individual entity and might fail to capture the dependencies between a specific pair of spans.We also hypothesize that sharing the contextual representations for different pairs of spans may be suboptimal.For example, the words is a in Figure 1 are crucial in classifiying the relationship between MORPA and PARSER but not for MORPA and TEXT-TO-SPEECH.
Our relation model instead processes each pair of spans independently and inserts typed markers at the input layer to highlight the subject and object and their types.Specifically, given an input sentence X and a pair of spans s i , s j , where s i , s j have a type of e i , e j ∈ E ∪ { } respectively.We define text markers as S:e i , /S:e i , O:e j , and /O:e j , and insert them into the input sentence before and after the subject and object spans (Figure 1 (b)).3Let X denote this modified sequence with text markers inserted: X = . . .S:e i , x START(i) , . . ., x END(i) , /S:e i , . . .O:e j , x START(j) , . . ., x END(j) , /O:e j , . . . .
We then apply another pre-trained encoder on X and denote the output representations by x t .We concatenate the output representations of two start positions and obtain the span pair representation: where START(i) and START(j) are the indices of S:e i and O:e j in X.Finally, the representation h r (s i , s j ) will be used to predict the relation type r ∈ R ∪ { }: This idea of using additional markers to highlight the subject and object is not entirely new as it has been studied recently in relation classification tasks (Zhang et al., 2019;Soares et al., 2019).However, most relation classification tasks (e.g., TACRED (Zhang et al., 2017b)) only focus on a given pair of subject and object in an input sentence and its effectiveness has not been evaluated in the end-to-end setting in which we need to classify the relationships between multiple entity mentions.We observed a large improvement in our experiments (Section 5.1) and this strengthens our hypothesis that modeling the relationship between different entity pairs in one sentence require different contextual representations.Furthermore, Zhang et al. (2019); Soares et al. (2019) only considered untyped markers (e.g., S , /S ) and previous end-to-end models e.g., (Wadden et al., 2019) inject the entity type information into the relation model through auxiliary losses.Using typed entity markers hasn't been explored before.We find that injecting type information at the input layer is very helpful in distinguishing entity types -for example, whether "Disney" refers to a person or an organizationbefore trying to understand the relations between them.
Cross-sentence context Cross-sentence information can be used to help predict entity types and relations, especially for pronominal mentions.Luan et al. (2019); Wadden et al. (2019) employ a propagation mechanism through coreference and relation links to incorporate cross-sentence context.Wadden et al. (2019) also add a 3-sentence context window which is shown to improve performance.We also evaluate the importance of leveraging cross-sentence context in end-to-end relation extraction.As we expect that pre-trained language models to be able to capture long-range dependencies already, we simply incorporate cross-sentence context by extending the sentence to a fixed window size W for both the entity and relation model.Specifically, given an input sentence with n words, we augment the input with (W − n)/2 words from the left context and right context respectively (W = 100 in our default model).
Training & inference For both entity model and relation model, we employ two pre-trained language models and fine-tune them using taskspecific losses.We use cross-entropy loss for both models: where During inference, we first predict the entities by taking y e (s i ) = arg max e∈E∪{ } P e (e|s i ).Denote S pred = {s i : y e (s i ) = }, we enumerate all the spans s i , s j ∈ S pred and use y e (s i ), y e (s j ) as the inputs for the relation model P r (r | s i , s j ).4) We do not use beam search or graph propagation layers.As a result, our model is much simpler.Moreover, we will show that it also achieves large gains in all the benchmarks, using the same pre-trained encoders.

Efficient Batch Computations
Despite the simplicity and effectiveness of our approach (which we will demonstrate in our experiments), one possible shortcoming is that we need to run our relation model once for every pair of entities.To alleviate this issue, we propose a novel and efficient alternative for our relation model.The key problem is that we would like to re-use computations for different span pairs in the same sentence.This is impossible in our original model because we must insert special entity markers for each pair of spans independently.Thus we propose an approximation model by making two major changes to the original relation model.First, instead of directly inserting entity markers into the original sentence, we tie the position embeddings of the markers with the start and end tokens of the corresponding span: where POS(•) denotes the position id of a token.As the example shown in Figure 1, if we want to classify the relationship between MORPA and PARSER, the first entity marker S: METHOD will share the positional embedding with the token MOR.To do this, the positional embeddings of the original tokens will not be changed.
Second, we add a constraint to the attention layers: We enforce the text tokens to only attend to text tokens and not attend to the marker tokens while an entity marker token can attend to all the text tokens and all the 4 marker tokens associated with the same span pair.These two modifications allow us to re-use the computations of all text tokens, because text tokens are independent of the entity marker tokens.Thus, we can batch multiple pairs of spans from the same sentence in one run of the relation model.In practice, we add all marker tokens to the end of the sentence to form an input that batches a set of span pairs (Figure 1 (c)).This leads to a large speedup at inference time and only a small drop in performance (Section 4.3).

Experimental Setup
Datasets We evaluate our approach on three end-to-end relation extraction datasets: ACE04, ACE054 , and SciERC (Luan et al., 2018) Implementation details For the entity model, we follow Wadden et al. (2019) and set the width embedding size as d W = 150 and use a 2-layer FFNN with 150 hidden units.For our approximation model (Section 4.3), we batch candidate pairs by adding 4 markers for each pair to the end of the sentence, until the total number of tokens exceeds 250.We use a context window size of W = 100 in our default setting using cross-sentence context and we will study the effect of different context sizes in Section 5.4.We consider spans up to L = 8 words.For all the experiments, we report the averaged F1 scores of 5 runs.We implement our models based on Hug-gingFace's Transformers library (Wolf et al., 2019).We use bert-base-uncased (Devlin et al., 2019) and albert-xxlarge-v1 (Lan et al., 2020) as the base encoders for ACE04 and ACE05, for a fair comparison with previous work and an investigation of the impact of small vs large pre-trained models. 5We also use scibert-scivocab-uncased (Beltagy et al., 2019) as the base encoder for SciERC, as this in-domain pre-trained model is shown to be more effective than BERT (Wadden et al., 2019).We train our models with Adam optimizer of a linear scheduler with a warmup ratio of 0.1.For all the experiments, we train the entity model for 100 epochs, and a learning rate of 1e-5 for weights in pre-trained LMs, 5e-4 for others and a batch size of 16.We train the relation model for 10 epochs with a learning rate of 2e-5 and a batch size of 32.

Main Results
Table 2 compares our approach to all the previous results.We report the F1 scores in both single-sentence (no cross-sentence context) and cross-sentence (a context window size of W = 100) settings for a fair comparison with previous work.As is shown, our single-sentence models achieve strong performance and incorporating cross-sentence context further improves the results consistently.Our BERT-base (or SciBERT) models achieve similar or better results compared to all the previous work including models built on top of larger pre-trained LMs, and the performance is further improved by using a larger encoder, i.e., ALBERT.
For entity recognition, our best model achieves an absolute F1 improvement of +1.4%, +1.7%, +0.7% on ACE05, ACE04, and SciERC respectively.This shows that cross-sentence information is useful for the entity model and that pretrained Transformer encoders are able to capture long-range dependencies from a large context.For relation extraction, our approach outperforms the best previous methods by an absolute F1 of +2.6%, +2.8%, +1.7% on ACE05, ACE04, and SciERC respectively.We also obtained a 4.3% higher relation F1 on ACE05 compared to DYGIE++ (Wadden et al., 2019) using the same BERT-base pre-trained model.All these large improvements demonstrate the effectiveness of learning distinct representations for entities and relations of different entity pairs, as well as fusing entity information at the input layer of the relation model.
We also noticed that compared to the previous state-of-the-art model (Wang and Lu, 2020) based on ALBERT, our model achieves a similar entity F1 (89.5 vs 89.7) but a substantially better relation F1 (67.6 vs 69.0) without using cross-sentence context.This clearly demonstrates the superiority of our relation model.2: Test F1 scores on ACE04, ACE05, and SciERC.We evaluate our approach in two settings: single-sentence and cross-sentence depending on whether cross-sentence context is used or not.♣ : These models leverage crosssentence information.† : These models are trained with additional data (e.g., coreference).The Encoder column denotes the base encoder each model used: L = LSTM, L+E = LSTM + ELMo, Bb = BERT-base, Bl = BERT-large, SciB = SciBERT (size as BERT-base), ALB = ALBERT-xxlarge-v1.Rel denotes the boundaries evaluation (the entity boundaries must be correct) and Rel+ denotes the strict evaluation (both the entity boundaries and entity type must be correct).Table 3: We compare our full relation model and the approximation model (Approx) for both accuracy (relation F1 on the test set) and efficiency.We evaluate our BERT-base for ACE05 and SciBERT for SciERC for both single-sentence and cross-sentence (W = 100) settings.The speed is measured on a single NVIDIA GeForce 2080 Ti GPU with a batch size of 32.

Batch Computations and Speedup
In Section 3.3, we proposed an efficient approximation solution for the relation model, which enables us to re-use the computations of text tokens and batch multiple span pairs in one input sentence.We evaluate this approximation model on ACE05 and SciERC.Table 3 shows the relation F1 scores and the inference speed of the full relation model and the approximation model.On both datasets, our approximation model significantly improves the efficiency of the inference process.For example, in the single-sentence setting, we obtain a 11.9× speedup on ACE05 and a 8.7× speedup on SciERC.By re-using a large part of computations, we are able to make predictions on the full ACE05 test set (2k sentences) in less than 10 seconds on a single NVIDIA GeForce 2080 Ti GPU.On the other hand, this approximation only brings a small performance drop -compared to the full model, the F1 score drops 0.5% and 1.0% on ACE05 and SciERC respectively in the single-sentence setting.Considering the accuracy and efficiency of this approximation model, we expect it to be very effective to use in practice.

Analysis
Despite its simple design and training paradigm, we have shown that our approach outperforms all previous joint models.In this section, we aim to take a deeper look and understand why this model performs so well and what contributes to its final performance.

Importance of Typed Text Markers
Our first argument is that it is important to build different contextual representations for different pairs of spans and an early fusion of entity type information can further improve performance.To validate the importance of typed text markers, we experiment the following variants on both ACE05 and SciERC when the gold entities are given: TEXT: We use the span representations defined in the entity model (Section 3.2) and concatenate the hidden representations for the subject and the object, as well as their element-wise multiplication: [h e (s i ), h e (s j ), h e (s i ) h e (s j )].This is similar to the relation model in Luan et al. (2018Luan et al. ( , 2019)).TEXTETYPE: In addition to TEXT, we concatenate the span-pair representations with entity type embeddings ψ(e i ), ψ(e j ) ∈ R d E (d E = 150).MARKERS: We use untyped entity types ( S , /S , O , /O ) at the input layer and concatenate the representations of two spans' starting points.MARKERSETYPE: In addition to MARKERS, we concatenate the span-pair representations with entity type embeddings ψ(e i ), ψ(e j ) ∈ R d E (d E = 150).MARKERSELOSS: We also consider a variant which uses untyped markers but add another FFNN to predict the entity types of subject and object through auxiliary losses.This is similar to how the entity information is used in multi-task learning (Luan et al., 2019;Wadden et al., 2019).TYPEDMARKERS: This is our final model described in Section 3.2.We use typed markers at the input layer.
Table 4 shows the performance of all the variants and it clearly indicates that different input repre-Shared encoders?

Entity F1
88.8 87.7 Relation F1 64.8 64.4 sentations make a real difference in the relation accuracy.Compared to TEXT, TYPEDMARKERS improved the F1 scores largely by +5.5% and +7.4% absolute.All the variants of using marker tokens are significantly better than the standard text representations and this suggests the importance of learning different representations with respect to different subject-object pairs.Finally, entity type is useful in improving the relation performance and an early fusion of entity information is particularly effective (TYPEDMARKERS vs MARK-ERSETYPE and MARKERSELOSS).We also find that MARKERSETYPE to perform even better than MARKERSELOSS which suggests that using entity types directly as features is better than using them to provide training signals through auxiliary losses.

Modeling Interactions between Entities and Relations
One main argument for joint models is that modeling the interactions between the two tasks can contribute to each other.In this section, we aim to validate if it is the case in our approach.We first study whether sharing the two representation encoders can improve the performance or not.We train the entity and relation models together by jointly optimizing L e + L r .As shown in Table 5, we find that simply sharing the encoders hurts both the entity and relation F1.We think this is because the two tasks have different input formats and require different features for predicting entity types and relations, thus using separate encoders indeed learns better task-specific features.
In the previous section, we have already shown that the entity information is useful in the relation model (either entity embeddings, auxiliary loss or input features) and the best way to use it is through typed markers.Next, we aim to investigate whether the relation information can improve the entity performance.To do so, we add an auxiliary loss to our entity model, which concatenates the two span representations as well as their element-wise multi-  plication (see the TEXT variant in Section 5.1) and predicts the relation type between the two spans (r ∈ R or ).Through joint training with this auxiliary relation loss, we observe a negligible improvement (< 0.1%) on averaged entity F1 over 5 runs on the ACE05 development set.Hence, we conclude that relation information does not improve the entity model substantially.
To summarize our findings, (1) entity information is clearly helpful in predicting relations.However, we don't find enough evidence in our experiments that relation information can improve the entity performance substantially. 6(2) Simply sharing the encoders does not provide benefits to our approach. 6Miwa and Bansal (2016) observed a slight improvement on entity F1 by sharing the parameters (80.8 → 81.8 F1) on the ACE05 development data.Wadden et al. (2019) observed that their relation propagation layers improved the entity F1 slightly on SciERC but it hurts performance on ACE05.

Mitigating Error Propagation
One well-known drawback of pipeline training is the error propagation issue.In our final model, we use gold entities (and their types) to train the relation model and the predicted entities during inference and this may lead to a discrepancy between training and testing.In the following, we describe several attempts we made to address this issue.
We first study whether using predicted entities -instead of gold entities -during training can mitigate this issue.We adopt a 10-way jackknifing method, which is a standard technique in many NLP tasks such as dependency parsing (Agić and Schluter, 2017).Specifically, we divide the data into 10 folds and predict the entities in the k-th fold using an entity model trained on the remainder.As shown in Table 6, we find that jackknifing strategy hurts the final relation performance surprisingly.We hypothesize that it is because it introduced additional noise during training.
Second, we consider using more pairs of spans for the relation model at both training and testing time.The main reason is that in the current pipeline approach, if a gold entity is missed out by the entity model during inference, the relation model will not be able to predict any relations associated with that entity.Following the beam search strategy used in the previous work (Luan et al., 2019;Wadden et al., 2019), we consider using λn (λ = 0.4 and n is the sentence length)7 top spans scored by the entity model.We explored several different strategies for encoding the top-scoring spans for the relation model: (1) typed markers: the same as our main model except that we now have markers e.g., S: , /S: as input tokens; (2) untyped markers: in this case, the relation model is unaware of a span is an entity or not; (3) untyped markers trained with an auxiliary entity loss (e ∈ E or ).As Table 6 shows, none of these changes led to significant improvements and using untyped markers is especially worse because the relation model struggles to identify whether a span is an entity or not.
In sum, we don't find any of these attempts improved performance significantly and our simple pipeline training turns out to be a surprisingly effective strategy.We do not argue that this error propagation issue does not exist or cannot be solved, while we will need to explore better solutions to address this issue.

Effect of Cross-sentence Context
In Table 2, we demonstrated the improvements from using cross-sentence context on both the entity and relation performance.Finally, we explore the effect of different context sizes W in Figure 2. We find that using cross-sentence context clearly improves both entity and relation F1.However, the results don't further increase from W = 100 to W = 300.In our final models, we use W = 100 for both the entity model and relation model.

Conclusion
In this paper, we present a very simple and effective approach for end-to-end relation extraction.Our model learns two encoders for entity recognition and relation extraction independently and our experiments show that it outperforms previous state-of-the-art on three standard benchmarks considerably.We conduct extensive analysis to undertand the superior performance of our approach and we validate the importance of learning distinct contextual representations for entities and relations and using entity information as input features for the relation model.We also propose an efficient approximation, obtaining a large speedup at inference time with a small accuracy drop.We hope that this simple model will serve as a very strong baseline and make us rethink the value of joint training in end-to-end relation extraction.

Figure 1 :
Figure 1: An example from the SciERC dataset(Luan et al., 2018), where a system is expected to identify that MORPA and PARSER are entities of type METHOD, TEXT-TO-SPEECH is a TASK, as well as MORPA is a hyponym of PARSER and MORPA is used for TEXT-TO-SPEECH.Our entity model (a) predicts all the entities at once and our relation model (b) considers every pair of entities independently by inserting typed entity markers (e.g., [S:MD] = the subject is a METHOD, [O:TK] = the object is a TASK).We also proposed an approximation relation model (c) which supports batch computations.The tokens of the same color in (c) share the positional embeddings.See text for more details.
Our model differs from DYGIE++ in the following ways: (1) We use separate encoders for the entity and relation model, without any multi-task learning; the predicted entity labels are used directly as the input features of the relation model.(2) The contextual representations in the relation model are specific to each pair of spans.(3) We incorporate crosssentence information by extending the input with additional context.(

Figure 2 :
Figure 2: Effect of different context window sizes, measured on the ACE05 development set with the BERTbase model.We use the same entity model (an entity model with W = 100) to report the relation F1 scores.
e * i represents the gold entity type of s i and r * i,j represents the gold relation type of span pair s i , s j in the training data.For training the relation model, we only consider the gold entities S G ⊂ S in the training set and use the gold entity labels as the input of the relation model.We considered training on all spans S (with pruning) as well as predicted entity types but none of them led to meaningful improvements compared to this simple pipeline training (see more details in Section 5.3).

Table 4 :
Relation F1 scores on the development set of ACE05 and SciERC with different input features (the gold entities are given).The results are obtained using BERT-base for ACE05 and SciBERT for SciERC, without cross-sentence context.

Table 5 :
We compare sharing and not sharing the entity and relation encoders on the ACE05 development set.This result is obtained from BERT-base models with cross-sentence context.