Span-based Joint Entity and Relation Extraction with Attention-based Span-specific and Contextual Semantic Representations

Span-based joint extraction models have shown their efficiency on entity recognition and relation extraction. These models regard text spans as candidate entities and span tuples as candidate relation tuples. Span semantic representations are shared in both entity recognition and relation extraction, while existing models cannot well capture semantics of these candidate entities and relations. To address these problems, we introduce a span-based joint extraction framework with attention-based semantic representations. Specially, attentions are utilized to calculate semantic representations, including span-specific and contextual ones. We further investigate effects of four attention variants in generating contextual semantic representations. Experiments show that our model outperforms previous systems and achieves state-of-the-art results on ACE2005, CoNLL2004 and ADE.


Introduction
This paper considers intra-sentence joint entity and relation extraction. For joint extraction mode can alleviate cascading errors and promote information utilization compared to pipelined one, this mode has drawn much attention. Typically, the joint extraction task is solved by sequence tagging based methods Chi et al., 2019).
Rather than using sequence tagging based methods, recent works attempt to solve the task with spanbased joint extraction mode (Dixit and Al-Onaizan, 2019;. Typically, this mode first processes sentence text into text spans, which are span-based candidate entities ("spans" for short); then, calculates span semantic representations and performs span classification on them to obtain predicted entities; next, forms span-based candidate relation ("relation" for short) tuples with spans, and calculates relation semantic representations with corresponding span semantic representations; at last, performs relation classification on relation semantic representations and obtains predicted relation triples. This mode further improves joint extraction performance, whereas exists three problems.
First, different tokens in span should contribute differently to span representation, which we call spanspecific features. But existing methods treat each span token equally important (Eberts and Ulges, 2019) or just consider span head and tail tokens (Dixit and Al-Onaizan, 2019), ignoring these significant features. Take the span "a Palestinian youth" in sentence 1 of Figure 1 as an example, the "youth" should contribute much greater to the span representation than "a" and "Palestinian" when classifying the span into "PER". Second, local contextual information of relation tuples is omitted (Luan et al., 2018;Dixit and Al-Onaizan, 2019) or just calculated by max pooling way (Eberts and Ulges, 2019) when performing relation classification, which do not sufficiently capture information contained in it. Whereas local context may contain crucial information to help predict the relations that relation tuples hold. A case study is shown in sentence 2 of Figure 1, the "ownership" (in red font) can greatly help to determine the relation ("PART-WHOLE") of the relation tuple (namely Figure 1: Sentence examples including gold entities and gold relation triples from the ACE2005 dataset, where "PER", "ORG" etc., denote gold entity types, "PART-WHOLE" denotes gold relation type, texts in bracket are spans of gold entities and underlined texts are local contexts of gold relation tuples, "Relation" denotes gold relation triple. <several foreign subsidiaries, Starbucks>). Third, sentence-level contextual information is ignored in both span and relation classifications, which may be important compensation information for both ones. An example is given in sentence 3 of Figure 1, "state" (in red font) benefits the relation classification of the relation tuple (namely <all of the West Bank and Gaza, Palestinians>), while "state" is neither contained in the relation tuple nor in the local context (namely "claim"), but contained in other part of sentence 3, which is sentence-level.
To address above issues, we introduce a span-based joint extraction model with attention-based spanspecific and contextual semantic representations. Specifically, 1) MLP attention is used to calculate spanspecific semantic representation; 2) attention-based sentence-level contextual semantic representation for span is calculated by taking span-specific semantic representation as query and sentence token sequence semantic representations as key, value respectively; 3) local and sentence-level contextual semantic representations for relation are obtained by attention calculation with relation tuple semantic representations and corresponding token sequence semantic representations. The advantage of this approach is that we can capture the most useful information to constitute efficient span and relation semantic representations.
We take BERT (Devlin et al., 2019) as the default backbone network, and explore the three research questions. Moreover, we investigate effects of Multi-Head attention (Vaswani et al., 2017), Dot-Product attention (Luong et al., 2015), General attention (Luong et al., 2015) and Additive attention (Bahdanau et al., 2015) in generating contextual semantic representations. Extensive experiments on three benchmark datasets show that our model consistently outperforms previous systems. In addition, the Multi-Head attention firmly improves over other attention variants.
As discussed in §1, joint entity and relation extraction is typically formulated as a sequence tagging based task. Traditionally, table-filling methods have been widely explored (Miwa and Sasaki, 2014;Gupta et al., 2016), where token labels and relation labels fill the diagonal and off-diagonal of the table respectively. Recently, many works concentrate on leveraging deep neural networks to tackle this task, e.g., stacked bidirectional LSTM (Miwa and Bansal, 2016;, combination of bidirectional LSTM and CNN , and combination of bidirectional LSTM and attention mechanism (Chi et al., 2019;Nguyen and Verspoor, 2019). In addition, a novel machine reading comprehension based approach (Li et al., 2019) is proposed, which formulates the task as a Question & Answer task but still in sequence tagging based mode.
Recently, span-based joint extraction methods have been investigated to tackle problems existing in sequence tagging based methods, e.g., inability to detect overlapping entities. Specially, Dixit and Al-Onaizan (2019) realize this method by obtaining span semantic representations through a BiLSTM over concatenated ELMo, word and character embeddings, then share them in both span and relation classifications. Luan et al. (2018) obtain span semantic representations generally the same as Lee et al. (2017), but reinforce them by introducing coreference task. Follow Luan et al. (2018),  propose DyGIE, which can capture span interactions through a span graph constructed dynamically.  deliver further performance increases on DyGIE by replacing the BiLSTM with BERT and introduce DyGIE++. More recently, Eberts and Ulges (2019) propose SpERT, a simple but effective span-based model that takes BERT as backbone and use two FFNNs to classify span and relation respectively. Unlike previous works, SpERT dramatically reduces model training complexity by adopting negative sampling. Our work follows SpERT, but differs in span-specific and contextual semantic representations. Specifically, our model obtains these semantic representations with attention mechanism. By calculating the matching degree between target sequence semantic representations and source sequence semantic representations, attention mechanism obtains attention scores on the source sequence, which are weight scores in essence. And the more important the information, the higher weight score it holds. Classified by implementation manners of score function, attention mechanism has multiple variants, e.g., Content-Base attention (Graves et al., 2014), Additive attention (Bahdanau et al., 2015), General attention (Luong et al., 2015), Dot-Product attention (Luong et al., 2015), Multi-Head attention (Vaswani et al., 2017).

Approach
In the rest, we abbreviate "semantic representation" as "representation". Figure 2 gives an overview of our model, which uses BERT as encoder following SpERT: we map word embeddings into BERT embeddings using pre-trained Transformer blocks (Vaswani et al., 2017). Based on the representations, we calculate span representations and perform span classification & filtration ( §3.1); Then, we organize relation tuples, calculate relation representations and perform relation classification & filtration ( §3.2); Third, we investigate effects of multiple attention variants in generating contextual representations ( §3.3); At last, we introduce model training settings ( §3.4).
Define a sentence and a span from the sentence to help introduce the rest, where t denotes tokens and subscripts (e.g., 1, 2, 3...) denote token indexes, as:

Span Classification and Filtration
Add NoneEntity type to the pre-defined entity types (denoted as η). Spans will be classified into NoneEntity as long as they do not hold any pre-defined entity types.
As Figure 2 shows, span representation for classification composes of four parts, namely a) concatenation of span head and tail representations, b)span-specific representation, c) sentence-level contextual representation, and d) span width embedding. We use X i to denote the BERT embedding of token t i , and the BERT embedding sequences of S and s are defined as follows, where X 0 denotes the BERT embedding of [CLS]: Concatenation of span head and tail representations. If a span composes of more than one token, then concatenate the BERT embeddings of span head and tail tokens. Else, duplicate the BERT embedding of the single token and concatenate them. The concatenation result for span s is as: predicted entity representation <S2, S1> Figure 2: Our joint extraction model with attention-based span-specific and contextual representations. 1) MLP attention is utilized to obtain span-specific representation; 2) Sentence-level contextual representation for span is obtained by attention calculation between span-specific representation and sentence token embedding sequence; 3) Relation local and sentence-level contextual representations are calculated by relation tuple representation attending to corresponding token embedding sequences.
Span-specific representation. Here we use MLP attention (Dixit and Al-Onaizan, 2019) to calculate span-specific representation. Take span s as an example.
Where V k is a scalar; α k is the attention weight of X k , computed by Softmax function; F s is the spanspecific representation by matrix calculation on attention weights and B s . By this way, we can evaluate the significance of each span token, and the more important a token, the larger attention weight it holds.
Sentence-level contextual representation. Take F s as query, B S as key and value respectively, sentence-level contextual representation for span s is calculated as: Information beneficial for span classification will be assigned a heavy weight, and the contextual representation will be taken to constitute span representation.
Span width embedding. Span width embedding allow the model to incorporate a prior over the span width. Fixed-size embedding for each span width 1,2,. . . (Lee et al., 2017) is learned during model training. Thus, we can look up a width embedding W j+1 from the embedding matrix for s.
Span classification. The final span representation for classification is as: R s first passes through a multi-layers FFNN and then is fed into a Softmax classifier, which yields a posterior for s on η (including NoneEntity), as: Span filtration. By searching the highest-scored class, the y s estimates which entity type does s holds. We just keep spans that are not classified into NoneEntity, and form a predicted entity set E. Then we perform relation classification on relation tuples derived from {E ⊗ E} to reduce searching space, where ⊗ denotes Cartesian Product.

Relation Classification and Filtration
Add NoneRelation type to the pre-defined relation types (denoted as γ). Let s 1 , s 2 be two spans, relation tuples taken for relation classification are defined as: As Figure 2 shows, relation representation for classification composes of three parts, namely a) concatenation of relation tuple representations, b) local contextual representation, c) sentence-level contextual representation.
Concatenation of relation tuple representations. Before concatenating R s 1 and R s 2 , we first apply a multi-layers FFNN to them to reduce their dimensions. The concatenation result is as: Local contextual representation. Let B c denotes the BERT embedding sequence of local context between s 1 and s 2 , as: B c = (X m , X m+1 , X m+2 , . . . , X m+n ) The attention-based local contextual representation is calculate by taking H r as query, B c as key and value respectively, as: Sentence-level contextual representation. The sentence-level contextual representation is calculated by taking H r as query, B S as key and value respectively, as: Relation classification. Before F r and T r are taken to constitute relation representation, we first apply two different multi-layers FFNNs to them to reduce their dimensions, aiming to keep them in a proper proportion in relation representation. The final relation representation for classification is as: Akin to span classification, R r first passes through a multi-layers FFNN and then is fed into a Softmax classifier, which yields a posterior for <s 1 , s 2 > on γ (including NoneRelation), as: Relation filtration. By searching the highest-scored class, the y r estimates which relation type does <s 1 , s 2 > holds. Only relation tuples that are not classified into NoneRelation are kept and compose predicted relation triples with predicted types.

Attention Variants
In this paper, we investigate effects of Multi-Head attention, Additive attention, Dot-Product attention and General attention in generating contextual representations, of which score functions are shown below.

M ulti − Head attention
Additive attention : score = W 1 · Q + W 2 · K Dot − P roduct attention : score = W · (Q K) General attention : score = Q · W · K Where Q, K denote query and key respectively; W denotes parameter matrix; d K denotes dimension of K; · and denote matrix broadcast and element-wise multiplication respectively. For Multi-Head attention and Dot-Product attention, we first apply different multi-layers FFNNs to Q and K, aiming to convert them to the same dimension.

Model Training
Parameter matrices of FFNNs and attentions are learned, and BERT is fine-tuned during model training. The joint loss function of our model is defined as: Where L s denotes the cross-entropy loss of span classification and L r denotes the binary cross-entropy loss of relation classification. Due to the fact that performance of relation classification is generally worse than span one, we apply a larger weight score to L r , aiming to let the model focus more on relation classification.
Negative sampling (Eberts and Ulges, 2019) are adopted during model training to improve model performance and robustness. Unlike previous works, we adopt a dynamic sampling strategy, where the negative examples for both entity and relation are thirtyfold of the ground truth ones in each sentence. By this strategy, our model keeps a much more balanced data distribution on training data.

Datasets
We test our model on ACE2005, CoNLL2004 and ADE, which are refered as ACE05, CoNLL04 and ADE respectively in the rest.
• ACE05 (Doddington et al., 2004) English dataset composes of news articles in multi-domain, e.g., broadcast, newswire, weblog etc.. Seven coarse-grained entity types and six coarse-grained relation types are pre-defined. We follow the training/dev/test split in (Li and Ji, 2014;Li et al., 2019). It has 351 documents for training, 80 for development and 80 for test, of which 437 contain overlapping entities.
• CoNLL04 (Roth and Yih, 2004) composes of news articles from outlets such as WSJ and AP, We follow the training/dev/test split in (Adel and Schütze, 2017;Bekoulis et al., 2018), which consists 910 articles for training, 243 for development and 288 for test.
• ADE (Gurulingappa et al., 2012) aims to extract drug-related adverse effects from medical text, including two pre-defined entity types (namely Adverse-Effect and Drug) and a single relation type i.e., Adverse-Effect. It consists of 4272 sentences, of which 1695 contain overlapping entities. We conduct 10-fold cross validation following (Bekoulis et al., 2018;Eberts and Ulges, 2019).
For ACE05, following (Li et al., 2019;, an entity is considered correct if we can identify its head region and type correctly. A relation is considered correct if we can identify its argument entities and type correctly. For CoNLL04 and ADE, following (Li et al., 2019;Eberts and Ulges, 2019), we treat an entity as correct when its type and entity region match ground truth, and treat a relation as correct when its type and argument entities match ground truth.

Experimental Settings
We build our model upon English cased version of BERT BASE . We set the negative sampling rate to 30, batch size for model training to 8, dropout to 0.2 and width embedding dimension to 50. For Multi-Head attention, the head number is set to 8. FFNN F and FFNN T contain three fully connected layers; and all the other FFNNs contain two layers. We set different epochs in case of different datasets. For all datasets, the span width threshold is initialized to 10. In our model, we adopt weight loss setting, as shown in equ.(11), and follow Eberts and Ulges (2019) for other hyperparameter settings.

Baseline Models
We compare our model with the following models.
• DyGIE  is the current span-based state-of-the-art model on ACE05. It reinforces span and relation representations by introducing coreference task.
• Multi-turn QA (Li et al., 2019) is the current sequence tagging based state-of-the-art model on ACE05 and CoNLL04. It formulates joint entity and relation extraction as a multi-turn question and answer task, but still in sequence tagging based mode.
• SpERT (Eberts and Ulges, 2019) is the current span-based state-of-the-art model on ADE and CoNLL04. Our work follows this model but adopts attention-based span-specific and contextual representations.
• Relation-Metric (Tran and Kavuluru, 2019) is a sequence tagging based model in multi-task learning scheme. It reports performances on ADE and CoNLL04, and achieves state-of-the-art on ADE.

Main Results
We compare our model with both sequence tagging based methods and span-based methods on the three benchmark datasets, and show the results in Table 1. We denote our method as SPAN M ulti−Head , which means we use Multi-Head attention to calculate contextual representations. For ACE05 and CoNLL04, we report performance under micro-average metrics, and apply macro-average metrics to ADE performance, which follow prior works. For ACE05 and ADE, all the reported performances take overlapping entities into consideration. SPAN M ulti−Head consistently outperforms both sequence tagging based and span-based SOTA methods on the three benchmark datasets. Compared to SpERT, SPAN M ulti−Head delivers performance increases in entity recognition by 1.29(CoNLL04) and 1.31(ADE), while better ones in relation extraction, by 2.86(CoNLL04) and 1.89(ADE). We owe these performance increases to efficient spanspecific representation and contextual representations. Moreover, SPAN M ulti−Head delivers solid performance increases compared to DyGIE by 1.19 and 2.04 in entity recognition and relation extraction on ACE05. However, it's worth noting that DyGIE adopts a multi-task learning scheme, reinforcing span representations by introducing coreference task, which is absent in our method.
Besides Multi-Head attention, we investigate Dot-Product attention, General attention and Additive attention in our method. Table 2 shows performances of these attention variants on the three benchmark datasets. SPAN M ulti−Head consisiently outperforms the other three methods. One possible reason is that in Multi-Head attention, the eight attention heads attend to different contextual information and learn features from different representation spaces. Thus Multi-Head attention based contextual representations can better compensate span and relation representations.

Ablation Study
Based on SPAN M ulti−Head , we conduct ablations on the ACE2005 dev set to analyze effects of different model components. ; base is the model by performing above two ablations, which is the default span representation settings in SpERT. For ACE05, we observe that both span-specific representation and sentence-level contextual representation are helpful for both entity recognition and relation extraction. This is due to that span representations are shared in the two subtasks. Table 4 shows effects of local and sentence-level contextual representations for relation in our model, where -local denotes ablating local representation by replacing FFNN F (F r ) in equ.(9) with the max pooling of B c ; -SentenceLevel denotes ablating sentence-level contextual representation by removing FFNN T (T r ) from equ. (9); base is the model by performing above two ablations, which is the default relation representation settings in SpERT. For ACE05, we observe that both local and sentence-level contextual representations apparently benefit relation extraction, while have negligible influence on entity recognition. A convincing explanation is that these representations directly constitute relation representation, while affect span representation only by backpropagation. It is worth noting that local contextual representation has a greater impact on relation extraction compared to sentence-level one. One reason for this is that information determining relation type mainly exists in relation tuples and the local context. Another reason is that as compensation information, sentence-level contextual representation occupies a relative small proportion in relation representation, aiming to avoid introducing noise into relation representation.

Conclusion
We introduce attention-based semantic representation generating methods in span-based joint entity and relation extraction method. We apply MLP attention to capture span-specific features aiming to obtain semantic rich span representation, and calculate task-specific contextual representations with attention architecture to further reinforce span and relation representations. Our approach firmly outperforms both the sequence tagging based and span-based SOTA methods on three benchmark datasets, creating new state-of-the-art results. As future work, we would like to consider further improving relation classification performance by reducing span classification errors. We also plan to explore more advanced methods for encoding efficient span and relation representations.