Two Are Better than One: Joint Entity and Relation Extraction with Table-Sequence Encoders

Named entity recognition and relation extraction are two important fundamental problems. Joint learning algorithms have been proposed to solve both tasks simultaneously, and many of them cast the joint task as a table-filling problem. However, they typically focused on learning a single encoder (usually learning representation in the form of a table) to capture information required for both tasks within the same space. We argue that it can be beneficial to design two distinct encoders to capture such two different types of information in the learning process. In this work, we propose the novel {\em table-sequence encoders} where two different encoders -- a table encoder and a sequence encoder are designed to help each other in the representation learning process. Our experiments confirm the advantages of having {\em two} encoders over {\em one} encoder. On several standard datasets, our model shows significant improvements over existing approaches.


Introduction
Named Entity Recognition (NER, Florian et al. 2006Florian et al. , 2010) ) and Relation Extraction (RE, Zhao and Grishman 2005;Jiang and Zhai 2007;Sun et al. 2011;Plank and Moschitti 2013) are two fundamental tasks in Information Extraction (IE).Both tasks aim to extract structured information from unstructured texts.One typical approach is to first identify entity mentions, and next perform classification between every two mentions to extract relations, forming a pipeline (Zelenko et al., 2002;Chan and Roth, 2011).An alternative and more recent approach is to perform these two tasks jointly (Li and Ji, 2014;Miwa and Sasaki, 2014;Miwa and Bansal, 2016), which mitigates the error propagation issue associated with the pipeline ap-1 Our code is available at https://github.com/LorrinWWW/two-are-better-than-one.proach and leverages the interaction between tasks, resulting in improved performance.Among several joint approaches, one popular idea is to cast NER and RE as a table filling problem (Miwa and Sasaki, 2014;Gupta et al., 2016;Zhang et al., 2017).Typically, a two-dimensional (2D) table is formed where each entry captures the interaction between two individual words within a sentence.NER is then regarded as a sequence labeling problem where tags are assigned along the diagonal entries of the table.RE is regarded as the problem of labeling other entries within the table.Such an approach allows NER and RE to be performed using a single model, enabling the potentially useful interaction between these two tasks.One example2 is illustrated in Figure 1.
Unfortunately, there are limitations with the existing joint methods.First, these methods typically suffer from feature confusion as they use a single representation for the two tasks -NER and RE.As a result, features extracted for one task may coincide or conflict with those for the other, thus confusing the learning model.Second, these methods underutilize the table structure as they usually convert it to a sequence and then use a sequence labeling approach to fill the table.However, crucial structural information (e.g., the 4 entries at the bottom-left corner of Figure 1 share the same label) in the 2D table might be lost during such conversions.
In this paper, we present a novel approach to address the above limitations.Instead of predicting entities and relations with a single representation, we focus on learning two types of representations, namely sequence representations and table representations, for NER and RE respectively.On one hand, the two separate representations can be used to capture task-specific information.On the other hand, we design a mechanism to allow them to interact with each other, in order to take advantage of the inherent association underlying the NER and RE tasks.In addition, we employ neural network architectures that can better capture the structural information within the 2D table representation.As we will see, such structural information (in particular the context of neighboring entries in the table) is essential in achieving better performance.
The recent prevalence of BERT (Devlin et al., 2019) has led to great performance gains on various NLP tasks.However, we believe that the previous use of BERT, i.e., employing the contextualized word embeddings, does not fully exploit its potential.One important observation here is that the pairwise self-attention weights maintained by BERT carry knowledge of word-word interactions.Our model can effectively use such knowledge, which helps to better learn table representations.To the best of our knowledge, this is the first work to use the attention weights of BERT for learning table representations.
We summarize our contributions as follows: • We propose to learn two separate encoders -a table encoder and a sequence encoder.They interact with each other, and can capture taskspecific information for the NER and RE tasks; • We propose to use multidimensional recurrent neural networks to better exploit the structural information of the table representation; • We effectively leverage the word-word interaction information carried in the attention weights from BERT, which further improves the performance.
Our proposed method achieves the state-of-theart performance on four datasets, namely ACE04, ACE05, CoNLL04, and ADE.We also conduct further experiments to confirm the effectiveness of our proposed approach.

Related Work
NER and RE can be tackled by using separate models.By assuming gold entity mentions are given as inputs, RE can be regarded as a classification task.Such models include kernel methods (Zelenko et al., 2002), RNNs (Zhang and Wang, 2015), recursive neural networks (Socher et al., 2012), CNNs (Zeng et al., 2014), and Transformer models (Verga et al., 2018;Wang et al., 2019).Another branch is to detect cross-sentence level relations (Peng et al., 2017;Gupta et al., 2019), and even document-level relations (Yao et al., 2019;Nan et al., 2020).However, entities are usually not directly available in practice, so these approaches may require an additional entity recognizer to form a pipeline.
Joint learning has been shown effective since it can alleviate the error propagation issue and benefit from exploiting the interrelation between NER and RE.Many studies address the joint problem through a cascade approach, i.e., performing NER first followed by RE.Miwa and Bansal (2016) use bi-LSTM (Graves et al., 2013) and tree-LSTM (Tai et al., 2015) for the joint task.Bekoulis et al. (2018a,b) formulate it as a head selection problem.Nguyen and Verspoor (2019) apply biaffine attention (Dozat and Manning, 2017) for RE.Luan et al. (2019), Dixit andAl (2019), andWadden et al. (2019) use span representations to predict relations.Miwa and Sasaki (2014) tackle joint NER and RE as from a table filling perspective, where the entry at row i and column j of the table corresponds to the pair of i-th and j-th word of the input sentence.The diagonal of the table is filled with the entity tags and the rest with the relation tags indicating possible relations between word pairs.Similarly, Gupta et al. (2016) employ a bi-RNN structure to label each word pair.Zhang et al. (2017) propose a global optimization method to fill the table.Tran and Kavuluru (2019) investigate CNNs on this task.
2019), and ALBERT (Lan et al., 2019).However, none of them use pre-trained attention weights, which convey rich relational information between words.We believe it can be useful for learning better table representations for RE.

Problem Formulation
In this section, we formally formulate the NER and RE tasks.We regard NER as a sequence labeling problem, where the gold entity tags y NER are in the standard BIO (Begin, Inside, Outside) scheme (Sang and Veenstra, 1999;Ratinov and Roth, 2009).For the RE task, we mainly follow the work of Miwa and Sasaki (2014) to formulate it as a table filling problem.Formally, given an input sentence x = [x i ] 1≤i≤N , we maintain a tag table y RE = [y RE i,j ] 1≤i,j≤N .Suppose there is a relation with type r pointing from mention x i b , .., x i e to mention x j b , .., x j e , we have y RE i,j = − → r and We use ⊥ for word pairs with no relation.An example was given earlier in Figure 1.

Model
We describe the model in this section.The model consists of two types of interconnected encoders, a table encoder for table representation and a sequence encoder for sequence representation, as shown in Figure 2. Collectively, we call them tablesequence encoders.Figure 3 presents the details of each layer of the two encoders, and how they interact with each other.In each layer, the table encoder uses the sequence representation to construct the table representation; and then the sequence encoder uses the table representation to contextualize the sequence representation.With multiple layers, we incrementally improve the quality of both representations.

Text Embedder
For a sentence containing N words x = [x i ] 1≤i≤N , we define the word embeddings x w ∈ R N ×d 1 , as well as character embeddings x c ∈ R N ×d 2 computed by an LSTM (Lample et al., 2016).We also consider the contextualized word embeddings x ∈ R N ×d 3 , which can be produced from language models such as BERT.
We concatenate those embeddings for each word and use a linear projection to form the initial sequence representation S 0 ∈ R N ×H : where each word is represented as an H dimensional vector.

Table Encoder
The table encoder, shown in the left part of Figure 3, is a neural network used to learn a table representation, an N × N table of vectors, where the vector at row i and column j corresponds to the i-th and j-th word of the input sentence.
We first construct a non-contextualized table by concatenating every two vectors of the sequence representation followed by a fully-connected layer to halve the hidden size.Formally, for the l-th layer, we have X l ∈ R N ×N ×H , where: Next, we use the Multi-Dimensional Recurrent Neural Networks (MD-RNN, Graves et al. 2007) Figure 4: How the hidden states are computed in MD-RNN with 4 directions.We use D + or D − to indicate the direction that the hidden states flow between cells at the D dimension (where D can be layer, row or col).
For brevity, we omit the input and the layer dimension for cases (b), (c) and (d), as they are the same as (a).
with Gated Recurrent Unit (GRU, Cho et al. 2014) to contextualize X l .We iteratively compute the hidden states of each cell to form the contextualized table representation T l , where: We provide the multi-dimensional adaptations of GRU in Appendix A to avoid excessive formulas here.
Generally, it exploits the context along layer, row, and column dimensions.That is, it does not consider only the cells at neighbouring rows and columns, but also those of the previous layer.
The time complexity of the naive implementation (i.e., two for-loops) for each layer is O(N ×N ) for a sentence with length N .However, antidiagonal entries3 can be calculated at the same time as they do not depend on each other.Therefore, we can optimize it through parallelization and reduce the effective time complexity to O(N ).
The above illustration describes a unidirectional RNN, corresponding to Figure 4(a).Intuitively, we would prefer the network to have access to the surrounding context in all directions.However, this could not be done by one single RNN.For the case of 1D sequence modeling, this problem is resolved by introducing bidirectional RNNs.Graves et al. (2007) discussed quaddirectional RNNs to access the context from four directions for modeling 2D data.Therefore, similar to 2D-RNN, we also need to consider RNNs in four directions4 .We visualize them in Figure 4.
Empirically, we found the setting only considering cases (a) and (c) in Figure 4 achieves no worse performance than considering four cases altogether.Therefore, to reduce the amount of computation, we use such a setting as default.The final table representation is then the concatenation of the hidden states of the two RNNs:

Sequence Encoder
The sequence encoder is used to learn the sequence representation -a sequence of vectors, where the i-th vector corresponds to the i-th word of the input sentence.The architecture is similar to Transformer (Vaswani et al., 2017), shown in the right portion of Figure 3.However, we replace the scaled dotproduct attention with our proposed table-guided attention.Here, we mainly illustrate why and how the table representation can be used to compute attention weights.First of all, given Q (queries), K (keys) and V (values), a generalized form of attention is defined in Figure 5.For each query, the output is a weighted sum of the values, where the weight assigned to each value is determined by the relevance (given by score function f ) of the query with all the keys.
For each query Q i and key K j , Bahdanau et al. (2015) define f in the form of: where U is a learnable vector and g is the function to map each query-key pair to a vector.Specifically, where W 0 , W 1 are learnable parameters.Our attention mechanism is essentially a selfattention mechanism, where the queries, keys and values are exactly the same.In our case, they are essentially sequence representation S l−1 of the previous layer (i.e., Q = K = V = S l−1 ).The attention weights (i.e., the output from the function f in Figure 5) are essentially constructed from both queries and keys (which are the same in our case).On the other hand, we also notice the table representation T l is also constructed from S l−1 .So we can consider T l to be a function of queries and keys, such that Then we put back this g function to Equation 7, and get the proposed table-guided attention, whose score function is: We show the advantages of using this tableguided attention: (1) we do not have to calculate g function since T l is already obtained from the table encoder; (2) T l is contextualized along the row, column, and layer dimensions, which corresponds to queries, keys, and queries and keys in the previous layer, respectively.Such contextual information allows the network to better capture more difficult word-word dependencies; (3) it allows the table encoder to participate in the sequence representation learning process, thereby forming the bidirectional interaction between the two encoders.
The table-guided attention can be extended to have multiple heads (Vaswani et al., 2017), where each head is an attention with independent parameters.We concatenate their outputs and use a fullyconnected layer to get the final attention outputs.
The remaining parts are similar to Transformer.For layer l, we use position-wise feedforward neural networks (FFNN) after self-attention, and wrap attention and FFNN with a residual connection (He et al., 2016) and layer normalization (Ba et al. 2016), to get the output sequence representation:

Exploit Pre-trained Attention Weights
In this section, we describe the dashed lines in Figures 2 and 3, which we ignored in the previous discussions.Essentially, they exploit information in the form of attention weights from a pre-trained language model such as BERT.
We stack the attention weights of all heads and all layers to form T ∈ R N ×N ×(L ×A ) , where L is the number of stacked Transformer layers, and A is the number of heads in each layer.We leverage T to form the inputs of MD-RNNs in the table encoder.Equation 2 is now replaced with: We keep the rest unchanged.We believe this simple yet novel use of the attention weights allows us to effectively incorporate the useful word-word interaction information captured by pre-trained models such as BERT into our table-sequence encoders for improved performance.

Training and Evaluation
We use S L and T L to predict the probability distribution of the entity and relation tags: where Y NER and Y RE are random variables of the predicted tags, and P θ is the estimated probability function with θ being our model parameters.
For training, both NER and RE adopt the prevalent cross-entropy loss.Given the input text x and its gold tag sequence y NER and tag table y RE , we then calculate the following two losses: The goal is to minimize both losses L NER + L RE .During evaluation, the prediction of relations relies on the prediction of entities, so we first predict the entities, and then look up the relation probability table P θ (Y RE ) to see if there exists a valid relation between predicted entities.Specifically, we predict the entity tag of each word by choosing the class with the highest probability: The whole tag sequence can be transformed into entities with their boundaries and types.
Relations on entities are mapped to relation classes with highest probabilities on words of the entities.We also consider the two directed tags for each relation.Therefore, for two entity spans (i b , i e ) and (j b , j e ), their relation is given by: where the no-relation type ⊥ has no direction, so if − → r = ⊥, we have ← − r = ⊥ as well.

Data
We evaluate our model on four datasets, namely ACE04 (Doddington et al., 2004), ACE05 (Walker et al., 2006), CoNLL04 (Roth and tau Yih, 2004) and ADE (Gurulingappa et al., 2012).More details could be found in Appendix B. Following the established line of work, we use the F1 measure to evaluate the performance of NER and RE.For NER, an entity prediction is correct if and only if its type and boundaries both match with those of a gold entity. 5For RE, a relation prediction is considered correct if its relation type and the boundaries of the two entities match with those in the gold data.We also report the strict relation F1 (denoted RE+), where a relation prediction is considered correct if its relation type as well as the boundaries and types of the two entities all match with those in the gold data.Relations are asymmetric, so the order of the two entities in a relation matters.

Model Setup
We tune hyperparameters based on results on the development set of ACE05 and use the same setting for other datasets.GloVe vectors (Pennington et al., 2014) are used to initialize word embeddings.We also use the BERT variant -ALBERT as the default pre-trained language model.Both pre-trained word embeddings and language model are fixed without fine-tuning.In addition, we stack three encoding layers (L = 3) with independent parameters including the GRU cell in each layer.For the table encoder, we use two separate MD-RNNs with the directions of "layer + row + col + " and "layer + row − col − " respectively.For the sequence encoder, we use eight attention heads to attend to different representation subspaces.We report the averaged F1 scores of 5 runs for our models.For each run, we keep the model that achieves the highest averaged entity F1 and relation F1 on the development set, and evaluate and report its score on the test set.Other hyperparameters could be found in Appendix C.

Comparison with Other Models
Table 1 presents the comparison of our model with previous methods on four datasets.Our NER performance is increased by 1.2, 0.9, 1.2/0.6 and 0.4 absolute F1 points over the previous best results.Besides, we observe even stronger performance gains in the RE task, which are 3.6, 4.2, 2.1/2.5 (RE+) and 0.9 (RE+) absolute F1 points, respectively.This indicates the effectiveness of our model for jointly extracting entities and their relations.Since our reported numbers are the average of 5 runs, we can consider our model to be achieving new state-of-the-art results.

Comparison of Pre-trained Models
In this section, we evaluate our method with different pre-trained language models, including ELMo, BERT, RoBERTa and ALBERT, with and without attention weights, to see their individual contribution to the final performance.
Table 2 shows that, even using the relatively earlier contextualized embeddings without attention weights (ELMo +x ), our system is still comparable to the state-of-the-art approach (Wadden et al., 2019), which was based on BERT and achieved F1 scores of 88.6 and 63.4 for NER and RE respectively.It is important to note that the model of Wadden et al. (2019) was trained on the additional coreference annotations from OntoNotes (Weischedel et al., 2011) before fine-tuning on ACE05.Nevertheless, our system still achieves comparable results, showing the effectiveness of the table-sequence encoding architecture.
The overall results reported in Table 2 confirm the importance of leveraging the attention weights, which bring improvements for both NER and RE tasks.This allows the system using vanilla BERT to obtain results no worse than RoBERTa and AL-BERT in relation extraction.

Ablation Study
We design several additional experiments to understand the effectiveness of components in our system.The experiments are conducted on ACE05.
We also compare different table filling settings, which are included in Appendix E.

Bidirectional Interaction
We first focus on the understanding of the necessity of modeling the bidirectional interaction between the two encoders.Results are presented in Table 3. "RE (gold)" is presented so as to compare with settings that do not predict entities, where the gold entity spans are used in the evaluation.We first try optimizing the NER and RE objectives separately, corresponding to "w/o Relation Loss" and "w/o Entity Loss".Compared with learning with a joint objective, the results of these two settings are slightly worse, which indicates that learning better representations for one task not only is helpful for the corresponding task, but also can be beneficial for the other task.
Next, we investigate the individual sequence and table encoder, corresponding to "w/o Table Encoder" and "w/o Sequence Encoder".We also try jointly training the two encoders but cut off the interaction between them, which is "w/o Bi-Interaction".Since no interaction is allowed in the above three settings, the table-guided attention is changed to conventional multi-head scaled dotproduct attention, and the table encoding layer always uses the initial sequence representation S 0 to enrich the table representation.The results of these settings are all significantly worse than the default one, which indicates the importance of the bidirectional interaction between sequence and table representation in our table-sequence encoders.
We also experiment the use of the main diagonal entries of the table representation to tag entities, with results reported under "NER on diagonal".This setup attempts to address NER and RE in the same encoding space, in line with the original intention of Miwa and Sasaki (2014).By exploiting the interrelation between NER and RE, it achieves better performance compared with models without such information.However, it is worse than our default setting.We ascribe this to the potential incompatibility of the desired encoding space of entities and relations.Finally, although it does not Table 4: The performance on ACE05 with different number of layers.Pre-trained word embeddings and language models are not counted to the number of parameters.The underlined ones are from our default setting.
directly use the sequence representation, removing the sequence encoder will lead to performance drop for NER, which indicates the sequence encoder can help improve the table encoder by better capturing the structured information within the sequence.

Encoding Layers
Table 4 shows the effect of the number of encoding layers, which is also the number of bidirectional interactions involved.We conduct one set of experiments with shared parameters for the encoding layers and another set with independent parameters.In general, the performance increases when we gradually enlarge the number of layers L. Specifically, since the shared model does not introduce more parameters when tuning L, we consider that our model benefits from the mutual interaction inside table-sequence encoders.Typically, under the same value L, the non-shared model employs more parameters than the shared one to enhance its modeling capability, leading to better performance.However, when L > 3, there is no significant improvement by using non-shared model.We believe that increasing the number of layers may bring the risk of over-fitting, which limits the performance of the network.We choose to adopt the non-shared model with L = 3 as our default setting.

Settings of MD-RNN
Table 5 presents the comparisons of using different dimensions and directions to learn the table representation, based on MD-RNN.Among those settings, "Unidirectional" refers to an MD-RNN with direction "layer + row + col + "; "Bidirectional" uses two MD-RNNs with directions "layer + row + col + " and "layer + row − col − " respectively; "Quaddirectional" uses MD-RNNs in four directions, illustrated in Figure 4. information is beneficial.Since the bidirectional model is almost as good as the quaddirectional one, we leave the former as the default setting.
In addition, we are also curious about the contribution of layer, row, and column dimensions for MD-RNNs.We separately removed the layer, row, and column dimension.As we can see, the results are all lower than the original model without removal of any dimension."Layer-wise only" removed row and col dimensions, and is worse than others as it does not exploit the sentential context.
More experiments with more settings are presented in Appendix D. Specifically, all unidirectional RNNs are consistently worse than others, while bidirectional RNNs are usually on-par with quaddirectional RNNs.Besides, we also tried to use CNNs to implement the table encoder.However, since it is usually difficult for CNNs to learn long-range dependencies, we found the performance was worse than the RNN-based models.

Attention Visualization
We visualize the table-guided attention with bertviz (Vig, 2019) 6 for a better understanding of how the network works.We compare it with pre-trained Transformers (ALBERT) and humandefined ground truth, as presented in Figure 6.
Our discovery is similar to Clark et al. (2019).Most attention heads in the table-guided attention and ALBERT show simple patterns.As shown in the left part of Figure 6, these patterns include attending to the word itself, the next word, the last word, and the punctuation.
The right part of Figure 6 also shows task-related patterns, i.e., entities and relations.For a relation, we connect words from the head entity to the tail entity; For an entity, we connect every two words inside this entity mention.We can find that our pro-    -guided attention has learned more taskrelated knowledge compared to ALBERT.In fact, not only does it capture the entities and their relations that ALBERT failed to capture, but it also has higher confidence.This indicates that our model has a stronger ability to capture complex patterns other than simple ones.

Probing Intermediate States
Figure 7 presents an example picked from the development set of ACE05.The prediction layer after training (a linear layer) is used as a probe to display the intermediate state of the model, so we can interpret how the model improves both representations from stacking multiple layers and thus from the bidirectional interaction.Such probing is valid since we use skip connection between two adjacent encoding layers, so the encoding spaces of the outputs of different encoding layers are consistent and therefore compatible with the prediction layer.
In Figure 7, the model made many wrong predictions in the first layer, which were gradually corrected in the next layers.Therefore, we can see that more layers allow more interaction and thus make the model better at capturing entities or relations, especially difficult ones.More cases are presented in Appendix F.

Conclusion
In this paper, we introduce the novel tablesequence encoders architecture for joint extraction of entities and their relations.It learns two separate encoders rather than one -a sequence encoder and a table encoder where explicit interactions exist between the two encoders.We also introduce a new method to effectively employ useful information captured by the pre-trained language models for such a joint learning task where a table representation is involved.We achieved state-of-the-art F1 scores for both NER and RE tasks across four standard datasets, which confirm the effectiveness of our approach.In the future, we would like to investigate how the table representation may be applied to other tasks.Another direction is to generalize the way in which the table and sequence interact to other types of representations.

A MD-RNN
In this section we present the detailed implementation of MD-RNN with GRU.
Formally, at the time-step layer l, row i, and column j, with the input X l,i,j , the cell at layer l, row i and column j calculates the gates as follows: Figure 8: For 2D-RNNs, cells in the same color can be computed in parallel.
And then calculate the hidden states: T l,i,j = z l,i,j Tl,i,j where W and b are trainable parameters and please note that they share parameters in different rows and columns but not necessarily in different layers.
Besides, is the element-wise product, and σ is the sigmoid function.
As in GRU, r is the reset gate controlling whether to forget previous hidden states, and z is the update gate, selecting whether the hidden states are to be updated with new hidden states.In addition, we employ a lambda gate λ, which is used to weight the predecessor cells before passing them through the update gate.
There are two slightly different ways to compute the candidate activation Tl,i,j , namely and And we found in our preliminary experiments that both of them performed as well as each other, and we choose the former, which saves some computation.

B Data
Table 6 shows the dataset statistics after preprocessing.We keep the same pre-processing and evaluation standards used by most previous works.The ACE04 and ACE05 corpora are collected from a variety of domains, such as newswire and online forums.We use the same entity and relation types, data splits, and pre-processing as Li and Ji (2014) and Miwa and Bansal (2016)  7 .Specifically, they use head spans for entities but not use the full mention boundary.
The CoNLL04 dataset provides entity and relation labels.We use the same train-test split as Gupta et al. (2016) 8 , and we use the same 20% train set as development set as Eberts and Ulges (2019) 9 .Both micro and macro average F1 are used in previous work, so we will specify this while comparing with other systems.
The ADE dataset is constructed from medical reports that describe the adverse effects arising from drug use.It contains a single relation type "Adverse-Effect" and the two entity types "Adverse-Effect" and "Drug".Similar to previous work, we filter out instances containing overlapping entities, only accounting for 2.8% of total.
Following prior work, we perform 5-fold crossvalidation for ACE04 and 10-fold for ADE.Besides, we use 15% of the training set as the development set.We report the average score of 5 runs 7 We use the prepocess script provided by Luan et al. (2019)

C Hyperparameters and Pre-trained Language Models
The detailed hyperparameters are present in Table 7.
For the word embeddings, we use 100-dimensional GloVe word embeddings trained on 6B tokens10 as initialization.We disable updating the word embeddings during training.We set the hidden size to 200, and since we use bidirectional MD-RNNs, the hidden size for each MD-RNN is 100.We use inverse time learning rate decay: lr = lr/(1 + decay rate × steps/decay steps), with decay rate 0.05 and decay steps 1000.
Besides, the tested pre-trained language models are shown as follows: • [ELMo] (Peters et al., 2018): Characterbased pre-trained language model.We use the large checkpoint, with embeddings of dimension 3072.We use the implementation provided by Wolf  et al. (2019) 11 and Akbik et al. (2019)  12 to generate contextualized embeddings and attention weights.Specifically, we generate the contextualized word embedding by averaging all sub-word embeddings in the last four layers; we generate the attention weight feature (if available) by summing all subword attention weights for each word, which are then concatenated for all layers and all heads.Both of them are fixed without fine-tuning.Context along row and column Neighbors along both the row and column dimensions are important.setting "layer + row + col ; layer + row − col" and "layer + row col + ; layer + row col − " remove the row and column dimensions respectively, and their performance is though better than "layer + row col", but worse than setting "layer + row + col + ; layer + row − col − ".

D Ways to Leverage the Table Context
Multiple dimensions Since in setting "layer + row + col + ", the cell at row i and column j only knows the information before the i-th and j-th word, causing worse performance than bidirectional ("layer + row + col + ; layer + row − col − " and "layer + row + col − ; layer + row − col + ") and quaddirectional ("layer + row + col + ; layer + row − col − ; layer + row + col − ; layer + row − col + ") settings.Besides, the quaddirectional model does not show superior performance than bidirectional ones, so we use the latter by default.
Layer dimension Different from the row and column dimensions, the layer dimension does not carry more sentential context information.Instead, it carries the information from previous layers, so the model can reason high-level relations based on low-level dependencies captured by predecessor layers, which may help recognize syntactically and semantically complex relations.Moreover, recurring along the layer dimension can also be viewed as a layer-wise short-cut, serving similarly to high way (Srivastava et al., 2015) and residual connection (He et al., 2016) and making it possible for the networks to be very deep.By removing it (results under "layer row + col + ; layer row − col − "), the performance is harmed.
Other network Our model architecture can be adapted to other table encoders.We try CNN to encode the table representation.For each layer l, given inputs X l , we have: T 1 l = ReLU(LayerNorm(CNN(T 0 l ))) T l = ReLU(T l−1 + LayerNorm(CNN(T 1 l ))) We also try different kernel sizes for CNN.How- ever, despite its advantages in training time, its performance is worse than the MD-RNN based ones.

E Table Filling Formulations
Our table filling formulation does not exactly follow Miwa and Sasaki (2014).Specifically, we fill the entire table instead of only the lower (or higger) triangular part, and we assign relation tags to cells where entity spans intersect instead of where last words intersect.To maintain the ratio of positive instances to negative instances, although the entire table can express directed relations by undirected tags, we still keep the directed relation tags.I.e, if y RE i,j = − → r then y RE j,i = ← − r , and vice versa.Table 9 ablates our formulation (last row), and compares it with the original one (Miwa and Sasaki, 2014) (first row).Such probing is valid since for the table encoder, the encoding spaces of different cells are consistent as they are connected through gate mechanism, including cells in different encoding layers; for the sequence encoder, we used residual connection so the encoding spaces of the inputs and outputs are consistent.Therefore, they are all compatible with the prediction layer.Empirically, the intermediate layers did give valid predictions, although they are not directly trained for prediction.

F Probing Intermediate States
In Figure 9a, the model made a wrong prediction

Figure 1 :
Figure 1: An example of table filling for NER and RE.

Figure 3 :
Figure 3: A layer in the table-sequence encoders.

Figure 5 :
Figure 5: The generalized form of attention.The softmax function is used to normalize the weights of values V for each query Q i .

Figure 6 :
Figure 6: Comparison between ground truth and selected heads of and table-guided attention.The sentence is randomly selected from the development set of ACE05.

Figure 9
Figure9presents examples picked from the development set of ACE05.The prediction layer (a linear layer) after training is used as a probe to display the intermediate state of the model, so we can interpret how the model improves both representations from stacking multiple layers and thus from the bidirectional interaction.Such probing is valid since for the table encoder, the encoding spaces of different cells are consistent as they are connected through gate mechanism, including cells in different encoding layers; for the sequence encoder, we used residual connection so the encoding spaces of the inputs and outputs are consistent.Therefore, they are all compatible with the prediction layer.Empirically, the intermediate layers did give valid predictions, although they are not directly trained for prediction.In Figure9a, the model made a wrong prediction

Table 2 :
Using different pre-trained language models on ACE05.+x uses the contextualized word embeddings; +T uses the attention weights.

Table 3 :
Ablation of the two encoders on ACE05.Gold entity spans are given in RE (gold).

Table 5 :
The effect of the dimensions and directions of MD-RNNs.Experiments are conducted on ACE05.The underlined ones are from our default setting.
Their results are improved when adding more directions, showing richer contextual The time complexity of the naive implementation (i.e., two for-loops in each layer) is O(L×N ×

Table 6 :
Dataset statistics N ) for sentence with length N and the number of encoding layer L. However, antidiagonal entries can be calculated at the same time because their values do not depend on each other, shown in the same color in Figure8.Therefore, we can optimize it through parallelization and reduce the effective time complexity to O(L × N ).
Table 8 presents the comparisons of different ways to learn the table representation.Importance of context Setting "layer + row col" does not exploit the table context when learning the table representation, instead, only layer-wise operations are used.As a result, it performs much worse than the ones exploiting the context, confirming the importance to leverage the context information.

Table 9 :
Comparisons of different table filling formulations.When not the entire table, L only fills the lower-triangular part, and U fills the upper-triangular part.