Probing Linguistic Features of Sentence-Level Representations in Relation Extraction

Despite the recent progress, little is known about the features captured by state-of-the-art neural relation extraction (RE) models. Common methods encode the source sentence, conditioned on the entity mentions, before classifying the relation. However, the complexity of the task makes it difficult to understand how encoder architecture and supporting linguistic knowledge affect the features learned by the encoder. We introduce 14 probing tasks targeting linguistic properties relevant to RE, and we use them to study representations learned by more than 40 different encoder architecture and linguistic feature combinations trained on two datasets, TACRED and SemEval 2010 Task 8. We find that the bias induced by the architecture and the inclusion of linguistic features are clearly expressed in the probing task performance. For example, adding contextualized word representations greatly increases performance on probing tasks with a focus on named entity and part-of-speech information, and yields better results in RE. In contrast, entity masking improves RE, but considerably lowers performance on entity type related probing tasks.


Introduction
Relation extraction (RE) is concerned with extracting relationships between entities mentioned in text, where relations correspond to semantic categories such as org:founded by, person:spouse, or org:subsidiaries (Figure 1).Neural models have shown impressive results on this task, achieving state-of-the-art performance on standard datasets like SemEval2010 Task 8 (dos Santos et al., 2015;Wang et al., 2016;Lee et al., 2019), TA-CRED (Zhang et al., 2018;Alt et al., 2019b;Peters et al., 2019;Joshi et al., 2019), and NYT (Lin et al., 2016;Vashishth et al., 2018;Alt et al., 2019a).The majority of models implement an encoder architec- ture to learn a fixed size representation of the input, e.g. a sentence, which is passed to a classification layer to predict the target relation label.
These good results suggest that the learned representations capture linguistic and semantic properties of the input that are relevant to the downstream RE task, an intuition that was previously discussed for a variety of other NLP tasks by Conneau et al. (2018).However, it is often unknown which exact properties the various models have learned.Our aim is to pinpoint the information a given RE model is relying on, in order to improve model performance as well as to diagnose errors.
A general approach to model introspection is the use of probing tasks.Probing tasks (Shi et al., 2016;Adi et al., 2017), or diagnostic classifiers, are a well established method to analyze the presence of specific information in a model's latent representations, e.g. in machine-translation (Belinkov et al., 2017), language modeling (Giulianelli et al., 2018), and sentence encoding (Conneau et al., 2018).For each probing task, a classifier is trained on a set of representations, and its performance measures how well the information is encoded.The probing task itself is typically selected in accordance with the downstream task, e.g. an encoder trained on RE may be probed for the entity type of a relation argument.If the classifier correctly predicts the type, it implies the encoder retains entity type information in the representations, which also directly inform the relation prediction.The simplicity of this ap-arXiv:2004.08134v1[cs.CL] 17 Apr 2020 proach makes it easier to pinpoint the information a model is relying on, as opposed to probing the downstream task directly.
Our goal in this paper is to understand which features of the input a model conditioned on relation extraction has learned as useful for the task, in order to be able to better interpret and explain model predictions.Relation extraction literature is rich with information about useful features for the task (Zhou et al., 2005;Mintz et al., 2009;Surdeanu et al., 2011).Consequently, our initial question is whether and how good the sentence representations learned by state-of-the-art neural RE models encode these well-known features, such as e.g.argument entity types, dependency path or argument distance features.Another question is how the prior imposed by different encoding architectures, e.g.CNN, RNN, Graph Convolutional Network and Self-Attention, affects the features stored in the learned sentence representations.Finally, we would like to understand the effect of additional input features on the learned sentence representations.These include explicit semantic and syntactic knowledge like entity information and grammatical role, and as recently proposed, contextualized word representations such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018).We therefore significantly extend earlier work on probing tasks as follows: • Following the framework of Conneau et al. (2018), we propose a set of 14 probing tasks specifically focused on linguistic properties relevant to relation extraction.• We evaluate four encoder architectures, also in combination with supporting linguistic knowledge, on two datasets, TACRED (Zhang et al., 2017) and SemEval 2010 Task 8 (Hendrickx et al., 2010), for a total of more than 40 variants.• We follow up on this analysis with an evaluation on the proposed probing tasks to establish a connection between task performance and captured linguistic properties.et al., 2012;Mintz et al., 2009), or part-of-speech tags (Zhou et al., 2005;Surdeanu et al., 2011).We therefore include the tree depth task (TreeDepth) described by Conneau et al. (2018).This task tests whether an encoder can group sentences by the depth of the longest path from root to any leaf.We group tree depth values into 10 (TACRED, SemEval 7) approximately uniformly distributed classes, ranging from from depth 1 to depth 15.
To account for shortest dependency path (SDP) information, we include an SDP tree depth task (SDPTreeDepth), which tests if the learned sentence embedding stores information about the syntactical link between the relation arguments.Again, we group SDP tree depth values into bins, in this case only 6 (4) classes, since the SDP trees are generally more shallow than the original sentence dependency parse tree.The argument ordering task (ArgOrd) tests if the head argument of a relation occurs before the tail argument in the token sequence.An encoder that successfully addresses this challenge captures some information about syntactic structures where the order of a relation's arguments is inverted, e.g. in constructions such as "The acquisition of Monsanto by Bayer", as compared to default constructions like "Bayer acquired Monsanto".We also include 4 tasks that test for the part-of-speech tag of the token directly to the left or right of the relation's arguments: PosHeadL, PosHeadR, PosTailL, PosTailR.These tasks test whether the encoder is sensitive to the immediate context of an argument.Some relation types, e.g.per:nationality or org:top member, can often be identified based on the immediate argument context, e.g."US president-NN Donald Trump", or "Google 's-POSS CEO-NN Larry Page".Representing this type of information in the sentence embedding should be useful for the relation classification.
Argument information Finally, we include probing tasks that require some understanding of what each argument denotes.The argument entity type tasks (TypeHead, TypeTail) ask for the entity tag of the head, and respectively the tail, argument.Entity type information is highly relevant for relation extraction systems since it strongly constrains the set of possible relation labels for a given argument pair.We treat these tasks as multi-class classification problems over the set of possible argument entity tags (see Section 3.3).
Our last task concerns the grammatical function of relation arguments.The grammatical role tasks (GRHead, GRTail) ask for the role of each argument, as given by the dependency label connecting the argument and its syntactic head token.The motivation is that the subject and object of verbal constructions often correspond to relation arguments for some relation types, e.g."Bayer acquired Monsanto".We currently test for four roles, namely nsubj, nsubjpass, dobj and iobj, and group all other dependency labels into the other class.Note that there are other grammatical relations that may be of interest for relation extraction, for example possessive modifiers ("Google's Larry Page"), compounds ("Google CEO Larry Page"), and appositions ("Larry Page, CEO of Google").

Experiment Setup
This section first introduces the four sentence encoding architectures we consider for evaluation ( §3.1), followed by a description of the supporting linguistic knowledge we evaluate: entity masking and contextualized word representations ( §3.2).We also introduce the two datasets we use for training the relation extraction models and probing the sentence representations ( §3.3).

Sentence Encoders
Generally, methods in relation extraction follow the sequence to vector approach, encoding the input (often a single sentence) into a fixed-size representation, before applying a fully connected relation classification layer (Figure 2).A single input is represented as a sequence of T tokens {w t } t=1,...,T , and the spans (head start , head end ) and (tail start , tail end ) of the two entity mentions in question.We focus our evaluation on four widely used approaches that have shown to perform well on RE.For all architectures we signal the position of head and tail by the relative offset to each to-  ken w i as a positional embedding , where e w i ∈ R d is the token embedding.
CNN We follow the work of Zeng et al. (2014) and Nguyen and Grishman (2015), who both use a convolutional neural network for relation extraction.Their models encode the input token sequence {w t } t=1,...,T by applying a series of 1-dimensional convolutions of different filter sizes, yielding a set of output feature maps M f , followed by a maxpooling operation that selects the maximum values along the temporal dimension of M f to form a fixed-size representation.
Bi-LSTM max Similar to Zhang and Wang (2015) and Zhang et al. (2017), we use a Bi-LSTM to encode the input sequence.A Bi-LSTM yields a sequence of hidden states {h t } t=1,...,T , where of the states of a forward LSTM h f and a backward LSTM h b .Similar to the CNN, we use max pooling across the temporal dimension to obtain a fixed-size representation3 .
GCN Graph convolutional networks (Kipf and Welling, 2016) adapt convolutional neural networks to graphs.Following the approach of Zhang et al. (2018), we treat the input token sequence {w t } t=1,...,T as a graph consisting of T nodes, with an edge between w i and w j , if there exists a dependency edge between the two tokens.We convert the dependency tree into a T × T adjacency matrix, after pruning the graph to the shortest dependency path between head and tail.A L-layer GCN applied to {w t } t=1,...,T yields a sequence of hidden states {h t } t=1,...,T contextualized on neighboring tokens with a graph distance of at most L. Forming a fixed size representation is done by max pooling over the temporal dimension and local max pooling over the tokens {w t }, for t ∈ [head start , . . ., head end ] and similar for t ∈ [tail start , . . ., tail end ].
Multi-Headed Self-Attention Similar to the Transformer (Vaswani et al., 2017), we compute a sequence of contextualized representations {h t } t=1,...,T by applying L layers of multiheaded self-attention to the input token sequence {w t } t=1,...,T .The representation h t of w t is computed as a weighted sum of a projection V of the input tokens, with respect to the scaled, normalized dot product of Q and K, which are also both linear projections of the input with the procedure repeated for each attention head.A fixed-size representation is obtained by taking the final state h T at the last layer L.

Supporting Linguistic Knowledge
Adding additional lexical, syntactic, and semantic input features to neural RE approaches has been shown to considerably improve performance (Zeng et al., 2014;Zhang et al., 2017Zhang et al., , 2018)).Features include e.g.casing, named entity, part-of-speech and dependency information.Most recently, pre-learned contextualized word representations (deep language representations) emerged, capturing syntactic and semantic information useful to a wide range of downstream tasks (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2018).We therefore evaluate the effect of adding explicit named entity and grammatical role information (through entity masking) on our pre-learned sentence representations, and compare it to adding contextualized word representations computed by ELMo (Peters et al., 2018) and BERT (Devlin et al., 2018) as additional input features.
Entity Masking Entity masking has been shown to provide a significant gain for RE performance on the TACRED dataset (Zhang et al., 2017) by replacing each entity mention with a combination of its entity type and grammatical role (subject or object).It limits the information about entity mentions available to a model, possibly preventing overfitting to specific mentions and forcing the model to focus more on the context.
ELMo Embeddings from Language Models, as introduced by Peters et al. (2018), are an approach to compute contextualized word representations by applying a pre-learned, two-layer Bi-LSTM neural network to an input token sequence {w t } t=1,...,T .ELMo operates on a character level and is pretrained with the forward and backward direction as a separate unidirectional language model.It yields a representation for each token w i , with h f i conditioned on the preceding context {w t } t=1,...,i−1 and independently h b i , conditioned on the succeeding context {w t } t=i+1,...,T .
BERT Bidirectional Encoder Representations from Transformers (Devlin et al., 2018) improves upon methods such as ELMo and the OpenAI Generative Pre-trained Transformer (GPT) (Radford et al., 2018) by using a masked language model that allows for jointly training forward and backward directions.Compared to ELMo, BERT operates on word-piece input and is based on the self-attentive Transformer architecture (Vaswani et al., 2017).It computes a representation for a token w i jointly conditioned on the preceding {w t } t=1,...,i−1 and succeeding context {w t } t=i+1,...,T .We follow the official convention and report macroaveraged F1 scores with directionality taken into account.

Results
Table 2 and Table 3  Baseline performances are reported in the top section of Table 2 and Table 3. Length and ArgDist are both linear classifiers, which use sentence length and distance between head and tail argument as the only feature.BoE computes a representation of the input sentence by summing over the embeddings of all tokens it contains.Generally, there is a large gap between top baseline performance and that of a trained encoder.While SentLength and ArgDist are trivially solved by the respective linear classifier, BoE shows surprisingly good performance on SentLen and ArgOrd, and a clear improvement over the other baselines for named entity-and part-of-speech-related probing tasks.
Encoder Architecture For most probing tasks, except SentLen and ArgOrd, a proper encoder clearly outperforms bag-of-embeddings (BoE), which is coherent with the findings of Adi et al. (2017) and Conneau et al. (2018).Similarly, the results indicate that the prior imposed by the encoder architecture preconditions the information encoded in the learned embeddings.Models with a local or recency bias (CNN, BiLSTM) perform well on probing tasks with local focus, such as PosHead{L,R} and PosTail{L,R} and distance related tasks (ArgDist, ArgOrd).Similarly, models with access to dependency information (GCN) perform well on tree related tasks (SDPTreeDepth).Due to the graph pruning step (Zhang et al., 2018), the GCN is left with a limited view of the depen- dency tree, which explains the low performance on TreeDepth.Surprisingly, while Self-Attention exhibits superior performance on the RE task, it consistently performs lower on the probing tasks compared to the other encoding architectures.This could indicate Self-Attention encodes "deeper" linguistic information into the sentence representation, not covered by the current set of probing tasks.
Probing Tasks Compared to the baselines, all proper encoders exhibit consistently high performance on TypeHead and TypeTail, clearly highlighting the importance of entity type information to RE.In contrast, encoders trained on the downstream task perform worse on SentLen, which intuitively makes sense, since sentence length is mostly irrelevant for RE.This is consistent with Conneau et al. (2018), who found SentLen performance to decrease for models trained on more complex downstream tasks, e.g.neural machine translation, strengthen the assumption that, as a model captures deeper linguistic properties it will tend to forget about this superficial feature.With the exception of the CNN, all encoders consistently show low performance on the argument distance (ArgDist) task.A similar performance pattern can be observed for ArgOrd, where models that are biased towards locality (CNN and BiLSTM) perform bet-ter, while models that are able to efficiently model long range dependencies, such as GCN and S-Att., show lower performance.The superior RE task performance of the latter indicates that their bias may allow them to learn "deeper" linguistic features.
The balanced performance of CNN, BiLSTM and GCN encoders across part-of-speech related tasks (PosHeadL, PosHeadR, PosTailL, PosTailR) highlights the importance of part-of-speech-related features to RE, again with the exception of S-Att., which performs just slightly above baselines.On TreeDepth and SDPTreeDepth (with GCN as the exception), average performance in many cases ranges just slightly above baseline performance, suggesting that TreeDepth requires more nuanced syntactic information, which the models fail to acquire.The good performance on grammatical role tasks (GRHead, GRTail) once more emphasizes the relevance of this feature to RE, with the GCN exhibiting the best performance on average.This is unsurprising, because the GCN focuses on token-level information along the dependency path connecting the arguments, and hence seems to be able to capture grammatical relations among tokens more readily than the other encoders (even though the GCN also does not have access to the dependency labels themselves).
Entity Masking Perhaps most interestingly, masking entity mentions with their respective named entity and grammatical role information considerably lowers the performance on entity type related tasks (TypeHead and TypeTail).This indicates that masking forces the encoder's focus away from the entity mentions, which is confirmed by the performance decrease in probing tasks with a focus on argument position and distance, e.g.ArgDist, ArgOrd, and SentLen.CNN and BiLSTM encoders exhibit the greatest decrease in performance, suggesting a severe overfitting to specific entity mentions when no masking is applied.In comparison, the GCN shows less tendency to overfit.Surprisingly, with entity masking the self-attentive encoder (S-Attn.)increases its focus on entity mentions and their surroundings as suggested by the performance increase on the distance and argument related probing tasks.
Word Representations Adding contextualized word representations computed by ELMo or BERT greatly increases performance on probing tasks with a focus on named entity and part-of-speech information.This indicates that contextualized word representations encode useful syntactic and semantic features relevant to RE, which is coherent with the findings of Peters et al. (2018) and Radford et al. (2018), who both highlight the effectiveness of linguistic features encoded in contextualized word representations (deep language representations) for downstream tasks.The improved performance on syntactic and semantic abilities is also reflected in an overall improvement in RE task performance.Compared to ELMo, encoders with BERT generally exhibit an overall better and more balanced performance on the probing tasks.This is also reflected in a superior RE performance, suggesting that a bidirectional language model encodes linguistic properties of the input more effectively.Somewhat surprisingly, BERT without casing performs equally or better on the probing tasks focused on entity and part-of-speech information, compared to the cased version.While this intuitively makes sense for SemEval, as the dataset focuses on semantic relations between concepts, it is surprising for TACRED, which contains relations between proper entities, e.g.person and company names, with casing information more important to identify the entity type.
Probing vs. Relation Extraction One interesting observation is that encoders that perform better on probing tasks do not necessarily perform better on the downstream RE task.For example, CNN+ELMo scores highest for most of the probing tasks, but has an 8.1 lower F1 score than the best model on this dataset, S-Att.+BERTcased with masking.Similarly, all variants of the self-attentive encoder (S-Att.)show superior performance on RE but consistently come up last on the probing tasks, occasionally performing just above the baselines.Conneau et al. (2018) observed a similar phenomena for encoders trained on neural machine translation.

Relation Extraction
The relation extraction task performance 7 on the TACRED dataset ranges between 55.3 (Bi-LSTM) and 57.6 F1 (S-Att.), with performance improving to around 58.8 -64.7 F1 when adding pre-learned, contextualized word representations.As observed in previous work (Zhang et al., 2017), masking helps the encoders to generalize better, with gains of around 4 -8 F1 when compared to the vanilla models.This is mainly due to better recall, which indicates that without masking, models may overfit, e.g. by memorizing specific entity names.The best-performing model achieves a score of 66.9 F1 (S-Att.+BERT cased and masking).
On the SemEval dataset performance of the vanilla models is around 80.0 F1.Adding contextualized word representations significantly improves the performance of all models, by 3.5 -6 F1.The best-performing model on this dataset is a CNN with uncased BERT embeddings with an F1-score of 86.3, which is comparable to state-of-the-art models (Wang et al., 2016;Cai et al., 2016).trained on NMT and NLI for general text classification.Their setup, however, is not directly applicable to relation extraction, because the RE task requires not only the input sentence, but also the entity arguments.We therefore extend their framework to accommodate the RE setting.Another difference to their work is that while their probing tasks focus on linguistic properties of general sentence encoders, we specifically focus on relation extraction.To that end, we extend the evaluation to relation extraction by introducing a set of 14 probing tasks, including SentLen and TreeDepth, specifically designed to probe linguistic properties relevant to relation extraction.

Conclusion
We introduced a set of probing tasks to study the linguistic features captured in sentence encoder representations trained on relation extraction.We conducted a comprehensive evaluation of common RE encoder architectures, and studied the effect of explicitly and implicitly provided semantic and syntactic knowledge, uncovering interesting properties about the architecture and input features.For example, we found self-attentive encoders to be well suited for the RE on sentences of different complexity, though they consistently perform lower on probing tasks; hinting that these architectures capture "deeper" linguistic features.We also showed that the bias induced by different architectures clearly affects the learned properties, as suggested by probing task performance, e.g. for distance and dependency related probing tasks.
In future work, we want to extend the probing tasks to also cover specific linguistic patterns such as appositions, and also investigate a model's ability of generalizing to specific entity types, e.g. company and person names.

Figure 1 :
Figure 1: Example relation from TACRED.The sentence contains the relation org:subsidiaries between the head and tail entities 'Aerolineas' and 'Austral'.

Figure 2 :
Figure 2: Probing task setup.In the first step, we train a RE model (sentence encoder and relation classifier) on a dataset D. In the second step, we fix the encoder and for each probing task train a classifier on the encoder representations {s j } j=1,...,|D| of all sentences in D. The probing classifier performance indicates how well the sentence representations encode the information probed by the classifier, e.g. the entity type of the tail relation argument.

Table 2 :
TACRED probing task accuracies and model F1 scores on the test set.↑ and ↓ indicate the cased and uncased version of BERT, ⊗ models with entity masking.Probing task classification is performed by a logistic regression on the representations s j of all sentences in the dataset.

Table 3 :
SemEval probing task accuracies and model F1 scores on the test set.↑ and ↓ indicate the cased and uncased version of BERT.Probing task classification is performed by a logistic regression on the representations s j of all sentences in the dataset.