Interpretability Rules: Jointly Bootstrapping a Neural Relation Extractorwith an Explanation Decoder

We introduce a method that transforms a rule-based relation extraction (RE) classifier into a neural one such that both interpretability and performance are achieved. Our approach jointly trains a RE classifier with a decoder that generates explanations for these extractions, using as sole supervision a set of rules that match these relations. Our evaluation on the TACRED dataset shows that our neural RE classifier outperforms the rule-based one we started from by 9 F1 points; our decoder generates explanations with a high BLEU score of over 90%; and, the joint learning improves the performance of both the classifier and decoder.


Introduction
Information extraction (IE) is one of the key challenges in the natural language processing (NLP) field. With the explosion of unstructured information on the Internet, the demand for high-quality tools that convert free text to structured information continues to grow (Chang et al., 2010;Lee et al., 2013;Valenzuela-Escarcega et al., 2018).
The past decades have seen a steady transition from rule-based IE systems (Appelt et al., 1993) to methods that rely on machine learning (ML) (see Related Work). While this transition has generally yielded considerable performance improvements, it was not without a cost. For example, in contrast to modern deep learning methods, the predictions of rule-based approaches are easily explainable, as a small number of rules tends to apply to each extraction. Further, in many situations, rule-based methods can be developed by domain experts with minimal training data. For these reasons, rule-based IE methods remain widely used in industry (Chiticariu et al., 2013).
In this work we demonstrate that this transition from rule-to ML-based IE can be performed such that the benefits of both worlds are preserved. In particular, we start with a rule-based relation ex-traction (RE) system (Angeli et al., 2015) and bootstrap a neural RE approach that is trained jointly with a decoder that learns to generate the rules that best explain each particular extraction. The contributions of our idea are the following: (1) We introduce a strategy that jointly learns a RE classifier between pairs of entity mentions with a decoder that generates explanations for these extractions in the form of Tokensregex (Chang and Manning, 2014) or Semregex (Chambers et al., 2007) patterns. The only supervision for our method is a set of input rules (or patterns) in these two frameworks (Angeli et al., 2015), which we use to generate positive examples for both the classifier and the decoder. We generate negative examples automatically from the sentences that contain positives examples.
(2) We evaluate our approach on the TACRED dataset (Zhang et al., 2017) and demonstrate that: (a) our neural RE classifier outperforms considerably the rule-based one we started from; (b) our decoder generates explanations with high accuracy, i.e., a BLEU overlap score between the generated rules and the gold, hand-written rules of over 90%; and, (c) joint learning improves the performance of both the classifier and decoder.
(3) We demonstrate that our approach generalizes to the situation where a vast amount of labeled training data is combined with a few rules. We combined the TACRED training data with the above rules and showed that when our method is trained on this combined data, the classifier obtains near state-of-art performance at 67.0% F1, while the decoder generates accurate explanations with a BLEU score of 92.4%.

Related Work
Relation extraction using statistical methods is well studied. Methods range from supervised, "traditional" approaches (Zelenko et al., 2003;Bunescu and Mooney, 2005) to neural methods. Neural approaches for RE range from methods that rely on simpler representations such as CNNs (Zeng et al., 2014) and RNNs (Zhang and Wang, 2015) to more complicated ones such as augmenting RNNs with different components (Xu et al., 2015;Zhou et al., 2016), combining RNNs andCNNs (Vu et al., 2016;Wang et al., 2016), and using mechanisms like attention (Zhang et al., 2017) or GCNs (Zhang et al., 2018). To solve the lack of annotated data, distant supervision (Mintz et al., 2009;Surdeanu et al., 2012) is commonly used to generate a training dataset from an existing knowledge base. Jat et al. (2018) address the inherent noise in distant supervision with an entity attention method.
Rule-based methods in IE have also been extensively investigated. Riloff (1996) developed a system that learns extraction patterns using only a pre-classified corpus of relevant and irrelevant texts. Lin and Pantel (2001) proposed a unsupervised method for discovering inference rules from text based on the Harris distributional similarity hypothesis (Harris, 1954). Valenzuela-Escárcega et al. (2016) introduced a rule language that covers both surface text and syntactic dependency graphs. Angeli et al. (2015) further show that converting rule-based models to statistical ones can capture some of the benefits of both, i.e., the precision of patterns and the generalizability of statistical models.
Interpretability has gained more attention recently in the ML/NLP community. For example, some efforts convert neural models to more interpretable ones such as decision trees (Craven and Shavlik, 1996;Frosst and Hinton, 2017). Some others focus on producing a post-hoc explanation of individual model outputs (Ribeiro et al., 2016;Hendricks et al., 2016).
Inspired by these directions, here we propose an approach that combines the interpretability of rule-based methods with the performance and generalizability of neural approaches.

Approach
Our approach jointly addresses classification and interpretability through an encoder-decoder architecture, where the decoder uses multi-task learning (MTL) for relation extraction between pairs of named entities (Task 1) and rule generation (Task 2). Figure 1 summarizes our approach.

Task 1: Relation Classifier
We define the RE task as follows. The inputs consist of a sentence W = [w 1 , . . . , w n ], and a pair of entities (called "subject" and "object") corresponding to two spans in this sentence: W s = [w s 1 , . . . , w sn ] and W o = [w o 1 , . . . , w on ]. The goal is to predict a relation r ∈ R (from a predefined set of relation types) that holds between the subject and object or "no relation" otherwise.
For each sentence, we associate each word w i with a representation x x x i that concatenates three embeddings: x x x i = e e e(w i ) • e e e(n i ) • e e e(p i ), where e e e(w i ) is the word embedding of token i, e e e(n i ) is the NER embedding of token i, e e e(p i ) is the POS Tag embedding of token i. We feed these representations into a sentence-level bidirectional LSTM encoder (Hochreiter and Schmidhuber, 1997): Following (Zhang et al., 2018), we extract the "K-1 pruned" dependency tree that covers the two entities, i.e., the shortest dependency path between two entities enhanced with all tokens that are directly attached to the path, and feed it into a GCN (Kipf and Welling, 2016) layer: where A A A is the corresponding adjacency matrix, Ã Ã A = A A A + I I I with I I I being the n × n identity matrix, d i = n j=1Ã ij is the degree of token i in the resulting graph, and W W W (l) is linear transformation.
Lastly, we concatenate the sentence representation, the subject entity representation, and the object entity representation as follows: where h h h (l) denotes the collective hidden representations at layer l of the GCN, and f : R d×n → R d is a max pooling function that maps from n output vectors to the representation vector. The concatenated representation h h h f inal is fed to a feedforward layer with a softmax function to produce a probability distribution over relation types. Figure 1: Neural architecture of the proposed multitask learning approach. The input is a sequence of words together with NER labels and POS tags. The pair of entities to be classified ("subject" in blue and "object" in orange) are also provided. We use a concatenation of several representations, including embeddings of words, NER labels, and POS tags. The encoder uses a sentence-level bidirectional LSTM (biLSTM) and graph convolutional networks (GCN). There are pooling layers for the subject, object, and full sentence GCN outputs. The concatenated pooling outputs are fed to the classifier's feedforward layer. The decoder is an LSTM with an attention mechanism.

Task 2: Rule Decoder
The rule decoder's goal is to generate the pattern P that extracted the corresponding data point, where P is represented as a sequence of tokens in the corresponding pattern language: P = [p 1 , . . . , p n ]. For example, the pattern (([{kbpentity:true}]+)/was/ /born/ /on/([{slotvalue:true}]+)) (where kbpentity:true marks subject tokens, and slotvalue:true marks object tokens) extracts mentions of the per:date_of_birth relation.
We implemented this decoder using an LSTM with an attention mechanism. To center rule decoding around the subject and object, we first feed the concatenation of subject and object representation from the encoder as the initial state in the decoder. Then, in each timestep t, we generate the attention context vector C C C D t by using the current hidden state of the decoder, h h h D t : where W W W A is a learned matrix, and h h h E (L) are hidden representations from the encoder's GCN.
We feed this C C C D t vector to a single feed forward layer that is coupled with a softmax function and use its output to obtain a probability distribution over the pattern vocabulary. We use cross entropy to calculate the losses for both the classifier and decoder. To balance the loss between classifier and decoder, we normalize the decoder loss by the pattern length. Note that for the data points without an existing rule, we only calculate the classifier loss. Formally, the joint loss function is:   Table 3: Learning curve of our approach based on amount of rules used, in the rule-only data configuration. These results are on TACRED development.
our models from the patterns in the rule-based system of Angeli et al. (2015), which uses 4,528 surface patterns (in the Tokensregex language) and 169 patterns over syntactic dependencies (using Semgrex). We experimented with two configurations: rule-only data and rules + TACRED training data. In the former setting, we use solely positive training examples generated by the above rules. We combine these positive examples with negative ones generated automatically by assigning 'no_relation' to all other entity mention pairs in the same sentence where there is a positive example. 1 We generated 3,850 positive and 12,311 negative examples for this configuration. In the latter configuration, we apply the same rules to the entire TACRED training dataset. 2 Baselines: We compare our approach with two baselines: the rule-based system of Zhang et al. 2 Thus, some training examples in this case will be associated with a rule and some will not. We adjusted the loss function to use only the classification loss when no rule applies.
3 For a fair comparison, we do not compare against ensemble methods, or transformer-based ones. Also, note that this baseline does not use rules at all. our word embeddings. We use the Adagrad optimizer (Duchi et al., 2011). We apply entity masking to subject and object entities in the sentence, which is replacing the original token with a special <NER>-SUBJ or <NER>-OBJ token where <NER> is the corresponding name entity label provided by TACRED.
We used micro precision, recall, and F1 scores to evaluate the RE classifier. We used the BLEU score to measure the quality of generated rules, i.e., how close they are to the corresponding gold rules that extracted the same output. We used the BLEU implementation in NLTK (Loper and Bird, 2002), which allows us to calculate multi-reference BLEU scores over 1 to 4 grams. 4 We report BLEU scores only over the non 'no_relation' extractions with the corresponding testing data points that are matched by one of the rules in (Zhang et al., 2017). Table 1 reports the overall performance of our approach, the baselines, and ablation settings, for the two configurations investigated. We draw the following observations from these results:

Results and Discussion:
(1) The rule-based method of Zhang et al. (2017) has high precision but suffers from low recall. In contrast, our approach that is bootstrapped from the same information has 13% higher recall and almost 9% higher F1 (absolute). Further, our approach decodes explanatory rules with a high BLEU score of 90%, which indicates that it maintains almost the entire explanatory power of the rule-based method.
(2) The ablation experiments indicate that joint training for classification and explainability helps both tasks, in both configurations. This indicates that performance and explainability are interconnected.
(3) The two configurations analyzed in the table demonstrate that our approach performs well not only when trained solely on rules, but also when rules are combined with a training dataset annotated for RE. This suggests that our direction may be a general strategy to infuse some explainability in a statistical method, when rules are available during training.
(4) Table 3 lists the learning curve for our approach in the rule-only data configuration when the amount of rules available varies. 5 This table shows that our approach obtains a higher F1 than the complete rule-based RE classifier even when using only 40% of the rules. 6 (5) Note that the BLEU score provides an incomplete evaluation of rule quality. To understand if the decoded rules explain their corresponding data point, we performed a manual evaluation on 176 decoded rules. We classified them into three categories: (a) the rules correctly explain the prediction (according to the human annotator), (b) they approximately explain the prediction, and (c) they do not explain the prediction. Class (b) contains rules that do not lexically match the input text, but capture the correct semantics, as shown in Table 2. The percentages we measured were: (a) 33.5%, (b) 31.3%, (c) 26.1%. 9% of these rules were skipped in the evaluation because they were false negatives( which are labeled as no relation falsely by our model). These numbers support our hypothesis that, in general, the decoded rules do explain the classifier's prediction.
Further, out of 750 data points associated with rules in the evaluation data, our method incorrectly classifies only 26. Out of these 26, 16 were false negatives, and had no rules decoded. In the other 10 predictions, 7 rules fell in class (b) (see the examples in Table 2). The other 3 were incorrect due to ambiguity, i.e., the pattern created is an ambiguous succession of POS tags or syntactic dependencies without any lexicalization. This suggests that, even when our classifier is incorrect, the rules decoded tend to capture the underlying semantics.

Conclusion
We introduced a strategy that jointly bootstraps a relation extraction classifier with a decoder that generates explanations for these extractions, using as sole supervision a set of example patterns that match such relations. Our experiments on the TACRED dataset demonstrated that our approach outperforms the strong rule-based method

A Experimental Details
We use the dependency parse trees, POS and NER sequences as included in the original release of the TACRED dataset, which was generated with Stanford CoreNLP (Manning et al., 2014). We use the pretrained 300-dimensional GloVe vectors (Pennington et al., 2014) to initialize word embeddings. We use a 2 layers of bi-LSTM, 2 layers of GCN, and 2 layers of feedforward in our encoder. And 2 layers of LSTM and 1 layer of feedforward in our decoder. Table 4 shows the details of the proposed neural network. We apply the ReLU function for all nonlinearities in the GCN layers and the standard max pooling operations in all pooling layers. For regularization we use dropout with p = 0.5 to all encoder LSTM layers and all but the last GCN layers. For training, we use Adagrad (Duchi et al., 2011) an initial learning rate, and from epoch 1 we start to anneal the learning rate by a factor of 0.9 every time the F1 score on the development set does not increase after one epoch. We tuned the initial learning rate between 0.01 and 1; we chose 0.3 as  this obtained the best performance on development.
We trained 100 epochs for all the experiments with a batch size of 50. There were 3,850 positive data points and 12,311 negative data in the rule-only data. For this dataset, it took 1 minute to finish one epoch in average. And for Rules + TACRED training data, it took 4 minutes to finish one epoch in average 7 . All the hyperparameters above were tuned manually. We trained our model on PyTorch 3.8.5 with CUDA version 10.0, using one NVDIA Titan RTX.

B Dataset Introduction
You can find the details of TACRED data in this link: https://nlp.stanford.edu/ projects/tacred/.

C Rules
The rule-base system we use is the combination of Stanford's Tokensregex (Chang and Manning, 2014) and Semregex (Chambers et al., 2007). The rules we use are from the system of Angeli et al. (2015), which contains 4528 Tokensregex patterns and 169 Semgrex patterns.
We extracted the rules from CoreNLP and mapped each rule to the TACRED dataset. We provided the mapping files in our released dataset. We also generate the dataset with only datapoints matched by rules in TACRED training partition and its mapping file.