ExpBERT: Representation Engineering with Natural Language Explanations

Suppose we want to specify the inductive bias that married couples typically go on honeymoons for the task of extracting pairs of spouses from text. In this paper, we allow model developers to specify these types of inductive biases as natural language explanations. We use BERT fine-tuned on MultiNLI to “interpret” these explanations with respect to the input sentence, producing explanation-guided representations of the input. Across three relation extraction tasks, our method, ExpBERT, matches a BERT baseline but with 3–20x less labeled data and improves on the baseline by 3–10 F1 points with the same amount of labeled data.


Introduction
Consider the relation extraction task of finding spouses in text, and suppose we wanted to specify the inductive bias that married couples typically go on honeymoons. In a traditional feature engineering approach, we might try to construct a "did they go on a honeymoon?" feature and add that to the model. In a modern neural network setting, however, it is not obvious how to use standard approaches like careful neural architecture design or data augmentation to induce such an inductive bias. In a way, while the shift from feature engineering towards end-to-end neural networks and representation learning has alleviated the burden of manual feature engineering and increased model expressivity, it has also reduced our control over the inductive biases of a model.
In this paper, we explore using natural language explanations ( Figure 1) to generate features that can augment modern neural representations. This imbues representations with inductive biases corresponding to the explanations, thereby restoring some degree of control while maintaining their expressive power. Prior work on training models with explanations use semantic parsers to interpret explanations: the parser converts each explanation into an executable logical form that is executable over the input sentence and uses the resulting outputs as features (Srivastava et al., 2017) or as noisy labels on unlabeled data (Hancock et al., 2018). However, semantic parsers can typically only parse low-level statements like "'wife' appears between {o 1 } and {o 2 } and the last word of {o 1 } is the same as the last word of {o 2 }" (Hancock et al., 2018).
We remove these limitations by using modern distributed language representations, instead of semantic parsers, to interpret language explanations. Our approach, ExpBERT (Figure 2), uses BERT (Devlin et al., 2019) fine-tuned on the MultiNLI natural language inference dataset (Williams et al., 2018) to produce features that "interpret" each explanation on an input. We then use these features to augment the input representation. Just as a semantic parser grounds an explanation by converting it into a logical form and then executing it, the features produced by BERT can be seen as a soft "execution" of the explanation on the input. On three benchmark relation extraction tasks, ExpBERT improves over a BERT baseline with no explanations: it achieves an F1 score of 3-10 points higher with the same amount of labeled data, and a similar F1 score as the full-data baseline but with 3-20x less labeled data. ExpBERT also improves on a semantic parsing baseline (+3 to 5 points F1), suggesting that natural language explanations can be richer than low-level, programmatic explanations.

Setup
Problem. We consider the task of relation extraction: Given x = (s, o 1 , o 2 ), where s is a sequence of words and o 1 and o 2 are two entities that are substrings within s, our goal is to classify the relation y ∈ Y between o 1 and o 2 . The label space Y includes a NO-RELATION label if no relation applies. Additionally, we are given a set of natural language explanations E = {e 1 , e 2 , . . . , e n } designed to capture relevant features of the input for classification. These explanations are used to define a global collection of features and are not tied to individual examples.
Approach. Our approach (Figure 2) uses pretrained neural models to interpret the explanations E in the context of a given input x. Formally, we define an interpreter I as any function that takes an input x and explanation e j and produces a feature vector in R d . In our ExpBERT implementation, we choose I to capture whether the explanation e j is entailed by the input x. Concretely, we use BERT (Devlin et al., 2019) finetuned on MultiNLI (Williams et al., 2018): we feed wordpiece-tokenized versions of the explanation e j (hypothesis) and the instance x (premise), separated by a [SEP] token, to BERT. Following standard practice, we use the vector at the [CLS] token to represent the entire input as a 768-dimensional feature vector: These vectors, one for each of the n explanations, are concatenated to form the explanation representation v(x) ∈ R 768n , v(x) = I(x, e 1 ), I(x, e 2 ), . . . , I(x, e n ) . (2) In addition to v(x), we also map x into an input representation u(x) ∈ R 768|Y| by using the same interpreter over textual descriptions of each potential relation. Specifically, we map each potential relation y i in the label space Y to a textual description r i (Figure 2), apply I(x, ·) to r i , and concatenate the resulting feature vectors: Finally, we train a classifier over u(x) and v(x): Note that u(x) and v(x) can be obtained in a preprocessing step since I(·, ·) is fixed (i.e., we do not additionally fine-tune BERT on our tasks). For more model details, please refer to Appendix A.1.
Baselines. We compare ExpBERT against several baselines that train a classifier over the same input representation u(x). NoExp trains a classifier only on u(x). The other baselines augment u(x) with variants of the explanation representation v(x). BERT+SemParser uses the semantic parser from Hancock et al. (2018) to convert explanations into executable logical forms. The resulting denotations over the input x (a single bit for each explanation) are used as the explanation representation, i.e., v(x) ∈ {0, 1} n . We use two different sets of explanations for this baseline: our natural language explanations (LangExp) and the low-level explanations from Hancock et al. (2018) that are more suitable for the semantic parser (ProgExp). BERT+Patterns converts explanations into a collection of unigram, bigram, and trigram patterns and creates a binary feature for each pattern based on whether it is contained in s or not. This gives v(x) ∈ {0, 1} n , where n is the number of patterns. Finally, we compare ExpBERT against a

Experiments
Datasets. We consider 3 relation extraction datasets from various domains-Spouse and Disease (Hancock et al., 2018), and TACRED (Zhang et al., 2017). Spouse involves classifying if two entities are married; Disease involves classifying whether the first entity (a chemical) is a cause of the second entity (a disease); and TACRED involves classifying the relation between the two entities into one of 41 categories. Dataset statistics are in Table 1; for more details, see Appendix A.2.
Explanations. To construct explanations, we randomly sampled 50 training examples for each y ∈ Y and wrote a collection of natural language statements explaining the gold label for each example. For Spouse and Disease, we additionally wrote some negative explanations for the NO-RELATION category. To interpret explanations for Disease, we use SciBERT, a variant of BERT that is better suited for scientific text (Beltagy et al., 2019). A list of explanations can be found in Appendix A.3.
Benchmarks. We find that explanations improve model performance across all three datasets: ExpBERT improves on the NoExp baseline by +10.6 F1 points on Spouse, +2.7 points on Disease, and +3.2 points on TACRED (Table 2). 1 On TACRED, which is the most well-established of our benchmarks and on which there is significant prior work, ExpBERT (which uses a smaller BERT-base model that is not fine-tuned on our task) outperforms the standard, fine-tuned BERT-large model by +1.5 F1 points (Joshi et al., 2019). Prior work on Spouse and Disease used a simple logistic classifier over traditional features created from dependency paths of the input sentence. This performs poorly compared to neural models, and our models attain significantly higher accuracies (Hancock et al., 2018). Using BERT to interpret natural language explanations improves on using semantic parsers to evaluate programmatic explanations (+5.5 and +2.7 over BERT+SemParser (ProgExp) on Spouse and Disease, respectively). ExpBERT also outperforms the BERT+SemParser (LangExp) model by +9.9 and +3.3 points on Spouse and Disease. We exclude these results on TACRED as it was not studied in Hancock et al. (2018), so we did not have a corresponding semantic parser and set of programmatic explanations.
We note that ExpBERT-which uses the full 768-dimensional feature vector from each explanation-outperforms ExpBERT (Prob), which summarizes these vectors into one number per explanation, by +2-5 F1 points across all three datasets.
Data efficiency. Collecting a set of explanations E requires additional effort-it took the authors about 1 minute or less to construct each explanation, though we note that it only needs to be done once per dataset (not per example). However, collecting a small number of explanations can significantly and disproportionately reduce the number of labeled examples required. We trained ExpBERT and the NoExp baseline with varying fractions of Spouse and TACRED training data (Figure 3). ExpBERT matches the NoExp baseline with 20x less data on Spouse; i.e., we obtain the same performance with ExpBERT with 40 explanations and 2k labeled training examples as with NoExp with 22k examples. On TACRED, ExpBERT requires 3x less data, obtaining the same performance with 128 explanations and 23k training examples as compared to NoExp with 68k examples. These results suggest that the higher-bandwidth signal in language can help models be more dataefficient.

Which explanations are important?
To understand which explanations are important, we group explanations into a few semantic categories (details in Appendix A.3) and cumulatively add them to the NoExp baseline. In particular, we break down explanations for Spouse into the   groups MARRIED (10 explanations), CHILDREN (5 explanations), ENGAGED (3 explanations), NEGA-TIVES (13 explanations) and MISC (9 explanations). We find that adding new explanation groups helps performance (Table 3), which suggests that a broad coverage of various explanatory factors could be helpful for performance. We also observe that the MARRIED group (which contains paraphrases of {o 1 } is married to {o 2 }) alone boosts performance over NoExp, which suggests that a variety of paraphrases of the same explanation can improve performance.

Quality vs. quantity of explanations
We now test whether ExpBERT can do equally well with the same number of random explanations, obtained by replacing words in the explanation with random words. The results are dataset-specific: random explanations help on Spouse but not on Disease. However, in both cases, random explanations do significantly worse than the original explanations (Table 4). Separately adding 10 random  explanations to our original explanations led to a slight drop (≈1 F1 point) in accuracy. These results suggest that ExpBERT's performance comes from having a diverse set of high quality explanations and are not just due to providing more features.

Complementing language explanations with external databases
Natural language explanations can capture different types of inductive biases and prior knowledge, but some types of prior knowledge are of course better introduced through other means. We wrap up our experiments with a vignette on how language explanations can complement other forms of feature and representation engineering. We consider Disease, where we have access to an external ontology (Comparative Toxicogenomic Database or CTD) from Wei et al. (2015) containing chemicaldisease interactions. Following Hancock et al. (2018), we add 6 bits to the explanation representation v(x) that test if the given chemical-disease pair follows certain relations in CTD (e.g., if they are in the ctd-therapy dictionary). Table 5 shows that as expected, other sources of information can complement language explanations in ExpBERT.

Related work
Many other works have used language to guide model training. As mentioned above, semantic parsers have been used to convert language explanations into features (Srivastava et al., 2017) and noisy labels on unlabeled data (Hancock et al., 2018;Wang et al., 2019). Rather than using language to define a global collection of features, Rajani et al. (2019) and Camburu et al. (2018) use instance-level explanations to train models that generate their own explanations. Zaidan and Eisner (2008) ask annotators to highlight important words, then learn a generative model over parameters given these rationales. Others have also used language to directly produce parameters of a classifier (Ba et al., 2015) and as part of the parameter space of a classifier (Andreas et al., 2017).
While the above works consider learning from static language supervision, Li et al. (2016) and Weston (2016) learn from language supervision in an interactive setting. In a related line of work, Wang et al. (2017), users teach a system high-level concepts via language.

Discussion
Recent progress in general-purpose language representation models like BERT open up new opportunities to incorporate language into learning. In this work, we show how using these models with natural language explanations can allow us to leverage a richer set of explanations than if we were constrained to only use explanations that can be programmatically evaluated, e.g., through ngram matching (BERT+Patterns) or semantic parsing (BERT+SemParser).
The ability to incorporate prior knowledge of the "right" inductive biases into model representations dangles the prospect of building models that are more robust. However, more work will need to be done to make this approach more broadly applicable. We outline two such avenues of future work. First, combining our ExpBERT approach with more complex state-of-the-art models can be conceptually straightforward (e.g., we could swap out BERT-base for a larger model) but can sometimes also require overcoming technical hurdles. For example, we do not fine-tune ExpBERT in this paper; doing so might boost performance, but fine-tuning through all of the explanations on each example is computationally intensive. Second, in this paper we provided a proof-ofconcept for several relation extraction tasks, relying on the fact that models trained on existing natural language inference datasets (like MultiNLI) could be applied directly to the input sentence and explanation pair. Extending ExpBERT to other natural language tasks where this relationship might not hold is an open problem that would entail finding different ways of interpreting an explanation with respect to the input.

A Appendix
A.1 Implementation Details Interpreting explanations. When interpreting an explanation e i on a particular example x = (s, o 1 , o 2 ), we first substitute o 1 and o 2 into the placeholders in the explanation e i to produce an instance-level version of the explanation. For example, "{o 1 } and {o 2 } are a couple" might become "Jim Bob and Michelle Duggar are a couple".
Model hyperparameters and evaluation. We use BERT-BASE-UNCASED for Spouse and TACRED, and SCIBERT-SCIVOCAB-UNCASED for Disease from Beltagy et al. (2019). We finetune all our BERT models on MultiNLI using the Transformers library 2 using default parameters. The resulting BERT model is then frozen and used to produce features for our classifier. We use the following hyperparameters for our MLP classifier: number of feed-forward layers ∈ [0,1], dimension of each layer ∈ [64, 256], and dropout ∈ [0.0, 0.3]. We optionally project the 768 dimensional BERT feature vector down to 64 dimensions. To train our classifier, we use the Adam optimizer (Kingma and Ba, 2014) with default parameters, and batch size ∈ [32, 128].
We early stop our classifier based on the F1 score on the validation set, and choose the hyperparameters that obtain the best early-stopped F1 score on the validation set. For Spouse and Disease, we report the test F1 means and 95% confidence intervals of 5-10 runs. For TACRED, we follow Zhang et al. (2017), and report the test F1 of the median validation set F1 of 5 runs corresponding to the chosen hyperparameters.

A.2 Datasets
Spouse and Disease preprocessed datasets were obtained directly from the codebase provided by Hancock et al. (2018) 3 . We use the train, validation, test split provided by Hancock et al. (2018) for Disease, and split the development set of Spouse randomly into a validation and test set (the split was done at a document level). To process TACRED, we use the default BERT tokenizer and indexing pipeline in the Transformers library. 2 https://huggingface.co/transformers/ 3 https://worksheets.codalab.org/worksheets/0x900e7e41deaa4ec5b2fe41dc50594548/

A.3 Explanations
The explanations can be found in Tables 6 and 7 on the following page. We use 40 explanations for Spouse, 28 explanations for Disease, and 128 explanations for TACRED (in accompanying file). The explanations were written by the authors.