What-if I ask you to explain: Explaining the effects of perturbations in procedural text

Our goal is to explain the effects of perturbations in procedural text, e.g., given a passage describing a rabbit’s life cycle, explain why illness (the perturbation) may reduce the rabbit population (the effect). Although modern systems are able to solve the original prediction task well (e.g., illness results in less rabbits), the explanation task - identifying the causal chain of events from perturbation to effect - remains largely unaddressed, and is the goal of this research. We present QUARTET, a system that constructs such explanations from paragraphs, by modeling the explanation task as a multitask learning problem. QUARTET constructs explanations from the sentences in the procedural text, achieving ~18 points better on explanation accuracy compared to several strong baselines on a recent process comprehension benchmark. On an end task on this benchmark, we show a surprising finding that good explanations do not have to come at the expense of end task performance, in fact leading to a 7% F1 improvement over SOTA.


Introduction
Procedural text is common in natural language (in recipes, how-to guides, etc.) and finds many applications such as automatic execution of biology experiments (Mysore et al., 2019), cooking recipes (Bollini et al., 2012) and everyday activities (Yang and Nyberg, 2015). However, the goal of procedural text understanding in these settings remains a major challenge and requires two key abilities, (i) understanding the dynamics of the world inside a procedure by tracking entities and what events happen as the narrative unfolds. (ii) understanding the dynamics of the world outside the procedure that can influence the procedure.
While recent systems for procedural text comprehension have focused on understanding the dynamics of the world inside the process, such as tracking Figure 1: Given a procedural text, the task is to explain the effect of the perturbation using the input sentences.
entities and answering questions about what events happen, e.g., Henaff et al., 2017), the extent to which they understand the influences of outside events remains unclear. In particular, if a system fully understands a process, it should be able to predict what would happen if it was perturbed in some way due to an event from the outside world. Such counterfactual reasoning is particularly challenging because, rather than asking what happened (described in text), it asks about what would happen in an alternative world where the change occurred.
Recently, Tandon et al. (2019) introduced the WIQA dataset that contains such problems, requiring prediction of the effect of perturbations in a procedural text. They also presented several strong models on this task. However, it is unclear whether those high scores indicate that the mod-els fully understand the described procedures, i.e., that the models have knowledge of the causal chain from perturbation to effect. To test this, Tandon et al. (2019) also proposed an explanation task. While the general problem of synthesizing explanations is hard, they proposed a simplified version in which explanations were instead assembled from sentences in the input paragraph and qualitative indicators (more/less/unchanged). Although they introduced this explanation task and dataset, they did not present a model to address it. We fill this gap by proposing the first solution to this task. We present a model, QUARTET (QUAlitative Reasoning wiTh ExplanaTions) that takes as input a passage and a perturbation, and its qualitative effect. The output contains the qualitative effect and an explanation structure over the passage. See Figure 1 for an example. The explanation structure includes up to two supporting sentences from the procedural text, together with the qualitative effect of the perturbation on the supporting sentences (more of or less of in Figure 1). QUARTET models this qualitative reasoning task as a multitask learning problem to explain the effect of a perturbation.
Our main contributions are: • We present the first model that explains the effects of perturbations in procedural text. On a recent process comprehension benchmark, QUARTET generates better explanations compared to strong baselines. • On an end task on this benchmark, we show a finding that good explanations do not have to come at the expense of end task performance, in fact leading to a 7% F1 improvement over SOTA. (refer §6). Prior work has found that optimizing for explanation can hurt end-task performance. Ours is a useful datapoint showing that good explanations do not have to come at the expense of end-task performance 1 .

Related work
Procedural text understanding: Machine reading has seen tremendous progress. With machines reaching human performance in standard QA benchmarks (Devlin et al., 2018;Rajpurkar et al., 2016), more challenging datasets have been proposed (Dua et al., 2019) that require background knowledge, commonsense reasoning (Talmor et al., 2019) and visual reasoning (Antol et al., 2015;1 All the code will be publicly shared upon acceptance Zellers et al., 2018). In the context of procedural text understanding which has gained considerable amount of attention recently, Henaff et al., 2017;) address the task of tracking entity states throughout the text. Recently, (Tandon et al., 2019) introduced the WIQA task to predict the effect of perturbations.
Understanding the effects of perturbations, specifically, qualitative change, has been studied using formal frameworks in the qualitative reasoning community (Forbus, 1984;Weld and De Kleer, 2013) and counterfactual reasoning in the logic community (Lewis, 2013). The WIQA dataset situates this task in terms of natural language rather than formal reasoning, by treating the task as a mixture of reading comprehension and commonsense reasoning. However, existing models do not explain the effects of perturbations. Explanations: Despite large-scale QA benchmarks, high scores do not necessarily reflect understanding (Min et al., 2019). Current models may not be robust or exploit annotation artifacts (Gururangan et al., 2018). This makes explanations desirable for interpretation (Selvaraju et al., 2017).
Attention based explanation has been successfully used in vision tasks such as object detection (Petsiuk et al., 2018) because pixel information is explainable to humans. These and other token level attention models used in NLP tasks (Wiegreffe and Pinter, 2019) do not provide full-sentence explanations of a model's decisions.
Recently, several datasets with natural language explanations have been introduced, e.g., in natural language inference (Camburu et al., 2018), visual question answering (Park et al., 2018), and multihop reading comprehension (HotpotQA dataset) . In contrast to these datasets, we explain the effects of perturbations in procedural text. HotpotQA contains explanations based on two sentences from a Wikipedia paragraph. Models on the HotpotQA would not be directly applicable to our task and require substantial modification for the following reasons: (i) HotpotQA models are not trained to predict the qualitative structure (more or less of chosen explanation sentences in Figure 1). (ii) HotpotQA involves reasoning over named entities, whereas the current task focuses on common nouns and actions (models that work well on named entities need to be adapted to common nouns and actions (Sedghi and Sabharwal, 2018)). (iii) explanation paragraphs in HotpotQA are not ears less protected → (MORE/+) sound enters the ear → (MORE/+) sound hits ear drum → (MORE/+) more sound detected blood clotting disorder → (LESS/-) blood clots → (LESS/-) scab forms → (MORE/+) less scab formation breathing exercise → (MORE/+) air enters lungs → (MORE/+) air enters windpipe → (MORE/+) oxygen enters bloodstream squirrels store food → (MORE/+) squirrels eat more → (MORE/+) squirrels gain weight → (MORE/+) hard survival in winter less trucks run → (LESS/-) trucks go to refineries → (LESS/-) trucks carry oil → (MORE/+) less fuel in gas stations coal is expensive → (LESS/-) coal burns → (LESS/-) heat produced from coal → (LESS/-) electricity produced legible address → (MORE/+) mailman reads address → (MORE/+) mail reaches destination → (MORE/+) on-time delivery more water to roots → (MORE/+) root attract water → MORE/+) roots suck up water → (LESS/-) plants malnourished in a quiet place → (LESS/-) sound enters the ear → (LESS/-) sound hits ear drum → (LESS/-) more sound detected eagle hungry → (MORE/+) eagle swoops down → (MORE/+) eagle catches mouse → (MORE/+) eagle gets more food Table 1: Examples of our model's predictions on the dev. set in the format: Supporting sentences x i , x j are compressed e.g., "the person has his ears less protected" → "ears less protected" procedural while the current input is procedural in nature with a specific chronological structure.
Another line of work provides more structure and organization to explanations, e.g., using scene graphs in computer vision (Ghosh et al., 2019). For elementary science questions, Jansen et al. (2018) uses a science knowledge graph. These approaches rely on a knowledge structure or graph but knowledge graphs are incomplete and costly to construct for every domain (Weikum and Theobald, 2010). There are trade-offs between unstructured and structured explanations. Unstructured explanations are available abundantly while structured explanations need to be constructed and hence are less scalable (Camburu et al., 2018). Generating free-form (unstructured) explanations is difficult to evaluate (Cui et al., 2018;Zhang et al., 2019), and adding qualitative structure over them is nontrivial. Taking a middle ground between free-form and knowledge graphs based explanations, we infer a qualitative structure over the sentences in the paragraph. This retains the rich interpretability and simpler evaluation of structured explanations as well as leverages the large-scale availability of sentences required for these explanation.
It is an open research problem whether requiring explanation helps or hurts the original task being explained. On the natural language inference task (e-SNLI), Camburu et al. (2018) observed that models generate correct explanations at the expense of good performance. On the Cos-E task, recently Rajani et al. (2019) showed that explanations help the end-task. Our work extends along this line in a new task setting that involves perturbations and enriches natural language explanations with qualitative structure.

Problem definition
We adopt the problem definition described in Tan don et al. (2019), and summarize it here.
Here, x k denotes step k (i.e., a sentence) in a procedural text comprising K steps. 2. A perturbation q p to the procedural text and its likely candidate effect q e .
Output: An explanation structure that explains the effect of the perturbation q p : • i: step id for the first supporting sentence.
• j: step id for the second supporting sentence.
See Figure 1 for an example of the task, and Table 1 for examples of explanations.
An explanation consists of up to two (i.e., zero, one or two) supporting sentences i, j along with their qualitative directions d i , d j . If there is only one supporting sentence, then While there can be potentially many correct explanation paths in a passage, the WIQA dataset consists of only one gold explanation considered best by human annotators. Our task is to predict that particular gold explanation.
Assumptions: In a procedural text, steps x 1 . . . x K are chronologically ordered and have a forward flowing effect i.e., if j > i then more/increase of x i will result in more/increase of x j . Prior work on procedural text makes a similar assumption . Note that this assumption does not hold for cyclic processes, and cyclic processes have already been flattened in WIQA dataset. We make the following observations based on this forward-flow assumption. a1: i <= j (forward-flow order) a2: d j = d i (forward-flow assumption) 2 a3: For the WIQA task, d e is the answer label because it is the end node in the explanation structure.
a4: If d i = • then answer label = • (since q p does not affect q e , there is no valid explanation.) This assumption reduces the number of predictions, removing d j and answer label (see a2, a3). Given x 1 . . . x K , q p , q e the model must predict four labels: i, j, d i , d e .

QUARTET model
We can solve the problem as a classification task, predicting four labels: i, j, d i , d e . If these predictions are performed independently, it requires several independent classifications and this can cause error propagation: prediction errors that are made in the initial stages cannot be fixed and can propagate into larger errors later on (Goldberg, 2017).
To avoid this, QUARTET predicts and explains the effect of q p as a multitask learning problem, where the representation layer is shared across different tasks. We apply the widely used parameter sharing approach, where a single representation layer is followed by task specific output layers (Baxter, 1997). This reduces the risk of overfitting to a single task and allows decisions on i, j, d i , d e to influence each other in the hidden layers of the network. We first describe our encoder and then the other layers on top, see Figure 2 for the model architecture.
Encoder: To encode x 1 . . . x K and question q we use the BERT architecture (Devlin et al., 2018) that has achieved state-of-the-art performance across several NLP tasks , 2 Note that this does not assume all sentences have the same directionality of influence. For example, a paragraph could include both positive and negative influences: "Predators arrive. Thus the rabbit population falls...". Rather, the dj = di assumption is one of narrative coherence: the more predators arrive, the more the rabbit population falls. That is, within a paragraph, we assume enhancing one step will have enhanced effects (both positive or negative effects) on future steps -a property of a coherently authored paragraph.
where the question q = q p ⊕ q e (⊕ stands for concatenation). We start with a byte-pair tokenization (Sennrich et al., 2015)  These byte-pair tokens are passed through a 12layered Transformer network, resulting in a contextualized representation for every byte-pair token. In this contextualized representation, the vector u = [u 1 , ...u K , u q ] where u k denotes the encoding for [x k ], and u q denotes question encoding. Let E l be the embedding size resulting from l th transformer layer. In that l th layer, [u 1 , ...u K ] ∈ R K * E l . The hidden representation of all transformer layers are initialized with weights from a self-supervised pre-training phase, in line with contemporary research that uses pre-trained language models (Devlin et al., 2018).
To compute the final logits, we add a linear layer over the different transformer layers in BERT that are individual winners for individual tasks in our multitask problem. For instance, out of the total 12 transformer layers, lower layers (layer 2) are the best predictors for [i, j] while upper layers (layer 10 and 11) are the best performing predictors for Zhang et al. (2019) found that the last layer is not necessarily the best performing layer. Different layers seem to learn complementary information because their fusion helps. Combining different layers by weighted averaging of the layers has been attempted with mixed success (Zhang et al., 2019;. We observed the same trend for simple weighted transformation. However, we found that learning a linear layer over concatenated features from winning layers improves performance. This is probably because there is very different information encoded in a particular dimension across different layers, and the concatenation preserves it better than simple weighted averaging.
Classification tasks: To predict the first supporting sentence x i , we obtain a softmax distribution s i ∈ R K over [u 1 , ...u K ]. From the forward-flow assumption made in the problem definition section earlier, we know that i ≤ j, making it possible to model this as a span prediction x i:j . Inline with standard span based prediction models (Seo et al.,

2017), we use an attended sentence representation
Here, denotes element-wise multiplication and ⊕ denotes concatenation.
For classification of d i (and d j , since d i = d j ), we use the representation of the first token (i.e., CLS token ∈ R E l ) and a linear layer followed by softmax to predict d i ∈ { + − • }. Classification of d e is performed in exactly the same manner.
The network is trained end-to-end to minimize the sum of cross-entropy losses for the individual classification tasks i, j, d i , d e . At prediction time, we leverage assumptions (a4, a5, a6) to generate consistent predictions.

Experiments
Dataset: We train and evaluate QUARTET on the recently published WIQA dataset 3 comprising of 30,099 questions from 2107 paragraphs with explanations (23K train, 5K dev, 2.5K test). The perturbations q p are either linguistic variation (17% examples) of a passage sentence (these are called in-para questions) or require commonsense reasoning to connect to a passage sentence (41% examples) (called, out-of-para questions). Explanations are supported by up to two sentences from the pas-3 WIQA dataset link: http://data.allenai.org/wiqa/ sage: 52.7% length 2, 5.5% length 1, 41.8% length 0. Length zero explanations indicate that d e = • (called, no-effect questions), and ensure that random guessing on explanations gets low score on the end task.
Metrics: We evaluate on both explainability and the downstream end task (QA). For explainability, we define explanation accuracy as the average accuracy of the four components of the explanation: acc expl = 1 4 * i∈{i,j,d i ,de} acc(i) and acc qa = acc(d e ) (by assumption a3). The QA task is measured in terms of accuracy.
Hyperparameters: QUARTET fine-tunes BERT, allowing us to re-use the same hyperparameters as BERT with small adjustments in the recommended range (Devlin et al., 2018). We use the BERT-baseuncased version with a hidden size of 768. We use the standard adam optimizer with a learning rate 1e-05, weight decay 0.01, and dropout 0.2 across all the layers 4 . All the models are trained on an NVIDIA V-100 GPU.

Models:
We measure the performance of the following baselines (two non-neural and three neural).
• RANDOM: Randomly predicts one of the three labels {+ − • } to guess [d i , d e ]. Supporting sentences i and j are picked randomly from |avg sent | sentences.
• MAJORITY: Predicts the most frequent label (no effect i.e. d e = • in the case of WIQA dataset.) • q e ONLY : Inspired by existing works (Gururangan et al., 2018), this baseline exploits annotation artifacts (if any) in the explanation dataset by retraining QUARTET using only q e while hiding the permutation q p in the question.
• TAGGING: We can reduce our task to a structured prediction task. An explanation i, j, d i , d e requires span prediction x i:j and labels on that span. So, for example, the explanation i = 1, j = 2, d i =+, d j =− for input x 1 · x 5 can be expressed as a tag sequence: Formulating as a sequence tagging task allows us to use any standard sequence tagging model such as CRF as baseline. The decoder invalidates sequences that violate assumptions (a3 -a6). To make the encoder strong and yet comparable to our model, we use exactly the same BERT encoder as QUARTET. For each sentence representation u k , we predict a tag ∈ T . A CRF over these local predictions additionally provides global consistency. The model is trained end-to-end by minimizing the negative log likelihood from the CRF layer.
• BERT-NO-EXPL: State-of-the-art BERT model (Tandon et al., 2019) that only predicts the final answer d e , but cannot predict the explanation.
• BERT-W/-EXPL: A standard BERT based approach to the explanation task that predicts the explanation structure. This model minimizes only the cross-entropy loss of the final answer d e , predicting an explanation that provides the best answer accuracy.
• DATAAUG: This baseline is adapted from Asai and Hajishirzi (2020), where a RoBERTa model is augmented with symbolic knowledge and uses an additional consistency-based regularizer. Com-5 https://allenai.org/data/wiqa pared to our model, this approach uses a more robustly pre-trained BERT (RoBERTa) with dataaugmentation optimized for QA Accuracy.
• QUARTET: our model described in §4 that optimizes for the best explanation structure.

Explanation accuracy
QUARTET is also the best model on explanation accuracy. Table 2 shows the performance on QUARTET also outperforms baselines on every component of the explanation. QUARTET performs better at predicting i than j. This trend correlates with human performance-picking on the second supporting sentence is harder because in a procedural text neighboring steps can have similar effects.
We found that the explanation dataset does not contain substantial annotation artifacts for the q e ONLY model to leverage (q e ONLY < MAJORITY)  We also tried a simple bag of words and embedding vector based alignment between q p and x i in order to pick the most similar x i . These baselines perform worse than random, showing that aligning q p and x i involves commonsense reasoning that the these models cannot address.

Downstream Task
In this section, we investigate whether a good explanation structure leads to better end-task performance. QUARTET advocates explanations as a first class citizen from which an answer can be derived.

Accuracy on a QA task
We compare against the existing SOTA on WIQA no-explanation task. Table 3 shows that QUARTET improves over the previous SOTA BERT-NO-EXPL by 7%, achieving a new SOTA results. Both these models are trained on the same dataset 6 . The major difference between BERT-NO-EXPL and QUARTET is that BERT-NO-EXPL solves only the QA task, whereas QUARTET solves explanations, and the answer to the QA task is derived from the explanation. Multi-tasking (i.e., explaining the answer) provides the gains to QUARTET.  All the models get strong improvements over RANDOM and MAJORITY. The least performing model is TAGGING. The space of possible sequences of correct labels is large, and we believe that the current training data is sparse, so a larger training data might help. QUARTET avoids this sparsity problem because rather than a sequence it learns on four separate explanation components. Table 4 presents the accuracy based on question types. QUARTET achieves large gains over BERT-NO-EXPL on the most challenging out-of-para questions. This suggests that QUARTET improves the alignment of q p and x i that involves some commonsense reasoning.

Correlation between QA and Explanation
QUARTET not only improves QA accuracy but also the explanation accuracy. We find that QA accuracy (acc de in Table 2) is positively correlated (Pearson coeff. 0.98) with explanation accuracy (acc expl ). This shows that if a model is optimized for explanations, it leads to better performance on end-task. Thus, with this result we establish that (at least on 6 We used the same code and parameters as provided by the authors of WIQA-BERT. The WIQA with-explanations dataset has about 20% fewer examples than WIQA withoutexplanations dataset [http://data.allenai.org/wiqa/] This is because the authors removed about 20% instances with incorrect explanations (e.g., where turkers didn't have an agreement). So we trained both QUARTET and WIQA-BERT on exactly the same vetted dataset. This helped to increase the score of WIQA-BERT by 1.5 points.  our task) models can make better predictions when forced to generate a sensible explanation structure. An educational psychology study (Dunlosky et al., 2013) hypothesizes that student performance improves when they are asked to explain while learning. However, their hypothesis is not conclusively validated due to lack of evidence. Results in Table  2 hint that, at least on our task, machines that learn to explain, ace the end task.

Error analysis
We analyze our model's errors (marked in red) over the dev set, and observe the following phenomena.

Multiple explanations:
As mentioned in Section 3, more than one explanations can be correct. 22% of the incorrect explanations were reasonable, suggesting that overall explanation accuracy scores might under-estimate the explanation quality. The following example illustrates that while gathering firewood is appropriate when fire is needed for survival, one can argue that going to wilderness is less precise but possibly correct.  Fig. 3 shows that predicted and gold distributions of i and j are similar. Here, sentence id = −1 indicates no effect. The model has learned from the data to never predict j < i without any hard constraints.
The model is generally good at predicting i, j and in many cases when the model errs, the explanation seems plausible. Perhaps for the same underlying reason, human upper bound is not high on i (75.9%) and on j (66.1%). We show an example where i, j are incorrectly predicted (in red), but   The following example shows an instance where '−' is misclassified as '+'. It implies that there is more scope for improvement here.
Gold: less seeds fall to the ground → (OPP/-) seed falls to the ground → (OPP/-) seeds germinate → (MORE/+) fewer plants Pred: less seeds fall to the ground → (OPP/-) seed falls to the ground → (OPP/-) seeds germinate → (OPP/-) fewer plants 4. in-para vs. out-of-para: The model performs better on in-para questions (typically, linguistic variations) than out-of-para questions (typically, commonsense reasoning). Also see empirical evidence of this in Table 4.
The model is challenged by questions involving commonsense reasoning, especially to connect q p with x i in out-of-para questions. For example, in the following passage, the model incorrectly predicts • (no effect) because it fails to draw a connection between sleep and noise: Pack up your camping gear, food. Drive to your campsite. Set up your tent. Start a fire in the fire pit. Cook your food in the fire. Put the fire out when you are finished. Go to sleep. Wake up ... qp: less noise from outside qe: you will have more energy Analogous to i and j, the model also makes more errors between labels '+' and '−' in out-of-para questions compared to in-para questions (39.4% vs 29.7%) -see Table 6.   (Tandon et al., 2019) discuss that some in-para questions may involve commonsense reasoning similar to out-of-para questions. The following is an example of an in-para question where the model fails to predict d i correctly because it cannot find the connection between protected ears and amount of sound entering.
Gold: ears less protected → (MORE/+) sound enters ear → (MORE/+) sound hits ear drum → (MORE/+) more sound detected Pred: ears less protected → (OPP/-) sound enters the ear → (OPP/-) sound hits ear drum → (MORE/+) more sound detected 5. Injecting background knowledge: To study whether additional background knowledge can improve the model, we revisit the out-of-para question that the model failed on. The model fails to draw a connection between sleep and noise, leading to an incorrect (no effect) ' • ' prediction.
By adding the following relevant background knowledge sentence to the paragraph "sleep requires quietness and less noise", the model was able to correctly change probability mass from d e = ' • ' to '+'. This shows that providing commonsense through Web paragraphs and sentences is a useful direction.
Pack up your camping gear, food ... Sleeping requires quietness and less noise. Go to sleep. Wake up ... qp: less noise from outside qe: you will have more energy 8 Assumptions and Generality QUARTET makes two simplifying assumptions: (1) explanations are assembled from the provided sentences (question + context), rather than generated, and (2) explanations are chains of qualitative, causal influences, describing how an end-state is influenced by a perturbation. Although these (helpfully) bound this work, the scope of our solution is still quite general: Assumption (1) is a common approach in other work on multihop explanation (e.g., HotpotQA), where authoritative sentences support an answer. In our case, we are the first to apply the same idea to chains of influences. Assumption (2) bounds QUARTET to explaining the effects of qualitative, causal influences. However, this still covers a large class of problems, given the importance of causal and qualitative reasoning in AI. The WIQA dataset provides the first large-scale dataset that exemplifies this class: given a qualitative influence, assemble a causal chain of events leading to a qualitative outcome. Thus QUARTET offers a general solution within this class, as well as a specific demonstration on a particular dataset.

Conclusion
Explaining the effects of a perturbation is critical, and we have presented the first system that can do this reliably. QUARTET not only predicts meaningful explanations, but also achieves a new state-of-the-art on the end-task itself, leading to an interesting finding that models can make better predictions when forced to explain. Our work opens up new directions for future research: 1) Can additional background context from the Web improve explainable reasoning? 2) Can such structured explanations be applied to other NLP tasks? We look forward to future progress in this area.