Towards Interpretable Reasoning over Paragraph Effects in Situation

We focus on the task of reasoning over paragraph effects in situation, which requires a model to understand the cause and effect described in a background paragraph, and apply the knowledge to a novel situation. Existing works ignore the complicated reasoning process and solve it with a one-step"black box"model. Inspired by human cognitive processes, in this paper we propose a sequential approach for this task which explicitly models each step of the reasoning process with neural network modules. In particular, five reasoning modules are designed and learned in an end-to-end manner, which leads to a more interpretable model. Experimental results on the ROPES dataset demonstrate the effectiveness and explainability of our proposed approach.


Introduction
As a long-standing fundamental task of natural language processing, machine reading comprehension (MRC) has attracted remarkable attention recently and different MRC datasets have been studied (Rajpurkar et al., 2018;Dua et al., 2019b;Choi et al., 2018;Yang et al., 2018), among which reasoning over paragraph effects in situation (ROPES for short) is a very challenging scenario that needs to understand knowledge from a background paragraph and apply it to answer questions in a novel situation.Table 1 shows an example of the ROPES dataset (Lin et al., 2019), where the background passage states that developmental difficulties could usually be treated by using iodized salt, the situation passage describes two villages using different salt, and questions about which village having more/less people experiencing developmental difficulties need to be answered.

Background
Before iodized salt was developed, some people experienced a number of developmental difficulties, including problems with thyroid gland function and mental retardation.In the 1920s, we learned that these conditions could usually be treated easily with the addition of iodide anion to the diet.One easy way to increase iodide intake was to add the anion to table salt.Situation People from two villages ate lots of salt.People from Salt village used regular salt, while people from Sand village people used iodized salt in their diets, after talking to specialists.Q&A Q: Which village had more people experience developmental difficulties?A: Salt Q: Which village had less people experience developmental difficulties?A: Sand Table 1: An example from the ROPES dataset.Effect property tokens are highlighted in blue, cause property tokens in orange, and world tokens in green.
Almost all existing works (Lin et al., 2019;Khashabi et al., 2020;Dua et al., 2019a;Gardner et al., 2020) for this task adopt a standard MRC approach based on deep learning in one step: the question and a pseudo passage constructed by concatenating the background and situation are fed into a large pre-trained model (e.g.RoBERTa large), and the answer is predicted directly by the model.However, the ROPES task is more complicated than traditional MRC since it requires a model to not only understand the causes and effects described in a background paragraph, but also apply the knowledge to a novel situation.Ignoring the understanding and reasoning process hinders such models from achieving their best performance.Consequently, the best F1 (61.6%) achieved so far is far below human performance (89.0%).More importantly, such a one-step approach makes the reasoning process unexplainable, which is of great importance for complicated reasoning tasks.
We observe that human solve this kind of complicated reasoning tasks in a sequential manner with multiple steps (Evans, 1984;Sloman, 1996;Sheorey and Mokhtari, 2001;Mokhtari and Reichard, 2002;Mokhtari and Sheorey, 2002).As shown in Table 1, the background paragraph usually states the relationship between a cause property and an effect property, the situation describes multiple worlds each of which is associated with a specific value in terms of the cause property.Human usually does reasoning in a multi-step process: (1) identifying mentioned worlds, (2) identifying the cause and effect property, (3) understanding the relationship between the cause and effect property, (4) comparing identified worlds in terms of the cause property, and (5) reasoning about the comparison of mentioned worlds in terms of the effect property based on (3) and (4).
Inspired by human cognitive processes, in this paper, we propose a sequential approach that leverages neural network modules to implement each step of the above process1 .Specifically, we define • a World Detection module to identify potential worlds, • an Effect and Cause Detection module to identify effect and cause property, • a Relation Classification module to understand the relationship between effect and cause, • a Comparison module to compare identified worlds in terms of the cause property, and • a Reasoning module to infer comparison of mentioned worlds in terms of the effect property.
These modules are trained in an end-to-end manner, and auxiliary loss over intermediate latent decisions further boosts the model accuracy.
Explicitly modeling the sequential reasoning process has two advantages.First, it achieves better performance since the complicated reasoning process is decomposed into more manageable subtasks and each module only needs to focus on a simple sub-task.Second, intermediate outputs provide a better understanding of the reasoning process, making the learnt model more explainable.
Experimental results on the ROPES dataset demonstrate the effectiveness and explainability of our proposed approach.It surpasses the state-ofthe-art model by a large margin (6% absolute difference) in the five-fold cross-validation setting.Furthermore, analyses on intermediate outputs show that each module in our learnt model performs well on its corresponding sub-task and well explains the reasoning process.

Related Work
Neural network modules have been studied by several works.Andreas et al. (2016) propose neural module networks with a semantic parser on visual question answering.Jiang and Bansal (2019) apply a self-assembling modular network with only three modules: Find, Relocate and Compare to Hotpot QA (Yang et al., 2018).Gupta et al. (2019) extend the neural module networks to answer compositional questions against a paragraphs of text as context, and perform symbolic reasoning on the self-pruned subset of DROPS (Dua et al., 2019b).Compared with them, we focus on a more challenging MRC task: reasoning over paragraph effects in situation, which has been rarely investigated and needs more complex reasoning.So far as we know, the only two works (i.e.(Lin et al., 2019) and (Khashabi et al., 2020)) on this topic uses a one-step "black box" model.Such an approach performs well on some questions at the expense of limited intepretability.Our work solves this task in a logical manner and exposes intermediate reasoning steps which improves performance and interpretability concurrently.

Methodology
As shown in Figure 1, our approach consists of three components which are contextual encoding, interpretable reasoning, and answer prediction.

Contextual Encoding
We use RoBERTa (Devlin et al., 2019;Liu et al., 2019) to encode background, situation and question together and generate contextualized embeddings.Specifically, given a background passage B = {b i } m i=1 , a situation passage S = {s j } n j=1 and a question Q = {q k } l k=1 , we concatenate them with special tokens as s q 1 , . . ., q l /s /s s 1 , . . ., s n ; b 1 , . . ., b m /s , which is then fed into a series of successive transformer blocks contained in RoBERTa, where H b ∈ R m×d , H s ∈ R n×d , and H q ∈ R l×d are contextual embeddings for the background, situation, and question, respectively, d is the dimension for hidden states.

World Detection
The module aims to identify concerned worlds from situation according to a question.Take Table 1 as an example, the question cares about two worlds, Sand Village and Salt Village.To achieve that, we apply a multilayer perceptron (MLP) over the situation representations H s and normalize the projected logits (using a softmax function) to get attention over all situation tokens for each world, where p s w 1 and p s w 2 are the attention vectors over situation for the first and second world, θ's are learnable parameters of MLP.Note that since most examples in the ROPES dataset are related to two concerned worlds, we identify two worlds in our model.However, we can handle multiple worlds by simply extending the module with more MLPs.

Effect and Cause Detection
This module aims to identify effect and cause properties described in the background.To achieve that, another MLP is used to identify the effect property, Here p b e is the attention vector over background tokens in terms of the effect property, which attends more to tokens of effect property.Take Table 1 as an example, p b e is the attention over background tokens, whose value is much larger for developmental difficulties than other tokens.
Next, we apply a relocate operation which reattends to the background based on the situation and is used to find the cause property in the background (e.g., shifting the attention from developmental difficulties to iodized salt in Table 1).This is achieved with the help of a situationaware background-to-background attention matrix where [;] denotes the concatenation operation and is Hadamard product.w relo ∈ R 3d is a learnable parameter vector, s can be viewed as an embedding of the whole situation.Then each row of R is normalized using the softmax operation.Finally we get the attention vector over background tokens in terms of the cause property p b c , Here p b c should attend more to the tokens of effect property.For example, iodized salt will get larger attention value than other tokens in the background.

Relation Classification
This module aims to predict the qualitative relation between effect and cause property.Take Table 1 as an example, the cause property iodized salt and the effect property developmental difficulty is negatively correlated.To achieve that, we first derive and concatenate representations of cause and effect property by averaging background representation H b weighted by according attention vector, p b c and p b e .Next, we adopt another MLP stacked with softmax to get corresponding probabilities, where p rel = [p rel− , p rel+ ] denotes probability of negative and positive relation, θ rel is a learnable parameter in the MLP.In the example shown in Table 1, p rel− is supposed to be larger than p rel+ .

Comparison
This module aims to compare the worlds in terms of the cause property.For example, world 1 (salt village) is more relevant to iodized salt than world 2 (sand village) in Table 1 since people in salt village use iodized salt while people in sand village use regular salt.This is achieved by three steps.First, we derive the attention of cause property over situation p s c from p b c with a similarity matrix M ∈ R n×m between situation and background, where W sb ∈ R d×d are learnable parameters.Second, we use p s w to mask out irrelevant cause property for each world.This part ensures the alignment between each world and its cause property, which is critical when one situation contains multiple worlds.
Third, each world's cause property is evaluated by a bilinear function in terms of its relevance to the cause property in background, which is further normalized into a probability with softmax, where W com ∈ R d×d is a learnable matrix, H b T p b c represents expected embedding of cause property in background, H sT p s cw i represents expected embedding of cause property for world i, p comw i denotes the probability that world i is relevant to cause property.

Reasoning
Given the relationship between effect property and cause property, p rel+ and p rel− , and the comparison between worlds in terms of cause property p comw i , this module infers comparison between identified worlds in terms of the effect property.Take Table 1 as an example, given the negative relationship between developmental difficulties and iodized salt, and salt village uses more iodized salt than sand village, we infer that people in sand village are more likely to have developmental difficulties.
To this end, we have where p ew i is the probability that world i is more relevant to effect property.

Answer Prediction
Given intermediate outputs from the interpretable reasoning component, this module predicts the final answer for a question.Specifically, we first convert these intermediate outputs into text spans or 0/1 class as follow.
• We take two steps to convert an attention vector output by World Detection or Effect and Cause Detection into a text span.First, the token with the highest probability is selected.
Then it is expanded with left and right neighbors which are continuous spans and the probability of each token is larger than threshold t.In our experiment we set t = 1 l , where l is the length of the paragraph.
• For Comparison, Relation Classification, and Reasoning, we select the class with the highest probability.
Then we synthesize a sentence ŝ in the format of where choosing "larger" or "smaller" depends on results from the Reasoning module.Take Table 1 as an example, the synthetic sentence is Salt village has larger developmental difficulties than Sand village.Such synthetic text explicitly expresses comparison between the identified worlds in terms of the effect property.Finally, we concatenate it with the situation s and question q as s q; s /s /s ŝ /s , and feed them into RoBERTa which directly predicts the starting and end position of the final answer.

Model Training
Two models (i.e.interpretable reasoning model; and answer prediction model) are learned in our approach.
Interpretable Reasoning The final loss function for interpretable reasoning is defined as Here or xT ∈ {0, 1}2 are corresponding gold labels, and α x is the weight for module x.

Answer Prediction
The training objective of the answer prediction model is defined as where s, e ∈ R m+n+k are predicted probabilities of the starting and end position, k is the length of the synthetic sentence ŝ, and s, ẽ ∈ {0, 1} m+n+k are corresponding gold labels.
4 Experimental Setup

Dataset
We evaluate our proposed approach on the ROPES (Lin et al., 2019) (Geva et al., 2019;Lin et al., 2019).This might pose a large data bias in each set.For example, as can be seen in Table 2, dev and test sets have similar numbers of questions, while the vocabulary size of background and situation in test set is 2× and 2.7× larger than that in dev set.The same thing happens on the size of question vocabulary, which indicates the existence of the distribution gap between train/dev and test sets and it might lead to underestimate/overestimate the performance of a model.

Cross Validation
Because of the limited size of the official dev set and potential data bias between train/dev and test, we conduct 5-fold crossvalidation to verify the effectiveness of the proposed approach.K-fold cross-validation assesses the predictive performance of the models and judges how they perform outside the sample to a new data set.Therefore it can assure unbiased results, avoid over-fitting, and testify the generalization capability of a model (Burman, 1989;Browne, 2000;Raschka, 2018)

Implementation Details
Our model is evaluated based on the pretrained language model RoBERTa large in Pytorch version3 .We train the five modules on one P100 16GB GPU and use four GPUs for predicting final answer.We tune the parameter α x 's according to the averaged performance of all modules, and set it to be 0.05 for span-based loss, 0.2 for the Comparison and Relation prediction, and 0.3 for the Reasoning prediction.Evaluation metrics are EM and F1 which are same as the ones used in SQuAD4 .The detailed hyperparameters are described in the Appendix A.

Baseline
We re-implemented the best model (RoBERTa) in the leaderboard 5  5 Experimental Results

Question Answering Performance
Table 3 shows question answering performance of different models, where our approach outperforms the RoBERTa large model by 8.4% and 6.4% in terms of EM and F1 scores respectively.These results show that compared to one-step "black box" model, our interpretable approach which mimics the human reasoning process has a better capability of conducting such complex reasoning.Furthermore, we also list the performance of our approach and the baseline model when using only randomly sampled 10% of training data in Table 3.That is, both the neural network modules and answer prediction model in our approach are trained with only 1074 questions.As seen in the table, our model learned from 10% of training examples achieves competitive performance to the baseline model learned from full data (71.8%v.s.71.1% in terms of F1 score).In contrast, the performance of the baseline model drops dramatically by 32%.This indicates that traditional black-box approach requires much more training data while our approach has better generalization ability and can learn the reasoning capability with much fewer examples.
We also implement a rule-based answer prediction approach (detailed descriptions in the Appendix E), which are generated based on the same 10% of training examples as in interpretable reasoning components.As shown in Table 3 the rulebased approach performs worse than the RoBERTa model, indicating better generalization ability of pre-trained models.

Case Study of Interpretability
The most remarkable difference between our model and the one-step "black box" model is that our Background Storing large volumes of data -When storing XML to either file or database, the volume of data a system produces can often exceed reasonable limits, with a number of detriments: the access times go up as more data is read, CPU load goes up as XML data takes more power to process, and storage costs go up....  model outputs multiple intermediate predictions which well explains the reasoning process.Note all modules in our model output probabilities or attention on input text, which are further fed into downstream ones for end-to-end learning.In order to explicitly visualize the output of each module, we take a similar approach to §3.3 to convert these probabilities into a text span or a 0/1 classification.

Situation
We demonstrate the reasoning process of our model with a running example shown in Table 4. Please see more examples in Appendix D.Here the background states the relationship between CPU load and data volume, i.e.CPU load goes up when processing larger volume of data.The situation describes that Tory stored different sizes of data at a different time.For example, he stored 301 Gigabytes at 8 AM and went to sleep at 1 PM.Finally, the question asks to compare CPU loads between 8 AM and 1 PM.As shown in 301 Gigabytes and sleep) are predicted.Next, it compares the two worlds in terms of cause property and predicts that world 1 is larger than world 2. Also it predicts that the cause property and effect property is positively related, i.e. the relation is classified as 1.Finally, it reasons that world 1 takes higher CPU loads than world 2. This example demonstrates that our approach not only predicts the final answer for the question, but also provides detailed explanations for the reasoning process.

Neural Network Module Performance
Taking the same approach as in §5.2, we convert the output of each module into a text span or a predicted class.We manually sampled another 5% of the training data, labeled them with outputs for each module, and evaluate the visualized results of all modules.Table 5 summarizes the performance for each module, where the predicted text span is measured by F1 score and classification prediction is measured by accuracy.
World Detection This module implements a similar capability as traditional extractive MRC, since both require to detect concerned text spans from a passage according to a question.Consequently, it achieves similar performance to top models of the popular SQuAD dataset6 , where our World Detection module reaches about 83% F1 score and single RoBERTa large model on SQuAD gets about 89%.
The gap might come from different modeling styles.
Our model predicts the probability of each token being concerned, while SQuAD models directly predict the starting and end position of an answer, which performs better on boundary detection.

Effect and Cause Detection Compared with
World Detection, the F1 score for this module decreases but actually still acceptable (F1=67.6 for Effect B , 57.6 for Cause B ).The most possible reason is that, effects and causes are usually longer than world names.For example, the average length of world name is 1.2, while those of effect and cause are 2.7 and 2.2 respectively.Longer text span increases the difficulty of prediction.
For above two span-related modules, we argue that since our model leverages the attention score in a soft way, it is less sensitive to accuracy of boundary changes.Therefore, we added another fuzzy F1 score for them.The fuzzy F1 of each question is set to 1 as long as its original F1 is larger than 0. As shown in Table 5, the fuzzy F1 scores of these two modules increase to 70%∼86%, indicating good reasoning capability of them.
Comparison, Relation Classification These two modules essentially requires the capability of classification.The high accuracy (83.8% and 84.5%) indicates that our modeling approach can effectively leverages the prediction of upstream modules and does a good job on them.
Reasoning Given the high accuracy of the Comparison and Relation Classification modules, the Reasoning model achieves 74% of accuracy, which provides high-quality input for final answer prediction.

Error Analysis
We randomly sampled 200 wrongly predicted questions to do error analysis and find that they fall into two major types, which are described below (Please see complete description of questions, backgrounds and situations in the Appendix C).

Type One Error
Type one errors are caused by wrong model predictions, most of which occur in below three modules.
Wrong Predicted Worlds Such errors are mainly caused by length imbalance between question and situation.Since situation is usually much longer than question, the World Detection module might make the same predictions for different questions of same situation.

Wrong Predicted Cause Property in Situation
Such errors are mainly caused by imbalanced de-scriptions for different worlds, where a situation describes details for one world but mentions another world with very few words.In such cases, the Comparison module might assign the same cause property for different worlds in situation.
Wrong Predicted Comparison Results Such errors often occur when two worlds are described with similar words, e.g."high" v.s."higher", or "smoking" vs. "not smoking", in which case the Comparison module might be confused by similar expressions of two worlds and fail to compare them.

Type Two Error
Type two errors occur when the proposed framework is not suitable to solve the questions.Here we list some example cases.
Missing Knowledge Background paragraph does not provide sufficient knowledge for reasoning.For example, a background paragraph only describes information about fish while the questions asks fertilization take place inside or outside of mother's body for a mammal creature.
Implicit Worlds Concerned worlds in a question are not explicitly described in the situation.For example, a situation paragraph says that Mattew does intensive worksouts while the question asks that his strengths will increase or decrease when he stops working out.In such a case, the world that Mattew stop working out is not explicitly described in the situation.
Additional Math Computation Answering such questions requires additional math computation.For example, a background states the speed of sound waves in air/water/iron and the question asks how much faster a channel (with water) would be than another channel (with air).Answering such questions requires additional math computation (i.e.subtraction, addition etc.)

Conclusion and Future Work
In this paper, we aim to answer ROPES questions in an interpretable way by leveraging five neural network modules.These modules are trained in an end-to-end manner and each module provides transparent intermediate outputs.Experimental results demonstrate the effectiveness of each module, and analysis on intermediate outputs presents good interpretability for the inference process in contrasted with "black box" models.Moreover, we find that with explicitly designed compositional modeling of inference process, our approach with a few training examples achieves similar accuracy to strong baselines with full-size training data which indicates a better generalization capability.Meanwhile, extending these models to a larger scope of question types or more complex scenarios is still a challenge, and we will further investigate the trade-off between explainability and scalability.

Background
As a cell grows, its volume increases more quickly than its surface area.If a cell was to get very large, the small surface area would not allow enough nutrients to enter the cell quickly enough for the cell's needs...Such cell types are found lining your small intestine, where they absorb nutrients from your food through protrusions called microvilli .Situation There are two cells inside a Petri dish in a laboratory, cell X and cell Z.These cells are from the same organism, but are not the same age.Cell X was created two weeks ago, and cell Z was created one month ago.Therefore, cell Z has had two extra weeks of growth compared to cell X. 2. Generate machine-readable labels automatically: Then we use scripts to automatically transform the annotations to the machine-readable form, i.e. we record the start and end character index for all spans and keep the results for comparison, relation and reasoning modules as binary form.

C Error Cases for Modules
Table 9 lists out several type 1 error cases mentioned in the Error Analysis part, while Table 10 lists out type 2 error cases which beyonds the scope of our model.

D More Examples
We present more examples that correctly answered by our model in Table 11.

E Heuristic Rules for Answer Prediction
We also conduct a rule-based approached to predict the final answer which contains the following steps: 1. Categorize questions based on the type of answer: By looking at the labeled train dataset, we can summarize that the types of answer can be divided into two types:1.World Type, answer is one of the compared worlds; 2. Comparative Word Type, like "more" or "less".
2. For World Type, we filter out such type of questions by searching question keywords, for example, questions started with { What, Which, Who, Where, When} usually have world type of answers.Then we determine the results based on the prediction obtained in Reasoning Module.
3. For Comparative Word Type, we further filter out this type of questions from the rest questions by defining a list of comparative word pair like {'more':'less','higher':'lower'... }.
Then we identify the primary world that being compared in the question and associate it with our identified worlds from Group Detection module, then determine the comparative word for the primary compared world by using the results from Reasoning module.
4. For the remaining questions, we simply return the world with higher effect property probability from Reasoning module as the final answer.

Figure 1 :
Figure 1: The left part is the architecture of our model.The middle part is the interpretable reasoning component in our model.The right part is the summary for inputs and outputs flowing between each module.The encoded contextual representations, H q , H s , H b , serve as global variables for the interpretable reasoning component.

Table 2 :
dataset 2 .So far as we know, ROPES statistics it is the only dataset that requires reasoning over paragraph effects in situation.Given a background paragraph that contains knowledge about relations of causes and effects and a novel situation, questions about applying the knowledge to the novel situation need to be answered.Table1shows an example and Table2presents the statistics.To be noticed, different from other extractive MRC datasets, train/dev/test set in ROPES is split based on annotators instead of context

Table 3 :
Performance of different models on the ROPES dataset under cross-validation setting.

Table 4 :
A running example with visualized intermediate outputs of our approach.

Table 5 :
Table 4, our model Performance of Each Module outputs several intermediate results.First, it identifies two concerned worlds, 8 AM and 1 PM from the situation.Then it predicts the effect property, CPU load goes up, given which the cause property in the background (i.e.storing large volumes of data) and according values for the two worlds (i.e.

Table 7 :
Detailed parameters used in Answer Prediction, we provide search bounds for each hyperparameter and list out the hyperparameters combination for out best model and baseline model.Other unmentioned parameters keep same as the one used in BERT.

Table 8 :
An example with auxiliary supervision labels.