Multi-Step Inference for Reasoning Over Paragraphs

Complex reasoning over text requires understanding and chaining together free-form predicates and logical connectives. Prior work has largely tried to do this either symbolically or with black-box transformers. We present a middle ground between these two extremes: a compositional model reminiscent of neural module networks that can perform chained logical reasoning. This model first finds relevant sentences in the context and then chains them together using neural modules. Our model gives significant performance improvements (up to 29\% relative error reduction when comfibined with a reranker) on ROPES, a recently introduced complex reasoning dataset.


Introduction
Performing chained inference over natural language text is a long-standing goal in artificial intelligence (Grosz et al., 1986;Reddy, 2003). This kind of inference requires understanding how natural language statements fit together in a way that permits drawing conclusions. This is very challenging without a formal model of the semantics underlying the text, and when polarity needs to be tracked across many statements.
For instance, consider the example in Figure 1 from ROPES (Lin et al., 2019), a recently released reading comprehension dataset that requires applying information contained in a background paragraph to a new situation. To answer the question, one must associate each category of flowers with a polarity for having brightly colored petals, which must be done by going through the information about pollinators given in the situation and linking it to what was said about pollinators and brightly colored petals in the background paragraph, along with tracking the polarity of those statements. * Work done during an internship at Allen Institute for Artificial Intelligence. Prior work addressing this problem has largely either used symbolic reasoning, such as markov logic networks (Khot et al., 2015) and integer linear programming (Khashabi et al., 2016), or black-box neural networks . Symbolic methods give some measure of interpretability and the ability to handle logical operators to track polarity, but they are brittle, unable to handle the variability of language. Neural networks often perform better on practical datasets, as they are more robust to paraphrase, but they lack any explicit notion of reasoning and are hard to interpret. We present a model that is a middle ground between these two approaches: a compositional model reminiscent of neural module networks that can perform chained logical reasoning. The proposed model is able to understand and chain together free-form predicates and logical connectives. The proposed model is inspired by neural module networks (NMNs), which were proposed for visual question answering (Andreas et al., 2016b,a). NMNs assemble a network from a collection of specialized modules where each module performs some learnable function, such as locating a question word in an image, or recognizing relationships between objects in the image. The modules are composed together specific to what is asked in the question, then executed to obtain an answer. We design general modules that are targeted at the reasoning necessary for ROPES and compose them together to answer questions.
We design three kinds of basic modules to learn the neuro-symbolic multi-step inference over questions, situations, and background passages. The first module is called SELECT, which determines which information (in the form of spans) is essential to the question; the second module is called CHAIN, which captures the interaction from multiple statements; the last one is called PREDICT, which assigns confidence scores to potential answers. The three basic modules can be instantiated separately and freely combined.
In this paper, we investigate one possible combination as our multi-step inference on ROPES. The results show that with the multi-step inference, the model achieves significant performance improvement. Furthermore, when combined with a reranking architecture, the model achieves a relative error reduction of 29% and 8% on the dev and test sets in the ROPES benchmark. As ROPES is a relatively new benchmark, we also present some analysis of the data, showing that the official dev set is likely better treated as an in-domain test, while the official test set is more of an out-of-domain test set. 1

Model
We first describe the baseline system, a typical QA span extractor built on ROBERTA (Liu et al., 2019), and then present the proposed system with multi-step inference. Furthermore, we introduce a reranker with multi-step inference given the output of the baseline system.

Baseline
Our baseline system is a span extractor built on the top of ROBERTA. Given the passage representations from ROBERTA P ROBERTA = [x 0 , ..., x n−1 ], two scores are generated for each token by span scorer, showing the chance to be the start and the end of the answer span: S,Ē = QA score(P ROBERTA ), whereS = [s,s 1 , ...s n−1 ] andĒ = [ē 0 ,ē 1 , ...,ē n−1 ] (0 ≤ k < n) 2 are the scores of the start and the end of answer spans, respectively. QA score(·) : R dx ⇒ R 2 is a linear function, where d x is the output dimension of ROBERTA. The span with highest start and end scores is extracted as the answer by span extractor: where the span i * ,j * is the answer.
1 Model code is available at https://github.com/LeonCrashCode/allennlp/blob/transf-2 The answer spans always appear in the situation and question passage, so we mask the scores for the background passage.

Multi-Step Inference for ROPES
Instead of a simple span prediction head on top of ROBERTA, our proposed multi-step inference model uses a series of neural modules targeted at chained inference. Like the baseline, our model begins with encoded ROBERTA passage representations P ROBERTA , but replaces the QA score function with a MS-Inference function, which similarly outputs a span start and end score for each encoded token x k : The MS-Inference(·) function consists of several modules. These modules SELECT relevant information from parts of the passage, CHAIN the selected text together, then PREDICT the answer to the question given the result of the chaining. These modules are applied on P ROBERTA which is decomposed into B ROBERTA , S ROBERTA , Q ROBERTA , denoting the token representations of ROBERTA for the background, the situation and the question, respectively.
As most of the questions in ROPES require the same basic reasoning steps, we use a fixed combination of these modules to answer every question, instead of trying to predict the module layout for each question, as was done in prior work (Hu et al., 2017). This combination is shown in Figure 2: we SELECT important parts of the question passage, and CHAIN them with the background passage to find a likely part of the background that supports answering the question (marked as red). Then we SELECT important parts of the background passage, which are combined with previous results that we have (marked as blue), and we CHAIN the combined information to find relevant parts of the situation passage (marked as green), and finally PREDICT an answer (marked as black), which is most often found in the situation text. The intuition for how these modules work together to piece together the information necessary to answer the question is shown in Figure 1. The actual operations performed by each of these modules is described below.
SELECT The select module, i.e. z = SELECT(Y ), where Y ∈ R n×dx and z ∈ R dx , aims to find the important parts of its input and summarize in a single vector. It first uses a learned linear scoring function, f (·) : R dx ⇒ R, to determine which parts of its input are most Figure 2: Multi-step inference model, where ⊗ is the operation to collect multiple vectors as a list, Z, Y are the interfaces of the modules, and X is a token representation to be scored as start/end of the answer span in the QA systems, or a candidate span representation to be scored in the reranking systems.
important, then converts those scores into a probability distribution using a SOFTMAX operation, and computes a weighted sum of the inputs: CHAIN The chain module, i.e. z = CHAIN(Y, Z), computes the interaction between an input matrix Y and a list of the input vectors Z = [z 0 , z 1 , ..., z l−1 ], where Y ∈ R n×dx , z k ∈ R d k and d k is the dimension of the kth input vector (0 ≤ k < l), and again outputs a summary vector of this interaction z ∈ R dx . Intuitively, this module is supposed to chain together the inputs Y and Z and return a summary of the result. This is done with the following operations: where g(·) : R (d 0 +d 1 +...+d l−1 ) ⇒ R dx is a linear function, ; is the concatenation, and attention(·) is instantiated with the multi-head attention: PREDICT The predict module, i.e. S = PREDICT(Z, X), takes the list of output vectors Z = [z 0 , z 1 , ..., z m−1 ] from previous modules, where z k ∈ R d k (0 ≤ k < m) and m is the number of previous modules, and the candidates x and n is the number of candidates, and produces scores for the candidates. In our base model, each candidate is a token in the situation or question, and the score is a pair of numbers representing span start and end probabilities for that token. When we use this module in a re-ranker (Section 2.3), the candidates X are already encoded spans, and so we produce just one number for each span. The PRE-DICT module simply uses a linear scoring function on the concatenation of its inputs: is the concatenation and r = 2 if the module is used to extract spans, while r = 1 if the module is used to score candidates for the reranker.
Full model Our full model combines these modules in the following way to compute span start and end scores for each token (depicted graphically in Figure 2):

Multi-Step Reranker
Most questions in ROPES have only two or three reasonable candidate answers (in Figure 1 these are "category A" and "category B"), and we find that the baseline model is able to reliably find these answers, though it has a hard time selecting between them. This suggests that a reranker that only focuses on deciding which of the candidates is correct could be effective. To do this, we take the top c spans output by the baseline system and score these candidates directly using our MS-Inference model instead of producing span start and end scores for each input token.
Scoring spans instead of tokens To feed the candidate spans into our multi-step inference model, we represent each span as a single vector by concatenating its endpoint tokens: We take all c candidates and concatenate them together as X, instead of X = [S; Q] as is done in our base model. Similarly, PREDICT(Z, X) outputs a single scoreŌ per candidate instead of a pair of start and end probabilities.
Ensemble We additionally use an ensemble strategy for the reranker.
We train several rerankers and build a voting system where each reranker makes a vote for the candidate to be the best answer. The candidate with the most votes is chosen the best answer through the voting system.

Data bias in ROPES
We experiment with ROPES (Lin et al., 2019), a recently proposed dataset which focuses on complex reasoning over paragraphs for document comprehension. We noticed a very severe drop in performance between the ROPES dev and test sets during initial experiments, and we performed an analysis of the data to figure out the cause. ROPES used an annotator split to separate the train, dev, and test sets in order to avoid annotator bias (Geva et al., 2019), but we discovered that this led to a large distributional shift between train/dev and test, which we explore in this section. In light of this analysis, we recommend treating the dev set as an in-domain test set, and the original test set as an out-of-domain test.
Answer types Our analysis is based on looking at the syntactic category of the answer phrase. We use the syntactic parser of Kitaev and Klein (2018) to obtain constituent trees for the passages in ROPES. We take the constituent label of the lowest subtree that covers the answer span 3 as the answer type.
The four most frequent answer types in ROPES are noun phrase (NP), verb phrase (VP), adjective phrase (ADJP) and adverb phrase (ADVP). Table 1 shows examples for each type. Most NP answers  come from the situation, while the other answer types typically come from the question.

Bias
The distribution of answer types in the train/dev/test sets of ROPES are shown in Table 2. We found that the distribution in the train set is similar to development set, where most of the answers are NPs (85%), with ADJP being the second most frequent. However, the test set has a very different distribution over answer types, where less than half of the answers NPs, and there are more VPs, ADJPs, ADVPs, and other types.
This distributional shift over answer types between train/dev and test raises challenges for reading comprehension systems; to perform well on test, the model must predict a significant number of answers from the question instead of from the situation, which only rarely happens in the training data. Given this distributional shift, it seems fair to characterize the official test as somewhat out-of-domain for the training data.

Experiments
In this section, we evaluate the performance of our proposed model relative to baselines on ROPES.    Section 3), we additionally set up an experiment using the dev set as an in-domain test set, by partitioning the training set into train (9,824 questions) and train-dev (1,100 questions).
Training Following the settings of prior work (Lin et al., 2019), we fine-tune the ROBERTA-LARGE pre-trained transformer. The hidden sizes of all layers are set to 1024 which is the same to the output dimension of ROBERTA-LARGE, and the number of heads on multi-step attentions is 8. All the models share the same hyperparameters that are shown in Table 4. 4 Metrics Though ROPES was released using both exact match (EM) and F1 as metrics, we only report EM here, as F1 has been shown to correlate poorly with human judgments on ROPES (Chen et al., 2019a). F1 assumes that answers that share many overlapping words are likely similar; while this is largely true on SQuAD (Rajpurkar et al., 2016), where this particular F1 score was introduced, it is not true on ROPES, where things like Village A and Village B are both plausible answers to a question. All the systems are trained in three runs with different random seeds, and we post the average performance over the three runs. Table 5 shows the performance of the three systems. The multi-step system and multi-step reranker outperform the baseline system with 8.1% and 11.7% absolute EM accuracy on dev set, respectively, and with 2.4% and 2.0% EM accuracy on test set, respectively, showing that with  multi-step inference, the system can achieve improvements. With the ensemble, the multi-step reranker performs best on dev and test sets. As can be seen, the improvement of our model on the dev set is quite large. While performance is also better on the official test set, the gap is not nearly so large. To understand whether this was due to overfitting to the dev set or to the distributional shift mentioned in Section 3, Table 5 also shows the results on dev-test, our split that treats the official dev set as a held-out test set. Here, we still see large gains of 7.2% EM from our model, suggesting that it is indeed a distributional shift and not overfitting that is the cause of the difference in performance between the original dev and test sets. Properly handling the distributional shift in the ROPES test set is an interesting challenge for future work.

Analysis and Discussion
We conduct detailed analysis in this section, studying (1) the impact of various components of our model, (2) the gap between results on development and test set, (3) the strategy for sampling candidates for the reranker, and (4) the errors that the models cannot cover.

Ablation Study
We perform an ablation study on the multi-step system and the multi-step reranker. Table 6 shows the results on dev set by various ablated systems. The performances of two systems drop down without any one module due to the property of the chained reasoning. The performance of the multi-step system without Q SE-LECT or B CHAIN drops (around) more than that of the multi-step system without B SELECT or S CHAIN (around -2.1% EM ). So Q SELECT module and B CHAIN play relatively more important roles. The performance of the multi-step reranker without Q SELECT, B SELECT or S CHAIN drops   (around -5.9% EM) more than that of the multistep reranker without B CHAIN (-3.7% EM).
Answer Types We break down the overall accuracy by answer type, which is shown in Table 7. All three systems perform substantially better on NP, ADJP, and ADVP questions than on VP questions. The main reason is that the VP questions are associated with complex and long answers, e.g., acquire it from other organisms or make their own glucose. The major improvements happen on answering NP and ADVP questions, which explains the gap between the scores on the development set, with a large amount of NP questions, and the test set, with relatively more VP questions. The analysis can inspire the future work of investigating the specific inference programs for specific-type questions.
Candidate Sampling In order to train the reranker, we need training data with high-diversity EM 10-fold 84.1 5-fold 82.4 2-fold 75.9 3-turn 59.9 candidates. However, a well-trained model does not generate similar candidates for the training set to what it generates for the dev and test sets, due to overfitting to the training set. In order to get useful candidates for the training set, we need a model that was not trained on the data that it generates candidates for. We investigate four strategies based on cross-validation to generate training data candidates: 10-fold, 5-fold, 2-fold and 3-turn. With the k-fold method, the training data is partitioned into k parts, and (k − 1) parts are used to train a model that generates candidates answers for the remaining part. With the k-turn method, the training data is partitioned into k parts, and the ith part is used to train a model that generates candidate answers for (i + 1)th part. Table 8 shows the average accuracy on training data. The accuracy on training data generated by k-fold self-sampling method is very high, and they are not consistent with the dev and test set. The accuracy on training data generated by the 3-turn self-sampling method is most similar to the accuracy on dev set (59.7% EM) and test set (55.4% EM) by the baseline system. We adopt the 3-turn self-sampling method for our experiments. Table 9 shows the oracle of top k candidates on train, development and test set. Because oracle scores are the upper bound of the reranker, there is a trade-off that the upper bound is lower as fewer candidates are sampled, while the noise increases as more incorrect candidates are sampled. We found that top 3 provides a good trade-off for the reranker on the development set, giving a large jump over just two candidates, and this is what we used during our main experiments.
Error Analysis and Future Work We analyze some errors that our proposed model made, aiming to discover the questions that our model could not cover. Table 10 shows some questions that our proposed model gives incorrect answers. The questions require model to get the numeric information from the passage, and then compare the nu-  meric relation (e.g. larger, smaller and equal) and target the effect of the relation in the background passage, where positive correlation between the prices and the sold number in example 1, positive correlation between the tolerance degree and usage times in example 2 and negative correlation between the crash rate the the number of cyclists in example 3. It seems that the model is not sensitive to the numeric information and their reasonings.
Also, the situations give more than two entities with their related information, and although the questions narrow down the multiple choices to two choices, the systems are still distracted by these question-irrelevant entities. The distraction comes from the difficulty of associating the relevant information with the correct entities. Future work can be motivated by the discovery to design more modules to deal with this phenomenon.

Related Work
Neural Module Networks were originally proposed for visual question answering tasks (Andreas et al., 2016b,a), and recently have been used on several reading comprehension tasks Gupta et al., 2020), where they specialize the module functions such as FIND and COMPARE to retrieve the relevant entities with or without supervised signals for HotpotQA (Yang et al., 2018) or DROP (Dua et al., 2019). As ROPES is quite different from these datasets, the modules that we choose to use are also different, focusing on chained inference.

Multi-Hop Reasoning
There are several datasets constructed for multi-hop reasoning e.g. HOTPOTQA (Yang et al., 2018;Min et al., 2019;Feldman and El-Yaniv, 2019), QAN-GAROO (Welbl et al., 2018;Chen et al., 2019b;Zhuang and Wang, 2019;Tu et al., 2019) and  WIKIHOP (Welbl et al., 2018;Song et al., 2018;Das et al., 2019;Asai et al., 2019) which aims to get the answer across the documents. The term "multi-hop" reasoning on these datasets is similar to relative information retrieval, where one entity is bridged to another entity with one hop. Differently, the multi-step reasoning on ROPES aims to do reasoning over the effects of a passage (background and situation passage) and then give the answer to the question in the specific situation, without retrieval on the background passage.
Models beyond Pre-trained Transformer As the emergence of fully pre-trained transformer (Peters et al., 2018;Devlin et al., 2019;Liu et al., 2019;Radford et al.;, most of NLP benchmarks got new state-of-the-art results by the models built beyond the pre-trained transformer on specific tasks (e.g. syntactic parsing, semantic parsing and GLUE) Kitaev and Klein, 2018;Zhang et al., 2019;Tsai et al., 2019). Our work is in the same line to adopt the advantages of pre-trained transformer, which has already collected contextualized word representation from a large amount of data.

Conclusion
We propose a multi-step reading comprehension model that performs chained inference over natural language text. We have demonstrated that our model substantially outperforms prior work on ROPES, a challenging new reading comprehension dataset. We have additionally presented some analysis of ROPES that should inform future work on this dataset. While our model is not a neural module network, as our model uses a single fixed layout instead of different layouts per question, we believe there are enough similarities that future work could explore combining our modules with those used in other neural module networks over text, leading to a single model that could perform the necessary reasoning for multiple different datasets.