multiPRover: Generating Multiple Proofs for Improved Interpretability in Rule Reasoning

We focus on a type of linguistic formal reasoning where the goal is to reason over explicit knowledge in the form of natural language facts and rules (Clark et al., 2020). A recent work, named PRover (Saha et al., 2020), performs such reasoning by answering a question and also generating a proof graph that explains the answer. However, compositional reasoning is not always unique and there may be multiple ways of reaching the correct answer. Thus, in our work, we address a new and challenging problem of generating multiple proof graphs for reasoning over natural language rule-bases. Each proof provides a different rationale for the answer, thereby improving the interpretability of such reasoning systems. In order to jointly learn from all proof graphs and exploit the correlations between multiple proofs for a question, we pose this task as a set generation problem over structured output spaces where each proof is represented as a directed graph. We propose two variants of a proof-set generation model, multiPRover. Our first model, Multilabel-multiPRover, generates a set of proofs via multi-label classification and implicit conditioning between the proofs; while the second model, Iterative-multiPRover, generates proofs iteratively by explicitly conditioning on the previously generated proofs. Experiments on multiple synthetic, zero-shot, and human-paraphrased datasets reveal that both multiPRover models significantly outperform PRover on datasets containing multiple gold proofs. Iterative-multiPRover obtains state-of-the-art proof F1 in zero-shot scenarios where all examples have single correct proofs. It also generalizes better to questions requiring higher depths of reasoning where multiple proofs are more frequent.


Introduction
Formal reasoning over explicit multi-sentence knowledge (Newell and Simon, 1956) has often proved to be challenging (Musen and Van Der Lei, 1988), owing to the difficulty in creating logical forms from such sentences, thereby restricting the application of semantic parsers (Zettlemoyer and Collins, 2005;Berant et al., 2013;Berant and Liang, 2014). Thus, in a recent work, Clark et al. (2020) bypass the creation of intermediate logical forms and show that transformers (Vaswani et al., 2017) can act as "soft theorem provers" by answering questions over natural language (English) rule-bases, consisting of facts and rules. In order to reliably interpret these predicted answers, Saha et al. (2020) propose PROVER, a transformerbased model that generates the corresponding proof graph, thus emulating formal reasoning closely. Consider the two example rule-bases with two questions and corresponding proofs in Figure 1, where a proof is a directed graph consisting of the relevant facts and rules from the corresponding rule-base.
PROVER shows good single-proof generation accuracy but is designed and trained in a way to generate only a single proof for each question. This is not ideal because formal proofs are not always unique and there may be multiple correct ways of arriving at the answer. For example, Q 1 and Q 2 in Figure 1 have three and four correct proofs respectively. Hence, in order to enhance the humaninterpretability of linguistic formal reasoning systems, it is desirable to develop methods that can generate multiple proofs, each providing a different rationale for the predicted answer. Such interpretable methods, while possessing the flexibility of operating over natural language, can also aid in verifying claims when constructing proofs from scratch is tedious or infeasible.
We find that PROVER (Saha et al., 2020), when trained on all proofs as independent training examples (Eq. 2) and extended to generate top-p proofs during inference (Eq. 3), fails drastically, achieving a low proof precision of 34%. The subsequent proofs are often incorrect because it is not trained jointly with all proofs and hence, is unable to exploit the inter-proof correlations and also does not learn the correct number of proofs for a question. Thus, we propose MULTIPROVER, a transformer-based model that can generate a set of proof graphs with appropriate cardinality for a given question. Since multiple proofs can be generated in any arbitrary order, we pose this task as a set generation problem over graphs and train MULTI-PROVER jointly with a permutation-invariant Hungarian Loss (Zhang et al., 2019a,b)

over all proofs.
A proof graph is generated through a node module which selects the relevant facts and rules as part of the proof and an edge module which determines the edges between the chosen nodes. Similar to PRover, we first enforce multiple structural constraints during training and inference to ensure that a generated proof is valid. Next, in order to generate a set of proofs jointly, we propose our first model, Multilabel-MULTIPROVER, a multi-label classification framework which performs implicit conditioning among the proofs and predicts p binary labels for each node and edge, denoting its presence or absence in each of the p proofs that we want to generate. It is efficient in terms of number of parameters and training time and also achieves a better proof F1 than PROVER. However, the lack of explicit conditioning between the proofs is not ideal because a question with multiple proofs often has certain common sub-graphs across the proofs. E.g., all the 3 proofs for Q 1 in Figure 1 have the sub-graph {F 10 → R 1 } common. Thus, in order to exploit these correlations which Multilabel-MULTIPROVER cannot capture explicitly, we further propose an improved variant of MULTI-PROVER, named Iterative-MULTIPROVER, which generates appropriate number of proofs by stacking multiple node and edge encoders, each of which generates one proof at each time step by conditioning on the previously generated proofs. This enables the model to better learn the correlations between multiple proofs for a given question. To capture the set-based nature of the task, we train MULTIPROVER using a permutation-invariant Hungarian Loss (Sec. 3.5), which solves an assignment problem between a set of predicted and gold proofs.
Empirical evaluation on synthetic and human paraphrased QA rule-bases (Clark et al., 2020) show that both of our MULTIPROVER models achieve a significantly higher proof F1 compared to PROVER while retaining the QA accuracy. Further, on a challenging hand-authored zero-shot dataset, where all examples have single gold proofs, Iterative-MULTIPROVER achieves state-of-the-art proof F1. It also generalizes better to questions requiring higher depths of reasoning with more multiple proofs. Overall, our contributions are: • We address a new and challenging problem of generating a set of multiple logical proof graphs for reasoning over natural language rule-bases by proposing two set-based joint models, Multilabel-MULTIPROVER and Iterative-MULTIPROVER. 1 • Iterative-MULTIPROVER's joint training and explicit conditioning helps it to better learn the relative importance of rules and facts for a particular question and uncover common subgraphs across multiple proofs. Thus, compared to Multilabel-MULTIPROVER and PROVER, it is able to transfer well in zero-shot settings because it learns to assign a soft prior over the rule-base. • Iterative-MULTIPROVER's conditional generation also enables it to generalize better to questions requiring higher depths of reasoning where the presence of multiple proofs is frequent.

2 Related Work
The task of rule reasoning (Clark et al., 2020) is related to other recently proposed tasks on QA (Weston et al., 2015;Yang et al., 2018;Lin et al., 2019;Tafjord et al., 2019;Richardson et al., 2020) and NLI (MacCartney and Manning, 2014). However, most of these tasks require implicit reasoning rules as opposed to explicit ones and the focus is either on broad language understanding or on single rule application. Below we discuss MULTIPROVER's relation to multiple areas of NLP and ML.
Generating Multiple Outputs: Generating a set of proofs can be viewed as a task of generating multiple structured outputs (Prasad et al., 2014). Multiple prior studies focus on generating diverse unstructured texts (Gimpel et al., 2013;Dai et al., 2017;Xu et al., 2018;Raffel et al., 2020). which broadly span two categories -(1) using improved decoding techniques like beam search with intersibling ranking penalty (Li et al., 2016), iterative beam search (Kulikov et al., 2018), diverse beam search (Vijayakumar et al., 2018), and sentence codes (Shu et al., 2019), (2) varying the hidden representations or using multiple decoders (Dai et al., 2017;Jain et al., 2017;Shen et al., 2019). Our baseline, PROVER-top-p, which extends PROVER to generate top-p proofs during inference falls in the first category while MULTIPROVER falls in the second category, where the multiple node and edge encoders vary the node and edge representations for generating multiple proofs.
Machine Learning over Sets: Set-based ML models (Zaheer et al., 2017;Lee et al., 2018;Zhang et al., 2019a;Kosiorek et al., 2020) have a wide range of applications including generating multiple image captions (Vinyals et al., 2015), generating diverse translations (Cho et al., 2014;Bahdanau et al., 2015), enumerating rules in a logical inference system (Gao et al., 2019). Set problems are challenging because the number of valid solutions for a set of size n are n!, which increases faster than exponential in n and ignoring the set structure produces sub-optimal solutions (Zhang et al., 2019a). Thus, we use a set-based Hungarian Loss (Zhang et al., 2019a,b) for capturing the permutation-invariant nature of generating a set of proofs.

Task Description and Notations
The input to our task is a tuple of the form (C, Q), where C is a rule-base context and Q is the question. We want to predict a binary answer A ∈ {T rue, F alse} for the question and generate a set of proof graphs P = {P 1 , . . . , P p }, each of which provides a diverse rationale for the answer (see Figure 1). The context C consists of a set of facts and rules, denoted by F and R respectively. Facts F = {F 1 , . . . F f } are unambiguous statements, while rules R = {R 1 , . . . R r } are logical statements, which can be used in conjunction with the facts to arrive at a logical conclusion. Each proof If a statement (E.g. "Anne is big") cannot be deduced from the context, then Negation as Failure (NAF) contains the negation of that statement (E.g. "Anne is not big"), which is considered true in a closedworld assumption. See appendix for more details of the syntax of proof graphs. consisting of the representations of each fact, rule and NAF where d is the embedding dimension. The i th row n i of N denotes the embedding of node i. A node classifier takes in these embeddings to output the node probabilities np i ∈ R k for each fact, rule and NAF being present in the proof. The edge module computes the edge embeddings E ∈ R k 2 ×3d for every edge (i, j) through the function φ(i, j) = [n i ; n j ; (n i − n j )] where ; is the concatenation operation and outputs probabilities ep i,j ∈ R k 2 of each edge being present in the proof. PROVER is trained using the joint cross-entropy loss over the QA, node and edge modules. The authors pose inference as a Integer Linear Program (ILP). Given a set of nodes and the edge probabilities from the trained model, the following global score over the edge probabilities is maximized, subject to multiple structural constraints S that ensure the validity of a proof graph (like checking for graph connectivity).

Baseline PROVER Model
arg max e i,j ∈{0,1},s∈S i,j,i =j epi,j * ei,j +(1−epi,j) * (1−ei,j) (1) Extending PROVER to Generate Proof-Sets: Since Saha et al. (2020) focus on generating one proof per question, they also train their model with one gold proof per question. For multiple proof generation, an obvious extension is to treat each proof for a question as a separate training example. Formally, for each sample l, given a context C l , a question Q l , an answer A l and a set of gold proofs P l i , where i ∈ {1, . . . , p l }, the extended training dataset can be defined as: Once PROVER is trained with this dataset, during inference, we generate top-p proofs by first selecting the top-p node sets according to Eqn. 3 and then choosing the corresponding edge sets us-  ing the optimization function in Eqn. 1.
The top-p solutions of Eqn. 3 are v 1 , . . . , v p which indicate a node's presence or absence in the proofs. Although simple, this approach has two major issues. First, the lack of coupling between the proofs can potentially confuse the model as there are multiple possible proofs for the same (question, context) pair. Second, inference is inflexible and always generates a fixed number of proofs for every example, thus leading to the generation of many incorrect proofs (Section 5.1). As shown in Figure 1, certain questions can have multiple possible proofs. Figure 2 demonstrates this phenomenon statistically -the datasets we experiment with (Clark et al., 2020) contain up to 13% of the samples with > 1 correct proof. Thus, in the light of PROVER's limitations, we propose two novel architectures of a proof-set generation model, MULTIPROVER.

Multilabel-MULTIPROVER
As described in the previous section, a desired property for generating a set of proofs is to have the proofs conditioned on each other as opposed to treating them independently. Thus, we propose Multilabel-MULTIPROVER (see Figure 3), which poses the problem of generating a set of proofs as a multi-label classification task over all the nodes and edges corresponding to the set of p proofs. Each training example is a tuple Q l ,  consisting of a set of gold proofs {P l i } p l i=1 per example. It consists of a QA module, a node module and an edge module. Following PROVER (Section 3.2), we obtain the node representations N ∈ R k×d by mean-pooling over the constituent RoBERTa representations. These are then passed through a multilabel node classifier, which consists of two linear layers and produces the probabilities np i ∈ R p of a node being present in the p proofs. The node embeddings n i and n j for a pair of nodes are transformed by the function φ(i, j), described in Section 3.2, to output the edge embeddings E ∈ R k 2 ×3d . We also have a multi-label edge classifier, which takes in the edge embeddings to generate the probabilities ep i,j ∈ R p of an edge (i, j) being present in the p proofs. Lastly, a question answering module predicts a binary answer for the question. Following PROVER, during training, we mask certain impossible edges like fact to fact, rule to fact and non-nodes. Given the outputs from the three modules, we train our model jointly over all proofs using a set-based Hungarian Loss.
This model is advantageous because there is implicit conditioning between the proofs as all the proofs are generated in parallel from the same node embeddings and edge embeddings. Thus, it has no additional time or memory overhead while also generating proof-sets better than PROVER (Section 5.1). However, it suffers from two major drawbacks. First, since the proofs are generated in parallel, the model is trained by padding empty proof graphs. Hence for higher values of p, the model has to learn more empty proofs, which makes the learning problem harder. Second, the proofs are not explicitly conditioned on each other. This motivates us to propose Iterative-MULTIPROVER.

Iterative-MULTIPROVER
As a motivating example for why explicit conditioning among proofs is necessary, consider the proofs for Q 1 in Figure 1 where the sub-graph {F 10 → R 1 } is common across all the proofs. F 10 and R 1 are essential for answering the question and hence conditioning on the previously generated proofs will help the model adjust the relevance of nodes and edges in the subsequent proofs. Quantitatively, we find that about 75% of the samples with 4 proofs have at least one node and one edge common across all the proofs (see Figure 5). Thus, we propose Iterative-MULTIPROVER (see Figure  4), which broadly consists of a base PROVER architecture, as in Figure 3 and an additional p node and edge encoders for generating a maximum of p proofs. The proofs are generated iteratively until an empty graph is generated to denote the end.
Base PROVER architecture computes the first level of node embeddings N 1 ∈ R k×d and edge embeddings E 1 ∈ R k 2 ×d . These are passed respectively through a node and edge classifier to generate the node probabilities np 1 ∈ R k and edge probabilities ep 1 ∈ R k 2 , corresponding to the first proof. In the next iteration, two transformer encoders generate the node and edge embeddings corresponding to the second proof. Specifically, we condition the generation of the next node embeddings N 2 on the previous node (N 1 ) and edge (E 1 ) embeddings simultaneously. Conditioning on both is crucial because N 1 captures the relevance of nodes for the first proof, while E 1 contains information about the strength of the connections between these nodes. We condition E 2 only on E 1 , because the edge embeddings corresponding to the nodes predicted by N 1 are already updated in E 1 . Formally, These next set of embeddings, when passed through the respective node and edge classifiers, predict the node probabilities np 2 ∈ R k and edge probabilities ep 2 ∈ R k 2 , denoting the likelihood of their presence in the second proof. We repeat this process of stacking up node and edge encoders for generating a maximum of p proofs. Given the node and edge probabilities corresponding to each proof and a QA probability from the QA module, we train Iterative-MULTIPROVER jointly with all proofs using the Hungarian Loss, described below.

Permutation-Invariant Hungarian Loss
Unlike words in text generation, proofs can be generated in any arbitrary order. Consequently, computing cross-entropy loss between the i th predicted proof and the i th gold proof, i ∈ {1, ..., p} will be sub-optimal. Thus, we use a permutation-invariant Hungarian Loss (Zhang et al., 2019a,b) which finds the most optimal assignment between the predicted proofs and the gold proofs such that the overall loss is minimized. Formally, the Hungarian loss L H and total loss L are denoted as follows: where CE (., .) is the cross entropy loss, np i and ep i are the respective node and edge probabilities for the i th predicted proof while y π(i) n ∈ {0, 1} k and y π(i) e ∈ {0, 1} k 2 are the respective true node and edge labels for the gold proof π(i), where π is the most optimal permutation. The Hungarian Loss is implemented by first summing the node and edge cross-entropy loss matrices L n ∈ R p×p and L e ∈ R p×p respectively, each entry (i, j) of which corresponds to the proof loss between the i th predicted proof and j th gold proof (see Figures 3 and  4). Then we find the best assignment between the gold and predicted proofs through the Hungarian algorithm (Kuhn and Yaw, 1955). Our final loss sums the Hungarian proof loss and the QA loss.

Integer Linear Program (ILP) Inference
Following PROVER, we generate valid proofs during inference using an ILP, subject to multiple global constraints (see Saha et al. (2020)). For each predicted proof, the predicted nodes and edge probabilities from MULTIPROVER, we obtain the corresponding predicted edges using Eqn. 1.
Datasets: The five synthetic datasets DU0-DU5 consist of 100k questions with their own train, validation and test splits (70/10/20) and reasoning depths up to D = 0, 1, 2, 3, 5. Each example in these datasets is annotated with all possible proofs. The second dataset is a Birds-Electricity dataset, consisting of 5k hand-authored samples aimed at evaluating the zero-shot performance of the models. Unlike the previous datasets, all examples in this dataset have a unique gold proof. Third, ParaRules is a human-paraphrased dataset, consisting of 40k examples with all possible proofs, where the facts and rules are paraphrased by crowdworkers. Further details of the datasets and model's hyperparameters can be found in the appendix.
Evaluation Metrics: Following PROVER, QA evaluation is done through accuracy. For proofs, we compute the following metrics: (1) Node Precision, Recall, F1 (2) Edge Precision, Recall, F1, (3) Proof Precision, Recall, F1, and (4) Full Accuracy (FA). For each sample, given a set of gold proofs and predicted proofs, node precision is computed as the fraction of predicted proofs where the predicted node set matches exactly with a gold proof's node set. Similarly, node recall for each sample is computed as the fraction of gold proofs where the corresponding node sets match exactly. The overall node precision, recall and F1 are the respective sample-wise precision, recall and F1 scores averaged over all the samples. Edge metrics are computed similarly but with respect to the edges only and the proof metrics consider both nodes and edges in conjunction. Our final metric, full accuracy evaluates a sample as a whole and is given by the fraction of samples where the answer and all corresponding proofs are exactly correct. to predict the number of proofs to generate, i.e., we stop generating proofs when the score difference between two consecutive proofs exceeds a certain threshold (tuned on the validation set). All models are trained on the DU5 train set and tested on the corresponding test set. Based on Figure 2 which shows that 98% of the dataset contains samples with ≤ 3 proofs, we set max-proofs, p = 3. 87% of the examples in the dataset have a single gold proof, thereby making PROVER a strong baseline.
We observe that PROVER-all has a slightly lower proof F1 than PROVER, because the model likely gets confused with multiple possible proofs for the same context and question. PROVER-top-p's huge drop in precision is unsurprising because the subsequent non-empty proofs are always incorrect, causing full accuracy to drop to 0%. When we perform careful inference over PROVER either by predicting the number of proofs or by thresholding and do not generate a fixed p number of proofs for all examples, we observe a boost in precision over the vanilla top-p model, with very little drop in recall. However, PROVER continues to be a stronger baseline than all the top-p variants because of a lot of single-proof examples in the dataset.
Both MULTIPROVER models improve significantly on the state-of-the-art proof F1, while retaining a near perfect QA accuracy. IT-MULTIPROVER is a significantly stronger model because of its explicit conditioning mechanism and obtains up to a statistically significant 2 (p < 0.001) 4% improvement on proof F1 and full accuracy. While our model is expected to improve the proof recall compared to PROVER and PROVER-all because of the generation of multiple proofs, the improvement in precision is particularly important as it shows that the subsequently generated proofs by IT-MULTIPROVER are mostly correct. Similarly, its improvement in proof recall compared to PROVERtop-p also shows the strength of the model considering that PROVER-top-p generates the maximum number of proofs for every sample. Overall, IT-MULTIPROVER outperforms all other models in all metrics. In summary, careful inference strategies over a single-proof generation model like PROVER are largely ineffective for generating multiple proofs and an effective proof-set generation model needs to exploit and learn the inter-proof correlations during the training phase itself. Our experiments on the ParaRules dataset demonstrate similar findings, details of which and the effect of varying p for MULTIPROVER is in the appendix.
Iterative-MULTIPROVER performs equally well on the subset of questions where the context has negations, achieving a high proof F1 of 90.8. As part of error analysis, we find that 58% of Iterative-MULTIPROVER's wrongly predicted proofs have more nodes and edges than those in the gold proof, suggesting that our model tends to overestimate the essential rules and facts and their inter-connections. In the following subsections, we analyze MULTI-PROVER's generalization capabilities in three dif-   ferent contexts -zero-shot settings, higher depth questions and training with less training data.

Generalization to Zero-Shot Dataset with Single Gold Proofs
The Birds-Electricity test-only dataset evaluates the zero-shot performance. It contains examples with single gold proofs; hence, if a multiple-proof generation model like MULTIPROVER transfers well to it, this indicates strong generalization capabilities because along with generating correct proofs, it also needs to infer the correct number of proofs. With that motivation, in Table 2, we compare PROVER and PROVER-all, both trained on DU5 to generate a single proof, with our MULTIPROVER models, also trained on DU5 and find that IT-MULTIPROVER obtains state-of-the-art result on all proof-related metrics, while retaining the QA performance. Note that IT-MULTIPROVER has two important design choices which explains its good performance on out-of-domain transfer -(1) it trains on all proofs jointly, (2) explicit proof conditioning. Both of these, when combined, enable it to learn the correlations between the proofs to identify the degree of relevance of facts and rules, ranging from essential to sometimes useful to irrelevant, for a given question. Thus, on out-of-domain test data, it assigns soft prior relevance scores to the context which helps it to better learn the significantly smaller space of correct proofs and be more accurate even for a single-proof dataset.

Generalization to Higher Depths
The DU5 dataset consists of questions requiring reasoning up to a maximum depth of 5. Thus, we test the generalization capabilities of the MULTI-PROVER models on higher depth questions. Specifically, in Table 3, we compare the DU5-trained models of PROVER-all, ML-MULTIPROVER and IT-MULTIPROVER on the subset of DU5 test examples with varying depths of reasoning (d). Each row also shows the percentage of examples with multiple gold proofs (MP) which, unsurprisingly, increases as the depth increases. We observe that much of IT-MULTIPROVER's improvement compared to ML-MULTIPROVER comes at higher depths where the presence of multiple proofs is a more frequent phenomenon. At depth-5, where 23% of the examples have > 1 correct proof, IT-MULTIPROVER obtains a 6% improvement over ML-MULTIPROVER. This shows that joint training with all proofs and explicit conditioning between them leads to better generalization at higher depths.

Generalization with Less Training Data
Collecting proofs for supervised training is expensive in most real-world scenarios. Hence, on top of the zero-shot and depth generalization results presented so far, we ask if our MULTIPROVER models can learn from less training data. Table 4 shows that these models obtain near perfect QA accuracy with only 40% of the training data (30k examples). However, proof generation proves to be challenging and only improves with sufficient training data. Another interesting observation is that while both MULTIPROVER models perform comparably with less training data, IT-MULTIPROVER starts to outperform ML-MULTIPROVER upon training with more examples. IT-MULTIPROVER consists of more trainable parameters because of its multiple node and edge encoders, which get learned better with more data. See appendix for runtime and parameter space of these models.

Comparison of MULTIPROVER with the Skyline Single-Proof Generation Model
We find that an ideal (skyline) single-proof generation model's proof recall for the DU5 dataset is upper-bounded by 92% as it contains about 87% of single-proof examples. This is computed by considering exactly 1 correct proof per question. Hence, we ask how well our MULTIPROVER models compare with this ideal performance (Figure 7). Our results are encouraging, not only because IT-MULTIPROVER generates more correct proofs than all other models but also because it almost matches the performance of the skyline single-proof generation model. The PROVER model is 9.2% worse as compared to the skyline single-proof generation model while IT-MULTIPROVER reduces this gap to 3%. Given the dataset mostly contains single-proof examples, the skyline is a strong upperbound on proof generation performance and IT-MULTIPROVER significantly reduces the gap. See appendix for ablations of IT-MULTIPROVER, including the effect of Hungarian Loss.
6 Qualitative Analysis of MULTIPROVER Fig. 6 shows the sets of proofs correctly generated by Iterative-MULTIPROVER for two randomly chosen questions. For Q 1 , it generates all the possible proofs by identifying the common subgraph F 6 → R 9 . Q 2 is interesting, because (i) the singlenode proof F 2 is significantly different from the other proofs in both structure and size, and (ii) the two larger proofs have two distinct common subgraphs. Here, PROVER performs simple lookup in the rule-base to generate the proof F 2 , thereby limiting our understanding of its reasoning capabilities. However, MULTIPROVER, through its ability to also generate the larger and more complex proofs enhances the transparency and verification of its reasoning abilities, and hence is a crucial step towards bridging the gap between neural and symbolic approaches.

Conclusion
We proposed Multilabel-MULTIPROVER and Iterative-MULTIPROVER, two variants of a proof-set generation model where the former performs implicit conditioning between the proofs to generate them in parallel while the latter generates a proof-set through explicit conditioning on the previously generated proofs. Both models obtain strong proof F1 improvements on synthetic and humanparaphrased datasets and Iterative-MULTIPROVER also obtains state-of-the-art proof F1 on a zero-shot dataset with single proofs. MULTIPROVER's modeling is fairly generic and similar methods can be used in generating a set of structured explanations for other NLP tasks like multi-hop QA.

Ethical Considerations
Despite the overwhelming success of pre-trained language models for various NLP tasks, a common criticism is their lack of interpretability. Generating structured proofs from such models allows us to explain their reasoning capabilities and also bridges the gap between neural and symbolic systems. In this work, we take a step closer towards improving the interpretability of rule-based reasoning by generating a set of multiple proofs, each providing a diverse rationale for the reasoning process. We experiment with a wide variety of rule-bases ranging from synthetic to hand-authored to humanparaphrased rule-bases. Our results show good generalization performance of our models across three different aspects -(1) zero-shot settings, (2) questions requiring higher depths of reasoning, and (3) availability of less training data. We hope our models and findings will inspire future work on generating multiple structured explanations for different compositional reasoning tasks in NLP.   (Wolf et al., 2020). 3 Experiments with PROVER (Saha et al., 2020) are performed using their publicly released code and hyperparameters. 4 All MULTIPROVER hyperparameters are chosen based on the best Full Accuracy on the corresponding validation sets. We use RoBERTa-large (Liu et al., 2019) as the pre-trained language model. The batch size and maximum sequence length are set to 8 and 300 respectively. We train all our models for a maximum of 7 epochs using an initial learning rate of 10 −5 , a weight decay of 0.1 and a dropout probability of 0.1. We use a random seed of 42 across all our experiments. All experiments are performed on one V100 Volta GPU. Batch size and learning rate are manually tuned in the range {8,16} and {10 −5 , 2 * 10 −5 } respectively. For inference, we use PROVER's ILP optimization code, which is modeled using PuLP. 5 In all the datasets, the maximum number of facts and rules corresponding to a context is 25.

A.2 Datasets
Our experiments are conducted on the datasets introduced in Clark et al. (2020)  Birds-Electricity: The Birds-Electricity dataset comprises of two test-only datasets where the contexts are about birds and electric circuits. The vocabulary of the entities, attributes and predicates, apart from is() are all new at test time, thus providing a benchmark for testing the generalization capability of the models on out-of-distribution data. Another interesting aspect of this dataset is that all examples are annotated with a unique gold proof.

ParaRules:
The ParaRules dataset is one where the facts and rules are paraphrased by humans into more natural language. It consists of a total of 40k questions, with 28k, 4k, and 8k questions in the train, validation and test splits respectively. This dataset tests the model's ability to reason over more complex human-like language. Similar to the synthetic datasets, each example is annotated with all possible proofs.

A.3 Syntax of Proof Graph
Each proof P i = (V i , E i ) is a directed graph, with a set of nodes V i ⊆ N and a set of edges E i ⊆ V i ×V i . Each node n i ∈ N is either a fact F ∈ F or a rule   R ∈ R from the context or a special NAF node, denoting "Negation as Failure". A NAF node in a proof indicates the truthfulness of the negation of statement(s) that cannot be proved using the set of rules (under closed-world assumption). Edges in the graph can be directed either from a fact (or NAF) to a rule or between two rules. An edge from a fact to a rule means that the rule applies on the fact to generate a new fact. Similarly, an edge from a rule R 1 ∈ R to another rule R 2 ∈ R implies the application of R 2 on the fact generated by R 1 . Proofs are either successful or failed. A successful proof is one where the question statement can be logically reached (to be either proved or disproved) using the given rule-base while for failed proofs, no conclusion can be reached, in which case the shallowest branch of the proof tree that fails is generated. For more details and examples of proofs, we refer the readers to prior work (Saha et al., 2020;Clark et al., 2020).

A.4 Ablation Analysis
In Table 5, we compare our baselines PROVER, PROVER-all and PROVER-top-p variants with our MULTIPROVER models on the validation set of DU5 dataset. Additionally, we also show two ablations of IT-MULTIPROVER -in the first, we replace the Hungarian loss with a sequential loss, which computes the cross-entropy loss of the i th predicted proof with the i th gold proof and in the second, we condition the node embeddings on the previous node embeddings only instead of both node and edge embeddings. All models, except PROVER and PROVER-all, generate a maximum of 3 proofs. PROVER-top-p suffers from a huge drop in proof precision due to the generation of many incorrect proofs. Although carefully choosing the value of p either by thresholding or through a classifier helps boost the proof precision, PROVER continues to be a superior baseline on this dataset due to a high skew towards single-proof examples. ML-MULTIPROVER improves upon PROVER's proof F1 and full accuracy (FA) which are further bettered by IT-MULTIPROVER, owing to its explicit conditioning mechanism between the proofs. Replacing the Hungarian loss with a sequential loss leads to a significant drop in proof F1, thereby showing the effectiveness of modeling multiple proof generation as a set generation problem. Finally, conditioning the node embeddings on both node and edge embeddings leads to marginal improvement in proof F1. Overall, IT-MULTIPROVER outperforms all other models across all metrics.

A.5 MULTIPROVER with Varying Maximum Number of Proofs
We analyze the effect of varying the maximum number of proofs p on ML-MULTIPROVER and IT-MULTIPROVER in Table 6 and 7 respectively. All models are trained on the DU5 training set and evaluated on the corresponding validation set. Although all models maintain the QA accuracy, we   find that the proof F1 for ML-MULTIPROVER starts to decrease marginally with the increase in p. Note that this model is trained with padding of empty proof graphs since it generates all p proofs in parallel. Thus, the amount of padding increases with the increase in p, thereby leading to a harder learning problem as the model needs to predict more number of empty graphs. IT-MULTIPROVER, on the other hand, is significantly robust to such variations in p, because it generates proofs iteratively with one empty graph at the end, indicating end of set.

A.6 Evaluation on Human-Paraphrased Rule-Bases
Following PROVER, we also test MULTIPROVER's effectiveness in generating proofs for more humanlike complex rule-bases. The ParaRules dataset is constructed by first creating a set of fact groups where each fact group consists of all facts in the theory concerning a particular person and then paraphrasing these fact groups into more complex language. E.g., a fact group "Alan is blue. Alan is rough. Alan is young.", may be re-worded into "Alan is on the young side, but rough. He often feels rather blue." Thus, unlike the DU datasets or the Birds-Electricity dataset where the proof graphs are composed of facts and rules, ParaRules proofs are composed of fact groups and rules. Following past work (Clark et al., 2020;Saha et al., 2020) we train our models combining the DU3 and ParaRules train sets, and evaluate on the ParaRules validation and test set in Tables 8 and 9 respectively. We find that similar conclusions to the DU5 dataset hold for  this dataset as well -ML-MULTIPROVER achieves a better proof F1 and full accuracy than PROVER, which are further improved by IT-MULTIPROVER due to its explicit conditioning mechanism between the proofs. Table 10 shows the number of trainable parameters and training times per epoch for the baseline model PROVER and our proposed models, ML-MULTIPROVER and IT-MULTIPROVER across varying number of maximum proofs (p) per sample. Since ML-MULTIPROVER adopts the same PROVER architecture but with multi-label classification, it has the same number of parameters as PROVER, which also remains unchanged irrespective of the maximum number of proofs. The number of parameters for IT-MULTIPROVER, however, increases with the increase in p because of the presence of multiple node and edge encoders.

A.7 Training Time and Size Comparison
While IT-MULTIPROVER has more parameters than PROVER, our empirical findings reveal that just having a similarly-sized, larger PROVER model will not be sufficient and exploiting the correlations between multiple proofs with a permutationinvariant loss is necessary for the task of generating a set of multiple proofs. The training time of PROVER is more than that of ML-MULTIPROVER because the former treats each proof as a separate example, causing an increase in the training data size from 70k to 110k. ML-MULTIPROVER is the most timeefficient model and its running time only increases marginally with the increase in p. This is due to the additional node and edge classifications that the model has to perform corresponding to each extra proof. Unsurprisingly, IT-MULTIPROVER takes longer to train but encouragingly for p ≤ 4, still has a comparable running time to PROVER.