PRover: Proof Generation for Interpretable Reasoning over Rules

Recent work by Clark et al. (2020) shows that transformers can act as 'soft theorem provers' by answering questions over explicitly provided knowledge in natural language. In our work, we take a step closer to emulating formal theorem provers, by proposing PROVER, an interpretable transformer-based model that jointly answers binary questions over rule-bases and generates the corresponding proofs. Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm. During inference, a valid proof, satisfying a set of global constraints is generated. We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation, with strong generalization performance. First, PROVER generates proofs with an accuracy of 87%, while retaining or improving performance on the QA task, compared to RuleTakers (up to 6% improvement on zero-shot evaluation). Second, when trained on questions requiring lower depths of reasoning, it generalizes significantly better to higher depths (up to 15% improvement). Third, PROVER obtains near perfect QA accuracy of 98% using only 40% of the training data. However, generating proofs for questions requiring higher depths of reasoning becomes challenging, and the accuracy drops to 65% for 'depth 5', indicating significant scope for future work. Our code and models are publicly available at https://github.com/swarnaHub/PRover


Introduction
Developing systems that can understand and reason over explicitly provided knowledge has been a fundamental goal of AI (Newell and Simon, 1956). Owing to the challenges posed in reasoning over formal representations (Musen and Van Der Lei, 1 Our code and models are publicly available at https: //github.com/swarnaHub/PRover. 1988), and backed by the recent successes of transformers (Vaswani et al., 2017) in NLP, Clark et al. (2020) propose a new version of the problem by replacing the formal representations of rule-bases with natural language (English). Specifically, their task requires predicting the truth value of a statement by reasoning over a set of facts and rules, all expressed in natural language. Figure 2 shows some examples of the task. Clark et al. (2020) propose RuleTakers, a fine-tuned RoBERTa model (Liu et al., 2019b) to show that transformers can act as "soft theorem provers" by predicting the final answer in such reasoning-based problems with high accuracy. We argue that to use transformers for natural language reasoning reliably, they should be able to generate proofs that provide rationales for the predicted answer. Proof generation is vital for emulating formal reasoning but also for moving towards human-interpretable models that alleviate concerns about the black-box nature of deep architectures (Rudin, 2019). Towards this, we present PROVER, a transformer-based model that jointly answers questions over natural language rule-bases and generates corresponding proofs. Figure 1 illustrates our method as a closer linguistic analog of formal reasoning, as it generates proofs along with answers. However, unlike formal reasoners, PROVER can operate on natural language text that provides the underlying theory, rather than rely on formal logical representations. Such methods that combine interpretability and flexibility in reasoning can have wide applications across domains.
PROVER's architecture consists of three modules that together generate answers along with proofs. In this work, proofs are represented as directed graphs consisting of the relevant rules and facts needed to prove or disprove the question statement. Section 3.1 contains details of this representation. A QA module predicts a binary answer for the question, a node module chooses which rules and facts are part of the proof, and an edge module predicts the presence and the direction of the edges between the chosen nodes. Model training minimizes a joint cross-entropy loss over the three modules. To guide the model to predict edges between valid nodes only, we enforce global constraints on the structure of the proof during training, by masking out labels for impossible edges, resulting in a more efficient learning problem. PROVER generates valid proofs during inference by solving an ILP over the edge potentials, subject to multiple semantic constraints, such as ensuring proof graph connectivity. Our contributions are: • We present PROVER, an interpretable joint model that learns to reason over natural language rulebases and generate corresponding proofs. • PROVER performs similarly or improves upon state-of-the-art QA accuracy for the task, with up to 6% improvement on zero-shot evaluation, and generates exact proofs at 87% accuracy. Unlike RuleTakers, it does not require additional finetuning on the RACE (Lai et al., 2017) dataset. • PROVER demonstrates significantly better generalization. When trained on lower depth questions, it shows better QA accuracy (up to 15%) on higher depth ones.

Related Work
Our work is related to multiple bodies of previous work in NLP and formal reasoning.
QA and NLI: The rule reasoning task is related to reasoning tasks that have been proposed recently. These include tasks in the bAbI dataset (Weston et al., 2015), synthetically generated probe tasks (Richardson et al., 2020) or reading comprehension tasks in datasets such as QuaRTz (Tafjord et al., 2019) and ROPES (Lin et al., 2019). Unlike our task, most of these require reasoning over implicit rules, the focus being on language understanding and one step of rule application. Multi-hop QA datasets like HotpotQA (Yang et al., 2018) require multiple reasoning steps, but the inference rules needed are again implicitly inferred, rather than explicitly provided. Our task also bears similarity with Natural Language Inference (MacCartney and Manning, 2014), but NLI also allows unsupported inferences by filling gaps in explicitly stated knowledge (Dagan et al., 2013).
Formal Reasoning and Neural Theorem Proving: Semantic parsing (Zettlemoyer and Collins, 2005;Berant et al., 2013;Berant and Liang, 2014) of multi-sentence texts into logical forms has proved to be challenging, restricting the application of semantic parsers to formal reasoning systems (Kamath and Das, 2019). PROVER bypasses this expensive and error-prone process and attempts to solve the problem in an end-to-end manner, without any intermediate logical representations. Our approach is conceptually similar to a body of work on Neural Theorem Proving (Weber et al., 2019) that has focused on developing theorem provers by combining reasoning from symbolic techniques with the possibility of differentiable learning from neural networks. These include neuro-symbolic methods for table comprehension (Neelakantan et al., 2016), executing basic compositional programs (Reed and de Freitas, 2016), SAT solving (Selsam et al., 2019), formula embedding (Abdelaziz et al., 2020), approximate (DNF) model counting (Abboud et al., 2020), etc. However, PROVER diverges from these in working with free-form natural language input to generate proofs similar to formal reasoners.
Model Interpretability: PROVER follows a significant body of previous work on developing interpretable neural models for NLP tasks to foster explainability. Several approaches have focused on formalizing the notion of interpretability (Rudin, 2019;Doshi-Velez and Kim, 2017;Hase and Bansal, 2020), tweaking features for local model interpretability (Ribeiro et al., 2016(Ribeiro et al., , 2018 and exploring interpretability in latent spaces (Joshi et al., 2018;Samangouei et al., 2018). Our work can be seen as generating explanations in the form of proofs for an NLP task. While there has been prior work on generating natural language explana-Facts : F 1 : The bald eagle eats the lion. F2: The bald eagle sees the tiger. F3: The lion chases the bald eagle. F 4 : The lion eats the mouse. F5: The mouse eats the tiger. F6: The tiger eats the bald eagle. F 7 : The tiger is red.

Rules : R1:
If the lion is green and the lion is not kind then the lion sees the bald eagle. R2: If someone sees the lion then they eat the mouse. R 3 : If someone is kind and not green then they see the bald eagle. R4: If someone is rough then they see the lion. R5: If someone sees the lion and they do not eat the tiger then the tiger is rough. R 6 : If someone eats the bald eagle and the bald eagle is not kind then the bald eagle is rough. R7: If someone does not eat the lion then the lion is big. R8: If someone is kind then they do not eat the mouse.

NAF NAF
Facts : F1: The circuit includes the battery. F2: The wire is metal. F 3 : The circuit includes the bell. Rules : R1: If the circuit includes the battery and the battery is not flat then the circuit is powered. R2: If the circuit includes the switch and the switch is on then the circuit is complete. R 3 : If the circuit does not have the switch then the circuit is complete. R4: If the wire is metal then the wire is conducting. R 5 : If the wire is plastic then the wire is not conducting.

Rules: R6:
If the circuit is powered and the circuit is complete and the wire is conducting then the current runs through the circuit. R7: If the current runs through the circuit and the circuit includes the light bulb then the current runs through the light bulb. R 8 : If the current runs through the circuit and the circuit includes the bell then current runs through the bell. R9: If the current runs through the circuit and the circuit includes the radio then current runs through the radio. R 10 : If the current runs through the light bulb then the light bulb is glowing. R11: If the current runs through the bell then the light bell is ringing. R12: If the current runs through the radio then the radio is playing.

Q1:
The wire is not conducting.  consisting of proof graphs that detail the chain of reasoning, starting from language. We use a maxflow ILP formulation for checking proof graph connectivity (Even and Tarjan, 1975). Multiple approaches for NLP tasks such as sentiment analysis and content selection (Pang and Lee, 2004;Barzilay and Lapata, 2005;Bansal et al., 2008) have been framed as optimal flow problems on graphs.
Program Synthesis with Transformers: Existing works show that transformers already capture some knowledge from pre-training for algorithm emulation (Talmor et al., 2019) or can be fine-tuned for tasks like semantic parsing (He and Choi, 2020), translation (Wang et al., 2019), symbolic integration (Lample and Charton, 2020) and mathematics (Saxton et al., 2019). In our work, we also employ a transformer-based pre-trained language model (RoBERTa (Liu et al., 2019b)) but for the downstream task of rule-based reasoning.

Method
Each input to PROVER is a context C (consisting of facts F and rules R) and a question Q, about the context. PROVER predicts the answer A ∈ {T rue, F alse} and generates a proof P.

Proof Representation
A proof, P = (N , E), is a directed graph with nodes n ∈ N and edges e ∈ E. Each node is either a fact f ∈ F , a rule r ∈ R or a special NAF node (Negation As Failure, as described below). Edges in the proof are directed either from a fact (or NAF ) to a rule or from a rule to another rule. These indicate that a fact (or NAF ) is consumed by a rule, or the output of a rule (a new fact) is consumed by another rule, respectively. We use these constraints both during PROVER's training and inference, as described later in the paper. Formally, we have: Figure 2 shows examples of two contexts (consisting of facts and rules), five questions about the contexts, along with their answers and proofs. Each proof has a depth (Q 1 's proof has a depth of 1). The maximum proof depth in all the datasets considered in this work (Clark et al., 2020) is 5. Proofs in the datasets are of three types: Successful proof with NAF: The proof of Q 1 in Figure 2 is one such such example. F 2 acts on R 4 to prove that "The wire is conducting." and hence the answer is false.
Successful proof with NAF: Given a statement s, NAF in logic programming is a non-monotonic inference rule used to derive "not s" (negation of the statement) from failure to derive s. Hence, a proof may contain NAF node(s), representing the truthfulness of the negation of statement(s) that cannot be proved using the set of rules. Consider the proofs for Q 4 and Q 5 where the NAF node in Q 4 represents "The bald eagle is not kind.".
Failed proof: This happens when a statement cannot be derived using the given rule-base and the shallowest branch of the proof tree that fails is shown. Q 3 's proof in Figure 2 is an example as "The radio is playing." cannot be proved. Note that a proof can have edges between two rules in both directions. E.g., consider the edges R 4 → R 5 and R 5 → R 4 in Q 5 's proof in Figure 2. A node can have more than two incoming edgesthe node R 6 in Q 2 has three incoming edges from R 1 , R 3 , and R 4 .

Task Description
Each training example is a tuple consisting of a context (set of rules and facts), a question, the corresponding answer, and a proof. Generating a proof graph requires (1) identifying the nodes (set of relevant facts, NAF and rules) that are part of the proof, (2) identifying the edges connecting these nodes, and (3) verifying a set of global constraints such as proof connectivity that ensure a valid proof.
For the first, we predict a binary label over each rule, fact and NAF denoting their presence or absence in the proof. For the second, we also predict binary labels denoting the presence or absence of each edge. For the third, we enforce constraints during both training and inference (Section 3.4). During training, we mask out the edge labels 2 corresponding to (1) self-loops, (2) edges between absent nodes, and (3) edges between facts to facts and rules to facts. This enforces a semantic constraint that the set of candidate edges in the ensuing proof is consistent with the chosen set of nodes, and also simplifies the learning problem, since a smaller number of edges need to be labeled.   token embedding, we obtain the class-wise probability values P QA using the softmax function σ.

PROVER: Joint QA and Proof Generation Model
denotes the m tokens corresponding to RF i . Assuming the corresponding RoBERTa embeddings are denoted by {t w (i) j } m j=1 , we learn a representation t RF i for each RF i , by performing a mean pooling MP of the constituent token embeddings.
We also learn a representation of the NAF node t N AF as a linear transformation on t [CLS] . Note that due to the self-attention layers of RoBERTa, t [CLS] summarizes the set of all derivable facts given the context and the question. We want the NAF node to encode information about all facts containing negation (e.g.,"The bald eagle is not kind" in Q 4 's proof of Figure 2) in the context. These are taken as true as their positive counterparts ("The bald eagle is kind") are non-derivable given the context. Thus, if a statement s cannot be derived from the facts and the rules in a context, the NAF node should infer that "not s" is true. We model this notion of the negation of all unprovable statements (given a context) by learning NAF as a function of everything provable in the context, encoded by the t  The node classifier H N ode has a similar architecture to the QA classifier and predicts a presence and absence probability score for each node.
Edge Module: Now, given the representations of each fact, rule and NAF , we learn a representation for each edge between these. Formally, we define the edge embedding t (RF i ,RF j ) from node RF i to node RF j by concatenating their individual embeddings t RF i and t RF j with their element-wise difference (which gives the directionality vector).
The above formulation also helps learn separate representations for edges RF i → RF j and RF j → RF i . This is essential for our task as a proof can have edges between two rules in both directions. In Section 4.7, we see that this formulation leads to a near perfect empirical performance in predicting the directionality of edges. The edge classifier H Edge outputs probability scores representing the presence and absence of each edge.
We train our model by using binary cross-entropy loss for each of the three modules. Formally, if all these as NAF , we collapse all the NAF nodes into a single node and learn a unified representation for them. L QA , L N ode and L Edge denote the three losses, the overall loss L is given by:

ILP Inference for Global Constraints
As mentioned previously, during inference, we enforce additional constraints on the structure of the predicted proof graph. For this, we frame inference as Integer Linear Program (ILP) optimization, which we describe next. We follow the generative process of a graph wherein the nodes are defined first, followed by the edges on that set of nodes. Thus, we fix the nodes first based on the predictions by the node module of our model and maximize a global score over the set of edges only. This reduces the large search space and ensures that all constraints can be expressed as linear expressions.
Proof Connectivity Formulation: An important constraint is to ensure that the predicted proof graphs are connected. 4 To check if a proof graph P is connected, we define an augmented graph P aug = (N aug , E aug ) with two added nodes "source" and "sink". We add an edge from the source to any one of the nodes x in P. We also define edges from all nodes in P to the sink.
Having defined P aug , we can reduce the graph connectivity in P to a maximum flow problem (Leighton and Rao, 1999) in P aug (Even and Tarjan, 1975). For this, we define the capacity variable c (m,n) for each edge, m → n in P aug , as follows.
c (source,x) = |N | and c (x,source) = 0 ∀n ∈ N , c (n,sink) = 1 and c (sink,n) = 0 Now, there can be a maximum total flow of |N | from the source to the sink, if and only if the graph is connected. We use this flow formulation to provide additional constraints for our ILP inference procedure that ensure connectivity of proof graphs.
Final Optimization Problem: Our maximization objective, subject to the connectivity constraint and all other constraints (that ensure a valid proof) is as follows. Let φ (m,n) represent the probability that an edge m → n is present, as predicted by PROVER. We want to infer 0/1 assignments for our optimization variables e (m,n) (a value of 1 means the edge is part of the proof, while 0 means it is not) such that the following objective is maximized: subject to constraints: Note that N , F , and R refer to the set of predicted nodes (from the model), the set of facts, and the set of rules, respectively. Equations 2, 3 and 4 ensure that edges are present only when the corresponding nodes are present and that there are no edges between two facts and from a rule to a fact. Next, to ensure proof connectivity, we first define the flow constraints in Equations 5 and 6 constrained by the flow variables f (m,n) for each edge m → n. These maintain the capacity constraints (the flow at each edge should be less than its capacity) and the flow conservation constraints (the total flow through the incoming edges at a node is equal to the total flow through the outgoing edges). Equation 7 ensures connectivity in the proof graph, by enforcing the total flow to be |N |. Finally, we ensure that the proof connectivity is checked on the valid edges only (which are part of the proof) through the last constraint, since a max-flow of |N | is achievable for any connected graph.

Experiments
Our experiments evaluate the effectiveness of PROVER (PR), our joint QA, and proof model against RuleTakers (RT). Details of our experimental setup are in the appendix.

Datasets and Evaluation Metrics
We conduct experiments on all the three sets of datasets introduced in Clark et al. (2020) and consisting of gold answers and proofs. Further details of the datasets can be found in the appendix. DU0-DU5: The first set consists of five datasets, each containing 100k questions with theories in synthetic language and requiring reasoning paths up to depth D (D = 0, 1, 2, 3, 5). We refer to these datasets as DU0, DU1, DU2, DU3 and DU5, where DU stands for "Depth Upto". Birds-Electricity: It consists of two test-only datasets of 5k samples used to evaluate the outof-distribution performance of the models. ParaRules: ParaRules consists of 40k questions against 2k theories expressed in paraphrased natural language, obtained through crowdsourcing.
We evaluate QA performance through accuracy.  proof are exactly correct.

QA and Proof Results for Varying Depths
We first train and evaluate PROVER on the train and test splits of the DU5 dataset, and compare its QA performance with RuleTakers for questions of varying depths (D). Table 1 shows these results and the proof-related metrics for PROVER. The corresponding validation set results can be found in the appendix. Overall, and at each depth, PROVER matches the QA performance of Rule-Takers. PROVER is also able to generate exact proofs fairly accurately at 87%. Perhaps unsurprisingly, we find that edge prediction is a harder task than node prediction, and performance worsens with increasing depth due to an increasingly large number of edges to be labeled. The proof accuracy matches the edge accuracy at each depth, suggesting that proofs are almost always correct if the edges are correct. Similarly, the full accuracy matches the proof accuracy, showing that the predicted answer is almost always correct when the corresponding proof is correct. This points to an interesting observation -QA is easier than node prediction, which in turn is easier than edge prediction. All the datasets experimented with exhibit this behavior, as we also describe later. Proof generation becomes harder with increasing depth (and hence, more nodes and edges), as the exact proof generation accuracy drops to 65% for depth 5. On analyzing further, we find that on average, PROVER correctly predicts 6 out of 7 edges present in a depth 5 proof. Overall, PROVER is interpretable yet efficient, as it generates proofs fairly accurately without any loss in QA performance.

Zero-Shot Evaluation
Following previous work (Clark et al., 2020), we now test the out-of-distribution performance of  PROVER on the Birds-Electricity dataset ( Table 2). The DU5-trained model is tested on six datasets, two from the birds domain (B1, B2) and another four from the electricity domain (E1, E2, E3, E4). Overall, our model achieves a 6% QA improvement over RuleTakers. More importantly, PROVER outperforms RuleTakers by 8% on the hardest and largest E4 subset of the data. The proof accuracy is also fairly high, demonstrating good proof generation ability of our model for out-of-distribution data as well. Similar to the test results on DU5, the full accuracy matches the proof accuracy, demonstrating proof consistency with the predicted answers.
We show examples of proofs generated by PROVER in Figure 2 and in the appendix.

Generalization to Higher Depths
We evaluate the generalization ability of PROVER compared to RuleTakers by training models on the train splits of DU0, DU1, DU2 and DU3, and testing the QA performance on the overall test set for DU5, which includes questions with higher depth than seen during training. The corresponding validation set and proof-related results can be found in the appendix. As shown in Figure 4  and 6%, respectively. On DU3, both models show high and comparable performance. PROVER's superior generalization ability can be attributed to the extra training supervision incorporated in the form of proofs and an inductive bias for making proof-based predictions. While proof construction for supervised training is expensive, PROVER's superior QA results on out-of-distribution data ( Table  2) and higher depth questions is a potential first step to showing that limited proof supervision can still lead to effective generalization.

Varying Training Data Size
We explore varying the amount of training data from 10k to 30k to all the examples (70k) in DU5. As shown in Table 4, when trained with only 40% of the data, PROVER obtains a near-perfect QA accuracy of 97.8%. Thus, for QA, PROVER's joint training with proofs can compensate for the lack of training data. Proof generation, however, is much harder and with increased training data, the rate of increase in proof accuracy is much more gradual.

Evaluation on Complex Language
We also test PROVER's ability to generate proofs for more human-like natural language theories. More details on the ParaRules dataset can be found in the appendix. Following Clark et al. (2020), we train a model by combining the DU3 and ParaRules training partitions and test on the ParaRules test partition. Table 3 again shows that PROVER matches the QA performance of RuleTakers, and also generates proofs with a high accuracy of 95%. Following previous trends, the proof accuracy drops as the depth increases, and QA performance is higher than for node prediction, which in turn is higher than for edge prediction.  More details about these models in appendix.

Ablation and Error Analysis
The QA accuracy is mostly unaffected in all our models and all but "No NAF" have similar node accuracy. The "No NAF" model does not learn a representation for NAF, leading to 5-6% drop in both node and edge accuracy. The 5-6% drop in edge and proof accuracy for the "Unconstrained Train + No ILP" model, compared to PROVER, shows that removing constraints results in a harder learning problem and the model fails to automatically learn all the constraints. The proof accuracy improves slightly when we add constraints only during inference ("Unconstrained Train + ILP"). The connectivity constraint provides only marginal improvement as our model mostly predicts connected proofs without any explicit supervision. Specifically, only 57 examples have disconnected proofs without this constraint. The overall PROVER model outperforms all variants in full accuracy.
To better understand the loss of accuracy for higher depth proofs, we perform error analysis of PROVER for the depth 5 subset of DU5. We find that our NAF learning module is highly accurate -PROVER correctly predicts NAF in a proof 95% of the time. Among all examples with incorrectly predicted node sets, 42% are such that the predicted set is a subset of the gold set while for 25% examples, it is a superset, demonstrating that our model tends to underestimate the number of essential rules and facts. PROVER almost perfectly identifies the direction of edges. We find only 1 example where the proof is incorrect solely due to the incorrect  identification of directionality. Further, 21% of the incorrectly predicted edges are subsets of the gold sets, while 35% are supersets.

Discussion and Future Work
Graph-based Explanations: While we have presented PROVER as a model that can emulate formal reasoning, it has further potential use as an explanation generation system. PROVER generates compositional explanations in the form of graphs and QA systems, in general, can potentially benefit from generating such graphical explanations. For example, in multi-hop QA tasks, the node module can choose all the relevant sentences in the context and the edge module can identify the flow of information between these to arrive at the answer (in the presence of task-specific constraints). Graphical explanations, in contrast to natural language ones, are more structured and can allow explicit modeling of causality (and are easier to evaluate, as opposed to free-form natural language generation). We hope that PROVER will encourage further work towards developing interpretable NLP models with structured explanations.
QA and Proof Consistency: Currently, PROVER predicts the answer and generates the proof by jointly optimizing the QA, node and edge modules using a shared RoBERTa model. Another modeling choice could explicitly condition the QA module on the node and edge modules so that the answer is predicted from the proof. We empirically verify the consistency between the predicted answer and the generated proof by showing that the full accuracy matches the proof accuracy. However, in scenarios where questions have open-ended answers, generating answer from a 'proof' in a consistent manner needs more exploration. PROVER's constraints like ensuring connectivity are necessary constraints for generating valid proofs for any graph-based explanation generation system. However, other tasks may require imposing additional constraints to ensure valid explanations.PROVER's inference mechanism can be extended to incorporate these.

Broader Implications in Formal Logic:
PROVER's framework is not conceptually constrained to a particular logic fragment. PROVER uses the idea that applying a rule to fact(s) can produce new fact(s). All logic fragments from formal logic fit this idea and may only differ in the nature of the graphs generated. For a fact "Robin is a bird" and a rule with universal quantification "All birds can fly", PROVER's graph will have an edge from the fact to the rule to generate "Robin can fly". We experiment with datasets which already contain negations in facts. While these datasets currently do not contain disjunctions, our graphical representations of proofs allow an easy extension in such scenarios. E.g., if there is a disjunction rule "If X or Y then Z" instead of a conjunction rule "If X and Y then Z", only the shape of the graph changes. In the former, Z is proved by either an edge from X or from Y to the rule, while in the latter, both edges have to be necessarily present. Inferences over modals like "might" and disjunction rules like "If X then Y or Z" will mean that both the answer and the proof will be probabilistic. In such scenarios, PROVER's unweighted proof graphs can be extended to weighted ones to represent this probabilistic nature.

Conclusion
We introduce PROVER, an interpretable joint model that answers binary questions over natural language rule-bases and generates corresponding proofs. The proofs are generated through the node and edge modules of the model in the presence of multiple global constraints during training and ILP inference. Our model improves state-of-theart QA accuracy in the zero-shot scenario by 6% and generates proofs accurately. PROVER also generalizes much better to higher depth questions with up to 15% absolute improvement in QA performance over RuleTakers. PROVER's modeling is relatively generic, and similar proof generation methods can be explored in traditional multi-hop QA tasks. PROVER can also be a helpful aid to formal reasoners in scenarios where rules are fuzzy and creating rule-bases in a formal language is tedious or infeasible.

A.2 Dataset Details
Below we briefly describe the three sets of datasets we conduct experiments on. 8 Each dataset has a train, validation and test split, except for the zeroshot test-only one. Further details about these can be found in Clark et al. (2020).
DU0-DU5: The first set consists of five datasets, each containing 100k questions with theories in synthetic language and requiring reasoning paths up to depth D (D = 0, 1, 2, 3, 5). For example, D = 0 means the true facts can be proved by simple lookup in the context. The samples are randomly split 70/10/20 into train/dev/test partitions such that there is no overlap of theories between the partitions.
Birds-Electricity: The second set consists of two test-only datasets used to evaluate robustness and out-of-distribution performance of the models.
The contexts are about birds and an electric circuit, and consist of 5k samples in total. The vocabulary of entities, attributes and predicates, apart from is() are all new at test time.
ParaRules: The final dataset, ParaRules consists of 40k questions against 2k theories expressed in paraphrased natural language, obtained through crowdsourcing. While the previous datasets contain synthetic language, ParaRules tests the models' ability to reason over more human-like paraphrased language.
A.3 QA and Proof Results for Varying Depths Table 7 shows the DU5 validation set performance of PROVER trained on the training split of DU5. PROVER obtains a near perfect QA accuracy and a proof accuracy of 88%. While the QA accuracy remains equally high at all depths, the proof accuracy drops with increasing depth. Full accuracy matches the proof accuracy, demonstrating consistency between the predicted answers and generated proofs.

A.4 Generalization to Higher Depths
In

A.5 Evaluation on Complex Language
In Table 8, we report the ParaRules validation set results of PROVER trained on the combination of DU3 and ParaRules training splits (following previous work (Clark et al., 2020)). ParaRules is created by first separating the fact groups (a fact group is the set of all facts in the theory concerning a partic-ular person) and the rules from a theory and then asking crowdworkers to paraphrase these in their own words. For example, a fact group "Alan is blue. Alan is rough. Alan is young.", may be reworded into "Alan is on the young side, but rough. He often feels rather blue.". Thus, unlike the previous datasets where the proof graphs are composed of facts and rules, ParaRules proofs are composed of fact groups and rules. 9 PROVER obtains high QA and proof accuracy on complex humanparapharsed rule-bases, showing good generalization on such language. However, the proof accuracy again drops as the depth of the questions increases.

A.6 Ablation Models and Simpler Baselines
We provide brief descriptions of our ablation models. We also experiment with simpler baselines for edge prediction like training a Random Forest with lexical features (BLEU scores, length difference, word overlap, etc.) and this obtains a much lower edge accuracy of 47%. This fails primarily because (1) proof graphs can contain NAF which account for 9% of the data and edges from it cannot be learned without learning a latent representation; (2) overlap features are mostly symmetric and hence are not enough for learning directionality; (3) there is lack of overall context information.

A.7 Critical Sentence Identification
Clark et al. (2020) provide an initial solution towards generating explanations for the predicted answers by using a post-hoc method -they remove each fact or rule from the theory and check if the predicted answer changes with the new theory. They define all such rules and facts which flip the answer as critical sentences. If an example has multiple gold proofs, a critical sentence is one which is present in all the proofs. We argue that this leave-one-out analysis is not ideal for multiple reasons -(1) This does not work if the theory has negations, (2) This only predicts the presence or absence of rules and facts, and does not look at the entire chain of reasoning, which our model achieves through the edge module. In our final ex-  periment, we still apply the leave-one-out-strategy on the no-negation subset of the DU5 test set for a direct comparison with RuleTakers. As shown in Table 9, our model identifies the exact critical sentences in an example in 78% of the cases, a 4% improvement over RuleTakers.

A.8 Proofs Generated by PROVER
In Figure 5, we show two rule-bases, one about electric circuits and another about birds from the Birds-Electricity dataset. PROVER not only answers the questions correctly but also generates the proofs accurately. These proofs are complex because of the presence of NAF and also the long chains of reasoning needed in the inference process. Figure 6 shows three more accurate proofs generated by PROVER for three questions from the DU5 datset.
Facts : F 1 : The circuit has the battery. F 2 : The switch is on. F 3 : The circuit has the bell.
Rules : R 1 : If the circuit has the battery then the circuit is powered. R 2 : If the circuit does not have the battery then the circuit is dead. R 3 : If the circuit is dead then the bell is not ringing. R 4 : If the circuit is dead then the radio is not playing. R 5 : If the circuit is dead then the light bulb is not glowing.
Rules : R 6 : If the circuit has the switch and the switch is on then the circuit is complete. R 7 : If the circuit does not have the switch then the circuit is complete. R 8 : If the circuit is powered and the circuit is complete then the current runs through the circuit. R 9 : If the current runs through the circuit and the circuit has the light bulb then the light bulb is glowing. R 10 : If the current runs through the circuit and the circuit has the bell then the bell is ringing. R 11 : If the current runs through the circuit and the circuit has the radio then the radio is playing. Q 1 : The current runs through the circuit. [ Answer : T ] Facts : F 1 : Arthur is a bird. F 2 : Arthur is not wounded. F 3 : Bill is an ostrich. F 4 : Colin is a bird. F 5 : Colin is wounded. F 6 : Dave is not an ostrich. F 7 : Dave is wounded.
Rules : R 1 : If someone is an ostrich then they are a bird. R 2 : If someone is an ostrich then they are abnormal. R 3 : If someone is an ostrich then they cannot fly. R 4 : If someone is a bird and wounded then they are abnormal. R 5 : If someone is wounded then they cannot fly. R 6 : If someone is a bird and not abnormal then they can fly.  Figure 5: Examples of proofs generated by PROVER for four questions on two rule-bases about electric circuits and birds from the Birds-Electricity dataset. PROVER not only answers the questions correctly but also accurately predicts the long reasoning chains with multiple branches.
Facts : F 1 : The bear visits the tiger. F 2 : The cat is kind. F 3 : The mouse is green. F 4 : The mouse is kind. F 5 : The mouse sees the tiger. F 6 : The tiger is rough. F 7 : The tiger visits the cat.
Rules : R 1 : If something visits the bear then it sees the bear. R 2 : If something sees the bear then the bear likes the cat. R 3 : If something visits the cat then the cat visits the bear. R 4 : If something sees the bear and the bear likes the cat then it is cold. R 5 : Cold things are rough. R 6 : If something is green and it likes the tiger then the tiger visits the mouse.  Figure 6: Examples of proofs generated by PROVER for three questions on a rule-base from the DU5 dataset. The proof corresponding to the last question is a failed case.