Autoregressive Reasoning over Chains of Facts with Transformers

This paper proposes an iterative inference algorithm for multi-hop explanation regeneration, that retrieves relevant factual evidence in the form of text snippets, given a natural language question and its answer. Combining multiple sources of evidence or facts for multi-hop reasoning becomes increasingly hard when the number of sources needed to make an inference grows. Our algorithm copes with this by decomposing the selection of facts from a corpus autoregressively, conditioning the next iteration on previously selected facts. This allows us to use a pairwise learning-to-rank loss. We validate our method on datasets of the TextGraphs 2019 and 2020 Shared Tasks for explanation regeneration. Existing work on this task either evaluates facts in isolation or artificially limits the possible chains of facts, thus limiting multi-hop inference. We demonstrate that our algorithm, when used with a pre-trained transformer model, outperforms the previous state-of-the-art in terms of precision, training time and inference efficiency.


Introduction
The task of multi-hop explanation generation has recently received interest as it could be a stepping-stone towards general multi-hop inference over language. Multi-hop reasoning requires algorithms to combine multiple sources of evidence. This becomes increasingly hard when the number of required facts for an inference grows, because of the exploding number of combinations and phenomena such as semantic drift (Fried et al., 2015;Jansen, 2018). The WorldTree dataset was designed specifically for (>2)-fact inference Xie et al., 2020): it consists of elementary science exam questions that can be explained by an average of 6 facts from a complementary dataset of textual facts. The explanation regeneration task as in the TextGraphs Shared Tasks (Jansen and Ustalov, 2019;Jansen and Ustalov, 2020) asks participants to retrieve and rank relevant facts (given one of these natural language questions and its answer as input 1 ) such that the top-ranked facts explain the answer to the question. An example is shown in the upper left part of fig. 1 (and more in appendix A).
As each question-answer pair potentially has many supporting facts, evaluating all combinations of facts is computationally prohibitive. Previous work remedies this by computing scores for facts in isolation, or by severely limiting the number of combinations of facts (Das et al., 2019;Banerjee, 2019;Chia et al., 2019). The latter is done by considering combinations of few facts only and/or by reranking combinations of only the top retrieved facts by a simpler method. Both approaches limit multi-hop reasoning as facts are not combined or too many facts are ignored by the simple first-stage retriever.
In this paper we propose a method to retrieve facts that does build long chains of facts, while being efficient. During training, a pre-trained neural language model encodes the question-answer pair as well as a randomly selected combination of ground-truth facts, before evaluating candidate facts. For efficiency, only a set of neighborhood facts (which we call 'visible' facts) is considered. This set is obtained by selecting the nearest facts by tf-idf distance to the question, answer and set of ground truth facts. During inference, facts are ranked iteratively: in each iteration, the highest scoring fact is selected and encoded together with previously selected facts and the question-answer pair, so the next ranking is always conditioned on the current set of chosen facts. At each inference step, the set of visible facts consists of the nearest facts to the already selected facts and question-answer pair. This autoregressive formulation and the definition of neighborhoods together enable the use of losses that incorporate interactions between different facts, like the pairwise RankNet loss from the learning-to-rank literature (Burges et al., 2005).
Most methods in earlier work use some form of rank-rerank set-up, but none dynamically expand the set of reranked facts, and the reranked set is usually too small (Das et al., 2019;Chia et al., 2019;Banerjee, 2019). Banerjee (2019) and Chia et al. (2019) use iterative schemes, but the former only reconsiders the top 15 initially retrieved facts and the latter uses a term frequency based approach. Das et al. (2019) consider chains of facts during training and inference, but only up to length 2, because of the explosive number of combinations. All of the above systems use pointwise losses only. While components of our method are inspired by previous work, we are the first to put them together in a principled way, so that they enable each other. Our neighborhoods enable both our iterative inference procedure to efficiently build chains of up to 10 facts, and the use of more informative losses. Our training is designed to support iterative inference and to close the gap between training and inference as much as possible. Contributions.
(1) By defining dynamically growing neighborhoods of facts for our model to operate in, we limit computational cost without severely limiting the range of our method, (2) We define a new autoregressive training and inference method to evaluate facts within the context of other facts; (3) We apply a learning-to-rank loss that successfully exploits interactions between facts, leading to an improvement in MAP score over previous baselines.

Related Work
Explanation regeneration was promoted as a TextGraphs 2019 Shared Task (Jansen and Ustalov, 2019). We briefly summarize systems proposed in 2019. Most methods finetuned a pre-trained transformer like BERT (Devlin et al., 2019) in a learning-to-rank set-up, as reranker with the question and answer as query and facts as documents, like Nogueira and Cho (2019). Das et al. (2019) use BERT to classify chains of 2 facts, drawn from initially retrieved facts by a tf-idf retriever and facts with words in common with those facts. Their idea is similar to ours, but due to the high number of combinations, they are limited to explanations of only 2 facts. As a second approach, they use BERT like Nogueira and Cho (2019) to rank single facts: but Das et al. simply rank all facts instead of the top-T initially retrieved ones, which is computationally expensive for the larger 2020 dataset. Chia et al. (2019) use an iterative scheme where the tf-idf representation of the question is aggregated with the tf-idf representations of previously selected facts for retrieval. They compare this to a BERT based approach like described above. In contrast, selected facts and the question in our method are encoded as one paragraph with BERT in a trainable way, which allows for more complex relations to be learnt. Banerjee (2019) adds gold facts as context during training and scores facts individually with BERT during inference. The top facts are reranked by iterative rescoring, which only uses the originally computed scores and precomputed (hence untrainable) sentence embeddings. D' Souza et al. (2019) propose a pair-wise learning-to-rank approach with SVMs. Das et al. (2018) propose a retriever-reader system that iteratively retrieves and 'reads' relevant paragraphs from a large text corpus for open-domain QA. High level analogies can be drawn between their system and ours, but their architecture involves separate query and document encoders, a recurrent reasoner and a specialized reading comprehension model. Their system is trained end-to-end for QA.
In conclusion, our work has similarity to the widely used retrieval-reranking paradigm, but the initially retrieved set of facts is dynamically extended based on selected facts. Our training and inference method to evaluate facts within the context of other facts successfully models interactions between different facts, framing the task more as a reading comprehension task.

Choose L facts
Add f c with highest score as f 3 Q An animal has six legs. What is it most likely to be? A A fly Score all f c f 1 : A fly is a kind of insect f 2 : An insect has six legs f 3 : ??? Figure 1: An overview of ARCF during inference. The computed score represents P (f 3 | f 1,2 , q).

Proposed Approach
This section describes our proposed method, which we call 'Autoregressive Reasoning over Chains of Facts' (ARCF). ARCF consists of an initial retrieval component described in §3.3, a learning-to-rank training scheme and an iterative inference procedure (both in §3.4, fig. 1 shows the latter schematically).

Task Description
Given is a dataset D f of facts in text form (size D f ). Each fact f consists of word tokens: f = [w 1 , w 2 , ..., w Z f ]. Further given is a training, validation and test dataset D qa containing multiple choice questions (size D qa ). Each question is concatenated with answer options: 4 (or so) multiple choice answers, with the correct one marked. Question and answer(s) together form a query q = [w 1 , w 2 , ..., w Zq ]. All queries in D qa can be explained by 1 to 22 gold facts in D f . Gold facts for a question are marked with 'grounding', 'central' or 'lexical glue', depending on their role in explaining the question. Central facts are key in explaining the question, lexical glue links facts together (e.g., by synonymy or taxonomy relations) and grounding facts connect facts to the question 2 .
The task for a given q ∈ D qa is to rank all facts in D f , with gold facts ranked higher than irrelevant facts. The answer in this context is the answer to a question, always encoded together with the question in q, and not the output target of the task. When we write 'query', we mean an instance of q, i.e., a question concatenated with its answer. The output target is the set of gold facts f * 1,...,G for a q. The mean average precision (MAP) of the gold facts in the computed ranking is calculated as evaluation metric. We use the notation f 1,...,N for an intermediate set of facts, f 1,...,M for a completed set and f * 1,...,G for the gold set for a q, with G the number of gold facts. We call a concatenation of a query with a number of facts a prefix: p = [q | f 1,...,N ]. Appendix A shows example q's and gold facts.

Model
We use a neural language encoder to compute a function f θ : V Z → R. The input is the concatenation of a query and a number of facts [q | f 1,...,N ]. The output is a scalar score s, indicating how well the last of the concatenated facts (the candidate fact) fits in the explanation for the query. The process is iterated: a chosen fact is appended to f 1,...,N and a new score is computed. We use pre-trained transformer models, like BERT or RoBERTa (Devlin et al., 2019;Liu et al., 2019;Vaswani et al., 2017), because the task dataset is relatively small, and this allows for the reasoning to incorporate external knowledge.

Fact -Question Neighborhoods
Neighborhoods. The first step of the method consists of computing neighborhoods of visible facts, for each question and corresponding set of facts, denoted by vis k : D f ∪ D qa → D K f . To retain tractability, facts from neighborhoods vis k (·) will later be ranked, while facts outside vis k (·) are out of consideration. In contrast to classic rank-(neural) rerank approaches, our neighborhood (initially retrieved set that will be reranked) will expand dynamically. Pairwise distances between all facts and between questions and facts are precomputed. A fact f c is visible from [q | f 1,...,N ] if it is one of the k nearest facts either to q, or to one of the f 1,...,N (denoted as nearest k ). We use K N ≤ (N + 1) · k to refer to the cardinality 3 of vis k (·). Formally: vis k (q, f 1,...,N ) = x∈{q,f 1,...,N } nearest k (x).
For a given question and (possibly empty) set of facts, the algorithm should be able to retrieve the set of visible facts, but is agnostic to how this set or the distances are computed. Given the limited size of D f and D qa , if we use an inexpensive distance metric, the overhead of computing all pairwise distances is small. For larger datasets or for distances that are expensive to compute, the overhead might not be negligible. Approximate nearest-neighbor methods could then be used.  Distances. We tried computing the distances as distance between tf-idf vectors, as Word Mover's Distance (Kusner et al., 2015), as the distance between sentence embeddings computed by a pre-trained BERT or Sentence-BERT (Reimers and Gurevych, 2019), and as the reciprocal number of words in common (lexical overlap). We compared the distance metrics by the fraction of gold facts, for a given k, that could be reached in an unlimited number of 'hops' 4 via gold facts from vis k (·), starting from a q ∈ D qa . We only computed the fractions on the training data to prevent test set leakage. We found tf-idf to work best for all k: table 1 shows some indicative ratios (mean over q ∈ D qa ). The fact that tf-idf works best here can be explained by the fact that the dataset is well curated and terminology is uniform across facts. Das et al. (2019) construct a connectivity graph between facts, which they use to extend the set of initially retrieved facts by tf-idf. They use lexical overlap as criterion for being connected, resulting in a dense graph and leading to an explosive number of chains. In contrast, parameter k in our definition of vis k allows for easy and precise control of the size of neighborhoods.

Autoregressive Ranking of Candidate Facts
We propose to decompose the conditional probability distribution over rankings of facts autoregressively: where f 1 is the highest ranked fact. At each iteration, we compute P (f i | f 1,...,i−1 , q) for all visible candidate facts f i , and we select the fact with the highest probability to be ranked at position i. The unnormalized probability for P (f i | f 1,...,i−1 , q) is computed by our scoring function as For the task at hand, only the interclass order in the produced ranking is relevant: relevant facts should be ranked higher than irrelevant facts. The intraclass order, i.e., the relative order of relevant facts and of irrelevant facts, does not affect the MAP ( §3.1). Eq. 1 can thus be understood as the decomposition of selecting a set of facts jointly into selecting one fact at a time (conditioned on previously selected facts). Scoring all combinations of facts is definitely infeasible, while scoring facts independently is too simplifying. The decomposition aims to strike a balance between the two. Conditioning the selection of facts on previously selected facts brings several advantages (compared to scoring facts independently as Nogueira and Cho (2019)). First, it enables the incremental building of chains of reasoning. The role of many facts in explaining a question is not immediately apparent when they are looked at in isolation, and only becomes more evident when they are considered as part of a larger explanation. Consider the question: "George warms his hands by rubbing them. Which skin surface produces the most heat? (A) dry palms". The relevancy of "1: as moisture of an object decreases, the friction of that object against another object increases" is clearer when "2: friction causes the temperature of an object to increase" is also known. Without the latter, someone (without world knowledge) might, for instance, regard facts about any other physical property that varies with moisture level as equally relevant to the question as the former, while they are not.
Second, by processing multiple facts at once, we can leverage BERT as a reading comprehension model, rather than as a retrieval model. The task requires not merely retrieving facts that seem relevant, like a search-engine would, but gathering a set of facts from which the answer to a simple science question can be inferred. That clearly requires more reasoning than a traditional retrieval task. Research has shown that pre-trained transformers are able to infer knowledge from paragraphs of text, which is why they are more suited to handle this formulation of the task (Liu et al., 2019;Clark et al., 2019).
Since we only need to be able to retrieve the facts with the highest probabilities, we can avoid computing normalized probabilities and instead compute scores (i.e., unnormalized probabilities). During training (next §) we do compute probabilities, in order to compute losses, but only over subsets of facts.

Training
Samples. The training input is encoded as x = [q|f * 1,...,N |f c ], with q a query (question and answer), f * 1,...,N a set of gold facts, and f c ∈ vis k (q i , f * 1,...,N ) a candidate fact, which can be positive or negative, from the visible neighborhood of [q | f * 1,...,N ]. We train our scoring function f θ with stochastic gradient descent, to output high scores for positive candidate facts and low scores for negative candidate facts.
Training samples for one q are constructed by first uniformly sampling a number N ≤ G of gold facts f * 1,...,N from q's full set of gold facts. N is itself uniformly sampled: N ∼ U(0, G). The query and gold facts are concatenated into a prefix p = [q|f * 1,...,N ]. For one given prefix, positive training samples x p are constructed by concatenating p with all remaining visible gold facts. Negative training samples x n are constructed by concatenating the same p with a number of uniformly sampled visible negative facts 5 . We either use all visible negatives, or sample a number until the minibatch is full. The process is repeated: multiple prefixes are constructed for every query in D qa , and every prefix is appended with multiple visible, gold and negative facts. We construct multiple prefixes per q so that we have multiple training samples per q, and so that the model is trained with different explanation lengths.
The prefix itself serves as a sample as well: the model is trained to score p highest when no more visible gold facts remain, i.e., when f * 1,...,N contains all gold facts or when the remaining gold facts are not visible. During inference, p getting the highest score in an iteration is a stopping condition: the gathered set of facts is then considered complete.
Losses. Because classical maximum likelihood training for eq. 1 would allow to backpropagate a loss only after all candidate facts have been considered, i.e., after K N forward passes, we resort to different loss functions. A simple loss that can be used is the pointwise binary cross-entropy loss (bXENT), which considers each input example x individually and trains to correctly classify the candidate f c as relevant or irrelevant. We propose to use the pairwise RankNet loss (Burges et al., 2005): Where σ is the logistic sigmoid function, and x p and x n are samples in which f c is a positive and negative fact, respectively. This loss is shown by Chen et al. (2009) to maximize a lower bound on the MAP. To further amplify between-fact interactions in the gradient, we also use the conditional ranking variant of Noise-Contrastive Estimation (NCE), which covers >2 facts at once (Ma and Collins, 2018;Gutmann and Hyvärinen, 2010): Where x 1 is positive and x >1 are negative, B is the batch size, and P n is a negative sampling distribution over candidates: a uniform distribution over vis k (p). This loss has been used for training word embeddings as a more efficient approximation to the negative log-likelihood loss (Mnih and Kavukcuoglu, 2013;Mikolov et al., 2013). When training with NCE and RankNet, samples in one batch share a common prefix p and only differ in their candidate fact f c , so that all samples x in one loss term L(·, θ) (eqs. 2-3) only differ in f c and hence the model is trained to score candidate facts and not prefixes.
The proposed training scheme -training a model to predict the next gold element conditioned on previous gold elements -is reminiscent of training text generation models with teacher forcing. A known weakness of teacher forcing is exposure bias (Ranzato et al., 2016); models are conditioned on ground-truth data during training, as opposed to on their own outputs during inference. ARCF exhibits this discrepancy as well, which is why, during training, we try replacing a uniformly sampled amount of gold facts f * 1,...,N (in the prefix) with uniformly sampled negative facts from vis k (p). This feature is called 'CN' later. Hence the model is trained to be more robust to mistakes it makes during inference. A similar technique was already proposed for text generation as scheduled sampling (Bengio et al., 2015).

Inference
At inference, we incrementally build an explanation, i.e., a set of facts. The input follows the same encoding format: x = [q|f 1,...,N |f c ] where f 1,...,N are previously selected facts (and not gold facts like in training). At each iteration, we use the query q concatenated with already selected facts f 1,...,N as updated retrieval query, and rank all other visible facts f c ∈ vis k (q, f 1,...,N ). The highest scored fact is appended to the query for next iteration. Note that the set of visible facts is extended with the neighborhood of the selected fact in each iteration. The set of selected facts is considered to be complete when its cardinality N equals L or when the highest scored sample is the sample without candidate appended (see prev. §). Multiple rankings are made; one with each intermediate set of facts and the question as retrieval query. Algorithm 1 shows the procedure in pseudocode.
Algorithm 1 Inference procedure for one q When the stopping condition is met, we end up with an explanationp = [q|f 1,...,M ]. To convert the result to a ranking, f 1,...,M are ranked highest. Facts that were considered as candidate facts but not selected are ranked next; their relative order is determined by their scores in the last iteration. Next, all facts that were never considered are ranked by a simple metric like tf-idf distance from the computed explanationp. We experimented with beam search procedures, where the beams were intermediate sets of facts. This did not improve performance on the validation set, so we do not consider it further.
This iterative fact selection bears similarity with how token-per-token text generation is usually performed with neural networks. Instead of computing a probability distribution over the vocabulary in one forward pass, our procedure requires a forward pass per score, i.e., per fact considered. To keep required resources for inference reasonable, only neighborhoods of facts are considered, instead of all facts, reducing the number of forward passes in one iteration from D f = 9707 to K N . Inference for a single q requires L + L l=1 K l−1 = O(L 2 k) forward passes 6 , with L a chosen maximum number of iterations and k the neighborhood size. Parameter k controls the trade-off between completeness and efficiency.

Data & Preprocessing
In the 2020 version of the task D f contains 9727 facts, and D train qa , D val qa , D test qa contain 2206, 496 and 1664 questions respectively (Xie et al., 2020). The dataset has been extended w.r.t. the 2019 version of the task. For completeness, we also include results obtained with baselines and our models on the 2019 data. The 2019 data includes 902, 214 and 541 questions for training, validation, and testing respectively, along with 4950 facts . We remove incorrect answers from q (like Das et al. (2019)), 6 L + L l=1 K l−1 ≤ L + L l=1 l · k = L + L 2 (k + Lk) = O(L 2 k), with L the max. number of iterations and K defined in §3.3: K l ≤ (l + 1) · k.
mark the correct answer with "(answer)" and the start of the gold facts with "(explanation)" 7 . An example of a tokenized input sample could be "[START] When does water start boiling? (answer) At 100 • C. (explanation) This is a gold fact. This is another gold fact.
[SEP] This is a candidate fact that is gold or not [SEP]". Examples of q ∈ D qa and their explaining facts are shown in Appendix A.

Baselines
As a simple baseline, we include a tf-idf vector retrieval model, that ranks facts by cosine similarity between their and the q's tf-idf representation. We stem facts and q, and remove stopwords before computing tf-idf vectors. Baseline single-fact concatenates a q and a single candidate fact, and computes a relevance score for all f c ∈ D f by encoding the concatenation [q | f c ] with BERT and projecting the final layer's CLS-token embedding to a scalar with a linear layer (Das et al., 2019;Nogueira and Cho, 2019). The model is trained with binary cross-entropy (relevant or not).
The highest score in the 2019 competition was obtained by an ensemble of baselines single-fact and path-rerank (Das et al., 2019). Model path-rerank ranks facts for a q by first retrieving the top-T facts with the tf-idf retriever from above. This initial top-T set is extended with all facts that have ≥1 words in common with one of the top-T facts. Next, all combinations of up to C = 2 facts are taken from this extended set. A relevance score is computed for all combinations (chains), in the same way as in singlefact or our models: concatenate q and the C facts, encode the concatenation with BERT and project the final CLS-token embedding to a score s with a linear layer. A fact's relevance score is the maximum score of any chain it appears in. The binary cross-entropy loss is used for training the model. We use the implementation of Das et al. (2019) for the single-fact and path-rerank baselines 8 .
Complexities. Das et al. (2019) used single-fact and path-rerank for the smaller 2019 dataset, with fewer facts and fewer q. They already noted that single-fact is not scalable to a large corpus of facts: for the 2020 data, D f ≈ 10K forward passes are required to solve a single q during inference. One epoch trained with bXENT consists of 21M samples (D train qa · D f , trained 3 epochs). The path-rerank model uses T = 25 for training, which generates 7k chains of facts per q, resulting in 16M training samples (trained 1 epoch). Using T = 50 during inference results in 16k chains (hence forward passes) per q. This number can be controlled by setting T , but setting T too small would leave too many relevant facts out of consideration. In contrast, the neighborhood in our method is less restrictive as it depends on selected facts and thus expands progressively as more facts are selected.

Experiments
We implemented our algorithm and baselines using PyTorch and the Transformers library 9 (Paszke et al., 2019;Wolf et al., 2019). The tf-idf baseline was implemented with SciKit Learn (Pedregosa et al., 2011). To keep comparisons fair, all results on 2019 data (baselines and ARCF) are obtained by finetuning the publicly available pre-trained bert-base-uncased (since this model is used in Das et al. (2019)). To reduce resource usage, all models on 2020 data were finetuned from the smaller pre-trained distilroberta-base 10 . It can reasonably be expected that using bigger or more advanced pre-trained models would further improve results. We ran experiments on 1 16GB Nvidia Tesla P100 GPU. We used the Adam optimizer (Kingma and Ba, 2015), with learning rate 2e−5 and linear LR decay. We append samples to minibatches until they reach 5000 tokens. An overview of used hyperparameters can be found in appendix B. For training we set neighborhood size k = 180 (L only impacts inference), for inference we set the maximum and minimum number of iterations L = 9, L min = 3, and k = 290. Some hyperparameters were taken from Das et al. (2019), while others were tuned manually. 7 It is worth noting that technically, ARCF can perfectly run without the correct answer marked and without incorrect answers removed, although a performance drop on the explanation regeneration task can be expected. 8 Code available at https://github.com/ameyagodbole/multihop_inference_explanation_ regeneration. 9 Our code and trained models are publicly available at https://github.com/rubencart/ LIIR-TextGraphs-14.
10 When we finetuned bert-base-uncased on the 2020 data or distilroberta-base on the 2019 data we obtained similar results.   In the remainder of this section, ARCF denotes our proposed method, SF refers to the single-fact baseline, PR is the path-rerank baseline, CN means 'conditioned on negatives', S2 means 'rank scored but not selected facts 2 nd ', R3 means 'rank rest 3 rd ' (as opposed to omitting them altogether, see §3.4).

Results and Discussion
Test set results. Tables 2a,c show results on the hidden 2019 and 2020 test sets 11 , total training time and inference time per sample, for the baselines and ARCF. The test scores are obtained by models that got the highest validation score of 5 training runs with different random seeds, while the training times are averaged over these 5 runs. As can be seen, ARCF outperforms the baselines both in terms of obtained MAP and efficiency 12 . The highest 2020 MAP is 0.5815, which put us at the second place in the online competition. Appendix C shows examples of validation set questions and predicted facts.
On the 2020 test set, all our models obtain a higher MAP than all baselines. This is not the case on the 2019 data, which suggests that ARCF benefits more from additional training data. Including >2 facts in one loss term with NCE shows no benefit compared to the pairwise RankNet loss. Conditioning on negatives has no significant impact. Five 2020 test set evaluations of ARCF and SF show that the difference in scores is statistically significant: ARCF with RankNet scores higher on average than SF (1-tailed indep. t-test, p < 0.001).
Ablation study. As ARCF consists of several components, we perform an ablation study on the 2020 test and validation sets. Results are shown in table 2b. First, ARCF was trained with the pointwise bXENT loss (but still with prefixes and neighborhoods). The MAP drops, showing that pairwise information in the gradients improves learning w.r.t. pointwise information. Second, ARCF is compared to ARCF w SF train, which is trained like SF and uses algorithm 1 for inference. The large drop in MAP indicates that algorithm 1 only works for inference if the model is trained accordingly. Next, ARCF w SF inf is trained like ARCF and evaluated like SF. This result shows that ARCF training still improves performance w.r.t. SF even when facts are scored individually, and it shows that algorithm 1 improves performance. ARCF w/o prefix, neighb is trained with the pairwise RankNet loss but without prefixes and neighborhoods. Minibatches each contain 1 positive and B − 1 negatives that are uniformly sampled from all D f − G negative facts. Evaluation is carried out like for SF. The score drops almost 10%, which emphasizes both the need for informative negatives when training with a pairwise loss, which our neighborhoods provide by returning nearby facts, and the importance of training with prefixes. It also affirms the gain of algorithm 1 for inference. Leaving facts that were never scored out of the ranking (w/o R3) has negligible impact. Ignoring scored but unselected facts too (w/o R3, S2) cuts performance by >10%: including them is thus essential. Visible facts are scored anyway, so including them all in the ranking comes at no additional cost, except for once sorting them based on their score. We also trained a randomly initialized version of distilroberta-base (instead of pre-trained). Although we did not extensively tune hyperparameters, the maximum obtained test MAP was about 25% of pre-trained models.
Impact of neighborhood size k and maximum explanation length L. As tables 2d-e show, MAP increases with k and L, before flattening around k = 180, L = 8. Inference time per sample increases approximately quadratically with L and linearly with k, which is in line with the number of FW passes for ARCF inference growing with O(L 2 k). Fig. 3 in appendix D also shows this. Table 2d shows that we could have taken k = 180 for inference, with virtually no loss in MAP and almost 40% speedup, but the k = 290 we used showed better validation results (likewise for L = 6).
Performance on subsets of facts. Since the leaderboards only return a scalar test MAP, we run a number of experiments on the smaller validation set. The results are therefore only indicative. Fig. 2a shows the number of q ∈ D val qa for different numbers of corresponding gold facts (G). Fig. 2b-c show MAP scores of ARCF (RankNet) vs. the SF baseline on subsets of facts as marked in the dataset. ARCF increases MAP on the Challenge subset with an absolute %10 13 . It significantly improves retrieval of all roles, but most of lexical glue and grounding facts. These are facts that support central facts, which might be easier to detect when using the context of other facts as our model does. This is a useful improvement, as reasoning of explanations that contain lexical glue and grounding facts is easier to understand 14 .
Figs. 2e-f show the MAP for facts that need an increasing number of hops to reach (from q or a fact to another fact, only via gold facts). In fig. 2e the hops are always to a fact in the current neighborhood vis k (·), in fig. 2f to facts with lexical overlap as seen in Das et al. (2019). The '∞' on the x-axis shows the precision for facts that cannot be reached from q in any number of hops (they can still be correctly retrieved by our method, but only via a negative fact). Surprisingly, the MAP of ARCF drops below that of SF for facts that are 3 − 4 neighborhood hops away. When lexical overlap hops are considered, ARCF performs better than the baseline for 'farther' facts. The precision values for figure 2c-f are computed as by Jansen and Ustalov (2019), by first removing gold facts that have another role (or are not exactly h hops away) both from the gold set and from the predicted ranking, and then computing the MAP of the remaining predicted ranking w.r.t. the remaining gold set 15 . Fig. 2d shows the MAP for q's with different numbers of gold facts G (the point at x = 2 shows the MAP on those q that have a total of G = 2 gold facts). ARCF gives a consistent improvement over the baseline for all G.

Conclusion
Future work. Future work might expand on our approach by finding alternative methods to evaluate a fact w.r.t. other facts in an iterative inference procedure, or by designing better fact-question neighborhood methods. Additionally, the performance of our method when the correct answer to a question is not given could be evaluated. Future work could then infer the answer from retrieved facts in a downstream QA setting. Finally, one could improve the method by considering to remove earlier chosen facts from the intermediate set of selected facts.
We have proposed a new method to retrieve relevant facts for an explanation regeneration task by iteratively evaluating candidate facts with respect to previously selected facts using a learning-to-rank approach. We have successfully evaluated our method on the Textgraphs 2019 and 2020 datasets and have performed several ablation experiments. We have analyzed time complexity of our method and the performance on different subsets of facts. By selecting the nearest facts by similarity between tfidf vectors, considering not just the question but also already selected facts, only a subset of facts are considered at each step, and ARCF outperforms previous state-of-the-art methods at a higher efficiency.
Appendix A Examples of TextGraphs-2020 data

Question 1
The influence of the Moon on the tides on Earth is greater than that of the Sun. Which best explains this? (answer) The Moon is closer to Earth than the Sun.
Fact 0 -Role central the gravitational pull of the Moon on Earth's oceans causes the tides Fact 1 -Role central as distance from an object decreases , the pull of gravity on that object increases Fact 2 -Role grounding closer means lower; less; a decrease in distance Fact 3 -Role grounding a moon is a kind of celestial object; body Fact 4 -Role central the moon is the celestial object that is closest to the Earth Fact 5 -Role grounding the Sun is a kind of star Fact 6 -Role grounding a star is a kind of celestial object; celestial body Fact 7 -Role central the Moon is the celestial object that is closer to the Earth than the Sun Fact 8 -Role grounding Earth is a kind of planet Fact 9 -Role grounding a planet is a kind of celestial object; body Fact 10 -Role lexglue cause is similar to influence Fact 11 -Role lexglue gravity means gravitational pull; gravitational energy; gravitational force; gravitational attraction Question 2 A student placed an ice cube on a plate in the sun. Ten minutes later, only water was on the plate. Which process caused the ice cube to change to water? (answer) melting.
Fact 0 -Role central melting means matter; a substance changes from a solid into a liquid by increasing heat energy Fact 1 -Role grounding an ice cube is a kind of solid Fact 2 -Role grounding water is a kind of liquid at room temperature Fact 3 -Role central water is in the solid state , called ice , for temperatures between 0; -459; -273 and 273; 32; 0 K; F; C Fact 4 -Role lexglue heat means heat energy Fact 5 -Role lexglue adding heat means increasing temperature Fact 6 -Role central if an object; a substance; a location absorbs solar energy then that object; that substance will increase in temperature Fact 7 -Role central if an object; something is in the sunlight then that object; that something will absorb solar energy Fact 8 -Role central the sun is a source of light; light energy called sunlight Fact 9 -Role lexglue to be in the sun means to be in the sunlight Fact 10 -Role central melting is a kind of process

Appendix B Hyperparameters
All ARCF models were trained with an L2 weight decay coefficient of 0.01. We tried training baselines single-fact and path-rerank with the same weight decay, but validation and test set results were lower than without weight decay.
Tables 4a and 4b show the hyperparameters we used for ARCF training and inference. Hyperparameters for training with different loss functions are largely the same, only the number of epochs trained might differ with 1 or 2. When conditioning on negatives (CN) was used for training, we first trained for 2 epochs as normal and then started replacing a uniformly sampled proportion between 0.0 and 0.3 of gold facts in the prefix f * 1,...,N by uniformly sampled negatives from the visible neighborhood of the prefix as it was before the replacement vis k (p). Parameter L is only relevant for inference, as it is only used by algorithm 1.  Appendix C Examples of validation set predictions Table 5 shows the 15 highest ranked facts by ARCF (RankNet) for the two validation questions in table 3. The first column of each row for gold facts (which are correctly ranked high) is colored blue. As can be seen in the predictions for Question 1, some wrongly predicted facts are clearly related to the question but not necessary for explaining the answer (e.g. Fact 0). While others (like Fact 2) could actually be used for explaining the question as well. Fact 2 for question 1 in table 5 could reasonably take the place of gold Fact 7 for the same question in table 3. Question 1 The influence of the Moon on the tides on Earth is greater than that of the Sun. Which best explains this? (answer) The Moon is closer to Earth than the Sun.

Appendix D Extra plots
Fact 0 cause is similar to influence. Fact 1 as the gravitational pull of the moon on the Earth decreases , the size of the tides on Earth decrease. Fact 2 the Moon is closer to the Earth than the Sun. Fact 3 closer means lower; less; a decrease in distance. Fact 4 as the gravitational pull of the moon on the Earth decreases , the size of the tides on Earth decrease. Fact 5 gravity means gravitational pull; gravitational energy; gravitational force; gravitational attraction. Fact 6 as distance from an object decreases , the pull of gravity on that object increases. Fact 7 as distance from an object increases , the pull of gravity on that object decrease. Fact 8 an increase is the opposite of a decrease. Fact 9 as the distance from an object increases , the force of gravity on that object will decrease. Fact 10 a moon is a kind of celestial object; body. Fact 11 the gravitational pull of the Moon on Earth's oceans causes the tides. Fact 12 the gravitational pull of the Sun on Earth's oceans causes the tides. Fact 13 less is similar to decrease. Fact 14 to lower means to decrease.
Question 2 A student placed an ice cube on a plate in the sun. Ten minutes later, only water was on the plate. Which process caused the ice cube to change to water? (answer) melting.
Fact 0 melting is a kind of process. Fact 1 melting means matter; a substance changes from a solid into a liquid by increasing heat energy. Fact 2 an ice cube is a kind of solid. Fact 3 water is a kind of substance. Fact 4 water is a kind of liquid at room temperature. Fact 5 water is in the solid state , called ice , for temperatures between 0; -459; -273 and 273; 32; 0 K; F; C. Fact 6 temperature is a measure of heat energy. Fact 7 heat means heat energy. Fact 8 water is in the liquid state , called liquid water , for temperatures between 273; 32; 0 and 373; 212; 100 K; F; C. Fact 9 ice is a kind of solid. Fact 10 if an object; a substance; a location absorbs solar energy then that object; that substance will increase in temperature. Fact 11 melting is when solids are heated above their melting point. Fact 12 adding heat means increasing temperature. Fact 13 cooling;colder means removing;reducing;decreasing heat;temperature. Fact 14 heating means adding heat.