Breadth First Reasoning Graph for Multi-hop Question Answering

Recently Graph Neural Network (GNN) has been used as a promising tool in multi-hop question answering task. However, the unnecessary updations and simple edge constructions prevent an accurate answer span extraction in a more direct and interpretable way. In this paper, we propose a novel model of Breadth First Reasoning Graph (BFR-Graph), which presents a new message passing way that better conforms to the reasoning process. In BFR-Graph, the reasoning message is required to start from the question node and pass to the next sentences node hop by hop until all the edges have been passed, which can effectively prevent each node from over-smoothing or being updated multiple times unnecessarily. To introduce more semantics, we also define the reasoning graph as a weighted graph with considering the number of co-occurrence entities and the distance between sentences. Then we present a more direct and interpretable way to aggregate scores from different levels of granularity based on the GNN. On HotpotQA leaderboard, the proposed BFR-Graph achieves state-of-the-art on answer span prediction.


Introduction
Typical Question Answering (QA) or Reading Comprehension (RC) task aims at exploring a desired answer through a single evidence document or paragraph. Recently, a more challenging multi-hop QA task, where we need to reason over multiple paragraphs to find the answer, is gradually catching attention. One example from HotpotQA dataset (Yang et al., 2018) is shown in Fig. 1.
One method for achieving multi-hop QA is to concatenate all the paragraphs together and treat it as a typical single-hop QA task (Yang et al., 2018), then existing QA techniques can be applied. Although multi-hop QA can be solved to some extent, * Corresponding author. Figure 1: One example from HotpotQA dataset. "s1", "s2", ... denote the sentences in paragraphs. The model needs to find the answer and supporting sentences by reasoning over multiple sentences and paragraphs. It's obvious that the reasoning is in a ordered process from the question to "s1", "s2" and finally to "s4". this method lacks interpretation of the reasoning process from one hop to the next hop.
Graph Neural Networks (GNN) is a natural way to represent the solving procedure of multi-hop QA. For instance, nodes in GNN represent sentences/entities in the paragraphs, and from the updation through edges we can get interactive message between them, which is similar to the process of reasoning. Thus, a more reasonable method is to construct GNN to simulate the reasoning process among multiple paragraphs (Ding et al., 2019;Qiu et al., 2019;Tu et al., 2020). Promising performance has been reported in methods that designed different type of nodes or edges for GNN (De Cao et al., 2019;Tu et al., 2019Tu et al., , 2020Fang et al., 2020) and the features generated from GNN has also been combined with those from the context encoder in a latent way (Qiu et al., 2019;Fang et al., 2020).
Despite of the success that GNN achieves in multi-hop QA, new problems associated to GNN arise. Firstly, current approaches update all the nodes, including some unnecessary ones, together within each layer, which may lead the nodes to converge to similar values and lose the discriminating ability for GNN with more layers (Kipf and Welling, 2017). Secondly, although different types of edges have been designed for GNN, there is no more fine-grained distinction between edges of the same type, without considering the other relational information between sentences. Thirdly, existing methods only latently fuse the hidden representations of GNN and context encoder, without contributing to the answer span extraction in a direct and interpretable way.
To solve the aforementioned issues, we proposed a novel model of Breadth First Reasoning Graph (BFR-Graph) to effectively adapt GNN to multihop QA. The proposed BFR-Graph is a weighted graph in which the weight of an edge is computed based on other relational information (e.g., cooccurrence entities and distance) of the connected sentences. Inspired by the Human reasoning mechanism and the Breadth First Search algorithm, in BFR-Graph the reasoning message starts from the question and passes to the next sentence nodes hop by hop until all the edges have been passed, effectively preventing each node from updating multiple times or being updated unnecessarily. Then the reasoning result from BFR-Graph is converted to the sentence scores and paragraph scores, contributing to the answer span extraction. Specifically, the final answer span probability is the sum of the score of answer span, the sentence and the paragraph, in both of which the answer is located. Experiment results shows that our methods make GNN more powerful in multi-hop QA and achieves state-ofthe-art on answer span prediction of HotpotQA.
The contributions of this paper are summarized as follows: • We propose BFR-Graph for multi-hop QA, which is more in line with reasoning process than existing GNNs. The reasoning message starts at the question and then reasons to the next sentences hop by hop.
• Our BFR-Graph is a weighted graph, considering the number of co-occurrence entities and the distance between sentences.
• To take advantage of the reasoning result from BFR-Graph, multi-score mechanism is used for answer span extraction in a more direct and interpretable way.  (Welbl et al., 2018) and Hot-potQA (Yang et al., 2018). WikiHop provides candidate answers for selection while HotpotQA needs to find an answer span over all paragraphs. Based on these datasets, several categories of multi-hop QA approaches were proposed. Yang et al. (2018) proposed a baseline method based on RNNs and Min et al. (2019) decomposed the multi-hop question into simpler single-hop subquestion that can be answered by existing singlehop RC models. To better utilize multiple paragraphs, Nishida et al. (2019) proposed Query Focused Extractor to sequentially summarize the context and Asai et al. (2020) used a recurrent retrieval approach that learns to sequentially retrieve evidence paragraphs. Moreover, reasoning has also been conducted in multi-hop QA. Jiang and Bansal (2019) designed a neural modular network to perform unique types of reasoning; Chen et al. (2020) presented extra hop attention that can naturally hops across the connected text sequences. Qiu et al. (2019) regards the task as a two-stage task including paragraph selection and downstream model. and Tu et al. (2020) further proposed a pairwise learning-to-rank loss for better interaction between paragraphs. Although the aforementioned methods are specifically designed for multi-hop QA with different structures, they lack an explicit scheme to show the reasoning process.

GNNs for Multi-hop QA
Recently GNNs such as Graph Convolution Networks (Kipf and Welling, 2017) and Graph Attention Networks (Veličković et al., 2018) show enhancement in multi-hop QA because the GNNbased methods are more intuitive and explicit.
Entity GNN model for reasoning over sentence, which is summarized over token representations based on a mixed attentive pooling mechanism. Furthermore, more complex graphs is also designed. Qiu et al. (2019) proposed a Dynamically Fused Graph Network to explore along the entity graph dynamically and finds supporting entities from the context. Fang et al. (2020) created a hierarchical graph for different levels of granularity to aggregate clues from scattered texts across multiple paragraphs. However, GNNs in these methods update all the nodes together, including some unnecessary ones.

Model
To solve the aforementioned issues, we propose a novel model of Breadth First Reasoning Graph (BFR-Graph) for multi-hop QA. Different from existing GNN-based methods, BFR-Graph introduces new restrictions on the message passing: the message only starts from the question and then passes to the latter sentence nodes hop by hop. Besides, our graph is constructed as a weighted graph considering the co-occurrence entities and distance between sentences. Moreover, multi-score answer prediction is designed to take advantage of the reasoning result from BFR-Graph. In short, we propose breadth first reasoning on the weighted graph and then combine multi-level scores for answer prediction in the framework of multi-task joint training.
The diagram of our system is shown in Fig. 2. Given multiple paragraphs, we first filter out irrelevant paragraph with paragraph selection (Sec. 3.1) and then use a BERT for context encoding (Sec. 3.2). A weighted graph is constructed (Sec. 3.3) to reason over sentences (Sec. 3.4) and calculate the sentence score and paragraph score. Finally, we use multi-score mechanism to predict the answer span (Sec. 3.5).

Paragraph Selection
Although multiple candidate paragraphs are given for answering the question, not all of them are useful (i.e., relevant to the question). Following Qiu et al. (2019), we retrieve N useful paragraphs for each question through a straightforward way. Each candidate paragraph is concatenated with the question ("[CLS]" + question + "[SEP]" + paragraph + "[SEP]") and fed into a BERT (Devlin et al., 2019) for binary classification. After training procedure, we select paragraphs with top-N score as the useful paragraphs, which are then concatenated together as context C.

Context Encoding
Following Qiu et al. (2019), we concatenate each question Q and its corresponding context C, and feed them into a BERT followed by a bi-attention layer (Seo et al., 2017) to obtain the encoded representations of question and context. The output is denoted as: where L is the length of the input sequence (concatenating question and context), and d is the output dimension of bi-attention layer (also the dimension of BERT). To achieve sentence-level representations, we first obtain token-level representation of each sen- Figure 3: Message passing procedure of BFR-Graph and typical GNN. Active node is the node that is reachable for its neighbors while the quiet one is on the contrary. Active edge is the passable edge while the quiet one is on the contrary. tence: where s start i , s end i are the start and end position of the sentence i respectively, L s i is the length of sentence i. Note that the question is also a sentence. Then using the method in Rei and Søgaard (2019), we get sentence representation: where α i k is the weight on the k-th token of sentence i, obtained from a two-layer MLP (Multi-Layer Perceptron) with output size = 1.

Weighted Graph Construction
The nodes in our weighted graph represent question Q and sentences in context C. To better exploit complex relational information between sentences, two types of correlation are defined: positive correlation and negative correlation. Although they can be designed in many ways, now we illustrate our design: (1) Positive correlation: an edge is added if the nodes representing the sentences i and j have n(n ≥ 1) of the same named entities, and the weight of the edge is: (2) Negative correlation: otherwise, an edge is added if the two nodes are originally from the same paragraph, and the weight of the edge is: where d is the distance of the two sentences (e.g., d = 1 if the sentence is immediately followed by the other sentence in a paragraph, d = 2 if there is a sentence between them, etc.). K 1 and K 2 are hyperparameters.
To simplify our design, we treat our graph as a homogeneous graph, which contains single type of nodes and edges.

Breadth First Reasoning
When we reason over paragraphs to answer a question, we start from the question and find the next sentence hop by hop. For a GNN where nodes represent sentences, the following message passing is unnecessary and may suppress the disturbance from useless nodes: (1) from the latter node to the former node, (2) a node haven't received the message from question but it updates other nodes.
To prevent each node from being updated multiple times unnecessarily, the reasoning message in our BFR-Graph starts from the question node and passes to the next nodes hop by hop until all the edges have been passed. Note that a node is allowed to update multiple times, depending on whether the connected edges have all been passed.
Algorithm 1: Algorithm of BFR-Graph E represents the set of edges that haven't been passed yet (dynamic); A represents the set of active nodes (dynamic); N i represents neighbors of node i (static); N i represents reachable neighbors of node i (dynamic). Input: Initial node representations S, the set of neighbors N . Specifically, a node i is updated by node j when the following conditions are met simultaneously: (1) node i and node j are neighbors, (2) node j is active, i.e., it is updated last layer, (3) the edge between node i and node j haven't been passed previously. The overall message passing procedure of BFR-Graph is illustrated in Algorithm 1.
Inspired by Graph Attention Networks (Veličković et al., 2018), the updating function (or message passing function) is defined as: Figure 4: Multi-score answer prediction. The example is calculating the score for an answer span located in paragraph 1 ("p1") and sentence 5 ("s5").
where N i is the set of reachable neighbors for node i, calculated with Algorithm 1. f (s i , s j ) = s i W 1 W 2 s j is for calculating the attention score between node i and j. W, W 1 and W 2 are learnable parametres. w ij is the weight of the edge (i, j), described in the Sec. 3.3. For clarity, s is written as s in following contents.

Multi-score Answer Prediction
The answer in HotpotQA dataset is a span from the context. Existing works only calculate the span probability on the output of encoder (e.g., BERT) or additionally concatenate the GNN's hidden output. Differently, we use a more interpretable method by calculating the sentence score and paragraph score obtained from the GNN. An example is shown in Fig. 4. Conventionally, the score of y-th word in context being the start / end of the answer span is calculated by: where MLP is a two-layer MLP with output size = 1 to obtain the score value. Then, we calculate the sentence score corresponding to each node in GNN: Similarly, we calculate the paragraph score through a global-max-pooling:

5815
where s p j i is the representation of the i-th sentence in paragraph p j , L p j is the number of sentences in paragraph p j . Max(·) is a max-pooling layer with pooling size = L p j × 1, which can also be done by taking the maximum hidden value on each dimension over all the sentence nodes.
Finally, the probability of y-th word in context being the start of the answer span is determined by: where the y-th word is located in sentence s i and paragraph p j . And the probability of y-th word in context being the end of the answer span can be calculated similarly.
In other words, if a sentence or paragraph has a higher score, the words located in it are more likely to be the answer.

Multi-task Joint Training
In addition to the answer span prediction, there are other two training tasks in HotpotQA. One is the answer type prediction task: some answers cannot be retrieved from the context, but are "Yes" or "No", so finally there are three type of answers (e.g., span, "Yes" and "No"). We use a globalmax-pooling similar with Eq.(11) to compress all the nodes in the GNN and predict the answer type through a two-layer MLP.
The other task is to predict whether a sentence in the context is a support sentence (or called supporting fact in some papers) that is an evidence to the answer. Following previous works (Tu et al., 2020), we use the output of the GNN to predict the supporting sentences with a two-layer MLP.
The tasks in HotpotQA are jointly performed through multi-task learning, and the loss function is: L =L CE (ŷ start , y start ) + L CE (ŷ end , y end )+ λ 1 · L CE (ŷ type , y type ) + λ 2 · L BCE (ŷ sp , y sp ), where L CE and L BCE denote the cross entropy and binary cross entropy loss respectively.ŷ start denotes the logits of start position from Eq.(12) and y start is the label. Similarly,ŷ type andŷ sp are the logits of answer type prediction and supporting sentence prediction respectively.

Dataset
The HotpotQA dataset (Yang et al., 2018) is the first explainable multi-hop QA dataset with sentencelevel evidence supervision. Each sample in the dataset contains 2 gold paragraphs and 8 distracting paragraphs. Three tasks are included for evaluation: (1) answer span prediction (denoted as "Ans") that extracts a span in the paragraphs or generate "Yes"/"No"; (2) supporting sentences prediction (denoted as "Sup") that determines which sentences are evidences to the answer; (3) joint prediction (denoted as "Joint"). We submit our model to Hot-potQA official leaderboard 1 and carry out ablation studies on the dev-set.
We also apply the main idea of BFR-Graph to the WikiHop dataset (Welbl et al., 2018), which provides candidate answers for selection while Hot-potQA dataset needs to find an answer span over all paragraphs.
Implementation details can be found in Appendix A.

Results
The experimental result on HotpotQA dataset is shown in Table 1. As a reading comprehension task, the performance of answer prediction should be emphasized. Our model improves 0.84% Ans-EM (Exact Match) than HGN-large, becoming the first model to break through 70% and achieving state-ofthe-art on answer span prediction. On supporting sentence prediction and joint prediction, our model shows a close performances to HGN-large, possibly because this paper is based on the standard GNN (homogeneous graph) for simple clarification, and we just plan to prove that our algorithm can improve the performance of GNN. Existing GNN methods mostly constructed elaborate graphs for more granular expression of nodes, while our BFR-Graph solve the problem from another novel perspective. Thus, BFR-Graph is universal and can be easily applied to existing promising models (e.g., HGN) to get better results, which provides a promising direction for future research.
We also compare our model with two state-ofthe-art GNN models (i.e., SAE and HGN), shown in Table 2 Table 1: Results on HotpotQA leaderboard. "Ans", "Sup" and "Joint" denote answer span prediction, supporting sentence prediction and joint prediction, respectively.

Layers
Edges Intuitive SAE manual 3 types false NGN manual 7 types false BFR-Graph adaptive fine-grained true Table 2: Comparison with state-of-the-art GNN models. "Layers" denotes the number of GNN layers, "Edges" denotes how fine-grained the edges are, and "Intuitive" denotes whether the output of GNN can be intuitively observed.
an extremely low risk of over-smoothing (Kipf and Welling, 2017). SAE and HGN set a fixed types of edges, which is still not fine-grained enough, while BFR-Graph define different weights (can up to ∞ different weights depends on the dataset) to distinguish nodes in a finer granularity. Furthermore, we can easily observe scores from GNN in an intuitive way in BFR-Graph.
Besides, Table 3 shows the results on WikiHop dev-set. When we add the breadth first reasoning graph and weights to Longformer (Beltagy et al., 2020), the performance is slightly improved, showing that our method have the ability for better reasoning.

Ablations and Analysis
In this section, we carry out ablation studies on HotpotQA dev-set. Table 4 shows the results of our full model and that without breadth first reasoning, weights, and multi-score. It indicates that our methods obviously improve the performance of GNN. Table 5 shows the result by gradually replace the BFR-Graph layers with standard GNN layers. In detail, "r/p 1 layer" denotes replacing the first layer with a standard GNN layer, "r/p 2 layers" denotes the same operation for the first and second layers, etc.. We observe that the more layers to be replaced, the more severely the result drops. And when we replace 4 layers, the joint F1 drops at about 6%, meaning that it causes over-smoothing. It also reflects the severe problem of typical GNN: if it have more layers, over-smoothing is caused; if it have less layers, it cannot achieve long-path reasoning.

Evaluation on Breadth First Reasoning
To further analyze why this particular approach of message passing in a breadth first reasoning fashion should result in better reasoning, we propose to calculate how many useful messages the answer sentence node received from supporting senteences: Nsp , where N rcv denotes how many nodes' massages the answer sentence node received, N sp denotes the number of supporting sentence (containing the question sentence here), and N sp&rcv denotes how many supporting nodes' massages the answer sentence node received.
The above-mentioned precision, recall and corresponding F1 on dev-set is shown in Table 6, where the typical GNN is a 2-layer GNN following previous works. With breadth first reasoning, the answer  Table 4: General ablation study for our full model. "bfr" denotes a typical GNN without breadth first reasoning; "ws" and "ms" denote the weights and multi-score respectively.  sentence could receive messages from supporting sentences with a higher precision, meaning that it can focus on useful sentences and eliminate invalid distractions. Since the restrictions on message passing in breadth first reasoning, it leads to a decrease in recall. However, it is hard to draw a PR curve or get different precision-recall results because this is not a binary classification task as we generally understand. But fortunately, BFR-Graph shows a higher F1 than the typical GNN. Table 7 (top) presents the results with and without the weights in the GNN. "-ent" denotes removing the weights (we set the weights = 0.5 rather than simply remove them) and "-dist" denotes removing the distance weights. When we remove the weights, although the answer F1 rises slightly, the supporting F1 falls to a greater extent. This shows that the proposed weights is beneficial to the supporting sentences prediction, which is directly predicted from the GNN nodes. To our understanding, our model enhances the discrimination of edges by setting weights for them, and inevitably reduces the robustness of model. Fortunately, by designing Eqs.(4) and (5), the quantitative error will not cause the weight to increase or decrease sharply, and is still able to distinguish   Table 7: Ablations on weights and multi-score. the difference between sentences. For multi-score, we evaluate how the result changes if this particular way of exploiting GNN's output is replaced by traditional way. In Table  7 (bottom), "-sent" and "-para" denote removing multi-score for sentence and paragraph respectively. It indicates that both the addition of sentence scores and paragraph scores are beneficial to the performance.

Complexity Analysis
We also analyze the complexities of BFR-Graph and typical GNN, which is simply shown in Table  8. Firstly, in each layer of our BFR-Graph, only several nodes are updated by active nodes, so the number of nodes to be updated in a BFR-Graph layer is less than or equal to that in a typical GNN (N update ≤ N ). Secondly, for a node in a layer of BFR-Graph, it is only updated by its reachable nodes (i.e., active neighbors), so the number of reachable nodes for a node in a BFR-Graph layer is also less than or equal to that in typical GNN (M reach ≤ M ). Therefore, breadth first reasoning leads to lower complexity.
For GPU parallel training, we also show the actual cost of time per epoch. BFR-Graph cost 158.6 minutes per epoch, while a 2-layer and 3-layer typical GNN costs 157.5 and 165.6 minutes respec-tively. We find that BFR-Graph is always 4 layers in HotpotQA dataset, and it can even cost less time than a 3-layer typical GNN and is close to a 2-layer typical GNN.

Conclusion
In this paper, we proposed a novel GNN model of BFR-Graph. Specifically, the reasoning message starts from the question node and passes to the next sentences node hop by hop until all the edges have been passed. We also construct the reasoning graph as a weighted graph and present a more interpretable way to aggregate scores of different levels from GNN. On HotpotQA leaderboard, BFR-Graph achieved state-of-the-art on answer span prediction.

A Implementation Details
We select N = 3 useful paragraphs in paragraph selection, which achieves 98.7% recall in dev-set. We use RoBerta-large (Liu et al., 2019) for context encoding, with a maximum length of 512 tokens. We also fine-tune the model on SQuAD dataset similar as Groeneveld et al. (2020). We use spaCy 2 for named entity recognition and we found the balance factor K 1 = 0, K 2 = −2 lead to better result. The manual weights of the loss function are λ 1 = 1, λ 2 = 5 in this work. The sentences number is limited to 30 and the max sentence length is set to 512 (same with BERT). We use Adam with learning rate of 1e-5, L2 weight decay of 0.01, learning rate warm-up over the first 1,000 steps and linear decay to 0. Other hyperparameters mainly follow previous works (Fang et al., 2020). We implement our model using PyTorch 3 and train it on RTX 2080ti GPUs.
The whole task consists of two stage training: the first stage is the paragraph selection and the second stage is the following. For the second stage, we train the model using annotated gold paragraphs, and take the predicted paragraphs from the first stage during evaluation.
More details of the dataset and metrics can be found in Yang et al. (2018). For WikiHop dataset, we migrate the breadth first reasoning and weights to a baseline model (we reimplement Longformerbase (Beltagy et al., 2020) as the baseline) and evaluate the models on the dev-set.

B Case Study and Error Analysis
In Fig.5, we provide an example for case study. The reasoning chain in this case should be divided into two part: Q→s1→s2→s5 and Q→s6→s5, and finally the two part of the chain is combined together and contribute to the final answer. The complex and long reasoning chain make the question hard to answer.
As reported in Fang et al. (2020), HGN retrieved another incorrect answer span. But fortunately, our BFR-Graph can effectively deal with complex reasoning and extract a better answer through the long reasoning chain.   To provide in-depth understanding of the weaknesses of our model, we carry out error analysis. Following Fang et al. (2020), we randomly sample 100 examples in the dev-set with the answer F1 as 0. Then we group the error cases into 6 categories: (1) Annotation: the reference answer is incorrect; (2) Multiple Answers: multiple correct answers can answer the question, but only one is provided in the dataset; (3) Discrete Reasoning: this type of error often appears in "comparison" questions, where discrete reasoning is required to answer the question; (4) External Knowledge: commonsense, external knowledge or mathematical operation is required; (5) Multi-hop: the model fails to perform multi-hop reasoning, and finds the final answer from wrong paragraphs; (6) MRC: the model extracts the wrong answer span but correctly finds the supporting paragraphs and sentences. Table 9 shows the percentages of the 6 error cate-gories of our BFR-Graph. We find that many errors are due to the wrong reference answer (10%) or multiple answers (22%), which actually should not be considered as the error cases. Among other error cases, the major category of errors comes from the questions that need external knowledge (20%, including commonsense and mathematical operation), which is hard to handle without a knowledge base.
C A Case for Multi-score Prediction Fig. 6 shows an example with specific scores when calculating multi-scores. The RoBerta-style tokens have already been converted to the BERT-style tokens for better reading. "Token-idx" denotes the index for each token. "Para-score" and "Sent-score" denote paragraph scores and sentences scores respectively. "Startscore" and "End-score" are the scores that be the start and end of the answer span.