Hierarchical Graph Network for Multi-hop Question Answering

In this paper, we present Hierarchical Graph Network (HGN) for multi-hop question answering. To aggregate clues from scattered texts across multiple paragraphs, a hierarchical graph is created by constructing nodes from different levels of granularity (i.e., questions, paragraphs, sentences, and entities), the representations of which are initialized with BERT-based context encoders. By weaving heterogeneous nodes in an integral unified graph, this characteristic hierarchical differentiation of node granularity enables HGN to support different question answering sub-tasks simultaneously (e.g., paragraph selection, supporting facts extraction, and answer prediction). Given a constructed hierarchical graph for each question, the initial node representations are updated through graph propagation; and for each sub-task, multi-hop reasoning is performed by traversing through graph edges. Extensive experiments on the HotpotQA benchmark demonstrate that the proposed HGN approach significantly outperforms prior state-of-the-art methods by a large margin in both Distractor and Fullwiki settings.

An example from HotpotQA is illustrated in Figure 1. In order to correctly answer the question ("The director of the romantic comedy 'Big Stone Gap' is based in what New York city"), the model first needs to identify P1 as a relevant paragraph, whose title contains keywords that appear in the question ("Big Stone Gap"). S1, the first sentence of P1, is then verified as supporting facts, which leads to the next-hop paragraph P2. From P2, the span "Greenwich Village, New York City" is selected as the predicted answer.
Most existing studies use a retriever to find paragraphs that potentially contain the right answer to the question (P1 and P2 in this case). Then, a Machine Reading Comprehension (MRC) model is applied to the selected paragraphs for answer prediction (Nishida et al., 2019;Min et al., 2019b). However, even after identifying a reasoning chain through multiple paragraphs, it still remains a big challenge how to aggregate evidence from sources arXiv:1911.03631v1 [cs.CL] 9 Nov 2019 on different granularity levels (e.g., paragraphs, sentences, entities) for both answer and supporting facts prediction.
To tackle this challenge, some studies aggregate document information into an entity graph, based on which query-guided multi-hop reasoning is performed for answer/supporting facts prediction. Depending on the characteristics of the dataset, answers can be selected either from the entities in the constructed entity graph (Song et al., 2018;Dhingra et al., 2018;De Cao et al., 2019;Tu et al., 2019;Ding et al., 2019), or from spans of documents by fusing entity representations back into tokenlevel document representation (Xiao et al., 2019). However, the constructed graph is often used for predicting answers only, but insufficient for finding supporting facts. Also, reasoning through a simple entity graph (Ding et al., 2019) or a paragraphentity hybrid graph (Tu et al., 2019) is not sufficient for handling complicated questions that require multi-hop reasoning.
Intuitively, given a question that requires multiple hops through a set of documents in order to derive the right answer, a natural sequence of actions follows: (i) identifying relevant paragraphs; (ii) determining supporting facts in those paragraphs; and (iii) pinpointing the right answer based on the gathered evidence. To this end, the message passing algorithm in graph neural network, which can pass on multi-hop information through graph propagation, has the potential of effectively predicting both supporting facts and answer jointly for multi-hop questions.
Motivated by this, we propose a Hierarchical Graph Network (HGN) for multi-hop question answering, which provides multi-level fine-grained graphs with a hierarchical structure for joint answer and evidence prediction. Instead of using only entities as nodes, we construct a hierarchical graph for each question to capture clues from sources on different levels of granularity. Specifically, we introduce four types of graph nodes: question, paragraphs, sentences, and entities (see Figure 2). To obtain contextualized representations for these hierarchical nodes, large-scale pre-trained language models such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) are used for contextual encoding. These initial representations are then passed through a graph neural network for graph propagation. The updated representations of different nodes are used to perform different sub-tasks (e.g., paragraph selection, supporting facts prediction and entity prediction) through a hierarchical manner. Since some answers may not be entities, a span prediction module is further introduced for final answer prediction.
The main contributions of this paper are threefold. (i) We propose a Hierarchical Graph Network (HGN) for multi-hop question answering, where heterogeneous nodes are woven into an integral unified graph. (ii) Nodes from different granularity levels are utilized for different sub-tasks, providing effective supervision signals for both supporting facts extraction and final answer prediction. (iii) HGN achieves new state of the art in both Distractor and Fullwiki settings on HotpotQA benchmark, outperforming previous work by a significant margin.

Related Work
Multi-Hop QA Multi-hop question answering requires a model to aggregate scattered pieces of evidence across multiple documents to predict the right answer. WikiHop (Welbl et al., 2018) and Hot-potQA  are two recent datasets designed for this purpose. Specifically, WikiHop is constructed using the schema of the underlying knowledge bases, thus limiting answers to entities only. HotpotQA, on the other hand, is freeform text collected by Amazon Mechanical Turkers, which results in significantly more diverse questions and answers. HotpotQA also focuses more on explainability, by requiring supporting facts as the reasoning chain for deriving the correct answer. Two settings are provided in HotpotQA: the distractor setting requires techniques for multihop reading comprehension, while the fullwiki setting is more focused on information retrieval.
Existing work on HotpotQA distractor setting tries to convert the multi-hop reasoning task into single-hop sub-problems. Specifically, QFE (Nishida et al., 2019) regards the evidence extraction as a query-focused summarization task, and reformulates the query in each hop. Decom-pRC (Min et al., 2019b) decomposes a compositional question into simpler sub-questions, and then leverages single-hop MRC models to answer the sub-questions. A neural modular network is also proposed in Jiang and Bansal (2019b), where carefully designed neural modules are dynamically assembled for more interpretable multi-hop reasoning. Although the task is multi-hop by nature, Figure 2: Model architecture of the proposed Hierarchical Graph Network. The constructed graph corresponds to the example in Figure 1. Green, blue, orange, and brown colors represent paragraph, sentence, entity, and question nodes, respectively. Some entities and hyperlinks are omitted for illustration simplicity. recent studies Min et al., 2019a) also observed models that achieve high performance may not necessarily perform the expected multi-hop reasoning procedure, and may be merely leveraging some reasoning shortcuts (Jiang and Bansal, 2019a).
Graph Neural Network for Multi-hop QA Besides the work mentioned above, recent studies on multi-hop QA also focus on building graphs based on entities, and reasoning over the constructed graph using graph neural networks (Kipf and Welling, 2017;Veličković et al., 2018). For example, MHQA-GRN (Song et al., 2018) and Coref-GRN (Dhingra et al., 2018) construct an entity graph based on co-reference resolution or sliding windows. Entity- GCN (De Cao et al., 2019) considers three different types of edges that connect different entities in the entity graph. HDE-Graph (Tu et al., 2019) enriches information in the entity graph by adding document nodes and creating interactions among documents, entities and answer candidates. Cognitive Graph QA (Ding et al., 2019) employs an MRC model to predict answer spans and possible next-hop spans, and then organizes them into a cognitive graph. DFGN (Xiao et al., 2019) constructs a dynamic entity graph, where in each reasoning step irrelevant entities are softly masked out, and a fusion module is designed to improve the interaction between the entity graph and the documents. Different from the above methods, our proposed model constructs a hierarchical graph, effectively exploring relations among clues of different granu-larities and employing different nodes to perform different tasks, such as supporting facts prediction and entity prediction.

Hierarchical Graph Network
As illustrated in Figure 2, the proposed Hierarchical Graph Network (HGN) consists of four main components: (i) Graph Construction Module (Sec. 3.1), through which a hierarchical graph is constructed to connect clues from different sources; (ii) Context Encoding Module (Sec. 3.2), where initial representations of graph nodes are obtained via a BERT-based encoder; (iii) Graph Reasoning Module (Sec. 3.3), where graph-attention-based message passing algorithm is applied to jointly update node representations; and (iv) Multi-task Prediction Module (Sec. 3.4), where multiple sub-tasks, including paragraph selection, supporting facts prediction, entity prediction, and answer span extraction, are performed simultaneously. The following sub-sections describe each component in detail.

Graph Construction
The hierarchical graph is constructed in two steps: (i) identifying relevant multi-hop paragraphs; and (ii) adding edges that represent connections between sentences and entities within the selected paragraphs.
Paragraph Selection Starting from the question, the first step is to identify relevant paragraphs (i.e., the first hop). We first retrieve those documents whose titles match the whole question. If multiple paragraphs are found, only two paragraphs with highest ranking scores are selected. If title matching returns no relevant documents, we further search for paragraphs that contain entities appearing in the question. If this also fails, a BERT-based paragraph ranker (described below) will be used to select the paragraph with the highest ranking score. The number of first-hop paragraphs will be at most two.
Once the first-hop paragraphs are identified, the next step is to find facts and entities within the paragraphs that can lead to other relevant paragraphs (i.e,, the second hop). Instead of relying on entity linking, which could be noisy, we use hyperlinks (provided by Wikipedia) in the first-hop paragraphs to discover second-hop paragraphs. Once the links are selected, we add edges between the sentences containing these links (source) and the paragraphs that the hyperlinks refer to (target), as illustrated by the dashed orange line in Figure 2. In order to allow information flow from both directions, the edges are considered as bidirectional.
Through this two-hop selection process, we are able to obtain several candidate paragraphs. In order to reduce introduced noise, we use a paragraph ranking model to select paragraphs with top-N ranking scores in each step. This paragraph ranking model is based on a pre-trained BERT encoder, followed by a binary classification layer, to predict whether an input paragraph contains the groundtruth supporting facts or not.
Hierarchical Graph Construction Paragraphs are comprised of sentences, and each sentence contains multiple entities. This graph is naturally encoded in a hierarchical structure, and also motivates how we construct the hierarchical graph. For each paragraph node, we add an edge between the node and all the sentences in the paragraph, each sentence corresponding to a sentence node. For each sentence node, we extract all the entities in the sentence and add edges between the sentence node and these entity nodes. Optionally, edges between paragraphs and edges between sentences can also be included in the final graph.
Each type of these nodes captures semantics from different information sources. Thus, the proposed hierarchical graph effectively exploits the structural information across all the different granularity levels to learn fine-grained representations, which can locate supporting facts and answers more accurately than simpler graphs with homogeneous nodes.
An example hierarchical graph is illustrated in Figure 2. We define different types of edges as follows: (i) edges between question node and paragraph nodes; (ii) edges between question node and its corresponding entity nodes (entities appearing in the question, not shown for simplicity); (iii) edges between paragraph nodes and their corresponding sentence nodes (sentences within the paragraph); (iv) edges between sentence nodes and their linked paragraph nodes (linked through hyperlinks); (v) edges between sentence nodes and their corresponding entity nodes (entities appearing in the sentences); (vi) edges between paragraph nodes; and (vii) edges between sentence nodes that appear in the same paragraph. Note that a sentence is only connected to its previous and next neighboring sentence. The final graph consists of these seven types of edges as well as four types of nodes, which link the question to paragraphs, sentences, and entities in a hierarchical way.

Context Encoding
Given the constructed hierarchical graph, the next step is to obtain the initial representations of all the graph nodes. To this end, we first combine all the selected paragraphs into context C, which is concatenated with the question Q and fed into pre-trained BERT (Devlin et al., 2019), followed by a bi-attention layer (Seo et al., 2017). We denote the encoded question representation as Q = {q 0 , q 1 , . . . , q m−1 } ∈ R m×d , and the encoded context representation as C = {c 0 , c 1 , ..., c n−1 } ∈ R n×d , where m, n are the length of the question and the context, respectively. Each q i and c j ∈ R d .
A shared BiLSTM is applied on top of the context representation C, and the representations of different nodes are extracted from the output of the BiLSTM, denoted as M ∈ R n×2d . For entity/sentence/paragraph nodes, which are spans of the context, the representation is calculated from: (i) the hidden state of the backward LSTM at the start position, and (ii) the hidden state of the forward LSTM at the end position. For the question node, a max-pooling layer is used to obtain its representation. Specifically, where P denotes the concatenation of two vectors. As a summary, after context encoding, each p i , s i , and e i ∈ R d , serves as the representation of the i-th paragraph/sentence/entity node. The question node is represented as q ∈ R d .

Graph Reasoning
After context encoding, HGN performs reasoning over the hierarchical graph, where the contextualized representations of all the graph nodes are transformed into higher-level features via a graph neural network. Specifically, let where n p , n s and n e denote the number of paragraph/sentence/entity nodes in a graph. In experiments, we set n p = 4, n s = 40 and n e = 60 (padded where necessary), and denote H = {q, P, S, E} ∈ R N ×d , where N = n p + n s + n e + 1, and d is the feature dimension of each node.
For graph propagation, we use Graph Attention Network (GAT) (Veličković et al., 2018) to perform message passing over the hierarchical graph. Specifically, GAT takes all the nodes as input, and updates node feature h i through its neighbors N i in the graph. Formally, where W ∈ R d×d is a weight matrix to be learned, σ(·) denotes an activation function, and α ij is the attention coefficients, which can be calculated by: where W e ij is the weight matrix corresponding to the edge type e ij between the i-th and j-th nodes, and f (·) denotes the LeakyRelu activation function. In a summary, after graph reasoning, we obtain H = {h 0 , h 1 , . . . , h N } ∈ R N ×d , from which the updated representations for each type of node can be obtained, i.e., P ∈ R np×d , S ∈ R ns×d , E ∈ R ne×d , and q ∈ R d .

Multi-task Prediction
In this module, the updated node representations after graph reasoning are exploited for different sub-tasks of QA: (i) paragraph selection based on paragraph nodes; (ii) supporting facts prediction based on sentence nodes; and (iii) answer prediction based on entity nodes. Since the answers may not reside in entity nodes, the loss from (iii) only serves as a regularization term, and the encoded context representation M is directly used for answer span extraction. Similar to Xiao et al. (2019), we use a cascade structure to solve the output dependency, and jointly perform all the tasks in a multi-task way. The final objective is specified as: where λ 1 , λ 2 , λ 3 , and λ 4 are hyper-parameters, and each loss function is a cross-entropy loss, calculated over the logits described below. For both paragraph selection (L para ) and supporting facts prediction (L sent ), we use a two-layer MLP as the binary classifier: where o sent ∈ R ns represents the logit that a sentence is selected as supporting facts, and o para ∈ R np represents the logit that a paragraph contains the ground-truth supporting facts.
We treat entity prediction (L entity ) as a multiclass classification problem. Candidate entities include all the entities in the constructed graph, plus an additional dummy entity indicating that the ground-truth answer does not exist among the entity nodes. Specifically, o entity = MLP 6 (E ) .
During inference, the above loss only serves as a regularization term, and the final answer will be predicted by the answer span extraction module. For answer span extraction, a two-layer MLP on top of BiLSTMs is used to calculate the logits of every position being the beginning and end points of the ground-truth span :  (Min et al., 2019b) 55.20 69.63 ----ChainEx  61.20 74.11 ----Baseline Model   times, and n is the number of tokens for the context.
For answer-type 1 classification (L type ), we use a two-layer MLP on top of BiLSTM for multi-class classification: The final cross-entropy loss (L joint ) used for training is defined over all the above logits:

Experiments
In this section, we describe our experiments on the HotpotQA dataset, comparing HGN with state-ofthe-art approaches and providing detailed analysis to validate the effectiveness of our proposed model.

Experimental Setup
Dataset We use HotpotQA dataset  for evaluation, which has become a popular benchmark for multi-hop QA. Specifically, two subtasks are included in this dataset: (i) Answer prediction; and (ii) Supporting facts prediction. For each sub-task, exact match (EM) and partial match (F1) are used to evaluate model performance, and a joint EM and F1 score is used to measure the final performance, which encourages the model to take both answer and evidence prediction into consideration.
In addition, there are two settings in HotpotQA: Distractor and Fullwiki setting. In the Distractor setting, for each question, two gold paragraphs with ground-truth answers and supporting facts are provided, along with 8 'distractor' paragraphs that were collected via a bi-gram TF-IDF retriever (Chen et al., 2017). The Fullwiki setting is more challenging, which contains the same questions as in the Distractor setting, but does not provide relevant paragraphs. To obtain the right answer and supporting facts, the entire Wikipedia can be used to find relevant documents. Implementation Details Our implementation is based on the Transformer library (Wolf et al., 2019), we use BERT-wwm (whole word masking) or RoBERTa (Liu et al., 2019) for context encoding. To construct the proposed hierarchical graph, we use spacy 2 to extract entities from both questions and sentences. The numbers of entities, sentences and paragraphs in one graph are limited to 60, 40 and 4, respectively. Since HotpotQA only requires two-hop reasoning, up to two paragraphs are connected to each question. Our paragraph ranking model is a binary classifier based on the BERT-base model. For the Fullwiki setting, we leverage the retrieved paragraphs and the paragraph ranker provided by Yixin Nie (2019). The hyper-parameters λ 1 , λ 2 , λ 3 and λ 4 are set to 1, 5, 1 and 1, respectively.  Distractor setting, we compare with DFGN (Xiao et al., 2019), QFE (Nishida et al., 2019), the official baseline , and DecompRC (Min et al., 2019b). Unpublished work includes TAP2, EPS+BERT, SAE, P-BERT, LQR-net (Anonymous, 2020a), and ChainEx . For the Fullwiki setting, the published baselines include SemanticRetrievalMRS (Yixin Nie, 2019), Entity-centric BERT (Godbole et al., 2019), GoldEn Retriever (Qi et al., 2019), Cognitive Graph (Ding et al., 2019), MUPPET (Feldman and El-Yaniv, 2019), QFE (Nishida et al., 2019), and the official baseline . Unpublished work includes Graph-based Recurrent Retriever (Anonymous, 2020b), MIR+EPS+BERT, Transformer-XH (Anonymous, 2020c), PR-BERT, and TPReasoner (Xiong et al., 2019).

Effectiveness of Paragraph Selection
The proposed HGN relies on effective paragraph selection to find relevant multi-hop paragraphs. Table 3 shows the performance of paragraph selection on the dev set of HotpotQA. In DFGN, paragraphs are selected based on a threshold to maintain high recall (98.27%), leading to a low precision (60.28%). Compared to both threshold-based and pure Top-N -based paragraph selection, our two-step paragraph selection process is more accurate, achieving 94.53% precision and 94.53% recall. Besides these two top-ranked paragraphs, we also include two other paragraphs with the next highest ranking scores, to obtain a higher coverage on potential answers, while sacrificing slightly the precision score.   their released code to allow finetuning of BERT. Results show that our paragraph selection method outperforms the threshold-based one in both models.
Effectiveness of Hierarchical Graph As described in Section 3.1, we construct our graph with four types of nodes and seven types of edges. For ablation study, we build the graph step by step. First, we only consider edges from question to paragraphs, and from paragraphs to sentences, i.e., only edge type (i), (iii) and (iv) are considered. We call this the PS Graph. Based on this, entity nodes and edges related to each entity node (corresponding to edge type (ii) and (v)) are added. We call this the PSE Graph. Lastly, edge types (vi) and (vii) are added, resulting in the final hierarchical graph. As shown in Table 5, the use of PS Graph improves the joint F1 score over the plain RoBERTa model by 1.59 points. By further adding entity nodes, the Joint F1 increases by 0.18 points. This indicates that the addition of entity nodes is helpful, but may also bring in noise, thus only leading to limited performance improvement. By including edges among sentences and paragraphs, our final hierarchical graph provides an additional improvement of 0.22 points. We hypothesize that this is due to the explicit connection between sentences that leads to better representations.   Effectiveness of Multi-task Loss As described in Section 3.4, different node representations are utilized for different downstream sub-tasks. Table  6 shows the ablation study results on paragraph selection loss L para and entity prediction loss L entity . The span extraction loss L span and supporting facts prediction loss L sent are not ablated, since they are the essential final sub-tasks on which we evaluate the model. As shown in the table, using paragraph selection and entity prediction loss can further improve the joint F1 by 0.31 points, which demonstrates the effectiveness of optimizing all the losses jointly.

Effectiveness of Pre-trained Language Model
To isolate the effects of pre-trained language models, we compare our HGN with prior state-of-theart methods by using the same pre-trained language models. Results in Table 7 show that our HGN variants outperform DFGN and EPS, indicating that the performance gain comes from a better model design.

Case Study
We provide two example questions for case study.
To answer the question in Figure 3 (left), Q needs to be linked with P 1. Subsequently, the sentence S4 within P 1 is connected to P 2 through the hyperlink ("John Surtees") in S4. A plain BERT model without using the constructed graph missed S7 as additional supporting facts, while our HGN discovers and utilizes both pieces of evidence as the connections among S4, P 2 and S7 are explicitly For the question in Figure 3 (right), the inference chain is Q → P 1 → S1 → S2 → P 2 → S3. The plain BERT model infers the evidence sentences S2 and S3 correctly. However, it fails to predict S1 as the supporting facts, while HGN succeeds, potentially due to the explicit connections between sentences in the constructed graph.

Conclusion
In this paper, we propose a new approach, Hierarchical Graph Network (HGN), for multi-hop question answering. To capture clues from different granularity levels, our HGN model weaves heterogeneous nodes into a single unified graph. Experiments with detailed analysis demonstrate the effectiveness of our proposed model, which achieves state-of-the-art performance on HotpotQA benchmark. Currently, in the Fullwiki setting, an off-theshelf paragraph retriever is adopted for selecting relevant context from large corpus of text. Future work includes investigating the interaction and joint training between HGN and paragraph retriever for performance improvement.