Dynamically Fused Graph Network for Multi-hop Reasoning

Text-based question answering (TBQA) has been studied extensively in recent years. Most existing approaches focus on finding the answer to a question within a single paragraph. However, many difficult questions require multiple supporting evidence from scattered text among two or more documents. In this paper, we propose Dynamically Fused Graph Network(DFGN), a novel method to answer those questions requiring multiple scattered evidence and reasoning over them. Inspired by human's step-by-step reasoning behavior, DFGN includes a dynamic fusion layer that starts from the entities mentioned in the given query, explores along the entity graph dynamically built from the text, and gradually finds relevant supporting entities from the given documents. We evaluate DFGN on HotpotQA, a public TBQA dataset requiring multi-hop reasoning. DFGN achieves competitive results on the public board. Furthermore, our analysis shows DFGN produces interpretable reasoning chains.


Introduction
Question answering (QA) has been a popular topic in natural language processing. QA provides a quantifiable way to evaluate a NLP system's capability on language understanding and reasoning (Hermann et al., 2015;Rajpurkar et al., 2016Rajpurkar et al., , 2018. Most previous work focus on finding evidence and answers from a single paragraph (Seo et al., 2016;Liu et al., 2017;Wang et al., 2017). It rarely tests deep reasoning capabilities of the underlying model. In fact, Min et al. (2018) observe that most questions in existing QA benchmarks can be answered by retrieving † These authors contributed equally. The order of authorship is decided through dice rolling. Work done while Lin Qiu was a research intern in ByteDance AI Lab.

Original Entity Graph Second Mask Applied First Mask Applied
Input Paragraphs: Figure 1: Example of multi-hop text-based QA. One question and three document paragraphs are given. Our proposed DFGN conducts multi-step reasoning over the facts by constructing an entity graph from multiple paragraphs, predicting a dynamic mask to select a subgraph, propagating information along the graph, and finally transfer the information from the graph back to the text in order to localize the answer. Nodes are entity occurrences, with the color denoting the underlying entity. Edges are constructed from co-occurences. The gray circles are selected by DFGN in each step. a small set of sentences without reasoning. To address this issue, there are several recently proposed QA datasets particularly designed to evaluate a system's multi-hop reasoning capabilities, including WikiHop (Welbl et al., 2018), ComplexWe-bQuestions (Talmor and Berant, 2018), and Hot-potQA .
In this paper, we study the problem of multi-hop text-based QA, which requires multi-hop reasoning among evidence scattered around multiple raw documents. In particular, a query utterance and a set of accompanying documents are given, but not all of them are relevant. The answer can only be obtained by selecting two or more evidence from the documents and inferring among them (see Figure 1 for an example). This setup is versatile and does not rely on any additional predefined knowledge base. Therefore the models are expected to generalize well and to answer questions in open domains.
There are two main challenges to answer questions of this kind. Firstly, since not every document contain relevant information, multi-hop textbased QA requires filtering out noises from multiple paragraphs and extracting useful information. To address this, recent studies propose to build entity graphs from input paragraphs and apply graph neural networks (GNNs) to aggregate the information through entity graphs (Dhingra et al., 2018;De Cao et al., 2018;Song et al., 2018a). However, all of the existing work apply GNNs based on a static global entity graph of each QA pair, which can be considered as performing implicit reasoning. Instead of them, we argue that the queryguided multi-hop reasoning should be explicitly performed on a dynamic local entity graph tailored according to the query.
Secondly, previous work on multi-hop QA (e.g. WikiHop) usually aggregates document information to an entity graph, and answers are then directly selected on entities of the entity graph. However, in a more realistic setting, the answers may even not reside in entities of the extracted entity graph. Thus, existing approaches can hardly be directly applied to open-domain multi-hop QA tasks like HotpotQA.
In this paper, we propose Dynamically Fused Graph Network (DFGN), a novel method to address the aforementioned concerns for multi-hop text-based QA. For the first challenge, DFGN constructs a dynamic entity graph based on entity mentions in the query and documents. This process iterates in multiple rounds to achieve multihop reasoning. In each round, DFGN generates and reasons on a dynamic graph, where irrelevant entities are masked out while only reasoning sources are preserved, via a mask prediction module. Figure 1 shows how DFGN works on a multi-hop text-based QA example in HotpotQA. The mask prediction module is learned in an endto-end fashion, alleviating the error propagation problem.
To solve the second challenge, we propose a fusion process in DFGN to solve the unrestricted QA challenge. We not only aggregate information from documents to the entity graph (doc2graph), but also propagate the information of the entity graph back to document representations (graph2doc). The fusion process is iteratively performed at each hop through the document tokens and entities, and the final resulting answer is then obtained from document tokens. The fusion process of doc2graph and graph2doc along with the dynamic entity graph jointly improve the interaction between information of documents and the entity graph, leading to a less noisy entity graph and thus more accurate answers.
As one merit, DFGN's predicted masks implicitly induce reasoning chains, which can explain the reasoning results. Since the ground truth reasoning chain is very hard to define and label for open-domain corpus, we propose a feasible way to weakly supervise the mask learning. We propose a new metric to evaluate the quality of predicted reasoning chains and constructed entity graphs.
Our contributions are summarized as follows: • We propose DFGN, a novel method for the multi-hop text-based QA problem. • We provide a way to explain and evaluate the reasoning chains via interpreting the entity graph masks predicted by DFGN. The mask prediction module is additionally weakly trained. • We provide an experimental study on a public dataset (HotpotQA) to demonstrate that our proposed DFGN is competitive against stateof-the-art unpublished works.

Related work
Text-based Question Answering Depending on whether the supporting information is structured or not, QA tasks can be categorized into knowledge-based (KBQA), text-based (TBQA), mixed, and others. In KBQA, the supporting information is from structured knowledge bases (KBs), while the queries can be either structure or natural language utterances. For example, SimpleQuestions is one large scale dataset of this kind (Bordes et al., 2015). In contrast, TBQA's supporting information is raw text, and hence the query is also text. SQuAD (Rajpurkar et al., 2016) and HotpotQA  are two such datasets. There are also mixed QA tasks which combine both text and KBs, e.g. WikiHop (Welbl et al., 2018) and ComplexWebQuestions (Talmor and Berant, 2018). In this paper, we focus on TBQA, since TBQA tests a system's end-to-end capability of extracting relevant facts from raw language and reasoning about them. Depending on the complexity in underlying reasoning, QA problems can be categorized into single-hop and multi-hop ones. Single-hop QA only requires one fact extracted from the underlying information, no matter structured or unstructured, e.g. "which city is the capital of California". The SQuAD dataset belongs to this type (Rajpurkar et al., 2016). On the contrary, multi-hop QA requires identifying multiple related facts and reasoning about them, e.g. "what is the capital city of the largest state in U.S.". Example tasks and benchmarks of this kind include WikiHop, Com-plexWebQuestions, and HotpotQA. Many IR techniques can be applied to answer single-hop questions (Rajpurkar et al., 2016). However, these IR techniques are hardly introduced in multi-hop QA, since a single fact can only partially match a question.
Note that existing multi-hop QA datasets Wiki-Hop and ComplexWebQuestions , are constructed using existing KBs and constrained by the schema of the KBs they use. For example the answers are limited in entities not free text in WikiHop. In this work, we focus on multi-hop text-based QA, so we only evaluate on HotpotQA.
Coref-GRN extracts and aggregates entity information in different references from scattered paragraphs (Dhingra et al., 2018). Coref-GRN utilizes co-reference resolution to detect different mentions of the same entity. These mentions are combined with a graph recurrent neural network (GRN) (Song et al., 2018b) to produce aggregated entity representations. MHQA-GRN (Song et al., 2018a) follows Coref-GRN, and refines the graph construction procedure with more connections: sliding-window, same entity, and co-reference, which shows further improvements. Entity-GCN (De Cao et al., 2018) proposes to distinguish dif-ferent relations in the graphs through a relational graph convolutional neural network (GCN) (Kipf and Welling, 2017). Coref-GRN, MHQA-GRN and Entity-GCN explore the graph construction problem in answering real-world questions. However, it is yet to investigate how to effectively reason about the constructed graphs, which is the main problem studied in this work.
Another group of sequential models deals with multi-hop reasoning following Memory Networks (Sukhbaatar et al., 2015). Such models construct representations for queries and memory cells for contexts, then make interactions between them in a multi-hop manner. Munkhdalai and Yu (2017) and Onishi et al. (2016) incorporate a hypothesis testing loop to update the query representation at each reasoning step and select the best answer among the candidate entities at the last step. IR-Net (Zhou et al., 2018) generates a subject state and a relation state at each step, computing the similarity score between all the entities and relations given by the dataset KB. The ones with highest score at each time step are linked together to form an interpretable reasoning chain. However, these models perform reasoning on simple synthetic datasets with limited number of entities and relations, which are quite different with largescale QA dataset with complex question. Also, the supervision of entity-level reasoning chains in synthetic datasets can be easily given following some patterns while they are not available in Hot-potQA.

Dynamically Fused Graph Network
We describe dynamically fused graph network (DFGN) in this section. Our intuition is drawn from the human reasoning processing for QA. One starts from an entity of interest in the query, focuses on the words surrounding the start entities, connects to some related entity either found in the neighborhood or linked by the same surface mention, repeats the step to form a reasoning chain, and lands on some entity or snippets likely to be the answer. To mimic human reasoning behavior, we develop five components in our proposed QA system (Fig. 2): a paragraph selection sub-network, a module for entity graph construction, an encoding layer, a fusion block for multi-hop reasoning, and a final prediction layer.

Paragraph Selection
For each question, we assume that N p paragraphs are given (e.g. N p = 10 in HotpotQA). Since not every piece of text is relevant to the question, we train a sub-network to select relevant paragraphs. The sub-network is based on a pre-trained BERT model (Devlin et al., 2018) followed by a sentence classification layer with sigmoid prediction. The selector network takes a query Q and a paragraph as input and outputs a relevance score between 0 and 1. Training labels are constructed by assigning 1's to the paragraphs with at least one supporting sentence for each Q&A pair. During inference, paragraphs with predicted score greater than η (= 0.1 in experiment) are selected and concatenated together as the context C. η is properly chosen to ensure the selector reaches a significantly high recall of relevant paragraphs. Q and C are further processed by upper layers.

Constructing Entity Graph
We do not assume a global knowledge base. Instead, we use the Stanford corenlp toolkit (Manning et al., 2014) to recognize named entities from the context C. We only adopt POL entities (Person, Organization, and Location) because they are more important. The number of extracted POL entities is denoted as N . The entity graph is constructed with the entities as nodes and edges built as follows. The edges are added 1. for every pair of entities appear in the same sentence in C (sentence-level links); 2. for every pair of entities with the same mention text in C (context-level links); and 3. between a central entity node and other entities within the same paragraph (paragraph-level links). The central entities are extracted from the title sentence for each paragraph. Notice the context-level links ensures that entities across multiple documents are connected in certain way. We do not apply co-reference resolution for pronouns because it introduces both additional useful and erroneous links.

Encoding Query and Context
We concatenate the query Q with the context C and pass the resulting sequence to a pre-trained BERT model to obtain representations Q = [q 1 , . . . , q L ] ∈ R L×d 1 and C = [c 1 , . . . , c M ] ∈ R M ×d 1 , where L,M are lengths of query and context, and d 1 is the size of BERT hidden states.
In experiments, we find concatenating queries and contexts performs better than passing them separately to BERT. The representations are further passed through a bi-attention layer (Seo et al., 2016) to enhance cross interactions between the query and the context. In practice, we find adding the bi-attention layer achieves better performance than the BERT encoding only. The output representation are Q 0 ∈ R L×d 2 and C 0 ∈ R M ×d 2 , where d 2 is the output embedding size.

Reasoning with the Fusion Block
With the embeddings calculated for the query Q and context C, the remaining challenge is how to identify supporting entities and the text span of potential answers. We propose a fusion block to mimic human's one-step reasoning behaviorstarting from Q 0 and C 0 and finding one-step supporting entities. A fusion block achieves the following: 1. passing information from tokens to entities by computing entity embeddings from tokens (Doc2Graph flow); 2. propagating information on entity graph; and 3. passing information from entity graph to document tokens since the final prediction is on tokens (Graph2Doc flow). Fig. 3 depicts the inside structure of the fusion block in DFGN.
Document to Graph Flow. Since each entity is recognized via NER tool, the text span associated with the entities are utilized to compute entity embeddings (Doc2Graph). To this end, we construct a binary matrix M, where M i,j is 1 if i-th token in the context is within the span of the j-th en-  tity. M is used to select the text span associated with an entity. The token embeddings calculated from the above section (which is a matrix containing only selected columns of C t−1 ) is passed into a mean-max pooling to calculate entity embeddings E t−1 = [e t−1,1 , . . . , e t−1,N ]. E t−1 will be of size 2d 2 ×N , where N is the number of entities, and each of the 2d 2 dimensions will produce both mean-pooling and max-pooling results. This module is denoted as Tok2Ent.
Dynamic Graph Attention. After obtaining entity embeddings from the input context C t−1 , we apply a graph neural network to propagate node information to their neighbors. We propose dynamic graph attention mechanism to mimic human's step-by-step exploring and reasoning behavior. In each reasoning step, we assume every node has some information to disseminate to neighbors. The more relevant to the query, the neighbor nodes receive more information from nearby. We first identify nodes relevant to the query by creating a soft mask on entities. It serves as a information gate keeper -only those entity nodes pertaining to the query are allowed to disseminate information. We use an attention network between the query embeddings and the entity embeddings to predict a soft mask m t , which aims to signify the start entities in the t-th reasoning step: where V t is a linear projection matrix, and σ is the sigmoid function. By multiplying the soft mask and the initial entity embeddings, the desired start entities will be encouraged and others will be penalized. As a result, this step of information propagation is restricted to a dynamic sub-part of the entity graph. The next step is to disseminate information across the dynamic sub-graph.
Inspired by GAT (Veličković et al., 2018), we compute attention score α between two entities by: where U t ∈ R d 2 ×2d 2 , W t ∈ R 2d 2 are linear projection parameters. Here the i-th row of α represents the proportion of information that will be assigned to the neighbors of entity i. Note that the information flow in our model is different from most previous GATs. In dynamic graph attention, each node sums over its column, which forms a new entity state containing the total information it received from the neighbors: where B i is the set of neighbors of entity i. Then we obtain the updated entity embeddings Updating Query. A reasoning chain contains multiple steps, and the newly visited entities by one step will be the start entities of the next step. In order to predict the expected start entities for the next step, we introduce a query update mechanism, where the query embeddings are updated by the entity embeddings of current step. In our implementation, we utilize a bi-attention network (Seo et al., 2016) to update the query embeddings: Graph to Document Flow. Using Tok2Ent and dynamic graph attention, we realize a reasoning step at the entity level. However, the unrestricted answer still cannot be back traced. To address this, we develop a Graph2Doc module to keep information flowing from entity back to tokens in the context. Therefore the text span pertaining to the answers can be localized in the context. Using the same binary matrix M as described above, the previous token embeddings in C t−1 are concatenated with the associated entity embedding corresponding to the token. Each row in M corresponds to one token, therefore we use it to select one entity's embedding from E t if the token participates in the entity's mention. This information is further processed with a LSTM layer (Hochreiter and Schmidhuber, 1997) to produce the nextlevel context representation: where ; refers to concatenation and C (t) ∈ R M ×d 2 serves as the input of the next fusion block. At this time, the reasoning information of current subgraph has been propagated onto the whole context.

Prediction
We follow the same structure of prediction layer as . The framework has four output dimensions, including 1. supporting sentences, 2. start position of the answer, 3. end position of the answer, and 4. answer type. We use a cascade structure to solve the output dependency, where four isomorphic LSTMs F i are stacked layer by layer. The context representation of the last fusion block is sent to the first LSTM F 0 . Each F i outputs a logit O ∈ R M ×d 2 and computes a cross entropy loss over these logits.
We jointly optimize these four cross entropy losses. Each loss term is weighted by a coefficient.
Weak Supervision. In addition, we introduce a weakly supervised signal to induce the soft masks at each fusion block to match the heuristic masks. For each training case, the heuristic masks contain a start mask detected from the query, and additional BFS masks obtained by applying breadthfirst search (BFS) on the adjacent matrices give the start mask. A binary cross entropy loss between the predicted soft masks and the heuristics is then added to the objective. We skip those cases whose start masks cannot be detected from the queries.

Experiments
We evaluate our Dynamically Fused Graph Network (DFGN) on HotpotQA  in the distractor setting. For the full wiki setting where the entire wikipedia articles are given as input, we consider the bottleneck is about information retrieval, thus we do not include the full wiki setting in our experiments.

Implementation Details
In paragraph selection stage, we use the uncased version of BERT Tokenizer to tokenize all passages and questions. The encoding vectors of sentence pairs are generated from a pretrained bert-base-uncased model. We set a relatively low threshold during selection to keep a high recall (97%) and a reasonable precision (69%) on supporting facts.
In graph construction stage, we use a pretrained NER model from Stanford CoreNLP Toolkits 1 (Manning et al., 2014) to extract named entities. The maximum number of entities in a graph is set to be 40. Entity nodes in graph has an average degree of 3.52.
In encoding stage, we also choose bert-baseuncased model as the encoder, thus d 1 is 768. All the hidden state dimensions d 2 are set to 300. We set the dropout rate for all hidden units of LSTM and dynamic graph attention to 0.3 and 0.5 respectively. For optimization, we use Adam Optimizer (Kingma and Ba, 2015) with an initial learning rate of 1e −4 .

Main Results
We first present a comparison between baseline models and our DFGN 2 . Table 1 shows the performance of different models on the private test set of HotpotQA. From the table we can see that our model achieves the second best result on the leaderboard now 3 (on March 1st). Besides, the answer performance and the joint performance of our model are competitive against state-of-the-art unpublished models.
To evaluate the performance of different components in our DFGN, we perform ablation study on both model components and dataset segments. Here we follow the experiment setting in Yang   Table 2. From the table we can see that each of our model components can provide from 1% to 2% relative gain over the QA performance. Particularly, using 1-layer fusion block leads to an obvious performance loss, which implies the significance of performing multi-hop reasoning in HotpotQA. Besides, the dataset ablation results show that our model is not very sensitive to the noise paragraphs in the distractor setting comparing with the baseline model which can achieve more than 5% performance gain over gold paragraphs and supporting facts .

Evaluation on Graph Construction and Reasoning Chain
The chain of reasoning is a directed path on the entity graph, so high-quality entity graphs are the basis of good reasoning. Since the limited accuracy of NER model and the incompleteness of our graph construction, 31.3% of the cases in the develop set are unable to perform a complete reasoning process, where at least one supporting sentence is not reachable through the entity graph, i.e. no entity is recognized by NER model in this sentence. We name such cases as "missing supporting entity", and the ratio of such cases can evaluate the quality of graph construction. We focus on the rest 68.7% good cases in the following analysis. In the following we give several definitions before presenting ESP (Entity-level Support) scores.
Path A path is a sequence of entities visited by the fusion blocks, denoting as P = [e p 1 , . . . , e p t+1 ] (suppose t-layer fusion blocks).
Path Score The score of a path is acquired by multiplying corresponding soft masks and attention scores along the path, i.e. score(P ) = t i=1 m i,p i α t,p i ,p i+1 (Eq. (3), (7)).
Hit Given a path and a supporting sentence, if at least one entity of the supporting sentence is visited by the path, we call this supporting sentence is hit 4 .
Given a case with m supporting sentences, we select the top-k paths with highest scores as the predicted reasoning chains. For each supporting sentence, we use the k paths to calculate how many supporting sentences are hit.
In the following we introduce two scores to evaluate the quality of multi-hop reasoning through entity-level supporting (ESP) scores.  ESP EM (Exact Match) For a case with m supporting sentences, if all the m sentences are hit, we call this case is exactly match. The ESP EM score is the ratio of exactly matched cases.
ESP Recall For a case with m supporting sentences and h of them are hit, this case has a recall score of h/m. The averaged recall scores of the whole dataset is the ESP Recall.
We train a DFGN with 2 fusion blocks to select paths with top-k scores. On the develop set, the average number of paths of length 2 is 174.7. We choose k as 1, 2, 5, 10 to compute ESP EM and ESP Recall scores. As we can see in Table 3, regarding the supporting sentences as the ground truth of reasoning chains, our framework can predict reliable information flow. The most informative flow can cover the supporting facts and help produce reliable reasoning results. Figure 4 illustrates the reasoning process in a DFGN with 2-layer fusion blocks. At the first step, by comparing the query with entities, our model generates Mask1 as the start entity mask of reasoning, where "Barrack" and "British Army Lynx" are detected as the start entities of two reasoning chains. Information of two start enti-ties is then passed to their neighbors on the entity graph. At the second step, mentions of the same entity "IRA" are detected by Mask2, serving as a bridge for propagating information across two paragraphs. Finally, two reasoning chains are linked together by the bridge entity IRA, which is exactly the answer.

Case Study
The lower part in Figure 4 shows a bad case in our experiments. Due to the malfunction of the NER module, the only start entity, "Farrukhzad Khosrau V", was not successfully detected. Without the start entities, the reasoning chains cannot be established, and further information flow on entity graph is blocked at the first step.

Conclusion
We introduce Dynamically Fused Graph Network (DFGN) to address multi-hop reasoning. Specifically, we propose a dynamic fusion reasoning block based on graph neural networks. Different from previous approaches in QA, DFGN is capable of predicting the sub-graphs dynamically at each reasoning step, and the entity-level reasoning is fused with token-level context. We evaluate DFGN on HotpotQA and achieve leading results. Besides, our analysis shows DFGN can produce reliable and explainable reasoning chains. In the future, we may incorporate new advances in building entity graphs from texts, and solve harder reasoning problems, e.g. "comparison" in Hot-potQA.