Modelling Long-distance Node Relations for KBQA with Global Dynamic Graph

The structural information of Knowledge Bases (KBs) has proven effective to Question Answering (QA). Previous studies rely on deep graph neural networks (GNNs) to capture rich structural information, which may not model node relations in particularly long distance due to oversmoothing issue. To address this challenge, we propose a novel framework GlobalGraph, which models long-distance node relations from two views: 1) Node type similarity: GlobalGraph assigns each node a global type label and models long-distance node relations through the global type label similarity; 2) Correlation between nodes and questions: we learn similarity scores between nodes and the question, and model long-distance node relations through the sum score of two nodes. We conduct extensive experiments on two widely used multi-hop KBQA datasets to prove the effectiveness of our method.


Introduction
Knowledge bases have become critical resources in a variety of natural language processing applications. A KB such as Freebase (Bollacker et al., 2008) always contains millions of facts which are composed of subject-predicate-object triples, also referred to as a relation between two entities. Such rich structural information has proven effective in KB-based Question Answering (KBQA) tasks , which aim to find the answer entities to a factoid question using facts in the targeting KB (Zhou et al., 2018;Zhang et al., 2018).
Early studies on KBQA are mainly based on neural network models (Dong et al., 2015;Das et al., 2017), which simulate the similarity between the factoid question and the entities in the KB. Although these methods are effective, the structural information in the KB is not fully utilized, which is essential in the reasoning process (Sun et al., 2018). To address this limitation, recent studies (Sun et al., 2019;Xiong et al., 2019) focus on graph neural networks (GNNs), which update nodes by aggregating their neighbor information in graphs. This updated pattern allows GNNs to capture structural information. However, GNN is a special form of Laplacian smoothing (Li et al., 2018), stacking multiple GNN layers may oversmooth features of nodes and reduce the discriminative power of graph embedding. With this insufficiency, conventional GNNs are poor at modeling long-distance node relations, which is essential for GNN reasoning. (Wu et al., 2019).
In this paper, to address the above limitations, we propose a novel framework GlobalGraph, which models long-distance node relations from two views: 1) Modeling node relations by predicting whether two nodes are of the same type label; 2) Modeling node relations by predicting whether two nodes are all correlated with the question. For the 1st view, we assign global type labels for each node according to its neighbor relation information, and then model the long-distance node relations by their global label similarity. Relations contain node label information, and the relation information around the same type nodes should be similar. For example, as shown in Figure 1, there are two triples: (N 3 , directed by, N 1 ) and (N 4 , directed by, N 5 ). Based on the relation "directed by" in these two triples, we regard that  Figure 1: a) We display the original graph with different types (shown with different colored edges) of relations. Take a specific relation "directed by" as an example, we can infer that the type of N 1 and N 5 , which are connected to this relation, is "person". So the two nodes are marked by the same color, indicating that they have the same label. It is the same to N 2 and N 6 , and N 3 and N 4 . Through this process, We assgin each node a global label through its neighbor relation set. It is worth noting that the larger the neighbor relation set, the more confident the model is to infer the similairty of two nodes' label. b) The confidence score is represented by the thickness of the dashed lines.
N 1 is a person and the same as N 5 . Type labels of the two nodes are the same, so we connect the two nodes. Based on the new graph, GNN propagates information across long-distance nodes. For the 2nd view, only connect the same label nodes is insufficient, because a normal KB contains a huge number of long-distance node pairs with different labels, which are not utilized for GNN reasoning. For a specific node pair, although the node labels are not similar, if the two nodes are related to the question, the information propagation between them is also useful for reasoning. Based on this, we dynamically select nodes related to the current question, and construct a dynamic graph to connect these nodes through full connection. Finally, we implement GNN to perform information propagation and reasoning. By solving the two views, we model global node features through their long-distance nodes, and then combine them with the local node features of conventional GNNs to perform answer prediction.
The main contributions of this paper can be summarized as follows: • We propose a novel idea to assign type labels to nodes based on their neighbor relation information, and introduce a novel model to enable GNNs to capture long-distance node information from two views: 1) node type similarity; 2) correlation between nodes and questions, which overcomes the shallow node representation in GNNs.
• We conduct extensive experiments on MetaQA (Zhang et al., 2018) and PQL (Zhou et al., 2018), and the results demonstrate the effectiveness of our model.

Neural Network-based Question Answering
The KBQA based on the neural network can be divided into two categories: single-hop QA and multihop QA. Single-hop models (Bordes et al., 2014;Xu et al., 2016) predict the answer from one fact triple, which can be retrieved by judging the similarity between the question and relations in triples. Although these models have good performance in answer prediction, they are insufficient in multi-hop QA tasks. Because mutli-hop QA task contains complex questions, which requires reasoning across multiple triples to get answers. To perform reasoning, Jiang and Bansal (2019) proposes a self-assembling network to assemble the reasoning modules; Yavuz et al. (2017) considers a continuous checking mechanism to judge the correctness of answer evidence; Zhang et al. (2018) utilizes the variational learning algorithm for multi-hop reasoning; Wang et al. (2019b) explores additional knowledge bases to improve natural language inference; Mitra et al. (2019) translates the question and the KB to a logical representation and then uses logical reasoning. However, these models lack considering graph structural information, which is important for multi-hop reasoning.

Graph Neural Networks based Question Answering
Supported with a number of studies on graph representation learning (Kipf and Welling, 2017;Schlichtkrull et al., 2018;Wang et al., 2019a), graph neural network (GNN) shows its powerful ability in graph analysis. A massive number of GNN-based algorithms are designed to perform graph reasoning, such as R- GCN (Schlichtkrull et al., 2018), GRAFT-Net (Sun et al., 2018), HGMAN (Wang et al., 2020) and BAG (Cao et al., 2019), in which nodes update themselves by aggregating the information of neighboring nodes. A node can capture the unconnected node information through multiple GNN layers. Since GNN is a special form of Laplacian smoothing, stack multiple GNN layers may oversmooth features of nodes from different clusters and reduce the discriminative power of graph embedding (Li et al., 2018). Therefore, most GNN models have less than two layers. Due to limited-layer information propagation, conventional GNNs suffer from bad performance in modeling long-distance node relations. Xiao et al. (2019) and Zhuang and Ma (2018) attempt to model long-distance node relations under the guidance of pre-defined node type labels. However, for most datasets of KBQA, pre-defined node type labels are not provided. which makes the above methods not applicable. Different from the previous work, we first assign a global label to each node by modeling its surrounding relation structure, and further gain the long-distance node relations based on the global labels. The source graph without node labels. b) We assign a global label to each node based on the connected relations. c) The information propagation based on label similarity. d) The information propagation based on question-aware subgraph. Through (b, c, d), the model outputs the long-distance propagation results. We combine it with Conventional GNNs Information Propagation results to predict the answers.

Task Definition
Let K = (V, E, R) denotes a knowledge graph, where V is the set of entities and R is the set of relations in KB. E consists of a set of triples (e h , r, e t ), which represent the relation r ∈ R holds between e h ∈ V and e t ∈ V. Given a natural language question Q = (w 1 , w 2 , ..., w |q| ), where w i denotes the ith word, the model needs to extract its answer from V, The overview of our models is shown in Figure 2.
The rest of the Model Section is organized as follows: Subsection 3.2 discusses how to encode the factoid question and knowledge graph. Subsection 3.3 describes the information propagation method of conventional GNNs. Subsection 3.4.1 and 3.4.2 discuss how to assign global type labels to each node and propagate information among nodes with similar labels. Subsection 3.4.3 explains the construction of question-aware dynamic graph. Finally, subsection 3.5 discusses the answer prediction.

Input Encoder
The input encoder initializes the given natural language question and all the candidate entities (in KB) to vector representation.
Question Initialization. We pass word sequence of the question Q to a long short-term memory network (LSTM) (Hochreiter and Schmidhuber, 1997): where q ∈ R m is the last state of LST M output. m is the hidden state size. We use q to represent the question. Node Initialization. Firstly, all of the nodes are represented by pre-trained word vectors or random initialized vectors. For node e v , it is noted as w v ∈ R n , where n is the embedding size. The seed nodes are nodes that can be connected to the question through entity linking (Sun et al., 2018). The input encoder also embeds the average distance from the node e v to the seed nodes, as d v ∈ R n . For simplicity, d v can be represented with the embedding of words "0", "1", "2", etc. With the distance embedding d v and word embedding w v , the node e v is represented as n v , which is defined as: where [;] is column-wise concatenation and W N ∈ R 2n×n is a learned parameter matrix. By adding distance information, nodes can better update themselves according to the number of hops needed to infer the answers to the current question.

Conventional GNNs Information Propagation
Similar to previous works (Sun et al., 2018;Xiong et al., 2019), we implement conventional GNNs methods to capture local information. A node catches its local information by aggregating the information of its real neighbors in the source graph.
To enable each node to capture the current question information, we concatenate each node representation n v with the question q, which is defined as h 0 v = [n v ; q], and then the node updates itself by aggregating its neighbors' information, which is defined as: where N r v represents the set of neighbor indices of node v based on relation r ∈ R. c v,r is a normalization constant that can be learned or set directly, such as c v,r = |N r v |. W l r ∈ R d l+1 ×d l stands for a learnable parameter matrix. 0 ≤ l < L and L is the number of layers in the model. h l j denotes the hidden state of node e j at the lth layer.
A gate mechanism decides how much of the update message u l+1 v propagates to the next layer. Gate levels are computed as: where f a is a linear function. Ultimately, the next layer representation h l+1 v of the node e v is a gated combination of the previous representation h l v and a non-linear transformation of the update information where φ(·) is any nonlinear function and stands for element-wise multiplication.
The model stacks such networks for L layers. Through L times' convolution operation, the node constantly updates its own state, which simulates the reasoning process. Finally, we get the node representation h L v . However, such GNNs can not propagate information between two long-distance nodes due to limited-layer. To overcome this challenge, in the next section, we introduce how to capture the long-distance node relations and propagate information based on them.

Model Global Node Type Labels
In this section, we introduce how to build a global label for a node according to its connection relation information, which is based on the relation information implying the connected node type. For example, in the field of movies, for a specific triple (N 1 , directed by, N 2 ) whose relation is "directed by", it can be retrieved that N 2 is a person and N 1 is a movie. It is the same for another triple (N 3 , directed by, N 4 ). From the results, we get that N 2 and N 4 belong to the same type label. Similar to the above process, we first collect the connection relation set of each node e v , which is defined as: where Set in v means the set of relations pointing to node e v and Set out v represents the set of relations pointing out from node e v . The reason we need to take into account the relation direction is that, with the above example, although N 1 and N 2 are both connected with relation "directed by", their labels are obviously different. Finally, we regard the Set v as the global type label of node e v .

Information Propagation Based on Label Similarity
With the global node type label, we calculate the similarity s ij between two nodes, which is defined as: where * ∩ * represents the intersection of two sets. len( * ) means the number of elements in the set. Finally, we get the node similarity matrix S ∈ R |V |×|V | , where |V | means the number of nodes. Based on the node similarity matrix S, similar to Equation 3, we use graph convolutional network (GCN) to perform information propagation, which is defined as: where t l j denotes the hidden state of node e j at the lth layer and t 0 j = n j (Equation 2). The update message g l+1 v pass through the gate mechanism (similar to Equation 4,5) to get the current layer representation t l v . The model stacks such networks for K layers. Finally, we get the last layer representation t K v of the node e v .

Information Propagation Based on Dynamic Question-aware Subgraph
In the above section, we consider that node pairs with higher label similarity have relations. However, although low label similarity, some node pairs are related to the current question. The information propagation between them can play a positive role in predicting answers. In this section, we first select the nodes related to the factoid question, and then link these nodes by full connection to construct a dynamic question-aware graph. With the dynamic graph, the model performs information propagation to capture question-related information.
We first get the representation of node e v , which is defined as: where W stands for a learnable parameter matrix. The similarity between node e v and question Q is calculated as: where q 0 = q and it is updated by summing the seed nodes' vectors of the (l − 1)th layer. sq l v ∈ [0, 1] represents the similarity confidence. We select the nodes whose sq l v is greater than the threshold t q and then construct the node set. Then we connect the nodes in the collected set by full connection to construct the question-aware dynamic graph. In the dynamic graph, the edge weight between node e i and node e j is the average of sq l i and sq l j . Similar to Equation 10, we perform GCN on the dynamic graph and stack such graphs for J layers. Finally, we get the last layer representation m J v of the node e v .

Answer Prediction
We concatenate the entity representation of local propagation results h L v and global propagation results m J v and pass through a linear layer f out to predict the answer distribution, which is defined as: where σ is the sigmod function. f out converts the dimension to 1.

Loss
The training loss is binary cross-entropy loss of the final answers prediction, which is defined as: where θ represents the model parameters, y is the golden distribution over entities, and n is the number of nodes. Entity linking is performed on these two datasets. We follow Xiong et al. (2019) and utilize the simple surface-level matching to make fair comparisons. The statistics of the two datasets are shown in Table 1.

Baselines
We compare our proposed model with the following models: (1) Key-Value Memory Network (KVMem) (Miller et al., 2016), an end-to-end memory network that can be used for KBQA. (2) IRN (Zhou et al., 2018), an interpretable reasoning model for knowledge graph question answering. (3) VRN (Zhang et al., 2018), an end-to-end variational learning algorithm, which not only addresses the noise in questions but also performs effective multi-hop reasoning. (4) GraftNet (Sun et al., 2018), a model which treats documents as a special genre of nodes in KB and utilizes graph convolution network to aggregate the information. (5) SGReader (Xiong et al., 2019), a model that aims to solve the incomplete knowledge graph by utilizing text information, applying a graph-attention to aggregate the information of each entity from its linked neighbors.

Training Details
We run the experiments on a P40 GPU with 24G memory. Throughout the experiments, for all of the baselines and the proposed model, we apply the 300-dimension TransE embeddings (Bordes et al., 2013) to initialize entity states and 300-dimension GloVE embeddings (Pennington et al., 2014) to initialize word states in questions. The hidden dimension of the LSTM is 300. The hidden dimension of all GCN    Table 2 depicts the comparisons with state-of-the-art models on the MetaQA dataset. As shown in Table 2, our model achieves the best Hits@1 and F1. Specifically, on the MetaQA 1-Hop, our model improves Hits@1 and F1 by 1.2% and 1.6% respectively, and on the MetaQA 2-Hop dataset, our model is 0.7% and 3.2% higher than the second best one on Hits@1 and F1 respectively. Similarly, our model has achieved the best performance on MetsQA 3-Hop.

Main Results and Discussion
We show the experimental results on the PQL dataset in Table 3. PQL dataset has the feature that each question has only one answer, so we only adopt Hits@1 for evaluation. On the Hits@1 metric, we observe that our model achieves the best results, improving 3.5% and 2.8% on 2-Hop and 3-Hop, respectively.
The reasons why our method performs well include: 1) Our method considers using graph neural network (GNN) to model the structural information of knowledge graph, which aims to enhance the reasoning ability; 2) Our idea can catch the long-distance node similarity by modeling the labels of each node, which is not considered in previous GNN-based KBQA models; 3) Our model captures more question-related information by constructing the question-aware dynamic graph.

Ablation Experiment
We compare our model with a few variants. R- GCN (Schlichtkrull et al., 2018) considers the influence of different types of connected relations when aggregating neighbors' information. GAT (Velickovic et al., 2018) implements the weight-based neighbor aggregation method. In the experiment, we combine R-GCN and GAT, and name it as R-GAT. R-GAT and R-GCN fail to consider the long-distance node relations, and only perform information propagation based on the real neighbors of nodes. As shown in Table 4, we can find that our model has achieved the best performance, and the biggest difference    between our model and these two models lie in considering long-distance node relations, which proves the effectiveness of our proposed model. we conduct experiments to evaluate the performance of different components in our model. Glob-alGraph w/o q-aware subgraph does not consider constructing the question-aware subgraph. Glob-alGraph w/o label similarity does not consider propagating information between two nodes with the same label, which only performs local and question-aware information propagation. As shown in Table 4, without these components, the performance of the model has declined, which proves the effectiveness of these two components in our model.

Analysis of Question-Aware Graph
In the proposed model, we construct a question-aware dynamic graph to enhance the relevance between nodes and the given question. In this section, we analyze the effectiveness of this method by showing the model performance of different threshold values t q . As shown in Figure 3, if the threshold is set too low (threshold=0), we can find that the model performance reduces, probably because there are too many question-irrelevant nodes in the graph. The information propagation between these nodes will reduce the reasoning performance. With the increase of threshold (from 0 to 0.8), the performance of the model is increasing, which proves the validity of the question-aware subgraph. If the threshold is too large (threshold=0.8), the performance of the model is also reduced because too many nodes are discarded, resulting in the information loss.

Case Study of Modeling Long-distance Node Similarity
In order to prove the validity of modeling long-distance node similarity based on the global labels, we give examples from PQL 3-Hop. As shown in Figure 4 (b), it contains the adjacency matrix of the real graph and the similarity matrix of node labels.  Figure 4: Example (a) and Example (b) display two KB examples respectively, in which the left heatmap is the original adjacency matrix, and the right heatmap is the constructed long-distance node relations. Deep color means a strong correlation between two nodes. From the comparison of two heatmaps in an example, we find that the constructed relation matrix can capture the long-distance node relations.

Example (b)
From the adjacency matrix of the real graph, we find that node "pulse" and node "i miss you" are not connected. However, the labels of them are the same, which can be obtained by their surrounding relations in the given KB. This relation is captured correctly by the similarity matrix, which proves the validity of our methods. Figure 4 (a) is a specific knowledge graph, because all nodes are connected to only one node. In this case, the relations between other nodes can not be captured. By using the constructed similarity matrix, we can capture more abundant relation information.

Conclusion
In this paper, we propose a novel KBQA model based on graph neural network, which can capture long-distance node relations by modeling the relation features of each node and further judge the feature similarity. Moreover, our model constructs a dynamic question-aware subgraph, retains the nodes related to the question, and propagates messages on these nodes to improve the reasoning ability. Experiments based on two open datasets demonstrate our model's ability on performing answer prediction. Ablation experiments prove the validity of each part of the model. Case study demonstrates our model's ability to capture long-distance node relations. In the future,we will explore other ways to capture the relation between distant nodes and improve the current proposed model.