Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings

Knowledge Graphs (KG) are multi-relational graphs consisting of entities as nodes and relations among them as typed edges. Goal of the Question Answering over KG (KGQA) task is to answer natural language queries posed over the KG. Multi-hop KGQA requires reasoning over multiple edges of the KG to arrive at the right answer. KGs are often incomplete with many missing links, posing additional challenges for KGQA, especially for multi-hop KGQA. Recent research on multi-hop KGQA has attempted to handle KG sparsity using relevant external text, which isn’t always readily available. In a separate line of research, KG embedding methods have been proposed to reduce KG sparsity by performing missing link prediction. Such KG embedding methods, even though highly relevant, have not been explored for multi-hop KGQA so far. We fill this gap in this paper and propose EmbedKGQA. EmbedKGQA is particularly effective in performing multi-hop KGQA over sparse KGs. EmbedKGQA also relaxes the requirement of answer selection from a pre-specified neighborhood, a sub-optimal constraint enforced by previous multi-hop KGQA methods. Through extensive experiments on multiple benchmark datasets, we demonstrate EmbedKGQA’s effectiveness over other state-of-the-art baselines.


Introduction
Knowledge Graphs (KG) are multi-relational graphs consisting of millions of entities (e.g., San Jose, California, etc.) and relationships among them (e.g., San Jose-cityInState-California). Examples of a few large KGs include Wikidata (Google, 2013), DBPedia (Lehmann et al., 2015), Yago (Suchanek et al., 2007), and NELL (Mitchell Figure 1: Challenges with Multi-hop QA over Knowledge Graphs (KGQA) in sparse and incomplete KGs: Absence of the edge has genre (Gangster No. 1,Crime) in the incomplete KG makes it much harder to answer the input NL question, as the KGQA model potentially needs to reason over a longer path over the KG (marked by bold edges). Existing multi-hop KGQA methods also impose heuristic neighborhood limits (shaded region in the figure), which often makes the true answer (Crime in this example) out of reach. EmbedKGQA, our proposed method, overcomes these limitations by utilizing embeddings of the input KG during multi-hop KGQA. For more details, please refer Figure 2 and Section 4. et al., 2018). Question Answering over Knowledge Graphs (KGQA) has emerged as an important research area over the last few years Sun et al., 2019a). In KGQA systems, given a natural language (NL) question and a KG, the right answer is derived based on analysis of the question in the context of the KG.
In multi-hop KGQA, the system needs to perform reasoning over multiple edges of the KG to infer the right answer. KGs are often incomplete, which creates additional challenges for KGQA systems, especially in case of multi-hop KGQA. Recent methods have used an external text corpus to handle KG sparsity (Sun et al., 2019a(Sun et al., , 2018. For example, the method proposed in (Sun et al., 2019a) constructs a question-specific sub-graph from the KG, which is then augmented with supporting text documents. EmbedKGQA's use of embeddings makes it more effective in handling KG sparsity. Moreover, since Em-bedKGQA considers all entities as candidate answers, it doesn't suffer from the limited neighborhood out-of-reach issues of existing Multi-hop KGQA methods. Please refer Section 4 for detailed description of EmbedKGQA.
Graph CNN (Kipf and Welling, 2016) is then applied over this augmented sub-graph to arrive at the final answer. Unfortunately, availability and identification of relevant text corpora is a challenge on its own which limits broad-coverage applicability of such methods. Moreover, such methods also impose pre-specified heuristic neighborhood size limitation from which the true answer needs to be selected. This often makes the true answer out of reach of the model to select from.
In order to illustrate these points, please consider the example shown in Figure 1. In this example, Louis Mellis is the head entity in the input NL question, and Crime is the true answer we expect the model to select. If the edge has genre(Gangster No. 1, Crime) were present in the KG, then the question could have been answered rather easily. However, since this edge is missing from the KG, as is often the case with similar incomplete and sparse KGs, the KGQA model has to potentially reason over a longer path over the KG (marked by bolded edges in the graph). Moreover, the KGQA model imposed a neighborhood size of 3-hops, which made the true answer Crime out of reach.
In a separate line of research, there has been a large body of work that utilizes KG embeddings to predict missing links in the KG, thereby reducing KG sparsity (Bordes et al., 2013;Trouillon et al., 2016;Yang et al., 2014a;Nickel et al., 2011). KG embedding methods learn high-dimensional embeddings for entities and relations in the KG, which are then used for link prediction. In spite of its high relevance, KG embedding methods have not been used for multi-hop KGQA -we fill this gap in this paper. In particular, we propose EmbedKGQA, a novel system which leverages KG embeddings to perform multi-hop KGQA. We make the following contributions in this paper: 1. We propose EmbedKGQA, a novel method for the multi-hop KGQA task. To the best of our knowledge, EmbedKGQA is the first method to use KG embeddings for this task. EmbedKGQA is particularly effective in performing multi-hop KGQA over sparse KGs.
2. EmbedKGQA relaxes the requirement of answer selection from a pre-specified local neighborhood, an undesirable constraint imposed by previous methods for this task.
3. Through extensive experiments on multiple real-world datasets, we demonstrate Embed-KGQA's effectiveness over state-of-the-art baselines.
We have made EmbedKGQA's source code available to encourage reproducibility.

Related Work
KGQA: In prior work (Li et al., 2018) TransE, (Bordes et al., 2013) embeddings have been used to answer factoid based questions. However, this requires ground truth relation labeling for each question and it does not work for multi-hop question answering. In another line of work (Yih et al., 2015) and (Bao et al., 2016) proposed extracting a particular sub-graph to answer the question. The method presented in (Bordes et al., 2014a), the sub-graph generated for a head entity is projected in a high dimensional space for question answering. Memory Networks have also been used to learn high dimensional embeddings of the facts present in the KG to perform QA (Bordes et al., 2015). Methods like (Bordes et al., 2014b) learn a similarity function between the question and the corresponding triple during training, and score the question with all the candidate triples at the test time. (Yang et al., 2014b) and (Yang et al., 2015) utilize embedding based methods to map natural language questions to logical forms. Methods like (Dai et al., 2016;Dong et al., 2015;Hao et al., 2017;Lukovnikov et al., 2017;Yin et al., 2016) utilize neural networks to learn a scoring functions to rank the candidate answers. Some works like (Mohammed et al., 2017;Ture and Jojic, 2016) consider each relation as a label and model QA task as a classification problem. Extending these kinds of approaches for multi-hop question answering is non-trivial.
Recently, there has been some work in which text corpus is incorporated as a knowledge source in addition to KG to answer complex questions on KGs (Sun et al., 2018(Sun et al., , 2019a. Such approaches are useful in case the KG is incomplete. However, this leads to another level of complexity in the QA system, and text corpora might not always be available. KG completion methods: Link prediction in Knowledge Graphs using KG embeddings has become a popular area of research in recent years. The general framework is to define a score function for a set of triples (h, r, t) in a KG and constraining them in such a way that the score for a correct triple is higher than the score for an incorrect triple.
RESCAL (Nickel et al., 2011) and DistMult (Yang et al., 2015) learn a score function containing a bi-linear product between head entity and tail entity vectors and a relation matrix. ComplEx (Trouillon et al., 2016) represents entity vectors and relation matrices in the complex space. SimplE (Kazemi and Poole, 2018) and TuckER (Balažević et al., 2019) are based on Canonical Polyadic (CP) decomposition (Hitchcock, 1927) and Tucker decomposition (Tucker, 1966) respectively.
TransE (Bordes et al., 2013) embeds entities in high dimensional real space and relation as translation between the head and the tail entities. RotatE (Sun et al., 2019b) on the other hand projects entities in complex space and relations are represented as rotations in the complex plane.
ConvE (Dettmers et al., 2018) utilizes Convolutional Neural Networks to learn a scoring function between the head entity, tail entity and relation. InteractE (Vashishth et al., 2019) improves upon ConvE by increasing feature interaction.

Background
In this section, we formally define a Knowledge Graph(KG) and then describe link prediction task on incomplete KGs. We then describe KG embeddings and explain the ComplEx embedding model.

Knowledge Graph
Given a set of entities E and relations R, a Knowledge Graph G is a set of triples K such that K ⊆ E × R × E . A triple is represented as (h, r, t), with h, t ∈ E denoting subject and object entities respectively and r ∈ R the relation between them.

Link Prediction
In link prediction, given an incomplete Knowledge Graph, the task is to predict which unknown links are valid. KG Embedding models achieve this through a scoring function φ that assigns a score s = φ(h, r, t) ∈ R, which indicates whether a triple is true, with the goal of being able to score all missing triples correctly.

Knowledge Graph Embeddings
For each e ∈ E and r ∈ R, Knowledge Graph Embedding (KGE) models generate e e ∈ R de and e r ∈ R dr , where e e and e r are d e and d r dimensional vectors respectively. Each of the embedding methods also has a scoring function φ : E × R × E → R to assign some score φ(h, r, t) to a possible triple (h, r, t), h, t ∈ E and r ∈ R. Models are trained in a way such that for every correct triple (h, r, t) ∈ K and incorrect triple (h , r , t ) ∈ K the model assign scores such that φ(h, r, t) > 0 and φ(h , r , t ) < 0. A scoring function is generally a function of (e h , e r , e t ).

ComplEx Embeddings
ComplEx (Trouillon et al., 2016) is a tensor factorization approach that embeds relations and entities in complex space. Given h, t ∈ E and r ∈ R, ComplEx generates e h , e r , e t ∈ C d and defines a scoring function: such that φ(h, r, t) > 0 for all true triples, and φ(h, r, t) < 0 for false triples. Re denotes the real part of a complex number.

EmbedKGQA: Proposed Method
In this section, we first define the problem of KGQA and then describe our model.

Problem Statement
Let E and R be the set of all entities and relations respectively in a KG G, and K ⊆ E × R × E is the set of all available KG facts. The problem in KGQA involves, given a natural language question q and a topic entity e h ∈ E present in the question, the task is to extract an entity e t ∈ E that correctly answers the question q.

EmbedKGQA Overview
We work in a setting where there is no finegrained annotation present in the dataset, such as the question type or the exact logic reasoning steps. For example, co-actor is a combination of starred actor −1 and starred actor relations, but our model does not require this annotation. EmbedKGQA uses Knowledge Graph embeddings to answer multi-hop natural language questions. First it learns a representation of the KG in an embedding space. Then given a question it learns a question embedding. Finally it combines these embedding to predict the answer.
In the following sections, we introduce the Em-bedKGQA model. It consists of 3 modules: ComplEx embeddings are trained for all h, t ∈ E and all r ∈ R in the KG such that e h , e r , e t ∈ C d . The entity embeddings are used for learning a triple scoring function between the head entity, question, and answer entity. Based on the coverage of the KG entities in the QA training set, the entity embeddings learned here are either kept frozen or allowed to be fine-tuned in the subsequent steps.

Question Embedding Module
This module embeds the natural language question q to a fixed dimension vector e q ∈ C d . This is done using a feed-forward neural network that first embeds the question q using RoBERTa (Liu et al., 2019) into a 768-dimensional vector. This is then passed through 4 fully connected linear layers with ReLU activation and finally projected onto the complex space C d . Given a question q, topic entity h ∈ E and set of answer entities A ⊆ E, it learns the question embedding in a way such that where φ is the ComplEx scoring function (1) and e a , eā are entity embeddings learnt in the previous step.
For each question, the score φ(.) is calculated with all the candidate answer entities a ∈ E. The model is learned by minimizing the binary crossentropy loss between the sigmoid of the scores and the target labels, where the target label is 1 for the correct answers and 0 otherwise. Label smoothing is done when the total number of entities is large.

Answer Selection Module
At inference, the model scores the (head, question) pair against all possible answers a ∈ E. For relatively smaller KGs like MetaQA, we simply select the entity with the highest score. However if the knowledge graph is large, pruning the candidate entities can significantly improve the performance of EmbedKGQA. The pruning strategy is described in the following section.

Relation matching
Similar to PullNet (Sun et al., 2019a) we learn a scoring function S(r, q) which ranks each relation r ∈ R for a given question q. Let h r be the embedding of a relation r and q = (< s >, w 1 , .., w |q| , < /s >) be the sequence of words in question q which are input to RoBERTa. The scoring function is defined as the sigmoid of the dot product of the final output of the last hidden layer of RoBERTa (h q ) and the embedding of relation r (h r ).
Among all the relations, we select those relations which have score greater than 0.5 It is denoted as the set R a . For each candidate entity a that we have obtained so far (Section 4.4), we find the relations in the shortest path between head entity h and a . Let this set of relations be R a . Now the relation score for each candidate answer entity is defined as the size of their intersection.
RelScore a = |R a ∩ R a | We use a linear combination of the relation score and ComplEx score to find the answer entity.
where γ is a tunable hyperparameter.

Experimental Details
In this section, we first describe the datasets that we evaluated our method on, and then explain the experimental setup and the results.

Baselines
We compare our model with the Key-Value Memory Network (Miller et al., 2016), the GraftNet (Sun et al., 2018)  • PullNet (Sun et al., 2019a) also creates a question-specific sub-graph but instead of using heuristics, it learns to "pull" facts and sentences from the data to create a more relevant  sub-graph. It also uses a graph CNN approach to perform reasoning.
The complete KG setting is the easiest setting for QA because the datasets are created in such a way that the answer always exists in the KG, and there is no missing link in the path. However, it is not a realistic setting, and the QA model should also be able to work on an incomplete KG. So we simulate an incomplete KB by randomly removing half of the triples in the KB (we randomly drop a fact with probability = 0.5). We call this setting KG-50 and we call full KG setting KG-Full in the text.
In the next section we will answer the following questions: Q1. Can Knowledge Graph embeddings be used to perform multi-hop KGQA? (Section 5.3) Q2. Can EmbedKGQA be used to answer questions when there is no direct path between the head entity and the answer entity? (Section 5.4) Q3. How much does the answer selection module help in the final performance of our model? (Section 5.5)

KGQA results
In this section, we have compared our model with baseline models on MetaQA and WebQuestionsSP datasets.

Analysis on MetaQA
MetaQA has different partitions of the dataset for 1-hop, 2-hop, and 3-hop questions. In the full KG setting (MetaQA KG-Full) our model is comparable to the state-of-the-art for 2-hop questions and establishes the state-of-the-art for 3-hop questions. EmbedKGQA performs similar to the state-of-the in case of 1-hop question which is expected because the answer node is directly connected to the head node and it is able to learn the corresponding relation embedding from the question. On the other hand performance on 2-hop and 3-hop questions suggest that EmbedKGQA is able to infer the correct relation from the neighboring edges because the KG embeddings can model composition of relations. Pullnet and GraftNet also perform similarly well because the answer entity lies in the question sub-graph most of the times.
We have also tested our method on the incomplete KG setting, as explained in the previous section. Here we find that the accuracy of all baselines decreases significantly compared to the full KG setting, while EmbedKGQA achieves state-of-the-art performance. This is because MetaQA KG is fairly sparse, with only 135k triples for 43k entities. So when 50% of the triples are removed (as is done in MetaQA KG-50), the graph becomes very sparse with an average of only 1.66 links per entity node. This causes many head entity nodes of questions to have much longer paths (>3) to their answer node. Hence models that require question-specific sub-graph construction (GraftNet, PullNet) are unable to recall the answer entity in their generated sub-graph and therefore performs poorly. However, their performance improves only after including additional text corpora. On the other hand, Em-bedKGQA does not limit itself to a sub-graph and utilizing the link prediction properties the KG embeddings, EmbedKGQA is able to infer the relation on missing links.

Analysis on WebQuestionsSP
WebQuestionsSP has a relatively small number of training examples but uses a large KG (Freebase) as background knowledge. This makes multi-hop KGQA much harder. Since all the entities of the KG are not covered in the training set, freezing the  Our method on WebQSP KG-50 outperforms all baselines including PullNet, which uses extra textual information and is the state-of-the-art model. Even though WebQuestionsSP has fewer questions, EmbedKGQA is able to learn good question embeddings that can infer mission links in KG. This can be attributed to the fact that relevant and necessary information is being captured through KG embeddings, implicitly.

QA on KG with missing links
State-of-the-art KGQA models like PullNet and GraftNet require a path between the head entity and the answer entity to be present in the Knowledge Graph to answer the question. For example, in PullNet, the answer is restricted to be one of the entities present in the extracted question subgraph. For the incomplete KG case where only 50% of the original triples are present, PullNet (Sun et al., 2019a) reports a recall of 0.544 on the MetaQA 1hop dataset. This means that only for 54.4 percent of questions, all the answer entities are present in the extracted question subgraph, and this puts a hard limit on how many questions the model can answer in this setting.
EmbedKGQA, on the other hand, uses Knowledge Graph Embeddings rather than a localized sub-graph to answer the question. It uses the head embedding and question embedding, which implicitly captures the knowledge of all observed and unobserved links around the head node. This is possible because of the link prediction property of   Knowledge Graph Embeddings. So unlike other QA systems, even if there is no path between the head and answer entity, our model should be able to answer the question if there is sufficient information in the KG to be able to predict that path (See Fig. 1).
We design an experiment to test this capability of our model. For all questions in the validation set of the MetaQA 1-hop dataset, we removed all the triples from the Knowledge Graph that can be directly used to answer the question. For example, given the question 'what language is [PK] in' in the validation set, we removed the triple (P K, in language, Hindi) from the KG. The dataset also contains paraphrases of the same question, for, e.g., 'what language is the movie [PK] in' and 'what is the language spoken in the movie [PK]'. We also removed all paraphrases of validation set questions from the training dataset since we only want to evaluate the KG completion property of our model and not a linguistic generalization.
In such a setting, we expect models that rely only on sub-graph retrieval to achieve 0 hits@1. However, our model delivers a significantly better 29.9 hits@1 in this setting. This shows that our model can capture the KG completion property of ComplEx embeddings and apply it to answer questions which was otherwise impossible.
Further, if we know the relation corresponding to each question, then the problem of 1-hop KG QA is the same as KG completion in an incomplete Knowledge Graph. Using the same training KG as above and using the removed triples as the test set, we do tail prediction using KG embeddings. Here we obtain 20.1 hits@1. The lesser score can be attributed to the fact that ComplEx embedding uses only the KG while our model uses the QA data as well -which in itself represents knowledge. Our model is first trained on the KG and then uses these embeddings to train the QA model, and thus it can leverage the knowledge present in both the KG and QA data.

Effect of Answer Selection Module
We analyse the effect of the answer selection module (Section 4.4) on EmbedKGQA in the WebQues-tionsSP dataset by ablating the relation matching module. Furthermore, in order to compare with other methods that restrict the answer to a neighbourhood in the KG (Sun et al. (2019a), Sun et al. (2018)), we experimented with restricting the candidate set of answer entities to only the 2-hop neighbourhood of the head entity. The results can be seen in Table 5. As we can see, relation matching has a significant impact on the performance of Em-bedKGQA on both WebQSP KG-full and WebQSP KG-50 settings.
Also, as mentioned earlier, WebQSP KG (Freebase subset) has an order of magnitude more entities than MetaQA (1.8M versus 134k in MetaQA) and the number of possible answers is large. So reducing the set of answers to a 2-hop neighbourhood of the head entity showed improved performance in the case of WebQSP KG-Full. However, this caused a degradation in performance on WebQSP KG-50. This is because restricting the answer to a 2-hop neighbourhood on an incomplete KG may cause the answer to not be present in the candidates (Please refer figure 1).
In summary, we find that relation matching is an important part of EmbedKGQA. Morever, we suggest that n-hop filtering during answer selection may be included on top of EmbedKGQA for KGs which are reasonably complete.

Conclusion
In this paper, we propose EmbedKGQA, a novel method for Multi-hop KGQA. KGs are often incomplete and sparse which poses additional challenges for multi-hop KGQA methods. Recent recent for this problem have tried to address the incompleteness problem by utilizing an additional text corpus. However, the availability of a relevant text corpus is often limited, thereby reducing broad-coverage applicability of such methods. In a separate line of research, KG embedding methods have been proposed to reduce KG sparsity by performing missing link prediction. EmbedKGQA utilizes the link prediction properties of KG embeddings to mitigate the KG incompleteness problem without using any additional data. It trains the KG entity embeddings and uses it to learn question embeddings, and during the evaluation, it scores (head entity, question) pair again all entities, and the highest-scoring entity is selected as an answer. EmbedKGQA also overcomes the shortcomings due to limited neighborhood size constraint imposed by existing multi-hop KGQA methods. Em-bedKGQA achieves state-of-the-art performance in multiple KGQA settings, suggesting that the link prediction properties of KG embeddings can be utilized to mitigate the KG incompleteness problem in Multi-hop KGQA.