Learning Representation Mapping for Relation Detection in Knowledge Base Question Answering

Relation detection is a core step in many natural language process applications including knowledge base question answering. Previous efforts show that single-fact questions could be answered with high accuracy. However, one critical problem is that current approaches only get high accuracy for questions whose relations have been seen in the training data. But for unseen relations, the performance will drop rapidly. The main reason for this problem is that the representations for unseen relations are missing. In this paper, we propose a simple mapping method, named representation adapter, to learn the representation mapping for both seen and unseen relations based on previously learned relation embedding. We employ the adversarial objective and the reconstruction objective to improve the mapping performance. We re-organize the popular SimpleQuestion dataset to reveal and evaluate the problem of detecting unseen relations. Experiments show that our method can greatly improve the performance of unseen relations while the performance for those seen part is kept comparable to the state-of-the-art.


Introduction
The task of Knowledge Base Question Answering (KBQA) has been well developed in recent years (Berant et al., 2013;Bordes et al., 2014;Yao and Van Durme, 2014). It answers questions using an open-domain knowledge base, such as Freebase (Bollacker et al., 2008), DBpedia (Lehmann et al., 2015) or NELL (Carlson et al., 2010). The knowledge base usually contains a large set of triples.
Each triple is in the form of subject, relation, object , indicating the relation between the subject entity and the object entity.
Typical KBQA systems (Yao and Van Durme, 2014;Yin et al., 2016;Dai et al., 2016;Yu et al., 2017;Hao et al., 2018) can be divided into two steps: the entity linking step first identifies the target entity of the question, which corresponds to the subject of the triple; the relation detection step then determines the relation that the question asks from a set of candidate relations. After the two steps, the answer could be obtained by extracting the corresponding triple from the knowledge base (as shown in Figure 1).
Our main focus in this paper is the relation detection step, which is more challenging because it needs to consider the meaning of the whole question sentence (e.g., the pattern "where was ... born"), as well as the meaning of the candidate relation (e.g., "place of birth"). For comparison, the entity linking step benefits more from the matching of surface forms between the words in the question and subject entity (e.g., "Mark Mifsud").
In recent deep learning based relation detection approaches, each word or relation is represented by a dense vector representation, called embedding, which is usually learned automatically while optimizing the relation detection objective. Then, the inference processes of these approaches are executed by neural network computations. Such approaches enjoy great success in common KBQA datasets, such as SimpleQuestion (Bordes et al., 2015), achieving over 90% accuracy in relation detection. In the words of Petrochuk and Zettlemoyer (2018), "SimpleQuestion is nearly solved." However, we notice that in the common split of  Figure 1: A KBQA example. The bold words in the question are the target entity, identified in the entity linking step. The relation detection step selects the correct relation (marked with bold font) from a set of candidate relations. The answer of this question is the object entity of the triple extracted from the knowledge base.
the SimpleQuestion dataset, 99% of the relations in the test set also exist in the training data, which means their embeddings could be learned well during training. On the contrary, for those relations which are never seen in the training data (called unseen relations), their embeddings have never been trained since initialization. As a result, the corresponding detection performance could be arbitrary, which is a problem that has not been carefully studied.
We emphasize that the detection for these unseen relations is critical because it is infeasible to build training data for all the relations in a large-scale knowledge base. For example, SimpleQuestion is a large-scale human annotated dataset, which contains 108,442 natural language questions for 1,837 relations sampled from FB2M (Bordes et al., 2015). FB2M is a subset of FreeBase (Bollacker et al., 2008) which have 2 million entities, 6,700 relations. A large portion of these relations can not be covered by the humanannotated dataset such as SimpleQuestion. Therefore, for building up a practical KBQA system that could answer questions based on FB2M or other large-scale knowledge bases, dealing with the unseen relations is very important and challenging. This problem could be considered as a zero-shot learning problem (Palatucci et al., 2009) where the labels for test instances are unseen in the training dataset.
In this paper, we present a detailed study on this zero-shot relation detection problem. Our contributions could be summarized as follows: 1. Instead of learning the relation representation barely from the training data, we employ methods to learn the representations from the whole knowledge graph which has much wider coverage.
2. We propose a mapping mechanism, called representation adapter, or simply adapter, to incorporate the learned representations into the relation detection model. We start with the simple mean square error loss for the non-trivial training of the adapter and propose to incorporate adversarial and reconstruction objectives to improve the training process.
3. We re-organize the SimpleQuestion dataset as SimpleQuestion-Balance to evaluate the performance for seen and unseen relations, separately. 4. We present experiments showing that our proposed method brings a great improvement to the detection of unseen relations, while still keep comparable to the state-of-the-art method for the seen relations.

Motivation
Representation learning of human annotated data is limited by the size and coverage of the training data. In our case, because the unseen relations and their corresponding questions do not occur in the training data, their representations cannot be properly trained, leading to poor detection performance. A possible solution for this problem is to employ a large number of unannotated data, which may be much easier to obtain, to provide better coverage.
Usually, pre-trained representations are not directly applicable to specific tasks. One popular way to utilize these representations is using them as initialization. These initialized representations are then fine-tuned on the labeled training data, with a task specific objective. However, with the above mentioned coverage issues, the representations of unseen relations will not be updated properly during fine-tuning, leading to poor test performance.
To solve this problem, we keep the representation unchanged during training, and propose a representation adapter to bridge the gap between general purposed representations and task specific ones. We will then present the basic adapter framework, introduce the adversarial adapter and the reconstruction objective as enhancements.
Throughout this paper, we use the following notations: let r denote a single relation; S and U denote the set of seen and unseen relations, respectively; e(r) or e denote the embedding of r; specifically, we use e g to denote the general pre-trained On the left is the basic adapter; on the middle is the adversarial adapter; on the right is the adapter with the reconstruction loss. Adver. and recon. are the abbreviation of adversarial and reconstruction, respectively. embedding.

Basic Adapter
Pseudo Target Representations The basic idea is to use a neural network representation adapter to perform the mapping from the general purposed representation to the task specific one. The input of the adapter is the embedding learned from the knowledge base. However, the output of the adapter is undecided, because there is no oracle representation for the relation detection task. Therefore, we first train a traditional relation detection model similar to Yu et al. (2017). During training, the representations for relations in the training set (seen relations) will be updated for the relation detection task. We use these representations as pseudo target representations, denoted aŝ e, for training the adapter.
Linear Mapping Inspired by Mikolov et al. (2013), which shows the representation space of similar languages can be transferred by a linear mapping, we also employ a linear mapping function G(·) to map the general embedding e g to the task specific (pseudo target) representationê (Figure 2, left). The major difference between our adapter and an extra layer of neural network is that specific losses are designed to train the adapter, instead of implicitly learning the adapter as a part of the whole network. We train the adapter to optimize the following objective function on the seen relations: L adapter = r∈S loss(ê, G(e g )). (1) Here the loss function could be any metric that evaluates the difference between the two representations. The most common and simple one is the mean square error loss (Equation (2)), which we employ in our basic adapter. We will discuss other possibilities in the following sub-sections. (2)

Adversarial Adapter
The mean square error loss only measures the absolute distance between two embedding vectors. Inspired by the popular generative adversarial networks (GAN) (Goodfellow et al., 2014;Arjovsky et al., 2017) and some previous works in unsupervised machine translation (Conneau et al., 2018;Zhang et al., 2017a,b), we use a discriminator to provide an adversarial loss to guide the training ( Figure 2, middle). It is a different way to minimize the difference between G(e) andê. In detail, we train a discriminator, D(·) , to discriminate the "real" representation, i.e., the finetuned relation embeddingê, from the "fake" representation, which is the output of the adapter. The adapter G(·) is acting as the generator in GAN, which tries to generate a representation that is similar to the "real" representation. We use WassersteinGAN (Arjovsky et al., 2017) to train our adapter. For any relations sampled from the training set, the objective function for the discriminator loss D and generator loss G are: Here for D(·), we use a feed forward neural network without the sigmoid function of the last layer (Arjovsky et al., 2017).

Reconstruction Loss
The adapter could only learn the mapping by using the representations of seen relations, which neglects the potential large set of unseen relations.
Here we propose to use an additional reconstruction loss to augment the adapter (Figure 2, right). More specifically, we employ a reversed adapter G (·), mapping the representation G(e) back to e. The advantage of introducing the reversed training is two-fold. On the one hand, the reversed adapter could be trained with the representation for all the relations, both seen and unseen ones. On the other hand, the reversed mapping could also serve as an extra constraint for regularizing the forward mapping.
For the reversed adapter G (·), We simply use a similar linear mapping function as for G(·), and train it with the mean square error loss: Please note that, different from previous loss functions, this reconstruction loss is defined for both seen and unseen relations.

Relation Detection with the Adapter
We integrate our adapter into the state-of-the-art relation detection framework (Yu et al., 2017, Hierarchical Residual BiLSTM (HR-BiLSTM)).
Framework The framework uses a question network to encode the question sentence as a vector q f and a relation network to encode the relation as a vector r f . Both of the two networks are based on the Bi-LSTM with max-pooling operation. Then, the cosine similarity is introduced to compute the distance between the q f and r f , which determines the detection result. Our adapter is an additional module which is used in the relation network to enhance this framework (Figure 3).

Adapting the Relation Representation
The relation network proposed in Yu et al. (2017) has two parts for relation representations: one is at wordlevel and the other is at relation-level. The two parts are fed into the relation network to generate the final relation representation.
Different from previous approaches, we employ the proposed adapter G(·) on the relation-level representations to solve unseen relation detection problem. There are several approaches to obtain the relation representations from the knowledge base into a universal space (Bordes et al., 2013;Wang et al., 2014;Han et al., 2018). In practice, we use the JointNRE embedding (Han et al., 2018), because its word and relation representations are in the same space.
Training Following Yu et al. (2017), the relation detection model is trained by the hinge loss (Bengio et al., 2003) which tries to separate the score of each negative relation from the positive relation by a margin: where γ is the margin; r + f is the positive relation from the annotated training data; r − f is the relation negative sampled from the rest relations; s(·, ·) is the cosine distance between q f and r f .
The basic relation detection model is pretrained to get the pseudo target representations. Then, the adapter is incorporated into the training process, and jointly optimized with the relation detection model. For the adversarial adapter, the generator and the discriminator are trained alternatively following the common practice.

SimpleQuestion-Balance (SQB)
As mentioned before, SimpleQuestion (SQ) is a large-scale KBQA dataset. Each sample in SQ includes a human annotated question and the corresponding knowledge triple. However, the distribution of the relations in the test set is unbalanced. Most of the relations in the test set have been seen in the training data. To better evaluate the performance of unseen relation detection, we re-organize the SQ dataset to balance the number of seen and unseen relations in development and test sets, and the new dataset is denoted as SimpleQuestion-Balance (SQB).
The re-organization is performed by randomly shuffle and split into 5 sets, i.e. Train, Dev-seen, Den-unseen, Test-seen and Test-unseen, while   More specifically, the dimension of relation representation is 300. The dimension for the hidden state of Bi-LSTM is set to 256. Parameters in the neural models are initialized using a uniform sampling. The number of negative sampled relations is 256. The γ in hinge loss (Equation (6)) is set to 0.1.
Evaluation To evaluate the performance of relation detection, we assume that the results of entity linking are correct. Two metrics are employed. Micro average accuracy (Tsoumakas et al., 2010) is the average accuracy of all samples, which is the metric used in previous work. Macro average accuracy (Sebastiani, 2002;Manning et al., 2008;Tsoumakas et al., 2010) is the average accuracy of the relations.
Please note that because different relations may correspond to the different number of samples in the test set, the micro average accuracy may be affected by the distribution of unseen relations in the test set. In this case, the macro average accuracy will serve as an alternative indicator.
We report the average and standard deviation (std) of 10-folds cross validation to avoid contingency.

Main Results
Main results for baseline and the proposed model with the different settings are listed in Table 2. The detailed comparison is as follows: Baseline The baseline HR-BiLSTM (line 1) shows the best performance on Test-seen, but the performance is much worse on Test-unseen. For comparison, training the model without finetuning (line 2) achieves much better results on Test-unseen, demonstrating our motivation that the embeddings are the reason for the weak performance on unseen relations, and fine-tuning makes them worse.
Using Adapters Line 3 shows the results of adding an extra mapping layer of neural networks between the pretrained embedding and the relation detection networks, without any loss. Although ideally, it is possible to learn the mapping implicitly with the training, in practice, this does not lead to a better result (line 3 v.s. line 2).
While keeping similar performance on the Testseen with the HR-BiLSTM, all the models using the representation adapter achieve great improvement on the Test-unseen set. With the simplest form of adapter (line 4), the accuracy on Testunseen improves to 76.0% / 69.5%. It shows that our model can predict unseen relation with better accuracy.
Using adversarial adapter (line 6) can further improve the performance on the Test-unseen in both micro and macro average scores.
Using Reconstruction Loss Adding reconstruction loss to basic adapter can also improve the performance (line 5 v.s. line 4) slightly. The similar improvement is obtained for the adversarial  adapter in micro average accuracy (line 7 v.s. line 6).
Finally, using all the techniques together (line 7) gets the score of 77.3% / 73.0% on Test-unseen, and 84.9% / 81.1% on the union of Test-seen and Test-unseen in micro/macro average accuracy, respectively. We mainly use this model as our final model for further comparison and analysis.
We notice that the results of our model on Testseen are slightly lower than that of HR-BiLSTM. It is because we use the mapped representations for the seen relations instead of the directly finetuned representations. This dropping is negligible compared with the improvement in the unseen relations.
Integration to the KBQA To confirm the influence of unseen relation detection for the entire KBQA, we integrate our relation detection model into a prototype KBQA framework. During the entity linking step, we use FocusPrune (Dai et al., 2016) to get the mention of questions. Then, the candidate mentions are linked to the entities in the knowledge base. Because the FreeBase API was deprecated 2 , we restrict the entity linking to an exact match for simplicity. The candidate relations are the set of relations linked with candidate subjects. We evaluate the KBQA results using the micro average accuracy introduced in Bordes et al. (2015), which considers the prediction as correct if both the subject and relation are correct.
As shown in Table 3, the proposed adapter method can improve KBQA from 48.5% to 63.7%. Comparing with the result of relation detection, we find that the boost of relation detection could indeed lead to the improvement of a KBQA system.

Model
Accuracy (

Analysis
Seen Relation Bias We use macro-average to calculate the percentage of instances whose relations are wrongly predicted to be a seen relation on Test-unseen. We call this indicator the seen rate, the lower the better. Because the seen relations are better learned after fine-tuning, while the representations for unseen relations are not updated well. So the relation detection model may have a strong trend to select those seen relations as the answer. The result in Table 4 shows that our adapter makes the trend of choosing seen relation weaker, which helps to promote a fair choice between seen and unseen relations.

Influence of Number of Relations for Training
We discuss the influence of the number of relations in the training set for our adapter. Our adapter are trained mainly by the seen relations, because we can get pseudo target representation for these relations. In this experiment, we random sample 60,000 samples from the training set to perform the training, and plot the accuracy against the different number of relations for training. We report the macro average accuracy on Test-unseen. As shown in Figure 4, with different number of relations, our model still perform better than HR-BiLSTM. Note that, our adapter can beat HR-BiLSTM with even a smaller number of seen relations. When there are more relations for training, the performance will be improved as expected.

Relation Representation Analysis
We visualize the relation representation in JointNRE, HR-BiLSTM and the output representation of our final adapter by principal component analysis (PCA) with the help of TensorBoard. We use the yellow (light) point represents the seen relation, and the blue (dark) point represents the unseen relation.
As shown in Figure 5a), the JointNRE representation is pre-trained by the interaction between knowledge graph and text. Because without knowing the relation detection tasks, seen and unseen relations are randomly distributed. 3 3 We also notice that there is a big cluster of relations on the left hand side. This is presumably the set of less updated  After training with HR-BiLSTM (Figure 5b), the seen and unseen relations are easily separated, because the training objective is to discriminate the seen relations from the other relations for the corresponding question. Although the embeddings of unseen relations are also updated due to negative sampling, they are never updated towards their correct position in the embedding space. As a result, the relation detection accuracy for the unseen relations is poor.
The training of our final model uses the adapter to fit the training data, instead of directly updating the embeddings. Despite the comparable performance on seen relations, the distribution of seen and unseen relations (Figure 5c) is much similar to the original JointNRE, which is the core reason for its ability to obtain better results on unseen relations.
Adapting JointNRE Interestingly, we notice that JointNRE is to train the embedding of relations with a corpus of text that may not cover all the relations, which is also a process that needs the adapter. As a simple solution, we use a similar adapter to adapt the representation from TransE 4 (Lin et al., 2015) to the training of Joint-NRE. With the resulting relation embedding, denoted as JointNRE*, we train the baseline and final relation detection models, denoted as HR-BiLSTM* and Final*, respectively.
We visualize the relation representation in these models again. Clearly, the distribution of seen and unseen relations in JointNRE* (Figure 5d) looks more reasonable than before. This distribution is interrupted by fine-tuning process of HR-BiLSTM* (Figure 5e), while is retained by our adapter model (Figure 5f).
Furthermore, as shown in  Table 6: Case studies for relation detection using different models. For each question, the gold relation is marked with bold font; the gold target entity of the question is marked with italic font. The models and notations are the same as in Table 2. sentations for unseen relations.
Case Study In the first case of Table 6, Twenty One is the subject of question. "music.recording.producer" is the gold relation, but it is an unseen relation. The baseline model predicts "music.recording.artist" because this relation is seen and perhaps relevant in the training set. A dig into the set of relations shown that there is a seen relation, "music.recording.engineer", which happens to be the closest relation in the mapped representation to the gold relation. It is possible that the knowledge graph embedding is able to capture the relatedness between the two relations.
In the second case, although the gold relation "people.person.profession" is unseen, both baseline and our model predict the correct answer because of strong lexical evidences: "profession".
In the last case, both the gold relation and predict error relation are unseen relation. "Hud county place" refers to the name of a town in a county, but "location.location.contains" has a broader meaning. When asked about "village", "location.location.contains" is more appropriate. This case shows that our model still can not process the minor semantic difference between word. We will leave it for future work.

Related Work
Relation Detection in KBQA Yu et al. (2017) first noticed the zero-shot problem in KBQA relation detection. They split relation into word se-quences and use it as a part of the relation representation. In this paper, we push this line further and present the first in-depth discussion about this zero-shot problem. We propose the first relationlevel solution and present a re-organized dataset for evaluation as well.
Embedding Mapping Our main idea of embedding mapping is inspired by previous work about learning the mapping of bilingual word embedding. Mikolov et al. (2013) observed the linear relation of bilingual word embedding, and used a small starting dictionary to learn this mapping. Zhang et al. (2017a) use Generative Adversarial Nets (Goodfellow et al., 2014) to learn the mapping of bilingual word embedding in an unsupervised manner. Different from this work which maps words in different languages, we perform mappings between representations generated from heterogeneous data, i.e., knowledge base and question-triple pairs.
Zero-Shot Learning Zero-shot learning has been studied in the area of natural language process. Hamaguchi et al. (2017) use a neighborhood knowledge graph as a bridge between out of knowledge base entities to train the knowledge graph. Levy et al. (2017) connect nature language question with relation query to tackle zero shot relation extraction problem. Elsahar et al. (2018) extend the copy actions (Luong et al., 2015) to solve the rare words problem in text generation. Some attempts have been made to build machine translation systems for language pairs without direct parallel data, where they relying on one or more other languages as the pivot (Firat et al., 2016;Ha et al., 2016;Chen et al., 2017). In this paper, we use knowledge graph embedding as a bridge between seen and unseen relations, which shares the same spirit with previous work. However, less study has been done in relation detection.

Conclusion
In this paper, we discuss unseen relation detection in KBQA, where the main problem lies in the learning of representations. We re-organize the SimpleQuestion dataset as SimpleQuestion-Balance to reveal and evaluate the problem, and propose an adapter which significantly improves the results.
We emphasize that for any other tasks which contain a large number of unseen samples, train-ing, fine-tuning the model according to the performance on the seen samples alone is not fair. Similar problems may exist in other NLP tasks, which will be interesting to investigate in the future.