Zero-Shot Question Generation from Knowledge Graphs for Unseen Predicates and Entity Types

We present a neural model for question generation from knowledge graphs triples in a “Zero-shot” setup, that is generating questions for predicate, subject types or object types that were not seen at training time. Our model leverages triples occurrences in the natural language corpus in a encoder-decoder architecture, paired with an original part-of-speech copy action mechanism to generate questions. Benchmark and human evaluation show that our model outperforms state-of-the-art on this task.


Introduction
Questions Generation (QG) from Knowledge Graphs is the task consisting in generating natural language questions given an input knowledge base (KB) triple (Serban et al., 2016). QG from knowledge graphs has shown to improve the performance of existing factoid question answering (QA) systems either by dual training or by augmenting existing training datasets (Dong et al., 2017;Khapra et al., 2017). Those methods rely on large-scale annotated datasets such as Simple-Questions (Bordes et al., 2015). Building such datasets is a tedious task in practice, especially to obtain an unbiased dataset -i.e. a dataset that covers equally a large amount of triples in the KB. In practice many of the predicates and entity types in KB are not covered by those annotated datasets. For example 75.6% of Freebase predicates are not covered by the SimpleQuestions dataset 1 . Among those we can find important missing predicates such as: fb:food/beer/country, fb:location/country/national anthem, fb:astronomy/star system/stars.
One challenge for QG from knowledge graphs is to adapt to predicates and entity types that 1 replicate the observation http://bit.ly/2GvVHae were not seen at training time (Zero-Shot Question Generation). Since state-of-the-art systems in factoid QA rely on the tremendous efforts made to create SimpleQuestions, these systems can only process questions on the subset of 24.4% of freebase predicates defined in SimpleQuestions. Previous works for factoid QG (Serban et al., 2016) claims to solve the issue of small size QA datasets. However encountering an unseen predicate / entity type will generate questions made out of random text generation for those out-of-vocabulary predicates a QG system had never seen. We go beyond this state-of-the-art by providing an original and non-trivial solution for creating a much broader set of questions for unseen predicates and entity types. Ultimately, generating questions to predicates and entity types unseen at training time will allow QA systems to cover predicates and entity types that would not have been used for QA otherwise.
Intuitively, a human who is given the task to write a question on a fact offered by a KB, would read natural language sentences where the entity or the predicate of the fact occur, and build up questions that are aligned with what he reads from both a lexical and grammatical standpoint. In this paper, we propose a model for Zero-Shot Question Generation that follows this intuitive process. In addition to the input KB triple, we feed our model with a set of textual contexts paired with the input KB triple through distant supervision. Our model derives an encoder-decoder architecture, in which the encoder encodes the input KB triple, along with a set of textual contexts into hidden representations. Those hidden representations are fed to a decoder equipped with an attention mechanism to generate an output question. In the Zero-Shot setup, the emergence of new predicates and new class types during test time requires new lexicalizations to express these pred-icates and classes in the output question. These lexicalizations might not be encountered by the model during training time and hence do not exist in the model vocabulary, or have been seen only few times not enough to learn a good representation for them by the model. Recent works on Text Generation tackle the rare words/unknown words problem using copy actions (Luong et al., 2015;Gülçehre et al., 2016): words with a specific position are copied from the source text to the output text -although this process is blind to the role and nature of the word in the source text. Inspired by research in open information extraction (Fader et al., 2011) and structure-content neural language models (Kiros et al., 2014), in which part-of-speech tags represent a distinctive feature when representing relations in text, we extend these positional copy actions. Instead of copying a word in a specific position in the source text, our model copies a word with a specific part-of-speech tag from the input text -we refer to those as partof-speech copy actions. Experiments show that our model using contexts through distant supervision significantly outperforms the strongest baseline among six (+2.04 BLEU-4 score). Adding our copy action mechanism further increases this improvement (+2.39). Additionally, a human evaluation complements the comprehension of our model for edge cases; it supports the claim that the improvement brought by our copy action mechanism is even more significant than what the BLEU score suggests.

Related Work
QG became an essential component in many applications such as education (Heilman andSmith, 2010), tutoring (Graesser et al., 2004;Evens and Michael, 2006) and dialogue systems (Shang et al., 2015). In our paper we focus on the problem of QG from structured KB and how we can generalize it to unseen predicates and entity types. (Seyler et al., 2015) generate quiz questions from KB triples. Verbalization of entities and predicates relies on their existing labels in the KB and a dictionary. (Serban et al., 2016) use an encoderdecoder architecture with attention mechanism trained on the SimpleQuestions dataset (Bordes et al., 2015). (Dong et al., 2017) generate paraphrases of given questions to increases the performance of QA systems; paraphrases are generated relying on paraphrase datasets, neural ma-chine translation and rule mining. (Khapra et al., 2017) generate a set of QA pairs given a KB entity. They model the problem of QG as a sequence to sequence problem by converting all the KB entities to a set of keywords. None of the previous work in QG from KB address the question of generalizing to unseen predicates and entity types. Textual information has been used before in the Zero-Shot learning. (Socher et al., 2013) use information in pretrained word vectors for Zero-Shot visual object recognition. (Levy et al., 2017) incorporates a natural language question to the relation query to tackle Zero-Shot relation extraction problem.
Previous work in machine translation dealt with rare or unseen word problem problem for translating names and numbers in text. (Luong et al., 2015) propose a model that generates positional placeholders pointing to some words in source sentence and copy it to target sentence (copy actions). (Gülçehre et al., 2016;Gu et al., 2016) introduce separate trainable modules for copy actions to adapt to highly variable input sequences, for text summarization. For text generation from tables, (Lebret et al., 2016) extend positional copy actions to copy values from fields in the given table. For QG, (Serban et al., 2016) use a placeholder for the subject entity in the question to generalize to unseen entities. Their work is limited to unseen entities and does not study how they can generalize to unseen predicates and entity types.

Model
Let F = {s, p, o} be the input fact provided to our model consisting of a subject s, a predicate p and an object o, and C be the set of textual contexts associated to this fact. Our goal is to learn a model that generates a sequence of T tokens Y = y 1 , y 2 , . . . , y T representing a question about the subject s, where the object o is the correct answer. Our model approximates the conditional probability of the output question given an input fact p(Y |F ), to be the probability of the output question, given an input fact and the additional textual context C, modelled as follows: where y <t represents all previously generated tokens until time step t. Additional textual contexts are natural language representation of the triples that can be drawn from a corpus -our model is generic to any textual contexts that can be additionally provided, though we describe in Section 4.1 how to create such texts from Wikipedia. Our model derives the encoder-decoder architecture of (Sutskever et al., 2014;Bahdanau et al., 2014) with two encoding modules: a feed forward architecture encodes the input triple (sec. 3.1) and a set of recurrent neural network (RNN) to encode each textual context (sec. 3.2). Our model has two attention modules (Bahdanau et al., 2014): one acts over the input triple and another acts over the input textual contexts (sec. 3.4). The decoder (sec. 3.3) is another RNN that generates the output question. At each time step, the decoder chooses to output either a word from the vocabulary or a special token indicating a copy action (sec. 3.5) from any of the textual contexts.

Fact Encoder
Given an input fact F = {s, p, o}, let each of e s , e p and e o be a 1-hot vectors of size K. The fact encoder encodes each 1-hot vector into a fixed size H k is the size of the KB embedding and K is the size of the KB vocabulary. The encoded fact h f ∈ R 3H k represents the concatenation of those three vectors and we use it to initialize the decoder.
Following (Serban et al., 2016), we learn E f using TransE (Bordes et al., 2015). We fix its weights and do not allow their update during training time.

Textual Context Encoder
Given a set of n textual contexts C = {c 1 , c 2 , . . . , c n : i represents the 1-hot vector of the i th token in the j th textual context c j , and |c j | is the length of the j th context. We use a set of n Gated Recurrent Neural Networks (GRU)  to encode each of the textual concepts separately: where h c j i ∈ R Hc is the hidden state of the GRU that is equivalent to x j i and of size H c . E c is the input word embedding matrix. The encoded context represents the encoding of all the textual contexts; it is calculated as the concatenation of all the final states of all the encoded contexts:

Decoder
For the decoder we use another GRU with an attention mechanism (Bahdanau et al., 2014), in which the decoder hidden state s t ∈ R H d at each time step t is calculated as: Where: E w ∈ R m×V is the word embedding matrix, m is the word embedding size and H d is the size of the decoder hidden state. a f t , a c t are the outputs of the fact attention and the context attention modules respectively, detailed in the following subsection. In order to enforce the model to pair output words with words from the textual inputs, we couple the word embedding matrices of both the decoder E w and the textual context encoder E c (eq.(3)). We initialize them with GloVe embeddings (Pennington et al., 2014) and allow the network to tune them. The first hidden state of the decoder s 0 = [h f ; h c ] is initialized using a concatenation of the encoded fact (eq.(2)) and the encoded context (eq.(4)) . At each time step t, after calculating the hidden state of the decoder, the conditional probability distribution over each token y t of the generated question is computed as the sof tmax(W o s t ) over all the entries in the output vocabulary, W o ∈ R H d ×V is the weight matrix of the output layer of the decoder.

Attention
Our model has two attention modules: Triple attention over the input triple to determine at each time step t an attention-based encoding of the input fact a f t ∈ R H k : α s,t , α p,t , α o,t are scalar values calculated by the attention mechanism to determine at each time step which of the encoded subject, predicate, or object the decoder should attend to.
Textual contexts attention over all the hidden states of all the textual contexts a c t ∈ R Hc : α c i t,j is a scalar value determining the weight of the j th word in the i th context c i at time step t.
Given a set of encoded input vectors I = {h 1 , h 2 , ...h k } and the decoder previous hidden state s t−1 , the attention mechanism calculates α t = α i,t , . . . , α k,t as a vector of scalar weights, each α i,t determines the weight of its correspond- where v a , W a , U a are trainable weight matrices of the attention modules. It is important to notice here that we encode each textual context separately using a different GRU, but we calculate an overall attention over all tokens in all textual contexts: at each time step the decoder should ideally attend to only one word from all the input contexts.

Part-Of-Speech Copy Actions
We use the method of (Luong et al., 2015) by modeling all the copy actions on the data level through an annotation scheme. This method treats the model as a black box, which makes it adaptable to any text generation model. Instead of using positional copy actions, we use the part-of-speech information to decide the alignment process between the input and output texts to the model. Each word in every input textual context is replaced by a special token containing a combination of its context id (e.g. C1) and its POS tag (e.g . NOUN). Then, if a word in the output question matches a word in a textual context, it is replaced with its corresponding tag as shown in Table 1. Unlike (Serban et al., 2016;Lebret et al., 2016) we model the copy actions in the input and the output levels. Our model does not have the drawback of losing the semantic information when replacing words with generic placeholders, since we provide the model with the input triple through the fact encoder. During inference the model chooses to either output words from the vocabulary or special tokens to copy from the textual contexts. In a post-processing step those special tokens are replaced with their original words from the textual contexts.

Textual contexts dataset
As a source of question paired with KB triples we use the SimpleQuestions dataset (Bordes et al., 2015). It consists of 100K questions with their corresponding triples from Freebase, and was created manually through crowdsourcing. When asked to form a question from an input triple, human annotators usually tend to mainly focus on expressing the predicate of the input triple. For example, given a triple with the predicate fb:spacecraft/manufacturer the user may ask "What is the manufacturer of [S] ?". Annotators may specify the entity type of the subject or the object of the triple: "What is the manufacturer of the spacecraft [S]?" or "Which company manufactures [S]?". Motivated by this example we chose to associate each input triple with three textual contexts of three different types. The first is a phrase containing lexicalization of the predicate of the triple. The second and the third are two phrases containing the entity type of the subject and the object of the triple. In what follows we show the process of collection and preprocessing of those textual contexts.

Collection of Textual Contexts
We extend the set of triples given in the Sim-pleQuestions dataset by using the FB5M (Bordes et al., 2015) subset of Freebase. As a source of text documents, we rely on Wikipedia articles.
Predicate textual contexts: In order to collect textual contexts associated with the SimpleQuestions triples, we follow the distant supervision setup for relation extraction (Mintz et al., 2009). The distant supervision assumption has been effective in creating training data for relation extraction and shown to be 87% correct (Riedel et al., 2010) on Wikipedia text. First, we align each triple in the FB5M KB to sentences in Wikipedia if the subject and the object of this triple co-occur in the same sentence. We use a simple string matching heuristic to find entity mentions in text 2 . Afterwards we reduce the   Table 2 shows examples of predicates and their corresponding textual context.

Sub-Type and Obj-Type textual contexts:
We use the labels of the entity types as the sub-type and obj-type textual contexts. We collect the list of entity types of each entity in the FB5M through the predicate fb:type/instance. If an entity has multiple entity types we pick the entity type that is mentioned the most in the first sentence of each Wikipedia article. Thus the textual contexts will opt for entity types that is more natural to appear in free text and therefore questions.

Generation of Special tokens
To generate the special tokens for copy actions (sec. 3.5) we run POS tagging on each of the input textual contexts 3 . We replace every word in each textual context with a combination of its context id (e.g. C1) and its POS tag (e.g. NOUN). If the same POS tag appears multiple times in the textual context, it is given an additional id (e.g. C1 NOUN 2). If a word in the output question overlaps with a word in the input textual context, this word is replaced by its corresponding tag. For sentence and word tokenization we use the Regex tokenizer from the NLTK toolkit (Bird, 2006), and for POS tagging and dependency pars- 10.0 ± 2.5 20.0 ± 3.8 Table 3: Dataset statistics across 10 folds for each experiment ing we use the Spacy 4 implementation.

Zero-Shot Setups
We develop three setups that follow the same procedure as (Levy et al., 2017) for Zero-Shot relation extraction to evaluate how our model generalizes to: 1) unseen predicates, 2) unseen sub-types and 3) unseen obj-types. For the unseen predicates setup we group all the samples in SimpleQuestions by the predicate of the input triple, and keep groups that contain at least 50 samples. Afterwards we randomly split those groups to 70% train, 10% valid and 20% test mutual exclusive sets respectively. This guarantees that if the predicate fb:person/place of birth for example shows during test time, the training and validation set will not contain any input triples having this predicate. We repeat this process to create 10 cross validation folds, in our evaluation we report the mean and standard deviation results across those 10 folds. While doing this we make sure that the number of samples in each fold -not only unique predicates -follow the same 70%, 30%, 10% distribution. We repeat the same process for the subject entity types and object entity types (answer types) individually. Similarly, for example in the unseen object-type setup, the question "Which artist was born in Berlin?" appearing in the test set means that, there is no question in the training set having an entity of type artist. Table 3 shows the mean number of samples, predicates, sub-types and obj-types across the 10 folds for each experiment setup. 4 https://spacy.io/

Baselines
SELECT is a baseline built from (Serban et al., 2016) and adapted for the zero shot setup. During test time given a fact F , this baseline picks a fact F c from the training set and outputs the question that corresponds to it. For evaluating unseen predicates, F c has the same answer type (obj-type) as F . And while evaluating unseen sub-types or objtypes, F c and F have the same predicate.

R-TRANSE
is an extension that we propose for SELECT. The input triple is encoded using the concatenation of the TransE embeddings of the subject, predicate and object. At test time, R-TRANSE picks a fact from the training set that is the closest to the input fact using cosine similarity and outputs the question that corresponds to it. We provide two versions of this baseline: R-TRANSE which indexes and retrieves raw questions with only a single placeholder for the subject label, such as in (Serban et al., 2016). And R-TRANSE copy which indexes and retrieves questions using our copy actions mechanism (sec. 3.5).
IR is an information retrieval baseline. Information retrieval has been used before as baseline for QG from text input (Rush et al., 2015;Du et al., 2017). We rely on the textual context of each input triple as the search keyword for retrieval. First, the IR baseline encodes each question in the training set as a vector of TF-IDF weights (Joachims, 1997) and then does dimensionality reduction through LSA (Halko et al., 2011). At test time the textual context of the input triple is converted into a dense vector using the same process and then the question with the closest cosine distance to the input is retrieved. We provide two versions of this baseline: IR on raw text and IR copy on text with our placeholders for copy actions.
Encoder-Decoder. Finally, we compare our model to the Encoder-Decoder model with a single placeholder, the best performing model from (Serban et al., 2016). We initialize the encoder with TransE embeddings and the decoder with GloVe word embeddings. Although this model was not originally built to generalize to unseen predicates and entity types, it has some generalization abilities represented in the encoded infor-mation in the pre-trained embeddings. Pretrained KB terms and word embeddings encode relations between entities or between words as translations in the vector space. Thus the model might be able to map new classes or predicates in the input fact to new words in the output question.

Training & Implementation Details
To train the neural network models we optimize the negative log-likelihood of the training data with respect to all the model parameters. For that we use the RMSProp optimization algorithm with a decreasing learning rate of 0.001, mini-batch size = 200, and clipping gradients with norms larger than 0.1. We use the same vocabulary for both the textual context encoders and the decoder outputs. We limit our vocabulary to the top 30, 000 words including the special tokens. For the word embeddings we chose GloVe (Pennington et al., 2014) pretrained embeddings of size 100. We train TransE embeddings of size H k = 200, on the FB5M dataset (Bordes et al., 2015) using the TransE model implementation from (Lin et al., 2015). We set GRU hidden size of the decoder to H d = 500, and textual encoder to H c = 200. The networks hyperparameters are set with respect to the final BLEU-4 score over the validation set. All neural networks are implemented using Tensorflow (Abadi et al., 2015). All experiments and models source code are publicly available 5 for the sake of reproducibility.

Automatic Evaluation Metrics
To evaluate the quality of the generated question, we compare the original labeled questions by human annotators to the ones generated by each variation of our model and the baselines. We rely on a set of well established evaluation metrics for text generation: BLEU-1, BLEU-2, BLEU-3, BLEU-4 (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014) and ROUGE L (Lin, 2004).

Human Evaluation
Automatic Metrics for evaluating text generation such as BLEU and METEOR give an measure of how close the generated questions are to the target correct labels. However, they still suffer from many limitations (Novikova et al., 2017). Automatic metrics might not be able to evaluate directly whether a specific predicate was explicitly mentioned in the generated text or not.
As an example, taking a target question and two corresponding generated questions A and B: We can find that the sentence A having a better BLEU score than B although it is not able to express the correct target predicate (film genre). For that reason we decide to run two further human evaluations to directly measure the following: Predicate identification: annotators were asked to indicate whether the generated question contains the given predicate in the fact or not, either directly or implicitly. Naturalness: following (Ngomo et al., 2013), we measure the comprehensibility and readability of the generated questions. Each annotator was asked to rate each generated question using a scale from 1 to 5, where: (5) perfectly clear and natural, (3) artificial but understandable, and (1) completely not understandable. We run our studies on 100 randomly sampled input facts alongside with their corresponding generated questions by each of the systems using the help of 4 annotators.

Results & Discussion
Automatic Evaluation Table 4 shows results of our model compared to all other baselines across all evaluation metrics. Our that encodes the KB fact and textual contexts achieves a significant enhancement over all the baselines in all evaluation metrics, with +2.04 BLEU-4 score than the Encoder-Decoder baseline. Incorporating the partof-speech copy actions further improves this enhancement to reach +2.39 BLEU-4 points. Among all baselines, the Encoder-Decoder baseline and the R-TRANSE baseline performed the best. This shows that TransE embeddings encode intra-predicates information and intra-class-types information to a great extent, and can generalize to some extent to unseen predicates and class types. Similar patterns can be seen in the evaluation on unseen sub-types and obj-types (Table 5). Our model with copy actions was able to outperform    all the other systems. Majority of systems have reported a significantly higher BLEU-4 scores in these two tasks than when generalizing to unseen predicates (+12 and +8 BLEU-4 points respectively). This indicates that these tasks are relatively easier and hence our models achieve relatively smaller enhancements over the baselines. Table 6 shows how different variations of our system can express the unseen predicate in the target question with comparison to the Encoder-Decoder baseline.

Human Evaluation
Our proposed copy actions have scored a significant enhancement in the identification of unseen predicates with up to +40% more than best performing baseline and our model version without the copy actions.
By examining some of the generated questions ( Table 7) we see that models without copy actions can generalize to unseen predicates that only have a very similar freebase predicate in the training set. For example fb:tv program/language and fb:film/language, if one of those predicates exists in the training set the model can use the same questions for the other during test time. Copy actions from the sub-type and the obj-type textual contexts can generalize to a great extent to unseen predicates because of the overlap between the predicate and the object type in many questions (Example 2 Table 7). Adding the predicate context to our model has enhanced model performance for expressing unseen predicates by +9% (Table 6). However we can see that it has affected the naturalness of the question. The post processing step does not take into consideration that some verbs and prepositions do not fit in the sentence structure, or that some words are already existing in the question words (Example  4 Table 7). This does not happen as much when having copy actions from the sub-type and the obj-type contexts because they are mainly formed of nouns which are more interchangeable than verbs or prepositions. A post-processing step to reform the question instead of direct copying from the input source is considered in our future work.

Conclusion
In this paper we presented a new neural model for question generation from knowledge bases, with a main focus on predicates, subject types or object types that were not seen at the training phase (Zero-Shot Question Generation). Our model is based on an encoder-decoder architecture that leverages textual contexts of triples, two attention layers for triples and textual contexts and  Table 7: Examples of generated questions from different systems in comparison finally a part-of-speech copy action mechanism. Our method exhibits significantly better results for Zero-Shot QG than a set of strong baselines including the state-of-the-art question generation from KB. Additionally, a complimentary human evaluation, helps in showing that the improvement brought by our part-of-speech copy action mechanism is even more significant than what the automatic evaluation suggests. The source code and the collected textual contexts are provided for the community 6