Knowledge-enriched, Type-constrained and Grammar-guided Question Generation over Knowledge Bases

Question generation over knowledge bases (KBQG) aims at generating natural-language questions about a subgraph, i.e. a set of triples. Two main challenges still face the current crop of encoder-decoder-based methods, especially on small subgraphs: (1) low diversity and poor fluency due to the limited information contained in the subgraphs, and (2) semantic drift due to the decoder’s oblivion of the semantics of the answer entity. We propose an innovative knowledge-enriched, type-constrained and grammar-guided KBQG model, named KTG, to addresses the above challenges. In our model, the encoder is equipped with auxiliary information from the KB, and the decoder is constrained with word types during QG. Specifically, entity domain and description, as well as relation hierarchy information are considered to construct question contexts, while a conditional copy mechanism is incorporated to modulate question semantics according to current word types. Besides, a novel reward function featuring grammatical similarity is designed to improve both generative richness and syntactic correctness via reinforcement learning. Extensive experiments show that our proposed model outperforms existing methods by a significant margin on two widely-used benchmark datasets SimpleQuestion and PathQuestion.


Introduction
Question Generation over Knowledge Bases (KBQG) aims to generate natural-language questions given a subgraph in the KB, i.e. a set of connected triples of the form <subject, predicate, object>. KBQG has a wide range of applications and is increasingly attracting attention from both academia and industry. For example, KBQG can improve factoid-based question answering (QA) systems by either dual training of QA and QG or by data augmentation for training corpora. As another example, KBQG can play a critical role in developing a chat-bot to ask KB-based questions under conversational settings. Table 1 illustrates a real scenario of KBQG, in which three questions are generated from two connected triples, along with the answer entity Ohio. Among the three questions, Q3 "Where was the high school of LeBron James, the American basketball player, located in?" is not only correct in grammars and semantics but also more diverse than Q1 because it contains the description information of LeBron James. Meanwhile, Q2 suffers a semantic drift problem due to a mismatch between the wrong interrogative "when" and Ohio whose entity type is location. Where was LeBron James's high school located in? Q2 ✓ ✗ When was the high school of LeBron James, the American basketball player, located in? Q3 ✓ ✓ Where was the high school of LeBron James, the American basketball player, located in? Recent works on neural KBQG follow an encoder-decoder architecture that takes KB subgraphs as the input to yield questions (Serban et al., 2016). In the case of entering a single triple, Elsahar et al (2018) enriched the encoder with extra contexts and equipped the decoder with attention and copy two mechanisms to improve generated questions. For expressing given predicates and answers adequately, Liu et al (2019) presented a new encoder-decoder framework to incorporate diversified off-the-shelf contexts and an answer-aware loss function. In the case of entering multiple triples, Chen et al (2020) applied a bidirectional Graph2Seq model to generate questions from a KB subgraph concerning a target answer. While these KBQG solutions have gained noticeable success, two critical research challenges (RCs) are still under-explored to date. RC-1: Limited Information. Questions are often generated from a small KB subgraph consisting of one or a few triples, and thus the contained information may not be sufficient enough to create a diverse and fluent question well. For instance, as shown in Table 1, Q1 is a plain question, just a simple combination of the two connected triples, whereas Q2 and Q3 are much more informative in terms of description details. For this challenge, Liu et al (2019) expanded given triples through additional contextual information including type and range. However, the information inadequacy issue still matters, continuing the generation of rigid and unfluent questions.
RC-2: Semantic Drift. The semantic drift problem will occur when the semantics of generated questions becomes incompatible with given triples and/or answers. As shown in Table 1, Q2 starts with the wrong interrogative "when" as the answer is a location ("Ohio"). By contrast, Q3 starts with the correct interrogative "where". One possible reason can be that the KBQG model is trained under teacher forcing without any high-level semantic regularization. In this manner, the resulting model may be loosely grounded by the given triples and answers.
In this paper, we propose a novel knowledge-enriched, type-constrained, and grammar-guided KBQG model, denoted as KTG, to address the two challenges and generate correct, diverse, and fluent naturallanguage questions. For the first challenge, we augment the encoder input by linking both entities and relations in the source subgraph to an external KB, Wikidata (Vrandečić and Krötzsch, 2014), and hence introducing auxiliary knowledge such as entity description and domain. In general, the auxiliary knowledge can provide additional background information to improve the diversity and fluency of generated questions. For the second challenge, we label each word of a question with one of the following four types: interrogative, entity word, relation word, ordinary words, by which the decoder output is conditioned. At each decoding step, we estimate a distribution over word types first then compute multiple type-specific generation probabilities for the entire vocabulary. Meanwhile, we use a conditional copy mechanism to transfer content from different according to current word types. Furthermore, we conjecture that the semantics of generated questions mainly depends on the interrogative. Therefore, we explore to leverage the entity types of answers to help determine proper interrogatives. For instance, if the entity type of a target answer is time, then a reasonable interrogative can be "when" for the generated question. Besides, previous studies have utilized reinforcement learning to encourage the structural conformity between generated and ground-truth questions (Du et al., 2017;Kumar et al., 2018). This objective is achieved by promoting higher degrees of text matching evaluated typically by a rigid reward measure such as BLEU and ROUGE. This study designs an innovative reward function based on the dependency parsing tree (DPT), which enables the proposed model to benefit from the semantic structure similarity between generated and ground-truth questions.
The main contributions of this paper are summarized as follows.
• We augment the source subgraph with auxiliary information to enrich encoder input, which improves the diversity of generated questions. • We propose to incorporate word types in generated questions, and make the decoder output conditioned on these types, which alleviates the semantic drift issue. • In a reinforcement learning framework, we design a DPT-based evaluator to encourage structural conformity whilst not rigidly enforcing subsequence matching. • We conduct extensive experiments on two benchmark datasets SimpleQuestion (Bordes et al., 2015) and PathQuestion  on both standard evaluation metrics and human evaluation. Results demonstrate that our model outperforms state-of-the-art methods by a significant margin and that it can generate question that are more correct, diverse and fluent.

Related Work
Our work is inspired by the recent work for KBQG based on encoder-decoder frameworks. Owing to the development of neural networks, the encoder-decoder model is initially proposed for text generation (Sutskever et al., 2014) and has significant performances. Based on the big success of the encoderdecoder model, Serban et al.(2016) first proposed a neural network for mapping KB fact triples into corresponding natural language questions and created the 30M Factoid Question-Answer corpus. However, their approach requires a large number of fact-question pairs as training data, which is not necessarily available for each domain. To address this challenge, Song et al.(2016) proposed an unsupervised system to generate questions from a domain-specific KB without requiring any labeled data. Besides, the types of generated questions are more diverse without any restrictions. Indurthi et al.(2017) proposed an RNN based question generation model to generate natural language question-answer pairs from a knowledge graph. To generalize KBQG to unseen predicates and entity types, Elsahar et al.(2018) leveraged other contexts in the natural language corpus in an encoder-decoder architecture, paired with an original part-of-speech copy action mechanism to generate questions. These contexts may make it difficult to generate questions that express the given predicate and associate with a definitive answer. Thus, Liu et al.(2019) presented a neural encoder-decoder model that integrates diversified off-the-shelf contexts and an answer-aware loss. Finally, this model obtains significant improvements. Based on the Transformer (Vaswani et al., 2017) architecture, Kumar et al.(2019a) proposed an end-to-end neural-networkbased model for generating complex multi-hop and difficulty-controllable questions over knowledge graphs. Instead of using a single KB triple, Chen et al.(2020) applied a bidirectional Graph2Seq model to generate questions from a subgraph of KB and target answers. Nevertheless, we observe that there are still two important research issues that are not processed well or even neglected, as we mentioned in sec. 1. Therefore, we focus on the two issues: generating diverse and fluent questions and solving semantic drift problem during the process of question generation.
Our model is also inspired by text generation from reinforcement learning (RL). RL has been successfully applied to question generation task. Pan et al.(2019) proposed reinforced dynamic reasoning network, which is based on the general encoder-decoder framework but incorporates a dynamic reasoning component to generate conversational questions via an RL mechanism better. (2019b) proposed two novel QG-specific reward functions for text conformity and answer conformity of the generated question. Besides, our work is also related to copy mechanisms. To handle rare or unknown words and copy from the input, Gu et al. (2016) firstly incorporated copy mechanism into neural network-based Seq2Seq learning and propose a new model called CopyNet with encoder-decoder structure. Bao et al.(2018) proposed KB copy to copy elements in the table (KB). Different from the above copying method, Li et al.(2019) designed a dual copy mechanism to copy from two sources with two gates to maintain the informativeness and faithfulness of generated questions.

Methodology
In this section, we present the details of our model. The overall architecture of our model is shown in Figure 1. Our model consists of a knowledge-augmented fact encoder, a typed decoder, as well as a grammar-guided evaluator in the reinforcement learning framework. The knowledge-augmented fact encoder takes the given entities, relations, and corresponding auxiliary knowledge, i.e. entity description and relation domain, as input and learns a knowledge-augmented fact representation. The learned representation is passed to a typed decoder for question generation. For each token the decoder outputs, the evaluator rewards the generated question using the grammatical similarity between it and the groundtruth question. Based on the reward assigned by the evaluator, our encoder-decoder module updates and improves its current generation.

Knowledge-augmented Fact Encoder
Typed Decoder

Problem Formulation
In this paper, we leverage auxiliary knowledge about the input triples to generate questions over a background KB. We assume a collection of triples (i.e. facts) F as input. F consist of two parts E and R, where E = {e 1 , · · · , e n } denotes a set of entities (i.e., subjects or objects) and R = {r 1 , · · · , r n−1 } denotes all the predicates (i.e. relations) connecting these entities. Moreover, e n ∈ E denotes the answer entity. Note that these facts form an answer path as a sequence of entities and relations in the KB which starts from the subject and ends with the answer: e 1 Given the above definitions, the task of KBQG can be formalized as follows: (1) Here K = (D, O) represents auxiliary knowledge, where D = " x 1 , · · · , x n # denotes a set of entity description and O = " o 1 , · · · , o n # denotes the domains (i.e. types) for entities. Y = (y 1 , · · · , y |Y | ) is the generated question, and y <t denotes all previously generated question words before time-step t.

Knowledge-augmented Fact Encoder
Contrary to conventional encoders, our model takes as input not only triples but also the corresponding auxiliary knowledge as described above. We design a multi-level encoder to obtain the representation of knowledge-augmented facts. We describe our multi-level encoder below, which consists of entity encoder, relation memory, and knowledge-augmented fact encoder.

Entity Encoder
Facts in F only provide the most pertinent information of entities and relations, which is not sufficient to generate a diverse question, especially when F is small. In this paper, we link each entity in E to its respective Wikidata page, and obtain corresponding auxiliary knowledge, including a brief description, and a domain definition, to enrich the source input. For example, for the entity "LeBron James" in Table 1, its description and domain is "American Basketball Player" and "human" respectively.
We leverage label, description and domain information to represent each entity. Since the label information of an entity e i is a single token, we obtain the label embedding l i ∈ R d from a KB embedding matrix E f ∈ R k×d , where k represents the size of KB vocabulary.
Both description and domain information are sequences of words, and we employ a two-layer bidirectional LSTM network to encode them respectively. Given an entity e i , its description X i = " x i 1 , · · · , x i m # is a sequence of words x i j of length m. The BiLSTM encoder calculates the hidden state at time-step t by h We output the hidden state of the final time-step h m as the embedding vector, and obtain the description embedding The domain embedding o i is calculated in the same way. The entity embedding e i is the concatenation of the label, domain and description embeddings e i = [l i ; x i ; o i ].

Global Relation Encoder
Relations in a knowledge base are typically organised hierarchically, such as root/people/deceased person/place of death. The global relation encoder exploits this hierarchical structure through an N -ary Tree-LSTM (Tai et al., 2015) to encode these relation. Each LSTM unit in the relation encoder is able to incorporate information from multiple child units and N is the branching factor of the tree. Each unit (indexed by j) contains input and output gates i j and o j , a memory cell c j and hidden state h j . Instead of a single forget gate, the N -ary Tree-LSTM unit contains one forget gate f jk for each child k, k = 1, 2, · · · , N , and the hidden state and memory cell of the k-th child are h jk and c jk respectively. Given the input r j in the N -ary Tree-LSTM, its hidden state is calculated as follows: Finally, we use the hidden state of each node h j to represent the corresponding relation embeddings r j . In this way, the encoding is performed once and the relation embeddings are updated through backpropagation in the training process.

Knowledge-augmented Fact Encoder
With knowledge-augmented embeddings of all entities and relations, we encode the triples F using a two-layer bidirectional LSTM network with the input sequence (e 1 , r 1 , e 2 , · · · , r n−1 , e n ), where each e i and r j is described in Section 3.2.1 and 3.2.2 respectively. Note that in this paper we use a linear layer to transform embeddings to maintain the consistency of embedding size. Ultimately, we regard the hidden states as semantic representations and obtain entity representation (h 1 , h 3 , h 5 , . . . , h 2n−1 ) and relation representation (h 2 , h 4 , h 6 , . . . , h 2n ). The last hidden state of BiLSTM is the knowledgeaugmented fact representation F, which is fed into our decoder for question generation.

Typed Decoder
In order to generate questions that are consistent with the input subgraph, inspired by previous work (Du et al., 2017), we employ a typed decoder based on LSTM to calculate type-specific word probability distributions, which assumes that each word has a latent type of the set {interrogative, entity word, relation word, ordinary words}. In conjunction, we employ a conditional copy mechanism to allow copying from either the entity input or the relation input.
At the t-th time-step, our decoder reads the generated word embedding y t−1 and the hidden state s t−1 of the previous time step to generate the current hidden state by s t = LST M (s t−1 , y t−1 ). Note that since the first token of the generated question is interrogative, which is vital for the semantic-consistency of the generated question, we use the answer embedding, instead of the special start-of-sequence token <SOS> embedding, at the first time step of the decoder. The answer embedding is the embedding of entity e n , which is obtained in the entity encoder and contains label, description, and domain information.
With an explicit answer embedding, the generated interrogative is more accurate, thus alleviates the semantic drift problem.
For conditional copy from entity and relation source inputs, we leverage a gated attention mechanism to jointly attend to the entity representation and the relation representation. For entity representation (h 1 , h 3 , h 5 , . . . , h 2n−1 ), the entity context vector c e t is calculated by the attention mechanism: α e t,i = where W α is a trainable weight parameter. Similarly, the relation context vector c r t can be obtained from the relation representation. Then a gating mechanism is used to control the information flow from these two sources: Generally, the predicted probability distribution over the vocabulary V is calculated as: Different from the conventional decoder, our typed-decoder calculates type-specific generation distributions. Having generated the interrogative, the word types only include {entity word, relation word, ordinary words} in the following decoding steps. We first estimate a type distribution over word types and decide to copy or generate words according to the word type. If the word belongs to entity or relation, we copy this token from the input entity source or relation source. If the word is ordinary, we calculate type-specific generation distributions over the whole vocabulary. Finally, the generation probability is a mixture of type-specific generation/copy distributions where the coefficients are type probabilities.
We reuse the attention score α e t,i and α r t,i to derive the copy probability over entities and relations: The final generation distribution P (y t |y <t , F, K) from which a word can be sampled, is computed by: Here τ yt is the word type at time-step t and g i is a word type among the three word types {g e , g r , g o }.
Each word can be any of the three types, but with different probabilities given the current context. The probability distribution over three word types is calculated by: P (τ yt |y <t , F, K) = sof tmax(W 0 s t + b 0 ), where W 0 ∈ R 3×d , and d is the dimension of the hidden state. The type-specific probability distribution is computed as: P (y t |τ yt = g e , y <t , F, K) = P E , P (y t |τ yt = g r , y <t , F, K) = P R , P (y t |τ yt = g o , y <t , F, K) = P V . (5)

Evaluator
We employ a reinforcement learning framework to fine-tune the parameters of the encoder-decoder module by optimizing task-specific reward functions through policy gradient in the evaluator. Previous works directly use the final evaluation metrics BLEU, GLEU, ROUGE-L (Du et al., 2017;Kumar et al., 2019b) as rewards. Kumar et al. (2019b) also proposed the question sentence overlap score (QSS), which is the number of common n-grams between predicted question and the source sentence, as a reward function. Consequently, these methods tend to reward generated questions with large n-gram overlaps with the ground-truth question or the source context, thus may result in the generation of highly similar but unvaried questions. Therefore, we present a new reward function that is specifically designed to improve the variety of generated questions.
DPTS Reward. Dependency Parse Tree (DPT) provides a grammatical structure for a sentence by annotating edges with dependency types. We propose DPTS, Dependency Parse Tree Similarity, between the generated question and the ground-truth question as our reward function. DPTS encourages the generation of syntactically and semantically valid questions and further improve the diversity of generated questions, as it is not defined over n-gram overlapping. To calculate DPTS, we leverage the ACVT (Attention Constituency Vector Tree) kernel (Quan et al., 2019) to efficiently calculate similarity based on the number of common substructures between two trees. To apply the DPTS reward, we employ the self-critical sequence training (SCST) algorithm (Rennie et al., 2017). At each training iteration, the model generates two output sequences: the sampled output Y s , in which each word y s t is sampled according to the likelihood P (y t |y <t , E, R, K) predicted by the generator, and the baseline outputŶ , obtained by greedy search. r(Y ) denotes the DPTS reward of an output sequence Y , and the loss function is defined as: L rl = (r(Ŷ ) − r(Y s )) % t log P (y s t |y s <t , E, R, K).

Inference and Optimization
Apart from the loss in the evaluator, we adopt the negative log-likelihood loss function, and apply supervision on the mixture weights of word types.
whereŷ t is the reference word and τŷ t is the reference word type at time t. The overall loss function is defined as: L = L cl + αL wl + βL rl , where α and β are two factors to balance the three loss terms.

Experiment
In this section we present the evaluation of our KBQG model KTG. The main experiments compare our model to a number of baseline models in two settings: automatic evaluation using standard metrics, as well as human evaluation over a number of criteria. We also conduct an ablation analysis to examine the effect of various components on model performance.

Datasets and Preprocessing
We conduct experiments on two widely-used benchmark datasets: SimpleQuestion (Bordes et al., 2015) dataset and PathQuestion . To obtain auxiliary knowledge, we link each entity and predicate in an input subgraph to Wikidata (Vrandečić and Krötzsch, 2014), an open knowledge base, and obtain the corresponding entity description and domain and predicate hierarchy as auxiliary knowledge. In SimpleQuestion, entities are represented by their Freebase IDs. Thus we first map these Freebase IDs to Wikidata IDs and then find auxiliary knowledge according to the Wikidata IDs. PathQuestion contains verbalized entities and predicates, which can be directly used to link auxiliary knowledge. For both SimpleQuestion and PathQuestion, we add auxiliary knowledge to questions as parenthesis. As shown in Figure 1, the italic and bold words in each question are our auxiliary knowledge. SimpleQuestion consists of over 108,000 samples and PathQuestion consists of over 11,700 samples. We randomly select 70% of these samples for training, 10% for validation, and 20% for testing.

Experimental Settings
The size of KB embeddings and word embeddings are both set to 300. The hidden vector size in the BiLSTM is also set to 300. The Adam (Kingma and Ba, 2015) optimizer is used in training, with the learning rate set to 2e-5. Batch size and dropout rate acre set to 64, 0.5, respectively. We stop the training iterations until the performance difference between two consecutive iterations is smaller than 1e-6.

Baseline Models
We compare our method with the following baseline models. RNN-based: a RNN-based question generation model to generate natural language question-answer pairs from a knowledge graph (Indurthi et al., 2017). Zero-Shot: a zero-shot KBQG model for unseen predicates and entity types (Elsahar et al., 2018). Multi-hop: an end-to-end neural network-based method for automatic generation of complex multi-hop questions over knowledge graphs (Kumar et al., 2019a). Ans-aware: a KBQG model via using diversified contexts and answer-aware loss (Liu et al., 2019).

Evaluation Metrics
Following previous KBQG works, we rely on a set of well-established metrics for question generation: BLEU-4 (B-4) (Papineni et al., 2002), METEOR (ME.) (Denkowski and Lavie, 2014) and ROUGE-L (R-L) (Chin-Yew and Lin, 2004) for automatic evaluation. Moreover, we conduct human evaluations on 50 randomly chosen questions from the test set of each dataset. Two human annotators were asked to judge each question on the following three criteria on a Likert scale of 1-5, with 1 being the worst and 5 being the best. Naturalness (Nat.) rates the fluency and comprehensibility of the generated question. Diversity (Div.) indicates whether the generated question contains diverse information. Correctness (Cor.) measures whether the question has grammar errors.

Results and Discussion
The results of all evaluations are shown in Table 2. For automatic evaluations, our model considerably outperforms all the baselines on all evaluation metrics across both datasets. The BLEU-4 score of our full model KTG (last row) increases by 6.93 percentage points on SimpleQuestion and 5.7 percentage points on PathQuestion compared with BiGraph2Seq, which is the strongest baseline. Similar values can be observed for METEOR and ROUGE too. It is worth noting that the results of models KTG⊕BLEU, KTG⊕ROUGE and KTG⊕QSS are highly similar, and they outperform the baseline models. Yet, our full model KTG attains superior performance, which demonstrates the effectiveness of the DPTS reward in question generation. For human evaluation results, our model also consistently achieves the best performance and generates significantly more natural, diverse, and correct questions. Our model is observed to have the highest naturalness, diversity, and correctness scores among these baseline models.

Ablation Test
We conduct an ablation test to examine the effectiveness of our model components, by removing auxiliary knowledge, typed decoder, and reinforcement learning in our model one at a time. We can make a number of important observations from the analysis results in Table 3. Both auxiliary knowledge in the encoder contributes and typed decoder contribute significantly and similarly to the overall model performance, resulting a marked fall of model performance on all metrics with their removal. However, some subtle nuances in their contributions can be observed from human evaluation. The removal of the auxiliary knowledge results in the biggest reduction in both naturalness and diversity. This is consistent with the purpose of the component, as it is designed to equip the model with more information to generate more varied questions. Similarly, by replacing the typed decoder with a general decoder, we observe a larger performance drop in correctness as compared to the removal of auxiliary knowledge. This again validates the effectiveness of the typed decoder, as it is designed to mitigate the semantic drift problem by generating correct interrogatives.
Finally, reinforcement learning improves naturalness, diversity and correctness. This is due to the fact that the DPTS-based evaluator rewards high grammatical conformity (thus improves correctness), but not at the expense of enforcing n-gram similarity (thus improves naturalness and diversity).  Table 3: Ablation test by removing each main component one at a time, where "w/o knowledge" removes auxiliary knowledge from model input, "w/o type" replaces the typed decoder by a general LSTM decoder, without classifying word types, and "w/o RL" represents our model not optimized with reinforcement learning (thus without DPTS-based reward). Table 4.7 lists questions generated by various models for a same subgraph, providing an intuitive illustration of how our model improves the performance of question generation. Compared to the two baseline models Ans-aware and BiGraph2Seq, all our model variants generate questions of much higher quality. Among our model variants, without auxiliary knowledge, KTG-knowledge only generates a plain question without the additional information "American actress". Without the typed decoder, model KTGtype generates a wrong interrogative ("what" instead of "where"). Lastly, incorporating reinforcement learning, the full model generates a syntactically and semantically valid question.

Conclusion
In this paper, we tackle two crucial challenges: insufficient source input and semantic drift problem for the task of question generation over knowledge bases (KBQG). We enrich encoder input with auxiliary knowledge, including entity descriptions and predicate domains to improve question diversity. We employ a typed decoder with a conditional copy mechanism to further improve the semanticconsistency of generated questions. We further optimize model performance through reinforcement learning and design a novel reward function based on grammatical similarity but not n-gram overlap. This reward ensures the generation of syntactically and semantically valid questions while allowing more diversity and fluency. Experimental results on two benchmark datasets show that our model achieves significant improvements over state-of-the-art models on all automatic and human evaluation metrics. The source code will be released to encourage reproducibility and further research https://github.com/bisheng/KTG4KBQG.