Learning to Ask Questions in Open-domain Conversational Systems with Typed Decoders

Asking good questions in open-domain conversational systems is quite significant but rather untouched. This task, substantially different from traditional question generation, requires to question not only with various patterns but also on diverse and relevant topics. We observe that a good question is a natural composition of interrogatives, topic words, and ordinary words. Interrogatives lexicalize the pattern of questioning, topic words address the key information for topic transition in dialogue, and ordinary words play syntactical and grammatical roles in making a natural sentence. We devise two typed decoders (soft typed decoder and hard typed decoder) in which a type distribution over the three types is estimated and the type distribution is used to modulate the final generation distribution. Extensive experiments show that the typed decoders outperform state-of-the-art baselines and can generate more meaningful questions.


Introduction
Learning to ask questions (or, question generation) aims to generate a question to a given input.Deciding what to ask and how is an indicator of machine understanding (Mostafazadeh et al., 2016), as demonstrated in machine comprehension (Du et al., 2017;Zhou et al., 2017b;Yuan et al., 2017) and question answering (Tang et al., 2017;Wang et al., 2017).Raising good questions is essential to conversational systems because a good system can well interact with users by asking and responding (Li et al., 2016).Furthermore, asking questions is one of the important proactive behaviors that can drive dialogues to go deeper and further (Yu et al., 2016).
Question generation (QG) in open-domain conversational systems differs substantially from the traditional QG tasks.The ultimate goal of this task is to enhance the interactiveness and persistence of human-machine interactions, while for traditional QG tasks, seeking information through a generated question is the major purpose.The response to a generated question will be supplied in the following conversations, which may be novel but not necessarily occur in the input as that in traditional QG (Du et al., 2017;Yuan et al., 2017;Tang et al., 2017;Wang et al., 2017;Mostafazadeh et al., 2016).Thus, the purpose of this task is to spark novel yet related information to drive the interactions to continue.
Due to the different purposes, this task is unique in two aspects: it requires to question not only in various patterns but also about diverse yet relevant topics.First, there are various questioning patterns for the same input, such as Yes-no questions and Wh-questions with different interrogatives.Diversified questioning patterns make dialogue interactions richer and more flexible.Instead, traditional QG tasks can be roughly addressed by syntactic transformation (Andrenucci and Sneiders, 2005;Popowich and Winne, 2013), or implicitly modeled by neural models (Du et al., 2017).In such tasks, the information questioned on is pre-specified and usually determines the pattern of questioning.For instance, asking Whoquestion for a given person, or Where-question for a given location.
Second, this task requires to address much more transitional topics of a given input, which is the nature of conversational systems.For instance, for the input "I went to dinner with my friends", we may question about topics such as friend, cuisine, price, place and taste.Thus, this task generally requires scene understanding to imagine and comprehend a scenario (e.g., dining at a restaurant) that can be interpreted by topics related to the input.However, in traditional QG tasks, the core information to be questioned on is pre-specified and rather static, and paraphrasing is more required.Undoubtedly, asking good questions in conversational systems needs to address the above issues (questioning with diversified patterns, and addressing transitional topics naturally in a generated question).As shown in Figure 1, a good question is a natural composition of interrogatives, topic words, and ordinary words.Interrogatives indicate the pattern of questioning, topic words address the key information of topic transition, and ordinary words play syntactical and grammatical roles in making a natural sentence.
We thus classify the words in a question into three types: interrogative, topic word, and ordinary word automatically.We then devise two decoders, Soft Typed Decoder (STD) and Hard Typed Decoder (HTD), for question generation in conversational systems1 .STD deals with word types in a latent and implicit manner, while HTD in a more explicit way.At each decoding position, we firstly estimate a type distribution over word types.STD applies a mixture of type-specific generation distributions where type probabilities are the coefficients.By contrast, HTD reshapes the type distribution by Gumbel-softmax and modulates the generation distribution by type probabilities.Our contributions are as follows: • To the best of our knowledge, this is the first study on question generation in the setting of conversational systems.We analyze the key differences between this new task and other traditional question generation tasks.
• We devise soft and hard typed decoders to ask good questions by capturing different roles of different word types.Such typed decoders may be applicable to other generation tasks if word semantic types can be identified.

Related Work
Traditional question generation can be seen in task-oriented dialogue system (Curto et al., 2012), sentence transformation (Vanderwende, 2008), machine comprehension (Du et al., 2017;Zhou et al., 2017b;Yuan et al., 2017;Subramanian et al., 2017), question answering (Qin, 2015;Tang et al., 2017;Wang et al., 2017;Song et al., 2017), and visual question answering (Mostafazadeh et al., 2016).In such tasks, the answer is known and is part of the input to the generated question.Meanwhile, the generation tasks are not required to predict additional topics since all the information has been provided in the input.They are applicable in scenarios such as designing questions for reading comprehension (Du et al., 2017;Zhou et al., 2017a;Yuan et al., 2017), and justifying the visual understanding by generating questions to a given image (video) (Mostafazadeh et al., 2016).In general, traditional QG tasks can be addressed by the heuristic rule-based reordering methods (Andrenucci and Sneiders, 2005;Ali et al., 2010;Heilman and Smith, 2010), slotfilling with question templates (Popowich and Winne, 2013;Chali and Golestanirad, 2016;Labutov et al., 2015), or implicitly modeled by recent neural models (Du et al., 2017;Zhou et al., 2017b;Yuan et al., 2017;Song et al., 2017;Subramanian et al., 2017).These tasks generally do not require to generate a question with various patterns: for a given answer and a supporting text, the question type is usually decided by the input.
Question generation in large-scale, opendomain dialogue systems is relatively unexplored.Li et al. (2016) showed that asking questions in task-oriented dialogues can offer useful feedback to facilitate learning through interactions.Several questioning mechanisms were devised with handcrafted templates, but unfortunately not applicable to open-domain conversational systems.Similar to our goal, a visual QG task is proposed to generate a question to interact with other people, given an image as input (Mostafazadeh et al., 2016).

Overview
The task of question generation in conversational systems can be formalized as follows: given a user post X = x 1 x 2 • • • x m , the system should generate a natural and meaningful question Y = y 1 y 2 • • • y n to interact with the user, formally as As aforementioned, asking good questions in conversational systems requires to question with diversified patterns and address transitional topics naturally in a question.To this end, we classify the words in a sentence into three types: interrogative, topic word, and ordinary word, as shown in Figure 1.During training, the type of each word in a question is decided automatically2 .We manually collected about 20 interrogatives.The verbs and nouns in a question are treated as topic words, and all the other words as ordinary words.During test, we resort to PMI (Church and Hanks, 1990) to predict a few topic words for a given post.
On top of an encoder-decoder framework, we propose two decoders to effectively use word types in question generation.The first model is soft typed decoder (STD).It estimates a type distribution over word types and three type-specific generation distributions over the vocabulary, and then obtains a mixture of type-specific distributions for word generation.
The second one is a hard form of STD, hard typed decoder (HTD), in which we can control the decoding process more explicitly by approximating the operation of argmax with Gumbel-softmax (Jang et al., 2016).In both decoders, the final generation probability of a word is modulated by its word type.

Encoder-Decoder Framework
Our model is based on the general encoderdecoder framework (Cho et al., 2014;Sutskever et al., 2014).Formally, the model encodes an input sequence X = x 1 x 2 • • • x m into a sequence of hidden states h i , as follows, where GRU denotes gated recurrent units (Cho et al., 2014), and e(x) is the word vector of word x.The decoder generates a word sequence by sampling from the probability P(y t |y <t , X) (y <t = y 1 y 2 • • • y t−1 , the generated subsequence) which can be computed via P(y t |y <t , X) = MLP(s t , e(y t−1 ), c t ), where s t is the state of the decoder at the time step t, and this GRU has different parameters with the one of the encoder.The context vector c t is an attentive read of the hidden states of the encoder as c t = T i=1 α t,i h i , where the weight α t,i is scored by another MLP(s t−1 , h i ) network.

Soft Typed Decoder (STD)
In a general encoder-decoder model, the decoder tends to generate universal, meaningless questions like "What's up?" and "So what?".In order to generate more meaningful questions, we propose a soft typed decoder.It assumes that each word has a latent type among the set {interrogative, topic word, ordinary word}.The soft typed decoder firstly estimates a word type distribution over latent types in the given context, and then computes type-specific generation distributions over the entire vocabulary for different word types.The final probability of generating a word is a mixture of type-specific generation distributions where the coefficients are type probabilities.
The final generation distribution P(y t |y <t , X) from which a word can be sampled, is given by where ty t denotes the word type at time step t and c i is a word type.Apparently, this formulation states that the final generation probability is a mixture of the type-specific generation probabilities P(y t |ty t = c i , y <t , X), weighted by the probability of the type distribution P(ty t = c i |y <t , X).We name this decoder as soft typed decoder.In this model, word type is latent because we do not need to specify the type of a word explicitly.In other words, each word can belong to any of the three types, but with different probabilities given the current context.
The probability distribution over word types as type distribution) is given by where s t is the hidden state of the decoder at time step t, W 0 ∈ R k×d , and d is the dimension of the hidden state.
The type-specific generation distribution is given by where W c i ∈ R |V |×d and |V | is the size of the entire vocabulary.Note that the type-specific generation distribution is parameterized by W c i , indicating that the distribution for each word type has its own parameters.
Instead of using a single distribution P(y t |y <t , X) as in a general Seq2Seq decoder, our soft typed decoder enriches the model by applying multiple type-specific generation distributions.This enables the model to express more information about the next word to be generated.Also note that the generation distribution is over the same vocabulary, and therefore there is no need to specify word types explicitly.

Hard Typed Decoder (HTD)
In the soft typed decoder, we assume that each word is a distribution over the word types.In this sense, the type of a word is implicit.We do not need to specify the type of each word explicitly.In the hard typed decoder, words in the entire vocabulary are dynamically classified into three types for each post, and the decoder first estimates a type distribution at each position and then generates a word with the highest type probability.This pro-cess can be formulated as follows: This is essentially the hard form of Eq. 1, which just selects the type with the maximal probability.However, this argmax process may cause two problems.First, such a cascaded decision process (firstly selecting the most probable word type and secondly choosing a word from that type) may lead to severe grammatical errors if the first selection is wrong.Second, argmax is discrete and nondifferentiable, and it breaks the back-propagation path during training.
To make best use of word types in hard typed decoder, we address the above issues by applying Gumbel-Softmax (Jang et al., 2016) to approximate the operation of argmax.There are several steps in the decoder (see Figure 2): First, the type of each word (interrogative, topic, or ordinary) in a question is decided automatically during training, as aforementioned.
Second, the generation probability distribution is estimated as usual, Further, the type probability distribution at each decoding position is estimated as follows, Third, the generation probability for each word is modulated by its corresponding type probabil-ity: where c(y t ) looks up the word type of word y t , and c * is the type with the highest probability as defined in Eq. 3.This formulation has exactly the effect of argmax, where the decoder will only generate words of type with the highest probability.
To make P * (y t |y <t , X) a distribution, we normalize these values by a normalization factor Z: Z = 1 yt∈V P (y t |y <t , X) where V is the decoding vocabulary.Then, the final probability can be denoted by As mentioned, in order to have an effect of argmax but still maintain the differentiability, we resort to Gumbel-Softmax (Jang et al., 2016), which is a differentiable surrogate to the argmax function.The type probability distribution is then adjusted to the following form: m(y t ) = GS(P(ty t = c(y t )|y <t , X)),

GS(π
where π 1 , π 2 , • • • , π k represents the probabilities of the original categorical distribution, g j are i.i.d samples drawn from Gumbel(0,1) 3 and τ is a constant that controls the smoothness of the distribution.When τ → 0, Gumbel-Softmax performs like argmax, while if τ → ∞, Gumbel-Softmax performs like a uniform distribution.In our experiments, we set τ a constant between 0 and 1, making Gumbel-Softmax smoother than argmax, but sharper than normal softmax.Note that in HTD, we apply dynamic vocabularies for different responses during training.The words in a response are classified into the three types dynamically.A specific type probability will only affect the words of that type.During test, for each post, topic words are predicted with PMI, interrogatives are picked from a small dictionary, and the rest of words in the vocabulary are treated as ordinary words.

Loss Function
We adopt negative data likelihood (equivalent to cross entropy) as the loss function, and additionally, we apply supervision on the mixture weights of word types, formally as follows: where ty t represents the reference word type and ỹt represents the reference word at time t.λ is a factor to balance the two loss terms, and we set λ=0.8 in our experiments.
Note that for HTD, we substitute P * (y t = w j |y <t , X) (as defined by Eq. 8) into Eq.10.

Topic Word Prediction
The only difference between training and inference is the means of choosing topic words.During training, we identify the nouns and verbs in a response as topic words; whereas during inference, we adopt PMI (Church and Hanks, 1990) and Rel(k i , X) to predict a set of topic words k i for an input post X, as defined below: where p 1 (w)/p 2 (w) represent the probability of word w occurring in a post/response, respectively, and p(w x , w y ) is the probability of word w x occurring in a post and w y in a response.
During inference, we predict at most 20 topic words for an input post.Too few words will affect the grammaticality since the predicted set contains infrequent topic words, while too many words introduce more common topics leading to more general responses.

Experiment 4.1 Dataset
To estimate the probabilities in PMI, we collected about 9 million post-response pairs from Weibo.To train our question generation models, we distilled the pairs whereby the responses are in question form with the help of around 20 hand-crafted templates.The templates contain a list of interrogatives and other implicit questioning patterns.Such patterns detect sentences led by words like what, how many, how about or sentences ended with a question mark.After that, we removed the pairs whose responses are universal questions that can be used to reply many different posts.This is a simple yet effective way to avoid situations where the type probability distribution is dominated by interrogatives and ordinary words.
Ultimately, we obtained the dataset comprising about 491,000 post-response pairs.We randomly selected 5,000 pairs for testing and another 5,000 for validation.The average number of words in post/response is 8.3/9.3 respectively.The dataset contains 66,547 different words, and 18,717 words appear more than 10 times.The dataset is available at: http://coai.cs.tsinghua.edu.cn/hml/dataset/.

Baselines
We compared the proposed decoders with four state-of-the-art baselines.Seq2Seq: A simple encoder-decoder with attention mechanisms (Luong et al., 2015).MA: The mechanism-aware (MA) model applies multiple responding mechanisms represented by real-valued vectors (Zhou et al., 2017a).The number of mechanisms is set to 4 and we randomly picked one response from the generated responses for evaluation to avoid selection bias.TA: The topic-aware (TA) model generates informative responses by incorporating topic words predicted from the input post (Xing et al., 2017).ERM: Elastic responding machine (ERM) adaptively selects a subset of responding mechanisms using reinforcement learning (Zhou et al., 2018a).The settings are the same as the original paper.

Experiment Settings
Parameters were set as follows: we set the vocabulary size to 20, 000 and the dimension of word vectors as 100.The word vectors were pretrained with around 9 million post-response pairs from Weibo and were being updated during the training of the decoders.We applied the 4-layer GRU units (hidden states have 512 dimensions).These settings were also applied to all the baselines.λ in Eq. 12 is 0.8.We set different values of τ in Gumbel-softmax at different stages of training.At the early stage, we set τ to a small value (0.6) to obtain a sharper reformed distri-bution (more like argmax).After several steps, we set τ to a larger value (0.8) to apply a more smoothing distribution.Our codes are available at: https://github.com/victorywys/Learning2Ask_TypedDecoder.

Automatic Evaluation
We conducted automatic evaluation over the 5, 000 test posts.For each post, we obtained responses from the six models, and there are 30, 000 post-response pairs in total.

Evaluation Metrics
We adopted perplexity to quantify how well a model fits the data.Smaller values indicate better performance.To evaluate the diversity of the responses, we employed distinct-1 and distinct-2 (Li et al., 2015).These two metrics calculates the proportion of the total number of distinct unigrams or bigrams to the total number of generated tokens in all the generated responses.
Further, we calculated the proportion of the responses containing at least one topic word in the list predicted by PMI.This is to evaluate the ability of addressing topic words in response.We term this metric as topical response ratio (TRR).We predicted 20 topic words with PMI for each post.

Results
Comparative results are presented in Table 1.STD and HTD perform fairly well with lower perplexities, higher distinct-1 and distinct-2 scores, and remarkably better topical response ratio (TRR).Note that MA has the lowest perplexity because the model tends to generate more universal responses.Our decoders have better distinct-1 and distinct-2 scores than baselines do, and HTD performs much better than the strongest baseline TA.Noticeably, the means of using topic information in our models differs substantially from that in TA.Our decoders predict whether a topic word should be decoded at each position, whereas TA takes as input topic word embeddings at all decoding positions.
Our decoders have remarkably better topic response ratios (TRR), indicating that they are more likely to include topic words in generation.

Manual Evaluation
We resorted to a crowdsourcing service for manual annotation.500 posts were sampled for manual annotation4 .We conducted pair-wise comparison between two responses generated by two models for the same post.In total, there are 4,500 pairs to be compared.For each response pair, five judges were hired to give a preference between the two responses, in terms of the following three metrics.Tie was allowed, and system identifiers were masked during annotation.

Evaluation Metrics
Each of the following metrics is evaluated independently on each pair-wise comparison: Appropriateness: measures whether a question is reasonable in logic and content, and whether it is questioning on the key information.Inappropriate questions are either irrelevant to the post, or have grammatical errors, or universal questions.Richness: measures whether a response contains topic words that are relevant to a given post.Willingness to respond: measures whether a user will respond to a generated question.This metric is to justify how likely the generated questions can elicit further interactions.If people are willing to respond, the interactions can go further.

Results
The label of each pair-wise comparison is decided by majority voting from five annotators.Results shown in Table 2 indicate that STD and HTD outperform all the baselines in terms of all the metrics.This demonstrates that our decoders produce more appropriate questions, with richer topics.Particularly, our decoders have substantially better willingness scores, indicating that questions generated by our models are more likely to elicit further interactions.Noticeably, HTD outperforms STD significantly, indicating that it is beneficial to specify word types explicitly and apply dynamic vocabularies in generation.
We also observed that STD outperforms Seq2Seq and TA, but the differences are not significant in appropriateness.This is because STD generated about 7% non-question responses which were judged as inappropriate, while Seq2Seq and TA generated universal questions (inappropriate too but beat STD in annotation) to these posts.

Annotation Statistics
The proportion of the pair-wise annotations in which at least three of five annotators assign the same label to a record is 90.57%/93.11%/96.62% for appropriateness/ richness/willingness, respectively.The values show that we have fairly good agreements with majority voting.

Questioning Pattern Distribution
To analyze whether the model can question with various patterns, we manually annotated the questioning patterns of the responses to 100 sampled posts.The patterns are classified into 11 types including Yes-No, How-, Why-, What-, When-, and Who-questions.We then calculated the KL diver-gence between the pattern type distribution by a model and that by human (i.e., gold responses).
Results in Table 3 show that the pattern distribution by our model is closer to that in humanwritten responses, indicating that our decoders can better learn questioning patterns from human language.Further investigation reveals that the baselines tend to generate simple questions like What?(什 么 ？) or Really?(真 的 吗), and constantly focus on using one or two question patterns whereas our decoders use more diversified patterns as appeared in the human language.tioning on a particular topic word (效 果-effect) of the input.While for Post-3, the questions are asking about transitional topics of the input (上 班-work → 部门-department; 体育中心-sports center → 体育学院-college of Physical Education), indicating a typical case of topic transition in our task (also seen in Post-4, 寿司-sushi →日式 料理-Japanese food).This example also demonstrates that for the same input, there are various questioning patterns: a How-question asked by human, a Which-question by STD, and a Yes-No question by HTD.As for Post-4, the gold question requires a background that is only shared between the poster and responder, while STD and HTD tend to raise more general questions due to the lack of such shared knowledge.

Visualization of Type Distribution
To gain more insights into how a word type influence the generation process, we visualized the type probability at each decoding position in HTD.This example (Figure 3) shows that the model can capture word types well at different positions.For instance, at the first and second positions, ordinary words have the highest probabilities for generating 你-you and 喜欢-like, and at the third position, a topic word 兔子-rabbit is predicted while the last two positions are for interrogatives (a particle and a question mark).The generated question is "你喜欢兔子吗？do you like rabbit?".EOS means end of sentence.

Error Analysis
We presented error type distribution by manually analyzing 100 bad responses sampled from STD and HTD respectively, where bad means the response by our model is worse than that by some baseline during the pair-wise annotation.
There are 4 typical error types: no topic words (NoT) in a response (mainly universal questions), wrong topics (WrT) where topic words are irrelevant, type generation error (TGE) where a wrong word type is predicted (See Eq. 2) and it causes grammatical errors, and other errors.
The error distribution is shown in Table 6.For STD, most of the errors are attributed to no topic or wrong topics, while for HTD, the majority of errors fall into wrong topics.

    
Post-1: 今天好开心啊!I am so happy today!STD: 你怎 怎 怎么 么 么知道?How do you know ? WrT The poster is great and we look forward to our first cooperation with Ruoxi.HTD: 你海报怎 怎 怎么 么 么样啊?How about your poster ?
Post-3: 又生病啦?吃点药就好了。 Got sick again?Just take some medicine and you'll be fine soon.
Table 7: Cases for the error types with interrogative words bolded and topic words underlined.
There are typical cases for these error types: (1) Posts such as "I am so happy today!" contains no topic words or rare topic words.In this case, our method is unable to predict the topic words so that the models tend to generate universal questions.This happens more frequently in STD because the topic words are not specified explicitly.
(2) Posts contains multiple topic words, but the model sometimes focuses on an inappropriate one.For instance, for Post-2 in Table 7, HTD focused on 海报-poster but 合作-cooperation is a proper one to be focused on.(3) For complex posts, the models failed to predict the correct word type in response.For Post-3, STD generated a declarative sentence and HTD generated a question which, however, is not adequate within the context.
These cases show that controlling the questioning patterns and the informativeness of the content faces with the compatibility issue, which is challenging in language generation.These errors are also partially due to the imperfect ability of topic word prediction by PMI, which is challenging itself in open-domain conversational systems.

Conclusion and Future Work
We present two typed decoders to generate questions in open-domain conversational systems.The decoders firstly estimate a type distribution over word types, and then use the type distribution to modulate the final word generation distribution.Through modeling the word types in language generation, the proposed decoders are able to question with various patterns and address novel yet related transitional topics in a generated question.Results show that our models can generate more appropriate questions, with richer topics, thereby more likely to elicit further interactions.
The work can be extended to multi-turn conversation generation by including an additional detector predicting when to ask a question.The detector can be implemented by a classifier or some heuristics.Furthermore, the typed decoders are applicable to the settings where word types can be easily obtained, such as in emotional text generation (Ghosh et al., 2017;Zhou et al., 2018b).

Figure 1 :
Figure 1: Good questions in conversational systems are a natural composition of interrogatives, topic words, and ordinary words.

Figure 2 :
Figure2: Illustration of STD and HTD.STD applies a mixture of type-specific generation distributions where type probabilities are the coefficients.In HTD, the type probability distribution is reshaped by Gumbel-softmax and then used to modulate the generation distribution.In STD, the generation distribution is over the same vocabulary whereas dynamic vocabularies are applied in HTD.

Figure 3 :
Figure 3: Type distribution examples from HTD.The generated question is "你喜欢兔子吗？do you like rabbit?".EOS means end of sentence.

Table 1 :
Results of automatic evaluation.

Table 2 :
Annotation results.Win for "A vs. B" means A is better than B. Significance tests with Z-test were conducted.Values marked with * means p-value < 0.05, and * * for p-value < 0.01.

Table 3 :
KL divergence between the questioning pattern distribution by a model and that by human.

Table 5 :
Examples for typical questioning patterns.Interrogative words in response are bolded and topic words are underlined.