Generating Informative Responses with Controlled Sentence Function

Sentence function is a significant factor to achieve the purpose of the speaker, which, however, has not been touched in large-scale conversation generation so far. In this paper, we present a model to generate informative responses with controlled sentence function. Our model utilizes a continuous latent variable to capture various word patterns that realize the expected sentence function, and introduces a type controller to deal with the compatibility of controlling sentence function and generating informative content. Conditioned on the latent variable, the type controller determines the type (i.e., function-related, topic, and ordinary word) of a word to be generated at each decoding position. Experiments show that our model outperforms state-of-the-art baselines, and it has the ability to generate responses with both controlled sentence function and informative content.


Introduction
Sentence function is an important linguistic feature and a typical taxonomy in terms of the purpose of the speaker (Rozakis, 2003). There are four major function types in the language including interrogative, declarative, imperative, and exclamatory, as described in (Rozakis, 2003). Each sentence function possesses its own structure, and transformation between sentence functions needs a series of changes in word order, syntactic patterns and other aspects (Akmajian, 1984;Yule, 2010).
Since sentence function is regarding the purpose of the speaker, it can be a significant factor indicating the conversational purpose during interac- * *Corresponding author: Minlie Huang.  Interrogative responses can be used to acquire further information from the user; imperative responses are used to make requests, directions, instructions or invitations to elicit further interactions; and declarative responses commonly make statements to state or explain something. 1 Interrogative and imperative responses can be used to avoid stalemates (Li et al., 2016b), which can be viewed as important proactive behaviors in conversation (Yu et al., 2016). Thus, conversational systems equipped with the ability to control the sentence function can adjust its strategy for different purposes within different contexts, behave more proactively, and may lead the dialogue to go further. Generating responses with controlled sentence functions differs significantly from other tasks on controllable text generation (Hu et al., 2017;Ficler and Goldberg, 2017;Asghar et al., 2017;Ghosh et al., 2017;Zhou and Wang, 2017;Dong et al., 2017;Murakami et al., 2017). These studies, involving the control of sentiment polarity, emotion, or tense, fall into local control, more or less, because the controllable variable can be locally re-flected by decoding local variable-related words, e.g., terrible for negative sentiment (Hu et al., 2017;Ghosh et al., 2017), glad for happy emotion (Zhou et al., 2018;Zhou and Wang, 2017), and was for past tense (Hu et al., 2017). By contrast, sentence function is a global attribute of text, and controlling sentence function is more challenging in that it requires to adjust the global structure of the entire text, including changing word order and word patterns.
Controlling sentence function in conversational systems faces another challenge: in order to generate informative and meaningful responses, it has to deal with the compatibility of the sentence function and the content. Similar to most existing neural conversation models (Li et al., 2016a;Mou et al., 2016;, we are also struggling with universal and meaningless responses for different sentence functions, e.g., "Is that right?" for interrogative responses, "Please!" for imperative responses and "Me, too." for declarative responses. The lack of meaningful topics in responses will definitely degrade the utility of the sentence function so that the desired conversational purpose can not be achieved. Thus, the task needs to generate responses with both informative content and controllable sentence functions. In this paper, we propose a conversation generation model to deal with the global control of sentence function and the compatibility of controlling sentence function and generating informative content. We devise an encoder-decoder structure equipped with a latent variable in conditional variational autoencoder (CVAE) (Sohn et al., 2015), which can not only project different sentence functions into different regions in a latent space, but also capture various word patterns within each sentence function. The latent variable, supervised by a discriminator with the expected function label, is also used to realize the global control of sentence function. To address the compatibility issue, we use a type controller which lexicalizes the sentence function and the content explicitly. The type controller estimates a distribution over three word types, i.e., function-related, topic, and ordinary words. During decoding, the word type distribution will be used to modulate the generation distribution in the decoder. The type sequence of a response can be viewed as an abstract representation of sentence function. By this means, the model has an explicit and strong control on the function and the content. Our contributions are as follows: • We investigate how to control sentence functions to achieve different conversational purposes in open-domain dialogue systems. We analyze the difference between this task and other controllable generation tasks.
• We devise a structure equipped with a latent variable and a type controller to achieve the global control of sentence function and deal with the compatibility of controllable sentence function and informative content in generation. Experiments show the effectiveness of the model.

Related Work
Recently, language generation in conversational systems has been widely studied with sequenceto-sequence (seq2seq) learning (Sutskever et al., 2014;Bahdanau et al., 2015;Vinyals and Le, 2015;Shang et al., 2015;Serban et al., 2016Serban et al., , 2017. A variety of methods has been proposed to address the important issue of content quality, including enhancing diversity (Li et al., 2016a;Zhou et al., 2017) and informativeness (Mou et al., 2016; of the generated responses. In addition to the content quality, controllability is a critical problem in text generation. Various methods have been used to generate texts with controllable variables such as sentiment polarity, emotion, or tense (Hu et al., 2017;Ghosh et al., 2017;Zhou and Wang, 2017;Zhou et al., 2018) . There are mainly two solutions to deal with controllable text generation. First, the variables to be controlled are embedded into vectors which are then fed into the models to reflect the characteristics of the variables (Ghosh et al., 2017;Zhou et al., 2018). Second, latent variables are used to capture the information of controllable attributes as in the variational autoencoders (VAE) (Zhou and Wang, 2017). (Hu et al., 2017) combined the two techniques by disentangling a latent variable into a categorical code and a random part to better control the attributes of the generated text.
The task in this paper differs from the above tasks in two aspects: (1) Unlike other tasks that realize controllable text generation by decoding attribute-related words locally, our task requires to not only decode function-related words, but also  Figure 2: Model overview. During training, the latent variable z is sampled from the recognition network which is supervised by the function label in the discriminator. In the type controller, the latent variable and the decoder's state are used to estimate a type distribution which modulates the final generation distribution. During test, z is sampled from the prior network whose input is only the post. The response encoder in the dotted box appears only in training.
plan the words globally to realize the function type to be controlled.
(2) The compatibility of controllable variables and content quality is less studied in the literature. The most similar work in (Zhao et al., 2017) proposed to control the dialogue act of a response, which is also a global attribute. However, the model controls dialog act by directly feeding a latent variable into the decoder, instead, our model has a stronger control on the generation process via a type controller in which words of different types are concretely modeled.

Task Definition and Model Overview
Our problem is formulated as follows: given a post X = x 1 x 2 · · · x n and a sentence function category l, our task is to generate a response Y = y 1 y 2 · · · y m that is not only coherent with the specified function category l but also informative in content. We denote c as the concatenation of all the input information, i.e. c = [X; l]. Essentially, the goal is to estimate the conditional probability: The latent variable z is used to capture the sentence function of a response. P (z|c), parameterized as the prior network in our model, indicates the sampling process of z, i.e., drawing z from P (z|c). And P (Y |z, c) = m t=1 P (y t |y <t , z, c) is applied to model the generation of the response Y conditioned on the latent variable z and the input c, which is implemented by a decoder in our model. Figure 2 shows the overview of our model. As aforementioned, the model is constructed in the encoder-decoder framework. The encoder takes a post and a response as input, and obtains the hidden representations of the input. The recognition network and the prior network, adopted from the CVAE framework (Sohn et al., 2015), sample a latent variable z from two normal distributions, respectively. Supervised by a discriminator with the function label, the latent variable encodes meaningful information to realize a sentence function. The latent variable, along with the decoder's state, is also used to control the type of a word in generation via the type controller. In the decoder, the final generation distribution is mixed by the type distribution which is obtained from the type controller. By this means, the latent variable encodes information not only from sentence function but also from word types, and in return, the decoder and the type controller can deal with the compatibility of realizing sentence function and information content in generation.

Encoder-Decoder Framework
The encoder-decoder framework has been widely used in language generation (Sutskever et al., 2014;Vinyals and Le, 2015). The encoder transforms the post sequence X = x 1 x 2 · · · x n into hidden representations H = h 1 h 2 · · · h n , as follows: where GRU is gated recurrent unit (Cho et al., 2014), and e(x t ) denotes the embedding of the word x t . The decoder first updates the hidden states S = s 1 s 2 · · · s m , and then generates the target sequence Y = y 1 y 2 · · · y m as follows: where this GRU does not share parameters with the encoder's network. The context vector cv t−1 is a dynamic weighted sum of the encoder's hidden states, i.e., cv t−1 = n i=1 α t−1 i h i , and α t−1 i scores the relevance between the decoder's state s t−1 and the encoder's state h i (Bahdanau et al., 2015).

Recognition/Prior Network
On top of the encoder-decoder structure, our model introduces the recognition network and the prior network of CVAE framework, and utilizes the two networks to draw latent variable samples during training and test respectively. The latent variable can project different sentence functions into different regions in a latent space, and also capture various word patterns within a sentence function.
In the training process, our model needs to sample the latent variable from the posterior distribution P (z|Y, c), which is intractable. Thus, the recognition network q φ (z|Y, c) is introduced to approximate the true posterior distribution so that we can sample z from this deterministic parameterized model. We assume that z follows a multivariate Gaussian distribution whose covariance matrix is diagonal, i.e., q φ (z|Y, c) ∼ N (µ, σ 2 I).
Under this assumption, the recognition network can be parameterized by a deep neural network such as a multi-layer perceptron (MLP): During test, we use the prior network p θ (z|c) ∼ N (µ , σ 2 I) instead to draw latent variable samples, which can be implemented in a similar way: To bridge the gap between the recognition and the prior networks, we add the KL divergence term that should be minimized to the loss function:

Discriminator
The discriminator supervises z to encode function-related information in a response with supervision signals. It takes z as input instead of the generated response Y to avoid the vanishing gradient of z, and predicts the function category conditioned on z: This formulation can enforce z to capture the features of sentence function and enhance the influence of z in word generation. The loss function of the discriminator is given by:

Type Controller
The type controller is designed to deal with the compatibility issue of controlling sentence function and generating informative content. As aforementioned, we classify the words in a response into three types: function-related, topic, and ordinary words. The type controller estimates a distribution over the word types at each decoding position, and the type distribution will be used in the mixture model of the decoder for final word generation. During the decoding process, the decoder's state s t and the latent variable z are taken as input to estimate the type distribution as follows:

Decoder
Compared with the traditional decoder described in Section 3.2, our decoder updates the hidden state s t with both the input information c and the latent variable z, and generates the response in a mixture form which is combined with the type distribution obtained from the type controller: where wt = 1, 2, 3 stand for function-related words, topic words, and ordinary words, respectively. The probability for choosing different word types at time t, P (wt = i|s t , z), is obtained from the type controller, as shown in Equation (10). The probabilities of choosing words in different types are introduced as follows: Function-related Word: Function-related words represent the typical words for each sentence function, e.g., what for interrogative responses, and please for imperative responses. To select the function-related words at each position, we simultaneously consider the decoder's state s t , the latent variable z and the function category l.
P (y t |y t−1 , s t , c, z, wt = 1) = sof tmax(W 1 · [s t , z, e(l)]) (13) where e(l) is the embedding vector of the function label. Under the control of z, our model can learn to decode function-related words at proper positions automatically. Topic Word: Topic words are crucial for generating an informative response. The probability for selecting a topic word at each decoding position depends on the current hidden state s t : This probability is over the topic words we predict conditioned on a post. Section 3.8 will describe the details. Ordinary Word: Ordinary words play a functional role in making a natural and grammatical sentence. The probability of generating ordinary words is estimated as below: P (y t |y t−1 , s t , c, z, wt = 3) = sof tmax(W 3 s t ) (15) The generation loss of the decoder is given as below:

Loss Function
The overall loss L is a linear combination of the KL term L 1 , the classification loss of the discriminator L 2 , and the generation loss of the decoder L 3 : We let α gradually increase from 0 to 1. This technique of KL cost annealing can address the optimization challenges of vanishing latent variables in the RNN encoder-decoder (Bowman et al., 2016).

Topic Word Prediction
Topic words play a key role in generating an informative response. We resort to pointwise mutual information (PMI) (Church and Hanks, 1990) for predicting a list of topic words that are relevant to a post. Let x and y indicate a word in a post X and its response Y respectively, and PMI is computed as follows: P M I(x, y) = log P (x, y) P (x)P (y) Then, the relevance score of a topic word to a given post x 1 x 2 · · · x n can be approximated as follows, similar to (Mou et al., 2016): During training, the words in a response with high REL scores to the post are treated as topic words. During test, we use REL to select the top ranked words as topic words for a post.

Data Preparation
We collected a Chinese dialogue dataset from Weibo 2 . We crawled about 10 million postresponses pairs. Since our model needs the sentence function label for each pair, we built a classifier to predict the sentence function automatically to construct large-scale labeled data. Thus, we sampled about 2,000 pairs from the original dataset and annotated the data manually with four categories, i.e., interrogative, imperative, declarative and other. This small dataset was partitioned into the training, validation, and test sets with the ratio of 6:1:1. Three classifiers, including LSTM (Hochreiter and Schmidhuber, 1997), Bi-LSTM (Graves et al., 2005) and a self-attentive model , were attempted on this dataset. The results in Table 1 show that the self-attentive classifier outperforms other models and achieves the best accuracy of 0.78 on the test set.
Model Accuracy LSTM 0.60 Bi-LSTM 0.75 Self-Attentive 0.78 Table 1: Accuracy of sentence function classification on the 2,000 post-response pairs.
We then applied the self-attentive classifier to annotate the large dataset and obtained a dialogue dataset with noisy sentence function labels 3 . To balance the distribution of sentence functions, we randomly sampled about 0.6 million pairs for each sentence function to construct the final dataset. The statistics of this dataset are shown in Table 2

Experiment Settings
Our model was implemented with TensorFlow 5 . We applied bidirectional GRU with 256 cells to the encoder and GRU with 512 cells to the decoder. The dimensions of word embedding and function category embedding were both set to 100. We also set the dimension of latent variables to 128. The vocabulary size was set to 40,000. Stochastic gradient descent (Qian, 1999) was used to optimize our model, with a learning rate of 0.1, a decay rate of 0.9995, and a momentum of 0.9. The batch size was set to 128. Our codes are available at https://github.com/ kepei1106/SentenceFunction. We chose several state-of-the-art baselines, which were implemented with the settings provided in the original papers: Conditional Seq2Seq (c-seq2seq): A Seq2Seq variant which takes the category (i.e., function type) embedding as additional input at each decoding position (Ficler and Goldberg, 2017). Mechanism-aware (MA): This model assumes that there are multiple latent responding mechanisms . The number of responding mechanisms is set to 3, equal to the number of function types. Knowledge-guided CVAE (KgCVAE): A modified CVAE which aims to control the dialog act of a generated response (Zhao et al., 2017).

Automatic Evaluation
Metrics: We adopted Perplexity (PPL) (Vinyals and Le, 2015), Distinct-1 (Dist-1), Distinct-2 (Dist-2) (Li et al., 2016a), and Accuracy (ACC) to evaluate the models at the content and function level. Perplexity can measure the grammaticality of generated responses. Distinct-1/distinct-2 is the proportion of distinct unigrams/bigrams in all the generated tokens, respectively. Accuracy measures how accurately the sentence function can be controlled. Specifically, we compared the prespecified function (as input to the model) with the function of a generated response, which is predicted by the self-attentive classifier (see Section 4.1).  Table 3: Automatic evaluation with perplexity (PPL), distinct-1 (Dist-1), distinct-2 (Dist-2), and accuracy (ACC). The integers in the Dist-* cells denote the total number of distinct n-grams. Results: Our model has lower perplexity than c-seq2seq and KgCVAE, indicating that the model is comparable with other models in generating grammatical responses. Note that MA has the lowest perplexity because it tends to generate generic responses.  The scores indicate the percentages that our model wins the baselines after removing tie pairs. The scores of our model marked with * are significantly better than the competitors (Sign Test, p-value < 0.05).
As for distinct-1 and distinct-2, our model generates remarkably more distinct unigrams and bigrams than the baselines, indicating that our model can generate more diverse and informative responses compared to the baselines.
In terms of sentence function accuracy, our model outperforms all the baselines and achieves the best accuracy of 0.992, which indicates that our model can control the sentence function more precisely. MA has a very low score because there is no direct way to control sentence function, instead, it learns automatically from the data.

Manual Evaluation
To evaluate the generation quality and how well the models can control sentence function, we conducted pair-wise comparison. 200 posts were randomly sampled from the test set and each model was required to generate responses with three function types to each post. For each pair of responses (one by our model and the other by a baseline, along with the post), annotators were hired to give a preference (win, lose, or tie). The total annotation amounts to 200×3×3×3=5,400 since we have three baselines, three function types, and three metrics. We resorted to a crowdsourcing service for annotation, and each pair-wise comparison was judged by 5 curators. Metrics: We designed three metrics to evaluate the models from the perspectives of sentence function and content: grammaticality (whether a response is grammatical and coherent with the sentence function we prespecified), appropriateness (whether a response is a logical and appropriate reply to its post), and informativeness (whether a response provides meaningful information via the topic words relevant to the post). Note that the three metrics were separately evaluated. Results: The scores in Table 4 represent the percentages that our model wins a baseline after removing tie pairs. A value larger than 0.5 indicates that our model outperforms its competitor. Our model outperforms the baselines significantly in most cases (Sign Test, with p-value < 0.05). Among the three function types, our model performs significantly better than the baselines when generating declarative and imperative responses. As for interrogative responses, our model is better but the difference is not significant in some settings. This is because interrogative patterns are more apparent and easier to learn, thereby all the models can capture some of the patterns to generate grammatical and appropriate responses, resulting in more ties. By contrast, declarative and imperative responses have less apparent patterns whereas our model is better at capturing the global patterns through modeling the word types explicitly.
We can also see that our model obtains particularly high scores in informativeness. This demonstrates that our model is better to generate more informative responses, and is able to control sentence functions at the same time.
The annotation statistics are shown in Table  5. The percentage of annotations that at least 4 judges assign the same label (at least 4/5 agreement) is larger than 50%, and the percentage for at least 3/5 agreement is about 90%, indicating that annotators reached a moderate agreement.  Table 5: Annotation statistics. At least n/5 means there are no less than n judges assigning the same label to a record during annotation.

Words and Patterns in Function Control
To further analyze how our model realizes the global control of sentence function, we presented frequent words and frequent word patterns within each function. Specifically, we counted the frequency of a function-related word in the generated responses. The type of a word is predicted by the type controller. Further, we replaced the Declarative be also/too think but no also , but I also think so, but I will find a person. Ha-ha. , too, and has . Me, too, and my fans have been shocked by me. Figure 3: Frequent function-related words and frequent patterns containing at least 3 function-related words. The letters denote the variables which replace ordinary and topic words in the generated responses. The underlined words in responses are those occurring in patterns.
ordinary and topic words of a generated response with variables and treated each response as a sequence of function-related words and variables. We then used the Apriori algorithm (Agrawal and Srikant, 1994) to mine frequent patterns in these sequences. We retained frequent patterns that consist of at most 5 words and appear in at least 2% of the generated responses. Figure 3 presents the most frequent words (the second and third columns) and patterns (the fourth and fifth columns) for each function type. Note that the word patterns can be viewed as an abstract representation of sentence function. We observed that: First, function-related words are distributed at multiple positions of a sentence, indicating that realizing a sentence function needs a global control by not only predicting the word types but also planning the words of different types properly. Second, the frequent words clearly reveal the difference between function types.
For instance, interrogatives like 什么(what), ？(?) and 吗(particle) are commonly seen in interrogative responses, words like 请(please), 来(come), and 要(will) occur frequently in imperative responses. Further, word patterns in different function types differ significantly (see the fourth/fifth columns), indicating that the model is able to learn functionspecific word patterns. Third, interrogative and imperative responses have explicit patterns, while declarative responses are more implicit and divergent in pattern. Interrogative responses fall into Yes-No, Wh-, How-, and other questions. Imperative responses generally start with the base form of verbs or imperative words including 请(please). Our model succeeds in capturing two typical forms in declarative responses: adversative and progressive structures.

Post:
What would you do if I suddenly broke up with you someday?

MA:
I don't know what to do, but I'm in fear.

KgCVAE:
I think I'm a good man, too.
Our Model: I would think that I was stupid and I would be blamed by my mother. Figure 4: Generated responses of all the models for different sentence functions. In the responses of our model, function-related words are in red and topic words in blue. The word type is predicted by the type controller.
We presented an example in Figure 4 to show that our model can generate responses of different function types better compared to baselines. We can see that each function type can be realized by a natural composition of function-related words (in red) and topic words (in blue). Moreover, function-related words are different and are placed at different positions across function types, indicating that the model learns function-specific word patterns. These examples also show that the compatibility issue of controlling sentence function and generating informative content is well addressed by planning function-related and topic words properly.

Post
What would you do if I suddenly broke up with you someday?

Interrogative
Response #1 Do you mean that it's my fault?

Interrogative
Response #2 Can you speak normally?

Interrogative
Response #3 What do you think I should do? Shall I break up with you? Figure 5: Different patterns of interrogative responses generated by our model.
Furthermore, we verified the ability of our model to capture fine-grained patterns within a sentence function. We took interrogative responses as example and obtained responses by drawing latent variable samples repeatedly. Figure  5 shows interrogative responses with different patterns generated by our model given the same post. The model generates several Yes-No questions led by words such as 吗(do), 会(can) and 要(shall), and a Wh-question led by 怎样(what). This example shows that the latent variable can capture the fine-grained patterns and improve the diversity of responses within a function.

Conclusion
We present a model to generate responses with both controllable sentence function and informative content. To deal with the global control of sentence function, we utilize a latent variable to capture the various patterns for different sentence functions. To address the compatibility issue, we devise a type controller to handle function-related and topic words explicitly. The model is thus able to control sentence function and generate informative content simultaneously. Extensive experiments show that our model performs better than several state-of-the-art baselines.
As for future work, we will investigate how to apply the technique to multi-turn conversational systems, provided that the most proper sentence function can be predicted under a given conversation context.