TSDG: Content-aware Neural Response Generation with Two-stage Decoding Process

Neural response generative models have achieved remarkable progress in recent years but tend to yield irrelevant and uninformative responses. One of the reasons is that encoder-decoder based models always use a single decoder to generate a complete response at a stroke. This tends to generate high-frequency function words with less semantic information rather than low-frequency content words with more semantic information. To address this issue, we propose a content-aware model with two-stage decoding process named Two-stage Dialogue Generation (TSDG). We separate the decoding process of content words and function words so that content words can be generated independently without the interference of function words. Experimental results on two datasets indicate that our model significantly outperforms several competitive generative models in terms of automatic and human evaluation.


Introduction
With the development of deep learning, the opendomain neural response generation has achieved remarkable progress Serban et al., 2017b;Chen et al., 2019) in resent years. At present, most of generative models are based on encoder-decoder framework (Cho et al., 2014;Shang et al., 2015). In the decoding process, these models always use a single decoder to generate the final response at a stroke in a left-to-right manner. However, we find it hard for these methods to model the dependency of semantic between post and response which causes irrelevant and uninformative responses. We analyze this problem from the perspective of linguistics as following.
In linguistics, there are two different types of words to form a sentence, namely content words * Both of authors contributed equally to this research. † Corresponding author (words which have substantive lexical content) and function words (words which essentially serve to make grammatical properties) (Hill, 1952). For the response "I am going to read an interesting book." in Figure 1, content words "read, interesting, book" give us the most important semantic information which establishes the semantic dependency with the post, while function words "I, am, going, to, an" are used to stitch content words together. Highquality content words are a critical component of a relevant and informative response. Although function words are small in numbers (less than 0.04% of our vocabulary), they account for over half of the words used in our daily speech (Rochon et al., 2000). Therefore, function words are always highfrequency relative to the content words.
In vanilla encoder-decoder models, these models always use a single decoder to generate a complete response at a stroke. When the decoder generates content words and function words at a stroke, it tends to generate high-frequency function words with less semantic information rather than lowfrequency content words with more semantic information. Since function words have very little substantive meaning, they not only are redundant for understanding semantic dependency, but also make the dependency sparse. Therefore, generating content words and function words at a stroke makes it difficult to learn the semantic dependency between the post and response.
To address the aforementioned issue, we propose a novel content-aware TSDG model with a two-stage decoding process. As shown in Figure 1, the key idea is to separate the decoding process of content words and function words so that content words can be generated independently without the interference of function words. In the first decoding stage, we use the first decoder to focus on generating a content word sequence according to the post. In the second decoding stage, we use the second decoder to expand the content word sequence to a complete and fluent response. Through this stage, our model gets the final fluent response including more relevant and informative content words.
Our contributions in this paper are two-fold: (1) This paper analyzes the limitation of the encoder-decoder models which use a single decoder to generate a complete response at a stroke. For this limitation, we elaborate a content-aware TSDG model to generate a more informative and relevant response.
(2) Experimental results on two datasets demonstrate that our model can generate more appropriate content words and significantly outperforms several competitive generative models in terms of automatic evaluation and human evaluation.

Related Work
Open-domain conversation has long attracted the attention of researchers. The generative models have shown great potential in terms of flexibility, which has aroused a research hotspot. Most of generative models are based upon encoder-decoder framework (Cho et al., 2014;Shang et al., 2015). However, the traditional encoder-decoder models tend to generate short and uninformative responses, which are known as "safe responses" .
Lots of models have been proposed to solve this issue: (1) Modifying the objective function to penalize the generation probability of the safe response . (2) Generating from latent variables to increase the diversity of response Serban et al., 2017b). (3) Using additional topic content (Xing et al., 2017). (4) Content-introducing methods (Mou et al., 2016;Yao et al., 2017). (5) Knowledge-based methods (Zhou et al., 2018;Tong et al., 2019). Zhou et al. (2018) Figure 2: The architecture of our proposed model. account to facilitate conversation understanding. In the decoding process, these models always use a single decoder to generate the final response at a stroke in a left-to-right manner.

Model
The architecture of the proposed TSDG model is illustrated in Figue 2. It consists of an encoding process and a two-stage decoding process. Given a post U = u 1 , u 2 ..., u I as input, our model first uses a Self-Attention Encoder (SAE) to encode them into a hidden vector. Then, the first decoding stage decodes this hidden vector into a content word sequence C = c 1 , c 2 , ..., c K without the influence of function word. Finally, the second decoding stage expands the content word sequence into a complete response R = r 1 , r 2 , ..., r J .

Encoding process
In the encoding process, we use SAE to encode the utterance. SAE is a transformer encoder (Vaswani et al., 2017). There are two encoders in the encoding process: Post Self Attention Encoder (PSAE) and Content words Self Attention Encoder (CSAE) which encodes the post utterance and the content word sequence generated by first decoding process independently. The input (In s ) of the encoder is a sequence of word embedding with positional encoding added (Vaswani et al., 2017). We use P SAE(U ) to denote the process of encoding the post utterance and use CSAE(C) to denote the process of encoding the content word sequence.

Two-stage decoding process
First decoding stage: Based on the hidden vector encoded by P SAE(U ), the first decoding stage uses a transformer decoder (Vaswani et al., 2017) to generate the content words of response. When generating the i th content word c i , we have the generated words c ≤i−1 as input. We use In (i−1) c to denote the matrix representation of c ≤i−1 . The probabilities of the content word c i decoded by the first decoding stage: The loss of the first decoding stage: In the training process, we apply a rule-based content word extractor to automatically extract content words from the response in terms of Part-Of-Speech features and a stop word list. Based on the characteristic of the content words, its Part-Of-Speech should be noun, verb, adjective or adverb and not in the stop word list. Then, we take this content word sequence as ground truth to train the first decoding stage.
Second decoding stage: the second decoding stage aims to expand the content word sequence to a complete response. To capture the information of content word sequence and post, we propose a multi-layer multi-head attention decoder.
When generating the i th word r i of response, we have the generated words r ≤i−1 as input. We use In (i−1) r to denote the matrix representation of r ≤i−1 .
The first sub-layer is a multi-head self-attention: The second sub-layer is a content word multihead attention: The third sub-layer is a post multi-head attention: The forth sub-layer is a position-wise fully connected feed-forward network: We use softmax to get the probabilities of the words decoded by the second decoding stage: where r i is the i th word of response. The loss of the second decoding stage : Note that residual connection and layer normalization are used in each sub-layer, which are omitted in the presentation for simplicity.
During the training process, the total loss function of our model is a combination of L 1 and L 2 : where λ (λ > 0) acts as a trade-off between the two items. We set λ to 1 in our experiment.

Dataset
We conduct experiments on two datasets, namely the STC-SeFun dataset and the Weibo dataset. STC-SeFun: A Short-Text Conversation dataset, in which each sentence segment in the queryresponse pairs is labeled with its sentence functions . There are 45,022 postresponse pairs for training, 9,590 for validation, and another unseen 9,590 samples for testing.
Weibo: A high-quality Weibo conversation pairs pre-processed by  from the benchmark dataset (Shang et al., 2015). We used 50,000 post-response pairs to train the model. We use another unseen 997 and 800 samples for validation and testing, respectively.
As pre-processing, we remove duplicate pairs and the pairs with a post or a response having less than 2 words. We also truncate the sentences with more than fifty characters.
• MrRNN: a content-introducing model based on Seq2Seq (Serban et al., 2017a). We reimplemented this work to get the results.
• MMPMS: the model with the state-of-the-art performance on the Short text Conversation (STC) task (Chen et al., 2019). We re-run  the released code 2 to obtain the results on our dataset.
• Skeleton: a model (Cai et al., 2019) to enhance generative models with information retrieval technologies for dialogue response generation. We re-run the released code 3 to obtain the results on our dataset.
• Seq2Seq-trans: an ablated model of TSDG. We replace the two-stage decoding process in TSDG with a basic transformer decoder to directly generate the response.

Implementation Details
In our experiments, we use OpenNMT-py (Klein et al., 2017) as the code framework of TSDG. The layers of both encoder and decoder are set to 3. The number of attention heads in multi-head attention is 8 and the filter size is 2048. The dimension of word embedding is set to 512 empirically. We use Adam for optimization. When decoding both in two stages, the beam size is set to 5. The experiments are conducted on an NVIDIA 2080 Ti.

Automatic and Human Evaluation
where T denotes the size of the test set, n i denotes the number of content words in i th predicted response.
Human Evaluation: Human evaluations are essential for response generation. We randomly sampled 100 utterances from the test set. We asked 3 experienced annotators to score the fluency, relevance and informativeness of responses. Table 1 shows the results of automatic and human evaluation. TSDG outperforms all baseline methods both on automatic and human evaluation, and the improvement is significant in a statistical sense (p-value < 0.01). This indicates that TSDG generates a more appropriate response in terms of fluency, informativeness and relevance. The performance of the ablated model (Seq2seq-trans) suffers from the ablation, which demonstrates that the twostage decoding process is essential for TSDG.

Results and Analysis
We find that the CWSs of baselines are significantly lower than the CWS of ground truth. The significant improvement in CWS indicates that TSDG can effectively increase the proportion of content words in the response. We also compare the content words generated by our model with other models side-by-side on 100 test cases which are randomly picked from STC-SeFun dataset. The human evaluation results are shown in Table 2. Note that our model consistently outperforms the comparison models with a large margin. This superior performance confirms that our model can generate more appropriate content words.  To further evaluate the relevance between two decoding stages, we use the content words acc (CWA): where T denotes the size of test set, k i is the number of content words which both in the i th content word sequence and i th predicted response, l i is the number of i th content word sequence predicted by the first stage. The higher the CWA, the higher the relevance between the two stages. Under two datasets, our model gets CWA of 0.9252 and 0.9152 separately. Both of them are higher than 0.9, which verifies that our model can make good use of first decoded content. There still is some room for improvement.
To show the influence of the content word sequence more clearly, we feed different content word sequences into the second decoding stage to compare the generated response. The results are shown in Table 3. These examples demonstrate that content words generated by the first decoding stage play an important role in the generation of final response.

Conclusion
In this paper, we analyze the limitation of the current generative models in the decoding process. To address this, we propose a content-aware neural response generative model with a two-stage decoding process. Evaluation results on two datasets indicate that our model can generate more appropriate content words and significantly outperform several competitive models in terms of automatic and human evaluation. There still is some room for improvement. We will refine our model from two decoding stages independently in the future.