An Auto-Encoder Matching Model for Learning Utterance-Level Semantic Dependency in Dialogue Generation

Generating semantically coherent responses is still a major challenge in dialogue generation. Different from conventional text generation tasks, the mapping between inputs and responses in conversations is more complicated, which highly demands the understanding of utterance-level semantic dependency, a relation between the whole meanings of inputs and outputs. To address this problem, we propose an Auto-Encoder Matching (AEM) model to learn such dependency. The model contains two auto-encoders and one mapping module. The auto-encoders learn the semantic representations of inputs and responses, and the mapping module learns to connect the utterance-level representations. Experimental results from automatic and human evaluations demonstrate that our model is capable of generating responses of high coherence and fluency compared to baseline models.


Introduction
Automatic dialogue generation task is of great importance to many applications, ranging from open-domain chatbots (Higashinaka et al., 2014;Vinyals and Le, 2015;Li et al., 2016Li et al., , 2017a to goal-oriented technical support agents (Bordes and Weston, 2016;Asri et al., 2017). Recently there is an increasing amount of studies about purely datadriven dialogue models, which learn from large corpora of human conversations without handcrafted rules or templates. Most of them are based on the sequence-to-sequence (Seq2Seq) framework (Sutskever et al., 2014) that maximizes the probability of gold responses given the previous dialogue turn. Although such methods offer great * Equal Contribution 1 The code is available at https://github.com/ lancopku/AMM promise for generating fluent responses, they still suffer from the poor semantic relevance between inputs and responses . For example, given "What's your name" as the input, the models generate "I like it" as the output.
Recently, the neural attention mechanism (Luong et al., 2015;Vaswani et al., 2017) has been proved successful in many tasks including neural machine translation (Ma et al., 2018b) and abstractive summarization , for its ability of capturing word-level dependency by associating a generated word with relevant words in the source-side context. Recent studies (Mei et al., 2017;Serban et al., 2017) have applied the attention mechanism to dialogue generation to improve the dialogue coherence. However, conversation generation is a much more complex and flexible task as there are less "word-to-words" relations between inputs and responses. For example, given "Try not to take on more than you can handle" as the input and "You are right" as the response, each response word can not find any aligned words from the input. In fact, this task requires the model to understand the utterance-level dependency, a relation between the whole meanings of inputs and outputs. Due to the lack of utterance-level semantic dependency, the conventional attention-based methods that simply capture the word-level dependency achieve less satisfying performance in generating high-quality responses.
To address this problem, we propose a novel Auto-Encoder Matching model to learn utterance-level dependency. First, motivated by Ma et al. (2018a), we use two auto-encoders to learn the semantic representations of inputs and responses in an unsupervised style. Second, given the utterance-level representations, the mapping module is taught to learn the utterance-level dependency. The advantage is that by explicitly separating representation learning and dependency learning, the model has a stronger modeling ability compared to traditional Seq2Seq models. Experimental results show that our model substantially outperforms baseline methods in generating highquality responses. Our contributions are listed as follows: • To promote coherence in dialogue generation, we propose a novel Auto-Encoder Matching model to learn the utterance-level dependency.
• In our proposed model, we explicitly separate utterance representation learning and dependency learning for a better expressive ability.
• Experimental results on automatic evaluation and human evaluation show that our model can generate much more coherent text compared to baseline models.

Approach
In this section, we introduce our proposed model. An overview is presented in Section 2.1. The details of the modules are shown in Sections 2.2, 2.3 and 2.4. The training method is introduced in Section 2.5.

Overview
The proposed model contains three modules: an encoder, a decoder, and a mapping module, as shown in Figure 1.
In general, our model is different from the conventional sequence-to-sequence models. The encoder and decoder are both implemented as autoencoders (Baldi, 2012). They learn the internal representations of inputs and target responses, respectively. In addition, a mapping module is built to map the internal representations of the input and the response.

Encoder
The encoder E θ is an unsupervised auto-encoder based on Long Short Term Memory Networks (LSTM) (Hochreiter and Schmidhuber, 1996). As it is essentially a LSTM-based Seq2Seq model, we name the encoder and decoder of the autoencoder "source-encoder" and "source-decoder". To be specific, the encoder E θ receives the source text x = {x 1 , x 2 , ..., x n }, and encodes it to an internal representation h, and then decodes h to a new sequencex = {x 1 ,x 2 , ...,x n } for the reconstruction of the input. We extract the hidden state h as the semantic representation. The encoder E θ is trained to reduce the reconstruction loss, whose loss function is defined as follows: where θ refers to the parameters of the encoder E θ .

Decoder
Similar to the encoder, our decoder D φ is also a LSTM-based auto-encoder. However, as there is no target text provided in the testing stage, we propose the customized implementation, which is illustrated in Section 2.5. Here in the introduction of the decoder, we do not provide the testing details. Similarly, we name the encoder and decoder of the auto-encoder "target-encoder" and "targetdecoder". The target-encoder receives the target y = {y 1 , y 2 , ..., y n } and encodes it to a utterancelevel semantic representation s, and then decodes s to a new sequence to approximate the target text. The loss function is identical to that of the encoder:

Mapping Module
As our model is constructed for dialogue generation, we design the mapping module to ensure that the generated response is semantically consistent with the source. There are many matching models that can be used to learn such dependency relations (Hu et al., 2014;Pang et al., 2016;. For simplicity, we only use a simple feedforward network for implementation. The mapping module M γ transforms the source semantic representation h to a new representation t. To be specific, we implement a multilayer perceptron (MLP) g(·) for M γ and train it by minimizing the L2-norm loss J 3 (γ) of the transformed representation t and the semantic representation of target response s: (3)

Training and Testing
In the testing stage, given an input utterance, the encoder E θ , the decoder D φ , and the matching module M γ work together to produce a dialogue response. The source-encoder first receives the input x and encodes it to a semantic representation h of the source utterance. Then, the mapping module transforms h to t, a target response representation. Finally, t is sent to the target-decoder for response generation.
In the training stage, besides the auto-encoder loss and the mapping loss, we also use an end-toend loss J 4 (θ, φ, γ): log P (y t |x, y 1..t−1 ; θ, φ, γ) where x is the source input, y is the target response, and T is the length of response sequence. The model learns to generateỹ to approximate y by minimizing the reconstruction losses J 1 (θ) and J 2 (φ), the mapping loss J 3 (γ), and the end-to-end loss J 4 (θ, φ, γ). The details are illustrated below: where J refers to the total loss, and λ 1 , λ 2 , and λ 3 are hyperparameters.

Experiment
We conduct experiments on a high-quality dialogue dataset called DailyDialog built by Li et al. (2017b). The dialogues in the dataset reflect our daily communication and cover various topics about our daily life. We split the dataset into three parts with 36.3K pairs for training, 11.1K pairs for validation, and 11.1K pairs for testing.

Experimental Details
For dialogue generation, we set the maximum length to 15 words for each generated sentence. Based on the performance on the validation set, we set the hidden size to 512, embedding size to 64 and vocabulary size to 40K for baseline models and the proposed model. The parameters are updated by the Adam algorithm (Kingma and Ba, 2014) and initialized by sampling from the uniform distribution ([−0.1, 0.1]). The initial learning rate is 0.002 and the model is trained in minibatches with a batch size of 256. λ 1 and λ 3 are set to 1 and λ 2 is set to 0.01 in Equation (6). It is important to note that for a fair comparison, we reimplement the baseline models with the best settings on the validation set. After fixing the hyperparameters, we combine the training and validation sets together as a larger training set to produce the final model.

Results
We use BLEU (Papineni et al., 2002), to compare the performance of different models, and use the widely-used BLEU-4 as our main BLEU score. The results are shown in Table 1. The proposed AEM model significantly outperforms the Seq2Seq model. It demonstrates the effectiveness of utterance-level dependency on improving the quality of generated text. Furthermore, we find that the utterance-level dependency also benefits the learning of word-level dependency. The improvement from the AEM model to the AEM+Attention model 2 is 0.68 BLEU-4 point. It is much more obvious than the improvement from the Seq2Seq model to the Seq2Seq+Attention, which is 0.29 BLEU-4 point. We also report the diversity of the generated responses by calculating the number of distinct unigrams, bigrams, and trigrams. The results are shown in Table 2. We find that the AEM model achieves significant improvement on the diversity of generated text. The number of unique trigram of the AEM model is almost six times more   than that of the Seq2Seq model. Also, it should be noticed that the attention mechanism performs almost the same compared to the AEM model (31.2K vs. 34.6K in terms of Dist-3), which indicates that the utterance-level dependency and the word-level dependency are both indispensable for dialogue generation. Therefore, by combining the two dependencies together, the AEM+Attention model achieves the best results. Such improvements are expected. With the increase of the relevance of the generated text, it gets harder for the model to generate repeated responses. In our experimental results, the number of repetitive "I don't know" in the AEM+Attention model is reduced by 50% compared to the Seq2Seq model. For dialogue generation, human evaluation is more convincing, so we also report human evaluation results on the test set. We randomly choose 100 utterances in daily communication style for the human evaluation, each of which is sent to different models to generate responses. The results are distributed to the annotators who have no knowledge about which model the sentence is from. All annotators have linguistic background. They are asked to score the generated responses in terms of fluency and coherence. Fluency rep-  resents whether each sentence is in correct grammar. Coherence evaluates whether the generated response is relevant to the input. The score ranges from 1 to 10 (1 is very bad and 10 is very good). To evaluate the overall performance, we use the geometric mean of fluency and coherence as the final evaluation metric. Table 3 shows the results of human evaluation. The inter-annotator agreement is satisfactory considering the difficulty of human evaluation. The Pearson's correlation coefficient is 0.69 on coherence and 0.57 on fluency, with p < 0.0001. First, it is clear that the AEM model outperforms the Seq2Seq model with a large margin, which proves the effectiveness of the AEM model on generating high quality responses. Second, it is interesting to note that with the attention mechanism, the coherence is decreased slightly in the Seq2Seq model but increased significantly in the AEM model. It suggests that the utterance-level dependency greatly benefits the learning of wordlevel dependency. Therefore, it is expected that the AEM+Attention model achieves the best G-score. Table 4 shows the examples generated by the AEM model and the Seq2Seq model. For easy questions (ex. 4 and ex. 5), they both perform well. For hard questions (ex. 1 and ex. 2), the proposed model obviously outperforms the Seq2Seq model. It shows that the utterance-level dependency learned by the proposed model is useful for handling complex inputs.

Error Analysis
Although our model achieves the best performance, there are still several failure cases. We find that the model performs badly for the inputs with unseen words. For instance, given "Bonjour" as the input, it generates "Stay out of here" as the output. It shows that the proposed model is sensitive to the unseen utterance representations. Therefore, we would like to explore more approaches to address this problem in the future work. For example, the auto-encoders can be replaced by variational auto-encoders to ensure that the distribution of utterance representations is normal, which has a better generalization ability.

Conclusion
In this work, we propose an Auto-Encoder Matching model to learn the utterance-level semantic dependency, a critical dependency relation for generating coherent and fluent responses. The model contains two auto-encoders that learn the utterance representations in an unsupervised way, and a mapping module that builds the mapping between the input representation and response representation. Experimental results show that the proposed model significantly improves the quality of generated responses according to automatic evaluation and human evaluation, especially in coherence.