DLGNet: A Transformer-based Model for Dialogue Response Generation

Neural dialogue models, despite their successes, still suffer from lack of relevance, diversity, and in many cases coherence in their generated responses. On the other hand, transformer-based models such as GPT-2 have demonstrated an excellent ability to capture long-range structures in language modeling tasks. In this paper, we present DLGNet, a transformer-based model for dialogue modeling. We specifically examine the use of DLGNet for multi-turn dialogue response generation. In our experiments, we evaluate DLGNet on the open-domain Movie Triples dataset and the closed-domain Ubuntu Dialogue dataset. DLGNet models, although trained with only the maximum likelihood objective, achieve significant improvements over state-of-the-art multi-turn dialogue models. They also produce best performance to date on the two datasets based on several metrics, including BLEU, ROUGE, and distinct n-gram. Our analysis shows that the performance improvement is mostly due to the combination of (1) the long-range transformer architecture with (2) the injection of random informative paddings. Other contributing factors include the joint modeling of dialogue context and response, and the 100% tokenization coverage from the byte pair encoding (BPE).


Introduction
Recent successes of pretrained transformer-based language models, such as BERT (Devlin et al., 2019), GPT(-2) (Radford and Salimans, 2018;Radford et al., 2019), Transformer-XL , XLNet , and ERNIE(2.0) (Sun et al., 2019a,b), have led to state-of-the-art performance on many natural language understanding (NLU) tasks including sentence classification, named entity recognition, sentence similarity, and question answering. The exceptional performance Figure 1: Positional Entropy for Movie and Ubuntu datasets -Applying a greedy training objective to the original and BPE datasets can achieve low overall entropy just by overfitting to low entropy regions, resulting in short and generic responses. Injecting random paddings into the data does not suffer from this problem and can be used to train transformer architectures due to their lack of recurrent propagations. of transformer-based language models is due to their ability to capture long-term temporal dependencies in the input sequence. This attribute should be very beneficial to dialogue modeling, especially in multi-turn scenarios. Most of the existing neural dialogue response generation models are based on recurrent neural networks (Sutskever et al., 2014;Vinyals and Le, 2015;Li et al., 2016a;Serban et al., 2016;Xing et al., 2017;Serban et al., 2017b,a;Li et al., 2016b;Zhang et al., 2018a;Olabiyi et al., 2018Olabiyi et al., , 2019a.
These models have yielded promising results by generating mostly coherent responses given the dialogue context. However, most of them, including the state-of-the-art models trained with naturalistic dialogue data, still perform well below the human level. Generated responses tend to be either generic, out-of-context, or disproportionately short. Previous work points to some causes of these limitations: i) Training data: The presence of high frequency generic utterances (utterance-level semantic redundancy), such as "I don't know", "I'm not sure", and high frequency generic n-gram tokens (wordlevel syntactic redundancy), such as "I", "I am", leading to the concave positional entropy profile of dialogue datasets, see Fig. 1), which makes learning difficult, resulting in short and generic responses. ii) Short-range Model Architecture: Short-range model architectures that capture limited temporal dependencies. iii) Out-of-vocabulary Problem: Less frequent (usually more informative) words mapped to the out-of-vocabulary token <UNK>, leading to generation of a large number of <UNK> tokens. iv) Exposure Bias: The discrepancy in model behavior between training and inference, which limits the informativeness of the responses iv) Training Objective: The limitations of the maximum likelihood training objective.
In this paper, we propose DLGNet, a transformer-based model for multi-turn dialogue modeling that addresses some of the highlighted problems above. The use of a transformer architecture allows DLGNet to capture long-term temporal dependencies in the dialogue data better than the existing RNN-based architectures (Vaswani et al., 2017). However, applying a vanilla Seq2Seq transformer (Vaswani et al., 2017) and its multi-turn variants, such as ReCoSa (Zhang et al., 2019), for dialogue modeling does not work well because of the semantic redundancy in dialogue data. To overcome this, DLGNet models the joint distribution of the context and response instead of the conditional distribution of the response given the context, usually employed in Seq2Seq frameworks (Vinyals and Le, 2015;Serban et al., 2016;Olabiyi et al., 2018;Vaswani et al., 2017). DLGNet also addresses the syntactic redundancy in dialogue data by appending random paddings before and after the input data. This helps to break down the learning barrier from the concave entropy profile of human conversation data, as shown in Fig. 1. The flattening of the entropy profile also provides regularization during training, and reduces even the extent of the exposure bias problem. Finally, to avoid the out-of-vocabulary problem, DLGNet uses byte pair encoding (BPE) similar to GPT-2 (Radford et al., 2019) to provide 100% coverage for any Unicode input and output texts. Given all these proposed changes, we train DLGNet models using only the maximum likelihood objective. DLGNet models, <person> appetit .

Ubuntu
Context 0 The netboot one is suppose to download packages from the net. Context 1 like the ones to be installed? or the installed to be run? Groundtruth Installed. The netbook also features the non-graphical installer. DLGNet-117M the installed to be run. DLGNet-345M the ones to be installed. DLGNet-117M Random I think the netboot one is the one that is installed to the net. DLGNet-345M Random the ones to be installed to. despite being trained with only the maximum likelihood objective, demonstrate state-of-the-art performance on the Movie and Ubuntu datasets, as measured in terms of BLEU, ROUGE, and distinct n-gram scores.

Task Description
Consider a dialogue sample consisting of a sequence of N utterances, x = x 1 , x 2 , · · · , x N , where each utterance contains a variable-length sequence of M i word tokens such that x i j ∈ V for vocabulary V . At any time step i, the dialogue history is given by x i = x 1 , x 2 , · · · , x i . The dialogue response generation task can be defined as follows: Given a dialogue history x i , generate a response y i = y 1 i , y 2 i , · · · , y T i i , where T i is the number of generated tokens such that the distribution of the generated response P (y i ) is indistinguishable from that of the ground truth P (x i+1 ). The distribution of the model output sequence can be factored by the 1 Model with pretraining 2 Model with random initialization (without pretraining) Figure 2: An example of DLGNet input and output consisting of a 3-turn conversation sample separated by [TSEP] tokens, combined with random informative paddings, before and after. Paddings and conversations are separated by [CSEP] tokens. product rule: where y 1:j−1 i = (y 1 i , · · · , y j−1 i ). The MLE objective based on the conditional distribution of (1) can be expressed as where θ are the model parameters. This formulation, known as Seq2Seq, originated from machine translation (Sutskever et al., 2014) and assumes that the context-response pair in the training examples are fairly unique. Seq2Seq is the basis of most of the previous work on dialogue modeling. The framework, however, does not account for the semantic and syntactic redundancy in human conversations as pointed out by Li et al. (2016a).

DLGNet Model Description
In order to address the semantic redundancy, we propose to jointly model both the context and the response as an alternative to the mutual information objective (Li et al., 2016a;Zhang et al., 2018b). The resulting distribution and the objective function can then be respectively expressed as While (3) addresses the semantic redundancy, it does not address the syntactic redundancy coming from the concave positional entropy profile of dialogue data. To circumvent this, we append random informative paddings (sampled from the dataset) before (x b i ) and after (x a i ), the dialogue example of interest, leading to and since x b i and x a i are independent of (y i , x i ). As we see from the resulting entropy profile in Fig. 1, appending random paddings circumvents the adverse effect of syntactic redundancy in dialogue data on model training. The conditional distribution P (y i |x i ) in (1) is then just an inference on the joint distribution of (5).
DLGNet adopts GPT-2's autoregressive transformer architecture (Radford et al., 2019) using only the decoder part of the original transformer architecture (Vaswani et al., 2017) since there is no need for a separate encoder network (see Fig.  2). Autoregressive transformer models use multiple layers of masked multi-head self-attention to map a sequence of input tokens to a sequence of output tokens (i.e., the input sequence token shifted one position to the right). During inference, at each step, the model is autoregressive, consuming the previously generated token as additional input when generating the next. There are some basic conceptual differences between autoregressive architectures based on transformers and those based on recurrent neural networks (RNNs). For instance, while the output of an RNN layer depends on only the immediate previous output, a transformer layer output consists of attention over all previous outputs. Due to this lack of ordering in transformer architectures, the position representation is usually passed along with the input tokens into the model (Vaswani et al., 2017).
In order to take advantage and evaluate the impact of pretrained parameters, we use two model configurations i.e., (i) DLGNet-117M -with 117M parameters, 12 attention layers, and a hidden state size of 767, and (ii) DLGNet-345M -with 345M parameters, 24 attention layers, and a hidden state size of 1024; similar to the publicly available GPT-2 models (Radford et al., 2019).

Model Training
We trained the small DLGNet-117M and the medium DLGNet-345M models on multi-turn dialogue datasets initialized with either random noise or pretrained language model parameters. The models are trained end-to-end using the Adaptive Moment Estimation (Adam) stochastic gradient descent algorithm with a learning rate of 0.001. The maximum sequence length is 1024. Due to GPU memory limitations, we use a batch size of 2 and accumulate gradients over 5 iterations, making the effective batch size 10. Both models are trained until the training perplexity on the dialogue datasets reaches a steady state. Finally, the models are implemented, trained, and evaluated using Python and the TensorFlow deep learning framework.

Setup
We evaluated DLGNet models on the Movie Triples and Ubuntu Dialogue corpora randomly split into training, validation, and test sets, using 90%, 5%, and 5% proportions. Since we use BPE with 100% tokenization coverage, we performed no preprocessing of the datasets whatsoever. For each training example, however, we randomly sample a target conversation and two independent padding chunks from the dataset to fill up the maximum input sequence length. We append the paddings to the target conversation, one before, and one after, separated by token [C SEP]. The target conversation in each training example in turn consists of utterances that are separated by token [T SEP] as shown in Fig. 2. The Movie dataset (Serban et al., 2016) spans a wide range of topics with few spelling mis-takes and contains about 240,000 dialogue triples, which makes it suitable for studying the relevancediversity tradeoff in multi-turn conversations (Zhang et al., 2018b). The Ubuntu dialog dataset extracted from the Ubuntu Relay Chat Channel (Serban et al., 2017b) contains about 1.85 million conversations with an average of 5 utterances per conversation. This dataset is ideal for training dialogue models that can provide expert knowledge/recommendation in domain-specific conversations.
All models are evaluated in autoregressive mode, i.e., we pass a multi-turn dialogue context to the model inputs and the models generate a sequence of response tokens using the context and all the previously generated tokens until the end-of-sequence token is reached. All models are greedily sampled to generate the model outputs. It is worth noting that, for DLGNet models, we search for the optimum top k between 0 and 20 inclusive that maximizes the overall BLEU-2 (relevance) score of the validation set using the top k sampling strategy (Radford et al., 2019). It turns out that for all DLGNet models, the optimum top k is 1 across datasets, which is equivalent to greedy sampling.
6 Results and Discussion

Quantitative Evaluation
We report the quantitative measures in Table 2. The transformer-based DLGNet provides a significant improvement in response generation performance over existing methods such as (V)HRED, hredGAN, DAIM, and adversarial bootstrapping (aBoots), all of which are based on recurrent neural networks. In fact, DLGNet achieves the best performance to date on the Movie triples and Ubuntu dialogue datasets in terms of BLEU, ROUGE, and distinct n-gram scores. This indicates that, despite being trained only with the maximum likelihood objective, the autoregressive transformer architecture in conjunction with the random padding injection, is able to overcome some of the problems that have plagued existing dialogue models such as semantic and syntactic redundancy, and exposure bias. Also contributing to the models' performance improvement is the 100% input coverage from the BPE encoding, which eliminates the generation of <UNK> tokens (this is especially helpful for the Ubuntu dataset with a large number of out-ofvocabulary tokens) as well as the joint modeling of the context and response. Also, in contrast to existing work reporting a trade-off between relevance and diversity (Zhang et al., 2018b;Li et al., 2016a,b), we observe that relevance performance improves with diversity performance in DLGNet models. It is worth pointing out, however, that DL-GNet models tend to generate shorter responses than adversarially trained models (hredGAN and aBoots). This indicates that the models still suffer from the impact of using only the maximum likelihood training objective. Alleviating this problem with an adversarial training objective similar to aBoots and or hredGAN should further improve performance and will be considered in our future work.

Qualitative Evaluation
Random samples of the model outputs are shown in Tables 1 and 4. One striking observation is the high level of coherence in the generated responses from DLGNet models. The models are able to capture both short-and long-term temporal dependencies in their responses. The models give responses that are relevant to the topic of the discussion, and are able to answer posed questions with answer choices. Also, they don't simply generate the alltoo-common phrase "I'm not sure" like existing models; they are able to point to areas of the context they are uncertain about (see the Ubuntu section of Table 1).

Ablation Studies on DLGNet Models with Random Informative Padding
In this section, we carry out a more detailed analysis and discussion of different configurations of DL-GNet models as well as their performance across datasets, using the evaluation results in Table 2.

Open vs. Closed Domain Dataset
From Table 2, we observe that the performance improvement achieved by DLGNet models over existing models is higher for the open-domain Movie Triples dataset than for the closed-domain Ubuntu Dialogue dataset with or without pretraining. While the performance difference could be due to the size of the dataset, it could also indicate that closed-domain dialogue responses are inherently more difficult to learn, even for large and expressive models such as the DLGNet transformer.

Effect of Model Pretraining
Although models with pretraining generally perform better than ones trained with random initialization, we observe that the performance difference is not significant. This shows that the performance of the DLGNet is mostly due to the multi-layer self attention model architecture rather than the scaffolding achieved from language model pretraining.
We observe similar behavior across datasets. However, pretraining seems to be consistently more helpful for open-domain datasets versus closeddomain datasets. This might be because the distribution of the language data used for pretraining is similar to the open-domain dataset but different from the closed-domain dataset. Also, models without pretraining tend to generate longer responses on average compare to those with pretraining. This indicates that model pretraining also plays a role in the relevance-diversity tradeoff.

Effect of Model Size
We also compare the small (DLGNet-117M) and large (DLGNet-345M) models. We observe that there is a significant performance improvement of the larger over the smaller model on the Movie dataset (about 50%), but a smaller performance improvement on the Ubuntu dataset. It's also surprising that the larger model doesn't overfit to the Movie dataset. Overfitting might have been prevented by the injection of random padding into the input data, which regularizes the model training by artificially inducing high entropy into the data.

Relevance vs. Diversity Tradeoff
The results in Table 2 show state-of-the-art relevance performance with some compromise on the response length. Here, we explore the possibility of generating longer and more diverse responses with the trained models and estimate the effect on the relevance scores. For this experiment, we chose the larger DLGNet-345M models of both datasets and tried two sampling techniques, i.e., top k (Radford et al., 2019) and top p nucleus Zellers et al., 2019) sampling strategies on the validation sets. The trajectory of the evaluation metrics with increasing top k and top p values are shown Figs. 3 and 4 respectively. With top k sampling, increasing the top k value increases the response length at the expense of relevance metrics like BLEU for both datasets, as expected. However, the response length increase is more significant on the Ubuntu dataset than the Movie dataset. It is also surprising that the ROGUE-2 score for Ubuntu increases with increasing top k value, which is the reverse of the case for the Movie dataset. Also, Fig. 3 shows that it is more advantageous to trade off relevance for diversity on the Ubuntu dataset compare to the Movie dataset. This is probably due to the size and closed-domain nature of the Ubuntu dataset, which makes it more difficult to learn with the maximum likelihood estimation only. We observe a similar pattern with the top p nucleus sampling in Fig. 4. This reinforces the fact that greedy sampling may be sufficient for opendomain datasets such as Movie.

Further Ablation Studies on DLGNet Models
We also set out to analyze the features of DLGNet that make it suitable for multi-turn dialogue modeling. We train both DLGNet-117M and DLGNet-345M models on both datasets, but replace the random informative paddings with static paddings using a pad token. Below are the definitions of the model configuration factors considered: 1.) Multi-turn Data (M): Training data is variable-length multi-turn data padded to a fixed length. This helps to evaluate the effect of using random informative padding.
2.) Single-turn Data (S): Training data is variable-length single-turn data padded to a fixed length. This helps to evaluate the effect of number of turns.
3.) Joint model (Joint): DLGNet models are  4.) Conditional model (Cond): DLGNet models are trained in the traditional sequence-to-sequence mode with a bidirectional encoder and an autoregressive decoder for a conditional modeling of the dialogue response given the context (Vaswani et al., 2017;Zhang et al., 2019).

5.) Basic Tokenizer:
We use a basic tokenization traditionally used in dialogue modeling instead of BPE tokenization to evaluate the effect of tokenization coverage. It also provides an apples-toapples comparison between the transformer-based and RNN-based architectures.

Effect of Random Padding Injection
The results in Table 3 are from models trained with static paddings. The models perform significantly worse than those of Table 2. Without random padding injection, the models quickly overfit to the low entropy regions of the training data, which leads generic and/or short responses.

Single Turn vs. Multi-turn
We also observe that the multi-turn models perform better than single-turn models on BPE tokenized data. This is expected because the multi-turn models capture longer temporal dependencies in the input data. It is also worth mentioning that the single-turn performance is further hurt by BPE tokenization since it tends to work better with long input sequences.

Joint vs. Conditional Models
For multi-turn models, the joint modeling architecture yields better performance than the conditional Seq2Seq architecture. This trend is however reversed for single-turn models. This is because a model that focuses on jointly modeling both the context and the response performs better with longer contextual information compared to a model that focuses on modeling only the conditional distribution of the response given the context. Therefore, multi-turn dialogue model should rather employ the joint structure instead of the conditional Seq2Seq structure.

Effect of Tokenization Coverage
For a more fair comparison with previous work on multi-turn dialogue not using random padding injection and 100% BPE tokenization, we trained the DLGNet models on multi-turn data with basic tokenization. The tokenization coverages of the basic tokenizer used are 83.9% and 4.19% for Movie and Ubuntu datasets respectively. Basically, most of the Ubuntu tokens are mapped to the <UNK> token. In comparison with previous work on HRED, the results in Table 3 show that the transformer-based DLGNet models under the same conditions perform better than the basic HRED model but worse than the improved HRED models (such as VHRED, hredGAN, and aBoots). In comparison with other transformer-based configurations, the smaller size multi-turn models perform better than their BPE counterparts but the larger size models perform worse. This is probably due to the overfitting of the larger models.

Conclusion
In this paper, we have proposed DLGNet, an extension of autoregressive transformer models such as GPT-2 for multi-turn dialogue modeling. Our experiments show that DLGNet models perform better than existing state-of-the-art multi-turn dialogue models. They also achieve the best performance to date on open-domain Movie and closeddomain Ubuntu datasets based on BLEU, ROUGE and distinct n-gram scores. Our experiments reveal that the combination of (i) the transformer architecture with (ii) the injection of random paddings exploiting the large maximum input sequence is responsible for the performance improvement over existing methods. Other contributing factors include joint modeling of dialogue context and response, and the 100% tokenization coverage from the byte pair encoding (BPE). Our analysis also reveals some tradeoffs between response relevance and response length, and we showed how different sampling strategies can be used to make an informed decision about such response relevancediversity compromises. In our future work, we plan to investigate how to improve on the length of the generated responses without necessarily sacrificing their coherence and their relevance to the dialogue context.