Improving Neural Conversational Models with Entropy-Based Data Filtering

Current neural network-based conversational models lack diversity and generate boring responses to open-ended utterances. Priors such as persona, emotion, or topic provide additional information to dialog models to aid response generation, but annotating a dataset with priors is expensive and such annotations are rarely available. While previous methods for improving the quality of open-domain response generation focused on either the underlying model or the training objective, we present a method of filtering dialog datasets by removing generic utterances from training data using a simple entropy-based approach that does not require human supervision. We conduct extensive experiments with different variations of our method, and compare dialog models across 17 evaluation metrics to show that training on datasets filtered this way results in better conversational quality as chatbots learn to output more diverse responses.


Introduction
Current open-domain neural conversational models (NCM) are trained on pairs of source and target utterances in an effort to maximize the likelihood of each target given the source (Vinyals and Le, 2015). However, real-world conversations are much more complex, and a plethora of suitable targets (responses) can be adequate for a given input. We propose a data filtering approach where the "most open-ended" inputs -determined by calculating the entropy of the distribution over target utterances -are excluded from the training set. We show that dialog models can be improved using this simple unsupervised method which can be applied to any conversational dataset. We conduct several experiments to uncover how some of the current open-domain dialog evaluation methods behave with respect to overfitting and random data. Our software for filtering dialog data and automatic evaluation using 17 metrics is released on GitHub under an MIT license 12 . This paper exists in poster 3 and blog post 4 form as well.

Background
Most open-domain NCMs are based on neural network architectures developed for machine translation (MT, Sutskever et al. (2014); Cho et al. (2014); Vaswani et al. (2017)). Conversational data differs from MT data in that targets to the same source may vary not only grammatically but also semantically Tandon et al., 2017): consider plausible replies to the question What did you do today?. Dialog datasets also contain generic responses, e.g. yes, no and i don't know, that appear in a large and diverse set of contexts (Mou et al., 2016;. Following the approach of modeling conversation as a sequence to sequence (seq2seq, Sutskever et al. (2014)) transduction of single dialog turns, these issues can be referred to as the one-to-many, and many-to-one problem. seq2seq architectures are not suited to deal with the ambiguous nature of dialogs since they are inherently deterministic, meaning that once trained they cannot output different sequences to the same input. Consequently they tend to produce boring and generic responses (Li et al., 2016a;Shao et al., 2017;Zhang et al., 2018a;. Previous approaches to the one-to-many, manyto-one problem can be grouped into three categories. One approach involves feeding extra information to the dialog model such as dialog history , categorical information like persona (Li et al., 2016b;Joshi et al., 2017;Zhang et al., 2018b), mood/emotion Li et al., 2017c), and topic (Xing et al., 2017;Baheti et al., 2018), or through knowledge-bases (Dinan et al., 2019;Ghazvininejad et al., 2018;Zhu et al., 2017;Moghe et al., 2018). A downside to these approaches is that they require annotated datasets which are not always available, or might be smaller in size. Augmenting the model itself, with e.g. latent variable sampling (Serban et al., 2017b;Gu et al., 2019;Park et al., 2018;Shen et al., 2018b;Gao et al., 2019), or improving the decoding process (Shao et al., 2017;Kulikov et al., 2018; is also a popular approach. Sampling provides a way to generate more diverse responses, however such models are more likely to output ungrammatical or irrelevant responses. Finally, directly modifying the loss function (Li et al., 2016a), or training by reinforcement (Li et al., 2016d;Serban et al., 2017a;Li et al., 2016c;Lipton et al., 2018;Lewis et al., 2017) or adversarial learning (Li et al., 2017b;Ludwig, 2017;Olabiyi et al., 2018;Zhang et al., 2018c) has also been proposed, but this is still an open research problem, as it is far from trivial to construct objective functions that capture conversational goals better than cross-entropy loss.
Improving dataset quality through filtering is frequently used in the machine learning literature (Sedoc et al., 2018;Ghazvininejad et al., 2018;Wojciechowski and Zakrzewicz, 2002) and data distillation methods in general are used both in machine translation and dialog systems (Axelrod et al., 2011;Li et al., 2017a). Xu et al. (2018b) introduced coherence for measuring the similarity between contexts and responses, and then filtered out pairs with low coherence. This improves datasets from a different aspect and could be combined with our present approach. However, natural conversations allow many adequate responses that are not similar to the context, thus it is not intu-itively clear why filtering these should improve dialog models. Our experiments also further support that cross-entropy is not an adequate loss function (shown qualitatively by Csaky (2019) and Tandon et al. (2017)), by showing that many automatic metrics continue to improve after the validation loss reaches its minimum and starts increasing. However, we found that the metrics steadily improve even after we can be certain that the model overfitted (not just according to the loss function). Further research is required, to determine whether this indicates that overfitted model responses are truly better or if it's a shortcoming of the metrics that they prefer such models.
Currently, there is no well-defined automatic evaluation method (Liu et al., 2016), and while some metrics that correlate more with human judgment have been proposed recently (Li et al., 2017b;Lowe et al., 2017;Tao et al., 2018), they are harder to measure than simpler automatic metrics like perplexity or BLEU (Papineni et al., 2002). Furthermore, even human evaluation has its downsides, like high variance, high cost, and difficulty of replicating experimental setups (Zhang et al., 2018b;Tao et al., 2018). Some works resort to human evaluations (Krause et al., 2017;Fang et al., 2018), others use automatic metrics only (Olabiyi et al., 2018;Xing and Fernández, 2018;Kandasamy et al., 2017;Shalyminov et al., 2018;Xu et al., 2018b), and some use both (Shen et al., 2018a;Baheti et al., 2018;Ram et al., 2018). While extensive human evaluation of the methods presented here is left for future work, we do conduct an especially thorough automatic evaluation both at the validation loss minimum and of overfitted models. We believe our experiments also shed light on the limitations of frequently used automatic metrics.

Intuition
We approach the one-to-many, many-to-one problem from a relatively new perspective: instead of adding more complexity to NCMs, we reduce the complexity of the dataset by filtering out a fraction of utterance pairs that we assume are primarily responsible for generic/uninteresting responses. Of the 72 000 unique source utterances in the Dai-lyDialog dataset (see Section 4.1 for details), 60 000 occur with a single target only. For these it seems straightforward to maximize the conditional probability P (T |S), S and T denoting a specific source and target utterance. However, in the case of sources that appear with multiple targets (oneto-many), models are forced to learn some "average" of observed responses .
The entropy of response distribution of an utterance s is a natural measure of the amount of "confusion" introduced by s. For example, the context What did you do today? has high entropy, since it is paired with many different responses in the data, but What color is the sky? has low entropy since it's observed with few responses. The many-toone scenario can be similarly formulated, where a diverse set of source utterances are observed with the same target (e.g. I don't know has high entropy). While this may be a less prominent issue in training NCMs, we shall still experiment with excluding such generic targets, as dialog models tend to generate them frequently (see Section 2).

Clustering Methods and Filtering
We refer with IDENTITY to the following entropy computation method. For each source utterance s in the dataset we calculate the entropy of the conditional distribution T |S = s, i.e. given a dataset D of source-target pairs, we define the target entropy of s as Similarly, source entropy of a target utterance is The probabilities are based on the observed relative frequency of utterance pairs in the data.
For the purposes of this entropy-based filtering, we considered the possibility of also including some form of similarity measure between utterances that would allow us to detect whether a set of responses is truly diverse, as in the case of a question like What did you do today?, or diverse only on the surface, such as in the case of a question like How old are you? (since answers to the latter are semantically close). Measuring the entropy of semantic clusters as opposed to individual utterances may improve our method by reducing data sparsity. For example How are you? can appear in many forms, like How are you <name>? (see Section 4.2). While the individual forms have low entropy (because they have low frequency), we may decide to filter them all if together they form a high-entropy cluster.
To this end we performed the filtering based not only on the set of all utterances, as in the case of IDENTITY, but also on clusters of utterances established by clustering their vector representations using the Mean Shift algorithm (Fukunaga and Hostetler, 1975). Source and target utterances are clustered separately. In the AVG-EMBEDDING setup the representation R(U ) of utterance U is computed by taking the average word embedding weighted by the smooth inverse frequency 0.001+p(w) of words (Arora et al., 2017), where E(w) and p(w) are the embedding and the probability 5 of word w respectively. We also experiment with SENT2VEC 6 , a more sophisticated sentence embedding approach, which can be thought of as an extension of word2vec to sentences (Pagliardini et al., 2018).
The target entropy of a source cluster c s is where C is the set of all clusters and p(c i |c s ) is the conditional probability of observing an utterance from cluster i after an utterance from cluster s. In the context of these methods, the entropy of an utterance will mean the entropy of its cluster. Note that IDENTITY is a special case of this cluster-based entropy computation method, since in IDENTITY a "cluster" is comprised of multiple examples of one unique utterance. Thus a target cluster's entropy is computed similarly to Equation 2, but using clusters as in Equation 3. Entropy values obtained with each of these methods were used to filter dialog data in three ways. The SOURCE approach filters utterance pairs in which the source utterance has high entropy, TARGET filters those with a high entropy target, and finally the BOTH strategy filters all utterance pairs that are filtered by either SOURCE or TARGET. Some additional techniques did not yield meaningful improvement and were excluded from further evaluation. Clustering based on the Jaccard similarity of the bag of words of utterances only added noise to IDENTITY and resulted in much worse clusters than SENT2VEC. Clustering single occurrences of each unique utterance (as opposed to datasets with multiplicity) lead to less useful clusters than when clustering the whole dataset, probably because it resulted in less weight being given to the frequent utterances that we want to filter out. K-means proved inferior to the Mean Shift algorithm, which is a density-based clustering algorithm and seems to work better for clustering vectors of sentences. Filtering stop words before clustering did not improve the quality of clusters, probably because many utterances that we want to filter out contain a large number of stop words.

Dataset
With 90 000 utterances in 13 000 dialogs, Dai-lyDialog (Li et al., 2017c), our primary dataset, is comparable in size with the Cornell Movie-Dialogs Corpus (Danescu-Niculescu-Mizil and Lee, 2011), but contains real-world conversations. Using the IDENTITY approach, about 87% of utterances have 0 entropy (i.e. they do not appear with more than one target), 5% have an entropy of 1 (e.g. they appear twice, with different targets), remaining values rise sharply to 7. This distribution is similar for source and target utterances. Entropy is clearly proportional to utterance frequency ( Figure 1), but has a wide range of values among utterances of equal frequency. For example, utterances with a frequency of 3 can have entropies ranging from 0 to log 2 3 ≈ 1.58, the latter of which would be over our filtering threshold of 1 (see Section 5.1 for details on selecting thresholds). Since high-entropy utterances are relatively short, we also examined the relationship between entropy and utterance length ( Figure 2). Given the relationship between frequency and entropy, it comes as no surprise that longer utterances have lower entropy.

Clustering Results
Compared to IDENTITY, both SENT2VEC and AVG-EMBEDDING produce a much lower number of clusters with 0 entropy, but also a huge cluster with more than 5000 elements (the size of the second largest cluster is below 500), which we didn't filter since it clearly doesn't group utterances with similar meaning. Generally, clusters were formed of similar utterances with the occasional exception of longer outlier utterances clustered together (instead of creating a separate cluster for each outlier), which can be attributed to the nature of the clustering algorithm. Overall, SENT2VEC appeared to produce better clusters than AVG-EMBEDDING, as reflected in the evaluation in Section 5.
We experimented with different bandwidth values 7 for the Mean Shift algorithm to produce clusters with as many elements as possible while also keeping the elements semantically similar. In an example cluster ( Figure 3) we can see that the clustering was able to group together several variants of How are you?, in particular, those with different names. In general, we noticed that both in the case of IDENTITY and the clustering methods, utterances labeled with the highest entropy are indeed those generic sources and replies which we hoped to eliminate. See Appendix A.1 for a selection of high entropy utterances and clusters.

Experiments
In this section the model and parameter setups are presented along with 17 evaluation metrics. Limitations of these metrics are discussed and a comparison between our filtering methods is presented on DailyDialog (Section 5.3), and other datasets (Section 5.4). We use transformer (Vaswani et al., 2017) as our dialog model, an encoder-decoder architecture relying solely on attention mechanisms (Bahdanau et al., 2015). transformer has already been applied to a plethora of natural language processing tasks, including dialog modeling (Dinan et al., 2019;Mazare et al., 2018;Devlin et al., 2018). We used the official implementation 8 (see Appendix A.2 for a report of hyperparameters). 8 https://github.com/tensorflow/ tensor2tensor

Model and Parameters
The vocabulary for DailyDialog was limited to the most frequent 16 384 words, and train / validation / test splits contained 71 517 / 9 027 / 9 318 examples, respectively.

Clustering and
Filtering. For AVG-EMBEDDING fastText 9 embeddings were used. The bandwidth of Mean Shift was set to 0.7 and 3.5 for AVG-EMBEDDING and SENT2VEC, which produced 40 135 and 23 616 clusters, respectively. Entropy thresholds and amount of data filtered can be found in Table 1. Generally we set the threshold so that filtered data amount is similar to the DailyDialog IDENTITY scenario. We also set a threshold for the maximum average utterance length (15 and 20 for AVG-EMBEDDING and SENT2VEC) in clusters that we considered for filtering, excluding outliers from the filtering process (see Section 4.2).
Training and Decoding. Word embeddings of size 512 were randomly initialized, batch size was set to 2048 tokens, and we used the Adam optimizer (Kingma and Ba, 2014). We experimented with various beam sizes (Graves, 2012), but greedy decoding performed better according to all metrics, also observed previously (Asghar et al., 2017;Shao et al., 2017;Tandon et al., 2017).

Evaluation Metrics
As mentioned in Section 2, automatic evaluation of chatbots is an open research problem. In order to get as complete a picture as possible, we use 17 metrics that have been applied to dialog models over the past years, briefly described below. These metrics assess different aspects of response quality, thus models should be compared on the whole set of metrics.
Word and utterance entropy. The per-word entropy H w = − 1 |U | w∈U log 2 p(w) of responses is measured to determine their non-genericness (Serban et al., 2017b). Probabilities are calculated based on frequencies observed in the training data. We introduce the bigram version of this metric, to measure diversity at the bigram level as well. Utterance entropy is the product of H w and |U |, also reported at the bigram level.
KL divergence. We use the KL divergence between model and ground truth (GT) response sets to measure how well a model can approximate the GT distribution of words. Specifically, we define distributions p gt and p m based on each set of responses and calculate the KL divergence D kl = 1 |Ugt| w∈Ugt log 2 pgt(w) pm(w) for each GT response. The bigram version of this metric is also reported.
Embedding metrics. Embedding average, extrema, and greedy are widely used metrics (Liu et al., 2016;Serban et al., 2017b;Zhang et al., 2018c). average measures the cosine similarity between the averages of word vectors of response and target utterances. extrema constructs a representation by taking the greatest absolute value for each dimension among the word vectors in the response and target utterances and measures the cosine similarity between them. Finally, greedy matches each response token to a target token (and vice versa) based on the cosine similarity between their embeddings and averages the total score across all words. For word embeddings and average word embedding representations, we used the same setup as in AVG-EMBEDDING. Coherence. We measure the cosine similarity between pairs of input and response (Xu et al., 2018b). Although a coherence value of 1 would indicate that input and response are the same, generally a higher value seems better as model responses tend to have lower coherence than targets.
Distinct metrics. Distinct-1 and distinct-2 are widely used in the literature (Li et al., 2016a;Shen et al., 2018a;Xu et al., 2018b), measuring the ratio of unique unigrams/bigrams to the total number of unigrams/bigrams in a set of responses. However, they are very sensitive to the test data size, since increasing the number of examples in itself lowers their value. While the number of total words increases linearly, the number of unique words is limited by the vocabulary, and we found that the ratio decreases even in human data (see Appendix A.3 for details). It is therefore important to only compare distinct metrics computed on the same test data.
Bleu. Measuring n-gram overlap between response and target is widely used in the machine learning and dialog literature (Shen et al., 2018a;Xu et al., 2018b). We report BLEU-1, BLUE-2, BLEU-3, and BLEU-4 computed with the 4th smoothing algorithm described in Chen and Cherry (2014).  Normally metrics are computed at the validation loss minimum of a model, however in the case of chatbot models loss may not be a good indicator of response quality (Section 2), thus we also looked at how our metrics progress during training. Figure 4 shows how coherence and the 3 embedding metrics saturate after about 80-100k steps, and never decrease (we ran the training for 300k steps, roughly 640 epochs). Most metrics show a similar trend of increasing until 100k steps, and then stagnating (see Appendix A.3 for more figures).
In contrast, validation loss for the same training reaches its minimum after about 10-20k steps ( Figure 5). This again suggests the inadequacy of   the loss function, but it also questions the validity of these metrics, as they seem to favor a model that overfitted the training data, which we can assume after 640 epochs. This could be due to the many identical inputs in train and test splits, because of the nature of dialog data. Most interesting are embedding metrics and BLEU scores (Section 5.3), since they show that even after overfitting responses do not get farther from targets. This is in line with other findings reporting that qualitatively responses are better after overfitting (Csaky, 2019; Tandon et al., 2017), however occasionally they also tend to be too specific and irrelevant. We leave it for future work to conduct human evaluation between non-overfitted and overfitted models to solidify these claims. In light of these issues, we compare trainings on the DailyDialog dataset both at the validation loss minimum and at an overfitted point (150 epochs).

DailyDialog Results
We compute metrics on the unfiltered test set to show that filtered trainings perform better even on utterances that would have been filtered from the training data. TRF, the baseline transformer model trained on unfiltered data is compared to the 9 trainings on filtered data. In all tables the 17 metrics from left to right are: response length, unigram and bigram entropy, unigram and bigram utterance entropy, unigram and bigram KL divergence, embedding average, extrema and greedy, coherence, distinct-1 and distinct-2, and finally, BLEU-1, BLEU-2, BLEU-3 and BLEU-4 (see Section 5.2). Evaluating at the minimum validation loss (Ta-Input Response your starting salary is 2500 yuan a month and after you become a permanent employee it will be higher .  ble 2) clearly shows that models trained on data filtered by IDENTITY and SENT2VEC are better than the baseline. IDENTITY performs best among the three methods, surpassing the baseline on all but the distinct-1 metric. SENT2VEC is a close second, getting higher values on fewer metrics than IDENTITY, but mostly improving on the baseline. Finally, AVG-EMBEDDING is inferior to the baseline, as it didn't produce clusters as meaningful as SENT2VEC, and thus produced a lower quality training set. It seems like filtering high entropy targets (both in the case of IDENTITY and SENT2VEC) is more beneficial than filtering sources, and BOTH falls mostly in the middle as expected, since it combines the two filtering types. By removing example responses that are boring and generic from the dataset the model learns to improve response quality. Finding such utterances is useful for a number of purposes, but earlier it has been done mainly manually (Li et al., 2016d;Shen et al., 2017), whereas we provide an automatic, unsupervised method of detecting them based on entropy. Every value is higher after 150 epochs of training than at the validation loss minimum (Table 3). The most striking change is in the unigram KL divergence, which is now an order of magnitude lower. IDENTITY still performs best, falling behind the baseline on only the two distinct metrics. Interestingly this time BOTH filtering was better than the TARGET filtering. SENT2VEC still mostly improves the baseline and AVG-EMBEDDING now also performs better or at least as good as the baseline on most metrics. In some cases the best performing model gets quite close to the ground truth performance. On metrics that evaluate utterances without context (i.e. entropy, divergence, dis-tinct), randomly selected responses achieve similar values as the ground truth, which is expected. However, on embedding metrics, coherence, and BLEU, random responses are significantly worse than those of any model evaluated.
Computing the unigram and bigram KL divergence with a uniform distribution instead of the model yields a value of 4.35 and 1.87, respectively. Thus, all models learned a much better distribution, suggesting that this is indeed a useful metric. We believe the main reason that clustering methods perform worse than IDENTITY is that clustering adds some noise to the filtering process. Conducting a good clustering of sentence vectors is a hard task. This could be remedied by filtering only utterances instead of whole clusters, thus combining IDENTITY and the clustering methods. In this scenario, the entropy of individual utterances is computed based on the clustered data. The intuition behind this approach would be that the noise in the clusters based on which we compute entropy is less harmful than the noise in clusters which we consider for filtering. Finally, Table 4 shows responses from the baseline and the best performing model to 3 randomly selected inputs from the test set (which we made sure are not present in the training set) to show that training on filtered data does not degrade response quality. We show more example responses in Appendix A.3.

Cornell and Twitter Results
To further solidify our claims we tested the two best performing variants of IDENTITY (BOTH and TARGET) on the Cornell Movie-Dialogs Corpus and on a subset of 220k examples from the Twit-    Table 5 and Table 6, respectively. On these noisier datasets our simple IDENTITY method still managed to improve over the baseline, but the impact is not as pronounced and in contrast to DailyDialog, BOTH and TAR-GET perform best on nearly the same number of metrics. On these noisier datasets the clustering methods might work better, this is left for future work. Compared to DailyDialog there are some important distinctions that also underline that these datasets are of lesser quality. The CO-HERENCE metric is worse on the ground truth responses than on model responses (Table 5, and some embedding metrics and BLEU scores are better on randomly selected responses than on model responses (Table 6).

Conclusion
We proposed a simple unsupervised entropy-based approach that can be applied to any conversational dataset for filtering generic sources/targets that cause "confusion" during the training of opendomain dialog models. We compared various setups in an extensive quantitative evaluation, and showed that the best approach is measuring the 10 https://github.com/Marsan-Ma/chat_ corpus/ entropy of individual utterances and filtering pairs based on the entropy of target, but not source utterances. Some limitations of current automatic metrics and the loss function have also been shown, by examining their behavior on random data and with overfitting.
In the future, we plan to explore several additional ideas. As mentioned in Section 5.3, we want to extend our clustering experiments combining the ideas behind IDENTITY and the clustering methods to make them more robust to noise. We wish to conduct clustering experiments on noisier datasets and try other sentence representations (Devlin et al., 2018). We also plan to combine our method with coherence-based filtering (Xu et al., 2018b). Furthermore, we intend to perform a direct quantitative evaluation of our method based on human evaluation. Finally, we believe our method is general enough that it could also be applied to datasets in other similar NLP tasks, such as machine translation, which could open another interesting line of future research.               you might want to change your plans . with the cold and wet weather that we re supposed to have tomorrow the roads are going to be rather slippery . Table 9: Responses to randomly selected test inputs which we made sure were not in the training data (DailyDialog). Unfiltered is the model trained on unfiltered data, and IDENTITY TARGET is the model trained on IDENTITY, TARGET filtered data. Overfitted means that the respective model is evaluated at an overfitted point. ok . i will take it .

A.3 Evaluation Metrics and Examples
thank you very much .
we will try our best to do our best to help you .
i understand . i understand that you are right .
ok i will . i look forward to it .
ok . any time on friday will be ok with me .
ok . see you then . i think so . i 'll be expecting you then .
great ! no wonder you can control your voice so well . you are a professional singer .
i am not a singer . i would n't mind a position that would be great .
i am a singer . i think i can be a singer in the lead .
i would n't like to be a singer but i love to sing that dream .
when can i get high speed internet installed ?
you can make a list of the company and make sure that you do n't have to use the machine .
you 'll need to make an appointment before the company starts .
you can get up at the airport and you 'll have to make an appointment .
you 'll have to make an appointment before we can come in and install it . i like those kinds of programmes too . they re very informative . i think that many people underrate the education value of tv .   log). Unfiltered is the model trained on unfiltered data, and IDENTITY TARGET is the model trained on IDENTITY, TARGET filtered data. Overfitted means that the respective model is evaluated at an overfitted point.