Data Augmentation for Neural Online Chats Response Selection

Data augmentation seeks to manipulate the available data for training to improve the generalization ability of models. We investigate two data augmentation proxies, permutation and flipping, for neural dialog response selection task on various models over multiple datasets, including both Chinese and English languages. Different from standard data augmentation techniques, our method combines the original and synthesized data for prediction. Empirical results show that our approach can gain 1 to 3 recall-at-1 points over baseline models in both full-scale and small-scale settings.


Introduction
Building machines that are capable of conversing like humans is one of the primary goals of artificial intelligence. Extensive manual labor is typically required by traditional rule-based systems, limiting the scalability of such systems across multiple domains. With the success of machine learning, the quest of building data-driven dialog systems has come into focus over the past few years (Ritter et al., 2011). Existing approaches in this area can be categorized into generation-based methods and retrieval-based methods. While generation-based methods are still far from reliably generating informative responses, retrievalbased methods have the advantage of fluency and groundedness, since they select responses from existing data. We concentrate on retrieval-based methods in this paper, though we believe the proposed techniques could also improve generationbased models.
While current state-of-the-art results for dialog models are achieved by deep learning approaches, the performance of neural models largely depends on the amount of training data. However, acquiring conversational data can be difficult at times.
On the other hand, even with thousands of data points, it is unclear whether these models can optimally benefit from them. Therefore, data augmentation and its efficient use becomes an important problem. Our main contribution is that we investigated new ways to manipulate chat data and neural model architectures to improve performance. To our knowledge, we are the first to evaluate data augmentation on different types of neural conversation models over multiple domains and languages.

Data Augmentation
Recent studies (Adi et al., 2016;Khandelwal et al., 2018) have shown that recurrent neural networks (RNN), especially long-short term memory networks (LSTM), are sensitive to word order when encoding contextual information. However, for the response selection task, it is so far unclear to what extent word order is important. This problem is perplexed by the following language phenomena we observed from existing chat data: 1. Broken continuity.
Simultaneous conversations happen in multi-party dialogs (Elsner and Charniak, 2008) very often, resulting in some utterances not responding to their immediately preceding ones. Even in conversations between only two people, continuity may still break due to one person switch topic before the other responds. See Table 1 Example 1 of Table 2 after Permutation: and their orderings are not that important. We found this to be very common in online live chats. See Table 2 for examples. 3. Long utterances. Some utterances contain multiple sentences. Some are single compound sentence with multiple clauses. See Table 3 for examples.
To summarize, the critical information for responding, which can be either a single word, phrase, or a full sentence, may have varying relative positions in the context. Therefore, we hypothesize that there exist alternative orderings of utterances and intra-utterance arguments in chat data that can help selecting responses, given recurrent neural models' sensitivity to word order. In this paper, our main goal is to seek improvement by creating variations in the ordering of utterances and arguments. We aim for generic methods, bypassing the need of discourse and syntactic parsing as an intermediate step. With the fact that online chats are typically noisy with spelling errors and ungrammaticality, a relative lack of precision may actually help. We therefore propose the following ways to manipulate chat data: Permutation is simply reversing the order of any two messages in the context. This may help recover the continuity or create alternative ordering of parallel arguments.
Flipping breaks an utterance into two parts, and concatenate them in their reversed order. The break point is the punctuation that is closest to the middle of the utterance if there is any. Otherwise, we break the utterance at the middle.
As illustrated in Table 4, the proposed transformations neither change the implication of the contexts nor the appropriateness of the responses.

Data
We describe four datasets that we will be using to evaluate our proposed methods: Taobao chat log was collected by a vendor of pajamas between 2013 and 2015. The conversations took place on Taobao, one of the largest Chinese e-commerce websites. The website allows two-way conversations between customers and agents in individual sessions.
Ubuntu dialog corpus (Lowe et al., 2015) is the first large dataset of online chats made available. It contains multi-party chat logs from Ubuntu chat room where people help each other to solve technical problems related to Ubuntu.
Douban conversation corpus is a collection of web forum post discussions from Douban, a Chinese internet community . It covers a wide range of topics, hence open-domain in nature.
Frames dataset was collected by (Asri et al., 2017) in wizard-of-oz setting. The chats are about booking flight. The wizard has access to database to answer domain-specific questions. Unlike the datasets mentioned above, the conversations of Frames are highly controlled so that the language is perfect and the chats have perfect turn exchanges.

Model Overview
We first give a high level abstraction of the neural models we will be investigating. Given context and candidate responses, the models score each candidate and the one with the highest score is selected. The models are trained by maximizing the likelihood of labels. To build training data, one negative example is sampled from the corpus for each pair of context and true response. We group the models into the following two categories: Dual-Encoder Model (DE) As first proposed in (Lowe et al., 2015), DE models encode context m and response r into v(m) ∈ R l , v(r) ∈ R m , respectively. Then where σ is the sigmoid function, M ∈ R l×m . In this paper, response encoder is LSTM. We consider two choices of context encoder: one is wordlevel LSTM encoder only (LSTM-DE), which takes concatenated messages as input. The other one is hierarchical recurrent encoder (HRE-DE). For HRE, we encode each message with an LSTM word-level encoder, and then feed the last hidden states from the word-level encoder to an utterancelevel encoder, which is also an LSTM. We concatenate the last hidden state of the utterance-level encoder to that of word-level encoder on concatenated messages as final context encoding. Note that HRE-DE is a simplified version of the model in .
Sequential Matching Network (SMN) Unlike DE models, SMN finds the affinity between context messages and responses as a first step   . Given messages m k where k = 1, ..., n and response r, SMN first extract feature u(m k , r) ∈ R p of how related the two utterances are, and then accumulate these features with an RNN: v(m, r) = RN N (u(m k , r)), k = 1, ..., n where v(m, r), w ∈ R q .

Combining Transformed Data
Let π i be the applicable transformations including the identity. For context m and response r, let m i = π i (m), r j = π j (r). For DE models, we use the same encoder for m, r to encode m i , r j . Then we combine the encodings and predict by where M ij ∈ R l×m . Similarly, for SMN, the predicted score is where w i,j ∈ R q . Please note that this score function allows augmentations to be done at test time for prediction. Additionally, we inject squared distance between the encodings of the original data and the transformed data in order to enforce models to learn similar representations for them. We are assuming that the transformation should not drastically change the meanings of contexts and responses even though they are not exactly labelpreserving. Empirically we found adding this regularization term actually helps. The training loss for DE models becomes (m,r) and the one for SMN becomes (m,r) where t is a hyper-parameter. We tuned it on the validation set in [0.01, 0.1].

Setup
We evaluate our method on the datasets mentioned in Section 3. For the Ubuntu dataset, we use the version shared by (Xu et al., 2016). For Douban, we discard the test set provided by the authors since the responses are not from the same domain, and re-split training set. Negative responses are randomly sampled. For Frames, we select negative responses from those that have different slot types and values from true responses. We also conduct an experiment with smaller amount of training data on the three large datasets, Ubuntu, Douban, and Taobao, in which 1% of the training set are randomly selected for training. Following (Lowe et al., 2015), we evaluate the model performance with recall-at-1, following previous work. We experiment with two types of permutation: the first one is permuting the last and the penultimate message in contexts, and the second one is permuting the penultimate with the third to last message. We only do the first type of permutation for SMN since SMN seems to be insensitive to permutation. We flip all messages in contexts and responses for SMN, and only flip context messages for DE models.

Training
We initialize word embeddings using the results of word2vec (Mikolov et al., 2013) trained on the whole corpus. The size of word embeddings is 300 for LSTM-DE and HRE-DE, and 200 for SMN. For LSTM-DE and HRE-DE, each LSTM layer has hidden size of 300. We use the same hyper-parameters for SMN as in . All models are trained with Adam optimizer with Ubuntu Taobao  Douban  Frames  100%  1%  100%  1%  100%  1%  100%  LSTM- (Srivastava et al., 2014) with rate 0.5 to all recurrent layers. As a side note, we find that dropout does not affect the result in any significant way under full-scale setting.

Related Work
Data augmentation has been widely adopted in computer vision and speech recognition (Krizhevsky et al., 2012;Ko et al., 2015). In image processing, label-preserving transformations such as tilting and flipping are used, but in NLP, finding such transformations that exactly preserve meanings is difficult. Language data is discrete in nature, and minor perturbation may change the meaning. Most commonly used techniques include word substitution (Fadaee et al., 2017) and paraphrasing (Dong et al., 2017). These methods may require heavy external resources, which can be difficult to apply across multiple languages and domains.
Recently, there has been a surging interest in adversarial training (Goodfellow et al., 2014). For text data, one class of methods generate adversarial examples by moving word embeddings along the opposite direction of the gradient of loss functions (Wu et al., 2017;Yasunaga et al., 2017), hence small perturbation in the continuous space of word vectors. Another class of methods aim to create genuinely new examples.  adds syntactic and semantic variations to training data based on grammar rules and thesaurus. (Xie et al., 2017) add noises to data by blanking out or substituting words for language modeling. (Yang et al., 2017) adopt a seq2seq model  to generate questions based on paragraphs and answers into their generative adversarial framework. One main difference between these methods and our approach is that, while adversarial training only manipulates training data, we in addition apply transformations to data at test time to help prediction. This is closer to (Dong et al., 2017) in spirit.
We proposed a general method to improve dialog response selection through manipulating existing data that can be applied to different models. Our results show that for both open-domain and task-oriented dialogues, and for both English and Chinese languages, at least one of the proposed augmentation methods is effective, and the chance that they hurt is rare. We have deliberately chosen a diverse set of domains and models to test this on to try to understand the contribution of data augmentation. Thus even when working on new datasets, and new models, it seems data augmentation is still a valuable addition that will likely improve results. Being more specific about when augmentation works is harder. One future research direction would be to apply data transformation situationally based on the discourse structure of dialogs. In our experiments, we tried combining permutation and flipping but found no advantage over using only one type of transformation. We believe a more sophisticated method of combination could further improve the results, and leave it to future work.