Dialogue Distillation: Open-domain Dialogue Augmentation Using Unpaired Data

Recent advances in open-domain dialogue systems rely on the success of neural models that are trained on large-scale data. However, collecting large-scale dialogue data is usually time-consuming and labor-intensive. To address this data dilemma, we propose a novel data augmentation method for training open-domain dialogue models by utilizing unpaired data. Specifically, a data-level distillation process is first proposed to construct augmented dialogues where both post and response are retrieved from the unpaired data. A ranking module is employed to filter out low-quality dialogues. Further, a model-level distillation process is employed to distill a teacher model trained on high-quality paired data to augmented dialogue pairs, thereby preventing dialogue models from being affected by the noise in the augmented data. Automatic and manual evaluation indicates that our method can produce high-quality dialogue pairs with diverse contents, and the proposed data-level and model-level dialogue distillation can improve the performance of competitive baselines.


Introduction
Open-domain dialogue systems have attracted much research attention (Shum et al., 2018;), thanks to the success of neural generation models trained with large-scale data. Existing research has been endeavored to address various aspects in dialogue systems, such as modeling persona (Qian et al., 2018;Zheng et al., 2019;Zhang et al., 2018), expressing emotion (Zhou et al., 2018a), or generating knowledge-grounded dialogues (Ghazvininejad et al., 2018;Zhou et al., 2018b.

… Unpaired data
Today's flight is not delayed. Lucky for you, mine is delayed.

Anchor pair
Augmented post-response pairs … Pair 1 Pair K Sounds nice, but mine is delayed.
Today's flight is not delayed.  Figure 1: Process of constructing augmented postresponse pairs. The sentence in blue rectangle is used to match the anchor pair and the corresponding response is then used to retrieve similar sentences in unpaired data. Each augmented pair contains two sentences both from unpaired data.
In general, training neural open-domain dialogue models requires a large amount of high-quality paired data, e.g., post-response pairs, which are usually labor-intensive and time consuming to collect. A feasible solution to this data dilemma is to use data augmentation techniques, which are popular in various research areas such as computer vision (Cubuk et al., 2019) or machine translation (Sennrich et al., 2016). Nevertheless, this technique is rarely investigated in the study of opendomain dialogues, and few existing approaches are specifically designed for either the generationbased dialogue models  or the retrieval-based dialogue models (Du and Black, 2018). Moreover, existing data augmentation approaches only take a set of paired data as input without considering to utilize unpaired data.
As a matter of fact, high-quality unpaired data, i.e., non-conversational texts, are generally easier to collect compared to high-quality dialogue pairs. Specifically, these unpaired data provide us a rich bank of alternative expressions for different contents. It is thus feasible to augment the training dialogue pairs utilizing sentences extracted from the unpaired data. As shown in Figure 1, we can extract various sentences from the unpaired data that are similar to a given post-response pair (i.e., anchor pair). Augmented pairs that carry richer expressions can be then constructed by combining these extracted sentences. To the best of our knowledge, there are no previous studies for opendomain dialogues that try to construct augmented dialogue pairs by utilizing retrieved unpaired data.
In this paper, we propose a novel data augmentation method "Dialogue Distillation" to improve the performance of open-domain dialogue models by utilizing unpaired data. Our method involves two phases of distillation. The first phase is at the data level as it constructs (i.e., distills) postresponse pairs by matching sentences retrieved from a set of unpaired data. Specifically, given a set of training pairs { x i , y i }, a randomly selected sentence s is firstly used as a query to retrieve the most similar x i , and then the corresponding y i are used as queries to retrieve similar s i from the unpaired data. Augmented pairs s, s i are then constructed and filtered using a ranking module. Note that different from previous approaches, the post and response sentences that constitute an augmented pair are both from the unpaired data, which are human written and thereby fluent and contentrich. The second phase is at the model-level as it distills a teacher model using the augmented data. Specifically, we borrow the idea of knowledge distillation (Hinton et al., 2015) to first train a teacher model on a set of high-quality dialogue pairs, and then distill the dialogue model by mimicking the distribution produced by the teacher model on the augmented data to prevent the final dialogue models from being affected by the noise in the augmented data.
Automatic and manual evaluation results indicate that our data-level distillation process can produce high-quality post-response pairs that are content-rich, and our model-level distillation process can better utilize these augmented data to improve the performance of both retrieval-based and generation-based open-domain dialogue models.
Our contributions are summarized as follows: 1) We propose a data-level and model-level distillation method for open-domain dialogue models. The data-level distillation constructs new post-response pairs where both post and response are retrieved from unpaired data, and the model-level distillation distills a teacher model trained on high quality paired data to augmented pairs. To the best of our knowledge, this is the first attempt to augment open-domain dialogue pairs by utilizing the retrieved unpaired data.
2) Automatic and manual evaluation shows that the augmented pairs produced by our method are content-rich, and these augmented data can be used to improve the performance of both generationbased and retrieval-based dialogue models.

Related Work
There are two major categories of open-domain dialogue models: 1) retrieval-based models, which retrieve the best matching response from the precollected dialogues (Lu and Li, 2013); and 2) generation-based models, which decode responses from a learned distribution (Sutskever et al., 2014;Vinyals and Le, 2015). Recent advances in these two categories all focus on DNN-based data-driven methods .
Data augmentation is an effective approach to boost the performance of neural models. It has been explored in various NLP tasks, such as text classification (Wei and Zou, 2019; Zheng et al., 2020a), machine reading comprehension (Yu et al., 2018) and machine translation (Sennrich et al., 2016). Although proved to be effective, this technique is rarely investigated in open-domain dialogue models. Few existing approaches are restricted to only take the dialogue pairs as their inputs Zhao et al., 2017), whereas unpaired texts, i.e., sentences without replies, are not utilized.
Note that the pre-training based methods (Devlin et al., 2019;Radford et al., 2019;Golovanov et al., 2019;Zheng et al., 2020b) share a similar motivation with our study, i.e., to boost the performance of neural NLP models utilizing unlabeled (i.e., unpaired) texts. Nevertheless, the data augmentation method proposed in our study can be regarded as a supplement to these pre-training approaches. Experiments demonstrate that our method can be used to improve the performance of dialogue models even if these models are initialized with strong pretrained models.
Our study is also related to the knowledge distillation method (Hinton et al., 2015), which also employs a teacher model and tries to minimize the KL divergence between the teacher distribu-

Retrieve in
Retrieve in

Candidate Pairs
Top-1 scored pair (1) The sentence S is randomly selected in the unpaired data D u . (2) A set of posts X 1 , . . . , X n that are similar to S are retrieved from the paired data D p . (3) Each corresponding response Y i is then used to retrieve m sentences S i1 , . . . , S im that are similar to Y i from D u . (4) Then n × m candidate pairs can be formed by grouping S with each sentence: S, S ij , (i = 1, . . . , n, j = 1, . . . , m). (5) A ranking module is used to rank these candidate pairs. tion and the model distribution. The most related work in this branch compared to ours was done by Kim and Rush (2016). However, their methods do not utilize unpaired data, and the augmented data are decoded from a probability model using beam search. Whereas our method tries to utilize the unpaired data, and the augmented data are generated by aligning human produced sentences.
There are also works that try to utilize retrieved non-conversational texts to improve the diversity of the dialogue model Cai et al., 2019;. However, most of these studies focus on extracting templates from these non-conversational texts rather than generating augmented pairs, and they typically use specifically designed model structures. Nevertheless, the data augmentation method proposed in our study can be used in combination with any dialogue models to improve the performance.

Data-level Distillation
The data-level distillation in our method aims at constructing a set of new post-response pairs D a by matching non-parallel sentences retrieved from unpaired data D u . Specifically, D p consists of N post-response pairs: , in which X i and Y i is the post and response, respectively, and D u consists of M non-parallel sentences: Note that M is usually much larger than N because non-parallel sentences are generally easier to collect.
Further, the output of our data-level distillation process is a set of augmented post-response pairs: , in which both the post and response come from the unpaired dataset D u , i.e., X i ∈ D u and Y i ∈ D u for i = 1, . . . , K.
The data-level distillation involves two major processes: 1) constructing candidate pairs and 2) filtering low-quality candidates. The whole framework is shown in Figure 2 and detailed below.

Constructing Candidate Pairs
We first construct candidate dialogue pairs with the help of some post-response pairs X i , Y i selected from D p . The basic intuition is that sentences that are similar to post X i can usually be responded with sentences that are similar to the corresponding response Y i . Candidate dialogue pairs can be then constructed by anchoring sentences in D u using The construction of candidate pairs starts by randomly selecting a sentence S from the unpaired dataset D u . We then treat S as a candidate post, and it is used to retrieve n posts X i (1 ≤ i ≤ n) that are similar to S from the paired data D p . In this study, the sentence retrieval process is implemented based on the Okapi BM25 algorithm, which scores the similarity of input sentences using bag-of-words features. Then the corresponding n post-response For each response Y i , we further retrieve m sentences S ij (1 ≤ j ≤ m) that are similar to Y i from the unpaired dataset D u . These sentences S ij can then serve as candidate responses to the original sentence S, and therefore n × m candidate pairs S, S ij , (1 ≤ i ≤ n, 1 ≤ j ≤ m) are generated. Moreover, for each candidate pair S, S ij , we name the post-response pair X i , Y i in D p that are used to produce S, S ij as its "anchor pair" since it anchors the sentences S and S ij from D u .
Note that we have explored other variants of the above process, such as treating the initial sentence S as a candidate response rather than a candidate post or utilizing more advanced text retrieval methods to extract similar sentences. However, we notice little difference in neither the quality of the final augmented pairs nor the performance improvement brought to the dialogue models.

Filtering Candidate Pairs
In order to enhance the quality of the augmented data, we propose to filter out low-quality pairs using a ranking module, which calculates a score for each candidate pair obtained above. Specifically, high-quality pairs that are fluent and coherent are expected to receive high scores. In this study, we implement the score function as a text matching model, which is built by fine-tuning a pre-trained BERT model on the paired dataset D p . Negative samples are constructed by replacing the original responses using randomly sampled sentences from D p . The ranking score for each input pair is calculated as the matching score produced by the matching model.
In this study, we follow a quite rigorous policy to select the final augmented pairs in D a . For each sample sentence S from D u , we only extract the top-1 scored pair S, S ij among all its n × m candidate pairs, and S, S ij is added to D a only when its matching score exceeds a certain threshold η(0.9 ≤ η). We repeat the above procedures with newly sampled sentences from D u until a desired number of augmented pairs in D a are obtained. The whole data-level distillation process in our method is summarized in Algorithm 1.
Note that the matching model used in the ranking process can also be directly used to align sentences from the unpaired dataset D u . Specifically, for a sampled sentence S from D u , we can treat all other sentences in D u as its candidate response and select an augmented pair by ranking all these candidates. Although theoretically possible, this approach is practically infeasible considering the large amount of sentences in D u and the tremendous computational load to rank these candidates. Note that previous works on effective ranking (such as Henderson et al. (2017Henderson et al. ( , 2020) can not be directly adapted to this study because our ranking model does not use dot-product scoring function.

Model-level Distillation
A straightforward way to improve a dialogue model with the augmented dialogue data is to directly merge the original paired data D p with D a . However, this naive approach may lead to sub-optimal performance since the augmented pairs in D a might not be as high-quality as these human-crafted pairs Algorithm 1 Data-level distillation process . 1: Da ← Empty set 2: while |Da| < K do 3: Da ← Empty set 4: Sample a sentence S ∼ Du.

5:
Retrieve n posts {Xi} n i=1 that are similar to S in Dp.

6:
Get the responses {Yi} n i=1 for {Xi} n i=1 from Dp. 7: for each Yi do 8: Retrieve m sentences {Sij} m j=1 that are similar to Yi in Du.

9:
Da ← Da { S, Sij } m j=1 10: end for 11: Calculate the ranking score for each pair in Da.

12:
Extract the top-1 scored pair S, Sij from Da.

13:
if The ranking score of S, Sij exceeds η then 14: Da ← Da { S, Sij } 15: end if 16: end while in D p . In this study, we apply the model-level distillation in the training process to prevent the dialogue models from being affected by the noise in D a . This approach can be used in both retrievalbased and generation-based dialogue models.

Retrieval-based Dialogue Model
A retrieval-based dialogue model produces responses by retrieving a best matching sentence from the pre-collected dialogue dataset. Its key component is a matching function P θ (l|X, Y ) that predicts whether a response Y matches a given post X. Specifically, l ∈ {0, 1} is a matching label, where l = 1 means Y is a proper response for X and l = 0 otherwise. The model parameters θ can be learned by optimizing a negative log likelihood (NLL) loss defined as In this study, we formalize the matching function using the BERT model (Devlin et al., 2019;Whang et al., 2020). A teacher model P θt (l|X, Y ) is first obtained by optimizing the NLL loss L m−nll (θ t ) on the paired dataset D p . After the training is completed, the teacher model is fixed and used to compute a knowledge distillation (KD) loss (Kim and Rush, 2016) as The final matching model is trained on the following loss: where the loss L M (θ) is evaluated using D p D a and α m is used to balance these two losses.

Generation-based Dialogue Model
A generation-based dialogue model tries to capture the distribution of the response sentences Y given the post sentence X, i.e., P φ (Y |X), which can be formalized as where |Y | is the length of Y , y <i = y 1 · · · y i−1 is the token sequence before y i . The model parameters φ can be learned by optimizing the NLL loss: In this study, we parameterize the dialogue generation model using the Transformer-based encoderdecoder framework (Vaswani et al., 2017;Golovanov et al., 2019;Zheng et al., 2020b). Similar to the retrieval-based approach, a teacher model is first obtained by optimizing the NLL loss L g−nll on the paired dataset D p and the trained teacher model is used to compute a KD loss as where |V| denotes the size of the vocabulary and φ t is the parameter of the teacher model, which is fixed.
The final loss for the generation model is: where the loss L G (θ) is evaluated using D p D a and α g is used to balance these two losses.

Dataset
The evaluation of our method is performed on a corpus collected from Weibo 1 . Specifically, the paired data D p contains 300K post-response pairs, which are made up of Weibo posts and their following replies. All these pairs are manually filtered with annotators by removing ungrammatical sentences and incoherent dialogues. The unpaired data D u contains about 2 million posts on Weibo that do not have replies. Non-fluent sentences in D u are filtered out using a set of heuristic rules. Further, two additional sets of paired data are also prepared to validate and test the dialogue models, with 10K and 5K pairs respectively. These dialogue pairs are collected and manually filtered using the same criterion as D p .

Implementation Details
Data-level Distillation: We implement the retrieval module in Section 3.1 using the Lucene library 2 , and set the value of both n and m to 5. The matching model used in Section 3.2 is fine-tuned with D p for three epochs based on the pretrained BERT-base model (Devlin et al., 2019). The hyperparameter setting of the matching model follows the work of Devlin et al. (2019).
Model-level Distillation: For the retrievalbased dialogue model, the matching model used in Section 3.2 is directly used as the teacher model to calculate the KD loss (Eq. 2). The final retrievalbased dialogue model is initialized with the pretrained BERT-base weights and fine-tuned using the loss in Eq. 3 for 2 epochs on D p D a . The value of α m in Eq. 3 is set to 1.
For the generation-based dialogue model, the encoder and decoder share the same set of parameters, which is initialized using a pretrained GPT model (Wang et al., 2020). The teacher model uses the same architecture and it is fine-tuned using the paired dataset D p for 15 epochs on the NLL loss (Eq. 5). The final generative dialogue model is first initialized using the pre-trained GPT weights and fine-tuned using the loss in Eq. 7 for 50 epochs on D p and D a . The value of α g in Eq. 7 is set to 1. Moreover, the GPT model used in the initialization phase is trained on a corpus collected from various Chinese novels. This corpus contains about 0.5 billion tokens and a character-level vocabulary of size 13,084.
See Appendix A for more details of the model setting and reproduction guidance. The data and code for all experiments can be downloaded from the link 3 .

Baselines
We first evaluate the quality of the augmented pairs generated by our Data-Level (DL) distillation process. Three different matching thresholds η in Algorithm 1 are tested, i.e., η = 0.90, 0.95, 0.99. Several strong baselines are also compared: CVAE: A CVAE-based model as proposed by  is trained on the paired data D p . Augmented pairs are generated by sampling different latent codes.
BT: Augmented pairs are generated by Back Translating (i.e., translate Chinese to English and then translate back to Chinese) the post sentences of the dialogue pairs in D p . The translation is done via the Google Translate API.
SP: A variant of our method is implemented by first Sampling a post-response Pair X, Y from D p , and then retrieving a best-matching post and response from the unpaired data D u using X and Y as the query, respectively. An augmented pair is constructed by pairing the retrieved post and response sentence without the ranking process.
Note that there are two major differences between the baseline SP and our data-level distillation process: 1) the baseline SP starts with a dialogue pair X, Y sampled from D p rather than a candidate post sampled from D u ; 2) The ranking process is not used in the baseline SP to further filter the candidate pairs.

Metrics
The automatic evaluation of augmented dialogue pairs uses the following metrics: 1) Distinct (Li et al., 2016) is used to measure the proportion of unique n-grams in the augmented dialogue pairs (n=1,2,3,4); 2) Novelty (Wang and Wan, 2018) is used to measure the proportion of new n-grams in the augmented dialogue pairs (n=1,2,3,4), i.e., ngrams that are covered by the augmented dialogue pairs but are not shown in the paired dataset D p . A higher novelty score means the augmented dialogue pairs contain more "novel" contents.
Manual evaluation is also used to evaluate the quality of augmented dialogue pairs. Three annotators are employed to rate these pairs from two aspects: 1) Fluency (Flu.): whether the augmented pairs are fluent; 2) Coherency (Coh.): whether the 3 https://github.com/njuzrs/dialogue distillation response is coherent with the post so that they make a plausible dialogue pair. The rating scale for each measure is of (0, 1, 2), in which 0 means worst and 2 best.

Results
Each data augmentation method introduced above are used to generate 300K augmented dialogue pairs, and on which automatic evaluation is performed. Further, manual evaluation is carried out on 200 dialogue pairs that are randomly sampled from these augmented data, and the inter-rater agreement between annotators is measured using the Fleiss's kappa κ (Randolph, 2005). The κ value for Fluency and Coherency is 0.69 (substantial agreement), and 0.42 (moderate agreement), respectively. Note that this evaluation is purely regarding the augmented dialogue data, without considering any dialogue model training.
The evaluation results in Table 1 demonstrate that the augmented dialogue data produced by our method outperform all the baselines in almost all the metrics. We can further observe that: 1) Our method obtains similar scores on all the metrics compared to these human-produced and filtered dialogue pairs in D p . This indicates that the augmented dialogue pairs generated by our method are of high quality. We present some examples of the augmented pairs together with their associated anchor pairs in Table 2.
2) The matching threshold η can be used to trade off between the coherency and diversity of the augmented dialogue pairs. Specifically, a higher η value improves Fluency and Coherency scores but hurts Distinct and Novelty scores of the augmented pairs.

Baselines
We evaluate the benefit of the augmented dialogue data in both retrieval-based and generationbased dialogue models. Specifically, 300K augmented dialogue pairs are generated using these three baselines introduced in Section 5.3.1, and the model-level distillation process as introduced in Section 4 is used to train the dialogue models. We denote these three dialogue model baselines as CVAE+ML, BT+ML, and SP+ML, respectively. Note that the notation "ML" means that the Model-Level distillation is used. Moreover, besides comparing to different data augmented methods as introduced in Section 5.3.1, several other competitive dialogue model baselines are also tested:  Teacher: Training the dialogue models on the paired data D p with the NLL loss. Note that this setting produces the teacher models used in Section 4.
AP: Training dialogue models only on the Augmented Pairs D a with the NLL loss.
UP+PreT: First fine-tuning the pre-trained GPT (with the NLL loss in Eq. 5) or BERT-base model (with the MLM loss (Devlin et al., 2019)) on the UnPaired Data D u , and then using these fine-tuned weights to initialize the dialogue models, which are further fine-tuned on D p with the NLL loss.
NP+ML: Sampling 300K pairs from a set of Weibo dialogues that are not manually filtered and use these "Noisy Pairs" as the augmented pairs. The model-level distillation process introduced in Section 4 is used to train this baseline.
We denote our method as DL+ML since it trains the dialogue model using both the data-level and model-level distillation. The threshold η in Algorithm 1 is set to 0.95 for a better trade-off between the coherency and diversity of the augmented data. Further, we also test another method to work with data-level distillation (i.e., utilizing D a D p ): DL+PreT, i.e., first pre-train the dialogue model on D a and then fine-tune on D p with the NLL loss.
Further, we also performed several ablation tests on our method to validate the effect of each component: 1) training dialogue models on D p D a using only the NLL loss, i.e., without the modellevel distillation (w/o ML); 2) training dialogue models only on the paired data D p using L M (θ) or L G (φ), i.e., the data-level distillation are not used (w/o DL); 3) training dialogue models on the augmented data D a using L M (θ) or L G (φ), i.e., the paired data D p are not used (w/o PD); 4) generating D a without the ranking module (w/o Ranking), i.e., the candidate pairs are used as the augmented data without filtering.
Note that all the baselines and ablation models are initialized with pre-trained GPT or BERT-base weights.

Metrics
The retrieval-based dialogue models are evaluated using the following metrics: 1) Mean Average Precision (MAP): the average rank of the reference responses; 2) R 10 @k: the recall of the reference response being in the top-k ranked candidates (k=1,2,5) when given 10 candidates in total.
The generation-based dialogue models are evaluated both automatically and manually. Specifi-cally, the following automatic metrics are used: 1) Perplexity (PPL) which measures how the model fits the test data; 2) BLEU which evaluates the overlap of n-grams (n=1,2) between the generated and reference responses; 3) Distinct (Dist.) measures the proportion of unique n-grams in the generated responses (n=1,2). Manual evaluation is also performed for the generated dialogue responses following the same protocol as introduced in Section 5.3.2.

Model
MAP R 10 @1 R 10 @2 R 10 @5  Table 3: Automatic evaluation for retrieval-based dialogue models with different training and data augmentation methods.

Results
Automatic evaluation for each dialogue model is performed on 5K test data (see Table 3 and Table 4 for the results), and manual evaluation is performed using 200 pairs that are randomly sampled from these test data (see Table 5 for the results). The κ value for the Fluency and Coherency annotation is 0.9 (substantial agreement) and 0.56 (moderate agreement), respectively.
Our method outperforms all the baselines in almost all the metrics for both retrieval-based and generation-based dialogue models. We can further observe that: 1) The dialogue models that utilize unpaired data D u (e.g. DL+ML, DL+PreT, UP+PreT) generally outperform the models that are only trained on D p (e.g., Teacher, CVAE+ML). This demonstrates that utilizing unpaired data is more effective at improving the performance of dialogue models; 2) Training the dialogue models on the merged data D p D a without utilizing the  Table 4: Automatic evaluation results for generationbased dialogue models with different training and data augmentation methods. Significance tests between the best model and others were performed using t-test with booststrap resampling (Koehn, 2004). † and ‡ indicates p-value < 0.005 and 0.001, respectively. model-level distillation (i.e., w/o ML) brings little or no performance improvements compared to directly training on D p (i.e., Teacher). This verifies the effectiveness of the model-level distillation process proposed in our method; 3) When the modellevel distillation is employed, the augmented data produced by our data-level distillation process (i.e., DL+ML) can better improve the performance of dialogue models compared to the augmented data produced by other data augmentation methods (e.g. CVAE+ML, NP+ML, SP+ML, BT+ML). This verifies the effectiveness of the data-level distillation process proposed in our study.

Model
Flu.  This paper presents a novel dialogue distillation method that consists of two processes, i.e., 1) a data augmentation process to construct new postresponse pairs from unpaired data and 2) a model distillation process that distills a teacher model trained on the original data to the augmented data. Automatic and manual evaluation shows that our method can produce high-quality post-response pairs that are both coherent and content-rich, which can be further used to improve the performance of competitive baselines. Our method may inspire other research in low-resource NLP tasks.