Group-wise Contrastive Learning for Neural Dialogue Generation

Neural dialogue response generation has gained much popularity in recent years. Maximum Likelihood Estimation (MLE) objective is widely adopted in existing dialogue model learning. However, models trained with MLE objective function are plagued by the low-diversity issue when it comes to the open-domain conversational setting. Inspired by the observation that humans not only learn from the positive signals but also benefit from correcting behaviors of undesirable actions, in this work, we introduce contrastive learning into dialogue generation, where the model explicitly perceives the difference between the well-chosen positive and negative utterances. Specifically, we employ a pretrained baseline model as a reference. During contrastive learning, the target dialogue model is trained to give higher conditional probabilities for the positive samples, and lower conditional probabilities for those negative samples, compared to the reference model. To manage the multi-mapping relations prevalent in human conversation, we augment contrastive dialogue learning with group-wise dual sampling. Extensive experimental results show that the proposed group-wise contrastive learning framework is suited for training a wide range of neural dialogue generation models with very favorable performance over the baseline training approaches.


Introduction
Open-domain human-machine dialogue systems, especially the generation-based genre, have attracted extensive attention recently.Typically, following the neural encoder-decoder paradigm, contemporary dialogue generation models (Shang et al., 2015;Serban et al., 2016;Xing et al., 2017;Yan, 2018;Huang et al., 2020), more often than not, are trained with Maximum Likelihood Estimation (MLE) principle to mimic human context-response pairs in the training corpus.While notable gains have been achieved under this learning framework, prior art (Li et al., 2016a(Li et al., , 2017;;Zhang et al., 2018a) suggests that naive MLE objective used for training neural dialogue generation models is not that effective enough and tends to result in issues like dull response generation.By optimizing the likelihood of training dialogues, neural models are inclined to assign high probabilities to "safe" responses, due to the fact that vacuous responses like "I don't know" are of relatively high frequencies in conversational datasets (Li et al., 2016a).
One promising training framework for neural dialogue generation is adversarial learning (Goodfellow et al., 2014a;Li et al., 2017), where a discriminator provides rewards for the generator by contrastively distinguishing dialogues as humangenerated or machine-generated.However, the learning ability of GANs in text is drastically limited due to training instability and model collapse (Nie et al., 2019;Caccia et al., 2020).First, the discriminator is usually unlikely to be fooled very easily, and the generator can hardly learn from those ineffective rewards.Second, the generator is sometimes encouraged to mimic the highfrequency generic responses in the training corpus, because in some cases, the discriminator fails to distinguish a good response from a bad one: it can easily recognize contentful but less-grammatical responses as machine-generated, yet treat those human-generated dull responses as the oracle.
In this paper, we introduce contrastive learning (Hadsell et al., 2006;Gutmann and Hyvärinen, 2012) into dialogue generation, where the model explicitly perceives the difference between the wellchosen positive and negative utterances.From the perspective of contrastive learning, the discriminator in adversarial learning considers human-

Response Context
Pull together Push apart c < l a t e x i t s h a 1 _ b a s e 6 4 = " + A 6 V E t o s 3 H L / 1 r 2 w K Y F o n h C 2 0 M o = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q i 6 L L o x m U F + 8 A 2 l M l 0 0 g 6 d T M L M j V B C / 8 K N C 0 X c + j f u / B s n b R b a e m D g c M 6 9 z L k n S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T 2 9 z v P H F t R K w e c J p w P 6 I j J U L B K F r p s R 9 R H A d h x m a D a s 2 t u 3 O Q V e I V p A Y F m o P q V 3 8 Y s z T i C p m k x v Q 8 N 0 E / o x o F k 3 x W 6 a e G J 5 R N 6 I j 3 L F U 0 4 s b P 5 o l n 5 M w q Q x L G 2 j 6 F Z K 7 + 3 s h o Z M w 0 C u x k n t A s e 7 n 4 n 9 d L M b z 2 M 6 G S F L l i i 4 / C V B K M S X 4 + G Q r N G c q p J Z R p Y b M S N q a a M r Q l V W w J 3 v L J q 6 R 9 U f f c u n d / W W v c F H W U 4 Q R O 4 R w 8 u I I G 3 E E T W s B A w T O 8 w p t j n B f n 3 f l Y j J a c Y u c Y / s D 5 / A H d U J E H < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " + A 6 V E t o s 3 H L / 1 r 2 w K Y F o n h C 2 0 M o = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q i 6 L L o x m U F + 8 A 2 l M l 0 0 g 6 d T M L M j V B C / 8 K N C 0 X c + j f u / B s n b R b a e m D g c M 6 9 z L k n S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T 2 9 z v P H F t R K w e c J p w P 6 I j J U L B K F r p s R 9 R H A d h x m a D a s 2 t u 3 O Q V e I V p A Y F m o P q V 3 8 Y s z T i C p m k x v Q 8 N 0 E / o x o F k 3 x W 6 a e G J 5 R N 6 I j 3 L F U 0 4 s b P 5 o l n 5 M w q Q x L G 2 j 6 F Z K 7 + 3 s h o Z M w 0 C u x k n t A s e 7 n 4 n 9 d L M b z 2 M 6 G S F L l i i 4 / C V B K M S X 4 + G Q r N G c q p J Z R p Y b M S N q a a M r Q l V W w J 3 v L J q 6 R 9 U f f c u n d / W W v c F H W U 4 Q R O 4 R w 8 u I I G 3 E E T W s B A w T O 8 w p t j n B f n 3 f l Y j J a c Y u c Y / s D 5 / A H d U J E H < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " + A 6 V E t o s 3 H L / 1 r 2 w K Y F o n h C 2 0 M o = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q i 6 L L o x m U F + 8 A 2 l M l 0 0 g 6 d T M L M j V B C / 8 K N C 0 X c + j f u / B s n b R b a e m D g c M 6 9 z L k n S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T 2 9 z v P H F t R K w e c J p w P 6 I j J U L B K F r p s R 9 R H A d h x m a D a s 2 t u 3 O Q V e I V p A Y F m o P q V 3 8 Y s z T i C p m k x v Q 8 N 0 E / o x o F k 3 x W 6 a e G J 5 R N 6 I j 3 L F U 0 4 s b P 5 o l n 5 M w q Q x L G 2 j 6 F Z K 7 + 3 s h o Z M w 0 C u x k n t A s e 7 n 4 n 9 d L M b z 2 M 6 G S F L l i i 4 / C V B K M S X 4 + G Q r N G c q p J Z R p Y b M S N q a a M r Q l V W w J 3 v L J q 6 R 9 U f f c u n d / W W v c F H W U 4 Q R O 4 R w 8 u I I G 3 E E T W s B A w T O 8 w p t j n B f n 3 f l Y j J a c Y u c Y / s D 5 / A H d U J E H < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " + A 6 V E t o s 3 H L / 1 r 2 w K Y F o n h C 2 0 M o = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q i 6 L L o x m U F + 8 A 2 l M l 0 0 g 6 d T M L M j V B C / 8 K N C 0 X c + j f u / B s n b R b a e m D g c M 6 9 z L k n S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T 2 9 z v P H F t R K w e c J p w P 6 I j J U L B K F r p s R 9 R H A d h x m a D a s 2 t u 3 O Q V e I V p A Y F m o P q V 3 8 Y s z T i C p m k x v Q 8 N 0 E / o x o F k 3 x W 6 a e G J 5 R N 6 I j 3 L F U 0 4 s b P 5 o l n 5 M w q Q x L G 2 j 6 F Z K 7 + 3 s h o Z M w 0 C u x k n t A s e 7 n 4 n 9 d L M b z 2 M 6 G S F L l i i 4 / C V B K M S X 4 + G Q r N G c q p J Z R p Y b M S N q a a M r Q l V W w J 3 v L J q 6 R 9 U f f c u n d / W W v c F H W U 4 Q R O 4 R w 8 u I I G 3 E E T W s B A w T O 8 w p t j n B f n 3 f l Y j J a c Y u c Y / s D 5 / A H d U J E H < / l a t e x i t > r < l a t e x i t s h a 1 _ b a s e 6 4 = " j u w 6 e Q U U F M 3 i 3 7 B w 8 Y s q 5 t b V 1 A o = " > A A A B 8 X i c b V D L S s N A F L 2 p r 1 p f V Z d u B o v g q i Q i 6 L L o x m U F + 8 A 2 l M l 0 0 g 6 d T M L M j V B C / 8 K N C 0 X c + j f u / B s n b R b a e m D g c M 6 9 z L k n S K Q w 6 L r f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T Z x q x l s s l r H u B t R w K R R v o U D J u 4 n m N A o k 7 w S T 2 9 z v P H F t R K w e c J p w P 6 I j J U L B K F r p s R 9 R H A d h p m e D a s 2 t u 3 O Q V e I V p A Y F m o P q V 3 8 Y s z T i C p m k x v Q 8 N 0 E / o x o F k 3 x W 6 a e G J 5 R N 6 I j 3 L F U 0 4 s b P 5 o l n 5 M w q Q x L G 2 j 6 F Z K 7 + 3 s h o Z M w 0 C u x k n t A s e 7 n 4 n 9 What are your hobbies?I love to cook.r < l a t e x i t s h a 1 _ b a s e 6 4 = " j u w 6 e Q U U F M 3 i 3 7 B w 8 Y s q 5 t b Reading is my favorite hobby.
A g iv e n t r a in in g p a ir r + < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 q h K 8 y s r E Y p 5 6 i 4 P 3 a s h a 1 _ b a s e 6 4 = " M g S C y 0 o P v F 9 9 0 Y h k T e 3 L I k 0 r 8 B w 8 s 9 X 6 Y 0 u + 8 B e 4 T 0 N  generated responses as positive utterances and synthetic ones as negative samples.Instead, this work deems highly-matched context-response pairs as positive samples and mismatched training pairs as negative samples.In particular, we utilize a pretrained baseline model as a reference.During contrastive learning, for context c and its response r, the target dialogue model is trained to give higher conditional probabilities p(r|c) for the positive samples, and lower conditional probabilities for those negative samples, compared to the reference model.This training paradigm encourages the model to pull the positive data points together and push apart the negative samples, as exemplified in Figure 1.As a result, our proposed training scheme explicitly takes the semantic associations and differences among training examples into account for dialogue modeling.Besides, by contrastively characterizing the distinctions relative to a strong reference, our method implicitly enhances the distinctiveness of the generated responses as well, and ensures that the overall performance of the target model is not inferior to the reference.
Contrastively learning from one pair of positive and negative samples is quite straightforward, however, multi-mapping relations prevail in humanhuman conversations, where there exist multiple appropriate responses for a given context, and a response sometimes fits well to several contexts, known as one-to-many and many-to-one relations.Such complex multi-mapping relations are overlooked in previous learning framework, which ham-pers effective dialogue response learning.Furthermore, if a potential highly-matched utterance pair is treated as the negative sample or an outlier is used as the positive sample, the model may be confused.Therefore, in order to consider the multi-mapping phenomenon in human conversations and remedy the potential problematic false learning samples, and enhance the training stability, we augment contrastive learning with group-wise dual sampling, where groups of positive and negative instances are sampled regarding both the context and the response, respectively.To further depict subtle differences between instances in the group, we adapt the instance importance with the matching scores, and optimize the weighted loss.
We show an illustration case to understand our learning framework in Figure 1.Given a training context-response pair (c, r), for context "What are your hobbies?I love to cook", multiple highlymatched responses are organized as the positive samples r + , and the mismatched utterances are deemed as the negatives r -.On the dual direction, regarding the response "Reading is my favorite hobby", multiple sampled context utterances are similarly divided into c + and c -. Compared with the reference baseline, the target dialogue model is trained to give higher generation probabilities for positive instances, (c, r + ) and (c + , r), and lower probabilities for negatives (c, r -) and (c -, r).By this mean, the target model is actually induced to pull the positive sample pairs together and push the mismatched pairs apart, and thus learns from the distinctions between the positives and negatives.
The proposed group-wise contrastive learning framework is suited for training various neural dialogue generation models.We conduct extensive studies on three large-scale conversation datasets using four popular dialogue models to assess the proposed approach.The experimental results confirm the effectiveness of our learning framework with very favorable performance over the baseline training approaches 1 .

Dialogue Learning by Comparison
Given training data D containing context-response pairs {(c, r) i } N i=1 , a dialogue model parameterized by θ aims to map from the input context c to the output response r.To achieve this, conventional dialogue learning approaches search the parameter θ by maximizing the conditional probability p θ (r|c) over the training samples.MLE maximizes the loglikelihood of training pairs while adversarial based approaches rely on the discriminator to distinguish between good responses and bad ones.To combat the aforementioned drawbacks of traditional training approaches in dialogue learning, we advocate the use of contrastive learning to explicitly perceive the difference between the positive and negative samples.Inspired by Gutmann and Hyvärinen (2012); Dai and Lin (2017), we utilize a pretrained baseline model p n (•; φ) , to provide the target dialogue model p m (•; θ) a strong reference when contrasting the positive samples and the negatives.Humans not only learn from the positive signals but also benefit from correcting behaviors of undesirable actions.Intuitively, the target dialogue model is expected to give higher conditional probabilities p(r|c) for the positive samples, and lower conditional probabilities for those negative samples, compared to the reference model.Towards this end, we define the difference between p m (r|c, θ) and p n (r|c, φ) as: We wish that D((c, r); θ, φ) > 0 for any positive pair and vice versa for any negative pair.Concretely speaking, we minimize the following loss 1 Code is available at https://github.com/hengyicai/ContrastiveLearning4Dialogue function: where σ(•) is the sigmoid activation function, the given training pair (c, r) can be used as the positive sample (c, r) + and the negative sample (c, r) -can be obtained through negative sampling using the given instance (c, r).
Optimizing the dialogue model with the above objective function is reminiscent of nonlinear logistic regression in Noise-Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2012).The underlying motivation of our formulation and NCE are essentially different.The reference model in our work is utilized to constrain the behaviors of the target model, rather than serve as a noise distribution to provide noise data.Another difference is that, instead of using the log-ratio between p m (•; θ) and p n (•; φ) to compute posterior classification probabilities as in NCE, we introduce the function D((c, r); θ, φ) to characterize the distinctions of intrinsic dialogue properties relative to the reference, and encourage the generation of positive samples as well as penalize the negative ones through minimizing the loss in Eq.(2).Besides, by contrastively characterizing the distinctions relative to a strong reference, our method implicitly enhances the distinctiveness of the generated response as well, and ensures that the overall performance of the target model is not inferior to the reference.

Contrastive Dual Sampling
Nevertheless, in the presence of multi-mapping relations in human dialogues, effectively sampling the positive and negative pairs in conversation is not that straightforward and even runs the risk of introducing false learning samples.To manage the complex multi-mapping phenomenon in human conversations and enhance the training stability, we augment the contrastive learning with groupwise dual sampling, where groups of positive and negative instances are sampled regarding both the context and the response, respectively.To put it concretely, for each training instance (c, r), we find a group of positive examples {(c, r + ) i } k i=1 with highest matching degree and a group of negative examples {(c, r -) i } k i=1 with lowest matching degree, using an off-the-shelf pretrained matching < l a t e x i t s h a 1 _ b a s e 6 4 = " u 4 s K Z J g R U y 0 R I 4 h w i n 8 w 2 3 H < l a t e x i t s h a 1 _ b a s e 6 4 = " u m J 0  model to compute the matching scores between the given context and candidate responses.Similarly, {(c + , r) i } k i=1 and {(c -, r) i } k i=1 are also retrieved from the training set to serve as the context-side contrastive examples, as shown in Figure 2(a).In this work, we adopt MSN (Yuan et al., 2019), a context-response matching network based on multihop selection, as the off-the-shelf matching model.Note that other sophisticated matching models can also be applied, e.g., deep attention matching network (Zhou et al., 2018).

Group-wise Contrastive Learning
For each training instance (c, r), as describe in §2.2, we sample k different positive and negative pairs regarding both the dialogue context and its response to manage multi-mapping relations in conversation and stabilize the model training.The resultant well-chosen samples are composed of positive samples, {(c, r + ) i } k i=1 and {(c + , r) i } k i=1 , and the negatives, {(c, r -) i } k i=1 and {(c -, r) i } k i=1 .Then, the loss function is updated as: .
Given varied matching degrees of the collected context-response pairs in open-domain dialogue, indiscriminately training on such data impedes the model to perceive intra-group differences of these samples.We thus utilize the matching score s attached with each sample to adapt its instance effect on the group-wise contrastive dialogue learning.Specifically, for a given training example (c, r), the matching score s + of its positive pair lies in (0, 1] and the negative score s -lies in [−1, 0].To induce the model learning from sample pairs with varied matching degrees discriminately, the loss function is finally defined to be: The loss function L(θ) reaches its lower bound when the positive and negative pairs can be perfectly distinguished, i.e., p m (r|c, θ) p n (r|c, φ) for the positive samples and p m (r|c, θ) p n (r|c, φ) for the negatives, which indicates that the target dialogue model is able to clearly contrast a group of positive candidates from the negative ones and generate highly-distinctive responses for the given contexts.

Discussion
Neural sequence-to-sequence models trained with the MLE objective function are plagued by the lowdiversity issue when it comes to the open-domain conversational setting, in which bland and generic utterances usually dominate the data distribution.Since the objective of MLE is to maximize only the probabilities of ground-truth context-response pairs, it fails to capture the multi-mapping nature of human dialogues, not to mention the semantic differences among various candidates for a given example.While the proposed group-wise contrastive learning framework explicitly explores multiple variants of a given dialogue example by leveraging an off-the-shelf matching model, and implicitly guarantees the ground-truth generation probabilities through the contrastive constraints in Eq.( 4).
Adversarial learning approaches and our proposed framework both involve an auxiliary model during the training process.However, GANs are learned via a competition between the target generator and the counteracting discriminator, which needs careful tuning to prevent model collapse in text modeling (Caccia et al., 2020), whereas in our framework, the auxiliary reference system models conversation data in the same direction with the target dialogue model, and is stable during the learning procedure.

Experiment Settings
Datasets We perform experiments on three conversation datasets: PersonaChat (Zhang et al., 2018b), Douban Corpus (Wu et al., 2017) and OpenSubtitles (Lison and Tiedemann, 2016).Per-sonaChat, an English-language dataset, contains multi-turn dialogues between pairs of speakers, collected via Amazon Mechanical Turk.Douban consists of daily conversations from a popular social networking service-Douban group 2 in China.OpenSubtitles contains human-human conversations converted from movie transcripts in English.Data statistics are listed in Table 1.
Experimental Models We apply the proposed group-wise contrastive learning framework to several state-of-the-art models, including (i) SEQ2SEQ: a LSTM-based sequence-to-sequence model with attention mechanisms (Bahdanau et al., 2015), (ii) HRED: a hierarchical recurrent neural dialogue generation model (Serban et al., 2016), (iii) TRANSFORMER: an encoder-decoder architecture relying solely on attention mechanisms (Vaswani et al., 2017), (iv) HRAN: a hierarchical recurrent attention network for multi-turn 2 https://www.douban.com/groupresponse generation (Xing et al., 2018).Each model is trained using two protocols: the vanilla MLE training procedure and the proposed groupwise contrastive learning procedure, keeping other configurations the same.
Baselines We compare our group-wise contrastive learning framework against the following dialogue learning approaches: (i) ADVERSARIAL: an adversarial training approach for response generation (Li et al., 2017), (ii) MMI: a training objective which maximums the mutual information between the dialogue context and its response (Li et al., 2016a;Zhang et al., 2018c), (iii) DEEPRL: a reinforcement learning framework for neural response generation with heuristic reward functions to boost response qualities (Li et al., 2016b), (iv) CVAE: a conditional variational auto-encoder learning framework to maximize the data likelihood, augmented with the KL-annealing technique (Bowman et al., 2016) and a BOW loss (Zhao et al., 2017), and (v) DIALOGWAE: a conditional Wasserstein auto-encoder framework, modeling the distribution of dialogues by training a GAN within the latent variable space (Gu et al., 2019).

Implementation and Reproducibility
We implement our model in ParlAI (Miller et al., 2017) and train them on Nvidia P40 GPUs.All the models use pretrained word embeddings produced by fastText (Bojanowski et al., 2017), and the dimensionality of word vectors is 300.For experimental models, 2-layer LSTM-based encoder and decoder with hidden size 256 are used in SEQ2SEQ.We use the base Transformer configuration described in Vaswani et al. (2017) response decoder for both the HRED and HRAN.
The GRU hidden size is set to 256.For both models using different training procedures, we pretrain them by MLE and the result checkpoint is adopted as the reference model used in our framework.We employ BM25 (Robertson and Zaragoza, 2009) to construct the index used during the contrastive dual sampling procedure.The group size k is set to 3. Regarding comparison models, we adopt the default configurations used in the original papers.We optimize models by Adam (Kingma and Ba, 2015) with an initial learning rate of 0.001 and the batch size of 128.All systems are trained until the validation loss fails to decrease for 5 checkpoints.We compute the loss on the validation set at every 0.5 epochs and save the parameters for the top model.Evaluation scores on the test set from the saved model are finally reported.

Evaluation Results
Performance on Experimental Models We instantiate the proposed framework on several stateof-the-art models for dialogue generation.Comparison with Baseline Approaches We compare our proposed framework with existing learning approaches designed for dialogue generation task.Table 3 summarizes the evaluation results.We observe that our learning framework outperforms previous approaches regarding the majority of evaluation metrics.It is worth noting that the proposed framework brings a relatively large improvement regarding both the response diversity and conversation coherence, indicating that our approach helps the dialogue model to generate not only informative but also context-relevant responses, which confirms our hypothesis that the group-wise contrastive learning encourages distinctiveness.

Human Evaluation
We further conduct human evaluations to assess the proposed learning framework.We choose the PersonaChat as our evaluation  corpus since its expressions are more close to the style of daily chat and are easier for the annotators to make judgments.Three crowd-sourced graduate students are employed to evaluate the quality of generated responses for 100 randomly sampled input contexts.During the evaluation, the annotators are requested to select a preferred response, or vote a tie, considering the following aspects of response quality: fluency, informativeness, coherence and engagingness.Table 4 summarizes the evaluation results and the Cohen's kappa scores (Cohen, 1960) to measure the intra-rater reliability.We observe that our learning framework brings more preferable replies compared with the competitors.This indicates that training the dialogue model with the proposed group-wise contrastive learning framework does improve the response quality.

Effect of the Group-wise Learning Strategy
To manage the multi-mapping relations in humanhuman conversation and stabilize the model training with noisy data, the dialogue model is induced to contrast a group of positive samples from the negative ones, pulling the matched sample pairs together and pushing the mismatched pairs apart.We ablate the group-wise learning from the framework by using only one pair of positive and negative samples to verify its effectiveness.As presented in Table 5 (a), we can see that disabling groupwise learning hurts performance significantly on all evaluation metrics.Note that ablating either the group-wise positive sampling (Table 5 (b)) or group-wise negative sampling (Table 5 (c)) also leads to a performance drop with respect to the evaluation metrics.It demonstrates that the groupwise learning strategy plays a key role in achieving strong performance.
Effect of the Dual Sampling In our framework, the contrastive samples can be organized regarding either the dialogue context or response, allowing the dialogue model to explore both the many-toone and one-to-many relations in human conversation.We investigate different sampling strategies in Table 5 (d) and (e).We notice that when both the response-side and context-side samplings together incorporated into the learning framework, the model achieves its best performance, verifying the effectiveness of the contrastive dual sampling.

Impact of Matching Scores
To discriminatively exploit the sampled context-response pairs with varied matching degrees, we utilize the matching score attached with each sample to adapt its instance effect on the model training.We conduct the ablation test of this learning strategy by simply discarding the impact of matching scores as in Eq.( 3).As shown in Table 5 (f), training without considering matching degrees of samples leads to a consistent performance drop, which suggests that the system can benefit from perceiving finegrained differences within the group during contrastive learning.

Impact of Group Size
We explore the impact of using different group size k in our group-wise contrastive learning framework in Figure 3.We observe that increasing the group size k leads to continuous improvement on the Distinct metric while other reference-based metrics achieve the best results at a moderate group size.We conjecture that a larger group size allows the dialogue model to learn from more diverse expressions, meanwhile it also risks introducing more utterances that are Coherence k < l a t e x i t s h a 1 _ b a s e 6 4 = " F N 4 g 9 G F e + 3 z i a        inconsistent with the references.

Related Work
Learning Methods for Dialogue Generation Typically, state-of-the-art neural dialogue generation models adopt Maximum Likelihood Estimation (MLE) as their learning approach, maximizing loglikelihood of model parameters on the training data.Though effective, well-known issues, including the notorious general dull response problem, are reported by prior art (Li et al., 2016a;Zhao et al., 2017) on dialogue models trained with MLE.
Alternative dialogue learning approaches are proposed to tackle the issues.Li et al. (2016a); Zhang et al. (2018c) introduce the Maximum Mutual Information as the objective function to promote response diversity.Techniques of reinforcement learning (RL) and adversarial learning have been introduced into dialogue generation by Li et al. (2016bLi et al. ( , 2017) ) to better approximate the optimization goal of dialogue models.Conditional variational framework (Zhao et al., 2017;Shen et al., 2017;Park et al., 2018) has also shown a promise in dialogue generation.Gu et al. (2019) further introduce a conditional Wasserstein autoencoder that employs GAN (Goodfellow et al., 2014b) to model the multimodal latent structures.Cai et al. (2020) design a multi-curriculum learning frame-work to facilitate the dialogue model training.Contrasted with existing learning methods for dialogue generation, the proposed framework in this work encourages the model to learn from the difference between well-chosen contrastive pairs, which explicitly models the multi-mapping relations in conversation and promotes the distinctiveness of the generated responses in the meantime.

Contrastive Learning
The concept of learning by contrasting positive pairs against negative pairs (Hadsell et al., 2006;Gutmann and Hyvärinen, 2012) has been successfully adopted in many tasks.For example, contrastive learning in language modeling task (Mnih and Teh, 2012;Vaswani et al., 2013;Baltescu and Blunsom, 2015) aims to approximate the negative log-likelihood by training the model to correctly classify between generated noise samples and words observed in the training data.Contrastive visual representation learning (van den Oord et al., 2018;Chen et al., 2020) trains a generative model to score real data points higher than negative samples.Dai and Lin (2017) propose to use contrastive learning for image caption.Clark et al. (2020) use contrastive learning to train a discriminative model for language representation learning.Compared with existing work, samples used in this paper, instead of being sampled randomly, are carefully chosen to ex-hibit particular properties of human dialogues.Another difference is that, we manage multi-mapping relations prevailed in human conversation using many positives and many negatives, which captures both the intra-group and inter-group variability.

Conclusion
In this work, we propose a group-wise contrastive dialogue learning approach, that explicitly perceives the difference between the well-chosen positive and negative utterances, and manages the multi-mapping relations in human conversations simultaneously.Given a training instance, the proposed learning framework first organizes a group of positive samples and negative samples regarding context-response matching degrees, and then trains the target dialogue model to give higher conditional probabilities for positive pairs and lower probabilities for the negatives.Extensive experimental results show that the proposed learning framework brings a solid favorable performance boost amongst various strong baseline approaches.
like cooking.My hobby is to watch sports.Thanks for the offer though.I like it, too! c < l a t e x i t s h a 1 _ b a s e 6 4 = "

Figure 1 :
Figure 1: An illustration case of group-wise contrastive learning.For a given training instance, the proposed framework explicitly considers the multi-mapping relations in human conversations, by encouraging the dialogue generation model to pull the matched sample pairs together and push the mismatched pairs apart in the latent space.
y 0 9 c D l H s 6 5 l 3 s 5 f s S o 0 o 7 z Z R U W F p e W V 4 q r p b X 1 j c 0 t e 3 u n p c J Y Y t L E I Q t l x 0 e K M C p I U 1 P N S C e S B H G f k b Y / u s r 8 9 g O R i o b i V o 8 j 4 n E 0 F D S g G G k j 9 e 2 9 S s / n C U 6 P Y d b l X X K S p k d 9 u + x U n Q n g P H F z U g Y 5 G n 3 7 u z c I c c y J 0 J g h p b q u g 5 T u 7 4 I + s z x 9 K a J t Y < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " t m p + 6 t T i 8 7 F h x Z C 7 U B j n Y G j 7 5 D w

Figure 2 :
Figure 2: A demonstration of the proposed group-wise contrastive dialogue learning pipeline.For each training pair, it first samples a group of highly-matched examples and another group of most mismatched utterances regarding both the context and response to build the contrastive samples, using an off-the-shelf conversation matching model ( §2.2).The target dialogue model is then trained with group-wise contrastive learning ( §2.3).
D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A 0 o W M 7 w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = "F N 4 g 9 G F e + 3 z i a T g M 0 D M w d U e U K p 0 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D tj 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A 0 o W M 7 w = = < / l a t e x i t > k < l a t e x i t s h a 1 _ b a s e 6 4 = " F N 4 g 9 G F e + 3 z i aT g M 0 D M w d U e U K p 0 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D tj 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A 0 o W M 7 w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = "F N 4 g 9 G F e + 3 z i a T g M 0 D M w d U e U K p 0 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D tj 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A 0 o W M 7 w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = "F N 4 g 9 G F e + 3 z i a T g M 0 D M w d U e U K p 0 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D tj 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A 0 o W M 7 w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = "F N 4 g 9 G F e + 3 z i a T g M 0 D M w d U e U K p 0 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D tj 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A 0 o W M 7 w = = < / l a t e x i t > k < l a t e x i t s h a 1 _ b a s e 6 4 = " F N 4 g 9 G F e + 3 z i aT g M 0 D M w d U e U K p 0 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D tj 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y D K 7 w 5 j 8 6 L 8 + 5 8 L F s L T j 5 z C n / g f P 4 A 0 o W M 7 w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F N 4 g 9 G F e + 3 z i aT g M 0 D M w d U e U K p 0 = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 l E q M e i F 4 8 t 2 A 9 o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D tj 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g o b m 1 v b O 8 X d 0 t 7 + w e F R + f i k r e N U M W y

Figure 3 :
Figure3: Evaluation results (%) with different group size k on the validation set of PersonaChat using the proposed framework instantiated on SEQ2SEQ.BLEU-1 and Dist-1 are denoted as "BLEU" and "Distinct", respectively.

Table 2 :
Automatic evaluation results (%) on the test set of three datasets: (a) PersonaChat, (b) Douban and (c) OpenSubtitles." " denotes that the model is trained using our proposed framework.The metrics Average, Extrema, Greedy and Coherence are abbreviated as Avg., Ext., Gre. and Coh., respectively.The best results in each group are highlighted with bold.

Table 3 :
Performance (%) of our approach instantiated on naive SEQ2SEQ and baseline approaches on Per-sonaChat.

Table 4 :
The results of human evaluation on the test set of PersonaChat.