A Simple and Efficient Multi-Task Learning Approach for Conditioned Dialogue Generation

Conditioned dialogue generation suffers from the scarcity of labeled responses. In this work, we exploit labeled non-dialogue text data related to the condition, which are much easier to collect. We propose a multi-task learning approach to leverage both labeled dialogue and text data. The 3 tasks jointly optimize the same pre-trained Transformer – conditioned dialogue generation task on the labeled dialogue data, conditioned language encoding task and conditioned language generation task on the labeled text data. Experimental results show that our approach outperforms the state-of-the-art models by leveraging the labeled texts, and it also obtains larger improvement in performance comparing to the previous methods to leverage text data.


Introduction
General conversational models pre-trained on large text data (Radford et al., 2018;Devlin et al., 2018) or human-to-human conversation data Bao et al., 2019) have shown excellent performance in generating fluent and diverse responses. In addition to general conversation, we are more and more faced with the problem of conditioned conversation that tunes the dialogue toward a specific style or domain. For example, we might specify a condition as the vocabulary frequently used by a person and ask the system to mimic the speaking style of the person, or a topic-related vocabulary and ask the chatbot to discuss the given topic.
Conditioned response generation has been extensively explored using RNN-based sequence-tosequence models, under different conditions, e.g. persona (Li et al., 2016b), topic (Xing et al., 2017), emotion , situations (Sato et al., 2017), and so on. However, only a few existing studies considered using pre-training based models (Zheng et al., 2019;Lin et al., 2019). The basic idea in these previous works is to utilize a parametric vector to represent a condition and then use it in the decoder for conditioned generation. However, the key issue in conditioned dialogue generation is the availability of labeled responses (Zhou and Wang, 2018), and pre-training on unlabeled text or dialogue data does not help much. Therefore, the motivation of this work is to leverage labeled text (non-dialogue) data that are much easier to collect than labeled dialogue data as supplement. These data can be, for example, texts written by the same person (for a persona condition), within the same topic domain (for a topic condition), etc. The idea is inspired by response style transfer (Luan et al., 2017;Niu and Bansal, 2018), which uses a text corpus to learn a style and then transfer the style to dialogue. Based on their success, we assume that the labeled text data can contribute to create better representations of conditions and better utilization of conditions in natural language generation.
In this work, we propose a multi-task learning approach to leverage both labeled dialogue and text data. We use 3 tasks to jointly optimize the same pre-trained Transformer -conditioned dialogue generation task on the labeled dialogue data, conditioned language encoding task and conditioned language generation task on the labeled text data. Our assumption is that the two other tasks can help in our final goal of conditioned dialogue generation: conditioned language generation is the base of conditioned response generation, and conditioned language encoding using bi-directional attention can efficiently encode condition-related expressions and lead to better condition representations. We apply different input representations, self-attention masks, and random mask strategies to differentiate the 3 tasks. Regardless of these differences, the training objectives of these tasks are essentially the same, i.e. masked language modeling, and thus we can mix up 2 types of data / 3 tasks in one training batch, which prevents us from having the catastrophic forgetting problem (Phang et al., 2018).
To efficiently leverage labeled data, first, our approach incorporates all types of data within the same framework, avoiding introducing ad hoc model components which are usually needed in some response style transfer methods in order to leverage extra texts. Second, we propose TF-IDF based masking which selects more conditionrelated tokens to mask, so that the model can exploit the labeled text data more for conditionrelated expressions rather than the general language features already captured by the pre-trained models. Third, for conditioned generation, we propose a non-parametric attention-based gating mechanism, which chooses between generating a general word (necessary for general function words) or a condition-related word at each position. We expect it to be more efficient than a parametric gating. Experimental results show that these approaches all bring improvements.
Our approach is generalizable. In spite of many different labels, a condition essentially specifies some preferences on words, phrases, and sentence structures in the generated responses. Thus, a general approach can be instantiated to a specific case as long as the corresponding labeled dialogue data are available. We will run experiments with two instantiated models for persona-and topic-related dialogue. Additionally, we will empirically show that our approach is robust and can even work with condition labels predicted by a classification model, e.g. LDA for topic labels.
The contributions in this work are as follows 1 : • We propose a simple and efficient multi-task learning approach based on pre-trained Transformer that leverages different labeled data, i.e., dialogue and text, for conditioned response generation.
• The experiments under two different conditions -persona-and topic-based dialogue, show that our approach outperforms the stateof-the-art models by leveraging labeled texts even when the labels are predicted by a model.
• Our approach obtains larger improvement in performance comparing to the existing methods to leverage text data, based on extra autoencoder or sequential fine-tuning.
2 Related Work

Conditioned Dialogue Generation
We categorize the related existing works into 3 categories.
(2) Loosely-conditioned response generation, where a label designating the type of the response is required. For example, persona labels (Li et al., 2016b) designate the speaking styles of the responses, and topic labels (Xing et al., 2017;Dziri et al., 2019) or emotion labels (Li et al., 2017;Rashkin et al., 2019) specify topic-related or emotion-related vocabularies. These studies usually utilize a parametric vector to encode a label, which is then used in the decoder to guide the generation.
(3) Strictly-conditioned response generation, where extra knowledge is required to determine the content of the response, such as a persona profile , a situation description (Rashkin et al., 2018;, or a wikipedia paragraph , which are used to ground the response. The ability to strictly-conditioned generation is important, but these dialogues only count for a small fraction of open-domain conversation (Zheng et al., 2019). In many other cases, we are in the situation of loosely-conditioned dialogue. Furthermore, the state-of-the-art strictly-conditioned method (Wolf et al., 2019) can be easily added in other models as well Madotto et al., 2020), which simply concatenates the extra knowledge with the dialogue history as the model input.
In this work, we focus on loosely-conditioned response generation 2 . We will show that our approach is robust and can work with different types of labels including those predicted by a classification model, e.g. LDA for topic labels. Therefore, our method is compatible to generation conditioned on latent variables by borrowing power of a classification model. In this work, we do not touch on strictly-conditioned generation. However, this ability can be easily equipped as mentioned.

Response Style Transfer
Style transfer in dialogue aims to learn the style of a text corpus and then incorporate the style in dia-4929 logue generation. The transfer is usually between two styles, e.g. rude and polite, or adding a style to general dialogues. To leverage the text corpus, Luan et al. (2017) jointly trains a seq2seq response generator and an extra auto-encoder, and Niu and Bansal (2018) trains an extra style classifier first to guild the response generator using reinforcement learning.
These works show that text data contain rich information about how to generate a specific type of texts, which inspire us to exploit the labeled text data in conditioned dialogue generation to alleviate the data scarcity issue. Style transfer is usually between two given styles. In contrast, conditioned dialogue generation could work with hundreds of condition labels simultaneously. As we will show in our experiments, the style transfer methods that utilize additional models, e.g. auto-encoder, to leverage text corpus are unscalable and inefficient for conditioned dialogue. In contrast, our approach that leverages labeled text data without using ad hoc models and makes a tighter integration of labeled text data with labeled dialogue data can more directly impact the conditioned dialogue generation.

Method
We assume that we have two types of training data: a labeled dialogue corpus containing (dialogue history, condition, response) samples, and a labeled text corpus consisting of (condition, text) samples. Notice that the "condition" is any categorical label that indicates a type of responses or texts. Our goal is to generate a response y that exhibits the desired characteristics of the type of responses given a dialogue history x and a condition c: The Transformer in our work uses bi-directional attention on the source side to encode the dialogue history, and left-to-right attention on the target side to generate the response. Such a transformer can be initialized from BERT (Devlin et al., 2018), Roberta , UniLM (Dong et al., 2019), or the models pre-trained on large-scale unlabeled dialogue data e.g. PLATO (Bao et al., 2019) and Blender (Roller et al., 2020). In this work, we focus on efficiently leveraging labeled data, i.e. dialogue and text. Figure 1 (Left) shows the overview of our approach.

Masked Multi-Head Attention
In this subsection, we introduce the basic components of Transformer. Masked multi-head attention is also applied in our condition-aware transformer block. The input representation H 0 ∈ R n×d h , where n is the input length and d h = 768 is the hidden dimension, is the sum of token embedding, position embedding, and type embedding at each position. We apply type embeddings to introduce a separation between source side and target side as shown in Figure 1 (Left) in order to warrant different treatments in the model. Then, H 0 is encoded into hidden representations of i-th layer H i = [h i 1 , ..., h i n ] using multi-layer transformer blocks: The core component of a transformer block is the masked multihead attention, whose outputs, i.e. contextualized representations, determines whether a position can attend to other positions: M ij = 0 allows the i-th position to attend to j-th position and M ij = −∞ prevents from it. Our approach jointly optimizes three tasks that apply different self-attention masks as shown in Figure 1 (Left). For conditioned dialogue generation task, the self-attention mask allows bidirectional attention on the source side to fully encode dialogue history and left-to-right attention on the target side to generate conditioned responses. For the labeled text data, we randomly choose between conditioned language encoding and conditioned language generation task. The two tasks use bi-directional attention and left-to-right attention respectively. The language encoding objective, i.e. Masked Language Modeling (MLM), is used in BERT, which has shown stronger ability than the auto-regressive objective used in GPT (Devlin et al., 2018). Therefore, we expect conditioned language encoding is more helpful to learn condition-related expressions (especially with the TF-IDF masking strategy which we will introduce) than the two generation tasks that employ the auto-regressive objective. Figure 1: (a) Overview of our multi-task learning approach. Labeled dialogue and text data are mixed, and they are processed using the same pre-trained Transformer with data/task-adaptive input representations, self-attention masks, and random mask strategies. (b) Detailed structures of a condition-aware transformer block, i.e. a C-Transformer Block.

Condition-aware Transformer Block
In this subsection, we introduce position-wise condition bias that aims to determine how much condition information should be utilized to bias word generation probability at a position. The core component to calculate the bias is a non-parametric attention-based gating mechanism as shown in Figure 1 (Right). Other gate mechanisms usually employ parametric linear layers to calculate weights. We assume a non-parametric attention based method could be more training-efficient, which is important since labeled data are usually limited. We will empirically confirm its effectiveness compared to other gating methods.
Specifically, given a training sample (x, c, y) or (c, text), the condition label c is encoded using two sets of parameters: one parametric vector works as the key k c ∈ R d h and another one works as the value v c ∈ R d h . Additionally, there is a general condition label g with a parametric vector k g as its key and a zero vector v g as its value. The former corresponds to conditioned generation, while the latter to the general dialogue that generates words only based on dialogue history. At each position, the model determines an attention weight to each choice. More attention to c means that the position is more tuned to the condition. More specifically, for each condition-aware transformer block as shown in Figure 1(Right), given .., c i n ] as queries, the condition biases The calculation is non-parametric. We use the matrix M b ∈ R n×2 to prevent adding condition bias to positions on the source side because the condition only influences the target side (the labeled response or text).

Objectives
We jointly optimize three tasks: conditioned dialogue generation on labeled dialogue, conditioned language encoding and conditioned language generation on labeled text. As discussed in Section 3.1, conditioned language encoding is expected to be very helpful to learn condition-related expressions. A specific self-attention mask is required for each task, while the objectives of three tasks are essentially the same -some tokens of the target side (labeled response or text) are randomly masked, and the final hidden vectors H L corresponding to the masked tokens are fed into an output softmax over the vocabulary to predict the expected tokens. Therefore, we can mix up 2 types of data (3 different tasks) in one training batch, and the loss is averaged in a batch. This thus prevents us from having the catastrophic forgetting problem (Phang et al., 2018). This problem is usually observed using a sequential fine-tuning process, i.e. first finetuning on labeled texts and then on conditioned dialogue data, which will erase the effect of the previous steps of training.
When using labeled dialogue data, we want the model to learn to generate conditioned but more importantly coherent responses. Thus, we uniformly sample the tokens on the target side to mask. Differently, when exploiting labeled text data, we only want the model to generate condition-related expressions. Therefore, we introduce TF-IDF Based Masking for the labeled text data to speed up the learning process -we sample tokens to mask according to their TF-IDF values counted on the entire corpus. We will empirically show its effectiveness.

Datasets
We use two labeled dialogue datasets, and we created two smaller training sets (500K labeled texts and 250K labeled dialogues), which are summarized in Table 1. We anticipate that when labeled dialogue data are limited, the benefit of leveraging labeled text data will be larger.
Persona Reddit We filtered the Reddit data from 2015 to 2019 that is provided by a third party 3 . Reddit data is a natural source of dialogue with multiple users -a post may have multiple comments by different users. Following Li et al. (2016b), we consider each user as a distinct persona. We extract (post, user, comment) tuples, where "user" is the label of the user who makes the "comment". We further filtered the data based on sentence length and users: sentences with more than 30 words or less than 4 words are removed, and we only keep comments from the 2000 most active users so that we can collect enough data for each user. As a result, each user has 1291 samples (comments) on average. To build the labeled text corpus, we collect extra posts or comments on Reddit from the same user that have no overlap with the dialogue data -these extra texts are intended to reflect the general writing style of the user.
Topic-related Dialogue Dziri et al. (2019) provides a high-quality 3-turns conversational dataset for topic aware response generation 4 . Along with each (history, target) pair, there is a topic label and dozens of topic words that are predicted by LDA 3 https://files.pushshift.io/reddit/ 4 https://github.com/nouhadziri/THRED model. The dataset contains 9.2M samples, from which we sample 3M (history, topic, target) tuples as the labeled dialogue corpus. To construct the labeled text data, we sample other 3M tuples and only keep their (topic, target) parts. Note that the topic labels are generated by LDA, and thus it is difficult to obtain the labeled text data from other sources.

Baselines
We choose two strong baselines specifically designed for personalized response generation and two others for topic-aware generation. Additionally, we choose some state-of-the-art pre-trained Transformers.
Speaker Model (Li et al., 2016b) a seq2seq model using four LSTM layers. Given a user label, the decoder transforms it into a user embedding and use it to generate a personalized response.
MT-Speaker an approach jointly trains a Speaker Model and a conditioned auto-encoder with shared decoder parameters, which is adapted from a style transfer approach (Luan et al., 2017). This approach also leverages the labeled text data.
TA-Seq2Seq (Xing et al., 2017) and THRED (Dziri et al., 2019) these models utilize topic words instead of topic labels predicted by the LDA model. TA-Seq2Seq leverages the topic information by a joint attention mechanism and a biased generation probability. THRED is built based on HRED and incorporates topic words via a hierarchical joint attention mechanism. . We add a condition embedding to the input representation to enable conditioned generation.
BERT fine-tuning the pre-trained model (Devlin et al., 2018) on the dialogue datasets. The encoder and decoder share the parameters. When encoding, the model uses bi-directional attention. When decoding, it uses left-to-right attention.

Implementation Details
We implement the speaker model and MT-Speaker model based on OpenNMT 5 . Other models are directly taken from the available open-source code.
Hyper-parameters are set following the original papers. Since our baselines utilize GPT or BERT, we use BERT (base, uncased) to initialize our model for fair comparison. It is however possible to build our model upon more powerful pre-trained models such as Roberta . We do hyperparameter search based on perplexity on the validation set for: the number of condition-aware transformer blocks in {2, 6, 12}, the mix-up rate of labeled dialogues and texts in {3:1, 1:1}, and whether using conditioned language encoding task. We report experimental results with 2, 3:1, and using conditioned language encoding respectively. The warm-up proportion is set to 0.1. 25% tokens of the target side are randomly masked. During decoding the beam size is 10, and we prevent duplicated bigrams. We fine-tune all the parameters end-to-end for four epochs on two P100 GPUs. With in total 6M training samples, each epoch takes twelve hours. The fine-tuning model only has (2C + 1) × d h additional parameters, where C is the number of different condition labels. Other details are given in Appendix A.

Evaluation
Automatic Metrics We choose some widely used metrics in the literature 6 : BLEU (Papineni et al., 2002) with n=1,2,3; ROUGE-L -longest common subsequence based statistics; CIDEr (Vedantam et al., 2015) utilizing TF-IDF weighting for each n-gram; and Distinct (Li et al., 2016a) indicating the proportion of unique n-grams (n=1,2) in the entire set of generated responses to evaluate response diversity. Two-sided t-test is used for statistical significance test.
Response Appropriateness Furthermore, we conduct manual evaluation on the best models according to the automatic metrics. We only manually evaluate the model performance on large-scale datasets 7 . We ask human evaluators to rate a response in {0, 1, 2}. A score of 0 means that the response might have flaw in fluency and logic or be incoherent. Special cases are for example completely coping from the dialogue history as the output, and a bland response such as "I don't know what you mean". A score of 1 represents a coherent but generic response. 2 represents a coherent and informative response. We also do a pair-wise evaluation to compare two models and indicate which one is better. The evaluation is based on 200 random samples. Each generated response is rated by three annotators. The inter-rater annotation agreement in Cohen's kappa (Cohen, 1960) is 0.441 on average, which indicates moderate agreement.

Condition Consistency
We observe that automatic metrics fail to evaluate condition consistency since BERT that does not consider conditions outperforms C-Trans-ED and C-Trans-Dec. Thus, we perform manual evaluation on the condition consistency. A generated response is rated in {0, 1, 2}. The scores 0, 1 and 2 mean respectively that the response is inconsistent to the condition, somehow related, and consistent. However, if the response has flaw in fluency or logic, it will get a score of 0. For Topic Dialogue, it is easy to measure whether a generated response is in the topic. However, for persona consistency, it is difficult for a human evaluator to know the speaking style of each user. Thus, before evaluation we first automatically determine those frequently used words by a user in responses and show them to the annotators to help their evaluations.

Analysis
Table 2 and 3 gives automatic evaluation results, and Table 4 gives human evaluation results. Appendix B shows some generated responses. The results can be summarized as follow: BERT vs. Trans-ED & Trans-Dec C-Trans-Dec has a clear advantage over C-Trans-ED in almost all automatic metrics, which can also  Table 2: Evaluation results on large-scale (upper half) and small-scale (lower half) Persona Reddit. Two-Step FT means using our model architecture but applying sequential fine-tuning. w/o ctext is without leveraging conditioned text data. w/o tf-idf means without applying TF-IDF based masking. * (p < 0.05) or ** (p < 0.01) show statistically significant differences with our model by two-sided t-test.  be observed in their generated responses. Finetuning BERT without considering conditions outperforms C-Trans-Dec on most similarity metrics such as BLEU. We explain this by the fact that bi-directional attention could enable a model to better encode dialogue history, and thus to generate responses more similar to the ground truth. The ablation model using w/o ctext is fine-tuning C-BERT (with our condition-aware transformer blocks) on labeled dialogue data. The performance of w/o ctext is similar to C-Trans-Dec's, with a slight advantage in condition consistency and small disadvantage in response appropriateness. These results show that our approach is built upon a strong base model. As mentioned, other pre-trained models can also be used.  Table 4: Human evaluation of generated responses on appropriateness and condition consistency. Pair-wise comparisons show the wining percentages of (baseline, ours).
Figure 2: Performance comparison between sequential fine-tuning and our approach given 1M labeled text data and different size of labeled dialogue data.
ertheless, similarly, with large data C-BERT outperforms BERT in all metrics, but when only smallscale labeled dialogue data are available, C-BERT performs worse than BERT in terms of BLEU. The result again shows the importance of exploiting labeled texts, and our approach is the best on smallscale Topic Dialogue.
Leveraging Labeled Texts In general, our approach significantly outperforms all baselines and w/o ctext that do not exploit labeled text data, either with large-scale or small-scale data. With smallscale data, our approach outperforms BERT while w/o ctext itself cannot achieve this, which shows that conditioned dialogue generation can be helped by extra labeled text data. On Topic Dialogue, with such noisy labels, our model leveraging the labeled texts still produces the best performance, which confirms the robustness of our multi-task learning approach to work with different types of labels. The human evaluation on appropriateness and condition consistency further confirms the effectiveness of our approach. Not all methods utilizing extra labeled text can obtain such performance improvement as we did.  does not gain much improvement over Sp-Model. This result shows that using additional model components to leverage labeled texts is inefficient for conditioned dialogue generation. Furthermore, Two-Step FT that first fine-tuning on labeled texts and then on labeled dialogue data does not always produce good performance. It achieves comparable performance to our approach on large-scale datasets, but on small-scale datasets it can even perform worse than w/o ctext (Table 2). This result shows that with small-scale dataset, it is better to avoid sequential fine-tuning because first finetuning on labeled texts will erase the effect of the previous step of pre-training. Furthermore, we investigate how the ratio of the size of labeled text data to the size of dialogue data influence model performance. As shown in Figure 2, given 1M labeled text data, when the ratio is less than 6.7, our approach performs better than Two-Step FT. However, when labeled text corpus is much larger than dialogue corpus, sequential fine-tuning is better. We assume that with large labeled text corpus the pre-trained language model can be tuned to conditioned language generation. Besides, the final task in sequential fine-tuning is purely conditioned dialogue generation, which is expected to achieve better performance on dialogue than a multi-task learning approach. However, in real application situations, one cannot always expect that a large labeled text corpus as supplement for the dialogue data is available.

TF-IDF Masking and Attention Gating
We assumed that the general language features have already been captured by the pre-trained models. Thus, to better utilize labeled text data, we mask more condition-related words using TF-IDF based masking. Our ablation study confirms that TF-IDF masking brings improvement in almost all automatic metrics although the improvement might not always be statistically significant.
Our attention gating is a non-parametric gating mechanism to fuse the condition into the decoder. We expected it to be efficient, which is particularly important when labeled data are limited. Here, we compare it with two common parametric gating mechanisms: 1) setting a single gate on C i to get a weight; 2) setting gates on both C i and v c to get two weights. Then, we combine the weighted C i and v c to get C i as in our attention gating. Experimental results in Table 5 confirm that our method is more efficient. When only small-scale labeled data are available, the model with attention gating generates responses that are significantly more similar to the ground-truth.

Conclusion
In this paper, we examined the data scarcity issue of conditioned dialogue generation. Pre-training on unlabeled text or dialogue data is not helpful to conditioned generation. Thus, we exploited labeled text data that are easier to collect than labeled dialogues. We expected these data can contribute to better representations of conditions and better use the conditions in natural language generation, which complement what is lacking in the pre-trained models.
To leverage these two types of data, we proposed a simple and efficient multi-task learning approach. Three tasks are considered: conditioned dialogue generation task on the labeled dialogue data, conditioned language encoding task and conditioned language generation task on the labeled text data. We conducted experiments under persona and topic conditions. Experimental results show that our approach outperforms the state-of-the-art models by leveraging labeled texts, and it also obtains larger improvement in performance comparing to the previous methods leveraging text data.

A More Implementation Details
For the small-scale datasets, we trained the models until the performances stop to increase on validation set to prevent over-fitting. For large-scale datasets, we fine-tuned the models for four epochs. In Table 6, the average runtime is tested using a 1080Ti GPU device, and the batch size is set to take all of the GPU memories. TA-Seq2Seq and THRED are implemented in TensorFlow. Other models are implemented in PyTorch. Notice that the runtime will be influenced by code implementation in additional to model structure. When experimenting with the small-scale Persona Reddit dataset, we decrease the number of parameters of Sp-Model and MT-Speaker models to 48M and 52M respectively in order to avoid over-fitting. C-Trans-ED loads the pre-training results of GPT. In the original paper, they pre-trained by themselves using a Chinese corpus, which cannot be used in our experiments.

Model
Parameters   Additionally, we explored the idea proposed in Zeng and Nie (2020) to decrease finetune-generation discrepancy introduced by MLM training objective. Nevertheless, the conditioned language encoding task cannot employ this method because it applies bi-directional self-attention. Experimental result on small-scale Persona Reddit shows that eliminating this discrepancy helps to decrease perplexity from 55 to 52.