On Task-Level Dialogue Composition of Generative Transformer Model

Task-oriented dialogue systems help users accomplish tasks such as booking a movie ticket and ordering food via conversation. Generative models parameterized by a deep neural network are widely used for next turn response generation in such systems. It is natural for users of the system to want to accomplish multiple tasks within the same conversation, but the ability of generative models to compose multiple tasks is not well studied. In this work, we begin by studying the effect of training human-human task-oriented dialogues towards improving the ability to compose multiple tasks on Transformer generative models. To that end, we propose and explore two solutions: (1) creating synthetic multiple task dialogue data for training from human-human single task dialogue and (2) forcing the encoder representation to be invariant to single and multiple task dialogues using an auxiliary loss. The results from our experiments highlight the difficulty of even the sophisticated variant of transformer model in learning to compose multiple tasks from single task dialogues.


Introduction
Recent years have seen a tremendous surge in the application of deep learning methods for dialogue in general (Vinyals and Le, 2015;Rojas-Barahona et al., 2017;Budzianowski et al., 2018;Lewis et al., 2017) and task-oriented dialogue (Wen et al., 2015;Einolghozati et al., 2019;Neelakantan et al., 2019) specifically.Task-oriented dialogue systems help users accomplish tasks such as booking a movie ticket and ordering food via conversation.Generative models are a popular choice for next turn response generation in such systems (Rojas-Barahona et al., 2017;Wen et al., 2017;Eric and Manning, 2017).These models are typically learned using large amounts of dialogue data for every task (Budzianowski et al., 2018;Byrne et al., 2019).It is natural for users of the task-oriented dialogue system to want to accomplish multiple tasks within the same conversation, e.g.booking a movie ticket and ordering a taxi to the movie theater within the same conversation.The brute-force solution would require collecting dialogue data for every task combination which might be practically infeasible given the combinatorially many possibilities.
While the ability of generative dialogue models to compose multiple tasks has not yet been studied in the literature, there has been some investigation on the compositionality skills of deep neural networks.Lake and Baroni (2017) propose a suite of tasks to evaluate a method's compositionality skills and find that deep neural networks generalize to unseen compositions only in a limited way.Kottur et al. (2017) analyze whether the language emerged when multiple generative models interact with each other is compositional and conclude that compositionality arises only with strong regularization.
Motivated by the practical infeasibility of collecting data for combinatorially many task compositions, we focus on task-level compositionality of text response generation models.We begin by studying the effect of training data size of humanhuman multiple task dialogues on the performance of Transformer (Vaswani et al., 2017) generative models.Next, we explore two solutions to improve task-level compositionality.First, we propose a data augmentation approach (Simard et al., 2003;Schmidhuber, 2012;Krizhevsky et al., 2012;Baird, 1992;Sennrich et al., 2016) where we create synthetic multiple task dialogues for training from human-human single task dialogue; we add a portion of one dialogue as a prefix to another to arXiv:2010.04826v1 [cs.CL] 9 Oct 2020 simulate multiple task dialogues during training.As a second solution, we draw inspiration from the domain adaptation literature (Ganin and Lempitsky, 2015;Tzeng et al., 2015;Xu and Yang, 2017;Chen et al., 2016;Xu et al., 2017;Sun et al., 2018) and encourage the model to learn domain invariant representations with an auxiliary loss to learn representations that are invariant to single and multiple task dialogues.
We conduct our experiments on the Multiwoz dataset (Budzianowski et al., 2018).The dataset contains both single and multiple task dialogues for training and evaluation.In Multiwoz, the tasks in multiple task dialogues are only the combinations of tasks in single task dialogues.This allows the dataset to be an appropriate benchmark for our experiments.
To summarize, our key findings are: • We study task-level compositionality of text response generation models and find that they are heavily reliant on multiple task conversations at train time to do well on such conversations at test time.
• We explore two novel unsupervised solutions to improve task-level compositionality: (1) creating synthetic multiple task dialogue data from human-human single task dialogue and (2) forcing the encoder representation to be invariant to single and multiple task dialogues using an auxiliary loss.
• Highlighting the difficulty of composing tasks in generative dialogues with experiments on the Multiwoz dataset, where both the methods combined result only in a 8.5% BLEU (Papineni et al., 2002) score improvement when zero-shot evaluated on multiple task dialogues.

Background
Let d 1 , d 2 , . . ., d M be the dialogues in the training set and every dialogue d m = ((u 1 m , a 1 m ), (u 2 m , a 2 m ), . . ., (u nm m , a nm m ) (∀m ∈ {1, 2, . . ., M }) consists of n m turns each of user and assistant.Further each user and assistant turn consists of a sequence of word tokens.The individual dialogue could be either single task or multiple task depending on the number of tasks being accomplished in the dialogue.
The response generation model is trained to generate each turn of the assistant response given the conversation history.The generative model learns a probability distribution given by P (a i | (u 1 , a 1 ), . . ., (u i−1 , a i−1 ), u i ).We drop the symbol m that denotes a particular training example for simplicity.The assistant turn a i consists of a sequence of word tokens, a i = (w i 1 , w i 2 , . . ., w i l i ).The response generation model factorizes the joint distribution left-to-right given by, where x i = ((u 1 , a 1 ), . . ., (u i−1 , a i−1 ), u i ) refers to the conversation history till the i th turn.
We use a Transformer (Vaswani et al., 2017) sequence-to-sequence model to parameterize the above distribution.Given a training set of dialogues, the parameters of the Transformer model are learned to optimize the conditional language modelling objective given by, where Θ refers to the parameters of the Transformer model.

Data Augmentation
The first solution we explore for task compositionality generates synthetic multiple task dialogues for training from human-human single task dialogues 1 .Here, we sample two dialogues from the training set, and add a portion of one dialogue as a prefix to another.While this procedure might not create dialogues of the quality equivalent to human-human multiple task dialogue, it is an unsupervised way to create approximate multiple task dialogues that the model could theoretically benefit from.Concretely, we randomly sample two single task dialogues d i and d j from the training set and create a noisy multiple task dialogue by adding a fraction of the dialogue d j as a prefix to dialogue d i .The fraction of dialogue taken from dialogue d j is given by the hyperparameter augment f raction.The number of times dialogue d i is augmented by a randomly sampled dialogue is given by the hyperparameter augment f old.
We consider two strategies for sampling the dialogue d j .In Random Augment, the dialogue is uniformly randomly sampled from the remainder of the training set.A potential issue with the random strategy is that it might create spurious task combinations and the model might fit to this noise.Motivated by the spurious task combination phenomenon, we consider another sampling strategy T argeted Augment where we create synthetic multiple task dialogues only for task combinations that exist in the development set.Here, d j is sampled from a set of dialogues whose task is compatible with the task of dialogue d i .The Transformer model is now trained on the augmented training set using the objective function given in Equation 1.The effect of the sampling strategy and the hyperparameters on the model performance is discussed in the experiments section (Section 5).

Domain Invariant Transformer
We propose Domain Invariant Transformer model (Figure 1) to maintain a domain invariant representation of the encoder by training the encoder representation for an auxiliary task.Here, the auxiliary task for the network is to predict the label , i l, denoting the type of task (single or multi-task) in the encoded conversation history.The model takes as input the sequence of byte pair encoded tokens that are represented at the encoder hidden state as a set of attention weights from the multi-head multiple layer attention mechanism of transformer.The conditional language model (Equation 1) is learnt by a transformer decoder on top that attends over the encoder states.
The discriminator task network is trained with average pooling of the encoder summary over the attention heads (h j )as shown in Equation 2.
The average pooled encoder summary is passed as input to a two-layer feed forward discriminator.The discriminator network has a dropout (Srivastava et al., 2014) layer in-between the two fully connected layers (f 1 and f 2 ) (Equation 3).
The binary cross-entropy loss, L disc , for the predicted label, ŷi , an input context i is computed as in Equation 4. The Domain Invariant Transformer model optimizes a convex combination of the two losses as shown in Equation 5.
The language model loss makes sure that the model learns to generate the next utterance while the discriminator loss makes sure the model is aware of the nature of task.To understand the effect of the auxiliary loss we experiment with different values for α (ref Appendix).

Importance of multiple task dialogues
We measure the importance of multiple task dialogue on the overall performance of transformer by training the model with varying amount of multiple task dialogues and keeping the task distribution between multiple and single domain dialogues almost similar in the experiments.We keep increasing the number of multiple task dialogues while reducing the single task dialogues to keep the total number of dialogues constant at 2, 150.The model should be able to learn to generalize to multiple tasks as the set of tasks are the same between the train and test sets with only the nature in which the task is posed by the user is different.We use the Ten-sor2Tensor (Vaswani et al., 2018)  As shown in Table 1, the quality of the model improves significantly as number of multiple task dialogues increases.Interestingly, even though the total number of dialogues are kept fixed, the overall validation BLEU score also improves as the number of multiple task dialogues increase in the training set.The results show that the models may be better at decomposing than composing in the domain of goal oriented dialogues or the model at best can only mimic surface level token distribution (Appendix B).Though training with more multi-task dialogues can potentially improve the performance, it is not a scalable solution.We will test two of the out-of-the-shelf techniques to improve the task level compositionality in the following section.

Zero-shot Compositionality Experiments
We experiment on Transformer to evaluate the performance on handling zero-shot compositional tasks by training the baseline model only on single task dialogues, and with the proposed data augmentation techniques.The results, in The reason for only a minor BLEU improvement could be due to the noise in generation process.Although the task distributions are matched, the token level distributions appear to be significantly differ-ent between the single and multiple-tasks.The results suggest that the method may inject more noise in the token level distribution thereby not improving the model performance significantly.

Domain Invariant Transformer
We compared the proposed architecture and the baseline Transformer model to understand the effects of domain invariant encoder representation towards language generation in multi-task dialogues.We observed from our experiments in  The poor performance of the data augmentation techniques can be due to the overwhelming noise in token distribution of input contexts, which skews the language model that the model learns.

Conclusion
We studied the problem of composing multiple dialogue tasks to predict next utterance in a single multiple-task dialogue.We found that even powerful transformer models do not naturally compose multiple tasks and the performance is severely relied on multiple task dialogues.In this paper, we explored two solutions that only further showed the difficulty of composing multiple dialogue tasks.
The challenge in generalizing to zero-shot composition, as observed in the experiments, hints at the possibility of transformer model potentially mimicking only the surface level tokens without understanding the underlying task.The token overlap distribution in Appendix B supports the possibility.

A Preprocessing
The MultiWoZ 2.0 dataset has a JSON metadata that maintains a dictionary of slot-value pairs provided by the user to the agent in every utterance.We use this metadata to construct a local and a global knowledge of slot-value shared by the user and split to relabel the dataset for single domain and multidomain dialogues.The preprocessing step removed the noise in the labeling of dialogues.We used this approach to keep a test set of multidomain dialogues to evaluate the model performance on compositional tasks.On the clean split of single domain dialogues we generate synthetic multidomain dialogues using two different approaches: In this approach, we pick a single task dialogue i D SN G and randomly select a set of K single task With an hyperparameter, percentCopy, we select the number of utterances to be copied from every dialogue in the set noiseDialogues and add it as a prefix to D SN G .This results in K negative samples of synthetic multidomain dialogues, i D M U L RS K k=1 , for every single domain dialogues in the dataset.

A.2 Targetted Synthetic (TS)
We bucket the single domain dialogues based on the conversation domain (taxi, hotel, attraction etc.,).Similarly, we bucket the multi-task dialogues in the training set to measure the topic distributions in multi-task dialogues.Using the computed distribution of composite tasks in true multidomain dialogues and the domain label of every i D SN G , we constrain the selection of random dialogues to conform to the training distribution of true composite tasks in the training set.The hyperparameters and the remainder of the procedure is similar to RS except when combining the single domain dialogues from two different domains i Dom, j Dom , we inject the topic change exchanges randomly sampled from T C ( j Dom1, i Dom2) .
For training the proposed Domain Invariant Transformer model, we create the labels for the auxiliary tasks using the preprocessing steps used to split the dataset into single and multi-domain dialogues

B Token distribution
We analyze the token distribution in the dataset to understand the negative result further.We observed that despite the task distributions are matched the underlying token distribution in different set up is not (Table 5).We looked at the overlap of the distribution of 4-grams in conversations on the different splits we used for training.We observed that Multi-task dialogues (MUL) training set has as much 4-gram overlap with MUL Valid and SNG (Single task dialogues) Valid sets as the combined (SNG + MUL) training data.The analysis raises doubts in the performance of transformer model with increased MUL train dialogues that the performance improvement cannot be only because of the model's ability to decompose multiple tasks but may be because the MUL train has higher 4-gram overlap with SNG Valid and MUL Valid.This shows that despite the dialogues carrying rich information in task oriented dialogues, the model at best only mimics the surface level token distribution.Hence, it is not clear if the Transformer model can generalize to multi-task dialogues with an understanding of the underlying task structure.

Figure 2 :
Figure 2: An example of combining two single-task dialogues in color1 and color2 together to form a single multi-task dialogue.

Table 1 :
framework to run our experiments with (tiny) hyper-parameter setting in the framework.Ablation study to understand the usefulness of Multiple task dialogues.

Table 2 ,
show that the Targeted Augment technique increased the performance on multiple-task dialogues by 8.5% BLEU score while the scores of the model slightly dropped in the performance of all dialogues.
Table 3 that Domain Invariant Transformer or Transformer model fails to generalize with few-shot multi-task dialogues.The data augmentation techniques too appear to not contribute towards improving the performance.But, Domain Invariant Transformer model improved the performance to a BLEU score when trained only on all of training data, which, though was not the intended objective.Although that seems good, the model is still heavily reliant on human-human multiple domain dialogues and zeroshot or few-shot generalization in compositional dialogues seem quite difficult to achieve.

Table 3 :
0.5 and 1.0 correspond to half and all of multitask samples respectively during training.Synthetic refers to Targeted Augment dialogues.

Table 4 :
Varying the α to understand the effect of the discriminator on decoder performanceWe experimented with different values of α to understand the influence of the discriminator loss.The results in Table4show that Domain Invariant Transformer performed better when α is 0.001.
USER:I'm looking for the Lensfield Hotel.Can you help me find it ?ASSISTANT:Yes of course, it is located in 53-57 lensfield road.Would you like me to book you a reservation ?USER: No thank you.Could you give me the postcode and tell me if they have free parking ?ASSISTANT: Yes, they have free parking.The postcode is cb21en.Is there anything else you'd like to know about this hotel ?USER:No, that's all I needed.Thanks!Bye! ASSISTANT: You are welcome.Goddbye!USER:I'd like to find a train that leaves after 17:00 on Saturday.ASSISTANT: Absolutely, I need departure and destination information and a time you need to arrive by.USER: I will be departing from Broxbourne and I'd like to go to Cambridge.ASSISTANT: I have 7 available.The first leaves at 17:32, and arrives at 18:32.Would you like me to book that for you ?USER:I am not looking to book at the moment, can I just get the train ID? Thanks.ASSISTANT: Of course!TR0811 leaves at 17:32, arrives at 18:32, with a duration of 60 minutes.The price is 14.32 pounds.May I assist you with anything else ?USER: That is all I need for today.Thank you.Good bye.ASSISTANT: Okay glad I could help you.Enjoy your trip.USER:I'm looking for the Lensfield Hotel.Can you help me find it ?ASSISTANT:Yes of course, it is located in 53-57 lensfield road.Would you like me to book you a reservation ?USER: No thank you.Coul you give me the postcode and tell me if they have free parking ?ASSISTANT: Yes, they have free parking.The postcode is cb21en.Is there anything else you'd like to know about this hotel ?USER:I'd like to find a train that leaves after 17:00 on Saturday.ASSISTANT: Absolutely, I need departure and destination information and a time you need to arrive by.USER: I will be departing from Broxbourne and I'd like to go to Cambridge.ASSISTANT: I have 7 available.The first leaves at 17:32, and arrives at 18:32.Would you like me to book that for you ?USER:I am not looking to book at the moment, can I just get the train ID? Thanks.ASSISTANT: Of course!TR0811 leaves at 17:32, arrives at 18:32, with a duration of 60 minutes.The price is 14.32 pounds.May I assist you with anything else ?USER: That is all I need for today.Thank you.Good bye.ASSISTANT: Okay glad I could help you.Enjoy your trip.

Table 5 :
Analysis of 4-gram overlap across different combinations of train and validation splits that were used in the experiments.The analysis show that the %Unseen in validation set is higher when training with SNG (Single domain dialogues) but considerably lower when trained with MUL.The composition task requires models to understand the underlying task structure but the data distribution and performance of transformer strongly correlate to show that the transformer model at best mimics the surface level token distribution than understanding the nature of task.