The Dialogue Dodecathlon: Open-Domain Knowledge and Image Grounded Conversational Agents

We introduce dodecaDialogue: a set of 12 tasks that measures if a conversational agent can communicate engagingly with personality and empathy, ask questions, answer questions by utilizing knowledge resources, discuss topics and situations, and perceive and converse about images. By multi-tasking on such a broad large-scale set of data, we hope to both move towards and measure progress in producing a single unified agent that can perceive, reason and converse with humans in an open-domain setting. We show that such multi-tasking improves over a BERT pre-trained baseline, largely due to multi-tasking with very large dialogue datasets in a similar domain, and that the multi-tasking in general provides gains to both text and image-based tasks using several metrics in both the fine-tune and task transfer settings. We obtain state-of-the-art results on many of the tasks, providing a strong baseline for this challenge.


Introduction
One of the goals of AI is to build a seeing, talking agent that can discuss, reason, empathize, and provide advice -in short a system that can perform natural communication displaying many of the properties expected when speaking to a human partner. Ideally, it should be able to be knowledgeable and personable, expert and engaging, serious or humorous -depending on the situation. It should be capable of answering questions, asking questions, responding to statements, having its own persona, and grounding itself on external information and images.
While no single task exists that can train an agent or measure its ability on all of these tasks at once, a number of distinct large-scale datasets measuring subsets of these skills have recently become available. We thus collect together these disparate tasks and form a single challenge: do-decaDialogue, consisting of 12 subtasks. Each contains both training data to build the skills we desire for our agent, and validation and test sets to measure our agent's ability at that skill. The overall goal is a single agent that can display all these skills. As some of the subtasks have very large datasets, e.g. 2200M utterances, they can possibly help the agent with other skills too.
We thus build a model capable of training and multi-tasking on all these sources. We employ a transformer-based architecture (Vaswani et al., 2017) which accepts both an image and dialogue history as input, and generates a response for a given dialogue turn. Practically, by pre-training on the largest of the subtasks and then multi-tasking on all them we can obtain state of the art results compared to independently reported current performance on all 10 of the 12 subtasks that have previous comparable results. We thus set a strong baseline for this challenge. While many existing approaches use large pre-training on general text corpora, we show that using dialogue datasets instead, which are more closely linked to the desired agents goal, is a strong alternative.
However, many challenges remain. While multi-tasking performs well, and has clear benefits, as shown in other works (Liu et al., 2015;Raffel et al., 2019), when compared to fine-tuning of the same system we do obtain typically small losses. Zero-shot transfer to left-out tasks is also demanding for current approaches. We analyze these aspects, along with our model's ability to ground on external knowledge and images in conjunction with the dialogue context, the impact of decoding algorithms, analysis of the weighting of tasks during multi-tasking as well as cross-task transfer ability in order to shed light and make progress on this challenging topic.

The dodecaDialogue Task
The dodecaDialogue task is intended to assemble important aspects of an engaging conversational agent into a single collection, where each subtask covers some of those goals. Such an agent should be able to get to know you when you first talk to it (ConvAI2), discuss everyday topics (DailyDialog, pushshift.io Reddit, Twitter, Cornell Movie), speak knowledgeably at depth (Wizard of Wikipedia, Ubuntu) and answer questions on such topics (ELI5). It must be able to handle situated conversations and apply empathy (Empathetic Dialog, LIGHT) . It can also discuss images, as this is vital part of human connection (Image Chat, IGC).
The overall statistics of the subtasks are given in Table 1. We now discuss each in turn.

ConvAI2
ConvAI2 is a challenge dataset used at the NeurIPS 2018 competition of the same name, and is based on PersonaChat (Zhang et al., 2018;. The training data involves paired crowdworkers having a conversation where they get to know each other, where each is given a role to play based on sentences describing their persona, which were also separately crowdsourced. It thus involves asking and answering questions, responding in kind, and getting to know the other speaker and engage them in friendly conversation -useful skills for an open-domain conversational agent. DailyDialog Li et al. (2017) built a dialogue dataset intended to reflect conversations occurring in daily life. It covers ten categories ranging from holidays to financial topics, rather than focusing on one domain. Compared to ConvAI2, these conversations seem more in keeping with partners who already know each other, and want to discuss typical life details, again useful skills for a conversational agent. The dataset is also annotated with topic, emotion and utterance acts, but here we ignore these annotations and learn only from the utterances in the dialogue turns.

Wizard of Wikipedia
This task involves discussing a given topic in depth, where the goal is to also engage the partner as well as display expert knowledge . The training set consists of 1247 topics, and a retrieval system over Wikipedia from which the dialogues were grounded during the human-human crowdsourced conversations. The topics were also crowdsourced and range from e-books to toga parties to showers. A model can thus learn to also perform a similar retrieval and grounding at test time to potentially discuss any topic if it can generalize. We use the gold knowledge version of the task where the input is the same as selected by the crowdworkers (checked sentence only) and use the seen test set of 533 topics. We see this skill as a core component of an agent being able to not just chitchat, but actually engage a user in discussing real information about the world, e.g. by retrieving over documents from the internet. Rashkin et al. (2019) constructed a dataset of crowdworker conversations grounded in an emotional situation. In each dialogue, one speaker describes a personal situation and the other plays a "listener" role, displaying empathy during the discussion. The dataset contains descriptions of the situtations being discussed with an attached emotion label, but these are not used here. Trained models are measured playing the part of the empathetic listener, an important feature of an agent humans wish to speak to. Average BLEU is used as a main metric.

Empathetic Dialogues
Cornell Movie Danescu-Niculescu-Mizil and Lee (2011) constructed a corpus containing a collection of fictional conversations from movie scripts, thus covering a large diversity of topics and emotional states.
LIGHT LIGHT  involves situated interactions between characters in a text adventure game. Similar to ConvAI2, personas for each character are given, with the training set including conversations between crowdworkers playing the roles of 1369 possible characters. Different from ConvAI2, included are emotes and actions grounded within the game world (e.g. picking up and giving objects). As such, it measures the ability of a conversational agent to ground its discussion to a dynamic environment.
ELI5 ELI5  involves longform question answering grounded on multiple retrieved documents in order to answer common questions which people ask on the popular ELI5 subreddit. As such, the answers are in a conversational form applicable to a dialogue agent.
Ubuntu Lowe et al. (2015) built a dataset that involves in-depth discussions in solving Ubuntu problems. This studies the ability of an agent on a very focused single topic, and is also a standard benchmark in the field.
Twitter We use a variant of Twitter discussions (text-only), which have been used in many existing studies, e.g. Sordoni et al. (2015); See et al. (2019). This data naturally involves everyday discussions about topics that people care about. The public forum makes them different from the more personal discussions of some of the other tasks. This is the second largest dataset in the collection, and we thus measure in experiments its ability to help performance on other tasks.
pushshift.io Reddit We use a variant of Reddit discussions (text-only), which has also been used in several existing studies, see e.g. Yang et al. (2018);Mazaré et al. (2018); Keskar et al. (2019). Following , we use a previously existing Reddit dataset extracted and obtained by a third party and made available on pushshift.io, training to generate a comment conditioned on the full thread leading up to the comment, spanning 2200M training examples. This is the largest dataset in the collection -much larger than the others. The subreddits cover a vast range of topics, and hence is also a strong candidate for helping improving performance on other tasks via pre-training and multi-tasking.
Image Chat Shuster et al. (2018b) collected a crowdsourced dataset of human-human conversations about an image with a given personality, where the goal is to engage the other speaker. As such, it covers natural conversational responses, including displays of emotion and humor.
Image Grounded Conversations (IGC) IGC (Mostafazadeh et al., 2017) similarly involves two speakers discussing an image, here focusing on questions and responses. It only includes a validation and test set, and so we converted most of the validation set to form a small training set.

Evaluation
Metrics For all tasks, we measure the following metrics: perplexity (PPL), BLEU, ROUGE-1,-2 and -L and F1, and also pick the metric most used in the literature as that subtask's 'Score' to compare to existing work.
Multi-tasking As we are interesting in building a single conversational agent, we measure the ability of multi-tasked models that can perform all twelve tasks at once.
Single-Task Fine-tuning We can still compare such multi-tasked models to single-task fine-tuned baselines to assess if we have gained or lost performance. Like other works (Liu et al., 2015;Raffel et al., 2019) we also consider a multi-task followed by finetune setup in order to see if this produces better models. The latter tests if multi-tasking still proves useful in the single-task setting.
Zero-shot Transfer Finally, we consider a leave-one-out zero-shot setting whereby training is constrained to be on all the training data except for the task being evaluated. This evaluates the performance on truly new unseen tasks, an important behavior given there are always new tasks.

Existing Models and Results
Where possible, we have tried to track the best existing results for each task and provided a comparison in our final results tables.
As ConvAI2 was a competition, a number of competitors built strong models on it. The best results were obtained by large pre-trained transformers . In particular, Wolf et al. (2019b) pre-trained via the method of Radford et al. (2018) using the BooksCorpus dataset, resulting in the best perplexities and F1 scores. Since then, results have got even better with the advent of better and larger pretraining (Lewis et al., 2019), which we compare to here; the same work also reports strong results on ELI5.
He et al. (2019) recently obtained strong results on the DailyDialog and Cornell Movie tasks in terms of perplexity by pre-training on 10% of CC-NEWS (Bakhtin et al., 2019), thus using 100 million sentences (2.7 billion words) and then finetuning a transformer based model with a multitask strategy.
Overall, large pre-trained transformers indeed provide strong existing results on many of the tasks. Several large language modeling projects have been undertaken in order to show prowess in multi-tasking ability (Radford et al., 2019;Keskar et al., 2019), and transformer-based approaches have been adapted to language and vision tasks as well (Lu et al., 2019;Tan and Bansal, 2019;Li et al., 2019a;Shuster et al., 2018a). As well as citing the relevant papers' results where possible in the experiments section, we also train a BERTbased (Devlin et al., 2018) generative model as an additional baseline.

Related Tasks and Collections
In the interests of feasibility, there are tasks we did not include in dodecaDialogue. For example, there are additional knowledge tasks (Qin et al., 2019;Moghe et al., 2018) and image-based datasets (Das et al., 2017) one could use. There are also a large number of QA tasks we did not include, e.g. Rajpurkar et al. (2016); Choi et al. (2018). In general, our choices were made based on tasks that after training might produce an engaging dialogue agent that humans naturally would want to talk to -which means either natural datasets or crowdsourced datasets where crowdworkers were encouraged to engage one another. As computational resources and ambitions scale, it would be interesting to add more tasks as well, while retaining the twelve we have chosen here in order to continue to evaluate their success, whilst extending the ambitions of the entire system.
All the subtasks in the collection we use here already exist. Other research projects have also built such collection-based tasks before as well. In particular, the NLP decathlon (McCann et al., 2018), upon which the name of this paper is inspired, collects together a diverse set of NLP tasks -from sentiment detection to parsing. Talmor and Berant (2019) collect a set of 10 QA datasets and build MULTIQA. Recently, (Raffel et al., 2019) also similarly multi-tasked a large set of NLP tasks, on an even bigger scale. Our work differs from these in that is more focused on tasks which naturally group together to form a conversational agent.
4 Models BERT baseline. We implement a generative baseline using BERT via adopting the model using a standard auto-regressive loss. We concatenate both the context and current generation and provide as input to the model, using BERT's sentence embeddings to distinguish the roles in the network. Although BERT is trained to predict masked tokens, we find that fine-tuning can easily adjust its behavior to predicting the next token. Our BERT baseline is roughly equivalent to the model of Wolf et al. (2019b), but does not have a classification loss term. The implementation relies on HuggingFace Transformers (Wolf et al., 2019a). We thus fine-tune this model for each of our tasks, except Image Chat and IGC which require images as input.
Image+Seq2seq. We use a modification of a transformer Seq2Seq architecture (Vaswani et al., 2017), additionally adding image features to the encoder. Our model is a 8 layer encoder, 8 layer decoder with 512 dimensional embeddings and 16 attention heads, and is based on the ParlAI implementation (Miller et al., 2017). We use BPE following

Experiments
Task Training We employ the ParlAI framework (Miller et al., 2017) for training on single tasks or for multi-tasking, as many of the tasks are already implemented there, along with a (multitask) training and evaluation framework for such models.
Pre-training As pushshift.io Reddit and (to some extent) Twitter are much larger than our other tasks, we try pre-training the Seq2Seq module of our Image+Seq2Seq networks with those datasets, before multi-tasking on all of the tasks, or for evaluating single task fine-tuning. For Reddit, the model was trained to generate a comment conditioned on the full thread lead-   all of them (using the validation set), reporting perplexity. The results are reported in Table 3. They show that training on pushshift.io Reddit alone, a huge dataset, is effective at transfer to other tasks, but never as effective as fine-tuning on the task itself. Moreover, fine-tuning on most of the smaller tasks actually provides improvements over pushshift.io Reddit training alone at transfer, likely because the three tasks selected are more similar to each other than to Reddit. Finally, training on all four tasks is the most effective strategy averaged over all tasks compared to any other single model, although this does not beat switching between different fine-tuned models on a per-task basis.
Comparison of Pre-training + Fine-tuning strategies Across all 12 tasks, we compare several pre-training strategies: using BERT, no pretraining at all, only initializing via fastText (Joulin et al., 2016), and using Twitter and pushshift.io Reddit pre-training with our Image+Seq2Seq architecture. For each variant we tune the learning rate, layers, number of heads and embedding size, with less pre-training typically requiring smaller capacity models. We then only fine-tune on a single task in these experiments, and report perplexity for that task alone, over all 12 tasks. The results are given in Table 2, reporting results on the validation set 1 . The results show a clear reduction in perplexity with more pre-training, as expected. This is most easily seen by the dodecaScore (last row) which is the mean perplexity over all 12 tasks, which decreases from 49.5 (from scratch models) down to 17.1 with pushshift.io Reddit pre-training. Fast-Text (45.7) and Twitter (35.6) initializations help, but nowhere near as much. BERT fares better, but still is clearly worse than pushshift.io Reddit pretraining. The hypothesis here is that pushshift.io Reddit yields much more effective transfer as it is  Table 5: Validation perplexity on select do-decaDialogue tasks comparing relative weights of tasks during multi-tasking, followed by fine-tuning (row below). The relative task weight is the ratio of examples from that task compared to others presented during multitasking. ∞ indicates single-task training.  a dialogue task like our others, whereas training on non-dialogue corpora such as Wikipedia are not. This was previously observed for retrieval models in . Note that we do not report results for the image dialogue tasks for BERT as that architecture does not deal with images. Finally, as pushshift.io Reddit is so effective, we also compare to pushshift.io Reddit training only, with no fine-tuning at all across all tasks, similar to our initial study in Table 3. The performance is impressive, with some tasks yielding lower perplexity than BERT pre-training + single task finetuning. However, it still lags significantly behind fine-tuning applied after pushshift.io Reddit pretraining.
Image and Knowledge Grounding Some of our tasks involve grounding on knowledge or images. To show such grounding helps, we report results with and without grounding on those tasks in Table 4, reporting perplexity. Particularly for Wizard of Wikipedia (knowledge) and Image Chat (images) such grounding has a clear effect.
Multi-Task Results Next, we perform multitask training across all tasks, which is our ultimate goal in order to obtain an open-domain conversational agent. We optimize over the same set of hyperparameters as before, including multitasking weights for tasks, where one samples during training with differing probabilities. However,     Table 7: Test performance for various metrics on the dodecaDialogue tasks comparing our multi-task and multitask + fine-tuned methods to existing approaches (cited). Dashes mean metric was not provided. * was reported on validation only. Score is defined on a per-task basis in the metric column.  Table 8: Test performance for various metrics on the dodecaDialogue tasks comparing our multi-task and multitask + fine-tuned methods.
in the end we did not obtain clear improvements beyond pre-training with pushshift.io Reddit and then equally sampling from all tasks. We report that final model's validation performance in terms of perplexity in Table 2 (second to last column, "All Tasks MT"). It achieves a dodecaScore of 19.1, superior to all pre-train fine-tune approaches except pushshift.io Reddit pre-training followed by fine-tuning, and is also superior to a single pushshift.io Reddit model. However, comparing across tasks, while most are close to the corresponding best fine-tuned model, many are just slightly worse. This is an expected result and is often reported in multi-task systems (Raffel et al., 2019). We look upon this result as both positive -we can obtain a single model doing well on all tasks, which a fine-tuned model cannot -whilst also remaining a challenge to the community: can one find architectures that leverage multi-tasking even better?
Multi-Task followed by Fine-Tuning As also performed in (Liu et al., 2015;Raffel et al., 2019) we can try to train in a multi-task manner on all tasks, before fine-tuning on a single task, and build a separate model performing this procedure for all tasks, in an attempt to improve single task results further. Using this approach, one is free to perform hyperparameter search differently for each task. Here, we found that applying relative task up-weighting during multi-tasking training made a clear difference to the final quality of the finetuned target task model, see Table 5. Generally, better results come from assigning most of the multi-task weight towards the task itself to be fine-tuned. Using such an approach we can get marginally better results than fine-tuning alone, although the differences are generally small. The final best models per task are shown compared to other approaches in Table 2 (third to last column, "MT All Tasks + FT Single Task"). The final validation dodecaScore is 16.8, only slightly below 17.1 for fine-tuning.
Decoding Strategies So far, we have only been measuring perplexity, but we are actually interested in generation, which requires us to decode. We consider several standard approaches: greedy, beam search (with beam size, and minimum and maximum output length 2 hyperparameters), beam search with beam blocking (blocking n-grams, we use n = 3) (Paulus et al., 2017) and nucleus sampling (with parameter p) (Holtzman et al., 2019). We show the effect of these choices in Table 6 for ConvAI2 and Wizard of Wikipedia (WoW).
Final Systems The final test performance, reporting perplexity and decoding-based metrics for our best multi-task and fine-tuned (via multi-task followed by fine-tuning) systems are reported in Table 8. Their corresponding validation performance is also reported in Table 9. Here, for the multi-task model we have fine-tuned the decoding hyperparameters per task. For results with a single set of decoding hyperparameters, see Table 10. We generally find across all metrics a similar story as before when comparing the fine-tuning with multitasking: multi-tasking is successful, but the challenge is still to do better.
Comparison to Existing Systems We compare to existing state-of-the-art results previously published for each task. Results are given in Table 7. As existing works report different metrics per task, we report perplexity where possible (but note, they may be computed on a different dictionary), and choose the sequence decoding-based metric that is commonly reported per task (listed in column 'Metric'), where the 'Score' column reports its value. We compare these to our best fine-tuned and multi-tasked models. Our multi-task model outperforms all available existing results, with 2 of the 12 tasks having no previous result. It is only surpassed by our fine-tuned model which also outperforms all available existing results. Overall, our methods set a strong challenge to future approaches.
Example Outputs We show some example outputs of our multi-task model for some of the tasks in Section B. Our model is able to leverage images, knowledge, and given personality attributes 2 The length parameters are important for ELI5.
to produce engaging dialogue with a large amount of variety, depending on the situation.
Leave-One-Out Zero-Shot Performance Last, but not least, we evaluate the performance of a multi-task model at zero-shot transfer to a new dialogue task. This is performed by training on all but one of the tasks, and reporting performance on the left out one, repeating this experiment for all tasks.
Our best performing models in that regard are reported in Table 2 (last column). First, it is reassuring that the overall scores are reasonable, outperforming a pushshift.io Reddit only model on every task except Reddit itself. This means that multitasking across many tasks helps transfer learning. However, the gap between zero-shot performance and multi-task or fine-tuning performance means there is still a significant challenge in improving these results. Finally, we believe that reporting results in this regime in addition to multi-tasking results may help avoid the temptation to "cheat" at multi-tasking by trying to detect the task and then apply a separate fine-tuned classifier, as presumably that approach will not truly leverage reasoning and skills between tasks, which transfer may help measure.

Discussion
We have introduced the dodecaDialogue task, and provide strong baseline results leveraging multimodal image+seq2seq transformers trained across all tasks. The goal of introducing this task is not just as another challenge test set, but to further motivate building and evaluating conversational agents capable of multiple skills -one of the core goals of AI. We believe current systems are closer to that goal than ever before -but we also still have a long way to go. Current results show that such systems can be reasonably competitive compared to humans (Li et al., 2019b;Shuster et al., 2018a) -in particular domains and for short conversations. This work tries to bridge the gap to avoid agents with niche skills, to move towards evaluating an open-domain set of skills. Still, despite leveraging 12 tasks, there are many skills not included in our set. For example, longer conversations involving memory (Moon et al., 2019), or mixing open-domain conversation with more task oriented goals (Anonymous, 2020).
When and if we arrive at an agent trained from these resources that is engaging enough for hu-mans to want to significantly interact with, several opportunities will arise. In particular, this leaves open the possibility of studying continual learning and "self-feeding" whereby the agent can learn more through such interactions (Hancock et al., 2019). Such a system would also be more naturally and readily amenable to human evaluation, which future work should address. In the short term we plan to perform crowdworker side-byside evaluations to assess our current approaches (Li et al., 2019b).    Table 11: Best decoding parameters for each task, based on metric. Scores are from the best performing taskspecific multi-task + fine-tuned model on validation sets.