Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems

Developing Video-Grounded Dialogue Systems (VGDS), where a dialogue is conducted based on visual and audio aspects of a given video, is significantly more challenging than traditional image or text-grounded dialogue systems because (1) feature space of videos span across multiple picture frames, making it difficult to obtain semantic information; and (2) a dialogue agent must perceive and process information from different modalities (audio, video, caption, etc.) to obtain a comprehensive understanding. Most existing work is based on RNNs and sequence-to-sequence architectures, which are not very effective for capturing complex long-term dependencies (like in videos). To overcome this, we propose Multimodal Transformer Networks (MTN) to encode videos and incorporate information from different modalities. We also propose query-aware attention through an auto-encoder to extract query-aware features from non-text modalities. We develop a training procedure to simulate token-level decoding to improve the quality of generated responses during inference. We get state of the art performance on Dialogue System Technology Challenge 7 (DSTC7). Our model also generalizes to another multimodal visual-grounded dialogue task, and obtains promising performance.


Introduction
A video-grounded dialogue system (VGDS) generates appropriate conversational response to queries of humans, by not only keeping track of the relevant dialogue context, but also understanding the relevance of the query in the context of a given video (knowledge grounded in a video) (Hori et al., 2018).An example dialogue exchange can be seen in Figure 1.Devel-C: a man is standing in a kitchen putting groceries away.He closes the cabinet when finished, walks over to a table and pulls out a chair and sits down.S: a man puts away his groceries and then sits at a kitchen table and stares out the window.Q1: how many people are in the video?A1: there is just one person Q2: is there sound to the video?A2: yes there is audio but no one is talking ... Q10: is he happy or sad?A10: he appears to be neutral in expression Figure 1: A sample dialogue from the DSTC7 Video Scene-aware Dialogue training set with 4 example video scenes.C: Video Caption, S: Video Summary, Qi: i th -turn question, Ai: i th -turn answer oping such systems has recently received interest from the research community (e.g.DSTC7 challenge (Yoshino et al., 2018)).This task is much more challenging than traditional text-grounded or image-grounded dialogue systems because: (1) feature space of videos is larger and more complex than text-based or image-based features because of diverse information, such as background noise, human speech, flow of actions, etc. across multiple video frames; and (2) a conversational agent must have the ability to perceive and comprehend information from different modalities (text from dialogue history and human queries, visual and audio features from the video) and semantically shape a meaningful response to humans.
Most existing approaches for multi-modal dialogue systems are based on RNNs as the sequence processing unit and sequence-to-sequence network as the overall architecture to model the sequential information in text (Das et al., 2017a,b;Hori et al., 2018;Kottur et al., 2018).Some efforts adopted query-aware attention to allow the models to focus on specific parts of the features most relevant to the dialogue context (Hori et al., 2018;Kottur et al., 2018).Despite promising results, these methods are not very effective or efficient for processing video-frames, due to the complexity of long term sequential information from multiple modalities.We propose Multimodal Transformer Networks (MTN) which model the complex sequential information from video frames, and also incorporate information from different modalities.MTNs allow for complex reasoning over multimodal data such as in videos, by jointly attending to information in different representation subspaces, and making it easier (than RNNs) to fuse information from different modalities.Inspired by the success of Transformers (Vaswani et al., 2017)) for text, we propose novel neural architectures for VGDS: (1) We propose to capture complex sequential information from video frames using multi-head attention layers.Multihead attention is applied across several modalities (visual, audio, captions) repeatedly.This works like a memory network to allow the models to comprehensively reason over the video to answer human queries; (2) We propose an autoencoder component, designed as query-aware attention layer, to further improve the reasoning capability of the models on the non-text features of the input videos; and (3) We employ a training approach to improve the generated responses by simulating token-level decoding during training.
We evaluated MTN on a video-grounded dialogue dataset (released through DSTC7 (Yoshino et al., 2018)).In each dialogue, video features such as audio, visual, and video caption, are available, which have to be processed and understood to hold a conversation.We conduct comprehensive experiments to validate our approach, including automatic evaluations, ablations, and qualitative analysis of our results.We also validate our approach on the visual-grounded dialogue task (Das et al., 2017a), and show that MTN can generalize to other multimodal dialog systems.

Related Work
The majority of work in dialogues is formulated as either open-domain dialogues (Shang et al., 2015;Vinyals and Le, 2015;Yao et al., 2015;Li et al., 2016a,b;Serban et al., 2017Serban et al., , 2016) ) or taskoriented dialogues (Henderson et al., 2014;Bordes and Weston, 2016;Fatemi et al., 2016;Liu and Lane, 2017;Lei et al., 2018;Madotto et al., 2018).Some recent efforts develop conversational agents that ground their responses on external knowledge, e.g.online encyclopedias (Dinan et al., 2018), social networks, or user recommendation sites (Ghazvininejad et al., 2018).The agent generates a response that can relate to the current dialogue context as well as exploit the information source.Recent dialogue systems use Transformer principles (Vaswani et al., 2017) for incorporating attention and focus on different dialogue settings, e.g.text-only or response selection settings (Zhu et al., 2018;Mazaré et al., 2018;Dinan et al., 2018), These approaches consider the knowledge to be grounded in text, whereas in VGDS, the knowledge is grounded in videos (with multimodal sources of information).
There are a few efforts in NLP domain, where multimodal information needs to be incorporated for the task.Popular research areas include image captioning (Vinyals et al., 2015;Xu et al., 2015), video captioning (Hori et al., 2017;Li et al., 2018) and visual question-answering (QA) (Antol et al., 2015;Goyal et al., 2017).Image captioning and video captioning tasks require to output a description sentence about the content of an image or video respectively.This requires the models to be able to process certain visual features (and audio features in video captioning) and generate a reasonable description sentence.Visual QA involves generating a correct response to answer a factual question about a given image.The recently proposed movie QA (Tapaswi et al., 2016) task is similar to visual QA but the answers are grounded in movie videos.However, all of these methods are restricted to answering specific queries, and do not maintain a dialogue context, unlike what we aim to achieve in VGDS.We focus on generating dialogue responses rather than selecting from a set of candidates.This requires the dialogue agents to model the semantics of the visual and/or audio contents to output appropriate responses.
Another related task is visual dialogues (Das et al., 2017a,b;Kottur et al., 2018).This is similar to visual QA but the conversational agent needs to track the dialogue context to generate a response.However, the knowledge is grounded in images.
In contrast, we focus on knowledge grounded in videos, which is more complex, considering the large feature space spanning across multiple video frames and modalities that need to be understood.

Multimodal Transformer Networks
Given an input video V , its caption C, a dialogue context of (t − 1) turns, each including a pair of (question, answer) (Q 1 , A 1 ), ..., (Q t−1 , A t−1 ), and a factual query Q t on the video content, the goal of a VGDS is to generate an appropriate dialogue response A t .We follow the attention-based principle of Transformer network (Vaswani et al., 2017) and propose a novel architecture: Multimodal Transformer Networks to elegantly fuse feature representations from different modalities.MTN enables complex reasoning over long video sequences by attending to important feature representations in different modalities.
MTN comprises 3 major components: encoder, decoder, and auto-encoder layers.(i) Encoder layers encode text sequences and input video into continuous representations.Positional encoding is used to inject the sequential characteristics of input text and video features at token and video-frame level respectively; (ii) Decoder layers project the target sequences and perform reasoning over multiple encoded features through a multi-head attention mechanism.Attention layers coupled with feed-forward and residual connections process the projected target sequence over N attention steps before passing to a generative component to generate a response; (iii) Auto-encoder layers enhance video features with a query-aware attentions on the visual and audio aspects of the input video.A network of multi-head attentions layers are employed as a query auto-encoder to learn the attention in an unsupervised manner.We combine these modules as a Multimodal Transformer Network (MTN) model and jointly train the model end-to-end.An overview of the MTN architecture is shown in Figure 2. Next, we will discuss the details of each of these components.

Encoder Layers
Text Sequence Encoders.The encoder layers map each sequence of tokens (x 1 , ..., x n ) to a sequence of continuous representation z = (z 1 , ..., z n ) ∈ R d .An overview of text sequence encoder can be seen in Figure 3.The encoder is composed of a token-level learned embedding, a fixed positional encoding layer, and layer nor-malization.We use the positional encoding to incorporate sequential information of the source sequences.The token-level positional embedding is added on top of the embedding layer by using element-wise summation.Both learned embedding and positional encoding has the same dimension d.We used the sine and cosine functions for the positional encoding as similarly adopted in (Vaswani et al., 2017).Compared to a Transformer encoder, we do not use stack of encoder layers with self-attention to encode source sequences.Instead, we only use layer normalization (Ba et al., 2016) on top of the embedding.We also experimented with using stacked Transformer encoder blocks, consisting of self-attention and feed-forward layers, and compare with our approach (see Table 4 Row A and B-1).The target sequence A t = (y 1 , ..., y m ) is offset by one position to ensure that the prediction in the decoding step i is auto-regressive only on the previously positions 1, ..., (i − 1).Here we share the embedding weights of encoders for source sequences i.e. query, video caption, and dialogue history.
Video Encoders.For a given video V , its features are extracted with a sliding window of nvideo-frame length.This results in modality feature vector f m ∈ R numSeqs×dm for a modality m.Each f m represents the features for a sequence of n video frames.Here we consider both visual and audio features M = (v, a).We use pretrained feature extractors and keep the weights of the extractors fixed during training.For a set of scene sequences s 1 , ..., s v , the extracted features for modality m is f m = (f 1 , ..., f v ).We apply a linear network with ReLU activation to transform the feature vectors from d m -to d-dimensional space.We then also employ the same positional encoding as before to inject sequential information into f m .Refer to Figure 3 for an overview of video encoder.

Decoder Layers
Given the continuous representation z s for each source sequence x s and z t for the offset target sequence, the decoder generates an output sequence (y 2 , ..., y m ) (The first token is always an sos token).The decoder is composed of a stack of N identical layers.Each layer has 4 + M sub-layers, each of which performs attention on an individual encoded input: the offset target sequence z t , dialogue history z his , video caption  z cap , user query z que , and video non-text features {f a , f v }.Each sub-layer consists of a multi-head attention mechanism and a position-wise feedforward layer.Each feed-forward network consists of 2 linear transformation with ReLU activation in between.We employed residual connection (He et al., 2016) and layer normalization (Ba et al., 2016) around each attention block.The multi-head attention on z s is defined as: (1) where (the superscripts of s and t are not presented for each W for simplicity).z dec out is the output of the previous sub-layer.
The multi-head attention allows the model to attend on text sequence features at different positions of the sequences.By using multi-head atten-tion on visual and audio features, the model can attend on frame sequences to project and extract information from different parts of the video.Using multiple attentions for different input components also allows the model attend differently on inputs rather than using the same attention network for all.We also experimented with concatenating the input sequences and only use one attention block in each decoding layer, similarly to a Transformer decoder ( See the appendix Section B).

Auto-Encoder Layers
As the multi-head attentions allow dynamic attentions on different input components, the essential interaction between the input query and nontext features of the input video is not fully implemented.While a residual connection is employed and the video attention block is placed at the end of the decoder layer, the attention on video features might not be optimal.We consider adding queryaware attention on video features as a separate component.We design it as a query auto-encoder to allow the model to focus on query-related features of the video in an unsupervised manner.The auto-encoder is composed of a stack of N layers, each of which includes an query self-attention and query-aware attention on video features.Hence, the number of sub-layers is 1 + M .For selfattention, the output of the previous sub-layer z ae out (or z que in case of the first auto-encoder stack) is used identically as q, k and v in Equation 3, while for query-aware attention, z ae out is used as q and f m is used as k and v.For an n th auto-encoder layer, each output of the query-aware attention on video features f att m,n is passed to video attention module of the corresponding n th decoder layer.Each video attention head i for a given modality m at decoding layer n th is defined as: The decoder and auto-encoder create a network similar to the One-to-Many setting in (Luong et al., 2015) as the encoded query features are shared between the two modules.We also consider using the auto-encoder as stacked queryaware encoder layers i.e. use query self-attention and query-based attention on video features and extract the output of final layer at N th block to the decoder.Comparison of the performance (See Table 4 Row C-5 and D) shows that adopting an auto-encoder architecture is more effective in capturing relevant video features.

Generative Network
Similar to sequence generative models (Sutskever et al., 2014;Manning and Eric, 2017), we use a Linear transformation layer with softmax function on the decoder output to predict probabilities of the next token.In the auto-encoder, the same architecture is used to re-generate the query sequence.We separate the weight matrix between the source sequence embedding, output embedding, and the pre-softmax linear transformation.
Simulated Token-level Decoding.Different from training, during test time, decoding is still an auto-regressive process where the decoder generates the sentence token-by-token.We aim to simulate this process during training by performing the following procedures: • Rather than always using the full target sequence of length L, the token-level decoding simulation will do the following: • With a probability p, e.g.p = 0.5 i.e. for 50% of time, crop the target sequence at a uniform-randomly selected position i where i = 2, ..., (L − 1) and keep the left sequence as the target sequence e.g.sos there is just one person eos → sos there is just one • As before, the target sequence is offset by one position as input to the decoder We employ this approach to reduce the mismatch of input to the decoder during training and test time and hence, improve the quality of the generated responses.We only apply this procedure for the target sequences to the decoder but not the query auto-encoder.

Data
We used the dataset from DSTC7 (Yoshino et al., 2018) which consists of multi-modal dialogues grounded on the Charades videos (Sigurdsson et al., 2016).Table 1 summarizes the dataset and Figure 1 shows a training example.We used the audio and visual feature extractors pre-trained on YouTube videos and the Kinetics dataset (Kay et al., 2017) (Refer to (Hori et al., 2018) for the detail video features).Specifically we used the 2048-dimensional I3D flow features from the "Mixed 5c" layer of the I3D network (Carreira and Zisserman, 2017) for visual features and 128dimensional Audio Set VGGish (Hershey et al., 2017) for audio features.We concatenated the provided caption and summary for each video from the DSTC7 dataset as the default video caption Cap+Sum.Other data pre-processing procedures are described in the appendix Section A.1.

Training
We use the standard objective function loglikelihood of the target sequence T given the dialogue history H, user query Q, video features V , and video caption C. The log-likelihood of re-generated query is also added when QAE is used: We The probability p for simulating token-level decoding is 0.5.We trained each model up to 17 epochs.We used the Adam optimizer (Kingma and Ba, 2014).The learning rate is varied over the course of training with strategy adopted similarly in (Vaswani et al., 2017).We used warmup steps as 9660.We employed dropout (Srivastava et al., 2014) of 0.1 at all sub-layers and embeddings.Label Smoothing (Szegedy et al., 2016) is also applied during training.For all models, we select the latest checkpoints that achieve the lowest perplexity on the validation set.We used beam search with beam size 5 an a length penalty 1.0.The maximum output length during inference is 30 tokens.All models were implemented using PyTorch (Paszke et al., 2017) 1 .
The results were computed based on one reference ground-truth response per test dialogue in the test set.As can be seen in Table 3, both Base-and Large-MTN models outperform the baseline (Hori et al., 2018) in all metrics.Our Large model outperforms the best previously reported models in the challenge across all the metrics.Even our Base model with smaller parameters outperforms most of the previous results, except for entry1, which we outperform in BLEU1-3 and METEOR measures.While some of the submitted models to the challenge utilized external data or ensemble techniques (Alamri et al., 2018), we only use the given training data from the DSTC7 dataset similarly as the baseline (Hori et al., 2018).
Impact of Token-level Decoding Simulation.
We consider text-only dialogues (no visual or audio features) to study the impact of the tokenlevel decoding simulation component.We also remove the auto-encoder module i.e.MTN w/o QAE.
We study the differences of performance when the simulation probability p = 0, 0.1, ..., 1. 0 is equivalent to always keeping the target sequences as a whole and 1 is cropping all target sequences at random points during training.As shown in Figure 4, adding the simulation helps to improve the performance in most cases of p > 0 and < 1.At p = 1, the performance is suffered as the decoder receives only fragmented sequences during training.Ablation Study.We tested variants of our models with different combinations of data input in Table 4.With text-only input, compared to our approach (Row B-1), using encoder layers with selfattention blocks (Row A) does not perform well.The self-attention encoders also make it hard to optimize the model as noted by (Liu et al., 2018).
When we remove the video caption from the input (hence, no caption attention layers) and use either visual or audio video features, we observe that the proposed auto-encoder with query-aware attention results in better responses.For example, with audio feature, adding the auto-encoder component (Row C-1) increases BLEU4 and CIDEr measures as compared to the case where no autoencoder is used (Row B-2).When using both caption and video features, the proposed auto-encoder (Row C-5) improves all metrics from the decoderonly model (Row B-4).We also consider using the auto-encoder structure as an encoder (i.e.without the generative component to re-generate query) and decouple from the decoder stacks (i.e.output of the N th encoder layer is used as input to the 1 st decoder layer) (Row D).The results show that an auto-encoder structure is superior to stacked encoder layers.Our architecture is also better in terms of computation speed as both decoder and auto-encoder are processed in parallel, layer by layer.Results of other model variants are available in the appendix Section B.

Visual Dialogues
We also test if MTN could generalize to other multi-modal dialogue settings.We experiment on the visually grounded dialogue task with the VisDial dataset (Das et al., 2017a).The training dataset is much larger than DSTC7 dataset with more than 1.2 million training dialogue turns grounded on images from the COCO dataset (Lin et al., 2014).This task aims to select a response from a set of 100 candidates rather than generating a new complete response.Here we still keep the generative component and maximize the loglikelihood of the ground-truth responses during training.During testing, we use the log-likelihood scores to rank the candidates.We also remove the positional encoding component from the encoder to encode image features as these features do not have sequential characteristics.All other components and parameters remain unchanged.We trained MTN with the Base parameters on the Visual Dialogue v1.02 training data and evaluate on the test-std v1.0 set.The image features are extracted by a pre-trained object detection model (Refer to the appendix Section A.2 for data preprocessing).We evaluate our model with Normalized Discounted Cumulative Gain (NDCG) score by submitting the predicted ranks of the response candidates to the evaluation server (as the groundtruth for the test-std v1.0 split is not published).We keep all the training procedures unchanged from the video-grounded dialogue task.Table 2 shows that our proposed MTN is able to generalize to the visually grounded dialogue setting.It is interesting that our generative model outperforms other retrieval-based approaches in NDCG without any task-specific fine-tuning.There are other submissions with higher NDCG scores from the leaderboard3 but the approaches of these submis-sions are not clearly detailed to compare with.

Qualitative Analysis
Figure 6 shows some samples of the predicted test dialogue responses of our model as compared to the baseline (Hori et al., 2018).Our generated responses are more accurate than the baseline to answer human queries.Some of our generated responses are more elaborate e.g."with a cloth in her hand".Our responses can correctly describe single actions (e.g."cleaning the table", "stays in the same place") or a series of actions (e.g."walks over to a closet and takes off her jacket").This shows that our MTN approach can reason over complex features came from multiple modalities.Figure 5 summarizes the CIDEr measures of the responses generated by our Base model and the baseline (Hori et al., 2018) by their position in dialogue e.g. 1 st ...10 th turn.It shows that our responses are better across all dialogue turns, from 1 st to 10 th .Figure 5 also shows that MTN perform better at shorter dialogue lengths e.g.1-turn, 2-turn and 3-turn, in general and the performance could be further improved for longer dialogues.In this paper, we showed that MTN, a multi-head attention-based neural network, can generate good conversational responses in multimodal settings.
Our MTN models outperform the reported baseline and other submission entries to the DSTC7.We also adapted our approach to a visual dialogue task and achieved excellent performance.A possible improvement to our work is adding pre-trained embedding such as BERT (Devlin et al., 2018) or image-grounded word embedding (Kiros et al., 2018) to improve the semantic understanding capability of the models.

Figure 3 :
Figure 3: 2 types of encoders are used: text-sequence encoders (left) and video encoders (right).Text-sequence encoders are used on text input, i.e. dialogue history, video caption, query, and output sequence.Video encoders are used on visual and audio features of input video.
train MTN models in two settings: Base and Large.The Base parameters are N = 6, h = 8, d = 512, d k = d v = d/h = 64, and the Large parameters are N = 10, h = 16, d = 1024,

Figure 4 :
Figure4: Impact of simulation probability p in BLEU4 measure on the test data.At p = 0.4 to 0.6, the improvement in BLEU4 scores is more significant.

Figure 5 :
Figure 5: Comparison of CIDEr measures on the test data between MTN (Base) and the baseline (Hori et al., 2018) across different turn position of the generated responses.Our model outperforms the baselines at all dialogue turn positions.

Figure 6 :
Figure 6: Example test dialogue responses extracted from the ground-truth A ref and generated by MTN (Base) A ours and the baseline (Hori et al., 2018) A base .For simplicity, the dialogue history is not presented and only parts of the video caption C are shown.Our model provides answers that are more accurate than the baseline, capturing single human action or a series of actions in the videos.

0 ,A 0 ),...(Q t-1 , A t-1 ) z que
t : How many people are in the video?A t : There is just one person C: A man is standing in the kitchen... (Q V Figure 2: Our MTN architecture includes 3 major components: (i) encoder layers encode text sequences and video features; (ii) decoder layers (D) project target sequence and attend on multiple inputs; and (iii) Query-Aware Auto-Encoder layers (QAE) attend on non-text modalities from query features.For simplicity, Feed Forward, Residual Connection and Layer Normalization layers are not presented.Best viewed in color.

Table 2 :
Comparison of MTN (Base)to state-of-the-art visual dialogue models on the test-std v1.0.The best measure is highlighted in bold.

Table 3 :
Evaluated on the test data, the proposed approach achieves better objective measures than the baselines and the submissions to the challenge.The best result in each metric is highlighted in bold.

Table 4 :
Ablation analysis of MTN evaluated on the test data.The video features being used is either VGGish for audio features (A) or I3D-Flow for visual features (V).All models are trained with the Base parameters.Best result in each metric is highlighted in bold.: man walks over to laptop...he sits down and wipes and scratches his face with his hands... Q: does he ever stand back up ?A_ref: no , the video ends with him sitted A_ours: no , he stays in the same place the whole time A_base: no he stays in the same position the whole time C: a guy is sitting at his desk writing in a notebook .he seems very enthralled with what he is writing... Q: does he pick up a pencil at the start of the video ?A_ref: he is already writing when the video begins A_ours: yes he has it in his hand A_base: yes , he is looking at his phone C: a table is being wiped with a cloth ...someone sits down and takes a drink from a cup.Q: what is she cleaning ?A_ref: she is cleaning off a table A_ours: she is cleaning the table with a cloth in her hand A_base: she is holding a book C:...after she temporarily puts the paper down for a moment so she can take her jacket off ... Q: what is happening in the video ?A_ref: a lady walks over to a closet with papers in hand and then takes her jacket off A_ours: a woman walks over to a closet and takes off her jacket A_base: a man walks into the room C