Learning Question-Guided Video Representation for Multi-Turn Video Question Answering

Understanding and conversing about dynamic scenes is one of the key capabilities of AI agents that navigate the environment and convey useful information to humans. Video question answering is a specific scenario of such AI-human interaction where an agent generates a natural language response to a question regarding the video of a dynamic scene. Incorporating features from multiple modalities, which often provide supplementary information, is one of the challenging aspects of video question answering. Furthermore, a question often concerns only a small segment of the video, hence encoding the entire video sequence using a recurrent neural network is not computationally efficient. Our proposed question-guided video representation module efficiently generates the token-level video summary guided by each word in the question. The learned representations are then fused with the question to generate the answer. Through empirical evaluation on the Audio Visual Scene-aware Dialog (AVSD) dataset, our proposed models in single-turn and multi-turn question answering achieve state-of-the-art performance on several automatic natural language generation evaluation metrics.


Introduction
Nowadays dialogue systems are becoming more and more ubiquitous in our lives. It is essential for such systems to perceive the environment, gather data and convey useful information to humans in an accessible fashion. Video question answering (VideoQA) systems provide a convenient way for humans to acquire visual information about the environment. If a user wants to obtain information about a dynamic scene, one can simply ask the VideoQA system a question in natural language, and the system generates a natural-language answer. The task of a VideoQA dialogue system in

User System
Video Can you tell me what is happening in the video?
A person is packing a bag and then looking into the mirror.
Is the person a woman?
No, the person is a youngman.
What room is this person in ?
It looks like a bedroom or a dorm room.
What color are the walls?
The walls look like light purple. Figure 1: An example from the AVSD dataset. Each example contains a video and its associated question answering dialogue regarding the video scene. this paper is described as follows. Given a video as grounding evidence, in each dialogue turn, the system is presented a question and is required to generate an answer in natural language. Figure 1 shows an example of multi-turn VideoQA. It is composed of a video clip and a dialogue, where the dialogue contains open-ended question answer pairs regarding the scene in the video. In order to answer the questions correctly, the system needs to be effective at understanding the question, the video and the dialogue context altogether. Recent work on VideoQA has shown promising performance using multi-modal attention fusion for combination of features from different modalities (Xu et al., 2017;Zeng et al., 2017;Zhao et al., 2018;Gao et al., 2018). However, one of the challenges is that the length of the video sequence can be very long and the question may concern only a small segment in the video. Therefore, it may be time inefficient to encode the entire video sequence using a recurrent neural network.
In this work, we present the question-guided video representation module which learns 1) to summarize the video frame features efficiently using an attention mechanism and 2) to perform feature selection through a gating mechanism. The learned question-guided video representation is a compact video summary for each token in the question. The video summary and question information are then fused to create multi-modal representations. The multi-modal representations and the dialogue context are then passed as input to a sequence-to-sequence model with attention to generate the answer (Section 3). We empirically demonstrate the effectiveness of the proposed methods using the AVSD dataset (Alamri et al., 2019a) for evaluation (Section 4). The experiments show that our model for single-turn VideoQA achieves state-of-the-art performance, and our multi-turn VideoQA model shows competitive performance, in comparison with existing approaches (Section 5).

Related Work
In the recent years, research on visual question answering has accelerated following the release of multiple publicly available datasets. These datasets include COCO-QA (Ren et al., 2015a), VQA (Agrawal et al., 2017), and Visual Madlibs (Yu et al., 2015) for image question answering and MovieQA (Tapaswi et al., 2016), TGIF-QA (Jang et al., 2017), and TVQA  for video question answering.

Image Question Answering
The goal of image question answering is to infer the correct answer, given a natural language question related to the visual content of an image. It assesses the system's capability of multi-modal understanding and reasoning regarding multiple aspects of humans and objects, such as their appearance, counting, relationships and interactions . State-of-the-art image question answering models make use of spatial attention to obtain a fixed length question-dependent embedded representation of the image, which is then combined with the question feature to predict the answer Xu and Saenko, 2016;Kazemi and Elqursh, 2017;Anderson et al., 2018).

Video Question Answering
VideoQA is a more complex task. As a video is a sequence of images, it contains not only appearance information but also motion and transitions. Therefore, VideoQA requires spatial and temporal aggregation of image features to encode the video into a question-relevant representation. Hence, temporal frame-level attention is utilized to model the temporal dynamics, where framelevel attribute detection and unified video representation are learned jointly (Ye et al., 2017;Xu et al., 2017;Mun et al., 2017). Similarly,  use Faster R-CNN (Ren et al., 2015b) trained with the Visual Genome (Krishna et al., 2017) dataset to detect object and attribute regions in each frame, which are used as input features to the question answering model. Previous works also adopt various forms of external memory (Sukhbaatar et al., 2015;Kumar et al., 2016;Graves et al., 2016) to store question information, which allows multiple iterations of questionconditioned inference on the video features (Na et al., 2017;Zeng et al., 2017;Gao et al., 2018;Chenyou Fan, 2019).

Video Question Answering Dialogue
Recently in DSTC7, Alamri et al. (2019a) introduce the Audio-Visual Scene-aware Dialog (AVSD) dataset for multi-turn VideoQA. In addition to the challenge of integrating the questions and the dynamic scene information, the dialogue system also needs to effectively incorporate the dialogue context for coreference resolution to fully understand the user's questions across turns. To this end, Alamri et al. (2019b) use twostream inflated 3D ConvNet (I3D) model (Carreira and Zisserman, 2017) to extract spatiotemporal visual frame features (I3D-RGB features for RGB input and I3D-flow features for optical flow input), and propose the Naïve Fusion method to combine multi-modal inputs based on the hierarchical recurrent encoder (HRE) architecture (Das et al., 2017). Hori et al. (2018) extend the Naïve Fusion approach and propose the Attentional Fusion method which learns multi-modal attention weights to fuse features from different modalities. Zhuang et al. (2019)   ) also explore various attention mechanisms to incorporate the different modal inputs, such as hierarchical attention (Libovickỳ and Helcl, 2017) and cross attention . For modeling visual features,  propose to use Dynamic memory networks (Kumar et al., 2016) and Nguyen et al. (2019) propose to use feature-wise linear modulation layers (Perez et al., 2018).

Approach
We formulate the multi-turn VideoQA task as follows. Given a sequence of raw video frames f , the embedded question sentence x = {x 1 , . . . , x K } and the single concatenated embedded sentence of the dialogue context d = {d 1 , . . . , d M }, the output is an answer sentence y = {y 1 , . . . , y N }. The architecture of our proposed approach is illustrated in Figure 2. First the Video Frame Feature Extraction Module extracts the I3D-RGB frame features from the video frames (Section 3.1). The Question-Guided Video Representation Module takes as input the embedded question sentence and the I3D-RGB features, and generates a compact video representation for each token in the question sentence (Section 3.2). In the Video-Augmented Question Encoder, the question tokens are first augmented by their corresponding per-token video representations and then encoded by a bidirectional LSTM (Section 3.3). Similarly, in the Dialogue Context Encoder, the dialogue context is encoded by a bidirectional LSTM (Section 3.4). Finally, in the Answer Decoder, the outputs from the Video-Augmented Question Encoder and the Dialogue Context Encoder are used as attention memory for the LSTM decoder to predict the answer sentence (Section 3.5). Our encoders and decoder work in the same way as the multi-source sequence-to-sequence models with attention (Zoph and Knight, 2016;Firat et al., 2016).

Video Frame Feature Extraction Module
In this work, we make use of the I3D-RGB frame features as the visual modality input, which are pre-extracted and provided in the AVSD dataset (Alamri et al., 2019a). Here we briefly describe the I3D-RGB feature extraction process, and we refer the readers to (Carreira and Zisser-man, 2017) for more details of the I3D model. Two-stream Inflated 3D ConvNet (I3D) is a stateof-the-art action recognition model which operates on video inputs. The I3D model takes as input two streams of video frames: RGB frames and optical flow frames. The two streams are separately passed to a respective 3D ConvNet, which is inflated from 2D ConvNets to incorporate the temporal dimension. Two sequences of spatiotemporal features are produced by the respective 3D ConvNet, which are jointly used to predict the action class. The I3D-RGB features provided in the AVSD dataset are intermediate spatiotemporal representations from the "Mixed 5c" layer of the RGB stream's 3D ConvNet. The AVSD dataset uses the I3D model parameters pre-trained on the Kinetics dataset (Kay et al., 2017). To reduce the number of parameters in our model, we use a trainable linear projection layer to reduce the dimensionality of I3D-RGB features from 2048 to 256. Extracted from the video frames f and projected to a lower dimension, the sequence of dimensionreduced I3D-RGB frame features are denoted by r = {r 1 , . . . , r L }, where r i ∈ R 256 , ∀i.

Question-Guided Video Representation Module
We use a bidirectional LSTM network to encode the sequence of question token embedding x = {x 1 , . . . , x K }. The token-level intermediate representations are denoted by x tok = {x tok 1 , . . . , x tok K }, and the embedded representation of the entire question is denoted by x sen . These outputs will be used to guide the video representation.
where ⊕ denotes vector concatenation; h and h represent the local forward and backward LSTM hidden states.

Per-Token Visual Feature Summarization
Generally the sequence length of the video frame features is quite large, as shown in Table 1. There-fore it is not computationally efficient to encode the video features using a recurrent neural network. We propose to use the attention mechanism to generate a context vector to efficiently summarize the I3D-RGB features. We use the trilinear function  as a similarity measure to identify the frames most similar to the question tokens. For each question token x k , we compute the similarity scores of its encoded representation x tok k with each of the I3D-RGB features r. The similarity scores s k are converted to an attention distribution w att k over the I3D-RGB features by the softmax function. And the video summary v k corresponding to the question token x k is defined as the attention weighted linear combination of the I3D-RGB features. We also explored using dot product for computing similarity and empirically found out it yields suboptimal results.
where denotes element-wise multiplication, and W sim is a trainable variable.

Visual Feature Gating
Not all details in the video are important for answering a question. Attention helps in discarding the unimportant frames in the time dimension. We propose a gating mechanism which enables us to perform feature selection within each frame. We project the sentence-level question representation x sen through fully-connected layers with ReLU nonlinearity to generate a gate vector g. For each question token x k , its corresponding video summary v k is then multiplied element-wise with the gate vector g to generate a gated visual summary v g k . We also experimented applying gating on the dimension-reduced I3D-RGB features r, prior to the per-token visual feature summarization step, but it resulted in an inferior performance. g = sigmoid(W g, 1 (ReLU(W g, 2 x sen + b g, 2 ) where W g, 1 , b g, 1 , W g, 2 , b g, 2 are trainable variables.

Video-Augmented Question Encoder
Given the sequence of per-token gated visual summary v g = {v g 1 , . . . , v g K }, we augment the question features by concatenating the embedded question tokens x = {x 1 , . . . , x K } with their associated per-token video summary. The augmented question features are then encoded using a bidirectional LSTM. The token-level video-augmented question features are denoted by q tok = {q tok 1 , . . . , q tok K }, and the sentence-level feature is denoted by q sen .
where h and h represent the local forward and backward LSTM hidden states.

Dialogue Context Encoder
Similar to the video-augmented question encoder, we encode the embedded dialogue context tokens d = {d 1 , . . . , d M } using a bidirectional LSTM. The embedded token-level representations are denoted by d tok = {d tok 1 , . . . , d tok M }.
where h and h represent the local forward and backward LSTM hidden states.

Answer Decoder
The final states of the forward and backward LSTM units of the question encoder are used to initialize the state of answer decoder. Let y n be the output of the decoder at step n, where 1 ≤ n ≤ N , y 0 be the special start of sentence token and y emb n be the embedded representation of y n . At a decoder step n, the previous decoder hidden state h n−1 is used to attend over q tok and d tok to get the attention vectors h att, q n and h att, d n respectively. These two vectors retrieve the relevant features from the intermediate representations of the videoaugmented question encoder and the dialogue context encoder, both of which are useful for generating the next token of the answer. At each decoder step, the decoder hidden state h n is used to generate a distribution over the vocabulary. The decoder output y * n is defined to be argmax yn p(y n |y ≤n−1 ). where h represents the local LSTM hidden states, and W ans, q , W ans, d , W ans , b ans are trainable variables.

Dataset
We consider the Audio-Visual Scene-aware Dialog (AVSD) dataset (Alamri et al., 2019a) for evaluating our proposed model in single-turn and multi-turn VideoQA. We use the official release of train set for training, and the public (i.e., prototype) validation and test sets for inference. The AVSD dataset is a collection of text-based humanhuman question answering dialogues based on the video clips from the CHARADES dataset (Sigurdsson et al., 2016). The CHARADES dataset contains video clips of daily indoor human activities, originally purposed for research in video activity classification and localization. Along with  Table 1: Data statistics of the AVSD dataset. We use the official training set, and the public (i.e., prototype) validation and test sets. We also present the average length of the question token sequences and the I3D-RGB frame feature sequences to highlight the importance of time efficient video encoding without using a recurrent neural network. The sequence lengths of the questions and I3D-RGB frame features are denoted by K and L respectively in the model description (Section 3). the video clips and associated question answering dialogues, the AVSD dataset also provides the pre-extracted I3D-RGB visual frame features using a pre-trained two-stream inflated 3D ConvNet (I3D) model (Carreira and Zisserman, 2017). The pre-trained I3D model was trained on the Kinetics dataset (Kay et al., 2017) for human action recognition.
In Table 1, we present the statistics of the AVSD dataset. Given the fact that the lengths of the I3D-RGB frame feature sequences are more than 20 times longer than the questions, using a recurrent neural network to encode the visual feature sequences will be very time consuming, as the visual frames are processed sequentially. Our proposed question-guided video representation module summarizes the video sequence efficientlyaggregating the visual features by question-guided attention and weighted summation and performing gating with a question-guided gate vector, both of which can be done in parallel across all frames.

Experimental Setup
We implement our models using the Ten-sor2Tensor framework (Vaswani et al., 2018). The question and dialogue context tokens are both embedded with the same randomly-initialized word embedding matrix, which is also shared with the answer decoder's output embedding. The dimension of the word embedding is 256, the same dimension to which the I3D-RGB features are transformed. All of our LSTM encoders and decoder have 1 hidden layer. Bahdanau attention mechanism (Bahdanau et al., 2015) is used in the answer decoder. During training, we apply dropout rate 0.2 in the encoder and decoder cells. We use the ADAM optimizer (Kingma and Ba, 2015) with α = 2 × 10 −4 , β 1 = 0.85, β 2 = 0.997, = 10 −6 , and clip the gradient with L2 norm threshold 2.0 (Pascanu et al., 2013). The models are trained up to 100K steps with early stopping on the validation BLEU-4 score using batch size 1024 on a single GPU. During inference, we use beam search decoding with beam width 3. We experimented with word embedding dimension {256, 512}, dropout rate {0, 0.2}, Luong and Bahdanau attention mechanisms, {1, 2} hidden layer(s) for both encoders and the decoder. We found the aforementioned setting worked best for most models.

Comparison with Existing Methods
We evaluate our proposed approach using the same natural language generation evaluation toolkit NLGEval (Sharma et al., 2017) as the previous approaches. The corpus-wide scores of the following unsupervised automated metrics are reported, including BLEU-1 through BLEU-4 (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), ROUGE-L (Lin and Och, 2004) and CIDEr (Vedantam et al., 2015). The results of our models in comparison with the previous approaches are shown in Table 2. We report the mean and standard deviation scores of 5 runs using random initialization and early stopping on the public (prototype) validation set. We apply our model in two scenarios: single-turn and multi-turn VideoQA. The only difference is that in singleturn VideoQA, the dialogue context encoder is excluded from the model.
First we observe that our proposed multi-turn VideoQA model significantly outperforms the single-turn VideoQA model. This suggests that the additional dialogue context input can provide supplementary information from the question and visual features, and thus is helpful for generating the correct answer. Secondly, comparing the single-turn VideoQA models, our approach outperforms the existing approaches across all automatic evaluation metrics. This suggests the effectiveness of our proposed question-guided video representations for VideoQA. When comparing   (Alamri et al., 2019b;Zhuang et al., 2019), Attentional Fusion (Hori et al., 2018;Zhuang et al., 2019), Multi-Source Sequence-to-Sequence model (Pasunuru and Bansal, 2019), Modified Attentional Fusion with Maximum Mutual Information objective (Zhuang et al., 2019) and Hierarchical Attention with pre-trained embedding (Le et al., 2019), on the AVSD public test set. For each approach, we report its corpus-wide scores on BLEU-1 through BLEU-4, METEOR, ROUGE-L and CIDEr. We report the mean and standard deviation scores of 5 runs using random initialization and early stopping on the public (prototype) validation set.  with previous multi-turn VideoQA models, our approach that uses the dialogue context (questions and answers in previous turns) yields stateof-the-art performance on the BLEU-3, BLEU-4, ROUGE-L and CIDEr metrics and competitive results on BLEU-1, BLEU-2 and METEOR. It is worth mentioning that our model does not use pretrained word embedding or audio features as in the previous hierarchical attention approach (Le et al., 2019).

Ablation Study and Weights Visualization
We perform ablation experiments on the validation set in the multi-turn VideoQA scenario to analyze the effectiveness of the two techniques in the question-guided video representation module. The results are shown in Table 3.

Question-Guided Per-Token Visual Feature Summarization (TokSumm)
Instead of using token-level question representations x tok = {x tok 1 , . . . , x tok K } to generate per-token video summary v = {v 1 , . . . , v K }, we experiment with using the sentence-level representation of the question x sen as the query vector to attend over the I3D-RGB visual features to create a visual summary v, and use v to augment each of the question tokens in the video-augmented question encoder. s l = trilinear(x sen , r l ) (30) ∀l ∈ {1, . . . , L} We observe the performance degrades when the sentence-level video summary is used instead of the token-level video summary. Figure 3 shows an example of the attention weights in the question-guided per-token visual feature summarization. We can see that for different question tokens, the attention weights are shifted to focus on the different segment in the sequence of the video frame features. augment the question information in the videoaugmented question encoder. We observe the model's performance declines when the questionguided gating is not applied on the video summary feature. Removing both the per-token visual feature summarization and the gating mechanism results in further degradation in the model performance. Figure 4 illustrates the question-guided gate weights g of several example questions. We observe that the gate vectors corresponding to the questions regarding similar subjects assign weights on similar dimensions of the visual feature. Although many of the visual feature dimensions have low weights across different questions, the feature dimensions of higher gate weights still exhibit certain topic-specific patterns.

Conclusion and Future Work
In this paper, we present an end-to-end trainable model for single-turn and multi-turn VideoQA.
Our proposed framework takes the question, I3D-RGB video frame features and dialogue context as input. Using the question information as guidance, the video features are summarized as compact representations to augment the question information, which are jointly used with the dialogue context to generate a natural language answer to the question. Specifically, our proposed question-guided video representation module is able to summarize the video features efficiently for each question token using an attention mechanism and perform feature selection through a gating mechanism. In empirical evaluation, our proposed models for single-turn and multi-turn VideoQA outperform existing approaches on several automatic natural language generation evaluation metrics. Detailed analyses are performed, and it is shown that our model effectively attends to relevant frames in the video feature sequence for summarization, and the gating mechanism shows topic-specific patterns in the feature dimension selection within a frame. In future work, we plan to extend the models to incorporate audio features and experiment with more advanced techniques to incorporate the dialogue context with the question and video information, such as hierarchical attention and co-attention mechanisms. We also plan to employ our model on TVQA, a larger scale VideoQA dataset.