TVQA: Localized, Compositional Video Question Answering

Recent years have witnessed an increasing interest in image-based question-answering (QA) tasks. However, due to data limitations, there has been much less work on video-based QA. In this paper, we present TVQA, a large-scale video QA dataset based on 6 popular TV shows. TVQA consists of 152,545 QA pairs from 21,793 clips, spanning over 460 hours of video. Questions are designed to be compositional in nature, requiring systems to jointly localize relevant moments within a clip, comprehend subtitle-based dialogue, and recognize relevant visual concepts. We provide analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVQA task. The dataset is publicly available at http://tvqa.cs.unc.edu.


Introduction
Now that algorithms have started to produce relevant and realistic natural language that can describe images and videos, we would like to understand what these models truly comprehend. The Visual Question Answering (VQA) task provides a nice tool for fine-grained evaluation of such multimodal algorithms. VQA systems take as input an image (or video) along with relevant natural language questions, and produce answers to those questions. By asking algorithms to answer different types of questions, ranging from object identification, counting, or appearance, to more complex questions about interactions, social relationships, or inferences about why or how something is occurring, we can evaluate different aspects of a model's multimodal semantic understanding.
With video-QA in particular, as opposed to image-QA, the video itself often comes with associated natural language in the form of (subtitle) dialogue. We argue that this is an important area to study because it reflects the real world, where people interact through language, and where many computational systems like robots or other intelligent agents will ultimately have to operate. As such, systems will need to combine information from what they see with what they hear, to pose and answer questions about what is happening.
We aim to provide a dataset that merges the best qualities from all of the previous datasets as well as focus on multimodal compositionality. In particular, we collect a new large-scale dataset that is built on natural video content with rich dynamics and realistic social interactions, where questionanswer pairs are written by people observing both videos and their accompanying dialogues, encouraging the questions to require both vision and language understanding to answer. To further encourage this multimodal-QA quality, we ask people to write compositional questions consisting What is on the couch behind Joey when he is at the counter?
A A chick B A soccer ball C A duck D A pillow E Janice's coat What is Janice holding on to after Chandler sends Joey to his room?
A Chandler's tie B Chandler's hands C Her Breakfast D Her coat E Chandler's coffee cup.

00:00
Why does Joey want Chandler to kiss Janice when they are in the kitchen?
A Because Joey is glad that Chandler is happy B Because Joey likes to watch people kiss C Because then she will leave D Because Joey thinks Janice is hot E Because then Chandler will move away from the toast.  Figure 1: Examples from the TVQA dataset. All questions and answers are attached to 60-90 seconds long clips. For visualization purposes, we only show a few of the most relevant frames here. As illustrated above, some questions can be answered using subtitles or videos alone, while some require information from both modalities.
of two parts, a main question part, e.g. "What are Leonard and Sheldon arguing about" and a grounding part, e.g. "when they are sitting on the couch". This also leads to an interesting secondary task of QA temporal localization.
Our contribution is the TVQA dataset, built on 6 popular TV shows spanning 3 genres: medical dramas, sitcoms, and crime shows. On this data, we collected 152.5K human-written QA pairs (examples shown in Fig.1). There are 4 salient advantages of our dataset. First, it is large-scale and natural, containing 21,793 video clips from 925 episodes. On average, each show has 7.3 seasons, providing long range character interactions and evolving relationships. Each video clip is associated with 7 questions, with 5 answers (1 correct) for each question. Second, our video clips are relatively long (60-90 seconds), thereby containing more social interactions and activities, making video understanding more challenging. Third, we provide the dialogue (character name + subtitle) for each QA video clip. Understanding the relationship between the provided dialogue and the question-answer pairs is crucial for correctly answering many of the collected questions. Fourth, our questions are compositional, requiring algorithms to localize relevant moments (START and END points are provided for each question).
With the above rich annotation, our dataset supports three tasks: QA on the grounded clip, question-driven moment localization, and QA on the full video clip. We provide baseline experiments on both QA tasks and introduce a state-ofthe-art language and vision-based model (leaving moment localization for future work).
With the same goal in mind, Rajpurkar et al. (2016) introduced the SQuAD dataset, but their answers are specific spans from long passages.  designed a set of tasks with automatically generated QAs to evaluate the textual reasoning ability of artificial agents and Hermann et al. (2015); Hill et al. (2015) constructed the cloze dataset on top of an existing corpus. While questions in these text QA datasets are specifically designed for language understanding, TVQA questions require both vision understanding and language understanding. Although methods developed for text QA are not directly applicable to TVQA tasks, they can provide inspiration for designing suitable models. Natural Language Object Retrieval: Language grounding addresses the task of object or moment localization in an image or video from a natural language description. For image-based object grounding, there has been much work on phrase grounding (Plummer et al., 2015;Wang et al., 2016b; and referring expression comprehension Yu et al., 2016;Nagaraja et al., 2016;Yu et al., 2017Yu et al., , 2018b. Recent work (Vasudevan et al., 2018) extends the grounding task to the video domain. Most recently, moment localization was proposed in (Hendricks et al., 2017;Gao et al., 2017), where the goal is to localize a short moment from a long video sequence given a query description. Accurate temporal grounding is a necessary step to answering our compositional questions.

Dataset Collection
We collected our dataset on 6 long-running TV shows from 3 genres: 1) sitcoms: The Big Bang Theory, How I Met Your Mother, Friends, 2) medical dramas: Grey's Anatomy, House, 3) crime drama: Castle. There are in total 925 episodes spanning 461 hours. Each episode was then segmented into short clips. We first created clips every 60/90 seconds, then shifted temporal boudaries to avoid splitting subtitle sentences between clips. Shows that are mainly conversational based, e.g., The Big Bang Theory, were segmented into 60 seconds clips, while shows that are less cerebral, e.g. Castle, were segmented into 90 seconds clips. In the end, 21,793 clips were prepared for QA collection, accompanied with subtitles and aligned with transcripts to add character names. A sample clip is shown in Fig. 1.
Amazon Mechanical Turk was used for VQA collection on video clips, where workers were presented with both videos and aligned named subtitles, to encourage multimodal questions requiring both vision and language understanding to answer. Workers were asked to create questions using a compositional-question format: [What/How/Where/Why/...] [when/before/after] . The second part of each question serves to localize the relevant video moment within a clip, while the first part poses a question about that moment. This compositional format also serves to encourage questions that require both visual and language understanding to answer, since people often naturally use visual signals to ground questions in time, e.g. What was House saying before he leaned over the bed? During data collection, we only used prompt words (when/before/after) to encourage workers to propose the desired, complex compositional questions. There were no additional template constraints. Therefore, most of the language in the questions is relatively free-form and complex.
Ultimately, workers pose 7 different questions for each video clip. For each question, we asked workers to annotate the exact video portion required to answer the question by marking the START and END timestamps as in Krishna et al. (2017). In addition, they provide 1 correct and 4 wrong answers for each question. Workers get paid $1.3 for a single video clip annotation. The whole collection process took around 3 months.
To ensure the quality of the questions and answers, we set up an online checker in our collection interface to verify the question format, allowing only questions that reflect our two-step format to be submitted. The collection was done in batches of 500 videos. For each harvested batch, we sampled 3 pairs of submitted QAs from each worker and checked the semantic correctness of the questions, answers, and timestamps.

Dataset Analysis
Multiple Choice QAs: Our QAs are multiple choice questions with 5 candidate answers for each question, for which only one is correct. Table 1 provides statistics of the QAs based on the first question word. On average, our questions contain 13.5 words, which is fairly long compared to other datasets. In general, correct answers tend  Figure 2: Distribution of question types based on answer types.
to be slightly longer than wrong answers. Fig. 2 shows the distribution of different questions types. Note "what" (Abstract, Object, Action), "who" (Person), "why" (Reasoning) and "where" (Location) questions form a large part of our data. The negative answers in TVQA are written by human annotators. They are instructed to write false but relevant answers to make the negatives challenging. Alternative methods include sampling negative answers from other questions' correct answers, either based on semantic similarity (Das et al., 2017;Jang et al., 2017) or randomly (Antol et al., 2015;Das et al., 2017). The former is prone to introducing paraphrases of the ground-truth answer . The latter avoids the problem of paraphrasing, but generally produces irrelevant negative choices. We show in Table 8 that our human written negatives are more challenging than randomly sampled negatives. Moment Localization: The second part of our question is used to localize the most relevant video portion to answer the question. The prompt of "when", "after", "before" account for 60.03%, 30.19% and 9.78% respectively of our dataset. TVQA provides the annotated START and END timestamps for each QA. We show the annotated    segment lengths in Fig. 3. We found most of the questions rely on relatively short moments (less than 15 secs) within a longer clip (60-90 secs). Differences among our 6 TV Shows: The videos used in our dataset are from 6 different TV shows.  Q. Src. = Question Sources, it indicates where the questions are raised from. TVQA dataset is unique since its questions are based on both text and video, with additional timestamp annotation for each of them. It is also significantly larger than previous datasets in terms of total length of videos.  tains "game" and "laptop" while HIMYM contains "bar" and "beer", indicating the different major activities and topics in each show. Additionally, questions about different characters also mention different words, as shown in Table 4. Comparison with Other Datasets: Table 5 presents a comparison of our dataset to some recently proposed video question answering datasets. In terms of total length of videos, TVQA is the largest, with a total of 461.2 hours of videos.
MovieQA (Tapaswi et al., 2016) is most similar to our dataset, with both multiple choice questions and timestamp annotation. However, their questions and answers are constructed by people posing questions from a provided plot summary, then later aligned to the video clips, which makes most of their questions text oriented.
Human Evaluation on Usefulness of Video and Subtitle in Dataset: To gain a better understand-ing of the roles of videos and subtitles in the our dataset, we perform a human study, asking different groups of workers to complete the QA task in settings while observing different sources (subsets) of information: • Question only.
• Video and Question.
• Subtitle and Question.
We made sure the workers that have written the questions did not participate in this study and that workers see only one of the above settings for answering each question. Human accuracy on our test set under these 4 settings are reported in Table 5. As expected, compared to human accuracy based only on question-answer pairs (Q), adding videos (V+Q), or subtitles (S+Q) significantly improves human performance. Adding both videos and subtitles (V+S+Q) brings the accuracy to 89.41%. This indicates that in order to answer the questions correctly, both visual and textual understanding are essential. We also observe that workers obtain 31.84% accuracy given questionanswer pairs only, which is higher than random guessing (20%). We ascribe this to people's prior knowledge about the shows. Note, timestamp annotations are not provided in these experiments.

Methods
We introduce a multi-stream end-to-end trainable neural network for Multi-Modal Video Question Answering. Fig. 4 gives an overview of our model. Formally, we define the inputs to the model as: a 60-90 second video clip V , a subtitle S, a question q, and five candidate answers {a i } 4 i=0 .

Video Features
Frames are extracted at 3 fps. We run Faster R-CNN (Ren et al., 2015b) Figure 4: Illustration of our multi-stream model for Multi-Modal Video QA. Our full model takes different contextual sources (regional visual features, visual concept features, and subtitles) along with question-answer pair as inputs to each stream. For brevity, we only show regional visual features (upper) and subtitle (bottom) streams.
brown door, gold sign, red sign, woman, white shorts, green sweater, man, blue shirt, white basket, woman, gray pants, gray door, standing man, gray shirt, black pants Figure 5: Faster R-CNN detection example. The detected object labels and attributes can be viewed as a description to the frame, which is potentially helpful to answer a visual question.
Genome (Krishna et al., 2017) to detect object and attribute regions in each frame. Both regional features and predicted detection labels can be used as model inputs. We also use ResNet101 (He et al., 2016) trained on ImageNet (Deng et al., 2009) to extract whole image features. Regional Visual Features: On average, our videos contain 229 frames, with 16 detections per frame. It is not trivial to model such long sequences. For simplicity, we follow (Anderson et al., 2018;Karpathy and Fei-Fei, 2015) selecting the top-K regions 1 from each detected label across all frames. Their regional features are L2normalized and stacked together to form our visual representation V reg ∈ R nreg×2048 . Here n reg is the number of selected regions. Visual Concept Features: Recent work (Yin and Ordonez, 2017) found that using detected object 1 Based on cross-validation, we find K=6 to perform best. labels as input to an image captioning system gave comparable performance to using CNN features directly. Inspired by this work, we also experiment with using detected labels as visual inputs. As shown in Fig. 5, we are able to detect rich visual concepts, including both objects and attributes, e.g. "white basket", which could be used to answer "What is Sheldon holding in his hand when everyone is at the door". We first gather detected concepts over all the frames to represent concept presence. After removing duplicate concepts, we use GloVe (Pennington et al., 2014) to embed the words. The resulting video representation is denoted as V cpt ∈ R ncpt×300 , where n cpt is the number of unique concepts. ImageNet Features: We extract the pooled 2048D feature of the last block of ResNet101. Features from the same video clip are L2 normalized and stacked, denoted as V img ∈ R n img ×2048 , where n img is the number of frames extracted from the video clip.

LSTM Encoders for Video and Text
We use a bi-directional LSTM (BiLSTM) to encode both textual and visual sequences. A subtitle S, which contains a set of sentences, is flattened into a long sequence of words and GloVe (Pennington et al., 2014) is used to embed the words. We stack the hidden states of the BiLSTM from both directions at each timestep to obtain the subtitle representation H S ∈ R n S ×2d , where n S is the number of subtitle words, d is the hidden size of the BiLSTM (set to 150 in our experiments). Similarly, we encode question H q ∈ R nq×2d , candidate answers H a i ∈ R na i ×2d , and visual con-cepts H cpt ∈ R ncpt×2d . n q and n a i are the number of words in question and answer a i , respectively. Regional features V reg and ImageNet features V img are first projected into word vector space using a non-linear layer with tanh activation, then encoded using the same BiLSTM to obtain the regional representations H reg ∈ R nreg×2d and H img ∈ R n img ×2d , respectively.

Joint Modeling of Context and Query
We use a context matching module and BiLSTM to jointly model the contextual inputs (subtitle, video) and query (question-answer pair). The context matching module is adopted from the contextquery attention layer from previous works Yu et al., 2018a). It takes context vectors and query vectors as inputs and produces a set of context-aware query vectors based on the similarity between each context-query pair. Taking the regional visual feature stream as an example (Fig. 4 upper stream), where H reg is used as context input 2 . The question embedding, H q , and answer embedding, H a i , are used as queries. After feeding context-query pairs into the context matching module, we obtain a video-aware-question representation, G reg,q ∈ R nreg×2d , and video-aware-answer representation, G reg,a i ∈ R nreg×2d , which are then fused with video context: where is element-wise product. The fused feature, M reg,a i ∈ R nreg×10d , is fed into another BiLSTM. Its hidden states, U reg,a i ∈ R nreg×10d , are max-pooled temporally to get the final vector, u reg,a i ∈ R 10d , for answer a i . We use a linear layer with softmax to convert {u reg,a i } 4 i=0 into answer probabilities. Similarly, we can compute the answer probabilities given subtitle as context (Fig. 4 bottom stream). When multiple streams are used, we simply sum up the scores from each stream as the final score (Wang et al., 2016a).
In all experiments, setup is as follows. We split the TVQA dataset into 80% training, 10% validation, and 10% testing splits such that videos and their corresponding QA pairs appear in only one split. This results in 122,039 QA pairs for training, 15,253 QA pairs for validation, and 15,253 QA pairs for testing. We evaluate each model using multiple-choice question answering accuracy.

Baselines
Longest Answer: Table 1 indicates that the average length of the correct answers is longer than the wrong ones; thus, our first baseline simply selects the longest answer for each question. Nearest Neighbor Search: In this baseline, we use Nearest Neighbor Search (NNS) to compute the closest answer to our question or subtitle. We embed sentences into vectors using TFIDF, SkipThought (Kiros et al., 2015), or averaged GloVe (Pennington et al., 2014) word vectors, then compute the cosine similarity for each questionanswer pair or subtitle-answer pair. For TFIDF, we use bag-of-words to represent the sentences, assigning a TFIDF value for each word. Retrieval: Due to the size of TVQA, there may exist similar questions and answers in the dataset. Thus, we also implement a baseline two-step retrieval approach: given a question and a set of candidate answers, we first retrieve the most relevant question in the training set, then pick the candidate answer that is closest to the retrieved question's correct answer. Similar approaches have also been used in dialogue systems (Jafarpour and Burges, 2010;Leuski and Traum, 2011), picking the appropriate responses to an utterance from a predefined human conversational corpus. Similar to NNS, we use TFIDF, SkipThought, and GloVe vectors with cosine similarity. Table 6 shows results from baseline methods and our proposed neural model. Our main results are obtained by using full-length video clips and subtitles, without using timestamps (w/o ts). We also run the same experiments using the localized video and subtitle segment specified by the ground truth timestamps (w/ ts). If not indicated explicitly, the numbers described below are from the experiments on full-length video clips and subtitles.  Table 6: Accuracy for different methods on TVQA test set. Q = Question, S = Subtitle, V = Video, img = ImageNet features, reg = regional visual features, cpt = visual concept features, ts = timestamp annotation. Human performance without timestamp annotation is reported in Table 5.

Results
(compared to random chance at 20%). As expected, the retrieval-based methods (row 2-4) and the answer-question similarity based methods (row 5-7) perform rather poorly, since no contexts (video or subtitle) are considered. When using subtitle-answer similarity to choose correct answers, Glove, SkipThought, and TFIDF based approaches (row 8-10) all achieve significant improvement over question-answer similarity. Notably, TFIDF (row 10) answers 49.94% of the questions correctly. Since our questions are raised by people watching the videos, it is natural for them to ask questions about specific and unique objects/locations/etc., mentioned in the subtitle. Thus, it is not surprising that TFIDF based similarity between answer and subtitle performs so well.  Table 7: Accuracy of each question type using different models (w/ ts) on TVQA Validation set. Q = Question, S = Subtitle, V = Video, img = ImageNet features, reg = regional visual features, cpt = visual concept features. The percentage of each question type is shown in brackets.
their ability to answer visual questions. Overall, the best performance is achieved by using all the contextual sources, including subtitles and videos (using concept features, row 18). Comparison with Human Performance: Human performance without timestamp annotation is shown in Table 5. When using only questions (Table 6 row 11), our model outperforms humans (43.34% vs 31.84%) as it has access to all statistics of the questions and answers. When using videos or subtitles or both, humans perform significantly better than the models. Models with Timestamp Annotation: Columns under w/o ts and w/ ts show a comparison between the same model using full-length videos/subtitles and using timestamp localized videos/subtitles. With timestamp annotation, the models perform consistently better than their counterpart without this information, indicating that localization is helpful for question answering. Accuracy for Different Question Types: To gain further insight, we examined the accuracy of our models on different question types on the validation set (results in Table 7), all models using timestamp annotation. Compared to S+Q model, S+V+Q models get the most improvements on "what" and "where" questions, indicating these questions require additional visual information.
On the other hand, adding video features did not improve S+Q performance on questions relying more on textual reasoning, e.g., "how" questions. Human-Written Negatives vs. Randomly-Sampled Negatives For comparison, we create a new answer set by replacing the original human written negative answers with randomly sampled negative answers. To produce relevant negative answers, for each question, negatives are sampled (from the other QA pairs) within the same show.  Table 8: Accuracy on TVQA validation set with negative answers collected using different strategies. Negative Answer Source (N.A. Src.) indicates the collection method of the negative answers. Q = Question, S = Subtitle, V = Video, cpt = visual concept features, ts = timestamp annotation. All the experiments are conducted using the proposed multi-stream neural model. Table 8. Performance on randomly sampled negatives is much higher than that of human written negatives, indicating that human written negatives are more challenging. Qualitative Analysis: Fig. 6 shows example predictions from our S+V+Q model (row 18) using full-length video and subtitle. Fig. 6a and Fig. 6b demonstrate its ability to solve both grounded visual questions and textual reasoning question. Bottom row shows two incorrect predictions. We found that wrong inferences are mainly due to incorrect language inferences and the model's lack of common sense knowledge. For example, Fig. 6c, the characters are talking about radiology, the model is distracted to believe they are in the radiology department, while Fig. 6d shows a case of questions that need common sense to answer, rather than simply textual or visual cues.

Conclusion
We presented the TVQA dataset, a large-scale, localized, compositional video question answering dataset. We also proposed two QA tasks (with/without timestamps) and provided baseline experiments as a benchmark for future comparison. Our experiments show both visual and textual understanding are necessary for TVQA. There is still a significant gap between the proposed baselines and human performance on the QA accuracy. We hope this novel multimodal dataset and the baselines will encourage the community to develop stronger models in future work. To narrow the gap, one possible direction is to enhance the interactions between videos and subtitles to improve multimodal reasoning ability. Another direction is to exploit human-object relations in the video and subtitle, as we observe that a large number of questions involve such relations. Additionally, temporal reasoning is crucial for answering the TVQA questions. Thus, future work also includes integrating better temporal cues.