Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Videos convey rich information. Dynamic spatio-temporal relationships between people/objects, and diverse multimodal events are present in a video clip. Hence, it is important to develop automated models that can accurately extract such information from videos. Answering questions on videos is one of the tasks which can evaluate such AI abilities. In this paper, we propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Specifically, we first employ dense image captions to help identify objects and their detailed salient regions and actions, and hence give the model useful extra information (in explicit textual format to allow easier matching) for answering questions. Moreover, our model is also comprised of dual-level attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier. Finally, we also cast the frame selection problem as a multi-label classification task and introduce two loss functions, In-andOut Frame Score Margin (IOFSM) and Balanced Binary Cross-Entropy (BBCE), to better supervise the model with human importance annotations. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin (74.09% versus 70.52%). We also present several word, object, and frame level visualization studies.


Introduction
Recent years have witnessed a paradigm shift in the way we get our information, and a lot of it 1 Our code is publicly available at: https://github.com/hyounghk/ VideoQADenseCapFrameGate-ACL2020 is related to watching and listening to videos that are shared in huge amounts via the internet and new high-speed networks. Videos convey a diverse breadth of rich information, such as dynamic spatiotemporal relationships between people/objects, as well as events. Hence, it has become important to develop automated models that can accurately extract such precise multimodal information from videos (Tapaswi et al., 2016;Maharaj et al., 2017;Jang et al., 2017;Gao et al., 2017;Anne Hendricks et al., 2017;Lei et al., , 2020. Video question answering is a representative AI task through which we can evaluate such abilities of an AI agent to understand, retrieve, and return desired information from given video clips. In this paper, we propose a model that effectively integrates multimodal information and locates the relevant frames from diverse, complex video clips such as those from the video+dialogue TVQA dataset , which contains questions that need both the video and the subtitles to answer. When given a video clip and a natural language question based on the video, naturally, the first step is to compare the question with the content (objects and keywords) of the video frames and subtitles, then combine information from different video frames and subtitles to answer the question. Analogous to this process, we apply dual-level attention in which a question and video/subtitle are aligned in word/object level, and then the aligned features from video and subtitle respectively are aligned the second time at the frame-level to integrate information for answering the question. Among the aligned frames (which contain aggregated video and subtitle information now), only those which contain relevant information for answering the question are needed. Hence, we also apply gating mechanisms to each frame feature to select the most informative frames before feeding them to the classifier.
Next, in order to make the frame selection more effective, we cast the frame selection sub-task as a multi-label classification task. To convert the time span annotation to the label for each frame, we assign a positive label ('1') to frames between the start and end points, and negative ('0') label to the others, then train them with the binary crossentropy loss. Moreover, for enhanced supervision from the human importance annotation, we also introduce a new loss function, In-and-Out Frame Score Margin (IOFSM), which is the difference in average scores between in-frames (which are inside the time span) and out-frames (which are outside the time span). We empirically show that these two losses are complementary when they are used together. Also, we introduce a way of applying binary cross-entropy to the unbalanced dataset. As we see each frame as a training example (positive or negative), we have a more significant number of negative examples than positive ones. To balance the bias, we calculate normalized scores by averaging the loss separately for each label. This modification, which we call balanced binary crossentropy (BBCE), helps adjust the imbalance and further improve the performance of our model. Finally, we also employ dense captions to help further improve the temporal localization of our video-QA model. Captions have proven to be helpful for vision-language tasks (Wu et al., 2019;Li et al., 2019;Kim and Bansal, 2019) by providing additional, complementary information to the primary task in descriptive textual format. We employ dense captions as an extra input to our model since dense captions describe the diverse salient regions of an image in object-level detail, and hence they would give more useful clues for question answering than single, non-dense image captions.
Empirically, our first basic model (with duallevel attention and frame-selection gates) outperforms the state-of-the-art models on TVQA validation dataset (72.53% as compared to 71.13% previous state-of-the-art) and with the additional supervision via the two new loss functions and the employment of dense captions, our model gives further improved results (73.34% and 74.20% respectively). These improvements from each of our model components (i.e., new loss functions, dense captions) are statistically significant. Overall, our full model's test-public score substantially outperforms the state-of-the-art score by a large margin of 3.57% (74.09% as compared to 70.52%). 2 Also, our model's scores across all the 6 TV shows are more balanced than other models in the TVQA leaderboard 3 , implying that our model should be more consistent and robust over different genres/domains that might have different characteristics from each other.
Our contributions are four-fold: (1) we present an effective model architecture for the video question answering task using dual-level attention and gates which fuse and select useful spatial-temporal information, (2) we employ dense captions as salient-region information and integrate it into a joint model to enhance the videoQA performance by locating proper information both spatially and temporally in rich textual semi-symbolic format, (3) we cast the frame selection sub-task as a multilevel classification task and introduce two new loss functions (IOFSM and BBCE) for enhanced supervision from human importance annotations (which could be also useful in other multi-label classification settings), and (4) our model's score on the test-public dataset is 74.09%, which is around 3.6% higher than the state-of-the-art result on the TVQA leaderboard (and our model's scores are more balanced/consistent across the diverse TV show genres). We also present several ablation and visualization analyses of our model components (e.g., the word/object-level and the frame-level attention).

Inside Frames
Outside Frames Frame-Level Att.
Multi-Heads Self-Cross Att. understanding of temporal information as well as spatial information, so it is more challenging than a single image question answering.

Max-Pool
Temporal Localization Temporal localization is a task that is widely explored in event/object detection in video context. There has been work that solely processes visual information to detect objects/actions/activity (Gaidon et al., 2013;Weinzaepfel et al., 2015;Shou et al., 2016;Dai et al., 2017;Shou et al., 2017). At the same time, work on natural language-related temporal localization task is less explored with recent work that focuses on the retrieval of a certain moment in a video by natural language (Anne Hendricks et al., 2017;Gao et al., 2017). With deliberately designed gating and attention mechanisms, our work, in general, will greatly contribute to the task of temporal localization, especially under natural language context and multimodal data.
Dense Image Captioning Image captioning is another direction of understanding visual and language information jointly. Single-sentence captions (Karpathy and Fei-Fei, 2015;Anderson et al., 2018) capture the main concept of an image to describe it in a single sentence. However, an image could contain multiple aspects that are important/useful in different ways. Dense captions (Johnson et al., 2016; and paragraph captions (Krause et al., 2017;Liang et al., 2017;Melas-Kyriazi et al., 2018) have been introduced to densely and broadly capture the diverse aspects and salient regions of an image. Especially, dense caption describes an image in object level and gives useful salient regional information about objects such as attributes and actions. In this paper, we take advantage of this dense caption's ability to help our video QA model understand an image better for answering questions.

Model
Our model consists of 2 parts: feature fusion and frame selection. For feature fusion, we introduce dual-level (word/object and frame level) attention, and we design the frame selection problem as a multi-label classification task and introduce 2 new loss functions for enhanced supervision ( Figure 1).

Features
We follow the same approach of Lei et al. (2020)'s work to obtain features from video, questionanswer pairs, and subtitle input and encode them. We sample frames at 0.5 fps and extract object features from each frame via Faster R-CNN (Girshick, 2015). Then we use PCA to get features of 300 dimension from top-20 object proposals. We also create five hypotheses by concatenating a question feature with each of five answer features, and we pair each visual frame feature with temporally neighboring subtitles. We encode all the features using convolutional encoder.
φ en (x) : where E pos denotes positional encoding, f i,t convolution preceded by Layer Normalization and followed by ReLU activation, and g n the layer normalization. The encoder is composed of N blocks iterations. In each iteration, the encoded inputs are transformed L times of convolutions. The L is set to 2, and N to 1 in our experiment ( Figure 2).

Dual-Level Attention
In dual-level attention, features are sequentially aligned in word/object-level and frame-level (Figure 3). Word/Object-Level Attention The QA feature, qa = {qa 0 , qa 1 , .., qa Tqa }, are combined with subtitle feature, s t = {s t0 , s t1 , .., s tTs t }, and visual feature, v t = {v t0 , v t1 , .., v tTv t }, of t-th frame respectively via word/object-level attention. To be specific, we calculate similarity matrices following Seo et al. (2017)'s approach, S v t ∈ R Tqa×Ts t and S s t ∈ R Tqa×Tv t , from QA/subtitle and QA/visual features respectively. From the similarity matrices, attended subtitle features are obtained and combined with the QA features by concatenating and applying a transforming function. Then, maxpooling operation is applied word-wise to reduce the dimension.
where f 1 is a fully-connected layer followed by ReLU non-linearity. The same process is applied to the QA features.
The fused features from different directions are integrated by concatenating and being fed to a function as follows: where f 2 is the same function as f 1 with non-shared parameters. All this process is also applied to visual features to get word/object-level attended features.  Figure 3: Dual-Level Attention. Our model performs two-level attentions (word/object and frame level) sequentially. In the word/object-level attention, each word/object is aligned to relevant words or objects. In the frame-level attention, each frame (which has integrated information from the word/object-level attention) is aligned to relevant frames.

Frame-Level Attention
The fused features from word/object-level attention are integrated framewise via frame-level attention. Similar to the word/object-level attention, a similarity matrix, S ∈ R T F ×T F , is calculated, where T F is the number of frames. Also, from the similarity matrix, attended frame-level features are calculated.
where f 3 is the same function as f 1 and f 2 with non-shared parameters. The frame-wise attended features are added to get an integrated feature.

Video and Dense Caption Integration
We also employ dense captions to help further improve the temporal localization of our video-QA model. They provide more diverse salient regional information (than the usual single non-dense image captions) about object-level details of image frames in a video clip, and also allow the model to explicitly (in textual/semi-symbolic form) match keywords/patterns between dense captions and questions to find relevant locations/frames.  We apply the same procedure to the dense caption feature by substituting video features with dense caption features to obtain u sd . To integrate u sv and u sd , we employ multi-head self attention (Figure 4). To be specific, we concatenate u sv and u sd frame-wise then feed them to the self attention function.
where g a denotes self-attention.
In this way, u sv and u sd attend to themselves while attending to each other simultaneously. We split the output, u svd into the same shape as the input, then add the two.

Frame-Selection Gates
To select appropriate information from the framelength features, we employ max-pooling and gates. Features from the video-dense caption integration are fed to the CNN encoder. A fully-connected layer and sigmoid function are applied sequentially to the output feature to get frame scores that indicate how relevant each frame is for answering a given question. We get weighted features by multiplying the output feature from the CNN encoder with the scores.ẑ We calculate another frame scores with a different function f G to get another weighted feature.
The three features (from local gate, global gate, and max-pooling, respectively), are then concatenated and fed to the classifier to give scores for each candidate answer.
We get the logits for the five candidate answers and choose the highest value as the predicted answer.
where s g is the logit of ground-truth answer.

Novel Frame-Selection Supervision Loss Functions
We cast frame selection as a multi-label classification task. The frame scores from the local gate, g L , are supervised by human importance annotations, which are time spans (start-end points pair) annotators think needed for selecting correct answers. To this end, we transform the time span into groundtruth frame scores, i.e., if a frame is within the time span, the frame has '1' as its label and a frame outside the span gets '0'. In this way, we can assign a label to each frame, and frames should get as close scores as their ground-truth labels. We train the local gate network with binary cross-entropy (BCE) loss.
where s f i is a frame score of i-th frame, and y is a corresponding ground-truth label.
In-and-Out Frame Score Margin For additional supervision other than the binary crossentropy loss, we create a novel loss function, Inand-Out Frame Score Margin (IOFSM).
where OFS (Out Frame Score) is scores of frames whose labels are '0' and IFS (In Frame Score) is scores of frames whose labels are '1'.
Balanced Binary Cross-Entropy In our multilabel classification setting, each frame can be considered as one training example.
where s f in i and s fout j are i-th in-frame score and j-th out-frame score respectively, and T F in and T Fout are the number of in-frames and out-frames respectively.
Thus, the total loss is: 4 Experimental Setup TVQA Dataset TVQA dataset  consists of video frames, subtitles, and questionanswer pairs from 6 TV shows. The number of examples for train/validation/test-public dataset are 122,039/15,253/7,623. Each example has five candidate answers with one of them the ground-truth. 4 At the time of the ACL2020 submission deadline, the publicly visible rank-1 entry was 70.52%. Since then, two more entries have appeared in the leaderboard; however, our method still outperforms their scores by a large margin (71.48% and 71.13% versus 74.09%). So, TVQA is a classification task, in which models select one from the five candidate answers, and models can be evaluated on the accuracy metric.
Dense Captions We use 's pretrained model to extract dense captions from each video frame. We extract the dense captions in advance and use them as extra input data to the model. 5 Training Details We use GloVe (Pennington et al., 2014) word vectors with dimension size of 300 and RoBERTa (Liu et al., 2019) with 768 dimension. The dimension of the visual feature is 300, and the base hidden size of the whole model is 128. We use Adam (Kingma and Ba, 2015) as the optimizer. We set the initial learning rate to 0.001 and drop it to 0.0002 after running 10 epochs. For dropout, we use the probability of 0.1.

Results and Ablation Analysis
As seen from Table 1, our model outperforms the state-of-the-art models in the TVQA leaderboard. Especially our model gets balanced scores for all the TV shows while some other models have high variances across the shows. As seen from Table 2, the standard deviation and 'max-min' value over our model's scores for each TV show are 0.65 and 1.83, respectively, which are the lowest values among all models in the list. This low variance could mean that our model is more consistent and robust across all the TV shows. Table 3, our basic dual-attention and frame selection gates model shows substantial improvement over the strong single attention and frame span baseline (row 4 vs 1: p < 0.0001), which is from the best published model (Lei et al., 2020). Each of our dual-attention and frame selection gates alone shows a small improvement in performance than the baseline (row 3 vs 1 and 2 vs 1, respectively). 6 However, when they are applied together, the model works much better. The reason why they are more effective when put together is that frame selection gates basically select frames based on useful information Table 1: Our model outperforms the state-of-the-art models by a large margin. Moreover, the scores of our model across all the TV shows are more balanced than the scores from other models, which means our model is more consistent/robust and not biased to the dataset from specific TV shows. 4

Model Ablations As shown in
Model TV Show Score avg.
std. max-min 1 jacobssy (anonymous) 66.37 2.01 5.48 2 multi-stream     from each frame feature and our dual-attention can help this selection by getting more relevant information to each frame through the frame-level attention. Next, our new loss functions significantly help over the dual-attention and frame selection gates model by providing enhanced supervision (row 5 vs 4: p < 0.0001, row 7 vs 6: p < 0.005). Our RoBERTa version is also significantly better than the GloVe model (row 6 vs 4: p < 0.0005, row 7 vs 5: p < 0.01). Finally, employing dense captions further improves the performance via useful textual clue/keyword matching (row 8 vs 7: p < 0.005). 7 7 Statistical significance is computed using the bootstrap test (Efron and Tibshirani, 1994). 8 Two more entries have appeared in the leaderboard since the ACL2020 submission deadline. However, our scores are still more balanced than their scores across all TV shows (std.: 2.11 and 2.40 versus our 0.65, max-min: 5.50 and 7.38 versus our 1.83).

IOFSM and BCE Loss Functions Ablation and
Analysis To see how In-and-Out Frame Score Margin (IOFSM) and Binary Cross-Entropy (BCE) loss affect the frame selection task, we compare the model's performance/behaviors according to the combination of IOFSM and BCE. As shown in Table 4, applying IOFSM on top of BCE gives a better result. When we compare row 1 and 3 in Table 4, the average in-frame score of BCE+IOFSM is higher than BCE's while the average out-frame scores of both are almost the same. This can mean two things: (1) IOFSM helps increase the scores of in-frames, and (2) increased in-frame scores help improve the model's performance. On the other hand, when we compare row 1 and 2, the average in-frame score of IOFSM is higher than BCE's. But, the average out-frame score of IOFSM is also much higher than BCE's. This can mean that out-frame scores have a large impact on the performance as well as in-frame scores. This is intuitively reasonable. Because information from out-frames also flows to the next layer (i.e., classifier) after being multiplied by the frame scores, the score for the 'negative' label also has a direct impact on the performance. So, making the scores as small as possible is also important. Also, when we compare the row 2 and others (2 vs. 1 and 3), the gap between in-frame scores is much larger than the gap between out-frame scores. But, considering the scores are average values, and the number of out-frames is usually much larger than in-frames, the difference between out-frame scores would affect more than the gap itself.
Balanced BCE Analysis We can see from row 1 and 4 of the This could mean that a model with IOFSM has an unstable scoring behavior, and it could affect the performance. As seen from row 5, applying BBCE and IOFSM together gives further improvement, possibly due to the increased in-frame scores and decreased out-frame scores while staying around at a similar standard deviation value.

Visualizations
In this section, we visualize the dual-level attention (word/object and frame level) and the frame score change by new losses application (for all these attention examples, our model predicts the correct answers).
Word/Object-Level Attention We visualize word-level attention in Figure 5. In the top example, the question and answer pair is "Where sat Rachel when holding a cup?" -"Rachel sat on a couch". Our word/object-level attention between QA pair and dense caption attend to a relevant description like 'holding a glass' to help answer the question. In the middle example, the question and answer pair is, "How did Lance react after Mandy insulted his character?" -"Lance said he would be insulted if Mandy actually knew anything about acting". Our word/object-level attention between QA pair and subtitle properly attend to the most relevant words such as 'insulted', 'knew', and 'acting' to answer the question. In the bottom example, the question and answer pair is, "What is Cathy doing with her hand after she introduces her fiance to Ted?" -"She is doing sign language". From the score of our word/object-level attention, the model aligns the word 'sign' to the woman's hand  to answer the question. Figure 6, our frame-level attention can align relevant frames from different features. In the example, the question and answer pair is "Where did Esposito search after he searched Carol's house downstairs?" -"Upstairs". To answer this question, the model needs to find a frame in which 'he (Esposito) searched Carol's house downstairs', then find a frame which has a clue for 'where did Esposito search'. Our frame-level attention can properly align the information fragments from different features (Frame 20 and 25) to help answer questions.

Frame Score Enhancement by New Losses
As seen in Figure 7, applying our new losses (IOFSM+BBCE) changes the score distribution Multi-Head Self Attention before after Figure 7: Visualization of distribution change in frame selection scores. Left: the score distribution before applying new losses (IOFSM+BBEC). Right: the score distribution after applying the losses. Scores neighboring in-frame (gray) are increased. For this example, the model does not predict the right answer before applying the losses, but after training with the losses, the model chooses the correct answer. over all frames. Before applying our losses (left figure), overall scores are relatively low. After using the losses, overall scores increased, and especially, scores around in-frames get much higher.

Conclusion
We presented our dual-level attention and frameselection gates model and novel losses for more effective frame-selection. Furthermore, we employed dense captions to help the model better find clues from salient regions for answering questions. Each component added to our base model architecture (proposed loss functions and the adoption of dense captions) significantly improves the model's performance. Overall, our model outperforms the state-of-the-art models on the TVQA leaderboard, while showing more balanced scores on the diverse TV show genres.