DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization

Leveraging large-scale unlabeled web videos such as instructional videos for pre-training followed by task-specific finetuning has become the de facto approach for many video-and-language tasks. However, these instructional videos are very noisy, the accompanying ASR narrations are often incomplete, and can be irrelevant to or temporally misaligned with the visual content, limiting the performance of the models trained on such data. To address these issues, we propose an improved video-and-language pre-training method that first adds automatically-extracted dense region captions from the video frames as auxiliary text input, to provide informative visual cues for learning better video and language associations. Second, to alleviate the temporal misalignment issue, our method incorporates an entropy minimization-based constrained attention loss, to encourage the model to automatically focus on the correct caption from a pool of candidate ASR captions. Our overall approach is named DeCEMBERT (Dense Captions and Entropy Minimization). Comprehensive experiments on three video-and-language tasks (text-to-video retrieval, video captioning, and video question answering) across five datasets demonstrate that our approach outperforms previous state-of-the-art methods. Ablation studies on pre-training and downstream tasks show that adding dense captions and constrained attention loss help improve the model performance. Lastly, we also provide attention visualization to show the effect of applying the proposed constrained attention loss.


Introduction
Video and language are ubiquitous in the world we live. The ability to understand the interplay of video and language is thus essential for intelligent agents to operate in real-world scenario. Past success in video-and-language has mostly been driven by supervised learning, where models are learned on manually labeled data for a particular task (e.g., text-to-video retrieval). However, manually annotating video and language data is very expensive, hence limiting the scale of such datasets, and consequently also limiting the performance of models trained on the datasets. The self-supervised pretraining then finetuning paradigm offers an easy and generic solution to this dilemma, where models are first pre-trained on large-scale unlabeled data by performing various "proxy tasks", followed by finetuning the pre-trained model on downstream tasks where data is often limited.
Recent advances on language pre-training (Devlin et al., 2019; demonstrate the effectiveness of this approach, where transformerbased (Vaswani et al., 2017) models pre-trained on large-scale unlabeled text corpus has shown to perform remarkably well across a wide range of natural language tasks (Rajpurkar et al., 2016;Williams et al., 2017;Zellers et al., 2018;Wang et al., 2018). Following this momentum, multimodal pre-training (Tan and Bansal, 2019;Su et al., 2019;Cho et al., 2021;Sun et al., 2019;Li et al., 2020c;Zhu and Yang, 2020;Miech et al., 2020;Li et al., 2020b; on large-scale image-text corpus (Sharma et al., 2018;Chen et al., 2015;Krishna et al., 2017) and video-text corpus Miech et al., 2019;Sun et al., 2019) have also shown to outperform existing approaches (Anderson et al., 2018;Yu et al., 2018a;Lei et al., 2020a,b) on vision and language tasks (Antol et al., 2015;Xu et al., 2016;Yu et al., 2018a;Suhr et al., 2019;Lei et al., 2020b). The most commonly used "proxy tasks" for multimodal pre-training are masked language modeling (Devlin et al., 2019) (MLM) and cross-modal matching (Tan and Bansal, 2019;Zhu and Yang, 2020) (e.g., video-text matching), where MLM aims to learn a better language model in the presence of the extra   (Miech et al., 2019). We show three clips and their corresponding ASR captions and dense captions. We use green box to indicate correct matched ASR caption for the middle clip. We highlight semantically misaligned ASR caption in pink. As can be seen from this example, the ASR captions are often incomplete and unpunctuated, and are semantically or temporally misaligned with their corresponding clips. In contrast, dense captions typically capture key objects, attributes and actions in the clips.
vision modality, and the matching objective encourages better association and alignment between relevant image-text or video-text pairs. Existing video-text pre-training models (Sun et al., 2019;Miech et al., 2020;Zhu and Yang, 2020) are typically trained on large-scale instructional video datasets such as HowTo100M (Miech et al., 2019). The dataset contains 1.2 million videos with 136 million clips that are automatically harvested from YouTube. Each clip is paired with text transcribed from the video narrations via an automatic speech recognition (ASR) system. While the models trained on HowTo100M have shown promising results, they suffer from a few inherent drawbacks from the dataset: (i) Semantic misalignment: the narration words are sometimes irrelevant to the visual content (e.g., credits or other nonvisual words, see Figure 1 text highlighted in pink), and vice versa, i.e., some important visual objects and actions are not described by words. (ii) Temporal misalignment: the videos and the captions are far from perfectly aligned, i.e., people might talk about something before or after they actually demonstrate it. For example, Figure 1 shows the caption "cross" is spoken after the action happened. Miech et al. (2019) reported that around 50% of the clip-caption pairs in HowTo100M suffers from these two misalignments, both of which cause difficulties in optimizing the video-text matching objective. (iii) Furthermore, the ASR captions are generally noisy, incomplete, and unpunctuated (Tilk and Alumäe, 2015) (e.g., in Figure 1, "taking pieces paper go"), which limits the language modeling ability of the systems that trained on such text.
To address the aforementioned issues, we propose to add Dense Captions (Johnson et al., 2016;Yang et al., 2017) as a complementary text input to the ASR captions. Beyond serving as an extra language input for better language modeling, dense captions also describes important object, attribute, and action details regarding several salient regions in the video frames, providing useful signals for video-text matching. In addition to its use in the pre-training stage, these dense captions also provide helpful clues for downstream tasks such as video question answering.
In parallel, to alleviate the temporal misalignment issue, we propose a constrained attention loss that encourages the model to automatically focus on the relevant ASR caption from a pool of continuous caption candidates. Instead of using only a single paired ASR caption for each clip, we also use the captions from its neighboring clips. We expect one of neighboring captions semantically aligns with the clip. To encourage the alignment between the clip and its relevant caption, we employ a "constrained attention loss" that encourages the attention mass from video features to the captions to be distributed mostly in one of the caption, by minimizing the entropy of attention scores.
We evaluate our DECEMBERT (Dense Captions and Entropy Minimization) model on a wide range of video-and-language tasks, including video question answering , text-to-video retrieval (Xu et al., 2016;, and video captioning (Xu et al., 2016;, where our approach outperforms previous state-of-the-art methods. To better understand the underlying factors that contribute to this success, we present comprehensive analyses concerning each of the added components. To summarize, our contribution is three-fold: (i) We propose incorporating automatically extracted dense captions as an extra text input for video-text pre-training. (ii) We propose an entropy minimization-based constrained attention loss to encourage the model to dynamically select the best matched captions from a pool of neighboring captions, to alleviate the inherent misalignment between the ASR captions and the videos. (iii) Extensive experiments on three video-and-language tasks (text-to-video retrieval, video captioning, and video question answering) across five datasets demonstrate the effectiveness of our approach. Furthermore, we also provide comprehensive ablation study and visualization to quantitatively and qualitatively examine the effect of using dense captions and the proposed constrained attention loss.

Related Work
Since the birth of BERT (Devlin et al., 2019), transformer (Vaswani et al., 2017) language pre-training models Lan et al., 2020;Dong et al., 2019;Song et al., 2019;Raffel et al., 2020;Clark et al., 2020) which perform unsupervised pre-training followed by downstream task specific finetuning has became the de facto approach for various natural language understanding tasks (Rajpurkar et al., 2016;Williams et al., 2017;Zellers et al., 2018;Wang et al., 2018). Followed by this success, image-and-language pretraining models (Tan and Bansal, 2019;Li et al., 2020a) and video-and-language pre-training models (Sun et al., 2019;Miech et al., 2019;Zhu and Yang, 2020;Miech et al., 2020;Li et al., 2020b;Luo et al., 2020;Stroud et al., 2020) have also shown promising results on many vision and language tasks (Antol et al., 2015;Xu et al., 2016;. For video-and-language pre-training in particular, most existing work (Sun et al., 2019;Miech et al., 2019;Zhu and Yang, 2020;Miech et al., 2020;Li et al., 2020b;Luo et al., 2020) are trained on large-scale unlabeled instructional videos, such as HowTo100M (Miech et al., 2019) videos. However, as the ASR captions associated with these videos are noisy, i.e., they are often temporally or semantically misaligned with the video content. Miech et al. (2020) propose Multiple Instance Learning Noise Contrastive Learning (MIL-NCE) to address the temporal misalignment issue, but semantic misalignment still remains. Moreover, MIL-NCE requires computing a separate similarity score from the target clip to each of the ASR caption candidates, it does not suitable for the prevailing single-stream transformer pre-training ar-chitecture due to linearly increased computation cost.
Inspired by recent work (Kim and Bansal, 2019;Kim et al., 2020) that uses dense captions (Johnson et al., 2016;Yang et al., 2017) to improve image and video QA models, we propose to add dense captions an as auxiliary text input that provide aligned visual cues to ease the difficulties of learning a video-text matching objective from often temporally and semantically misaligned ASR captions. In addition, we also propose a constrained attention loss, which employs an entropy minimizationbased regularization (Tanaka et al., 2018;Yi and Wu, 2019) to the model to encourage higher attention scores from the video to the correct matched caption among a pool of ASR caption candidates.

Method
In this section, we describe the details of DECEM-BERT, including its architecture, pre-training objectives, dense caption inputs, and the constrained attention loss. Figure 2 shows an overview of DE-CEMBERT.
Input Representations. Input text (e.g., ASR captions) are tokenized and represented as a sequence of WordPiece (Wu et al., 2016) tokens. We use a trainable word embedding layer to encode the tokens into feature representations. We use appearance and motion features to represent videos. For appearance, we use a resnet152 (He et al., 2016) model pre-trained on ImageNet (Deng et al., 2009) to extract 2D video features at 1FPS. Similarly, for motion, we use a 3D ResNeXt (Xie et al., 2017;Hara et al., 2018;Kataoka et al., 2020) to extract 3D video features at 1FPS. The temporally aligned appearance and motion features are L2-normalized and concatenated together at feature dimension. We then apply a two-layer MLP to map the it to the same dimension as the word embeddings. Next, we add learned positional embedding and token type embedding (Devlin et al., 2019) to the video and text representations to encode the position and token type information. The video and text representations are then concatenated as a single sequence as inputs to a 12-layer transformer encoder for pre-training and downstream task finetuning.
Dense Captions. The original captions from ASR systems might not well describe a video with rich content or can even be irrelevant to the video as discussed in Section 1. Moreover, as ASR cap- tions are often incomplete and unpunctuated, they might also be sub-optimal for language modeling. Therefore, we use dense captions (Johnson et al., 2016) automatically extracted from an off-the-shelf image dense captioning model (Yang et al., 2017) as additional language input for the model. This dense captioning model is pre-trained on Visual Genome (Krishna et al., 2017) regional captions.
To obtain video-level captions, we extract dense captions from frames sampled at every two seconds. There are on average 4.4 dense captions per frame, we sample two of them from each frame at each training step to avoid redundant information and reduce memory and computation cost. Note that the other dense captions might still be sampled in another training step. The sampled dense captions are then concatenated together as video-level captions for training. These extracted dense captions provide rich and comprehensive information regarding the salient objects, attributes, and actions (see examples in Figure 1 and Figure 2), which helps to optimize a video-text matching objective during pre-training and provide essential visual clues for many downstream tasks such as video question answering. Meanwhile, because the dense captions are text input with diverse semantics, it complements the typically short and incomplete ASR captions as additional resources for better language modeling. We observe in our ablation study that adding dense captions improves both MLM accuracy and videotext matching accuracy, demonstrating the effectiveness of using them as extra inputs.
Pre-Training Objectives. During pre-training, we use masked language modeling (Devlin et al., 2019) (MLM) and cross-modality matching (Tan and Bansal, 2019;Miech et al., 2019;Zhu and Yang, 2020) (also referred as videotext matching in our context) as our objectives to learn model parameters. For masked language modeling, the goal is to learn better language models conditioned on bidirectional text context and the video. We set a probability of 0.20 2 to replace an input language token with [MASK]. When dense captions are used as extra text input, we also perform masked language modeling on them with the same masking probability as the ASR captions.
For video-text matching, with a probability of 0.50, we replace the original ASR captions with randomly sampled captions from other videos or clips as a negative. Of the sampled negative ASR captions, 50% of them are from different videos, while another 50% are from the same video but different clips. Text from the same video clip is likely to have the same theme or similar context, and thus can serve as hard samples to improve the model's ability to do fine-grained matching. We do not designate a [CLS] token before the start of input caption, instead we take the mean pooling of the output sequence hidden states to perform binary classification for video-text matching. Empirically, we found this approach works better than using a then be expressed as:

342
where softmax(·, dim = 1) denotes performing 343 softmax at the second dimension of the input ma-344 trix. A is the attention output. When multiple 345 attention heads are used, the formulation is similar.

346
We use S to denote the similarity matrix computed 347 by XX T . It can be expressed as a block matrix:

349
Our goal is to encourage the model to focus on the 350 correct matched caption for an input clip, i.e., the 351 attention mass from the video clip to the correct 352 matched caption should be higher than the others.

353
We denote the overall attention score from the input 354 video clip to one of the ASR captions as: The 355 maximum response between the ASR captions to 356 each element in the video can be computed as: We then employ an entropy-based loss: only inputting its associated caption s i , we also in-330 clude captions from its two neighboring clips, i.e., 331 s i 1 and s i+1 . In most cases, the correct matched 332 caption for the clip is from these three captions.

333
We denote X=[X ci ; X si 1 ; X si ; X si+1 ] 2 R l⇥d 334 as the generalized input sequence to each trans-335 former layer, where X ci , X si 1 , X si , X si+1 are 336 the embedding matrices correspond to the input 337 clip and captions. We further simplify the nota-338 tions as X=[X 0 ; X 1 ; X 2 ; X 3 ]. The self-attention 339 operation in the transformer encoder layers can 340 then be expressed as:

342
where softmax(·, dim = 1) denotes performing 343 softmax at the second dimension of the input ma-344 trix. A is the attention output. When multiple 345 attention heads are used, the formulation is similar.

346
We use S to denote the similarity matrix computed 347 by XX T . It can be expressed as a block matrix: Our goal is to encourage the model to focus on the 350 correct matched caption for an input clip, i.e., the 351 attention mass from the video clip to the correct 352 matched caption should be higher than the others.

353
We denote the overall attention score from the input 354 video clip to one of the ASR captions as: The 355 maximum response between the ASR captions to 356 each element in the video can be computed as: We then employ an entropy-based loss: en be expressed as:    n employ an entropy-based loss: s i 1 and s i+1 . In most cases, the correct matched 332 caption for the clip is from these three captions. 333 We denote X=[X ci ; X si 1 ; X si ; X si+1 ] 2 R l⇥d 334 as the generalized input sequence to each trans-335 former layer, where X ci , X si 1 , X si , X si+1 are 336 the embedding matrices correspond to the input 337 clip and captions. We further simplify the nota-338 tions as X=[X 0 ; X 1 ; X 2 ; X 3 ]. The self-attention 339 operation in the transformer encoder layers can 340 then be expressed as:

342
where softmax(·, dim = 1) denotes performing 343 softmax at the second dimension of the input ma-344 trix. A is the attention output. When multiple 345 attention heads are used, the formulation is similar.

346
We use S to denote the similarity matrix computed 347 by XX T . It can be expressed as a block matrix: We then employ an entropy-based loss: anguage modeling on it with the same probability as the ASR captions. o-text matching, with a probability of 0.5, e the original ASR caption with a ranmpled caption from other videos or clips tive. Of the sampled negative ASR capof them are from different videos, while 0% are from the same video but different xt from the same video clip is likely to same theme or similar context, and thus as hard samples to improve the model's do fine-grained matching. We do not desoken before the start of input caption for predict text-video matching, instead we ean pooling [Jie: is it mean-pooled through sequence? ] of the output sequence hidres to perform binary classification. This e matching objective usually requires a learning rate as to text token prediction.
u have more reasons behind this? Like a refere: how the dense captions are used in video-text ] strained Attention SR captions are often temporally misith their corresponding clip, simply preodel over these misaligned clip-text pairs to sub-optimal performance. Recent work E (Miech et al., 2020) proposes to adproblem with a multiple instance learning sed objective, but it is more computational e as we need to get scores from multient pairs. Within the single-stream transchitecture, this means we need to add adput sequences, which effectively increase utation cost by a factor of N where N is the ber of neighboring captions and negative been used. In our formulation, we only rocess a single sequence that combines ighboring captions and without the need ive captions. that videos and extracted uth caption in the corresponding time do ly match in every sample, we address this llowing the model to automatically select ing caption. note an input video V as [c 1 , c 2 , ..., c N ], ponding ASR captions are denoted as , s N ], where c i is the i-th clip of V and ASR caption of c i , N is the total numps in the video. For a clip c i , instead of clude captions from its two neighboring clips, i.e., 331 s i 1 and s i+1 . In most cases, the correct matched 332 caption for the clip is from these three captions. 333 We denote X=[X ci ; X si 1 ; X si ; X si+1 ] 2 R l⇥d 334 as the generalized input sequence to each trans-335 former layer, where X ci , X si 1 , X si , X si+1 are 336 the embedding matrices correspond to the input 337 clip and captions. We further simplify the nota-338 tions as X=[X 0 ; X 1 ; X 2 ; X 3 ]. The self-attention 339 operation in the transformer encoder layers can 340 then be expressed as:

342
where softmax(·, dim = 1) denotes performing 343 softmax at the second dimension of the input ma-344 trix. A is the attention output. When multiple 345 attention heads are used, the formulation is similar.

346
We use S to denote the similarity matrix computed 347 by XX T . It can be expressed as a block matrix:

349
Our goal is to encourage the model to focus on the We then employ an entropy-based loss: onstrained Attention ASR captions are often temporally misd with their corresponding clip, simply premodel over these misaligned clip-text pairs ad to sub-optimal performance. Recent work CE (Miech et al., 2020) proposes to adhis problem with a multiple instance learning based objective, but it is more computational sive as we need to get scores from multiferent pairs. Within the single-stream transarchitecture, this means we need to add adl input sequences, which effectively increase putation cost by a factor of N where N is the umber of neighboring captions and negative ns been used. In our formulation, we only o process a single sequence that combines neighboring captions and without the need gative captions. that videos and extracted truth caption in the corresponding time do actly match in every sample, we address this y allowing the model to automatically select tching caption. denote an input video V as [c 1 , c 2 , ..., c N ], responding ASR captions are denoted as , ..., s N ], where c i is the i-th clip of V and he ASR caption of c i , N is the total numclips in the video. For a clip c i , instead of clude captions from its two neighboring clips, i.e., 331 s i 1 and s i+1 . In most cases, the correct matched 332 caption for the clip is from these three captions.

333
We denote X=[X ci ; X si 1 ; X si ; X si+1 ] 2 R l⇥d 334 as the generalized input sequence to each trans-335 former layer, where X ci , X si 1 , X si , X si+1 are 336 the embedding matrices correspond to the input 337 clip and captions. We further simplify the nota-338 tions as X=[X 0 ; X 1 ; X 2 ; X 3 ]. The self-attention 339 operation in the transformer encoder layers can 340 then be expressed as:

342
where softmax(·, dim = 1) denotes performing 343 softmax at the second dimension of the input ma-344 trix. A is the attention output. When multiple 345 attention heads are used, the formulation is similar.

346
We use S to denote the similarity matrix computed 347 by XX T . It can be expressed as a block matrix:

349
Our goal is to encourage the model to focus on the We then employ an entropy-based loss: 4 ed language modeling on it with the same ing probability as the ASR captions. video-text matching, with a probability of 0.5, eplace the original ASR caption with a rany sampled caption from other videos or clips negative. Of the sampled negative ASR cap-, 50% of them are from different videos, while er 50% are from the same video but different . Text from the same video clip is likely to the same theme or similar context, and thus erve as hard samples to improve the model's y to do fine-grained matching. We do not dese a token before the start of input caption for l to predict text-video matching, instead we the mean pooling [Jie: is it mean-pooled through hole sequence? ] of the output sequence hideatures to perform binary classification. This cause matching objective usually requires a rent learning rate as to text token prediction. Constrained Attention e ASR captions are often temporally mised with their corresponding clip, simply prea model over these misaligned clip-text pairs lead to sub-optimal performance. Recent work -NCE (Miech et al., 2020) proposes to adthis problem with a multiple instance learning ) based objective, but it is more computational nsive as we need to get scores from multiifferent pairs. Within the single-stream transer architecture, this means we need to add adal input sequences, which effectively increase omputation cost by a factor of N where N is the number of neighboring captions and negative ons been used. In our formulation, we only to process a single sequence that combines e neighboring captions and without the need egative captions. that videos and extracted nd truth caption in the corresponding time do xactly match in every sample, we address this by allowing the model to automatically select atching caption. e denote an input video V as [c 1 , c 2 , ..., c N ], orresponding ASR captions are denoted as 2 , ..., s N ], where c i is the i-th clip of V and the ASR caption of c i , N is the total numf clips in the video. For a clip c i , instead of clude captions from its two neighboring clips, i.e., 331 s i 1 and s i+1 . In most cases, the correct matched 332 caption for the clip is from these three captions. 333 We denote X=[X ci ; X si 1 ; X si ; X si+1 ] 2 R l⇥d 334 as the generalized input sequence to each trans-335 former layer, where X ci , X si 1 , X si , X si+1 are 336 the embedding matrices correspond to the input 337 clip and captions. We further simplify the nota-338 tions as X=[X 0 ; X 1 ; X 2 ; X 3 ]. The self-attention 339 operation in the transformer encoder layers can 340 then be expressed as:

342
where softmax(·, dim = 1) denotes performing 343 softmax at the second dimension of the input ma-344 trix. A is the attention output. When multiple 345 attention heads are used, the formulation is similar.

346
We use S to denote the similarity matrix computed 347 by XX T . It can be expressed as a block matrix:

349
Our goal is to encourage the model to focus on the We then employ an entropy-based loss: s i 1 and s i+1 . In most cases, the correct matched 332 caption for the clip is from these three captions.

333
We denote X=[X ci ; X si 1 ; X si ; X si+1 ] 2 R l⇥d 334 as the generalized input sequence to each trans-335 former layer, where X ci , X si 1 , X si , X si+1 are 336 the embedding matrices correspond to the input 337 clip and captions. We further simplify the nota-338 tions as X=[X 0 ; X 1 ; X 2 ; X 3 ]. The self-attention 339 operation in the transformer encoder layers can 340 then be expressed as:

342
where softmax(·, dim = 1) denotes performing 343 softmax at the second dimension of the input ma-344 trix. A is the attention output. When multiple 345 attention heads are used, the formulation is similar.

346
We use S to denote the similarity matrix computed 347 by XX T . It can be expressed as a block matrix:

349
Our goal is to encourage the model to focus on the We then employ an entropy-based loss: as [c 1 , c 2 , ..., c N ], s are denoted as -th clip of V and is the total numclip c i , instead of clude captions from its two neighboring clips, i.e., 331 s i 1 and s i+1 . In most cases, the correct matched 332 caption for the clip is from these three captions. 333 We denote X=[X ci ; X si 1 ; X si ; X si+1 ] 2 R l⇥d 334 as the generalized input sequence to each trans-335 former layer, where X ci , X si 1 , X si , X si+1 are 336 the embedding matrices correspond to the input 337 clip and captions. We further simplify the nota-338 tions as X=[X 0 ; X 1 ; X 2 ; X 3 ]. The self-attention 339 operation in the transformer encoder layers can 340 then be expressed as:

342
where softmax(·, dim = 1) denotes performing 343 softmax at the second dimension of the input ma-344 trix. A is the attention output. When multiple 345 attention heads are used, the formulation is similar.

346
We use S to denote the similarity matrix computed 347 by XX T . It can be expressed as a block matrix:

349
Our goal is to encourage the model to focus on the We then employ an entropy-based loss: on it with the same SR captions. ith a probability of 0.5, R caption with a ranother videos or clips ed negative ASR capdifferent videos, while me video but different video clip is likely to ilar context, and thus improve the model's tching. We do not desrt of input caption for matching, instead we s it mean-pooled through output sequence hidry classification. This ive usually requires a text token prediction.
behind this? Like a referions are used in video-text n ften temporally misding clip, simply prealigned clip-text pairs ormance. Recent work 020) proposes to adltiple instance learning is more computational et scores from multie single-stream transans we need to add adich effectively increase tor of N where N is the captions and negative formulation, we only quence that combines and without the need videos and extracted orresponding time do ample, we address this to automatically select o V as [c 1 , c 2 , ..., c N ], tions are denoted as the i-th clip of V and , N is the total numr a clip c i , instead of clude captions from its two neighboring clips, i.e., 331 s i 1 and s i+1 . In most cases, the correct matched 332 caption for the clip is from these three captions.

333
We denote X=[X ci ; X si 1 ; X si ; X si+1 ] 2 R l⇥d 334 as the generalized input sequence to each trans-335 former layer, where X ci , X si 1 , X si , X si+1 are 336 the embedding matrices correspond to the input 337 clip and captions. We further simplify the nota-338 tions as X=[X 0 ; X 1 ; X 2 ; X 3 ]. The self-attention 339 operation in the transformer encoder layers can 340 then be expressed as:

342
where softmax(·, dim = 1) denotes performing 343 softmax at the second dimension of the input ma-344 trix. A is the attention output. When multiple 345 attention heads are used, the formulation is similar.

346
We use S to denote the similarity matrix computed 347 by XX T . It can be expressed as a block matrix:

349
Our goal is to encourage the model to focus on the 350 correct matched caption for an input clip, i.e., the 351 attention mass from the video clip to the correct 352 matched caption should be higher than the others.

353
We denote the overall attention score from the input 354 video clip to one of the ASR captions as: The 355 maximum response between the ASR captions to 356 each element in the video can be computed as: We then employ an entropy-based loss: Similarity Matrix sed as extra text input, we also perform guage modeling on it with the same bability as the ASR captions. text matching, with a probability of 0.5, the original ASR caption with a ranled caption from other videos or clips e. Of the sampled negative ASR capf them are from different videos, while are from the same video but different from the same video clip is likely to me theme or similar context, and thus hard samples to improve the model's fine-grained matching. We do not desen before the start of input caption for edict text-video matching, instead we an pooling [Jie: is it mean-pooled through uence? ] of the output sequence hidto perform binary classification. This matching objective usually requires a arning rate as to text token prediction.
ave more reasons behind this? Like a referhow the dense captions are used in video-text rained Attention captions are often temporally mish their corresponding clip, simply preel over these misaligned clip-text pairs sub-optimal performance. Recent work (Miech et al., 2020) proposes to adoblem with a multiple instance learning objective, but it is more computational s we need to get scores from multit pairs. Within the single-stream transitecture, this means we need to add adt sequences, which effectively increase tion cost by a factor of N where N is the r of neighboring captions and negative en used. In our formulation, we only cess a single sequence that combines hboring captions and without the need captions. that videos and extracted caption in the corresponding time do match in every sample, we address this wing the model to automatically select g caption. te an input video V as [c 1 , c 2 , ..., c N ], nding ASR captions are denoted as N ], where c i is the i-th clip of V and R caption of c i , N is the total numin the video. For a clip c i , instead of only inputting its associated caption s i , we also in-330 clude captions from its two neighboring clips, i.e., 331 s i 1 and s i+1 . In most cases, the correct matched 332 caption for the clip is from these three captions.

333
We denote X=[X ci ; X si 1 ; X si ; X si+1 ] 2 R l⇥d 334 as the generalized input sequence to each trans-335 former layer, where X ci , X si 1 , X si , X si+1 are 336 the embedding matrices correspond to the input 337 clip and captions. We further simplify the nota-338 tions as X=[X 0 ; X 1 ; X 2 ; X 3 ]. The self-attention 339 operation in the transformer encoder layers can 340 then be expressed as:

342
where softmax(·, dim = 1) denotes performing 343 softmax at the second dimension of the input ma-344 trix. A is the attention output. When multiple 345 attention heads are used, the formulation is similar.

346
We use S to denote the similarity matrix computed 347 by XX T . It can be expressed as a block matrix:

349
Our goal is to encourage the model to focus on the 350 correct matched caption for an input clip, i.e., the 351 attention mass from the video clip to the correct 352 matched caption should be higher than the others.

353
We denote the overall attention score from the input 354 video clip to one of the ASR captions as: The 355 maximum response between the ASR captions to 356 each element in the video can be computed as: We then employ an entropy-based loss: Constrained Attention Loss. The ASR captions are often temporally misaligned with their corresponding clips, simply pre-train a model over these misaligned clip-text pairs may lead to sub-optimal performance. To alleviate this issue, we propose a constrained attention loss that encourages the model to automatically select the best matched ASR caption from a pool of continuous caption candidates. This is achieved by minimizing the entropy of the attentions from the video to the ASR captions. Formally, we denote an input video V as [c 1 , c 2 , ..., c N ], its corresponding ASR captions are denoted as [s 1 , s 2 , ..., s N ], where c i is the i-th clip of V and s i is the ASR caption of c i , N is the total number of clips in the video. For a clip c i , instead of only inputting its associated caption s i , we also include captions from its two neighboring clips, 3 i.e., s i−1 and s i+1 . In most cases, the correct matched caption for the clip is from these three captions. We denote X=[X c i ; X s i−1 ; X s i ; X s i+1 ] ∈ R l×d as the generalized input sequence to each transformer layer (dense captions are ignored for simplicity), where X c i , X s i−1 , X s i , X s i+1 are the embedding matrices correspond to the input clip and three captions. We further simplify the notations as X=[X 0 ; X 1 ; X 2 ; X 3 ]. A single head 3 While our approach works for arbitrary number of neighbors, we use two neighbors to illustrate the idea for simplicity. In fact, we found that, of 100 randomly sampled videos, using two neighbors already covers 95% of the videos with at least one positive matched ASR caption. self-attention operation in the transformer encoder layers can then be expressed as: where softmax(·, dim=1) denotes applying softmax at the second dimension of the input matrix. A is the attention output matrix. When multiple attention heads are used, the formulation is similar. We use S to denote the similarity matrix computed by XX T , it can be expressed using block matrices: S q,r = X q X T r , q, r ∈ {0, 1, 2, 3}.
(2) Our goal is to encourage the model to focus on the correct matched caption for an input clip, i.e., the attention mass from the video clip to the correct matched caption should be higher than the others. To achieve this, we first define the maximum response between the video hidden states X 0 to the ASR captions hidden states X j as: (3) For a single example, we define its constrained attention loss as: This loss formulation is based on entropy minimization (Tanaka et al., 2018;Yi and Wu, 2019), it forces the model to assign high attention scores only to one of the ASR captions, i.e., to peak at only one caption rather than being flat because the one-hot distribution has the smallest entropy. Figrue 3 shows an overview of applying the constrained attention loss. During pre-training, we add this loss to each of the attention heads across all layers, we add these losses along with the MLM loss and video-text matching loss for joint optimization. Meanwhile, as the similarity matrix S is a symmetric matrix, the entropy minimization objective also encourages the correct matched ASR caption to have higher similarity to the video, while forcing the mismatched captions to put more attention on the other ASR captions rather than the video.

Experiments
In this section, we compare our model with state-ofthe-art methods on three video-and-language downstream tasks (e.g., video captioning, text-to-video retrieval, and video question answering) across five datasets. We then present a comprehensive ablation study, where we show that each of our proposed components help improve the pre-training task performance and downstream task performance. Lastly, we also provide an attention visualization example to demonstrate the effect of applying our proposed constrained attention loss.

Datasets and Tasks
Pre-training. We use HowTo100M (Miech et al., 2019) for pre-training. It contains 1.22 million YouTube instructional videos that cover 23.6K instruction tasks (e.g., making peanut butter, pruning a tree). Each video is associated with an English narration automatically transcribed by an Automatic Speech Recognition (ASR) system. On average, each video has 110 clip-caption pairs, with an average duration of 4 seconds per clip and 4 words per caption. We reserve 10K videos for validation, and use the rest of the videos for pre-training.
Video Captioning. We evaluate video captioning on MSRVTT (Xu et al., 2016) and YouCook2  datasets. The task is to generate a text description (a single sentence or a paragraph of multiple sentences) for a given video. Text-to-Video Retrieval. We evaluate text-tovideo retrieval on MSRVTT and YouCook2 datasets, where the goal is to retrieve a relevant video from a gallery of videos given a text query.
(i) MSRVTT is the same dataset as the captioning task. We follow previous work (Yu et al., 2018b;Miech et al., 2019) to use the 7k train+val videos for training and report results on the 1K test set sampled by Yu et al. (2018b). (ii) YouCook2 is the same dataset as the captioning task. We evaluate our model on the clip retrieval task as in previous work (Miech et al., 2019;Zhu and Yang, 2020).

Implementation Details
We use the BERT-base (Devlin et al., 2019)

Comparison to State-of-the-Art
We present our results on three downstream tasks across five datasets, and compare the results against the state-of-the-art methods. All the downstream results are obtained by fine-tuning the same pretrained model that is pre-trained with dense captions and constrained attention loss.   (Zhou et al., 2018), MART (Lei et al., 2020a), COOT (Ging et al., 2020), and MIL-NCE (Miech et al., 2020). PT indicates models with pre-training on HowTo100M.
Video Captioning. We follow Vaswani et al. (2017) to train auto-regressive captioning models, by only allowing the text tokens to attend to tokens that precede them at training. During inference time, we use beam search with beam size 5 to generate captions. For MSR-VTT, we evaluate captioning performance at sentence level. For YouCook2, we follow previous work (Lei et al., 2020a;Ging et al., 2020) to evaluate performance at paragraph-level, where single segment captions are concatenated as a paragraph for evaluation. We use standard metrics BLEU@4 (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014), Rouge-L (Lin, 2004), and CIDEr-D  to report performance. Table 1 shows the comparison on MSRVTT, our DECEMBERT model achieves significant performance gain over previous state-of-the-art. Notably, DECEMBERT outperforms ORG-TRL  by 1.6% BLEU@4, 2.6% Rouge-L, and 1.4% CIDEr-D, even though ORG-TRL uses a set of strong visual features (appearance, motion, and object) together with a sophisticated graph encoder network and external language model supervision. Table 2 shows the results on YouCook2 captioning task. Overall, DECEMBERT outperforms previous methods across all metrics. Compared to the strong baseline method MART+COOT+NIL-NCE (Lei et al., 2020a;Ging et al., 2020;Miech et al., 2020) PT, that uses HowTo100M videos for pre-training followed by a designated hierarchical modeling training, our approach still shows better performance with a reasonable margin. This shows the effectiveness of our pre-training strategy.
Text-to-video Retrieval. We train text-to-video retrieval models similar to the way we perform video-text matching, where we sample a negative caption 50% of the time. We use average recall at K (R@K) and median rank (MdR) to report perfor-  Table 3: Text-to-video retrieval results on MSRVTT 1k test set (Yu et al., 2018b). PT indicates models with pre-training on HowTo100M (or on HowTo100M+TV shows Liu et al., 2020a) for HERO).
We gray out models that used extra ASR features for a fair comparison.   (Klein et al., 2015), HowTo (Miech et al., 2019), COOT (Ging et al., 2020), MIL-NCE (Miech et al., 2020) mance on the retrieval tasks. We show MSRVTT text-to-video retrieval in Table 3. Overall, our approach achieves the best performance. Compared to the pre-trained models HowTo (Miech et al., 2019), ActBERT (Zhu and Yang, 2020), and HERO (Li et al., 2020b), DECEMBERT achieves strong performance with a reasonable margin. It outperforms HERO by 0.7% R1, note that HERO is pre-trained with extra TV show videos Liu et al., 2020a) in addition to the HowTo100M videos that we use. Moreover, DE-CEMBERT is also quite competitive compared to the HERO w/ ASR model that uses additional ASR features during finetuning. For YouCook2 text-to-video retrieval results shown in Table 4, our approach also show better performance compared to the pre-trained models HowTo and COOT+MIL-NCE. Notably, it outperforms previous state-of-theart COOT+MIL-NCE by 7.5% R@10.
Video Question Answering. We use a two-layer MLP followed by a softmax layer for open-ended question answering, where we optimize the probability of choosing the correct answer from a large pool of candidate answers. We report accuracy to Method Accuracy ST-VQA (Jang et al., 2017) 30.9 Co-Memory (Gao et al., 2018) 32.0 AMU  32.5 Heterogeneous Memory (Fan et al., 2019) 33.0 HCRN  35.6 DECEMBERT 37.4 measure the QA performance. We show MSRVTT-QA results in Table 5 where our approach outperform all the baseline methods by a large margin. Compared to HCRN (Le et al., 2020) which employs a complicated hierarchical reasoning module, our approach achieves 1.8% performance gain, achieving a new state-of-the-art for the task.

Analysis
Ablation Study. We present ablation study on our pre-training strategies, on both the pre-training tasks and the MSRVTT captioning downstream task. We report ablation results on our 10K holdout HowTo100M videos for pre-training tasks, i.e., masked language modeling (MLM) accuracy and video-text matching accuracy. Because we use MLM for both dense captions and the original ASR captions, we report their accuracy separately. The results are shown in Table 6. To understand how the pre-training strategies affect the downstream performance, we also perform downstream finetuning from pre-trained models using these different pre-training strategies. The results are shown in Table 7. Compared to the basic model that uses only a single paired ASR caption with each clip for training, we observe the the variant that takes three ASR captions achieves significantly higher accuracy in MLM and video-text matching. Adding dense captions and constrained attention loss further improve the performance. Overall, the same trend also holds true for the downstream performance on MSRVTT captioning and QA tasks. The best captioning and QA models are finetuned from the model pre-trained using both the dense captions and the constrained attention loss. Compared to the basic model with only MLM and video-text matching, our best models achieve a significant performance gain: e.g., 3.3% BLEU@4, 3.1% CIDEr-D for captioning, and 2.3% Accuracy for QA.
Qualitative Results During pre-training, we apply our proposed constrained attention loss to every    Figure 4: Attention visualization for models with and without constrained attention loss. After adding constrained attention, the attention mass concentrated to the ASR caption (e.g., AC3) that best matches the video content and the dense captions. These attention maps are taken from an attention head of the 10-th layer of the transformer model. attention heads across all layers. In Figure 4, we compare the attention maps from models with or without the proposed constrained attention loss during pre-training. As we found the attention weight distributions (not absolute values) on different layers look similar to each other, we randomly chose the 10-th layer to showcase the effect of adding constrained attention loss. We observe that after adding constrained attention loss as a regularization, the attention mass concentrated to the best-matched ASR caption rather than distributed to all the captions.

Conclusion
In this work, we propose DECEMBERT as an improved pre-training method for learning from noisy, unlabeled instructional videos. Specifically, we propose adding automatically-extracted framelevel dense captions as an auxiliary text input for learning better video and language associations. We also propose a constrained attention loss that forces the model to automatically focus on the bestmatched caption from a pool of misalignment caption candidates via entropy minimization. Comprehensive experiments on three popular video and language tasks (i.e., text-to-video retrieval, video captioning, and video question answering) across five datasets demonstrate the effectiveness of DE-CEMBERT compared to existing approaches. We also provide detailed ablation study and visualization to quantitatively and qualitatively examine the impact of our added components.