HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

We present HERO, a Hierarchical EncodeR for Omni-representation learning, for large-scale video+language pre-training. HERO encodes multimodal inputs in a hierarchical fashion, where local textual context of a video frame is captured by a Cross-modal Transformer via multimodal fusion, and global video context is captured by a Temporal Transformer. Besides standard Masked Language Modeling (MLM) and Masked Frame Modeling (MFM) objectives, we design two new pre-training tasks: (i) Video-Subtitle Matching (VSM), where the model predicts both global and local temporal alignment; and (ii) Frame Order Modeling (FOM), where the model predicts the right order of shuffled video frames. Different from previous work that mostly focused on cooking or narrated instructional videos, HERO is jointly trained on HowTo100M and large-scale TV show datasets to learn complex social scenes, dynamics backdrop transitions and multi-character interactions. Extensive experiments demonstrate that HERO achieves new state of the art on both text-based video moment retrieval and video question answering tasks across different domains.


Introduction
Inspired by BERT (Devlin et al., 2019), large-scale multimodal pre-training has prevailed in the arena of vision-and-language research (Lu et al., 2019a;Tan and Bansal, 2019;Chen et al., 2019b). However, most existing models are tailored for static images, not dynamic videos.VideoBERT (Sun et al., 2019b) was the first to apply BERT to learn joint embedding for video-text pairs. But as only discrete tokens are used to represent video frames, rich video frame features are not fully utilized. To remedy this, CBT (Sun et al., 2019a) uses a contrastive loss but still mainly for video representa- * Equal contribution. tion learning alone, with text input only considered as side information. UniViLM (Luo et al., 2020) takes a step further and considers both understanding (e.g., text-based video retrieval) and generation (i.e., video captioning) tasks.
Several limitations cast constraints on the scope of existing models. (i) Most model designs are direct adaptation of BERT, without considering the unique characteristics of video+text input. Subtitle sentences and visual frames are usually concatenated, while losing the temporal alignment between different modalities. (ii) Pre-training tasks are directly borrowed from image+text pretraining, without exploiting the sequential nature of video input. (iii) Compared to diverse image domains, video datasets investigated in existing models are restricted to cooking or narrated instructional videos (Miech et al., 2019), excluding video sources that contain dynamic scene transitions and multi-character interactions.
To address these challenges, we present a new video-and-language large-scale pre-training approach -HERO (Hierarchical EncodeR for Omnirepresentation learning). As illustrated in Figure 1, HERO takes as input video clip frames and their accompanying subtitle sentences 1 . Instead of adopting a flat BERT-like encoder, HERO encodes multimodal inputs in a hierarchical fashion, with (i) a Cross-modal Transformer to fuse a subtitle sentence and its accompanying local video frames, followed by (ii) a Temporal Transformer to obtain a sequentially contextualized embedding for each video frame, using all the surrounding frames as global context. The proposed hierarchical model first absorbs visual and textual local context on frame level, which is then transferred to a global clip-level temporal context. Experiments show that this novel model design achieves better per-formance than a flat BERT-like architecture.
Four pre-training tasks are designed for HERO: (i) Masked Language Modeling (MLM); (ii) Masked Frame Modeling (MFM); (iii) Video-Subtitle Matching (VSM); and (iv) Frame Order Modeling (FOM). Compared to previous work, the key novelty is VSM and FOM, which encourages explicit temporal alignment between multimodalities as well as full-scale exploitation of the sequential nature of video input. In VSM, the model considers not only global alignment, by predicting whether a subtitle matches the input video clip; but also local temporal alignment, by retrieving the moment where the subtitle should be localized in the video clip. In FOM, we randomly select and shuffle a subset of video frames, and the model is trained to restore their original order. Extensive ablation studies demonstrate that both VSM and FOM play a critical role in video+language pre-training.
To empower the model with richer knowledge such as contextual understanding of dynamic social interactions between multi-characters and dramatic scene/event evolvement, we jointly train HERO on two diverse datasets: HowTo100M dataset (containing 22k narrated instructional videos) (Miech et al., 2019) and a large-scale TV dataset (containing 660k TV episodes spanning different genres) . Compared to factual and instructional descriptions in HowTo100M, the TV dataset contains more complex plots that require comprehensive interpretation of human emotions, social relations and causal relations of events, which makes it a valuable supplement to HowTo100M and a closer approximation to real-life scenarios.
Previous models pre-trained on HowTo100M are evaluated on YouCook2 (Zhou et al., 2018a) and MSR-VTT (Xu et al., 2016) datasets. YouCook2 focuses on cooking videos only, and the captions in MSR-VTT are very simple. To evaluate our model on more challenging benchmarks, we collect two new datasets on video moment retrieval (HowTo100M-R) and question answering (HowTo100M-QA). We also evaluate on TVR  and TVQA , with extensive ablation studies on pre-training settings.
Our main contributions are summarized as follows. (i) We present HERO, a hierarchical Transformer-based encoder for video+language representation learning. (ii) We propose new pre-training tasks VSM and FOM, which comple-ments MLM and MRM objectives by better capturing temporal alignment between multimodalities in both global and local contexts. (iii) Different from previous work that mainly relies on HowTo100M, we include additional large-scale TV show datasets for pre-training, encouraging the model to learn from richer and more divserse visual content. (iv) We also collect two new datasets based on HowTo100M for video moment retrieval/QA, and will release the new benchmarks to foster future studies. HERO achieves new state of the art across all the evaluated tasks.
2 Related Work
Branching out from language processing towards multimodality, there also emerged subsequent studies in vision+language space. Pioneering works such as ViLBERT (Lu et al., 2019a) and LXMERT (Tan and Bansal, 2019) propose to encode image and text modalities by two separate Transformers, with a third Transformer for later multimodal fusion. Compared to this two-stream architecture, VL-BERT (Su et al., 2019), Unicoder-VL (Li et al., 2019a), B2T2 (Alberti et al., 2019), VisualBERT , and UNITER  advocate single-stream architecture, where image and text signals are fused together in early stage. More recently, ViLBERT is enhanced by multi-task learning (Lu et al., 2019b), Oscar  enhances pre-training with image tags, and Pixel-BERT  proposes to align image pixels (instead of bottomup features (Anderson et al., 2018)) with text.
Contrast to the boom in other areas, video+language pre-training is still in its infancy. VideoBERT (Sun et al., 2019b), CBT (Sun et al., 2019a) and UniViLM (Luo et al., 2020) are the only existing works exploring this space. In this paper, we aim to propel video+language omni-presentation learning in four dimensions: (i) better model architecture design; (ii) better pre-training task design; (iii) diversification of training corpora; and (iv) new high-quality benchmarks for downstream evaluation.

Video+Language Tasks
Text-based video moment retrieval is one of the most popular video+language tasks currently studied. Anne Hendricks et al. (2017) and Gao et al. (2017) introduce the task of Single Video Moment Retrieval (SVMR), which aims at retrieving a moment from a single video via a natural language query. Escorcia et al. (2019) extends SVMR to Video Corpus Moment Retrieval (VCMR), extending searching pool from single video to large video corpus. TVR  defines a new task: Video-Subtitle Corpus Moment Retrieval, which provides temporally aligned subtitle sentences along with the videos as inputs. For this new task, XML  is proposed to compute similarity scores between the query and each modality separately (visual frames, subtitles) and then sum them together for final prediction.
Video question answering (QA) aims to predict answers to natural language questions given a video as context. Some tasks collect QA pairs based on one modality only. For example, MovieFIB (Maharaj et al., 2017) focuses on visual concepts, MovieQA (Tapaswi et al., 2016) is based on text summaries, and TGIF-QA (Jang et al., 2017) uses predefined templates for question generation on short GIFs. TVQA  designed a more realistic multimodal setting: collecting human-written QA pairs along with their associated video segments by providing both video clips and accompanying subtitles. Later on,  augmented TVQA with frame-level bounding box annotations for spatial-temporal video QA, and introduced the STAGE framework to jointly localize moments, ground objects, and answer questions.

Hierarchical Video+Language Encoder
In this section, we introduce the proposed HERO architecture (Sec. 3.1) and explain the four pre-training tasks in detail (Sec. 3.2).

Model Architecture
Model architecture of HERO is illustrated in Figure 1. HERO takes in the visual frames of a video clip and the textual tokens of subtitle sentences as inputs. First, the inputs are fed into a Video Embedder and a Text Embedder to extract their respective embeddings. Second, HERO computes contextualized video embeddings in a hierarchical fashion. The local textual context of each visual frame is captured by a Cross-modal Transformer, while a Temporal Transformer takes global video context into consideration. To be more specific, the Cross-modal Transformer computes contextualized multi-modal embeddings between a subtitle sentence and its associated visual frames. The encoded frame embeddings within the whole video clip are then collected, and fed into the Temporal Transformer to obtain the final contextualized video embeddings.
Frame-Subtitle Pairing Given a pair of video clip and its associated subtitle, we first extract a sequence of visual frames v = {v i } Nv i=1 at a fixed frame rate (N v is the number of visual frames in a video clip). The subtitle is parsed into sentences s = {s i } Ns i=1 (N s is the number of sentences in each subtitle). Note that N v = N s in most cases, since a subtitle sentence may last for several visual frames. We then align the subtitle sentences temporally with the visual frames. Specifically, for each subtitle sentence s i , we pair it with a sequence of visual frames whose timestamps overlap with the subtitle timestamp, and denote these visual frames as v s i = {v j s i } K j=1 (K is the number of overlapping frames with s i ). In the case that multiple sentences overlap with the same visual frame, we always pair the frame with the one with maximal temporal Intersection over Union (tIoU) to avoid duplication. It is possible that a subtitle sentence is not paired with any visual frame, and in this case, we concatenate it to the neighboring sentences to avoid information loss.
Input Embedder For Text Embedder, we follow  and tokenize a subtitle sentence s i into a sequence of WordPieces  sub-word tokens, i.e., w s i = {w j s i } L j=1 (L is the number of tokens in s i ). The final representation for each sub-word token is obtained via summing up its token embedding and position embedding, followed by another layer normalization (LN) layer.  Figure 1: Overview of HERO model (best viewed in color), consisting of Cross-Modal Transformer and Temporal Transformer, learned via four pre-training tasks hierarchically. Initial frame features are obtained by SlowFast and ResNet feature extractors, and initial word embeddings are learned via an embedding layer initialized from RoBERTa. During training, we sample one task per mini-batch to prevent different tasks from corrupting each others' inputs. Sec. 3 provides more detailed descriptions on model architecture and each pre-training task.

Cross-Modal Transformer
For Video Embedder, we first use ResNet (He et al., 2016) pre-trained on ImageNet (Deng et al., 2009) and SlowFast (Feichtenhofer et al., 2019) pre-trained on Kinetics (Kay et al., 2017) to extract 2D and 3D visual features for each video frame. The 2D and 3D features are concatenated as our visual features, which are fed through a fully-connected (FC) layer to be projected into the same lower-dimensional space as token embeddings. Since video frames are sequential, their position embeddings can be calculated in the same way as in Text Embbedder. The final embedding of a visual frame is obtained by summing up FC outputs and position embeddings and then passing through a LN layer. In a summary, after Input Embedder, the token embeddings and visual frame embeddings corresponding to w s i and v s i are denoted as W emb Cross-modal Transformer To utilize the inherent alignment between subtitles and video frames, for each subtitle sentence s i , we first learn contextualized embeddings between the corresponding tokens w s i and its associated visual frames v s i through cross-modal attention. Inspired by the recent success Lu et al., 2019a) of using Transformer (Vaswani et al., 2017) for multimodal fusion, we also use a multi-layer Transformer here. The outputs from Cross-modal Transformer is a sequence of contextualized embeddings for each subtitle token and each video frame: where f cross (·, ·) denotes the Cross-modal Transformer, V cross Temporal Transformer After collecting all the visual frame embeddings V cross = {V cross s i } Ns i=1 ∈ R Nv×d from the output of Cross-modal Transformer, we use another Transformer as temporal attention to learn contextualized video embeddings from the global context of a video clip. To avoid losing positional information, we use residual connection (He et al., 2016) to add back V emb ∈ R Nv×d . The final contextualized video embeddings are calculated as: where f temp (·) denotes the Temporal Transformer, and V temp ∈ R Nv×d . Compared with a flat BERTlike encoder, which directly concatenate all the textual tokens and visual frames as model inputs, the proposed model effectively utilizes the temporal alignment between subtitle sentences and video frames for multi-modal fusion in a more finegrained manner. In the experiments, we show our model design far outperforms a flat BERT-like baseline.

Pre-training Tasks
We introduce four main tasks to pre-train our model: Masked Language Modeling (MLM), Masked Frame Modeling (MFM) (with two variants), Video-Subtitle Matching (VSM), and Frame Order Modeling (FOM) 2 . As shown in Figure 1, MFM and MLM are in analogy to BERT (Devlin et al., 2019). Word masking is realized by replacing the word with a special token [MASK], and frame masking by replacing the visual feature vector of a frame with zeros. Following Chen et al. (2019b), we only mask one modality each time while keeping the other modality intact. VSM is designed to learn both local alignment (between visual frames and a subtitle sentence) and global alignment (between a video clip and a sequence of subtitle sentences). FOM is designed to model sequential characteristics of visual clips, by learning the original order of randomly reordered frames.

Masked Language Modeling
The inputs for MLM include: (i) sub-word tokens from the i-th subtitle sentence w s i ; (ii) visual frames v s i aligned with w s i ; and (iii) mask indices m ∈ N M . 3 In MLM, we randomly mask out input words with a probability of 15%, and replace the masked tokens w m s i with special tokens [MASK] 4 . The goal is to predict these masked words based on the observation of their surrounding words w \m s i and the visual frames aligned with the sentence v s i , by minimizing the negative log-likelihood: where θ denotes the trainable parameters, and each pair (w s i , v s i ) is sampled from the whole training set D.

Masked Frame Modeling
Similar to MLM, we also sample frames and mask their visual features with a probability of 15%. However, the difference is that MLM is performed on a local context (i.e., the output of Crossmodal Transformer), while MFM is performed on the global context (i.e., the output of Temporal Transformer). The model is trained to reconstruct masked frames v m , given the remaining frames v \m and all the subtitle sentences s. The visual features of masked frames are replaced by zeros. Unlike textual tokens that are represented as discrete labels, visual features are high-dimensional and continuous, thus cannot be supervised via class likelihood. Instead, we propose two variants for MFM, which share the same objective base: Masked Frame Feature Regression (MFFR) MFFR learns to regress the output on each masked frame v (i) m to its visual features. Specifically, we apply an FC layer to convert its output into a vector h θ (v 2 . Masked Frame Modeling with Noise Contrastive Estimation (MNCE) Instead of directly regressing the real values of masked visual features, we use the softmax version of the Noise Contrastive Estimation (NCE) loss (Jozefowicz et al., 2016), which has been widely adopted in self-supervised representation learning (Sun et al., 2019a;Hjelm et al., 2019;Oord et al., 2018). NCE loss encourages the model to learn to identify the correct frame (given the context) compared to a set of negative distractors.
Similar to MFFR, we feed the output of the masked frames v

Video-Subtitle Matching
The inputs to VSM are: (i) a sampled query s q from all subtitle sentences, (ii) the whole video clip v, and (iii) the rest subtitle sentences s \q for the video clip. We expect the model to learn: (i) local alignment -the start and end index y st , y ed ∈ {1, ..., N v }, indicating the span of visual frames aligned with the query; and (ii) global alignmentwhich video is the sampled query matched to.
In VSM, we follow XML  to compute the matching scores between query and visual frames at both local and global levels. Specifically, we extract the output of Temporal Transformer as the final visual frame representation V temp ∈ R Nv×d . The query is fed into Crossmodal Transformer to compute its textual representations W cross sq = f cross (0, W embed sq ). Based on this, we use a query encoder , consisting of a self-attention layer, two linear layers and a LN layer, to obtain the final query vector q ∈ R d from W cross sq .
Local Alignment The local query-video matching score is computed using dot product: Two trainable 1D convolution filters are applied to the scores, followed by a softmax layer, to generate two probability vectors p st , p ed ∈ R Nv , representing the probabilities of every position being the start and end of the ground-truth span. During training, we sample 15% subtitle sentences as queries for each video, and use the cross-entropy loss to predict the start and end index for local alignment: where p[y] denotes indexing the y-th element of the vector p. Note that, XML computes the query-video matching score for each modality separately, and the final matching score is the sum of the two scores. In our HERO model, multi-modal fusion is performed at a much earlier stage.

Global Alignment
The global matching score is computed by max-pooling the cosine similarities between each frame and query: We use a combined hinge loss L h  over positive and negative query-video pairs. For each positive pair (s q , v), we replace v or s q with one from other samples in the same mini-batch to construct two sets of negative examples: (s q ,v) and (ŝ q , v), and the training loss is specified as L h (S pos , S neg ) = max(0, δ + S neg − S pos ) , where δ is the margin hyper-parameter. The final loss L VSM = L local + λL global , where λ is a hyperparameter that balances the above two terms.

Frame Order Modeling
The inputs for FOM are: (i) all subtitle sentences s, (ii) visual frames v, and (iii) the reorder indices 5 We randomly select 15% of the frames to be shuffled, and the goal is to reconstruct their original timestamps, denoted as We formulate FOM as a classification problem, where t is the ground-truth labels of the reordered frames.
Specifically, reordering happens after the multimodal fusion of subtitle and visual frames, and is therefore applied to the input of Temporal Transformer. The reordered features are fed into Temporal Transformer to produce reordered visual frame embeddings V temp r . These embeddings are transformed through an FC layer, followed by a softmax layer to produce a probability matrix P ∈ R Nv×Nv , where each column p i ∈ R Nv represents the scores of N v timestamp classes that the i-th timestamp belongs to. The final objective is to minimize the the negative log-likelihood (cross-entropy loss):

Downstream Adaptation
The pre-trained model can be readily adapted to downstream video+language tasks through end-toend finetuning. Below, we describe the detailed adaptation approach to two popular tasks: (i) textbased video moment retrieval, and (ii) video question answering.

Text-based Video Moment Retrieval
The input video clip with its accompanying subtitles is encoded by HERO. The input query is encoded by the query encoder from the VSM pre-training task. We follow the same procedure as in VSM to compute query-video matching scores both locally (framelevel) and globally (clip-level). The model is finetuned end-to-end using loss L VSM .
Video Question Answering For Video QA, we consider the multiple-choice setting. For each answer candidate, the corresponding QA pair is appended to each of the subtitle sentences and fed into the Cross-modal Transformer to perform early fusion with local textual context. In addition, these QA pairs are also appended to the input of Temporal Transformer to be fused with global video context. We use a simple attention layer to compute the weighted-sum-across-time of the QA-aware frame representations from the Temporal Transformer output. These final QA-aware global representations are then fed through an MLP and softmax layer to obtain the probability score p ans of all the answers for question i. The training objective is where y i is the index of the ground-truth answer for question i. When supervision is available 6 , we also include the span prediction loss:

Experiments
In this section, we describe experiments on different downstream tasks that validate the effectiveness of the representations learned by HERO. Detailed ablation studies also provide in-depth analysis of different pre-training settings.

Pre-training Datasets
Our pre-training dataset is composed of videos from TV and Howto100M datasets. We exclude all the videos that appeared in the downstream tasks to avoid contamination in evaluation. The full pretraining dataset contains 680k video clips with their accompanying subtitles.
TV Dataset  was built on 6 popular TV shows across 3 genres: medical 6 Some existing Video QA tasks require localizing 'frames of interest' for the question, e.g., TVQA+  dramas, sitcoms and crime shows. It contains 21,793 video clips from 925 episodes. Each video clip is 60-90 seconds long, covering long-range scenes with complex character interactions and social/professional activities. Dialogue for each video clip (in the format of "character name: subtitle") is also provided.
Howto100M Dataset (Miech et al., 2019) was collected from YouTube with mostly instructional videos that teach diverse tasks. It contains 1.22 million videos, with activities falling into 12 categories (e.g., Food & Entertaining, Home & Garden, Hobbies & Crafts). Each video is associated with a narration as subtitles that are either written manually or from an Automatic Speech Recognition (ASR) system. The average duration of videos in Howto100M is 6.5 minutes. We cut the videos into 60-second clips to make them consistent with the TV dataset, and exclude videos in non-English languages. These pre-processing steps result in a subset of 660k video clips, accompanied with English subtitles.

Data Collection
Existing datasets for video moment retrieval and video QA are built on videos from either a single domain or a single modality. In order to evaluate on datasets not only containing diverse video content but also reflecting multimodalities of videos, we collect two new datasets based on Howto100M as additional benchmarks.
We use Amazon Mechanical Turk (AMT) to collect annotations on Howto100M videos. Figure 2 in Appendix shows the interface for video moment retrieval data collection. We randomly sample 29,843 60-second clips from 9,421 videos and present each clip to the annotators, who are asked to select a video segment containing a single, self-contained scene. After video segments are selected, another group of workers are asked to write captions that describe the displayed segment. Narrations are not provided to workers for some video clips to ensure we include queries that are related to video only. On average, selected video segments are 10-20 seconds long. The length (number of words) of queries is diverse, ranging from 8 to 20.
We also present the selected video segments to another group of AMT workers for QA annotations (interface shown in Figure 3 in Appendix). Each worker is assigned with one video segment and asked to write one question, one correct answer  Table 1: Evaluation on pre-training tasks and datasets using TVR, TVQA, Howto100M-R and Howto100M-QA validation set as benchmarks. Dark and light grey colors highlight the top and second best results across all the tasks trained with TV Dataset. The best results are in bold. For simplicity, we only report video moment retrieval 7 results for TVR and Howto100M-R. and 3 wrong answers. Similarly, some narrations are hidden to ensure we include QA pairs that are based on video only and not biased by subtitles. In practice, we observe that human-written negative answers suffer from serious bias (i.e., models can learn to predict the correct answer without even absorbing information from the video or subtitles). Therefore, we use adversarial matching (Zellers et al., 2019) to construct negative answers, by selecting a correct answer (from another question) that is most relevant to the current question. We replace one out of three written negative answers in this way. Detailed statistics about the collected datasets are provided in Appendix.

Downstream Tasks
To validate the effectiveness of HERO, we evaluate on four different downstream tasks. This subsection describes each task and the corresponding evaluation metrics.
TVR  is built upon the TV dataset, split into 80% train, 10% val, 5% test-public and 5% test-private. On average, 5 queries were collected for each video clip. Among them, 74.2% of queries are related to video only, 9.1% to text only, and 16.6% to both video and text.
TVQA  was first introduced along with the TV dataset, where given a video clip and the accompanying subtitles, the goal is to answer a multiple-choice question about the video. Each video clip is annotated with 7 questions and 5 answers per question. The start and end points of relevant moments are also provided for each question. The train, val and test video splits are the same as TVR dataset.
Howto100M-R In total, we have collected 67,542 queries for 29,843 60-second clips from 9,421 videos in HowTo100M, on average 2-3 queries per clip. We split the video clips and its associated queries into 80% train, 10% val and 10% test.
Howto100M-QA is collected under multi-choice QA setting. For the same video clips used in Howto100M-R, each is annotated with 2 questions on average and 4 answers per question. Similar to TVQA, we also provide the start and end points for the relevant moment for each question. We split data into 80% train, 10% val and 10% test.

Evaluation Metrics Text-based Video Moment
Retrieval can be decomposed into two sub-tasks: (i) Video Retrieval: retrieve the most relevant video clip described by the query; (ii) Moment Retrieval: given the query, localize the correct moment from the most relevant video clip. Model performance on video moment retrieval is measured on these two sub-tasks. A model prediction is correct if: (i) its predicted video matches the ground-truth (in video retrieval); and (ii) its predicted span has high overlap with the ground-truth (in moment retrieval). Average recall at K (R@K) over all queries is used as the evaluation metric for both TVR and Howto100M-R. Temporal Intersection over Union (tIoU) is also used to measure the overlap between the predicted span and the ground-truth span. 7 TVQA and Howto100M-QA include 3 sub-tasks: QA on the grounded clip, question-driven moment localization, and QA on the full video clip. We only consider QA on the full video clip, as it is the most challenging setting among the three. Accuracy is used to measure model performance.

Ablation Study
We analyze the effectiveness of our model design, especially with different combinations of pretraining tasks, through ablation studies over downstream tasks. This indicates that FOM, which models sequential characteristics of video frames, can effectively benefit downstream tasks that rely on temporal reasoning (such as QA tasks).

Pre-training Tasks and Datasets
The best performance is achieved by MLM + MNCE + FOM + VSM (L4). We observe significant performance lift by adding VSM. The local and global alignments between subtitle and visual frames learned through VSM are especially effective on TVR and Howto100M-R. Adding additional MFFR (L5) achieves slightly worse results. Our observation is that MFFR is competing with (instead of complimentary to) MNCE during pre-training, which renders the effect of adding MFFR negligible.
Lastly, we study the effects of pre-training datasets, by augmenting TV dataset with Howto100M dataset and pre-training HERO with the optimal combination of MLM + MNCE + FOM + VSM. The learned model continues to improve over all tasks except TVR. We hypothesize that the comparable result on TVR is due to the domain difference between the augmented videos and TV dataset.
Model Design To validate the effectiveness of the Cross-modal Transformer in HERO, we compare our model with a Hierarchical Transformer (H-TRM) baseline under two settings: (i) without pre-training 8 ; (ii) with optimal pre-training (MLM + MNCE + FOM + VSM) over TV dataset. H-TRM is constructed by simply replacing the Cross-modal Transformer with a RoBERTa model and encoding subtitles only. This way, the inputs to the Temporal Transformer in H-TRM are the summation of initial frame embedding and max-pooled subtitle embeddings from RoBERTa. We also compare HERO with a flat BERT-like encoder (F-TRM) where no pre-training is applied. F-TRM takes as input a single sequence by concatenating the embeddings  of visual frames and all subtitle sentences, and encodes them through one multi-layer Transformer, as used in previous pre-training methods. Results are summarized in Table 2. (i) When no pre-training is applied, F-TRM is much worse than HERO on both tasks. H-TRM achieves comparable results to HERO on TVR, but worse on TVQA. Unlike F-TRM, H-TRM and HERO explicitly utilize the inherent temporal alignment between two modalities of videos, which is uniquely important for video+language tasks. (ii) With pre-training, HERO shows significant improvement over H-TRM. Our hypothesis is that with the hierarchical design, HERO can capture cross-modal interactions between visual frames and its local textual context better than H-TRM. Such cross-modality joint understanding of visual and textual contexts is critical for video-based retrieval and QA tasks. (iii) Pretraining lifts HERO performance by a large margin, but not very helpful for H-TRM. These results provide strong evidence that cross-modal interactions and temporal alignments between visual frames and its local textual context learned by HERO are essential for these video+language tasks.

Comparison with SOTA Models
We compare our model with task-specific stateof-the-art models in Table 3. First, we compare with XML  on text-based video moment retrieval tasks (TVR and Howto100M-R). Results show that our model consistently outperforms XML on both TVR and Howto100M-R, with or without pre-training.
Second, we compare with SOTA models on video QA tasks (TVQA and Howto100M-QA). Note that for TVQA, STAGE  is trained with additional supervision on spatial grounding, which requires region-level features  for each frame of the video. Results show that without additional supervision on spatial grounding or fine-grained region-level features, our HERO model is able to achieve better performance than STAGE on TVQA dataset. We also observe that pre-training significantly boosts the performance of HERO across TVR, Howto100M-R and TVQA tasks. On Howto100M-QA, since STAGE was specifically designed to leverage region-level features, we cannot directly apply STAGE. Thus, we only compare model performance of HERO without and with pre-training. Results exhibit consistent pattern observed on other downstream tasks: pretraining achieves better performance than without pre-training. Overall, HERO achieves state-of-theart results on all four downstream tasks.

Conclusion
In this paper, we present a hierarchical encoder for video+language omni-representation pre-training. Our HERO model presents a hierarchical architecture, consisting of Cross-modal Transformer and Temporal Transformer for multi-modal fusion. Novel pre-training tasks are proposed to capture temporal alignment both locally and globally. Pretrained on two large-scale video datasets, HERO exceeds state of the art by a significant margin when transferred to multiple video-and-language tasks. Two new datasets on text-based video moment retrieval and video QA are introduced to serve as additional benchmarks for downstream evaluation. We consider extension of our model to other video-and-language tasks as future work, as well as developing more well-designed pre-training tasks.

A Appendix
A.1 Data Analysis on Howto100M-R and Howto100M-QA Data Collection Interface Figure 2 and 3 present the interface we used for collecting Howto100M-R and Howto100M-QA, respectively. For Howto100M-R, the annotator is asked to first select a video segment from the presented video clip using the sliding bar, and then enter a description about the selected video segment in the text box shown at the bottom of Figure 2. For Howto100M-QA, we reuse the selected video segment collected for Howto100M-R. The annotators are asked to write a question, a correct answer and 3 wrong answers in the four text boxes shown in Figure 3.  Video Segment Length Distribution The length distribution of selected video segment is presented in Figure 4. The lengths of video segments vary from 5 to more than 30 seconds. The majority of them have length less than 15 seconds.  Howto100M-QA Question and Answer Distribution Figure 6 and Figure 7 show the length (in number of words) distribution of collected questions and answers in Howto100M-QA. Questions are relatively longer, with more than 10 words on average. Answers are relatively shorter, most of them have less than 7 words.
In addition, we analyze the types of collected question by showing the distribution of their leading words in Figure 8. In total, we collected questions with 7 different types. Majority of them starts with "what", "why" and "when".