Beyond Instructional Videos: Probing for More Diverse Visual-Textual Grounding on YouTube

Pretraining from unlabelled web videos has quickly become the de-facto means of achieving high performance on many video understanding tasks. Features are learned via prediction of grounded relationships between visual content and automatic speech recognition (ASR) tokens. However, prior pretraining work has been limited to only instructional videos, a domain that, a priori, we expect to be relatively"easy:"speakers in instructional videos will often reference the literal objects/actions being depicted. Because instructional videos make up only a fraction of the web's diverse video content, we ask: can similar models be trained on broader corpora? And, if so, what types of videos are"grounded"and what types are not? We examine the diverse YouTube8M corpus, first verifying that it contains many non-instructional videos via crowd labeling. We pretrain a representative model on YouTube8M and study its success and failure cases. We find that visual-textual grounding is indeed possible across previously unexplored video categories, and that pretraining on a more diverse set still results in representations that generalize to both non-instructional and instructional domains.


Introduction
Self-supervised pretraining approaches have recently been adapted to web videos (Sun et al., 2019a,b;Miech et al., 2019Miech et al., , 2020Zhu and Yang, 2020;Amrani et al., 2020); the resulting models have achieved state-of-the-art performance on a wide range of video understanding tasks, e.g., dense caption generation, action localization, etc.
In general, the pretraining step requires a large, unlabelled corpus of web videos. The training objective aligns visual content (i.e., video segments) with automatic speech recognition (ASR) tokens, and the resulting representations are fine-tuned for downstream tasks. The assumption underlying this family of approaches is that the words spoken in a video scene have some consistent relationship with the temporally corresponding visual content.
However, in contrast to the highly diverse corpora utilized for text-based pretraining (Wikipedia, Common Crawl, etc.), pretraining for web videos so far has been limited to instructional videos. This domain restriction is motivated by the commonly-accepted notion that "procedural knowledge tends to be inherently multimodal" (Malmaud et al., 2015): we expect that the semantic information in video frames and ASR tokens is readily correlated in instructional videos. But corpus diversity brings significant benefits: in the text-only case, models can effectively represent diverse real-world entities (Roberts et al., 2020) precisely because pretraining is not restricted to, e.g., only fictional stories (Zhu et al., 2015).
In search of more general representations, our main question is: does video-ASR pretraining "work" for more diverse pretraining corpora? Are certain categories of non-instructional videos "groundable," thus enabling diverse representation learning? Or are some types too difficult, only acting as training noise? We conclude that: 1) grounding is indeed possible in a wide range of yet-to-be-computationally-exploited YouTube video categories, e.g., walk-throughs, vehicles, tech reviews, etc., with some harder than others; 2) for the model we consider, not much representational power is gained or lost by switching from a pure instructional training set to a diverse one, which may additionally provide more versatility.

Related Work
ASR is known to be a useful signal source in various instructional video understanding tasks (Gupta et al., 2017;Huang et al., 2017;Huang* et al., 2018;Moriya et al., 2019), e.g., action detection/classification (Yu et al., 2014Alayrac et al., 2017;Chang et al., 2019;Kuehne et al., 2019), segmentation/captioning (Sener et al., 2015), and instruction alignment (Malmaud et al., 2015;Alayrac et al., 2016). A number of multimodal instructional video datasets have been proposed (Wang et al., 2019;Tang et al., 2019;Sanabria et al., 2018). A notable recent example of a noninstructional video corpus is Ignat et al. (2019), who analyze grounded-ness in a "lifestyle vlogs" corpus. Fouhey et al. (2018) highlight the difference between keyword search, and implicitly mining action data of interest from a broader corpus (e.g., Bregler (1997); Gu et al. (2018)). Operational grounding. Our work builds upon prior operational notions of grounding: if an algorithm is able to consistently predict specific visualtextual relationships, then that relationship is said to be "grounded" (Lu et al., 2008;Berg et al., 2010;Parikh and Grauman, 2011;Hill and Korhonen, 2014;Hessel et al., 2018). For example, Yanai and Barnard (2005) rank "substrings of text by how well their occurrence can be predicted from visual features." 3 Video-ASR pretraining + our model Recent work in designing pretraining objectives assumes that: 1) ASR tokens have, on average, some correspondence to temporally co-occurring video frames within the same video; and 2) video clips lacking ASR tokens can be ignored, i.e., models are not expected to align or reason about clips that lack ASR.
We build and analyze a model that encapsulates both of these assumptions. 1 The model is a slight simplification of Miech et al. (2019), where a joint embedding for the visual content and ASR tokens is learned. In the supplementary material, we detail replications of key experiments of their work: our model achieves comparable results on a difficult action localization task, CrossTask , when pretrained on the same corpus of 1M instructional videos. Model details. The similarity between clip i and ASR caption j, s i,j , is estimated by computing the cosine similarity between their corresponding embeddings in the joint space. 2 During training, tem-1 While more complicated models are possible, our goal is to conduct an error analysis of a simple, representitive model, not to necessarily achieve state-of-the-art results.
2 Joint embedding models are parameterized using gated First, we'll prepare the ingredients….
Put three ounces of lemon zest in a bowl...
We'll get an empty jar, and begin filling it...
And that's it!! Enjoy! Figure 1: Intra-video AUC metric: the model scores all possible links between clips and ASR captions within a single video; the model is rewarded for assigning higher similarity to temporally-aligned segments versus mismatched ones.
porally corresponding (clip, ASR caption) pairs are sampled ("Positive" cases). For each positive case, a set of mismatched "N egative" cases is also sampled both from other videos and from the same video in equal proportion. In contrast to Miech et al. (2019), we control for clip length, and sample temporally fixed-length segments: this simplifying choice makes our error analysis significantly more straightforward, and results in minimal performance change. 3 The following hinge loss is minimized for margin δ: (1) Measuring visual-textual alignment. Viewed through the lens of link prediction between truly co-occuring (clip, ASR) pairs, Eq. 1 can be seen as a differentiable approximation of AUC (Rendle et al., 2009). Thus, we propose to operationally measure the groundedness using intra-video AUC: a single score is assigned to each video, rewarding the model if it is able to successfully align temporal pairs within the same video (and penalizing it if not). Fig. 1 presents a visualization of this method.

A More Diverse Corpus
YouTube-600K. YouTube8M (Abu-El-Haija et al., 2016) is a dataset of 6.1M YouTube videos, 4 where each video is labeled across 3K multi-layer feedforward networks. Visual features are: frame-wise 2D Inception-v3 pretrained for object detection (Szegedy et al., 2016;Sun et al., 2017) and 3D CNN S3D-G features pretrained for action recognition (Xie et al., 2018;Kay et al., 2017). 3 We use 5 second segments, and initially randomly sample 256 segments per video before discarding ones that have no temporally-accompanying ASR. Segments may overlap. 4 v3 of the dataset is smaller than v1/v2, due to videos becoming unavailable over time and other refinements. categories, ranging from "cooking" to "games" to "nature." It is among the largest and most diverse publicly available dataset of YouTube videos. Due to user deletions and videos without detected spoken words, we are able to collect ASR via the YouTube API for 1.4M (29%) videos; we further filtered to 817K videos tagged with English ASR.
Maintaining the train / validation split of the original data release yields 639K training videos (henceforth referred to as YouTube-600K) and 167K validation-set videos.
Human annotation of "Is-it-instructional" While a qualitative examination of YouTube8M reveals clear topical and stylistic diversity compared to domain-restricted corpora, we quantitatively verify that YouTube8M does not consist of mostly instructional videos.
To that end, we sample 6.8K videos with English ASR from the validation set for human labeling. Each video is shown to three paid annotators, who must each provide a Yes/No answer to the question: "Does this video focus on realworld human actions accompanied by procedural language that explains what is happening on screen in reasonable detail?" 5 After a few iterations over the guidelines and examples, the annotators reach high agreement: in 96% of cases, all three judges are unanimous. Based on these annotations, we estimate that around 74% of the videos in the YouTube-600K corpus are not instructional. Which categories are easiest/hardest? We train our model on YouTube-600K, and compute intra-5 Note that our definition for "is-instructional" intended to include the usual "how-to" videos, but also attempted to capture a more general notion of "instructional-ness". For instance, an un-boxing video where parts of a product are taken out and assembled along with corresponding narration should receive "Yes", whereas a video showing only a product from different angles should receive "No", due to a lack of narrated human actions.
We next ask: are instructional videos indeed easier to ground? While human judgements of instructional-ness and intra-video AUC are positively correlated ρ = .20 (p 0), the low magnitude of this correlation provides additional empirical confirmation that other types of videos are 6 To make sure that the model is not succeeding simply because a category happened to be frequent in the dataset, we note the correlation between category AUC and category frequency is essentially zero (ρ = .02, p > .58).
7 These meta-categories are called "verticals," and are released with YouTube8M. Within-category observations. To this point, we have identified broad categories of YouTube videos that are more groundable than others. However -it is not yet clear why, e.g., the algorithm gets 64 AUC on "Action Figure," or 55 AUC on "Call of Duty" (a first-person shooter game). We now define a segment-level AUC metric, analogous to the intra-video AUC metric previously defined: it quantifies how readily individual ASR captions are temporally localized by the model within the same video (see Menon and Elkan (2011) for a description of different AUC variants).
While visual/textual content plays a role in groundability, contextual factors must be considered too. Fig. 3 illustrates clear relationships 1) between ASR caption placement within a video and segment AUC (segments at the very beginning and very end of videos tend to be easier); and 2) between the number of tokens in an ASR caption and segment AUC. For "Action Figure" -ASR segments with more words are easier (this is the case with most categories), but for "Call of Duty", the opposite is true.
Additionally, we train OLS regression models to predict segment AUC from lexical unigram features, while controlling for timing/length features. Lexical features add predictive capacity (p .01, F-test). While we find some patterns, e.g., intro/outro-language (e.g., "hey", "welcome", "peace") predictive of segment AUC for both categories, we also observe topical patterns, e.g., several unigrams associated with specific action figure body parts ("knee", "shoulder", "joint", etc.) are positively associated with segment AUC.

Implications for Pretraining
While self-grounding is possible for a diverse set of domains, do we gain anything by training on a more diverse corpus? Or do difficult-to-ground videos introduce noise, rendering its representations useless for downstream tasks? We compare two versions of our model: one with parameters learned from training on the diverse YouTube-600K corpus (M Diverse ), and one with parameters learned from a domain-specific corpus of 1M instructional videos (M Instructional ).
First, we evaluate each model's capacity to localize instructional steps on the CrossTask  dataset. Both models have similar performance for this instructional video task: macro-average task recall drops by only 15% when swapping from M Instructional to M Diverse , even though YouTube-600K only contains 26% instructional videos (full results in supplementary).
We next evaluate each model's performance on the same-video clip alignment task over a diverse set of videos: the sample of 6.8K humanannotated videos from the YouTube8M validation set. In terms of intra-video AUC, M Diverse outperforms M Instructional on 61% of videos. If we split the data across the "Is-it-instructional" human judgements and compare the two models in each subset, M Instructional "wins" in 57% of the instructional videos, whereas M Diverse "wins" in 65% of non-instructional cases.
In short, both models achieve reasonable performance under instructional vs. non-instructional train/test domain mismatch. This is a promising result for future pretraining work with more diverse corpora: at least for these evaluations, good performance on an instructional video grounding task is still possible under domain shift. While the comparison of intra-video AUC is not necessarily definitive, it suggests that diverse corpora may provide more versatility, and we look forward to exploring this further in future work.

Conclusion
Peeking through the lens of a joint embedding model, we probe into learning visual-textual grounding over a more diverse corpus of YouTube videos vs. prior work. We find that learning visual-textual grounding is possible across many yet-to-be-explored categories of YouTube videos, and that it's possible to learn generalizable representations from a more diverse video set.  (2019)'s joint embedding model that pre-trains by aligning ASR tokens with corresponding video frames. The main difference between our implementation and theirs is how we generated (ASR, caption) pairs. While we considered generating clips according to their methodology, we ran into two problems. First, in early experiments, we found that the interpretability our error analysis was significantly impacted by varying clip length. For example: we were worried that it might not be consistent to compare the model's ability to temporally ground a 1s clip vs. a 15 second clip. There was also high correlation between caption length and temporal clip duration, which further complicated interpretation. Sampling clips of uniform duration solved these problems. Second, Miech et al. (2019)'s temporal segmentation was generated by relying on the scrolling timing of the ASR tokens on the YouTube, i.e., the time that YouTube decides to generate a linebreak, removing a line of caption from the screen. Via manual inspection, we found that scrolling time was temporally unreliable, e.g., the time in which ASR captions scroll on YouTube often differs significantly from when particular words were said. Instead, we sample 256 candidate 5 second segments uniformly at random from the video, and then discard segments that have no corresponding ASR. Visual features. Following (Xu et al., 2016;Miech et al., 2019), we extract clip features from both 2D and 3D convolutional networks. For 2D features, we sample frames at 1FPS from all of the videos in our corpus, resize frames to be 256 by 256, and pass them through Inception-v3 (Szegedy et al., 2016) pretrained on JFT (Sun et al., 2017). For 3D convolutional networks, we follow a similar procedure to (Sun et al., 2019), sample frames at 30FPS, aggregate frames into one second nonoverlapping clips of 1 second each, and run an S3D-G (Xie et al., 2018) network that is pretrained on the Kinetics action recognition dataset (Kay et al., 2017). Both 2D and 3D features are L2 normalized. The result of this process is a 2524-D feature vector for each second of video in our corpus. To extract a single feature vector for each clip, following Miech et al. (2019), we max pool token embeddings. 1 ASR features. To compute ASR caption features for a given segment, after decapitalizing and tokenizing, each vocabulary item is assigned a 300 dimensional embedding parameter; these are finetuned during the training process. Once again following Miech et al. (2019), we max pool token embeddings to achieve a single embedding for an ASR caption. 1 When training on YouTube-600K the vocabulary size is 61K. Joint embedding. The similarity scores between (clip, caption) pairs is estimated by the cosine similarity between their corresponding embeddings in a joint space. To project into this joint space, clip/ASR caption embedding models are parameterized using gated multi-layer perceptions, as described in Miech et al. (2019); these parameters are trained to maximize the training objective. Comparison to HowTo100M. To verify that our model simplifications didn't significantly hinder performance, we attempted to replicate key experiments from Miech et al. (2019). In particular, we sought to gather the pretraining corpus they used, HowTo100M, which consists of 1.22M videos. Because of, e.g., users deleting videos, we were able to gather features for 87% of the original set, 1.06M videos. We trained with the Adam optimizer (Kingma and Ba, 2014) starting with a learning rate of .001, with the margin parameter set to .1, but didn't undertake significant hyperparameter optimization. We terminate training after 300K steps.
We verify the performance of our model is comparable using the CrossTask localization task . While we defer details to the original paper, the goal is to temporally localize a set of procedural steps for a task in an unlabelled/unsegmented video depicting that task. An algorithm's performance is evaluated with a recall metric (higher is better). We follow the evaluation procedure given in Miech et al. (2019), except instead of embedding each frame individually, we embed a sliding 5-second window of video clips. We also experimented with Zhukov et al. (2019)'s dynamic programming postprocessing method, and found that it usually resulted in a small performance increase. Our model simplified model performs only slightly worse (5%) than Miech et al. (2019)'s. While we argue that our model is certainly still representative, there are several reasons why this gap might exist. For example, there may be a regularizing effect when the model is allowed to view clips of varying length. Furthermore, our feature set was different -we used different base neural networks for feature extraction. Also, our model is trained on less data due to authors deleting their videos. Finally -we didn't tune the training hyperparameters for our model/implementation, e.g., hinge size, learning rate, batch size, etc. Stability of results. To ensure the results in our paper related to intra-video AUC were insensitive to the particular choice of model checkpoint, we re-did the experiments in §4 using a version of our model checkpointed at 140K iterations vs. the 300K presented in the main paper; these experiments were conducted over 21K dev videos instead of the full 167K dev videos presented in the main paper. Figures+tables in that section were consistent with the presented results, and the qual-itative observations about the "Action Figure" category held.