What is More Likely to Happen Next? Video-and-Language Future Event Prediction

Given a video with aligned dialogue, people can often infer what is more likely to happen next. Making such predictions requires not only a deep understanding of the rich dynamics underlying the video and dialogue, but also a significant amount of commonsense knowledge. In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions. To support research in this direction, we collect a new dataset, named Video-and-Language Event Prediction (VLEP), with 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog video clips. In order to promote the collection of non-trivial challenging examples, we employ an adversarial human-and-model-in-the-loop data collection procedure. We also present a strong baseline incorporating information from video, dialogue, and commonsense knowledge. Experiments show that each type of information is useful for this challenging task, and that compared to the high human performance on VLEP, our model provides a good starting point but leaves large room for future work. Our dataset and code are available at: https://github.com/jayleicn/VideoLanguageFuturePred


Introduction
Given a video clip (premise event), humans can often describe logical events that might happen next (future events), and interestingly people tend to agree on which future events are more likely to happen than others. Making such predictions requires not only a deep understanding of the rich dynamics underlying the video and dialogue, but also a significant amount of multimodal commonsense knowledge about the world. In Figure 1 (top), video, and commonsense world knowledge, and then make a sound and practical judgment about the future, by choosing the more likely event from two provided possible future events. The VLEP dataset contains 28,726 examples from 10,234 short video clips. Each example (see Figure 1) consists of a Premise Event (a short video clip with dialogue), a Premise Summary (a text summary of the premise event), and two potential natural language Future Events (along with Rationales) written by people. These clips are on average 6.1 seconds long and are harvested from diverse event-rich sources, i.e., TV show and YouTube Lifestyle Vlog videos.
Collecting such a dataset is a non-trivial task, as crowd-workers may write trivial negatives (lesslikely events) that contain biases or annotation artifacts (Gururangan et al., 2018), such as negation (e.g., 'says nothing') or impolite actions (e.g., 'hit someone in the face'), as shown in Table 1.
To mitigate this, we combine two recent effective approaches, adversarial human-and-model-in-theloop data collection (Nie et al., 2020) and adversarial matching (Zellers et al., 2019a), to build a larger, more-challenging, and less-biased dataset. Specifically, 50% of the examples in VLEP are directly annotated by humans over two rounds: round one of standard data collection, i.e., crowd-workers perform the annotations with no model feedback, and round two of adversarial data collection, i.e., crowd-workers perform the annotations with the goal of fooling our basic models trained on round one data (thus avoiding obvious biases). Our analysis shows that the adversarial data collection helps to mitigate dataset bias (reduce trivial negatives), i.e., we notice that a premise-oblivious model (that does not see the premise event) performs worse on data collected on round two than that of round one. Another 50% of the examples are obtained by performing adversarial matching on the humanannotated positive events (more-likely events), i.e., for each premise event, we sample a positive from other premises as a negative, such that the sampled negative is relevant to the current premise while not being overly similar to the true positive. Overall, our dataset is collected via 3 methods (standardhuman, adversarial-human, adversarial-matching), hence maintaining a balance between easy and hard examples while reducing potential biases.
To provide a strong baseline for this challenging multimodal future-prediction task, we propose a transformer-based model to incorporate both visual and textual information from the premise event. We also inject commonsense reasoning knowledge into our model from the ATOMIC dataset (Sap et al., 2019). Our ablation study shows that each part of our model, i.e., video understanding, dialogue understanding, and commonsense knowledge, is useful for the multimodal event prediction. Though our model has shown promising results, it is still not comparable to human performance (67.46% vs. 90.50%), indicating the challenging nature of the multimodal event prediction task and the large scope for interesting future work on our VLEP dataset.
To summarize, our contributions are 3-fold: (1) We propose a new task, Video-and-Language Event Prediction, which requires a model to make finegrained, multimodal prediction regarding which future event is more likely to happen following a premise video. (2) We introduce a new dataset VLEP for the task, and use two approaches to gather natural hard-negative future-events: adversarial data collection and adversarial matching. This helps mitigate potential annotation artifacts and biases in the dataset. A detailed analysis of VLEP is provided. (3) We present a strong baseline method to benchmark the proposed dataset, and show that incorporating commonsense knowledge improves performance, indicating future directions for this new task (with a large model-human performance gap).

Related Work
Video-and-Language Understanding. Various datasets and tasks have been introduced in this area, such as video captioning (Xu et al., 2016;Rohrbach et al., 2017;Wang et al., 2019;Lei et al., 2020c), video QA (Tapaswi et al., 2016Jang et al., 2017;Lei et al., 2018), and moment retrieval (Hendricks et al., 2017;Gao et al., 2017a;Lei et al., 2020c). Recently, Liu et al. (2020) propose the video-andlanguage inference task where a model needs to infer whether a statement is entailed or contradicted by a video. While this task requires judging a statement's verification w.r.t. existing events, our task requires predicting future events.
Commonsense Reasoning. Recently, commonsense reasoning has emerged as an important topic in both language (Zellers et al., 2018(Zellers et al., , 2019bSap et al., 2019) andvision (Vedantam et al., 2015b;Zellers et al., 2019a;Zadeh et al., 2019;Fang et al., 2020) communities. Zellers et al. (2018, 2019b build multiple-choice QA datasets for commonsense inference with text context, Zellers et al. (2019a); Park et al. (2020) propose datasets for commonsense-based QA and captioning on still images, Fang et al. (2020) augment MSRVTT (Xu et al., 2016) videos with commonsense captions and QAs. In this work, we focus on a more complex type of context (video with dialogue) and a future prediction task, posing challenges for both video-and-dialogue understanding and commonsense reasoning .
Video Forecasting. Predicting the future is one of the popular research areas in the vision community. It covers a wide spectrum of topics, including predicting future frames (Vondrick et al., 2016b;Liang et al., 2017), future action labels (Vondrick et al., 2016a;Gao et al., 2017b;Shi et al., 2018;Epstein et al., 2020), future human motions (Fragkiadaki et al., 2015;Mao et al., 2019), etc. While these works mostly study low-level vision or semantic concepts prediction (e.g., action labels), we focus on predicting high-level future events from video and dialogue.

Bias in Datasets.
It is known that biases or annotation artifacts (Goyal et al., 2017;Gururangan et al., 2018;McCoy et al., 2019;Tsuchiya, 2018;Poliak et al., 2018;Zellers et al., 2019a) exist in standard human annotated datasets (Bowman et al., 2015;Williams et al., 2018;Antol et al., 2015;Tapaswi et al., 2016;Jang et al., 2017;Kim et al., 2017;Lei et al., 2018). For example, negation words such as nobody, no and never are strong indicators of contradictions (Gururangan et al., 2018) in MNLI (Williams et al., 2018). Such superficial patterns are easy for models to exploit, resulting in an overestimate of task performance (Goyal et al., 2017;Gururangan et al., 2018). Zellers et al. (2019a) propose Adversarial Matching to mitigate biases in QA, where positive answers are recycled to serve as negatives for other questions. Nie et al. (2020) propose a Human-And-Model-in-the-Loop Entailment Training (HAMLET) adversarial data collection strategy to gather challenging examples for NLI. In this work, we adopt both approaches to construct a less-biased and more challenging dataset for the multimodal video+dialogue setting.

Dataset
The VLEP dataset contains 28,726 examples from 10,234 TV show and YouTube Lifestyle Vlog video clips. Of these, 50% are collected directly from human annotators over two rounds: (1) round one: standard data collection; (2) round two: adversarial data collection. We collect human examples using Amazon Mechanical Turk (AMT), with an average cost of $1.10 per example. More detail about the annotators and quality checks are presented in Appendix Section A.2. The other 50% are obtained from human-annotated examples via Adversarial Matching (Zellers et al., 2019a). Hence, overall we build our dataset with 3 collection methods (standard-human, adversarial-human, adversarialmatching), allowing a balance between easy and hard examples while reducing potential biases.

Video and Language Source
VLEP is built using videos (with English dialogues) from two sources: TV shows and YouTube Vlogs. Both types of videos contain rich physical interactions and dialogues between people and are thus ideal sources for collecting interesting events. We do not use videos from sources like Activi-tyNet (Caba Heilbron et al., 2015) since they do not have dialogues and typically contain fewer events.
TV Show Videos. We use TV show clips from TVQA (Lei et al., 2018). The clips are typically 60-90 seconds long, and are from 6 popular TV shows of 3 genres: 1) sitcom: The Big Bang Theory, How I Met Your Mother, Friends, 2) medical drama: Grey's Anatomy, House, 3) crime drama: Castle.
YouTube Lifestyle Vlogs. While TV shows are good video sources with rich inter-human interactions, they may focus more on scripted content (Lei et al., 2020b). Thus, we also collect a set of YouTube lifestyle vlogs as additional sources, which are typically more natural and live interactive. We first manually identify a list of YouTube channels that contain videos with rich human interactions and dialogues (in English). We filtered out those channels with instructional videos (Miech et al., 2019) or routine videos (Ignat et al., 2019;Fouhey et al., 2018), as they focus more on a single person performing actions, while we desire videos with richer multi-person interactions and dialogues. In addition, the actors in instructional or routine videos typically follow rigid steps (e.g., in cooking videos, they usually follow recipes) to finish a particular task, making it much easier to predict the future events. In the end, we identified 9 channels that contain a diverse set of lifestyle vlog videos on various topics: travel, food, daily life and family, etc. We downloaded all videos from these channels that are published after 2017, which were then verified to ensure high quality. The resulting pool contains 971 videos of 10-30 minutes long. Each video is associated with aligned dialogue text, either written by humans or generated from YouTube's Automatic Speech Recognition (ASR) system. We segment the videos into 60second clips. For each video, we drop the first and the last clip, as most of them are high-level introductions or subscription pleas.

Round One: Standard Data Collection
As our task is video event prediction, we aim to collect a set of videos annotated with future event pairs (i.e., more-likely events and less-likely events, also referred to as positive events and negative events) that are likely to happen right after the 'premise' video. Each event is written in natural language (English), and we require the positive event to be more likely to happen than the negative event.
With this goal in mind, we create our first annotation task on AMT. We present workers (human writers) with a 60-90 seconds long video with aligned dialogue subtitle, to encourage them to write events that are related to both the visual content and the dialogue. Workers are required to first select an interesting event from the video with timestamps, similar to previous works (Lei et al., 2018(Lei et al., , 2020c. This event is defined as the premise event. We also require workers to write a premise summary -a natural language (English) description summarizing the premise event. Following Lei et al. (2018Lei et al. ( , 2020c, for referring a specific person in the video, workers are instructed to either use the character names (e.g., 'Sheldon') if they are available in the dialogues or provide a referring expression (Kazemzadeh et al., 2014) (e.g., 'the man in blue top') that uniquely refers to a person in the video. Next, given the premise event, workers are required to write two future events, one more likely (>50% chance) to happen after the premise event, and one less likely (<50% chance). For example, in Figure 1, the correct answers are the more-likely while the wrong answers are the less-likely. To encourage workers to write more reasonable future event that ground to the premise event, 2 we also require them to provide a rationale as to why it is more or less likely. As it is not the Type: Negation Premise Summary: Amy picks up her phone and reads a text message. More-likely: Amy tells her friends what the text message says. Less-likely: Amy says nothing at all to her friends.
Type: Impolite Actions Premise Summary: Chandler finds out that Joey used his toothbrush. More-likely: Chandler starts arguing with Joey for using his toothbrush. Less-likely: Chandler hits Joey in the face with a punch. focus of this work, we will release these rationales to support research on textual explanation generation/classification tasks (Huk Park et al., 2018;Zellers et al., 2019a).
Each collected example is verified by three human verifiers, by ranking the future events conditioned on the premise event. We only accept an example if at least three out of four users (one writer + three verifiers) reach an agreement, as Hendricks et al. (2017); Nie et al. (2020). In addition, we also discard examples if one of the verifiers thinks the events are against our instructions (e.g., wrong person reference). In total, we collected 6,458 verified examples from 2329 TV show clips. We split them into 70% training, 15% development, and 15% testing splits such that the videos and their corresponding examples only appear in one split.

Round Two: Adversarial Data Collection
While being efficient in data collection, we found the collected negative events in round one are sometimes simple and contain biases or annotation artifacts (Gururangan et al., 2018). In Table 1, we show typical examples of annotation artifacts. For example, we found workers tend to use negation when writing the less-likely event. This particular type is similar to the visual priming bias (Zhang et al., 2016) for yes/no questions in VQA (Antol et al., 2015) and the negation word bias (Gururangan et al., 2018) in MNLI (Williams et al., 2018). To quantitatively study the effect of these annotation artifacts, we fine-tune a RoBERTa-base  model to classify which event is more likely to happen, with only the future events from round one's training data, i.e., the model has no access to the premise event. On round one's Dev. split, this premise-oblivious model obtains 75.34% accuracy, which is much higher than chance (50%).
Hence, in order to collect harder and less-biased negatives, we make use of an adversarial collection procedure (see Figure 2), in a human-and-model-inthe-loop process (Nie et al., 2020), where models Step 2: Get model feedback Future Event Pair 4 Step 4: Retrain model for next round 1 Step 1: Write events 3 Step 3: Verify events Figure 2: Illustration of our adversarial data collection procedure. p m and p l are the probabilities of the more-likely and the less-likely event being happening, respectively. ∆ is a hyperparameter that controls how hard we want the collected example to be, it also helps to reduce prediction noises from imperfect models. ∆ is set to 0.1 in our experiment. No* or the number of trials reaches the maximum limit of three. are used to provide real-time feedback to crowdworkers during data collection. Specifically, each submitted result is sent to the model for evaluation and writers are prompted 3 to rewrite their negative event if our model predicts a much higher probability for the more-likely event (p m ) than the lesslikely event (p l ), i.e., p m − p l > ∆, where ∆ is a hyperparameter that controls how difficult we want the collected examples to be and is set to 0.1 empirically. This can be seen as a soft-adversarial strategy, unlike Nie et al. (2020) where feedback decisions are made by directly using hard model predictions (consider it as a special case of our soft-adversarial strategy with ∆ = 0). In addition to controlling the difficulty of the collected examples, it also helps us to reduce the prediction noise from imperfect models and avoid forcing workers to write abnormal events in order to fool the model.
We use two models to provide feedback to the writers, a future event only model that focuses primarily on reducing the aforementioned annotation artifacts, and a premise summary + future event model that can additionally detect and thus reduce simple negatives that are created as contradictions of the premise. For example, with the premise summary, 'Howard tells Bernadette that he has a dominant personality', the negative event 'Howard will say that he doesn't have a dominant personality' is relatively simple as it directly contradicts the premise. Both models are fine-tuned as a sequence classification task from round one's training data, using a pre-trained RoBERTa-base 4 model. The 3 Rewrite for at most twice, in total three trials. 4 Empirically, RoBERTa-large does not yield better perfor-objective is to maximize the probability of the positive event being the correct answer. For the future event only model, we only use the future event for classification, ignoring the premise. For the premise summary + future event model, we concatenate the premise summary and the future event text as a single sequence for classification. Note that we use the premise summary as an overall proxy to represent both video and dialogue content to build our adversarial model, considering video and dialogue understanding is still an open research problem in itself. 5 The accuracy of these two models on round one's Dev. split are 75.34% and 76.68%, respectively. During collection, we randomly pick one model from these two models to provide feedback to users. This is similar to the approach used by Nie et al. (2020) where one model is randomly picked from a set of random seeded models. The difference lies in that we use a set of two models with different inputs (architecture) while Nie et al. (2020) use the same architecture with varying random seeds. This strategy can be seen as constructing a pseudo-ensemble model, which provides diverse adversarial feedback to the crowd-workers and helps avoid annotators exploiting vulnerabilities of a single model (Nie et al., 2020), while reducing server load. 6 In round two, with our adversarial collection procedure, we collected 7,905 verified examples from mance but longer response time that affects user experience. 5 In Appendix Section A.3, we show that an oracle model that uses the premise summary as auxiliary input significantly outperforms our video+dialogue model. 6 As we only need to run one model instead of multiple models in a standard ensemble approach. 4,418 TV show clips and 3,487 YouTube clips. Similar to round one, we split them into 70% training, 15% development, and 15% testing splits.

Adversarial Matching
With adversarial data collection, we are able to collect harder and less-biased examples. However, this approach is not scalable due to its high cost. On average, each verified example in round two costs $1.70. Inspired by Zellers et al. (2019a), which proposed to use Adversarial Matching to obtain less-biased negatives, we use a similar strategy to create additional examples for our dataset. Given a premise event and its positive event, the goal of adversarial matching is to find a negative from other premise events' positives, such that the matched negative is very relevant to the premise event (so that they are still hard for machines) and at the same time, not overly similar to the true positive (in case they incidentally become a positive event to the premise). Specifically, we use BERTScore (Zhang et al., 2020) and the recommended RoBERTa-Large model fine-tuned on MNLI (Williams et al., 2018) to calculate similarity score S sim (e i , e j ) between two events e i and e j . For relevance, we use a RoBERTa-base model that takes as input the concatenation of a premise summary p i and a future event e j and output a relevance score S rel (p i , e j ). This model is trained to distinguish positive events from randomly sampled events. Next, given dataset examples {(p i , e i )} N i=1 , we obtain a negative future event for each premise p i with maximum-weight bipartite matching (Munkres, 1957;Jonker and Volgenant, 1987) on a weight matrix W ∈ R N ×N : where α=0.1 is a hyperparameter that controls the tradeoff between relevance and similarity, the indicator 1(p i , e j ) equals 1 if p i and e j are from different sources (e.g., different TV shows), otherwise 0. Thus, λ serves as a regularization that penalizes e j if it is from a different video source than that of p i -as e j could potentially be an easy negative that can be distinguished from superficial clues such as character names in different shows.

Method
Given a video with dialogue text, and two future event candidates {e i }, i ∈ {1, 2}, our goal is to predict which future event is more likely to happen. In the following, we introduce our baseline approach (see model overview in Figure 4) Figure 4: Model overview. We first separately encode video and text, and then use a multimodal transformer encoder to encode information from both modalities. Please see text for details. details in the paragraph below) and then use the resulting model for feature encoding. Note that this model is end-to-end trainable during training. We concatenate dialogue and future event candidate as input to the transformer layers, special tokens such as [CLS] (Devlin et al., 2019) are also added in this process. We use the extracted token embeddings from the last layer, denoted as E t i ∈ R L i ×d , i ∈ {1, 2}, where L i is the sequence length (#tokens, including added special tokens). Similar to how we encode video, the resulting text representation is further encoded using a transformer encoder. Without ambiguity, we use the same notation to denote the outputs as E t i ∈ R L i ×d , i ∈ {1, 2}. Commonsense-based Text Representations. Addressing our challenging future event prediction task requires general world knowledge that is beyond basic visual and language semantic understanding. Thus, we propose to inject the commonsense from the ATOMIC dataset (Sap et al., 2019) into our framework in a simple way. ATOMIC contains events with if-then inferences, e.g., if X meets Y at the station, then X want to give Y a ride home (see example in Figure 3). We extract 406K event inferences from the dataset, and replace the person tokens X and Y with the names from our dataset (Mitra et al., 2019). We then use the extracted event inference sentences to finetune the pre-trained RoBERTa-base model. The fine-tuned model is then used to encode our text inputs.

Multimodal Encoding and Event Classification.
To obtain the joint multimodal representation, we concatenate encoded video E v and text E t and use a transformer encoder to encode the concatenated representations. This encoder allows information exchange between the two modalities. We use the representation from the [CLS] token as the joint representation of video, dialogue and future event e i , denoted as g i ∈ R d , i ∈ {1, 2}. We gather the joint representation vectors for all future event candidates and pass them through a two-layer MLP with a softmax layer for classification. We train the model using cross-entropy loss that maximizes the scores for the more-likely future events.

Implementation Details
Our models are implemented in PyTorch (Paszke et al., 2017). To speed up training, we use NVIDIA Apex for mixed precision training. We set the hidden size d to be 768 and use a single transformer layer for all our transformer encoders. We use Adam (Kingma and Ba, 2015) optimizer with β 1 =0.9, β 2 =0.999. Since our model has a pretrained component (RoBERTa), we use a twophase training strategy. Specifically, we first freeze RoBERTa's weights up to the second last layer and then pre-train the rest of model for 3 epochs with initial learning rate of 1e-4, learning rate warmup over the first 10% of the steps and linear decay the learning rate to 0. We then unfreeze all the weights and finetune the whole model for 3 epochs with learning rate 5e-5 and linearly decay the learning rate to 0. We train the model on a single RTX 2080Ti GPU with batch size 16. We report multiple-choice question answering accuracy.

Results
Are video and dialogue modalities useful? Table 5 shows the results with different input context. The model using future event text only as the input   achieves 58.09% accuracy, which is higher than random chance (50%), suggesting there exists slight bias even with our deliberate adversarial collection and matching but is tolerable. Adding video or dialogue as additional input improves the accuracy to 59.03% and 66.63%, respectively. The best performance is achieved when using both video and dialogue, with an accuracy of 67.46%. In Appendix Section A.3, we also present an oracle model with premise summary as auxiliary input.
Human Performance. To obtain human performance, we randomly sampled 400 examples from our test set. We present a premise event (a video with dialogue subtitles or dialogue subtitles only) and its two corresponding future events to a new set of AMT workers and ask them to select which one is more likely to happen after the premise. Each example is answered by 10 different workers to reduce crowdworker variance (Rajpurkar et al., 2018). The final answer is selected by majority vote. Table 5 shows the results. We observe that human performance without video (i.e., only dialogue+future) is 76.25%, while showing the video improves the performance to 90.5%. which indicates video information is important for getting the correct answer.
Compared with the best model result (67.46%), there is still a large useful gap (23%) for future community work on our challenging task of multimodal event prediction.
Does commonsense knowledge help? In Table 6, we show a model variant that uses text features without ATOMIC sentences for fine-tuning. We see this variant has a lower accuracy (66.96%) compared with the fine-tuned accuracy (67.46%).
Impact of Data Collection Method. Table 7 shows the model performance breakdown by differ- A. Zach will cherish the photo in his hand. √

Premise Event
(Which event is most likely to happen right after the premise?) B. Zach will give the picture back to Esposito.   ent collection methods. For human-annotated data, we show performance on round one (R1, standard data collection) and round two (R2, adversarial data collection). First, we observe that the accuracy of the future only model matches chance on adversarial matching data while being higher on humanannotated data. The main reason is the matched data has less artificial biases than human-annotated ones. Second, for human-annotated data, across all models, we see the performance on round two subset is significantly lower than that of round one, which demonstrates the effectiveness of using our adversarial collection procedure. Gururangan et al. (2018) shows lexical choice is a strong indicator of the inference class in NLI. To check how our adversarial collection affects the use of words, we use pointwise mutual information (PMI) as in Gururangan et al. (2018). In Table 8 we show top words that are associated with negative class (less-likely event) in standard collection versus their values in our adversarial collection process. We find that the PMI values of these top negative words (e.g., 'throws', 'without', that frequently occur in negative less-likely events) in standard collection clearly drop in adversarial collection, e.g., 'throws' drops from 1.38 to 0.83, making it less indicative of the negative.
Qualitative Examples. We show 4 prediction examples using our best model (video + dialogue + future) in Figure 5. Top row shows two correct prediction examples, where our model is able to predict basic human intention and reaction. Bottom row shows two incorrect predictions, where wrong predictions are mainly caused by the lack of commonsense. For example, to correctly pick the more likely event in Figure 5(c), the model needs to understand that the 'photo' is an evidence of a police investigation. Figure 5(d) shows an example that requires the model to infer that the food is not ready yet. More examples are presented in Appendix Section A.4.

Conclusion
We introduce a new task, Video-and-Language Event Prediction (VLEP) -given a video with aligned dialogue, and two future events, an AI system is required to predict which event is more likely to happen. To support this task, VLEP dataset is collected. We present a strong transformer-based baseline that incorporates information from video, dialogue, and commonsense knowledge, each of which is necessary for this challenging task.

A.1 Additional Data Analysis
Our videos are curated from two sources, TV shows and YouTube lifestyle vlogs, across five major categories, i.e., sitcom, medical, crime, travel-food, family-daily. Events generally vary by genre. One way to show the difference is by checking the top unique nouns in each genre. To obtain the top unique nouns, we first tokenize and lemmatize the future event sentences. Each resulting token is also tagged with a part-of-speech tag. Next, for each genre, we take the top unique nouns as the ones among the most frequent 100 nouns from one genre but do not appear in those from the other genres combined. We show the top unique nouns in each genre in Table 9. Interestingly, the top unique nouns in crime genre are closer to crime and violence, while in family-daily, the top unique nouns are relatively more family relevant.    Most of the premise events require both video and dialogue understanding. Figure 6 (right) shows the distribution of examples by commonsense reasoning type. We categorize commonsense reasoning into three types by examining the relation between the premise event and the positive future event: (1) intention, e.g., if X brings two cups of coffee, then X (intends to) give Y a cup of coffee.
(2) reaction, e.g., if X hands Y a form and describes a procedure, then Y signs the form and hands it back. (3) causal, e.g., if X says they hit a bump, then X gets unbalanced and falls off the boat. The two distributions are obtained by manually annotating 100 randomly sampled examples from VLEP Dev. split.
Next, we show the distribution of premise event length and premise summary length in Figure 7 and Figure 8, respectively. In addition, we also show the distribution of positive future event length and negative event length in Figure 9 and Figure 10.

A.2 Additional Data Collection Details
We hire workers from Amazon Mechanical Turk (AMT) to annotate our data. To ensure our data quality, we only allow workers from Englishspeaking countries to participate in our task. We require workers to have at least 500 HITs approved with an approval rate of 95%. Furthermore, we design a qualification test with 10 multiple-choice questions to ensure that workers well understand our annotation requirement. We show an example question from our qualification test in Figure 11. The workers have to correctly answer at least 7 questions to pass the test. In total, 518 workers participated the test, with a pass rate of 56%. During data collection, we set up an automatic tool to check if all required annotations have been performed. We also manually review the submitted results and provide prompt feedback to them, encouraging better annotation.
Our data collection instructions and interface for round two (adversarial data collection) are shown in Figure 12 and Figure 13, respectively. Round one collection details are similar to that of the round two, except that we do not require workers to fool our basic models (robot). In our annotation process, the actual future events in the videos are not hidden from the workers to ease the collection. The workers can either write the actual future event as the more likely event, or they can hypothesize one when the actual future event in the given video is surprising/rare (such as some events in sitcoms).
To ensure the quality of the examples, we conduct a strict filtering step in which each example is verified by three extra workers (

A.3 More Results
Oracle Premise Results. As an oracle test, we apply the collected premise summary as an auxiliary input to the model, removing certain obstacles of video-dialogue understanding in our baseline model. We show this oracle model performance in Table 10. Our model with premise summary (oracle) achieves 75.64%, which is significantly higher than the one without it (67.46%), indicating the desire for better video-dialogue understanding.
Future Event Generation Results. Given the videos, we can also set up an alternative task of using a captioning-style model to generate future event descriptions. We use the MultiModal Transformer from Lei et al. (2020c) as our baseline for this task. This model uses a standard transformer encoder-decoder architecture for caption generation. The video embeddings and dialogue embeddings are concatenated as inputs (Lei et al., 2020a) to the transformer encoder. We use the default model and training configurations from Lei et al. (2020c). With this system, we evaluate generation performance with video and dialogue as inputs.
Our video+dialogue model has CIDEr-D (Vedan-tam et al., 2015a): 19.57, BLEU@4 (Papineni et al., 2002: 1.80, Rouge-L (Lin, 2004): 16.42, and ME-TEOR (Denkowski and Lavie, 2014): 7.58. Note that we only use this generation task to demonstrate that it is possible to generate future event sentences from videos. This may not be as suitable as our default multiple choice setup to serve as an benchmark, since generation is known to be more difficult to evaluate (Liu et al., 2016). Besides, it also requires multiple references (Vedantam et al., 2015a) to be more accurate. Therefore, we recommend future work to use human evaluation if you pursue a generation-based setup on our dataset.

A.4 More Qualitative Examples
We show more correct and incorrect predictions from our best model (video + dialogue + future) in Figure 14 and Figure 15, respectively. Figure 11: Example question from our qualification test. Workers have to correctly answer 7 out of 10 questions in the test to participate in our annotation task.   (Which event is most likely to happen right after the premise?)

B. Monica goes and tells Rachel how sorry Ross is.
Future Events 00:00:10,679 --> 00:00:12,259 that's it insanely you're supposed to break up all the cheese yes A. The person takes a piece of cheese and eats it. √

Premise Event
(Which event is most likely to happen right after the premise?)

B
. The person mixes the cheese into the salad.