What is More Likely to Happen Next? Video-and-Language Future Event Prediction

Jie Lei, Licheng Yu, Tamara Berg, Mohit Bansal


Abstract
Given a video with aligned dialogue, people can often infer what is more likely to happen next. Making such predictions requires not only a deep understanding of the rich dynamics underlying the video and dialogue, but also a significant amount of commonsense knowledge. In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions. To support research in this direction, we collect a new dataset, named Video-and-Language Event Prediction (VLEP), with 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog video clips. In order to promote the collection of non-trivial challenging examples, we employ an adversarial human-and-model-in-the-loop data collection procedure. We also present a strong baseline incorporating information from video, dialogue, and commonsense knowledge. Experiments show that each type of information is useful for this challenging task, and that compared to the high human performance on VLEP, our model provides a good starting point but leaves large room for future work.
Anthology ID:
2020.emnlp-main.706
Volume:
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Month:
November
Year:
2020
Address:
Online
Editors:
Bonnie Webber, Trevor Cohn, Yulan He, Yang Liu
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8769–8784
Language:
URL:
https://aclanthology.org/2020.emnlp-main.706
DOI:
10.18653/v1/2020.emnlp-main.706
Bibkey:
Cite (ACL):
Jie Lei, Licheng Yu, Tamara Berg, and Mohit Bansal. 2020. What is More Likely to Happen Next? Video-and-Language Future Event Prediction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8769–8784, Online. Association for Computational Linguistics.
Cite (Informal):
What is More Likely to Happen Next? Video-and-Language Future Event Prediction (Lei et al., EMNLP 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.emnlp-main.706.pdf
Video:
 https://slideslive.com/38939207
Code
 jayleicn/VideoLanguageFuturePred
Data
VLEPMultiNLI