DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization

Zineng Tang, Jie Lei, Mohit Bansal


Abstract
Leveraging large-scale unlabeled web videos such as instructional videos for pre-training followed by task-specific finetuning has become the de facto approach for many video-and-language tasks. However, these instructional videos are very noisy, the accompanying ASR narrations are often incomplete, and can be irrelevant to or temporally misaligned with the visual content, limiting the performance of the models trained on such data. To address these issues, we propose an improved video-and-language pre-training method that first adds automatically-extracted dense region captions from the video frames as auxiliary text input, to provide informative visual cues for learning better video and language associations. Second, to alleviate the temporal misalignment issue, our method incorporates an entropy minimization-based constrained attention loss, to encourage the model to automatically focus on the correct caption from a pool of candidate ASR captions. Our overall approach is named DeCEMBERT (Dense Captions and Entropy Minimization). Comprehensive experiments on three video-and-language tasks (text-to-video retrieval, video captioning, and video question answering) across five datasets demonstrate that our approach outperforms previous state-of-the-art methods. Ablation studies on pre-training and downstream tasks show that adding dense captions and constrained attention loss help improve the model performance. Lastly, we also provide attention visualization to show the effect of applying the proposed constrained attention loss.
Anthology ID:
2021.naacl-main.193
Volume:
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
June
Year:
2021
Address:
Online
Editors:
Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, Yichao Zhou
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2415–2426
Language:
URL:
https://aclanthology.org/2021.naacl-main.193
DOI:
10.18653/v1/2021.naacl-main.193
Bibkey:
Cite (ACL):
Zineng Tang, Jie Lei, and Mohit Bansal. 2021. DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2415–2426, Online. Association for Computational Linguistics.
Cite (Informal):
DeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization (Tang et al., NAACL 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.naacl-main.193.pdf
Video:
 https://aclanthology.org/2021.naacl-main.193.mp4
Code
 zinengtang/decembert
Data
HowTo100MYouCook2