A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Jack Hessel, Bo Pang, Zhenhai Zhu, Radu Soricut


Abstract
Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e.g., “heat the oil in the pan”) improves user experiences. However, current automatic annotation methods based on visual features alone perform only slightly better than constant prediction. Taking cues from prior work, we show that we can improve performance significantly by considering automatic speech recognition (ASR) tokens as input. Furthermore, jointly modeling ASR tokens and visual features results in higher performance compared to training individually on either modality. We find that unstated background information is better explained by visual features, whereas fine-grained distinctions (e.g., “add oil” vs. “add olive oil”) are disambiguated more easily via ASR tokens.
Anthology ID:
K19-1039
Volume:
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Mohit Bansal, Aline Villavicencio
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
419–429
Language:
URL:
https://aclanthology.org/K19-1039
DOI:
10.18653/v1/K19-1039
Bibkey:
Cite (ACL):
Jack Hessel, Bo Pang, Zhenhai Zhu, and Radu Soricut. 2019. A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 419–429, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions (Hessel et al., CoNLL 2019)
Copy Citation:
PDF:
https://aclanthology.org/K19-1039.pdf
Data
MSR-VTTYouCook2