A hierarchical approach to vision-based language generation: from simple sentences to complex natural language

Simion-Vlad Bogolin, Ioana Croitoru, Marius Leordeanu


Abstract
Automatically describing videos in natural language is an ambitious problem, which could bridge our understanding of vision and language. We propose a hierarchical approach, by first generating video descriptions as sequences of simple sentences, followed at the next level by a more complex and fluent description in natural language. While the simple sentences describe simple actions in the form of (subject, verb, object), the second-level paragraph descriptions, indirectly using information from the first-level description, presents the visual content in a more compact, coherent and semantically rich manner. To this end, we introduce the first video dataset in the literature that is annotated with captions at two levels of linguistic complexity. We perform extensive tests that demonstrate that our hierarchical linguistic representation, from simple to complex language, allows us to train a two-stage network that is able to generate significantly more complex paragraphs than current one-stage approaches.
Anthology ID:
2020.coling-main.220
Volume:
Proceedings of the 28th International Conference on Computational Linguistics
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Editors:
Donia Scott, Nuria Bel, Chengqing Zong
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
2436–2447
Language:
URL:
https://aclanthology.org/2020.coling-main.220
DOI:
10.18653/v1/2020.coling-main.220
Bibkey:
Cite (ACL):
Simion-Vlad Bogolin, Ioana Croitoru, and Marius Leordeanu. 2020. A hierarchical approach to vision-based language generation: from simple sentences to complex natural language. In Proceedings of the 28th International Conference on Computational Linguistics, pages 2436–2447, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):
A hierarchical approach to vision-based language generation: from simple sentences to complex natural language (Bogolin et al., COLING 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.coling-main.220.pdf