Connecting Language and Vision to Actions

A long-term goal of AI research is to build intelligent agents that can see the rich visual environment around us, communicate this understanding in natural language to humans and other agents, and act in a physical or embodied environment. To this end, recent advances at the intersection of language and vision have made incredible progress – from being able to generate natural language descriptions of images/videos, to answering questions about them, to even holding free-form conversations about visual content! However, while these agents can passively describe images or answer (a sequence of) questions about them, they cannot act in the world (what if I cannot answer a question from my current view, or I am asked to move or manipulate something?). Thus, the challenge now is to extend this progress in language and vision to embodied agents that take actions and actively interact with their visual environments. To reduce the entry barrier for new researchers, this tutorial will provide an overview of the growing number of multimodal tasks and datasets that combine textual and visual understanding. We will comprehensively review existing state-of-the-art approaches to selected tasks such as image captioning, visual question answering (VQA) and visual dialog, presenting the key architectural building blocks (such as co-attention) and novel algorithms (such as cooperative/adversarial games) used to train models for these tasks. We will then discuss some of the current and upcoming challenges of combining language, vision and actions, and introduce some recently-released interactive 3D simulation environments designed for this purpose.


Tutorial Overview
This tutorial will provide an overview of the growing number of multimodal tasks and datasets that combine textual and visual understanding. We will comprehensively review existing stateof-the-art approaches to selected tasks such as image captioning (Chen et al., 2015), visual question answering (VQA) (Antol et al., 2015;Goyal et al., 2017) and visual dialog (Das et al., 2017a,b), presenting the key architectural building blocks (such as co-attention (Lu et al., 2016)) and novel algorithms (such as cooperative/adversarial games (Das et al., 2017b)) used to train models for these tasks. We will then discuss some of the current and upcoming challenges of combining language, vision and actions, and introduce some recently-released interactive 3D simulation environments designed for this purpose (Anderson et al., 2018b;Das et al., 2018). The goal of this tutorial is to provide a comprehensive yet accessible overview of existing work and to reduce the entry barrier for new researchers.
In detail, we will first review the building blocks of the neural network architectures used for these tasks, starting from variants of recurrent sequenceto-sequence language models (Ilya Sutskever, 2014), applied to image captioning (Vinyals et al., 2015), optionally with visual attentional mechanisms (Bahdanau et al., 2015;Xu et al., 2015;You et al., 2016;Anderson et al., 2018a). We will then look at evaluation metrics for image captioning Anderson et al., 2016), before reviewing how these metrics can be optimized directly using reinforcement learning (RL) (Williams, 1992;Rennie et al., 2017).
Next, on the topic of visual question answering, we will look at more sophisticated multimodal attention mechanisms, wherein the network simultaneously attends to visual and textual features (Fukui et al., 2016;Lu et al., 2016). We will see how to combine factual and commonsense reasoning from learnt memory representations (Sukhbaatar et al., 2015) and external knowledge bases , and approaches that use the question to dynamically compose the answering neural network from specialized modules (Andreas et al., 2016a,b;Johnson et al., 2017a,b;Hu et al., 2017).
Following the success of adversarial learning in visual recognition (Goodfellow et al., 2014), it has recently been gaining momentum in language modeling (Yu et al., 2016) and in multimodal tasks such as captioning (Dai et al., 2017) and dialog (Wu et al., 2018a). Within visual dia-log, we will look at recent work that uses cooperative multi-agent tasks as a proxy for training effective visual conversational models via RL (Kottur et al., 2017;Das et al., 2017b).
Finally, as a move away from static datasets, we will cover recent work on building active RL environments for language-vision tasks. Although models that link vision, language and actions have a rich history (Tellex et al., 2011;Paul et al., 2016;Misra et al., 2017), we will focus primarily on embodied 3D environments (Anderson et al., 2018b;, considering tasks such as visual navigation from natural language instructions (Anderson et al., 2018b), and question answering (Das et al., 2018;Gordon et al., 2018). We will position this work in context of related simulators that also offer significant potential for grounded language learning (Beattie et al., 2016;Zhu et al., 2017). To finish, we will discuss some of the challenges in developing agents for these tasks, as they need to be able to combine active perception, language grounding, commonsense reasoning and appropriate long-term credit assignment to succeed.

Structure
The following structure is based on an approximately 3 hour timeframe with a break.

Peter Anderson
Peter Anderson is a final year PhD candidate in Computer Science at the Australian National University, supervised by Dr Stephen Gould, and a researcher within the Australian Centre for Robotic Vision (ACRV). His PhD focuses on deep learning for visual understanding in natural language. He was an integral member of the team that won first place in the 2017 Visual Question Answering (VQA) challenge at CVPR, and he has made several contributions in image captioning, including achieving first place on the COCO leaderboard in July 2017. He has published at CVPR, ECCV, EMNLP and ICRA, and spent time at numerous universities and research labs including Adelaide University, Macquarie University, and Microsoft Research. His research is currently focused on vision-and-language understanding in complex 3D environments.

Abhishek Das
Abhishek Das is a Computer Science PhD student at Georgia Institute of Technology, advised by Dhruv Batra, and working closely with Devi Parikh. He is interested in deep learning and its applications in building agents that can see (computer vision), think (reasoning and interpretability), talk (language modeling) and act (reinforcement learning). He is a recipient of an Adobe Research Fellowship and a Snap Research Fellowship. He has published at CVPR, ICCV, EMNLP, HCOMP and CVIU, co-organized the NIPS 2017 workshop on Visually-Grounded Interaction and Language, and has held visiting positions at Virginia Tech, Queensland Brain Institute and Facebook AI Research. He graduated from Indian Institute of Technology Roorkee in 2015 with a Bachelor's in Electrical Engineering.

Qi Wu
Dr. Qi Wu, is a research fellow in the Australia Centre for Robotic Vision (ACRV) in the University of Adelaide. Before that, he was a postdoc researcher in the Australia Centre for Visual Technologies (ACVT) in the University of Adelaide. He obtained his PhD degree in 2015 and MSc degree in 2011, in Computer Science from University of Bath, United Kingdom. His research interests are mainly in Computer Vision and Machine Learning. Currently, he is working on the vision-to-language problem and he is especially an expert in the area of Image Captioning and Visual Question Answering (VQA). His attributes-based image captioning model got first place on the COCO Image Captioning Challenge Leader Board in the October of 2015. He has published several papers in prestigious conferences and journals, such as TPAMI, CVPR, ICCV, ECCV, IJCAI and AAAI.