Achieving Common Ground in Multi-modal Dialogue

All communication aims at achieving common ground (grounding): interlocutors can work together effectively only with mutual beliefs about what the state of the world is, about what their goals are, and about how they plan to make their goals a reality. Computational dialogue research offers some classic results on grouding, which unfortunately offer scant guidance to the design of grounding modules and behaviors in cutting-edge systems. In this tutorial, we focus on three main topic areas: 1) grounding in human-human communication; 2) grounding in dialogue systems; and 3) grounding in multi-modal interactive systems, including image-oriented conversations and human-robot interactions. We highlight a number of achievements of recent computational research in coordinating complex content, show how these results lead to rich and challenging opportunities for doing grounding in more flexible and powerful ways, and canvass relevant insights from the literature on human–human conversation. We expect that the tutorial will be of interest to researchers in dialogue systems, computational semantics and cognitive modeling, and hope that it will catalyze research and system building that more directly explores the creative, strategic ways conversational agents might be able to seek and offer evidence about their understanding of their interlocutors.

A: A green bike with tan handlebars. B: Got it (Manuvinakurike et al., 2017) A: The green cup is called Bill. B: Ok, the green cup is Bill. literature on human-human conversation. We expect that the tutorial will be of interest to researchers in dialogue systems, computational semantics and cognitive modeling, and hope that it will catalyze research and system building that more directly explores the creative, strategic ways conversational agents might be able to seek and offer evidence about their understanding of their interlocutors.
Grounding in human-human communication. Clark et al. (1991) argued that communication is accomplished in two phases. In the presentation phase, the speaker presents signals intended to specify the content of the contributions. In the second phase, the participants work together to establish mutual beliefs that serve the purposes of the conversation. The two phases together constitute a unit of communication-contributions. Clark and Krych (2004) show how this model applies to coordinated action, while Stone and Stojnić (2015) applies the model to text-and-video presentations.
Coherence is key.
Grounding in dialogue systems. Computer systems achieve grounding mechanistically by ensuring they get attention and feedback from their users, tracking user state, and planning actions with reinforcement learning to resolve problematic situations. We will review techniques for maintaining engagement (Sidner et al., 2005;Bohus and Horvitz, 2014;Foster et al., 2017) and problems that arises in joint attention (Kontogiorgos et al., 2018) and turn taking such as incremental interpretation (DeVault and Stone, 2004;DeVault et al., 2011), ambiguity resolution (De-Vault andStone, 2009) and learning flexible dialogue management policies (Henderson et al., 2005). Similar questions have been studied in the context of instruction games (Perera et al., 2018;Thomason et al., 2019;Suhr and Artzi, 2018), and interactive tutoring systems (Yu et al., 2016;Wiggins et al., 2019).
Grounding in multi-modal systems. Multimodal systems offer the ability to use signals such as nodding, certain hand gestures and gazing at a speaker to communicate meaning and contribute to establishing common ground (Mavridis, 2015). However, multi-modal grounding is more than just using pointing to clarify. Multi-modal systems have diverse opportunities to demonstrate understanding. For example, recent work has aimed to bridge vision, interactive learning, and natural language understanding through language learning tasks based on natural images (Zhang et al., 2018;Kazemzadeh et al., 2014;De Vries et al., 2017a;Kim et al., 2020). The work on visual dialogue games (Geman et al., 2015) brings new resources and models for generating referring expression for referents in images (Suhr et al., 2019;Shekhar et al., 2018), visually grounded spoken language communication (Roy, 2002;Gkatzia et al., 2015), and captioning (Levinboim et al., 2019;Alikhani and Stone, 2019), which can be used creatively to demonstrate how a system understand a user. Similarly, robots can demostrate how they understand a task by carring it out-in research on interactive task learning in human-robot interaction (Zarrieß and Schlangen, 2018;Carlmeyer et al., 2018) as well as embodied agents perform-Show me a restaurant by the river, serving pasta/Italian food, highly rated and expensive, not child-friendly, located near Cafe Adriatic. (Novikova et al., 2016) Crystal Island, an interactive narrative- ing interactive tasks (Gordon et al., 2018;Das et al., 2018) in physically simulated environments (Anderson et al., 2018;Tan and Bansal, 2018) often drawing on the successes of deep learning and reinforcement learning (Branavan et al., 2009;Liu and Chai, 2015). A lesson that can be learned from this line of research is that one main factor that affects grounding is the choice of medium of communication. Thus, researchers have developed different techniques and methods for data collection and modeling of multimodal communication (Alikhani et al., 2019;Novikova et al., 2016). Figure 2 shows two example resources that were put together using crowdsourcing and virtual reality systems. We will discuss the strengths and shortcomings of these methods.
Grounding in end-to-end language & vision systems. With current advances in neural mod-elling and the availability of large pretrained models in language and vision, multi-modal interaction often is enabled by neural end-to-end architectures with multimodal encodings, e.g. by answering questions abut visual scenes (Antol et al., 2015;Das et al., 2017). It is argued that these shared representations help to ground word meanings. In this tutorial, we will discuss how this type of lexical grounding relates to grounding in dialogue from a theoretical perspective (Larsson, 2018), as well as within different interactive application scenarios -ranging from interactively identifying an object (De Vries et al., 2017b) to dialogue-based learning of word meanings (Yu et al., 2016). We then critically review existing datasets and shared tasks and showcase some of the shortcomings of current vision and language models, e.g. (Agarwal et al., 2018). In contrast to previous ACL tutorials on Multimodal Learning and Reasoning, we will concentrate on identifying different grounding phenomena as identified in the first part of this tutorial.

Outline
We begin by discussing grounding in humanhuman communication (∼20 min). After that, we discuss the role of grounding in spoken dialogue systems (∼30 min) and visually grounded interactions including grounding visual explanations in images and multimodal language grounding for human-robot collaboration (∼90 min). We then survey methods for developing and testing multimodal systems to study non-verbal grounding (∼20 min). We follow this by describing common solution concepts and barrier problems that cross application domains and interaction types (∼20 min).

Prerequisites and reading list
The tutorial will be self-contained. For further readings, we recommend the following publications that are central to the non-verbal grounding framework as of late 2019: