A Two-Level Interpretation of Modality in Human-Robot Dialogue

We analyze the use and interpretation of modal expressions in a corpus of situated human-robot dialogue and ask how to effectively represent these expressions for automatic learning. We present a two-level annotation scheme for modality that captures both content and intent, integrating a logic-based, semantic representation and a task-oriented, pragmatic representation that maps to our robot’s capabilities. Data from our annotation task reveals that the interpretation of modal expressions in human-robot dialogue is quite diverse, yet highly constrained by the physical environment and asymmetrical speaker/addressee relationship. We sketch a formal model of human-robot common ground in which modality can be grounded and dynamically interpreted.


Introduction
The interpretation of modal expressions is essential to meaningful human-robot dialogue: the ability to convey information about objects and events that are displaced in time, space, and actuality allows the human and robot to align their environmental perceptions and successfully collaborate (Liu and Chai, 2015). As an example, if a robot is sent to a remote location on a search and navigation mission, modally interpreted expressions such as "Tell me what you see" (uttered by the human) and "I can't see because of smoke" (uttered by the robot) are vital to information exchange. Similarly, a robot that has abilities to navigate obstacles (for example, by jumping or using LIDAR) can inform the human of this.
The learning of modal expressions for automatic understanding and use nevertheless presents a conversational paradox: while these expressions serve to communicate and align world knowledge, there is no obvious manner to ground them in the shared environment. Whereas objects and actions can be pointed to or modeled for grounded learning, modal expressions are grounded in the linguistic signal itself. Nevertheless, a basic understanding of modal meaning would allow non-human agents to reason about the possible uses of objects and better assess how certain actions and behaviors impact the task at hand.
In this paper, we document the range and nature of modally interpreted expressions used in human-robot dialogue with the goal to make the interpretation of such expressions easily automated in the future. We hypothesize that certain readings and scope preferences for modal operators are more salient in human-robot dialogue because of the unique makeup of the common ground (Poesio, 1993). We provide a mapping from formal semantic theories of modality related to participant beliefs and updates of the common ground (Portner, 2009), to a practical model of speech acts that translates into robot action for search and navigation task-oriented dialogue and an automated NLU and NLG system (Bonial et al., 2020). This mapping is formalized in an annotation scheme in which the use of modal expressions is mapped to their effect in dialogue, providing a model for the robot to learn the meaning of modal expressions (Chai et al., 2018). Our annotation task reveals surprisingly high inter-annotator agreement for a complex scheme; results indicate that our data is highly repetitive in the natural language used, and yet the interpretation of modal expressions is quite diverse and worth investigating further to foster effective human-robot communication in situated, task-oriented settings.
The paper is structured as follows. In Section 2 we motivate our annotation of modality and introduce the SCOUT corpus; we then situate formal semantic theories of modality in the context of human-robot dialogue. We describe our annotation scheme in Section 3, which covers both the type of modality used in an expression, and the speech act the expression conveys. We describe our results in Section 4, discussing implications for modal interpretation in human-robot dialogue and some linguistic issues that arose during the annotation process. In Section 5 we consider the implications of our results for a theory of modality and common ground in human-robot dialogue, before concluding in Section 6.
2 Background and Related Work

Annotating Modality
Though there is little previous work on annotating modality in dialogue, several annotation schemes exist for annotating modality in text. These annotation schemes often address modality and negation together as extra-propositional aspects of meaning, focusing on the tasks of detecting key linguistic markers and mapping their scope (Saurı et al., 2006;Morante and Daelemans, 2012). These tasks can then be leveraged to identify and analyze related concepts such as subjectivity, hedging, evidentiality, uncertainty, committed belief, and factuality (Morante and Sporleder, 2012). Automatic tagging of modality and negation and detection of related concepts has received little, though promising, attention (Baker et al., 2012;Prabhakaran et al., 2012;Marasović and Frank, 2016). The detection of the related concept of hedging (and its scope) was the focus of the CoNLL 2010 Shared Task (Farkas et al., 2010).
In the context of human-robot dialogue, modality and related concepts provide the basis for assessing speaker beliefs, commitments, and attitudes, thereby fostering understanding and coherent interaction. For example, the manner in which a speaker employs modal information can be used to assess trustworthiness (Su et al., 2010); this is important both for the human to trust the robot and work collaboratively, and for the robot to assess whether or not it should accept human instruction. Additionally, modal information allows both dialogue participants to assess the factuality of events and propositions (Saurí and Pustejovsky, 2009;Prabhakaran et al., 2015). Notably this is a complex process that requires the understanding of both fine-grained lexical semantics (e.g. a question to the robot, "Can you prevent fire?") and the interaction of scopal operators (e.g. a robot assertion, "I probably cannot fit there."). Our work is one step towards aligning participant perceptions of respective environments and both discourse and real-world events.

Human-Robot Dialogue
The data we annotate comes from the Situated Corpus of Understanding Transactions (SCOUT), a collection of dialogues from the robot navigation domain. 1 SCOUT was created to explore the natural diversity of communication strategies in situated human-robot dialogue (Marge et al., 2016;Marge et al., 2017). Data collection efforts leveraged "Wizard-of-Oz" experiment design (Riek, 2012), in which participants directed what they believed to be an autonomous robot to complete search and navigation tasks. The domain testbed for this data was collaborative exploration in a low-bandwidth environment, mimicking the conditions of a reconnaissance or search-and-navigation operation. For data collection, two "wizard" experimenters controlled the robot's dialogue processing and robot navigation capabilities behind the scenes. This design permitted participants to instruct the robot without imposing artificial restrictions on the language used. As more data was collected, increasing levels of automated dialogue processing were introduced (Lukin et al., 2018a). We discuss the impact of further design details in Sections 4 and 5. Table 1 shows an example SCOUT interaction. The dialogues are divided into two conversational floors, each involving two interlocutors: the left conversational floor consists of dialogue between the participant and the dialogue manager (DM), and the right consists of dialogue between the DM and the robot navigator (RN). The participant and RN never speak directly to or hear each other; instead, the DM acts as an intermediary passing communication between the participant and the RN. Of interest to our work, the left conversational floor (that which mimics our desired human-robot communication) is comprised of several potential modal expressions: "move" as an imperative; "not sure" as a negated epistemic; and "can" expressing a circumstantial ability. All SCOUT speech data (collected from the participant and RN) are transcribed and time-aligned with text messages produced by the DM. SCOUT also includes annotations of dialogue structure  that allow for the characterization of distinct information states by way of sets of participants, participant roles, turn-taking and floor-holding, and other factors (Traum and Larsson, 2003). In total, SCOUT contains over 80 hours of human-robot dialogue from 83 participants.

Modal Expressions in Dialogue
As we are interested in modal meaning in context, we take a broad approach to the modal expressions we investigate, including modal verbs, attitude verbs, and imperatives. Most theories of modality in natural language take Kratzer (1981) as a starting point. Modal statements are interpreted relative to some modal force, e.g., necessity or various grades of possibility, and conversational backgrounds, e.g., realistic or normative. The traditional approach to attitude verbs treats them similarly to modals in a possible-worlds semantics (Hintikka, 1969): the verb specifies the set of accessible worlds (e.g., believe quantifies over worlds compatible with the beliefs of the attitude holder); quantification is taken to be universal.
As for imperatives, following Kaufmann (2019), imperatives should be treated similarly to modals; in fact, imperatives are modals. Any non-descriptive illocutionary force a modal proposition has comes from its context; the imperative modal operator presupposes that the context is non-descriptive. In contrast, Condoravdi and Lauer (2012) and Portner (2007) do not consider imperatives to be modals. Condoravdi and Lauer (2012) posit that each agent has an effective preference structure at any given world. Imperatives, then, are public commitments for the speaker's effective preference structure to be ordered in a certain way. Portner (2007), meanwhile, gives each interlocutor a To-Do List, a list of properties the agent is committed making true of themselves. The use of an imperative adds a property to the addressee's To-Do List. We adapt and motivate Portner's approach further in developing our formal theory in Section 5.
Previous work on modal expressions in dialogue is analogous to our own in prioritizing the discourse effect of such potentially ambiguous expressions, particularly those involving operators that take scope (Heim, 1982;Poesio, 1993;Lascarides and Asher, 2003). Authors concur that the semantic ambiguity of scopal operators (modals included) is typically reduced or absent in the context of human-human dialogue. Little work has focused on this resolution process in human-robot dialogue, instead focusing on documenting naturally occurring human language in this setting (Lukin et al., 2018b;.

A Two-Level Annotation Scheme
The motivation for a two-level annotation scheme comes from the need to bridge formal semantic theories of modality and its interpretation with models that are actionable in the context of human-robot dialogue and adequately model the discourse. In this section we discuss the development of our annotation scheme, drawing on both fine-grained annotation of modality (Section 3.1) and the identification of speech acts specific to human-robot dialogue (Section 3.2). We present our final annotation scheme in Section 3.3.   Rubinstein et al. (2013).

Level I: Fine-Grained Annotation of Modality
The first level of our annotation scheme is based on Rubinstein et al. (2013), who present a fine-grained annotation scheme of modal expressions and apply it to a subset of the MPQA corpus (Wiebe et al., 2005). The fine-grained nature of the annotation scheme results from the range of expressions the authors identify to carry modal meaning and the layers of information they annotate. We adapt the authors' understanding of modal expressions and their Modality Type category and accompanying values for our work, though we take into consideration the other elements they annotate. 2 A modal expression is understood in this scheme as (i) an expression used to describe alternative ways the world could be, (ii) that has some sort of propositional argument (referred to as the prejacent), and (iii) is not associated with an overt attitude holder. Modality Type specifically categorizes the type of modality a modal expressions conveys in context. Seven fine-grained types are distinguished in Rubinstein et al.: Epistemic, Circumstantial, Ability, Deontic, Bouletic, Teleological, and Bouletic/Teleological. Before this classification is made, annotators first categorize each modal as belonging to one of two coarse-grained categories: Priority or Non-Priority. Priority picks out a conceptually motivated subclass of non-epistemic modalities: those that use some "priority" (a desire, a goal) to designate certain possibilities as better than others (Portner, 2009). For the MPQA corpus, annotators reliably agreed on only the highest level split between priority and non-priority interpretations (α=.89); Modality Type was quite challenging (α=.49).
The scheme we adapt for our Level I annotation is in Table 2. The modal expressions we target for annotation are broadly defined as any verb construction that conveys a modal meaning. Unlike the original scheme, we exclude modal nouns, adverbs, and adjectives and focus on verbs; we additionally annotate attitude verbs that have overt subjects (iii). This is both to provide coverage of different types of modal expressions we know to occur in our dialogue, as well as to simplify the annotation task, given the low annotator agreement of the original scheme. We additionally include the category imperative following work presented in Section 2.3, as a significant portion of our data is comprised of this type of utterance.

Level II: Speech Acts for Human-Robot Dialogue
The fine-grained annotation scheme developed by Rubinstein et al. (2013) is not sufficient for human-robot dialogue for two key reasons: (i) the scheme is geared towards modality in text, and thus does not consider how participant roles in spoken dialogue may impact modal meaning; and (ii) the shades of meaning the scheme pinpoint are not always meaningful in the context of achieving a specific task. Nevertheless, it is an ideal basis upon which to build a more complete understanding of modal interpretation in context.
The second level of our annotation thus encodes pragmatic information essential to successful interpretation of modal expressions in the context of dialogue. A robot first needs to understand if the illocutionary force of communications are (for example) commands, suggestions, or clarifications, which may not be obvious from the surface form of the human utterance alone. Furthermore, a robot needs to understand specific instructions such as how far to go and when, evaluate whether or not these instructions are feasible, and communicate and discuss the status of a given task in relation to a larger goal.  Table 2 Speech acts,  Table 3: An overview of our annotation scheme.
To this end, we incorporate the speech act inventory of Bonial et al. (2020) and Dial-AMR, a collection of 1122 utterances from the SCOUT corpus annotated with speech acts tailored to the robot in the search and navigation domain. 3 In delineating and defining their speech acts, the authors focus on the effects of an utterance relating to belief and obligation within human-robot dialogue (Traum, 1999;Poesio and Traum, 1998). Belief and obligation are not mutually exclusive, and utterances can and do often convey both the commitment to a belief and evoke an obligation in either the speaker or the hearer. These pragmatic effects are critical for agents navigating dialogue: in planning, agents can choose to pursue either goals or obligations and must reason about these notions so that the choice can be explained. Mutual beliefs about the feasibility of actions and the intention of particular agents to perform parts of that action are captured in the notion of committed, a social commitment to a state of affairs rather than an individual one (Traum, 1999). Incorporating notions of speaker intent into our annotation scheme is thus both practical and crucial to disambiguate the multiple meanings a modal expression can have.
There are fourteen possible values for the interpretation level of our annotation, all of which we preserve (though we expected and found not all to be compatible with modal expressions). The values, their relation to speaker and addressee commitments and obligations, and examples are given in Table 8 in Appendix A. These values map on to a set of 24 robot concepts, which designate the primitive concepts in the robot's knowledge ontology and include categories such as ability, scene, environment, readiness, and help.

Final Annotation Scheme
The goal of our final annotation scheme is to identify the range of naturally occurring modal expressions in task-based human-robot dialogue and to provide information about the use and interpretation of these expressions in context. In addition to modal expressions, we annotate negation and quantification for the purpose of detecting scope relations and meaning in dialogue more broadly in future work. Our approach acknowledges both the semantic richness of how modals are assigned interpretations in context (Rubinstein et al., 2013), as well as the situational grounding of the role an expression is playing in the task-oriented dialogue (Sarathy et al., 2019;Roque et al., 2020;Bonial et al., 2020). For this reason, we have developed a two-level annotation scheme that separates out the basic modal value of an expression from its eventual interpretation within a context.
We introduce a number of constraints to help pinpoint the interpretation of modal expressions in dialogue and to make annotation feasible for non-experts. First, we reduce the number of modality type values from Rubinstein et al. (2013) from seven to six, eliminating the circumstantial and combined bouletic/teleological values and adding a value for imperative. Our adaptation forces annotators to select a single, most salient category of modality type. The addition of an imperative value is due to the preponderance of this form in our data, and we discuss its broader implications in Section 5.
We compensate for the elimination of the circumstantial modal value by adding an additional layer of annotation: temporal index. The temporal index (TI) fixes the temporal reference of the modal expression based on the interaction of the modal with the semantics of the expressions it combines with (Condoravdi, 2001). In so doing, it designates how the expression of interest relates to the common ground between the speakers. There are two possible values for TI: (i) Local TI signifies that the utterance applies only to the immediate context; and (ii) Global TI signifies that the utterance adds meaningful, new information to the common ground that speakers should be aware of throughout the dialogue. A good diagnostic for this value is to ask how the subsequent response or action contributes to the understanding of the  Table 4: Example annotations with our annotation scheme for utterances for the six modality types we annotate, as defined in Table 2. Targets are in bold with scope in italics for each utterance.
utterance in context. For example, if a human commands "move forward two feet" to the robot, and the next action consists of the robot moving two feet forward, this is a local imperative (the task is completed and removed from the immediate context). Alternatively, if a human asks "Robot, do you speak Arabic?", both the question and answer to this provide lasting useful information: an intrinsic ability of the robot. An overview of our final annotation scheme is seen in Table 3. A key question we aim to address with our scheme is the interaction of vagueness and ambiguity in natural language, or whether an utterance has one or many salient readings. The two primary levels of our annotation are comprised of linguistic categories well-known to be ambiguous: a modal expression can be both bouletic and teleological ("I would like you to move forward so we can investigate the next room"); while a speech act such as "Why don't you ask for help?" can be interpreted as a question and/or a suggestion. Similarly, TI introduces room for ambiguity: a human asking the robot "Can you fit in that space?" can be understood as both temporally local, in the sense that the robot moving into the space will clear this question from the immediate context; and global, in the sense that the subsequent response or action still contributes to the common ground as lasting information about the robot's size and abilities. Given the combined possibility for ambiguity in our annotation, we wanted to see whether or not clear interpretive distinctions emerge from the data. This information allows us to evaluate the ease with which future work integrating modality can be conducted (and whether or not is is a worthwhile endeavor to begin with). It also builds on the work of Bonial et al. (2020), whose scheme disfavors multiple possible interpretations that may nevertheless add important information to the dialogue.

Annotation Task
Our goal for the annotation task was two-fold: (i) to provide coverage of the data to quantitatively assess the kind and frequency with which modal expressions are used and interpreted by speaker type; and (ii) to qualitatively assess instances where modal usage is unexpected. This second goal is situated within a larger goal of understanding and automating the interpretation of scope in human-robot dialogue.
Four annotators were trained to apply the annotation scheme following the annotation guidelines and with two example annotated transcripts. Each annotator annotated 70 experimental transcripts, of which 16 transcripts overlapped with one of the other three annotators. In total, 248 transcripts were annotated: 32 by two annotators, and the remaining 216 by a single annotator. Annotators were instructed to only annotate the left conversation floor (Table 1), as this is designed to mimic automated human-robot dialogue. For each category of annotation except scope, annotators were provided with a drop-down menu that allowed them to easily restrict their choice of value; scope was manually annotated. For utterances that contained multiple types, annotators were instructed to annotate each type separately. There were a total of 48,168 utterances in the left conversation floor (22,259 human, and 25,909 robot) across all transcripts, for an average of 194.23 utterances per transcript. Examples of final annotations are given in Table 4.
To evaluate our annotation scheme, we calculated a number of inter-annotator agreement metrics on the 32 transcripts annotated by two annotators. First, we calculated the proportion of annotations for which the annotators agreed on the target. Among those annotations where the annotators agreed on the target, for each pair of annotators, we calculated the string overlap 4 between the scopes identified by each   annotator, and Cohen's kappa (Cohen, 1960) for type, value, interpretation, and temporal index.

Results
Of the 3,959 annotations in the 32 shared transcripts, annotators agreed on the target for 3,470 (87.65%). Among each pair of annotators, the string overlap for scope ranged from 83.31% to 91.59% (median 85.97%). Table 5 describes Cohen's kappa (median and range) (IAA) for each category.
After calculating IAA, we adjudicated the shared transcripts and combined them with our singlyannotated transcripts to form a gold standard. In total, 18,073 utterances (37.52%) contained one or more annotations. There were 19,456 total annotations, for an average of 1.08 annotations per annotated utterance. The distribution of modal expressions (including attitude verbs, imperatives, and other modal verbs), negation, and quantifiers is shown in Table 6. Additional result tables, describing the classification of modal expressions by speaker, value, interpretation, and temporal index, are presented in Appendix B.

Discussion
Several data points are of immediate interest from our annotation results. First, there are several asymmetries in how humans and 'robots' employ modality and illocutionary force. Humans use many more imperative modal forms than robots (13,257/120), 59.56% of their total utterances (Table 9); this finding correlates with humans using more command speech acts than robots (13,616/121) and more command speech acts than any other speech act type (Table 10), confirming findings from Marge et al. (2017). In contrast, the SCOUT robot employs teleological (1,591/432) and bouletic (184/19) modal values more frequently than the human; these tend to be in the form of making offers to perform certain actions (bouletic , Table 11) or assertions, promises, questions, and requests related to the task goals (teleological). For speech acts overall, the robot most commonly employs assertions (1,167) and promises (1,001).
Overall, our IAA scores are higher than we expected (Table 5). Though this is likely due in large part to the repetitive nature of the SCOUT data, it both validates our annotation scheme for future use and sheds light on the attested interaction of modal expressions and their interpretation. As expected, ability modals demonstrate the most flexibility in use in our data: they are employed for eight of the fourteen speech act values found. Teleological modals are also quite flexible: these are employed for ten of the fourteen speech act values (though only in single instances for two values). Epistemic modals pattern to either assertions or questions, while bouletic modals primarily comprise offers. With regards to TI, the majority of utterances are local and relevant to the immediate context rather than adding lasting information to the common ground; this imbalance is less pronounced, however, for ability and epistemic modals.
Other phenomena of interest from our data involve modal operators and their scope. For example, there were 298 utterances containing both a modal expression and a negation. Of those, a negation scopes over a modal in 227 ("I couldn't hear everything you said"), while in 53, a modal scopes over a negation ("Can you first scan the area you haven't scanned yet?"). We note that anaphora and coreference on one hand ("Do that again", "That sounds good"), and implicit arguments on the other ("Repeat ∅", "Yes I would ∅"), are quite challenging with regards to identifying the proposition in the scope of the modal operator. In contrast, we also find utterances where only the proposition is explicit, and the operator implicit ("45 degrees", "Picture") These phenomena fall under the umbrella of underspecification, an enduring challenge of creating meaningful natural language representation that must nevertheless be actionable in settings like HRI. Finally, corrected or disjoint scope ("Can you turn 90 degrees left... I mean right") and coordination ("Can you go back inside and take a picture") also pose challenges to scope in dialogue, especially in the context of sentence-based meaning representation (Pustejovsky et al., 2019).
Finally, we note some utterances that our annotation scheme alone cannot account for. These include conditional utterances such as "If you can turn around and take a photo so I can have a clear picture" interpreted as commands. We note for now that these utterances exemplify our ambiguity challenge: the modal can has both ability and teleological meanings, while the utterance can function as both a request (given its conditional nature) and a command (given that it is uttered by the human).

Towards a Formal Theory of Modality in Human-Robot Dialogue
Here, we sketch the beginnings of a formal theory of modality in human-robot dialogue built upon our annotation findings in Section 4. As stated before, this work takes a step towards automating the interpretation of frequently ambiguous expressions in context and mapping this interpretation to actionable representations in the human-robot context.

Desiderata
In typical human dialogue, there is a shared understanding of both an utterance meaning (content) and the speaker's meaning in the specific context (intent). This is what our annotation has captured. The ability to link these two dynamically is the act of situationally grounding meaning to the local context, or establishing the common ground between interlocutors (Stalnaker, 2002;Asher and Gillies, 2003;Tomasello and Carpenter, 2007). The common ground represents the mutual knowledge, beliefs, and assumptions of the participants that result from co-situatedness, co-perception, and co-intent. Robust human-robot dialogue requires a unique process of alignment to facilitate human-like interaction, including the recognition and generation of expressions through multiple modalities (language, gesture, vision, action); and the encoding of situated meaning (Dobnik et al., 2013;Pustejovsky et al., 2017;Krishnaswamy et al., 2017;Hunter et al., 2018). Specifically, this entails outlining three key aspects of common ground interpretation: (i) the situated grounding of expressions in context; (ii) an interpretation of the expression contextualized to the dynamics of the discourse; and (iii) an appreciation of the actions and consequences associated with objects in the environment. Here, we address (ii) first, before moving on to (i) and (iii).

Dynamic Interpretation of Modal Expressions
An account of how modal expressions are used in discourse needs to capture their command-force "context change potential" (CCP), usually modeled as a function from input contexts to output contexts, as well as how this relates to an agent behaving rationally and cooperatively relative to their commitments (Section 3.2). An adequate model of the common ground in human-robot dialogue will especially require a satisfactory account of imperatives, as these are so frequent and directly impact goal achievement.
Here, we follow Portner (2007) in the idea that imperatives technically do not add to the common ground (and are technically not modals), while modals do (as they can be evaluated as true or false). Imperatives are instead evaluated relative to the addressee's To-Do list (TDL), a list of properties (not propositions). TDL is nevertheless a contextual resource for the interpretation of priority modals, analogous to the common ground for epistemic modals. An imperative specifically adds an addressee-restricted property to a hearer's TDL such that the hearer should act so as to make as many items on TDL true as feasible. This is based on a mutual assumption between the participants that each will try to bring it about that they have each of these properties. For example, if a given property corresponds to an action ([λwλx.x moves forward two feet in w]), the TDL represents the actions that an agent α is committed to taking. The TDL function T assigns to each α in the conversation a set of properties T(α). The canonical discourse function of an imperative clause φ imp is then to add φ imp to T(addressee), where C is a context of the form ⟨CG, Q, T⟩: C + φ imp = ⟨CG, Q, T[addressee/(T(addressee) ∪ { φ imp })]⟩. More details are in Appendix C.
In other words, imperatives make reference to an additional component of the context set: the TDL, formalized by T(α). TDLs are structured with different "flavors" similar to how ordering sources differ for modals. Thus, each participant in a conversation possesses multiple TDLs that correspond to priority types: a teleological TDL represents goals; a bouletic TDL, desires; and a deontic TDL, obligations. In addition to assuming these, we propose another flavor of TDL specific to human-robot dialogue: a shared TDL that represents shared goals, desires, and obligations. Both individual and shared TDLs in  Table 7: Interpretive variation of modal type ability in relation to speaker and temporal index (TI) with corresponding mappings to logical representation in our proposed context set ⟨CG, Q, T ⟩.
our scheme ought to possess local and global temporal indices, reflecting our annotation and the discrete and continuous planning functions of robots they correspond to (Chai et al., 2018). These intuitions can be formalized in the interpretations in Table 7. We use ability modals as an example, as they demonstrate a range of flexibility in their illocutionary force in our data. For present purposes, we understand the denotation of the modal auxiliary can as: can = λp∃w(w ∈ MB(e) ∶ q(w)), where MB (modal base) represents the set of states that are compatible with the utterance (Hacquard and Cournane, 2016). For example, the temporal indices in of local (a,b) and global (c,d) force circumstantial and epistemic interpretations, respectively. The additional interpretations in Table 7 fall out from our context set ⟨CG, Q, T ⟩. The context set includes vital information such as the speaker/addressee relationship particular to the human-robot context, in which the human is endowed with more authority; the question or goal under discussion (Ginzburg, 1995); and other properties of the common ground, described in 5.3. The reverse mapping can be formalized, too: ability, bouletic, deontic, and teleological modals can all map on to request speech acts (Table 8), though their logical representations will differ. As Bonial et al. (2020) use AMR to represent their speech act inventory (footnote 3), we plan to extend our work by translating AMR into first-order-logic for simpler mapping to robot action (Lai et al., 2020). Logical differences between modal categories will then be captured in our FOL translations and assist the robot in understanding the mappings between modal expressions and speech acts.

Situated Grounding and Modal Meaning
As noted in Section 2.2, dialogues in SCOUT were collected to mimic the setting of a low-bandwidth reconnaissance or search-and-navigation operation. A participant verbally instructs a robot at a remote location, guiding the robot to explore a physical space. The sensors and video camera on-board the robot populate a map as it moves, enabling it to describe that environment and send photos at the participant's request, but the communications bandwidth prohibits real-time video streaming or direct tele-operation. The robot is assumed capable of performing low to intermediate level tasks, but not more complex tasks involving multiple or quantified goals without clear instruction. The experiment used a Clearpath Robotics Jackal, fitted with an RGB camera and LIDAR sensors, to operate in the environment (Marge et al., 2017).
Given this as background, we assume that both robot and human are aware of these capabilities and that they are in the common ground, entering into the dialogues under discussion. From the robot's perspective, the objects in the environment present opportunities for interaction, exploration, and manipulation. These are modally contingent actions that a situation presents to an agent by virtue of the objects it encounters. The contextual meaning for many modal expressions will be interpreted relative to such object knowledge.
For these reasons, it is useful to think of objects as providing habitats, which are situational contexts or environments conditioning the object's affordances, which may be either "Gibsonian" affordances (Gibson et al., 1982) or "Telic" affordances (Pustejovsky, 1995). A habitat specifies how an object typically occupies a space (Pustejovsky, 2013). Affordances are used as attached behaviors, which the object either facilitates by its geometry (Gibsonian) or purposes for which it is intended to be used (Telic). For example, a Gibsonian affordance for [[CUP]] is "grasp," while its Telic affordance is "drink from." Similarly, in SCOUT's environment, a "doorway" affords passage to another room, unless it is blocked by an object or closed. Hence, when asked: "Can you go through the doorway?", the modal force is taken as a query over its situational (or local) ability, given what the speaker already knows about the robot's navigation capabilities. An example representation of the affordances of a "doorway" is given in Appendix D. In a similar manner, the question: "Do you speak Arabic?" is interpreted as a general ability modal, motivated by the situational awareness of Arabic script identified in the picture the robot sent. That is, linguistic signs afford decoding or interpretation, which prompts the modal reference to the ability to speak the language associated with the affording script (Sundar et al., 2010;Krippendorff, 2012).

Putting it All Together
We have sketched components that allow us to conceptualize how to formalize the key aspects of common ground interpretation we outlined in 5.1. A proper treatment of modal expressions in human-robot dialogue will integrate both the dynamic semantics of 5.2 as well as how the grounding of objects explained in 5.3 impacts this interpretation by allowing the robot to reason about abilities, actions, and consequences. Nevertheless, work remains. The data we present support findings that humans tend towards a less verbose style of communication with robots than with other humans (Lukin et al., 2018b); and that humans spend less time updating beliefs and planning with robots than with other humans . In contrast, the surrogate 'robot' of our data orients its utterances towards goal-completion and general cooperation, behaving in a more constructive and polite manner. If we expect future robots to learn behavior and language use through interaction, these results are problematic. This paradox suggests that other avenues for the learning of modal expressions ought to be explored, specifically those that leverage existing semantic representations and modal ontologies such as ours to endow the robot with semantic knowledge prior to interaction. From a practical standpoint, modal expressions allow a robot to determine the meaning of a natural language utterance, generate a goal representation with reference to existing goals, and produce an action sequence to achieve the new goal if possible (Dzifcak et al., 2009). From a social standpoint, modal expressions reflect and create participant relations, impacting factors such as trust and openness that indirectly foster successful collaboration (Lukin et al., 2018b;Lucas et al., 2018). Thus, the work we present here is very much worth exploring further.

Conclusion
In this paper, we present a two-level annotation scheme for modality as used in situated human-robot dialogues relating to search and navigation. Our annotation scheme captures both the semantic content of modal expressions as well as their pragmatic function relevant to speaker intent in discourse. Results from our annotation task demonstrate that our annotation scheme is valid and expressive, as well as both practical and transparent; it also gives us novel insight into the interaction between modality and illocutionary force in our setting. Our work can be extended to future, automated pipelines for human-robot dialogue that incorporate modal expressions within a formal common ground.  Bonial et al. (2020), adapted here for Level II of our annotation. Examples are from the SCOUT corpus with modal values in parentheses when applicable. Note: A response (Request) might be by doing the action, rejecting it, accepting it, or discussing desirability. Expressive types (Request and subsequent rows) are left unspecified as to the resulting obligations and some further commitments, since some derive as much from context and committed mental state as well as the act itself, and some are culture-specific. For example, an acceptance of a Request generally commits the accepter to act, and an acceptance of an Offer generally commits the offerer to act.
Where C is a context of the form ⟨CG, Q, T⟩: C + φ imp = ⟨CG, Q, T[addressee/(T(addressee) ∪ { φ imp })]⟩ 4. Partial ordering of worlds by TDL compatible with CG (∈ ⋂ CG = context set): For any w 1 , w 2 ∈ ⋂ CG and any participant i, w 1 < w 2 iff for some P ∈ T(i), P(w 2 )(i) = 1 and P(w 1 )(i) = 0, and for all Q ∈ T(i), if Q(w 1 )(i) = 1, then Q(w 2 )(i) = 1 5. Agent's commitment: For any participant i, the participants in the conversation mutually agree to deem i's actions rational and cooperative to the extent that those actions in any world w 1 ∈ ⋂ CG such that w 1 < i w 2 D Example Representation for Dialogue Concept "Doorway" (1)