Turn-Taking Strategies for Human-Robot Peer-Learning Dialogue

In this paper, we apply the contribution model of grounding to a corpus of human-human peer-mentoring dialogues. From this analysis, we propose effective turn-taking strategies for human-robot interaction with a teachable robot. Specifically, we focus on (1) how robots can encourage humans to present and (2) how robots can signal that they are going to begin a new presentation. We evaluate the strategies against a corpus of human-robot dialogues and offer three guidelines for teachable robots to follow to achieve more human-like collaborative dialogue.


Introduction
Grounding is the process by which two parties coordinate to come to a joint understanding or common ground in a joint project. This involves assuming mutual knowledge, beliefs, and assumptions (Clark, 1996). Since humans use grounding to collaborate in dialogue interactions, robots can look to human grounding patterns to mimic collaboration in a human-like way. In human-robot dialogue with a teachable robot, the robot often wants the human to take initiative in presenting material; at the same time, the robot wants to ensure that it can steer the conversation in a natural way. By analyzing a human-human peer-mentoring corpus, we identify turn-level grounding patterns that help achieve these two goals.
First, we observe peer-learning dialogues in a human-human corpus to model how human teachers and learners signal presentation and understanding. In this corpus, both teachers and learners alternately take the floor to offer presentations. While one speaker presents, the other speaker accepts the presentation by displaying evidence of understanding. Our first goal is to understand how a speaker signals to the other speaker to take the floor, such as a teacher encouraging a learner to present an idea, or a learner asking a question that leads the teacher to present an explanation.
Second, speakers may need to shift the floor towards themselves during a conversation. For example, a teacher may have a plan to offer feedback on the learner's work, or a learner may need to explain a problem that confused them. Therefore, our second goal is to understand how a speaker can effectively signal that they are taking the floor.
These two goals are also relevant to humanrobot dialogue with a teachable robot: a robot who acts as a peer to a student and prompts the student to teach them the material (Jacq et al., 2016;Lubold et al., 2018b). Because humans engage more deeply with material when they teach it to someone else (Roscoe and Chi, 2007), we want a teachable robot to encourage humans to present material. At the same time, especially when interacting with children, the robot may not always understand or be able to process the human's speech and actions. To handle unexpected, degraded, or out-of-vocabulary input, the robot will sometimes need to take the floor and steer the conversation.
In Section 2 of this paper, we discuss related work. We introduce a human-human peermentoring corpus and detail our annotation process in Section 3. In Section 4, we analyze humanhuman grounding patterns with respect to the two goals: encouraging humans to present, and taking the floor. In Section 5, we introduce and analyze grounding in a corpus of dialogues with a teachable robot. We discuss similarities and differences in the two corpora in Section 6, and offer suggestions for improving human-robot dialogue.

Related Work
The contribution model of Clark and Schaefer is a widely-used theory of conversational ground-ing (Clark and Schaefer, 1989;Clark, 1996). The model proposes that collaborative conversations be analyzed in terms of contribution units, where each contribution consists of a presentation phase followed by an acceptance phase. In the presentation phase, Speaker A, the presenter, presents a signal to Speaker B, the acceptor. In the acceptance phase, B, the acceptor, acknowledges that they have understood the signal. This requires positive evidence of understanding from B. The speakers signal back and forth until they have received closure-a sense of mutual understanding. Traum (1994Traum ( , 1999 reformulated the contribution model for real-time use by collaborative dialogue agents. In this model, the units of analysis-grounding acts-occur at the utterance level. In human-robot dialogues, Liu et al. (2013) found that incorporating an 'agent-present humanaccept' dialogue pattern based on the contribution model into its grounding algorithm led to improved reference resolution. Graesser et al. (2014) used a 'pump-hint-prompt-assertion' dialogue pattern in an intelligent tutoring system, finding learning outcomes comparable to those of human tutors.
Turn-taking in human-robot interaction involves understanding the cues that signal when it is appropriate for a robot to take a turn (Meena et al., 2014). Integrating factors such as robot gaze, head movement, parts of speech, and semantics into turn-taking models is an active area of research (Chao et al., 2011;Andrist et al., 2014;Johannson and Skantze, 2015), informed by studies of turn-taking in human-human dialogue (Gravano and Hirschberg, 2011). In human-human interaction, turn-taking behaviors vary considerably depending on the task. A better understanding of turn-taking in peer-learning dialogue will help inform the design of effective peer-learning robots.
Robot learning companions have the potential to teach broad populations of learners but an important challenge is maintaining engagement and effectiveness over multiple sessions (Kanda et al., 2004). Social robotic learning companions can motivate students, encourage them to persist with a task, and even promote a growth mindset (Park et al., 2017). Recently, teachable robots have flipped the traditional teacher-learner roles, with the goal of improving learning and motivation (Hood et al., 2015). Most of the these robots use spoken utterances as output but do not engage in conversational interaction around the human partner's utterances, if any exist. One exception is a robot that encourages students to think aloud, finding greater long-term learning gains when students articulate their thought process (Ramachandran et al., 2018).
Robots that are physically present have advantages over virtually-present robots and virtual agents. For example, in a game-playing setting with children, a co-present robot companion was found to be more enjoyable and have greater social presence than a virtual version of the same robot (Leite et al., 2008). In a puzzle-solving setting, students learned more with a co-present robot tutor than with a virtual version of the same robot (Leyzberg et al., 2012). A survey by Li (2015) found that in 73% of human-robot interaction studies surveyed, co-present robots were more persuasive, received more attention, and were perceived more positively than virtually-present robots and virtual agents. There may be tradeoffs to physical presence; in an interview setting, co-present robots were liked better than virtual agents, but participants disclosed less and remembered less with the co-present robot (Powers et al., 2007). Overall, the literature suggests that physically co-present robots are preferable for relationship-oriented tasks, for interaction with children, and for learning.

Peer-Mentoring Dialogue Corpus and Annotation
To develop dialogue strategies for a robot peerlearner to effectively shift the conversational floor, we examine the grounding patterns of human peerteachers and peer-learners.
Corpus. The human-human peer-mentoring dialogue dataset consists of fifty 10-minute conversations, totaling approximately nine hours. Ta

Grounding label Definition Speaker role
Presentation A signal or piece of information offered by the presenter presenter Probe Questions such as "When are we meeting?", or a signal made either without certainty of positive evidence from the other speaker, such as "You know that assignment..." Backchannel A short turn to signal understanding, such as "Mm-hmm", acceptor "Yeah", and in some cases, laughter Uptake The acceptor's next relevant turn acceptor Answer A signal to display understanding of the presenter's probe acceptor Repetition A signal to confirm understanding acceptor Paraphrase A signal to confirm understanding acceptor Closure Evidence of the conclusion of a joint project either Table 2: Definitions of grounding labels and their associated roles.
turn lengths in this dataset. Audio recordings were collected of conversations between undergraduate computer science students as part of a near-peer mentorship program. The mentees were enrolled in an introductory computer science course. The mentors were mid-and upper-level computer science students. Mentors had multiple mentees and met with each mentee individually each week over the course of a semester to give feedback on completed programming assignments. Because mentors received training on giving effective feedback and encouraging mentees to reflect on their work, we assume that all conversations are examples of effective mentoring. The dataset used in this paper is part of a ongoing data collection project with over 250 dialogues. The audio recordings of the dialogues were manually transcribed by a commercial transcription service. An excerpt below illustrates an interaction between a mentor and mentee, who we will refer to in this paper as 'teachers' and 'learners' (punctuation is added for clarity).
TEACHER: So then you might have like a Point2D trunk start which would then update within that method down below LEARNER: What do you mean by . . . TEACHER: So like up here instead of putting say like public int tx1 you might write something like-LEARNER: Oh you mean in uh as a parameter-TEACHER: Yeah like just put 'public Point2D trunk start' and then you just end it LEARNER: Yeah yeah I got that Annotation. Our approach to annotation is motivated by the grounding actions proposed in Clark's model of collaborative dialogue (Clark, 1996), and also by the turn-level unit of analysis in Traum's model (see Section 2). The set of grounding labels, shown in Table 2, is designed to be applicable to both human-human and humanrobot corpora. The annotation guidelines and the annotated data are publicly available 1 . In our annotation model, at any time, one speaker has the presenter role, and the other is the acceptor. The roles are associated with a set of grounding actions, which characterize individual dialogue turns. Only the presenter's turns can be labeled as presentation 2 . Labels such as uptake, answer, and backchannel 3 typically indicate shorter signals to confirm understanding, and occur in turns by the acceptor. Two labels can occur with both presenters and acceptors: probe and closure. Each turn is labeled with one or sometimes two grounding labels.
We manually annotated each dialogue turn in the peer-mentoring corpus with one or two grounding labels as well as the identity of the current presenter. This annotation was performed by a single annotator. The counts of each grounding label for teachers and for learners are shown in Table 3. We note that presentation is the most frequent label for teachers, while backchannel is the most frequent label for learners. presentation  2475  999  probe  517  507  backchannel  957  1793  uptake  356  701  answer  125  357  repetition  12  26  paraphrase  7  16  closure  205  214  TOTAL  4654  4613   Table 3: Grounding label counts for teacher turns and learner turns in the human-human peermentoring corpus.

Peer-Mentoring Dialogue Analysis
To support our goal of designing effective turntaking strategies for a teachable robot, we use the corpus of human-human peer-mentoring dialogues to answer two questions: (1) how do humans encourage their partners to present? and (2) how do humans signal that they are going to shift the floor towards themselves? To frame the decision of whether to focus on teacher strategies, learner strategies, or both, we begin by examining initiative patterns in the corpus.

Initiative and presentation
Expecting that perceived initiative is closely related to the number of presentation turns, we label each dialogue in the peer-mentoring corpus with a perceived initiative score from 1 to 5 (1=high learner-initiative; 5=high teacher-initiative). We compare the initiative ratings with the count of each speaker's presentation turns as a proportion of their total turns in the dialogue. This is shown in Figure 1. For learners, the proportion of presentation turns is highest when they are perceived to have high initiative. However, teachers present for roughly the same proportion of turns regardless of initiative label. This analysis suggests that learners might assume greater initiative if they are encouraged to present.

Encouraging partner to present
To analyze how one speaker encourages their partner to present, we consider two cases: (a) when the partner does not currently have the floor, and (b) when the partner does currently have the floor.
To understand how human mentors and mentees encourage their partners to present when that part- ner does not hold the floor (i.e., to take the floor), we identify all turns with a presentation label that are at the start of a floor shift. A floor shift occurs when a presentation turn shifts the presenter role from one speaker to the other. We examine what the partner's grounding label was in the preceding turn. In other words, if Speaker B has taken the floor by beginning a presentation, what was Speaker A's last grounding action? An annotated example exchange is shown below.
A: But don't put it off because it's a big project (presentation) B: I can tell cause it's broken down into two parts (uptake/presentation) A: Mh-mmm (backchannel) We find that when a speaker takes the floor, their partner is most frequently presenting in the preceding turn: 0.554 and 0.618 for teachers and learners, respectively. The next most frequent grounding label is probe (see all values in Table 4, section (a)).
To understand how human mentors and mentees encourage their partners to present when that partner already has the floor (i.e., to continue presenting), we identify all turns with a presentation label that are not at the at start of a floor shift. We examine what the partner's grounding label was in the preceding turn. In other words, if Speaker B already has the floor and then has a presentation turn, what is Speaker A doing before B's presentation that encourages B to continue to present? An annotated example exchange is shown below.
B: It'll be the same problems (presentation) A: Mh-mmm (backchannel) B: So you should prepare in the same way you did last semester (presentation) N present. probe backch. uptake ans. clos. When there is no floor shift, we find, unsurprisingly, that the most frequent grounding label preceding presentation turns is a backchannel: 0.542 of the turns for teachers, 0.604 of the turns for learners. The next most frequent labels are uptakes and probes (see all values in Table 4, section (b)).
This data suggests that a robot should consider presenting or probing to encourage a partner who does not have the floor to present, and should consider backchannels to encourage a partner who already has the floor to continue presenting. We note, however, that the overall label frequencies are a factor. After considering next-turn probabilities conditioned on the preceding labels, we expect that probes might be more effective than presentations at encouraging a partner to take the floor.

Signaling taking the floor
To understand how human mentors and mentees naturally take the floor and become the presenter, we look at the grounding labels of dialogue turns at shifts in the conversational floor. All floor shifts begin with a presentation turn; most also have a second grounding label. If there is no accompanying grounding label, we report the grounding label of the speaker's previous turn.
We find that when a speaker takes the floor, the grounding label most frequently accompanying the presentation label is uptake: 0.596 and 0.547 for teachers and learners, respectively. The next most frequent grounding labels are answer and probe (see all values in Table 4, section (c)). This suggests that a robot that wants to take the floor might consider an uptake, answer, or probe in conjunction with their presentation.

Comparison with Human-Robot Dialogue Interaction
To understand if the grounding strategies we observed in the human-human corpus are effective in human-robot interaction, we perform a preliminary empirical analysis using dialogue data from a teachable robot interaction experiment conducted in a Wizard-of-Oz (WOZ) style. Section 5.1 describes the dialogue data; Section 5.2 presents our empirical analysis.

Human-robot dialogue data
The human-robot dialogue data consists of transcripts from a teachable robot interaction experiment where the robot was operated by a human Wizard. In this WOZ experiment, human students interacted in a learning-by-teaching context (Ploetzner et al., 1999) with Nico, a social, teachable, NAO robot. The human participants were peer teachers while Nico behaved as a peer learner, working to solve mathematics word problems. The human-robot corpus includes dialogue transcripts from twenty college-age participants who each engaged in four problem-solving dialogues with Nico in the WOZ experiment (Chaffey et al., 2018). Table 5 summarizes the dialogue durations and turn lengths in this human-robot dialogue corpus.
The WOZ experiment aided in the development of an autonomous version of the teachable robot  aimed at middle-school students (Lubold et al., 2018a,b). WOZ experiment overview. Participants were told that their goal was to help Nico solve a set of mathematics problems. Prior to the interaction, they received worked-out problem solutions. During the interaction, a tablet user interface displayed the problem, highlighting one step at time. Nico, controlled by the Wizard, took initiative in leading the dialogue, asking for help about how to approach the problem sub-parts (e.g., "How do I figure out how much paint to mix?"). Participants responded by explaining their reasoning (e.g., "We want to have six cans of green paint so we mix three cans of yellow paint and three cans of blue paint because..."). Nico's actions included textto-speech output, gestures such as scratching its head, and updates to values in the tablet interface. Figure 2 shows a student teaching Nico. Wizard behavior. A human Wizard operated Nico behind the scenes, selecting dialogue responses and corresponding gesture movements from a pre-defined set. If necessary, they had the ability to input additional phrases. If the participant did not explain their reasoning, the Wizard prompted them to try again (e.g., "Could you explain that better?"). The Wizard was not instructed to model specific grounding behaviors.

Empirical analysis
We analyze the human-robot dialogue transcripts asking the same questions as in Section 4, but from the robot perspective: (1) how does the robot encourage the human to present, and (2) how does the robot signal that it is taking the floor?

Encouraging partner to present
Based on our analysis of the human-human dialogues, we hypothesize that effective strategies for a robot to use when encouraging their partner to present, e.g., to elaborate or to explain, are: presentation and probe if their partner does not have floor, and backchannel if their partner already has the floor.
To evaluate the extent to which the human-robot dialogues reflect these strategies, we identify the following robot dialogue phrases (fixed phrases or templates, available to the Wizard): • presentation: "Okay, we [perform math operation] 4 ", "So now we [perform math operation]?" • probe: "How did we get that number?", "What do we do next?", "Could you give me a hint?" • backchannel: "Okay" For each grounding category (presentation, probe, and backchannel) we manually annotate 50 dialogue exchanges surrounding the queried phrases. Each exchange is five turns in length. We label each turn in the exchange with one or more grounding labels, as we did for the humanhuman corpus. For presentations and probes, the dialogue exchanges are in contexts where the human partner does not have the floor in the preceding turn. Two examples are shown in Appendix A. We test if presentations and probes result in the human partner taking the floor. For backchannels, the dialogue exchanges are in contexts where the human partner has the floor in the preceding turn. We test if backchannels result in the human partner keeping the floor.
Following presentations, 36% of the exchanges had a presentation in the human's first turn after the robot presentation. Following probes, 74% of the exchanges had a presentation in the human's first turn after the robot probe. Following backchannels, 68% of the exchanges had a presentation in the human's first turn after the robot  Table 6: Success in encouraging human to present in the first turn, or second turn following robot presentations, probes, and backchannels; median human turn lengths for presentations.
backchannel. Table 6 summarizes this data, reports turn lengths, and reports on occurence of presentations in the subsequent turn (if the first turn was not a presentation). Not only are probes more effective than presentations at getting the human to present, the subsequent human presentation turns are also longer.

Taking the floor
Based on our analysis of the human-human dialogues, we hypothesize that effective strategies for a robot to use when taking the floor from their partner are: uptake, answer, and probe.
To evaluate the extent to which the human-robot dialogues utilize these grounding acts, we identify four dialogue phrases that the robot uses to take the floor and steer the conversation. The first two selected phrases are navigation instructions, labelled as uptakes. In these, the robot takes the floor to explicitly steer the conversation towards the next problem step. We did not find any suitable robot phrases at floor shifts that we considered to be answers. The second two phrases are questions about the partner's attitudes towards the material. These are labelled as probes, and serve to indirectly steer the conversation away from the previous topic. The dialogue phrases are as follows: • uptake: "Please tap the 'next' button for me so we can move on to the next step", "Please press the 'back' button" • probe: "Do you like math?", "Have you done problems like this before?" We manually annotate 45 dialogue exchanges surrounding each of the queried categories. As above, we label each turn in the exchange with one or more grounding labels. Two examples are shown in Appendix A.
We find that navigation instruction uptakes succeed in taking the floor immediately in 97.8% of the exchanges. For the probes about attitudes towards math, we evaluate their success in shifting the floor by reporting how long the partner continues answering the question that the robot posed, and how verbose those answers are (see Table 7). We find that in 35% of the exchanges, partners continue to answer the question for only one turn; in 60% of the exchanges they stay on-topic for two turns. The average length of these turns is 5.5 and 8.0, respectively.

Discussion
In the human-human peer-mentoring dialogue corpus, we find that human speakers encourage partners to take the floor most frequently via presentations or probes. In the human-robot dialogue corpus, we find that probes are more successful than presentations in getting partners to take the floor and also result in longer turn lengths. We note that our analysis is limited by the set of robot phrases queried. To more accurately assess the success of probes versus presentations in human-robot dialogue, we would need to annotate all instances of these two grounding actions in the corpus.
Speakers in the peer-mentoring dialogue corpus encourage partners to keep the floor most frequently by backchanneling. Therefore, it seems that providing a simple acknowledgement of the partner's signal is an effective way to ensure that they continue to present. In the human-robot Following robot probe about math attitudes on the 1st turn 35% 5.5 on the 2nd turn 60% 8.0 dialogue corpus, we find that backchannels are successful in encouraging a partner to hold the floor. Partners present within the next two turns 88% of the time. However, we find that the robot backchannels occur on average in 8.9% of its total turns in a conversation, whereas learners in human-human conversations backchannel for 40.8% of their turns. By incorporating more backchannels in the robot's dialogues (see Kawahara et al. (2016)), we could encourage presentations more often, and also make the robot's dialogue more similar to that of human learners. Backchannels could also take non-verbal form, such as nodding. However, we should be cautious of using backchannels too liberally if they are not a result of true understanding, since they could break down trust between robot and human.
In the human-human corpus, we find that speakers use uptakes, answers, and probes as signals that they are taking the floor. Uptakes are the most frequently used grounding label in this regard. This reinforces the idea that speakers take more initiative when taking the floor because they must produce a relevant turn without being explicitly prompted for it.
In the human-robot dialogue corpus, we find that uptakes in the form of instructions to the human partner are successful in shifting the floor. Due to the nature of the human-robot dialogue, we could not find instances of the robot using answers at floor shifts. Instead, the robot used probes to take the conversation floor. These are less successful than instructions in immediately shifting the floor, but this may be due to the unexpectedness of these questions; participants may have been caught off guard.
To achieve more human-like collaborative dialogue, we suggest that teachable robots consider using the following turn-taking strategies: • When human partners are not taking initiative, probe partners to encourage them to talk more and take the floor.
• Backchannel more frequently while human partners are presenting to encourage partners to talk more and to better articulate their thoughts and explanations.
• Use uptakes, answers, and probes to take the floor. These can be useful when the conversation has gotten off-course and the robot wants to steer it to a different topic.

Conclusion
To inform turn-taking strategies for teachable robots, we annotate and analyze grounding patterns in a corpus of human-human peer-mentoring dialogues and a corpus of human-robot dialogues (Wizard-controlled). In the human-human dialogues, we identify grounding actions that may encourage dialogue partners to take initiative in teaching, while steering the conversation naturally. We find that some of these grounding actions are present in the corpus of human-robot dialogues, but that others are absent, or present to a lesser degree. This suggests future research to investigate whether student outcomes might improve if robot interactions could be designed to encourage more human-like collaborative dialogue.