Modelling Adaptive Presentations in Human-Robot Interaction using Behaviour Trees

In dialogue, speakers continuously adapt their speech to accommodate the listener, based on the feedback they receive. In this paper, we explore the modelling of such behaviours in the context of a robot presenting a painting. A Behaviour Tree is used to organise the behaviour on different levels, and allow the robot to adapt its behaviour in real-time; the tree organises engagement, joint attention, turn-taking, feedback and incremental speech processing. An initial implementation of the model is presented, and the system is evaluated in a user study, where the adaptive robot presenter is compared to a non-adaptive version. The adaptive version is found to be more engaging by the users, although no effects are found on the retention of the presented material.


Introduction
Speakers in dialogue cannot just assume that their speech is received by the addressee and understood as intended. They have to continuously monitor the addressee to verify that the information is attended to, perceived, understood and accepted (Clark, 1996). By keeping close track of verbal and non-verbal feedback from the addressee, speakers can alter their presentation in order to accommodate the listener.
In this paper, we explore how this process can be modelled in spoken human-robot interaction. As a test-bed, we have designed a scenario where a robot is presenting visual information (such as a poster or a piece of art) to a human, as seen in Figure 1. This setting allows us to explore how the presentation can be adapted to the audience's level of attention, understanding and engagement.
Modelling adaptive presentation in a humanrobot interaction scenario is non-trivial, as the robot needs to pick up feedback from different modalities, and continuously adapt its behaviour to accommodate the listener. It is also not obvious that such a system would be better in terms of teaching the presented material and user experience, compared to a fixed, non-adaptive presentation (such as audio-guides used in museums), as the robot is unlikely to exhibit the same level of adaptation as a human. This paper has two main contributions, which address these concerns. First, we explore the use of Behaviour Trees (Colledanchise andÖgren, 2018) for modelling the adaptive behaviour. Behaviour Trees, a specific formalism for decomposing a plan into a tree structure, have been applied extensively to video games and robotics (Hasegawa et al., 2017;Hu et al., 2015), and systems that break down an interaction or a dialogue to a tree are not new (Smith and Hipp, 1994;Boye, 2007;Bohus and Rudnicky, 2009). However, we are not aware of any previous attempts at applying specifically Behaviour Trees to real-time modelling of spoken interaction. Second, we present an experiment where we compare the adaptive robot presenter to a version where the presentation is statically executed, i.e., where the user's reactions are not taken into account.

Background
The scenario of a robot presenting information to an audience (one or several people), has been explored in earlier work (Jensen et al., 2005;Szafir and Mutlu, 2012;Ohya et al., 2006). However, these works have not focused on how the presentation can be adapted based on verbal and nonverbal feedback. Poster presentations between humans have been studied in order to analyse the gaze and backchannel behaviours of participants and presenters (Kawahara, 2012). Hashimoto et al. (2011) andVerner et al. (2016) have shown that more interactive robot teachers lead to better results in learning. Yousuf et al. (2012) and Eichner et al. (2007) show that users prefer presenting agents that adapt their grounding behaviour to their audience.

Grounding and Adaptation
According to Clark (1996) and Allwood et al. (1992), any coordinated action can be described as an action ladder, with each level requiring the cooperation of speaker and addressee. If the speaker A is presenting to the addressee B, then the levels of the action ladder, bottom-to-top, are attention (B must be paying attention to A's presentation), hearing (B must hear the words said by A), understanding (B must understand the meaning behind the words said by A) and acceptance (B must accept, and optionally be interested in, the concept proposed by A's presentation).
The addressee can give positive and negative evidence of each level (feedback), to signal completeness to the speaker. If negative evidence is signalled for a level, all levels above it have failed by extension. If positive evidence is signalled for a level, all levels below it have succeeded by extension. Feedback signals like these can then be used by the speaker to adapt the presentationby explaining some information in more depth or by making the presentation more interesting -and thereby accommodate the listener. This process is referred to as Grounding by Clark (1996). It is not possible to give positive evidence in response to every piece of a conversation, but the important thing is to receive enough evidence to meet the grounding criterion, the requirements for evidence needed depending on how important the speakers deem the content of the presentation to be.

Behaviour Trees
A Behaviour Tree, or BT, is a tree structure that models a plan, initially proposed by Mateas and Stern (2002). Behaviour Trees have been used in video games (Isla, 2005(Isla, , 2008Hasegawa et al., 2017) and to model robot behaviours (Hu et al., 2015;Colledanchise et al., 2016). There is previous work applying BTs to virtual agents (Sun et al., 2012;Fujita et al., 2003), but to our knowledge, so far they have not been used to model conversational agents or social behaviour.
The leaves of the tree are the tasks that are executed. All non-leaves are control flow nodes. Execution flows from the root down the tree, starting when some external process ticks the root to start execution. Each node in the tree returns one of three values to its parent; SUCCESS or FAILURE if the task has finished with either result, or RUNNING if it has not finished.
The two most common control flow nodes are Sequence and Selector nodes. Sequence nodes run their children in order from left to right until a FAILURE or RUNNING is encountered, at which point the sequence returns that value. If all child nodes succeed, the sequence returns SUCCESS. Selector nodes run their children from left to right until a SUCCESS or RUNNING is encountered, returning that value, or FAILURE if all children fail (Colledanchise andÖgren, 2018).

Modelling the presentation
In this paper, we propose a Behaviour Tree to model the complex task of poster presentation while taking grounding and adaptation into account. The tree breaks down this complex task into smaller, independent tasks. As Section 4 describes, our initial implementations of these individual tasks are greatly simplified, as many of them are indeed challenging research problems in their own right. However, the decomposition into the behaviour tree allows us to start with simpler initial implementations of the individual tasks (some of which can be controlled through Wizard of Oz), and then gradually replace them with more complex models (e.g., through machine learning), without changing the structure of the tree, or the implementation of other tasks.
The abstract BT is shown in Figure 2. Whereas most traditional dialogue systems process the interaction utterance-by-utterance, the BT allows the system to process the interaction incremen-  tally, in real time (in the vein of Schlangen and Skantze, 2009). Thus, the tree is designed to be executed on the time scale of 10 times per second. The root represents the entire task of presenting a poster. The tree contains both a sub-tree for finding and recruiting participants and presenting to them, and thus will never return SUCCESS; the presentation is either going on (RUNNING) or impossible (FAILURE). The deeper levels of the tree are discussed, top-to-bottom and left-to-right, below.
Dynamic information is not kept in the static tree; instead, it depends on external modules to keep track the joint action ladder (a knowledge manager), and where the agent is in its presentation (an agenda). These components are not discussed here, as they are less general than the tree.
The system needs to find a user to whom to present, which happens in the Establish engagement sub-tree at the top of the tree. After this tree has succeeded at inviting or engaging a user into the presentation, which can be a more or less complicated task (Bohus andHorvitz, 2009, 2014), the system presents its presentation through its interact with user sub-tree.
This sub-tree handles turn-taking by offering the turn to the addressee if appropriate, which can be done in multiple ways (Meena et al., 2014;Ström and Seneff, 2000). As the tree runs at its rate of 10 Hz, the user's utterance is processed incrementally, and the system can deploy backchan-nels and gaze cues in response (Morency et al., 2008).
If the user does not have the turn, the robot either has or takes the turn through its Robot's initiative sub-tree, and executes the presentation. Firstly, joint attention is ensured or grabbed (see Yu et al. (2015)) if lost, this can be sensed in multiple ways (Ba and Odobez, 2009;Sheikhi, 2014;Szafir and Mutlu, 2012).
If the system has the user's attention, it ensures hearing, understanding, and acceptance, in order, according to the respective grounding criteria. As these sub-trees have had their chance to change the presentation agenda to address negative evidence of hearing, understanding and acceptance (see (Vaufreydaz et al., 2016;Aly and Tapus, 2015;Sidner et al., 2006;Skantze et al., 2014) for examples on how to measure these), the system then speaks from the agenda, driving the presentation forward. Only if the tree reaches this leaf without any previous leaf returning RUNNING

Implementation
We developed an initial implementation of a system containing the Behaviour Tree model proposed in Section 3 as an extension to the IrisTK dialogue framework (Skantze and Al Moubayed, 2012). The Furhat robot head (shown in Figure 1) served as the robot platform (Al Moubayed et al., 2012).
The agenda of the implemented system tracked entire lines of the presentation's script. To adapt the presentation, evidence of understanding was thus tracked on a line-by-line basis, and the system could explain a line for which understanding had not been shown, by finding other lines that explained the misunderstood line.
The system modelled attention by treating users as attentive if they were looking at the system or the poster, using their head pose (estimated via Kinect) as a proxy of gaze direction. Upon inattention, the system would restart its current utterance, similar to the stop-and-restart method employed by Yousuf et al. (2012). A Wizard of Oz setup was used to tag positive and negative evidence of hearing, attention and acceptance.

Experiment
To evaluate the system and tree, we set up an experiment where the system described in Section 4 had two modes: in the adaptive mode, the system fully used its adaptive behaviour. In the non-adaptive mode, the system always assumed positive feedback on all four levels of the joint action ladder. The non-adaptive system also never yielded the turn to the user. The nonadaptive mode presented the same surface-level five-minute presentation every time, so a fiveminute time limit was also set for the adaptive mode, which would end its presentation after that time. The agent's gaze behaviour was the same in both modes, shifting between the participant's head and the poster.
We used a within-subject experimental design, where each subject interacted with the two versions of the system. Two posters with 16thcentury paintings were created: Gentile Bellini's Miracle of the Cross fallen into the channel of Saint Lawrence (Croce, for short), and Great Tower of Babel, by Pieter Bruegel the Elder. The orders of the two paintings and modes were both counterbalanced between subjects.
30 subjects participated in the experiment, 16 male and 14 female. A majority of participants were undergraduate university students. Participants were not told about the differences between the adaptive and non-adaptive modes, other than that only the adaptive mode could answer ques-tions. Participants were otherwise encouraged to give active feedback to the agent regardless of condition (even though the non-adaptive version would actually ignore this feedback).
Conditions were evaluated immediately following the end of the respective presentation. Firstly, in order to evaluate retention of the information presented, participants were given an electronic form where they answered questions about the presentation and painting. Secondly, they were asked to fill in adapted versions of the Godspeed questionnaire by Bartneck et al. (2009), and the Networked Minds social presence questionnaire by Biocca and Harms (2011). Participants were rewarded with a cinema ticket.

Results
The results of 2 participants had to be excluded due to technical problems during the experiment, yielding 28 data points (16 male, 12 female), of which 14 indicated that they had previous experience with a social robot, two indicated that they had seen the Croce painting before, and eight indicated they had seen the Babel painting before.
The Wilcoxon paired signed-rank test (Wilcoxon, 1945) was used to compare the answers given in the Social Presence and Godspeed forms. The questions were grouped by categories in each test, and the answers to them were averaged. This compensated for the large number of questions.
For the analysis of the retention questionnaire, one additional subject had to be excluded due to technical problems. Eleven questions per poster were graded on a scale from zero to eleven based on correctness, normalising to only count ques-tions that were possible to answer based on the presentation the user received. The answers in the Babel questionnaire (M = 6.938, Mdn = 7.542, SD = 1.989) were found to have a statistically significantly (p = .04235) different distribution than those in the Croce questionnaire (M = 6.270, Mdn = 6.758, SD = 1.771), but no statistically significant differences were found when comparing the adaptive mode and the non-adaptive mode (p = .449), or the first and second presentation participants received (p = .990).

Discussion
The results from the Social presence and Godspeed questionnaires showed that the adaptive version was perceived to have a higher Animacy, Anthropomorphism, Safety, Emotional contagion, and Behavioural interdependence. These are all aspects that relate to higher interactivity, and are all associated with positive values, which indicates that an interactive presenter that takes the user's attention and understanding into account is indeed perceived to be more engaging. When asking the subjects about the difference between the two versions after the experiment, they typically had a hard time identifying the exact difference in terms of interactivity. This is interesting, as it indicates that they were not aware of the specific reason for why they preferred the adaptive version. The gaze behaviour of the robot, which followed users around even in the otherwise non-adaptive mode, may have led to the perception that the system was paying attention to the user even in this mode.
There was a somewhat unexpected difference between the first and second presentation, where the former had a somewhat higher Likeability of the robot, regardless of painting and mode. One potential explanation for this is that users were aware of the format of the evaluation the second time, and might have been more stressed about it.
However, no statistically significant differences were found in the user's retention of the two presentations. There was a large variation in how much the individual subjects remembered from each presentation. Certain participants remembered almost nothing of either presentation. Others were able to quote the robot on every question in both the adaptive and non-adaptive modes. This introduces noise and makes the comparison hard to perform, given the relatively small number of participants.

Future work
Although the agent developed in our initial implementation does adapt its presentation based on feedback from the user, this adaptation was mostly done on a semantic level (i.e., updating its agenda). In future studies, we will explore how the system could also adapt factors like turn length, speech rate, the frequency with which the agent would require evidence of understanding, and what the system would consider as evidence of understanding.
Classifying negative and positive evidence based on multi-modal signals is indeed a very challenging task, as these cues could be very subtle (e.g., facial expressions of boredom or interest). In this experiment, this classification was done by a human Wizard of Oz. The data collected through this experiment could potentially be used to train specific models for this, as they have already been partially annotated by the Wizard.
A natural extension of the model is to also allow several users to take part in the presentation. This would give rise to new challenges when it comes to determining who should be considered to be engaged in the presentation, and how to adapt the presentation, since the different users in the audience might show evidence of understanding to various degrees. Also, if a new user appears in the middle of the presentation, it is not clear how to proceed with the agenda.

Conclusions
This paper presents a first step towards a system that uses Behaviour Trees to create an adaptive presentation agent. Initial results show that users find a system that attempts to adapt its presentation to their reception of the presentation more positive along several dimensions. Our initial implementation of the proposed Behaviour Tree model is a promising first step towards a complex adaptive behaviour model for conversational interaction, where the complex task of making an adaptive presentation has been decomposed into smaller tasks, which can gradually be replaced by more and more sophisticated models.

A Appendices
The Godspeed forms included the questions as found at http://www.bartneck.de/2008/03/11/thegodspeed-questionnaire-series/. The Social Presence forms includes the questions as referenced (Biocca and Harms, 2011), but the following questions were removed: • I often felt as if (my partner) and I were in the same (room) together.
• I think (my partner) often felt as if we were in the same room together.
• I often felt as if we were in different places rather than together in same (room) • I think (my partner) often felt as if we were in different places rather than together in the same (room).

Question (Babel) Question (Croce) Answer type
Have you interacted with a social robot like the one in this experiment before?

Yes/No
In what context have you interacted with a system like the one used in the experiment?

Text
Had you seen the painting before the presentation? There are many examples of small details in the painting: give some examples.

Text
The artist had relatives who also became artists: who were they? An example Social Presence question is shown above Table 1. Godspeed questions were presented identically (with the same seven-point scale), but the ends of the scale were instead the two adjectives or adjective phrases connected to the specific Godspeed question.

Text
The full questionnaires can not be presented here because of space issues. Table 1 on the bottom of this page shows the retention-based questions that were part of the electronic questionnaire.  Table 2: Visualisation of numbers given in Section 6.