Coached Conversational Preference Elicitation: A Case Study in Understanding Movie Preferences

Conversational recommendation has recently attracted significant attention. As systems must understand users’ preferences, training them has called for conversational corpora, typically derived from task-oriented conversations. We observe that such corpora often do not reflect how people naturally describe preferences. We present a new approach to obtaining user preferences in dialogue: Coached Conversational Preference Elicitation. It allows collection of natural yet structured conversational preferences. Studying the dialogues in one domain, we present a brief quantitative analysis of how people describe movie preferences at scale. Demonstrating the methodology, we release the CCPE-M dataset to the community with over 500 movie preference dialogues expressing over 10,000 preferences.


Introduction
Conversational information seeking has repeatedly been identified as a research direction of particular importance (Allan et al., 2012;Culpepper et al., 2018). From a practical perspective, it is a common task for personal digital assistants in many recommendation domains including movies, restaurants, and travel. However, today's systems are often limited in what they understand. We observe that in many cases, the actions allowed and the utterances understood reflect available metadata, such as movie genres or restaurant food categories, which may mirror uncertain assumptions of how users would choose to characterize their needs in an unconstrained setting. This can lead to conversational systems with unnatural or tedious dialog design.
Developing systems supporting natural interactions requires understanding how users would choose to express preferences to an idealized assistant. It has been noted that a lack of suitable conversational datasets limits such research (Joho et al., 2018). Thus we ask what properties matter most to users? How do real people describe their preferences when encouraged to do so naturally in a conversational setting?
We present a new robust approach for eliciting preferences, producing natural language that a conversational recommender should interpret, represent internally, and use in determining items to recommend. The semantic structure observed also provides new insights into how results could be described to users, to mirror their terminology.
We use a Wizard-of-Oz approach (WoZ): A human agent plays a digital assistant, and users are played by crowd-sourced workers. The human agent is given instructions specifically designed to elicit preferences, while keeping the conversation natural. We particularly focus on avoiding biases in prior approaches, yielding new insights into natural language processing challenges. Crucially, we argue that the focus should be preference elicitation, rather than standard task completion.
Although our approach is domain independent, we validate on movie preference elicitation, as it has received most past attention (Ricci et al., 2015). In particular, movies have high-quality metadata available (actors, directors, production dates, etc.), which is often used. We are able to ask which of these properties are actually normally mentioned by people, finding significant differences: Canonical attributes such as genre, leading actors and directors, paint an incomplete picture. Real users more often refer to less tangible and highly subjective aspects, like the plot style or attributes like violence. We argue that conversational recommender systems should take this into account when representing knowledge.
Our key contributions are three-fold. First, we present a new method for obtaining realistic conversational recommendation dialogues, addressing previous challenges in quantitative analysis of recommendation needs. Second, we release a dialog corpus that allows natural language understanding systems to assess how well they interpret user utterances in a conversational context, and promote their more closely mirroring natural dialogue. Finally, we present a brief analysis of user preferences in the movie domain.
2 Related work

Dialog Systems
Dialog systems are generally classified as goaldriven or non-goal-driven . The latter, commonly chatbots, mimic human responses in open domain dialogues, often powered by neural networks trained end-to-end on large corpora (Sordoni et al., 2015;Serban et al., 2016). Goal-driven (a.k.a. task-oriented) systems aim to assist users with specific tasks (e.g., select products). The architecture typically consists of natural language understanding, state tracking, dialogue policy, and language generation , each often implemented and optimized individually (Young et al., 2013). There is a growing interest in end-to-end trainable task-oriented systems , yet most are restricted to narrow domains (Serban et al., 2018).
Commercial systems, like Google Assistant and Apple's SIRI, combine chat and task focus, supporting a hybrid of multi-domain task-oriented and open-domain chat. Yet user interaction is often relatively unnatural (Luger and Sellen, 2016). Combining task-based and chat modes of operation attracts active research (Akasaki and Kaji, 2017;Yan et al., 2017).

Conversational Recommendation
We focus on conversational recommendation, combining elements of chat, goal-oriented dialog, and question answering (Dodge et al., 2016;Li et al., 2018). Within the movie domain, a large body of prior work on models, test collections, and evaluation methodology exists (Ricci et al., 2015). Early work includes human-human movie recommendation, such as (Johansson, 2004), who focused on characterizing dialogue structure. Dodge et al. (2016) develop a synthetic dataset with the purpose of training end-to-end neural dialog systems. Their Movie dataset combines question answering, recommendation, and general dialog. It is generated using a fixed set of simple templates, and mining a Reddit online forum.
Closest to our work is the REDIAL dataset (Li et al., 2018), containing human-to-human conversations about movies. Similar to our work, the dialogues are conducted on a crowdsourcing platform, where one participant is seeking recommendations which the other party provides. However, their main focus is on algorithmic aspects, and the conversations are driven by the explicit goal of making recommendations. As such, workers are required to mention at least four specific movies in each conversation. Our interest is more broadly targeted to understand how people naturally express preferences in a conversational setting.
Other relevant conversational recommendation work includes Sun and Zhang (2018), who capture long term user preferences in a deep reinforcement learning framework by asking the user for information about particular facets.

Data Collection Approaches
Conversational recommendation system training data can be obtained in many ways. Serban et al.
(2018) provide a comprehensive overview, here we summarize the most relevant past approaches. Implicit observations use logs from an existing system, e.g., for travel booking (Bennett and Rudnicky, 2002). It may be that the system is operated by humans (Hemphill et al., 1990). Such analysis is necessarily biased by current system policy, which drives user (re)actions. Past failures also influence logs, as they can create frustration (Kiseleva et al., 2016) after which users may avoid similar interactions.
Explicit preference observations are most commonly based on web review mining  or mining online forums (Li et al., 2010;Dodge et al., 2016). Both suffer from population biases. More importantly, neither type of corpus necessarily represents what preferences would be expressed in a direct interaction with an intelligent assistant, nor how they would be stated.
Unstructured user studies produce more rigid yet smaller datasets. Participants express a need, which they refine through unstructured dialog. The objective is usually to characterize interaction behavior (Johansson, 2004;Trippas et al., 2017) and to understand users' attitude and expectations towards an automatic agent (Vtyurina et al., 2017).
Task-based user studies commonly create collections using WoZ methodology (Li et al., 2018). A participant engages in conversation for some task (e.g. schedule a bus ride). A wizard acts as intermediary to an existing non-conversational system. This frees dialog state tracking and conversation understanding from current practical limitations. Yet the conversations intend to solve tasks that discourage natural information flow (Serban et al., 2018). Moreover, the Wizard interacts with an existing system, often strongly basing them by the existing interface and its terminology.

Coached Wizard-of-Oz User Studies
As we have seen, most dialogues backed by real systems are biased by that existing system. These systems, in turn, are often biased by the metadata available rather than natural user preferences. For instance, if a Wizard is presented with an existing categorization of possible answers, it is normal for them to ask the user to select among these.
Meanwhile, we aim to understand desirable qualities of future conversational search and recommendation systems and desire to understand natural user preferences. We ask which properties users express preferences about, and also in what way. Our methodology is thus closer to coaching the user, through questions that avoid suggesting particular terminology or answers. Rather, openended questions are used to obtain preferences, requesting examples, and questioning what aspect of the expressed preferences or examples the system should pay attention to. By using a WoZ approach, with human operators simulating the system (who we refer to as Wizards), we similarly allow for human-level natural language understanding. This renders linguistically rich utterances. We also design for "users" (who we refer to as Requesters) to have an experience as consistent as possible to interacting with a fully automated digital assistant. 2 To make this concrete, we introduce our validation setting: Movie preference elicitation. In each conversation, the Wizard was instructed to elicit the Requester's preferences following a general script, while keeping the exchanges as natural as possible. While the full instructions are presented in the Appendix, at a high level these are to: 1. Ask what sort of movies the Requester likes. 2. Ask for an example of a liked movie.
3. Ask what in particular was appealing. 4. Ask for an example of a disliked movie.
5. Ask what in particular was not appealing. 6. Select example movies, and for each: (a) Ask if the user has heard of / seen it. (b) If so, ask for similar preferences.
Importantly, the flow is permitted to evolve naturally and may be adapted to the Requester. Compared to existing corpora, the dialogues collected are not slot-filling, nor do they resemble "20 questions" with repetitive yes/no questions. They also differ from past unstructured dialogues, having clear preference structure. This makes our CCPE method unique in providing rich yet tractable conversational exchanges.

Methodology
The Wizard was provided the written dialog flow template, and given occasional feedback on their conversations. Unique to our setup among WoZ systems, the Wizard typed their input, which was played to the Requester using text-to-speech consistent with that used by a commercial digital assistant. Thus, from the perspective of the Requester, the system resembled today's speechbased digital assistants as closely as possible, aiming to preserve the distinctive nature of spoken dialogue (Chafe and Tannen, 1987).
The Requesters were paid crowd workers on a crowdsourcing platform, invited to talk about their movie preferences. There we informed that an assistant would guide them with questions. They spoke using a microphone, with the audio played directly to the Wizard.
To collect the corpus, each Wizard had a succession of conversations, matched to a sequence of Requesters. After each conversation, the Requester's audio was transcribed by a separate crowd worker, then combined with the known typed text of the Wizard. An example partial dialog is provided in the Appendix.
Elements that are not relevant to preference understanding were removed from the transcribed conversations. These include pleasantries, confirmation of the Requester's task, resolution of technical issues or task interruptions. On the other hand, the transcribed speech was kept as uttered, including filler words, disfluencies and discourse markers. Conversations that ended prematurely were kept (where of non-trivial length). While relatively rare, conversations where the Requester only gave single-word answers were removed as they only provided minimal insight into natural recommendation dialog. Finally, all utterances were annotated, as described below.

Methodological Notes
We briefly discuss three common challenges seen.
(1) Audio failures occurred at times, where one of the Wizard and Requester could not correctly hear the other. Other times, there was also outof-context background communication.
(2) Some Requesters had poor engagement, with very short answers. While the Wizard attempted to elicit richer answers, this did not always succeed. We hypothesize that some crowd workers acted lazily, although perhaps some also did not have particular preferences to express. (3) Undesirable prompting by the Wizard saw some Requesters prompted for specific properties. Other times, the Wizard interjected their own preferences. While this biased the Requester, it is also natural and sometimes led to richer exchanges. We therefore allowed it, but attempted to filter it in our analysis by associating each named item or attribute with the first speaker who mentioned it. We are thus able to differentiate prompted and unprompted terminology.

Semantic Annotation
Our key contribution is a methodology for preference elicitation. To better allow characterizing how users naturally express preferences in the example movie setting, we also annotated the dialogues by identifying preference statements.
As developing robust annotation guidelines that yield consistent labels is known to be complex, annotation was performed by the authors of this paper. 3 In particular, we sub-sampled 510 of the dialogues collected to annotate. These have a median of 22 turns and median duration of 3 minutes and 36 seconds. During annotation, 8 conversations were identified as of too poor quality, yielding a final set of 502 conversations. The conversations consist of 11,972 utterances and were annotated with 15,646 annotations.

Annotation Ontology
In the corpus, we first annotate Anchor items: names of movies or series, genres or categories, people, and other entities. These provide the anchor points for preferences, i.e., what is being talked about.
Preferences by a Requester or Wizard were also annotated. These were partitioned by what the preference was about (matching the anchor items), and the information conveyed in three categories: Preference statements about an anchor item indicate that the person does or does not like the relevant item, or some aspect of it. It most closely matches the popular meaning of a preference. Descriptions of an anchor item consist of neutral information about an anchor item. Bringing attention to specific parts of a movie (for instance), they tell us what this person finds as key characteristics. Other statements about an anchor item convey relevant information but do not provide an explicit sentiment, such as "I haven't seen that." While not telling us if the user likes or dislikes the movie, these convey relevant information for a recommendation system.
In summary, the annotations identify statements that a conversational recommender should be able to interpret. See Appendix for an example.

Annotation Analysis
At least one movie was named in 99.6% of conversations, and at least one movie genre or category was named in 95%. A person was named in just 33% of conversations. Other statements, usually about whether the Requester had seen a movie, were present in 66% of conversations. We identified on average 12.5 preferences about specific movies, and 5.5 genre preferences in each, as well as 0.3 preferences about a person. Neutral descriptions of movies were found in 40% of conversations. In total, 6,297 movie preferences were found, along with 2,775 genre preferences, 2,545 movie names and 1,714 genre or category names.

Inter-Judge Agreement
A random subset of 80 conversations (15%) were independently annotated by two annotators. As our ontology is on two dimensions, and spans between labels can overlap, Krippendorff's α U does not apply (Artstein and Poesio, 2008). Due to space constraints, we report agreement uncorrected for chance agreement. In the 4,094 annotations, 58% matched exactly and 17% had one annotator select a substring of that selected by the other, with the same type. We thus find 75% interjudge agreement. A further 6% of annotations consisted of the same text being annotated with different labels, most often due to disagreement between neutral description and preference labels.

User-Generated Anchor Items
In one step, the Requesters were asked to name specific likes and dislikes. They did not find it difficult: Only 4% of did not provide any movies, while 70% named at least two. Analyzing the movies named, we find a heavy tailed distribution: 643 distinct movies were named (1.3 distinct movies each). No movie was mentioned by more than 18 distinct Requesters, and all but 18 movies or series were mentioned 6 or fewer times. That is, Requesters often gave examples of less wellknown movies, characterizing their uniqueness.
We find a similar heavy-tailed pattern among mentions of other named entities, such as people (actors, directors) and genres. However, people (actors or directors) are only mentioned in 33% of conversations. On the other hand, users often refer to fine-grained movie sub-genres.

Conversational Preference Relationships
The dialog collection also illustrates how preferences build upon each other. E.g., consider: ASST Have you seen the movie Arrival? USER Yes. ASST Did you like that movie? USER Yes, I did. ASST What did you enjoy about it? USER I liked the narrative, I liked that it didn't pull punches and didn't have unnecessary action scenes. I thought [...] To interpret each utterance, the full context needs to be taken in account. This also provides an opportunity to use the CCPE-M dataset to study contextual natural language understanding.

Non-rating preferences
In the above, we also see the user provide information that is not a rating of a movie. Rather, we first learn that the user has seen a given movie. In other conversations, we observe that a user has not heard of some classic movie, or has seen all the movies in some series. Such statements, known to be informative (Steck, 2010;Marlin et al., 2007), were seen in 66% of conversations.

Details Present in Preferences
We saw that when Requesters were asked an openended question about the type of movie that they like or dislike, they most often first characterized themselves by movie genre. These genres were sometimes expanded with details such as example movies, yet it is interesting to note that people were much more rarely mentioned here.

Disfluences
We note that many spoken preferences are naturally disfluent. This requires flexible approaches to semantic interpretation. For example I really like the action and all that like the like I really like like the action in that movie was pretty great.

Final Observations
We find that in the movie domain, when users express preferences naturally, these are very rich. The items suggested by users follow a heavy-tailed distribution. The natural language observed is often both complex and disfluent, and requires the full conversational context to interpret. Preferences refer to rich properties, with emphasis on the story, plot, characters and acting.

Conclusion
This paper presented a new methodology for obtaining natural conversational preferences. By asking questions in a "coaching" format, where the assistant avoids prompting the user with specific terminology, the collected data allows a quantitative analysis of the structure of preferences. This analysis can then inform the design of conversational recommendation systems, providing a basis for realistic natural language understanding and natural language generation challenges.
This work opens a number of avenues. It identifies challenges in natural language understanding of realistic preference statements, and provides a datasets for addressing them. Assuming that the output of a system should reflect users' language, the methodology and data also provide guidance for development of future conversational systems. Finally, our method could be used to obtain similar datasets in other domains. ASST What kind of movies do you like, and why do you like this type of movie? USER I like science fiction movies. I like science fiction movies because they always have nteresting stories, and they deal with crazy new technologies or futuristic technologies. Name of.