INSPIRED: Toward Sociable Recommendation Dialog Systems

In recommendation dialogs, humans commonly disclose their preference and make recommendations in a friendly manner. However, this is a challenge when developing a sociable recommendation dialog system, due to the lack of dialog dataset annotated with such sociable strategies. Therefore, we present INSPIRED, a new dataset of 1,001 human-human dialogs for movie recommendation with measures for successful recommendations. To better understand how humans make recommendations in communication, we design an annotation scheme related to recommendation strategies based on social science theories and annotate these dialogs. Our analysis shows that sociable recommendation strategies, such as sharing personal opinions or communicating with encouragement, more frequently lead to successful recommendations. Based on our dataset, we train end-to-end recommendation dialog systems with and without our strategy labels. In both automatic and human evaluation, our model with strategy incorporation outperforms the baseline model. This work is a first step for building sociable recommendation dialog systems with a basis of social science theories.


Introduction
Sociable conversational agents build rapport with users, in order to gain trust and favor from them. Social science researchers believe that the rapport influence a more persuasive recommendation to successfully suggest an item that satisfies user needs (Yoo et al., 2012;Gkika and Lekakos;Pecune et al., 2019;Gretzel and Fesenmaier, 2006).
However, existing works on recommendation dialog systems lack a study about communication strategies used by human speakers for making successful and persuasive recommendations. They col-Figure 1: An example snippet of human-human recommendation dialog in INSPIRED. REC refers a person who recommends a movie and SEEK refers a person who looks for a recommendation. Above each recommender's utterance is the recommendation strategy annotated by human workers. Best seen in colors. lect the dataset in scenario-based settings or convert product review datasets into question-answering conversations (Reschke et al., 2013;Yan et al., 2017;Sun and Zhang, 2018;Kang et al., 2019;Li et al., 2018). Common issues with these types of datasets are: (1) homologous utterances, (2) mostly question-answering pairs, and (3) lack of user engagement.
In this work, we aim to validate whether sociable recommendation strategies are effective for making a successful recommendation in a dialog. To do so,  Table 1: Comparison of related recommendation dialog datasets. "QA" refers to question-answer pairs. "Mixed" indicates that the conversations contain both statements and question-answer pairs. CONVREC collected 385 humancurated dialogs, but only released 875,721 simulated dialogs.
we propose INSPIRED, a recommendation dialog dataset of two-paired crowd-workers in a natural setting, with additional annotations for sociable recommendation strategies. The dataset consists of 1,001 dialogs, and each utterance is manually annotated with the sociable strategies based on social science theory. To encourage more natural dialog flow, we do not set any restrictions on the number of movies or the type of movies to recommend.  Table 11 and 12 in the Appendix. Our analyses show that sociable recommendation strategies are correlated with successful recommendation in dialogs. These insights motivate us to build a more sociable recommendation dialog system to achieve better persuasion outcomes.
For extrinsic evaluation, we build two end-to-end dialog systems trained on the INSPIRED dataset: one is encoded with recommendation strategies and the other is not. We find that the model encoded with our strategy annotations performs better in both automatic and human evaluation.
We believe that enriching the intersection between social science and computational linguistics in INSPIRED opens plenty of rooms for future studies on sociable recommendation dialog.

Related Work
Social science theories on recommendation. Psychological researchers believe that interactions with recommendation systems should not only be seen from a technical perspective but should also be examined from a social and emotional perspective (Zanker et al., 2006). Yoo et al. (2012) propose that credibility, likeability, friendliness, humor, and other language styles are significant factors for persuasive recommendations. Pecune et al. (2019) has studied modeling social explanation for movie rec-ommendation, such as personal opinion and personal experience. Häubl and Murray (2003) find that more information on recommendation may help consumers make better purchase decisions, but leave them overwhelmed with the abundant information. Inspired by these theories, we borrow such principles in the design of our sociable recommendation strategies.

Conversational
recommendation systems. While studies on conversational recommendation systems have been done, none of them focus on the sociable recommendation strategies for persuasive outcome. This is is due to the lack of existing datasets for studying effective strategies in recommendation dialog. Table 1 compares  different factors across the recommendation dialog  datasets including INSPIRED. Prior works on recommendation dialogs collect data based on template-based question-answering pairs from user reviews (Thompson et al., 2004;Reschke et al., 2013;Sun and Zhang, 2018;Zhang et al., 2018b). These datasets contain structured utterances where the recommender continuously asks for the seeker's product preference. Kang et al. (2019) collected goal-driven recommendation dialogs (GORECDIAL) in a gamified setting where both speakers are given a small set of movies with descriptions to find the best recommendation. This role-play game setting may not effectively reflect the real-world situation since the seeker pretends that they like the given movies.
The most similar work to ours is Li et al. (2018)'s REDIAL dataset which consists of chit-chats for movie recommendation. However, the recommendations are conditioned on the movies mentioned in the dialog, and not directly on the language usage. Also, they tend to mention only movie names rather than an in-depth discussion on the movie preference.
Our work is also closely related to Radlinski et al. (2019) on movie preference elicitation and Fabian Galetzka1 (2020) on movie discussion in the dialog setting. Preference elicitation is an important step for the human recommender to comprehend seeker's taste before recommendation, but these datasets are not recommendation conversations.
Meanwhile, dialogs in INSPIRED have both stages: preference elicitation and recommendation. INSPIRED also captures sociable recommendation strategies in conversations and measures recommendation with ratings.
Sociability in dialog systems. In human-human conversations, people engage in a talk that does not only contain task-oriented topics (Bickmore and Cassell, 2005). Thus, sociability has raised more attention in dialog systems as they become more sociable, engaging, and user-adaptive (Zhang et al., 2018a;Shi and Yu, 2018;Göker and Thompson, 2000). Zhang et al. (2018a) proposed a chit-chat dataset and presented the task of more personalized dialogs system conditioned on user profile information. Sociability leads to a more persuasive conversation (Yoo et al., 2012), so social skills are essential for dialog systems to make successful recommendations.
Communication strategies on specific tasks, such as donation and product price negotiation, have been found useful for task completion (Wang et al., 2019;Zhou et al., 2019). In this work, we connect different sociable strategies with recommendation in dialog settings and show that sociable strategies have a positive impact on recommendation success.

Movie Database Creation
To ensure that the recommended movie has trailers and metadata information, we curate a database with all movie trailers from Movieclips Trailer 2 released between 2008 and 2020, and movies from MovieLens dataset (Harper and Konstan, 2015). In total, we have 17,869 movies with trailers and metadata information. We design a simple movie search interface (Figure 2) to assist recommenders in searching for a movie.

Recommendation Task
We recruit crowd-workers from Amazon Mechanical Turk. In each conversation, two workers are randomly paired and assigned different roles: one as a recommender and another as a seeker. Our collection set-up is more realistic compared to prior works as (1) recommenders have no limitations of the number of movies to recommend, (2) seekers accept or reject a movie following their true preference, and (3) we record if seekers actually watch the video trailer or not.

Recommender.
Recommenders' task is to recommend a movie successfully to the seeker. Before chatting, we show them tips for sociable recommendation strategies with example utterances. Then they chat with the seekers in two phases: user information gathering and movie recommendation. In the user information gathering phase, recommenders are asked to understand the seekers' movie tastes. In the recommendation phase, the recommenders can still request seekers' preference while browsing movies to recommend. We encourage the recommenders to continue the conversation until seekers accept a movie.
Seeker. Seekers are asked to talk about movie recommendations without any strategy support. After they complete the conversation, seekers can opt to accept or reject the provided movie recommendations. If the seekers accept the recommendation, they can watch the entire recommended movie trailer or part of it, or simply skip it after the conversation. We record how long seekers watched the recommended movie trailer and ask them to rate the trailer on 5-Likert scale in the post-task survey.

Dialog Data Collection Details
We use ParlAI platform (Miller et al., 2017) and hire 1,594 US crowd-workers from Amazon Mechanical Turk with a minimum of 90% task acceptance rate. The dialog collection process lasted from November 2019 to March 2020. Workers first fill out questionnaires related to their personality traits and values before their conversations. The questionnaire consists of three personality trait models: the Big Five personality traits (15 questions) (Goldberg, 1993), the Schwartz Portrait Value (10 questions) (Schwartz, 2003), and the Decision Making Style (2 questions) (Hamilton et al., 2016) 3 . Then, recommenders start the conversation and both workers should chat for a minimum of 10 turns or until a recommendation is made. After the conversation ends, both workers will answer a post-task survey of demographic questions such as age, and gender. Seekers are asked to rate the trailer with a high score (4 or 5 stars) on a 5-Likert scale and provide the reason of why they reject or do not finish watching the video. Both workers receive a bonus up to $2 if they complete the entire process in addition to the base pay of $0.5. Table 2 presents statistics of the collected dataset 4 . Even though our dataset has relatively small number of samples compared to REDIAL or GORECDIAL, it has human annotations on each sociable strategy. Moreover, our dataset can be 3 We also release this personality information in our dataset for future work 4 Dialog collection interfaces are in appendix H in Appendix

Cases #Dialogs
Accept (Rating 4-5) 532 (53.1%) Accept (Rating 3 or lower) 45 (4.5%) Accept (Other Reasons) 289 (28.9%) Accept Uninterested 123 (12.3%) Reject 12 (1.2%) Table 3: Statistics of dialogs when the seekers accept or reject the final recommended movie. "Accept (Rating 4-5)" means that the seekers accept the recommendation and give rating 4 or 5, and the same is for "Accept (Rating 3 or lower)". "Accept (Other Reasons)" suggests that the seeker gives other reasons for not finishing the video. "Accept Uninterested" indicates that the seekers accept the recommendation, do not finish watching the video, and explains in the post-task survey that they are not interested in the recommended video.
used in combination with other datasets in a semisupervised setting, as shown in our implementation of recommendation dialog systems in §6. The statistics of accept and reject cases are shown in Table 3. We have higher number of successful cases (79.7%) compared to failure cases. This shows that people tend to accept recommendations, and it is not surprising since watching a video trailer is an entertaining, low-risk activity. For training the dialog model, we use every dialog from all cases so that the dialog system will be able to respond to diverse responses.

Strategy Definition
After conversations are collected, two experts, trained with linguistics background, develop an annotation scheme using content analysis method (Krippendorff, 2004) and from past study on human behavior in making recommendations. Similar approaches have been done in prior studies on work for persuasion task (Wang et al., 2019) or negotiation task (Zhou et al., 2019). We divide the recommendation strategies into two categories: sociable strategies and preference elicitation strategies. Sociable strategies are also derived from our literature study on the social science theories.
Sociable strategies contain eight strategies related to the recommendation task. These strategies relate to the recommenders trying to build rapport with the seekers.
• Personal opinion refers to a condition when recommenders express their subjective opinion

Category Example
PERSONAL OPINION "I really like Disney's more recent princesses" PERSONAL EXPERIENCE "I have Disney+ and watched it everyday!" SIMILARITY "Oh, I love Disney as well." ENCOURAGEMENT "You should definitely watch it!" OFFERING HELP "I'm here to help you find a trailer!" PREFERENCE CONFIRMATION "So do you like Disney movies in general?" CREDIBILITY "It's about a dog named Lady who runs away with a stray named Tramp" SELF-MODELING "We are planning to go see Maleficent, we heard it was a very good movie." EXPERIENCE INQUIRY "Have you seen the new Lady and the Tramp?" OPINION INQUIRY "What do you like about the Avengers: End-game?" RECOMMENDATION "You should check out Shazam!" about a movie, including its plot, actors, or other movie attributes.
• Personal experience refers to the use of sharing personal experience related to a movie. For example, recommenders may say that they watch the movie several times to convince the seekers that the movie is good. Both personal opinion and personal experience are part of self-disclosure that leads to establishing rapport with the seekers (Altman, 1973).
• Similarity refers to a condition when the recommenders are empathizing and being like-minded toward seekers about their movie preference to produce similarity among them. Similarity is believed to influence the seekers' liking for the source that leads to trust the recommenders' judgment more (O'Keefe, 2004), following Lazarsfeld and Merton (1964)'s homophily theory that states humans like other people who are similar to them.
• Encouragement is the use of praise of the seekers' movie taste and encouragement to watch a recommended movie to build rapport and promote the recommended movie.
• Offering help is a strategy when the recommenders disclose explicit intention to help the seeker or being transparent. It is a part of "transparency" strategy from Gretzel and Fesenmaier (2006).
• Preference confirmation is a strategy when the recommenders ask or rephrase the seeker's prefer-ence. This strategy is also a part of "transparency" strategy which states that the recommenders disclose their thinking process of understanding the seekers' preference.
• Self-modeling is a strategy when the recommender becomes a role model to do something first so that the Seeker would follow (Dowrick, 1999).
• Credibility happens when the recommender shows expertise and trustworthiness in providing information to persuade the seeker (Fogg, 2002;O'Keefe, 2004;Rhoads and Cialdini, 2002). In our study, a recommender is doing credibility appeal when they provide factual information about movie attributes, such as the plot, actors, or awards that the movie has.
Preference elicitation inquiries include the following inquiries that are asked by the recommenders to know the seekers' movie tastes.
• Experience inquiry asks for seeker's experience on movie watching, such as whether a seeker has watched a certain movie or not.
• Opinion inquiry asks for seeker's opinion on movie-related attributes. Example answers for this inquiry is the seeker's explanation on what they like about the plot or if they admire the actors' acting skill.
Other kinds of utterances, such as greetings or thanks, fall into non-strategy category. We also label sentences which are recommendation. Recommendation is defined as when the recommender  suggests a new movie title for the first time for the seeker. 30% of the recommendation sentences are "experience inquiries", 27% are "encouragement", and 14% are "personal opinion". Example annotated utterances are displayed in Table 4. Meanwhile, Table 5 shows the number of annotated utterances in INSPIRED.

Annotation Quality
To ensure annotation quality, we separate our annotation study in two steps. First, we hire two experts with linguistics training to perform annotation, in order to test the validity of the scheme. The two experts annotated 30 randomly selected conversations and reached a Kappa agreement of 0.77, suggesting that our scheme is possible to replicate. Our dataset contains more than 18k utterances, so it's too costly to hire experts to annotate all of them. In the second step, We hire US-based crowdworkers (95% task acceptance) from Amazon Mechanical Turk for the annotation tasks. In each task, a worker was given a tutorial of the annotation and then they were given 10 dialogs to annotate. One of the dialogs was labeled by experts to calibrate the quality of the worker's annotation, called as evaluation dialog. Five workers work on the same task. We filter out workers whose score is below the threshold 0.60 on the evaluation dialog. To set this threshold in a reasonable value, we conducted the following study. This time we ran onetask in which all the dialogs are already labeled with the experts including the evaluation dialogs. We found that if the workers' score on the evaluation dialog is above 0.60, their agreement score with the expert's annotation on the rest of the dialogs in this task is 0.77.
These selected high quality crowd-workers annotate the rest of the dialogs. We still have five workers annotate the same dialog. If more than one worker disagrees on a utterance's annotation, the experts are then involved to annotate them as quality control. The inter-annotator majority agreement among all workers is 0.78 over all dialogs. The annotation scheme for the crowd-workers are provided in Figure 12 in the Appendix.

Distribution of Strategies over Dialog
As shown in Figure 3, we observe that different sociable strategies are unequally distributed across conversation turns. Most notably, "offering help" and "similarity" often happen at the beginning, indicating that recommenders strategically attempt to build rapport with seekers at the early stages. Then, "credibility" and "personal opinion" frequently appear in the conversations, as recommenders seek to persuade. Moreover, "encouragement" mostly appears in the middle and at the end of conversations.

What Strategies Contribute to Successful
Recommendations?
We study the association of sociable strategies and successful recommendations. A recommendation is considered successful if seekers finish watching a substantial portion of the recommended movie trailer and rate the trailer with a high score (4 or 5 stars). We set a threshold that seekers need to watch at least more than 50% of the video duration since some videos have advertisements at the end, etc. On the other hand, a recommendation is considered unsuccessful if the seekers reject the recommendation ("Reject") or skip watching the trailer ("Accept Uninterested"). Thus, for our analysis, we use 532 successful dialogs and 135 unsuccessful dialogs for our analysis on association of strategies in successful recommendations.
To analyze the effect of our sociable recommendation strategies on success of recommendation, we run a logistic regression model to predict the success of recommendation (1 = successful, 0 = unsuccessful). We use frequency of the strategy in a dialog as the feature value. Table 6 shows the coefficients of each strategy with respect to the recommendation. We observe that "personal opinion", "similarity", "encouragement", and "credibility" strategies have a significant positive effect on successful recommendations. This confirms with the previous studies that more sociable recommenders are more likely to be successful in the recommendation.
"Similarity" strategy has the highest coefficient value which suggests that if the recommender is conforming to the seeker's preference, the seeker is more likely to favor the recommendation. This also supports the theory in O'Keefe (2004) that likeability helps in recommendation. We also observe that all the preference elicitation inquiries are not significantly contributing to the successful recommendation. From this result, we are not saying that recommenders need not to query seekers' preferences since it is crucial to understand their tastes. However, a more sociable approach is necessary for a more successful recommendation.

Are Sociable Strategies Still Significant with the Presence of Movie Attributes?
In a recommendation task, a natural question to ask is how big a role the recommended product plays in the acceptance of recommendation. If the quality of the product matters more than how you recom-  mend, it makes more sense to improve the products rather than the recommendation skills. Therefore, we also analyze if adding movie attributes, such as the genre, recent release date, and the number of likes of the movie trailer have an impact on successful recommendation along with the eight sociable strategies and two preference elicitation inquiries.
For the popularity, we categorize the top 10% movies in terms of the number of likes to be popular and the rest to be non-popular in our database. A movie is said to be recent if it is released in 2019 or 2020. For the genre, we select the top five most popular genres in the movie database. When we check with the recommended movies in INSPIRED, 96% of recommended movies are covered by the top five genres.
Results of the analysis between the strategies and movie attributes are shown in Table 8 in the Appendix. Sociable strategies remain significantly correlated with successful recommendations. Recommenders who perform "similarity" strategy, express "personal opinion", and show "encouragement" are more likely to successfully recommend a movie (p < 0.05). Surprisingly, none of the movie attributes has significant effect on successful recommendations. A possible reason is that the seekers' movie tastes are so diverse that movie attributes such as genre do not have a significant impact on the recommendation success.  Figure 4: The Seeker's language model (Seeker LM) and the Recommender's language model (Recommender LM) are separate memory. The Seeker LM input at turn t is the seeker's utterance S utt t consisting of a sequence of tokens s t0 , s t1 , ...s tn . The Recommender LM input at turn t is the recommender's utterance R utt t consisting of a sequence of tokens r t0 , r t1 , ..., r tn . The <strategy t > prepended as a special token. For the baseline, the recommender's input does not contain the strategies.

Recommendation Dialog Systems
To evaluate how the strategies in INSPIRED are useful in creating a more engaging and persuasive recommendation dialog, we develop a generative dialog model as our baseline to compare against our strategy-incorporated dialog system. We split the dialogs into 801/100/100 for train/validation/test split. We use external recommendation system from TMDB 5 with heuristics to select the movies.
More details for heuristics and training set-up are in the Appendix.

Baseline Model
The baseline dialog model uses two separate Transformer-based pretrained language models (Vaswani et al., 2017;Radford et al., 2019;Wu et al., 2019) to learn the recommender's and seeker's language models separately in alternating order. Both language models are trained to maximize the likelihood of generating ground truth utterance on the alternating memory as shown in Figure 4. The model is pretrained on non-task related corpus, WebText, and task-related corpus: recommendation dataset from REDIAL (Li et al., 2018) and movie preference elicitation dataset (Radlinski et al., 2019). Then, we fine-tune the model with INSPIRED.
We replace movie attributes such as titles, actors, and genres with indexed placeholders. It is because in a single conversation, multiple attributes may be mentioned several times. The replacement with placeholders improves factual correctness as we replace them back with the original movie attributes later. At the end of the sentence, we append the attribute information as below: Original The model first generates five candidate sentences. Then, it randomly selects a generated candidate that either contains "encouragement" strategy or has the greatest sentence length. In our experiment, we have tried various combinations of the top three strategies (e.g., "encouragement" only, "encouragement" and "similarity"), and it turns out that "encouragement" only model gave the best result. Moreover, the sentence length selection is based on our intuition when chatting with the system. This aligns from our findings, "encouragement" is the second most frequently used strategy when humans make recommendations ( §4.1), and "recommendation" is associated positively with successful recommendation (Table 8) 6 .
To decide if a sentence is a recommendation or not, we train a BERT-based recommendation classifier that receives an input of recommender's current utterance and seeker's utterances from previous turn with 95.4% accuracy and 91.2 % F1-score. While the index in the placeholder may become a Model PPL↓ BLEU-4↑ Baseline 9.28 5.11 Strategy 8.93 6.63 proxy to decide whether the system needs to recommend a movie or not, it is not strictly supervised. Thus, if a generated sentence is labeled as "recommendation", we enforce our dialog system to recommend a new movie.

Results
We compare the baseline dialog model without strategy supervision against our dialog model with strategy supervision. We use both automatic metrics and human evaluation. For automatic metrics, we compute perplexity and BLEU scores (Papineni et al., 2002), suggesting that prepending strategies improves the model performance as shown in Table 7. For human evaluation, twenty-eight participants chat with both models for 2-3 times for a more reliable judgment. We randomize which model they will chat first, in order to avoid exposure bias. After chatting, they are asked to decide which model is better in these five aspects: fluency, consistency, naturalness, persuasiveness, and engagingness. If they are unable to distinguish the dialog systems, they are allowed to choose "can't tell" option.
Results in Figure 5 suggest that human users prefer the model with strategy over the baseline in all aspects 7 . It is interesting to see that although the strategy model is preferred on all metrics, people find the two model differs the most in engagingness, followed by naturalness. This supports our hypothesis that human users will find the conversations more engaging and more natural with sociable strategies incorporated in recommendation dialog systems.

Conclusion and Future Work
In this work, we have introduced INSPIRED, a new recommendation dialog dataset collected in natural setting and annotated with sociable recommendation strategies. We analyze the connection between different strategies and the recommendation results. Our findings show that sociable strategies do have a positive impact on the acceptance of recommendation and dialog quality. This work opens up several 7 We also run additional user study with five-scale ratings on these five aspects with results in Table 10  directions for future studies in building sociable and personalized recommendation dialog systems as follows: First, we will explore more ways of utilizing the strategies, including dynamic strategy selection after decoding. Then, we plan to investigate the strategy patterns for people with different personalities and movie preferences to make dialog system more personalized. Finally, another interesting exploration is to extend the model with a jointly trainable movie recommendation and movie information modules.

A Movie Trailer Database Creation
For each movie, we obtain metadata information from Youtube and add other movie attributes, such as plot, actors, and genre using OMBD API 8 .
We enrich the movies from MovieLens datasets (Harper and Konstan, 2015) with more movie trailers by searching the movie title and "trailer" on Youtube with a duration restriction of less than 5 minutes. We choose the trailers that are shorter than 5 minutes so that the crowd-workers do not have to spend a long time on watching them. We used the first retrieved link of the video under the duration constraint. We remove movies without a retrieved trailer from our database. Our motivation to use MovieLens and include more trailers is to link our movie database with MovieLens user review, so that it can be used for future work on building recommendation systems.

B Heuristics for Recommendation System
Our heuristics for the recommendation system to handle cold-start is as follows. If the seeker never mentions a movie before and the generated text of the recommender dialog system is labeled as "recommendation", the most recent movie with the last mentioned genre will be recommended. If the seeker already mentioned a movie, we will query the last mentioned movie with positive or neutral sentiment to the recommendation system for recommendation. The first recommended movie from the recommendation system output will be chosen by our dialog system. If the movie has been recommended, we will choose the next recommended movie in the recommendation output list.
To detect which movies are favored by the seekers and movie titles in the sentence, we use the modules from Liang et al. (2020). The sentiment classifier is a BERT-based (Devlin et al., 2019) model trained on Stanford Sentiment dataset (Socher et al., 2013). For the movie title detection, the model is a bidirectional LSTM-CRF with character-augmented word embedding for the input combined with retrieving similar movie title in the movie database. The movie database is from TMDB. The model was trained on speech transcripts.
To detect movie genre in the sentence, we use regular expression matching for these genres following movie information from OMDB in our database: Action, Animation, Biography, Comedy, Crime, Drama, Documentary, Fantasy, History, Horror, Mystery, Musical, News, Romance, Sport, Thriller, War, and Western. To detect movie actors, actresses, and directors, we use pattern matching for capitalized first letter and find if the name exists in TMDB search for people.

C Dialog Model
The dialog model p(d) of a dialog d with T turns is defined as follows: where s is the seeker's utterance at turn t and r is the recommender's utterance at turn t, and p s (s t |s <t , r <t ) is the probability of generating the seeker's utterance given the history. The conversation history is represented by they query/key/value features using self-attention. Interested reader can refer to Wu et al. (2019) for more details.

D Training Set-up
We adopt GPT-2 small, which is a 12-head, 12layer, and 768-hidden size Transformer, with 117M parameters. We use pre-trained GPT-2 Byte Pair Encoding (BPE) tokenizer with the extended vocabulary of 50,310 tokens to tokenize texts. The optimizer is AdamW (Loshchilov and Hutter, 2019), and the number of warm-up steps is 100. The learning rate is set to 3 × 10 −5 , and the dropout rate is set to 0.1. All experiments are run with an NVIDIA GeForce GTX 1080 Ti GPU. The movie information in the input data, such as actress/actor's name, movie genre and movie plot, is delexicalized as special tokens. The real information (genre, movie title, etc.) is appended to the utterance. In addition, the strategy label is also treated as special tokens.
We leverage the ReDial (Li et al., 2018) and movie preference elicitation datasets (Radlinski et al., 2019) to conduct task-related pretraining. It takes around 1.37 hour to finish one epoch in pretraining the model.
As for the training on INSPIRED dataset, it takes around 16 minutes to finish one epoch. We train the model until it converges. The baseline model usually converges after the second epoch while the strategy-incorporated model after the third epoch.
During the inference stage, we combined top-k based sampling and top-p based sampling (Nucleus Sampling method (Holtzman et al., 2019)). We keep the highest probability tokens whose cumulative probability mass exceeds the threshold p. We manually tuned the threshold of temperature, p and k to make both model achieve their best performance.
The temperature is set as 0.82 for baseline and 0.8 for the strategy-incorporated model. For both model, the threshold of k is set as 400 and the upperbound of p is set as 0.9. We manually tune the hyperparameters.
For the strategy-incorporated model, the strategy is generated first and the utterance is then generated conditioned on the strategy. Although it is a loose constraint, the model learned categorical strategic patterns. For completeness, we also provide validation perplexity and BLEU-4 score in Table 9 Model Test Valid PPL BLEU-4 PPL BLEU-4 Baseline 9.28 5.11 9.21 5.09 Strategy 8.93 6.63 8.90 7.55 Table 9: Results for automatic metrics in both validation and test data.

E Additional User Study
In addition to the comparison study done by human users mentioned in §6.3, we conduct another user study which asks each participant to rate from 1 (worst) to 5 (best) for the same five aspects: fluency, consistency, naturalness, persuasiveness, and engagingness. For each model, 25 participants chat interactively with it (in total: 50 users). Unlike the user study in §6.3 where 1 user interacts with both models, this time a user interacts with one model since users do not need to compare. These participants are different from the ones reported in the comparison user study ( §6.3). From Table 10, we can see that the strategy model has higher ratings than the baseline model in all aspects.

F Example Human-Human Dialogs in INSPIRED
We include 2 annotated examples of human-human dialogs in Table 11 and 12.

G Example Human-System Dialogs
We include example dialog of human seeker and the baseline model in Table 13 and an example strategyincorporated dialog model in Table 14 from user study. In the user study of evaluating the dialog system, we do not set a minimum turn for the human user. Figure 6, 7, 9, 11 show dialog collection interface. Figure 12 and 13 are dialog annotation interfaces for the crowd-workers.

Model
Fluency Consistency Naturalness Persuasiveness Engagingness  Table 10: Average score for human ratings on a 5-point Likert scale. Note that the human-human dialogues were collected before the user study and we did not measure fluency and consistency for human recommender. Table 14: Example dialog of human-system. REC SYS refers strategy-incorporated recommendation dialog system and SEEK to human Seeker.