Recommendation as a Communication Game: Self-Supervised Bot-Play for Goal-oriented Dialogue

Traditional recommendation systems produce static rather than interactive recommendations invariant to a user’s specific requests, clarifications, or current mood, and can suffer from the cold-start problem if their tastes are unknown. These issues can be alleviated by treating recommendation as an interactive dialogue task instead, where an expert recommender can sequentially ask about someone’s preferences, react to their requests, and recommend more appropriate items. In this work, we collect a goal-driven recommendation dialogue dataset (GoRecDial), which consists of 9,125 dialogue games and 81,260 conversation turns between pairs of human workers recommending movies to each other. The task is specifically designed as a cooperative game between two players working towards a quantifiable common goal. We leverage the dataset to develop an end-to-end dialogue system that can simultaneously converse and recommend. Models are first trained to imitate the behavior of human players without considering the task goal itself (supervised training). We then finetune our models on simulated bot-bot conversations between two paired pre-trained models (bot-play), in order to achieve the dialogue goal. Our experiments show that models finetuned with bot-play learn improved dialogue strategies, reach the dialogue goal more often when paired with a human, and are rated as more consistent by humans compared to models trained without bot-play. The dataset and code are publicly available through the ParlAI framework.


Introduction
Traditional recommendation systems factorize users' historical data (i.e., ratings on movies) to extract common preference patterns (Koren et al., 2009;He et al., 2017b).However, besides making it difficult to accommodate new users because of the cold-start problem, relying on aggregated history makes these systems static, and prevents users from making specific requests, or exploring a temporary interest.For example, a user who usually likes horror movies, but is in the mood for a fantasy movie, has no way to indicate their preference to the system, and would likely get a recommendation that is not useful.Further, they cannot iterate upon initial recommendations with clarifications or modified requests, all of which are best specified in natural language.
Recommending through dialogue interactions (Reschke et al., 2013;Wärnestål, 2005) offers a promising solution to these problems, and recent work by Li et al. (2018) explores this approach in detail.However, the dataset introduced in that work does not capture higher-level strategic behaviors that can impact the quality of the recommendation made (for example, it may be better to elicit user preferences first, before making a recommendation).This makes it difficult for models trained on this data to learn optimal recommendation strategies.Additionally, the recommendations are not grounded in real observed movie preferences, which may make trained models less consistent with actual users.This paper aims to provide goal-driven recommendation dialogues grounded in real-world data.We collect a corpus of goal-driven dialogues grounded in real user movie preferences through a carefully designed gamified setup (see Figure 1) and show that models trained with that corpus can learn a successful recommendation dialogue strategy.The training is conducted in two stages: first, a supervised phase that trains the model to mimic human behavior on the task; second, a bot-play phase that improves the goal-directed strategy of the model.
The contribution of this work is thus twofold.

Iron Man (2008)
Iron Man is a 2008 American superhero film based on the Marvel Comics character of the same name, produced by Marvel Studios and distributed by Paramount Pictures….
Seeker Expert Figure 1: Recommendation as a dialogue game.We collect 81,260 recommendation utterances between pairs of human players (experts and seekers) with a collaborative goal: the expert must recommend the correct (blue) movie, avoiding incorrect (red) ones, and the seeker must accept it.A chatbot is then trained to play the expert in the game.
(1) We provide the first (to the best of our knowledge) large-scale goal-driven recommendation dialogue dataset with specific goals and reward signals, grounded in a real-world knowledge base.
(2) We propose a two-stage recommendation strategy learning framework and empirically validate that it leads to better recommendation conversation strategies.

Recommendation Dialogue Task Design
In this section, we first describe the motivation and design of the dialogue-based recommendation game that we created.We then describe the data collection environment and present detailed dataset statistics.

Dialogue Game: Expert and Seeker
The game is set up as a conversation between a seeker looking for a movie recommendation, and an expert recommending movies to the seeker.Figure 2 shows an example movie recommendation dialogue between two-paired human workers on Amazon Mechanical Turk.
Game Setting.Each worker is given a set of five movies2 with a description (first paragraph from the Wikipedia page for the movie) including important features such as director's name, year, and genre.The seeker's set represents their watching history (movies they are supposed to have liked) for the game's sake.The expert's set consists of candidate movies to choose from when making recommendations, among which only one is the correct movie to recommend.The correct movie is chosen to be similar to the seeker's movie set (see Sec. 2.2), while the other four movies are dissimilar.The expert is not told by the system which of the five movies is the correct one.The expert's goal is to find the correct movie by chatting with the seeker and recommend it after a minimal number of dialogue turns.The seeker's goal is to accept or reject the recommendation from the expert based on whether they judge it to be similar to their set.The game ends when the expert has recommended the correct movie.The system then asks each player to rate the other for engagingness.
Justification.Players are asked to provide reasons for recommending, accepting, or rejecting a movie, so as to get insight into human recommendation strategies3 .
Gamification.Rewards and penalties are provided to players according to their decisions, to make the task more engaging and incentivize better strategies.Bonus money is given if the expert recommends the correct movie, or if the seeker accepts the correct movie or rejects an incorrect one.

Picking Expert and Seeker movie sets
This section describes how movie sets are selected for experts and seekers.

Pool of movies
To reflect movie preferences of real users, our dataset uses the MovieLens dataset4 , comprising 27M ratings applied to 58K movies by 280K real users.We obtain descriptive text for each movie from Wikipedia5 (i.e., the first paragraph).We also extract entity-level features (e.g., directors, actors, year) using the MovieWiki dataset (Miller et al., 2016) (See Figure 1).We filter out less frequent movies and user profiles (see Appendix), resulting in a set of 5,330 movies and 65,181 user profiles with their ratings.

Movie similarity metric
In order to simulate a natural setting, the movies in the seeker's set How about this movie?Almost Famous This was a Cameron Crowe movie from the 90's that was a coming of age drama.should be similar to each other, and the correct movie should be similar to these, according to a metric that reflects coherent empirical preferences.To compute such a metric, we train an embedding-driven recommendation model (Wu et al., 2018). 6Each movie is represented as an embedding, which is trained so that embeddings of movies watched by the same user are close to each other.The closeness metric between two movies is the cosine similarity of these trained embeddings.A movie is deemed close to a set of movies if its embedding is similar to the average of the movie embeddings in the set.
Movie Set Selection Using these trained embeddings, we design seeker and expert sets based on the following criteria (See Figure 3): • Seeker movies (grey) are a set of five movies which are close to each other, chosen from the set of all movies watched by a real user.• The correct movie (light blue) is close to the average of the five embeddings of the seeker set.• The expert's incorrect movies (light red) are far from the seeker set and the correct movie.We filter out movie sets that are too difficult or easy for the recommendation task (see Appendix), and choose 10,000 pairs of seeker-expert movie sets at random.

Data Collection
For each dialogue game, a movie set is randomly chosen without duplication.We collect dialogues using ParlAI (Miller et al., 2017) to interface with Amazon Mechanical Turk.More details about data collection are included in the Appendix.
Table 1 shows detailed statistics of our dataset regarding the movie sets, the annotated dialogues, actions made by expert and seeker, dialogue games, and engagingness feedback.
The collected dialogues contain a wide variety of action sequences (recommendations and accept/reject decisions).Experts make an average of 1.16 incorrect recommendations, which indicates a reasonable difficulty level.Only 37.6% of dialogue games end at first recommendation, and 19.0% and 10.8% at second and third recommendations, respectively.
Figure 4 shows histogram distributions of (a) expert's decisions between speaking utterance and recommendation utterance and (b) correct and incorrect recommendations over the normalized turns of dialogue.In (a), recommendations increasingly occur after a sufficient number of speaking utterances.In (b), incorrect recommendations are much more frequent earlier in the dialogue, while the opposite is true later on.

Our Approach
In order to recommend the right movie in the role of the expert, a model needs to combine several perceptual and decision skills.We propose to con- duct learning in two stages (See Figure 5): supervised multi-aspect learning and bot-play.

Supervised Multi-Aspect Learning
The supervised stage of training the expert model combines three sources of supervision, corresponding to the three following subtasks: (1) generate dialogue utterances to speak with the seeker in a way that matches the utterances of the human speaker, (2) predict the correct movie based on the dialogue history and the movie description representations, and (3) decide whether to recommend or speak in a way that matches the observed decision of the human expert.
Using an LSTM-based model (Hochreiter and Schmidhuber, 1997), we represent the dialogue history context h t of utterances x 1 to x t as the average of LSTM representations of x 1 , • • • , x t , and the description m k of the k-th movie as the average of the bag-of-word representations7 of its description sentences.Let (x t+1 , y, d t+1 ) denote the ground truth next utterance, correct movie index, and ground truth decision at time t+1, respectively.We cast the supervised problem as an end-to-end optimization of the following loss: where α and β are weight hyperparameters optimized over the validation set, and L predict , L decide , L gen are negative log-likelihoods of probability distributions matching each of the three subtasks: 2) with p gen the output distribution of an attentive seq2seq generative model (Bahdanau et al., 2015), p a softmax distribution over dot products h t • m k that capture how aligned the dialogue history h t is with the description m k of the k-th movie, and p MLP the output distribution of a multi-layer perceptron predictor that takes c 1 , • • • , c K as inputs8 .

Bot-Play
Motivated by the recent success of self-play in strategic games (Silver et al., 2017;Vinyals et al., 2019;OpenAI, 2018) and in negotiation dialogues (Lewis et al., 2017), we show in this section how we construct a reward function to perform botplay between two bots in our setting, with the aim of developing a better expert dialogue agent for recommendation.
Plan optimizes long-term policies of the various aspects over multiple turns of the dialogue game by maximizing game-specific rewards.We ).Let T REC the set of turns when the expert made a recommendation.We define the expert's reward as: where δ is a discount factor11 to encourage earlier recommendations, b t is the reward obtained at each recommendation made, and |T REC | is the number of recommendations made.b t is 0 unless the correct movie was recommended.
We define the reward function R as follows: where is the average of the rewards received by the expert until time t and γ is a discount factor to diminish the reward of earlier actions.We optimize the expected reward for each turn of dialogue x t and calculate its gradient using REINFORCE (Williams, 1992).The final role-playing objective L RP is: We optimize the role-playing objective with the pre-trained expert model's decision (L decide ) and generation (L gen ) objectives at the same time.To control the variance of the RL loss, we alternate optimizing the RL loss and other two supervised losses for each step.We do not fine-tune the prediction loss, in order not to degrade the prediction performance during bot-play.

Experiments
We describe our experimental setup in §4.1.We then evaluate our supervised and unsupervised models in §4.2 and §4.3, respectively.

Setup
We select 5% of the training corpus as validation set in our training.
All hyper-parameters are chosen by sweeping different combinations and choosing the ones that perform best on the validation set.In the following, the values used for the sweep are given in brackets.Tokens of textual inputs are lower-cased and tokenized using bytepair-encoding (BPE) (Sennrich et al., 2016) or the Spacy 12 tokenizer.The seq-to-seq model uses 300-dimensional word embeddings initialized with GloVe (Pennington et al., 2014) or Fasttext (Joulin et al., 2017) embeddings, [1, 2] layers of [256,512]-dimensional Uni/Bi-directional LSTMs (Hochreiter and Schmidhuber, 1997) with 0.1 dropout ratio, and soft attention (Bahdanau et al., 2015).At decoding, we use beam search with a beam of size 3, and choose the maximum likelihood output.For each turn, the initial 12 https://spacy.io/movie text and all previous dialogue turns including seeker's and expert's replies are concatenated as input to the models.
Both supervised and bot-play learning use Adam (Kingma and Ba, 2015) optimizer with batch size 32 and learning rates of [0.1, 0.01, 0.001] with 0.1 gradient clipping.The number of softmax layers (Yang et al., 2018) is [1,2].For each turn, the initial movie description and all previous dialogue utterances from the seeker and the expert are concatenated as input text to the other modules.Each movie textual description is truncated at 50 words for efficient memory computation.
We use annealing to balance the different supervised objectives: we only optimize the generate loss for the first 5 epochs, and then gradually increase weights for the predict and decide losses.We use the same movie-sets as in the supervised phase to fine-tune the expert model.Our models are implemented using PyTorch and Par-lAI (Miller et al., 2017).Code and dataset will be made publicly available through ParlAI 13 .

Evaluation of Supervised Models
Metrics.We first evaluate our supervised models on the three supervised tasks: dialogue generation, movie recommendation, and per-turn decision to speak or recommend.The dialogue generation is evaluated using the F1 score and BLEU (Papineni et al., 2002) comparing the predicted and ground-truth utterances.The F1 score is computed at token-level.The recommendation model is evaluated by calculating the percentage of times the correct movie is among the top k recommendations (hit@k).In order to see the usefulness of dialogue for recommendation, precision is measured per each expert turn of the dialogue (Turn@k) regardless of the decision to speak or recommend, and at the end of the dialogue (Chat@k).
Models.We compare our models with Information Retrieval (IR) based models and recommendation-only models.The IR models retrieve the most relevant utterances from the set of candidate responses of the training data and rank them by comparing cosine similarities using TFIDF features or BERT (Devlin et al., 2019) encoder features.Note that IR models make no recommendation.The recommendation-only models

Generation Recommendation Decision
F1 BLEU Turn@1 Turn@3 Chat@1 Chat@3 Acc always produce recommendation utterances following the template (e.g., "how about this movie,

Baseline
[MOVIE]?") where the [MOVIE] is chosen randomly or based on cosine similarities between dialogue contexts and the text descriptions of candidate movies.We use the pre-trained BERT encoder (Devlin et al., 2019) to encode dialogue contexts and movie text descriptions.
We incrementally add each module to our base Generate model: Predict and Decide for vised learning and Plan for bot-play fine-tuning.Each model is chosen from the best model in our hyper-parameter sweeping.
Results.Table 2 shows performance comparison on the test set.Note that only the full supervised model (+Decide) and the fine-tuned model (+Plan) can appropriately operate every function required of an expert agent such as producing utterances, recommending items, and deciding to speak or recommend.
Compared to recommendation-only models, our prediction Predict modules show significant improvements over the recommendation baselines on both per-turn and per-chat recommendations: 52% on Turn@1 and 34% on Turn@3.Chat scores are always higher than Turn, indicating that recommendations get better as more dialogue context is provided.The Decide module yields additional improvements over the Predict model in both generation and recommendation, with 67.6% decision accuracy, suggesting that the supervised signal of decisions to speak or recommend can contribute to better overall representations.
In generation, our proposed models show comparable performance as the IR baseline models (e.g., BERTRanker).The +Decide model improves on the F1 generation score because it learns when to predict the templated recommendation utterance.
As expected, +Plan slightly hurts most metrics of supervised evaluation, because it optimizes a different objective (the game objective), which might not systematically align with the supervised metrics.For example, a system optimized to maximize game objective should try to avoid incorrect recommendations even if humans made them.Game-related evaluations are shown in §4.3.

Analysis
We analyze how each of the supervised modules acts over the dialogue turns on the test set.Figure 6(a) shows a histogram of the rank of the ground-truth movie over turns.The rank of the model's prediction is very high for the first few turns, then steadily decreases as more utterances are exchanged with the seeker.This indicates that the dialogue context is crucial for finding good recommendations.
The evolution of generation metrics (F1, BLEU) for each turn is shown in Fig. 6(b), and the (accumulated) recommendation and decision metrics (Turn@1/Accuracy) in Fig. 6(c)14 .The accumulated recommendation and decision performance sharply rises at the end of the dialogue and variance decreases.The generation performance increases, because longer dialogue contexts helps predict the correct utterances.

Evaluation on Dialogue Games
Metrics.In the bot-play setting, we provide game-specific measures as well as human evaluations.We use three automatic game measures: Goal to measure the ratio of dialogue games where the goal is achieved (i.e., recommending the correct movie or not), Score to measure the total game score, and Turn2G to count the number of dialogue turns taken until the goal is achieved.
We conduct human evaluation by making the expert model play with human seekers.We measure automatic metrics as well as dialogue quality scores provided by the player: fluency, consistency, and engagingness (scored between 1 and 5) (Zhang et al., 2018).We use the full test set (i.e., 911 movie sets) for bot-bot games and use 20 random samples from the test set for {bot,human}human games.
Results.Compared to the supervised model, the self-supervised model fine-tuned by seeker models shows significant improvements in the gamerelated measures.In particular, the BERT-R model shows a +27.7% improvement in goal success ratio.Interestingly, the number of turns to reach the goal increases from 1.4 to 3.2, indicating that conducting longer dialogues seems to be a better strategy to achieve the game goal throughout our roleplaying game.
In dialogue games with human seeker players, the bot-play model also outperforms the supervised one, even though it is still far behind human performance.When the expert bot plays with the human seeker, performance increases compared to playing with the bot seeker, because the human seeker produces utterances more relevant to their movie preferences, increasing overall game success.

Related Work
Recommendation systems often rely on matrix factorization (Koren et al., 2009;He et al., 2017b).Content (Mooney and Roy, 2000) and social relationship features (Ma et al., 2011) have also been used to help with the cold-starting problem of new users.The idea of eliciting users' preference for certain content features through dialogue has led to several works.Wärnestål (2005) studies requirements for developing a conversational recommender system, e.g., accumulation of knowledge about user preferences and database content.Reschke et al. (2013) automatically produces template-based questions from user reviews.However, no conversational recommender systems have been built based on these works due to the lack of a large publicly available corpus of human recommendation behaviors.
Very recently, Li et al. (2018) collected the Re-Dial dataset, comprising 10K conversations of movie recommendations, and used it to train a generative encoder-decoder dialogue system.In this work, crowdsource workers freely talk about movies and are instructed to make a few movie recommendations before accepting one.Compared to ReDial, our dataset is grounded in real movie preferences (movie ratings from Movie-Lens), instead of relying on workers' hidden movie tastes.This allows us to make our task goaldirected rather than chit-chat; we can optimize prediction and recommendation strategy based on a known ground truth, and train the predict and plan modules of our system.That in turn allows for novel setups such as bot-play.
To the best of our knowledge, Bordes et al. (2016) is the only other goal-oriented dialogue benchmark grounded in a database that has been released with a large-scale publicly available dataset.Compared to that work, our database is made of real (not made-up) movies, and the choice of target movies is based on empirical distances between movies and movie features instead of being arbitrary.This, combined with the collaborative set-up, makes it possible to train a model for the seeker in the bot-play setting.
Our recommendation dialogue game is collaborative.Other dialogue settings with shared objectives have been explored, for example a collaborative graph prediction task (He et al., 2017a), and semi-cooperative negotiation tasks (Lewis et al., 2017;Yarats and Lewis, 2018;He et al., 2018).

Conclusion and Future Directions
In conclusion, we have posed recommendation as a goal-oriented game between an expert and a seeker, and provided a framework for both training agents in a supervised way by learning to mimic a large set of collected human-human dialogues, as well as by bot-play between trained agents.We have shown that a combination of the two stages leads to learning better expert recommenders.
Our results suggest several promising directions.First, we noted that the recommendation performance linearly increases as more dialogue context is provided.An interesting question is how to learn to produce the best questions that will result in the most informative dialogue context.
Second, as the model becomes better at the game, we observe an increase in the length of dialogue.However, it remains shorter than the average length of human dialogues, possibly because our reward function is designed to minimize it, which worked better in experiments.A potential direction for future work is to study how different game objectives interact with each other.
Finally, our evaluation on movie recommendation is made only within the candidate set of movies given to expert.Future work should evaluate if our training scheme generalizes to a fully open-ended recommendation system, thus making our task not only useful for research and model development, but a useful end-product in itself.

A Additional Notes on Data Preparation
we obtain a rating matrix of 265,905 users and 11,382 movies.We filter the data according to a few criteria: • users who watched less than 50 movies are filtered out.• moves which are watched less than 50 users are filtered out.• movies which are filmed before 1950 are filtered out.
• movies whose average rates are less than 2 and users who average rates are less than 2 are filtered out.
We also remove some movie sets which are too difficult or too easy to predict based on their distance scores.For example, we filter out movie sets where the cosine similarity of the correct movie and the averaged incorrect movies is less than 0.75.After filtering, the remaining data comprises 5,330 movies, rated by 65,181 users.
We tested different types of embedding features such as movie IDs (i.e., MovieLens's ratings), movie text (i.e., Wiki-text), and knowledge base features (e.g., director's name).The movie ID features turn out to be the best performing for recommendation performance.After training, the model finds reasonable close neighbors; for example, for "Ice Age", the model identifies "Shrek 2", "Shrek", "Monsters Inc.", and "Finding Nemo" as close.

B Data Collection: Full Description
In our annotation interface, we provide action buttons for workers to click on in order to interact with the system.When a button is clicked, the corresponding system message is shown.For example, if an expert clicks on a movie button to recommend that movie, the system displays a recommendation message to the seeker, using a simple template.Similarly, if a seeker clicks to accept or reject the recommendation, a templated message with the decision is automatically delivered to the expert.
If an expert recommends the correct movie, a seeker accepts the correctly recommended movie, or a seeker rejects an incorrectly recommended movie, they receive a reward (points, which can translate into bonus money if enough points are earned); otherwise, the system encourages them to focus more on the task and get more points.The amount of reward points awarded is calculated based on the similarities between the average of the seeker's movie set and each candidate movie in the expert's set, using a softmax.The similarity scores are calculated using the euclidean distance between movie embedding vectors (see Section C).
Overall, a total of 1,034 unique workers created 9,125 dialogues, over a duration of 2.5 weeks.

C Supervised training: Details
This section gives more details about the supervised training phase.
Encoding textual inputs: Textual inputs are encoded differently for the dialogue context and for the movie descriptions.The dialogue history context h t for predicting utterance x t+1 comprises the history of all previous utterances x 1 , • • • , x t .Each utterance is encoded with an LSTM (Hochreiter and Schmidhuber, 1997).The dialogue context is then obtained by averaging over all utterances: For the movies, we found that using bags of words instead worked better.We encode each sentence of a movie description as a bag of words, and then average all the resulting representations to obtain m j , the representation of the j-th movie: E(m j ) = AVG (BOW(m j )) for j ∈ 1..K (10) Aligning dialogue context and movie descriptions: we use dot-product attention (Chen et al., 2017) between the dialogue context and each of the movie descriptions: Generating utterances: Generate The expert can produce two types of utterances, according to whether it is recommending a movie or asking for more input from the seeker.For Recommend, the response is produced by a template: "How about this movie, [MOVIE]?"where [MOVIE] is the movie that the expert is recommending.For Speak, the next utterance is generated by taking the dialogue context history h t and the average of all movie representations M = AVG(m 1 , .., m K ), and inputting them into a seq2seq generative model with attention (Bahdanau et al., 2015).The model is then trained to minimize the negative log likelihood of the true next utterance x t+1 according to We include Recommend utterances in the L gen calculation; as a result, the generation loss is also a partial indicator of other aspects such as Decide and Predict, in addition to the corresponding specific losses (see below).
Predicting the correct movie to recommend: Predict Let y denote the correct movie.The prediction module is trained by minimizing the negative log likelihood of y according to the distribution of a softmax predictor over the c j inputs described above: , where ( 14) When making a recommendation, the expert recommends the top candidate: arg max c {r 1 ..r K }.We also experimented with using a soft representation for the target movie distribution, for example through a softmax over similarities.For instance, in Figure 2, the hard ground-truth movie distribution is {1, 0, 0, 0}, and the soft version is {0.37, 0.15, 0.16, 0.16, 0.15}.But the hard version always outperformed the soft version in our experiments.
Deciding when to recommend: Decide The expert needs to decide whether to to recommend a movie or speak to elicit more information.We model this using a two-layer perceptron that takes the movie prediction distribution scores and the dialogue context as input, and predicts whether to make a recommendation or not.Training is conducted by minimizing the negative log likelihood of the ground truth decision: We also experimented with other functions of the movie prediction distribution (e.g., skewness and kurtosis (Mardia, 1970)), but the multi-layer perceptron (MLP) always performed better.
Supervised loss of the overall system: The overall objective function of the full supervised system is as follows: where α and β are weight terms that control the balance between the different objectives and are empirically optimized on the validation set.For the predict and decide losses, we use annealing at the beginning of training, with all the weight being given to the generate loss, and the weights of the other two being gradually increased.
movie?American Beauty This is more along the lines of crime and fantasy.I accepted the recommendation.I can watch a crime movie I accepted the recommendation.I enjoy Cameron Crowe films Sounds good.Maybe something with crime or fantasy would be better 15 Rushmore 1998 Comedy, Drama Reservoir Dogs 1992 Crime, Mystery, Thriller Election 1999 Comedy Big Fish 2003 Drama, Fantasy, Romance Vanilla Sky 2001 Mystery, Romance, Sci-Fi Seeker Expert American Beauty 1999 Drama, Romance 37 Almost Famous 2000 Drama 15 Metropolitan 1990 Comedy 16 Unbreakable 2000 Drama, Sci-Fi 16 Pathfinder 2007 Action, Adventure, Drama 15

Figure 2 :
Figure 2: An example dialogue from our dataset of movie recommendation between two human workers: seeker (grey) and expert (blue).The goal is for the expert to find and recommend the correct movie (light blue) out of incorrect movies (light red) which is similar to the seeker movies.Best viewed in color.

Figure 3 :
Figure 3: Movie set selection: watched movies for seeker (grey) and correct (light blue) / incorrect (light red) movies for expert.

Figure 4 :
Figure 4: Histogram distribution of (a) experts' decisions of whether to speak or recommend and (b) correct/incorrect recommendations over the normalized dialogue turns.
Figure 5: (a) Supervised learning of the expert model M expert and (b) bot-play game between the expert M expert and the seeker M seeker models.The former imitates multiple aspects of humans' behaviors in the task, while the later fine-tunes the expert model w.r.t the game goal (i.e., recommending the correct movie).
first pre-train expert and seeker models individually: the expert model M expert (θ) = min θ L sup is pre-trained by minimizing the supervised loss in Eq 1, and the seeker model M seeker (φ) is a retrieval-based model that retrieves seeker utterances from the training set based on cosine similarity of the preceding dialogue contexts encoded using the BERT pre-trained encoder 9 .θ and φ are model parameters of the expert and seeker model, respectively.Then, we make them chat with each other, and fine-tune the expert model by maximizing its reward in the game (See Figure5, Right).The dialogue game ends if the expert model recommends the correct movie, or a maximum dialogue length is reached 10 , yielding T turns of dialogue; g = (x Figure6: Analysis of the expert's model: as the dialogue continues (x-axis is either fraction of the full dialogue, or index of dialogue turn), y-axis is (a) rank of the correct recommendation (the lower rank, the better) and (b,c) F1/BLEU/Turn@1/Decision Accuracy (the higher the better) with the variance shown in grey.

Table 2 :
Evaluation on supervised models.We incrementally add different aspects of modules: Generate, predict, and Decide for supervised multi-aspect learning and Plan for bot-play fine-tuning.