Towards End-to-End Reinforcement Learning of Dialogue Agents for Information Access

This paper proposes KB-InfoBot - a multi-turn dialogue agent which helps users search Knowledge Bases (KBs) without composing complicated queries. Such goal-oriented dialogue agents typically need to interact with an external database to access real-world knowledge. Previous systems achieved this by issuing a symbolic query to the KB to retrieve entries based on their attributes. However, such symbolic operations break the differentiability of the system and prevent end-to-end training of neural dialogue agents. In this paper, we address this limitation by replacing symbolic queries with an induced “soft” posterior distribution over the KB that indicates which entities the user is interested in. Integrating the soft retrieval process with a reinforcement learner leads to higher task success rate and reward in both simulations and against real users. We also present a fully neural end-to-end agent, trained entirely from user feedback, and discuss its application towards personalized dialogue agents.


Introduction
The design of intelligent assistants which interact with users in natural language ranks high on the agenda of current NLP research.With an increasing focus on the use of statistical and machine learning based approaches (Young et al., 2013), the last few years have seen some truly remarkable conversational agents appear on the market (e.g.Apple Siri, Microsoft Cortana, Google Allo).These agents can perform simple tasks, answer factual questions, and sometimes also aimlessly chit-chat with the user, but they still lag far behind a human assistant in terms of both the variety and complexity of tasks they can perform.In particular, they lack the ability to learn from interactions with a user in order to improve and adapt with time.Recently, Reinforcement Learning (RL) has been explored to leverage user interactions to adapt various dialogue agents designed, respectively, for task completion (Gašić et al., 2013), information access (Wen et al., 2016b), and chitchat (Li et al., 2016a).
We focus on KB-InfoBots, a particular type of dialogue agent that helps users navigate a Knowledge Base (KB) in search of an entity, as illustrated by the example in Figure 1.Such agents must necessarily query databases in order to retrieve the requested information.This is usually done by performing semantic parsing on the input to construct a symbolic query representing the beliefs of the agent about the user goal, such as Wen et al. (2016b), Williams and Zweig (2016), and Li et al. (2017)'s work.We call such an operation a Hard-KB lookup.While natural, this approach has two drawbacks: (1) the retrieved results do not carry any information about uncertainty in semantic parsing, and (2) the retrieval operation is non differentiable, and hence the parser and dialog policy are trained separately.This makes online endto-end learning from user feedback difficult once the system is deployed.
In this work, we propose a probabilistic framework for computing the posterior distribution of the user target over a knowledge base, which we term a Soft-KB lookup.This distribution is constructed from the agent's belief about the attributes of the entity being searched for.The dialogue policy network, which decides the next system action, receives as input this full distribution instead of a handful of retrieved results.We show in our ex-

Movie=? Actor=Bill Murray Release Year=1993
Find me the Bill Murray's movie.I think it came out in 1993.

When was it released?
Groundhog Day is a Bill Murray movie which came out in 1993.

Movie
Actor Release Year

Groundhog Day Bill Murray 1993
Australia Nicole Kidman X Mad Max: Fury Road X 2015 Figure 1: An interaction between a user looking for a movie and the KB-InfoBot.An entity-centric knowledge base is shown above the KB-InfoBot (missing values denoted by X).
periments that this framework allows the agent to achieve a higher task success rate in fewer dialogue turns.Further, the retrieval process is differentiable, allowing us to construct an end-to-end trainable KB-InfoBot, all of whose components are updated online using RL.
Reinforcement learners typically require an environment to interact with, and hence static dialogue corpora cannot be used for their training.Running experiments on human subjects, on the other hand, is unfortunately too expensive.A common workaround in the dialogue community (Young et al., 2013;Schatzmann et al., 2007b;Scheffler and Young, 2002) is to instead use user simulators which mimic the behavior of real users in a consistent manner.For training KB-InfoBot, we adapt the publicly available 2 simulator described in Li et al. (2016b).
Evaluation of dialogue agents has been the subject of much research (Walker et al., 1997;Möller et al., 2006).While the metrics for evaluating an InfoBot are relatively clear -the agent should return the correct entity in a minimum number of turns -the environment for testing it not so much.Unlike previous KB-based QA systems, our focus is on multi-turn interactions, and as such there are no publicly available benchmarks for this problem.We evaluate several versions of KB-InfoBot with the simulator and on real users, and show that the proposed Soft-KB lookup helps the reinforcement learner discover better dialogue policies.Initial experiments on the end-to-end agent also demonstrate its strong learning capability.

Related Work
Our work is motivated by the neural GenQA (Yin et al., 2016a) and neural enquirer (Yin et al., 2016b) models for querying KBs via natural language in a fully "neuralized" way.However, the key difference is that these systems assume that users can compose a complicated, compositional natural language query that can uniquely identify the element/answer in the KB.The research task is to parse the query, i.e., turning the natural language query into a sequence of SQL-like operations.Instead we focus on how to query a KB interactively without composing such complicated queries in the first place.Our work is motivated by the observations that (1) users are more used to issuing simple queries of length less than 5 words (Spink et al., 2001); (2) in many cases, it is unreasonable to assume that users can construct compositional queries without prior knowledge of the structure of the KB to be queried.
Also related is the growing body of literature focused on building end-to-end dialogue systems, which combine feature extraction and policy optimization using deep neural networks.Wen et al. (2016b) introduced a modular neural dialogue agent, which uses a Hard-KB lookup, thus breaking the differentiability of the whole system.As a result, training of various components of the dialogue system is performed separately.The intent network and belief trackers are trained using supervised labels specifically collected for them; while the policy network and generation network are trained separately on the system utterances.We retain modularity of the network by keeping the belief trackers separate, but replace the hard lookup with a differentiable one.
Dialogue agents can also interface with the database by augmenting their output action space with predefined API calls (Williams and Zweig, 2016;Zhao and Eskenazi, 2016;Bordes and Weston, 2016;Li et al., 2017).The API calls modify a query hypothesis maintained outside the end-toend system which is used to retrieve results from this KB.This framework does not deal with uncertainty in language understanding since the query hypothesis can only hold one slot-value at a time.Our approach, on the other hand, directly models the uncertainty to construct the posterior over the KB.Wu et al. (2015) presented an entropy minimization dialogue management strategy for In-foBots.The agent always asks for the value of the slot with maximum entropy over the remaining entries in the database, which is optimal in the absence of language understanding errors, and serves as a baseline against our approach.Reinforcement learning neural turing machines (RL-NTM) (Zaremba and Sutskever, 2015) also allow neural controllers to interact with discrete external interfaces.The interface considered in that work is a one-dimensional memory tape, while in our work it is an entity-centric KB.

Probabilistic KB Lookup
This section describes a probabilistic framework for querying a KB given the agent's beliefs over the fields in the KB.

Entity-Centric Knowledge Base (EC-KB)
A Knowledge Base consists of triples of the form (h, r, t), which denotes that relation r holds between the head h and tail t.We assume that the KB-InfoBot has access to a domain-specific entity-centric knowledge base (EC-KB) (Zwicklbauer et al., 2013) where all head entities are of a particular type (such as movies or persons), and the relations correspond to attributes of these head entities.Such a KB can be converted to a table format whose rows correspond to the unique head entities, columns correspond to the unique relation types (slots henceforth), and some entries may be missing.An example is shown in Figure 1.

Notations and Assumptions
Let T denote the KB table described above and T i,j denote the jth slot-value of the ith entity. 1 ≤ i ≤ N and 1 ≤ j ≤ M .We let V j denote the vocabulary of each slot, i.e. the set of all distinct values in the j-th column.We denote missing values from the table with a special token and write T i,j = Ψ.M j = {i : T i,j = Ψ} denotes the set of entities for which the value of slot j is missing.Note that the user may still know the actual value of T i,j , and we assume this lies in V j .We do not deal with new entities or relations at test time.
We assume a uniform prior G ∼ U[{1, ...N }] over the rows in the table T , and let binary random variables Φ j ∈ {0, 1} indicate whether the user knows the value of slot j or not.The agent maintains M multinomial distributions p t j (v) for v ∈ V j denoting the probability at turn t that the user constraint for slot j is v, given their utterances 1 till that turn.The agent also maintains M binomials q t j = Pr(Φ j = 1) which denote the probability that the user knows the value of slot j.
We assume that column values are independently distributed to each other.This is a strong assumption but it allows us to model the user goal for each slot independently, as opposed to modeling the user goal over KB entities directly.Typically max j |V j | < N and hence this assumption reduces the number of parameters in the model.

Soft-KB Lookup
Let p t T (i) = Pr(G = i|U t 1 ) be the posterior probability that the user is interested in row i of the table, given the utterances up to turn t.We assume all probabilities are conditioned on user inputs U t 1 and drop it from the notation below.From our assumption of independence of slot values, we can write where Pr(G j = i) denotes the posterior probability of user goal for slot j pointing to T i,j .Marginalizing this over Φ j gives: = q t j Pr(G j = i|Φ j = 1)+ (1 − q t j ) Pr(G j = i|Φ j = 0).For Φ j = 0, the user does not know the value of the slot, and from the prior: For Φ j = 1, the user knows the value of slot j, but this may be missing from T , and we again have two cases: Combining (1), (2), and (3) gives us the procedure for computing the posterior over KB entities.
4 Towards an End-to-End-KB-InfoBot We claim that the Soft-KB lookup method has two benefits over the Hard-KB method -(1) it helps the agent discover better dialogue policies by providing it more information from the language understanding unit, (2) it allows end-to-end training of both dialogue policy and language understanding in an online setting.In this section we describe several agents to test these claims.

Belief Trackers
Policy Network Beliefs Summary

Overview
Figure 2 shows an overview of the components of the KB-InfoBot.At each turn, the agent receives a natural language utterance u t as input, and selects an action a t as output.The action space, denoted by A, consists of M + 1 actions -request(slot=i) for 1 ≤ i ≤ M will ask the user for the value of slot i, and inform(I) will inform the user with an ordered list of results I from the KB.The dialogue ends once the agent chooses inform.
We adopt a modular approach, typical to goaloriented dialogue systems (Wen et al., 2016b), consisting of: a belief tracker module for identifying user intents, extracting associated slots, and tracking the dialogue state (Yao et al., 2014;Hakkani-Tür et al., 2016;Chen et al., 2016b;Henderson et al., 2014;Henderson, 2015); an interface with the database to query for relevant results (Soft-KB lookup); a summary module to summarize the state into a vector; a dialogue policy which selects the next system action based on current state (Young et al., 2013).We assume the agent only responds with dialogue acts.A templatebased Natural Language Generator (NLG) can be easily constructed for converting dialogue acts into natural language.

Belief Trackers
The InfoBot consists of M belief trackers, one for each slot, which get the user input x t and produce two outputs, p t j and q t j , which we shall collectively call the belief state: p t j is a multinomial distribution over the slot values v, and q t j is a scalar probability of the user knowing the value of slot j.We describe two versions of the belief tracker.
Hand-Crafted Tracker: We first identify mentions of slot-names (such as "actor") or slot-values (such as "Bill Murray") from the user input u t , using token-level keyword search.Let {w ∈ x} de-note the set of tokens in a string x3 , then for each slot in 1 ≤ j ≤ M and each value v ∈ V j , we compute its matching score as follows: A similar score b t j is computed for the slot-names.A one-hot vector req t ∈ {0, 1} M denotes the previously requested slot from the agent, if any.q t j is set to 0 if req t [j] is 1 but s t j [v] = 0 ∀v ∈ V j , i.e. the agent requested for a slot but did not receive a valid value in return, else it is set to 1.
Starting from an prior distribution p 0 j (based on the counts of the values in the KB), p t j [v] is updated as: Here C is a tuning parameter, and the normalization is given by setting the sum over v to 1.
Neural Belief Tracker: For the neural tracker the user input u t is converted to a vector representation x t , using a bag of n-grams (with n = 2) representation.Each element of x t is an integer indicating the count of a particular n-gram in u t .We let V n denote the number of unique n-grams, hence x t ∈ N V n 0 .Recurrent neural networks have been used for belief tracking (Henderson et al., 2014;Wen et al., 2016b) since the output distribution at turn t depends on all user inputs till that turn.We use a Gated Recurrent Unit (GRU) (Cho et al., 2014) for each tracker, which, starting from h 0 j = 0 computes h t j = GRU(x 1 , . . ., x t ) (see Appendix B for details).h t j ∈ R d can be interpreted as a summary of what the user has said about slot j till turn t.The belief states are computed from this vector as follows: Here

Soft-KB Lookup + Summary
This module uses the Soft-KB lookup described in section 3.3 to compute the posterior p t T ∈ R N over the EC-KB from the belief states (p t j , q t j ).
Collectively, outputs of the belief trackers and the soft-KB lookup can be viewed as the current dialogue state internal to the KB-InfoBot.Let s t = [p t 1 , p t 2 , ..., p t M , q t 1 , q t 2 , ..., q t M , p t T ] be the vector of size j V j + M + N denoting this state.It is possible for the agent to directly use this state vector to select its next action a t .However, the large size of the state vector would lead to a large number of parameters in the policy network.To improve efficiency we extract summary statistics from the belief states, similar to (Williams and Young, 2005).
Each slot is summarized into an entropy statistic over a distribution w t j computed from elements of the KB posterior p t T as follows: (8) Here, p 0 j is a prior distribution over the values of slot j, estimated using counts of each value in the KB.The probability mass of v in this distribution is the agent's confidence that the user goal has value v in slot j.This two terms in (8) correspond to rows in KB which have value v, and rows whose value is unknown (weighted by the prior probability that an unknown might be v).Then the summary statistic for slot j is the entropy H(w t j ).The KB posterior p t T is also summarized into an entropy statistic H(p t T ).The scalar probabilities q t j are passed as is to the dialogue policy, and the final summary vector is st = [H(p t 1 ), ..., H(p t M ), q t 1 , ..., q t M , H(p t T )].Note that this vector has size 2M + 1.

Dialogue Policy
The dialogue policy's job is to select the next action based on the current summary state st and the dialogue history.We present a hand-crafted baseline and a neural policy network.
Hand-Crafted Policy: The rule based policy is adapted from (Wu et al., 2015).It asks for the slot ĵ = arg min H(p t j ) with the minimum entropy, except if -(i) the KB posterior entropy H(p t T ) < α R , (ii) H(p t j ) < min(α T , βH(p 0 j ), (iii) slot j has already been requested Q times.α R , α T , β, Q are tuned to maximize reward against the simulator.
Neural Policy Network: For the neural approach, similar to (Williams and Zweig, 2016;Zhao and Eskenazi, 2016), we use an RNN to allow the network to maintain an internal state of dialogue history.Specifically, we use a GRU unit followed by a fully-connected layer and softmax nonlinearity to model the policy π over actions in During training, the agent samples its actions from the policy to encourage exploration.If this action is inform(), it must also provide an ordered set of entities indexed by I = (i 1 , i 2 , . . ., i R ) in the KB to the user.This is done by sampling R items from the KB-posterior p t T .This mimics a search engine type setting, where R may be the number of results on the first page.

Training
Parameters of the neural components (denoted by θ) are trained using the REINFORCE algorithm (Williams, 1992).We assume that the learner has access to a reward signal r t throughout the course of the dialogue, details of which are in the next section.We can write the expected discounted return of the agent under policy π as J(θ) = E π H t=0 γ t r t (γ is the discounting factor).We also use a baseline reward signal b, which is the average of all rewards in a batch, to reduce the variance in the updates (Greensmith et al., 2004).When only training the dialogue policy π using this signal, updates are given by (details in Appendix C): For end-to-end training we need to update both the dialogue policy and the belief trackers using the reinforcement signal, and we can view the retrieval as another policy µ θ (see Appendix C).The updates are given by: In the case of end-to-end learning, we found that for a moderately sized KB, the agent almost always fails if starting from random initialization.
In this case, credit assignment is difficult for the agent, since it does not know whether the failure is due to an incorrect sequence of actions or incorrect set of results from the KB.Hence, at the beginning of training we have an Imitation Learning (IL) phase where the belief trackers and policy network are trained to mimic the hand-crafted agents.Assume that pt j and qt j are the belief states from a rule-based agent, and ât its action at turn t.Then the loss function for imitation learning is: D(p||q) and H(p, q) denote the KL divergence and cross-entropy between p and q respectively.The expectations are estimated using a minibatch of dialogues of size B. For RL we use RMSProp (Hinton et al., 2012) and for IL we use vanilla SGD updates to train the parameters θ.

Experiments and Results
Previous work in KB-based QA has focused on single-turn interactions and is not directly comparable to the present study.Instead we compare different versions of the KB-InfoBot described above to test our claims.

KB-InfoBot versions
We have described two belief trackers -(A) Hand-Crafted and (B) Neural, and two dialogue policies -(C) Hand-Crafted and (D) Neural.
Rule agents use the hand-crafted belief trackers and hand-crafted policy (A+C).RL agents use the hand-crafted belief trackers and the neural policy (A+D).We compare three variants of both sets of agents, which differ only in the inputs to the dialogue policy.The No-KB version only takes entropy H(p t j ) of each of the slot distributions.The Hard-KB version performs a hard-KB lookup and selects the next action based on the entropy of the slots over retrieved results.This is the same approach as in Wen et al. (2016b), except that we take entropy instead of summing probabilities.The Soft-KB version takes summary statistics of the slots and KB posterior described in Section 4. At the end of the dialogue, all versions inform the user with the top results from the KB posterior p t T , hence the difference only lies in the policy for action selection.Lastly, the E2E agent uses the neural belief tracker and the neural policy (B+D), with a Soft-KB lookup.For the RL agents, we also append qt j and a one-hot encoding of the previous agent action to the policy network input.Hyperparameter details for the agents are provided in Appendix D.

User Simulator
Training reinforcement learners is challenging because they need an environment to operate in.In the dialogue community it is common to use simulated users for this purpose (Schatzmann et al., 2007a,b;Cuayáhuitl et al., 2005;Asri et al., 2016).
In this work we adapt the publicly-available user simulator presented in Li et al. (2016b) to follow a simple agenda while interacting with the KB-InfoBot, as well as produce natural language utterances .Details about the simulator are included in Appendix E. During training, the simulated user also provides a reward signal at the end of each dialogue.The dialogue is a success if the user target is in top R = 5 results returned by the agent; and the reward is computed as max(0, 2(1 − (r − 1)/R)), where r is the actual rank of the target.For a failed dialogue the agent receives a reward of −1, and at each turn it receives a reward of −0.1 to encourage short sessions4 .The maximum length of a dialogue is 10 turns beyond which it is deemed a failure.

Movies-KB
We use a movie-centric KB constructed using the IMDBPy5 package.We constructed four different splits of the dataset, with increasing number of entities, whose statistics are given in Table 1.The original KB was modified to reduce the number of actors and directors in order to make the task more challenging6 .We randomly remove 20% of the values from the agent's copy of the KB to simulate a scenario where the KB may be incomplete.
The user, however, may still know these values.

Simulated User Evaluation
We compare each of the discussed versions along three metrics: the average rewards obtained (R), success rate (S) (where success is defined as providing the user target among top R results), and the average number of turns per dialogue (T).For the RL and E2E agents, during training we fix the model every 100 updates and run 2000 simulations with greedy action selection to evaluate its performance.Then after training we select the model with the highest average reward and run a further 5000 simulations and report the performance in Table 2.For reference we also show the performance of an agent which receives perfect information about the user target without any errors, and selects actions based on the entropy of the slots (Max).This can be considered as an upper bound on the performance of any agent (Wu et al., 2015).
In each case the Soft-KB versions achieve the highest average reward, which is the metric all agents optimize.In general, the trade-off between minimizing average turns and maximizing success rate can be controlled by changing the reward signal.Note that, except the E2E version, all versions share the same belief trackers, but by re-asking values of some slots they can have different posteriors p t T to inform the results.This shows that having full information about the current state of beliefs over the KB helps the Soft-KB agent discover better policies.Further, reinforcement learning helps discover better policies than the handcrafted rule-based agents, and we see a higher reward for RL agents compared to Rule ones.This is due to the noisy natural language inputs; with perfect information the rule-based strategy is optimal.Interestingly, the RL-Hard agent has the minimum number of turns in 2 out of the 4 settings, at the cost of a lower success rate and average reward.This agent does not receive any information about the uncertainty in semantic parsing, and it tends to inform as soon as the number of retrieved results becomes small, even if they are incorrect.Among the Soft-KB agents, we see that E2E>RL>Rule, except for the X-Large KB.For E2E, the action space grows exponentially with the size of the KB, and hence credit assignment gets more difficult.Future work should focus on improving the E2E agent in this setting.The difficulty of a KB-split depends on number of entities it has, as well as the number of unique values for each slot (more unique values make the problem easier).Hence we see that both the "Small" and "X-Large" settings lead to lower reward for the agents, since is small for them.

Human Evaluation
We further evaluate the KB-InfoBot versions trained using the simulator against real subjects, recruited from the author's affiliations.In each session, in a typed interaction, the subject was first presented with a target movie from the "Medium" KB-split along with a subset of its associated slot-  values from the KB.To simulate the scenario where end-users may not know slot values correctly, the subjects in our evaluation were presented multiple values for the slots from which they could choose any one while interacting with the agent.Subjects were asked to initiate the conversation by specifying some of these values, and respond to the agent's subsequent requests, all in natural language.We test RL-Hard and the three Soft-KB agents in this study, and in each session one of the agents was picked at random for testing.In total, we collected 433 dialogues, around 20 per subject.Figure 3 shows a comparison of these agents in terms of success rate and number of turns, and Figure 4 shows some sample dialogues from the user interactions with RL-Soft.
In comparing Hard-KB versus Soft-KB lookup methods we see that both Rule-Soft and RL-Soft agents achieve a higher success rate than RL-Hard, while E2E-Soft does comparably.They do so in an increased number of average turns, but achieve a higher average reward as well.Between RL-Soft and Rule-Soft agents, the success rate is similar, however the RL agent achieves that rate in a lower number of turns on average.RL-Soft achieves a success rate of 74% on the human evaluation and 80% against the simulated user, indicating minimal overfitting.However, all agents take a higher number of turns against real users as compared to the simulator, due to the noisier inputs.
The E2E gets the highest success rate against the simulator, however, when tested against real users it performs poorly with a lower success rate and a higher number of turns.Since it has more trainable components, this agent is also most prone to overfitting.In particular, the vocabulary of the simulator it is trained against is quite limited (V n = 3078), and hence when real users provided inputs outside this vocabulary, it performed poorly.In the future we plan to fix this issue by employing a better architecture for the language understanding and belief tracker components Hakkani-Tür et al. (2016); Liu and Lane (2016); Chen et al. (2016b,a), as well as by pretraining on separate data.While its generalization performance is poor, the E2E system also exhibits the strongest learning capability.In Figure 5, we compare how different agents perform against the simulator as the temperature of the output softmax in its NLG is increased.A higher temperature means a more uniform output distribution, which leads to generic simulator responses irrelevant to the agent questions.This is a simple way of introducing noise in the utterances.The performance of all agents drops as the temperature is increased, but less so for the E2E agent, which can adapt its belief tracker to the inputs it receives.Such adaptation is key to the personalization of dialogue agents, which motivates us to introduce the E2E agent.

Conclusions and Discussion
This work is aimed at facilitating the move towards end-to-end trainable dialogue agents for information access.We propose a differentiable probabilistic framework for querying a database given the agent's beliefs over its fields (or slots).We show that such a framework allows the downstream reinforcement learner to discover better dialogue policies by providing it more information.We also present an E2E agent for the task, which demonstrates a strong learning capacity in simulations but suffers from overfitting when tested on real users.Given these results, we propose the following deployment strategy that allows a dialogue system to be tailored to specific users via learning from agent-user interactions.The system could start off with an RL-Soft agent (which gives good performance out-of-the-box).As the user interacts with this agent, the collected data can be used to train the E2E agent, which has a strong learning capability.Gradually, as more experience is collected, the system can switch from RL-Soft to the personalized E2E agent.Effective implementation of this, however, requires the E2E agent to learn quickly and this is the research direction we plan to focus on in the future.

A Posterior Derivation
Here, we present a derivation for equation 3, i.e., the posterior over the KB slot when the user knows the value of that slot.For brevity, we drop Φ j = 0 from the condition in all probabilities below.For the case when i ∈ M j , we can write: where we assume all missing values to be equally likely, and estimate the prior probability of the goal being missing from the count of missing values in that slot.For the case when i = v ∈ M j : where the second term comes from taking the probability mass associated with v in the belief tracker and dividing it equally among all rows with value v.
We can also verify that the above distribution is valid: i.e., it sums to 1:

B Gated Recurrent Units
A Gated Recurrent Unit (GRU) (Cho et al., 2014) is a recurrent neural network which operates on an input sequence x 1 , . . ., x t .Starting from an initial state h 0 (usually set to 0 it iteratively computes the final output h t as follows: Here σ denotes the sigmoid nonlinearity, and an element-wise product.

C REINFORCE updates
We assume that the learner has access to a reward signal r t throughout the course of the dialogue, details of which are in the next section.We can write the expected discounted return of the agent under policy π as follows: Here, the expectation is over all possible trajectories τ of the dialogue, θ denotes the trainable parameters of the learner, H is the maximum length of an episode, and γ is the discounting factor.We can use the likelihood ratio trick (Glynn, 1990) to write the gradient of the objective as follows: where p θ (τ ) is the probability of observing a particular trajectory under the current policy.With a Markovian assumption, we can write where θ denotes dependence on the neural network parameters.From 17,18 we obtain If we need to train both the policy network and the belief trackers using the reinforcement signal, we can view the KB posterior p t T as another policy.During training then, to encourage exploration, when the agent selects the inform action we sample R results from the following distribution to return to the user: This formulation also leads to a modified version of the episodic REINFORCE update rule (Williams, 1992).Specifically, eq.18 now becomes,

D Hyperparameters
We use GRU hidden state size of d = 50 for the RL agents and d = 100 for the E2E, a learning rate of 0.05 for the imitation learning phase and 0.005 for the reinforcement learning phase, and minibatch size 128.For the rule agents, hyperparameters were tuned to maximize the average reward of each agent in simulations.For the E2E agent, imitation learning was performed for 500 updates, after which the agent switched to reinforcement learning.The input vocabulary is constructed from the NLG vocabulary and bigrams in the KB, and its size is 3078.

E User Simulator
At the beginning of each dialogue, the simulated user randomly samples a target entity from the EC-KB and a random combination of informable slots for which it knows the value of the target.The remaining slot-values are unknown to the user.The user initiates the dialogue by providing a subset of its informable slots to the agent and requesting for an entity which matches them.In subsequent turns, if the agent requests for the value of a slot, the user complies by providing it or informs the agent that it does not know that value.If the agent informs results from the KB, the simulator checks whether the target is among them and provides the reward.
We convert dialogue acts from the user into natural language utterances using a separately trained natural language generator (NLG).The NLG is trained in a sequence-to-sequence fashion, using conversations between humans collected by crowd-sourcing.It takes the dialogue actions (DAs) as input, and generates template-like sentences with slot placeholders via an LSTM decoder.Then, a post-processing scan is performed to replace the slot placeholders with their actual values, which is similar to the decoder module in (Wen et al., 2015(Wen et al., , 2016a)).In the LSTM decoder, we apply beam search, which iteratively considers the top k best sentences up to time step t when generating the token of the time step t + 1.For the sake of the trade-off between the speed and performance, we use the beam size of 3 in the following experiments.
There are several sources of error in user utterances.Any value provided by the user may be corrupted by noise, or substituted completely with an incorrect value of the same type (e.g., "Bill Murray" might become just "Bill" or "Tom Cruise").The NLG described above is inherently stochastic, and may sometimes generate utterances irrelevant to the agent request.By increasing the temperature of the output softmax in the NLG we can increase the noise in user utterances.

Figure 2 :
Figure 2: High-level overview of the end-to-end KB-InfoBot.Components with trainable parameters are highlighted in gray.

Figure 3 :
Figure 3: Performance of KB-InfoBot versions when tested against real users.Left: Success rate, with the number of test dialogues indicated on each bar, and the p-values from a two-sided permutation test.Right: Distribution of the number of turns in each dialogue (differences in mean are significant with p < 0.01).

Figure 4 :
Figure4: Sample dialogues between users and the KB-InfoBot (RL-Soft version).Each turn begins with a user utterance followed by the agent response.Rank denotes the rank of the target movie in the KB-posterior after each turn.

Figure 5 :
Figure 5: Average rewards against simulator as temperature of softmax in NLG output is increased.Higher temperature leads to more noise in output.Average over 5000 simulations after selecting the best model during training.

Table 2 :
Performance comparison.Average (±std error) for 5000 runs after choosing the best model during training.T: Average number of turns.S: Success rate.R: Average reward.