Dialog state tracking, a machine reading approach using Memory Network

In an end-to-end dialog system, the aim of dialog state tracking is to accurately estimate a compact representation of the current dialog status from a sequence of noisy observations produced by the speech recognition and the natural language understanding modules. This paper introduces a novel method of dialog state tracking based on the general paradigm of machine reading and proposes to solve it using an End-to-End Memory Network, MemN2N, a memory-enhanced neural network architecture. We evaluate the proposed approach on the second Dialog State Tracking Challenge (DSTC-2) dataset. The corpus has been converted for the occasion in order to frame the hidden state variable inference as a question-answering task based on a sequence of utterances extracted from a dialog. We show that the proposed tracker gives encouraging results. Then, we propose to extend the DSTC-2 dataset with specific reasoning capabilities requirement like counting, list maintenance, yes-no question answering and indefinite knowledge management. Finally, we present encouraging results using our proposed MemN2N based tracking model.


Introduction
One of the core components of state-of-the-art and industrially deployed dialog systems is a dialog state tracker. Its purpose is to provide a compact representation of a dialog produced from past user inputs and system outputs which is called the dialog state. The dialog state summarizes the infor- * Work carried out as an intern at XRCE mation needed to successfully maintain and finish a dialog, such as users' goals or requests. In the simplest case of a so-called slot-filling schema, the state is composed of a predefined set of variables with a predefined domain of expression for each of them. As a matter of fact, in the recent context of end-to-end trainable machine learnt dialog systems, state tracking remains a central element of such architectures . Current models, mainly based on the principle of discriminative learning, tend to share three common limitations. First, the tracking task is perform using a fixed window of the past dialog utterances as support for decision. Second, the possible correlations between the set of tracked variables are not leveraged and individual trackers tend to be learnt independently. Third, the tracking task is summarized as the capability of inferring values for a predefined set of latent variables. Starting from these observations, we propose to formalize the task of state tracking as a particular instance of machine reading problem. Indeed, these formalization and the proposed resolution model called MemN2N  allow to define a tracker that is be able to decide at the utterance level on the basis on the current entire dialog. Indeed, the model learns to focus its attention on the meaningful parts of the dialog regarding the currently asked slot and can eventually capture possible correlation between slots. As far as our knowledge goes, it is the first attempt to explicitly frame the task of dialog state tracking as a machine reading problem. Finally, such formalization allows for the implementation of approximate reasoning capability that has been shown to be crucial for any machine reading tasks  while extending the task from slot instantiation to question answering. This paper is structured as follows, Section 2 recalls the main definitions associated to transactional dialogs and describes the associated problem of statistical dialog state tracking with both the generative and discriminative approaches. At the end of this section, the limitations of the current models in terms of necessary annotations and reasoning capabilities are addressed. Then, Section 3 depicts the proposed machine reading model for dialog state tracking and proposes to extend a state of the art dialog state tracking dataset, DSTC-2, to several simple reasoning capabilities. Section 4 illustrates the approach with experimental results obtained using a state of the art benchmark for dialog state tracking.
2 Dialog state tracking 2.1 Main Definitions A dialog state tracking task is formalized as follows: at each turn of a dyadic dialog, the dialog agent chooses a dialog act d to express and the user answers with an utterance u. In the simplest case, the dialog state at each turn is defined as a distribution over a set of predefined variables, which define the structure of the state (Williams et al., 2005). This classic state structure is commonly called slot filling or semantic frame. In this context, the state tracking task consists of estimating the value of a set of predefined variables in order to perform a procedure or transaction which is the purpose of the dialog. Typically, a natural language understanding module processes the user utterance and generates an Nbest list o = {(d 1 , f 1 ), . . . , (d n , f n )}, where d i is the hypothesized user dialog act and f i is its confidence score. Various approaches have been proposed to define dialog state trackers. The traditional methods used in most commercial implementations use hand-crafted rules that typically rely on the most likely result from an NLU module (Yeh et al., 2014) and hardly models uncertainty. However, these rule-based systems are prone to frequent errors as the most likely result is not always the correct one (Williams, 2014).
More recent methods employ statistical approaches to estimate the posterior distribution over the dialog states allowing them to leverage the uncertainty of the results of the NLU module. In the simplest case where no ASR and NLU modules are employed, as in a text based dialog system (Henderson et al., 2013), the utterance is taken as the observation using a so-called bag of words representation. If an NLU module is available, stan-dardized dialog act schemes can be considered as observations (Bunt et al., 2010). Furthermore, if prosodic information is available from the ASR component of the dialog system (Milone and Rubio, 2003), it can also be considered as part of the observation definition. A statistical dialog state tracker maintains, at each discrete time step t, the probability distribution over states, b(s t ), which is the system's belief over the state. The actual slot filling process is composed of the cyclic tasks of information gathering and integration, in other words -dialog state tracking. In such framework, the purpose is to estimate as early as possible in the course of a given dialog the correct instantiation of each variable. In the following, we will assume the state is represented as a set of variables with a set of known possible values associated to each of them. Furthermore, in the context of this paper, only the bag of words has been considered as an observation at a given turn but dialog acts or detected named entity provided by an SLU module could have also been incorporated.
Two statistical approaches have been considered for maintaining the distribution over a state given sequential NLU output. First, the discriminative approach aims to model the posterior probability distribution of the state at time t + 1 with regard to state at time t and observations z 1:t . Second, the generative approach attempts to model the transition probability and the observation probability in order to exploit possible interdependencies between hidden variables that comprise the dialog state.

Generative Dialog State Tracking
A generative approach to dialog state tracking computes the belief over the state using Bayes' rule, using the belief from the last turn b(s t−1 ) as a prior and the likelihood given the user utterance hypotheses p(z t |s t ), with z t the observation gathered at time t. In prior works (Williams et al., 2005), the likelihood is factored and some independence assumptions are made: b t ∝ ∑ s t−1 ,z t p(s t |z t , s t−1 )p(z t |s t−1 )b(s t−1 ). A typical generative model uses a factorial hidden Markov model (Ghahramani and Jordan, 1997). In this family of approaches, scalability is considered as one of the main issues. One way to reduce the amount of computation is to group the states into partitions, as proposed in the Hidden Information State (HIS) model (Gasic and Young, 2011). Other approaches to cope with the scalability problem in dialog state tracking is to adopt a factored dynamic Bayesian network by making conditional independence assumptions among dialog state components, and then using approximate inference algorithms such as loopy belief propagation (Thomson and Young, 2010) or a blocked Gibbs sampling as (Raux and Ma, 2011). To cope with such limitations, discriminative methods of state tracking presented in the next part of this section aim at directly model the posterior distribution of the tracked state using a chosen parametric form.

Discriminative Dialog State Tracking
The discriminative approach of dialog state tracking computes the belief over a state via a parametric model that directly represents the belief b(s t+1 ) = p(s s+1 |s t , z t ). For example, Maximum Entropy has been widely used in the discriminative approach (Metallinou et al., 2013). It formulates the belief as follows: . . ,t}, and the sequence of states leading to the current dialog turn at time t. Then, φ (.) is a vector of feature functions on x and s. Finally, w is the set of model parameters to be learned from annotated dialog data. Finally, deep neural models, performing on a sliding window of features extracted from previous user turns, have also been proposed in (Henderson et al., 2014c;. Of the current literature, this family of approaches have proven to be the most efficient for publicly available state tracking datasets. Recently, deep learning based models implementing this strategy Henderson et al., 2014a;Williams et al., 2016) have shown state of the art results. This approaches tends to leverage unsupervised training word representation (Mikolov et al., 2013).

Current Limitations
Using error analysis (Henderson et al., 2014b), three limitations can be observed in the application of these inference approaches. First, current models tend to fail at considering long-tail dependencies that occurs on dialogs. For example, coreferences, inter-utterances informations and correlations between slots have been shown to be difficult to handle even with the usage of recurrent network models (Henderson et al., 2014a). To illustrate the inter-slot correlation, Figure 1 depicted the t-SNE (van der Maaten and Hinton, 2008) projected final state of the dialog of the DSTC-2 training set. On the other hand, reasoning capabilities, as required in machine reading applications (Poon and Domingos, 2010;Etzioni et al., 2007;Berant et al., 2014; remain absent in these classic formalizations of dialog state tracking. Finally, tracking definition is limited to the capability to instantiate a predefined set of slots. In the next section, we present a model of dialog state tracking that aims at leveraging the current advances of MemN2N, a memory-enhanced neural networks and their approximate reasoning capabilities that seems particularly adapted to the sequential, long range dependency equipped and sparse nature of complex dialog state tracking tasks. Furthermore, this model allows to relax the hypothesis of strict utterance-level annotation that does not corresponds to common practices in industrial applications of transactional conversational user interfaces where annotations tend to be placed at a multi-utterance level or full-dialog level only.

Machine Reading Formulation of Dialog State Tracking
We propose to formalize the dialog state tracking task as a machine reading problem (Etzioni et al., 2007;Berant et al., 2014). In this section, we recall the main definitions of the task of machine reading, then describes the MemN2N, a memoryenhanced neural network architectures proposed to handle such tasks in the context of dialogs. Finally, we formalize the task of dialog state tracking as a machine reading problem and propose to solve it using a memory-enhanced neural architecture of inference.

Machine Reading
The task of textual understanding has recently been formulated as a supervised learning problem (Kumar et al., 2015;Hermann et al., 2015). This task consists in estimating the conditional probability p(a|d, q) of an answer a to a question q where d is a document. Such an approach requires a large training corpus of {Document -Query -Answer} triples and until now such corpora have been limited to hundreds of examples (Richardson et al., 2013). In the context of dialog state tracking, it can be understood as the capability of inferring a set of latent values l associated with a set of variables v related to a given dyadic or multi-party conversation d, from direct correlation and/or reasoning, using the course of exchanges of utterances, p(l|d, v). State updates at an utterance-level are rarely provided off-the-shelf from a production environment. In these environments, annotation is often performed afterhand for the purpose of logging, monitoring or quality assessment. In the limit cases, as in human-to-human dialog systems, dialog-level annotations remains a common practice of annotation especially in personal assistance, customer care dialogs and, in a more general sense, industrial application of transactional conversational user interfaces. Another frequent setting consist of informing the state after a given number of utterance exchange between the locutors. So an additional effort of specific annotation is often needed in order to train a state of the art statistical state tracking model (Henderson et al., 2014b). In that sense, formalizing dialog state tracking at a sub-dialog level in order to infer hidden state variables with respect to a list of utterances started from the first one to any given utterance of a given dialog seems particularly appropriate. In the context of dialog state tracking challenges, the DSTC-4 dialog corpus have been designed in such purpose but only consists of 22 dialogs. Concerning the DSTC-2 corpus, the training data contains 2207 dialogs (15611 turns) and the test set consists of 1117 dialogs (Williams et al., 2016). This dataset is more suitable for our experiments.
For these reasons, the machine reading paradigm becomes a promising formulation for the general problem of dialog state tracking. Furthermore, current approaches and available datasets for state tracking do not explicitly cover reasoning capabilities such as temporal and spatial reasoning, counting, sorting and deduction. We suggest that in the future dataset dialogs expressing such specific abilities should be developed. In this last part, several reasoning enhancements are suggested to the DSTC-2 dataset.

End-to-End Memory Networks
The MemN2N architecture, introduced by , consists of two main components: supporting memories and final answer prediction. Supporting memories are in turn comprised of a set of input and output memory representations with memory cells. The input and output memory cells, denoted by m i and c i , are obtained by transforming the input context x 1 , . . . , x n (i.e a set of utterances) using two embedding matrices A and C (both of size d ×|V | where d is the embedding size and |V | the vocabulary size) such that m i = AΦ(x i ) and c i = CΦ(x i ) where Φ(·) is a function that maps the input into a bag of dimension |V |.
Similarly, the question q is encoded using another embedding matrix B ∈ R d×|V | , resulting in a question embedding u = BΦ(q). The input memories {m i }, together with the embedding of the question u, are utilized to determine the relevance of each of the stories in the context, yielding in a vector of attention weights where softmax(a i ) = e a i ∑ i e a i . Subsequently, the response o from the output memory is constructed by the weighted sum: Other models of parametric encoding for the question and the document have been proposed in (Kumar et al., 2015). For the purpose of this presentation, we will keep with definition of Φ. For more difficult tasks requiring multiple supporting memories, the model can be extended to include more than one set of input/output memories by stacking a number of memory layers. In this setting, each memory layer is named a hop and the (k + 1) th hop takes as input the output of the k th hop: Lastly, the final step, the prediction of the answer to the question q, is performed bŷ whereâ is the predicted answer distribution, W ∈ R |V |×d is a parameter matrix for the model to learn and K the total number of hops. Two weight tying schemes of the embedding matrices have been introduced in : 1. Adjacent: the output embedding matrix in the k th hop is shared with the input embedding matrix in the (k + 1) th hop, i.e., A k+1 = C k for k ∈ {1, K − 1}. Also, the weight matrix W in Equation (4) is shared with the output embedding matrix in the last memory hop such that W = C K . 2. Layer-wise: all the weight matrices A k and C k are shared across different hops, i.e., A 1 = A 2 = . . . = A K and C 1 = C 2 = . . . = C K . In the next section, we show how the task of dialog state tracking can be formalized as machine reading task and solved using such memory enhanced model.

Dialog Reading Model for State Tracking
In this section, we formalize dialog state tracking using the paradigm of machine reading. As far as our knowledge goes, it is the first attempt to apply this approach and develop a specific dataset format, detailed in Section 4, from an existing and publicly available dialog state tracking challenge dataset to fulfill this task. Assuming (1) a dyadic dialog d composed of a list of utterances, (2) a state composed with (2a) a set of variables v i with i = {1, . . . , n}and (2b) a set of corresponding assigned values l i . One can define a question q v that corresponds to the specific querying of a variable in the context of a dialog p(l i |q v i , d). In such context, a dialog state tracking task consists in determining for each variable v, l * = arg max l i ∈L p(l i |q v i , d), with L the specific domain of expression of a variable v i .
In addition to the actual dataset, we propose to investigate four general reasoning tasks using DSTC-2 dataset as a starting point. In such way, we leverage the dataset of DSTC-2 to create more complex reasoning task than the ones present in the original dialogs of the dataset by performing rule-based modification over the corpus. Obviously, the goal is to develop resolution algorithms that are not dedicated to a specific reasoning task but inference models that will be as generic as possible. In the rest of the section, each of the reasoning tasks associated with dialog state tracking are described and the generation protocol is explained with examples.
Factoid Questions : This first task corresponds to the current formulation of dialog state tracking. It consists of questions where a previously given a set of supporting facts, potentially amongst a set of other irrelevant facts, provides the answer. This kind of task was already employed in (Weston et al., 2014) in the context of a virtual world. In that sense, the result obtained to such task are comparable with the state of the art approaches.
Yes/No Questions : This task tests the ability of a model to answer true/false type questions like "Is the food italian ?". The conversion of a dialog to such format is deterministic regarding the fact that the utterances and corresponding true states are known at each utterance of a given dialog.
Indefinite Knowledge : This task tests a more complex natural language construction. It tests if statements can be models in order to describe possibilities rather than certainties, as proposed in (Weston et al., 2014). In our case, the answer will be "maybe" to the question "Is the price-range required moderate ?" if the slot hasn't been mentioned yet throughout the current dialog. In the case of state tracking, it will allow to seamlessly deal with unknown information about the dialog state. Concretely, this set of questions and answers are generated has a super-set of the Yes-No Questions set. First, sub-dialog starting from the first utterance of a given dialog are extracted under the condition that a given slot is not informed in the corresponding annotation. Then, a questionanswering question is generated.
Counting and Lists/Sets : This last task tests the capacity of the model to perform simple counting operations, by asking about the number of objects with a certain property, e.g. "How many area are requested ?". Similarly, the ability to produce a set of single word answers in the form of a list, e.g. "What are the area requested ?" is investigated. Table 1 give an example of each of the question type presented below on a dialog sample of DSTC-2 corpus.
Inference procedure: Concretely, the current set of utterances of a dialog will be placed into the memory using sentence based encoding and the question will be encoded as the controller state at t = 1. The answer will be produced using a softmax operation over the answer vocabulary that is supposed fixed. We consider this hypothesis valid in the case of factoid and list questions because the set of value for a given variable is often considered known. In the cases of Yes/No and Indefinite knowledge question, {Yes, No, Maybe} are added to the output vocabulary. Following (Weston et al., 2014), a list-task answer will be considered as a single element in the answer set and the count question. A possible alternative would be to change the activation function used at the output of the MemN2N from softmax activation function to a logistic one and to use a categorical cross entropy loss. A drawback of such alternative would be the necessity of cross-validating a decision threshold in order to select a eligible answers. Concerning the individual numbers for the count question set, the numbers founded on the training set are added into the vocabulary.
We believe more reasoning capabilities need to be explore in the future, like spacial and temporal reasoning or deduction as suggested in . However, it will probably need the development of a new dedicated resource. Another alternative could be to develop a questionanswering annotation task based on a dialog corpus where reasoning task are present. The closest work to our proposal that can be cited is (Bordes and Weston, 2016). In this paper, the authors defines a so-called End-to-End learnable dialog system to infer an answer from a finite set of eligible answers w.r.t the current list of utterances of the dialog. The authors generate 5 artificial tasks of dialog. However the reasoning capabilities are not explicitly addressed and the author explicitly claim that the resulting dialog system is not satisfactory yet. Indeed, we believe that having a proper dialog state tracker where a policy is built on top can guarantee dialog achievement by properly optimizing a reward function throughout a explicitly learnt dialog policy. In the case of proper end-toend systems, the objective function is still not explicitly defined (Serban et al., 2015) and the resulting systems tend to be used in the context of chatoriented and non-goal oriented dialog systems. In the next section, we present experimental details and results obtained on the basis of the DSTC-2 dataset and its conversion to the four mentioned reasoning tasks.

Dataset and Data Preprocessing
In the DSTC-2 dialog corpus, a user queries a database of local restaurants by interacting with a dialog system. A dialog proceeds as follows: first, the user specifies constraints concerning the restaurant. Then, the system offers the name of a restaurant that satisfies the constraints. Finally, the user accepts the offer and requests additional information about the accepted restaurant. In this context, the dialog state tracker should be able to track several types of information that compose the state like the geographic area, the food type and the price range slots. In order to make comparable experiments, sub-dialogs generated from the first utterance to each utterance of each dialog of the corpus have been generated. The corresponding question-answer pairs have been generated using the annotated state for each of the subdialog. In the case of factoid question, this setting allows for fair comparison at the utterance-level state tracking gains with the prior art. The same protocol has been adopted for the generated reasoning task. In that sense, the tracker task consists  In order to exhibit reasoning capability of the proposed model in the context of dialog state tracking, three other dataset have been automatically generated from the dialog corpus in order to support 3 capabilities of reasoning described in Section 3.3. Dialog modification has been required for two reasoning tasks, List and Count. Two types of rules have been developed to automatically produce modified dialogs. On a first hand, string matching has been performed to determine the position of a slot values in a given utterance and an alternative statement has been produced as a substitution. For example, the utterance "I'm looking for a chinese restaurant in the north" can be replaced by "I'm looking for a chinese restaurant in the north or the west of town". A second type of modification has been performed in an inter-utterance fashion. For example, assuming a given value "north" has been informed in the current state of a given dialog, one can add lately in the dialog a remark like "I would also accept a place east side of town". This kind of statement tends to not affect the overall flow of the dialog and allows to add richer semantic to the dialog. In the future, we plan to develop a richer set of generation procedures to augment the dataset. Nevertheless, we believe this simple dialog augmentation strategy allows to exhibit the competency of the proposed model beyond factoid questions.

Training Details
As suggested in (Sukhbaatar et al., 2015), 10% of the set was held-out to form a validation set for hyperparameter tuning. Concerning the utterance encoding, we use the so-called Temporal Encoding technique. In fact, reading tasks require some notion of temporal context. To enable the model to address them, the memory vector is modified as such m i = ∑ j Ax i j + T A (i), where T A (i) is the i th row of a dedicated matrix T A that encodes temporal information. The output embedding is augmented in the same way with a matrix T c (e.g. c i = ∑ j Cx i j + T C (i)). Both T A and T C are learned during training in an end-to-end fashion. They are also subject to the same sharing constraints as A and C. The embedding matrix A and B are initialized using GoogleNews word2vec embedding model (Mikolov et al., 2013). Also suggested on (Sukhbaatar et al., 2015), utterances are indexed in reverse order, reflecting their relative distance from the question so that x 1 is the last sentence of the dialog. Furthermore, adjacent weight tying schema has been adopted. Learning rate η is initially assigned a value of 0.005 with exponential decay applied every 25 epochs by η/2 until 100 epochs are reached. Then, linear start is used in all our experiments as proposed by (Sukhbaatar et al., 2015). More precisely, the softmax function in each memory layer is removed and re-inserted after 20 epochs. Batch size is set to 16 and gradients with an L 2 norm larger than 40 are divided by a scalar to have norm 40. All weights are initialized randomly from a Gaussian distribution with zero mean and σ = 0.1. In all our experiments, we have tested a set of the embedding size d ∈ {20, 40, 60}.
After validation, each model uses a 5-hops depth configuration.    Table 4 presents the performance obtained for the four reasoning tasks. The obtained results lead us to think that MemN2N are a competitive alternative for the task dialog state tracking but also increase the spectrum of definition of the general dialog state tracking task to machine reading and reasoning. In the future, we believe new reasoning capabilities like spacial and temporal reasoning and deduction should be exploited on the basis of a specifically designed dataset.

Conclusion and Further Work
This paper describes a novel method of dialog state tracking based on the paradigm of machine reading and solved using MemN2N, a memoryenhanced neural network architecture. In this context, a dataset format inspired from the current datasets of machine reading tasks has been developed for this task. It is the first attempt to solve this classic sub-problem of dialog management in  such way. Beyond the experimental results presented in the experimental section, the proposed approach offers several advantages compared to state of the art methods of tracking. First, the proposed method allows to perform tracking on the basis of segment-dialog-level annotation instead of utterance-level one that is commonly admitted in academic datasets but tedious to produce in a large scale industrial environment. Second, we propose to develop dialog corpus requiring reasoning capabilities to exhibit the potential of the proposed model. In future work, we plan to address more complex tasks like spatial and temporal reasoning, sorting or deduction and experiment with other memory enhanced inference models. Indeed, we plan to experiment and compare the same approach with Stacked-Augmented Recurrent Neural Network (Joulin and Mikolov, 2015) and Neural Turing Machine (Graves et al., 2014) that sounds also promising for these family of reasoning tasks.