Recurrent Polynomial Network for Dialogue State Tracking with Mismatched Semantic Parsers

Recently, constrained Markov Bayesian polynomial (CMBP) has been proposed as a data-driven rule-based model for dialog state tracking (DST). CMBP is an approach to bridge rule-based models and statistical models. Recurrent Polynomial Network (RPN) is a recent statistical framework taking advantages of rule-based models and can achieve state-of-the-art performance on the data corpora of DSTC-3, outperforming all submitted trackers in DSTC-3 including RNN. It is widely acknowledged that SLU’s reliability inﬂuences tracker’s performance greatly, especially in cases where the training SLU is poorly matched to the testing SLU. In this paper, this effect is analyzed in detail for RPN. Experiments show that RPN’s tracking result is consistently the best compared to rule-based and statistical models investigated on different SLUs including mismatched ones and demonstrate RPN’s is very robust to mismatched semantic parsers.


Introduction
Dialogue management is the core of a spoken dialogue system. As a dialogue progresses, dialogue management usually accomplishes two missions. One mission is called dialogue state tracking (DST), which is a process to estimate the distribution of the dialogue states. Another mission is to choose semantics-level machine dialogue acts to direct the dialogue given the information of the dialogue state, referred to as dialogue decision making. Due to unpredictable user behaviours, inevitable automatic speech recognition (ASR) and spoken language understanding (SLU) errors, dialogue state tracking and decision making are difficult (Williams and Young, 2007). Consequently, much research has been devoted to statistical dialogue management. In previous studies, dialogue state tracking and decision making are usually investigated together. In recent years, to advance the research of statistical dialogue management, the DST problem is raised out of the statistical dialogue management framework so that a bunch of models can be investigated for DST. Moreover, shared research tasks like the Dialog State Tracking Challenge (DSTC) (Williams et al., 2013;Henderson et al., 2014a;Henderson et al., 2014b) have provided a common testbed and evaluation suite to facilitate direct comparisons among DST models.
Two DST model categories are broadly known, i.e, rule-based models and statistical models. Recent studies on constrained Markov Bayesian polynomial (CMBP) framework took the first step towards bridging the gap between rule-based and statistical approaches for DST (Sun et al., 2014a;. CMBP formulates rule-based DST in a general way and allows data-driven rules to be generated, so the performance can be improved when training data is available. This enables CMBP to achieve competitive performance to the state-of-the-art statistical approaches, while at the same time keeping most of the advantages of rule-based models. Nevertheless, adding features to CMBP is not as easy as in most other statistical approaches because additional prior knowledge is needed to be added to keep the search space tractable (Sun et al., 2014a;. For the same reason, increasing the model complexity is difficult. To tackle the weakness of CMBP, recurrent polynomial network (RPN)  is proposed to further bridge the gap between rule-based and statistical approaches for DST . RPN's unique structure enables the framework to have all the advantages of CMBP. Additionally, RPN achieves more properties of statistical approaches than CMBP. RPN uses gradient descent where CMBP uses Hillclimbing. Hence RPN can train its parameters faster and the parameter space are not limited to grid where parameters only takes values which are a multiple of a constant.
SLU is usually the input module of tracker. Hence its performance affect state tracking's performance greatly. However, it is hard to design a reliable parser because of ASR errors and the difficulty of obtaining in-domain data. Further, it is a common case that SLU on a tracker's training data is very different from SLU on a tracker's testing data in real world end-to-end dialogue system. Thus, RPN is evaluated on SLUs with great variance and especially in the case where SLU for training mismatches SLU for testing. RPN shows consistently best results among trackers investigated on all SLUs.
The contribution of this paper is to investigate more complex RPN structures with deeper layers, multiple activation nodes and more features and to evaluate RPN's performance in mismatched SLU condition.
The rest of the paper is organized as follows. Section 2 introduces rule-based models and statistical models used in DST. Section 3 introduces two frameworks -CMBP and RPN bridging rulebased models and statistical models. Complex RPN structures are also introduced in this section. Section 4 discusses the influence of SLU on tracking and the SLU mismatch condition. Section 5 evaluates RPN with different structures and features and these results are compared with state-ofthe-art trackers in DSTC-3. Rule-based models, statistical models and mixed models' performance in cases where testing parser mismatches training parser are also compared. Finally, section 6 concludes the paper.

Rule-based and Statistical Models for DST
The results of the DSTCs demonstrated the power of statistical approaches, such as Maximum Entropy (MaxEnt) (Lee and Eskenazi, 2013), Conditional Random Field (Lee, 2013), Deep Neural Network (DNN) (Sun et al., 2014b), and Recurrent Neural Network (RNN) (Henderson et al., 2014d). However, statistical approaches have some disadvantages. For example, statistical approaches sometimes show large variation in performance and poor generalisation ability because of lack of data (Williams, 2012). Moreover, statistical models usually have a complex model structure and complex features, and thus can hardly achieve portability and interpretability. In addition to statistical approaches, rule-based approaches have also been investigated in DSTC due to their efficiency, portability and interpretability and some of them showed good performance and generalisation ability in DSTC (Zilka et al., 2013;Wang and Lemon, 2013).
However, the performance of rule-based models is usually not competitive to the best statistical approaches. Furthermore, a general way is lacking to design rule-based models with prior knowledge and their performance can hardly be improved when training data is available.

Bridging Rule-based models and statistical models
There are two ways of bridging rule-based approaches and statistical approaches. One starts from rule-based models and uses data-driven approaches to find a good rule, while the other one is a statistical model taking advantage of prior knowledge and constraints.

Constrained Markov Bayesian Polynomial
Constrained Markov Bayesian Polynomial (CMBP) (Sun et al., 2014a; takes the first way of bridging rule-based models and statistical models. Several probability features extracted from SLU results shown below are used in CMBP for each slot (Sun et al., 2014a;: : sum of scores of SLU hypotheses informing or affirming value v at turn t : sum of scores of SLU hypotheses denying or negating value v at turn t •P + With these probability features , a CMBP model is defined by where the P is a multivariate polynomial function defined as where k i is an index into input variables. n called order of the CMBP is the order of the polynomial, D denotes the number of inputs with x 0 = 1 and g is the parameter of CMBP. In CMBP, prior knowledge or intuition is encoded by constraints in equation (1). For example, intuition that goal belief should be unchanged or positively correlated with the positive scores from SLU can be written to a constraint: Further, these constraints are approximated to linear forms (Sun et al., 2014a;.
With a set of linear constraints, integer linear programming can be used to get the integer parameters which satisfy the relaxed constraints. Then the tracking accuracy of each parameters can be evaluated and the best one is picked out. Hill-climbing can further be used to extend the best integer-coefficient CMBP to real-coefficient CMBP .
Note that in practice order 3 (n=3) is used to balance the performance and the complexity (Sun et al., 2014a;. 3-order CMBP has achieved state-of-the art-performance on DSTC-2/3.

Recurrent Polynomial Network
Recurrent Polynomial network  takes the second way to bridge rule-based and statistical models. It is a computational network and a statistical framework, which takes advantage of prior knowledge by using CMBP to do initialization.
RPN contains two types of nodes, input node or computational node. Every node x has a value at every time t, denoted by u (t) x . The values of computational nodes at time t are evaluated using the nodes' values at time t and the nodes' values at time t − 1 as inputs just like Recurrent Neural Networks (RNNs).
Two types of edges are introduced to denote the time relation between linked nodes. A node at time t takes the value of a node at time t − 1 as input when they are connected by type-1 edges, while type-2 edges indicate that a node at time t takes the value of a node at time t.
Let I x denote the set of nodes which are connected to node x by type-1 edges. Similarly, let I x denote the set of nodes which are connected to node x by type-2 edges.
Generally, three types of computational node are used in RPN, which are sum node, product node and activation node.
• Sum node: For sum node x at time t, its value u (t) x is the weighted sum of its inputs: where w x,y ,ŵ x,y ∈ R are the weights of edges.
• Product node: For product node x at time t, its value u (t) x is the product of its inputs. Note that there may be multiple edges connecting from node y to node x. Then node y's value should be multiplied to u (t) x multiple times. Formally, let M x,y andM x,y be the multiplicity of the type-1 edge − → yx and the multiplicity of the type-2 edge − → yx respectively. Node x's value u (t) x is evaluated by • Activation node: As the value of product nodes and sum nodes are not bounded by certain range while the output belief should lie in [0, 1], activation functions are needed to map values from R to some interval such as [0, 1]. An activation function is a univariate function. If node x is an activation node, there is only one type-2 edge linked to it.  investigated several activation functions and proposed an ascending, continuous function sof tclip mapping from R to [0, 1] which is linear on [ , 1 − ] with being a small value.
Note that w,ŵ are the only parameters in RPN while M x,y andM x,y are constant given the structure of RPN and each node can be used as output node in RPN.

Basic Structure
A basic 3-layer RPN shown in figure 1 is introduced here to help understand the correlation between 3-order CMBP and RPN. For simplicity, (l, i) is used to denote the index of the i-th node in the l-th layer. Then each layer is defined as follows: • First layer / Input layer: In this layer, input nodes correspond to the variables in equation (1), i.e. the value of 6 input nodes u (1). Feature b r t−1 which is belief of the value at time t − 1 being N one is not used here to make the RPN structure clear and compact. Experiments show that performance of CMBP without feature b r t−1 would not degrade. It is not used by CMBP mentioned in the rest of paper either.
• Second layer: Every product node x in the second layer corresponds to a monomial in equation (2). To express different monomials, each triple of input nodes (1, is enumerated to link to a product node x = (2, i) in the second layer and u (1,k 3 ) .
• Third layer: There is only one sum node (3, 0) in the third layer corresponding to the belief value calculated by a polynomial. With the parameters set according to g k 1 ,k 2 ,k 3 in equation (2), the value u (3,0) is equal to b t outputted by equation (1). It is the only output node in this structure.
From the explanation of basic structure in this section, it can be easily observed that a CMBP can be used to initialize RPN and thus RPN can achieve at least the same results with CMBP. So prior knowledge and constraints are used to find a suboptimum point in RPN parameter space and RPN as a statistical approach, can further optimize its parameters. Hence, RPN is a way of bridging rule-based models and statistical models.

Complete Structure
It is easy to add features to RPN as a statistical model. In the work of , 4 more features about user dialogue acts and machine acts are introduced.
A new sum node x = (3, 1) in the third layer is introduced to capture some property across turns just like belief b t . Like the node (3, 0) that outputs belief in the same layer, node (3, 1) takes input from every product node in the second layer and is used as input features at next time.
Further, to map the output belief to [0, 1], activation nodes with sof tclip(·) as their activation function are introduced.
The complete structure with the activation function, 4 more features and the new recurrent connection is shown in figure 2. The relation between a 3-order CMBP and the basic structure is shown in section 3.2.1. Similarly, the complete structure can also be initialized using CMBP by setting the weights of edges that do not appear in the basic structure to 0.

Complex RPN Structure
We next exam RPN's power of utilizing more features, multiple activation functions and a deeper structure with two interesting explorations on RPN structure are shown in this section. Although these extensions do not yield better results, this section covers these extensions to show the flexibility of the RPN approach.

Complex Structure
Firstly, to express a 4-order polynomial, simply using the structure shown in figure 2 with indegree of nodes in the second layer increased to 4 would be sufficient. However, it can be expressed by a more compact RPN structure. To simplify the explanation, the example RPN ex- In figure 3, the first layer is used for input, and the values of the product nodes in the second layer are equal to the products of two features such as (b t−1 ) 2 , b t−1 P + t , (P + t ) 2 and so on. Every sum node in the third layer can express all the possible 2-order polynomial of features with weights set accordingly. In figure 3, the values of the three sum nodes are 1 − (b t−1 ) 2 , 1 − (P + t ) 2 and 1 respectively. Then similarly, with another product nodes layer and sum nodes layer, the value of the output node in the last layer equals the value of the 4-order polynomial (1 − (b t−1 ) 2 )(1 − (P + t ) 2 ). The complete RPN structure with same features shown in figure 2, the new recurrent connection and activation nodes that expresses 4-order CMBPs can be obtained similarly.
With limited sum nodes in the third layer, the complexity of the model is much smaller than using a structure shown in figure 2 with product node's in-degree increased to 4 and increasing the number of product nodes accordingly.

Complex Features
Secondly, RNN proposed by Henderson et al. (2014c) uses n-gram of ASR results and machine acts. Similar to that, features of n-gram of ASR results and machine acts are also investigated in RPN. Since RPN used in this paper is a binary classification model and assumes slots independent of each other, the n-gram features proposed by Henderson et al. (2014c) are modified in this paper by removing/merging some features to make the features independent of slots and values. When tracking slot s and value v, the sum of confidence scores of ASR hypothesises of the following cases are extracted: • V : confidence score of ASR hypothesises where value v appears n-gram features of machine acts about the tracking slot and value are also used as features.
For example, given machine acts hello() | inform(area=center) | inform(food=Chinese) | request(name), for slot food and value Chinese, the n-gram machine act features are hello, inform, request, inform+slot, inform+value, inform+slot+value, slot, value, slot+value. Features such as request(name) are about slot name and hence request+slot are not in the feature list.
To combine RPN with RNN proposed by Henderson et al. (2014c), input nodes of these n-gram features are not linked to product nodes in the second layer. Instead, a layer of sum nodes followed by a layer of activation nodes with sigmoid activation function, which are equivalent to a layer of neurons are introduced. These activation nodes are linked to sum nodes in the third layer just like product nodes in the second layer. The structure is illustrated by figure 4 clearly. Experiments in section 5 show that these two structures do not yield better results when initialized randomly or initialized using 3-order CMBPs, although the model complexity increases a lot. This indicates the briefness and effectiveness of the simple structure shown in figure 2.

Uncertainty in SLU
In an end-to-end dialogue system, there are two challenges in spoken language understanding: ASR errors and insufficient in-domain dialogue data.
ASR errors make information contained in the user's utterance distorted or even missed. Thankfully, statistical approaches to SLU, trained on labeled in-domain examples, have been shown to be relatively robust to ASR errors. (Mairesse et al., 2009).
Even with an effective way to get SLU robust to ASR errors, it is hard to implement these SLUs for a new domain due to insufficient labelled data. In DSTC-3, only little data of new dialogue domain is provided.
Following the work of Zhu et al. (2014), the following steps are used to handle the two challenges stated above: • Data generation: with sufficient data in restaurants domain in DSTC-2, data on tourists domain using ontology of DSTC-3 can be generated. Utterance patterns of data in the original domain are used to generate data for the new domain of DSTC-3. After preparing both the original data in DSTC-2 and the generated data of DSTC-3, a more general parser for these two domains can be built.
• ASR error simulation: after data generation, ASR error simulation (Zhu et al., 2014) is needed to make the prepared data resemble ASR output with speech recognition errors to train a parser robust to ASR errors. With a simple mapping from the pattern of transcription to the corresponding patterns of ASR nbest hypotheses learned from existing data and phone-based confusion for slot-values, pseudo ASR n-best hypotheses can be obtained. Note that methods proposed by Zhu et al. (2014) only do ASR error simulation for generated data in domain of DSTC-3 and leave the original data in DSTC-2 as its original ASR form,which may introduce the difference in the distribution between training data and testing data on two different domains for the tracker. So ASR error is simulated in data on both domains instead.
• Training: Using the data got from the previous steps, a statistical parser can be trained (Henderson et al., 2012). By varying the fraction of simulated vs. real data, and the simulated error rate, prior expectations about operating conditions can be expressed.
Although a semantic parser with state-of-theart techniques can achieve good performance in some degree, parsing without any error is impossible because it is typical that a semantic parser gets high performance in speech patterns existing in the training dataset, while it fails to predict the correct semantics for some utterances unseen in training dataset. So it is common for SLU performance to differ significantly between training and test conditions in real world end-to-end systems.
It has been widely observed that SLU influences state tracking greatly because the confidence scores of SLU hypotheses are usually the key inputs for dialogue state tracking. When these confidence scores become unreliable, the performance of tracker is sure to degrade. Studies have shown that it is possible to improve SLU accuracy as compared to the live SLU in the DSTC data (Zhu et al., 2014;Sun et al., 2014b). Hence, most of the state-of-the-art results from DSTC-2 and DSTC-3 used refined SLU (either explicitly rebuild a SLU component or take the ASR hypotheses into the trackers (Williams, 2014;Sun et al., 2014b;Henderson et al., 2014d;Henderson et al., 2014c;Kadlec et al., 2014;Sun et al., 2014a)). Kadlec et al.(2014) gets a tracking accuracy improvement of 7.6% when they use SLU refined by themselves instead of organiser-provided live SLU.
In semantic parser mismatch condition, the accuracy of state tracking can degrade badly. Mismatched SLU problem is a main challenge in DST. Trackers under mismatched SLU conditions are investigated in this paper.

RPN with Different Structures
In this section, the performance of three structures shown in this paper is compared and RPN with the simple structure is evaluated on DSTC-3 and compared with the best submitted trackers. Only joint goal accuracy which is the most difficult task of DSTC-3 is of interest. Note that the integercoefficient CMBP with the best performance on DSTC-2 is used to initialize RPN. As it is stated in section 4, SLU designed in this paper focuses on domain extension, so trackers are only evaluated on DSTC-3. The RPN structures that express 3-order CMBP, 4-order CMBP without n-gram features and 4order CMBP with n-gram features are evaluated. Acc is the accuracy of tracker's 1-best joint goal hypothesis, the larger the better. L2 is the L2 norm between correct joint goal distribution and distribution tracker outputs, the smaller the better.
It can be seen from table 1 that the simple structure yields the best result. Note that parser used here is explained in work (Zhu et al., 2014). Experiments of the mismatched SLU case also use this SLU for training.
For DSTC-3, it can be seen from table 2, RPN trained on DSTC-2 can achieve state-of-the-art performance on DSTC-3 without modifying tracking method, outperforming all the submitted trackers in DSTC-3 including the RNN system.
Note that the simple structure is used here with SLU refined described in section 4. We picked the best practical one on dstc2-test among SLUs intro-  duced in the following section as the training SLU and testing SLU.

RPN with Mismatched Semantic Parsers
As section 4 stated, SLU is the input module for dialogue state tracking whose confidence score is usually directly used as probability features and hence has tremendous effect on trackers. Handling mismatched semantic parsers is a main challenge to DST. In this section, different tracking methods are evaluated when there is a mismatch between training data and testing data. More specifically, different tracking models are trained with the same fixed SLU and tested with different SLUs.
Three main categories of tracking models are investigated: rule-based models, statistical models and mixed models.
MaxEnt (Sun et al., 2014b) is a statistical model. HWU baseline (Wang, 2013) is selected as a competitive rule-based model. CMBP and RPN are mixed models.
Four type of SLUs with different levels of performance are used: It has been shown that the organiser-provided live SLU can be improved upon and so it is used as the worst SLU in the following comparison. Past work has shown that trained parser gets a performance improvement when combined with the one the organiser provided (Zhu et al., 2014). Using transcription for parsing gives a much more reliable SLU results than using ASR hypotheses. So generally speaking, performance of SLUs of different types is quite distinguished to each other. Six different SLUs whose performance score shown in  Note that ASR error here is the percent of training data with ASR error simulation when training SLU. The Item Cross Entropy (ICE) (Thomson et al., 2008) between the N-best SLU hypotheses and the semantic label assesses the overall quality of the semantic items distribution, and is shown to give a consistent performance ranking for both the confidence scores and the overall correctness of the semantic parser (Zhu et al., 2014). SLU with the lower ICE has better performance.
Precision and recall are evaluated using only SLU's 1-best hypothesis where ICE takes all hypothesises and their confidence score into consideration.
In results shown in figure 5, the training dataset for tracker is fixed, while testing dataset is outputted by different SLUs. The X-axis gives the SLU ICE and Y-axis gives the tracking accuracy on DSTC3-test. It can be observed that RPN achieves highest accuracy on every SLU among rule-based models, statistical models and mixed models. Thus, RPN shows its robustness on mismatched semantic parsers, which demonstrates the power of using both prior knowledge and being a statistical approach.
After evaluating the mismatched case, the matched case is also tested. When training dataset and testing dataset are outputted by the same SLU, RPN also outperforms all other models, shown in figure 6.
It can be observed that RPN achieves the highest accuracy among RPN, CMBP, MaxEnt, and  Figure 6: Trackers' performances with matched semantic parser HWU baseline whether there is a mismatch between training SLU and testing SLU or not.

Conclusion
Recurrent Polynomial Network demonstrated in this paper is a recent framework to bridge rulebased and statistical models. Several networks are explored and the simple structure's performance outperforms others. Experiments show that RPN outperforms many state-of-the-art trackers on DSTC-3 and RPN performs best on all SLUs with mismatched SLU.

Activation function
An activation function sof tclip(·) is a combination of logistic function and clip function. Let denote a small value such as 0.01, δ denote the offset of sigmoid function such that sigmoid ( − 0.5 + δ) = . sigmoid function here is defined as The softclip function is defined as It is a non-decreasing, continuous function, which is linear on [ , 1 − ]. Its derivative is defined as follows:

Training
Backpropagation through time (BPTT) using mini-batch is used to train the network with batch size 50. Gradients of weights are calculated and accumulated within each batch. Gradients computed for each timestep are propagated to the first timestep. Mean squared error (MSE) is used as the criterion to measure the distance of the output belief to the correct belief distribution.

Derivative calculation
Let δ (t) x be the partial derivative of the cost function over value of node x, i.e., δ (t) x = ∂L ∂ux . Suppose node x = (d, i) is a sum node, then when node x passes its error, the error of child node y ∈Î x is updated as Similarly, error of node y ∈ I x is updated as Suppose node x = (d, i) is a product node, then when node x passes its error, error of node y ∈Î x is updated as Similarly, error of node y ∈ I x is updated as x M x,y u (t−1)