Nonparametric Bayesian Models for Spoken Language Understanding

In this paper, we propose a new generative approach for semantic slot ﬁlling task in spoken language understanding using a nonparametric Bayesian formalism. Slot ﬁlling is typically formulated as a sequential labeling problem, which does not directly deal with the posterior distribution of possible slot values. We present a nonparametric Bayesian model involving the generation of arbitrary natural language phrases, which allows an explicit calculation of the distribution over an inﬁnite set of slot values. We demonstrate that this approach signiﬁcantly improves slot estimation accuracy compared to the existing sequential labeling algorithm.


Introduction
Spoken language understanding (SLU) refers to the challenge of recognizing a speaker's intent from a natural language utterance, which is typically defined as a slot filling task. For example, in the utterance "Remind me to call John at 9am tomorrow", the specified information {"time": "9am tomorrow"} and {"subject": "to call John"} should be extracted. The term slot refers to a variable such as the time or subject that is expected to be filled with a value provided through the user's utterance.
The slot filling task is typically formulated as a sequential labeling problem as shown in Figure 1. This labeling scheme naturally represents the recognition of arbitrary phrases that appear in the transcription of an utterance. Formally speaking, when we assume a given set of slots {s 1 , ..., s M } and denote the corresponding slot values by {v s 1 , ..., v s M } where v s i ∈ V s i , the domain of each slot value V s i is an infinite set of word sequences. In this paper, we use the term arbitrary slot filling task to refer to this implicit problem statement, which inherently underlies the sequential labeling formulation.
In contrast, a different line of work has explored the case where V s i is provided as a finite set of possible values that can be handled by a backend system (Henderson, 2015). We refer to this type of task as a categorical slot filling task. In this case, the slot filling task is regarded as a classification problem that explicitly considers a value-based prediction, as shown in Figure 2. From this point of view, we can say that a distribution of slot values is actually concentrated in a small set of typical phrases, even in the arbitrary slot filling task, because users basically know what kind of function is offered by the system.
To reflect this observation, in this paper we explore the value-based formulation approach for arbitrary slot filling tasks. Unlike the sequential labeling formulation, which is basically position-based label prediction, our method directly estimates the posterior distribution over an infinite set of possible values for each slot V s i . The distribution is represented by using a Dirichlet process (Gershman and Blei, 2012), which is a nonparametric Bayesian formalism that generates a categorical distribution for any space. We demonstrate that this approach improves estimation accuracy in the arbitrary slot filling task compared with conventional sequential labeling approach.

!"#"!
34.53,6! 34+*,+6! 34)%%16! duce related work. In Section 3, we present our nonparametric Bayesian formulation, the hierarchical Dirichlet process slot model (HDPSM), which directly models an infinite set of slot values. On the basis of the HDPSM, we develop a generative utterance model that allows us to compute the posterior probability of slot values in Section 4. In Section 5, we introduce a two-stage slot filling algorithm that consists of a candidate generation step and a candidate ranking step using the proposed model. In Section 6, we show the experimental results for multiple datasets in different domains to demonstrate that the proposed algorithm performs better than the baseline sequential labeling method. We conclude in Section 7 with a brief summary.

Related Work
The difference between the categorical and arbitrary slot filling approaches has not been explicitly discussed in a comparative manner to date. In this section, we review existing work for both approaches. For the categorical slot filling approach, various algorithms that directly model the distribution of slot values have been proposed, including generative models (Williams, 2010), maximum entropy linear classifiers (Metallinou et al., 2013), and neural networks (Ren et al., 2014). However, none of these models are applicable for predicting a variable that ranges over an infinite set, and it is not straightforward to extend them suitably. In particular, a discriminative approach is not applicable for arbitrary slot filling tasks because it requires a fixed finite set of slot values to take statistics.
The arbitrary slot filling approach is a natural application of shallow semantic parsing (Gildea, 2002), which is naturally formulated as a sequential labeling problem. Various sequential labeling algorithms have been applied to this task, including support vector machines, conditional random fields (CRF) (Lafferty et al., 2001;Hahn et al., 2011), and deep neural networks (Mesnil et al., 2015;Xu and Sarikaya, 2013). Vukotic et al. (2015) reported that the CRF is still the most accurate, rapid, and stable method among them. Because the focus of this paper is arbitrary slot filling tasks, we use CRFs as our baseline method.
In this paper, we apply nonparametric Bayesian models (Gershman and Blei, 2012) to represent the distribution over arbitrary phrases for each slot. The effectiveness of this phrase modeling approach has been examined in various applications including morphological analysis (Goldwater et al., 2011) and infinite vocabulary topic models (Zhai and Boydgraber, 2013). Our method can be regarded as an application of this idea, although it is not straightforward to integrate it with the utterance generation process, as we explain later.
Consequently, our proposed method is categorized as a generative approach. There are many advantages inherent in generative approaches that have been examined, including unsupervised SLU (Chen et al., 2015), automatic feature extraction (Tur et al., 2013), and integration with syntactic modeling (Lorenzo et al., 2013). Another convenient property of generative models is that prior knowledge can be integrated in an intuitive way (Raymond et al., 2006). This often leads to better performance with less training data compared with discriminative models trained completely from scratch (Komatani et al., 2010).

Hierarchical Dirichlet Process Slot Model
In this section, we present a nonparametric Bayesian formulation that directly models the distribution over an infinite set of possible values for each slot. Let S = {s 1 , ..., s M S } be a given set of slots and M S be the number of slots. We define each slot s i as a random variable ranging over an infinite set of letter sequences V , which is represented as follows: where C is a set of characters including the blank character and any other character that potentially appears in the transcription of an utterance. Consequently, we regard the set of slots S as also being a random variable that ranges over V M S . The objective of this section is to develop the formulation of the probabilistic distribution p(S).

Dirichlet Process
We apply the Dirichlet process (DP) to model both the distribution for an individual slot p i (s i ) and the joint distribution p(S). In this subsection, we review the definition and key properties of DP with general notation for the target distribution G over the domain X . In the DP for the prior of p i (s i ) that is described in Section 3.2, the domain X corresponds to a set of slot values V , e.g., "fen ditton", "new chesterton", and None. In the DP for p(S) presented in Section 3.3, X indicates a set of tuples of slot values V M S , e.g., ("restaurant", "new chesterton", "fast food") and ("restaurant", "fen ditton", None). The DP is a probabilistic distribution over the distribution G. DP is parameterized by α 0 and G 0 , where α 0 > 0 is a concentration parameter and G 0 is a base distribution over X . If G is drawn from DP (α 0 , G 0 ) (i.e., G ∼ DP (α 0 , G 0 )), then the following Dirichlet distributed property holds for any partition of X denoted by {A 1 , ..., A L }: where α(A) = α 0 G 0 (A), which is known as the base measure of DP. Ferguson (1973) proved an important property of a posterior distribution of repeated i.i.d. samples . Consider a countably infinite set of atoms φ = {φ 1 , φ 2 , ...} that are independently drawn from G 0 . Let c i ∈ N be the assignment of an atom for sample x i , which is generated by a sequential draw with the following conditional probability: where n k is the number of times that the kth atom appears in c 1:N and K is the number of different atoms in c 1:N . Given the assignment c 1:N , the predictive distribution of x N +1 ∈ X is represented in the following form: The base distribution possibly generates an identical value for different atoms, such as (φ 1 = "fen ditton", φ 2 = "new chesterton", φ 3 = "fen ditton"). The assignment c i is an auxiliary variable to indicate which of these atoms is assigned to the ith data point x i ; when x i = "fen ditton", c i can be 1 or 3. The posterior distribution above depends on the frequency of atom n k , not on the frequency of θ itself. The atoms φ and the assignment c are latent variables that should be determined at runtime.

Individual Slot Model
First we formulate the distribution for an individual slot as i as a generative model that consists of two-step generation: generation of the phrase length 0 ≤ L i ≤ L max using a categorical distribution and generation of a letter sequence s 1:L i using an n-gram model, as follows: where λ i and η i are parameters for the categorical distribution and the n-gram model for slot s i , respectively. This explicit modeling of the length helps avoid the bias toward shorter phrases and leads to a better distribution, as reported by Zhai and Boydgraber (2013). We define G 0 i as a joint distribution of these models: G 0 i potentially generates an empty phrase of L i = 0 to express the case that the slot value v s i is not provided by an utterance. Therefore, the distribution p i (s i ) can naturally represent the probability of None, which is shown in Figure 2.
We consider prior distributions of the parameters λ i and η i to treat the n-gram characteristics of each slot in a fully Bayesian manner. p(λ) is given as a L max -dimensional symmetric Dirichlet distribution with parameter a. We also define the |C|dimensional symmetric Dirichlet distributions with parameter b for each n-gram context, since given the context p(s ι i |s ι−n+1:ι−1 i , η i ) is just a categorical distribution that ranges over C. Consider we observe N phrases s i for slot i. Let n L iι be the number of phrases that have length ι and n γ ih be the number of times that letter s ι = h appears after context s ι−n+1:ι−1 = γ. The predictive probability of a phrase is represented as follows:

Generative Model for a Set of Slot Values
A naive definition of the joint distribution p(S) is a product of all slot probabilities M S i=1 p i (s i ) for making an independence assumption. However, the slot values are generally correlated with each other (Chen et al., 2015). To obtain more accurate distribution, we formulate p(S) using another DP that recognizes a frequent combination of slot values, as p(S) ∼ DP (α 1 , G 2 ) where G 2 is a base distribution over V M S . We apply the naive independence assumption to G 2 as follows: The whole generation process of S involves twolayered DPs that share atoms among them. In this sense, this generative model is regarded as a hierarchical Dirichlet process (Teh et al., 2005).
Let G 1 i (s i ) = p i (s i ) and G 3 (S) = p(S) for consistent notations. In summary, we define the hierarchical Dirichlet process slot model (HDPSM) as a generative model that has the following generation process.

Inference of HDPSM
In a slot filling task, observations of S 1:T = {S 1 , ..., S T } are available as training data. The inference of HDPSM refers to the estimation of λ, η and the atom assignments for each DP.
We formulate the HDPSM in a form of the Chinese restaurant franchise process, which is one of the explicit representations of hierarchical DPs obtained by marginalizing out the base distributions. Teh et al. (2005) presents a Gibbs sampler for this representation, which involves a repetitive resampling of atoms and assignment. In our method, we prefer to adopt a single pass inference, which samples the assignment for each observation only once. Our preliminary experiments showed that the quality of inference is not affected because S is observed unlike the settings in Teh et al. (2005).
We denote the atoms and the atom assignment in the first level DP DP (α 1 , G 2 ) by φ 1 and c 1 1:N , respectively. The posterior probability of atom assignment for a new observation S N +1 is represented as follows: where n 1 k is the number of times that the kth atom appears in c 1 1:N and K is the number of different atoms in c 1 1:N . φ 0 i and c 0 i1:K denote the atoms and the assignment in the second level DPs DP (α 0 i , G 0 i ). The second level DPs assign atoms to each first level atom φ 1 k , i.e. the second level atom φ 0 it is generated only when a new atom is assigned for S N +1 at the first level. The posterior probability of atom assignment at the second level is: where n 0 it is the number of times that the tth atom appears in c 0 i1:K and T i is the number of different atoms in c 0 i1:K . The single pass inference procedure is presented in Algorithm 1. Given the atoms φ and the assignments c, the predictive distribution of S N +1 = if k = K + 1 then 5: if t i = T i + 1 then c 1 d ← k and φ 1 k ← S 14: end for {s N +11 , ..., s N +1M S } is calculated as follows:

Generative Model for an Utterance
We present a generative utterance model to derive a slot estimation algorithm given utterance u. Figure 3 presents the basic concept of our generative model. In the proposed model, we formulate the distribution of slot values as well as the distribution of non-slot parts. In Figure 3, the phrases "hi we're in um" and "and we need a" should be removed to identify the slot information. We call these non-slot phrases as functional fillers because they more or less have a function to convey information. Identifying the set of non-slot phrases is equivalent to identifying the set of slot phrases. Therefore, we define a generative model of functional fillers in the same way as the slot values. The proposed generative utterance model. We attempt to find the best combination of the slot parts and the nonslot parts (i.e., functional filler parts) by using this model.

Functional Filler
We assume an utterance u is a concatenation of slot values S and functional fillers F . A functional filler is represented as a phrase that ranges over V . To derive the utterance model, we first formulate a generative model for functional fillers. In our observation, the distribution of the functional filler depends on its position in an utterance. For example, utterances often begin with typical phrases such as "Hello I'm looking for ..." or "Hi please find ...", which can hardly ever appear at other positions. To reflect this observation, we introduce a filler slot to separately model the functional fillers based on a position feature. Specifically, we define three filler slots: beginning filler f 1 , which precedes any slot value, ending filler f 3 , which appears at the end of an utterance, and middle filler f 2 , which is inserted between slot values. We use the term content slot to refer to S when we intend to explicitly distinguish it from a filler slot.
Let F = {f 1 , f 2 , f 3 } be a set of filler slots and M F = 3 be the number of filler slots. Each slot f i is a random variable ranging over V and F is a random variable over V M F . These notations for filler slots indicate compatibility to a content slot, which suggests that we can formulate F using HDPSMs, as follows: where H 0 i is an n-gram-based distribution over V that is defined in an identical way to (1) and Figure 4: Graphical model of the utterance model. Figure 4 presents the graphical model of our utterance model. We assume that an utterance u is built with phrases provided by S and F . Therefore, the conditional distribution p(u|S, F ) basically involves a distribution over the permutation of these slot values with two constraints: f 1 is placed first and f 3 has to be placed last. In our formulation, we simply adopt a uniform distribution over all possible permutations.

Utterance Model
For training the utterance model, we assume that a set of annotated utterances is available. Each training instance consists of utterance u and annotated slot values S. Given u and S, we assume that the functional fillers F can be uniquely identified. For the example in Figure 3, we can identify the subsequence in u that corresponds to each content slot value of "restaurant" and "fen ditton". This matching result leads to the identification of filler slot values. Consequently, a triple (u, S, F ) is regarded as an observation. Because the HDPSMs of the content slot and of the filler slot are conditionally independent given S and F , we can separately apply Algorithm 1 to train each HDPSM.
For slot filling, we examine the posterior probability of content slot values S given u, which can be reformed as follows: In this equation, we can remove the summation of F because filler slot values F are uniquely identified regarding u and S in our assumption. Additionally, we approximately regard P (u|S, F ) as a constant if u can be built with S and F . By using these assumptions, the posterior probability is reduced to the following formula:  where F in this formula is fillers identified given u and S. Consequently, the proposed method attempts to find the most likely combination of the slot values and the non-slot phrases, since all words in an utterance have to belong to either of them. By using trained HDPSM (i.e., the posterior given all training data), P (S) and P (F ) can be computed by (2).

Candidate Generation
For estimating slot values given u, we adopt a candidate generation approach (Williams, 2014) that leverages another slot filling algorithm to enumerate likely candidates. 2 Specifically, we assume a candidate generation function g(u) that generates N candidates {S 1 , ..., S N } regarding u. Our slot filling algorithm computes the posterior probability by (3) for each candidate slot S j and takes the candidate that has the highest posterior probability. In this estimation process, our utterance model works as a secondary filter that covers the error of the primary analysis. Figure 5 provides an example of candidate generation by using a sequential labeling algorithm with IOB tags. The subsequences to which the O tag is assigned can be regarded as functional fillers. The values for each filler slot are identified depending on the position of the subsequence, as the figure shows.

Experiments
We evaluate the performance of the proposed generative model with an experiment using the algorithm name #utterances #slots max. diversity DSTC 1,441 6 55 Weather 1,442 3 191 described in Section 5. We adopt a conditional random field (CRF) as a candidate generation algorithm that generates N -best estimation as candidates. For the CRF, we apply commonly used features including unigram and bigram of the surface form and part of speech of the word. We used CRF++ 3 as the CRF implementation.

Dataset
The performance of our method is evaluated using two datasets from different languages, as summarized in Table 1. The first dataset is provided by the third Dialog State Tracking Challenge (Henderson, 2015), hereafter referred to as the DSTC corpus.
The DSTC corpus consists of dialogs in the tourist information domain. In our experiment, we use the user's first utterance in each dialog, which typically describes the user's query to the system. Utterances without any slot information are excluded. We manually modified the annotated slot values into "asis form" to allow a sequential labeling method to extract the ground-truth values. This identification process can be done in a semi-automatic manner that involves no expert knowledge. We apply the part of speech tagger in NLTK 4 for the CRF application. The second dataset is a weather corpus consisting of user utterances in an in-house corpus of humanmachine dialogues in the weather domain. It contains 1,442 questions spoken in Japanese. In this corpus, the number of value types for each slot is higher than DSTC, which indicates a more challenging task. We applied the Japanese morphological analyzer MeCab (Kudo et al., 2004) to segment the Japanese text into words before applying CRF.
For both datasets, we examine the effect of the amount of available annotated utterances by varying the number of training data in 25,50,75,100,200,400,800,all. #train CRF best HDP N = 5

Evaluation Metrics
The methods are compared in terms of slot estimation accuracy. Let n c be the number of utterances for which the estimated slot S and the ground-truth slotŜ are perfectly matched, and let n e be the number of the utterances including an estimation error. The slot estimation accuracy is simply calculated as nc nc+ne . All evaluation scores are calculated as the average of 10-fold cross validation. We also conduct a binomial test to examine the statistical significance of the improvement in the proposed algorithm compared to the CRF baseline.

Results
Tables 2 and 3 present the slot estimation accuracy for the DSTC corpus and the Japanese weather corpus, respectively. The baseline (CRF best) is a method that takes only one best output of CRF for slot estimation. HDP with N = 5 and N = 300 is the proposed method, where N is the number of candidates generated by the CRF candidate genera-  tor. The asterisks (*) beside the HDP accuracy indicate the statistical significance against CRF best, which is tested using the binomial test.
Results show that our proposed method performs significantly better than CRF. Especially when the amount of training data is limited, the proposed method outperforms the baseline. This property is attractive for practical speech recognition systems that offer many different functions. Accurate recognition at an early stage of development allows a practitioner to launch a service that results in quickly collecting hundreds of speech examples.
Since we use the CRF as a candidate generator, we expect that the CRF N-best can rank the correct answer higher in the candidate list. In fact, the top five candidates cover almost all of the correct answers. Therefore, the result in the comparison of N = 5 and N = 300 suggests the stability of the proposed method against the mostly noisy 295 candidates. Because the proposed algorithm makes no use of the original ranking order, N = 300 is a harder condition in which to identify the correct answer. Nevertheless, the result shows that the drop in the performance is limited; the accuracy is still significantly better than the baseline. This result suggests that the proposed method is less dependent on the performance of the candidate generator. Table 4 presents some examples of the slot values estimated by CRF best and HDP with N = 5 for the condition where the number of training utterances is 800. The first two are samples where CRF best failed to predict the correct values. These errors are attributed to infrequent sequential patterns caused by the less trained expressions "that serves fast food" and "moderate restaurant" because CRF is a position-based classifier. The value-based formulation allows the model to learn that the phrase "fast food" is more likely to be a food name than to be a functional filler and to reject the candidate.
The third example in Table 4 shows an error using HDP, which extracted "chine chinese takeaway" which includes a reparandum of disfluency (Georgila et al., 2010). This error can be attributed to the fact that this kind of disfluency resembles the true slot value, which leads to a higher probability of "chine" in the food slot model compared to in the functional filler model. Regarding this type of error, preliminary application of a disfluency detection method (Zayats et al., 2016) is promising for improving accuracy.
The execution time for training the proposed HDP utterance model with 1297 training data in the Japanese weather corpus was about 0.3 seconds. This is a good performance since the CRF training takes about 5.5 seconds. Moreover, the training of the proposed HDP model is scalable and works in an online manner because it is a single pass algorithm. When we have a very large number of training examples, the bottleneck is the CRF training, which requires scanning the whole dataset repeatedly.

Conclusion
In this paper, we proposed an arbitrary slot filling method that directly deals with the posterior probability of slot values by using nonparametric Bayesian models. We presented a two-stage method that involves an N-best candidate generation step, which is typically done using a CRF. Experimental results show that our method significantly improves recognition accuracy. This empirical evidence suggests that the value-based formulation is a promising approach for arbitrary slot filling tasks, which is worth exploring further in future work.