Matrix Factorization with Knowledge Graph Propagation for Unsupervised Spoken Language Understanding

Spoken dialogue systems (SDS) typically require a predeﬁned semantic ontology to train a spoken language understanding (SLU) module. In addition to the annotation cost, a key challenge for designing such an ontology is to deﬁne a coherent slot set while considering their complex relations. This paper introduces a novel matrix factorization (MF) approach to learn latent feature vectors for utterances and semantic elements without the need of corpus annotations. Speciﬁcally, our model learns the semantic slots for a domain-speciﬁc SDS in an unsupervised fashion, and carries out semantic parsing using latent MF techniques. To further consider the global semantic structure, such as inter-word and inter-slot relations, we augment the latent MF-based model with a knowledge graph propagation model based on a slot-based semantic graph and a word-based lexical graph. Our experiments show that the proposed MF approaches produce better SLU models that are able to predict semantic slots and word patterns taking into account their relations and domain-speciﬁcity in a joint manner.


Introduction
A key component of a spoken dialogue system (SDS) is the spoken language understanding (SLU) module-it parses the users' utterances into semantic representations; for example, the utterance "find a cheap restaurant" can be parsed into (price=cheap, target=restaurant) (Pieraccini et al., 1992). To design the SLU module of a SDS, most previous studies relied on predefined slots 1 for training the decoder (Seneff, 1992;Dowding et al., 1993;Gupta et al., 2006;Bohus and Rudnicky, 2009). However, these predefined semantic slots may bias the subsequent data collection process, and the cost of manually labeling utterances for updating the ontology is expensive .
In recent years, this problem led to the development of unsupervised SLU techniques Chen et al., 2014b). In particular,  proposed a frame-semantics based framework for automatically inducing semantic slots given raw audios. However, these approaches generally do not explicitly learn the latent factor representations to model the measurement errors (Skrondal and Rabe-Hesketh, 2004), nor do they jointly consider the complex lexical, syntactic, and semantic relations among words, slots, and utterances.
Another challenge of SLU is the inference of the hidden semantics. Considering the user utterance "can i have a cheap restaurant", from its surface patterns, we can see that it includes explicit semantic information about "price (cheap)" and "target (restaurant)"; however, it also includes hidden semantic information, such as "food" and "seeking", since the SDS needs to infer that the user wants to "find" some cheap "food", even though they are not directly observed in the surface patterns. Nonetheless, these implicit semantics are important semantic concepts for domainspecific SDSs. Traditional SLU models use discriminative classifiers (Henderson et al., 2012) to predict whether the predefined slots occur in the utterances or not, ignoring the unobserved concepts and the hidden semantic information.
In this paper, we take a rather radical approach: we propose a novel matrix factorization (MF) model for learning latent features for SLU, taking account of additional information such as the word relations, the induced slots, and the slot relations. To further consider the global coherence of induced slots, we combine the MF model with a knowledge graph propagation based model, fusing both a word-based lexical knowledge graph and a slot-based semantic graph. In fact, as it is shown in the Netflix challenge, MF is credited as the most useful technique for recommendation systems (Koren et al., 2009). Also, the MF model considers the unobserved patterns and estimates their probabilities instead of viewing them as negative examples. However, to the best of our knowledge, the MF technique is not yet well understood in the SLU and SDS communities, and it is not very straight-forward to use MF methods to learn latent feature representations for semantic parsing in SLU. To evaluate the performance of our model, we compare it to standard discriminative SLU baselines, and show that our MF-based model is able to produce strong results in semantic decoding, and the knowledge graph propagation model further improves the performance. Our contributions are three-fold: • We are among the first to study matrix factorization techniques for unsupervised SLU, taking account of additional information; • We augment the MF model with a knowledge graph propagation model, increasing the global coherence of semantic decoding using induced slots; • Our experimental results show that the MFbased unsupervised SLU outperforms strong discriminative baselines, obtaining promising results. In the next section, we outline the related work in unsupervised SLU and latent variable modeling for spoken language processing. Section 3 introduces our framework. The detailed MF approach is explained in Section 4. We then introduce the global knowledge graphs for MF in Section 5. Section 6 shows the experimental results, and Section 7 concludes.

Related Work
Unsupervised SLU Tur et al. (2011; were among the first to consider unsupervised approaches for SLU, where they exploited query logs for slot-filling. In a subsequent study,  studied the Semantic Web for an unsupervised intent detection problem in SLU, showing that results obtained from the unsupervised training process align well with the performance of traditional supervised learning. Following their success of unsupervised SLU, recent studies have also obtained interesting results on the tasks of relation detection Chen et al., 2014a), entity extraction (Wang et al., 2014), and extending domain coverage (El-Kahky et al., 2014;Chen and Rudnicky, 2014). However, most of the studies above do not explicitly learn latent factor representations from the data-while we hypothesize that the better robustness in noisy data can be achieved by explicitly modeling the measurement errors (usually produced by automatic speech recognizers (ASR)) using latent variable models and taking additional local and global semantic constraints into account. Latent Variable Modeling in SLU Early studies on latent variable modeling in speech included the classic hidden Markov model for statistical speech recognition (Jelinek, 1997). Recently,  were the first to study the intent detection problem using query logs and a discrete Bayesian latent variable model. In the field of dialogue modeling, the partially observable Markov decision process (POMDP) (Young et al., 2013) model is a popular technique for dialogue management, reducing the cost of handcrafted dialogue managers while producing robustness against speech recognition errors. More recently,  used a semi-supervised LDA model to show improvement on the slot filling task. Also, Zhai and Williams (2014) proposed an unsupervised model for connecting words with latent states in HMMs using topic models, obtaining interesting qualitative and quantitative results. However, for unsupervised learning for SLU, it is not obvious how to incorporate additional information in the HMMs. To the best of our knowledge, this paper is the first to consider MF techniques for learning latent feature representations in unsupervised SLU, taking various local and global lexical, syntactic, and semantic information into account.

The Proposed Framework
This paper introduces a matrix factorization technique for unsupervised SLU,. The proposed framework is shown in Figure 1(a). Given the utterances, the task of the SLU model is to decode their surface patterns into semantic forms and differentiate the target semantic concepts from the generic semantic space for task-oriented SDSs simultaneously. Note that our model does not require any human-defined slots and domainspecific semantic representations for utterances.
In the proposed model, we first build a feature matrix to represent the training utterances, where each row represents an utterance, and each column refers to an observed surface pattern or a induced slot candidate. Figure 1(b) illustrates an example  Our matrix factorization method completes a partiallymissing matrix for implicit semantic parsing. Dark circles are observed facts, shaded circles are inferred facts. The slot induction maps (yellow arrow) observed surface patterns to semantic slot candidates. The word relation model (blue arrow) constructs correlations between surface patterns. The slot relation model (pink arrow) learns the slot-level correlations based on propagating the automatically derived semantic knowledge graphs. Reasoning with matrix factorization (gray arrow) incorporates these models jointly, and produces a coherent, domain-specific SLU model. of the matrix. Given a testing utterance, we convert it into a vector based on the observed surface patterns, and then fill in the missing values of the slots. In the first utterance in the figure, although the semantic slot food is not observed, the utterance implies the meaning facet food. The MF approach is able to learn the latent feature vectors for utterances and semantic elements, inferring implicit semantic concepts to improve the decoding process-namely, by filling the matrix with probabilities (lower part of the matrix).
The feature model is built on the observed word patterns and slot candidates, where the slot candidates are obtained from the slot induction component through frame-semantic parsing (the yellow block in Figure 1(a)) . Section 4.1 explains the detail of the feature model.
In order to consider the additional inter-word and inter-slot relations, we propose a knowledge graph propagation model based on two knowledge graphs, which includes a word relation model (blue block) and a slot relation model (pink block), described in Section 4.2. The method of auto-matic knowledge graph construction is introduced in Section 5, where we leverage distributed word embeddings associated with typed syntactic dependencies to model the relations (Mikolov et al., 2013b;Mikolov et al., 2013c;Levy and Goldberg, 2014;Chen et al., 2015).
Finally, we train the SLU model by learning latent feature vectors for utterances and slot candidates through MF techniques. Combining with a knowledge graph propagation model based on word/slot relations, the trained SLU model estimates the probability that each semantic slot occurs in the testing utterance, and how likely each slot is domain-specific simultaneously. In other words, the SLU model is able to transform the testing utterances into domain-specific semantic representations without human involvement.

The Matrix Factorization Approach
Considering the benefits brought by MF techniques, including 1) modeling the noisy data, 2) modeling hidden semantics, and 3) modeling the  long-range dependencies between observations, in this work we apply an MF approach to SLU modeling for SDSs. In our model, we use U to denote the set of input utterances, W as the set of word patterns, and S as the set of semantic slots that we would like to predict. The pair of an utterance u ∈ U and a word pattern/semantic slot x ∈ {W + S}, u, x , is a fact. The input to our model is a set of observed facts O, and the observed facts for a given utterance is denoted by { u, x ∈ O}. The goal of our model is to estimate, for a given utterance u and a given word pattern/semantic slot x, the probability, p(M u,x = 1), where M u,x is a binary random variable that is true if and only if x is the word pattern/domain-specific semantic slot in the utterance u. We introduce a series of exponential family models that estimate the probability using a natural parameter θ u,x and the logistic sigmoid function: (1) We construct a matrix M |U |×(|W |+|S|) as observed facts for MF by integrating a feature model and a knowledge graph propagation model below.

Feature Model
First, we build a word pattern matrix F w with binary values based on observations, where each row represents an utterance and each column refers to an observed unigram. In other words, F w carries the basic word vectors for the utterances, which is illustrated as the left part of the matrix in Figure 1 To induce the semantic elements, we parse all ASR-decoded utterances in our corpus using SE-MAFOR 2 , a state-of-the-art semantic parser for frame-semantic parsing (Das et al., 2010;Das et al., 2013), and extract all frames from semantic parsing results as slot candidates Dinarelli et al., 2009). Figure 2 shows an example of an ASR-decoded output parsed by SEMAFOR. Three FrameNet-defined frames 2 http://www.ark.cs.cmu.edu/SEMAFOR/ (capability, expensiveness, and locale by use) are generated for the utterance, which we consider as slot candidates for a domain-specific dialogue system (Baker et al., 1998). Then we build a slot matrix F s with binary values based on the induced slots, which also denotes the slot features for the utterances (right part of the matrix in Figure 1(b)).
To build the feature model M F , we concatenate two matrices: which is the upper part of the matrix in Figure 1(b) for training utterances. Note that we do not use any annotations, so all slot candidates are included.

Knowledge Graph Propagation Model
Since SEMAFOR was trained on FrameNet annotation, which has a more generic frame-semantic context, not all the frames from the parsing results can be used as the actual slots in the domainspecific dialogue systems. For instance, in Figure 2, we see that the frames "expensiveness" and "locale by use" are essentially the key slots for the purpose of understanding in the restaurant query domain, whereas the "capability" frame does not convey particularly valuable information for SLU. Assuming that domain-specific concepts are usually related to each other, considering global relations between semantic slots induces a more coherent slot set. It is shown that the relations on knowledge graphs help make decisions on domain-specific slots (Chen et al., 2015). Considering two directed graphs, semantic and lexical knowledge graphs, each node in the semantic knowledge graph is a slot candidate s i generated by the frame-semantic parser, and each node in the lexical knowledge graph is a word w j .
The edges connect two nodes in the graphs if there is a typed dependency between them. Figure 3 is a simplified example of a slot-based semantic knowledge graph. The structured graph helps define a coherent slot set. To model the relations between words/slots based on the knowledge graphs, we define two relation models below. • Semantic Relation For modeling word semantic relations, we compute a matrix R S is the cosine similarity between the dependency embeddings of the word patterns w i and w j after normalization. For slot semantic relations, we compute R S s = [Sim(s i , s j )] |S|×|S| similarly 3 . The matrices R S w and R S s model not only the semantic but functional similarity since we use dependency-based embeddings (Levy and Goldberg, 2014).

• Dependency Relation
Assuming that important semantic slots are usually mutually related to each other, that is, connected by syntactic dependencies, our automatically derived knowledge graphs are able to help model the dependency relations. For word dependency relations, we compute a matrix R D w = [r(w i , w j )] |W |×|W | , wherê r(w i , w j ) measures the dependency between two word patterns w i and w j based on the word-based lexical knowledge graph, and the detail is described in Section 5. For slot dependency relations, we similarly compute R D s = [r(s i , s j )] |S|×|S| based on the slotbased semantic knowledge graph.
With the built word relation models (R S w and R D w ) and slot relation models (R S s and R D s ), we combine them as a knowledge graph propagation matrix M R 4 .
3 For each column in R S w and R S s , we only keep top 10 highest values, which correspond the top 10 semantically similar nodes. 4 The values in the diagonal of MR are 0 to model the propagation from other entries.
where R SD w = R S w + R D w and R SD s = R S s + R D s to integrate semantic and dependency relations. The goal of this matrix is to propagate scores between nodes according to different types of relations in the knowledge graphs (Chen and Metze, 2012).

Integrated Model
With a feature model M F and a knowledge graph propagation model M R , we integrate them into a single matrix.
where M is the final matrix and I is the identity matrix. α and β are the weights for balancing original values and propagated values, where α + β = 1. The matrix M is similar to M F , but some weights are enhanced through the knowledge graph propagation model, M R . The word relations are built by F w R w , which is the matrix with internal weight propagation on the lexical knowledge graph (the blue arrow in Figure 1(b)). Similarly, F s R s models the slot correlations, and can be treated as the matrix with internal weight propagation on the semantic knowledge graph (the pink arrow in Figure 1(b)). The propagation models can be treated as running a random walk algorithm on the graphs. F s contains all slot candidates generated by SEMAFOR, which may include some generic slots (such as capability), so the original feature model cannot differentiate the domain-specific and generic concepts. By integrating with R s , the semantic and dependency relations can be propagated via the knowledge graph, and the domainspecific concepts may have higher weights based on the assumption that the slots for dialogue systems are often mutually related (Chen et al., 2015). Hence, the structure information can be automatically involved in the matrix. Also, the word relation model brings the same function, but now on the word level. In conclusion, for each utterance, the integrated model not only predicts the probability that semantic slots occur but also considers whether the slots are domain-specific. The following sections describe the learning process.

Parameter Estimation
The proposed model is parameterized through weights and latent component vectors, where the parameters are estimated by maximizing the log likelihood of observed data (Collins et al., 2001).
where M u is the vector corresponding to the utterance u from M u,x in (1), because we assume that each utterance is independent of others.
To avoid treating unobserved facts as designed negative facts, we consider our positive-only data as implicit feedback. Bayesian Personalized Ranking (BPR) is an optimization criterion that learns from implicit feedback for MF, which uses a variant of the ranking: giving observed true facts higher scores than unobserved (true or false) facts (Rendle et al., 2009). Riedel et al. (2013) also showed that BPR learns the implicit relations for improving the relation extraction task.

Objective Function
To estimate the parameters in (5), we create a dataset of ranked pairs from M in (4): for each utterance u and each observed fact f + = u, x + , where M u,x ≥ δ, we choose each word pattern/slot x − such that f − = u, x − , where M u,x < δ, which refers to the word pattern/slot we have not observed to be in utterance u. That is, we construct the observed data O from M . Then for each pair of facts f + and f − , we want to model p(f + ) > p(f − ) and hence θ f + > θ f − according to (1). BPR maximizes the summation of each ranked pair, where the objective is The BPR objective is an approximation to the per utterance AUC (area under the ROC curve), which directly correlates to what we want to achieve -well-ranked semantic slots per utterance.

Optimization
To maximize the objective in (6), we employ a stochastic gradient descent (SGD) algorithm (Rendle et al., 2009). For each randomly sampled observed fact u, x + , we sample an unobserved fact u, x − , which results in |O| fact pairs f − , f + . For each pair, we perform an SGD update using the gradient of the corresponding objective function for matrix factorization (Gantner et al., 2011).

Knowledge Graph Construction
This section introduces the procedure of constructing knowledge graphs in order to estimatê r(w i , w j ) for building R D w andr(s i , s j ) for R D s in Section 4.2. Considering the relations in the knowledge graphs, the edge weights for E ww and E ss are measured asr(w i , w j ) andr(s i , s j ) based on the dependency parsing results respectively.
The example utterance "can i have a cheap restaurant" and its dependency parsing result are illustrated in Figure 4. The arrows denote the dependency relations from headwords to their dependents, and words on arcs denote types of the dependencies. All typed dependencies between two words are encoded in triples and form a word-based dependency set T w = { w i , t, w j }, where t is the typed dependency between the headword w i and the dependent w j . For example, Figure 4 generates restaurant, AMOD, cheap , restaurant, DOBJ, have , etc.
for T w , Similarly, we build a slot-based dependency set T s = { s i , t, s j } by transforming dependencies between slot-fillers into ones between slots. For example, restaurant, AMOD, cheap from T w is transformed into locale by use, AMOD, expensiveness for building T s , because both sides of the non-dotted line are parsed as slot-fillers by SEMAFOR.

Relation Weight Estimation
For the edges in the knowledge graphs, we model the relations between two connected nodes x i and x j asr(x i , x j ), where x is either a slot s or a word pattern w. Since the weights are measured based on the relations between nodes regardless of the directions, we combine the scores of two directional dependencies: where r(x i → x j ) is the score estimating the dependency including x i as a head and x j as a dependent. We propose a scoring function for r(·) using dependency-based embeddings.

Dependency-Based Embeddings
Most neural embeddings use linear bag-of-words contexts, where a window size is defined to produce contexts of the target words (Mikolov et al., 2013c;Mikolov et al., 2013b;Mikolov et al., 2013a). However, some important contexts may be missing due to smaller windows, while larger windows capture broad topical content. A dependency-based embedding approach was proposed to derive contexts based on the syntactic relations the word participates in for training embeddings, where the embeddings are less topical but offer more functional similarity compared to original embeddings (Levy and Goldberg, 2014). Table 1 shows the extracted dependency-based contexts for each target word from the example in Figure 4, where headwords and their dependents can form the contexts by following the arc on a word in the dependency tree, and −1 denotes the directionality of the dependency. After replacing original bag-of-words contexts with dependencybased contexts, we can train dependency-based embeddings for all target words (Yih et al., 2014;Bordes et al., 2011;Bordes et al., 2013).
For training dependency-based word embeddings, each target x is associated with a vector v x ∈ R d and each context c is represented as a context vector v c ∈ R d , where d is the embedding dimensionality. We learn vector representations for both targets and contexts such that the dot product v x · v c associated with "good" targetcontext pairs belonging to the training data D is maximized, leading to the objective function: which can be trained using stochastic-gradient updates (Levy and Goldberg, 2014). Then we can obtain the dependency-based slot and word embeddings using T s and T w respectively.

Embedding-Based Scoring Function
With trained dependency-based embeddings, we estimate the probability that x i is the headword and x j is its dependent via the typed dependency t as where Sim(x i , x j /t) is the cosine similarity between word/slot embeddings v x i and context embeddings v x j /t after normalizing to [0, 1].
Based on the dependency set T x , we use t * x i →x j to denote the most possible typed dependency with x i as a head and x j as a dependent.
where C(x i − → t x j ) counts how many times the dependency x i , t, x j occurs in the dependency set T x . Then the scoring function r(·) in (7) that estimates the dependency x i → x j is measured as which is equal to the highest observed frequency of the dependency x i → x j among all types from T x and additionally weighted by the estimated probability. The estimated probability smoothes the observed frequency to avoid overfitting due to the smaller dataset. Figure 3 is a simplified example of an automatically derived semantic knowledge graph with the most possible typed dependencies as edges based on the estimated weights. Then the relation weightsr(x i , x j ) can be obtained by (7) in order to build R D w and R D s matrices.

Experimental Setup
In this experiment, we used the Cambridge University SLU corpus, previously used on several other SLU tasks (Henderson et al., 2012;. The domain of the corpus is about restaurant recommendation in Cambridge; subjects were asked to interact with multiple SDSs in an in-car setting. The corpus contains a total number of 2,166 dialogues, including 15,453 utterances (10,571 for self-training and 4,882 for  Figure 5: The mappings from induced slots (within blocks) to reference slots (right sides of arrows).
testing). The data is gender-balanced, with slightly more native than non-native speakers. The vocabulary size is 1868. An ASR system was used to transcribe the speech; the word error rate was reported as 37%. There are 10 slots created by domain experts: addr, area, food, name, phone, postcode, price range, signature, task, and type.
For parameter setting, the weights for balancing feature models and propagation models, α and β, are set as 0.5 to give the same influence, and the threshold for defining the unobserved facts δ is set as 0.5 for all experiments. We use the Stanford Parser 5 to obtain the collapsed typed syntactic dependencies (Socher et al., 2013) and set the dimensionality of embeddings d = 300 in all experiments.

Evaluation Metrics
To evaluate the accuracy of the automatically decoded slots, we measure their quality as the proximity between predicted slots and reference slots. Figure 5 shows the mappings that indicate semantically related induced slots and reference slots .
To eliminate the influence of threshold selection when predicting semantic slots, in the following 5 http://nlp.stanford.edu/software/lex-parser. shtml metrics, we take the whole ranking list into account and evaluate the performance by the metrics that are independent of the selected threshold. For each utterance, with the predicted probabilities of all slot candidates, we can compute an average precision (AP) to evaluate the performance of SLU by treating the slots with mappings as positive. AP scores the ranking result higher if the correct slots are ranked higher, which also approximates to the area under the precision-recall curve (Boyd et al., 2012). Mean average precision (MAP) is the metric for evaluating all utterances. For all experiments, we perform a paired t-test on the AP scores of the results to test the significance. Table 2 shows the MAP performance of predicted slots for all experiments on ASR and manual transcripts. For the first baseline using explicit semantics, we use the observed data to self-train models for predicting the probability of each semantic slot by support vector machine (SVM) with a linear kernel and multinomial logistic regression (MLR) (row (a)-(b)) (Pedregosa et al., 2011;Henderson et al., 2012). It is shown that SVM and MLR perform similarly, and MLR is slightly better than SVM because it has better capability of estimating probabilities. For modeling implicit semantics, two baselines are performed as references, Random (row (c)) and Majority (row (d)), where the former assigns random probabilities for all slots, and the later assigns probabilities for the slots based on their frequency distribution. To improve probability estimation, we further integrate the results from implicit semantics with the better result from explicit approaches, MLR (row (b)), by averaging the probability distribution from two results.

Evaluation Results
Two baselines, Random and Majority, cannot model the implicit semantics, producing poor results. The results of Random integrated with MLR significantly degrades the performance of Table 3: The MAP of predicted slots using different types of relation models in M R (%); † indicates that the result is significantly better than the feature model (column (a)) with p < 0.05 in t-test.

Model
Feature Knowledge Graph Propagation Model Rel.
(a) None (b) Semantic (c) Dependency (d) Word (e) Slot (f) All  (e)) achieves 24.2% and 22.6% of MAP for ASR and manual results respectively, which are worse than two baselines using explicit semantics. However, with the combination of explicit semantics, using only the feature model significantly outperforms the baselines, where the performance comes from about 34.0% to 37.6% and from 38.8% to 45.3% for ASR and manual results respectively. Additionally integrating a knowledge graph propagation (KGP) model (row (e)) outperforms the baselines for both ASR and manual transcripts, and the performance is further improved by combining with explicit semantics (achieving MAP of 43.5% and 53.4%). The experiments show that the proposed MF models successfully learn the implicit semantics and consider the relations and domain-specificity simultaneously.

Discussion and Analysis
With promising results obtained by the proposed models, we analyze the detailed difference between different relation models in Table 3.

Effectiveness of Semantic and Dependency Relation Models
To evaluate the effectiveness of semantic and dependency relations, we consider each of them individually in M R of (3) (columns (b) and (c) in Table 3). Comparing to the original model (column (a)), both modeling semantic relations and modeling dependency relations significantly improve the performance for ASR and manual results. It is shown that semantic relations help the SLU model infer the implicit meaning, and then the prediction becomes more accurate. Also, dependency relations successfully differentiate the generic concepts from the domain-specific concepts, so that the SLU model is able to predict more coherent set of semantic slots (Chen et al., 2015). Integrating two types of relations (column (f)) further improves the performance.

Comparing Word/ Slot Relation Models
To analyze the performance results from interword and inter-slot relations, the columns (d) and (e) show the results considering only word relations and only slot relations respectively. It can be seen that the inter-slot relation model significantly improves the performance for both ASR and manual results. However, the inter-word relation model only performs slightly better results for ASR output (from 37.6% to 39.2%), and there is no difference after applying the inter-word relation model on manual transcripts. The reason may be that inter-slot relations carry high-level semantics that align well with the structure of SDSs, but inter-word relations do not. Nevertheless, combining two relations (column (f)) outperforms both results for ASR and manual transcripts, showing that different types of relations can compensate each other and then benefit the SLU performance.

Conclusions
This paper presents an MF approach to self-train the SLU model for semantic decoding in an unsupervised way. The purpose of the proposed model is not only to predict the probability of each semantic slot but also to distinguish between generic semantic concepts and domain-specific concepts that are related to an SDS. The experiments show that the MF-based model obtains promising results, outperforming strong discriminative baselines.