LSTM-Based Mixture-of-Experts for Knowledge-Aware Dialogues

We introduce an LSTM-based method for dynamically integrating several word-prediction experts to obtain a conditional language model which can be good simultaneously at several subtasks. We illustrate this general approach with an application to dialogue where we integrate a neural chat model, good at conversational aspects, with a neural question-answering model, good at retrieving precise information from a knowledge-base, and show how the integration combines the strengths of the independent components. We hope that this focused contribution will attract attention on the benefits of using such mixtures of experts in NLP.


Introduction
The traditional architecture for virtual agents in dialogue systems (Jokinen and McTear, 2009) involves a combination of several components, which require a lot of expertise in the different technologies, considerable development and implementation effort to adapt each component to a new domain, and are only partially trainable (if at all).Recently, Vinyals and Le (2015), Serban et al. (2015), Shang et al. (2015) proposed to replace this complex architecture by a single network (such as a Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997)) that predicts the agent's response from the dialogue history up to the point where it should be produced: this network can be seen as a form of conditional neural language model (LM), where the dialogue history provides the context for the production of the next agent's utterance.
Despite several advantages over the traditional architecture (learnability, adaptability, better approximations to human utterances), this approach is inferior in one dimension: it assumes that all the knowledge required for the next agent's utterance has to be implicitly present in the dialogues over which the network is trained, and to then be precisely memorized by the network, while the traditional approach allows this knowledge to be dynamically accessed from external knowledge-base (KB) sources, with guaranteed accuracy.
To address this issue, we propose the following approach.As in Vinyals and Le (2015), we first do train a conditional neural LM based on existing dialogues, which we call our chat model; this model can be seen as an "expert" about the conversational patterns in the dialogue, but not about its knowledge-intensive aspects.Besides, we train another model, which this time is an expert about these knowledge aspects, which we call our QA model, due to its connections to Question Answering (QA).We then combine these two expert models through an LSTM-based integration model, which at each time step, encodes the whole history into a vector and then uses a softmax layer to compute a probability mixture over the two models, from which the next token is then sampled.
While here we combine in this way only two models, this core contribution of our paper is immediately generalizable to several expert models, each competent on a specific task, where the (soft) choice between the models is done through the same kind of contextually-aware "attention" mechanism.Additional smaller contributions consist in the neural regime we adopt for training the QA model, and the way in which we reduce the memorization requirements on this model.

LSTM-based Mixture of Experts
The method is illustrated in Figure 1.Let w t 1 = w 1 ...w t be a history over words.We suppose that we have K models each of which can compute a distribution over its own vocabulary V k : We use an LSTM to encode the history word-by-word into a vector h t which is the hidden state of the LSTM at time step t.We then use a softmax layer to compute the probabilities p(k|w t 1 ) = where [u(1, h t ), ..., u(K, h t )] T = Wh t +b, W ∈ R K×dim(ht) , b ∈ R K .The final probability of the next word is then: p(k|w t 1 ) p k (w|w t 1 ). (1) Our proposal can be seen as bringing together two previous lines of research within an LSTM framework.Similar to the mixture-of-experts technique of Jacobs et al. (1991), we predict a label by using a "gating" neural network to mix the predictions of different experts based on the current situation, and similar to the approach of Florian and Yarowsky (1999), we dynamically combine distributions on words to produce an integrated LM. 3  In our case, the labels are words, the gating neural network is an LSTM that stores a representation of a long textual prefix, and the combination mechintegration framework we propose is generally applicable to situations where a pool of word-prediction "experts" compete for attention during the generation of text.

Data
Our corpus consists of 165k dialogues from a "tech company" in the domain of telephony support.We split them into train, development, and test sets whose sizes are 145k, 10k, and 10k dialogues.We then tokenize, lowercase each dialogue, and remove unused information such as head, tail, chat time (Figure 2).For each response utterance found in a dialogue, we create a contextresponse pair whose context consists of all sentences appearing before the response.This process gives us 973k/74k/75k pairs for training/development/testing.

Knowledge-base
The KB we use in this work consists of 1,745k device-attribute-value triples, e.g., (Apple iPhone 5; camera megapixels; 8.0).There are 4729 devices and 608 attributes.Because we consider only numeric values, only triples that have numeric attributes are chosen, resulting in a set of 65k triples of 34 attributes.

Device-specification context-response pairs
Our target context-response pairs are those in which the client asks about numeric value attributes.We employ a simple heuristic to select target context-response pairs: a context-response pair is chosen if its response contains a number and one of the following keywords: cpu, processor, ghz, mhz, memory, mb(s), gb(s), byte, pixel, height, width, weigh, size, camera, mp, hour(s), mah.Using this heuristic, we collect 17.6k/1.3k/1.4kpairs for training/dev/testing.These sets are significantly smaller than those extracted above.

Neural Chat Model
Ouur corpus is comparable to the one described in Vinyals and Le (2015)'s first experiment, and we use here a similar neural chat model.
Without going into the details of this model for lack of space, this model uses a LSTM to encode into a vector the sequence of words observed in a dialogue up to a certain point, and then this vector is used by another LSTM for generating the next utterance also word-by-word.The approach is reminiscent of seq2seq models for machine translation such as (Sutskever et al., 2014), where the role of "source sentence" is played by the dialogue prefix, and that of "target sentence" by the response utterance.

Neural Question Answering Model
In a standard setting, a question to query a KB must be formal (e.g., SQL).However, because a human-like QA system should take natural questions as input, we build a neural model to translate natural questions to formal queries.This model employs an LSTM to encode a natural question into a vector.It then uses two softmax layers to predict the device name and the attribute.This model is adequate here, since we focus on the QA situation where the client asks about device specifications.For more complex cases, more advanced QA models should be considered (e.g., Bordes et al. (2014), Yih et al. (2015)).
Given question w l 1 , the two softmax layers give us a distribution over devices p d (•|w l 1 ) and a distribution over attributes p a (•|w l 1 ).Using the KB, we can compute a distribution over the set V qa of all values found in the KB, by marginalizing over d, a: where T is the set of all triples in the KB.Initial experiments showed that predicting values in this indirect way significantly improves the accuracy compared to employing a single softmax layer to predict values directly, because it minimizes the memorization requirements on the hidden states.
Data Generation Although QA has been intensively studied recently, existing QA corpora and methods for generating data (e.g., Fader et al. ( 2013)) hardly meet our need here.This is be-cause our case is very different from (and somewhat more difficult than) traditional QA set-ups in which questions are independent.In our case several scenarios are possible, resulting from the chat interaction (e.g., in a chat, questions can be related as in Figure 3).We therefore propose a method generating artificial QA data that can cover several scenarios.
For each tuple <device name, attribute>, we paraphrase the device name by randomly dropping some words (e.g., "apple iphone 4" becomes "iphone 4"), and paraphrase the attribute using a small handcrafted dictionary and also randomly dropping some words ("battery talk time" becomes "battery life" which can become "battery").We then draw a sequence of l words from a vocabulary w.r.t word frequency, where l ∼ Gamma(k, n) (e.g., "i what have"), and shuffle these words.The output of the final step is used as a training datapoint like: have iphone 4 what battery i → apple iphone 4 battery talk time.To make it more realistic, we also generate complex questions by concatenating two simple ones.Such questions are used to cover the dialogue scenario where the client continues asking about another device and attribute.In this case, the system should focus on the latest device and attribute.
Using this method, we generate a training set of 7.6m datapoints and a development set of 10k.

Integration
We now show how we integrate the chat model with the QA model using the LSTM-based mixture-of-experts method.The intuition is the following: the chat model is in charge of generating smooth responses into which the QA model "inserts" values retrieved from the KB.Ideally, we should employ an independent LSTM for the purpose of computing mixture weights, as in Section 2. However, due to the lack of training data, our integration model makes use of the chat model's hidden state to compute these weights.Because this hidden state captures the uncertainty of generating the next word, it is also able to detect whether or not the next word should be generated by the chat model.
It is easy to see that the chat model is the backbone because most tokens should be generated by it.The QA model, on the other hand, is crucial since we want the system to generate correct val-ues.(E.g., the chat model alone cannot perform the chat shown in Figure 3 precisely.)More importantly, in the future when new devices are released, we do not need to collect new chat data, which are often expensive, to retrain the chat model.
Let C and w t 1 be a context and words generated up to this point.p c (•|w t 1 , C) and p qa (•|w t 1 , C) are given by the chat model and the QA model.We then compute the distribution p(•|w t 1 , C) over V c ∪ V qa as a mixture of p c and p qa : where α = σ(w T h c t + b), h c t is the hidden state of the chat model, σ is the sigmoid function; w ∈ R dim(h c t ) and b ∈ R. Note that the sigmoid is equivalent to the softmax for two output units.
Training To train this integration model, we keep the chat model and the QA model frozen, and minimize the objective: Decoding To find the most probable responses, our decoder employs the uniform-costsearch algorithm (Russell and Norvig, 2003), which is guaranteed to find optimal solutions.We stipulate a constraint that a response is to answer not more than one question.

Experiments
We implement our models in C++ using CUDA Toolkit.Since evaluating a conversation system is still challenging, we, following Vinyals and Le (2015), use word perplexity only.In our experiments, every LSTM has 1024 hidden units and 1024 memory cells.The vocabulary of the chat model has 19.3k words, that of the QA model 12.7k words.
We firstly train the chat model on all chat data with the learning rate 0.01, and continue training it on the device-specification training data with a smaller learning rate, 0.001.Using this smaller learning rate we expect that the model will not Client: i want to know about how fast is the processor of apple ipad 2 Machine: that 's an excellent question , i can definitely help you with that !apple_ipad_2|||cpu_maximum_frequency ghz was there anything else i could help you out with today ?Client: and its ram ?Machine: apple_ipad_2|||internal_ram gigabytes Client: thanks Machine: was there anything else i could help you out with today ?forget what it has learnt on all the chat corpus.Next, we train the QA model on the data generated in Section 4.2 with the learning rate 0.01.Finally, we train the integration model on the devicespecification training data also with the learning rate 0.01.
We report here initial results.Although the integration does not help to decrease the perplexity on all tokens (15.4,compared to 14.7 of the chat model), it does help to decrease perplexity 38% on value tokens (46.8, compared to 75.8 of the chat model).This shows that using the QA model is helpful, especially for predicting the right values, which are informationally more critical to the user than general words. 4Figure 3 shows a chat example between us and our integration model.To examine whether the model retrieves precise information, outputs are only shown here in the format device|||attribute.

Conclusions
This short paper introduces a general LSTMbased mixture-of-experts method for language modelling and illustrates the approach by integrating a neural chat model with a neural QA model.The experimental results, while limited to measures of perplexity, do show that the integration model is capable of handling chats inside of which the user may ask about device specifications; a more thorough and convincing evaluation would require human assesments of the quality of the produced responses.
We believe that the proposed integration method has potential for a wide range of applications.It allows to pool a number of different language models, each expert in a specific domain or class of problems (possibly trained independently based on the most appropriate data) and to generate the next word based on a competition between these models, under the supervision of an LSTM-based attention mechanism.
1 otherwise.λ is the regularization parameter and D is the training set.We set β(w ∈ V qa \ V c ) high because we want the training phase to focus on those tokens representing values in the KB but not supported by the chat model.

Figure 3 :
Figure 3: A chat with the integration model. 2