May I take your order? A Neural Model for Extracting Structured Information from Conversations

In this paper we tackle a unique and important problem of extracting a structured order from the conversation a customer has with an order taker at a restaurant. This is motivated by an actual system under development to assist in the order taking process. We develop a sequence-to-sequence model that is able to map from unstructured conversational input to the structured form that is conveyed to the kitchen and appears on the customer receipt. This problem is critically different from other tasks like machine translation where sequence-to-sequence models have been used: the input includes two sides of a conversation; the output is highly structured; and logical manipulations must be performed, for example when the customer changes his mind while ordering. We present a novel sequence-to-sequence model that incorporates a special attention-memory gating mechanism and conversational role markers. The proposed model improves performance over both a phrase-based machine translation approach and a standard sequence-to-sequence model.


Introduction
Extracting structured information from unstructured text is a critically important problem in natural language processing. In this paper, we attack a deceptively simple form of the problem: understanding what a customer wants when ordering at a restaurant. In this problem, we seek to convert the conversation between the customer and the order taker, i.e. the waiter or waitress, into the structured form that is conveyed to the kitchen to prepare the food, and which appears on the customer receipt.

Item
Size Qty Modifiers Pizza large 1 add pepperoni Caesar Salad small 1 side dressing Diet Coke medium 3 Table 1: An example of the structured data record corresponding to the conversation in Figure 1 We develop this system to analyze real-time interactions with the aim of discovering errors in the order-entry process. Note that the objective is to analyze the interaction and suggest corrections to the human order-taker. Thus, we take both sides of the order-taking interaction as input, and are not attempting to predict the order-taker's side of the conversation.
While we focus on the restaurant domain in this work, this problem is relevant in any scenario in which a conversation results in the creation of structured information. Other examples include a sales interaction which results in a purchase order, a call to a help desk which results in a service record, or a conversation with a travel agent that results in an itinerary.
An example of the problem of interest is shown in Figure 1. The structured data record that corresponds to this conversation is shown in Table 1. There are several things to note about this example: • The output is a stylized and structured representation of the input • The items in the structured order may appear in a different sequence than they are mentioned • Inference occurs across turns, for example that "medium" applies to the coke and not the pizza whose size was earlier specified • Logical manipulations must be done, for example changing the number of cokes from two to three • In contrast to machine translation, we do not wish to create a verbatim "translation" of the input, but instead a logical distillation of it To attack this problem, we implemented two baselines and several sequence-to-sequence models. The first baseline is an information-retrieval approach based on a TF-IDF match (Salton et al., 1975) which finds the most similar conversation in the training data, and returns the associated order. The second uses phrase-based machine translation (Koehn et al., 2003) to "translate" from the conversational input to the tokens in the structured order. We compare these to a sequence-to-sequence (s2s) model with attention (Chan et al., 2016;Devlin et al., 2015;Sutskever et al., 2014;Mei et al., 2016), and then extend the s2s model with the addition of a gating mechanism on the attention memory and with an auxiliary input that indicates the conversational role of the speaker (customer or ordertaker). We show that it is in fact possible to extract the orders from conversations recorded at a real restaurant 1 , and achieve an F measure of over 70 from raw text and 65 from ASR transcriptions.

Problem Formulation
The precise problem setting in this paper is as follows. The training data consists of input/output pairs of examples (X 1 , Y 1 ), . . . , (X N , Y N ), where X k is a conversation consisting of several utterances, similar to the example shown in Figure 1, and Y k is the corresponding structured data record such as the one in Table 1.  Figure 2: An input unstructured conversation and the corresponding structured record.
Given a conversation X k , the goal of our model is to extract the structured data record Y k so that: We cast this task as a sequence modeling problem which aims to map the sequence of words in a conversation X k to the sequence of tokens in the corresponding structured data record Y k . The input sequence is formed by concatenating the utterances in the conversation, while the output sequence is formed by concatenating the rows in the structured data record. For example, the utterances in the conversation shown in Figure  1 are concatenated to predict the sequence y = Pizza, size=large, qty=1, modifiers=(add pepperoni) | Diet Coke, size=medium, qty=3 | Caesar Salad, size=small, qty=1, modifiers=(side dressing) which is derived from Table 1. Under this sequential model, the conditional probability of the structured data record Y given the observed conversation X can be written as where y 1:t−1 denotes the first t − 1 terms in the structured data record and θ represents the model parameters.

Model
The proposed model is based on an encoderdecoder architecture with attention , as shown in Figure 2. The encoder network reads the input conversation X one word at a time and updates its hidden state h t according to current input w t and previous hidden state h t−1 , where f e is a nonlinear function which is elaborated in the following section. After reading all the tokens, the encoder network yields a context vector c as the representation of the entire conversation.
The decoder then processes this representation and generates a hypothesized structured data record Y as an output sequence, word by word given the context vector c and all previous predicted tokens. The conditional probability can be expressed as follows: where f d and g are nonlinear functions and s t is the hidden state of decoder at time t. Critically, our decoder also utilizes an attention mechanism, which stores the intermediate encoder representations of each input word for use by the decoder. Two improvements to the conventional encoderdecoder model architecture are proposed in this work. First, we incorporate gates controlled by the encoder into the neural attention memory to adaptively modulate the representations in the memory based on their semantic importance. Second, we propose a way to incorporate conversational role information into the model to reflect the fact that different participants in a multi-party interaction have different roles and the meaning of certain utterances may be dependent on the speaker's role.
A detailed illustration of the proposed model is shown in Figure 3. We elaborate on each component of this model in the following sections.

Encoder Network
The encoder network is designed to generate a semantically meaningful representation of unstructured conversations. Several neural network architectures have been proposed for this purpose, including CNNs (Kalchbrenner et al., 2014;Hu et al., 2014), RNNs (Sutskever et al., 2014) and LSTMs (Hochreiter and Schmidhuber, 1997). In this work, we use an encoder constructed from a recurrent neural network with gated RNN units (GRU) . The GRU has been shown to alleviate the gradient vanishing problem of RNNs, enabling the model to learn long term dependencies in the input sequence. GRUs have been shown to perform comparably to LSTMs (Chung et al., 2014).
At time t, the new state of a GRU is computed as follows: Figure 3: Graphical structure of memory-gated encoder-decoder model with attention mechanism. w 1 represents input; − → h 1 and ← − h 1 are the hidden states of forward and backward GRUs, respectively. g 1 , α 1 represent the context gates and attention weights, respectively. Small dot node means element-wise product.
where stands for element-wise multiplication. W , U are weight matrixes applied to input and previous hidden state, respectively .h t is a linear combination of the previous state h t−1 and the hypothesis stateĥ t .ĥ t is computed with new sequence information. The update gate, z t , controls to what extent the past information is kept and how much new information is added. The reset gate, r t , controls to what extent the history state contributes to the hypothesis state. If r t is zero, then GRU ignores all the history information.
The conversation encoding is obtained by concatenating the GRU hidden state vectors from the forward and backward directions. Thus the encoder operation can be summarized as follows where w t is the one-hot input vector, W e is the embedding matrix, and x t is the word embedding for represent the GRU operating in the forward and backward directions, respectively, with processing defined by equations 6-9.
This produces a sequence of context vectors, h + t which are subsequently consumed by an attention mechanism in the decoder. We use the final attention vector h + T to initialize the hidden state of the decoder.

Memory Gate
In most sequence-to-sequence tasks such as machine translation, every word in the input is important. However, in our scenario, where the input to the system is conversational speech, not all the words in the conversation contribute to the prediction of structured data record. For example, it is reasonable to ignore the chit-chat that is present in many conversations. Further, in other tasks, gating mechanisms have been shown to be useful to dynamically select important information Hochreiter and Schmidhuber, 1997;Tu et al., 2016).
In light of this, we propose the use of an additional memory gate to select important information from the memory vector. The memory gate we use consists of a single-layer feed-forward neural network where σ is a sigmoid activation function and W g and b g are weight matrix and bias, respectively, and h + t is the context vector at time t defined in equation 10. The gate is then applied to the context vector h + t using an element-wise multiplication operation.
After applying memory gate, the gated context vector c t is then fed into attention memory of the decoder network in place of the original context vector h + t . Figure 4 illustrates an example of the gating weights for a sample utterance. The darker colors indicates values close to 1 while the lighter colors indicate values close to 0. As the figure shows, the network learns to suppress semantically unimportant words.

Role Information
In many sequence-to-sequence models, there is no notion of different speakers with different roles. Inspired by the work in dialog generation  and spoken language understanding (Hori et al., 2016), we propose the addition of speaker information into the encoder network to explicitly model the interaction patterns of the customer and order-taker.
Specifically we learn separate word and role embeddings, and concatenate them to form the input. The input to the encoder network becomes:

Decoder Network
The decoder network is used to predict the next word given all the previously predicted words and the context vectors from the encoder network (Luong et al., 2015;.
We use an RNN with GRU units to predict each word y t sequentially based on the previously predicted word y t−1 and the output of the attention process a t that computes a weighted combination of the context vectors in memory.
If we define s t as the hidden layer of the decoder at time t, the decoder's operation can be expressed as where y t−1 ⊕ a t is the concatenation of the previously predicted output y t−1 and the output of the attention process a t , and − −− → GRU (·) is defined by equations 6-9, as before.
The attention vector a t is computed as a linear combination of the gated context vectors generated by the encoder network. This can be written as where the weights α ij are computed as A single-layer feed-forward neural network is used to compute e ij as where V a , W a , and U a are weight matrices.

Model Training
The model is trained to maximize the log probability of the structured data records given the corresponding conversation, where D is the set containing all the training pairs and P (Y k |X k ) is computed with equation 2. The standard adadelta algorithm (Zeiler, 2012) is used for parameter updates. Gradients are clipped to 1 to avoid exponentially increasing values (Pascanu et al., 2013).

Experiments
In this section, we evaluate our proposed model on two data sets and compare performance with several baseline systems.

Data sets
We conducted experiments on a corpus of conversations between a customer and an order taker (waiter or waitress) captured in a real restaurant environment. The conversations were manually transcribed by professional annotators. There are 4823 examples in the training set, 543 in the development (dev) set, and 843 in the test set. There are approximately 260 unique items in the record and 150 unique modifiers on these items, but not all modifiers apply to all items. We experimented with two version of the dev and test sets. The first is manually transcribed in the same manner as the training set, while the second is generated by a speech recognition decoder that was trained on the conversations in the training set. We denote the second set as ASR-dev and ASR-test. Table 2 lists the statistics of the data sets. Note that the audio of a conversation was collected as a single file and then automatically segmented into turns for ASR decoding. This process was not perfect and likely introduced some errors. Thus, the average length and number of turns of differ between the ASR transcriptions and the manual transcriptions.

Experimental setup
All words are lower-cased and an unknown word token is used for words which appear less than four times in the training set. The word embedding matrix is initialized by randomly sampling from a normal distribution, and scaled by 0.01. The recurrent connections of the GRU are initialized with orthogonal matrices (Saxe et al., 2013) and biases are initialized to zero. A single layer GRU is used for both the encoder and decoder. The network has 600 hidden units and uses 300dimensional word embeddings. The dropout rate is set to 0.5. We did not tune hyper-parameters except for the dimension of the role embedding which is selected from {3, 5, 10} on the dev set. During inference, we use beam search decoding with a beam of 5 to generate the structured records. In order to decode without a length bias, the log probability of decoded results is normalized by the number of tokens.

Evaluation
A typical metric to evaluate a generation system is BLEU score (Papineni et al., 2002) which uses ngram overlap to quantify the degree to which a hypothesis matches the reference. However, our scenario is more demanding: order items are either correct or incorrect. Therefore, we adopt precision and recall at the item level as our evaluation metric. Note that an item is defined as a row in the structured data record and typically includes multiple fields. Using Table 1 as an example, there are three items to be scored. Only when the model produces an item that is exactly the same as the reference item do we count it as correct. As an additional measure, we report accuracy of the entire order, in which every item in an order must be correct for the order to be counted as correct.

Baseline systems
We compare the performance of our neural model with baseline models that employ information retrieval (IR) and phrase-based machine translation (PBMT) approaches.

IR:
The IR method treated the training set of transcriptions as a collection of documents, each mapped to a corresponding order. The test conversation was used as a query to find the most similar training set conversation. The corresponding order was returned as the estimated order. In our experiment, we use TFIDF to compute the similarity score.
PBMT: The goal of a phrase-based translation model is to map a conversation into its structured record with alignment and language models. In our experiments, we use the Moses decoder, a state-of-the-art phrase-based MT system available for research purposes. We use GIZA++ (Och and Ney, 2003) to learn word alignment and irstlm to learn the language model. The models are trained on the conversation/order pairs in the training set and used to predict the structured data record given a conversation.

Results
First we discuss the performance of our models on manually transcribed data and then examine the results on ASR recognized data. Table 3 lists the experiment results on manually transcribed dev and test sets. We refer to our model as the neural attention model (NAM). We see that the NAM is superior to both the IR and PBMT methods by a large margin. Both the proposed memory gate and role modifications yield improvements over the basic NAM. When combined, these produce the best performance in terms of accuracy on the dev set, and both F1 and accuracy on the test set. While there are only small differences in the scores among some of the NAM methods, we are unaware of a measure of statistical significance suitable for this task.
Though not reported, we also found that a basic encoder-decoder s2s model without attention performs poorly; it cannot summarize information across multiple turns into a single vector. The attention mechanism, acting on the entire encoding sequence, is critical in our task.  rate around 25%. With this noisy data, we find that the memory gate and role additions consistently improve performance. When combined, both F1 and accuracy improved. Figure 6 shows a sample input and the output from each model. We see that the NAM augmented with memory gates and role information successfully captures the interaction and generates the correct record.

Qualitative analysis
To better understand the proposed model, we visualize the attention weight at each time step in Figure 5. The figure compares the attention weights produced by a conventional context mem-ory and the proposed gated context memory. We see that both models are able to learn good soft alignment between the input conversation and the output structured data record. However, the attention weights in 5(b), with our proposed gated attention mechanism, are sparser than those in 5(a) and better able to ignore uninformative terms in the input.

Related Work
There has been much work on information extraction from single utterances. Kate and Mooney (2006) proposed the use of SVM classifiers based on string kenels to parse natural language to a formal meaning representation. Wong   Mooney (2006) used syntax-based statistical machine translation method to do semantic parsing. Translation of natural language to a formal meaning representation is captured by a synchronous context-free grammar in (Wu, 1997;Chiang et al., 2006). Quirk et al. (2015) created models to map natural language descriptions to executable code using productions from the formal language. Beltagy and Quirk (2016) improved the performance of semantic parsing on If-Then statements by using neural networks to model derivation trees and leveraged several techniques like synthetic training data from paraphrases and grammar combinations to improve generalization and reduce overfitting. In addition, there are some other research works focusing on text generation from structured data records. Angeli et al. (2010) proposed of a domain independent probabilistic approach to performing content selection and surface realization, making text generation as a local decision process. Konstas and Lapata (2013) created a global model to generate text from structured records, which jointly modeled content selection and surface realization with a probabilistic context-free grammar. In contrast, in this paper we focus on generating structured data records from text descriptions.
Using spoken language understanding techniques, (Mesnil et al., 2015) tag each word in a sentence with a predefined slot. A dialog modeling approach (Young et al., 2013) is also relevant to our task. However, this approach requires the definition of semantic slot names and human labeling of dialog acts in each utterance.
There are a number of relevant applications of neural attention models. Nallapati et al. (2016) proposed using sequence to sequence model to summarize source code into natural language; they used a LSTM as encoder and another attentional LSTM and decoder to jointly learn content selection and realization. Dong and Lapata (2016) presented a sequence to sequence model with a tree structure decoder to map natural language to its logical form. The tree structure decoder shows superior performance on data that has nested output structure. It has also been used in other domains including machine translation (Sutskever et al., 2014;, and image caption generation . From this perspective, the most related work is (Mei et al., 2016) in which they proposed using a sequence-tosequence model to map navigational instructions in natural language to actions, which is conceptually similar to our work. However, we start from conversations and our structured data records are more complex.

Conclusion
In this paper we have presented an end to end method for extracting structured information from unstructured conversations using an encoderdecoder neural network. The restaurant-ordering domain we study is distinguished from past work by its conversational nature, and the need to handle user corrections and modifications. We incorporate a memory gate and role information into the encoder network to selectively keep important information and capture interaction patterns between conversation participants. Experimental results on both a human transcribed data set and ASR-recognized data set demonstrate the feasibility and effectiveness of our approach.