Slot Tagging for Task Oriented Spoken Language Understanding in Human-to-Human Conversation Scenarios

Task oriented language understanding (LU) in human-to-machine (H2M) conversations has been extensively studied for personal digital assistants. In this work, we extend the task oriented LU problem to human-to-human (H2H) conversations, focusing on the slot tagging task. Recent advances on LU in H2M conversations have shown accuracy improvements by adding encoded knowledge from different sources. Inspired by this, we explore several variants of a bidirectional LSTM architecture that relies on different knowledge sources, such as Web data, search engine click logs, expert feedback from H2M models, as well as previous utterances in the conversation. We also propose ensemble techniques that aggregate these different knowledge sources into a single model. Experimental evaluation on a four-turn Twitter dataset in the restaurant and music domains shows improvements in the slot tagging F1-score of up to 6.09% compared to existing approaches.


Introduction
Spoken Language Understanding (SLU) is the first component in digital assistants geared towards task completion, such as Amazon Alexa or Microsoft Cortana. The input to an SLU component is a natural language utterance from the user and its output is a structured representation that can be used by the downstream dialog components to select the next action. The structured representation used by most standard dialog agents is a semantic frame consisting of domains, intents and slots (Tur and De Mori, 2011). For example, the structured representation of "Find me a cheap Italian restaurant" is the domain Restaurant, the intent find place, and slots [cheap] price range , * Work done while the author was at Microsoft Corporation Figure 1: Example of language understanding for task completion on a H2H conversation. In this work, our goal is to identify useful slots (marked with red rectangles).
We extend the task oriented SLU problem to human-to-human (H2H) conversations. A digital assistant can listen to the conversation between two or more humans and provide relevant information or suggest actions based on the structured representation captured with SLU. Figure 1 shows an example of capturing intents and slots expressed implicitly during a conversation between two humans. The digital assistant can show general information about the restaurant Mua, and provide the opening hours based on the captured structured representation. These types of H2H task completion scenarios may allow digital assistants to suggest useful information to users in advance without them needing to explicitly ask questions.
In this paper, we investigate SLU oriented to-wards task completion for H2H scenarios with a specific focus on solving the slot tagging task. Some early conceptual ideas on this problem were presented in DARPA projects on developing cognitive assistants, such as CALO 1 and RADAR 2 . This work can be seen as an effort to formalize the problem and propose a practical framework. SLU for task completion in H2H conversations is a challenging problem. Firstly, since the problem has not been studied before, there are no existing datasets to use. Therefore, we built a multiturn dataset for two H2H domains that we found to be prevalent in Twitter conversations: Music and Restaurants. The dataset is described in more detail in Section 4. Secondly, the task is harder than H2M conversations in several aspects. It is hard to identify the semantics of noisy H2H conversation text with slang and abbreviations, and such conversations have no explicit commands toward the digital assistants requiring the assistant to indirectly infer users intent.
In this work, we introduce a modular architecture with a core bi-directional LSTM network, and additional network components that utilize knowledge from multiple sources including: sentence embeddings to encode semantics and intents of noisy texts with web-data and click logs, H2M based expert feedback, and contextual models relying on previous turns in the conversation. The idea of adding components is inspired from some recent advances in H2M SLU that use additional encoded information Su et al., 2018;Kim et al., 2017;Jha et al., 2018). However, these work only considered adding a component from a single knowledge resource. Furthermore, since these additional components bring in information from different perspectives, we also experimented with deep learning based ensemble methods. Our best ensemble method outperforms existing methods by 6.09% for the Music domain and 2.62% for the Restaurant domain.
In summary, this paper makes the following contributions: • A practical framework on slot tagging for task oriented SLU on H2H conversations using bidirectional LSTM architecture.
• Extension of the LSTM architecture utilizing knowledge from external sources (e.g. Web data, click logs, H2M expert feedback, and pervious sentences) with deep learning based ensemble methods • Newly developed dataset for evaluating task oriented LU on H2H conversations We begin by describing our methods for H2H slot tagging in Section 3. We then describe the data used in our experiments in Section 4 and discuss results in Section 5. This is followed by a review of the related work and conclusion.
Recent advances in LU use additional encoded information to improve DNN based models. There have been some attempts to use data or models from existing domains. One direction is to do transfer learning. Kim et al. (2017) and Jha et al. (2018) utilized previously trained models relevant to the target domain as expert models. They use the output of expert models as additional input to add relevant knowledge while training for the target domain. Goyal et al. (2018) reused low-level features from previously trained models and only retrained high level layers to adapt to a new domain.
There have also been some attempts to use contextual information. Xu and Sarikaya (2014) used past predictions of domains and intents in the previous turn for predicting current utterance.  expanded upon this work by using a set of past utterances utilizing a memory network (Sukhbaatar et al., 2015) with an attention model. Subsequent works attempted to use the order and time information. Bapna et al. (2017) additionally used the chronological order of previous sentences, and Su et al. (2018) used time decaying functions to add temporal information.
Our work trains a sentence embedding that encodes the semantics and intents. DSSM and its variants (Huang et al., 2013;Shen et al., 2014;Palangi et al., 2016) are used for training sentence embedding, which were originally used for finding  Figure 2: Overview of our slot tagging architecture. Our architecture consists with the core network (Section 3.1) and additional network components utilizing knowledge from multiple sources (Each discussed in Section 3.2.1, 3.2.2, 3.2.3). A network ensembling approach is applied on additional components (Section 3.3), figure shows with the attention mechanism. relevance between the query and retrieved documents in a search engine. Also there have been attempts to use sentence embeddings similar to our data (Twitter). Dhingra et al. (2016) trained an embedding for predicting hash tags of a tweet using RNNs, Vosoughi et al. (2016) used an encoderdecoder model for sentiment classification.
All of the previous methods have studied LU components for task completion in H2M conversations. On the other hand, prior work on LU on H2H conversations has focused on dialog state detection and tracking for spoken dialog systems. Shi et al. (2017) used CNN model, and later extended multiple channel model for a cross-language scenario (Shi et al., 2016). Jang et al. (2018) used attention mechanism to focus on words with meaningful context, and Su et al. (2018) used a time decay model to incorporate temporal information. Figure 2 shows the overview of our slot tagging architecture. Our modular architecture is a core LSTM-based network and additional network components that encode knowledge from multiple sources. Slot prediction is done with the final feed forward layer, whose input is the composition of the output of the core network and the additional components. We first describe our core network and then the additional network components, followed by our network ensembling approach.

Core Network
Our core network is a bidirectional model similar to Lample et al. (2016). The first characterlevel bidirectional LSTM layer extracts the encoding from a sequence of characters from each word. Each character c is represented with a character embedding e c ∈ R 25 , and the sequence of the embedding is used as the input. The layer outputs for each character, where f c , b c ∈ R 25 . The second word-level bidirectional LSTM layer extracts the encoding from a sequence of words for each sentence. For each word w i , the input of the layer is g and b c i is the output of previous layer, e w i ∈ R 100 is the word embedding vector, and ⊕ is a concatenation operator of vectors. We use pre-trained GloVe with 2B tweets 3 (Pennington et al., 2014) for the word embedding. The forward and backward word-level LSTM's produce Our model is trained using stochastic gradient descent with Adam optimizer (Kingma and Ba, 2015), with the mini batch size 64 and the learning rate 0.7×10 −3 . We also apply dropout (Srivastava et al., 2014) on embeddings and other layers to avoid overfitting. The learning rate and dropout ratio were optimized using random search (Bergstra and Bengio, 2012). The core network can be used alone for slot tagging, however we discuss our additional network components in the following sections for improving our architecture.

Additional Network Components
In this section, we discuss additional network components that encode knowledge from different sources. Encoded vectors are used as additional input to the feed forward layer as shown in Figure 2.

Sentence Embedding for H2H Conversations
Texts from H2H conversations are noisy and contain slang and abbreviations, which can make identifying their semantics challengins. In addition, it can be challenging to infer their intents since there are no explicit commands toward the digital assistants. The upper part of Figure 3 shows part of a conversation from Twitter. The sentence lacks the semantics needed to fully understand "club and country". However, if we follow the URL in the original text, we can get additional information to assist with the understanding. For instance, the figure shows texts found from two sources, 1) web page title of the URL in the tweet and 2) web search engine queries that lead to the URL in the tweet. We use web search queries and click logs from a major commercial Web search engines to find queries that lead to clicks on the URL. Using this information, we can infer from the Web page title that the "club and country" referred to in the tweet are Atletico Madrid and Nigeria. Furthermore, the search queries from the search engine logs indicates possible user intents. In our approach, we encode knowledge found from these two sources based on the URL. In our dataset, we were able to gather 2.35M pairs of tweet text with URL and web search engine queries that lead to the same URL, and 420K pairs of tweet text and web page titles of the URL. We then use this information to train a sentence embedding model that can be used to encode the semantics and implicit intents of each H2H conversation sentence. Our approach is to train a model that projects texts from H2H conversation and texts from each knowledge sources into a same embedding space, keeping the corresponding text pairs close to each other with other nonrelevant texts being apart, as shown in Figure 3. The learned embedding model F then can be used to represent any texts from H2H sentences with a vector with semantically similar texts (or similar intents) being projected close to each other in the embedding space. Embeddings are used as additional component of our modular architecture, so that the semantic and intent information can be utilized in our slot tagging model.
We use the deep structured semantic model (DSSM) architecture (Huang et al., 2013) to train the sentence embedding encoder. DSSM uses letter-trigram word hashing, so it is capable of partially matching noisy spoken words so that we can get more robust sentence embeddings for H2H conversations. Let S be the set of sentences from the H2H conversations that have the URL. For each sentence s ∈ S, we find corresponding texts (web page title of the URL, web search engine queries to the URL) T + s and randomly choose non-related texts T − s from corresponding texts of other sentences (in other words, from different URLs). Like the original DSSM model, each sentence s, t + s ∈ T + s , and t − s ∈ T − s are initially encoded with letter-trigram word hashing vector x, and used as the input of two consecutive dense layers, where x ∈ R 1000 and y ∈ R 300 . We train the model to favor choosing t + s ∈ T + s over t − s ∈ T − s for each s. So the loss function is defined as minimizing the likelihood, sim(s, t + s ) = cos(y s , y t + s ) where cos is cosine similarity of two encoded vectors. Please refer to the original paper (Huang et al., 2013) for further details. The dropout ratio, Figure 3: Example of H2H conversation text with URL link and corresponding texts found by following the URL. We use those two sources of corresponding texts to train sentence embedding models. Each model projects the original text and its corresponding texts to a close position in the sentence embedding space, while non-relevant texts are being apart.
learning rate, and γ are selected based on a random search (Bergstra and Bengio, 2012), which are 0.0275, 0.4035 × 10 −2 , and 15 respectively. The output of the second dense layer y of trained model is used as the sentence embedding: for each sentence we extract the sentence embedding v s ∈ R 300 .

Contextual Information
Contextual information extracted from previous sentences is known to be useful to improve understanding of human spoken language on other scenarios (Xu and Sarikaya, 2014;Su et al., 2018). To obtain knowledge from a previous sentence in the conversation, we extract a contextual encoded vector using the memory network , which uses the weighted sum of the output of word-level bidirectional LSTM h in the core network (Section 3.1) from previous sentences. We did not consider a time decaying model (Su et al., 2018) since our data has a small number of turns.
We tested the model with some variations on 1) number of previous sentences to use and 2) weighting scheme (uniform or with attention). using the implementation from the original pa-per 4 . From our experiments, the best result was achieved using the previous two sentences with a uniform weight. We use this model to extract the contextual encoded vector v c ∈ R 100 . We adopt this idea to take advantage of massive amount of labeled data for H2M conversations. Instead of transferring knowledge from domain to domain, we transfer the knowledge of different tasks within a similar domain. For example, we use Places (H2M) domain for the Restaurant (H2H) domain, and Entertainment (H2M) domain for the Music (H2H) domain. We use previously trained slot tagging models on H2M conversations on similar domains as our expert model, which has the same architecture as our core net-work (Section 3.1). These H2M models were originally used for the SLU component of a commercial digital assistant. The output of word-level bidirectional LSTM h is then extracted as the encoded vector from H2M expert model v e ∈ R 200 .

Network Ensemble Approaches
Since additional network components (sentence embedding v s , contextual information from previous turns of the conversation v c , and H2M based expert feedback v e ) bring information from different perspectives, we discuss how to compose them into a single vector k with various ensemble approaches.
• Concatenation: Here, we simply concatenate all encodings into a single vector, • Mean: We first apply a separate dense layer to each encoded vector to match dimensions and transform into the same latent space, and then take the arithmetic mean of transformed vectors.
v {s,c,e} = W {s,c,e} + b {s,c,e} (11) In the Figure 2, we denote the dense layer applied to each encoded vector v {s,c,e} as D {s,c,e} for simplicity of representation. Each transformed vector v {s,c,e} ∈ R 100 , so k ∈ R 100 .
• Attention: We apply an attention mechanism to apply different weights on the encoded vectors for each sentence. For our problem, it is not straightforward to define a context vector for each sentence to calculate the importance of each encoded vector; therefore, we adopted the idea of using a global context vector (Yang et al., 2016). The global context vector u ∈ R 100 can be thought as a fixed query of "finding the informative encoded vector for slot tagging" used for each sentence. The weight of each encoded vector is calculated with the standard equation of calculating the attention weight, which is the softmax of the dot product of encoding and context vector, where v {s,c,e} are same as Equation 11.
The combined single vector k is then aggregated with the output of core network h, k ⊕ h is used as the input of the final feed forward layer as shown in Figure 2. The same hyperparameters (mini batch size, learning rate, dropout ratio) and optimizer is used as stated in the baseline model (Section 3.1).

Data
Although some datasets with H2H conversations are available (Forsythand and Martell, 2007;Danescu-Niculescu-Mizil and Lee, 2011;Nio et al., 2014;Sordoni et al., 2015;Lowe et al., 2015;Li et al., 2017), they were not feasible to use for experimenting on our task. All datasets excluding the Ubuntu Dialogue (Lowe et al., 2015) were collected without any restrictions on the domain and, as a result, there were insufficient training samples to train a slot tagging model for a specific domain. In addition, the Ubuntu Dialogue dataset (Lowe et al., 2015) focuses on questions related to Ubuntu OS, which is not an attractive domain an intelligent focus that focuses on task completion rather than question answering.
Since there were no existing datasets that were sufficient for our task in H2H conversation, we built our own dataset for the experiments. It was difficult to acquire actual H2H conversations from instant messages due to privacy concerns. Therefore, we chose to use public conversations on Twitter and extracted sequences in which two users engage in a multi-turn conversation. Using this approach, we were able to collect 3.8 million sequences of four-turn conversations using Twitter Firehose.
We focused on two domains for our experiments: Restaurants and Music. To acquire the dataset for each domain, we first defined a set of key phrases and found the candidate conversations with at least one of those key phrases. Key phrases consisted of the top 100 most frequently used unigrams and bigrams on each relevant domain from the H2M conversation dataset. We used the H2M Places domain to find the top n-grams for the Restaurant domain and the H2M Entertainment domain to find top n-grams for the Music domain. Places includes other type of places besides restaurants (e.g. tour sights), and also Entertainment includes other genre (e.g. movies). So we manually replaced unigrams and bigrams that were not music or entertainment related, and also some terms that are too general (e.g. time, call, find). We were able to gather 16K and 22K candidate conversations for the Restaurant and Music domains, respectively, using the keyphrases.
We randomly sampled 10K conversations for each domain for annotating slots and domain. Annotation was done by managed judges, who had been trained over time for annotating SLU components such as intents, slots and domains. A guideline document was provided with the precise definition and annotated examples of each of the slots and intents. Agreement between judges and manual inspection of samples for quality assurance was done by a linguist trained for managing annotation tasks. We also ensured that judges did not attempt to guess at the underlying intents and slots, and annotate objectively within the context from the text. We only keep the conversations that are labeled relevant to each domain by annotators. Table 1 shows an example conversation from the dataset in each domain, and Table 2 shows the dataset statistics.

Experimental Setup
All experiments were done with 10-fold cross validation for the slot tagging task, and we generated training, development, test datasets using 80%, 10%, and 10% of the data. The development dataset is used for hyperparameter tuning with random search (Bergstra and Bengio, 2012) and early stopping. The baseline is set with core network only (Section 3.1). We evaluated the performance of each of the models with precision, recall, and F1. We checked for statistical significance over the baseline at the p-value < 0.05 using the Wilcoxon signed-rank test.

Evaluation on Adding Sentence Embeddings for H2H Conversations
In this section, we evaluate adding the sentence embeddings into our slot tagging architecture introduced in Section 3.2.1. Table 3 shows the results of adding sentence embeddings, compared with the baseline and existing sentence embedding methods. We extracted two months of recent tweets that had non-twitter domain URLs in the text for our method. Below is the brief description of each method: • DSSM (Deep Structured Semantic Model) (Huang et al., 2013): Pre-trained DSSM model from the authors , trained with pairs of (Major commercial web search engine queries, clicked page titles).
• Tweet2Vec (Dhingra et al., 2016): The model was originally used to predict hashtags of a  tweet. We use the pre-trained model from the authors, which used 2M tweets for training.
• Ours (Tweets, Web Search Engine Queries): Trained our model with 2.35M pairs of (Tweet text with shared URL, Web serach engine queries that lead to the shared URL). We extracted most frequent queries (up to eight) found from the major commercial web search engine query logs.
• Ours (Tweets, Web Page Titles): Trained our model with 420K pairs of (Tweet text with shared URL, web page title of URL).
The result shows that adding our proposed sentence embedding network improves the slot tagging result compared to the baseline, while other previous methods have a negative effect. This implies that 1) a sentence embedding specifically trained for H2H conversation texts are needed (compared with original DSSM), 2) our idea of embedding semantics and intentions from web data and search engine query logs can help to improve the slot tagging task (compared to the Tweet2Vec). Since our sentence embedding network trained with web page titles gives the most significant improvement, we used this for further evaluation.

Evaluation on Utilizing Knowledge Sources
We also tested adding contextual information and H2M expert feedback network components to our slot tagging architecture.  Table 4 shows the result of adding each. Results show that 1) adding network component from each knowledge source leads to an improvement on at least one of the domain, 2) improvement on each method varies with the domain. Adding sentence embeddings and contextual information led to significant improvements for the Restaurant domain while contextual information and H2M expert feedback led to significant improvements for the Music domain.

Evaulation on Network Ensemble Approaches
We also conducted an experiment to include all network components to see if we can improve further by considering multiple knowledge sources together. The result is shown in the lower part (row 5-7) of Table 4 with different ensembling methods introduced in Section 3.3. It shows that any of the ensemble approaches to add all of the network components leads to better results than adding either of them individually.
The result implies that each of the proposed method improves the slot tagging method from different perspectives so all of them can be considered. Also, we see that attention has the best results among ensemble approaches, with 2.62% higher F1 score for the Restaurant domain, and 6.09% for the Music domain compared to the baseline. This implies the attention model can help to find the best way to ensemble additional components by predicting the importance of each component for each sentence. Especially, we could see a statistically significant improvement on the Music domain compared with other methods. We believe this is because the improvement of each network component on the Music domain is more obvious compared to the Restaurant domain. We would like to test in other domains for the future work.  Table 4: Comparison on adding additional network components from each knowledge source and network ensemble approaches that adds all components. P, R, F1 stands for precision, recall, F1-score (%) respectively. * denotes the F1 score is statistically significant compared to the baseline. ** denotes the F1 score of ensemble model is also statistically significant compared to the concatenation ensemble model.

Conclusion
We studied slot tagging in H2H online text conversations. Starting from a core network with bidirectional LSTM, we proposed to use additional network components and ensemble them to augment useful knowledge from multiple sources (web data, search engine click logs, H2M expert feedback, and previous utterances). Experiments with our four-turn Twitter dataset on Restaurant and Music domains showed that our method improves up to 6.09%-points higher F1 on slot tagging compared to existing approaches. For future work, we plan to study our model on domain and intent classification, and also on additional domains.