Learning to Embed Semantic Correspondence for Natural Language Understanding

While learning embedding models has yielded fruitful results in several NLP subfields, most notably Word2Vec, embedding correspondence has relatively not been well explored especially in the context of natural language understanding (NLU), a task that typically extracts structured semantic knowledge from a text. A NLU embedding model can facilitate analyzing and understanding relationships between unstructured texts and their corresponding structured semantic knowledge, essential for both researchers and practitioners of NLU. Toward this end, we propose a framework that learns to embed semantic correspondence between text and its extracted semantic knowledge, called semantic frame. One key contributed technique is semantic frame reconstruction used to derive a one-to-one mapping between embedded vectors and their corresponding semantic frames. Embedding into semantically meaningful vectors and computing their distances in vector space provides a simple, but effective way to measure semantic similarities. With the proposed framework, we demonstrate three key areas where the embedding model can be effective: visualization, semantic search and re-ranking.


Introduction
The goal of NLU is to extract meaning from a natural language and infer the user intention. NLU typically involves two tasks: identifying user intent and extracting domain-specific entities, the second of which is often referred to as slot-filling (Mesnil et al., 2013;Jeong and Lee, 2006;Kim et al., 2016). Typically, the NLU task can be viewed as an extraction of structured text from a raw text. In NLU literature, the structured form of intent and filled slots is called a semantic frame.  Figure 1: Semantic vector learning framework and applications. We assume a pair of corresponding text and semantic frame (t, s), which has semantically the same meaning in a raw text domain (χ T ), and a semantic frame domain (χ S ) can be encoded to a vector v in a shared embedding vector space Z. R T and R S are two reader functions that encode raw and structured text to a semantic vector. W is a writing function that decodes a semantic vector to a symbolic semantic frame.
In this study, we aim to learn the meaningful distributed semantic representation, rather than focusing on building the NLU system itself. Once we obtained a reliable and reasonable semantic representation in a vector form, we can devise many useful and new applications around the NLU (Figure 1). Because all the instances of text and semantic frame are placed on a single vector space, we can obtain the natural and direct distance measure between them. Using the distance measure, the similar text or semantic frame instances can be searched directly and interchangeably by the distance comparison. Moreover, re-ranking of multiple NLU results can be applied without further learning by comparing the distances between the text and the corresponding predicted semantic frame. Converting symbols to vectors makes it possible to do visualization naturally as well.
In this study, we assumed that the reasonable semantic vector representation satisfies the following properties.
• Property -embedding correspondence: Distributed representation of text should be the same as the distributed representation of the corresponding semantic frame.
• Property -reconstruction: Symbolic semantic frame should be recovered from the learned semantic vector.
We herein introduce a novel semantic vector learning framework called ESC (Embedding Semantic Correspondence learning), which satisfies the assumed properties. The remainder of the paper is structured as follows: Section 2 describes the detailed structure of the framework. Section 3 introduces semantic vector applications in NLU. Section 4 describes the experimental settings and results. Section 5 discusses the related work. Finally, section 6 presents the conclusion.

ESC Framework
Our framework consists of text reader, semantic frame reader, and semantic frame writer. The text reader embeds a sequence of tokens to a distributed vector representation. The semantic frame reader reads the structured texts and encodes each to a vector. v t represents a vector semantic frame derived from the text reader, and v s represents a vector semantic frame derived from the semantic frame reader. Finally, the semantic frame writer generates a symbolic semantic frame from a vector representation.

Text Reader
A text reader (Figure 2), implementing a neural sentence encoder, reads a sequence of input tokens and encodes each to a vector. In this study, we used long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) for encoding input sequences. The encoding process can be defined as where s = {1, 2, ..., S} and − → h s is the forward hidden states over the input sequence at time s; R text is an RNN cell; and E X (x s ) is the token embedding function, which returns a distributed vector representation of token x at time s. The final RNN output − → h S is taken as v t , which is a semantic vector derived from the text.

Semantic Frame Reader
A semantic frame consists of structured tags such as the intent and slot-tag and slot-values. In this study, the intent tag is handled as a symbol, and the slot-tags and slot-values are handled as a sequence of symbols. For example, the sentence, "Please list all flights from Dallas to Philadelphia on Monday." is handled as The intent reader is a simple embedding function v intent = E I (i), which returns a distributed vector representation of the intent tag i for a sentence.
Stacked LSTM layer is used to read the sequences of slot-tags and slot-values. E S (o) is a slot-tag embedding function with o as a token. E V (a) is an embedding function with a as a token. The embedding result E S (o m ) and E V (a m ) are concatenated at time-step m, and the merged vectors are fed to the stacked layer for each timestep ( Figure 2). v tag,value -the reading result of sequence of slot-tags and values -is taken from the final output of RNN at time M . Finally, intent, slot-tag and value encoded vectors are merged to construct a distributed semantic frame representation as where [; ] denotes the vector concatenation operator. The dimension of v s is same as v t . All embedding weights are randomly initialized and learned through the training process.  Figure 2: Text reader, semantic reader and semantic frame writer neural architecture. E X is an embedding function for the input text token x. E I , E S , and E V are the embedding functions for the intent tag, slottag and slot-value, respectively. is a vector concatenation operation; is a cross-entropy; ⊕ is an average calculation; represents the distance calculation.ŷ intent is a reference intent tag vector and y m slot is a reference slot tag vector at time m. M is the number of slots in a sentence (in the above example, M = 3).

Semantic Frame Writer and Loss Functions
One of the objectives of this study is to learn semantically the reasonable vector representations of text and a related semantic frame. Hence, we set the properties of the desirable semantic vector, and the loss functions are defined to satisfy the properties. Loss for Property "embedding correspondence" Distance loss measures the dissimilarity between the encoded semantic vectors from the text reader and those from the semantic frame reader in the vector space. The loss is defined as where the dist function can be any vector distance measure; however, in this study, we employed a Euclidean and a cosine distance (=1.0 -cosine similarity).
Loss for Property "reconstruction" Content loss provides a measure of how much semantic information the semantic frame vector contains. Without the content loss, v t and v s tend to quickly converge to zero vectors, implying the failure to learn the semantic representation. To measure the content keeping, symbolic semantic frame generations from semantic vector is performed, and the difference between the original semantic frame and the generated semantic frame is calculated.
Because the semantic frame's slot-value has a large vocabulary size to generate the slot values, a reduced semantic frame is devised to ease the generation problem. A reduced semantic frame is created by simply dropping the slot values from the corresponding semantic frame. For example, in Figure 2, slot values [Dallas, Philadelphia, Monday] are removed to create a reduced semantic frame. Content loss calculation is performed on this reduced semantic frame. Another advantage of employing reduced semantic frame is that the learned distributed semantic vectors have more abstract power because the learned semantic vectors are less sensitive to the lexical vocabulary.
For content loss, the intent and slot-tags' generation qualities are measured. The intent generation network can be simply defined using linear where v is the semantic vector, and y intent is the output vector.
The slot-tag generation networks are defined as where R G is an RNN cell. The semantic vector v is copied and repeatedly fed into each RNN input. The outputs from the RNN are projected onto the slot tag space with W ′ S . Figure 2 shows the intent and slot tag generation networks and the corresponding loss calculation methods. The generational losses can be defined with the cross entropy between the generated tag vector and the reference tag vector as where M is the number of slots in a sentence.
With the combination of intent and slot losses, the content loss(L content ) to reconstruct a semantic frame from a semantic vector v can be defined as follows: Finally, the total loss value (L) for learning the semantic frame representation is defined with the distance loss and content loss as The hyperparameters of the proposed model are summarized in Table 1.

Visualization
With vector semantic representation, we can visualize the instances (sentences) in an easier and more natural way. Once the symbolic text or semantic frame are converted to vector, vector visualization methods such as t-sne (Maaten and Hinton, 2008) can be used directly to check the relationship between instances or the distribution of the entire corpus.

Re-ranking Without Further Learning
Re-ranking the NLU results from multiple NLU modules is difficult but important if a robust NLU system is to be built. Typically, a choice is made by comparing the scores produced by each system. However, this technique is not always feasible because the scores are often in different scales, or are occasionally not provided at all (e.g., in the purely rule-based NLU systems). The vector form of the semantic frame provides a very clear and natural solution for the re-ranking problem. Figure 3 shows the flow of the re-ranking algorithm with the proposed vector semantic representation. In this study, we reordered the NLU results from multiple NLU systems according to their corresponding distances of v t to v s . It is noteworthy that the proposed re-ranking algorithm does not require further learning for ranking such as ensemble learning or learning-to-rank techniques. Further, the proposed methods are applicable to any type of NLU system. Even purely rule-based systems can be satisfactorily compared to purely statistical systems. Figure 3: Re-ranking multiple NLU results using the semantic vector. The semantic vector from the text (v t ) functions as a pivot. We show three different NLU systems in this illustration.

Experiments
For training and testing purposes, we used the ATIS2 dataset (Price, 1990). The ATIS2 dataset consists of an annotated intent and slot corpus for an air travel information search task. ATIS2 data set comes with a commonly used training and test split. For tuning parameters, we further split the training set into 90% training and 10% development set.

Validity of Learned Semantic Vector with Visualization
The intuition behind the proposed method is that semantically similar instances will be grouped together if the semantic vector learning is performed successfully. Figure 4 supports that the intuition is correct. In the early stages of training, the instances are scattered randomly; however, as the training progresses, semantically similar instances gather closer to each other. We observed that the proposed framework groups and depicts the sentences based on the intent tag remarkably well.

Multi-form Distance Measurement
In our framework, the instances having different forms (text or semantic frame) can be compared directly on a semantic vector space. To demonstrate that multi-form distance measurement works well, the sentence and semantic frame search results with a sentence and a semantic frame query are shown in Table 2. Table 2 shows that text to text search is very well done with the learned vector. The retrieved sentence patterns are similar to the given text, and the vocabulary presented is also similar. On the other hand, in the case of the text to semantic frame search, The sentence patterns are similar, but the content words such as city name are not similar. In fact, this is what we predicted, because the content loss for reconstruction property is measured on reduced semantic frame which does not include slot-values. In semantic frame to text search, we can find similar behaviors. Retrieved results have almost same intent tag and slot-tags, but have dif-  ferent city or airport names which are corresponding to slot-values. If we could include the slotvalue generation in the reconstruction loss with large data, a better multi-form semantic search result might be expected.
To measure the quantitative search performance, precision at K are reported in Table 3. Precision at K corresponds to the number of same sentence pattern instances in the top K results. From the search result, we can conclude that the learned semantic vectors keep sentence pattern (intent tag and slot-tags) information very well.

Re-ranking
We prepared 11 NLU systems for re-ranking. Nine intent-/slot-combined classifiers and two in-   tent/slot joint classifiers were implemented. For the combined classifiers, three intent classifiers and three slot sequential classifiers were prepared and combined. For the joint classifiers, those of Liu and Lane (2016) and Hakkani-Tür et al. (2016) were each implemented. Here, we did not significantly tune the NLU systems, as the purpose of this paper is to learn the semantic vector, not to build the state-of-the-art NLU systems. A maximum-entropy (MaxEnt)-and a support vector machine (SVM)-based intent classifier were implemented as a traditional sentence classification method. Both classifiers share the same feature set (1-gram, 2-gram, 3-gram, and 4gram around each word). Also, a convolutionalneural network-based (CNN-based) (Kim, 2014) sentence classification method was implemented.
A conditional random field (CRF)-based sequential classifier was implemented as a traditional slot classifier. Also, an RNN-and an RNN+CRF-based sequential classifier were implemented as a deep learning method. Bidirectional LSTMs were used to build the simple RNNbased classifier. By placing a CRF layer on top of the bidirectional LSTM network (Lee, 2017), an RNN+CRF-based network was implemented. In addition, two joint NLU systems (Liu and Lane, 2016;Hakkani-Tür et al., 2016) are prepared by reusing their codes, which are publicly accessible 12 3 . Table 4 shows the summary of the NLU systems that we prepared and used for the re-ranking experiments. Table 5 shows the performance of all the NLU  Table 5: NLU performance of multiple NLU systems and re-ranked results. Acc., prec., rec., and f m stand for accuracy, precision, recall, and fmeasure, respectively.
systems, the proposed re-ranking algorithm's performance, and the oracle performance. Typical choices in re-ranking NLU results are majority voting and score-based ranking. In the majority voting method, the semantic frame most predicted by the NLU systems is selected. The score of the NLU scoring method in Table 5 is the prediction probability. In the case of joint NLU classifiers (C10 and C11), the joint prediction probabilities are used for the score. In the case of combination NLU systems (C1 to C9), the product of the intent and slot prediction probabilities is used for the score.
The proposed distance-based re-ranking method using semantic vector shows superior selection performance at both intent and slot-filling tasks. It is noteworthy that the re-ranked intent prediction performance (acc. 97.05) is relatively close to the oracle intent performance (acc. 97.85), which is the upper bound. Compared to the baseline re-ranker (NLU score), the proposed re-ranker (cosine) achieves 33.25% and 7.07% relative error reduction for intent prediction and slot-filling task, respectively.

Related Work
The task of spoken NLU consists of intent classification and domain entity slot filling. Traditionally, both tasks are approached using statistical machine-learning methods (Schwartz et al., 1997;He and Young, 2005;Dietterich, 2002). Recently, with the advances in deep learning, RNNbased sequence encoding techniques have been used to detect the intent or utterance type (Ravuri and Stolcke, 2015), and RNN-based neural architectures have been employed for slot-filling tasks (Mesnil et al., 2013(Mesnil et al., , 2015. The combinations of CRF and neural networks have also been explored by Xu and Sarikaya (2013).
Recent works have focused on enriching the representations for neural architectures to implement NLU. For example, Chen et al. focused on leveraging substructure embeddings for joint semantic frame parsing . Kim et al. utilized several semantic lexicons, such as WordNet, PPDB, and the Macmillan dictionary, to enrich the word embeddings, and later used them in the initial representation of words for intent detection (Kim et al., 2016).
Previous NLU works have used statistical modeling for the intent and slot-filling tasks, and input representation. None of the work performed has represented the text and semantic frame as a vector form simultaneously. To our best knowledge, this is the first presentation of a method for learning the distributed semantic vector for both text and semantic frame and it's applications in NLU research.
In general natural language processing literature, many raw text to vector studies to learn the vector representations of text have been performed. Mikolov et al. (2013); Pennington et al. (2014); Collobert et al. (2011) proposed word to vector techniques. Mueller and Thyagarajan (2016); Le and Mikolov (2014) introduced embedding methods at the sentence and document level. Some attempts have shown that in this embedding process, certain semantic information such as analogy, antonym, and gender can be obtained in the vector space.
Further, many structured text to vector techniques have been introduced recently. Preller (2014) introduced a logic formula embedding method while Bordes et al. (2013); Do et al. (2018) proposed translating symbolic structured knowledge such as Wordnet and freebase.
We herein introduce a novel semantic frame embedding method by simultaneously executing the raw text to vector and structured text to vector method in a single framework to learn semantic representations more directly. In this framework, the text and semantic frame are each projected onto a vector space, and the distance loss between the vectors is minimized to satisfy embedding correspondence. Our research goes a step further to guarantee that the learned vector indeed keep the semantic information by checking the reconstruction the symbolic semantic frame from the vector.
In learning the parameters by minimizing the vector distances, this work is similar to a Siamese constitutional neural network (Chopra et al., 2005;Mueller and Thyagarajan, 2016) or an autoencoder (Hinton and Salakhutdinov, 2006); however, the weights are not shared or transposed in this work.

Conclusion
In this study, we have proposed a new method to learn a correspondence embedding model for NLU. To learn a valid and meaningful distributed semantic representation, two properties -embedding correspondence and reconstruction -are considered. By minimizing the distance between the semantic vectors which are the outputs of text and semantic frame reader, the semantically equivalent vectors are placed very close in the vector space. In addition, reconstruction consistency from a semantic vector to symbol semantic frame was jointly enforced to prevent the method from learning trivial degenerate mappings (e.g. mapping all to zeros).
Through various experiments with ATIS2 dataset, we confirmed that the learned semantic vectors indeed contain semantic information. Semantic vector visualization and the results of similar text and semantic frame search showed that semantically similar instances are actually located near on the vector space. Also, using the learned semantic vector, re-ranking multiple NLU systems can be implemented without further learning by comparing semantic vector values of text and semantic frame.
Based on the results of the proposed research, various research directions can be considered in the future. A semantic operation or algebra on a vector space will be a very promising research topic. Furthermore, with enough training data and appropriate modification to our method, adding text reconstruction constraint can be pursed and generating text directly from a semantic vector would be possible, somewhat resembling problem settings of neural machine translation tasks.