Deep LSTM based Feature Mapping for Query Classification

Traditional convolutional neural network (CNN) based query classification uses linear feature mapping in its convolution operation. The recurrent neural network (RNN), differs from a CNN in representing word sequence with their ordering information kept explicitly. We propose using a deep long-short-term-memory (DLSTM) based feature mapping to learn feature representation for CNN. The DLSTM, which is a stack of LSTM units, has different order of feature representations at different depth of LSTM unit. The bottom LSTM unit equipped with input and output gates, extracts the first order feature representation from current word. To extract higher order nonlinear feature representation, the LSTM unit at higher position gets input from two parts. First part is the lower LSTM unit’s memory cell from previous word. Second part is the lower LSTM unit’s hidden output from current word. In this way, the DLSTM captures the nonlinear nonconsecutive interaction within n-grams. Using an architecture that combines a stack of the DLSTM layers with a tradition CNN layer, we have observed new state-of-the-art query classification accuracy on benchmark data sets for query classification.


Introduction
Convolutional neural networks (CNNs) have achieved significant improvements for query classification. CNNs capture the correlations of spatial or temporal structures with different resolutions using their temporal convolution operators. A pooling strategy on these local correlations extracts invariant regularities.
However, CNNs use simple linear operations on ngram vectors that are formed by concatenating word vectors. The linear operation together with the concatenation may not be sufficient to model the nonconsecutive dependency and interaction within the n-grams. For example, in the query "not a total loss", nonconsecutive dependency "not loss" is the key information that is not well addressed by the linear operation with simple concatenation.
In this paper, we propose to use deep long-shortterm-memory (DLSTM) based feature mapping to capture high order nonlinear feature representations. LSTM (Hochreiter and Schmidhuber, 1997) is one type of recurrent neural networks (RNNs) that have achieved remarkable performance in natural language processing and speech recognition (Sutskever et al., 2014;Graves et al., 2013).
The DLSTM is a stack of LSTM units where different order of nonlinear feature representation is captured by LSTM units at different depth. The bottom LSTM unit extracts the first order feature representation from current word. The LSTM unit at the higher position captures the higher order feature representation relying on the outputs from LSTM units at lower position, specifically, the memory cell from lower LSTM unit at previous word position and the hidden output from lower LSTM unit at current word position. Using DLSTM, linear feature mapping in traditional CNN can be obviously extended to nonlinear feature mapping. Moreover, the memory cell together with different gates in LSTM unit are able to model the nonconsecutive feature interaction and information decaying based on context. For example, in the query "not so good", the proposed DL-STM is expected to keep the information of "not" and "good" in the memory, and to decay the information about "so" via the forget gates.
Similar to CNNs where multiple convolution operations are used, we propose to stack different DLSTM feature mappings together to model multiple level nonlinear feature representations. The bottom DL-STM layer takes the original word sequence as input. The DLSTM layer at lower position fed its output to the adjacent higher DLSTM layer. In the proposed models, the concatenation of the multiple level feature representations are further reduced by the pooling operation. The prediction output is finally made based on the reduced feature representations.
We evaluated the proposed method on three benchmark data sets: Standford Sentiment Treebank dataset (Socher et al., 2013), TREC (Text Retrieval Conference) question type classification data set (Li and Roth, 2002) and ATIS (Airline Travel Information Systems) dataset (Hemphill et al., 1990). On Standford Sentiment Treebank dataset, our model obtains 51.9% accuracy on fine-grained classification and 88.7% accuracy on binary classification. The SVM based method uses a large amount of engineered features, and it outperforms LSTM and RNN based methods on TREC question type classification dataset. The DLSTM outperforms other neural network based methods without using engineered features. On ATIS data, DLSTM achieves 97.9% F1 score, which is better than the previous best F1 score of 95.6% using the same data settings.
Due to its superior ability to memorize long distance dependencies, LSTMs have been applied to extract the sentence-level continuous representation (Ravuri and Stolcke, 2015a;Tang et al., 2015;Tai et al., 2015). When the LSTM is applied to model a sentence, memory cell from the ending word in the sentence carries the information of the whole sentence. The LSTM hidden vector from the ending word is directly used as sentence feature representation in (Ravuri and Stolcke, 2015a). Alternatively, a sentence is represented by the average of LSTM hidden vectors from its words (Tang et al., 2015). Inspired from recursive neural networks (Socher et al., 2011a), LSTM is further combined with a tree structure to model sentence representation (Tai et al., 2015).
CNNs have been originally developed for image processing (Lecun et al., 1998). They are firstly applied by Collobert et al. (2008; for natural language processing tasks using max-over-time pooling method to aggregate convolution layer vectors. CNNs have also been applied to spoken language understanding (Shi et al., 2015b), information retrieval (Shen et al., 2014) and semantic parsing (Yih et al., 2015). Kalchbrenner et al. (2014) proposed to extend CNNs max-over-time pooling to k-max pooling for sentence modeling. Remarkable query classification performance on different benchmark datasets have been achieved by integrating CNNs with different feature mapping channels and pre-trained word vectors (Zhang and Wallace, 2015;Kim, 2014). Recently, Mou et al. (2015) proposed to model sentences by tree structured CNNs.
CNNs and LSTMs are complementary in their modeling capabilities; CNNs are good at capturing local invariant regularities and LSTMs are good at modeling temporal features. The combination of CNNs and LSTMs achieves improved performances in speech recognition (Sainath et al., 2015) and query classification (Tang et al., 2015;Zhou et al., 2015). In these models, the basic architecture is the LSTM that models sequence representation from local features captured by CNNs. Different from the above methods, our method use LSTM units to model the nonlinear and non-consecutive local features. CNNs are placed on top of these local features for query classification. Our motivation is to use LSTM replace the linear feature mapping in convolution operation where the feature mapping is a multiplication of the word vectors with a filter matrix. So our proposed model is still CNN based model but using DLSTM as feature mapping for convolution operation.
Our work is closely related to tensor product based CNNs  that expand CNN feature representation capacity with non-consecutive ngrams. They improve the query modeling from two aspects. Firstly, tensor products enable the nonlinear feature vector interactions between adjacent words. Secondly, an exponentially decaying weight is applied to represent non-consecutive n-gram features. Instead of using tensor products as feature mapping, we propose to apply DLSTM to address these two aspects. Nonlinear feature mapping can be achieved by the DLSTM that equipped with nonlinear activation function. The nonconsecutive feature interaction is well addressed by the memory cell and different gates in LSTM unit. In particular, the forget gate is able to decay the information according to the context rather than a fixed decaying weight in tensor product based CNNs.

Linear Feature Mapping in CNN
Let k-dimensional vector x t ∈ R k be the continuous feature representation of the tth word in a sentence. A sentence with l words is represented by x 0:l−1 = [x 0 ; x 2 ; ...; x l−1 ] that is a concatenation of all word vectors. The traditional CNN (Collobert et al., 2011;Kim, 2014) takes such sentence feature vector as input. Different filters M j ∈ R nd * h are applied in convolution operation to map each n-gram feature vector x t:t+n−1 ,t ∈ (0, l − n) to an h-dimensional feature vector c t, j .
where b j is the bias in filter j.
The resulting feature vector c t, j are often passed through non-linear element-wise transformations (e.g. the hyperbolic tangent and rectifier linear unit) as well as pooling operations. After aggregation or reduction by different pooling operations such as the max-over-time pooling (Collobert et al., 2011;Kim, 2014) and the average pooling , a constant dimensional feature vector is generated for sentences with various lengths.
In traditional CNNs, the concatenated word vectors are mapped linearly to feature coordinates as shown in Equation (1). Such linear feature mapping can be improved from the following two aspects, one is to extend linear mapping to nonlinear mapping. The other one is to improve the consecutive feature mapping to nonconsecutive feature mapping. For example, in the query "not a total loss", "not loss" is the key sentiment. By using nonconsecutive feature representation, the information about "not loss" could be addressed.  extends the linear feature mapping to tensor based feature mapping. To model the nonconsecutive n-grams, a decaying weight is applied to control the information carryover. In this paper, we propose to replace the linear feature mapping using DLSTM that captures the nonlinear and nonconsecutive feature interaction within n-grams. Rather than setting a fixed decaying weight, the proposed architecture is able to control the information decaying according to the context information.

Feature Mapping Based on Deep Long
Short Term Memory Figure 1 gives the basic architecture of a three-order nonlinear feature mapping in DLSTM. The bottom LSTM 0 extract the first order information from word input vector x t . It is equipped with input gate and output gate. The input gate automatically controls the information saving in memory cell that will be passed to higher order LSTM unit. The output gate modifies the information from the memory cell to represent current word.
On top of the bottom LSTM 0 unit, we analogously stack two LSTM units LSTM 1 and LSTM 2 to extract nonlinear feature representations from bigram and trigram, respectively. The LSTM j is formulated as follows: Due to the effect from different gates that controls the information saving, expressing and decaying, LSTM 1 and LSTM 2 are able to model the nonconsecutive interaction in n-grams. Take "not so good" as a example. LSTM 0 extract the nonlinear feature mapping from word "good" as h 0,2 . The LSTM 1 takes c 0,1 (carries the information from word "so") and h 0,2 as input. Due to the effect of forget gate, we expect the output h 1,2 from LSTM 1 to address more on word "good" rather than "so". By further stacking LSTM 2 , information about the word "not" and "good" should be emphasized by the proposed DLSTM. Note the sum of the resulting outputs from these LSTM units is used as the high order feature representation of a n-gram ending with word x t . So the original sequence input x 0:l−1 is mapped to a sequence of feature vector z 0: The proposed DLSTM architecture is characterized by the following two features: 2. Memory Cell Interaction: To model the nonlinear feature interaction in n-gram vectors, traditional LSTM unit is modified by Equation (10) in which the memory cell stores the interaction of different order memory cells. In this way, the feature interaction in n-grams is characterized by the memory cell interactions.
To stack the LSTM unit deeper, the depth-gated LSTM  and the highway network (Srivastava et al., 2015; also allow the memory cell flow across LSTM units at different depth. There are three basic differences between these architectures with the proposed DLSTM. Firstly, in their architectures, LSTM units at different depth are different LSTMs that have different weight matrices. In our model, the LSTM units in DLSTM share weight matrices with each other. Secondly, in their proposed architecture, the memory cell is carried over to higher LSTM unit for facilitating model training. Because the networking training becomes more difficult with increasing model depth. In our DLSTM, the LSTM unit at higher position takes the memory cell from lower LSTM unit mainly for feature interaction in n-grams. Finally, an additional "depth" gate is applied in their architecture to control the information flow across different layers. In our model, the input gate in higher LSTM unit controls the interaction between the memory cells extracted from previous word and current word. Figure 3 gives the whole architecture of the proposed query classification system. A DLSTM layer first maps the input sequence to a sequence of high order nonlinear feature representations z 0 . Instead of being directly used for query classification, the feature representation z 0 is further processed by a stack of DLSTM layers illustrated in previous section. In such stacked DLSTM layers, the output z i of the ith DLSTM layer, is used as the input for the i + 1th DL-STM layer parameterized by a different set of weight matrices. As shown in Figure 3, the resulting feature representations z 0 , z 1 , ..., z d of all these layers are concatenated. Finally, an average pooling is applied to reduce the sentence feature representation to a fixed dimensional vector that is further fed to a softmax function to obtain the prediction output.

Learning and Regularization
In the classification layer, the prediction output is obtained by the following softmax function.
where y is a m-dimensional vector. The model is trained by minimizing cross-entropy on the given training data set. To avoid overfitting during training, L2 regularization and dropout (Hinton et al., 2012) are used. The L2 regularization is applied to constrain all weight matrices using the same regularization weight. The dropout is only applied to the output of each DLSTM layer.
In the training, the model weights are updated using mini-batch stochastic gradient descent (SGD). We adapt a per-feature learning rate control method (AdaGrad) (Duchi et al., 2011) to dynamically tune the learning rate as follows:  where α t,i is the learning rate for weight i at epoch t. ∑ t j=1 g j,i sums all the historical gradients of weight i. A small positive ε is applied to make the AdaGrad robust. ε is usually set to 1e − 5.

Datasets
We evaluate the proposed query classification models on sentence sentiment classification, question type categorization and query intent detection tasks.
For sentence sentiment classification, the Stanford Sentiment Treebank (Socher et al., 2013) is used. In this dataset, 11855 English sentences are annotated at both sentence level and phrases level with fine-grained labels (very positive, positive, neutral, negative and very negative). We use the provided data split, which has 8544 sentences for training, 1101 sentences for developing and 2210 sentences for testing. This dataset also provides a binary classification variant that ignores the neutral sentences. The binary classification task in this dataset has 6920 sentences for training, 872 sentences for developing and 1821 sentences for testing. There are in total 17835 unique running words for fine-grained dataset and 16185 for binary version dataset.
For query intent detection, ATIS (airline travel information system) dataset (Hemphill et al., 1990;Yao et al., 2014b) is used. This dataset is mainly about the air travel domain with 26 different intents such as "flight", "ground s ervice" and "city". There are 893 utterances for testing (ATIS-III, Nov93 and Dec94), and 4978 utterances for training (rest of ATIS-III and ATIS-II). There are 899 unique running words and 22 intents in the training data. The question type classification task is to classify a question into a specific type, which is a very important step in question answering system. In TREC (Text Retrieval Conference) data (Li and Roth, 2002), all the questions are divided into 6 categories, including "human", "entity", "location", "description", "abbreviation" and "numeric". The dataset in total has 5952 questions, 5452 of them for training, the rest for testing. The vocabulary size of TREC dataset is 9592.
Following previous work (Iyyer et al., 2015;Tai et al., 2015;, we used word vectors pre-trained on large unannotated corpora to achieve better generalization capability. In this paper, we used a publicly available 300 dimensional GloVe word vectors that are trained using Common Crawl with 840B tokens and 2.2M vocabulary size.

Settings
We implemented our model based on Theano library (Bastien et al., 2012). All our models are trained on Nvidia Tesla K40m.
We performed extensive hyperparameter selection based on Stanford Sentiment Treebank Binary version of validation data. The selected hyperparameters were directly used for all datasets. To investigate the robustness of the proposed method, we ran each configuration 10 times using different random initialization (random seed ranges from 1 to 10).

model
Acc discriminative (Tur et al., 2010) 95.5 SVM (Shi et al., 2015b) 95.6 joint-RNN (Shi et al., 2015b) 95.2 ours 97.9 The top block shows that the traditional methods such as SVM using ngram features and neural network using bag-of-words features (Nbow) perform much worse than Para-vec and DAN using word vectors that are pre-trained on large amount of unlabeled data. Para-vec builds a logistic regression on top of paragraph vectors. DAN is a deep neural network takes the average of word vectors as input.
In addition to pre-trained word vectors, syntactic compositional information can be used to improve the sentiment classification accuracy. RAE is a tree structured Antoencoder model based on pretrained word vectors from Wikipedia. MVRNN further improves the recursive neural network by assigning each node with a matrix to learn the meaning change of neighboring words and phrases. To address large amount of different vectors and matrices involved in MVRNN, RNTN proposed to use one single tensor based function to model all nodes. By making the tree-structured recursive neural networks deeper, significant improvement has been achieved by DRNN. According to our knowledge, the best compositional information based model is achieved by RLSTM that combines LSTM unit with treestructure.
By comparing the classification accuracy between second blocks and third blocks, we see that CNN based models in general perform better than recursive neural network based methods. Another advantage of CNN based methods is that they can be generalized to any language without dependency over compositional information. DCNN uses a dynamic k-max pooling operator function in CNN. To explore the task specific word vectors and the general word vectors pre-trained on large News dataset, CNN-MC equips CNN with two feature mapping channels. CNN-nostatic gives the results by only making use of general word vectors. The best published classification results are achieved by TCNN that is tensor based CNN. In this paper, the proposed method is closely related to TCNN. Instead of using tensor products to replace linear convolution operation, our method exploits the nonlinear feature mapping through DL-STM. Rather than setting specific decaying weight to model non-consecutive n-gram features in tensor based CNN, the different gates automatically adjust the information storing, removing and outputting according to context.
Following the work of TCNN, to leverage the phrases level annotation in Standford Sentiment Treebank, all phrases and their corresponding labels are added to training data as additional sequences. The bottom line of Table 1 shows that our models achieved the state-of-the-art performance on sentiment classification task.

Results on ATIS
ATIS dataset is widely used to test spoken language understanding system. As shown in Table 2, SVM using n-grams performs better than simple RNN and CNN based approach. joint-RNN is a query classification and slot filling joint training model where CNN is applied on top of slot tagging RNN for query classification. In this way, joint-RNN actually implicitly makes use of slot tag information for query classification. However, joint-RNN doesn't take ad- vantage of word vectors trained on large amount of unlabeled data. Based on pre-trained word vectors, our models obtain more than 2% absolute classification accuracy improvement over the published best model.
In ATIS data, about 70% of queries is categorized to "flight" intent. Recent work using RNN for utterance classification (Ravuri and Stolcke, 2015b; Ravuri and Stolcke, 2015a) simplifies it to a "flight" VS "others" binary classification task. In their paper, using word based LSTM, they achieve 97.55% classification accuracy. By using extra name entity features, word based gated RNN obtains 98.42% classification accuracy. Table 3 gives the TREC question type classification accuracy of our models with other baseline models. Different from the sentiment classification task, the shallow models using diverse engineered feature performs better than CNN and LSTM based models.

Results on TREC Question Type Classification
Previous best classification results on TREC data is achieved by SVM using unigrams, bigrams, whword, head word, POS tags, hypernyms, WordNet synsets and a bunch of hand-coded rules. AdaSent is a self adaptive hierarchical sentence model based on gating networks with level pooling. As shown in Table 3, CNN and LSTM achieve similar performances on question type classification. Recently CLSTM achieves substantial improvement over previous neural network based methods. In CLSTM, CNN is used to extract high level phrase representation. Such local segment representation is fed into LSTM to model whole sequence representation. Different with CLSTM that is an LSTM based sequence model with CNN for local feature extraction, our model is CNN based model using DLSTM for non-linear feature mapping. Our model outperforms previous neural network based models without relying on task specific feature engineering.

Deep Architecture
One critical hyperparameter in the proposed method is the number of DLSTM layers. On sentiment binary classification task, we run our model 10 times by keeping all the hyperparameters the same except the number of DLSTM layers using different random initialization. As observed from Figure 3, the better performance is achieved by deeper architecture. Our model achieves the best classification result by stacking 3 DLSTM layers that actually leverages 9 different LSTM units to extract the nonlinear feature from n-grams. Figure 4 demonstrates some examples and their sentiments predicted by our model trained on finegrained classification data. In order to see how the nonlinear feature mapping captures the sentiment at each word position in the query, we follow the strategy used in  where the softmax function is directly applied on the concatenated feature mapping without passing through the average pooling layer. So the sentiment distribution p t at tth word is computed as p t = W T [z 0 t , z 1 t , ..., z d t ]. The expected value over the probability distribution ∑ 2 s=−2 s.p t is used as the sentiment score that is plotted in Figure 4. In the figure, the sentiment score ranges from −2 to 2, where −2 means very negative, 2 mean very positive and 0 means neutral.

Examples
Five examples are illustrated in the figure where the first row gives the synthetic examples to show that our model is able to model the nonconsecutive interaction within n-grams. For example, in query "hardly to be bad", even though word "hardly" is not directly modifying word "bad", our model still be able to capture such sentiment changes.
The second row of the figure shows the examples from fine-grained classification testing data. Both the example show that our model to some degree can capture sentiment of the satire. Especially the last example, our model actually gives negative prediction, even no word in the query really means negative.

Conclusions
We have proposed a deep long-short-term-memory (DLSTM) nonlinear nonconsecutive feature mapping architecture to replace traditional linear mapping in the convolutional neural network based query classification. Each LSTM unit in the DLSTM is responsible for capturing different order feature representation from word segments. The bottom LSTM unit equipped with input gate and output gate, extracts the nonlinear feature from unigram. The higher LSTM unit in the DLSTM takes the outputs from lower LSTM units as input. In such way, the higher LSTM unit is able to capture nonlinear feature representation from higher order n-grams. The sum of different LSTM units is used as the output of the DL-STM layer. The DLSTM output rather than being directly used as input to convolutional neural network for query classification, is passed through a stacked DLSTM layers. The query is finally represented by the concatenation of the outputs from the stacked DLSTM layers.
We evaluated the proposed models on three benchmark datasets-Stanford Sentiment Treebank dataset, TREC dataset and ATIS dataset. On both sentiment classification dataset and ATIS dataset, our model achieved the state-of-the-art performance. On TREC question type classification, SVM based model using extra engineered features still performed better than our model. But we noticed that the proposed method outperformed all the other neural network based approaches.