Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

Methods of deep neural networks (DNNs) have recently demonstrated superior performance on a number of natural language processing tasks. However, in most previous work, the models are learned based on either unsupervised objectives, which does not directly optimize the desired task, or single-task supervised objectives, which often suffer from insufﬁcient training data. We develop a multi-task DNN for learning representations across multiple tasks, not only leveraging large amounts of cross-task data, but also beneﬁting from a regularization effect that leads to more general representations to help tasks in new domains. Our multi-task DNN approach combines tasks of multiple-domain classiﬁcation (for query classiﬁcation) and information retrieval (ranking for web search), and demonstrates signiﬁcant gains over strong baselines in a comprehensive set of domain adaptation.


Introduction
Recent advances in deep neural networks (DNNs) have demonstrated the importance of learning vector-space representations of text, e.g., words and sentences, for a number of natural language processing tasks. For example, the study reported in (Collobert et al., 2011) demonstrated significant accuracy gains in tagging, named entity recognition, and semantic role labeling when using vector space word * This research was conducted during the author's internship at Microsoft Research.
representations learned from large corpora. Further, since these representations are usually in a lowdimensional vector space, they result in more compact models than those built from surface-form features. A recent successful example is the parser by (Chen and Manning, 2014), which is not only accurate but also fast.
However, existing vector-space representation learning methods are far from optimal. Most previous methods are based on unsupervised objectives such as word prediction for training (Mikolov et al., 2013c;Pennington et al., 2014). Other methods use supervised training objectives on a single task, e.g. (Socher et al., 2013), and thus are often constrained by limited amounts of training data. Motivated by the success of multi-task learning (Caruana, 1997), we propose in this paper a multi-task DNN approach for representation learning that leverages supervised data from many tasks. In addition to the benefit of having more data for training, the use of multi-task also profits from a regularization effect, i.e., reducing overfitting to a specific task, thus making the learned representations universal across tasks.
Our contributions are of two-folds: First, we propose a multi-task deep neural network for representation learning, in particular focusing on semantic classification (query classification) and semantic information retrieval (ranking for web search) tasks. Our model learns to map arbitrary text queries and documents into semantic vector representations in a low dimensional latent space. While the general concept of multi-task neural nets is not new, our model is novel in that it successfully combines tasks as disparate as operations necessary for classifica-tion or ranking.
Second, we demonstrate strong results on query classification and web search. Our multi-task representation learning consistently outperforms stateof-the-art baselines. Meanwhile, we show that our model is not only compact but it also enables agile deployment into new domains. This is because the learned representations allow domain adaptation with substantially fewer in-domain labels.
2 Multi-Task Representation Learning

Preliminaries
Our multi-task model combines classification and ranking tasks. For concreteness, throughout this paper we will use query classification as the classification task and web search as the ranking task. These are important tasks in commercial search engines: Query Classification: Given a search query Q, the model classifies in the binary fashion as to whether it belongs to one of the domains of interest. For example, if the query Q is "Denver sushi", the classifier should decide that it belongs to the "Restaurant" domain. Accurate query classification enables a richer personalized user experience, since the search engine can tailor the interface and results. It is however challenging because queries tend to be short (Shen et al., 2006). Surface-form word features that are common in traditional document classification problems tend to be too sparse for query classification, so representation learning is a promising solution. In this study, we classify queries into four domains of interest: ("Restaurant", "Hotel", "Flight", "Nightlife"). Note that one query can belong to multiple domains. Therefore, a set of binary classifiers are built, one for each domain, to perform the classification. We frame the problem as four binary classification tasks. Thus, for domain C t , our goal is binary classification based on P (C t | Q) (C t = {0, 1} ). For each domain t, we assume supervised data (Q, y t = {0, 1} with y t as binary labels. 1 Web Search: Given a search query Q and a document list L, the model ranks documents in the order of relevance. For example, if the query Q is "Denver sushi", model returns a list of documents that satisfies such information need. Formally, we estimate P (D 1 |Q), P (D 2 |Q), . . . for each document D n and rank according to these probabilities. We assume that supervised data exist; I.e., there is at least one relevant document D n for each query Q.

The Proposed Multi-Task DNN Model
Briefly, our proposed model maps any arbitrary queries Q or documents D into fixed lowdimensional vector representations using DNNs. These vectors can then be used to perform query classification or web search. In contrast to existing representation learning methods which employ either unsupervised or single-task supervised objectives, our model learns these representations using multi-task objectives.
The architecture of our multi-task DNN model is shown in Figure 1. The lower layers are shared across different tasks, whereas the top layers represent task-specific outputs. Importantly, the input X (either a query or document), initially represented as a bag of words, is mapped to a vector (l 2 ) of dimension 300. This is the shared semantic representation that is trained by our multi-task objectives. In the following, we elaborate the model in detail: Word Hash Layer (l 1 ): Traditionally, each word is represented by a one-hot word vector, where the dimensionality of the vector is the vocabulary size. However, due to the large size of vocabulary in realworld tasks, it is very expensive to learn such kind of models. To alleviate this problem, we adopt the word hashing method (Huang et al., 2013). We map a one-hot word vector, with an extremely high dimensionality, into a limited letter-trigram space (e.g., with the dimensionality as low as 50k). For example, word cat is hashed as the bag of letter trigram {#-c-a, c-a-t, a-t-#}, where # is a boundary symbol. Word hashing complements the one-hot vector representation in two aspects: 1) out of vocabulary words can be represented by letter-trigram vectors; 2) spelling variations of the same word can be mapped to the points that are close to each other in the letter-trigram space.
Semantic-Representation Layer (l 2 ): This is a shared representation learned across different tasks. this layer maps the letter-trigram inputs into a 300-X: Bag-of-Words Input (500k) l 1 : Letter 3gram (50k)  Figure 1: Architecture of the Multi-task Deep Neural Network (DNN) for Representation Learning: The lower layers are shared across all tasks, while top layers are task-specific. The input X (either a query or document, with vocabulary size 500k) is first represented as a bag of words, then hashed into letter 3-grams l 1 . Non-linear projection W 1 generates the shared semantic representation, a vector l 2 (dimension 300) that is trained to capture the essential characteristics of queries and documents. Finally, for each task, additional non-linear projections W t 2 generate task-specific representations l 3 (dimension 128), followed by operations necessary for classification or ranking. dimensional vector by 1+e −2z . This 50k-by-300 matrix W 1 is responsible for generating the cross-task semantic representation for arbitrary text inputs (e.g., Q or D). Task-Specific Representation (l 3 ): For each task, a nonlinear transformation maps the 300dimension semantic representation l 2 into the 128dimension task-specific representation by where, t denotes different tasks (query classification or web search). Query Classification Output: is the 128-dimension taskspecific representation for a query Q. The probability that Q belongs to class C 1 is predicted by a logistic regression, with sigmoid g(z) = 1 1+e −z : Web Search Output: For the web search task, both the query Q and the document D are mapped into 128-dimension task-specific representations Q Sq and D S d . Then, the relevance score is Pick a task t randomly 2. Pick sample(s) from task t (Q, y t = {0, 1}) for query classification (Q, L) for web search 3. Compute loss: L(Θ) L(Θ)=Eq. 5 for query classification L(Θ)=Eq. 6 for web search 4. Compute gradient: ∇(Θ) 5. Update model: The task t is one of the query classification tasks or web search task, as shown in Figure 1. For query classification, each training sample includes one query and its category label. For web search, each training sample includes query and document list. computed by cosine similarity as:

The Training Procedure
In order to learn the parameters of our model, we use mini-batch-based stochastic gradient descent (SGD) as shown in Algorithm 1. In each iteration, a task t is selected randomly, and the model is updated ac-cording to the task-specific objective. This approximately optimizes the sum of all multi-task objectives. For query classification of class C t , we use the cross-entropy loss as the objective: where y t = {0, 1} is the label and the loss is summed over all samples in the mini-batch (1024 samples in experiments). The objective for web search used in this paper follows the pair-wise learning-to-rank paradigm outlined in (Burges et al., 2005). Given a query Q, we obtain a list of documents L that includes a clicked document D + (positive sample), and J randomlysampled non-clicked documents {D − j } j=1,.,J . We then minimize the negative log likelihood of the clicked document (defined in Eq. 7) given queries across the training data − log where the probability of a given document D + is computed here, γ is a tuning factor determined on held-out data.

An Alternative View of the Multi-Task Model
Our proposed multi-task DNN (Figure 1) can be viewed as a combination of a standard DNN for classification and a Deep Structured Semantic Model (DSSM) for ranking, shown in Figure 2. Other ways to merge the models are possible. Figure 3 shows an alternative multi-task architecture, where only the query part is shared among all tasks and the DSSM retains independent parameters for computing the document representations. This is more similar to the original DSSM. We have attempted training this model using Algorithm 1, but it achieves good results on query classification at the expense of web search. This is likely due to unbalanced updates (i.e. parameters for queries are updated more often than that of documents), and implying that the amount of sharing is an important design choice in multi-task models. 3 Experimental Evaluation

Data Sets and Evaluation Metrics
We employ large-scale, real data sets in our evaluation. See Table 1 (Järvelin and Kekäläinen, 2000).

Results on Accuracy
First, we evaluate whether our model can robustly improve performance, measured as accuracy across multiple tasks. Table 2 summarizes the AUC scores for query classification, comparing the following classifiers: • SVM-Word: a SVM model 2 with unigram, bigram and trigram surface-form word features.
• SVM-Letter: a SVM model with letter trigram features (i.e. l 1 in Figure 1 as input to SVM).
• MT-DNN: our multi-task proposal (Figure 1). The results show that the proposed MT-DNN performs best in all four domains. Further, we observe: 1. MT-DNN outperforms DNN, indicating the usefulness of the multi-task objective (that includes web search) over the single-task objective of query classification.

Both DNN and MT-DNN outperform SVM-
Letter, which initially uses the same input features (l 1 ). This indicates the importance of learning a semantic representation l 2 on top of these letter trigrams.
3. Both DNN and MT-DNN outperform a strong SVM-Word baseline, which has a large feature set that consists of 3 billion features.  • DSSM: single-task ranking model ( Figure 2) • MT-DNN: our multi-task proposal (Figure 1) Again, we observe that MT-DNN performs best. For example, MT-DNN achieves NDCG@1=0.334, outperforming the current state-of-the-art single-task DSSM (0.327) and the classic methods like PLSA (0.308) and BM25 (0.305). This is a statistically significant improvement (p < 0.05) over DSSM and other baselines.
To recap, our MT-DNN robustly outperforms strong baselines across all web search and query classification tasks. Further, due to the use of larger training data (from different domains) and the regularization effort as we discussed in Section 1, we confirm the advantage of multi-task models over than single-task ones. 3
Compactness is particularly important for query classification, since one may desire to add new domains after discovering new needs from the query logs of an operational system. On the other hand, it is prohibitively expensive to collect labeled training data for new domains. Very often, we only have very small training data or even no training data.
To evaluate the models using the above criteria, we perform domain adaptation experiments on query classification using the following procedure: (1) Select one query classification task t * . Train MT-DNN on the remaining tasks (including Web Search task) to obtain a semantic representation (l 2 ); (2) Given a fixed l 2 , train an SVM on the training data t * , using varying amounts of labels; (3) Evaluate the AUC on the test data of t * We compare three SVM classifiers trained using different feature representations: (1) Semanti-cRepresentation uses the l 2 features generated according to the above procedure. (2) Word3gram uses unigram, bigram and trigram word features.
(3) Letter3gram uses letter-trigrams. Note that Word3gram and Letter3gram correspond to SVM-Word and SVM-Letter respectively in Table 2. The AUC results for different amounts of t * training data are shown in Figure 4. In the Hotel, Flight and Restaurant domains, we observe that our semantic representation dominated the other two feature representations (Word3gram and Letter3gram) in all cases except the extremely large-data regime (more than 1 million training samples in domain t * ). Given sufficient labels, SVM is able to train well on Word3gram sparse features, but for most cases Se-manticRepresentation is recommended. 4 In a further experiment, we compare the following two DNNs using the same domain adaptation procedure: (1) DNN1: DNN where W 1 is randomly initialized and parameters W 1 , W 2 , W t * 3 are trained on varying amounts of data in t * ; (2) DNN2: DNN where W 1 is obtained from other tasks (i.e. SemanticRepresentation) and fixed, while parameters W 2 , W t * 3 are trained on varying amounts of data in t * . The purpose is to see whether shared semantic representation is useful even under a DNN architecture. Figure 5 show the AUC results of DNN1 vs. DNN2 (the results SVM denotes the same system as SemanticRepresentation in Figure  4, plotted here for reference). We observe that when the training data is extremely large (millions of samples), one does best by training all parameters from scratch (DNN1). Otherwise, one is better off using a shared semantic representation trained by multitask objectives. Comparing DNN2 and SVM with SemanticRepresentation, we note that SVM works best for training data of several thousand samples; DNN2 works best in the medium data range.

Related Work
There is a large body of work on representation learning for natural language processing, sometimes using different terminologies for similar concepts; e.g., feature generation, dimensionality reduction, and vector space models. The main motivation is similar: to abstract away from surface forms in words, sentences, or documents, in order to alleviate sparsity and approximate semantics. Traditional techniques include LSA (Deerwester et al., 1990), ESA (Gabrilovich and Markovitch, 2007), PCA (Karhunen, 1998), and non-linear kernel variants (Schölkopf et al., 1998). Recently, learningbased approaches inspired by neural networks, especially DNNs, have gained in prominence, due to their favorable performance (Huang et al., 2013;Baroni et al., 2014;Milajevs et al., 2014).
Popular methods for learning word representations include (Collobert et al., 2011;Mikolov et al., 2013c;Mnih and Kavukcuoglu, 2013;Pennington et al., 2014): all are based on unsupervised objec-tives of predicting words or word frequencies from raw text. End-to-end neural network models for specific tasks (e.g. parsing) often use these word representations as initialization, which are then iteratively improved by optimizing a supervised objective (e.g. parsing accuracy). A selection of successful applications of this approach include sequence labeling (Turian et al., 2010), parsing (Chen and Manning, 2014), sentiment (Socher et al., 2013), question answering (Iyyer et al., 2014) and translation modeling (Gao et al., 2014a).
Our model takes queries and documents as input, so it learns sentence/document representations. This is currently an open research question, the challenge being how to properly model semantic compositionality of words in vector space (Huang et al., 2013;M. Baroni and Zamparelli, 2013;Socher et al., 2013). While we adopt a bag-of-words approach for practical reasons (memory and run-time), our multi-task framework is extensible to other methods for sentence/document representations, such as those based on convolutional networks (Kalchbrenner et al., 2014;Shen et al., 2014;Gao et al., 2014b), parse tree structure (Irsoy and Cardie, 2014), and run-time inference (Le and Mikolov, 2014).
The synergy between multi-task learning and neural nets is quite natural; the general idea dates back to (Caruana, 1997). The main challenge is in designing the tasks and the network structure. For example, (Collobert et al., 2011) defined part-of-speech tagging, chunking, and named entity recognition as multiple tasks in a single sequence labeler; (Bordes et al., 2012) defined multiple data sources as tasks in their relation extraction system. While conceptually similar, our model is novel in that it combines tasks as disparate as classification and ranking. Further, considering that multi-task models often exhibit mixed results (i.e. gains in some tasks but degradation in others), our accuracy improvements across all tasks is a very satisfactory result.

Conclusion
In this work, we propose a robust and practical representation learning algorithm based on multi-task objectives. Our multi-task DNN model successfully combines tasks as disparate as classification and ranking, and the experimental results demon-strate that the model consistently outperforms strong baselines in various query classification and web search tasks. Meanwhile, we demonstrated compactness of the model and the utility of the learned query/document representation for domain adaptation.
Our model can be viewed as a general method for learning semantic representations beyond the word level. Beyond query classification and web search, we believe there are many other knowledge sources (e.g. sentiment, paraphrase) that can be incorporated either as classification or ranking tasks. A comprehensive exploration will be pursued as future work.