Multi-Task Label Embedding for Text Classification

Multi-task learning in text classification leverages implicit correlations among related tasks to extract common features and yield performance gains. However, most previous works treat labels of each task as independent and meaningless one-hot vectors, which cause a loss of potential information and makes it difficult for these models to jointly learn three or more tasks. In this paper, we propose Multi-Task Label Embedding to convert labels in text classification into semantic vectors, thereby turning the original tasks into vector matching tasks. We implement unsupervised, supervised and semi-supervised models of Multi-Task Label Embedding, all utilizing semantic correlations among tasks and making it particularly convenient to scale and transfer as more tasks are involved. Extensive experiments on five benchmark datasets for text classification show that our models can effectively improve performances of related tasks with semantic representations of labels and additional information from each other.


Introduction
Text classification is a common Natural Language Processing task that tries to infer the most appropriate label for a given sentence or document, for example, sentiment analysis, topic classification and so on. With the developments and prosperities of Deep Learning (Bengio, Courville, and Vincent 2013), many neural network based models have been exploited by a large body of literature and achieved inspiring performance gains on various text classification tasks. These models are robust at feature engineering and can represent word sequences as fix-length vectors with rich semantic information, which are notably ideal for subsequent NLP tasks.
Due to numerous parameters to train, neural network based models rely heavily on adequate amounts of annotated corpora, which can not always be met as constructions of large-scale high-quality labeled datasets are extremely timeconsuming and labor-intensive. Multi-Task Learning solves this problem by jointly training multiple related tasks and leveraging potential correlations among them to increase corpora size implicitly, extract common features and yield classification improvements. Inspired by (Caruana 1997), there are lots of works dedicated for multi-task learning with neural network based models (Collobert and Weston 2008;Liu et al. 2015b;Liu, Qiu, and Huang 2016a;Liu, Qiu, and Huang 2016b;Zhang et al. 2017). These models usually contain a pre-trained lookup layer that map words into dense, low-dimension and real-value vectors with semantic implications, which is known as Word Embedding (Mikolov et al. 2013b), and utilize some lower layers to capture common features that are further fed to follow-up task-specific layers. However, most existing models have the following three disadvantages: • Lack of Label Information. Labels of each task are represented by independent and meaningless one-hot vectors, for example, positive and negative in sentiment analysis encoded as [1, 0] and [0, 1], which may cause a loss of potential label information.
• Incapable of Scaling. Network structures are elaborately designed to model various correlations for multi-task learning, but most of them are structurally fixed and can only deal with interactions between two tasks, namely pair-wise interactions. When new tasks are introduced, the network structures have to be modified and the whole networks have to be trained again.
• Incapable of Transferring. For human beings, we can handle a completely new task without any more efforts after learning with several related tasks, which can be concluded as the capability of Transfer Learning (Ling et al. 2008). As discussed above, the network structures of most previous models are fixed, thus not compatible with and failing to tackle new tasks.
In this paper, we proposed Multi-Task Label Embedding (MTLE) to map labels of each task into semantic vectors as well, similar to how Word Embedding represents the word sequences, thereby converting the original text classification tasks into vector matching tasks. Based on MTLE, we implement unsupervised, supervised and semi-supervised multi-task learning models for text classification, all utilizing semantic correlations among tasks and effectively solving the problems of scaling and transferring when new tasks are involved.
We conduct extensive experiments on five benchmark datasets for text classification. Compared to learning separately, jointly learning multiple related tasks based on MTLE demonstrates significant performance gains for each task.
Our contributions are four-folds: • Our models efficiently leverage potential label information of each task by mapping labels into dense, lowdimension and real-value vectors with semantic implications.
• It is particularly convenient for our models to scale when new tasks are involved. The network structures need no modifications and only data from the new tasks require training.
• After training on several related tasks, our models can also naturally transfer to deal with completely new tasks without any additional training, while still achieving appreciable performances.
• We consider different scenarios of multi-task learning and demonstrate strong results on several benchmark datasets for text classification. Our models outperform most stateof-the-art baselines.

Single-Task Learning
In a supervised text classification task, the input is a word sequence denoted by x = {x 1 , x 2 , ..., x T } and the output is the class label y or the one-hot representation y. A pretrained lookup layer is used to get the embedding vector x t ∈ R d for each word x t . A text classification model f is trained to produce the predicted distributionŷ for each and the training objective is to minimize the total crossentropy over all samples.
where N denotes the number of training samples and C is the class number.

Multi-Task Learning
Given K supervised text classification tasks, T 1 , T 2 , ..., T K , a multi-task learning model F is trained to transform each where onlyŷ (k) is used for loss computation. The overall training loss is a weighted linear combination of costs for each task.
where λ k , N k and C k denote the linear weight, the number of samples and the class number for each task T k respectively.

Three Perspectives of Multi-Task Learning
Text classification tasks can differ in characteristics of the input word sequence x or the output label y. There are lots of benchmark datasets for text classification and three different perspectives of multi-task learning can be concluded.
• Multi-Cardinality Tasks are similar apart from cardinalities, for example, movie review datasets with different average sequence lengths and class numbers. • Multi-Domain Tasks are different in domains of corpora, for example, product review datasets on books, DVDs, electronics and kitchen appliances. • Multi-Objective Tasks are targeted for different objectives, for example, sentiment analysis, topic classification and question type judgment.
The simplest multi-task learning scenario is that all tasks share the same cardinality, domain and objective, while just come from different sources. On the contrary, when tasks vary in cardinality, domain and even objective, the correlations and interactions among them can be quite complicated and implicit. When implementing multi-task learning, both the model used and the tasks involved have significant influences on the ideal performance gains for each task. We will further investigate the scaling and transferring capabilities of MTLE on different scenarios in the Experiment section.

Methodology
Neural network based models have obtained substantial interests in many NLP tasks for their capabilities to represent variable-length words sequences as fix-length vectors, for example, Neural Bag-of-Words (NBOW), Recurrent Neural Networks (RNN), Recursive Neural Networks (RecNN) and Convolutional Neural Network (CNN). These models mostly first map sequences of words, n-grams or other semantic units into embedding representations with a pretrained lookup layer, then comprehend the vector sequences with neural networks of different structures and mechanisms, finally utilize a softmax layer to predict categorical distribution for specific text classification tasks. For RNN, input vectors are absorbed one by one in a recurrent manner, which resembles the way human beings understand texts and makes RNN notably suitable for NLP tasks.

Recurrent Neural Network
RNN maintains a internal hidden state vector h t that is recurrently updated by a transition function f . At each time step t, the hidden state h t is updated according to the current input vector x t and the previous hidden state h t−1 .
where f is usually a composition of an element-wise nonlinearity with an affine transformation of both x t and h t−1 .
In this way, RNN can accept a word sequence of arbitrary length and produce a fix-length vector, which is fed to a softmax layer for text classification or other NLP tasks. However, gradient of f may grow or decay exponentially over long sequences during training, namely the gradient exploding or vanishing problems, which hinder RNN from effectively learning long-term dependencies and correlations. (Hochreiter and Schmidhuber 1997) proposed Long Short-Term Memory Network (LSTM) to solve the above problems. Besides the internal hidden state h t , LSTM also maintains an internal memory cell and three gating mechanisms. While there are numerous variants of the standard LSTM, in this paper we follow the implementation of (Graves 2013). At each time step t, states of the LSTM can be fully described by five vectors in R m , an input gate i t , a forget gate f t , an output gate o t , the hidden state h t and the memory cell c t , which adhere to the following transition equations.
where x t is the current input, σ denotes logistic sigmoid function and denotes element-wise multiplication. By strictly controlling how to accept x t and the portions of c t to update, forget and expose at each time step, LSTM can better understand long-term dependencies according to the labels of the whole sequences.

Multi-Task Label Embedding
Labels of text classification tasks are made up of word sequences as well, for example, positive and negative in binary sentiment classification, very positive, positive, neutral, negative and very negative in 5-categorical sentiment classification. Inspired by Word Embedding, we propose Multi-Task Label Embedding (MTLE) to convert labels of each task into dense, low-dimension and real-value vectors with semantic implications, thereby disclosing potential intra-task and inter-task label correlations. Figure 1 illustrates the general idea of MTLE for text classification, which mainly consists of three parts, the Input Encoder, the Label Encoder and the Matcher.
In the Input Encoder, each input sequence T } by the Lookup Layer (Lu I ). The Learning Layer (Le I ) is applied to recurrently comprehend x (k) and generate a fix-length vector X (k) , which can be regarded as an overall representation of the original input sequence x (k) .
In the Label Encoder, labels of each task are mapped and learned to produce fix-length representations as well. There are C k labels in T k , namely y is also a word sequence, for example, very positive, and is mapped into the vector sequence y (1) j . In order to achieve the classification task for a sample x (k) from T k , the Matcher obtains the corresponding X (k) from the Input Encoder, all Y (k) j (1 ≤ j ≤ C k ) from the Label Encoder, and then conducts vector matching to select the most appropriate class label.
Based on the idea of MTLE, we implement unsupervised, supervised and semi-supervised models to investigate and explore different possibilities of multi-task learning in text classification.

Model-I: Unsupervised
Suppose that for each task T k , we only have N k input sequences and C k classification labels, but lack the specific annotations for each input sequence and its corresponding label. In this case, we can only implement MTLE in an unsupervised manner.
Word Embedding (Mikolov et al. 2013b) leverages contextual features of words and trains them into semantic vectors so that words sharing synonymous meanings result in vectors of similar values. In the unsupervised model, we utilize all available input sequences and classification labels as the whole corpora and train a embedding model E unsup (Mikolov et al. 2013a) that covers contextual features of different tasks. The embedding model will be employed as both Lu I and Lu L .
We achieve Le I and Le L simply by summing up vectors in a sequence and calculating the average, since we don't have any supervised annotations. After obtaining X (k) for each input sample and all Y (k) j for a certain task T k , we apply unsupervised vector matching methods D(X (k) , Y (k) j ), for example, Cosine Similarity or L 2 Distance, to select the most appropriate Y (k) j for each X (k) . In conclusion, the unsupervised model of MTLE exploits contextual and semantic information of both the input sequences and the classification labels. Model-I may fail to achieve adequately satisfactory performances due to employments of so many unsupervised methods, but can still provide some useful insights when no annotations are available at all.

Model-II: Supervised
Given the specific annotations for each input sequence and its corresponding label, we can better train the Input Encoder and the Label Encoder in a supervised manner.
The Lu I and the Lu L are both fully-connection layers with the weights W I and W L of |V | × d matrixes, where |V | denotes the vocabulary size and d is the embedding size. We can utilize the E unsup obtained in Model-I or other pretrained lookup tables to initialize W I , W L and further tune their weights during training.
The Le I and the Le L should be trainable models that can transform a vector sequence of arbitrary lengths into a fix-length vector. We apply the implementation of (Graves 2013) and denote them by LST M I and LST M L with hidden size m. We can also try some more complicated but effective sequence learning models, but in this paper we mainly focus on the idea and effects of MTLE, so we just choose a common one for implementation and spend more efforts on explorations of MTLE.
We utilize another fully-connection layer of size 2m × 1, denoted by M 2m×1 , to achieve the Matcher, which accepts outputs from the Le I and the Le L to produce a score of matching. Given the matching scores of each label, we implement the idea of cross-entropy and calculate the loss function for a sample x (k) from T k as follows.
where ⊕ denotes vector concatenation andỹ (k) is the true label in one-hot representation for x (k) . The overall training objective is to minimize the weighted linear combination of costs for samples from all tasks.
where λ k and N k denote the linear weight and the number of samples for each task T k as explained in Eq.(4). The network structure of the supervised model for MTLE is illustrated in Figure 2.
Model-II provides a simple and intuitive way to realize multi-task learning, where input sequences and classification labels from different tasks are jointly learned and compactly fused. During the process of training, Lu I and Lu L learn better understanding of word semantics for different tasks, while Le I and Le L obtain stronger capabilities of sequence representation. When new tasks are involved, it is extremely convenient for Model-II to scale as the whole network structure needs no modifications. We can continue training Model-II and further tune the parameters based on samples from the new tasks, which we define as Hot Update, or re-train Model-II again based on samples from all tasks, which is defined as Cold Update. We will detailedly investigate the performances of these two scaling methods in the Experiment Section.

Model-III: Semi-Supervised
For human beings, we can handle a completely new task without any more efforts and achieve appreciable performances after learning with several related tasks, which we conclude as the capability to transfer.
We propose Model-III for semi-supervised learning based on MTLE. The only different between Model-II and Model-III is the way how they deal with new tasks, annotated or not. If the new tasks are provided with annotations, we can choose to apply Hot Update or Cold Update of Model-II. If the new tasks are completely unlabeled, we can still employ Model-II for vector mapping and find the best label for each input sequence without any further training, which we define as Zero Update. To avoid confusion, we specially use Model-III to denote the cases where annotations of new tasks are unavailable and only Zero Update is applicable, which corresponds to the transferring and semi-supervised learning capability of human beings. The differences among Hot Update, Cold Update and Zero Update are illustrated in Figure 3, where Before Update denotes the model trained on the old tasks before the new tasks are introduced. We will further investigate these three updating methods in the Experiment Section.

Experiment
In this section, we design extensive experiments with multitask learning based on five benchmark datasets for text classification. We investigate the empirical performances of our models and compare them to existing state-of-the-art baselines.

Datasets
As Table 1 shows, we select five benchmark datasets for text classification and design three experiment scenarios to evaluate the performances of Model-I and Model-II.  • Multi-Objective Text classification datasets with different objectives, including IMDB, RN (Apté, Damerau, and Weiss 1994) and QC (Li and Roth 2002).

Hyperparameters and Training
Training of Model-II is conducted through back propagation with stochastic gradient descent (Amari 1993 For each iteration, we randomly select one task and choose an untrained batch from the task, calculate the gradient and update the parameters accordingly. All involved parameters of neural layers are randomly initialized from a truncated normal distribution with zero mean and standard deviation. We apply 10-fold cross-validation and different combinations of hyperparameters are investigated, of which the best one is described in Table 2.

Results of Model-I and Model-II
We compare the performances of Model-I and Model-II with the implementation of (Graves 2013) as shown in Table 3. It is expected that Model-I falls behind (Graves 2013) as no annotations are available at all. However, with contextual information of both sequences and labels, Model-I still achieves considerable margins against random choices. Model-I performs better on tasks of shorter lengths, for example, SST-1 and SST-2, as it is difficult for unsupervised methods to learn long-term dependencies.
Model-II obtains significant performance gains with label information and additional correlations from related tasks. Multi-Domain, Multi-Cardinality and Multi-Objective benefit from MTLE with average improvements of 5.8%, 3.1% and 1.7%, as they contain increasingly weaker relevance among tasks. The result of Model-II for IMDB in Multi-Cardinality is slightly better than that in Multi-Objective (91.3 against 90.9), as SST-1 and SST-2 share more semantically useful information with IMDB than RN and QC.

Scaling and Transferring Capability of MTLE
In order to investigate the scaling and transferring capability of MTLE, we use A + B → C to denote the case where Model-II is trained on task A and B, while C is the newly involved one. We design three cases based on different scenarios and compare the influences of Hot Update, Cold Update, Zero Update on each task, • Case 1 SST-1 + SST-2 → IMDB.
where in Zero Update, we ignore the training set of C and directly utilize the test set for evaluations.
As Table 4 shows, Before Update denotes the model trained on the old tasks before the new tasks are involved, so only evaluations on the old tasks are conducted, which outperform the Single Task in Table 3 by 3.1% on average. Zero Update provides inspiring possibilities for completely unlabeled tasks. There are no more annotations available for additional training from the new tasks, so we can only employ the models of Before Update for evaluations on the new tasks. Zero Update achieves competitive performances in Case 1 (89.9 for IMDB) and Case 2 (86.3 for Kitchen), as tasks from these two cases all belong to sentiment datasets of different cardinalities or domains that contain rich semantic correlations with each other. However, the result for IMDB in Case 3 is only 74.2, as sentiment shares less relevance with topic classification and question type judgment, thus resulting in poor transferring performances.

Multi-Task or Label Embedding
MTLE mainly consists of two parts, label embedding and multi-task learning, so both implicit information from labels and potential correlations from other tasks make differences. In this section, we conduct experiments to explore the respective contributions of label embedding and multi-task learning.
We choose the four tasks from Multi-Domain scenario and train Model-II on each task respectively. Given that each task is trained separately, in this case their performances are only influenced by label embedding. Then we re-train Model-II from scratch for every two tasks, every three tasks from them and record the performances of each task in different cases, where both label embedding and multi-task learning matter.
The results are illustrated in Figure 4, where B, D, E, K are short for Books, DVDs, Electronics and Kitchen. The first three graphs denote the results of Model-II trained on every one task, every two tasks and every three tasks. In the first graph, the four tasks are trained separately and achieve improvements of 3.2%, 3.3%, 3.5%, 2.5% respectively compared to the baseline (Graves 2013). As more tasks are involved step by step, Model-II produces increasing performance gains for each task and achieves an average improvement of 5.9% when all the four tasks are trained together. So it can be concluded that information from labels as well as correlations from other tasks account for considerable parts of contributions, and we integrate both of them into MTLE with the capabilities of scaling and transferring.
In the last graph, diagonal cells denote improvements of every one task, while off-diagonal cells denote average improvements of every two tasks, so an off-diagonal cell of darker color indicates stronger correlations between the corresponding two tasks. An interesting finding is that Books is more related with DVDs and Electronics is more relevant to Kitchen. A possible reason may be that Books and DVDs are products targeted for reading or watching, while customers care more about appearances and functionalities when talking about Electronics and Kitchen.

Comparisons with State-of-the-art Models
We compare Model-II against the following state-of-the-art models: • NBOW Neural Bag-of-Words that sums up embedding vectors of all words and applies a non-linearity followed by a softmax layer. • PV Paragraph Vectors followed by logistic regression (Le and Mikolov 2014). • MT-CNN Multi-Task learning with Convolutional Neural Networks (Collobert and Weston 2008) where lookup tables are partially shared. • MT-DNN Multi-Task learning with Deep Neural Networks (Liu et al. 2015b) that utilizes bag-of-word representations and a hidden shared layer. • DSM Deep multi-task learning with Shared Memory (Liu, Qiu, and Huang 2016a) where a external memory and a reading/writing mechanism are introduced.
• GRNN Gated Recursive Neural Network for sentence modeling and text classification (Chen et al. 2015).
As Table 5 shows, MTLE achieves competitive or better performances on all tasks except for the task QC, as it contains less correlations with other tasks. PV slightly surpasses MTLE on IMDB (91.7 against 91.3), as sentences from IMDB are much longer than SST and MDSD, which require stronger capabilities of long-term dependency learning. In this paper, we mainly focus the idea and effects of integrating label embedding with multi-task learning, so we just apply (Graves 2013) to realize Le I and Le L , which can be further implemented by other more effective sentence learning models (Liu et al. 2015a;Chen et al. 2015) and produce better performances.

Related Work
There are a large body of literatures related to multi-task learning with neural networks in NLP (Collobert and Weston 2008;Liu et al. 2015b;Liu, Qiu, and Huang 2016a;Liu, Qiu, and Huang 2016b;Zhang et al. 2017). (Collobert and Weston 2008) utilizes a shared lookup layer for common features, followed by task-specific layers for several traditional NLP tasks including part-of-speech tagging and semantic parsing. They use a fix-size window to solve the problem of variable-length input sequences, which can be better addressed by RNN. (Liu et al. 2015b;Liu, Qiu, and Huang 2016a;Liu, Qiu, and Huang 2016b;Zhang et al. 2017) all investigate multitask learning for text classification. (Liu et al. 2015b) applies bag-of-word representation and information of word orders are lost. (Liu, Qiu, and Huang 2016a) introduces an external memory for information sharing with a reading/writing mechanism for communications. (Liu, Qiu, and Huang 2016b) proposes three different models for multitask learning with RNN and (Zhang et al. 2017) constructs a generalized architecture for RNN based multi-task learning. However, models of these papers ignore essential information of labels and mostly can only address pair-wise interactions between two tasks. Their network structures are also fixed, thereby failing to scale or transfer when new tasks are involved.
Different from the above works, our models map labels of text classification tasks into semantic vectors and provide a more intuitive way to realize multi-task learning with the capabilities of scaling and transferring. Input sequences from three or more tasks are jointly learned together with their labels, benefitting from each other and obtaining better sequence representations.

Conclusion
In this paper, we propose Multi-Task Label Embedding to map labels of text classification tasks into semantic vectors. Based on MTLE, we implement unsupervised, supervised and semi-supervised models to facilitate multi-task learning, all utilizing semantic correlations among tasks and effectively solving the problems of scaling and transferring when new tasks are involved. We explore three different scenarios of multi-task learning and our models can improve performances of most tasks with additional related information from others in all scenarios.
In future work, we would like to explore quantifications of task correlations and generalize MTLE to address other NLP tasks, for example, sequence labeling and sequence-to-sequence learning.