Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces

We combine multi-task learning and semi-supervised learning by inducing a joint embedding space between disparate label spaces and learning transfer functions between label embeddings, enabling us to jointly leverage unlabelled data and auxiliary, annotated datasets. We evaluate our approach on a variety of sequence classification tasks with disparate label spaces. We outperform strong single and multi-task baselines and achieve a new state-of-the-art for topic-based sentiment analysis.


Introduction
Multi-task learning (MTL) and semi-supervised learning are both successful paradigms for learning in scenarios with limited labelled data and have in recent years been applied to almost all areas of NLP. Applications of MTL in NLP, for example, include partial parsing ), text normalisation (Bollman et al., 2017), neural machine translation (Luong et al., 2016), and keyphrase boundary classification (Augenstein and . Contemporary work in MTL for NLP typically focuses on learning representations that are useful across tasks, often through hard parameter sharing of hidden layers of neural networks (Collobert et al., 2011;. If tasks share optimal hypothesis classes at the level of these representations, MTL leads to improvements (Baxter, 2000). However, while sharing hidden layers of neural networks is an effective regulariser , we potentially loose synergies between the classification functions trained to associate these representations with class labels. This paper sets out to build an architecture in which such synergies are exploited, The first two authors contributed equally.
with an application to pairwise sequence classification tasks. Doing so, we achieve a new state of the art on topic-based sentiment analysis.
For many NLP tasks, disparate label sets are weakly correlated, e.g. part-of-speech tags correlate with dependencies (Hashimoto et al., 2017), sentiment correlates with emotion (Felbo et al., 2017;Eisner et al., 2016), etc. We thus propose to induce a joint label embedding space (visualised in Figure 2) using a Label Embedding Layer that allows us to model these relationships, which we show helps with learning.
In addition, for tasks where labels are closely related, we should be able to not only model their relationship, but also to directly estimate the corresponding label of the target task based on auxiliary predictions. To this end, we propose to train a Label Transfer Network (LTN) jointly with the model to produce pseudo-labels across tasks.
The LTN can be used to label unlabelled and auxiliary task data by utilising the 'dark knowledge' (Hinton et al., 2015) contained in auxiliary model predictions. This pseudo-labelled data is then incorporated into the model via semisupervised learning, leading to a natural combination of multi-task learning and semi-supervised learning. We additionally augment the LTN with data-specific diversity features (Ruder and Plank, 2017) that aid in learning.
Contributions Our contributions are: a) We model the relationships between labels by inducing a joint label space for multi-task learning. b) We propose a Label Transfer Network that learns to transfer labels between tasks and propose to use semi-supervised learning to leverage them for training. c) We evaluate MTL approaches on a variety of classification tasks and shed new light on settings where multi-task learning works. d) We perform an extensive ablation study of our model. e) We report state-of-the-art performance on topicbased sentiment analysis.

Related work
Learning task similarities Existing approaches for learning similarities between tasks enforce a clustering of tasks (Evgeniou et al., 2005;Jacob et al., 2009), induce a shared prior (Yu et al., 2005;Xue et al., 2007;Daumé III, 2009), or learn a grouping (Kang et al., 2011;Kumar and Daumé III, 2012). These approaches focus on homogeneous tasks and employ linear or Bayesian models. They can thus not be directly applied to our setting with tasks using disparate label sets.
Multi-task learning with neural networks Recent work in multi-task learning goes beyond hard parameter sharing (Caruana, 1993) and considers different sharing structures, e.g. only sharing at lower layers  and induces private and shared subspaces (Liu et al., 2017;. These approaches, however, are not able to take into account relationships between labels that may aid in learning. Another related direction is to train on disparate annotations of the same task Peng et al., 2017). In contrast, the different nature of our tasks requires a modelling of their label spaces.
Semi-supervised learning There exists a wide range of semi-supervised learning algorithms, e.g., self-training, co-training, tri-training, EM, and combinations thereof, several of which have also been used in NLP. Our approach is probably most closely related to an algorithm called coforest (Li and Zhou, 2007). In co-forest, like here, each learner is improved with unlabeled instances labeled by the ensemble consisting of all the other learners. Note also that several researchers have proposed using auxiliary tasks that are unsupervised (Plank et al., 2016;Rei, 2017), which also leads to a form of semi-supervised models.
Label transformations The idea of manually mapping between label sets or learning such a mapping to facilitate transfer is not new. Zhang et al. (2012) use distributional information to map from a language-specific tagset to a tagset used for other languages, in order to facilitate crosslingual transfer. More related to this work, Kim et al. (2015) use canonical correlation analysis to transfer between tasks with disparate label spaces. There has also been work on label transformations in the context of multi-label classification problems (Yeh et al., 2017).
3 Multi-task learning with disparate label spaces

Problem definition
In our multi-task learning scenario, we have access to labelled datasets for T tasks T 1 , . . . , T T at training time with a target task T T that we particularly care about. The training dataset for task Our base model is a deep neural network that performs classic hard parameter sharing (Caruana, 1993): It shares its parameters across tasks and has task-specific softmax output layers, which output a probability distribution p T i for task T i according to the following equation: is the weight matrix and bias term of the output layer of task T i respectively, h ∈ R h is the jointly learned hidden representation, L i is the number of labels for task T i , and h is the dimensionality of h.
The MTL model is then trained to minimise the sum of the individual task losses: where L i is the negative log-likelihood objec- and λ i is a parameter that determines the weight of task T i . In practice, we apply the same weight to all tasks. We show the full set-up in Figure 1a.

Label Embedding Layer
In order to learn the relationships between labels, we propose a Label Embedding Layer (LEL) that embeds the labels of all tasks in a joint space. Instead of training separate softmax output layers as above, we introduce a label compatibility function c(·, ·) that measures how similar a label with embedding l is to the hidden representation h: where · is the dot product. This is similar to the Universal Schema Latent Feature Model introduced by Riedel et al. (2013). In contrast to (a) Multi-task learning 12/6/2017 label_embedding_layer.html Figure 1: a) Multi-task learning (MTL) with hard parameter sharing and 3 tasks T 1−3 and L 1−3 labels per task. A shared representation h is used as input to task-specific softmax layers, which optimise crossentropy losses L 1−3 . b) MTL with the Label Embedding Layer (LEL) embeds task labels l T 1−3 1−L i in a joint embedding space and uses these for prediction with a label compatibility function. c) Semi-supervised MTL with the Label Transfer Network (LTN) in addition optimises an unsupervised loss L pseudo over pseudo-labels z T T on auxiliary/unlabelled data. The pseudo-labels z T T are produced by the LTN for the main task T T using the concatenation of auxiliary task label output embeddings other models that use the dot product in the objective function, we do not have to rely on negative sampling and a hinge loss (Collobert and Weston, 2008) as negative instances (labels) are known. For efficiency purposes, we use matrix multiplication instead of a single dot product and softmax instead of sigmoid activations: where L ∈ R ( i L i )×l is the label embedding matrix for all tasks and l is the dimensionality of the label embeddings. In practice, we set l to the hidden dimensionality h. We use padding if l < h. We apply a task-specific mask to L in order to obtain a task-specific probability distribution p T i . The LEL is shared across all tasks, which allows us to learn the relationships between the labels in the joint embedding space. We show MTL with the LEL in Figure 1b.

Label Transfer Network
The LEL allows us to learn the relationships between labels. In order to make use of these relationships, we would like to leverage the predictions of our auxiliary tasks to estimate a label for the target task. To this end, we introduce the Label Transfer Network (LTN). This network takes the auxiliary task outputs as input. In particular, we define the output label embedding o i of task T i as the sum of the task's label embeddings l j weighted with their probability p T i j : The label embeddings l encode general relationship between labels, while the model's probability distribution p T i over its predictions encodes finegrained information useful for learning (Hinton et al., 2015). The LTN is trained on labelled target task data. For each example, the corresponding label output embeddings of the auxiliary tasks are fed into a multi-layer perceptron (MLP), which is trained with a negative log-likelihood objective L LTN to produce a pseudo-label z T T for the target task T T : where [·, ·] designates concatenation. The mapping of the tasks in the LTN yields another signal that can be useful for optimisation and act as a regulariser. The LTN can also be seen as a mixtureof-experts layer (Jacobs et al., 1991) where the experts are the auxiliary task models. As the label embeddings are learned jointly with the main model, the LTN is more sensitive to the relationships between labels than a separately learned mixture-of-experts model that only relies on the experts' output distributions. As such, the LTN can be directly used to produce predictions on unseen data.

Semi-supervised MTL
The downside of the LTN is that it requires additional parameters and relies on the predictions of the auxiliary models, which impacts the runtime during testing. Instead, of using the LTN for prediction directly, we can use it to provide pseudolabels for unlabelled or auxiliary task data by utilising auxiliary predictions for semi-supervised learning.
We train the target task model on the pseudolabelled data to minimise the squared error between the model predictions p T i and the pseudo labels z T i produced by the LTN: We add this loss term to the MTL loss in Equation 2. As the LTN is learned together with the MTL model, pseudo-labels produced early during training will likely not be helpful as they are based on unreliable auxiliary predictions. For this reason, we first train the base MTL model until convergence and then augment it with the LTN. We show the full semi-supervised learning procedure in Figure 1c.

Data-specific features
When there is a domain shift between the datasets of different tasks as is common for instance when learning NER models with different label sets, the output label embeddings might not contain sufficient information to bridge the domain gap.
To mitigate this discrepancy, we augment the LTN's input with features that have been found useful for transfer learning (Ruder and Plank, 2017). In particular, we use the number of word types, type-token ratio, entropy, Simpson's index, and Rényi entropy as diversity features. We calculate each feature for each example. 1 The features are then concatenated with the input of the LTN.

Other multi-task improvements
Hard parameter sharing can be overly restrictive and provide a regularisation that is too heavy when jointly learning many tasks. For this reason, we propose several additional improvements that seek to alleviate this burden: We use skip-connections, which have been shown to be useful for multitask learning in recent work . Furthermore, we add a task-specific layer before the output layer, which is useful for learning taskspecific transformations of the shared representations .

Experiments
For our experiments, we evaluate on a wide range of text classification tasks. In particular, we choose pairwise classification tasks-i.e. those that condition the reading of one sequence on another sequence-as we are interested in understanding if knowledge can be transferred even for these more complex interactions. To the best of our knowledge, this is the first work on transfer learning between such pairwise sequence classification tasks. We implement all our models in Tensorflow (Abadi et al., 2016) and release the code at https://github.com/ coastalcph/mtl-disparate.

Tasks and datasets
We use the following tasks and datasets for our experiments, show task statistics in Table 1, and summarise examples in Table 2: Topic-based sentiment analysis Topic-based sentiment analysis aims to estimate the sentiment of a tweet known to be about a given topic. We use the data from SemEval-2016 Task 4 Subtask B and C (Nakov et al., 2016) for predicting on a twopoint scale of positive and negative (Topic-2) and five-point scale ranging from highly negative  to highly positive (Topic-5) respectively. An example from this dataset would be to classify the tweet "No power at home, sat in the dark listening to AC/DC in the hope it'll make the electricity come back again" known to be about the topic "AC/DC", which is labelled as a positive sentiment. The evaluation metrics for Topic-2 and Topic-5 are macro-averaged recall (ρ P N ) and macro-averaged mean absolute error (M AE M ) respectively, which are both averaged across topics.
Target-dependent sentiment analysis Targetdependent sentiment analysis (Target) seeks to classify the sentiment of a text's author towards an entity that occurs in the text as positive, negative, or neutral. We use the data from Dong et al. (2014). An example instance is the expression "how do you like settlers of catan for the wii?" which is labelled as neutral towards the target "wii'.' The evaluation metric is macroaveraged F 1 (F M 1 ).
Aspect-based sentiment analysis Aspect-based sentiment analysis is the task of identifying whether an aspect, i.e. a particular property of an item is associated with a positive, negative, or neutral sentiment (Ruder et al., 2016). We use the data of SemEval-2016 Task 5 Subtask 1 Slot 3 (Pontiki et al., 2016) for the laptops (ABSA-L) and restaurants (ABSA-R) domains. An example is the sentence "For the price, you cannot eat this well in Manhattan", labelled as positive towards both the aspects "restaurant prices" and "food quality". The evaluation metric for both domains is accuracy (Acc).
Stance detection Stance detection (Stance) requires a model, given a text and a target entity, which might not appear in the text, to predict whether the author of the text is in favour or against the target or whether neither inference is likely . We use the data of SemEval-2016 Task 6 Subtask B (Mohammad et al., 2016). An example from this dataset would be to predict the stance of the tweet "Be prepared -if we continue the policies of the liberal left, we will be #Greece" towards the topic "Donald Trump", labelled as "favor". The evaluation metric is the macro-averaged F 1 score of the "favour" and "against" classes (F F A 1 ). Fake news detection The goal of fake news detection in the context of the Fake News Challenge 2 is to estimate whether the body of a news article agrees, disagrees, discusses, or is unrelated towards a headline. We use the data from the first stage of the Fake News Challenge (FNC-1). An example for this dataset is the document "Dino Ferrari hooked the whopper wels catfish, (...), which could be the biggest in the world." with the headline "Fisherman lands 19 STONE catfish which could be the biggest in the world to be hooked" labelled as "agree". The evaluation metric is accuracy (Acc) 3 .
Natural language inference Natural language inference is the task of predicting whether one sentences entails, contradicts, or is neutral towards another one. We use the Multi-Genre NLI corpus (MultiNLI) from the RepEval 2017 shared task (Nangia et al., 2017). An example for an instance would be the sentence pair "Fun for only children", "Fun for adults and children", which are  49  (2015) -  Table 3: Comparison of our best performing models on the test set against a single task baseline and the state of the art, with task specific metrics. *: lower is better. Bold: best. Underlined: second-best.
in a "contradiction" relationship. The evaluation metric is accuracy (Acc).

Base model
Our base model is the Bidirectional Encoding model , a state-of-theart model for stance detection that conditions a bidirectional LSTM (BiLSTM) encoding of a text on the BiLSTM encoding of the target. Unlike , we do not pre-train word embeddings on a larger set of unlabelled indomain text for each task as we are mainly interested in exploring the benefit of multi-task learning for generalisation.

Training settings
We use BiLSTMs with one hidden layer of 100 dimensions, 100-dimensional randomly initialised word embeddings, a label embedding size of 100. We train our models with RMSProp, a learning rate of 0.001, a batch size of 128, and early stopping on the validation set of the main task with a patience of 3.

Results
Our main results are shown in Table 3, with a comparison against the state of the art. We present the results of our multi-task learning network with label embeddings (MTL + LEL), multi-task learning with label transfer (MTL + LEL + LTN), and the semi-supervised extension of this model. On 7/8 tasks, at least one of our architectures is better than single-task learning; and in 4/8, all our architectures are much better than single-task learning.
The state-of-the-art systems we compare against are often highly specialised, taskdependent architectures. Our architectures, in contrast, have not been optimised to compare favourably against the state of the art, as our main objective is to develop a novel approach to multi-task learning leveraging synergies between label sets and knowledge of marginal distributions from unlabeled data. For example, we do not use pre-trained word embeddings Palogiannidi et al., 2016;Vo and Zhang, 2015), class weighting to deal with label imbalance (Balikas and Amini, 2016), or domainspecific sentiment lexicons (Brun et al., 2016;Kumar et al., 2016). Nevertheless, our approach outperforms the state-of-the-art on two-way topic-based sentiment analysis (Topic-2). The poor performance compared to the stateof-the-art on FNC and MultiNLI is expected; as we alternate among the tasks during training, our model only sees a comparatively small number of examples of both corpora, which are one and two orders of magnitude larger than the other datasets. For this reason, we do not achieve good performance on these tasks as main tasks, but they are still useful as auxiliary tasks as seen in Table 4.

Label Embeddings
Our results above show that, indeed, modelling the similarity between tasks using label embeddings sometimes leads to much better performance. Figure 2 shows why. In Figure 2, we visualise the label embeddings of an MTL+LEL model trained on all tasks, using PCA. As we can see, similar labels are clustered together across tasks, e.g. Our visualisation also provides us with a picture of what auxilary tasks are beneficial, and to what extent we can expect synergies from multitask learning. For instance, the notion of positive sentiment appears to be very similar across the topic-based and aspect-based tasks, while the conceptions of negative and neutral sentiment differ. In addition, we can see that the model has failed to learn a relationship between MultiNLI labels and those of other tasks, possibly accounting for its poor performance on the inference task. We did not evaluate the correlation between label embeddings and task performance, but Bjerva (2017) recently suggested that mutual information of target and auxiliary task label sets is a good predictor of gains from multi-task learning.

Auxilary Tasks
For each task, we show the auxiliary tasks that achieved the best performance on the development data in Table 4. In contrast to most existing work, we did not restrict ourselves to performing multitask learning with only one auxiliary task . Indeed we find that most often a combination of auxiliary tasks achieves the best performance. Indomain tasks are less used than we assumed; only Target is consistently used by all Twitter main tasks. In addition, tasks with a higher number of labels, e.g. Topic-5 are used more often. Such

Main task Auxiliary tasks
Topic-2 FNC-1, MultiNLI, Target   Topic-5 FNC-1, MultiNLI, ABSA-L, Target  Target FNC-1, MultiNLI, Topic-5 Stance FNC-1, MultiNLI, Target ABSA-L Topic-5 ABSA-R Topic-5, ABSA-L, Target FNC-1 Stance, MultiNLI, Topic-5, ABSA-R, Target MultiNLI Topic-5 tasks provide a more fine-grained reward signal, which may help in learning representations that generalise better. Finally, tasks with large amounts of training data such as FNC-1 and MultiNLI are also used more often. Even if not directly related, the larger amount of training data that can be indirectly leveraged via multi-task learning may help the model focus on relevant parts of the representation space (Caruana, 1993). These observations shed additional light on when multi-task learning may be useful that go beyond existing studies .

Ablation analysis
We now perform a detailed ablation analysis of our model, the results of which are shown in Table 5. We ablate whether to use the LEL (+ LEL), whether to use the LTN (+ LTN), whether to use the LEL output or the main model output for prediction (main model output is indicated by , main model), and whether to use the LTN as a regulariser or for semi-supervised learning (semisupervised learning is indicated by + semi). We further test whether to use diversity features (-diversity feats) and whether to use main model predictions for the LTN (+ main model feats).
Overall, the addition of the Label Embedding Layer improves the performance over regular MTL in almost all cases.

Label transfer network
To understand the performance of the LTN, we analyse learning curves of the relabelling function vs. the main model. Examples for all tasks without semi-supervised learning are shown in Figure 3. One can observe that the relabelling   model does not take long to converge as it has fewer parameters than the main model. Once the relabelling model is learned alongside the main model, the main model performance first stagnates, then starts to increase again. For some of the tasks, the main model ends up with a higher task score than the relabelling model. We hypothesise that the softmax predictions of other, even highly related tasks are less helpful for predicting main labels than the output layer of the main task model. At best, learning the relabelling model alongside the main model might act as a regulariser to the main model and thus improve the main model's performance over a baseline MTL model, as it is the case for TOPIC-5 (see Table 5). To further analyse the performance of the LTN, we look into to what degree predictions of the main model and the relabelling model for individual instances are complementary to one another. Or, said differently, we measure the percentage of correct predictions made only by the relabelling model or made only by the main model, relative to the number of correct predictions overall. Results of this for each task are shown in Table 6 for the LTN with and without semi-supervised learning. One can observe that, even though the relabelling function overall contributes to the score to a lesser degree than the main model, a substantial number of correct predictions are made by the relabelling function that are missed by the main model. This is most prominently pronounced for ABSA-R, where the proportion is 14.6.

Conclusion
We have presented a multi-task learning architecture that (i) leverages potential synergies between classifier functions relating shared representations with disparate label spaces and (ii) enables learning from mixtures of labeled and unlabeled data. We have presented experiments with combinations of eight pairwise sequence classification tasks. Our results show that leveraging synergies between label spaces sometimes leads to big improvements, and we have presented a new state of the art for topic-based sentiment analysis. Our analysis further showed that (a) the learned label embeddings were indicative of gains from multitask learning, (b) auxiliary tasks were often beneficial across domains, and (c) label embeddings almost always led to better performance. We also investigated the dynamics of the label transfer network we use for exploiting the synergies between disparate label spaces.