Quasi-Multitask Learning: an Efficient Surrogate for Obtaining Model Ensembles

We propose the technique of quasi-multitask learning (Q-MTL), a simple and easy to implement modification of standard multitask learning, in which the tasks to be modeled are identical. With this easy modification of a standard neural classifier we can get benefits similar to an ensemble of classifiers with a fraction of the resources required.We illustrate it through a series of sequence labeling experiments over a diverse set of languages, that applying Q-MTL consistently increases the generalization ability of the applied models. The proposed architecture can be regarded as a new regularization technique that encourages the model to develop an internal representation of the problem at hand which is beneficial to multiple output units of the classifier at the same time. Our experiments corroborate that by relying on the proposed algorithm, we can approximate the quality of an ensemble of classifiers at a fraction of computational resources required. Additionally, our results suggest that Q-MTL handles the presence of noisy training labels better than ensembles.


Introduction
Ensemble methods are frequently used in machine learning applications due to their tendency of increasing model performance. While the increase in the prediction performance is undoubtedly an important aspect when we train a model, it should not be forgotten that the increased performance of ensembling comes at the price of training multiple models for solving the same task.
The question that we tackle in this paper is the following: Can we enjoy the benefits of ensemble learning, while avoiding its overhead for training models from scratch multiple times? This question is highly relevant these days, since state-of-theart neural models tend to be extremely resourceintensive on their own (Strubell et al., 2019), pro-hibiting their inclusion in a traditional ensemble setting.
Our proposed architecture simultaneously offers the benefit of ensemble learning, while avoiding its drawback of training multiple models. The method introduced here employs a special form of multitask learning (MTL). Caruana (Caruana, 1997) argues in his seminal work that MTL can be a useful source of introducing inductive bias into machine learning models. Standard MTL have been shown to be fruitfully applicable in solving a series of NLP tasks: Collobert and Weston (2008); Plank et al. (2016); Rei (2017); Kiperwasser and Ballesteros (2018); Sanh et al. (2018), inter alia. We introduce quasi-multitask learning (Q-MTL), where the goal is to simultaneously learn multiple neural models that solve identical tasks, while relying on a shared representation layer.
Besides the considerable speedup that comes with the proposed technique, we additionally argue that by applying multiple output units on top of a shared parameter set is beneficial, as we can avoid converging to such degenerate internal representations that are highly tailored for a particular classification model. In that sense, Q-MTL can also be viewed as an implicit regularizer.
Our experiments with Q-MTL illustrate that the presence of multiple classifier layers for the same task affect each other positively -similar to ensemble learning -without the additional overhead of actually training multiple models.
A similar technique have already been derived from MTL called Pseudo-Task Augmentation (Meyerson and Miikkulainen, 2018), which builds on the idea of common representation, but the management of these tasks differs. We conducted experiments comparing the two methods for a greater comprehension of the differences.

Applied models
We release all our source code used for our experiments at https://github.com/N0rbi/ Quasi-Multitask-Learning/. Our models are based on the sequence classification framework from Plank et al. (2016) implemented in DyNet (Neubig et al., 2017). Figure 1 provides a visual summary of the different architectures we implemented. Figure 1b highlights that Q-MTL has the benefit of training multiple classification models over the same internal representation, as opposed to traditional ensemble model, which requires the training of multiple LSTM parameters as well (cf. Figure 1c).

Baseline architecture
Our baseline classifier is a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) incorporating character and word level embeddings. We first compute the input embedding for the network at position i as where ⊕ is the concatenation operator, w i denotes the word embedding, − → c i and ← − c i refers to the left-toright and right-to-left character-based embeddings, respectively. We subsequently feed e i into a bi-LSTM, which determines a hidden representation h i ∈ R m for every token position as , the concatenation of the hidden states of the two LSTMs processing the input from its beginning to the end, and in reverse direction.
The final output of the network for token position i gets computed as with V ∈ R h×m and b V ∈ R m denoting the weight matrix and the bias of a regular perceptron layer with m outputs, whereas W ∈ R m×c and b W ∈ R c are the parameters of the neuron performing classification over the c target classes.

Q-MTL architecture
The Q-MTL network behaves similarly to the model introduced in Section 2.1, with the notable exception that it trains k distinct classification models, all of which operate over the same hidden representation as input obtained from a single bi-LSTM unit.
More concretely, we replace the single prediction of the standard single task learning (STL) with j ∈ {1, . . . , k}. As argued before, this approach behaves efficiently from a computational point of view, as it relies on a shared representation h i for all the k classification units.
The loss of the network for token position i and gold standard class label y * i can be conveniently generalized as where CE denotes categorical cross entropy loss and k is the number of (identical) tasks in the Q-MTL model, with the special case of k = 1 resulting in standard STL. Losses from the different outputs can be efficiently aggregated for backpropagation, hence the shared LSTM cell benefit from multiple error signals without the actual need of going through multiple individual forward and backward passes.
Q-MTL outputs k predictions by all of its prediction units, however, we can as well derive a combined prediction from the distinct outputs of Q-MTL according to (3) which is a weighted average according to the predicted probabilities of the distinct models. As introducing averaging at the model-level would eliminate diversity of the individual classifiers (Lee et al., 2015), this kind of averaging took place in a post-hoc manner, only when making predictions.

Traditional ensemble model
As an additional model, we also employ a traditional ensemble of k independently trained STL models. We define the prediction of the ensemble model by averaging the predictions of k independent models as (4) The distinctive difference between Eq. 4 and the Q-MTL model formulation in Eq. 3 is that ensembling relies on the hidden representations originating from k independently trained LSTM models as denoted by the superscripts of the hidden states in h (j) i . Such an ensemble necessarily requires approximately k-times as much computational resources compared to Q-MTL, due to the LSTM models being trained in total isolation. For the above reason, ensembling is a strictly more expensive form of training a model, therefore we regard its performance as a glass ceiling for Q-MTL.

Experiments
Our model uses character embeddings of 100 dimensions and the word representations get initialized by the 64-dimensional pre-trained polyglot word embeddings (Al-Rfou et al., 2013) as suggested by Plank and Agić (2018). We use the bi-LSTM introduced in the previous section. We refer to the hidden representation of the LSTM for readability as h i ∈ R 200 which stands for the con- Instead of directly applying a fully-connected layer to perform classification based on h i , we first transform h i by an intermediate perceptron unit with ReLU activation -as shown in 2. The perceptron transforms h i into 20 dimensions, that is, we have V ∈ R 20×200 . Our motivation with the extra non-linearity introduced by ReLU is to encourage an increased diversity in the behavior of the different output units.
Upon training the LSTMs, we used the default architectural settings employed by Plank et al. (2016), i.e., we relied on a word dropout rate of 0.25 (Kiperwasser and Goldberg, 2016) and an additive Gaussian noise (with σ = 0.2) over the input embeddings. We trained all our models for 20 epochs using stochastic gradient descent with a batch size of 1. First, we assess the quality of Q-MTL towards POS tagging, then we evaluate it on named entity recognition as well.
When comparing the performance of different approaches, Q-MTL models are compared against the average performance of k STL models, where k denotes the number of task in the case of Q-MTL. The k STL models are also used to derive a single prediction by the ensemble model.

Experiments with the number of tasks
We first investigate how changing the value of k, i.e., the number of simultaneously learned tasks, affects the performance of Q-MTL. We experimented with k ∈ {1, 10, 30}. Based on the results in Table 1, we set the number of tasks to be employed as k = 10 for all upcoming experiments. In order to choose k without overfitting to the training data, this experiment was conducted on the development set.

Comparing Q-MTL with STL
Following the recommendation in Dodge et al. (2019), we report learning curves over the development set as a function of the number epochs in Figure 2 As a general observation, we can see that Q-MTL tends to perform consistently better than STL models right from the beginning of training.  Directly comparing the classifiers One benefit of Q-MTL is that it learns k different classification models during training with only a marginal computational overhead compared to training a STL baseline, since all the tasks share a common internal representation. As discussed earlier, we can combine the predictions from the k classifiers from Q-MTL according to Eq. 2. It is also possible, however, to use the k distinct predictions of Q-MTL. In what follows next, we compare the performance of the k STL models we train to the k classifiers that are incorporated within a Q-MTL model.
Upon comparing the performance of a Q-MTL classifier with a STL model, we made it sure that the overlapping parameters (matrices V and W ) were initialized with the same values and that they receive the training instances in the exact same order. This way the performance achieved by the i th output of Q-MTL is directly comparable with the i th STL baseline. Comparison of the results of the individual outputs of Q-MTL and their corresponding STL counterpart are included in Figure 3.
Training Q-MTL models with k tasks simultaneously is not only faster than training k distinct STL models separately, but the individual Q-MTL models typically outperform their baseline counterparts evaluated against both the development and the test data.
The regularizing effect of Q-MTL We have argued earlier that Q-MTL has an implicit regularizing effect. Among most recent techniques, such as dropout (Srivastava et al., 2014), weight decay (Krogh and Hertz, 1992) is one of the most typical form of regularization for fostering the generalization capability of the learned models. When employing weight decay, we add an extra term penalizing the magnitude of the values learned by our model, which results in an overall shrinkage in the values of the model parameters. Figure 4 illustrates that the effects of employing Q-MTL is similar to applying weight decay, as the Frobenius norm of the parameter matrices from the classifiers of Q-MTL are substantially smaller than those of the STL classifiers. This observation holds for both the of parameter sets V and W . Recall that the initial values for these matrices were identical for both Q-MTL and STL.

Comparison to an ensemble of classifiers
We next compared the Q-MTL technique with ensemble learning. Our comparison additionally assesses the sensitivity of the different approaches towards the presence of noisily labeled tokens during training. To do so, we conducted multiple experiments for each language, for which we randomly replaced the true class label of a token by some pre-  defined probability p ∈ {0, 0.1, 0.2, 0.3}. During the random replacement of the class labels, we ensured that the same tokens got randomly relabeled by the same label for the different approaches. Figure 5 contains the performance of the three different models in conjunction with the different amounts of noisy labels introduced to the training set. We can observe from Figure 5 that Q-MTL outperforms STL irrespective to the amount of noisy tokens being present encountered during training. Figure 5 further reveals that the performances of the ensemble models -which are based on the predictions of the STL classifiers -are dominantly better than the average performance of the individual STL models. When mislabeled tokens are not present in the training data at all, ensemble also has a slight advantage over Q-MTL, however, this advantage of the ensembling model gradually fades out as the proportion of noisy training labels increases. Indeed, for the case when 30% of the training labels are randomly replaced, the performance of Q-MTL reaches that of the ensemble model. The proposed approach has the additional benefit over the ensemble model that it requires a fraction of computational resources as we will demonstrate it in Section 3.1.5.

Comparison to Pseudo-Task Augmentation
Pseudo-Task Augmentation (PTA) architecture (Meyerson and Miikkulainen, 2018) introduces a similar architecture to Q-MTL for leveraging a better representation of the task by fitting multiple outputs to the same task. PTA makes a series of predictions according to PTA introduces two special subroutines, named as DecInit and DecUpdate. These subroutines introduce various heuristics with the goal of encouraging the different decoders to behave differently.
DecInit DecInit gets called right before the start of the training and can contain any of the following three methods. PTA-I means that the weight of the DecUpdate DecUpdate introduces the so-called meta-iteration into the learning process. A metaiteration is invoked after M th gradient update. The methods used in DecUpdate all require a ranking of the tasks based on their dev dataset performance. This makes performing an evaluation step necessary at the beginning of each meta-iteration. The goal of the ranking is to identify the best task (BT ) with the highest dev set performance. PTA introduces three methods for the DecUpdate as well. PTA-P perturbs the weight matrix of the tasks excluding the BT . Hyperturb (PTA-H) modifies the tasks in the same manner, but instead of adding noise to the weight matrices, noise gets added to the hyperparameters of the tasks (in our case it is the dropout probability preceding the softmax layers). The remaining method is called greedy (PTA-G), which takes the parameters of BT and replaces the actual parameters for all the remaining k − 1 decoders besides BT .
The most similar PTA method to Q-MTL is PTA-I, with the main difference that Q-MTL uses an extra transformation and a ReLU non-linearity over the hidden representation of the LSTM (cf. Eq. 2 and Eq. 5 for Q-MTL and PTA-I, respectively).
Another key difference is that PTA uses model selection (BEST ), whereas Q-MTL relies on model averaging (AV G). This means that PTA makes prediction for test instances during inference based on the model which achieves best performing dev set accuracy at the end of the training phase. Q-MTL, on the other hand, aggregates all the models according to Eq. 3. Figure 6 shows the effects of the different combinations of inference strategies (BEST/AVG) and the usage of a Multi-Layered Perceptron (MLP) in the model (0M LP /20M LP ). In these experiments the 0 MLP means we do not add the extra layer before the output. Note, that the AV G inference strategy used in conjunction with the 0M LP architecture is essentially equivalent to the PTA-I architecture. Figure 6 demonstrates that the Q-MTL model with its MLP layer can facilitate the use of model averaging shown in Eq. 3 as it outperforms the Q-MTL using model selection (BEST @ 20 MLP). On the other hand it is indeed discouraged to use the AV G model when no MLP is applied, as BEST often outperforms AV G in case of 0 MLP. Interestingly, when the train set contains high label noise, the later observation seems to pivot towards the ensemble of linear classifiers. Additionally, we can see that the MLP layer improves the tolerance of the models to the increasing label noise, as it outperforms 7 out of 10 treebanks the model not employing extra ReLU non-linearity.
As an interesting note, the Q-MTL has an improved performance for Indonesian as the amount of noisy training labels increases. A possible explanation for this is that corrupting the class labels of the training data can be viewed as an alternative form of label smoothing (Szegedy et al., 2016), which is known to increase the generalization ability of neural models.
After the detailed differentiation between PTA-I and Q-MTL we also compare Q-MTL to the more complex PTA variants that were introduced in Meyerson and Miikkulainen (2018). We conducted these experiments for English only because of the computational overhead introduced by the meta- iterations being part of the PTA approach. In cases when there are more than one letter after the prefix of the keyword, it refers to a combination of multiple approaches (eg. PTA-HGD: hyperturb, greedy, dropout). Table 2 shows that while most PTA architectures slightly underperform the Q-MTL, two variants of PTA -namely freeze (F) and freeze combined with perturb (FP) -had a substantially inferior performance.
These experiments have also shown that the meta-iterations of PTA are responsible for the nongradient based updates create a considerable computational overhead -as noted above -compared to the traditional SGD without external heuristics. We used M = 100 for our POS tagging classifiers. This means, that 100 training samples are followed by an evaluation on the dev set, making the training phase 5 and a half hours on average for the different PTA models, while our method took slightly less than 2 hours to finish training. We do not report the performance of all eight methods due to the limitation by this training time overhead.

Comparison of training times
One of the main benefits of Q-MTL resides in its training efficiency compared to traditional ensemble models as also demonstrated by Figure 7, which includes the training times for the different approaches. We plot the training times on the logarithmic scale for better readability for both k = 10 and k = 30. We can see that the training times for STL and Q-MTL practically concur, whereas the overall costs of ensembling exceeds the training time of STL and Q-MTL models by a factor of k.
The training times reported in Figure 7 were obtained without GPU acceleration -on an Intel Xeon E7-4820 CPU -in order to simulate a setting with limited computational resources. We also repeated training on a TITAN Xp GPU. The GPUbased training was 3 to 10 times quicker depending on the languages, but the relative performance between the different approaches remained the same, i.e., STL and Q-MTL training times did not differ substantially, whereas the ensemble model took k-times as much time to be created.
This training overhead is due to the number of excess parameters in the ensemble and Q-MTL models. Given we have a k = 5 English model, the ensemble has 5 times the parameter numbers of STL while the Q-MTL has only 1.003 times the number of STL parameters.

Evaluation on Named Entity Recognition
We also conducted experiments on the CoNLL 2002/2003 shared task data on named entity recognition (NER) in English, Spanish and Dutch (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003). For these experiments, we report performance in terms of overall F1 scores calculated by the official scorer of the shared task. We trained models with k = 10 and compared the average performance of the individual STL models to the performance of the Q-MTL and ensemble models. Table 3a shows the results for NER over the different languages, corroborating our previous observation that Q-MTL is capable of closing the gap between the performance of STL models and the much more resource-intensive ensemble model derived from k independent models.
In our POS tagging experiment, we trained models on treebanks of radically differing sizes, whereas during our NER experiments, we had access to training data sets of comparable sizes (ranging between 218K and 273K tokens). In order to simulate the effects of having access to limited training data on NER as well, we artificially relied on only 10% of the available training sets.
These results for the limited training data setting are included in Table 3b, from which we can see that Q-MTL manages to preserve more of its original performance, i.e., 87.5% on average as opposed to the ensemble and STL models, which preserved  Bingel and Søgaard (2017) show that better performing models can be trained by introducing multiple auxiliary tasks. Rei (2017) proposes an auxiliary task for NLP sequence labeling tasks, where the auxiliary tasks is to predict the previous and next word in the sequence. Our results complement these findings by showing that this generalization property holds even if the tasks are the same. Meyerson and Miikkulainen (2018) introduced Pseudo-Task Augmentation a similar architecture that aims to build a robust internal representation from multiple classifier units optimized for the same task in the same network. Section 3.1.4 describes the similarities and differences to our method. PTA architecture is evaluated on multitask as well, while our work only considers single tasks at the moment. Ruder and Plank (2018) has shown that selflearning and tri-training can be adapted to deep neural nets in the semi-supervised regime. Their tri-training architecture resembles our approach in that they were utilizing multiple classifier units that were built on top of a common representation layer for providing labels to previously unlabeled data.
Cross-view training (CVT) (Clark et al., 2018) resembles Q-MTL in that it also employs a shared bi-LSTM layer used by multiple output layers. The main difference between CVT and Q-MTL is that we are utilizing an bi-LSTM to solve the same task multiple times in a supervised setting, whereas Clark et al. used it to solve different tasks in a semi-supervised scenario.
A series of studies have made use of ensemble learning in the context of deep learning (Hansen and Salamon, 1990;Krogh and Vedelsby, 1995;Lee et al., 2015;Huang et al., 2017). Our proposed model is also related to the line of research on mixture of experts proposed by Jacobs et al. (1991), which has already been applied successfully in NLP before (Le et al., 2016). The main difference in our proposed architecture is that the internal LSTM representation is shared across the classifiers, hence a more efficient training could be achieved as opposed to training multiple independent expert models as it was done in Shazeer et al. (2017).
Model distillation (Hinton et al., 2015) is an alternative approach for making computationally demanding models more effective during inference, however, the approach still requires training of a "cumbersome" model first.

Conclusions
We proposed quasi-multitask learning (Q-MTL), which can be viewed as an efficiently trainable alternative of traditional ensembles. We additionally demonstrated that it acts as an implicit form of regularization as well. In our experiments, Q-MTL consistently outperformed the single task learning (STL) baseline for both POS tagging and NER. We have also illustrated that Q-MTL generalizes better on smaller and noisy datasets compared to both STL and ensemble models.
The computational overhead for the additional classification units in Q-MTL is infinitesimal due to the effective aggregation of the losses and the shared recurrent unit between the identical tasks. Although we evaluated Q-MTL over an LSTM, the idea can be applied for more resource-heavy architectures, like transformer (Vaswani et al., 2017) based models where training an ensemble would be too expensive. This is the future direction of our research.