Multi-Task Supervised Pretraining for Neural Domain Adaptation

Two prevalent transfer learning approaches are used in recent works to improve neural networks performance for domains with small amounts of annotated data: Multi-task learning which involves training the task of interest with related auxiliary tasks to exploit their underlying similarities, and Mono-task fine-tuning, where the weights of the model are initialized with the pretrained weights of a large-scale labeled source domain and then fine-tuned with labeled data of the target domain (domain of interest). In this paper, we propose a new approach which takes advantage from both approaches by learning a hierarchical model trained across multiple tasks from a source domain, and is then fine-tuned on multiple tasks of the target domain. Our experiments on four tasks applied to the social media domain show that our proposed approach leads to significant improvements on all tasks compared to both approaches.


Introduction
Deep learning approaches are powerful when dealing with large amounts of annotated data. However, these are only available for a few languages and domains due to the cost of the manual annotation (Duong, 2017). Particularly, despite the valuable importance of Social Media's content for a variety of applications (e.g. public security, health monitoring, or trends highlight), this large domain is still poor in terms of annotated data. Furthermore, to build systems for such applications, the machine might understand many low and high-level tasks. Therefore, a model, able to recognise as many linguistic properties as possible from a given sentence, is required.
Many attempts have been done to exhibit the performance of NLP models in low-resource scenarios. Particularly, two dominant approaches of neural Transfer Learning (TL) (Pan and Yang, 2010) are used in the State-Of-The-Art (SOTA): 1) Mono-Task Pretraining (MTP): Sequential Transfer Learning (STL) (Ruder, 2019), performed in two stages: pretraining on a rich source-domain on enough training examples and then, fine-tuning on the available few target-domain examples. This approach has proved to be powerful in many NLP tasks, outperforming the classic supervised learning paradigm, because it takes benefit from prelearned knowledge. And 2) Multi-Task Learning (MTL) (Caruana, 1997), that showed many benefits in several NLP tasks and applications, consists of training different tasks simultaneously, leveraging learned knowledge from related problems and resulting richer representations with higher generalisation (Collobert and Weston, 2008).
We introduce in this paper a novel method, that we call Multi-Task Supervised Pre-training and Adaptation (MuTSPad), which unifies both approaches discussed above. MuTSPad takes benefit from both, by learning a hierarchical multi-task model trained across multiple tasks from the sourcedomain, and further fine-tuned on multiple tasks from the target-domain. Hence, in addition to diverse linguistic properties learned from various supervised NLP tasks, MuTSPad takes advantage of the pre-learned knowledge from the high-resource source-domain.
We demonstrate the effectiveness of our approach on domain adaptation from the highresource News-domain to the low-resourced Tweets-domain. We carry out experiments on four NLP tasks, from the low-level Part-of-Speech tagging (POS) and Chunking (CK) to the higher-level Named Entity Recognition (NER) and Dependency Parsing (DP). MuTSPad exhibits significantly better performance than both TL approaches and is highly competitive compared to best SOTA methods.
Furthermore, to the best of our knowledge, there are no available common datasets containing annotations for all the above-mentioned tasks, nei-ther for the News-domain or the Tweets-domain. Though, many early works had highlighted the intricacy of multi-task training from heterogeneous datasets (Subramanian et al., 2018). Thus, we propose to build multi-task datasets for the News and Tweets domains, by unifying the aforementioned task-independent datasets.

Related Work
Our work is related to two lines of research, Sequential Transfer Learning and Multi-Task Learning. In the following, we briefly present the SOTA of each one. Then, we discuss some papers from the literature with a loosely close idea to multi-task pretraining and fine-tuning. Sequential Transfer Learning (STL) is a TL setting performed in two stages: Pretraining and Adaptation. The purpose behind using STL techniques for NLP can be divided into two main research areas, "universal representations" and "domain adaptation". The former aims to build neural features transferable and beneficial to a wide range of NLP tasks and domains. e.g. ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), etc. The second aims to harness the knowledge represented in features learned on a source domain (high-resourced in most cases) to improve learning a target domain (low-resourced in most cases) (Zennaki et al., 2016(Zennaki et al., , 2019. The source and the target problems may differ on the task, the language or the domain. For instance, cross-lingual adaptation has been explored for sentiment analysis (Chen et al., 2016) and cross-domain adaptation has been applied for POS models adaptation from News to Tweets domain (Gui et al., 2017;Meftah et al., 2017. Our research falls into the second research area, since we aim, as the last two works, to transfer the knowledge learned when training on the News-domain to improve the Tweets-domain's training. Multi-Task Learning (MTL) consists in a joint learning of related tasks and thus leverages training signals generated by different tasks (Caruana, 1997). The advantage of using MTL over independent task learning has been shown in many NLP tasks and applications (Lin et al., 2018). The processing (training) order of examples from different tasks (datasets), also called Scheduling, is particularly studied in the literature. This could be implicit or explicit (Jean et al., 2018). The formal include, for instance, affecting different learning rates for task-specific parameters. While the second, modify the importance of each task statically or dynamically. e.g. Kiperwasser and Ballesteros (2018) proposed variable schedules that increasingly favour the principal task over batches and Jean et al. (2018) proposed adaptive schedules that vary according to the validation performance of each task during training. Multi-Task Pretraining and Fine-tuning: Multitask pretraining has been especially explored for learning universal representations (Conneau et al., 2017;Ahmad et al., 2018). Multi-task fine-tuning was recently explored to fine-tune BERT pretrained model in a multi-task fashion on multiple tasks (Liu et al., 2019). Furthermore, in term of using multi-task features for domain adaptation, Søgaard and Goldberg (2016) showed the benefit of multi-task learning for domain adaptation from News-domain to Weblogs-domain for CK task, when disposing CK's supervision only for the source-domain, and lower-level POS supervision for the target-domain. Finally, in terms of unifying multi-task learning and fine-tuning, Kiperwasser and Ballesteros (2018) proposed to improve machine translation with the help of POS and DEP tasks by scheduling tasks during training, starting with multi-tasking of the principal task with auxiliary lower-level tasks (POS and DEP), and as the training graduates, the model trains only to the main task. However, to the best of our knowledge, performing pretraining and fine-tuning on multitask models for domain adaptation has not been explored in the literature.

Sequence Labelling Architecture
Regarding the exact architecture of each task, POS, CK and NER tasks are Sequence Labelling (SL) tasks. Given an input sentence of n successive tokens [w 1 , . . . , w n ], SL predicts the tag c i ∈ C of every w i , with C being the tag-set.
We followed the literature (Ma and Hovy, 2016; and used a common SL architecture, including three main components: (i) a Word Representation Extractor (WRE), (ii) a Features Extractor (FE) and (iii) a Classifier (Cl). WRE computes, for each token w i , a word and a character-level biLSTMs encoderbased embeddings (respectively, we i =WE(w i ) and ce i =CE(w i )), and concatenates them to get a final representation x i =(we i ,ce i ). WRE's out- puts [x 1 , . . . , x n ] are fed into the FE that outputs a context sensitive representation for each token, consisting of a single biLSTMs layer which iteratively passes through the sentence in both directions. Finally, Cl consists of a fully-connected layer (denoted Ψ) that classifies every given x i following: (1)

Dependency Parsing Architecture
For the DP branch, a similar procedure is applied, except that, compared to previous tasks, DP is a a harder problem and thus requires a more complex model. We followed Qi et al. (2018) and used their "neural arc-factored graph-based dependency parser", which is based on the "Deep bi-Affine parser" (Dozat and Manning, 2016)). Indeed, given an input sentence of n successive tokens [w 1 , . . . , w n ], the goal of DP is two folds: 1) identifying, for each w i , its head w j ∈ S. The couple of tokens w i and w j are called the dependent and the head, respectively. Then, 2) predicting the dependency syntactic relation's class r i j ∈ R dp relating each dependent-head pair, where R dp being the dependency-relations set. More precisely, for each token w i we predict its out-going labelled arc (w i , w j , r i j ). Thus, constructing a syntactic tree structure of the sentence, where words are treated as nodes in a graph, connected by labelled directed arcs.
Hence, as in SL models, the DP architecture is composed of a WRE followed by a FE dp and a Cl dp . Except that Cl dp consists of four classifiers, producing four distinct vectors for representing the word: (i) as a dependent seeking its head; (ii), as a head seeking all its dependants; (iii), as a dependent deciding on its relation; and (iv), as a head deciding on the labels of its dependants. These representations are then passed to biAffine softmax classifiers.

Hierarchical Multi-Task Architecture
As mentioned above, POS, CK, NER and DP are the four tasks considered in this work. As we aim to learn a multi-task model where the four tasks are learned jointly, the architecture of our model contains a common branch as well as four exits, one per task. Also, as the tasks are hierarchically related to each other, we adopted a hierarchical architecture (similar to Hashimoto et al. (2017) and Sanh et al. (2019)). More specifically, we organised the four tasks from low-level to high-level, with each task being fed with a shared word embedding as well as the outputs of all the lower tasks. To construct that hierarchy of tasks, we followed some linguistic hints from the literature. Indeed, many works have shown that POS improves CK (Yang et al., 2017;Ruder12 et al., 2019); NER benefits from POS Ruder, 2019) and CK (Collobert and Weston, 2008); and DP profits from POS and CK (Hashimoto et al., 2017). In simple terms, POS and CK are considered as "universal helpers" (Changpinyo et al., 2018). Thus, based on these linguistic hierarchy observations, we feed POS features to CK; then POS and CK features to both NER and DP.
An illustration of our multi-task hierarchical model is given in Fig.1. More specifically, WRE is shared across all tasks, its output (namely x i = WRE shared (w i )) is fed to all branches. The lower component of the POS tagging branch (FE pos ) is fed with the shared embedding and after processing, it outputs BiLSTMs features h pos i . This is then fed into the POS classifier C pos to calculate predictions through: These POS features (h pos i ) as well as the shared embeddings x i are then fed to the CK branch that outputs a probability distribution for the CK tag-set as follows: Note that, rather than directly using FE pos 's output, we first reduce its dimensionality by applying a learnable FC layer transformation denoted T pos .
In the same vein, following our hierarchy, the shared embedding, plus the output features of POS (h pos i ) as well as the output features of CK (h ck i ) are fed to the NER branch that outputs one class probability per named entity. Formally, this is computed using: For DP, similarly to the NER branch, the shared embedding, plus the output features of POS as well as the output features of CK are fed to FE dp , followed by C dp which outputs:

Our Approach
In low-resources scenarios, two approaches are commonly used to alleviate the lack of training data: 1. "Mono-Task Pre-training": pre-training the neural model (supervisedly or unsupervisedly) on a rich source-task before updating its weights on the task of interest (target task); 2. "Multi-task learning": training the task of interest jointly with other auxiliary tasks with labelled data that might force the network to learn useful features.
The first approach is known to work very well since a while, and yielded impressive results in recent years but the last approach seems to struggle as it still faces the problem of annotated data scarcity.
In this paper, to make sure that we learn useful features that are relevant for the tasks of our interest, we propose an approach that combines pretraining and multi-task learning, and thus takes benefits from the rich source-domain, and especially all its available annotated data and tasks. We called our method Multi-Task Supervised Pre-training and Adaptation (MuTSPad). This is described in details in Sec. 4.1, before we describe (in Sec.4.2) how we alleviated the problem of datasets heterogeneity in multi-task learning (i.e., only mono-task labelled datasets are available).

MuTSPad: Multi-Task Supervised Pretraining and Adaptation
MuTSPad roughly consists in pretraining on a large annotated multi-task source dataset and then finetuning it on the multi-task target dataset, that corresponds to that task of interest. As supervised and unsupervised pretraining, MuTSPad alleviates the lack of annotated data by taking benefit from rich source-domains. However, compared to them it does the pre-training on multiple tasks, and not only one. This brings even more real supervision to the network and thus gives more chance to end up with more features. Also important, as sourcedomains are usually richer than target-domains, we might always find source-datasets that are labelled exactly with all the tasks we want to solve in the target-domain. This enforces the network to learn only features that might be relevant for our tasks of interest, and thus avoid filling up the network with irrelevant features. More specifically, we considered four commonly used tasks in this work, POS, CK, NER and DP. In classical multi-task scenario all sentences have all needed tasks annotation (i.e. an annotation dataset for each task). However, in our case we have two datasets, the first for News domain and the second for Tweets domain, both are heterogeneous, having one task-annotation per sentence. In contrary to the classical multi-task scenario that assumes having one dataset where words and sentences are labelled for all the tasks, in reality, this is very hard to encounter. Thus, let us consider a more realistic scenario: a set of datasets is given, with each dataset being labelled to only one task. For instance, one dataset is labelled with POS, the other with NER. Note that, though the first is labelled with POS only, it might certainly contain named entities. For this reason, we call this scenario "Heterogeneous multi-task learning". The difficulties of learning in such a scenario are described in details in Sec. 4.2.
Back to MuTSPad, let us assume a set of tasks T , and one dataset D i per task T i . A source multitask model M s (described in Fig.1) is first trained on these heterogeneous source-datasets D S . And the set of all parameters learned for each task T i as well as task-agnostic ones, are then used to initialise the target multi-task model, denoted M t . All the weights of this last M t are then adapted on the set of target-datasets D T . Note that, though the sets of target-tasks T T and source-tasks T S might be the same, their label-sets might differ, thus as in classical fine-tuning, the weights of each taskclassifier are randomly initialised. However, for the labels that might be the same, we initialise the weights of the target task-classifiers with those pretrained on the source-domain.
In terms of loss functions, as in classical multitask learning, we minimise the weighted sum of each task loss: Where α task j represents task-weight and N is tasks number. As we used a hierarchical model for reasons mentioned in Sec. 3.3, we propose to focus the early-stage training on low-level tasks and progressively increase the focus on higher-level ones (along the same line of thought of Kiperwasser and Ballesteros (2018)). However, unlike (Kiperwasser and Ballesteros, 2018) who modified tasks sampling during training, we propose to tune loss calculation minimisation. During training, tasks weights α pos , α ck , α ner and α dp are respectively set up to: [1, 0.5, 0.25, 0.25], and doubled at each epoch until α = 1.

Heterogeneous Multi-Task Learning
As mentioned in the previous section, we mostly face the heterogeneous multi-task learning scenario, where only one task-labels might be assigned to a dataset. In that case, the classical multi-task learning approach is not directly applicable, thus we propose to use the "Scheduling process" (Zaremoodi et al., 2018;Lu et al., 2019) (described in the following paragraph). However, since training with different datasets for each task remains difficult (Subramanian et al., 2018) ) we proposed "Dataset Unification" a much simpler and easy to learn method for that scenario.

Tasks Scheduling Procedure
To deal with this heterogeneous aspect, we first use simple frozen uniform scheduling, which we call "one task per batch", similar to (Zaremoodi et al., 2018) and (Lu et al., 2019), where at each iteration of the training process, the task to train is selected randomly. Specifically, the base steps of "one task per mini-batch" scheduling process are as follows: 1) picking a mini-batch of samples from only one particular task and 2) updating only the parameters corresponding to the selected task, as well as the subsequent tasks (including the taskagnostic parameters). Thus, at every step only one task is trained. We successively pick all the tasks following a constant ordering strategy "from lowerlevel to higher-level tasks" (Hashimoto et al., 2017): POS then CK then NER then DP. Thus, every 4 steps, the model sees all the tasks once and learns their corresponding parameters once.

Datasets Unification
To overcome the intricacy of "tasks scheduling process", we propose to construct a unified dataset by combining several sources of independent textual annotations. Furthermore, since we are interested in benefiting from pretraining and fine-tuning, we apply unification process on both, source and targetdomains.
These datasets contain samples of a broad range of heterogeneous annotations in a variety of contexts (initially sentences are labelled only with one task rather than all), making the multi-task training challenging. Thus, to circumvent this problem, we propose to unify the Twitter-domain datasets to form a unified Tweets dataset that we call Tweet-All. We do the same with standard News-domain datasets to form a unified multi-task dataset that we name EnglishAll. Concretely, we enrich the gold annotations of each task with an automatic annotation by applying on its training-set our baseline Mono-Task Learning model of the other 3 tasks. In the end, we obtain two unified datasets one for Tweet (TweetAll) and one for English (EnglishAll). Thus, in both datasets each sentence is labelled with all tasks (one label is the initial manual annotation and three are generated automatically). Consequently, using our unified datasets brings us to the classical multi-task scenario, where each sentence is annotated with all tasks, thus at each iteration, all tasks are learned and thus all multi-task model's parameters are updated.

Domains, Tasks and Datasets
As mentioned above, we conducted experiments on four tasks: two low-level tasks (POS and CK) and two higher-level ones: (NER and DP). In terms of domain, data and annotations for the sourcedatasets, we used the standard English domain and chose the following datasets: The WSJ part of Penn-Tree-Bank (PTB) (Marcus et al., 1993) for POS, annotated with the PTB tag-set; CONLL2003 for  NER (Sang and De Meulder, 2003); CONLL2000 (Sang et al., 2000) for CK; and finally UD-English-EWT (Nivre et al., 2016) for DP. In the same vein, for the target-datasets, we used the Tweets domain and the following datasets: the recent TweeBank (Liu et al., 2018) for POS, annotated with the PTB universal tag-set; WNUT-17 from emerging entity detection shared task (Derczynski et al., 2017) for NER; TChunk (Ritter et al., 2011) for CK; and the data annotated with Universal dependency relations in the Twee-Bank dataset for DP 1 . Detailed statistics of all the datasets are summarised in Table 1. To evaluate our models, we use the accuracy (acc) for POS, Exact-match F1 2 (Li et al., 2020) for NER and CK and labelled attachment score (LAS) (Nivre et al., 2004).

Baselines
We compare our method to multiple baselines, that we separate into 4 categories according to the pretraining method. Without Pretraining: Training from scratch on the Tweets (target-domain) datasets.
• Mono-Task Learning: an independent training of our mono-tasks models (one model per task) on every target task separately.
• Multi-Task Learning: a joint training of our multi-task model described in Sec.3.3 (one model for all the tasks) on all the tasks from target-domain. Supervised pretraining on the source-domain of the network on each task independently then fine-tuning on the same task in the Tweets domain. This method is called Mono-Task Pre-Training. A variant of it is marked with *, and consists of just pretraining, i.e., without fine-tuning. Note that this variant is possible only when target dataset has the same tagset as the source dataset.
Adversarial pretraining (Ganin et al., 2016;Gui et al., 2017) is particularly used for domain adaptation, that aims to reduce the shift between the source and target domains at the pretraining stage. Precisely, in parallel to task's objective trained on supervised annotations from the source domain, an adversarial objective with respect to a domain discriminator is trained on unsupervised target data to minimise the distance between source and target representations. Followed by a fine-tuning on the same task in the Tweets domain.

State-Of-The-Art (SOTA)
We compare our approach to the best SOTA performances for each task: • BiAffine (Dozat and Manning, 2016): we report the LAS score for DP reported by Liu et al. (2018). Note that, in addition to wordlevel and character embeddings, which we use in our model to represent words, they use predicted POS labels and lemmas as input.
• Flairs (Akbik et al., 2019): For NER, using Method PreTraining POS (acc) DP (LAS) NER (F1) CK (F1) mNRG SOTA -BiAffine (Dozat et al., 2017) n/a n/a 77.7 n/a n/a n/a SOTA -PretRand (Meftah et al., 2019) n/a 94.95 n/a n/a n/a n/a SOTA -Flairs (Akbik et al., 2019) n/a n/a n/a 49.59 n/a n/a SOTA -MDMT (Mishra, 2019) n/a 92.44 n/a 49.86 87.85 n/a SOTA -DA (LSTM) (Gu and Yu, 2020) n/a n/a n/a n/a 84.58 n/a SOTA -DA (BERT BASE ) (Gu and Yu, 2020) n/a n/a n/a n/a 87.03 n/a SOTA -DA (BERT LARGE ) (Gu and Yu, 2020) n/a n/a n/a n/a 87.  Table 2: Results on the TweetAll test-sets for the four tasks. On the second column, we describe the pretraining type (none, supervised, unsupervised, adversarial and our multi-task supervised). The last column (mNRG that states for median Normalised Relative Gain) aggregates relevantly the scores of the methods across tasks.
a BiLSTM-CRF sequence labelling architecture, fed with Pooled Contextual Embeddings, pre-trained on character-level language models.
• PretRand (Meftah et al., 2019): a neural model based on a transfer learning approach, it improves Mono-Task Pre-Training baseline by reducing the bias occurring in the pretrained neurons.
• Data Annealing (DA) (Gu and Yu, 2020): a fine-tuning approach similar to Mono-Task Pre-Training baseline, but the passage from pretraining to fine-tuning is performed gradually, i.e. the training starts with only formal text data (News) at first; then, the proportion of the informal text data (Tweets) is gradually increased during the training process.

Implementation details
The hyper-parameters (HP) we used are as follows. For The task-agnostic WRE: The dimensions of character embedding = 50, hidden states of the character-level biLSTM = 100 and word-level embeddings (updated during training) = 300 (these latter are pre-loaded from Glove pre-trained vectors (Pennington et al., 2014) and fine-tuned during training). For Sequence labelling branches: we use a single-layer biLSTM (token-level feature extractor), with dimension = 200. DP branch HP: we follow Stanford parser' (//github.com/stanfordnlp/stanfordnlp) HP configuration. Global HP: In all experiments, SGD was used for training with early stopping and minibatches were set to 16 sentences.

Comparison to SOTA & baselines
Our experimental results are reported in Table 2. Clearly, MuTSPad strongly outperforms baselines, and is very competitive with the best SOTA results. We detail our main findings below: Multi-Task Learning baseline enhances the performances of all tasks compared to mono-task learning. Obviously, it is most benefactor for DP by ∼3.5% since DP is highly influenced by POS labels, while it is least benefactor for POS by ∼0.5%.
Unsupervised pretraining: Clearly, incorporating pre-trained ELMo representations performs better compared to mono-task learning. Particularly for NER task with ∼+8% by ELMo large . We also found that it improves the other tasks but not with the same order of improvement as for NER, which we mainly attribute to the fact that contextual representations pre-trained on language modelling capture more semantic features. Particularly, we find that DP gains the least from ELMo compared to the other syntactic tasks.
Comparing MuTSPad to baselines: MuTSPad outperforms both TL methods, Multi-Task Learning and Mono-Task Fine-Tuning, on all data-sets, by ∼+26 and ∼+10, respectively, on mNRG (median  Normalised Relative Gain; a well suited metric for multi-task ). Compared to unsupervised pretraining, we can observe that MuTSPad outperforms ELMo on POS, CK and DP, where Elmo large brought higher performances for NER. Note that ELMo is complementary to our approach, hence, we expect higher performances when incorporating Elmo large to MuTSPad.
Comparing MuTSPad to SOTA: Our score on DP is about ∼2.5% higher than best SOTA scores. For POS, CK and NER experiments, we achieve lower scores than SOTA. It is noteworthy that, first, contrary to our approach, all these methods are mono-task models (except MDMT), i.e., unable to solve other tasks. Second, NER and CK best SOTA used pretrained contextualised representations that harness the performance, namely, Flais embeddings by Akbik et al. (2019), ELMo by Mishra (2019) and BERT by Gu and Yu (2020).

Impact of Datasets Unification
We report in Tab.3 MuTSPad's results: 1) w/o unif. : training on independent datasets, using the "one batch per task" scheduling rule. on both stages, pretraining and fine-tuning 2) w/ source unif. : In the pretraining stage, training is performed on unified EnglishAll dataset. While in fine-tuning, training is performed on independent datasets.
3) w/ source+target unif. : In both pretraining and finetuning stages, training is performed on unified EnglishAll and TwitterAll datasets, respectively. Clearly, pretraining on unified source datasets (w/source unif) slightly improved performances on all tasks. Nevertheless, finetuning on unified target datasets (w/source+target unif) is beneficial only for POS and DP tasks, while it strongly hurts NER's performance. We mainly attribute this to the NER's "Mono-task learning" model's low performance on Tweets leading to noisy NER automatic predictions. It is noteworthy that, using unified datasets is easier to train, making training convergence faster. Figure 2: Maximum drop on class-score for each task when ablating individual units from the POS Feature Extractor output (h pos ). Dark/light blue: high/low drop. One can see that it is the POS task that is most impacted by the POS units.

Low-Level Tasks Importance Analysis
In this section, we investigate how low-level tasks impact high-level tasks in our hierarchical multitask model (See Fig.1). Specifically, we focus on the impact of h pos , the representation encoded by the POS task, for CK, NER and DP tasks.
For this purpose, we quantify the importance of h pos individual units for POS, CK, NER and DP performances. Assuming that ablating the most important units for a task should bring higher drop in performance compared to the least important units, we perform an individual ablation (also called pruning) of h pos units (neurons), as in (Zhou et al., 2018;Dalvi et al., 2019).
Given the already trained target multi-task model M t , we set the relating weights of each unit i from h pos to zero, i.e. T pos weights for CK, NER and DP; and Cl pos weights for POS. Hence, the ablated unit will not contribute to the final prediction for any input word. Then, with one unit ablated at a time, we launch the inference on eachtask's dev-set, then compute the resulting scoredrop for each class, leading to a matrix per task A task ∈ M d , m (R), where d is h pos 's dimension and m is the number of task' classes. This matrix can be summarised in a max-class-score-drop vector v task ∈ R d , where each v task i from the vector represents the max class score drop of unit i from h pos .
Applying this method, for POS, CK, NER and DP, leads to 4 max-class-score-drop vectors, one for each task, v pos , v ck , v ner and v dp , that we plot in Fig.4 Table 4: Top-10 words activating positively (red) or negatively (blue) (Since LSTMs generate positive and negative activations) some units from h pos that are the most important for different classes from CK, DP and NER several layers for the other tasks. Furthermore, we can also observe that h pos 's units are more important for CK and DP compared to NER. Moreover, we attempt to peek inside some units from h pos , the ablation thereof begets the higher drop in CK, DP and NER classes-scores. Specifically, we report in Tab.4 the top-10 words activating some of these units, as in (Kádár et al., 2017;Meftah et al., 2019). Expectedly, we found that some of POS' units are firing, and thus specialised, on patterns that are beneficial for higher-level tasks. For instance, Unit-99, specialised on adjectives ending with the suffix "ly", is highly important for the CK class "B-ADJP" (beginning of adjectival phrase). Also, Unit-115, is firing on persons names, a valuable pattern for "B-person" class of NER. Interestingly, we found some units that are beneficial for multiple tasks, e.g. Unit-112, which is specific for interjections, is also important for both "discourse" class for DP and "B-INTJ" (beginning of an interjection phrase) for CK.

Conclusion
In this research, we have proposed MuTSPad, a new approach based on Transfer Learning (TL) for domain adaptation with three main contributions: 1) Consolidating two TL's approaches, sequential transfer learning and multi-task learning, by pretraining on resource-rich domain and finetuning on low-resourced domain in a multi-task fashion; 2) Unifying independent datasets to overcome the intricacy of multi-task training from different datasets; and 3) Conducting a set of individual units ablation, refining our understanding on how individual neurons lower-level tasks impact high-level tasks. We showed through empirical results on domain adaptation from News to Tweets that the proposed method MuTSPad allows a simultaneous benefit from similarities between domains and tasks, yielding better transfer learning performances on four NLP tasks, and outperforming the state-of-the-art on Dependency Parsing task.
This study leaves several important open directions for future work. First, we should explore soft multi-task architectures. Second, we expect to explore the combination of supervised and unsupervised multi-tasking. We also plan to explore the benefit of MuTSPad's learned representations for higher-level NLP applications such as machine translation and sentiment analysis.