How Transferable are Neural Networks in NLP Applications?

Transfer learning is aimed to make use of valuable knowledge in a source domain to help the model performance in a target domain. It is particularly important to neural networks because neural models are very likely to be overfitting. In some fields like image processing, many studies have shown the effectiveness of neural network-based transfer learning. For neural NLP, however, existing studies have only casually applied transfer learning, and conclusions are inconsistent. In this paper, we conduct a series of empirical studies and provide an illuminating picture on the transferability of neural networks in NLP.


Introduction
Transfer learning, or sometimes known as domain adaptation,2 plays an important role in various natural language processing (NLP) applications, especially when we do not have a large enough dataset for the task of interest (called the target task T ).In such scenarios, we would like to transfer or adapt knowledge in datasets from other domains (called the source domains/tasks S) so as to mitigate the problem of overfitting and to improve performance.For traditional feature-rich or kernel-based models, researchers have developed a variety of elegant methods for domain adaptation; examples include EasyAdapt (Daumé III, 2007;Daumé III et al., 2010), instance weighting (Jiang and Zhai, 2007;Foster et al., 2010), struc-Recently, deep neural networks are emerging as the prevailing technical solution to almost every field in NLP.Although capable of learning highly nonlinear features, deep neural networks are very prone to overfitting (Peng et al., 2015) compared with traditional methods.Transfer learning therefore becomes even more important to neural models.Fortunately, neural networks can be trained in a transferable way by their incremental learning nature: we can directly use trained (tuned) parameters from a source task to initialize the network in the target task; alternatively, we may also train two tasks simultaneously with some parameters shared.But analysis is needed regarding the performance of neural transfer learning.
Existing studies have already shown some evidence of the transferability of neural features.For example, in image processing, low-level neural layers resemble much to Gabor filters or color blobs (Zeiler and Fergus, 2014;Krizhevsky et al., 2012); they can be transferred well to different tasks, e.g., different image classification tasks.Donahue et al. (2014) suggest that high-level layers are also transferable in general visual recognition; Yosinski et al. (2014) further investigate the transferability of neural layers in different levels of abstraction.
Although transfer learning is promising in image processing, conclusions appear to be less clear in NLP applications.Image pixels are low-level signals, which are generally continuous and less related to semantics.By contrast, natural language tokens are discrete: each word well reflects the thought of humans, but neighboring words do not share as much information as pixels in images do.Previous neural NLP studies have casually applied transferring techniques, but their results are not consistent.Collobert and Weston (2008) ap-ply multi-task learning to SRL, NER, POS, and CHK,3 but obtain only 0.04-0.21%error reduction4 (out of a base error rate of 16-18%).Bowman et al. (2015), however, improve a natural language inference task from an accuracy of 71.3% to 80.8% by initializing parameters with an additional dataset of 550,000 samples.Therefore, more systematic studies are needed to shed light on transferring neural networks in NLP applications.

Our Contributions
In this paper, we investigate the question "How transferable are neural networks in NLP applications?" We distinguish two scenarios of transfer: (1) transferring knowledge to a semantically equivalent task but with a different dataset, which is semantically equivalent in general; (2) transferring knowledge to a task that is semantically different but shares the same neural topology/architecture so that neural parameters can be transferred.We further distinguish two transfer methods: (1) using the parameters trained on S to initialize T (INIT), and (2) multi-task learning (MULT), i.e., training S and T simultaneously.(Please see Sections 2 and 4).
Specifically, our study mainly focuses on the following research questions: RQ1: How transferable are neural networks between two tasks with similar or different semantics in NLP applications?RQ2: How transferable are different layers of NLP neural models?RQ3: How transferable are INIT and MULT, respectively?What is the effect of combining these two methods?
We conducted extensive experiments over three datasets on classifying sentence pairs.We leveraged the widely-studied convolutional neural network (CNN) as our neural model.
Based on our experimental results, we have the following main observations, some of which are unexpected.
• Whether a neural network is transferable in NLP depends largely on how semantically similar the tasks are.The rest of this paper is organized as follows.Section 2 introduces the datasets that our model is transferred across; Section 3 details the neural architecture and experimental settings.We describe two approaches (INIT and MULT) for transfer learning in Section 4. We present experimental results in Sections 5-6, and have concluding remarks in Section 7.

Datasets
In our study, we used three open datasets as follows.
• Stanford Natural Language Inference (SNLI): a newly-released large dataset containing more than 550,000 sentence pairs.5 The task is to recognize whether a sentence can be entailed (E) from another sentence, or the two sentences are contradictory (C) to each other.Besides, there is also a third target label, indicating the two sentences are irrelevant, denoted as neutral (N).It should be noticed that in image or speech processing (Yosinski et al., 2014;Wang and Zheng, 2015), the input of neural networks is pretty much consists of raw signals; hence, low-level feature detectors are almost always transferable, even if Yosinski et al. (2014) manually distinguish artificial objects and natural ones in an image classification task.
Distinguishing semantic relatedness-which emerges from very low layers of either word embeddings or its successive layer-is specific to NLP and also a new insight of our paper.As we shall see in Sections 5 and 6 , the two scenarios lead to different results.

The Neural Model and Settings
As all the above datasets can be viewed as a classification task over sentence pairs, we may use a single neural model to solve the three problems in a unified manner.That is to say, the neural architecture is the same among different datasets, which makes it possible to investigate transfer learning regardless of whether the tasks are semantically equivalent.
Figure 1 depicts the basic neural network in our study.We leverage the widely-studied convolutional neural network (CNN) to capture the semantics of a single sentence; the convolution window size is 5. Then the vector representations of two sentences are combined by concatenation and fed to a hidden layer before the softmax output.Such a two-step strategy is called the "Siamese" structure (Bromley et al., 1993), and also used for various sentence pair modeling applications (Hu et al., 2014;Mou et al., 2015b).
In our experiments, the embeddings were pretrained by word2vec (Mikolov et al., 2013); all embeddings and hidden layers were 100 dimensional as in Bowman et al. (2015); we designated the relatively small dimension because of computational concerns.
We applied stochastic gradient descent with a mini-batch size of 50 for optimization.In each setting, we tuned the parameters as follows: learning rate from {1, 0.3, 0.1, 0.03}, power decay of learning rate from {fast, moderate, low} (defined by how much, after one epoch, the learning rate residual is: 0.1x, 0.3x, 0.9x, resp).We regularized our network by dropout with a rate from {0, 0.1, 0.2, 0.3}.Note that we might not run nonsensical settings, e.g., a larger dropout rate if the network has already been underfitting (i.e., accuracy has decreased when the dropout rate increases).However, we might try additional smaller learning rates, if needed, especially in transfer learning by INIT.We report the test performance associated with the highest validation accuracy.
To setup a baseline, we trained our model without transfer 5 times by different random parameter initializations.Results are reported in Table 2. Our model has achieved reasonable performance that is comparable to similar networks reported in the literature with all three datasets.Therefore, our implementation is fair and suitable for further study of transfer learning.Table 2: Accuracy (%) without transfer.We also include related models for comparison (Hu et al., 2014;Bowman et al., 2015), showing that our model has achieved comparable results; thus we are ready to investigate into transfer learning.
It should be mentioned that the goal of this paper is not to outperform state-of-the-art results; instead, we would like to conduct a fair comparison of different methods and settings for transfer learning in NLP.

Transfer Methods
Transfer learning aims to use knowledge in a source domain to aid the target domain.As neural networks are usually trained incrementally with gradient descent (or variant), it is straightforward to use gradient information in both source and target domains for optimization so as to accomplish knowledge transfer.Depending on how samples in source and target domains are scheduled, there are two main approaches to neural network-based transfer learning: • Parameter initialization (INIT).The INIT approach first trains the network on S, and then directly uses the tuned parameters to initialize the network for T .In the INIT approach, we may fix ( ♂ ) the parameters in the target domain (Glorot et al., 2011), i.e., no training is performed on T .But when labeled data are available in T , it would be better to finetune ( ) the parameters like Bowman et al. (2015).INIT is also related to unsupervised pretraining such as word embedding learning (Mikolov et al., 2013) and autoencoders (Bengio et al., 2007).In these approaches, parameters that are (pre)trained in an unsupervised way are transferred to initialize the model for a supervised task (Plank and Moschitti, 2013).However, our paper focuses on "supervised pretraining," which means we transfer knowledge from a labeled source domain.
• Multi-task learning (MULT).MULT, on the other hand, simultaneously trains samples in both domains (Collobert and Weston, 2008;Liu et al., 2016).The overall cost function is given by (1) where J T and J S are the individual cost function of each domain.(Both J T and J S are normalized by the number of training samples.)λ ∈ (0, 1) is a hyperparameter balancing the two domains.It is nontrivial to optimize Equation 1 in practice by gradient-based methods.One may take the partial derivative of J and thus λ goes to the learning rate (Liu et al., 2016), but the model is then vulnerable because it is likely to blowup for large learning rates (multiplied by λ or 1 − λ) and be stuck in local optima for small ones.Collobert and Weston (2008) alternatively choose a data sample from either domain with a certain probability (controlled by λ) and take the derivative for the data sample.In this way, domain transfer is independent of learning rates, but we may not be able to fully use the entire dataset of S if λ is large.Due to the limitation of time and space, we adopted the latter approach in our experiment for simplicity.(More in-depth analysis may be needed in future work.)Formally, our multi-task learning strategy is as follows.
1 Switch to T by prob.λ, or to S by prob. 1 − λ. 2 Compute the gradient of the next data sample in the particular domain.Further, INIT and MULT can be combined straightforwardly, and we obtain the third setting: • Combination (MULT+INIT).We first pretrain on the source domain S for parameter initialization, and then train S and T simultaneously.
From a theoretical perspective, INIT and MULT work in different ways.In the MULT approach, the source domain regulates the model by "aliasing" the error surface of the target domain; hence the neural network is less prone to overfitting.In INIT, T 's error surface remains intact.Before training on the target dataset, the parameters are initialized in such a meaningful way that they contain additional knowledge in the source domain.However, in an extreme case where T 's error surface is convex, INIT is ineffective because the parameters can reach the global optimum regardless of their initialization.Fortunately, deep neural networks usually have highly complicated, non-convex error surfaces.By properly initializing parameters with the knowledge of S, we can reasonably expect that the parameters are in a better "catchment basin," and that the INIT approach can transfer knowledge from S to T .

Results of Transferring by INIT
We first analyze how INIT behaves in NLP-based transfer learning.In addition to two different transfer scenarios regarding semantic relatedness as described in Section 2, we further evaluated two settings: (1) fine-tuning parameters , and (2) freezing parameters after transfer ♂ .Existing evidence shows that frozen parameters would generally hurt the performance (Peng et al., 2015), but it provides a more direct understanding how transferable the features are (because the factor of target domain optimization is ruled out).Therefore, we included this setting in our experiments.Moreover, we transferred parameters layer by layer to answer our second research question.Notice that the E and E H O settings 8 are inapplicable to SNLI→MSRP, because the output targets do not share same meanings and numbers of target classes.
Through Subsections 5.1-5.3,we initialized the parameters of SICK and MSRP with the ones corresponding to the highest validation accuracy of SNLI.In Subsection 5.4, we further investigated when the parameters are ready to be transferred during the training on S.

Overall Performance
Table 3 shows the main results of INIT.A quick observation is that using SNLI information significantly improves the SICK performance, which is not surprising and also reported in Bowman et al. (2015).
For MSRP, however, there is nearly no improvement in performance regardless of how we transfer the parameters.Although in previous studies, researchers have mainly drawn positive conclusions about transfer learning, we find a negative result similar to ours upon careful examination of Collobert and Weston (2008).In that paper, the authors report transferring NER, POS, CHK, and pretrained word embeddings improves the SRL task by 1.91-3.90%accuracy (out of 16.54-18.40%error rate), with gain mainly due to word embeddings.In the settings that use pretrained 8 Please refer to the caption of Table 3 Collobert and Weston (2008).
word embeddings (which is common in NLP), NER, POS, and CHK improve the SRL accuracy by 0.04-0.21%.Likewise, in the SNLI→MSRP experiment, the E H O setting yields a degradation of 0.2% (∼.5x std and not statistically significant).The incapability of transferring is also proved by locking embeddings and hidden layers (E ♂ H ♂ O ).We see in this setting, the test performance is even worse than majority-class guess.Further examining its training accuracy, which is 65.5%, we conclude that extracted features of SNLI by our CNN model are irrelevant to MSRP.
The above results are rather frustrating, indicating for RQ1 that neural networks may not be transferable to NLP tasks of different semantics.These results are different from those in the image processing domain, where feature detectors are almost always transferable (Donahue et al., 2014;Yosinski et al., 2014).

Layer-by-Layer Analysis
To answer RQ2, we next analyze the transferability of each layer.Since SNLI→MSRP is nontransferable, we mainly focus on SNLI→SICK in this section.
First ), we obtain an accuracy similar to the baseline as long as the output is not frozen.Likewise, in the finetuning setting, hidden-layer transfer improves the accuracy by another 5.3% on the basis of transferring word embeddings (which improves the accuracy by 0.1%).
Regarding MSRP, the embeddings are the only parameters that have been observed to be transferable, although the improvement (0.9% in accuracy, 1.8x std) is not large either.As the above studies use word2vec to pretrain word embeddings (E ) on the Wikipedia corpus (which is a common practice in the literature), a curious question is whether word embeddings are transferable, provided that they are randomly initialized.
To answer this question, we first trained SNLI again with word embeddings being randomly initialized, and then transferred the pretrained word embeddings (in a supervised manner on SNLI) to MSRP.Table 4 compares the result of transferring word embeddings (pretrained by supervised/unsupervied approaches) with nontransferring.As is seen, random initialization of word embeddings works fine for large datasets like SNLI, but performs very poorly in MSRP.
Transferring word embeddings that are pretrained purely by supervised objective on SNLI significantly improves the performance, but is still worse than word2vec pretraining by 1.8x std.Therefore, we conclude Word embeddings (purely pretrained by supervised approaches on S) are indeed transferable across semantically different tasks.When large unlabeled corpora are available, however, unsupervised pretraining of word embeddings like word2vec is preferable.

How does learning rate affect transfer?
Bowman et al. ( 2015) suggest that after transferring, a large learning rate may damage the knowledge stored in the parameters; in their paper, they transfer the learning rate information (AdaDelta) from S and T in addition to the parameters.Although the rule of the thumb is to choose all hyperparameters-including the learning rate-by validation, we are curious whether the above conjecture holds.Estimating a rough range of sensible hyperparameters can ease the burden of model selection; it also provides evidence to better understand how transfer learning actually works.
We plot the learning curves of different learning rates α in Figure 2 (SNLI→SICK, E H O ).(In the figure, no learning rate decay is applied.)As we see, with a large learning rate like α = 0.3, the accuracy increases fast and peaks at the 10 th epoch.Training with a small learning rate (e.g., α = 0.01) is slow, but its peak performance is comparable to large learning rates when iterated by, say, 100 epochs.Aside from our main conclusions, we have the following additional finding: In INIT, transferring learning rate information is not necessarily useful.A large learning rate does not damage the knowledge stored in the pretrained hyperparameters, but accelerates the training process to a large extent.In all, we may need perform validation to choose the learning rate if computational resources are available.

When is it ready to transfer?
In the above experiments, we transfer the parameters when they achieve the highest validation performance on S.This is a straightforward and intuitive practice.However, we may imagine that the parameters well-tuned to the source dataset may be too specific to it, i.e., the model overfits S and thus may underfit T .Another advantage of early transfer lies in computational concerns.If we manage to transfer model parameters after one or a few epochs on S, we can save much time especially when S is large.
We therefore made efforts in studying when the neural model is ready to be transferred.Figure 3a plots the learning curve of the source task SNLI.The accuracy increases sharply from epochs 1-5; later, it reaches a plateau but is still growing slowly until the 23 rd epoch gives the highest validation accuracy.(Results related to the validation set are not plotted in the figure).
We then transferred the parameters at different stages (epochs) of SNLI's training to SICK and MSRP (also with the setting E H O ).Their accuracies are plotted in Figure 3b.
Transferring SNLI to MSRP at any epoch appears to be frustratingly ineffective.No matter how the model performs on SNLI, it can hardly fit MSRP better than without transfer.The results rule out some undesirable factors which may cause the failure to transfer from SNLI to MSRP in Subsection 5.1.
The SNLI→SICK experiment produces interesting yet undesirable results.Using the second epoch of SNLI's training yields the highest transfer performance on SICK, i.e., 78.98%.However, the SNLI performance itself is comparatively low at that time (72.65% v.s. 76.26% at epoch 23).Later, the performance decreases gradually by 1-2%.Although the degradation is not significant, the tendency is reasonably perceptible, showing that by fitting S too well the parameters may be ineffective for T more or less.More evidence is needed in order to draw conclusions.Nevertheless, from Figure 3, we are reasonably safe to conclude When we apply INIT for transferring, only a few epochs over the source dataset are sufficient to capture transferable knowledge, although the source task performance has not been optimal.

MULT, and its Combination with INIT
To answer RQ3, we investigate how multi-task learning performs in transfer learning; we also analyze the effect of the combination of MULT and INIT.In this section, we applied the setting: sharing embeddings and hidden layers (denoted as E♥H♥O ), analogous to E H O in INIT.Note that sharing all parameters E♥H♥O♥ is not applicable to MSRP because of different output objectives; thus we did not apply this setting.
We also tried the combination of MULT and INIT, i.e., we used the pretrained parameters on SNLI to initialize the multi-task training of SNLI and SICK/MSRP.This setting could be visually represented by E ♥H ♥O .
In both MULT and MULT+INIT, we had a hyperparameter λ ∈ (0, 1) balancing the source and target tasks (defined in Section 4).λ was tuned with a granularity of 0.1.As a friendly reminder, λ = 1 refers to using T only; λ = 0 refers to using S only.After finding that λ = 0.1 yields the highest performance of MULT in the SNLI+SICK experiment (the thick blue line in Figure 4a), we further tuned the λ from 0.01 to 0.09 with a finegrained granularity of 0.02.
The results are shown in Figure 4. From the green curves in the second subplot, we see MULT (with or without INIT of SNLI) does not improve the accuracy of MSRP; its inability to transfer is cross-checked with INIT in Section 5.For SICK, on the other hand, transferability of the neural model is also consistently positive (blue curves in Figure 4a), supporting our conclusion to RQ1 that NLP neural transfer learning depends largely on how similar in semantics the source and target datasets are.
Moreover, we see that the peak performance of MULT is 79.6% when λ is 0.03, slightly higher than 79.0% by INIT in Figure 3b.If λ is large, (e.g., ≥ 0.5), the performance is generally similar to without transfer.Furthermore, we find that when λ is very small (say 0.01 in MULT or 0.01 and 0.03 in MULT+INIT), the performance drops sharply, as it fits S too much and thus underfits T.
The transfer performance of MULT+INIT (E ♥H ♥O ) remains high for different values of λ.As the parameters given by INIT have conveyed sufficient information about the source task, MULT+INIT consistently outperforms nontransferring (70.3%) by a large margin.Its peak performance 77.6% is achieved when λ = 0.6, which is sightly higher than for merely applying INIT (76.3% in the E H O setting), but sightly lower than for MULT only.As these qualitative results are not very significant and may vary for different tasks and models, we answer our RQ3 as follows.
In our experiment, MULT and INIT are generally comparable, and MULT is slightly better than INIT.We do not obtain further gain by combining MULT and INIT.9

Concluding Remarks
In this paper, we addressed the problem of transfer learning in neural network-based NLP applications.We conducted experiments on three datasets, showing that the transferability of neural NLP models depends on the semantic relatedness of the source and target tasks.We analyzed the behavior of different neural layers: word embeddings are transferable but the results are similar to (or worse than) word2vec; hidden layers are the main transferable features that improve the performance; and the output layer is generally unable to transfer.We also experimented with two transfer methods: parameter initialization (INIT) and multi-task learning (MULT); they generally perform similarly in our experiment.Additional findings are boxed in Section 5 and not repeated here.Our study provides insight on the transferability of neural NLP models; the results also help to better understand neural features in general.
How transferable are the conclusions in this paper?Although we had in total more than 3000 separate runs of experiments in this paper, we have to concede that empirical studies are subject to a variety of factors (e.g., models, tasks, datasets), and that conclusions may vary in different scenarios.Therefore, along with analyzing our own experimental data, we have also carefully collected related results in the literature, serving as additional evidence in answer to our research questions.As our results are generally consistent with previously reported ones, we think the generality of this work is fair and that the conclusions can be generalized to similar scenarios.
Future work.Our work also points out some future directions of research.We would like to analyze different MULT strategies; more efforts are also needed in developing an effective yet robust method for multi-task learning.

Figure 1 :
Figure 1: The neural model in our study.

Figure 3 :
Figure 3: (a) Learning curve of SNLI.(b) Accuracies of SICK and MSRP when parameters are transferred at a certain epoch during the training of SNLI.Dotted lines refer to non-transfer, which can be equivalently viewed as transferring before training on S, i.e., epoch = 0. Note that the x-axis shares across the two subplots.

Figure 4 :
Figure 4: Results of MULT and MULT+INIT, where we share word embeddings and hidden layers.The tasks are (a) SNLI+SICK, and (b) SNLI+MSRP.Dotted lines are the nontransfer setting; dashed lines are the INIT setting E H O , transferred at the peak performance of SNLI.

•
The output layer is mainly specific to the dataset and not transferable.