Soft Representation Learning for Sparse Transfer

Transfer learning is effective for improving the performance of tasks that are related, and Multi-task learning (MTL) and Cross-lingual learning (CLL) are important instances. This paper argues that hard-parameter sharing, of hard-coding layers shared across different tasks or languages, cannot generalize well, when sharing with a loosely related task. Such case, which we call sparse transfer, might actually hurt performance, a phenomenon known as negative transfer. Our contribution is using adversarial training across tasks, to “soft-code” shared and private spaces, to avoid the shared space gets too sparse. In CLL, our proposed architecture considers another challenge of dealing with low-quality input.


Introduction
Transfer learning in neural networks has been applied in recent years to improving the performance of related tasks, for example, 1) multi-task learning (MTL) with different tasks (labeled data available all tasks) and 2) cross-lingual learning (CLL) with different language (but the same task) though labeled data available only in source language. For both settings, one of their most common strategies is hard-parameter sharing, as shown in Figure 1a, which shares the hidden layers across tasks, which we will call shared layer. This approach works well when tasks are closely related, when most features are invariant across tasks. Otherwise, which we call sparse transfer, transferring between loosely related tasks often hurt performance, known as negative transfer. We elaborate this problem in MTL and CLL scenarios.
First, for MTL, the shared space is reported to be sparse, in an architecture with one shared encoder (Sachan and Neubig, 2018), when shared by K (e.g., K > 2 tasks) loosely-related tasks. To address this problem, as shown in Figure 1b, recent models (Liu et al., 2017;Lin et al., 2018) divide the features of different tasks into task-invariant and task-dependent latent spaces, which we will call shared and private spaces from this point on. However, since such approach still hard-codes shared and private features, deciding which subsets of tasks should share encoders in many-task settings, among all possible combinations of tasks, is a non-trivial design problem Sanh et al., 2019).
Second, for CLL, the given task in source language (with rich resources) transfers to that for target languages without training resources. For the latter, machine-translated resources are fed instead, to the shared encoder (Schwenk and Douze, 2017;Conneau et al., 2018). When translation is perfect, the shared space would be dense: For example, English training pair with entailment relationship, "Because it looked so formidable" and "It really did look wonderful" can be translated to Chinese sentences of the same meaning, to preserve labels. Meanwhile, its translation into "因为 它看起来那么可怕" (Because it looks so scary) and "它真的看起来很棒" (It really looks great), fails to preserve the entailment relationship, and makes the shared space sparse.
As a unified solution for both problems, we propose soft-coding approaches that can adapt in the following novel ways.
First, for MTL, we propose Task-Adaptive Representation learning using Soft-coding, namely TARS, wherein shared and private features are both mixtures of features. Specifically, as shown in Figure 1c, TARS begins as a generic sharing framework using one common shared encoder, but also adopts its paired task-specific layers to feed a Mixture-of-Experts (MoE) module (Shazeer et al., 2017;Guo et al., 2018) which captures soft-private features with a weighted combination of all task-dependent features, where a gating network G in Figure 1c, decides on output weights for each task. Based on this basic architecture, TARS softly-shares features balanced by two conflicting auxiliary losses: one is used to eliminate private features from the shared space, which decreases the generalization across task, while the other is used to keep shared space "dense" with soft-private features, which is a form of adversarial training. Such balancing efforts prevent the shared space from being too sparse to be generalized for every task, even when K > 2.
Second, for CLL, we propose a Cross-lingual AdverSarial Example, namely CASE. Compared to Figure 1c, task-specific private layers no longer exist in Figure 1d, because CLL deals with a single task for multiple languages. Instead, for an additional challenge of refining low-quality input, we add Refiner. Specifically, once the source language is translated into the target language, CASE moves the noisy representation on the target side towards a direction of space on the source side back in a form of adversarial example, and uses this as an additional training data to task classifier. However, this refinement may have adverse effects (Yeo et al., 2018), for which a policy network P in Figure 1d decides whether to refine or not.
To demonstrate the effectiveness and flexibility of our soft-coding approaches, we evaluate TARS on five different datasets covering diverse scenarios and CASE on cross-lingual natural language inference (XNLI) datasets with 15 languages (including low-resource language such as Swahili and Urdu), and show that TARS and CASE outperform existing hard-coding approaches.

Problem Statement
Formally, we assume the existence of K datasets {D k } K k=1 , where each D k contains |D k | data samples for classification task k. Specifically, (1) where x k i and y k i denote a sentence (or pair) and its corresponding label for task k. In CLL, D k is given only for one language, for which we create a new datasetD k = {(x k i , y k i )}, wherex k i is translated, using neural machine translation (NMT), for training task k (for another language). Transfer learning aims to improve classification by learning these K tasks in parallel. Thus, our objective is to learn a sentence (or pair) representation x k per task k, but take into account the correlation among related tasks.
Specifically, given an input sequence x k = {w k 1 , w k 2 , ..., w k T } with length T , we aim to learn a sentence representation x k for the entire sequence as follows, x k = Encoder({w k 1 , w k 2 , ..., w k T }). Following (Conneau et al., 2017), the final output representation x k is ultimately fed into a corresponding classifier which consists of multiple fully connected layers culminating in a softmax layer, i.e.,ŷ k = softmax(W k x k + b k ). The parameters of the network are trained to minimize the loss L task of the predicted and true distribu-tions on all the tasks as follows: where L(ŷ k , y k ) denote a typical cross-entropy loss for each task k.

Baseline: Hard-code Approach
As overviewed in Section 1, the success of transfer learning depends on the sharing scheme in latent feature space. Existing architectures differ in how to group the shared features to maximize sharing, as illustrated in Figure 1. We overview the existing approaches into the following two categories.
Base I: Fully-Shared Model (FS) As shown in Figure 1a, the Fully-Shared (FS) model adopts a single shared encoder S-Encoder to extract features generalized for all the tasks. For example, given two tasks k and m, all features s k of task k are expected to be shared by task m and vice versa, i.e., s k = S-Encoder({w k 1 , w k 2 , ..., w k T }; θ s ), where θ s represents the parameters of the shared encoder. In FS model, s k is equivalent to x k fed into classifiers.
Base II: Shared-Private Model (SP) As Figure 1b shows, the Shared-Private (SP) model consists of two modules: (1) the underlying shared encoder S-Encoder responsible to capture taskinvariant features, and (2) the private encoder P-Encoder to extract task-dependent features, i.e., p k = P-Encoder({w k 1 , w k 2 , ..., w k T }; θ k p ), where θ k p represents the parameters of each private encoder. Then, both shared representation s k and private representation p k are concatenated to construct the final sentence representation: These hard-code approaches greatly reduce the risk of overfitting to capture all of the tasks simultaneously, but have the caveat that the ability of shared space to model task-invariant features can be significantly reduced (Sachan and Neubig, 2018). We empirically show our observations are consistent in Section 5.2.

Soft-code Approach for MTL: TARS
Inspired by the limitation of hard-coding approaches, our proposed model, TARS, begins with FS model but progressively adapts to task characteristics, as shown in Figure 1c.
Soft-Private Module TARS first models the multiple tasks as MoE, where each task has an individual expert network, and weighs the experts for different task examples. To be specific, TARS feeds the shared features s k into individual P-Encoder for each task, to encode task-dependent features as follows: Simultaneously, a gating network decides on output weights for each expert (i.e., individual P-Encoder). Specifically, the gating network G, parameterized by θ g , is used to map the shared representation of current task into the correct expert, and each expert is thus learning task-dependent features for that task, estimating task label of s k : where W g and b g is a trainable weight matrix and a bias, respectively. Based on above, the final softprivate representation p(s k ) is a mixture of all expert outputs with respect to s k as the following: Soft-Shared Module In order to learn taskinvariant features, inspired by (Liu et al., 2017), TARS adopts an adversarial network, which contains a feature extractor and a task discriminator D. The basic idea is to learn features that cannot be distinguished by D. Specifically, D aims to discriminate which task the feature comes from, while the feature extractor (e.g., S-Encoder) tries to fool D so that it cannot identify the task of the feature and is hence task-invariant. More formally, (6) where d k i is the ground-truth task label, θ d is the parameter of task discriminator D, and λ is a hyperparameter. As mentioned before, such adversarial learning has been verified to be very effective for extracting task-invariant features. However, trying to keep the shared space too pure inevitably leads to sparseness, for which we additionally introduce the density constraint L dense for this purpose.
Specifically, the objective of the density constraint L dense is to push the soft-private features from the private embeddings closer to the shared ones, such that the shared space is encouraged to being dense rather than being too sparse, resolving the sparseness of the shared space. Therefore, the soft-shared features might be more informative in this case. Formally, where || · || 2 is the mean squared L-2 norm.
Training and Inference Lastly, the soft-private and soft-shared representations p(s k ) and s k are concatenated, i.e., x k = s k ⊕ p(s k ), to feed the all networks in TARS with the following loss: TARS is trained with backpropagation, and adopts a gradient reversal layer (Ganin and Lempitsky, 2015) to address minimax optimization problem. Note that, unlike hard-code approaches, zero-shot learning is also possible since TARS can adapt to a new target task (e.g., cross-domain or -lingual), by aligning it with the trained expert gate deciding what combination of the expert to use in Eq. (4) and Eq. (5) on inference.

Soft-code Approach for CLL: CASE
This section revises L dense in Eq. (7) for CLL scenario. Note that, in CLL, sparse space corresponds to mistranslated low-resource language, which we call pseudo-sentence. The goal of L dense is thus replaced by, softly correcting the representation to align better L align , while preserving the semantics L sem . For that purpose, we propose a Refiner replacing L dense with these two new losses. Refinement by Perturbation We first discuss how to refine pseudo-sentences by perturbation ∆ for higher learning effectiveness. Related ideas are ensuring the robustness of a model, by finding ∆ that changes a prediction, or, f (x) = y while f (x + ∆) = y (Goodfellow et al., 2015). Inspired, CASE explores if incorrect translations that may cause wrong predictions in the target language can be moved back to change predictions. For which, based on the basic architecture of variational auto-encoder (VAE) (Kingma and Welling, 2013), CASE models a neural refiner to refine low-quality representations. Specifically, as shown in Figure 1d, CASE first encodes pseudoparallel sentences into shared space, e.g., (x,x). Then, the refiner which consists of two encoding feed-forward network µ(x) and σ(x) converts the representations into two distribution variables µ(x) and σ(x), the mean and standard deviation for pseudo representations. Unlike traditional VAE minimizing the latent loss that measures how closely the latent variables match a unit Gaussian, i.e., KL(N (µ(x), σ(x)), N (0, 1)), CASE enhances the latent loss with the pseudo-parallel representation, to generate pseudo-adversarial examplez that roughly follows a representation x from resource-rich space as follows: (9) In order to optimize the KL divergence, CASE applies a simple reparameterization trick (Kingma and Welling, 2013). Using this trick, pseudoadversarial examplez is generated from the mean and standard deviation vectors, i.e.,z = µ(x) + σ(x) · , where ∈ N (0, 1). This constraint not only allows us to generate an informative representation, but also improves the generalization of our network, towards x (e.g., English) with higher confidence. Then, CASE aims at preserving its original semantics in the latent space, for which CASE includes the reconstruction loss, which is a mean squared error, to measure how accurately the pseudo-adversarial examplez preserves its original semantics. i.e., L sim = Σ |D| ||z −x|| 2 . As a result,z is fed into the classifier, and the overall loss of CASE is defined as follows: Selective Refinement Lastly, CASE aims to refine only when the perturbation can refine the translation. In other words, if the translation is already good, CASE avoids a refinement, by parameterizing refinement with α set to be near zero. Not applying a refinement for correct translation is important, since more than half of translations is correctly translated, as reported by (Yeo et al., 2018), such that refinement may lower the quality. For computing α, CASE adapts a policy network P , which consists of a feed forward network P(x; θ p ) = softmax(W p x + b p ), to identify wrong translations by capturing the difference of domain distribution. Then, the policy is calculated as follows: (11) in which P(x) outputs a domain distribution of x, and CASE estimates α as the difference between two distributions (i.e., KL divergence), and the final loss function is defined factoring in α: L CASE = L task + L adv + α(L align + L sim ).

Experimental Settings
To show the effectiveness of our proposed approaches, we conduct experiments on both multitask and cross-lingual settings. Multi-task Dataset For Multi-task learning, we use five different datasets on Natural Language Inference (NLI) and Paraphrase Identification (PI) tasks: SNLI (Bowman et al., 2015), MNLI , and CNLI 1 , for single-domain-English, multi-domain-English, and Chinese NLI respectively; QQP (Csernai et al., 2017) and LCQMC (Liu et al., 2018) for English and Chinese PI. Cross-lingual Dataset We use the cross-lingual natural language inference (XNLI) dataset (Conneau et al., 2018) 2 from 15 different languages for Cross-lingual learning. The dataset is a version of MNLI  where 2,500 dev and 5,000 test sets have been translated (by humans) into 14 languages. For training datasets, the English training data is translated into each target language by NMT. Implementation Details For all encoder, we adopt BiLSTM-max (Conneau et al., 2017) model and the pre-trained word embeddings we use are 300-dimensional fastText word embeddings (Bojanowski et al., 2017). Following (Conneau et al., 2018), the BiLSTM hidden states is set to 256 and Adam optimizer with a learning rate of 0.001 was applied. The learning rate was decreased by a factor of 0.85 when the target dev accuracy does not improve. As in (Conneau et al., 2018), for text classification networks, we use a feedforward neural network with one hidden layer of 128 hidden units with a dropout rate of 0.1, to  Table 1: Accuracy over MTL with two-source tasks measure the relatedness of a given premise and hypothesis. The hyperparameter λ is empirically set to 0.005. All our implementation is available at github.com/haejupark/soft.

Experimental Result I: MTL
Using (Liu et al., 2017) as hard-code baselines, we apply Adversarial training (and so-called orthogonality constraints) to FS and SP models, namely AFS and ASP. Such techniques enhance the distinct nature of shared and private features. Two-source MTL Table 1 shows the performance on three text classification tasks. The first row shows the results of "single task", and other rows show the results of "multiple tasks" by corresponding MTL models trained with two source tasks. More concretely, (SNLI+MNLI) and (*NLI+QQP) are for cross-domain and cross-task classification respectively. In this table, we can see that TARS achieves higher accuracy than all sharing scheme baselines in all scenarios, surpassing multi-task learning (i.e., ASP) as well as single task learning. These results show that our softcode approach also works well in typical MTL settings with two source tasks, though they are not our targeted sparse scenario. Three-source MTL In Table 2, MTL models use three source tasks (SNLI+MNLI+QQP), where the first row shows the results of "single task". We first test SNLI, MNLI, and QQP as a supervised target task. From the results, we can see that TARS outperforms all baselines including MoE, which is a variant of TARS excluding the two auxiliary losses. We also include the recent work,   (Ma et al., 2018), which explicitly learns to model task relationship by modeling an expert for each task (which is not desirable for a new task). This suggests that the synergetic effect of soft-private and -shared modules in TARS is critical to outperform other baselines. Specifically, AFS and ASP show a "negative transfer", which is an inherent challenge of MTL. For example, ASP with three-source tasks achieves 82.23% and 66.92% accuracy, respectively, in SNLI and MNLI, which are lower than 82.28% and 67.39% accuracy with its best performance with two-source tasks. In contrast, TARS overcomes such challenges, for example, 83.12% > 82.67% and 68.24% > 67.79% in SNLI and MNLI, except for QQP, which can be further improved by asymmetric MTL techniques (Lee et al., 2016).
To investigate how TARS helps transfer knowledge across tasks, Figure 2a and 2b contrast the feature representation of shared space in ASP and TARS, in two-and three-source settings respectively. First, for two-sources, ASP and TARS are comparable, capturing the distribution of two tasks that are nearly identical, which is desirable for transfer learning. Second, for three sources, the shared space of ASP shows two quite distinct distributions (task-dependent), while TARS keeps two distributions comparable (and task-invariant). Table 2, we test zero-shot learning with two target tasks, CNLI and LCQMC, excluding their own training data (except for the first row single task). As ASP requires target task labels to train its private encoders, we compare TARS only with AFS and MoE, where TARS shows the best performance in MTL. As shown in Figure 3, we observe that when TARS covers sentences in CNLI and LCQMC, using its gating network that identifies that the unknown target tasks are the most similar to SNLI and QQP, respectively: Specifically, highest weights are assigned to these two, but other source tasks also contribute, with non-zero weights.  Table 3 shows our results on 14 XNLI languages. Following (Conneau et al., 2018), we divide the models into following three categories: 1) Translate train, where the English NLI training set is translated into each XNLI language and train a language-specific NLI classifier for each language; 2) Translate test, where all dev and test set of XNLI is translated to English and apply English NLI classifier; and 3) Zero-shot Learning, where English classifier is directly applied to the target language without any translation. We also report the results of XNLI baselines (Conneau et al., 2018), a supervised cross-lingual MTL model that combines the L adv loss using pseudoparallel data (Liu et al., 2017), the multilingual BERT (Devlin et al., 2018), and the recent work of (Artetxe and Schwenk, 2018). First, in Table 3, we can see that BiLSTM model (Conneau et al., 2018) Table 3: Accuracy over 14 XNLI languages (test set accuracy). We report results for translation baselines, multitask learning baselines and zero-shot baselines. Overall best results are in bold, and the best in each group is underlined. All results * from its Github project https://github.com/google-research/bert/blob/ master/multilingual.md.

Experimental Result II: CLL
to perform consistently better than Translate train for all languages, which means a single English model works better than training each target model with translated data. In contrast, Multilingual BERT (Devlin et al., 2018) achieves best results on Translate train, outperforming most languages, suggesting the generalization of BERT across languages significantly better than BiLSTM model.
Meanwhile, CASE, significantly outperforms the BiLSTM and BiLSTM+MTL models in Translate train for all languages, and even outperforms BiLSTM in Translate test. Compared to the best performing MTL baseline, CASE achieves an improvement of 1.7% and 9.5% in Bulgarian (bg) and Urdu (ur) languages respectively. From these results, we observe that: 1) the improvements on low-resource language (e.g., Swahili and Urdu) are more substantial than those on other languages; 2) the selective refinement strategy consistently contributes to the performance improvement. These results show that CASE, by incorporating pseudo-adversarial example as an additional resource, contributes to the robustness and the generalization of the model.
Lastly, we show that CASE with multilingual BERT model achieves the state-of-the-art, and even significantly outperforms the supervised approach of (Artetxe and Schwenk, 2018) enjoying an unfair advantage of extremely large amounts of parallel sentences. These results show that CASE, with the help of strong baselines, gets a significant boost in performance, particularly for Swahili and Urdu that are low-resource languages, achieving the improvement of 9.4% and 10.3% respectively.

Robustness Analysis
In order to verify whether CASE is robust, inspired by (Goodfellow et al., 2015), we test if models keep its prediction, even after changes to the sentence, as long as the meaning remains unchanged. For example, the given sentence can be paraphrased by changing some words with their synonyms, and the models should give the same answer to the paraphrase.
Meanwhile, existing models, especially those overfitted to surface forms, are sensitive to such "semantic-preserving" perturbations. As human annotation for such perturbations is expensive, an automated approach (Alzantot et al., 2018) was studied for English, to generate semanticpreserving adversaries that fool well-trained sentiment analysis and NLI models with success rates of 97% and 70%, respectively. In our problem setting of XNLI, we need such a generator (or generated resources) for each language. For which, we identify three research questions: • (RQ1) How hard is it to build a generator for a new language?
Specifically, in this paper we focus on Chinese, as we could hire native speaking volunteers to validate whether automatically generated perturbations indeed preserve semantics. First, for RQ1, we leverage Chinese synonyms and antonyms to build counter fitting vectors as (Mrkšić et al., 2016) to ensure the selected words are synonyms. Then, we slightly change  Table 4: Example of generated adversarial example for chinese natural language inference task. (Alzantot et al., 2018) 3 to automatically generate Chinese perturbations for NLI task. Following the convention of (Alzantot et al., 2018), for NLI problem, we only add perturbation to the hypothesis, excluding premise, and aim to divert the prediction result from entailment to contradiction, and vice versa. Table 4 is an example of generated adversarial example.
For RQ2, we validate the automatically generated perturbations by native speaking volunteers. We show volunteers 500 samples to label whether it is contradiction, neutral or entailment. 84 percent of the responses matched the original ground truth. Second, we sample 500 samples, with each sample including the original sentence and the corresponding adversarial example. Volunteers were asked to judge the similarity of each pair on a scale from 1 (very different) to 4 (very similar). The average rating is 2.12, which shows the performance of our implementation for Chinese perturbation is also competitive.
Lastly, for RQ3, we show the attack success rates over generated adversarial example in Table 5. For comparison, we include the single task and MTL baselines. As shown in the Table 5, CASEs are able to achieve higher defense rate (or lower success rate) in performance of 36.6%, while baselines obtained 15.7% and 21.4% respectively, which demonstrates incorporating pseudoadversarial example is indeed helpful to the robustness of the model.  Transfer Learning: Transfer learning enables effective knowledge transfer from the source to the target task. Early works mainly focused on the shared representation methods (Liu et al., 2017;Tong et al., 2018;Lin et al., 2018), using a single shared encoder between all tasks while keeping several task-dependent output layers. However, the sparseness of the shared space, when shared by K tasks, was observed (Sachan and Neubig, 2018). In this paper, we study a soft-coding approach to overcome sparsity, leading to performance gains in MTL and CLL tasks. Closely related work is MMoE (Ma et al., 2018), which explicitly learns the task relationship by modeling a gating network for each task. Such work does not consider which combination of networks to use for a new task, while we differentiate by deciding such combination for a new task based on its similarity to the source tasks.
Adversarial Example: Despite the success of deep neural networks, neural models are still brittle to adversarial examples (Goodfellow et al., 2015). Recently, adversarial examples are widely incorporated into training to improve the generalization and robustness of the model using back-translated paraphrases (Iyyer et al., 2018), machine-generated rules (Ribeiro et al., 2018), black-box (Alzantot et al., 2018) and whitebox (Ebrahimi et al., 2018). Inspired, we study pseudo-adversarial example in latent space to improve the robustness of the model. To the best of our knowledge, we are the first proposing pseudoadversarial training in latent space for transfer learning.

Conclusion
In this paper, we study the limitations of hardparameter sharing in sparse transfer learning. We propose soft-code approaches to avoid the sparseness observed in MTL and CLL. We have demonstrated the effectiveness and flexibility of our softcode approaches in extensive evaluations over MTL and CLL scenarios.