Sparse and Decorrelated Representations for Stable Zero-shot NMT

Using a single encoder and decoder for all directions and training with English-centric data is a popular scheme for multilingual NMT. However, zero-shot translation under this scheme is vulnerable to changes in training conditions, as the model degenerates by decoding non-English texts into English regardless of the target specifier token. We present that enforcing both sparsity and decorrelation on encoder intermediate representations with the SLNI regularizer (Aljundi et al., 2019) efficiently mitigates this problem, without performance loss in supervised directions. Notably, effects of SLNI turns out to be irrelevant to promoting language-invariance in encoder representations.


Introduction
As massive language pairs are supported in recent works in neural machine translation (NMT) (Aharoni et al., 2019;Arivazhagan et al., 2019b), obtaining training data becomes more of an issue. Due to limited availability of parallel corpora, datasets for multilingual NMT are in many cases Englishcentric-English is either on the source side or the target side-, or at least missing several pairs among the supported language set. This leads to a conspicuous need for a model to support zero-shot translation, which is to translate between language pairs for which no parallel training data exists.
A popular scheme for multilingual NMT is to have one encoder and one decoder shared across all trained directions, and prepend a reserved token to the source text to indicate the target language. This model is capable of zero-shot translation; setting the target token which was unpaired with the source at training time still works (Wu et al., 2016;Ha et al., 2016;Johnson et al., 2017). However, while being parameter-efficient, an exposure bias arises when trained with English-centric data; as non-English languages are always trained to be translated into English, they are wrongly decoded into English for zero-shot directions (Ha et al., 2016(Ha et al., , 2017. In fact, zero-shot NMT under this scheme is extremely sensitive to hyperparmeters including batch size, dropout, and weight initialization (Gu et al., 2019). Fixing the hyperparameters favorable to zero-shot directions would not be desirable, however, if such conditions hurt performance on supervised directions.
We utilize the Sparse coding through Local Neural Inhibition (SLNI) (Aljundi et al., 2019) regularizer to make the representations more robust to hyperparameters. SLNI was originally suggested as a continual learning technique by enforcing representation sparsity and decorrelation. Here, we deviate from its previous use and focus on its single-stage effects during joint multitask training of multiple language pairs. We present that enforcing representation sparsity and decorrelation together stabilizes zero-shot performance across various training conditions, without hurting performance on supervised directions. Gu et al. (2019) pointed out that target-languagespecific characteristics should be determined only by the target indicator token, but their being wrongly entangled with source semantics causes degeneracy. To directly counter this issue, Ha et al. (2017) filtered entries other than the target language from the vocabulary. Gu et al. (2019) proposed back-translation as a way to explicitly avoid the wrong entanglement by exposing the model to non-English sources paired with non-English targets. They also pretrained the decoder as a multilingual language model, which approximates marginalizing over all possible source sentences.

Related Work
Such multi-staged methods are effective but could be burdensome, while our methods do not involve any additional stage like post-processing, pretraining or dataset augmentation.
Meanwhile, Arivazhagan et al. (2019a) noted that regularizing the model to be language-invariant empirically alleviates degeneration. They aligned non-English latent representations to English by minimizing the cosine distance between parallel instances. Ji et al. (2019) built a universal encoder on both source and pivot languages, so that the encoder can deal with zero-shot directions in the way it handles pivot-target data. Pham et al. (2019) were on the simliar track by learning language-invariant features, though via regularizing the decoder.
We also utilize a regularizer, but its effects turn out to be irrelevant to making language-invariant representations (See 5.2 for details).

Methods
SLNI (Aljundi et al., 2019) is a regularizer that promotes sparse and decorrelated representations by penalizing correlation between neurons. Inspired by lateral inhibition in biological neurons, this penalty is weighted by Gaussian distribution, resulting in each neuron inhibiting mostly its local neighbors. This was originally suggested as a continual learning technique to avoid catastrophic forgetting, as there should be enough free neurons that can be changed without tampering with the neural activations already learned.
With a batch of N inputs and 1 ≤ i, j ≤ C l such that i = j where C l is the dimension size of a hidden layer l, the layer representation H l = {h (n) i } is subject to: where σ is the scale at which dimensions can affect each other, thus controlling sparsity. This loss is summed over all 1 ≤ l ≤ L where L is the number of regularized layers. Combined with the canonical negative log-likelihood loss L MLE = − 1 N n log P (y (n) |x (n) ), the final objective to minimize is: where λ is the coefficient hyperparameter.
Adapting SLNI to Transformers. SLNI was originally applied to toy datasets in the vision domain and rather simple models. Here, we adopt it to the real-world language domain and to Transformers (Vaswani et al., 2017). 1 We apply SLNI on the encoder-side. 2 Outputs of every layer normalization (after both self-attention and position-wise feed-forward sublayers) are subject to regularization. 3 Unlike images, inputs for NMT have time dimension. We flatten the batch and time dimensions into N , so that the representations are regularized at the token level.

Settings
Dataset. We use only English-centric parallel data from IWSLT2017, having English on one side and one of 4 languages {German(De), Italian(It), Dutch(Nl), Romanian(Ro)} on the other side. This is a popular but potentially problematic scheme with exposure bias. While non-English languages are always translated to English at training time, they have to be decoded in different languages (zero-shot) at inference time.
Training conditions. We experiment with four training conditions. The top three conditions are taken from Gu et al. (2019), where naive models reportedly degenerate under the latter two. We add the last condition as it improves performance on supervised directions.

Results
We show the translation quality of zero-shot and supervised NMT under all training conditions in Table 1. All results are generated using beam-search with beam size = 4 and length penalty = 1. Unlike the naive model, our model trained with SLNI shows stable performance across all training conditions, including the Compound setting where the naive model completely degenerates. Furthermore, there is no evident performance decrease in supervised directions. As in Table 2, we can even achieve slight maximum performance increase in supervised directions where the zero-shot performance falls by less than 1 BLEU (15.75) than that we could have achieved by choosing an alternative training condition (16.59).
This effect is consistently observed across multiple coefficients (Table 2), with the largest performance drop (15.10) compared to Default setting (16.02) is less than 1 BLEU with a small λ = 0.01.
Exposure bias. To confirm that BLEU score decrease in zero-shot directions comes from the wrong target language problem, we measure the ratio of wrongly decoding into English (En ratio in Table 2). We use an off-the-shelf language identification fastText (Bojanowski et al., 2017) model to determine which language the decoded outputs belong to. 4 En ratio aligns well with BLEU decrease in naive models, and SLNI models consistently have low En ratio across all conditions.
To figure out whether SLNI has other effects than preventing the wrong target language, we also measure sentence-level BLEU for outputs correctly generated in the specified target language (Table 3). While in principle sentence-level BLEU scores are not directly comparable, the scores with and without SLNI are not drastically different from each other. This suggests that exposure bias is the very problem that our technique handles.

Neither Sparsity nor Decorrelation Suffices
We investigate the individual effects of sparsity and decorrelation. To promote sparsity only, we use L 1    (Cogswell et al., 2016) regularizer. Given a covariance matrix C of the representation values in a batch, Decov penalizes the L 2 norm of C, and subtracts the diagonal holding the variances to avoid making the individual representation values small (hence, no sparsity). Table 2 shows the results. Both regularizations do not harm the performance for supervised directions, and show competitive zero-shot performance to naive and SLNI models under the Default setting. With alternative training conditions, however, Decov degenerates severely in all directions and coefficients. Results of L 1 are more modest, but it still degrades at least under the Compound setting even with the most favorable coefficient λ = 0.1.
These results suggest that zero-shot stabilizing effects of SLNI are compound effects of representation sparsity and decorrelation.

Effects on Encoder Representations
An implicit hypothesis of previous works that explicitly made the model invariant to sourcelanguage (Arivazhagan et al., 2019a;Ji et al., 2019;Pham et al., 2019) is that given the same target language token, encoder representations of non-English should be similar to that of English; if they are highly distinguishable, the decoder is more prone to instant degeneration as it may easily de-code non-English sources into English.
However, when tested with various conditions that we experimented with, language-invariance of encoder representations seems not to be the real key for zero-shot NMT to perform properly. We ran the model of Arivazhagan et al. (2019a) with λ = 0.001 as they set on this dataset, and observed zero-shot degeneration under non-Default settings as in Table 4.  To this end, we investigate whether SLNI enhances interlingual representation similarity. The results are negative, implying that SLNI's resolving the entanglement issue does not involve learning language-invariant features.
Instance similarity. We use Singular Value Canonical Correlation Analysis (SVCCA) (Raghu et al., 2017), which is a technique to compare vector representations in a way that is invariant to affine transformations. 5 Following Kudugunta et al. (2019), we perform SVCCA on the encoder final outputs mean-pooled over timesteps, using a multiparallel evaluation set.
Space similarity. We use Representational Similarity Analysis (RSA) (Kriegeskorte et al., 2008) to compare the geometry of non-English encoder representations to that of English, given the same target language. We take the encoder final outputs   In both tests, there is no evident difference across different models. Similarity scores of SLNI are not higher than other models, and no coherent pattern between the SVCCA/RSA and BLEU scores is observed.

Conclusion
Without a specifically adjusted training condition, a single encoder-decoder model trained with Englishcentric data suffers from exposure bias in such target language specifier tokens are ignored. We resolve this problem with the SLNI regularizer which enforces sparse and decorrelated representations. We show its effects as a silver bullet technique to preserve performance over all language pairs, both zero-shot and supervised. The ground for this success seems to be orthogonal to previous studies, proposing a new context to be incorporated for a more complete picture of robust zero-shot NMT.

A Experiment and Dataset Details
We use the FairSeq (Ott et al., 2019) framework to implement all models. We use the default setting of Adam optimizer (Kingma and Ba, 2015) and learning rate schedule as in Vaswani et al. (2017), with 8K warmup steps and 120K training steps. Label smoothing is applied with rate of 0.1.
For Default and AttDrop settings, all models are trained with 1 NVIDIA Tesla V100 GPU. For LargeBatch and Compound settings, we conduct distributed training on 4 GPUs. This indicates that regularizer losses are computed on a batch of max 2400 tokens, not 9600.
For SVCCA, we use the top 128 singular values among 512 dimensions, as they explained over 50% of the variance.
We use a joint vocabulary for all languages, consisting of 40K BPE (Sennrich et al., 2016) tokens constructed with the Sentencepiece package (Kudo and Richardson, 2018). Following Al-Shedivat and Parikh (2019)   We obtain similar results when SLNI is applied to encoder layer-level outputs, i.e. after feed-forward layer normalizations. Still, as the best scores across all conditions fall below for both zero-shot (16.59) and supervised (30.19) directions compared to our designated locations (and for generalizability as well), we conduct further experiments with applying SLNI after both layer normalizations in the encoder layers.
Applying SLNI on the decoder side does not show the stabilizing effects.