Compact Personalized Models for Neural Machine Translation

We propose and compare methods for gradient-based domain adaptation of self-attentive neural machine translation models. We demonstrate that a large proportion of model parameters can be frozen during adaptation with minimal or no reduction in translation quality by encouraging structured sparsity in the set of offset tensors during learning via group lasso regularization. We evaluate this technique for both batch and incremental adaptation across multiple data sets and language pairs. Our system architecture–combining a state-of-the-art self-attentive model with compact domain adaptation–provides high quality personalized machine translation that is both space and time efficient.


Introduction
Professional translators typically translate a collection of related documents drawn from a domain for which they have a set of previously translated examples. Domain adaptation is critical to providing high quality suggestions for interactive machine translation and post-editing interfaces. When many translators use the same shared service, the system must train and apply a personalized adapted model for each user. We describe a system architecture and training method that achieve high space efficiency, time efficiency, and translation performance by encouraging structured sparsity in the set of offset tensors stored for each user.
Effective model personalization requires both batch adaptation to an in-domain training set, as well as incremental adaptation to the test set. Batch adaptation is applied when a user uploads relevant translated documents before starting to work. Incremental adaptation is applied when a user provides a correct translation of each segment just after receiving machine translation suggestions, and the system is able to train on that correction before generating suggestions for the next segment. This is referred to as a posteriori adaptation by Turchi et al. (2017). Our experiments compare both types of adaptation. There are cases for which incremental adaptation achieves better performance using fewer examples, as examples drawn directly from the test set are often highly relevant to subsequent parts of that test set. There are also cases for which the gains from both types of domain adaptation are additive.
The time required to translate and to adapt both must be minimal in a personalized translation service. Interactive translation requires suggestions to be generated at typing speed, and incremental adaptation must occur within a few hundred milliseconds to keep up with a translator's typical workflow. The service can be expected to store models for a large number of users and dynamically load and adapt models for many active users concurrently. Therefore, minimizing the number of parameters stored for each user's personalized model is important both for reducing storage requirements and latency. We achieve space and time efficiency by representing each user's model as an offset from the unadapted baseline parameters and encouraging most offset tensors to be zero during adaptation.
We show that group lasso regularization can be applied to a self-attentive Transformer model to freeze up to 75% of the parameters with minimal or no loss of adapted translation quality across experiments on four English→German data sets. We confirm these findings for six additional language pairs.

Related Work
There is extensive work on incremental adaptation from human post edits or simulated post edits, both for statistical machine translation (Green et al., 2013;Denkowski et al., 2014a,b;Wuebker et al., 2015) and neural machine translation (Peris et al., 2017;Turchi et al., 2017;Karimova et al., 2017). Both Turchi et al. (2017) and Karimova et al. (2017) apply vanilla fine-tuning algorithms. In addition to fine-tuning towards user corrections, the former applies a priori adaptation to retrieved data that is similar to the incoming source sentences. Peris et al. (2017) propose a variant of fine-tuning with passive-aggressive learning algorithms. In contrast to these papers, where all model parameters are possibly altered during training, this work focuses on space efficiency of the adapted models.
Regularization methods that promote or enforce sparsity have been previously used in the context of sparse feature models for SMT: Duh et al. (2010) presented an application of multi-task learning via 1 / 2 regularization for feature selection in an Nbest reranking task. A similar approach, employing 1 / 2 regularization for feature selection and multi-task learning, was developed by Simianer et al. (2012) and Simianer and Riezler (2013) for tuning of SMT systems. Both works report improvements from regularization.
Techniques for enforcing sparse models using 1 regularization during stochastic gradient descent optimization were previously developed for linear models (Tsuruoka et al., 2009).
An extremely space efficient method for personalized model adaptation is presented by Michel and Neubig (2018). Here, adaptation is performed solely on the output vocabulary bias vector. Another notable approach for creating compact models is student-teacher-training or knowledge distillation (Kim and Rush, 2016). To the best of our knowledge, this has not been applied in a domain adaptation setting.

Self-Attentive Translation Model
The neural machine translation systems used in this work are based on the Transformer model introduced by Vaswani et al. (2017), which uses selfattention rather than recurrent or convolutional layers to aggregate information across words. In addition to its superior performance, its main practical advantage over recurrent models is faster training.
The Transformer follows the encoder-decoder paradigm. Source word vectors x 1 , . . . , x m are chosen from an embedding matrix X e . A series of stacked encoder layers generate intermediate representations z 1 , . . . , z m . Each layer of the encoder consists of two sub-layers: a multi-head selfattention layer that uses scaled dot-product atten-tion over all source positions, followed by a feedforward filter layer. Layer normalization (Ba et al., 2016), dropout (Srivastava et al., 2014), and residual connections (He et al., 2016) are applied to each sub-layer.
A series of stacked decoder layers produces a sequence of target word vectors y 1 , . . . , y n . Each decoder layer has three sub-layers: self-attention, encoder-attention, and a filter. For target position j, the self-attention layer can attend to any previous target position j ∈ [1, j], with target words offset by one so that representations at j can observe word j −1, but not word j. The encoder-attention layer can attend to the final encoder state z i for any source position i ∈ [1, m]. Observed target word vectors are chosen from an embedding matrix Y e , and target word j is predicted from y j via a soft-max layer parameterized by an output projection matrix Y o .
The encoders in this work have six layers that have a self-attention sub-layer size of 256 and a filter sub-layer size of 512. Each filter performs two linear transformations and a ReLU activation: The decoders in this work have three layers, and all sub-layer sizes are 256. The decoder sublayers are simplified versions of those described in Vaswani et al. (2017): The filter sub-layers perform only a single linear transformation, and layer normalization is only applied once per decoder layer after the filter sub-layer.
Unlike in Vaswani et al. (2017), none of X e , Y e , or Y o share parameters in our TensorFlow 1 implementation. Baseline models are optimized with Adam (Kingma and Ba, 2015). stochastic gradient descent (SGD) is the most effective optimizer for fine tuning (Turchi et al., 2017). In our experiments, batch adaptation uses a batch size of 7000 words for 10 Epochs and a fixed learning rate of 0.1, dropout of 0.1, and label smoothing with ls = 0.1 (Szegedy et al., 2016).
Incremental adaptation uses a batch size of one and a learning rate of 0.01. To ensure a strong adaptation effect within a single document, we set dropout and label smoothing to zero and perform up to three SGD updates on each segment. After each update, we measure the model perplexity on the current training example and continue with another update if the perplexity is still above 1.5.

Offset Tensors
In a personalized translation service, adapted models need to be loaded quickly, so a space-efficient representation is critical for time efficiency as well. Production speed requirements using contemporary cloud hardware limit model sizes to roughly 10M parameters per user, while a high-quality baseline Transformer model typically requires 35M parameters or more. We propose to store the parameters of an adapted model as an offset from the baseline model. Each tensor is a sum W = W b +W u , where W b is from the baseline model and is shared across all adapted models, while the offset W u is specific to an individual user domain. Space efficiency is achieved by only storing W u for a subset of tensors and setting the rest of the offset tensors to zero.
One approach to achieving model sparsity is to manually partition the network into a small number of regions and systematically evaluate translation performance when storing offsets for only one region. We define five distinct regions, which are evaluated in isolation: Outer layers (the first and last layers of both encoder and decoder), inner layers (all the remaining layers), the two embedding matrices X e and Y e , and the output projection matrix Y o . The latter three are each stored as a single matrix and each contributes 10.3M parameters to the full model size in English→German. During adaptation, the embedding matrices are only updated for vocabulary present in the training examples, and so the offsets can be stored efficiently as a sparse collection of columns. The same principle can be applied to the output projection matrix by only updating parameters corresponding to vocabulary items that appears in the adaptation examples (denoted Sparse Output Proj. in Table 1).
A second approach to achieving model sparsity is to use a procedure to select the subset of offset tensors that are stored. For example, we evaluate a simple policy that stores an offset for all tensors whose average change in parameter values is higher than a threshold. This set is selected on a development domain and held fixed for all other domains. We refer to this method as fixed adaptation.

Tensor Selection via Group Lasso
A group sparse regularization penalty such as group lasso can be applied to the offset tensors for simultaneous regularization and tensor selection. This penalty drives entire offset tensors to zero, so that they do not need to be stored or loaded. We add the following regularization term to the loss function (Scardapane et al., 2017): Here, each tensor corresponds to one group. T denotes the set of all tensors in the model, τ ∈ T the set of all weights within a single tensor and ∆τ the size of the offset for τ . Note that we are regularizing the difference between the parameters of the adapted model and the baseline model, rather than regularizing the full network parameters directly. In this way, we maintain the expressive power of the full network while minimizing the size of the adapted models. Group lasso regularization is equivalent to 1 regularization when the group size is 1. Sparsity among groups is encouraged because the 1 norm serves as a convex proxy for the 0 norm, which would explicitly penalize the number of non-zero elements (Yuan and Lin, 2006). To facilitate tensor selection, we define a threshold ϑ to clip offset tensors ∆T with average weight 1 |T | τ ∈T ∆τ < ϑ to zero. Both the threshold ϑ and the regularization weight λ were manually tuned on a development domain and set to ϑ = 10 −4 and λ = 10 −6 . We apply clipping to all tensors except the embedding and output projection matrices X e , Y e and Y o . As our production constraints allow us to retain only one of the three, we pre-select the sparse output projection as part of the model and exclude the embedding matrices from adaptation. This method will be denoted as Lasso. We evaluate changes to each region of the network separately. In combination with sparse output projection, we also evaluate a fixed selection of parameters chosen by thresholding and a set selected dynamically for each data set using group lasso. The two bottom rows show repetition rates in % for the source and target sides of the test data.

Data
We first evaluate all techniques on an English→German Transformer network trained on 98M parallel sentence pairs. We apply byte pair encoding (Sennrich et al., 2016) separately to each language and obtain vocabularies with 40K unique tokens each. We refer to the unadapted model as Baseline. We evaluate on four domains. For development, we use a data set labeled User1 that was gathered from a user of the browser-based CAT (computer-aided translation) tool Lilt 2 and contains documents from the financial domain with 48K segments for batch adaptation and 1790 segments for testing and incremental adaptation. We further evaluate on a second user test set User2 (technical support, 31k batch adaptation, 1000 test segments); the public Autodesk corpus 3 , where we select the first 20k segments for batch adaptation and the next 1000 segments for testing; and the IWSLT corpus 4 (semi-technical talks), where we use all provided 206K sentences for batch adaptation 2 https://lilt.com 3 https://autodesk.app.box.com/ Autodesk-PostEditing 4 http://workshop2017.iwslt.org/ and the dev2010 set (888 sentences) for testing. The overall best performing compact adaptation technique, group lasso regularization, is further evaluated on six other language pairs trained using production data sets collected from Lilt's user base: English↔French, English↔Russian and English↔Chinese. Adaptation is performed on user data from various domains (technical manuals, finance, legal), each with 8k-10k segments for batch adaptation and 2000 segments for testing and incremental adaptation. Translation quality is evaluated using the cased B (Papineni et al., 2002) measure. Table 1 shows English→German results. Full model adaptation, where all offsets are stored, improves over the baseline in all cases to various degrees. This full model contains only 25.8M parameters, as offsets for both embedding matrices are stored as sparse collections of columns for the vocabulary present in the adaptation data. Next, we evaluate the impact of storing offsets only for one region at a time. We observe that among the three vocabulary matrices, the output projection Y o has the strongest impact on quality, which is not dimin-en→fr fr→en en→ru ru→en en→zh zh→en Avg.  Table 2: Experimental results in B (%) on six production language pairs. We compare the unadapted baseline model with a full model and the model with sparse output projection and group lasso, both with application of batch and incremental adaptation.

Results
ished by storing a sparse variant that is restricted only to observed vocabulary.
In addition, we evaluate two methods of choosing a subset of tensors procedurally. We first experiment with a fixed subset of tensor offsets that was chosen by selecting all tensors for which parameters were offset by more than 0.002 on average after batch adaptation on the User1 data set. This simple procedure approaches the performance of full model adaptation, but stores only 27% of its parameters. Dynamically selecting tensor offsets for each data set using group lasso regularization improves performance on 6 out of 8 data conditions. The combination of batch and incremental adaptation yields further improvements, with the exception of the User2 and IWSLT tasks, where incremental adaptation overall performs not as well as batch adaptation. For these tasks, both tests sets exhibit lower repetition rates 5  than the test sets for the two other tasks (see the two bottom lines in Table 1). The User2 test set is furthermore a random sample of non-consecutive text from a translation memory, which is suboptimal for incremental learning.
Altogether, we are able to achieve translation performance similar to full model adaptation with 25% of the total network parameters. Note that due to the selection of entire tensors with groupwise regularization, there is nearly zero space overhead incurred by storing a sparse set of offset tensors. Table 2 confirms our main findings on six other language pairs. We observe average improvements of 14.3 B with our final compact model, which compares to 15.5 B for full model adaptation.

Conclusion
We describe an efficient approach to personalized machine translation that stores a sparse set of ten-5 Repetition rates have been confirmed to be a suitable indicator for gains through incremental adaptation in numerous works (Wuebker et al., 2015; sor offsets for each user domain. Group lasso regularization applied to the offsets during adaptation achieves high space and time efficiency while yielding translation performance close to a full adapted model, for both batch and incremental adaptation and their combination.