Extreme Adaptation for Personalized Neural Machine Translation

Every person speaks or writes their own flavor of their native language, influenced by a number of factors: the content they tend to talk about, their gender, their social status, or their geographical origin. When attempting to perform Machine Translation (MT), these variations have a significant effect on how the system should perform translation, but this is not captured well by standard one-size-fits-all models. In this paper, we propose a simple and parameter-efficient adaptation technique that only requires adapting the bias of the output softmax to each particular user of the MT system, either directly or through a factored approximation. Experiments on TED talks in three languages demonstrate improvements in translation accuracy, and better reflection of speaker traits in the target text.


Introduction
The production of language varies depending on the speaker or author, be it to reflect personal traits (e.g. job, gender, role, dialect) or the topics that tend to be discussed (e.g. technology, law, religion). Current Neural Machine Translation (NMT) systems do not incorporate any explicit information about the speaker, and this forces the model to learn these traits implicitly. This is a difficult and indirect way to capture inter-personal variations, and in some cases it is impossible without external context (Table 1, Mirkin et al. (2015)).
Recent work has incorporated side information about the author such as personality (Mirkin et al., 2015), gender (Rabinovich et al., 2017) or politeness (Sennrich et al., 2016a), but these methods can only handle phenomena where there are ex-  plicit labels for the traits. Our work investigates how we can efficiently model speaker-related variations to improve NMT models.
In particular, we are interested in improving our NMT system given few training examples for any particular speaker. We propose to approach this task as a domain adaptation problem with an extremely large number of domains and little data for each domain, a setting where we may expect traditional approaches to domain adaptation that adjust all model parameters to be sub-optimal ( §2). Our proposed solution involves modeling the speakerspecific variations as an additional bias vector in the softmax layer, where we either learn this bias directly, or through a factored model that treats each user as a mixture of a few prototypical bias vectors ( §3).
We construct a new dataset of Speaker Annotated TED talks (SATED, §4) to validate our approach. Adaptation experiments ( §5) show that explicitly incorporating speaker information into the model improves translation quality and accuracy with respect to speaker traits. 1

Problem Formulation and Baselines
In the rest of this paper, we refer to the person producing the source sentence (speaker, author, 1 Data/code publicly available at http://www.cs.cmu.edu/ ∼ pmichel1/ sated/ and https://github.com/neulab/ extreme-adaptation-for-personalized-translation respectively. arXiv:1805.01817v1 [cs.CL] 4 May 2018 etc. . . ) generically as the speaker. We denote as S the set of all speakers.
The usual objective of NMT is to find parameters θ of the conditional distribution p(y | x; θ) to maximize the empirical likelihood. We argue that personal variations in language warrant decomposing the empirical distribution into |S| speaker specific domains D s and learning a different set of parameters θ s for each. This setting exhibits specific traits that set it apart from common domain adaptation settings: 1. The number of speakers is very large. Our particular setting deals with |S| ≈ 1800 but our approaches should be able to accommodate orders of magnitude more speakers.
2. There is very little data (even monolingual, let alone bilingual or parallel) for each speaker, compared to millions of sentences usually used in NMT.
3. As a consequence of 1, we can assume that many speakers share similar characteristics such as gender, social status, and as such may have similar associated domains. 2

Baseline NMT model
All of our experiments are based on a standard neural sequence to sequence model. We use one layer LSTMs as the encoder and decoder and the concat attention mechanism described in Luong and Manning (2015). We share the parameters in the embedding and softmax matrix of the decoder as proposed in Press and Wolf (2017). All the layers have dimension 512 except for the attention layer (dimension 256). To make our baseline competitive, we apply several regularization techniques such as dropout (Srivastava et al., 2014) in the output layer and within the LSTM (using the variant presented in Gal and Ghahramani, 2016). We also drop words in the target sentence with probability 0.1 according to Iyyer et al. (2015) and implement label smoothing as proposed in Szegedy et al. (2016) with coefficient 0.1. Appendix A provides a more thorough description of the baseline model.

Baseline adaptation strategy
As mentioned in §2, our goal is to learn a separate conditional distribution p(y | x, s) and parametrization θ s to improve translation for speaker s. The usual way of adapting from general domain parameters θ to θ s is to retrain the full model on the domain specific data (Luong and Manning, 2015). Naively applying this approach in the context of personalizing a model for each speaker however has two main drawbacks: Parameter cost Maintaining a set of model parameters for each speaker is expensive. For example, the model in §2.1 has ≈47M parameters when the vocabulary size is 40k, as is the case in our experiments in §5. Assuming each parameter is stored as a 32bit float, every speaker-specific model costs ≈188MB. In a production environment with thousands to billions of speakers, this is impractical.
Overfitting Training each speaker model with very little data is a challenge, necessitating careful and heavy regularization (Miceli Barone et al., 2017) and an early stopping procedure.

Domain Token
A more efficient domain adaptation technique is the domain token idea used in Sennrich et al. (2016a); Chu et al. (2017): introduce an additional token marking the domain in the source and/or the target sentence. In experiments, we add a token indicating the speaker at the start of the target sentence for each speaker. We refer to this method as the spk token method in the following. Note that in this case there is now only an embedding vector (of dimension 512 in our experiments) for each speaker. However, the resulting domain embedding are non-trivial to interpret (i.e. it is not clear what they tell us about the domain or speaker itself).

Speaker-specific Vocabulary Bias
In NMT models, the final choice of which word to use in the next step t of translation is generally performed by the following softmax equation where o t is predicted in a context-sensitive manner by the NMT system and E T and b T are the weight matrix and bias vector parameters respectively. Importantly, b T governs the overall likelihood that the NMT model will choose particular vocabulary. In this section, we describe our proposed methods for making this bias term speaker- Figure 1: Graphical representation of our different adaptation models for the softmax layer. From top to bottom is the base softmax, the full bias softmax and the fact bias softmax specific, which provides an efficient way to allow for speaker-specific vocabulary choice. 3

Full speaker bias
We first propose to learn speaker-specific parameters for the bias term in the output softmax only. This means changing Eq. 1 to for speaker s. This only requires learning and storing a vector equal to the size of the vocabulary, which is a mere 0.09% of the parameters in the full model in our experiments. In effect, this greatly reducing the parameter cost and concerns of overfitting cited in §2.2. This model is also easy to interpret as each coordinate of the bias vector corresponds to a log-probability on the target vocabulary. We refer to this variant as full bias.

Factored speaker bias
The biases for a set of speakers S on a vocabulary V can be represented as a matrix: where each row of B is one speaker bias b s . In this formulation, the |S| rows are still linearly independent, meaning that B is high rank. In practical terms, this means that we cannot share information among users about how their vocabulary   (4) where S is a matrix of speaker vectors of low dimension r andB is a matrix of r speaker independent biases. Here, the bias for each speaker is a mixture of r "centroid" biasesB with r speaker "weights". This reduces the total number of parameters allocated to speaker adaptation from |S||V| to r(|S| + |V|). In our experiments, this corresponds to using between 99.38 and 99.45% fewer parameters than the full bias model depending on the language pair, with r parameters per speaker. In this work, we will use r = 10.
We provide a graphical summary of our proposed approaches in figure 1.

Speaker Annotated TED Talks Dataset
In order to evaluate the effectiveness of our proposed methods, we construct a new dataset, Speaker Annotated TED (SATED) based on TED talks, 4 with three language pairs, English-French (en-fr), English-German (en-de) and English-Spanish (en-es) and speaker annotation.
The dataset consists of transcripts directly collected from https://www.ted.com/talks, and contains roughly 271K sentences in each language distributed among 2324 talks. We pre-process the data by removing sentences that don't have any translation or are longer than 60 words, lowercasing, and tokenizing (using the Moses tokenizer (Koehn et al., 2007)). Some talks are partially or not translated in some of the languages (in particular there are fewer translations in German than in French or Spanish), we therefore remove any talk with less than 10 translated sentences in each language pair.
The data is then partitioned into training, validation and test sets. We split the corpus such that the test and validation split each contain 2 sentence pairs from each talk, thus ensuring that all talks are present in every split. Each sentence pair is annotated with the name of the talk and the speaker. Table 2 lists statistics on the three language pairs. This data is made available under the Creative Commons license, Attribution-Non Commercial-No Derivatives (or the CC BY-NC-ND 4.0 International, https://creativecommons.org/ licenses/by-nc-nd/4.0/legalcode), all credit for the content goes to the TED organization and the respective authors of the talks. The data itself can be found at http://www.cs.cmu.edu/ ∼ pmichel1/sated/.

Experiments
We run a set of experiments to validate the ability of our proposed approach to model speakerinduced variations in translation.

Experimental setup
We test three models base (a baseline ignoring speaker labels), full bias and fact bias. During training, we limit our vocabulary to the 40,000 most frequent words. Additionally, we discard any word appearing less than 2 times. Any word that doesn't satisfy those conditions is replaced with an UNK token. 5 All our models are implemented with the DyNet  framework, and unless specified we use the default settings therein. We refer to appendix B for a detailed explanation of the training process. We translate the test set using beam search with beam size 5.

Does explicitly modeling speaker-related
variation improve translation quality? Table 3 shows final test scores for each model with statistical significance measured with paired boot-5 Recent NMT systems also commonly use sub-word units (Sennrich et al., 2016b). This may influence on the result, either negatively (less direct control over highfrequency words) or positively (more capacity to adapt to high-frequency words). We leave a careful examination of these effects for future work.  Table 3: Test BLEU. Scores significantly (p < 0.05) better than the baseline are written in bold strap resampling (Koehn, 2004). As shown in the table, both proposed methods give significant improvements in BLEU score, with the biggest gains in English to French (+0.99) and smaller gains in German and Spanish (+0.74 and +0.40 respectively). Reducing the number of parameters with fact bias gives slightly better (en-fr) or worse (en-de) BLEU score, but in those cases the results are still significantly better than the baseline.
However, BLEU is not a perfect evaluation metric. In particular, we are interested in evaluating how much of the personal traits of each speaker our models capture. To gain more insight into this aspect of the MT results, we devise a simple experiment. For every language pair, we train a classifier (continuous bag-of-n-grams; details in Appendix C) to predict the author of each sentence on the target language part of the training set. We then evaluate the classifier on the ground truth and the outputs from our 3 models (base, full bias and fact bias).
The results are reported in Figure 2. As can be seen from the figure, it is easier to predict the author of a sentence from the output of speakerspecific models than from the baseline. This demonstrates that explicitly incorporating information about the author of a sentence allows for better transfer of personal traits during translations, although the difference from the ground truth demonstrates that this problem is still far from solved. Appendix D shows qualitative examples of our model improving over the baseline.

Further experiments on the Europarl corpus
One of the quirks of the TED talks is that the speaker annotation correlates with the topic of their talk to a high degree. Although the topics that a speaker talks about can be considered as a manifestation of speaker traits, we also perform a control experiment on a different dataset to verify that our model is indeed learning more than just topical information. Specifically, we train our models on a speaker annotated version of the Europarl corpus (Rabinovich et al., 2017), on the en-de language pair 6 . We use roughly the same training procedure as the one described in §5.1, with a random train/dev/test split since none is provided in the original dataset. Note that in this case, the number of speakers is much lower (747) whereas the total size of the dataset is bigger (≈300k).
We report the results in table 4. Although the difference is less salient than in the case of SATED, our factored bias model still performs significantly better than the baseline (+0.83 BLEU). This suggests that even outside the context of TED talks, our proposed method is capable of improvements over a speaker-agnostic model.

Related work
Domain adaptation techniques for MT often rely on data selection (Moore and Lewis, 2010; Li et al., 2010;Wang et al., 2017), tuning (Luong and Manning, 2015;Miceli Barone et al., 2017), or adding domain tags to NMT input (Chu et al., 2017). There are also methods that fine-tune parameters of the model on each sentence in the test set (Li et al., 2016), and methods that adapt based on human post-edits (Turchi et al., 2017), although these follow our baseline adaptation strategy of tuning all parameters. There are also partial update methods for transfer learning, albeit for the very different task of transfer between language pairs (Zoph et al., 2016).
Pioneering work by Mima et al. (1997) introduced ways to incorporate information about speaker role, rank, gender, and dialog domain for  Table 4: Test BLEU on the Europarl corpus. Scores significantly (p < 0.05) better than the baseline are written in bold rule based MT systems. In the context of datadriven systems, previous work has treated specific traits such as politeness or gender as a "domain" in domain adaptation models and applied adaptation techniques such as adding a "politeness tag" to moderate politeness (Sennrich et al., 2016a), or doing data selection to create genderspecific corpora for training (Rabinovich et al., 2017). The aforementioned methods differ from ours in that they require explicit signal (gender, politeness. . . ) for which labeling (manual or automatic) is needed, and also handle a limited number of "domains" (≈ 2), where our method only requires annotation of the speaker, and must scale to a much larger number of "domains" (≈ 1, 800).

Conclusion
In this paper, we have explained and motivated the challenge of modeling the speaker explicitly in NMT systems, then proposed two models to do so in a parameter-efficient way. We cast this problem as an extreme form of domain adaptation and showed that, even when adapting a small proportion of parameters (the softmax bias, < 0.1% of all parameters), allowed the model to better reflect personal linguistic variations through translation.
We further showed that the number of parameters specific to any person could be reduced to as low as 10 while still retaining better scores than a baseline for some language pairs, making it viable in a real world application with potentially millions of different users.