Towards Modeling the Style of Translators in Neural Machine Translation

One key ingredient of neural machine translation is the use of large datasets from different domains and resources (e.g. Europarl, TED talks). These datasets contain documents translated by professional translators using different but consistent translation styles. Despite that, the model is usually trained in a way that neither explicitly captures the variety of translation styles present in the data nor translates new data in different and controllable styles. In this work, we investigate methods to augment the state of the art Transformer model with translator information that is available in part of the training data. We show that our style-augmented translation models are able to capture the style variations of translators and to generate translations with different styles on new data. Indeed, the generated variations differ significantly, up to +4.5 BLEU score difference. Despite that, human evaluation confirms that the translations are of the same quality.


Introduction
Translators often translate the original content with provided guidelines for styles. 1 However, guidelines are supposed to be high level and not comprehensive. Personal stylistic choices are thus welcome as creative part of the translator's job, as long as their translation style consistency is ensured to the task. By contrast, although neural machine translation (NMT) models (Cho et al., 2014;Sutskever et al., 2014) are trained from these human translations (e.g. Europarl, TED Talks), the models do not explicitly learn to capture the rich variety of translators' styles from the data. This limits their capability to creatively translate new data with different and consistent styles as translators do. We believe that modeling the style of translators is an * Y. Wang carried out this work during an internship with Amazon AI. 1 See https://www.ted.com/participate/translate/guidelines as an example of translation style guidelines. important yet overlooked aspect in NMT. Our contribution, to the best of our knowledge, is to fill this gap for the first time.
In particular, our work investigates ways to integrate translator information into NMT, with an emphasis on mimicking the translator's style. Our study uses the TED talk dataset, with four language pairs with translator annotations. We present and compare a set of different methods of using a discrete translator token to model and control translator-related stylistic variations in translation. Note that using a discrete token is a common approach to model and control not only specific traits in translation such as verbosity, politeness and speaker-related variances (Sennrich et al., 2016a;Michel and Neubig, 2018)) but also other aspects in NMT such as language ids (Johnson et al., 2017;Fan et al., 2020). However, our study is the first to use such a discrete token to model the style of translators. It also provides several insights regarding translation style modeling as follows.
First, we show that the state-of-the-art Transformer model implicitly learns the style of translators only to a limited extent. Moreover, methods that add translator information to the decoder surprisingly result in NMT that fully ignores the additional knowledge. This is regardless of whether the token is added to the bottom (i.e. the embedding layer) or to the top (i.e. the softmax layer) of the decoder. Meanwhile, methods that add the information to the encoder seem to model the translator's style effectively.
Second, we show that our best style-augmented NMT method is able to control the generation of translation in a way that mimics the translator's style, e.g. lexical and grammatical preferences, verbosity. While output produced by the styleaugmented NMT can vary significantly with the translator-token values, with BLEU score variations up to +4.5, a human evaluation confirms that observed differences are all about style and not
Our study focuses on capturing the personal style of translators. The closest work to our study is thus the work of Michel and Neubig (2018), where they study instead the effects of using the speaker information in NMT. In our results, we show that the translator information has indeed more impact to NMT than the speaker information.
Finally, another distantly related research line tries to improve the diversity in the top rank translations of an input (Li et al., 2016;Shen et al., 2019;Agrawal and Carpuat, 2020). In fact, adding the translator information to NMT also provides means to generate translations with significantly different stylistic variations.

NMT with Translator Information
NMT reads an input sequence x = x 1 , ..., x n in the source language with an encoder and then produces an output sequence y = y 1 , ..., y m in the target language. The generation process is performed in a token-by-token manner and its probability can be factored as m j=1 P (y j | y <j , x), where y <j denotes the previous sub-sequence before j-th token. The prediction for each token over the vocabulary V is based on a softmax function as follows: Here, o j ∈ R d is an output vector with size d (e.g. 512 or 1024), encoding both the context from the encoder and the state of the decoder at time j. Meanwhile, W V ∈ R |V|×d and b V ∈ R |V| are a trainable projection matrix and bias vector. We adjust NMT in different ways as below to let it mimic and control the translator's style.
Source Token. In our first approach, we insert the translator token T as the beginning of each input sentence. The translator token is thus assigned with an embedding vector like any other source token. Hence, the embedding sequence E enc for the MT encoder becomes: where e(·) is an embedding lookup function. Token Embedding. We also consider adding the embedded translator token e(T ) to every token embedding in the encoder and/or decoder as follows: Our motivation is to reinforce the influence of the translator token in MT. Output Bias. Following Michel and Neubig (2018), we add the translator token information to the output bias at the final layer of the decoder (FULL-BIAS variant). Specifically, the method directly modulates the word probability over vocabulary V as follows: (5) Here, b T ∈ R |V| is the translator-specific bias vector, which can be thought of as a translator-token embedding with dimension |V| rather than d. We also explore another variant, named FACT-BIAS, as in Michel and Neubig (2018). This variant instead learns the translator bias through the factorization: with parameters W ∈ R |V|×k and s T ∈ R k×1 where k << |V|. Note that while the above methods digest the translator token at an earlier stage, this one consumes translator signals in a late fusion manner.

Dataset and Models
We run experiments with the WIT 3 public dataset of TED talks (Cettolo et al., 2012), with four language pairs: English-German (en-de), English-French (en-fr), English-Italian (en-it) and English-Spanish (en-es). The dataset contains both speaker and translator information for each talk and translation, thus allowing to measure the effects of translators and speakers .  T1 T2 T3 T4 T5 T6 T7 T8 T9 T10   Top 10 en-de translators   0   20   40   60   80   100   90   54  45 42 40 37   31 31 31  24   T1 T2 T3 T4 T5 T6 T7 T8 T9 T10   Top 10 en-fr translators   0   100   200   300   302   78 65  48 42 34 30 26 25 24   T1 T2 T3 T4 T5 T6 T7 T8 T9    We construct training, validation and test sets for each translation direction as follows. We first extract all talks that are translated by the 10 most popular translators (see Figure 1) and split them into parallel sentences. From the data of each translator, we then sample 500 sentences for testing, and, from the remaining data, 90% for training and 10% for validation. All training, testing, and validation sentence pairs are put together and annotated with training and speaker labels. Table 1 shows the data statistics for four language pairs. For preprocessing, we employ Moses (Koehn et al., 2007) tool 2 for tokenization and apply subword-nmt 3 (Sennrich et al., 2016b) to learn subword representations.
We choose Transformer (Vaswani et al., 2017) as the baseline and employ Fairseq  for our implementations. Our Transformer model is comprised of 6 layers of encoder-decoder network, where each layer contains 16 heads with a self-attention hidden state of size 1024 and a feedforward hidden state of size 4096. We employ Adam optimizer (Kingma and Ba, 2015) to update model parameters. We warm up the model by linearly increasing the learning rate from 1 × 10 −7 to 5 × 10 −4 for 4000 updates and then decay it with an inverse square root of the rest training steps by a rate of 1 × 10 −4 . We apply a Dropout of 0.3 for en-de and 0.1 for both en-fr and en-it.
For all MT systems, we load weights from pretrained models to set up a better model initialization. Specifically, we employ models pretrained on WMT data for en-de and en-fr (Ott et al., 2018), and pretrain models for en-it and en-es using our large in-house out-of-domain data, as there are no previous pretrained models for these pairs. We finetune models on TED talk data for 10 epochs 4 and select the best model based on the validation loss.
During inference, we employ beam search with a beam size of 4 and add a length penalty of 0.4.
We use the BLEU score (Papineni et al., 2002) to evaluate translation accuracy.

Adding Translator Token
We first compare methods to integrate the translator token into the Transformer. Notice that we report performance of the model in two settings: (i) when fed with the oracle translator label (as at training time) and (ii): when fed with randomly assigned labels. Intuitively, if a model really leverages the translator information, we expect to see a performance drop in the randomized setting. Results are shown in Table 2.
Our findings are as follows. First, it is surprisingly ineffective to add the translator token into the decoder, whether to the input (DEC-EMB) or to the softmax (FULL-BIAS, FACT-BIAS). In most cases, our randomization experiment shows that the model simply ignores the information.
Second, methods adding the token to the encoder (SRC-TOK, ENC-EMB) are significantly more effective. Translation accuracy is also consistently better (at most by 0.4 BLEU) than with the Transformer baseline, indicating the translator token is useful. For those models, randomizing translator labels results in visible drops in BLEU score (up to 1.0 BLEU), indicating that the translator information has an important effect to the model.

Style Imitation
Following the common practice in evaluating the style imitation (e.g. see (Michel and Neubig, 2018;Hovy et al., 2020)), we train a classifier to predict the translator style of the output of various models. We employ a Logistic Regression classifier based on both uni-gram and bi-gram word features. The classifier, trained on NMT training data, is applied on the outputs of NMT models. Figure 2 shows the results of this experiment. As can be seen, the standard Transformer learns the style of translators only to a limited extent. The style of translation outputs are less consistent with the original translator's style, i.e. accuracy is between 20% and 35%). Meanwhile, the classification accuracy is significantly higher (up to +12% relative) under SRC-TOK and ENC-EMB. This confirms that explicitly incorporating translator information at the sentence level allows for transferring some of her/his personal traits into the translations.
Meanwhile, we notice higher accuracy achieved with the reference translations (e.g. 42% in EN-ES), suggesting there is room for improvement.

Stylistic Variations
We analyzed stylistic variations using different translator token labels. In particular, we evaluate model outputs on en-fr after translating the entire test set with the same translator token labels. As in Table 3, translator-informed NMT can produce quite different outputs, resulting in BLEU score variations up to +4.5, (i.e. between T 7 and T 3,  T 8, T 10). We also observe differences in BLEU (albeit smaller) when testing with the WMT 2014 test set. In particular, BLEU score variations are up to +0.84 between T 7 and T 5. We also compute the symmetric-BLEU distances between any two of the translators using their predictions for both TED and WMT test set and visualize their heatmaps in Figure 3. We observe that a similar BLEU distance between various translators in both test sets. Besides, T7 has a farther distance with others but its gap is closer on WMT than TED. These findings verify the consistency of translator styles in data from different domains.
Then, we asked 3 professional translators to grade the quality of translation produced with the labels T 7 and T 3 on the TED talks. The evaluation is on a 1-6 scale (higher is better) on a random sample of 100 sentences. This resulted in average scores of 4.867 and 4.860 for T3 and T7, respectively. A similar human evaluation with T 7 and T 5 labels was also run on a random sample of 100 sentences of the WMT 2014 test set. It provided the same conclusion: average scores are very similar: 4.99 and 5.0 for T5 and T7 respectively. Both evaluations confirm that there is no difference in translation quality when using different token labels, i.e. the low BLEU score of T7 is only an effect due to stylistic differences. Table 4 shows examples of translations generated with labels T3 and T7. As we can observe, the translations show different use of grammars, words and verbosity. 5  Dataset  T1  T2  T3  T4  T5  T6  T7  T8  T9  T10  TED T2 T3 T4 T5 T6 T7 T8 T9 T10   TED   T1   T2   T3   T4   T5   T6   T7   T8   T9 83.6 89.5 89.  T2 T3 T4 T5 T6 T7 T8 T9 T10   WMT   T1   T2   T3   T4   T5   T6   T7   T8   T9 87.9 90.9 91.7 88.8 90.  Grammar Src: I had just tweeted, "Pray for Egypt". T3: J'avais tweeté : "Priez pour l'Egypte". T7: Je venais de tweeter, "Priez pour l'Egypte." Table 4: Examples of stylistic differences: T3 and T7 have different preferences of grammars and words in translation. Their translations are also different in the verbosity (Using T7 results in consistently less verbose output than as of using T3), which is indeed also what translations by T3 and T7 differ in the training data.

Translator vs. Speaker Effects
Finally, we compared the effect of the translator token with that of the speaker token, which was proposed in Michel and Neubig (2018) to perform extreme personalization. Results on all four directions (see Table 5) show that the translator token has more impact. 6 Given that speaker and author style has received much more attention in the liter-6 One probable reason is that the speaker signal is more sparse than the translator signal, i.e. each speaker is represented by one TED talk, while translators by multiple talks.  ature, we hope that this final result will spark more interests on the style of translators.

Conclusion
We designed various ways of incorporating translator information into NMT, in order to model and control the generation of translation with different translator styles. We show that resulting styleaugmented NMT produces significantly different stylistic variations, mimicking professional translators. Human evaluation confirms that the generated variations are all of same translation quality.