Human or Neural Translation?

Deep neural models tremendously improved machine translation. In this context, we investigate whether distinguishing machine from human translations is still feasible. We trained and applied 18 classifiers under two settings: a monolingual task, in which the classifier only looks at the translation; and a bilingual task, in which the source text is also taken into consideration. We report on extensive experiments involving 4 neural MT systems (Google Translate, DeepL, as well as two systems we trained) and varying the domain of texts. We show that the bilingual task is the easiest one and that transfer-based deep-learning classifiers perform best, with mean accuracies around 85% in-domain and 75% out-of-domain .


Introduction
This work addresses the task of distinguishing between translations produced by humans and machines. Practical applications for this include: improving machine translation systems (Li et al., 2015), filtering parallel data mined from the Web (Arase and Zhou, 2013) and evaluating machine translation quality without reference translations (Aharoni et al., 2014). In our case, we are more interested in tracing the origin of translations outsourced by a large institutional translation service.
Our work aims at distinguishing between human and neural machine translations at the sentence level. We consider two settings: a monolingual task, where only the target sentence is considered; and a bilingual task where both the source text and its translation are available. We compare feature-based approaches with several deep learning methods, investigating the impact of text domains and MT systems (in-house neural engines, Google Translate, DeepL), paying attention to cases where the translation engine at test time is different from the one used for training, which we found often not studied in related work. We show that identifying machine translation is still feasible nowadays. On the bilingual task, the best transfer learning method we tested recorded an in-domain accuracy of 87.6% and out-of-domain performances ranging between 65.4% and 84.2% depending on the domain of texts and MT system considered. We analyze why our classifiers manage to do better than chance even though translations produced automatically seem to us of very good quality overall. We believe our study offers many new data points, and hope it will foster research on this timely topic.
After reviewing related work in Section 2, we describe our dataset and experimental setting in Section 3, the neural MT systems we used in our experiments in Section 4 and the classifiers we tested in Section 5. We present our experimental results in Section 6 and propose a deeper analysis in Section 7.

Related Work
Most studies on identifying machine translation were conducted at a time where MT systems were fraught with problems that rendered their identification somewhat easy. Current neural MT systems deliver translations that are sometimes bafflingly fluent. We are not aware of much work addressing MT identification with these newer systems. One notable exception is a recent study by Nguyen-Son et al.
(2019) on distinguishing original sentences from translations produced by Google Translate (GT). The authors build on the interesting intuition that back translations of automatically translated texts should contain less variations (word usage, structure) than back translations of human translations. They report an accuracy of 75% with an SVM classifier on a small corpus of 1200 sentences selected from the Europarl corpus 1 that are either original (human) or translated with GT. In their experiments, they use the same translation engine for producing the automatic translation of test sentences, and the backtranslations used by the classifier. In a real-world scenario, we are not expected to know which system has been used for producing a translation (we do not even know if a translation has been produced automatically) and the impact of producing back-translations by a different translation engine remains to investigate.
In earlier work on MT identification, approaches and evaluations vary greatly from one study to the other. For instance, Li et al. (2015) uses features extracted from the parse tree of the sentence to characterize, as well as features capturing the density of some function words (with the help of a part-of-speech tagger), and some features dedicated to out-of-vocabulary words. They also use features aimed at capturing emotion agreement inside a sentence, using a dictionary of emotion words. They gathered a balanced dataset of human and machine translations from the Europarl corpus (including French-English, German-English, Italian-English and Danish-English language pairs) using a statistical machine translation (SMT) engine trained in-domain with Moses (Koehn et al., 2007). They report an accuracy of 74.2%. However, they do not analyze which features are the most beneficial to the task. Arase and Zhou (2013) investigate the use of features to capture the fluency of the text, such as partof-speech and word-based n-gram language models, as well as features aimed at detecting so-called phrase-salad phenomena (Lopez, 2008), i.e. poor inter-phrasal coherence often observed in SMT output. On a collection of public texts crawled over the Web, they report an impressive accuracy of 95.8% when distinguishing human versus automatic translations for the English-Japanese language pair. The best performance was observed when combining all the features, and surpasses that of humans performing the same task (88.2%). The authors did not show the quality of the automatic translations, but mentioned that it was pretty low compared to the translations produced by native speakers and professional translators. It is therefore questionable how their approach would do on good quality automatic translations. Aharoni et al. (2014) use features capturing the presence or absence of part-of-speech tags and function words taken from LIWC (Pennebaker et al., 1999) appearing at least 10 times in the training material. On a corpus extracted from the Canadian Hansards, and using various translation engines, they report accuracies at detecting machine versus human translations (under a monolingual scenario tested on the English language) which are inversely correlated with the quality of the MT system used. For the best translation engines, they report an accuracy slightly over 60%.

Data
All our experiments are centered around one very large dataset: the translation memory of a large institutional translation service. This data collection -called TM hereafter -contains the English and French versions of over 1.8 million documents, covering over 200 broad domains (military, health, etc.), for a total close to 140 million sentence pairs. Since the vast majority of translations in the TM are into French, we focus on this language direction.
Our goal is to build classifiers that determine if a translation is human or machine-made. For this, we need training data that contains both types of translations. We create such data by machine translating a subset of 530k sentence pairs, randomly sampled from the TM. These machine translations are performed using two different neural MT systems, themselves trained using a distinct subset of 5.8M sentence pairs, also randomly sampled from the TM. 2 These two MT systems, one based on XLM (Lample and Conneau, 2019) and one on FairSeq (Ott et al., 2018), are detailed in Section 4. Thus, two distinct classifier training sets are created, one from each MT system: each contains 530k human translations and 530k machine translations, totalling 1.06M examples.
We proceed similarly to produce test sets to evaluate the performance of our classifiers: we randomly sample 10k sentence pairs from the TM, machine translate the English versions into French using our XLM and FairSeq MT systems, thus creating two test sets of 20k examples (10k human translations + 10k machine translations) each. We call these X-TM (for XLM) and F-TM (for FairSeq).
These two test sets can be seen as "in-domain" relative to our classifiers: not only because they share the same source as the training data (the TM), but also because the machine translations were produced using the same MT systems. To test the ability of our classifiers to handle different text domains and translations produced by different MT engines, we also created "out-of-domain" test sets: we used two online translation platforms -DeepL 3 (D) and Google Translate 4 (GT) -to translate 10K sentences of each of four publicly available data sets: Europarl (EURO), Canadian Hansard (HANS), 5 the News Commentaries (NEWS) available through the WMT conference, 6 and the Common Crawl corpus (CRAWL) also available through WMT. Again these were mixed in equal parts with human translations. In what follows, each test set is named based on the system used to produce automatic translations, and the domain of the material.
We further translated another excerpt of (previously unused) 10k sentences from the TM, using the DeepL translation API with a private account, to produce a test set we call D-TM. The TM being a proprietary translation memory, we did not submit it to the GT platform.

NMT systems
As noted above, to produce the training data for our classifiers, we first created two transformer-based NMT systems using English-French texts from the TM. We provide the details of this process here.

Cross-lingual Language Model (XLM)
In Lample and Conneau (2019), the authors propose three models: two unsupervised ones that do not use sentence pairs in translation relation, and a supervised one that does. We focus on the third model, called the Translation Language Modeling (TLM) which tackles cross-lingual pre-training in a way similar to the BERT model (Devlin et al., 2018a) with notable differences. First, XLM is based on a shared source-target vocabulary using Byte Pair Encoding (BPE) (Sennrich et al., 2016). We used the 60k BPE vocabulary which comes with the pre-trained language model. 7 Second, XLM is trained to predict both source and target masked words, leveraging both source and target contexts, encouraging the model to align the source and target representations. Third, XLM stores the ID for the language and the token order (i.e., positional encoding) in both languages which builds a relationship between related tokens in the two languages.
During training and when translating, we use a beam search of width 6 and a length penalty of 1. XLM is implemented in PyTorch 8 and supports distributed training on multiple GPUs. 9 The original distribution does not include beam search for translating (but does for training), so we modified it accordingly. Also, we modified the pre-processing code such that XLM accepts a parallel corpus for training TLM.

Scaling Neural Machine Translation (FairSeq)
Scaling NMT (Ott et al., 2018) is a novel transformer model that showcased an improvement in training efficiency while maintaining state-of-the-art accuracy by lowering the precision of computations, increasing the batch size and enhancing the learning rate regimen. The architecture uses the big-transformer model with 6 blocks in encoder and decoder networks. The half-precision training reduced the training time by 65%. Scaling NMT is implemented in PyTorch and is part of the fairseq-py toolkit. 10 We use the default 40k vocabulary with a shared source and target BPE factor-ization. During training and for translating, we use a beam search of width 4 and a length penalty of 0.6. For translation, 11 we average the last five checkpoints.

Post-processing
Translating the classifier training data (Section 3) with the XLM engine took approximatively 10 hours on a computer equipped with a V100-SXM2 GPU, and 26 hours for the FairSeq system. By inspection, we noticed small issues with the translations produced by both systems, such as punctuation misplacements, extra spaces, inconsistencies in the use of single and double-quotes. Since those issues would ease the identification of machine-translated material, we normalized the translations in a post-processing step, using 12 very conservative regular expressions 12 that we applied to both the human and machine translations. We observe in Table 1

MT Identification
We experimented with two strategies for building classifiers: feature-based models trained from scratch, as well as deep learning ones making use of pre-trained representations.

Feature-based Classifiers
We considered three supervised classifiers informed by different feature sets. We tested various classifiers (random forest, support vector machines and logistic regression), but obtained more stable results with random forest classifiers trained with scikit-learn (Pedregosa et al., 2011). In all our experiments, we fixed the number of trees in the forest to 1000 with a maximum depth of 40 and a minimum number of samples required to split an internal node set to 10.
n-GRAM We reproduce the approach of Cavnar and Trenkle (1994) where we define a vector space on the 30k most frequent character n-grams in the MT output of our training material, with n ranging from 2 to 7. 13 Each sentence is then encoded by the frequency of the terms in this vocabulary, thus leading to a large sparse representation which is passed to a classifier. In the bilingual task, we also consider the top 30k n-grams of the source-language version of the training corpus, leading to representations of 60k dimensions.
KENLM As a point of comparison, in the monolingual task, we experimented with features extracted from four {3, 4}-gram word language models trained with the kenLM package (Heafield et al., 2013) on the machine-translated material of our training corpus: two left-to-right models, and two right-to-left ones. We computed 18 features: ratios of min and max logprob over the (target) sentence per model (four features), the number of tokens with a logprob less than {mean, max, −6} (three features per model), as well as the logprob of the full sentence given by the left-to-right models (two features).
T-MOP T-MOP (Jalili Sabet et al., 2016) is a translation memory cleaning tool which computes 27 features for detecting spurious sentence pairs, including broad features (such as length ratio) adapted from Barbu (2015), some based on IBM models computed by MGIZA++ (Gao and Vogel, 2008), as well as some features based on multilingual word embeddings, using the method proposed by Søgaard et al. (2015). While in T-MOP, those features are aggregated in an unsupervised way (that is, with rules), we instead pass them to a random forest classifier trained specifically to distinguish human from machine translations. Because of the nature of the feature set, we only deploy this classifier in the bilingual task.

Deep Learning Classifiers
bi-LSTM We re-implemented the method of Grégoire and Langlais (2018) for recognizing whether two sentences are translations of each other: two bidirectional LSTMs (Hochreiter and Schmidhuber, 1997) encode the source and target sentences into two continuous vector representations, which are then fed into a Feed-Forward Neural Network with two layers (one in the original paper): one of dimension 150 to process the continuous representation, and one of dimension 75. The output of each network is finally passed to the sigmoid function.
In the original paper, the authors used 512-dimensional word embeddings and 512-dimensional recurrent states since they learn the word embedding from scratch. We found it easier (faster, and slightly better) to adapt pre-trained FAST word embeddings (Bojanowski et al., 2016) of dimension 300. Also, the authors tie the parameters of the two encoders, while we do not. We use two hidden layers before the sigmoid function because we are mapping from 300 values to 1 and intuitively, it is better to do it smoothly. We trained our classifier with the Adadelta optimizer (Zeiler, 2012) with gradient clipping (clip value 5) to avoid exploding gradient and batch size 300, whereas the original architecture uses the Adam optimizer with a learning rate of 0.0002 and a mini-batch of 128. 14 We use a similar setting for the monolingual task, except that we only use one bidirectional LSTM whose output we directly pass to the hidden layer of dimension 150, then a layer of dimension 75 and finally the sigmoid function.
LASER The LASER toolkit (Artetxe and Schwenk, 2019) released by Facebook 15 provides a pretrained sentence encoder that handles 92 different languages. Sentences from all the languages are mapped together into the same embedding space with a bi-LSTM 512-dimensional encoder, such that the embeddings from different languages are comparable.
For the bilingual detection task, we extract the representation of the source and target sentences and tie them into one vector by taking their absolute difference and dot product, and adding them. This tied representation is then passed through 3 hidden layers of size 512, 150 and 75 respectively 16 with dropout (Srivastava et al., 2014) of 50%, and then fed into a relu (Nair and Hinton, 2010) activation function, whose output is finally passed to the sigmoid function. For the monolingual task, we just use the LASER French (target) representation of the sentence and pass it through the very same architecture. We train the classifiers with the Adadelta optimizer with gradient clipping (clip value 3).
Transformer-based Classifiers The use of pre-trained language models in a transfer learning setting is ubiquitous and has shown substantial improvements in various NLP tasks. Therefore, we also considered various representations trained either solely on French data (CamemBERT, FlauBERT) or on multiple languages (XLM-ROBERTA, XLM, and mBERT). We experiment with different pre-trained transformer models, using the Python module simpletransformers 17 based on the HuggingFace library 18 , which has a sequence classification head on top (a linear layer on top of the pooled output). Our classifiers were fine-tuned using the ClassificationModel class and evaluated with the eval model class. We have maintained the same parameters for all the transformer models: sequence length of 256, batch size of 32, Adam optimizer (Kingma and Ba, 2014) 19 .   (Kudo and Richardson, 2018) and performs whole word masking, which has been shown to be preferable . The architecture of the base model is a multi-layer bidirectional transformer (Devlin et al., 2018b;Vaswani et al., 2017) with 12 transformer blocks of hidden size 768 and 12 self attention heads.
FlauBERT (Le et al., 2019) The base model we used is trained on 71GB of publicly available French data and the data was pre-processed and tokenized using a basic French tokenizer (Koehn et al., 2007). The model was trained with the MLM training objective. mBERT (Devlin et al., 2018b) is very similar to the original BERT model with 12 layers of bidirectional transformers, but released as a single language model trained on 104 separate languages from Wikipedia pages, with a shared word piece vocabulary. The model does not use any marker for input language and the pre-trained model is not made to extract translation pairs to have similar representations. The tokenization splits words into multiple pieces and it takes the prediction of the first piece as the prediction for the word. The model is fine-tuned to minimize cross-entropy loss.

Experiments
We trained all classifiers described above using training data produced with XLM and FairSeq MT systems. Overall, classifiers trained with FairSeq translations performed very marginally better on out-ofdomain data, with an average accuracy of 64.5%, compared to 64.3% for classifiers trained with XLM translations. In this section, unless otherwise specified, we report the results of classifiers trained with FairSeq translations, but both training sets produce very comparable results.

Monolingual task
Results on the monolingual task are reported in Table 2. Most accuracies are over the 50% that would be obtained by a random guess, albeit by a small margin on some conditions. Expectedly, the best performances are observed on in-domain data (TM), in which machine translations were produced by the same MT systems used to produce the classifiers' training data. Which of XLM or FairSeq was used to produce test translations has little to no impact on performance, however. The highest accuracy (84.3%)  is obtained on TM data by fine-tuning the FlauBERT pre-trained representations on the training material produced with XLM. Using this configuration, but classifying translations produced by DeepL only slightly reduces performance (82.4%), but for most other approaches -including other BERT-inspired solutions -it does lead to a notable decrease of accuracy. HANS and EURO are the hardest test sets, where performances are often close to the random guess baseline. This suggests that translations produced by GT and DeepL on those datasets are very good and hard to distinguish from human translations. Part of this poor performance may be imputed to some extent to the mismatch between the system used to translate the classifiers's training material, and the one used for testing. The lowest performances overall are recorded when classifying sentences produced by GT on the HANS dataset, where the best classifier only succeeds at a rate of 57.9%. Around 15% of automatic translations in this test set are identical to the reference one (see column 1 of Table 4), and the same percentage are very close to the reference (see column 2 and 3). It is notorious that GT has been trained on Hansards, further complicating the task. We were however surprised by the low percentage of automatic translations close to the reference one we measured on the EURO test set. Inspection did not reveal anything particular. If we set apart those two test sets, we observe that BERT-like models provide better results than bi-LSTM and LASER ones. BERT models are systematically better at classifying DeepL translations than those produced by GT. We do not have a clear explanation for this yet.
The n-GRAM feature-based classifier is competitive with the LASER and bi-LSTM classifiers, but is slightly behind BERT-inspired classifiers. KENLM is clearly overfitting, delivering impressive results for such a simple device on in-domain data and systems, but failing to generalize to other settings.
The good performances we obtained on TM, when distinguishing translations produced by DeepL may be of interest to the language service provider that provided us with the data. It could for instance be used to diagnose translation providers that heavily rely on this system to produce their translations. The performance obtained on the NEWS and CRAWL test sets indicate that the automatic translations do have a signature that we can recognize to some extent, without even looking at the source sentence. Task   Table 3 shows accuracies obtained in the bilingual task, that is, when both the source sentence and the translation are considered. With a very few exceptions, all configurations benefit the extra input. For settings where the monolingual accuracies are high, the gains can be modest (for instance less than 2 points for FlauBERT on in-domain test sets), but otherwise, clear improvements are observable. For instance, on the HANS test sets, gains close to 20 points can be observed for some Transformer-based classifiers.

Bilingual
The more challenging datasets are now handled with an accuracy around 70% or above, while for the other test sets, the best performances are over 80%. Similarly to the monolingual task, Transformer-  Table 4: Accuracy of best classifier (in percentage) for each test set, in the monolingual and bilingual tasks, as a function of the (normalized) BLEU score. Except for underlined scores, classifier training data were produced with XLM. The best classifier is specified in parentheses next to its accuracy. Column "=ref %" indicates the percentage of sentences for which MT output is identical to the reference; while "x-edit%" columns indicate the percentage of translations which differe to the reference translation by exactly one or two edit distance operations.
based classifiers are the best performers. The T-MOP classifier overall underperforms the bi-LSTM and LASER ones. The n-GRAM classifier shows signs of overfitting, and delivers disappointing results on out-of-domain data. 7 Analysis 7.1 Quantitative Analysis Table 4 shows the accuracy of the best performing classifiers for each test set, alongside the BLEU score of the respective translation engine (F-, X-, GT-, D-) for that set. We anticipated that poor quality MT would be easier to detect, but BLEU score does not seem to correlate strongly with the classification performance, which contradicts the observation in Aharoni et al. (2014). What is noticeable however, is that in-domain performances (data from TM, and classifiers trained with the same translation engine used for producing test-sentence translations) are systematically higher than out-domain ones. Also, the bilingual task is unquestionably easier to tackle and for many test sets, including out-of-domain ones, the best classifier achieves an accuracy over 80%, a rather decent level of performance we did not anticipate at first, considering the relatively high quality of current NMT output. Figure 1 shows the cumulative accuracy (y-axis) in the bilingual task calculated over the number of target sentences, sorted by the length of sentences (number of tokens). For all test sets and all classifiers, we observe that the longer the translation, the better the accuracy. This corroborates the findings of Arase and Zhou (2013), that longer sentences are easier to classify. This is likely explained by the fact that translations of short sentences are more likely to be similar to the human translation, and longer sentences likely contain more problems, further easing detection.

Qualitative Analysis
We inspected the decisions made by our classifiers on some examples. We did notice machine translations involving problems with proper names and acronyms, as example i) of Figure 2. We also occasionally found syntax problems in machine translations, such as example ii), which involves a failure in longdistance number agreement as well as a bad choice of pronoun. Also, we observed a strong tendency of machine translations to mimic the structure of the source sentence, as can be seen in most examples of Figure 2. This suggests that alignment features in the bilingual task could be useful. T-MOP explicitly captures alignment information, but does not seem to make good use of it. We were otherwise impressed by the overall quality of the MT, and rapidly realized how difficult it would be for human annotators to achieve a decent level of performance on this task. This is in line with the observations of Arase and Zhou (2013), who report lower performances for humans than for machines at detecting translations produced by statistical phrase-based MT.
To better understand the type of information our classifiers base their decisions on, we inspected cases where our classifiers predominantly classified the human translations as such 20 , and the machine translation counterpart is predominantly recognized as a machine translation. For 32 such cases randomly selected, we manually produced minimal pairs (3 on average), that is, as small as possible variants of the automatic translation, to see at which point the classifiers were changing their decisions from machine to human, thus allowing us to see which signals they react to. For instance, we produced 7 variants of example i) in Figure 2, including the 3 reported.
We found that in half of the cases, modifying only a few words (often only one) of the automatic translation is enough for the classifier to reverse its decision. Some cases involved normalizations that our post-processing script (see Section 4.3) fails to take into account. Among those, we noted the presence of a hyphen symbol produced by DeepL on the NEWS data set, different from the one used in human translations. We also noted a few cases involving typographical preferences. For instance, on the EURO test set, removing a space in section numbering produced by XLM (e.g. "5 c)" versus "5c)") sometimes suffices to make our classifiers believe the translation is human. Also, removing a capital letter (or sometimes adding one) may reverse the classifier's decision.
Of course, such normalization issues are in a way deceptive since although they do help decision making, they do not have much to do with translation quality. In any case, the most frequent situation involves lexical choices. For instance in example v) of Figure 2, changing the future tense enverra by the infinitive form doit envoyer significantly reduces (from 18 to 4) the number of classifiers believing the translation is an automatic one. Further replacing the preposition pour by en vue de reduces this number to 2. Sometimes, it is easy to blame the translation engine for a different lexical choice, as the underlined wording in example iii), but sometimes it is less, as in example iv) where se retire might be a correct translation of step down. Clearly, more analysis is required to better appreciate the type of information captured by our classifiers.

Conclusion
In this study, we implemented 18 classifiers to detect machine-translated texts, and evaluated their performance on several test sets, containing translations produced by different state-of-the-art NMT systems. i) SRC 6c) Were you informed about the ADR process at the CHRC? HUM 6c) Vous a-t-on informé du processus relatif au RAD de la CCDP ? NMT 6c) Avez-vousété informé du processus de MARCà la CCDP ? (XLM,TM) VAR 6c) Avez-vousété informé du processus de RADà la CCDP ? (14) VAR 6c) Vous a-t-on informé du processus de MARC de la CCDP ? (9) VAR 6c) Vous a-t-on informé du processus au RAD de la CCDP ?
ii) SRC Are there any specific services being requested by SMEs that you are not able to provide for them or that you feel lie outside of your mandate? HUM Les PME vous demandent-elles de leur fournir des services que vous ne pouvez leur donner ou qui, selon vous, echappentà votre mandat NMT Y a-t-il des services particuliers demandés par les PME que vous ne pouvez pas leur fournir ou que, selon vous, ne cadre pas avec votre mandat ? (XLM, TM) iii VAR Mesure Baki doit envoyer un rappel pour l'inspection d'octobre.
(2) Figure 2: Examples of human and automatic translations. Underlined passages identify problems and bold ones their corresponding parts. Examples i) and v) come along manually edited variants of the automatic translation (modifications are in italic), followed in parentheses by the number of classifiers (among 18) that identify an automatic translation.
Overall, we found that classifiers with access to both the source sentence and the translation perform better than those with access to the translation alone. Our classifiers achieve accuracies above 80% on several test sets and always surpass a random baseline. Our analysis reveals that, despite of our efforts to normalize translations, artifacts still exist in the data that could explain in part our relatively high classifier accuracies. But in general, it appears that NMT systems do elicit signatures that can be recognized by automatic methods. Often, a single lexical choice gives away the automatic nature of the translation, even when the translation looks fluent from a language model point of view. While we had the opportunity to work on a large, high quality professional translation memory, we realize that our results can not be replicated exactly: by nature, large professional TMs are proprietary and not easily shared. We argue however that one can easily reproduce (Drummond, 2009) our experiments in another setting.
In future work, we hope to produce better MT detectors by creating training data using a wider variety of MT systems. Another question we would like to examine is to what extent it is possible to detect post-edited translations, i.e. machine translations manually edited by human translators.