The Effect of Translationese in Machine Translation Test Sets

The effect of translationese has been studied in the field of machine translation (MT), mostly with respect to training data. We study in depth the effect of translationese on test data, using the test sets from the last three editions of WMT’s news shared task, containing 17 translation directions. We show evidence that (i) the use of translationese in test sets results in inflated human evaluation scores for MT systems; (ii) in some cases system rankings do change and (iii) the impact translationese has on a translation direction is inversely correlated to the translation quality attainable by state-of-the-art MT systems for that direction.


Introduction
Translated texts in a human language exhibit unique characteristics that set them apart from texts originally written in that language.It is common then to refer to translated texts with the term translationese.The characteristics of translationese can be grouped along the so-called universal features of translation or translation universals (Baker, 1993), namely simplification, normalisation and explicitation.In addition to these three, interference is recognised as a fundamental law of translation (Toury, 2012): "phenomena pertaining to the make-up of the source text tend to be transferred to the target text".In a nutshell, compared to original texts, translations tend to be simpler, more standardised, and more explicit and they retain some characteristics that pertain to the source language.
The effect of translationese has been studied in machine translation (MT), mainly with respect to the training data, during the last decade.Previous work has found that an MT system performs better when trained on parallel data whose source side is original and whose target side is translationese, rather than the opposite (Kurokawa et al., 2009;Lembersky, 2013).
A recent paper has studied the effect of translationese on test sets (Toral et al., 2018), in the context of assessing the claim of human parity made on Chinese-to-English WMT's 2017 test set (Hassan et al., 2018).The source side of this test set, as it is common in WMT (Bojar et al., 2016(Bojar et al., , 2017(Bojar et al., , 2018)), was half original and half translationese.It was found out that the translationese part was artificially easier to translate, which resulted in inflated scores for MT systems.
Noting that this finding was based on one test set for a single translation direction, we explore this topic in more depth, studying the effect of translationese in all the language pairs of the news shared task of WMT 2016 to 2018.Our research questions (RQs) are the following: • RQ1.Does the use of translationese in the source side of MT test sets unfairly favour MT systems in general or is this just an artifact of the Chinese-to-English test set from WMT 2017?
• RQ2.If the answer to RQ1 is yes, does this effect of translationese have an impact on WMT's system rankings?In other words, would removing the part of the test set whose source side is translationese result in any change in the rankings?
• RQ3.If the answer to RQ1 is yes, would some language pairs be more affected than others?E.g. based on the level of the relatedness between the two languages involved.
The remainder of the paper will be organized as follows.Section 2 provides an overview of previous work about the effect of translationese in MT.Next, Section 3 describes the data sets used in our research.This is followed by Section 4, Section 5 and Section 6, where we conduct the experiments for RQ1, RQ2 and RQ3, respectively.Finally, Section 7 outlines our conclusions and lines of future work.

Related Work
There is previous research in the field of MT that has looked at the impact of translationese, mostly on training data, but there are works that have focused also on tuning and testing data sets.
The pioneering work on this topic by Kurokawa et al. (2009) showed that French-to-English statistical MT systems trained on human translations from French to English (original source and translationese target, henceforth referred to as O→T) outperformed systems trained on human translations in the opposite direction (i.e.translationese source and original target, henceforth referred to as T→O).These findings were corroborated by Lembersky (2013), who also adapted phrase tables to translationese, which resulted in further improvements.Lembersky et al. (2012) focused on the monolingual data used to train the language model of a statistical MT system and found that using translated texts led to better translation quality than relying on original texts.Stymne (2017) investigated the effect of translationese on tuning for statistical MT, using data from the WMT 2008-2013(Bojar et al., 2013) for three language pairs.The results using O→T and T→O tuning texts were compared; the former led to a better length ratio and a better translation, in terms of automatic evaluation metrics.
Finally, Toral et al. (2018) investigated the effect of translationese on the Chinese→English (ZH→EN) test set from WMT's 2017 news shared task.They hypothesized that the sentences originally written in EN are easier to translate than those originally written in ZH, due to the simplification principle of translationese, namely that translated sentences tend to be simpler than their original counterparts (Laviosa-Braithwaite, 1998).Two additional universal principles of translation, explicitation and normalisation, would also indicate that a ZH text originally written in EN would be easier to translate.In fact, they looked at a human translation and the translation by an MT system (Hassan et al., 2018) and observed that the human translation outperforms the MT system when the input text is written in the original language (ZH), but the difference between the two is not significant when the original language is translationese (ZH input originally written EN).Therefore, they concluded that the use of translationese as the source language in test sets distorts the results in favour of MT systems.

Data Sets
We use the test data from WMT16, WMT17, and WMT18 news translation tasks (newstest2016, newstest2017, and newstest2018) exclusively, because they provide results using the direct assessment (DA) score (Graham et al., 2013(Graham et al., , 2014(Graham et al., , 2017)), which is the metric we will use in our experiments.DA is a crowd-sourced human evaluation metric to determine MT quality.To elaborate, after participants submit their translations produced by their MT systems, a human evaluation campaign is run.This is to assess the translation quality of the systems, and to rank them accordingly.Human evaluation scores are provided via crowdsourcing and/or by participants, using Appraise (Federmann, 2012).Human assessors are asked to rate a given candidate translation by how adequately it expresses the meaning of the corresponding reference translation, thus avoiding the use of the source texts and therefore not requiring bilingual speakers.The rating is done on an analogue scale, which corresponds to an absolute 0-100 scale.
To prevent differences in scoring strategies of distinct human assessors, the human assessment scores for translations are standardized according to each individual human assessor's overall mean and standard deviation score, which is indicated as the z-score in WMT finding papers.Average standardized scores for individual segments belonging to a given system are then computed, before the final overall DA score for that system is computed as the average of its standardized segment scores.
Finally, systems are ranked to produce the shared task results.There is of course the possibility that some systems score similarly in the shared task.If that is the case, those systems are clustered together.Specifically, clusters are determined by grouping systems together, and comparing the scores they obtained.According to the Wilcoxon rank-sum test, if systems do not significantly outperform others, they are in the same cluster, the opposite is the case if they do outperform each other (Bojar et al., 2016(Bojar et al., , 2017(Bojar et al., , 2018)).
Table 1: Datasets used in this study (DA scores from WMT16-18 news translation task).Columns contain (from left to right) the number of submitted systems (# sys.), total number of segments prior to quality control (# seg.), and total number of assessments human assessors carried out (# assess.) Table 1 provides an overview of the number of systems, segments, and assessments in the previously mentioned editions of WMT for all available language directions.These are the datasets that we use in this work.

Effect of Translationese on Direct Assessment Scores
The test sets used by Bojar et al. (2016Bojar et al. ( , 2017Bojar et al. ( , 2018) ) are bilingual, thus having two sides: source text and reference translation.The source is written in the language that is to be translated from (original language), while the reference is written in the language into which the source text is to be translated (target language).In all the test sets used in our experiments English is one of the two languages involved, being either the source or the target.
Taking as an example of WMT test set the one for Chinese-to-English from 2017, this contains 2,001 sentence pairs.Out of these, 1,000 sentences were originally written in Chinese and translated by a human translator into English, hence the target text is translationese.The other half consists of 1,001 sentences that were originally written in English and translated by a human translator into Chinese, hence the source text is translationese in this subset.A graphical depiction of this can be found in Figure 1.The advan-tage of this procedure is that the same test set can be used for the English-to-Chinese direction, thus reducing the costs involved in creating test sets in half.→ Chinese (ZH) translation direction, where English is translated into Chinese, and Chinese into English.Indicated as a subscript is which the original language was, red means original language and blue translationese.
Source and reference files contain documents, each of which is provided with a label indicating in which language it was originally written.In our experiments we compute the DA scores for each test set (i) on the whole test set, which corresponds to the results reported in WMT, (ii) on the subset for which the source text was originally written in the source language (referred to as ORG in our experiments) and (iii) on the remaining subset, for which the source text was originally written in the target language, and is thus translationese (referred to as TRS in our experiments).
Table 2  whole test set (WMT) as starting point for the comparison.We observe a clear and common trend: using original input results in a lower DA score, while using translationese input increases the DA score.This trend is consistent for all the 17 translation directions considered and for all the 3 years of WMT studied, thus providing enough evidence to answer RQ1: the use of translationese as input of test sets results in higher DA scores for MT systems.

Effect of Translationese on Rankings
We compute Kendall's τ to give an overview of to what degree rankings change for each translation direction.The τ coefficient is obtained by comparing WMT rankings to the resulting rankings if only the ORG subset is used as input.Since systems can share the same cluster, and thus the same ranking, we compute Kendall's τ both with and without ties.With ties, all systems in the same cluster are considered to occupy the same rank, hence the correlation with ties is sensitive only to changes that go beyond clusters.E.g. if a system moves from the second cluster to the first one.In contrast, without ties all the ranking changes are considered, even if a system changes position but remains within the same cluster.
Table 3 shows the Kendall's τ correlations for all translation directions between the rankings on the whole test set (WMT) and on the ORG subset.We do see that some of the translation directions have a τ coefficient of 1, which means that the agreement between the two rankings is perfect, i.e. the rankings in WMT and ORG are exactly the same.However, we observe that there were few systems submitted to such translation directions (e.g.τ = 1 for Romanian→English in 2017, for which 7 systems were submitted, see Table 1).Apart from those, other language directions show that there are at least slight rank changes between the WMT rankings and ORG rankings.Looking at the low ranked translation directions, we observe that some are close to a τ coefficient of 0, especially in correlations without ties, such as German→English in WMT 2017 (τ = 0.345).This means that some rankings have only a weak correlation.
Probably related to the differences in DA scores between WMT and ORG (RQ1), we also find that systems' rankings change for most language pairs when comparing WMT and ORG rankings.We see that there is no perfect correlation between rankings, apart from a few language directions for which only a few systems were submitted.This indicates that the rankings do change to a certain degree.Computing Kendall's τ with ties results in higher correlation coefficients than without ties, implying that systems do shift, but tend to stay in the same cluster they occupied in the WMT ranking.In some editions of WMT, the rankings for certain language pairs change considerably.The biggest change in terms of ranking takes place for PROMT's rule-based system RU→EN for WMT16.This system advances four positions in the ranking when only original source text is considered, going from rank 5 to rank 1 (although tied with several other systems).It is worth noting that while the DA score for the majority of systems decreases when using original source text, the opposite happens for PROMT's system.Thus far we have looked at a single result per translation direction and year, based on the best system in Table 2, and on the correlation between systems in Table 3.Now we zoom in on a translation direction: Chinese→English.Table 4 shows how DA scores change between the whole test set (WMT) and the subsets ORG and TRS, both in terms of raw and standarized scores.In addition, the table depicts how many positions a system goes up or down in the ranking.
In the table we observe consistently that the DA score for ORG input is lower than that for WMT, while that for TRS is higher than that for WMT.It is also worth noting that most top scoring systems change in rankings, and that system clusters shift.Due to limited space we provide equivalent tables to Table 4 for the remaining 16 translation directions as an appendix.

Effect of Translationese on Different Language Pairs
We aim to find out not only whether translationese has an effect on test sets (RQ1 and RQ2), but also to study whether some language pairs are more affected than others (RQ3).Two hypotheses in this regard are as follows: (i) the degree of translationese's impact has to do with the translation quality attainable for a translation direction, as represented by the DA score of the best MT system submitted; (ii) the degree of translationese's impact has to do with how related are the two languages involved.
In order to test the second hypothesis, the degree of similarity between languages has to be quantified.We make use of the lang2vec tool (Lit-tell et al., 2017) using the URIEL Typological Database (Littell et al., 2016) to compute the similarity between pairs of languages.Similar to the approach of Berzak et al. (2017), all the 103 available morphosyntactic features in URIEL are obtained; these are derived from the World Atlas of Language Structures (WALS) (Dryer and Haspelmath, 2013), Syntactic Structures of the Worlds Languages (SSWL) (Collins and Kayne, 2009) and Ethnologue (Lewis et al., 2009).Missing feature values are filled with a prediction from a k-nearest neighbors classifier.We also extract URIEL's 3,718 language family features derived from Glottolog (Hammarström et al., 2019).Each of these features represents membership in a branch of Glottolog's world language tree.Truncating features with the same value for all the languages present in our study, 87 features remain, consisting of 60 syntactic features and 27 family tree features.We then measure the level of relatedness between two languages using the linguistic similarity (LS) by Berzak et al. (2017) (Equation 1), i.e. the cosine similarity between the URIEL feature vectors for two languages v y and v y .
Together with the LS for a language direction, we take the best system of the most recent year in our data set, WMT18, for that language direction.The motivation behind is that a top performing system from the most recent campaign should be representative of the current state-of-the-art in machine translation for the translation direction it was submitted to.
To look into the effect of translationese across different language pairs, we present two approaches, following the hypotheses put forward at the beginning of this section: (i) compare the DA score of the best system for each translation direction on subset ORG to the relative or absolute difference in DA score for that system between subset ORG and the whole set (WMT); (ii) compare the LS of the two languages in each translation direction to the relative or absolute difference in DA scores for the best system between subset ORG and the whole set (WMT); Figure 2 shows the Pearson correlation and 95% confidence region of the DA score of the best scoring system for each language direction on subset ORG against the absolute and relative difference q q q q q q q q q q q q q q enfi enru encs Best system vs. relative difference q q q q q q q q q q q q q q enfi enru encs enet entr eten enzh deen The languages are abbreviated into ISO 639-1 codes (Byrum, 1999).q q q q q q q q q q q q q q enfi enru encs enet entr eten enzh deen tren fien csen ende zhen ruen R = − 0.15 , p = 0.61 0 5 10 0.2 0.4 0.6 Similarity of the language pair using URIEL and lang2vec Relative difference between original input and source input LS vs. relative difference q q q q q q q q q q q q q q enfi enru encs  of the DA scores of those systems between WMT input and ORG input.We observe an interesting trend; higher scoring systems tend to have lower differences in score, which indicates that translationese has less effect.Considering either relative or absolute differences, the correlations are in both cases significant and strong (p < 0.001, |R| > 0.75).
Figure 3 shows the Pearson correlation and 95% confidence region of the LS of a language pair (English compared to another language in our data sets) against the absolute and relative difference of the DA scores of the best system for each translation direction between WMT input and ORG input.Here, we see a less obvious trend, and in fact both correlations are very weak and nonsignificant.However, just as in the previous figure we can see that most of the out-of-English systems tend to have a higher relative and absolute difference than systems that translate into English.
On a side note, we created different feature combinations from the earlier mentioned features for LS.Apart from syntactic and family tree features, phonological features are also present in URIEL.However, other combinations did not seem to alter the LS difference score, compared to using the mentioned features in the experimental setup.

Conclusion and Future Work
This paper has looked in depth at the effect of translationese in bidirectional test sets, commonly used in machine translation shared tasks, by conducting a series of experiments on data sets for 17 translation directions in the three last editions of the news shared task from WMT.Specifically, we have recomputed the direct assessment (DA) scores separately for the whole test set (WMT), and for the subsets whose source side contains original language (ORG) and translationese (TRS).Results show that using original language input lowers the DA scores, and translationese input increases the scores (RQ1), and perhaps more importantly, system rankings do change (RQ2).We have also investigated the degree to which these rankings change, by measuring the correlation between the rankings with a non-parametric correlation metric that supports ties (Kendall's τ ).Results show that systems do change in absolute ranking, but tend to stay more in the same cluster as they were before.
Last, we looked at whether the effect of translationese correlates with certain characteristics of translation directions.We did not find a correlation between the effect of translationese and the level of relatedness of the two languages involved but we did find a correlation between the effect of translationese and the translation quality attainable for translation directions (RQ3).In other words, human evaluation for better performing systems would seem to be less affected by translationese.Related, we observe that translation directions that contain an under-resourced language tend to obtain low DA scores.Hence, we could say that the effect of translationese tends to be high specially when an under-resourced language is present, which could distort (inflate) the expectations in terms of translation quality for these languages.
As for future work, we plan to focus on studying what the characteristics of translationese are.I.e.what are the traits that set apart the language used in original test sets from translationese test sets.
All the code and data used in our experiments are available on GitHub1 .

A Supplemental Material
These are the supplementary tables for the paper "The Effect of Translationese in Machine Translation Test Sets".Provided are the remaining 16 tables of each language direction.These tables are of the same structure as Table 4 in the paper.

Figure 1 :
Figure1: Example of a WMT test set for English (EN) → Chinese (ZH) translation direction, where English is translated into Chinese, and Chinese into English.Indicated as a subscript is which the original language was, red means original language and blue translationese.
best system with original input Relative difference between WMT input and original input Figure2: Pearson correlation between the DA scores of the best system for each translation direction at WMT18 and the relative (left) and absolute (right) difference in DA score (%) of comparing WMT input and ORG input.The languages are abbreviated into ISO 639-1 codes(Byrum, 1999).
language pair using URIEL and lang2vec Absolute difference between original input and source input LS vs. absolute difference

Figure 3 :
Figure 3: Pearson correlation between Linguistic Similarity for each language direction and the relative (left) and absolute (right) difference (%) in DA score of comparing WMT input and ORG input.The languages are abbreviated into ISO 639-1 codes (Byrum, 1999).

Table 2 :
shows the absolute difference in DA score for the ORG and TRS subsets, taking the DA scores for the best MT system for each translation direction of WMT's 2016-2018 news translation shared task.Columns ORG and TRS show the absolute difference of the DA scores in those subsets compared to the whole test set (WMT).

Table 3 :
Kendall's τ coefficient for each translation direction and year.The coefficient is obtained by comparing WMT's ranking with the ranking if only original language is used as input (subset ORG), with and without ties.A (*) indicates the significance level at p-level p≤0.05.Furthermore, language directions are sorted by the computed mean Kendall's τ .A † indicates that the mean is computed over one year.

Table 4 :
Results of the Chinese→English language direction with WMT, ORG, and TRS input.Systems are ordered by standardized mean DA score.If a system does not contain a rank, this means that it shares the same cluster as the system above it.Clusters are obtained according to Wilcoxon rank-sum test at p-level p ≤ 0.05.Indicated in the [↑↓] column are the changes in absolute ranking (i.e.how many positions a system goes up or down).