Statistical Machine Translation with Automatic Identification of Translationese

Translated texts (in any language) are so markedly different from original ones that text classiﬁcation techniques can be used to tease them apart. Previous work has shown that awareness to these differences can signiﬁcantly improve statistical machine translation. These results, however, required meta-information on the on-tological status of texts (original or translated) which is typically unavailable. In this work we show that the predictions of translationese classiﬁers are as good as meta-information. First, when a monolingual corpus in the target language is given, to be used for constructing a language model, predicting the translated portions of the corpus, and using only them for the language model, is as good as using the entire corpus. Second, identifying the portions of a parallel corpus that are translated in the direction of the translation task, and using only them for the translation model, is as good as using the entire corpus. We present results from several language pairs and various data sets, indicating that these results are robust and general.


Introduction
Research in Translation Studies suggests that translated texts are considerably different from original texts, constituting a sublanguage known as Translationese (Gellerstam, 1986). Awareness to translationese can significantly improve statistical machine translation (SMT). Kurokawa et al. (2009) showed that French-to-English SMT systems whose translation models were constructed from human translations from French to English yielded better translation quality than ones created from translations in the other direction. These results were corroborated by Lembersky et al. (2012aLembersky et al. ( , 2013, who showed that translation models can be adapted to translationese, thereby improving the quality of SMT even further. Awareness to translationese also benefits the language models used in SMT: Lembersky et al. (2011Lembersky et al. ( , 2012b showed that language models complied from translated texts better fit the reference sets in term of perplexity, and SMT systems constructed from such language models perform much better than those constructed from original texts.
To benefit from these results, however, one has to know whether the texts used for training SMT systems are original or translated, and previous work indeed used such meta-information. Unfortunately, annotation reflecting the status of texts, or the direction of translation, is typically unavailable. The research question we investigate in this work is whether the predictions of translationese classifiers can replace manual annotation. In a variety of evaluation scenarios, we demonstrate that this is indeed the case. When a monolingual corpus in the target language is given for constructing a language model for SMT, we show that automatically identifying the translated portions of the corpus, and using only them for the language model, is as good as using the entire corpus. Similarly, when a parallel corpus is given, we show that automatically identifying the portions of the corpus that are translated in the direction of the translation task, and using only them for training the translation model, is again as good as using the entire corpus. We present results from several language pairs and various data sets, indicating that the approach we advocate is general and robust.
The main contribution of this work is a general approach that, provided labeled data for training classifiers, can be applied to any corpus before it is used for constructing SMT systems, resulting in systems that are as good as (or better than) those that use the entire corpus, but that rely on significantly smaller language and translation models.
We briefly review related work in Section 2. Section 3 describes our methodology and experimental setup. Section 4 details the experiments and their results. We conclude with an analysis of the results and suggestions for future research.

Related work
Until recently, SMT systems were agnostic to the ontological status of a text (as original vs. translated). Several recent works, however, underscore the relevance of translationese for SMT. Kurokawa et al. (2009) were the first to show that translationese matters for SMT. They defined two translation tasks, English-to-French and Frenchto-English, and used a parallel corpus in which the translation direction of each text was indicated. They showed that for the English-to-French task, translation models compiled from Englishtranslated-to-French texts were better than translation models compiled from texts translated in the reverse direction; and the same holds for the reverse translation task. These results were corroborated by Lembersky et al. (2012aLembersky et al. ( , 2013, who further demonstrated that translation models can be adapted to translationese, thereby improving the quality of SMT even further. Lembersky et al. (2011Lembersky et al. ( , 2012b focused on the language model (LM). They built several SMT systems for several pairs of languages. For each language pair they built two systems, one in which the LM was compiled from original English text, and another in which the LM was compiled from text translated to English from each of the languages. They showed that LMs complied from translated texts better fit the reference set in term of perplexity. Moreover, SMT systems that were constructed from translationese-based LMs perform much better than those constructed from original LMs. In fact, an original corpus must be as much as ten times larger in order to yield the same translation quality as a translated corpus.
To benefit from these results, one has to know whether the texts used for training SMT systems are original or translated; such meta-information is typically unavailable. Due to the unique properties of translationese, however, this information can be determined automatically using textclassification techniques. Several works address this task, using various feature sets, and reporting excellent accuracy (Baroni and Bernardini, 2006;van Halteren, 2008;Ilisei et al., 2010;Eetemadi and Toutanova, 2014). Some of these works, however, only conduct in-domain evaluation; much evidence suggests that out-of-domain accuracy is much lower (Koppel and Ordan, 2011;Islam and Hoenen, 2013;Avner et al., Forthcoming).
A thorough investigation was conducted by Volansky et al. (2015), who focused on the features of translationese (in English) from a translation theory perspective. They defined several classifiers based on various linguisticallyinformed features, implementing several hypotheses of Translation Studies. We adopt some of their best-performing classifiers in this work. 1

Experimental setup
The experiments we describe in Section 4 consist of three parts: 1. Training classifiers to tease apart original from translated texts. 2. Constructing SMT systems with language models compiled from the predicted translations, comparing them with similar SMT systems whose language models consist of the entire monolingual corpora. 3. Constructing SMT systems with translation models compiled from bitexts that are predicted as translated in the same direction as the direction of the SMT task, comparing them with similar SMT systems whose translation models consist of the entire parallel corpora. In this section we describe the language resources and tools required for performing these experiments.

Tools
Our first task is text classification; to ensure that the length of each text does not influence the classification, we partition the training corpus in most experiments into chunks of approximately 2000 tokens (ending on a sentence boundary). We henceforth use chunk units to define the size of a sub-corpus. Our major experiments involve 2,500 chunks (of approximately 2,000 tokens each, hence 5M tokens). To detect sentence boundaries, we use the UIUC CCG tool. 2 We use MOSES  for tokenization and case normalization. Part-of-speech (POS) tagging is done with OpenNLP 3 for English and the Stanford tagger 4 for French. For classification we use Weka (Hall et al., 2009) with the SMO algorithm, a support-vector machine with a linear kernel, in its default configuration.
To construct language models and measure perplexity, we use SRILM (Stolcke, 2002) with interpolated modified Kneser-Ney discounting (Chen and Goodman, 1996) and with a fixed vocabulary. We limit language models to a fixed vocabulary and map out-of-vocabulary (OOV) tokens to a unique symbol to overcome sparsity and better control the OOV rates among various corpora.
We train and build the SMT systems using MOSES. For evaluation we use MultEval (Clark et al., 2011), which takes machine translation hypotheses from several runs of an optimizer and provides three popular metric scores, BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2011), and TER (Snover et al., 2006)), as well as standard deviations and p-values.

Corpora
To construct SMT systems we need both monolingual corpora (for the language model) and bilingual ones (for the translation model). The main corpora we use are Europarl (Koehn, 2005) and the Canadian Hansard. Europarl is a multilingual corpus recording the proceedings of the European Parliament. Some portions of the corpus are annotated with the original language of the utterances, and we use the method of Lembersky et al. (2012a) to identify the source language of other segments. The Hansard is a parallel corpus consisting of transcriptions of the Canadian parliament in English and (Canadian) French from 2001-2009. We use a version that is annotated with the original language of each parallel sentence. 5 We also use the News Commentary corpus , a French-English corpus in the domain of politics, economics and science. The direction of translation of this corpus is not annotated. 6

Language model experiments
Our main experiments focus on French translated to English (FR→EN), and we define a classifier that can identify English translationese. However, to further establish the robustness of our approach, we also experiment with German translated to English (DE→EN) and with English translated to French (EN→FR). We also conduct cross-corpus experiments in which we train translationese classifier on one corpus (Europarl) and test its contribution to SMT on another (Hansard, News). These experiments are crucial for evaluating the robustness of our approach, in light of the findings that translationese classification is much less accurate outside the training domain.
From the Europarl corpus we use several portions, collected over the years 1996 to 1999 and 2001 to 2009. In all experiments, the split of the monolingual corpora to translated vs. original texts is balanced (in terms of chunks). The parallel corpora are divided to two sections according to the direction of the translation (when it is known). For example, for the French-to-English translation task, we divide the Europarl corpus to a Frenchoriginal section (FR→EN) and an English-original section. We also use portions of Europarl to define reference sets for evaluating the perplexity of LMs. For this task we only use translated texts.
For constructing translation models we use parallel corpora. For the FR→EN and EN→FR tasks we use original French text, aligned with its translation to English (FR→EN). For the DE→EN translation task we use original German text, aligned with its translation to English (DE→EN). The parallel portions we use are disjoint from those used for the language model and are evenly balanced between the original text and the aligned translated text. From Europarl we use portions from the period of January to September 2000.
To tune and evaluate SMT systems we use reference sets that are extracted from a parallel, aligned corpus. These include 1000 sentence pairs for tuning and 1000 (different) sentence pairs for evaluation. The sentences are randomly extracted from another portion of the Europarl corpus, collected over the period of October to December 2000, and another portion of Hansard. All tuning and references sets are disjoint from the training materials.

Translation model experiments
In this set of experiments we focus again on FR→EN systems, but also experiment with DE→EN and EN→FR. We conduct in-domain experiments using the Europarl corpus, and a crosscorpus experiment in which we train on one corpus and test on another. From the Europarl corpus we use several portions, collected over the years 1996 to 1999 and 2001 to 2009. To construct language models for the in-domain experiments we use Europarl portions from the period of January to September 2000 (this is the English/French side of the training data used for building the translation model in the language model experiments). For cross-corpus experiments we use the LM built from translated texts that we use in the Hansard language model experiments. For tuning and evaluation we use the same sets used in the language model experiments.

Language models experiments
We build several SMT systems that use the same translation model, but differ in their language models. This involves three tasks detailed below.

Classification of translationese
The first task is to train a classifier to detect translationese. This has been done before, and we adapt some of the classifiers of Volansky et al. (2015). Specifically, our classifier is based on Contextual function words: we use counts of (contiguous) trigrams w 1 , w 2 , w 3 , where each element w i is either a word or its part of speech (POS), at least two of the elements are function words, and at most one is a POS tag. An example feature is the triple in,the,Noun . This feature set combines lexical and shallow syntactic information in a way that was proven useful for identifying translationese. We also add counts of punctuation marks, another feature that was shown accurate. 7 We evaluate the accuracy of this classifier intrinsically, using tenfold cross-validation.
Then, we use the prediction of the classifier to determine whether test texts are original or translated. The classifier thus defines a partition of the training corpus to (predicted) originals vs. translations. Based on the classifier's prediction, we build language models from the sub-corpus determined as translated. We then evaluate the fitness of this sub-corpus to the reference set, in terms of perplexity. Specifically, we train 1-, 2-, 3-, and 4gram LMs for this sub-corpus and measure their 7 The code for feature generation will be released. perplexity on the reference set. This provides an extrinsic evaluation for the quality of the classifier.
The results are reported in Table 1. Replicating the results of Volansky et al. (2015), we demonstrate that the classifier is indeed excellent. Not surprisingly, good classification yields good language models. The rightmost columns of Table 1 list the perplexity of language models trained on the sub-corpus that was predicted as translated, when applied to the reference set. For comparison, we provide in Table 1 also the perplexity of language models compiled from the entire training set; from the actual (as opposed to predicted) translated texts; and from the actual original texts. Clearly, and consistently with the results of Lembersky et al. (2012b), the original texts yield the worst language models (highest perplexity), whereas the actual translated texts yield an upper bound (lowest perplexity). Still, due to the high accuracy of the classifier, its perplexity is very similar to this upper bound. The model that is built from all texts, both original and translated, is twice as large as the corpus used for the other models, hence the lower perplexity rates.
To further establish the robustness of these results, we repeat the experiments with other corpora, this time consisting of German translated to English (DE→EN), and also English translated to French (EN→FR). We only report results for the 4-gram LMs ( Table 2). The accuracies of the classifiers are high, comparable to the case of FR→EN. Moreover, the perplexities of the induced language models are very close to the upper bound obtained by taking actual translated texts.

Language models compiled from predicted translationese
We established the fact that translated texts can be identified with high accuracy, and that language models compiled from predicted translations fit the reference sets well. Next, we construct SMT systems with these language models. Our hypothesis is that language models compiled from (predicted) translationese will perform as well as (or even better than) language models compiled from the entire corpus. We evaluate this hypothesis in several scenarios: when the corpus used for the language model is the same corpus used for training the classifiers; or a different one, but of the same type; or from a completely different domain. We begin with a French-to-English translation task. We use the same (4-gram) language models  Table 2: Accuracy of the classification, and fitness of language models compiled from texts predicted as translated to the reference set, DE→EN and EN→FR described in Section 4.1.1, constructed from the predictions of the classifier. We also fix a single translation model, compiled from the parallel portion of the training corpus (Section 3.2). We then train a French-to-English SMT system with the (predicted) LM. As a baseline, we build an SMT system that uses the entire training corpus for its language model; we refer to this system as All. As an upper bound (for a system that uses only a portion of the corpus), we build a system that uses the (actual) translated texts for its LM. We also report results on a system that uses only original texts for its LM. All systems are tuned on the same tuning set of 1000 parallel sentences, and are tested on the same reference set of 1000 parallel sentences.
We evaluate the quality of each of the SMT systems using MultEval (Section 3.1). The results are presented in Table 3, reporting the BLEU, ME-TEOR (MET), and TER evaluation measures, as well as the p-value defining the statistical significance with which the system is different from the baseline (with respect to the BLEU score only).
Replicating some of the results of Lembersky et al. (2011Lembersky et al. ( , 2012b, we find that using only translated texts for the language model is not inferior to using the entire corpus (although the size of the latter is double the size of the former). In terms of BLEU scores, both yield the same score, 29.1. Similarly, as reported by Lembersky et al. (2011Lembersky et al. ( , 2012b, using only original texts is markedly worse, with a BLEU score of 27.8. The main novelty of our current results, however, is the observation that the language model that only uses predicted, rather than actual translated texts, performs just as well. 8 For completeness, we repeat the same experiments with two more language pairs: German to English and English to French. The setup is identical, and we report the same evaluation metrics. The results are presented in Table 4. The emerging pattern is identical to that of French to English. The results of all the experiments confirm our hypothesis; SMT systems built from predicted translationese language models perform as well as SMT systems built from (actual) translated language models, and similarly to (twice as large) mixed language models.

Cross-corpus experiments
The experiments discussed above all use the same type of corpus both for training the translationese classifiers and for training the SMT systems (the actual portions differ, but all are taken from the same corpus). In a typical translation scenario, a monolingual corpus is available for constructing a language model, but the status of its texts (original or translated) is unknown, and has to be predicted by a classifier that was trained on a potentially dif-   As a first experiment, we use an (English) translationese classifier that is trained on the Europarl training data, but use the Hansard training data for constructing the SMT system. In this experiment, we do not use the meta-information of the Hansard corpus, but instead use the predictions of the classifier. Based on these predictions, we define a partition of the Hansard training corpus to (predicted) originals vs. translations and use the text chunks that were classified as translated to build 4-grams language models. Again, as in the in-domain experiment, we construct a single, fixed translation model from the parallel portion of the (Hansard) corpus. We then train a French-to-English SMT system with the (predicted) LM. As a baseline, we build an SMT system that uses the entire Hansard training corpus for its language model (All). As an upper bound, we build a system that uses the (real) translated texts for its LM. We also report results on a system that uses only original texts for its LM. All systems are tuned and tested on the same tuning and evaluation reference set.
The results (Table 5) are consistent with the findings of the in-domain experiments. Although the classifier only performs at 78% accuracy, its predictions are sufficient for defining a language model whose BLEU score (37.8) is statistically indistinguishable with the score (38.0) of LMs based on real translations or the entire corpus.
We repeat the cross-corpus experiments with the News Commentary corpus, a French-English parallel corpus for which the direction of translation is not annotated; we only use its English side. Presumably, most of the texts in this corpus consist of original English, but we hypothesize that the classifier may be able to select chunks with translationese-like features and consequently provide a better SMT system. Additionally, as the News Commentary corpus is a collection of editorials, we partition the corpus into (not necessarily equal-length) articles, rather than to 2000-token chunks, to maintain the coherence of chunks.
The results (Table 6) reveal the same pattern: the predicted-translationese system yields a BLEU score of 27.0, statistically insignificant difference compared with the All system that uses the entire corpus (27.2). This is obtained with much smaller corpora, only 1,470 chunks (58% of the entire corpus of 2,527 chunks).

Translation model experiments
We now move to experiments that address the translation model. We build SMT systems that use a fixed language model but differ in their translation model training data. For all systems we use fixed tuning and evaluation sets.

Translation models compiled from
predicted translationese We first train a classifier to detect the direction of the translation (FR→EN vs. EN→FR). We classify the English side of the parallel corpus; for the   FR→EN and DE→EN tasks, chunks predicted as translated are assumed to be translated in the right direction (S → T ). For the EN→FR task, chunks predicted as original are assumed to be translated in the right direction. Then, we use the prediction of the classifier to construct translation models: we only use the chunks predicted as translated in the right direction. For each partition, we match the English with the aligned French (or German) sentences, thereby defining the SMT training data.
We hypothesize that translation models built from such training data are better for SMT. To explore this hypothesis we fix a single language model (Section 3.2), and train an SMT system with the (predicted) partitions and their aligned sentences. As a baseline, we build an SMT system, All, that uses the entire training corpus for its translation model. As an upper bound, we build a system that uses for its translation model the portion of the parallel corpus that was indeed translated in the right direction (S → T ). We also report results on a system that uses only the portion of the parallel corpus that was translated in the opposite direction (T → S) for its translation model. All systems are tuned on the same tuning set and are tested on the same reference set.
The results are presented in Table 7. They are consistent with previous works that showed that SMT systems trained on S → T parallel texts outperformed systems trained on T → S texts (Kurokawa et al., 2009;Lembersky et al., 2012aLembersky et al., , 2013. Indeed, the best-performing systems use either (actual) S → T texts (BLEU score of 31.3), or the entire corpus (31.3); the worst system uses (actual) T → S texts (28.4). What we add to previous results is the corroboration of the hypothesis that a predicted-translationese system performs just as well as the actual ones.
As in the language model experiments, we repeat the same experiments with two more translation tasks: German to English and English to French. The setup is identical, and we report the same evaluation metrics. The emerging pattern (Table 7) confirms our hypothesis: SMT systems built from predicted S → T systems perform as well as SMT systems built from the entire corpus.

Cross-corpus experiments
The above results are not very surprising given the high accuracy of the translationese classifier. The question we investigate in this section is whether a classifier trained on texts in one domain is useful for predicting translationese in a different domain.
We train an (English) translationese classifier on the Europarl training data, but use the Hansard corpus for the translation model. We apply the classifier to the English side of the Hansard corpus, and based on its predictions, define a partition of the Hansard training corpus to use for the translation model. As in the in-domain experiment, we construct a single, fixed language model from a portion of the (Hansard) corpus. We then train a French-to-English SMT system with the (predicted) translation model, comparing it to systems that use the entire Hansard training corpus, the (actual) S → T texts and the actual T → S texts. Table 8 reports the results. The best-performing systems use either actual S → T texts or the entire corpus (BLEU score of 37.3). The classifier performs worse, at 36.3, but still much better than the system that is based on T → S texts. This should be attributed to the very small number of chunks predicted by the classifier as S → T .

Conclusion
Two fundamental insights, motivated by research in Translation Studies, drive our work: 1. Direction matters. When constructing translation models from parallel texts it is important to identify which side of the bitext is the source and which is the target. Translation from the source of the SMT task to its target is always better than the reverse option. In fact, direction itself was utilized as features for classification of translationese by selecting alignment patterns from O to T and vice versa Toutanova, 2014, 2015). 2. Translationese matters. When constructing language models, translated texts (especially from the source language, but not only) are preferable to texts written originally in the target language of the task at hand. Our main hypothesis was that these benefits to SMT still hold when meta-information on the status of the texts is unavailable, and has to be predicted, especially in light of the deterioration in the accuracy of translationese classifiers in the face of out-of-domain texts. We trained classifiers to identify translationese, and then used their predictions to construct language-and translation-models for SMT, demonstrating that attention to translationese can yield state-of-the-art translation quality with only a fraction of the corpora. We find that one can generally rely on classifiers that identify at least half of the data as translated for both the language model and the translation model.
In future work we would like to improve our classifiers such that smaller chunks of text suffice for accurate identification of translationese. We also believe that combining various feature sets is a key to improving the accuracy, and especially the robustness, of translationese classifiers. In this work we combined two complementary feature sets; more work should be done in this direction. In particular, there is ample evidence that features should be sensitive to language family, as translations from similar languages look more similar than translations from unrelated languages (Pym and Chrupała, 2005;Koppel and Ordan, 2011). To further improve the generality and domainindependence, we currently experiment with unsupervised classification of translationese, with very encouraging preliminary results (Rabinovich and Wintner, 2015).
Finally, we mainly experimented with English and French in this work, but we are confident that many language pairs can benefit from the methodology we propose.