Information Density and Quality Estimation Features as Translationese Indicators for Human Translation Classification

This paper introduces information density and machine translation quality estimation inspired features to automatically detect and classify human translated texts. We investigate two settings: discriminating between translations and comparable originally authored texts, and distinguishing two levels of translation professionalism. Our framework is based on delexicalised sentence-level dense feature vector representations combined with a supervised machine learning approach. The results show state-of-the-art performance for mixed-domain translationese detection with information density and quality estimation based features, while results on translation expertise classiﬁcation are mixed.


Introduction
Translations, regardless of the method they were produced with, are different from their source texts and from originally authored comparable texts in the target language. This has been confirmed by many linguistic studies on translation properties commonly called translationese (Gellerstam, 1986). These studies show that translations tend to share a set of lexical, syntactic and/or textual features distinguishing them from non-translated texts. As most of these features can be measured quantitatively, we are able to automatically distinguish translations from originals (Baroni and Bernardini, 2006;Ozdowska and Way, 2009;Kurokawa et al., 2009). This is useful for Statistical Machine Translation (SMT), as language and translation models can be improved if the translation direction and status of the data (translation or original) is known (Lembersky, 2013).
Research on translationese has recently focused on exploring features capturing aspects of translationese such as simplification, explicitation, convergence, normalisation and shining-through (Volansky, 2012;Ilisei, 2012). Here we extend this work as follows: (i) we investigate the impact of information density and surprisal features, (ii) we explore the use of features used in machine translation quality estimation (Blatz et al., 2003;, (iii) we explore classification between originally authored text and trainee and professional translation, as well as between professional and trainee translation. In order to avoid biasing classification by topic content, throughout our experiments we use fully delexicalised features, resulting in dense vector representations (rather than sparse vectors, where the size of the vectors can be up to and in fact exceed the size of the vocabulary). We show that information theory as well as translation quality estimation inspired features achieve state-of-the-art results in mixed-domain original vs. human translation classification.
Languages provide speakers with a large number of possibilities of how they may encode messages. These include the choice of phonemes, words, syntactic structures, as well as arranging sentences in discourse. Speakers' decisions regarding these choices are influenced by diverse factors: cognitive processing limitations can impact variation in linguistic encoding across all linguistic levels. Text production conditions, including monolingual vs. multilingual settings, can influence this variation: in translation, choices can be shaped by aspects of both the source and the target language.
Contrastive studies have shown that information density is distributed differently in English and German (Doherty, 2006;Fabricius-Hansen, 1996). These contrasts may impact translation, and in case of source language shining through 1 , we would expect to observe differences between translations and comparable originals in terms of information density. Additionally, translations are often more specialised and more conventionalised than originals (excluding translation of fictional texts). In this paper we investigate whether and to what extent information density based features are useful in human translation classification.
Quality estimation (QE) (Blatz et al., 2004;Ueffing and Ney, 2005) is the attempt to learn models that predict machine translation quality without access to a reference translation at prediction time. Translation, manual or automatic, is always a process of transforming a source into a target text. This process is prone to error. In this paper we explore whether and to what extent the extensive research on QE can be brought to bear on the problem of human translation vs. originals classification, and in particular the discrimination between novice and professional translation output.
Below we explore the ability of our features to distinguish between 1) non-translated texts and translations by professionals, 2) non-translated texts and translations by translator trainees, and 3) the two translation varieties that diverge in the degree of translation experience. We report results in terms of accuracy and f-score, and provide a feature analysis in order to understand the role of the information density and QE inspired features in the task.
The paper is organised as follows: related work is presented in Section 2. The experimental setup is detailed in Section 3, followed by the results and analysis in Section 4. A discussion about our results compared to previous work is given in Section 5. Finally, conclusion and future work are provided in Section 6.

Related Work
We briefly review previous work on translationese, information density, machine translation quality estimation and studies on human translation expertise.

Translationese
A number of corpus-based studies on translation have shown that it is possible to automatically predict whether a text is an original or a translation (Baroni and Bernardini, 2006;Koppel and Ordan, 2011). These approaches are based on the concept of translationese -a term coined to capture the specific language of translations by Gellerstam (1986). The idea is that translations exhibit properties which distinguish them from original texts, both the source texts of the translation and comparable texts originally authored in the target language. Baker (1993;1995) claimed these properties to be universal, i.e. (source) language-independent, emphasising general effects of the process of translation.
However, translationese includes features involving both source and target language. Most linguistic studies distinguish explicitation -a tendency to spell things out rather than leave them implicit and implicitation (the opposite effect), simplification -a tendency to simplify the language used in translation, normalisation -a tendency to exaggerate features of the target language and to conform to its typical patterns, levelling out or convergence -a relatively higher level of homogeneity of translated texts compared to non-translated ones, and interference or shining through (e.g. Teich (2003)). While simple lexicalised features including word tokens and character n-grams can produce near perfect classification results for in-domain data (Avner et al., 2014), a significant amount of work has gone into devising features that can capture presumed linguistic aspects of translationese (Volansky, 2012).  explore unsupervised discrimination of translations based on principal components analysis for dimensionality reduction followed by a clustering step. The method is robust to unbalanced and heterogeneous datasets, which may be useful to handle mixed domain, genre and source of data, a common situation when training language and translation models.
Automatic classification of original vs. translated texts has applications in machine translation, especially in studies showing the impact of the nature (original vs. translation) of the text in translation and language models used in SMT. Kurokawa et al. (2009) show that taking directionality into account when training an English-to-French phrasebased SMT system leads to improved translation performance. Ozdowska & Way (2009) analyse the same language pair and demonstrate that the nature of the original source language has an impact on the quality of SMT output. Lembersky et al. (2012) show that BLEU scores can be improved by language models compiled from translated texts and not from comparable originally authored ones.

Information Density
In a natural communication situation, speakers tend to exploit variations in their linguistic encoding -modulating the order, density and specificity of their expressions to avoid informational peaks and troughs that may result in inefficient communication. This is often referred to as the uniform information density hypothesis (Frank and Jaeger, 2008). The information conveyed by an expression can be quantified by its surprisal, a measure of how predictable an expression is given its context. Simplification and explicitation may impact the average information density measured on translated texts compared to comparable originally authored ones in the same language. Source language interference should result in peaks of measured surprisal values in translated texts, while the information density may remain uniform in originals. According to Hale (2001), a surprisal model allows the estimation of the probability of a parse tree given a sentence prefix. Levy (2008) showed that a lexical-based surprisal measure can be obtained by computing the negative log probability of a word given its preceding context: S = − log P (w k+1 |w 1 . . . w k ). Following Demberg et al. (2013), we estimate surprisal in three ways, at the word, part-of-speech and syntax levels, based on ngram language models and language models trained on unlexicalised part-of-speech sequences and flattened syntactic trees. Note that all resulting feature vectors do not represent lexical information but information theoretic surprisal measures.

Quality Estimation
Machine translation QE is the process of estimating how accurate an automatic translation is through characteristic features of the source and target texts, and (possibly) also the translation engine, with a supervised machine learning setting to estimate quality scores. QE can be applied at the word, sentence and document level (Gandrabur and Foster, 2003;Ueffing et al., 2003;Blatz et al., 2003;Scarton and Specia, 2014).
Many different delexicalised dense features have been explored in previous work on QE, including language and topic models, n-best lists, etc. (Quirk, 2004;Ueffing and Ney, 2004;Specia and Gimenez, 2010;Rubino et al., 2013a). It has been shown that the performance of a supervised classifier to distinguish between originals and automatic translations is correlated with the quality of the machine translated texts (Aharoni et al., 2014): low quality translation, containing grammatical and syntactic errors, as well as incorrect lexical choices, are robust indicators of automatic translations. In the case of human translation, to the best of our knowledge, there are no empirical studies on the level of professional expertise in the translation process and its correlation with the performance of a translationese classifier.

Translator Experience
Jääskeläinen (1997) describes translational behaviour of professionals and non-professionals who perform translation from English into Finnish. Carl and Buch-Kromann (2010) apply psycholinguistic methods in their analysis. They present a study of translation phases and processes for student and professional translators, relating translators' eye movements and keystrokes to the quality of the translations produced. They show that the translation behaviour of novice and professional translators differs with respect to how they use the translation phases. Englund Dimitrova (2005) develops a combined process and product analysis and compares translators with different levels of translation experience, but concentrates only on cohesive explicitness.
Most of these works are rather process-oriented than product-oriented, which means that features of translated texts are rarely taken into account. How-ever, some of the findings are valuable for the analysis of translated texts. For instance, Göpferich & Jääskeläinen (2009) find that with increasing translation competence, translators focus on larger translation units, which can impact the choice of linguistic encoding translators use.

Experimental Setup
Our experiments are designed to investigate underexplored topics focusing on (i) information theoretic and (ii) machine translation QE features in translation classification. We use dense vector representations with fully delexicalised features and investigate three hypotheses:

Supervised Classification
In order to train a classifier and predict binary labels on unseen data, we use a dense vector sentencelevel representation associated with a class (x i , y i ), i = 1, . . . , l (l is the number of training instances) with x i ∈ R n (n is the size of a dense vector) and y ∈ {−1, 1} l . We train classification models with a support vector machine SVM (the C-SVC implementation in LIBSVM (Chang and Lin, 2011)) as a quadratic optimization problem: φ is a kernel function and allows the projection of training data to a higher dimensional space. We use the radial basis function (RBF) kernel, as it produced the best empirical results compared to linear and polynomial kernels. We predict the class for unseen instances x as follows:  Table 1: Details of the corpora used to train language and n-gram frequency models for originally authored texts and translations.
Two hyper-parameters have to be set for C-SVC with the RBF kernel: the regularisation parameter (or penalty) C and the kernel parameter γ. We use grid-search to find optimal values, performing a 5fold cross-validation on the training data. To avoid over-fitting, we use a held-out development set to evaluate the models obtained.

Datasets
The datasets used in our experiments are separated into two subsets: corpora used to extract features and corpora used to train, tune and test our classifiers. The former are taken from the publicly available bilingual English-German parallel corpora consisting of parliamentary proceedings, literary works and political commentary, compiled by . These corpora are used individually to train language models and compute n-gram frequency distributions. Basic corpus statistics are presented in Table 1. The latter ones are composed of German texts, taken from the VARTRA corpora (Lapshinova-Koltunski, 2013), which were either originally written in German (originals) or translated from English (translations).
Originals and translations belong to the same genres and registers and can be considered comparable. They include a mixture of literary, tourism and popular-scientific texts, instruction manuals, commercial letters and political essays and speeches. The VARTRA translations are split in two sets: one produced by professional translators, and one produced by translator trainees. Details are presented in Table 2. We extract balanced subsets of training, tuning and testing data containing three, one and two thousands sentences, respectively, of each type.

Feature Sets
For classification, input text is represented as a set of feature vectors. The features capture aspects of information density and translation QE. Throughout we use unlexicalised lower-dimensional dense vectors rather than high-dimensional lexicalised sparse vectors to minimize the input of specific content on classification results. We extract a total of 778 features 2 and separate them into four subsets corresponding to broad but distinct characteristics of original and translated sentences: surface and distortion features are related to QE, surprisal and complexity features are inspired by information theory.
Surface Features -13 surface features based on meta representations of sentences' lexical form.
Features include sentence and average word length, the number of word tokens and number of punctuation marks. Three case-based features capture the number of upper-cased letters and words, and a binary feature indicates whether a sentence starts with an upper-case character. Another binary value encodes whether the sentence ends with a period. Two features are obtained from the ratio between the number of upper-cased and lower-cased letters, the number of punctuation marks and the length of the sentence. Finally two features capture the number of periods merged with words and words with mixed-case characters.
Surprisal Features -225 features based on the surprisal measure presented in Section 2.2 are extracted using language models trained on words, delexicalised part-of-speech and flattened syntactic trees. The language models are trained on individual 3 corpora presented in Table 1. We extract n-gram (n ∈ [1; 5]) log-probabilities and perplexities, with and without the tags indicating the beginning and ending of sentences, using the SRILM toolkit (Stolcke et al., 2011).
Complexity Features -315 features based on n-gram frequencies, indicating how frequent the lexical choices, part-of-speech and flattened syntactic sequences present in the text to be classified are. As for the surprisal features, we use the same originally authored and translated texts individually to extract n-grams frequency quartiles. We extract the percentage of n-grams (n ∈ [1; 5]) occurring in each quartile. Frequency percentages are averaged at the sentence level, leading to 4 features per sentence (one per quartile) given a value of n, for each corpus used to define the frequency quartiles. This approach allows us to avoid encoding raw n-gram features and keep a dense vector representation (Blatz et al., 2003).
Distortion Features -225 features based on the possible distortion in lexical, part-of-speech and syntactic structures observed between originals and translations, as well as between different levels of translation experience. These features are extracted the same way as the suprisal features, but based on language models trained on sentence-level reversed text. The backward language model features are popular in translation quality estimation studies and show interesting results (Duchateau et al., 2002;Rubino et al., 2013b).
tic trees are then delexicalised in order to remove all surface forms from the representations.

Results and Analysis
Below we provide details on discriminating between originally authored texts and translations, followed by the prediction of translation experience comparing professional translators and students. Finally, we evaluate feature importance with individual and ensemble feature selection techniques.

Original vs Translated Texts
Two sets of experiments are conducted to discriminate between originals and professional translations (Table 3) and originals and student translations (Table 4). For each classification task, we evaluate feature groups on the test set containing 4, 000 unseen sentences balanced over two classes, reporting overall accuracy, and also precision, recall and f-score. Finally, a classification model is trained and evaluated combining all features.
Originals vs. professional translations reaches a maximum accuracy of 70.0% using the distortion feature set with surprisal a close second at 69.2%. The difference is not statistically significant (bootstrap resampling at p < 0.05). They outperform the other types of features, as well as the combination of all feature types. Per class evaluation shows a similar trend with the best performing feature sets. The results show that originals and professional translations exhibit differences in terms of sequences of words, part-of-speech and syntactic tags which are captured by language model based features.   Table 4: Accuracy, precision, recall and F-measure obtained on the originals versus student translations classification task. Best results in bold and statistically significant winner marked with (p < 0.05).
The classification of originals and student translations shows that the combination of the four feature types leads to the most accurate classifier, followed by the distortion and the surprisal sets (with equivalent accuracy results at p < 0.05). The two latter feature sets are the best performing ones overall based on the two classification tasks. Comparing the two tasks, originally authored texts are closer to professional translations and more distant to student translations, which validates two of our hypotheses listed in Section 3.

Translation Expertise
In order to investigate whether our third assumption is correct, we perform binary classification between professional and student translations ( Table 5). The results, barely above the 50% baseline, show the proximity of the two types of translations according to our feature sets, which supports our third assumption. The combination of four feature types reaches the highest accuracy, followed by the distortion and complexity sets. However, the surprisal features do not seem to be helpful in differentiating between the professional and the student translations, compared to the two previous binary classification tasks.
This result indicates that the surprisal measure is a reliable source of information to determine whether a sentence is originally authored or a translation, but it is not reliable to separate two translations produced by translators with different levels of expertise. The features inspired by translation quality estimation do not reach high accuracy results: it seems that the difference between professional and student translations cannot be tied to properties of the surface level or lexical choices of the human translators as indirectly captured by our features.  Table 5: Accuracy, precision, recall and F-measure obtained on the professional versus student translations classification task. Best results in bold and statistically significant winner marked with (p < 0.05). Table 6 shows the confusion matrix obtained with the classifier trained on the combination of the four feature sets. This classifier reaches third position overall in terms of accuracy, behind the distortion and surprisal sets with first and second positions, respectively. This ranking of classifiers trained on different feature sets follows the trend observed in the originals versus professional translation binary classification task.  Table 6: Confusion matrix obtained using a classifier trained on the four feature sets for the multi-class task, separating originals, professional and student translations.

3-way Classification
The training method chosen for the multi-class problem is the one against one, where individual classifiers are first trained on each binary classification task before being combined to form the final multi-class classifier (Hsu and Lin, 2002). The results indicate that our feature sets distinguish originally authored texts from professional and student translations (first line of the matrix), while the professional translations are more difficult to separate from the two other types of text. Also, student translations have characteristics differing from originals and professional translations, which can be captured with our feature sets (last line of the matrix). However, the columns of the confusion matrix show that originals are not necessarily closer to professional translations, as indicated by the first column where a larger amount of gold originals are incorrectly classified as student translations. The same trend is observable in the last column. These results go against the hypothesis that originals and student translations are easier to separate, a phenomenon which does not appear for the binary classification task (originals vs. student translations).

Feature Performance
Evaluating the performance of our feature sets is done by calculating the discriminative power of each feature individually which allows us to rank features according to their correlation with a class given a classification task. We follow the "f-score" measure (1) as proposed by Chen (2006): (1) with training vectors x k and k = 1, . . . , m, binary classes n + and n − for positive and negative instances,x i ,x k,i the ith feature of the kth positive or negative instance. The measure indicates how discriminative a feature is given a binary classification task. A drawback of the f-score is that it does not take into account possible feature complementarity.
We report the distribution of the top 25 features amongst the three levels of analysis: lexical, POS and syntax (Figure 2a  feature types: surface, surprisal, complexity and distortion ( Figure 2b). The results show that POS features are not ranked as the most discriminant ones when evaluated individually, while syntactic features are the most important ones for the originals vs. professional translation task and lexical features have the highest discriminative power for the two other tasks. When looking at the feature types, we see that complexity features, based on n-gram frequencies, are the most discriminant for the three tasks, followed by the surprisal features, while the distortion and surface features do not have a strong discriminative power. Most of the top n-gram based features rely on sequences between 1 and 3 words, indicating that higher order n-grams are not important features when considered individually. Surprisal, distortion and complexity features are based on external resources (detailed in Table 1) and the corpus of political texts translated into German is the most useful one when used to extract the complexity and surprisal features, which can be explained by the presence of political speeches and essays in the VARTRA corpus.
The results obtained on individual feature discriminative power do not reflect the ones obtained using features grouped by types. Individually, features indicating complexity based on n-gram frequencies are ranked highest. However, only a few of the distortion features appear in the discriminative ranking while this feature type reaches the high-est accuracy scores on the three binary classification tasks. These results indicate that features are highly complementary within a group of a particular type, but also between different types. To capture possible relationships between features, we conduct a non-linear feature selection using the forest of randomised trees approach (Geurts et al., 2006) and present the results for the top 25 features in Figure 2 (right hatched bars).
The tree-based feature ranking method shows the complementarity of words and POS features, while the syntactic ones appear in the top 25 for the original vs. translation tasks for both levels of expertise. When looking at the feature types, the originals vs. professional task relies mainly on a mixture of distortion and complexity features, and surprisal indicators are totally absent from the top 25 for the professional vs. student task. For both tasks involving student translation, the complexity features are the most important ones, and simple surface features are useful, such as the average words occurrence per sentence or the ratio between the number of punctuation marks and the sentence length. The most useful external resource used to extract n-gram based features is again the political corpus, indicating once more the domain proximity of our datasets.
Individually, syntactic features appear to be highly discriminant when classifying between originals and translations (regardless expertise), which may indicate two translationese phenomena: simpli-fication, translators use less complex constructions, and interference (shining through), source syntax shines through in translated texts. The ensemble ranking shows that surprisal and distortion, although not as important as complexity and distortion, are important indicators of translationese as they appear in both tasks where originals are classified against translations. These feature types are not present in the top 25 if only translated texts are classified.

Discussion
Previous research (Baroni and Bernardini, 2006;Volansky, 2012) has shown that high classification accuracy (> 80%) can be achieved using lexicalised token n-gram sparse feature vectors. As a sanity check, we conduct a set of experiments for each of our classification tasks using token unigram frequency as features, normalised by the segment length. The vocabulary defining the feature vector dimensionality is taken from the training sets, using the data presented in Table 2 only, leading to 25, 561 features. The same classification setup as presented in Section 3 is used and we observe accuracy results reaching 78.0%, 83.3% and 65.2% for original vs. professional, originals vs. student and professional vs. student classifications respectively. For the three-way task, an accuracy score of 62.7% is reached. These results are substantially lower than the ones reported by Volansky (2012), mostly because of the text chunks size, which has a strong impact on performance as shown by . In our work, we classify each sentence individually as they appear naturally in the corpus, while most previous studies are based on artificial chunks of approximately 2, 000 tokens. An other explanation of the low performing unigram-based features is related to our mixeddomain setting, as it was shown that classifiers' performance drop drastically when trained on this type of features and tested on out-of-domain data .

Conclusion
This paper presented a first step in using information density, and especially surprisal and complexity inspired features, as well as features used in translation quality estimation, as indicators of transla-tionese for originally authored and manually translated text classification. We focused on separating originals and translations produced by humans with different levels of expertise and showed that translationese features based on information density and quality estimation are useful indicators of whether a text was manually translated or originally produced. We conducted experiments in a mixed-domain setting, including literary, tourism and scientific texts, as well as instruction manuals, commercial letters and political essays and speeches.
Our experiments on feature type evaluation show that the best performing one is a set of quality estimation inspired distortion indicators, extracted from backward language models trained on originally authored and translated texts. When features are evaluated individually according to the "f-score" measure (Chen and Lin, 2006), the most discriminative ones are from the complexity subset, extracted from n-gram frequency quartiles, followed by surprisal features, both extracted at the lexical and syntactic levels. The features ensemble evaluation based on randomised trees reveals feature complementarity and shows that extracting complexity and distortion indicators at the lexical and POS levels leads to the highest performing sets.
The features used in our experiments are extracted at the word-level. As future work, we plan to extend our feature sets to information theoretic aspects of character-level indicators, such as character n-grams frequencies and language models, encoding complexity and surprisal respectively. This approach would allow to capture sub-word information density indicators, such as morphological information (Avner et al., 2014).