Detecting Syntactic Features of Translated Chinese

We present a machine learning approach to distinguish texts translated to Chinese (by humans) from texts originally written in Chinese, with a focus on a wide range of syntactic features. Using Support Vector Machines (SVMs) as classifier on a genre-balanced corpus in translation studies of Chinese, we find that constituent parse trees and dependency triples as features without lexical information perform very well on the task, with an F-measure above 90%, close to the results of lexical n-gram features, without the risk of learning topic information rather than translation features. Thus, we claim syntactic features alone can accurately distinguish translated from original Chinese. Translated Chinese exhibits an increased use of determiners, subject position pronouns, NP + “的” as NP modifiers, multiple NPs or VPs conjoined by "、", among other structures. We also interpret the syntactic features with reference to previous translation studies in Chinese, particularly the usage of pronouns.


Introduction
Work in translation studies has shown that translated texts differ significantly in subtle and not so subtle ways from original, non-translated texts. For example, Volansky et al. (2013) show that the prefix mono-is more frequent in Greek-to-English translations because epistemologically it originates from Greek. Also, the structure of modal verb, infinitive, and past participle (e.g. must be taken) is more prevalent in translated English from 10 source languages.
We also know that a machine learning based approach can distinguish translated from original texts with high accuracy for Indo-European languages such as Italian (Baroni and Bernardini, 2005), Spanish (Ilisei et al., 2010), and English (Volansky et al., 2013;Lembersky et al., 2012;Koppel and Ordan, 2011). Features used in those studies include common bag-of-words features, such as word n-grams, as well as part-of-speech (POS) n-grams, function words, etc. Although such surface features yield very high accuracy (in the high nineties), they do not contain much deeper syntactic information, which is key in interpreting textual styles. Furthermore, despite the large amount of research on Indo-European languages, few studies have quantitatively investigated either lexical or syntactic features of translated Chinese, and to our knowledge, no automatic classification experiments have been conducted for this language.
Thus the purpose of this paper is two-fold: First, we perform translated vs. original text classification on a balanced corpus of Chinese, in order to verify whether translationese in Chinese is as real as it is in Indo-European languages, and to discover which structures are prominent in translated but not original Chinese texts. Second, we show that using only syntactic features without any lexical information, such as context-free grammar (CFG), subtrees of constituent parses, and dependency triples, perform almost as well as lexical ngram features, confirming the translationese hypothesis from a purely syntactic point of view. These features are also easily interpretable for linguists interested in syntactic styles of translated Chinese. We analyze the top syntactic features ranked by a common feature selection algorithm, and interpret them with reference to previous studies on translationese features in Chinese.

Translated vs. Original Classification
The pioneering work of Baroni and Bernardini (2005) is one of the first to use machine learning methods to distinguish translated and orig-inal (Italian) texts. They experimented with word/lemma/POS n-grams and mixed representations and reached an F-measure of 86% using recall maximizing combinations of SVM classifiers. In the mixed n-gram representation, they used inflected wordforms for function words, but replaced content words with their POS tags. The high F-measure (85.2%) with such features shows that "function word distributions and shallow syntactic patterns" without any lexical information can already account for much of the characteristics of translated text. Volansky et al. (2013) is a very comprehensive study that investigated translationese in English by looking at original and translated English from 10 source languages, in a European parliament corpus. While they mainly aimed to test translational universals, e.g. simplification, explicitation, etc., the classification accuracy with SVMs using features such as POS trigrams (98%), function words (96%), function word n-grams (100%) provided more evidence that function words and surface syntactic structures may be enough for the identification of translated text.
For Chinese, however, there are very few quantitative studies on translationese (apart from Xiao and Hu, 2015;Hu, 2010, etc.). Xiao and Hu (2015) built a comparable corpus containing 500 original and translated Chinese texts respectively, from four genres. They used statistical tests (loglikelihood tests) to find statistical differences between translated and original Chinese with regard to the frequency of mostly lexical features. They discovered, for example, that translated text use significantly more pronouns than the original texts, across all genres. But they were unable to investigate the syntactic contexts in which those overused pronouns occur most often.
For them, syntactic features were examined through word n-grams, similar to previous studies on Indo-European languages, but no text classification task was carried out.

Syntactic Features in Text Classification
Although n-gram features are more prevalent in text-classification tasks, deep syntactic features have been found useful as well. In the Native Language Identification (NLI) literature, which in many respects is similar to the task of detecting translations, various forms of context-free grammar (CFG) rules are often used as features (Bykh # Wong and Dras, 2011). Bykh and Meurers (2014) showed that using a form of normalized counts of lexicalized CFG rules plus ngrams as features in an ensemble model performed better than all other previous systems. Wong and Dras (2011) reported that using unlexicalized CFG rules (except for function words) from two parsers yielded statistically higher accuracy than simple lexical features (function words, character and POS n-grams).
Other approaches have used rules of tree substitution grammar (TSG) (Post and Bergsma, 2013;Swanson and Charniak, 2012) in NLI. Swanson and Charniak (2012) compared the results of CFG rules and two variants of TSG rules and showed that TSG rules obtained through Bayesian methods reached the best results.
Nevertheless, such deep syntactic features are rarely used, if at all, in the identification of translated texts. This is the gap that we hope to fill.

Dataset
We use the comparable corpus by Xiao and Hu (2015), which is composed of 500 original Chinese texts from the Lancaster Corpus of Modern Chinese (LCMC), and another 500 human translated Chinese texts from the Zhejiang-University Corpus of Translated Chinese (ZCTC). All texts are of similar lengths (~2000 words), and from different genres. There are four broad genres: news, general prose, science, and fiction (see Table 1), and 15 second-level categories. We exclude texts from the second-level categories "science fiction" and "humor" (both under fiction) since they only have 6 and 9 texts respectively, which is not enough for a classification task.
LCMC (McEnery and Xiao, 2004) was originally designed for "synchronic studies of Chinese and the contrastive studies of Chinese and English" (see Xiao and Hu, 2015, chapter 4.2). It includes written Chinese sampled from 1989 to 1993, amounting to about one million words. ZCTC was created specifically for translation studies "as a comparable counterpart of translated Chinese" to LCMC (Xiao and Hu, 2015, pp. 48), with the same genre distribution and also one million words in total. The texts in ZCTC are sampled in 2001, all translated by human translators, with 99% originally written in English (pp. 50).
Both corpora contain texts that are segmented and POS tagged, processed by the corpus developers using the 2008 version of ICTCLAS (Zhang et al., 2003), a common tagger used in Chinese NLP research. However, only the segmentation is used in this study since our parser uses a different POS tagset.
In this study, we perform 5-fold cross validation on the whole dataset and then evaluate on the full set of 970 texts.

Features
Character and word n-gram features can be considered upper bound and baseline. On the one hand, they have been used extensively (see Section 2), but on the other hand, they partially encode topic information rather than stylistic differences because of their lexical nature. Consequently, while they are very informative in the current setup, they may not be useful if we want to use the trained model on other texts.
For syntactic features, we use various forms of constituent and dependency parses of the sentences. We extract the following features based on either type of parse using the CoreNLP parser with its pre-trained parsing model.

Context-Free Grammar
Context-free grammar rules (CFGR) We use the count of each CFG rule extracted from the parse trees.
Subtrees Subtrees are defined as any part of the constituent tree of any depth, closely following the data-oriented parsing (DOP) paradigm (Bod et al., 2003;Goodman, 1998) 2009; Sangati and Zuidema, 2011;Swanson and Charniak, 2012) in that we do not include any lexical information in order to exclude topical influence from content words. Thus no lexical rules are considered, and POS tags are considered to be the leaf nodes ( Figure 2).
We experiment with subtrees of depth up to 3 since the number of subtrees grows exponentially as the depth increases. With depth 3, we are already facing more than 1 billion features. Performing subtree extraction and feature selection becomes difficult and time consuming. Also note that CFGRs are essentially subtrees of depth 1. So with increasing maximum depth of subtrees, we test fewer local relations in constituent parses. In the future, we plan to use Bayesian methods (Post and Gildea, 2009) to sample from all the subtrees.
We also conduct separate experiments using subtrees headed by a specific label (we only look at NP, VP, IP, and CP, since they are the most frequent types of subtrees). For example, using NP subtrees as features will inform us how important the noun phrase structure is in identifying translationese.

Dependency Graphs
Dependency relations, as well as the head and dependent are extracted to construct the following features.
depTriple We combine the POS of a head and its dependent along with the dependency relation, e.g., [VV, nsubj, PN] describes a dependency relation of a nominal subject (nsubj) between a verb (VV) and a pronoun (PN).
depPOS Here only the POS tags of the head and dependent are used, e.g., [VV, PN].
depTripleFuncLex Same as depTriple, except when the word is a function word, we use the lexical item instead of the POS. e.g. [VV, nsubj, 我们] where "我们" (we) is a function word (Figure 3).
It should be noted that no lexical information are included in our syntactic features, except for the function words in depTripleFuncLex.

Combination of Features
If combined feature sets work significantly better than one feature set alone, we can draw the conclusion that they model different characteristics of translationese. We experiment with combination of CFGR/subtree and depTriple features.

Classifier and Feature Selection
For the machine learning experiments, we use support vector machines, in the implementation of the svm.SVC classifier in scikit-learn (Pedregosa et al., 2011). We perform 5-fold cross validation and average over the results. When extracting the folds, we perform stratified sampling across genres so that both training and test data are balanced. Since the number of CFGR/subtree features is much greater than the number of training texts, we perform feature selection by filtering using information gain (Liu et al., 2016;Wong and Features F-measure (%) char n-grams(1-3) 95.3 word n-grams(1-3) 94.3 POS n-grams(1-3) 93.9 Table 2: Results for the lexical and POS features Dras, 2011) to choose the most discriminative features. Information gain has been shown to select highly discriminative, frequent features for similar tasks (Liu et al., 2014). We experiment with different numbers of features, ranging between the values of 100, 1 000, 10 000, and 50 000.

Empirical Evaluation
First we report the results based on lexical and POS features in Table 2 (F-measure).
Character n-grams perform the best, achieving an F-measure of 95.3%, followed by word ngrams with an F-measure of 94.3%. Both settings include content words that indicate the source language. In fact, out of the top 30 character n-gram features that predict translations, 4 are punctuations, e.g., the first and family name delimiter "·" in the translations of English names and parentheses "()"; 11 are function words, e.g. "的" (particle), "可能" (maybe), "在" (in/at), and many pronouns (he, I, it, she, they); all others are content words, where "斯" (s) and "尔" (r) are at the very top, mainly because they are common transliterations of foreign names involving "s" and "r", followed by "公司" (company), "美国" (US), "英国" (UK), etc. Lexical features have been extensively analyzed in Xiao and Hu (2015), and they reveal little concerning syntactic styles of translated text; thus we will refrain from analyzing them here.
POS n-grams also produce good results (Fmeasure of 93.9%), confirming previous research on Indo-European languages (Baroni and Bernardini, 2005;Koppel and Ordan, 2011). Since they are not lexicalized and thus avoid a topical bias, they provide a better comparison to syntactic features.
Syntactic features: Table 3 presents the result for the syntactic features described in Section 3.3. The best performing unlexicalized syntactic features can reliably classify texts into "original" and "translated", with F-measures greater than 90%,   Table 2. This suggests that although lexical features do achieve slightly better results, syntactic features alone can capture most of the differences between original and translated texts. Note that when we increase the depth of constituent parses from 1 (CFGR) to subtrees of depth 3, the F-measure increases by 2 percent, which is a highly significant difference (McNemar (McNemar, 1947) on the 0.001 level). Thus, including deeper constituent information proves helpful in detecting the syntactic styles of texts.
However, combination of different types of syntactic features does not increase the accuracy over the dependency results. Adding syntactic features to POS n-gram or character n-gram features decreases the POS n-gram results slightly, thus indicating that both types of features cover the same information, and POS n-grams are a good approximation of shallow syntax. The lack of improvement when adding syntactic features may also be attributed to their unlexicalized nature in this study. Our syntactic features are completely unlexicalized, whereas research in NLI has shown that CFGR features need to include at least the function words to give higher accuracy (Wong and Dras, 2011). Although this suggests that in terms of classification accuracy, unlexicalized syntactic features cannot provide more information than ngram features, we can still draw some very inter- esting observations about styles of translated and original texts, many of which are not possible with simple n-gram features. We will discuss those in the following sections.

Constituency Features
The top ranking CFG features are shown in Table 5. The top three features in translated section (bottom half) of the table tell us that pronouns (PN) and determiners (DT) are indicative of translated text. We will discuss pronouns in Section 5; as for determiners, dependency graph features in Table 7 further show that among them, "该" (this), "这些" (these) and "那些" (those) are the most prominent. The parenthesis rule (PRN) captures another common feature of translation, i.e., giving the original English form of proper nouns ("加州大学洛杉矶分校(UCLA)") or putting translator's notes in parentheses. Furthermore, the prominence of the two rules NP → DNP NP and DNP → NP DEG in translation indicates that when an NP is modified by another NP, translators tend to add the particle "的" (DE; DEG for DE Genitive) between the two NPs, for example: • (NP (DNP (NP 美国) (DEG 的)) (NP 政治)).
Gloss: "US DE politics", i.e. US politics • (NP (DNP (NP 舆论) (DEG 的)) (NP 谴责)). Gloss: "media DE criticism", i.e. criticism from the media • (NP (DNP (NP 脑) (DEG 的)) (NP 供血)). Gloss: "brain DE blood supply", i.e. cerebral circulation In all three cases above, "的" can be dropped, and the phrases remain grammatical. But there are  Table 5: Top 20 CFGR features; rank averaged across 5-fold CV many cases where "的" is mandatory in the "NP modifying NP" structure. Thus, it is easier to use "的", since it is almost always grammatical, but decisions when to drop "的" are much more subtle. Translators seem to make the safer decision by always using the particle after the NP modifiers, thus making the structure more frequent. Now we turn to features of subtrees rooted in specific syntactic categories. The classification results are shown in Table 4. Using only NP-headed rules gives us an F-measure of 86.4%. Larger subtrees fare slightly worse, probably indicating data sparsity. However, these results mean that noun phrases alone often provide enough information whether the text is translated. Table 6 shows the top 20 CFGR features headed by an NP. This gives us an idea of the distinctive structures of noun phrases in original and translated texts. Apart from the obvious over-use of pronouns (PN) and determiner phrases (DP) for NPs in translated text, there are other very interesting patterns: For original Chinese, nouns inside a complex noun phrase tend to be conjoined by a Chinese specific punctuation "、"(similar to the comma in "I like apples, oranges, bananas, etc."), indicated by the high ranking of NP rules involving PU. This punctuation is most often used to separate elements in a list, and a check using Tregex   (Levy and Andrew, 2006) for the parsed sentences retrieves many phrases like the following from the LCMC corpus: "全院医生、护士最先挖掘的..." (doctors, nurses from the hospital first dug out...).
In contrast, in translated Chinese, those nouns are more likely to be conjoined by a conjunction (CC), exemplified by the following example from the ZCTC corpus: "对经济和 和 和股市非常敏感" (very sensitive to the economy and the stock market.).
Here, to conjoin doctors and nurses, or the economy and the stock market, either "、" or "and" is grammatical, but original texts favor the former while the translated text, probably influenced by English, prefers the conjunction.

Dependency Features
Features based on dependency parses have similar F-measures, but should be easier to obtain than subtrees of depth greater than 1. Using the lexical items for function words (depTripleFuncLex) can further improve the results, showing that the choice of function words is indeed very indicative of translationese. A selection of top ranking dep-TripleFuncLex features is shown in Table 7.
Chinese-specific punctuations such as "、" predicts original Chinese text, as we have already seen, but notice that it is also often used to conjoin verbs (VV PUNCT 、). Translated texts, in contrast, use more determiners (these, such, those,  ) and pronouns (he, they, etc.), which will be discussed in more detail in the following section. These results are in accordance with previous research on translationese in Chinese (He, 2008;Xiao and Hu, 2015).

Analyzing Features: Pronouns
In this section, we discuss one example where syntactic features provide unique information about the stylistic differences between original and translated Chinese that cannot be extracted from lexical sequences, yielding new insights into translationese in Chinese: We have a closer look at the use of pronouns. For this investigation, we examine the top 100 subtrees with depth 2, selected by information gain.
Our results not only confirm the previous finding that pronoun usage is more prominent in trans-  (He, 2008;Xiao and Hu, 2015, among others, see Section 2.1), but also provide more insights on the details of pronoun usage in translated Chinese, by looking at the syntactic structures that involve a pronoun (PN) and their ranking after applying the feature ranking algorithm (see Table 8).
The high ranking of pronoun-related features (4 out of the top 10 features involve pronouns) confirms the distinguishing power of pronoun usage.
Crucially, it appears that pronouns in subject position or as a genitive (as part of DNP phrase such as 他的书, his book), are more prominent than pronoun in the object position in translated texts. In fact, pronouns as the object of a preposition (captured by subtree "(PP P (NP PN))") ranked only about 93rd among all features. Also, pronouns as the object of a verb only shows up once in the top 100 features, and they are of the structure "(VP VV (NP PN) IP)". When searching for sentences with such structures (using Tregex), we almost always encounter phrases similar to "make + pronoun + V.", e.g. "让 他们 懂得 ..." (make them understand ...), where the pronoun is both the object of "make", and the subject of "understand". All this shows that the over-usage of pronouns in translated texts is more likely to occur in subject positions, or in a genitive complement, rather than as the direct object of a verb. Even when it appears in the object position, it appears to play both the roles of subject and object. To our knowledge, this characteristic has not been discussed in previous studies in translationese.
If we examine the dependency features, we see the same pattern. Pronouns serving as the subject of verbs rank very high (5.4, 10, 17, 24, 35.6, see Table 7), whereas pronouns as the object of verbs are not in the top 100 features (the highest ranking 191, VV DOBJ 它 it). Thus we see the two types of syntactic features (constituent trees and dependency trees) converging to the same conclusion. If we look at the pronoun issue from the opposite side, a reasonable consequence would be that in original texts, more common nouns should serve as the subject, which is indeed what we find. VV NSUBJ NN predicts "original" and ranks 94.8.
The conclusion concerning pronoun usage drawn from the ranking of syntactic features coincides with observation of (non-)pro-drop in English and Chinese. I.e., Chinese is pro-drop while Enlgish is not. Thus, the overuse of pronouns in Chinese texts translated from English is an example of the interference effect (Toury, 1979), where translators are likely to carry over linguistic features in the source language to the target language. A further observation is that, in Chinese, subject pro-drop seems to be more frequent. The reason is that subject pro-drop does not require much context, while object-drop generally requires the dropped object to be discourse old (c.f. Li and Thompson, 1981). This explains why pronoun overuse occurs more often in subject position in translated text, because object pro-drop in Chinese itself is less common in original Chinese text.
We are not trying to imply that lexical features should not be used. Rather, we want to stress that syntactic features offer a more in-depth and comprehensive picture to linguists interested in the style of translated text. The pronoun analysis presented above is only one such example. We can perform such analyses for any feature of interest and gain a deeper understanding of how they occur in both types of text.

Conclusion and Future Work
To our knowledge, the current study is the first machine learning experiment on translated vs. original Chinese. We find that translationese can be identified with roughly the same high accuracy using either lexical n-gram features or syntactic features. More importantly, we show how syntactic features can yield linguistically meaningful features that can help decipher differences in styles of translated and original texts. For example, translated Chinese features more determiners, subject-position pronouns, NP modifiers involving "的", and multiple NPs or VPs conjoined by the Chinese-specific punctuation "、". Our methodology can, in principle, be applied to any stylistic comparisons in the digital humanities, and can yield stylistic insights much deeper than the pioneering work of Mosteller and Wallace (1963).
In future work, we will investigate tree substitution grammar (TSG), which extracts even deeper constituent trees (c.f. Post and Gildea, 2009), and detailed feature interpretation for phrases headed by other tags (ADJP, PP, etc.) and for specific genres. It is also desirable to improve the accuracy of constituent parsers for Chinese, along the lines of (Wang et al., 2013;Wang and Xue, 2014;Hu et al., 2017), since accurate syntactic trees are the prerequisite for accurate feature interpretation. While the parser in this study works well, better parsers will undoubtedly be a plus.