Consistent Translation of Repeated Nouns using Syntactic and Semantic Cues

We propose a method to decide whether two occurrences of the same noun in a source text should be translated consistently, i.e. using the same noun in the target text as well. We train and test classifiers that predict consistent translations based on lexical, syntactic, and semantic features. We first evaluate the accuracy of our classifiers intrinsically, in terms of the accuracy of consistency predictions, over a subset of the UN Corpus. Then, we also evaluate them in combination with phrase-based statistical MT systems for Chinese-to-English and German-to-English. We compare the automatic post-editing of noun translations with the re-ranking of the translation hypotheses based on the classifiers’ output, and also use these methods in combination. This improves over the baseline and closes up to 50% of the gap in BLEU scores between the baseline and an oracle classifier.


Introduction
The repetition of a noun in a text may be due to co-reference, i.e. repeated mentions of the same entity, or to mentions of two entities of the same type. But in other cases, two occurrences of the same noun may simply convey different meanings. The translation of repeated nouns depends, among other things, on the conveyed meanings: in case of co-reference or identical senses, they should likely be translated with the same word, while otherwise they should be translated with different words, if the target language distinguishes the two meanings. State-of-the-art machine translation systems do not address this challenge systematically, and translate two occurrences of the same noun inde-pendently, thus potentially introducing unwanted variations in translation.
We exemplify this issue in Figure 1 for Chineseto-English and German-to-English translations, with examples of inconsistent translations of a repeated source noun by a baseline SMT system, as opposed to consistent translations in the reference. In Example 1, the system's translation of the second occurrence of politik is mistaken and should be replaced by the first one (policy, not politics). In Example 2, although the first translation differs from the reference, it could be acceptable as a legitimate variation, although the second one (identity documents) is more idiomatic and more frequent. Of course, in addition to these two examples, there are other configurations of the six nouns involved in a consistency relation across source, candidate and reference translations, but they will be discussed below when designing the training and test data for our problem.
In this paper, we aim to improve the translation of repeated nouns by designing a classifier which predicts, for every pair of repeated nouns in a source text, whether they should be translated by the same noun, i.e. consistently, and if that is the case, which of the two candidate translations generated by an MT system should replace the other one. We thus address one kind of long-range dependencies between words in texts; such dependencies have been the target of an increasing number of studies, presented briefly in Section 2.
To learn a consistency classifier from the data, we consider a corpus with source texts and reference translations, from the parallel UN Corpora in Chinese, German and English. As we explain in Section 3, we mine the corpus for pairs of repeated nouns in the source texts, and examine human and machine translations in order to learn to predict whether the machine translation of the first noun must replace the second one, or vice-versa, or no  change should be made. In Section 4, we present the lexical, syntactic and semantic features used by the classifiers. When presented with previously unseen source texts and baseline MT output, the decisions of the classifiers serve to post-edit or rerank the repeated nouns of the MT baseline.
As shown in Section 5, the new end-to-end MT system generates improved Chinese-English and German-English translations, with larger improvements on the latter pair. Syntactic features appear to be more useful than semantic ones, for reasons that will be discussed. The case of more than two consecutive occurrences of the same noun will be briefly examined. Finally, a combined re-ranking and post-editing approach appears to be the most effective, covering about 50% of the gap in BLEU scores between the baseline MT and the use of an oracle classifier.

Related Work
This study is related to several research topics in MT: lexical consistency, caching, co-reference, and long-range dependencies between words in general. Our proposal aims to improve the consistency of noun translation, and thus has a narrower scope than the "one translation per discourse" hypothesis (Carpuat, 2009;Carpuat and Simard, 2012), which aimed to implement for MT the broader hypothesis of "one sense per discourse" (Gale et al., 1992).
We focus on nouns because of their referential properties, which are a strong requirement for consistency in case of co-reference, although in many cases consistency should not be blindly enforced, in order to avoid the "trouble with MT consistency" (Carpuat and Simard, 2012) which may induce translation errors. As indicated in that study, MT systems trained on small datasets are often more consistent but of lower quality than systems trained on larger and more diverse data sets. In any case, in our study, we never alter consistent translations, but we address inconsistencies, which are often translation errors (Carpuat and Simard, 2012), and attempt to find those that can be corrected simply by enforcing consistency.
Similarly, our scope is narrower than the caching approach (Tiedemann, 2010;Gong et al., 2011), which encourages a priori consistent translations of any word, with the risk on propagating cached incorrect translations. In our study, the first and second translation in a pair have equal status.
Noun phrase consistency is often due to coreference. Several recent studies consider coreference to improve pronoun resolution, but none of them exploits noun phrase co-reference, likely due to an insufficient accuracy of co-reference resolution systems (?; ?). The improvement of pronoun translation was only marginal with respect to a baseline SMT system in a 2015 shared task (Hardmeier et al., 2015), while the 2016 shared task (Guillou et al., 2016) somewhat shifted its focus to pronoun prediction in a lemmatized reference translation.
This study builds upon and extends our previous work on the translation of compounds (Mascarell et al., 2014;Pu et al., 2015), which constrained the translation of the head of a compound when it was repeated separately after it. The present study is considerably more general, as it makes no assumption on either of the repeated nouns, i.e. it does not require them to be part of compounds.
Our study contributes to a growing corpus of research on modeling longer-range dependencies than those modeled in phrase-based SMT or neural MT, often across different sentences of a document. Ture et al. (2012) used cross-sentence consistency features in a translation model, while Hardmeier (2012) designed the Docent decoder, which can use document-level features to improve the coherence across sentences of a translated document. Our classifier for repeated nouns outputs decisions that can serve as features in Docent, but as the frequency of repeated nouns in documents is quite low, we use here post-editing and/or reranking rather than Docent.

Overview of the Method
Our method flexibly enforces noun consistency in discourse to improve noun phrase translation. We first detect two neighboring occurrences of the same noun in the source text, i.e. closer than a fixed distance, and which satisfy some basic conditions. Then, we consider their baseline translations by a phrase-based statistical MT system, which are identified from word-level alignments. If the two baseline translations of the repeated noun differ, then our classifier uses the source and target nouns and a large set of features (presented in Section 4) to decide whether one of the translations should be edited, and how. This decision will serve to post-edit and/or re-rank the baseline MT's output (Section 4.4). To design the classifier, we train machine-learning classifiers over examples that are extracted from parallel data and from a baseline MT system, as described in Section 3.3. A separate subset of unseen examples will be used to test classifiers, first intrinsically and then in combination with MT.

Corpora and Pre-processing
Our data comes from WIT 3 Corpus 1 (Cettolo et al., 2012), a collection of transcripts of TED talks, and the UN Corpora, 2 a collection of documents from the United Nations. The experiments are on Chinese-to-English and German-to-English.
We first build a phrase-based SMT system for each language pair with Moses (Koehn et al., 2007), with its default settings. Both MT systems are trained on the WIT 3 data, and are used to generate candidate translations of the UN Corpora. Then, the ML classifiers are trained on noun pairs extracted from the UN Corpora, using semantic and syntactic features extracted from both source and target sides. The test sets also come from the UN Corpora, with the same features on the source side. Table 1 presents statistics about the data.

Extraction of Training/Testing Instances
At this stage, the goal is to automatically extract for training the pairs of repeated nouns in the source texts, noted N . . . N , which are translated differently by the SMT baseline, noted T 1 . . . T 2 , with T 1 = T 2 . Indeed, when the translations are identical, we have no element in the 1-best translation to post-edit them, therefore we do not consider such pairs. We examine the reference translations of T 1 and T 2 , noted RT 1 and RT 2 , from which we derive the answer we expect from the classifiers (as specified below), and which will be used for supervised learning. We obtain the T i and RT i values using word-alignment with GIZA++.
Prior to the identification of repeated nouns in the source text, we tokenize the texts and identify parts-of-speech (POS) using the Stanford NLP tools 3 . In particular, as Chinese texts are not wordsegmented, we first perform this operation and then identify multi-character nouns. We then consider each noun in turn, and look for a second occurrence of the same noun in what follows, limiting the search to the same sentence for Chinese, and to the same and next three sentences for German. The difference in the distance settings is based on observations of the Chinese vs. German datasets: average length of sentences, average distance of repeated nouns, and sentence segmentation issues.
Once the pairs of repeated nouns have been identified, we check the SMT translations of each pair, and if the two translations are different, we include the pair in our dataset. For instance, in Figure 1, the noun 证 件 appears twice in the sentence, and the baseline translations of the two occurrences are papers and document; therefore, this pair is included in our dataset. We extracted from the UN Corpora 3,301 pairs for training and 647 pairs for testing on ZH/EN, and 11,289 pairs for training and 695 pairs for testing on DE/EN. We selected a smaller amount of noun pairs for ZH/EN than DE/EN for reasons of availability, because DE/EN dataset is more than 10 times larger than the ZH/EN one. We kept similar test set sizes to enable comparison.
The word-aligned reference translations are used to set the ground-truth class (or decision) for training the classifiers, as follows. With the notations above (baseline translations of N noted T 1 and T 2 , with T 1 = T 2 ), if the reference translations differ (RT 1 = RT 2 ), then we label the pair as 'none', i.e. none of T 1 and T 2 should be post-edited and changed into the other, because this would not help to reach the reference translation anyway (recall that the only possible actions knowing the SMT baseline are replacing T 1 by T 2 or vice-versa).
If the reference translations are the same (RT 1 = RT 2 ), then we examine this word, noted RT . If this word is equal to one of the baseline translations (T 1 = RT or T 2 = RT ), then this value should be given to other baseline (e.g., if T 1 = RT = T 2 , then T 2 := T 1 ). For classification, we simply label these examples with the index of the word that must be used, 1 or 2. However, if the reference differs from both baseline translations, then the label is again 'none', because we cannot infer which of them is a better translation.
After labeling all the pairs, we extract the features in an attribute/value format to be used for machine learning.

Role and Nature of the Classifiers
We describe here the machine learning classifiers that are trained to predict one of the three classes -'1', '2' or 'none' -for each pair of identical source nouns with different baseline SMT translations. The sense of the predicted classes is the following: '1' means that T 1 should replace T 2 , '2' means the opposite, and 'none' means translations should be left unchanged. For instance, if Example 2 in Figure 1 was classified as '2', we would replace the translation of the first occurrence (papers) with the second one (documents).
We use the WEKA environment 4 to train and test several different learning algorithms: SVMs (Cortes and Vapnik, 1995), C4.5 Decision Trees (noted J48 in Weka) (Quinlan, 1993), and Random Forests (Breiman, 2001). We use 10-fold cross validation on the training set, and then test them once on the test set, and later on in combination with MT. For performance reasons, we used the Maximum Entropy classifier (Manning and Klein, 2003) from Stanford 5 instead of WEKA's Logistic Regression.
The hyper-parameters of the above classifiers were set as follows, mostly following the default settings from WEKA, and setting others on the cross-validation sets (not the unseen test sets). For SVMs, the round-off error is = 10 −12 . For Decision Trees, we set the minimal number of instances per leaf ('minNumObj') at 2 and the confidence factor used for pruning to 0.25. For Random Forests, we defined the number of trees to be generated ('numTree') as 100 and set their maximal depth ('maxDepth') as unlimited. Finally, we set the MaxEnt smoothing (σ) to 1.0, and the tolerance used for convergence in parameter optimization to 10 −5 .
We evaluate our proposal in two ways. First, we measure the classification accuracy in terms of accuracy and kappa (κ) agreement (Cohen, 1960) with the correct class, either in 10-fold crossvalidation experiments, or on the test set. Second, we compare the updated translations with the reference, to check if we obtain a result that is closer to it, using the popular BLEU measure (Papineni et al., 2002).

Syntactic Features
We defined 19 syntactic features, mainly with the assumption that out of a pair of repeated source nouns N . . . N , the occurrence which is embed-ded in a more complex local parse tree, i.e. has more information syntactically bound to it, is more "determined" and has a higher probability of been translated correctly by the baseline MT system, since this information can help the system to disambiguate it. The results tend to confirm this assumption.
The features are listed in Figure 2, left side, with an explicit description of each feature and its value on a Chinese text (top of the figure). In the last line of the table we show the ground-truth class of this example.
The sentences are parsed using the Stanford parser, 6 , and the values of the features are obtained from the parse trees, using the sizes (in nodes or words) of the siblings and ancestor sub-trees for each analyzed noun. In the sample parse trees on the right side of Figure 2, the first NP ancestor is marked with a red rectangle, and the values of the features are computed We can distinguish three subsets of features. The first subset includes lexical and positional features: the original noun, automatic baseline translations of both occurrences from the baseline MT system, and the distance between the sentences that contain the two nouns. The second subset includes features that capture the size of the siblings in the parse trees of each of the two nouns. The third subset includes the size of sub-tree for the latest noun phrase ancestor for each analyzed noun, and also the depth distances to the next noun phrase ancestor.

Semantic Features
The semantic features, to be used independently or in combination with the syntactic ones, are divided into two groups: discourse vs. local context features, which differ by the amount of context they take into account. On the one hand, local context features represent the immediate context of each of the nouns in the pair and their translations, i.e. three words to their left and three words to their right in both source and MT output, always within the same sentence.
On the other hand, discourse features capture those cases where the inconsistent translations of a noun might be due to a disambiguation problem of the source noun, and semantic similarity can be leveraged to decide which of the two translations best matches the context. To compute the 6 http://nlp.stanford.edu/software/lex-parser.html discourse features, we use the word2vec word vector representations generated from a large corpus (Mikolov et al., 2013), which have been successfully used in the recent past to compute similarity between words (Schnabel et al., 2015). Specifically, we employ the model trained on the English Google News corpus 7 with about 100 billion words.
For each pair of inconsistent translations (T 1 , T 2 ) of a source noun N , we compute the cosine similarities c 1 and c 2 between the vector representation of each translation and the mean vector of their contexts. These mean vectors, noted v 1 and v 2 , are computed by averaging all vectors of the words in the respective contexts of T 1 and T 2 . Here, the contexts consist of 20 words to the left and 20 words to the right of each T i , possibly crossing sentence boundaries. The cosine similarities c 1 and c 2 are thus: The two values c 1 and c 2 are used as features, allowing classifiers to learn that, in principle, higher values indicate a better translation in the sense of its semantic similarity with the context. In the Example 1 from Figure 1, the German word Politik is translated into the English words policy and then politics. The semantic similarity between the word politics and its context (c 2 ) is lower than the similarity between policy and its context (c 1 ), which we consider to be an indication that the first occurrence, namely policy, has better chances to be the correct translation -which is actually the case in this example.

Integration with the MT System
The classifier outputs a post-editing decision for each pair of repeated nouns: replace T 1 with T 2 , replace T 2 with T 1 , or do nothing. This decision can be directly executed, or it can be combined in a more nuanced fashion with the MT system. Therefore, to modify translations using this decision, we propose and test three approaches for using in in an MT system: Post-editing: directly edit the translations T 1 or T 2 depending on the classifier's decision. Number of sibling nodes of the 2 nd occurrence 2 Sign of the difference between the above (+1, 0, −1) 1 Number of words of the 1 st occurrence and its siblings 2 Number of words of the 2 nd occurrence and its siblings 1 Sign of the difference between the above (+1, 0, −1) 1

Number of nodes in the first NP ancestor of 1 st occ. 15
Number of nodes in the first NP ancestor of 2 nd occ. 7 Sign of the difference between the above (+1, 0, −1) 1 Number of words in the first NP ancestor of the 1 st occ. 6 Number of words in the first NP ancestor of the 2 nd occ. 2 Sign of the difference between the above (+1, 0, −1) 1 Distance between the first NP ancestor and the 1 st occ. 3 Distance between the first NP ancestor and the 2 nd occ. 3 Sign of the difference between the above (+1, 0, −1) 0 Re-ranking: search among the translation hypotheses provided by the SMT system (in practice, the first 10,000 ones) for those where T 1 and T 2 are translated as predicted by the classifier, and select the highest ranking one as the new translation. If none is found, the baseline 1-best hypothesis is kept.
Re-ranking + Post-editing: after applying reranking, if no hypothesis conforms to the prediction of the classifier, instead of keeping the baseline translation we post-edit it as in the first approach.

Results and Analysis
We first present the results of the classification task, i.e. the prediction of the correct translation variant (1 st / 2 nd / None), for Chinese-English and German-English translation respectively in Tables 2 and 3, with 10-fold cross-validation on the training sets. Then, we present the scores on the test sets for both the classification task and its     Table 3: Prediction of the correct translation (1 st / 2 nd / None) for repeated nouns in German, in terms of accuracy (%) and kappa scores, on the development set with 10-fold c.
-v. Methods are sorted by average accuracy over the three feature sets. The best scores are in bold.

Best Scores of Classification and MT
The classification accuracy is above 80% when applying 10-fold cross-validation, for both language pairs, and reaches 74-78% on the test sets. As the classes are quite balanced, a random baseline would reach around 33% only. Kappa values reach 0.75 on the dev sets and 0.60-0.67 on the test sets. The performances of the classifiers appear thus to be well above chance, and the comparable performances achieved on the unseen test sets indicate that over-fitting is unlikely. The ordering of methods by performance is remarkably stable: Decision Trees (J48) and SVMs get the lowest scores, followed by Random Forests, and then by the MaxEnt classifier. The ordering {J48, SVM} < RF < MaxEnt is observed over both language pairs, over the three types of features, and the four datasets, with 1-2 exceptions only. Overall, the best configuration of our method found on the training sets is, for both languages, the MaxEnt classifier with all features.
There is a visible rank correlation between the increase in classification accuracy and the increase in BLEU score, for all languages, features, classifiers, and combination methods with MT. The best configurations found on the training sets bring the following BLEU improvements: for ZH/EN, from 11.07 to 11.36, and for DE/EN, from 17.10 to 17.67. In fact, syntactic features turn out to reach an even higher value on the test set, at 17.75. To interpret these improvements, they should be compared to the oracle BLEU scores obtained by using a "perfect" classifier, which are 11.64 for ZH/EN and 17.99 for DE/EN. Our method thus bridges 51% of the BLEU gap between baseline and oracle on ZH/EN and 64% on DE/EN -a significant improvement.
The BLEU scores of the three different methods for using classification for MT (Tables 4 and 5) clearly show that the combined method outperforms both post-editing and re-ranking alone, for all languages and features. Post-editing, the easiest one to implement, has little consideration for the words surrounding the nouns, while re-ranking works on MT hypotheses and thus ensures that a better global translation is found that is also consistent. However, in some cases, no hypothesis conforms to the consistency decision, and in this case post-editing the best hypothesis appears to be beneficial.

Feature Analysis: Syntax vs. Semantics
On the training sets, syntactic features always outperform the semantic ones when using the Max- #nodes in the first NP ancestor of the 2 nd occ. 0.037 #words of the 1 st occ. and its siblings 0.023 #words of the 2 nd occ. and its siblings 0.037 Table 6: Top ten syntactic features ranked by information gain for each language pair.
Ent classifier, and their joint use outperforms their separate uses. For the other classifiers (not the best ones on the training sets), on ZH/EN, adding semantic features to syntactic ones decreases the performance. Indeed, semantic features (specifically the discourse ones) are intended to disambiguate nouns based on contexts, but here, manual inspection of the data showed that these are similar for T 1 and T 2 , which makes prediction difficult.
Semantic features appear to be more useful in German compared to Chinese. We hypothesize that this is because translation ambiguities of Chinese nouns, i.e. cases when the same noun can be translated into English with two very different words, are less frequent and less semantically divergent than in German. In other words, semantic features are less useful in Chinese because cases of strong polysemy or homonymy seem to be less frequent than in German. Such a characteristic is suggested for English vs. Chinese by Huang (1995), and we believe it extends to German.
These facts might also explain the results obtained when using all features, for German and Chinese. As in Chinese semantic features are less helpful, given also the limited amount of data, combining them with syntactic ones actually decreases the performance of the syntactic ones used independently. In contrast, semantic features are more helpful on German dataset, and also improve results when we considered along with the syntactic ones together. Table 6 shows the top ten syntactic features for ZH/EN and for DE/EN, ranked by information gain computed using Weka. These features include both lexical information and properties of the parse tree. The analysis shows that lexical features are significantly more important than purely syntactic ones, for both languages. However, the syntactic ones are not negligible.   -v.). The scores with discourse features increase as similarity between T 1 and T 2 decreases. Table 7 shows an analysis of the effect of the semantic features on different training sets in terms of accuracy and kappa scores. These training sets are built according to the cosine similarity between T 1 and T 2 , as follows: for each training instance (pair of nouns), we compute the cosine similarity between the vector representation of T 1 and T 2 ; then, we group instances by intervals and carry out 10-fold c.-v. classification experiments for each subset. The lower the range values, the more dissimilar the translation pairs T 1 and T 2 , and the better the scores of discourse features. Specifically, when the translations are dissimilar, the classifier makes better predictions with the discourse features, i.e. considering a larger context. However, the more similar the words are, the better the local context features, i.e. the surrounding words.

Extension to Triples of Repeated Nouns
Finally, we consider briefly the case of nouns that appear more than twice. Using our dataset, we identified them as noun pairs that share the same word, i.e. triples of repeated nouns, to which we limit our investigation. There are 129 ZH/EN triples and 138 DE/EN ones.
We defined the following method to determine the translation of such nouns when their baseline translations are different across the two pairs. If T 1 , T 2 and T 3 are the translation candidates, we aim to find the consistent translation T c as follows. If two of the T i are identical, we use this value as T c , but if they all differ, then we compare the syntactic features of the three source occurrences, and select the one with the highest number of features with highest values, and use its value as T c . Going back to our classifier, if the decision for a particular instance pair is not 'none', then we replace the translations of the instance pairs with T c .
We tested the method with the three feature types and the four classifiers, i.e. 12 cases per language. On ZH/EN, a small increase of BLEU is observed in 5 cases (0.01), a decrease in two cases (0.02), and no variation in 5 cases. On DE/EN, half of the cases show a small improvement (up to 0.03) and the rest stay the same. The method appears to work better on DE/EN, possibly because the initial accuracy on pairs is lower, but improvements are overall very small. The main conclusion from experimenting with triples, and considering also longer lexical chains of consistent nouns, is that the pairwise method should be replaced by a different type of consistency predictor, which remains to be found.

Conclusion and Perspectives
We presented a method for flexibly enforcing consistent translations of repeated nouns, by using a machine learning approach with syntactic and semantic features to decide when it should be enforced. We experimented with Chinese-English and German-English data. To build our datasets, we detected source-side nouns which appeared twice within a fixed distance and were translated differently by MT. Syntactic features were defined based on the complexity of the parse trees containing the nouns, thus capturing which of the two occurrences of a noun is more syntactically bound, while semantic features focused on the similarity between each translated noun and its context. The trained classifiers have shown that they can predict consistent translations above chance, and that, when combined to MT, bridge 50-60% of the gap between the baseline and an oracle classifier.
In future work, we will consider whether neural MT is prone to similar consistency problems, and whether they can be addressed by a similar method. The answer is likely positive, because both PBSMT and NMT assume that consistency simply results from correct individual translations, whereas human translators often take consistency into account for lexical choice. Moreover, a better consideration of legitimate lexical variation, e.g. using multiple references or human evaluators, should improve the assessment of consistency enforcement strategies.