Fixing Translation Divergences in Parallel Corpora for Neural MT

Corpus-based approaches to machine translation rely on the availability of clean parallel corpora. Such resources are scarce, and because of the automatic processes involved in their preparation, they are often noisy. This paper describes an unsupervised method for detecting translation divergences in parallel sentences. We rely on a neural network that computes cross-lingual sentence similarity scores, which are then used to effectively filter out divergent translations. Furthermore, similarity scores predicted by the network are used to identify and fix some partial divergences, yielding additional parallel segments. We evaluate these methods for English-French and English-German machine translation tasks, and show that using filtered/corrected corpora actually improves MT performance.


Introduction
Parallel sentence pairs are the only necessary resource to build Machine Translation (MT) systems. In the case of neural MT, a large neural network is trained through maximising a proxy of translation performance on a parallel corpus. Therefore, the quality of MT engines is heavily dependent on the amount but also the quality of available parallel sentences. 1 Parallel texts are unfortunately, scarce resources: There are relatively few language pairs for which parallel corpora of large sizes exist, and even for those pairs, available corpora only concern few restricted domains. To alleviate the lack of parallel data, several approaches have been developed over the years. They range from methods using non-parallel, or comparable data (Zhao and 1 Recent work on neural MT (Lample et al., 2018;Artetxe et al., 2018) completely dispenses with parallel data, using unsupervised methods to obtain performance improvements over word-by-word statistical MT. These systems however lag far behind supervised systems, as considered in this work. Vogel, 2002;Fung and Cheung, 2004;Munteanu and Marcu, 2005;Grégoire and Langlais, 2018;Grover and Mitra, 2017;Schwenk, 2018) to techniques that produce synthetic parallel data from monolingual corpora (Sennrich et al., 2016a;Chinea-Rios et al., 2017), using automated alignment/translation engines that are prone to the introduction of noise in the resulting parallel sentences. Mismatches in parallel sentences extracted from translated texts are also reported (Tiedemann, 2011;Xu and Yvon, 2016). This problem is mostly ignored in MT, where parallel sentences are considered to convey the exact same meaning; yet it seems particularly important for neural MT engines (Chen et al., 2016).   Table 1 gives some examples of English-French parallel sentences that are not completely semantically equivalent, extracted from the OpenSubtitles corpus (Lison and Tiedemann, 2016).
Multiples types of translation divergences are found in parallel corpora: Additional segments are included on either side of the parallel sentences (first and second rows) most likely due to errors in sentence segmentation; Some translations may be completely uncorrelated (third row); Inaccurate translations also exist (fourth row). Note that divergent translations can be due various reasons (Li et al., 2014), the study of which is beyond the scope of this paper.
In this work, we present an unsupervised method for building cross-lingual sentence embeddings based on modelling word similarity, relying on a neural architecture (see § 3) that is able to identify several types of common cross-lingual divergences. The resulting embeddings are then used to measure semantic equivalence between sentences. To evaluate our method, we show in § 4 that translation accuracy can be improved after filtering out divergent sentence pairs in an Englishto-French and an English-to-German translation tasks. We also show that in some cases, divergent sentences can be fixed by removing divergent segments, further increasing translation quality. All the code used in this paper is freely available. 2

Related Work
Attempts to measure the impact of translation divergences in MT have focused on the introduction of noise in sentence alignments (Goutte et al., 2012), showing that statistical MT is highly robust to noise, and that performance only degrades seriously at very high noise levels. In contrast, neural MTs seem to be more sensitive to noise (Chen et al., 2016), as they tend to assign high probabilities to rare events (Hassan et al., 2018). Efforts devoted to characterising the degree of semantic equivalence between two snippets of texts in the same or different languages are presented (Agirre et al., 2016). In (Mueller and Thyagarajan, 2016), a monolingual sentence similarity network is proposed, making use of a simple LSTM layer to compute sentence representations. The authors show that a simple SVM classifier exploiting such sentence representations achieves state-of-the-art results in a textual entailment task. With the same objective, the system of He and Lin (2016) uses multiple convolutional layers and models pairwise word interactions.
Our work is inspired by Carpuat et al. (2017), who train a SVM-based cross-lingual divergence detector using word alignments and sentence length features. Their work shows that an NMT system trained only on non-divergent sentences yields slightly better translation scores, while requiring less training time. A follow-up study by the same authors (Vyas et al., 2018) achieves even better results, using the neural architecture of He and Lin (2016). Our work differs from theirs as we make use of a network with a different, arguably simpler, topology. We model sentence similarity by means of optimising a loss function based on word alignments. Furthermore, the network predicts word similarity scores that we further use to correct divergent sentences.

Neural Divergence Classifier
The architecture of our network is inspired by the work on word alignment of Legrand et al. (2016), using however contextual, rather than fixed, word embeddings (see Figure 1). It computes the similarity of any source-target sentence pair (s, t), where s = (s 1 , ..., s I ) and t = (t 1 , ..., t J ). The model is composed of 2 bi-directional LSTM subnetworks, net s and net t , which respectively encode source and target sentences. Since both net s and net t take the same form, we only describe the former network: it outputs forward and backward hidden states, − → h src i and ← − h src i , which are then concate-nated into a vector encoding the i th source word as h src In addition, the last forward/backward hidden states (in dark grey on Figure 1) are also concatenated to represent whole The similarity between sentence pairs can then be obtained using eg. the cosine similarity: Our model is trained to maximize word alignment scores between words in both sentences, using aggregation functions that summarise the alignment scores for each source/target word. Similar to (Legrand et al., 2016), alignment scores S(i, j) are given by the dot-product S(i, j) = h src i · h tgt j , further aggregated as follows: The training loss function is then defined as:

Training with Negative Examples
Training is performed by minimizing Eq. (3), for which annotated examples of source (sign i ) and target (sign j ) words are needed. As positive examples, we use paired sentences of a parallel corpus; all words in such sentences are labelled as parallel (∀i, j, sign i = sign j = −1). We consider three types of negative instances: the basic case uses random unpaired sentences; in this case, all words are labelled as divergent (∀i, j, sign i = sign j = +1.). Since negative pairs may be very easy to classify and we want our network to detect less obvious divergences, we further create more difficult negative examples as follows.
We first replace random sequences of words in source or target by a sequence of words with the same part-of-speeches. 3 Words that are not replaced are deemed parallel (sign i = −1) while those replaced are annotated as sign i = +1. Words aligned to some replaced words are also assigned the divergent label (sign i = +1). For instance, given the original sentence pair: src: What do you feel ? tgt: Que ressentez-vous ? , we may replace 'you feel', with part-of-speech tags 'PRP VB', by another sequence with same tags (i.e. 'we want'), yielding a new negative instance (divergent words are in bold): src: What do we want ?
Note that we need word alignments to identify as divergent the sequence 'ressentez-vous', which was aligned to 'you feel' in the original sentence. Finally, motivated by sentence segmentation errors observed in many corpora, we also build negative examples by inserting a sentence at the beginning (or end) of the source (or target) sentence. Words in the original sentence pair are annotated sign i = −1, while the new words inserted are considered divergent (sign i = +1). Given the same sentence pair as above, a negative example is created by inserting the sentence 'Not .' at the end of the original source: src: What do you want ? Not .
To finally avoid the generation of easy negative sentence pairs having a large difference in sentence length, we restrict negative examples to have a length ratio < 2.0 (3.0 for shortest sentences).

Divergence Correction
Our training corpora contains many divergent sentences that follow a common pattern, consisting of adding some extra leading/trailing words. Accordingly, we implemented a simple algorithm that discards sequences of leading/trailing words on both sides. To find optimal source (u, v) and target (x, y) indices that enclose parallel segments within the original sentence, we compute: arg max u,v,x,y u≤I≤v max x≤j≤y

{S(i, j)}
The N -best sequences (s v u , t y x ) are considered as likely corrections, in which we use the one having the highest similarity score to replace the original (s I 1 , t J 1 ). Note that short sentences are not considered and we enforce v − u > τ and y − x > τ . Figure 2 (left) displays an example of an alignment matrix S(i, j). An acceptable correction is: Que ressentez-vous ? ⇔ What do you feel ?. corresponding to u = 1, v = 5, x = 1 and y = 3.

Corpora
We filter out divergences from the English-French OpenSubtitles corpus (Lison and Tiedemann, 2016), which consists of a collection of movie and TV subtitles. We also use the very noisy English-German Paracrawl 4 corpus. Both corpora present many potential divergences. To evaluate English-French translation performance, we use the En-Fr Microsoft Spoken Language Translation corpus, created from actual Skype conversations (Federmann and Lewis, 2016). English-German performance is evaluated on the publicly available Newstest-2017 (Bojar et al., 2017), corresponding to news stories selected from online sources.
In order to better assess the quality of our classifier when facing different word divergences, we also collected from the original OpenSubtitles corpus 500 sentences containing different types of examples: 200 paired sentences; 100 unpaired sentences; 100 sentences with replace examples; and 100 sentences with insert examples (see § 3.1). All data is preprocessed with OpenNMT 5 , performing minimal tokenisation. After tokenisation, each out-of-vocabulary word is mapped to a special UNK token, assuming a vocabulary containing the 50, 000 more frequent words.

Neural Divergence
Word embeddings of E s = E t = 256 cells are initialised using fastText, 6 further aligned by means of MUSE 7 following the unsupervised 4 http://paracrawl.eu/ 5 http://opennmt.net 6 https://github.com/facebookresearch/fastText 7 https://github.com/facebookresearch/MUSE method of Lample et al. (2018). Both bi-LSTMs use 256-dimensional hidden representations (E = 512). Network optimization is done using SGD with gradient clipping (Pascanu et al., 2013). For each epoch, we randomly select 1 million sentence pairs that we place in batches of 32 examples. We run 10 epochs and start decaying at each epoch by 0.8 when the loss on validation set increases. Divergence is computed as in equation (1) and setting r = 1.0 ; For divergence correction, we use N = 20 and τ = 3. The same number of examples are always generated for each type of example (Paired, Unpaired, Replace and Insert). Alignments needed for Replace and Insert methods are performed using fast align 8 .

Neural Translation
In addition to the basic tokenisation detailed above, we perform Byte-Pair Encoding (Sennrich et al., 2016b) with 30000 merge operations learned by joining both language sides. Neural systems are based on the open-source project OpenNMT; using a Transformer model similar to the model of Vaswani et al. (2017): both encoder and decoder have 6 layers; Multi-head attention is performed over 8 head; the hidden layer size is 512; and the inner layer of feed forward network is of size 2048. Word embeddings have 512 cells. We set the dropout probability to 0.1 and the batch size to 3072. The optimiser is Lazy Adam with β 1 = 0.9, β 2 = 0.98, = 10 −9 , warmup steps = 4000. Training stops after 30 epochs.

Results
We first evaluate the ability of our divergence classifier to predict different types of divergences at the level of words. We use the test set manually annotated for that purpose and train our model on the OpenSubtitles corpus. A word is considered divergent when associated to a negative aggregation score (see Equation (2)). Accuracies obtained for various combinations of negative examples, where we see that non-divergent words in parallel and unpaired sentences (columns P and U) are easy to spot, as long as the model has seen these types of examples in training. However, the accuracy drops dramatically when the model is not trained with unpaired sentences (rows PR, PI and PRI). Regarding columns R and I, accuracies are lower since these sentences contain a mix of divergent and non-divergent words.  Models that were trained with the matching examples (R and I) obtain the highest accuracies (in bold letters). Column PURI gives results for the complete test set, mixing all type of examples. As expected, the best accuracy is also obtained when training on all types of examples. Figure 2 illustrates the output of our network when trained using PU examples (right) and PURI examples (left). The former (right) fails to predict some divergences, most likely because its training set does not contain sentences mixing divergent and non-divergent words. Furthermore, the network trained with PURI examples correctly assigns a lower similarity score to this pair, as both sentences do not convey the exact same meaning. Finally, BLEU scores obtained with varying training data configurations are in Table 3: The entire 9 data sets (all); The most similar pairs after 9 Paracrawl contains more than 100M sentences. We reduced its size to 22.2M using standard filtering techniques.  Results obtained after filtering sentence pairs (sim) clearly outperform the baseline (all) by +0.94 and +2.25 BLEU respectively. Regarding OpenSubtitles, when fixing 2.5M sentences (4 th row) the accuracy is further boosted to +2.01, whereas the same sentence pairs do not show any improvement when added in their original form (3 rd row). Similar results are obtained for the Paracrawl corpus. Results after fixing 2.5M sentences (4 th row) outperform those obtained with their original form (3 rd row).

Conclusions and outlook
We presented an unsupervised method based on deep neural networks for detecting translation divergences in parallel corpora. Our model optimizes word alignments, and computes a fine grained divergence prediction at the level of words. Misaligned/divergent words can then be filtered out, yielding larger and better training sets. Our experiments on two machine translation tasks show significant improvements in comparison to training with the entire data set.
We plan to use our model to predict sentence embeddings over monolingual corpora, allowing to collect parallel pairs through vector similarity measures. In addition, we would like to measure the performance of our model after applying subword tokenisation, as well as using multiple LSTM layers, a technique well known to capture hierarchical structure in the context of MT.