Identifying Semantic Divergences in Parallel Text without Annotations

Recognizing that even correct translations are not always semantically equivalent, we automatically detect meaning divergences in parallel sentence pairs with a deep neural model of bilingual semantic similarity which can be trained for any parallel corpus without any manual annotation. We show that our semantic model detects divergences more accurately than models based on surface features derived from word alignments, and that these divergences matter for neural machine translation.


Introduction
Parallel sentence pairs are sentences that are translations of each other, and are therefore often assumed to convey the same meaning in the source and target language.Occasional mismatches between source and target have been primarily viewed as alignment noise (Goutte et al., 2012) due to imperfect sentence alignment tools in parallel corpora drawn from translated texts (Tiedemann, 2011;Xu and Yvon, 2016), or the noisy process of extracting parallel segments from non-parallel corpora (Fung and Cheung, 2004;Munteanu and Marcu, 2005).
In contrast, we view translation as a process that inherently introduces meaning mismatches, so that even correctly aligned sentence pairs are not necessarily semantically equivalent.This can happen for many reasons: translation lexical choice often involves selecting between near synonyms that introduce language-specific nuances (Hirst, 1995), typological divergences lead to structural mismatches (Dorr, 1994), differences in discourse organization can make it impossible to find oneto-one sentence alignments (Li et al., 2014).Cross-linguistic variations in other discourse phenomena such as coreference, discourse relation and modality (Lapshinova-Koltunski, 2015) compounded with translation effects that distinguish "translationese" from original text (Koppel and Ordan, 2011) might also lead to meaning mismatches between source and target.
In this paper, we aim to provide empirical evidence that semantic divergences exist in parallel corpora and matter for downstream applications.This requires an automatic method to distinguish semantically equivalent sentence pairs from semantically divergent pairs, so that parallel corpora can be used more judiciously in downstream cross-lingual NLP applications.We propose a semantic model to automatically detect whether a sentence pair is semantically divergent (Section 3).While prior work relied on surface cues to detect mis-aligments, our approach focuses on comparing the meaning of words and overlapping text spans using bilingual word embeddings (Luong et al., 2015) and a deep convolutional neural network (He and Lin, 2016).Crucially, training this model requires no manual annotation.Noisy supervision is obtained automatically borrowing techniques developed for parallel sentence extraction (Munteanu and Marcu, 2005).Our model can thus easily be trained to detect semantic divergences in any parallel corpus.
We extensively evaluate our semanticallymotivated models on intrinsic and extrinsic tasks: detection of divergent examples in two parallel English-French data sets (Section 5), and data selection for English-French and Vietnamese-English machine translation (MT) (Section 6).The semantic models significantly outperform other methods on the intrinsic task, and help select data to train neural machine translation faster with no loss in translation quality.Taken together, these results provide empirical evidence that sentencealignment does not necessarily imply semantic equivalence, and that this distinction matters in practice for a downstream NLP application.
Translation Divergences We use the term semantic divergences to refer to bilingual sentence pairs, including translations, that do not have the same meaning.These differ from typological divergences, which have been defined as structural differences between sentences that convey the same meaning (Dorr, 1994;Habash and Dorr, 2002), and reflect the fact that languages do not encode the same information in the same way.
Non-Parallel Corpora Mismatches in bilingual sentence pairs have previously been studied to extract parallel segments from non-parallel corpora, and augment MT training data (Fung and Cheung, 2004;Munteanu andMarcu, 2005, 2006;AbduI-Rauf and Schwenk, 2009;Smith et al., 2010;Riesa and Marcu, 2012, inter alia).Methods for parallel sentence extraction rely primarily on surface features based on translation lexicons and word alignment patterns (Munteanu andMarcu, 2005, 2006).Similar features have proved to be useful for the related task of translation quality estimation (Specia et al., 2010(Specia et al., , 2016)), which aims to detect divergences introduced by MT errors, rather than human translation.Recently, sentence embeddings have also been used to detect parallelism (España-Bonet et al., 2017;Schwenk and Douze, 2017).Although embeddings capture semantic generalizations, these models are trained with neural MT objectives, which do not distinguish semantically equivalent segments from divergent parallel segments.
Cross-Lingual Sentence Semantics Crosslingual semantic textual similarity (Agirre et al., 2016) and cross-lingual textual entailment (Negri and Mehdad, 2010;Negri et al., 2012Negri et al., , 2013) ) seek to characterize semantic relations between sentences in two different languages beyond translation equivalence, and are therefore directly relevant to our goal.While the human judgments obtained for each task differ, they all take inputs of the same form (two segments in different languages) and output a prediction that can be interpreted as indicating whether they are equivalent in meaning or not.Models share core intuitions, relying either on MT to transfer the cross-lingual task into its monolingual equivalent (Jimenez et al., 2012;Zhao et al., 2013), or on features derived from MT components such as translation dictionaries and word alignments (Turchi and Negri, 2013;Lo et al., 2016).Training requires manually annotated examples, either bilingual, or monolingual when using MT for language transfer.

Impact of mismatched sentence pairs on MT
Prior MT work has focused on noise in sentence alignment rather than semantic divergence.Goutte et al. (2012) show that phrase-based systems are remarkably robust to noise in parallel segments.When introducing noise by permuting the target side of parallel pairs, as many as 30% of training examples had to be permuted to degrade BLEU significantly.While such artificial noise does not necessarily capture naturally occurring divergences, there is evidence that data cleaning to remove real noise can benefit MT in low-resource settings (Matthews et al., 2014).
Neural MT quality appears to be more sensitive to the nature of training examples than phrasebased systems.Chen et al. (2016) suggest that neural MT systems are sensitive to sentence pair permutations in domain adaptation settings.Belinkov and Bisk (2018) demonstrate the brittleness of character-level neural MT when exposed to synthetic noise (random permutations of words and characters) as well as natural human errors.Concurrently with our work, Hassan et al. (2018) claim that even small amounts of noise can have adverse effects on neural MT models, as they tend to assign high probabilities to rare events.They filter out noise and select relevant in-domain examples jointly, using similarities between sentence embeddings obtained from the encoder of a bidirectional neural MT system trained on clean indomain data.In contrast, we detect semantic divergence with dedicated models that require only 5000 parallel examples (see Section 5).
This work builds on our initial study of semantic divergences (Carpuat et al., 2017), where we provide a framework for evaluating the impact of meaning mismatches in parallel segments on MT via data selection: we show that filtering out the most divergent segments in a training corpus improves translation quality.However, we previously detect mismatches using a cross-lingual entailment classifier, which is based on surface features only, and requires manually annotated training examples (Negri et al., 2012(Negri et al., , 2013)).In this paper, we detect divergences using a semanticallymotivated model that can be trained given any existing parallel corpus without manual intervention.
We introduce our approach to detecting divergence in parallel sentences, with the goal of (1) detecting differences ranging from large mismatches to subtle nuances, (2) without manual annotation.
Cross-Lingual Semantic Similarity Model We address the first requirement using a neural model that compares the meaning of sentences using a range of granularities.We repurpose the Very Deep Pairwise Interaction (VDPWI) model, which has been previously been used to detect semantic textual similarity (STS) between English sentence pairs (He and Lin, 2016).It achieved competitive performance on data from the STS 2014 shared task (Agirre et al., 2014), and outperformed previous approaches on sentence classification tasks (He et al., 2015;Tai et al., 2015), with fewer parameters, faster training, and without requiring expensive external resources such as WordNet.
The VDPWI model was designed for comparing the meaning of sentences in the same language, based not only on word-to-word similarity comparisons, but also on comparisons between overlapping spans of the two sentences, as learned by a deep convolutional neural network.We adapt the model to our cross-lingual task by initializing it with bilingual embeddings.To the best of our knowledge, this is the first time this model has been used for cross-lingual tasks in such a way.We give a brief overview of the resulting model here and refer the reader to the original paper for details.Given sentences e and f , VDPWI models the semantic similarity between them using a pipeline consisting of five components: 1. Bilingual word embeddings: Each word in e and f is represented as a vector using pretrained, bilingual embeddings.2. BiLSTM for contextualizing words: Contextualized representations for words in e and f are obtained by choosing the output vectors at each time step obtained by running a bidirectional LSTM (Schuster and Paliwal, 1997) on each sentence.3. Word similarity cube: The contextualized representations are used to calculate various similarity scores between each word in e with each word in f.Each score thus forms a matrix and all such matrices are stacked to form a similarity cube tensor.

Similarity focus layer:
The similarity cube is fed to a similarity focus layer that reweights the similarities in the cube to focus on highly similar word pairs, by decreasing the weights of pairs which are less similar.This output is called the focus cube. 5. Deep convolutional network: The focus cube is treated as an "image" and passed to a deep neural network, the likes of which have been used to detect patterns in images.
The network consists of repeating convolution and pooling layers.Each repetition consists of a spatial convolutional layer, a Rectified Linear Unit (Nair and Hinton, 2010), and a max pooling layer, followed by fully connected layers, and a softmax to obtain the final output.
The entire architecture is trained end-to-end to minimize the Kullback-Leibler divergence (Kullback, 1959) between the output similarity score and gold similarity score.
Noisy Synthetic Supervision How can we obtain gold similarity scores as supervision for our task?We automatically construct examples of semantically divergent and equivalent sentences as follows.Since a large number of parallel sentence pairs are semantically equivalent, we use parallel sentences as positive examples.Synthetic negative examples are generated following the protocol introduced by Munteanu and Marcu (2005) to distinguish parallel from non-parallel segments.Specifically, candidate negative examples are generated starting from the positive examples {(e i , f i ) ∀i} and taking the Cartesian product of the two sides of the positive examples{(e i , f j )∀i, j s.t.i = j}.This candidate set is filtered to ensure that negative examples are not too easy to identify: we only retain pairs that are close to each other in length (a length ratio of at most 1:2), and have enough words (at least half) which have a translation in the other sentence according to a bilingual dictionary derived from automatic word alignments.
This process yields positive and negative examples that are a noisy source of supervision for our task, as some of the positive examples might not be fully equivalent in meaning.However, experiments will show that, in aggregate, they provide a useful signal for the VDPWI model to learn to detect semantic distinctions (Section 5).
We crowdsource annotations of English-French sentence pairs to construct test beds for evaluating our models, and also to assess how frequent semantic divergences are in parallel corpora.

Data Selection
We draw examples for annotation randomly from two English-French corpora, using a resource-rich and well-studied language pair, and for which bilingual annotators can easily be found.The OpenSubtitles corpus contains 33M sentence pairs based on translations of movie subtitles.The sentence pairs are expected to not be completely parallel given the many constraints imposed on translations that should fit on a screen and be synchronized with a movie (Tiedemann, 2007;Lison and Tiedemann, 2016), and the use of more informal registers which might require frequent non-literal translations of figurative language.The Common Crawl corpus contains sentence-aligned parallel documents automatically mined from the Internet.Parallel documents are discovered using e.g., URL containing language code patterns, and sentences are automatically aligned after structural cleaning of HTML.The resulting corpus of 3M sentence pairs is noisy, yet extremely useful to improve translation quality for multiple language pairs and domains (Smith et al., 2013).
Annotation Protocol Divergence annotations are obtained via Crowdflower. 1 Since this task requires good command of both French and English, we rely on a combination of strategies to obtain good quality annotations, including Crowdflower's internal worker proficiency ratings, georestriction, reference annotations by a bilingual speaker in our lab, and instructions that alternate between the two languages (Agirre et al., 2016).
Annotators are shown an English-French sentence pair, and asked whether they agree or disagree with the statement "the French and English text convey the same information."We do not use the term "divergent", and let the annotators' intuitions about what constitutes the same take precedence.We set up two distinct annotation tasks, one for each corpus, so that workers only see examples sampled from a given corpus in a given job.Each example is shown to five distinct annotators.
1 http://crowdflower.comAnnotation Analysis Forcing an assignment of divergent or equivalent labels by majority vote yields 43.6% divergent examples in OpenSubtitles, and 38.4% in Common Crawl.Fleiss' Kappa indicates moderate agreement between annotators (0.41 for OpenSubtitles and 0.49 for Common Crawl).This suggests that the annotation protocol can be improved, perhaps by using graded judgments as in Semantic Textual Similarity tasks (Agirre et al., 2016), or for sentence alignment confidence evaluation (Xu and Yvon, 2016).Current annotations are nevertheless useful, and different degrees of agreement reveal nuances in the nature of divergences (Table 1).Examples labeled as divergent with high confidence (lowest block of the table) are either unrelated or one language misses significant information that is present in the other.Examples labeled divergent with lower confidence contain more subtle differences (e.g., "what does it mean" in English vs. "what are the advantages" in French).

Divergence Detection Evaluation
Using the two test sets obtained above, we can evaluate the accuracy of our cross-lingual semantic divergence detector, and compare it against a diverse set of baselines in controlled settings.We test our hypothesis that semantic divergences are more than alignment mismatches by comparing the semantic divergence detector with models that capture mis-alignment (Section 5.2) or translation (Section 5.3).Then, we compare the deep convolutional architecture of the semantic divergence model, with a much simpler model that directly compares bilingual sentence embeddings (Section 5.4).Finally, we compare our model trained on synthetic examples with a supervised classifier used in prior work to predict finer-grained textual entailment categories based on manually created training examples (Section 5.5).Except for the entailment classifier which uses external resources, all models are trained on the exact same parallel corpora (OpenSubtitles or CommonCrawl for evaluating on the corresponding test bed.)

Neural Semantic Divergence Detection
Model and Training Settings We use the publicly available implementation of the VDPWI model. 2 We initialize with 200 dimensional BiVec Equivalent with High Agreement (n = 5) subs en the epidemic took my wife, my stepson.
fr l'épidémie a touché ma femme, mon beau-fils.gl the epidemic touched my wife, my stepson.
Equivalent with Low Agreement (n = 3) cc en cancellation policy: if cancelled up to 28 days before date of arrival, no fee will be charged.fr conditions d'annulation : en cas d'annulation jusqu'à 28 jours avant la date d'arrivée, l'hôtel ne prélève pas de frais sur la carte de crédit fournie.gl cancellation conditions: in case of cancellation up to 28 days before arrival date, the hotel does not charge fees from the credit card given.
Divergent with Low Agreement (n = 3) cc en what does it mean when food is "low in ash" or "low in magnesium"?fr quels sont les avantages dune nourriture "réduite en cendres" et "faible en magnésium" ?gl what are the advantages of a food "low in ash" or "low in magnesium"?
Divergent with High Agreement (n = 5) French-English word embeddings (Luong et al., 2015), trained on the parallel corpus from which our test set is drawn.We use the default setting for all other VDPWI parameters.The model is trained for 25 epochs and the model that achieves the best Pearson correlation coefficient on the validation set is chosen.At test time, VDPWI outputs a score ∈ [0, 1], where a higher value indicates less divergence.We tune a threshold on development data to convert the real-valued score to binary predictions.

Synthetic Data Generation
The synthetic training data is constructed using a random sample of 5000 sentences from the training parallel corpus as positive examples.We generate negative examples automatically as described in Section 3, and sample a subset to maintain a 1:5 ratio of positive to negative examples. 3

Parallel vs. Non-parallel Classifier
Are divergences observed in parallel corpora more than alignment errors?To answer this question, we reimplement the model proposed by Munteanu and Marcu (2005).It discriminates parallel pairs from non-parallel pairs in comparable corpora using a supervised linear classifier with the following features for each sentence pair (e, f ): 3 We experimented with other ratios and found that the results only slightly degraded while using a more balanced ratio (1:1, 1:2), but severely degraded with a skewed ratio (1:9).Training uses the exact same synthetic examples as the semantic divergence model (Section 3).

Neural MT
If divergent examples are nothing more than bad translations, a neural MT system should assign lower scores to divergent segments pairs than to those that are equivalent in meaning.We test this empirically using neural MT systems trained for a single epoch, and use the system to score each of the sentence pairs in the test sets.We tune a threshold on the development set to convert scores to binary predictions.The system architecture and training settings are described in details in the later MT section (Section 6.2).Preliminary experiments showed that training for more than one epoch does not help divergence detection.

Bilingual Sentence Embeddings
Our semantic divergence model introduces a large number of parameters to combine the pairwise word comparisons into a single sentence-level prediction.This baseline tests whether a simpler model would suffice.We detect semantic divergence by computing the cosine similarity between sentence embeddings in a bilingual space.The sentence embeddings are bag-of-word representations, build by taking the mean of bilingual word embeddings for each word in the sentence.This approach has been shown to be effective, despite ignoring fundamental linguistic information such as word order and syntax (Mitchell and Lapata, 2010).We use the same 200 dimensional BiVec word embeddings (Luong et al., 2015), trained on OpenSubtitles and CommonCrawl respectively.

Textual Entailment Classifier
Our final baseline replicates our previous system (Carpuat et al., 2017) where we repurposed annotations and models designed for the task of Cross-Lingual Textual Entailment (CLTE) to detect semantic divergences.This baseline also helps us understand how the synthetic training data compares to training examples generated manually, for a related cross-lingual task.Using CLTE datasets from SemEval (Negri et al., 2012(Negri et al., , 2013)), we train a supervised linear classifier that can distinguish sentence pairs that entail each other, from pairs where entailment does not hold in at least one direction.The features of the classifier are based on word alignments and sentence lengths.6

Intrinsic Evaluation Results
Table 2 shows that the semantic similarity model is most successful at distinguishing equivalent from divergent examples.The break down per class shows that both equivalent and divergent examples are better detected.The improvement is larger for divergent examples with gains of about 10 points for F-score for the divergent class, when compared to the next-best scores.
Among the baseline methods, the nonentailment model is the weakest.While it uses manually constructed training examples, these examples are drawn from distant domains, and the categories annotated do not exactly match the task at hand.In contrast, the other models benefit from training on examples drawn from the same corpus as each test set.
Next, the machine translation based model and the sentence embedding model, both of which are unsupervised, are significantly weaker than the two supervised models trained on synthetic data, highlighting the benefits of the automatically constructed divergence examples.The strength of the semantic similarity model compared to the sentence embeddings model highlights the benefits of the fine-grained representation of bilingual sentence pairs as a similarity cube, as opposed to the bag-of-words sentence embedding representation.
Finally, despite training on the same noisy synthetic data as the parallel v/s non-parallel system, the semantic similarity model is better able to detect meaning divergences.This highlights the benefits of (1) meaning comparison between words in a shared embedding space, over the discrete translation dictionary used by the baseline, and of (2) the deep convolutional neural network which enables the semantic comparison of overlapping spans in sentence pairs, as opposed to more local word alignment features.

Analysis
We manually examine the 13-15% of examples in each test set that are correctly detected as divergent by semantic similarity and mis-classified by the non-parallel detector.
On OpenSubtitles, most of these examples are true divergences rather than noisy alignments (i.e.sentences that are not translation of each other.)The non-parallel detector weighs length features highly, and is fooled by sentence pairs of similar length that share little content and therefore have very sparse word alignments.The remaining sentence pairs are plausible translations in some context that still contain inherent divergences, such as details missing or added in one language.The non-parallel detector views these pairs as non-divergent since most words can be aligned.The semantic similarity model can identify subtle meaning differences, and correctly classify them as divergent.As a result, the non-parallel detector has a higher false positive rate (22%) than the semantic similarity classifier (14%), while having similar false negative rates (11% v/s 12%).
On the CommonCrawl test set, the examples with disagreement are more diverse, ranging from Table 2: Intrinsic evaluation on crowdsourced semantic equivalence vs. divergence testsets.We report overall Fscore, as well as precision (P), recall (R) and F-score (F) for the equivalent (+) and divergent (-) classes separately.Semantic similarity yields better results across the board, with larger improvements on the divergent class.
noisy segments that should not be aligned to sentences with subtle divergences.

Machine Translation Evaluation
Having established the effectiveness of the semantic divergence detector, we now measure the impact of divergences on a downstream task, machine translation.As in our prior work (Carpuat et al., 2017), we take a data selection approach, selecting the least divergent examples in a parallel corpus based on a range of divergence detectors, and comparing the translation quality of the resulting neural MT systems.

Translation Tasks
English-French We evaluate on 4867 sentences from the Microsoft Spoken Language Translation dataset (Federmann and Lewis, 2016) as well as on 1300 sentences from TED talks (Cettolo et al., 2012), as in past work (Carpuat et al., 2017).
Training examples are drawn from OpenSubtitles, which contains ~28M examples after deduplication.We discard 50% examples for data selection.
Vietnamese-English Since the SEMANTIC SIMILARITY model was designed to be easily portable to new language pairs, we also test its impact on the IWSLT Vietnamese-English TED task, which comes with ~120,000 and 1268 in-domain sentences for training and testing respectively (Farajian et al., 2016).This is a more challenging translation task as Vietnamese and English are more distant languages, there is little training data, and the sentence pairs are expected to be cleaner and more parallel than those from OpenSubtitles.In these lower resource settings, we discard 10% of examples for data selection.

Neural MT System
We use the attentional encoder-decoder model (Bahdanau et al., 2015) implemented in the Sock-Eye toolkit (Hieber et al., 2017).Encoders and decoders are single-layer GRUs (Cho et al., 2014) with 1000 hidden units.Source and target word embeddings have size 512.Using byte-pair encoding (Sennrich et al., 2016), the vocabulary size is 50000.Maximum sequence length is set to 50.We optimize the standard cross-entropy loss using Adam (Kingma and Ba, 2014), until validation perplexity does not decrease for 8 checkpoints.The learning rate is set to 0.0003 and is halved when the validation perplexity does not decrease for 3 checkpoints.The batch size is set to 80.At decoding time, we construct a new model by averaging the parameters for the 8 checkpoints with best validation perplexity, and decode with a beam of 5.All experiments are run thrice with distinct random seeds.
Note that the baseline in this work is much stronger than in our prior work ( >5 BLEU).This is due to multiple factors that have been recommended as best practices for neural MT and have been incorporated in the present baseline -deduplication of the training data, ensemble decoding using multiple random runs, use of Adam as the optimizer instead of AdaDelta (Bahar et al., 2017;Denkowski and Neubig, 2017), and checkpoint averaging (Bahar et al., 2017) -as well as a more recent neural modeling toolkit.

English-French Results
We train English-French neural MT systems by selecting the least divergent half of the training corpus with the following criteria: Learning curves (Figure 1) show that data selected using SEMANTIC SIMILARITY yields better    The SEMANTIC SIMILARITY model also achieves significantly better translation quality than the ENTAILMENT model used in our prior work.Surprisingly, the ENTAILMENT model performs worse than the ALL baseline, unlike in our prior work.We attribute this different behavior to several factors: the strength of the new baseline (Section 6.2), the use of Adam instead of AdaDelta, which results in stronger BLEU scores at the beginning of the learning curves for all models, and finally the deduplication of the training data.In our prior systems, the training corpus was not deduplicated.Data selection had a sideeffect of reducing the ratio of duplicated examples.When the ENTAILMENT model was used, longer sentence pairs with more balanced length were selected, yielding longer translations with a better BLEU brevity penalty than the baseline.With the new systems, these advantages vanish.We further analyze output lengths in Section 6.5.

Vietnamese-English Results
Trends from English-French carry over to Vietnamese English, as the SEMANTIC SIMILARITY model compares favorably to ALL while reducing the number of training updates by 10%.SEMAN-TIC SIMILARITY also yields better BLEU than RANDOM with the differences being statistically significant.While differences in score here are smaller, these result are encouraging since they demonstrate that our semantic divergence models port to more distant low-resource language pairs.

Analysis
We break down the results seen in Figure 1 and Table 3, with a focus on the behavior of the EN-TAILMENT and ALL models.We start by analyzing the BLEU brevity penalty trends observed on the validation set during training (Figure 2).We observe that both the ENTAILMENT and SE-MANTIC SIMILARITY based models have similar brevity penalties despite having performances that are at opposite ends of the spectrum in terms of BLEU.This implies that translations generated by the SEMANTIC SIMILARITY model have better n-gram overlap with the reference, but are much shorter.Manual examination of the translations suggests that the ENTAILMENT model often fails by under-translating sentences, either dropping segments from the beginning or the end of source sentences (Table 5).
The PARALLEL model consistently produces translations that are longer than the reference. 7This is partially due to the model's propensity to generate a sequence of garbage tokens in the beginning of a sentence, especially while translating shorter sentences.In our test set, almost 12% of the translated sentences were found to begin with the garbage text shown in Table 5.Only a small fraction (< 0.02%) of the French sentences in our training data begin with these tokens, but the tendency of PARALLEL to promote divergent examples above non-divergent ones, seems to exaggerate the generation of this sequence. 7The brevity penalty does not penalize translations that are longer than the reference.

Conclusion
We conducted an extensive empirical study of semantic divergences in parallel corpora.Our crowdsourced annotations confirms that correctly aligned sentences are not necessarily meaning equivalent.We introduced an approach based on neural semantic similarity that detects such divergences much more accurately than shallower translation or alignment based models.Importantly, our model does not require manual annotation, and can be trained for any language pair and domain with a parallel corpus.Finally, we show that filtering out divergent examples helps speed up the convergence of neural machine translation training without loss in translation quality, for two language pairs and data conditions.New datasets and models introduced in this work are available at http://github.com/ yogarshi/SemDiverge.
These findings open several avenues for future work: How can we improve divergence detection further?Can we characterize the nature of the divergences beyond binary predictions?How do divergent examples impact other applications, including cross-lingual NLP applications and semantic models induced from parallel corpora, as well as tools for human translators and second language learners?

•
Length features: |f |, |e|, |f | |e| , and |e| |f |• Alignment features (for each of e and f ):4  -Count and ratio of unaligned words -Top three largest fertilities -Longest contiguous unaligned and aligned sequence lengths • Dictionary features: 5 fraction of words in e that have a translation in f and vice-versa.

Figure 1 :
Figure 1: Learning curves on the validation set for English-French models (mean of 3 runs/model).The SEMANTIC SIMILARITY model outperforms other models throughout training, including the one trained on all data.

Figure 2 :
Figure 2: Brevity penalties on the validation set for English-French models (single run).
ENTAILMENT is inadequate due to under-translation Source he's a very impressive man and still goes out to do digs.Reference c'est un homme très impressionnant et il fait encore des fouilles.ENTAILMENT c'est un homme très impressionnant.Source when the Heat first won.Reference lorsque les Heat ont gagné pour la première fois.ENTAILMENT quand le Heat a gagné.PARALLEL produces garbage tokens Source alright.Reference d'accord.ENTAILMENT { \ pos (192,210)} d'accord.

Table 1 :
en rabbit?if i told you it was a chicken, you wouldn't know the difference.fr vous croirez manger du poulet.gl you think eat chicken Randomly selected sentence pairs (English (en), French (fr) and gloss of French (gl)) annotated as divergent or equivalent, with high and low degrees of agreement between the 5 annotators.Examples are taken either from the OpenSubtitles (subs) or Common Crawl (cc) corpus.

Table 4 :
Vietnamese-English decoding results: dropping 10% of the data based on SEMANTIC SIMILAR- ITY does not hurt BLEU, and performs significantly (p < 0.05) better than RANDOM selection.

Table 5 :
Selected translation examples from the ensemble systems of the various models.