Detecting Fine-Grained Cross-Lingual Semantic Divergences without Supervision by Learning to Rank

Detecting fine-grained differences in content conveyed in different languages matters for cross-lingual NLP and multilingual corpora analysis, but it is a challenging machine learning problem since annotation is expensive and hard to scale. This work improves the prediction and annotation of fine-grained semantic divergences. We introduce a training strategy for multilingual BERT models by learning to rank synthetic divergent examples of varying granularity. We evaluate our models on the Rationalized English-French Semantic Divergences, a new dataset released with this work, consisting of English-French sentence-pairs annotated with semantic divergence classes and token-level rationales. Learning to rank helps detect fine-grained sentence-level divergences more accurately than a strong sentence-level similarity model, while token-level predictions have the potential of further distinguishing between coarse and fine-grained divergences.


Introduction
Comparing and contrasting the meaning of text conveyed in different languages is a fundamental NLP task. It can be used to curate clean parallel corpora for downstream tasks such as machine translation (Koehn et al., 2018), cross-lingual transfer learning, or semantic modeling (Ganitkevitch et al., 2013;Conneau and Lample, 2019), and it is also useful to directly analyze multilingual corpora. For instance, detecting the commonalities and divergences between sentences drawn from English and French Wikipedia articles about the same topic would help analyze language bias (Bao et al., 2012;Massa and Scrinzi, 2012), or mitigate differences in coverage and usage across languages (Yeung et al., 2011;Wulczyn et al., 2016;Lemmerich et al., 2019). This requires not only detecting coarse content mismatches, but also fine-grained differences in sentences that overlap in content. Consider the following English and French sentences, sampled from the WikiMatrix parallel corpus. While they share important content, highlighted words convey meaning missing from the other language: EN Alexander Muir's "The Maple Leaf Forever" served for many years as an unofficial Canadian national anthem. FR Alexander Muir compose The Maple Leaf Forever (en) qui est un chant patriotique pro canadien anglais. GLOSS Alexander Muir composes The Maple Leaf Forever which is an English Canadian patriotic song.
We show that explicitly considering diverse types of semantic divergences in bilingual text benefits both the annotation and prediction of crosslingual semantic divergences. We create and release the Rationalized English-French Semantic Divergences corpus (REFRESD), based on a novel divergence annotation protocol that exploits rationales to improve annotator agreement. We introduce Divergent mBERT, a BERT-based model that detects fine-grained semantic divergences without supervision by learning to rank synthetic divergences of varying granularity. Experiments on RE-FRESD show that our model distinguishes semantically equivalent from divergent examples much better than a strong sentence similarity baseline and that unsupervised token-level divergence tagging offers promise to refine distinctions among divergent instances. We make our code and data publicly available. 1

Background
Following Vyas et al. (2018), we use the term cross-lingual semantic divergences to refer to differences in meaning between sentences written in two languages. Semantic divergences differ from typological divergences that reflect different ways of encoding the same information across languages (Dorr, 1994). In sentence pairs drawn from comparable documents-written independently in each language but sharing a topic-sentences that contain translated fragments are rarely exactly equivalent (Fung and Cheung, 2004;Munteanu and Marcu, 2005), and sentence alignment errors yield coarse mismatches in meaning (Goutte et al., 2012). In translated sentence pairs, differences in discourse structure across languages (Li et al., 2014) can lead to sentence-level divergences or discrepancies in translation of pronouns (Lapshinova-Koltunski and Hardmeier, 2017;Šoštarić et al., 2018); translation lexical choice requires selecting between near synonyms that introduce languagespecific nuances (Hirst, 1995); typological divergences lead to structural mismatches (Dorr, 1994), and non-literal translation processes can lead to semantic drifts (Zhai et al., 2018).
Despite this broad spectrum of phenomena, recent work has effectively focused on coarse-grained divergences: Vyas et al. (2018) work on subtitles and Common Crawl corpora where sentence alignment errors abound, and Pham et al. (2018) focus on fixing divergences where content is appended to one side of a translation pair. By contrast, Zhai et al. (2018Zhai et al. ( , 2019 introduce token-level annotations that capture the meaning changes introduced by human translators during the translation process (Molina and Hurtado Albir, 2002). However, this expensive annotation process does not scale easily.
When processing bilingual corpora, any meaning mismatches between the two languages are primarily viewed as noise for the downstream task. In shared tasks for filtering web-crawled parallel corpora (Koehn et al., 2018(Koehn et al., , 2019, the best performing systems rely on translation models, or cross-lingual sentence embeddings to place bilingual sentences on a clean to noisy scale (Junczys-Dowmunt, 2018;Sánchez-Cartagena et al., 2018;Lu et al., 2018;Chaudhary et al., 2019). When mining parallel segments in Wikipedia for the Wiki-Matrix corpus (Schwenk et al., 2019), examples are ranked using the LASER score (Artetxe and Schwenk, 2019), which computes cross-lingual similarity in a language-agnostic sentence embedding space. While this approach yields a very useful corpus of 135M parallel sentences in 1,620 language pairs, we show that LASER fails to detect many semantic divergences in WikiMatrix.

Unsupervised Divergence Detection
We introduce a model based on multilingual BERT (mBERT) to distinguish divergent from equivalent sentence-pairs (Section 3.1). In the absence of annotated training data, we derive synthetic divergent samples from parallel corpora (Section 3.2) and train via learning to rank to exploit the diversity and varying granularity of the resulting samples (Section 3.3). We also show how our model can be extended to label tokens within sentences (Section 3.4).

Divergent mBERT Model
Following prior work (Vyas et al., 2018), we frame divergence detection as binary classification (equivalence vs. divergence) given two inputs: an English sentence x e and a French sentence x f . Given the success of multilingual masked language models like mBERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019), and XLM-R (Conneau et al., 2020) on cross-lingual understanding tasks, we build our classifier on top of multilingual BERT in a standard fashion: we create a sequence x by concatenating x e and x f with helper delimiter tokens: . The [CLS] token encoding serves as the representation for the sentence-pair x, passed through a feed-forward layer network F to get the score F (x). Finally, we convert the score F (x) into the probability of x belonging to the equivalent class.

Generating Synthetic Divergences
We devise three ways of creating training instances that mimic divergences of varying granularity by perturbing seed equivalent samples from parallel corpora (Table 1): Subtree Deletion We mimic semantic divergences due to content included only in one language by deleting a randomly selected subtree in the dependency parse of the English sentence, or French words aligned to English words in that subtree. We use subtrees that are not leaves, and that cover less than half of the sentence length. Durán et al. (2014); Cardon and Grabar (2020) success- fully use this approach to compare sentences in the same language. Following Pham et al. (2018), we introduce divergences that mimic phrasal edits or mistranslations by substituting random source or target sequences by another sequence of words with matching POS tags (to keep generated sentences as grammatical as possible).

Phrase Replacement
Lexical Substitution We mimic particularization and generalization translation operations (Zhai et al., 2019) by substituting English words with hypernyms or hyponyms from WordNet. The replacement word is the highest scoring WordNet candidate in context, according to a BERT language model (Zhou et al., 2019;Qiang et al., 2019).
We call all these divergent examples contrastive because each divergent example contrasts with a specific equivalent sample from the seed set. The three sets of transformation rules above create divergences of varying granularity and create an implicit ranking over divergent examples based on the range of edit operations, starting from a single token with lexical substitution, to local short phrases for phrase replacement, and up to half the words in a sentence when deleting subtrees.

Learning to Rank Contrastive Samples
We train the Divergent mBERT model by learning to rank synthetic divergences. Instead of treating equivalent and divergent samples independently, we exploit their contrastive nature by explicitly pairing divergent samples with their seed equivalent sample when computing the loss. Intuitively, lexical substitution samples should rank higher than phrase replacement and subtree deletion and lower than seed equivalents: we exploit this intuition by enforcing a margin between the scores of increasingly divergent samples.
Formally, let x denote an English-French sentence-pair and y a contrastive pair, with x > y indicating that the divergence in x is finer-grained than in y. For instance, we assume that x > y if x is generated by lexical substitution and y by subtree deletion.
At training time, given a set of contrastive pairs D = {(x, y)}, the model is trained to rank the score of the first instance higher than the latter by minimizing the following margin-based loss where ξ is a hyperparameter margin that controls the score difference between the sentence-pairs x and y. This ranking loss has proved useful in supervised English semantic analysis tasks (Li et al., 2019), and we show that it also helps with our cross-lingual synthetic data.

Divergent mBERT for Token Tagging
We introduce an extension of Divergent mBERT which, given a bilingual sentence pair, produces a) a sentence-level prediction of equivalence vs. divergence and b) a sequence of EQ/DIV labels for each input token. EQ and DIV refer to token-level tags of equivalence and divergence, respectively. Motivated by annotation rationales, we adopt a multi-task framework to train our model on a set of triplets D = {(x, y, z)}, still using only synthetic supervision ( Figure 1). As in Section 3.3, we assume x > y, while z is the sequence of labels for the second encoded sentence pair y, such that, at time t, z t ∈{EQ,DIV} is the label of y t . Since Divergent mBERT operates on sequences of subwords, we assign an EQ or DIV label to a word token if at leat one of its subword units is assigned that label.
For the token prediction task, the final hidden state h t of each y t token is passed through a feedforward layer and a softmax layer to produce the probability P yt of the y t token belonging to the EQ class. For the sentence task, the model learns to rank x > y, as in Section 3.3. We then minimize the sum of the sentence-level margin-loss and the average token-level cross-entropy loss (L CE ) across all tokens of y, as defined in Equation 2.
Similar multi-task models have been used for Machine Translation Quality Estimation (Kim et al., 2019a,b), albeit with human-annotated training samples and a standard cross-entropy loss for both word-level and sentence-level sub-tasks.

Rationalized English-French Semantic Divergences
We introduce the Rationalized English-French Semantic Divergences (REFRESD) dataset, which consists of 1,039 English-French sentence-pairs annotated with sentence-level divergence judgments and token-level rationales. Figure 2 shows an example drawn from our corpus. Our annotation protocol is designed to encourage annotators' sensitivity to semantic divergences other than misalignments, without requiring expert knowledge beyond competence in the languages of interest. We use two strategies for this purpose: (1) we explicitly introduce distinct divergence categories for unrelated sentences and sentences that overlap in meaning; and (2) we ask for annotation rationales (Zaidan et al., 2007) by requiring annotators to highlight tokens indicative of meaning differences in each sentence-pair. Thus, our approach strikes a balance between coarsely annotating sentences with binary distinctions that are fully based on annotators' intuitions (Vyas et al., 2018), and exhaustively annotating all spans of a sentence-pair with fine-grained labels of translation processes (Zhai et al., 2018). We describe the annotation process and analysis of the collected instances based on data statements protocols described in Bender and Friedman (2018); Gebru et al. (2018). We include more information in A.4.
Task Description An annotation instance consists of an English-French sentence-pair. Bilingual participants are asked to read them both and highlight tokens in each sentence that convey meaning not found in the other language. For each highlighted span, they pick whether this span conveys added information ("Added"), information that is present in the other language but not an exact match ("Changed"), or some other type ("Other"). Those fine-grained classes are added to improve consistency across annotators and encourage them to read and compare the text closely. Finally, participants are asked to make a sentence-level judgment by selecting one of the following classes: "No meaning difference", "Some meaning difference", "Unrelated". Participants are not given specific instructions on how to use span annotations to make sentence-level decisions. Furthermore, participants have the option of using a text box to provide any comments or feedback on the example and their decisions. A summary of the different span and sentence labels along with the instructions given to participants can be found in A.3. Figure 1: Divergent mBERT training strategy: given a triplet (x,y,z), the model minimizes the sum of a marginbased loss via ranking a contrastive pair x > y and a token-level cross-entropy loss on sequence labels z.
Curation rationale Examples are drawn from the English-French section of the WikiMatrix corpus (Schwenk et al., 2019). We choose this resource because (1) it is likely to contain diverse, interesting divergence types, since it consists of mined parallel sentences of diverse topics which are not necessarily generated by (human) translations, and (2) Wikipedia and WikiMatrix are widely used resources to train semantic representations and perform cross-lingual transfer in NLP. We exclude obviously noisy samples by filtering out sentence-pairs that a) are too short or too long, b) consist mostly of numbers, c) have a small tokenlevel edit difference. The filtered version of the corpus consists of 2,437,108 sentence-pairs. Quality Control We implement quality control strategies at every step. We build a dedicated task interface using the BRAT annotation toolkit (Stenetorp et al., 2012) ( Figure 2). We recruit participants from an educational institution and ensure they are proficient in both languages of interest. Specifically, participants are either bilingual speakers or graduate students pursuing a Translation Studies degree. We run a pilot study were participants annotate a sample containing both duplicated and reference sentence-pairs previously annotated by one of the authors. All annotators are found to be internally consistent on duplicated instances and agree with the reference annotations more than 60% of the time. We solicit feedback from participants to finalize the instructions.
Inter-annotator Agreement (IAA) We compute IAA for sentence-level annotations, as well as for the token and span-level rationales (Table 2). We report 0.60 Krippendorf's α coefficient for sentence classes, which indicates a "moderate" agreement between annotators (Landis and Koch, 1977). This constitutes a significant improvement over the 0.41 and 0.49 reported agreement coefficients on crowdsourced annotations of equivalence vs. divergence English-French parallel sentences drawn from OpenSubtitles and CommonCrawl corpora by prior work (Vyas et al., 2018).
Disagreements mainly occur between the "No meaning difference" and "Some meaning difference" classes, which we expect as different annotators might draw the line between which differences matter differently. We only observed 3 examples where all 3 annotators disagreed (tridisagreements), which indicates that the "Unrelated" and "No meaning difference" categories are more clearcut. The rare instances with tridisagreements and bidisagreements-where the disagreement spans the two extreme classes-were excluded from the final dataset. Examples of REFRESD corresponding to different levels of IAA are included in A.5.
Quantifying agreement between rationales requires different metrics. At the span-level, we compute macro F1 score for each sentence-pair following DeYoung et al. (2020), where we treat one set of annotations as the reference standard and the other set as predictions. We count a prediction as a match if its token-level Intersection Over Union (IOU) with any of the reference spans overlaps by more than some threshold (here, 0.5). We report average span-level and token-level macro F1 scores, computed across all different pairs of annotators. Average statistics indicate that our annotation protocol enabled the collection of a high-quality dataset.
Dataset Statistics Sentence-level annotations were aggregated by majority vote, yielding 252, 418, and 369 instances for the "Unrelated", "Some meaning difference", and "No meaning difference" classes, respectively. In other words, 64% of samples are divergent and 40% of samples contain finegrained meaning divergences, confirming that divergences vary in granularity and are too frequent to be ignored even in a corpus viewed as parallel.

Experimental Setup
Data We normalize English and French text in WikiMatrix consistently using the Moses toolkit (Koehn et al., 2007), and tokenize into subword units using the "BertTokenizer". Specifically, our pre-processing pipeline consists of a) replacement of Unicode punctuation, b) normalization of punctuation, c) removing of non-printing characters, and d) tokenization. 2 We align English to French bitext using the Berkeley word aligner. 3 We filter out obviously noisy parallel sentences, as described in Section 4, Curation Rationale. The top 5,500 samples ranked by LASER similarity score are treated as (noisy) equivalent samples and seed the generation of synthetic divergent examples. 4 We split the seed set into 5,000 training instances and 500 development instances consistently across experiments. Results on development sets for each experiment are included in A.7.
Models Our models are based on the Hugging-Face transformer library (Wolf et al., 2019). 5 We fine-tune the "BERT-Base Multilingual Cased" model (Devlin et al., 2019), 6 and perform a grid search on the margin hyperparameter, using the synthetic development set. Further details on model and training settings can be found in A.1.
Evaluation We evaluate all models on our new REFRESD dataset using Precision, Recall, F1 for each class, and Weighted overall F1 score as computed by scikit-learn (Pedregosa et al., 2011). 7

Binary Divergence Detection
We evaluate Divergent mBERT's ability to detect divergent sentence pairs in REFRESD.

Experimental Conditions
LASER baseline This baseline distinguishes equivalent from divergent samples via a threshold on the LASER score. We use the same threshold as Schwenk et al. (2019) To test the impact of contrastive training samples, we fine-tune Divergent mBERT using 1. the Cross-Entropy (CE) loss on randomly selected synthetic divergences; 2. the CE loss on paired equivalent and divergent samples, treated as independent; 3. the proposed training strategy with a Margin loss to explicitly compare contrastive pairs.
Given the fixed set of seed equivalent samples (Section 5, Data), we vary the combinations of divergent samples: 1. Single divergence type we pair each seed equivalent with its corresponding divergent of that type, yielding a single contrastive pair; 2. Balanced sampling we randomly pair each seed equivalent with one of its corresponding divergent types, yielding a single contrastive pair; 3. Concatenation we pair each seed equivalent with one of each synthetic divergence type, yielding four contrastive pairs; 4. Divergence ranking we learn to rank pairs of close divergence types: equivalent vs. lexical substitution, lexical substitution vs. phrase replacement, or subtree deletion yielding four contrastive pairs. 8

Results
All Divergent mBERT models outperform the LASER baseline by a large margin (Table 3). The proposed training strategy performs best, improving over LASER by 31 F1 points. Ablation experiments and analysis further show the benefits of diverse contrastive samples and learning to rank.  Table 3: Intrinsic evaluation of Divergent mBERT and its ablation variants on the REFRESD dataset. We report Precision (P), Recall (R), and F1 for the equivalent (+) and divergent (-) classes separately, as well as for both classes (All). Divergence Ranking yields the best F1 scores across the board. Contrastive Samples With the CE loss, independent contrastive samples improve over randomly sampled synthetic instances overall (+8.7 F1+ points on average), at the cost of a smaller drop for the divergent class (−5.3 F1-points) for models trained on a single type of divergence. Using the margin loss helps models recover from this drop.
Divergence Types All types improve over the LASER baseline. When using a single divergence type, Subtree Deletion performs best, even matching the overall F1 score of a system trained on all types of divergences (Balanced Sampling). Training on the Concatenation of all divergence types yields poor performance. We suspect that the model is overwhelmed by negative instances at training time, which biases it toward predicting the divergent class too often and hurting F1+ score for the equivalent class.
Divergence Ranking How does divergence ranking improve predictions? Figure 3 shows model score distributions for the 3 classes annotated in RE-FRESD. Divergence Ranking particularly improves divergence predictions for the "Some meaning difference" class: the score distribution for this class is more skewed toward negative values than when training on contrastive Subtree Deletion samples.  Table 4: Evaluation of different models on the token-level prediction task for the "Some meaning difference" class of REFRESD. Divergence Ranking yields the best results across the board.

Finer-Grained Divergence Detection
While we cast divergence detection as binary classification in Section 6, human judges separated divergent samples into "Unrelated" and "Some meaning difference" classes in the REFRESD dataset. Can we predict this distinction automatically? In the absence of annotated training data, we cannot cast this problem as a 3-way classification, since it is not clear how the synthetic divergence types map to the 3 classes of interest. Instead, we test the hypothesis that token-level divergence predictions can help discriminate between divergence granularities at the sentence-level, inspired by humans' use of rationales to ground sentence-level judgments.

Experimental Conditions
Models We fine-tune the multi-task mBERT model that makes token and sentence predictions jointly, as described in Section 3.4. We contrast against a sequence labeling mBERT model trained independently with the CE loss (Token-only). Finally, we run a random baseline where each token is labeled EQ or DIV uniformly at random.
Training Data We tag tokens edited when generating synthetic divergences as DIV (e.g., highlighted tokens in Table 1), and others as EQ. Since edit operations are made on the English side, we tag aligned French tokens using the Berkeley aligner.

Evaluation
We expect token-level annotations in REFRESD to be noisy since they are produced as rationales for sentence-level rather than tokenlevel tags. We, therefore, consider three methods to aggregate rationales into token labels: a token is labeled as DIV if it is highlighted by at least one (Union), two (Pair-wise Union), or all three annotators (Intersection). We report F1 on the DIV and EQ class, and F1-Mul as their product for each of the three label aggregation methods.

Results
Token Labeling We evaluate token labeling on REFRESD samples from the "Some meaning difference" class, where we expect the more subtle differences in meaning to be found, and the token-level annotation to be most challenging (Table 4). Examples of Divergent mBERT's token-level predictions are given in A.6. The Token-only model outperforms the Random Baseline across all metrics, showing the benefits of training even with noisy token labels derived from rationales. Multi-task training further improves over Token-only predictions for almost all different metrics. Divergence Ranking of contrastive instances yields the best results across the board. Also, on the auxiliary sentencelevel task, the Multi-task model matches the F1 as the standalone Divergence Ranking model.

From Token to Sentence Predictions
We compute the % of DIV predictions within a sentencepair. The multi-task model makes more DIV predictions for the divergent classes as its % distribution is more skewed towards greater values (Figure 4 (d) vs. (e)). We then show that the % of DIV predictions of the Divergence Ranking model can be used as an indicator for distinguishing between divergences of different granularity: intuitively, a sentence pair with more DIV tokens should map to a coarse-grained divergence at a sentence-level. Table 5 shows that thresholding the % of DIV tokens could be an effective discrimination strategy, which we will explore further in future work.

Conclusion
We show that explicitly considering diverse semantic divergence types benefits both the annotation and prediction of divergences between texts in different languages. We contribute REFRESD, a new dataset of Wiki-Matrix sentences-pairs in English and French, annotated with semantic divergence classes and tokenlevel rationales that justify the sentence level annotation. 64% of samples are annotated as divergent, and 40% of samples contain fine-grained meaning divergences, confirming that divergences are too frequent to ignore even in parallel corpora. We show that these divergences can be detected by a mBERT model fine-tuned without annotated samples, by learning to rank synthetic divergences of varying granularity.
Inspired by the rationale-based annotation process, we show that predicting token-level and sentence-level divergences jointly is a promising direction for further distinguishing between coarser and finer-grained divergences.

A Appendices
A.1 Implementation Details Training setup We employ the Adam optimizer with initial learning rate η = 2e−5, fine-tune for at most 5 epochs, and use early-stopping to select the best model. We use a batch size of 32 for experiments that do not use contrastive training and a batch size of 16 for those using contrastive training to establish a fair comparison.
Model setup All of our models are based on the "Multilingual BERT-base model" consisting of: 12layers, 768-hidden size, 12-heads and 110M parameters.

Average Runtime & Computing Infrastructure
Each experiment is run on a single GeForce GTX 1080 GPU. For experiments run on either a single type of divergence (e.g., Subtree Deletion) or using Balanced sampling, the average duration time is ∼ 0.4 hours. For Divergence Ranking and Concatenation, sampling methods, training takes ∼ 2 hours to complete.
Hyperparameter search on margin We perform a grid search on the margin parameter for each experiment that employs contrastive training. We experiment with values {3, 4, 5, 6, 7, 8} and pick the one corresponding to the best Weighted-F1 score on a synthetic development set. Table 6 shows mean and variance results on both the development and the REFRESD dataset for different ξ values. In general, we observe that our model's performance on REFRESD is not sensitive to the margin's choice, as reflected by the small variances on the REFRESD Weighted-F1.

A.2 Very Deep Pair-Wise Interaction baseline
We compare against the Very Deep Pair-Wise Interaction (VDPWI) model repurposed by Vyas et al. (2018) to identify cross-lingual semantic divergence vs. equivalence. We fine-tune mBERT models on coarsely-defined semantic synthetic divergent pairs, similarly to the authors. We report results on two crowdsourced datasets, consisting of equivalence vs. divergence labels for 300 sentence-pairs, drawn from the noisy OpenSubtitles and CommonCrawl corpora. The two evaluation datasets are available at: https://github.com/yogarshi/ SemDiverge/tree/master/dataset.   Table 7: Performance comparison between mBERT and VDPWI trained on coarsely-generated semantic divergences. We report F1 overall results (F1) and F1+/F1scores for the two classes, on the crowdsourced Open-Subtitles and CommonCrawl datasets. Table 7 presents results on the OpenSubtitles and CommonCrawl testbeds. We observe that mBERT trained on similarly defined coarse divergences performs better than cross-lingual VDPWI.

A.3 REFRESD: Annotation Guidelines
Below we include the annotation guidelines given to participants: "You are asked to compare the meaning of English and French text excerpts. You will be presented with one pair of texts at a time (about a sentence in English and a sentence in French). For each pair, you are asked to do the following: 1 Read the two sentences carefully. Since the sentences are provided out of context, your understanding of content should only rely on the information available in the sentences. There is no need to guess what additional information might be available in the documents the excerpts come from.
2 Highlight the text spans that convey different meaning in the two sentences. After highlighting a span of text, you will be asked to further characterize it as: ADDED the highlighted span corresponds to a piece of information that does not exist in the other sentence CHANGED the highlighted span corresponds to a piece of information that exists in the other sentence, but their meaning is not the exact same OTHER none of the above holds You can highlight as many spans as needed. You can optionally provide an explanation for your assessment in the text form under the Notes section (e.g., literal translation of idiom) 3 Compare the meaning of the two sentences by picking one of the three classes: UNRELATED The two sentences are completely unrelated or have a few words in common but convey unrelated information about them SOME MEANING DIFFERENCE The two sentences convey mostly the same information, except differences for some details or nuances (e.g., some information is added and/or missing on either or both sides; some English words have a more general or specific translation in French) NO MEANING DIFFERENCE The two sentences have the exact same meaning"

A.4 Annotation Procedures
We run 8 online annotation sessions. Each session consists of 120 instances, annotated by 3 participants, and lasts about 2 hours. Participants are allowed to take breaks during the process. Participants are rewarded with Amazon gift cards at a rate of $2 per 10 examples, with bonuses of $5 and $10 for completing the first and additional sessions, respectively. Table 8 includes examples of annotated instances drawn from REFRESD, corresponding to different levels of inter-annotator agreement.

A.5 Annotated examples in REFRESD
A.6 Token predictions of Divergent mBERT Table 9 shows randomly selected instances from REFRESD along with token tags predicted by our best performing system (Divergence Ranking).
A.7 Results on synthetic development sets Tables 10 and 11 report results on development  sets for each experiment included in Tables 3 and  4, respectively.
No meaning difference with high sentence-level agreement and high span overlap (n=3) EN The plan was revised in 1916 to concentrate the main US naval fleet in New England, and from there defend the US from the German navy.
Some meaning difference with high sentence-level agreement and high span overlap (n=3) EN After an intermediate period during which Stefano Piani edited the stories, in 2004 a major rework of the series went through.
FR Après une période intermédiaire pendant laquelle Stefano Pianiédita les histoires, une refonte majeure de la série fut faite en 2004 en réponseà une baisse notable des ventes. Unrelated with high sentence-level agreement and high span overlap (n=3) EN To reduce vibration, all helicopters have rotor adjustments for height and weight.
FR En vol, le régime du compresseur Tous les compresseurs ont un taux de compression liéà la vitesse de rotation et au nombre d'étages.
No meaning difference with high sentence-level agreement and high span overlap (n=3) EN One can see two sunflowers on the main façade and three smaller ones on the first floor above ground just above the entrance arcade.
Some meaning difference with high sentence-level agreement and low span overlap (n=3) EN On November 10, 2014, CTV ordered a fourth season of Saving Hope that consisted of eighteen episodes, and premiered on September 24.
FR Le 10 novembre 2014, CTV a renouvelé la série pour une quatrième saison de 18épisodes diffusée depuis le 24 septembre 2015. Unrelated with high sentence-level agreement and low span overlap (n=3) EN He talks about Jay Gatsby, the most hopeful man he had ever met .
FR Il côtoie notamment Giuseppe Meazza qui dira de lui Il fut le joueur le plus fantastique que j'aie eu l'occasion de voir.
No meaning difference with moderate sentence-level agreement (n=2) EN Nine of these revised BB LMs were built by Ferrari in 1979, while a further refined series of sixteen were built from 1980 to 1982.
Some meaning difference with moderate sentence-level agreement (n=2) EN From 1479, the Counts of Foix became Kings of Navarre and the last of them , made Henri IV of France, annexed his Pyrrenean lands to France.
Unrelated difference with moderate sentence-level agreement (n=2) EN The operating principle was the same as that used in the Model 07/12 Schwarzlose machine gun used by Austria-Hungary during the First World War.

EN
He experimented with silk vests resembling medieval gambesons, which used 18 to 30 layers of silk fabric to protect the wearers from penetration.
He experimented with silk vests resembling medieval gambesons, which used 18 to 30 layers of silk fabric to protect the wearers from penetration.

EN
Even though this made Armenia a client kingdom, various contemporary Roman sources thought that Nero had de facto ceded Armenia to the Parthian Empire.
Even though this made Armenia a client kingdom , various contemporary Roman sources thought that Nero had de facto ceded Armenia to the Parthian Empire .

EN
The Photo League was a cooperative of photographers in New York who banded together around a range of common social and creative causes. The Photo League was a cooperative of photographers in New York who banded together around a range of common social and creative causes .

EN
She made a courtesy call to the Hawaiian Islands at the end of the year and proceeded thence to Puget Sound where she arrived on 2 February 1852.
She made a courtesy call to the Hawaiian Islands at the end of the year and proceeded thence to Puget Sound where she arrived on 2 February 1852.

EN
Recognizing Nishikaichi and his plane as Japanese, Kaleohano thought it prudent to relieve the pilot of his pistol and papers before the dazed airman could react.
Recognizing Nishikaichi and his plane as Japanese, Kaleohano thought it prudent to relieve the pilot of his pistol and papers before the dazed airman could react .

EN
At the same time , the mortality rate increased slightly from 8.9 per 1,000 inhabitants in 1981 to 9.6 per 1,000 inhabitants in 2003.
At the same time, the mortality rate increased slightly from 8.9 per 1,000 inhabitants in 1981 to 9.6 per 1,000 inhabitants in 2003 .

EN
They called for a state convention on September 17 in Columbia to nominate a statewide ticket.
They called for a state convention on September 17 in Columbia to nominate a statewide ticket.

EN
His plants are still in the apartment and the two take all of the plants with them back to their place.
His plants are still in the apartment and the two take all of the plants with them back to their place.

FR
Il reste donc chez lui et les deux soeurs s'occupent du show toutes seules.
Il reste donc chez lui et les deux soeurs s'occupent du show toutes seules. Table 9: REFRESD examples, along with Divergent mBERT's predictions. Tokens highlighted with green color correspond to DIV predictions of Divergent mBERT (second sentence). Tokens highlighted with red colors correspond to gold-standard labels of divergence provided by annotators (first sentence). The red color intensity denotes the degree of agreement across three annotators (darker color denotes higher agreement).