Fully Unsupervised Crosslingual Semantic Textual Similarity Metric Based on BERT for Identifying Parallel Data

We present a fully unsupervised crosslingual semantic textual similarity (STS) metric, based on contextual embeddings extracted from BERT – Bidirectional Encoder Representations from Transformers (Devlin et al., 2019). The goal of crosslingual STS is to measure to what degree two segments of text in different languages express the same meaning. Not only is it a key task in crosslingual natural language understanding (XLU), it is also particularly useful for identifying parallel resources for training and evaluating downstream multilingual natural language processing (NLP) applications, such as machine translation. Most previous crosslingual STS methods relied heavily on existing parallel resources, thus leading to a circular dependency problem. With the advent of massively multilingual context representation models such as BERT, which are trained on the concatenation of non-parallel data from each language, we show that the deadlock around parallel resources can be broken. We perform intrinsic evaluations on crosslingual STS data sets and extrinsic evaluations on parallel corpus filtering and human translation equivalence assessment tasks. Our results show that the unsupervised crosslingual STS metric using BERT without fine-tuning achieves performance on par with supervised or weakly supervised approaches.


Introduction
Crosslingual semantic textual similarity (STS) (Agirre et al., 2016a;Cer et al., 2017) aims at measuring the degree of meaning overlap between two texts written in different languages. It is a key task in crosslingual natural language understanding (XLU), with applications in crosslingual information retrieval (Franco-Salvador et al., 2014;Vulić and Moens, 2015), crosslingual plagiarism detection (Franco-Salvador et al., 2016a,b), etc. It is also particularly useful for identifying parallel resources (Resnik and Smith, 2003;Aziz and Specia, 2011) for training and evaluating downstream multilingual NLP applications, such as machine translation systems.
Unlike in crosslingual textual entailment (Negri et al., 2013) or crosslingual natural language inference (XNLI) (Conneau et al., 2018), which are directional classification tasks, in crosslingual STS, continuous values are produced, to reflect a range of similarity that goes from complete semantic unrelatedness to complete semantic equivalence. Machine translation quality estimation (MTQE) (Specia et al., 2018) is perhaps the field of work that is the most related to crosslingual STS: in MTQE, one tries to estimate translation quality, by comparing an original source-language text with its machine translation. In contrast, in crosslingual STS, neither the direction nor the origin (human or machine) of the translation is taken into account. Furthermore, MTQE also typically considers the fluency and grammaticality of the target text; these aspects are usually not perceived as relevant for crosslingual STS.
Many previous crosslingual STS methods rely heavily on existing parallel resources to first build a machine translation (MT) system and translate one of the test sentences into the other language for applying monolingual STS methods (Brychcín and Svoboda, 2016). Methods that do not rely explicitly on MT, such as that in Lo et al. (2018), still require parallel resources to build bilingual word representations for evaluating crosslingual lexical semantic similarity. It is clear that there is a circular dependency problem on parallel resources.
Massively multilingual context representation models, such as MUSE (Conneau et al., 2017), BERT (Devlin et al., 2019), and XLM (Lample and Conneau, 2019), that are trained in an unsupervised manner with non-parallel data from each language, have shown improved performance in XNLI classification tasks using task-specific finetuning.
In this paper, we propose a crosslingual STS metric based on fully unsupervised contextual embeddings extracted from BERT without finetuning. In an intrinsic crosslingual STS evaluation and extrinsic parallel corpus filtering and human translation error detection tasks, we show that our BERT-based metric achieves performance on par with similar metrics based on supervised or weakly supervised approaches. With the availability of the multilingual context representation models, we show that the deadlock around parallel resources for crosslingual textual similarity can be broken.

Crosslingual STS metric
Our crosslingual STS metric is based on YiSi (Lo, 2019). YiSi is a unified adequacy-oriented MT quality evaluation and estimation metric for languages with different levels of available resources. Lo et al. (2018) showed that YiSi-2, the crosslingual MT quality estimation metric, performed almost as well as the "MT + monolingual MT evaluation metric (YiSi-1)" pipeline for identifying parallel sentence pairs from a noisy web-crawled corpus in the Parallel Corpus Filtering task of WMT 2018 (Koehn et al., 2018b).
To measure semantic similarity between pairs of segments, YiSi-2 proceeds by finding alignments between the words of these segments that maximize semantic similarity at the lexical level. For evaluating crosslingual lexical semantic similarity, it relies on a crosslingual embedding model, using cosine similarity of the embeddings from the crosslingual lexical representation model. Following the approach of Corley and Mihalcea (2005), these lexical semantic similarities are weighed by lexical specificity using inverse document frequency (IDF) collected from each side of the tested corpus.
As an MTQE metric, YiSi-2 also takes into account fluency and grammatically of each side of the sentence pairs using bag-of-ngrams and the semantic parses of the tested sentence pairs. But since crosslingual STS focuses primarily on measuring the meaning similarity between the tested sentence pairs, here we set the size of ngrams to 1 and opt not to use semantic parses in YiSi-2. In addition, rather than compute IDF weights w(e) and w(f ) for lexical units e and f in each language directly on the texts under consideration, we rely on precomputed weights from monolingual corpora E and F of the two tested languages.
The YiSi metrics are formulated as an F-score: by viewing the source text as a "query" and the target as an "answer", precision and recall can be computed. Depending on the intended application, precision and recall can be weighed differently. For example, in MT evaluation applications, we typically assign more weight to recall ("every word in the source should find an equivalent in the target"). For this application, we give equal weights to precision and recall.
Thus, the crosslingual STS of sentences e and f using YiSi-2 in this work can be expressed as follows: where s(e, f ) is the cosine similarity of the vector representations v(e) and v(f ) in the bilingual embeddings model. In the following, we present the approaches we experimented with to obtain the crosslingual embedding space in supervised, weakly supervised and unsupervised manners. Luong et al. (2015) proposed BiSkip (with open source implementation bivec 1 ) to jointly learn bilingual representations from the context cooccurrence information in the monolingual data and the meaning equivalent signals in the parallel data. It trains bilingual word embeddings with the objective to preserve the clustering structures of words in each language. We train our crosslingual word embeddings using bivec on the parallel resources as described in each experiment.

Weakly supervised crosslingual word embeddings with vecmap
Artetxe et al. (2016) generalized a framework to learn the linear transformation between two monolingual word embedding spaces by minimizing the distances between equivalences listed in a collection of bilingual lexicons (with open source implementation vecmap 2 ). We train our monolingual word embeddings using word2vec 3 (Mikolov et al., 2013) on the monolingual resources and then learn the linear transformation of the two monolingual embedding space using vecmap on the dictionary entries as described in each experiment.

Unsupervised crosslingual contextual embeddings with multilingual BERT
The above two mentioned embedding models produce static word embeddings that captures the semantic space to represent the training data. The shortcoming of these static embedding models is that they provide the same embedding representation for the same word without reflecting the context variation of them being used in different sentences. In contrast, BERT (Devlin et al., 2019) uses a bidirectional transformer encoder (Vaswani et al., 2017) to capture the sentence context in the output embeddings, such that the embedding for the same word unit in different sentences would be different and better represented in the embedding space. Multilingual BERT model is trained on the Wikipedia pages of 104 languages with a shared subword vocabulary. Pires et al. (2019) showed multilingual BERT works well on different monolingual NLP tasks across different languages. Following the recommendation in Devlin et al. (2019), we use embeddings extracted from the ninth layer of the pretrained multilingual cased BERT-Base model 4 to represent subword units in the two sentences in assessment for the crosslingual lexical semantic similarity.

Experiment on crosslingual STS
We first evaluate the performance of YiSi-2 on the intrinsic crosslingual STS task, before testing its ability on the downstream task of identifying parallel data.

Setup
We use data from the SemEval-2016 Semantic Textual Similarity (STS) evaluation's crosslingual track (task1) (Agirre et al., 2016b), in which the goal was to estimate the degree of equivalence between pairs of Spanish-English bilingual fragments of text. 5 The test data is partitioned into two evaluation sets: the News data set has 301 pairs, manually harvested from comparable Spanish and English news sources; the Multi-source data set consists of 294 pairs, sampled from English pairs of snippets used in the SemEval-2016 monolingual STS task, translated into Spanish.
We apply YiSi-2 directly to these pairs of text fragments, using bilingual word embeddings trained under three different conditions (details of the training sets are given in Table 1 BERT : BWE's are obtained from pre-trained multilingual BERT models. We compare the YiSi-2 approach to direct cosine computations on sums of bilingual word embeddings (bivec sum, vecmap sum and bert sum). We also compare our approach to an MT-based approach, in which each Spanish fragment is first machine-translated into English, then compared to the original English fragment, using English word embeddings, produced with word2vec trained on WMT 2019 news translation task monolingual data. Similarity is measured either as the cosine of the sums of word vectors from each fragment (w2v sum), or with YiSi using monolingual embeddings as if they were bilingual (YiSi-1 w2v ).   The MT system used is a phrase-based SMT system, trained using standard resources -Europarl, Common Crawl (CC) and News & Commentary (NC) -totaling approximately 110M words in each language. We bias the SMT decoder to produce a translation that is as close as possible on the surface to the English sentence. This is done by means of log-linear model features that aim at maximizing n-gram precision between the MT output and the English sentence. More details on this method can be found in Lo et al. (2016).

Results
The results of these experiments are presented in Table 2, where performance is measured in terms of Pearson's correlation with the test sets' gold standard annotations. For reference, we also include results obtained by the UWB system (Brychcín and Svoboda, 2016), which was the best performing system in the SemEval 2016 crosslingual STS shared task. The UWB system is an MT-based system with a STS system trained on assorted lexical, syntactic and semantic features. Globally, using the YiSi metric to measure semantic similarity performs much better than sentence-level cosine ("* sum" systems). On the News dataset, the best results are obtained by combining an MT-based approach with YiSi-1 using monolingual word embeddings (MT+YiSi-1 w2v ), reflecting the in-domain nature of the text for the MT system. However, this is followed very closely by both the supervised BWE's (YiSi-2 bivec ) and BERT (YiSi-2 bert ), which yield very similar results, and clearly outperform semisupervised BWE's (YiSi-2 vecmap ). The nature of the Multisource translations appears to be quite different from what supervised BWE's and the MT system have been exposed to in training (YiSi-2 bivec and MT+YiSi-1 w2v ), which possibly explains their much poorer performance on this dataset. In contrast, weakly supervised BWE's and BERT behave much more reliably on this data.
Overall, while MT and supervised BWE's seem to work best with YiSi when large quantities of in-domain training data is available, the fully unsupervised alternative of using a pretrained BERT model comes very close, and behaves much better in the face of out-of-domain data.

Experiment on Parallel Corpus Filtering
Next, we evaluate YiSi on the task of Parallel Corpus Filtering (PCF). Quality -or "cleanliness"of parallel training data for MT has been shown to affect MT quality at different degrees, and various characteristics of the data -parallelism of the sentence pairs and the grammaticality of targetlanguage data -impact MT systems in different ways (Goutte et al., 2012;Simard, 2014;Khayrallah and Koehn, 2018).
Here, we use data from the WMT19 shared task on PCF. In this shared task, participants were challenged to find good quality translations from noisy sentence-aligned parallel corpora, for the purpose of training MT systems for translating from two low-resource languages, Nepali and Sin-  hala, into English. 7 Both corpora were crawled from the web, using ParaCrawl (Koehn et al., 2018a). Specifically, the task is to produce a score for each sentence pair in these noisy corpora, reflecting the quality of that pair. The scoring schemes are evaluated by extracting the topscoring sentence pairs from each corpus, then using them to train MT systems; these systems are run on test sets of Wikipedia articles (Guzmán et al., 2019), and the results are evaluated using BLEU (Papineni et al., 2002). In addition to the noisy corpora, participants are allowed to use a few small sets of parallel data, covering different domains, for each of the two low-resource languages, as well as a third, related language, Hindi (which uses the same script as Nepali). The provided data also included much larger monolingual corpora for each of English, Hindi, Nepali and Sinhala.

Setup
In these experiments, we focus on the Nepali-English corpus, and perform PCF in three steps: 1. pre-filtering: apply ad hoc filters to remove sentences that are exact duplicates (masking numbers, emails and web addresses), that contain mismatching numbers, that are in the wrong language according to the pyCLD2 language detector 8 or that are excessively long (either side has more than 150 tokens).
We also filter out all pairs where over 50% of the Nepali text is comprised of English, numbers or punctuation.
3. re-ranking: to optimize vocabulary coverage in the resulting MT system, we apply a  form of re-ranking: going down the ranked list of scored sentence pairs, we apply a 20% penalty to the pair's score if it does not contain at least one "new" source-language word bigram, i.e., a pair of consecutive sourcelanguage tokens not observed in previous (higher-scoring) sentence pairs. This has the effect of down-ranking sentences that are too similar to previously selected sentences.
The scoring step is performed with YiSi-2, using bilingual word embeddings obtained under three different conditions (details of the various training sets used can be found in Table 3): bivec : supervised BWE's produced using bivec, trained on the WMT19 PCF (clean) parallel data.
vecmap : weakly supervised BWE's are produced with vecmap, trained on all monolingual WMT19 PCF data, using Wikititles and the provided dictionary entries as bilingual lexicon.
As in the WMT19 PCF shared task, we evaluate the quality of our scoring by training MT systems and measuring their performance on the official test set. We used the provided software to extract the 1M-word and 5M-word samples from the original test corpora, using the scores of each of our systems in turn. We then trained MT systems using the extracted data: our MT systems are standard phrase-based SMT systems, with components and parameters similar to the German-English SMT system in Williams et al. (2016).

Results
BLEU scores of the resulting MT systems are shown in Table 4. For comparison, we present the results of random scoring, as well as results obtained by the Zipporah PCF method (Xu and Koehn, 2017). Zipporah combines fluency and adequacy features to score sentence pairs; adequacy features are derived from existing parallel corpora, and the feature combination (logistic regression) is optimized on in-domain parallel data. Therefore, Zipporah can be seen as a fully supervised method. The Zipporah-based MT systems were trained similarly to other systems in the results reported here.
All systems produced with YiSi-2 produce similar results. Interestingly, the MT systems produced with YiSi-2 in the 5M-word condition are not better than those of the 1M-word condition. This is possibly explained by the large quantity of noisy data in the WMT19 Nepalese-English corpus: it is not even clear that there are 5M words of proper translations in that corpus. In such harsh conditions, pre-and post-processing steps become crucially important, and deduplicating the data may even turn out to be harmful, if that means allowing more space for noise. The MT systems produced with Zipporah all achieve higher BLEU scores than YiSi-2, which may be explained by Zipporah's explicit modeling of target-language fluency. This is especially apparent in the 5Mword condition, but it may explain Zipporah's slightly better performance in the 1M-word condition as well. Overall, the benefits of supervised and weakly supervised approaches over using a pre-trained BERT model for PCF appear to be minimal, even in very low-resource conditions such as this.

Experiments on Translation Equivalence Error Detection
Given a text and its translation, Translation Equivalence Error Detection (TEED) is the task of identifying pairs of corresponding text segments whose meanings are not strictly equivalent. Note that, while in practice "translation errors" can take many forms, here, we are strictly focusing on meaning errors. In this formulation of the problem, we are also assuming that the source and target texts have been properly segmented into sentences and aligned. The TEED problem is essentially the same as that of Parallel Corpus Filtering (PCF), discussed in the previous section. However, the usage scenario is quite different: in PCF, one is typically dealing with a very large collection of segment pairs, only a fraction of which are true translations; the PCF task is then to filter out pairs which are not proper translations, possibly with some tolerance for pairs of segments that do share partial meaning. In TEED, the data is mostly expected to be high-quality translations; the task is then to identify those pairs that deviate from this norm, even on small details.

Setup
We experiment the TEED task using a data set obtained from the Canadian government's Public Service Commission (PSC). As part of its mandate, the PSC periodically audits Canadian government job ads, to ensure that they conform with Canada's Official Languages Act: as such, job ads must be posted in both of Canada's official languages, English and French, and both versions must be equivalent in meaning.
Our PSC data set consists of 175 000 "Statement of merit criteria" paragraphs, identifying any skill, ability, academic specialization, relevant experience or any other essential or asset criteria required for a position to be filled. Of these, 3521 have been manually annotated for equivalence errors by PSC auditors. Out of the 3521 pairs, 164 (4.6%) were reported to contain equivalence errors. The majority of these errors result from missing information in one language or the other (45%). In a slightly smaller proportion (43%), we find pairs of segments that don't express exactly the same meaning -a surprisingly large proportion of this last group consists in cases where the word and is translated as or or vice-versa. The rest consist in terminology issues and untranslated segments.
We experimented applying the YiSi-2 metric to this task, using bilingual word embeddings obtained under four different conditions:  BERT : BWE's obtained from pretrained multilingual BERT models.
Details about the training data can be found in Table 5.

Results
For these experiments, we considered an application scenario in which a text and its translation, in the form of pairs of matching segments, are scored using YiSi-2, and presented to a user, ranked in increasing order of score, so that pairs most likely to contain a translation error are presented first. The performance of the system can then be measured in terms of true and false positive rates, precision and recall, over subsets of increasing sizes of the test set. In Table 6, we report results in terms of mean F -score, with β = 1 and β = 2, and in terms of the Area under the ROC curve (ROC AUC), which can be interpreted as the probability that a system will score a randomly chosen faulty translation lower than a randomly chosen good translation. The ROC curves themselves can be seen in Figure 1. Globally, YiSi-2 clearly performs best at this task when using BWE's trained on domainspecific parallel data (PSC.bivec), even when there is very limited quantities of such data, as is the case here. However, BERT models perform comparably to vector-mapped BWE's trained with indomain data (PSC.vecmap), and substantially better than BWE's trained on large quantities of generic, out-of-domain parallel data (WMT). We conclude that, in the absence of in-domain parallel data, for TEED applications, an unsupervised YiSi-2 method will perform at least as well as supervised methods trained on out-of-domain data.

Conclusion
We presented a fully unsupervised crosslingual semantic textual similarity (STS) metric, based on contextual embeddings extracted from BERT without fine-tuning. We perform intrinsic evaluations on crosslingual STS data sets and extrinsic evaluations on parallel corpus filtering and human translation equivalence assessment tasks. Our results show that the unsupervised metric we propose achieves performance on par with supervised or weakly supervised approaches. We show that the circular dependency on the existence of parallel resources for using crosslingual STS to identify parallel data can be broken.
In this paper, we have only experimented with the contextual embeddings extracted from pretrained multilingual BERT model. For domainspecific applications, such as the job advertisement domain in the PSC translation equivalence error detection task, the performance of YiSi-2 could potentially be improved by fine-tuning BERT with in-domain data, something we plan to examine in the near future. We will also want to explore the use of other multilingual context representation models, such as MUSE (Conneau et al., 2017), XLM (Lample and Conneau, 2019), etc.