Simple and Effective Paraphrastic Similarity from Parallel Translations

We present a model and methodology for learning paraphrastic sentence embeddings directly from bitext, removing the time-consuming intermediate step of creating para-phrase corpora. Further, we show that the resulting model can be applied to cross lingual tasks where it both outperforms and is orders of magnitude faster than more complex state-of-the-art baselines.


Introduction
Measuring sentence similarity is a core task in semantics (Cer et al., 2017), and prior work has achieved strong results by training similarity models on datasets of paraphrase pairs (Dolan et al., 2004). However, such datasets are not produced naturally at scale and therefore must be created either through costly manual annotation or by leveraging natural annotation in specific domains, like Simple English Wikipedia (Coster and Kauchak, 2011) or Twitter (Lan et al., 2017).
One of the most promising approaches for inducing paraphrase datasets is via manipulation of large bilingual corpora. Examples include bilingual pivoting over phrases (Callison-Burch et al., 2006;Ganitkevitch et al., 2013), and automatic translation of one side of the bitext Wieting and Gimpel, 2018;Hu et al., 2019). However, this is costly - Wieting and Gimpel (2018) report their large-scale database of sentential paraphrases required 10,000 GPU hours to generate.
In this paper, we propose a method that trains highly performant sentence embeddings (Pham et al., 2015;Hill et al., 2016;Pagliardini et al., 2017;McCann et al., 2017;Conneau et al., 2017) directly on bitext, obviating these intermediate steps and avoiding the noise and error propagation from automatic dataset preparation methods. This approach eases data collection, since bitext occurs naturally more often than paraphrase data and, further, has the additional benefit of creating cross-lingual representations that are useful for tasks such as mining or filtering parallel data and cross-lingual retrieval.
Most previous work for cross-lingual representations has focused on models based on encoders from neural machine translation (Espana-Bonet et al., 2017;Schwenk and Douze, 2017;Schwenk, 2018) or deep architectures using a contrastive loss (Grégoire and Langlais, 2018;Guo et al., 2018;Chidambaram et al., 2018). However, the paraphrastic sentence embedding literature has observed that simple models such as pooling word embeddings generalize significantly better than complex architectures (Wieting et al., 2016b). Here, we find a similar effect in the bilingual setting. We propose a simple model that not only produces state-of-the-art monolingual and bilingual sentence representations, but also encode sentences hundreds of times fasteran important factor when applying these representations for mining or filtering large amounts of bitext. Our approach forms the simplest method to date that is able to achieve state-of-the-art results on multiple monolingual and cross-lingual semantic textual similarity (STS) and parallel corpora mining tasks. 2 Lastly, since bitext is available for so many language pairs, we analyze how the choice of language pair affects the performance of English paraphrastic representations, finding that using related languages yields the best results.
We first describe our objective function and then describe our encoder, in addition to several baseline encoders. The methodology proposed here borrows much from past work (Wieting and Gimpel, 2018;Guo et al., 2018;Grégoire and Langlais, 2018;Singla et al., 2018), but this specific setting has not been explored and, as we show in our experiments, is surprisingly effective.
Training. The training data consists of a sequence of parallel sentence pairs (s i , t i ) in source and target languages respectively. For each sentence pair, we randomly choose a negative target sentence t i during training that is not a translation of s i . Our objective is to have source and target sentences be more similar than source and negative target examples by a margin δ: The similarity function is defined as: where g is the sentence encoder with parameters for each language θ = (θ src , θ tgt ). To select t i we choose the most similar sentence in some set according to the current model parameters, i.e., the one with the highest cosine similarity.
Negative Sampling. The described objective can also be applied to monolingual paraphrase data, which we explore in our experiments.  Wieting and Gimpel (2018), which aggregates M mini-batches to create one mega-batch and selects negative examples therefrom. Once each pair in the megabatch has a negative example, the mega-batch is split back up into M mini-batches for training.
Encoders. Our primary sentence encoder simply averages the embeddings of subword units generated by sentencepiece (Kudo and Richardson, 2018); we refer to it as SP. This means that the sentence piece embeddings themselves are the only learned parameters of this model. As baselines we explore averaging character trigrams (TRIGRAM) (Wieting et al., 2016a) and words (WORD). SP provides a compromise between averaging words and character trigrams, combining the more distinct semantic units of words with the coverage of character trigrams. We also use a bidirectional long short-term memory LSTM encoder (Hochreiter and Schmidhuber, 1997), with LSTM parameters fully shared between languages , as well as BLSTM-SP, which uses sentence pieces instead of words as the input tokens. For all encoders, when the vocabularies of the source and target languages overlap, the corresponding encoder embedding parameters are shared. As a result, language pairs with more lexical overlap share more parameters.
We utilize several regularization methods (Wieting and Gimpel, 2017) including dropout (Srivastava et al., 2014) and shuffling the words in the sentence when training BLSTM-SP. Additionally, we find that annealing the mega-batch size by slowly increasing it during training improved performance by a significant margin for all models, but especially for BLSTM-SP.

Experiments
Our experiments are split into two groups. First, we compare training on parallel data to training on back-translated parallel data. We evaluate these models on the 2012-2016 SemEval Semantic Textual Similarity (STS) shared tasks (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016, which predict the degree to which sentences have the same meaning as measured by human judges. The evaluation metric is Pearson's r with the gold labels. We use the small STS English-English dataset from Cer et al. (2017) for model selection. Second, we compare our best model, SP, on two semantic crosslingual tasks: the 2017 SemEval STS task (Cer et al., 2017) which consists of monolingual and cross-lingual datasets and the 2018 Building and Using Parallel Corpora (BUCC) shared bitext mining task (Zweigenbaum et al., 2018).

Hyperparameters and Optimization
Unless otherwise specified, we fix the hyperparameters in our model to the following: megabatch size to 60, margin δ to 0.4, annealing rate to Model en-en en-cs(1M) en-cs(2M) BLSTM-SP (20k)  150, 3 dropout to 0.3, shuffling rate for BLSTM-SP to 0.3, and the size of the sentencepiece vocabulary to 20,000. For WORD and TRIGRAM, we limited the vocabulary to the 200,000 most frequent types in the training data. We optimize our models using Adam (Kingma and Ba, 2014) with a learning rate of 0.001 and trained the models for 10 epochs.

Back-Translated Text vs. Parallel Text
We first compare sentence encoders and sentence embedding quality between models trained on backtranslated text and those trained on bitext directly. As our bitext, we use the Czeng1.6 English-Czech parallel corpus (Bojar et al., 2016). We compare it to training on ParaNMT (Wieting and Gimpel, 2018), a corpus of 50 million paraphrases obtained from automatically translating the Czech side of Czeng1.6 into English. We sample 1 million examples from ParaNMT and Czeng1.6 and evaluate on all 25 datasets from the STS tasks from 2012-2016. Since the models see two full English sentences for every example when training on ParaNMT, but only one when training on bitext, we also experiment with sampling twice the amount of bitext data to keep fixed the number of English training sentences. Results in Table 1 show two observations. First, models trained on en-en, in contrast to those trained on en-cs, have higher correlation for all encoders except SP. However, when the same number of English sentences is used, models trained on bitext have greater than or equal performance across all encoders. Second, SP has the best overall performance in the en-cs setting. It also has fewer parameters and is faster to train than BLSTM-SP and TRIGRAM. Further, it is faster at encoding new sentences at test time.

Monolingual and Cross-Lingual Similarity
We evaluate on the cross-lingual STS tasks from SemEval 2017. This evaluation contains Arabic-Arabic, Arabic-English, Spanish-Spanish, Spanish-English, and Turkish-English STS datsets. These datasets were created by translating one or both pairs of an English STS pair into Arabic (ar), Spanish (es), or Turkish (tr). 4 Baselines. We compare to several models from prior work (Guo et al., 2018;Chidambaram et al., 2018). A fair comparison to other models is difficult due to different training setups. Therefore, we perform a variety of experiments at different scales to demonstrate that even with much less data, our method has the best performance. 5 In the case of Schwenk (2018), we replicate their setting in order to do a fair comparison. 6 As another baseline, we analyze the performance of averaging randomly initialized embeddings.
We experiment with SP having sentencepiece vocabulary sizes of 20,000 and 40,000 tokens as well as TRIGRAM with a maximum vocabulary size of 200,000. The embeddings have 300 dimensions and are initialized from a normal distribution with mean 0 and variance 1. Table 2. We make several observations. The first is that the 1024 dimension SP model trained on 2016 Open-Subtitles Corpus 7 (Lison and Tiedemann, 2016) outperforms prior work on 4 of the 6 STS datasets. This result outperforms the baselines from the literature as well, all of which use deep architec-  tures. 8 Our SP model trained on Europarl 9 (EP) also surpasses the model from Schwenk (2018) which is trained on the same corpus. Since that model is based on many-to-many translation, Schwenk (2018) trains on nine (related) languages in Europarl. We only train on the splits of interest (en-es for STS and en-de/en-fr for the BUCC tasks) in our experiments. Secondly, we find that SP outperforms TRI-GRAM overall.

Results. The results are shown in
This seems to be especially true when the languages have more sentencepiece tokens in common.
Lastly, we find that random encoders, especially random TRIGRAM, perform strongly in the monolingual setting. In fact, the random encoders are competitive or outperform all three models from the literature in these cases. For cross-lingual similarity, however, random encoders lag behind because they are essentially measuring the lexical overlap in the two sentences and there is little lexical overlap in the cross-lingual setting, especially for distantly related languages like Arabic and English.

Mining Bitext
Lastly, we evaluate on the BUCC shared task on mining bitext. This task consists of finding the gold aligned parallel sentences given two large corpora in two distinct languages. Typically, only 8 Including a 3-layer transformer trained on a constructed parallel corpus (Chidambaram et al., 2018), a bidirectional gated recurrent unit (GRU) network trained on a collection of parallel corpora using en-es, en-ar, and ar-es bitext (Espana-Bonet et al., 2017), and a 3 layer bidirectional LSTM trained on 9 languages in Europarl (Schwenk, 2018  about 2.5% of the sentences are aligned. Following Schwenk (2018), we train our models on Europarl and evaluate on the publicly available BUCC data. Results in Table 3 on the French and German mining tasks demonstrate the proposed model outperforms Schwenk (2018), although the gap is substantially smaller than on the STS tasks. The reason for this is likely the domain mismatch between the STS data (image captions) and the training data (Europarl). We suspect that the deep NMT encoders of Schwenk (2018) overfit to the domain more than the simpler SP model, and the BUCC task uses news data which is closer to Europarl than image captions.

Analysis
We next conduct experiments on encoding speed and analyze the effect of language choice.   Figure 1: Plot of average performance on the 2012-2016 STS tasks compared to SP overlap and language distance as defined by Littell et al. (2017).

Encoding Speed
In addition to outperforming more complex models (Schwenk, 2018;Chidambaram et al., 2018), the simple SP models are much faster at encoding sentences. Since implementations to encode sentences are publicly available for several baselines, we are able to test their encoding speed and compare to SP. To do so, we randomly select 128,000 English sentences from the English-Spanish Europarl corpus. We then encode these sentences in batches of 128 on an Nvidia Quadro GP100 GPU. The number of sentences encoded per second is shown in Table 4, showing that SP is hundreds of times faster.

Does Language Choice Matter?
We next investigate the impact of the non-English language in the bitext when training English paraphrastic sentence embeddings. We took all 46 languages with at least 100k parallel sentence pairs in the 2016 OpenSubtitles Corpus (Lison and Tiedemann, 2016) and made a plot of their average STS performance on the 2012-2016 English datasets compared to their SP overlap 10 and language distance. 11 We segmented the languages separately and trained the models for 10 epochs using the 2017 en-en task for model selection.
The plot, shown in Figure 1, shows that sentencepiece (SP) overlap is highly corre-  lated with STS score. There are also two clusters in the plot, languages that have a similar alphabet to English and those that do not. In each cluster we find that performance is negatively correlated with language distance. Therefore, languages similar to English yield better performance. The Spearman's correlations (multiplied by 100) for all languages and these two clusters are shown in Table 5. When choosing a language to pair up with English for learning paraphrastic embeddings, ideally there will be a lot of SP overlap. However, beyond or below a certain threshold (approximately 0.3 judging by the plot), the linguistic distance to English is more predictive of performance. Of the factors in URIEL, syntactic distance was the feature most correlated with STS performance in the two clusters with correlations of -56.1 and -29.0 for the low and high overlap clusters respectively. This indicates that languages with similar syntax to English helped performance. One hypothesis to explain this relationship is that translation quality is higher for related languages, especially if the languages have the same syntax, resulting in a cleaner training signal.
We also hypothesize that having high SP overlap is correlated with improved performance because the English SP embeddings are being updated more frequently during training. To investigate the effect, we again learned segmentations separately for both languages then prefixed all tokens in the non-English text with a marker to ensure that there would be no shared parameters between the two languages. Results showed that SP overlap was still correlated (correlation of 24.9) and language distance was still negatively correlated with performance albeit significantly less so at -10.1. Of all the linguistic features, again the syntactic distance was the highest correlated at -37.5.
We have shown that using automatic dataset preparation methods such as pivoting or backtranslation are not needed to create higher performing sentence embeddings. Moreover by using the bitext directly, our approach also produces strong paraphrastic cross-lingual representations as a byproduct. Our approach is much faster than comparable methods and yields stronger performance on cross-lingual and monolingual semantic similarity and cross-lingual bitext mining tasks.