Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations

We extend the work of Wieting et al. (2017), back-translating a large parallel corpus to produce a dataset of more than 51 million English-English sentential paraphrase pairs in a dataset we call ParaNMT-50M. We find this corpus to be cover many domains and styles of text, in addition to being rich in paraphrases with different sentence structure, and we release it to the community. To show its utility, we use it to train paraphrastic sentence embeddings using only minor changes to the framework of Wieting et al. (2016b). The resulting embeddings outperform all supervised systems on every SemEval semantic textual similarity (STS) competition, and are a significant improvement in capturing paraphrastic similarity over all other sentence embeddings We also show that our embeddings perform competitively on general sentence embedding tasks. We release the corpus, pretrained sentence embeddings, and code to generate them. We believe the corpus can be a valuable resource for automatic paraphrase generation and can provide a rich source of semantic information to improve downstream natural language understanding tasks.


Introduction
The natural language processing (NLP) community has benefited greatly from using pretrained word embeddings in downstream tasks. Recently, attention has been shifting from word embeddings to sentence embeddings. Most of this work is centered on general-purpose embeddings that can be used for any task, when they are used as features for a linear classifier or a shallow neural network (Kiros et al., 2015;Conneau et al., 2017) Our recent work has focused on paraphrastic sentence representations where sentences that have the same meaning lie close together in a vector space (Wieting et al., 2015(Wieting et al., , 2016bWieting and Gimpel, 2017; In this paper, we make two contributions. First, we scale up the work of . We use neural machine translation (NMT) to generate an automatic paraphrase corpus of over 51 million sentential paraphrase pairs. We call this dataset PARANMT-50M. We find this corpus to be of very high quality due to the state-of-the-art sentence embeddings that can be trained from it and from visual inspection. The dataset has the potential to be used for many tasks, from linguistically controlled paraphrase generation, style transfer, and sentence simplification to core NLP problems like neural machine translation. There are many rich paraphrases, from this corpus illustrating many paraphrase phenomena; we show examples and discuss its potential in Section 5.
Second, we show the utility of the corpus by using it to train state-of-the-art paraphrastic sentence embeddings. We perform a thorough search of the space, motivated by our prior work (Wieting et al., 2015(Wieting et al., , 2016bWieting and Gimpel, 2017; and find that many neural architectures can be used to create strong embeddings. Moreover, given the large size of the paraphrase corpus, we found benefit from making a small change to the training procedure. That is, we increase the search space of negative examples during training. This presumably helps by ensuring the difficulty of minimizing the loss as the models improve over the course of learning. Our sentence embeddings would win every SemEval semantic textual similarity (STS) competition which have occurred from 2012-2016. These STS tasks have been the central task Sem-Eval for the past 5 years, drawing dozens of participating institutions with over a hundred submissions each year. Since so many domains are covered in these datasets, they form a demanding evaluation for a general purpose sentence embedding model. This is notable because most STS systems use curated lexical resources and supervised training data with manually-annotated similarities. Moreover, we do not even model the sentences jointly (i.e., using attention or features produced from both embeddings in the sentence pair), we just encode each sentence separately and then compute the cosine similarity between the embeddings.
We release our back-translated data and trained models for two main purposes. First we hope that the data can motivate new research directions and help create new and interesting NLP models, while adding a robustness to existing ones by incorporating this paraphrase knowledge. It is the largest collection of sentential paraphrases released to date. Secondly, since our pretrained sentence embeddings are state-of-the-art by a significant margin, we hope that these embeddings can be useful for many applications both as a sentence representation and as a general similarity metric. We are actively researching both of these directions to find downstream applications for both of these new resources.

Related Work
We describe related work in learning generalpurpose sentence embeddings, work in automatically generating or discovering paraphrases, and finally prior work in leveraging neural machine translation for embedding learning.
Paraphrastic sentence embeddings. Our learning and evaluation setting is the same as that considered by Wieting et al. (2016b) and Wieting et al. (2016a), in which the goal is to learn paraphrastic sentence embeddings that can be used for downstream tasks. They trained models on PPDB and evaluated them using a suite of semantic textual similarity (STS) tasks and supervised semantic tasks. Others have begun to consider this setting as well (Arora et al., 2017).
The most relevant work uses bilingual corpora, e.g., Bannard and Callison-Burch (2005), culminating in the Paraphrase Database (PPDB; Ganitkevitch et al., 2013). Our goals are similar to those of PPDB, which has likewise been generated for many languages (Ganitkevitch and Callison-Burch, 2014) since it only needs parallel text.
Prior work has shown that PPDB can be used for learning embeddings for words and phrases (Faruqui et al., 2015;Wieting et al., 2015). However, when learning sentence embeddings, Wieting and Gimpel (2017) showed that PPDB is not as effective as sentential paraphrases, especially for recurrent networks. These results are intuitive because the phrases in PPDB are short and often cut across constituent boundaries. For sentential paraphrases, Wieting and Gimpel (2017) used a dataset developed for text simplification by Coster and Kauchak (2011). It was created by aligning sentences from Simple English and standard English Wikipedia.
Neural machine translation for paraphrastic embedding learning. Sutskever et al. (2014) trained NMT systems and visualized part of the space of the source language encoder for their English→French system. Hill et al. (2016) evaluated the encoders of English-to-X NMT systems as sentence representations, finding them to perform poorly compared to several other methods based on unlabeled data.  adapted trained NMT models to produce sentence similarity scores in semantic evaluations. They used pairs of NMT systems, one to translate an English sentence into multiple foreign translations and the other to then translate back to English.  only uses the NMT system to generate training data for training sentence embeddings, rather than using it in a similarity model itself. This permitted us to decouple decisions made in designing the NMT architecture from decisions about which models we will use for learning sentence embeddings. Thus we can benefit from orthogonal work in both NMT and in designing neural architectures to embed sentences. In other recent work, Suzuki et al. (2017) similarly use NMT to generate paraphrase corpora. Other work has used neural MT architectures and training settings to obtain better word embeddings (Hill et al., 2014a,b).
This work extends , scaling it up to a larger corpus of machine translations and producing state-of-the-art paraphrastic sentence embeddings.

Neural Machine Translation
To create the data used for training our models we back-translated the Czeng1.6 corpus (Bojar et al., 2016). We chose this corpus over other parallel text as it has many desirable properties. First, the sentences are short, meaning that NMT systems can generate correct inputs for many sentences in the corpus. Second, there is a lot of diversity in the corpus. A large portion of it is movie subtitles which tend to use a wide vocabulary and have a diversity of sentence structures, however other domains are included as well. Lastly, there is a lot of data. PARANMT-50M consists of over 51 million training examples, so there is plenty of data to train a strong in-domain translation system. It took over 10,000 GPU hours to back-translate all of this data.
We used the pre-trained models from the neural machine translation system of  to translate the Czech sentences into English. The model was trained on this data along with data from some smaller Czech sources ( 160k from common-crawl, 650k from Europarlm and 190k from News). We used beam search with a beam size of 12 and selected the highest scoring translation from the beam to be the translation.  investigated methods to filter back-translated text. We experimented with simple approaches like using length cutoffs, which were found to be very effective, and more sophis-ticated ones like using a classifier to discern generated text from source text and using this classifier as an indicator of the quality of a translation.
We focused on extracting quality paraphrases from the back-translated text instead of trying to find both quality and lexically different paraphrases. We experimented with three simple approaches: We first used the translation scores from decoding. Secondly, we used trigram overlap filtering as was done by . Lastly, we experimented with using the PARAGRAM-PHRASE embeddings from Wieting et al. (2015), with higher scoring pairs indicating a stronger paraphrase relationship between the sentences. 2 We then proceeded to find the quality training data. To do this, we ranked all 51M+ paraphrase pairs in the corpus by whichever of the three metrics we were using, and then split the data into tenths (so the first tenth contains the bottom 10% scored paraphrases, the second contains those in the bottom 10-20%, etc.). We then used the models to select the optimal score ranges on a small sample of labeled data. 3 For more details about the corpus including statistics, comparisons to the other parallel corpora, and ablations on the data selection, see Section 5.

Models
We wish to embed a word sequence s into a fixedlength vector. We denote the tth word in s as s t , and we denote its word embedding by x t . We focus on three main models in this paper, though we also experiment with combining them in various ways.
The first model, which we call AVG, simply averages the embeddings x t of all words in s. The only parameters learned in this model are those in the word embeddings themselves, which are stored in the word embedding matrix W w . This model was found by Wieting et al. (2016b) to per-form very strongly for semantic similarity tasks.
The second model is similar to word averaging, but instead of averaging word embeddings, we average character trigram embeddings (Huang et al., 2013). Wieting et al. (2016a) found this to perform strongly for sentence embeddings compared to other n-gram orders and compared to wordaveraging.
The third model are various LSTM architectures (Hochreiter and Schmidhuber, 1997). We mainly experimented with LSTMs where we average the hidden states with some small amount of scrambling of the input words. This combination was proposed by Wieting and Gimpel (2017), where they found that averaging the LSTM hidden states was significantly more effective than using the last hidden state for paraphrastic transfer learning, and that scrambling led to significant gains as well. We also experiment with bidirectional LSTMs (BiLSTMs), averaging all hidden states. Unlike in (Conneau et al., 2017), we found this to outperform max-pooling for both paraphrastic similarity and general sentence embedding tasks. Similarly, we also experimented with concatenating the hidden states, prior to averaging, as was also done in (Conneau et al., 2017). However, we found that for a given fixed output dimension, this wasn't as effective as just averaging the hidden states.

Training
We follow the training procedure of prior work (Wieting et al., 2015(Wieting et al., , 2016b, with one modification. We introduce "mega-batching" for selecting negative examples which results in significant performance gains. The training data is a set S of paraphrastic pairs s 1 , s 2 and we optimize a margin-based loss: where g is the model (AVG or GRAN), δ is the margin, λ c and λ w are regularization parameters, W w initial is the initial word embedding matrix, and t 1 and t 2 are "negative examples" taken from a mini-batch during optimization. The intuition is that we want the two texts to be more similar to each other (cos(g(s 1 ), g(s 2 ))) than either is to their respective negative examples t 1 and t 2 , by a margin of at least δ. To select t 1 and t 2 , we choose the most similar sentence in some set (other than those in the given pair). For simplicity we use the mini-batch for this set, i.e., we choose t 1 for a given s 1 , s 2 as follows: where S b ⊆ S is the current mini-batch. That is, we want to choose a negative example t i that is similar to s i according to the current model. The downside is that we may occasionally choose a phrase t i that is actually a true paraphrase of s i .

Mega-batching
Due to training on large amounts of sentence pairs, we find that the models require stronger negative examples to learn more discriminative paraphrastic sentence embeddings. Therefore, we find that tying together the S b ⊆ S with the mini-batchsize can lead to very large mini-batches which isn't optimal for optimization. Therefore, we propose mega-batching where we set a parameter M that controls how many batches are gathered up, and then the negative sampling is done from this pool of batches all at once. Once every example in every batch has a negative example, the mega-batch is split back up into individual batches and training commences, this time with more challenging negative examples that help the model learn better paraphrastic representations of text. This can be seen in Ta

Generated Paraphrase Corpus
One of the keys to the success of this work, and eventual work that used back-translated text is finding the right data to use. We claim that the   desirable data will: (1) have relatively short sentences so that the NMT produces quality outputs, (2) copious data so that the learned NMT models are of high quality, and (3) a diverse vocabulary and sentence structure.
To show why the Czeng1.6 (Bojar et al., 2016) has many of these properties in comparison to other available corpora, we calculate some statistics between other possible data sources for creating a paraphrase corpus. We compare with Europarl, Common-crawl, and News for Czech.
For each corpus, we sampled 100k source sentences, and we compute the average sentence length, the average IDF (inverse document frequency) 5 , as well as the entropy of the vocabularies and constituent parse trees 6 . To prevent sparsity in the parse tree statistics, we used only the top 2 levels of the parse tree (i.e. 2 levels after the ROOT). The results are shown in Table 2.
As the statistics indicate, Czeng1.6 isn't especially more diverse in terms of vocabulary, though this could largely be due to all the rare entity tokens in the News and Common-crawl dataset. However, all corpora are significantly more diverse than Europarl, which tends to be very repetitive. However, Czeng1.6 has significantly shorter sentences than the corpus and more diverse structure due to its highest constituent parse entropy.
As another experiment, we use these small samples corpora for each data and trained 3 different models, Word Averaging, Trigram Averaging, 5 We computed the inverse-document-frequencies from Wikipedia. 6 We computed these parse trees using the Stanford Parser    and LSTM Averaging. In Table 3, the results on Czeng1.6 are the best for each model. We hypothesize that the core strength of Czeng1.6 is more due to having better translations than any other innate property of the corpus. It has decently diverse in terms of vocabulary and has very diverse sentence structure, but the fact that the corpus has significantly shorter sentences than the other corpora, and has much more training data, means that its translations are going to be much better than that in the other corpora. In , we noted that in their generated corpora, sentence length was the most important factor in filtering quality paraphrase data and is presumably due to how NMT performance worsens with longer sentences. Therefore, as obvious as it sounds, the better the translation of a corpus, the better the quality of the generated paraphrases.
We also show some illustrative examples from the corpus to show the diversity in paraphrase pairs present in the data. While the corpus has noise. We note that with improvements in NMT, the corpus quality can increase as NMT performance steadily rises. These examples are shown in Table 1 where we show the source sentence along with its translation.

Experiments
We now investigate how best to use our generated paraphrase data for training paraphrastic sentence embeddings as well as general-purpose sentence embeddings to be used as features for a wide range of tasks as has been done in (Kiros et al., 2015;Conneau et al., 2017).

Paraphrastic Sentence Embedding Evaluation
We evaluate the quality of a paraphrase dataset by using the experimental setting of Wieting et al. (2016b). We use the paraphrases as training data to create paraphrastic sentence embeddings, then evaluate the embeddings on the SemEval semantic textual similarity (STS) tasks from 2012 to 2016 (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016Cer et al., 2017). Lastly, we evaluate on the STS Benchmark (Cer et al., 2017). The STS Benchmark was designed to be a standard for comparing sentence representation across a multitude of domains and across multiple modeling approaches (i.e. separate tracks for using the training data, modeling each sentence individually, etc.). Given two sentences, the aim of the STS tasks is to predict their similarity on a 0-5 scale, where 0 indicates the sentences are on different topics and 5 indicates that they are completely equivalent. As our test set, we report the average Pearson's r over each year of these sentence similarity tasks from 2012-2016. We use the very small (250 examples) English track From SemEval 2017 as a development set.

Sentence Embedding Evaluation
We also evaluate our sentence embeddings on a range of tasks that have previously been used for evaluating sentence representations. These include sentiment analysis (MR, CR, SST), classifying the objectivity/subjectivity of a piece of text (SUBJ), opinion polarity (MPQA), question classification (TREC), paraphrase detection (MRPC), semantic relatedness (SICK), and textual entailment (SICK).

Experimental Setup
We use PARAGRAM-SL999 embeddings (Wieting et al., 2015) to initialize the word embedding matrix (W w ) for both models. For all experiments, we fix the mini-batch size to 100, λ w to 0, λ c to 0, and the margin δ to 0.4. We train all models for 5 epochs. For optimization we use Adam (Kingma and Ba, 2014) with a learning rate of 0.001.
For the recurrent neural networks, we set the scrambling rate to 0.3. We found, like in (Wieting and Gimpel, 2017), that scrambling significantly improves the results even though we are using far   Table 5: Results on the STS tasks of wordaveraging and BiLSTM models with different sizes of mega-batches. more sentence level data than was used in those previous experiments. We found a smaller value than the 0.5 primarily used in (Wieting and Gimpel, 2017) worked better most likely due to the larger amount of training data.

Filtering Data
In order to find the optimal training data to use for training our models, we first ranked all 51M+ generated paraphrase pairs from the NMT system (See Section 3 for details) by ranking them with 3 techniques: semantic similarity using the PARAGRAM-PHRASE model from (Wieting et al., 2016b), trigram-overlap, and translations score (the latter two were both used in   Table 4 shows the results of the different filtering methods. We trained an LSTM Avg. model for a single epoch on 1M examples sampled from each of the ten folds for each filtering criteria. From the results, filtering by word similarity produces the highest quality data. We then randomly selected 5M examples, all where both sentence pairs had no more than 30 tokens, from the top two scoring folds. 7 These 5M examples were the training data for our experiments.

Effect of Mega-Batching
The effect mega-batching has on our results is shown in Table 5. We trained on our 5M example training set for 5 epochs.
The results show that for both Word Avg. and BiLSTM Avg., larger megabatches increase performance. There isn't much gain moving from 20 to 40, but these high megabatch sizes are Source Sentence Negative Example Mega-batchsize of 1 sir, i'm just trying to protect.
i mean, colonel... i'm looking at him, you know? they know that ive been looking for her. iil let it go a couple of rounds.
sometimes the ball doesn't go down. Mega-batchsize of 20 sir, i'm just trying to protect.
i only ask that the baby be safe. i'm looking at him, you know? i'm keeping him. iil let it go a couple of rounds.
i'll take two. Mega-batchsize of 40 sir, i 'm just trying to protect.
just trying to survive. on instinct. i'm looking at him, you know? i looked at him with wonder. iil let it go a couple of rounds.
i want you to sit out a couple of rounds, all right? in the megabatch) for different megabatch sizes for two sentences. As can be seen, having high mega-batch values produces more interesting negative examples and hence the model learns richer representations. This is likely more important when training the model on sentences since they have so much more space to have diverse meaning unlike prior work on learning on text snippets (Wieting et al., 2015(Wieting et al., , 2016bPham et al., 2015).

Simple Models
We first investigate how well simple models (Word Avg., Trigram Avg., and LSTM Avg.) do on their own. Table 7 shows the results on the STS tasks from 2012-2016. Table 9 shows results on the STS Benchmark.
Next we investigate how well our embeddings do on general sentence embedding tasks. Note that all of these simple models have 300 dimensions, which is far fewer than the several thousands (often 2400-4800) of dimensions used in most prior work.

Larger Recurrent Networks
Since LSTM Avg. model performed so well, we also investigated the effectiveness of scaling up the dimension of the output embedding and using BiLSTMs.
Interestingly, we found that higher dimensions did not increase performance on the STS tasks, 8 All examples except the sentence itself and its pair. but it did help on the general sentence embedding tasks.

Mixed Models
We next experimented with combining some of our models after noting that they have very different performance on different STS tasks. Table 10 shows the average absolute difference in Pearson's r over all 25 datasets. Hence, combining models and training them jointly could result in overall improvement. From the table, it seems that the Trigram Avg. model compliments the two models using word embeddings, as the Trigram Avg. model can provide information on unknown words that are a limitation to both Word Avg. and LSTM Avg.
We experimented with two natural ways of combining the embeddings. The first was to simply train the models jointly and form the final embedding by adding the output of each model. This would leave us with a fixed 300 dimensional embedding. The second approach was to simply concatenate the output embeddings of each model, creating a final vector that had n * 300 dimensions where n is the number of models being jointly trained.
The mixed models have significant improvement over the individual models as well as the larger LSTMs and BiLSTMs. Interesting, combining Trigram Avg. and Word Avg. results in the best performance. We did experiment with all combinations, but in the subsequent results only show the combination of Trigram Avg. and Word Avg. and the combination of all three models.

Discussion
There are several noteworthy facts about our experiments. First, it is interesting to note the effect of embedding dimensionality on performance in the various evaluations. For semantic similarity, the dimensionality of the embeddings does not appear to have a strong effect. However, for the general sentence embedding evaluation, larger dimensions nearly always improve overall performance. This is presumably due to the fact that having more features leads to more trainable parameters in the subsequent classifier, increasing the ability of the classifier to linearly separate the data.
Secondly, while InferSent has strong performance across all tasks, our model obtains better results on semantic similarity.  (Hill et al., 2016) 100 --63 --DictRep (Hill et al., 2016) 500 --67 --CaptionRep (Hill et al., 2016) 500 --46 --SkipThought (Kiros et al., 2015) 4800  Table 7: Results on the STS tasks of our simplest models. We compare to the top performing systems from each SemEval. Note that we are reporting the mean result, not the weighted mean used in the competition. Our best performing overall model, Trigram-Word is in bold.
able gap between our work and all previous work (outside of our own). Our models consistently perform around 10 points higher for each year of the STS tasks as well as on the STS Benchmark.
Regarding the supervised tasks, we note that some result trends appear to be influenced by data domain. InferSent is trained on a dataset of mostly captions, especially the model trained on just SNLI. Therefore, the datasets for the SICK relatedness and entailment evaluations are similar in domain to the training domain of Infersent, especially entailment since the target tasks are aligned. Our results on MRPC and entailment are significantly better than Skip-Thought vectors, and on tasks that are not on caption data (MRPC), we have performance competitive with that of Infersent.
To measure the effects of domain on the performance of InferSent, we performed additional experiments. We first compared the performance of both our model and their SNLI and ALLNLI models on all STS tasks from 2012-2016. We then compared the overall mean with that of the three caption STS datasets within the collection. The results are shown in Table 11. The InferSent models are much closer to the Trigram-Word model on the caption datasets than overall, and InferSent trained on SNLI shows the largest difference between its Model MR CR SUBJ MPQA SST TREC MRPC SICK-R SICK-E Unsupervised (Unordered Sentences) Unigram-TFIDF (Hill et al., 2016) (Zhao et al., 2015) 83  overall performance and its performance on caption data.
We also compared the performance of these models on the STS Benchmark under several conditions as shown in Table 12. We compare the models unsupervised performance on the entire dataset, the entire dataset without caption data 9 and the performance of the models on just the caption data. We also include the performance on the supervised setting for these cases as well. From the table, it is clear that relative to our models, Infersent does much better on the caption data than the non-caption data, providing evidence of a bias. Note that this bias is less for the model trained on AllSNLI, as it incorporates other domains.
Overall, we believe that there are many avenues to training sentence embeddings, each with their own strengths and weaknesses. Ours are specialized for semantic similarity, entailment, and paraphrasing. However, they do well on general tasks as well when scaled up, outperforming Skipthought and all of the earlier work on all tasks with the exception of X and Y. InferSent does well on lots of these tasks, better than any published system to date.
We would also like to emphasize that our embeddings only require cheap structured data. There are millions to billions of pairs of parallel  Pham et al., 2015) 63.9 (Pennington et al., 2014) 40.6 (Mikolov et al., 2013) 56.5 (Pagliardini et al., 2017) 75.5 Supervised (Tai et al., 2015) LSTM 70.5 (Tai et al., 2015) BiLSTM 71.1 (Tai et al., 2015) Dep. Tree LSTM 71.2 (Tai et al., 2015) Const. Tree LSTM 71.9 (Shao, 2017) 78.4  text, and more are created on a continual basis. We do not require data that requires human generation and labeling which can be very expensive -well outside the budget of most research groups. This makes our models, even in the case of general sentence embeddings, very useful for other languages where such resources do not exist. Our model can also be extended for general similarity embeddings that are inter-lingual as well, with just a simple change to how the negative sampling is done. 10 It would be interesting to combine the best of these models that have various specialization in a single model. It seems that (Radford et al., 2017), Infersent, and our model are a natural fit -each 10 The only change needs to be where the negative sampling is done cross-lingually. However, doing it without regard to language would be an interesting experiment.   with their own specializations. Combining these in s clever way, could make truly capable sentence embeddings that show strong performance on nearly all NLP tasks. It would be interesting to see if adding in this extra semantic information even improves the paraphrastic performance of the final embeddings.

Conclusion
This paper makes two valuable contributions. First, we release PARANMT-50M, a corpus of over 51M backtranslations from the Czeng1.6 corpus, which we release for general use. We also release code and trained models for state-of-the-art paraphrastic sentence embeddings, which also exhibit strong performance as general-purpose sentence representations for a multitude of tasks. We hope that this corpus, along with our improved paraphrastic sentence representations, can significantly improve the state-of-the-art in many research areas like machine translation, paraphrase generation, question-answering, among many others.