Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering

Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe ParaBank 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering.We show that ParaBank 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.


Introduction
The ability to understand and produce paraphrases is a basic competency task, one that is often used as a teaching aid to validate if a student understands a statement or a concept. Current deep learning systems struggle with this task, exhibiting brittleness to both understanding and producing paraphrastic expressions (Iyyer et al., 2018).
One crucial factor behind this incompetence is the dearth of sentential paraphrastic data. Many works have sought to leverage the relative abundance of sub-sentential paraphrastic resources in paraphrase detection or generation (Napoles et al., 2016). Yet, they fail to capture contextualized word choices or syntactical variations, as wordor phrase-level resources cannot incorporate information from the whole input sentence.
Recent works have focused on leveraging bilingual resources to create large sentence-level paraphrastic collections using translation-based methods Hu et al., 2019).  However, these works are confined to using beam search in decoding, which tend not to produce diverse candidates. One approach to force diverse translations is the use of hard lexical constraints at inference time (Hu et al., 2019). While effective in some cases, current approaches to automatic selection of such constraints is based on heuristics and task-oriented trial-and-error.
We present a novel resource with accurate and collectively diverse paraphrases, generated using stochastic decoding and clustering. By collectively diverse, we mean that the paraphrases of a given sentence cover a wide lexical and syntactic spectrum. Given a bilingual input pair, our core idea is to sample a large space of outputs from a translation system, cluster the results according to a notion of token-sequence similarity, score them with two translation models (one in each direction), and then select the best item from each cluster. We believe that sampling from the word distribution at each decoder time-step bet-ter preserves the decoder's level of uncertainty, which is intrinsic to the goals of paraphrasing. We also sample ancillary lexical constraints to discourage, instead of explicitly prohibiting (Hu et al., 2019), certain words from being used by the decoder. While our experiment produces a largescale English resource, our approach is dependent only on the availability of large bitexts and so is language-agnostic. We chose to build an English resource from CzEng to enable a direct comparison with  and Hu et al. (2019).
Our contributions include: • A large, high quality paraphrase collection 1 with up to 5 paraphrases per reference, close to 100 million pairs in total, which are more diverse than prior work in two distinct ways, as measured by standardized metrics; • An evaluation of semantic similarity, lexical and syntactic diversity, compared against prior works, along with results on Sentence Textual Similarity (STS) Benchmark; • Experiments on how our resource can be leveraged to improve performance on a set of language tasks.

Paraphrase generation pipeline
Prior works in constructing sentential paraphrastic resources have worked from large collections of bitext, producing translations of the foreign language sentence which, when paired with the target-language reference, constitute a set of paraphrases. Working from the very large CzEng parallel corpus,  produced a single paraphrase for each English sentence by translating from the Czech source. Hu et al. (2019) expanded on this by translating the Czech sentence several times, using positive or negative constraints obtained from the English reference. In terms of producing diverse paraphrases, both approaches are limited because they rely on beam search. There are potentially billions of paraphrases of a sentence (Dreyer and Marcu, 2012), yet beam search with recurrent models can only search a constant subset of them (in the beam size). There are techniques for producing more diverse paraphrases, such as the use of positive and negative constraints (Hu et al., 2019) or syntactic fragments (Iyyer et al., 2018), but these require the user to manually specify them, which can be cumbersome and unreliable.
We follow these prior works in working with the CzEng, a Czech-English dataset (Bojar et al., 2016b), due to its size, diverse domain coverage, and rich syntactic variations , and to allow for a direct comparison in methodologies. However, we propose a new approach to paraphrase generation designed to increase paraphrastic diversity, using a multi-step process: the first part of the pipeline generates a large number of candidate paraphrases through a random process, and the second part whittles them down to a much shorter list. For each {source, tar-get} input pair, we run the following pipeline: 1. Constrained sampling. We sample translations using a source→target translation model with lexical constraints. We obtain negative constraints by randomly selecting a set of tokens from the "source", so that they are not allowed to appear in the translations. Then, we decode each translation by sampling from only the top-k most probable tokens at each time step, after excluding constrained tokens ( §2.1).
2. Dual scoring. The set of samples is then scored against the original source input using a target→source translation model. The scores from the forward and backward models are summed ( §2.2).
3. Clustering. The samples are then clustered. The best item from each cluster (according to the summed score) is then returned ( §2.3).

Constrained sampling
Sampling is a more effective way to explore model search space than beam search, particularly in auto-regressive models that do not permit dynamic programming. We introduce two means by which we can expand the hypothesis space, and produce a more diverse set of paraphrases, relative to straightforward beam search.
Top-k sampling In auto-regressive neural MT, the standard sampling approach would be to choose a word w t at each decoder timestep t by sampling from the distribution P (w t | w 1...t−1 ). This approach has been found effective over 1best beam search in generating source sentences in back-translation (Edunov et al., 2018). However, for paraphrasing, this is not ideal, since words that are not semantically licensed by the source may be selected. Instead, we propose top-k sampling, in which we choose w t from the top k most-probable tokens at each time step. This way, we allow the model to sample flexibly, vastly opening up the hypothesis space, without creating a large risk of producing nonsensical translations.
Randomized negative constraints Negative constraints are tokens that are not permitted in the decoder output. They are not formally described in the literature, but an implementation was provided with the associated positive constraints (Post and Vilar, 2018). Negative constraints can be provided as tokens or phrases; the decoder tracks the progress of generation through each constraint and adds an infinite cost to the final word of any constraints, precluding its selection in both sampling and beam search.
In order to further increase sample diversity when generating the hypotheses ( §2.1), we obtain negative constraints from the source by randomly choosing a subset of tokens. We do this independently multiple times for each input sentence. This provides new sets of constraints for the inputs, independent of the decoding.
Note that we use subword regularization (Kudo, 2018) during training, causing different subword segmentations to be applied to training data types each time they are encountered and helping to build more robust models. We only constrain on the Viterbi segmentation, effectively discouraging negatively constrained words from appearing in the output, instead of prohibiting them, since there are often ways for the model to produce a word by generating a different decomposition.

Back-translation likelihoods
Some semantic changes during paraphrasing, especially omission, are not well-reflected by the (forward) probability p generate from the generating model. However, a model running in the other direction can penalize this omission, as found by Goto and Tanaka (2017). Thus, we obtain the back-translation probability p back of each sampled candidate paraphrase, and define the final score for each candidate paraphrase as the joint probability p * = p generate * p back , which is the sum of negative log-likelihood.

Edit-distance-based clustering
The above process produces a large set of translations of the source sentence. Many of them will be minor variants of one another, but we expect that there will be a lot of variety in the large pool. The task now is to reduce this pool to a small set of collectively diverse paraphrastic candidates.
We address this problem with k-means clustering via Levenshtein (or edit) distance (Miller et al., 2009). We compute this on lowercased, segmented candidates, after striping punctuation. Clusters are initialized with the k furthest candidates measured by edit-distance. We also add the reference sentence as the centroid of an additional cluster and skip the re-centering for that cluster. This improves the chance of the k clusters congregating candidates different from the reference in different ways. When the clustering has converged, we take the candidate with the best score from each cluster (except for the one with the reference sentence), rank them by score, and take the best n as the final output.

Data
All of our experiments are based on the CzEng 1.7 corpus, a subset of CzEng 1.6 (Bojar et al., 2016b) that has been chosen for higher quality. Based on experience with data quality issues in neural MT (Ott et al., 2018;Junczys-Dowmunt, 2018), we decided to further clean the corpus. First, we normalize Unicode punctuation, and keep only bilingual pairs whose English side can be encoded with latin-1 and Czech side with latin-2. We then filter the data with dual cross-entropy filtering (Junczys-Dowmunt, 2018). We use Sockeye (Hieber et al., 2017) to train two NMT models, CS-EN and EN-CS, on a relatively clean subset of the data provided for WMT 2018 (Bojar et al., 2016a): Europarl, Wiki titles, and news commentary. We use 4 layer Transformer models (Vaswani et al., 2017) trained to convergence, with held-out likelihood evaluated on a random 500sentence subset of the WMT16 and WMT17 news test data. These models are then used to score all the remaining CzEng data after deduplication. We kept all sentences with a model score (negative log-likelihood) of less than 3.5. After applying the above two filters, we keep 19, 723, 003 out of the 57, 065, 358 pairs in CzEng 1.7.

Translation models
We train two new translation models on the filtered data, the CS-EN generation model (for generating English candidates via sampling) and the EN-CS scoring model (for providing backwards scores of the candidates). Both are Transformer models built with AWS SOCKEYE. The generation model is a 12 layer Transformer with a model and embedding size of 768, 12 attention heads, a feedforward layer size of 3072. The scoring model has 6 layers, model and embedding size of 512, 8 attention heads, and a feed-forward layer size of 2048.
All training data is pre-processed with subword sampling using SentencePiece 2 (Kudo, 2018) with a vocabulary size of 20k and character coverage of 0.9999. We used separate models for Czech and English. At inference time, we use the Viterbi segmentation of each input sentence, for both the generation and scoring models.

Parameters
There are a few parameters involved in the samplescore-cluster pipeline. For each Czech input sentence, we generate 5 sets of random constraints ( §2.1), creating 5 variants of the input. From each of these inputs, we generate 30 samples using topk sampling with k = 10 (i.e., at each timestep, the model randomly chooses from the top 10 most probable words, according to their scaled distribution, and excluding negatively constrained words). The resulting 150 sentences are scored, and anything with a combined score greater than 3.5 is thrown out. The remaining sentences are clustered into 8 clusters, one of them centered on the English reference. The reference cluster is thrown out, and a list of the best-scoring translation from the remaining 7 clusters is constructed. From this list, the top 5 translations are returned as hypotheses.

Setup
We follow the evaluation framework of Hu et al. (2019), which judged semantic similarity between paraphrases and their reference through human evaluation, and lexical diversity via automatic metrics. We use the evaluation result made public by Hu et al. (2019) to enable a direct comparison. Rather than focusing on improving seman-tic similarity, which is limited by the quality of the bilingual resource, we seek to build a resource that contains both more lexical and syntactical diversity.
We obtained the evaluation set from Hu et al. (2019), which contains 400 English sentences from CzEng. Due to additional filtering, 24 out of 400 (6%) reference sentences aren't in PARA-BANK 2 and therefore excluded in this evaluation.
We set the output size n = 5. After sorting the candidates by negative log-likelihood for each reference, we treat candidates at each rank as an individual system to investigate the expected quality of paraphrases under our approach. For references that produce fewer than 5 paraphrases, the paraphrase with the highest negative log-likelihood is duplicated to fill in ranks that otherwise would be empty. We also artificially pick the paraphrase with the maximum, minimum, and median human semantic similarity judgment under each reference as three additional oracle systems.

Semantic similarity via human judgments
For a fair comparison, we used the evaluation setup released by Hu et al. (2019), which uses the interface from EASL (Sakaguchi and Van Durme, 2018) to collect semantic similarity and gammaticality judgments. Each human annotator is presented with a reference sentence and five paraphrases from different sources. Annotators use a slider bar under each paraphrase to rate the semantic similarity from 0 (Opposite/Irrelevant) to 100 (Identical Meaning). Annotators are also asked to comment on whether the paraphrase is ungrammatical or nonsensical. The reference sentence is repeated next to the paraphrase for easier visual comparison.
Each paraphrase receives at least 3 independent judgments. Following Hu et al. (2019), we randomly add in the reference sentence as a paraphrase and filter out annotators who fail to score them 100 more than 10% of such encounters. The result includes only annotators who contributed at least 25 judgments and is shown in Tab. 1.

Paraphrastic diversity
BLEU has been a successful metric in evaluating MT systems. However, as noted earlier, monolingual paraphrasing has inherently different objectives than cross-lingual translation. BLEU, in tandem with human evaluation in semantic similarity, makes a good metric for paraphrastic diversity.  Table 1: Paraphrastic diversity measured by (1-BLEU)×100, bag-of-word intersection/union score×100, and Tree edit-distance. Systems from this work that receive the best human judgments, worst human judgments, and the median, are included in the table. A higher 1-BLEU suggests higher paraphrastic diversity; a higher Intersection/Union score suggests a higher lexical diversity; a higher Tree edit-distance suggests a higher syntactic diversity. Best in each column, excluding oracle systems, is in bold. * denotes best oracle systems.
Here, we use 1-BLEU to measure how different the paraphrases are to the references. We generate 5 paraphrases for each reference sentence using the approach outlined in this work. To account for randomness, we average over two independent runs in the result, shown in Tab. 1.
We consider two sources of paraphrastic diversity: 1) lexical diversity, the use of different words; and 2) syntactic diversity, the change of sentence or phrasal structure. We separately measure them using bag-of-word Intersection/Union scores and parse-tree edit-distances, respectively.
Lexical diversity A sentence is lexically different from the reference when it uses lexical paraphrases (e.g., synonyms) to convey similar meanings. We calculate the case-insensitive piece Intersection/Union score after striping punctuation and the SentencePiece white space symbol. All pieces are put to lowercase and into a set. The more pieces the two sentences share, the higher the score will be. The Intersection/Union scores between the reference and the paraphrases are shown in Tab. 1.

Syntactic diversity
We consider the editdistance between the parse trees of the reference and the paraphrase as a metric of syntactic diversity. Parse tree edit-distance is considered a useful feature in NLP tasks (Yao et al., 2013). The more syntactic variations there are between two sentences, the larger the tree edit-distance will be. We consider only the top 3 levels of the parse trees, excluding any terminals. Sentences are parsed with Stanford CoreNLP (Manning et al., 2014); the tree edit-distance is calculated with the APTED (Pawlik and Augsten, 2015a,b) algorithm. The average tree edit-distance for each system is shown in Tab. 1. Hu et al. (2019) produced multiple paraphrases for each reference. While shown to be diverse compared to the reference, the authors did not investigate whether these paraphrases are trivial rewrites of one another, as it is likely the case with beam search under a few lexical constraints. Our clustering step is specifically designed to retrieve collectively diverse paraphrases.

Diversity among paraphrases
We use the same metrics to evaluate pairs of systems from our work and compare them against PARABANK (Hu et al., 2019), as shown in Tab. 2. The max/min/median systems are oracle systems derived from human semantic similarity judgment scores. The human judgments from Tab. 1 show our paraphrases are of comparable quality to PARABANK, while maintaining a much higher degree of diversity among paraphrases of the same reference, as shown by automatic metrics.

Semantic similarity on STS Benchmark
In addition to evaluating via human judgments, we consider the same evaluation mechanism as PARANMT : the use  Table 2: Collective diversity within our work compared to PARABANK, as measured by (1-BLEU)×100, intersection/union score×100, and parse tree edit-distance.
of paraphrase corpora as training data for the Semantic Textual Similarity (STS) task. STS aims to measure the degree of equivalence in meaning or semantics between a pair of sentences. Notably, Agirre et al. (2016) having been a part of the SemEval workshop (2012 -2017). The evaluation consists of human annotated English sentence pairs, scored on a scale of 0 to 5 to quantify similarity of meaning, with 0 being the least, and 5 the most similar.  compared three encoding mechanisms: WORD, TRIGRAM and LSTM. The WORD model (Wieting et al., 2016) averages the embedding for each word in the sentence into a fixed length vector embedding for the sentence; the TRIGRAM model (Huang et al., 2013) averages over character trigrams; and the LSTM (Hochreiter and Schmidhuber, 1997) approach averages over the final hidden states to obtain the sentence embedding.
Encoders are trained on paraphrase pairs (s, s ) with a margin based loss function l(s, s , t, t ) = max(0, δ − cos[g(s), g(s )] + cos[g(s), g(t)])+ where g is one of (WORD, TRIGRAM, LSTM) and (t, t ) is a negative sample selected from a megabatch, an aggregation of m minibatches . 3 We evaluate the WORD model trained 4 on PARANMT, PARABANK and PARABANK 2 (our work). We retrieved the paraphrases from PARA- 3 We confirmed this loss with Wieting and Gimpel, that it captures their open implementation, which we employ. Wieting and Gimpel (2018) described their loss as: max(0, δ − cos(g(s), g(s )) + cos(g(s), g(t))), which is equivalent under their assumption the paraphrases are equivalent.  BANK and our work that share the same references as PARANMT-5M. Our work is evaluated as 5 systems, based on the rank in the output; the last available paraphrase is used when lower ranks are empty. We also include a system that uses a pair of paraphrases, instead of a reference and a paraphrase. We keep PARABANK paraphrases that have a bag-of-word intersection/union score of 0.7 or less, and use the 1-best based on regression scores. In Tab. 3, we report Pearson's r and Spearman's r on the STS'16 test set. Sentence embeddings trained on our work exhibit higher correlation with human judgments, which reflects the superior paraphrastic diversity of the corpus.

Improving contextualized encoders with paraphrastic data
Paraphrastic data can be used to fine-tune contextualized encoders such as BERT (Devlin et al., 2018). We frame the fine-tuning task as paraphrase identification (Das and Smith, 2009), where given a pair of sentences, the task is to classify them as paraphrases or non-paraphrases.
To generate the training data, we extract, for each   sentence in PARANMT-5M, the sentence embeddings generated by the WORD model trained in §3.7. For each sentence s, we then find the (approximate) nearest neighbour n which is not s , among all of the sentences. We thus obtain two pairs, where (s, s ) is a paraphrase pair, and (s, n) is a non-paraphrase pair. We use these to train a binary classifier with cross-entropy loss. We then use this BERT fine-tuned on paraphrases (henceforth pBERT) for fine-tuning on SQuAD 2.0 (Rajpurkar et al., 2018) and 4 NLP tasks present in the General Language Understanding Evaluation (GLUE) benchmark (Wang et al., 2019): Quora Question Pairs (QQP) (Chen et al., 2017), Multi-Genre Natural Language Inference (MNLI) (Williams et al., 2018), the Semantic Textual Similarity Benchmark (STS-B) (Agirre et al., 2016), and the Microsoft Research Paraphrase Corpus (MRPC) (Dolan et al., 2004). Following the model formulation, hyper-parameter selection and training procedure specified in Devlin et al. (2018), we add a single task-specific, randomly initialized output layer for the classifier.
We present our results in Tab. 4 and Tab. 5. We observe gains for STS-B, MRPC and QQP, tasks strongly related to paraphrase identification. Fine-tuning on our paraphrase corpus also improves performance on SQuAD, a questionanswering task, while slightly degrading performance on MNLI. Overall, simple fine-tuning of BERT on our corpus leads to improvements on downstream tasks, in particular when the task is related to paraphrase detection.

Related works 4.1 Paraphrastic resources
Paraphrastic resources exist across different scopes (i.e., lexical, phrasal, sentential) and different creation strategies (i.e., manually curated, automatically generated). For a more comprehensive survey on data-driven approaches to paraphrasing, please refer to Madnani and Dorr (2010).
Sub-sentential resources WordNet (Miller, 1995), FrameNet (Baker et al., 1998), and VerbNet (Schuler, 2006) can be used to extract paraphrastic expressions at lexical levels. They contain the grouping of words or phrases that share similar semantics and sometimes entailment relations.
While FrameNet and VerbNet do have example sentences or frames where lexical units are put into contexts, there is no explicit paraphrastic relations among these examples. Also, these datasets tend to be small, as they were curated manually. There have been efforts to augment such resources with automatic methods (Snow et al., 2006;Pavlick et al., 2015b), but they are still confined to lexical level and sometimes require the use of other paraphrastic resources (Pavlick et al., 2015b).
PPDB (Ganitkevitch et al., 2013;Pavlick et al., 2015a) automated the generation of lexical paraphrases via bilingual pivoting, taking advantage of the relative abundance of bilingual corpora. While significantly larger and more informative (e.g., ranking, entailment relations, etc.) than the above manually curated resources, PPDB suffers from ambiguity as words or phrases are removed from their sentential contexts.
Sentential resources There exists multiple human translations in the same language for some classic readings. Barzilay and McKeown (2001) sought to extract lexical paraphrastic expression from such sources. Unfortunately such resources -along with those manually constructed for text generation research (Robin, 1995;Pang et al., 2003) -are small and limited in domain.
PARANMT and PARABANK are two much larger sentential paraphrastic resources created through back-translation.

Reference:
Real life is sometimes thoughtless and mean. Hey, stop right there! PARANMT: real life is sometimes reckless and cruel . hey , stop . PARABANK: The real life is occasionally ruthless and cruel. Stay where you are! The real world is occasionally ruthless and cruel. The real life is sometimes reckless and cruel. Our work: True life is sometimes ruthless and cruel. Hold your position! Actual life is sometimes ruthless and cruel.
Stay where you are! Sometimes real life is ruthless and cruel.
Stay in position! Real life can be inconsiderate, cruel sometimes.
Remain where you are! Real living is a harsh and unscrupulous one, at times. Stay put! Table 6: Selected examples from our work, compared to paraphrastic resources with prior approaches. Our work has paraphrases that are not only different from the reference, but also diverse among themselves.

Translation-based Approaches
PARANMT is an automatically generated sentential paraphrastic resource through back-translating bilingual resources. It leveraged the imperfect ability of Neural Machine Translation (NMT) to recreate the translation target by conditioning on the source side of the bitext. PARABANK took a similar approach but with the inclusion of lexical constraints from the target side of the bitext. This step allows for multiple translations from one bilingual sentence pair and promotes lexical diversity. Their work, despite being larger and shown to be less noisy than PARANMT, relies on heuristics to produce hard constraints on the decoder, which often causes unintended changes in semantics or grammar.
Both works largely follow standard approaches in NMT, generating 1-best hypotheses given a source text and a set of constraints using beam search. Sentential paraphrasing, nevertheless, has fundamentally different objectives than MT. The latter strives to find the best elicitation that is both fluent and semantically close to the foreign text to convey information across languages. The former, on the other hand, seeks syntactically and lexically diverse expressions that convey the same meaning, with the goal of capturing the intrinsic flexibility and uncertainty of human communications. This work attempts to adapt the methodology to these objectives of monolingual paraphrasing.

Leveraging paraphrases in NLP
In the context of semantic parsing, Berant and Liang (2014) use a paraphrase classification module to determine the match between a canonical utterance and a logical form, both using a phrase table and distributed representations. To improve question answering (QA), Duboue and Chu-Carroll (2006) generate paraphrases of a given question using back-translation, and optionally replace the original question with the most relevant paraphrase. Dong et al. (2017) tackle QA by marginalizing the probability of an answer over a set of paraphrases, generated using rule-based and NMT-based methods. Fader et al. (2013) use a corpus of questions with paraphrases, to construct a corpus of semantically equivalent queries.
The task of paraphrase identification, which we use as a fine-tuning objective, has been studied as a task in itself. Das and Smith (2009) use grammars to perform generative modeling of paraphrases. Madnani et al. (2012) identify paraphrases by relying only on MT metrics as features. Ferreira et al. (2018) feed sentence similarity measured with hand-crafted features to machine learning algorithms. Convolutional neural networks have been introduced by Yin and Schütze (2015) and Chen et al. (2018), and further augmented with LSTMs (Kubal and Nimkar, 2018) and attention mechanisms (Fan et al., 2018).

Conclusions and future work
A presumed goal for building a sentential paraphrase resource is to capture different ways of expressing the same thing: diversity matters. Previous work on paraphrastic resource creation relied on decoding techniques from NMT using bilingual corpora, with limited success in promoting diverse expressions. We have presented a new community resource produced by sampling and clustering. We evaluated our method against prior works Hu et al., 2019) and found significant gains in both lexical and syntactic diversity. Further, we've shown how straightforward fine-tuning of a state-of-the-art contextual encoder on our resource can improve performance on a variety of language tasks.