Unsupervised Paraphrasing by Simulated Annealing

We propose UPSA, a novel approach that accomplishes Unsupervised Paraphrasing by Simulated Annealing. We model paraphrase generation as an optimization problem and propose a sophisticated objective function, involving semantic similarity, expression diversity, and language fluency of paraphrases. UPSA searches the sentence space towards this objective by performing a sequence of local editing. We evaluate our approach on various datasets, namely, Quora, Wikianswers, MSCOCO, and Twitter. Extensive results show that UPSA achieves the state-of-the-art performance compared with previous unsupervised methods in terms of both automatic and human evaluations. Further, our approach outperforms most existing domain-adapted supervised models, showing the generalizability of UPSA.


Introduction
Paraphrasing aims to restate one sentence as another with the same meaning, but different wordings. It constitutes a corner stone in many NLP tasks, such as question answering (Mckeown, 1983), information retrieval (Knight and Marcu, 2000), and dialogue systems (Shah et al., 2018). However, automatically generating accurate and different-appearing paraphrases is a still challenging research problem, due to the complexity of natural language.
Conventional approaches (Prakash et al., 2016;Gupta et al., 2018) model the paraphrase generation as a supervised encoding-decoding problem, inspired by machine translation systems. Usually, such models require massive parallel samples for training. In machine translation, for example, the WMT 2014 English-German dataset contains 4.5M sentence pairs (Neidert et al., 2014 Figure 1: UPSA generates a paraphrase by a series of editing operations (i.e., insertion, replacement, and deletion). At each step, UPSA proposes a candidate modification of the sentence, which is accepted or rejected according to a certain acceptance rate (only accepted modifications are shown). Although sentences are discrete, we make an analogue in the continuous real x-axis where the distance of two sentences is roughly given by the number of edits.
However, the training corpora for paraphrasing are usually small. The widely-used Quora dataset 2 only contains 140K pairs of paraphrases; constructing such human-written paraphrase pairs is expensive and labor-intensive. Further, existing paraphrase datasets are domain-specific: the Quora dataset only contains question sentences, and thus, supervised paraphrase models do not generalize well to new domains . On the other hand, researchers synthesize pseudo-paraphrase pairs by clustering news events (Barzilay and Lee, 2003), crawling tweets of the same topic (Lan et al., 2017), or translating bi-lingual datasets (Wieting and Gimpel, 2017), but these methods typically yield noisy training sets, leading to low paraphrasing performance (Li et al., 2018).
As a result, unsupervised methods would largely benefit paraphrase generation as no parallel data are needed. With the help of deep learning, researchers are able to generate paraphrases by sampling from a neural network-defined probabilistic distribution, either in a continuous latent space (Bowman et al., 2016;Bao et al., 2019) or directly in the word space (Miao et al., 2019). However, the meaning preservation and expression diversity of those generated paraphrases are less "controllable" in such probabilistic sampling procedures.
To this end, we propose a novel approach to Unsupervised Paraphrasing by Simulated Annealing (UPSA). Simulated annealing (SA) is a stochastic searching algorithm towards an objective function, which can be flexibly defined. In our work, we design a sophisticated objective function, considering semantic preservation, expression diversity, and language fluency of paraphrases. SA searches towards this objective by performing a sequence of local editing steps, namely, word replacement, insertion, deletion, and copy. For each step, UPSA first proposes a potential editing, and then accepts or rejects the proposal based on sample quality. In general, a better sentence (higher scored in the objective) is always accepted, while a worse sentence is likely to be rejected, but could also be accepted (controlled by an annealing temperature) to explore the search space in a less greedy fashion. At the beginning, the temperature is usually high, and worse sentences are more likely to be accepted, pushing SA outside a local optimum. The temperature is cooled down as the optimization proceeds, making the model better settle down to some optimum. We evaluate the effectiveness of our model on four paraphrasing datasets, namely, Quora, Wikianswers, MSCOCO, and Twitter. Experimental results show that UPSA achieves a new state-of-theart unsupervised performance in terms of both automatic metrics and human evaluation.
In summary, our contributions are as follows: • We propose the novel UPSA framework that addresses Unsupervised Paraphrasing by Simulated Annealing. • We design a searching objective function for paraphrasing that not only considers language fluency and semantic similarity, but also explicitly models expression diversity between a paraphrase and the input. • We propose a copy mechanism as one of our search actions of simulated annealing to address rare words. • We achieve the state-of-the-art performance on four benchmark datasets compared with previous unsupervised paraphrase generators, largely reducing the performance gap between unsupervised and supervised paraphrasing. We outperform most domain-adapted paraphrase generators, and even a supervised one on the Wikianswers dataset.

Related Work
In early years, paraphrasing was typically accomplished by exploiting linguistic knowledge (Mckeown, 1983;Ellsworth and Janin, 2007;Narayan et al., 2016) and statistical machine translation methods . Recently, deep neural networks have become a prevailing approach to text generation, where paraphrasing is often formulated as a supervised encodingdecoding problem, for example, using stacked residual LSTM (Prakash et al., 2016) and the Transformer model (Wang et al., 2019). Unsupervised paraphrasing is an emerging research direction in the field of NLP. The variational autoencoder (VAE) can be intuitively applied to paraphrase generation in an unsupervised fashion, as we can sample sentences from a learned latent space (Bowman et al., 2016;Zhang et al., 2019;Bao et al., 2019). But the generated sentences are less controllable and suffer from the error accumulation problem in VAE's decoding phase (Miao et al., 2019). Roy and Grangier (2019) introduce an unsupervised model based on vector-quantized autoencoders (Van den Oord et al., 2017). But their work mainly focuses on generating sentences for data augmentation instead of paraphrasing itself. Miao et al. (2019) use Metropolis-Hastings sampling (1953) for constrained sentence generation, achieving the state-of-the-art unsupervised paraphrasing performance. The main difference between their work and ours is that UPSA imposes the annealing temperature into the sampling process for better convergence to an optimum. In addition, we define our searching objective involving not only semantic similarity and language fluency, but also the expression diversity; we further propose a copy mechanism in our searching process.
Recently, a few studies have applied editingbased approaches to sentence generation. Guu et al. (2018) propose a heuristic delete-retrieve-generate component for a supervised sequence-to-sequence (Seq2Seq) model. Dong et al. (2019) learn the deletion and insertion operations for text simplification in a supervised way, where their groundtruth operations are obtained by some dynamic programming algorithm. Our editing operations (insertion, deletion, and replacement) are the search actions of unsupervised simulated annealing.
Regarding discrete optimization/searching, a naïve approach is by hill climbing (Edelkamp and Schroedl, 2011;Schumann et al., 2020;Kumar et al., 2020), which is in fact a greedy algorithm. In NLP, beam search (BS, Tillmann et al. 1997) is widely applied to sentence generation. BS maintains a k-best list in a partially greedy fashion during left-to-right (or right-to-left) decoding (Anderson et al., 2017;Zhou and Rush, 2019). By contrast, UPSA is local search with distributed edits over the entire sentence. Moreover, UPSA is able to make use of the original sentence as an initial state of searching, whereas BS usually works in the decoder of a Seq2Seq model and is not applicable to unsupervised paraphrasing.

Approach
In this section, we present our novel UPSA framework that uses simulated annealing (SA) for unsupervised paraphrasing. In particular, we first present the general SA algorithm and then design our searching objective and searching actions (i.e., candidate sentence generator) for paraphrasing.

The Simulated Annealing Algorithm
Simulated Annealing (SA) is an effective and general metaheuristic of searching, especially for a large discrete or continuous space (Kirkpatrick et al., 1983).
Let X be a (huge) search space of sentences, and f (x) be an objective function. The goal is to search for a sentence x that maximizes f (x). At a searching step t, SA keeps a current sentence x t , and proposes a new candidate x * by local editing. If the new candidate is better scored by f , i.e., f (x * ) > f (x t ), then SA accepts the proposal. Otherwise, SA tends to reject the proposal x * , but may still accept it with a small probability e , controlled by an annealing temperature T . In other words, the probability of accepting the proposal is If the proposal is accepted, Inspired by the annealing in chemistry, the temperature T is usually high at the beginning of searching, leading to a high acceptance probability even if x * is worse than x t . Then, the temperature is decreased gradually as the search proceeds. In our work, we adopt the linear annealing schedule, given by T = max(0, T init − C · t), where T init is the initial temperature and C is the decreasing rate.
The high initial temperature of SA makes the algorithm less greedy compared with hill climbing, whereas the decreasing of temperature enables the algorithm to better settle down to a certain optimum.
Theoretically, simulated annealing is guaranteed to converge to the global optimum in a finite search space if the proposal and the temperature satisfy some mild conditions (Granville et al., 1994). Although such convergence may be slower than exhaustive search and the sentence space is, in fact, potentially infinite, simulated annealing is still a widely applied search algorithm, especially for discrete optimization. Readers may refer to Hwang (1988) for details of the SA algorithm.

Objective Function
Simulated annealing maximizes an objective function, which can be flexibly specified in different applications. In particular, our UPSA objective f (x) considers multiple aspects of a candidate paraphrase, including semantic preservation f sem , expression diversity f exp , and language fluency f flu . Thus, our searching objective is to maximize Semantic Preservation. A paraphrase is expected to capture all the key semantics of the original sentence. Thus, we leverage the cosine function of keyword embeddings to measure if the key focus of the candidate paraphrase is the same as the input. Specifically, we extract the keywords of the input sentence x 0 by the Rake system (Rose et al., 2010) and embed them by GloVE (Pennington et al., 2014). For each keyword, we find the closest word in the candidate paraphrase x * in terms of the cosine similarity. Our keyword-based semantic preservation score is given by the lowest cosine similarity among all the keywords, i.e., the least matched keyword: where w * ,j is the jth word in the sentence x * ; e is an extracted keyword of x 0 . Bold letters indicate embedding vectors.
In addition to keyword embeddings, we also adopt a sentence-level similarity function, based on Sent2Vec embeddings (Pagliardini et al., 2017). Sent2Vec learns n-gram embeddings and computes the average of n-grams embeddings as the sentence vector. It has been shown to be significant improvements over other unsupervised sentence embedding methods in similarity evaluation tasks (Pagliardini et al., 2017). Let x * and x 0 be the Sent2Vec embeddings of the candidate paraphrase and the input sentence, respectively. Our sentence-based semantic preservation scoring func- To sum up, the overall semantic preservation scoring function of UPSA is given by where P and Q are hyperparameters, balancing the importance of the two factors. Here, we use power weights because the scoring functions are multiplicative. Expression Diversity. The expression diversity scoring function computes the lexical difference of two sentences. We adopt a BLEU-induced function to penalize the repetition of the words and phrases in the input sentence: (5) where the BLEU score (Papineni et al., 2002) computes a length-penalized geometric mean of n-gram precision (n = 1, · · · , 4). S coordinates the importance of f exp (x t , x 0 ) in the objective function (2).
Language Fluency. Despite semantic preservation and expression diversity, the candidate paraphrase should be a fluent sentence by itself. We use a separately trained (forward) language model (denoted as −→ LM) to compute the likelihood of the candidate paraphrase as our fluency scoring function: where l * is the length of x * and w * ,1 , . . . , w * ,l are words of x * . Here, we use a dataset-specific language model, trained on non-parallel sentences. Notice that a weighting hyperparameter is not needed for f flu , because the relative weights of different factors in Eqn.
(2) are given by the powers in f sem,key , f sem,sen , and f exp .

Candidate Sentence Generator
As mentioned, simulated annealing proposes a candidate sentence, given by different search actions. Since each action yields a new sentence x * from x t , we call it a candidate sentence generator. While the proposal of candidate sentences does not affect convergence in theory (if some mild conditions are satisfied), it may largely influence the efficiency of SA searching.
In our work, we mostly adopt the word-level editing in Miao et al. (2019) as our searching actions, but we differ in sampling distributions and further propose a copy mechanism for editing.
At each step t, the candidate sentence generator randomly samples an editing position k and an editing operation namely, replacement, insertion, and deletion. For replacement and insertion, the candidate sentence generator also samples a candidate word. Let the current sentence be x t = (w t,1 , . . . , w t,k−1 , w k , w t,k+1 . . . , w t,lt ). If the replacement operation proposes a candidate word w * for the kth step, the resulting candidate sentence becomes x * = (w t,1 , . . . , w t,k−1 , w * , w t,k+1 . . . , w t,lt ). The insertion operation works similarly.
Here, the candidate word is sampled from a probabilistic distribution, induced by the objective function (2): where W is the sampling vocabulary; Z is known as the normalizing factor (noticing our scoring functions are nonnegative). We observe that sampling from such objective-induced distribution typically yields a meaningful candidate sentence, which enables SA to explore the search space more efficiently.
It is also noted that sampling a word from the entire vocabulary involves re-evaluating (2) for each candidate word, and therefore, we also follow Miao et al. (2019) and only sample from the top-K words given by jointly considering a forward language Algorithm 1 UPSA 1: Input: Original sentence x0 2: for t ∈ {1, . . . , N } do 3: T = max{Tinit − C · t, 0} 4: Randomly choose an editing operation and a position k 5: Obtain a candidate x * by candidate sentence generator 6: Compute the accepting probability paccept by Eqn. (1) 7: With probability paccept, xt+1 = x * 8: With probability 1 − paccept, xt+1 = xt 9: end for 10: return xτ s.t. τ = argmax τ ∈{1,...,N } f (xτ ) model and backward language model. The replacement operator, for example, suggests the top-K words vocabulary by For word insertion, the top-K vocabulary W t,insert is computed in a similar way (except that the position of w * is slightly different). Details are not repeated. In our experiments, K is set to 50.
Copy Mechanism. We observe that name entities and rare words are sometimes deleted or replaced during SA stochastic sampling. They are difficult to be recovered because they usually have a low language model-suggested probability.
Therefore, we propose a copy mechanism for SA sampling, inspired by that in Seq2Seq learning (Gu et al., 2016). Specifically, we allow the candidate sentence generator to copy the words from the original sentence x 0 for word replacement and insertion. This is essentially enlarging the top-K sampling vocabulary with the words in x 0 , given by where op ∈ {replace,insert}. Thus, W t,op is the actual vocabulary from which SA samples the word w * for replacement and insertion operation. While such vocabulary reduces the proposal space, it works well empirically because other low-ranked candidate words are either irrelevant or make the sentence disfluent; they usually have low objective scores, and are likely to be rejected even if sampled.

Overall Optimization Process
We summarize our UPSA algorithm in Algorithm 1.
Given an input x 0 , UPSA searches from the sentence space to maximize our objective f (x), which involves semantic preservation, expression diversity, and language fluency. UPSA starts from x 0 itself. For each step, it randomly selects a search action (namely, word insertion, deletion, and replacement) at a position k (Line 4); if insertion or replacement is selected, UPSA also proposes a candidate word, so that a candidate paraphrase x * is formed (Line 5). Then, UPSA computes an acceptance rate p accept based on the increment of f and the temperature T (Line 6). The candidate sentence x t+1 for the next step becomes x t if the proposal is accepted, or remains x t if the proposal is rejected. Until the maximum searching iterations, we choose the sentence x τ that yields the highest score.

Datasets
Quora. The Quora question pair dataset (Footnote 2) contains 140K parallel paraphrases and additional 260K pairs of non-parallel sentences. We follow the unsupervised setting in Miao et al. (2019), where 3K and 20K pairs are used for validation and test, respectively. Wikianswers.
The original Wikianswers dataset (Fader et al., 2013) contains 2.3M pairs of question paraphrases from the Wikianswers website. Since our model only involves training a language model, we randomly selected 500K nonparallel sentences for training. For evaluation, we followed the same protocol as  and randomly sampled 5K for validation and 20K for testing. Although the exact data split in previous work is not available, our results are comparable to previous ones in the statistical sense.
MSCOCO. The MSCOCO dataset contains 500K+ paraphrases pairs for ∼120K image captions (Lin et al., 2014). We follow the standard split (Lin et al., 2014) and the evaluation protocol in Prakash et al. (2016) where only image captions with fewer than 15 words are considered, since some captions are extremely long (e.g., 60 words).
Twitter. The Twitter URL paraphrasing corpus (Lan et al., 2017) is originally constructed for paraphrase identification. We follow the standard train/test split, but take 10% of the training data as the validation set. The remaining samples are used to train our language model. For the test set, we only consider sentence pairs that are labeled as "paraphrases." This results in 566 test cases.

Competing Methods and Metrics
Unsupervised paraphrasing is an emerging research topic. We would compare UPSA with recent discrete and continuous sampling-based paraphrase generators, namely, VAE, Lag VAE (He et al., 2019), and CGMH. Early work on unsupervised paraphrasing typically adopts rule-based methods (Mckeown, 1983;Barzilay and Lee, 2003). Their performance could not be verified on the above datasets, since the extracted rules are not available. Therefore, we are unable to compare them in this paper. Also, rule-based systems usually do not generalize well to different domains. In the following, we describe our competing methods: VAE. We train a variational autoencoder (VAE) with two-layer, 300-dimensional LSTM units. The VAE is trained with non-parallel corpora by maximizing the variational lower bound of loglikelihood; during inference, sentences are sampled from the learned variational latent space (Bowman et al., 2016).
Lag VAE. He et al. (2019) propose to aggressively optimize the inference process of VAE with more updates to address the posterior collapse problem (Chen et al., 2017). This method has been reported to be the state-of-the-art VAE. We adopted the published source code and generated paraphrases for comparison.
CGMH. Miao et al. (2019) use Metropolis-Hastings sampling in the word space for constrained sentence generation. It is shown to outperform latent space sampling as in VAE, and is the state-of-the-art unsupervised paraphrasing approach. We also adopted the published source code and generated paraphrases for comparison.
We further compare UPSA with supervised Seq2Seq paraphrase generators: ResidualL-STM (Prakash et al., 2016), VAE-SVG-eq (Gupta et al., 2018), Pointer-generator (See et al., 2017), the Transformer (Vaswani et al., 2017), and the decomposable neural paraphrase generator (DNPG, . DNPG has been reported as the state-of-the-art supervised paraphrase generator. To better compare UPSA with all paraphrasing settings, we also include domain-adapted supervised paraphrase generators that are trained in a source domain but tested in a target domain, including shallow fusion (Gulcehre et al., 2015) and multitask learning (MTL, Domhan and Hieber 2017).
We adopt BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) scores as automatic metrics to evaluate model performance. Sun and Zhou (2012) observe that BLEU and ROUGE could not measure the diversity between the generated and the original sentences, and propose the iBLEU variant by penalizing by the similarity with the original sentence. Therefore, we regard the iBLEU score as our major metric, which is also adopted in . In addition, we also conduct human evaluation in our experiments (detailed later).

Implementation Details
Our method involves unsupervised language modeling (forward and backward), realized by two-layer LSTM with 300 hidden units and trained specifically on each dataset with non-parallel sentences.
For hyperparameter tuning, we applied a grid search procedure on the validation set of the Quora dataset using the iBLEU metric. The power weights P, Q, and S in the objective were 8, 1, and 1, respectively, chosen from {0.5, 1, 2, . . . , 8}.
The initial temperature T init was chosen from {0.5, 1, 3, 5, 7, 9} × 10 −2 and set to T init = 3 × 10 −2 by validation. The magnitude of T init appears small here, but is in fact dependent on the scale of the objective function. The annealing rate C was set to T init #Iteration = 3 × 10 −4 , where our number of iterations (#Iteration) was 100.
We should emphasize that all SA hyperparameters were validated only on the Quora dataset, and we did not perform any tuning on the other datasets (except the language model). This shows the robustness of our UPSA model and its hyperparameters. Table 1 presents the performance of all competing methods on the Quora and Wikianswers datasets. The unsupervised methods are only trained on the non-parallel sentences. The supervised models were trained on 100K paraphrase pairs for Quora and 500K pairs for Wikianswers. The domainadapted supervised methods are trained on one dataset (Quora or Wikianswers), adapted using nonparallel text on the other (Wikianswers or Quora), and eventually tested on the latter domain (Wikianswers or Quora).

Results
We observe in Table 1 that, among unsupervised approaches, VAE and Lag VAE achieve the worst performance on both datasets, indicating that paraphrasing by latent space sampling is worse than word editing. We further observe that UPSA yields significantly better results than CGMH: the iBLEU score of UPSA is higher than that of CGMH by 2-5  points. This shows that paraphrase generation is better modeled as an optimization process, instead of sampling from a distribution.
It is curious to see how our unsupervised paraphrase generator is compared with supervised ones, should large-scale parallel data be available. Admittedly, we see that supervised approaches generally outperform UPSA, as they can learn from massive parallel data. Our UPSA nevertheless achieves comparable results with the recent ResidualLSTM model (Prakash et al., 2016), reducing the gap between supervised and unsupervised paraphrasing.
In addition, our UPSA could be easily applied to new datasets and new domains, whereas the supervised setting does not generalize well. This is shown by a domain adaptation experiment, where a supervised model is trained on one domain but tested on the other. We notice in Table 1 that the performance of supervised models (e.g., Transformer+Copy) decreases drastically on out-of-domain sentences, even if both Quora and Wikianswers are question sentences. The performance is supposed to decrease further if the source and target domains are more different. UPSA outperforms all supervised domain-adapted paraphrase generators (except DNPG on the Wikianswers dataset). Table 2 shows model performance on MSCOCO and Twitter corpora. These datasets are less used for paraphrase generation than Quora and Wikianswers, and thus we could only compare unsupervised approaches by running existing code bases. Again, we see the same trend as Table 1: UPSA achieves the best performance, CGMH second, and VAEs worst. It is also noted that the Twitter corpus yields lower iBLEU scores for all models, largely due to the noise of Twitter utterances (Lan et al., 2017). However, the consistent results demonstrate that UPSA is robust and generalizable to different domains (without hyperparameter re-tuning).
Human Evaluation. We also conducted human  evaluation on the generated paraphrases. Due to the limit of budget and resources, we sampled 300 sentences from the Quora test set and only compared the unsupervised methods (which is the main focus of our work). Selecting a subset of models and data samples is a common practice for human evaluation in previous work (Wang et al., 2019). We asked three human annotators to evaluate the generated paraphrases in terms of relevance and fluency; each aspect was scored from 1 to 5. We report the average human scores and the Cohen's kappa score (Cohen, 1960). It should be emphasized that our human evaluation was conducted in a blind fashion. Table 3 shows that UPSA achieves the highest human satisfaction scores in terms of both relevance and fluency, and the kappa scores indicate moderate inter-annotator agreement (Landis and Koch, 1977). The results are also consistent with the automatic metrics in Tables 1 and 2. We further conducted two-sided Wilcoxon signed rank tests. The improvement of UPSA is statistically significant with p < 0.01 in both aspects, compared with both competing methods.

Model Analysis
We analyze UPSA in more detail on the most widely-used Quora dataset, with a test subset of 2000 samples.
Ablation Study. We first evaluate the searching objective function (2) in Lines 1-4 of Table 4. The results show that each component of our objective (namely, keyword similarity, sentence similarity, and expression diversity) does play its role in paraphrase generation.
Line 5 of Table 4 shows the effect of our copy mechanism, which is used in word replacement and insertion. It yields roughly one iBLEU score improvement if we keep sampling those words in the original sentence.
Finally, we test the effect of the temperature decay in SA. Line 6 shows the performance if we fix the initial temperature during the whole searching process, which is similar to Metropolis-Hastings   sampling. 3 The result shows the importance of the annealing schedule. It also verifies our intuition that sentence generation (in particular, paraphrasing in this paper) should be better modeled as a searching problem than a sampling problem. Analysis of the Initial Temperature. We fixed the decreasing rate to C = 1 × 10 −4 and chose the initial temperature T init from {0, 0.5, 1, 3, 5, 7, 9, 11, 15, 21} × 10 −2 . In particular, T init = 0 is equivalent to hill climbing (greedy search). The trend is plotted in Figure 2.
It is seen that a high temperature yields worse performance (with other hyperparameters fixed), because in this case UPSA accepts more worse sentences and is less likely to settle down. On the other hand, a low temperature makes UPSA greedier, also resulting in worse performance. Especially, our simulated annealing largely outperforms greedy search, whose temperature is 0.
We further observe that BLEU and iBLEU peak at different values of the initial temperature. This is because a lower temperature indicates a greedier strategy with less editing, and if the input sentence is not changed much, we may indeed have a higher BLEU score. But our major metric iBLEU penalizes the similarity to the input and thus prefers  Table 5: Example paraphrases generated by different methods on the Quora dataset. The averaged score evaluated by three annotators is shown at the end of each generated sentence. a higher temperature. We chose T init = 0.03 by validating on iBLEU.
Case Study. We showcase several generated paraphrases in Table 5. We see qualitatively that UPSA can produce more reasonable paraphrases than the other methods in terms of both closeness in meaning and difference in expressions, and can make non-local transformations. For example, "places for spring snowboarding in the US" is paraphrased as "places in the US for snowboarding." Admittedly, such samples are relatively rare, and our current UPSA mainly synthesizes paraphrases by editing words in the sentence, whereas the syntax is mostly preserved. This is partially due to the difficulty of exploring the entire (discrete) sentence space even by simulated annealing, and partially due to the insensitivity of the similarity objective given two very different sentences.

Conclusion and Future Work
In this paper, we proposed a novel unsupervised approach UPSA that generates paraphrases by simulated annealing. Experiments on four datasets show that UPSA outperforms previous state-of-theart unsupervised methods to a large extent.
In the future, we plan to apply the SA framework on syntactic parse trees in hopes of generating more syntactically different sentences (motivated by our case study).