Multilingual Whispers: Generating Paraphrases with Translation

Naturally occurring paraphrase data, such as multiple news stories about the same event, is a useful but rare resource. This paper compares translation-based paraphrase gathering using human, automatic, or hybrid techniques to monolingual paraphrasing by experts and non-experts. We gather translations, paraphrases, and empirical human quality assessments of these approaches. Neural machine translation techniques, especially when pivoting through related languages, provide a relatively robust source of paraphrases with diversity comparable to expert human paraphrases. Surprisingly, human translators do not reliably outperform neural systems. The resulting data release will not only be a useful test set, but will also allow additional explorations in translation and paraphrase quality assessments and relationships.


Introduction
Humans naturally paraphrase. These paraphrases are often a byproduct: when we can't recall the exact words, we can often generate approximately the same meaning with a different surface realization. Recognizing and generating paraphrases are key challenges in many tasks, including translation, information retrieval, question answering, and semantic parsing. Large collections of sentential paraphrase corpora could benefit such systems. 1 Yet when we ask humans to generate paraphrases of a given task, they are often a bit stuck. How much should be changed? Annotators tend to preserve the reference expression: a safe choice, as the only truly equivalent representation is to leave the text unchanged. Each time we replace a word with a synonym, some shades of meaning change, some connotations or even denotations shift. 1 Expanding beyond the sentence boundary is also very important, though we do not explore cross-sentence phenomena in this paper. One path around the obstacle of reference bias is to provide a non-linguistic input, then ask humans to describe this input in language. For instance, crowd-sourced descriptions of videos provide a rich source of paraphrase data that is grounded in visual phenomena (Chen and Dolan, 2011). Such visual grounding helps users focus on a clear and specific activity without imparting a bias toward particular lexical realizations. Unfortunately, these paraphrases are limited to phenomena that can be realized visually. Another path is to find multiple news stories describing the same event (Dolan et al., 2004), or multiple commentaries about the same news story (Lan et al., 2017). Although this provides a rich and growing set of paraphrases, the language is again biased, this time toward events commonly reported in the news.
An alternative is to provide input in a foreign language. Nearly anything expressible in one human language can be written in another language. When users translate content, some variation in lexical realization occurs. To gather monolingual paraphrases, we can first translate a source sentence into a variety of target languages, then translate back into the source language, using either humans or machines. This provides naturalistic variation in language, centered around a common yet relatively unconstrained starting point. Although several research threads have explored this possibility (e.g., (Wieting and Gimpel, 2018)), we have seen few if any comparative evaluations of the quality of this approach.
Our primary contribution is to evaluate various methods of constructing paraphrase corpora, including monolingual methods with experts and nonexperts as well as automated, semi-automated, and manual translation-based approaches. Each paraphrasing method is evaluated for fluency ("does the resulting paraphrase sound not only grammatical but natural?") and adequacy ("does the paraphrase accurately convey the original meaning of the source?") using human direct assessment, inspired by effective techniques in machine translation evaluation (Federmann, 2018).
In addition, we measure the degree of change between the original and rewritten sentence using both edit distance and BLEU (Papineni et al., 2002). Somewhat surprisingly, fully automatic neural machine translation actually outperforms manual human translation in terms of adequacy. The semi-automatic method of post-editing neural machine translation output with human editors leads to fluency improvements while retaining diversity and adequacy. Although none of the translationbased approaches outperform monolingual rewrites in terms of adequacy or fluency, they do produce greater diversity. Human editors, particularly nonexperts, tend toward small edits rather than substantial rewrites. We conclude that round-tripping with neural machine translation is a cheap and effective means of gathering diverse paraphrases.
Our second contribution is a unique data release. As a byproduct of this evaluation, we have compiled a data set consisting of paraphrases gathered using monolingual rewrites and translation paraphrases generated through human translation, neural machine translation, and human post-edited neural machine translation. These 500 source sentences-together with all rewrites and intermediate translations-comprise a rare and interesting multilingual data set, useful for both monolingual and translation tasks. We include all human quality assessments for adequacy (semantic equivalence) and fluency of paraphrases, as well as translation adequacy assessments. Data is publicly available at https://aka.ms/MultilingualWhispers.

Related Work
Translation as a means of generating paraphrases has been explored for decades. Paraphrase corpora can be extracted from multiple translations of the same source material (Barzilay and McKeown, 2001). Sub-sentential paraphrases (mostly phrasal replacements) can be gathered from these multiple translations. Alternatively, one can create a large body of phrasal replacements from by pivoting on the phrase-tables used by phrase-based statistical machine translation (Bannard and Callison-Burch, 2005;Ganitkevitch et al., 2013;Pavlick et al., 2015).
Recent work has also explored using neural machine translation to generate paraphrases via pivoting (Prakash et al., 2016;Mallinson et al., 2017). One can also use neural MT systems to generate large monolingual paraphrase corpora. Another line of work has translated the Czech side of a Czech-English parallel corpus into English, thus producing 50 million words of English paraphrase data (Wieting and Gimpel, 2018). Not only can the system generate interesting paraphrases, but embeddings trained on the resulting data set prove useful in sentence similarity tasks. When added to a paraphrase system, constraints obtained from a semantic parser can reduce the semantic drift encountered during rewrites (Wang et al., 2018). Adding lexical constraints to the output can also increase diversity (Hu et al., 2019).
Past research has also explored effective methods for gathering paraphrases from the crowd (Jiang et al., 2017). However, to the best of our knowledge, no prior work has compared the efficacy of human experts, crowd-workers, human post-editing approaches and machine translation systems on gathering paraphrase quality.

Methodology
To run a comprehensive evaluation of paraphrase techniques, we create many paraphrases of a common data set using multiple methods, then evaluate using human direct assessment as well as automatic diversity measurements.

Data
Input data was sampled from two sources: Reddit provides volumes of casual online conversations; the Enron email corpus represents communication in the professional world. 2 Both are noisier than usual NMT training data; traditionally, such noise has been challenging for NMT systems (Michel and Neubig, 2018) and should provide a lowerbound on their performance. It would definitely be valuable, albeit expensive, to rerun our experiments on a cleaner data source.  As an initial filtering step, we ran automatic grammar and spell-checking, in order to select sentences that exhibit some disfluency or clear error. Additionally, we asked crowd workers to discard sentences that contain any personally identifiable information, URLs, code, XML, Markdown, and non-English sentences. The crowd workers were also encouraged to select noisy sentences containing slang, run-ons, contractions, and other behavior observed in informal communications.

Paraphrase techniques
Expert human monolingual paraphrase. We hired trained linguists (who are native speakers of English) to provide paraphrases of the given source sentences, targeting highest quality rewrites. These linguists were also encouraged to fix any misspellings, grammatical errors, or disfluencies.
Crowd-worker monolingual paraphrase. As a less expensive and more realistic setting, we asked English native speaking crowd workers who passed a qualification test to perform the same task.
Human round-trip translation. For the first set of translation-based paraphrases, we employed human translators who translated the source text from English into some pivot language and back again. The translations were provided by a human translation service, potentially using multiple different translators (though the exact number was not visible to us). In our experiments we focused on a diverse set of pivot languages, namely: Arabic, Chinese, French, German, Japanese, and Russian.
While French and German seem like a better choice for translation from and back into English, due to the close proximity of English as part of the Germanic language family and its shared vocabulary with French, we hypothesize that the use of more distant pivot languages may result in a greater diversity of the back translation output.
We employed professional translators-native in the chosen target language-who were instructed to generate translations from scratch, without the use of any online translation tools. Translation from English into the pivot languages and back into English were conducted in separate phases, by different translators.  Post-edited round-trip translation. Second, we created round-trip translation output based on human post-editing of neural machine translation output. Given the much lower post-editing cost, we hypothesize that results contain only minimal edits, mostly improving fluency but not necessarily fixing problems with translation adequacy.
Neural machine translation. We kept the NMT output used to generate post-editing-based paraphrases, without further human modification. Given the unsupervized nature of machine translation, we hypothesize that resulting output may be closer to the source syntactically (and hopefully more diverse lexically), especially those source sentences which a human editor would consider incomplete or low quality.
Crowd-worker monolingual paraphrase grounded by translation. Finally, we also use a variant of the Crowd-worker monolingual paraphrase technique where the crowd worker is grounded by a translation-based paraphrase output. The crowd worker is then asked to modify the translation-based paraphrase to make it more fluent than the source, and as adequate.
Intuitively, one assumes that human translation output should achieve both highest adequacy and fluency scores, while post-editing should result in higher adequacy than raw neural machine translation output.
Considering translation fluency scores, NMT output should be closer to both post-editing and human translation output, as neural MT models usually achieve high levels of fluency (Bojar et al., 2016;Castilho et al., 2017;Läubli et al., 2018).
We hypothesize that translation helps to increase diversity of the resulting back translation output, irrespective of the specific method.

Assessments
We measure four dimensions of quality:  Table 3: Priming questions used for human evaluation of paraphrase adequacy (Par A ), paraphrase fluency (Par F ), and translation adequacy (NMT A ). Paraphrase evaluation campaigns referred to source and candidate text as "candidate A" and "B", respectively. Translation evaluation campaigns used "source" and "candidate text" instead.
Paraphrase adequacy For adequacy, we ask annotators to assess semantic similarity between source and candidate text, labeled as "candidate A" and "B", respectively. The annotation interface implements a slider widget to encode perceived similarity as a value x ∈ [0, 100]. Note that the exact value is hidden from the human, and can only be guessed based on the positioning of the slider. Candidates are displayed in random order, preventing bias.
Paraphrase fluency For fluency, we use a different priming question, implicitly asking the human annotators to assess fluency for candidate "B" relative to that of candidate "A". We collect scores x ∈ [−50, 50], with −50 encoding that candidate "A" is much more fluent than "B", while a value of 50 denotes the polar opposite. Intuitively, the middle value 0 encodes that the annotator could not determine a meaningful difference in fluency between both candidates. Note that this may mean two things: 1. candidates are semantically equivalent but similarly fluent or non-fluent; or 2. candidates have different semantics.
We observe that annotators have a tendency to fall back to "neutral" x = 0 scoring whenever they are confused, e.g., when semantic similarity of both candidates is considered low.
Translation Adequacy We measure translation adequacy using our own implementation of sourcebased direct assessment. Annotators do not know that the source text shown might be translated content, and they do not know about the actual goal of using back-translated output for paraphrase generation. Except for the labels for source and candidate text, the priming question is identical to the one used for paraphrase adequacy evaluation. Notably, we have to employ bilingual annotators to collect these assessments. Scores for translation adequacy again are collected as x ∈ [0, 100].
Paraphrase diversity Additionally, we measure diversity of all paraphrases (both monolingual and based on translation) by computing the average number of token edits between source and candidate texts. To focus our attention on meaningful changes as opposed to minor function word rewrites, we normalize both source and candidate by lower-casing and excluding any punctuation and stop words using NLTK (Bird et al., 2009).
We adopt source-based direct assessment (src-DA) for human evaluation of adequacy and fluency. The original DA approach (Graham et al., 2013(Graham et al., , 2014 is reference-based and, thus, needs to be adapted for use in our paraphrase assessment and translation scoring scenarios. In both cases, we can use the source sentence to guide annotators in their assessment. Of course, this makes translation evaluation more difficult, as we require bilingual annotators. Src-DA has previously been used, e.g., in (Cettolo et al., 2017;Bojar et al., 2018).
Direct assessment initializes mental context for annotators by asking a priming question. The user interface shows two sentences: -the source (src-DA, reference otherwise); and -the candidate output.
Annotators read the priming question and both sentences and then assign a score x ∈ [0, 100] to the candidate shown. The interpretation of this score considers the context defined by the priming question, effectively allowing us to use the same annotation method to collect human assessments with respect to the different dimensions of quality a defined above. Our priming questions are shown above in Table 3.

Profanity handling
Some source segments from Reddit contain profanities, which may have affected results reported in this paper. While a detailed investigation of such effects is outside the scope of this work, we want  Table 4: Results by paraphrasing method. Adequacy (Par A ) and fluency (Par F ) are human assessments of paraphrases; paraphrase diversity (Par D ) is measured by the average string-edit-distance between source and paraphrase (higher means greater diversity); NMT A is a human assessment of translation quality.
to highlight two potential issues which could be introduced by profanity in the source text: 1. Profanity may have caused additional monolingual rewrites (in an attempt to clean the resulting paraphrase), possibly inflating diversity scores; 2. Human translators may have performed similar cleanup, increasing the likelihood of back translations having a lower adequacy score.
All data collected in this work is publicly released. This includes paraphrases as well as assessments of adequacy, fluency, and translation adequacy. Human scores are based on two evaluation campaigns-one for adequacy, the other for fluency-with t = 27 annotation tasks, a = 54 human annotators, r = 4 redundancy, and tpa = 2 tasks per annotator, resulting in a total of t * r = a * tpa = 108 annotated tasks-equivalent to at least 9, 504 assessments per campaign (more in case of duplicates in the set of paraphrases to be evaluated), based on the alternate HIT structure with 88 : 12 candidates-vs-controls setting as described in (Bojar et al., 2018).    Table 5 organizes by pivot languages used. "Multi-Hop NMT" refers to an experiment in which we created paraphrases translating via two non-English pivot languages, namely Chinese and Japanese. French and German perform best as pivot languages, while Chinese-Japanese achieves best diversity. Table 6 shows results from our grounded paraphrasing experiment in which we compared how different translation methods affect monolingual rewriting quality. Based on results in Tables 5, we focus on French and German as our pivot languages. We also keep Chinese-Japanese "Two-Pivot NMT" to see how additional pivot languages may affect resulting paraphrase diversity. Figure 2 shows convergence of adequacy scores for the grounded paraphrasing experiment, over time. Figure 3 shows convergence of relative fluency scores. Note how clustering reported in Table 6 appears after a few hundred annotations only. The clusters denote sets of systems that are not statistically significantly different.

Error Analysis
While neural machine translation based paraphrases achieve surprising results in terms of diversity compared to paraphrases generated by human Non-Experts, NMT does not reach the adequacy or fluency level provided by Expert paraphrases. The examples in Table 7 provides a flavor of the outputs from each method and demonstrates some of the error cases.
Partially paraphrasing entities and common expressions. NMT systems often mangle multiword units, rewriting parts of non-compositional phrases that change meaning ("Material Design" → "hardware design") or decrease fluency. Informal language. Inadequate or disfluent paraphrases are also caused by typos, slang and  Table 6: Results for translation-based rewriting, ordered by decreasing average adequacy (Par A ). Horizontal lines between methods denote significance cluster boundaries. Edits measures average number of edits needed to create rewrite (higher means greater diversity). BLEU score measures overlap with original sentence (lower means greater diversity). Labelling time measured in seconds, with a maximum timeout set to two minutes. P 25 and P 75 refer to the 25th and 75th percentiles of observed labelling time, respectively; StdDev to standard deviation.
other informal patterns. As prior work has mentioned (Michel and Neubig, 2018), NMT models often corrupt these inputs, leading to bad paraphrases.
Negation handling. One classic struggle for machine translation approaches is negation -losing or adding negation is a common error type. Paraphrases generated through NMT are no exception.

Key findings
Given our experimental results, we formulate the following empirical conclusions: "Monolingual is better" Human rewriting achieves higher adequacy and fluency scores compared to all tested translation methods. This comes at a relatively high cost, though.
"Non-experts more adequate..." Human experts appear worse than non-experts in adequacy. We have empirically identified a way to either save or produce more paraphrases for the same budget.
"...but less diverse" Non-expert paraphrases are not as diverse as those created by experts. Expert rewrites also fix source text issues such as profanity.
"MT is not bad" Neural machine translation performs surprisingly well, creating more diverse output than human experts.
"Post-editing is better" Paraphrase adequacy, paraphrase fluency and translation adequacy benefit from human post-editing. In our experiments, this method achieved best performance of all tested translation methods.
"Human translations are expensive and less adequate" While humans achieve high translation adequacy scores and good paraphrase diversity, the corresponding paraphrase adequacy values are worst among all tested methods (except two-pivot NMT, which solves a harder problem).
"Related languages are better..." Generating paraphrases by translation works better when pivot languages are closely related.
"Use neural MT for cheap, large data!" Seems good enough to work for constrained budgets, can be improved with post-editing as needed. Specifically, we have empirically proven that you can increase paraphrase diversity by using NMT pivot translation, combined with non-expert rewriting.

Conclusions
Somewhat surprisingly, strong neural machine translation is more effective at paraphrase generation than humans: it is cheap, adequate, and diverse. In contrast, crowd workers required more money, producing more adequate translations but with trivial edits. Although neural MT also produced less fluent outputs, post-editing could improve the quality with little additional expenditure. Expert linguists produced the highest quality paraphrases, but at substantially greater cost. Translation-based paraphrases are more diverse.
One limitation of this survey is the input data selection: generally all input sentences contained some kind of error. This may benefit some techniques -humans in particular can navigate these errors easily. Also, the casual data used often included profanity and idiomatic expressions. Translators often rewrote profane expressions, perhaps decreasing adequacy. Future work on different data sets could further quantify such data effects. ORIGINAL Is it actually more benefitial/safe to do this many exercises a day? EXPERT Is it actually more beneficial and safe to do so many exercises in a day?
NMT CHINESE-JAPANESE Tell me if daily practice is good?
PE GERMAN Is it actually more safe and important to do this many exercises a day? PE FRENCH Is it actually more benefitial/safe to do as many exercises a day? HT FRENCH Is it really more beneficial/safe to do so much exercise per day?
HT GERMAN Is it really more beneficial / safer to do so many exercises per day?
NON-EXPERT Is it actually more beneficial and safe to do this many exercises a day? NMT FRENCH Is it actually more benefitial/safe to do this many exercises per day?
NMT GERMAN Is it actually beneditialat/sure to do these many exercises a day?
ORIGINAL The cold and rain couldn't effect my enjoyment.

EXPERT
The cold and rain could not affect my enjoyment. NMT CHINESE-JAPANESE Cold and rain can not detract from my enjoyment.

PE GERMAN
The cold and rain will not affect my enjoyment.

PE FRENCH
The cold and rain could not effect my enjoyment.

HT FRENCH
Cold and rain dont satisfy me.

HT GERMAN
The cold and rain couldnt spoil my enjoyment.

NON-EXPERT
The cold and the rain couldn't affect my happiness.

NMT FRENCH
The cold and the rain could not affect my pleasure.

NMT GERMAN
The cold and rain couldn't affect my enjoyment. Table 7: Example paraphrases generated by several monolingual and bilingual methods. Changed regions are highlighted -insertions are presented in green , and deleted phrases from the original sentence are highlighted in red and strikethrough . Note how Non-Expert translations tend to be the most conservative, except when clearly informal language is rewritten or corrected. Figure 2: Convergence of adequacy scores over time. Despite the lack of an absolute standard of system assessment, a diverse set of judges rapidly converge to a consistent ranking of system quality. Within a 100 to 200 judgements, the rating has basically stabilized, though we continue to assess the whole set for greatest stability and confidence in ranking. We note, however, that readers should take caution in an absolute reading of these ratings -instead, it should reflect a relative quality assessment among the approaches under consideration. Figure 3: Convergence of relative fluency scores over time. These assessments reflect the same trends as adequacy -raters rapidly converge on a relative assessment of distinct systems.