Polly Want a Cracker: Analyzing Performance of Parroting on Paraphrase Generation Datasets

Paraphrase generation is an interesting and challenging NLP task which has numerous practical applications. In this paper, we analyze datasets commonly used for paraphrase generation research, and show that simply parroting input sentences surpasses state-of-the-art models in the literature when evaluated on standard metrics. Our findings illustrate that a model could be seemingly adept at generating paraphrases, despite only making trivial changes to the input sentence or even none at all.


Introduction
The task of paraphrase generation has many important applications in NLP. It can be used to generate adversarial examples of input text, which can then be used to train neural networks so that they become less susceptible to adversarial attack (Iyyer et al., 2018). For knowledge-based QA systems, a paraphrasing step can produce multiple variations of a user query and match them with knowledge base assertions, enhancing recall (Yin et al., 2015;Fader et al., 2014). Relation extraction can also benefit from incorporating paraphrase generation into its processing pipeline (Romano et al., 2006). Manually annotating translation references is expensive, and automatically generating references through paraphrasing has been shown to be effective for evaluation of machine translation (Zhou et al., 2006;Kauchak and Barzilay, 2006).
In this paper, we find that simply using the input sentence as output in an unsupervised manner (i.e. fully parroting the input) significantly outperforms the state-of-the-art on two metrics for TWITTER, and on one metric for QUORA. Even after changing part of the input sentence (i.e. partially parroting the input), state-of-the-art metric scores can still be surpassed.
Consequently, for future paraphrase generation research which achieve good evaluation scores, we suggest investigating whether their methods or models act differently from simple parroting behavior.

Method Description
Given an input sentence i, the goal of paraphrase generation is to generate an output sentence o which is semantically identical to i, but contain variations in lexicon or syntax. Full parroting simply uses the input as output (o = i).
Paraphrase generation models may not parrot the input sentence word for word, but it is possible that they only modify a few words of the input, thus we also experiment with simple methods of modifying i, such as replacing or cutting words from the head, from the tail or from random positions.
Both full parroting and the forms of partial parroting we use are fully unsupervised.

Datasets
QUORA. The QUORA dataset contains 149,263 paraphrase sentence pairs (positive examples) and 255,027 non-paraphrase sentence pairs (negative examples). Having both positive and negative ex-amples makes it appealing for research on paraphrase generation (Gupta et al., 2018;Li et al., 2018) and identification (Lan and Xu, 2018). After processing the dataset, there are 149,650 unique sentences that have reference paraphrases. Gupta et al. (2018) sampled 4K sentences as their test set, but did not specify which sentences they used. Li et al.(2018) sampled 30K sentences as their test set, also not specifying which sentences they used. To avoid selecting a subset of data that is biased in favor of our method, we perform evaluation on the entire QUORA dataset. Although we evaluate on the entire dataset, the size of our training set is zero due to the fully unsupervised nature of full and partial parroting.
We group sentences by the number of reference paraphrases they have, and plot the relative counts in Appendix A. It can be seen that over 64% of entries have only a single reference paraphrase, which is problematic because even if a paraphrase of good quality is generated for any one of these entries, BLEU, METEOR and TER scores could still be inferior if the generated paraphrase differs too much from the single reference paraphrase. Previous paraphrase generation work on QUORA (Gupta et al., 2018;Li et al., 2018) did not mention removing these entries, thus we include them in our experiments for fair comparison. However, we strongly recommend future work which wishes to use BLEU, METEOR and TER as evaluation metrics to only consider entries that have multiple reference paraphrases.
TWITTER. There are 114,025 paraphrase sentence pairs in TWITTER, which were acquired by collecting tweets which contain identical URLs (Lan et al., 2017). As with QUORA, prior paraphrase generation work on this dataset (Li et al., 2018) did not provide their sampled test set sentences, so we evaluate parroting on the entire dataset to avoid bias. We follow the same data processing steps as QUORA, and plot the number of reference paraphrases in Appendix A.
MSCOCO. This is an image captioning dataset, with multiple captions provided for a single image (Lin et al., 2014). There have been multiple works which use it as a paraphrase generation dataset by treating captions of the same image as paraphrases (Wang et al., 2019;Gupta et al., 2018;Prakash et al., 2016). The training and testing sets are available, containing 331,163 and 162,016 input sentences respectively. However, relevance scores for captions of the same image score only 3.38 out of 5 under human evaluation (in contrast, the score is 4.82 for QUORA) (Gupta et al., 2018), due to the fact that different captions for the same image often vary in the semantic information conveyed. This makes the use of MSCOCO as a paraphrase generation dataset questionable.
We plot the number of reference paraphrases in Appendix A.

Experiments
We evaluate the performance of full parroting on all three datasets and compare with state-of-the-art models.
We also study the performance of partial parroting. Whereas full parroting does not modify the input sentence, partial parroting replaces or cuts some of the input words. We try three different modes of choosing words to be cut or replaced: from the sentence head, from the tail or sampled randomly.

Evaluation
Following prior paraphrase generation research which used QUORA, TWITTER and MSCOCO, we use BLEU, METEOR and TER as evaluation metrics. When calculating metric scores, all available reference paraphrases for a given input sentence are considered.

Results
Full parroting. Our results are organized in Tables 1, 2 and 3. We see for TWITTER, parroting outperforms the state-of-the-art by significant margins on both BLEU and METEOR scores; for QUORA, parroting outperforms the state-of-the-art appreciably on METEOR while having comparable performance on BLEU.
The poor performance of full parroting on MSCOCO is due to higher edit distances between input sentences and their reference paraphrases. TER measures the edit distance of a sentence to a reference sentence, normalized by the average length of all references (Snover et al., 2006): We see that the TER score of full parroting is particularly high on MSCOCO compared to the other two datasets. Correspondingly, the BLEU and METEOR scores are lower by a wide margin.   For further investigation of parroting on QUORA and TWITTER, we plot parroting performance versus the number of reference paraphrases available for a given input sentence (Figures 1 and  2). If the number of references is not too high, metric scores generally improve when the number of references rises. Once the number of references exceeds a certain threshold, we do not observe a clear correlation, showing that the probability of finding a reference sentence which bears higher resemblance to the input does not increase proportionally with the number of references.
The choice of testing on the entire dataset for QUORA and TWITTER experiments was to avoid bias in favor of parroting. Nevertheless, we also randomly sampled test sets of size 4K for QUORA in the same manner as (Gupta et al., 2018) Table 4: Performance of full parroting on randomly sampled test sets. The test set size and sampling method is the same as that described in prior state-of-the-art work. Here, scores for sampled QUORA test sets are similar to those of the full dataset, and the scores for TWITTER test sets are better than scores achieved on the full dataset. of-the-art records on TWITTER). In total, 1200 test sets of size 4K were sampled for QUORA and 250 test sets of size 5K were sampled for TWITTER. Parroting performance on these sampled test sets can be found in Table 4. It can be observed that the average metric scores for QUORA are similar to the scores in Table 1, whereas the average scores for TWITTER are noticeably better than those in Table 2. Furthermore, the score deviation between different samples is small. Consequently, although the exact test sets used by (Gupta et al., 2018) and (Li et al., 2018) are not available, it is logical to assume that parroting performance would still exceed or be on par with the state-of-the-art on those test sets.
Partial parroting. We also introduce lexical variation into our parroting method by replacing or cutting words of the input sentence. For replacement, we substitute input words with an outof-vocabulary word not found in any of the input sentence's reference paraphrases. Paraphrase generation models are usually allowed to generate words which exist in reference paraphrases; we purposely use out-of-vocabulary words to give harsher scores to our method. words from the start of input sentences. For QUORA, when over 10% of the input sentence has been modified by being cut off, partial parroting underperforms the state-of-the-art by only 3.8% on METEOR. For TWITTER, the same form of partial parroting (cutting off words) still outperforms the state-of-the-art on BLEU when input sentences are modified by 42% , and does the same on METEOR when the input is modified by 56%. Additionally, we experiment with cutting words in other positions, and also replace words rather than cut them away. The results can be found in Appendix B.

Figures 3 and 4 show performance of cutting
Earlier work using QUORA and TWITTER (Gupta et al., 2018;Li et al., 2018) only provided a few examples of output paraphrases, and did not study in detail the paraphrasing behavior of their models, making it unclear whether the models achieve qualitatively better results than our simple rule-based parroting techniques, given that evaluation scores of the two are similar. We recommend future research to perform such an analysis if their metric scores are close to that of parroting. Work on paraphrase generation using other datasets can also be found. Methods include lexical substitution (Hassan et al., 2007;Bolshakov andGelbukh, 2004), back-translation (Wieting andGimpel, 2018) and sequence-to-sequence neural networks (Iyyer et al., 2018).

Related Work
It is worth noting that paraphrase generation serves practical purposes, such as augmenting training data for NLP models to decrease their susceptibility to adversarial attack (Iyyer et al., 2018), or enhancing recall for QA systems (Yin et al., 2015;Fader et al., 2014). Improvement of downstream model performance is a valid evaluation metric for paraphrase generation, and future work wishing to use QUORA entries which only have a single reference paraphrase could choose such an evaluation metric instead of BLEU, METEOR or TER.
As a sidenote, we also ran experiments in which BLEU scores were calculated using non-reference dataset sentences. The results are in Appendix C.

Conclusion
In this work, we discover that various forms of simple parroting outperforms state-of-the-art results on QUORA and TWITTER when evaluated using BLEU and METEOR. An interpretation is that current models could simply be parroting input sentences, and researchers should perform qualitative analysis of such behavior. Another interpretation is that BLEU and METEOR are inappropriate for evaluating paraphrase generation models, in which case other metrics such as effectiveness of data augmentation (Iyyer et al., 2018), may be used instead.  : Metric scores v.s. ratio of text that is cut from the end of the input sentence (Twitter) Figure 10: Metric scores v.s. ratio of text that is cut randomly from the input sentence (Quora) Figure 11: Metric scores v.s. ratio of text that is cut randomly from the input sentence (Twitter) Figure 12: Metric scores v.s. ratio of text that is replaced from the start of the input sentence (Quora) Figure 13: Metric scores v.s. ratio of text that is replaced from the start of the input sentence (Twitter) Figure 14: Metric scores v.s. ratio of text that is replaced from the end of the input sentence (Quora) Figure 15: Metric scores v.s. ratio of text that is replaced from the end of the input sentence (Twitter) Figure 16: Metric scores v.s. ratio of text that is replaced randomly from the input sentence (Quora) Figure 17: Metric scores v.s. ratio of text that is replaced randomly from the input sentence (Twitter) C Calculating BLEU score using non-reference sentences For an input sentence, BLEU scores are usually calculated by comparing the input sentence with a number of reference sentences. We ran experiments in which 5 reference and 100 randomly sampled non-reference sentences were used, and show part of our results below. It can be seen that sentences with higher BLEU scores are more similar to the input sentence, which is to be expected.