Negative Lexically Constrained Decoding for Paraphrase Generation

Paraphrase generation can be regarded as monolingual translation. Unlike bilingual machine translation, paraphrase generation rewrites only a limited portion of an input sentence. Hence, previous methods based on machine translation often perform conservatively to fail to make necessary rewrites. To solve this problem, we propose a neural model for paraphrase generation that first identifies words in the source sentence that should be paraphrased. Then, these words are paraphrased by the negative lexically constrained decoding that avoids outputting these words as they are. Experiments on text simplification and formality transfer show that our model improves the quality of paraphrasing by making necessary rewrites to an input sentence.


Introduction
Paraphrase generation is a generic term for tasks that generate sentences semantically equivalent to input sentences. These techniques make it possible to control information other than the meaning of the text. Typical paraphrase generation tasks include subtasks such as text simplification to control complexity, formality transfer to control formality, grammatical error correction to control fluency, and sentence compression to control sentence length. These paraphrase generation applications not only support communication and language learning but also contribute to the performance improvement of other natural language processing applications (Evans, 2011;Stajner and Popović, 2016).
Paraphrase generation can be considered as a monolingual machine translation problem. Sentential paraphrases with different complexities (Coster and Kauchak, 2011;Xu et al., 2015) and formalities (Rao and Tetreault, 2018) were created manually, and parallel corpora special-ized for each subtask were constructed. As in the field of machine translation, phrasebased (Coster and Kauchak, 2011;Xu et al., 2012) and syntax-based (Zhu et al., 2010;Xu et al., 2016) methods were proposed early. In recent years, the encode-decoder model based on the attention mechanism (Nisioi et al., 2017;Zhang and Lapata, 2017;Jhamtani et al., 2017;Niu et al., 2018) has been studied, inspired by the success of neural machine translation (Bahdanau et al., 2015).
In machine translation, all words appearing in an input sentence must be rewritten in the target language. However, paraphrase generation does not require rewriting of all words. When some criteria are provided, words not satisfying the criteria in the input sentence are identified and rewritten. For example, the criterion for text simplification is the textual complexity, and rewrites complex words to simpler synonymous words. Owing to the characteristics of the task where only a limited portion of an input sentence needs to be rewritten, previous methods based on machine translation often perform conservatively and fail to produce necessary rewrites (Zhang and Lapata, 2017;Niu et al., 2018). To solve the problem of conservative paraphrasing that copies many parts of the input sentence, we propose a neural model for paraphrase generation that first identifies words in the source sentence requiring paraphrasing. Subsequently, these words are paraphrased by the negative lexically constrained decoding that avoids outputting them as they are.
We evaluate the performance of the proposed method with two major paraphrase generation tasks.
Experiments on text simplification (Xu et al., 2015) and formality transfer (Rao and Tetreault, 2018) show that our model improves the quality of paraphrasing by performing necessary rewrites to an input sentence.

Proposed Method
To improve the conservative rewriting of the neural paraphrase generation, we first identify the words to be paraphrased for a given input sentence (Section 2.1). Next, we paraphrase the input sentence using the pretrained paraphrase generation model. Here, we select sentences not including those words by adding negative lexically constrained decoding to the beam search (Section 2.2). Because our method only changes the beam search, it can be applied to various paraphrase generation models and model retraining is not necessary.

Identification of Word to be Paraphrased
We extract words strongly related to the source style included in the input sentence s i as vocabulary V i to be paraphrased. Point-wise mutual information is used to estimate the relatedness between each word w ∈ s i and style z ∈ {x, y} (Pavlick and Nenkova, 2015). Here, x and y are the source style (e.g. informal) and the target style (e.g. formal), respectively.
We define the vocabulary V i to be paraphrased using the threshold θ as follows.
After extracting the vocabulary V i to be paraphrased for each input sentence s i , we generate paraphrase sentences using it as a hard constraints. Note that PMI score is calculated using a training parallel corpus for paraphrase generation.

Negative Lexically Constrained Decoding
Lexically constrained decoding (Anderson et al., 2017;Hokamp and Liu, 2017;Post and Vilar, 2018) adds constraints to the beam search to force the output text to include certain words. The effectiveness of these methods are demonstrated in image captioning using given image tags (Anderson et al., 2017) and in the post-editing of machine translation (Hokamp and Liu, 2017). In paraphrase generation, there is no situation that words to be included in the output sentence are given. Therefore, positive lexical constraints used in the image captioning and post-editing of machine translation cannot be applied to this task as they are. Meanwhile, negative lexical constraints that are forced to not include certain words in output sentence are promising for paraphrase generation. This is because, for example, text simplification is a task of generating sentential paraphrase without using complex words that appear in the source sentence.
In this study, we add negative lexical constraints to beam search using dynamic beam allocation (Post and Vilar, 2018), which is the fastest lexically constrained decoding algorithm. In negative lexical constraints, we exclude hypotheses including the given words during beam search. Consequently, the words identified in Section 2.1 will not appear in our generated sentences.

Experiment
We evaluate the performance of the proposed method on two major paraphrase generation tasks. We conduct experiments on text simplification and formality transfer using datasets shown in Table 1. For text simplification, we identify complex words in the input sentence and generate simple paraphrase sentence without using these complex words. Similarly, for formality transfer, we identify informal words in the input sentence and generate formal paraphrase sentence without using these informal words.

Setup
For text simplification, we used the Newsela dataset (Xu et al., 2015) split and tokenized with the same settings as the previous study (Zhang and Lapata, 2017).
For formality transfer, we used the GYAFC dataset (Rao and Tetreault, 2018) normalized and tokenized using Moses toolkit. 1 For each task, we used byte-pair encoding 2 (Sennrich et al., 2016) to limit the number of token types to 16, 000. In the GYAFC dataset, it is reported that a correlation exists between manual evaluation and automatic evaluation using BLEU only when paraphrasing from an informal style to formal style (Rao and Tetreault, 2018). Therefore, we will only experiment with this setting. For lexical constraints, we identified words with a PMI score above the threshold θ. We selected a threshold θ ∈ {0.0, 0.1, 0.2, ..., 0.7} that maximizes the BLEU score between the output sentence and the reference sentence in the development dataset. We calculated PMI scores using each training dataset shown in Table 1.
As a paraphrase generation model, we constructed the recurrent neural network (RNN) and self-attention network (SAN) models using the Sockeye toolkit (Hieber et al., 2017). 3 Our RNN model uses a single LSTM with a layer size of 512 for both the encoder and decoder, and MLP attention with a layer size of 512. Our SAN model uses a six-layer transformer with a model size of 512 and a single attention head. We used word embeddings in 512 dimensions tying the source, target, and the output layer's weight matrix. We added dropout to the embeddings and hidden layers with probability 0.2. In addition, we used layer-normalization and label-smoothing for regularization. We trained using the Adam optimizer (Kingma and Ba, 2014) with a batch size of 4,096 tokens and checkpoint the model every 1,000 updates. The training stopped after five checkpoints without improvement in validation perplexity.
BLEU (Papineni et al., 2002) is primarily used for our evaluation metrics; SARI (Xu et al., 2016) is also used for text simplification. For a more detailed comparison of the models, we evaluated the F1 score of the words that are added (Add), kept 3 https://github.com/awslabs/sockeye (Keep), and deleted (Del) by the models. 4 Our proposed method is compared with previous methods trained only on the dataset shown in Table 1. For detailed analysis, we chose the methods whose model outputs are published. Among these, Dress-LS (Zhang and Lapata, 2017) and BiFT-Ens (Niu et al., 2018) with the highest BLEU score in each task are compared with our model. Following BiFT-Ens, we also used a bidirectional domain-mixed ensemble model for formality transfer task.
We also experimented with Oracle settings that can properly identify words to be paraphrased. In this setting, we used all words that did not appear in the reference sentence among the words included in the input sentence as lexical constraints.

Results
The experimental results are shown in Table 2. These results in both RNN and SAN architectures and three datasets showed that our PMI-based method consistently improves the Base method that does not use constraints in both BLEU and SARI metrics. As a result of a detailed analysis of the model outputs, our PMI method always improves the Base method in terms of Add and Del in both model architectures. These results mean that our proposed method promotes active rewriting as expected. In addition, since Oracle method shows higher performance, it is worthwhile to further improve PMI-based identification. In this study, we identified words to be paraphrased using the training corpus for paraphrase generation. In future work, we plan to identify these words using not only a parallel corpus but also larger data.  Table 3: Comparison with previous models on text simplification in Newsela dataset and formality transfer in GYAFC dataset. Our models achieved the best BLEU scores across styles and domains.

GYAFC-E&M: Informal → Formal
Source mama so ugly, she scares buzzards off of a meat wagon.

Reference
Your mother is so unattractive she scared buzzards off of a meat wagon. SAN-BASE mama is so ugly, she scares buzzards off of a meat wagon.

SAN-PMI
The mother is so unattractive that she scares buzzards off of a meat wagon.

Source
Well, if the one boy picks on you, why like him? Reference Well, if that one boy bullies you, why the attraction to him? SAN-BASE If the one boy picks on you, why like him? SAN-PMI Well, if the one boy teases you, why like him? We succeeded in paraphrasing as follows: mama → mother, picks on → teases. Table 3 shows a comparison between our models and comparative models. Whereas Dress-LS has a higher SARI score because it directly optimizes SARI using reinforcement learning, our models achieved the best BLEU scores across styles and domains. Table 4 shows examples of generated paraphrases in formality transfer task. We succeeded in identifying informal expressions of mama and picks, and successfully paraphrased them. Our proposed method avoids these informal words during beam search, and outputs their synonymous formal expressions, i.e., mother and teases. Figure 1 shows the sensitivity of the quality of generated paraphrases to PMI threshold θ on the development dataset. Too low thresholds cause a large amount of constraints, which adversely affect paraphrase quality. However, with a high threshold, the proposed method can achieve high performance stably. Finally, we used a threshold of θ = 0.5 to maximize the BLEU score on the development dataset for formality transfer tasks. Similarly, in the text simplification task, we used a threshold of θ = 0.2.  Pavlick and Nenkova (2015) worked on a stylesensitive paraphrase acquisition. They used a large-scale raw corpus in each style to calculate PMI scores for each word or phrase and assigned style scores to paraphrase pairs in the paraphrase database (Ganitkevitch et al., 2013;.  further improved style-sensitive paraphrase acquisition based on supervised learning with additional features such as frequency and word embeddings. In this study, as in these previous studies, we have identified words that are strongly related to a particular style. Furthermore, we used these words to control the neural paraphrase generation model and improved the performance of sentential paraphrase generation.

Lexically Constrained Paraphrasing
Hu et al. (2019b) automatically constructed a large-scale paraphrase corpus 5 via lexically constrained machine translation. In a Czech-English bilingual corpus, sentence pairs of a Czech-to-English machine translation and an English reference can be regarded as automatically generated sentential paraphrase pairs (Wieting and Gimpel, 2018). They used words in reference sentences as positive or negative constraints and succeeded in generating diverse paraphrases via machine translation. In addition, recent work (Hu et al., 2019a) has used lexically constrained paraphrase generation for data augmentation and improve performance in some NLP applications. Unlike these previous studies, we focused on the paraphrase generation as an application. Furthermore, we have shown that negative lexical constraints consistently improve the performance of paraphrase generation applications such as text simplification and formality transfer.

Conclusion
To improve the conservative rewriting of the paraphrase generation model, we proposed the identification of words to be paraphrased and the addition of negative lexical constraints on beam search. Experimental results on English text simplification and formality transfer indicated that the proposed method consistently improved the quality of paraphrase generation for both RNN and SAN models across styles or domains. Our proposed method deleted complex or informal words appearing in source sentences and promoted the addition of simple or formal words to paraphrased sentences.