An Empirical Comparison on Imitation Learning and Reinforcement Learning for Paraphrase Generation

Generating paraphrases from given sentences involves decoding words step by step from a large vocabulary. To learn a decoder, supervised learning which maximizes the likelihood of tokens always suffers from the exposure bias. Although both reinforcement learning (RL) and imitation learning (IL) have been widely used to alleviate the bias, the lack of direct comparison leads to only a partial image on their benefits. In this work, we present an empirical study on how RL and IL can help boost the performance of generating paraphrases, with the pointer-generator as a base model. Experiments on the benchmark datasets show that (1) imitation learning is constantly better than reinforcement learning; and (2) the pointer-generator models with imitation learning outperform the state-of-the-art methods with a large margin.


Introduction
Generating paraphrases is a fundamental research problem that could benefit many other NLP tasks, such as machine translation , text generation (Radford et al., 2019), document summarization (Chopra et al., 2016), and question answering (McCann et al., 2018). Although various methods have been developed (Zhao et al., 2009;Quirk et al., 2004;Barzilay and Lee, 2003), the recent progress on paraphrase generation is mainly from neural network modeling (Prakash et al., 2016). Particularly, the encoder-decoder framework is widely adopted (Cho et al., 2015), where the encoder takes source sentences as inputs and the decoder generates the corresponding paraphrase for each input sentence.
In supervised learning, a well-known challenge of generating paraphrases is the exposure bias: the current prediction is conditioned on the ground truth during training but on previous predictions during decoding, which may accumulate and propagate the error when generating the text. To address this challenge, prior work (Li et al., 2018) suggests to utilize the exploration strategy in reinforcement learning (RL). However, training with the RL algorithms is not trivial and often hardly works in practice (Dayan and Niv, 2008). A typical way of using RL in practice is to train the model with supervised learning (Ranzato et al., 2015;Shen et al., 2015;Bahdanau et al., 2016), which leverages the supervision information from training data and alleviate the exposure bias to some extent. In the middle ground between RL and supervised learning, a well-known category is imitation learning (IL) (Daumé et al., 2009;Ross et al., 2011), which has been used in structured prediction (Bagnell et al., 2007) and other sequential prediction tasks . 2 In this work, we conduct an empirical comparison between RL and IL to demonstrate the pros and cons of using them for paraphrase generation. We first propose a unified framework to include some popular learning algorithms as special cases, such as the REINFORCE algorithm (Williams, 1992) in RL and the DAGGER algorithm (Ross et al., 2011) in IL. To better understand the value of different learning techniques, we further offer several variant learning algorithms based on the RL framework. Experiments on the benchmark datasets show: (1) the DAGGER algorithm is better than the REINFORCE algorithm and its variants on paraphrase generation, (2) the DAG-GER algorithm with a certain setting gives the best results, which outperform the previous state-ofthe-art with about 13% on the average evaluation score. We expect this work will shed light on how to choose between RL and IL, and alleviate the exposure bias for other text generation tasks.

Method
Given an input sentence x = (x 1 , x 2 , · · · , x S ) with length S, a paraphrase generation model outputs a new sentence y = (y 1 , y 2 , · · · , y T ) with length T that shares the same meaning with x. The widely adopted framework on paraphrase generation is the encoder-decoder framework . The encoder reads sentence x and represents it as a single numeric vector or a set of numeric vectors. The decoder defines a probability function p(y t | y t 1 , x; ✓), where y t 1 = (y 1 , y 2 , . . . , y t 1 ) and ✓ is the collection of model parameters, where f as a nonlinear transition function and W 2 ✓ as a parameter matrix. We use the pointer-generator model (See et al., 2017) as the base model, which is state-of-the-art model on paragraph generation (Li et al., 2018). We skip the detail explanation of this model and please refer to (See et al., 2017) for further information.

Basic Learning Algorithms
To facilitate the comparison between RL and IL, we propose a unified framework with the following objective function. Given a training example (x, y), the objective function is defined as Following the terminology in RL and IL, we rename P (ỹ t | y t 1 , x; ✓) as the the policy function ⇡ ✓ (ỹ t | h t ). That implies taking an action based on the current observation, where the action is picking a wordỹ t from the vocabulary V. r(ỹ, y) is a reward function with r(ỹ, y) = 1 if y = y. In our experiments, We use the ROUGE-2 score (Lin, 2004) as the reward function. Algorithm 1 presents how to optimize L(✓) in the online learning fashion. As shown in the pseudocode, the schedule rates (↵, ) and the decoding function Decode(·) are the keys to understand the special cases of this unified framework.
The DAGGER Algorithm. When 0 < ↵ < 1, = 1, and Decode(⇡(y | h t )) is defined as as: (4) Depending the value of ↵,ỹ t 1 will choose between the ground truth y t 1 and decoded valuê y t 1 with the function defined in Equation 4. On the other hand,ỹ t will always choose the ground truth y t as = 1. Sinceỹ = y, we have r(ỹ, y) = 1 and the reward can be ignored from Equation 2. In imitation learning, ground truth sequence y is called expert actions. The DAGGER algorithm (Ross et al., 2011) is also called scheduled sampling  in recent deep learning literature. To be accurate, in the DAGGER and the scheduled sampling, the ↵ is dynamically changed during training. Typically, it starts from 1 and gradually decays to a certain value along with iterations. As shown in our experiments, the selection of decay scheme has a big impact on model performance.
The MLE Algorithm. Besides, there is a trivial case when ↵ = 1, = 1. In this case,ỹ t 1 andỹ t are equal to y t 1 and y t respectively, and r(ỹ, y) = 1. Optimizing the objective function in Equation 2 is reduced to the maximum likelihood estimation (MLE).

Other Variant Algorithms
Inspired by the previous three special cases, we offer other algorithm variants with different combinations of (↵, ), while the decoding function Decode(⇡(y | h t )) in the same as Equation 3 in all following variants.
• REINFORCE-GTI (REINFORCE with Ground Truth Input): ↵ = 1, = 0. Unlike the REINFORCE algorithm, REINFORCE-GTI restricts the input to the decoder can only be ground truth words, which means y t 1 = y t 1 . This is a popular implementation in the deep reinforcement learning for Seq2Seq models (Keneshloo et al., 2018).
• REINFORCE-SO (REINFORCE with Sampled Output): ↵ = 1, 0 < < 1. In terms of choosing the value ofỹ t as output from the decoder, REINFORCE-SO allowsỹ t to select the ground truth y t with probability .
• REINFORCE-SIO (REINFORCE with Sampled Input and Output): 0 < ↵ < 1, 0 < < 1. Instead of always taking the ground truth y t 1 as input, REINFORCE-SIO further relaxes the constraint in REINFORCE-SO and allowsỹ t 1 to be the decoded valueŷ t 1 with probability ↵.
Unless specified explicitly, an additional requirement when 0 < ↵, < 1 is that its value decays to a certain value during training, which by default is 0.

Experiments
Dataset and Evaluation Metrics. We evaluate our models on the Quora Question Pair Dataset 3 , and the Twitter URL Paraphrasing Dataset (Lan et al., 2017) (Li et al., 2018;Patro et al., 2018). For the Quora dataset, we follows the configuration of (Li et al., 2018) and split the data into 100K training pairs, 30K testing pairs and 3K validation pairs. For the Twitter dataset, since our model cannot deal with the negative examples as (Li et al., 2018) do, we just obtain the 1-year 2,869,657 candidate pairs from https://languagenet.github.io, and filter out all negative examples. Finally, we divided the remaining dataset into 110K training pairs, 3K testing pairs and 1K validation pairs.
We use the following evaluation metrics to compare our models with other state-of-art neural networks: ROUGE-1 and ROUGE-2 (Lin, 2004), BLEU with up to bi-grams (Papineni et al., 2002). For the convenience of comparison, we also calculate the average of the scores.
Competitive Systems. We compare our results with four competitive systems on paraphrase generation: the sequence-to-sequence model (Bahdanau et al., 2014, Seq2seq), the Reinforced by Matching framework (Li et al., 2018, RbM), the Residual LSTM (Prakash et al., 2016, Res-LSTM), and the Discriminator LSTM model (Patro et al., 2018, Dis-LSTM). Among these competitive systems, the RbM (Li et al., 2018) is more closely related to our work, since we both use the pointer-generator as the base model and apply some reinforcement learning algorithms for policy learning.
Experimental Setup. We first pre-train the pointer-generator model with MLE, then fine-tune the models with various algorithms proposed in section 2. Pre-training is critical to make the RE-INFORCE algorithm and some variants to work. More implementation details are provided in Appendix A.
Result Analysis. Table 1 shows the model performances on the Quora test set, and Table 2 shows the model performances on the Twitter test set. For the Quora dataset, all our models outperform the competitive systems with a large margin. We suspect the reason is because we ran the development set during training on-the-fly, which is not the experimental setup used in (Li et al., 2018).
For both datasets, we find that DAGGER with a fixed (↵, ) gives the best performance among all the algorithm variants. The difference between DAGGER and DAGGER* is that, in DAGGER, we use the decay function on ↵ at each iteration,  ↵ k · ↵ with k = 0.9999. In our experiments, we also try different decaying rates, and present the best results we obtained (more details are provided in Table B). The selection of ↵ depends on the specific task: for the Quora dataset, we find ↵ = 0.5 gives us the optimal policy; for the Twitter dataset, we find ↵ = 0.2 gives us the optimal policy.
As shown in line 6 -11 from Table 1, the additional training with whichever variant algorithms can certainly enhance the generation performance over the pre-trained model (line 5). This observation is consistent with many previous works of using RL/IL in NLP. However, we also notice that the improvement of using the REINFORCE algorithm (line 6) is very small, only 0.18 on the average score.
As shown in line 2 -7 from Table 2, the additional training with variant algorithms also shows improved performance over the pre-trained model (line 1). However, for the pointer-generator model, it is more difficult to do paraphrase generation on the Twitter dataset. Since in the Twitter dataset, one source sentence shares several different paraphrases, while in the Quora dataset, one source sentence only corresponds to one paraphrase. This explains why the average improvement in the Twitter dataset is not as significant as in the Quora dataset. Besides, from Table 2, we also find that IL (line 6 -7) outperforms RL (line 2 -3), which is consist with the experimental results in Table 1.
Overall, in this particular setting of paraphrase generation, we found that DAGGER is much easier to use than the REINFORCE algorithm, as it always takes ground truth (expert actions) as its outputs. Although, picking a good decay function ↵ can be really tricky. On the other hand, the REINFORCE algorithm (together with its variants) could only outperform the pre-trained baseline with a small margin.

Related Work
Paraphrase generation has the potential of being used in many other NLP research topics, such as machine translation (Madnani et al., 2007) and question answering (Buck et al., 2017;Dong et al., 2017). Early work mainly focuses on extracting paraphrases from parallel monolingual texts (Barzilay and McKeown, 2001;Ibrahim et al., 2003;Pang et al., 2003). Later, Quirk et al. (2004) propose to use statistical machine translation for generating paraphrases directly. Despite the particular MT system used in their work, the idea is very similar to the recent work of using encoder-decoder frameworks for paraphrase generation (Prakash et al., 2016;. In addition, Prakash et al. (2016) extend the encoder-decoder framework with a stacked residual LSTM for paraphrase generation. Li et al. (2018) propose to use the pointer-generator model (See et al., 2017) and train it with an actor-critic RL algorithm. In this work, we also adopt the pointer-generator framework as the base model, but the learning algorithms are developed by uncovering the connection between RL and IL.
Besides paraphrase generation, many other NLP problems have used some RL or IL algorithms for improving performance. For example, structured prediction has more than a decade history of using imitation learning (Daumé et al., 2009;Chang et al., 2015;Vlachos, 2013;. In addition, scheduled sampling (as another form of DAGGER) has been used in sequence prediction ever since it was proposed in . Similar to IL, reinforcement learning, particularly with neural network models, has been widely used in many different domains, such as coreference resolution (Yin et al., 2018), document summarization (Chen and Bansal, 2018), and machine translation (Wu et al., 2018).

Conclusion
In this paper, we performed an empirical study on some reinforcement learning and imitation learning algorithms for paraphrase generation. We proposed a unified framework to include the DAGGER and the REINFORCE algorithms as special cases and further presented some variant learning algo-rithms. The experiments demonstrated the benefits and limitations of these algorithms and provided the state-of-the-art results on the Quora dataset.