Transformer and seq2seq model for Paraphrase Generation

Paraphrase generation aims to improve the clarity of a sentence by using different wording that convey similar meaning. For better quality of generated paraphrases, we propose a framework that combines the effectiveness of two models – transformer and sequence-to-sequence (seq2seq). We design a two-layer stack of encoders. The first layer is a transformer model containing 6 stacked identical layers with multi-head self attention, while the second-layer is a seq2seq model with gated recurrent units (GRU-RNN). The transformer encoder layer learns to capture long-term dependencies, together with syntactic and semantic properties of the input sentence. This rich vector representation learned by the transformer serves as input to the GRU-RNN encoder responsible for producing the state vector for decoding. Experimental results on two datasets-QUORA and MSCOCO using our framework, produces a new benchmark for paraphrase generation.


Introduction
Paraphrasing is a key abstraction technique used in Natural Language Processing (NLP). While capable of generating novel words, it also learns to compress or remove unnecessary words along the way. Thus, gainfully lending itself to abstractive summarization (Chen and Bansal, 2018;Gehrmann et al., 2018) and question generation (Song et al., 2018) for machine reading comprehension (MRC) (Dong et al., 2017). Paraphrases can also be used as simpler alternatives to input sentences for machine translation (MT) (Callison-Burch et al., 2006) as well as evaluation of natural language generation (NLG) texts (Apidianaki et al., 2018).
In this paper, we propose a novel framework for paraphrase generation that utilizes the transformer model of Vaswani et al. (2017) and seq2seq model of Sutskever et al. (2014) specifically GRU ). The multi-head self attention of the transformer complements the seq2seq model with its ability to learn long-range dependencies in the input sequence. Also the individual attention heads in the transformer model mimics behavior related to the syntactic and semantic structure of the sentence (Vaswani et al., 2017(Vaswani et al., , 2018 which is key in paraphrase generation. Furthermore, we use GRU to obtain a fixed-size state vector for decoding into variable length sequences, given the more qualitative learned vector representations from the transformer.
The main contributions of this work are: • We propose a novel framework for the task of paraphrase generation that produces quality paraphrases of its source sentence.
• For in-depth analysis of our results, in addition to using BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) which are word-overlap based, we further evaluate our model using qualitative metrics such as Embedding Average Cosine Similarity (EACS), Greedy Matching Score (GMS) from Sharma et al. (2017) and METEOR (Banerjee and Lavie, 2005), with stronger correlation with human reference.

Task Definition
Given an input sentence S = (s 1 , ..., s n ) with n words, the task is to generate an alternative output sentence Y = (y 1 , ..., y m ) | ∃y m ∈ S with m words that conveys similar semantics as S, where preferably, m < n but not necessarily.

Method
In this section, we present our framework for paraphrase generation. It follows the popular encodedecode paradigm, but with two stacked layers of encoders. The first encoding layer is a transformer encoder, while the second encoding layer is a GRU-RNN encoder. The paraphrase of a given sentence is generated by a GRU-RNN decoder.

Encoder -TRANSFORMER
We use the transformer-encoder as sort of a pretraining module of our input sentence. The goal is to learn richer representation of the input vector that better handles long-term dependencies as well as captures syntactic and semantic properties before obtaining a fixed-state representation for decoding into the desired output sentence. The transformer contains 6 stacked identical layers mainly driven by self-attention implemented by Vaswani et al. (2017Vaswani et al. ( , 2018.

Encoder -GRU-RNN
Our architecture uses a single layer uni-directional GRU-RNN whose input is the output of the trans-S: Three dimensional rendering of a kitchen area with various appliances. G: a series of photographs of a kitchen R: A series of photographs of a tiny model kitchen S: a young boy in a soccer uniform kicking a ball G: a young boy kicking a soccer ball R: A young boy kicking a soccer ball on a green field. S: The dog is wearing a Santa Claus hat. G: a dog poses with santa hat R: A dog poses while wearing a santa hat. S: the people are sampling wine at a wine tasting. G: a group of people wine tasting. R: Group of people tasting wine next to some barrels. former. The GRU-RNN encoder (Chung et al., 2014; produces fixed-state vector representation of the transformed input sequence using the following equations: where r and z are the reset and update gates respectively, W and U are the network's parameters, s t is the hidden state vector at timestep t, x t is the input vector and represents the Hadamard product.

Decoder -GRU-RNN
The fixed-state vector representation produced by the GRU-RNN encoder is used as initial state for the decoder. At each time step, the decoder receives the previously generated word, y t−1 and hidden state s t−1 at time step t −1 . The output word, y t at each time step, is a softmax probability of the vector in equation 3 over the set of vocabulary words, V .

Experiments
We describe baselines, our implementation settings, datasets and evaluation of our proposed model.

Baselines
We compare our model with very recent models (Gupta et al., 2018;Li et al., 2018;Prakash et al., 2016) including the current state-of-the-art (Gupta et al., 2018) in the field. To further highlight the gain of stacking 2 encoders we use each component -Transformer (TRANS) and seq2seq (SEQ) as baselines.
• VAE-SVG-EQ (Gupta et al., 2018): This is the current state-of-the-art in the field, with a variational autoencoder as its main component.
• RbM-SL (Li et al., 2018): Different from the encoder-decoder framework, this is a generator-evaluator framework, with the evaluator trained by reinforcement learning.
• TRANS: Encoder-decoder framework as described in Section 3 but with a single transformer encoder layer.
• SEQ: Encoder-decoder framework as described in Section 3 but with a single GRU-RNN encoder layer.

Implementation
We used pre-trained 300-dimensional gloV e 1 word-embeddings (Pennington et al., 2014) as the distributed representation of our input sentences. We set the maximum sentence length to 15 and 10 respectively for our input and target sentences following the statistics of our dataset.
For the transformer encoder, we used the transf ormer base hyperparameter setting from  Table 4: Performance of our model against various models on the MSCOCO dataset. R-L refers to the ROUGE-L F1 score with 95% confidence interval the tensor2tensor library (Vaswani et al., 2018) 2 , but set the hidden size to 300. We set dropout to 0.0 and 0.7 for MSCOCO and QUORA datasets respectively. We used a large dropout for QUORA because the model tends to over-fit to the training set. Both the GRU-RNN encoder and decoder contain 300 hidden units. We pre-process our datasets, and do not use the pre-processed/tokenized versions of the datasets from tensor2tensor library. Our target vocabulary is a set of approximately 15,000 words. It contains words in our target training and test sets that occur at least twice. Using this subset of vocabulary words as opposed to over 320,000 vocabulary words contained in gloV e improves both training time and performance of the model.
We train and evaluate our model after each epoch with a fixed learning rate of 0.0005, and stop training when the validation loss does not decrease after 5 epochs.
The model learns to minimize the seq2seq loss implemented in tensorflow API 3 with AdamOptimizer. We use greedy-decoding during training and validation and set the maximum number of iterations to 5 times the target sentence length. For testing/inference we use beam-search decoding.

Datasets
We evaluate our model on two standard datasets for paraphrase generation -QUORA 4 and MSCOCO (Lin et al., 2014) as described in Gupta et al. (2018) and used similar settings. The QUORA dataset contains over 120k examples with a 80k and 40k split on the training and test sets respectively. As seen in Tables 1 and   2 https://github.com/tensorflow/ tensor2tensor 3 https://www.tensorflow.org/api_docs/ python/tf/contrib/seq2seq/sequence_loss 4 https://data.quora.com/ First-Quora-Dataset-Release-Question-Pairs 2, while the QUORA dataset contains question pairs, MSCOCO contains free form texts which are human annotations of images. Subjective observation of the MSCOCO dataset reveals that most of its paraphrase pairs contain more novel words as well as syntactic manipulations than the QUORA pairs making it a more interesting paraphrase generation corpora. We split the QUORA dataset to 50k, 100k and 150k training samples and 4k testing samples in order to align with baseline models for comparative purposes.

Evaluation
For quantitative analysis of our model, we use popular automatic metrics such as BLEU, ROUGE, METEOR. Since BLEU and ROUGE both measure n − gram word-overlap with difference in brevity penalty, we report just the ROUGE-L value. We also use 2 additional recent metrics -GMS and EACS by (Sharma et al., 2017) 5 that measure the similarity between the reference and generated paraphrases based on the cosine similarity of their embeddings on word and sentence levels respectively.

Result Analysis
Tables 3 and 4 report scores of our model on both datasets. Our model pushes the benchmark on all evaluation metrics compared against current published top models evaluated on the same datasets. Since several words could connote similar meaning, it is more logical to evaluate with metrics that match with embedding vectors capable of measuring this similarity. Hence we also report GMS and EACS scores as a basis of comparison for future work in this direction.
Besides quantitative values, Tables 1 and 2 show that our paraphrases are well formed, abstractive (e.g dumbest -stupidest, dog is wearing -dog poses), capable of performing syntactic manipulations (e.g in a soccer uniform kicking a ball -kicking a soccer ball) and compression. Some of our paraphrased sentences even have more brevity than the reference, and still remain very meaningful.

Related Work
Our baseline models -VAE-SVG-EQ (Gupta et al., 2018) and RbM-SL (Li et al., 2018) are both deep learning models. While the former uses a variational-autoencoder and is capable of generating multiple paraphrases of a given sentence, the later uses deep reinforcement learning. In tune, with part of our approach, ie, seq2seq, there exists ample models with interesting variants -residual LSTM (Prakash et al., 2016), bi-directional GRU with attention and special decoding tweaks (Cao et al., 2017), attention from the perspective of semantic parsing (Su and Yan, 2017).
MT has been greatly used to generate paraphrases (Quirk et al., 2004;Zhao et al., 2008) due to the availability of large corpora. While much earlier works have explored the use of manually drafted rules (Hassan et al., 2007;Kozlowski et al., 2003). Similar to our model architecture,  combined transformers and RNN-based encoders for MT. Zhao et al. (2018) recently used the transformer model for paraphrasing on different datasets. We experimented using solely a transformer but got better results with TRANSEQ. To the best of our knowledge, our work is the first to cross-breed the transformer and seq2seq for the task of paraphrase generation.

Conclusions
We proposed a novel framework, TRANSEQ that combines the efficiency of a transformer and seq2seq model and improves the current state-ofthe-art on the QUORA and MSCOCO paraphrasing datasets. Besides quantitative results, we presented examples that highlight the syntactic and semantic quality of our generated paraphrases.
In the future, it will be interesting to apply this framework for the task of abstractive text summarization and other NLG-related problems.