Revisiting Recurrent Networks for Paraphrastic Sentence Embeddings

We consider the problem of learning general-purpose, paraphrastic sentence embeddings, revisiting the setting of Wieting et al. (2016b). While they found LSTM recurrent networks to underperform word averaging, we present several developments that together produce the opposite conclusion. These include training on sentence pairs rather than phrase pairs, averaging states to represent sequences, and regularizing aggressively. These improve LSTMs in both transfer learning and supervised settings. We also introduce a new recurrent architecture, the Gated Recurrent Averaging Network, that is inspired by averaging and LSTMs while outperforming them both. We analyze our learned models, finding evidence of preferences for particular parts of speech and dependency relations.


Introduction
Modeling sentential compositionality is a fundamental aspect of natural language semantics. Researchers have proposed a broad range of compositional functional architectures (Mitchell and Lapata, 2008;Socher et al., 2011;Kalchbrenner et al., 2014) and evaluated them on a large variety of applications. Our goal is to learn a general-purpose sentence embedding function that can be used unmodified for measuring semantic textual similarity (STS) (Agirre et al., 2012) and can also serve as a useful initialization for downstream tasks. We wish to learn this embedding function such that sentences with high semantic similarity have high cosine similarity in the embedding space. In particular, we focus on the setting of Wieting et al. (2016b), in which models are trained on noisy paraphrase pairs and evaluated on both STS and supervised semantic tasks.
Surprisingly, Wieting et al. found that simple embedding functions-those based on averaging word vectors-outperform more powerful architectures based on long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997). In this paper, we revisit their experimental setting and present several techniques that together improve the performance of the LSTM to be superior to word averaging.
We first change data sources: rather than train on noisy phrase pairs from the Paraphrase Database (PPDB; Ganitkevitch et al., 2013), we use noisy sentence pairs obtained automatically by aligning Simple English to standard English Wikipedia (Coster and Kauchak, 2011). Even though this data was intended for use by text simplification systems, we find it to be efficient and effective for learning sentence embeddings, outperforming much larger sets of examples from PPDB.
We then show how we can modify and regularize the LSTM to further improve its performance. The main modification is to simply average the hidden states instead of using the final one. For regularization, we experiment with two kinds of dropout and also with randomly scrambling the words in each input sequence. We find that these techniques help in the transfer learning setting and on two supervised semantic similarity datasets as well. Further gains are obtained on the supervised tasks by initializing with our models from the transfer setting.
Inspired by the strong performance of both averaging and LSTMs, we introduce a novel recurrent neural network architecture which we call the GATED RECURRENT AVERAGING NETWORK (GRAN). The GRAN outperforms averaging and the LSTM in both the transfer and supervised learning settings, forming a promising new recurrent architecture for semantic modeling.

Related Work
Modeling sentential compositionality has received a great deal of attention in recent years.
Most work cited above uses a supervised learning framework, so the composition function is learned discriminatively for a particular task. In this paper, we are primarily interested in creating general purpose, domain independent embeddings for word sequences. Several others have pursued this goal (Socher et al., 2011;Le and Mikolov, 2014;Pham et al., 2015;Kiros et al., 2015;Hill et al., 2016;Arora et al., 2017;Pagliardini et al., 2017), though usually with the intent to extract useful features for supervised sentence tasks rather than to capture semantic similarity.
An exception is the work of Wieting et al. (2016b). We closely follow their experimental setup and directly address some outstanding questions in their experimental results. Here we briefly summarize their main findings and their attempts at explaining them. They made the surprising discovery that word averaging outperforms LSTMs by a wide margin in the transfer learning setting. They proposed several hypotheses for why this occurs. They first considered that the LSTM was unable to adapt to the differences in sequence length between phrases in training and sentences in test. This was ruled out by showing that neither model showed any strong correlation between sequence length and performance on the test data.
They next examined whether the LSTM was overfitting on the training data, but then showed that both models achieve similar values of the training objective and similar performance on indomain held-out test sets. Lastly, they considered whether their hyperparameters were inadequately tuned, but extensive hyperparameter tuning did not change the story. Therefore, the reason for the performance gap, and how to correct it, was left as an open problem. This paper takes steps toward addressing that problem.

Models
Our goal is to embed a word sequence s into a fixed-length vector. We focus on three compositional models in this paper, all of which use words as the smallest unit of compositionality. We denote the tth word in s as s t , and we denote its word embedding by x t .
Our first two models have been well-studied in prior work, so we describe them briefly. The first, which we call AVG, simply averages the embeddings x t of all words in s. The only parameters learned in this model are those in the word embeddings themselves, which are stored in the word embedding matrix W w . This model was found by Wieting et al. (2016b) to perform very strongly for semantic similarity tasks.
Our second model uses a long shortterm memory (LSTM) recurrent neural network (Hochreiter and Schmidhuber, 1997) to embed s. We use the LSTM variant from Gers et al. (2003) including its "peephole" connections. We consider two ways to obtain a sentence embedding from the LSTM. The first uses the final hidden vector, which we denote h −1 . The second, denoted LSTMAVG, averages all hidden vectors of the LSTM. In both variants, the learnable parameters include both the LSTM parameters W c and the word embeddings W w .
Inspired by the success of the two models above, we propose a third model, which we call the GATED RECURRENT AVERAGING NETWORK (GRAN). The GATED RECURRENT AVERAGING NETWORK combines the benefits of AVG and LSTMs. In fact it reduces to AVG if the output of the gate is all ones. We first use an LSTM to generate a hidden vector, h t , for each word s t in s. Then we use h t to compute a gate that will be elementwise-multiplied with x t , resulting in a new, gated hidden vector a t for each step t: where W x and W h are parameter matrices, b is a parameter vector, and σ is the elementwise logistic sigmoid function. After all a t have been generated for a sentence, they are averaged to produce the embedding for that sentence. This model includes as learnable parameters those of the LSTM, the word embeddings, and the additional parameters in Eq. (1). For both the LSTM and GRAN models, we use W c to denote the "compositional" parameters, i.e., all parameters other than the word embeddings.
The motivation for the GRAN is that we are contextualizing the word embeddings prior to averaging. The gate can be seen as an attention, attending to the prior context of the sentence. 2 We also experiment with four other variations of this model, though they generally were more complex and showed inferior performance. In the first, GRAN-2, the gate is applied to h t (rather than x t ) to produce a t , and then these a t are averaged as before.
GRAN-3 and GRAN-4 use two gates: one applied to x t and one applied to a t−1 . We tried two different ways of computing these gates: for The sum of these two terms comprised a t . In this model, the last average hidden state, a −1 , was used as the sentence embedding after dividing it by the length of the sequence. In these models, we are additionally keeping a running average of the embeddings that is being modified by the context at every time step. In GRAN-4, this running average is also considered when producing the contextualized word embedding.
Lastly, we experimented with a fifth GRAN, GRAN-5, in which we use two gates, calculated by σ(W x i x t + W h i h t + b i ) for each gate i. The first is applied to x t and the second is applied to h t . The output of these gates is then summed. Therefore GRAN-5 can be reduced to either wordaveraging or averaging LSTM states, depending on the behavior of the gates. If the first gate is all ones and the second all zeros throughout the sequence, the model is equivalent to wordaveraging. Conversely, if the first gate is all zeros and the second is all ones throughout the sequence, the model is equivalent to averaging the LSTM states. Further analysis of these models is included in Section 4.

Training
We follow the training procedure of Wieting et al. (2015) and Wieting et al. (2016b), described below. The training data consists of a set S of phrase or sentence pairs s 1 , s 2 from either the Paraphrase Database (PPDB; Ganitkevitch et al., 2013) or the aligned Wikipedia sentences (Coster and Kauchak, 2011) where s 1 and s 2 are assumed to be paraphrases. We optimize a margin-based loss: where g is the model in use (e.g., AVG or LSTM), δ is the margin, λ c and λ w are regularization parameters, W w initial is the initial word embedding matrix, and t 1 and t 2 are carefully-selected negative examples taken from a mini-batch during optimization. The intuition is that we want the two phrases to be more similar to each other (cos(g(s 1 ), g(s 2 ))) than either is to their respective negative examples t 1 and t 2 , by a margin of at least δ.

Selecting Negative Examples
To select t 1 and t 2 in Eq. (2), we simply choose the most similar phrase in some set of phrases (other than those in the given phrase pair). For simplicity we use the mini-batch for this set, but it could be a different set. That is, we choose t 1 for a given s 1 , s 2 as follows: where S b ⊆ S is the current mini-batch. That is, we want to choose a negative example t i that is similar to s i according to the current model. The downside is that we may occasionally choose a phrase t i that is actually a true paraphrase of s i .
Our experiments are designed to address the empirical question posed by Wieting et al. (2016b): why do LSTMs underperform AVG for transfer learning? In Sections 4.1.2-4.2, we make progress on this question by presenting methods that bridge the gap between the two models in the transfer setting. We then apply these same techniques to improve performance in the supervised setting, described in Section 4.3. In both settings we also evaluate our novel GRAN architecture, finding it to consistently outperform both AVG and the LSTM.

Datasets and Tasks
We train on large sets of noisy paraphrase pairs and evaluate on a diverse set of 22 textual similarity datasets, including all datasets from every SemEval semantic textual similarity (STS) task from 2012 to 2015. We also evaluate on the Sem-Eval 2015 Twitter task (Xu et al., 2015) and the SemEval 2014 SICK Semantic Relatedness task (Marelli et al., 2014). Given two sentences, the aim of the STS tasks is to predict their similarity on a 0-5 scale, where 0 indicates the sentences are on different topics and 5 indicates that they are completely equivalent. We report the average Pearson's r over these 22 sentence similarity tasks. Each STS task consists of 4-6 datasets covering a wide variety of domains, including newswire, tweets, glosses, machine translation outputs, web forums, news headlines, image and video captions, among others. Further details are provided in the official task descriptions (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015.

Experiments with Data Sources
We first investigate how different sources of training data affect the results. We try two data sources. The first is phrase pairs from the Paraphrase Database (PPDB). PPDB comes in different sizes (S, M, L, XL, XXL, and XXXL), where each larger size subsumes all smaller ones. The pairs in PPDB are sorted by a confidence measure and so the smaller sets contain higher precision paraphrases. PPDB is derived automatically from naturally-occurring bilingual text, and versions of PPDB have been released for many languages without the need for any manual annotation (Ganitkevitch and Callison-Burch, 2014 (2011).
The second source of data is a set of sentence pairs automatically extracted from Simple English Wikipedia and English Wikipedia articles by Coster and Kauchak (2011). This data was extracted for developing text simplification systems, where each instance pairs a simple and complex sentence representing approximately the same information. Though the data was obtained for simplification, we use it as a source of training data for learning paraphrastic sentence embeddings. The dataset, which we call SimpWiki, consists of 167,689 sentence pairs.
To ensure a fair comparison, we select a sample of pairs from PPDB XL such that the number of tokens is approximately the same as the number of tokens in the SimpWiki sentences. 3 We use PARAGRAM-SL999 embeddings (Wieting et al., 2015) to initialize the word embedding matrix (W w ) for all models. For all experiments, we fix the mini-batch size to 100, and λ c to 0. We tune the margin δ over {0.4, 0.6, 0.8} and λ w over {10 −4 , 10 −5 , 10 −6 , 10 −7 , 10 −8 , 0}. We train AVG for 7 epochs, and the LSTM for 3, since it converges much faster and does not benefit from 7 epochs. For optimization we use Adam (Kingma and Ba, 2015) with a learning rate of 0.001. We use the 2016 STS tasks (Agirre et al., 2016) for model selection, where we average the Pearson's r over its 5 datasets. We refer to this type of model selection as test. For evaluation, we report the average Pearson's r over the 22 other sentence similarity tasks.
The results are shown in Table 1. We first note that, when training on PPDB, we find the same result as Wieting et al. (2016b): AVG outperforms the LSTM by more than 13 points. However, when training both on sentence pairs, the gap shrinks to about 9 points. It appears that part of the inferior performance for the LSTM in prior work was due to training on phrase pairs rather than on sentence pairs. The AVG model also benefits from training on sentences, but not nearly as much as the LSTM. 4 Our hypothesis explaining this result is that in PPDB, the phrase pairs are short fragments of text which are not necessarily constituents or phrases in any syntactic sense. Therefore, the sentences in the STS test sets are quite different from the fragments seen during training. We hypothesize that while word-averaging is relatively unaffected by this difference, the recurrent models are much more sensitive to overall characteristics of the word sequences, and the difference between train and test matters much more.
These results also suggest that the SimpWiki data, even though it was developed for text simplification, may be useful for other researchers working on semantic textual similarity tasks.

Experiments with LSTM Variations
We next compare LSTM and LSTMAVG. The latter consists of averaging the hidden vectors of the LSTM rather than using the final hidden vector as in prior work (Wieting et al., 2016b). We hypothesize that the LSTM may put more emphasis on the words at the end of the sentence than those at the beginning. By averaging the hidden states, the impact of all words in the sequence is better taken into account. Averaging also makes the LSTM more like AVG, which we know to perform strongly in this setting.
The results on AVG and the LSTM models are shown in Table 1. When training on PPDB, moving from LSTM to LSTMAVG improves performance by 10 points, closing most of the gap with AVG. We also find that LSTMAVG improves by moving from PPDB to SimpWiki, though in both cases it still lags behind AVG. 4 We experimented with adding EOS tags at the end of training and test sentences, SOS tags at the start of training and test sentences, adding both, and adding neither. We treated adding these tags as hyperparameters and tuned over these four settings along with the other hyperparameters in the original experiment. Interestingly, we found that adding these tags, especially EOS, had a large effect on the LSTM when training on SimpWiki, improving performance by 6 points. When training on PPDB, adding EOS tags only improved performance by 1.6 points.
The addition of the tags had a smaller effect on LSTMAVG. Adding EOS tags improved performance by 0.3 points on SimpWiki and adding SOS tags on PPDB improved performance by 0.9 points.

Experiments with Regularization
We next experiment with various forms of regularization. Previous work (Wieting et al., 2016b,a) only used L 2 regularization. Wieting et al. (2016b) also regularized the word embeddings back to their initial values. Here we use L 2 regularization as well as several additional regularization methods we describe below.
We try two forms of dropout. The first is just standard dropout (Srivastava et al., 2014) on the word embeddings. The second is "word dropout", which drops out entire word embeddings with some probability (Iyyer et al., 2015).
We also experiment with scrambling the inputs. For a given mini-batch, we go through each sentence pair and, with some probability, we shuffle the words in each sentence in the pair. When scrambling a sentence pair, we always shuffle both sentences in the pair. We do this before selecting negative examples for the mini-batch. The motivation for scrambling is to make it more difficult for the LSTM to memorize the sequences in the training data, forcing it to focus more on the identities of the words and less on word order. Hence it will be expected to behave more like the word averaging model. 5 We also experiment with combining scrambling and dropout. In this setting, we tune over scrambling with either word dropout or dropout.
The settings for these experiments are largely the same as those of the previous section with the exception that we tune λ w over a smaller set of values: {10 −5 , 0}. When using L 2 regularization, we tune λ c over {10 −3 , 10 −4 , 10 −5 , 10 −6 }. When using dropout, we tune the dropout rate over {0.2, 0.4, 0.6}. When using scrambling, we tune the scrambling rate over {0.25, 0.5, 0.75}. We also include a bidirectional model ("Bi") for both LSTMAVG and the GATED RECURRENT AVERAG-ING NETWORK. We tune over two ways to combine the forward and backward hidden states; the first simply adds them together and the second uses a single feedforward layer with a tanh activation.
We try two approaches for model selection. The first, test, is the same as was done in Section 4.   where we use the average Pearson's r on the 5 2016 STS datasets. The second tunes based on the average Pearson's r of all 22 datasets in our evaluation. We refer to this as oracle.
The results are shown in Table 2. They show that dropping entire word embeddings and scrambling input sequences is very effective in improving the result of the LSTM, while neither type of dropout improves AVG. Moreover, averaging the hidden states of the LSTM is the most effective modification to the LSTM in improving performance. All of these modifications can be combined to significantly improve the LSTM, finally allowing it to overtake AVG. In Table 3, we compare the various GRAN architectures. We find that the GRAN provides a small improvement over the best LSTM configuration, possibly because of its similarity to AVG. It also outperforms the other GRAN models, despite being the simplest.
In Table 4

Supervised Text Similarity
We also investigate if these techniques can improve LSTM performance on supervised semantic textual similarity tasks. We evaluate on two supervised datasets. For the first, we start with the 20 SemEval STS datasets from 2012-2015 and then use 40% of each dataset for training, 10% for validation, and the remaining 50% for testing. There are 4,481 examples in training, 1,207 in validation, and 6,060 in the test set. The second is the SICK 2014 dataset, using its standard training, validation, and test sets. There are 4,500 sentence pairs in the training set, 500 in the development set, and 4,927 in the test set. The SICK task is an easier learning problem since the training examples are all drawn from the same distribution, and they are mostly shorter and use simpler language. As these are supervised tasks, the sentence pairs in the training set contain manually-annotated semantic similarity scores. We minimize the loss function 6 from Tai et al. (2015). Given a score for a sentence pair in the range [1, K], where K is an integer, with sentence representations h L and h R , and model parameters θ, they first compute: where r T = [1 2 . . . K]. They then define a sparse target distribution p that satisfies y = r T p: for 1 ≤ i ≤ K. Then they use the following loss, the regularized KL-divergence between p andp θ : where m is the number of training pairs. We experiment with the LSTM, LSTMAVG, and AVG models with dropout, word dropout, and scrambling tuning over the same hyperparameter as in Section 4.2. We again regularize the word embeddings back to their initial state, tuning λ w over {10 −5 , 0}. We used the validation set for each respective dataset for model selection.
The results are shown in Table 5. The GATED RECURRENT AVERAGING NETWORK has the best   performance on both datasets. Dropout helps the word-averaging model in the STS task, unlike in the transfer learning setting. The LSTM benefits slightly from dropout, scrambling, and averaging on their own individually with the exception of word dropout on both datasets and averaging on the SICK dataset. However, when combined, these modifications are able to significantly improve the performance of the LSTM, bringing it much closer in performance to AVG. This experiment indicates that these modifications when training LSTMs are beneficial outside the transfer learning setting, and can potentially be used to improve performance for the broad range of problems that use LSTMs to model sentences.
In Table 6 we compare the various GRAN architectures under the same settings as the previous experiment. We find that the GRAN still has the best overall performance.
We also experiment with initializing the supervised models using our pretrained sentence model   parameters, for the AVG model (no regularization), LSTMAVG (dropout, scrambling), and GATED RECURRENT AVERAGING NETWORK (dropout, scrambling) models from Table 2 and Table 3. We both initialize and then regularize back to these initial values, referring to this setting as "universal". 7 The results are shown in Table 8. Initializing and regularizing to the pretrained models significantly improves the performance for all three models, justifying our claim that these models serve a dual purpose: they can be used a black box semantic similarity function, and they possess rich knowledge that can be used to improve the performance of downstream tasks.
predicted similarities of LSTMAVG and AVG to the gold similarities. We analyzed instances in which each model would tend to overestimate or underestimate the gold similarity relative to the other. These are illustrated in Table 7.
We find that AVG tends to overestimate the semantic similarity of a sentence pair, relative to LSTMAVG, when the two sentences have a lot of word or synonym overlap, but have either important differences in key semantic roles or where one sentence has significantly more content than the other. These phenomena are shown in examples 1 and 2 in Table 7. Conversely, AVG tends to underestimate similarity when there are one-word-tomultiword paraphrases between the two sentences as shown in examples 3 and 4.
LSTMAVG tends to overestimate similarity when the two inputs have similar sequences of syntactic categories, but the meanings of the sentences are different (examples 5, 6, and 7). Instances of LSTMAVG underestimating the similarity relative to AVG are relatively rare, and those that we found did not have any systematic patterns.

GRAN Gate Analysis
We also investigate what is learned by the gating function of the GATED RECURRENT AVERAGING NETWORK. We are interested to see whether its estimates of importance correlate with those of traditional syntactic and (shallow) semantic analysis.
We use the oracle trained GATED RECUR-RENT AVERAGING NETWORK from Table 3 and calculate the L 1 norm of the gate after embedding 10,000 sentences from English Wikipedia. 8 We also automatically tag and parse these sentences using the Stanford dependency parser (Manning et al., 2014). We then compute   the average gate L 1 norms for particular part-ofspeech tags, dependency arc labels, and their conjunction. Table 9 shows the highest/lowest average norm tags and dependency labels. The network prefers nouns, especially proper nouns, as well as cardinal numbers, which is sensible as these are among the most discriminative features of a sentence.
Analyzing the dependency relations, we find that nouns in the object position tend to have higher weight than nouns in the subject position. This may relate to topic and focus; the object may be more likely to be the "new" information related by the sentence, which would then make it more likely to be matched by the other sentence in the paraphrase pair.
We find that the weights of adjectives depend on their position in the sentence, as shown in Table 10. The highest norms appear when an adjective is an xcomp, acomp, or root; this typically means it is residing in an object-like position in its clause. Adjectives that modify a noun (amod) have medium weight, and those that modify another adjective or verb (advmod) have low weight.
Lastly, we analyze words tagged as VBG, a  Table 11: Average L 1 norms for words with the tag VBG with selected dependency labels.
highly ambiguous tag that can serve many syntactic roles in a sentence. As shown in Table 11, we find that when they are used to modify a noun (amod) or in the object position of a clause (xcomp, pcomp) they have high weight. Medium weight appears when used in verb phrases (root, vmod) and low weight when used as prepositions or auxiliary verbs (prep, auxpass).

Conclusion
We showed how to modify and regularize LSTMs to improve their performance for learning paraphrastic sentence embeddings in both transfer and supervised settings. We also introduced a new recurrent network, the GATED RECURRENT AVER-AGING NETWORK, that improves upon both AVG and LSTMs for these tasks, and we release our code and trained models. Furthermore, we analyzed the different errors produced by AVG and the recurrent methods and found that the recurrent methods were learning composition that wasn't being captured by AVG. We also investigated the GRAN in order to better understand the compositional phenomena it was learning by analyzing the L 1 norm of its gate over various inputs.
Future work will explore additional data sources, including from aligning different translations of novels (Barzilay and McKeown, 2001), aligning new articles of the same topic (Dolan et al., 2004), or even possibly using machine translation systems to translate bilingual text into paraphrastic sentence pairs. Our new techniques, combined with the promise of new data sources, offer a great deal of potential for improved universal paraphrastic sentence embeddings.