Learning Probabilistic Sentence Representations from Paraphrases

Probabilistic word embeddings have shown effectiveness in capturing notions of generality and entailment, but there is very little work on doing the analogous type of investigation for sentences. In this paper we define probabilistic models that produce distributions for sentences. Our best-performing model treats each word as a linear transformation operator applied to a multivariate Gaussian distribution. We train our models on paraphrases and demonstrate that they naturally capture sentence specificity. While our proposed model achieves the best performance overall, we also show that specificity is represented by simpler architectures via the norm of the sentence vectors. Qualitative analysis shows that our probabilistic model captures sentential entailment and provides ways to analyze the specificity and preciseness of individual words.


Introduction
Probabilistic word embeddings have been shown to be useful for capturing notions of generality and entailment (Vilnis and McCallum, 2014;Athiwaratkun and Wilson, 2017;Athiwaratkun et al., 2018). In particular, researchers have found that the entropy of a word roughly encodes its generality, even though there is no training signal explicitly targeting this effect. For example, hypernyms tend to have larger variance than their corresponding hyponyms (Vilnis and McCallum, 2014). However, there is very little work on doing the analogous type of investigation for sentences.
In this paper, we define probabilistic models that produce distributions for sentences. In particular, we choose a simple and interpretable probabilistic model that treats each word as an operator that translates and scales a Gaussian random variable representing the sentence. Our models are able to capture sentence specificity as measured by the annotated datasets of Li and Nenkova (2015) and Ko et al. (2019) by training solely on noisy paraphrase pairs. While our "wordoperator" model yields the strongest performance, we also show that specificity is represented by simpler architectures via the norm of the sentence vectors. Qualitative analysis shows that our models represent sentences in ways that correspond to the entailment relationship and that individual word parameters can be analyzed to find words with varied and precise meanings.

Proposed Methods
We propose a model that uses ideas from flowbased variational autoencoders (VAEs) (Rezende and Mohamed, 2015;Kingma et al., 2016) by treating each word as an "operator". Intuitively, we assume there is a random variable z associated with each sentence s = {w 1 , w 2 , · · · , w n }. The random variable initially follows a standard multivariate Gaussian distribution. Then, each word in the sentence transforms the random variable sequentially, leading to a random variable that encodes its semantic information.
Our word linear operator model (WLO) has two types of parameters for each word w i : a scaling factor A i ∈ R k and a translation factor B i ∈ R k . The word operators produce a sequence of random variables z 0 , z 1 , · · · , z n with z 0 ∼ N (0, I k ), where I k is a k × k identity matrix, and the operations are defined as The means and variances for each random variable are computed as follows: For computational efficiency, we only consider diagonal covariance matrices, so the equations above can be further simplified.

Learning
Following Wieting and Gimpel (2018), all of our models are trained with a margin-based loss on paraphrase pairs (s 1 , s 2 ): where δ is the margin and d is a similarity function that takes a pair of sentences and outputs a scalar denoting their similarity. The similarity function is maximized over a subset of examples (typically, the mini-batch) to choose negative examples n 1 and n 2 . When doing so, we use "mega-batching" (Wieting and Gimpel, 2018) and fix the megabatch size at 20. For deterministic models, d is cosine similarity, while for probabilistic models, we use the expected inner product of Gaussians.

Expected Inner Product of Gaussians
Let µ 1 , µ 2 be mean vectors and Σ 1 , Σ 2 be the variances predicted by models for a pair of input sentences. For the choice of d, following Vilnis and McCallum (2014), we use the expected inner product of Gaussian distributions: For diagonal matrices Σ 1 and Σ 2 , the equation above can be computed analytically.

Regularization
To avoid the mean or variance of the Gaussian distributions from becoming unbounded during training, resulting in degenerate solutions, we impose prior constraints on the operators introduced above. We force the transformed distribution after each operator to be relatively close to N (0, I k ), which can be thought of as our "prior" knowledge of the operator. Then our training additionally minimizes λ s∈{s 1 ,s 2 ,n 1 ,n 2 } w∈s KL(N (µ(w), Σ(w)) N (0, I)) where λ is a hyperparameter tuned based on the performance on the 2017 semantic textual similarity (STS; Cer et al., 2017) data. We found prior  regularization very important, as will be shown in our results. For fair comparison, we also add L2 regularization to the baseline models.

Baseline Methods
We consider two baselines that have shown strong results on sentence similarity tasks (Wieting and Gimpel, 2018). The first, word averaging (WORDAVG), simply averages the word embeddings in the sentence. The second, long shortterm memory (LSTM; Hochreiter and Schmidhuber, 1997) averaging (LSTMAVG), uses an LSTM to encode the sentence and averages the hidden vectors. Inspired by sentence VAEs (Bowman et al., 2016), we consider an LSTM based probabilistic baseline (LSTMGAUSSIAN) which builds upon LSTMAVG and uses separate linear transformations on the averaged hidden states to produce the mean and variance of a Gaussian distribution. We also benchmark several pretrained models, including GloVe (Pennington et al., 2014), Skipthought (Kiros et al., 2015), InferSent (Conneau et al., 2017), BERT (Devlin et al., 2019), and ELMo (Peters et al., 2018). When using GloVe, we either sum embeddings (GloVe SUM) or average them (GloVe AVG) to produce a sentence vector. Similarly, for ELMo, we either sum the outputs from the last layer (ELMo SUM) or average them (ELMo AVG). For BERT, we take the representation for the "[CLS]" token.

Datasets
We use the preprocessed version of ParaNMT-50M (Wieting and Gimpel, 2018) as our training set, which consists of 5 million paraphrase pairs.
For evaluating sentence specificity, we use human-annotated test sets from four domains, including news, Twitter, Yelp reviews, and movie reviews, from Li and Nenkova (2015) and Ko et al. (2019). For the news dataset, labels are either "general" or "specific" and there is additionally a training set. For the other datasets, labels are real values indicating specificity. Statistics for these datasets are shown in Table 1.
For analysis we also use the semantic textual similarity (STS) benchmark test set (Cer et al., 2017) and the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015).

Specificity Prediction Setup
For predicting specificity in the news domain, we threshold the predictions either based on the entropy of Gaussian distributions produced from probabilistic models or based on the norm of vectors produced by deterministic models, which includes all of the pretrained models. The threshold is tuned based on the training set but no other training or tuning is done for this task with any of our models. For prediction in other domains, we simply compute the Spearman correlations between the entropy/norm and the labels. Intuitively, when sentences are longer, they tend to be more specific. So, we report baselines ("Length") that predict specificity solely based on length, by thresholding the sentence length for news (choosing the threshold using the training set) or simply returning the length for the others. The latter results are reported from Ko et al. (2019). We also consider baselines that average or sum ranks of word frequencies within a sentence ("Word Freq. AVG" and "Word Freq. SUM"). Table 2 shows results on sentence specificity tasks. We compare to the best-performing models reported by Li and Nenkova (2015) and Ko et al. (2019). Their models are specifically designed for predicting sentence specificity and they both use labeled training data from the news domain.

Sentence Specificity
Our averaging-based models (WORDAVG, LSTMAVG) failed on this task, either giving the majority class accuracy or negative correlations. So, we also evaluate WORDSUM, which sums word embeddings instead of averaging and shows strong performance compared to the other models.
While the model from Li and Nenkova (2015) performs quite well in the news domain, its performance drops on other domains, indicating some amount of overfitting. On the other hand, WORD-SUM and WLO, which are trained on a large number of paraphrases, perform consistently across the four domains and both outperform the supervised models on Yelp. Additionally, our WLO model outperforms all our other models, achieving comparable performance to the supervised methods.   Among pretrained models, BERT, Skipthought, ELMo SUM, and GloVe SUM show slight correlations with specificity, while InferSent performs strongly across domains. InferSent uses supervised training on a large manually-annotated dataset (SNLI) while WORDSUM and WLO are trained on automatically-generated paraphrases and still show results comparable to InferSent.
To control for effects due to sentence length, we design another experiment in which sentences from News training and test are grouped by length, and thresholds are tuned on the group of length k and tested on the group of length k − 1, for all k, leading to a pool of 3582 test sentences. Table 3 shows the results. In this lengthnormalized experiment, the averaging models demonstrate much better performance and even outperform WORDSUM, but still WLO has the best performance.  We test models on the SNLI test set, assuming that for a given premise p and hypothesis h, p is more specific than h for entailing sentence pairs. To avoid effects due to sentence length, we only consider p, h pairs with the same length. After this filtering, entailment/neural/contradiction categories have 120/192/208 instances respectively. We encode each sentence and calculate the percentage of cases in which the hypothesis has larger entropy (or smaller norm for non-probabilistic models) than the premise. Under an ideal model, this would happen with 100% of entailing pairs while showing random results (50%) for the other two types of pairs.
As shown in Table 4, our best paraphrasetrained models show similar trends to InferSent, achieving around 75% accuracy in the entailment category and around 50% accuracy in other categories. Although ELMo can also achieve similar accuracy in the entailment category, it seems to conflate entailment with contradiction, where it shows the highest percentage of all models. Other models, including BERT, GloVe, and Skipthought, are much closer to random (50%) for entailing pairs.

Lexical Analysis
WLO associates translation and scaling parameters with each word, allowing us to analyze the  impact of words on sentence representations. We ranked words under several criteria based on their translation parameter norms and single-word sentence entropies. Table 5 shows the top 20 words under each criterion. Words with small norm and small absolute entropy have little effect, both in terms of meaning and specificity; they are mostly function words. Words with large norm and small entropy have a large impact on the sentence while also making it more specific. They are organization names (cenelec) or technical terms found in medical or scientific literature. When they appear in a sentence, they are very likely to appear in its paraphrase.
Words with large norm and small absolute entropy contribute to the sentence semantics but do not make it more specific. Words like microwave and synthetic appear in many contexts and have multiple senses. Names (trent, alison) also appear in many contexts. Words like these often appear in a sentence's paraphrase, but can also appear in many other sentences in different contexts.
Words with small norm/entropy make sentences more specific but do not lend themselves to a precise characterization. They affect sentence meaning, but can be expressed in many ways. For example, when beneficiaries appears in a sentence, its paraphrase often has a synonym like beneficiary, heirs, or grantees. These words may have multiple senses, but it appears more that they correspond to WORDSUM WLO largest norm (specific) smallest norm (general) smallest entropy (specific) largest entropy (general) this regulation shall not apply to wine grape products, with the exception of wine vinegar, spirit drinks or flavoured wines. oh, man, you're gonna... you're just gonna get it, vause * , aren't you ? under a light coating of dew she was a velvet study in reflected mauve with rose overtones against the indigo nightward * sky. oh, man, you're gonna... you're just gonna get it, vause * , aren't you?
operating revenue community subsidies other subsidies/revenue * total (a) operating expenditure staff administration operating activities total (b) operating result (c=ab) okay, i know you don't get relationships, like, at all, but i don't need to screw anyone for an "a." a similar influenza disease occurred in 47% of patients who received plegridy 125 micrograms every 2 weeks, and 13% of the patients were given placebo.
'authorisation' means an instrument issued in any form by the authorities by which the right to carry on the business of a credit institution is granted;   concepts with many valid ways of expression.

Sentential Analysis
We subsample the ParaNMT training set and group sentences by length. For each model and length, we pick the sentence with either highest/lowest entropy or largest/smallest norm values. Table 6 shows some examples. WORDSUM tends to choose conversational sentences as general and those with many rare words as specific. WLO favors literary and technical/scientific sentences as most specific, and bureaucratic/official language as most general.

Effect of Prior Regularization
As shown in Table 7, there is a large performance improvement after adding prior regularization for avoiding degenerate solutions.

Semantic Textual Similarity
Although semantic textual similarity is not our target task, we still include the performance of our models on the STS benchmark test set in Table 8 to show that our models are competitive with standard strong baselines. When using probabilistic models to predict sentence similarity during test time, we let v 1 = concat(µ 1 , Σ 1 ), v 2 = concat(µ 2 , Σ 2 ), where concat is a concatenation operation, and predict sentence similarity via cosine(v 1 , v 2 ), since we find it performs better than solely using the mean vectors. The two probabilistic models, LSTMGAUSSIAN and WLO, are able to outperform the baselines slightly.

Related Work
Our models are related to work in learning probabilistic word embeddings (Vilnis and McCallum, 2014;Athiwaratkun and Wilson, 2017;Athiwaratkun et al., 2018) and text-based VAEs (Miao et al., 2016;Bowman et al., 2016;Yang et al., 2017;Kim et al., 2018;Xu and Durrett, 2018, inter alia). The WLO is also related to flow-based VAEs (Rezende and Mohamed, 2015;Kingma et al., 2016), where hidden layers are viewed as operators over the density function of latent variables. Previous work on sentence specificity relies on hand-crafted features or direct training on annotated data (Louis and Nenkova, 2011;Li and Nenkova, 2015). Recently, Ko et al. (2019) used domain adaptation for this problem when only the source domain has annotations. Our work also relates to learning sentence embeddings from paraphrase pairs (Wieting et al., 2016;Wieting and Gimpel, 2018).

Conclusion
We trained sentence models on paraphrase pairs and showed that they naturally capture specificity and entailment. Our proposed WLO model, which treats each word as a linear transformation operator, achieves the best performance and lends itself to analysis.