Exploring Semantic Properties of Sentence Embeddings

Neural vector representations are ubiquitous throughout all subfields of NLP. While word vectors have been studied in much detail, thus far only little light has been shed on the properties of sentence embeddings. In this paper, we assess to what extent prominent sentence embedding methods exhibit select semantic properties. We propose a framework that generate triplets of sentences to explore how changes in the syntactic structure or semantics of a given sentence affect the similarities obtained between their sentence embeddings.


Introduction
Neural vector representations have become ubiquitous in all subfields of natural language processing. For the case of word vectors, important properties of the representations have been studied, including their linear substructures (Mikolov et al., 2013;Levy and Goldberg, 2014), the linear superposition of word senses (Arora et al., 2016b), and the nexus to pointwise mutual information scores between co-occurring words (Arora et al., 2016a).
However, thus far, only little is known about the properties of sentence embeddings. Sentence embedding methods attempt to encode a variablelength input sentence into a fixed length vector. A number of such sentence embedding methods have been proposed in recent years (Le and Mikolov, 2014;Kiros et al., 2015;Wieting et al., 2015;Conneau et al., 2017;Arora et al., 2017).
Sentence embeddings have mainly been evaluated in terms of how well their cosine similarities mirror human judgments of semantic relatedness, typically with respect to the SemEval Semantic Textual Similarity competitions. The SICK dataset (Marelli et al., 2014) was created to better benchmark the effectiveness of different models across a broad range of challenging lexical, syntactic, and semantic phenomena, in terms of both similarities and the ability to be predictive of entailment. However, even on SICK, oftentimes very shallow methods prove effective at obtaining fairly competitive results (Wieting et al., 2015). Adi et al. investigated to what extent different embedding methods are predictive of i) the occurrence of words in the original sentence, ii) the order of words in the original sentence, and iii) the length of the original sentence (Adi et al., 2016(Adi et al., , 2017.  inspected neural machine translation systems with regard to their ability to acquire morphology, while Shi et al. (2016) investigated to what extent they learn source side syntax. Wang et al. (2016) argue that the latent representations of advanced neural reading comprehension architectures encode information about predication. Finally, sentence embeddings have also often been investigated in classification tasks such as sentiment polarity or question type classification (Kiros et al., 2015). Concurrently with our research, Conneau et al. (2018) investigated to what extent one can learn to classify specific syntactic and semantic properties of sentences using large amounts of training data (100,000 instances) for each property.
Overall, still, remarkably little is known about what specific semantic properties are directly reflected by such embeddings. In this paper, we specifically focus on a few select aspects of sentence semantics and inspect to what extent prominent sentence embedding methods are able to capture them. Our framework generates triplets of sentences to explore how changes in the syntactic structure or semantics of a given sentence affect the similarities obtained between their sentence embeddings.

Analysis
To conduct our analysis, we proceed by generating new phenomena-specific evaluation datasets.
Our starting point is that even minor alterations of a sentence may lead to notable shifts in meaning. For instance, a sentence S such as A rabbit is jumping over the fence and a sentence S * such as A rabbit is not jumping over the fence diverge with respect to many of the inferences that they warrant. Even if sentence S * is somewhat less idiomatic than alternative wordings such as There are no rabbits jumping over the fence, we nevertheless expect sentence embedding methods to interpret both correctly, just as humans do.
Despite the semantic differences between the two sentences due to the negation, we still expect the cosine similarity between their respective embeddings to be fairly high, in light of their semantic relatedness in touching on similar themes.
Hence, only comparing the similarity between sentence pairs of this sort does not easily lend itself to insightful automated analyses. Instead, we draw on another key idea. It is common for two sentences to be semantically close despite differences in their specific linguistic realizations. Building on the previous example, we can construct a further contrasting sentence S + such as A rabbit is hopping over the fence. This sentence is very close in meaning to sentence S, despite minor differences in the choice of words. In this case, we would want for the semantic relatedness between sentences S and S + to be assessed as higher than between sentence S and sentence S * .
We refer to this sort of scheme as sentence triplets. We rely on simple transformations to generate several different sets of sentence triplets.

Sentence Modification Schemes
In the following, we first describe the kinds of transformations we apply to generate altered sentences. Subsequently, in Section 2.2, we shall consider how to assemble such sentences into sentence triplets of various kinds so as to assess different semantic properties of sentence embeddings.
Not-Negation. We negate the original sentence by inserting the negation marker not before the first verb of the original sentence A to generate a new sentence B, including contractions as appropriate, or removing negations when they are already present, as in: A: The young boy is climbing the wall made of rock.
B: The young boy isn't climbing the wall made of rock.
Quantifier-Negation. We prepend the quantifier expression there is no to original sentences beginning with A to generate new sentences.
A: A girl is cutting butter into two pieces. B: There is no girl cutting butter into two pieces.
Synonym Substitution. We substitute the verb in the original sentence with an appropriate synonym to generate a new sentence B. A: The man is talking on the telephone. B: The man is chatting on the telephone.
Embedded Clause Extraction. For those sentences containing verbs such as say, think with embedded clauses, we extract the clauses as the new sentence.
A: Octel said the purchase was expected. B: The purchase was expected.
Passivization. Sentences that are expressed in active voice are changed to passive voice. A: Harley asked Abigail to bake some muffins. B: Abigail is asked to bake some muffins.
Argument Reordering. For sentences matching the structure " somebody verb somebody to do something ", we swap the subject and object of the original sentence A to generate a new sentence B. A: Matilda encouraged Sophia to compete in a match.
B: Sophia encouraged Matilda to compete in a match.
Fixed Point Inversion. We select a word in the sentence as the pivot and invert the order of words before and after the pivot. The intuition here is that this simple corruption is likely to result in a new sentence that does not properly convey the original meaning, despite sharing the original words in common with it. Hence, these sorts of corruptions can serve as a useful diagnostic.
A: A dog is running on concrete and is holding a blue ball B: concrete and is holding a blue ball a dog is running on.

Sentence Triplet Generation
Given the above forms of modified sentences, we induce five evaluation datasets, consisting of triplets of sentences as follows.
1. Negation Detection: Original sentence, Synonym Substitution, Not-Negation With this dataset, we seek to explore how well sentence embeddings can distinguish sentences with similar structure and opposite meaning, while using Synonym Substitution as the contrast set. We would want the similarity between the original sentence and the negated sentence to be lower than that between the original sentence and its synonym version.

Negation Variants: Quantifier-Negation, Not-Negation, Original sentence
In the second dataset, we aim to investigate how well the sentence embeddings reflect negation quantifiers. We posit that the similarity between the Quantifier-Negation and Not-Negation versions should be a bit higher than between either the Not-Negation or the Quantifier-Negation and original sentences.
3. Clause Relatedness: Original sentence, Embedded Clause Extraction, Not-Negation In this third set, we want to explore whether the similarity between a sentence and its embedded clause is higher than between a sentence and its negation.

Argument Sensitivity: Original sentence, Passivization, Argument Reordering
With this last test, we wish to ascertain whether the sentence embeddings succeed in distinguishing semantic information from structural information. Consider, for instance, the following triplet.
Here, S and S + mostly share the same meaning, whereas S + and S * have a similar word order, but do not possess the same specific meaning. If the sentence embeddings focus more on semantic cues, then the similarity between S and S + ought to be larger than that between S + and S * . If the sentence embedding however is easily misled by matching sentence structures, the opposite will be the case.
5. Fixed Point Reorder: Original sentence, Semantically equivalent sentence, Fixed Point Inversion With this dataset, our objective is to explore how well the sentence embeddings account for shifts in meaning due to the word order in a sentence. We select sentence pairs from the SICK dataset according to their semantic relatedness score and entailment labeling. Sentence pairs with a high relatedness score and the Entailment tag are considered semantically similar sentences. We rely on the Levenshtein Distance as a filter to ensure a structural similarity between the two sentences, i.e., sentence pairs whose Levenshtein Distance is sufficiently high are regarded as eligible.
Additionally, we use the Fixed Point Inversion technique to generate a contrastive sentence. The resulting sentence likely no longer adequately reflects the original meaning. Hence, we expect that, on average, the similarity between the original sentence and the semantically similar sentence should be higher than that between the original sentence and the contrastive version.

Experiments
We now proceed to describe our experimental evaluation based on this paradigm.

Datasets
Using the aforementioned triplet generation methods, we create the evaluation datasets listed in Table 1, drawing on source sentences from SICK, Penn Treebank WSJ and MSR Paraphase corpus. Although the process to modify the sentences is automatic, we rely on human annotators to double-check the results for grammaticality and semantics. This is particularly important for synonym substitution, for which we relied on Word-Net (Fellbaum, 1998). Unfortunately, not all synonyms are suitable as replacements in a given context.

Embedding Methods
In our experiments, we compare three particularly prominent sentence embedding methods: 1. GloVe Averaging (GloVe Avg.): The simple approach of taking the average of the word vectors for all words in a sentence. Although this method neglects the order of words entirely, it can fare reasonably well on some of the most commonly invoked forms of evaluation (Wieting et al., 2015;Arora et al., 2017). Note that we here rely on regular unweighted GloVe vectors (Pennington et al., 2014) instead of fine-tuned or weighted word vectors.
3. Sent2Vec: Pagliardini et al. (2018) proposed a method to learn word and n-gram embeddings such that the average of all words and n-grams in a sentence can serve as a highquality sentence vector.

The Skip-Thought
Vector approach (SkipThought) by Kiros et al. (2015) applies the neighbour prediction intuitions of the word2vec Skip-Gram model at the level of entire sentences, as encoded and decoded via recurrent neural networks. The method trains an encoder to process an input sentence such that the resulting latent representation is optimized for predicting neighbouring sentences via the decoder. (Conneau et al., 2017) is based on supervision from an auxiliary task, namely the Stanford NLI dataset.

Results and Discussion
Negation Detection. Table 2 lists the results for the Negation Detection dataset, where S, S + , S * refer to the original, Synonym Substitution, and Not-Negation versions of the sentences, respectively. For each of the considered embedding methods, we first report the average cosine similarity scores between all relevant sorts of pairings of two sentences, i.e. between the original and the Synonym-Substitution sentences (S and S + ), between original and Not-Negated (S and S * ), and between Not-Negated and Synonym-Substitution (S + and S * ). Finally, in the last column, we report the Accuracy, computed as the percentage of sentence triplets for which the proximity relationships were as desired, i.e., the cosine similarity between the original and synonym-substituted versions was higher than the similarity between that same original and its Not-Negation version.
On this dataset, we observe that GloVe Avg. is more often than not misled by the introduction of synonyms, although the corresponding word vector typically has a high cosine similarity with the original word's embedding. In contrast, both In-ferSent and SkipThought succeed in distinguishing unnegated sentences from negated ones. Negation Variants. In Table 3, S, S + , S * refer to the original, Not-Negation, and Quantifier-Negation versions of a sentence, respectively. Accuracy in this problem is defined as percentage of sentence triples whose similarity between S+ and S * is the higher than similarity between S and S+ and S + and S * The results of both averaging of word embeddings. and SkipThought are dismal in terms of the accuracy. InferSent, in contrast, appears to have acquired a better understanding of negation quantifiers, as these are commonplace in many NLI datasets.
Clause Relatedness. In Table 4, S, S + , S * refer to original, Embedded Clause Extraction, and Not-Negation, respectively. Although not particularly more accurate than random guessing, among the considered approaches, Sent2vec fares best in distinguishing the embedded clause of a sentence b) "We made our own decision," he said. -"We didn't make our own decision," he said. -We made our own decision.
For cases resembling a), the average SkipThought similarity between the sentence and its Not-Negation version is 79.90%, while for cases resembling b), it is 26.71%. The accuracy of SkipThought on cases resembling a is 36.90%, and the accuracy of SkipThought on cases like b is only 0.75% It seems plausible that SkipThought is more sensitive to the word order due to the recurrent architecture. Infersent also achieved better performance on sentences resembling a) compared with sentences resembling b), its accuracy on these two structures is 28.37% and 15.73% respectively. Argument Sensitivity. In Table 5, S, S + , S * to refer to the original sentence, it Passivization form, and the Argument Reordering version, respectively. Although recurrent architectures are able to consider the order of words, unfortunately, none of the analysed approaches prove adept at distinguishing the semantic information from structural information in this case.
Fixed Point Reorder. In Table 6, S, S + , S * to refer to the original sentence, its semantically equivalent one and Fixed Point Inversion Version. As Table 6 indicates, sentence embeddings based on means (GloVe averages), weighted means (Sent2Vec), or concatenation of p-mean embeddings (P-Means) are unable to distinguish the fixed point inverted sentence from the semantically equivalent one, as they do not encode sufficient word order information into the sentence embeddings. Sent2Vec does consider ngrams but these do not affect the results sufficiently.SkipThought and InferSent did well when the original sentence and its semantically equivalence share similar structure.

Conclusion
This paper proposes a simple method to inspect sentence embeddings with respect to their semantic properties, analysing three popular embedding methods. We find that both SkipThought and InferSent distinguish negation of a sentence from synonymy. InferSent fares better at identifying semantic equivalence regardless of the order of words and copes better with quantifiers. SkipThoughts is more suitable for tasks in which the semantics of the sentence corresponds to its structure, but it often fails to identify sentences with different word order yet similar meaning. In almost all cases, dedicated sentence embeddings from hidden states a neural network outperform a simple averaging of word embeddings.