Sentence Analogies: Linguistic Regularities in Sentence Embeddings

While important properties of word vector representations have been studied extensively, far less is known about the properties of sentence vector representations. Word vectors are often evaluated by assessing to what degree they exhibit regularities with regard to relationships of the sort considered in word analogies. In this paper, we investigate to what extent commonly used sentence vector representation spaces as well reflect certain kinds of regularities. We propose a number of schemes to induce evaluation data, based on lexical analogy data as well as semantic relationships between sentences. Our experiments consider a wide range of sentence embedding methods, including ones based on BERT-style contextual embeddings. We find that different models differ substantially in their ability to reflect such regularities.


Introduction
Sentence embeddings are dense vectors that reflect salient semantic properties of a sentence. Similar to how commonly used word embedding methods such as word2vec (Mikolov et al., 2013a) capture semantic relationships between words, sentence embeddings are expected to encode semantic relationships between sentences. A number of different sentence embedding methods have been proposed (cf. Section 2.1 for an overview). In recent years, pretrained language models such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), XLNet , and RoBERTa  have become the method of choice when encoding text. Thus, such models are also often invoked to represent sentences by means of individual embeddings. 1 While important properties of word vector representations have been studied extensively, far less is known about the properties of sentence vector representations. A particularly prominent aspect of word vector representations induced by methods such as word2vec is that the vector space exhibits certain kinds of regularities. Many of these are of the sort considered in word analogies. Proportional analogies take the form A is to B as C is to D, e.g., Paris is to France as Berlin is to Germany. Rumelhart and Abrahamson (1973) first proposed identifying such analogies using vector representations in a Euclidean space. Given vector representations of concepts, derived from human similarity judgments using a multi-dimensional scaling algorithm, they proposed representing analogical relationships in terms of the difference vectors. Turney and Littman (2005) investigated identifying such analogies using bag-of-words vector space models. Mikolov et al. (2013b) showed that word2vec's word vector representations reflect certain kinds of word analogies surprisingly well. The widely used word analogy task that they proposed takes the following form. Given embeddings v A , v B , v C , v D for words A, B, C, D for an analogy of the above form, the task consists in identifying the correct word D given A, B, and C. Most commonly, this is achieved by optimizing (1) where sim( v 1 , v 2 ) typically denotes cosine similarity between two vectors. This sort of analogy task is one of the most commonly invoked means of assessing the quality of word vector induction techniques. However, little is known about the topology of vector representation spaces for entire sentences. In this paper we fill this gap, considering models with a dedicated sentence embedding objective as well as BERT-style pretrained embedding models. We study whether such sentence representation spaces as well exhibit regularities with regard to certain kinds of relationships. To this end, we devise new datasets that are similar to typical word analogy datasets (Mikolov et al., 2013b). These allow us to empirically assess whether existing sentence embedding models reflect analogical relationships between sentences.
2 Background and Related Work

Sentence Embedding Methods
In order to move from word vector representations towards representations for entire sentences, a simple baseline is to simply average the word embeddings of all words in a sentence. Although this method neglects the order of words, it performs surprisingly well in many downstream tasks. Pagliardini et al. (2017) proposed a method to learn word and n-gram embeddings such that the average of all words and n-grams in a sentence can serve as a high-quality sentence vector. Rücklé et al. (2018) improved the average pooling method by concatenating different power means of word embeddings. Almarwani et al. (2019) proposed the use of a Discrete Cosine Transform (DCT) to compress word vectors into sentence embeddings, while retaining word order information.
Several methods have been proposed to directly learn representations of sentences. The Skip-Thought Vector approach (Kiros et al., 2015), inspired by the skip-gram word2vec approach (Mikolov et al., 2013a), attempts to learn representations that enable the prediction of neighbouring sentences. It relies on an encoder-decoder structure based on Gated Recurrent Units. The Quick-Thought Vector approach (Logeswaran and Lee, 2018) improves both the efficiency and performance of Skip-Thought Vectors by replacing the decoder with a simple classifier that selects the correct sentence among a set of candidates. InferSent (Conneau et al., 2017) learns sentence representations by auxiliary supervised learning on Natural Language Inference (NLI) data, outperforming prior methods on tasks that require detailed semantic understanding (Zhu et al., 2018). Subramanian et al. (2018) proposed methods to learn general purpose sentence representations via Multi-Task Learning.
In recent years, contextualized word embeddings have drawn considerable attention in light of the formidable gains that they achieve across a wide range of NLP and IR tasks. The pioneering work on ELMo (Peters et al., 2018) showed that significant gains can be achieved across a range of NLP tasks by considering the intermediate layers of a deep BiLSTM-based language model. Instead of standard bidirectional language modeling as in ELMo, the BERT approach (Devlin et al., 2019) developed at Google uses a training regimen considering Cloze-style masked language modeling, in which both sides of the context are simultaneously used to reconstruct an artificially masked word, along with an additional neighbour sentence prediction task. XLNet  is an auto-regressive Transformer-XL  based model using a permutation language model as the training task. XLNet outperforms BERT on various downstream tasks when they share the same number of model parameters and training corpus size. RoBERTa  improves the pre-training task of the original BERT model by removing the Next Sentence Prediction task and randomly generating different masks for words in a sentence. It also improves the performance of BERT by adding more training data. Reimers and Gurevych (2019) proposed Sentence-BERT, which utilizes Siamese and Triplet Networks to fine-tune BERT on NLI and Semantic Textual Similarity (STS) data to obtain more semantically meaningful sentence embeddings that can be compared using cosine similarity.

Analysing Linguistic Representations
Whereas in the field of computer vision, there has been prominent work on understanding what is happening inside popular kinds of models (Zeiler and Fergus, 2014), the latent representations of recent NLP models have long remained impervious and opaque, in the sense that it is not well-understood how they represent the relevant properties of language. While recently there has been substantial research on assessing the capabilities of BERT-like architectures (Rogers et al., 2020), this research for the most part does not shed sufficient light on the topological properties of the representation space.
The most well-known way to inspect the capabilities of sentence embeddings has been via what has been dubbed probing, i.e., supervised training of models that predict specific linguistic phenomena given embeddings as input. Kiros et al. (2015) evaluated the quality of their embeddings by using them for supervised downstream tasks such as sentiment polarity and question type classification. Adi et al. (2016) attempted to gain more specific insights by predicting word occurrences, word order, and sentence lengths. Bacon and Regier (2018) considered this approach to predict verb tense. Ettinger et al. (2018) trained classifiers for semantic roles and negation detection. Dasgupta et al. (2018) studied the argument sensitivity of the InferSent model by probing with respect to an NLI classification (contradiction vs. entailment). Kann et al. (2019) investigated verb alternation acceptability classifications. Conneau et al. (2018) predicted a wide range of mostly syntactic phenomena such as major syntactic constituents, the depth of the syntactic tree, grammatical number of the subject, and grammatical number of the object. For each probing task, they provide 100,000 training instances.
Probing provides important insights about whether sufficient signals needed for a given downstream task are available if one has sufficient supervision. However, training on 100,000 instances does not reveal whether these signals are genuinely present in the sentence representations, as opposed to just being learnable from the training data. For instance, consider an email spam classification task trained using a simple bag-of-words TF-IDF vector representation. With 100,000 training examples, a model will likely be able to learn to recognize salient kinds of spam emails with an accuracy significantly above the level of chance. However, this does not license the conclusion that the bag-of-words representation inherently captures some notion of spamicity, as it were.
Hence, an important complementary endeavour is to study the topology of the representation space. Zhu et al. (2018) proposed assessing sentence embeddings from a relational perspective in terms of proximity. In this paper, we specifically examine to what extent analogical relationships are reflected in terms of regularities in the representation space. Diallo et al. (2019) explored analogical embeddings for the relationship between questions and answers in question answering. Zhang and Baldwin (2019) investigated to what extent analogical reasoning can be used for relationships between documents.

Methodology
Our goal is to explore to what extent different sentence embedding spaces reflect analogical regularities of the form A is to B as C is to D. In the remainder of this paper, we shall invoke the notation A : B :: C : D to refer to this sort of relationship. We will assess such relationships using the same methods as considered for word vectors. A typical choice is the method given by Eq. 1 (see Section 4.1 for further discussion).
To be able to perform our analysis, we induce two kinds of data. In Section 3.1, we create sentence analogies based on lexical analogies. In Section 3.2, we induce sentence analogies based on predefined relationships between sentences.

Sentence Analogies from Lexical Analogy
The first kind of sentence-level analogy data considered in our paper is induced based on lexical analogy data. Specifically, we consult Google's word analogy dataset (Mikolov et al., 2013a) and use it to construct 5 types of semantic sentence analogies and 5 types of syntactic sentence analogy categories.

Semantic Relationships
For semantic instances, we first create general-purpose sentence templates. Then, we replace a certain word in the template with words from Google's word analogy dataset. We consider the following categories of relationships.
• Common Capital Cities. We first consult corpora to extract sentence templates such as "I'm not sure if they can travel to France." Then we replace the word "France" in the template with words from the Google Analogy dataset to create sentence pairs, as shown in Table 1. • All Capital Cities. We create sentence templates such as "I've never been to Thimphu." For each word analogy pair in the Google dataset, we replace the word "Thimphu" with pertinent words from the pairs to obtain sentence pairs. • Currencies. For currency-country pairs in the Google dataset, we create different templates for currency and country, respectively. Then, we replace the target word in the currency and country templates with word pairs to generate sentence pairs as shown in Table 1. • City in State. We create a unified template for both city and state. Sentence pairs are then created by replacing a target word in the template with the a city or state name from the Google dataset. • Gender. Google's word analogy dataset provides stereotypical male-female pairs (e.g., sondaughter), disregarding other gender identities. For these pairs, we again create templates, but invoke them in more intricate ways. For example, given a template "My grandpa makes wooden crafts and arts.", we can replace the word "grandpa" with any word describing family members such as "grandma", "father", and "mother". However, when the candidate word is a word that describes an occupation, we replace the word "My" in the original template with "The". When the candidate word is a pronoun such as "he" or "she", we omit the word "My" from the template.

Syntactic Instances
For (morpho-)syntactic questions, we first perform part-of-speech tagging and dependency parsing 2 to analyze the structure of the sentences in the MNLI dataset (Williams et al., 2018a) and extract sentences that correspond to a certain structure. Subsequently, we invoke a set of rules to generate new sentences from the original ones. The specific sentence generation schemes invoked to generate the evaluation data for the syntactic categories are as follows.
• Comparative. We first find a sentence containing a comparative adjective followed by "than", and then replace the comparative adjective with its original form and remove the noun or clause after "than" to obtain comparative sentence pairs, as again exemplified in Table 1. • Nationality Adjectives. We create templates for nationalities and their corresponding adjectives.
We then replace a target word in the template with the nationality designation or with the adjective word. • Opposites. We first find a sentence containing an adjective and then replace the adjective with its antonym to obtain sentence pairs. Note that these antonym pairs generally bear a derivational connection, e.g., efficientinefficient. • Plurals. We retrieve a sentence with a plural noun and a numeral word between the noun, and then replace the plural noun with its singular form and replace the numeral word with "one", "a", or "an". • Verb Conjugation. We first identify sentences containing an auxiliary verb followed by a verb and then remove the auxiliary verb and replace the verb in the sentence with its inflected form.

Analogy based on Relationships Between Sentences
In addition to our sentence analogy data derived from word analogies, we also create new diagnostic sentence analogy data based on specific forms of relationships between sentences. We start off with sentences extracted from NLI datasets, including SNLI (Bowman et al., 2015), Multi-NLI (Williams et al., 2018b), and SICK (Marelli et al., 2014).

Relationships
Entailment. Given two sentence pairs S A , S B and S C , S D , an entailment analogy holds between these two sentence pairs if the respective relationships between S A and S B and between S C and S D are both entailment, as annotated in the NLI data. They took a trip to Cuba. All Capital Cities I've never been to Amman. I've never been to Jordan.

Currencies
The economy in Japan was great. The yen appreciated due to the strong economy.

City in State
They go down to Chandler. They go down to Arizona.

Man -Woman
The man makes wooden crafts and arts. The woman makes wooden crafts and arts. Comparative The second article was long. The second article was longer than the first one. Nationality Adjective The man from Egypt tapped his cheek. The Egyptian man tapped his cheek. Opposites It's possible to measure it. It's impossible to measure it. Plurals The Harvard data examined one city.
The Harvard data examined six cities.

Verb Conjunction
Duke will play better this year. Duke plays better this year.
Negation. We consider sentence pair S A , S B as standing in a negation relationship if one has a negated meaning compared to the other. Some of the negation pairs are extracted from the SICK dataset, but most of our negation pairs are created by dependency parsing and rule-based transformations. Given two sentence pairs S A , S B and S C , S D , a negation analogy holds between these sentence pairs if the respective relationship between S A and S B and between S C and S D are both negation. An example is given in Table 2.
Passivization. Sentence pair S A , S B is regarded as standing in a passivization relationship if S B is a passive form of S A . Given two sentence pairs S A , S B and S C , S D , a passivization analogy holds between these sentence pairs if the respective relationships between S A and S B and between S C and S D are both passivization. We create this data from NLI sentences using a dependency-driven rule-based transformation. Our passivization data is extracted from the argument sensitivity dataset from Zhu et al. (2018), which contains suitable passivization pairs.
Objective Clause. Given sentence pairs S A , S B and S C , S D , an objective clause analogy holds between these two sentence pairs if S A and S C include a verb that is indicative of stating a fact or opinion (e.g., say, tell, think, etc.), followed by clauses S B and S D , respectively.

Predicative Adjective Conversion
We deem a predicative adjective conversion relationship as holding between sentences S A and S B if S A contains an adjective, while S B contains a predicative clause that shares the same meaning with the adjective. Given sentence pairs S A , S B and S C , S D a predicative adjective conversion analogy holds between them if the respective relationships between S A and S B and between S C and S D are both of this form.

Candidate Sets
For a given hypothesis, there may be a multitude of valid premises that entail it. Given an analogy of the form S A : S B :: S C : S D , we need to restrict the scope of candidate sentences for S D so as to ensure the uniqueness of the correct answer. Instead of considering the entire corpus as a candidate sentence set, our candidate sentence sets for the relationships in Section 3.2.1 consist of one true candidate and several challenging distractor candidates that are similar to the true candidate at a superficial level but modified to be semantically different. Table 2 provides examples for a brief overview of the resulting task. In the following, we explain the distractor generation in further detail.
Not Negation. We insert the negation marker not after the first auxiliary verb in the original true target sentence to generate a new distractor sentence. If the sentence already contains the negation marker not, we instead remove it. Not Negation aims to detect whether a sentence embedding model is misled by a negation of the sentence relation caused by adding the word not.
Random Deletion. We randomly delete words in the original sentence with a probability of 20% to generate a new sentence. If the length of the sentence is less than 5, we delete at least one word. However, simply deleting arbitrary words in the original sentence may not always affect the semantic relationship. For example, consider the hypothesis "John ate a yummy sandwich." vs. the premise "John ate a delicious sandwich." If we delete the word delicious from the premise sentence, the relation between the hypothesis and the new sentence is also entailment. In order to avoid this situation, we pick words for deletion that are not adjectives, adverbs, determiners, or auxiliary verbs.
Random Masking. Following BERT (Devlin et al., 2019), we randomly replace tokens in the original sentence with BERT's special "[MASK]" token, where the probability of a certain token being masked is 20%. For sentence embedding methods that are not based on BERT, the "[MASK]" token is treated as an "UNK" token, which represents unknown words. The purpose of random masking task is exploring whether replacing a word with a special meaningless token will affect a model's performance in judging a semantic relation.
Span Deletion. A number of text spans are sampled, with span lengths drawn from a Poisson distribution with λ = 3. Each text span is deleted from the original sentence. Span Deletion is inspired by the text infilling operation in BART . The only difference is that BART replaces text spans with a "[MASK]" token instead of deleting them. One difference between random deletion and span deletion is that we do not pose any restrictions on the word spans to be deleted. Since a continuous text span in a sentence often represents a phrase or a sub-clause, in most cases, the generated sentence's meaning and relationship is different from the original sentence. But there may also be some exceptions to this. Hence, we rely on manual checking to avoid such issues.
Word Reordering. We randomly choose a word in the original sentence as a pivot, and then swap the words before and after the pivot to obtain a new sentence, which is likely grammatically incorrect and not an appropriate target to be selected. We invoke this sort of word reordering to test whether an embedding model is sensitive to semantic relation changes caused by changes of the word order. Clearly, a simple averaging of word embeddings is not able to distinguish this sort of example from the true target sentence, but it is not yet known to what extent more sophisticated sentence embedding models may suffer from this issue.

Experiments
In a sentence analogy task, we are given two pairs of sentences sharing a relation. For example, "He is very enamored with culture in Egypt" : "He is very enamored with Egyptian culture", and "He is very enamored with culture in Bulgaria" : "He is very enamored with Bulgarian culture". The goal is to identify the fourth sentence given the first three sentences. The kind of analogical relationship sought for the sentence pairs is not explicitly provided. The number of sentence pairs and analogies in each category of our analogy dataset is given in Table 3,

Evaluation Metric
In word analogy tasks, the offset between word vectors is often used to determine relations between words. For example, in order to solve man is to woman as king is to W , we find a word W for which the corresponding vector is the closest to v man − v woman + v king . This amounts to optimizing Eq. 1. Levy and Goldberg (2014) studied this in more detail, referring to the aforementioned method as 3CosAdd, while introducing a multiplicative variant called 3CosMul, which often yields better empirical results. Linzen (2016), Schluter (2018), and Nissim et al. (2019) highlighted the significance of excluding the other analogy words in Eq. 1. Given a word analogy problem of the form A : B :: C : D, the standard procedure is to disregard any D that is equal to A, B, or C. This constraint drastically improves the performance of word embedding models on word analogy datasets such as the Google dataset (Mikolov et al., 2013a), but may also lead to biased results.
In our experiments, we consider both 3CosAdd and 3CosMul, and evaluate these both with the additional constraint (3CosAdd, 3CosMul) and without it (3CosAdd-U, 3CosMul-U), where the suffix -U denotes an unconstrained evaluation.

Embedding Methods
In our experiments, we consider a number of embedding models. These include simple word vector aggregation methods such as the Average of GloVe embeddings (abbreviated as GloVe). The concatenation of Discrete Cosine Transform coefficients (DCT) embeddings are generated by concatenating the first k DCT coefficients. In our experiment, k ranges from 0 to 6, and for space reasons, we report the bestperforming result. For sentence embeddings based on RNNs such as Skip-Thought Vectors (SkipThought), Quick-Thought vectors (QuickThought), and the General Purpose Sentence Encoder by Subramanian et al. (2018) (GenSen), we use the hidden state of the final RNN cell as the sentence embedding. For InferSent, we use max-pooling over all hidden states of RNN cells to produce sentence embeddings. Facebook released two versions of the InferSent model, the earlier version (InferSentV1) is trained based on GloVe word embeddings, while the second version (InferSentV2) is trained using fastText word embeddings. The Universal Sentence Encoder (Cer et al., 2018) comes in two versions, the first one based on Deep Average Networks (USE-DAN), the second based on Transformer networks (USE-Transformer).
For contextual embedding models such as BERT, XLNet, RoBERTa, along with Sentence-BERT (Reimers and Gurevych, 2019), we consider two popular methods to generate a sentence embedding. The first one (-CLS) consists in using the embedding of the special "[CLS]" token in the sentence, followed by a linear transformation and a tanh activation layer. Another method (-AVG) involves computing the element-wise sum of contextual word representations w 1 , w 2 , w 3 , ..., w n at the top level of the Transformer encoder and dividing it by the square root of the sentence length. For a given model, we only show the pooling method that obtained the highest accuracy in the experimental results. We consider different versions of the models (-Base, -Large) as released by the original authors. In our result tables, the Sentence-BERT models by Reimers and Gurevych (2019) based on BERT and RoBERTa are referred to as SBERT and SRoBERTa, respectively.

Results and Analysis
Sentence Analogy from Lexical Analogy Table 4 provides the overall aggregate results for lexical analogy-based pairs, while Table 5 specifically assesses the semantic analogy and syntactic analogy categories. With an unconstrained evaluation, we observe that all considered sentence embedding models show relatively poor success rates. We conjecture that this is because sim(D, B) − sim(D, A) tends to be fairly small in most cases, so 3CosAdd-U and 3CosMul-U often degenerate mainly to evaluating sim(D, C).
With a traditional constrained evaluation, several methods show substantial regularities. Averages of GloVe vectors draw on the linguistic regularities inherent in the GloVe vectors. The Discrete Cosine Transform method outperforms other sentence embedding methods for most of the categories. InferSent outperforms contextual embedding methods, with XLNet-Large obtaining the weakest results across all considered models. Despite the strength of InferSent, fine-tuning BERT on NLI datasets does not improve the performance of the model on lexical analogy based tasks. SBERT and SRoBERTa obtained a lower accuracy than BERT and RoBERTa, respectively.
From the results of Table 5, we find that capturing the kinds of semantic analogies considered in our study is more challenging than capturing the syntactic analogies. Most of the sentence embedding models we tested (except XLNet) excelled at solving syntactic question pairs using the 3CosAdd metric, while few of them perform well on semantic analogy pairs. In particular, contextual embedding-based models appear capable of reflecting syntactic phenomena, but do not appear to yield semantic knowledge at the same level as word embedding models such as GloVe, although they are known to be able to emit world knowledge when evaluated in language modeling settings (Petroni et al., 2019). Sentence Analogy from Relationships Between Sentences In Figure 1, we provide the results on the set of all relation-based sentence analogies from Section 3.2. The first row in the table represents the accuracy, while the following rows show the probability that a certain adversarial candidate was chosen by the method. Note that because our relation-based analogy contains entailment and a sentence always entails itself, we omit the unconstrained versions of 3CosAdd and 3CosMul in this evaluation. The InferSentV2 and GenSen models achieve the highest accuracy on the relation-based analogy tasks when using 3CosAdd and 3CosMul, respectively, while the large version of XLNet achieves the lowest accuracy. By comparing the performance of the base and large versions of the pre-trained models based on Transformers, we find that large models have a tendency to confuse the real premise sentence with the not-negated form of the original sentence, which hampers their performance in this sort of evaluation. Another interesting finding is that BERT-based models improve their performance on distinguishing the actual premise from the negated version of the hypothesis after fine-tuning on the SNLI dataset. Yet, they have a higher probability of being misled by adversarial candidates created by Span Deletion and Word Reordering. Figure 2 shows the accuracy of pre-trained sentence embeddings broken down by particular sentence relation-based analogy forms. We observe that the difficulty of some relation-based analogy tasks is substantially higher than for others. Most of the models, except for XLNet-Large and GloVe, achieve relatively high accuracy on Entailment analogy, while none of the models perform well on the proposed Objective Clause analogy. In addition, Transformer-driven models have made great progress at capturing syntactic analogies such as Passivization, Objective Clauses, and Predicative Adjective Conversion, but their ability to identify the Negation relation appears limited in this sort of evaluation. We also find that the performance of the pre-trained models on relational analogy tasks might be affected by the network architecture. For example, sentence embedding models built on RNNs fare better at recognizing Entailment and Negation analogy, but their performance on distinguishing Passivization and Objective Clause is not as good as Transformer-driven models.

Conclusion
This paper presents several new datasets to test to what extent existing sentence embedding models exhibit regularities with regard to sentence analogies. Most of the sentence embedding models we tested succeeded in recognizing syntactic analogies based on lexical ones, but had a harder time capturing semantic regularities by means of an analogy task. Moreover, the remarkable success of BERT-style contextual embeddings does not always translate into better regularities in the vector space of fixed-length sentence embeddings. More training data and model parameters as well do not necessarily yield better results. In many cases, word vector averages or a Discrete Cosine Transform of word embeddings outperform more complex sentence embedding models. Resources related to this study are available online at http://sentence.embeddings.org.