Empirical Linguistic Study of Sentence Embeddings

The purpose of the research is to answer the question whether linguistic information is retained in vector representations of sentences. We introduce a method of analysing the content of sentence embeddings based on universal probing tasks, along with the classification datasets for two contrasting languages. We perform a series of probing and downstream experiments with different types of sentence embeddings, followed by a thorough analysis of the experimental results. Aside from dependency parser-based embeddings, linguistic information is retained best in the recently proposed LASER sentence embeddings.


Introduction
Modelling natural language with neural networks has been an extensively researched area for several years now. On the one hand, deep learning enormously reduced the cost of feature engineering. On the other hand, we are largely unaware of features that are used in estimating a neural model and, therefore, kinds of information that a trained neural model relies most heavily on. Since neural network-based models work very well in many NLP tasks and often provide state-of-the-art results, it is extremely interesting and desirable to understand which properties of words, phrases or sentences are retained in their embeddings. An approach to investigate whether linguistic properties of English sentences are encoded in their embeddings is proposed by Shi et al. (2016), Adi et al. (2017), and . It consists in designing a series of classification problems focusing on linguistic properties of sentences, so called probing tasks . In a probing task, sentences are labelled according to a particular linguistic property. Given a model that generates an embedding vector for any sentence, the model is applied to the probing sentences. A classifier is then trained with the resulting embeddings as inputs and probing labels as targets. The performance of the resulting classifier is considered a proxy for how well the probing property is retained in the sentence embeddings.
We propose an extension and generalisation of the methodology of the probing tasks-based experiments. First, the current experiments are conducted on two typologically and genetically different languages: English, which is an isolating Germanic language and Polish, which is a fusional Slavic one. Our motivation for conducting experiments on two contrasting languages is as follows. English is undoubtedly the most prominent language with multiple resources and tools. However, English language processing is only a part of NLP in general. Methods designed for English are not guaranteed to be universal. In order to verify whether an NLP algorithm is powerful, it is not enough to evaluate it solely on English. Evaluation on additional languages can shed light on an investigated method. We select Polish as our contrasting language for pragmatic reasons, i.e. there is a Polish dataset -CDSCorpus (Wróblewska and Krasnowska-Kieraś, 2017) -which is comparable to the SICK relatedness/entailment corpus (Bentivogli et al., 2014). Both datasets are used in downstream evaluation.
Second, the designed probing tests are universal for both tested languages. For syntactic processing of both languages, we use the Universal Dependencies schema (UD, Nivre et al., 2016). 1 Since we use automatically parsed UD trees for generating probing datasets, analogous tests can be generated for any language with a UD treebank on which a parser can be trained. 1 The Universal Dependencies initiative aims at developing a cross-linguistically consistent morphosyntactic annotation schema and at building a large multilingual collection of treebanks annotated according to this schema. It is worth nothing that the UD schema has become the de facto standard for syntactic annotation in the recent years.
The contributions of this work are twofold.
(1) We introduce a method of analysing the content of sentence embeddings based on universal probing tasks, along with the classification datasets for two contrasting languages. (2) We carry out a series of empirical experiments based on publicly released probing datasets 2 created within the described work and the obtainable downstream task datasets with different types of sentence embeddings, followed by a thorough analysis of the experimental results.
We test sentence embeddings obtained with maxpooling and mean-pooling operations over word embeddings or contextualised word embeddings, sentence embeddings estimated on small corpora, and sentence embeddings estimated on large monolingual or multilingual corpora.

Experimental Methodology
The purpose of the research is to answer the question whether linguistic information is retained in vector representations of sentences. Assessment of the linguistic content in sentence embeddings is not a trivial task and we verify whether it is possible with a probing task-based method (see Section 2.1). Probing sentence embeddings for individual linguistic properties do not examine the overall performance of embeddings in composing the meaning of the represented sentence. We therefore provide two downstream tasks for a general evaluation (see Section 2.2).

Probing Task-based Method
A probing task can be defined as "a classification problem that focuses on simple linguistic properties of sentences" . A probing dataset contains the pairs of sentences and their categories. For example, the dataset for the Passive probing task (the binary classification) consists of two types of the pairs: a passive voice sentence, 1 and a non-passive (active) voice sentence, 0 . The sentence-category pairs are automatically extracted from a corpus of dependency parsed sentences. The extraction procedure is based on a set of rules compatible with the Universal Dependencies annotation schema. The proposed rules of creating the probing task datasets are thus universal for languages with the UD style dependency treebanks.
A classifier is trained and tested on vector representations of the probing sentences generated with a sentence embedding model. If a linguistic property is encoded in the sentence embeddings and the classifier learns how this property is encoded, it will correctly classify the test sentence embeddings. The efficiency of the classifiers for each probing task is measured with accuracy. The probing tasks are described in Section 3.

Downstream Task-based Method
Two downstream tasks are proposed in our experiments: Relatedness and Entailment. The semantic relatedness 3 task is to measure the degree of any kind of lexical or functional association between two terms, phrases or sentences. The efficiency of the classifier for semantic relatedness is measured with Pearson's r and Spearman's ρ coefficients. The textual entailment task is to assess whether the meaning of one sentence is entailed by the meaning of another sentence. There are three entailment classes: entailment, contradiction, and neutral. The efficiency of the classifier for entailment, in turn, is measured with accuracy.

Probing Tasks
The point of reference for designing our probing tasks is the work by . The authors propose several probing tasks and divide them into those pertaining to surface, syntactic and semantic phenomena. However, we decide to discard the 'syntactic versus semantic' distinction and consider all tasks either surface (see Section 3.1) or compositional (see Section 3.2). This decision is motivated by the fact that both syntactic and semantic principles are undoubtedly compositional by their nature. The syntax admitting well-formed expressions on the basis of the lexicon works in tandem with the semantics. According to Jacobson's notion of Direct Compositionality (Jacobson, 2014, 43), "each syntactic rule which predicts the existence of some well-formed expression (as output) is paired with a semantic rule which gives the meaning of the output expression in terms of the meaning(s) of the input expressions".

Tests on Surface Properties
The tests investigate whether surface properties of sentences (i.e. sentence length and lexical content) ROOT She has starred with many leading actors . are retained in their embeddings. We follow the definition of surface probing tasks and the procedure of preparing training data as described by .
WC (word content) This task consists in a 750way classification of sentences containing exactly one of pre-selected 750 target words (i.e. the categories correspond to the 750 words). The words are selected based on their frequency ranking in the corpus from which the probing datasets were extracted: top 2000 words are discarded and the next 750 words are used as task categories. 4

Compositional Tests
The tests on compositional principles are significantly modified (e.g. TreeDepth, TopDeps, Tense) with respect to  or designed anew (i.e. Passive and SentType), because the basis for preparing probing datasets is constituted by dependency trees. 5 4  use 1000 target words selected in a similar manner, but since our datasets are smaller, we proportionally decreased this number in order to maintain the same number of training/validation/testing instances per target word. 5 We reject the bigram shift task (BShift) as it is applicable only for isolating languages and practically useless for fusional languages with relatively free word order. This task consists in detecting sentences with two random, adjacent words switched. According to , such shift generally leads to an erroneous utterance (acceptable sentences can be generated accidentally). However, given a language with less strict word order, the intuition is that the BShift procedure could produce too many correct sentences. A very preliminary case study involving several shift strategies and one sentence (Autorka we wszystkich książkach każe bohaterom szukać tożsamości. 'The author tells the characters in her all TreeDepth (dependency tree depth) This task consists in classifying sentences based on the depth of the corresponding dependency trees. The task is defined similarly to , but dependency trees are used instead of constituent trees. Similarly to the original TreeDepth task, the data is decorrelated with respect to sentence length. Example: The dependency tree in Figure 1 has a depth of 3, because the path from the root node to any token node contains 3 tokens at most.
TopDeps (top dependency schema) The idea of this task is based on TopConst task 6 , but adapted to dependency trees. The task consists in predicting a multiset of the dependency types labelling the relations between the top-most node (the ROOT's only dependent) and all its children, barring punct relations. The position of a phrase in an English sentence largely determines its grammatical function. In Polish, in turn, word order is relatively free and therefore not a strong determinant of grammatical functions. We thus extract multisets of dependency types, not taking into account the text order of their respective phrases. The extracted multisets roughly correspond to predicate-argument structures. There are 20 classes for each language: 19 most common top dependency schemata and the class {OTHER}. Example: The TopDeps class of the sentence in Figure 1 is {aux nsubj obl}.
Passive (passive voice) This is a binary classification task where the goal is to predict whether a sentence embedding represents a passive voice sentence (the class 1) or an active sentence (the class books to look for identity.', lit. 'The author in her all books tells the characters to look for identity.') confirmed this intuition, as most of BShift-modified sentences were accepted by Polish speakers. 6 In the original TopConst task, the classifier learns to detect one of 19 most common top constructions or <OTHER>, e.g. the top construction sequence of the tree for [Then][very dark gray letters on a black screen][appeared] [.] consists of four constituent labels: <ADVP NP VP .>. 0). In case of complex sentences only the voice of the matrix (main) clause is detected. 7 In order to identify passive voice sentences, we adhere to the following procedure: the predicate of a passive voice sentence governs an auxiliary verb and the relation is labelled aux:pass. Furthermore, the predicate (part-of-speech VERB or ADJ) has the features Voice=Pass and VerbForm=Part. The dependency nsubj:pass (passive nominal subject) can be helpful, but as the subject may be dropped in Polish, it is not sufficient. Example: The active voice sentence in Figure 1 is classified as 0.
Tense (grammatical tense) This is a binary classification of sentences by the grammatical tense of their main predicates. The sentence predicates can be marked for the present (the pres class) or past (the past class) grammatical tense. The present tense predicates have the following properties: the UD POS tag VERB and the feature Tense=Pres. The past tense predicates have the following properties: the UD POS tag VERB and the feature Tense=Past. Example: The sentence in Figure 1 is classified as past.
SubjNum (grammatical number of subjects) In this binary classification task, sentences are classified by the grammatical number of nominal subjects (marked with the UD label nsubj) of main predicates. There are two classes: sing (the UD POS tag NOUN and the feature Number=Sing) and plur (the UD POS tag NOUN and the feature Number=Plur). Example: The sentence in Figure 1 is categorised as sing.
ObjNum (grammatical number of objects) This binary classification task is analogous to the one above, but this time sentences are classified by the grammatical number of direct objects of main predicates. The classes are again sing to represent the singular nominal objects (the obj label, the NOUN tag, and the feature Number=Sing), and plur for the plural/mass ones (the obj label, the NOUN tag, and the feature Number=Plur).
SentType (sentence type) This is a new probing task consisting in classifying sentences by their types. There are three classes: inter for interrogatve sentences (e.g. Do you like him?), imper for imperative sentences (e.g. Get out of here!), and other for declarative sentences (e.g. He likes her.) and exclamatory sentences (e.g. What a liar!).

SentEval Toolkit
We use the SentEval toolkit (Conneau and Kiela, 2018) in our experiments. The toolkit provides utility for testing any vector representation of sentences in probing and downstream scenarios. Given a function f mapping a list of sentences to a list of vectors (serving as an interface to the tested sentence embedding model), a task and a dataset (with sentences or pairs of sentences as input data), SentEval performs evaluation in the context of the task. More specifically, it generates vectors for the dataset sentences using f , trains a classifier with vectors as inputs and task-specific labels as outputs, and evaluates it. Applying an identical evaluation procedure with the same dataset to different sentence embedding models provides the meaningful comparison of the models.
For the purpose of our tests, the probing datasets provided with the toolkit are replaced with our own, the CDS downstream task dataset is added and the SICK dataset is retained. Other SentEval downstream tasks are not used, having no Polish counterparts. In all experiments we use SentEval's Multilayer Perceptron classifier. 8

Probing Datasets
For English and Polish, 9 probing datasets are extracted from Paralela 9 (Pęzik, 2016), the largest Polish-English parallel corpus with nearly 4M sentence pairs. An important objective is to make the probing datasets in both languages maximally similar. The choice of a parallel corpus as their source allows to draw probing sentences from collections of texts that have analogous distributions of genre, style, sentence complexity etc. Note that we do not extract parallel sentence pairs (sharing common target classes) for individual probing datasets (sentences are often not translated literally), but we construct English and Polish datasets separately.
The sentences are tokenised with UDPipe 10 (Straka and Straková, 2017) and POS-tagged and dependency parsed with COMBO 11 (Rybak and Wróblewska, 2018). The UDPipe and COMBO models are trained on the UD English-EWT treebank 12 (Silveira et al., 2014) with 16k trees (254k tokens) and on the Polish PDB-UD treebank 13 (Wróblewska, 2018) with 22k trees (351k tokens). The set of UD-based rules is applied to dependency-parsed sentences to extract the final probing datasets for both languages.
Following , for the probing tasks constructed by determining selected properties of a certain dependency tree node (e.g. main predicate's tense, direct object's number, etc.), the division into training, validation and test sets ensures that all data instances, where the relevant token of the sentence (target token) bears the same word form, are not distributed into different sets. For example, all SubjNum instances, where the subject phrase is headed by the token cats (and the plur class is determined based on the features of this token), are assigned into the same set.
For each probing dataset, only relevant sentences are included (sentences with no subject are irrelevant for SubjNum, utterances with no main predicate in present/past tense are irrelevant for Tense etc.). Moreover, the target tokens are filtered based on their frequency (most and least frequent are discarded) and the number of occurrences of any target token is limited (to prevent the more frequent ones from dominating the datasets). Finally, the datasets are balanced with relation to the target class.
With the above restrictions implemented, we are able to extract datasets consisting of 90k examples each (75k for training, 7.5k for validation and testing). The dataset sizes are smaller than 120k examples proposed by , but remain in the same order of magnitude. The lower number of examples per dataset is due to the fact that we strive to build comparable datasets for both investigated languages based on the parallel corpus.

Downstream Datasets
Two datasets for evaluation of compositional distributional semantic models are used in our experi-10 https://github.com/ufal/udpipe/releases/tag/v1.2.0 11 https://github.com/360er0/COMBO 12 https://github.com/UniversalDependencies/UD_ English-EWT 13 http://git.nlp.ipipan.waw.pl/alina/PDBUD ments. The SICK corpus 14 (Bentivogli et al., 2014) consists of 10k pairs of English sentences. Each sentence pair is human-annotated for relatedness in meaning and entailment. The relatedness score indicates the extent to which meanings of two sentences are related and is calculated as the average of ten human ratings collected for this sentence pair on the 5-point Likert scale. The entailment relation between two sentences, in turn, is labelled with entailment, contradiction, or neutral, selected by the majority of human annotators. CDSCorpus 15 (Wróblewska and Krasnowska-Kieraś, 2017) is a comparable corpus of 10k pairs of Polish sentences human-annotated for relatedness and entailment. The degree of semantic relatedness between two sentences is calculated as the average of six human ratings on the 0-5-point scale. As an entailment relation between two sentences doesn't have to be symmetric, sentence pairs are annotated with bi-directional entailment labels, i.e. pairs of entailment, contradiction, and neutral.

Sentence Embeddings
Three types of sentence embeddings are tested in our experiments: (1) sentence embeddings obtained with max-pooling and mean-pooling over pre-trained word embeddings or contextualised word embeddings, (2) sentence embeddings estimated on small comparable corpora, and (3) pretrained sentence embeddings estimated on large monolingual or multilingual corpora.
Max/Mean-pool Sentence Embeddings Words can be represented as continuous vectors in a lowdimensional space, i.e. word embeddings. Word embeddings are assumed to capture linguistic (e.g. morphological, syntactic, semantic) properties of words. Recently, they are often learnt as part of a neural network trained on an unsupervised or semi-supervised objective task using massive amounts of data (e.g. Mikolov et al., 2013;Grave et al., 2018). 16 In our experiments, we test FASTTEXT embeddings 17 (Grave et al., 2018) (Devlin et al., 2018) for English and Polish. 19 Apart from the FASTTEXT and BERT models, we use parts of the dependency parsing models of COMBO to generate sentence embeddings. COMBO has a BiLSTMbased module that produces contextualised word embeddings based on concatenations of word level embeddings and character level embeddings. As the contextualised word embeddings are originally used to predict dependency trees, they should be linguistic information-rich. Since there is some overlap between the PDB-UD treebank (used to train COMBO parsing model for Polish) and CDSCorpus (source of downstream datasets for Polish), a separate COMBO model 20 is trained on PDB-UD data without the overlapping sentences. The model is used to obtain the embeddings for both probing and downstream evaluations. 21 For all three models listed above, sentence embeddings are obtained by mean or max pooling over individual word embeddings. For FASTTEXT and COMBO, the UDPipe tokenisation of the probing sentences is used and a sequence of embedding vectors is obtained by model lookup and reading the outputs of the parser's BiLSTM module respectively. In the case of BERT (which uses its own tokenisation mechanism), whole sentences are passed to the module and outputs of its penultimate layer are treated as token embeddings.

Small Corpora-based Sentence Embeddings
English and Polish sentence embeddings are estimated on Paralela corpus. The sentences that are included in any probing dataset are to be excluded from any data used for training sentence embeddings. Furthermore, Paralela corpus contains not only 1-to-1 sentence alignments, but also 1-tomany or even many-to-many. As we aim at estimating sentence embedding models, only proper sentences are selected from the corpus. English and Polish sentence embedding models are trained on 3M sentences with the SENT2VEC library 22 (Pagliardini et al., 2018). The SENT2VEC models are estimated with a neural architecture which resembles the CBOW model architecture by Mikolov et al. (2013). The tested models (SENT2VEC NS ) are estimated on unigrams and bigrams with the loss function coupled with negative sampling, to improve training efficiency.

Pre-trained Sentence Embeddings
We test English sentence embeddings provided by the pretrained SENT2VEC and USE models, and multilingual sentence embeddings generated by the LASER model. The SENT2VEC ORIG model 23 trained on the Toronto Book corpus 24 (70M sentences) outputs 700dimensional sentence embeddings. The Universal Sentence Encoder model 25 (USE, Cer et al., 2018) was estimated in a multi-task learning scenario on a variety of data sources 26 with a Transformer encoder. It takes a variable length English text (e.g. sentence, phrase, or short paragraph) as input and produces a 512-dimensional vector. The Language-Agnostic SEntence Representations model 27 (LASER, Artetxe and Schwenk, 2018) was trained on 223M parallel sentences (93 languages) from various sources. The encoder is implemented as a 5-layer BiLSTM network that represents a sentence as a 1,024-dimensional vector (max-pooling over the last hidden states of the BiLSTM).

Results
Results reported by SentEval are summarised in Table 1. The best result for each task in each language is highlighted in grey. For almost all probing tasks, the most accurate embedding is one of the two COMBO-based representations. This is not surprising as the contextualised vector representations produced by COMBO are learnt in the context of dependency parsing. Moreover, the target classes in the probing tasks are derived from trees produced by a parser that uses virtually the same neural model, which can be considered a kind of information leak. With COMBO models excluded from the ranking due to their obvious handicap, the best-performing sentence embeddings (shown in boldface) for 17 task-language pairs in 22 are yielded by LASER. The exceptions are ObjNum and SentType for Polish (where the advantage of BERT MEAN is so small it might be insignificant), Relatedness for English (suggesting that a comparable USE model could beat LASER in the Polish version of the task as well) and WC (where SENT2VEC performs visibly better than all other, even if it is trained on a relatively small corpus).
An interesting observation is that among the pooled embeddings, the MEAN variants quite consistently outperform their MAX counterparts. Figure 2 visualises the results yielded by selected models in the particular tasks. The models shown are BERT MEAN (the best pooled model), SENT2VEC NS (trained on Paralela corpus) and LASER (best-performing apart from COMBO, pre-trained on massive multilingual data). The plots are very similar in shape, the only striking difference being the discrepancy in WC results, with LASER and SENT2VEC NS faring similarly (and better than BERT MEAN ) for English and SENT2VEC NS yielding visibly best results for Polish.
We also measure the correlations between results for Polish and English in two ways. First, for each embedding model we compare the results it yielded in all Polish tasks and all English tasks. Second, for each task type we compare the results obtained using all models in the Polish and English variant of the task. 28 The corresponding correlation coefficients are plotted in Figure 3.
All the per-model correlations are high, which strongly suggests that given embeddings encode a given property similarly well (or poorly) relative to other properties regardless of the language. In the case of per-task correlations, there are three tasks with visibly lower correlations: SentType and the two downstream tasks. Therefore, for these tasks, the relative performance of individual models differs more between languages. For the downstream tasks this might be partially due to the fact that their respective datasets were created entirely independently and are expected to differ more. As far as SentType is concerned, the accuracies obtained for this task are generally very high and most of them fit within a small range.

Related Work
Our study follows a research trend in exploring sentence embeddings by means of probing methods, initiated by Shi et al. (2016) and Adi et al. (2017), and continued by .
Investigating NMT systems, Shi et al. (2016) found out that LSTM-based encoders can learn sourcelanguage syntax storing different syntactic properties (e.g. voice, tense, top level constituents, partof-speech tags) in different layers of NMT models. Adi et al. (2017) designed probing tasks for surface properties of sentences (i.e. sentence length, word content, and word order). Two types of sentence embeddings were tested: averaging of CBOW word embeddings and sentence representation output by a LSTM encoder.  carried out a series of the large-scale experiments on understanding English sentence embeddings with human-validated upper bounds for all probing tasks. They designed 10 probing tasks capturing simple linguistic properties of sentences, tested various sentence encoding architectures (i.e. BiLSTM and gated convolutional network), and various training objectives (e.g. neural machine translation, autoencoding, SkipThought). Following the mentioned approaches, we examine how much linguistic information is retained in sentence embeddings using 9 similar probing tasks. However, Universal Dependency trees instead of constituent trees are the core of our probing tasks. Furthermore, our experiments are carried out on two contrasting languages, to verify the validity of the evaluation method proposed for English in another language experimental scenario. Ettinger et al. (2018) considered a very important aspect of sentence meaning -composition. They proposed a method of assessing compositional meaning content in sentence embeddings on the examples of semantic role and negation phenomena. This study has drawn our attention to the compositional dimension of our probing tasks.
Related works by Linzen et al. (2016) and Warstadt and Bowman (2019) proposed evaluation of sentence encoders (e.g. LSTM, transformers) in terms of their ability to learn grammatical information, e.g. to assess sentences as grammatically correct or not (i.e. acceptability judgments).
Finally, several studies were devoted to exploring morphosyntactic properties of sentence embeddings in neural machine translation systems (e.g. Shi et al., 2016;.

Conclusion
We presented a methodology of empirical research on retention of linguistic information in sentence embeddings using probing and downstream tasks. In the probing-based scenario, a set of language-independent tests was designed and probing datasets were generated for two contrasting languages -English and Polish. The procedure of generating probing datasets is based on the Universal Dependency schema. It is thereby universal for all languages with a UD treebank on which a natural language pre-processing system can be trained. In the downstream-based scenario, the publicly available datasets for semantic relatedness and entailment were used.
We performed a series of probing and downstream experiments with three types of sentence embeddings in the SentEval environment, followed by a thorough analysis of the linguistic content of sentence embeddings. We found out that the COMBO-based embeddings designed to convey morphosyntax encode linguistic information in the most accurate way. Aside from COMBO embeddings, linguistic information is retained most exactly in the recently proposed LASER sentence embeddings, provided by an encoder designed with a relatively simple BiLSTM architecture, but estimated on tremendous multilingual data. Further research is required to find out in what lies the success of LASER embeddings: in the embedding size, in the magnitude of training data, or maybe in the multitude of used languages.