How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation

Sentence encoders map sentences to real valued vectors for use in downstream applications. To peek into these representations—e.g., to increase interpretability of their results—probing tasks have been designed which query them for linguistic knowledge. However, designing probing tasks for lesser-resourced languages is tricky, because these often lack largescale annotated data or (high-quality) dependency parsers as a prerequisite of probing task design in English. To investigate how to probe sentence embeddings in such cases, we investigate sensitivity of probing task results to structural design choices, conducting the first such large scale study. We show that design choices like size of the annotated probing dataset and type of classifier used for evaluation do (sometimes substantially) influence probing outcomes. We then probe embeddings in a multilingual setup with design choices that lie in a ‘stable region’, as we identify for English, and find that results on English do not transfer to other languages. Fairer and more comprehensive sentence-level probing evaluation should thus be carried out on multiple languages in the future.


Introduction
Extending the concept of word embeddings to the sentence level, sentence embeddings (a.k.a. as sentence encoders) have become ubiquitous in NLP (Kiros et al., 2015;Conneau et al., 2017).In the context of recent efforts to open the black box of deep learning models and representations (Linzen et al., 2019), it has also become fashionable to probe sentence embeddings for the linguistic information signals they contain (Perone et al., 2018), as this may not be clear from their performances in downstream tasks.Such probes are linguistic micro tasks-like detecting the length of a sentence or its dependency tree depth-that have to be solved by Table 1: Schematic illustration of our concept of stability across two dimensions (classifier and training size).Here, three encoders, dubbed A,B,C, are ranked.The region of stability is given by those settings that support the majority ranking of encoders, which is A B C. a classifier using given representations.
The majority of approaches for probing sentence embeddings target English, but recently some works have also addressed other languages such as Polish, Russian, or Spanish in a multiand cross-lingual setup (Krasnowska-Kieraś and Wróblewska, 2019;Ravishankar et al., 2019).Motivations for considering a multi-lingual analysis include knowing whether findings from English transfer to other languages and determining a universal set of probing tasks that suits multiple languages, e.g., with richer morphology and freer word order.
Our work is also inspired by probing sentence encoders in multiple (particularly low-resource) languages.We are especially interested in the formal structure of probing task design in this context.Namely, when designing probing tasks for lowresource languages, some questions arise naturally that are less critical in English.One of them is the size of training data for probing tasks, as this training data typically needs to be (automatically or manually) annotated, an inherent obstacle in low-resource settings. 1hus, at first, we ask for the training data size required for obtaining reliable probing task results.This question is also relevant for English: on the one hand, Conneau et al. (2018) claim that training data for a probing task should be plentiful, as otherwise (highly parametrized) classifiers on top of representations may be unable to extract the relevant information signals; on the other hand, Hewitt and Liang (2019) note that a sufficiently powerful classifier with enough training data can in principle learn any task, without this necessarily allowing to conclude that the representations adequately store the linguistic signal under scrutiny.Second, we ask how stable probing task results are across different classifiers (e.g., MLP vs. Naive Bayes).This question is closely related to the question about size, since different classifiers have different sensitivities to data size; especially deep models are claimed to require more training data.
We evaluate the sensitivity of probing task results to the two outlined parameters-which are mere machine learning design choices that do not affect the linguistic content stored in the sentence representations under scrutiny-and then determine a 'region of stability' for English (en), where outcomes are predicted to be similar for the majority of parameter choices.Table 1 illustrates this.Using parameter choices within our region of stability, we turn to three lower-resource languages, viz.: Turkish (tr), Russian (ru), and Georgian (ka).tr is a Turkic language written in Latin script which makes exhaustive use of agglutination.ru is a Slavic language written in Cyrillic script characterized by strong inflection and rich morphology.ka is a South-Caucasian language using its own script called Mkhedruli.It makes use of both agglutination as well as inflection.For these, our main research questions are whether probing task results transfer from English to the other languages.
Overall, our research questions are: • (i) How reliable are probing task results across machine learning design choices?
• (ii) Will encoder performances correlate across languages, even though the languages and their linguistic properties may differ?
• (iii) Will probing task performances correlate across languages?
• (iv) Will the correlation between probing and downstream tasks be the same across languages?
These questions are important because they indi-cate whether or not probing tasks (and their relation to downstream tasks) have to be re-evaluated in languages other than en.Our results strongly suggest that re-evaluation is required and that claims of superiority of sentence encoders on en data do not transfer to other languages.

Related work
The goal of this work is to probe for sentence-level linguistic knowledge encoded in sentence embeddings (Perone et al., 2018) in a multilingual setup which marginalizes out the effects of probing task design choices when comparing sentence representations.
Sentence embeddings have become central for representing texts beyond the word level, e.g., in small data scenarios, where it is difficult to induce good higher-level text representations from word embeddings (Subramanian et al., 2018) or for clustering or text retrieval applications (Reimers and Gurevych, 2019).To standardize the comparison of sentence embeddings, Conneau and Kiela (2018) proposed the SentEval framework for evaluating the quality of sentence embeddings on a range of downstream and 10 probing tasks.
Probing tasks are used to introspect embeddings for linguistic knowledge, by taking "probes" as dedicated syntactic or semantic micro tasks (Köhn, 2016).As opposed to an evaluation in downstream applications or benchmarks like GLUE (Wang et al., 2018), probing tasks target very specific linguistic knowledge which may otherwise be confounded in downstream applications.Since they are artificial tasks, they can also be better controlled for to avoid dataset biases and artifacts.Probing is typically either executed on type/token (word) (Tenney et al., 2019) or sentence level (Adi et al., 2017).For sentence level evaluation, SentEval thus far only includes en data.Each probing task in SentEval is balanced and has 100k train, 10k dev, and 10k test instances.The effects of these design choices are unclear, which is why our work addresses their influence systematically.
In the multilingual setting, Sahin et al. (2019) propose 15 token and type level probing tasks.Their probing task data is sourced from UniMorph 2.0 (Kirov et al., 2018), Universal Dependency treebanks (McCarthy et al., 2018) and Wikipedia word frequency lists.To deal with lower-resourced languages, they only use 10K samples per probing task/language pair (7K/2K/1K for train/dev/test) and exclude task/language pairs for which this amount cannot be generated.Their final experiments are carried out on five languages (Finnish, German, Spanish, ru, tr), for which enough training data is available.They find that for morphologically rich (agglutinative) languages, several probing tasks positively correlate with downstream applications.This finding is obviously bound to the fact that they tested on word level, such that probes in agglutinative languages which encode more linguistic information in a single word are easier to solve.Our work also investigates correlation between probing and downstream performance, but we do so on sentence level.
On sentence level, Ravishankar et al. ( 2019) train an InferSent-like encoder on en and map this encoder to four languages (ru, French, German, Spanish) using parallel data.Subsequently, they probe the encoders on the probing tasks proposed by Conneau et al. (2018) on Wikipedia data for each language.They use the same size of probing task data as in SentEval, i.e., 100k/10k/10k for train/dev/test.Their interest is in whether probing tasks results are higher/lower compared to en scores.They find particularly the ru probing scores to be low, which they speculate to be an artifact of cross-lingual word embedding induction and the language distance of ru to en.In contrast to us, their focus is particularly on the effect of transferring sentence representations from en to other languages.The problem of such an analysis is that results may be affected by the nature of the cross-lingual mapping techniques.Krasnowska-Kieraś and Wróblewska (2019) probe sentence encoders in en and Polish (pl).They use tasks defined in Conneau et al. (2018) but slightly modify them (e.g., replacing dependency with constituency trees), reject some tasks (Bigram-Shift, as word order may play a minor role in pl), and add two new tasks (Voice and Sentence Type).Since pl data is less abundant, they shrink the size of the pl datasets to 75k/7.5k/7.5kfor train/dev/test and, for consistency, do the same for en.They extract probing datasets from an enpl parallel corpus using COMBO for dependency parsing (Rybak and Wróblewska, 2018).They find that en and pl probing results mostly agree, i.e., encoders store the same linguistic information across the two languages.

Approach
In the absence of ground truth, our main interest is in a 'stable' structural setup for probing task design-with the end goal of applying this design to multilingual probing analyses (keeping their restrictions, e.g., small data sizes, in mind).To this end, we consider a two-dimensional space X comprising probing data size and classifier choice for probing tasks.2For a selected set of points p 0 , p 1 , . . . in X , we evaluate all our encoders on p i , and determine the 'outcomes' O i (e.g., ranking) of the encoders at p i .We consider a setup p i as stable if outcome O i is shared by a majority of other settings p j .This can be considered a region of agreement, similarly as in inter-annotator agreement (Artstein and Poesio, 2008).In other words, we identify 'ideal' test conditions by minimizing the influence of parameters p i on the outcome O i .Below, we will approximate these intuitions using correlation.

Embeddings
We consider two types of sentence encoders, nonparametric methods which combine word embeddings in elementary ways, without training; and parametric methods, which tune parameters on top of word embeddings.As non-parametric methods, we consider: (i) average word embeddings as a popular baseline, (ii) the concatenation of average, min and max pooling (pmeans) (Rücklé et al., 2018); and Random LSTMs (Conneau et al., 2017;Wieting and Kiela, 2019), which feed word embeddings to randomly initialized LSTMs, then apply a pooling operation across time-steps.As parametric methods, we consider: InferSent (Conneau et al., 2017), which induces a sentence representation by learning a semantic entailment relationship between two sentences; QuickThought (Logeswaran and Lee, 2018), as a supervised improvement over the popular SkipThought model (Kiros et al., 2015); LASER (Artetxe and Schwenk, 2019) derived from massively multilingual machine translation models, and BERT base (Devlin et al., 2019), where we average token embeddings of the last layer for a sentence representation.Dimensionalities of encoders are listed in the Appendix.

Probing Tasks
Following Conneau et al. (2018), we consider the following probing tasks: BigramShift (en, tr, ru, ka), TreeDepth (en), Length (en, tr, ru, ka), Subject Number (en, tr, ru), WordContent (en, tr, ru, ka), and TopConstituents (en).We choose Length, BigramShift and WordContent because they are unsupervised tasks that require no labeled data and thus can be easily implemented across different languages-they also represent three different types of elementary probing tasks: surface, syntactic and semantic/lexical.We further include Subject Number across all our languages because number marking is extremely common across languages and it is comparatively easy to identify.We adopt Voice (en, tr, ru, ka) from Krasnowska-Kieraś and Wróblewska (2019).For en, we additionally evaluate on TreeDepth and TopConstituents as hard syntactic tasks.We add two tasks not present in the canon of probing tasks listed in SentEval: Subject-Verb-Agreement (SV-Agree) (en, tr, ru, ka) and Subject-Verb-Distance (SV-Dist) (en, tr, ru).
We probe representations for these properties because we suspect that agreement between subject and verb is a difficult task which requires inferring a relationship between pairs of words which may stand in a long-distance relationship (Gulordava et al., 2018).Moverover, we assume this task to be particularly hard in morphologically rich and word-order free languages, thus it could be a good predictor for performance in downstream tasks.
To implement the probing tasks, for en, we use the probing tasks datasets defined in Conneau and Kiela (2018) and we apply SpaCy to sentences extracted from Wikipedia for our newly added probing tasks Voice, SV-Dist and SV-Agree.For tr, ru, and ka, we do not rely on dependency parsers because of quality issues and unavailability for ka.Instead, we use Universal Dependencies (UD) (Nivre et al., 2016) and manual rules for sentences extracted from Wikipedia.In particular, for SV-Agree, we create a list of frequently occurring verbs together with their corresponding present tense conjugations for each individual language.We check each individual sentence from Wikipedia for the presence of a verb form in the list.If no word is present, we exclude the sentence from consideration.Otherwise, we randomly replace the verb form by a different conjugation in 50% of the cases.For SV-Dist, we use the information from UD to determine the dependency distance between the main verb and the subject.Instead of predicting the exact distances, we predict binned classes: [1], [2,4], [5,7], [8,12], [13,∞).This task could not be implemented for ka, due to missing dependency information in the UD.We omit Subject Number for ka for the same reason.
An overview of the probing tasks, along with descriptions and examples, is given in Table 2.

Downstream Tasks
In addition to probing tasks, we test the embeddings in downstream applications.Our focus is on a diverse set of high-level sentence classification tasks.We choose Argument Mining (AM) (Stab et al., 2018), sentiment analysis and TREC question answering.Required training data for languages other than en has been machine translated using Google Translate3 for AM and TREC. 4entiment analysis uses original datasets with 2 to 3 sentiment classes.Details of the training procedure and the tasks themselves can be found in the Appendix.Statistics are reported in Table 6.
Experimental Setup To the SentEval toolkit (Conneau and Kiela, 2018), which addresses both probing and downstream tasks and offers Logistic Regression (LR) and MLP classifiers on top of representations, we added implementations of Random Forest (RF) and Naive Bayes (NB) from scikit-learn as other popular but 'simple' classifiers.SentEval defines specific model validation techniques for each task.For all probing tasks and TREC, we use predefined splits.For AM and sentiment analysis, we use 10 fold inner cross validation.Following SentEval, we tune the size of the hidden layer in {50, 100, 200}, dropout in {0.0, 0.1, 0.2} and L 2 regularization in {10 −5 , 10 −1 } when training an MLP.For RF, we tune maximum tree depth in {10, 50, 100, ∞}.For Logistic Regression (LR), we tune L 2 regularization in {10 −5 , 10 −1 }.We did not tune any hyperparameters for NB.

Probing task design in en
In our design, we consider (a) four well-known and popular classifiers-LR, MLP, NB, RF-on top of sentence representations, and (b) six different training data sizes (between 2k and 100k).We perform an exhaustive grid-search for size and classifier design, considering all combinations.
Size For each classifier, we obtain results (on 10k test instances) when varying the training data size s over 2k, 5k, 10k, 20k, 30k, 100k.Downsampling was implemented by selecting the same percentage of samples that appears in the full dataset for each class.We then report average Spearman/Pearson correlations ρ/p between any two training set sizes s and t over all 9 probing tasks: where n p is the number of probing tasks (n p = 9 for en), and c i (s) is the vector that holds scores for each of the 7 sentence encoders in our experiments, given training size s, for probing task i and classifier c.We set correlations to zero if the p-value has p > 0.2. 6In Table 3, we then report the minimum and average scores min (s,t) sim c (s, t) and 1 N (s,t) sim c (s, t), respectively, per classifier c.We observe that the minimum values are small to moderate correlations between 0.2 (for NB) and 0.6 (for RF).The average correlations are moderate to high correlations ranging from 0.6 (for NB) to above 0.8 (for the others).
In Figure 1 (top), we show all the values sim c (s, t) for c = LR, NB.We observe that, indeed, LR has high correlations between training sizes especially starting from 10k training data points.The corresponding correlations of NB are much lower comparatively.
In Figure 2, we plot the stability of each training data size s for all of our classifiers c and where N is a normalizer, N = 6 in our case.Classifier Next, we add the classifier choice as a 2nd dimension: we examine whether correlations (Spearman/Pearson) between vectors c (holding scores for each of 7 sentence encoders for a classifier c) and d (holding the same scores for a classifier d) are similar in the same sense as in Eq. ( 1): Again, we average across all probing tasks, and set correlation values to zero if the p-value exceeds 0.2.
In Table 5, we give min/avg values across data set sizes in this setup.We observe that LR and MLP most strongly agree.They have acceptable average agreement with RF, but low agreement with NB, on average, and, in the worst cases, even negative correlations with NB.
In Figure 1 (bottom), we illustrate correlations between three classifiers, comparing LR with NB and RF across all possible training set sizes.We observe that as the training data set sizes for RF and LR become larger, these two classifiers agree more strongly with LR.RF starts to have acceptable agreement with LR from 10k training instances onwards, while NB has acceptable agreement with LR only in the case of 100k training instances.
We now operationalize our intuition of 'region of stability' outlined in Table 1.For each of nine probing tasks, we compute the following.Let be a specific ranking of encoders, where ζ is a fixed permutation.Let r (c,s) be the ranking of encoders according to the classifier, size combination (c, s).We compute the Spearman correlation τ (c,s,j) between r (c,s) and r j .For each possible ranking r j of our 7 encoders, we then determine its support as the average over all values τ (c,s,j) and then find the ranking r max with most support according to this definition.Finally, we assign a score to the combination (c, s) not only when r (c,s) equals r max , but also when r (c,s) is close to r max : we again use the Spearman correlation between r (c,s) and r max as a measure of closeness (we require a closeness of at least 0.75).The final score for (c, s) is given by: Overall, we answer our (i) first research question as follows: probing tasks results can be little reliable and may vary greatly among machine learning parameter choices.The standard training set size of SentEval, 100k, appears to be less stable.As region of stability, we postulate especially the setting with 10k training instances for the LR classifier.

Multi-lingual results
Experimental Setup Given our results for en, we choose the LR classifier with a size of roughly 10k instances overall.Table 6 provides more details about the datasets.In line with SentEval (and partly supported by our results in the appendix), we aim for as balanced label distributions as possible.
Because of the small test sizes, we use inner 5-fold CV for all tasks except for SubjNumber, where we use pre-defined train/dev/test splits as in Conneau et al. (2018) to avoid leaking lexical information from train to test splits.
We obtain average and pmeans embeddings through pooling over pretrained Fasttext embeddings (Grave et al., 2018).The same embeddings are used for the random LSTM.For average BERT, we use the base-multilingual-cased model.We machine translate the AllNLI corpus into tr, ru and ka, to obtain training data for Infersent. 7The models are then trained using default hyperparameters and using pre-trained FastText embeddings.Compared to en, we modify the WC probing task in the multilingual setting to only predict 30 midfrequency words instead of 1000.This is more appropriate for our much smaller data sizes.

Probing tasks
Results are shown in Figures 3 and 4. (ii) Will encoder performances correlate across languages?For each encoder e, we correlate performances of e between en and the other languages on 5 (for ka) and 7 (for tr, ru) probing tasks (using 10k dataset size and LR for all involved languages, including en).In Figure 3, we observe that correlations between en and other languages are generally either zero or weakly positive.Only average embeddings have 2 positive correlation scores across the 3 language combinations with en.Among low-resource languages, there are no negative correlations and fewer zero correlations.All of the low-resource languages correlate more among themselves than with en.This makes sense from a linguistic point of view, since en is clearly the outlier in our sample given its minimal inflection and fixed word order.Thus, the answer to this research question is that our results support the view that transfer is better for typologically similar languages.
(iii) Will probing task performances correlate across languages?For each probing task π, we report Pearson correlations, between all language pairs, of vectors holding scores of 7 encoders on π. Figure 4 shows the results.The pattern is over- similar as for (ii) in that there are many zero correlations between en and the other languages.tr and ka also have negative correlations with en for selected tasks.Only BigramShift has positive correlations throughout.Low-resource languages correlate better among themselves as with en.Our conclusions are the same as for question (ii).
Note that our findings contrast with Krasnowska-Kieraś and Wróblewska (2019), who report that probing results for en and pl are mostly the same.Our results are more intuitively plausible: e.g., a good encoder should store linguistic information relevant for a particular language.

Downstream Tasks
Results are shown in Figure 5.
(iv) Will the correlation between probing and downstream tasks be the same across languages?For each of our languages, we correlate probing and downstream tasks.The results show that the answer to research question (iv) is clearly negative.In particular, en behaves differently to the other languages-while, again, ru and tr behave more similarly.ka is the only language with negative correlations for Length, en the only one with positive scores.For the sentiment task, Word Content correlates positively for all languages except ka.The AM task correlates only in en and ka, but with different probing tasks.SV-Agree correlates positively with TREC and sentiment in all languages but en.Predicting the performances of embeddings in downstream tasks via probing Table 6: Probing and downstream tasks.We report the balance between the class with the most and the least samples.For downstream tasks, the evaluation measure is given in brackets.

Concluding Remarks
We investigated formal aspects of probing task design, including probing data size and classifier choice, in order to determine structural conditions for multilingual (low-resource) probing.We showed that probing tasks results are at best partly stable even for en and that the rankings of encoders varies with design choices.However, we identified a partial region of stability where results are supported by a majority of settings-even though this may not be mistaken for a region of 'truth'.This region was identified in en, which has most resources available.Our further findings then showed that probing and downstream results do not transfer well from English to our other languages, which in turn challenges our identified region of stability.Overall, our results have partly negative implica-tions for current practices of probing task design as they may mean that probing tasks are to some degree unreliable as tools for introspecting linguistic information contained in sentence encoders.Their relation to downstream tasks is also unclear, as our multilingual results show.This is supported by recent findings giving contradictory claims regarding, e.g., the importance of the Word Content probing task for downstream performances (Eger et al., 2019;Wang and Kuo, 2020;Perone et al., 2018;Conneau et al., 2018).An important aspect to keep in mind in this context is that results may heavily depend on the selection of encoders involved in the analysis-in our case, we selected a number of recently proposed state-of-the-art models in conjunction with weaker baseline models, for a diverse collection of encoders.Another clear limitation of our approach is the small number of encoders we examined-nonetheless, many of our results are significant (at relatively large p-values).
To the degree that the supervised probing tasks examined here will remain important tools for interpretation of sentence encoders in the future, our results indicate that multilingual probing is important for a fairer and more comprehensive comparison of encoders with respect to the linguistic information signals that they store.con-, or non-arguments for eight different topics.A sentence only qualifies as pro or con argument when it both expresses a stance towards the topic and gives a reason for that stance.The classifier input is a concatenation of the sentence embedding and the topic encoding.In total, there are about 25,000 sentences.
Sentiment Analysis As opposed to AM, sentiment analysis only determines the opinion flavor of a statement.Since sentiment analysis is a very established NLP task, we did not machine translate en training data, but used original data for en, ru and tr and created a novel dataset for ka.For en, we use the US Airline Twitter Sentiment dataset, consisting of 14,148 tweets labeled in three sentiment classes8 .For tr, we took the Turkish Twitter Sentiment Dataset with 6,172 examples and three classes9 .For ru, we used the Russian Twitter Corpus (RuTweetCorp), which we reduced to 30,000 examples in two classes.10For ka, we followed the approach by Choudhary et al. (2018) and crawled sentiment flavored tweets in a distant supervision manner.Emojis were used as distant signals to indicate sentiment on preselected tweets from the Twitter API.After post-processing, we were able to collect 11,513 Georgian tweets in three sentiment classes.The dataset will made available publicly, including more details on the creation process.
TREC Question Type Detection Question type detection is an important part of Question-Answering systems.The Text Retrieval Conference (TREC) dataset consists of a set of questions labeled with their respective question types (six labels including e.g."description" or "location") and is part of the SentEval benchmark (Conneau and Kiela, 2018).We used the data as provided in SentEval, yielding 5,952 instances.

Quality of Machine Translated Data
We automatically translated the input data for the AM and TREC downstream tasks.To estimate the quality of the machine translated data, we measured the performance of the service used to translate the data with the help of the JW300 corpus (Agić and Vulić, 2019;Tiedemann, 2012).For each of the language pairs en-ka, en-tr, and en-ru, we translated the first 10,000 sentences of the respective bitext files from JW300 and measured their quality in terms of BLEU, METEOR and MOVERSCORE (Zhao et al., 2019). 11Results are summarized in Table 7.They show that, with the exception of en-ka, all language pairs have high-quality translations.We thus expect the influence of errors of the machine translated data to be minimal in tr and ru.For ka, this is not necessarily the case.

A.3 Class Imbalance
In addition to the classifier type and size, we also tested the influence of the class (im)balance of the training data.In particular, for the four binary probing tasks BigramShift, SubjNumber, SV-Agree, and Voice, we examine the effect of imbalancing with ratios of 1:5 and 1:10.We use LR with sizes of 10k, 20k, and 30k training instances and correlate the results for imbalanced datasets with the standardly balanced datasets.We find that (i) for two tasks (BigramShift, SV-Agree) there is typically high correlation (0.6-0.8) while for the other two tasks the correlation is typically zero between the balanced and imbalanced setting; (ii) correlation to the setting 1:1 (slightly) diminishes as we increase the class imbalance from 1:5 to 1:10.Thus, the scenarios 1:5 and 1:10 do not strongly correlate with 1:1 (as used in all our other experiments).As C,B) (C,B,A) (A,B,C) (C,B,A) Low (A,B,C) (B,A,C) (B,C,A) (A,B,C)

Figure 1 :
Figure 1: Top: Average correlations sim c (s, t) for LR (left) and NB (right).Bottom: Average correlations sim c,d (s, t) for c = LR and d = RF (left) and c = LR and d = NB (right).

Figure 3 :
Figure 3: Pearson correlations across languages for different encoders.

Figure 4 :
Figure 4: Pearson correlations across languages for different probing tasks.

Figure 5 :
Figure 5: Pearson correlation among probing task and downstream performance for all languages.

Table 2 :
Bigram Shift Whether two words in a sentence are inverted This is my Eve Christmas.−→ True Tree Depth Longest path from root to leaf in constituent tree "One hand here , one hand there , that 's it" −→ 5 Length Number of tokens I like cats −→ 1-4 words Subject Number Whether the subject is in singular or plural They work together −→ Plural Word Content Which mid-frequency word a sentence contains Everybody should step back −→ everybody Top Constituents Classific.task where classes are given by 19 most Did he buy anything from Troy −→ VDP NP VP common top constituent sequences in corpus Probing tasks, their description and illustration.Top tasks are defined as in SentEval.

Table 4
shows classifier, size combinations with highes µ scores.LR and MLP are at the top, along with RF in the setting of 100k training data size.LR with size 10k is most stable overall, but the distance to the other top settings is small.Least stable (not shown) is NB.

Table 5 :
Min/Avg values sim c,d (s, t) across (s, t) (using Pearson) between classifiers c and d.

Table 7 :
Quality of the machine translation service used to translate training data for downstream tasks on reference datasets.A.2 Sentence Encoder DimensionsTable8shows the full list of encoders used in our study and dimensionalities.