Intrinsic Probing through Dimension Selection

Most modern NLP systems make use of pre-trained contextual representations that attain astonishingly high performance on a variety of tasks. Such high performance should not be possible unless some form of linguistic structure inheres in these representations, and a wealth of research has sprung up on probing for it. In this paper, we draw a distinction between intrinsic probing, which examines how linguistic information is structured within a representation, and the extrinsic probing popular in prior work, which only argues for the presence of such information by showing that it can be successfully extracted. To enable intrinsic probing, we propose a novel framework based on a decomposable multivariate Gaussian probe that allows us to determine whether the linguistic information in word embeddings is dispersed or focal. We then probe fastText and BERT for various morphosyntactic attributes across 36 languages. We find that most attributes are reliably encoded by only a few neurons, with fastText concentrating its linguistic structure more than BERT.


Introduction
Natural language processing (NLP) is enamored of contextual word representations-and for good reason! Contextual word-embedders, e.g. BERT (Devlin et al., 2019) and ELMo (Peters et al., 2018), have bolstered NLP model performance on myriad tasks, such as syntactic parsing (Kitaev et al., 2019), coreference resolution (Joshi et al., 2019), morphological tagging (Kondratyuk, 2019) and text generation (Zellers et al., 2019). Given the large empirical gains observed when they are employed, it is all but certain that word representations derived from neural networks encode some continuous analogue of linguistic structures. 1 Code and data are available at https://github. com/rycolab/intrinsic-probing. Exactly what these representations encode about linguistic structure, however, remains little understood. Researchers have studied this question by attributing function to specific network cells with visualization methods (Karpathy et al., 2015;Li et al., 2016) and by probing (Alain and Bengio, 2017;Belinkov and Glass, 2019), which seeks to extract structure from the representations. Recent work has probed various representations for correlates of morphological (Belinkov et al., 2017;Giulianelli et al., 2018), syntactic (Hupkes et al., 2018;Zhang and Bowman, 2018;Hewitt and Manning, 2019;Lin et al., 2019), and semantic (Kim et al., 2019) structure.
Most current probing efforts focus on what we term extrinsic probing, where the goal is to determine whether the posited linguistic structure is predictable from the learned representation. Generally, extrinsic probing works argue for the presence of linguistic structure by showing that it is extractable from the representations using a machine learning model. In contrast, we focus on intrinsic probing-whose goals are a proper superset of the goals of extrinsic probing. In intrinsic probing, one seeks to determine not only whether a signature of linguistic structure can be found, but also how it is encoded in the representations. In short, we aim to discover which particular "neurons" (a.k.a. dimensions) in the representations correlate with a given linguistic structure. Intrinsic probing also has ancillary benefits that extrinsic probing lacks; it can facilitate manual analyses of representations and potentially yield a nuanced view about the information encoded by them.
The technical portion of our paper focuses on developing a novel framework for intrinsic probing: we scan sets of dimensions, or neurons, in a word vector representation which activate when they correlate with target linguistic properties. We show that when intrinsically probing high-dimensional representations, the present probing paradigm is insufficient ( §2). Current probes are too slow to be used under our framework, which invariably leads to low-resolution scans that can only look at one or a few neurons at a time. Instead, we introduce decomposable probes, which can be trained once on the whole representation and henceforth be used to scan any selection of neurons. To that end, we describe one such probe that leverages the multivariate Gaussian distribution's inherent decomposability, and evaluate its performance on a large-scale, multi-lingual, morphosyntactic probing task ( §3).
We experiment on 36 languages 2 from the Universal Dependencies treebanks (Nivre et al., 2017). We find that all the morphosyntactic features we considered are encoded by a relatively small selection of neurons. In some cases, very few neurons are needed; for instance, for multilingual BERT English representations, we see that, with two neurons, we can largely separate past and present tense in Fig. 1. In this, our work is closest to Lakretz et al. (2019), except that we extend the investigation beyond individual neurons-a move which is only made tractable by decomposable probing. We also provide analyses on morphological features beyond number and tense. Across all languages, 35 out of 768 neurons on average suffice to reach a reasonable amount of encoded information, and adding more yields diminishing returns (see Fig. 2). Interestingly, in our head-to-head comparison of BERT and fastText, we find that fastText almost always encodes information about morphosyntactic 2 See App. F for a list.
properties using fewer dimensions.

Probing through Dimension Selection
The goal of intrinsic probing is to reveal how "knowledge" of a target linguistic property is structured within a neural network-derived representation. If said property can be predicted from the representations, we expect that this is because the neural network encodes this property (Giulianelli et al., 2018). We can then determine whether a probe requires a large subset or a small subset of dimensions to predict the target property reliably. 3 Particularly small subsets could be used to manually analyze a network and its decision process, and potentially reveal something about how specific neural architectures learn to encode linguistic information.
To formally describe our framework, we first define the necessary notation. We consider the probing of a word representation h ∈ R d for morphosyntax. In this work, our goal is find a subset of dimensions C ⊆ D = {1, . . . , d} such that the corresponding subvector of h C contains only the dimensions that are necessary to predict the target morphosyntactic property we are probing for. For all possible subsets of dimensions C ⊆ D, and some random variable Π that ranges over P property values {π 1 , . . . , π P }, we consider a general probabilistic probe: p θ C (Π = π | h C ); note that the model is conditioned on h C , not on h. Our goal is to select a subset of dimensions using the log-likelihood of held-out data. We term this type of probing dimension selection. One can express dimension selection as the following combinatorial optimization problem: is a held-out dataset. Importantly, for complicated models we will require a different parameter set θ C for each subset C ⊆ D.
In the general case, solving a subset selection problem such as eq. (1) is NP-Hard (Binshtok et al., 2007). Indeed, without knowing more about the structure of p θ C we would have to rely on enumeration to solve this problem exactly. As there are d k possible subsets, it takes a prohibitively long time to enumerate them all for even small d and k.
Greed is not Enough. A natural first approach to approximate a solution to eq. (1) is a greedy algorithm (Kleinberg and Tardos, 2005, Chapter 4). Such an algorithm chooses the dimension that results in the largest increase to the objective at every iteration. However, some probes, such as neural network probes, need to be trained with a gradient-based method for many epochs. In such a case, even a greedy approximation is prohibitively expensive. For example, to select the first dimension, we train d probes and take the best. To select the second dimension, we train d − 1 probes and take the best. This requires training O(dk) networks! In the case of BERT, we have d = 768 and we would generally like to consider k at least up to 50. Training on the order of 38400 neural models to probe for just one morphosyntactic property is generally not practical. What we require is a decomposable probe, which can be trained once on all dimensions and then be used to evaluate the log-likelihood of any subset of dimensions in constant or near-constant time. To the best of our knowledge, no probes in the literature exhibit this property; the primary technical contribution of the paper is the development of such a probe in §3.
Other Selection Criteria. Our exposition above uses the log-likelihood of held-out data as a selection criterion for a subset of dimensions; however, any function that scores a subset of dimensions is suitable. For example, much of the current probing literature relies on accuracy to evaluate probes (Conneau et al., 2018;Liu et al., 2019, inter alia), and two recent papers motivate a probabilistic evaluation with information theory (Pimentel et al., 2020b;Voita and Titov, 2020). One could select based on accuracy, mutual information, or anything else within our framework. In fact, recent work in intrinsic probing by Dalvi et al. (2019) could be recast into our framework if we chose a dimension selection criterion based on the magnitude of the weights of a linear probe. However, we suspect that a performance-based dimension selection criterion (e.g., log-likelihood) should be more robust given that a weight-based approach is sensitive to feature collinearity, variance and regularization. As we mentioned before, performance-based selection requires a probe to be decomposable, and to the best of our knowledge, this is not the case for the the linear probe of Dalvi et al. (2019).

A Decomposable Probe for Morphosyntactic Properties
Using the framework introduced above, our goal is to probe for morphosyntactic properties in word representations. We first describe the multivariate Gaussian distribution as it is responsible for our probe's decomposability ( §3.1), and provide some more notation ( §3.2). We then describe our model ( §3.3) and a Bayesian formulation ( §3.4).

Properties of the Gaussian
The multivariate Gaussian distribution is defined as where µ is the mean of the distribution and Σ is the covariance matrix. We review the multivariate Gaussian with emphasis on the properties that make it ideal for intrinsic morphosyntactic probing. Firstly, it is decomposable. Given a multivariate Gaussian distribution over the marginals for x 1 and x 2 may be computed as This means that if we know µ and Σ, we can obtain the parameters for any subset of dimensions of x by selecting the appropriate subvector (and submatrix) of µ (Σ). 4 As we will see in §3.3, this property is the very centerpiece of our probe. Secondly, the Gaussian distribution is the maximum entropy distribution over the reals given a finite mean and covariance and no further information. Thus, barring additional information, the Gaussian is a good default choice. Jaynes (2003, Chapter 7) famously argued in favor of the Gaussian because it is the real-valued distribution with support (−∞, ∞) that makes the fewest assumptions about the data (beyond its first two moments).

Notation for Morphosyntactic Probing
We now provide some notation for our morphosyntactic probe. Let {h (1) , . . . , h (N ) } be word representation vectors in R d for N words {w (1) , . . . , w (N ) } from a corpus. For example, these could be embeddings output by fastText (Bojanowski et al., 2017), or contextual representations according to ELMo (Peters et al., 2018) or BERT (Devlin et al., 2019). Furthermore, let {m (1) , . . . , m (N ) } be the morphosyntactic tags associated with each of those words in the sentential context in which they were found. 5 Let A = {a 1 , . . . , a |A| } be a universal 6 set of morphosyntactic attributes in a language, e.g. PERSON, TENSE, NUMBER, etc. For each attribute a ∈ A, let V a be that attribute's universal set of possible values. For instance, we have V PERSON = {1, 2, 3} for most languages. For this task, we will further decompose each morphosyntactic tag as a set of attribute-value pairs m (i) = a 1 = v 1 , . . . , a |m (i) | = v |m (i) | where each attribute a j is taken from the universal set of attributes A, and each value v j is taken from a set V a j of universal values specific to that attribute. For example, the morphosyntactic tag m for the English verb "has" would be {PERSON = 3, NUMBER = SG, TENSE = PRS}.

Our Decomposable Generative Probe
We now present our decomposable probabilistic probe. We model the joint distribution between embeddings and a specific attribute's values where we define where µ v and Σ v are the value-specific mean and covariance. We further define This allows each value to have a different probability of occurring. This is important since our probe should be able to model that, e.g. the 3rd person is more prevalent than the 1st person in corpora derived from Wikipedia. We can then probe with which can be computed quickly as |V a | is small. 7 This model is also known as quadratic discriminant analysis (Murphy, 2012, Chapter 4). Another interpretation of our model is that it amounts to a generative classifier where, given some specific morphosyntactic attribute, we first sample one of its possible values v, and then sample an embedding from a value-specific Gaussian. Compared to a linear probe (e.g. Hewitt and Liang 2019), whose decision boundary is linear for two values, the decision boundary of this model generalizes to conic sections, including parabolas, hyperbolas and ellipses (Murphy, 2012, Chapter 4). This formulation allows us to model the word representations of each attribute's value as a separate Gaussian. Since the Gaussian distribution is decomposable ( §3.1), we can train a single model and from it obtain a probe for any subset of dimensions in O(1) time. To the best of our knowledge, no other probes in the literature possess this desirable property, which is what enables us to intrinsically probe representations for morphosyntax.

Bayesically Done Now
All that is left now is to obtain the value-specific Gaussian parameters , . . . , h (Nv) } be a sample of ddimensional word representations for a value v for some language. One simple approach is to use maximum-likelihood estimation (MLE) to estimate θ v ; this amounts to computing the empirical mean and covariance matrix of D (v) . However, in preliminary experiments we found that a Bayesian approach is advantageous since it precludes degenerate Gaussians when there are more dimensions under consideration than training datapoints (Srivastava et al., 2007).
Under the Bayesian framework, we seek to compute the posterior distribution over the probe's parameters given our training data, where p(θ v ) is our Bayesian prior. The prior encodes our a priori belief about the parameters in the absence of any data, and p(D (v) | θ v ) is the likelihood of the data under our model given a parameterization θ v . In the case of a Gaussianinverse-Wishart prior, 8 there is an exact expression for the posterior.
The GIW prior has hyperparameters µ 0 , k 0 , Λ 0 , ν 0 , where the inverse-Wishart distribution (IW, see App. B) defines a distribution over covariance matrices (Murphy, 2012, Chapter 4), and the Gaussian defines a distribution over the mean. As this prior is conjugate to the multivariate Gaussian distribution, our posterior over the parameters after observing We did not perform full Bayesian inference as we found a maximum a posteriori (MAP) estimate to be sufficient for our purposes. 9 MAP estimation uses the parameters at the posterior mode which are (Murphy, 2012, Chapter 4) where d is the dimensionality of the Gaussian.

Probing Metrics
In this section, we describe the metrics that we compute. We track both accuracy ( §4.1) and mutual information ( §4.2).

Accuracy
As with most probes in the literature, we compute the accuracy of our model on held-out data. We report the lower-bound accuracy (LBA) of a set of dimensions C, which is defined as the highest accuracy achieved by any subset of dimensions C ⊆ C. This metric counteracts a decrease in performance due to the model overfitting in certain dimensions.
In principle, if a model was able to achieve a higher score using fewer dimensions, then there exists a model that can be at least as effective using a superset of those dimensions.
Despite its popularity, accuracy also has its downsides. In particular, we found it to be misleading when not taking a majority-class baseline into account, which complicates comparisons. For example, in fastText and BERT Latin (lat), our probe achieved slightly over 65% accuracy when averaging over attributes. This appears to be high, but 65% is the average majority-class baseline accuracy. Conversely, LBNMI (see §4.2) is roughly zero, which more intuitively reflects performance. Hence, we prioritize mutual information in our analysis.

Mutual Information
Recent work has advocated for informationtheoretic metrics in probing (Voita and Titov, 2020;Pimentel et al., 2020b). One such metric, mutual information (MI), measures how predictable the occurrence of one random variable is given another.
We estimate the MI between representations and particular attributes using a method similar to the one proposed by Pimentel et al. (2019) (refer to App. D for an extended derivation). Let V a be a V a -valued random variable denoting the value of a morphosyntactic attribute, and H be a R d -valued random variable for the word representation.
The mutual information between V a and H is The attribute's entropy, H(V a ), depends on the true distribution over values p (v). For this, we use the plug-in approximation p(v), which is estimated from held-out data. The conditional entropy, H(V a | H) is trickier to compute, as it also depends on the true distribution of embeddings given a value, p(h | v), which is high-dimensional and poorly sampled in our data. 10 However, we can obtain an upper-bound if we use our probe p(v | h) and compute (Brown et al., 1992) using held-out data,D = {(h (n) ,ṽ (n) )} N n=1 . Incidentally, this is equivalent to computing the average negative log-likelihood of the probe on held-out data. Using these estimates in eq. (15), we obtain an empirical lower-bound on the MI.
For ease of comparison across languages and morphosyntactic attributes, we define two metrics associated to MI. The lower-bound MI (LBMI) of any set of neurons C is defined as the highest MI estimate obtained by any subset of those neurons C ⊆ C. While true MI can never decrease upon adding a variable, our estimate may decrease due to overfitting in our model, or by it being unable to capture the complexity of p(h | v). LBMI offers a way to counteract this limitation by using the very best estimate at our disposal for any set of dimensions. In practice, we report lower-bound normalized MI (LBNMI), which normalizes LBMI by the entropy of V a , because normalizing MI estimates drawn from different samples enables them to be compared (Gates et al., 2019).

Experimental Setup
In this section we outline our experimental setup.
Selection Criterion. We use log-likelihood as our greedy selection criterion. We select 50 dimensions, and keep selecting even if the estimate has decreased. 11 Data. We map the UD v2.1 treebanks (Nivre et al., 2017) to the UniMorph schema (Kirov et al., 2018;Sylak-Glassman, 2016) using the mapping by McCarthy et al. (2018). We keep only the "main" treebank for a language (e.g. UD_Portuguese as opposed to UD_Portuguese_PUD). We remove any sentences that would have a sub-token length greater than 512, the maximum allowed for our BERT model. 12 We assign any tags from the constituents of a contraction to the contracted word form (e.g., for Portuguese, we copy annotations from de and a to the contracted word form da). We use the UD train split to train a probe for each attribute, the validation split to choose which dimensions to select using our greedy scheme, and the test split to evaluate the performance of the probe after dimension selection.
We do not include in our estimates any morphological attribute-value pairs with fewer than 100 word types in any of our splits, as we might not be able to model or evaluate them accurately. This removes certain constructions that mostly pertain to function words (e.g. as definiteness is marked only in articles in Portuguese, the attribute is dropped), but we found it also removed rare inflected forms in our data, which may be due to inherent biases in the domain of text found in the treebanks (e.g. the future tense in Spanish). We use all the words that have been tagged in one of the filtered attributevalue pairs (this includes both function and content words). Finally, we apply some minor postprocessing to the annotations (App. C).
Word Representations. We probe the multilingual fastText vectors, 13 and the final layer of the multilingual release of BERT. 14 We compute wordlevel embeddings for BERT by averaging over subtoken representations as in Pimentel et al. (2020b). We use the tokenization in the UD treebanks.
Hyperparameters. Our model has four hyperparameters, which control the Gaussian-inverse-Wishart prior. We choose hyperparameter settings that have been shown to work well in the literature (Fraley and Raftery, 2007;Murphy, 2012). We set µ 0 to the empirical mean, Λ 0 to a diagonalized version of the empirical covariance, ν 0 = d + 2, and k 0 = 0.01. We note that the resulting prior is degenerate if the data contains only one datapoint, since the covariance is not well-defined. However, since we do not consider attribute-values with less that 100 word types, this does not affect our experiments.

Results and Discussion
Overall, our results strongly suggest that morphosyntactic information tends to be highly focal 12 Out of a total of 419943 sentences in the treebanks, 4 were removed. 13 We use the implementation by Grave et al. (2018). 14 We use the implementation by Wolf et al. (2020). (concentrated in a small set of dimensions) in fast-Text, whereas in BERT it is more dispersed. Averaging across all languages and attributes (Fig. 2), fastText has on average 0.306 LBNMI at two dimensions, which is around twice as much as BERT at the same dimensionality. However, the difference between the two becomes progressively smaller, reducing to 0.053 at 50 dimensions. A similar trend holds for LBA ( §4.1), with an even smaller difference at higher dimensions. On the whole, roughly 10 dimensions are required to encode any morphosyntactic attribute we probed fast-Text for, compared to around 35 dimensions for BERT.
The pattern above holds across attributes (Fig. 3), and languages (Fig. 4). There is little improvement in fastText performance when adding more than 10 dimensions and, in some cases, two fastText dimensions can explain half of the information achieved when selecting 50. In contrast, while BERT also displays highly informative dimensions, a substantial increase in LBNMI can be obtained by going from 2 selected dimensions to 10 and 50. Among languages, the only exceptions to this are the Indic languages, where BERT concentrates more morphological information than fastText already at 2 dimensions. Interestingly, when looking at attributes, our results suggest that fastText encodes most attributes better than BERT (when considering the 50 most informative dimensions), except animacy, gender and number. These findings also hold for LBA, where we additionally find little to no gain when comparing LBA after 50 dimensions to accuracy on the full vector.
Visualizing the most informative dimensions for BERT and fastText may give some intuition for how this trend manifests. Fig. 5 shows a scatter plot of the two most informative dimensions selected by our probe for English tense in fastText and BERT. We observed similar patterns for other morphosyntactic attributes. Both embeddings have dimensions that induce some separability in English tense, but this is more pronounced in fastText than BERT. We cannot clearly plot more than two dimensions at a time, but based on the trend depicted in Fig. 2, we can intuit that BERT makes up for at least part of the gap by inducing more separability as dimensions are added.

Limitations
The generative nature of our probe means that adequately modeling the embedding distribution p(h | v) is of paramount importance. We choose a Gaussian model in order to assume as little as possible about the distribution of BERT and fast-Text embeddings; however, as one reviewer pointed out, the embedding distribution is unlikely to be Gaussian (see Fig. 6 for an example). This results in a looser bound on the mutual information for dimensions in which the Gaussian assumption does not hold, which leads to decreasing mutual information estimates after a certain number of dimensions are selected (see Fig. 7). As we compute and report an empirical lower-bound on the mutual information for any subset of dimensions (LBMI), we have evidence that there is at least that amount  of information for any given subset of dimensions. However, we expect that better modeling of the embedding distribution should improve our bound on the mutual information and thus yield a better probe (Pimentel et al., 2020b).

Related Work
There has been a growing interest in understanding what information is in NLP models' internal representations. Studies vary widely, from detailed analyses of particular scenarios and linguistic phenomena (Linzen et al., 2016;Gulordava et al., 2018;Ravfogel et al., 2018;Krasnowska-Kieraś and Wróblewska, 2019;Wallace et al., 2019;Warstadt et al., 2019;Sorodoc et al., 2020) to extensive investigations across a wealth of tasks (Tenney et al., 2018;Conneau et al., 2018;Liu et al., 2019). identify units whose local, peak activations correlate with features in an image (e.g., material, door presence), show that ablation of these units has a disproportionately big impact on the classification of their respective features, and can be manually controlled, with interpretable effects. Most similar to our analysis is LIN-SPECTOR (Şahin et al., 2020), a suite of probing tasks that includes probing for morphosyntax. Our work differs in two respects. Firstly, whereas LINSPECTOR focuses on extrinsic probing, we probe intrinsically. Secondly, the scope of our morphosyntactic study is more typologically diverse (36 vs. 5 languages), albeit they consider more varieties of word representations, such as GloVe (Pennington et al., 2014) and ELMo (Peters et al., 2018)-but not BERT.

Conclusion
In this paper, we introduce an alternative framework for intrinsic probing, which we term dimension selection. The idea is to use probe performance on different subsets of dimensions as a gauge for how much information about a linguistic property different subsets of dimensions jointly encode. We show that current probes are unsuitable for intrinsic probing through dimension selection dotted) and normalized MI (solid) curves for case in 5 randomly selected languages. Note that the y-axis ranges from 0-0.5 unlike other graphs. Observe how the normalized MI estimates start to decrease after a certain number of dimension have been selected. as they are not inherently decomposable, which is required to make the procedure computationally tractable. Therefore, we present a decomposable probe which is based on the Gaussian distribution, and evaluate its effectiveness by probing BERT and fastText for morphosyntax across 36 languages. Overall, we find that fastText is more focal than BERT, requiring fewer dimensions to capture most of the information pertaining to a morphosyntactic property.
Future Work. Future work will be separated into two strands. The first will focus on how to better model the distribution of embeddings given a morphosyntactic attribute; as mentioned above, this should yield a better probe overall. The second strand of work pertains to a deeper analysis of our results, and expansion to other probing tasks. A Gaussian-inverse-Wishart Posterior Parameters Using the notation introduced in §3.4, the parameters of the Gaussian-inverse-Wishart distribution GIW(µ v , Σ v | µ n , k n , Λ n , ν n ), are (Murphy, 2012) whereh is the empirical mean of D (v) and S is the scatter matrix

B Inverse-Wishart Distribution
The inverse-Wishart distribution is defined as (Murphy, 2007) where Σ is a positive-definite d × d matrix, and Γ d is the multivariate Gamma function.

C Changes to UD Annotations
We apply some post-processing to canonicalize the automatically-converted UniMorph annotations.
The changes we make are: 4. We canonicalize conjunctive features by sorting them alphabetically, ensuring they all belong to the same morphological attribute, and joining them into a new feature. So the annotation "MASC+FEM" becomes "FEM+MASC".
5. We discard language-specific annotations as this is not a canonical UniMorph dimension.

D Mutual Information Approximation
Let V a be a V a -valued random variable denoting the value of a morphosyntactic attribute, and H be a R d -valued random variable for the word representation. The mutual information between V a and H is To compute the entropy H(V a ), we would ideally need the true attribute distribution p(v) for a language. We can empirically approximate it using p(v), which has been computed from held-out data Computing H(V a | H) is trickier as it relies on the true distribution of the representations for a value, p(h | v), which is hard to estimate as it is high-dimensional and poorly sampled in our data.
Note that by using an approximation p(v | h) ≈ p(v | h) instead (a.k.a. our probe), we obtain an upper bound on the true conditional entropy (Brown et al., 1992) should be reasonable for our purposes, the integral I v is intractable as it still depends on p(h | v). However, we can use heldout data to approximate I v (Pimentel et al., 2019) where {h (i) } Nv i=1 are held-out word representations for a value v, and thus obtain an empirical upperbound on H(V a | H).

E Reproducibility Details
All experiments were run on an AWS p2.xlarge instance, with 1 Tesla K80 GPU, 4 CPU cores, and 61 GB of RAM. The total runtime of the experiments was 2 days, 18 hours, 42 minutes and 14 seconds.
In total, when considering a d-dimensional word representation, this model has parameters. In practice, this means that for every value, a fastText Gaussian we fit has 45450 parameters, whereas a BERT Gaussian has 296064 parameters.

F Probed Attributes by Language
Tab. 1 shows a list of all languages that were probed, which attributes were probed, and which values were considered. The number of example words for a value in the train/validation/test split is shown in parenthesis.