Finding Universal Grammatical Relations in Multilingual BERT

Recent work has found evidence that Multilingual BERT (mBERT), a transformer-based multilingual masked language model, is capable of zero-shot cross-lingual transfer, suggesting that some aspects of its representations are shared cross-lingually. To better understand this overlap, we extend recent work on finding syntactic trees in neural networks’ internal representations to the multilingual setting. We show that subspaces of mBERT representations recover syntactic tree distances in languages other than English, and that these subspaces are approximately shared across languages. Motivated by these results, we present an unsupervised analysis method that provides evidence mBERT learns representations of syntactic dependency labels, in the form of clusters which largely agree with the Universal Dependencies taxonomy. This evidence suggests that even without explicit supervision, multilingual masked language models learn certain linguistic universals.


Introduction
Past work Tenney et al., 2019a,b) has found that masked language models such as BERT (Devlin et al., 2019) learn a surprising amount of linguistic structure, despite a lack of direct linguistic supervision. Recently, large multilingual masked language models such as Multilingual BERT (mBERT) and XLM ) have shown strong cross-lingual performance on tasks like XNLI Williams et al., 2018) and dependency parsing (Wu and Dredze, 2019). Much previous analysis has been motivated by a desire to explain why BERT-like models perform so well on downstream applications in the monolingual setting, which begs the question: what properties of these models make them so crosslingually effective? Figure 1: t-SNE visualization of head-dependent dependency pairs belonging to selected dependencies in English and French, projected into a syntactic subspace of Multilingual BERT, as learned on English syntax trees. Colors correspond to gold UD dependency type labels. Although neither mBERT nor our probe was ever trained on UD dependency labels, English and French dependencies exhibit cross-lingual clustering that largely agrees with UD dependency labels.
In this paper, we examine the extent to which Multilingual BERT learns a cross-lingual representation of syntactic structure. We extend probing methodology, in which a simple supervised model is used to predict linguistic properties from a model's representations. In a key departure from past work, we not only evaluate a probe's performance (on recreating dependency tree structure), but also use the probe as a window into understanding aspects of the representation that the probe was not trained on (i.e. dependency labels; Figure 1). In particular, we use the structural probing method of Hewitt and Manning (2019), which probes for syntactic trees by finding a linear transformation under which two words' distance in their dependency parse is approximated by the squared distance between their model representation vectors under a linear transformation. After evaluating whether such transformations recover syntactic tree distances across languages in mBERT, we turn to analyzing the transformed vector representations themselves.
We interpret the linear transformation of the structural probe as defining a syntactic subspace (Figure 2), which intuitively may focus on syntactic aspects of the mBERT representations. Since the subspace is optimized to recreate syntactic tree distances, it has no supervision about edge labels (such as adjectival modifier or noun subject). This allows us to unsupervisedly analyze how representations of head-dependent pairs in syntactic trees cluster and qualitatively discuss how these clusters relate to linguistic notions of grammatical relations.
We make the following contributions: • We find that structural probes extract considerably more syntax from mBERT than baselines in 10 languages, extending the structural probe result to a multilingual setting.
• We demonstrate that mBERT represents some syntactic features in syntactic subspaces that overlap between languages. We find that structural probes trained on one language can recover syntax in other languages (zeroshot), demonstrating that the syntactic subspace found for each language picks up on features that BERT uses across languages.
• Representing a dependency by the difference of the head and dependent vectors in the syntactic space, we show that mBERT represents dependency clusters that largely overlap with the dependency taxonomy of Universal Dependencies (UD) (Nivre et al., 2020); see Figure 1. Our method allows for fine-grained analysis of the distinctions made by mBERT that disagree with UD, one way of moving past probing's limitation of detecting only linguistic properties we have training data for rather than properties inherent to the model.
Our analysis sheds light on the cross-lingual properties of Multilingual BERT, through both zeroshot cross-lingual structural probe experiments and novel unsupervised dependency label discovery experiments which treat the probe's syntactic subspace as an object of study. We find evidence that mBERT induces universal grammatical relations without any explicit supervision, which largely Figure 2: The structural probe recovers syntax by finding a syntactic subspace in which all syntactic trees' distances are approximately encoded as squared L 2 distance (Hewitt and Manning, 2019).
agree with the dependency labels of Universal Dependencies. 1

Methodology
We present a brief overview of Hewitt and Manning (2019)'s structural probe, closely following their derivation. The method represents each dependency tree T as a distance metric where the distance between two words d T (w i , w j ) is the number of edges in the path between them in T . It attempts to find a single linear transformation of the model's word representation vector space under which squared distance recreates tree distance in any sentence. Formally, let h 1:n be a sequence of n representations produced by a model from a sequence of n words w 1:n composing sentence . Given a matrix B ∈ R k×m which specifies the probe parameters, we define a squared distance metric d B as the squared L 2 distance after transformation by B: We optimize to find a B that recreates the tree distance d T between all pairs of words (w i , w j ) in all sentences s in the training set of a parsed corpus. Specifically, we optimize by gradient descent: For more details, see Hewitt and Manning (2019).
Departing from prior work, we view the probetransformed word vectors Bh themselves-not just the distances between them-as objects of study.
The rows of B are a basis that defines a subspace of R m , which we call the syntactic subspace, and may focus only on parts of the original BERT representations. A vector Bh corresponds to a point in that space; the value of each dimension equals the dot product of h with one of the basis vectors. 2

Experimental Settings
These settings apply to all experiments using the structural probe throughout this paper.
Data Multilingual BERT is pretrained on corpora in 104 languages; however, we probe the performance of the model in 11 languages (Arabic, Chinese, Czech, English, Farsi, Finnish, French, German, Indonesian, Latvian, and Spanish). 3,4 Specifically, we probe the model on trees encoded in the Universal Dependencies v2 formalism (Nivre et al., 2020).
Model In all our experiments, we investigate the 110M-parameter pre-trained weights of the BERT-Base, Multilingual Cased model. 5 Baselines We use the following baselines: 6 • MBERTRAND: A model with the same parametrization as mBERT but no training. Specifically, all of the contextual attention layers are reinitialized from a normal distribution with the same mean and variance as the original parameters. However, the subword embeddings and positional encoding layers remain unchanged. As randomly initialized ELMo layers are a surprisingly competitive baseline for syntactic parsing , we also expect this to be the case for BERT.
In our experiments, we find that this baseline performs approximately equally across layers, so we draw always from Layer 7.
• LINEAR: All sentences are given an exclusively left-to-right chain dependency analysis.
EVALUATION To evaluate transfer accuracy, we use both of the evaluation metrics of Hewitt and Manning (2019). That is, we report the Spearman correlation between predicted and true word pair distances (DSpr.). 7 We also construct an undirected minimum spanning tree from said distances, and evaluate this tree on undirected, unlabeled attachment score (UUAS), the percentage of undirected edges placed correctly when compared to the gold tree.
3 Does mBERT Build a Syntactic Subspace for Each Language?
We first investigate whether mBERT builds syntactic subspaces, potentially private to each language, for a subset of the languages it was trained on; this is a prerequisite for the existence of a shared, cross-lingual syntactic subspace. Specifically, we train the structural probe to recover tree distances in each of our eleven languages. We experiment with training syntactic probes of various ranks, as well as on embeddings from all 12 layers of mBERT.

Results
We find that the syntactic probe recovers syntactic trees across all the languages we investigate, achieving on average an improvement of 22 points UUAS and 0.175 DSpr. over both baselines (Table 1, section IN-LANGUAGE). 8 Additionally, the probe achieves significantly higher UUAS (on average, 9.3 points better on absolute performance and 6.7 points better on improvement over baseline) on Western European languages. 9 . Such languages have been shown to have better performance on recent shared task results on multilingual parsing (e.g. Zeman et al., 2018). However, we do not find a large improvement when evaluated on DSpr. (0.041 DSpr. absolute, -0.013 relative).
We find that across all languages we examine, the structural probe most effectively recovers tree structure from the 7th or 8th mBERT layer (Figure 4). Furthermore, increasing the probe maximum rank beyond approximately 64 or 128 gives   no further gains, implying that the syntactic subspace is a small part of the overall mBERT representation, which has dimension 768 ( Figure 3).
These results closely correspond to the results found by Hewitt and Manning (2019) for an equivalently sized monolingual English model trained and evaluated on the Penn Treebank (Marcus et al., 1993), suggesting that mBERT behaves similarly to monolingual BERT in representing syntax.

Transfer Experiments
We now evaluate the extent to which Multilingual BERT's syntactic subspaces are similar across languages. To do this, we evaluate the performance of a structural probe when evaluated on a language unseen at training time. If a probe trained to predict syntax from representations in language i also predicts syntax in language j, this is evidence that mBERT's syntactic subspace for language i also encodes syntax in language j, and thus that syntax is encoded similarly between the two languages.
Specifically, we evaluate the performance of the structural probe in the following contexts: • Direct transfer, where we train on language i and evaluate on language j.
• Hold-one-out transfer, where we train on all languages other than j and evaluate on language j.

Joint Syntactic Subspace
Building off these cross-lingual transfer experiments, we investigate whether there exists a single joint syntactic subspace that encodes syntax in all languages, and if so, the degree to which it does so.
To do so, we train a probe on the concatenation of data from all languages, evaluating it on the concatenation of validation data from all languages.

Results
We find that mBERT's syntactic subspaces are transferable across all of the languages we examine. Specifically, transfer from the best source language (chosen post hoc per-language) achieves on average an improvement of 14 points UUAS and 0.128 DSpr. over the best baseline (Table 1, section SINGLETRAN). 10 Additionally, our results demonstrate the existence of a cross-lingual syntactic subspace; on average, a holdout subspace trained on all languages but the evaluation language achieves an improvement of 16 points UUAS and 0.137 DSpr. over baseline, while a joint ALLLANGS subspace trained on a concatenation of data from all source languages achieves an improvement of 19 points UUAS and 0.156 DSpr. (Table 1, section HOLD-OUT, ALLLANGS). Furthermore, for most languages, syntactic information embedded in the post hoc best crosslingual subspace accounts for 62.3% of the total possible improvement in UUAS (73.1% DSpr.) in recovering syntactic trees over the baseline (as represented by in-language supervision). Holdout transfer represents on average 70.5% of improvement in UUAS (79% DSpr.) over the best baseline, while evaluating on a joint syntactic subspace accounts for 88% of improvement in UUAS (89% DSpr.). These results demonstrate the degree to which the cross-lingual syntactic space represents syntax cross-lingually. 10 For full results, consult Appendix Table 1.

Subspace Similarity
Our experiments attempt to evaluate syntactic overlap through zero-shot evaluation of structural probes. In an effort to measure more directly the degree to which the syntactic subspaces of mBERT overlap, we calculate the average principal angle 11 between the subspaces parametrized by each language we evaluate, to test the hypothesis that syntactic subspaces which are closer in angle have closer syntactic properties (Table 4).
We evaluate this hypothesis by asking whether closer subspaces (as measured by lower average principal angle) correlate with better cross-lingual transfer performance. For each language i, we first compute an ordering of all other languages j by increasing probing transfer performance trained on j and evaluated on i. We then compute the Spearman correlation between this ordering and the ordering given by decreasing subspace angle. Averaged across all languages, the Spearman correlation is 0.78 with UUAS, and 0.82 with DSpr., showing that transfer probe performance is substantially correlated with subspace similarity.

Extrapolation Testing
To get a finer-grained understanding of how syntax is shared cross-lingually, we aim to understand whether less common syntactic features are embedded in the same cross-lingual space as syntactic features common to all languages. To this end, we examine two syntactic relations-prenominal and postnominal adjectives-which appear in some of our languages but not others. We train syntactic probes to learn a subspace on languages that primarily only use one ordering (i.e. majority class is greater than 95% of all adjectives), then evaluate their UUAS score solely on adjectives of the other ordering. Specifically, we evaluate on French, which has a mix (69.8% prenominal) of both orderings, in the hope that evaluating both orderings in the same language may help correct for biases in pairwise language similarity. Since the evaluation ordering is out-of-domain for the probe, predicting evaluation-order dependencies successfully suggests that the learned subspace is capable of generalizing between both kinds of adjectives.
We find that for both categories of languages, accuracy does not differ significantly on either prenominal or postnominal adjectives. Specifi-  cally, for both primarily-prenominal and primarilypostnominal training languages, postnominal adjectives score on average approximately 2 points better than prenominal adjectives (Table 2).

Methodology
Given the previous evidence that mBERT shares syntactic representations cross-lingually, we aim to more qualitatively examine the nature of syntactic dependencies in syntactic subspaces. Let D be a dataset of parsed sentences, and the linear transformation B ∈ R k×m define a k-dimensional syntactic subspace. For every non-root word and hence syntactic dependency in D (since every word is a dependent of some other word or an added ROOT symbol), we calculate the k-dimensional head-dependent vector between the head and the dependent after projection by B. Specifically, for all head-dependent pairs (w head , w dep ), we com- We then visualize all differences over all sentences in two dimensions using t-SNE (van der Maaten and Hinton, 2008).

Experiments
As with multilingual probing, one can visualize head-dependent vectors in several ways; we present the following experiments: • dependencies from one language, projected into a different language's space (Figure 1) • dependencies from one language, projected into a holdout syntactic space trained on all other languages (Figure 5) Figure 5: t-SNE visualization of syntactic differences in Spanish projected into a holdout subspace (learned by a probe trained to recover syntax trees in languages other than Spanish). Despite never seeing a Spanish sentence during probe training, the subspace captures a surprisingly fine-grained view of Spanish dependencies.
• dependencies from all languages, projected into a joint syntactic space trained on all languages ( Figure 6) For all these experiments, we project into 32dimensional syntactic spaces. 12 Additionally, we expose a web interface for visualization in our GitHub repository. 13

Results
When projected into a syntactic subspace determined by a structural probe, we find that difference vectors separate into clusters reflecting linguistic characteristics of the dependencies. The cluster identities largely overlap with (but do not exactly agree with) dependency labels as defined by Universal Dependencies ( Figure 6). Additionally, the clusters found by mBERT are highly multilingual. When dependencies from several languages are projected into the same syntactic subspace, whether trained monolingually or cross-lingually, we find that dependencies of the same label share the same cluster (e.g. Figure 1, which presents both English 12 We reduce the dimensionality of the subspaces here as compared to our previous experiments to match t-SNE suggestions and more aggressively filter non-syntactic information. 13 https://github.com/ethanachi/ multilingual-probing-visualization/blob/master/ visualization.md Example sentences (trimmed for clarity). Heads in bold; dependents in bold italic.

(b) Postnominal adjectives fr Le gaz développe ses applications domestiques.
id Film lain yang menerima penghargaan istimewa. fa (c) Genitives en The assortment of customers adds entertainment. es Con la recuperación de la democracia y las libertades lv Svešiniece piecēlās, atvadījās no vecā vīra (j) Definite articles en The value of the highest bid fr Merak est une ville d'Indonésie sur la côte occidentale. de Selbst mitten in der Woche war das Lokal gut besucht. Figure 6: t-SNE visualization of 100,000 syntactic difference vectors projected into the cross-lingual syntactic subspace of Multilingual BERT. We exclude punct and visualize the top 11 dependencies remaining, which are collectively responsible for 79.36% of the dependencies in our dataset. Clusters of interest highlighted in yellow; linguistically interesting clusters labeled. and French syntactic difference vectors projected into an English subspace).

Finer-Grained Analysis
Visualizing syntactic differences in the syntactic space provides a surprisingly nuanced view of the native distinctions made by mBERT. In Figure 6, these differences are colored by gold UD dependency labels. A brief summary is as follows: Adjectives Universal Dependencies categorizes all adjectival noun modifiers under the amod relation. However, we find that mBERT splits adjectives into two groups: prenominal adjectives in cluster (b) (e.g., Chinese 獨 獨 獨特 特 特的 的 的地理) and postnominal adjectives in cluster (u) (e.g., French applications domestiques).
Nominal arguments mBERT maintains the UD distinction between subject (nsubj) and object (obj). Indirect objects (iobj) cluster with direct objects. Interestingly, mBERT generally groups adjunct arguments (obl) with nsubj if near the beginning of a sentence and obj otherwise.
Relative clauses In the languages in our dataset, there are two major ways of forming relative clauses. Relative pronouns (e.g., English the man who is hungry are classed by Universal Dependencies as being an nsubj dependent, while subordinating markers (e.g., English I know that she saw me) are classed as the dependent of a mark relation. However, mBERT groups both of these relations together, clustering them distinctly from most nsubj and mark relations.
Negatives Negative adverbial modifiers (English not, Farsi , Chinese 不) are not clustered with other adverbial syntactic relations (advmod), but form their own group (h). 14 Determiners The linguistic category of determiners (det) is split into definite articles (i), indefinite articles (e), possessives (f), and demonstratives (g). Sentence-initial definite articles (k) cluster separately from other definite articles (j).
Expletive subjects Just as in UD, with the separate relation expl, expletive subjects, or thirdperson pronouns with no syntactic meaning (e.g. English It is cold, French Il faudrait, Indonesian Yang menjadi masalah kemudian), cluster separately (k) from other nsubj relations (small cluster in the bottom left).
Overall, mBERT draws slightly different distinctions from Universal Dependencies. Although some are more fine-grained than UD, others appear to be more influenced by word order, separating relations that most linguists would group together. Still others are valid linguistic distinctions not distinguished by the UD standard.

Discussion
Previous work has found that it is possible to recover dependency labels from mBERT embeddings, in the form of very high accuracy on dependency label probes Tenney et al., 2019b). However, although we know that dependency label probes are able to use supervision to map from mBERT's representations to UD dependency labels, this does not provide full insight into the nature of (or existence of) latent dependency label structure in mBERT. By contrast, in the structural probe, B is optimized such that v diff 2 ≈ 1, but no supervision as to dependency label is given. The contribution of our method is thus to provide a view into mBERT's "own" dependency label representation. In Appendix A, Figure 8, we provide a similar visualization as applied to MBERTRAND, finding much less cluster coherence.

Probing as a window into representations
Our head-dependent vector visualization uses a supervised probe, but its objects of study are properties of the representation other than those relating to the probe supervision signal. Because the probe 14 Stanford Dependencies and Universal Dependencies v1 had a separate neg dependency, but it was eliminated in UDv2. never sees supervision on the task we visualize for, the visualized behavior cannot be the result of the probe memorizing the task, a problem in probing methodology (Hewitt and Liang, 2019). Instead, it is an example of using probe supervision to focus in on aspects that may be drowned out in the original representation. However, the probe's linear transformation may not pick up on aspects that are of causal influence to the model.

Related Work
Cross-lingual embedding alignment  find that independently trained monolingual word embedding spaces in ELMo are isometric under rotation. Similarly, Schuster et al. (2019) and Wang et al. (2019) geometrically align contextualized word embeddings trained independently. Wu et al. (2019) find that cross-lingual transfer in mBERT is possible even without shared vocabulary tokens, which they attribute to this isometricity. In concurrent work, Cao et al. (2020) demonstrate that mBERT embeddings of similar words in similar sentences across languages are approximately aligned already, suggesting that mBERT also aligns semantics across languages. K et al. (2020) demonstrate that strong cross-lingual transfer is possible without any word piece overlap at all.
Analysis with the structural probe In a monolingual study, Reif et al. (2019) also use the structural probe of Hewitt and Manning (2019) as a tool for understanding the syntax of BERT. They plot the words of individual sentences in a 2dimensional PCA projection of the structural probe distances, for a geometric visualization of individual syntax trees. Further, they find that distances in the mBERT space separate clusters of word senses for the same word type. Pires et al. (2019) find that cross-lingual BERT representations share a common subspace representing useful linguistic information. Libovickỳ et al. (2019) find that mBERT representations are composed of a language-specific component and a languageneutral component. Both Libovickỳ et al. (2019) and Kudugunta et al. (2019) perform SVCCA on LM representations extracted from mBERT and a massively multilingual transformer-based NMT model, finding language family-like clusters.

Understanding representations
Li and Eisner (2019) present a study in syntactically motivated dimensionality reduction; they find that after being passed through an information bottleneck and dimensionality reduction via t-SNE, ELMo representations cluster naturally by UD part of speech tags. Unlike our syntactic dimensionality reduction process, the information bottleneck is directly supervised on POS tags, whereas our process receives no linguistic supervision other than unlabeled tree structure. In addition, the reduction process, a feed-forward neural network, is more complex than our linear transformation. Singh et al. (2019) evaluate the similarity of mBERT representations using Canonical Correlation Analysis (CCA), finding that overlap among subword tokens accounts for much of the representational similarity of mBERT. However, they analyze cross-lingual overlap across all components of the mBERT representation, whereas we evaluate solely the overlap of syntactic subspaces. Since syntactic subspaces are at most a small part of the total BERT space, these are not necessarily mutually contradictory with our results. In concurrent work, Michael et al. (2020) also extend probing methodology, extracting latent ontologies from contextual representations without direct supervision.

Discussion
Language models trained on large amounts of text have been shown to develop surprising emergent properties; of particular interest is the emergence of non-trivial, easily accessible linguistic properties seemingly far removed from the training objective. For example, it would be a reasonable strategy for mBERT to share little representation space between languages, effectively learning a private model for each language and avoiding destructive interference. Instead, our transfer experiments provide evidence that at a syntactic level, mBERT shares portions of its representation space between languages. Perhaps more surprisingly, we find evidence for fine-grained, cross-lingual syntactic distinctions in these representations. Even though our method for identifying these distinctions lacks dependency label supervision, we still identify that mBERT has a cross-linguistic clustering of grammatical relations that qualitatively overlaps considerably with the Universal Dependencies formalism.
The UUAS metric We note that the UUAS metric alone is insufficient for evaluating the accuracy of the structural probe. While the probe is opti-mized to directly recreate parse distances, (that is, d B (h i , h j ) ≈ d T (w i , w j )) a perfect UUAS score under the minimum spanning tree construction can be achieved by ensuring that d B (h i , h j ) is small if there is an edge between w i and w j , and large otherwise, instead of accurately recreating distances between words connected by longer paths. By evaluating Spearman correlation between all pairs of words, one directly evaluates the extent to which the ordering of words j by distance to each word i is correctly predicted, a key notion of the geometric interpretation of the structural probe. See Maudslay et al. (2020) for further discussion.
Limitations Our methods are unable to tease apart, for all pairs of languages, whether transfer performance is caused by subword overlap (Singh et al., 2019) or by a more fundamental sharing of parameters, though we do note that language pairs with minimal subword overlap do exhibit nonzero transfer, both in our experiments and in others (K et al., 2020). Moreover, while we quantitatively evaluate cross-lingual transfer in recovering dependency distances, we only conduct a qualitative study in the unsupervised emergence of dependency labels via t-SNE. Future work could extend this analysis to include quantitative results on the extent of agreement with UD. We acknowledge as well issues in interpreting t-SNE plots (Wattenberg et al., 2016), and include multiple plots with various hyperparameter settings to hedge against this confounder in Figure 11.
Future work should explore other multilingual models like XLM and XLM-RoBERTa  and attempt to come to an understanding of the extent to which the properties we've discovered have causal implications for the decisions made by the model, a claim our methods cannot support.

A.1 Visualization of All Relations
In our t-SNE visualization of syntactic difference vectors projected into the cross-lingual syntactic subspace of Multilingual BERT (Figure 6), we only visualize the top 11 relations, excluding punct. This represents 79.36% of the dependencies in our dataset. In Figure 7, we visualize all 36 relations in the dataset.

A.2 Visualization with Randomly-Initialized Baseline
In Figure 8, we present a visualization akin to Figure 1; however, both the head-dependency representations, as well as the syntactic subspace, are derived from MBERTRAND. Clusters around the edges of the figure are primarily type-based (e.g. one cluster for the word for and another for pour), and there is insignificant overlap between clusters with parallel syntactic functions from different languages.

B Alternative Dimensionality Reduction Strategies
In an effort to confirm the level of clarity of the clusters of dependency types which emerge from syntactic difference vectors, we examine simpler strategies for dimensionality reduction.

B.1 PCA for Visualization Reduction
We project difference vectors as previously into a 32-dimensional syntactic subspace. However, we visualize in 2 dimensions using PCA instead of t-SNE. There are no significant trends evident.

B.2 PCA for Dimensionality Reduction
Instead of projecting difference vectors into our syntactic subspace, we first reduce them to a 32-dimensional representation using PCA, 15 then reduce to 2 dimensions using t-SNE as previously. We find that projected under PCA, syntactic difference vectors still cluster into major groups, and major trends are still evident (Figure 10). In addition, many finer-grained distinctions are still apparent (e.g. the division between common nouns and pronouns). However, in some cases, the clusters are motivated less by syntax and more by semantics or language identities. For example: • The nsubj and obj clusters overlap, unlike our syntactically-projected visualization, where there is clearer separation.
• Postnominal adjectives, which form a single coherent cluster under our original visualization scheme, are split into several different clusters, each primarily composed of words from one specific language.
• There are several small monolingual clusters without any common syntactic meaning, mainly composed of languages parsed more poorly by BERT (i.e. Chinese, Arabic, Farsi, Indonesian). Figure 10: t-SNE visualization of syntactic differences in all languages we study, projected to 32 dimensions using PCA.

C.1 Pairwise Transfer
We present full pairwise transfer results in Table  3. Each experiment was run 3 times with different random seeds; experiment settings with range in   Table 4: Subspace angle overlap as evaluated by the pairwise mean principal angle between subspaces UUAS greater than 2 points are labeled with an asterisk (*). Table 4 presents the average principal angle between the subspaces parametrized by each language we evaluate. Table 5 contains the perlanguage Spearman correlation between the ordering given by (negative) subspace angle and structural probe transfer accuracy, reported both on UUAS and DSpr.

E t-SNE reproducibility
Previous work (Wattenberg et al., 2016) has investigated issues in the interpretability of tSNE plots. Given the qualitative nature of our experiments, to avoid this confounder, we include multiple plots with various settings of the perplexity hyperparameter in Figure 11.  Table 5: The Spearman correlation between two orderings of all languages for each language i. The first ordering of languages is given by (negative) subspace angle between the B matrix of language i and that of all languages. The second ordering is given by the structural probe transfer accuracy from all languages (including i) to i. This is repeated for each of the two structural probe evaluation metrics. Figure 11: t-SNE visualization of head-dependent dependency pairs belonging to selected dependencies in English and French, projected into a syntactic subspace of Multilingual BERT, as learned on English syntax trees. Colors correspond to gold UD dependency type labels, as in Figure 1, varying the perplexity (PPL) t-SNE hyperparmeter. From left to right, figures correspond to PPL 5, 10, 30, 50, spanning the range of PPL suggested by van der Maaten and Hinton (2008). Cross-lingual dependency label clusters are exhibited across all four figures.