The Secret Is in the Spectra: Predicting Cross-lingual Task Performance with Spectral Similarity Measures

Performance in cross-lingual NLP tasks is impacted by the (dis)similarity of languages at hand: e.g., previous work has suggested there is a connection between the expected success of bilingual lexicon induction (BLI) and the assumption of (approximate) isomorphism be-tween monolingual embedding spaces. In this work we present a large-scale study focused on the correlations between monolingual embedding space similarity and task performance, covering thousands of language pairs and four different tasks: BLI, parsing, POS tagging and MT. We hypothesize that statistics of the spectrum of each monolingual embedding space indicate how well they can be aligned. We then introduce several isomorphism measures between two embedding spaces, based on the relevant statistics of their individual spectra. We empirically show that 1) language similarity scores derived from such spectral isomorphism measures are strongly associated with performance observed in different cross-lingual tasks, and 2) our spectral-based measures consistently outperform previous standard isomorphism measures, while being computationally more tractable and easier to interpret. Finally, our measures capture complementary information to typologically driven language distance measures, and the combination of measures from the two families yields even higher task performance correlations.


Introduction
The effectiveness of joint multilingual modeling and cross-lingual transfer in cross-lingual NLP is critically impacted by the actual languages in consideration (Bender, 2011;Ponti et al., 2019). Characterizing, measuring, and understanding this cross-language variation is often the first step towards the development of more robust multilingually applicable NLP technology (O'Horan et al., 2016;Bjerva et al., 2019;Ponti et al., 2019). For instance, selecting suitable source languages is a prerequisite for successful cross-lingual transfer of dependency parsers or POS taggers (Naseem et al., 2012;de Lhoneux et al., 2018). In another example, with all other factors kept similar (e.g., training data size, domain similarity), the quality of machine translation also depends heavily on the properties and language proximity of the actual language pair (Kudugunta et al., 2019).
In this work, we contribute to this research endeavor by proposing a suite of spectral-based measures that capture the degree of isomorphism  between the monolingual embedding spaces of two languages. Our main hypothesis is that the potential to align two embedding spaces and learn transfer functions can be estimated through the differences between the monolingual embeddings' spectra. We therefore discuss representative statistics of the spectrum of an embedding space (i.e., the set of the singular values of the embedding matrix), such as its condition number or its sorted list of singular values. We then derive measures for the isomorphism between two embedding spaces based on these statistics.
To validate our hypothesis, we perform an extensive empirical evaluation with a range of crosslingual NLP tasks. This analysis reveals that our proposed spectrum-based isomorphism measures better correlate and explain greater variance than previous isomorphism measures Patra et al., 2019). In addition, our measures also outperform standard approaches based on linguistic information (Littell et al., 2017), The first part of our empirical analysis targets bilingual lexicon induction (BLI), a cross-lingual task that received plenty of attention, in particular as a case study to investigate the impact of crosslanguage variation on task performance Artetxe et al., 2018). Its popularity stems from its simple task formulation and reduced resource requirements, which makes it widely applicable across a large number of language pairs (Ruder et al., 2019b).
Prior work has empirically verified that for some language pairs BLI performs remarkably well, and for others rather poorly . It attempted to explain this variance in performance by grounding it in the differences between the monolingual embedding spaces themselves. These studies introduced the notion of approximate isomorphism, and argued that it is easier to learn a mapping function (Mikolov et al., 2013;Ruder et al., 2019b) between language pairs whose embeddings are approximately isomorphic, than between languages pairs without this property (Barone, 2016;. Subsequently, novel methods to quantify the degree of isomorphism were proposed, and were shown to significantly correlate with BLI scores (Zhang et al., 2017;Patra et al., 2019).
In this work, we report much higher correlations with BLI scores than existing isomorphism measures, across a variety of state-of-the-art BLI approaches. While previous work was limited only to coarse-grained analysis with a small number of language pairs (i.e., < 10), our study is the first large-scale analysis that is focused on the relationship between quantifiable isomorphism and BLI performance. Our analysis covers hundreds of diverse language pairs, focusing on typologically, geographically and phylogenetically distant pairs as well as on similar languages.
We further show that our findings generalize beyond BLI, to cross-lingual transfer in dependency parsing and POS tagging, and we also demonstrate strong correlations with machine translation (MT) performance. Finally, our spectral-based measures can be combined with typologically driven language distance measures to achieve further correlation improvements. This indicates the complementary nature of the implicit knowledge coded in continuous semantic spaces (and captured by our spectral measures) and the discrete linguistic information from typological databases (captured by the typologically driven measures).

Quantifying Isomorphism with Spectral Statistics
Following the distributional hypothesis (Harris, 1954;Firth, 1957), word embedding models learn the meaning of words according to their co-occurrence patterns. Hence, the word embedding space of a language whose words are used in diverse contexts is intuitively expected to encode richer information and greater variance than the word embedding space of a language with more restricting word usage patterns. The difference between two monolingual embedding spaces may also result from other reasons, such as the difference between the training corpora on which the embedding induction algorithm is trained, and the degree to which this algorithm accounts for the linguistic properties of each of the languages. While the exact combination of factors that govern the difference between the embedding spaces of different languages is hard to figure, this difference is likely to be indicative of the quality of crosslingual transfer. This is particularly true when the embedding spaces are used by cross-lingual transfer algorithms. Our core hypothesis is that the difference between two monolingual spaces can be quantified by spectral statistics of the two spaces.

Spectrum Statistics
Given a d-dimensional embedding matrix X, we perform Singular Value Decomposition (SVD) and obtain a diagonal matrix Σ whose main diagonal comprises of d singular values, σ 1 , σ 2 , . . . , σ d , sorted in a descending order. 1 Our aim is to quantify the difference between two embedding spaces by comparing statistics of their singular values. We next describe such statistics and in §2.2 use them to measure the isomorphism between the spaces.
Condition Number. In numerical analysis, a function's condition number measures the extent of change of the function's output value conditioned on a small change in the input (Blum, 2014). Consider the case of ϕ : X → Y, where X and Y are two embedding spaces mapped via ϕ. The condition number, κ(X), represents the degree to which small perturbations in the input X are amplified in the output ϕ(X). Following Higham et al. (2015), we compute the condition number of an input matrix X with d singular values as the ratio between its first (largest) and last (smallest) singular values: Why is it a relevant statistic? A smaller condition number denotes a more "stable" matrix that is less sensitive to perturbations. Consequently, learning a transfer function ϕ from one embedding space to another is more robust to noise when dealing with spaces with smaller κ(X). We thus expect that embedding matrices with high condition numbers might impede the learning of good transfer functions in cross-lingual NLP: A function learnt on an embedding space that is sensitive to small perturbations may not generalize well.
Are small singular values reliable? Small singular values are associated with noise, or with the least important information, and many noise reduction techniques remove them (Ford, 2015). If the smallest singular value indeed captures noise, this might affect the condition number (Eq. (1)). It is thus crucial to distinguish between "small but significant" and "small and insignificant" singular values. This is what we do below.
Effective Rank. Given sorted singular values, how can we determine the last effective singular value? For a matrix with d singular values σ 1 ≥ σ 2 ≥ · · · ≥ σ d ≥ 0, the -numerical rank can be defined as: r = min{r : σ r ≥ }, which means that singular values below a certain threshold are removed. However, this formulation introduces a dependency on the hyper-parameter . To avoid this, Roy and Vetterli (2007) proposed an alternative method that considers the full spectrum of singular values before computing the so-called effective rank of the input matrix X: where H(Σ) is the entropy of the matrix X's normalized singular value distributionσ i = , rounded to the smaller integer, yields the index of the last singular value that is considered significant, and is interpreted as the effective dimensionality, or rank, of the matrix X. If d is the dimensionality of the embedding space X, and we assume that the number of word vectors in X is typically much larger than d, it then holds that (see Roy and Vetterli (2007)): The dimensionality of an embedding space is intuitively assumed to be equal to the dimensionality of its constituent vectors: the matrix rank. Effective rank undermines this assumption: with effective rank matrices of the same 'initial dimension-ality' can have very different 'true dimensionalities' (Yin and Shen, 2018). Effective rank is used for various problems outside NLP, such as source localization for acoustic (Tourbabin and Rafaely, 2015) and seismic (Leeuwenburgh and Arts, 2014) waves, video compression (Bhaskaranand and Gibson, 2010), and for the evaluation of implicit regularization in neural matrix factorization (Arora et al., 2019). We propose to use it to inform and improve the estimation of the condition number.
Effective Condition Number. We replace σ d in Eq.
(1) with the singular value at the position of X's effective rank (see Eq. (2)), and compute the effective condition number κ ecn as follows: In §5 we empirically validate the quality of the effective condition number in comparison to the standard condition number.
Having defined spectral statistics of an embedding space, we move to define means of comparing two spaces using these statistics.

Spectral-Based Isomorphism Measures
The statistics described in §2.1 capture properties of a single embedding space, but it is not straightforward how to employ them in order to quantify the similarity between two distinct embedding spaces.
In what follows, we introduce isomorphism measures based on the spectral statistics.
Let us assume two embedding matrices X 1 and X 2 , with their condition numbers, κ(X 1 ) and κ(X 2 ). We combine the two numbers using the harmonic mean function (HM) to derive an isomorphism measure between two embedding spaces, COND-HM(X 1 , X 2 ): We similarly define the ECOND-HM measure over κ ecn (X 1 ) and κ ecn (X 2 ). Why harmonic mean? The higher the (effective) condition number of an embedding space, the higher its sensitivity to perturbations (i.e., the performance of transfer functions will be low). We view the condition number as a constraining factor on transferability, but what is the right way to evaluate the 'transferability potential' of two spaces via their condition numbers? There are multiple ways to combine two condition numbers, but we have empirically validated ( §5) that HM is a robust choice that outperforms some other possibilities (e.g., the arithmetic mean). We hypothesize this is because HM treats large discrepancies between two numbers in a manner that leans towards the smaller one (unlike e.g. arithmetic mean). Two noisy and two stable embedding spaces would have high and low HMs, respectively, but a noisy embedding space and a stable one would have an HM that leans towards the stable one. 2 Our results suggest that embedding spaces with small condition numbers can often tolerate noisy mappings from embedding spaces with high condition numbers, which might result from the improved stability of the former spaces.
Singular Value Gap. In addition to COND-HM and ECOND-HM, we introduce another measure that empirically quantifies the divergence between the full spectral information of two embedding spaces. This measure quantifies the gap between the singular values obtained from the matrices X 1 and X 2 sorted in descending order. We define the measure of Singular Value Gap (SVG) between two ddimensional spaces X 1 and X 2 , as the squared Euclidean distance between the corresponding sorted singular values after log transform: where σ 1 i and σ 2 i , i = 1, . . . , d are the sorted singular values characterizing the two embedding matrices X 1 and X 2 . The intuition here is that two embedding spaces with similar singular values at the same index will be more isomorphic and therefore easier to align into a shared space, and enable more effective cross-lingual transfer.
In summary, this section has presented methods that estimate the degree of isomorphism between any given pair of embedding spaces, which may differ in their language, training corpus, embedding induction algorithm or in other factors. While the focus of our empirical analysis ( §4, §5) is crosslanguage learning and transfer, we note that the scope of our methods may be wider, and that they have not been developed only with cross-lingual learning in mind.

Related Work and Baselines
We now provide an overview of prior research that focused on two relevant themes: 1) measuring approximate isomorphism between two embedding spaces, and 2) more generally, quantifying the (dis)similarity between languages, going beyond isomorphism measures. The discussed approaches will also be used as the main baselines later in §5.
Measuring Approximate Isomorphism. We focus on two standard isomorphism measures from prior work which are most similar to our work, and use them as our main baselines. The first measure, termed Isospectrality (IS) , is based on spectral analysis as well, but of the Laplacian eigenvalues of the nearest neighborhood graphs that originate from the initial embedding spaces X 1 and X 2 (for further technical details see Appendix A).  argue that these eigenvalues are compact representations of the graph Laplacian, and that their comparison reveals the degree of (approximate) isomorphism. Although similar in spirit to our approach, constructing nearest neighborhood graphs (and then analyzing their eigenvalues) removes useful information on the interaction between all vectors from the initial space, which our spectral method retains.
The second measure is the Gromov-Hausdorff distance (GH) introduced by Patra et al. (2019). It measures the maximum distance of a set of points to the nearest point in another set, or in other words the worst case distance between two metric spaces X and Y (for further technical details see again Appendix A). Patra et al. (2019) propose this distance to test how well two language embedding spaces can be aligned under an isometric transformation.
While both IS and GH were reported to have strong correlations with BLI performance in prior work, they have not been evaluated in large-scale experiments before. In fact, the correlations were computed on a very small number of language pairs (IS: 8 pairs, GH: 10 pairs). Further, both measures do not scale well computationally. Therefore, for computational tractability, the scores are computed only on the sub-matrices spanning the sub-spaces of the most frequent subsets from the full embedding spaces (IS: 10k words, GH: 5k words). In this work, we provide full-fledged empirical analyses of the two measures on a much larger number of pairs from diverse languages, and compare them against the spectral-based measures introduced in §2. The fact that the proposed spectral-based meth-ods are grounded in linear algebra theory (cf. §2.1) also arguably provides a more intuitive understanding of their theoretical underpinning than what is currently offered in the relevant prior work.
Measuring Language Similarity. At the same time, distances between language pairs can also be captured through (dis)similarities in their discrete linguistic properties, such as overlap in syntactic features, or proximity along the phylogenetic language tree. The properties are typically handcrafted, and are extracted from available typological databases such as the World Atlas of Languages (WALS) (Dryer and Haspelmath, 2013) or URIEL (Littell et al., 2017), among others (O'Horan et al., 2016;Ponti et al., 2019). Such distances were found useful in guiding and informing cross-lingual transfer tasks (Cotterell and Heigold, 2017;Agić, 2017;Lin et al., 2019;Ponti et al., 2019).
In particular, we compare against three precomputed measures of language distance based on the URIEL typological database (Littell et al., 2017). Phylogenetic distance (PHY) is derived from the hypothesized phylogenetic tree of language descent. Typological distance (TYP) is computed based on the overlap in syntactic features of languages from the WALS database (Dryer and Haspelmath, 2013). Geographic distance (GEO) is obtained from the locations where languages are spoken; see the work of Littell et al. (2017) for more details.
We use these isomorphism measures and linguistic measures as language distance measures. We simply compute language distance between two languages L 1 and L 2 as LDist(L 1 , L 2 ) = D(X 1 , X 2 ), where D = {SVG, COND-HM, ECOND-HM, GH, IS, PHY, TYP, GEO}. Later in §5 we show that "proxy" language distances originating from the proposed spectral-based isomorphism measures (see §2.2) correlate better with cross-lingual transfer scores across several tasks, than these language distances which are based on discrete linguistic properties. We verify that implicit knowledge coded in continuous embedding spaces and linguistic knowledge explicitly coded in external databases often complement each other.

Experimental Setup
The conducted empirical analyses can be divided into two major parts. First, we run large-scale BLI analyses across several hundred language pairs from dozens of languages, comparing the correlation of spectral-based isomorphism measures ( §2.2) and all baselines ( §3) with performance of a wide spectrum of state-of-the-art BLI methods. Second, we run further correlation analyses with performances in cross-lingual downstream tasks: dependency parsing, POS tagging, and MT. We first provide the details of the experiments that are shared between the two parts, and then provide further specifics of each experimental part.
Monolingual Word Embeddings. For all isomorphism measures (SVG, COND-HM, ECOND-HM, GH and IS) and languages in our analyses we use publicly available 300-dim monolingual fastText word embeddings pretrained on Wikipedia with exactly the same default settings (see Bojanowski et al. (2017)), length-normalized and trimmed to the 200k most frequent words. 3 Isomorphism Measures: Technical Details. For our spectral-based measures, we compute a full SVD decomposition (i.e., no dimensionality reduction) of the embedding space. We compute SVG scores for BLI based on the first 40 singular values, which we empirically found to produce slightly better results; 4 for the other tasks we use all singular values. For IS and GH, we replicate the experimental setup from prior work: we compute the IS score over the top 10k most frequent words in each monolingual space, while the GH score is computed over the top 5k words from each monolingual space. 5

Bilingual Lexicon Induction
We conduct correlation analyses of the results from previous studies that report BLI scores for a large number of language pairs. On top of that, we complement the existing results from previous research with new results obtained with state-of-the-art BLI methods, applied to additional language pairs. BLI Setups and Scores.  ran BLI experiments on 210 language pairs, spanning 15 diverse languages. Their training and test dictionaries (5k and 2k translation pairs) are derived from PanLex (Baldwin et al., 2010;Kamholz et al., 2014). We complement the original 210 pairs with additional 210 language pairs of 15 closely related (European) languages using dictionaries extracted from PanLex following the procedure of . With the additional language set, the aim is to probe if isomorphism measures can also capture more subtle and smaller language differences. 6 We also analyze the BLI results of 108 language pairs from MUSE (Conneau et al., 2018). This dataset systematically covers English, with 88 translation pairs that involve English as either the source or target language. Finally, we analyze the available BLI results of  (referred to as GTrans) that are based on dictionaries obtained from Google Translate and include 28 language pairs spanning 8 different languages. For the full list of language pairs involved in previous BLI studies, we refer the reader to prior work (Conneau et al., 2018;. BLI Methods in Comparison. The scores in each BLI setup were computed by several state-of-theart BLI methods based on cross-lingual word embeddings, briefly described here. 1) SUP is the standard supervised method (Artetxe et al., 2016;Smith et al., 2017) that learns a mapping between two embedding spaces based on a training dictionary by solving the orthogonal Procrustes problem (Schönemann, 1966). 2) SUP+ is another standard supervised method that additionally applies a variety of pre-processing and post-processing steps (e.g., whitening, dewhitening, symmetric reweighting) before and after learning the mapping matrix, see (Artetxe et al., 2018). 3) UNSUP is a fully unsupervised method based on the "similarity of monolingual similarities" heuristic to extract the seed dictionary from monolingual data. It then uses an iterative self-learning procedure to improve on the initial noisy dictionary (Artetxe et al., 2018). For more technical details on the fully unsupervised model, we refer the reader to prior work (Ruder et al., 2019a;Vulić et al., 2019). 7 In sum, our analyses are conducted in three BLI setups (PanLex, MUSE, GTrans) and examine three types of state-of-the-art mapping-based methods, both supervised and unsupervised (SUP, SUP+, UNSUP). Altogether, these span 556 language pairs, and cover both related and distant 6 The initial set of  comprises Bulgarian, Catalan, Esperanto, Estonian, Basque, Finnish, Hebrew, Hungarian, Indonesian, Georgian, Korean, Lithuanian, Norwegian, Thai, Turkish. The additional 210 language pairs are only composed of Germanic, Romance and Slavic languages. For a full list of the languages see Table 4 in the appendix. 7 The SUP+ and UNSUP methods are based on the VecMap framework (github.com/artetxem/vecmap) which showed very competitive and robust BLI performance across a wide range of language pairs in recent comparative analyses Doval et al., 2019). languages. 8 Following prior work , our BLI evaluation measure is Mean Reciprocal Rank (MRR). We note that identical findings emerge from running the correlation analyses based on Precision@1 scores in lieu of MRR.

Downstream Tasks
Following the large-scale nature of our BLI analyses, we run similar correlation analyses on several downstream tasks that comprise a large number of (both similar and distant) language pairs. 9 We rely on results from a recent study of Lin et al. (2019) that focused on cross-lingual transfer performance in MT, dependency parsing, and POS tagging. 10 Machine Translation. Lin et al. (2019) report BLEU scores when translating 54 source L 1 languages into English as the target language. We report correlations between the different language distance measures and these 54 BLEU scores.
Dependency Parsing. We base our analysis on the cross-lingual zero-shot parser transfer results of Lin et al. (2019): The standard biaffine dependency parser  is trained on the training portions of Universal Dependencies (UD) treebanks from 31 languages (Nivre et al., 2018), and is then used to parse the test treebank of each language, now used as the target language. We report correlations between the language distance measures and the Labeled Attachment Scores (LAS) for all combinations of 31 languages, resulting in 930 pairs. POS Tagging. We use POS tagging accuracy scores reported by Lin et al. (2019). These scores span 26 low-resource target languages and 60 source languages which measure the utility of each source language to each of the 26 target languages in POS tagging. We use a sample of 840 language pairs for the correlation analysis, as 16 lowresource target languages and 49 source languages have readily available pretrained fastText vectors. 8 We report all results for each BLI method, dictionary and language pairs in the supplementary material (and also here https://tinyurl.com/skn5cf7). We also report scores with another method, RCSLS (Joulin et al., 2018), benchmarked in the GTrans BLI setup (see Table 1). 9 For the full list of languages that were analyzed throughout all our experiments see Table 4 in the appendix. 10 For full details regarding the models used to compute the scores for each downstream task, we refer the interested reader to the work of Lin et al. (2019) and the accompanying repository: https://github.com/neulab/ langrank. We note that scores for each language pair in each task have been produced with the same task architectures.

Correlation Analyses and Statistical Tests
All scores from isomorphism measures and BLI scores were log-transformed 11 prior to any correlation computation. We report Pearson's correlation coefficients in all tasks. This allows us to investigate which of the different individual measures is most important to predict task performance.
Regression Analyses. The individual (i.e., singlevariable) analyses are not sufficient to account for the complex interdependencies between the distance measures themselves, and how they interact with task performance when combined. Therefore, we also use standard linear stepwise regression model (Hocking, 1976;Draper and Smith, 1998): Here, task performance Y is predicted using a set of regressors, x 1 , . . . , x n (i.e., SVG, COND-HM, ECOND-HM, GH, IS, PHY, TYP, GEO), that are added to the model incrementally only if their marginal addition to predicting Y is statistically significant (p < .01). This method is useful for finding variables (i.e., in our case distance measures) with maximal and unique contribution to the explanation of Y, when the variables themselves are strongly cross-correlated, as in our case. This model is able to: (a) discern which variables overlap in their information; (b) detect variables that complement each other; and (c) evaluate their joint contribution in predicting task performance. We compute the regression model's score for all statistically significant variables, and report its square-root,r. Importantly,r is not a one-number description of a language, but rather an illustrative quantification of the joint contribution of several different distance measures to the explanation of Y. Its goal is to investigate potential gains achieved through the combination of several distance measures, as opposed to using a single-best measure. The distance measures that are found statistically significant in the regression analyses are marked by superscripts overr (see later in Tables 1 and 2).

Analyses and Results
The results are summarized in Tables 1 and 2. The first main finding is that our proposed spectral-based isomorphism measures are strongly correlated with performance across all tasks and settings. 12 In fact, they show the strongest individual correlations with task performance among all isomorphism measures and linguistic distances alike. The only exception is the MT task, where our measures fall short of TYP (see Table 2), although we mark that they still hold a strong advantage over the baseline GH and IS isomorphism measures that do not seem to capture any useful language similarity properties needed for the MT task.
ECOND-HM systematically outperforms COND-HM on 2 of 3 BLI datasets and 2 of 3 downstream tasks, validating our assumption that discarding the smallest singular values reduces noise. Additionally, SVG shows greater stability across tasks and datasets than ECOND-HM. A general finding across all tasks is that our spectral measures are the most robust isomorphism measures: they substantially outperform the widely used baselines GH and IS.
As stepwise regression discerns between overlapping and complementing variables (see §4.3), another finding indicates that our spectral measures are complemented by linguistically driven language distances. Indeed, their combination achieves very high correlation scores. The results demonstrate this across all tasks and settings (see bottom rows of the tables). For instance, when combining spectral measures with the linguistic distances, the regression model reaches outstanding correlation scores up to r = .91 on PanLex BLI (Table 1); with 420 language pairs, PanLex is the most comprehensive BLI dataset in our study. In addition, GH and IS are not chosen as significant regressors in the stepwise regression model, which indicates that they capture less information than our spectral methods. 13 Overall, the regression results support the notion that conceptually different distances (i.e., isomorphism-based versus measures based on linguistic properties) capture different properties of similarities between languages, which has a synergistic effect when they are combined.
Concerning individual tasks, we note that our spectral-based measures outperform the baselines   Additional results and analyses are provided in Appendix B. They further demonstrate that our measures also indicate transfer quality of different target languages for a given source language, and transfer quality of source languages for a given target language, for the tasks discussed in this paper.

Further Discussion and Conclusion
This work introduces two spectral-based measures, SVG and ECOND-HM, that excel in predicting performance on a variety of cross-lingual tasks. Both measures leverage information from singular values in different ways: ECOND-HM uses the ratio between two singular values, and is grounded in linear algebra and numerical analysis (Blum, 2014;Roy and Vetterli, 2007), while SVG directly utilizes the full range of singular values. We suspect that the use of the full range of singular values is what makes SVG more robust across different tasks and datasets, compared to ECOND-HM that shows greater variance, as observed in our results above.
While the spectral methods are computed solely on word vectors from Wikipedia, the results in the downstream tasks are computed with different sets of embeddings (e.g., multilingual embeddings for dependency parsing), or the embeddings are learnt during training (for POS tagging and MT). We believe that this discrepancy does not constitute a shortcoming of our analyses, but rather the opposite: spectral-based methods maintain their high correlations in the downstream tasks as well, and this supports the notion that these measures might indeed capture deeper linguistic information than mere similarities between embedding spaces.
Our use of effective rank in improving the condition number (via effective condition number) is also inspired by recent work that aimed to automatically detect true dimensionality of embedding spaces. However, previous work has taken an empirical approach by simply tuning embedding dimensionality to evaluation tasks at hand (Wang, 2019;Raunak et al., 2019;Carrington et al., 2019). Our intention, on the other hand, is to extract the true embedding dimensionality directly from the embedding space. Another recent study (Yin and Shen, 2018) employed perturbation analysis to study the robustness of embedding spaces to noise in monolingual settings, and established that it is also related to effective dimensionality of the embedding space. All these inspired us to replace the standard matrix rank with effective rank when computing the condition number, and to introduce the statistic of effective condition number in §2.1.
Our study is also the first to compare language distance measures that are based on discrete linguistic information (Littell et al., 2017) with measures of isomorphism (i.e., our spectral-based measures, IS, GH), which can also be used as proxy language distance measures. Our findings, suggesting that it is possible to effectively combine these two types of language distance measures, call for further research that will advance our understanding of: 1) what knowledge is captured in monolingual and cross-lingual embedding spaces (Gerz et al., 2018;Pires et al., 2019;Artetxe et al., 2020); 2) how that knowledge complements or overlaps with linguistic knowledge compiled into lexical-semantic and typological databases (Dryer and Haspelmath, 2013;Wichmann et al., 2018;Ponti et al., 2019); and 3) how to use the combined knowledge for more effective transfer in cross-lingual NLP applications Eisenschlos et al., 2019).
The differences in embedding spaces of different languages do not only depend on linguistic properties of the languages in consideration, but also on other factors such as the chosen training algorithm, underlying training domain, or training data size and quality Arora et al., 2019;Vulić et al., 2020). In future research we also plan an in-depth study of these factors and their relation to our spectral analysis.
We believe that the main insights from this study will inform and guide different cross-lingual transfer learning methods and scenarios in future work. These might range from choosing source languages for transfer in low-data regimes, over monolingual word vector induction guided by the spectral statistics, even to more effective hyperparameter search.
A IS and GH Isospectrality (IS) After length-normalizing the vectors,  compute the nearest neighbor graphs using a subset of the top N most frequent words in each space, and then calculate the Laplacian matrices LP 1 and LP 2 of each graph. For LP 1 , the smallest k 1 is then sought such that the sum of its k 1 largest eigenvalues k 1 i=1 λ 1i is at least 90% of the sum of all its eigenvalues. The same procedure is used to find k 2 . They then define k = min(k 1 , k 2 ). The final IS measure ∆ is then the sum of the squared differences of the k largest Laplacian eigenvalues: The lower ∆, the more similar are the graphs and, consequently, the more isomorphic are the two embedding spaces.
Gromov-Hausdorff Distance (GH) It measures the worst case distance between two metric spaces X and Y with a distance function m as: It computes the distance between the nearest neighbors that are farthest apart. The Gromov-Hausdorff distance then minimizes this distance over all isometric transforms X and Y: Computing GH directly is computationally intractable in practice, but it can be tractably approximated by computing the Bottleneck distance between the metric spaces (Chazal et al., 2009).

B Source and Target Selection Analysis
In addition to the correlation analyses in the main text that aggregate all language pairs, some tasks and datasets also support analyses where one language is fixed as a target language (i.e., sourcelanguage selection analysis), or as a source language (i.e., target-language selection analysis). Such analyses could inform us on how to choose the right transfer language. That is, given a target language one would like to transfer to, which is the best source language to transfer from, and vice versa. These analyses are conducted for the following tasks with sufficient language pairs: BLI with PanLex, parsing, and POS tagging. For these analyses we report average correlation: across target languages in the source-language selection analysis, or across the source languages in the target-language selection analysis. We provide the percentage of times each compared measure scored the highest for a particular task and setting.
Stepwise regression analysis is not suitable for the selection analysis due to the limited number of language pairs in each language selection setup (e.g., PanLex offers 14 language pairs for each source-or target-language selection analysis). These conditions impede the statistical significance power of the tests which stepwise regression requires. We therefore opt for a standard multiple linear regression model instead; the regressors include the isomorphism measure with the highest individual correlation combined with the linguistic measures. Similarly to the stepwise analysis, we report the unified correlation coefficient,r.
We observe that the same findings reported for the aggregated correlation analyses (Tables 1 and 2 in the main text) also hold for the language selection analyses (Table 3 below). This indicates that our spectral measures have an applicative value as they can facilitate better cross-language transfer.
We also observe interesting patterns in the selection analyses for the POS tagging task in Table 3: While the results in the target-language selection analysis largely follow the main-text results, the same does not hold for source-language selection (Table 3, POS Target and Source columns). We speculate that this is in fact an artefact of the experimental design of Lin et al. (2019). Their set of target languages deliberately comprises only truly low-resource languages, and such languages are expected to have lower-quality embedding spaces. Transferring to such languages is bound to fail with most source languages regardless of the actual source-target language similarity. The difficulty of this setting is reflected in the actual scores: average accuracy scores for the best source-target combination is 0.55 in the source-language selection analysis, and 0.92 for target-language selection.

C Single and Combined Analysis
We show (Figure 1) a single experimental condition (SUP method in the PanLex BLI dataset, leftmost column in Table 1 in the main text) to illustrate the data distribution and the correlation for spectralbased measures (e.g., ECOND-HM), and their improvement once this measure is combined with linguistic measures through regression analysis.  Table 3: Correlation scores in source-language (Source) and target-language (Target) selection analyses. The best distance measure per column is provided in bold. The percentage of cases a measure topped the others is shown in superscript (see details in Appendix B).r refers to the unified correlation coefficient from the multiple regression model (see details in Appendix B).  Table 1 of the main paper). The left panel presents results for the best single isomorphism measure, ECOND-HM. The right panel presents results for the combined unified model based on the regression analysisr that includes linguistic measures.r's sign was flipped (right panel) to make the graphs directly comparable.