Bridging Linguistic Typology and Multilingual Machine Translation with Multi-view Language Representations

Sparse language vectors from linguistic typology databases and learned embeddings from tasks like multilingual machine translation have been investigated in isolation, without analysing how they could benefit from each other's language characterisation. We propose to fuse both views using singular vector canonical correlation analysis and study what kind of information is induced from each source. By inferring typological features and language phylogenies, we observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy in tasks that require information about language similarities, such as language clustering and ranking candidates for multilingual transfer. With our method, we can easily project and assess new languages without expensive retraining of massive multilingual or ranking models, which are major disadvantages of related approaches.


Introduction
Recent surveys consider linguistic typology as a potential source of knowledge to support multilingual natural language processing (NLP) tasks (O'Horan et al., 2016;Ponti et al., 2019). Linguistic typology studies language variation in terms of their functional processes (Comrie, 1989). Several typological knowledge bases (KB) have been crafted, from where we can extract categorical language features (Littell et al., 2017). Nevertheless, their sparsity and reduced coverage present a challenge for an end-to-end integration into NLP algorithms. For example, the World Atlas of Language Structure (WALS; Dryer and Haspelmath, 2013) encodes 143 features for 2,679 languages, but their mean coverage per language is barely around 14%.
Dense and data-driven language representations have emerged in response. They are computed from multilingual settings of language modelling (Östling and Tiedemann, 2017) and neural machine translation (NMT) (Malaviya et al., 2017). However, the language diversity in the corpus-based representations is limited. The language coverage could be broadened with other knowledge, such as that encoded in WALS, to distinguish even more language properties. Therefore, to obtain the best of both views (KB and task-learned) with minimal information loss, we project a shared space of discrete and continuous features using a variant of canonical correlation analysis (Raghu et al., 2017).
For our study, we fuse language-level embeddings from multilingual machine translation with syntactic features of WALS. We inspect how much typological knowledge is present by predicting features for new languages. Then, we infer language phylogenies and inspect whether specific relationships are induced from the task-learned vectors.
Furthermore, to demonstrate that our approach has practical benefits in NLP, we apply our language vectors in multilingual NMT with language clustering (Tan et al., 2019) and adapt the ranking of related languages for multilingual transfer (Lin et al., 2019). As a side outcome, we identify that there is an ideal setting to encode language relationships in language embeddings from NMT. Finally, we are releasing a simple tool to allow everyone to fuse their own representations for clustering, ranking and more.

Multi-view language representations
Our primary goal is to fuse parallel representations of the same language in one shared space, and canonical correlation analysis (CCA) allows us to find a projection of two views for a given set of data. With CCA, we look for linear combinations that maximise the correlation of the two sources in each coordinate iteratively (Hardoon et al., 2004). After training, we can apply the transformation learned on a new sample from any view to obtain a CCA-based language representation. 1 CCA considers all dimensions of the two views as equally important. However, our sources are potentially redundant: KB features are mostly onehot-encoded, whereas task-learned ones inherit the high dimensionality of the embedding layer. Moreover, few samples and sparsity could make the convergence harder. For the redundancy issue, singular value decomposition (SVD) is an appealing alternative. With SVD, we factorise the source data matrix to compute the principal components and singular values. Furthermore, to deal with sparsity, we adopt a truncated SVD approximation, which is also known as latent semantic analysis in the context of linear dimensionality reduction for termcount matrices (Dumais, 2004).
The two-step transformation of SVD followed by CCA is called singular vector canonical correlation analysis (SVCCA; Raghu et al., 2017) in the context of understanding the representation learning throughout neural network layers. That being said, we use SVCCA to get language representations and not to inspect a neural architecture. 2

Methodology and research questions
To embed linguistic typology knowledge in dense representations for a broad set of languages, we employ SVCCA ( §2) with the following sources: KB view. We employ the language vectors from the URIEL and lang2vec database (Littell et al., 2017). Precisely, we work with the k-NN vectors of the Syntax feature class (U S ; 103 feats.), that are composed of binary features encoded from WALS (Dryer and Haspelmath, 2013).
(NMT) Learned view. Firstly, we exploit the NMT-learned embeddings from the Bible (L B ; 512 1 With language representations, we refer to an annotated or unsupervised characterisation of a language itself (e.g. Spanish or English), and not to word or sentence-level representations, as it is used in the recent NLP literature. 2 As the SVD step performs a dimensionality reduction while preserving the most explained variance as possible, we can consider two additional parameters: a threshold value in the [0.5,1.0] range with 0.05 incremental steps, for the explained variance ratio of each view. With a value equal to 1, we bypass SVD and compute CCA only. We then tuned all our following experiments (see Appendix C for details). dim.) (Malaviya et al., 2017). Up to 731 entries are available in lang2vec that intersects with U S . They were trained in a many-to-English NMT model with a pseudo-token identifying the source language at the beginning of every input sentence.
Secondly, we take the many-to-English language embeddings learned for the language clustering task on multilingual NMT (L W ; 256 dim.) (Tan et al., 2019), where they use 23 languages of the WIT 3 corpus (Cettolo et al., 2012).
One main difference for the latter is the use of factors in the architecture, meaning that the embedding of every input token was concatenated with the embedded pseudo-token that identifies the source language. The second difference is the neural architecture used to extract the embeddings: the former use a recurrent neural network, whereas the latter a small transformer model (Vaswani et al., 2017).
Finally, we train a new set of embeddings (L T ) that we extracted from the 53 languages of the TED corpus (many-to-English) processed by Qi et al. What knowledge do we represent? Each source embeds specialised knowledge to assess language relatedness. The KB vectors can measure typological similarity, whereas task-learned embeddings correlates with other kinds of language relationships (e.g. genetic) (Bjerva et al., 2019b). To analyse whether each kind of knowledge is induced with SVCCA, we assess the tasks of typological feature prediction ( §4) and reconstruction of a language phylogeny ( §5).
What is the benefit for multilingual NMT (and NLP)? Language-level representations can evaluate the distance between languages in a vector space. We then can assess their applicability on multilingual NMT tasks that require guidance from language relationships. Therefore, language clustering and ranking related partner languages for (multilingual) transfer are our study cases ( §6).

Prediction of typological features
An example of a typological feature is a word order specification, like whether the adjective is predominately placed before or after the noun (features #24 and #25 of U S ). Our task consists in predicting syntactic features (U S ) leaving one-language and one-language-family out to control phylogenetic relationships (Bjerva et al., 2019a). Previous work has shown that task-learned embeddings are potential candidates to predict features of a linguistic typology KB (Malaviya et al., 2017), and our goal is to evaluate whether SVCCA can enhance the NMT-learned language embeddings with typological knowledge from their KB parallel view.
Experimental setup. We use a Logistic Regression classifier per U S feature, which is trained with the NMT-learned or SVCCA representations in both one-language-out and one-language-familyout settings. For prediction, we use the original embedding or its SVCCA projection as inputs.
Results. In Table 1, we observe that SVCCA outperformed their NMT-learned counterparts for L W and L T , where the performance is significantly better for the one-language-out setting. In the case of L B (with 731 entries), we notice that the overall performance drops, and the SVCCA transformation cannot improve it. We argue that a potential reason for the accuracy dropping is the method used to extract the NMT-learned embeddings (initial pseudo-token instead of factors: §7), which could diminishes the information embedded about each language, and consequently, impacts the SVCCA projection. 4 In conclusion, we notice that specific typological knowledge is usually hard to learn in an unsupervised way, and fusing them with KB vectors using SVCCA is feasible for inducing information of linguistics typology in some scenarios.

Language phylogeny analysis
According to Bjerva et al. (2019b), there is a positive correlation between the language distances in a phylogenetic tree and a pairwise distance-matrix of task-learned representations. Our goal therefore 4 In other words, for SVCCA, it is difficult to deal with the noise provided in the learned embeddings. In Figures 6a and 6b of the Appendix, we observe noisy agglomerations in the dendrograms (obtained by clustering different language representations), which is preserved after the fusing with the KB vectors through SVCCA as we can see in Fig. 6c) is to investigate whether fusing linguistic typology with SVCCA can preserve or enhance the embedded relationship information. For that reason, we examine how well a language phylogeny can be reconstructed from language representations ( §5.1), and also study the correlation (Appendix B).

Inference of a phylogenetic tree
Experimental design. Based on previous work (Rabinovich et al., 2017), we take a tree of 17 Indo-European languages (Serva and Petroni, 2008) as a Gold Standard (GS), which is shown in Figure 1a. 5 We also use agglomerative clustering with variance minimisation (Ward Jr, 1963) as linkage, but we employ cosine similarity as Bjerva et al. (2019b).
We also consider a concatenation (⊕) of the KB and NMT-learned views as a baseline.
It is essential to highlight that none of the NMTlearned and ⊕ vectors have all the 17 language entries of the GS. Therefore, we can already see one of the significant advantages of the SVCCA vectors, as we are able to represent "unknown" languages using one of the views. The NMT-learned views lack English, since they were extracted from the source side of a many-to-English system, but we were able to project the KB English vectors into the shared space. 6 In addition, we project other four languages (Swedish, Danish, Latvian, Lithuanian) to complete the L W embeddings of Tan et al. (2019) and Latvian to complete our own L T set.
Evaluation metric. We differ from previous studies and use a tree edit distance metric, which is defined as the minimum cost of transforming one tree into another by inserting, deleting or modifying (the label of) a node. Specifically, we used the All Path Tree Edit Distance algorithm (APTED; Pawlik and Augsten, 2015, 2016), a novel one for the task. We chose an edit-distance method as it is more transparent for assessing what is the degree of impact for a single change of linkage in the GS.
As we need to compare inferred pruned trees with different number of nodes, we propose a normalised version given by:  Table 2: APTED and nAPTED scores (↓) between the GS and inferred trees from all scenarios. APTED ranges from 0 (no difference) and the size of the tree at most. NMT-learned and concatenation (⊕) can only reconstruct pruned trees of 16 (L B ), 12 (L W ) and 15 (L T ) languages.
tree, and |.| indicates the number of nodes. The denominator then is the maximum cost possible of deleting all nodes of τ and inserting each GS node.
Results. Table 2 shows the results for all settings, where the single-view scores are meagre in most of the cases. For instance, the U S inferred tree ( Fig.1c) requires 30 edits to match the GS. The exception is L T (Fig.1d), which requires half the edits, although it is incomplete. We observe that the best absolute and normalised scores are obtained by fusing U S and L T with SVCCA ( Fig.1b). English is projected in the Germanic branch, although Latvian is separated from the Balto-Slavic group. The latter case is similar for Bulgarian, which is misplaced in the original L T tree as well. Nevertheless, we only require ten editions to equate the GS (where 66 is the maximum cost possible), confirming that our approach is a robust alternative for completing language entries and inferring a language phylogeny. 7 In conclusion, we observe that using typological knowledge with SVCCA enhances the language relationship encoded in the NMT-learned embeddings. In Appendix B, we further discuss what kind of relationship we are representing in the NMTlearned embeddings and SVCCA, and study their correlation.

Application in multilingual NMT
With multilingual NMT, we can translate several language-pairs using a single model. Low-resource languages usually benefit through multilingual transfer, which resembles a simultaneous training of the parent(s) and child models. Therefore, we want to take advantage of a language-level vector 7 In further analysis, we confirmed that the inferred tree with only 12 languages of SVCCA (without projection of extra entries) is comparable or better against the rest of the baselines.  Language clustering. The main idea is to obtain smaller multilingual NMT models as an intermediate point between maintaining many pairwise systems and a single massive multilingual model. With limited resources, it is challenging to support the first scenario, whereas the advantages for the massive setting are also very appealing (e.g. simplified training process, translation improvement for low-resource languages or zero-shot translation (Johnson et al., 2017)). Therefore, to address the task, Tan et al. (2019) trained a factored multilingual NMT model of 23 languages from Cettolo et al. (2012), where the language embedding is concatenated in every input token. Then, they performed hierarchical clustering with the representations, and selected a number of clusters guided by the Elbow method. Finally, they compared the systems against individual, massive and language family-based cluster models.
In a practical multilingual NMT system, it is not only necessary to choose the right clustering, the ability to easily add new languages is also important. With this in mind, we apply our multi-view representations to compute a set of clusters, and we also address the question: do we need to train the massive model again if we want to add one or more new languages to our setting?
Language ranking. The original goal of LAN-GRANK is to choose a parent language to perform transfer learning in different tasks, NMT included. To achieve this, Lin et al. (2019) trained a model based on the performance of several hundred pairwise MT systems using the dataset of Qi et al. (2018). For the input features, they considered linguistically-informed vectors from lang2vec (Littell et al., 2017) and corpus-based statistics, such as word/sub-word overlapping and the ratio of the token-types or the data size between the target child and potential candidates, where the latter features were some of the most relevant.
Considering the transfer capabilities within multilingual NMT and the possibility to obtain a ranked list of candidates from LANGRANK, we propose an adapted task of choosing k-related languages for multilingual transfer. We then use our multi-view representations to rank related languages from the vector space, as they embed information about typological and lexical relationships. This is similar to the features that Lin et al. (2019) consider, but without training a ranking model fed with scores from pairwise MT systems.

Experimental setup
We focus on the many-to-one (English) multilingual NMT setting to simplify the evaluation in both tasks. However, similar experiments could be performed in a one-to-many direction.
Dataset. We use the dataset processed and tokenised by Qi et al. (2018) of 53 languages (TED-53), from where we learned our L T embeddings. We opted for TED-53 to better evaluate the extensibility of clusters and because it is also used to train the LANGRANK model. The list of languages, set sizes and other details are included in Appendix A. Before preprocessing the text, we drop any sentences from the training sets which overlap with any of the test sets. Since we are building many-to-English multilingual systems, this is important, as any such overlap will bias the results.
Model and training. Similar to Tan et al. (2019), we train small transformer models (Vaswani et al., 2017). We jointly learn 90k shared sub-words with the byte pair encoding (Sennrich et al., 2016) algorithm built in SentencePiece (Kudo and Richardson, 2018). We also oversample all the training data of the less-resourced languages in each cluster, and shuffle them proportionally in all batches.
We use Nematus (Sennrich et al., 2017) only to extract the factored language embeddings from the TED-53 corpus (L T ). Besides, given the large number of experiments, we also choose the efficient Marian NMT (Junczys-Dowmunt et al., 2018) toolkit for training the rest of systems. With Marian NMT, we only use the basic pseudo-token setting for identifying the source language, as we did not need to retrieve new language embeddings after training. Besides, we allow the Marian NMT framework to automatically determine the minibatch size given the sentence-length and available memory (mini-batch-fit parameter).
Clustering settings. We first list the baselines and our approaches, with the number of clusters/models between brackets: 1.    With the last setting, we are interrogating whether SVCCA is a useful method for rapidly increasing the number of languages without retraining massive models given new entries that require their NMTlearned embeddings for clustering. Similar to Tan et al. (2019), we use hierarchical agglomeration with average linkage and cosine similarity. However, we choose a different criterion for choosing the optimal number of clusters.
Selection of number of clusters. The Elbow criterion has been suggested for this purpose (Tan et al., 2019); however, as we can see in Figure 2, it might be ambiguous. Thus, we propose using a heuristic called Silhouette (Rousseeuw, 1987), which returns a score in the [-1,1] range. A sample cluster with a silhouette close to 1 indicates that it is cohesive and well-separated. With the average silhouette of all samples, we vary the number of clusters, and look for the peak value above two.
Ranking settings. We focus on five low-resource languages from TED-53: Bosnian (bos, Indo-European/Balto-Slavic), Galician (glg, Indo-European/Italic), Malay (zlm, Austronesian), Estonian (est, Uralic) and Georgian (kat, Kartvelian). They have between 5k and 13k translated sentences with English, and we chose them as they achieved the most significant improvement from the individual to the massive setting. We then identified the top-3 related languages using LANGRANK, which give us a multilingual training set of around 500 thousand sentences for each case. Given that LAN-GRANK usually prefers to choose candidates with larger data size (Lin et al., 2019), for a fair comparison, we use SVCCA and cosine similarity to choose the k closest languages that can agglomerate a similar amount of parallel sentences.

Language clustering results
We first briefly discuss the composition of clusters obtained by SVCCA. Then, we analyse the results grouped by training size bins. We complement the analysis by family groups in Appendix D.
Cluster composition: In Figure 2, we observe that SVCCA-53 (Fig. 2a) has adopted ten clusters with a proportionally distributed number of languages (the smallest one is Greek-Arabic-Hebrew, and the largest one has seven entries). Moreover, the languages are usually grouped by phylogenetic or geographical criteria. These agglomeration trends are adopted from both the KB (Fig. 2c) and NMT-learned (Fig. 2d) sources.
From a more detailed inspection, there are entries that do not correspond to their respective family branches, although the single-view sources might induce the bias. For instance, the L T phylogenetic tree (Fig. 1d) "misplaced" Bulgarian within Italic languages. Nevertheless, the unexpected agglomerations rely on the features encoded in the KB or the NMT learning process, and we expect they can uncover surprising clusters to avoid isolating languages without close relatives (e.g. Basque, or even Japanese as the only Japonic member in the set). Another benefit is noticeable in the SVCCA-23 clusters (Fig. 2b), which have resemblances with the SVCCA-53 agglomeration despite using only 23 languages to compute the shared space.
Training size bins: We manually define the upper bounds of the bins as [10,75,175,215] thousands of training sentences, which results in groups composed by [14,14,13,12] languages. Figure 3 shows the box plots of BLEU from where we can analyse each distribution (mean, variance).
Throughout all the bins, we observe that both SVCCA-53 and SVCCA-23 accomplish a comparable accuracy with the best setting in each group. In other words, their clusters provide stable performance for both low or high-resource languages.
In the first bin of the smallest corpora, the Massive baseline and the large clusters of U S barely sur- [0,10[ [10,75[ [75,175[ [175,215[ Train-size bins for TED53 pass the SVCCA schemes. Nevertheless, SVCCA contributes a notable advantage if we want to train a multilingual NMT model for a specific lowresource language, and we do not have the resources for training a massive system. We further analyse this scenario in §6.3.
In the rightmost bin, for the highest resource languages, the Massive and U S performed worse than SVCCA. Furthermore, we show a competitive accuracy for the Individual and Family approaches. The former's clusters have steady performance across most of the bins as well. Nevertheless, they double the number of clusters that we have in both SVCCA settings, and with more than half of the "clusters" having only one language.
Other approaches, like using the NMT-learned embeddings (L T ) as Tan et al. (2019) or the concatenation baseline, obtain similar translation results in the last three bins. However, we need to obtain the NMT-learned embeddings first in order to fulfil those methods (from a 53-languages massive model). Using SVCCA and a pre-trained smaller set of language embeddings is enough for projecting new representations, as we present with our SVCCA-23 approach.

Language ranking results
After discussing overall translation accuracy for all the languages, we now focus on five specific lowresource cases and how multilingual transfer enhance their performance. Table 3 shows the BLEU scores of the translation into English for the smaller multilingual models that group each child language with their candidates ranked by LANGRANK and our SVCCA-53 representations.
We also include the results of the individual and massive MT systems. Even when the latter baseline provides a significant improvement over the former, we observe that many of the smaller multilingual models outperform the translation accuracy of the massive system. The result suggests that the amount of data is not the most important confound for supporting multilingual transfer in a low-resource language, which is aligned with the literature (Wang and Neubig, 2019).
Comparing the two ranking approaches, we observe that SVCCA approximates the performance of LANGRANK in most of the cases. We note that LANGRANK prefers related languages with large datasets, as it only requires three candidates to group around half a million training samples, whereas SVCCA suggests to include from three to ten languages to reach a similar amount of parallel sentences. However, increasing the number of languages could impact the multilingual transfer negatively (see the case of Georgian or kat), as it is analogous to adding different "out-of-domain" samples. To alleviate this, we could bypass candidate languages that do not possess a specific amount of training samples.
We argue that our representations still provides a robust alternative to determine which languages are suitable for multilingual transfer learning. The notable advantage is that we do not need to pretrain MT systems from a specific dataset, and we can easily extend the coverage of languages without re-training the ranking model to consider new language entries 8 .

Factors over initial pseudo-tokens
We additionally argue that the configuration used to compute the language embeddings impacts what relationship they can learn. For the analysis, we extract an alternative set of 53 language embeddings (L T * ) but using the initial pseudo-token setting instead of factors. Then, we perform a silhouette analysis to identify whether we can build cohesive and well-separated clusters of languages. Figure 4 shows the silhouette analysis for the aforementioned embeddings (L T * ) together with the Bible embeddings (L B ) that were trained with the same configuration. We observe that the silhouette score never exceeds 0.2, and the curve keeps degrading when we examine a higher number of clusters, which contrast the trend shown in Figure  2. The pattern proves that the vectors are not suitable for clustering (the hierarchies are shown in Figure 6 in the Appendix), and they might only encode enough information to perform a classification task in the multilingual NMT training and inference. For that reason, we consider it essential to use language embeddings from factors for extracting language relationships. Our approach is most similar to Bjerva et al. (2019a), as they build a generative model from typological features and use language embeddings, extracted from factored language modelling at character-level, as a prior of the model to extend the language coverage. However, our method primarily differs as it is mainly based in linear algebra, encodes information from both sources since the beginning, and can deal with a small number of shared entries (e.g. 23 from L W ) to compute robust representations.

Related work
There has been very little work on adopting typology knowledge for NMT. There is not a deep integration of the topics (Ponti et al., 2019), but one shallow and prominent case is the ranking method (Lin et al., 2019) that we analysed in §6.
Finally, CCA and its variants have been previously used to derive embeddings at word-level (Faruqui and Dyer, 2014;Dhillon et al., 2015;Osborne et al., 2016). Kudugunta et al. (2019) also used SVCCA but to inspect sentence-level representations, where they uncover relevant insights about language similarity that are aligned with our results in §5. However, as far as we know, this is the first time a CCA-based method has been used to compute language-level representations.

Takeaways and practical tool
We summarise our key findings as follows: • SVCCA can fuse linguistic typology KB entries with NMT-learned embeddings without diminishing the originally encoded typological and genetic similarity of languages. • Our method is a robust alternative for identifying clusters and choosing related languages for multilingual transfer in NMT. The advantage is notable when it is not feasible to pretrain a ranking model or learn embeddings from a massive multilingual system. Assessing new languages is an important ability, given that most of them do not have even enough monolingual corpora to learn embeddings from multilingual language modelling (Joshi et al., 2020). • Factored language embeddings encode more information to agglomerate related languages than the initial pseudo-token setting. Furthermore, we make our code available as an open-source tool 9 , together with our L T factoredembeddings, to compute multi-view language representations using SVCCA. We enable the option to use other language vectors from lang2vec (Phonology or Phonetic Inventory) as the KB-source, and to upload new task-learned embeddings from different settings, such as one-to-many or many-to-many NMT, and also multilingual language modelling. Besides, given a list of languages to assess, our method will project new language representations when they are only available in the KB-view. Finally, we include the tasks of language clustering and ranking candidates, which could benefit multilingual NLP studies that involve massive datasets of hundreds of languages.

Conclusion
We compute multi-view language representations with SVCCA using two sources: KB and NMTlearned vectors. With a typological feature prediction task and the inference of phylogenetic trees, we showed that the knowledge and language relationship encoded in both sources is preserved in the combined representation. Moreover, our approach offers important advantages because we can evaluate projected languages with entries in only one of the views and can easily extend the language coverage. The benefits are noticeable in multilingual NMT tasks, like language clustering and ranking related languages for multilingual transfer. We plan to study how to deeply incorporate our typologically-enriched embeddings in multilingual NMT, where there are promising avenues in parameter selection (Sachan and Neubig, 2018) and generation (Platanios et al., 2018).

Acknowledgments
This work was supported by funding from the European Union's Horizon 2020 research and innovation programme under grant agreements No 825299 (GoURMET) and the EP-SRC fellowship grant EP/S001271/1 (MTStretch). Also, it was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (http://www. csd3.cam.ac.uk/), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/P020259/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk). We express our thanks to Kenneth Heafield and Rico Sennrich, who provided us with access to the computing resources.
Last but not least, we thank the organisers and participants of the First Workshop of Typology for Polyglot NLP, and the members of the Statistical Machine Translation group at the University of Edinburgh, whose provided relevant feedback in an early stage of the study.

A Languages and individual BLEU scores
We work with 53 languages pre-processed by (Qi et al., 2018), from where we mapped the ISO 639-1 codes to the ISO 693-2 standard. However, we need to manually correct the mapping of some codes to identify the correct language vector in the URIEL (Littell et al., 2017) library: • zh (zho , Chinese macro-language) mapped to cmn (Mandarin Chinese). • fa (fas , Persian inclusive code for 11 dialects) mapped to pes (Western/Iranian Persian). • ar (ara , Arabic) mapped to arb (Standard Arabic). We disregard working with artificial languages like Esperanto (eo) or variants like Brazilian Portuguese (pt-br) and Canadian French (fr-ca). Table 4 presents the list of all the languages with the following details: ISO 693-2 code, language family, size of the training set in thousands of sentences (with their respective training size bin) and the individual BLEU score obtained per clustering approach and other baselines.
B Correlation of SVCCA with genetic similarity Bjerva et al. (2019b) argued that raw language embeddings from language modelling correlates with genetic and structural similarity 10 . For the former,  of genetic information is preserved. Concerning structural similarity, they computed a distance matrix using syntaxdependency-tags counts per language from annotated treebanks. We leave this analysis for further work.
The list is a crafted set of concepts for comparative linguistics (e.g. I, eye, sleep), and it is usually processed by lexicostatistics methods to study language relationship through time. Therefore, we prefer to argue that corpus-based embeddings could partially encode lexical similarity of languages.
We perform a Spearman correlation between the cophenetic matrix 11 of the GS and the pairwise cosine-distance matrices of U S , L T and SVCCA(U S ,L T ), where we obtain correlation coefficients of 0.48, 0.68 and 0.80, respectively (p-values<0.001). Our conclusion is that typological knowledge strengthen the representation of lexical similarity within NMT-learned embeddings.

C SVD explained variance selection
To compute SVCCA, we transform each source space using SVD, where we can choose to preserve a number of dimensions that represents an accumulated explained variance of the original dataset.
For that reason, we perform a parameter sweep between 0.5 and 1.0 using 0.05 incremental steps. For a fair comparison, we also transform the single spaces (KB or Learned) with SVD and look for the optimal threshold.
Prediction of typological features. We selected a 0.5 threshold for the NMT-learned vectors of L B and L W , and 0.7 for L T . In case of the SVCCA representation, L T uses [0.75,0.70], whereas L B and L W employ [0.95,0.50] values. The parameter values are for both one-language-out and one-familyout settings. We can argue that there is redundancy in the NMT-learned embeddings, as the prediction of typological features with Logistic Regression always prefers a dimensionality-reduced version instead of the original data (threshold at 1.0).
Language phylogeny inference. In Table 5, we report the optimal value for the SVD explained variance ratio in each single and multi-view (concatenation and SVCCA) setting.
Language clustering (and ranking). We cannot perform an exhaustive analysis for the threshold of the explained variance ratio per view. As our main goal is to increase the coverage of languages steadily, we must determine what configuration allows a stable growth of the hierarchy. We thereupon take inspiration from bootstrap clustering (Nerbonne et al., 2008), and increase the number of language entries from few entries (e.g. 10) to 53 by resample bootstrapping using each of the source vectors: U S , L T and L W . Afterwards, we search for the threshold value that preserves a stable number of clusters given the peak silhouette value. Our heuristic looks for the least variability throughout the incremental bootstrapping (Fig. 5). 11 Pairwise-distances of the hierarchy's leaves (languages).  th1=0.65 (10d.) Figure 5: Analysis of the number of clusters (blue) and the ratio of number of clusters per total languages (red) given the chosen thresholds of explained variance ratio. We show the confidence interval computed from the bootstrapping, and we observe that the number of clusters is stable since 42 and 38 languages for U S and L T vectors, respectively.
We found that 0.65 is the most stable value for U S , whereas 0.60 is the best one for both L T and L W , so we thereupon fix SVCCA-53 and SVCCA-23 to [0.65,0.6]. We also apply the chosen thresholds on the concatenation baseline for a fair comparison. In the single-view cases, the transformations with the tuned variance ratio do not overcome any non-optimised counterparts.

D Language clustering results by language families
Following a guide for evaluating multilingual benchmarks (Anastasopoulos, 2019), we also group the scores by language families. Table 6 includes the overall weighted average per number of languages in each family branch. We observe that most of the approaches have obtained clusters with similar overall translation accuracy. The individual models are the only ones that significantly underperform. The poor performance is transferred to the Family baseline, as most of the groups contains only one language given the low language diversity of the dataset. The U S vectors obtain the highest overall accuracy, mostly from their few large clusters (see Fig.  2c). Meanwhile, SVCCA-53 achieves the secondbest overall result, by a minimal margin, and with 3 to 7 languages per cluster, which are usually faster  to converge. Besides, the massive model, the L T embeddings and the concatenation baseline present a competitive achievement as well. However, the first requires more resources to train until convergence, whereas the last two need the 53 pre-trained embeddings from a previous massive system. In contrast, SVCCA-23 is a faster alternative if we want to target specific new languages (see Fig.  2b). We only require a small group of language embeddings (e.g. L W of 23 entries) and project the rest with SVCCA and a set KB-vectors as a side view. For instance, if we need to deploy a translation model for Basque or Thai, we could reach a comparable or better accuracy to a massive model with the SVCCA-23 chosen clusters of only 3 (Arabic, Hebrew) or 5 (Chinese, Indonesian, Vietnamese, Malay) languages, respectively.