Multilingual Factor Analysis

In this work we approach the task of learning multilingual word representations in an offline manner by fitting a generative latent variable model to a multilingual dictionary. We model equivalent words in different languages as different views of the same word generated by a common latent variable representing their latent lexical meaning. We explore the task of alignment by querying the fitted model for multilingual embeddings achieving competitive results across a variety of tasks. The proposed model is robust to noise in the embedding space making it a suitable method for distributed representations learned from noisy corpora.


Introduction
Popular approaches for multilingual alignment of word embeddings base themselves on the observation in (Mikolov et al., 2013a), which noticed that continuous word embedding spaces (Mikolov et al., 2013b;Pennington et al., 2014;Bojanowski et al., 2017;Joulin et al., 2017) exhibit similar structures across languages. This observation has led to multiple successful methods in which a direct linear mapping between the two spaces is learned through a least squares based objective (Mikolov et al., 2013a;Smith et al., 2017;Xing et al., 2015) using a paired bilingual dictionary.
An alternate set of approaches based on Canonical Correlation Analysis (CCA) (Knapp, 1978) seek to project monolingual embeddings into a shared multilingual space (Faruqui and Dyer, 2014b;Lu et al., 2015). Both these methods aim to exploit the correlations between the monolingual vector spaces when projecting into the aligned multilingual space. The multilingual embeddings from (Faruqui and Dyer, 2014b;Lu et al., 2015) are shown to improve on word level semantic tasks, which sustains the authors' claim that multilingual information enhances semantic spaces.
In this paper we present a new non-iterative method based on variants of factor analysis (Browne, 1979;McDonald, 1970;Browne, 1980) for aligning monolingual representations into a multilingual space. Our generative modelling assumes that a single word translation pair is generated by an embedding representing the lexical meaning of the underlying concept. We achieve competitive results across a wide range of tasks compared to state-of-the-art methods, and we conjecture that our multilingual latent variable model has sound generative properties that match those of psycholinguistic theories of the bilingual mind (Weinreich, 1953). Furthermore, we show how our model extends to more than two languages within the generative framework which is something that previous alignment models are not naturally suited to, instead resorting to combining bilingual models with a pivot as in (Ammar et al., 2016).
Additionally the general benefit of the probabilistic setup as discussed in (Tipping and Bishop, 1999) is that it offers the potential to extend the scope of conventional alignment methods to model and exploit linguistic structure more accurately. An example of such a benefit could be modelling how corresponding word translations can be generated by more than just a single latent concept. This assumption can be encoded by a mixture of Factor Analysers (Ghahramani et al., 1996) to model word polysemy in a similar fashion to (Athiwaratkun and Wilson, 2017), where mixtures of Gaussians are used to reflect the different meanings of a word.
The main contribution of this work is the application of a well-studied graphical model to a novel domain, outperforming previous approaches on word and sentence-level translation retrieval tasks. We put the model through a battery of tests, showing it aligns embeddings across languages well, while retaining performance on monolingual word-level and sentence-level tasks. Finally, we apply a natural extension of this model to more languages in order to align three languages into a single common space.

Background
Previous work on the topic of embedding alignment has assumed that alignment is a directed procedure -i.e. we want to align French to English embeddings. However, another approach would be to align both to a common latent space that is not necessarily the same as either of the original spaces. This motivates applying a well-studied latent variable model to this problem.

Factor Analysis
Factor analysis (Spearman, 1904;Thurstone, 1931) is a technique originally developed in psychology to study the correlation of latent factors z ∈ R k on observed measurements x ∈ R d . Formally: In order to learn the parameters W , Ψ of the model we maximise the marginal likelihood p(x|W , Ψ) with respect to W , Ψ. The maximum likelihood estimates of these procedures can be used to obtain latent representations for a given observation E p(z|x) [z]. Such projections have been found to be generalisations of principal component analysis (Pearson, 1901) as studied in (Tipping and Bishop, 1999).

Inter-Battery Factor Analysis
Inter-Battery Factor Analysis (IBFA) (Tucker, 1958;Browne, 1979) is an extension of factor Figure 2: Graphical model for MBFA. Latent space z represents the aligned shared space between the multiple vector spaces {x j } v j=1 .
analysis that adapts it to two sets of variables x ∈ R d , y ∈ R d (i.e. embeddings of two languages). In this setting it is assumed that pairs of observations are generated by a shared latent variable z p(z) = N (z; 0, I), As in traditional factor analysis, we seek to estimate the parameters that maximise the marginal likelihood arg max where the joint marginal p( and Ψ 0 means Ψ is positive definite. Maximising the likelihood as in Equation 2 will find the optimal parameters for the generative process described in Figure 1 where one latent z is responsible for generating a pair x, y. This makes it a suitable objective for aligning the vector spaces of x, y in the latent space. In contrast to the discriminative directed methods in (Mikolov et al., 2013a;Smith et al., 2017;Xing et al., 2015), IBFA has the capacity to model noise.
We can re-interpret the logarithm of Equation 2 (as shown in Appendix D) as The exact expression for Σ y|x is given in the same appendix. This interpretation shows that for each pair of points, the objective is to minimise the reconstruction errors of x and y, given a projection into the latent space E p(z|x k ) [z]. By utilising the symmetry of Equation 2, we can show the converse is true as well -maximising the joint probability also minimises the reconstruction loss given the latent projections E p(z|y k ) [z]. Thus, this forces the latent embeddings of x k and y k to be close in the latent space. This provides intuition as to why embedding into this common latent space is a good alignment procedure.
In (Browne, 1979;Bach and Jordan, 2005) it is shown that the maximum likelihood estimates for {Ψ i , W i } can be attained in closed form The projections into the latent space from x are given by (as proved in Appendix B) Evaluated at the MLE, (Bach and Jordan, 2005) show that Equation 4 can be reduced to

Multiple-Battery Factor Analysis
Multiple-Battery Factor Analysis (MBFA) (Mc-Donald, 1970;Browne, 1980) is a natural extension of IBFA that models more than two views of observables (i.e. multiple languages), as shown in Figure 2.
Formally, for a set of views {x 1 , ..., x v }, we can write the model as Similar to IBFS the projections to the latent space are given by Equation 4, and the marginal yields a very similar form Unlike IBFA, a closed form solution for maximising the marginal likelihood of MBFA is unknown. Because of this, we have to resort to iterative approaches as in (Browne, 1980) such as the natural extension of the EM algorithm proposed by (Bach and Jordan, 2005). Defining the EM updates are given by where S is the sample covariance matrix of the concatenated views (derivation provided in Appendix E). (Browne, 1980) shows that the MLE of the parameters of MBFA is uniquely identifiable (up to a rotation that does not affect the method's performance). We observed this in an empirical study -the solutions we converge to are always a rotation away from each other, irrespective of the parameters' initialisation. This heavily suggests that any optimum is a global optimum and thus we restrict ourselves to only reporting results we observed when fitting from a single initialisation. The chosen initialisation point is provided as Equation (3.25) of (Browne, 1980).

Multilingual Factor Analysis
We coin the term Multilingual Factor Analysis for the application of methods based on IBFA and MBFA to model the generation of multilingual tuples from a shared latent space. We motivate our generative process with the compound model for language association presented by (Weinreich, 1953). In this model a lexical meaning entity (a concept) is responsible for associating the corresponding words in the two different languages. We note that the structure in Figure 3 is very similar to our graphical model for IBFA specified in Figure 1. We can interpret our latent variable as the latent lexical concept responsible for associating (generating) the multilingual language pairs. Most theories that explain the interconnections between languages in the bilingual mind assume that "while phonological and morphosyntactic forms differ across languages, meanings and/or concepts are largely, if not completely, shared" (Pavlenko, 2009). This shows that our generative modelling is supported by established models of language interconnectedness in the bilingual mind.
Intuitively, our approach can be summarised as transforming monolingual representations by mapping them to a concept space in which lexical meaning across languages is aligned and then performing retrieval, translation and similarity-based tasks in that aligned concept space.

Comparison to Direct Methods
Methods that learn a direct linear transformation from x to y, such as (Mikolov et al., 2013a;Artetxe et al., 2016;Smith et al., 2017;Lample et al., 2018) could also be interpreted as maximising the conditional likelihood As shown in Appendix F, the maximum likelihood estimate for W does not depend on the noise term Ψ. In addition, even if one were to fit Ψ, it is not clear how to utilise it to make predictions as the conditional expectation does not depend on the noise parameters. As this method is therefore not robust to noise, previous work has used extensive regularisation (i.e. by making W orthogonal) to avoid overfitting.

Relation to CCA
CCA is a popular method used for multilingual alignment which is very closely related to IBFA, as detailed in (Bach and Jordan, 2005). (Barber, 2012) shows that CCA can be recovered as a limiting case of IBFA with constrained diagonal covariance Ψ x = σ 2 x I, Ψ y = σ 2 y I , as σ 2 x , σ 2 y → 0. CCA assumes that the emissions from the latent spaces to the observables are deterministic. This is a strong and unrealistic assumption given that word embeddings are learned from noisy corpora and stochastic learning algorithms.

Experiments
In this section, we empirically demonstrate the effectiveness of our generative approach on several benchmarks, and compare it with state-of-the-art methods. We first present cross-lingual (wordtranslation) evaluation tasks to evaluate the quality of our multi-lingual word embeddings. As a follow-up to the word retrieval task we also run experiments on cross-lingual sentence retrieval tasks. We further demonstrate the quality of our multi-lingual word embeddings on monolingual word-and sentence-level similarity tasks from (Faruqui and Dyer, 2014b), which we believe provides empirical evidence that the aligned embeddings preserve and even potentially enhance their monolingual quality.

Word Translation
This task is concerned with the problem of retrieving the translation of a given set of source words. We reproduce results in the same environment as (Lample et al., 2018) 1 for a fair comparison. We perform an ablation study to assess the effectiveness of our method in the Italian to English (it-en) setting in (Smith et al., 2017;Dinu et al., 2014).
Method en-es es-en en-fr fr-en en-de de-en en-ru ru-en en-zh zh-en In these experiments we are interested in studying the effectiveness of our method compared to that of the Procrustes-based fitting used in (Smith et al., 2017) without any post-processing steps to address the hubness problem (Dinu et al., 2014).
In Table 1 we observe how our model is competitive to the results in (Lample et al., 2018) and outperforms them in most cases. We notice that given an expert dictionary, our method performs the best out of all compared methods on all tasks, except in English to Russian (en-ru) translation where it remains competitive. What is surprising is that, in the semi-supervised setting, IBFA bridges the gap between the method proposed in (Lample et al., 2018) on languages where the dictionary of identical tokens across languages (i.e. the pseudo-dictionary from (Smith et al., 2017)) is richer. However, even though it significantly outperforms SVD using the pseudo-dictionary, it cannot match the performance of the adversarial approach for more distant languages like English and Chinese (en-zh).

Detailed Comparison to Basic SVD
We present a more detailed comparison to the SVD method described in (Smith et al., 2017). We focus on methods in their base form, that is without post-processing techniques, i.e. crossdomain similarity local scaling ( . We significantly outperform both SVD and CCA, especially when using the pseudo-dictionaries.

Word Similarity Tasks
This task assesses the monolingual quality of word embeddings. In this experiment, we fit both considered methods (CCA and IBFA) on the entire available dictionary of around 100k word pairs. We compare to CCA as used in (Faruqui and     These tasks consist of English word pairs that have been assigned ground truth similarity scores by humans. We use the test-suite provided by (Faruqui and Dyer, 2014a) 4 to evaluate our multilingual embeddings on these datasets. This testsuite calculates similarity of words through cosine similarity in their representation spaces and then reports Spearman correlation with the ground truth similarity scores provided by humans. As shown in Table 4, we observe a performance gain over CCA and monolingual word embeddings suggesting that we not only preserve the monolingual quality of the embeddings but also enhance it.

Monolingual Sentence Similarity Tasks
Semantic Textual Similarity (STS) is a standard benchmark used to assess sentence similarity metrics (Agirre et al., 2012(Agirre et al., , 2013(Agirre et al., , 2014(Agirre et al., , 2015(Agirre et al., , 2016. In this work, we use it to show that our alignment procedure does not degrade the quality of the embeddings at the sentence level. For both IBFA and CCA, we align English and one other language (from French, Spanish, German) using the entire dictionaries (of about 100k word pairs each) provided by (Lample et al., 2018). We then use the procedure defined in (Arora et al., 2016) to create sentence embeddings and use cosine similarity to output sentence similarity using those embeddings. The method's performance on each set of 4 https://github.com/mfaruqui/eval-word-vectors embeddings is assessed using Spearman correlation to human-produced expert similarity scores. As evidenced by the results shown in Table 5, IBFA remains competitive using any of the three languages considered, while CCA shows a performance decrease.

Crosslingual Sentence Similarity Tasks
Europarl (Koehn, 2005) is a parallel corpus of sentences taken from the proceedings of the European parliament. In this set of experiments, we focus on its English-Italian (en-it) sub-corpus, in order to compare to previous methods. We report results under the framework of (Lample et al., 2018). That is, we form sentence embeddings using the average of the tf-idf weighted word embeddings in the bag-of-words representation of the sentence. Performance is averaged over 2,000 randomly chosen source sentence queries and 200k target sentences for each language pair. Note that this is a different set up to the one presented in (Smith et al., 2017), in which an unweighted average is used. The results are reported in Table 6. As we can see, IBFA outperforms all prior methods both using nearest neighbour retrieval, where it has a gain of 20 percent absolute on SVD, as well as using the CSLS retrieval metric.

Alignment of three languages
In an ideal scenario, when we have v languages, we wouldn't want to train a transformation between each pair, as that would involve storing O(v 2 ) matrices. One way to overcome this prob-   lem is by aligning all embeddings to a common space. In this exploratory experiment, we constrain ourselves to aligning three languages at the same time, but the same methodology could be applied to an arbitrary number of languages. MBFA, the extension of IBFA described in Section 2.2.1 naturally lends itself to this task. What is needed for training this method is a dictionary of word triples across the three languages considered. We construct such a dictionary by taking the intersection of all 6 pairs of bilingual dictionaries for the three languages provided by (Lample et al., 2018). We then train MBFA for 20,000 iterations of EM (a brief analysis of convergence is provided in Appendix G). Alternatively, with direct methods like (Smith et al., 2017;Lample et al., 2018) one could align all languages to English and treat that as the common space.
We compare both approaches and present their results in Table 7. As we can see, both methods experience a decrease in overall performance when compared to models fitted on just a pair of languages, however MBFA performs better overall. That is, the direct approaches preserve their performance on translation to and from English, but translation from French to Italian decreases significantly. Meanwhile, MBFA suffers a decrease in each pair of languages, however it retains competitive performance to the direct methods on English translation. It is worth noting that as the number of aligned languages v increases, there are O(v) pairs of languages, one of which is English, and O(v 2 ) pairs in which English does not participate. This suggests that MBFA may generalise past three simultaneously aligned languages better than the direct methods.

Generating Random Word Pairs
We explore the generative process of IBFA by synthesising word pairs from noise, using a trained English-Spanish IBFA model. We follow the generative process specified in Equation 1 to generate 2,000 word vector pairs and then we find the nearest neighbour vector in each vocabulary and display the corresponding words. We then rank these 2,000 pairs according to their joint probability under the model and present the top 28 samples in   ing are dreadful and despair; frightening and brutality; crazed and merry; unrealistic and questioning; misguided and conceal; reactionary and conservatism.

Conclusion
We have introduced a cross-lingual embedding alignment procedure based on a probabilistic latent variable model, that increases performance across various tasks compared to previous methods using both nearest neighbour retrieval, as well as the CSLS criterion. We have shown that the resulting embeddings in this aligned space preserve their quality by presenting results on tasks that assess word and sentence-level monolingual similarity correlation with human scores. The resulting embeddings also significantly increase the precision of sentence retrieval in multilingual settings. Finally, the preliminary results we have shown on aligning more than two languages at the same time provide an exciting path for future research.

A Joint Distribution
We show the form of the joint distribution for 2 views. Concatenating our data and parameters as below, we can use Equation (3) of (Ghahramani et al., 1996)

C Derivation for the Marginal Likelihood
We want to compute p(x, y|θ) so that we can then learn the parameters θ = {θ x , θ y }, θ i = {µ i , W i , Ψ i , } by maximising the marginal likelihood as is done in Factor Analysis. From the joint p(m, z|θ), again using rules from (Petersen et al., 2008) Sections (8.1.2) we get For the case of two views, the joint probability can be factored as we can re-parametrise as This follows the same form as regular factor analysis, but with a block-diagonal constraint on Ψ. Thus by Equations (5) and (6) of (Ghahramani et al., 1996), we apply EM as follows.

E-
Step: Compute E[z|x] and E[zz |x] given the parameters θ t = {W t , Ψ t }. where update the parameters as follows.
Imposing the block diagonal constraint,

F Independence to Noise in Direct Methods
We are maximising the following quantity with respect to θ = {W , µ, Ψ} The maximum likelihood is achieved when ∂ log p(Y |X, θ) ∂W = 0, and since Ψ −1 has an inverse (namely Ψ), this means that It is clear from here that the MLE of W does not depend on Ψ, thus we can conclude that adding a noise parameter to this directed linear model has no effect on its predictions.  Figure 4: Training curve of EM algorithm over the first 5,000 iterations. It is clear that the procedure quickly finds a good approximation to the optimal parameters and then slowly converges to the real optimum. Top picture shows the entire training curve, while the bottom picture starts from iteration 100.
G Learning curve of EM Figure 4 shows the negative log-likelihood of the three language model over the first 5,000 iterations. The precision of the learned model is very close when evaluated at iteration 1,000 and at iteration 20,000 as seen in Table 9. This suggests that the model need not be trained to full convergence to work well.