Learning Geometric Word Meta-Embeddings

We propose a geometric framework for learning meta-embeddings of words from different embedding sources. Our framework transforms the embeddings into a common latent space, where, for example, simple averaging of different embeddings (of a given word) is more amenable. The proposed latent space arises from two particular geometric transformations - the orthogonal rotations and the Mahalanobis metric scaling. Empirical results on several word similarity and word analogy benchmarks illustrate the efficacy of the proposed framework.


Introduction
Word embeddings have become an integral part of modern NLP. They capture semantic and syntactic similarities and are typically used as features in training NLP models for diverse tasks like named entity tagging, sentiment analysis, and classification, to name a few. Word embeddings are learnt in an unsupervised manner from a large text corpora and a number of pre-trained embeddings are readily available. The quality of the word embeddings, however, depends on various factors like the size and genre of training corpora as well as the training method used. This has led to ensemble approaches for creating meta-embeddings from different original embeddings (Yin and Shutze, 2016;Coates and Bollegala, 2018;Bao and Bollegala, 2018;ONeill and Bollegala, 2020). Meta-embeddings are appealing because: (a) they can improve quality of embeddings on account of noise cancellation and diversity of data sources and algorithms, (b) no need to retrain the model, (c) the original corpus may not be available, and (d) may increase vocabulary coverage.
Various approaches have been proposed to learn meta-embeddings and can be broadly classified into two categories: (a) simple linear methods like averaging or concatenation, or a low-dimensional projection via singular value projection (Yin and Shutze, 2016;Coates and Bollegala, 2018) and (b) non-linear methods that aim to learn meta-embeddings as shared representation using auto-encoding or transformation between common representation and each embedding set (Muromägi et al., 2017;Bao and Bollegala, 2018;ONeill and Bollegala, 2020).
In this work, we focus on simple linear methods such as averaging and concatenation for computing meta-embeddings, which are very easy to implement and have shown highly competitive performance (Yin and Shutze, 2016;Coates and Bollegala, 2018). Due to the nature of the underlying embedding generation algorithms (Mikolov et al., 2013;Pennington et al., 2014), correspondences between dimensions, e.g., of two embeddings x ∈ R d and z ∈ R d of the same word, are usually not known. Hence, averaging may be detrimental in cases where the dimensions are negatively correlated. Consider the scenario where z := −x.
Here, simple averaging of x and z would result in the zero vector. Similarly, when z is a (dimension-wise) permutation of x, simple averaging would result in a sub-optimal meta-embedding vector than performing averaging of aligned embeddings. Therefore, we propose to align the embeddings (of a given word) as an important first step towards generating meta-embeddings.
To this end, we develop a geometric framework for learning meta-embeddings, by aligning different embeddings in a common latent space, where the dimensions of different embeddings (of a given word) are in coherence. Mathematically, we perform different orthogonal transformations of the source embeddings to learn a latent space along with a Mahalanobis metric that scales different features appropriately. The metaembeddings are, subsequently, learned in the latent space, e.g., using averaging or concatenation. Empirical results on the word similarity and the word analogy tasks show that the proposed geometrically aligned metaembeddings outperform strong baselines such as the plain averaging and the plain concatenation models.

Proposed Geometric Modeling
Consider two (monolingual) embeddings x i ∈ R d and z i ∈ R d of a given word i in a d-dimensional space. As discussed earlier, embeddings generated from different algorithms (Mikolov et al., 2013;Pennington et al., 2014) may express different characteristics (of the same word). Hence, the goal of learning a metaembedding w i (corresponding to word i) is to generate a representation that inherits the properties of the different source embeddings (e.g., x i and z i ).
Our framework imposes orthogonal transformations on the given source embeddings to enable alignment. In this latent space, we additionally induce the Mahalanobis metric to incorporate the feature correlation information (Jawanpuria et al., 2019). The Mahalanobis similarity generalizes the cosine similarity measure, which is commonly used for evaluating the relatedness between word embeddings. The combination of the orthogonal transformation and Mahalanobis metric learning allows to capture any affine relationship between different available source embeddings of a given word (Bonnabel and Sepulchre, 2009;Mishra et al., 2014).
Overall, we formulate the problem of learning geometric transformations -the orthogonal rotations and the metric scaling -via a binary classification problem. The meta-embeddings are subsequently computed using these transformations. The following sections formalize the proposed latent space and metaembedding models.

Learning the Latent Space
In this section, we learn the latent space using geometric transformations.
Let U ∈ M d and V ∈ M d be orthogonal transformations for embeddings x i and z i , respectively, for all words i. Here M d represents the set of d × d orthogonal matrices. The aligned embeddings in the latent space corresponding to x i and z i can then be expressed as Ux i and Vz i , respectively. We next induce the Mahalanobis metric B in this (aligned) latent space, where B is a d × d symmetric positive-definite matrix. In this latent space, the similarity between the two embeddings x i and z i is given by the following expression: (Ux i ) ⊤ B(Vz i ). An equivalent interpretation is that the expression (Ux i ) ⊤ B(Vz i ) boils down to the standard scalar product (cosine similarity) between B The orthogonal transformations as well as the Mahalanobis metric are learned via the following binary classification problem: pairs of word embeddings {x i , z i } of the same word i belong to the positive class while pairs {x i , z j } belong to the negative class (for i = j). We consider the similarity between the two embeddings in the latent space as the decision function of the proposed binary classification problem. Let X = [x 1 , . . . , x n ] ∈ R d×n and Z = [z 1 , . . . , z n ] ∈ R d×n be the word embedding matrices for n words, where the columns correspond to different words. In addition, let Y denote the label matrix, where Y ii = 1 for i = 1, . . . , n and Y ij = 0 for i = j. The proposed optimization problem employs the simple to optimize square loss function: where · is the Frobenius norm and C > 0 is the regularization parameter.

Averaging and Concatenation in Latent Space
Meta-embeddings constructed by averaging or concatenating the given word embeddings have been shown to obtain highly competitive performance (Yin and Shutze, 2016;Coates and Bollegala, 2018). Hence, we propose to learn meta-embeddings as averaging or concatenation in the learned latent space.

Geometry-Aware Averaging
The meta-embedding w i of a word i is generated as an average of the (aligned) word embeddings in the latent space. The latent space representation of x i , as a function of orthogonal transformation U and metric B, is B It should be noted that the proposed geometry-aware averaging approach generalizes (Coates and Bollegala, 2018), which is now a particular case in our framework by choosing U, V, and B as identity matrices. Our framework easily generalizes to the case of more than two source embeddings, by learning different sourceembedding specific orthogonal transformations and a common Mahalanobis metric.

Geometry-Aware Concatenation
We next propose to concatenate the aligned embeddings in the learned latent space. For a given word i, with x i and z i as different source embeddings, the meta-embeddings w i learned by the proposed geometry-aware concatenation model is w i = concatenation(B

Optimization
The proposed optimization problem (1) employs square loss function and ℓ 2 -norm regularization, both of which are well-studied in literature. In addition, the proposed problem involves optimization over smooth constraint sets such as the set of symmetric positive definite matrices and the set of orthogonal matrices. Such sets have well-known Riemannian manifold structure (Lee, 2003) that allows to propose computationally efficient iterative optimization algorithms. We employ the popular Riemannian optimization framework (Absil et al., 2008) to solve (1). Recently, Jawanpuria et al. (2019) have studied a similar optimization problem in the context of learning cross-lingual word embeddings.
Our implementation is done using the Pymanopt toolbox (Townsend et al., 2016), which is a publicly available Python toolbox for Riemannian optimization algorithms. In particular, we use the conjugate gradient algorithm of Pymanopt. For, this we need only supply the objective function of (1). This can be done efficiently as the numerical cost of computing the objective function is O(nd 2 ). The overall computational cost of our implementation scales linearly with the number of words in the vocabulary sets.  Table 1: Generalization performance of the meta-embedding algorithms on the Word Similarity (WS) and Word Analogy (WA) tasks. The columns 'Avg.(WS)' and 'Avg.(WA)' correspond to the average performance on the WS and the WA tasks, respectively. The rows marked 'indv.' correspond to the performance of individual source embeddings CBOW, GloVe, and fastText. The rows marked 'Gl.+CB.' correspond to the performance of meta-embedding algorithms with GloVe and CBOW embeddings as input. Similarly, 'Gl.+fa.' corresponds to GloVe and fastText embeddings and 'CB.+fa.' implies CBOW and fastText embeddings. A meta-embedding result is highlighted if it obtains the best result on a dataset when compared with the corresponding source embeddings as well as other meta-embedding algorithms employing the same source embeddings. We observe that the best overall performance on both the tasks, word similarity and word analogy, is obtained by the proposed geometry-aware models for every pair of input source embeddings.

Experiments
In this section, we evaluate the performance of the proposed meta-embedding models.

Evaluation Tasks and Datasets
We consider the following standard evaluation tasks (Yin and Shutze, 2016;Coates and Bollegala, 2018): • Word similarity: in this task, we compare the human-annotated similarity scores between pairs of words with the corresponding cosine similarity computed via the constructed meta-embeddings. We report results on the following benchmark datasets: RG (Rubenstein and Goodenough, 1965), MC (Miller and Charles, 1991), WS (Finkelstein et al., 2001), MTurk (Halawi et al., 2012), RW (Luong et al., 2013), and SL (Hill et al., 2015). Following previous works (Yin and Shutze, 2016;Coates and Bollegala, 2018;ONeill and Bollegala, 2020), we report the Spearman correlation score (higher is better) between the cosine similarity (computed via meta-embeddings) and the human scores.
• Word analogy: in this task, the aim is to answer questions which have the form "A is to B as C is to ?" (Mikolov et al., 2013). After generating the meta-embeddings a, b, and c (corresponding to terms A, B, and C, respectively), the answer is chosen to be the term whose meta-embedding has the maximum cosine similarity with (b − a + c) (Mikolov et al., 2013). The benchmark datasets include MSR (Gao et al., 2014), GL (Mikolov et al., 2013), and SemEval (Jurgens et al., 2012). Following previous works (Yin and Shutze, 2016;Coates and Bollegala, 2018;ONeill and Bollegala, 2020), we report the percentage of correct answers for MSR and GL datasets, and the Spearman correlation score for SemEval. In both cases, a higher score implies better performance.
We learn the meta-embeddings from the following publicly available 300-dimensional pre-trained word embeddings for English.
• GloVe (Pennington et al., 2014): has 1 917 494 word embeddings trained on 42B tokens of web data from the common crawl.
The meta-embeddings are learned on the common set of words from different pairs of the source embeddings. The number of common words between various source embeddings pairs are as follows: 154 077 (GloVe ∩ CBOW), 552 168 (GloVe ∩ fastText), and 641 885 (CBOW ∩ fastText).

Results and Discussion
The performance of our geometry-aware averaging and concatenation models, henceforth termed as Geo-AVG and Geo-CONC, respectively, are reported in Table 1. We also report the performance of metaembeddings models AVG (Coates and Bollegala, 2018) and CONC (Yin and Shutze, 2016), which perform plain averaging and concatenation, respectively. In addition, we report the performance of individual source embeddings (CBOW, GloVe, and fastText), serving as a benchmark which the meta-embeddings algorithms should ideally surpass in order to justify their usage. We observe that the proposed geometry-aware models, Geo-AVG and Geo-CONC, outperform the individual source embeddings in all the datasets. The proposed models also easily surpass the AVG and CONC models in both the word similarity and the word analogy tasks. This shows that the alignment of word embedding spaces with orthogonal rotations and the Mahalanobis metric improves the overall quality of the meta-embeddings.

Conclusion
We propose a geometric framework for learning meta-embeddings of words from various sources of word embeddings. Our framework aligns the embeddings in a common latent space. The importance of learning the latent space is shown in several benchmark datasets, where the proposed algorithms (Geo-AVG and Geo-CONC) outperforms the plain averaging and the plain concatenation models. The proposed framework can be extended to generating sentence meta-embeddings, which remains a future research direction.