Frustratingly Easy Meta-Embedding – Computing Meta-Embeddings by Averaging Source Word Embeddings

Creating accurate meta-embeddings from pre-trained source embeddings has received attention lately. Methods based on global and locally-linear transformation and concatenation have shown to produce accurate meta-embeddings. In this paper, we show that the arithmetic mean of two distinct word embedding sets yields a performant meta-embedding that is comparable or better than more complex meta-embedding learning methods. The result seems counter-intuitive given that vector spaces in different source embeddings are not comparable and cannot be simply averaged. We give insight into why averaging can still produce accurate meta-embedding despite the incomparability of the source vector spaces.


Introduction
Distributed vector representations of words, henceforth referred to as word embeddings, have been shown to exhibit strong performance on a variety of NLP tasks (Turian et al., 2010;Zou et al., 2013). Methods for producing word embedding sets exploit the distributional hypothesis to infer semantic similarity between words within large bodies of text, in the process they have been found to additionally capture more complex linguistic regularities, such as analogical relationships (Mikolov et al., 2013c). A variety of methods now exist for the production of word embeddings (Collobert and Weston, 2008;Mnih and Hinton, 2009;Huang et al., 2012;Pennington et al., 2014;Mikolov et al., 2013a). Comparative work has illustrated a variation in performance between methods across evaluative tasks (Chen et al., 2013;Yin and Schütze, 2016).
Methods of "meta-embedding", as first proposed by Yin and Schütze (2016), aim to conduct a complementary combination of information from an ensemble of distinct word embedding sets, each trained using different methods, and resources, to yield an embedding set with improved overall quality.
Several such methods have been proposed. 1TON (Yin and Schütze, 2016), takes an ensemble of K pre-trained word embedding sets, and employs a linear neural network to learn a set of meta-embeddings along with K global projection matrices, such that through projection, for every word in the meta-embedding set, we can recover its corresponding vector within each source word embedding set. 1TON+ (Yin and Schütze, 2016), extends this method by predicting embeddings for words not present within the intersection of the source word embedding sets. An unsupervised locally linear meta-embedding approach has since been taken (Bollegala et al., 2017), for each source embedding set, for each word; a representation as a linear combination of its nearest neighbours is learnt. The local reconstructions within each source embedding set are then projected to a common meta-embedding space.
The simplest approach considered to date, has been to concatenate the word embeddings across the source sets (Yin and Schütze, 2016). Despite its simplicity, concatenation has been used to provide a good baseline of performance for metaembedding.
A method which has not yet been proposed is to conduct a direct averaging of embeddings. The validity of this approach may perhaps not seem obvious, owing to the fact that no correspondence exists between the dimensions of separately trained word embedding sets. In this paper we first provide some analysis and justification that, despite this dimensional disparity, averaging can provide an approximation of the performance of concatenation without increasing the dimension of the embeddings. We give empirical results demonstrating the quality of average meta-embeddings. We make a point of comparison to concatenation since it is the most comparable in terms of simplicity, whilst also providing a good baseline of performance on evaluative tasks. Our aim is to highlight the validity of averaging across distinct word embedding sets, such that it may be considered as a tool in future meta-embedding endeavours.

Analysis
To evaluate semantic similarity between word embeddings we consider the Euclidean distance measure. For 2 normalised word embeddings, Euclidean distance is a monotonically decreasing function of the cosine similarity, which is a popular choice in NLP tasks that use word embeddings such as semantic similarity prediction and analogy detection (Levy et al., 2015;Levy and Goldberg, 2014). We defer the analysis of other types of distance measures to future work. By evaluating the relationship between the Euclidean distances of pairs of words in the source embedding sets and their corresponding Euclidean distances in the meta-embedding space we can obtain a view as to how the meta-embedding procedure is combining semantic information. We begin by examining concatenation through this lens, before moving on to averaging.

Concatenation
We can express concatenation by first zeropadding our source embeddings, before combining them through addition.
Without loss of generality, we consider both concatenation and averaging over only two source word embedding sets for ease of exposition. Let S 1 and S 2 be unique embedding sets of real-valued continuous embeddings. We make no assumption that S 1 and S 2 were trained using the same method or resources. Consider two semantically similar words u and v such that u, v ∈ S 1 ∩ S 2 . Let u S 1 and v S 1 , and u S 2 and v S 2 denote the specific word embeddings of u and v within the embeddings S 1 , and S 2 respectively.
Let the dimensions of embeddings S 1 , and S 2 be denoted d S 1 , and d S 2 respectively. We zeropad embeddings from S 1 by front-loading d S 2 zero entries to each word embedding vector. In contrast, we zero-pad embeddings from S 2 by adding d S 1 zero entries to the end of each embedding vector. The resulting embeddings from S 1 and S 2 now share a common dimension of d S 1 + d S 2 . Denote the resulting embeddings of any word u ∈ S 1 ∩S 2 , as u zero S 1 and u zero S 2 respectively. Now, combining our source embeddings through addition we obtain equivalency to concatenation.
Note that the zero-padded vectors are orthogonal. Let the Euclidean distance between these words in each embedding be denoted by E S 1 and E S 2 . Note that for any vector u ∈ R n the addition of zero-valued dimensions does not affect the value of its 2 -norm. So we have (3) Consider the Euclidean distance between u and v after concatenation.
For any two words belonging to the resultant embedding obtained by concatenation, the distance between these words in the resultant space is the root of the sum of squares of Euclidean distances between these words in S 1 and S 2 .

Average word embeddings
Here we now make the assumption that S 1 and S 2 have common dimension d. 1 Despite there being no obvious correspondence between dimensions of S 1 and S 2 we can show that the average embedding set retains semantic information through preservation of the relative distances between words.
Consider the positioning of words u, and v after performing a word-wise average between the source embedding sets. The Euclidean distance between u and v in the resultant meta-embedding is given by Now in this case, unlike concatenation, we have not designed our source embedding sets such that they are orthogonal to each other, and so it seems we are left with a term dependant on the angle between (u S 1 −v S 1 ) and (v S 2 −u S 2 ). However, Cai et al. (2013) showed that, if X is a set of random points ∈ R n with cardinality |X |, then the limiting distribution of angles, as |X | → ∞, between pairs of elements from X , is Gaussian with mean π/2. In addition, Cai et al. (2013) showed that the variance of this distribution shrinks as the dimensionality increases.
Word embedding sets typically contain in the order of ten thousand or more points, and are typically of relatively high dimension. Moreover, assuming the difference vector between any two words in an embedding set is sufficiently random, we may approximate the limiting Gaussian distribution described by Cai et al. (2013). In such a case the expectation would then be that the vectors (u S 1 − v S 1 ) and (v S 2 − u S 2 ) are orthogonal, leading to the following result.
To summarise, if word embeddings can be shown to be approximately orthogonal, then averaging will approximate the same information as concatenation, without increasing the dimensionality of the embeddings.

Experiments
We first empirically test our theory that word embeddings are sufficiently random and high dimensional, such that they are approximately all orthogonal to each other. We then present an empirical evaluation of the performance of the meta-embeddings produced through averaging, and compare against concatenation.

Datasets
We use the following pre-trained embedding sets that have been used in prior work on metaembedding learning (Yin and Schütze, 2016;Bollegala et al., 2017) for experimentation.
Note that the purpose of this experiment is not to compare against previously proposed metaembedding learning methods, but to empirically verify averaging as a meta-embedding method and validate the assumptions behind the theoretical analysis. By using three pre-trained word embeddings with different dimensionalities and empirical accuracies, we can evaluate the averagingbased meta-embeddings in a robust manner. We pad HLBL embeddings to the rear with 200 zero-entries to bring their dimension up to 300. For GloVe, we 2 normalise each dimension of the embedding across the vocabulary, as recommended by the authors. Every individual word embedding from each embedding set is then 2normalised. The proposed averaging operation, as well as concatenation, operate only on the intersection of these embeddings. The intersectional vocabularies GloVe ∩ CBOW, GloVe ∩ HLBL, and CBOW ∩ HLBL contain 154,076; 90,254; and 140,479 word embeddings respectively.

Empirical distribution analysis
We conduct an empirical analysis of the distribution of the angle [(u S 1 − v S 1 ), (v S 2 − u S 2 )] for each pair of datasets. Table 1 shows the mean and variance of these distributions, obtained   Figure 1 shows a normalised histogram of the results for GloVe ∩ CBOW, along with a normal distribution characterised by the sample mean and variance. GloVe ∩ HLBL, and CBOW ∩ HLBL plots are not shown due to space limitations, but are similarly normally distributed. This result shows that the pre-trained word embeddings approximately satisfy the predictions made by Cai et al. (2013), thereby empirically justifying the assumption made in the derivation of (4).

Semantic Similarity
We measure the similarity between words by calculating the cosine similarity between their embeddings; we then calculate Spearman correlation against human similarity scores. The following datasets are used: RG (Rubenstein and Goodenough, 1965), MC (Miller and Charles, 1991), WS (Finkelstein et al., 2001), RW (Luong et al., 2013), and SL (Hill et al., 2015).

Word Analogy
Using the Google dataset GL (Mikolov et al., 2013b)   the CosAdd method (Mikolov et al., 2013c) shown in (5). Specifically, we determine a fourth word d such that the similarity between (b − a + c) and d is maximised.
3.4 Discussion of results Table 2 shows task performance for each source embedding set, and for both methods on every pair of datasets. In our experiments concatenation obtains better overall performance. However, averaging offers improvements over the source embedding sets for semantic similarity task SL and word analogy task GL, on the combination of CBOW and GloVe. HLBL has a negative effect on CBOW and GloVe, but the performance of averaging is close to that of concatenation. An advantage of averaging when compared against concatenation, is that the dimensionality of the produced metaembedding is not increased beyond the maximum dimension present within the source embeddings, resulting in a meta-embedding which is easier to process and store.

Conclusion
We have presented an argument for averaging as a valid meta-embedding technique, and found experimental performance to be close to, or in some cases better than that of concatenation, with the additional benefit of reduced dimensionality. We propose that when conducting meta-embedding, both concatenation and averaging should be considered as methods of combining embedding spaces, and their individual advantages considered.