Learning Word Meta-Embeddings

Word embeddings – distributed representations of words – in deep learning are beneﬁcial for many tasks in NLP. However, different embedding sets vary greatly in quality and characteristics of the captured information. Instead of relying on a more advanced algorithm for embedding learning, this paper proposes an ensemble approach of combining different public embedding sets with the aim of learning metaembeddings . Experiments on word similarity and analogy tasks and on part-of-speech tagging show better performance of metaembeddings compared to individual embedding sets. One advantage of metaembeddings is the increased vocabulary coverage. We release our metaembeddings publicly at http:// cistern.cis.lmu.de/meta-emb .

Some prior work has studied differences in performance of different embedding sets. For example, Chen et al. (2013) show that the embedding sets HLBL (Mnih and Hinton, 2009), SENNA (Collobert and Weston, 2008), Turian (Turian et al., 2010) and Huang (Huang et al., 2012) have great variance in quality and characteristics of the semantics captured. Hill et al. (2014;2015a) show that embeddings learned by NN machine translation models can outperform three representative monolingual embedding sets: word2vec (Mikolov et al., 2013b), GloVe (Pennington et al., 2014) and CW (Collobert and Weston, 2008). Bansal et al. (2014) find that Brown clustering, SENNA, CW, Huang and word2vec yield significant gains for dependency parsing. Moreover, using these representations together achieved the best results, suggesting their complementarity. These prior studies motivate us to explore an ensemble approach. Each embedding set is trained by a different NN on a different corpus, hence can be treated as a distinct description of words. We want to leverage this diversity to learn better-performing word embeddings. Our expectation is that the ensemble contains more information than each component embedding set.
The ensemble approach has two benefits. First, enhancement of the representations: metaembeddings perform better than the individual embedding sets. Second, coverage: metaembeddings cover more words than the individual embedding sets. The first three ensemble methods we introduce are CONC, SVD and 1TON and they directly only have the benefit of enhancement. They learn metaembeddings on the overlapping vocabulary of the embedding sets. CONC concatenates the vectors of a word from the different embedding sets. SVD performs dimension reduction on this concatenation. 1TON assumes that a metaembedding for the word exists, e.g., it can be a randomly initialized vector in the beginning, and uses this metaembedding to predict representations of the word in the individual embedding sets by projections -the resulting fine-tuned metaembedding is expected to contain knowledge from all individual embedding sets.
To also address the objective of increased coverage of the vocabulary, we introduce 1TON + , a modification of 1TON that learns metaembeddings for all words in the vocabulary union in one step. Let an out-of-vocabulary (OOV) word w of embedding set ES be a word that is not covered by ES (i.e., ES does not contain an embedding for w). 1 1TON + first randomly initializes the embeddings for OOVs and the metaembeddings, then uses a prediction setup similar to 1TON to update metaembeddings as well as OOV embeddings. Thus, 1TON + simultaneously achieves two goals: learning metaembeddings and extending the vocabulary (for both metaembeddings and invidual embedding sets).
An alternative method that increases coverage is MUTUALLEARNING. MUTUALLEARNING learns the embedding for a word that is an OOV in embedding set from its embeddings in other embedding sets. We will use MUTUALLEARNING to increase coverage for CONC, SVD and 1TON, so that these three methods (when used together with MUTUALLEARNING) have the advantages of both performance enhancement and increased coverage.
In summary, metaembeddings have two benefits compared to individual embedding sets: enhancement of performance and improved coverage of the vocabulary. Below, we demonstrate this experimentally for three tasks: word similarity, word analogy and POS tagging.
If we simply view metaembeddings as a way of coming up with better embeddings, then the alternative is to develop a single embedding learning algorithm that produces better embeddings. Some improvements proposed before have the disadvantage of increasing the training time of embedding learning substantially; e.g., the NNLM presented in (Bengio et al., 2003) is an order of magnitude less efficient than an algorithm like word2vec and, more generally, replacing a linear objective function with a nonlinear objective function increases training time. Similarly, fine-tuning the hyperparameters of the embedding learning algorithm is complex and time consuming. In terms of coverage, one might argue that we can retrain an existing algorithm like word2vec on a bigger corpus. However, that needs much longer training time than our simple ensemble approaches which achieve coverage as well as enhancement with less effort. In many cases, it is not possible to retrain using a different algorithm because the corpus is not publicly available. But even if these obstacles could be overcome, it is unlikely that there ever will be a single "best" embedding learning algorithm. So the current situation of multiple embedding sets with different properties being available is likely to persist for the forseeable future. Metaembedding learning is a simple and efficient way of taking advantage of this diversity. As we will show below they combine several complementary embedding sets and the resulting metaembeddings are stronger than each individual set.

Related Work
Related work has focused on improving performance on specific tasks by using several embedding sets simultaneously. To our knowledge, there is no work that aims to learn generally useful metaembeddings from individual embedding sets. Tsuboi (2014) incorporates word2vec and GloVe embeddings into a POS tagging system and finds that using these two embedding sets together is better than using them individually. Similarly, Turian et al. (2010) find that using Brown clusters, CW embeddings and HLBL embeddings for Name Entity Recognition and chunking tasks together gives better performance than using these representations individually. Luo et al. (2014) adapt CBOW (Mikolov et al., 2013a) to train word embeddings on different datasets -a Wikipedia corpus, search clickthrough data and user query data -for web search ranking and for word similarity. They show that using these embeddings together gives stronger results than using them individually.
Both  and (Zhang et al., 2016) try to incorporate multiple embedding sets into channels of convolutional neural network system for sentence classification tasks. The better performance also hints the complementarity of component embedding sets, however, such kind of incorporation brings large numbers of training parameters.
In sum, these papers show that using multiple embedding sets is beneficial. However, they either use embedding sets trained on the same corpus (Turian et al., 2010) or enhance embedding sets by more training data, not by innovative learning algorithms (Luo et al., 2014), or make the system architectures more complicated (Yin and Vocab Size Dim Training Data HLBL (Mnih and Hinton, 2009) 246,122 100 Reuters English newswire August 1996-August 1997Huang (Huang et al., 2012 100,232 50 April 2010 snapshot of Wikipedia Glove (Pennington et al., 2014) 1,193,514 300 42 billion tokens of web data, from Common Crawl CW (Collobert and Weston, 2008) 268,810 200 Reuters English newswire August 1996-August 1997 word2vec (Mikolov et al., 2013b) 929,022 300 About 100 billion tokens from Google News Schütze, 2015; Zhang et al., 2016). In our work, we can leverage any publicly available embedding set learned by any learning algorithm. Our metaembeddings (i) do not require access to resources such as large computing infrastructures or proprietary corpora; (ii) are derived by fast and simple ensemble learning from existing embedding sets; and (iii) have much lower dimensionality than a simple concatentation, greatly reducing the number of parameters in any system that uses them. An alternative to learning metaembeddings from embeddings is the MVLSA method that learns powerful embeddings directly from multiple data sources (Rastogi et al., 2015). Rastogi et al. (2015) combine a large number of data sources and also run two experiments on the embedding sets Glove and word2vec. In contrast, our focus is on metaembeddings, i.e., embeddings that are exclusively based on embeddings. The advantages of metaembeddings are that they outperform individual embeddings in our experiments, that few computational resources are needed, that no access to the original data is required and that embeddings learned by new powerful (including nonlinear) embedding learning algorithms in the future can be immediately taken advantage of without any changes being necessary to our basic framework. In future work, we hope to compare MVLSA and metaembeddings in effectiveness (Is using the original corpus better than using embeddings in some cases?) and efficiency (Is using SGD or SVD more efficient and in what circumstances?).

Experimental Embedding Sets
In this work, we use five released embedding sets.  Table 1 gives a summary of the five embedding sets.
The intersection of the five vocabularies has size 35,965, the union has size 2,788,636.

Ensemble Methods
This section introduces the four ensemble methods: CONC, SVD, 1TON and 1TON + .

CONC: Concatenation
In CONC, the metaembedding of w is the concatenation of five embeddings, one each from the five embedding sets. For GloVe, we perform L2 normalization for each dimension across the vocabulary as recommended by the GloVe authors. Then each embedding of each embedding set is L2-normalized. This ensures that each embedding set contributes equally (a value between -1 and 1) when we compute similarity via dot product.
We would like to make use of prior knowledge and give more weight to well performing embedding sets. In this work, we give GloVe and word2vec weight i > 1 and weight 1 to the other three embedding sets. We use MC30 (Miller and Charles, 1991) as dev set, since all embedding sets fully cover it. We set i = 8, the value in Figure 1 where performance reaches a plateau. After L2 normalization, GloVe and word2vec embeddings are multiplied by i and remaining embedding sets are left unchanged.
The dimensionality of CONC metaembeddings is k = 100+50+300+200+300 = 950. We also tried equal weighting, but the results were much worse, hence we skip reporting it. It nevertheless gives us insight that simple concatenation, without studying the difference among embedding sets, is unlikely to achieve enhancement. The main disadvantage of simple concatenation is that word embeddings are commonly used to initialize words in DNN systems; thus, the high-dimensionality of concatenated embeddings causes a great increase in training parameters.

SVD: Singular Value Decomposition
We do SVD on above weighted concatenation vectors of dimension k = 950.
Given a set of CONC representations for n words, each of dimensionality k, we compute an SVD decomposition C = U SV T of the corresponding n × k matrix C. We then use U d , the first d dimensions of U , as the SVD metaembeddings of the n words. We apply L2-normalization to embeddings; similarities of SVD vectors are computed as dot products.
d denotes the dimensionality of metaembeddings in SVD, 1TON and 1TON + . We use d = 200 throughout and investigate the impact of d below. Let c be the number of embedding sets under consideration,

1TON
. . , c) as follows: The training objective is as follows: In Equation 2, k i is the weight scalar of the i th embedding set, determined in Section 4.1, i.e, k i = 8 for GloVe and word2vec embedding sets, otherwise k i = 1; l 2 is the weight of L2 normalization. The principle of 1TON is that we treat each individual embedding as a projection of the metaembedding, similar to principal component analysis. An embedding is a description of the word based on the corpus and the model that were used to create it. The metaembedding tries to recover a more comprehensive description of the word when it is trained to predict the individual descriptions.
1TON can also be understood as a sentence modeling process, similar to DBOW (Le and Mikolov, 2014). The embedding of each word in a sentence s is a partial description of s. DBOW combines all partial descriptions to form a comprehensive description of s. DBOW initializes the sentence representation randomly, then uses this representation to predict the representations of individual words. The sentence representation of s corresponds to the metaembedding in 1TON; and the representations of the words in s correspond to the five embeddings for a word in 1TON.

1TON +
Recall that an OOV (with respect to embedding set ES) is defined as a word unknown in ES. 1TON + is an extension of 1TON that learns embeddings for OOVs; thus, it does not have the limitation that it can only be run on overlapping vocabulary.  Figure  2, we assume that the current word is an OOV in embedding sets 3 and 5. Hence, in the new learning task, embeddings 1, 2, 4 are known, and embeddings 3 and 5 and the metaembedding are targets to learn.
We initialize all OOV representations and metaembeddings randomly and use the same mapping formula as for 1TON to connect a metaembedding with the individual embeddings. Both metaembedding and initialized OOV embeddings are updated during training.
Each embedding set contains information about only a part of the overall vocabulary. However, it can predict what the remaining part should look like by comparing words it knows with the information other embedding sets provide about these words. Thus, 1TON + learns a model of the dependencies between the individual embedding sets and can use these dependencies to infer what the embedding of an OOV should look like.
CONC, SVD and 1TON compute metaembeddings only for the intersection vocabulary. 1TON + computes metaembeddings for the union of all individual vocabularies, thus greatly increasing the coverage of individual embedding sets.

MUTUALLEARNING
MUTUALLEARNING is a method that extends CONC, SVD and 1TON such that they have increased coverage of the vocabulary. With MU-TUALLEARNING, all four ensemble methods -CONC, SVD, 1TON and 1TON + -have the benefits of both performance enhancement and increased coverage and we can use criteria like performance, compactness and efficiency of training  to select the best ensemble method for a particular application. MUTUALLEARNING is applied to learn OOV embeddings for all c embedding sets; however, for ease of exposition, let us assume we want to compute embeddings for OOVs for embedding set j only, based on known embeddings in the other c − 1 embedding sets, with indexes i ∈ {1 . . . j − 1, j + 1 . . . c}. We do this by learning c − 1 mappings f ij , each a projection from embedding set E i to embedding set E j .
Similar to Section 4.3, we train mapping f ij on the intersection V i ∩ V j of the vocabularies covered by the two embedding sets. Formally, Training loss has the same form as Equation 2 except that there is no " c i=1 k i " term. A total of c − 1 projections f ij are trained to learn OOV embeddings for embedding set j.
Let w be a word unknown in the vocabulary V j of embedding set j, but known in V 1 , V 2 , . . . , V k . To compute an embedding for w in V j , we first compute the k projections f 1j (w 1 ), f 2j (w 2 ), . . ., f kj (w k ) from the source spaces V 1 , V 2 , . . . , V k to the target space V j . Then, the element-wise average of f 1j (w 1 ), f 2j (w 2 ), . . ., f kj (w k ) is treated as the representation of w in V j . Our motivation is that -assuming there is a true representation of w in V j and assuming the projections were learned well -we would expect all the projected vectors to be close to the true representation. Also, each source space contributes potentially complementary information. Hence averaging them is a balance of knowledge from all source spaces.
We report results on three tasks: word similarity, word analogy and POS tagging.  (21) Table 3: Results on five word similarity tasks (Spearman correlation metric) and analogical reasoning (accuracy). The number of OOVs is given in parentheses for each result. "ind-full/ind-overlap": individual embedding sets with respective full/overlapping vocabulary; "ensemble": ensemble results using all five embedding sets; "discard": one of the five embedding sets is removed. If a result is better than all methods in "ind-overlap", then it is bolded. Significant improvement over the best baseline in "ind-overlap" is underlined (online toolkit from http://vassarstats.net/index.html for Spearman correlation metric, test of equal proportions for accuracy, p < .05).

Word Similarity and Analogy Tasks
We evaluate on SimLex-999 (Hill et al., 2015b), WordSim353 (Finkelstein et al., 2001), MEN (Bruni et al., 2014) and RW (Luong et al., 2013). For completeness, we also show results for MC30, the validation set. The word analogy task proposed in (Mikolov et al., 2013b) consists of questions like, "a is to b as c is to ?". The dataset contains 19,544 such questions, divided into a semantic subset of size 8869 and a syntactic subset of size 10,675. Accuracy is reported.
We also collect the state-of-the-art report for each task. SimLex-999: (Wieting et al., 2015), WS353: (Halawi et al., 2012). Not all state-ofthe-art results are included in Table 3. One reason is that a fair comparison is only possible on the shared vocabulary, so methods without released embeddings cannot be included. In addition, some prior systems can possibly generate better performance, but those literature reported lower results than ours because different hyperparameter setup, such as smaller dimensionality of word embeddings or different evaluation metric. In any case, our main contribution is to present ensemble frameworks which show that a combination of complementary embedding sets produces betterperforming metaembeddings. Table 3 reports results on similarity and analogy. Numbers in parentheses are the sizes of words in the datasets that are uncovered by intersection vocabulary. We do not consider them for fair comparison. Block "ind-full" (1-5) lists the performance of individual embedding sets on the full vocabulary. Results on lines 6-34 are for the intersection vocabulary of the five embedding sets: "ind-overlap" contains the performance of individual embedding sets, "ensemble" the performance of our four ensemble methods and "discard" the performance when one component set is removed.
These results demonstrate that ensemble methods using multiple embedding sets produce stronger embeddings. However, it does not mean the more embedding sets the better. Whether an embedding set helps, depends on the complementarity of the sets and on the task.
CONC, the simplest ensemble, has robust performance. However, size-950 embeddings as input means a lot of parameters to tune for DNNs. The other three methods (SVD, 1TON, 1TON + ) have the advantage of smaller dimensionality. SVD reduces CONC's dimensionality dramatically and still is competitive, especially on word similarity. 1TON is competitive on analogy, but weak on word similarity. 1TON + performs consistently strongly on word similarity and analogy. Table 3 uses the metaembeddings of intersection vocabulary, hence it shows directly the quality enhancement by our ensemble approaches; this enhancement is not due to bigger coverage.
System comparison of learning OOV embeddings. In Table 4, we extend the vocabularies of each individual embedding set ("ind" block) and our ensemble approaches ("ensemble" block) to the vocabulary union, reporting results on RW and analogy -these tasks contain the most OOVs. As both word2vec and GloVe have full coverage on analogy, we do not rereport them in this table. This subtask is specific to "coverage" property. Apparently, our mutual learning and 1TON + can cover the union vocabulary, which is bigger than each individual embedding sets. But the more important issue is that we should keep or even improve the embedding quality, compared with their original embeddings in certain component sets.
For each embedding set, we can compute the representation of an OOV (i) as a randomly initialized vector (RND); (ii) as the average of embeddings of all known words (AVG); (iii) by MUTU-ALLEARNING (ml) and (iv) by 1TON + . 1TON + learns OOV embeddings for individual embedding sets and metaembeddings simultaneously, and it   Table 5: POS tagging results on six target domains. "baselines" lists representative systems for this task, including FLORS. "+indiv / +meta": FLORS with individual embedding set / metaembeddings. Bold means higher than "baselines" and "+indiv".
would not make sense to replace these OOV embeddings computed by 1TON + with embeddings computed by "RND/AVG/ml". Hence, we do not report "RND/AVG/ml" results for 1TON + . Table 4 shows four interesting aspects. (i) MU-TUALLEARNING helps much if an embedding set has lots of OOVs in certain task; e.g., MUTUAL-LEARNING is much better than AVG and RND on RW, and outperforms RND considerably for CONC, SVD and 1TON on analogy. However, it cannot make big difference for HLBL/CW on analogy, probably because these two embedding sets have much fewer OOVs, in which case AVG and RND work well enough. (ii) AVG produces bad results for CONC, SVD and 1TON on analogy, especially in the syntactic subtask. We notice that those systems have large numbers of OOVs in word analogy task. If for analogy "a is to b as c is to d", all four of a, b, c, d are OOVs, then they are represented with the same average vector. Hence, similarity between b − a + c and each OOV is 1.0. In this case, it is almost impossible to predict the correct answer d. Unfortunately, methods CONC, SVD and 1TON have many OOVs, resulting in the low numbers in Table 4. (iii) MUTUALLEARN-ING learns very effective embeddings for OOVs. CONC-ml, 1TON-ml and SVD-ml all get better results than word2vec and GloVe on analogy (e.g., for semantic analogy: 88.1, 87.3, 88.2 vs. 81.4 for GloVe). Considering further their bigger vocabulary, these ensemble methods are very strong representation learning algorithms. (iv) The performance of 1TON + for learning embeddings for OOVs is competitive with MUTUALLEARNING. For HLBL/Huang/CW, 1TON + performs slightly better than MUTUALLEARNING in all four met-rics. Comparing 1TON-ml with 1TON + , 1TON + is better than "ml" on RW and semantic task, while performing worse on syntactic task. Figure 4 shows the influence of dimensionality d for SVD, 1TON and 1TON + . Peak performance for different data sets and methods is reached for d ∈ [100, 500]. There are no big differences in the averages across data sets and methods for high enough d, roughly in the interval [150,500]. In summary, as long as d is chosen to be large enough (e.g., ≥ 150), performance is robust.

Domain Adaptation for POS Tagging
In this section, we test the quality of those individual embedding embedding sets and our metaembeddings in a Part-of-Speech (POS) tagging task. For POS tagging, we add word embeddings into FLORS 7 (Schnabel and Schütze, 2014) which is the state-of-the-art POS tagger for unsupervised domain adaptation.
FLORS tagger. It treats POS tagging as a window-based (as opposed to sequence classification), multilabel classification problem using LIB-LINEAR, 8 a linear SVM. A word's representation consists of four feature vectors: one each for its suffix, its shape and its left and right distributional neighbors. Suffix and shape features are standard features used in the literature; our use of them in FLORS is exactly as described in (Schnabel and Schütze, 2014).
Let f (w) be the concatenation of the two distributional and suffix and shape vectors of word w. Then FLORS represents token v i as follows: where ⊕ is vector concatenation. Thus, token v i is tagged based on a 5-word window.
FLORS is trained on sections 2-21 of Wall Street Journal (WSJ) and evaluate on the development sets of six different target domains: five SANCL (Petrov and McDonald, 2012) domainsnewsgroups, weblogs, reviews, answers, emailsand sections 22-23 of WSJ for in-domain testing.
Original FLORS mainly depends on distributional features. We insert word's embedding as the fifth feature vector. All embedding sets (except for 1TON + ) are extended to the union vocabulary by MUTUALLEARNING. We test if this additional feature can help this task. Table 5 gives results for some representa-7 cistern.cis.lmu.de/flors  8 liblinear.bwaldvogel. de (Fan et al., 2008) tive systems ("baselines"), FLORS with individual embedding sets ("+indiv") and FLORS with metaembeddings ("+meta"). Following conclusions can be drawn. (i) Not all individual embedding sets are beneficial in this task; e.g., HLBL embeddings make FLORS perform worse in 11 out of 12 cases. (ii) However, in most cases, embeddings improve system performance, which is consistent with prior work on using embeddings for this type of task (Xiao and Guo, 2013;Yang and Eisenstein, 2014;Tsuboi, 2014). (iii) Metaembeddings generally help more than the individual embedding sets, except for SVD (which only performs better in 3 out of 12 cases).

Conclusion
This work presented four ensemble methods for learning metaembeddings from multiple embedding sets: CONC, SVD, 1TON and 1TON + . Experiments on word similarity and analogy and POS tagging show the high quality of the metaembeddings; e.g., they outperform GloVe and word2vec on analogy. The ensemble methods have the added advantage of increasing vocabulary coverage. We make our metaembeddings available at http://cistern.cis. lmu.de/meta-emb.