Continuous Word Embedding Fusion via Spectral Decomposition

Word embeddings have become a mainstream tool in statistical natural language processing. Practitioners often use pre-trained word vectors, which were trained on large generic text corpora, and which are readily available on the web. However, pre-trained word vectors oftentimes lack important words from specific domains. It is therefore often desirable to extend the vocabulary and embed new words into a set of pre-trained word vectors. In this paper, we present an efficient method for including new words from a specialized corpus, containing new words, into pre-trained generic word embeddings. We build on the established view of word embeddings as matrix factorizations to present a spectral algorithm for this task. Experiments on several domain-specific corpora with specialized vocabularies demonstrate that our method is able to embed the new words efficiently into the original embedding space. Compared to competing methods, our method is faster, parameter-free, and deterministic.


Introduction
There has been a recent surge of neural word embedding models (Mikolov et al., 2013a,b). These models have been shown to perform well in a variety of NLP problems, such as word similarity and relational analogy tasks. Word embeddings play a crucial role in diverse fields such as computer vision (Hwang and Sigal, 2014), news classification (Kenter and De Rijke, 2015;Phung and De Vine, 2015), machine translation (Zou et al., 2013;Wu et al., 2014), and have been extended in various ways (Rudolph et al., 2016;Bamler and Mandt, 2017;Peters et al., 2018).
Instead of training word embeddings from scratch, practitioners often resort to high-quality, pre-trained word embeddings which can be downloaded from the web. These embeddings were trained on massive corpora, such as online news. The downside is that their vocabulary is often restricted, and by nature, is very generic. An open question remains of how to optimally include new words from highly specialized text corpora into an existing word embedding fit. Such transfer learning has several advantages. First, one saves the computational burden of learning high-quality word vectors from scratch. Second, one can already rely on the fact that the majority of word embedding vectors are semantically meaningful. Third, as we show in this paper, there are deterministic and parameter-free approaches that fulfill this goal, making the scheme robust and reproducible.
For a practical application, imagine that we are given a small corpus, such as a collection of scientific articles, and our goal is to include the associated vocabulary into a pre-trained set of word vectors which were learned on Google Books or Wikipedia. The scientific corpus contains both common words (e.g., "propose", "experiment") and domain-specific words, such as "submodular" and "sublinear". However, this specialized corpus can be safely assumed to be too small to train a word embedding model from scratch. Alternatively, we could merge the domain-specific corpus with a large generic corpus and train the entire word-embedding from scratch, but this would be computationally demanding and non-reproducible due to the non-convexity of the underlying optimization problem. In this paper, we show how to include the specialized vocabulary into the generic set of word vectors without having to re-train the model on the large vocabulary, simply relying on linear algebra.
A naive baseline method is to fix the pre-trained word vectors and only update the new ones. We found that this approach suffers from local optima and sensitive to hyper-parameters; therefore it is not reliable in practice. In contrast, our approach is not based on gradient descent and therefore more robust, deterministic, and parameter-free.
In this paper, we propose a Spectral Online Word Embedding (SOWE) algorithm for integrating new words into a pre-trained word embedding fit. Our approach is based on online matrix factorization. In more detail, our main contributions are as follows: • We propose a Spectral Online Word Embedding (SOWE) method to include new words into a pre-trained set of word vectors. This approach does naturally not suffer from optimization problems, such as initialization, parameter tuning, and local optima.
• Our approach approximately reduces to an online matrix factorization problem. We provide a bound on the approximation error and show that this error does not scale with the vocabulary size (Theorem 1).
• The complexity of proposed method scales linearly with the size of vocabulary and quadratically with embedding dimension, making the approach feasible in large-scale applications.
• We evaluate our method on two domain specific corpora. Experimental results show that our method is able to embed new vocabulary faster than the baseline method while obtaining more meaningful embeddings. It is also parameter-free and deterministic, making the approach easily reproducible.

Related Work
Our paper departs from word embeddings learned via the skip-gram method, and shows how the vocabulary can be extended in an online fashion, using methods from linear algebra. As such, our approach relates to word embeddings, online learning, and the singular value decomposition (SVD).
Skip-Gram Model Our model builds on word embeddings trained via the skip-gram model with negative sampling (SGNS), proposed by Mikolov et al. (2013a,b). These papers proposed a scalable training algorithm based on negative sampling. The model predicts a target word in the middle of a sentence based on its surrounding words (contexts). Each target word / context word is associated with a feature vector whose entries are latent, and are treated as parameters to be learned. This model is efficient to train via stochastic gradient descent; its resulting word vectors provide state-ofthe-art results on various linguistic tasks Pickhardt et al., 2014). The skip-gram model was influential both in the machine learning and related communities. Levy and Goldberg (2014) showed that word2vec can be viewed as an implicit matrix factorization of the pointwise mutual information matrix (PMI) of word distributions. The authors present a closed-form solution based on a singular value decomposition of a sparse version of this matrix, termed SPPMI. In this paper, we extend on this view and present an efficient online learning algorithm based on a decomposition of SPPMI matrix which departs form pre-trained word embeddings. Kiros et al. (2015) propose adding new words into an existing embedding space using a projection method to warm-start learning the new words. The authors assumed that there already exist well-trained word vectors from a large underlying vocabulary, and present a method that projects word vectors from an old space to a new space, where the projection matrix is learned from known words. Bojanowski et al. (2016) exploited characterlevel features. Concretely, they train a character-ngram model to locate the new word vector near the existing word with similar root. Le and Mikolov (2014) introduced paragraph-level vectors (instead of word-level), a fixed-length feature representations for variable-length texts. When embedding new paragraphs, old paragraph vectors are frozen, and the new ones are updated. Furthermore, Luo et al. (2015) proposed an efficient online method to address the memory issue encountered when learning word embeddings based on nonnegative matrix factorization.

Online Word Embeddings
Online SVD We already discussed word embeddings via implicit matrix factorization (IMF) above. This method is based on a truncated SVD on a square matrix whose size is the vocabulary size. As this paper combines this idea with online learning, we review related word on online singular value decompositions.
Online SVD (incremental SVD) is a classical problem in numerical linear algebra (Datta, 2010), and is intensively used in recommendation systems (Sarwar et al., 2002;Brand, 2003) and subspace learning (Li, 2004). Online SVD only possesses an approximate solution. Recently some methods have been proposed to reduce the involved approximation error (Shamir, 2015;Allen-Zhu and Li, 2016) based on iterative learning. In this paper, we use the same online SVD method as in Sarwar et al. (2002), which owns a closed-form solution.

Method
We present our spectral word embedding method to efficiently insert new words from an extended vocabulary into pre-trained word embeddings, without having to re-train the model on the extended vocabulary. We first introduce some relevant background with respect to word embedding via implicit matrix factorization (Section 3.1) before presenting our method (Section 3.2) and theoretical consideration (Section 3.3).
Notation In this paper, the vocabulary of the existing pre-trained word embedding is called base vocabulary, whose size is m. After adding new words, we call the whole vocabulary the extended vocabulary; its size is n ≡ m + m , where m is the number of unique new words. We assume m m, so O(n) = O(m). Furthermore, let d denote the embedding dimension. As will be explained below, let S 0 , S full denote the SPPMI matrices of the base and extended vocabularies, respectively. The subscript "full", thus, always refers to the extended vocabulary.

Background: Word Embedding via Implicit Matrix Factorization (IMF)
The basis of our approach is the skip-gram model with negative sampling (SGNS), also called word2vec (Mikolov et al., 2013a,b). Let D denote the set of all observed of word-context pairs. Furthermore, #(w, c) denotes the number of times the pair (w, c) appears in D, and w and c are the word and context embeddings. The objective that SGNS minimizes is (1) In the limit of large d, Levy and Goldberg (2014) found the following closed-form solution to Eq 1: The first term can be seen as an empirical estimate of Pointwise Mutual Information (PMI): Levy and Goldberg (2014) suggested a sparse and consistent alternative called Shifted Positive PMI (SPPMI): (2) Levy and Goldberg (2014) showed that using such sparse representation, word and context embeddings could be efficiently obtained using a truncated singular value decomposition.
As will be explained in the next section, our approach builds on the intuition that word2vec implicitly factorizes the SPPMI matrix. Given pre-trained word and context vectors, we ask for an efficient way of extending the vocabulary and re-adjusting these vectors accordingly.

SOWE: Spectral Online Word Embedding
Our method takes advantages of the implicit matrix factorization method for efficiently embedding previously unseen words. Given a pre-trained word embedding, we firstly transform it to the SVD form. Using such form, under mild approximation, we can utilize efficient online SVD (Sarwar et al., 2002) to obtain the word embeddings for the extended vocabulary. Figure 1 presents a sketch of the problem that we want to solve. We start from a (m+m )×(m+m ) matrix in the extended vocabulary space, for which Algorithm 1 Spectral Online Word Embedding (SOWE) Input: Old word/context vectors W and C with S 0 ≡ W C , co-occurrence matrices involving new vocabulary x , y , and z (see Fig. 1). Output: Word/context vectors W full , C full for extended vocabulary. ).
we seek an approximate factorization. We assume that we already have a factorization of the upperleft submatrix of size m × m, implicitly obtained from word2vec or related word embedding algorithms. We seek an efficient linear algebra algorithm that, given the block-structure, results in a factorization of the whole matrix in linear time in m. This will be detailed below.
Overview The following steps summarize our overall proposed procedure.
• We start by assuming that our word embedding algorithm came from a factorization of the pointwise mutual information matrix of word frequencies. Thus, S 0 ≈ W C .
• In order to make use of efficient only SVD, we need to convert this matrix product into an SVD form. This can be done in O(m) time and results in W C = U ΣV (Section 3.2.1).
• Next, we need to estimate all elements in the extended pointwise mutual information matrix, see Figure 1. We first estimate the dominant block (section 3.2.2 (i)), and show that it can be approximated by our previously obtained SVD. We then estimate the remaining blocks (section 3.2.2 (ii)). For the latter, we need to estimate the frequencies of the new words relative to the old words.
• We are now in a position to efficiently compute a new SVD for the extended pointwise mutual information matrix, U full Σ full V full , using online SVD. The operational costs are still O(m) ((iii) in Section 3.2.2).
• Finally we define our new embedding matrices as W full = U full √ Σ full and C full = V full √ Σ full , which completes our algorithm.
These steps will be explained in more detail below.

SVD from Word-Context Vectors
The first step in our algorithm is to obtain a singular value decomposition (SVD) of the old vocabulary's approximate PMI matrix S 0 . Our working hypothesis is that our pre-trained word and context embedding matrices W and C are already approximately factorizing this matrix, Levy and Goldberg (2014) showed that this factorization is correct in the limit of a large enough embedding dimension d, but is only approximately true otherwise. In this paper, we will use Eq. 3 as a working hypothesis.
Computing an SVD from S 0 would usually be an operation that costs O(m 2 ), thus would scale quadratically in the vocabulary size. In such factorization would be not practical, since m is typically of the order of hundred thousands. Instead, we show next that, given a low-rank factorization of S 0 in terms of W and C renders this cost linear in the vocabulary size, making such an approach practical. The following procedure corresponds to steps 1-6 in Algorithm 1.
A truncated SVD (tSVD) of S 0 with rank d can be obtained from QR decompositions (Golub and Loan, 1996) of W and C as follows: This results in S 0 = U R 1 R 2 V . In a second step, we apply an SVD to R 1 R 2 : The costs of this are small, as R 1,2 are d × d matrices. Since the composition of two orthogonal matrices is still orthogonal, we obtain the SVD of S 0 as U = U U , V = V V . Note that this transformation is exact since in our approximation, S 0 = W C was already of rank d. Thus, W C = U ΣV . The complexity of this operation is O(md 2 ), which concludes the first step.

Utilizing Online SVD to Embed
Extended Vocabulary The next steps amout to adding new words to the old embeddings by adding rows and columns to the original SPPMI matrix, and efficiently factorizing it via online SVD.
Given a representation of the old block of the PMI matrix in terms of an SVD, our next task is to compute the new elements of this matrix that correspond to the extended vocabulary of m words, with m m. We denote this matrix S full ∈ R (m+m )×(m+m ) , and it has the following block structure (see also Fig. 1): In the following three steps, we describe how to estimate and efficiently factorize this matrix.
(i) Approximating the main block In a first step, we approximate the main block S of the SPPMI matrix (Eq. 4). We show that to a first approximation, this is just the SPPMI matrix of the original vocabulary, hence S ≈ S 0 .
Algorithm 2 Recap: Online SVD (OSVD) To set up the SPPMI matrix, the following formula has to be applied to the observed cooccurrence counts #(w, c) between all word and context words in the extended vocabulary: Besides the co-occurrence counts, this also involves the absolute frequencies #(w) and #(c) of words and context vectors in the extended vocabulary, as well as the total number of counts |D|. Note that all these quantities enter only on a logarithmic scale. The co-occurrence counts #(w, c) are the same for the SPPMI matrices of the original and full vocabularies. What differs slightly are the absolute counts #(w), #(c), and |D| (these are slightly higher in the extended corpus). However, since we assumed that the original training corpus was much bigger than the corpus containing the new words, we can safely assume that the change in log #(w), log #(c), and log |D| is negligible (we will further specify and analyze this approximation in our section 3.3). Thus, S ≈ S 0 . Furthermore, since we have shown in section 3.2.1 that S 0 = U ΣV , this results in (ii) Adding rows and columns The matrices x , y ∈ R m×m in Eq. 6 are tall-and-skinny matrices that contain information about cross coocurrences between old and new words, and z ∈ R m ×m are the co-occurrences of new words in the new vocabulary. Next, we will describe how to estimate these quantities, taking into account that we don't have access to the original training corpus that was used to learn the word embeddings of the old vocabulary.
As follows, we focus on x as an example (estimating y works analogously). In this case, we observe the co-occurrence counts #(w, c) between words w from the base vocabulary in the context of context words c from the new vocabulary. To compute the SPPMI (2), we then apply Eq. 5 to all obtained counts. This results in x . The remaining problem is that #(w) and |D| are unknown to us, and some heuristics have to be found to circumvent this problem.
First, notice that #(w) |D| corresponds to the word frequencies in the original corpus. Furthermore, we are only interested in log #(w) |D| . The logarithm is less sensitive to the result of the estimation of this quantity.
When using pre-trained word embeddings, the embedding vectors are typically ranked according to their frequency. We estimated the word frequency based based on their frequencies on the smaller corpus. For words from the old vocabulary that are not present in the new corpus, we interpolated using an exponential model, taking their frequency rankings into account.
Another heuristic has to be found to approximate z , in which case #(w) and #(c) are available, but |D| is unknown. Here, we assume that the new words are about as rare as the rarest words in the old vocabulary, setting #(w)/|D| to the frequency of the least frequent word in corpus. This specifies the extended SPPMI matrix. Next, we show how to efficiently re-factorize it.
(iii) Factorizing the Extended SPPMI Matrix Finally, the approximated SPPMI matrix is efficiently factorized using online SVD.
This can not be carried out in a single step, because the online SVD method sketched in Algorithm 2 only supports the addition either rows or columns (here presented for columns). Thus, we first concatenate U ΣV and x horizontally and perform a rank d truncated SVD. In a second step, we concatenate the resulting singular value decomposition vertically with the concatenation of y and z to obtain a truncated SVD of the full SPPMI matrix. Algorithm 2 gives the details; for more details we refer to (Sarwar et al., 2002). This results in an approximate SVD for the full matrix. Word and context embeddings can be obtained trivially from the SVD.
Finally, let us discuss the complexity of the method. The online SVD subroutine dominates the complexity of our approach, as it scales as O(nd 2 ). In all steps, the costs remain linear in the vocabulary size. This makes our approach scalable and convenient to use. In contrast, when carrying out an SVD from scratch to compute the word and context embeddings, we would have a quadratic scaling in the vocabulary size, which would be impractical.

Theoretical Analysis
In this section, we show that under certain assumptions, the difference between approximate SPMMI matrix S full (Equation 6) and SPMMI matrix S full (Equation 4) is bounded. This justifies the previous assumption that we can substitute S full for S full . Now we want to show theoretically that this is a reasonable approximation. First, we make some assumptions.
Assumption 1. There exists a constant c 1 > 0 such that number of nonzero (nnz) entries in m×m SPPMI matrix can be upper-bounded by c 1 m, i.e., nnz(SPPMI) ≤ c 1 m.
Remark 1. It is reasonable to assume that in Shifted Positive PPMI matrix, most of the words are only closely related to a small number of other words.
Assumption 2. Every entry in co-occurrence matrix can be bounded by c 2 > 0, i.e., #(w, c) ≤ c 2 for ∀ w, c. This is always satisfied, since the number of observed co-occurrences is always bounded.
Assumption 3. For m × m SPPMI matrix, there exists a constant c 3 > 0 such that the number of w and c occur in corpus D at least c 3 m times, i.e., This results implies that the difference between S full and S full is bounded by a constant independent of the vocabulary size. Since in large-scale word embedding models the size of the vocabulary is typically 10 5 or even 10 6 , the relative difference between these two matrices can be negligible. We  Table 1: Performance on NIPS Abstract and Economic News. The unit of running time is second. We report the average value of 10 independent runs for FOUN and standard deviation in brackets only for "loss for new". For FOUN, "loss for all words" is the sum of "loss for new words" and "loss that are only related to old words"(which is a constant). So standard deviation of "loss for all" is equal to "loss for all".
provide the proof of the theorem in supplementary materials.

Experiment
In this section, we show empirical results where we compare our proposed SOWE with other continuous word embedding algorithms.We first present some generic settings of our experiments, followed by quantitative and qualitative baseline comparisons. Compared to the baselines, we find that our method is more efficient and finds more semantically meaningful embeddings. Our method takes less than one minute to insert about 1,000 domain-specific words (e.g., machine learning related words from NIPS abstracts) into a pre-trained embedding model with more than 180,000 words.
Experimental Setup Our approach departs from pre-trained word vectors from a generic training corpus. We downloaded publicly available pretrained word embeddings and two small text corpora from specific domains. Our goal is to insert the domain-specific words that do not already appear in the original vocabulary efficiently into the embeddings. First, we report on basic settings for our experiments. The pre-trained embedding model based on English Wikipedia is available online 1 . It contains 183,870 words. The embedding dimension is 300. The small, domain-specific corpora that we considered were the following ones: (1) "NIPS Abstracts": this data set contains abstracts of all the NIPS papers from 2014 to 2016. The data set contains 981 new words. (2) "Economy News": This data set contains news articles, containing 868 new words. These two corpora are much smaller than base corpus. For both the base corpus and the new corpus, the text was pre-processed using a window of 5 tokens on each side of the target word, where stop-words and non-textual elements were removed, and sentence splitting and tokenization were applied.
Baselines: FOUN and FOUN+annealing A natural idea for inserting new words into a preestisting word embedding fit is to fix the old word/context vectors, and only to update the new ones. This is our baseline method, referred in the following as "Fix Old, Update New" (FOUN). The approach uses the word2vec objective and employs stochastic gradient descent for training. We employ Robbins-Monro learning schedules (Robbins and Monro, 1951), setting the stepsize at the t-th step as t = a(t + γ) −0.51 . We used grid-search to find optimal parameters on all considered data sets, and found k = 5, γ = 1e4, a = 1/10 (for different tasks) to be optimal. Due to the involved randomness in the baseline approach, we conducted 10 independent trials using different random seeds for each results and reported the average results. For "FOUN+annealing" we used the same settings as in FOUN, but added random zero-mean Gaussian noise to the gradient. To this end, we employed a version of Stochastic Gradient Langevin Dynamics (Welling and Teh, 2011), where we scaled down the noise by a factor of 0.01.

Loss minimization and runtime
We considered the word2vec loss on the extended vocabulary and evaluated the value of the loss function on the embedding vectors obtained form the different methods under consideration. The associated loss values and runtimes are reported in Table 1 for the "NIPS Abstracts", and for "Economy News". We found both approaches yield similar values of the loss function. (In our experiments below, however, we will show that our obtained word vectors seem to reflect the semantics of the original corpus better.) As a clear improvement, we found that our method      Table 6: Performance on word analogy task using existing embedding results and text8 dataset with different folds of splits. In the row "Ideal", we use the well-trained word embedding results downloaded from Internet. FOUN and FOUN+annealing are the baseline that we are comparing with.
is faster than the baseline, yielding a factor of 8 times speedup. We consider the baseline method to be converged when the loss value of the current epoch is close (smaller than a threshold) to that of the previous epoch, where the threshold is 1e2 for NIPS abstract and Economics News.
Qualitative Nearest Neighbor Test To test whether the learned embedding vectors are semantically meaningful, we chose some words from the new vocabulary and reported their nearest neighbors in the extended vocabulary. We expect the nearest neighbors to have a close semantic meanings. We chose cosine similarity as a means to measure distance between words. We chose the words "eurodollars" as a query for "Economic News" and "submodular" for "NIPS Abstracts". The results are reported in Table 2 and 3. In the case of economics, we see that our algorithm recovered meaningful words such as "midcap" and "ultralow". The baseline methods failed to return meaningful results with respect to the query. One possible reason is that the baseline's underlying optimization algorithm got trapped in a poor local optimum. In the case of NIPS abstracts (Table 3), our SOWE method results in words such as "nonsmooth" and "coreset" which are highly related to the query "submodular", while the FOUN-based methods fail. Our approach thus outperforms the baseline in providing meaningful relationships between the pre-trained word vectors and the newly embedded ones. More examples are provided in the Appendix.
Evaluations on NLP Tasks Additionally, we evaluated the proposed method on some downstream NLP tasks, such as pairwise word similarity. To this end, we used datasets that contain word pairs associated with human-assigned similarity scores 2 . The word vectors are evaluated by ranking the pairs according to their cosine similarities, and measuring the correlation (Spearmans ρ) with the human ratings.
We excluded the word pairs of the similarity test from the original vocabulary and trained word2vec with the associated reduced vocabulary on several corpora. We then added the test words using the three competing methods (SOWE, FOUN, and FOUN+anneal). The fourth algorithm "Ideal" amounts to evaluating the test on the generic pre-trained word embeddings from the web. The results on the word similarity task are shown in Table 4. Our method obtains the best performance on four out of five word similarity tasks.
Word analogy tests consists of questions of the form "a is to a* as b is to b*", where b* must be completed (Mikolov et al., 2013b). We performed such word analogy tests; our results are reported in Table 5. We observe that SOWE outperform FOUN-based methods in two out of four cases.
Word Analogy Analysis with Varying Number of Folds We further split the corpus into folds to evaluate the word analogy task, where we varied the size of the folds. Here, we choose the most frequent 20,000 words in text8 and then split the vocabulary into k ∈ {5, 10, 20} folds. All folds but one were considered as base vocabulary, and one fold was considered as new vocabulary. We used implicit matrix factorization on all but one fold, and added the last fold's vocabulary using the different methods under comparison (FOUN, FOUN+anneal and SOWE). We repeated this procedure k times and report means and standard deviations in Table 6. Our method achieves results comparable with the baselines.