Tiny Word Embeddings Using Globally Informed Reconstruction

We reduce the model size of pre-trained word embeddings by a factor of 200 while preserving its quality. Previous studies in this direction created a smaller word embedding model by reconstructing pre-trained word representations from those of subwords, which allows to store only a smaller number of subword embeddings in the memory. However, previous studies that train the reconstruction models using only target words cannot reduce the model size extremely while preserving its quality. Inspired by the observation of words with similar meanings having similar embeddings, our reconstruction training learns the global relationships among words, which can be employed in various models for word embedding reconstruction. Experimental results on word similarity benchmarks show that the proposed method improves the performance of the all subword-based reconstruction models.


Introduction
Word embeddings form the basis for many natural language processing (NLP) applications, e.g., text classification (Shen et al., 2018) and machine translation (Qi et al., 2018). However, widely used pretrained word embeddings such as fastText (Bojanowski et al., 2017) are considerably large, thereby making it difficult to develop NLP applications in limited memory environments such as mobile devices. For example, fastText 1 (crawl-300d-2M-subword) requires approximately 2 GB of memory.
In previous studies, the model size has been reduced by reconstructing word embeddings from characters (Pinter et al., 2017;Kim et al., 2018) and character N-grams (Zhao et al., 2018;Sasaki et al., 2019). As the number of characters or character N-grams, is significantly smaller than that of words, reconstructing word embeddings with accuracy from these subwords can reduce the model size while preserving the performance of applications. 2 As shown in Figure 1, existing methods reconstruct word embeddings from subword embeddings and mimic the corresponding pre-trained word embeddings. These methods rely only on local information of subwords and pre-trained word embeddings.
To improve the performance of word embedding reconstruction, we propose a global loss function that uses words other than the target word as clues. Inspired by the observation of words with similar meanings having similar embeddings in pre-trained word embeddings, our reconstruction training learns the similarity among word embeddings. Our method can be easily applied to any method for reconstructing a word embedding from subwords, regardless of the unit of the subword or the network structure.
Experimental results on word similarity tasks (Faruqui and Dyer, 2014) show that our global loss function improves the performance of word embedding reconstruction in all previous methods. When the proposed method was applied to the method based on a self-attention mechanism for reconstructing word embeddings from character N-grams (Sasaki et al., 2019), the model size was reduced to 74 MB (1/30) while preserving 97% of the quality of the original word embeddings. Furthermore, even if the model size was reduced to 12 MB (1/200), 86% of the quality was preserved.

Word Embeddings Reconstruction Based on Global Similarity Loss
We denote the pre-trained embeddings of word w ∈ W and the randomly initialized embeddings of subword s ∈ S as e w and v s , respectively. As in previous studies (Pinter et al., 2017;Kim et al., 2018;Zhao et al., 2018;Sasaki et al., 2019), we reconstruct a word embedding e w of word w asê w , from a set of subwords φ(w) by minimizing the following loss function.
where f (·) is the function for reconstructing a word embedding, i.e., a reconstruction network shown in Figure 1, and d w is the dimension of pre-trained word embeddings. As reconstruction networks, a recurrent neural network (RNN) (Pinter et al., 2017), convolutional neural network (CNN) (Kim et al., 2018), and self-attention mechanism (Sasaki et al., 2019) were used in the previous studies.
To improve word embedding reconstruction, we employ a loss function based on global information in addition to the loss function of Equation (1), which depends only on the local information of the target word, w. As shown in Figure 2, we consider the relationship between the reconstructed and pre-trained embeddings for the target word and the relationship between the reconstructed embedding of the target word and the pre-trained embedding of other words. We define the global loss function using cosine similarity 3 among word embeddings as follows: In this study, we sample n words from W in each training batch. To balance the similarity distribution among the selected n words, we first select the top-n/2 words that have a high cosine similarity to the target word, and then randomly select the rest from the training batch. Finally, we minimize the loss function that combines Equations (1) and (2), as given below:

Implementation Details
We employed the 300-dimensional pre-trained fastText 2 (Bojanowski et al., 2017) as the original word embeddings. Each reconstruction network was trained using 100k words based on the descending order of frequency. Words containing hyphens and numbers were excluded, and we only targeted words consisting of 26 lowercase Latin alphabets. We sampled n = 10 words for the global loss calculation. To minimize the loss, we adopted Adam (Kingma and Ba, 2015) (α = 0.001, β 1 = 0.9, β 2 = 0.999, = 10 −8 ) with a batch size of 50. The training was stopped after 5 epochs without improvement in the training loss.

Baseline Methods
We applied the global loss function to the following four methods that reconstruct pre-trained word embeddings from subwords.
Character RNN (Pinter et al., 2017): This method uses characters as subwords. It employs a bidirectional long short-term memory of 512 hidden dimensions based on a 32-dimensional embedding layer.

Experimental Results
According to Table 1, the proposed method improves the performance of word similarity estimation for all the settings. Especially, in the case of "Small" setting in the N-gram SAM model, the proposed method improves the performance by 25%. These experimental results indicate the effectiveness of the proposed method that considers global relationships among words. For the "Medium" setting of the Ngram SAM model, the model size can be reduced to 74MB, which is approximately 1/30 (3.3%), while preserving 97% of the performance of the original fastText. Further, for the "Small" setting, the model size can be reduced by a factor of 200 to 12MB, which is approximately 0.5% of that of the original model. Nonetheless, the model preserves 86% of the performance of the original model.     Table 4: Nearest Neighbors of the word "london" (upper section) and "flu" (lower section)

Effect of Sample Size
In Table 1, we sampled n = 10 words for the global loss calculation. In this section, we investigate the effect of the number of word samples on the performance of word similarity estimation. Table 2 shows the performance when the sample size was changed to n = 10, 20, and 50. Notably, n = 0 is a baseline model that does not perform global loss calculation.
According to Table 2, a large sample size for global loss calculation is not always effective. Overall, the n = 20 model performs better than the n = 10 model, but there is no large improvement beyond n = 20. Nevertheless, the proposed method consistently performs better than the n = 0 baseline model regardless of the sample size.

Analysis of Nearest Neighbors
To analyze the reconstructed embeddings, we investigated words having similar meaning using 100k high-frequency words in fastText. Table 3 shows the precision@5 for similar word searches. As the correct similar words, we collected the top 5 words with cosine similarity in the original fastText. In this analysis, the sample size of the global loss calculation is n = 10, and the methods based on character N-grams use the "Small" settings. Table 3 shows that the proposed method always improves the performance of searching for similar words. Table 4 lists similar-word searches for the word "london" and "flu." According to the upper section of the table, the original fastText lined up the big cities of the United Kingdom like "london" such as "glasgow" and "birmingham." However, in the N-gram SAM (n = 0) in the "Small" setting, words such as "lon" and "lond" that are similar on the surface but are significantly different in meaning are collected at the top. In the N-gram SAM (n = 10), by considering the relationship among words, semantically similar words such as "glasgow" and "edinburgh" succeeded in acquiring embeddings similar to "london." The lower section of the table is an example of "flu." The baseline model failed in reconstruction and lists irrelevant words. However, the proposed method successfully finds words that are semantically similar to "flu," such as "influenza" and "pneumonia."

Conclusion
We proposed a loss function that considers global relationships among words for the reconstruction of pre-trained word embeddings from subword embeddings. Experimental results on word similarity benchmarks show that the proposed method improves the performance of all the reconstruction networks. By applying the proposed method, we can compress 2GB of fastText to 12MB while preserving the quality of the original word embeddings. This method will help develop NLP applications in limited memory environments, e.g., mobile devices.