Learning Word Embeddings for Low-Resource Languages by PU Learning

Word embedding is a key component in many downstream applications in processing natural languages. Existing approaches often assume the existence of a large collection of text for learning effective word embedding. However, such a corpus may not be available for some low-resource languages. In this paper, we study how to effectively learn a word embedding model on a corpus with only a few million tokens. In such a situation, the co-occurrence matrix is sparse as the co-occurrences of many word pairs are unobserved. In contrast to existing approaches often only sample a few unobserved word pairs as negative samples, we argue that the zero entries in the co-occurrence matrix also provide valuable information. We then design a Positive-Unlabeled Learning (PU-Learning) approach to factorize the co-occurrence matrix and validate the proposed approaches in four different languages.


Introduction
Learning word representations has become a fundamental problem in processing natural languages. These semantic representations, which map a word into a point in a linear space, have been widely applied in downstream applications, including named entity recognition (Guo et al., 2014), document ranking (Nalisnick et al., 2016), sentiment analysis (Irsoy and Cardie, 2014), question answering (Antol et al., 2015), and image captioning (Karpathy and Fei-Fei, 2015).
Over the past few years, various approaches have been proposed to learn word vectors (e.g., (Pennington et al., 2014;Mikolov et al., 2013a;Levy and Goldberg, 2014b;Ji et al., 2015)) based on co-occurrence information between words observed on the training corpus. The intuition behind this is to represent words with similar vectors if they have similar contexts. To learn a good word embedding, most approaches assume a large collection of text is freely available, such that the estimation of word co-occurrences is accurate. For example, the Google Word2Vec model (Mikolov et al., 2013a) is trained on the Google News dataset, which contains around 100 billion tokens, and the GloVe embedding (Pennington et al., 2014) is trained on a crawled corpus that contains 840 billion tokens in total. However, such an assumption may not hold for low-resource languages such as Inuit or Sindhi, which are not spoken by many people or have not been put into a digital format. For those languages, usually, only a limited size corpus is available. Training word vectors under such a setting is a challenging problem.
One key restriction of the existing approaches is that they often mainly rely on the word pairs that are observed to co-occur on the training data. When the size of the text corpus is small, most word pairs are unobserved, resulting in an extremely sparse co-occurrence matrix (i.e., most entries are zero) 1 . For example, the text8 2 corpus has about 17,000,000 tokens and 71,000 distinct words. The corresponding co-occurrence matrix has more than five billion entries, but only about 45,000,000 are non-zeros (observed on the training corpus). Most existing approaches, such as Glove and Skip-gram, cannot handle a vast number of zero terms in the co-occurrence matrix; therefore, they only sub-sample a small subset of zero entries during the training.
In contrast, we argue that the unobserved word pairs can provide valuable information for training a word embedding model, especially when the co-occurrence matrix is very sparse. Inspired by the success of Positive-Unlabeled Learning (PU-Learning) in collaborative filtering applications (Pan et al., 2008;Hu et al., 2008;Pan and Scholz, 2009;Qin et al., 2010;Paquet and Koenigstein, 2013;Hsieh et al., 2015), we design an algorithm to effectively learn word embeddings from both positive (observed terms) and unlabeled (unobserved/zero terms) examples. Essentially, by using the square loss to model the unobserved terms and designing an efficient update rule based on linear algebra operations, the proposed PU-Learning framework can be trained efficiently and effectively.
We evaluate the performance of the proposed approach in English 3 and other three resourcescarce languages. We collected unlabeled language corpora from Wikipedia and compared the proposed approach with popular approaches, the Glove and the Skip-gram models, for training word embeddings. The experimental results show that our approach significantly outperforms the baseline models, especially when the size of the training corpus is small.
Our key contributions are summarized below.
• We propose a PU-Learning framework for learning word embedding.
• We tailor the coordinate descent algorithm (Yu et al., 2017b) for solving the corresponding optimization problem.
• Our experimental results show that PU-Learning improves the word embedding training in the low-resource setting.

Related work
Learning word vectors. The idea of learning word representations can be traced back to Latent Semantic Analysis (LSA) (Deerwester et al., 1990) and Hyperspace Analogue to Language (HAL) (Lund and Burgess, 1996), where word vectors are generated by factorizing a worddocument and word-word co-occurrence matrix, respectively. Similar approaches can also be extended to learn other types of relations between words (Yih et al., 2012;Chang et al., 2013) or entities (Chang et al., 2014). However, due to the limitation of the use of principal component analysis, these approaches are often less flexible. Besides, directly factorizing the co-occurrence matrix may cause the frequent words dominating the training objective.
In the past decade, various approaches have been proposed to improve the training of word embeddings. For example, instead of factorizing the co-occurrence count matrix, Bullinaria and Levy (2007); Levy and Goldberg (2014b) proposed to factorize point-wise mutual information (PMI) and positive PMI (PPMI) matrices as these metrics scale the co-occurrence counts (Bullinaria and Levy, 2007;Levy and Goldberg, 2014b). Skipgram model with negative-sampling (SGNS) and Continuous Bag-of-Words models (Mikolov et al., 2013b) were proposed for training word vectors on a large scale without consuming a large amount of memory. GloVe (Pennington et al., 2014) is proposed as an alternative to decompose a weighted log co-occurrence matrix with a bias term added to each word. Very recently, WordRank model (Ji et al., 2015) has been proposed to minimize a ranking loss which naturally fits the tasks requiring ranking based evaluation metrics. Stratos et al. (2015) also proposed CCA (canonical correlation analysis)-based word embedding which shows competitive performance. All these approaches focus on the situations where a large text corpus is available.
Positive and Unlabeled (PU) Learning: Positive and Unlabeled (PU) learning (Li and Liu, 2005) is proposed for training a model when the positive instances are partially labeled and the unlabeled instances are mostly negative. Recently, PU learning has been used in many classification and collaborative filtering applications due to the nature of "implicit feedback" in many recommendation systems-users usually only provide positive feedback (e.g., purchases, clicks) and it is very hard to collect negative feedback.
To resolve this problem, a series of PU matrix completion algorithms have been proposed (Pan et al., 2008;Hu et al., 2008;Pan and Scholz, 2009;Qin et al., 2010;Paquet and Koenigstein, 2013;Hsieh et al., 2015;Yu et al., 2017b). The main idea is to assign a small uniform weight to all the missing or zero entries and factorize the corresponding matrix. Among them, Yu et al. (2017b) proposed an efficient algorithm for matrix factorization with PU-learning, such that the weighted matrix is constructed implicitly. In this paper, we W, C vocabulary of central and context words m, n vocabulary sizes k dimension of word vectors W, H m × k and n × k latent matrices C ij weight for the (i, j) entry A ij value of the PPMI matrix Q ij value of the co-occurrence matrix w i , h j i-th row of W and j-th row of H b,b bias term λ i , λ j regularization parameters | · | the size of a set Ω Set of possible word-context pairs Ω + Set of observed word-context pairs Ω − Set of unobserved word-context pairs design a new approach for training word vectors by leveraging the PU-Learning framework and existing word embedding techniques. To the best of our knowledge, this is the first work to train word embedding models using the PU-learning framework.

PU-Learning for Word Embedding
Similar to GloVe and other word embedding learning algorithms, the proposed approach consists of three steps. The first step is to construct a cooccurrence matrix. Follow the literature (Levy and Goldberg, 2014a), we use the PPMI metric to measure the co-occurrence between words. Then, in the second step, a PU-Learning approach is applied to factorize the co-occurrence matrix and generate word vectors and context vectors. Finally, a post-processing step generates the final embedding vector for each word by combining the word vector and the context vector. We summarize the notations used in this paper in Table 1 and describe the details of each step in the remainder of this section.

Building the Co-Occurrence Matrix
Various metrics can be used for estimating the co-occurrence between words in a corpus. PPMI metric stems from point-wise mutual information (PMI) which has been widely used as a measure of word association in NLP for various tasks (Church and Hanks, 1990). In our case, each entry P M I(w, c) represents the relevant measure between a word w and a context word c by calculating the ratio between their joint probability (the chance they appear together in a local context window) and their marginal probabilities (the chance they appear independently) (Levy and Goldberg, 2014b). More specifically, each entry of PMI matrix can be defined by whereP (w),P (c) andP (w, c) are the the frequency of word w, word c, and word pairs (w, c), respectively. The PMI matrix can be computed based on the co-occurrence counts of word pairs, and it is an information-theoretic association measure which effectively eliminates the big differences in magnitude among entries in the cooccurrence matrix.
Extending from the PMI metric, the PPMI metric replaces all the negative entries in PMI matrix by 0: The intuition behind this is that people usually perceive positive associations between words (e.g. "ice" and "snow"). In contrast, the negative association is hard to define (Levy and Goldberg, 2014b). Therefore, it is reasonable to replace the negative entries in the PMI matrix by 0, such that the negative association is treated as "uninformative". Empirically, several existing works (Levy et al., 2015;Bullinaria and Levy, 2007) showed that the PPMI metric achieves good performance on various semantic similarity tasks. In practice, we follow the pipeline described in Levy et al. (2015) to build the PPMI matrix and apply several useful tricks to improve its quality. First, we apply a context distribution smoothing mechanism to enlarge the probability of sampling a rare context. In particular, all context counts are scaled to the power of α. 4 : where #(w) denotes the number of times word w appears. This smoothing mechanism effectively alleviates PPMI's bias towards rare words (Levy et al., 2015).
Next, previous studies show that words that occur too frequent often dominate the training objective (Levy et al., 2015) and degrade the performance of word embedding. To avoid this issue, we follow Levy et al. (2015) to sub-sample words with frequency more than a threshold t with a probability p defined as:

PU-Learning for Matrix Factorization
We proposed a matrix factorization based word embedding model which aims to minimize the reconstruction error on the PPMI matrix. The lowrank embeddings are obtained by solving the following optimization problem: where W and H are m × k and n × k latent matrices, representing words and context words, respectively. The first term in Eq.
(3) aims for minimizing reconstruction error, and the second and third terms are regularization terms. λ i and λ j are weights of regularization term. They are hyperparameters that need to be tuned. The zero entries in co-occurrence matrix denote that two words never appear together in the current corpus, which also refers to unobserved terms. The unobserved term can be either real zero (two words shouldn't be co-occurred even when we use very large corpus) or just missing in the small corpus. In contrast to SGNS sub-sampling a small set of zero entries as negative samples, our model will try to use the information from all zeros.
The set Ω includes all the |W| × |C| entriesboth positive and zero entries: Note that we define the positive samples Ω + to be all the (w, c) pairs that appear at least one time in the corpus, and negative samples Ω − are word pairs that never appear in the corpus.
Weighting function. Eq (3) is very similar to the one used in previous matrix factorization approaches such as GloVe, but we propose a new way to set the weights C ij . If we set equal weights for all the entries, then C ij = constant, and the model is very similar to conducting SVD for the PPMI matrix. Previous work has shown that this approach often suffers from poor performance (Pennington et al., 2014). More advanced methods, such as GloVe, set non-uniform weights for observed entries to reflect their confidence. However, the time complexity of their algorithm is proportional to number of nonzero weights (|(i, j) | C ij = 0|), thus they have to set zero weights for all the unobserved entries (C ij = 0 for Ω − ), or try to incorporate a small set of unobserved entries by negative sampling. We propose to set the weights for Ω + and Ω − differently using the following scheme: Here x max and α are re-weighting parameters, and ρ is the unified weight for unobserved terms. We will discuss them later. For entries in Ω + , we set the non-uniform weights as in GloVe (Pennington et al., 2014), which assigns larger weights to context word that appears more often with the given word, but also avoids overwhelming the other terms. For entries in Ω − , instead of setting their weights to be 0, we assign a small constant weight ρ. The main idea is from the literature of PU-learning (Hu et al., 2008;Hsieh et al., 2015): although missing entries are highly uncertain, they are still likely to be true 0, so we should incorporate them in the learning process but multiplying with a smaller weight according to the uncertainty. Therefore, ρ in (5) reflects how confident we are to the zero entries.
In our experiments, we set x max = 10, α = 3/4 according to (Pennington et al., 2014), and let ρ be a parameter to tune. Experiments show that adding weighting function obviously improves the performance especially on analogy tasks.
Bias term. Unlike previous work on PU matrix completion (Yu et al., 2017b;Hsieh et al., 2015), we add the bias terms for word and context word vectors. Instead of directly using w i h j to approximate A ij , we use Yu et al. (2017b) design an efficient columnwise coordinate descent algorithm for solving the PU matrix factorization problem; however, they do not consider the bias term in their implementations. To incorporate the bias term in (3), we propose the following training algorithm based on the coordinate descent approach. Our algorithm does not introduce much overhead compared to that in (Yu et al., 2017b).
We augment each w i , h j ∈ R k into the following (k + 2) dimensional vectors: Therefore, for each word and context vector, we have the following equality which means the loss function in (3) can be written as Also, we denote W = [w 1 , w 2 , . . . , w n ] and H = [h 1 , h 2 , . . . , h n ] . In the column-wise coordinate descent method, at each iteration we pick a t ∈ {1, . . . , (k + 2)}, and update the t-th column of W and H . The updates can be derived for the following two cases: a. When t ≤ k, the elements in the t-th column is w 1t , . . . , w nt and we can directly use the update rule derived in Yu et al. (2017b) to update them.
b. When t = k + 1, we do not update the corresponding column of W since the elements are all 1, and we use the similar coordinate descent update to update the k + 1-th column of H (corresponding tob 1 , . . . ,b n ). When t = k+2, we do not update the corresponding column of H (they are all 1) and we update the k + 2-th column of W (corresponding to b 1 , . . . , b n ) using coordinate descent.
With some further derivations, we can show that the algorithm only requires O(nnz(A) + nk) time to update each column, 5 so the overall complexity is O(nnz(A)k + nk 2 ) time per epoch, which is only proportional to number of nonzero terms in A. Therefore, with the same time complexity as GloVe, we can utilize the information from all the zero entries in A instead of only sub-sampling a small set of zero entries.

Interpretation of Parameters
In the PU-Learning formulation, ρ represents the unified weight that assigned to the unobserved terms. Intuitively, ρ reflects the confidence on unobserved entries-larger ρ means that we are quite certain about the zeroes, while small ρ indicates the many of unobserved pairs are not truly zero. When ρ = 0, the PU-Learning approach reduces to a model similar to GloVe, which discards all the unobserved terms. In practice, ρ is an important parameter to tune, and we find that ρ = 0.0625 achieves the best results in general. Regarding the other parameter, λ is the regularization term for preventing the embedding model from overfitting. In practice, we found the performance is not very sensitive to λ as long as it is resonably small. More discussion about the parameter setting can be found in Section 5.
Post-processing of Word/Context Vectors The PU-Learning framework factorizes the PPMI matrix and generates two vectors for each word i, w i ∈ R k and h i ∈ R k . The former represents the word when it is the central word and the latter represents the word when it is in context. Levy et al. (2015) shows that averaging these two vectors (u avg i = w i + h i ) leads to consistently better performance. The same trick of constructing word vectors is also used in GloVe. Therefore, in the experiments, we evaluate all models with u avg .

Experimental Setup
Our goal in this paper is to train word embedding models for low-resource languages. In this section, we describe the experimental designs to evaluate the proposed PU-learning approach. We first describe the data sets and the evaluation metrics. Then, we provide details of parameter tuning.

Evaluation tasks
We consider two widely used tasks for evaluating word embeddings, the word similarity task and the word analogy task. In the word similarity task, each question contains a word pairs and an annotated similarity score. The goal is to predict the similarity score between two words based on the inner product between the corresponding word vectors. The performance is then measured by the Spearmans rank correlation coefficient, which estimates the correlation between the model predictions and human annotations. Following the settings in literature, the experiments are conducted on five data sets, WordSim353 (Finkelstein et al., 2001), WordSim Similarity (Zesch et al., 2008), WordSim Relatedness (Agirre et al., 2009), Mechanical Turk (Radinsky et al., 2011) and MEN (Bruni et al., 2012). In the word analogy task, we aim at solving analogy puzzles like "man is to woman as king is to ?", where the expected answer is "queen." We consider two approaches for generating answers to the puzzles, namely 3CosAdd and 3Cos-Mul (see (Levy and Goldberg, 2014a) for details). We evaluate the performances on Google analogy dataset (Mikolov et al., 2013a) which contains 8,860 semantic and 10,675 syntactic questions. For the analogy task, only the answer that exactly matches the annotated answer is counted as correct. As a result, the analogy task is more difficult than the similarity task because the evalu-ation metric is stricter and it requires algorithms to differentiate words with similar meaning and find the right answer.
To evaluate the performances of models in the low-resource setting, we train word embedding models on Dutch, Danish, Czech and, English data sets collected from Wikipedia. The original Wikipedia corpora in Dutch, Danish, Czech and English contain 216 million, 47 million, 92 million, and 1.8 billion tokens, respectively. To simulate the low-resource setting, we sub-sample the Wikipedia corpora and create a subset of 64 million tokens for Dutch and Czech and a subset of 32 million tokens for English. We will demonstrate how the size of the corpus affects the performance of embedding models in the experiments.
To evaluate the performance of word embeddings in Czech, Danish, and Dutch, we translate the English similarity and analogy test sets to the other languages by using Google Cloud Translation API 6 . However, an English word may be translated to multiple words in another language (e.g., compound nouns). We discard questions containing such words (see Table 3 for details). Because all approaches are compared on the same test set for each language, the comparisons are fair.

Implementation and Parameter Setting
We compare the proposed approach with two baseline methods, GloVe and SGNS. The imple-  mentations of Glove 7 and SGNS 8 and provided by the original authors, and we apply the default settings when appropriate. The proposed PU-Learning framework is implemented based on Yu et al. (2017a). With the implementation of efficient update rules, our model requires less than 500 seconds to perform one iteration over the entire text8 corpus, which consists of 17 million tokens 9 . All the models are implemented in C++. We follow Levy et al. (2015) 10 to set windows size as 15, minimal count as 5, and dimension of word vectors as 300 in the experiments. Training word embedding models involves selecting several hyper-parameters. However, as the word embeddings are usually evaluated in an unsupervised setting (i.e., the evaluation data sets are not seen during the training), the parameters should not be tuned on each dataset. To conduct a fair comparison, we tune hyper-parameters on the text8 dataset. For GloVe model, we tune the discount parameters x max and find that x max = 10 per-forms the best. SGNS has a natural parameter k which denotes the number of negative samples. Same as Levy et al. (2015), we found that setting k to 5 leads to the best performance. For the PU-learning model, ρ and λ are two important parameters that denote the unified weight of zero entries and the weight of regularization terms, respectively. We tune ρ in a range from 2 −1 to 2 −14 and λ in a range from 2 0 to 2 −10 . We analyze the sensitivity of the model to these hyper-parameters in the experimental result section. The best performance of each model on the text8 dataset is shown in the Table 2. It shows that PU-learning model outperforms two baseline models.

Experimental Results
We compared the proposed PU-Learning framework with two popular word embedding models -SGNS (Mikolov et al., 2013b) and Glove (Pennington et al., 2014) on English and three other languages. The experimental results are reported in Table 4. The results show that the proposed PU-Learning framework outperforms the two baseline approaches significantly in most datasets. This re- Figure 1: Performance change as the corpus size growing (a) on the Google word analogy task (on the left-hand side) and (b) on the WS353 word similarity task (on the right-hand side). We demonstrate the performance on four languages, Dutch, Danish, Czech and English datasets. Results show that PU-Learning model consistently outperforms SGNS and GloVe when the size of corpus is small. sults confirm that the unobserved word pairs carry important information and the PU-Learning model leverages such information and achieves better performance. To better understand the model, we conduct detailed analysis as follows.
Performance v.s. Corpus size We investigate the performance of our algorithm with respect to different corpus size, and plot the results in Figure 1. The results in analogy task are obtained by 3CosMul method (Levy and Goldberg, 2014a). As the corpus size grows, the performance of all models improves, and the PU-learning model consistently outperforms other methods in all the tasks. However, with the size of the corpus increases, the difference becomes smaller. This is reasonable as when the corpus size increases the number of nonzero terms becomes smaller and the PU-learning approach is resemblance to Glove.
Impacts of ρ and λ We investigate how sensitive the model is to the hyper-parameters, ρ and λ. Figure 2 shows the performance along with various values of λ and ρ when training on the text8 corpus, respectively. Note that the x-axis is in log scale. When ρ is fixed, a big λ degrades the performance of the model significantly. This is because when λ is too big the model suffers from underfitting. The model is less sensitive when λ is small and in general, λ = 2 −11 achieves consistently good performance.
When λ is fixed, we observe that large ρ (e.g., ρ ≈ 2 −4 ) leads to better performance. As ρ represents the weight assigned to the unobserved term, this result confirms that the model benefits from using the zero terms in the co-occurrences matrix.

Conclusion
In this paper, we presented a PU-Learning framework for learning word embeddings of lowresource languages. We evaluated the proposed approach on English and other three languages and showed that the proposed approach outperforms other baselines by effectively leveraging the information from unobserved word pairs.
In the future, we would like to conduct experiments on other languages where available text corpora are relatively hard to obtain. We are also interested in applying the proposed approach to domains, such as legal documents and clinical notes, where the amount of accessible data is small. Besides, we plan to study how to leverage other information to facilitate the training of word embeddings under the low-resource setting.