Evaluating Word Embeddings on Low-Resource Languages

The analogy task introduced by Mikolov et al. (2013) has become the standard metric for tuning the hyperparameters of word embedding models. In this paper, however, we argue that the analogy task is unsuitable for low-resource languages for two reasons: (1) it requires that word embeddings be trained on large amounts of text, and (2) analogies may not be well-defined in some low-resource settings. We solve these problems by introducing the OddOneOut and Topk tasks, which are specifically designed for model selection in the low-resource setting. We use these metrics to successfully tune hyperparameters for a low-resource emoji embedding task and word embeddings on 16 extinct languages. The largest of these languages (Ancient Hebrew) has a 41 million token dataset, and the smallest (Old Gujarati) has only a 1813 token dataset.


Introduction
Imagine you're given the task of training a text classification model for Middle English. This form of English was spoken in the Middle Ages from 1066-1500 CE. It is significantly different from modern English (Chamonikolasová, 2014), and only a handful of historians speak this language today.
A natural first step would be to train word embeddings. So you use the Classical Language ToolKit (CLTK) (Johnson, 2014) to download the largest corpus of known Middle English documents (only 7 million tokens, 0.3 million unique tokens), and Gensim (Řehůřek and Sojka, 2010) to train the embeddings. To evaluate the embeddings, you follow the current standard practice established by Mikolov et al. (2013) of using an analogy test set. Of course, you can't use Mikolov et al. (2013)'s test set-it is in Modern English, and Middle English is not Modern English. But you also can't even use translations of their test set-many of the  (Mikolov et al., 2013) fails to measure the quality of word embeddings trained on small datasets, but our novel OddOneOut and Topk tasks succeed in this regime.
analogy concepts simply didn't exist in the Middle Ages. For example, the analogy London is to England as Paris is to France can be translated perfectly fine into Middle English, but the concept of nations and capitals didn't exist in the Middle Ages, so the analogy is not semantically meaningful. To create a meaningful analogy test set, you hire a historian fluent in Middle English, and with considerable effort and research she creates custom analogies that make sense in Middle Age England. With this analogy test set in hand, you train dozens of models with varying hyperparameters. Unfortunately, all these models get 0 accuracy on your test set. You simply don't have enough data to get good results on the analogy task. As Figure  1 shows, the analogy task requires a large training dataset before it begins getting non-zero results (Details provided in Section 3.1 below). But that does not mean that you cannot train word embeddings on Middle English.
In this paper, we introduce the OddOneOut and Topk tasks for evaluating word vectors on low-resource languages, successfully train word embeddings for 16 extinct languages (including Middle English), and perform a low-resource emoji embedding task. To get a sense of scale, the original word2vec paper trained English word embeddings on a dataset with 6 billion tokens (Mikolov et al., 2013) with subsequent work improving performance by training on datasets as large as 630 billion tokens (Grave et al., 2018). In this paper, the largest dataset we consider has 41 million tokens, and the smallest only 1813 tokens. We argue that different evaluation techniques are needed for datasets like ours that are more than 1000 times smaller. Figure 1 shows that the OddOneOut and Topk tasks are better suited than the analogy task to measure improvement in embedding quality with datasets like these.

Related Work
Other work in the low-resource regime has focused on developing new training methods rather than evaluation methods. Specifically, the goal has been to reduce the sample complexity of word embedding models by adding new regularizations (Adams et al., 2017;Jiang et al., 2018;Gupta et al., 2019;Jungmaier et al., 2020). A common thread of this work is the difficulty of evaluation. Unfortunately, each of these works evaluate their method only in a simulated low-resource environment using Modern English text and not on any actual low-resource languages. They do this specifically because no evaluation metrics were available that were suitable for their low-resource target languages. More theoretical work has also shown that these simulated low-resource design methodologies give biased hyperparameter estimates which systematically overestimate model performance (Kann et al., 2019). This highlights the need for new evaluation methods like ours, which are suitable for the low-resource regime.
From an evaluation standpoint analogies are not the only metric available to tune the hyperparameters of low-resource embedding models. Other work has focused on similarity tasks, establishing evaluation benchmarks based on human annotation of English language word pairs (Finkelstein et al., 2001;Radinsky et al., 2011;Bruni et al., 2012). Compared to the analogy task, these methods are more sensitive to low-resource experimental design, however, they suffer from the high overhead costs associated with manually generating test datasets.
In contrast, our tasks leverage the Wikidata knowledge base to automate the process of creating custom test sets while still maintaining sensitivity to low-resource settings.
High-resource languages also directly benefit from our methods in two ways. First, we help automate evaluation on many languages. Grave et al. (2018) trained FastText embeddings on 157 languages using data from the Common Crawl project. But they were only able to explicitly evaluate 10 of these language models using the analogy task due to the expense required in developing appropriate test sets. A major advantage of our OddOneOut and Topk methods is that test sets can be generated for them automatically in any of Wikidata's 581 supported languages (including extinct languages like Middle English).
Second, many applications of word embeddings investigate low-resource subsets of high-resource languages. There is growing body of digital humanities work where English language text is subdivided into smaller corpora based on time periods (e.g. Kulkarni et al., 2015;Hamilton et al., 2016b,a;Dubossarsky et al., 2017;Szymanski, 2017;Chen et al., 2017;Liang et al., 2018;Tang, 2018;Kutuzov et al., 2018;Kozlowski et al., 2019) or different political ideologies Azarbonyad et al. (2017). Word embeddings are then trained on these smaller corpora, and differences in the resulting embeddings are used to track changes in word usage. Our evaluation methods can be used to improve the ability to evaluate this work as well.

Contributions
Our contributions can be summarized with the following three points.
1. We introduce the first word embedding evaluation tasks designed specifically for the lowresource setting, OddOneOut and Topk. Code for computing these metrics is released as an open source Python library. 1 2. We introduce a method for automatically generating test datasets for the OddOneOut and Topk tasks in the 581 languages supported by the Wikidata project.
3. We perform the largest existing multilingual evaluation on low-resource languages using 16 extinct languages from the Classical Language ToolKit (CLTK) (Johnson, 2014). Specifically, we provide word embeddings for 16 of the 18 languages with corpora in the CLTK library (Johnson, 2014) and introduce the Language Comparison Task (LCT) to investigate which topics are included in these classical language corpora.
The remainder of the paper is organized as follows. Section 2 formally defines the Topk and OddOneOut tasks. Section 3 empirically demonstrates that these tasks are better than the analogy task in low-resource settings. We use a synthetic English language experimental design common in previous work, and demonstrate the versatility of our evaluation metrics by applying them to an emoji embedding task for which the analogy task is not even well defined. Section 4 computes word embeddings for 16 extinct languages and introduces our technique to automatically generate test sets for the OddOneOut, Topk, and LCT tasks using Wikidata. We also provide a semantic analysis of the topics covered in each of the 16 language corpora. Section 5 concludes by discussing how extensions to this work could serve communities working with low-resource languages.

Evaluation Methods
The OddOneOut and Topk tasks are simple and widely applicable. Both tasks require a test set consisting of a list of categories, where each category contains a list of words belonging to that category. Figure 4 shows some example categories generated through a fully automated process (described in Section 4.1). The Topk task measures a model's ability to identify words that are related to each category, and conversely the OddOneOut task measures a model's ability to identify words that are unrelated to each category, Formally, assume that there are m categories, that each category has n words 2 , that there are v total words in the vocabulary, and that the words are embedded into R d . The method of generating the embedding (e.g. word2vec, GloVe, fastText) does not matter. Let c i,j be the jth word in category i, and let C i = {c i,1 , c i,2 , ..., c i,n } be the ith category.

The Topk method
Let Sim(k, w) return the k most similar words in the vocabulary to w. We use the cosine distance in all our experiments, but any distance metric can be used. Next, define the Topk score for class i to be and the Topk score for the entire evaluation set to be Typically k is small (we recommend k = 3 in our experiments), and so the runtime is linear in all of the interesting parameters. In particular, it is linear in both the size of our vocabulary, the number of categories in the test set, and the size of the categories.

The OddOneOut method
Define the OddOneOut score of a set S with k words with respect to a word w ∈ S as and µ = 1 We define the kth order OddOneOut score of a category i to be In Equation (7) above, the total number of values that S can take is n k = O(n k ), and the total number of values that w can take is O(v), so |P | = O(n k v). Finally, we define the k-th order OddOneOut score of the entire evaluation set to be OddOneOut(k, i).
. This exponential dependence on k is very bad. In practice, we used k = 3 in all of our experiments, but even this small value required prohibitively long run times.
To solve this problem, we use a sampling strategy. LetP denote the set of p samples without replacement from the set P . Then we rewrite Equation 6 as , which is linear in all the parameters of interest. In our experiments, we found p = 1000 to give sufficiently accurate results without taking too much computation.

Experiments
We demonstrate the usefulness of our evaluation metrics with two experiments. First, we show that the OddOneOut and Topk metrics are better measures of word embedding quality than the analogy metric in the low-resource regime using simulated English data. Second, we show that the OddOneOut and Topk metrics are useful for model selection in an emoji embedding task where the analogy task is not well defined. This experiment also demonstrates that the OddOneOut and Topk metrics correlate with downstream task performance.

English Experiments
This experiment measures the performance of the OddOneOut, Topk, and analogy metrics as a function of dataset size.
For training data, we use a 2017 dump of the English-language Wikipedia that contains 2 billion total tokens and 2 million unique tokens. The dataset is freely distributed with the popular Gen-Sim library (Řehůřek and Sojka, 2010) for training There does not appear to be any correlation between the performance of the three tasks, indicating that each task is measuring a different aspect of linguistic knowledge.
word embeddings, and it is therefore widely used. State-of-the-art embeddings are trained on significantly larger datasets-for example, datasets based on the common crawl contain hundreds of billions of tokens even for non-English languages (Buck et al., 2014;Grave et al., 2018)-but since our emphasis is on the low-resource setting, this 2 billion token dataset is sufficient.
Using the Wikipedia dataset, we generate a series of synthetic low-resource datasets of varying size. First, we sort the articles in the Wikipedia dataset randomly. 4 Then, each dataset i contains the first 2 i tokens in the randomly ordered Wikipedia dump.
On each of these low-resource datasets, we train a word2vec skipgram model with GenSim's default hyperparameters 5 , which are known to work well in many contexts. Importantly, we do not tune these hyperparameters for each low-resource dataset. Instead, we use the same hyperparameters because our goal is to isolate the effects of dataset size on the three evaluation metrics.
For the analogy task, we use the standard Google Analogy test set introduced by (Mikolov et al., 2013). This test set contains 14 sets of analogies, and each analogy set contains 2 categories that are being compared. We generate test sets for the OddOneOut and Topk tasks from all 28 categories in the Google test set. For example, the countries-capitals analogy set has analogies like

London is to England as Paris is to France
In order to convert this analogy to work with OddOneOut and Topk test sets we use the set of all capitals and the set of all countries as separate categories. Applying this same method on each analogy pair in the original Google Analogy test set results in an evaluation dataset that is compatible with the OddOneOut and Topk tasks. Note that this dataset conversion, results in explicitly losing information about how these categories relate to each other. While the analogy task tests a model's knowledge of the relationship between categories, the OddOneOut and Topk tasks will only test a model's knowledge about each category individually. Since these tasks are testing less knowledge, it makes since that they will get improved performance on smaller data sets. This intuition is confirmed in the results shown in Figure 1. The accuracy broken down by category for our model trained on the full dataset is shown in Figure 2. How a category performs on one task does not seem correlated to how a model performs on other tasks, which indicates that all three tasks are in fact measuring different aspects of linguistic knowledge (and not just representing the same knowledge scaled differently).

Emoji Experiments
This second experiment demonstrates the versatility of our methods by applying them to the domain of emoji embeddings. We show that our generic Topk and OddOneOut metrics perform as well 5 Embedding dimension 100, number of epochs 1, learning rate 0.025, window size is 5, min count is 5.  2016)'s emoji-specific model selection task as the learning rate varies. All three tasks provide can be used to estimate the optimal learning rate for the downstream sentiment classification task. as a custom designed emoji evaluation metric. That is, we don't sacrifice performance by choosing our easy-to-use and widely applicable metrics over harder-to-use domain-specific metrics.
Emoji embeddings are an important topic of study because they are used to improve the performance of sentiment analysis systems (e.g. Eisner et al., 2016;Felbo et al., 2017;Barbieri et al., 2017;Ai et al., 2017;Wijeratne et al., 2017;Al-Halah et al., 2019). Unfortunately, the standard analogy task is not suitable for evaluating the quality of emoji embeddings for two reasons. First, emoji embeddings are inherently low-resource-only 3000 unique emojis exist in the Unicode standard-and thus evaluation techniques specifically designed for the low-resource setting will be more effective. Second, the semantics of most emojis do not allow them to be used in any analogy task. In particular, the original emoji2vec paper (Eisner et al., 2016) identifies only 6 semantically meaningful emoji analogies.
In order to tune their emoji embeddings, Eisner et al. (2016) therefore do not use the analogy task, and instead introduce an ad-hoc "emoji-description classification" metric that required the creation of a test set with manually labeled emotion-description pairs. Due to the expense of manually creating this test set, only 1661 of the 3000 Unicode emojis are included. 6 The Topk and OddOneOut metrics improve on the "emoji-description classification" metric because they are able to evaluate the quality of all emojis and require no manual test set creation. For our test set categories, we use the categories that the Unicode standard provides for each emoji. 7 To test the performance of the three metrics, we use them to tune the hyperparameters of an emoji2vec model. To ensure the fairest comparison possible, we use the original emoji2vec code for training and model selection, changing only the function call to the metric used. In particular, this means we are only embedding and evaluating on the subset of 1661 emojis supported by the "emoji-description classification" custom metric. The code allows tuning of the model's learning rate, dimension, epochs, and three other hyperparameters unique to Eisner et al. (2016)'s custom metric. We found that the learning rate was the only hyperparameter to have a significant impact. Figure 3 shows how it affects performance on the three evaluation metrics and a downstream sentiment analysis task. All metrics show optimal performance with a learning rate of approximately 8 × 10 −4 , which also results in the best performance on the downstream task. This indicates that our Topk and OddOneOut metrics generate the same models as the specialized "emoji-description classification" metric, but our metrics have the advantage of being simpler, more widely applicable, and easier to generate test data for.
Note that it is incorrect to conclude that the Eisner et al. (2016) method is better than the OddOneOut and Topk values because it achieves higher accuracy rates in Figure 3. When evaluating a model selection metric, the important point to consider is the location where the metric is maximized, and not the maximum value itself. The location is used to determine the optimal hyperparameter, and all three metrics have maximal performance at the same location.

Multilingual Content Analysis
In this section we perform the first highly multilingual analysis of word embeddings for low-resource languages. We analyze 18 languages provided by the Classical Language ToolKit (CLTK) library  (Johnson, 2014). 8 Each is "extinct" in the sense that no new native text will ever be generated in these languages. That is not to say that CLTK is necessarily comprehensive in its coverage, nor that it's impossible that new data sources in these languages will be discovered. Rather, we claim that data representing them in their historical context (and consequently the theoretical amount of information we are able to extract) is capped. It's true that some classical languages are studied and used in modern times; however, this is almost always motivated by the need to extract meaning from historical corpora, not to add to them. For these reasons, the prospect of improving models on extinct languages through additional data collection seems unlikely. Instead, we must develop better techniques for the low-resource setting.
Among our datasets, the largest is Ancient Hebrew, with 41 million tokens, and the smallest is Old Gujarati with only 1813 tokens. Our techniques successfully let us choose hyperparameters for 16 of the 18 languages under consideration, including the tiny Old Gujarati model.
First we describe a procedure for automatically generating test set data for the OddOneOut and Topk tasks using Wikidata. Then, we describe our model training and selection procedure for each language. Finally, we perform the LCT task and an interlanguage analysis of the corpora's content.

Test Set Generation with Wikidata
One of the most difficult and time consuming steps of evaluating word embeddings on a new language is generating a high quality test set in that language. We now present the first fully automatic way to generate these test sets. Our method uses Wikidata 9 , which is the knowledge base that powers Wikipedia and contains millions of items and their semantic relationships. Wikidata supports 581 languages, and our test set generation method works for all of them. This method does not work for generating analogy test sets, but only works for generating the category test sets needed for our OddOneOut and Topk tasks. We implement this process using the Wikidata Query Service and SPARQL in our open source Python library, making it easy to generate arbitrary test sets. The idea is straightforward. Some items in Wikidata actually represent categories, and other items can be an "instance of" or "facet of" these categories. For example, item Q9181 (the Biblical patriarch "Abraham") is an instance of item Q20643955 (the category of "Human Biblical Figures"). We can then generate a category of "Human Biblical Figures" by gathering all the items that are an instance of this category, and extract-9 https://wikidata.org ing the translation of these items in our chosen language(s).
There are two minor complications to the process above. First, the item may not have a translation into all languages. Item Q3276278 (the minor Hebrew prophet Hilkiah), for example, does not have a translation into any of the languages we are studying except for Hebrew. We replace all of these words with a special out-of-vocabulary token. During the evaluation tasks, the models will be guaranteed to miss any test case involving these words. This will cause the model's performance to decrease, but it will not cause the optimal set of hyperparameters to change because the model's performance will decrease uniformly for all hyperparameters. This is acceptable because our primary goal with the OddOneOut and Topk metrics is model selection.
Second, the item may have a multi-word translation. Item Q43600 is translated into English as "Mathew the Apostle" which is guaranteed to be out of vocabulary because our embeddings are only for individual words and not phrases. We handle this case simply by treating phrases the same way we would treat any other out of vocabulary word for these models. For a word2vec model, these phrases are commonly replaced by a single out of vocabulary token, but fastText models incorpo-rate sub-word information and are therefore able to generate reasonable vectors that are good representations of these phrases. We therefore expect fastText models to perform better on datasets with many multi-word translations.
Given the relative ease with which categories of similar items can be generated from Wikidata, it's natural to wonder why our method could not be adapted to extract analogies instead. In some cases, it is possible however the task requires more effort and is difficult to generalize. For example, we could generate items that follow the state-capital analogy with a query that first returns all instances of US states and then finds all items that are related to those instances through the "capital" property P36. However, if we consider another popular analogy relationship such as singular-plural we run into trouble. Since there is no Wikidata property relating singular nouns to their plural forms, it's not clear how to generate a test set for this analogy relationship. Even if the lack of appropriate Wikidata properties could be overcome by developing more complicated queries, it's unlikely that these would generalize to other analogy relationships. Compared to OddOneOut and Topk, extracting test sets for the analogy task is much less automatic.
Using the method described above, we evaluate on a broad set of 18 categories which includes the semantic categories from the Google Analogy set along with others such as Fruit, Sports, Ancient Cities. Given the historical nature of many of the languages we are studying, we also choose three Wikidata categories dealing with religion Q20643955 (Human Biblical Figures), Q748 (Buddhism), and Q9089 (Hinduism), for a subsequent qualitative evaluation of corpus content using the LCT. Table 4 shows representative examples from these religious categories. Since they are large, with between 100-400 items each, we do not reproduce them here.

Model Selection
There are 7 hyperparameters for our language models, and we use the random search method (Bergstra and Bengio, 2012) to tune these hyperparameters. Random search is simple to implement, computationally efficient, easy to parallelize, and avoids the curse of dimensionality inherent to grid search and Bayesian optimization methods.  ,10,15,20,25,30,35,40,45,50,60,70,80,90,100,125,150,175,200,225,250,275,300,325,350,375,400,425,450,475,500 Window Size 3,4,5,6,7,8,9,10,11 Learning Rate 10 −3 , 10 −2 , 10 −1 3,4,5,6,7,8,9,10,11 Lemmatization True, False ments to GenSim's model training functions with the exception of lemmatization. This boolean argument indicates whether we used CLTK's builtin lemmatization tools to preprocess the datasets before training with GenSim. In theory, lemmatization can improve sample efficiency by giving words with the same stem the same word embedding. The results in Table 2, however, show that this was not the case in practice, and this indicates that the lemmatization models built-in to CLTK likely have very high error rates.

Min Count
To ensure a fair comparison, we randomly sample 100 sets of hyperparameter combinations. 10 We then train each language on these same sets of hyperparameters. The best results are reported in Table 1. The optimal set of hyperparameters is different for each language, which underscores the importance of proper model tuning in the lowresource regime. In particular, we note that there is no pattern regarding whether the word2vec model is better than the fastText model, or whether the CBOW type is better than the skipgram type.
In previous work, Al-Rfou et al. (2013) trained word2vec embeddings on 100 different languages using Wikipedia as the training data and Grave et al. (2018) extended this work by training Fast-Text embeddings on 157 languages using data from the Common Crawl project. In both cases, the researchers tuned the model's hyperparameters on only a single language (due to the difficulty of adapting an analogy test set to so many different languages), and then applied the same set of hyperparameters to all languages. Our results here suggest that performance could be improved if each model's hyperparameters were tuned individually, and our Wikidata technique would make this a realistic option.

Language Comparison
We now breakdown the performance of each of our language models on the three religious categories shown in Figure 4. The goal is to better understand which topics are discussed in each of the ancient languages. Figure 5 shows the results. There are three interesting results in this visualization. 3. Latin, and to a lesser extent Hebrew, were the only non-Indic languages to perform well on the Buddhism category. Investigation revealed that the Latin model had success almost exclusively on comparisons containing more generic words like meditatio and orsa.

Conclusion
In this paper we introduce the first word vector evaluation methods designed specifically for the lowresource domain, OddOneOut and Topk, along with a method for automatically producing test sets in Wikidata's 581 supported languages. We believe Wikidata is an underutilized resource in the NLP evaluation community, and in particular, that its massively multilingual support can be used to better serve under resourced languages.