A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity

Cross-lingual word embeddings encode the meaning of words from different languages into a shared low-dimensional space. An important requirement for many downstream tasks is that word similarity should be independent of language—i.e., word vectors within one language should not be more similar to each other than to words in another language. We measure this characteristic using modularity, a network measurement that measures the strength of clusters in a graph. Modularity has a moderate to strong correlation with three downstream tasks, even though modularity is based only on the structure of embeddings and does not require any external resources. We show through experiments that modularity can serve as an intrinsic validation metric to improve unsupervised cross-lingual word embeddings, particularly on distant language pairs in low-resource settings.

Typically the quality of cross-lingual word embeddings is measured with respect to how well they improve a downstream task.However, sometimes it is not possible to evaluate embeddings for a specific downstream task, for example a future task that does not yet have data or on a rare language that does not have resources to support traditional evaluation.In such settings, it is useful to have an intrinsic evaluation metric: a metric that looks at the embedding space itself to know whether the embedding is good without resorting to an extrinsic task.While extrinsic tasks are the ultimate arbiter of whether cross-lingual word embeddings work, intrinsic metrics are useful for low-resource languages where one often lacks the annotated data that would make an extrinsic evaluation possible.
However, few intrinsic measures exist for crosslingual word embeddings, and those that do exist require external linguistic resources (e.g., sensealigned corpora in Ammar et al. (2016)).The requirement of language resources makes this approach limited or impossible for low-resource languages, which are the languages where intrinsic evaluations are most needed.Moreover, requiring language resources can bias the evaluation toward words in the resources rather than evaluating the embedding space as a whole.
Our solution involves a graph-based metric that considers the characteristics of the embedding space without using linguistic resources.To sketch the idea, imagine a cross-lingual word embedding space where it is possible to draw a hyperplane that separates all word vectors in one language from all vectors in another.Without knowing anything about the languages, it is easy to see that this is a problematic embedding: the representations of the two languages are in distinct parts of the space rather than using a shared space.While this example is exaggerated, this characteristic where vectors are clustered by language often appears within smaller neighborhoods of the embedding space, we want to discover these clusters.
To measure how well word embeddings are mixed across languages, we draw on concepts from network science.Specifically, some cross- lingual word embeddings are modular by language: vectors in one language are consistently closer to each other than vectors in another language (Figure 1).When embeddings are modular, they often fail on downstream tasks (Section 2).Modularity is a concept from network theory (Section 3); because network theory is applied to graphs, we turn our word embeddings into a graph by connecting nearest-neighbors-based on vector similarity-to each other.Our hypothesis is that modularity will predict how useful the embedding is in downstream tasks; low-modularity embeddings should work better.
We explore the relationship between modularity and three downstream tasks (Section 4) that use cross-lingual word embeddings differently: (i) cross-lingual document classification; (ii) bilingual lexical induction in Italian, Japanese, Spanish, and Danish; and (iii) low-resource document retrieval in Hungarian and Amharic, finding moderate to strong negative correlations between modularity and performance.Furthermore, using modularity as a validation metric (Section 5) makes MUSE (Conneau et al., 2018), an unsupervised model, more robust on distant language pairs.Compared to other existing intrinsic evaluation metrics, modularity captures complementary properties and is more predictive of downstream performance despite needing no external resources (Section 6).

Background: Cross-Lingual Word Embeddings and their Evaluation
There are many approaches to training crosslingual word embeddings.This section reviews the embeddings we consider in this paper, along with existing work on evaluating those embeddings.

Cross-Lingual Word Embeddings
We focus on methods that learn a cross-lingual vector space through a post-hoc mapping between independently constructed monolingual embeddings (Mikolov et al., 2013a;Vulić and Korhonen, 2016).Given two separate monolingual embeddings and a bilingual seed lexicon, a projection matrix can map translation pairs in a given bilingual lexicon to be near each other in a shared embedding space.A key assumption is that cross-lingually coherent words have "similar geometric arrangements" (Mikolov et al., 2013a) in the embedding space, enabling "knowledge transfer between languages" (Ruder et al., 2017).We focus on mapping-based approaches for two reasons.First, these approaches are applicable to low-resource languages because they do not requiring large bilingual dictionaries or parallel corpora (Artetxe et al., 2017;Conneau et al., 2018).2Second, this focus separates the word embedding task from the cross-lingual mapping, which allows us to focus on evaluating the specific multilingual component in Section 4.

Evaluating Cross-Lingual Embeddings
Most work on evaluating cross-lingual embeddings focuses on extrinsic evaluation of downstream tasks (Upadhyay et al., 2016;Glavas et al., 2019).However, intrinsic evaluations are crucial since many low-resource languages lack annotations needed for downstream tasks.Thus, our goal is to develop an intrinsic measure that correlates with downstream tasks without using any external resources.This section summarizes existing work on intrinsic methods of evaluation for cross-lingual embeddings.
One widely used intrinsic measure for evaluating the coherence of monolingual embeddings is QVEC (Tsvetkov et al., 2015).Ammar et al. (2016) extend QVEC by using canonical correlation analysis (QVEC-CCA) to make the scores comparable across embeddings with different dimensions.However, while both QVEC and QVEC-CCA can be extended to cross-lingual word embeddings, they are limited: they require external annotated corpora.This is problematic in cross-lingual settings since this requires annotation to be consistent across languages (Ammar et al., 2016).
Other internal metrics do not require external resources, but those consider only part of the embeddings.Conneau et al. (2018) and Artetxe et al. (2018a) use a validation metric that calculates similarities of cross-lingual neighbors to conduct model selection.Our approach differs in that we consider whether cross-lingual nearest neighbors are relatively closer than intra-lingual nearest neighbors.Søgaard et al. (2018) use the similarities of intralingual neighbors and compute graph similarity between two monolingual lexical subgraphs built by subsampled words in a bilingual lexicon.They further show that the resulting graph similarity has a high correlation with bilingual lexical induction on MUSE (Conneau et al., 2018).However, their graph similarity still only uses intra-lingual similarities but not cross-lingual similarities.
These existing metrics are limited by either requiring external resources or considering only part of the embedding structure (e.g., intra-lingual but not cross-lingual neighbors).In contrast, our work develops an intrinsic metric which is highly correlated with multiple downstream tasks but does not require external resources, and considers both intraand cross-lingual neighbors.
Related Work A related line of work is the intrinsic evaluation measures of probabilistic topic models, which are another low-dimensional representation of words similar to word embeddings.Metrics based on word co-occurrences have been developed for measuring the monolingual coherence of topics (Newman et al., 2010;Mimno et al., 2011;Lau et al., 2014).Less work has studied evaluation of cross-lingual topics (Mimno et al., 2009).Some researchers have measured the overlap of direct translations across topics (Boyd-Graber and Blei, 2009), while Hao et al. (2018) propose a metric based on co-occurrences across languages that is more general than direct translations.

Approach: Graph-Based Diagnostics for Detecting Clustering by Language
This section describes our graph-based approach to measure the intrinsic quality of a cross-lingual embedding space.

Embeddings as Lexical Graphs
We posit that we can understand the quality of cross-lingual embeddings by analyzing characteristics of a lexical graph (Pelevina et al., 2016;Hamilton et al., 2016).The lexical graph has words as nodes and edges weighted by their similarity in the embedding space.Given a pair of words (i, j) and associated word vectors (v i , v j ), we compute the similarity between two words by their vector similarity.We encode this similarity in a weighted adjacency matrix A: However, nodes are only connected to their knearest neighbors (Section 6.2 examines the sensitivity to k); all other edges become zero.Finally, each node i has a label g i indicating the word's language.

Clustering by Language
We focus on a phenomenon that we call "clustering by language", when word vectors in the embedding space tend to be more similar to words in the same language than words in the other.For example in Figure 2, the intra-lingual nearest neighbors of "slow" have higher similarity in the embedding space than semantically related cross-lingual words.This indicates that words are represented differently across the two languages, thus our hypothesis is that clustering by language degrades the quality of cross-lingual embeddings when used in downstream tasks.

Modularity of Lexical Graphs
With a labeled graph, we can now ask whether the graph is modular (Newman, 2010).In a crosslingual lexical graph, modularity is the degree to which words are more similar to words in the same language than to words in a different language.This is undesirable, because the representation of words is not transferred across languages.
If the nearest neighbors of the words are instead within the same language, then the languages are not mapped into the cross-lingual space consis-tently.In our setting, the language l of each word defines its group, and high modularity indicates embeddings are more similar within languages than across languages (Newman, 2003;Newman and Girvan, 2004).In other words, good embeddings should have low modularity.Conceptually, the modularity of a lexical graph is the difference between the proportion of edges in the graph that connect two nodes from the same language and the expected proportion of such edges in a randomly connected lexical graph.If edges were random, the number of edges starting from node i within the same language would be the degree of node i, d i = j A ij for a weighted graph, following Newman (2004), times the proportion of words in that language.Summing over all nodes gives the expected number of edges within a language, where m is the number of edges, g i is the label of node i, and 1 [•] is an indicator function that evaluates to 1 if the argument is true and 0 otherwise.Next, we count the fraction of edges e ll that connect words of the same language: Given L different languages, we calculate overall modularity Q by taking the difference between e ll and a 2 l for all languages: Since Q does not necessarily have a maximum value of 1, we normalize modularity: (a 2 l ).
(4) The higher the modularity, the more words from the same language appear as nearest neighbors.Figure 1 shows the example of a lexical subgraph with low modularity (left, Q norm = 0.143) and high modularity (right, Q norm = 0.672).In Figure 1b, the lexical graph is modular since "firefox" does not encode same sense in both languages.
Our hypothesis is that cross-lingual word embeddings with lower modularity will be more successful in downstream tasks.If this hypothesis holds, then modularity could be a useful metric for cross-lingual evaluation.

Experiments: Correlation of Modularity with Downstream Success
We now investigate whether modularity can predict the effectiveness of cross-lingual word embeddings on three downstream tasks: (i) cross-lingual document classification, (ii) bilingual lexical induction, and (iii) document retrieval in low-resource languages.If modularity correlates with task performance, it can characterize embedding quality.

Data
To investigate the relationship between embedding effectiveness and modularity, we explore five different cross-lingual word embeddings on six language pairs (Table 1).
Monolingual Word Embeddings All monolingual embeddings are trained using a skip-gram model with negative sampling (Mikolov et al., 2013b).The dimension size is 100 or 200.All other hyperparameters are default in Gensim ( Řehůřek and Sojka, 2010).News articles except for Amharic are from Leipzig Corpora (Goldhahn et al., 2012).For Amharic, we use documents from LORELEI (Strassel and Tracey, 2016).MeCab (Kudo et al., 2004) 2014) maps two monolingual embeddings into a shared space by maximizing the correlation between translation pairs in a seed lexicon.Conneau et al. (2018, MUSE) use languageadversarial learning (Ganin et al., 2016) to induce the initial bilingual seed lexicon, followed by a refinement step, which iteratively solves the orthogonal Procrustes problem (Schönemann, 1966;Artetxe et al., 2017), aligning embeddings without an external bilingual lexicon.Like MSE+Orth, vectors are unit length and mean centered.Since MUSE is unstable (Artetxe et al., 2018a;Søgaard et al., 2018), we report the best of five runs.Artetxe et al. (2018a, VECMAP) induce an initial bilingual seed lexicon by aligning intra-lingual similarity matrices computed from each monolingual embedding.We report the best of five runs to address uncertainty from the initial dictionary.

Modularity Implementation
We implement modularity using random projection trees (Dasgupta and Freund, 2008) to speed up the extraction of k-nearest neighbors,6 tuning k = 3 on the German Rcv2 dataset (Section 6.2).

Task 1: Document Classification
We now explore the correlation of modularity and accuracy on cross-lingual document classification.We classify documents from the Reuters Rcv1 and Rcv2 corpora (Lewis et al., 2004).Documents have one of four labels (Corporate/Industrial, Economics, Government/Social, Markets).We follow Klementiev et al. (2012)  The DAN had better accuracy than averaged perceptron (Collins, 2002) in Klementiev et al. (2012).

Results
We report the correlation value computed from the data points in  2), reflected by its low modularity.
Error Analysis A common error in EN → JA classification is predicting Corporate/Industrial for documents labeled Markets.One cause is documents with 終値 "closing price"; this has few market-based English neighbors (Table 3).As a result, the model fails to transfer across languages.

Task 2: Bilingual Lexical Induction (BLI)
Our second downstream task explores the correlation between modularity and bilingual lexical induction (BLI).We evaluate on the test set from Conneau et al. (2018), but we remove pairs in the seed lexicon from Rolston and Kirchhoff (2016).The result is 2,099 translation pairs for ES, 1,358 for IT, 450 for DA, and 973 for JA.We report precision@1 (P@1) for retrieving cross-lingual nearest neighbors by cross-domain similarity local scaling (Conneau et al., 2018, CSLS).
Results Although this task ignores intra-lingual nearest neighbors when retrieving translations, modularity still has a high correlation (ρ = −0.785)with P@1 (Figure 4).MUSE and VECMAP beat the three supervised methods, which have the lowest modularity (Table 4).P@1 is low compared to other work on the MUSE test set (e.g., Conneau et al. (2018)) because we filter out translation pairs which appeared in the large training lexicon compiled by Rolston and Kirchhoff (2016), and the raw corpora used to train monolingual embeddings (Table 1) are relatively small compared to Wikipedia.JA) along with the average modularity of the crosslingual word embeddings trained with different methods.VECMAP scores the best P@1, which is captured by its low modularity.

Task 3: Document Retrieval in Low-Resource Languages
As a third downstream task, we turn to an important task for low-resource languages: lexicon expansion (Gupta and Manning, 2015;Hamilton et al., 2016) for document retrieval.Specifically, we start with a set of EN seed words relevant to a particular concept, then find related words in a target language for which a comprehensive bilingual lexicon does not exist.We focus on the disaster domain, where events may require immediate NLP analysis (e.g., sorting SMS messages to first responders).
We induce keywords in a target language by taking the n nearest neighbors of the English seed words in a cross-lingual word embedding.We manually select sixteen disaster-related English seed words from Wikipedia articles, "Natural hazard" and "Anthropogenic hazard".Examples of seed terms include "earthquake" and "flood".Using the extracted terms, we retrieve disaster-related documents by keyword matching and assess the coverage and relevance of terms by area under the precision-recall curve (AUC) with varying n.
Test Corpora As positively labeled documents, we use documents from the LORELEI project (Strassel and Tracey, 2016) containing any disaster-related annotation.There are 64 disasterrelated documents in Amharic, and 117 in Hungarian.We construct a set of negatively labeled documents from the Bible; because the LORELEI corpus does not include negative documents and the Bible is available in all our languages (Christodouloupoulos and Steedman, 2015), we take the chapters of the gospels (89 documents), which do not discuss disasters, and treat these as non-disaster-related documents.
Results Modularity has a moderate correlation with AUC (ρ = −0.378,Table 5).While modularity focuses on the entire vocabulary of cross-lingual word embeddings, this task focuses on a small, specific subset-disaster-relevant words-which may explain the low correlation compared to BLI or document classification.

Use Case: Model Selection for MUSE
A common use case of intrinsic measures is model selection.We focus on MUSE (Conneau et al., 2018) since it is unstable, especially on distant language pairs (Artetxe et al., 2018a;Søgaard et al., 2018;Hoshen and Wolf, 2018) and therefore requires an effective metric for model selection.MUSE uses a validation metric in its two steps: (1) the language-adversarial step, and (2) the refinement step.First the algorithm selects an optimal mapping W using a validation metric, obtained from language-adversarial learning (Ganin et al., 2016).Then the selected mapping W from the language-adversarial step is passed on to the refinement step (Artetxe et al., 2017) to re-select the optimal mapping W using the same validation metric after each epoch of solving the orthogonal Procrustes problem (Schönemann, 1966).Normally, MUSE uses an intrinsic metric, CSLS of the top 10K frequent words (Conneau et al., 2018, CSLS-10K).Given word vectors s, t ∈ R n from a source and a target embedding, CSLS is a cross-lingual similarity metric, CSLS(W s, t) = 2 cos(W s, t)−r(W s)−r(t) (5) where W is the trained mapping after each epoch, and r(x) is the average cosine similarity of the top 10 cross-lingual nearest neighbors of a word x.
What if we use modularity instead?To test modularity as a validation metric for MUSE, we compute modularity on the lexical graph of 10K most frequent words (Mod-10K; we use 10K for consistency with CSLS on the same words) after each

Family
Lang.

CSLS-10K
Mod-10K epoch of the adversarial step and the refinement step and select the best mapping.
The important difference between these two metrics is that Mod-10K considers the relative similarities between intra-and cross-lingual neighbors, while CSLS-10K only considers the similarities of cross-lingual nearest neighbors. 7xperiment Setup We use the pre-trained fast-Text vectors (Bojanowski et al., 2017) to be comparable with the prior work.Following Artetxe et al. (2018a), all vectors are unit length normalized, mean centered, and then unit length normalized.We use the test lexicon by Conneau et al. (2018).We run ten times with the same random seeds and hyperparameters but with different validation metrics.Since MUSE is unstable on distant language pairs (Artetxe et al., 2018a;Søgaard et al., 2018;Hoshen and Wolf, 2018), we test it on English to languages from diverse language families: Indo-European languages such as Danish (DA), German (DE), Spanish (ES), Farsi (FA), Italian (IT), Hindi (HI), Bengali (BN), and non-Indo-European languages such as Finnish (FI), Hungarian (HU), Japanese (JA), Chinese (ZH), Korean (KO), Arabic (AR), Indonesian (ID), and Vietnamese (VI).Results Table 6 shows P@1 on BLI for each target language using English as the source language.Mod-10K improves P@1 over the default validation metric in diverse languages, especially on the average P@1 for non-Germanic languages such as JA (+18.00%) and ZH (+5.74%), and the best P@1 for KO (+1.85%).These language pairs include pairs (EN-JA and EN-HI), which are difficult for MUSE (Hoshen and Wolf, 2018).Improvements in JA come from selecting a better mapping during the refinement step, which the default validation misses.For ZH, HI, and KO, the improvement comes from selecting better mappings during the adversarial step.However, modularity does not improve on all languages (e.g., VI) that are reported to fail by Hoshen and Wolf (2018).

Analysis: Understanding Modularity as an Evaluation Metric
The experiments so far show that modularity captures whether an embedding is useful, which suggests that modularity could be used as an intrinsic evaluation or validation metric.Here, we investigate whether modularity can capture distinct information compared to existing evaluation measures: QVEC-CCA (Ammar et al., 2016), CSLS (Conneau et al., 2018), and cosine similarity between translation pairs (Section 6.1).We also analyze the effect of the number of nearest neighbors k (Section 6.2).

Ablation Study Using Linear Regression
We fit a linear regression model to predict the classification accuracy given four intrinsic measures: QVEC-CCA, CSLS, average cosine similarity of  translations, and modularity.We ablate each of the four measures, fitting linear regression with standardized feature values, for two target languages (IT and DA) on the task of cross-lingual document classification (Figure 3).We limit to IT and DA because aligned supersense annotations to EN ones (Miller et al., 1993), required for QVEC-CCA are only available in those languages (Montemagni et al., 2003;Martínez Alonso et al., 2015;Martınez Alonso et al., 2016;Ammar et al., 2016).We standardize the values of the four features before training the regression model.
Omitting modularity hurts accuracy prediction on cross-lingual document classification substantially, while omitting the other three measures has smaller effects (Figure 5).Thus, modularity complements the other measures and is more predictive of classification accuracy.

Hyperparameter Sensitivity
While modularity itself does not have any adjustable hyperparameters, our approach to constructing the lexical graph has two hyperparameters: the number of nearest neighbors (k) and the number of trees (t) for approximating the k-nearest neighbors using random projection trees.We conduct a grid search for k ∈ {1, 3, 5, 10, 50, 100, 150, 200} and t ∈ {50, 100, 150, 200, 250, 300, 350, 400, 450, 500} using the German Rcv2 corpus as the held-out language to tune hyperparameters.
The nearest neighbor k has a much larger effect on modularity than t, so we focus on analyzing the effect of k, using the optimal t = 450.Our earlier experiments all use k = 3 since it gives the highest Pearson's and Spearman's correlation on the tuning dataset (Figure 6).The absolute correlation between the downstream task decreases when setting k > 3, indicating nearest neighbors beyond k = 3 are only contributing noise.This work focuses on modularity as a diagnostic tool: it is cheap and effective at discovering which embeddings are likely to falter on downstream tasks.Thus, practitioners should consider including it as a metric for evaluating the quality of their embeddings.Additionally, we believe that modularity could serve as a useful prior for the algorithms that learn cross-lingual word embeddings: during learning prefer updates that avoid increasing modularity if all else is equal.Nevertheless, we recognize limitations of modularity.Consider the following cross-lingual word embedding "algorithm": for each word, select a random point on the unit hypersphere.This is a horrible distributed representation: the position of words' embedding has no relationship to the underlying meaning.Nevertheless, this representation will have very low modularity.Thus, while modularity can identify bad embeddings, once vectors are well mixed, this metric-unlike QVEC or QVEC-CCA-cannot identify whether the meanings make sense.Future work should investigate how to combine techniques that use both word meaning and nearest neighbors for a more robust, semisupervised cross-lingual evaluation.

Figure 1 :
Figure 1: An example of a low modularity (languages mixed) and high modularity cross-lingual word embedding lexical graph using k-nearest neighbors of "eat" (left) and "firefox" (right) in English and Japanese.

Figure 2 :
Figure 2: Local t-SNE (van der Maaten and Hinton, 2008) of an EN-JA cross-lingual word embedding, which shows an example of "clustering by language".

Figure 5 :
Figure5: We predict the cross-lingual document classification results for DA and IT from Figure3using three out of four evaluation metrics.Ablating modularity causes by far the largest decrease (R 2 = 0.814 when using all four features) in R 2 , showing that it captures information complementary to the other metrics.

Figure 6 :
Figure 6: Correlation between modularity and classification performance (EN→DE) with different numbers of neighbors k.Correlations are computed on the same setting as Figure 3 using supervised methods.We use this to set k = 3.

7
Discussion: What Modularity Can and Cannot Do

Table 1 :
Dataset statistics (source and number of tokens) for each language including both Indo-European and non-Indo-European languages.
Artetxe et al. (2016))tences.Mean-squared error (MSE)Mikolov et al. (2013a)minimize the mean-squared error of bilingual entries in a seed lexicon to learn a projection between two embeddings.We use the implementation byArtetxe et al. (2016).

Table 2 :
, except we use all EN training documents and documents in each target Average classification accuracy on (EN → DA, ES, IT, JA) along with the average modularity of five cross-lingual word embeddings.MUSE has the best accuracy, captured by its low modularity.
language (DA, ES, IT, and JA) as tuning and test data.After removing out-of-vocabulary words, we split documents in target languages into 10% tuning data and 90% test data.Test data are

Table 3 :
Nearest neighbors in an EN-JA embedding.Unlike the JA word "market", the JA word "closing price" has no EN vector nearby.

Table 4 :
Average precision@1 on (EN → DA, ES, IT, Bold values are mappings that are not shared between the two validation metrics.Mod-10K improves the robustness of MUSE on distant language pairs.