Revisiting the Context Window for Cross-lingual Word Embeddings

Existing approaches to mapping-based cross-lingual word embeddings are based on the assumption that the source and target embedding spaces are structurally similar. The structures of embedding spaces largely depend on the co-occurrence statistics of each word, which the choice of context window determines. Despite this obvious connection between the context window and mapping-based cross-lingual embeddings, their relationship has been underexplored in prior work. In this work, we provide a thorough evaluation, in various languages, domains, and tasks, of bilingual embeddings trained with different context windows. The highlight of our findings is that increasing the size of both the source and target window sizes improves the performance of bilingual lexicon induction, especially the performance on frequent nouns.


Introduction
Cross-lingual word embeddings can capture word semantics invariant among multiple languages, and facilitate cross-lingual transfer for lowresource languages . Recent research has focused on mapping-based methods, which find a linear transformation from the source to target embedding spaces (Mikolov et al., 2013b;Artetxe et al., 2016;Lample et al., 2018). Learning a linear transformation is based on a strong assumption that the two embedding spaces are structurally similar or isometric.
The structure of word embeddings heavily depends on the co-occurrence information of words (Turney and Pantel, 2010; Baroni et al., 2014), i.e., word embeddings are computed by counting other words that appear in a specific context window of each word. The choice of context window changes the co-occurrence statistics of words and thus is crucial to determine the structure of an embedding space. For example, it has been known that an embedding space trained with a smaller linear window captures functional similarities, while a larger window captures topical similarities (Levy and Goldberg, 2014a). Despite this important relationship between the choice of context window and the structure of embedding space, how the choice of context window affects the structural similarity of two embedding spaces has not been fully explored yet.
In this paper, we attempt to deepen the understanding of cross-lingual word embeddings from the perspective of the choice of the context window through carefully designed experiments. We experiment with a variety of settings, with different domains and languages. We train monolingual word embeddings varying the context window sizes, align them with a mapping-based method, and then evaluate them with both intrinsic and downstream cross-lingual transfer tasks. Our research questions and the summary of the findings are as follows: RQ1: What kind of context windows produces a better alignment of two embedding spaces? Our result shows that increasing the window sizes of both the source and target embeddings improves the accuracy of bilingual dictionary induction consistently regardless of the domains of the source and target corpora. Our fine-grained analysis reveals that frequent nouns receive the most benefit from larger context sizes. RQ2. In downstream cross-lingual transfer, do the context windows that perform well on the source language also perform well on the target languages? No. We find that even when some context window performs well on the source language task, that is often not the best choice for the target language. The general tendency is that broader context windows produce better performance for the target languages.

Context Window of Word Embeddings
Word embeddings are computed from the cooccurrence information of words, i.e., context words that appear around a given word. The embedding algorithm used in this work is the skip-gram with negative sampling (Mikolov et al., 2013c). In the skip-gram model, each word w in the vocabulary W is associated with a word vector v w and a context vector c w . 1 The objective is to maximize the dot-product v wt · c wc for the observed word-context pairs (w t , w c ), and to minimize the dot-product for negative examples.
The most common type of context is a linear window. When the window size is set to k, the context words of a target word The choice of context is crucial to the resulting embeddings as it will change the co-occurrence statistics associated with each target word. Table 1 demonstrates the effect of the context window size on the nearest neighbor structure of embedding space; with a small window size, the resulting embeddings capture functional similarity, while with a larger window size, the embeddings capture topical similarities.
Among the other types of context windows that have been explored by researchers are linear windows enriched with positional information (Levy and Goldberg, 2014b;Ling et al., 2015a;Li et al., 2017), syntactically informed context windows based on dependency trees (Levy and Goldberg, 2014a;Li et al., 2017), and one that dynamically weights the surrounding words with the attention mechanism (Ling et al., 2015b). In this paper, we mainly discuss the most common linear window and investigate how the choice of the window size affects the isomorphism of two embedding spaces and the performance of cross-lingual transfer.

Cross-lingual Word Embeddings
Cross-lingual word embeddings aim to learn a shared semantic space in multiple languages. One promising solution is to jointly train the source and target embedding, so-called joint methods, by exploiting cross-lingual supervision signals 1 Conceptually, the word and context vocabularies are regarded as separated, but for simplicity, we assume that they share the vocabulary. in the form of word dictionaries (Duong et al., 2016), parallel corpora (Gouws et al., 2015;Luong et al., 2015), document-aligned corpora (Vulic and Moens, 2016). Another line of research is off-line mappingbased approaches , where monolingual embeddings are independently trained in multiple languages, and a post-hoc alignment matrix is learned to align the embedding spaces with a seed word dictionary (Mikolov et al., 2013b;Xing et al., 2015;Artetxe et al., 2016), with only a little supervision such as identical strings or numerals (Artetxe et al., 2017;Smith et al., 2017), or even in a completely unsupervised manner (Lample et al., 2018;Artetxe et al., 2018). Mapping-based approaches have recently been popularized by their cheaper computational cost compared to joint approaches, as they can make use of pre-trained monolingual word embeddings.
The assumption behind the mapping-based methods is the isomorphism of monolingual embedding spaces, i.e., the embedding spaces are structurally similar, or the nearest neighbor graphs from the different languages are approximately isomorphic (Søgaard et al., 2018). Considering that the structures of the monolingual embedding spaces are closely related to the choice of the context window, it is natural to expect that the context window has a considerable impact on the performance of mapping-based bilingual word embeddings.
However, most existing work has not provided empirical results on the effect of the context window on cross-lingual embeddings, as their focus is on how to learn a mapping between the two embedding spaces. In order to shed light on the effect of the context window on cross-lingual embeddings, we trained cross-lingual embeddings with different context windows, and carefully analyzed the implications of their varying performance on both intrinsic and extrinsic tasks.

Training Monolingual Embeddings
The experiment is designed to deal with multiple settings to fully understand the effect of the context window. Languages. As the target language, we choose English (En) because of its richness of resources, and as the source languages, we choose French (Fr), German (De), Russian (Ru), Japanese (Ja), taking into account the typological variety and availability of evaluation resource.
Note that the language pairs analyzed in this paper are limited to those including English, and there is a possibility that some results may not generalize to other language pairs. Corpus for Training Word Embeddings. To train the monolingual embeddings, we use the Wikipedia Comparable Corpora 2 . We choose comparable corpora for the main analysis in order to accentuate the effect of context window by setting an ideal situation for training cross-lingual embeddings.
We also experiment with different domain settings, where we use corpora from the news domain 3 for the source languages, because the isomorphism assumption is shown to be very sensitive to the domains of the source and target corpora (Søgaard et al., 2018). We refer to those results when we are interested in whether the same trend with respect to context window can be observed in the different domain settings.
For the size of the data, to simulate the setting of transferring from a low-resource language to a high-resource language, we use 5M sentences for the target language (English), and 1M sentences for the source languages. 4 Context Window. Since we want to measure the effect of the context window size, we vary the window size among 1, 2, 3, 4, 5, 7, 10, 15, and 20.
Besides the linear window, we also experimented with the unbound dependency context (Li et al., 2017), where we extract context words that are the head, modifiers, and siblings in a dependency tree. Our initial motivation was that, while the linear context is directly affected by different word orders, the dependency context can mitigate the effect of language differences, and thus may produce better cross-lingual embeddings. However, the performance of the dependency context turned out to be always in the middle between smaller and larger linear windows, and we found nothing notable. Therefore, the following analysis only focuses on the results of the linear context window. Implementation of Word2Vec. Note that some common existing implementations of the skipgram may obfuscate the effect of the window size. The original C implementation of word2vec and its python implementation Gensim 5 adopt a dynamic window mechanism where the window size is uniformly sampled between 1 and the specified window size for each target word (Mikolov et al., 2013a). Also, those implementations remove frequent tokens by subsampling before extracting word-context pairs (so-called "dirty" subsampling) (Levy et al., 2015), which enlarges the context size in effect. Our experiment is based on word2vecf, 6 which takes arbitrary word-context pairs as input. We extract wordcontext pairs from a fixed window size and afterward perform subsampling.
We train 300-dimensional embeddings. For details on the hyperparameters, we refer the readers to Appendix A.

Aligning Monolingual Embeddings
After training monolingual embeddings in the source and target languages, we align them with a mapping-based algorithm. To induce a alignment matrix W for the source and target embeddings x, y, we use a simple supervised method of solving the Procrustes problem arg min (Mikolov et al.,Figure 1: BLI performance in the comparable setting. The target window size is fixed and the source window size is varied. 2013b), with the orthogonality constraint on W , length normalization and mean-centering as preprocessing for the source and target embeddings (Artetxe et al., 2016).
The word dictionaries are automatically created by using Google Translate. 7 We translate all words in our English vocabulary into the source languages and filter out words that do not exist in the source vocabularies. We also perform this process in the opposite direction (translated from the source languages into English), and take the union of the two corresponding dictionaries. We then randomly select 5K tuples for training and 2K for testing. Although using word dictionaries automatically derived from a system is currently a common practice in this field, it should be acknowledged that this may sometimes pose problems: the generated dictionaries are noisy, and the definition of word translation is unclear (e.g., how do we handle polysemy?). It can hinder valid comparisons between systems or detailed analysis of them, and should be addressed in future research.
For each setting, we train three pairs of aligned embeddings with different random seeds in the monolingual embedding training, as training word embeddings is known to be unstable and different runs result in different nearest neighbors (Wendlandt et al., 2018). The following results are presented with their averages and standard deviations.

Bilingual Lexicon Induction
We first evaluate the learned bilingual embeddings with bilingual lexicon induction (BLI). The task is to retrieve the target translations with source words by searching for nearest neighbors with cosine similarity in the bilingual embedding space. The evaluation metric used in prior work is usually top-k precision, but here we use a more informative measure, mean reciprocal rank (MRR) as recommended by Glavaš et al. (2019). Fixed Target Context Window Settings. First, we consider the settings where the target context size is fixed, and the source context size is configurable. This setting assumes common situations where the embedding of the target language is available in the form of pre-trained embeddings. Figure 1 shows the result of the four languages. Firstly, we observe that too small windows (1 to 3) for source embeddings do not yield good performance, probably because the model failed to train accurate word embedding models with insufficient training word-context pairs that the small windows capture.
At first, this result may seem to contradict with the result from Søgaard et al. (2018). They trained English and Spanish embeddings with fasttext (Bojanowski et al., 2017) and the window size of 2, and then aligned them with an unsupervised mapping algorithm (Lample et al., 2018). When they changed the window size of the Spanish embedding to 10, they only observed a very slight drop on top-1 precision (from 81.89 to 81.28). We suspect that the discrepancy with our result is due to the different settings. First of  all, fasttext adopts a dynamic window mechanism, which may obfuscate the difference in the context window. Also, they trained embeddings with full Wikipedia articles, which is an order of magnitude larger than ours; the fasttext algorithm, which takes into account the character n-gram information of words, can exploit a nontrivial amount of subword overlap between the quite similar languages.
Overall, we observe that the best context window size for the source embeddings increases as the target context size increases, and increasing the context sizes of both the source and target embedding seems beneficial to the BLI performance. Configurable Source/Target Context Window Settings. Hereafter, we present the results where both the source and target sizes are configurable and set to the same. Figure 3 summarizes the result of the same domain setting.
As we expected from the observation of the settings where the target window size is fixed, the performance consistently improves as the source and target context sizes increase. Given that the larger context windows tend to capture topical similarities of words, we hypothesize that the more topical the embeddings are, the easier they are to be aligned. Topics are invariant across different languages to some extent as long as the corpora are comparable. It is natural to think that topic-oriented embeddings capture languageagnostic semantics of words and thus are easier to be aligned among different languages. This hypothesis can be further supported by looking at the metrics of each part-of-speech (PoS). Intuitively, nouns tend to be more representative of topics than other PoS, and thus are expected to show a high correlation with the window size. Figure 2 shows the scores for each PoS. 8 In all languages, nouns and adjectives show stronger (almost perfect) correlation than verbs and adverbs.  Different-domain Settings. The results so far are obtained in the settings where the source and target corpora are comparable. When the corpora are comparable, it is natural that topical embeddings are easier to be aligned as comparable corpora share their topics. In order to see if the observations from the comparable settings hold true for different-domain settings, we also present the result from the different-domain (news) source corpora in Figure 4.
Firstly, compared to the same-domain settings (Figure 3), the scores are lower by around 0.1 to 0.2 points across the languages and context windows, even with the same amount of training data. This result confirms previous findings showing that domain consistency is important to the isomorphism assumption (Søgaard et al., 2018).
As to the relation between the BLI performance and the context window, we observe a similar trend to the comparable settings: increasing the context window size basically improves the performance. Figure 5 summarizes the results for each PoS. The performance on nouns and adjectives still accounts for much of the correlation with the window size. This suggests that even when the source and target domains are different, some domain-invariant topics are captured by larger-context embeddings for nouns and adjectives. Frequency Analysis. To further gain insight into what kind of words receive the benefit of larger context windows, we analyze the effect of word frequency. We extract the top and bottom 500 frequent words 9 from the test vocabularies and evaluate the performance on them respectively.
The results of the comparable setting in each language are shown in Figure 6. The scores for the frequent words (top500) are notably higher than the rare words (bottom500). This confirms previous empirical results that existing mapping-based methods perform significantly worse for rare words (Braune et al., 2018;Czarnowska et al., 2019).
With respect to the relation with the context size, both frequent and rare words benefit from larger window sizes, although the gain in the rare words is less obvious in some languages (Ja and Ru).
In the different domain settings, as shown in Figure 7, the rare words, in turn, suffer from larger window sizes, especially for Fr and Ru, but the performance on frequent words still improves as the context window increases.
We conjecture that when training a skip-gram model, frequent words observe many context words, and that would mitigate the effect of irrelevant words (noise) caused by a larger window size and result in high-quality topical embeddings; however, rare words have to rely on a limited number of context words, and larger windows just amplify the noise and domain difference to result in an inaccurate alignment of them.

Downstream Tasks
Although BLI is a common evaluation method for bilingual embeddings, good performance on BLI does not necessarily generalize to downstream tasks (Glavaš et al., 2019). To further gain insight into the effect of the context size on bilingual embeddings, we evaluate the embeddings with three downstream tasks: 1) sentiment analysis; 2) document classification; 3) dependency parsing. Here, we briefly describe the dataset and model used for each task. Sentiment Analysis (SA). We use the Webis-CLS-10 corpus 10 (Prettenhofer and Stein, 2010), which is comprised of Amazon product reviews in the four languages: English, German, French, and Japanese (no Russian data available). We cast sentiment analysis as a binary classification task, where we label reviews with the scores of 1 or 2 as negative and reviews with 4 or 5 as positive. For the model, we employ a simple CNN encoder followed by a multi-layer perceptrons classifier. Document Classification (DC). MLDoc 11 (Schwenk and Li, 2018) is compiled from the Reuters corpus for eight languages including all the languages used in this paper. The task is a four-way classification of the news article topics: Corporate/Industrial, Economics, Government/Social, and Markets. We use the same model architecture as sentiment analysis. Dependency Parsing (DP). We train deep biaffine parsers (Dozat and Manning, 2017) with the UD English EWT dataset 12 (Silveira et al., 2014). We use the PUD treebanks 13 as test data.
The hyperparameters used in this experiment are shown in Appendix B. Evaluation Setup. We evaluate in a cross-lingual transfer setup how well the bilingual embeddings trained with different context windows transfer lexical knowledge across languages. Here, we focus on the settings where both the source and target context sizes are varied.
For each task, we train models with our pretrained English embeddings. We do not update the parameters of the embedding during training. Then, we evaluate the model with the test data in other languages available in the dataset. At test time, we feed the model with the word embeddings of the test language aligned to the training English embeddings.
We train nine models in total for each setting with different random seeds and English embeddings, and we present their average scores and standard deviations. Result and Discussion. The results from all the three tasks are presented in Figure 8. For sentiment analysis and document classification, we observe a similar trend where the best window size is around 3 to 5 for the source English task, but for the test languages, larger context windows achieve better results. The only deviation is the Japanese document classification, where the score does not show a significant correlation. We attribute this to low-quality alignments due to the large typological difference between English and Japanese.
For dependency parsing, embeddings with smaller context windows perform better in the source English task, which is consistent with 11 https://github.com/facebookresearch/ MLDoc 12 https://universaldependencies.org/ treebanks/en_ewt/index.html 13 https://universaldependencies.org/ conll17/ the observation that smaller context windows tend to produce syntax-oriented embeddings (Levy and Goldberg, 2014a). However, the performance of the small-window embeddings does not transfer to the test languages. The best context window for the English development data (the size of 1) performs the worst for all the test languages, and the transferred accuracy seems to benefit from larger context sizes, although it does not always correlate with the window size. This observation highlights the difficulty of transferring syntactic knowledge across languages. Word embeddings trained with small windows capture more grammatical aspects of words in each language, which, as different languages have different grammars, makes the source and target embedding spaces so different that it is difficult to align them.
In summary, a general trend we observe here is that good context windows in the source language task do not necessarily produce good transferrable bilingual embeddings. In practice, it seems better to choose a context window that aligns the source and target well, rather than using the window size that just performs the best for the source language.

Conclusion and Future Work
Despite their obvious connection, the relation between the choice of context window and the structural similarity of two embedding spaces has not been fully investigated in prior work. In this study, we have offered the first thorough empirical results on the relation between the context window size and bilingual embeddings, and shed new light on the property of bilingual embeddings. In summary, we have shown that: • larger context windows for both the source and target facilitate the alignment of words, especially nouns.
• for cross-lingual transfer, the best context window for the source task is often not the best for test languages. Especially for dependency parsing, the smallest context size produces the best result for the source task, but performs the worst for test languages.
We hope that our study will provide insights into ways to improve cross-lingual embeddings by not only mapping methods but also the properties of monolingual embedding spaces.