One of these words is not like the other: a reproduction of outlier identification using non-contextual word representations

Word embeddings are an active topic in the NLP research community. State-of-the-art neural models achieve high performance on downstream tasks, albeit at the cost of computationally expensive training. Cost aware solutions require cheaper models that still achieve good performance. We present several reproduction studies of intrinsic evaluation tasks that evaluate non-contextual word representations in multiple languages. Furthermore, we present 50-8-8, a new data set for the outlier identification task, which avoids limitations of the original data set, such as ambiguous words, infrequent words, and multi-word tokens, while increasing the number of test cases. The data set is expanded to contain semantic and syntactic tests and is multilingual (English, German, and Italian). We provide an in-depth analysis of word embedding models with a range of hyper-parameters. Our analysis shows the suitability of different models and hyper-parameters for different tasks and the greater difficulty of representing German and Italian languages.


Introduction
Unsupervised word embeddings have largely replaced language-specific hand-designed representations of syntax and semantics (Mikolov et al., 2013a;Levy and Goldberg, 2014a;Devlin et al., 2019). Models based on deep neural networks such as the BERT family (Devlin et al., 2019;Liu et al., 2019;Sanh et al., 2019) construct contextualized word vector representations. Showing state-of-theart results in benchmarks such as GLUE (Wang et al., 2018), they are computationally expensive for both training and inference (Devlin et al., 2019;You et al., 2020) with significant cost for the environment (Strubell et al., 2019). In this paper, ⇤ Equal contribution we turn our attention back to the non-contextual, less resource-hungry word representations of the word2vec family (Mikolov et al., 2013a;Levy and Goldberg, 2014a).
We contribute reproduction studies on the quality of the non-contextual word representations using outlier identification (Camacho-Collados and Navigli, 2016) and the classic word analogy task (Mikolov et al., 2013a). Replicability and reproducibility have gained increasing importance in the NLP community: focus on the publication of code and data with papers, special sections in leading journals (Branco et al., 2017), and dedicated shared tasks (Branco et al., 2020). Unfortunately, there exist opposing definitions of the terms reproduction and replication (e.g., Branco et al. (2017) and Chris (2009)), while others propose a spectrum of reproducibility (Peng, 2011). While we aim to reproduce the experiments in our target papers closely, we go beyond a straight-forward reproduction and address further questions such as effect of hyper-parameters, linear contexts (CBOW vs. skipgram), and non-linear dependency-based contexts (word2vecf).
We also propose 50-8-8, an alternative to the 8-8-8 outlier identification data set (Camacho-Collados and Navigli, 2016) that is several times larger, includes both semantic and syntactic evaluations, and addresses result variance issues that affect the original 8-8-8 data set. Finally, our 50-8-8 data set is multilingual, covering English (EN), German (DE), and Italian (IT). The three languages are challenging for word representations due to their large vocabulary, heavy reliance on word compounding (DE), and complex grammar and sentence structure (DE and IT).
In our paper, we contribute: • Reproduction studies of outlier identification and word analogy (Camacho-Collados and Navigli, 2016;Köper et al., 2015;Berardi et al., 2015;Mikolov et al., 2013a) through which we find that most evaluations are reproducible, albeit some, namely outlier identification, only after taking variance into account.
• 50-8-8, an improved outlier identification data set that addresses issues with the 8-8-8 data set used in the original outlier identification paper. 50-8-8 is multiple times larger than 8-8-8, multilingual (English, German, and Italian), excludes polysemous and rare words, and contains both semantic and syntactic tests.
• Comparative study and analysis of CBOW, skip-gram, word2vecf, and word2vecf without relation-suffixes, on multiple corpora and languages (English, German, and Italian), for multiple hyper-parameters, on outlier identification and analogy reasoning tasks (both semantic and syntactic). All results are based upon multiple instances of the models and quantify variation in results.

Related Work
Contextualized neural word embeddings (Devlin et al., 2019;Liu et al., 2019) show impressive performance in downstream NLP tasks, at the cost of training time; pre-training of the base version of BERT took four days using 16 TPU chips (Devlin et al., 2019). Efforts to reduce the training time still require significant computing power on dedicated hardware (You et al., 2020), with high environmental cost (Strubell et al., 2019). Some reduction of memory usage (Sanh et al., 2019) or of training time and memory usage (Lan et al., 2020) still does not eliminate the high resource consumption. As such, less computationally expensive models, such as word2vec (Mikolov et al., 2013a), word2vecf (Levy and Goldberg, 2014a), FastText (Bojanowski et al., 2017), and GloVe (Pennington et al., 2014), are attractive when showing good performance on NLP tasks. Computationally cheaper models, like word2vec, have some of the same evaluation drawbacks as their more complicated and expensive counterparts: there is no generally agreed upon evaluation. Ghannay et al. (2016) compare word2vec and word2vecf on attributional similarity, extended by Li et al. (2017) for combinations of context representations and context types for CBOW, skipgram, and GloVe. But, Faruqui et al. (2016) and Batchkarov et al. (2016) note that attributional similarity is subjective, lacks statistical significance, and has a low correlation with extrinsic evaluation, making it inconsistent and not necessarily indicative of model properties. However, Schnabel et al. (2015) argue that different extrinsic evaluation tasks prefer different embeddings, suggesting that extrinsic tasks might not be indicators of general embedding quality either.
The outlier identification task (Camacho-Collados and Navigli, 2016) avoids subjective similarity measurements. Instead, it employs relative word vector similarity to identify an outlier from a group of otherwise semantically related words. Blair et al. (2017) expanded the outlier identification data set algorithmically based on Wikidata. However, the automatic approach has several limitations, including ambiguous, infrequent, or duplicate words in the same category, and word variants in the same category, likely due to hierarchy inconsistencies in Wikidata (Brasileiro et al., 2016). In this paper, we return to manually curated data sets with controlled quality and difficulty.
In light of recent revelations into the instability of word2vec (Antoniak and Mimno, 2018), we reproduce several word vector evaluations. We find that the original 8-8-8 data set used in the outlier identification evaluation leads to high results variance. We address this issue by proposing an expanded evaluation data set we call 50-8-8. Both the original outlier identification (Camacho-Collados and Navigli, 2016) and word similarity publications (Ghannay et al., 2016;Li et al., 2017) do not fully explore the effects of hyper-parameters and randomness. We systematically evaluate models and hyper-parameters on ten training runs and measure average performance and variance.
Finally, most evaluations of word2vec embeddings focus on English, with notable exceptions (Köper et al., 2015;Berardi et al., 2015;Svoboda and Brychcín, 2018;Venekoski and Vankka, 2017;Rodrigues et al., 2016;Chen et al., 2015;Grave et al., 2018). However, these are translations of word similarity tasks and share the weaknesses of their English language counterparts. We reproduce the evaluation of core word analogy evaluations of Köper et al. (2015) and Berardi et al. (2015) and expand them by comparing word2vec to its dependency-based counterpart, word2vecf. We use the word analogy task from Mikolov et al. (2013a) to give a reference point for model performance and ease comparison with other research, even though the pitfalls from the similarity tasks also apply to this task (Faruqui et al., 2016;Batchkarov et al., 2016). To supplement the evaluations on non-English languages, we manually translate our new 50-8-8 data set into German and Italian and thus provide a multilingual outlier identification data set and evaluation.

Tasks
In this section, we introduce the intrinsic tasks and data sets we use for evaluation. Furthermore, we summarize previous data sets' limitations and introduce a new data set for the outlier identification task.

Outlier identification
Evaluations of word similarity rely on a similarity score of words. Therefore it is difficult (if not impossible) to obtain a gold standard as people cannot agree on similarity scores between words (e.g., Which is more like a cat? a tiger or a lion?). On the other hand, outlier identification aims to identify an outlier in a set of similar words. The outlier is the word with the lowest average cosine similarity to the rest of the set. This formulation makes constructing a gold standard more straightforward as the attribution of specific similarity scores is avoided (Camacho-Collados and Navigli, 2016). Even though word embeddings cannot answer questions involving subtle similarity, they can represent outliers as sufficiently distinctive from a group of words that share some similarities (the inliers).

Measures for outlier identification
We use two performance measures to evaluate, Accuracy (Acc) and Outlier Position Percentage (OPP). Accuracy is the ratio of correctly identified outliers to the total number of test cases and provides a strict, narrow-focused measure of performance. OPP indicates how close the outliers are to being correctly classified. OPP is defined as: W is a word set (8 inliers and one outlier), and D is a data set consisting of |D| such sets of words. Outlier Position (OP ) is the outlier position in the list of words ordered by the average cosine similarity to the other words in the set. The positions range from 0 to |W | 1, where an OP equal to |W | 1 indicates a correct classification of the outlier, and a lower OP indicates the computed position of the outlier in the sorted list. The lower the OP , the worse the system does at identifying the outlier. While accuracy takes a black-and-white approach to measuring performance, OPP accounts for differences in the words' rankings.
For our experiments, we modify the original evaluation script of Camacho-Collados and Navigli to address a bug. In the script, vectors are set to the zero vector for Out-Of-Vocabulary (OOV) words, resulting in an undeserved successful outlier identification. In our experiments, we instead mark such test cases as unsuccessful. Accordingly, OOV words decrease performance scores instead of increasing it. We describe the error and our fix in Appendix A and share our fixed script with our 50-8-8 data set 4 .

Data sets for outlier identification
Camacho-Collados and Navigli (2016) provide a manually curated 8-8-8 data set with their task; namely, 8 test groups of 8 semantically related inliers and 8 alternatives for non-related outliers, resulting in 64 test cases. The data set, however, has some limitations. First of all, its low number of test cases results in a significant change in accuracy for each misclassification. The low number also results in limited coverage of concepts in a vector space, which may not represent the semantic information encoded. Secondly, it contains ambiguous words. For example, Smart (used in the German car manufacturers test group) can denote both the car manufacturer and an unrelated adjective. Because the adjective might be more common in a corpus, it will have a higher influence on the resulting vector and might lead to its corresponding word being classified as an outlier. We claim that selecting the word "Smart" as an outlier when the adjective is prevalent is, in fact, the correct behavior. However, since this goes against the intention of the data set design (and the ground-truth labels), we consider such ambiguous words a drawback. Thirdly, multi-token words are handled by taking the average vector of all constituting tokens, which is problematic. The concept denoted by a multitoken word does not necessarily have connections to the meaning (i.e., vector) of the tokens that comprise it. 1 Finally, some words in the data set have a very low frequency in the corpora used for training in the original paper. 2 Low-frequency terms tend to have unstable word vectors, which can lead to high variance in evaluation using the 8-8-8 data set.
WikiSem500 (Blair et al., 2017) is an automatically generated extension of 8-8-8. By treating Wikidata as a graph such that semantic word similarities are distances in the graph, the authors of WikiSem500 automatically construct 500 test groups and 2 816 test cases. However, WikiSem500 has severe limitations. First of all, many inlier sets have a vague semantic connection that makes outliers difficult to identify (even for humans), which may be caused by Wikidata not always following structural rules from multilevel model theory (Brasileiro et al., 2016). Wiki-Data's crowd-sourced nature causes many hierarchies spanning more than one classification level to follow known anti-patterns such as items that are simultaneously instances and subclasses other items; items that are subclasses of several items, with one of the superclasses an instance of the other, and lastly, items representing instances of several items, with one of those also an instance of the other (Brasileiro et al., 2016). Such inconsistencies in the graph are reflected in some of the test groups in WikiSem500. Take, for example, test group Q197, which consists of instances of airplanes. The inliers include various specific combat aircraft models (e.g., B-29_Superfortress and F/A-18_Hornet) and also the terms glider and fighter_aircraft, which should be subclasses rather than instances of airplanes and should therefore not be inliers. At the same time, Mitsubishi F-1 (a Japanese combat aircraft) is an outlier, although it should be an instance of an airplane, and therefore an inlier.
Other problems include: ambiguous words, the same outliers appear several times in the same test group (thus overly impacting evaluation results), the same words with different spellings in the same test group, infrequent words, and inconsistency between using the same words or new ones in the same test group for different languages 3 .
WikiSem500, we propose 50-8-8 4 , a manually curated data set comprising two sections: 25-8-8-Sem and 25-8-8-Syn. We select unambiguous single-token 5 words with a minimum frequency of 350 in each training corpus (details in Section 4.2). We determine word ambiguity using dictionaries and native speakers. Our outliers have different degrees of connectedness to the inliers for different levels of test complexity, i.e., the further down the list of outliers, the weaker the connection to the inliers, and more evidently an outlier.
For example, in the test group Greek Gods, the first two outliers are Cupid (Roman god of love) and Odysseus (Greek legendary king), which could be misclassified by someone with little domain knowledge. The following are Jesus, Sparta, Delphi, and Rome, all of which have only a weak connection to the inliers. The last two outliers are wrath and Atlanta, with no connection to the inliers. 25-8-8-Sem contains 25 test groups, each comprising eight inliers and eight alternatives for outliers, resulting in 200 unique test cases, a more than 3-fold increase in size over the original 8-8-8 data set.
Please note that in preliminary experiments, we found that random selection of outliers produces trivial test cases, with all models scoring above 97.05 in accuracy and 99.15 in OPP.
The second part of our 50-8-8 data set, the syntactic 25-8-8-Syn data set consists of 25 syntactic test groups, as defined by part-of-speech tags (PoS). We choose words with a unique PoS tag in dictionaries to avoid syntactic ambiguity 6 . Furthermore, we ensure that the words in each test case share no semantic connection, such that evaluation can focus exclusively on distinction by syntactic role.
The two distinct subsets of 50-8-8 improve the outlier identification task by allowing for evaluations that target semantics and syntax, the two core aspects that word vectors encode.
In addition to English, we also look at German (another West Germanic language) and Italian (a Romance language), which both employ a more complex grammatical structure than English, and use declension to mark gender and plurality. German also relies heavily on compound words and grammatical cases. We manually translate our 50-8-8 data set using dictionaries and native speakers. We address translation and language-specific challenges as follows. First of all, words that are unambiguous in one language can be ambiguous in another. We address semantic ambiguity by replacing ambiguous words in any language with words that are unambiguous in all languages, and syntactic ambiguity by replacing the ambiguous word with one belonging to the same PoS tag. Syntactic ambiguity is language-specific, e.g., when translating adverbs to German, as the suffixes -ly and -mente often distinguish adjectives from adverbs in English and Italian, respectively, but German can use the same lexical form for both 6 . In Italian, many adjectives are also nouns, and many nouns are also conjugations of verbs, which are not as prevalent in German and English. Secondly, when a word translates to two synonymous words, we use the most common, as determined by native speakers.
Furthermore, for nouns in German, we use the nominative case of the nouns to avoid the effects of different grammatical cases. For adjectives in Italian, we use the masculine gender where applicable to avoid the effects of gender. Removing syntactic variation allows the semantic tests to stay focused on semantics. Thus, all the versions of 25-8-8-Sem are identical, all versions of 25-8-8-Syn have an identical distribution of PoS tags within a given test group, and we use consistent and frequent variants of words.

Word analogy task
Our study's second task is the word analogy task, which measures how well a model captures the relational similarity between pairs of words. A high degree of relational similarity between the pairs means that the words are analogous (Mikolov et al., 2013c;Turney, 2006). It includes questions like Berlin is to Germany as what is to France? where the model should return Paris. Word analogy also has separation into semantic and syntactic tests. As we note in Section 2, there is heavy criticism of this task (Faruqui et al., 2016). We include it for easy comparison with existing work and to contextualize the outlier identification results.

Data set for word analogy task
We use the analogy data set of Mikolov et al. (2013a)   we use a version of the analogy data set, which has a total of 18 552 test cases (the adjective-adverb category is missing as it does not exist in German) (Köper et al., 2015). We use an Italian translation of the analogy data set (Berardi et al., 2015), with 19 791 test cases, with small changes to the data set to keep all words as single token words. Please note that the word analogy data set is not balanced. Size varies by category, causing some relations to be over-represented, e.g., two of the semantic categories evaluate knowledge about countries and corresponding capitals and represent more than half of the total semantic tests (Gladkova et al., 2016).

Models and corpora
This section introduces the word embedding models and the training corpora we use for the evaluation.

Models
Word2vec consists of two types of models: CBOW (continuous-bag-of-words) and skip-gram (Mikolov et al., 2013a,b). Both models use a linear context, consisting of the n words before and n words after the current word.
Word2vecf (Levy and Goldberg, 2014a) replaces the linear context with one based on words directly connected via the dependency graph of the sentence. Thus, word2vecf eliminates the window size hyper-parameter of word2vec, increases the pool of available context tokens up to the sentence boundaries, and focuses context words selection by eliminating irrelevant words. The example Australian scientist discovers star with a telescope from the original paper can help understand the difference in context. For the word discovers and a window size of 2, word2vec would consider the words Australian, scientist, star, and with to be part of the context. There is nothing inherently Australian about discovering; hence, this word and with provide noise to the context of discovers. Word2vecf, instead, includes scientist_nsubj, star_obj, and telescope_prepwith into the context. Thus, word2vecf both removes noisy words (australian, with) and includes relevant terms (telescope) into the context.
On the downside, word2vecf requires the corpus to be dependency parsed using a dependency parser, introducing some noise (Chen and Manning, 2014). Word2vecf suffixes the dependency relation to each word in the context, which massively increases vocabulary size up to |V | · |D| where |V | denotes the vocabulary size and |D| denotes the number of relation types supported by the dependency parser. The massive vocabulary increase leads to lower frequency counts and can result in instability in vectors' values. Furthermore, the word vectors are trained on the auxiliary words with the relation as suffix instead of training word vectors directly on each other, and as such, words with dependency relations suffixed act as barriers to information flow between context words and target words.
Word2vecf+ addresses the limitations of word2vecf, more specifically the inclusion of dependency relations as word suffixes; thus, the vocabulary size does not increase. Word2vec+ maintains the vocabulary size fixed by removing the suffix from the word before training, thereby training words directly on each other and discarding the auxiliary words (Li et al., 2017). For example, the word scientist_nsubj from above becomes scientist. While the original paper calls this method generalized skip-gram with unbound dependencybased context, for readability, we refer to it as word2vecf+.

Training Corpora
We use multiple corpora to derive word vectors for our evaluations (see summary in Table 1). For English, we use the UMBC web-based corpus (Han et al., 2013) and the September 2019 dump of English Wikipedia. The choice of corpora aims to reproduce the experiments in the original outlier identification work (Camacho-Collados and Navigli, 2016). The newer version of Wikipedia is a super-set of the one used in the original experiments. As in the original paper, the use of two English corpora should eliminate questions of corpus-specific results.
For German, we derive vectors from the January 2020 version of Wikipedia; for Italian from the April 2020 version of Wikipedia. The three ver-sions of Wikipedia have widely different sizes. The largest (Wiki EN) is almost five times bigger than the smallest (Wiki IT). However, even the smallest has over 500 million tokens for a vocabulary of less than one million word types (average word type frequency of 670). The smallest corpus (Wiki IT) has a larger average word type frequency (670) than the second smallest, Wiki DE (average word type frequency 416). Such large corpora, combined with repetitions of training and evaluation cycles, provide a good overview of model performance and avoid the of word2vec (Antoniak and Mimno, 2018).
We use WikiExtractor (Attardi, 2018) to extract plain text from the Wikipedia corpora, and tokenize all corpora using Stanford CoreNLP v3.9.2 . We remove words that appear less than five times using the original word2vec code (Mikolov et al., 2013a) or word2vecf (Levy and Goldberg, 2014a), as appropriate. We dependency parse using the Stanford neural-network dependency parser for models that require dependency relations (word2vecf, word2vecf+) (Chen and Manning, 2014). For dependency parsing Italian, we use the model trained by Palmero Aprosio and Moretti (2016).

Experiments
This section presents experimental results, focusing on the reproduction, new data set, window size, different corpora, and languages. In our tables and figures, we denote the different approaches as follows: CBOW, SG (skip-gram), W2VF (word2vecf), W2VF+ (word2vecf+); each followed by the size of the window used. We include a detailed description of the experimental setup in Appendix C.

Reproduction results
In Table 2, we compare our reproduction results with those of Camacho-Collados and Navigli (2016). We observe a high variance in accuracy, which illustrates the small 8-8-8 data set's weakness and further underlines the importance of evaluating multiple training runs. We conclude that the original outlier identification results can be reproduced, but with the caveat that accuracy can suffer from large variance. In Section 5.2, we propose 50-8-8, a data set that alleviates this issue. Table 3 shows the results of reproducing the  word analogy task 7,8 . For English, we see the same pattern as (Mikolov et al., 2013a), where skip-gram outperforms CBOW, even though our corpora and hyper-parameters differ. Comparing the German Wikipedia results to those of (Köper et al., 2015), we see a similar pattern in the semantic part, where skip-gram outperforms CBOW. However, in the syntactic part, our results differ. Köper et al. observe that CBOW outperforms skip-gram, whereas we observe the opposite, which could be due to the difference in corpora and hyper-parameters such as vector dimensionality. Due to the different focus of this paper and that of Berardi et al. (2015), we can only compare skipgram results with window size 10. We observe a similar semantic performance, but a significant difference in syntactic performance where Berardi et al. observe a score of 32.62 compared to our result of 44.63, which could be the result of the difference in the number of negative samples (we use 15, they use 10) and the different Wikipedia version. However, as they do not cover the CBOW model, it is difficult to get an overview of model performance.

The effect of the new 50-8-8 data set
The results of outlier identification using our proposed 50-8-8 are in Table 4. As expected, given the more comprehensive tests, on both UMBC and English Wikipedia, we see significantly lower accuracy variance for 25-8-8-Sem than 8-8-8. The only exception is word2vecf, where the accuracy variance grows slightly from 0 on 8-8-8 up to 0.15 on 25-8-8-Sem. Although word2vecf accuracy variance on 8-8-8 is 0, the ten instances do differ in 7 Note that 5% of the questions were skipped by the German models and 10% of the questions were skipped by the Italian models due to OOV words. This was also observed by Berardi et al. (2015). 8 We use the 3CosAdd method for solving the task, just like (Mikolov et al., 2013a). The alternative 3CosMul improves the analogy results and is discussed in Appendix D.
their answers, as can be observed in the OPP variance in Table 2. Except for a few individual cases, the variance on 25-8-8-Syn is also low. The performance of the best models on 25-8-8-Syn usually matches that on 25-8-8-Sem, suggesting that the two subsets of 50-8-8 are balanced in terms of difficulty. The best performing models on 25-8-8-Syn is CBOW 2 (except for Italian). Table 4 shows that window size has a limited impact on OPP for semantic tests (25-8-8-Sem), but affects the results on syntactic tests (25-8-8-Syn), where skip-gram performs best with low window size across all corpora. For the word analogy task (Table 3), the opposite is true for the semantic evaluation, where larger window sizes have improved performance. These results align with Bansal et al. (2014), who observe that larger window sizes result in more semantic information, while smaller lead to more syntactic.

Effect of window size
The same pattern can be observed on syntactic German Wikipedia and syntactic UMBC when taking variance into account. Bansal et al. observe that CBOW and skip-gram with lower window size perform better on syntactic tests, and larger window size performs better on semantic tests. However, our results show that window size performance varies with the task. These two tasks' preferred window sizes indicate that lower window sizes better capture clusters with semantically and syntactically similar words. Larger window sizes are better suited for capturing word relations. These observations also indicate that hyper-parameters can have a big influence on the performance of the models.   model on semantic tests on all corpora. However, word2vecf seems better suited to syntactic tests, where it matches or outperforms the best word2vec model on all four corpora. We observe the same results in the word analogy task (Table 3). Despite the expected improvements in the contexts of word2vecf and word2vecf+, they consistently underperform the word2vec models, sometimes underperforming even the weakest of the word2vec models. This observation is consistent across all data sets on all languages.

Effect of relation-suffix
The results in Table 4 show that word2vecf+ outperforms word2vecf on semantic outlier identification across all corpora. On the syntactic subset, 25-8-8-Syn, word2vecf consistently outperforms word2vecf+ on all corpora. The consistent difference in performance between word2vecf and word2vecf+ on both the semantic and syntactic tests suggests that word2vecf might be better suited for encoding syntactic information and word2vecf+ might be better suited for encoding semantic information.
We observe a large drop in syntactic OPP and accuracy for both word2vecf and word2vecf+ from UMBC to Wiki EN. The drop may be due to the quality of dependency relations from the Stanford CoreNLP dependency parser, which learned from the Penn Treebank, a corpus of scientific abstracts, news stories, and bulletins (Chen and Manning, 2014;Marcus et al., 1993). Thus, Penn Treebank resembles UMBC more than English Wikipedia, which could explain the performance drop. On the word analogy task (Table 3), word2vecf+ performs better than word2vecf. On the syntactic tests, word2vecf is comparable to CBOW, but removing the relation suffix (word2vecf+) results in scores closer to skip-gram, which is the best performing model; on the semantic tests, removing the relation suffix results in a 3-fold increase in word2vecf+ performance over word2vecf.
Based on these observations, we conclude that word2vecf+ is better able to capture semantic information as it avoids word2vecf's dramatic, artificial, increase in vocabulary. It allows word vectors to directly influence each other during training resulting in better semantically positioned related words in the embedding space and better capturing both syntactic and semantic similarities in word pairs. In contrast, the relational suffixes improve the clustering of syntactically related words. Table 4 shows that the models trained on German and Italian are generally less capable than those trained on the English corpora. The difference between German and English is noticeable in syntactic analogy ( Table 3). The German performance is almost half that of English across all models while Italian is better, but is still significantly lower than English. Furthermore, in the semantic part of word analogy, the performance of models trained on UMBC is closer to models trained on Wiki DE than models trained on Wiki EN. In general, Table 4 shows a drop in performance for languages other than English, in line with our expectation that German and Italian are more difficult to model.

Conclusions
We contribute several reproduction studies of the outlier identification task and the classic word analogy task, both intrinsic evaluations of noncontextual word representations. We provide an in-depth analysis of word2vec, word2vecf, and word2vecf+ on the two tasks analyzing the effects of window size, context type, and context representation on English, German, and Italian. We find that the context construction strategy of word2vecf and word2vecf+ is not always effective. Sometimes the two models underperform even the weakest of the word2vec models.
Our reproduction of outlier identification shows high variance, which we attribute to the original data set's limitations. To address these limitations, we propose 50-8-8, a new data set that is multiple times larger, manually curated, multilingual, and contains syntactic and semantic tests. Besides eliminating the variance issues, 50-8-8 quantifies the drop in performance in representations of languages with more complicated grammar and mor-phology than English.