On the Role of Seed Lexicons in Learning Bilingual Word Embeddings

A shared bilingual word embedding space (SBWES) is an indispensable resource in a variety of cross-language NLP and IR tasks. A common approach to the SB-WES induction is to learn a mapping function between monolingual semantic spaces, where the mapping critically relies on a seed word lexicon used in the learning process. In this work, we analyze the importance and properties of seed lexicons for the SBWES induction across different dimensions (i.e., lexicon source, lexicon size, translation method, translation pair reliability). On the basis of our analysis, we propose a simple but effective hybrid bilingual word embedding (BWE) model. This model (HYBWE) learns the mapping be-tween two monolingual embedding spaces using only highly reliable symmetric translation pairs from a seed document-level embedding space. We perform bilingual lexicon learning (BLL) with 3 language pairs and show that by carefully selecting reliable translation pairs our new HYBWE model outperforms benchmarking BWE learning models, all of which use more expensive bilingual signals. Effectively, we demonstrate that a SBWES may be induced by leveraging only a very weak bilingual signal (document alignments) along with monolingual data.


Introduction
Dense real-valued vector representations of words or word embeddings (WEs) have recently gained increasing popularity in natural language processing (NLP), serving as invaluable features in a broad

Monolingual vs Bilingual
Figure 1: A toy example of a 3-dimensional monolingual vs shared bilingual word embedding space (further SBWES) from Gouws et al. (2015).
range of NLP tasks, e.g., (Turian et al., 2010;Collobert et al., 2011;Chen and Manning, 2014). Several studies have showcased a direct link and comparable performance to "more traditional" distributional models (Turney and Pantel, 2010). Yet the widely used skip-gram model with negative sampling (SGNS) (Mikolov et al., 2013b) is considered as the state-of-the-art word representation model, due to its simplicity, fast training, as well as its solid and robust performance across a wide variety of semantic tasks (Baroni et al., 2014;Levy and Goldberg, 2014b;. Research interest has recently extended to bilingual word embeddings (BWEs). BWE learning models focus on the induction of a shared bilingual word embedding space (SBWES) where words from both languages are represented in a uniform language-independent manner such that similar words (regardless of the actual language) have similar representations (see Fig. 1). A variety of BWE learning models have been proposed, differing in the essential requirement of a bilingual signal necessary to construct such a SBWES (discussed later in Sect. 2). SBWES may be used to support many tasks, e.g., computing cross-lingual/multilingual semantic word similarity (Faruqui and Dyer, 2014), learning bilingual word lexicons (Mikolov et al., 2013a;Gouws et al., 2015;, cross-lingual entity linking (Tsai and Roth, 2016), parsing (Guo et al., 2015;Johannsen et al., 2015), machine translation (Zou et al., 2013), or crosslingual information retrieval (Vulić and Moens, 2015;Mitra et al., 2016). BWE models should have two desirable properties: (P1) leverage (large) monolingual training sets tied together through a bilingual signal, (P2) use as inexpensive bilingual signal as possible in order to learn a SBWES in a scalable and widely applicable manner across languages and domains.
While we provide a classification of related work, that is, different BWE models according to these properties in Sect. 2.1, the focus of this work is on a popular class of models labeled Post-Hoc Mapping with Seed Lexicons. These models operate as follows (Mikolov et al., 2013a;Ammar et al., 2016): (1) two separate non-aligned monolingual embedding spaces are induced using any monolingual WE learning model (SGNS is the typical choice), (2) given a seed lexicon of word translation pairs as the bilingual signal for training, a mapping function is learned which ties the two monolingual spaces together into a SBWES.
All existing work on this class of models assumes that high-quality training seed lexicons are readily available. In reality, little is understood regarding what constitutes a high quality seed lexicon, even with "traditional" distributional models (Gaussier et al., 2004;Holmlund et al., 2005;Vulić and Moens, 2013). Therefore, in this work we ask whether BWE learning could be improved by making more intelligent choices when deciding over seed lexicon entries. In order to do this we delve deeper into the cross-lingual mapping problem by analyzing a spectrum of seed lexicons with respect to controllable parameters such as lexicon source, its size, translation method, and translation pair reliability.
The contributions of this paper are as follows: (C1) We present a systematic study on the importance of seed lexicons for learning mapping functions between monolingual WE spaces. (C2) Given the insights gained, we propose a simple yet effective hybrid BWE model HYBWE that removes the need for readily available seed lexicons, and satisfies properties P1 and P2. HYBWE relies on an inexpensive seed lexicon of highly reliable word translation pairs obtained by a documentlevel BWE model  from document-aligned comparable data.
(C3) Using a careful pair selection process when constructing a seed lexicon, we show that in the BLL task HYBWE outperforms a BWE model of Mikolov et al. (2013a) which relies on readily available seed lexicons. HYBWE also outperforms state-of-the-art models of (Hermann and Blunsom, 2014b; Gouws et al., 2015) which require sentencealigned parallel data.

Learning SBWES using Seed Lexicons
Given source and target language vocabularies V S and V T , all BWE models learn a representation of each word w ∈ V S V T in a SBWES as a realvalued vector: w = [f 1 , . . . , f d ], where f k ∈ R denotes the value for the k-th cross-lingual feature for w within a d-dimensional SBWES. Semantic similarity sim(w, v) between two words w, v ∈ V S V T is then computed by applying a similarity function (SF), e.g. cosine (cos) on their representations in the SBWES: sim(w, v) = SF (w, v) = cos(w, v). (Type 1) Parallel-Only: This group of BWE models relies on sentence-aligned and/or word-aligned parallel data as the only data source (Zou et al., 2013;Hermann and Blunsom, 2014a;Kočiský et al., 2014;Hermann and Blunsom, 2014b;Chandar et al., 2014). In addition to an expensive bilingual signal (colliding with P2), these models do not leverage larger monolingual datasets for training (not satisfying P1).

Related
(Type 2) Joint Bilingual Training: These models jointly optimize two monolingual objectives, with the cross-lingual objective acting as a cross-lingual regularizer during training (Klementiev et al., 2012;Gouws et al., 2015;Soyer et al., 2015;Shi et al., 2015;Coulmance et al., 2015). The idea may be summarized by the simplified formulation (Luong et al., 2015): γ(Mono S +Mono T )+δBi. The monolingual objectives M ono S and M ono T ensure that similar words in each language are assigned similar embeddings and aim to capture the semantic structure of each language, whereas the cross-lingual objective Bi ensures that similar words across languages are assigned similar embeddings. It ties the two monolingual spaces together into a SBWES (thus satisfying P1). Parameters γ and δ govern the influence of the monolingual and bilingual components. 1 The main disadvantage of Type 2 models is the costly parallel data needed for the bilingual signal (thus colliding with P2).
(Type 3) Pseudo-Bilingual Training: This set of models requires document alignments as bilingual signal to induce a SBWES. Vulić and Moens (2016) create a collection of pseudo-bilingual documents by merging every pair of aligned documents in training data, in a way that preserves important local information: words that appeared next to other words within the same language and those that appeared in the same region of the document across different languages. This collection is then used to train word embeddings with monolingual SGNS from word2vec.
With pseudo-bilingual documents, the "context" of a word is redefined as a mixture of neighbouring words (in the original language) and words that appeared in the same region of the document (in the "foreign" language). The bilingual contexts for each word in each document steer the final model towards constructing a SBWES. The advantage over other BWE model types lies in exploiting weaker document-level bilingual signals (satisfying P2), but these models are unable to exploit monolingual corpora during training (unlike Type 2 or Type 4; thus colliding with P1).
(Type 4) Post-Hoc Mapping with Seed Lexicons: These models learn post-hoc mapping functions between monolingual WE spaces induced separately for two different languages (e.g., by SGNS). All Type 4 models (Mikolov et al., 2013a;Faruqui and Dyer, 2014; rely on readily available seed lexicons of highly frequent words obtained by e.g. Google Translate (GT) to learn the mapping (again colliding with P2), but they are able to satisfy P1.

Post-Hoc Mapping with Seed Lexicons: Methodology and Lexicons
Key Intuition One may infer that a type-hybrid procedure which would retain only highly reliable translation pairs obtained by a Type 3 model as a seed lexicon for Type 4 models effectively satisfies both requirements: (P1) unlike Type 1 and Type 3, it can learn from monolingual data and tie two monolingual spaces using the highly reliable translation pairs, (P2) unlike Type 1 and Type 2, it does not require parallel data; unlike Type 4, it does not require external lexicons and translation systems. The only bilingual signal required are document alignments. Therefore, our focus is on novel less expensive Type 4 models.
Overview The standard learning setup we use is as follows: First, two monolingual embedding spaces, R d S and R d T , are induced separately in each of the two languages using a standard monolingual WE model such as CBOW or SGNS. d S and d T denote the dimensionality of monolingual WE spaces. The bilingual signal is a seed lexicon, i.e., a list of word translation pairs ( Learning Objectives Training is cast as a multivariate regression problem: it implies learning a function that maps the source language vectors from the training data to their corresponding target language vectors. A standard approach (Mikolov et al., 2013a; is to assume a linear map W ∈ R d S ×d T , where a L 2 -regularized least-squares error objective (i.e., ridge regression) is used to learn the map W. The map is learned by solving the following optimization problem (typically by stochastic gradient descent (SGD)): X and Y are matrices obtained through the respective concatenation of source language and target language vectors from training pairs. Once the linear map W is estimated, any previously unseen source language word vector x u may be straightforwardly mapped into the target language embedding space R d T as Wx u . After mapping all vectors x, x ∈ V S , the target embedding space R d T in fact serves as SBWES. 2

Seed Lexicon Source and Translation Method
Prior work on post-hoc mapping with seed lexicons used a translation system (i.e., GT) to translate highly frequent English words to other languages such as Czech, Spanish (Mikolov et al., 2013a;Gouws et al., 2015) or Italian . This method presupposes the availability and high quality of such an external translation system. To simulate this setup, we take as a starting point the BNC word frequency list from Kilgarriff (1997) containing 6, 318 most frequent English lemmas. The list is then translated to other languages via GT. We call the BNC-based lexicons obtained by employing Google Translate BNC+GT.
In this paper, we propose another option: first, we learn the "first" SBWES (i.e., SBWES-1) using another BWE model (see Sect. 2.1), and then translate the BNC list through SBWES-1 by retaining the nearest cross-lingual neighbor y i ∈ V T for each x i in the BNC list which is represented in SBWES-1. The pairs (x i , y i ) constitute the seed lexicon needed for learning the mapping between monolingual spaces, that is, to induce the final SBWES-2.
Although in theory any BWE induction model may be used to induce SBWES-1, we rely on a document-level Type 3 BWE induction model from , since it requires only document alignments as (weak) bilingual signal. The resulting hybrid BWE induction model (HYBWE) combines the output of a Type 3 model (SBWES-1) and a Type 4 model (SBWES-2). This seed lexicon and BWE learning variant is called BNC+HYB.
Our new hybrid model allows us to also use source language words occurring in SBWES-1 sorted by frequency as seed lexicon source, again leaning on the intuition that higher frequency phenomena are more reliably translated using statistical models. Their translations can also be found through SBWES-1 to obtain seed lexicon pairs (x i , y i ). This variant is called HFQ+HYB.
Another possibility, recently introduced by Kiros et al. (2015) for vocabulary expansion in monolingual settings, relies on all words shared between two vocabularies to learn the mapping. In this work, we test the ability and limits of such orthographic evidence in cross-lingual settings: seed lexicon pairs are (x i , x i ), where x i ∈ V S and x i ∈ V T . This seed lexicon variant is called ORTHO.
Seed Lexicon Size While all prior reported only results with restricted seed lexicon sizes only (i.e., 1K, 2K and 5K lexicon pairs are used as standard), in this work we provide a full-fledged analysis of the influence of seed lexicon size on the SBWES performance in cross-lingual tasks. More extreme settings are also investigated, in the attempt to answer two important questions: (1) Can a Type 4 SBWES be induced in a limited setting with only a few hundred lexicon pairs available (e.g., 100-500)? (2) Can the Type 4 models profit from the inclusion of more seed lexicon pairs (e.g., more than 5K, even up to 40K-50K lexicon pairs)?
Translation Pair Reliability When building seed lexicons through SBWES-1 (i.e., BNC+HYB and HFQ+HYB methods), it is possible to control for the reliability of translation pairs to be included in the final lexicon, with the idea that the use of only highly reliable pairs can potentially lead to an improved SBWES-2. A simple yet effective reliability reliability feature for translation pairs is the symmetry constraint (Peirsman and Padó, 2010;Vulić and Moens, 2013) : two words x i ∈ V S and y i ∈ V S are used as seed lexicon pairs only if they are mutual nearest neighbours given their representations in SBWES-1. The two variants of seed lexicons with only symmetric pairs are BNC+HYB+SYM and HFREQ+HYB+SYM. We also test the variants without the symmetry constraint (i.e., BNC+HYB+ASYM and HFQ+HYB+ASYM).
Even more conservative reliability measures may be applied by exploiting the scores in the lists of translation candidates ranked by their similarity to the cue word x i . We investigate a symmetry constraint with a threshold: two words x i ∈ V S and y i ∈ V S are included as seed lexicon pair (x i , y i ) iff they are mutual nearest neighbours in SBWES-1 and it holds: where z i ∈ V T is the second best translation candidate for x i , and w i ∈ V S for y i . THR is a parameter which specifies the margin between the two best translation candidates. The intuition is that highly unambiguous and monosemous translation pairs (which is reflected in higher score margins) are also highly reliable. 3

Experimental Setup
Task: Bilingual Lexicon Learning (BLL) After the final SBWES is induced, given a list of n source language words x u1 , . . . , x un , the task is to find a target language word t for each x u in the list using the SBWES. t is the target language word closest to the source language word x u in the induced SBWES, also known as the cross-lingual nearest neighbor. The set of learned n (x u , t) pairs is then run against a gold standard BLL test set. Following the standard practice (Mikolov et al., 2013a;, for all Type 4 models, all pairs containing any of the test words x u1 , . . . , x un are removed from training seed lexicons. Test Sets For each language pair, we evaluate on standard 1,000 ground truth one-to-one translation pairs built for three language pairs: Spanish (ES)-, Dutch (NL)-, Italian (IT)-English (EN) by Vulić and Moens (2013). The dataset is generally considered a benchmarking test set for BLL models that learn from non-parallel data, and is available online. 4 We have also experimented with two other benchmarking BLL test sets (Bergsma and Durme, 2011;Leviant and Reichart, 2015) observing a very similar relative performance of all the models in our comparison.

Evaluation Metrics
We measure the BLL performance using the standard Top 1 accuracy (Acc 1 ) metric (Gaussier et al., 2004;Mikolov et al., 2013a;Gouws et al., 2015). 5 Baseline Models To induce SBWES-1, we resort to document-level embeddings of Vulić and Moens (2016) (Type 3). We also compare to results obtained directly by their model (BWESG) to measure the performance gains with HYBWE.
To compare with a representative Type 2 model, we opt for the BilBOWA model of Gouws et al. (2015) due to its solid performance and robustness in the BLL task when trained on general-domain corpora such as Wikipedia (Luong et al., 2015), its reduced complexity reflected in fast computations on massive datasets, as well as its public availabilliterature (Smith and Eisner, 2007;Tu and Honavar, 2012;Vulić and Moens, 2013), but we do not observe any significant gains when resorting to the more complex reliability estimates. 4 http://people.cs.kuleuven.be/~ivan.vulic/ 5 Similar trends are observed within a more lenient setting with Acc5 and Acc10 scores, but we omit these results for clarity and the fact that the actual BLL performance is best reflected in Acc1 scores (i.e., best translation only). ity. 6 In short, BilBOWA combines the adapted SGNS for monolingual objectives together with a cross-lingual objective that minimizes the L 2 -loss between the bag-of-word vectors of parallel sentences. BilBOWA uses the same training setup as HYBWE (monolingual datasets plus a bilingual signal), but relies on a stronger bilingual signal (sentence alignments as opposed to HYBWE's document alignments).
We also compare with a benchmarking Type 1 model from sentence-aligned parallel data called BiCVM (Hermann and Blunsom, 2014b). Finally, a SGNS-based BWE model with the BNC+GT seed lexicon is taken as a baseline Type 4 model (Mikolov et al., 2013a). 7 Training Data and Setup We use standard training data and suggested settings to obtain BWEs for all models involved in comparison. We retain the 100K most frequent words in each language for all models. To induce monolingual WE spaces, two monolingual SGNS models were trained on the cleaned and tokenized Wikipedias from the Polyglot website (Al-Rfou et al., 2013) using SGD with a global learning rate of 0.025. For BilBOWA, as in the original work (Gouws et al., 2015), the bilingual signal for the cross-lingual regularization is provided by the first 500K sentences from Europarl.v7 (Tiedemann, 2012). We use SGD with a global rate of 0.15. 8 The window size is varied from 2 to 16 in steps of 2, and the best scoring model is always reported in all comparisons.
BWESG was trained on the cleaned and tokenized document-aligned Wikipedias available online 9 , SGD on pseudo-bilingual documents with a global rate 0.025. For BiCVM, we use the tool released by its authors 10 and train on the whole Europarl.v7 for each language pair: we train an additive model, with hinge loss margin set to d (i.e., dimensionality) as in the original paper, batch size of 50, and noise parameter of 10. All BiCVM models are trained with 200 iterations.
For all models, we obtain BWEs with d = 40, 64, 300, 500, but we report only results with 300-dimensional BWEs as similar trends were observed with other d-s. Other parameters are: 15 epochs, 15 negatives, subsampling rate 1e − 4.

Results and Discussion
Exp. I: Standard BLL Setting First, we replicate the previous BLL setups with Type 4 models from (Mikolov et al., 2013a; by relying on seed lexicons of exactly 5K word pairs (except for BNC+HYB+SYM which exhausts all possible pairs before the 5K limit) sorted by frequency of the source language word. Results with different lexicons for the three language pairs are summarized in Table 2, while Table 1 shows examples of nearest neighbour words for a Spanish word not present in any of the training lexicons. Table 1 provides evidence for our first insight: Type 4 models do not necessarily require external lexicons (such as the BNC+GT model) to learn a semantically plausible SBWES (i.e., the lists of nearest neighbours are similar for all lexicons excluding ORTHO). Table 1 also suggests that the choice of seed lexicon pairs may strongly influence the properties of the resulting SBWES. Due to its design, ORTHO finds a mapping which naturally brings foreign words appearing in the English vo-cabulary closer in the induced SBWES.
This first batch of quantitative results already shows that Type 4 models with inexpensive automatically induced lexicons (i.e., HYBWE) are on a par with or even better than Type 4 models relying on external resources or translation systems. In addition, the best reported scores using the more constrained symmetric BNC/HFQ+HYB+SYM lexicon variants are higher than those for three baseline models (of Type 1, Type 2, and Type 3) that previously held highest scores on the BLL test sets . These improvements over the baseline models and BNC+GT are statistically significant (using McNemar's statistical significance test, p < 0.05). Table 2 also suggests that a careful selection of reliable pairs can lead to peak performances even with a lower number of pairs, i.e., see the results of BNC+HYB+SYM.
Exp. II: Lexicon Size BLL results for ES-EN and NL-EN obtained by varying the seed lexicon sizes are displayed in Fig. 2(a) and 2(b). Results for IT-EN closely follow the patterns observed with ES-EN. BNC+HYB+SYM and HFQ+HYB+ASYM -the two models that do not blindly use all potential training pairs, but rely on sets of symmetric pairs (i.e., they include the simple measure of translation pair reliability) -display the best performance across all lexicon sizes. The finding confirms the intuition that a more intelligent pair selection strategy is essential for Type 4 BWE models. HFQ+HYB+SYM -a simple hybrid BWE model (HYBWE) combining a document-level Type 3 model with a Type 4 model and translation reliability detection -is the strongest BWE model overall (see also Table 2 again). HYBWE-based models which do not perform any pair selection (i.e., BNC/HFQ+HYB+ASYM) closely follow the behaviour of the GT-based model. This demonstrates that an external lexicon or translation system may be safely replaced by a document-level embedding model without any significant performance loss in the BLL task. The ORTHO-based model falls short of its competitors. However, we observe that even this model with the learning setting relying on the cheapest bilingual signal may lead to reasonable BLL scores, especially for the more related NL-EN pair.
The two models with the symmetry constraint display a particularly strong performance with settings relying on scarce resources (i.e., only a small portion of training pairs is available). For instance, HFQ+HYB+SYM scores 0.129 for ES-EN with only 200 training pairs (vs 0.002 with BNC+GT), and 0.529 with 500 pairs (vs 0.145 with BNC+GT). On the other hand, adding more pairs does not lead to an improved BLL performance. In fact, we observe a slow and steady decrease in performance with lexicons containing 10, 000 and more training pairs for all HYBWE variants. The phenomenon may be attributed to the fact that highly frequent words receive more accurate representations in SBWES-1, and adding less frequent and, consequently, less accurate training pairs to the SBWES-2 learning process brings in additional noise. In plain language, when it comes to seed lexicons Type 4 models prefer quality over quantity.
Exp. III: Translation Pair Reliability In the next experiment, we vary the threshold value THR (see sect. 2.2) in the HFQ+HYB+SYM variant with the following values in comparison: 0.0 (None), 0.01, 0.025, 0.05, 0.075, 0.1. We investigate whether retaining only highly unambiguous pairs would lead to even better BLL performance. The results for all three language pairs are summarized in Fig. 3(a)-3(c). The results for all variant models again decrease when employ-ing larger lexicons (due to the usage of less frequent word pairs in training). We observe that a slightly stricter selection criterion (i.e., THR = 0.01, 0.025) also leads to slightly improved peak BLL scores for ES-EN and IT-EN around the 5K region. The improvements, however, are not statistically significant. On the other hand, a too conservative pair selection criterion with higher threshold values significantly deteriorates the overall performance of HYBWE with HFQ+HYB+SYM. The conservative criteria discard plenty of potentially useful training pairs. Therefore, as one line of future research, we plan to investigate more sophisticated models for the selection of reliable seed lexicon pairs that will lead to a better trade-off between the lexicon size and reliability of the pairs.
Exp. IV: Another Task -Suggesting Word Translations in Context (SWTC) In the final experiment, we test whether the findings originating from the BLL task generalize to another crosslingual semantic task: suggesting word translations in context (SWTC) recently proposed by Vulić and Moens (2014). Given an occurrence of a polysemous word w ∈ V S , the SWTC task is to choose the correct translation in the target language of that particular occurrence of w from the given set T C(w) = {t 1 , . . . , t tq }, T C(w) ⊆ V T , of its tq possible translations/meanings. Whereas in the BLL task the candidate search is performed over the entire vocabulary V T , the set T C(w) typically comprises only a few pre-selected words/senses. One may refer to T C(w) as an inventory of translation candidates for w. The best scoring translation candidate in the ranked list is then the correct translation for that particular occurrence of w observing its local context Con(w). SWTC is an extended  Table 3: Acc 1 scores in the SWTC task. All seed lexicons contain 6K translation pairs, except for BNC+HYB+SYM (its sizes provided in parentheses). * denotes a statistically significant improvement over baselines and BNC+GT using McNemar's statistical significance test with the Bonferroni correction, p < 0.05.
cross-lingual variant of the task proposed by Huang et al. (2012) which evaluates monolingual contextsensitive semantic similarity of words in sentential context, and it is also very related to cross-lingual lexical substitution (Mihalcea et al., 2010).
To isolate the performance of each BWE induction model from the details of the SWTC setup, we use the same approach with all models: we opt for the SWTC framework proven to yield excellent results with BWEs in the SWTC task . In short, the context bag Con(w) = {cw 1 , . . . , cw r } is obtained by harvesting all r words that occur with w in the sentence.
The vector representation of Con(w) is the ddimensional embedding computed by aggregating over all word embeddings for each cw j ∈ Con(w) using standard addition as the compositional operator (Mitchell and Lapata, 2008) which was proven a robust choice (Milajevs et al., 2014): where cw j is the embedding of the j-th context word, and Con(w) is the resulting embedding of the context bag Con(w). Finally, for each t j ∈ T C(w), the context-sensitive similarity with w is computed as: sim(w, t j , Con(w)) = cos(Con(w), t j ), where Con(w) and t j are representations of the (sentential) context bag and the candidate translation t j in the same SBWES. 11 The evaluation set consists of 360 sentences for 15 polysemous nouns (24 sentences for each noun) in each of the three languages: Spanish, Dutch, Italian, along with the single gold standard single word English translation given the sentential context. 12 Table 3 summarizes the results (Acc 1 scores) in the SWTC task. NO-CONTEXT refers to the contextinsensitive majority baseline obtained by BNC+GT (i.e., it always chooses the most semantically similar translation candidate at the word type level). We also report the results of the best SWTC model from Vulić and Moens (2014).
The results largely support the claims established with the BLL evaluation. An exter-nal seed lexicon of BNC+GT may be safely replaced by an automatically induced inexpensive seed lexicon (as in HYBWE with BNC+HYB+SYM/ASYM). The best performing models are again BNC+HYB+SYM and HFQ+HYB+SYM. The comparison of ASYM and SYM lexicon variants further suggests that filtering translation pairs using the symmetry constraint again leads to consistent improvements, but stricter selection criteria with higher thresholds do not lead to significant performance boosts, and may even hurt the performance (see the results for NL-EN). Various HYBWE variants significantly improve over baseline BWE models (Types 1-4), also outperforming previous best SWTC results.

Conclusions and Future Work
We presented a detailed analysis of the importance and properties of seed bilingual lexicons in learning bilingual word embeddings (BWEs) which are valuable for many cross-lingual/multilingual NLP tasks. On the basis of the analysis, we proposed a simple yet effective hybrid bilingual word embedding model called HYBWE. It learns the mapping between two monolingual embedding spaces using only highly reliable symmetric translation pairs from an inexpensive seed document-level embedding space. The results in the tasks of (1) bilingual lexicon learning and (2) suggesting word translations in context demonstrate that -due to its careful selection of reliable translation pairs for seed lexicons -HYBWE outperforms benchmarking BWE induction models, all of which use more expensive bilingual signals for training.
In future work, we plan to investigate other methods for seed pairs selection, settings with scarce resources (Agić et al., 2015;Zhang et al., 2016), other context types inspired by recent work in the monolingual settings (Levy and Goldberg, 2014a;Melamud et al., 2016), as well as model adaptations that can work with multi-word expressions. Encouraged by the excellent results, we also plan to test the portability of the approach to more language pairs, and other tasks and applications.