Learning Bilingual Word Embeddings Using Lexical Definitions

Bilingual word embeddings, which represent lexicons of different languages in a shared embedding space, are essential for supporting semantic and knowledge transfers in a variety of cross-lingual NLP tasks. Existing approaches to training bilingual word embeddings require either large collections of pre-defined seed lexicons that are expensive to obtain, or parallel sentences that comprise coarse and noisy alignment. In contrast, we propose BiLex that leverages publicly available lexical definitions for bilingual word embedding learning. Without the need of predefined seed lexicons, BiLex comprises a novel word pairing strategy to automatically identify and propagate the precise fine-grain word alignment from lexical definitions. We evaluate BiLex in word-level and sentence-level translation tasks, which seek to find the cross-lingual counterparts of words and sentences respectively. BiLex significantly outperforms previous embedding methods on both tasks.


Introduction
Bilingual word embeddings are the essential components of multilingual NLP systems.These embeddings capture cross-lingual semantic transfers of words and phrases from bilingual corpora, and are widely deployed in many NLP tasks, such as machine translation (Conneau et al., 2018), crosslingual Wikification (Tsai and Roth, 2016), knowledge alignment (Chen et al., 2018) and semantic search (Vulić and Moens, 2015).
A variety of approaches have been proposed to learn bilingual word embeddings (Duong et al., 2017;Luong et al., 2015;Coulmance et al., 2015).Many such approaches rely on the use of aligned corpora.Such corpora could be seed lexicons that provide word-level mappings between two languages (Mikolov et al., 2013a;Xing et al., 2015), or parallel corpora that align sentences and documents (Klementiev et al., 2012;Gouws et al., 2015).However, these methods critically suffer from several deficiencies.First, seed-lexiconbased approaches are often hindered by the limitedness of seeds (Vulić and Korhonen, 2016), which is an intractable barrier since high-quality seed lexicons require extensive human efforts to obtain (Zhang et al., 2017).Second, parallel corpora provide coarse alignment that does not often accurately infer fine-grained semantic transfers of lexicons (Ruder et al., 2017).
Unlike the existing methods, we propose to use publicly available dictionaries1 for bilingual word embedding learning.Dictionaries, such as Wiktionary and Merriam-Webster, contain large collections of lexical definitions, which are clean linguistic knowledge that naturally connects word semantics within and across human languages.Hence, dictionaries provide valuable information to bridge the lexicons in different languages.However, cross-lingual learning from lexical definitions is a non-trivial task.A straightforward approach based on aligning the target word embedding to the aggregated embedding of words in the definition might work, but not all words in a definition are semantically related to the defined target word (Fig. 1(a)).Therefore, a successful model has to effectively identify the most related lexicons from the multi-granular and asymmetric alignment of lexical definitions.Besides, how to leverage both bilingual and monolingual dictionaries for cross-lingual learning is another challenge.
In this paper, we propose BilLex (Bilingual Word Embeddings Based on Lexical Definitions) to learn bilingual word embeddings.BilLex constitutes a carefully designed two-stage mechanism to automatically cultivate, propagate and leverage lexicon pairs of high semantic similarity from lexical definitions in dictionaries.It first extracts bilingual strong word pairs from bilingual lexical definitions of which the words contribute to the cross-lingual definitions of each other.On top of that, our model automatically exploits induced word pairs, which utilize monolingual dictionaries and the aforementioned strong pairs to exploit semantically related word pairs.This automated word pair induction process enables BilLex to capture abundant high-quality lexical alignment information, based on which the cross-lingual semantic transfer of words is easily captured in a shared embedding space.Experimental results on wordlevel and sentence-level translation tasks show that BilLex drastically outperforms various baselines that are trained on parallel or seed-lexicon corpora, as well as state-of-the-art unsupervised methods.

Related Work
Prior approaches to learning bilingual word embeddings often rely on word or sentence alignment (Ruder et al., 2017).In particular, seed lexicon methods (Mikolov et al., 2013a;Faruqui and Dyer, 2014;Guo et al., 2015) learn transformations across different language-specific embedding spaces based on predefined word alignment.The performance of these approaches is limited by the sufficiency of seed lexicons.Besides, parallel corpora methods (Gouws et al., 2015;Coulmance et al., 2015) leverage the aligned sentences in different languages and force the representations of corresponding sentence components to be similar.However, aligned sentences merely provide weak alignment of lexicons that do not accurately capture the one-to-one mapping of words, while such a mapping is well-desired by translation tasks (Upadhyay et al., 2016).In addition, a few unsupervised approaches alleviate the use of bilingual resources (Chen and Cardie, 2018;Conneau et al., 2018).These models require considerable effort to train and rely heavily on massive monolingual corpora.
Monolingual lexical definitions have been used for weak supervision of monolingual word similarities (Tissier et al., 2017).Our work demonstrates that dictionary information can be extended to a cross-lingual scenario, for which we develop a simple yet effective induction method to populate fine-grain word alignment.

Modeling
We first provide the formal definition of bilingual dictionaries.Let L be the set of languages and L 2 be the set of ordered language pairs.For a language l ∈ L, we use V l to denote its vocabulary, where for each word w ∈ V l , w ∈ R k denotes its embedding vector.A dictionary denoted as D l i ,l j contains words in language l i and their definitions in l j .In particular, D l i ,l j is a monolingual dictionary if l i = l j and is a bilingual dictionary if l i = l j .A dictionary D l i ,l j contains dictionary entries (w i , Q j (w i )), where w i ∈ V l i is the word being defined and Q j (w i ) is a sequence of words in l j describing the meaning of the word w i .Fig. 1(a) shows an entry from an English-French dictionary, and one from a French-English dictionary.
BilLex allows us to exploit semantically related word pairs in two stages.We first use bilingual dictionaries to construct bilingual strong pairs, which are similar to those monolingual word pairs in (Tissier et al., 2017).Then based on the given strong word pairs and monolingual dictionaries, we provide two types of induced word pairs to further enhance the cross-lingual learning.

Bilingual Strong Pairs
A bilingual strong pair contains two words with high semantic relevance.Such a pair of words that mutually contribute to the cross-lingual definitions of each other is defined as below.Definition (Bilingual Strong Pairs) P S l i ,l j is the set of bilingual strong pairs in (l i , l j ) ∈ L 2 (l i = l j ), where each word pair is defined as: appears in the cross-lingual definition of w j and w j appears in the crosslingual definition of w i , then w i and w j should be semantically close to each other.Particularly, P S l i ,l j denotes monolingual strong pairs if l i = l j .For instance, (car, véhicule) depicted in Fig. 1(a) form a bilingual strong pair.Note that Tissier et al. also introduce the monolingual weak pairs by pairing the target word with the other words from its definition, which do not form strong pairs with it.However, we do not extend such weak pairs to the bilingual setting, as we find them to be inaccurate to represent cross-lingual corresponding words.

Induced Word Pairs
Since bilingual lexical definitions cover only limited numbers of words in two languages, we in- corporate both monolingual and bilingual strong pairs, from which we induce two types of word pairs with different confidence: directly induced pairs and indirectly induced pairs.Definition (Bilingual Directly Induced Pairs) P D l i ,l j is the set of bilingual directly induced pairs in (l i , l j ) ∈ L 2 , where each word pair is defined as: (w i , w j ) ∈ P D l i ,l j ⇔ ∃w i p , (w i , w i p ) ∈ P S l i ,l i ∧ (w i p , w j ) ∈ P S l i ,l j Intuitively, a bilingual induced pair (w i , w j ) indicates that we can find a pivot word that forms a monolingual strong pair with one word from (w i , w j ) and a bilingual strong pair with the other.Definition (Bilingual Indirectly Induced Pairs) P I l i ,l j is the set of bilingual indirectly induced pairs in (l i , l j ) ∈ L 2 , where each word pair is defined as: (w i , w j ) ∈ P I l i ,l j ⇔ ∃(w i p , w j p ) ∈ P S l i ,l j , (w i , w i p ) ∈ P S l i ,l i ∧ (w j , w j p ) ∈ P S l j ,l j A bilingual indirectly induced pair (w i , w j ) indicates that there exists a pivot bilingual strong pair (w i p , w j p ), such that w i p forms a monolingual strong pair with w i and w j p forms a monolingual strong pair with w j .Fig. 1(b-c) shows examples of the two types of induced word pairs.

Training
Our model jointly learns three word-pair-based cross-lingual objectives Ω K to align the embedding spaces of two languages, and two monolingual monolingual Skip-Gram losses (Mikolov et al., 2013b) L l i , L l j to preserve monolingual word similarities.Given a language pair (l i , l j ) ∈ L 2 , the learning objective of BilLex is to minimize the following joint loss function: Each λ K (K ∈ {P S , P D , P I }) thereof, is the hyperparameter that controls how much the corresponding type of word pairs contributes to crosslingual learning.For alignment objectives, we use word pairs in both directions of an ordered language pair (l i , l j ) ∈ L 2 to capture the crosslingual semantic similarity of words, such that P S = P S l i ,l j ∪ P S l j ,l i , P D = P D l i ,l j ∪ P D l j ,l i and P I = P I l i ,l j ∪ P I l j ,l i .Then for each K ∈ {P S , P D , P I }, the alignment objective Ω K is defined as below, where σ is the sigmoid function.
For each word pair (w i , w j ), we use the unigram distribution raised to the power of 0.75 to select a number of words in l j (or l i ) for w i (or w j ) to form a negative sample set N i (w j ) (or N j (w i )).Without loss of generality, we define the negative sample set as N i (w j ) = {(w i n , w j )|w i n ∼ U i (w)∧(w i n , w j ) / ∈ P S ∪P D ∪P I }, where U i (w) is the distribution of words in l i .

Experiment
We evaluate BilLex on two bilingual tasks: word translation and sentence translation retrieval.Following the convention (Gouws et al., 2015;Mogadala and Rettinger, 2016), we evaluate BilLex between English-French and English-Spanish.Accordingly, we extract word pairs from both directions of bilingual dictionaries in Wiktionary for these language pairs.To support the induced word pairs, we also extract monolingual lexical definitions in the three languages involved, which include 238k entries in English, 107k entries in French and 49k entries in Spanish.The word pair extraction process of BilLex excludes stop words and punctuation in the lexical definitions.The statistics of three types of extracted word pairs are reported in

Word translation
This task aims to retrieve the translation of a source word in the target language.We use the test set provided by Conneau et al. (2018), which selects the most frequent 200k words of each language as candidates for 1.5k query words.We translate a query word by retrieving its k nearest neighbours in the target language, and report P @k (k = 1, 5) to represent the fraction of correct translations that are ranked not larger than k.Evaluation protocol.The hyperparameters of BilLex are tuned based on a small validation set of 1k word pairs provided by Conneau et al. (2018).We allocate 128-dimensional word embeddings with pre-trained BilBOWA (Gouws et al., 2015).and use the standard configuration to Skip-Gram (Mikolov et al., 2013b) on monolingual Wikipedia dumps.We set the negative sampling size of bilingual word pairs to 4, which is selected from 0 to 10 with the step of 1. λ P S is set to 0.9, which is tuned from 0 to 1 with the step of 0.1.As we assume that the strong pair relations between words are independent, we empirically set λ P D = (λ P S ) 2 = 0.81 and λ P I = (λ P S ) 3 = 0.729.We minimize the loss function using AMSGrad (Reddi et al., 2018) with a learning rate of 0.001.The training is terminated based on early stopping.We limit the vocabularies as the 200k most frequent words in each language, and exclude the bilingual strong pairs that have appeared in the test set.The baselines we compare against include BiCVM (Hermann and Blunsom, 2014), BilBOWA (Gouws et al., 2015), Biskip (Luong et al., 2015), supervised and unsupervised MUSE (Conneau et al., 2018).three types of word pairs BilLex(P S +P D +P I ).BilLex(P S +P D +P I ) thereof, offers consistently better performance in all settings, which implies that the induced word pairs are effective in improving the cross-lingual learning of lexical semantics.Among the baseline models, the unsupervised MUSE outperforms the other four supervised ones.We also discover that for the word translation task, the supervised models with coarse alignment such as BiCVM and BilBOWA do not perform as well as the models with word-level supervision, such as BiSkip and supervised MUSE.Our best BilLex outperforms unsupervised MUSE by 4.4∼5.7% of P @1 between En and Fr, and by 0.3∼1.8% between En and Es.The reason why the settings between En and Fr achieve better performance is that there are much fewer bilingual definitions between En and Es.

Sentence translation retrieval
This task focuses on retrieving the sentence in the target language space with the tf-idf weighted sentence representation approach.We follow the experiment setup in (Conneau et al., 2018) with 2k source sentence queries and 200k target sentences from the Europarl corpus for English and French.We carry forward model configurations from the previous experiment, and report P @k (k = 1, 5).
Results.The results are reported in Table 3. Overall, our best model variant BilLex(P S +P D +P I ) performs better than the best baseline with a noticeable increment of P @1 by 1.4∼1.7% and P@5 by 1.3∼2.1%.This demonstrates that BilLex is suitable for transferring sentential semantics.

Conclusion
In this paper, we propose BilLex, a novel bilingual word embedding model based on lexical definitions.BilLex is motivated by the fact that openly available dictionaries offer high-quality linguistic knowledge to connect lexicons across languages.We design the word pair induction method to capture semantically related lexicons in dictionaries, which serve as alignment information in joint training.BilLex outperforms state-of-the-art methods on word and sentence translation tasks.
Figure 1: Examples of three types of word pairs.The blue words in (b-c) are pivot words of the induced pairs.

Table 1 :
Statistics of dictionaries and word pair sets.

Table 2 :
Results of the word translation task.

Table 2 ,
where the performance of BilLex is reported for three variants: (i) training with bilingual strong pairs only BilLex(P S ), (ii) with directly induced pair added BilLex(P S +P D ), and (iii) with all S +P D +P I ) 64.9 78.2 76.3 89.7

Table 3 :
Results of sentence translation retrieval.