Dict2vec : Learning Word Embeddings using Lexical Dictionaries

Learning word embeddings on large unlabeled corpus has been shown to be successful in improving many natural language tasks. The most efficient and popular approaches learn or retrofit such representations using additional external data. Resulting embeddings are generally better than their corpus-only counterparts, although such resources cover a fraction of words in the vocabulary. In this paper, we propose a new approach, Dict2vec, based on one of the largest yet refined datasource for describing words – natural language dictionaries. Dict2vec builds new word pairs from dictionary entries so that semantically-related words are moved closer, and negative sampling filters out pairs whose words are unrelated in dictionaries. We evaluate the word representations obtained using Dict2vec on eleven datasets for the word similarity task and on four datasets for a text classification task.


Introduction
Learning word embeddings usually relies on the distributional hypothesis -words appearing in similar contexts must have similar meanings, and thus close representations. Finding such representations for words and sentences has been one hot topic over the last few years in Natural Language Processing (NLP) (Mikolov et al., 2013;Pennington et al., 2014) and has led to many improvements in core NLP tasks such as Word Sense Disambiguation (Iacobacci et al., 2016), Machine Translation (Devlin et al., 2014), Machine Comprehension (Hewlett et al., 2016), and Semantic Role Labeling (Zhou and Xu, 2015;Collobert et al., 2011) -to name a few.
These methods suffer from a classic drawback of unsupervised learning: the lack of supervision between a word and those appearing in the associated contexts. Indeed, it is likely that some terms of the context are not related to the considered word. On the other hand, the fact that two words do not appear together -or more likely, not often enough together -in any context of the training corpora is not a guarantee that these words are not semantically related. Recent approaches have proposed to tackle this issue using an attentive model for context selection (Ling et al., 2015), or by using external sources -like knowledge graphsin order to improve the embeddings . Similarities derived from such resources are part of the objective function during the learning phase (Yu and Dredze, 2014;Kiela et al., 2015) or used in a retrofitting scheme (Faruqui et al., 2015). These approaches tend to specialize the embeddings to the resource used and its associated similarity measures -while the construction and maintenance of these resources are a set of complex, time-consuming, and error-prone tasks.
In this paper, we propose a novel word embedding learning strategy, called Dict2vec, that leverages existing online natural language dictionaries. We assume that dictionary entries (a definition of a word) contain latent word similarity and relatedness information that can improve language representations. Such entries provide, in essence, an additional context that conveys general semantic coverage for most words. Dict2vec adds new co-occurrences information based on the terms occurring in the definitions of a word. This information introduces weak supervision that can be used to improve the embeddings. We can indeed distinguish word pairs for which each word appears in the definition of the other (strong pairs) and pairs where only one appears in the definition of the other (weak pairs) -each having their own weight as two hyperparameters. Not only this information is useful at learning time to control words vectors to be close for such word pairs, but also it becomes possible to devise a controlled negative sampling. Controlled negative sampling as introduced in Dict2vec consists in filtering out random negative examples in conventional negative sampling that forms a (strong or weak) pair with the target word -they are obviously non-negative examples. Processing online dictionaries in Dict2vec does not require a human-in-the-loop -it is fully automated. The neural network architecture from Dict2vec (Section 3) extends Word2vec (Mikolov et al., 2013) approach which uses a Skip-gram model with negative sampling.
Our main results are as follows : • Dict2vec exhibits a statistically significant improvement around 12.5% against state-ofthe-art solutions on eleven most common evaluation datasets for the word similarity task when embeddings are learned using the full Wikipedia dump.
• This edge is even more significant for small training datasets (50 millions first tokens of Wikipedia) than using the full dataset, as the average improvement reaches 30%.
• Since Dict2vec does significantly better than competitors for small dimensions (in the [20; 100] range) for small corpus, it can yield smaller yet efficient embeddings -even when trained on smaller corpus -which is one of the utmost practical interest for the working natural language processing practitioners.
• We also show that the embeddings learned by Dict2vec perform similarly to other baselines on an extrinsic text classification task.
Dict2vec software is an extension and an optimization from the original Word2vec framework leading to a more efficient learning. Source code to fetch dictionaries, train Dict2vec models and evaluate word embeddings are publicly availabe 1 and can be used by the community as a seed for future works.
The paper is organized as follows. Section 2 presents related works, along with a special focus on Word2vec, which we later derive in our 1 https://github.com/tca19/dict2vec approach presented in Section 3. Our experimental setup and evaluation settings are introduced in Section 4 and we discuss the results in Section 5. Section 6 concludes the paper.

The Neural Network Approach
In the original model from Collobert and Weston (2008), a window approach was used to feed a neural network and learn word embeddings. Since there are long-range relations between words, the window-based approach was later extended to a sentence-based approach (Collobert et al., 2011) leading to capture more semantic similarities into word vectors. Recurrent neural networks are another way to exploit the context of a word by considering the sequence of words preceding it (Mikolov et al., 2010;Sutskever et al., 2011). Each neuron receives the current window as an input, but also its own output from the previous step. Mikolov et al. (2013) introduced the Skip-gram architecture built on a single hidden layer neural network to learn efficiently a vector representation for each word w of a vocabulary V from a large corpora of size C. Skip-gram iterates over all (target, context) pairs (w t ,w c ) from every window of the corpus and tries to predict w c knowing w t . The objective function is therefore to maximize the log-likelihood : where n represents the size of the window (composed of n words around the central word w t ) and the probability can be expressed as : with v t+k (resp. v t ) the vector associated to w t+k (resp. w t ). This model relies on the principle "You shall know a word by the company it keeps" -Firth (1957). Thus, words that are frequent within the context of the target word will tend to have close representations, as the model will update their vectors so that they will be closer. Two main drawbacks can be said about this approach. First, words within the same window are not always related. Consider the sentence "Turing is widely considered to be the father of theoretical computer science and artificial intelligence." 2 , the words (Turing,widely) and (father,theoretical) will be moved closer while they are not semantically related. Second, strong semantic relations between words (like synonymy or meronymy) happens rarely within the same window, so these relations will not be well embedded into vectors.
fastText introduced in Bojanowski et al. (2016) uses internal additional information from the corpus to solve the latter drawback. They train a Skipgram architecture to predict a word w c given the central word w t and all the n-grams G wt (subwords of 3 up to 6 letters) of w t . The objective function becomes : Along learning one vector per word, fastText also learns one vector per n-gram. fastText is able to extract more semantic relations between words that share common n-gram(s) (like fish and fishing) which can also help to provide good embeddings for rare words since we can obtain a vector by summing vectors of its n-grams.
In what follows, we report related works that leverage external resources in order to address the two raised issues about the window approach.

Using External Resources
Even with larger and larger text data available on the Web, extracting and encoding every linguistic relations into word embeddings directly from corpora is a difficult task. One way to add more relations into embeddings is to use external data. Lexical databases like WordNet or sets of synonyms like MyThes thesaurus can be used during learning or in a post-processing step to specialize word embeddings. For example, Yu and Dredze (2014) include prior knowledge about synonyms from WordNet and the Paraphrase Database in a joint model built upon Word2vec. Faruqui et al. (2015) introduce a graph-based retrofitting method where they post-process learned vectors with respect to semantic relationships extracted from additional lexical resources. Kiela et al. (2015) propose to specialize the embeddings either on similarity or relatedness relations in a Skip-gram joint learning approach by adding new contexts from external thesaurus or from a norm association base in the function to optimize. Bian et al. 2 https://en.wikipedia.org/wiki/Alan_Turing (2014) combine several sources (syllables, POS tags, antonyms/synonyms, Freebase relations) and incorporate them into a CBOW model. These approaches have generally the objective to improve tasks such as document classification, synonym detection or word similarity. They rely on additional resources whose construction is a timeconsuming and error-prone task and tend generally to specialize the embeddings to the external corpus used. Moreover, lexical databases contain less information than dictionaries (117k entries in WordNet, 200k in a dictionary) and less accurate content (some different words in WordNet belong to the same synset thus have the same definition).
Another type of external resources are knowledge bases, containing triplets. Each triplet links two entities with a relation, for example Parisis capital of -France. Several methods (Weston et al., 2013;Xu et al., 2014) have been proposed to use the information from knowledge base to improve semantic relations in word embeddings, and extract more easily relational facts from text. These approaches are focused on knowledge base dependent task.

Dict2vec
The definition of a word is a group of words or sentences explaining its meaning. A dictionary is a set of tuples (word, definition) for several words. For example, one may find in a dictionary : car: A road vehicle, typically with four wheels, powered by an internal combustion engine and able to carry a small number of people. 3 The presence of words like "vehicle", "road" or "engine" in the definition of "car" illustrates the relevance of using word definitions for obtaining weak supervision allowing us to get semantically related pairs of words.
Dict2vec models this information by building strong and weak pairs of words ( §3.1), in order to provide both a novel positive sampling objective ( §3.2) and a novel controlled negative sampling objective ( §3.3). These objectives participate to the global objective function of Dict2vec ( §3.4).

Strong pairs, weak pairs
In a definition, each word does not have the same semantic relevance. In the definition of "car", the words "internal" or "number" are less relevant than "vehicle". We introduce the concept of strong and weak pairs in order to capture this relevance. If the word w a is in the definition of the word w b and w b is in the definition of w a , they form a strong pair, as well as the K closest words to w a (resp. w b ) form a strong pair with w b (resp. w a ). If the word w a is in the definition of w b but w b is not in the definition of w a , they form a weak pair.
The word "vehicle" is in the definition of "car" and "car" is in the definition of "vehicle". Hence, (car-vehicle) is a strong pair. The word "road" is in the definition of "car", but "car" is not in the definition of "road". Therefore, (car-road) is a weak pair.
Some weak pairs can be promoted as strong pairs if the two words are among the K closest neighbours of each other. We chose the K closest words according to the cosine distance from a pretrained word embedding and find that using K = 5 is a good trade-off between semantic and syntactic extracted information.

Positive sampling
We introduce the concept of positive sampling based on strong and weak pairs. We move closer vectors of words forming either a strong or a weak pair in addition to moving vectors of words cooccurring within the same window.
Let S(w) be the set of all words forming a strong pair with the word w and W(w) be the set of all words forming a weak pair with w. For each target w t from the corpus, we build V s (w t ) a random set of n s words drawn with replacement from S(w t ) and V w (w t ) a random set of n w words drawn with replacement from W(w t ). We compute the cost of positive sampling J pos for each target as follows: where is the logistic loss function defined by and v t (resp. v i and v j ) is the vector associated to w t (resp. w i and w j ).
The objective is to minimize this cost for all targets, thus moving closer words forming a strong or a weak pair.
The coefficients β s and β w , as well as the number of drawn pairs n s and n w , tune the importance of strong and weak pairs during the learning phase. We discuss the choice of these hyperparameters in Section 5. When β s = 0 and β w = 0, our model is the Skip-gram model of Mikolov et al. (2013).

Controlled negative sampling
Negative sampling consists in considering two random words from the vocabulary V to be unrelated. For each word w t from the vocabulary, we generate a set F(w t ) of k randomly selected words from the vocabulary : The model aims at separating the vectors of words from F(w t ) and the vector of w t . More formally, this is equivalent to minimize the cost J neg for each target word w t as follows: where the notation , v t and v i are the same as described in previous subsection. However, there is a non-zero probability that w i and w t are related. Therefore, the model will move their vectors further instead of moving them closer. With strong/weak word pairs in Dict2vec, it becomes possible to better ensure that this is less likely to occur: we prevent a negative example to be a word that forms a weak or strong pair with with w t . The negative sampling objective from Equation 6 becomes : In our experiments, we noticed this method discards around 2% of generated negative pairs. The influence on evaluation depends on the nature of the corpus and is discussed at Section 5.4.

Global objective function
Our objective function is derived from the noisecontrastive estimation which is a more efficient objective function than the log-likelihood in Equation 1 according to Mikolov et al. (2013). We add the positive sampling and the controlled negative sampling described before and compute the cost for each (target,context) pair (w t , w c ) from the corpus as follows: The global objective is obtained by summing every pair's cost over the entire corpus : 4 Experimental setup

Fetching online definitions
We extract all unique words with more than 5 occurrences from a full Wikipedia dump, representing around 2.2M words. Since there is no dictionary that contains a definition for all existing words (the word w might be in the dictionary D i but not in D j ), we combine several dictionaries to get a definition for almost all of these words (some words are too rare to have a definition anyway We use the same hyperparameters we usually find in the literature for all models. We use 5 negatives samples, 5 epochs, a window size of 5, a vector size of 100 (resp. 200 and 300) for the 50M file (resp. 200M and full dump) and we remove the words with less than 5 occurrences. We follow the same evaluation protocol as Word2vec and fastText to provide the fairest comparison against competitors, so every other hyperparameters (K, β s , β w , n s , n w ) are tuned using a grid search to maximize the weighted average score. For n s and n w , we go from 0 to 10 with a step of 1 and find the optimal values to be n s = 4 and n w = 5. For β s and β w we go from 0 to 2 with a step of 0.05 and find β s = 0.8 and β w = 0.45 to be the best values for our model. Table 1 reports training times for the three models (all experiments were run on a E3-1246 v3 processor).

Word similarity evaluation
We follow the standard method for word similarity evaluation by computing the Spearman's rank correlation coefficient (Spearman, 1904) between human similarity evaluation of pairs of words, and the cosine similarity of the corresponding word vectors. A score close to 1 indicates an embedding close to the human judgement. We use MC-30 (Miller and Charles, 1991), MEN (Bruni et al., 2014), MTurk-287 (Radinsky et al., 2011), MTurk-771 (Halawi et al., 2012), RG-65 (Rubenstein and Goodenough, 1965), RW (Luong et al., 2013), SimVerb-3500 (Gerz et al., 2016), WordSim-353 (Finkelstein et al., 2001) and YP-130 (Yang and Powers, 2006) classic datasets. We follow the same protocol used by Word2vec and fastText by discarding pairs which contain a word that is not in our embedding. Since all models are trained with the same corpora, the embeddings have the same words, therefore all competitors share the same OOV rates.
We run each experiment 3 times and report in Table 2 the average score to minimize the effect of the neural network random initialization. We compute the average by weighting each score by the number of pairs evaluated in its dataset in the same way as Iacobacci et al. (2016). We multiply each score by 1, 000 to improve readability.

Text classification evaluation
Our text classification task follows the same setup as the one for fastText in . We train a neural network composed of a single hidden layer where the input layer corresponds to the bag of words of a document and the output layer is the probability to belong to each label. The weights between the input and the hidden layer are initialized with the generated embeddings and are fixed during training, so that the evaluation score solely depends on the embedding. We update the weights of the neural network classifier with gradient descent. We use the datasets AG-News 6 , DBpedia (Auer et al., 2007) and Yelp reviews (polarity and full) 7 . We split each datasets into a training and a test file. We use the same training and test files for all models and report the classification accuracy obtained on the test file.

Baselines
We train Word2vec 8 and fastText 9 on the same 3 files and their 2 respective versions (A and B) described in 4.2 and use the same hyperparameters also described in 4.2 for all models. We train Word2vec with the Skip-gram model since our method is based on the Skip-gram model. We also train GloVe with their respective hyperparameters described in Pennington et al. (2014), but the results are lower than all other baselines (weighted average on word similarity task is 350 on the 50M file, 389 on the 200M file and 454 on the full dump) so we do no report GloVe's results.
We also retrofit the learned embeddings on corpus A with the Faruqui's method to compare another method using additional resources. The retrofitting introduces external knowledge from the WordNet semantic lexicon (Miller, 1995). We use the Faruqui's Retrofitting 10 with the W N all semantic lexicon from WordNet and 10 iterations as advised in the paper of Faruqui et al. (2015). Furthermore, we compare the performance of our method when using WordNet additional resources instead of dictionaries. 5 Results and model analysis 5.1 Semantic similarity Table 2 (top) reports the Spearman's rank correlation scores obtained with the method described in subsection 4.3. We observe that our model outperforms state-of-the-art approaches for most of the datasets on the 50M and 200M tokens files, and almost all datasets on the full dump (this is significant according to a two-sided Wilcoxon signedrank test with α = 0.05). With the weighted average score, our model improves fastText's performance on raw corpus (column A) by 28.3% on the 50M file, by 17.7% on the 200M and by 12.8% on the full dump. Even when we train fastText with the same additional knowledge as ours (column B), our model improves performance by 2.9% on the 50M file, by 5.1% in the 200M and by 11.9% on the full dump.
We notice the column B (corpus composed of Wikipedia and definitions) has better results than the column A for the 50M (+24% on average) and the 200M file (+12% on average). This demonstrates the strong semantic relations one can find in definitions, and that simply incorporating definitions in small training file can boost the performance of the embeddings. Moreover, when the training file is large (full dump), our supervised method with pairs is more efficient, as the boost brought by the concatenation of definitions is insignificant (+1.5% on average).
We also note that the number of strong and weak pairs drawn must be set according to the size of the training file. For the 50M and 200M tokens files, we train our model with hyperparameters n s = 4 and n w = 5. For the full dump (20  Table 2: Spearman's rank correlation coefficients between vectors' cosine similarity and human judgement for several datasets (top) and accuracies on text classification task (bottom). We train and evaluate each model 3 times and report the average score for each dataset, as well as the weighted average for all word similarity datasets.  Table 3: Percentage changes of word similarity scores for several datasets after the Faruqui's retrofitting method is applied. We compare each model to their own non-retrofitted version (vs self) and our nonretrofitted version (vs our). A positive percentage indicates the level of improvement of the retrofitting approach, while a negative percentage shows that the compared method is better without retrofitting. As an illustration: the +13.9% at the top left means that retrofitting Word2vec's vectors improves the initial vectors output by 13.9%, while the -7.3% below indicates that our approach without retrofitting is better than the retrofitted Word2vec's vectors.
times larger than the 200M tokens file), the number of windows in the corpus is largely increased, so is the number of (target,context) pairs. Therefore, we need to adjust the influence of strong and weak pairs and decrease n s and n w . We set n s = 2, n w = 3 to train on the full dump. The Faruqui's retrofitting method improves the word similarity scores on all frameworks for all datasets, except on RW and WS353 (Table 3). But even when Word2vec and fastText are retrofitted, their scores are still worse than our non-retrofitted model (every percentage on the vs our line are negative). We also notice that our model is compatible with a retrofitting improvement method as our scores are also increased with Faruqui's method.
We also observe that, although our model is superior on each corpus size, our model trained on the 50M tokens file outperforms the other models trained on the full dump (an improvement of 17% compared to the results of fastText, our best competitor, trained on the full dump). This means considering strong and weak pairs is more efficient than increasing the corpus size and that using dictionaries is a good way to improve the quality of the embeddings when the training file is small.
The models based on knowledge bases cited in §2.2 do not provide word similarity scores on all the datasets we used. However, for the reported scores, Dict2vec outperforms these models : Kiela et al. (2015)

Text classification accuracy
Table 2 (bottom) reports the classification accuracy for the considered datasets. Our model achieves the same performances as Word2vec and fastText on the 50M file and slightly improves results on the 200M file and the full dump. Using supervision with pairs during training does not make our model specific to the word similarity task which shows that our embeddings can also be used in downstream extrinsic tasks.
Note that for this experiment, the embeddings were fixed and not updated during learning (we only learned the classifier parameters) since our objective was rather to evaluate the capability of the embeddings to be used for another task rather than obtaining the best possible models. It is anyway possible to obtain better results by updating the embeddings and the classifier parameters with respect to the supervised information to adapt the embeddings to the classification task at hand as done in .    We also trained Dict2vec with pairs from Word-Net as well as no additional pairs during training (in this case, this is the Skip-gram model from Word2vec). Results are reported in Table 5. Training with WordNet pairs increases the scores, showing that the supervision brought by positive sampling is beneficial to the model, but lags behind the training using dictionary pairs demonstrating once again that dictionaries contain more semantic information than WordNet.

Positive and negative sampling
For the positive sampling, an empirical grid search shows that a 1 2 ratio between β s and β w is a good rule-of-thumb for tuning these hyperparameters. We also notice that when these coefficients are too low (β s ≤ 0.5 and β w ≤ 0.2), results get worse because the model does not take into account the information from the strong and weak pairs. On the other side, when they are too high (β s ≥ 1.2 and β w ≥ 0.6), the model discards too much the information from the context in favor of the information from the pairs. This behaviour is similar when the number of strong and weak pairs is too low or too high (n s , n w ≤ 2 or n s , n w ≥ 5).
For the negative sampling, we notice that the control brought by the pairs increases the average weighted score by 0.7% compared to the uncontrolled version. We also observe that increasing the number of negative samples does not significantly improve the results except for the RW dataset where using 25 negative samples can boost performances by 10%. Indeed, this dataset is mostly composed of rare words so the embeddings must learn to differentiate unrelated words rather than moving closer related ones. In Fig. 1, we observe that our model is still able to outperform state-of-the-art approaches when we reduce the dimension of the embeddings to 20 or 40. We also notice that increasing the vector size does increase the performance, but only until a dimension around 100, which is the common dimen-sion used when training on the 50M tokens file for related approaches reported here.

Conclusion
In this paper, we presented Dict2vec, a new approach for learning word embeddings using lexical dictionaries. It is based on a Skip-gram model where the objective function is extended by leveraging word pairs extracted from the definitions weighted differently with respect to the strength of the pairs. Our approach shows better results than state-of-the-art word embeddings methods for the word similarity task, including methods based on a retrofitting from external sources. We also provide the full source code to reproduce the experiments.