Enhancing Word Embeddings with Knowledge Extracted from Lexical Resources

In this work, we present an effective method for semantic specialization of word vector representations. To this end, we use traditional word embeddings and apply specialization methods to better capture semantic relations between words. In our approach, we leverage external knowledge from rich lexical resources such as BabelNet. We also show that our proposed post-specialization method based on an adversarial neural network with the Wasserstein distance allows to gain improvements over state-of-the-art methods on two tasks: word similarity and dialog state tracking.


Introduction
Vector representations of words (embeddings) have become the cornerstone of modern Natural Language Processing (NLP), as learning word vectors and utilizing them as features in downstream NLP tasks is the de facto standard. Word embeddings (Mikolov et al., 2013;Pennington et al., 2014) are typically trained in an unsupervised way on large monolingual corpora. Whilst such word representations are able to capture some syntactic as well as semantic information, their ability to map relations (e.g. synonymy, antonymy) between words is limited. To alleviate this deficiency, a set of refinement post-processing methods-called retrofitting or semantic specialization-has been introduced. In the next section, we discuss the intricacies of these methods in more detail.
To summarize, our contributions in this work are as follows: • We introduce a set of new linguistic constraints (i.e. synonyms and antonyms) created with BabelNet for three languages: English, German and Italian. * Equal contribution • We introduce an improved post-specialization method (dubbed WGAN-postspec), which demonstrates improved performance as compared to state-of-the-art DFFN  and AuxGAN (Ponti et al., 2018) models.
• We show that the proposed approach achieves performance improvements on an intrinsic task (word similarity) as well as on a downstream task (dialog state tracking).

Related Work
Numerous methods have been introduced for incorporating structured linguistic knowledge from external resources to word embeddings. Fundamentally, there exist three categories of semantic specialization approaches: (a) joint methods which incorporate lexical information during the training of distributional word vectors; (b) specialization methods also referred to as retrofitting methods which use post-processing techniques to inject semantic information from external lexical resources into pre-trained word vector representations; and (c) post-specialization methods which use linguistic constraints to learn a general mapping function allowing to specialize the entire distributional vector space. In general, joint methods perform worse than the other two methods, and are not model-agnostic, as they are tightly coupled to the distributional word vector models (e.g. Word2Vec, GloVe). Therefore, in this work we concentrate on the specialization and post-specialization methods. Approaches which fall in the former category can be considered local specialization methods, where the most prominent examples are: retrofitting (Faruqui et al., 2015) which is a post-processing method to enrich word embeddings with knowledge from semantic Lexical Resources (WordNet, BabelNet, etc.)   lexicons, in this case it brings closer semantically similar words. Counter-fitting (Mrkšić et al., 2016) likewise fine-tunes word representations; however, conversely to the retrofitting technique it counterfits the embeddings with respect to the given similarity and antonymy constraints. Attract-Repel (Mrkšić et al., 2017b) uses linguistic constraints obtained from external lexical resources to semantically specialize word embeddings. Similarly to counter-fitting it injects synonymy and antonymy constraints into distributional word vector spaces. In contrast to counter-fitting, this method does not ignore how updates of the example word vector pairs affect their relations to other word vectors.

Linguistic Constraints
On the other hand, the latter group, postspecialization methods, performs global specialization of distributional spaces. We can distinguish: explicit retrofitting  that was the first attempt to use external constraints (i.e. synonyms and antonyms) as training examples for learning an explicit mapping function for specializing the words not observed in the constraints. Later, a more robust DFFN  method was introduced with the same goal -to specialize the full vocabulary by leveraging the already specialized subspace of seen words.

Methodology
In this paper, we propose an approach that builds upon previous works Ponti et al., 2018). The process of specializing distributional vectors is a two-step procedure (as shown in Figure  1). First, an initial specialization is performed (see §3.1). In the second step, a global specialization mapping function is learned, allowing to generalize to unseen words (see §3.2).

Initial Specialization
In this step a subspace of distributional vectors for words that occur in the external constraints is specialized. To this end, fine-tuning of seen words can be performed using any specialization method. In this work, we utilize Attract-Repel model (Mrkšić et al., 2017b) as it offers stateof-the-art performance. This method allows to make use of both synonymy (attract) and antonymy (repel) constraints. More formally, given a set A of attract word pairs and a set of R of repel word pairs, let V S be the vocabulary of words seen in the constraints. Hence, each word pair (v l , v r ) is represented by a corresponding vector pair (x l , x r ). The model optimization method operates over mini-batches: a mini-batch B A of synonymy pairs (of size k 1 ) and a mini-batch B R of antonymy pairs (of size k 2 ). The pairs of negative examples The negative examples serve the purpose of pulling synonym pairs closer and pushing antonym pairs further away with respect to their corresponding negative examples. For synonyms: where τ is the rectifier function, and δ att is the similarity margin determining the distance between synonymy vectors and how much closer they should be comparing to their negative examples. Similarly, the equation for antonyms is given as: A distributional regularization term is used to retain the quality of the original distributional vector space using L 2 -regularization.
where λ reg is a L 2 -regularization constant, and x i is the original vector for the word x i .
Consequently, the final cost function is formulated as follows:

Proposed Post-Specialization Model
Once the initial specialization is completed, postspecialization methods can be employed. This step is important, because local specialization affects only words seen in the constraints, and thus just a subset of the original distributional space X d . While post-specialization methods learn a global specialization mapping function allowing them to generalize to unseen words X u . Given the specialized word vectors X s from the vocabulary of seen words V S , our proposed method propagates this signal to the entire distributional vector space using a generative adversarial network (GAN) (Goodfellow et al., 2014). Hence, in our model, following the approach of Ponti et al. (2018), we introduce adversarial losses. More specifically, the mapping function is learned through a combination of a standard L 2 -loss with adversarial losses. The motivation behind this is to make the mappings more natural and ensure that vectors specialized for the full vocabulary are more realistic. To this end, we use the Wasserstein distance incorporated in the generative adversarial network (WGAN)  as well as its improved variant with gradient penalty (WGAN-GP) (Gulrajani et al., 2017). For brevity, we call our model WGAN-postspec, which is an umbrella term for the WGAN and WGAN-GP methods implemented in the proposed post-specialization model. One of the benefits of using WGANs over vanilla GANs is that WGANs are generally more stable, and also they do not suffer from vanishing gradients.
Our proposed post-specialization approach is based on the principles of GANs, as it is composed of two elements: a generator network G and a discriminator network D. The gist of this concept, is to improve the generated samples through a minmax game between the generator and the discriminator.
In our post-specialization model, a multi-layer feed-forward neural network, which trains a global mapping function, acts as the generator. Consequently, the generator is trained to produce predictions G(x; θ G ) that are as similar as possible to the corresponding initially specialized word vectors x s . Therefore, a global mapping function is trained using word vector pairs, such that On the other hand, the discriminator D(x; θ D ), which is a multilayer classification network, tries to distinguish the generated samples from the initially specialized vectors sampled from X s . In this process, the differences between predictions and initially specialized vectors are used to improve the generator, resulting in more realistically looking outputs.
In general, for the GAN model we can define the loss L G of the generator as: While the loss of the discriminator L D is given as: In principle, the losses with Wasserstein distance can be formulated as follows: and An alternative scenario with a gradient penalty (WGAN-GP) requires adding gradient penalty λ coefficient in the Eq. (8).

Experiments
Pre-trained Word Embeddings. In order to evaluate our proposed approach as well as to compare our results with respect to current state-ofthe-art post-specialization approaches, we use popular and readily available 300-dimensional pretrained word vectors. Word2Vec (Mikolov et al., 2013) embeddings for English were trained using skip-gram with negative sampling on the cleaned and tokenized Polyglot Wikipedia (Al-Rfou' et al., 2013) by Levy and Goldberg (2014), while German and Italian embeddings were trained using CBOW with negative sampling on WacKy corpora (Dinu et al., 2015;Artetxe et al., 2017Artetxe et al., , 2018. Moreover, GloVe vectors for English were trained on Common Crawl (Pennington et al., 2014).
Linguistic Constraints. To perform semantic specialization of word vector spaces, we exploit linguistic constraints used in previous works (Zhang et al., 2014;Ono et al., 2015; (referred to as external) as well as introduce a new set of constraints collected by us (referred to as babelnet) for three languages: English, German and Italian. We use constraints in two different settings: disjoint and overlap. In the first setting, we remove all linguistic constraints that contain any of the words available in SimLex (Hill et al., 2015), SimVerb (Gerz et al., 2016) and WordSim (Leviant and Reichart, 2015) evaluation datasets. In the overlap setting, we let the SimLex, SimVerb and WordSim words remain in the constraints. To summarize, we present the number of word pairs for English, German and Italian constraints in Table  1.
Let us discuss in more detail how the lists of constraints were constructed. In this work, we use two sets of linguistic constraints: external and babelnet. The first set of constraints was retrieved from WordNet (Fellbaum, 1998) and Roget's Thesaurus (Kipfer, 2009), resulting in 1,023,082 synonymy and 380,873 antonymy word pairs. The second set of constraints, which is a part of our contribution, comprises synonyms and antonyms obtained using NASARI lexical embeddings (Camacho-Collados et al., 2016) and BabelNet (Navigli and Ponzetto, 2012). As NASARI provides lexical information for BabelNet words in five languages (EN, ES, FR, DE and IT), we collected each word with its related BabelNetID (a sense database identifier) to extract the list of its synonyms and antonyms using BabelNet API. Furthermore, to improve the list of Italian words, we also followed the approach proposed by Sucameli and Lenci (2017). The authors provided a new dataset of semantically related Italian word pairs. The dataset includes nouns, adjectives and verbs with their synonyms, antonyms and hypernyms. The information in this dataset was gathered by its authors through crowdsourcing from a pool of Italian native speakers. This way, we could concatenate Italian word pairs to provide a more complete list of synonyms and antonyms.
Similarly, we refer to the work of Scheible and Schulte im Walde (2014) that presents a new collection of semantically related word pairs in German, which was compiled through human evaluation. Relying on GermaNet and the respective JAVA API, the list of the word pairs was generated with a sampling technique. Finally, we used these word pairs in our experiments as external resources for the German language.
Initial Specialization and Post-Specialization. Although, initially specialized vector spaces show gains over the non-specialized word embeddings, linguistic constraints represent only a fraction of their total vocabulary. Therefore, semantic specialization is a two-step process. Firstly, we perform initial specialization of the pre-trained word vectors by means of Attract-Repel (see §2) algorithm. The values of hyperparameter are set according to the default values: λ reg = 10 −9 , δ sim = 0.6, δ ant = 0.0 and k 1 = k 2 = 50. Afterward, to perform a specialization of the entire vocabulary, a global specialization mapping function is learned. In our WGAN-postspec proposed approach, the post-specialization model uses a GAN with improved loss functions by means of the Wasserstein distance and gradient penalty. Importantly, the optimization process differs depending on the algorithm implemented in our model. In the case of a vanilla GAN (AuxGAN), standard stochastic gradient descent is used. While in the WGAN model we employ RMSProp (Tieleman and Hinton, 2012). Finally, in the case of the WGAN-GP, Adam (Kingma and Ba, 2015) optimizer is applied.   Table 2: Spearman's ρ correlation scores on SimLex-999 (SL), SimVerb-3500 (SV) and WordSim-353 (WS). Evaluation was performed using constraints in three settings: (a) external, (b) babelnet, (c) external + babelnet.

Word Similarity
We report our experimental results with respect to a common intrinsic word similarity task, using standard benchmarks: SimLex-999 and WordSim-353 for English, German and Italian, as well as SimVerb-3500 for English. Each dataset contains human similarity ratings, and we evaluate the similarity measure using the Spearman's ρ rank correlation coefficient. In Table 2, we present results for English benchmarks, whereas results for German and Italian are reported in Table 3. Word embeddings are evaluated in two scenarios: disjoint where words observed in the benchmark datasets are removed from the linguistic constraints; and overlap where all words provided in the linguistic constraints are utilized. We use the overlap setting in a downstream task (see §5.2).
The results suggest that the post-specialization methods bring improvements in the specialization of the distributional word vector space. Overall, the highest correlation scores are reported for the models with adversarial losses. We also observe that the proposed WGAN-postspec achieves fairly consistent correlation gains with GLOVE vectors on the SimLex dataset. Interestingly, while exploiting additional constraints (i.e. external + babelnet) generally boosts correlation scores for German and Italian, the results are not conclusive in the case of English, and thus they require further investigation.

Dialog State Tracking
We also evaluate our proposed approach on a dialog state tracking (DST) downstream task. This task is a standard language understanding task, which allows to differentiate between word similarity and relatedness. To perform the evaluation we follow previous works (Henderson et al., 2014;Williams et al., 2016;Mrkšić et al., 2017b). Concretely, a DST model computes probability based only on pre-trained word embeddings. We use Wizard-of-Oz (WOZ) v.2.0 dataset (Wen et al., 2017;Mrkšić et al., 2017a) composed of 600 training dialogues as well as 200 development and 400 test dialogues.
In our experiments, we report results with a standard joint goal accuracy (JGA) score. The results in Table 4 confirm our findings from the previous word similarity task, as initial semantic specialization and post-specialization (in particular WGAN-postspec) yield improvements over original distributional word vectors. We expect this conclusion to hold in all settings; however, additional experiments for different languages and word em-beddings would be beneficial.

Conclusion and Future Work
In this work, we presented a method to perform semantic specialization of word vectors. Specifically, we compiled a new set of constraints obtained from BabelNet. Moreover, we improved a state-of-theart post-specialization method by incorporating adversarial losses with the Wasserstein distance. Our results obtained in an intrinsic and an extrinsic task, suggest that our method yields performance gains over current methods.
In the future, we plan to introduce constraints for asymmetric relations as well as extend our proposed method to leverage them. Moreover, we plan to experiment with adapting our model to a multilingual scenario, to be able to use it in a neural machine translation task. We make the code and resources available at: https://github.com/ mbiesialska/wgan-postspec