Extrofitting: Enriching Word Representation and its Vector Space with Semantic Lexicons

We propose post-processing method for enriching not only word representation but also its vector space using semantic lexicons, which we call extrofitting. The method consists of 3 steps as follows: (i) Expanding 1 or more dimension(s) on all the word vectors, filling with their representative value. (ii) Transferring semantic knowledge by averaging each representative values of synonyms and filling them in the expanded dimension(s). These two steps make representations of the synonyms close together. (iii) Projecting the vector space using Linear Discriminant Analysis, which eliminates the expanded dimension(s) with semantic knowledge. When experimenting with GloVe, we find that our method outperforms Faruqui’s retrofitting on some of word similarity task. We also report further analysis on our method in respect to word vector dimensions, vocabulary size as well as other well-known pretrained word vectors (e.g., Word2Vec, Fasttext).


Introduction
As a method to represent natural language on computer, researchers have utilized distributed word representation.The distributed word representation is to represent a word as n-dimensional float vector, hypothesizing that some or all of the dimensions may capture semantic meaning of the word.The representation has worked well in various NLP tasks, substituting one-hot representation (Turian et al., 2010).Two major algorithms learning the distributed word representation are CBOW (Continuous Bag-of-Words) and skip-gram (Mikolov et al., 2013b).Both CBOW and skip-gram learn the representation using one hidden neural networks.The difference is that CBOW learns the representation of a center word from neighbor words whereas skip-gram gets the representation of neighbor words from a center word.Therefore, the algorithms have to depend on word order, because their objective function is to maximize the probability of occurrence of neighbor words given the center word.Then a problem occurs because the word representations do not have any information to distinguish synonyms and antonyms.For example, worthy and desirable should be mapped closely on the vector space as well as agree and disagree should be mapped apart, although they occur on a very similar pattern.Researchers have focused on the problem, and their main approaches are to use semantic lexicons (Faruqui et al., 2014;Mrkšić et al., 2016;Speer et al., 2017;Vulić et al., 2017;Camacho-Collados et al., 2015).One of the successful works is Faruqui's retrofitting 1 , which can be summarized as pulling word vectors of synonyms close together by weighted averaging the word vectors on a fixed vector space (it will be explained in Section 2.1).The retrofitting greatly improves word similarity between synonyms, and the result not only corresponds with human intuition on words but also performs better on document classification tasks with comparison to original word embeddings (Kiela et al., 2015).From the idea of retrofitting, our method hypothesize that we can enrich not only word representation but also its vector space using semantic lexicons 2 .We call our method as extrofitting, which retrofits word vectors by expanding its dimensions.

Retrofitting
Retrofitting (Faruqui et al., 2014) is a postprocessing method to enrich word vectors using synonyms in semantic lexicons.The algorithm learns the word embedding matrix Q = {q 1 , q 2 , . . ., q n } with the objective function Ψ(Q): (1) where an original word vector is q i , its synonym vector is q j , and inferred word vector is qi .The hyperparameter α and β control the relative strengths of associations.The qi can be derived by the following online update: qi = j:(i,j)∈E β ij q j +α i q i j:(i,j)∈E β ij +α i 2.2 Linear Discriminant Analysis (LDA) LDA (Welling, 2005) is one of the dimension reduction algorithms that project data into different vector space, while minimizing the loss of class information as much as possible.As a result, the algorithm finds linear vector spaces which minimize the distance of data in the same class as well as maximize the distance among the different class.The algorithm can be summarized as follows: Calculating between-class scatter matrix S B and within-class scatter matrix S W .When we denote data as x, classes as c, S B and S W can be formulated as follows: where the overall average of x is µ, and the partial average in class i is denoted by µ i .
Maximizing the objective function J(w).
The objective function J(w) that we should maximize can be defined as and its solution can be reduced to find U that satisfies S −1 W S B = U ΛU T .Therefore, U is derived by eigen-decomposition of S −1 W i S B ; choosing q eigen vectors which have the top-q eigen values, and composing transform matrix of U .Transforming data onto new vector space Using transform matrix U , we can get transformed data by y = U T x 3 Enriching Representations of Word Vector and The Vector Space

Expanding Word Vector with Enrichment
We simply enrich the word vectors by expanding dimension(s) that add 1 or more dimension to original vectors, filling with its representative value r i , which can be a mean value.We denote an original word vectors as q i = (e 1 , e 2 , • • • , e D ) where D denotes the number of word vector dimension.Then, the representative value r i can be formulated as Intuitively, if we expand more additional dimensions, the word vectors will strengthen its own meaning.Likewise, the ratio of the number of expanded dimension to the number of original dimensions will affect the meaning of the word vectors.

Transferring Semantic Knowledge
To transfer semantic knowledge on the representative value r i , we also take a simple approach of averaging all the representative values of each synonym pair, substituting each of its previous value.
We get the synonym pairs from lexicons we introduced in Section 3. The transferred representative value ri can be formulated as ri = s∈L r s /N where L refers to the lexicon consisting of synonym pairs s, and N is the number of synonyms.This manipulation makes the representation of the synonym pairs close to one another.

Enriching Vector Space
With the enriched vectors and the semantic knowledge, we perform Linear Discriminant Analysis for dimension reduction as well as clustering the synonyms from semantic knowledge.LDA finds new vector spaces to cluster and differentiate the labeled data, which are synonym pairs in this experiment.We can get the extrofitted word embedding matrix w as follows: where Q is the word embedding matrix composed of word vectors q and l is the index of the synonym pair.We implement our method using Python2.7 with scikit-learn (Pedregosa et al., 2011).
4 Experiment Data  (Bojanowski et al., 2016) is an extension of Word2Vec, which utilizes subword information to represent an original word.We used 300-dimensional pretrained Fasttext trained on Wikipedia (wiki.en.vec), using skip-gram.

Semantic Lexicons
We borrow the semantic lexicons from retrofitting (Faruqui et al., 2014).Faruqui et al. extracted the synonyms from PPDB (Ganitkevitch et al., 2013) by finding a word that more than two words in another language are corresponding with.Retrofitting also used Word-Net (Miller, 1995) database which grouped words into set of synonyms (synsets).We used two versions of WordNet lexicon, one which consists of synonym only (WordNet syn ) and the other with additional hypernyms, hyponyms included (WordNet all ).Lastly, synonyms were extracted from FrameNet (Baker et al., 1998), which contains more than 200,000 manually annotated sentences linked to semantic frames.Faruqui et al. regarded words as synonyms if the words can be grouped with any of the frames.

Evaluation Data
We evaluate our methods on word similarity tasks using 4 different kinds of dataset.MEN-3k (Bruni et al., 2014) consists of 3000-word pairs rated from 0 to 50.WordSim-353 (Finkelstein et al., 2001) consists of 353-word pairs rated from 0 to 10. SimLex-999 (Hill et al., 2015) includes 999word pairs rated from 0 to 10. RG-65 (Rubenstein and Goodenough, 1965) has 65 words paired scored from 0 to 4. MEN-3k and WordSim-353 were split into train (or dev) set and test set, but we combined them together solely for evaluation purpose.The other datasets have lots of out-ofvocabulary, so we disregard them for future work.

Experiments on Word Similarity Task
The word similarity task is to calculate Spearman's correlation (Daniel, 1990) between two words as word vector format.We first apply extrofitting to GloVe from different data sources and present the result in Table 1.The result shows that although the number of the extrofitted word with FrameNet is less than the other lexicons, its performance is on par with other lexicons.We can also ensure that our method improves the performance of original pretrained word vectors.Next, we perform extrofitting on GloVe in different word dimension and compare the performance with retrofitting.We use WordNet all lexicon on both retrofitting and extrofitting to compare the performances in the ideal environment for retrofitting.We present the results in Table 2.We can demonstrate that our method outperforms retrofitting on some of word similarity tasks, MEN-3k and WordSim-353.We believe that extrofitting on SimLex-999 and RG-65 is less powerful because all word pairs in the datasets are included on WordNet all lexicon.Since retrofitting forces the word similarity to be Figure 1: Plots of nearest top-100 words of cue words in different post-processing methods.We choose two cue words; one is included in semantic lexicons (love; left), and another is not (soo; right) improved by weighted averaging their word vectors, it is prone to be overfitted on semantic lexicons.On the other hand, extrofitting also uses synonyms to improve word similarity but it works differently that extrofitting projects the synonyms both close together on a new vector space and far from the other words.Therefore, our method can make more generalized word representation than retrofitting.We plot top-100 nearest words using t-SNE (Maaten and Hinton, 2008), as shown in Figure 1.We can find that retrofitting strongly collects synonym words together whereas extrofitting weakly disperses the words, resulting loss in cosine similarity score.However, the result of ex-trofitting can be interpreted as generalization that the word vectors strengthen its own meaning by being far away from each other, still keeping synonyms relatively close together (see Table 3).When we list up top-10 nearest words, extrofitting shows more favorable results than retrofitting.We can also observe that extrofitting even can be applied to words which are not included in semantic lexicons.Lastly, we apply extrofitting to other well-known pretrained word vectors trained by different algorithms (see Subsection 4.1).The result is presented in  representations except on WordSim-353 and RG-65, respectively.We find that our method can distort the well-established word embeddings.However, our results are noteworthy in that extrofitting can be applied to other kinds of pretrained word vectors for further enrichment.

Conclusion
We propose post-processing method for enriching not only word representation but also its vector space using semantic lexicons, which we call extrofitting.Our method takes a simple approach that (i) expanding word dimension (ii) transferring semantic knowledge on the word vectors (iii) projecting the vector space with enrichment.We show that our method outperforms another postprocessing method, retrofitting, on some of word similarity task.Our method is robust in respect to the dimension of word vector and the size of vocabulary, only including an explainable hyperparameter; the number of dimension to be expanded.Further, our method does not depend on the order of synonym pairs.As a future work, we will do further research about our method to generalize and improve its performance; First, we can experiment on other word similarity datasets for generalization.Second, we can also utilize Autoencoder (Bengio et al., 2009) for non-linear projection with a constraint of preserving spatial information of each dimension of word vector.

Table 1 :
(Mikolov et al., 2013a)of extrofitted word vectors for word similarity tasks using semantic lexicon.Our method improves pretrained GloVe in different vocabulary size.kens,andtrain sources.We used glove.6BtrainedonWikipedia+Gigawords and glove.42B.300dtrained on Common Crawl.The other pretrained GloVe do not fit in our experiment because they have different word dimension or are case-sensitive.We also use 300-dimensional Word2Vec(Mikolov et al., 2013a)with negative sampling trained on GoogleNews corpus.Fasttext (Pennington et al., 2014)rsGloVe(Pennington et al., 2014)has lots of variations in respect to word dimension, number of to-

Table 4 .
Extrofitting can be also applied to Word2Vec and Fasttext, enriching their word

Table 3 :
List of top-10 nearest words of cue words in different post-processing methods.We show cosine similarity scores of two words included in semantic lexicon (love) or not (soo).

Table 4 :
Spearman's correlation of extrofitted word vectors for word similarity tasks on pretrained word vectors by Word2Vec and Fasttext.Extrofitting can be applied to other kinds of pretrained word vector.