Using pseudo-senses for improving the extraction of synonyms from word embeddings

The methods proposed recently for specializing word embeddings according to a particular perspective generally rely on external knowledge. In this article, we propose Pseudofit, a new method for specializing word embeddings according to semantic similarity without any external knowledge. Pseudofit exploits the notion of pseudo-sense for building several representations for each word and uses these representations for making the initial embeddings more generic. We illustrate the interest of Pseudofit for acquiring synonyms and study several variants of Pseudofit according to this perspective.


Introduction
The interest aroused by word embeddings in Natural Language Processing, especially for neural models, has led to propose methods for creating them from texts (Mikolov et al., 2013;Pennington et al., 2014) but also for specializing them according to a particular viewpoint. This viewpoint generally comes in the form of set of lexical relations. For instance, Kiela et al. (2015) specialize word embeddings towards semantic similarity or relatedness by relying either on synonyms or free lexical associations. Methods such as Retrofitting (Faruqui et al., 2015), Counterfitting (Mrkšić et al., 2016) or PARAGRAM (Wieting et al., 2015) fall within the same framework.
The specialization of word embeddings can also come from the way they are built. For instance, Levy and Goldberg (2014) bring word embeddings towards similarity rather than relatedness by using dependency-based distributional contexts rather than linear bag-of-word contexts. Finally, some methods aim at improving word embeddings but without a clearly defined orientation, such as the All-but-the-Top method (Mu, 2018), which focuses on dimensionality reduction, or , which exploits morphological relations.
In this article, we propose Pseudofit, a method that improves word embeddings without external knowledge and focuses on semantic similarity and synonym extraction. The principle of Pseudofit is to exploit the notion of pseudo-sense coming from word sense disambiguation for building representations accounting for distributional variability and to create better word embeddings by bringing these representations closer together. We show the interest of Pseudofit and its variants through both intrinsic and extrinsic evaluations.

Method
The distributional representation of a word varies from one corpus to another. Without even taking into account the plurality of meanings of a word, this variability also exists inside any corpus C, even if it is quite homogeneous: the distributional representations of a word built from each half of C, C 1 and C 2 , are not identical. However, from the more general viewpoint of its meaning, they should be identical, or at least very close, and their differences be considered as incidental. Following this perspective, a representation resulting from the convergence of the representations built from C 1 and C 2 should be more generic and show better semantic similarity properties.
The method we propose, Pseudofit, formalizes this approach through the notion of pseudo-sense. This notion is related to the notion of pseudo-word introduced in the field of word sense disambiguation by Gale et al. (1992) and Schütze (1992). A pseudo-word is an artificial word resulting from the clustering of two or more different words, each of them being considered as one pseudo-sense of the pseudo-word. Pseudofit adopts the opposite viewpoint. For each word w, more precisely nouns in our case, it splits arbitrarily its occurrences into two sets: the occurrences of one set are labeled as pseudo-sense w 1 while the occurrences of the other set are labeled as pseudo-sense w 2 . A distributional representation is built for w, w 1 and w 2 under the same conditions, with a neural model in our case. The second stage of Pseudofit adapts a posteriori the representation of w according to the convergence of the representations of w 1 and w 2 . This adaptation is performed by exploiting the similarity relations between w, w 1 and w 2 in the context of a word embedding specialization method. By considering simultaneously w, w 1 and w 2 , Pseudofit benefits from both the variations between the representations of w 1 and w 2 and the quality of the representation of w, since it is built from the whole C while the two others are built from half of it.

Building of Word Embeddings
The first stage of Pseudofit consists in building a distributional representation of each word w and its two pseudo-senses w 1 and w 2 . The starting point of this process is the generation of a set of distributional contexts for each occurrence of w. Classically, this generation is based on a linear fixed-size window centered on the considered occurrence. The specificity of Pseudofit is that contexts are generated both for the target word and one of its pseudo-sense. The pseudo-sense changes from one occurrence of w to the following, leading to the same frequency for w 1 and w 2 . The generation of such contexts with a window of 3 words (before and after the target word policeman) is illustrated here for the following sentence: A policeman 1 was arrested by another policeman 2 .
TARGET CONTEXTS policeman {a, be, arrest (2), by (2), another} policeman 1 {a, be, arrest, by} policeman 2 {another, by, arrest} This sentence, which is voluntarily artificial, shows how three different contexts are built for a word in a corpus: one context (first line) is built from all the occurrences of the target word; a second one (second line) is built from half of the occurrences of the target word, representing its first pseudo-sense, while the third context (last line) is built from the other half of the occurrences of the target word, representing its second pseudo-sense.
The generated contexts are then used for building word embeddings. More precisely, we adopt the variant of the Skip-gram model (Mikolov et al., 2013) proposed by Levy and Goldberg (2014), which can take as input arbitrary contexts.

Convergence of Word Representations
The second stage of Pseudofit brings the representations of each target word w and its pseudosenses w 1 and w 2 closer together. This convergence aims at producing a more general representation of w by erasing the differences between the representations of w, w 1 and w 2 , which are assumed to be incidental since these representations refer by nature to the same object.
The implementation of this convergence process relies on the PARAGRAM algorithm, which takes as inputs word embeddings and a set of binary lexical relations accounting for semantic similarity. PARAGRAM gradually modifies the input embeddings for bringing closer together the vectors of the words that are part of similarity relations. This adaptation is controlled by a kind of regularization that tends to preserve the input embeddings. This twofold objective consists more formally in minimizing the following objective function by stochastic gradient descent: where the first sum expresses the convergence of the vectors according to the similarity relations while the second sum, modulated by the λ parameter, corresponds to the regularization term.
The specificity of PARAGRAM, compared to methods such as Retrofitting, lies in its adaptation term. While it logically tends to bring closer together the vectors of the words that are part of similarity relations (attracting term x 1 x 2 ), it also pushes them away from the vectors of the words that are not part these relations (repelling terms x 1 t 1 and x 2 t 2 ). More precisely, the relations are split into a set of mini-batches L i . For each word (vector x i ) of a relation, a word (vector t j ) outside the relation is selected among the words of the mini-batch of the current relation in such a way that t j is the closest word to x i according to the Cosine measure, which represents the most discriminative option. δ is the margin between the attracting and repelling terms.  The application of PARAGRAM to the embeddings resulting from the first stage of Pseudofit exploits the fact that a word and its pseudo-words are supposed to be similar. Hence, for each word w, three similarity relations are defined and used by PARAGRAM for adapting the initial embeddings: (w, w 1 ), (w, w 2 ) et (w 1 , w 2 ). Finally, only the representations of words w are exploited since they are built from a corpus that is twice as large as the corpus used for pseudo-words.

Experimental Setup
For implementing Pseudofit, we randomly select at the level of sentences a 1 billion word subpart of the Annotated English Gigaword corpus (Napoles et al., 2012). This corpus is made of news articles in English processed by the Stanford CoreNLP toolkit . We use this corpus under its lemmatized form. The building of the embeddings are performed with word2vecf, the adaptation of word2vec from (Levy and Goldberg, 2014), with the best parameter values from : minimal count=5, vector size=300, window size=5, 10 negative examples and 10 −5 for the subsampling probability of the most frequent words. For PARAGRAM, we adopt most of the parameter values from : δ = 0.6 and λ = 10 −9 , with the AdaGrad optimizer (Duchi et al., 2011) and 50 epochs 1 . Retrofitting and Counter-fitting are used with the parameter values specified respectively in (Faruqui et al., 2015) and (Mrkšić et al., 2016).

Evaluation of Pseudofit
Our first evaluation of Pseudofit at word level is a classical intrinsic evaluation consisting in measuring for a set of word pairs the Spearman's rank correlation between human judgments and the similarity of these words computed from their embeddings by the Cosine measure. This evaluation is performed for the nouns of three large enough reference datasets: SimLex-999 ,  MEN (Bruni et al., 2014) and MTurk-771 (Halawi et al., 2012). Table 1 clearly shows that Pseudofit significantly 2 improves the initial embeddings for the three datasets. By contrast, it also shows that replacing PARAGRAM with Retrofitting or Counter-fitting, two other reference methods for specializing embeddings, does not lead to comparable improvements and can even degrade results.
Our second evaluation, which is our main focus, is a more extrinsic task consisting in extracting synonyms 3 . This extraction is performed by ranking a set of candidate synonyms for each target word according to the similarity, computed here by the Cosine measure, of their embeddings. We evaluate the relevance of this ranking as in Information Retrieval with R-precision (R prec. ), MAP (Mean Average Precision) and precisions at various ranks (P@r). Our reference is made up of the synonyms of WordNet (Miller, 1990) while both our target words and candidate synonyms are made up of the nouns with more than ten occurrences in each half of our corpus, which represents 20,813 nouns. Table 2 gives the result of this second evaluation for 11,481 nouns with synonyms in WordNet among our 20,813 targets. As in the first evaluation, Pseudofit significantly 4 outperforms the initial embeddings. Moreover, replacing PARAGRAM with Retrofitting or Counter-fitting leads to a systematic decrease of results, which emphasizes the importance of the repelling term of PARAGRAM. This term probably prevents the representation of a word from being changed too much by its pseudosenses, which are interesting variants in terms of representations but were built from half of the corpus only.   Finally, we performed a finer analysis of these results according to the frequency and the degree of ambiguity of the target words. Concerning frequency, Table 3 shows that Pseudofit is particularly efficient for the lower half of the target words in terms of frequency, with a large increase of 5.3 points for R-precision, 6.7 points for MAP, 7.0 points for P@1 and 5.2 points for P@2 while the largest increase for the higher half of the target words is equal to 1.1 points for MAP.
One possible explanation of this gap between high and low frequency words is linked to the degree of ambiguity of words: high frequency words are more likely to be polysemous and Pseudofit does not take into account the polysemy of words. Figure 1 tends to confirm this hypothesis by showing that the improvement brought by Pseudofit for a word is inversely proportional to its ambiguity as estimated by its number of senses in WordNet 5 .

Variants of Pseudofit
We defined and tested several variants of Pseudofit. The first one, Pseudofit max, focuses on the strategy for selecting {t j } in PARAGRAM. The results of Table 1, as those of , are obtained with a setting where half of {t j } are selected randomly. In Pseudofit max, all {t j } are 5 Words with at most 10 senses cover 98.9% of the nouns of our evaluation.  selected according to their similarity with {x i }.
The second variant, Pseudofit 3 pseudo-senses, aims at determining if increasing the number of pseudo-senses, from two to three at first, can have a positive impact on results.
The third variant, Pseudofit context, tests the interest of defining pseudo-senses for the words of distributional contexts. In this configuration, pseudo-senses are defined for all nouns, verbs and adjectives with more than 21 occurrences in the corpus, which corresponds to a minimal frequency of 10 in each half of the corpus.
Finally, similarly to the second variant, the last variant, Pseudofit fus-*, adds a supplementary representation of the target word. However, this representation is not an additional pseudosense but an aggregation of its already existing pseudo-senses, which can be viewed as another global representation of the target word. Three aggregation methods are considered: Pseudofit fus-addition performs an elementwise addition of the embeddings of pseudo-senses, Pseudofit fusaverage computes their mean while Pseudofit fusmax-pooling takes their maximal value.
Each presented variant outperforms the base version of Pseudofit but Table 4 also shows that not all variants are of equal interest. From the viewpoint of both the absolute level of their results and the significance of their difference with Pseudofit, Pseudofit max and Pseudofit fusmax-pooling are clearly the most interesting variants. Their combination, Pseudofit max+fusmax-pooling, leads to our best results and significantly outperforms Pseudofit for all measures. Among the Pseudofit fus-* variants, Pseudofit fusmax-pooling and Pseudofit fus-average are close to each other and clearly exceeds Pseudofit fusaddition. The results of Pseudofit 3 pseudo-senses show that using more than two pseudo-senses by word faces the problem of having too few occurrences for each pseudo-sense. The same frequency effect, at the level of contexts, probably explains the very limited impact of the introduction of pseudo-senses in contexts in the case of Pseudofit context.

Sentence Similarity
Our final evaluation, which is fully extrinsic, examines the impact of Pseudofit on the identification of semantic similarity between sentences. More precisely, we adopt the STS Benchmark dataset on semantic textual similarity (Cer et al., 2017). The overall principle of this task is similar to the word similarity task of our first evaluation but at the level of sentences: the similarity of a set of sentence pairs is computed by the system to evaluate and compared with a correlation measure, the Pearson correlation coefficient, against a gold standard produced by human annotators.
This framework is interesting for the evaluation of Pseudofit because the computation of the similarity of a pair of sentences can be achieved by unsupervised approaches based on word embeddings in a very competitive way, as demonstrated by (Hill et al., 2016). More precisely, the approach we adopt is a classical baseline that composes the embeddings of the plain words of each sentence to compare by elementwise addition and computes the Cosine measure between the two resulting vectors. For building the representation of a sentence, we compare the use of our initial embeddings with that of the embeddings produced by Pseudofit max+fus-max-pooling, the best variant of Pseudofit. For this experiment, pseudo-senses are distinguished not only for nouns but more generally for all nouns, verbs and adjectives with more than 21 occurrences in the corpus. Table 5 shows the result of this evaluation for the 1,379 sentence pairs of the test part of the STS Benchmark dataset. As for the two previous evaluations, the use of the embeddings modified by Pseudofit leads to a significant improvement of results 6 compared to the initial embeddings, which demonstrates that the improvement at word level can be transposed at a larger scale. Table 5 also shows four reference results from (Cer et al., 2017): the lowest and the best baselines based on averaged word embeddings (Skip-gram 6 With the same evaluation of statistical significance as for word similarity.  and GloVe respectively), which are very close to our approach, and the best (Conneau et al., 2017) and the lowest (Duma and Menzel, 2017) unsupervised systems. Although our goal is not to compete with the best systems, it is interesting to note that our results are in line with the state of the art since they significantly outperform the two baselines and the lowest unsupervised system as well as other unsupervised systems mentioned in (Cer et al., 2017).

Conclusion and Perspectives
In this article, we presented Pseudofit, a method that specializes word embeddings towards semantic similarity without external knowledge by exploiting the variability of distributional contexts. This method can be described as hybrid since it operates both before and after the building of word embeddings. A set of intrinsic and extrinsic evaluations demonstrates the interest of the word embeddings produced by Pseudofit and its variants, with a particular emphasis on the extraction of synonyms.
In the presented work, the principles underlying Pseudofit, in particular the generation and convergence of different representations of a word, were tested only within the same corpus. In conjunction with the work about word meta-embeddings (Yin and Schütze, 2016), it would be interesting to apply these principles to representations built from several corpora, like  for different languages.