De-Conflated Semantic Representations

One major deficiency of most semantic representation techniques is that they usually model a word type as a single point in the semantic space, hence conflating all the meanings that the word can have. Addressing this issue by learning distinct representations for individual meanings of words has been the subject of several research studies in the past few years. However, the generated sense representations are either not linked to any sense inventory or are unreliable for infrequent word senses. We propose a technique that tackles these problems by de-conflating the representations of words based on the deep knowledge it derives from a semantic network. Our approach provides multiple advantages in comparison to the past work, including its high coverage and the ability to generate accurate representations even for infrequent word senses. We carry out evaluations on six datasets across two semantic similarity tasks and report state-of-the-art results on most of them.


Introduction
Modeling the meanings of linguistic items in a machine-interpretable form, i.e., semantic representation, is one of the oldest, yet most active, areas of research in Natural Language Processing (NLP). The field has recently experienced a resurgence of research interest with the new blood injected in its veins by neural network-based models that view the representation task as a language modeling problem and learn dense representations (usually referred to as embeddings) by efficiently processing massive amounts of texts. However, either in its conventional count-based form (Turney and Pantel, 2010) or the recent predictive approach, the prevailing objective of representing each word type as a single point in the semantic space has a major limitation: it ignores the fact that words can have multiple meanings and conflates all these meanings into a single representation. This objective can have negative impacts on accurate semantic modeling, e.g., semantically unrelated words that are synonymous to different senses of a word are pulled towards each other in the semantic space (Neelakantan et al., 2014).
Recently, there has been a growing interest in addressing the meaning conflation deficiency of word representations. A series of techniques tend to associate a word to multiple points in the semantic space by clustering its contexts in a given text corpus and learning distinct representations for individual clusters (Reisinger and Mooney, 2010;Huang et al., 2012). Though, these techniques usually assume a fixed number of word senses per word type, disregarding the fact that the number of senses of a word can range from one (monosemy) to dozens. Neelakantan et al. (2014) tackled this issue by allowing the number to be dynamically adjusted for each word during training. However, the approach and all the other clustering-based techniques still suffer from the fact that the computed sense representations are not linked to any sense inventory, a linking which would require large amounts of senseannotated data (Agirre et al., 2006). In addition, because of their dependence on knowledge derived from a text corpus, these techniques are generally unable to learn accurate representations for word senses that are infrequent in the underlying corpus.
Knowledge-based techniques tackle these issues by deriving sense-specific knowledge from external sense inventories, such as WordNet (Fellbaum, 1998), and learning representations that are linked to the sense inventory. These approaches either use sense definitions and employ Word Sense Disambiguation (WSD) to gather sense-specific contexts Iacobacci et al., 2015) or take advantage of the properties of WordNet, such as synonymy and direct semantic relations (Rothe and Schütze, 2015). However, the non-optimal WSD techniques and the shallow utilization of knowledge from WordNet do not allow these techniques to learn accurate and high-coverage semantic representations for all senses in the inventory.
We propose a technique that de-conflates a given word representation into its constituent sense representations by exploiting deep knowledge from the semantic network of WordNet. Our approach provides the following three main advantages in comparison to the past work: (1) our representations are linked to the WordNet sense inventory and, accordingly, the number of senses for a word is a dynamic parameter which matches that defined by WordNet; (2) the deep exploitation of WordNet's semantic network allows us to obtain accurate semantic representations, even for word senses that are infrequent in generic text corpora; and (3) our methodology involves only minimal parameter tuning and can be effectively applied to any sense inventory that is viewable as a semantic network and to any word representation technique. We evaluate our sense representations in two tasks: word similarity (both incontext and in-isolation) and cross-level semantic similarity. Experimental results show that the proposed technique can provide consistently high performance across six datasets, outperforming the recent state of the art on most of them.

De-Conflated Representations
Preliminaries. Our proposed approach takes a set of pre-trained word representations and uses the graph structure of a semantic lexical resource in order to de-conflate the representations into those of word senses. Therefore, our approach firstly requires a set of pre-trained word representations (e.g., word embeddings). Any model that maps a given word to a fixed-size vector representation (i.e., vector space model) can be used by our approach. In our experiments, we opted for a set of publicly available word embeddings (cf. §3.1).
Secondly, we require a lexical resource whose semantic relations allow us to view it as a graph G = (V, E) where each vertex in the set of vertices V corresponds to a concept and edges in E denote lexicosemantic relationships among these vertices. Each concept c ∈ V is mapped to a set of word senses by a mapping function µ(c) : c → {s 1 , . . . , s l }. Word-Net, the de facto community standard sense inventory, is a suitable resource that satisfies these properties. WordNet can be readily represented as a semantic graph in which vertices are synsets and edges are the semantic relations that connect these synsets (e.g., hypernymy and meronymy). The mapping function in WordNet maps each synset to the set of synonymous words it contains (i.e., word senses).

Overview of the approach
Our goal is to compute a semantic representation that places a given word sense in an existing semantic space of words. We achieve this by leveraging word representations as well as the knowledge derived from WordNet. The gist of our approach lies in its computation of a list of sense biasing words for a given word sense. To this end, we first analyze the semantic network of WordNet and extract a list of most representative words that can effectively pinpoint the semantics of individual synsets ( §2.2). We then leverage an effective technique which learns semantic representations for individual word senses by placing the senses in the proximity of their corresponding sense biasing words ( §2.3).

Determining sense biasing words
Algorithm 1 shows the procedure we use to extract from WordNet a list of sense biasing words for a given target synset y t . The algorithm receives as its inputs the semantic graph of WordNet and the mapping function µ(·), and outputs an ordered list of biasing words B t for y t . The list comprises the most semantically-related words to synset y t which can best represent and pinpoint its meaning. We leverage a graph-based algorithm for the computation of the sense biasing words.
Algorithm 1 Get sense biasing words for synset y t (of m synsets) and edges E (semantic relationships between synsets) Require: Function µ(y i ) that returns for a given synset y i the words it contains Require: Target synset y t ∈ V for which a sense biasing word sequence is required Ensure: The sequence B t of sense biasing words for synset y t 1: B t ← () 2: for all word w in µ(y t ) do 3: if w / ∈ B t then 10: Specifically, we use the Personalized PageRank (Haveliwala, 2002, PPR) algorithm which has been extensively used by several NLP applications (Yeh et al., 2009;Niemann and Gurevych, 2011;Agirre et al., 2014). To this end, we first represent the semantic network of WordNet as a row-stochastic transition matrix M ∈ R m×m where m is the number of synsets in WordNet (|V |). The cell M ij of M is set to the inverse of the degree of i if there is a semantic relationship between synsets i and j and to zero otherwise. We compute the PPR distribution for a target synset y t by using the power iteration method P t+1 = (1 − σ)P 0 + σMP t , where σ is the damping factor (usually set to 0.85) and P 0 is a one-hot initialization vector with the corresponding dimension of y t being set to 1.0. The weight p i in line 5 is the value of the i th dimension of the PPR vector P computed for the synset y t . This weight can be seen as the importance of the corresponding synset of the i th dimension (i.e., y i ) to y t . When applied to a semantic network, such as the WordNet graph, this importance can be interpreted as semantic relevance. Hence, the value of p i denotes the extent of semantic relatedness between y i and y t . We use this notion and retrieve a list of most semantically-related words to y t . To achieve this, we sort the synsets {y * ∈ V : y * = y t } according to their PPR values # Sense biasing words 1 dactyl, finger, toe, thumb, pollex, body part, nail, minimus, tarsier, webbed, extremity, appendage 2 figure, cardinal number, cardinal, integer, whole number, numeration system, number system, system of numeration, large integer, constituent, element, digital We then iterate (lines 7-10) the sorted list (y * ) and for each synset y * h append the list B t with all the words in y * h (i.e., µ(y * h )). However, in order to ensure that the words in the target synset y t appear as the most representative words in B t , we first assign these words to the list (line 3). Finally, the algorithm returns the ordered list B t of sense biasing words for the target synset y t . Table 1 shows a sample of top biasing words extracted for the two senses of the noun digit: the numerical and the anatomical senses. 1 We explain in §2.3 how we use the sense biasing lists to learn sense-specific representations. Note that the size of the list is equal to the total number of strings in WordNet. However, we observed that taking a very small portion of the top-ranking elements in the lists is enough to generate representations that perform very similarly to those generated when using the full-sized lists (please see §3.1).

Learning sense representations
Let V be the set of pre-trained d-dimensional word representations. Our objective here is to compute a set V * = {v * s 1 , . . . , v * sn } of representations for n word senses {s 1 , . . . , s n } in the same d-dimensional semantic space of words. We achieve this for each sense s i by de-conflating the representation v s i of its corresponding lemma and biasing it towards the representations of the words in B i . Specifically, we obtain a representation v * s i for a word sense s i by solving: where v s i and v b ij are the respective word representations (∈ V) of the lemma of s i and the j th biasing word in the list of biasing words for s i , 1 The first and third senses of the noun digit in WordNet 3.0.  i The first term in Formula 1 requires the representation of the word sense s i (i.e., v * s i ) to be similar to that of its corresponding lemma, i.e., v s i , whereas the second term encourages v * s i to be in the proximity of its biasing words in the semantic space. The above criterion is similar to the frameworks of Das andSmith (2011) andFaruqui et al. (2015) which, though being convex, is usually solved for efficiency reasons by an iterative method proposed by Bengio et al. (2007). Following these works, we obtain the below equation for computing the representation of a word sense s i : We define δ ij as e −λr(i,j) where r(i, j) denotes the rank of the word b ij in the list B i . This is essentially an exponential decay function that gives more importance to the top-ranking biasing words for s i . The hyperparameter α denotes the extent to which v * s i is kept close to its corresponding lemma representation v s i . Following Faruqui et al. (2015), we set α to 1. The only parameter to be tuned in our experiments is λ. We discuss the tuning of this parameter in §3.1. The representation of a synset y i can be accordingly calculated as the centroid of the vectors of its associated word senses, i.e., As a result of this procedure, we obtain the set V * of n sense representations in the same semantic space as word representations in V. In fact, we now have a unified semantic space which enables a direct comparison of the two types of linguistic items. In §3.3 we evaluate our approach in the word to sense similarity measurement framework. We show in Table 2 the closest words to the word bass and two of its senses, music and fish, 2 in our unified semantic space. We can see in row #1 a mixture of both meanings when the word representation is used whereas the closest words to the senses (rows #2 and #3) are mostly in-domain and specific to the corresponding sense.
To exhibit another interesting property of our sense representation approach, we depict in Figure  1 the word digit and its numerical and anatomical senses (from the example in Table 1) in a 2-d semantic space, along with a sample set of words in their proximity. 3 We can see that the word digit is placed in the semantic space in the neighbourhood of words from the numerical domain (lower left of the figure), mainly due the dominance (Sanderson and Van Rijsbergen, 1999) of this sense in the general-domain corpus on which the word embeddings in our experiments were trained (cf. §3.1). However, upon de-conflation, the emerging anatomical sense of the word is shifted towards the region in the semantic space which is occupied by anatomical words (upper right of the figure). A clustering-based sense representation technique would have failed in accurately representing the infrequent anatomical meaning of digit by analyzing a general domain corpus (such as the one used here). But our sense representation technique, thanks to its proper usage of knowledge from a sense inventory, is effective in unveiling and accurately modeling less frequent or domain-specific senses of a given word.
Please note that any vector space model representation technique can be used for the pre-training of word representations in V. Also, the list of sense biasing words can be obtained for larger sense inventories, such as FreeBase (Bollacker et al., 2008) or BabelNet (Navigli and Ponzetto, 2012). We leave the exploration of other ways of computing sense biasing words to the future work.

Experiments
We benchmarked our sense representation approach against several recent techniques on two standard tasks: word similarity ( §3.2), for which we evaluate on both in-isolation and in-context similarity datasets, and cross-level semantic similarity ( §3.3).

Experimental setup
Pre-trained word representations. As our word representations, we used the 300-d Word2vec (Mikolov et al., 2013) word embeddings trained on the Google News dataset 4 mainly for their popularity across different NLP applications. However, our approach is equally applicable to any count-based representation technique (Baroni and Lenci, 2010;Turney and Pantel, 2010) or any other embedding approach (Pennington et al., 2014;LeCun et al., 2015). We leave the evaluation and comparison of various word representation techniques with different training approaches, objectives, and dimensionalities to the future work.
Parameter tuning. Recall from §2.3 that our procedure for learning sense representations needs only one parameter to be tuned, i.e., λ. We did not perform an extensive tuning on the value of this parameter and set its value to 1 /5 after trying four differ-ent values (1, 1 /2, 1 /5, and 1 /10) on a small validation dataset. We leave the more systematic tuning of the parameter and the choice of alternative decay functions (cf. §2.3) to the future work.
The size of the sense biasing words lists. Also recall from §2.2 that the extracted lists of sense biasing words were originally as large as the total number of unique strings in WordNet (around 150K in ver. 3.0). But, given that we use an exponential decay function in our learning algorithm (cf. §2.3), the impact of the low-ranking words in the list is negligible. In fact, we observed that taking a very small portion of the top-ranking words, i.e., the top 25, produces similarity scores that are on par with those generated when the full lists were considered. Therefore, we experimented with the down-sized lists which enabled us to generate very quickly sense representations for all word senses in WordNet.

Word similarity
Comparison systems. We compared our results against nine other sense representation techniques: the WordNet-based approaches of Pilehvar and Navigli (2015)    it very suitable for the evaluation of sense representation techniques. For each of the datasets, we list the results that are reported by any of our comparison systems.
Similarity measurement. For the SCWS dataset, we follow the past works (Reisinger and Mooney, 2010; Huang et al., 2012) and report the results according to two system configurations: (1) AvgSim: where the similarity between two words is computed as the average of all the pairwise similarities between their senses, and (2) AvgSimC: where each pairwise sense similarity is weighted by the rele-vance of each sense to its corresponding context. For all the other datasets, since words are not provided with any context (they are in isolation), we measure the similarity between two words as that between their most similar senses. In all the experiments, we use the cosine distance as our similarity measure.

Experimental results
Tables 4 and 3 show the results of our system, DE-CONF, and the comparison systems on the SCWS and the other four similarity datasets, respectively. In both tables we also report the word vectors baseline, whenever they are available, which is computed by directly comparing the corresponding word representations of the two words (∈ V). Please note that the word-based baseline does not apply to the approach of Pilehvar and Navigli (2015) as it is purely based on the semantic network of WordNet and does not use any pre-trained word embeddings.
We can see from the tables that our sense representations obtain considerable improvements over those of words across the five datasets. This highlights the fact that the de-conflation of word representations into those of their individual meanings has been highly beneficial. On the SCWS dataset, DECONF outperforms all the recent state-of-the-art sense representation techniques (in their best settings) which proves the effectiveness of our approach in capturing the semantics of specific mean-ings of the words. The improvement is consistent across both system configurations (i.e., AvgSim and AvgSimC). Moreover, the state-of-the-art WordNetbased approach of Rothe and Schütze (2015) uses the same initial word vectors as DECONF does (cf. §3.1). Hence, the improvement we obtain indicates that our approach has made better use of the sensespecific knowledge encoded in WordNet. As seen in Table 3 our approach shows competitve performance on the other four datasets. The YP-130 dataset focuses on verb similarity, whereas SimLex-999 contains verbs and adjectives and MEN-3K has word pairs with different parts of speech (e.g., a noun compared to a verb). The results we obtain on these datasets exhibit the reliability of our approach in modeling non-nominal word senses.

Discussion
The similarity scale of the SimLex-999 dataset is different from our other word similarity benchmarks in that it assigns relatively low scores to antonymous pairs. For instance, sunset-sunrise and man-woman in this dataset are assigned the respective similarities of 2.47 and 3.33 (in a [0, 10] similarity scale) which is in the same range as the similarity between word pairs with slight domain relatedness, such as head-nail (2.47), air-molecule (3.05), or succeedtry (3.98). In fact we observed that tweaking the similarity scale of our system in a way that it diminishes the similarity scores between antonyms can result in significant performance improvement on this dataset. To this end, we performed an experiment in which the similarity of a word pair was simply divided by five whenever the two words belonged to synsets that were linked by the antonymy relation. We observed that the performance on the SimLex-999 dataset increased to 61.1 (from 54.2) and 59.0 (from 51.7) according to Pearson (r × 100) and Spearman (ρ × 100) correlation scores, respectively.

Cross-Level semantic similarity
In addition to the word similarity benchmark, we evaluated the performance of our representations in the cross-level semantic similarity measurement framework. To this end, we opted for the SemEval-2014 task on Cross-Level Semantic Similarity (Jurgens et al., 2014, CLSS). The word to sense similarity subtask of this task, with 500 instances in its test set, provides a suitable benchmark for the evaluation of sense representation techniques.
For a word sense s and a word w, we compute the similarity score according to four different strategies: the similarity of s to the most similar sense of w (MaxSim), the average similarity of s to individual senses of w (AvgSim), the direct similarity of s to w when the latter is modeled as its word representation (Sense-to-Word or S2W) or as the centroid of its senses' representations (Sense to aggregated word senses or S2A). For this task, we can only compare against the publicly-available sense representations of Iacobacci et al. (2015), Rothe and Schütze (2015), Pilehvar and Navigli (2015) and  which are linked to the WordNet sense inventory. Table 5 shows the results on the word to sense dataset of the SemEval-2014 CLSS task, according to Pearson (r) and Spearman (ρ) correlations and for the four strategies. As can be seen from the low overall performances, the task is a very challenging benchmark with many WordNet out-of-vocabulary or slang terms and rare usages. Despite this, DE-CONF provides consistent improvement over the comparison sense representation techniques according to both measures and for all the strategies.

Experimental results
Across the four strategies, S2A proves to be the most effective for DECONF and the representations of Rothe and Schütze (2015). The representations of  perform best with the S2W strategy whereas those of Iacobacci et al. (2015) do not show a consistent trend with relatively low performance across the four strategies. Also, a comparison of our results across the S2W and S2A strategies reveals that a word's aggregated representation, i.e., the centroid of the representations of its senses, is more accurate than its original word representation.
Our analysis showed that the performances of the approaches of Rothe and Schütze (2015) and Iacobacci et al. (2015) were hampered partly due to their limited coverage. In fact, the former was unable to model around 35% of the synsets in WordNet 1.7.1, mainly for its shallow exploitation of knowledge from WordNet, whereas the latter approach did not cover around 15% of synsets in WordNet 3.0. Chen et al. (2014)   word senses in WordNet. However, the relatively low performance of their system shows that the usage of glosses in WordNet and the automated disambiguation have not resulted in accurate sense representations. Thanks to its deep exploitation of the underlying resource, our approach provides full coverage over all word senses and synsets in WordNet. The three best-performing systems in the task are Meerkat Mafia (Kashyap et al., 2014) (r = 37.5, ρ = 39.3), SimCompass (Banea et al., 2014) (r = 35.4, ρ = 34.9), andSemantiKLUE (Proisl et al., 2014) (r = 17.9, ρ = 18.8). Please note that these systems are specifically designed for the cross-level similarity measurement task. For instance, the bestranking system in the task leverages a compilation of several dictionaries, including The American Heritage Dictionary, Wiktionary and WordNet, in order to handle slang terms and rare usages, which leads to its competitive performance (Kashyap et al., 2014).

Related Work
Learning semantic representations for individual senses of words has been an active area of research for the past few years. Based on the way they view the problem, the recent techniques can be classified into two main branches: (1) those that, similarly to our work, extract knowledge from external sense inventories for learning sense representations; and (2) those techniques that cluster the contexts in which a word appears in a given text corpus and learn distinct representations for individual clusters.
Examples for the first branch include the ap-proaches of Chen et al. (2014),  and Rothe and Schütze (2015), all of which use WordNet as an external resource and obtain sense representations for this sense inventory. Chen et al. (2014) uses the content words in the definition of a word sense and WSD. However, the sole usage of glosses as sense-distinguishing contexts and the non-optimal WSD make the approach inaccurate, particularly for highly polysemous words with similar senses and for word senses with short definitions. Similarly, Rothe and Schütze (2015) use only polysemy and synonymy properties of words in WordNet along with a small set of semantic relations. This significantly hampers the reliability of the technique in providing high coverage (discussed further in §3.3.1). Our approach improves over these works by exploiting deep knowledge from the semantic network of WordNet, coupled with an effective training approach. ADW (Pilehvar and Navigli, 2015) is another WordNet-based approach which exploits only the semantic network of this resource an obtains interpretable sense representations. Other work in this branch include SensEmbed (Iacobacci et al., 2015) and Nasari (Camacho-Collados et al., 2015;Camacho-Collados et al., 2016) which are based on the BabelNet sense inventory (Navigli and Ponzetto, 2012). The former technique first disambiguates words in a given corpus with the help of a knowledge-based WSD system and then uses the generated sense-annotated corpus as training data for Word2vec. Nasari combines structural knowledge from the semantic network of BabelNet with corpus statistics derived from Wikipedia for representing BabelNet synsets. However, the approach falls short of modeling non-nominal senses as Wikipedia, due to its very encyclopedic nature, does not cover verbs, adjectives, or adverbs. The second branch, which is usually referred to as multi-prototype representation, is often associated with clustering. Reisinger and Mooney (2010) proposed one of the recent pioneering techniques in this branch. Other prominent work in the category include topical word embeddings (Liu et al., 2015) which use latent topic models for assigning topics to each word in a corpus and learn topicspecific word representations, and the technique proposed by Huang et al. (2012) which incorporates "global document context." Tian et al. (2014) modified the Skip-gram model in order to learn multiple embeddings for each word type. Despite the fact that these techniques do not usually take advantage of the knowledge encoded in structured knowledge resource, they generally suffer from two disadvantages. The first limitation is that they usually make an assumption that a given word has a fixed number of senses, ignoring the fact that polysemy is highly dynamic across words that can range from monosemous to highly ambiguous with dozens of associated meanings (McCarthy et al., 2016). Neelakantan et al. (2014) tackled this issue by estimating the number of senses for a word type during the learning process. However, all techniques in the second branch suffer from another disadvantage that their computed sense representations are not linked to any sense inventory, a linking which itself would require the existence of high coverage sense-annotated data (Agirre et al., 2006).
Another notable line of research incorporates knowledge from external resources, such as PPDB (Ganitkevitch et al., 2013) and WordNet, to improve word embeddings (Yu and Dredze, 2014;Faruqui et al., 2015). Neither of the two techniques however provide representations for word senses.

Conclusions
We put forward a sense representation technique, namely DECONF, that provides multiple advantages in comparison to the recent state of the art: (1) the number of word senses in our technique is flexi-ble and the computed representations are linked to word senses in WordNet; (2) DECONF is effective in providing accurate representation of word senses, even for those senses that do not usually appear frequently in generic text corpora; and (3) our approach is general in that it can be readily applied to any set of word representations and any semantic network without the need for extensive parameter tuning. Our experimental results showed that DECONF can outperform recent state of the art on several datasets across two tasks. We release our computed representations for around 118K synsets and 205K word senses in WordNet 3.0 at https: //github.com/pilehvar/deconf. As future work, we plan to investigate the possibility of using larger semantic networks, such as FreeBase and BabelNet, which would also allow us to apply the technique to languages other than English. We also plan to evaluate the performance of our approach with other decay functions as well as with other initial word representations.