One Representation per Word - Does it make Sense for Composition?

In this paper, we investigate whether an a priori disambiguation of word senses is strictly necessary or whether the meaning of a word in context can be disambiguated through composition alone. We evaluate the performance of off-the-shelf single-vector and multi-sense vector models on a benchmark phrase similarity task and a novel task for word-sense discrimination. We find that single-sense vector models perform as well or better than multi-sense vector models despite arguably less clean elementary representations. Our findings furthermore show that simple composition functions such as pointwise addition are able to recover sense specific information from a single-sense vector model remarkably well.


Introduction
Distributional word representations based on counting co-occurrences have a long history in natural language processing and have successfully been applied to numerous tasks such as sentiment analysis, recognising textual entailment, wordsense disambiguation and many other important problems. More recently low-dimensional and dense neural word embeddings have received a considerable amount of attention in the research community and have become ubiquitous in numerous NLP pipelines in academia and industry. One fundamental simplifying assumption commonly made in distributional semantic models, however, is that every word can be encoded by a single representation. Combining polysemous lexemes into a single vector has the consequence of essentially creating a weighted average of all observed meanings of a lexeme in a given text corpus. Therefore a number of proposals have been made to overcome the issue of conflating several different senses of an individual lexeme into a single representation. One approach (Reisinger and Mooney, 2010;Huang et al., 2012) is to try directly inferring a predefined number of senses from data and subsequently label any occurrences of a polysemous lexeme with the inferred inventory. Similar approaches are proposed by Reddy et al. (2011) and Kartsaklis et al. (2013) who show that appropriate sense selection or disambiguation typically improves performance for composition of noun phrases (Reddy et al., 2011) and verb phrases (Kartsaklis et al., 2013). Dinu and Lapata (2010) proposed a model that represents the meaning of a word as a probability distribution over latent senses which is modulated based on contextualisation, and report improved performance on a word similarity task and the lexical substitution task. Other approaches leverage an existing lexical resource such as BabelNet or WordNet to obtain sense labels a priori to creating word representations (Iacobacci et al., 2015), or as a postprocessing step after obtaining initial word representations (Chen et al., 2014;Pilehvar and Collier, 2016). While these approaches have exhibited strong performance on benchmark word similarity tasks (Huang et al., 2012;Iacobacci et al., 2015) and some downstream processing tasks such as part-of-speech tagging and relation identification (Li and Jurafsky, 2015), they have been weaker than the single-vector representations at predicting the compositionality of multi-word expressions (Salehi et al., 2015), and at tasks which require the meaning of a word to be considered in context; e.g, word sense disambiguation (Iacobacci et al., 2016) and word similarity in context (Iacobacci et al., 2015).
In this paper we consider what happens when distributional representations are composed to form representations for larger units of meaning. In a compositional phrase, the meaning of the whole can be inferred from the meaning of its parts. Thus, assuming compositionality, the representation of a phrase such as black mood, should be directly inferable from the representations for black and for mood. Further, one might suppose that composing the correct senses of the individual lexemes would result in a more accurate representation of that phrase. However, our counterhypothesis is that the act of composition contextualises or disambiguates each of the lexemes thereby making the representations of individual senses redundant. We investigate this hypothesis by evaluating the performance of single-vector representations and multi-sense representations at both a benchmark phrase similarity task and at a novel word-sense discrimination task.
Our contributions in this work are thus as follows. First, we provide quantitative and qualitative evidence that even simple composition functions have the ability to recover sense-specific information from a single-vector representation of a polysemous lexeme in context. Second, we introduce a novel word-sense discrimination task 1 , which can be seen as the first stage of word-sense disambiguation. The goal is to find whether the occurrences of a lexeme in two or more sentential contexts belong to the same sense or not, without necessarily labelling the senses. While it has received relatively little attention in recent years, it is an important natural language understanding problem and can provide important insights into the process of semantic composition.

Evaluating Distributional Models of Composition
For evaluation we use several readily available off-the-shelf word embeddings, that have already been shown to work well for a number of different NLP applications. We compare the 300-dimensional skip-gram word2vec (Mikolov et al., 2013) word embeddings 2 to the dependency based version of word2vec -henceforth dep2vec 3 (Levy and Goldberg, 2014) -and the SENSEMBED model 4 by Iacobacci et al. (2015), which creates word-sense embeddings by performing word-sense disambiguation prior to running word2vec. We note that word2vec and dep2vec use a single vector per word approach and therefore conflate the different senses of a polysemous lexeme. On the other hand, SENSEMBED utilises Babelfy (Moro et al., 2014) as an external knowledge source to perform word-sense disambiguation and subsequently creates one vector representation per word sense.
For composition we use pointwise addition for all models as this has been shown to be a strong baseline in a number of studies (Hashimoto et al., 2014;Hill et al., 2016). We also experimented with pointwise multiplication as composition function but, similar to Hill et al. (2016), found its performance to be very poor (results not reported). We model any out-of-vocabulary items as a vector consisting of all zeros and determine proximity of composed meaning representations in terms of cosine similarity. We lowercase and lemmatise the data in our task but do not perform number or date normalisation, or removal of rare words.

Phrase Similarity
Our first evaluation task is the benchmark phrase similarity task of Mitchell and Lapata (2010). This dataset consists of 108 adjective-noun (AN), 108 noun-noun (NN) and 108 verb-object (VO) pairs. The task is to compare a compositional model's similarity estimates with human judgements by computing Spearman's ρ. An average ρ of 0.47-0.48 represents the current state-of-the-art performance on this task (Hashimoto et al., 2014;Wieting et al., 2015).
For single-sense representations, the strategy for carrying out this task is simple. For each phrase in each pair, we compose the constituent representations and then compute the similarity of each pair of phrases using the cosine similarity. For multi-sense representations, we adapted the strategy which has been used successfully in various word similarity experiments (Huang et al., 2012;Iacobacci et al., 2015). Typically, for each word pair, all pairs of senses are considered and the similarity of the word pair is considered to be the similarity of the closest pair of senses. The fact that this strategy works well suggests that when humans are asked to judge word similarity, the pairing automatically primes them to select the closest senses. Extending this to phrase similarity requires us to compose each possible pair of senses for each phrase and then select the sense configuration which results in maximal phrase similarity. For comparison, we also give results for the configuration which results in minimal phrase similarity and the arithmetic mean 5 of all sense configurations.  Table 1 shows that the simple strategy of adding high quality single-vector representations is very competitive with the state-of-the-art for this task. None of the strategies for selecting a sense configuration for the multi-sense representations could compete with the single sense representations on this task. One possible explanation is that the commonly adopted closest sense strategy is not effective for composition since the composition of incorrect senses may lead to spuriously high similarities (for two "implausible" sense configurations). Table 2 lists a number of example phrase pairs with low average human similarity scores in the Mitchell and Lapata (2010) test set. The results show the tendency of the closest sense strategy with SENSEMBED (SE) to overestimate the similarity of dissimilar phrase pairs. For a comparison we manually labelled the lexemes in the sample phrases with the appropriate BabelNet senses prior to composition (SE*). Human (H) similarity scores are normalised and averaged for an easier comparison, model estimates represent cosine similarities.

Word Sense Discrimination
Word-sense discrimination can be seen as the first stage of word-sense disambiguation, where the  Table 2 Tendency of SENSEMBED (SE) to overestimate the similarity on phrase pairs with low average human similarity when the closest sense strategy is used.
goal is to find whether two or more occurrences of the same lexeme express identical senses, without necessarily labelling the senses yet. It has received relatively little attention despite its potential for providing important insights into semantic composition, focusing in particular on to the ability of compositional distributional semantic models to appropriately contextualise a polysemous lexeme. Work on word-sense discrimination has suffered from the absence of a benchmark task as well as a clear evaluation methodology. For example Schütze (1998) evaluated his model on a dataset consisting of 20 polysemous words (10 naturally ambiguous lexemes and 10 artificially ambiguous "pseudo-lexemes") in terms of accuracy for coarse grained sense distinctions, and an information retrieval task. Pantel and Lin (2002), and Van de Cruys (2008) used automatically extracted words from various newswire sources and evaluated the output of their models in comparison to WordNet and EuroWordNet, respectively. Purandare and Pedersen (2004) used a subset of the words from the SENSEVAL-2 task and evaluated their models in terms of precision, recall and F1-score of how well available sense tags match with clusters discovered by their algorithms. Akkaya et al. (2012) used the concatenation of the SENSEVAL-2 and SENSEVAL-3 tasks and evaluated their models in terms of cluster purity and accuracy. Finally, Moen et al. (2013) used the semantic textual similarity (STS) 2012 task, which is based on human judgements of the similarity between two sentences.
One contribution of our work is a novel wordsense discrimination task, evaluated on a number of robust baselines in order to facilitate future research in that area. In particular, our task offers a testbed for assessing the contextualisation ability of compositional distributional semantic models. The goal is, for a given polysemous lexeme in context, to identify the sentence from a list of options that is expressing the same sense of that lexeme as the given target sentence. These two sentences -the target and the "correct answer" -can exhibit any degree of semantic similarity as long as they convey the same sense of the target lexeme. Table 3 shows an example of the polysemous adjective black in our task. The goal of any model would be to determine that the expressed sense of black in the sentence She was going to set him free from all of the evil and black hatred he had brought to the world is identical to the expressed sense of black in the target sentence Or should they rebut the Democrats' black smear campaign with the evidence at hand.
Our task assesses the ability of a model to discriminate a particular sense in a sentential context from any other senses and thus provides an excellent testbed for evaluating multi-sense word vector models as well as compositional distributional semantic models. By composing the representation of a target lexeme with its surrounding context, it should be possible to determine its sense. For example, composing black smear campaign should lead to a compositional representation that is closer to the composed representation of black hatred than to black mood, black sense of humour or black coffee. This essentially uses the similarity of the compositional representation of a lexeme's context to determine its sense. Similar approaches to word-sense disambiguation have already been successfully used in past works (Akkaya et al., 2012;Basile et al., 2014).

Task Construction
For the construction of our dataset we made use of data from two english dictionaries (Oxford Dictionary and Collins Dictionary), accessible via their respective web APIs 6 , as well as examples from the sense annotated corpus SemCor (Miller et al., 1993). Our use of dictionary data is motivated by a number of favourable properties which make it a very suitable data source for our proposed task: • The content is of very high-quality and curated by expert lexicographers.
• All example sentences are carefully crafted in order to unambiguously illustrate the usage 6 https://developer. oxforddictionaries.com for the Oxford Dictionary, https://www.collinsdictionary.com/api/ for the Collins Dictionary. We use NLTK 3.2 to access SemCor. of a particular sense for a given polysemous lexeme.
• The granularity of the sense inventory reflects common language use 7 .
• The example sentences are typically free of any domain bias wherever possible.
• The data is easily accessible via a web API.
By using the data from curated resources we were able to avoid a setup as a sentence similarity task and any potentially noisy crowd-sourced human similarity judgements. We were furthermore able to collect data from varying frequency bands, enabling an assessment of the impact of frequency on any model. Figure 1 shows the number of target lexemes per frequency band. While the majority of lexemes, with reference to a cleaned October 2013 Wikipedia dump 8 , is in the middle band, there is a considerable amount of less frequent lexemes. The most frequent target lexeme in our task is the verb be with ≈1.8m occurrences in Wikipedia, and the least frequent lexeme is the verb ruffle with only 57 occurrences. The average target lexeme frequency is ≈95k for adjectives, and ≈45k−46k for nouns and verbs 9 . 7 The Oxford dictionary lists 5 different senses for the noun "bank", whereas WordNet 3.0 lists 10 synsets, for example distinguishing "bank" as the concept for a financial institution and "bank" as a reference to the building where financial transactions take place. 8 We removed any articles with fewer than 20 page views. 9 The overall number of unique word types is smaller than the number of examples in our task as there are a number of lexemes that can occur with more than one part-of-speech.

Sense Definition
Sentence Target full of anger or hatred Or should they rebut the Democrats' black smear campaign with the evidence at hand? Option 1 full of anger or hatred She was going to set him free from all of the evil and black hatred he had brought to the world. Option 2 (of a person's state of mind) I've been in a black mood since September 2001, it's hanging over full of gloom or misery; very depressed me like a penumbra. Option 3 (of humour) presenting tragic or harrowing Over the years I have come to believe that fate either hates me, or situations in comic terms has one hell of a black sense of humour. Option 4 (of coffee or tea) served without milk The young man was reading a paperback novel and sipping a steaming mug of hot, black coffee. Table 3: Example of the polysemous adjective black in our task. The goal for any model is to predict option 1 as expressing the same sense of black as the target sentence.

Task Setup Details
We collected data for 3 different parts-of-speech: adjectives, nouns and verbs. We furthermore created task setups with varying numbers of senses to distinguish (2-5 senses) for a given target lexeme. This is to evaluate how well a model is able to discriminate different degrees of polysemy of any lexeme. For any task setup evaluating for n senses, we included all lexemes with > n senses and randomly sampled n senses from its inventory. For each lexeme, we furthermore ensured that it had at least 2 example sentences per sense. For the available senses of any given lexeme, we randomly chose a sense as the target sense, and from its list of example sentences randomly sampled 2 sentences, one as the target example and one as the "correct answer" for the list of candidate sentences. Finally we once again randomly sampled the required number of other senses and example sentences to complete the task setup. Using random sampling of word senses and targets aims to avoid a predominant sense bias. For each part-of-speech we created a development split for parameter tuning and a test split for the final evaluation.  correctly predicting which two sentences share the same sense of a given target lexeme. Accuracy has the advantage of being much easier to interpretin absolute terms as well as in the relative difference between models -in comparison to other commonly used evaluation metrics such as cluster purity measures or correlation metrics such as Spearman ρ and Pearson r.

Experimental Setup
In this paper we compare the compositional models outlined earlier with two baselines, a random baseline and a word-overlap baseline of the extracted contexts. For the single-vector representations, we composed the target lexeme with all of the words in the context window and compared it with the equivalent representation of each of the options (lexeme plus context words). The option with the highest cosine similarity was deemed to be the selected sense. For SENSEMBED, we composed all sense vectors of a target lexeme with the given context and then used the closest sense strategy (Iacobacci et al., 2015) on composed representations to choose the predicted sense 10 . The wordoverlap baseline is simply the number of words in common between the context window for the target and each of the options. We experimented with symmetric linear bagof-words contexts of size 1, 2 and 4 around the target lexeme. We also experimented with dependency contexts, where first-order dependency contexts performed almost identical to using a 2word bag-of-words context window (results not reported). We excluded stop words prior to extracting the context window in order to maximise the number of content words. We break ties for any of the methods -including the baselinesby randomly picking one of the options with the highest similarity to the composed representation of the target lexeme with its context. Statistical significance between the best performing model and the word overlap baseline is computed by using a randomised pairwise permutation test (Efron and Tibshirani, 1994). Table 5 shows the results for all context window sizes across all parts-of-speech and number of senses. All models substantially outperform the random baseline for any number of senses. Interestingly the word overlap baseline is competitive for all context window sizes. While it is a very simple method, it has already been found to be a strong baseline for paraphrase detection and semantic textual similarity (Dinu and Thater, 2012). One possible explanation for its robust performance on our task is an occurrence of the one-sense-per-collocation hypothesis (Yarowsky, 1993). The performance of all other models is roughly in the same ballpark for all parts-ofspeech and number of senses, suggesting that they form robust baselines for future models. While the results are relatively mixed for adjectives, word2vec appears to be the strongest model for polysemous nouns and verbs.

Results
The perhaps most interesting observation in Table 5 is that word2vec and dep2vec are performing as well or better than SENSEMBED despite the fact that the former conflate the senses of a polysemous lexeme in a single vector representation. Figure 2 shows the average performance of all models across parts-of-speech per number of senses and for all context window sizes.

SENSEMBED and Babelfy
One possible explanation for SENSEMBED not outperforming the other methods despite its cleaner encoding of different word senses in the above experiments is that at train time, it had access to sense labels from Babelfy. At test time on our task however, it did not have any sense labels available. We therefore sense tagged the 5-sense noun subtask with Babelfy and re-ran SENSEMBED. As Table 6 shows, access to sense labels at test time did not give a substantive performance boost, representing further support for our hypothesis that composition in single-sense vector models might be sufficient to recover sense specific information.

Frequency Range
We chose the 2-sense noun subtask to estimate the degree sensitivity of target lexeme frequency on our task we merged the [1, 1k) and [1k, 10k), and [50k, 100k) and [100k, ∞) frequency bands from Figure 1, and sampled an equal number of target words from each band. Table 7 reports the results for this experiment. All methods outperform the random and word overlap baseline and appear to be working better for less frequent lexemes. One possible explanation for this behaviour is that less frequent lexemes have fewer senses and potentially less subtle sense differences than more frequent lexemes, which would make them easier to discriminate by distributional semantic methods.

Discussion
Our results suggest that pointwise addition in a single-sense vector model such as word2vec is able to discriminate the sense of a polysemous lexeme in context in a surprisingly effective way and represents a strong baseline for future work. Distributional composition can therefore be interpreted as a process of contextualising the meaning of a lexeme. This way, composition does not only act as a way to represent the meaning of a phrase as a whole, but also as a local discriminator for any lexemes in the phrase. For example the composed representation of dry clothes should only keep contexts that dry shares with clothes while suppressing contexts it shares with weather or wine. Hence, one would expect the same to happen with a polysemous lexeme such as bank in the context of river and account, respectively.
Recent work by Arora et al. (2016) has shown that the different senses of a polysemous lexeme reside in a linear substructure within a single vector and are recoverable by sparse coding. There is furthermore evidence that additive composition in low-dimensional word embeddings approximates an intersection of the contexts of two distributional word vectors (Tian et al., 2015). It therefore seems plausible that an intersective composition function should be able to recover sense specific information.
To qualitatively analyse this hypothesis we used the word2vec and SENSEMBED vectors to compose a small number of example phrases by pointwise addition and calculated their top 5 nearest neighbours in terms of cosine similarity. For SENSEMBED we manually sense tagged the  Table 5 Performance overview for all parts-of-speech and number of senses, ‡ statistically significant at the p < 0.01 level in comparison to the Word Overlap baseline; † statistically significant at the p < 0.05 level in comparison to the Word Overlap baseline.  Table 6 Results on the 5-sense noun subtask with SENSEMBED having access to Babelfy sense labels at test time.
phrases with the appropriate BabelNet sense labels prior to composition. We omitted the Babel-Net sense labels in the neighbour list for brevity,  Table 7 Results on a subsample of the 2-sense noun subtask across frequency bands.
however they were consistent with the intended sense in all cases. Table 8 supports the view of composition as a way of contextualising the mean-ing of a lexeme. In all cases in our example the word2vec neighbours reflect the intended sense of the polysemous lexeme, providing evidence for the linear substructure of word senses in a single vector as discovered by Arora et al. (2016), and suggesting that distributional composition is able to recover sense specific information from a polysemous lexeme. The very fine-grained sense-level vector space of SENSEMBED is giving rise to a very focused neighbourhood, however there does not seem to be any advantage over word2vec from a qualitative point of view when using simple additive composition.

Related Work
The perhaps most popular tasks for evaluating the ability of a model to capture or encode the different senses of a polysemous lexeme in a given context are the english lexical substitution task (Mc-Carthy and Navigli, 2007) and the Microsoft sentence completion challenge (Zweig and Burges, 2011). Both tasks require any model to fill an appropriate word into a pre-defined slot in a given sentential context. The sentence completion challenge provides a list of candidate words while the english lexical substitution task does not. However, neither task focuses on polysemy and the english lexical substitution task conflates the problems of discriminating word senses and finding meaning preserving substitutes. Dictionary definitions have previously been used to evaluate compositional distributional semantic models where the goal is to match a dictionary entry with its corresponding definition (Kartsaklis et al., 2012;Polajnar and Clark, 2014). These datasets are commonly set up as retrieval tasks, but generally do not test the ability of a model to disambiguate a polysemous word in context, or discriminate multiple definitions of the same word.
Our task also provides a novel evaluation for compositional distributional semantic models, where the predominant strategy is to estimate the similarity of two short phrases (Bernardi et al., 2013;Grefenstette and Sadrzadeh, 2011;Kartsaklis and Sadrzadeh, 2014;Mitchell and Lapata, 2008;Mitchell and Lapata, 2010) or sentences (Agirre et al., 2016;Huang et al., 2012;Marelli et al., 2014) in comparison to human provided gold-standard judgements. One problem with these similarity tasks is that the similarity or relatedness of two sentences is very difficult to judge -especially on a fine-grained scaleeven for humans. This frequently results in a relatively high variance of judgements and low interannotator agreement (Batchkarov et al., 2016). The short phrase datasets typically have a fixed structure that only test a very small fraction of the possible grammatical constructions in which a lexeme can occur, and furthermore provide very little context. The use of full sentences remedies the lack of context and grammatical variation, however can still contain a significant level of noise due to the automatic construction of the dataset or the variance in human ratings. In contrast, our task is not set up as a sentence similarity task and therefore avoids the use of human similarity judgements.
Our task is similar to word-sense induction (WSI), however we only focus on discriminating the sense of a polysemous lexeme in context rather than inducing a set of senses from raw data and appropriately tagging subsequent occurrences of polysemous instances with the inferred inventory. Separating the sense discrimination task from the problem of sense induction has the advantage of making our task applicable to evaluating compositional distributional semantic models in order to test their ability to appropriately contextualise a polysemous lexeme. Due to not requiring any models to perform an extra step for sense induction, our task is easier to evaluate as no matching between sense clusters identified by a model and some gold standard sense classes needs to be performed, as typically proposed in the WSI literature (Agirre and Soroa, 2007;Manandhar et al., 2010).
Most closely related to our task are the Stanford Contextual Word Similarity (SCWS) dataset by Huang et al. (2012) and the Usage Similarity (USim) task by Erk et al. (2009). The goal in both tasks is to estimate the similarity of two polysemous words in context in comparison to human provided gold standard judgements. In the SCWS dataset typically two different lexemes are considered whereas in USim and our task the same lexemes with different contexts are compared. Instead of relying on crowd-sourced human gold-standard similarity judgements, which can be prone to a considerable amount of noise 11 , Table 8 Nearest neighbours of composed phrases for word2vec and SENSEMBED. Distributional composition in word2vec is able to recover sense specific information remarkably well. Some neighbours are phrases because they have been encoded as a single token in the original vector space.
we leverage the high-quality content of available english dictionaries. Furthermore, our task is not formulated as estimating the similarity between two lexemes in context, but identifying the sentences that use the same sense of a given polysemous lexeme.

Conclusion
While elementary multi-sense representations of words might capture a more fine grained semantic picture of a polysemous word, that advantage does not appear to transfer to distributional composition in a straightforward way. Our experiments on a popular phrase similarity benchmark and our novel word-sense discrimination task have demonstrated that semantic composition does not appear to benefit from a fine grained sense inventory, but that the ability to contextualise a polysemous lexeme in single-sense vector models is sufficient for superior performance. We furthermore have provided qualitative and quantitative evidence that an intersective composition function such as pointwise addition for neural word embeddings is able to discriminate the meaning of a word in context, and is able to recover sense specific information remarkably well.
Lastly, our experiments have uncovered an important question for multi-sense vector models, namely how to exploit the fine-grained sense level representations for distributional composition. Our novel word-sense discrimination task provides an excellent testbed for compositional distributional semantic models, both following a single-sense or multi-sense vector modelling can be up to 4-5 in some cases. paradigm, due to its focus on assessing the ability of a model to appropriately contextualise the meaning of a word. Our task furthermore provides another evaluation option away from intrinsic evaluations which are based on often noisy human similarity judgements, while also not being embedded in a downstream task.
In future work we aim to extend our evaluation to more complex compositional distributional semantic models such as the lexical function model (Paperno et al., 2014) or the Anchored Packed Dependency Tree framework . We furthermore want to investigate how far the sense-discriminating ability of composition can be leveraged for other tasks.