Towards a Seamless Integration of Word Senses into Downstream NLP Applications

Lexical ambiguity can impede NLP systems from accurate understanding of semantics. Despite its potential benefits, the integration of sense-level information into NLP systems has remained understudied. By incorporating a novel disambiguation algorithm into a state-of-the-art classification model, we create a pipeline to integrate sense-level information into downstream NLP applications. We show that a simple disambiguation of the input text can lead to consistent performance improvement on multiple topic categorization and polarity detection datasets, particularly when the fine granularity of the underlying sense inventory is reduced and the document is sufficiently large. Our results also point to the need for sense representation research to focus more on in vivo evaluations which target the performance in downstream NLP applications rather than artificial benchmarks.


Introduction
As a general trend, most current Natural Language Processing (NLP) systems function at the word level, i.e. individual words constitute the most fine-grained meaning-bearing elements of their input. The word level functionality can affect the performance of these systems in two ways: (1) it can hamper their efficiency in handling words that are not encountered frequently during training, such as multiwords, inflections and derivations, and (2) it can restrict their semantic understanding to the level of words, with all their ambiguities, and thereby prevent accurate capture of the intended meanings.
The first issue has recently been alleviated by techniques that aim to boost the generalisation power of NLP systems by resorting to sub-word or character-level information (Ballesteros et al., 2015;Kim et al., 2016). The second limitation, however, has not yet been studied sufficiently. A reasonable way to handle word ambiguity, and hence to tackle the second issue, is to semantify the input text: transform it from its surface-level semantics to the deeper level of word senses, i.e. their intended meanings. We take a step in this direction by designing a pipeline that enables seamless integration of word senses into downstream NLP applications, while benefiting from knowledge extracted from semantic networks. To this end, we propose a quick graph-based Word Sense Disambiguation (WSD) algorithm which allows high confidence disambiguation of words without much computation overload on the system. We evaluate the pipeline in two downstream NLP applications: polarity detection and topic categorization. Specifically, we use a classification model based on Convolutional Neural Networks which has been shown to be very effective in various text classification tasks (Kalchbrenner et al., 2014;Kim, 2014;Johnson and Zhang, 2015;Tang et al., 2015;Xiao and Cho, 2016). We show that a simple disambiguation of input can lead to performance improvement of a state-of-the-art text classification system on multiple datasets, particularly for long inputs and when the granularity of the sense inventory is reduced. Our pipeline is quite flexible and modular, as it permits the integration of different WSD and sense representation techniques.

Motivation
With the help of an example news article from the BBC, shown in Figure 1, we highlight some of the potential deficiencies of word-based models. Ambiguity. Language is inherently ambiguous. For instance, Mercedes, race, Hamilton and Formula can refer to several different entities or meanings. Current neural models have managed to successfully represent complex semantic associations by effectively analyzing large amounts of data. However, the word-level functionality of these systems is still a barrier to the depth of their natural language understanding. Our proposal is particularly tailored towards addressing this issue.
Multiword expressions (MWE). MWE are lexical units made up of two or more words which are idiosyncratic in nature (Sag et al., 2002), e.g, Lewis Hamilton, Nico Rosberg and Formula 1.
Most existing word-based models ignore the interdependency between MWE's subunits and treat them as individual units. Handling MWE has been a long-standing problem in NLP and has recently received a considerable amount of interest (Tsvetkov and Wintner, 2014;Salehi et al., 2015). Our pipeline facilitates this goal.
Co-reference. Co-reference resolution of concepts and entities is not explicitly tackled by our approach. However, thanks to the fact that words that refer to the same meaning in context, e.g., Formula 1-F1 or German Grand Prix-German GP-Hockenheim, are all disambiguated to the same concept, the co-reference issue is also partly addressed by our pipeline.

Disambiguation Algorithm
Our proposal relies on a seamless integration of word senses in word-based systems. The goal is to semantify the text prior to its being fed into the system by transforming its individual units from word surface form to the deeper level of word senses. The semantification step is mainly tailored if maxDeg < θ|S| / 100 then 7: break 8: else 9:Ŝ ←Ŝ ∪ {ŝ} 10: E ← E \ {(s, s ) : s ∨ s ∈ getLex(ŝ)} 11: return Disambiguation outputŜ towards resolving ambiguities, but it brings about other advantages mentioned in the previous section. The aim is to provide the system with an input of reduced ambiguity which can facilitate its decision making.
To this end, we developed a simple graph-based joint disambiguation and entity linking algorithm which can take any arbitrary semantic network as input. The gist of our disambiguation technique lies in its speed and scalability. Conventional knowledge-based disambiguation systems (Hoffart et al., 2012;Agirre et al., 2014;Moro et al., 2014;Pilehvar and Navigli, 2014) often rely on computationally expensive graph algorithms, which limits their application to on-the-fly processing of large number of text documents, as is the case in our experiments. Moreover, unlike supervised WSD and entity linking techniques (Zhong and Ng, 2010;Cheng and Roth, 2013;Melamud et al., 2016;Limsopatham and Collier, 2016), our algorithm relies only on semantic networks and does not require any senseannotated data, which is limited to English and almost non-existent for other languages.
Algorithm 1 shows our procedure for disambiguating an input document T . First, we retrieve from our semantic network the list of candidate senses 1 for each content word, as well as semantic relationships among them. As a result, we obtain a graph representation (S, E) of the input text, where S is the set of candidate senses and E is the set of edges among different senses in S. The graph is, in fact, a small sub-graph of the input semantic network, N . Our algorithm then selects the best candidates iteratively. In each iteration, the candidate sense that has the highest graph degree maxDeg is chosen as the winning sense: After each iteration, when a candidate senseŝ is selected, all the possible candidate senses of the corresponding word (i.e. getLex(ŝ)) are removed from E (line 10 in the algorithm). Figure 2 shows a simplified version of the graph for a sample sentence. The algorithm would disambiguate the content words in this sentence as follows. It first associates Oasis with its rock band sense, since its corresponding node has the highest degree, i.e. 3. On the basis of this, the desert sense of Oasis and its link to the stone sense of rock are removed from the graph. In the second iteration, rock band is disambiguated as music band given that its degree is 2. 2 Finally, Manchester is associated with its city sense (with a degree of 1).
In order to enable disambiguating at different confidence levels, we introduce a threshold θ which determines the stopping criterion of the algorithm. Iteration continues until the following condition is fulfilled: maxDeg < θ|S| / 100. This ensures that the system will only disambiguate those words for which it has a high confidence and backs off to the word form otherwise, avoiding the introduction of unwanted noise in the data for uncertain cases or for word senses that are not defined in the inventory.

Classification Model
In our experiments, we use a standard neural network based classification approach which is similar to the Convolution Neural Network classifier of Kim (2014) and the pioneering model of Collobert et al. (2011). Figure 3 depicts the architecture of the model. The network receives the concatenated vector representations of the input words, v 1:n = v 1 ⊕v 2 ⊕· · ·⊕v n , and applies (convolves) filters F on windows of h words, where b is a bias term and f () is a non-linear function, for which we use ReLU (Nair and Hinton, 2010). The convolution transforms the input text to a feature map m = [m 1 , m 2 , . . . , m n−h+1 ]. A max pooling operation then selects the most salient featurem = max{m} for each filter.
In the network of Kim (2014), the pooled features are directly passed to a fully connected softmax layer whose outputs are class probabilities. However, we add a recurrent layer before softmax in order to enable better capturing of longdistance dependencies. It has been shown by Xiao and Cho (2016) that a recurrent layer can replace multiple layers of convolution and be beneficial, particularly when the length of input text grows. Specifically, we use a Long Short-Term Memory (Hochreiter and Schmidhuber, 1997, LSTM) as our recurrent layer which was originally proposed to avoid the vanishing gradient problem and has proven its abilities in capturing distant dependencies. The LSTM unit computes three gate vectors (forget, input, and output) as follows: where W, U, and b are model parameters and g and h are input and output sequences, respectively. The cell state vector c t is then computed as As for regularization, we used dropout (Hinton et al., 2012) after the embedding layer.
We perform experiments with two configurations of the embedding layer: (1) Random, initialized randomly and updated during training, and (2) Pre-trained, initialized by pre-trained representations and updated during training. In the following section we describe the pre-trained word and sense representation used for the initialization of the second configuration.

Pre-trained Word and Sense Embeddings
One of the main advantages of neural models is that they usually represent the input words as dense vectors. This can significantly boost a system's generalisation power and results in improved performance (Zou et al., 2013;Bordes et al., 2014;Kim, 2014;Weiss et al., 2015, interalia). This feature also enables us to directly plug in pre-trained sense representations and check them in a downstream application.
In our experiments we generate a set of sense embeddings by extending DeConf, a recent technique with state-of-the-art performance on multiple semantic similarity benchmarks (Pilehvar and Collier, 2016). We leave the evaluation of other representations to future work. DeConf gets a pre-trained set of word embeddings and computes sense embeddings in the same semantic space. To this end, the approach exploits the semantic network of WordNet (Miller, 1995), using the Personalized PageRank (Haveliwala, 2002) algorithm, and obtains a set of sense biasing words B s for a word sense s. The sense representation of s is then obtained using the following formula: where δ is a decay parameter and v(w i ) is the embedding of w i , i.e. the i th word in the sense bi-asing list of s, i.e. B s . We follow Pilehvar and Collier (2016) and set δ = 5. Finally, the vector for sense s is calculated as the average ofv(s) and the embedding of its corresponding word.
Owing to its reliance on WordNet's semantic network, DeConf is limited to generating only those word senses that are covered by this lexical resource. We propose to use Wikipedia in order to expand the vocabulary of the computed word senses. Wikipedia provides a high coverage of named entities and domain-specific terms in many languages, while at the same time also benefiting from a continuous update by collaborators. Moreover, it can easily be viewed as a sense inventory where individual articles are word senses arranged through hyperlinks and redirections. Camacho-Collados et al. (2016b) proposed NASARI 3 , a technique to compute the most salient words for each Wikipedia page. These salient words were computed by exploiting the structure and content of Wikipedia and proved effective in tasks such as Word Sense Disambiguation (Tripodi and Pelillo, 2017;Camacho-Collados et al., 2016a), knowledge-base construction (Lieto et al., 2016), domain-adapted hypernym discovery (Espinosa-Anke et al., 2016; or object recognition (Young et al., 2016). We view these lists as biasing words for individual Wikipedia pages, and then leverage the exponential decay function (Equation 3) to compute new sense embeddings in the same semantic space. In order to represent both WordNet and Wikipedia sense representations in the same space, we rely on the WordNet-Wikipedia mapping provided by BabelNet 4 (Navigli and Ponzetto, 2012). For the WordNet synsets which are mapped to Wikipedia pages in Babel-Net, we average the corresponding Wikipediabased and WordNet-based sense embeddings.

Pre-trained Supersense Embeddings
It has been argued that WordNet sense distinctions are too fine-grained for many NLP applications (Hovy et al., 2013). The issue can be tackled by grouping together similar senses of the same word, either using automatic clustering techniques (Navigli, 2006;Agirre and Lopez, 2003;Snow et al., 2007) or with the help of WordNet's lexicographer files 5 . Various applications have been shown to improve upon moving from senses to supersenses (Rüd et al., 2011;Severyn et al., 2013;Flekova and Gurevych, 2016). In WordNet's lexicographer files there are a total of 44 sense clusters, referred to as supersenses, for categories such as event, animal, and quantity. In our experiments we use these supersenses in order to reduce granularity of our WordNet and Wikipedia senses. To generate supersense embeddings, we simply average the embeddings of senses in the corresponding cluster.

Evaluation
We evaluated our model on two classification tasks: topic categorization (Section 5.2) and polarity detection (Section 5.3). In the following section we present the common experimental setup.

Experimental setup
Classification model. Throughout all the experiments we used the classification model described in Section 4. The general architecture of the model was the same for both tasks, with slight variations in hyperparameters given the different natures of the tasks, following the values suggested by Kim (2014) and Xiao and Cho (2016) for the two tasks. Hyperparameters were fixed across all configurations in the corresponding tasks. The embedding layer was fixed to 300 dimensions, irrespective of the configuration, i.e. Random and Pre-trained. For both tasks the evaluation was carried out by 10-fold cross-validation unless standard trainingtesting splits were available. The disambiguation threshold θ (cf. Section 3) was tuned on the training portion of the corresponding data, over seven values in [0,3] in steps of 0.5. 6 We used Keras (Chollet, 2015) and Theano (Team, 2016) for our model implementations.
Semantic network. The integration of senses was carried out as described in Section 3. For disambiguating with both WordNet and Wikipedia senses we relied on the joint semantic network of Wikipedia hyperlinks and WordNet via the mapping provided by BabelNet. 7 Pre-trained word and sense embeddings. Throughout all the experiments we used Word2vec (Mikolov et al., 2013) embeddings, trained on the Google News corpus. 8 We truncated this set to its 250K most frequent words. We also used WordNet 3.0 (Fellbaum, 1998) and the Wikipedia dump of November 2014 to compute the sense embeddings (see Section 4.1). As a result, we obtained a set of 757,262 sense embeddings in the same space as the pre-trained Word2vec word embeddings. We used DeConf (Pilehvar and Collier, 2016) as our pre-trained WordNet sense embeddings. All vectors had a fixed dimensionality of 300.
Supersenses. In addition to WordNet senses, we experimented with supersenses (see Section 4.2) to check how reducing granularity would affect system performance. For obtaining supersenses in a given text we relied on our disambiguation pipeline and simply clustered together senses belonging to the same WordNet supersense.
Evaluation measures. We report the results in terms of standard accuracy and F1 measures. 9

Topic Categorization
The task of topic categorization consists of assigning a label (i.e. topic) to a given document from a pre-defined set of labels.

Datasets
For this task we used two newswire and one medical topic categorization datasets. Table 1 summarizes the statistics of each dataset. 10 The BBC news dataset 11 (Greene and Cunningham, 2006) comprises news articles taken from BBC, divided into five topics: business, entertainment, politics, sport and tech. Newsgroups (Lang, 1995) is a collection of 11,314 documents for training and 7532 for testing 12 divided into six topics: computing, sport and motor vehicles, science, politics, reli-8 https://code.google.com/archive/p/word2vec/ 9 Since all models in our experiments provide full coverage, accuracy and F1 denote micro-and macro-averaged F1, respectively (Yang, 1999). 10 The coverage of the datasets was computed using the 250K top words in the Google News Word2vec embeddings. 11 http://mlg.ucd.ie/datasets/bbc.html 12 We used the train-test partition available at http://qwone. com/ ∼ jason/20Newsgroups/    Table 2 shows the results of our classification model and its variants on the three datasets. 16 When the embedding layer is initialized randomly, the model integrated with word senses consistently improves over the word-based model, particularly when the fine-granularity of the underlying sense inventory is reduced using supersenses (with statistically significant gains on the three datasets). This highlights the fact that a simple disambiguation of the input can bring about performance gain for a state-of-the-art classification system. Also, 13 The dataset has 20 fine-grained categories clustered into six general topics. We used the coarse-grained labels for their clearer distinction and consistency with BBC topics. 14 ftp://medir.ohsu.edu/pub/ohsumed 15 http://disi.unitn.it/moschitti/corpora.htm 16 Symbols * and † indicate the sense-based model with the smallest margin to the word-based model whose accuracy is statistically significant at 0.95 confidence level according to unpaired t-test ( * for positive and † for negative change). the better performance of supersenses suggests that the sense distinctions of WordNet are too finegrained for the topic categorization task. However, when pre-trained representations are used to initialize the embedding layer, no improvement is observed over the word-based model. This can be attributed to the quality of the representations, as the model utilizing them was unable to benefit from the advantage offered by sense distinctions. Our results suggest that research in sense representation should put special emphasis on real-world evaluations on benchmarks for downstream applications, rather than on artificial tasks such as word similarity. In fact, research has previously shown that word similarity might not constitute a reliable proxy to measure the performance of word embeddings in downstream applications (Tsvetkov et al., 2015;Chiu et al., 2016).

Results
Among the three datasets, Ohsumed proves to be the most challenging one, mainly for its larger number of classes (i.e. 23) and its domain-specific nature (i.e. medicine). Interestingly, unlike for the other two datasets, the introduction of pre-trained word embeddings to the system results in reduced performance on Ohsumed. This suggests that general domain embeddings might not be beneficial in specialized domains, which corroborates previous findings by Yadav et al. (2017) on a different task, i.e. entity extraction. This performance drop may also be due to diachronic issues (Ohsumed dates back to the 1980s) and low coverage: the pre-trained Word2vec embeddings cover 79.3% of the words in Ohsumed (see Table 1), in contrast to the higher coverage on the newswire datasets, i.e. Newsgroups (83.4%) and BBC (87.4%). However, also note that the best overall performance is attained when our pre-trained Wikipedia sense embeddings are used. This highlights the effectiveness of Wikipedia in handling domain-specific entities, thanks to its broad sense inventory.

Polarity Detection
Polarity detection is the most popular evaluation framework for sentiment analysis (Dong et al., 2015). The task is essentially a binary classification which determines if the sentiment of a given sentence or document is negative or positive.

Datasets
For the polarity detection task we used five standard evaluation datasets. Table 1 summarizes statistics. PL04 (Pang and Lee, 2004) is a polarity detection dataset composed of full movie reviews. PL05 18 (Pang and Lee, 2005), instead, is composed of short snippets from movie reviews. RTC contains critic reviews from Rotten Tomatoes 19 , divided into 436,000 training and 2,000 test instances. IMDB (Maas et al., 2011) includes 50,000 movie reviews, split evenly between training and test. Finally, we used the Stanford Sentiment dataset , which associates each review with a value that denotes its sentiment. To be consistent with the binary classification of the other datasets, we removed the neutral phrases according to the dataset's scale (between 0.4 and 0.6) and considered the reviews whose values were below 0.4 as negative and above 0.6 as positive. This resulted in a binary polarity dataset of 119,783 phrases. Unlike the previous four datasets, this dataset does not contain an even distribution of positive and negative labels. Table 4 lists accuracy performance of our classification model and all its variants on five polar- ity detection datasets. Results are generally better than those of Kim (2014), showing that the addition of the recurrent layer to the model (cf. Section 4) was beneficial. However, interestingly, no consistent performance gain is observed in the polarity detection task, when the model is provided with disambiguated input, particularly for datasets with relatively short reviews. We attribute this to the nature of the task. Firstly, given that words rarely happen to be ambiguous with respect to their sentiment, the semantic sense distinctions provided by the disambiguation stage do not assist the classifier in better decision making, and instead introduce data sparsity. Secondly, since the datasets mostly contain short texts, e.g., sentences or snippets, the disambiguation algorithm does not have sufficient context to make high-confidence judgements, resulting in fewer disambiguations or less reliable ones. In the following section we perform a more in-depth analysis of the impact of document size on the performance of our sense-based models.

Analysis
Document size. A detailed analysis revealed a relation between document size (the number of tokens) and performance gain of our sense-level model. We show in Figure 4 how these two vary for our most consistent configuration, i.e. Wikipedia supersenses, with random initialization. Interestingly, as a general trend, the performance gain increases with average document size, irre-   Table 4: Accuracy performance on five polarity detection datasets. Given that polarity datasets are balanced 17 , we do not report F1 which would have been identical to accuracy. spective of the classification task. We attribute this to two main factors: 1. Sparsity: Splitting a word into multiple word senses can have the negative side effect that the corresponding training data for that word is distributed among multiple independent senses. This reduces the training instances per word sense, which might affect the classifier's performance, particularly when senses are semantically related (in comparison to fine-grained senses, supersenses address this issue to some extent).
2. Disambiguation quality: As also mentioned previously, our disambiguation algorithm requires the input text to be sufficiently large so as to create a graph with an adequate number of coherent connections to function effectively. In fact, for topic categorization, in which the documents are relatively long, our algorithm manages to disambiguate a larger proportion of words in documents with high confidence. The lower performance of graphbased disambiguation algorithms on short texts is a known issue (Moro et al., 2014;Raganato et al., 2017), the tackling of which remains an area of exploration.
Senses granularity. Our results showed that reducing fine-granularity of sense distinctions can be beneficial to both tasks, irrespective of the underlying sense inventory, i.e. WordNet or Wikipedia, which corroborates previous findings (Hovy et al., 2013;Flekova and Gurevych, 2016). This suggests that text classification does not require fine-grained semantic distinctions. In this work we used a simple technique based on Word-Net's lexicographer files for coarsening senses in this sense inventory as well as in Wikipedia. We leave the exploration of this promising area as well as the evaluation of other granularity reduction techniques for WordNet (Snow et al., 2007;Bhagwani et al., 2013) and Wikipedia (Dandala et al., 2013) sense inventories to future work.

Related Work
The past few years have witnessed a growing research interest in semantic representation, mainly as a consequence of the word embedding tsunami (Mikolov et al., 2013;Pennington et al., 2014). Soon after their introduction, word embeddings were integrated into different NLP applications, thanks to the migration of the field to deep learning and the fact that most deep learning models view words as dense vectors. The waves of the word embedding tsunami have also lapped on the shores of sense representation. Several techniques have been proposed that either extend word embedding models to cluster contexts and induce senses, usually referred to as unsupervised sense representations (Schütze, 1998;Reisinger and Mooney, 2010;Huang et al., 2012;Neelakantan et al., 2014;Guo et al., 2014;Tian et al., 2014;Šuster et al., 2016;Ettinger et al., 2016;Qiu et al., 2016) or exploit external sense inventories and lexical resources for generating sense representations for individual meanings of words Johansson and Pina, 2015;Jauhar et al., 2015;Iacobacci et al., 2015;Rothe and Schütze, 2015;Camacho-Collados et al., 2016b;Mancini et al., 2016;Pilehvar and Collier, 2016).
However, the integration of sense representations into deep learning models has not been so straightforward, and research in this field has often opted for alternative evaluation benchmarks such as WSD, or artificial tasks, such as word similarity. Consequently, the problem of integrating sense representations into downstream NLP applications has remained understudied, despite the potential benefits it can have. Li and Jurafsky (2015) proposed a "multi-sense embedding" pipeline to check the benefit that can be gained by replacing word embeddings with sense embeddings in multiple tasks. With the help of two simple disambiguation algorithms, unsupervised sense embeddings were integrated into various downstream applications, with varying degrees of success. Given the interdependency of sense representation and disambiguation in this model, it is very difficult to introduce alternative algorithms into its pipeline, either to benefit from the state of the art, or to carry out an evaluation. Instead, our pipeline provides the advantage of being modular: thanks to its use of disambiguation in the pre-processing stage and use of sense representations that are linked to external sense inventories, different WSD techniques and sense representations can be easily plugged in and checked. Along the same lines, Flekova and Gurevych (2016) proposed a technique for learning supersense rep-resentations, using automatically-annotated corpora. Coupled with a supersense tagger, the representations were fed into a neural network classifier as additional features to the word-based input. Through a set of experiments, Flekova and Gurevych (2016) showed that the supersense enrichment can be beneficial to a range of binary classification tasks. Our proposal is different in that it focuses directly on the benefits that can be gained by semantifying the input, i.e. reducing lexical ambiguity in the input text, rather than assisting the model with additional sources of knowledge.

Conclusion and Future Work
We proposed a pipeline for the integration of sense level knowledge into a state-of-the-art text classifier. We showed that a simple disambiguation of the input can lead to consistent performance gain, particularly for longer documents and when the granularity of the underlying sense inventory is reduced. Our pipeline is modular and can be used as an in vivo evaluation framework for WSD and sense representation techniques. We release our code and data to reproduce our experiments (including pre-trained sense and supersense embeddings) at https://github.com/pilehvar/sensecnn to allow further checking of the choice of hyperparameters and to allow further analysis and comparison. We hope that our work will foster future research on the integration of sense-level knowledge into downstream applications. As future work, we plan to investigate the extension of the approach to other languages and applications. Also, given the promising results observed for supersenses, we will investigate task-specific coarsening of sense inventories, particularly Wikipedia, or the use of SentiWordNet (Baccianella et al., 2010), which could be more suitable for polarity detection.