Sparsity Makes Sense: Word Sense Disambiguation Using Sparse Contextualized Word Representations

In this paper, we demonstrate that by utilizing sparse word representations, it becomes possible to surpass the results of more complex task-speciﬁc models on the task of ﬁne-grained all-words word sense disambiguation. Our proposed algorithm relies on an overcomplete set of semantic basis vectors that allows us to obtain sparse contextualized word representations. We introduce such an information theory-inspired synset representation based on the co-occurrence of word senses and non-zero coordinates for word forms which allows us to achieve an aggregated F-score of 78.8 over a combination of ﬁve standard word sense disambiguating benchmark datasets. We also demonstrate the general applicability of our proposed framework by evaluating it towards part-of-speech tagging on four different tree-banks. Our results indicate a signiﬁcant improvement over the application of the dense word representations.


Introduction
Natural language processing applications have benefited remarkably form language modeling based contextualized word representations, including CoVe (McCann et al., 2017), ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), inter alia. Contrary to standard "static" word embeddings like word2vec (Mikolov et al., 2013) and Glove (Pennington et al., 2014), contextualized representations assign such vectorial representations to mentions of word forms that are sensitive to the entire sequence in which they are present. This characteristic of contextualized word embeddings makes them highly applicable for performing word sense disambiguation (WSD) as it has been investigated recently (Loureiro and Jorge, 2019;Vial et al., 2019).
Another popular line of research deals with sparse overcomplete word representations which differ from typical word embeddings in that most coefficients are exactly zero. Such sparse word representations have been argued to convey an increased interpretability (Murphy et al., 2012;Faruqui et al., 2015;Subramanian et al., 2018) which could be advantageous for WSD. It has been shown that sparsity can not only favor interpretability, but it can contribute to an increased performance in downstream applications (Faruqui et al., 2015;Berend, 2017).
The goal of this paper is to investigate and quantify what synergies exist between contextualized and sparse word representations. Our rigorous experiments show that it is possible to get increased performance on top of contextualized representations when they are post-processed in a way which ensures their sparsity.
In this paper we introduce an information theoryinspired algorithm for creating sparse contextualized word representations and evaluate it in a series of challenging WSD tasks. In our experiments, we managed to obtain solid results for multiple fine-grained word sense disambiguation benchmarks. All our source code for reproducing our experiments are made available at https: //github.com/begab/sparsity_makes_sense. 1 Our contributions can be summarized as follows: • we propose the application of contextualized sparse overcomplete word representation in the task of word sense disambiguation, • we carefully evaluate our information theory inspired approach for quantifying the strength of the connection between the individual dimensions of (sparse) word representations and human interpretable semantic content such as fine grained word senses, • we demonstrate the general applicability of our algorithm by applying it for POS tagging on four different UD treebanks.

Related work
One of the key difficulties of natural language understanding is the highly ambiguous nature of language. As a consequence, WSD has long-standing origins in the NLP community (Lesk, 1986;Resnik, 1997a,b), still receiving major recent research interest (Raganato et al., 2017a;Trask et al., 2015;Melamud et al., 2016;Loureiro and Jorge, 2019;Vial et al., 2019). A thorough survey on WSD algorithms of the pre-neural era can be found in (Navigli, 2009). A typical evaluation for WSD systems is to quantify the extent to which they are capable of identifying the correct sense of ambiguous words in their contexts according to some sense inventory. One of the most frequently applied sense inventory in the case of English is the Princeton WordNet (Fellbaum, 1998) which also served the basis of our evaluation.
A variety of WSD approaches has evolved ranging from unsupervised and knowledge-based solutions to supervised ones. Unsupervised approaches could investigate the textual overlap between the context of ambiguous words and their potential sense definitions (Lesk, 1986) or they could be based on random walks over the semantic graph providing the sense inventory (Agirre and Soroa, 2009).
Supervised WSD techniques typically perform better than unsupervised approaches. IMS (Zhong and Ng, 2010) is a classical supervised WSD framework which was created with the intention of easy extensibility. It trains SVMs for predicting the correct sense of a word based on traditional features, such as surface forms and POS tags of the ambiguous words as well as its neighboring words.
The recent advent of neural text representations have also shaped the landscape of algorithms performing WSD. Iacobacci et al. (2016) extended the classical feature-based IMS framework by incorporating word embeddings. Melamud et al. (2016) devised context2vec, which relies on a bidirectional LSTM (biLSTM) for performing supervised WSD. Kågebäck and Salomonsson (2016) also proposed the utilization of biLSTMs for WSD. Raganato et al. (2017b) tackled all-words WSD as a sequence learning model and solved it using LSTMs. Vial et al. (2019) introduced a similar framework, but replaced the LSTM decoder with an ensemble of transformers. (Vial et al., 2019) additionally relied on BERT contextual word representations as input to their all-words WSD system. Contextual word embeddings have recently superseded traditional word embeddings due to their advantageous property of also modeling the neighboring context of words upon determining their vectorial representations. As such, the same word form gets assigned a separate embedding when mentioned in different contexts. Contextualized word vectors, including (Devlin et al., 2019;Yang et al., 2019), typically employ some language modelling-inspired objective and are trained on massive amounts of textual data, which makes them generally applicable in a variety of settings as illustrated by top-performing entries at the SuperGLUE leaderboard (Wang et al., 2019).
Most recently, Loureiro and Jorge (2019) have proposed the usage of contextualized word representations for tackling WSD. Their framework builds upon BERT embeddings and performs WSD relying on a k-NN approach of query words towards the sense embeddings that are derived as the centroids of contextual embeddings labeled with a certain sense. The framework also utilizes static fasttext (Bojanowski et al., 2017) embeddings, and averaged contextual embeddings derived from the definitions attached to WordNet senses for mitigating the problem caused by the limited amounts of sense-labeled training data. Kumar et al. (2019) proposed the EWISE approach which constructs sense definition embeddings also relying on the network structure of Word-Net for performing zero-shot WSD in order to handle words without any sense-annotated occurrence in the training data. Bevilacqua and Navigli (2020) introduces EWISER as an improvement over the EWISE approach by providing a hybrid knowledgebased and supervised approach via the integration of explicit relational information from WordNet. Our approach differs from both (Kumar et al., 2019) and (Bevilacqua and Navigli, 2020) in that we are not exploiting the structural properties of WordNet.
SenseBERT (Levine et al., 2019) extends BERT (Devlin et al., 2019) by incorporating an auxiliary task into the masked language modeling objective for predicting word supersenses besides word iden-tities. Our approach differs from SenseBERT as we do not propose an alternative way for training contextualized embeddings, but introduce an algorithm for extracting a useful representation from pretrained BERT embeddings that can effectively be used for WSD. Due to this conceptual difference, our approach does not need a large transformer model to be trained, but it can be steadily applied over pretrained models.
GlossBERT (Huang et al., 2019) framed WSD as a sentence pair classification task between the sentence containing an ambiguous target token and the contents of the glosses for the potential synsets of the ambiguous token and fine-tuned BERT accordingly. GlossBERT hence requires a fine-tuning stage, whereas our approach builds directly on the pre-trained contextual embeddings, which makes it more resource efficient.
Our work also relates to the line of research on sparse word representations. The seminal work on obtaining sparse word representations by Murphy et al. (2012) applied matrix factorization over the co-occurrence matrix built from some corpus. Arora et al. (2018) investigated the linear algebraic structure of static word embedding spaces and concluded that "simple sparse coding can recover vectors that approximately capture the senses".

Approach
Our algorithm is composed of two important steps, i.e. we first make a sparse representation from the dense contextualized ones, then we derive a succinct representation describing the strength of connection between the individual basis of our representation and the sense inventory we would like to perform WSD against. We elaborate on these components next.

Sparse contextualized embeddings
Our algorithm first determines contextualized word representations for some sense-annotated corpus. We shall denote the surface form realizations in the corpus as X = x , with x (i) j standing for the token at position j within sentence i, sup-posing a total of M sequences and N i tokens in sentence i. We refer to the contextualized word representation for some token in boldface, i.e. x (i) j and the collection of contextual embeddings as Likewise to the sequence of sentences and their respective tokens, we also utilize a sequence of annotations that we denote as S = s j indicating the labeling of token j within sentence i. We have s (i) j ∈ {0, 1} |S| with S denoting the set of possible labels included in our annotated corpus. That is, we have an indicator vector conveying the annotation for every token. We allow for the s (i) j = 0 case, meaning that it is possible that certain tokens lack annotation. In the case of WSD, the annotation is meant in the form of sense annotation, but in general, the token level annotations could convey other types of information as well.
The next step in our algorithm is to perform sparse coding over the contextual embeddings of the annotated corpus. Sparse coding is a matrix decomposition technique which tries to approximate some matrix X ∈ R v×m as a product of a sparse matrix α ∈ R v×k and a dictionary matrix D ∈ R k×m , where k denotes the number of basis vectors to be employed.
We formed matrix X by stacking and unit normalizing the contextual embeddings comprising X. We then optimize where C denotes the convex set of matrices with row norm at most 1, λ is the regularization coefficient and the sparse coefficients in α (i) j are required to be non-negative. We imposed the non-negativity constraint on α as it has been reported to provide increased interpretability (Murphy et al., 2012).

Binding basis vectors to senses
Once we have obtained a sparse contextualized representation for each token in our annotated corpus, we determine the extent to which the individual bases comprising the dictionary matrix D bind to the elements of our label inventory S. In order to do so, we devise a matrix Φ ∈ R k×|S| , which contains a φ bs score for each pair of basis vector b and a particular label s. We summarize our algorithm for obtaining Φ in Algorithm 1.
The definition of Φ is based on a generalization of co-occurrence of bases and the elements of the label inventory S. We first define our co-occurrence matrix between bases and labels as i.e. C is the sum of outer products of sparse word representations (α (i) j ) and their respective sense description vector (s (i) j ). The definition in (2) ensures that every c bs ∈ C aggregates the sparse nonnegative coefficients words labeled as s has received for their coordinate b. Recall that we allowed certain s (i) j to be the all zero vector, i.e. tokens that lack any annotation are conveniently handled by Eq.
(2) as the sparse coefficients of such tokens do not contribute towards C.
We next turn the elements of C into a matrix representing a joint probability distribution P by determining the 1 -normalized variant of C (line 5 of Algorithm 1). This way we devise a sparse matrix, the entries of which can be used for calculating Pointwise Mutual Information (PMI) between semantic bases and the presence of symbolic senses of our sense inventory.
For a pair of events (i, j) PMI is measured as log p ij p i * p * j , with p ij referring to their joint probability, p i * and p * j denoting the marginal probability of i and j, respectively. We determine these probabilities from the entries of P that we obtain from C via 1 normalization.
Employing Positive PMI Negative PMI values for a pair of events convey the information that they repel each other. Multiple studies have argued that negative PMI values are hence detrimental (Bullinaria and Levy, 2007;Levy et al., 2015) . To this end, we could opt for the determination of positive PMI (pPMI) values as indicated in line 7 of Algorithm 1.
Employing normalized PMI An additional property of (positive) PMI is that it favors observations with low marginal frequency (Bouma, 2009), since for events with low p(x) marginal probability p(x|y) ≈ p(x) tend to hold, which results in high PMI values. In our setting, it would result in rarer senses receiving higher φ bs scores towards all the bases.
In order to handle low-frequency senses better, we optionally calculate the normalized (positive) PMI (Bouma, 2009) between a pair of base and sense as log p ij p i * p * j − log (p ij ). That is, we normalize the PMI scores by the negative logarithm of the joint probability (cf. line 8 of Algorithm 1). This step additionally ensures that the normalized PMI (nPMI) ranges between −1 and 1 as opposed to the (−∞, min(− log(p i ), − log(p j ))) range of the unnormalized PMI values.
Algorithm 1 Calculating Φ Require: sense annotated corpus (X, S) Ensure: Φ ∈ R k×|S| describing the strength between k sense basis and the elements of the sense inventory |S| 1: procedure CALCULATEPHI(X, S) 2: return Φ, D 10: end procedure

Inferring senses
We now describe the way we assign the most plausible sense to any given token from a sequence according to the sense inventory employed for constructing D and Φ.
For an input sequence of N tokens accompanied by their corresponding contextualized word representations as [x j ] N j=1 , we determine their corresponding sparse representations [α j ] N j=1 based on D that we have already determined upon obtaining Φ. That is, we solve an 1 -regularized convex optimization problem with D being kept fixed for all the unit normalized vectors x j in order to obtain the sparse contextualized word representation α j for every token j in the sequence.
We then take the product between α j ∈ R k and Φ ∈ R k×|S| . Since every column in Φ corresponds to a sense from the sense inventory, every scalar in the resulting product α j Φ ∈ R |S| can be interpreted as the quantity indicating the extent to which token j -in its given context -pertains to the in-dividual senses from the sense inventory. In other words, we assign that sense s to a particular token j which maximizes α j Φ * s , where Φ * s indicates the column vector from Φ corresponding to sense s.

Experiments and results
We evaluate our approach towards the unified WSD evaluation framework released by Raganato et al. (2017a) which includes the sense-annotated Sem-Cor dataset for training purposes. SemCor (Miller et al., 1994) consists of 802,443 tokens with more than 28% (226,036) of its tokens being senseannotated using WordNet sensekeys.
For instance bank%1:14:00:: is one of the possible sensekeys the word bank can be assigned to according to one of the 18 different synsets it is included in WordNet 3.0. WordNet 3.0 contains all together 206,949 distinct senses for 147,306 unique lemmas grouped into 117,659 synsets. We constructed Φ relying on the synset-level information of WordNet.

Sparse contextualized embeddings
For obtaining contextualized word representations, we rely on the pretrained bert-large-cased model from (Wolf et al., 2019). Each input token x BERT relies on WordPiece tokenization, which means that a single token, such as playing, could be broken up into multiple subwords (play and ##ing). We defined token-level contextual embeddings to be the average of their subword-level contextual embeddings.
Sparse coding as formulated in (1) took the stacked 1024-dimensional contextualized BERT embeddings for the 802,443 tokens from SemCor as input, i.e. we had X ∈ R 1024×802443 . We used the SPAMS library (Mairal et al., 2009) to solve our optimization problems. Our approach has two hyperparameters, i.e. the number of basis vectors included in the dictionary matrix (k) and the regularization coefficient (λ). We experimented with k ∈ {1500, 2000, 3000} in order to investigate the sensitivity of our proposed algorithm towards the dimension of the sparse vectors and we employed λ = 0.05 throughout all our experiments. Figure 1 includes the average number of nonzero coefficients for the sparse word representations

Evaluation on all-words WSD
The evaluation framework introduced in ( The concatenation of the previous datasets is also included in the evaluation toolkit, which is commonly referred as the ALL dataset that includes 7253 sense-annotated test cases. We relied on the official scoring script included in the evaluation framework from (Raganato et al., 2017a). Unless stated otherwise, we report our results on the combination of all the datasets for brevity as results for all the subcorpora behaved similarly.
In order to demonstrate the benefits of our proposed approach, we develop a strong baseline similar to the one devised in (Loureiro and Jorge, 2019). This approach employs the very same contextualized embeddings that we use otherwise in our algorithm for providing identical conditions for the different approaches. For each synset s, we then determine its centroid based on the contextualized word representations pertaining to sense s accord- ing to the training data. We then use this matrix Ψ as a replacement over Φ when making predictions for some token with its dense contextualized embedding x j .
The way we make our fine-grained sensekey predictions towards the test tokens are identical when utilizing dense and sparse contextualized embeddings, the only difference is whether we base our decision on x j Ψ (for the dense case) or α j Φ (for the sparse case). In either case, we choose the best scoring synset a particular query lemma can belong to. That is, we perform argmax operation described in Section 3.3 over the set of possible synsets a query lemma can belong to. Figure 2 includes comparative results for the approach using dense and sparse contextualized embeddings derived from different layers of BERT. We can see that our approach yields considerable improvements over the application of dense embeddings. In fact, applying sparse contextualized embeddings provided significantly better results (p 0.01 using McNemar's test) irrespective of the choice of k when compared against the utilization of dense embeddings.
Additionally, the different choices for the dimension of the sparse word representations does not seem to play a decisive role as illustrated by Figure 2 and also confirmed by our significance tests conducted between the sparse approaches using different values of k. Since the choice of k does not severely impacted results, we report our experiments for the k = 3000 case hereon.

Increasing the amount of training data
We also measured the effects of increasing the amount of training data. We additionally used two sources of information, i.e. the WordNet synsets themselves and the Princeton WordNet Gloss Corpus (WNGC) for training. The WordNet synsets were utilized in an identical fashion to the LMMS approach (Loureiro and Jorge, 2019), i.e. we determined a vectorial representation for each synset by taking the average of the contextual representations that based on the concatenation of the definition and the lemmas belonging to the synsets.
WNGC includes a sense-annotated version of WordNet itself containing 117,659 definitions (one for each synset in WordNet), consisting of 1,634,691 tokens out of which 614,435 has a corresponding sensekey attached to. We obtained this data from the Unification of Sense Annotated Corpora (UFSAC) (Vial et al., 2018).
For this experiment all our framework was kept intact, the only difference was that instead of solely relying on the sense-annotated training data included in SemCor, we additionally relied on the sense representations derived from WordNet glosses and sense annotations included in WNGC upon the determination of Φ and Ψ for the sparse and dense cases, respectively. For these experiments we used the same set of semantic basis vectors D that we determined earlier for the case when we relied solely on SemCor as the source of sense annotated dataset. Figure 3 includes our results when increasing the amount of sense-annotated training data. We can see that the additional training data consistently improves performance for both the dense and the sparse case. strates that our proposed method when trained on the SemCor data alone is capable of achieving the same or better performance as the approach which is based on dense contextual embeddings using all the available sources of training signal.

Ablation experiments
We gave a detailed description of our algorithm in Section 3.2. We now report our experimental results that we conducted in order to see the contribution of the individual components of our algorithms. As mentioned in Section 3.2, determining normalized positive PMI (npPMI) between the semantic bases and the elements of the sense inventory plays a central role in our algorithm.
In order to see the effects of normalizing and keeping only the positive PMI values, we evaluated 3 further *PMI-based variants for the calculation of Φ, i.e. we had • vPMI vanilla PMI without normalization or discarding negative entries, • pPMI, which discards negative PMI values but does not normalize them and • nPMI which performs normalization, however does not discard negative PMI values.
Additionally, we evaluated the system which uses sparse contextualized word representations for determining Φ, however, does not involve the calculation of PMI scores at all. In that case we calculated a centroid for every synset similar to the calculation of Ψ for the case of contextualized embeddings that are kept dense. The only difference is that for the approach we refer to as no PMI, we calculated synset centroids based on the sparse contextualized word representations. Figure 4 includes our results for the previously mentioned variants of our algorithm when relying on the different layers of BERT as input. Figure 4 highlights that calculating PMI is indeed a crucial step in our algorithm (cf. the no PMI and * PMI results). We also tried to adapt the * PMI approaches for the dense contextual embeddings, but the results dropped severely in that case.
We can additionally observe that normalization has the most impact on improving the results, as the performance of nPMI is at least 4 points better than that of vPMI for all layers. Not relying on negative PMI scores also had an overall positive effect (cf. vPMI and pPMI), which seems to be additive with normalization (cf. nPMI and npPMI).

Comparative results
We next provide detailed performance results broken down for the individual subcorpora of the evaluation dataset. Table 1 includes comparative results to previous methods that also use SemCor and optionally WordNet glosses as their training data. In Table 1 we report our results obtained by our model which derives sparse contextual word embeddings based on the averaged representations retrieved from the last four layers of BERT identical to how it was done in (Loureiro and Jorge, 2019). Figure 4 illustrates that reporting results from any of the last 4 layers would not change our overall results substantially. Table 1 reveals that it is only the LMMS 2348 (Loureiro and Jorge, 2019) approach which performs comparably to our algorithm. LMMS 2348 determines dense sense representations relying on the large BERT model as well. The sense representations used by LMMS 2348 are a concatenation of the 1024-dimensional centroids of each senses encountered in the training data, an 1024-dimensional vectors derived from the glosses of WordNet synsets and a 300-dimensional static fasttext embeddings. Even though our approach does not rely on static fasttext embeddings, we still managed to improve upon the best results reported in (Loureiro and Jorge, 2019). The improvement of our approach which uses the SemCor training data alone is 1.9 points compared to the LMMS 1024 , i.e. such a variant of the LMMS system (Loureiro and Jorge, 2019) which also relies solely on BERT representations for the SemCor training set.  Table 1: Comparison with previous supervised results in terms of F measure computed by the official scorer provided in (Raganato et al., 2017a).

Evaluation towards POS tagging
In order to demonstrate the general applicability of our proposed algorithm, we evaluated it towards POS tagging using version 2.5 of Universal Dependencies. We conducted experiments over four different subcorpora in English, namely the EWT (Silveira et al., 2014), GUM (Zeldes, 2017), LinEs (Ahrenberg, 2007) and ParTut (Sanguinetti and Bosco, 2015) treebanks.
For these experiments, we used the same approach as before. We also used the same dictionary matrix D for obtaining the sparse word representations that we determined based on the SemCor dataset. The only difference for our POS tagging experiments is that this time the token level labels were replaced by the POS tags of the individual tokens as opposed to their sense labels. This means that both Ψ and Φ had 17 columns, i.e. the number of distinct POS tags used in these treebanks. Figure 5 reveals that the approach utilizing sparse contextualized word representations outper-  form the one that is based on the adaptation of the LMMS approach for POS tagging by a fair margin, again irrespective of the layer of BERT that is used as input. A notable difference compared to the results obtained for all-words WSD that for POS tagging the intermediate layers of BERT seem to deliver the most useful representation. We used the development set of the individual treebanks for choosing the most promising layer of BERT to employ the different approaches over. For the npPMI approach we selected layer 13, 13, 14 and 11 for the EWT, GUM, LinES and ParTut treebanks. As for the dense centroid based approach we selected layer 6 for the ParTUT treebank and layer 13 for the rest of the treebanks. After doing so, our results for the test set of the four treebanks are reported in Table 2. Our approach delivered significant improvements for POS tagging as well as indicated by the p-values of the McNemar test.

Conclusions
In this paper we investigated how the application of sparse word representations obtained from contextualized word embeddings can provide a substantially increased ability for solving problems that require the distinction of fine-grained word senses. In our experiments, we managed to obtain solid results for multiple fine-grained word sense disambiguation benchmarks with the help of our information theory-inspired algorithm. We additionally carefully investigated the effects of increasing the amount of sense-annotated training data and the different design choices we made. We also demonstrated the general applicability of our approach by evaluating it in POS tagging. Our source code is made available at https://github. com/begab/sparsity_makes_sense.