SenseBERT: Driving Some Sense into BERT

The ability to learn from large unlabeled corpora has allowed neural language models to advance the frontier in natural language understanding. However, existing self-supervision techniques operate at the word form level, which serves as a surrogate for the underlying semantic content. This paper proposes a method to employ weak-supervision directly at the word sense level. Our model, named SenseBERT, is pre-trained to predict not only the masked words but also their WordNet supersenses. Accordingly, we attain a lexical-semantic level language model, without the use of human annotation. SenseBERT achieves significantly improved lexical understanding, as we demonstrate by experimenting on SemEval Word Sense Disambiguation, and by attaining a state of the art result on the ‘Word in Context’ task.


Introduction
Neural language models have recently undergone a qualitative leap forward, pushing the state of the art on various NLP tasks. Together with advances in network architecture (Vaswani et al., 2017), the use of self-supervision has proven to be central to these achievements, as it allows the network to learn from massive amounts of unannotated text.
The self-supervision strategy employed in BERT (Devlin et al., 2019) involves masking some of the words in an input sentence, and then training the model to predict them given their context. Other proposed approaches for self-supervised objectives, including unidirectional (Radford et al., 2019), permutational (Yang et al., 2019), or word insertionbased (Chan et al., 2019) methods, operate similarly, over words. However, since a given word form can possess multiple meanings (e.g., the word 'bass' can refer to a fish, a guitar, a type of singer, etc.), the word itself is merely a surrogate of its actual meaning in a given context, referred to as its sense. Indeed, the word-form level is viewed as a surface level which often introduces challenging ambiguity (Navigli, 2009).
In this paper, we bring forth a novel methodology for applying weak-supervision directly on the level of a word's meaning. By infusing wordsense information into BERT's pre-training signal, we explicitely expose the model to lexical semantics when learning from a large unannotated corpus. We call the resultant sense-informed model SenseBERT. Specifically, we add a maskedword sense prediction task as an auxiliary task in BERT's pre-training. Thereby, jointly with the standard word-form level language model, we train a semantic-level language model that predicts the missing word's meaning. Our method does not require sense-annotated data; self-supervised learning from unannotated text is facilitated by using WordNet (Miller, 1998), an expert constructed inventory of word senses, as weak supervision.
We focus on a coarse-grained variant of a word's sense, referred to as its WordNet supersense, in order to mitigate an identified brittleness of finegrained word-sense systems, caused by arbitrary sense granularity, blurriness, and general subjectiveness (Kilgarriff, 1997;Schneider, 2014). Word-Net lexicographers organize all word senses into 45 supersense categories, 26 of which are for nouns, 15 for verbs, 3 for adjectives and 1 for adverbs (see full supersense table in the supplementary materials). Disambiguating a word's supersense has been widely studied as a fundamental lexical categorization task (Ciaramita and Johnson, 2003;Basile, 2012;Schneider and Smith, 2015).
We employ the masked word's allowed supersenses list from WordNet as a set of possible labels for the sense prediction task. The labeling of words with a single supersense (e.g., 'sword' has only the supersense noun.artifact) is straightforward: We train the network to predict this supersense given the masked word's context. As for words with multiple supersenses (e.g., 'bass' can be: noun. food, noun.animal, noun.artifact, noun.person, etc.), we train the model to predict any of these senses, leading to a simple yet effective soft-labeling scheme.
We show that SenseBERT BASE outscores both BERT BASE and BERT LARGE by a large margin on a supersense variant of the SemEval Word Sense Disambiguation (WSD) data set standardized in Raganato et al. (2017). Notably, SenseBERT receives competitive results on this task without funetuning, i.e., when training a linear classifier over the pretrained embeddings, which serves as a testament for its self-acquisition of lexical semantics. Furthermore, we show that SenseBERT BASE surpasses BERT LARGE in the Word in Context (WiC) task (Pilehvar and Camacho-Collados, 2019) from the SuperGLUE benchmark (Wang et al., 2019), which directly depends on word-supersense awareness. A single SenseBERT LARGE model achieves state of the art performance on WiC with a score of 72.14, improving the score of BERT LARGE by 2.5 points.

Related Work
Neural network based word embeddings first appeared as a static mapping (non-contextualized), where every word is represented by a constant pretrained embedding (Mikolov et al., 2013;Pennington et al., 2014). Such embeddings were shown to contain some amount of word-sense information (Iacobacci et al., 2016;Yuan et al., 2016;Arora et al., 2018;Le et al., 2018). Additionally, sense embeddings computed for each word sense in the word-sense inventory (e.g. WordNet) have been employed, relying on hypernymity relations (Rothe and Schütze, 2015) or the gloss for each sense (Chen et al., 2014). These approaches rely on static word embeddings and require a large amount of annotated data per word sense.
The introduction of contextualized word embeddings (Peters et al., 2018), for which a given word's embedding is context-dependent rather than precomputed, has brought forth a promising prospect for sense-aware word embeddings. Indeed, visualizations in Reif et al. (2019) show that sense sensitive clusters form in BERT's word embedding space. Nevertheless, we identify a clear gap in this abilty. We show that a vanilla BERT model trained with the current word-level self-supervision, burdened with the implicit task of disambiguating word meanings, often fails to grasp lexical semantics, exhibiting high supersense misclassification rates. Our suggested weakly-supervised word-sense signal allows SenseBERT to significantly bridge this gap.
Moreover, SenseBERT exhibits an improvement in lexical semantics ability (reflected by the Word in Context task score) even when compared to models with WordNet infused linguistic knowledge. Specifically we compare to Peters et al. (2019) who re-contextualize word embeddings via a wordto-entity attention mechanism (where entities are WordNet lemmas and synsets), and to Loureiro and Jorge (2019) which construct sense embeddings from BERT's word embeddings and use the Word-Net graph to enhance coverage (see quantitative comparison in table 3).

Incorporating Word-Supersense Information in Pre-training
In this section, we present our proposed method for integrating word sense-information within Sense-BERT's pre-training. We start by describing the vanilla BERT architecture in subsection 3.1. We conceptually divide it into an internal transformer encoder and an external mapping W which translates the observed vocabulary space into and out of the transformer encoder space [see illustration in figure 1(a)].
In the subsequent subsections, we frame our contribution to the vanilla BERT architecture as an addition of a parallel external mapping to the words supersenses space, denoted S [see illustration in figure 1(b)]. Specifically, in section 3.2 we describe the loss function used for learning S in parallel to W , effectively implementing word-form and wordsense multi-task learning in the pre-training stage. Then, in section 3.3 we describe our methodology for adding supersense information in S to the initial Transformer embedding, in parallel to word-level information added by W . In section 3.4 we address the issue of supersense prediction for out-ofvocabulary words, and in section 3.5 we describe our modification of BERT's masking strategy, prioritizing single-supersensed words which carry a clearer semantic signal.

Background
The input to BERT is a sequence of words {x (j) ∈ {0, 1} D W } N j=1 where 15% of the words are re-+ Wx (1) Wx (j) y words Wx (N) p (1) p (j) SMx (1) SMx (j) SMx (N) p (1) p (j) [MASK] x (1) x (N) [MASK] Transformer encoder 1 j N Transformer encoder Figure 1: SenseBERT includes a masked-word supersense prediction task, pre-trained jointly with BERT's original masked-word prediction task (Devlin et al., 2019) (see section 3.2). As in the original BERT, the mapping from the Transformer dimension to the external dimension is the same both at input and at output (W for words and S for supersenses), where M denotes a fixed mapping between word-forms and their allowed WordNet supersenses (see section 3.3). The vectors p (j) denote positional embeddings. For clarity, we omit a reference to a sentence-level Next Sentence Prediction task trained jointly with the above.
placed by a [MASK] token (see treatment of subword tokanization in section 3.4). Here N is the input sentence length, D W is the word vocabulary size, and x (j) is a 1-hot vector corresponding to the j th input word. For every masked word, the output of the pretraining task is a word-score vector y words ∈ R D W containing the per-word score. BERT's architecture can be decomposed to (1) an internal Transformer encoder architecture (Vaswani et al., 2017) wrapped by (2) an external mapping to the word vocabulary space, denoted by W . 1 The Transformer encoder operates over a sequence of word embeddings v (j) input ∈ R d , where d is the Transformer encoder's hidden dimension. These are passed through multiple attention-based Transformer layers, producing a new sequence of contextualized embeddings at each layer. The Transformer encoder output is the final sequence of contextualized word embeddings v The external mapping W ∈ R d×D W is effectively a translation between the external word vocabulary dimension and the internal Transformer dimension. Original words in the input sentence are translated into the Transformer block by applying this mapping (and adding positional encoding The word-score vector for a masked word at position j is extracted from the Transformer encoder output by applying the transpose: . The use of the same matrix W as the mapping in and out of the transformer encoder space is referred to as weight tying (Inan et al., 2017;Press and Wolf, 2017).
Given a masked word in position j, BERT's original masked-word prediction pre-training task is to have the softmax of the word-score vector y words = W v (j) output get as close as possible to a 1-hot vector corresponding to the masked word. This is done by minimizing the cross-entropy loss between the softmax of the word-score vector and a 1-hot vector corresponding to the masked word: where w is the masked word, the context is composed of the rest of the input sequence, and the probability is computed by: where y words w denotes the w th entry of the wordscore vector.

Weakly-Supervised Supersense Prediction Task
Jointly with the above procedure for training the word-level language model of SenseBERT, we train the model to predict the supersense of every masked word, thereby training a semantic-level language model. This is done by adding a parallel external mapping to the words supersenses space, de- where D S = 45 is the size of supersenses vocabulary. Ideally, the objective is to have the softmax of the sense-score vector y senses ∈ R D S := S v (j) output get as close as possible to a 1-hot vector corresponding to the word's supersense in the given context. For each word w in our vocabulary, we employ the WordNet word-sense inventory for constructing A(w), the set of its "allowed" supersenses. Specifically, we apply a WordNet Lemmatizer on w, extract the different synsets that are mapped to the lemmatized word in WordNet, and define A(w) as the union of supersenses coupled to each of these synsets. As exceptions, we set A(w) = ∅ for the following: (i) short words (up to 3 characters), since they are often treated as abbreviations, (ii) stop words, as WordNet does not contain their main synset (e.g. 'he' is either the element helium or the hebrew language according to WordNet), and (iii) tokens that represent part-of-word (see section 3.4 for further discussion on these tokens).
Given the above construction, we employ a combination of two loss terms for the supersense-level language model. The following allowed-senses term maximizes the probability that the predicted sense is in the set of allowed supersenses of the masked word w: where the probability for a supersense s is given by: The soft-labeling scheme given above, which treats all the allowed supersenses of the masked word equally, introduces noise to the supersense labels. We expect that encountering many contexts in a sufficiently large corpus will reinforce the correct labels whereas the signal of incorrect labels will diminish. To illustrate this, consider the following examples for the food context: Masking the marked word in each of the examples results in three identical input sequences, each with a different sets of labels. The ground truth label, noun.food, appears in all cases, so that its probability in contexts indicating food is increased whereas the signals supporting other labels cancel out. While L allowed SLM pushes the network in the right direction, minimizing this loss could result in the network becoming overconfident in predicting a strict subset of the allowed senses for a given word, i.e., a collapse of the prediction distribution. This is especially acute in the early stages of the training procedure, when the network could converge to the noisy signal of the soft-labeling scheme.
To mitigate this issue, the following regularization term is added to the loss, which encourages a uniform prediction distribution over the allowed supersenses: i.e., a cross-entropy loss with a uniform distribution over the allowed supersenses. Overall, jointly with the regular word level language model trained with the loss in eq. 2, we train the semantic level language model with a combined loss of the form:

Supersense Aware Input Embeddings
Though in principle two different matrices could have been used for converting in and out of the Tranformer encoder, the BERT architecture employs the same mapping W . This approach, referred to as weight tying, was shown to yield theoretical and pracrical benefits (Inan et al., 2017;Press and Wolf, 2017). Intuitively, constructing the Transformer encoder's input embeddings from the same mapping with which the scores are computed improves their quality as it makes the input more sensitive to the training signal.  We follow this approach, and insert our newly proposed semantic-level language model matrix S in the input in addition to W [as depicted in figure 1(b)], such that the input vector to the Transformer encoder (eq. 1) is modified to obey: where p (j) are the regular positional embeddings as used in BERT, and M ∈ R D S ×D W is a static 0/1 matrix converting between words and their allowed WordNet supersenses A(w) (see construction details above). The above strategy for constructing v (j) input allows for the semantic level vectors in S to come into play and shape the input embeddings even for words which are rarely observed in the training corpus. For such a word, the corresponding row in W is potentially less informative, since due to the low word frequency the model did not have sufficient chance to adequately learn it. However, since the model learns a representation of its supersense, the corresponding row in S is informative of the semantic category of the word. Therefore, the input embedding in eq. 8 can potentially help the model to elicit meaningful information even when the masked word is rare, allowing for better exploitation of the training corpus.

Rare Words Supersense Prediction
At the pre-processing stage, when an out-ofvocabulary (OOV) word is encountered in the corpus, it is divided into several in-vocabulary subword tokens. For the self-supervised word pre-diction task (eq. 2) masked sub-word tokens are straightforwardly predicted as described in section 3.1. In contrast, word-sense supervision is only meaningful at the word level. We compare two alternatives for dealing with tokenized OOV words for the supersense prediction task (eq. 7).
In the first alternative, called 60K vocabulary, we augment BERT's original 30K-token vocabulary (which roughly contained the most frequent words) with additional 30K new words, chosen according to their frequency in Wikipedia. This vocabulary increase allows us to see more of the corpus as whole words for which supersense prediction is a meaningful operation. Additionally, in accordance with the discussion in the previous subsection, our sense-aware input embedding mechanism can help the model extract more information from lowerfrequency words. For the cases where a sub-word token is chosen for masking, we only propagate the regular word level loss and do not train the supersense prediction task.
The above addition to the vocabulary results in an increase of approximately 23M parameters over the 110M parameters of BERT BASE and an increase of approximately 30M parameters over the 340M parameters of BERT LARGE (due to different embedding dimensions d = 768 and d = 1024, respectively). It is worth noting that similar vocabulary sizes in leading models have not resulted in increased sense awareness, as reflected for example in the WiC task results (Liu et al., 2019).
As a second alternative, referred to as average embedding, we employ BERT's regular 30K-token Dan cooked a bass on the grill.

The [MASK] fell to the floor.
The bass player was exceptional.

Gill [MASK] the bread.
verb.contact (cut, buttered, ...) verb.consumption (ate, chewed, ...) verb.change (heated, baked, ...) verb.possession (took, bought, ...) 33% 20% 11% 6% Figure 3: (a) A demonstration of supersense probabilities assigned to a masked position within context, as given by SenseBERT's word-supersense level semantic language model (capped at 5%). Example words corresponding to each supersense are presented in parentheses. (b) Examples of SenseBERT's prediction on raw text, when the unmasked input sentence is given to the model. This beyond word-form abstraction ability facilitates a more natural elicitation of semantic content at pre-training. vocabulary and employ a whole-word-masking strategy. Accordingly, all of the tokens of a tokenized OOV word are masked together. In this case, we train the supersense prediction task to predict the WordNet supersenses of this word from the average of the output embeddings at the location of the masked sub-words tokens.

Single-Supersensed Word Masking
Words that have a single supersense are good anchors for obtaining an unambiguous semantic signal. These words teach the model to accurately map contexts to supersenses, such that it is then able to make correct context-based predictions even when a masked word has several supersenses. We therefore favor such words in the masking strategy, choosing 50% of the single-supersensed words in each input sequence to be masked. We stop if 40% of the overall 15% masking budget is filled with single-supersensed words (this rarly happens), and in any case we randomize the choice of the remaining words to complete this budget. As in the original BERT, 1 out of 10 words chosen for masking is shown to the model as itself rather than replaced with [MASK].

Semantic Language Model Visualization
A SenseBERT pretrained as described in section 3 (with training hyperparameters as in Devlin et al. (2019)), has an immediate non-trivial bi-product. The pre-trained mapping to the supersenses space, denoted S, acts as an additional head predicting a word's supersense given context [see figure 1 (b)]. We thereby effectively attain a semantic-level lan-  figure 2(a). We further identify finer-grained semantic clusters, as shown for example in figure 2(b) and given in more detail in the supplementary materials.
SenseBERT's semantic language model allows predicting a distribution over supersenses rather than over words in a masked position. Figure 3(a) shows the supersense probabilities assigned by SenseBERT in several contexts, demonstrating the model's ability to assign semantically meaningful categories to the masked position.
Finally, we demonstrate that SenseBERT enjoys The team used a battery of the newly developed "gene probes" The kick must be synchronized with the arm movements.

Sent. B:
A sidecar is a smooth drink but it has a powerful kick.

Sent. A:
Plant bugs in the dissident's apartment.

Sent. B:
Plant a spy in Moscow. an ability to view raw text at a lexical semantic level. Figure 3

Lexical Semantics Experiments
In this section, we present quantitative evaluations of SenseBERT, pre-trained as described in section 3. We test the model's performance on a supersense-based variant of the SemEval WSD test sets standardized in Raganato et al. (2017), and on the Word in Context (WiC) task (Pilehvar and Camacho-Collados, 2019) (included in the recently introduced SuperGLUE benchmark (Wang et al., 2019)), both directly relying on the network's ability to perform lexical semantic categorization.

Comparing Rare Words Supersense Prediction Methods
We first report a comparison of the two methods described in section 3.4 for predicting the supersenses of rare words which do not appear in BERT's original vocabulary. The first 60K vocabulary method enriches the vocabulary and the second average embedding method predicts a supersense from the average embeddings of the sub-word tokens com-prising an OOV word. During fine-tuning, when encountering an OOV word we predict the supersenses from the rightmost sub-word token in the 60K vocabulary method and from the average of the sub-word tokens in the average embedding method.
As shown in table 1, both methods perform comparably on the SemEval supersense disambiguation task (see following subsection), yielding an improvement over the baseline of learning supersense information only for whole words in BERT's original 30K-token vocabulary. We continue with the 60K-token vocabulary for the rest of the experiments, but note the average embedding option as a viable competitor for predicting word-level semantics.

SemEval-SS: Supersense Disambiguation
We test SenseBERT on a Word Supersense Disambiguation task, a coarse grained variant of the common WSD task. We use SemCor (Miller et al., 1993) as our training dataset (226, 036 annotated examples), and the SenseEval (Edmonds and Cotton, 2001;Snyder and Palmer, 2004) / Se-mEval (Pradhan et al., 2007;Navigli et al., 2013;Moro and Navigli, 2015) suite for evaluation (overall 7253 annotated examples), following Raganato et al. (2017). For each word in both training and test sets, we change its fine-grained sense label to its corresponding WordNet supersense, and therefore train the network to predict a given word's supersense. We name this Supersense disambiguation task SemEval-SS. See figure 4(a) for an example    Peters et al. (2019) from this modified data set. We show results on the SemEval-SS task for two different training schemes. In the first, we trained a linear classifier over the 'frozen' output embeddings of the examined model -we do not change the the trained SenseBERT's parameters in this scheme. This Frozen setting is a test for the amount of basic lexical semantics readily present in the pre-trained model, easily extricable by further downstream tasks (reminiscent of the semantic probes employed in Hewitt and Manning (2019); Reif et al. (2019).
In the second training scheme we fine-tuned the examined model on the task, allowing its parameters to change during training (see full training details in the supplementary materials). Results attained by employing this training method reflect the model's potential to acquire word-supersense information given its pre-training. Table 2 shows a comparison between vanilla BERT and SenseBERT on the supersense disambiguation task.
Our semantic level pretraining signal clearly yields embeddings with enhanced word-meaning awareness, relative to embeddings trained with BERT's vanilla wordlevel signal. SenseBERT BASE improves the score of BERT BASE in the Frozen setting by over 10 points and SenseBERT LARGE improves that of BERT LARGE by over 12 points, demonstrating competitive results even without fine-tuning. In the setting of model fine-tuning, we see a clear demonstration of the model's ability to learn word-level semantics, as SenseBERT BASE surpasses the score of BERT LARGE by 2 points.

Word in Context (WiC) Task
We test our model on the recently introduced WiC binary classification task. Each instance in WiC has a target word w for which two contexts are provided, each invoking a specific meaning of w. The task is to determine whether the occurrences of w in the two contexts share the same meaning or not, clearly requiring an ability to identify the word's semantic category. The WiC task is defined over supersenses (Pilehvar and Camacho-Collados, 2019) -the negative examples include a word used in two different supersenses and the positive ones include a word used in the same supersense. See figure 4(b) for an example from this data set.  Results on the WiC task comparing Sense-BERT to vanilla BERT are shown in table 2. SenseBERT BASE surpasses a larger vanilla model, BERT LARGE . As shown in table 3, a single SenseBERT LARGE model achieves the state of the art score in this task, demonstrating unprecedented lexical semantic awareness.

GLUE
The General Language Understanding Evaluation (GLUE; Wang et al. (2018)) benchmark is a popular testbed for language understanding models. It consists of 9 different NLP tasks, covering different linguistic phenomena. We evaluate our model on GLUE, in order to verify that SenseBERT gains its lexical semantic knowledge without compromising performance on other downstream tasks. Due to slight differences in the data used for pretraining BERT and SenseBERT (BookCorpus is not publicly available), we trained a BERT BASE model with the same data used for our models. BERT BASE and SenseBERT BASE were both finetuned using the exact same procedures and hyperparameters. The results are presented in table 4. Indeed, Sense-BERT performs on par with BERT, achieving an overall score of 77.9, compared to 77.5 achieved by BERT BASE .

Conclusion
We introduce lexical semantic information into a neural language model's pre-training objective. This results in a boosted word-level semantic awareness of the resultant model, named SenseBERT, which considerably outperforms a vanilla BERT on a SemEval based Supersense Disambiguation task and achieves state of the art results on the Word in Context task. This improvement was obtained without human annotation, but rather by harnessing an external linguistic knowledge source. Our work indicates that semantic signals extending beyond the lexical level can be similarly introduced at the pre-training stage, allowing the network to elicit further insight without human supervision.