Attention Word Embedding

Word embedding models learn semantically rich vector representations of words and are widely used to initialize natural processing language (NLP) models. The popular continuous bag-of-words (CBOW) model of word2vec learns a vector embedding by masking a given word in a sentence and then using the other words as a context to predict it. A limitation of CBOW is that it equally weights the context words when making a prediction, which is inefficient, since some words have higher predictive value than others. We tackle this inefficiency by introducing the Attention Word Embedding (AWE) model, which integrates the attention mechanism into the CBOW model. We also propose AWE-S, which incorporates subword information. We demonstrate that AWE and AWE-S outperform the state-of-the-art word embedding models both on a variety of word similarity datasets and when used for initialization of NLP models.


Introduction
Word embedding models learn vector representations of words such that words that are semantically related are close to each other in the vector space. Popular word embedding models include word2vec (Mikolov et al., 2013a;Mikolov et al., 2013b), GloVe (Pennington et al., 2014), and fastText (Bojanowski et al., 2017). Word embedding models are often used to initialize deep learning models for various natural language processing (NLP) tasks such as machine translation (Bahdanau et al., 2015), part-ofspeech tagging (Ling et al., 2015a), and sentiment analysis (Wang et al., 2016), leading to improved performance over training the models from scratch.
The core idea underlying the training methodology of all the above embedding models is that "a word is characterized by the company it keeps" (Firth, 1957). For instance, the continuous bag-of-words (CBOW) model of word2vec predicts a randomly selected word in a sentence using the other words in the sentence as a context. The context words are typically treated equally by these models. However, clearly some words in the context will be more predictive of the masked word than others, and intuitively, should be weighted more heavily.
A promising remedy that treats each context word differently is the attention mechanism (Vaswani et al., 2017), which enables learning the relative importance of input features for the prediction of the desired output. Attention has become the de facto standard and state-of-the-art in modern NLP for a range of different tasks (Zhang et al., 2019;Devlin et al., 2019;Dai et al., 2019). However, to date, there have been only a few attempts to employ it to improve the performance and interpretability of word embedding models (Ling et al., 2015b;Schick and Schütze, 2019).

Contributions
In this paper, we propose the Attention Word Embedding (AWE), a new word embedding model that integrates the attention mechanism into the CBOW model of word2vec.
AWE consists of two components. First, AWE uses a variant of self-attention (Vaswani et al., 2017) to narrow down relevant words in the context for the prediction of the masked word. In contrast to prior work (Ling et al., 2015b;Schick and Schütze, 2019), the attention weights are a function of both the context words and the target/masked word. Second, AWE embeds the context words and the masked word representations in a shared subspace, since both representations embody the meaning of the word. Note that this is in sharp contrast with CBOW, which learns two different embeddings for each word, one when the word is present in the context and another when it is masked.
We also introduce a variant of AWE, AWE-S, which leverages subword information to enrich the word vectors. The idea to use subwords is inspired from fastText (Bojanowski et al., 2017), which defines subwords of a word as its character n-grams. Incorporating subword information is particularly valuable for improving word vector representations for rare and unseen words.
Our experimental evaluations show the superior performance of AWE and AWE-S than many existing word embedding models in both language modeling tasks and a number of downstream NLP applications including natural language inference, sentence semantic relatedness, and paraphrase detection. Furthermore, we analyze and interpret the role of the attention weights in AWE, which provide interesting insights into the workings of the attention mechanism.

Word2Vec
Word2vec is one of the standard word embedding models (Mikolov et al., 2013a;Mikolov et al., 2013b). There are two architectures proposed for word2vec: CBOW and Skip-Gram. We focus on the CBOW model in this work.
CBOW predicts a masked word using its context (fill in the blanks model). For each word, it learns two vectors to represent the two roles that a word can perform -first, when it is present in the context of the masked word, and second, when it is masked. Let where N is the size of the vocabulary, and D is the size of the word vector. U models the first role and is used to calculate the context vector, c, given by where b is the size of the context window and w i is the index of each word (w 0 is the index of the masked word; the rest are the indices of the context words). V models the second role and learns the masked word vector for w 0 , given by v w 0 . The probability p of w 0 to occur in the context of {w −b , .., w −1 , w 1 , .., w b } is given by The Skip-Gram model of word2vec is similar to CBOW. The key difference is that it predicts the context words using the masked word.

The AWE Model
We now explain AWE, our proposed word embedding model that incorporates the attention mechanism. AWE augments the CBOW model of word2vec with the attention mechanism in two different ways. First, we introduce two new matrices, a key matrix, K ∈ R N ×D and a query matrix, Q ∈ R N ×D , where N is the size of the vocabulary and D ∈ Z. With the attention mechanism, the context vector c is not modeled as a simple sum in (1) but rather as a weighted sum of context word vector embeddings where a w i is the attention weight of each context word vector u w i calculated using the key matrix K and the query matrix Q a w i = exp k w 0 T q w i , Second, we share the weights between the context word embedding matrix and the masked word embedding matrix in our model, i.e., we set U = V . Sharing weights is a natural and intuitive choice, since both matrices embed the meaning of a word in its vector representation, and the meaning of the word remains the same irrespective of it occurring in the context window, or as the masked word. In CBOW, the choice is to have two separate matrices, U and V , is justified as it leads to increase in performance. However, in AWE, the intuitive choice to have U same as V works better and adds interpretability to the model. On top of that, it has an added advantage. The number of parameters in AWE are much less as compared to CBOW even though AWE has one more matrix than CBOW. The reason is that the key matrix K ∈ R N ×D and the query matrix Q ∈ R N ×D are much smaller as compared to the value matrix, V = [u 1 , ..., u D ] T ∈ R N ×D . In our experiments, we set D = 50 and D = 500.

Weight Initialization Techniques
Weight initialization is critical to the training and performance of deep neural networks (Sutskever et al., 2013). Several methods have been proposed for initializing the weight matrices which include random initialization, standard normal initialization, Xavier initialization (Glorot and Bengio, 2010), and Kaiming initialization (He et al., 2015).
We propose a new initialization scheme for initializing the key and the query matrices. The weights for both the matrices are drawn according to: where N is the normal distribution. It is easy to verify that for any masked word, say w m , and any word in the context of the masked word, say w c , the initial attention weight, given by k T wm q wc , is centered around 1. Thus, in the beginning AWE mimics the CBOW method of word2vec which assigns equal weight to each context word. As the training progresses, AWE learns the relevance of each of the context words for the prediction of the masked word. For the value matrix, we initialize the weights using N (0, 0.1).

AWE-S: A Subword Variant of AWE
We also propose AWE-S, a variant of AWE that leverages subword information. A subword of a word is a part of the word, e.g., its character n-grams or its lemmas. Subword information has been used in a number of word embedding models including fastText (Bojanowski et al., 2017) and BERT (Devlin et al., 2019) to improve the expressiveness of rare word representations. As an illustration, learning a robust representation for a word like awing, which occurs rarely in a corpus, is challenging due to a dearth of examples. To circumvent the scarcity of samples, a word embedding model can exploit the fact that awe is the verb lemma form of awing to learn a better embedding for awing.
In AWE-S, the embedding of each word is given by the sum of embeddings of all of its subwords, regardless of whether the word is a context word or the masked word in the prediction task, i.e., where S w is the subword set of the word w. The context vector c in AWE-S is given by and the probability of the masked word w 0 to occur in the context of {w −b , .., w −1 , w 1 , .., w b } is given by Note that in fastText, the context vector of (1) and the probability of w 0 to occur in the context of  Table 3: Evaluation results of AWE and its subword variant, AWE-S, against other word embedding methods for initializing models for downstream tasks. Accuracy is reported for SNLI dataset and SICK-E, Pearson correlation is reported for SICK-R and STS Benchmark, and F1 score is reported for MRPC. ' ', ' * ' and ' † ' denote accuracy, pearson correlation, and F1 score respectively.
The computations in AWE-S may appear to be similar those in fastText. However, observe that, in fastText, a word is not represented as the sum of the embeddings of its subwords when it is masked. Also, contrary to fastText, which uses character n-grams to construct the subword set for a word, AWE-S uses the noun, adjective, and verb lemmas of the word to construct the subword set. This limits the size of the subword set significantly as compared to using character n-grams, which helps to reduce the number of learnable parameters in AWE-S.

Experiments
We demonstrate the superior performance of AWE and AWE-S against CBOW, Skip-Gram, GloVe, and fastText on a variety of datasets, which broadly fall into two main categories: 1. Word similarity tasks. We use this task to evaluate the performance of word embedding models themselves. The datasets for this task contain word pairs and a semantic similarity score associated with the pairs. The scores were annotated by human subjects. These are the most widely used evaluation methods for word embedding models (Bakarov, 2018). We test AWE's performance using spearman correlation score on eight such datasets -MEN (Bruni et al., 2014), WS353 (Finkelstein et al., 2001), WS353R (Agirre et al., 2009), WS353S (Agirre et al., 2009), SimLex999 (Hill et al., 2015), RW(RareWords) (Luong et al., 2013), RG65 (Rubenstein and Goodenough, 1965), and MTurk (Radinsky et al., 2011).
2. Downstream NLP tasks. These tasks evaluate performance on important NLP applications including natural language inference, semantic entailment, semantic relatedness, and paraphrase detection. Because the conventional application of word embedding models has been to initialize NLP models, we use these tasks to assess the quality of our word embeddings for initializing NLP models for downstream applications. Datasets include SNLI (Bowman et al., 2015), SICK-E (Marelli et al., 2014), SICK-R (Marelli et al., 2014), STS Benchmark (Cer et al., 2017), and MRPC (Dolan et al., 2004). Unlike datasets for the previous task, all the datasets here have a training set.

Experimental Setup
We train AWE, CBOW, Skip-Gram, GloVe, and fastText on the wikipedia dataset (∼ 17 GB in size), with vocabulary consisting of one million words. Dimensions of word embeddings for CBOW, Skip-Gram, GloVe, and fastText is 500. For AWE and AWE-S, K, Q ∈ R N ×50 , and V ∈ R N ×500 , where N is the size of the vocabulary. Widely-used parameter settings (context window size is 5, number of negative samples is 5, number of training epochs is 5) are used to train CBOW, Skip-Gram, and fastText. AWE and AWE-S are also trained for the same number of epochs. GloVe is trained for 100 epochs and parameters are set to those recommended in the paper (x-max is 100 and context window size is 10) . We used open-source word embedding evaluation tool kits for assessing the quality of the models (Jastrzebski et al., 2017;Conneau and Kiela, 2018). The code for for this work is available on https: //github.com/luffycodes/attention-word-embedding.git. The codebase is inspired from (Mai et al., 2019).

Numerical Results
We report the performance statistics of AWE, AWE-S, and the competing word embedding models on word similarity datasets and downstream NLP tasks in Tables 1, 2, and 3. Table 1 shows performance of AWE, CBOW, Skip-Gram, and GloVe on word similarity datasets. We use spearman correlation between cosine similarity and human-annotated relatedness scores as the metric to measure the performance. AWE performs significantly better on all of the datasets except SimLex999. Table 2 compares performance of AWE-S vs fastText. Both algorithms incorporate additional subword information into their architecture. FastText uses character n-grams, while AWE-S uses the lemma forms of the words. AWE wins against FastText on six out of the eight datasets. Table 3 reports performance of AWE and AWE-S against CBOW, Skip-Gram, GloVe, and fastText on supervised downstream tasks. A logistic regression classifier is trained to classify sentences using pre-trained word embeddings. For these initialization applications, AWE provides improvement in performance as compared to other models over all datasets except SICK-E.

Interpretation of the attention mechanism
Studies have shown that the attention mechanism helps a neural network to attend to relevant features in the input (Wiegreffe and Pinter, 2019;Vashishth et al., 2019). Thus, motivating the integration of the attention mechanism in CBOW. However, some studies argue otherwise and show that the attention weights are poor indicators of feature importance (Jain and Wallace, 2019;Serrano and Smith, 2019). This differing opinion thus necessitates an investigation as to why attention works in the case of AWE. We visually analyze the attention weights to investigate if the attention mechanism models the importance of a context word for the masked word prediction in AWE. The investigation revealed two key findings.
First, no matter the masked word, the attention weight of the masked word and a highly frequent word is typically high. In Table 4 and 5, the attention weights for highly frequent words like for, and, the, has, and a (words that have cells with dark gray background) are quite high as compared to other words in the sentence. Note that even though the attention weight is large, the similarity between word vectors is very close to zero, thus not affecting the probability of prediction of the masked word.
Second, if we leave out these highly frequent words from the set of context words, we observe that the attention weights focus on more informative words in the context for the prediction of the masked word. In Table 4, for each masked word, the attention weights corresponding to context words that are most attended to for its prediction (excluding frequent words) has been highlighted with bold numerals and a light gray background. For more examples, please refer to Table 5.
Also, inspired by (Jain and Wallace, 2019), we randomly permute the attention weights and observe the change in AWE's model performance. This idea to manipulate the attention weight distribution to study the attention mechanism dynamics is quite popular (Vashishth et al., 2019;Serrano and Smith, 2019). This experiment quantifies the change in behavior of the AWE model if it were to focus on  Table 4: Interpretation of attention. Attention weight between the masked word and the context word is given by e (k masked word ) T q context word , while the word vector similarity between the masked word and the context word is given by e (u masked word ) T u context word . Highly frequent words are highlighted with a dark gray background . For each masked word, the attention weights corresponding to context words that are most attended to (excluding highly frequent words) are highlighted with bold numerals and light gray background .
a different set of context words to predict the masked word. We find that over the entire dataset, the prediction loss for the masked word drops by 26% if we shuffle the attention weights. Thus, we can conclude that the attention mechanism in AWE focuses on informative context words for the prediction of the masked word. This is consistent with our preceding findings through the visual analysis of the attention weights (Table 4 and 5).

Role of subword information
Subwords play a critical function in the robust learning of infrequent words. For instance, the word happiest may not occur as frequently as the word happy, but the model can learn more about the word happiest if it were to supplement it with the information of the word happy as well. This is the reason why fastText and AWE-S perform significantly better than other embedding methods on the RareWords dataset (Table 1 and Table 2). FastText relates the two words happy and happiest through the n-gram 'happ'. On the other hand, AWE-S relates the word happiest to happy through its verb lemma form, i.e, 'happy'.

Application of AWE's distributional loss
Deep NLP models use an input embedding matrix to project a word's one-hot vector to a low-dimensional dense representation, which is then fed into the model. We propose to enrich this dense word vector representation with contextual information using a distributional loss term.
First, in addition to the input embedding matrix, which we refer to as the value matrix V ∈ R N ×D , we use two extra matrices: a key matrix K ∈ R N ×D and a query matrix Q ∈ R N ×D . N is the size of the vocabulary and D is the embedding dimension. Let {w 0 , w 1 , ..., w n } be the input sequence to a deep NLP model. Let w i be an arbitrary word in the input sequence with its context given by the sequence {w i−b , .., w i−1 , w i+1 , .., w i+b }, where b is the size of the context window. As previously stated, for w i , we use its value embedding v w i as input to the model. Second, we add a new loss term to the model  Table 5: More examples of attention weights between the masked word and the context words. Attention weight (att. wt.) between the masked word and context word is given by e (k masked word ) T q context word , while the word vector similarity (sim.) between the masked word and context word is given by e (u masked word ) T u context word . Highly frequent words are highlighted with a dark gray background . For each masked word, the attention weights corresponding to context words that are most attended to (excluding highly frequent words) are highlighted with bold numerals and a light gray background .
which enforces the input vector representation of w i , given by v w i , to be close to the weighted vector representation of its context. The loss for w i is given by where, a w j = softmax k T w i q w j .
This loss is inspired from the distributional hypothesis, which dictates that a word is known through its context (Harris, 1954). It is interesting to note that this loss can be added to any NLP model and requires no additional training data. The idea is similar to the training paradigm of word2vec and AWE, which learn word embeddings using a variant of this loss.
To test the efficacy of this loss, we train a transformer model (Vaswani et al., 2017) on WMT17 English to German news translation dataset 1 . The best trained transformer model gives a perplexity score of 7.17, however, we achieve a perplexity score of 6.56 (lower is better), with the embedding scheme augmented with the new distributional loss term.

Conclusion & Future Work
In this work, we have proposed AWE and AWE-S, which perform significantly better than CBOW, Skip-Gram, GloVe, and fastText on a variety of datasets across a diverse set of tasks. The simple setting of masked word prediction in AWE also helped us visually analyze and interpret how the attention mechanism works. Our analysis revealed that the attention mechanism can figure out the context words that are most relevant for predicting the masked word. In future, we want to explore more applications of the proposed distributional loss for bigger models like BERT on larger datasets.