Refining Pretrained Word Embeddings Using Layer-wise Relevance Propagation

In this paper, we propose a simple method for refining pretrained word embeddings using layer-wise relevance propagation. Given a target semantic representation one would like word vectors to reflect, our method first trains the mapping between the original word vectors and the target representation using a neural network. Estimated target values are then propagated backward toward word vectors, and a relevance score is computed for each dimension of word vectors. Finally, the relevance score vectors are used to refine the original word vectors so that they are projected into the subspace that reflects the information relevant to the target representation. The evaluation experiment using binary classification of word pairs demonstrates that the refined vectors by our method achieve the higher performance than the original vectors.


Introduction
The recent success of neural NLP is partially but largely due to the development of word embedding techniques (Goldberg, 2017). Although a considerable number of studies have been made on training word embeddings from distributional information of language (Mikolov et al., 2013;Pennington et al., 2014;Bojanowski et al., 2017;Nickel and Kiela, 2017), one recent research trend is to refine or fine-tune pretrained word embeddings. One promising approach is the use of other information such as multimodal information (Bruni et al., 2014;Kiela et al., 2014;Kiela and Clark, 2015;Kiela et al., 2015a;Silberer et al., 2017) and language resources Kiela et al., 2015b;Rothe and Schütze, 2017;Yu and Dredze, 2014). Other refinement methods include task-specific embeddings (Bolukbasi et al., 2016;Yu et al., 2017) and the selective use of multiple embeddings (Bollegala et al., 2017;Kiela et al., 2018).
In this paper, we propose a different approach to refining pretrained word embeddings so that word vectors reflect the information relevant for a specific knowledge. Our method utilizes layer-wise relevance propagation (Bach et al., 2015;Samek et al., 2017), which has been proposed as a general framework for decomposing predictions of modern AI systems, in particular deep learning systems. The basic idea of layer-wise relevance propagation is to quantitatively measure the contribution of each fragment of an input vector (e.g., a single pixel of an image) to the prediction as a relevance score. Using relevance scores, our method projects word vectors into the subspace that better reflects the target knowledge. The assumption underlying our approach is that the information for any given target knowledge is contained in pretrained word embeddings. Our method attempts to make the best use of the information contained in word vectors by estimating the importance in reflecting a target knowledge.
To the best of our knowledge, this paper is the first to employ the technique of layer-wise relevance propagation for refining word embeddings. Our method can be applied to word vectors x trained by any word embedding method. This implies that our method does not compete with other refinement methods, but they are complementary; it can be used for word vectors refined by other methods. In addition, our method can refine word vectors for any target knowledge y, from a single binary value to a structured representation, as long as a function y = f (x) can be learned.

Method for Refining Word Vectors
Our method comprises the following three steps: (1) it trains a prediction function from a pretrained word vector to a target representation; (2) computes a relevance score for each dimension of the word vector; and (3) projects word vectors into the subspace using the relevance scores. In this section, these three steps are explained in detail.

Training the Prediction Function
Given pairs of an input word vector x (i) to be refined and a target knowledge representation y (i) for a word w (i) , the proposed method trains a function y (i) = f (x (i) ). In this paper, we use a neural network as a learning method, but other learning methods such as linear transformation and SVM can also be used. Note that a scalar value or a class label can be used as a target representation y (i) .

Computing Relevance Scores
This step derives an explanation of the prediction in terms of input variables, namely the importance of each dimension of a word vector x (i) for the predictionŷ (i) = f (x (i) ). In layer-wise relevance propagation, the score of the correct predictionŷ (i) j is redistributed backward using relevance propagation rules. By repeatedly applying propagation rules, it assigns a relevance score r k of a word vector x (i) . As a result, a relevance score vector r (i,j) is obtained for each word vector x (i) and target dimension y Among a number of propagation rules (Bach et al., 2015), we use the "alpha-beta" rule for multilayer neural networks. The relevance score R , and z + ij and z − ij denote the positive and negative part of z As a result, relevance scores r (i,j) k of the word vector x (i) and the target dimension y (i) j are obtained as relevance scores R (1) k of the input layer. The parameters α and β denote the importance of positive and negative evidence for predicting a target representations and should be chosen such that α + β = 1. In this paper, we assume that posi-tive and negative evidence equally contributes to the prediction and thus set α = β = 0.5.

Projecting Word Vectors into a Subspace
The basic idea of projection is that n-dimensional word vectors are projected into m-dimensional vectors whose relevance scores are more than or equal to a threshold θ R .
First, for a target dimension j of y, relevance score vectors are averaged over words relevant to the target dimension as follows: j ≥ θ T . The functions g 1 and g 2 are used for downplaying irrelevant dimensions. For example, the target knowledge is the property of Visually dark and V visually dark is {chocolate, crow, night}. By averaging relevance score vectors of these words, we obtain the mean relevance vector r (visually dark) that represents the importance of word vector dimension in predicting whether a given word has the property of Visually dark.
Finally, using the mean relevance vector r (j) , word vectors x i is transformed into vectors z (j) i of a subspace for the target dimension. This is achieved by weighting x i by component-wise multiplication of x i and r (j) and removing the dimensions of zero relevance. Formally, the projection is defined by the n by m projection matrix T (j) as follows:

Evaluation Experiment
In order to justify the effectiveness of the proposed method, we conducted an evaluation experiment using binary classification of word pairs. Corpus: All word vectors were trained on the Corpus of Contemporary American English  (COCA), which includes 0.56G word tokens. Words that occurred less than 30 times in the corpus were ignored, resulting in the vocabulary of 108,230 words. Three context windows of size 3, 5, and 10 were used for training.
Target knowledge representation: We used Binder et al.'s (2016) brain-based semantic vectors of 535 words as a target representation. 1 This representation comprises 65 properties in Table 1, which are based entirely on functional divisions in the human brain. Each word is represented as a 65-dimensional vector and each dimension corresponds to one of these properties. Each value of the brain-based vectors represents the salience of the corresponding property, which is calculated as a mean salience rating on a 7-point scale ranging from 0 to 6. Because these properties are based on not only perceptual properties but also a variety of other properties such as affective, social, and cognitive ones, this dataset is suitable for evaluation.
Refining word vectors: The prediction function f was trained using a three-layer neural network comprising an input layer for n-dimensional word vectors, one hidden layer with n/2 sigmoid units, and a linear output layer. The parameters θ T , θ R 1 and θ R 2 for projection were estimated us-  Task: We used a binary classification task of judging whether a pair of words is similar or not with respect to each property of Table 1. For example, night and chocolate should be judged as similar with respect to the property of Dark, while night and ice should be judged as dissimilar with respect to that property. For each property, we chose 10 words with the highest salience and 10 words with the lowest salience from the vocabulary of brain-based vectors, and generated 45 highsalience word pairs and 100 pairs of high-salience and low-salience words. Note that we did not consider low-score word pairs because it does not make sense to ask whether words (e.g., peace and wit) that do not have a property (e.g., Dark) are similar with respect to that property.
To confirm the generality of our method, we also generated another evaluation dataset for untrained words (i.e., words not included in Binder et al.'s vocabulary) using CSLB concept property norms of 638 words (Devereux et al., 2014). 3 After removing words contained in Binder et al.'s vocabulary, we chose properties that were closely related to Binder et al.'s properties and possessed by at least 10 words. As a result, the generated dataset contained 18 properties listed in Table 2, because the property norm mainly includes perceptual and functional properties.
Binary classification was carried out by computing cosine similarity between vectors of paired words and classifying the n highest pairs into similar pairs. Hence, the classification performance was measured by average precision. Table 3 shows mean average precisions across 65 properties for the original word embeddings (Orig) and the refined embeddings by our method (Refn). The asterisk indicates that the mean average precision of the refined vectors is signifi-  cantly higher than that of the original vectors by Wilcoxon signed-rank test (p < .05). For all word embeddings, the refined vectors achieved higher mean average precision than the original ones. Furthermore, in almost all cases, the improvement is statistically significant. This result demonstrates that the proposed method is successful in refining word embeddings so that vector similarity better reflects the target knowledge.  that properties plotted below the diagonal line, for which refined word vectors yielded lower precision than the original vectors, are sensorimotor or spatiotemporal properties. This result is consistent with Utsumi's (2018) finding that these kinds of knowledge are less likely to be encoded in word vectors. Table 4 shows the result of binary classification for CSLB property norm dataset. In most cases, the refined vectors of untrained words also yielded better performance than the original vectors. In some cases, however, refinement did not improve the performance. One of the reasons for this failure would be that a small set of vocabulary words in Binder et al.'s (2016) dataset is not enough for the subspace to generalize to untrained words.

Results
To confirm whether the projected subspace better reflects the target knowledge than the original space, we visualize both spaces using MDS in Figure 2. Although all 535 words are embedded into the two-dimensional space, Figure 2 only shows words used in binary classification task, namely words with the 10 highest salience (denoted by red dots) and 20 lowest salience for a given property. As shown in Figure 2, our method refines the vectors of salient words to be more similar in the subspace, while preserving the other similarity of words.

Related Work
Prior work on word embedding refinement can be classified into general purpose refinement and specific target refinement. Many existing studies have attempted to refine word vectors to improve the performance of general-purpose similarity computation. These studies generally re- fine word vectors by solving an optimization problem whose objective function reflects the similarity obtained by language resources, such as WordNet Yu and Dredze, 2014;Rothe and Schütze, 2017), Freebase (Rothe and Schütze, 2017), Paraphrase Database Yu and Dredze, 2014), free association norm (Kiela et al., 2015b), and dictionary (Wang et al., 2015). Our method differs from them in that it is proposed for specific target refinement. In other words, the refined vectors by general purpose refinement method can be further refined to extract a specific knowledge by our method.
Most prior studies for specific purpose refinement propose a method specialized for a specific task such as sentiment analysis (Labutov and Lipson, 2013;Tang et al., 2016;Yu et al., 2017) and lexical entailment (Mrkšić et al., 2016;Vulić and Mrkšić, 2018). On the other hand, our method refines word vectors for a specific knowledge or task, but it is not specialized for a knowledge or task.  and  are conceptually similar to our approach; their method refines word vectors for a specific knowledge but it is not specialized for a certain task. The merit of our method is that any types of representation can be used as a target, while their method is limited to binary labels. Furthermore, while their method learns an orthogonal transformation of pretrained word vectors by directly optimizing the objective function, our method can project word vectors to a subspace independent of training method for a prediction function.

Conclusion
In this paper, we propose a method for refining pretrained word vectors using layer-wise relevance propagation. We demonstrated that the proposed method can refine word vectors so that they better reflect the target knowledge. One of our motivation is to make embeddings more interpretable and useful. In other studies (Utsumi, 2015(Utsumi, , 2018, we have analyzed the internal knowledge encoded in text-based word embeddings, while this study is the first step toward a general method for utilizing the internal knowledge of word embeddings.
In future work, we have to modify the refinement method by relevance propagation to be more effective by exploring the mechanism of how the internal knowledge of word vectors is extracted by multilayer neural networks and examining the effectiveness of other relevance propagation methods. It would also be vital for future work to explore efficient combinations with other refinement methods using language resources.