Delta Embedding Learning

Unsupervised word embeddings have become a popular approach of word representation in NLP tasks. However there are limitations to the semantics represented by unsupervised embeddings, and inadequate fine-tuning of embeddings can lead to suboptimal performance. We propose a novel learning technique called Delta Embedding Learning, which can be applied to general NLP tasks to improve performance by optimized tuning of the word embeddings. A structured regularization is applied to the embeddings to ensure they are tuned in an incremental way. As a result, the tuned word embeddings become better word representations by absorbing semantic information from supervision without “forgetting.” We apply the method to various NLP tasks and see a consistent improvement in performance. Evaluation also confirms the tuned word embeddings have better semantic properties.


Introduction
Unsupervised word embeddings have been popular in recent years and has become the basis for a wide range of natural language processing tasks. Most frequently used embedding models include skip-gram (Mikolov et al., 2013a) and Glove (Pennington et al., 2014). These embedding models all belong to distributional representation of words in vector space and are generated using a large corpus. These representations capture the statistics of the corpus and have good properties that corresponds to the semantics of words (Mikolov et al., 2013b). There are drawbacks of this representation, like not able to model some fine-grained word semantics, e.g. words in the same category but have different polarity, because they share much common statistics in the corpus Mrkšić et al., 2016).
In supervised NLP tasks, unsupervised word embeddings are often used as a starting-point for word representation. Based on the nature of the task and the available labeled data, there are usually 3 ways to use a word embedding 1) fixed: when the labeled data is scarce, use an unsupervised word embedding and fix it during training of the model, to avoid overfit; 2) finetune: when a moderate amount of labeled data is available for a not too easy task, use an unsupervised word embedding as initialization, allowing the embedding to be adjusted during model training; 3) learn from scratch: for tasks with huge amount of data like machine translation with millions of examples, the labeled data contain sufficient information to learn an embedding from scratch which is good enough for the task.
From an optimization perspective, this all or none optimization of the word embedding lacks control over the learning process. One has to carefully balance between underfit and overfit.
Word embeddings learned in supervised NLP tasks is also vastly different from unsupervised ones. Trained to maximize the objective of the task, these embeddings are often highly taskspecific, which means they are less useful as a general representation for transferring to other tasks. They are also harder to interpret without clear separation from the whole neural network model as a blackbox. A useful approach for incorporating supervised information into word embeddings is via multi-task learning, where one predicts context words and external labels at the same time (Tang et al., 2014). (Yang and Mao, 2015) tried to finetune the unsupervised embedding with specially designed gradient descent algorithm and stopping regime. It's still unclear if these supervisedly enhanced embeddings can provide consistent benefit for downstream tasks.
In this paper, we propose delta embedding learning, a way to find an optimum between unsupervised and supervised learning of word semantics. The method learns semantics from tasks to enrich existing unsupervised embeddings. The result is an embedding that does not only provide the best task performance, but also is itself a better quality universal embedding.
We aim to combine the benefits of unsupervised learning and supervised learning to learn better word embeddings. An unsupervised word embedding like skip-gram, trained on a large corpus, gives good-quality word representations and an embedding space that has nice properties like geometries that correspond to semantic relations. We use such an embedding w unsup as a starting point and learn a delta embedding w ∆ on top of it: (1) The unsupervised embedding w ∆ is fixed to preserve good properties of the embedding space and the word semantics learned from large corpus. Delta embedding w ∆ is used to capture discriminative word semantics from supervised tasks and is trained together with the task model. In order to only learn useful word semantics rather than task-specific peculiarities that results from fitting (or overfitting) a specific task, we use L 21 loss, a kind of structured regularization on w ∆ : the regularization loss is added to the original loss of the supervised task. The effect of L 21 loss on w ∆ has a straightforward interpretation: minimize the total moving distance of word vectors in embedding space while getting optimal task performance. The L 2 part of the regularization keeps the change of word vectors small, so that it does not lose its original semantic. The L 1 part of the regularization induces sparsity on word deltas, that only a small number of words (critical to the task) gets delta, while the majority of words are kept intact. The combined effect is selective finetune with moderation: only significant word semantics that is contained in the training data of the task, but is absent in the unsupervised embedding is captured in the delta embedding.
In the remaining sections of the paper, we use experiments and illustrations to show that delta embedding learning method indeed leads to performance advantage and desirable properties.

Experiments on supervised task
We conduct experiments on several different NLP tasks to illustrate the effect of delta embedding learning on task performance.

Experimental setup
Sentiment analysis We performed experiments on two sentiment analysis datasets: rt-polarity (binary) (Pang and Lee, 2005) and Kaggle movie review (KMR, 5 class) (Socher et al., 2013). For rtpolarity, we used a CNN model as in (Kim, 2014). For KMR an LSTM-based model is used.
Reading comprehension We used the Stanford Question Answering Dataset (SQuAD, v1.1) (Rajpurkar et al., 2016) and the Bi-directional Attention Flow (BiDAF) (Seo et al., 2016) model. The original hyperparameters are used, except that character-level embedding is turned off to more clearly illustrate the effect of word embeddings.
Language inference The datasets MultiNLI (Williams et al., 2018) and SNLI (Bowman et al., 2015) are used for evaluation of natural language inference task. We use the ESIM model, a strong baseline in (Williams et al., 2018). As MultiNLI is a large dataset, we use one genre of data ("fiction") in the training set for training, and use development set and SNLI for testing.
Common setup For all the experiments, we used Glove embeddings pre-trained from Wikipedia and Gigaword corpus 1 . Dimensions of word embeddings in all models are set to 100.

Results
The task performance of models with different embedding learning methods is reported in Table  1. We compare with fixing embedding to a unsupervise pre-trained embedding, or to use it as initialization and finetune it while training the model. For delta embedding training method, there is one hyperparameter c that controls the strength of regularization. We empirically experiment in the range of [10 −5 , 10 −3 ].
In all the tasks delta embedding learning outperforms conventional ways of using embedding, in terms of final task performance. As embedding is the only variable, we can conclude that delta embedding learning learns better quality embeddings that results in better task performance.
Upon closer observation, there are 2 types of scenarios: the easy-underfit and the hard-overfit tasks. The sentiment analysis datasets represent easier tasks, where one primarily need to learn the polarity of a bunch of words with salient sentiment. Such tasks are relatively easy to learn and a fixed embedding will result in underfit. Allowing the embedding to finetune helps to discriminate those critical words, resulting in better performance. On the other hand, in reading comprehension and language inference (especially in the former), the task is more complex and comprehensive, and involve learning of many words. Unless having abundant labeled data it's very likely to overfit the embedding if it is not fixed during training.
In both these scenarios delta embedding learning gets the best performance. In easy-underfit tasks delta learning gets better performance than freely finetune, meaning it is better at capturing semantics of critical words than unconstrained finetune. In hard-overfit tasks, delta embedding manages to learn useful semantics from the task, while avoiding overfit.
For the hyperparameter choice of regularization coefficient c, we found it insensitive to tasks, with c = 10 −4 achieving the best performance in most tasks. This makes the method easy to adapt.
The results indicate that with delta embedding learning, one does not need to decide whether to fix the embedding or not, as delta embedding always harvest the best from unsupervised embedding and supervised finetune, whether the amount of labeled data is sufficient to learn a good embedding or not. Also it is a generally applicable method that can be applied to various tasks and models.

Embedding evaluation
Besides getting better performance in supervised tasks, we examine the effect of delta embedding learning on the quality of the word embeddings by quantitive evaluation. Word embeddings from the BiDAF model are extracted after training on SQuAD, and compared with the original Glove embedding under different metrics. As reading comprehension involve a rather wide spectrum of word semantics, hopefully training on it could result in non-trivial changes of embedding properties on the whole vocabulary level.

QVEC
QVEC (Tsvetkov et al., 2015) is a comprehensive evaluation of the quality of word embeddings by aligning with linguistic features. We calculated the QVEC score of learned embeddings (  Using the original Glove embedding as reference, unconstrained finetune decreases the QVEC score, because the embedding overfits to the task, and some of the semantic information in the original embedding is lost. Delta embedding learning (c = 10 −4 ) not only achieved the best task performance, but also slightly increases the QVEC score. This means that it does not degrade the quality of the original embedding (in other words, it does not suffer from catastrophic forget problem, which is present whenever one tries to learn new things by finetuning an existing network). Also, as QVEC score is strongly related to downstream task performance, it also means that embeddings with delta is no less general and universal than the original unsupervised embedding.

Word similarity
Word similarity is a common approach for examining semantics captured by embeddings. We used the tool in    13 word similarity datasets. Results are displayed in Table 3. Delta embedding trained with c = 10 −4 has the best performance in over half of the benchmarks. When compared to the original Glove embedding, finetuned embedding gets better at some datasets while worse at others, indicating that naive finetuning learns some semantic information from the task while "forgetting" some others. Delta embedding learning however, achieved better performance than Glove embedding in all but one datasets (negligible decrease on RG-65, see the last column of Table 3). This shows that delta embedding learning effectively learns new semantics from supervised task and add it to the original embedding in a non-destructive way, which is the result of controlled finetune via L 21 regularization. Delta embedding learning is a process that improves the embeddings.

Interpreting word semantics learning
Neural network models are notorious for the difficulty to interpret. Because a delta embedding is used to capture learned word semantics entirely, analysis of delta embedding can facilitate understanding of the learning of semantics in a supervised task, regardless of the model used.
To answer the question "What is learned in the task?", the norm of delta embeddings is computed to identify which word has a significant newly learned component. In Table 4, it can be seen that different kinds of words are learned in these tasks. For example, words with a strong sentiment or polarity like "bore" and "fun" is mostly learned in sentiment analysis tasks. In reading comprehension, question words like "what", "why" indicates Sentiment Analysis neither still unexpected nor bore lacking worst suffers usual moving works interesting tv fun smart Reading Comprehension why another what along called whose call which also this if not occupation whom but he because into Language Inference not the even I nothing because that you it as anything only was if want forget well be so from does in certain could Table 4: Words with the largest norm of delta embedding in different tasks the intention of the question to be answered, and are the first to be learned. Words that help to locate possible answers like "called", "another", "also" also receive a large learned component. This helps us understand what semantics is most important to the task.
Nearest neighbors of word 'not' 2 Before training (+) good always clearly definitely well able (-) nothing yet none After training (+) sure (-) nothing yet none bad lack unable nobody less impossible unfortunately Not rarely The exact semantic learned in a word is represented by the shift of its position in the embedding space. We found that generally the semantics learned are mostly discriminative features. Use the word 'not' as an example, after training it clearly gained a component representing negativity, and differentiates positive and negative words much better (Table 5). These discriminative semantics are sometimes absent or only weakly present in co-occurrence statistics, but play a crucial role in the understanding of text. Delta embedding helps to learn discriminative semantics absent in unsupervised embeddings from supervised tasks, and the result is a more accurate word representation, as seen in Section 3 and 4.

Conclusion
We proposed delta embedding learning, a supervised embedding learning method that not only improves performance on supervised NLP tasks, but also learns better universal word embeddings by letting the embedding "grow" under supervision.
In this work we have only investigated learning on a single task. As an incremental process, it is possible to learn from a sequence of tasks with delta embeddings, essentially "continuous learning" (Parisi et al., 2018) of word semantics. It is interesting future work to learn word embeddings more like human learning a language.