Mittens: an Extension of GloVe for Learning Domain-Specialized Representations

We present a simple extension of the GloVe representation learning model that begins with general-purpose representations and updates them based on data from a specialized domain. We show that the resulting representations can lead to faster learning and better results on a variety of tasks.


Introduction
Many NLP tasks have benefitted from the public availability of general-purpose vector representations of words trained on enormous datasets, such as those released by the GloVe (Pennington et al., 2014) and fastText (Bojanowski et al., 2016) teams. These representations, when used as model inputs, have been shown to lead to faster learning and better results in a wide variety of settings (Erhan et al., 2009(Erhan et al., , 2010Cases et al., 2017).
However, many domains require more specialized representations but lack sufficient data to train them from scratch. We address this problem with a simple extension of the GloVe model (Pennington et al., 2014) that synthesizes general-purpose representations with specialized data sets. The guiding idea comes from the retrofitting work of Faruqui et al. (2015), which updates a space of existing representations with new information from a knowledge graph while also staying faithful to the original space (see also Yu and Dredze 2014;Mrkšić et al. 2016;Pilehvar and Collier 2016). We show that the GloVe objective is amenable to a similar retrofitting extension. We call the resulting model 'Mittens', evoking the idea that it is 'GloVe with a warm start' or a 'warmer GloVe'.
Our hypothesis is that Mittens representations synthesize the specialized data and the generalpurpose pretrained representations in a way that gives us the best of both. To test this, we conducted a diverse set of experiments. In the first, we learn GloVe and Mittens representations on IMDB movie reviews and test them on separate IMDB reviews using simple classifiers. In the second, we learn our representations from clinical text and apply them to a sequence labeling task using recurrent neural networks, and to edge detection using simple classifiers. These experiments support our hypothesis about Mittens representations and help identify where they are most useful.

Mittens
This section defines the Mittens objective. We first vectorize GloVe to help reveal why it can be extended into a retrofitting model.

Vectorizing GloVe
For a word i from vocabulary V occurring in the context of word j, GloVe learns representations w i and w j whose inner product approximates the logarithm of the probability of the words' cooccurrence. Bias terms b i andb j absorb the overall occurrences of i and j. A weighting function f is applied to emphasize word pairs that occur frequently and reduce the impact of noisy, low frequency pairs. This results in the objective where X ij is the co-occurrence of i and j. Since log X ij is only defined for X ij > 0, the sum excludes zero-count word pairs. As a result, existing implementations of GloVe use an inner loop to compute this cost and associated derivatives.
However, since f (0) = 0, the second bracket is irrelevant whenever X ij = 0, and so replacing log X ij with (for any k) does not affect the objective and reveals that the cost function can be readily vectorized as . W and W are matrices whose columns comprise the word and context embedding vectors, and g is applied elementwise. Because f (X ij ) is a factor of all terms of the derivatives, the gradients are identical to the original GloVe implementation too.
To assess the practical value of vectorizing GloVe, we implemented the model 1 in pure Python/Numpy (van der Walt et al., 2011) and in TensorFlow (Abadi et al., 2015), and we compared these implementations to a non-vectorized Tensor-Flow implementation and to the official GloVe C implementation (Pennington et al., 2014). 2 The results of these tests are in tab. 1. Though the C implementation is the fastest (and scales to massive vocabularies), our vectorized TensorFlow implementation is a strong second-place finisher, especially where GPU computations are possible.

The Mittens Objective Function
This vectorized implementation makes it apparent that we can extend GloVe into a retrofitting model by adding a term to the objective that penalizes the squared euclidean distance from the learned embedding w i = w i + w i to an existing one, r i : Here, R contains the subset of words in the new vocabulary for which prior embeddings are available (i.e., R = V ∩ V where V is the vocabulary used to generate the prior embeddings), and µ is a non-negative real-valued weight. When µ = 0 or R is empty (i.e., there is no original embedding), the objective reduces to GloVe's.
As in retrofitting, this objective encodes two opposing pressures: the GloVe objective (left term), which favors changing representations, and the distance measure (right term), which favors remaining true to the original inputs. We can control this trade off by decreasing or increasing µ.
In our experiments, we always begin with 50-dimensional 'Wikipedia 2014 + Gigaword 5' GloVe representations 3 -henceforth 'External GloVe' -but the model is compatible with any kind of "warm start".

Notes on Mittens Representations
GloVe's objective is that the log probability of words i and j co-occurring be proportional to the dot product of their learned vectors. One might worry that Mittens distorts this, thereby diminishing the effectiveness of GloVe. To assess this, we simulated 500-dimensional square count matrices and original embeddings for 50% of the words. Then we ran Mittens with a range of values of µ. The results for five trials are summarized in fig. 1: for reasonable values of µ, the desired correlation remains high ( fig. 1a), even as vectors with initial embeddings stay close to those inputs, as desired ( fig. 1b).

Sentiment Experiments
For our sentiment experiments, we train our representations on the unlabeled part of the IMDB review dataset released by Maas et al. (2011). This simulates a common use-case: Mittens should enable us to achieve specialized representations for these reviews while benefiting from the large datasets used to train External GloVe.

Word Representations
All our representations begin from a common count matrix obtained by tokenizing the unlabeled movie reviews in a way that splits out punctuation, downcases words unless they are written in all uppercase, and preserves emoticons and other common social media mark-up. We say word i co-occurs with word j if i is within 10 words to  the left or right of j, with the counts weighted by 1/d where d is the distance in words from j. Only words with at least 300 tokens are included in the matrix, yielding a vocabulary of 3,133 words. For regular GloVe representations derived from the IMDB data -'IMDB GloVE' -we train 50-dimensional representations and use the default parameters from Pennington et al. 2014: α = 0.75, x max = 100, and a learning rate of 0.05. We optimize with AdaGrad (Duchi et al., 2011), also as in the original paper, training for 50K epochs.
For Mittens, we begin with External GloVe. The few words in the IMDB vocabulary that are not in this GloVe vocabulary receive random initializations with a standard deviation that matches that of the GloVe representations. Informed by our simulations, we train representations with the Mittens weight µ = 0.1. The GloVe hyperparameters and optimization settings are as above. Extending the correlation analysis of fig. 1a to these real examples, we find that the GloVe representations generally have Pearson's ρ ≈ 0.37, Mittens ρ ≈ 0.47. We speculate that the improved correlation is due to the low-variance external GloVe embedding smoothing out noise from our co-occurrence matrix.

IMDB Sentiment Classification
The labeled part of the IMDB sentiment dataset defines a positive/negative classification problem with 25K labeled reviews for training and 25K for testing. We represent each review by the elementwise sum of the representation of each word in the review, and train a random forest model (Ho, 1995;Breiman, 2001) on these representations.   The rationale behind this experimental set-up is that it fairly directly evaluates the vectors themselves; whereas the neural networks we evaluate next can update the representations, this model relies heavily on their initial values.
Via cross-validation on the training data, we optimize the number of trees, the number of features at each split, and the maximum depth of each tree. To help factor out variation in the representation learning step (Reimers and Gurevych, 2017), we report the average accuracies over five separate complete experimental runs.
Our results are given in tab. 2. Mittens outperforms External GloVe and IMDB GloVe, indicating that it effectively combines complementary information from both.

Clinical Text Experiments
Our clinical text experiments begin with 100K clinical notes (transcriptions of the reports healthcare providers create summarizing their interactions with patients during appointments) from Real Health Data. 4 These notes are divided into informal segments that loosely follow the 'SOAP' convention for such reporting (Subjective, Objective, Assessment, Plan). The sample has 1.3 million such segments, and these segments provide our notion of 'document'.

Word Representations
The count matrix is created from the clinical text using the specifications described in sec. 3.1, but with the count threshold set to 500 to speed up optimization. The final matrix has a 6,519-word vocabulary. We train Mittens and GloVe as in sec. 3.1. The correlations in the sense of fig. 1a are ρ ≈ 0.51 for both GloVe and Mittens.

Disease Diagnosis Sequence Modeling
Here we use a recurrent neural network (RNN) to evaluate our representations. We sampled 3,206 sentences from clinical texts (disjoint from the data used to learn word representations) containing disease mentions, and labeled these mentions as 'Positive diagnosis ', 'Concern', 'Ruled Out', or (Hochreiter and Schmidhuber, 1997), and the inputs are updated during training. Fig. 2 summarizes the results of these experiments based on 10 random train/test with 30% of the sentences allocated for testing. Since the inputs can be updated, we expect all the initialization schemes to converge to approximately the same performance eventually (though this seems not to be the case in practical terms for Random or External GloVE). However, Mittens learns fastest for all categories, reinforcing the notion that Mittens is a sensible default choice to leverage both domain-specific and large-scale data.

SNOMED CT edge prediction
Finally, we wished to see if Mittens representations would generalize beyond the specific dataset they were trained on. SNOMED CT is a public, widely-used graph of healthcare concepts and their relationships (Spackman et al., 1997). It contains 327K nodes, classified into 169 semantic types, and 3.8M edges. Our clinical notes are more colloquial than SNOMED's node names and cover only some of its semantic spaces, but the Mittens representations should still be useful here.
For our experiments, we chose the five largest semantic types; tab. 4a lists these subgraphs along with their sizes. Our task is edge prediction: given a pair of nodes in a subgraph, the models predict whether there should be an edge between them. We sample 50% of the non-existent edges to create a balanced problem. Each node is represented by the sum of the vectors for the words in its primary name, and the classifier is trained on the concatenation of these two node representations. To help assess whether the input representations truly generalize to new cases, we ensure that the sets of nodes seen in training and testing are disjoint (which entails that the edge sets are disjoint as well), and we train on just 50% of the nodes. We report the results of ten random train/test splits.
The large scale of these problems prohibits the large hyperparameter search described in sec. 3.2, so we used the best settings from those experiments (500 trees per forest, square root of the total features at each split, no depth restrictions).
Our results are summarized in tab. 4b. Though the differences are small numerically, they are meaningful because of the large size of the graphs (tab. 4a). Overall, these results suggest that Mittens is at its best where there is a highlyspecialized dataset for learning representations, but that it is a safe choice even when seeking to transfer the representations to a new domain.

Conclusion
We introduced a simple retrofitting-like extension to the original GloVe model and showed that the resulting representations were effective in a number of tasks and models, provided a substantial (unsupervised) dataset in the same domain is available to tune the representations. The most natural next step would be to study similar extensions of other representation-learning models.