Mapping Unseen Words to Task-Trained Embedding Spaces

We consider the supervised training setting in which we learn task-specific word embeddings. We assume that we start with initial embeddings learned from unlabelled data and update them to learn task-specific embeddings for words in the supervised training data. However, for new words in the test set, we must use either their initial embeddings or a single unknown embedding, which often leads to errors. We address this by learning a neural network to map from initial embeddings to the task-specific embedding space, via a multi-loss objective function. The technique is general, but here we demonstrate its use for improved dependency parsing (especially for sentences with out-of-vocabulary words), as well as for downstream improvements on sentiment analysis.


Introduction
Performance on NLP tasks drops significantly when moving from training sets to held-out data (Petrov et al., 2010). One cause of this drop is words that do not appear in the training data but appear in test data, whether in the same domain or in a new domain. We refer to such out-of-trainingvocabulary (OOTV) words as unseen words. NLP systems often make errors on unseen words and, in structured tasks like dependency parsing, this can trigger a cascade of errors in the sentence.
Word embeddings can counter the effects of limited training data (Necsulescu et al., 2015;Turian et al., 2010;Collobert et al., 2011). While the effectiveness of pretrained embeddings can be heavily task-dependent (Bansal et al., 2014), there is a great deal of work on updating embeddings during supervised training to make them more task-specific (Kalchbrenner et al., 2014;Qu et al., 2015;Chen and Manning, 2014). These task-trained embeddings have shown encouraging results but raise some concerns: (1) the updated embeddings of infrequent words are prone to overfitting, and (2) many words in the test data are not contained in the training data at all. In the latter case, at test time, systems either use a single, generic embedding for all unseen words or use their initial embeddings (typically derived from unlabelled data) (Søgaard and Johannsen, 2012;Collobert et al., 2011). Neither choice is ideal: A single unknown embedding conflates many words, while the initial embeddings may be in a space that is not comparable to the trained embedding space.
In this paper, we address both concerns by learning to map from the initial embedding space to the task-trained space. We train a neural network mapping function that takes initial word embeddings and maps them to task-specific embeddings that are trained for the given task, via a multi-loss objective function. We tune the mapper's hyperparameters to optimize performance on each domain of interest, thereby achieving some of the benefits of domain adaptation. We demonstrate significant improvements in dependency parsing across several domains and for the downstream task of dependency-based sentiment analysis using the model of Tai et al. (2015).

Mapping Unseen Representations
Let V = {w 1 , . . . , w V } be the vocabulary of word types in a large, unannotated corpus. Let e o i denote the initial (original) embedding of word w i computed from this corpus. The initial em- : System Pipeline beddings are typically learned in an unsupervised way, but for our purposes they can be any initial embeddings. Let T ⊆ V be the subset of words that appear in the annotated training data for some supervised task-specific training. We define unseen words as those in the set V \ T . While our approach is general, for concreteness, we consider the task of dependency parsing, so the annotated data consists of sentences paired with dependency trees. We assume a dependency parser that learns task-specific word embeddings e t i for word w i ∈ T , starting from the original embedding e o i . In this work, we use the Stanford neural dependency parser (Chen and Manning, 2014).
The goal of the mapper is as follows. We are given a training set of N pairs of initial and task-trained embeddings D = e o 1 , e t 1 , . . . , e o N , e t N , and we want to learn a function G that maps each initial embedding e o i to be as close as possible to its corresponding output embedding e t i . We denote the mapped embedding e m i , i.e., e m i = G (e o i ). Figure 1a describes the training procedure of the mapper. We use a supervised parser which is trained on an annotated dataset and initialized with pre-trained word embeddings e o i . The parser uses back-propagation to update these embeddings during training, producing task-trained embeddings e t i for all w i ∈ T . After we train the parser, the mapping function G is trained to map an initial word embedding e o i to its parser-trained embedding e t i .
At test (or development) time, we use the trained mapper G to transform the original embeddings of unseen test words to the parser-trained space (see Figure 1b). When parsing held-out data, we use the same parser model parameters (W ) as shown in Figure 1b. The only difference is that now some of the word embeddings (i.e., for unseen words) have changed to mapped ones.

Mapper Architecture
Our proposed mapper is a multi-layer feedforward neural network that takes an initial word embedding as input and outputs a mapped representation of the same dimensionality. In particular, we use a single hidden layer with a hardtanh nonlinearity, so the function G is defined as: (1) where W 1 and W 2 are parameter matrices and b 1 and b 2 are bias vectors.
The 'hardtanh' non-linearity is the standard 'hard' version of hyperbolic tangent: In preliminary experiments we compared with other non-linear functions (sigmoid, tanh, and ReLU), as well as with zero and more than one non-linear layers. We found that fewer or more non-linear layers did not improve performance.

Loss Function
We use a weighted, multi-loss regression approach, optimizing a weighted sum of mean squared error and mean absolute error: where y = e t i (the ground truth) andŷ = e m i (the prediction) are n-dimensional vectors. This multiloss approach seeks to make both the conditional mean of the predicted representation close to the task-trained representation (via the squared loss) and the conditional median of the predicted representation close to the task-trained one (via the mean absolute loss). A weighted multi-criterion objective allows us to avoid making strong assumptions about the optimal transformation to be learned. We tune the hyperparameter α on domain-specific held-out data. We try to minimize the assumptions in our formulation of the loss, and let the tuning determine the particular mapper configuration that works best for each domain. Strict squared loss or an absolute loss are just special forms of this loss function.
For optimization, we use batch limited memory BFGS (L-BFGS) (Liu and Nocedal, 1989). In preliminary experiments comparing with stochastic optimization, we found L-BFGS to be more stable to train and easier to check for convergence (as has recently been found in other settings as well (Ngiam et al., 2011)).

Regularization
We use elastic net regularization (Liu and Nocedal, 1989), which linearly combines ℓ 1 and ℓ 2 penalties on the parameters to control the capacity of the mapper function. This equates to minimizing: where θ is the full set of mapper parameters and L(θ) is the loss function (Eq. 2 summed over mapper training examples). We tune the hyperparameters of the regularizer and the loss function separately for each task, using a task-specific development set. This gives us additional flexibility to map the embeddings for the domain of interest, especially when the parser training data comes from a particular domain (e.g., newswire) and we want to use the parser on a new domain (e.g., email). We also tried dropout-based regularization (Srivastava et al., 2014) for the non-linear layer but did not see any significant improvement.

Mapper-Parser Thresholds
Certain words in the parser training data T are very infrequent, which may lead to inferior taskspecific embeddings e t i learned by the parser. We want our mapper function to be learned on highquality task-trained embeddings. After learning a strong mapping function, we can use it to remap the inferior task-trained embeddings.
We thus consider several frequency thresholds that control which word embeddings to use to train the mapper and which to map at test time. Below are the specific thresholds that we consider:

Mapper-training Threshold (τ t )
The mapper is trained only on embedding pairs for words seen at least τ t times in the training data T .
Mapping Threshold (τ m ) For test-time inference, the mapper will map any word whose count in T is less than τ m . That is, we discard parsertrained embeddings e t i of these infrequent words and use our mapper to map the initial embeddings e o i instead. Parser Threshold (τ p ) While training the parser, for words that appear fewer than τ p times in T , the parser replaces them with the 'unknown' embedding. Thus, no parser-trained embeddings will be learned for these words.
In our experiments, we explore a small set of values from this large space of possible threshold combinations (detailed below). We consider only relatively small values for the mapper-training (τ t ) and parser thresholds (τ p ) because as we increase them, the number of training examples for the mapper decreases, making it harder to learn an accurate mapping function 1 .
Other work has also found improvements by combining pre-trained, fixed embeddings with task-trained embeddings (Kim, 2014;Paulus et al., 2014). Also relevant are approaches developed specifically to handle large target vocabularies (including many rare words) in neural machine translation systems (Jean et al., 2015;Luong et al., 2015;Chitnis and DeNero, 2015).
Closely related to our approach is that of Tafforeau et al. (2015). They induce embeddings for unseen words by combining the embeddings of the k nearest neighbors. In Sec. 4, we show that our approach outperforms theirs. Also related is the approach taken by Kiros et al. (2015). They learn a linear mapping of the initial embedding space via unregularized linear regression. Our approach differs by considering nonlinear mapping functions, comparing different losses and mapping thresholds, and learning separately tuned mappers for each domain of interest. Moreover, we focus on empirically evaluating the effect of the mapping of unseen words, showing statistically significant improvements on both parsing and a downstream task (sentiment analysis).

Dependency Parser
We use the feed-forward neural network dependency parser of Chen and Manning (2014). In all our experiments (unless stated otherwise), we use the default arc-standard parsing configuration and hyperparameter settings. For evaluation, we compute the percentage of words that get the correct head, reporting both unlabelled attachment score (UAS) and labelled attachment score (LAS). LAS additionally requires the predicted dependency label to be correct. To measure statistical significance, we use a bootstrap test (Efron and Tibshirani, 1986) with 100K samples.

Pre-Trained Word Embeddings
We use the 100-dimensional GloVe word embeddings from Pennington et al. (2014). These were trained on Wikipedia 2014 and the Gigaword v5 corpus and have a vocabulary size of approximately 400,000. 2

Datasets
We consider a number of datasets with varying rates of OOTV words. We define the OOTV rate (or, equivalently, the unseen rate) of a dataset as the percentage of the vocabulary (types) of words occurring in the set that were not seen in training.

Wall
Street Journal (WSJ) and OntoNotes-WSJ We conduct experiments on the Wall Street Journal portion of the English Penn Treebank dataset (Marcus et al., 1993). We follow the standard splits: sections 2-21 for training, section 22 for validation, and section 23 for testing. We convert the original phrase structure trees into dependency trees using Stanford Basic Dependencies (De Marneffe and Manning, 2008) in the Stanford Dependency Parser. The POS tags are obtained using the Stanford POS tagger (Toutanova et al., 2003) in a 10-fold jackknifing setup on the training data (achieving an accuracy of 96.96%). The OOTV rate in the development and test sets is approximately 2-3%.
We also conduct experiments on the OntoNotes 4.0 dataset (which we denote OntoNotes-WSJ). This dataset contains the same sentences as the WSJ corpus (and we use the same data splits), but has significantly different annotations. The OntoNotes-WSJ training data is used for the Web Treebank test experiments. We perform the same pre-processing steps as for the WSJ dataset.

Web Treebank
We expect our mapper to be most effective when parsing held-out data with 2 http://www-nlp.stanford.edu/data/glove.6B.100d.txt.gz; We have also experimented with the downloadable 50-dimensional SENNA embeddings from Collobert et al. (2011) and with word2vec (Mikolov et al., 2013) embeddings that we trained ourselves; in preliminary experiments the GloVe embeddings performed best, so we use them for all experiments below. many unseen words. This often happens when the held-out data is drawn from a different distribution than the training data. For example, when training a parser on newswire and testing on web data, many errors occur due to differing patterns of syntactic usage and unseen words (Foster et al., 2011;Petrov and McDonald, 2012;Kong et al., 2014;Wang et al., 2014).
We explore this setting by training our parser on OntoNotes-WSJ and testing on the Web Treebank (Petrov and McDonald, 2012), which includes five domains: answers, email, newsgroups, reviews, and weblogs. Each domain contains approximately 2000-4000 manually annotated syntactic parse trees in the OntoNotes 4.0 style. In this case, we are adapting the parser which is trained on OntoNotes corpora using the small development set for each of the sub-domains (the size of the Web Treebank dev corpora is only around 1000-2000 trees so we use it for validation instead of including it in training). As before, we convert the phrase structure trees to dependency trees using Stanford Basic Dependencies. The parser and the mapper hyperparameters were tuned separately on the development set for each domain. The unseen rate is typically 6-10% in the domains of the Web Treebank. We used the Stanford tagger (Toutanova et al., 2003), which was trained on the OntoNotes training corpus, for part-of-speech tagging the Web Treebank corpora. The tagger used bidirectional architecture and it included word shape and distributional similarity features. We train a separate mapper for each domain, tuning mapper hyperparameters separately for each domain using the development sets. In this way, we obtain some of the benefits of domain adaptation for each target domain.

Downstream Task: Sentiment Analysis with Dependency Tree LSTMs
We also perform experiments to analyze the effects of embedding mapping on a downstream task, in this case sentiment analysis using the Stanford Sentiment Treebank . We use the dependency tree long short-term memory network (Tree-LSTM) proposed by Tai et al. (2015), simply replacing their default dependency parser with our version that maps unseen words. The dependency parser is trained on the WSJ corpus and mapped using the WSJ development set. We use the same mapper that was optimized for the WSJ development set, without further hyperpa-rameter tuning for the mapper. For the Tree-LSTM model, we use the same hyperparameter tuning as described in Tai et al. (2015). We use the standard train/development/test splits of 6820/872/1821 sentences for the binary classification task and 8544/1101/2210 for the fine-grained task.

Mapper Settings and Hyperparameters
The initial embeddings given to the mapper are the same as the initial embeddings given to the parser. These are the 100-dimensional GloVe embeddings mentioned above. The output dimensionality of the mapper is also fixed to 100. All model parameters of the mappers are initialized to zero. We set the dimensionality of the non-linear layer to 400 across all experiments. The model parameters are optimized by maximizing the weighted multiple-loss objective using L-BFGS with elastic-net regularization (Section 2). The hyperparameters include the relative weight of the two objective terms (α) and the regularization constants (λ 1 , λ 2 ). For α, we search over values in {0, 0.1, 0.2, . . . , 1}. For each of λ 1 and λ 2 , we consider values in {10 −1 , 10 −2 , . . . , 10 −9 , 0}. The hyperparameters are tuned via grid search to maximize the UAS on the development set.

Results on WSJ, OntoNotes, and
Switchboard The upper half of Table 1 shows our main test results on WSJ, OntoNotes, and Switchboard, the low-OOTV rate datasets. Due to the small initial OOTV rates (<3%), we only see modest gains of 0.3-0.4% in UAS, with statistical significance at p < 0.05 for WSJ and OntoNotes and p < 0.07 for Switchboard. The initial OOTV rates are cut in half by our mapper, with the remaining unknown words largely being numerical strings and misspellings. 3 When only considering test sentences containing OOTV words (the row labeled "OOTV subset"), the gains are significantly larger (0.5-0.8% UAS at p < 0.05).

Results on Web Treebank
The lower half of  Table 1: Results of dependency parsing on various treebanks. Entries of the form A→B give results for parsing without mapped embeddings (A) and with mapped embeddings (B). "OOTV %" entries A→B indicate that A% of the test set vocabulary was unseen in the parser training, and B% remain unknown after mapping the embeddings. "OOTV UAS" refers to UAS measured on the subset of the test set sentences that contain at least one OOTV word, and "#Sents" gives the number of sentences in this subset.
Wife and I attempted to adopt a dog and was nothing but frustrating  high-OOTV rate datasets. As expected, the mapper has a much larger impact when parsing these out-of-domain datasets with high OOTV word rates. 4 The OOTV rate reduction is much larger than for the WSJ-style datasets, and the parsing improvements (UAS and LAS) are statistically significant at p < 0.05. On subsets containing at least one OOTV word (that also has an initial embedding), we see an average gain of 1.14% UAS (see row labeled "OOTV subset"). In this case, all improvements are statistically significant at p < 0.02. We observe that the relative reduction in OOTV% for the Web Treebanks is larger than for the WSJ, OntoNotes, or Switchboard datasets. In particular, we are able to reduce the OOTV% by 71-95% relative. We also see the intuitive trend that larger relative reductions in OOTV rate correlate with larger accuracy improvements.

Downstream Results
We now report results using the Dependency Tree-LSTM of Tai et al. (2015) for sentiment analysis on the Stanford Sentiment Treebank. We consider both the binary (positive/negative) and finegrained classification tasks ({very negative, negative, neutral, positive, and very positive}). We use the implementation provided by Tai et al. (2015), changing only the dependency parses that are fed to their model. The sentiment dataset contains approximately 25% OOTV words in the training set vocabulary, 5% in the development set vocabulary, and 9% in the test set vocabulary. We map un-   Table 2. We improve upon the original accuracies in both binary and fine-grained classification. 5 We also reduce the OOTV rate from 25% in the training set vocabulary to about 6%, and from 9% in the test set vocabulary down to 4%.

Effect of Thresholds
We also experimented with different values for the thresholds described in Section 2. For the mapping threshold τ m , mapper-training threshold τ t , and parser threshold τ p , we consider the following four settings: Using τ m = ∞ corresponds to mapping all words at test time, even words that we have seen many times in the training data and learned task-specific embeddings for.
We report the average development set UAS over all Web Treebank domains in Table 3. We see that t 3 performs best, though settings t 1 and t 5 also improve over the baseline. At threshold t 3 we have approximately 20,000 examples for training the mapper, while at threshold t 5 we have only about 10,000 examples. We see a performance drop at t ∞ , so it appears better to directly use the task-specific embeddings for words that appear frequently in the training data. In other results reported in this paper, we used t 3 for the Web Treebank test sets and t 1 for the rest.

Effect of Weighted Multi-Loss Objective
We analyzed the results when varying α, which balances between the two components of the mapper's multi-loss objective function. We found that, for all domains except Answers, the best results are obtained with some α between 0 and 1. The optimal values outperformed the cases with α = 0 and α = 1 by 0.1-0.3% UAS absolute. However, on the Answers domain, the best performance was achieved with α = 0; i.e., the mapper preferred mean squared error. For other domains, the optimal α tended to be within the range [0.3, 0.7].

Comparison with Related Work
We compare to the approach presented by Tafforeau et al. (2015). They propose to refine embeddings for unseen words based on the relative shifts of their k nearest neighbors in the original embeddings space. Specifically, they define "artificial refinement" as: where φ r (.) is the vector in the refined embedding space and φ o (.) is the vector in the original embedding space. They define α k to be proportional to the cosine similarity between the target unseen word (t) and neighbor (n k ):   Table 4 shows the average performance of the models over the development sets of the Web Treebank. On average, our mapper outperforms the k-NN approach (k = 3).

Dependency Parsing Examples
In Figure 2, we show two sentences: an instance where the mapper helps and another where the mapper hurts the parsing performance. 6 In the first sentence (Figure 2a), the parsing model has not seen the word 'attempted' during training. Note that the sentence contains 3 verbs: 'attempted', 'adopt', and 'was'. Even with the POS tags, the parser was unable to get the correct dependency attachment. After mapping, the parser correctly makes 'attempted' the root and gets the correct arcs and the correct tree. The 3 nearest neighbors of 'attempted' in the mapped embedding space are 'attempting', 'tried', and 'attempt'. We also see here that a single unseen word can lead to multiple errors in the parse.
In the second example (Figure 2b), the default model assigns the correct arcs using the POS information even though it has not seen the word 'google'. However, using the mapped representation for 'google', the parser makes errors. The 3-nearest neighbors for 'google' in the mapped space are 'damned', 'look', and 'hash'. We hypothesize that the mapper has mapped this noun instance of 'google' to be closer to verbs instead of nouns, which would explain the incorrect attachment.

Analyzing Mapped Representations
To understand the mapped embedding space, we use t-SNE (Van der Maaten and Hinton, 2008) to visualize a small subset of embeddings. In Figure 3, we plot the initial embeddings, the parsertrained embeddings, and finally the mapped embeddings. We include four unseen words (shown in caps): 'horrible', 'poor', 'marvelous', and 'magnificent'. In Figure 3a and Figure 3b, the embeddings for the unseen words are identical (even though t-SNE places them in different places when producing its projection). In Figure 3c, we observe that the mapper has placed the unseen words within appropriate areas of the space with respect to similarity with the seen words. We contrast this with Figure 3b, in which the unseen words appear to be within a different region of the space from all seen words.

Conclusion
We have described a simple method to resolve unseen words when training supervised models that learn task-specific word embeddings: a feedforward neural network that maps initial embeddings to the task-specific embedding space. We demonstrated significant improvements in dependency parsing accuracy across several domains, as well as improvements on a downstream task. Our approach is simple, effective, and applicable to many other settings, both inside and outside NLP.