Reconstruction of Word Embeddings from Sub-Word Parameters

Pre-trained word embeddings improve the performance of a neural model at the cost of increasing the model size. We propose to benefit from this resource without paying the cost by operating strictly at the sub-lexical level. Our approach is quite simple: before task-specific training, we first optimize sub-word parameters to reconstruct pre-trained word embeddings using various distance measures. We report interesting results on a variety of tasks: word similarity, word analogy, and part-of-speech tagging.


Introduction
Word embeddings trained from a large quantity of unlabled text are often important for a neural model to reach state-of-the-art performance. They are shown to improve the accuracy of partof-speech (POS) tagging from 97.13 to 97.55 (Ma and Hovy, 2016), the F1 score of named-entity recognition (NER) from 83.63 to 90.94 (Lample et al., 2016), and the UAS of dependency parsing from 93.1 to 93.9 (Kiperwasser and Goldberg, 2016). On the other hand, the benefit comes at the cost of a bigger model which now stores these embeddings as additional parameters.
In this study, we propose to benefit from this resource without paying the cost by operating strictly at the sub-lexical level. Specifically, we optimize the character-level parameters of the model to reconstruct the word embeddings prior to task-specific training. We frame the problem as distance minimization and consider various metrics suitable for different applications, for example Manhattan distance and negative cosine similarity.
While our approach is simple, the underlying learning problem is a challenging one; the subword parameters must reproduce the topology of word embeddings which are not always morphologically coherent (e.g., the meaning of fox does not follow any common morphological pattern). Nonetheless, we observe that the model can still learn useful patterns. We evaluate our approach on a variety of tasks: word similarity, word analogy, and POS tagging. We report certain, albeit small, improvement on these tasks, which indicates that the word topology transformation based on pretraining can be beneficial. Faruqui et al. (2015) "retrofit" embeddings against semantic lexicons such as PPDB or WordNet. Cotterell et al. (2016) leverage existing morphological lexicons to incorporate sub-word components. The aim and scope of our work are clearly different: we are interested in training a strictly sublexical model that only operates over characters (which has the benefit of smaller model size) and yet somehow exploit pre-trained word embeddings in the process.

Related Work
Our work is also related to knowledge distillation which refers to training a smaller "student" network to perform better by learning from a larger "teacher" network. We adopt this terminology and refer to pre-trained word embeddings as the teacher and sub-lexical embeddings as the student. This problem has mostly been considered for classification and framed as matching the probabilities of the student to the probabilities of the teacher (Ba and Caruana, 2014;Li et al., 2014;. In contrast, we work directly with representations in Euclidean space.
Let W denote the set of word types. For each word w ∈ W, we assume a pre-trained word embedding x w ∈ R d and a representation h w ∈ R d computed by sub-word model parameters Θ; we defer how to define h w until later. The reconstruction error with respect to a distance function D : where x w is constant and h w is a function of Θ.
Since we use gradient descent to optimize (1), we can define D(u, v) to be any continuous function measuring the discrepency between u and v, for example, Unlike other common losses used in the neural network literature such as negative log likelihood or the hinge loss, L D has a direct geometric interpretation illustrated in Figure 1. We first optimize (1) over sub-word model parameters Θ for a set number of epochs, and then proceed to optimize a task-specific loss L(Θ, Θ ) where Θ denotes all other model parameters.

Analysis of a Linear Model
In general, h w can be a complicated function of Θ. But we can gain some insight by analyzing the simple case of a linear model, which corresponds to the top layer of a neural network. More specifically, we assume the form where z w ∈ R d is fixed and Θ = {θ 1 . . . θ d } ⊂ R d is the only parameter to be optimized.

Manhattan distance The error
It is well-known that the LAD criterion is robust to outliers. To see this, if z w = (1/d )1 for all w ∈ W, then a minimizer of LAD i (θ) is given analytically by where the median resists extreme values (e.g., the median of both {1, 2, 3} and {1, 2, 999} is 2). Thus using Manhattan distance can be useful when teacher word embeddings are noisy or there are occasional exceptions in morphological patterns that are best ignored.
where OLS i (θ) := w∈W x w i − θ z w 2 is ordinary least squares (OLS). Thus if the matrix Z ∈ R |W|×d with z w as rows has rank d , the unique solution is given by θ denote the optimal sub-word embedding value. It is well-known that the change in h w i caused by removing x w i from the dataset is proportional to the residual x w i −h w i (Davidson et al., 1993). In other words, squared error is sensitive to outliers and may not be as suitable as Manhattan distance for fitting noisy or incoherent word embeddings.
Other distance metrics Euclidean distance is geometrically intuitive but less mathematically convenient than squared error, thus we choose not to focus on it. l ∞ distance penalizes the dimension with maximum absolute difference and can be useful if calculating one coordinate at a time is convenient. Finally, negative cosine similarity penalizes the angle between embeddings. This is suitable when we only care about directions and not magnitude, for instance in word similarity where we measure cosine similarities between word embeddings.
There are distance metrics not discussed here that may be appropriate in certain situations. For instance, the KL divergence is a natural (assymetric) measure if word embeddings are distributions (e.g., over context words). More generally, we can consider the wide class of metrics in the Bregman divergence (Banerjee et al., 2005).

Sub-Word Architecture
We now describe how we define word embedding h w ∈ R d from sub-word parameters. We use a character-based embedding scheme closely following Lample et al. (2016). We use an LSTM simply as a mapping φ : R d ×R d → R d that takes an input vector x and a state vector h to output a new state vector h = φ(x, h). See Hochreiter and Schmidhuber (1997) for a detailed description.

Character Model
Let C denote the set of character types. The model parameters Θ associated with this layer are Let w(j) ∈ C denote the character of w ∈ W at position j. The model computes h w ∈ R d as We also experiment with a highway network (Srivastava et al., 2015) which has been shown to be beneficial for image recognition (He et al., 2015) and language modeling . In this case, Θ includes additional parameters W highway ∈ R d×d and b highway ∈ R d . A new character-level embeddingh w is computed as where σ(·) ∈ [0, 1] denotes an element-wise sigmoid function and the element-wise multiplication. This allows the network to skip nonlinearity by making t i close to 0. We find that the additional highway network is beneficial in certain cases. We will use either (2) or (3) in our experiments depending on the task.

Experiments
Implementation We implement our models using the DyNet library. We use the Adam optimizer (Kingma and Ba, 2014) and apply dropout at all LSTM layers (Hinton et al., 2012). For POS tagging and parsing, we perform a 5 × 5 grid search over learning rates 0.0001 . . . 0.0005 and dropout rates 0.1 . . . 0.5 and choose the configuration that gives the best performance on the dev set. We use the highway network (3) for word analogy and parsing and (2) for others. Note that the character embedding dimension d c must match the dimension of the pre-trained word embeddings.
Teacher Word Embeddings We use 100dimensional word embeddings identical to those used by  which are computed with a variant of the skip n-gram model . These embeddings have been shown to be effective in various tasks Lample et al., 2016).

Word Similarity and Analogy
Data For word similarity, we use three public datasets WordSim-353, MEN, and Stanford Rare Word. Each contains 353, 3000, and 2034 word pairs annotated with similarity scores. The evaluation is conducted by computing the cosine of the angle θ between each word pair (w 1 , w 2 ) under the model (2):  Table 1: Effect of reconstruction on word similarity: the teacher word embeddings obtain score 0.50. and computing the Spearman's correlation coefficient with the human scores. We report the average correlation across these datasets. For word analogy, we use the 8000 syntactic analogy questions from the dataset of Mikolov et al. (2013b) and 8869 semantic analogy questions from the dataset of Mikolov et al. (2013a). We use the multiplicative technique of Levy and Goldberg (2014) for answering analogy questions.
Result Table 1 shows word similarity scores for different numbers of reconstruction training epochs. The teacher word embeddings obtain 0.5. The sub-word model improves performance from the initial score of 0.03 up to 0.16. In particular, the negative cosine distance metric which directly optimizes the relevant quantity (4) is consistently best performing. Table 2 shows the accuracy on the syntactic and semantic analogy datasets. An interesting finding in our experiment is that for syntactic analogy, a randomly initialized character-based model outperforms the pre-trained embeddings and thus reconstruction only decreases the performance. We suspect that this is because much of the syntactic regularities is already captured by the architecture. Many questions involves only simplistic transformation, for instance adding r in wise : wiser ∼ free : x. The model correctly answers such questions simply by following its architecture, though it is unable to answer less regular questions (e.g., see : saw ∼ keep : x).
Semantic analogy questions have no such morphological regularities (e.g., Athens : Greece ∼ Havana : x) and are challenging to sub-lexical models. Nonetheless, the model is able to make a minor improvement in accuracy.

POS Tagging
We perform POS tagging on the Penn WSJ treebank with 45 tags using a BiLSTM model de-  scribed in Lample et al. (2016). Given a vector sequence (v w 1 . . . v wn ) corresponding to a sentence (w 1 . . . w n ) ∈ W n , the BiLSTM model produces feature vectors (h 1 . . . h n ). We adhere to the simplest approach of making a local prediction at each position i by a feedforward network on h i , are additional parameters. The model is trained by optimizing log likelihood. We consider the following choices of v w : • FULL: v w = e w ⊕ h w uses both word-level lookup parameter e w and character-level embedding h w (2).
• FULL+EMB: Same as FULL but the lookup parameters e w are initialized with pre-trained word embeddings.
• CHAR: v w = h w uses characters only.
• CHAR(D): Same as CHAR but optimized for 10 epochs to reconstruct pre-trained word embeddings with distance metric D. Table 3 shows the accuracy of these models. We see that pre-trained word embeddings boost the beautiful   wonderful  prettiest  gorgeous  smartest  jolly  famous  sensual  baleful  bagful  basketful  bountiful  boastful  bashful  behavioural  bountiful  peaceful  disdainful  perpetual  primaeval  successul  purposeful  amazing  incredible  wonderful  remarkable  terrific  marvellous  astonishing  unbelievable  awaking  arming  aging  awakening  angling  agonizing  among  arousing  amusing  awarding  applauding  allaying  awaking  assaying  Springfield  Glendale  Kennesaw  Gainesville  Lynchburg  Youngstown  Kutztown  Harrisburg  Spanish-ruled  Serbian-held  Serbian-led  Spangled  Serbian-controlled  Schofield  Sharif-led  Stubblefield  Smithfield  Stansfield  Butterfield  Littlefield  Bitterfeld  Sinfield   Table 4: Nearest neighbor examples: for each word, the three rows respectively show its nearest neighbors using pre-trained word embeddings, student embeddings at random initialization (3), and student embeddings optimized for 10 epochs using D 1 .
performance of FULL from 97.20 to 97.32. When we use the strictly character-based model CHAR without reconstruction, the performance drops to 96.93. But with reconstruction, the model recovers some of the lost accuracy. In particular, reconstructing with the Manhattan distance metric gives the largeset improvement and yields 97.17. Table 4 shows examples of nearest neighbors. For each example, the first row corresponds to the teacher, the second to the student (3) at random initialization, and the third to the student optimized for 10 epochs using D 1 . The student embeddings at random initialization are already capable of capturing morphological regularities such as -ful and -ing. With reconstruction, there is a subtle change in the topology. For instance, the nearest neighbors of beautiful change from baleful and bagful to bountiful and peaceful. For Springfield, nearest neighbors change from unrelated words such as Spanish-ruled to fellow nouns such as Stubblefield.

Conclusion
We have presented a simple method for a sublexical model to leverage pre-trained word embeddings. We have shown that by recontructing the embeddings before task-specific training, the model can improve over random initialization on a variety of tasks. The reconstruction task is a challenging learning problem; while our model learns useful patterns, it is far from perfect. An important future direction is to improve reconstruction with other choices of architecture.