Reusing Weights in Subword-Aware Neural Language Models

We propose several ways of reusing subword embeddings and other weights in subword-aware neural language models. The proposed techniques do not benefit a competitive character-aware model, but some of them improve the performance of syllable- and morpheme-aware models while showing significant reductions in model sizes. We discover a simple hands-on principle: in a multi-layer input embedding model, layers should be tied consecutively bottom-up if reused at output. Our best morpheme-aware model with properly reused weights beats the competitive word-level model by a large margin across multiple languages and has 20%-87% fewer parameters.


Introduction
A statistical language model (LM) is a model which assigns a probability to a sequence of words. It is used in speech recognition, machine translation, part-of-speech tagging, information retrieval and other applications. Data sparsity is a major problem in building traditional n-gram language models, which assume that the probability of a word only depends on the previous n words. To deal with potentially severe problems when confronted with any n-grams that have not explicitly been seen before, some form of smoothing is necessary.
Recent progress in statistical language modeling is connected with neural language models (NLM), which tackle the data sparsity problem by representing words as vectors. Typically this is done twice: at input (to embed the current word of a sequence into a vector space) and at output (to embed candidates for the next word of a sequence). Especially successful are the models in which the architecture of the neural network between input and output is recurrent (Mikolov et al., 2010), which we refer to as recurrent neural network language models (RNNLM).
Tying input and output word embeddings in word-level RNNLM is a regularization technique, which was introduced earlier (Bengio et al., 2001;Mnih and Hinton, 2007) but has been widely used relatively recently, and there is empirical evidence (Press and Wolf, 2017) as well as theoretical justification (Inan et al., 2017) that such a simple trick improves language modeling quality while decreasing the total number of trainable parameters almost two-fold, since most of the parameters are due to embedding matrices. Unfortunately, this regularization technique is not directly applicable to subword-aware neural language models as they receive subwords at input and return words at output. This raises the following questions: Is it possible to reuse embeddings and other parameters in subword-aware neural language models? Would it benefit language modeling quality? We experimented with different subword units, embedding models, and ways of reusing parameters, and our answer to both questions is as follows: There are several ways to reuse weights in subword-aware neural language models, and none of them improve a competitive character-aware model, but some of them do benefit syllable-and morphemeaware models, while giving significant reductions in model sizes. A simple morpheme-aware model that sums morpheme embeddings of a word benefits most from appropriate weight tying, showing a significant gain over the competitive word-level baseline across different languages and data set sizes. Another contribution of this paper is the discovery of a hands-on principle that in a multi-layer input embedding model, layers should be tied consecutively bottom-up if reused at output.
The source code for the morpheme-aware model is available at https://github.com/ zh3nis/morph-sum.

Related Work
Subword-aware NLM: There has been a large number of publications in the last 2-3 years on subword-level and subword-aware NLMs, 1 especially for the cases when subwords are characters (Ling et al., 2015;Kim et al., 2016;Verwimp et al., 2017) or morphemes (Botha and Blunsom, 2014;Qiu et al., 2014;Cotterell and Schütze, 2015). Less work has been done on syllable-level or syllable-aware NLMs (Mikolov et al., 2012;Assylbekov et al., 2017;Yu et al., 2017). For a thorough and up-to-date review of the previous work on subword-aware neural language modeling we refer the reader to the paper by Vania and Lopez (2017), where the authors systematically compare different subword units (characters, character trigrams, BPE, morphs/morphemes) and different representation models (CNN, Bi-LSTM, summation) on languages with various morphological typology. Tying weights in NLM: Reusing embeddings in word-level neural language models is a technique which was used earlier (Bengio et al., 2001;Mnih and Hinton, 2007) and studied in more details recently (Inan et al., 2017;Press and Wolf, 2017). However, not much work has been done on reusing parameters in subword-aware or subwordlevel language models. Jozefowicz et al. (2016) reused the CharCNN architecture of Kim et al. (2016) to dynamically generate softmax word embeddings without sharing parameters with an input word-embedding sub-network. They managed to significantly reduce the total number of parameters for large models trained on a huge dataset in English (1B tokens) with a large vocabulary (800K tokens) at the expense of deteriorated performance. Labeau and Allauzen (2017) used similar approach to augment the output word representations with subword-based embeddings. They experimented with characters and morphological decompositions, and tried different compositional models (CNN, Bi-LSTM, concatenation) on Czech dataset consisting of 4.7M tokens. They were not tying weights between input and output representations, since their preliminary experiments with tied weights gave worse results.
Our approach differs in the following aspects: we focus on the ways to reuse weights at output, seek both model size reduction and performance improvement in subword-aware language models, try different subword units (characters, syllables, and morphemes), and make evaluation on small (1M-2M tokens) and medium (17M-51M tokens) data sets across multiple languages.

Recurrent Neural Language Model
Let W be a finite vocabulary of words. We assume that words have already been converted into indices. Let E in W ∈ R |W|×d W be an input embedding matrix for words -i.e., it is a matrix in which the wth row (denoted as w) corresponds to an embedding of the word w ∈ W.
Based on word embeddings w 1:k = w 1 , . . . , w k for a sequence of words w 1:k , a typical word-level RNN language model produces a sequence of states h 1:k according to The last state h k is assumed to contain information on the whole sequence w 1:k and is further used for predicting the next word w k+1 of a sequence according to the probability distribution is an output embedding matrix, b ∈ R |W| is a bias term, and d LM is a state size of the RNN. Subword-based word embeddings: One of the more recent advancements in neural language modeling has to do with segmenting words at input into subword units (such as characters, syllables, morphemes, etc.) and composing each word's embedding from the embeddings of its subwords. Formally, let S be a finite vocabulary of subwords, 2 and let E in S ∈ R |S|×d S be an input embedding matrix for subwords. Any word w ∈ W is a sequence of its subwords (s 1 , s 2 , . . . , s nw ) = σ(w), and hence can be represented as a sequence of the corresponding subword vectors: [s 1 , s 2 , . . . , s nw ]. ( A subword-based word embedding model E(·; E in S , Θ in ) with parameters (E in S , Θ in ) constructs a word vector x from the sequence of subword vectors (3), i.e. which is then fed into a RNNLM (1) instead of a plain embedding w. The additional parameters Θ in correspond to the way the embedding model constructs the word vector: for instance, in the CharCNN model of Kim et al. (2016), Θ in are the weights of the convolutional and highway layers. Reusing word embeddings: Another recent technique in word-level neural language modeling is tying input and output word embeddings: under the assumption that d W = d LM . Although being useful for word-level language modeling (Press and Wolf, 2017;Inan et al., 2017), this regularization technique is not directly applicable to subword-aware language models, as they receive subword embeddings at input and return word embeddings at output. In the next section we describe a simple technique to allow reusing subword embeddings E in S as well as other parameters Θ in in a subword-aware RNNLM.

Reusing Weights
Let E out S be an output embedding matrix for subwords and let us modify the softmax layer (2) so that it utilizes E out S instead of the word embedding matrix E out W . The idea is fairly straightforward: we reuse an embedding model (4) to construct a new word embedding matrix: and useÊ out W instead of E out W in the softmax layer (2). Such modification of the softmax layer will be referred to as subword-based softmax. The overall architecture of a subword-aware RNNLM with subword-based softmax is given in Figure 1. Such a model allows several options for reusing embeddings and weights, which are discussed below.
• Reusing neither subword embeddings nor embedding model weights: As was shown by Jozefowicz et al. (2015), this can significantly reduce the total number of parameters for large models trained on huge datasets (1B tokens) with large vocabularies (800K tokens). However, we do not expect significant reductions on smaller data sets (1-2M tokens) with smaller vocabularies (10-30K tokens), which we use in our main experiments. • Reusing subword embeddings (RE) can be done by setting E out S = E in S in (5). This will give a significant reduction in model size for models with |E in S | |Θ in |, 3 such as the morphemeaware model of Botha and Blunsom (2014).
• Reusing weights of the embedding model (RW) can be done by setting Θ out = Θ in . Unlike the previous option, this should significantly reduce sizes of models with |E in S | |Θ in |, such as the character-aware model of Kim et al. (2016).
• Reusing both subword embeddings and weights of the embedding model (RE+RW) can be done by setting E out S = E in S and Θ out = Θ in simultaneously in (5). This should significantly reduce the number of trainable parameters in any subword-aware model. Here we use exactly the same word representations both at input and at output, so this option corresponds to the reusing of plain word embeddings in pure word-level language models.

Experimental Setup
Data sets: All models are trained and evaluated on the PTB (Marcus et al., 1993) and the WikiText-in vocabulary ( Subword-based embedding models: We experiment with existing representational models which have previously proven effective for language modeling. • CharCNN (Kim et al., 2016) is a characteraware convolutional model, which performs on par with the 2014-2015 state-of-the-art wordlevel LSTM model (Zaremba et al., 2014) despite having 60% fewer parameters. • SylConcat is a simple concatenation of syllable embeddings suggested by Assylbekov et al. (2017), which underperforms CharCNN but has fewer parameters and is trained faster. • MorphSum is a summation of morpheme embeddings, which is similar to the approach of Botha and Blunsom (2014) with one important difference: the embedding of the word itself is not included into the sum. We do this since other models do not utilize word embeddings. In all subword-aware language models we inject a stack of two highway layers (Srivastava et al., 2015) right before the word-level RNNLM as done by Kim et al. (2016), and the non-linear activation in any of these highway layers is a ReLU. The highway layer size is denoted by d HW . Word-level RNNLM: There is a large variety of RNN cells to choose from in (1). To make our results directly comparable to the previous work of Inan et al. (2017), Press and Wolf (2017) on reusing word embeddings we select a rather conventional architecture -a stack of two LSTM cells (Hochreiter and Schmidhuber, 1997). Hyperparameters: We experiment with two configurations for the state size d LM of the word-level RNNLM: 200 (small models) and 650 (mediumsized models). In what follows values outside brackets correspond to small models, and values within brackets correspond to medium models.
• CharCNN: We use the same hyperparameters as in the work of Kim et al. (2016), where "large model" stands for what we call "medium-sized model".  Kim et al. (2016). Optimizaton method is guided by the previous works (Zaremba et al., 2014;Gal and Ghahramani, 2016) on word-level language modeling with LSTMs. See Appendix A for details. Syllabification and morphological segmentation: True syllabification of a word requires its grapheme-to-phoneme conversion and then its splitting up into syllables based on some rules. True morphological segmentation requires rather expensive morphological analysis and disambiguation tools. Since these are not always available for under-resourced languages, we decided to utilize Liang's widely-used hyphenation algorithm (Liang, 1983) and an unsupervised morphological segmentation tool, Morfessor 2.0 (Virpioja et al., 2013), as approximations to syllabification and morphological segmentation respectively. We use the default configuration of Morfessor 2.0. Syllable and morpheme vocabulary sizes for both PTB and WikiText-2 are reported in Table 1.

Results
In order to investigate the extent to which each of our proposed options benefits the language modeling task, we evaluate all four modifications (no reusing, RE, RW, RE+RW) for each subwordaware model against their original versions and word-level baselines. The results of evaluation are given in Table 2. We have both negative and positive findings which are summarized below.
Negative results: • The 'no reusing' and RW options should never be applied in subword-aware language models as they deteriorate the performance.  Table 2: Results. The pure word-level models and original versions of subword-aware models (with regular softmax) serve as baselines. Reusing the input embedding architecture at output in CharCNN leads to prohibitively slow models when trained on WikiText-2 (≈800 tokens/sec on NVIDIA Titan X Pascal); we therefore abandoned evaluation of these configurations.
• The RE and RE+RW configurations benefit MorphSum making it not only better than its original version but also better than all other models and significantly smaller than the wordlevel model with reused embeddings. In what follows we proceed to analyze the obtained results.

CharCNN is biased towards surface form
We hypothesize that the reason CharCNN does not benefit from tied weights is that CNN over character embeddings is an excessively flexible model which learns to adapt to a surface form more than to semantics. To validate this hypothesis we pick several words 4 from the English PTB vocabulary and consider their nearest neighbors under cosine similarity as produced by the medium-sized models (with the regular softmax layer) at input (Table 3). As we can see from the examples, the CharCNN model is somewhat more biased towards surface forms at input than SylConcat and 4 We pick the same words as Kim et al. (2016).

MorphSum. 5 When
CharCNN is reused to generate a softmax embedding matrix this bias is propagated to output embeddings as well (Table 3).

Tying weights bottom-up
From Table 2 one can notice that tying weights without tying subword embeddings (RW) always results in worse performance than the tying both weights and embeddings (RE+RW). Recall that subword embedding lookup is done before the weights of subword-aware embedding model are used (see Figure 1). This leads us to the following be the parameters of the consecutive layers of a subword-aware input embedding model (4), i.e.
x = x (n) = f n x (n−1) ; Θ in n , . . . , and let E out S = Θ out 0 , Θ out 1 , Θ out 2 , . . . , Θ out n be the parameters of the consecutive layers of a subword-aware embedding model used to generate the output projection matrix (5). Let A be a subword-aware neu-  Table 3: Nearest neighbors based on cosine similarity. We underline character ngrams in words which are close to the given word orthographically rather than semantically. The pyphen syllabifier, which is used in SylConcat, failed to segment the word 'looooook' into syllables, and therefore its neighbors are not available.
ral language model in which the first (j +1) layers of input and output embedding sub-networks have tied weights: ∀i = 0, j : Θ in i = Θ out i , and let B be a model in which at least one layer below the (j + 1) th layer has untied weights: To test this conjecture empirically, we conduct the following experiments: in all three embedding models (CharCNN, SylConcat, and MorphSum), we reuse different combinations of layers. If an embedding model has n layers, there are 2 n ways to reuse them, as each layer can either be tied or untied at input and output. However, there are two particular configurations for each of the embedding models that do not interest us: (i) when neither of the layers is reused, or (ii) when only the very first embedding layer is reused. Hence, for each model we need to check 2 n − 2 configurations. For faster experimentation we evaluate only small-sized models on PTB. The results are reported in Table 4. As we can see, the experiments in general reject our conjecture: in SylConcat leaving an untied first highway layer between tied embedding and second highway layers (denote this as HW 2 +Emb) turned out to be slightly better than tying all three layers (HW 2 +HW 1 +Emb). Recall, that a highway is a weighted average between nonlinear and identity transformations of the incom-ing vector: where t = σ(xW + c) is a transform gate, A, W, b and c are trainable parameters, and is the element-wise multiplication operator. To find out why leaving an untied highway below a tied one is beneficial in SylConcat, we compare the distributions of the transform gate values t from the first highway layers of both configurations, HW 2 +Emb and HW 2 +HW 1 +Emb, in SylConcat and MorphSum (Figure 2).
We can see that SylConcat heavily relies on nonlinearity in the first highway layer, while MorphSum does not utilize much of it. This means that in MorphSum, the highway is close to an identity operator (t ≈ 0), and does not transform the sum of morpheme vectors much, either at input or at output. Therefore, tying the first highway layer is natural to Morh-Sum. SylConcat, on the other hand, applies non-linear transformations to the concatenation of syllable vectors, and hence makes additional preparations of the word vector for the needs of the RNNLM at input and for Softmax prediction at output. These needs differ from each other (as shown in the next subsection). This is why SylConcat benefits from an additional degree of freedom when the first highway is left untied.
Despite not being true in all cases, and due to being true in many cases, we believe that the above-mentioned conjecture is still useful. In short it can be summarized as a practical hands-   MorphSum. Values corresponding to 'Input' and 'Output' curves come from the HW 2 +Emb configurations, while those corresponding to 'Tied' curves come from the HW 2 +HW 1 +Emb configurations.
on rule: Layers should be tied consecutively bottom-up, i.e. one should not leave untied layer(s) below a tied one. Keep in mind that this rule does not guarantee a performance increase as more and more layers are tied. It only says that leaving untied weights below the tied ones is likely to be worse than not doing so.

Difference between input and output embeddings
One can notice from the results of our experiments (Table 4) that having an untied second highway layer above the first one always leads to better performance than when it is tied. This means that there is a benefit in letting word embeddings slightly differ at input and output, i.e. by specializing them for the needs of RNNLM at input and of Softmax at output. This specialization is quite natural, as input and output representations of words have two different purposes: input rep-resentations send a signal to the RNNLM about the current word in a sequence, while output representations are needed to predict the next word given all the preceding words. The difference between input and output word representations is discussed in greater detail by Garten et al. (2015) and Press and Wolf (2017). Here we decided to verify the difference indirectly: we test whether intrinsic dimensionality of word embeddings significantly differs at input and output. For this, we apply principal component analysis to word embeddings produced by all models in "no reusing" mode. The results are given in Figure 3, where we can see that dimensionalities of input and output embeddings differ in the word-level model, Char-CNN, and SylConcat models, but the difference is less significant in MorphSum model. Interestingly, in word-level and MorphSum models the output embeddings have more principal components than the input ones. In CharCNN and SylConcat, however, results are to other way around. We defer the study of this phenomenon to the future work.

CharCNN generalizes better than MorphSum
One may expect larger units to work better than smaller units, but smaller units to generalize better than larger units. This certainly depends on how one defines generalizability of a language model. If it is an ability to model unseen text with unseen words, then, indeed, character-aware models may perform better than syllable-or morphemeaware ones. This can be partially seen from Table 3, where the OOV words are better handled by CharCNN in terms of in-vocabulary nearest neighbors. However, to fully validate the abovementioned expectation we conduct additional experiments: we train two models, CharCNN and MorphSum, on PTB and then we evaluate them on the test set of Wikitext-2 (245K words, 10K wordtypes). Some words in Wikitext-2 contain characters or morphemes that are not present in PTB, and therefore such words cannot be embedded by CharCNN or MorphSum correspondingly. Such words were replaced by the <unk> token, and we call them new OOVs 6 . The results of our experiments are reported in Table 5  faces less OOVs on unseen text, and thus generalizes better than MorphSum.

Performance on non-English Data
According to  Evaluation on non-English data. MorphSum+RE+RW has significantly less parameters than Word+RE (Appendix B). S -small model, M -medium model, D-S -small data, D-M -medium data; FR -French, ES -Spanish, DE -German, CS -Czech, RU -Russian. (Inan et al., 2017). It is interesting to see whether this advantage extends to non-English languages which have richer morphology. For this purpose we conduct evaluation of both models on small (1M tokens) and medium (17M-51M tokens) data in five languages (see corpora statistics in Appendix B). Due to hardware constraints we only train the small models on medium-sized data. We used the same architectures for all languages and did not perform any language-specific tuning of hyperparameters, which are specified in Appendix A. The results are provided in Table 6. As one can see, the advantage of the morpheme-aware model over the word-level one is even more pronounced for non-English data. Also, we can notice that the gain is larger for small data sets. We hypothesize that the advantage of MorphSum+RE+RW over Word+RE diminishes with the decrease of typetoken ratio (TTR). A scatterplot of PPL change versus TTR (Figure 4) supports this hypothesis. Moreover, there is a strong correlation between these two quantities:ρ(∆PPL, TTR) = 0.84, i.e. one can predict the mean decrease in PPL from the TTR of a text with a simple linear regression: ∆PPL ≈ 2, 109 × TTR.

Replacing LSTM with AWD-LSTM
The empirical perplexities in Table 2 are way above the current state-of-the-art on the same datasets (Melis et al., 2018). However, the approach of Melis et al. (2018) requires thousands of evaluations and is feasible for researchers who have access to hundreds of GPUs. Unfortunately, we do not have such access. Also, the authors do not disclose the optimal hyperparameters they found, and thus we could not reproduce their models. There is another state-of-the-art language model, AWD-LSTM (Merity et al., 2018), which has open-source code. We replaced this model's word embedding layer with the MorphSum subnetwork and fully reused morpheme embeddings and other weights of MorphSum at output. We refer to such modification as AWD-LSTM-MorphSum + RE + RW. We trained both models without fine-tuning (due to time constraints) and we did not use embedding dropout (section 4.3 of Merity et al. (2018)) in either model, as it is not obvious how embeddings should be dropped in the case of AWD-LSTM-MorphSum. The results of evaluation on the PTB, Wikitext-2, and non-English datasets are given in Table 7.
Although AWD-LSTM-MorphSum is on par with AWD-LSTM-Word on PTB and is slightly better on Wikitext-2, replacing plain word embeddings with the subword-aware model with appropriately reused parameters is crucial for non-English data. Notice that AWD-LSTM underperforms LSTM (used by us) on Czech dataset (cf. Table 6). We think that the hyperparameters of AWD-LSTM in Merity et al. (2018) are thoroughly tuned for PTB and Wikitext-2 and may poorly generalize to other datasets.

Conclusion
There is no single best way to reuse parameters in all subword-aware neural language models: the reusing method should be tailored to each type of subword unit and embedding model. However, instead of testing an exponential (w.r.t. sub-network depth) number of configurations, it is sufficient to check only those where weights are tied consecutively bottom-up.
Despite being similar, input and output embeddings solve different tasks. Thus, fully tying input and output embedding sub-networks in subwordaware neural language models is worse than letting them be slightly different. This raises the question whether the same is true for pure wordlevel models, and we defer its study to our future work.
One of our best configurations, a simple morpheme-aware model which sums morpheme embeddings and fully reuses the embedding subnetwork, outperforms the competitive word-level language model while significantly reducing the number of trainable parameters. However, the performance gain diminishes with the increase of training set size.

A Optimization
Training the models involves minimizing the negative log-likelihood over the corpus w 1:K : − T k=1 log Pr(w k |w 1:k−1 ) −→ min, by truncated BPTT (Werbos, 1990). We backpropagate for 35 time steps using stochastic gradient descent where the learning rate is initially set to • 1.0 in small word-level models, • 0.5 in small and medium CharCNN, medium SylConcat (SS, SS+RW) models, • 0.7 in all other models, and start decaying it with a constant rate after a certain epoch. This is 5 and 10 for the small word-level and all other networks respectively except CharCNN, for which it is 12. The decay rate is 0.9. The initial values for learning rates were tuned as follows: for each model we start with 1.0 and decrease it by 0.1 until there is convergence at the very first epoch. We use a batch size of 20. We train for 70 epochs. Parameters of the models are randomly initialized uniformly in [−0.1, 0.1] and in [−0.05, 0.05] for the small and medium networks, except the forget bias of the word-level LSTM, which is initialized to 1, and the transform bias of the highway layer, which is initialized to values around −2. For regularization we use a variant of variational dropout (Gal and Ghahramani, 2016) proposed by Inan et al. (2017). For PTB, the dropout rates are 0.3 and 0.5 for the small and medium models. For Wikitext-2, the dropout rates are 0.2 and 0.4 for the small and medium models. We clip the norm of the gradients (normalized by minibatch size) at 5.
For non-English small-sized data sets (Data-S) we use the same hyperparameters as for PTB. To speed up training on non-English medium-sized data (Data-M) we use a batch size of 100 and sampled softmax (Jean et al., 2015) with the number of samples equal to 20% of the vocabulary size (Chen et al., 2016).

B Non-English corpora statistics and model sizes
The non-English small and medium data comes from the 2013 ACL Workshop on Machine Translation 7 with pre-processing per Botha and Blunsom (2014). Corpora statistics is provided in Table 8. Model sizes for Word+RE and MorphSum+RE+RW, which were evaluated on non-English data sets, are given in Table 9. MorphSum+RE+RW requires 45%-87% less parameters than Word+RE.