Improving Low Compute Language Modeling with In-Domain Embedding Initialisation

Many NLP applications, such as biomedical data and technical support, have 10-100 million tokens of in-domain data and limited computational resources for learning from it. How should we train a language model in this scenario? Most language modeling research considers either a small dataset with a closed vocabulary (like the standard 1 million token Penn Treebank), or the whole web with byte-pair encoding. We show that for our target setting in English, initialising and freezing input embeddings using in-domain data can improve language model performance by providing a useful representation of rare words, and this pattern holds across several different domains. In the process, we show that the standard convention of tying input and output embeddings does not improve perplexity when initializing with embeddings trained on in-domain data.


Introduction
Language modeling is an essential part of many NLP applications, including predictive keyboards, speech recognition, and translation. Recent work has focused on (1) small constrained datasets, such as the Penn Treebank (Marcus et al., 1993) and WikiText-103 (Merity et al., 2017b), and (2) vast resources with billions of words used to train enormous models with significant computational requirements (Radford et al., 2019). This leaves a gap: when a substantial amount of in-domain data is available, but computational power is limited.
We explore how initialising word embeddings using in-domain data can improve language modeling in English. Testing all valid configurations of weight tying, embedding freezing, and initialisation, we find that the standard configuration is not optimal when rare words are present. Instead, the best approach is to initialise with in-domain data, untie the input and output, and freeze the input.
To understand this difference, we run a series of experiments to measure the impact of changing (a) the threshold for replacing rare words with a special symbol; (b) the source of data for initialisation; (c) the amount of training data for the language model; and (d) the hyperparameters for both the baseline and our proposed approach. We find that the improvement comes from improved representation of rare words. These findings are confirmed through experiments on four additional domains, with similar trends.
We also compare our approach to an n-gram language model and a large-scale transformer model. We find that if a large-scale transformer is inappropriate either for computational or modeling reasons, it is best to train an LSTM-based language model with as much data as possible and initialise the embeddings on all available in-domain data.

Proposed Approach
We propose initialising the language model's word embeddings with vectors trained on additional indomain data. To make this most effective, we make two other key changes to training. First, we prevent embeddings from shifting during training. Without this, the embedding space could become inconsistent as vectors for words seen in training shift while those for words seen only in the additional data stay the same. Second, we do not tie the weights of the input embeddings and final output layer. To understand the impact of these factors, we train models with every valid combination of weight tying, freezing, and pretraining. 1 We experiment with Merity et al. (2017a)'s AWD-LSTM -a high-performing model that can be trained in under a day on a single GPU (without fine-tuning). We train embeddings using GloVe on Gigaword. 2 For evaluation, we consider two 1 Note, for frozen output embeddings the bias is not frozen. 2 Embedding size 400 and rare word cutoff 5, the same as in the original AWD-LSTM model and GloVe respectively. All other GloVe hyperparameters were set as specified in the original GloVe paper and trained using the released code. versions of the Penn Treebank. Std is the standard version used in language modeling, with words of frequency less than five converted to UNK, all words lowercase, numbers replaced with a special symbol, and punctuation removed. Rare has the same pre-processing but without replacement of rare words. 3 Table 1 shows the results, with icons to concisely describe the different configurations. 4 Looking first at the standard evaluation set, we can see the value of pretrained embeddings by considering pairs where the only difference is whether the embeddings are random or pretrained. Pretrained embeddings are better in all but one case (comparing the fourth last and second last rows), and there the difference is only 0.5. As for freezing the pretrained input embeddings, keeping all other aspects the same, it is always better to freeze them.
There are also four clear sections of performance in the table: (a) frozen random output embeddings; (b) frozen pretrained output embeddings; (c) frozen random input embeddings; (d) various configurations. These results have an asymmetry. Freezing the output embeddings consistently leads to poor performance, even with pretrained embeddings pretrained. In contrast, freezing with pretrained input embeddings leads to some of the best results. We expected freezing with random initialisation to perform poorly, but the drop is modest for input freezing and dramatic for output freezing. This suggests that the two embedding matrices are serving different purposes in the model. The results do support the practise of tying when the input embeddings are random, but the benefit is half as large when they are pretrained.
For the dataset with rare words we see mostly the same trends. The exception is the bottom six rows. Once rare words are present, random initialisation of the input embeddings is considerably worse than pretraining (third last row). Again, there is an asymmetry between input and output, with the top five models all using pretrained input embeddings, but only three of them using pretrained output embeddings. Tying is also no longer the best approach, with the top three models not tying. Our proposed approach, using pretrained untied embeddings and freezing the input, has the best results.
The only difference between Std and Rare is 3 The script to generate our Rare data from the LDC release is available at: http://jkk.name/emnlp20lm/. 4 Dice Icon by Andrew Doane from the Noun Project. Fire and Snowflake Icons by Freepik from www.flaticon.com.

Embeddings
Dev  the lack of UNKs in Rare. This impacts 5.1% of tokens in the validation set (33% of types). While our pretrained embeddings do not cover all of these rare words, they do cover most. The vocabulary from Gigaword that we build vectors for covers 99.5% of the validation word tokens in Std (98% of word types), and 98.8% of the validation word tokens in Rare (84% of word types).

When & Why Does Pretraining Help?
To understand the strengths and limitations of this new approach, we consider a series of experiments, each probing a specific variable. To simulate our target scenario, we use 44 million words of Wall Street Journal data from the North American News Corpus (NANC, Graff, 1995). This provides enough data for pretraining, training, validation, and test sets all in the exact same domain (not even varying the newspaper). We apply similar preprocessing as in the previous section, but break the data down into articles rather than sentences and keep rare words.
We compare the six best configurations from Table 1. In all cases, output embeddings are not frozen, so we leave out the symbol. We use only one symbol for pretraining/random because both embeddings are the same in most cases. The exceptions have to indicate pretrained input and random output.
Standard approach. Our approach, but with random output embeddings and without freezing.
Standard approach + pretraining. Our approach, but without freezing.
Our approach.
Our approach, but with random output embeddings.
Other Domains Show the Same Pattern. First we consider varying the domain to make sure this is not an artifact of news data. Table 2 shows results on Covid-19 research , Ubuntu IRC chat (Kummerfeld et al., 2019), Reddit, and Wikipedia, tokenised with either Scispacy (Neumann et al., 2019) or Stanza (Qi et al., 2020). Pretraining consistently helps, while freezing is best on all but Wikipedia. Our approach is consistently either the best or very close to the best.
The Improvement is Due to Rare Words. To probe the impact of rare words, we explore replacing them with UNK (using the same UNK symbol as used in embedding pretraining). We consider four variations, each constructed in two steps. First, we make a list of the words in the original training set and how many times each one occurs. Second, we make modified versions of the training and validation sets, replacing words with UNK if their count in our list is lower than K. For this step, any word that does not appear in our list is treated as if it has a count of zero. We consider K = 0, 1, 2 and 5. K is 0 for all other experiments in this section, which means that no words are replaced with UNK. When K is 1, 2, and 5, the introduction of UNKs means all words in the validation set are seen during language model training.    Table 4: Percentage of word types and tokens that occur five times or fewer in each dataset. The last two columns are the percentage of types/tokens in the training set that occur five or fewer times in the pretraining set. For PTB the pretraining set is Gigaword (as used in Table 1). Table 3 shows a clear trend: the benefit of our approach grows as more rare words are present (i.e., K is smaller). Note, it may seem odd that perplexity is higher when K=1 than when K=0 since we have removed rare words. This is probably because when K is 1 there are UNKs in the validation set but not in the language model training set. Table 4 shows statistics about rare words in the datasets. 71-83% of word types in the training sets occur fewer than five times, but most of these appear frequently in the pretraining sets (compare the first column with the second last column). The same pattern occurs for word tokens. Comparing the statistics for the training set and the pretraining set, the percentage of rare word types is fairly consistent while the percentage of rare tokens consistently goes down.
Pretraining Data Needs to be from a Similar Domain. We would expect that the effectiveness of pretraining will depend on how similar the data is. Table 5 shows results with different embeddings, and indicates the number of words used in pretraining. We see that the value of additional data depends on the domain. Gigaword is also news text and is able to improve performance. The larger GloVe datasets use Wikipedia and Common-Crawl data, which is a poorer match and so does not improve performance. For GloVe we did have to change the embedding dimensions from 400 to 300, which may impact performance slightly.
The Effect Persists When Language Model Training Data is Increased. So far we have only used the additional in-domain data for pretraining. In this experiment, we expand the training set for the language model. We try two variations, one where the data is an exact domain match (NANC) and one where it is also news, but from different newspapers and from a different year (Gigaword). Table 6 shows that as we increase the amount of data our approach and the variant with random output embeddings continue to do best, but the margin shrinks between them and the standard approach. Note, however, that these results are with hyperparameters tuned for the baseline configuration. With tuning the 0.7 gap between our proposal and the baseline for 4xNANC widens to 6.6.  Table 6: Expanding the language model training set. Hyperparameter Tuning Further Improves Results. All of the previous experiments were slightly tipped in favour of the baseline as we used the hyperparameters from Merity et al. (2017a). We do not have the resources to tune for every condition, so instead we focus on a final set of experiments with the 4xNANC condition from Table 6. We run 37 configurations with randomly sampled hyperparameters, using the same configurations for the baseline and our proposed approach (see the supplementary material for details). Figure 1 shows that our approach is even stronger after tuning, with a score that is 6.6 better than the baseline. Comparing the baseline and tuned hyperparameters, some shifted substantially more than others: the learning rate was halved; word dropout was halved; and the number of layers was increased from 3 to 4. The other parameters shifted by 15-30%.
Test Results Confirm Our Observations. Using the best configuration we train the baseline and our proposed approach using 8xNANC (the most our GPU could support). We compare to an n-gram language model trained on all of the NANC data (Heafield et al., 2013), and a transformer based model trained on a massive dataset, GPT-2 (Radford et al., 2019). While GPT-2 cannot be retrained in a low-compute scenario, it can be used. We compare to GPT-2 without fine-tuning. We evaluate byte-pair encoding (BPE) separately because with BPE tokenisation models have additional information when predicting the second or later piece of a token (Merity, 2019). Table 7 shows that for word-level prediction, our approach improves over the baseline and an ngram language model. BPE breaks up rare words, leading to no improvement over the baseline and while we do better than the 112m parameter GPT-2, we do not do as well as the 774m parameter one (both untuned). Overall, this indicates that for users who require word-level scores and have limited computational resources our approach is an effective way to use additional data when training LSTM language models.

Related Work
Embedding Tying. Tying input and output matrices has consistently increased performance while reducing the number of model parameters (Press and Wolf, 2017;Inan et al., 2017). The improvement is thought to be because otherwise only one input embedding is updated each step and the gradient has to propagate a long way through the model to reach it. Subsequent work has explored more advanced forms of tying, recognising that the role of the input and output matrices are not exactly the same (Pappas et al., 2018). This asymmetry has been found in the actual embedding spaces learned and shown to have a negative effect on performance (Gao et al., 2019;Demeter et al., 2020). These observations match the patterns we observe and provide theoretical justification for not tying when possible.
In-Domain Data Pretraining and Freezing. Word vectors are frequently used in downstream tasks and recent work has shown that their effectiveness depends on domain similarity (Peters et al., 2019;Arora et al., 2020) For language modeling, Kocmi and Bojar (2017) explored random and pretrained embeddings and found improvements, but did not consider tying and freezing. In-domain data is also useful for continuing to train contextual em-bedding models before fine-tuning (Gu et al., 2020;Gururangan et al., 2020), and for monolingual pretraining in machine translation (Neishi et al., 2017;Qi et al., 2018;Artetxe et al., 2018). This matches our observations, but does not cover the interactions between freezing and tying we consider.
Handling Rare Words. These remain challenging even for large transformer models (Schick and Schütze, 2020). Recent work has explored copying mechanisms and character based generation (Kawakami et al., 2017), with some success. These ideas are complementary to the results of our work, extending coverage to the open vocabulary case. Due to space and computational constraints we only consider English. For other languages, inflectional morphology and other factors may impact the effectiveness of our approach (Shareghi et al., 2019;Cotterell et al., 2018). Our work is also complementary to concurrent work on producing rare words as output (Pappas and Mulcaire, 2020).
Language Model Types. We focus on a single model type for computational budget reasons. We chose an LSTM because while transformer based models such as GPT-2 now dominate transfer learning, LSTMs continue to be competitive in language modeling (Du et al., 2020;Li et al., 2020;Melis et al., 2018;Merity et al., 2017a). Our ideas are orthogonal to this prior work and our findings may apply to transformers as well, but confirming that would require additional experiments.

Conclusion
Initialising embeddings with vectors trained on indomain data can improve performance by providing better representations for rare words. This effect persists even as more in-domain data is used to train the language model. Our work also suggests that standard model components like embedding tying should be retested as we continue to explore the space of language modeling.