Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

Fixed-vocabulary language models fail to account for one of the most characteristic statistical facts of natural language: the frequent creation and reuse of new word types. Although character-level language models offer a partial solution in that they can create word types not attested in the training corpus, they do not capture the “bursty” distribution of such words. In this paper, we augment a hierarchical LSTM language model that generates sequences of word tokens character by character with a caching mechanism that learns to reuse previously generated words. To validate our model we construct a new open-vocabulary language modeling corpus (the Multilingual Wikipedia Corpus; MWC) from comparable Wikipedia articles in 7 typologically diverse languages and demonstrate the effectiveness of our model across this range of languages.


Introduction
Language modeling is an important problem in natural language processing with many practical applications (translation, speech recognition, spelling autocorrection, etc.).Recent advances in neural networks provide strong representational power to language models with distributed representations and unbounded dependencies based on recurrent networks (RNNs).However, most language models operate by generating words by sampling from a closed vocabulary which is composed of the most frequent words in a corpus.Rare tokens are typically replaced by a special token, called the unknown word token, UNK .Although fixed-vocabulary language models have some important practical applications and are appealing models for study, they fail to capture two empirical facts about the distribution of words in natural languages.First, vocabularies keep growing as the number of documents in a corpus grows: new words are constantly being created (Heaps, 1978).Second, rare and newly created words often occur in "bursts", i.e., once a new or rare word has been used once in a document, it is often repeated (Church and Gale, 1995;Church, 2000).
The open-vocabulary problem can be solved by dispensing with word-level models in favor of models that predict sentences as sequences of characters (Sutskever et al., 2011;Chung et al., 2017).Character-based models are quite successful at learning what (new) word forms look like (e.g., they learn a language's orthographic conventions that tell us that sustinated is a plausible English word and bzoxqir is not) and, when based on models that learn long-range dependencies such as RNNs, they can also be good models of how words fit together to form sentences.
However, existing character-sequence models have no explicit mechanism for modeling the fact that once a rare word is used, it is likely to be used again.In this paper, we propose an extension to character-level language models that enables them to reuse previously generated tokens ( §2).Our starting point is a hierarchical LSTM that has been previously used for modeling sentences (word by word) in a conversation (Sordoni et al., 2015), except here we model words (character by character) in a sentence.To this model, we add a caching mechanism similar to recent proposals for caching that have been advocated for closed-vocabulary models (Merity et al., 2017;Grave et al., 2017).As word tokens are generated, they are placed in an LRU cache, and, at each time step the model decides whether to copy a previously generated word from the cache or to generate it from scratch, character by character.The decision of whether to use the cache or not is a latent variable that is marginalised during learning and inference.In summary, our model has three properties: it creates new words, it accounts for their burstiness using a cache, and, being based on LSTM s over word representations, it can model long range dependencies.
To evaluate our model, we perform ablation experiments with variants of our model without the cache or hierarchical structure.In addition to standard English data sets (PTB and WikiText-2), we introduce a new multilingual data set: the Multilingual Wikipedia Corpus (MWC), which is constructed from comparable articles from Wikipedia in 7 typologically diverse languages ( §3) and show the effectiveness of our model in all languages ( §4).By looking at the posterior probabilities of the generation mechanism (language model vs. cache) on held-out data, we find that the cache is used to generate "bursty" word types such as proper names, while numbers and generic content words are generated preferentially from the language model ( §5).

Model
In this section, we describe our hierarchical character language model with a word cache.As is typical for RNN language models, our model uses the chain rule to decompose the problem into incremental predictions of the next word conditioned on the history: We make two modifications to the traditional RNN language model, which we describe in turn.First, we begin with a cache-less model we call the hierarchical character language model (HCLM; §2.1) which generates words as a sequence of characters and constructs a "word embedding" by encoding a character sequence with an LSTM (Ling et al., 2015).However, like conventional closed-vocabulary, word-based models, it is based on an LSTM that conditions on words represented by fixed-length vectors. 1 1 The HCLM is an adaptation of the hierarchical recurrent encoder-decoder of Sordoni et al. (2015) which was used to model dialog as a sequence of actions sentences which are themselves sequences of words.The original model was proposed to compose words into query sequences but we use it to compose characters into word sequences.
The HCLM has no mechanism to reuse words that it has previously generated, so new forms will only be repeated with very low probability.However, since the HCLM is not merely generating sentences as a sequence of characters, but also segmenting them into words, we may add a wordbased cache to which we add words keyed by the hidden state being used to generate them ( §2.2).This cache mechanism is similar to the model proposed by Merity et al. (2017).
Notation.Our model assigns probabilities to sequences of words w = w 1 , . . ., w |w| , where |w| is the length, and where each word w i is represented by a sequence of characters

Hierarchical Character-level Language Model (HCLM)
This hierarchical model satisfies our linguistic intuition that written language has (at least) two different units, characters and words.
The HCLM consists of four components, three LSTMs (Hochreiter and Schmidhuber, 1997): a character encoder, a word-level context encoder, and a character decoder (denoted LSTM enc , LSTM ctx , and LSTM dec , respectively), and a softmax output layer over the character vocabulary.Fig. 1 illustrates an unrolled HCLM.
Suppose the model reads word w t−1 and predicts the next word w t .First, the model reads the character sequence representing the word w t−1 = c t−1,1 , . . ., c t−1,|c t−1 | where |c t−1 | is the length of the word generated at time t − 1 in characters.Each character is represented as a vector v c t−1,1 , . . ., v c t−1,|c t−1 | and fed into the encoder LSTM enc .The final hidden state of the encoder LSTM enc is used as the vector representation of the previously generated word w t−1 , Then all the vector representations of words (v w 1 , . . ., v w |w| ) are processed with a context LSTM ctx .Each of the hidden states of the context LSTM ctx are considered representations of the history of the word sequence.
Finally, the initial state of the decoder LSTM is set to be h ctx t and the decoder LSTM reads a vector representation of the start symbol v S and  generates the next word w t+1 character by character.To predict the j-th character in w t , the decoder LSTM reads vector representations of the previous characters in the word, conditioned on the context vector h ctx t and a start symbol.
The character generation probability is defined by a softmax layer for the corresponding hidden representation of the decoder LSTM .
Thus, a word generation probability from HCLM is defined as follows.

Continuous cache component
The cache component is an external memory structure which store K elements of recent history.Similarly to the memory structure used in Grave et al. (2017), a word is added to a key-value memory after each generation of w t .The key at position i ∈ [1, K] is k i and its value m i .The memory slot is chosen as follows: if the w t exists already in the memory, its key is updated (discussed below).Otherwise, if the memory is not full, an empty slot is chosen or the least recently used slot is overwritten.When writing a new word to memory, the key is the RNN representation that was used to generate the word (h t ) and the value is the word itself (w t ).In the case when the word already exists in the cache at some position i, the k i is updated to be the arithmetic average of h t and the existing k i .
To define the copy probability from the cache at time t, a distribution over copy sites is defined using the attention mechanism of Bahdanau et al. (2015).To do so, we construct a query vector (r t ) from the RNN's current hidden state h t , r t = tanh(W q h t + b q ), then, for each element i of the cache, a 'copy score,' u i,t is computed, Finally, the probability of generating a word via the copying mechanism is: where [m i = w t ] is 1 if the ith value in memory is w t and 0 otherwise.Since p mem defines a distribution of slots in the cache, p ptr translates it into word space.

Character-level Neural Cache Language Model
The word probability p(w t | w <t ) is defined as a mixture of the following two probabilities.The first one is a language model probability, p lm (w t | w <t ) and the other is pointer probability , p ptr (w t | w <t ).The final probability p(w t | w <t ) is where λ t is computed by a multi-layer perceptron with two non-linear transformations using h t as its input, followed by a transformation by the logistic sigmoid function: We remark that Grave et al. (2017) use a clever trick to estimate the probability, λ t of drawing from the LM by augmenting their (closed) vocabulary with a special symbol indicating that a copy should be used.This enables word types that are highly predictive in context to compete with the probability of a copy event.However, since we are working with an open vocabulary, this strategy is unavailable in our model, so we use the MLP formulation.

Training objective
The model parameters as well as the character projection parameters are jointly trained by maximizing the following log likelihood of the observed characters in the training corpus,

Datasets
We evaluate our model on a range of datasets, employing preexisting benchmarks for comparison to previous published results, and a new multilingual corpus which specifically tests our model's performance across a range of typological settings.

Penn Tree Bank (PTB)
We evaluate our model on the Penn Tree Bank.
For fair comparison with previous works, we followed the standard preprocessing method used by Mikolov et al. (2010).In the standard preprocessing, tokenization is applied, words are lowercased, and punctuation is removed.Also, less frequent words are replaced by unknown an token (UNK),2 constraining the word vocabulary size to be 10k.Because of this preprocessing, we do not expect this dataset to benefit from the modeling innovations we have introduced in the paper.Fig. 1 summarizes the corpus statistics.To attempt to control for topic divergences across languages, every language's data consists of the same articles.Although these are only comparable (rather than true translations), this ensures that the corpus has a stable topic profile across languages. 4  Construction & Preprocessing We constructed the MWC similarly to the WikiText-2 corpus.Articles were selected from Wikipedia in the 7 target languages.To keep the topic distribution to be approximately the same across the corpora, we extracted articles about entities which explained in all the languages.We extracted articles which exist in all languages and each consist of more than 1,000 words, for a total of 797 articles.These cross-lingual articles are, of course, not usually translations, but they tend to be comparable.This filtering ensures that the topic profile in each language is similar.Each language corpus is approximately the same size as the WikiText-2 corpus.
Wikipedia markup was removed with WikiExtractor, 5 to obtain plain text.We used the same thresholds to remove rare characters in the WikiText-2 corpus.No tokenization or other normalization (e.g., lowercasing) was done.
Statistics After the preprocessing described above, we randomly sampled 360 articles.The articles are split into 300, 30, 30 sets and the first 300 articles are used for training and the rest are used 4 The Multilingual Wikipedia Corpus (MWC) is available for download from http://k-kawakami.com/research/mwc 5 https://github.com/attardi/wikiextractorfor dev and test respectively.Table 3 summarizes the corpus statistics.Additionally, we show in Fig. 2 the distribution of frequencies of OOV word types (relative to the training set) in the dev+test portions of the corpus, which shows a power-law distribution, which is expected for the burstiness of rare words found in prior work.Curves look similar for all languages (see Appendix A).

Experiments
We now turn to a series of experiments to show the value of our hierarchical character-level cache language model.For each dataset we trained the model with LSTM units.To compare our results with a strong baseline, we also train a model without the cache.
Model Configuration For HCLM and HCLM with cache models, We used 600 dimensions for the character embeddings and the LSTMs have 600 hidden units for all the experiments.This keeps the model complexity to be approximately the same as previous works which used an LSTM with 1000 dimension.Our baseline LSTM have 1000 dimensions for embeddings and reccurence weights.
For the cache model, we used cache size 100 in every experiment.All the parameters including character projection parameters are randomly sampled from uniform distribution from −0.08 to 0.08.The initial hidden and memory state of LSTM enc and LSTM ctx are initialized with zero.Mini-batches of size 25 are used for PTB experiments and 10 for WikiText-2, due to memory limitations.The sequences were truncated with 35  words.Then the words are decomposed to characters and fed into the model.A Dropout rate of 0.5 was used for all but the recurrent connections.
Learning The models were trained with the Adam update rule (Kingma and Ba, 2015) with a learning rate of 0.002.The maximum norm of the gradients was clipped at 10.
Evaluation We evaluated our models with bitsper-character (bpc) a standard evaluation metric for character-level language models.Following the definition in Graves ( 2013), bits-per-character is the average value of − log 2 p(w t | w <t ) over the whole test set, where |c| is the length of the corpus in characters.

Results
PTB Tab. 4 summarizes results on the PTB dataset. 6ur baseline HCLM model achieved 1.276 bpc which is better performance than the LSTM with Zoneout regularization (Krueger et al., 2017).And HCLM with cache outperformed the baseline model with 1.247 bpc and achieved competitive results with state-of-the-art models with regularization on recurrence weights, which was not used in our experiments.
Expressed in terms of per-word perplexity (i.e., rather than normalizing by the length of the corpus in characters, we normalize by words and exponentiate), the test perplexity on HCLM with cache is 94.79.The performance of the unregularized 2-layer LSTM with 1000 hidden units on wordlevel PTB dataset is 114.5 and the same model with dropout achieved 87.0.Considering the fact that our character-level models are dealing with an open vocabulary without unknown tokens, the results are promising.

Multilingual Wikipedia Corpus (MWC)
Tab. 6 summarizes results on the MWC dataset.
Similarly to WikiText-2 experiments, LSTM is strong baseline.We observe that the cache mechanism improve performance in every languages.In English, HCLM with cache achieved 1.538 bpc where the baseline is 1.622 bpc.It is 5.2% improvement.For other languages, the improvement rates were 2.7%, 3.2%, 3.7%, 2.5%, 4.7%, 2.7% in FR, DE, ES, CS, FI, RU respectively.The best improvement rate was obtained in Finnish.

Analysis
In this section, we analyse the behavior of proposed model qualitatively.To analyse the model, we compute the following posterior probability which tell whether the model used the cache given a word and its preceding context.Let z t be a random variable that says whether to use the cache or the LM to generate the word at time t.We would like to know, given the text w, whether the cache was used at time t.This can be computed as follows: where cache t is the state of the cache at time t.We report the average posterior probability of cache generation excluding the first occurrence of w, p(z | w).Tab.7 shows the words in the WikiText-2 test set that occur more than 1 time that are most/least likely to be generated from cache and character language model (words that occur only one time cannot be cache-generated).We see that the model uses the cache for proper nouns: Lesnar, Gore, etc., as well as very frequent words which always stored somewhere in the cache such as singletoken punctuation, the, and of.In contrast, the model uses the language model to generate numbers (which tend not to be repeated): 300, 770 and basic content words: sounds, however, unable, etc.This pattern is similar to the pattern found in empirical distribution of frequencies of rare words observed in prior wors (Church and Gale, 1995;Church, 2000), which suggests our model is learning to use the cache to account for bursts of rare words.
To look more closely at rare words, we also investigate how the model handles words that occurred between 2 and 100 times in the test set, but fewer than 5 times in the training set.Fig. 3 is a scatter plot of p(z | w) vs the empirical frequency in the test set.As expected, more frequently repeated words types are increasingly likely to be drawn from the cache, but less frequent words show a range of cache generation probabilities.Tab. 8 shows word types with the highest and lowest average p(z | w) that occur fewer than 5 times in the training corpus.The pattern here is similar to the unfiltered list: proper nouns are extremely likely to have been cache-generated, whereas numbers and generic (albeit infrequent) content words are less likely to have been.

Discussion
Our results show that the HCLM outperforms a basic LSTM.With the addition of the caching mechanism, the HCLM becomes consistently more powerful than both the baseline HCLM and the LSTM.This is true even on the PTB, which has no rare or OOV words in its test set (because of preprocessing), by caching repetitive common  Non-English languages.For non-English languages, the pattern is largely similar for non-English languages.This is not surprising since morphological processes may generate forms that are related to existing forms, but these still have slight variations.Thus, they must be generated by the language model component (rather than from the cache).Still, the cache demonstrates consistent value in these languages.Finally, our analysis of the cache on English does show that it is being used to model word reuse, particularly of proper names, but also of frequent words.While empirical analysis of rare word distributions predicts that names would be reused, the fact that cache is used to model frequent words suggests that effective models of language should have a means to generate common words as units.Finally, our model disfavors copying numbers from the cache, even when they are available.This suggests that it has learnt that numbers are not generally repeated (in contrast to names).

Related Work
Caching language models were proposed to account for burstiness by Kuhn and De Mori (1990), and recently, this idea has been incorporated to augment neural language models with a caching mechanism (Merity et al., 2017;Grave et al., 2017).Open vocabulary neural language models have been widely explored (Sutskever et al., 2011;Mikolov et al., 2012;Graves, 2013, inter alia).Attempts to make them more aware of wordlevel dynamics, using models similar to our hierarchical formulation, have also been proposed (Chung et al., 2017).
The only models that are open vocabulary language modeling together with a caching mechanism are the nonparametric Bayesian language models based on hierarchical Pitman-Yor processes which generate a lexicon of word types using a character model, and then generate a text using these (Teh, 2006;Goldwater et al., 2009;Chahuneau et al., 2013).These, however, do not use distributed representations on RNNs to capture long-range dependencies.

Conclusion
In this paper, we proposed a character-level language model with an adaptive cache which selectively assign word probability from past history or character-level decoding.And we empirically show that our model efficiently model the word sequences and achieved better perplexity in every standard dataset.To further validate the performance of our model on different languages, we collected multilingual wikipedia corpus for 7 typologically diverse languages.We also show that our model performs better than character-level models by modeling burstiness of words in local context.
The model proposed in this paper assumes the observation of word segmentation.Thus, the model is not directly applicable to languages, such as Chinese and Japanese, where word segments are not explicitly observable.We will investigate a model which can marginalise word segmentation as latent variables in the future work.

Figure 1 :
Figure 1: Description of Hierarchical Character Language Model with Cache.

Figure 2 :
Figure 2: Histogram of OOV word frequencies in the dev+test part of the MWC Corpus (EN).

Figure 3 :
Figure 3: Average p(z | w) of OOV words in test set vs. term frequency in the test set for words not obsered in the training set.The model prefers to copy frequently reused words from cache component, which tend to names (upper right) while character level generation is used for infrequent open class words (bottom left).

Fig. 4
Fig.4show distribution of frequencies of OOV word types in 6 languages.

Figure 4 :
Figure 4: Histogram of OOV word frequencies in MWC Corpus in different languages.

Table 3 :
Summary of MWC Corpus.