A Simple and Effective Method for Injecting Word-Level Information into Character-Aware Neural Language Models

We propose a simple and effective method to inject word-level information into character-aware neural language models. Unlike previous approaches which usually inject word-level information at the input of a long short-term memory (LSTM) network, we inject it into the softmax function. The resultant model can be seen as a combination of character-aware language model and simple word-level language model. Our injection method can also be used together with previous methods. Through the experiments on 14 typologically diverse languages, we empirically show that our injection method, when used together with the previous methods, works better than the previous methods, including a gating mechanism, averaging, and concatenation of word vectors. We also provide a comprehensive comparison of these injection methods.


Introduction
Language modeling (LM) is an important task in the natural language processing field, with various applications such as speech recognition (Mikolov et al., 2010a), machine translation (Koehn, 2009) and summarization (Filippova et al., 2015). Recently, neural language models (NLMs) have shown a great success and are better than traditional count-based methods (Bengio et al., 2003;Mikolov et al., 2010b). Standard NLMs usually maintain a fixed vocabulary and map each word to a continuous representation. These word representations obtained through NLMs are usually close to each other in the induced vector space if they are semantically similar. However, there are two main problems of standard NLMs. One is that they cannot handle out-of-vocabulary words. These words are usually replaced with a special unknown symbol. Another problem is that these models are not effective for learning the relationships between words for infrequent words.
For example, although words "husbandman" and "salesman" share the suffix "man" in their surface forms, standard NLMs cannot capture such information in obtaining the relationship between the two words. A common way to deal with these issues is to use character information of each word to calculate the word representation, and it is often referred to as character-aware NLMs (Ling et al., 2015;Kim et al., 2016;Vania and Lopez, 2017;Gerz et al., 2018). Our research focuses on utilizing advantages of both character-level information and word-level information in characteraware NLMs.
Previous work usually combines word-level information and character-level information at the input of LSTM layers through a gating mechanism, or averaging or concatenation of word vectors. Because these approaches generally target at the input vectors, the word-level information cannot be explicitly taken into account at the output layer for predicting the next word.
To deal with this problem, we propose an improved character-aware neural language model that takes into account the injected word-level information at the output layer. This model is strongly inspired by the success of n-gram language models. Our model can predict the next word using the embeddings of the words in the current n-gram window, in addition to the hidden state of the LSTM layer. Specifically, we also use a gate to control how much word-level information should be taken before injecting it into the softmax function. After that, we combine the gated wordlevel information with the output of LSTM. Lastly, we feed these mixed information to the softmax function for word prediction. In our method, we can also take into account the information of previous words when injecting word-level information into the softmax function.
Our injection method is simple and easy to im-plement 1 . We found our method effective compared with several common previous methods on 14 datasets with typologically diverse languages. In addition, the improvements can be further obtained when our injection method is used together with the previous methods. We also conducted a comprehensive comparison of these injection methods. Finally, we set up several experiments to check the effects of infrequent words on our model, and we also compared our model with several previous work on 6 common language modeling datasets. Our results show that: -Compared with the previous injection methods (i.e., the gating mechanism, averaging, addition, and concatenation of word vectors), our injection method performs best on the majority of languages.
-Our injection method works effectively even when used alone, and the combination of our injection method and the previous injection methods performs better than the previous injection methods.
-When injecting word-level information into character-aware NLMs, discarding rare words in the training data can help improve the performance.

Related Work
Many work have attempted to improve characteraware NLMs in recent years. For example, Assylbekov and Takhanov (2018) proposed several ways of reusing weights in character-aware NLMs. Gerz et al. (2018) achieved an improved result on 50 typologically diverse languages by injecting subword-level information into word vectors at the softmax. For a thorough review of past researches, readers are recommended to read the work by Vania and Lopez (2017), who performed a systematic comparison across different models based on different subword units (characters, character trigrams, BPE, etc.). One direction related to our research is to inject word-level information into character-aware neural models. Aside from language modeling, Santos and Zadrozny (2014) and dos Santos and Guimarães (2015) first used a convolutional neural 1 https://github.com/yukunfeng/char_ word_lm network (CNN) to encode characters and then concatenated these encoded character-level representations and word-level representations for part-ofspeech tagging and named entity recognition. Luong and Manning (2016) introduced a characterword neural machine translation model that only consults character-level representations for rare words encoded with a deep LSTM.
As research efforts for language models, Kang et al. (2011) used a simple character-word NLM designed for Chinese. Miyamoto and Cho (2016) introduced a gate mechanism between word embeddings and character embeddings obtained from a bidirectional LSTM (BiLSTM) for English. Verwimp et al. (2017) directly concatenated word and character embeddings without other subnetworks to encode the characters for English and Dutch.
Although there are a number of research efforts for using both character-level and word-level information, they feed the two types of information only to LSTM, while our model also injects the word-level information into the softmax function. Previous work on this topic has usually been tested in a limited number of languages and lacks a comprehensive comparison of different injection methods. We will compare our method with the previous methods mentioned in this section on 14 typologically diverse languages.

Model Description
For language modeling, we basically use a LSTM network (Hochreiter and Schmidhuber, 1997). We denote the hidden state of LSTM for the t-th word w t as h t ∈ R d , where d is the embedding size. We incorporate word-level information using the neural network shown in Figure 1. We describe the details in the following subsections. Figure 1: Our character-aware LSTM language model with injection of word-level information with an example word "cats". Symbolsˆand $ respectively represent the start and the end of a word.

Input Word Representations
We use BiLSTM to encode character n-grams to obtain character-level representation. We set n to 3 for all the languages except Japanese and Chinese, for which we set n to 1. This is because BiLSTM over character 3-grams obtained best results on most LM datasets in the work of Vania and Lopez (2017), but Japanese and Chinese are more ideographic than the others, and it is expected that a smaller n works better.
Given a word w t , we denote its embedding from a lookup table where |V | is the vocabulary size. We compute the character-level representation of w t as follows: where h f w l , h bw 0 ∈ R d are the last states of the forward and backward LSTMs respectively. W f , W b ∈ R d×d and b ∈ R d are trainable parameters. We define the following methods to obtain the combination w t from w t and c t : gate: we use the same gating mechanism as Miyamoto and Cho (2016), which is described later to combine w t and c t .
avg, add, cat: we obtain w t through averaging, addition and concatenation of w t and c t , respectively.
In the gating mechanism, we compute w t as follows: where v g ∈ R d and b g ∈ R are trainable parameters and σ(·) is a sigmoid function.

Representation of Input to Softmax
Our proposal is to combine h t with w t to better inform the softmax function of word-level information. Combination h t is computed as follows: where g out wt is a gate value. In our experiments, we set up two types of gate. One is a fixed value, g out wt = 0.5. The other is similar to the definition in Eq. (2), which adaptively outputs a gate value depending on w t : where v k ∈ R d and b k ∈ R are trainable parameters. In Eq. (4), the gate is used only on wordlevel information to decide how much information w t should be taken 2 . In Eq. (4), if we remove the term h t , the resultant model is a simple word-level language model P (w t+1 |w t ). Based on this observation, we can simply extend our method to contain the wordlevel information for previous words without extra parameters: where n is the number of the current and previous words used to calculate h word t . We simply give smaller weights inversely proportional to distance i to the embeddings of the previous words. For example, when n = 2, h word t is computed as w t + 1 2 w t−1 , which is used to calculate P (w t+1 |w t , w t−1 ). The hidden state h t now can be calculated as follows:

Language Modeling
The language modeling task is to compute the probability of a given sentence w 1 , . . . , w T : (8) We use a softmax function based on h t to generate a probability distribution over the vocabulary: where W out ∈ R d×|V | is output word embeddings.

Model Variants
The hyper-parameters of our models are shown in Table 1. The learning rate is decreased if no improvement is observed in the validation dataset. Several baseline models and our models are listed as follows: -Char-BiLSTM-LSTM: We use BiLSTM to encode characters without injecting wordlevel information. -Word-LSTM: Standard word-level LSTM model.
-Char-BiLSTM-gate/avg/add/cat-Word-LSTM: We combine character-level and word-level information at the input of LSTM through gate/avg/add/cat methods, mentioned in Sec. 3.1.
-Char-BiLSTM-LSTM-Word: We inject word-level information only into the softmax function. This is our injection method.
-Char-BiLSTM-gate/avg/add/cat-Word-LSTM-Word: We combine our injection method and previous injection methods, which means we inject word-level information both at the input of LSTM and into the softmax function.
For both Char-BiLSTM-LSTM-Word and Char-BiLSTM-gate/avg/add/cat-Word-LSTM-Word, we use g = 0.5/adaptive and n = 1/2/3 to represent our specific injection method. For example, Char-BiLSTM-LSTM-Word (g = 0.5, n = 2) represents that we use a fixed gate value on wordlevel information in Eq. (4) and we inject the information of the current word and the preceding word into the softmax function.

Datasets
Common language modeling datasets for evaluating character-aware NLMs are from the work of Botha and Blunsom (2014). While these datasets contain languages with rich morphology, they have only 5 different languages. Perhaps, the most large-scale language modeling datasets are from the work of Gerz et al. (2018), who released 50 language modeling datasets covering typologically diverse languages. The difference between the newly released datasets and the previous common datasets is that unseen words are kept in test set. Thus, on the datasets, we can test our methods in a real LM setup. The languages from the work of Gerz et al. (2018) were selected to represent a wide spectrum of different morphological systems and contain many low-frequency or unseen words. Thus, these datasets should be desirable for checking the performance of character-aware NLMs 3 .
To simplify the experiments without losing the wide coverage, we only chose datasets of 14 languages from these datasets and tried to cover different language typologies as well as different type/token ratios (TTRs). The statistics of our chosen datasets are shown in Table 2. We used all the words observed in training data and one special unknown token for out-of-vocabulary words as the output vocabulary to make the setting the same as Gerz et al. (2018).

Comparison of Baseline Models
The results of Word-LSTM and Char-BiLSTM-LSTM are shown in Table 3. We also showed the results of Word-LSTM and Char-CNN-LSTM from the work of Gerz et al. (2018). The embedding size and the number of LSTM layers are the same as those for the models in Gerz et al. (2018). As shown in the table, both the Word-LSTM and Char-BiLSTM-LSTM baselines are better than the Word-LSTM and Char-CNN-LSTM from the work of Gerz et al. (2018) on all the datasets 4 . Both Char-BiLSTM-LSTM and Char-CNN-LSTM from Gerz et al. (2018) are better than their respective Word-LSTM on all the datasets. One possible reason is that all the unseen words in the test set in the 14 datasets cannot be handled by Word-LSTM in the testing phase. However, character-aware models can encode the characters from these unseen words, making them possible to process these words. It is also shown that as TTR increases, Char-BiLSTM-LSTM achieves the better result than Word-LSTM. This may be because high-TTR languages have more low-frequency words and unseen tokens, as shown in Table 2   words still occupy the majority of both training and test data, injecting word-level information is still helpful for improving these character-aware models, as shown below.

Comparison of Different Injection Methods
The results of all the other different injection methods on 14 language modeling datasets are also shown in Table 3. In our experiments, Char-BiLSTM-gate-Word-LSTM underperforms Char-BiLSTM-LSTM on all the datasets. This indicates the gate method is not effective in our experiments. Char-BiLSTM-cat-Word-LSTM achieves better results than Char-BiLSTM-gate-Word-LSTM on all the datasets, but still underperforms Char-BiLSTM-LSTM on 8 out of 14 datasets. Char-BiLSTM-avg-Word-LSTM outperforms Char-BiLSTM-cat-Word-LSTM on 9 out of 14 datasets, which indicates the simple average method is better than the gating mechanism and the concatenation method in our tasks. However, Char-BiLSTM-avg-Word-LSTM still has no obvious improvements, compared with Char-BiLSTM-LSTM on most datasets.
We found some previous work also has similar results in the language modeling task. Kim et al. (2016) used a Char-CNN-LSTM model without injecting word-level information. They reported that some basic methods (e.g., concatenation, averaging and adaptive weighting schemes) for injecting word-level information degraded the performance of their Char-CNN-LSTM. Miyamoto and Cho (2016) showed the concatenation method for injecting word-level information into their Char-BiLSTM-LSTM also degraded their Word-LSTM model.
Char-BiLSTM-add-Word-LSTM achieves more improved results than Char-BiLSTM-LSTM on 13 out of 14 datasets and also performs best in general among Char-BiLSTM-avg/add/gate/cat-Word-LSTM. The addition method works better than other previous injection methods in general in our tasks, while this simple method is less mentioned in the previous work. In conclusion, the performance of the previous injection methods in our experiments was in the descending order of add, avg, cat and gate.
Our Char-BiLSTM-LSTM-Word (g = 0.5, n = 1) and Char-BiLSTM-LSTM-Word (g = adaptive, n = 1) work effectively, and both of them achieve better results than Char-BiLSTM-LSTM. A simple fixed gate value in our injection method may be effective enough. Char-BiLSTM-LSTM-Word (g = 0.5, n = 1) works better than Char-BiLSTM-LSTM-Word (g = adaptive, n = 1) on most datasets. When compared with other injection methods, Char-BiLSTM-LSTM-Word (g = 0.5, n = 1) achieves the best results on most datasets (bold scores in Table 3). This suggests that our injection method, aiming at the different position from the input of LSTM, the softmax function, makes good use of word-level information.

Combination of Injection Methods
To avoid too many combinations of our injection method and other previous methods, we only chose to combine our Char-BiLSTM-LSTM-Word (g = 0.5, n = 1) with the other previous injection methods, because Char-BiLSTM-LSTM-Word (g = 0.5, n = 1) performs better than Char-BiLSTM-LSTM-Word (g = adaptive, n = 1), as mentioned above. The results of the combination of our Char-BiLSTM-LSTM-Word (g = 0.5, n = 1) and the previous injection methods are shown in Table 4.
When our injection method is used together with gate/avg/cat/add methods, obvious improvements can be observed on most datasets. Among them, Char-BiLSTM-add-Word-LSTM-Word (g = 0.5, n = 1) obtained the best results on most datasets (bold scores in Table 4). The result indicates that the previous injection methods do not make full use of word-level information, while our method, which injects the word-level information into the different position, specifically, the softmax, can help the previous models make better use of the word-level information.

Including Word-level Information for Previous Words
As mentioned in Sec. 3.1, we can include wordlevel information for previous words when inject-ing it into the softmax function. The number of words used in our injection method is denoted by n. In our experiments, we only set n to 1, 2 and 3, as we observed no obvious improvements when using a larger n. Since Char-BiLSTM-add-Word-LSTM-Word (g = 0.5, n = 1) performs best in general on most datasets, as mentioned above, we only changed n for this model. Note that our Char-BiLSTM-add-Word-LSTM-Word (g = 0.5, n = 2/3) does not need extra parameters as we just reuse the word embeddings from the lookup table W in to compute word-level information. In addition, the computational time of our injection method should be low, since the involved computation is simple. The result is shown in Table 5.
In general, Char-BiLSTM-add-Word-LSTM-Word (g = 0.5, n = 2) achieves the best result on most datasets. Char-BiLSTM-add-Word-LSTM-Word (g = 0.5, n = 3) does not obtain further improvements on most datasets. Since our current method for including word-level information for previous words is simple, a more advanced method can be further exploited in future work.

Effects of Infrequent Words
In order to check whether infrequent words help our character-aware NLMs, we set up several experiments by discarding some infrequent words based on their word frequency. Note that we maintain two independent vocabularies. One is the input vocabulary and is used to inject word-level information. We obtain the word embeddings in our and previous injection methods through the lookup table W in , as described in Sec. 3.1. The other is the output vocabulary and is used for word prediction, as described in Sec. 3.3. When we discard the infrequent words, we only narrow down the input vocabulary and do not change the output vocabulary. Thus, the perplexity scores are still comparable with the scores in the above experiments. For example, when our model processes the sentence "the salesman brought some samples" in training phase, where 'salesman' is an infrequent word in training data, our model can still try to predict the word 'salesman' given the previous word 'the', because 'salesman' is in our output vocabulary. When inputting the word 'salesman' to predict the word 'brought', we do not inject word-level information for the word 'salesman'. We only use its character-level representation obtained through our BiLSTM over characters to perform the lan-vi Table 5: Perplexity of our Char-BiLSTM-add-Word-LSTM-Word including word-level information for previous words on 14 language modeling datasets. guage modeling task.
We denote the frequency threshold as θ and set its value among 5, 15 and 25. If the frequency of a word seen in the training data is less than or equal to θ, we discard it. We refer the model that discards infrequent words as Char-BiLSTMadd-Word-LSTM-Word (g = 0.5, n = 1, θ = 5/15/25). The result is shown in Table 7.
When discarding the words whose frequency is less than or equal to 15, the model obtains better results only on 2 out of 14 datasets than Char-BiLSTM-add-Word-LSTM-Word (g = 0.5, n = 1). This indicates some infrequent words are still helpful. When we increase the frequency threshold further to 25, the performance of the model has dropped compared with Char-BiLSTM-add-Word-LSTM-Word (g = 0.5, n = 1, θ = 15) as more frequent words are discarded. However, we found a relatively small frequency threshold θ = 5 works quite effectively. Char-BiLSTMadd-Word-LSTM-Word (g = 0.5, n = 1, θ = 5) achieves better results than Char-BiLSTM-add-Word-LSTM-Word (g = 0.5, n = 1) on 7 out of 14 datasets. It seems to be the trend that discarding infrequent words with θ = 5 is useful for high TTR languages. Note that we arranged our datasets from low TTR to high TTR in Table  7. Since many of the words in natural languages are rare as described in Zipf's law, we can reduce the size of the input vocabulary significantly even with a small θ. The size for the full input vocabulary and the reduced vocabulary with different fre-quency threshold value is shown in Table 8. As we can see, when θ is set to 5, our model achieves better results with fewer parameters.

Experiments on Common Datasets
In addition to the above datasets, we also set up 6 common language modeling datasets: English Penn Treebank (PTB) (Marcus et al., 1993) and 5 non-English datasets with rich morphology from the 2013 ACL Workshop on Machine Translation 5 , which have been commonly used for evaluating character-aware NLMs (Botha and Blunsom, 2014;Kim et al., 2016;Bojanowski et al., 2017;Assylbekov and Takhanov, 2018). Since some of previous work has tested their model on PTB, we also included PTB in our experiment. We used the preprocessed small version of non-English datasets by Botha and Blunsom (2014) and followed the same split as the previous work. The data statistics is provided in Table 9.
The results of our proposed models and previous work are shown in Table 6. We used Char-BiLSTM-LSTM and Char-BiLSTM-add-Word-LSTM as baseline models. For our models, we set the frequency threshold θ to 5 and also set n to 2 as these settings help improve our character-aware NLMs, as discussed in Sec. 5.6 and Sec. 5.5. The language models used in the previous work are improved at different aspects, and most of them are also based on standard LSTM, like ours. Botha and Blunsom (2014) used the morphological logbilinear (MLBL) model, which takes into account morpheme information. Kim et al. (2016) used CNN as their character encoder, and also trained an LSTM language model, where the input representation of a word is the sum of the morpheme embeddings of the word. Bojanowski et al. (2017) trained the word embeddings through skip-gram models with subword-level information, and used these word embeddings to initialize the lookup table of word embeddings of a word-level language  model. Assylbekov and Takhanov (2018) focused on reusing embeddings and weights in a characteraware language model. The input of their model is also the sum of the morpheme embeddings of the word. As shown in the table, Char-BiLSTM-LSTM underperforms the previous work on PTB.
One reason may be that we did not tune the hyperparameters of our models on PTB. The hyperparameters were simply kept the same in all the experiments on 20 datasets. As we can see, Char-BiLSTM-LSTM achieves better results than most previous work on non-English datasets. Our models also achieve the best results on non-English datasets.

Conclusion
In addition to combining character-level and word-level information at the input of LSTM, which is a widely used combination manner, we proposed to also inject word-level information into the softmax function in a character-aware neural language model. We gave a detailed comparison with previous methods, and the result showed our proposal works effectively on typologically diverse languages. For future work, it would be interesting to see how our model works for other tasks such as text generation.