Character and Subword-Based Word Representation for Neural Language Modeling Prediction

Most of neural language models use different kinds of embeddings for word prediction. While word embeddings can be associated to each word in the vocabulary or derived from characters as well as factored morphological decomposition, these word representations are mainly used to parametrize the input, i.e. the context of prediction. This work investigates the effect of using subword units (character and factored morphological decomposition) to build output representations for neural language modeling. We present a case study on Czech, a morphologically-rich language, experimenting with different input and output representations. When working with the full training vocabulary, despite unstable training, our experiments show that augmenting the output word representations with character-based embeddings can significantly improve the performance of the model. Moreover, reducing the size of the output look-up table, to let the character-based embeddings represent rare words, brings further improvement.


Introduction
Most of neural language models, such as n-gram models (Bengio et al., 2003) are word based and rely on the definition of a finite vocabulary V. Therefore, a look-up table maps each wordw ∈ V to a vector of real features, and is stored in a matrix. While this approach yields significant improvement for a variety of tasks and languages, see for instance (Schwenk, 2007) in speech recognition and (Le et al., 2012;Devlin et al., 2014;Bahdanau et al., 2014) in machine translation, it induces several limitations.
For morphologically-rich languages, like Czech or German, the lexical coverage is still an important issue, since there is a combinatorial explosion of word forms, most of which are hardly observed on training data. On the one hand, growing the look-up table is not a solution, since it would increase the number of parameters without having enough training examples for a proper estimation. On the other hand, rare words can be replaced by a special token. This acts as a word class merging very different words without any distinction, while using different word classes to handle outof-vocabulary words (OOVs) (Allauzen and Gauvain, 2005) does not really solve this issue, since rare words are difficult to classify. Moreover, for most inflected or agglutinative forms, as well as for compound words, the word structure is overlooked, wasting parameters for modeling forms that could be more efficiently handled by word decomposition into subwords units.
Using subword units, whether they are built via a different supervised method with embedded language knowledge, or from the training data, has been attempted many times, especially for speech recognition. The main goal is to reduce the OOV rate. While most of them were focused on a specific language, (Creutz et al., 2007) is a representative example of such a model applied to several morphologically-rich languages.
One of the first occurrences of general language models integrating morphological features to represent words are the factored language model (Bilmes and Kirchhoff, 2003) and its neural version (Alexandrescu and Kirchhoff, 2006). Input words are represented by their embedding, plus several other features, some of which include morphemes. To alleviate the impact of OOVs, (Mueller and Schuetze, 2011) used morphological features for class-based predictions when input words are unknown, obtaining state-of-the-art results on English. More recently, several types of language models represent words as function of subwords units: using a recursive structure (Luong et al., 2013), or an additive one (Botha and Blunsom, 2014). Quite a lot of work has been made on language models that extract features directly from the character sequence, whether they use character n-grams (Sperr et al., 2013), or characters composed by a convolutional layer (Santos and Zadrozny, 2014;Kim et al., 2015) or a Bi-LSTM layer (Ling et al., 2015). This avoids using an external morphological analyser. We can note that these types of models have also been applied with success to several other task, including learning word representations (Qiu et al., 2014;Cotterell et al., 2016;Bojanowski et al., 2016;Wieting et al., 2016), POS tagging (Plank et al., 2016;Ma and Hovy, 2016;Heigold et al., 2017), Named entity recognition (Gillick et al., 2016), Parsing (Ballesteros et al., 2015) and Machine translation (Costa-jussà and Fonollosa, 2016). Recently, an exhaustive summary of previous work on word representation by composing subword units was presented in (Vania and Lopez, 2017). This work also compares the types of subword unit, how they are composed, and their impact on various morphological typologies.
While recurrent neural networks have shown excellent performances for character-level language modeling (Sutskever et al., 2011;Hermans and Schrauwen, 2013), the results of such models are usually worse than those that use word-level prediction, since they have to consider a far longer history of tokens to be able to predict the next one correctly. However, more recent work (Hwang and Sung, 2017) seems to obtain very satisfactory results with a supplementary word-level layer that allows a better processing of the longer history.
Our work focuses on replacing output word embeddings by representations built from subwords. To the best of our knowledge, such a model has only been proposed in (Józefowicz et al., 2016), which evaluates the use of convolutional and LSTM layers to build word representations for outputs words. They allow the model to trade size against perplexity, since their model performs worse than the classic softmax approach, but with far less parameters. We first propose to study the training of a language model which augments or completely replaces output words representations with character-based representations. We compare the effect of different architectures, as well as the effect of different input representations. Our results show that: • When evaluating perplexity on the full training vocabulary, using an augmented output representation improves the model performance.
• Not using the look-up table for rare words also improves the model performance.
Finally, we describe a short experiment with factoring the output predictions using a morphological analysis, which we believe could lead to a facilitated word generation when combined with reinflexion models. Our paper is organized as follows: Section 2 describes the general architecture of the language model, and of the representations used, as well as its training, Section 3 presents the experiments and Section 4 gives our results and discussion.

Language model
We use a recurrent neural language model (Mikolov et al., 2010). The input of the network is a sequence of words S = (w 1 , . . . , w |S| ). Given a fixed sized vocabulary V, the language model outputs a multinomial distribution P (w i = j|w i−1 1 ), ∀j ∈ V for each position i in the sequence, and with the prediction contexte w i−1 1 = w 1 , . . . , w i−1 . This allows us to compute the following probability : Our model uses the LSTM variant (Hochreiter and Schmidhuber, 1997). The hidden state h i will be computed using the previous hidden state and a computed representation r w i of the word in position i in the sequence: The conditional probability distribution of the next word is computed with a softmax function: We propose to improve output word embeddings by using representations built from subwords, as it is often done for input words.
Usually, input and output word embeddings are parameters, stored in look-up matrices W and W out . The word embedding r word w of a word w is simply the column of W corresponding to its index in the vocabulary V:

Representing words
We consider two other types of representations: decomposition of the words into characters (or ngrams of characters), and decomposing them into a Lemma and positional tags using a morphological analysis. An example of these different decompositions is shown in table 1.

Representation Decomposition
Word pocátku Characters p+o+c+á+t+k+u Character 3-grams poc+ocá+cát+átk+tku Lemma + Tags pocátek+N+MascIn+Sg+Loc+Act First, a convolution layer (Waibel et al., 1990;Collobert et al., 2011), similar to layers used in (Santos and Zadrozny, 2014; Kim et al., 2015), applies a convolution filter W CN N nc over a sliding window of n c characters, producing local features: where x n nc is a vector obtained for each position n in the word. The embeddings of w is then obtained by applying a max-pooling and the activation function φ: We can use multiple filters of n c f different sizes and concatenate their results: Our second method uses a bi-LSTM (Hochreiter and Schmidhuber, 1997;Graves et al., 2005), on characters, similarly to (Ling et al., 2015). It combines the final states − − → h |w| and ← − h 1 of two LSTMs, respectively over the character sequence and the reverse character sequence, which are computed as such:

Lemma+Tags decomposition
For morphologically-rich languages, the different morphological properties of a word (gender, case, ... ) are usually encoded using mutliple tags as shown in table 1. Therefore a word w is decomposed into a lemma l along with a set of associated sub-tags T = {t 1 , .., t |T | } of fixed size |T |. For a given word, a single tag can be simply created by the concatenation of the subtags. However, this implies a large tagset and mitigates the generalization power since some sub-tags combinations can remain unobserved on training data. In this work we prefer a factored representation where each sub-tags is considered independently. Lemmas, similarly to surface forms, are represented by |V L | vectors stored in a look-up matrix L, and r lemma l = [L] l . For every words, each subtag has its own vocabulary and its own look-up matrix. However, the additional cost is negligible given their small size (see table 3). To infer a word embedding from a sub-tags set, we also use two methods. First, we simply concatenate their embeddings: The second method uses a bidirectionnal LSTM on the sequence of tags T , using exactly the same structure as in section 2.1.1: Words + CharCNN Figure 1: Example architecture of our language model, when using word embeddings and a character CNN to build both input and output word representations.

Training
Our final model, as illustrated in figure 1, uses concatenation of word, character-based or lemma and tags embeddings, to obtain input and output word representations. Following (Kim et al., 2015), we used a Highway layer (Srivastava et al., 2015) to model interactions between concatenated embeddings of various sources.
Usually, such a model is trained by maximizing the log-likelihood. For a given word w i given its preceding sequence w 1 , . . . , w i−1 , the model parameters θ are estimated in order to maximize the following function for all the sequences observed in the training data: This objective function implies a very costly summation imposed by the softmax activation of the output layer: large output vocabularies cause a computational bottleneck due to the output normalization.
Different solutions have been proposed, as shortlists (Schwenk, 2007), hierarchical softmax (Morin and Bengio, 2005;Mnih and Hinton, 2009;Le et al., 2011), or self-normalisation techniques (Devlin et al., 2014;Andreas et al., 2015;Chen et al., 2016). Sampling-based techniques explore a different solution, where a limited number of negative examples are sampled to reduce the normalization cost. Working with a large vocabulary, and with output representations potentially more costly to compute, we choose to use the following sampling-based training algorithms: • Target sampling, which is based on importance sampling (Bengio and Sénécal, 2008;Jean et al., 2015), directly approximates the normalization over V by normalizing over a sampled subset.
Indeed, the gradient of the objective described in equation 7 is written as: The idea is to approximate the expectation of the second term by importance sampling a subset of V from a proposal distribution Q. Target sampling implies associating with a part D i of the training data a subset V i of V that corresponds to the target words of D i plus a small subset of the remaining words. The resulting objective is equivalent to approximating the probability computed in equation 1 by normalizing it only over V i .
• Noise contrastive estimation (NCE), introduced in (Gutmann and Hyvärinen, 2012; Mnih and Teh, 2012), aims to discriminate between one example sampled from the real data D and k from a noise distribution P n , and results in the model being theoretically unnormalized. The idea is to sample examples according to a mixture: and train the model to recover whether the sample came from the data or the noise distribution. This is done by minimizing the binary crossentropy of recognizing the current sample's origin, using the posterior probabilities: Besides, the probabilities intervening in equation 10 can be replaced by unnormalized scores at training time, since we can consider normalizing quantities as parameters to be learned.
• BlackOut (Ji et al., 2015), also approximating the normalization computation, with a weighted sampling scheme and a discriminative objective.It can be considered as a variant from NCE where we sample a set of k examples S k from a proposal distribution Q. We then proceed to apply NCE with a re-weighted noise distribution which empirically behaves far better than NCE, providing an improved stability. BlackOut can also be linked to Importance sampling.
Ultimately, these three algorithms approximate the negative log-likelihood computed on a number k of negative samples from V, using an easy to sample distribution.

Experiments
Experiments are carried out on Czech, a morphologically rich language using the different criteria described in section 2.2.

Data
We used data from the parallel corpus Newscommentary 2015, from the WMT News MT Task. The data consists in 210K word sequences, amounting in about 4,7M tokens. We divided the data into a training, development and testing sets, these last two amounting to 150K tokens each. In our experiments, we use different vocabulary sizes by varying the frequency threshold: words are selected when their frequency in the training data are stricly higher than the threshold. Table 2 shows the correspondences between vocabulary sizes and these thresholds.  The lemma and tags decomposition presented in section 2.1.2 were obtained with Morphodita (Straková et al., 2014). There is 12 tag categories for Czech. Vocabulary sizes for characters, lemma and tags are detailed in table 3. 12,65,11,6,9,6,3,5,5,4,3,3]

Setup
The different versions of our model used in experiments are shown in table 4. We used a Highway layer when there is a concatenation of embeddings of different sources, which is for almost all architectures. We tried applying a Highway layer to the output representation, but it seemed almost always counter-productive, rendering training more unstable. In all experiments presented here, weights are not tied between input and output representations, since our preliminary experiments with tied weights always gave worst results. Besides, we didn't mix structures for character-level representations (for example, using an input Char-CNN and output CharLSTM) since our first experiments gave systematically worse results than using the same structures). When using different types of representations, we kept consistency between vocabularies: if both lemmas and words are used in a model, any lemma considered unknown will have its corresponding word unknown, and inversely. The same (or corresponding) vocabularies are used for inputs, outputs, and evaluation. The only exception is presented in section 4.4. When using a character-based output representation, during evaluation, the unknown token is built from a specific character token, a specific lemma token, and 12 specific tag tokens that are parameters of the model.  Our experiments aim at comparing potential use of subword-based word representation, and thus are not directed towards performance. For this reason, we used the same implementation for all experiments and did not specifically try to optimize the general model structure or the dimensional hyperparameters, neither compared our results with benchmarks on Czech corpora.

Training and evaluation
Language models are evaluated with perplexity: over all sequences in the testing data. Perplexity is computed for a fixed output vocabulary V, which allows to compare models using the same output vocabulary. However, we can't evaluate model performance on out-of-vocabulary words, since those are to be classified as the unknown token in V.
Our models are implemented with Tensorflow (Abadi et al., 2015). We use the Adam algorithm (Kingma and Ba, 2014) with an initial learning rate of 5 * 10 −4 for training, over a maximum of 10 epochs, with a batch size of 128 sequences. However, since the training is often unstable, the model backtracks to the last checkpoint if it does not improve its performance on validation data after 1/10 of an epoch, and stop training after 10 unsuccessful loadings in a row. To avoid overfitting, we use dropout with probability 0.5 on recurrent layers, and L2 regularization on feedforward layers.
We use two hidden layers, and choose our embeddings dimensions in order to obtain, for each type of representation, an embedding dimension of 150. In the case of the CNN, we used filters of 3, 5 and 7 characters, of dimension 30, 50, and 70. Whether we use NCE, blackOut, or importance sampling, we draw k = 500 noise samples by batch. For all experiments, we report the perplexity on test data at the end of training. Results presented in tables 5, 6, 7 are the average of the results obtained on 5 models, and the standard deviation.

Influence of the vocabulary size
We first train our model with different vocabulary sizes. As shown in figure 2, our model fails to improve upon the conventional word model when the output vocabulary size is relatively small (shown on the two leftmost graphs). More precisely, models that use word and character-based representations at the output seem unable to learn after a couple of iterations. We first link this behaviour to the difficulty met by the authors in (Józefowicz et al., 2016): since most logits are tied when we use an output character-based representation -as opposed to independently learned word embeddings, the function mapping from word to word representation is smoother and training becomes more difficult. They used a smaller learning rate and a low dimensional correction factor, learned for each word, as a work-around.
However, increasing the vocabulary size reduces this effect . This is especially clear with the whole training vocabulary (on the rightmost graph of figure 2): in this setup, using a character-based representation improves the performance of the model. We can assume that, for rare words, learning independent embeddings fails since scarce updates of these embeddings are insufficient. For the rare words, combining word and character-based embeddings allows the model to better counteract the sparsity issue.

Choice of the training criterion
Given the previous results, we use the full training vocabulary to assess the impact of the training criterion. However, using this full training vocabulary renders training very unstable, especially with sampling-based algorithms. Stability issues, especially for the Noise-contrastive estimation, have previously been discussed (Chen et al., 2016;Józefowicz et al., 2016). We shortly experimented to choose the most practical criterion to use. Figure 3 shows the shape of the training curves. While target sampling and blackOut both seem to work properly, NCE needs far more noise samples to converge. We believe this is related to the tensorflow implementation, which re-use the same noise samples for every example in the batch, which leads to a lack of diversity in negative  examples. Augmenting the number of samples or reducing the size of the batch are possible solutions, but they increase the training time. Black-Out obtains better results because, while very similar to NCE, the scores used as coming from the noise distribution are context-dependent, which brings diversity to negative examples.
Overall, across several experiments, target sampling performs better than blackOut, and we choose to use it for the rest of our experiments. Since training is still quite unstable, depending on the architecture, we report results across 5 trainings for the next sections.   Table 5 gathers the main experimental results to assess which combination of input and output representations gives the best performance. For any input representation, augmenting the output representation with a character-based embedding improves the performance of the model. It is especially true for convolutional layers. We also can notice that the improvement is better for models that performed badly with basic output word embeddings. Overall, biLSTMs perform worse than their convolution/concatenation counterparts. Finally, the best average perplexity of 495 for word only output representations is improved to an average of 411 for augmented output representations.

Effects of the representation choice
Other output representations: First, our experiments with only character-based embeddings as output representations give results far worse than those reported in 5, with our best model obtaining an average perplexity of ≈ 2500. Train-ing is also far more unstable. We believe these results are linked to the difficulties mentioned in (Józefowicz et al., 2016) and in section 4.1.
We also tried to use the lemma+tags decomposition presented in section 2.1.2, but without success. When tags were ambiguous across several occurrences of the same words, we tried using specific tokens, or choosing the most frequent tags, but in both cases the model severely overfits.
Finally, we tried to use word embeddings pretrained with word2vec (Mikolov et al., 2013) as output representations. We obtained results very similar to those of classical word embeddings, with a small but noticeable improvement when the input representation used LSTM. However, these improvements are still well under those obtained by augmenting the output representation with a character-based embedding.  Corresponding vocabulary sizes are given in table 2. Test perplexities are given for all words, frequent words (frequency > 10) and rare words (frequency < 10). In bold are the best models for a given input representation.
Following our observations in section 4.1, we then assess the effect of reducing the word vocabulary size for Words+CharCNN output representation. We don't change the size of the event space: when constructing output representations for words under a chosen frequency, we simply don't use the word representation. For example, using a threshold of f W T h = 10 means that words that appear less thant ten times won't have their own word embedding, and will be represented by the unknown word token combined with their character-based representation. Results are shown in table 6. We can see that for all input representations, using a specific unknown token in place of a specific word embedding for words appearing less than 5 times in training data gives the best performance. Reducing the look-up table to words only appearing more than 10 times gives worse results, while they are still better than if we keep the full table. However, there is no clear trend when looking at the rare words perplexities, which are very hard to interpret, given their very high standard deviation. With a smaller output word look-up table, our best average perplexity of 411 is reduced to 376, which is a very sizeable overall improvement. BiLSTM 232 ± 5 203 ± 6 212 ± 6 Table 7: Test perplexities averaged on 5 models on lemmas with a multiple objectives cost function. Results are given for various input/output representations. In bold are the best models for a given output representation.

Predicting root and tags jointly
While using the lemma+tags decomposition to build output representation was not, in our experiments, successful, we investigated a factorised prediction of lemma and tags. We used different costs for predicting lemmas and each tag, which are summed into a final objective function. As recently seen in (Martinez et al., 2016;Burlot and Yvon, 2017), these objectives are individually easier when working with morphologically-rich languages, and fully inflected words can be obtained by using morphological inflection models, which have been shown to be quite successful (Faruqui et al., 2016;Kann et al., 2017). Table 7 shows the test perplexities on lemmas for various input and output representations. We can observe that in all cases training is far more stable, with generally lower standard deviations. In this case, using a lemma+tags with a BiLSTM or a Words+CharCNN input representation both give the best results, while augmenting the output representation of the lemma with a character-build embedding also improves results. This makes the joint learning of a factored prediction and reinflection language model a very interesting direction for future work.

Conclusion
We described a neural language model allowing the use of subword units for both input and output word representations. While in our experiments training with a full vocabulary is unstable, we can identify important trends: augmenting output representations with character-based embeddings improves the model performance, and in this setup, replacing independent word embeddings by the unknown token for rare words yields further improvement. It is worth noticing that this also opens the vocabulary, since our model can be used to rescore unknown words. Additional experiments suggest that factoring the output of the model with a lemma+tags decomposition, then re-inflecting these into words, could make generation easier: this is a direction we plan to investigate.