Gated Word-Character Recurrent Language Model

We introduce a recurrent neural network language model (RNN-LM) with long short-term memory (LSTM) units that utilizes both character-level and word-level inputs. Our model has a gate that adaptively finds the optimal mixture of the character-level and word-level inputs. The gate creates the final vector representation of a word by combining two distinct representations of the word. The character-level inputs are converted into vector representations of words using a bidirectional LSTM. The word-level inputs are projected into another high-dimensional space by a word lookup table. The final vector representations of words are used in the LSTM language model which predicts the next word given all the preceding words. Our model with the gating mechanism effectively utilizes the character-level inputs for rare and out-of-vocabulary words and outperforms word-level language models on several English corpora.


Introduction
Recurrent neural networks (RNNs) achieve state-ofthe-art performance on fundamental tasks of natural language processing (NLP) such as language modeling (RNN-LM) (Józefowicz et al., 2016;Zoph et al., 2016). RNN-LMs are usually based on the wordlevel information or subword-level information such as characters (Mikolov et al., 2012), and predictions are made at either word level or subword level respectively.
In word-level LMs, the probability distribution over the vocabulary conditioned on preceding words is computed at the output layer using a softmax function. 1 Word-level LMs require a predefined vocabulary size since the computational complexity of a softmax function grows with respect to the vocabulary size. This closed vocabulary approach tends to ignore rare words and typos, as the words do not appear in the vocabulary are replaced with an outof-vocabulary (OOV) token. The words appearing in vocabulary are indexed and associated with highdimensional vectors. This process is done through a word lookup table.
Although this approach brings a high degree of freedom in learning expressions of words, information about morphemes such as prefix, root, and suffix is lost when the word is converted into an index. Also, word-level language models require some heuristics to differentiate between the OOV words, otherwise it assigns the exactly same vector to all the OOV words. These are the major limitations of word-level LMs.
In order to alleviate these issues, we introduce an RNN-LM that utilizes both character-level and word-level inputs. In particular, our model has a gate that adaptively choose between two distinct ways to represent each word: a word vector derived from the character-level information and a word vector stored in the word lookup table. This gate is trained to make this decision based on the input word.
According to the experiments, our model with the gate outperforms other models on the Penn Treebank (PTB), BBC, and IMDB Movie Review datasets. Also, the trained gating values show that the gating mechanism effectively utilizes the character-level information when it encounters rare words.
Related Work Character-level language models that make word-level prediction have recently been proposed. Ling et al. (2015a) introduce the compositional character-to-word (C2W) model that takes as input character-level representation of a word and generates vector representation of the word using a bidirectional LSTM (Graves and Schmidhuber, 2005). Kim et al. (2015) propose a convolutional neural network (CNN) based character-level language model and achieve the state-of-the-art perplexity on the PTB dataset with a significantly fewer parameters.
Moreover, word-character hybrid models have been studied on different NLP tasks. Kang et al. (2011) apply a word-character hybrid language model on Chinese using a neural network language model (Bengio et al., 2003). Santos and Zadrozny (2014) produce high performance part-of-speech taggers using a deep neural network that learns character-level representation of words and associates them with usual word representations. Bojanowski et al. (2015) investigate RNN models that predict characters based on the character and word level inputs. Luong and Manning (2016) present word-character hybrid neural machine translation systems that consult the character-level information for rare words.

Model Description
The model architecture of the proposed wordcharacter hybrid language model is shown in Fig. 1.
Word Embedding At each time step t, both the word lookup table and a bidirectional LSTM take the same word w t as an input. The word-level input is projected into a high-dimensional space by a word lookup table E ∈ R |V |×d , where |V | is the vocabulary size and d is the dimension of a word vector: where w wt ∈ R |V | is a one-hot vector whose i-th element is 1, and other elements are 0. The characterlevel input is converted into a word vector by using a bidirectional LSTM. The last hidden states of forward and reverse recurrent networks are linearly Figure 1: The model architecture of the gated word-character recurrent language model. wt is an input word at t. x word w t is a word vector stored in the word lookup table. x char w t is a word vector derived from the character-level input. gw t is a gating value of a word wt.ŵt+1 is a prediction made at t. combined: where h f wt , h r wt ∈ R d are the last states of the forward and the reverse LSTM respectively. W f , W r ∈ R d×d and b ∈ R d are trainable parameters, and x char wt ∈ R d is the vector representation of the word w t using a character input. The generated vectors x word wt and x char wt are mixed by a gate g wt as where v g ∈ R d is a weight vector, b g ∈ R is a bias scalar, σ(·) is a sigmoid function. This gate value is independent of a time step. Even if a word appears in different contexts, the same gate value is applied. Hashimoto and Tsuruoka (2016) apply a very similar approach to compositional and noncompositional phrase embeddings and achieve stateof-the-art results on compositionality detection and verb disambiguation tasks.
Language Modeling The output vector x wt is used as an input to a LSTM language model. Since the word embedding part is independent from the language modeling part, our model retains the flexibility to change the architecture of the language modeling part. We use the architecture similar to the nonregularized LSTM model by Zaremba et al. (2014).  One step of LSTM computation corresponds to where W s , U s ∈ R d×d and b s ∈ R d for s ∈ {f, i,c, o} are parameters of LSTM cells. σ(·) is an element-wise sigmoid function, tanh(·) is an element-wise hyperbolic tangent function, and is an element-wise multiplication. The hidden state h t is affine-transformed followed by a softmax function: where v k is the k-th column of a parameter matrix V ∈ R d×|V | and b k is the k-th element of a bias vector b ∈ R d . In the training phase, we minimizes the negative log-likelihood with stochastic gradient descent.

Experimental Settings
We test five different model architectures on the three English corpora. Each model has a unique word embedding method, but all models share the same LSTM language modeling architecture, that has 2 LSTM layers with 200 hidden units, d = 200. Except for the character only model, weights in the language modeling part are initialized with uniform random variables between -0.1 and 0.1. Weights of a bidirectional LSTM in the word embedding part are initialized with Xavier initialization (Glorot and Bengio, 2010). All biases are initialized to zero.
Stochastic gradient decent (SGD) with mini-batch size of 32 is used to train the models. In the first k epochs, the learning rate is 1. After the k-th epoch, the learning rate is divided by l each epoch. k manages learning rate decay schedule, and l controls speed of decay. k and l are tuned for each model based on the validation dataset.
As the standard metric for language modeling, perplexity (PPL) is used to evaluate the model performance. Perplexity over the test set is computed as PPL where N is the number of words in the test set, and p (w i |w <i ) is the conditional probability of a word w i given all the preceding words in a sentence. We use Theano (2016) to implement all the models. The code for the models is available from https://github.com/ nyu-dl/gated_word_char_rlm.

Model Variations
Word Only (baseline) This is a traditional wordlevel language model and is a baseline model for our experiments.
Character Only This is a language model where each input word is represented as a character se-  Word & Character This model simply concatenates the vector representations of a word constructed from the character input x char wt and the word input x word wt to get the final representation of a word x wt , i.e., Before being concatenated, the dimensions of x char wt and x word wt are reduced by half to keep the size of x wt comparably to other models.

Gated Word & Character, Fixed Value
This model uses a globally constant gating value to combine vector representations of a word constructed from the character input x char wt and the word input x word wt as where g is some number between 0 and 1. We choose g = {0.25, 0.5, 0.75}.

Gated Word & Character, Adaptive
This model uses adaptive gating values to combine vector representations of a word constructed from the character input x char wt and the word input x word wt as the Eq (3).

Datasets
Penn Treebank We use the Penn Treebank Corpus (Marcus et al., 1993) preprocessed by Mikolov et al. (2010). We use 10k most frequent words and 51 characters. In the training phase, we use only sentences with less than 50 words.
BBC We use the BBC corpus prepared by Greene & Cunningham (2006). We use 10k most frequent words and 62 characters. In the training phase, we use sentences with less than 50 words.

IMDB Movie Reviews
We use the IMDB Move Review Corpus prepared by Maas et al. (2011). We use 30k most frequent words and 74 characters. In the training phase, we use sentences with less than 50 words. In the validation and test phases, we use sentences with less than 500 characters.

Pre-training
For the word-character hybrid models, we applied a pre-training procedure to encourage the model to use both representations. The entire model is trained only using the word-level input for the first m epochs and only using the character-level input in the next m epochs. In the first m epochs, a learning rate is fixed at 1, and a smaller learning rate 0.1 is used in the next m epochs. After the 2m-th epoch, both the character-level and the word-level inputs are used. We use m = 2 for PTB and BBC, m = 1 for IMDB. Lample et al. (2016) report that a pre-trained word lookup table improves performance of their word & character hybrid model on named entity recognition (NER). In their method, word embeddings are first trained using skip-n-gram (Ling et al., 2015b), and then the word embeddings are finetuned in the main training phase.

Values of Word-Character Gate
The BBC and IMDB datasets retain out-ofvocabulary (OOV) words while the OOV words have been replaced by <unk> in the Penn Treebank dataset. On the BBC and IMDB datasets, our model assigns a significantly high gating value on the unknown word token UNK compared to the other words.
We observe that pre-training results the different distributions of gating values. As can be seen in Fig. 2 (a), the gating value trained in the gated word & character model without pre-training is in general higher for less frequent words, implying that the recurrent language model has learned to exploit the spelling of a word when its word vector could not have been estimated properly. Fig. 2 (b) shows that the gating value trained in the gated word & character model with pre-training is less correlated with the frequency ranks than the one without pre-training. The pre-training step initializes a word lookup table using the training corpus and includes its information into the initial values. We hypothesize that the recurrent language model tends to be word-inputoriented if the informativeness of word inputs and character inputs are not balanced especially in the early stage of training.
Although the recurrent language model with or without pre-training derives different gating values, the results are still similar. We conjecture that the flexibility of modulating between word-level and character-level representations resulted in a better language model in multiple ways.
Overall, the gating values are small. However, this does not mean the model does not utilize the character-level inputs. We observed that the word vectors constructed from the character-level inputs usually have a larger L2 norm than the word vectors constructed from the word-level inputs do. For instance, the mean values of L2 norm of the 1000 most frequent words in the IMDB training set are 52.77 and 6.27 respectively. The small gate values compensate for this difference.

Conclusion
We introduced a recurrent neural network language model with LSTM units and a word-character gate.
Our model was empirically found to utilize the character-level input especially when the model encounters rare words. The experimental results suggest the gate can be efficiently trained so that the model can find a good balance between the wordlevel and character-level inputs.