Incorporating Side Information into Recurrent Neural Network Language Models

Recurrent neural network language models (RNNLM) have recently demonstrated vast potential in modelling long-term dependencies for NLP problems, ranging from speech recognition to machine translation. In this work, we propose methods for conditioning RNNLMs on external side information, e.g., metadata such as keywords, description, document title or topic headline. Our experiments show consistent improvements of RNNLMs using side information over the baselines for two different datasets and genres in two languages. Interestingly, we found that side information in a foreign language can be highly beneﬁcial in modelling texts in another language, serving as a form of cross-lingual language modelling.


Introduction
Neural network approaches to language modelling (LM) have made remarkable performance gains over traditional count-based ngram LMs (Bengio et al., 2003;Mnih and Hinton, 2007;Mikolov et al., 2011). They offer several desirable characteristics, including the capacity to generalise over large vocabularies through the use of vector space representation, and -for recurrent models (Mikolov et al., 2011) -the ability to encode long distance dependencies that are impossible to include with a limited context windows used in conventional ngram LMs. These early papers have spawned a cottage industry in neural LM based applications, where text generation is a key component, including conditional language models for image captioning (Kiros et al., 2014;Vinyals et al., 2015) and neural machine translation (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015).
Inspired by these works for conditioning LMs on complex side information, such as images and foreign text, in this paper we investigate the possibility of improving LMs in a more traditional setting, that is when applied directly to text documents. Typically corpora include rich side information, such as document titles, authorship, time stamp, keywords and so on, although this information is usually discarded when applying statistical models. However, this information can be highly informative, for instance, keywords, titles or descriptions, often include central topics which will be helpful in modelling or understanding the document text. We propose mechanisms for encoding this side information into a vector space representation, and means of incorporating it into the generating process in a RNNLM framework. Evaluating on two corpora and two different languages, we show consistently significant perplexity reductions over the state-of-theart RNNLM models.
The contributions of this paper are as follows: 1. We propose a framework for encoding structured and unstructured side information, and its incorporation into a RNNLM. 2. We introduce a new corpus, the RIE corpus, based on the Europarl web archive, with rich annotations of several types of meta-data. 3. We provide empirical analysis showing consistent improvements from using side information across two datasets in two languages.

Problem Formulation & Model
We first review RNNLM architecture (Mikolov et al., 2011) before describing our extension in §2.2.

RNNLM Architecture
The standard RNNLM consists of 3 main layers: an input layer where each input word has its embedding via one-hot vector coding; a hidden layer consisting of recurrent units where a state is conditioned recursively on past states; and an output layer where a target word will be predicted. RNNLM has an advantage over conventional n-gram language model in modelling long distance dependencies effectively. In general, an RNN operates from left-to-right over the input word sequence; i.e., where f (.) is a non-linear function, e.g., tanh, applied element-wise to its vector input; h t is the current RNN hidden state at time-step t; and matrices W and vectors b are model parameters. The model is trained using gradient-based methods to optimise a (regularised) training objective, e.g. the likelihood function. In principle, a recurrent unit (RU) can be employed using different variants of recurrent structures such as: Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997), Gated Recurrent Unit (GRU) (Cho et al., 2014), or recently deeper structures, e.g. Depth Gated Long Short Term Memory (DGLSTM) -a stack of LSTMs with extra connections between memory cells in deep layers (Yao et al., 2015). It can be regarded as being a generalisation of LSTM recurrence to both time and depth. Such deep recurrent structure may capture long distance patterns at their most general. Empirically, we found that RNNLM with DGLSTM structure appears to be best performer across our datasets, and therefore is used predominantly in our experiments.

Incorporating Side Information
Nowadays, many corpora are archived with side information or contextual meta-data. In this work, we argue that such information can be useful for language modelling (and presumably other NLP tasks). By providing this auxiliary information directly to the RNNLM, we stand to boost language modelling performance. The first question in using side information is how to encode these unstructured inputs, y, into a vector representation, denoted e. We discuss several methods for encoding the auxiliary vector: BOW additive bag of words, e = t y t , and average the average embedding vector, e = 1 T t y t , both inspired by (Hermann and Blunsom, 2014a); bigram convolution with sum-pooling, e = t tanh (y t−1 + y t ) (Hermann and Blunsom, 2014b); and RNN a recurrent neural network over the word sequence (Sutskever et al., 2014), using the final hidden state(s) as e. From the above methods, we found that BOW worked consistently well, outperforming the other approaches, and moreover lead to a simpler model with faster training. For this reason we report only results for the BOW encoding. Note that when using multiple auxiliary inputs, we use a weighted combi- The next step is the integration of e into the RNNLM. We consider two integration methods: as input to the hidden state (denoted input), and connected to the output softmax layer (output), as shown in Figure 1 a and b, respectively. In both cases, we compare experimentally the following integration strategies: add adding the vectors together, e.g., using x t + e as the input to the RNN, such that h t = RU (x t + e, h t−1 ); stack concatenating the vectors, e.g., using x t e for generating the RNN hidden state, such that h t = RU x t e , h t−1 ; and mlp feeding both vectors into an extra perceptron with single hidden layer, using a tanh nonlinearity and projecting the output to the required dimensionality; i.e., Note that add requires the vectors to be the same dimensionality, while the other two methods do not.
The stack method can be quite costly, given that it increases the size of several matrices, either in the recurrent unit (for input) or the output mapping for word generation. This is a problem in the latter case: given the large size of the vocabulary, the matrix W (ho) is already very large and making it larger (doubling the size, to become W (h o) ) has a sizeable effect on training time (and presumably also propensity to over-fit). The output+stack method does however have a compelling interpretation as a jointly trained product model between a RNNLM and a unigram model conditioned on the side information, where both models are formulated as softmax classifiers. Considered as a product model (Hinton, 2002;Pascanu et al., 2013), the two components can concentrate on different aspects of the problem where the other model is not confident, and allowed each model the ability to 'veto' certain outputs, by assigning them a low probability.

Experiments
Datasets. We conducted our experiments on two datasets with different genres in two languages. As the first dataset, we use the IWSLT2014 MT track on TED Talks 1 due to its self-contained rich auxiliary information, including: title, description, keywords, and author related information. We chose the English-French pair for our experiments 2 . The statistics of the training set is shown in used dev2010 (7 talks/817 sentences) for early stopping of training neural network models. For evaluation, we used different testing sets over years, including tst2010 (10/1587), tst2011 (7/768), tst2012 (10/1083). As the second dataset, we crawled the entire European Parliament 3 website, focusing on plenary sessions. Such sessions contain useful structural information, namely multilingual texts divided into speaker sessions and topics. We believe that those texts are interesting and challenging for language modelling tasks. Our dataset contains 724 plenary sessions over 12.5 years until June 2011 with multilingual texts in 22 languages 4 . We refer to this dataset by RIE 5 (Rich Information Europarl). We randomly select 200/5/30 plenary sessions as the training/development/test sets, respectively. We believe that the new data including side information pose another challenge for language modelling. Furthermore, the sizes of our working datasets are an order of magnitude larger than the standard Penn Treebank set which is often used for evaluating neural language models.
Set-up and Baselines. We have used cnn 6 to implement our models. We use the same configurations for all neural models: 512 input embedding and hidden layer dimensions, 2 hidden layers, and vocabulary sizes as given in Table 1. We used the same vocabulary for the auxiliary and modelled text. We trained a conventional 5−gram language model using modified Kneser-Ney smoothing, with the KenLM toolkit (Heafield, 2011). We used the  and description as auxiliary side information respectively. bold: Statistically significant better than the best baseline.
Wilcoxon signed-rank test (Wilcoxon, 1945) to measure the statistical significance (p < 0.05) on differences between sentence-level perplexity scores of improved models compared to the best baseline. Throughout our experiments, punctuation, stop words and sentence markers (〈s〉, 〈/s〉, 〈unk〉) are filtered out in all auxiliary inputs. We observed that this filtering was required for BOW to work reasonably well. For each model, the best perplexity score on development set is used for early stopping of training models, which was obtained after 2-5 and 2-3 epochs on TED Talks and RIE datasets, respectively.
Results & Analysis. The perplexity results on TED Talks dataset are presented in Table 2   kinds of side information, including keywords, title, description. We attempted to inject those into different RNNLM layers, resulting in model variants as shown in Table 2. First, we chose "keywords" (+k) information as an anchor to figure out which incorporation method works well. Comparing input+add+k, input+mlp+k and input+stack+k, the largest decrease is obtained by output+mlp+k consistently across all test sets (and development sets, not shown here). We further evaluated the addition of other side information (e.g., "description" (+d), "title" (+t)), finding that +d has similar effect as +k whereas +t has a mixed effect, being detrimental for one test set (test2011). We suspect that it is due to often-times short sentences of titles in that test, after our filtering step, leading to a shortage of useful information fed into neural network learning. Interestingly, the best performance is obtained when incorporating both +k and +d, showing that there is complementary information in the two auxiliary inputs. Further, we also achieved the similar results in the counterpart of English part (in French) using output+mlp with both +t and +d as shown in Table 3. In French data, no "keywords" information is available. For this reason, we run additional experiments by injecting English keywords as side information into neural models of French. Interestingly, we found that "keywords" side information in English effectively improves the modelling of French texts as shown in Table 3, serving as a new form of cross-lingual language modelling.
We further achieved similar results by incorporating the topic headline in the RIE dataset. The consistently-improved results (in Table 4) demonstrate the robustness of the output+mlp approach.

Conclusion
We have proposed an effective approach to boost the performance of RNNLM using auxiliary side information (e.g. keywords, title, description, topic headline) of a textual utterance. We provided an empirical analysis of various ways of injecting such information into a distributed representation, which is then incorporated into either the input, hidden, or output layer of RNNLM architecture. Our experimental results reveal consistent improvements are achieved over strong baselines for different datasets and genres in two languages. Our future work will investigate the model performance on a closely-related task, i.e., neural machine translation (Sutskever et al., 2014;Bahdanau et al., 2015). Furthermore, we will explore learning methods to combine utterances with and without the auxiliary side information.