Larger-Context Language Modelling with Recurrent Neural Network

In this work, we propose a novel method to incorporate corpus-level discourse information into language modelling. We call this larger-context language model. We introduce a late fusion approach to a recurrent language model based on long short-term memory units (LSTM), which helps the LSTM unit keep intra-sentence dependencies and inter-sentence dependencies separate from each other. Through the evaluation on three corpora (IMDB, BBC, and PennTree Bank), we demon- strate that the proposed model improves perplexity significantly. In the experi- ments, we evaluate the proposed approach while varying the number of context sentences and observe that the proposed late fusion is superior to the usual way of incorporating additional inputs to the LSTM. By analyzing the trained larger- context language model, we discover that content words, including nouns, adjec- tives and verbs, benefit most from an increasing number of context sentences. This analysis suggests that larger-context language model improves the unconditional language model by capturing the theme of a document better and more easily.


INTRODUCTION
The goal of language modelling is to estimate the probability distribution of various linguistic units, e.g., words, sentences (Rosenfeld, 2000).Among the earliest techniques were count-based n-gram language models which intend to assign the probability distribution of a given word observed after a fixed number of previous words.Later Bengio et al. (2003) proposed feed-forward neural language model, which achieved substantial improvements in perplexity over count-based language models.Bengio et al. showed that this neural language model could simultaneously learn the conditional probability of the latest word in a sequence as well as a vector representation for each word in a predefined vocabulary.
Recently recurrent neural networks have become one of the most widely used models in language modelling (Mikolov et al., 2010).Long short-term memory unit (LSTM, Hochreiter & Schmidhuber, 1997) is one of the most common recurrent activation function.Architecturally speaking, the memory state and output state are explicitly separated by activation gates such that the vanishing gradient and exploding gradient problems described in Bengio et al. (1994) is avoided.Motivated by such gated model, a number of variants of RNNs (e.g.Cho et al. (GRU, 2014b), Chung et al. (GF-RNN, 2015)) have been designed to easily capture long-term dependencies.
When modelling a corpus, these language models assume the mutual independence among sentences, and the task is often reduced to assigning a probability to a single sentence.In this work, we propose a method to incorporate corpus-level discourse dependency into neural language model.We call this larger-context language model.It models the influence of context by defining a conditional probability in the form of P (w n |w 1:n−1 , S), where w 1 , ..., w n are words from the same sentence, and S represents the context which consists a number of previous sentences of arbitrary length.
We evaluated our model on three different corpora (IMDB, BBC, and Penn TreeBank).Our experiments demonstrate that the proposed larger-context language model improve perplexity for sentences, significantly reducing per-word perplexity compared to the language models without context information.Further, through Part-Of-Speech tag analysis, we discovered that content words, including nouns, adjectives and verbs, benefit the most from increasing number of context sentences.Such discovery led us to the conclusion that larger-context language model improves the unconditional language model by capturing the theme of a document.
To achieve such improvement, we proposed a late fusion approach, which is a modification to the LSTM such that it better incorporates the discourse context from preceding sentences.In the experiments, we evaluated the proposed approach against early fusion approach with various numbers of context sentences, and demonstrated the late fusion is superior to the early fusion approach.
Our model explores another aspect of context-dependent recurrent language model.It is novel in that it also provides an insightful way to feed information into LSTM unit, which could benefit all encoder-decoder based applications.

BACKGROUND: STATISTICAL LANGUAGE MODELLING
Given a document D = (S 1 , S 2 , . . ., S L ) which consists of L sentences, statistical language modelling aims at computing its probability P (D).It is often assumed that each sentence in the whole document is mutually independent from each other: (1) We call this probability (before approximation) a corpus-level probability.Under this assumption of mutual independence among sentences, the task of language modelling is often reduced to assigning a probability to a single sentence P (S l ).
A sentence S l = (w 1 , w 2 , . . ., w T l ) is a variable-length sequence of words or tokens.By assuming that a word at any location in a sentence is largely predictable by preceding words, we can rewrite the sentence probability into where w <t is a shorthand notation for all the preceding words.We call this a sentence-level probability.
This rewritten probability expression can be either directly modelled by a recurrent neural network (Mikolov et al., 2010) or further approximated as a product of n-gram conditional probabilities such that where w t−1 t−n = (w t−n , . . ., w t−1 ).The latter is called n-gram language modelling.See, e.g., (Kneser & Ney, 1995) for detailed reviews on the most widely used techniques for n-gram language modelling.
The most widely used approach to this statistical language modelling is n-gram language model in Eq. (3).This approach builds a large table of n-gram statistics based on a training corpus.Each row of the table contains as its key the n-gram phrase and its number of occurrences in the training corpus.Based on these statistics, one can estimate the n-gram conditional probability (one of the terms inside the product in Eq. (3)) by , where c(•) is the count in the training corpus, and V is the vocabulary of all unique words/tokens.As this estimate suffers severely from data sparsity (i.e., most n-grams do not occur at all in the training corpus), many smoothing/back-off techniques have been proposed over decades.One of the most widely used smoothing technique is a modified Kneser-Ney smoothing (Kneser & Ney, 1995) More recently, Bengio et al. (2003) proposed to use a feedforward neural network to model those n-gram conditional probabilities to avoid the issue of data sparsity.This model is often referred to as neural language model.
This n-gram language modelling is however limited due to the n-th order Markov assumption made in Eq. (3).Hence, Mikolov et al. (2010) proposed recently to use a recurrent neural network to directly model Eq. ( 2) without making any Markov assumption.We will refer to this approach of using a recurrent neural network for language modeling as recurrent language modelling.
A recurrent language model is composed of two function-transition and output functions.The transition function reads one word w t and updates its hidden state such that h t = φ (w t , h t−1 ) , (4) where h 0 is an all-zero vector.φ is a recurrent activation function, and two most commonly used ones are long short-term memory units (LSTM, Hochreiter & Schmidhuber, 1997) and gated recurrent units (GRU, Cho et al., 2014b).For more details on these recurrent activation units, we refer the reader to (Jozefowicz et al., 2015;Greff et al., 2015).
At each timestep, the output function computes the probability over all possible next words in the vocabulary V .This is done by p(w t+1 = w |w t 1 ) ∝ exp (g w (h t )) .
(5) g is commonly implemented as an affine transformation: where The whole model is trained by maximizing the log-likelihood of a training corpus often using stochastic gradient descent with backpropagation through time (see, e.g., Rumelhart et al., 1988).
These different approaches to language modelling have been extensively tested against each other in terms of speech recognition and machine translation in recent years (Sundermeyer et al., 2015;Baltescu & Blunsom, 2014;Schwenk, 2007).Often the conclusion is that all three techniques tend to have different properties and qualities dependent on many different factors, such as the size of training corpus, available memory and target application.In many cases, it has been found that it is beneficial to combine all these techniques together in order to achieve the best language model.
One most important thing to note in this conventional approach to statistical language modelling is that every sentence in a document is assumed independent from each other (see Eq. (1).)This raises a question on how strong an assumption this is, how much impact this assumption has on the final language model quality and how much gain language modelling can get by making this assumption less strong.

LANGUAGE MODELLING WITH LONG SHORT-TERM MEMORY
Here let us briefly describe a long short-term memory unit which is widely used as a recurrent activation function φ (see Eq. ( 4)) for language modelling (see, e.g., Graves, 2013).
A layer of long short-term memory (LSTM) unit consists of three gates and a single memory cell.Three gates-input, output and forget-are computed by ) where σ is a sigmoid function.x t is the input at the t-th timestep.
The memory cell is computed by where is an element-wise multiplication.This adaptive leaky integration of the memory cell allows the LSTM to easily capture long-term dependencies in the input sequence, and this has recently been widely adopted many works involving language models (see, e.g., Sundermeyer et al., 2015).
The output, or the activation of this LSTM layer, is then computed as h t = o t tanh(c t ).

LARGER-CONTEXT LANGUAGE MODELLING
In this paper, we aim not at improving the sentence-level probability estimation P (S) (see Eq. ( 2)) but at improving the corpus-level probability P (D) from Eq. ( 1) directly.One thing we noticed at the beginning of this work is that it is not necessary for us to make the assumption of mutual independence of sentences in a corpus.Rather, similarly to how we model a sentence probability, we can loosen this assumption by where S l−1 l−n = (S l−n , S l−n+1 , . . ., S l−1 ).n decides on how many preceding sentences each conditional sentence probability conditions on, similarly to what happens with a usual n-gram language modelling.
From the statistical modelling's perspective, estimating the corpus-level language probability in Eq. ( 10) is equivalent to build a statistical model that approximates similarly to Eq. ( 2).One major difference from the existing approaches to statistical language modelling is that now each conditional probability of a next word is conditioned not only on the preceding words in the same sentence, but also on the n − 1 preceding sentences.
A conventional, count-based n-gram language model is not well-suited due to the issue of data sparsity.In other words, the number of rows in the table storing n-gram statistics will explode as the number of possible sentence combinations grows exponentially with respect to both the vocabulary size, each sentence's length and the number of context sentences.
Either neural or recurrent language modelling however does not suffer from this issue of data sparsity.This makes these models ideal for modelling the larger-context sentence probability in Eq. ( 11).
More specifically, we are interested in adapting the recurrent language model for this.
In doing so, we answer two questions in the following subsections.First, there is a question of how we should represent the context sentences S l−1 l−n .We consider two possibilities in this work.Second, there is a large freedom in how we build a recurrent activation function to be conditioned on the context sentences.We also consider two alternatives in this case.

CONTEXT REPRESENTATION
A sequence of preceding sentences can be represented in many different ways.Here, let us describe two alternatives we test in the experiments.
The first representation is to simply bag all the words in the preceding sentences into a single vector s ∈ [0, 1] |V | .Any element of s corresponding to the word that exists in one of the preceding sentences will be assigned the frequency of that word, and otherwise 0. This vector is multiplied from left by a matrix P which is tuned together with all the other parameters: p = Ps.We call this representation p a bag-of-words (BoW) context.
Second, we try to represent the preceding context sentences as a sequence of bag-of-words.Each bag-of-word s j is the bag-of-word representation of the j-th context sentence, and they are put into a sequence (s l−n , . . ., s l−1 ).Unlike the first BoW context, this allows us to incorporate the order of the preceding context sentences.This sequence of BoW vectors are read by a recurrent neural network which is separately from the one used for modelling a sentence (see Eq. ( 4).)We use LSTM units as recurrent activations, and for each context sentence in the sequence, we get z t = φ (x t , z t−1 ) , for t = l − n, . . ., l − 1.We set the last hidden state z l−1 of this recurrent neural network, to which we refer as a context recurrent neural network, as the context vector p.

Attention-based Context Representation
The sequence of BoW vectors can be used in a bit different way from the above.Instead of a unidirectional recurrent neural network, we first use a bidirectional recurrent neural network to read the sequence.The forward recurrent neural network reads the sequence as usual in a forward direction, and the reverse recurrent neural network in the opposite direction.The hidden states from these two networks are then concatenated for each context sentence in order to form a sequence of annotation vectors (z l−n , . . ., z l−1 ).
Unlike the other approaches, in this case, the context vector p differs for each word w t in the current sentence, and we denote it by p t .The context vector p t for the t-th word is computed as the weighted sum of the annotation vectors: where the attention weight α t,l is computed by .
h t is the hidden state of the recurrent language model of the current sentence from Eq. ( 5).The scoring function score(z l , h t ) returns a relevance score of the l -th context sentence with respect to h t .

CONDITIONAL LSTM
Early Fusion Once the context vector p is computed from the n preceding sentences, we need to feed this into the sentence-level recurrent language model.One most straightforward way is to simply consider it as an input at every time step such that where E is the word embedding matrix that transforms the one-hot vector of the t-th word into a continuous word vector.This x is used by the LSTM layer as the input, as described in Sec.2.1.We call this approach an early fusion of the context into language modelling.
Late Fusion In addition to this approach, we propose here a modification to the LSTM such that it better incorporates the context from the preceding sentences (summarized by p t .)The basic idea is to keep dependencies within the sentence being modelled (intra-sentence dependencies) and those between the preceding sentences and the current sent (inter-sentence dependencies) separately from each other.
We let the memory cell c t of the LSTM in Eq. ( 9) to model intra-sentence dependencies.This simply means that there is no change to the existing formulation of the LSTM, described in Eqs. ( 6)-( 9).
The inter-sentence dependencies are reflected on the interaction between the memory cell c t , which models intra-sentence dependencies, and the context vector p, which summarizes the n preceding sentences.We model this by first computing the amount of influence of the preceding context sentences as This vector r t controls the strength of each of the elements in the context vector p.This amount of influence from the n preceding sentences is decided based on the currently captured intra-sentence dependency structures and the preceding sentences.
This controlled context vector r t (W p p) is then used to compute the output of the LSTM layer such that This is illustrated in Fig. 1 (b).
We call this approach a late fusion, as the effect of the preceding context is fused together with the intra-sentence dependency structure in the later stage of the recurrent activation.Late fusion is a simple, but effective way to mitigate the issue of vanishing gradient in corpuslevel language modelling.By letting the context representation flow without having to pass through saturating nonlinear activation functions, it provides a linear path through which the gradient for the context flows easily.

CONTEXT-DEPENDENT RECURRENT LANGUAGE MODEL
This possibility of extending a neural or recurrent language modeling to incorporate larger context was explored earlier.Especially, (Mikolov & Zweig, 2012) proposed an approach, called contextdependent recurrent neural network language model, very similar to the proposed approach here.
The basic idea of their approach is to use a topic distribution, represented as a vector of probabilities, of previous n words when computing the hidden state of the recurrent neural network each time.
In doing so, the words used to compute the topic distribution often went over the sentence boundary, meaning that this distribution vector was summarizing a part of a preceding sentence.Nevertheless, their major goal was to use this topic distribution vector as a way to "convey contextual information about the sentence being modeled."More recently, Mikolov et al. (2014) proposed a similar approach however without relying on external topic modelling.
There are three major differences in the proposed approach from the work by Mikolov & Zweig (2012).First, the goal in this work is to explicitly model preceding sentences to better approximate the corpus-level probability (see Eq. ( 10)) rather than to get a better context of the current sentence.Second, Mikolov & Zweig (2012) use an external method, such as latent Dirichlet allocation (Blei et al., 2003) or latent semantics analysis (Dumais, 2004) to extract a feature vector, where we learn the whole model, including the context vector extraction, end-to-end.Third, we propose a late fusion approach which is well suited for the LSTM units which have recently been widely adopted many works involving language models (see, e.g., Sundermeyer et al., 2015).This late fusion is later shown to be superior to the early fusion approach.
Similarly, Sukhbaatar et al. (2015) proposed more recently to use a memory network for language modelling with a very large context of a hundred to two hundreds preceding words.The major difference to the proposed approach is in the lack of separation between the context sentences and the current sentence being processed.There are two implications from this approach.First, each sentence, depending on its and the preceding sentences' lengths, is conditioned on a different number of context sentences.Second, words in the beginning of the sentence being modelled tend to have a larger context (in terms of the number of preceding sentences they are being conditioned on) than those at the end of the sentence.These issues do not exist in the proposed approach here.
Unlike these earlier works, the proposed approach here uses sentence boundaries explicitly.This makes it easier to integrate with downstream applications, such as machine translation and speech recognition, at the decoding level which almost always works sentence-wise.
It is however important to notice that these two previous works by Mikolov & Zweig (2012) and Sukhbaatar et al. (2015) are not in competition with the proposed larger-context recurrent language model.Rather, all these three are orthogonal to each other and can be combined.

DIALOGUE MODELLING WITH RECURRENT NEURAL NETWORKS
A more similar model to the proposed larger-context recurrent language model is a hierarchical recurrent encoder decoder (HRED) proposed recently by Serban et al. (2015).The HRED consists of three recurrent neural networks to model a dialogue between two people from the perspective of one of them, to which we refer as a speaker.If we consider the last utterance of the speaker being modelled by the decoder of the HRED, this model can be considered as a larger-context recurrent language model with early fusion.
Aside the fact that the ultimate goals differ (in their case, dialogue modelling and in our case, document modelling), there are two technical differences.First, they only test with the early fusion approach.We show later in the experiments that the proposed late fusion gives a better language modelling quality than the early fusion.Second, we use a sequence of bag-of-words to represent the preceding sentences, while the HRED a sequence of sequences of words.This allows the HRED to potentially better model the order of the words in each preceding sentence, but it increases computational complexity (one more recurrent neural network) and decreases statistical efficient (more parameters with the same amount of data.) Again, the larger-context language model proposed here is not competing against the HRED.Rather, it is a variant, with differences in technical details, that is being evaluated specifically for document language modelling.We believe many of the components in these two models are complementary to each other and may improve each other.For instance, the HRED may utilize the proposed late fusion, and the larger-context recurrent language model here may represent the context sentences as a sequence of sequences of words, instead of a BoW context or a sequence of BoW vectors.

SKIP-THOUGHT VECTORS
Perhaps the most similar work is the skip-thought vector by Kiros et al. (2015).In their work, a recurrent neural network is trained to read a current sentence, as a sequence of words, and extract a so-called skip-thought vector of the sentence.There are two other recurrent neural networks which respectively model preceding and following sentences.If we only consider the prediction of the following sentence, then this model becomes a larger-context recurrent language model which considers a single preceding sentence as a context.
As with the other previous works we have discussed so far, the major difference is in the ultimate goal of the model.Kiros et al. (2015) fully focused on using their model to extract a good, generic sentence vector, while in this paper we are focused on obtaining a good language model.There are less major technical differences.First, the skip-thought vector model conditions only on the immediate preceding sentence, while we extend this to multiple preceding sentences.The experiments later will show the importance of having a larger context.Second, similarly to the two other previous works by Mikolov & Zweig (2012) and Serban et al. (2015), the skip-thought vector model only implements early fusion.

NEURAL MACHINE TRANSLATION: CONDITIONAL LANGUAGE MODELLING
Neural machine translation is another related approach (Forcada & Ñeco, 1997;Kalchbrenner & Blunsom, 2013;Cho et al., 2014b;Sutskever et al., 2014;Bahdanau et al., 2014).In neural machine translation, often two recurrent neural networks are used.The first recurrent neural network, called an encoder, reads a source sentence, represented as a sequence of words in a source language, to form a context vector, or a set of context vectors.The other recurrent neural network, called a decoder, then, models the target translation conditioned on this source context.This is similar to the proposed larger-context recurrent language model, if we consider the source sentence as a preceding sentence in a corpus.The major difference is in the ultimate application, machine translation vs. language modelling, and technically, the differences between neural ma-Under review as a conference paper at ICLR 2016 chine translation and the proposed larger-context language model are similar to those between the HRED and the larger-context language model.
Similarly to the other previous works we discussed earlier, it is possible to incorporate the proposed larger-context language model into the existing neural machine translation framework, and also to incorporate advanced mechanisms from the neural machine translation framework.Attention mechanism was introduced by Bahdanau et al. (2014) with intention to build a variable-length context representation in source sentence.In larger-context language model, this mechanism is applied on context sentences (see Sec. 3.1,) and we present the results in the later section showing that the attention mechanism indeed improves the quality of language modelling.

CONTEXT-DEPENDENT QUESTION-ANSWERING MODELS
Context-dependent question-answering is a task in which a model is asked to answer a question based on the facts from a natural language paragraph.The question and answer are often formulated as filling in a missing word in a query sentence (Hermann et al., 2015;Hill et al., 2015).This task is closely related to the larger-context language model we proposed in this paper in the sense that its goal is to build a model to learn where q k is the missing k-th word in a query Q, and q <k and q >k are the context words from the query.D is the paragraph containing facts about this query.Often, it is explicitly constructed so that the query q does not appear in the paragraph D.
It is easy to see the similarity between Eq. ( 12) and one of the conditional probabilities in the r.h.s. of Eq. ( 11).By replacing the context sentences S l−1 l−n in Eq. ( 11) with D in Eq. ( 12) and conditioning w t on both the preceding and following words, we get a context-dependent question-answering model.In other words, the proposed larger-context language model can be used for context-dependent question-answering, however, with computational overhead.The overhead comes from the fact that for every possible answer the conditional probability completed query sentence must be evaluated.

EXPERIMENTAL SETTINGS 5.1 MODELS
There are six possible combinations of the proposed methods.First, there are two ways of representing the context sentences; (1) bag-of-words (BoW) and (2) a sequence of bag-of-words (SeqBoW), from Sec. 3.1.There are two separate ways to incorporate the SeqBoW; (1) with attention mechanism (ATT) and (2) without it.Then, there are two ways of feeding the context vector into the main recurrent language model (RLM); (1) early fusion (EF) and (2) late fusion (LF), from Sec. 3.2.We will denote these six possible models by 1. RLM-BoW-EF-n 2. RLM-SeqBoW-EF-n 3. RLM-SeqBoW-ATT-EF-n 4. RLM-BoW-LF-n 5. RLM-SeqBoW-LF-n 6. RLM-SeqBoW-ATT-LF-n n denotes the number of preceding sentences to have as a set of context sentences.We test four different values of n; 1, 2, 4 and 8.
As a baseline, we also train a recurrent language model without any context information.We refer to this model by RLM.Furthermore, we also report the result with the conventional, count-based n-gram language model with the modified Kneser-Ney smoothing with KenLM (Heafield et al., 2013) Each recurrent language model uses 1000 LSTM units and is trained with Adadelta (Zeiler, 2012) to maximize the log-likelihood defined as We early-stop training based on the validation log-likelihood and report the perplexity on the test set using the best model according to the validation log-likelihood.
We use only those sentences of length up to 50 words when training a recurrent language model for the computational reason.For KenLM, we used all available sentences in a training corpus.

DATASETS
We evaluate the proposed larger-context language model on three different corpora.For detailed statistics, see Table 1.
IMDB Movie Reviews A set of movie reviews is an ideal dataset to evaluate many different settings of the proposed larger-context language models, because each review is highly likely of a single theme (the movie under review.)A set of words or the style of writing will be well determined based on the preceding sentences.
We use the IMDB Move Review Corpus (IMDB) prepared by Maas et al. (2011). 1 This corpus has 75k training reviews and 25k test reviews.We use the 30k most frequent words for recurrent language models.
BBC Similarly to movie reviews, each new article tends to convey a single theme.We use the BBC corpus prepared by Greene & Cunningham (2006).2Unlike the IMDB corpus, this corpus contains news articles which are almost always written in a formal style.By evaluating the proposed approaches on both the IMDB and BBC corpora, we can tell whether the benefits from larger context exist in both informal and formal languages.We use the 10k most frequent words for recurrent language models.
Both with the IMDB and BBC corpora, we did not do any preprocessing other than tokenization.3 Penn Treebank We evaluate a normal recurrent language model, count-based n-gram language model as well as the proposed RLM-BoW-EF-n and RLM-BoW-LF-n with varying n = 1, 2, 4, 8 on the Penn Treebank Corpus.We preprocess the corpus according to (Mikolov et al., 2011) and use a vocabulary of 10k words.

RESULTS AND ANALYSIS
6.1 CORPUS-LEVEL PERPLEXITY We evaluated the models, including all the proposed approaches (RLM-{BoW,SeqBoW}-{ATT,∅}-{EF,LF}-n), on the IMDB corpus.In Fig. 2 (a), we see three major trends.First, RLM-BoW, either with the early fusion or late fusion, outperforms both the count-based n-gram and recurrent language model (LSTM) regardless of the number of context sentences.Second, the improvement grows as the number n of context sentences increases, and this is most visible with the novel late fusion.Lastly, we see that the RLM-SeqBoW does not work well regardless of the fusion type (RLM-SeqBow-EF not shown), while after using attention-based model RLM-SeqBow-ATT, the performance is greatly improved.
Because of the second observation from the IMDB corpus, that the late fusion clearly outperforms the early fusion, we evaluated only RLM-{BoW,SeqBoW}-{ATT}-LF-n's on the other two corpora.
On the other two corpora, PTB and BBC, we observed a similar trend of RLM-SeqBoW-ATT-LFn and RLM-BoW-LF-n outperforming the two conventional language models, and that this trend strengthened as the number n of the context sentences grew.We also observed again that the RLM-SeqBoW-ATT-LF outperforms RLM-SeqBoW-LF and RLM-BoW in almost all the cases.
From these experiments, the benefit of allowing larger context to a recurrent language model is clear, however, with the right choice of the context representation (see Sec. 3.1) and the right mechanism for feeding the context information to the recurrent language model (see Sec. 3.2.)In these experiments, the sequence of bag-of-words representation with attention mechanism, together with the late fusion was found to be the best choice in all three corpora.
One possible explanation on the failure of the SeqBoW representation with a context recurrent neural network is that it is simply difficult for the context recurrent neural network to compress multiple sentences into a single vector.This difficulty in training a recurrent neural network to compress a long sequence into a single vector has been observed earlier, for instance, in neural machine translation (Cho et al., 2014a).Attention mechanism, which was found to avoid this problem in machine translation (Bahdanau et al., 2014), is found to solve this problem in our task as well.
6.2 ANALYSIS: PERPLEXITY PER PART-OF-SPEECH TAG Next, we attempted at discovering why the larger-context recurrent language model outperforms the unconditional recurrent language model.In order to do so, we computed the perplexity per part-of-speech (POS) tag.
We used the Stanford log-linear part-of-speech tagger (Stanford POS Tagger, Toutanova et al., 2003) to tag each word of each sentence in the corpora. 4We then computed the perplexity of each word and averaged them for each tag type separately.We show the results using the RLM-BoW-LF and RLM-SeqBoW-ATT-LF on all three corpora-IMDB, BBC and Penn Treebank-in Fig. 3.We observe that the predictability, measured by the perplexity (negatively correlated), grows most for nouns (Noun) and adjectives (JJ) as the number of context sentences increases.They are followed by verbs (Verb).In other words, nouns, adjectives and verbs are the ones which become more predictable by a language model given more context.We however noticed the relative degradation of quality in coordinating conjunctions (CC), determiners (DT) and personal pronouns (PRP).
It is worthwhile to note that nouns, adjectives and verbs are open-class, content, words, and conjunctions, determiners and pronouns are closed-class, function, words (see, e.g., Miller, 1999).The functions words often play grammatical roles, while the content words convey the content of a sentence or discourse, as the name indicates.From this, we may carefully conclude that the larger-context language improves upon the conventional, unconditional language model by capturing the theme of a document, which is reflected by the improved perplexity on "content-heavy" open-class words (Chung & Pennebaker, 2007).In our experiments, this came however at the expense of slight degradation in the perplexity of function words, as the model's capacity stayed same (though, it is not necessary.)This observation is in line with a recent finding by Hill et al. (2015).They also observed significant gain in predicting open-class, or content, words when a question-answering model, including humans, was allowed larger context.

CONCLUSION
In this paper, we proposed a method to improve language model on corpus-level by incorporating larger context.Using this model results in the improvement in perplexity on the IMDB, BBC and Penn Treebank corpora, validating the advantage of providing larger context to a recurrent language model.
From our experiments, we found that the sequence of bag-of-words with attention is better than bagof-words for representing the context sentences (see Sec. 3.1), and the late fusion is better than the early fusion for feeding the context vector into the main recurrent language model (see Sec. 3.2).Our part-of-speech analysis revealed that content words, including nouns, adjectives and verbs, benefit most from an increasing number of context sentences (see Sec. 6.2).This analysis suggests that larger-context language model improves perplexity because it captures the theme of a document better and more easily.
To explore the potential of such a model, there are several aspects in which more research needs to be done.First, the three datasets we used in this paper are relatively small in the context of language modelling, therefore the proposed larger-context language model should be evaluated on larger corpora.Second, more analysis, beyond the one based on part-of-speech tags, should be conducted in order to better understand the advantage of such larger-context models.Lastly, it is important to evaluate the impact of the proposed larger-context models in downstream tasks such as machine translation and speech recognition.

Figure 1 :
Figure 1: Graphical illustration of the proposed (a) early fusion and (b) late fusion.

Figure 2 :
Figure 2: Corpus-level perplexity on (a) IMDB, (b) Penn Treebank and (c) BBC.The count-based 5-gram language models with Kneser-Ney smoothing respectively resulted in the perplexities of 110.20, 148 and 127.32, and are not shown here.Note that we did not show SeqBoW in the cases of n = 1, as this is equivalent to BoW.
Figure 3: Perplexity per POS tag on the (a) IMDB, (b) BBC and (c) Penn Treebank corpora.

Table 1 :
. Statistics of IMDB, BBC and Penn TreeBank Among the 36 POS tags used by the Stanford POS Tagger, we looked at the perplexities of the ten most frequent tags (NN, IN, DT, JJ, RB, NNS, VBZ, VB, PRP, CC), of which we combined NN and NNS into a new tage Noun and VB and VBZ into a new tag Verb.