Topically Driven Neural Language Model

Language models are typically applied at the sentence level, without access to the broader document context. We present a neural language model that incorporates document context in the form of a topic model-like architecture, thus providing a succinct representation of the broader document context outside of the current sentence. Experiments over a range of datasets demonstrate that our model outperforms a pure sentence-based model in terms of language model perplexity, and leads to topics that are potentially more coherent than those produced by a standard LDA topic model. Our model also has the ability to generate related sentences for a topic, providing another way to interpret topics.


Introduction
Topic models provide a powerful tool for extracting the macro-level content structure of a document collection in the form of the latent topics (usually in the form of multinomial distributions over terms), with a plethora of applications in NLP (Hall et al., 2008;Newman et al., 2010a;Wang and McCallum, 2006). A myriad of variants of the classical LDA method (Blei et al., 2003) have been proposed, including recent work on neural topic models (Cao et al., 2015;Wan et al., 2012;Larochelle and Lauly, 2012;Hinton and Salakhutdinov, 2009).
Separately, language models have long been a foundational component of any NLP task involving generation or textual normalisation of a noisy input (including speech, OCR and the processing of social media text). The primary purpose of a language model is to predict the probability of a span of text, traditionally at the sentence level, under the assumption that sentences are independent of one another, although recent work has started using broader local context such as the preceding sentences (Wang and Cho, 2016;Ji et al., 2016).
In this paper, we combine the benefits of a topic model and language model in proposing a topically-driven language model, whereby we jointly learn topics and word sequence information. This allows us to both sensitise the predictions of the language model to the larger document narrative using topics, and to generate topics which are better sensitised to local context and are hence more coherent and interpretable.
Our model has two components: a language model and a topic model. We implement both components using neural networks, and train them jointly by treating each component as a sub-task in a multi-task learning setting. We show that our model is superior to other language models that leverage additional context, and that the generated topics are potentially more coherent than LDA topics. The architecture of the model provides an extra dimensionality of topic interpretability, in supporting the generation of sentences from a topic (or mix of topics). It is also highly flexible, in its ability to be supervised and incorporate side information, which we show to further improve language model performance. An open source implementation of our model is available at: https://github.com/jhlau/ topically-driven-language-model.  propose a model that learns topics and word dependencies using a Bayesian framework. Word generation is driven by either LDA or an HMM. For LDA, a word is generated based on a sampled topic in the document. For the A key difference over our model is that their language model is driven by an HMM, which uses a fixed window and is therefore unable to track longrange dependencies. Cao et al. (2015) relate the topic model view of documents and words -documents having a multinomial distribution over topics and topics having a multinomial distributional over words -from a neural network perspective by embedding these relationships in differentiable functions. With that, the model lost the stochasticity and Bayesian inference of LDA but gained non-linear complex representations. The authors further propose extensions to the model to do supervised learning where document labels are given. Wang and Cho (2016) and Ji et al. (2016) relax the sentence independence assumption in language modelling, and use preceeding sentences as additional context. By treating words in preceeding sentences as a bag of words, Wang and Cho (2016) use an attentional mechanism to focus on these words when predicting the next word. The authors show that the incorporation of additional context helps language models.

Architecture
The architecture of the proposed topically-driven language model (henceforth "tdlm") is illustrated in Figure 1. There are two components in tdlm: a language model and a topic model. The language model is designed to capture word relations in sentences, while the topic model learns topical information in documents. The topic model works like an auto-encoder, where it is given the document words as input and optimised to predict them.
The topic model takes in word embeddings of a document and generates a document vector using a convolutional network. Given the document vector, we associate it with the topics via an attention scheme to compute a weighted mean of topic vectors, which is then used to predict a word in the document.
The language model is a standard LSTM language model (Hochreiter and Schmidhuber, 1997;Mikolov et al., 2010), but it incorporates the weighted topic vector generated by the topic model to predict succeeding words.
Marrying the language and topic models allows the language model to be topically driven, i.e. it models not just word contexts but also the document context where the sentence occurs, in the form of topics.

Topic Model Component
Let x i ∈ R e be the e-dimensional word vector for the i-th word in the document. A document of n words is represented as a concatenation of its word vectors: where ⊕ denotes the concatenation operator. We use a number of convolutional filters to process the word vectors, but for clarity we will explain the network with one filter.
Let w v ∈ R eh be a convolutional filter which we apply to a window of h words to generate a feature. A feature c i for a window of words x i:i+h−1 is given as follows: where b v is a bias term and I is the identity function. 1 A feature map c is a collection of features computed from all windows of words: where c ∈ R n−h+1 . To capture the most salient features in c, we apply a max-over-time pooling operation (Collobert et al., 2011), yielding a scalar: In the case where we use a filters, we have d ∈ R a , and this constitutes the vector representation of the document generated by the convolutional and max-over-time pooling network.
The topic vectors are stored in two lookup tables A ∈ R k×a (input vector) and B ∈ R k×b (output vector), where k is the number of topics, and a and b are the dimensions of the topic vectors.
To align the document vector d with the topics, we compute an attention vector which is used to compute a document-topic representation: 2 p = softmax(Ad) (1) where p ∈ R k and s ∈ R b . Intuitively, s is a weighted mean of topic vectors, with the weighting given by the attention p. This is inspired by the generative process of LDA, whereby documents are defined as having a multinomial distribution over topics. Finally s is connected to a dense layer with softmax output to predict each word in the document, where each word is generated independently as a unigram bag-of-words, and the model is optimised using categorical cross-entropy loss. In practice, to improve efficiency we compute loss for predicting a sequence of m 1 words in the document, where m 1 is a hyper-parameter.

Language Model Component
The language model is implemented using LSTM units (Hochreiter and Schmidhuber, 1997): where denotes element-wise product; i t , f t , o t are the input, forget and output activations respectively at time step t; and v t , h t and c t are the input word embedding, LSTM hidden state, and cell state, respectively. Hereinafter W, U and b are used to refer to the model parameters.
Traditionally, a language model operates at the sentence level, predicting the next word given its history of words in the sentence. The language model of tdlm incorporates topical information by assimilating the document-topic representation (s) with the hidden output of the LSTM (h t ) at each time step t. To prevent tdlm from memorising the next word via the topic model network, we exclude the current sentence from the document context.
We use a gating unit similar to a GRU Chung et al., 2014) to allow tdlm to learn the degree of influence of topical information on the language model: where z t and r t are the update and reset gate activations respectively at timestep t. The new hidden state h t is connected to a dense layer with linear transformation and softmax output to predict the next word, and the model is optimised using standard categorical cross-entropy loss.

Training and Regularisation
tdlm is trained using minibatches and SGD. 3 For the language model, a minibatch consists of a batch of sentences, while for the topic model it is a batch of documents (each predicting a sequence of m 1 words).
We treat the language and topic models as subtasks in a multi-task learning setting, and train them jointly using categorical cross-entropy loss. Most parameters in the topic model are shared by the language model, as illustrated by their scopes (dotted lines) in Figure 1.
Hyper-parameters of tdlm are detailed in Table 1. Word embeddings for the topic model and language model components are not shared, although their dimensions are the same (e). 4 For m 1 , m 2 and m 3 , sequences/documents shorter than these thresholds are padded. Sentences longer than m 2 are broken into multiple sequences, and documents longer than m 3 are truncated. Optimal hyper-parameter settings are tuned using the development set; the presented values are used for experiments in Sections 4 and 5.
To regularise tdlm, we use dropout regularisation (Srivastava et al., 2014). We apply dropout to d and s in the topic model, and to the input word embedding and hidden output of the LSTM in the language model (Pham et al., 2013;Zaremba et al., 2014).

Language Model Evaluation
We use standard language model perplexity as the evaluation metric. In terms of dataset, we use doc-ument collections from 3 sources: APNEWS, IMDB and BNC. APNEWS is a collection of Associated Press 5 news articles from 2009 to 2016. IMDB is a set of movie reviews collected by Maas et al. (2011). BNC is the written portion of the British National Corpus (BNC Consortium, 2007), which contains excerpts from journals, books, letters, essays, memoranda, news and other types of text. For APNEWS and BNC, we randomly sub-sample a set of documents for our experiments.
For preprocessing, we tokenise words and sentences using Stanford CoreNLP (Klein and Manning, 2003). We lowercase all word tokens, filter word types that occur less than 10 times, and exclude the top 0.1% most frequent word types. 6 We additionally remove stopwords for the topic model document context. 7 All datasets are partitioned into training, development and test sets; preprocessed dataset statistics are presented in Table 2.
We tune hyper-parameters of tdlm based on development set language model perplexity. In general, we find that optimal settings are fairly robust across collections, with the exception of m 3 , as document length is collection dependent; optimal hyper-parameter values are given in Table 1. In terms of LSTM size, we explore 2 settings: a small model with 1 LSTM layer and 600 hidden units, and a large model with 2 layers and 900 hidden units. 8 For the topic number, we experiment with 50, 100 and 150 topics. Word embeddings are pre-trained 300-dimension word2vec Google News vectors. 9 For comparison, we compare tdlm with: 10 vanilla-lstm: A standard LSTM language model, using the same tdlm hyper-parameters where applicable. This is the baseline model.

lclm:
A larger context language model that incorporates context from preceding sentences (Wang and Cho, 2016), by treating the preceding sentence as a bag of words, and using an   attentional mechanism when predicting the next word. An additional hyper-parameter in lclm is the number of preceeding sentences to incorporate, which we tune based on a development set (to 4 sentences in each case). All other hyperparameters (such as n batch , e, n epoch , k 2 ) are the same as tdlm.

lstm+lda:
A standard LSTM language model that incorporates LDA topic information. We first train an LDA model (Blei et al., 2003; to learn 50/100/150 topics for APNEWS, IMDB and BNC. 11 For a document, the LSTM incorporates the LDA topic distribution (q) by concatenating it with the output hidden state (h t ) to predict the next word (i.e. h t = h t ⊕ q).
That is, it incorporates topical information into the language model, but unlike tdlm the language model and topic model are trained separately. We present language model perplexity performance in Table 3. All models outperform the baseline vanilla-lstm, with tdlm performing the 11 Based on Gibbs sampling; α = 0.1, β = 0.01. best across all collections. lclm is competitive over the BNC, although the superiority of tdlm for the other collections is substantial. lstm+lda performs relatively well over APNEWS and IMDB, but very poorly over BNC.
The strong performance of tdlm over lclm suggests that compressing document context into topics benefits language modelling more than using extra context words directly. 12 Overall, our results show that topical information can help language modelling and that joint inference of topic and language model produces the best results.

Topic Model Evaluation
We saw that tdlm performs well as a language model, but it is also a topic model, and like LDA it produces: (1) a probability distribution over topics for each document (Equation (1)); and (2) a probability distribution over word types for each topic.   Recall that s is a weighted mean of topic vectors for a document (Equation (2)). Generating the vocabulary distribution for a particular topic is therefore trivial: we can do so by treating s as having maximum weight (1.0) for the topic of interest, and no weight (0.0) for all other topics. Let B t denote the topic output vector for the t-th topic. To generate the multinomial distribution over word types for the t-th topic, we replace s with B t before computing the softmax over the vocabulary.

Domain LSTM Size
Topic models are traditionally evaluated using model perplexity. There are various ways to estimate test perplexity (Wallach et al., 2009), but Chang et al. (2009) show that perplexity does not correlate with the coherence of the generated topics. Newman et al. (2010b); Mimno et al. (2011);Aletras and Stevenson (2013) propose automatic approaches to computing topic coherence, and Lau et al. (2014) summarises these methods to understand their differences. We propose using automatic topic coherence as a means to evaluate the topic model aspect of tdlm.
Following Lau et al. (2014), we compute topic coherence using normalised PMI ("NPMI") scores. Given the top-n words of a topic, coherence is computed based on the sum of pair-wise NPMI scores between topic words, where the word probabilities used in the NPMI calculation are based on co-occurrence statistics mined from English Wikipedia with a sliding window (Newman et al., 2010b;Lau et al., 2014). 13 Based on the findings of Lau and Baldwin (2016), we average topic coherence over the top-5/10/15/20 topic words. To aggregate topic coherence scores for a model, we calculate the mean coherence over topics.
In terms of datasets, we use the same document collections (APNEWS, IMDB and BNC) as the language model experiments (Section 4). We use the same hyper-parameter settings for tdlm and do not tune them.
For comparison, we use the following topic models:

lda:
We use a LDA model as a baseline topic model. We use the same LDA models as were used to learn topic distributions for lstm+lda (Section 4).  ntm: ntm is a neural topic model proposed by Cao et al. (2015). The document-topic and topicword multinomials are expressed from a neural network perspective using differentiable functions. Model hyper-parameters are tuned using development loss.
Topic model performance is presented in Table 4.
There are two models of tdlm (tdlm-small and tdlm-large), which specify the size of its LSTM model (1 layer+600 hidden vs. 2 layers+900 hidden; see Section 4). tdlm achieves encouraging results: it has the best performance over APNEWS, and is competitive over IMDB. lda, however, produces more coherent topics over BNC. Interestingly, coherence appears to increase as the topic number increases for lda, but the trend is less pronounced for tdlm. ntm performs the worst of the 3 topic models, and manual inspection reveals that topics are in general not very interpretable. Overall, the results suggest that tdlm topics are competitive: at best they are more coherent than lda topics, and at worst they are as good as lda topics.
To better understand the spread of coherence scores and impact of outliers, we present box plots for all models (number of topics = 100) over the 3 domains in Figure 2. Across all domains, ntm has poor performance and larger spread of scores. The difference between lda and tdlm is small (tdlm > lda in APNEWS, but lda < tdlm in BNC), which is consistent with our previous observation that tdlm topics are competitive with lda topics.

Extensions
One strength of tdlm is its flexibility, owing to it taking the form of a neural network. To showcase this flexibility, we explore two simple extensions of tdlm, where we: (1) build a supervised model using document labels (Section 6.1); and (2) incorporate additional document metadata (Section 6.2).

Supervised Model
In datasets where document labels are known, supervised topic model extensions are designed to leverage the additional information to improve modelling quality. The supervised setting also has an additional advantage in that model evaluation is simpler, since models can be quantitatively assessed via classification accuracy.
To incorporate supervised document labels, we treat document classification as another sub-task in tdlm. Given a document and its label, we feed the document through the topic model network to generate the document-topic representation s, and connect it to another dense layer with softmax output to generate the probability distribution over classes.
During training, we have additional minibatches for the documents. We start the document classification training after the topic and language models have completed training in each epoch.
We use 20NEWS in this experiment, which is a popular dataset for text classification. 20NEWS is a collection of forum-like messages from 20 newsgroups categories. We use the "bydate" version of the dataset, where the train and test partition is separated by a specific date. We sample 2K documents from the training set to create the development set. For preprocessing we tokenise words and sentence using Stanford CoreNLP (Klein and Manning, 2003), and lowercase all words. As with previous experiments (Section 4) we additionally filter low/high frequency word types and stopwords. Preprocessed dataset statistics are presented in Table 5.
For comparison, we use the same two topic   Table 7: Topic coherence and language model perplexity by incorporating classification tags on AP-NEWS. Boldface indicates optimal coherence and perplexity performance for each topic setting. models as in Section 5: ntm and lda. Both ntm and lda have natural supervised extensions (Cao et al., 2015;McAuliffe and Blei, 2008) for incorporating document labels. For this task, we tune the model hyper-parameters based on development accuracy. 14 Classification accuracy for all models is presented in Table 6. We present tdlm results using only the small setting of LSTM (1 layer + 600 hidden), as we found there is little gain when using a larger LSTM.
ntm performs very strongly, outperforming both lda and tdlm by a substantial margin. Comparing lda and tdlm, tdlm achieves better performance, especially when there is a smaller number of topics. Upon inspection of the topics we found that ntm topics are much less coherent than those of lda and tdlm, consistent with our observations from Section 5.

Incorporating Document Metadata
In APNEWS, each news article contains additional document metadata, including subject classification tags, such as "General News", "Accidents and Disasters", and "Military and Defense". We present an extension to incorporate document metadata in tdlm to demonstrate its flexibility in integrating this additional information.
As some of the documents in our original AP-NEWS sample were missing tags, we re-sampled a set of APNEWS articles of the same size as our original, all of which have tags. In total, approximately 1500 unique tags can be found among the training articles.
To incorporate these tags, we represent each of them as a learnable vector and concatenate it with the document vector before computing the attention distribution. Let z i ∈ R f denote the f -dimension vector for the i-th tag. For the j-th document, we sum up all tags associated with it: where n tags is the total number of unique tags, and function I(i, j) returns 1 is the i-th tag is in the j-th document or 0 otherwise. We compute d as before (Section 3.1), and concatenate it with the summed tag vector: d = d ⊕ e.
We train two versions of tdlm on the new AP-NEWS dataset: (1) the vanilla version that ignores the tag information; and (2) the extended version which incorporates tag information. 15 We exper-Topic Generated Sentences protesters suspect gunman officers occupy gun arrests suspects shooting officer • police say a suspect in the shooting was shot in the chest and later shot and killed by a police officer .
• a police officer shot her in the chest and the man was killed .
• police have said four men have been killed in a shooting in suburban london . film awards actress comedy music actor album show nominations movie • it 's like it 's not fair to keep a star in a light , " he says .
• but james , a four-time star , is just a unk .
• a unk adaptation of the movie " the dark knight rises " won best picture and he was nominated for best drama for best director of " unk , " which will be presented sunday night . storm snow weather inches flooding rain service winds tornado forecasters • temperatures are forecast to remain above freezing enough to reach a tropical storm or heaviest temperatures .
• snowfall totals were one of the busiest in the country .
• forecasters say tornado irene 's strong winds could ease visibility and funnel clouds of snow from snow monday to the mountains . virus nile flu vaccine disease outbreak infected symptoms cough tested • he says the disease was transmitted by an infected person .
• unk says the man 's symptoms are spread away from the heat .
• meanwhile in the unk , the virus has been common in the mojave desert . imented with a few values for the tag vector size (f ) and find that a small value works well; in the following experiments we use f = 5. We evaluate the models based on language model perplexity and topic model coherence, and present the results in Table 7. 16 In terms of language model perplexity, we see a consistent improvement over different topic settings, suggesting that the incorporation of tags improves modelling. In terms of topic coherence, there is a small but encouraging improvement (with one exception).
To investigate whether the vectors learnt for these tags are meaningful, we plot the top-14 most frequent tags in Figure 3. 17 The plot seems reasonable: there are a few related tags that are close to each other, e.g. "State government" and "Government and politics"; "Crime" and "Violent Crime"; and "Social issues" and "Social affairs".

Discussion
Topics generated by topic models are typically interpreted by way of their top-N highest probability words. In tdlm, we can additionally generate sentences related to the topic, providing another way to understand the topics. To do this, we can constrain the topic vector for the language model to be the topic output vector of a particular topic (Equation (3)).
We present 4 topics from a APNEWS model (k = 100; LSTM size = "large") and 3 randomly generated sentences conditioned on each 16 As the vanilla tdlm is trained on the new APNEWS dataset, the numbers are slightly different to those in Tables 3  and 4. 17 The 5-dimensional vectors are compressed using PCA. topic in Table 8. 18 The generated sentences highlight the content of the topics, providing another interpretable aspect for the topics. These results also reinforce that the language model is driven by topics.

Conclusion
We propose tdlm, a topically driven neural language model. tdlm has two components: a language model and a topic model, which are jointly trained using a neural network. We demonstrate that tdlm outperforms a state-of-the-art language model that incorporates larger context, and that its topics are potentially more coherent than LDA topics. We additionally propose simple extensions of tdlm to incorporate information such as document labels and metadata, and achieved encouraging results.