A Latent Variable Recurrent Neural Network for Discourse-Driven Language Models

This paper presents a novel latent variable recurrent neural network architecture for jointly modeling sequences of words and (possibly latent) discourse relations between adjacent sentences. A recurrent neural network generates individual words, thus reaping the benefits of discriminatively-trained vector representations. The discourse relations are represented with a latent variable, which can be predicted or marginalized, depending on the task. The resulting model can therefore employ a training objective that includes not only discourse relation classification, but also word prediction. As a result, it outperforms state-of-the-art alternatives for two tasks: implicit discourse relation classification in the Penn Discourse Treebank, and dialog act classification in the Switchboard corpus. Furthermore, by marginalizing over latent discourse relations at test time, we obtain a discourse informed language model, which improves over a strong LSTM baseline.


Introduction
Natural language processing (NLP) has recently experienced a neural network "tsunami" (Manning, 2016).A key advantage of these neural architectures is that they employ discriminatively-trained distributed representations, which can capture the meaning of linguistic phenomena ranging from individual words (Turian et al., 2010) to longer-range linguistic contexts at the sentence level (Socher et al., 2013) and beyond (Le and Mikolov, 2014).Because they are discriminatively trained, these meth-ods can learn representations that yield very accurate predictive models (e.g., Dyer et al, 2015).
However, in comparison with the probabilistic graphical models that were previously the dominant machine learning approach for NLP, neural architectures lack flexibility.By treating linguistic annotations as random variables, probabilistic graphical models can marginalize over annotations that are unavailable at test or training time, elegantly modeling multiple linguistic phenomena in a joint framework (Finkel et al., 2006).But because these graphical models represent uncertainty for every element in the model, adding too many layers of latent variables makes them difficult to train.
In this paper, we present a hybrid architecture that combines a recurrent neural network language model with a latent variable model over shallow discourse structure.In this way, the model learns a discriminatively-trained distributed representation of the local contextual features that drive word choice at the intra-sentence level, using techniques that are now state-of-the-art in language modeling (Mikolov et al., 2010).However, the model treats shallow discourse structure -specifically, the relationships between pairs of adjacent sentencesas a latent variable.As a result, the model can act as both a discourse relation classifier and a language model.Specifically: • If trained to maximize the conditional likelihood of the discourse relations, it outperforms state-of-the-art methods for both implicit discourse relation classification in the Penn Discourse Treebank (Rutherford and Xue, 2015) and dialog act classification in Switch-board (Kalchbrenner and Blunsom, 2013).The model learns from both the discourse annotations as well as the language modeling objective, unlike previous recursive neural architectures that learn only from annotated discourse relations (Ji and Eisenstein, 2015).
• If the model is trained to maximize the joint likelihood of the discourse relations and the text, it is possible to marginalize over discourse relations at test time, outperforming language models that do not account for discourse structure.
In contrast to recent work on continuous latent variables in recurrent neural networks (Chung et al., 2015), which require complex variational autoencoders to represent uncertainty over the latent variables, our model is simple to implement and train, requiring only minimal modifications to existing recurrent neural network architectures that are implemented in commonly-used toolkits such as Theano, Torch, and CNN.
We focus on a class of shallow discourse relations, which hold between pairs of adjacent sentences (or utterances).These relations describe how the adjacent sentences are related: for example, they may be in CONTRAST, or the latter sentence may offer an answer to a question posed by the previous sentence.Shallow relations do not capture the full range of discourse phenomena (Webber et al., 2012), but they account for two well-known problems: implicit discourse relation classification in the Penn Discourse Treebank, which was the 2015 CoNLL shared task (Xue et al., 2015); and dialog act classification, which characterizes the structure of interpersonal communication in the Switchboard corpus (Stolcke et al., 2000), and is a key component of contemporary dialog systems (Williams and Young, 2007).Our model outperforms state-of-the-art alternatives for implicit discourse relation classification in the Penn Discourse Treebank, and for dialog act classification in the Switchboard corpus.

Background
Our model scaffolds on recurrent neural network (RNN) language models (Mikolov et al., 2010), and recent variants that exploit multiple levels of linguistic detail (Ji et al., 2015;Lin et al., 2015).

RNN Language Models
Let us denote token n in a sentence t by y t,n ∈ {1 . . .V }, and write y t = {y t,n } n∈{1...Nt} to indicate the sequence of words in sentence t.In an RNN language model, the probability of the sentence is decomposed as, where the probability of each word y t,n is conditioned on the entire preceding sequence of words y t,<n through the summary vector h t,n−1 .This vector is computed recurrently from h t,n−2 and from the embedding of the current word, X y t,n−1 , where X ∈ R K×V and K is the dimensionality of the word embeddings.The language model can then be summarized as, where the matrix W o ∈ R V ×K defines the output embeddings, and b o ∈ R V is an offset.The function f(•) is a deterministic non-linear transition function.
It typically takes an element-wise non-linear transformation (e.g., tanh) of a vector resulting from the sum of the word embedding and a linear transformation of the previous hidden state.
The model as described thus far is identical to the recurrent neural network language model (RNNLM) of Mikolov et al. (2010).In this paper, we replace the above simple hidden state units with the more complex Long Short-Term Memory units (Hochreiter and Schmidhuber, 1997), which have consistently been shown to yield much stronger performance in language modeling (Pham et al., 2014).For simplicity, we still use the term RNNLM in referring to this model.

Document Context Language Model
One drawback of the RNNLM is that it cannot propagate longrange information between the sentences.Even if we remove sentence boundaries, long-range information will be attenuated by repeated application of the non-linear transition function.Ji et al. (2015) propose the Document Context Language Model (DCLM) to address this issue.The core idea is to represent context with two vectors: h t,n , representing intra-sentence word-level context, and c t , representing inter-sentence context.These two vectors where c t−1 is set to the last hidden state of the previous sentence.Ji et al. (2015) show that this model can improve language model perplexity.

Discourse Relation Language Models
We now present a probabilistic neural model over sequences of words and shallow discourse relations.Discourse relations z t are treated as latent variables, which are linked with a recurrent neural network over words in a latent variable recurrent neural network (Chung et al., 2015).

The Model
Our model (see Figure 1) is formulated as a two-step generative story.In the first step, context information from the sentence (t − 1) is used to generate the discourse relation between sentences (t − 1) and t, where z t is a random variable capturing the discourse relation between the two sentences, and c t−1 is a vector summary of the contextual information from sentence (t − 1), just as in the DCLM (Equation 5).The model maintains a default context vector c 0 for the first sentences of documents, and treats it as a parameter learned with other model parameters during training.
In the second step, the sentence y t is generated, conditioning on the preceding sentence y t−1 and the discourse relation z t : The generative probability for the sentence y t decomposes across tokens as usual (Equation 7).The per-token probabilities are shown in Equation 4, in Figure 2. Discourse relations are incorporated by parameterizing the output matrices W Overall, the joint probability of the text and discourse relations is, If the discourse relations z t are not observed, then our model is a form of latent variable recurrent neural network (LVRNN).Connections to recent work on LVRNNs are discussed in § 6; the key difference is that the latent variables here correspond to linguistically meaningful elements, which we may wish to predict or marginalize, depending on the situation.
Parameter Tying As proposed, the Discourse Relation Language Model has a large number of parameters.Let K, H and V be the input dimension, hidden dimension and the size of vocabulary in language modeling.The size of each prediction matrix W c is V × H; there are two such matrices for each possible discourse relation.We reduce the number of parameters by factoring each of these matrices into two components: where V (z) and M (z) are relation-specific components for intra-sentential and inter-sentential contexts; the size of these matrices is H × H, with H V .The larger V × H matrices W o and W c are shared across all relations.

Inference
There are two possible inference scenarios: inference over discourse relations, conditioning on words; and inference over words, marginalizing over discourse relations.
Inference over Discourse Relations The probability of discourse relations given the sentences p(z 1:T | y 1:T ) is decomposed into the product of probabilities of individual discourse relations conditioned on the adjacent sentences t p(z t | y t , y t−1 ).These probabilities are computed by Bayes' rule: The terms in each product are given in Equations 6 and 7. Normalizing involves only a sum over a small finite number of discourse relations.Note that inference is easy in our case because all words are observed and there is no probabilistic coupling of the discourse relations.
Inference over Words In discourse-informed language modeling, we marginalize over discourse relations to compute the probability of a sequence of sentence y 1:T , which can be written as, because the word sequences are observed, decoupling each z t from its neighbors z t+1 and z t−1 .This decoupling ensures that we can compute the overall marginal likelihood as a product over local marginals.

Learning
The model can be trained in two ways: to maximize the joint probability p(y 1:T , z 1:T ), or to maximize the conditional probability p(z 1:T | y 1:T ).The joint training objective is more suitable for language modeling scenarios, and the conditional objective is better for discourse relation prediction.We now describe each objective in detail.
Joint likelihood objective The joint likelihood objective function is directly adopted from the joint probability defined in Equation 8.The objective function for a single document with T sentences or utterances is, where θ represents the collection of all model parameters, including the parameters in the LSTM units and the word embeddings.
Maximizing the objective function (θ) will jointly optimize the model on both language language and discourse relation prediction.As such, it can be viewed as a form of multi-task learning (Caruana, 1997), where we learn a shared representation that works well for discourse relation prediction and for language modeling.However, in practice, the large vocabulary size and number of tokens means that the language modeling part of the objective function tends to dominate.

Conditional objective
This training objective is specific to the discourse relation prediction task, and based on Equation 10 can be written as: The first line in Equation 13 is the same as (θ), but the second line reflects the normalization over all possible values of z t .This forces the objective function to attend specifically to the problem of maximizing the conditional likelihood of the discourse relations and treat language modeling as an auxiliary task (Collobert et al., 2011).

Modeling limitations
The discourse relation language model is carefully designed to decouple the discourse relations from each other, after conditioning on the words.It is clear that text documents and spoken dialogues have sequential discourse structures, and it seems likely that modeling this structure could improve performance.In a traditional hidden Markov model (HMM) generative approach (Stolcke et al., 2000), modeling sequential dependencies is not difficult, because training reduces to relative frequency estimation.However, in the hybrid probabilisticneural architecture proposed here, training is already expensive, due to the large number of parameters that must be estimated.Adding probabilistic couplings between adjacent discourse relations z t−1 , z t would require the use of dynamic programming for both training and inference, increasing time complexity by a factor that is quadratic in the number of discourse relations.We did not attempt this in this paper; we do compare against a conventional HMM on the dialogue act prediction task in § 5. Ji et al. (2015) propose an alternative form of the document context language model, in which the contextual information c t impacts the hidden state h t+1 , rather than going directly to the outputs y t+1 .They obtain slightly better perplexity with this approach, which has fewer trainable parameters.However, this model would couple z t with all subsequent sentences y >t , making prediction and marginalization of discourse relations considerably more challenging.Sequential Monte Carlo algorithms offer a possible solution (de Freitas et al., ;Gu et al., 2015), which may be considered in future work.

Data and Implementation
We evaluate our model on two benchmark datasets: (1) the Penn Discourse Treebank (Prasad et al., 2008, PDTB), which is annotated on a corpus of Wall Street Journal acticles; (2) the Switchboard di-alogue act corpus (Stolcke et al., 2000, SWDA), which is annotated on a collections of phone conversations.Both corpora contain annotations of discourse relations and dialogue relations that hold between adjacent spans of text.
The Penn Discourse Treebank (PDTB) provides a low-level discourse annotation on written texts.In the PDTB, each discourse relation is annotated between two argument spans, Arg1 and Arg2.There are two types of relations: explicit and implicit.Explicit relations are signalled by discourse markers (e.g., "however", "moreover"), and the span of Arg1 is almost totally unconstrained: it can range from a single clause to an entire paragraph, and need not be adjacent to either Arg2 nor the discourse marker.However, automatically classifying these relations is considered to be relatively easy, due to the constraints from the discourse marker itself (Pitler et al., 2008).In addition, explicit relations are difficult to incorporate into language models which must generate each word exactly once.On the contrary, implicit discourse relations are annotated only between adjacent sentences, based on a semantic understanding of the discourse arguments.Automatically classifying these discourse relations is a challenging task (Lin et al., 2009;Pitler et al., 2009;Rutherford and Xue, 2015;Ji and Eisenstein, 2015).We therefore focus on implicit discourse relations, leaving to the future work the question of how to apply our modeling framework to explicit discourse relations.During training, we collapse all relation types other than implicit (explicit, ENTREL, and NOREL) into a single dummy relation type, which holds between all adjacent sentence pairs that do not share an implicit relation.
As in the prior work on first-level discourse relation identification (e.g., Park and Cardie, 2012), we use sections 2-20 of the PDTB as the training set, sections 0-1 as the development set for parameter tuning, and sections 21-22 for testing.For preprocessing, we lower-cased all tokens, and substituted all numbers with a special token "NUM".To build the vocabulary, we kept the 10,000 most frequent words from the training set, and replaced lowfrequency words with a special token "UNK".In prior work that focuses on detecting individual relations, balanced training sets are constructed so that there are an equal number of instances with and without each relation type (Park and Cardie, ; Biran and McKeown, 2013;Rutherford and Xue, 2014).In this paper, we target the more challenging multiway classification problem, so this strategy is not applicable; in any case, since our method deals with entire documents, it is not possible to balance the training set in this way.
The Switchboard Dialog Act Corpus (SWDA) is annotated on the Switchboard Corpus of humanhuman conversational telephone speech (Godfrey et al., 1992).The annotations label each utterance with one of 42 possible speech acts, such as AGREE, HEDGE, and WH-QUESTION.Because these speech acts form the structure of the dialogue, most of them pertain to both the preceding and succeeding utterances (e.g., AGREE).The SWDA corpus includes 1155 five-minute conversations.We adopted the standard split from Stolcke et al. (2000), using 1,115 conversations for training and nineteen conversations for test.For parameter tuning, we randomly select nineteen conversations from the training set as the development set.After parameter tuning, we train the model on the full training set with the selected configuration.We use the same preprocessing techniques here as in the PDTB.

Implementation
We use a single-layer LSTM to build the recurrent architecture of our models, which we implement in the CNN package.1 Our implementation is available on https://github.com/jiyfeng/drlm.Some additional details follow.
Initialization Following prior work on RNN initialization (Bengio, 2012), all parameters except the relation prediction parameters U and b are initialized with random values drawn from the range where d 1 and d 2 are the input and output dimensions of the parameter matrix respectively.The matrix U is initialized with random numbers from [−10 −5 , 10 −5 ] and b is initialized to 0.
Learning Online learning was performed using AdaGrad (Duchi et al., 2011) with initial learning rate λ = 0.1.To avoid the exploding gradient problem, we used norm clipping trick with a threshold of τ = 5.0 (Pascanu et al., 2012).In addition, we used value dropout (Srivastava et al., 2014) with rate 0.5, on the input X, context vector c and hidden state h, similar to the architecture proposed by Pham et al. (2014).The training procedure is monitored by the performance on the development set.In our experiments, 4 to 5 epochs were enough.
Hyper-parameters Our model includes two tunable hyper-parameters: the dimension of word representation K, the hidden dimension of LSTM unit H.We consider the values {32, 48, 64, 96, 128} for both K and H.For each corpus in experiments, the best combination of K and H is selected via grid search on the development set.

Experiments
Our main evaluation is discourse relation prediction, using the PDTB and SWDA corpora.We also evaluate on language modeling, to determine whether incorporating discourse annotations at training time and then marginalizing them at test time can improve performance.

Implicit discourse relation prediction on the PDTB
We first evaluate our model with implicit discourse relation prediction on the PDTB dataset.Most of the prior work on first-level discourse relation prediction focuses on the "one-versus-all" binary classification setting, but we attack the more general fourway classification problem, as performed by Rutherford and Xue (2015).We compare against the following methods: Rutherford and Xue (2015) build a set of featurerich classifiers on the PDTB, and then augment these classifiers with additional automaticallylabeled training instances.We compare against their published results, which are state-of-the-art.
Ji and Eisenstein (2015) employ a recursive neural network architecture.Their experimental setting is different, so we re-run their system using the same setting as described in § 4. Results As shown in Table 1, the conditionallytrained discourse relation language models (DRLM) outperforms all alternatives, on both metrics.While the jointly-trained DRLM is at the same level as the previous state-of-the-art, conditional training on the same model provides a significant additional advantage, indicated by a binomial test.

Dialogue Act tagging
Dialogue act tagging has been widely studied in both NLP and speech communities.We follow the setup used by Stolcke et al. (2000) to conduct experiments, and adopt the following systems for comparison: Stolcke et al. (2000) employ a hidden Markov model, with each HMM state corresponding to a dialogue act.
Kalchbrenner and Blunsom (2013) employ a complex neural architecture, with a convolutional network at each utterance and a recurrent network over the length of the dialog.To our knowledge, this model attains state-of-the-art accuracy on this task, outperforming other prior work such as (Webb et al., 2005;Milajevs and Purver, 2014).Prior work 3. (Stolcke et al., 2000) 71.0 4. (Kalchbrenner and Blunsom, 2013) 73.9  is more reliable on this evaluation, since no single class dominates, unlike the PDTB task.

Discourse-aware language modeling
As a joint model for discourse and language modeling, DRLM can also function as a language model, assigning probabilities to sequences of words while marginalizing over discourse relations.To determine whether discourse-aware language modeling can improve performance, we compare against the following systems: RNNLM+LSTM This is the same basic architecture as the RNNLM proposed by (Mikolov et al., 2010), which was shown to outperform a Kneser-Ney smoothed 5-gram model on modeling Wall Street Journal text.Following Pham et al. (2014), we replace the Sigmoid nonlinearity with a long short-term memory (LSTM).
DCLM We compare against the Document Context Language Model (DCLM) of Ji et al. (2015).We use the "context-to-output" variant, which is identical to the current modeling approach, except that it is not parametrized by discourse relations.This model achieves strong results on language modeling for small and medium-sized corpora, outperforming RNNLM+LSTM.

Results
The perplexities of language modeling on the PDTB and the SWDA are summarized in Table 3.The comparison between line 1 and line 2 shows the benefit of considering multi-sentence context information on language modeling.Line 3 shows that adding discourse relation information yields further improvements for both datasets.We emphasize that discourse relations in the test documents are marginalized out, so no annotations are required for the test set; the improvements are due to the disambiguating power of discourse relations in the training set.
Because our training procedure requires discourse annotations, this approach does not scale to the large datasets typically used in language modeling.As a consequence, the results obtained here are somewhat academic, from the perspective of practical language modeling.Nonetheless, the positive results here motivate the investigation of training procedures that are also capable of marginalizing over discourse relations at training time.

Related Work
This paper draws on previous work in both discourse modeling and language modeling.Discourse and dialog modeling Early work on discourse relation classification utilizes rich, handcrafted feature sets (Joty et al., 2012;Lin et al., 2009;Sagae, 2009).Recent representation learning approaches attempt to learn good representations jointly with discourse relation classifiers and discourse parsers (Ji and Eisenstein, 2014;Li et al., 2014).Of particular relevance are applications of neural architectures to PDTB implicit discourse relation classification (Ji and Eisenstein, 2015;Zhang et al., 2015;Braud and Denis, 2015).All of these approaches are essentially classifiers, and take supervision only from the 16,000 annotated discourse relations in the PDTB training set.In contrast, our approach is a probabilistic model over the entire text.
Probabilistic models are frequently used in dia-log act tagging, where hidden Markov models have been a dominant approach (Stolcke et al., 2000).In this work, the emission distribution is an n-gram language model for each dialogue act; we use a conditionally-trained recurrent neural network language model.An alternative neural approach for dialogue act tagging is the combined convolutionalrecurrent architecture of Kalchbrenner and Blunsom (2013).Our modeling framework is simpler, relying on a latent variable parametrization of a purely recurrent architecture.
Language modeling There are an increasing number of attempts to incorporate document-level context information into language modeling.For example, Mikolov and Zweig (2012) introduce LDAstyle topics into RNN based language modeling.Sordoni et al. (2015) use a convolutional structure to summarize the context from previous two utterances as context vector for RNN based language modeling.Our models in this paper provide a unified framework to model the context and current sentence.Wang and Cho (2015) and Lin et al. (2015) construct bag-of-words representations of previous sentences, which are then used to inform the RNN language model that generates the current sentence.
The most relevant work is the Document Context Language Model (Ji et al., 2015, DCLM); we describe the connection to this model in § 2. By adding discourse information as a latent variable, we attain better perplexity on held-out data.
Latent variable neural networks Introducing latent variables to a neural network model increases its representational capacity, which is the main goal of prior efforts in this space (Kingma and Welling, 2014; Chung et al., 2015).From this perspective, our model with discourse relations as latent variables shares the same merit.Unlike this prior work, in our approach, the latent variables carry a linguistic interpretation, and are at least partially observed.Also, these prior models employ continuous latent variables, requiring complex inference techniques such as variational autoencoders (Kingma and Welling, 2014;Burda et al., 2016;Chung et al., 2015).In contrast, the discrete latent variables in our model are easy to sum and maximize over.
We have presented a probabilistic neural model over sequences of words and shallow discourse relations between adjacent sequences.This model combines positive aspects of neural network architectures with probabilistic graphical models: it can learn discriminatively-trained vector representations, while maintaining a probabilistic representation of the targeted linguistic element: in this case, shallow discourse relations.This method outperforms state-of-the-art systems in two discourse relation detection tasks, and can also be applied as a language model, marginalizing over discourse relations on the test data.Future work will investigate the possibility of learning from partially-labeled training data, which would have at least two potential advantages.First, it would enable the model to scale up to the large datasets needed for competitive language modeling.Second, by training on more data, the resulting vector representations might support even more accurate discourse relation prediction.
discourse relation that holds between (t − 1) and t, these matrices will favor different parts of the embedding space.The bias term b (zt) o is also parametrized by the discourse relation, so that each relation can favor specific words.
training 77.0 * * significantly better than line 4 with p < 0.01

Table 1 :
Multiclass relation identification on the first-level PDTB relations.

Table 2 :
The results of dialogue act tagging.