Hierarchical Recurrent Neural Network for Document Modeling

This paper proposes a novel hierarchical recurrent neural network language model (HRNNLM) for document modeling. After establishing a RNN to capture the coherence between sentences in a documen-t, HRNNLM integrates it as the sentence history information into the word level RNN to predict the word sequence with cross-sentence contextual information. A two-step training approach is designed, in which sentence-level and word-level language models are approximated for the convergence in a pipeline style. Examined by the standard sentence reordering scenario, HRNNLM is proved for its better accuracy in modeling the sentence coherence. And at the word level, experimental results also indicate a signiﬁcant lower model perplexity, followed by a practical better translation result when applied to a Chinese-English document translation reranking task.


Introduction
Deep Neural Network (DNN), a neural network with multiple layers, has been proven powerful in many different domains, such as visual recognition (Kavukcuoglu et al., 2010) and speech recognition (Dahl et al., 2012), ever since Hinton et al. (2006) formulated an efficient training method for it.
In addition to the applications mentioned above, many neural network based methods have also been applied to natural language processing (NLP) tasks with great success. For example, Collobert et al. (2011) propose a generalized DNN framework for a variety of fundamental NLP tasks, including part-of-speech tagging (postag), chunking, named * Contribution during internship at Microsoft Research. entity recognition (NER), and semantic role labeling.
DNN is successfully introduced to do wordlevel language modeling, aka., to predict the next word given the history words. Bengio et al. (2003) propose a feedforward neural network to train a word-level language model with a limited n-gram history. To leverage as much history as possible, Mikolov et al. (2010) apply recurrent neural network to word-level language modeling. The model absorbs one word each time, keeps the information in a history vector, and predicts the next word with all the word history in the vector.
Word-level language model can only learn the relationship between words in one sentence. For sentences in one document which talks about one or several specific topics, the words in the next sentence are chosen partially in accordance with the previous sentences. To model this kind of coherence of sentences, Le and Mikolov (2014) extend word embedding learning network (Mikolov et al., 2013) to learn the paragraph embedding as a fixed-length vector representation for paragraph or sentence. Li and Hovy (2014) propose a neural network coherence model which employs distributed sentence representation and then predict the probability of whether a sequence of sentences is coherent or not.
In contrast to the methods mentioned above which learn the word relationship in or between the sentences separately, we propose a hierarchical recurrent neural network language model (HRNNLM) to capture the word sequence across the sentence boundaries at the document level. HRNNLM is essentially a combination of a wordlevel language model and a sentence-level language model, both of which are recurrent neural networks. The word-level recurrent neural network follows (Mikolov et al., 2010). The sentence-level language model is another recurrent neural network that takes sentence represen-tation as input, and predicts the words in the next sentence. Similar to (Mikolov et al., 2010), the hidden layer in the sentence-level recurrent neural network contains the sentence history information. The hidden layer containing the history information of previous sentences is then linked as an input to the word-level recurrent neural network to predict the next word together with the word-level history vector. This allows the language model to predict the next word probability distribution beyond the words in the current sentence.
We propose a two-step training approach to optimize the parameters of HRNNLM. In the first step, we train the sentence-level language models independently . And then, we connect the hidden layer of the sentence-level language model to the input of word-level RNNLM and train the two models jointly until converged. At sentence level, we evaluate our model with a sentence ordering task and the result shows our method can outperform a maximum entropy based and another stateof-the-art solution. At word level, we compare our method with the conventional recurrent neural network based language model, finding the perplexity is reduced significantly. We also apply our method to rank machine translation output and conduct experiments on a Chinese-English document translation task, yielding a better translation results compared with a state-of-the-art baseline system. The rest of this paper is organized as follows: Section 2 introduces work related to applying neural network to document modeling and SMT. Section 3 introduces the general framework for document modeling. Our sentence-level language model and its training is described in Section 4, and the overall HRNNLM and its training is presented in Section 5. Section 6 presents our experiments and their results. Finally, we conclude in Section 7.

Related work
In this section, we introduce previous efforts on applying neural network to model words coherence across sentence boundaries as well as works on improving machine translation performance at discourse level. Mikolov and Zweig (2012) propose a RNN-LDA model to implement a context dependent language model. They augment the contextual information into the conventional RNNLM via a realvalued input vector, which is the probability distri-bution computed by LDA topics for using a block of preceding text. They train a Latent Dirichlet Allocation (LDA) model using documents consisting of about 10 sentences long text from Penn Treebank (PTB) training data. Their approach outperforms RNNLM in perplexity on PTB data with a limited context history over topics instead of complete information of preceding sentences. Le and Mikolov (2014) extend the Continuous Bag-of-Words Model (CBOW) and Continuous Skip-gram Model (Skip-gram) (Mikolov et al., 2013) by introducing a paragraph vector. In their method, the paragraph vector is learnt in a similar way of word vector model, and there will be N × P parameters, if there are N paragraphs and each paragraph is mapped to P dimensions. Different from them, the sentence vectors of our model are learnt with nearly unlimited sentence history based on a RNN framework, in which, bag of words in the sentence are used as input. The sentence vector is no longer related with the sentence id, but only based on the words in the sentence. And our sentence vector also integrates nearly all the history information of previous sentences, while their model cannot.
Li and Hovy (2014) implement a neural network model to predict discourse coherence quality in essays. In their work, recurrent (Sutskever et al., 2011) and recursive (Socher et al., 2013) neural networks are both examined to learn distributed sentence representation given pre-trained word embedding. The distributed sentence representation is assigned to capture both syntactic and semantic information. With a slide window of the distributed sentence representation, a neural network classifier is trained to evaluate the coherence of the text. Successful as it is in scoring the coherence for a given sequence of sentences, this method is attempted to discriminate the different word order within a sentence.
An attempt of introducing RNN into convolutional neural network (CNN) is investigated by (Xu and Sarikaya, 2014) for spoken language understanding (SLU). To alleviate more contextual information, they apply a CNN with Jordan-type (Jordan, 1997) recurrent connections. The recurrent connections send the distribution of the last softmax layer's output to the current input layer as additional features. Aimed to improve SLU domain classification, their model is essentially a kind of document representation with certain text information, neglecting the coherence information between sentences.
Following the thread modeling the word sequence relationship within and across sentences, we propose a hierarchical recurrent neural network language model consist of a sentence-level language model and a word-level language model. This overall network is trained to capture the coherence between sentences and predict words sequence with preceding sentence contexts.
For statistical machine translation (SMT) in which we checked out model as a scenario, DNN has also been revealed for certain good results in several components. Yang et al. (2013) adapt and extend the CD-DNN-HMM (Dahl et al., 2012) model to the HMM-based word alignment model. In their method, they use bilingual word embedding to capture the lexical translation information and modeling the context with surrounding words. Liu et al. (2014) propose a recursive recurrent neural network (R 2 NN) for end-to-end decoding to help improve translation quality. And Cho et al. (2014) propose a RNN Encoder-Decoder which is a joint recurrent neural network model at the sentence level as conventional SMT decoder does. However, at the discourse level, there is little report on applying DNN to boost the translation result of a document.

Document Language Modeling
Statistical language model assigns a probability to a natural language sequence. Conventional language models only focus on the word sequence within a sentence. For sentences in one document talking about one or several specific topics, the adjacent sentences should be in a coherent order. Therefore, the words in the next sentence are also dependent on the preceding sentences. To model the coherence of sentences in the document D, which contains N sentences S 1 , S 2 , S 3 , ..., S N , we need to maximize the objective as follow: For the sentence S k containing words w 1 , w 2 , w 3 , ..., w T , p(S k |S 1 , S 2 , ..., S k−1 ) is defined as: (2) As a special case of approximation to this, classical n-gram language model keep only several words as history, discarding any information across the sentence boundaries. Recurrent neural network language model (Mikolov et al., 2010) uses a hidden layer which employs a real-valued vector recurrently as network's input to keep as many history as possible. This makes RNNLM be able to extend for capturing history beyond a sentence.
To prevent the potential exponential decay of the history, the history length in RNN can not be too long. Here we approximate the history information of previous sentences, p(S k |S 1 , S 2 , ..., S k−1 ), by the following: (3) where BoW S k denotes the bag of words for the sentence S k . The document is thus generated in two steps.
• Given the previous sentences BoW S 1 , ..., BoW S k−1 (treating them as bag of words here), first generate the words which will show in the next sentence without considering their order with p(BoW S k |BoW S 1 , ..., BoW S k−1 ) • Generate the words one by one with p(S k |BoW S k ).
The first phase actually completes sentence-level language modeling, and the second addresses the word-level language modeling. Because recurrent neural network has a natural advantage in processing sequential data, we investigate how to model the whole process under a unified framework of recurrent neural network.

Sentence-level Language Model
In this section, we describe how to leverage recurrent neural network for sentence-level language modeling. Mikolov et al. (2010) demonstrate a recurrent neural network language model (RNNLM) for word ordering. It overcomes the limitations of classical language model in capturing only a fixedlength history, yielding a significant performance improvements in terms of perplexity reduction and speech recognition accuracy. Here we adept this framework for a RNN based sentence-level language modeling, i.e. RNNSLM.

Model
A conventional language model reads a word each time, keeps several words as history and then predict the probability distribution of the next word. Similar to this, our sentence-level language model reads a sentence which is a bag of words representation. And then it stores the sentence history which captures coherence of sentences in a realvalued history vector. With the history vector, our model can predict which words are most likely to appear in the next sentence. All these will be modeled by a recurrent neural network. Figure 1: Recurrent Neural Network for Sentencelevel Language Modeling As shown in Figure 1, similar to the conventional recurrent neural network, for the sentence j, our network has two input layers xs j and hs j−1 . xs j is the current sentence representation, and hs j−1 is the history information vector before sentence j. The model has a hidden layer hs j , which will combine the history information of hs j−1 and the current sentence input xs j , and an output layer ys j+1 , which generates the probabilities of the words in the sentence j + 1. The layers are computed as follows: where W s , U s and V s denote the weight matrix.
f (z) is a HT anh function: and g(z) is a softmax function: The output layer ys j+1 is a 1×V vector that represents probability distribution of words in the next sentence given the current sentence xs j and previous history hs j−1 , where V denotes vocabulary size.
To emphasize coherence between the adjacent sentences, we further add some bigram-like bag of words feature to the output layer. As mentioned in (Mikolov, 2012), this is kind of maximum entropy feature which can be derived by a two-layer neural network. Some experiments show that perplexity significantly decreases after adding these features. Following (Mikolov, 2012), where, the maximum entropy bigram features are added to our RNNSLM by a direct connection between the feature input array and output layer ys j+1 . Following (Mahoney, 2000), we map bigram maximum entropy features to a fixed-length array to reduce the memory complexity of direct connections with feature hashing. Then the output layer can be computed as follow: where (t) denotes the t-th row of a vector or a matrix. D denotes that the hash array contains feature weights and hash(w i , w j ) denotes the hash function for mapping bigram features to a fixed-length array. For a output ys j+1 , multiple connections may be activated according to the words in sentence xs j .

Training
The training objective of our RNNSLM is to find the best parameters for predicting the words of next sentence. Formally, given the next sentence S k containing words w 1 , w 2 , w 3 , ..., w T . The training objective according to (Mikolov et al., 2013) can be denoted by: For weight matrix W s , U s , V s and hash feature weight D, the parameter are trained similar to the conventional recurrent neural network. The learning rate α is set to 0.1 at the start of the training as suggested in (Mikolov et al., 2010). After each epoch, it can be determined by the training loss of network. If the loss decreases significantly, training continues with the same learning rate. Otherwise, if the loss increases, the training will be executed with a new learning rate α/2. The training process will be terminated after about 30 epochs.

Initialization
All elements in weight matrix W s and U s are initialized by randomly sampling from a uniform distribution [− 1 K 1 , 1 K 1 ], where K 1 is the size of the input layer. Elements in weight matrix V s are initialized by randomly sampling from a uniform distribution [− 1 K 2 , 1 K 2 ], where K 2 denotes the size of the hidden layer. The hash feature weight array D is initialized as 0.
For the initialization of hs 0 , it can be set to a vector of the same values, which is 0.1.

Hierarchical Recurrent Neural Network
In the previous section, we propose a RNNSLM which models the coherence between sentences but ignores the word sequence within a sentence. Ideally, a perfect document model should not only capture the information between sentences but also the information with sentence. So we propose a hierarchical recurrent neural network language model (HRNNLM) to fulfill this issue.

Model
A hierarchical recurrent neural network consists of two independent recurrent neural network. For a conventional word-level language model, it predict the next word only using the word history within the sentence. To capture the longer history, we integrate the sentence history into the word-level language model from sentence-level language model, which forms a hierarchical recurrent neural network.  As illustrated Figure 2, the upper part is the unfolded illustration of conventional recurrent neural network based language model. It takes one word w i each time with the previous history information hw i−1 together and predicts the probability of the next word p(w i+1 ) with the information kept in the history vector hw i . The lower part is our RNNSLM, which takes the bag of words representation of a sentence xs j each time with the history information of previous sentences hs j−1 together and predicts the bag of words in the next sentence p(s j+1 ) with the information kept in hs j .
We integrate these two recurrent neural networks together by adding connections between the sentence-level history vector hs j−1 and word level history vector hw i . So while predicting the next word w i+1 of the current sentence, our model will consider the current word w i , history of previous sentences hs j−1 and history of previous words hw i−1 . The new word-level history vector hw i is computed as: where f (z) is a HT anh function. For HRNNLM, we also add a bigram hash feature, similar as we do for RNNSLM.

Training
The HRNNLM can be trained from scratch following Mikolov et al. (2010) with a dual objective. But this is not without problem. Beginning of training phase, the sentence history is unstable since the parameters of sentence-level language model are kept updating. Consequently, the training of HRNNLM will be also unstable and hard to converge with unstable sentence history.
In this paper, we approximate the whole training of HRNNLM by a two-step training method. We first train a RNNSLM until it converges. Then we connect the hidden layer of RNNSLM to the hidden layer of RNNWLM. To increase the training speed, all the parameters of RNNSLM are fixed while training HRNNLM. We only update the random initialized parameters in HRNNLM, though ideally the gradient of the sentence history vector could change and the RNNSLM could be updated again. The learning rate α is set to 0.1 and the updating of learning rate is the same as suggested in Section 4.2. All the parameters can be initialize as suggested in Section 4.3.

Experiments
We evaluate the sentence-level performance of HRNNLM by the common coherence evaluation of sentence ordering task, its word-level performance by perplexity measure. We also apply our HRNNLM to SMT reranking task in an open Chinese-English translation dataset. The translation performance index is the IBM version of BLEU-4 (Papineni et al., 2002).

Sentence Ordering
We follow (Barzilay and Lapata, 2008) to evaluate our sentence-level language model via a sentence ordering task with test set 2010 (tst2010), 2011 (tst2011) and 2012 (tst2012) from IWSLT 2014, totaling 37 English documents. 20 random permutations of sentences for each document are generated. Each permutation and its original document are combined as an article pair. Our goal is to find the original one among all the article pairs.
The training data for sentence-level language model is the 1,414 English documents from the parallel corpus also provided by the IWSLT 2014 spoken language translation task. 90% of the documents are for training and the rest are reserved for validation. The size of the hidden layer is set to 30 and hash array size is 10 7 .
We define the log probability of a given document as its coherence score. The document with the higher score is regarded as the original document.
We provide two baselines for sentence ordering. One is the state-of-the-art recursive neural network based method proposed by (Li and Hovy, 2014). We implement their model trained and tested with our data. The other is a maximum entropy classifier trained with bag of words features of adjacent sentences which can generate a coherent probability of adjacent sentences. The document with the higher sum of log probability for each adjacency sentences is regarded as the original document. Table 1 shows the accuracy of our system and baseline.

Setting
Accuracy Recursive 91.39% ME system 91.89% Our system 95.68% Table 1: Accuracy of the sentence ordering task for each system From Table 1 we can see that the maximum entropy model and the recursive neural network model has almost the same performance. Compared with the baseline systems, the proposed HRNNLM achieves significant improvement with nearly 4.3% improvement in term of accuracy. The experimental result shows that the HRNNLM can model document coherence and capture crosssentence information.

Word-level Model Perplexity
We compare the word level performance of HRNNLM with the most popular RNNLM in terms of model perplexity. For a fair comparison, we follow (Mikolov et al., 2010) and train the model also on 90% of the 1.414 English documents form IWSLT 2014, totaling about 3M words. Then we train our model with the same hidden layer size and hash array size as the baseline system. The perplexity of these two models is evaluated on held-out documents, about 370K words. The results are shown in Table 2.

Setting
Perplexity RNNLM-30 183 HRNNLM-30 174 Table 2: Perplexity of the different language model According to Table 2, it is reasonable to claim that, by integrating history information of previous sentences, the model perplexity decreased significantly. Empirically, this confirms the hypothesis that the words selection for the next sentence is dependent on its preceding sentences in the same document.

Spoken Language Translation
The conventional SMT systems translate sentences independently, without considering the coherence of the sentences in the same document. In order to learn translation coherence between sentences, we apply the HRNNLM to machine translation reranking task.

Data Setting and Baselines
The data comes from the IWSLT 2014 spoken language translation task. The training data consists of 1,414 documents on TED talks, and contains 179k sentence pairs, about 3M Chinese words, and 3.3M English words. The language model for SMT is a 4-gram language model trained with the English documents in the training data. The development set is specified by IWSLT as dev2010, and the test set contains 37 documents from tst2010, tst2011 and tst2012.
The IWSLT 2014 baseline system is built upon the open-source machine translation toolkit Moses at the default configuration, proposed by (Cettolo et al., 2012). We also train a decoder, which is an in-house Bracketing Transduction Grammar (BTG) (Wu, 1997) in a CKY-style decoder with a lexical reordering model trained with maximum entropy (Xiong et al., 2006). The decoder uses commonly used features, such as translation probabilities, lexical weights, a language model, word penalty, and distortion probabilities.

Rerank System
Our reranking system is a linear model with several features, including the SMT system final scores, sentence-level language model scores, and HRNNLM scores. It should be noted all these features are actually employed by the SMT model except for the HRNNLM score. Since Minimum Error Rate Training (MERT) (Och, 2003) is the most general method adopted in SMT systems for tuning, the feature weights are fixed by MERT. For our reranking system, to score the translation of one sentence we need the translation results of all the previous sentences in the document. Our SMT decoder generates 10-best results of all the sentences of the documents and the rerank-ing system select the best translation result for the first sentence at first. With the translation of first sentence, we score all the translation candidates of the second sentence and select the best one as the result. Following this procedure, we can get the translation results for all the sentences in the document.

Results
The HRNNLM focus on exploiting longer context, esp. cross-sentence word dependencies. Therefor the translation data for IWSLT 2014 is organized as documents instead of sentences for our rerank system. We hope HRNNLM will enable a contextsensitive reranking process, capturing the syntactic and logic relationships between the sentences in the same document.  Table 3: BLEU scores of SMT systems. The I-WSLT is a public baseline which issued by the organizer of IWSLT 2014, as described in (Cettolo et al., 2012).
The translation performance comparison is shown in Table 3. From Table 3, we can find that the rerank system improves SMT performance consistently. For a single sentence without the context information, there are several appropriate translations and it is hard to tell which one is better. When considering the context of a document (previous sentences for our model), some translation candidates may not be coherent with the others which should not be selected. Our model can generate the most coherent translation results by considering previous sentence history.
For example, we have the following two Chinese sentence in one document together with their correct translation: 我 拍摄 过 的 冰山, 有些 冰 是 非常 年 轻 --几千 年 年龄 Some of the ice in the icebergs that I photograph is very young --a couple thousand years old. 有些 冰 超过 十万 年 And some of the ice is over 100,000 years old.
Chinese word " 有 些" means "some" in English. But when it is used in parallelism sentences, it means "some of" instead of "some". The traditional SMT system translates the italics part without considering the context. The translation result for this kind of system is: Some ice more than 100,000 years. For our system, the HRNNLM can take previous sentence as context and learn the parallelism between the two sentences. It can select the best translation "some of" for 有些, and the output of our system is: Some of the ice more than 100,000 years. We also calculate the BLEU increase ratio of our system on document level. The ratio is defined as 1 N #(BleuD rerank > BleuD baseline ), where N denotes the number of documents, and #(BleuD rerank > BleuD baseline ) denotes the number of documents for which document level BLEU score of reranking system is higher than the baselines. The results are shown in Table 4. tst2010 tst2011 tst2012 72.73% 71.43% 75% Table 4: Experimental results to test BLEU increase ratio after reranking From Table 4, we can find that, for all the three test data sets, our reranking system can achieve better performance for more than 70% documents.

Conclusion and Future Work
In this paper, we propose a hierarchical recurrent neural network language model for document modeling. We first built a RNNSLM to capture the information between sentences. Then we integrate the hidden layer of RNNSLM into the input layer of word-level language model to form a hierarchical recurrent neural network. This enables the model be able to capture both in-sentence and cross-sentence information in a unified RN-N. Compared with conventional language models, our model can perceive a longer history than other language models and captures the context patterns in the previous sentences. At sentence level, we examine our model with sentence ordering task. At word level, we test the model perplexity. We also conduct a SMT rerank experiment on IWSLT 2014 data set. All these experimental results show that our hierarchical recurrent neural network has a satisfying performance.
In the future, we will explore better sentence representation such as distributed sentence representation as input for our sentence-level language model to better model document coherence. We can even update the gradient from different RNN to get a better performance.