Multi-Timescale Long Short-Term Memory Neural Network for Modelling Sentences and Documents

Neural network based methods have obtained great progress on a variety of nat-ural language processing tasks. However, it is still a challenge task to model long texts, such as sentences and documents. In this paper, we propose a multi-timescale long short-term memory (MT-LSTM) neural network to model long texts. MT-LSTM partitions the hidden states of the standard LSTM into several groups. Each group is activated at different time periods. Thus, MT-LSTM can model very long documents as well as short sentences. Experiments on four benchmark datasets show that our model outperforms the other neural models in text classiﬁcation task.


Introduction
Distributed representations of words have been widely used in many natural language processing (NLP) tasks (Collobert et al., 2011;Turian et al., 2010;Mikolov et al., 2013b;Bengio et al., 2003). Following this success, it is rising a substantial interest to learn the distributed representations of the continuous words, such as phrases, sentences, paragraphs and documents (Mitchell and Lapata, 2010;Socher et al., 2013;Mikolov et al., 2013b;Le and Mikolov, 2014;. The primary role of these models is to represent the variable-length sentence or document as a fixed-length vector. A good representation of the variable-length text should fully capture the semantics of natural language. Recently, the long short-term memory neural network (LSTM) (Hochreiter and Schmidhuber, 1997) has been applied successfully in many NLP tasks, such as spoken language understanding (Yao et al., 2014), sequence labeling (Chen et al., * Corresponding author 2015) and machine translation (Sutskever et al., 2014). LSTM is an extension of the recurrent neural network (RNN) (Elman, 1990), which can capture the long-term and short-term dependencies and is very suitable to model the variable-length texts. Besides, LSTM is also sensitive to word order and does not rely on the external syntactic structure as recursive neural network (Socher et al., 2013). However, when modeling long texts, such as documents, LSTM need to keep the useful features for a quite long period of time. The longterm dependencies need to be transmitted one-byone along the sequence. Some important features could be lost in transmission process. Besides, the error signal is also back-propagated one-byone through multiple time steps in the training phase with back-propagation through time (BPTT) (Werbos, 1990) algorithm. The learning efficiency could also be decreased for the long texts. For example, if a valuable feature occurs at the begin of a long document, we need to back-propagate the error through the whole document.
In this paper, we propose a multi-timescale long short-term memory (MT-LSTM) to capture the valuable information with different timescales. Inspired by the works of (El Hihi and Bengio, 1995) and (Koutnik et al., 2014), we partition the hidden states of the standard LSTM into several groups. Each group is activated and updated at different time periods. The fast-speed groups keep the short-term memories, while the slow-speed groups keep the long-term memories. We evaluate our model on four benchmark datasets of text classification. Experimental results show that our model can not only handle short texts, but can model long texts.
Our contributions can be summarized as follows.
• With the multiple different timescale memories, MT-LSTM easily carries the crucial information over a long distance. MT-LSTM can well model both short and long texts.
• MT-LSTM has faster convergence speed than the standard LSTM since the error signal can be back-propagated through multiple timescales in the training phase.

Neural Models for Sentences and Documents
The primary role of the neural models is to represent the variable-length sentence or document as a fixed-length vector. These models generally consist of a projection layer that maps words, subword units or n-grams to vector representations (often trained beforehand with unsupervised methods), and then combine them with the different architectures of neural networks. Most of these models for distributed representations of sentences or documents can be classified into four categories.
Bag-of-words models A simple and intuitive method is the Neural Bag-of-Words (NBOW) model, in which the representation of sentences or documents can be generated by averaging constituent word representations. However, the main drawback of NBOW is that the word order is lost. Although NBOW is effective for general document classification, it is not suitable for short sentences.
Sequence models Sequence models construct the representation of sentences or documents based on the recurrent neural network (RNN) (Mikolov et al., 2010) or the gated versions of RNN (Sutskever et al., 2014;Chung et al., 2014). Sequence models are sensitive to word order, but they have a bias towards the latest input words. This gives the RNN excellent performance at language modelling, but it is suboptimal for modeling the whole sentence, especially for the long texts. Le and Mikolov (2014) proposed a Paragraph Vector (PV) to learn continuous distributed vector representations for pieces of texts, which can be regarded as a long-term memory of sentences as opposed to the short-memory in RNN.
Topological models Topological models compose the sentence representation following a given topological structure over the words (Socher et al., 2011a;Socher et al., 2012;Socher et al., 2013). Recursive neural network (RecNN) adopts a more general structure to encode sentence (Pollack, 1990;Socher et al., 2013). At every node in the tree the contexts at the left and right children of the node are combined by a classical layer. The weights of the layer are shared across all nodes in the tree. The layer computed at the top node gives a representation for the sentence. However, RecNN depends on external constituency parse trees provided by an external topological structure, such as parse tree.
Convolutional models Convolutional neural network (CNN) is also used to model sentences (Collobert et al., 2011;Hu et al., 2014). It takes as input the embeddings of words in the sentence aligned sequentially, and summarizes the meaning of a sentence through layers of convolution and pooling, until reaching a fixed length vectorial representation in the final layer. CNN can maintain the word order information and learn more abstract characteristics.

Long Short-Term Memory Networks
A recurrent neural network (RNN) (Elman, 1990) is able to process a sequence of arbitrary length by recursively applying a transition function to its internal hidden state vector h t of the input sequence. The activation of the hidden state h t at time-step t is computed as a function f of the current input symbol x t and the previous hidden state h t−1 It is common to use the state-to-state transition function f as the composition of an element-wise nonlinearity with an affine transformation of both x t and h t−1 .
Traditionally, a simple strategy for modeling sequence is to map the input sequence to a fixedsized vector using one RNN, and then to feed the vector to a softmax layer for classification or other tasks (Sutskever et al., 2014;. Unfortunately, a problem with RNNs with transition functions of this form is that during training, components of the gradient vector can grow or decay exponentially over long sequences (Bengio et al., 1994;Hochreiter et al., 2001;Hochreiter and Schmidhuber, 1997). This problem with exploding or vanishing gradients makes it difficult for the RNN model to learn long-distance correlations in a sequence. Long short-term memory network (LSTM) was proposed by (Hochreiter and Schmidhuber, 1997) to specifically address this issue of learning longterm dependencies. The LSTM maintains a separate memory cell inside it that updates and exposes its content only when deemed necessary. A number of minor modifications to the standard LSTM unit have been made. While there are numerous LSTM variants, here we describe the implementation used by Graves (2013).
We define the LSTM units at each time step t to be a collection of vectors in R d : an input gate i t , a forget gate f t , an output gate o t , a memory cell c t and a hidden state h t . d is the number of the LSTM units. The entries of the gating vectors i t , f t and o t are in [0, 1]. The LSTM transition equations are the following: (2) where x t is the input at the current time step, σ denotes the logistic sigmoid function and ⊙ denotes elementwise multiplication. Intuitively, the forget gate controls the amount of which each unit of the memory cell is erased, the input gate controls how much each unit is updated, and the output gate controls the exposure of the internal memory state. Figure 1 shows the structure of a LSTM unit. In particular, these gates and the memory cell allow a LSTM unit to adaptively forget, memorize and expose the memory content. If the detected feature, i.e., the memory content, is deemed important, the forget gate will be closed and carry the memory content across many time-steps, which is equivalent to capturing a long-term dependency. On the other hand, the unit may decide to reset the memory content by opening the forget gate.
4 Multi-Timescale Long Short-Term Memory Neural Network LSTM can capture the long-term and short-term dependencies in a sequence. But the long-term dependencies need to be transmitted one-by-one along the sequence. Some important information could be lost in transmission process for long texts, such as documents. Besides, the error signal is back-propagated through multiple time steps when we use the back-propagation through time (BPTT) (Werbos, 1990) algorithm. The training efficiency could also be low for the long texts. For example, if a valuable feature occurs at the begin of a long document, we need to back-propagate the error through the whole document.
Inspired by the works of (El Hihi and Bengio, 1995) and (Koutnik et al., 2014), which use de-layed connections and units operating at different timescales to improve the simple RNN, we separate the LSTM units into several groups. Different groups capture different timescales dependencies.
More formally, the LSTM units are partitioned into g groups {G 1 , · · · , G g }. Each group G k , (1 ≤ k ≤ g) is activated at different time periods T k . Accordingly, the gates and weight matrices are also partitioned to maintain the corresponding LSTM groups. The MT-LSTM with just one group is the same to the standard LSTM.
At each time step t, only the groups G k that satisfy (t MOD T k ) = 0 are executed. The choice of the set of periods T k ∈ {T 1 , · · · , T g } is arbitrary. Here, we use the exponential series of periods: group G k has the period of T k = 2 k−1 . The group G 1 is the fastest one and can be executed at every time step, which works like the standard LSTM. The group G k is the slowest one.
At time step t, the memory cell vector and hidden state vector of group G k are calculate in two cases: (1) When group G k is activated at time step t, the LSMT units of this group are calculated by the following equations: where i k t , f k t and o k t are the vectors of input gates, forget gates, and output gates of group G k at time step t respectively; c k t and h k t are the memory cell vector and hidden state vector of group G k at time step t respectively.
(2) When group G k is non-activated at time step t, its LSMT units keep unchanged. Figure 3 shows the different between the standard LSTM and MT-LSTM.

Two Feedback Strategies
The feedback mechanism of LSTM is implemented by the recurrent connections from time step t − 1 to t. Since the MT-LSTM groups are updated with the different frequencies, we can regard the different group as the human memory. The fast-speed groups are short-term memories, while the slow-speed groups are long-term memories. Therefore, an important consideration is what feedback mechanism is between the shortterm and long-term memories.
For the proposed MT-LSTM, we consider two feedback strategies to define the connectivity patterns among the different groups.
Fast-to-Slow (F2S) Strategy Intuitively, when we accumulate the short-term memory to a certain degree, we store some valuable information from the short-term memory into the long-term memory. Therefore, we firstly define a fast to slow strategy, which updates the slower group using the faster group. The connections from group j to group k exist if and only if T j ≤ T k . The weight matrices U j→k are set to zero when T j > T k . The F2S updating strategy is shown in Figure  3a.
Slow-to-Fast (S2F) Strategy Following the work of (Koutnik et al., 2014), we also investigate another update scheme from slow-speed group to fast-speed group. The motivation is that a long term memory can be "distilled" into a short-term memory. The connections from group j to group i exist only if T j ≥ T i . The weight matrices U j→k The S2F update strategy is shown in Figure 3b.

Dynamic Selection of the Number of the MT-LSTM Unit Groups
Another consideration is how many groups need to be used. An intuitive way is that we need more groups for long texts than short texts. The number of the group depends the length of the texts.
Here, we use a simple dynamic strategy to choose the maximum number of groups, and then the best g is chosen as a hyperparameter according to different tasks. The upper bound of the number of groups is calculated by where L is the average length of the corpus. Thus, the slowest group is activated at least twice.

Training
In each of the experiments, the hidden layer at the last moment has a fully connected layer followed by a softmax non-linear layer that predicts the probability distribution over classes given the input sentence. The network is trained to minimise the cross-entropy of the predicted and true distributions; the objective includes an L 2 regularization term over the parameters. The network is trained with backpropagation and the gradientbased optimization is performed using the Adagrad update rule (Duchi et al., 2011). The back propagation of the error propagation is similar to LSTM as well. The only difference is that the error propagates only from groups that were executed at time step t. The error of nonactivated groups gets copied back in time (similarly to copying the activations of nodes not activated at the time step t during the corresponding forward pass), where it is added to the backpropagated error.

Experiments
In this section, we investigate the empirical performances of our proposed MT-LSTM model on four benchmark datasets for sentence and document classification and then compare it to other competitor models.

Datasets
We evaluate our model on four different datasets. The first three datasets are sentence-level, and the last dataset is document-level. The detailed statistics about the four datasets are listed in Table 1. Each dataset is briefly described as follows.
• SST-1 The movie reviews with five classes (negative, somewhat negative, neutral, somewhat positive, positive) in the Stanford Sentiment Treebank 1 (Socher et al., 2013). • SST-2 The movie reviews with binary classes. It is also from the Stanford Sentiment Treebank. • QC The TREC questions dataset 2 involves six different question types, e.g. whether the question is about a location, about a person or about some numeric information (Li and Roth, 2002). • IMDB The IMDB dataset 3 consists of 100,000 movie reviews with binary classes (Maas et al., 2011). One key aspect of this dataset is that each movie review has several sentences.

Competitor Models
We compare our model with the following models: • NB-SVM and MNB. Naive Bayes SVM and Multinomial Naive Bayes with uni and bigram features (Wang and Manning, 2012). • NBOW The NBOW sums the word vectors and applies a non-linearity followed by a softmax classification layer. • RAE Recursive Autoencoders with pretrained word vectors from Wikipedia (Socher et al., 2011b).

Hyperparameters and Training
In all of our experiments, the word embeddings are trained using word2vec (Mikolov et al., 2013a) on the Wikipedia corpus (1B words). The vocabulary size is about 500,000. The the word embeddings are fine-tuned during training to improve the performance (Collobert et al., 2011). The other parameters are initialized by randomly sampling from uniform distribution in [-0.1, 0.1]. The hyperparameters which achieve the best performance on the development set will be chosen for the final evaluation. For datasets without development set, we use 10-fold cross-validation (CV) instead. The final hyper-parameters for the LSTM and MT-LSTM are set as Figure 2. Table 3 shows the classification accuracies of the standard LSTM, MT-LSTM compared with the competitor models. Firstly, we compare two feedback strategies of MT-LSTM. The fast-to-slow feedback strat-4 https://github.com/piskvorky/gensim/ egy (MT-LSTM (F2S)) is better than the slow-tofast strategy (MT-LSTM (S2F)), which indicates that MT-LSTM benefits from periodically storing some valuable information "purified" from the short-term memory into the long-term memory. In the following discussion, we use fast-to-slow feedback strategy as the default setting of MT-LSTM. Compared with the standard LSTM, MT-LSTM results in significantly improvements with the same size of hidden layers.

Results
MT-LSTM outperforms the competitor models on the SST-1, QC and IMDB datasets, and is close to the two best CNN based models on the SST-2 dataset. But MT-LSTM uses much fewer parameters than the CNN based models. The number of parameters of LSTM range from 10K to 40K while the number of parameters is about 400K in CNN.
Moreover, MT-LSTM can not only handle short texts, but can model long texts in classification task.
form embeddings for the words in each sentence into an embedding for the entire sentence. The second level uses another DCNN to transform sentence embeddings from the first level into a single embedding vector that represents the entire document. However, their result is unsatisfactory and they reported that the IMDB dataset is too small to train a CNN model. The standard LSTM has an advantage to model documents due to its simplification. However, it is also difficult to train LSTM since the error signals need to be back-propagated over a long distance with the BPTT algorithm.
Our MT-LSTM can alleviate this problem with multiple timescale memories. The experiment on IMDB dataset demonstrates this advantage. MT-LSTM achieves the accuracy of 92.1% , which are better than the other models.
Moreover, MT-LSTM converges at a faster rate than the standard LSTM. Figure 4 plots the convergence on the IMDB dataset. In practice, MT-LSTM is approximately three times faster than the standard LSTM since the hidden states of lowspeed group often keep unchanged and need not to be re-calculated at each time step.
Impact of the Different Number of Memory Groups In our model, the number of memory groups is a hyperparameter. Here we plotted the accuracy curves of our model with the different numbers of memory groups in Figure 5 to show its impacts on the four datasets.
When the length of text (SST-1, SST-2 and QC) is small, not all memory groups can be activated if we set too many groups, which may harm the performance. When dealing with the long texts (IMBD), more groups lead to a better performance. The performance can be improved with the increase of the number of memory groups.
According to our dynamic strategy, the maximum numbers of groups is 3, 3, 2, 7 for the four datasets. The best numbers of groups from experiments are 3, 3, 3, 5 respectively. Therefor, our dynamic strategy is reasonable. All the datasets except QC, the best number of groups is equal to or smaller than our calculated upper bound. MT-LSMT suffers underfitting when the number of groups is larger than the upper bound.  Figure 6: The dynamical changes of the predicted sentiment score over time. Y-axis represents the sentiment score, while X-axis represents the input words in chronological order. The red horizontal line gives a border between the positive and negative sentiments.

Case Study
To get an intuitive understanding of what is happening when we use LSTM or MT-LSTM to predict the class of text, we design an experiment to analyze the output of LSTM and MT-LSTM at each time step.
We sample three sentences from the SST-2 test dataset, and the dynamical changes of the predicted sentiment score over time are shown in Figure 6. It is intriguing to notice that our model can handle the rhetorical question well.
The first sentence "Is this progress ?" has a negative sentiment. Although the word "progress" is positive, our model can adjust the sentiment correctly after seeing the question mark "?", and finally gets a correct prediction.
The second sentence "He 'd create a movie better than this ." also has a negative sentiment. The word "better" is positive. Our model finally gets a correct negative prediction after seeing "than this", while LSTM gets a wrong prediction.
The third sentence " It 's not exactly a gourmet meal but fare is fair , even coming from the drive ." is positive and has more complicated semantic composition. Our model can still capture the useful long-term features and gets the correct prediction, while LSTM does not work well.

Related Work
There are many previous works to model the variable-length text as a fixed-length vector. Specific to text classification task, most of the models cannot deal with the texts of several sentences (paragraphs, documents), such as MV-RNN (Socher et al., 2012), RNTN (Socher et al., 2013), CNN (Kim, 2014), AdaSent (Zhao et al., 2015), and so on. The simple neural bag-of-words model can deal with long texts, but it loses the word order information. PV (Le and Mikolov, 2014) works in unsupervised way, and the learned vector cannot be fine-tuned on the specific task.
Our proposed MT-LSTM can handle short texts as well as long texts in classification task.

Conclusion
In this paper, we introduce the MT-LSTM, a generalization of LSTMs to capture the information with different timescales. MT-LSTM can well model both short and long texts. With the multiple different timescale memories. Intuitively, MT-LSTM easily carries the crucial information over a long distance. Another advantage of MT-LSTM is that the training speed is faster than the standard LSTM (approximately three times faster in practice).
In future work, we would like to investigate the other feedback mechanism between the short-term and long-term memories.