Cached Long Short-Term Memory Neural Networks for Document-Level Sentiment Classification

Recently, neural networks have achieved great success on sentiment classification due to their ability to alleviate feature engineering. However, one of the remaining challenges is to model long texts in document-level sentiment classification under a recurrent architecture because of the deficiency of the memory unit. To address this problem, we present a Cached Long Short-Term Memory neural networks (CLSTM) to capture the overall semantic information in long texts. CLSTM introduces a cache mechanism, which divides memory into several groups with different forgetting rates and thus enables the network to keep sentiment information better within a recurrent unit. The proposed CLSTM outperforms the state-of-the-art models on three publicly available document-level sentiment analysis datasets.


Introduction
Sentiment classification is one of the most widely used natural language processing techniques in many areas, such as E-commerce websites, online social networks, political orientation analyses (Wilson et al., 2009;O'Connor et al., 2010), etc.Recently, deep learning approaches (Socher et al., 2013;Kim, 2014;Chen et al., 2015;Liu et al., 2016) have gained encouraging results on sentiment classification, which frees researchers from handcrafted feature engineering.Among these methods, Recurrent Neural Networks (RNNs) are one of the most * Corresponding author.
prevalent architectures because of the ability to handle variable-length texts.
Sentence-or paragraph-level sentiment analysis expects the model to extract features from limited source of information, while document-level sentiment analysis demands more on selecting and storing global sentiment message from long texts with noises and redundant local pattern.Simple RNNs are not powerful enough to handle the overflow and to pick up key sentiment messages from relatively far time-steps .
Efforts have been made to solve such a scalability problem on long texts by extracting semantic information hierarchically (Tang et al., 2015a;Tai et al., 2015), which first obtain sentence representations and then combine them to generate high-level document embeddings.However, some of these solutions either rely on explicit a priori structural assumptions or discard the order information within a sentence, which are vulnerable to sudden change or twists in texts especially a long-range one (Mc-Donald et al., 2007;Mikolov et al., 2013).Recurrent models match people's intuition of reading word by word and are capable to model the intrinsic relations between sentences.By keeping the word order, RNNs could extract the sentence representation implicitly and meanwhile analyze the semantic meaning of a whole document without any explicit boundary.
Partially inspired by neural structure of human brain and computer system architecture, we present the Cached Long Short-Term Memory neural networks (CLSTM) to capture the long-range sentiment information.In the dual store memory model proposed by Atkinson and Shiffrin (1968), memories can reside in the short-term "buffer" for a limited time while they are simultaneously strengthening their associations in long-term memory.Accordingly, CLSTM equips a standard LSTM with a similar cache mechanism, whose internal memory is divided into several groups with different forgetting rates.A group with high forgetting rate plays a role as a cache in our model, bridging and transiting the information to groups with relatively lower forgetting rates.With different forgetting rates, CLSTM learns to capture, remember and forget semantics information through a very long distance.
Our main contributions are as follows: • We introduce a cache mechanism to diversify the internal memory into several distinct groups with different memory cycles by squashing their forgetting rates.As a result, our model can capture the local and global emotional information, thereby better summarizing and analyzing sentiment on long texts in an RNN fashion.

Related Work
In this section, we briefly introduce related work in two areas: First, we discuss the existing documentlevel sentiment classification approaches; Second, we discuss some variants of LSTM which address the problem on storing the long-term information.

Document-level Sentiment Classification
Document-level sentiment classification is a sticky task in sentiment analysis (Pang and Lee, 2008), which is to infer the sentiment polarity or intensity of a whole document.The most challenging part is that not every part of the document is equally informative for inferring the sentiment of the whole document (Pang and Lee, 2004;Yessenalina et al., 2010).Various methods have been investigated and explored over years (Wilson et al., 2005;Pang and Lee, 2008;Pak and Paroubek, 2010;Yessenalina et al., 2010;Moraes et al., 2013).Most of these methods depend on traditional machine learning algorithms, and are in need of effective handcrafted features.
Recently, neural network based methods are prevalent due to their ability of learning discriminative features from data (Socher et al., 2013;Le and Mikolov, 2014;Tang et al., 2015a).Zhu et al. (2015) and Tai et al. (2015) integrate a tree-structured model into LSTM for better semantic composition; Bhatia et al. (2015) enhances document-level sentiment analysis by using extra discourse paring results.Most of these models work well on sentence-level or paragraph-level sentiment classification.When it comes to the document-level sentiment classification, a bottom-up hierarchical strategy is often adopted to alleviate the model complexity (Denil et al., 2014;Tang et al., 2015b).

Memory Augmented Recurrent Models
Although it is widely accepted that LSTM has more long-lasting memory units than RNNs, it still suffers from "forgetting" information which is too far away from the current point (Le et al., 2015;Karpathy et al., 2015).Such a scalability problem of LSTMs is crucial to extend some previous sentence-level work to document-level sentiment analysis.
Various models have been proposed to increase the ability of LSTMs to store long-range information (Le et al., 2015;Salehinejad, 2016) and two kinds of approaches gain attraction.One is to augment LSTM with an external memory (Sukhbaatar et al., 2015;Monz, 2016), but they are of poor performance on time because of the huge external memory matrix.Unlike these methods, we fully exploit the potential of internal memory of LSTM by adjusting its forgetting rates.
The other one tries to use multiple time-scales to distinguish different states (El Hihi and Bengio, 1995;Koutník et al., 2014;Liu et al., 2015).They partition the hidden states into several groups and each group is activated and updated at different frequencies (e.g. one group updates every 2 time-step, the other updates every 4 time-step).In these meth-ods, different memory groups are not fully interconnected, and the information is transmitted from faster groups to slower ones, or vice versa.
However, the memory of slower groups are not updated at every step, which may lead to sentiment information loss and semantic inconsistency.In our proposed CLSTM, we assign different forgetting rates to memory groups.This novel strategy enable each memory group to be updated at every time-step, and every bit of the long-term and shortterm memories in previous time-step to be taken into account when updating.

Long Short-Term Memory Networks
Long short-term memory network (LSTM) (Hochreiter and Schmidhuber, 1997) is a typical recurrent neural network, which alleviates the problem of gradient diffusion and explosion.LSTM can capture the long dependencies in a sequence by introducing a memory unit and a gate mechanism which aims to decide how to utilize and update the information kept in memory cell.
Formally, the update of each LSTM component can be formalized as: where σ is the logistic sigmoid function.Operator is the element-wise multiplication operation.t) and c (t) are the input gate, forget gate, output gate, and memory cell activation vector at time-step t respectively, all of which have the same size as the hidden vector Here, H and d are the dimensionality of hidden layer and input respectively.

Cached Long Short-Term Memory Neural Network
LSTM is supposed to capture the long-term and short-term dependencies simultaneously, but when dealing with considerably long texts, LSTM also fails on capturing and understanding significant sentiment message (Le et al., 2015).Specifically, the error signal would nevertheless suffer from gradient vanishing in modeling long texts with hundreds of words and thus the network is difficult to train.Since the standard LSTM inevitably loses valuable features, we propose a cached long short-term memory neural networks (CLSTM) to capture information in a longer steps by introducing a cache mechanism.Moreover, in order to better control and balance the historical message and the incoming information, we adopt one particular variant of LSTM proposed by Greff et al. (2015), the Coupled Input and Forget Gate LSTM (CIFG-LSTM).
Coupled Input and Forget Gate LSTM Previous studies show that the merged version gives performance comparable to a standard LSTM on language modeling and classification tasks because using the input gate and forget gate simultaneously incurs redundant information (Chung et al., 2014;Greff et al., 2015).
In the CIFG-LSTM, the input gate and forget gate are coupled as one uniform gate, that is, let i (t) = 1 − f (t) .We use f (t) to denote the coupled gate.Formally, we will replace Eq. 5 as below: Figure 1 gives an illustrative comparison of a standard LSTM and the CIFG-LSTM.
Cached LSTM Cached long short-term memory neural networks (CLSTM) aims at capturing the long-range information by a cache mechanism, which divides memory into several groups, and dif-ferent forgetting rates, regarded as filters, are assigned to different groups.
Different groups capture different-scale dependencies by squashing the scales of forgetting rates.The groups with high forgetting rates are short-term memories, while the groups with low forgetting rates are long-term memories.
Specially, we divide the memory cells into We modify the update of a LSTM as follows.
where r k represents forgetting rate of the k-th memory group at step t; ψ k is a squash function, which constrains the value of forgetting rate r k within a range.To better distinguish the different role of each group, its forgetting rate is squashed into a distinct area.The squash function ψ k (z) could be formalized as: where z ∈ (0, 1) is computed by logistic sigmoid function.Therefore, r k can constrain the forgetting rate in the range of ( k−1 K , k K ).Intuitively, if a forgetting rate r k approaches to 0, the group k tends to be the long-term memory; if a r k approaches to 1, the group k tends to be the shortterm memory.Therefore, group G 1 is the slowest, while group G K is the fastest one.The faster groups are supposed to play a role as a cache, transiting information from faster groups to slower groups.We also employ the bi-directional mechanism on CLSTM and words in a text will receive information from both sides of the context.Formally, the outputs of forward LSTM for the k-th group is

← − h (T )
k ].Hence, we encode each word w t in a given text w 1:T as h where the ⊕ indicates concatenation operation.Table 1: Statistics of the three datasets used in this paper.The rating scale (Class) of Yelp2013 and Yelp2014 range from 1 to 5 and that of IMDB ranges from 1 to 10. Words/Doc is the average length of a sample and Sents/Doc is the average number of sentences in a document.

Task-specific
Then, a fully connected layer followed by a softmax function is used to predict the probability distribution over classes for a given input.Formally, the probability distribution p is: where W p and b p are model's parameters.Here

Training
The objective of our model is to minimize the crossentropy error of the predicted and true distributions.Besides, the objective includes an L 2 regularization term over all parameters.Formally, suppose we have m train sentence and label pairs (w i=1 , the object is to minimize the objective function J(θ): where θ denote all the trainable parameters of our model.

Experiment
In this section, we study the empirical result of our model on three datasets for document-level sentiment classification.Results show that the proposed model outperforms competitor models from several aspects when modelling long texts.

Datasets
Most IMDB.Table 1 shows the statistical information of the three datasets.All these datasets can be publicly accessed1 .We pre-process and split the datasets in the same way as Tang et al. (2015b) did.
• Yelp 2013 and Yelp 2014 are review datasets derived from Yelp Dataset Challenge2 of year 2013 and 2014 respectively.The sentiment polarity of each review is 1 star to 5 stars, which reveals the consumers' attitude and opinion towards the restaurants.• IMDB is a popular movie review dataset consists of 84919 movie reviews ranging from 1 to 10.Average length of each review is 394.6 words, which is much larger than the length of two Yelp review datasets.

Evaluation Metrics
We use Accuracy (Acc.) and MSE as evaluation metrics for sentiment classification.Accuracy is a standard metric to measure the overall classification result and Mean Squared Error (MSE) is used to figure out the divergences between predicted sentiment labels and the ground truth ones.

Baseline Models
We compare our model, CLSTM and B-CLSTM with the following baseline methods.• CBOW sums the word vectors and applies a non-linearity followed by a softmax classification layer.• JMARS is one of the state-of-the-art recommendation algorithm, which leverages user and aspects of a review with collaborative filtering and topic modeling.• CNN UPNN (CNN) (Tang et al., 2015b) can be regarded as a CNN (Kim, 2014) Input Forget Gate LSTM and BLSTM, denoted as CIFG-LSTM and CIFG-BLSTM respectively (Greff et al., 2015).They combine the input and forget gate of LSTM and require smaller number of parameters in comparison with the standard LSTM.

Hyper-parameters and Initialization
For parameter configuration, we choose parameters on validation set mainly according to classification accuracy for convenience because MSE always has strong correlation with accuracy.The dimension of pre-trained word vectors is 50.We use 120 as the dimension of hidden units, and choose weight decay among { 5e−4, 1e−4, 1e−5 }.We use Adagrad (Duchi et al., 2011) as optimizer and its initial learning rate is 0.01.Batch size is chosen among { 32, 64, 128 } for efficiency.For CLSTM, the number of memory groups is chosen upon each dataset, which will be discussed later.We remain the total number of the hidden units unchanged.Given 120 neurons in all for instance, there are four memory groups and each of them has 30 neurons.This makes model comparable to (B)LSTM.Table 3 shows the optimal hyper-parameter configurations for each dataset.
For model initialization, we initialize all recurrent matrices with randomly sampling from uniform distribution in [-0.1, 0.1].Besides, we use GloVe (Pennington et al., 2014) as pre-trained word vectors.The word embeddings are fine-tuned during training.Hyper-parameters achieving best results on the validation set are chosen for final evaluation on test set.

Results
The classification accuracy and mean square error (MSE) of our models compared with other competitive models are shown in Table 2.When comparing our models to other neural network models, we have several meaningful findings.
1.Among all unidirectional sequential models, RNN fails to capture and store semantic features while vanilla LSTM preserves sentimental messages much longer than RNN.It shows that internal memory plays a key role in text modeling.CIFG-LSTM gives performance comparable to vanilla LSTM. 2. With the help of bidirectional architecture, models could look backward and forward to capture features in long-range from global perspective.In sentiment analysis, if users show their opinion at the beginning of their review, single directional models will possibly forget these hints.3. The proposed CLSTM beats the CIFG-LSTM and vanilla LSTM and even surpasses the bidirectional models.In Yelp 2013, CLSTM achieves 59.4% in accuracy, which is only 0.4 percent worse than B-CLSTM, which reveals that the cache mechanism has successfully and effectively stored valuable information without the support from bidirectional structure.4. Compared with existing best methods, our model has achieved new state-of-the-art results by a large margin on all documentlevel datasets in terms of classification accuracy.Moreover, B-CLSTM even has surpassed JMARS and CNN (UPNN) methods which utilized extra user and product information.5.In terms of time complexity and numbers of parameters, our model keeps almost the same as its counterpart models while models of hierarchically composition may require more computational resources and time.

Rate of Convergence
We compare the convergence rates of our models, including CIFG-LSTM, CIFG-BLSTM and B-CLSTM, and the baseline models (LSTM and BLSTM).We configure the hyper-parameter to make sure every competing model has approximately the same numbers of parameters, and various models have shown different convergence rates in Figure 3.In terms of convergence rate, B-CLSTM beats other competing models.The reason why B-CLSTM converges faster is that the splitting memory groups can be seen as a better initialization and constraints during the training process.

Effectiveness on Grouping Memory
For the proposed model, the number of memory groups is a highlight.In Figure 4, we plot the best prediction accuracy (Y-axis) achieved in validation set with different number of memory groups on all datasets.From the diagram, we can find that our model outperforms the baseline method.In Yelp 2013, when we split the memory into 4 groups, it achieves the best result among all tested memory group numbers.We can observe the dropping trends when we choose more than 5 groups.
For fair comparisons, we set the total amount of neurons in our model to be same with vanilla LSTM.Therefore, the more groups we split, the less the neurons belongs to each group, which leads to a worse capacity than those who have sufficient neurons for each group.

Sensitivity on Document Length
We also investigate the performance of our model on IMDB when it encodes documents of different lengths.Test samples are divided into 10 groups with regard to the length.From Figure 5, we can draw several thoughtful conclusions.
1. Bidirectional models have much better performance than the counterpart models.2. The overall performance of B-CLSTM is better than CIFG-BLSTM.This means that our model is adaptive to both short texts and long documents.Besides, our model shows power in dealing with very long texts in comparison with CIFG-BLSTM. 3. CBOW is slightly better than CIFG-LSTM due to LSTM forgets a large amount of information during the unidirectional propagation.

Conclusion
In this paper, we address the problem of effectively analyzing the sentiment of document-level texts in an RNN architecture.Similar to the memory structure of human, memory with low forgetting rate captures the global semantic features while memory with high forgetting rate captures the local semantic features.Empirical results on three real-world document-level review datasets show that our model outperforms state-of-the-art models by a large margin.
For future work, we are going to design a strategy to dynamically adjust the forgetting rates for finegrained document-level sentiment analysis.

Figure 1 :
Figure 1: (a) A standard LSTM unit and (b) a CIFG-LSTM unit.There are three gates in (a), the input gate, forget gate and output gates, while in (b), there are only two gates, the CIFG gate and output gate.

Figure 2 :
Figure 2: An overview of the proposed architecture.Different styles of arrows indicate different forgetting rates.Groups with stars are fed to a fully connected layers for softmax classification.Here is an instance of B-CLSTM with text length equal to 4 and the number of memory groups is 3.

Figure 3 :
Figure 3: Convergence speed experiment on Yelp 2013.X-axis is the iteration epoches and Y-axis is the classifcication accuracy(%) achieved.

Figure 4 :Figure 5 :
Figure4: Classification accuracy on different number of memory group on three datasets.X-axis is the number of memory group(s).
Each group includes a internal memory c k , output gate o k and forgetting rate r k .The forgetting rate of different groups are squashed in distinct ranges.
CLSTM, we concatenate the state of the first group in the forward LSTM at T -th time-step and the first group in the backward LSTM at first time-step.

Table 2 :
Sentiment classification results of our model against competitor models on IMDB, Yelp 2014 and Yelp 2013.Evaluation metrics are classification accuracy (Acc.) and MSE.Models with * use user and product information as additional features.Best results in each group are in bold.