Investigating Dynamic Routing in Tree-Structured LSTM for Sentiment Analysis

Deep neural network models such as long short-term memory (LSTM) and tree-LSTM have been proven to be effective for sentiment analysis. However, sequential LSTM is a bias model wherein the words in the tail of a sentence are more heavily emphasized than those in the header for building sentence representations. Even tree-LSTM, with useful structural information, could not avoid the bias problem because the root node will be dominant and the nodes in the bottom of the parse tree will be less emphasized even though they may contain salient information. To overcome the bias problem, this study proposes a capsule tree-LSTM model, introducing a dynamic routing algorithm as an aggregation layer to build sentence representation by assigning different weights to nodes according to their contributions to prediction. Experiments on Stanford Sentiment Treebank (SST) for sentiment classification and EmoBank for regression show that the proposed method improved the performance of tree-LSTM and other neural network models. In addition, the deeper the tree structure, the bigger the improvement.


Introduction
In sentiment analysis, word embeddings (Mikolov et al., 2013a;Mikolov et al., 2013b;Pennington et al., 2014) and sentiment embeddings (Tang et al., 2016;Yu et al., 2018a;Yu et al., 2018b) have become a fundamental component to build deep neural networks such as convolutional neural networks (CNN) (Kalchbrenner et al., 2014;Kim, 2014), recurrent neural networks (RNN) (Graves, 2012;Irsoy and Cardie, 2014), gated recurrent unit (GRU) (Cho et al., 2014), and long short-term memory (LSTM) (Tai et al., 2015;Wang et al., 2015). Given a variable-length text, one challenge of using these neural networks is to compose individual word vectors into sentence vectors with the same length (Iyyer et al., 2015;. The sequential neural networks such as RNN, GRU, and LSTM are commonly used due to their ability to capture long-distance dependency in sequential texts. However, these methods belong to the biased model, where the words in the tail of a sentence are more heavily emphasized than those in the header for building sentence representations. As shown in Fig. 1(a), the priority for each word vector will be "fantastic reaall ris rstre this". This prioritization seems satisfactory for this sentence, but note that the key components could appear anywhere in the sentence rather than necessarily at the end.
To improve the abovementioned sequential models, Tai et al. (2015) and Huang et al. (2017) proposed a tree-LSTM model to introduce useful structural information from sentence parse trees. However, the tree-LSTM also heavily emphasizes the root node in the tree to build sentence representations. That is, words that are closed to the root will be given higher priority than words that are far away from the root. As shown in Fig.  1(b), the priority of word vectors would be "thisr= stre r= is eaall r= fantastic". This example shows that the tree-LSTM still could not avoid the bias problem because the nodes (e.g., fantastic) that contribute more to the prediction but lie in the leaf node at the bottom of the parse tree will be less emphasized.
To overcome the bias problem that may arise in the tree-LSTM, this study proposes a capsule tree-LSTM model. In spired by recent promising work of capsule network (Sabour et al., 2017), the proposed method introduces a dynamic routing algorithm to consider all non-leaf nodes to build sentence vectors, instead of using the root alone in the tree-LSTM. In addition, different nodes will receive different weights according to their contributions to the prediction task. Unlike selfattention (Lin et al., 2017;, which applies a fixed policy without considering the state of the final sentence vectors, the task of assigning weights in the proposed model is considered to be a routing issue to iteratively determine how much information can be passed from non-leaf nodes in the tree to the vector presentation of the sentence, according to the state of final output. For example, in the aforementioned example text, it would be useful for the model to emphasize fantastic that contains the most salient information, even when the word lies at the bottom of the parser tree. Based on the dynamic routing algorithm, the priority of the word vector in the proposed model would be "fantastic eaall =is this=stre ". The proposed method is evaluated through both sentiment classification and regression tasks to determine whether dynamic routing can improve the performance of the tree-LSTM and other neural network models.
The rest of this paper is organized as follow. Section 2 describes the proposed capsule tree-LSTM model with dynamic routing. Section 3 summarizes the evaluation results. Conclusions are presented in Section 4.

Capsule Tree-LSTM Model
Figure 1(c) shows the framework of the proposed model. First, the given sentence is parsed as a treestructured topology. The vector representation of this sentence is then generated by composing the word vectors of all non-leaf nodes in the tree according to their weights learned by the dynamic routing algorithm. Finally, the composed sentence vector is used for sentiment prediction.

Tree-structured LSTM
Given a binary parser tree, the leaf nodes are words and the non-leaf nodes are multi-word phrases. Let C(j) denotes the set of left and right child nodes of a non-leaf node j. Different from the sequential LSTM, the hidden state 1 j t h − of the non-leaf node j is the composition of its left and right child nodes, defined as where i t , f t , r t , and c t respectively denote the input gate, forget gate, output gate, and memory cell of node j, x t denotes the input word vector at the time step t, σ denotes the logistic sigmoid function, W and b respectively denote the weights and bias, and  denotes element-wise multiplication. To integrate the sequence information in the output layer, the order of non-leaf hidden states to form the input matrix of dynamic routing layer is a key consideration. Here, we used the in-order traversal of depth-first search algorithm on the treestructured topology.

Dynamic Routing
To compose all word vectors to generate sentence vectors, the tree-structured LSTM model uses the hidden states of all non-leaf nodes to obtain the weights for all nodes through the dynamic routing algorithm.
Taking the hidden states of all non-leaf nodes as the input vectors, the goal of dynamic routing is to encode the sentiment information of those vectors into a fixed-length sentence vector, where s j is the vector output of capsule j, v j is the total input, which is a weighted sum over all "prediction vectors" | jt h from the capsules in the layer below, where coupling coefficientsrc tj are the probability distributions of capsule j which are computed using a softmax function so that all capsules in the layer above sum to 1 so that the sentiment information where b tj is the log probabilities, initialized with 0. The detailed iterative process of learning the weights between capsules in two layers for each non-leaf node is shown in Fig. 2.
In Eq. (7), the capsules in the above layer try to learn contribution weights c tj (i.e., coupling coefficients) for the capsules in the below layer. The updated information in b tj comes from the scalar product |

Experimental Results
Datasets. This experiment used two datasets for evaluation. i) The Stanford Sentiment Treebank (SST) (Socher et al., 2013) is used for sentiment classification. It contains 6920/872/1821 sentences for the train/dev/test sets with binary labels (positive/negative) and 8544/1101/2210 sentences with fine-grained labels (very negative/negative/ neutral/positive/very positive). ii) EmoBank (Buechel and Hahn, 2017;Buechel and Hahn, 2016) is used for sentiment regression to predict valence-arousal (VA) values (Wang et al., 2016b;Yu et al., 2016). It contains 10,000 sentences with real-valued VA ratings in the range of (1, 9), where the valence refers to the degree of positive and negative sentiment and the arousal refers to the degree of calm and excitement. The provided ratings have Reader and Writer perspectives, and the Reader was adopted as the ground-truth ratings due to its superiority reported in (Buechel and Hahn, 2017). We performed 5-fold crossvalidation (6:2:2) on the EmoBank dataset.
Evaluation Metrics. For SST, the evaluation metric is accuracy for both binary and fine-grained classification. For EmoBank, we used the Pearson correlation coefficient (e) and mean absolute error (MAE). A higher e or a lower MAE value indicates better prediction performance. Implementation Details. Several deep neural networks were implemented for comparison, including CNN, GRU, LSTM, and tree-LSTM. For the sequential models (GRU and LSTM), we additionally implemented an enhanced version using a bi-directional strategy and 2-layer stacked architecture. To investigate the performance of self-attention, we also implement a self-attention layer by taking as input the hidden states of all nonleaf nodes, to form an attention Tree-LSTM model (Kokkinos and Potamianos, 2017). For word vectors, we used GloVe pre-trained on the 840B Common Crawl corpus (Pennington et al., 2014). The respective dimensionality values of the word vectors and hidden states were 300 and 120. For classification and regression tasks, srftmax and linaaer dacrdae (Wang et al., 2016a) activation function are respectively applied as the output layer.
Comparative Results. Tables 1 and 2 respectively show the comparative results of different methods for SST and EmoBank. Both the enhanced bi-directional and 2-layer GRU/LSTM outperformed the standard GRU, LSTM, CNN, and the Tree-LSTM with structural information achieved better performance than all of them for both classification and regression tasks. Once the dynamic routing algorithm was introduced, the proposed Capsule Tree-LSTM further improved the performance of Tree-LSTM  (with attention). Figure 3 shows the detailed analysis of the effect of dynamic routing. The test sentences were first divided into several groups according to their depths in the parse trees (e.g., the depth of the example sentence in Fig. 1 is three). The performance improvement of Capsule Tree-LSTM over Tree-LSTM was then calculated for each group. The results show that the performance improvements increased with the increase of the depth. The reason is that the Tree-LSTM may suffer from a more serious bias problem for sentences with a deeper tree structure because the useful nodes in the deeper levels tend to be ignored. Conversely, the Capsule Tree-LSTM can assign a higher weight to the nodes that contribute more to the prediction even though they lie in the leaf node at the bottom of the tree.

Conclusion
This study presents a capsule tree-LSTM model for sentiment classification and regression. The proposed method uses dynamic routing algorithm to automatically learn the weights of each node to compose sentence representations. Experimental results show that the proposed method yielded better results than convolutional (CNN), sequential (LSTM and GRU), structural (tree-LSTM) and self-attention neural networks. Future work will conduct more detailed analysis to continue enhancing the proposed method.