Tree Communication Models for Sentiment Analysis

Tree-LSTMs have been used for tree-based sentiment analysis over Stanford Sentiment Treebank, which allows the sentiment signals over hierarchical phrase structures to be calculated simultaneously. However, traditional tree-LSTMs capture only the bottom-up dependencies between constituents. In this paper, we propose a tree communication model using graph convolutional neural network and graph recurrent neural network, which allows rich information exchange between phrases constituent tree. Experiments show that our model outperforms existing work on bidirectional tree-LSTMs in both accuracy and efficiency, providing more consistent predictions on phrase-level sentiments.


Introduction
There has been increasing research interest investigating sentiment classification over hierarchical phrases (Tai et al., 2015;Zhu et al., 2015;Looks et al., 2017;Teng and Zhang, 2017). As shown in Figure 1, the goal is to predict the sentiment class over a sentence and each phrase in its constituent tree. There have been methods that classify each phrase independently (Li et al., 2015;McCann et al., 2017). However, sentiments over hierarchical phrases can have dependencies. For example, in Figure 1, both sentences have a phrase "an awesome day", but the polarities of which are different according to their sentence level contexts.
To better represent such sentiment dependencies, one can encode a constituency tree holistically using a neural encoder. To this end, treestructured LSTMs have been investigated as a dominant approach (Tai et al., 2015;Zhu et al., 2015;Gan and Gong, 2017;Liu et al., 2016). Such methods work by encoding hierarchical phrases bottom-up, so that sub constituents can be used as inputs for representing a constituent. However, they cannot pass information from a constituent node to its children, which can be necessary for cases similar to Figure 1. In this example, sentence level information from toplevel nodes is useful for disambiguating "an awesome day". Bi-directional tree LSTMs provide a solution, using a separate top-down LSTM to augment a tree-LSTM (Teng and Zhang, 2017). This method has achieved highly competitive accuracies, at the cost of doubling the runtime.
Intuitively, information exchange between tree nodes can happen beyond bottom-up and topdown directions. For example, direct communication between sibling nodes, such as ("an awesome day", "winning the game") and ("an awesome day", "experiencing the tsunami") can also bring benefits to tree representation. Recent advances of graph neural networks, such as graph convolutional neural network (GCN) (Kipf and Welling, 2016; and graph recurrent neural network (GRN) (Beck et al., 2018;Zhang et al., 2018b;Song et al., 2018) offer rich node communication patterns over graphs. For relation extraction, for example, GCNs have been shown superior to tree LSTMs for encoding a dependency tree (Zhang et al., 2018c) We investigate both GCNs and GRNs as tree communication models for tree sentiment classification. In particular, initialized with a vanilla tree LSTM representation, each node repeatedly exchanges information with its neighbours using graph neural networks. Such multi-pass information exchange can allow each node to be more informed about its sentence-level context through rich communication patterns. In addition, the number of time steps does not scale with the height of the tree. To allow better interaction, we further propose a novel time-wise attention mechanism over GRN, which summarizes the representation after each communication step.
Experiments on Stanford Sentiment Treebank (SST; Socher et al. 2013) show that our model outperforms standard bottom-up tree-LSTM (Zhu et al., 2015;Looks et al., 2017) and also recent work on bidirectional tree-LSTM (Teng and Zhang, 2017). In addition, our model allows a more holistic prediction of phase-level sentiments over the tree with a high degree of node sentiment consistency. To our knowledge, we are the first to investigate graph NNs for tree sentiment classification, and the first to discuss phrase level sentiment consistency over a constituent tree for SST. We release our code and models at https://github.com/fred2008/TCMSA.

Related Work
Bi-directional Tree- LSTM Paulus et al. (2014) capture bidirectional information over a binary tree by propagating global belief down from the tree root to leaf nodes. Miwa and Bansal (2016) adopt a bidirectional dependency tree-LSTM model by introducing a top-down LSTM path. Teng and Zhang (2017) propose a first bidirectional tree-LSTM for constituent structures, by building a top-down tree-LSTM with estimations of head lexicons. Compared with their work, we achieve information interaction using an asymptotically more efficient algorithm, which performs node communication simultaneously across a whole tree.
Graph Neural Network Scarselli et al. (2009) propose graph neural network (GNN) for encoding an arbitrary graph structure. Kipf and Welling (2016) use graph convolutional network to learn node representation for graph structure. Marcheg-giani and Titov (2017) and Bastings et al. (2017) extend the use of graph convolutional network (GCN) to NLP tasks. In particular, they use GCN to learn dependency-syntactic word representation for semantic role labeling and machine translation, respectively. Zhang et al. (2018b) use a graph recurrent network (GRN) to model sentences. Beck et al. (2018) and Song et al. (2018) use a graph recurrent network for learning representation of abstract meaning representation (AMR) graphs. Our work is similar in utilizing graph neural network for NLP. Compared with their work, we apply GNN to constituent trees. In addition, we propose a novel time-wise attention mechanism on GRN to combine recurrent time steps dynamically.

Baseline
We take standard bottom-up tree-LSTMs as our baseline. Tree-LSTM extends sequence-LSTM by utilizing 2 previous states for modeling a left child node and a right child node, respectively, in a recurrent state transition process. Formally, a tree-LSTM calculates a cell state through an input gate, an output gate and two forget gates at each time step. In particular, at time step t, the input gate i t and the output gate o t are calculated respectively as follows: co and b o are parameters of the input gate and the output gate, respectively.
The forget gates of the left node f L t and the right node f R t are calculated respectively as: The cell candidateC t is dependent on both c L t−1 and c R t−1 : Constinuent Inference where W hC , W R hC and b C are model parameters. Based on the two previous cell states c L t−1 and c R t−1 , the cell state of the current node c t is calculated as: where f l t is the forget gate of the left child node, f R t is the forget gate of the right child node,C t is the cell candidate.
Finally, the hidden state h t of the current node is calculated based on the current cell c t and the output gate o t : Limitation Tree-LSTM models capture only bottom-up node dependencies. Specifically, for a node j, the hidden representation h tree j is dependent on the descendant nodes only. Formally, where d j is the set of descendant nodes of node j.
Bi-directional Solution A bidirectional tree-LSTM (Bi-tree-LSTM) takes a bottom-up tree-LSTM as a first step, performing a top-down tree communication process. Teng and Zhang (2017) is one example.

Tree Communication Models
Our tree communication models (TCM) take a trained tree LSTM as an initial state, performing information exchange using graph neural network (GNN). Thus h j is dependent on all related neighborhood nodes rather than only descendant nodes: where r j is the set of all relevant nodes of node j. Such node can be the full tree with sufficient communication. In particular, given a constituent tree, for each constituent node j, the initial state h ′ j is obtained using a tree-LSTM:

Time-wise Attention Mechanism
where h ′ j is the hidden state of the node j, c ′ j is the cell state of node j, left(j) denote the left child of node j, right(j) denotes the right child of node j.
As shown in Figure 2, a TCM performs information exchange between a constituent node j with its neighbor nodes in three channels: • A self-to-self channel transfers information from node j to itself. The input for the channel is represented as • A bottom-up channel transfers information from lower level nodes to upper-level nodes. The inputs for the channel are represented as where left(j) and right(j) denote the left child and the right child of node j, respectively. x up j is the sum of inputs from bottom up: • A top-down channel transfers information from parent nodes to child nodes. The input for the channel is represented as: where prt(j) denotes the parent node of node j.
When tree communications are executed repeatedly, each node receives information from an increasingly larger context. We explore a convolutional tree communication model (CTCM) and a recurrent tree communication model (RTCM), which are based on GCN  and GRN (Song et al., 2018), respectively. Both models allow node interactions in a tree to be performed in parallel, and thus are computationally efficient. The time complexity to achieve additional interaction of TCMs are O(1), in contrast to O(n) by top-down tree-LSTM.

Convolutional Tree Communication Model
We apply the strategy of , where multiple convolutional layers can be used for information combination. In particular, for the k-th layer, transformed inputs are obtained by linear transformation for each channel: where W k,e g and b k,e g are model parameters. The final representation of node j is:

Recurrent Tree Communication Model
We take the stategy of Song et al. (2018). The structure of RTCM shows in Figure 3. For each recurrent step t, the hidden states from the last recurrent step are taken to calculate the cell state of the current state. In particular, for node j, the hidden state of the previous step can be divided into the last hidden state h self j from self-to-self channel, the last hidden state h up t−1,j from bottom-up channel and the last hidden state h down t−1,j from the top-down channel: We calculate gate and state values based on the inputs and last hidden states from the three information channels. The input gate i j t and the forget gate f j t are defined as: f and b f are parameters of input and forget gate.
The cell candidateC j t is defined as: The current cell state is calculated as: The output gate o j t is defined as: The final hidden h t j is calculated through the current cell state c t j and the output gate o t j :

Time-wise attention
Both GRN and GCN calculate a sequence of incrementally more abstract representations c 1 j , c 2 j , ...c t j for each node c j . We further introduce a novel attention scheme to GRN. Intuitively, each recurrent step in RTCM learns a different level of abstraction. For a constituent node higher in the tree or on the leaf, more recurrent steps may be needed to learn the interaction between nodes. Accordingly, we use an adaptive recurrence mechanism to learn a dynamic node representation through attention structure (Bahdanau et al., 2014). Our method first encodes a recurrent-stepsensitive hidden state with positional embedding: where h j,depth t is the recurrent-step-sensitive hidden state for node j on t-th step, e p is positional encoding of the recurrence steps.
Inspired by Vaswani et al. (2017), a static representation is used for the positional encoding e p (t), which does not require training: t is the index of recurrent steps, e t,m is the m-th dimension of positional embedding, and d emb is the dimension of embedding. We learn the weight w t for the t-th recurrent step by the relationship between h j,depth .
The final state can be represented as a weighted sum of the hidden states obtained after different recurrent steps:

Decoding and Training
Following Looks et al. (2017) and Teng and Zhang (2017), we perform softmax classification on each node according to the last hidden state: where M and b are model parameters. For training, negative log-likelihood loss is computed over each o locally, and accumulated over the tree.

Experiments
We test the effectiveness of TCM by comparing its performance with a standard tree-LSTM (Zhu et al., 2015) as well as a state-of-the-art bidirectional tree-LSTM (Teng and Zhang, 2017). A series of analysis is conducted for measuring the holistic representation of sentiment in a tree via phrase-level sentiments consistency.

Data
We use the Stanford Sentiment Treebank (SST; Socher et al. 2013), which is a dataset of movie reviews originally from Pang and Lee (2005) annotated at both the clause level and the sentence level. Following Zhu et al. (2015) and Teng and Zhang (2017), we perform both fine-grained sentiment classification and binary classification. For the former, the dataset was annotated for 5 levels of sentiment: strong negative, negative, neutral, positive, and strong positive. For the latter, the data was labeled with positive sentiment and negative sentiment. We adopt a standard dataset split following Tai et al. (2015); Teng and Zhang (2017). Table 1 lists the data statistics.

Experimental Settings
Hyper-parameters We initialize word embeddings using GloVe (Pennington et al., 2014) 300dimensional embeddings. Embeddings are finetuned during training. The size of LSTM hidden states are set to 300. We thus fix the number to 9.
Training In order to obtain a good representation for an initial constituent state, we first train an independent bottom-up tree-LSTM, over which we train our tree communication models. To avoid over-fitting, we adopt dropout on the embedding layer, with a rate of 0.5. Training is done on minibatches through Adagrad (Duchi et al., 2011) with a learning rate of 0.05. We adopt gradient clipping with a threshold of 1.0. The L2 regularization parameter is set to 0.001.

Development Experiments
Hyper-parameters We investigate the effect of recurrent steps of RTCM as shown in Block A of Table 2. As the number of steps increases from 1, the accuracy increases, showing the effectiveness of tree node communication. A recurrent step of 9 gives the best accuracies, and a larger number of steps does not give further improvements. This is consistent with observations of Song et al. (2018), which shows that sufficient context information can be collected over a small number of iterations.
The effectiveness of TCM Block B in Table 2 Block  shows the performance of different models. Tree-LSTMs with different TCMs outperform the baseline tree-LSTM on both datasets. In addition, the time-wise attention mechanism in Section 4.2.1 improves performance on both SST-5 and SST-2.
In the remaining experiments, we use RTCM with time wise-attention. Table 3 shows the overall performances for sentiment classification on both SST-5 and SST-2. We report accuracies on both the sentence level and the phrase level. Compared with previous methods based on constituent tree-LSTM, our model improves the preformance on different datasets and settings. In particular, it outperforms BiConL-STM (Teng and Zhang, 2017), which use bidirectional tree-LSTM. This demostrates the advantage of graph neural networks compared to a top-down LSTM for tree communication. Our model gives the state-of-the-art accuracies on phrase-level settings. Note that we do not leverage character representation or external resources such as sentiment lexicons and large-scale corpuses. There has also been work using large-scale external datasets to improve performance. McCann et al. (2017) pretrain their model on large parallel bilingual datasets and exploit character ngram features. They report an accuracy of 53.7 on sentence-level SST-5 and an accuracy of 90.3 on sentence-level SST-2, which are lower than our model. Peters et al. (2018) pretrain a language model with character convolutions on a large-scale corpus and report an accuracy of 54.7 on sentencelevel SST-5, which is slightly higher than our model. Large-scale pretraining is orthogonal to our method. For a fair comparison, we do not list their results on Table 3.

Final Results
We further analyze the performance with re-   spect to different sentence lengths. Figure 4 shows the results. On both datasets, the performance of tree-LSTM on sentences of lengths less than 10 (l = 5 in the figure) is much better than that of longer sentences. There is a tendency of decreasing accuracies as the sentence length increases. As the length of sentences increases, there are longerrange dependencies along the depth of tree structure, which is more difficult to model than short sentences.
It can be seen that the improvement of TCM over tree-LSTM model is larger with increasing sentence length. This shows that longer sentences can benefit more from rich tree communication.

Disscusion
Sentence-level performance To further compare performances of holistic phrase sentiment classification on the sentence level, we measure the accuracy on the sentence level. We define sentencelevel phrase accuracy (SPAcc) of a sentence as: SPAcc = n correct /n total , where n total is the total number of phrases in the sentence, and n correct is the number of correct sentiment predictions in the sentence. For each sentence of test dataset, taking SPAcc of the corresponding label sequence resulting from the baseline model as the x-coordinate and SPAcc of the corresponding label sequence resulting from TCM as the y-coordinate, we draw a scatter plot with a regression line as shown in Figure 5. The regression line is inclined towards the top-left, indicating that TCM can improve the performance on holistic phrase classifications over a whole sentence.
If the SPAcc of a sentence is high, the sentence is more holistically-labeled. Table 4 shows the statistics on the rate of holistically-labeled sentences with SPAcc α (SPAcc-α). The rate of holistically-labeled sentences for TCM is higher   than that for tree-LSTM on both SST-5 and SST-2 for different values of α. It demonstrates that TCM labels the constituent nodes of the whole tree better than the tree-LSTM model, thanks to more information exchange between phrases in a tree. Consistency between nodes To compare the sentiment classification consistency of phrases in each sentence, we define a metric, phrase error deviation (PEDev), to measure the deviation among the error of labels for one sentence: where d(ŷ i , y i ) is the Hamming distance between the i-th predicted label and the i-th ground truth label.d is the mean value of d(ŷ i , y i ). Since d(ŷ i , y i ) ∈ [0, 1], PEDev(ŷ, y) ∈ [0, 0.5].
For an input sentence, if all the predicted labels are the same as the ground truth, or all the predicted labels are different from the ground truth, PEDev(ŷ, y) = 0, which means that the sentence is labeled with the maximum consistency. On the contrary, if the predicted labels of some phrases are the same as ground truth while others are not, PEDev(ŷ, y) is high. Table 5 lists the statistics on PEDev(ŷ, y) of the baseline model and our model for all the test sentences on SST-5 and SST-2. The mean and median of PEDev(ŷ, y) of TCM are much less than those of the baseline tree-LSTM model. In addition, as Figure 6 shows, compared with the PEDev(ŷ, y) distribution of the tree-LSTM model, the distribution of TCM is relatively less in value. It demonstrates that TCM   Figure 7: Sentiment classification samples. improves the consistency of phrase classification for each sentence. Figure 8 shows the confusion matrix on the SST-5 phrase-level test set for tree-LSTM (left) and TCM (right). Compared with tree-LSTM, the accuracies of most sentiment labels by TCM increase (the accuracy of the neutral label slightly decreases by 0.3%), indicating that TCM is strong in differentiating fine-grained sentiments in global and local contexts. 37.0 -0.6 Table 6: Sentence-level phrase accuracy (SPAcc) and phrase error deviation (PEDev) comparison on SST-5 between bi-tree-LSTM and TCM.

Confusion matrix
6.6 Comparison with Bi-tree-LSTM Table 6 shows the sentence-level phrase accuracy (SPAcc) and phrase error deviation (PEDev) comparison on SST-5 between bi-tree-LSTM and TCM, respectively. TCM outperforms bi-tree-LSTM on all the metrics, which demonstrates that TCM gives more consistent predictions of sentiments over different phrases in a tree, compared to top-down communication. This shows the benefit of rich node communication. Figure 9 shows a scatter chart and a deviation chart comparision between the two models, in the same format as Figure 5 and Figure 6, respectively. As shown in Figure 9a, the errors of TCM and bitree-LSTM are scattered, which shows that different communication patterns influence sentiment prediction. The final observation is consistent with Table 6. Figure 7 shows four samples on SST-5. In the first sentence, the phrase "seemed static" itself bares the neutral sentiment. However, it has a negative sentiment in the context. The tree-LSTM model captures the sentiment of the phrase bottom-up, therefore giving the neutral sentiment. In con-trast, TCM considers larger contexts by repeated node interaction. The phrase "seemed static" receives information from the constituents "never took off" and "Though everything might be literate and smart" through their common ancestor nodes, leading to the correct result. Although bitree-LSTM predicts these sentiments of the phrase "seemed static" and the whole sentence correctly, it gives more incorrect results on the phrase level. The other sentences in Figure 7 show similar trends. From the samples we can find that TCM provides more consistent predictions on phraselevel sentiments thanks to its better understanding of different contexts.

Conclusion
We investigated two tree communication models for sentiment analysis, leveraging recent advances in graph neural networks for information exchange between nodes in a baseline tree-LSTM model. Both GCNs and GRNs are explored and compared, with GRNs showing better accuracies. We additionally propose a novel time-wise attention mechanism to further improve GRNs. Results on standard benchmark show that graph NNs give better results compared to bi-directional tree LSTMs, providing more consistent predictions over phrases in one sentence. To our knowledge, we are the first to leverage graph neural network structures for enhancing tree-LSTMs, and the first to discuss tree-level sentiment consistency using a set of novel metrics.