Structural Attention Neural Networks for improved sentiment analysis

We introduce a tree-structured attention neural network for sentences and small phrases and apply it to the problem of sentiment classification. Our model expands the current recursive models by incorporating structural information around a node of a syntactic tree using both bottom-up and top-down information propagation. Also, the model utilizes structural attention to identify the most salient representations during the construction of the syntactic tree.


Introduction
Sentiment analysis deals with the assessment of opinions, speculations, and emotions in text (Zhang et al., 2012;Pang and Lee, 2008). It is a relatively recent research area that has attracted great interest as demonstrated by a series of shared evaluation tasks, e.g., analysis of tweets (Nakov et al., 2016). In (Turney and Littman, 2002), the affective ratings of unknown words were predicted utilizing the affective ratings of a small set of words (seeds) and the semantic relatedness between the unknown and the seed words. An example of sentence-level analysis was proposed in (Malandrakis et al., 2013). Other application areas include the detection of public opinion and prediction of election results (Singhal et al., 2015), correlation of mood states and stock market indices (Bollen et al., 2011).
Recently, Recurrent Neural Network (RNN) with Long-Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) or Gated Recurrent Units (GRU) (Chung et al., 2014) have been applied to various Natural Language Processing tasks. Tree structured neural networks, which are found in literature as Recursive Neural Networks, hold a linguistic interest due to their close relation to syntactic structures of sentences being able to capture distributed information of structure such as logical terms (Socher et al., 2012). These syntactic structures are N-ary trees which represent either the underlying structure of a sentence, known as constituency trees or the relations between words known as dependency trees.
This paper focuses on sentence-level sentiment classification of movie reviews using syntactic parse trees as input for the proposed networks. In order to solve the task of sentiment analysis of sentences, we work upon a variant of Recursive Neural Networks which recursively create representation following the syntactic structure. The proposed computation model exploits information from subnodes as well as parent nodes of the node under examination. This neural network is referred to as Bidirectional Recursive Network (Irsoy and Cardie, 2013). The model is further enhanced with memory units and the proposed structural attention mechanism. It is observed that different nodes of a tree structure hold information of variable saliency. Not all nodes of a tree are equally informative, so the proposed model selectively weights the contribution of each node regarding the sentence level representation using structural attention model.
We evaluate our approach on the sentence-level sentiment classification task using one standard movie review dataset (Socher et al., 2013). Experimental results show that the proposed model outperforms the state-of-the art methods.

Tree-Structured GRUs
Recursive GRUs (TreeGRU) upon tree structures are an extension of the sequential GRUs that allow information to propagate through network topologies. Similar to Recursive LSTM network on tree structures (Tai et al., 2015), for every node of a tree, the TreeGRU has gating mechanisms that modulate the flow of information inside the unit without the need of a separate memory cell. The activation h j of TreeGRU for node j is the interpolation of the previous calculated activation h jk of its kth child out of N total children and the candidate activation h j .
where z j is the update function which decide the degree of update that will occur on the activation based on the input vector x j and previously calculated representation h jk : The candidate activation h j for a node j is computed similarly to that of a Recursive Neural Network as in (Socher et al., 2011): where r j is the reset gate which allows the network to forget effectively previous computed representations when the value is close to 0 and it is computed as follows: Every part of a gated recurrent unit x j , h j , r j , z j , h j ∈ R d where d is the input vector dimensionality. σ is the sigmoid function and f is the non-linear tanh function.The set of matrices W k , U ∈ R dxd used in 2 -4 are the trainable weight parameters which connect the kth children node representation with the jth node representation and the input vector x j .

Bidirectional TreeGRU
A natural extension of Tree-Structure GRU is the addition of a bidirectional approach. TreeGRUs calculate an activation for node j with the use of previously computed activations lying lower in the tree structure. The bidirectional approach for a tree structure uses information both from under and lower nodes of the tree for a particular node j. In this manner, a newly calculated activation incorporates content from both the children and the parent of a particular node. The bidirectional neural network can be trained in two seperate phases: i) the Upward phase and ii) the Downward phase. During the Upward phase, the network topology is similar to the topology of a TreeGRU, every activation is calculated based on the previously calculated activations which are found lower on the structure in a bottom up fashion. When every activation has been computed, from leaves to root, then the root activation is used as input of the Downward phase. The Downward phase calculates the activations for every child of a node using content from the parent in a top down fashion. The process of computing the internal representations between the two phases is separated, so in a first pass the network compute the upward activation and after this is completed, then the downward representations are computed. The upward activation h ↑ j similarly to TreeGRU for node j is the interpolation of the previous calculated activation h ↑ jk of its kth child out of N total children and the candidate activation h ↑ j .
The update gate, rest gate and candidate activation are computed as follows: h The downward activation h ↑ j for node j is the interpolation of the previous calculated activation h ↑ p(j) , where the function p calculates the index of the parent node, and the candidate activation h ↓ j .
The update gate, reset gate and candidate activation for the downward phase are computed as follows: During downward phase, matrix U d ∈ R dxd connects the upward representation of node j with the respective jth downward node while W d ∈ R dxd connect the parent representation p(j).

Structural Attention
We introduce Structural Attention, a generalization of sequential attention model (Luong et al., 2015) which extracts informative nodes out of a syntactic tree and aggregates the representation of those nodes in order to form the sentence vector. We feed representation h j of node through a one-layer Multilayer Perceptron with W w ∈ R dxd weight matrix to get the hidden representation u j .
Using the softmax function, the weights a j for each node are obtained based on the similarity of the hidden representation u j and a global context vetor u w ∈ R d . The normalized weights a j are used to form the final sentence representation s ∈ R d which is a weighted summation of every node representation h j .
The proposed attention model is applied on structural content since all node representations contain syntactic structural information during training because of the recursive nature of the network topology.

Experiments
We evaluate the performance of the aforementioned models on the task of sentiment classification of sentences sampled from movie reviews. We use the Stanford Sentiment Treebank (Socher et al., 2013) dataset which contains sentiment labels for every syntactically plausible phrase out of the 8544/1101/2210 train/dev/test sentences. Each phrase is labeled with respect to a 5-class sentiment value, i.e. very negative, negative, neutral, positive, very positive. The dataset can also be used for a binary classification subtask by excluding any neutral phrases for the original splits. The binary classification subtask is evaluated on 6920/872/1821 train/dev/test splits.

Sentiment Classification
For all of the aforementioned architectures at each node j we use a softmax classifier to predict the sentiment labelŷ j . For example, the predicted labelŷ j corresponds to the sentiment class of the spanned phrase produced from node j. The classifier for unidirectional TreeGRU architectures uses the hidden state h j produced from recursive computations till node j using a set x j of input nodes to predict the label as follows: where W s ∈ R dxc and c is the number of sentiment classes. The classifier for bidirectional TreeBiGRU architectures uses both the hidden state h ↑ j and h ↓ j produced from recursive computations till node j during Upward and Downward Phase using a set x j of input nodes to predict the label as follows: where W ↑ s , W ↓ s ∈ R dxc and c is the number of sentiment classes. The predicted labelŷ j is the argument with the maximum confidence: For the Structural Attention models, we use for the final sentence representation s to predict the sentiment labelŷ j where j is the corresponding root node of a sentence. The cost function used is the negative log-likelihood of the ground-truth label y k at each node: where m is the number of labels in a training sample and λ is the L2 regularization hyperparameter.

Results
The evaluation results are presented in Table 2 in terms of accuracy, for several state-of-the-art models proposed in the literature as well as for the TreeGRU and TreeBiGRU models proposed in this work. Among the approaches reported in the literature, the highest accuracy is yielded by DRNN and DMN for the binary scheme (88.6), and by DMN for the fine-grained scheme (52.1). We observe that the best performance is achieved by TreeBiGRU with attention, for both binary (89.5) and fine-grained (52.4) evaluation metrics, exceeding any previously reported results. In addition, the attentional mechanism employed in the proposed TreeGRU and TreeBiGRU models improve the performance for both evaluation metrics.

Hyperparameters and Training Details
The evaluated models are trained using the Ada-Grad (Duchi et al., 2010) algorithm using 0.01 learning rate and a minibatch of size 25 sentences. L2-regularization is performed on the model parameters with a λ value 10 −4 . We use dropout   (Socher et al., 2013). PVec: (Mikolov et al., 2013). TreeLSTM (Tai et al., 2015). DRNN (Irsoy and Cardie, 2013). DCNN (Kalchbrenner et al., 2014).CNN-multichannel (Kim, 2014). DMN (Kumar et al., 2015) with probability 0.5 on both the input layer and the softmax layer. The word embeddings are initialized using the public available Glove vectors with a 300 dimensionality. The Glove vectors provide 95.5% coverage for the SST dataset. All initialized word vectors are finetuned during the training process along with every other parameter. Every matrix is initialized with the identity matrix multiplied by 0.5 except for the matrices of the softmax layer and the attention layer which are randomly initialized from the normal Gaussian distribution. Every bias vectors is initialized with zeros. The training process lasts for 40 epochs. During training, we evaluate the network 4 times every epoch and keep the parameters which give the best root accuracy on the development dataset.

Conclusion
In this short paper, we propose an extension of Recursive Neural Networks that incorporates a bidirectional approach with gated memory units as well as an attention model on structure level. The proposed models were evaluated on both finegrained and binary sentiment classification tasks on a sentence level. Our results indicate that both the direction of the computation and the attention on a structural level can enhance the performance of neural networks on a sentiment analysis task.

Acknowledgments
This work has been partially funded by the Baby-Robot project supported by the EU Horizon 2020 Programme, grant number 687831. Also, the authors would like to thank NVIDIA for supporting this work by donating a TitanX GPU.