You Only Need Attention to Traverse Trees

In recent NLP research, a topic of interest is universal sentence encoding, sentence representations that can be used in any supervised task. At the word sequence level, fully attention-based models suffer from two problems: a quadratic increase in memory consumption with respect to the sentence length and an inability to capture and use syntactic information. Recursive neural nets can extract very good syntactic information by traversing a tree structure. To this end, we propose Tree Transformer, a model that captures phrase level syntax for constituency trees as well as word-level dependencies for dependency trees by doing recursive traversal only with attention. Evaluation of this model on four tasks gets noteworthy results compared to the standard transformer and LSTM-based models as well as tree-structured LSTMs. Ablation studies to find whether positional information is inherently encoded in the trees and which type of attention is suitable for doing the recursive traversal are provided.


Introduction
Following the breakthrough in NLP research with word embeddings by Mikolov et al. (2013), recent research has focused on sentence representations. Having good sentence representations can help accomplish many NLP tasks because we eventually deal with sentences, e.g., question answering, sentiment analysis, semantic similarity, and natural language inference. Most of the existing task specific sequential sentence encoders are based on recurrent neural nets such as LSTMs or GRUs (Conneau et al., 2017;Lin et al., 2017;Liu et al., 2016). All of these works follow a common paradigm: use an LSTM/GRU over the word sequence, extract contextual features at each time step, and apply some kind of pooling on top of that. However, a few works adopt some different methods. Kiros et al. (2015) propose a skip-gram-like objective function at the sentence level to obtain the sentence embeddings. Logeswaran and Lee (2018) reformulate the task of predicting the next sentence given the current one into a classification problem where instead of a decoder they use a classifier to predict the next sentence from a set of candidates.
The attention mechanism adopted by most of the RNN based models require access to the hidden states at every time step Kumar et al., 2016). These models are inefficient and at the same time very hard to parallelize. To overcome this, Parikh et al. (2016) propose a fully attention-based neural network which can adequately model the word dependencies and at the same time is parallelizable. Vaswani et al. (2017) adopt the multi-head version in both the encoder and decoder of their Transformer model along with positional encoding. Ahmed et al. (2017) propose a multi-branch attention framework where each branch captures a different semantic subspace and the model learns to combine them during training. Cer et al. (2018) propose an unsupervised sentence encoder by leveraging only the encoder part of the Transformer where they train on the large Stanford Natural Language Inference (SNLI) corpus and then use transfer learning on smaller task specific corpora.
Apart from these sequential models, there has been extensive work done on the tree structure of natural language sentences. Socher et al. (2011bSocher et al. ( , 2013 propose a family of recursive neural net (RvNN) based models where a composition function is applied recursively bottom-up on children nodes to compute the parent node representation until the root is reached. Tai et al. (2015) propose two variants of sequential LSTM, child sum tree LSTM and N-ary tree LSTM. The same gating structures as in standard LSTM are used except  Recently, Shen et al. (2018) propose a Parsing-Reading-Predict Network (PRPN) which can induce syntactic structure automatically from an unannotated corpus and can learn a better language model with that induced structure. Later, Htut et al. (2018) test this PRPN under various configurations and datasets and further verified its empirical success for neural network latent tree learning. Williams et al. (2018) also validate the effectiveness of two latent tree based models but found some issues such as being biased towards producing shallow trees, inconsistencies during negation handling, and a tendency to consider the last two words of a sentence as constituents.
In this paper, we propose a novel recursive neural network architecture consisting of a decomposable attention framework in every branch. We call this model Tree Transformer as it is solely dependent on attention. In a subtree, the use of a composition function is justified by a claim of Socher et al. (2011b. In this work, we replace this composition function with an attention module. While Socher et al. (2011b consider only the child representations for both dependency and constituency syntax trees, in this work, for dependency trees, the attention module takes both the child and parent representations as input and produces weighted attentive copies of them. For constituency trees, as the parent vector is entirely dependent on the upward propagation, the attention module works only with the child representations. Our extensive evaluation proves that our model is better or at least on par with the existing sequential (i.e., LSTM and Transformer) and tree structured (i.e., Tree LSTM and RvNN) models.

Proposed Model
Our model is designed to address the following general problem. Given a dependency or constituency tree structure, the task is to traverse every subtree within it attentively and infer the root representation as a vector. Our idea is inspired by the RvNN models from Socher et al. (2013Socher et al. ( , 2011b where a composition function is used to transform a set of child representations into one single parent representation. In this section, we describe how we use the attention module as a composition function to build our Tree Transformer. Figure 1 gives a sketch of our model. A dependency tree contains a word at every node. To traverse a subtree in a dependency tree, we look at both the parent and child representations (X d in Eqn. 1). In contrast, in a constituency tree, only leaf nodes contain words. The nonterminal vectors are calculated only after traversing each subtree. Consequently, only the child representations (X c in Eqn. 1) are considered.
Here, p v is the parent representation and the c iv 's are the child representations. For both of these trees, Eqn. 2 computes the attentive transformed representation.
Here, f is the composition function using the multi-branch attention framework (Ahmed et al., 2017). This multi-branch attention is built upon the multi-head attention framework (Vaswani et al., 2017) which further uses scaled dot-product attention (Parikh et al., 2016) as the building block. It operates on a query Q, key K and value V as follows where d k is the dimension of the key. As we are interested in n branches, n copies are created for each (Q, K, V), converted to a 3D tensor, and then a scaled dot-product attention is applied using Instead of having separate parameters for the transformation of leaves, internal nodes and parents , we keep W Q i , W K i and W V i the same for all these components. We then project each of the resultant tensors into different semantic sub-spaces and employ a residual connection Srivastava et al., 2015) around them. Lastly, we normalize the resultant outputs using a layer normalization block (Ba et al., 2016) and apply a scaling factor κ to get the branch representation. All of these are summarized in Eqn. 5.
Here, W b i ∈ R n×dv×dm and κ ∈ R n are the parameters to be learned. Note that we choose d k = d q = d v = d m /n. Following this, we take each of these B's and apply a convolutional neural network (see Eqn. 6) consisting of two transformations on each position separately and identically with a ReLU activation (R) in between.
We compute the final attentive representation of these subspace semantics by doing a linearly weighted summation (see Eqn. 7) where α ∈ R n is learned as a model parameter. (7) Lastly, we employ another residual connection with the output of Eqn. 7, transform it non-linearly and perform an element-wise summation (EwS) to get the final parent representation as in Eqn. 8.
Here, x andx depict the input and output of the attention module.

Experiments
In this section, we present the effectiveness of our Tree Transformer model by reporting its evaluation on four NLP tasks. We present a detailed ablation study on whether positional encoding is important for trees and also demonstrate which attention module is most suitable as a composition function for the recursive architectures. Experimental Setup: We initialize the word embedding layer weights with GloVe 300dimensional word vectors (Pennington et al., 2014). These embedding weights are not updated during training. In the multi-head attention block, the dimension of the query, key and value matrices are set to 50 and we use 6 parallel heads on each input. The multi-branch attention block is composed of 6 position-wise convolutional layers. The number of branches is also set to 6. We use two layers of convolutional neural network as the composition function for the PCNN layer. The first layer uses 341 1d kernels with no dropout and the second layer uses 300 1d kernels with dropout 0.1.
During training, the model parameters are updated using the Adagrad algorithm (Duchi et al., 2011) with a fixed learning rate of 0.0002. We trained our model on an Nvidia GeForce GTX 1080 GPU and used PyTorch 0.4 for the implementation under the Linux environment. Datasets: Evaluation is done on four tasks: the Stanford Sentiment Treebank (SST) (Socher et al., 2011b) for sentiment analysis, Sentences Involving Compositional Knowledge (SICK) (Marelli et al., 2014) for semantic relatedness (-R) and natural language inference (-E), and the Microsoft Research Paraphrase (MSRP) corpus (Dolan et al., 2004) for paraphrase identification.
The SST dataset includes already generated dependency and constituency trees. As the other two datasets do not provide tree structures, we parsed each sentence using the Stanford dependency and constituency parser .
For the sentiment classification (SST), natural language inference (SICK-E), and paraphrase identification (MSRP) tasks, accuracy, the standard evaluation metric, is used. For the semantic relatedness task (SICK-R), we are using mean squared error (MSE) as the evaluation metric.
We use KL-divergence as the loss function for SICK-R to measure the distance between the predicted and target distribution. For the other three tasks, we use cross entropy as the loss function. Table 1 shows the results of the evaluation of the model on the four tasks in terms of task specific evaluation metrics. We compare our Tree Transformer against tree structured RvNNs, LSTM based, and Transformer based architectures.
To do a fair comparison, we implemented both variants of Tree LSTM and Transformer based architectures and some of the RvNN and LSTM based models which do not have reported results for every task. Instead of assessing on transfer performance, the evaluation is performed on each corpus separately following the standard train/test/valid split.
For SICK-E, our model achieved 82.95% and 82.72% accuracy with dependency and constituency tree, respectively, which is on par with DT-LSTM (83.11%) as well as CT-LSTM (82.00%) and somewhat better than the standard Transformer (81.15%). As can be seen, all of the previous recursive architectures are somewhat inferior to the Tree Transformer results.
For SICK-R, we are getting .2774 and .3012 MSE whereas the reported MSE for DT-LSTM and CT-LSTM are .2532 and .2734, respectively. However, in our implementation of those models with the same hyperparameters, we haven't been able to reproduce the reported results. Instead we ended up getting .2625 and .2891 MSE for DT-LSTM and CT-LSTM, respectively. On this task, our model is doing significantly better than the standard Transformer (.5241 MSE).
On the MSRP dataset, our dependency tree version (70.34% Acc.) is below DT-LSTM (72.07%   Since positional encoding is a crucial part of the standard Transformer, Table 2 presents its effect on trees. In constituency trees, positional information is inherently encoded in the tree structure. However, this is not the case with dependency trees. Nonetheless, our experiments suggest that for trees, positional encoding is irrelevant information as the performance drops in all but one case. We also did an experiment to see which attention module is best suited as a composition function and report the results in Table 3. As can be seen, in almost all the cases, multi-branch attention has much better performance compared to the other two. This gain by multi-branch attention is much more significant for CTT than for DTT. Figure 2 visualizes how our CTT model puts attention on different phrases in a tree to compute the correct sentiment. Space limitations allow only portions of the tree to be visualized. As can be seen, the sentiment is positive (+1) at the root and the model puts more attention on the right branch as it has all of the positive words, whereas the left branch (NP) is neutral (0). The bottom three trees are the phrases which contain the positive words. The model again puts more attention on the relevant branches. The words 'well' and 'sincere' are inherently positive. In the corpus the Doug Liman the director of Bourne directs the traffic well gets a nice wintry look from his locations absorbs us with the movie 's spycraft and uses Damon 's ability to be focused and sincere  Figure 2: Attentive tree visualization (CTT) word 'us' is tagged as positive for this sentence.

Conclusion
In this paper, we propose Tree Transformer which successfully encodes natural language grammar trees utilizing the modules designed for the standard Transformer. We show that we can effectively use the attention module as the composition function together with grammar information instead of just bag of words and can achieve performance on par with Tree LSTMs and even better performance than the standard Transformer.