On Tree-Based Neural Sentence Modeling

Neural networks with tree-based sentence encoders have shown better results on many downstream tasks. Most of existing tree-based encoders adopt syntactic parsing trees as the explicit structure prior. To study the effectiveness of different tree structures, we replace the parsing trees with trivial trees (i.e., binary balanced tree, left-branching tree and right-branching tree) in the encoders. Though trivial trees contain no syntactic information, those encoders get competitive or even better results on all of the ten downstream tasks we investigated. This surprising result indicates that explicit syntax guidance may not be the main contributor to the superior performances of tree-based neural sentence modeling. Further analysis show that tree modeling gives better results when crucial words are closer to the final representation. Additional experiments give more clues on how to design an effective tree-based encoder. Our code is open-source and available at https://github.com/ExplorerFreda/TreeEnc.


Introduction
Sentence modeling is a crucial problem in natural language processing (NLP). Recurrent neural networks with long short term memory (Hochreiter and Schmidhuber, 1997) or gated recurrent units (Cho et al., 2014) are commonly used sentence modeling approaches. These models embed sentences into a vector space and the resulting vectors can be used for classification or sequence generation in the downstream tasks.
In addition to the plain sequence of hidden units, recent work on sequence modeling proposes to impose tree structure in the encoder (Socher et al., 2013;Tai et al., 2015;Zhu et al., 2015).
These tree-based LSTMs introduce syntax tree as an intuitive structure prior for sentence modeling. They have already obtained promising results in many NLP tasks, such as natural language inference (Bowman et al., 2016;Chen et al., 2017c) and machine translation (Eriguchi et al., 2016;Chen et al., 2017a,b;Zhou et al., 2017). Li et al. (2015) empirically concludes that syntax tree-based sentence modeling are effective for tasks requiring relative long-term context features.
On the other hand, some works propose to abandon the syntax tree but to adopt the latent tree for sentence modeling (Choi et al., 2018;Maillard et al., 2017;Williams et al., 2018). Such latent trees are directly learned from the downstream task with reinforcement learning (Williams, 1992) or Gumbel Softmax (Jang et al., 2017;Maddison et al., 2017). However, Williams et al. (2018) empirically show that, Gumbel softmax produces unstable latent trees with the same hyper-parameters but different initializations, while reinforcement learning (Williams et al., 2018) even tends to generate left-branching trees. Neither gives meaningful latent trees in syntax, but each method still obtains considerable improvements in performance. This indicates that syntax may not be the main contributor to the performance gains.
With the above observation, we bring up the following questions: What does matter in tree-based sentence modeling? If tree structures are necessary in encoding the sentences, what mostly contributes to the improvement in downstream tasks? We attempt to investigate the driving force of the improvement by latent trees without syntax.
In this paper, we empirically study the effectiveness of tree structures in sentence modeling. We compare the performance of bi-LSTM and five tree LSTM encoders with different tree layouts, including the syntax tree, latent tree (from Gum-bel softmax) and three kinds of designed trivial trees (binary balance tree, left-branching tree and right-branching tree). Experiments are conducted on 10 different tasks, which are grouped into three categories, namely the single sentence classification (5 tasks), sentence relation classification (2 tasks), and sentence generation (3 tasks). These tasks depend on different granularities of features, and the comparison among them can help us learn more about the results. We repeat all the experiments 5 times and take the average to avoid the instability caused by random initialization of deep learning models.
We get the following conclusions: • Tree structures are helpful to sentence modeling on classification tasks, especially for tasks which need global (long-term) context features, which is consistent with previous findings (Li et al., 2015). • Trivial trees outperform syntactic trees, indicating that syntax may not be the main contributor to the gains of tree encoding, at least on the ten tasks we investigate. • Further experiments shows that, given strong priors, tree based methods give better results when crucial words are closer to the final representation. If structure priors are unavailable, balanced tree is a good choice, as it makes the path distances between word and sentence encoding to be roughly equal, and in such case, tree encoding can learn the crucial words itself more easily.

Experimental Framework
We show the applied encoder-classifier/decoder framework for each group of tasks in Figure 1. Our framework has two main components: the encoder part and the classifier/decoder part. In general, models encode a sentence to a length-fixed vector, and then applies the vector as the feature for classification and generation. We fix the structure of the classifier/decoder, and propose to use five different types of tree structures for the encoder part including: • Parsing tree. We apply binary constituency tree as the representative, which is widely used in natural language inference (Bowman et al., 2016) and machine translation (Eriguchi et al., 2016;Chen et al., 2017a). Dependency parsing trees (Zhou et al., 2015(Zhou et al., , 2016a are not considered in this paper.

Multi-Layer Perceptron
Softmax (c) Siamese encoder-classifier framework for sentence relation classification. Figure 1: The encoder-classifier/decoder framework for three different groups of tasks. We apply multi-layer perceptron (MLP) for classification, and left-to-right decoders for generation in all experiments.
• Binary balanced tree. To construct a binary balanced tree, we recursively divide a group of n leafs into two contiguous groups with the size of d n 2 e and b n 2 c, until each group has only one leaf node left.
• Gumbel trees, which are produced by straight-forward Gumbel softmax models (Choi et al., 2018). Note that Gumbel trees are not stable to sentences (Williams et al., 2018), and we only draw a sample among all of them. • Left-branching trees.
We combine two nodes from left to right, to construct a leftbranching tree, which is similar to those I love my pet cat .  generated by the reinforce based RL-SPINN model (Williams et al., 2018).
• Right-branching trees. In contrast to leftbranching ones, nodes are combined from right to left to form a right-branching tree. We show an intuitive view of the five types of tree structures in Figure 2. In addition, existing works (Choi et al., 2018;Williams et al., 2018) show that using hidden states of bidirectional RNNs as leaf node representations (bi-leaf-RNN) instead of word embeddings may improve the performance of tree LSTMs, as leaf RNNs help encode context information more completely. Our framework also support leaf RNNs for tree LSTMs.

Description of Investigated Tasks
We conduct experiments on 10 different tasks, which are grouped into 3 categories, namely the single sentence classification (5 tasks), sentence relation classification (2 tasks), and sentence generation (3 tasks). Each of the tasks is compatible to the encoder-classifier/decoder framework shown in Figure 1. These tasks cover a wide range of NLP applications, and depend on different granularities of features.
Note that the datasets may use articles or paragraphs as instances, some of which consist of only one sentence. For each dataset, we only pick the subset of single-sentence instances for our experiments, and the detailed meta-data is in Table 1.

Sentence Classification
First, we introduce four text classification datasets from , including AG's News, Amazon Review Polarity , Amazon Review Full and DBpedia. Additionally, noticing that parsing tree was shown to be effective (Li et al., 2015) on the task of word-level semantic relation classification (Hendrickx et al., 2009), we also add this dataset to our selections.
AG's News (AGN). Each sample in this dataset is an article, associated with a label indicating its topic: world, sports, business or sci/tech. Amazon Review Polarity (ARP). The Amazon Review dataset is obtained from the Stanford Network Analysis Project (SNAP; McAuley and Leskovec, 2013). It collects a large amount of product reviews as paragraphs, associated with a star rate from 1 (most negative) to 5 (most positive). In this dataset, 3-star reviews are dropped, while others are classified into two groups: positive (4 or 5 stars) and negative (1 or 2 stars).
Amazon Review Full (ARF). Similar to the ARP dataset, the ARF dataset is also collected from Amazon product reviews. Labels in this dataset are integers from 1 to 5.
DBpedia. DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia (Lehmann et al., 2015).  select 14 non-overlapping classes from DBpedia 2014 to construct this dataset. Each sample is given by the title and abstract of the Wikipedia article, associated with the class label.

Word-Level Semantic Relation (WSR)
SemEval-2010 Task 8 (Hendrickx et al., 2009) is to find semantic relationships between pairs of nominals. Each sample is given by a sentence, of which two nominals are explicitly indicated, associated with manually labeled semantic relation between the two nominals. For example, the sentence "My [apartment] e 1 has a pretty large [kitchen] e 2 ." has the label component-whole(e 2 , e 1 ). Different from retrieving the path between two labels (Li et al., 2015;Socher et al., 2013), we feed the entire sentence together with the nominal indicators (i.e., tags of e 1 and e 2 ) as words to the framework. We also ignore the order of e 1 and e 2 in the labels given by the dataset. Thus, this task turns to be a 10-way classification one.

Sentence Relation Classification
To evaluate how well a model can capture semantic relation between sentences, we introduce the second group of tasks: sentence relation classification.
Natural Language Inference (NLI). The Stanford Natural Language Inference (SNLI) Corpus (Bowman et al., 2015) is a challenging dataset for sentence-level textual entailment. It has 550K training sentence pairs, as well as 10K for development and 10K for test. Each pair consists of two relative sentences, associated with a label which is one of entailment, contradiction and neutral.

Conjunction Prediction (Conj).
Information about the coherence relation between two sentences is sometimes apparent in the text explicitly (Miltsakaki et al., 2004): this is the case whenever the second sentence starts with a conjunction phrase. Jernite et al. (2017) propose a method to create conjunction prediction dataset from unlabeled corpus. They create a list of phrases, which can be classified into nine types, as conjunction indicators. The object of this task is to recover the conjunction type of given two sentences, which can be used to evaluate how well a model captures the semantic meaning of sentences. We apply the method proposed by Jernite et al. (2017) on the Wikipedia corpus to create our conj dataset.

Sentence Generation
We also include the sentence generation tasks in our experiments, to investigate the representation ability of different encoders over global (longterm) context features. Note that our framework is based on encoding, which is different from those attention based approaches.

Paraphrasing (Para). Quora Question Pair
Dataset is a widely applied dataset to evaluate paraphrasing models Li et al., 2017b). 1 In this work, we treat the paraphrasing task as a sequence-to-sequence one, and evaluate on it with our sentence generation framework.
Machine Translation (MT). Machine translation, especially cross-language-family machine translation, is a complex task, which requires models to capture the semantic meanings of sentences well. We apply a large challenging English-Chinese sentence translation task for this investigation, which is adopted by a variety of neural translation work (Tu et al., 2016;Li et al., 2017a;Chen et al., 2017a). We extract the parallel data 1 https://data.quora.com/ First-Quora-Dataset-Release-Question-Pairs Auto-Encoding (AE). We extract the English part of the machine translation dataset to form a auto-encoding task, which is also compatible with our encoder-decoder framework.

Experiments
In this section, we present our experimental results and analysis. Section 4.1 introduces our setup for all the experiments. Section 4.2 shows the main results and analysis on ten downstream tasks grouped into three classes, which can cover a wide range of NLP applications. Regarding that trivial tree based LSTMs perform the best among all models, we draw two hypotheses, which are i) right-branching tree benefits a lot from strong structural priors; ii) balanced tree wins because it fairly treats all words so that crucial information could be more easily learned by the LSTM gates automatically. We test the hypotheses in  Test results for different encoder architectures trained by a unified encoder-classifier/decoder framework. We report accuracy (⇥100) for classification tasks, and BLEU score (Papineni et al., 2002; word-level for English targets and char-level for Chinese targets) for generation tasks. Large is better for both of the metrics. The best number(s) for each task are in bold. In addition, average sentence length (in words) of each dataset is attached in the last row with underline.
Section 4.3. Finally, we compare the performance of linear and tree LSTMs with three widely applied pooling mechanisms in Section 4.4.

Set-up
In experiments, we fix the structure of the classifier as a two-layer MLP with ReLU activation, and the structure of decoder as GRU-based recurrent neural networks (Cho et al., 2014). 3 The hidden-layer size of MLP is fixed to 1024, while that of GRU is adapted from the size of sentence encoding. We initialize the word embeddings with 300-dimensional GloVe (Pennington et al., 2014) vectors. 4 We apply 300-dimensional bidirectional (600-dimensional in total) LSTM as leaf RNN when necessary. We use Adam (Kingma and Ba, 2015) optimizer to train all the models, with the learning rate of 1e-3 and batch size of 64. In the training stage, we drop the samples with the length of either source sentence or target sentence larger than 64. We do not apply any regularization or dropout term in all experiments except the task of WSR, on which we tune dropout term with respect to the development set. We generate the binary parsing tree for the datasets without parsing trees using ZPar (Zhang and Clark, 2011). 5 More details are summarized in supplementary materials.

Main Results
In this subsection, we aim to compare the results from different encoders. We do not include any attention Lin et al., 2017) or pooling (Collobert and Weston, 2008;Socher et al., 2011;Zhou et al., 2016b) mechanism here, in order to avoid distractions and make the encoder structure affects the most. We will further analyze pooling mechanisms in Section 4.4. Table 2 presents the performances of different encoders on a variety of downstream tasks, which lead to the following observations: Tree encoders are useful on some tasks. We get the same conclusion with Li et al. (2015) that tree-based encoders perform better on tasks requiring long-term context features. Despiting the linear structured left-branching and rightbranching tree encoders, we find that, tree-based encoders generally perform better than Bi-LSTMs on tasks of sentence relation and sentence generation, which may require relatively more long term context features for obtaining better performances. However, the improvements of tree encoders on NLI and Para are relatively small, which may be caused by that sentences of the two tasks are shorter than others, and the tree encoder does not get enough advantages to capture long-term context in short sentences.
Trivial tree encoders outperform other encoders. Surprisingly, binary balanced tree encoder gets the best results on most tasks of classification and right-branching tree encoder tends to be the best on sentence generation. Note that binary balanced tree and right-branching tree are only trivial tree structures, but outperform syntactic tree and latent tree encoders. The latent tree is really competitive on some tasks, as its structure is directly tuned by the corresponding tasks. However, it only beats the binary balanced tree by very small margins on NLI and ARP. We will give analysis about this in Section 4.3.
Larger quantity of parameters is not the only reason of the improvements. Table 2 shows that tree encoders benefit a lot from adding leaf-LSTM, which brings not only sentence level information to leaf nodes, but also more parameters than the bi-LSTM encoder. However, leftbranching tree LSTM has a quite similar structure with linear LSTM, and it can be viewed as a linear LSTM-on-LSTM structure. It has the same amounts of parameters as other tree-based encoders, but still falls behind the balance tree encoder on most of the tasks. This indicates that larger quantity of parameters is at least not the only reason for binary balance tree LSTM encoders to gain improvements against bi-LSTMs.

Why Trivial Trees Work Better?
Binary balanced tree and right-branching are trivial ones, hardly containing syntax information. In this section, we analyze why these trees achieve high scores in deep.

Right Branching Tree Benefits from Strong Structural Prior
We argue that right-branching trees benefit from its strong structural prior. In sentence generation tasks, models generate sentences from left to right, which makes words in the left of the source sentence more important (Sutskever et al., 2014). If the encoder fails to memorize the left words, the information about right words would not help due to the error propagation. In right-branching trees, left words of the sentence are closer to the final representation, which makes the left words are more easy to be memorized, and we call this structure prior. Oppositely, in the case of left-branching trees, right words of the sentence are closer to the representation.
To validate our hypothesis, we propose to visualize the Jacobian as word-level saliency (Shi et al., 2018), which can be viewed as the contribution of each word to the sentence encoding: where s = (s 1 , s 2 , · · · , s p ) T denotes the embedding of a sentence, and w = (w 1 , w 2 , · · · , w q ) T denotes embedding of a word. We can compute the saliency score using backward propagation. For a word in a sentence, higher saliency score means more contribution to sentence encoding. We present the visualization in Figure 3 using the visualization tool from Lin et al. (2017). It shows that right-branching tree LSTM encoders tend to look at the left part of the sentence, which is very helpful to the final generation performance, as left words are more crucial. Balanced trees also have this feature and we think it is because balance tree treats these words fairly, and crucial information could be more easily learned by the LSTM gates automatically.
However, bi-LSTM and left-branching tree LSTM also pay much attention to words in the right (especially the last two words), which maybe caused by the short path from the right words to the root representation, in the two corresponding tree structures.
Additionally, Table 3 shows that models trained with the same hyper-parameters but different initializations have strong agreement with each other. the standing committee 's training work and informati onization work has also been strengthened in varying degrees .
maintaining the overall situation of stability , taking th e improvement of people 's standard of living as the basic starting point , and allowing people to continuo usly reapthe benefits of reform and development -these are the cornerstones of lasting peace and stab ility in the nation and an inexhaustible motive force fo r reform andopening up .
the standing committee 's training work and informati onization work has also been strengthened in varying degrees .
maintaining the overall situation of stability , taking th e improvement of people 's standard of living as the basic starting point , and allowing people to continuo usly reapthe benefits of reform and developmentthese are the cornerstones of lasting peace and stabi lity in the nation and an inexhaustible motive force for reform andopening up .
the standing committee 's training work and informati onization work has also been strengthened in varying degrees .
maintaining the overall situation of stability , taking th e improvement of people 's standard of living as the basic starting point , and allowing people to continuo usly reap the benefits of reform and developmentthese are the cornerstones of lasting peace and stab ility in the nation and an inexhaustible motive force fo r reform and opening up .
the standing committee 's training work and informati onization work has also been strengthened in varying degrees .
maintaining the overall situation of stability , taking th e improvement of people 's standard of living as the basic starting point , and allowing people to continuo usly reapthe benefits of reform and development -these are the cornerstones of lasting peace and stab ility in the nation and an inexhaustible motive force fo r reform andopening up .
the standing committee 's training work and informati onization work has also been strengthened in varying degrees .
maintaining the overall situation of stability , taking th e improvement of people 's standard of living as the basic starting point , and allowing people to continuo usly reapthe benefits of reform and development -these are the cornerstones of lasting peace and stab ility in the nation and an inexhaustible motive force fo r reform andopening up .
the standing committee 's training work and informati onization work has also been strengthened in varying degrees .
maintaining the overall situation of stability , taking th e improvement of people 's standard of living as the basic starting point , and allowing people to continuo usly reapthe benefits of reform and development -these are the cornerstones of lasting peace and stab ility in the nation and an inexhaustible motive force fo r reform andopening up .
the standing committee 's training work and informati onization work has also been strengthened in varying degrees .
maintaining the overall situation of stability , taking th e improvement of people 's standard of living as the basic starting point , and allowing people to continuo usly reap the benefits of reform and developmentthese are the cornerstones of lasting peace and stab ility in the nation and an inexhaustible motive force fo r reform and opening up .
the standing committee 's training work and informati onization work has also been strengthened in varying degrees .
maintaining the overall situation of stability , taking th e improvement of people 's standard of living as the basic starting point , and allowing people to continuo usly reapthe benefits of reform and development -these are the cornerstones of lasting peace and stab ility in the nation and an inexhaustible motive force fo r reform andopening up .

Model
MT AE  Mean average Pearson correlation (⇥100) across five models trained with same hyper-parameters. For each testing sentence, we compute the saliency scores of words. Crossmodel Pearson correlation can show the agreement of two models on one sentence, and average Pearson correlation is computed through all sentences. We report mean average Pearson correlation of the 5 ⇥ 4 model pairs.
Thus, "looking at the first words" is a stable behavior of balanced and right-branching tree LSTM encoders in sentence generation tasks. So is "looking at the first and the last words" for Bi-LSTMs and left-branching tree LSTMs.

Binary Balanced Tree Benefits from Shallowness
Compared to syntactic and latent trees, the only advantage of balanced tree we can hypothesize is that, it is shallower and more balanced than others. Shallowness may lead to shorter path for information propagation from leafs to the root representation, and makes the representation learning more easy due to the reduction of errors in the propagation process. Balance makes the tree fairly treats  all leaf nodes, which makes it more easily to automatically select the crucial information over all words in a sentence.
To test our hypothesis, we conduct the following experiments. We select three tasks, on which binary balanced tree encoder wins Bi-LSTMs with a large margin (WSR, MT and AE). We generate random binary trees for sentences, while con- Figure 5: Length-performance lines for the further investigated tasks. We divide test instances into several groups by length, and report the performance on each group respectively. Sentences with length in [1,8] are put to the first group, and the group i(i 2) covers the range of [4i + 1, 4i + 4] in length. ] trolling the depth using a hyper-parameter ⇢. We start by a group with all words (nodes) in the sentence. At each time, we separate n nodes to two continuous groups sized (d n 2 e, b n 2 c) with probability ⇢, while those sized (n 1, 1) with probability 1 ⇢. Trees generated with ⇢ = 0 are exactly left-branching trees, and those generated with ⇢ = 1 are binary balanced trees. The expected node depth of the tree turns smaller with ⇢ varies from 0 to 1. Figure 4 shows that, in general, trees with shallower node depth have better performance on all of the three tasks (for binary tree, shallower also means more balanced), which validates our above hypothesis that binary balanced tree gains the reward from its shallow and balanced structures.
Additionally, Figure 5 demonstrates that binary balanced trees work especially better with relative long sentences. As desired, on shortsentence groups, the performance gap between Bi-LSTM and binary balanced tree LSTM is not obvious, while it grows with the test sentences turning longer. This explains why tree-based encoder gives small improvements on NLI and Para, because sentences on these two tasks are much shorter than others.

Can Pooling Replace Tree Encoder?
Max pooling (Collobert and Weston, 2008;, mean pooling (Conneau et al., 2017) and self-attentive pooling (also known as selfattention; Santos et al., 2016;Lin et al., 2017) are three popular and efficient choices to improve sentence encoding. In this part, we will compare the performance of tree LSTMs and bi-LSTM on the tasks of WSR, MT and AE, with each pooling mechanism respectively, aiming to demonstrate the role that pooling plays in sentence  modeling, and validate whether tree encoders can be replaced by pooling.
As shown in Figure 6, for linear LSTMs, we apply pooling mechanism to all hidden states; as for tree LSTMs, pooling is applied to all hidden states and leaf states of tree LSTMs. Implementation details are summarized in the supplementary materials. Table 4 shows that max and attentive pooling improve all the structures on the task of WSR, but all the pooling mechanisms fail on MT and AE that require the encoding to capture complete information of sentences, while pooling mechanism may cause the loss of information through the procedure. The result indicates that, though pooling mechanism is efficient on some tasks, it cannot totally gain the advantages brought by tree structures. Additionally, we think the attention mech-  Table 4: Performance of tree and linear-structured encoders with or without pooling, on the selected three tasks. We report accuracy (⇥100), char-level BLEU for MT and word-level BLEU for AE. All of the tree models have bidirectional leaf RNNs (BiLRNN). The best number(s) for each task are in bold. The top and down arrows indicate the increment or decrement of each pooling mechanism, against the baseline of pure tree based encoder with the same structure.
anism has the benefits of the balanced tree modeling, which also fairly treat all words and learn the crucial parts automatically. The path from representation to words in attention are even shorter than the balanced tree. Thus the fact that attentive pooling outperforms balanced trees on WSR is not surprising to us.

Discussions
Balanced tree for sentence modeling has been explored by Munkhdalai and Yu (2017) and Williams et al. (2018) in natural language inference (NLI). However, Munkhdalai and Yu (2017) focus on designing inter-attention on trees, instead of comparing balanced tree with other linguistic trees in the same setting. Williams et al. (2018) do compare balanced trees with latent trees, but balanced tree does not outperform the latent one in their experiments, which is consistent with ours. We analyze it in Section 4.2 that sentences in NLI are too short for the balanced tree to show the advantage.  argue that LSTM works for the gates ability to compute an element-wise weighted sum. In such case, tree LSTM can also be regarded as a special case of attention, especially for the balanced-tree modeling, which also automatically select the crucial information from all word representation. Kim et al. (2017) propose a tree structured attention networks, which combine the benefits of tree modeling and attention, and the tree structures in their model are also learned instead of the syntax trees.
Although binary parsing trees do not produce better numbers than trivial trees on many downstream tasks, it is still worth noting that we are not claiming the useless of parsing trees, which are intuitively reasonable for human language understanding. A recent work (Blevins et al., 2018) shows that RNN sentence encodings directly learned from downstream tasks can capture implicit syntax information. Their interesting result may explain why explicit syntactic guidance does not work for tree LSTMs. In summary, we still believe in the potential of linguistic features to improve neural sentence modeling, and we hope our investigation could give some sense to afterwards hypothetical exploring of designing more effective tree-based encoders.

Conclusions
In this work, we propose to empirically investigate what contributes mostly in the tree-based neural sentence encoding. We find that trivial trees without syntax surprisingly give better results, compared to the syntax tree and the latent tree. Further analysis indicates that the balanced tree gains from its shallow and balance properties compared to other trees, and right-branching tree benefits from its strong structural prior under the setting of left-to-right decoder.