When Are Tree Structures Necessary for Deep Learning of Representations?

Recursive neural models, which use syntactic parse trees to recursively generate representations bottom-up from parse children, are a popular new architecture, promising to capture structural properties like long-distance semantic dependencies. But understanding exactly which tasks this parse-based method is appropriate for remains an open question. In this paper we benchmark {\bf recursive} neural models against sequential {\bf recurrent} neural models, which are structured solely on word sequences. We investigate 4 tasks: (1) sentence-level sentiment classification; (2) matching questions to answer-phrases; (3) discourse parsing; (4) computing semantic relations (e.g., {\em component-whole} between nouns). We implement apply basic, general versions of recursive and recurrent models and apply to each task. Our analysis suggests that syntactic tree-based recursive models are very helpful for tasks that require representing long-distance relations between words (e.g., semantic relations between nominals), but may not be helpful in other situations, where sequence based recurrent models can produce equal performance. Our results offer insights on the design of neural architectures for representation learning.

For tasks where the inputs are larger text units (e.g., phrases, sentences or documents), a convo-lutional model is first needed to aggregate tokens into a vector with fixed dimensionality that can be used as a feature for other NLP tasks.Models for achieving this usually fall into two categories: recurrent models and recursive models: Recurrent models deal successfully with timeseries data (Pearlmutter, 1989;Dorffner, 1996) like speech recognition (Robinson et al., 1996;Lippmann, 1989;Graves et al., 2013) or handwriting recognition (Graves and Schmidhuber, 2009;Graves, 2012).They were also applied early on to NLP (Elman, 1990), modeling a sentence as tokens processed sequentially, at each step combining the current token with previously built embeddings.These models generally consider no linguistic structure aside from word order.
Recursive neural models (Williams and Zipser, 1989), by contrast, are structured by syntactic parse trees.Instead of considering tokens sequentially, recursive models combine neighbors based on the recursive structure of the parse trees, starting from the leaves and proceeding recursively in a bottom-up fashion until the root of tree is reached.Figure 1 shows a recursive neural networks building a distributed representation for the phrase the food is delicious, following the operation sequence ( (the food) (is delicious) ) rather than the sequential order (((the food) is) delicious).Many recursive models have been proposed (e.g., ( One possible advantage of recursive models is their potential for capturing long-distance dependencies: two tokens may be structurally close to each other, even though they are far away in word sequence.For example, a verb and its corresponding direct object can be far away in terms of tokens if many adjectives lies in between, but they are adjacent in the parse tree (Irsoy and Cardie, 2013).But we don't know if this advantage is truly important, and if so for which tasks, or whether other issues are at play.Indeed, the reliance of recursive models on parsing is also a potential disadvantage, given that parsing is relatively slow, can be errorful, and parsers are also usually quite domaindependent.
On the other hand, recent progress in multiple subfields of neural NLP has suggested that recurrent nets may be sufficient to deal with many of the tasks for which recursive models have been proposed.Our goal in this paper is thus to investigate a number of tasks with the goal of understanding for which kinds of problems recurrent models may be sufficient, and for which kinds recursive models offer specific advantages.We investigate four tasks with different properties.Binary sentiment classification at the sentence level ((Pang et al., 2002)) and phrase level can help us understand the role of recursive models in dealing with semantic compositionally.Question Answer-Phrase Matching on the UMD-QA dataset (Iyyer et al., 2014) can help see whether parsing is useful for finding similarities between source question sentences and targets answer phrases.Semantic Relation Classification on the SemEval-2010 (Hendrickx et al., 2009) data tries to understand whether parsing is helpful in dealing with longterm dependencies, such as relations between two words that are far apart in the sequence.Discourse parsing (RST dataset) is useful for measuring the extent to which parsing improves discourse tasks that need to combine meanings of larger text units.
The principal motivation for this paper is to understand better when, and why, recursive models are needed to outperform simpler models by enforcing apples-to-apples comparison as much as possible.Therefore this paper might intrinsically suffer from the following weaknesses: (1) This paper applies existing models to existing tasks, offering no novel algorithms or tasks or new state-of-the-art results.Our goal is rather an analytic one, to investigate different versions of recursive and recurrent models.We therefore pair each recurrent model with a matching recursive one (standard recurrent vs standard recursive, bidirectional recurrent vs bidirectional recursive, etc.).
(2) We only explore the most general or basic forms of recursive/recurrent models rather than various sophisticated algorithm variants.This is because fair comparison becomes more and more difficult as models get complex (e.g., the number of neural layers, number of hidden units within each layer, etc.).Thus all the neural models employed in this work are comprised of only one layer of neural convolution-despite the fact that deep neural models with multiple layers potentially come with more expressivity, flexibility and thus better results.Our conclusions might thus be limited to the algorithms employed in this paper, and it is unclear whether they can be extended to other variants or to the latest state-of-the-art.We return to this question in the final discussion.
The rest of this paper organized as follows: We detail versions of recursive/recurrent models in Section 2, present the tasks and results in Section 3, and conclude with discussions in Section 4.

Recursive and Recurrent Models
In this section, we introduce the recursive and recurrent models we consider.

Notations
We assume that the text unit S, which could be a phrase, a sentence or a document, is comprised of a sequence of tokens/words: S = {w 1 , w 2 , ..., w N S }, where N s denotes the number of tokens in S. Each word w is associated with a K-dimensional vector embedding e w = {e 1 w , e 2 w , ..., e K w }.The goal of recursive and recurrent models is to map the sequence to a Kdimensional e S , based on its tokens and their correspondent embeddings.

Recurrent Models
Standard Recurrent Models A recurrent network successively takes word w i at step i, combines its vector representation e wt with previously built embedding e i−1 from time i − 1, calculates the resulting current embedding e t , and passes it to the next step.The embedding e t for the current time t is thus: where W and V are K × K dimensional convolutional matrixes and f(•) denotes the activation function.Bias vectors are omitted.If N s denote the length of the sequence, e Ns represents the whole sequence S.
Bidirectional Recurrent Models (Schuster and Paliwal, 1997) add bidirectionality to the recurrent framework where embeddings for each time are calculated both forwardly and backwardly: Normally, bidirectional models feed the concatenation vector calculated from both directions [e ← 1 , e where σ denotes the sigmoid function, and i t , f t and o t are scalars in [0,1].The current timestep embedding e t is then given by: As mentioned in Section 1, LSTM models we employ (along with gate model for recursive models) only contain one layer of convolution.
Bidirectional LSTM As in bidirectional recurrent models, Bi-LSTM obtains embeddings forwardly and backwardly and concatenates vectors calculated from both directions for classification.
Recurrent Tensor Network We extended the RNTN (Socher et al., 2013) recursive model to the recurrent version which enables richer convolution between early time output e t−1 and current word embedding e wt .Adjusted from the recursive RNTN, let e = [e t−1 , e wt ] denote the concatenation of e t−1 and e wt , the output embedding at e t is given by: (5)

Recursive Models
Standard Recursive Models Recursive models rely on the structure of parse trees, where each leaf node corresponds to a word from the original sentence, computing a representation for each parent node based on its immediate children recursively in a bottom-up fashion until reaching the root of the tree.Concretely, for a given node η in the tree and its left child η left (associated with vector representation e left ) and right child η right (associated with vector representation e right ), the standard recursive network calculates e η as follows: where W and V are K ×K convolutional matrixes and f (•) denotes the activation function.
Bidirectional Recursive Models extend standard recursive models and consider both downward and upward information propagations along the tree (Irsoy and Cardie, 2013).Each node p is associated with two vectors, an upward vector e ↑ η and a downward vector e ↓ η .e ↑ η is computed similarly as in standard recursive models based on two immediate children: The downward vector e ↓ η is computed based on its parent h(η) and upward vector e ↑ η : where W ↓ , W ↑ , V ↓ and V ↑ denote convolutional matrixes.Intuitively, e ↑ η includes evidence from the descendants of η while e ↓ η includes information from ancestors.
Note that bi-directional recursive models are suited for classification at the node level by considering upward and downward information, but not sentence level classification.No downward vector is computed for roots-downward vectors are computed based on parents-concatenating root embedding (upwardly computed) with leaf embeddings (downwardly computed) leads to vectors with different dimensionality.We therefore omit the bi-directional recursive model for sentence-level classification tasks Gate Based Recursive Models (Gate-Recursive) We adjust the idea of LSTM to recursive model where the representation of a node preserves part of the previous information and adds in the convolutional part.Similarly, let r η and z η denote the control gates associated with node η, given by: e η is then given by: 3 Experiments In this section, we detail our experimental settings and results.We consider 4 tasks, each representative of a different class of NLP tasks.(Goller and Kuchler, 1996).Parameters to tune include size of mini batches, learning rate, parameters for L2 penalizations.The number of running iterations is treated as a parameter to tune and the model achieving best performance on the development set is used as the final model to be evaluated.

Binary Sentiment Classification (Pang)
The sentiment dataset of (Pang et al., 2002) consists of sentences with a sentiment label for each sentence.We divide the original dataset into training(8101)/dev(500)/testing (2000).No pretraining procedure as described in (Socher et al., 2011b) is employed.Word embeddings are initialized using skip-grams and kept fixed in the learning procedure.We trained skip-gram embeddings on the Wikipedia+Gigaword dataset using the word2vec package1 .Sentence level embeddings are fed into a sigmoid classifier: For bi-directional recurrent models, concatenations of embeddings calculated leftwardly and rightwardly are used as sentence-level vectors.As the concatenation of leftward and rightward vectors would double the dimensionality of embeddings, a further convolutional operation is used to preserve vector dimensionality: where W L denotes a 2K × K dimensional matrix.Performances for 50-and 250-dimensional vectors are given in Figure 2; no large difference appears between standard recursive and recurrent models or RNTN-Recursive and RNTN-recurrent models.
Why don't parse trees help on this task?One possible explanation is the distance the supervision signal from the local compositional structure.The Pang et al. dataset has an average sentence length of 22.5 words, which means it takes multiple steps before sentiment related evidence comes up to the surface.It is therefore unclear whether local compositional operators (such as negation) can be learned; there is only a small amount of training data and the sentiment supervision only at the level of the sentence may not be easy to propagate down to deeply buried local phrases.
The question-answering dataset QANTA 2 is comprised of 1,713 history questions (8,149 questions) and 2,246 literature questions (10,552 sentences), each question paired with an answer (Iyyer et al., 2014).Each answer is a token or short phrase.The task is defined as a multi-class classification task that matches a source question with a 2 http://cs.umd.edu/˜miyyer/qblearn/.Because the publicly released dataset is smaller than the version used in (Iyyer et al., 2014) due to privacy issues, our numbers are not comparable to those in (Iyyer et al., 2014).candidates phrase from a predefined pool of candidate phrases rather than sequence generation as in standard QA.We give an illustrative example here: Question: He left unfinished a novel whose title character forges his Recurrent version models minimize the distance between the answer embedding and embeddings calculated from each timestep of the sequence: The concatenation of e ← t and e → t (or e ↑ t and e ↓ t ) in bi-directional models will double the dimensionality of embedding, a further convolutional operation is used to preserve vector dimensionality: where W L denotes a 2K × K dimensional matrix.As demonstrated in Figure 4, we see no great difference between recurrent and recursive models.
The UMD-QA task represents a group of situations where because we have insufficient supervision about matching (it's hard to know which node in the parse tree or which timestep provides the most direct evidence for the answer), decisions have to be made by looking at and iterating over all subunits (all nodes in parse trees or timesteps).Similar ideas can be found in pooling structures (e.g.Socher et al. ( 2011a)).We can draw a similar conclusion: in this task, recursive models look at the relevance to the target of linguistically meaningful units from parse trees, but this does not necessarily lead to better performance than simple sequential units.For the recursive implementations, we follow the neural framework defined in Socher et al. (2012).The path in the parse tree between the two nominals is retrieved, and the embedding is calculated based on recursive models and fed to a softmax classifier 3 .Retrieved paths are transformed 3 (Socher et al., 2012) achieve state-of-art performance by combining a sophisticated model, MV-RNN, in which each word is presented with both a matrix and a vector with for the recurrent models as shown in Figure 6.

Semantic Relationship Classification
Results are shown in Figure 5. Unlike for earlier tasks, here recursive models yield much better performance than the corresponding recurrent versions.Even standard recursive models perform better than LSTM.
These results suggest that it is the need to integrate structures far apart in the sentence that characterizes the tasks where recursive models surpass recurrent models.In parse-based models, the two target words are drawn together much earlier in the decision process than in recurrent models, which remembers one target until the other one appears.

Discourse Parsing
Our final task, discourse parsing based on the RST-DT corpus (Carlson et al., 2003), is to build a discourse tree for a document, based on assigning Rhetorical Structure Theory (RST) relations between elementary discourse units (EDUs).Because discourse relations express the coherence structure of discourse, they presumably express different aspects of compositional meaning than sentiment or nominal relations.A discourse relation classifier takes as input a vector embedding representation of each of the two EDUs, and computes the most likely relation (Li et al., 2014).We compare recursive versus recurrent methods for computing EDU embeddings.See Ji and Eisenstein () and Hernault et al. (2010) for more details on discourse parsing and the RST-DT corpus.
human-feature engineering.Again, because MV-RNN is difficult to adapt to a recurrent version, we do not employ this state-of-the-art model, adhering only to the general versions of recursive models described in Section 2, since our main goal is to compare equivalent recursive and recurrent models rather than implement the state of the art.2014), a CKY bottom-up algorithm is used to compute the most likely discourse parse tree using dynamic programming.Models are evaluated in terms of three matrices: Span (on blank tree structures), nuclearity (on tree structures with nuclearity indication), and relation (on tree structures with rhetorical relation indication but no nuclearity indication)5 .To serve the purposes for plain comparison, no additional human-developed features are added.
Results are reported in Table 1.see no large differences between equivalent recurrent and recursive models.Again Bi-LSTM achieves the best performance.

Discussions
We compared recursive and recurrent neural models for representation learning on four separate Our results suggest that recurrent models are sufficient to capture the compositional semantics required for many NLP tasks.But in tasks like semantic relation extraction, in which single headwords need to be associated across a long distance, recursive models seem to shine.This suggests that for the many classes of tasks where long-distances play a role (such as Chinese-English translation) syntactic structures from recursive models might offer useful power.
Exploring these directions constitute our future work; for example our results illustrate that improving sequential models to handle such phenomena will require methods for modeling such long-distance dependencies.
Of course, it is hard to define what is exactly "fair" in this comparison-oriented paper: we force every model to be trained the exactly in the same way: AdaGrad with minibatches, same set of initializations, etc.However, it is not necessarily true that this is the optimal way to train every model.It is possible that different training strategies are tailored for and should be adopted for different models.In that sense, trying to be "fair" in the way we selected in this paper is in fact not being fair.
Additionally, we require that all sequence and tree models be of the same structure.Such a strategy may be problematic, since one model might be more flexible to other algorithm variants while the other might not 6 .For example, we only employ a one-layer neural model structure for each model, to make fair comparison easier.However, it could be possible that one of the models is intrinsically better suited for deep (multi-layer) structure than the other.For instance, Tai et al. (2015) found that deep LSTM tree models outperform deep LSTM sequence models on the Stanford SentimentBank task, suggesting that may offer different or greater advantages in recursive models.We hope to explore these questions in our future work.

Figure 1 :
Figure 1: Standard recursive neural network calculating a representation for the food is delicious based on the syntactic parse tree.
Recursive Neural Tensor Network (RNTN) RNTN (Socher et al., 2013) enables richer composition between children embeddings.Let e = [e left , e right ] denote the concatenation of the two children's vectors.RNTN computes the parent representation e η as follows:
father's signature to get out of school and avoids the draft by feigning desire to join.Name this German author of The Magic Mountain and Death in Venice.Answer: Thomas Mann [from the pool of phrases] The model of Iyyer et al. (2014) minimizes the distances between answer embeddings and node embeddings along the parse tree of the question.Concretely, let c denote the correct answer to question S, with embedding c, and z denoting any random wrong answer.The objective function sums over the dot product between representation for every node η along the question parse trees and the answer representations: L = η∈[parse tree] z max(0, 1− c•e η + z•e η ) (13) where s denotes the embedding for parse tree node s calculated from the recursive neural model.Here the parse trees are dependency parses following (Iyyer et al., 2014).

Figure 3 :
Figure 3: Illustration of Models for Semantic Relationship Classification.

SemEval- 2010
Task 8 (Hendrickx et al., 2009) is to find semantic relationships between pairs of nominals, e.g., in "My [apartment] e1 has a pretty large [kitchen] e2 " classifying the relation between [apartment] and [kitchen] as component-whole.The dataset contains 9 ordered relationships, so the task is formalized as a 19-class classification problem, with directed relations treated as separate labels; see Hendrickx et al. (2009; Socher et al. (2012) for details.

Figure 6 :
Figure 6: An illustration of discourse parsing.[e 1 , e 2 , ...] denote EDU which consists of a sequence of tokens.[r 12 , r 34 , r 56 ] denote relationships to be classified.A binary classification model is first used to decided whether two EDUs should be merged and a multi-class classifier is then used to decide the index of the relationship.
Learns sentence-to-sentence relations based on calculated representations.In each case we followed the protocols described in the original papers.We employed standard training frameworks for neural models: for each task, we used stochastic gradient decent using AdaGrad(Duchi etal., 2011) with minibatches (Cotter et al., 2011).Parameters are tuned using development dataset if already found from original datasets or from cross-validation if not.Derivatives are calculated from standard backpropagation

Table 1 :
Discourse parsing performances from different EDUs models according to 3 metrics.