Top-down Tree Long Short-Term Memory Networks

Long Short-Term Memory (LSTM) networks, a type of recurrent neural network with a more complex computational unit, have been successfully applied to a variety of sequence modeling tasks. In this paper we develop Tree Long Short-Term Memory (TreeLSTM), a neural network model based on LSTM, which is designed to predict a tree rather than a linear sequence. TreeLSTM defines the probability of a sentence by estimating the generation probability of its dependency tree. At each time step, a node is generated based on the representation of the generated sub-tree. We further enhance the modeling power of TreeLSTM by explicitly representing the correlations between left and right dependents. Application of our model to the MSR sentence completion challenge achieves results beyond the current state of the art. We also report results on dependency parsing reranking achieving competitive performance.


Introduction
Neural language models have been gaining increasing attention as a competitive alternative to n-grams. The main idea is to represent each word using a real-valued feature vector capturing the contexts in which it occurs. The conditional probability of the next word is then modeled as a smooth function of the feature vectors of the preceding words and the next word. In essence, similar representations are learned for words found in similar contexts resulting in similar predictions for the next word. Previous approaches have mainly employed feed-forward (Bengio et al., 2003;Mnih and Hinton, 2007) and recurrent neural networks (Mikolov et al., 2010;Mikolov, 2012) in order to map the feature vectors of the context words to the distribution for the next word. Recently, RNNs with Long Short-Term Memory (LSTM) units (Hochreiter and Schmidhuber, 1997;Hochreiter, 1998) have emerged as a popular architecture due to their strong ability to capture long-term dependencies. LSTMs have been successfully applied to a variety of tasks ranging from machine translation , to speech recognition (Graves et al., 2013), and image description generation .
Despite superior performance in many applications, neural language models essentially predict sequences of words. Many NLP tasks, however, exploit syntactic information operating over tree structures (e.g., dependency or constituent trees). In this paper we develop a novel neural network model which combines the advantages of the LSTM architecture and syntactic structure. Our model estimates the probability of a sentence by estimating the generation probability of its dependency tree. Instead of explicitly encoding tree structure as a set of features, we use four LSTM networks to model four types of dependency edges which altogether specify how the tree is built. At each time step, one LSTM is activated which predicts the next word conditioned on the sub-tree generated so far. To learn the representations of the conditioned sub-tree, we force the four LSTMs to share their hidden layers. Our model is also capable of generating trees just by sampling from a trained model and can be seamlessly integrated with text generation applications.
Our approach is related to but ultimately different from recursive neural networks (Pollack, 1990) a class of models which operate on structured inputs. Given a (binary) parse tree, they recursively generate parent representations in a bottom-up fashion, by combining tokens to produce representations for phrases, and eventually the whole sentence. The learned representations can be then used in classification tasks such as sentiment analysis (Socher et al., 2011b) and paraphrase detection (Socher et al., 2011a). Tai et al. (2015) learn distributed representations over syntactic trees by generalizing the LSTM architecture to tree-structured network topologies.
The key feature of our model is not so much that it can learn semantic representations of phrases or sentences, but its ability to predict tree structure and estimate its probability.
Syntactic language models have a long history in NLP dating back to Chelba and Jelinek (2000) (see also Roark (2001) and Charniak (2001)). These models differ in how grammar structures in a parsing tree are used when predicting the next word. Other work develops dependency-based language models for specific applications such as machine translation (Shen et al., 2008;Zhang, 2009;Sennrich, 2015), speech recognition (Chelba et al., 1997) or sentence completion (Gubbins and Vlachos, 2013). All instances of these models apply Markov assumptions on the dependency tree, and adopt standard n-gram smoothing methods for reliable parameter estimation. Emami et al. (2003) and Sennrich (2015) estimate the parameters of a structured language model using feed-forward neural networks (Bengio et al., 2003). Mirowski and Vlachos (2015) re-implement the model of Gubbins and Vlachos (2013) with RNNs. They view sentences as sequences of words over a tree. While they ignore the tree structures themselves, we model them explicitly.
Our model shares with other structured-based language models the ability to take dependency information into account. It differs in the following respects: (a) it does not artificially restrict the depth of the dependencies it considers and can thus be viewed as an infinite order dependency language model; (b) it not only estimates the probability of a string but is also capable of generating dependency trees; (c) finally, contrary to previous dependencybased language models which encode syntactic in-formation as features, our model takes tree structure into account more directly via representing different types of dependency edges explicitly using LSTMs. Therefore, there is no need to manually determine which dependency tree features should be used or how large the feature embeddings should be. We evaluate our model on the MSR sentence completion challenge, a benchmark language modeling dataset. Our results outperform the best published results on this dataset. Since our model is a general tree estimator, we also use it to rerank the top K dependency trees from the (second order) MSTPasrser and obtain performance on par with recently proposed dependency parsers.

Tree Long Short-Term Memory Networks
We seek to estimate the probability of a sentence by estimating the generation probability of its dependency tree. Syntactic information in our model is represented in the form of dependency paths. In the following, we first describe our definition of dependency path and based on it explain how the probability of a sentence is estimated.

Dependency Path
Generally speaking, a dependency path is the path between ROOT and w consisting of the nodes on the path and the edges connecting them. To represent dependency paths, we introduce four types of edges which essentially define the "shape" of a dependency tree. Let w 0 denote a node in a tree and w 1 , w 2 , . . . , w n its left dependents. As shown in Figure 1, LEFT edge is the edge between w 0 and its first left dependent denoted as (w 0 , w 1 ). Let w k (with 1 < k ≤ n) denote a non-first left dependent of w 0 . The edge from w k−1 to w k is a NX-LEFT edge (NX stands for NEXT), where w k−1 is the right adjacent sibling of w k . Note that the NX-LEFT edge (w k−1 , w k ) replaces edge (w 0 , w k ) (illustrated with a dashed line in Figure 1) in the original dependency tree. The modification allows information to flow from w 0 to w k through w 1 , . . . , w k−1 rather than directly from w 0 to w k . RIGHT and NX-RIGHT edges are defined analogously for right dependents. Given these four types of edges, dependency paths (denoted as D(w)) can be defined as follows bearing in mind that the first right dependent of ROOT is its only dependent and that w p denotes the parent of w. We use (. . . ) to denote a sequence, where () is an empty sequence and is an operator for concatenating two sequences.
(1) if w is ROOT, then D(w) = () if w is not the first left dependent and w s is its right adjacent sibling, then if w is not the first right dependent and w s is its left adjacent sibling, then A dependency tree can be represented by the set of its dependency paths which in turn can be used to reconstruct the original tree. 1 Dependency paths for the first two levels of the tree in Figure 2 are as follows (ignoring for the moment the subscripts which we explain in the next section). (1) and (3a)), (according to (3b)).

Tree Probability
The core problem in syntax-based language modeling is to estimate the probability of sentence S given 1 Throughout this paper we assume all dependency trees are projective. its corresponding tree T , P(S|T ). We view the probability computation of a dependency tree as a generation process. Specifically, we assume dependency trees are constructed top-down, in a breadth-first manner. Generation starts at the ROOT node. For each node at each level, first its left dependents are generated from closest to farthest and then the right dependents (again from closest to farthest). The same process is applied to the next node at the same level or a node at the next level. Figure 2 shows the breadth-first traversal of a dependency tree.
Under the assumption that each word w in a dependency tree is only conditioned on its dependency path, the probability of a sentence S given its dependency tree T is: where D(w) is the dependency path of w. Note that each word w is visited according to its breadth-first search order (BFS(T)) and the probability of ROOT is ignored since every tree has one. The role of ROOT in a dependency tree is the same as the begin of sentence token (BOS) in a sentence. When computing P(S|T ) (or P(S)), the probability of ROOT (or BOS) is ignored (we assume it always exists), but is used to predict other words. We explain in the next section how TREELSTM estimates P(w|D(w)).

Tree LSTMs
A dependency path D(w) is subtree which we denote as a sequence of word, edge-type tuples. Our w 0 w 1 w 2 w 3 w 4 w 5 w 6 Generated by four LSTMs with tied We and tied Who w 0 At each time step, an LSTM is chosen according to an edge-type; then the LSTM takes a word as input and predicts/generates its dependent or sibling. This process can be also viewed as adding an edge and a node to a tree. Specifically, LSTMs GEN-L and GEN-R are used to generate the first left and right dependent of a node (w 1 and w 4 in Figure 3). So, these two LSTMs are responsible for going deeper in a tree. While GEN-NX-L and GEN-NX-R generate the remaining left/right dependents and therefore go wider in a tree. As shown in Figure 3, w 2 and w 3 are generated by GEN-NX-L, whereas w 5 and w 6 are generated by GEN-NX-R. Note that the model can handle any number of left or right dependents by applying GEN-NX-L or GEN-NX-R multiple times.
We assume time steps correspond to the steps taken by the breadth-first traversal of the dependency tree and the sentence has length n. At time step t (1 ≤ t ≤ n), let w t , z t denote the last tuple in D(w t ). Subscripts t and t denote the breadth-first search order of w t and w t , respectively. z t ∈ {LEFT, RIGHT, NX-LEFT, NX-RIGHT} is the edge type (see the definitions in Section 2.1). Let W e ∈ R s×|V | denote the word embedding matrix and W ho ∈ R |V |×d the output matrix of our model, where |V | is the vocabulary size, s the word embedding size and d the hidden unit size. We use tied W e and tied W ho for the four LSTMs to reduce the number of parameters in our model. The four LSTMs also share their hidden states. Let H ∈ R d×(n+1) denote the shared hidden states of all time steps and e(w t ) the one-hot vector of w t . Then, H[:,t] represents D(w t ) at time step t, and the computation 2 is: where the initial hidden state H[:, 0] is initialized to a vector of small values such as 0.01. According to Equation (2b), the model selects an LSTM based on edge type z t . We describe the details of LSTM z t in the next paragraph. The probability of w t given its dependency path D(w t ) is estimated by a softmax function: We must point out that although we use four jointly trained LSTMs to encode the hidden states, the training and inference complexity of our model is no different from a regular LSTM, since at each time step only one LSTM is working. We implement LSTM z in Equation (2b) using a deep LSTM (to simplify notation, from now on we write z instead of z t ). The inputs at time step t are x t and h t (the hidden state of an earlier time step t ) and the output is h t (the hidden state of current time step). Let L denote the layer number of LSTM z andĥ l t the internal hidden state of the l-th layer of the LSTM z at time step t, where x t isĥ 0 t and h t isĥ L t . The LSTM architecture introduces multiplicative gates and memory cellsĉ l t (at l-th layer) in order to address the vanishing gradient problem which makes it difficult for the standard RNN model to learn long-distance correlations in a sequence. Here,ĉ l t is a linear combination of the current input signal u t and an earlier memory cellĉ l t . How much input information u t will flow intoĉ l t is controlled 2 We ignore all bias terms for notational simplicity. by input gate i t and how much of the earlier memory cellĉ l t will be forgotten is controlled by forget gate f t . This process is computed as follows: where W z,l ux ∈ R d×d (W z,l ux ∈ R d×s when l = 1) and W z,l uh ∈ R d×d are weight matrices for u t , W z,l ix and W z,l ih are weight matrices for i t and W z,l f x , and W z,l f h are weight matrices for f t . σ is a sigmoid function and the element-wise product. Output gate o t controls how much information of the cellĉ l t can be seen by other modules: Application of the above process to all layers L, will yieldĥ L t , which is h t . Note that in implementation, allĉ l t andĥ l t (1 ≤ l ≤ L) at time step t are stored, although we only care aboutĥ L t (h t ).

Left Dependent Tree LSTMs
TREELSTM computes P(w|D(w)) based on the dependency path D(w), which ignores the interaction between left and right dependents on the same level. In many cases, TREELSTM will use a verb to predict its object directly without knowing its subject. For example, in Figure 2, TREELSTM uses ROOT, RIGHT and sold, RIGHT to predict cars. This information is unfortunately not specific to cars (many things can be sold, e.g., chocolates, candy). Considering manufacturer, the left dependent of sold would help predict cars more accurately.
In order to jointly take left and right dependents into account, we employ yet another LSTM, which goes from the furthest left dependent to the closest left dependent (LD is a shorthand for left dependent). As shown in Figure 4, LD LSTM learns the representation of all left dependents of a node w 0 ; this representation is then used to predict the first right dependent of the same node. Non-first right dependents can also leverage the representation of left dependents, since this information is injected into the hidden state of the first right dependent and can percolate all the way. Note that in order to retain the generation capability of our model (Section 3.4), we only allow right dependents to leverage left dependents (they are generated before right dependents).
The computation of the LDTREELSTM is almost the same as in TREELSTM except when z t = GEN-R. In this case, let v t be the corresponding left dependent sequence with length K (v t = (w 3 , w 2 , w 1 ) in Figure 4). Then, the hidden state (q k ) of v t at each time step k is: where q K is the representation for all left dependents. Then, the computation of the current hidden state becomes (see Equation (2) for the original computation): where q K serves as additional input for LSTM GEN-R . All other computational details are the same as in TreeLSTM (see Section 2.3).

Model Training
On small scale datasets we employ Negative Loglikelihood (NLL) as our training objective for both TREELSTM and LDTREELSTM: where S is a sentence in the training set S, T is the dependency tree of S and P(S|T ) is defined as in Equation (1).
On large scale datasets (e.g., with vocabulary size of 65K), computing the output layer activations and the softmax function with NLL would become prohibitively expensive. Instead, we employ Noise Contrastive Estimation (NCE; Gutmann and Hyvärinen (2012), Mnih and Teh (2012)) which treats the normalization termẐ inP(w|D(w t )) = exp(W ho [w,:]·h t ) Z as constant. The intuition behind NCE is to discriminate between samples from a data dis-tributionP(w|D(w t )) and a known noise distribution P n (w) via binary logistic regression. Assuming that noise words are k times more frequent than real words in the training set (Mnih and Teh, 2012), then the probability of a word w being from our model P(w|D(w t ))+kP n (w) . We apply NCE to large vocabulary models with the following training objective: wherew t, j is a word sampled from the noise distribution P n (w). We use smoothed unigram frequencies (exponentiating by 0.75) as the noise distribution P n (w) (Mikolov et al., 2013b). We initialize lnẐ = 9 as suggested in Chen et al. (2015), but instead of keeping it fixed we also learnẐ during training (Vaswani et al., 2013). We set k = 20.

Experiments
We assess the performance of our model on two tasks: the Microsoft Research (MSR) sentence completion challenge , and dependency parsing reranking. We also demonstrate the tree generation capability of our models. In the following, we first present details on model training and then present our results. We implemented our models using the Torch library (Collobert et al., 2011) and our code is available at https:// github.com/XingxingZhang/td-treelstm.

Training Details
We trained our model with back propagation through time (Rumelhart et al., 1988) on an Nvidia GPU Card with a mini-batch size of 64. The objective (NLL or NCE) was minimized by stochastic gradient descent. Model parameters were uniformly initialized in [−0.1, 0.1]. We used the NCE objective on the MSR sentence completion task (due to the large size of this dataset) and the NLL objective on dependency parsing reranking. We used an initial learning rate of 1.0 for all experiments and when there was no significant improvement in loglikelihood on the validation set, the learning rate was divided by 2 per epoch until convergence (Mikolov et al., 2010). To alleviate the exploding gradients problem, we rescaled the gradient g when the gradient norm ||g|| > 5 and set g = 5g ||g|| (Pascanu et al., 2013;. Dropout (Srivastava et al., 2014) was applied to the 2-layer TREELSTM and LDTREELSTM models. The word embedding size was set to s = d/2 where d is the hidden unit size.

Microsoft Sentence Completion Challenge
The task in the MSR Sentence Completion Challenge  is to select the correct missing word for 1,040 SAT-style test sentences when presented with five candidate completions. The training set contains 522 novels from the Project Gutenberg which we preprocessed as follows. After removing headers and footers from the files, we tokenized and parsed the dataset into dependency trees with the Stanford Core NLP toolkit . The resulting training set contained 49M words. We converted all words to lower case and replaced those occurring five times or less with UNK. The resulting vocabulary size was 65,346 words. We randomly sampled 4,000 sentences from the training set as our validation set.
The literature describes two main approaches to the sentence completion task based on word vectors and language models. In vector-based approaches, all words in the sentence and the five candidate words are represented by a vector; the candidate which has the highest average similarity with the sentence words is selected as the answer. For language model-based methods, the LM computes the probability of a test sentence with each of the five candidate words, and picks the candidate completion which gives the highest probability. Our model belongs to this class of models.   Table 1 presents a summary of our results together with previoulsy published results. The best performing word vector model is IVLBL (Mnih and Kavukcuoglu, 2013) with an accuracy of 55.5, while the best performing single language model is LBL (Mnih and Teh, 2012) with an accuracy of 54.7. Both approaches are based on the log-bilinear language model (Mnih and Hinton, 2007). A combination of several recurrent neural networks and the skip-gram model holds the state of the art with an accuracy of 58.9 (Mikolov et al., 2013b). To fairly compare with existing models, we restrict the layer   (2014) and Dyer et al. (2015), respectively. * indicates that the model is initialized with pre-trained word vectors.
size of our models to 1. We observe that LDTREEL-STM consistently outperforms TREELSTM, which indicates the importance of modeling the interaction between left and right dependents. In fact, LDTREELSTM (d = 400) achieves a new state-ofthe-art on this task, despite being a single model. We also implement LSTM and bidirectional LSTM language models. 3 An LSTM with d = 400 outperforms its smaller counterpart (d = 300), however performance decreases with d = 450. The bidirectional LSTM is worse than the LSTM (see Mnih and Teh (2012) for a similar observation). The best performing LSTM is worse than a LDTREEL-STM (d = 300). The input and output embeddings (W e and W ho ) dominate the number of parameters in all neural models except for RNNME, de-pRNN+3gram and ldepRNN+4gram, which include a ME model that contains 1 billion sparse n-gram features (Mikolov, 2012;Mirowski and Vlachos, 2015). The number of parameters in TREELSTM and LDTREELSTM is not much larger compared to LSTM due to the tied W e and W ho matrices.

Dependency Parsing
In this section we demonstrate that our model can be also used for parse reranking. This is not possible for sequence-based language models since they cannot estimate the probability of a tree. We use our models to rerank the top K dependency trees produced by the second order MSTParser (McDon-ald and Pereira, 2006). 4 We follow closely the experimental setup of Chen and Manning (2014) and Dyer et al. (2015). Specifically, we trained TREEL-STM and LDTREELSTM on Penn Treebank sections 2-21. We used section 22 for development and section 23 for testing. We adopted the Stanford basic dependency representations (De Marneffe et al., 2006); part-of-speech tags were predicted with the Stanford Tagger (Toutanova et al., 2003). We trained TREELSTM and LDTREELSTM as language models (singletons were replaced with UNK) and did not use any POS tags, dependency labels or composition features, whereas these features are used in Chen and Manning (2014) and Dyer et al. (2015). We tuned d, the number of layers, and K on the development set. Table 2 reports unlabeled attachment scores (UAS) and labeled attachment scores (LAS) for the MSTParser, TREELSTM (d = 300, 1 layer, K = 2), and LDTREELSTM (d = 200, 2 layers, K = 4). We also include the performance of two neural network-based dependency parsers; Chen and Manning (2014) use a neural network classifier to predict the correct transition (NN parser); Dyer et al. (2015) also implement a transition-based dependency parser using LSTMs to represent the contents of the stack and buffer in a continuous space. As can be seen, both TREELSTM and LDTREELSTM outperform the baseline MSTParser, with LDTREEL-STM performing best. We also initialized the word embedding matrix W e with pre-trained GLOVE vectors (Pennington et al., 2014). We obtained a slight improvement over TREELSTM (TREELSTM* in Table 2; d = 200, 2 layer, K = 4) but no improvement over LDTREELSTM. Finally, notice that LDTREELSTM is slightly better than the NN parser in terms of UAS but worse than the S-LSTM parser. In the future, we would like to extend our model so that it takes labeled dependency information into account.

Tree Generation
This section demonstrates how to use a trained LDTREELSTM to generate tree samples. The generation starts at the ROOT node. At each time step t, for each node w t , we add a new edge and node to 4 http://www.seas.upenn.edu/ strctlrn/MSTParser Profit widened to $ UNK million , from $ 1.37 billion a year earlier . the tree. Unfortunately during generation, we do not know which type of edge to add. We therefore use four binary classifiers (ADD-LEFT, ADD-RIGHT, ADD-NX-LEFT and ADD-NX-RIGHT) to predict whether we should add a LEFT, RIGHT, NX-LEFT or NX-RIGHT edge. 5 Then when a classifier predicts true, we use the corresponding LSTM to generate a new node by sampling from the predicted word distribution in Equation (3). The four classifiers take the previous hidden state H [:,t ] and the output embedding of the current node W ho · e(w t ) as features. 6 Specifically, we use a trained LDTREELSTM to go through the training corpus and generate hidden states and embeddings as input features; the corresponding class labels (true and false) are "read off" the training dependency trees. We use two-layer rectifier networks (Glorot et al., 2011) as the four classifiers with a hidden size of 300. We use the same LDTREELSTM model as in Section 3.3 to generate dependency trees. The classifiers were trained using AdaGrad (Duchi et al., 2011) with a learning rate of 0.01. The accuracies of ADD-LEFT, ADD-RIGHT, ADD-NX-LEFT and ADD-NX-RIGHT are 94.3%, 92.6%, 93.4% and 96.0%, respectively. Fig-5 It is possible to get rid of the four classifiers by adding START/STOP symbols when generating left and right dependents as in (Eisner, 1996). We refrained from doing this for computational reasons. For a sentence with N words, this approach will lead to 2N additional START/STOP symbols (with one START and one STOP symbol for each word). Consequently, the computational cost and memory consumption during training will be three times as much rendering our model less scalable. 6 The input embeddings have lower dimensions and therefore result in slightly worse classifiers. 317 ure 5 shows examples of generated trees.

Conclusions
In this paper we developed TREELSTM (and LDTREELSTM), a neural network model architecture, which is designed to predict tree structures rather than linear sequences. Experimental results on the MSR sentence completion task show that LDTREELSTM is superior to sequential LSTMs. Dependency parsing reranking experiments highlight our model's potential for dependency parsing. Finally, the ability of our model to generate dependency trees holds promise for text generation applications such as sentence compression and simplification (Filippova et al., 2015). Although our experiments have focused exclusively on dependency trees, there is nothing inherent in our formulation that disallows its application to other types of tree structure such as constituent trees or even taxonomies.