Neural Tree Indexers for Text Understanding

Recurrent neural networks (RNNs) process input text sequentially and model the conditional transition between word tokens. In contrast, the advantages of recursive networks include that they explicitly model the compositionality and the recursive structure of natural language. However, the current recursive architecture is limited by its dependence on syntactic tree. In this paper, we introduce a robust syntactic parsing-independent tree structured model, Neural Tree Indexers (NTI) that provides a middle ground between the sequential RNNs and the syntactic treebased recursive models. NTI constructs a full n-ary tree by processing the input text with its node function in a bottom-up fashion. Attention mechanism can then be applied to both structure and node function. We implemented and evaluated a binary tree model of NTI, showing the model achieved the state-of-the-art performance on three different NLP tasks: natural language inference, answer sentence selection, and sentence classification, outperforming state-of-the-art recurrent and recursive neural networks.


Introduction
Recurrent neural networks (RNNs) have been successful for modeling sequence data (Elman, 1990).Particularly, RNNs equipped with gated hidden units and internal short-term memories, such as long short-term memories (LSTM) (Hochreiter and Schmidhuber, 1997) have achieved a notable success in several NLP tasks including named entity recognition (Lample et al., 2016), constituency parsing (Vinyals et al., 2015), textual entailment recognition (Rocktäschel et al., 2016), question answering (Hermann et al., 2015), and machine translation (Bahdanau et al., 2015).However, most LSTM models explored so far are chain-structured.They encode text sequentially from left to right or vice versa and do not naturally support compositionality of language.Chain-structured LSTM models seem to learn syntactic structure from the natural language however their generalization on unseen text is relatively poor comparing with models that exploit syntactic tree structure (Bowman et al., 2015b).
Unlike chain-structured sequential models, research in recursive neural networks compose word phrases over syntactic tree structure and have shown improved performance in sentiment analysis (Socher et al., 2013).However their dependence on a syntactic tree architecture limits their practical NLP applications.In this study, we introduce Neural Tree Indexers (NTI), a class of tree structured models for NLP tasks.NTI takes a sequence of tokens and produces its representation by constructing a full n-ary tree in a bottom-up fashion.Each node in NTI is associated with one of the node transformation functions: leaf node mapping and nonleaf node composition functions.Unlike previous recursive models, the tree structure for NTI is relaxed, i.e., NTI does not require the input sequences to be parsed syntactically; and therefore it is flexi-ble and can be directly applied to a wide range of NLP tasks.Furthermore, we propose different variants of the node composition function and attention over tree for our NTI models.When a sequential leaf node transformer such as LSTM is chosen, the NTI network forms an effective sequence-tree hybrid model taking advantage of both conditional and compositional powers of sequential and recursive models.Figure 1 shows a binary-tree model of NTI.Although the model does not follow the syntactic tree structure, we empirically show that it achieved the state-of-the-art performance on three different NLP applications: natural language inference, answer sentence selection, and sentence classification.

Recurrent Neural Networks and Attention Mechanism
RNNs models input text sequentially by taking a single token at each time step and producing a corresponding hidden state.The hidden state is then passed along through the next time step to provide historical sequence information.Although a great success in a variety of tasks, RNNs have limitations (Bengio et al., 1994;Hochreiter, 1998).It is difficult to train RNNs with the standard affine hidden units on long input sequences because the gradients tend to either vanish (vanishing gradients) or explode (exploding gradients).Because RNNs has to compress all information captured in the past time steps into the current fixed length vector, it is not good at memorizing long or distant sequence (Sutskever et al., 2014).This is frequently called as information flow bottleneck.
Approaches have been developed to overcome the limitations.The gradient exploding problem is addressed by a gradient clipping method (Pascanu et al., 2013).To deal with the vanishing gradients, gated and internal short-term memory variants of an RNN unit have shown to be effective (Hochreiter and Schmidhuber, 1997;Cho et al., 2014).To mitigate the information flow bottleneck, Bahdanau et al. (2015) extended RNNs with an attention mechanism in the context of neural machine translation, leading to improved the results in translating longer sentences.Attention mechanism allows RNNs to selectively focus on the most task-relevant parts of input sequence, assign importance weights to those parts and blend them into a single attended representation, and therefore is able to bring out a past and possibly distant input vector to current time step with the blending operation.
Another limitation is that RNNs are linear chainstructured; this limits its potential for natural language which can be represented by complex structures including syntactic structure.In this study, we propose models to mitigate this limitation.

Recursive Neural Networks
Unlike RNNs, recursive neural networks explicitly model the compositionality and the recursive structure of natural language over tree.The tree structure can be predefined by a sentence parser (Socher et al., 2013).Each non-leaf tree node is associated with a node composition function which combines its children nodes and produces its own representation.The model is then trained by back-propagating error through structures (Goller and Kuchler, 1996).
The node composition function can be varied.A single layer network with tanh non-linearity was adopted in recursive auto-associate memories (Pollack, 1990) and recursive autoencoders (Socher et al., 2011).Socher et al. (2012) extended this network with an additional matrix representation for each node to augment the expressive power of the recursive network.Tensor networks have also been used as composition function for sentence-level sentiment analysis task (Socher et al., 2013).Recently, Zhu et al. (2015) introduced S-LSTM which extends LSTM units to compose tree nodes in a recursive fashion.
In this paper, we adopt S-LSTM and introduce a novel attentive node composition function.Unlike previous work which rely on a parser output and fine-grained supervision of non-leaf nodes, our NTI model does not rely on a parser, nor fine-grained supervision−in NTI, the supervision from the target labels is provided at the root node−making our NTI model robust and applicable to a wide range of NLP tasks.Like RNNs, NTI has the issues of vanishing/explode gradients and the information bottleneck.To address this, we introduce attention over tree structure in NTI.

Methods
We consider learning-based NLP tasks.The training set consists of N examples {X i , Y i } N i=1 , where the input X i is a sequence of word tokens w i 1 , w i 2 , . . ., w i T i and the output Y i can be either a single target or a sequence.Each input word token w t is represented by its word embedding x t ∈ R k .
NTI is a full n-ary tree.It has two types of transformation function: non-leaf node function f node (h 1 , . . ., h c ) and leaf node function f leaf (x t ).f leaf (x t ) computes a (possibly nonlinear) transformation of the input word embedding x t .f node (h 1 , . . ., h c ) is a function of its child nodes representation h 1 , . . ., h c , where c is the total number of child nodes of this non-leaf node.
NTI is open to different tree structures.In this study we explored only a binary tree form of NTI, meaning that a non-leaf node can take in only two direct child nodes (i.e., c = 2).Therefore, the function f node (h l , h r ) composes its left child node h l and right child node h r .Figure 1 illustrates our NTI model that is applied to question answering (a) and natural language inference tasks (b).Note that the node and leaf node functions are neural networks and are the only training parameters in NTI.
We explored two different approaches to compose node representations: an extended LSTM and attentive node composition functions, to be described below.

Non-Leaf Node Composition Functions
We define two different methods for non-leaf node function f node (h l , h r ).
LSTM-based Non-leaf Node Function (S-LSTM): First we use LSTM to compose f node (h l , h r ).We adopt S-LSTM as a non-leaf node function from work of Zhu et al. (2015).S-LSTM is an extension of LSTM being applied to tree structures.It learns to compose children nodes for their parent node representations.Concretely, let h l t , h r t , c l t and c r t be vector representations and cell states for left and right children.An S-LSTM computes a parent node representation h p t+1 and a node cell state c p t+1 as where W s 1 , . . ., W s 18 ∈ R k×k and biases (for brevity we eliminated the bias terms) are the training parameters.σ and denote the element-wise sigmoid function and the element-wise vector multiplication.Extension of S-LSTM non-leaf node function to compose more children is straightforward.However, the number of parameters increases quadratically in S-LSTM as we add more child nodes.
Attentive Non-leaf Node Function (ANF): We introduce ANF as a new non-leaf node function.ANF is suitable for modeling sequence pairs such as question-answer pair in QA, premise-hypothesis pair in language inference and source-target sentences in machine translation.Unlike S-LSTM, ANF composes the child nodes attentively in respect to another relevant input vector q ∈ R k .The input vector q can be a learned representation of a relevant sequence like question, premise or partial translation sentence depending on the task.Given a matrix S AN F ∈ R k×2 resulted by concatenating the child node representations h l t , h r t and the third input vector q, ANF is defined as where W AN F 1 ∈ R k×k is a learnable matrix, m ∈ R 2 the attention score and α ∈ R 2 the attention weight vector for each child.f score is an attention scoring function, which can be implemented as a multi-layer perceptron (MLP) or a matrix-vector product m = q S AN F .The matrices W score 1 and W score 2 ∈ R k×k and the vector w ∈ R k are training parameters.e ∈ R 2 is a vector of ones and ⊗ the outer product.We use ReLU function for non-linear transformation.

Attention Over Tree
To compare with chain-structured LSTM models, NTI has less recurrence defined by the tree depth, log(n) for binary tree where n is the length of the input sequence.However, NTI still needs to compress all the input information into a single representation vector of the root and like LSTM, this imposes practical difficulties when processing long sequences.We address this issue with attention mechanism over tree.In addition, the attention mechanism can be used for matching trees that carry different sequence representations.We first define a global attention and then introduce a tree attention which considers the parent-child dependency for calculation of the attention weights.
Global Attention: An attention neural network for the global attention takes all node representations as input and produces an attentively blended vector for whole tree.This neural net is similar to ANF.Particularly, given a matrix S GA ∈ R k×2n−1 resulted by concatenating the node representations h 1 , . . ., h 2n−1 and the relevant input representation q, the global attention is defined as z = S GA α (14) where W GA 1 and W GA 2 ∈ R k×k are training parameters and α ∈ R 2n−1 the attention weight vector for each node.This attention mechanism is robust as it globally normalizes the attention score m with sof tmax to obtain the weights α.However, it does not consider the tree structure when producing the final representation h tree .
Tree Attention: We modify the global attention network to the tree attention mechanism.Given a matrix S T A ∈ R k×3 resulted by concatenating the parent node representation h p t , the left child h l t and the right child h r t and the relevant input representation q, every non-leaf node h p t simply updates its own representation by using the following equation in a bottom-up manner.and this equation is almost the same as the global attention.However, now each non-leaf node attentively collects its own and children representations and passes towards the root which finally constructs the attentively blended tree representation.Note that unlike the global attention, the tree attention locally normalizes the attention scores with sof tmax.

Experiments
We describe in this section experiments on three different NLP tasks to demonstrate that NTI can be effective and flexible in different settings.We report results on natural language inference, question answering and sentence classification.These tasks challenge a model in terms of language understanding and semantic reasoning.The models are trained using Adam (Kingma and Ba, 2014) with hyperparameters selected on development set.The pre-trained 300-D Glove 840B vectors (Pennington et al., 2014) were obtained for the word embeddings 1 .The word embeddings are fixed during training.The embeddings for out-ofvocabulary words were set to zero vector.We pad the input sequence to form a full binary tree.A padding vector was inserted when padding.The size of hidden units of the NTI modules were set to 300.
1 http://nlp.stanford.edu/projects/glove/The models were regularized by using dropouts and an l 2 weight decay.2

Natural Language Inference
The natural language inference is one of the main tasks in language understanding.This task tests the ability of a model to reason about the semantic relationship between two sentences.In order to perform well on the task, NTI should be able capture sentence semantics and be able to reason relation between the sentence pairs, i.e., whether premisehypothesis pairs are entailing, contradictory or neutral.We conducted experiments on the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015a) (Miao et al., 2016) 0.6552 0.6747 Three-layer LSTM attention (Miao et al., 2016) 0.6639 0.6828 NASM (Miao et al., 2016) 0.6705 0.6914 NTI (Ours) 0.6742 0.6884 batch size to 32.The initial learning, the regularization strength and the number of epoch to be trained are varied for each model.NTI-SLSTM: this model does not rely on f leaf transformer but uses the S-LSTM units for the nonleaf node function.We set the initial learning rate to 1e-3 and l 2 regularizer strength to 3e-5, and train the model for 90 epochs.The neural net was regularized by 10% input dropouts and the 20% output dropouts.
NTI-SLSTM-LSTM: we use LSTM for the leaf node function f leaf .Here we initialize memory cell of S-LSTM with LSTM memory.The hyperparameters are the same as the previous model.
NTI-SLSTM node-by-node global attention: while models so far described do not involve intersentence relation and learn each sentence representations separately, this variant models the intersentence relation with the global attention over premise-indexed tree.The model is similar to wordby-word attention model of Rocktäschel et al. (2016) in that it attends over the premise tree nodes at every time step of hypothesis encoding.We tie the weight parameters of the two NTI-SLSTMs for premise and hypothesis and no f leaf transformer used.We set the initial learning rate to 3e-4 and l 2 regularizer strength to 1e-5, and train the model for 40 epochs.The neural net was regularized by 15% input dropouts and the 15% output dropouts.
NTI-SLSTM node-by-node tree attention: this is a variation of the previous model with the tree attention.The hyper-parameters are the same as the previous model.
NTI-SLSTM-LSTM node-by-node global attention: in this model we include LSTM as the leaf node function f leaf .Here we initialize the memory cell of S-LSTM with LSTM memory and hidden/memory state of hypothesis LSTM with premise LSTM (the later follows the work of (Rocktäschel et al., 2016)).We set the initial learning rate to 3e-4 and l 2 regularizer strength to 1e-5, and train the model for 10 epochs.The neural net was regularized by 10% input dropouts and the 15% output dropouts.
NTI-SLSTM-LSTM node-by-node tree attention: this is a variation of the previous model with the tree attention.The hyper-parameters are the same as the previous model.
Tree matching NTI-SLSTM-LSTM global attention: this model first constructs the premise and hypothesis trees simultaneously with the NTI-SLSTM-LSTM model and then computes their matching vector by using the global attention and an additional LSTM.In particular, the attention vectors are produced at each hypothesis tree node and then are given to the LSTM model sequentially.The LSTM model compress the attention vectors and outputs a single matching vector.This vector is then passed to an MLP for classification.The MLP for this tree matching setting has an input layer with 1024 units with ReLU activation and a sof tmax output layer.
This model is similar to Wang and Jiang (2015)'s matching LSTM model in a sense that we use an additional LSTM to calculate the matching vector.However, we use the common LSTM units and match trees while they design an LSTM architecture, mLSTM tailored for matching sequences.We set the initial learning rate to 3e-4 and l 2 regularizer strength to 3e-5, and train the model for 20 epochs.The neural net was regularized by 20% input dropouts and the 20% output dropouts.
Tree matching NTI-SLSTM-LSTM tree attention: we replace the global attention with the tree attention.The hyper-parameters are the same as the previous model.
Full tree matching NTI-SLSTM-LSTM global attention: this model produces two sets of the at-tention vectors, one by attending over the premise tree regarding each hypothesis tree node and another by attending over the hypothesis tree regarding each premise tree node.Each set of the attention vectors is given to a LSTM model to achieve full tree matching.The last hidden states of the two LSTM models (i.e. one for each attention vector set) are concatenated for classification.These two tree matching LSTM models share the same training weights and thus this model introduces no parametric complexity over the Tree matching NTI-SLSTM-LSTM global attention model.The hyper-parameters are the same as the previous model. 3 Table 1 shows the results of our models.For comparison, we also include published state-of-the-art systems and their performance.The classifier with handcrafted features extracts a set of lexical features.The next group of models are based on sentence encoding.While most of the sentence encoder models rely solely on word embeddings, the dependency tree CNN and the SPINN-NP models make use of sentence parser output.The last set of methods designs inter-sentence relation with soft attention (Bahdanau et al., 2015).Our best score on this task is 87.3% accuracy obtained with the full tree matching NTI model.The previous best performing model on the task performs phrase matching by using the attention mechanism.The comparison table shows some interesting points: • NTI-SLSTM improved the performance of the chain-structured LSTM encoder by approximately 2%.However, the number of training parameters in NTI-SLSTM is larger than LSTM; and thus we do not make a strong conclusion here.
• Using LSTM as leaf node function does help in learning better representations.In fact, NTI-SLSTM-LSTM is a hybrid model which encodes a sequence sequentially through its leaf node function and then hierarchically composes the output representations.
• Computing matching vector between trees or sequences with a separate LSTM model is effective.
• The global attention seems to be robust on this task.The tree attention were not helpful as it normalizes the attention scores locally in parent-child relationship.

Answer Sentence Selection
Answer sentence selection is an integral part of the open-domain question answering.For this task, a model is trained to identify the correct sentences that answer a factual question, from a set of candidate sentences.We experiment on WikiQA dataset constructed from Wikipedia (Yang et al., 2015).The dataset contains 20,360/2,733/6,165 QA pairs for train/dev/test sets.Table 3 summarises the statistics of the dataset.
The MLP setup used in the language inference task is kept same, except that we now replace the sof tmax layer with a sigmoid layer and model the following conditional probability distribution.
where h q n and h a n are the question and the answer encoded vectors and o QA denotes the output of the hidden layer of the MLP.We trained NTI to minimize the sigmoid cross entropy loss.For this task, we use NTI-SLSTM-LSTM to encode answer candidates and NTI-ANF-LSTM to encode the question sentences.Note that NTI-ANF-LSTM is relied on ANF as the non-leaf node function.q vector for NTI-ANF-LSTM is the answer representation produced by the answer encoding NTI-SLSTM-LSTM model.We set the batch size to 4 and the initial learning rate to 1e-3, and train the model for 10 epochs.We used 20% input dropouts and no l 2 weight decay.Following previous work, we adopt MAP and MRR as the evaluation metrics for this task. 4able 2 presents the results of our model and the previous models for the task. 5The classifier with handcrafted features is a SVM model trained with a set of features.The Bigram-CNN model is a simple convolutional neural net.We highlight the following points from Table 2: • The Deep and LSTM attention models outperform the previous best result by a large margin, nearly 5-6%.
• NASM improves the result further and sets a strong baseline by combining variational autoencoder (Kingma and Welling, 2014) with the soft attention.In NASM, they adopt a deep three-layer LSTM and introduced a latent stochastic attention mechanism over the answer sentence.
• Our NTI model exceeds NASM by approximately 0.4% on MAP for this task.

Sentence Classification
Lastly, we evaluated NTI on the Stanford Sentiment Treebank (SST) (Socher et al., 2013).This dataset comes with standard train/dev/test sets and two subtasks: binary sentence classification or fine-grained classification of five classes.We trained our model on the text spans corresponding to labeled phrases in the training set and evaluated the model on the full sentences.
We use NTI-SLSTM and NTI-SLSTM-LSTM models to learn sentence representations for the task.The sentence representations were passed to a twolayer MLP for classification.The first layer of the MLP has 1024 units followed by ReLU activation.The second layer is a sof tmax layer.We set the batch size to 64, the initial learning rate to 1e-3 and l 2 regularizer strength to 3e-5, and train each model for 10 epochs.The NTI-SLSTM model was regularized by 10%/20% of input/output and 20%/30% of input/output dropouts and the NTI-SLSTM-LSTM model 20% of input and 20%/30% of input/output dropouts for binary and fine-grained settings.
Table 4 compares the result of our model with the state-of-the-art methods on the two subtasks.Most best performing methods exploited the parse tree provided in the treebank on this task with the exception of DMN.The DNM (Dynamic Memory Network) model is a memory-augmented network which reads from and writes to an episodic memory.We summarize the comparison table as follows: • Our NTI-SLSTM model performed slightly worse than its constituency tree-based counter part, CT-LSTM model.The CT-LSTM model composes phrases according to the output of a sentence parser and uses a node composition function similar to S-LSTM.
• If we transform the input with LSTM leaf node function, we can achieve the best performance on this task.NTI-SLSTM-LSTM outperformed the DNM and set the state-of-the-art results on both subtasks.

Discussion and Conclusion
In this paper we introduce Neural Tree Indexers, a class of tree structured recursive neural network.With the right composition function, including LSTM or attentive node composition function we explored in this study, NTI achieves state-of-theart performance in different NLP tasks.Most of the NTI models introduced here form deep neural networks and we think this is one reason that NTI works well even if it lacks direct linguistic motivations followed by other syntactic-treestructured recursive models (Socher et al., 2013).For example, our sequence-tree hybrid model (NTI-SLSTM-LSTM) is deeper than the common twolayer LSTMs.In addition, it models the input text from both sequential and compositional perspectives.Another reason is that the task we addressed here may not require the syntactic knowledge to be explicitly encoded.
There is a relation between CNN and NTI models.Both NTI and CNNs are hierarchical.However, the current implementation of NTI only operates on non-overlapping sub-trees while CNNs slide over the input to produce higher-level representations.NTI has great deal of freedom for selecting the node function and the attention mechanisms.NTI can be extended to operate on overlapping sub-trees, like CNNs, thus effectively performing word n-gram compositions.
One of the limitations of our study is that we explored only binary tree structured NTI models.Different branching factors for the underlying tree structure have yet to be explored.Note that NTI can be seen as a generalization of LSTM.If we construct left-branching trees in a bottom-up fashion, the model acts just like sequential LSTM.NTI can be extended so it learns to select and compose dynamic number of nodes for efficiency, essentially performing text chunking at its leaf nodes.We leave this for the future work.

Figure 1 :
Figure 1: A binary tree form of Neural Tree Indexers (NTI) in the context of question answering and natural language inference.We insert empty tokens (denoted by −) to the input text to form a full binary tree.(a) NTI produces answer representation at the root node.This representation along with the question is used to find the answer.(b) NTI learns representations for the premise and hypothesis sentences and then attentively combines them for classification.Dotted lines indicate attention over premise-indexed tree.

Table 1 :
Training and test accuracy on natural language inference task.d is the word embedding size.

Table 2 :
Test set performance on answer sentence selection.

Table 3 :
Statistics of the WikiQA dataset.