Multiplicative Tree-Structured Long Short-Term Memory Networks for Semantic Representations

Tree-structured LSTMs have shown advantages in learning semantic representations by exploiting syntactic information. Most existing methods model tree structures by bottom-up combinations of constituent nodes using the same shared compositional function and often making use of input word information only. The inability to capture the richness of compositionality makes these models lack expressive power. In this paper, we propose multiplicative tree-structured LSTMs to tackle this problem. Our model makes use of not only word information but also relation information between words. It is more expressive, as different combination functions can be used for each child node. In addition to syntactic trees, we also investigate the use of Abstract Meaning Representation in tree-structured models, in order to incorporate both syntactic and semantic information from the sentence. Experimental results on common NLP tasks show the proposed models lead to better sentence representation and AMR brings benefits in complex tasks.


Introduction
Learning the distributed representation for long spans of text from its constituents has been a crucial step of various NLP tasks such as text classification (Zhao et al., 2015;Kim, 2014), semantic matching (Liu et al., 2016), and machine translation (Cho et al., 2014). Seminal work uses recurrent neural networks (RNN) (Elman, 1990), convolutional neural networks (Kalchbrenner et al., 2014), and tree-structured neural networks (Socher et al., 2011;Tai et al., 2015) for sequence and tree modeling. Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) networks are a type of recurrent neural net- * Work done as an intern at Amazon. work that are capable of learning long-term dependencies across sequences and have achieved significant improvements in a variety of sequence tasks. LSTM has been extended to model tree structures (e.g., TreeLSTM) and produced promising results in tasks such as sentiment classification (Tai et al., 2015;Zhu et al., 2015) and relation extraction (Miwa and Bansal, 2016). Figure 1 shows the topologies of the conventional chain-structured LSTM (Hochreiter and Schmidhuber, 1997) and the TreeLSTM (Tai et al., 2015), illustrating the input (x), cell (c) and hidden node (h) at a time step t. The key difference between Figure 1 (a) and (b) is the branching factor. While a cell in the sequential LSTM only depends on the single previous hidden node, a cell in the tree-structured LSTM depends on the hidden states of child nodes.
Despite their success, the tree-structured models have a limitation in their inability to fully capture the richness of compositionality (Socher et al., 2013a). The same combination function is used for all kinds of semantic compositions, though the compositions have different characteristics in nature. For example, the composition of the adjective and the noun differs significantly from the composition of the verb and the noun.
To alleviate this problem, some researchers propose to use multiple compositional functions, which are predefined according to some partition criterion (Socher et al., 2012(Socher et al., , 2013aDong et al., 2014). Socher et al. (2013a) defined different compositional functions in terms of syntactic categories, and a suitable compositional function is selected based on the syntactic categories. Dong et al. (2014) introduced multiple compositional functions and a proper one is selected based on the input information. These models accomplished their objective to a certain extent but they still face critical challenges. The predefined compositional functions cannot cover all the compositional rules and they add much more learnable parameters, bearing the risk of overfitting.
In this paper, we propose multiplicative TreeL-STM, an extension to the TreeLSTM model, which injects relation information into every node in the tree. It allows the model to have different semantic composition matrices to combine child nodes. To reduce the model complexity and keep the number of parameters manageable, we define the composition matrices using the product of two dense matrices shared across relations, with an intermediate diagonal matrix that is relation dependent.
Though the syntactic-based models have shown to be promising for compositional semantics, they do not make full use of the linguistic information. For example, semantic nodes are often the argument of more than one predicate (e.g., coreference) and it is generally useful to exclude semantically vacuous words like articles or complementizers, i.e., leave nodes unattached that do not add further meaning to the resulting representations. Recently, Banarescu et al. (2013) introduced Abstract Meaning Representation (AMR), single rooted, directed, acyclic graphs that incorporate semantic roles, correference, negation, and other linguistic phenomena. In this paper, we investigate a combination of the semantic process provided by TreeLSTM model with the lexical semantic representation of the AMR formalism. This differs from most of existing work in this area, where syntactic rather than semantic information is incorporated to the tree-structured models. We seek to answer the question: To what extent can we do better with AMR as opposed to syntactic representations, such as constituent and dependency trees, in tree-structured models?
We evaluate the proposed models on three common tasks: sentiment classification, sentence relatedness, and natural language inference. The results show that the multiplicative TreeLSTM models outperform TreeLSTM models on the same tree structures. The results further suggest that using AMR as the backbone for tree-structured models is helpful in the complex task such as for longer sentences in natural language inference but not in sentiment classification, where lexical information alone suffices.
In short, our contribution is twofold: 1. We propose the new multiplicative TreeL-STM model that effectively learns distributed representation of a given sentence from its constituents, utilizing not only the lexical information of words, but also the relation information between the words. 2. We conduct an extensive investigation on the usefulness of lexical semantic representation induced by AMR formalism in treestructured models.

Tree-Structured LSTM
A standard LSTM processes a sentence in a sequential order, e.g., from left to right. It estimates a sequence of hidden vectors given a sequence of input vectors, through the calculation of a sequence of hidden cell vectors using a gate mechanism. Extending the standard LSTM from linear chains to tree structures leads to TreeL-STM. Unlike the standard LSTM, TreeLSTM allows richer network topologies, where each LSTM unit is able to incorporate information from multiple child units.
As in standard LSTM units, each TreeLSTM unit contains input gate i j , output gate o j , a memory cell c j , and hidden state h j for node j. Unlike the standard LSTM, in TreeLSTM the gating vectors and the memory cell updates are dependent on the states of one or more child units. In addition, the TreeLSTM unit contains one forget gate f jk for each child k instead of having a single forget gate. The transition equations of node j are as 277 follows: where C(j) is the set of children of node j, k ∈ C(j) in f jk , σ is the sigmoid function, and is element-wise (Hadamard) product.

Multiplicative Tree-Structured LSTM
Encoding rich linguistic analysis introduces many distinct edge types or relations between nodes, such as syntactic dependencies and semantic roles. This opens up many possibilities for parametrization, but was not considered in most existing syntax-aware LSTM approaches, which only make use of input node information.
In this paper, we fill this gap by proposing multiplicative TreeLSTM, an extension to the TreeLSTM model, injecting relation information into every node in the tree. The multiplicative TreeLSTM model, mTreeLSTM for short, introduces more fined-grained parameters based on the edge types. Inspired by the multiplicative RNN (Sutskever et al., 2011), the hidden-to-hidden propagation in mTreeLSTM contains a separately learned transition matrix W hh for each possible edge type and is given bỹ where r(j, k) signifies the connection type between node k and its parent node j. This parametrization is straightforward, but requires a large number of parameters when there are many edge types. For instance, there are dozens of syntactic edge types, each corresponding to a Stanford dependency label.
To reduce the number of parameters and leverage potential correlation among fine-grained edge types, we learned an embedding of the edge types and factorized the transition matrix W r(j,k) hh by using the product of two dense matrices shared across edge types, with an intermediate diagonal matrix that is edge-type dependent: where e jk is the edge-type embedding and is jointly trained with other parameters. The mapping from h k toh j is then given by The gating units -input gate i, output gate o, and forget gate f -are computed in the same way as in the TreeLSTM with Eq. (1). 2 Multiplicative TreeLSTM can be applied to any tree, where connection types between nodes are given. For example, in dependency trees, the semantic relations r(j, k) between nodes are provided by a dependency parser.

Tree-Structured LSTMs with Abstract Meaning Representation
Tree-structured LSTMs have been applied successfully to syntactic parse trees (Tai et al., 2015;Miwa and Bansal, 2016). In this work, we look beyond syntactic properties of the text and incorporate semantic properties to the tree-structured LSTM model. Specifically, we utilize the network topology offered by a tree-structured LSTM and incorporate semantic features induced by AMR formalism. We aim to address the following questions: In which tasks using AMR structures as the backbone for the tree-structured LSTM is useful? Furthermore, which semantic properties are useful for the given task? AMR is a semantic formalism where the meaning of a sentence is encoded as a single rooted, directed and acyclic graph (Banarescu et al., 2013). For example, the sentence "A young girl is playing on the edge of a fountain and an older woman is not watching her" is represented as:  Figure 2: An AMR representing the sentence "A young girl is playing on the edge of a fountain and an older woman is not watching her".
(a / and :op1 (p / play-01 :ARG0 (g / girl :mod (y / young)) :ARG1 (e / edge-01 :ARG1 (f / fountain))) :op2 (w / watch-01 :ARG0 (w2 / woman :mod (o / old)) :ARG1 g :polarity -)) The same AMR can be represented as in Figure  2, in which the nodes in the graph (also called concepts) map to words in the sentence and the edges represent the relations between words. AMR concepts consist of predicate senses, named entity annotations, and in some cases, simply lemmas of English words. AMR relations consist of core semantic roles drawn from the Propbank (Palmer et al., 2005) as well as fine-grained semantic relations defined specifically for AMR. Since AMR provides a whole-sentence semantic representation, it captures long-range dependencies among constituent words in a sentence. Similar to other semantic schemes, such as UCCA (Abend and Rappoport, 2013), GMB (Basile et al., 2012), UDS (White et al., 2016), AMR abstracts away from morphological and syntactic variability and generalize cross-linguistically.
To use AMR structures in a tree-structured LSTM, we first parse sentences to AMR graphs and transform the graphs to tree structures. The transformation follows the procedure used by Takase et al. (2016), splits the nodes with an indegree larger than one, which mainly present coreferential concepts, to a set of separate nodes, whose indegrees exactly equal one. We use JAMR (Flanigan et al., 2014(Flanigan et al., , 2016, a statistical semantic parser trained on AMR bank, for AMR parsing.
On one hand, the AMR tree structure can be used directly with the TreeLSTM architecture described in Section 2, in which only node information is utilized to encode sentences into certain fixed-length embedding vectors. On the other hand, since AMR provides rich information about semantic relations between nodes, the mTreeL-STM architecture is more applicable due to its capability of modeling edges in the tree. We evaluate both encoded vectors produced by TreeLSTM and mTreeLSTM on AMR trees in Section 6.

Applications
In this section, we describe three specific models that apply the mTreeLSTM architecture and the AMR tree structures described above.

Sentiment Classification
In this task, we wish to predict the sentiment of sentences, in which two sub-tasks are considered: binary and fine-grained multiclass classification. In the former, sentences are classified into two classes (positive and negative), while in the latter they are classified into five classes (very positive, positive, neutral, negative, and very negative).
For a sentence x, we first apply tree-structured LSTMs over the sentence's parse tree to obtain the representation h r at the root node r. A softmax classifier is then used to predict the classŷ of the sentence, withp θ (y | x) = softmax W (s) h r , where θ is the model parameters andŷ = argmax ypθ (y | x). The cost function is the negative log-likelihood of the true sentiment class of the sentence with L2 regularization.

Semantic Relatedness
Given a sentence pair, the goal is to predict an integer-valued similarity score in {1, 2, ..., K}, where higher scores indicate greater degrees of similarity between the sentences.
Following Tai et al. (2015), we first produce semantic representation h L and h R for each sentence in the pair using the described models over each sentence's parse trees. Then, we predict the similarity scoreŷ using additional feedforward layers that consider a feature vector x s consisting of both distance and angle between the pair (h L , h R ): p θ = softmax W (p) σ W (s) x s ,ŷ = r p θ , where r = [1, 2, . . . , K]. Similar to Tai et al. (2015), we define a sparse target distribution p such that the ground-truth rating y ∈ [1, K] equals r p and use the regularized KL-divergence from p θ to p as the cost function.

Natural Language Inference (NLI)
In this task, the model reads two sentences (a premise and a hypothesis), and outputs a judgement of entailment, contradiction, or neutral, reflecting the relationship between the meanings of the two sentences.
Following Bowman et al. (2016), we frame the inference task as a sentence pair classification. First we produce representations h P and h H for the premise and hypothesis and then construct a feature vector x c for the pair that consists of the concatenation of these two vectors, their difference, and their element-wise product. This feature vector is then passed to a feedforward layer followed by a softmax layer to yield a distribution over the three classes:p θ = softmax W (p) σ W (c) x c . The negative loglikelihood of the true class labels for sentence pairs is used as the cost function.

Hyperparameters and Training
The model parameters are optimized using Ada-Grad (Duchi et al., 2011) with a learning rate of 0.05 for the first two tasks, and Adam (Kingma and Ba, 2015) with a learning rate of 0.001 for the NLI task. The batch size of 25 was used for all tasks and the model parameters were regularized with a per-minibatch L2 regularization strength of 10 −4 . The sentiment and inference classifiers were additionally regularized using dropout with a dropout rate of 0.5.
Following Tai et al. (2015) and Zhu et al. (2015), we initialized the word embeddings with 300-dimensional GloVe vectors (Pennington et al., 2014). In addition, we use the aligner provided by JAMR parser to align the sentences with the AMR trees and then generate the embedding by using the GloVe vectors. The relation embeddings were randomly sampled from an uniform distribution in [−0.05, 0.05] with a size of 100. The word and relation embeddings were updated during training with a learning rate of 0.1.
We use one hidden layer and the same dimensionality settings for sequential LSTM and treestructured LSTMs. LSTM hidden states are of size 150. The output hidden size is 50 for the related-ness task and the NLI task. Each model is trained for 10 iterations. (We did not observe better results with more iterations.) The same training procedure repeats 5 times with parameters being evaluated at the end of every iteration on the development set. The model having the best results on the development set is used for final tests.
For all sentences in the datasets, we parse them with constituency parser (Klein and Manning, 2003), dependency parser (Chen and Manning, 2014), and AMR parser (Flanigan et al., 2014(Flanigan et al., , 2016 to obtain the tree structures. We compare our mTreeLSTM model with two baselines: LSTM and TreeLSTM. We use the notation (C), (D), and (A) to denote the tree structures that the models are based on, where they stand for constituency trees, dependency trees, and AMR trees, respectively. The code to reproduce the results is available at https://github.com/ namkhanhtran/m-treelstm. 3

Sentiment Classification
For this task, we use the Stanford Sentiment Treebank (Socher et al., 2013b) with the standard train/dev/test splits of 6920/872/1821 for the binary classification sub-task, and 8544/1101/2210 for the fine-grained classification sub-task. We used two different settings for training: root-level and phrase-level. In the root-level setting, each sentence is a data point, while in the phrase-level setting, each phrase is reconstructed from nodes in the parse tree and treated as a separate data point. In the phrase-level setting we obtain much more data for training, but the root-level setting is closer to real-world applications. For AMR trees, we only report results in the root-level setting, as the annotation cost for the phrase-level setting is prohibitively high. We evaluate our models and baseline models at the sentence level. Table 1 shows the main results for the sentiment classification task. While LSTM model obtains quite good performance in both settings, TreeL-STM model on constituency tree obtains better results, especially in the phrase level setting, which has more supervision. It confirms the conclusion from Tai et al. (2015) that combining linguistic knowledge with LSTM leads to better performance than sequence models in this task. Table 1 also shows mTreeLSTM consistently outperforms  TreeLSTM on the same tree structures in both settings -Whenever a tree structure is applicable to both mTreeLSTM and TreeLSTM, the performance of mTreeLSTM with that tree structure is better. That is, in phrase-level setting, mTreeL-STM (D) outperforms TreeLSTM (D) and similarly in root-level setting, mTreeLSTM (D) and mTreeLSTM (A) perform better than TreeLSTM (D) and TreeLSTM (A), respectively. It demonstrates the effectiveness of the relation multiplication mechanism and the importance of modeling relation information. The TreeLSTM and mTreeLSTM models with AMR trees do not perform well on this task. Synthetic information along goes a long way in determining the sentiment of a sentence. Noisy sentences in this task also impact the accuracy of the AMR parser.
We now dive deep into what the models learn, by listing the composition matrices W r(j,k) hh with the largest Frobenius norms. These matrices have learned larger weights, which are in turn being multiplied with the child hidden states. That child will therefore have more weight in the composed parent vector. In decreasing order of Frobenius norm, the relationship matrices for mTreeL-STM on dependency trees are: conjunction, adjectival modifier, object of a preposition, negation modifier, verbal modifier. The relationship matrices for mTreeLSTM on AMR trees are: negation (:polarity), attribute (:ARG3, :ARG2), modifier (:mod), conjunction (:opN). The model learns that verbal and adjective modifiers are more important than nouns, as they tend to affect the sentiment of sentences.

Sentence Relatedness
For this task, we use the Sentences Involving Compositional Knowledge (SICK) dataset, con-  Table 2: Results on the SICK dataset for semantic relatedness task with standard deviation in parentheses sisting of 9927 sentence pairs with the standard train/dev/test split of 4500/500/4927. Each pair is annotated with a relatedness score y ∈ [1, 5], with 1 indicating the two sentences are completely unrelated, and 5 indicating they are very related. Following Tai et al. (2015), we use Pearson, Spearman correlations and mean squared error (MSE) as evaluation metrics.
Our results are summarized in Table 2. The treestructured LSTMs, both TreeLSTM and mTreeL-STM, reach better performance than the standard LSTM. The model using dependency tree as the backbone achieves best results. The mTreeL-STM with AMR trees obtain slightly better results than the TreeLSTM with constituency trees. The multiplicative TreeLSTM models outperform the TreeLSTM models on the same parse trees, illustrating again the usefulness of incorporating relation information into the model. Similar to the previous experiment, we list the composition matrices W r(j,k) hh with the largest Frobenius norms.

Natural Language Inference
In this task, we first look at the SICK dataset described in the previous section. In this setting each sentence pair is classified into three labels, entailment, contradiction, and neutral.
In addition to the standard test set, we also report performances of our models on two different  Table 3: Accuracy on the SICK dataset for the NLI task with standard deviation in parentheses (numbers in percentage) subsets. The first subset, Long Sentence (LS), consists of sentence pairs in the test set where the premise sentence contains at least 18 words. We hypothesize that long sentences are more difficult to handle by sequential models as well as treestructured models. The second subset, Negation, is a set of sentence pairs where negation words (not, n't or no) do not appear in the premise but appear in the hypothesis. In the test set, 58.7% of these examples are labeled as contradiction. Table 3 summarizes the results of our models on different test sets. The mTreeLSTM models obtain highest results, followed by TreeLSTM models. The standard LSTM model does not work well on this task. The results reconfirm the benefit of using the structure information of sentences in learning semantic representations. In addition, Table 3 shows that TreeLSTM on dependency trees and AMR trees outperform the models with constituency trees. The dependency trees provide some semantic information, i.e., semantic relations between words at some degrees, while AMR trees present more semantic information. The multiplicative TreeLSTM on AMR trees perform much better than other models on the LS and Negation subsets. The results on the LS subset shows that mTreeLSTM on AMR trees can handle long-range dependencies in a sentence more effectively. For example, only mTreeLSTM (A) is able to predict the following example correctly: Premise: The grotto with a pink interior is being climbed by four middle eastern children, three girls and one boy. Hypothesis: A group of kids is playing on a colorful structure. Label: entailment Similar to previous experiments, we list the composition matrices with the largest Frobenius norms to get some insights into what the mod-

Additional Tests and Discussions
Incorporating relation information in the treestructured LSTM increases model complexity. In this experiment, we analyze the impact of the dimensionality of relation embedding on the model  Table 6: Comparison between different methods using relation information on the SICK dataset for the NLI task size and accuracy. Table 5 shows the model with the relation embedding size of 100 achieves the best accuracy, while the overall impact of the embedding size is mild. The multiplicative TreeL-STM has only 1.2 times the number of weights in TreeLSTM (with the same number of hidden units). We did not count the number of parameters in the embedding models since these parameters are the same for all models. Table 6 shows a comparison between mTreeL-STM and two other plausible methods for integrating relation information with TreeLSTM. In addTreeLSTM, a relation is treated as an additional node input in the TreeLSTM model; In fullTreeLSTM, the model corresponds to Eq. (2), where each edge type has a separate transition matrix. Both models achieve better results than TreeLSTM, indicating the usefulness of relation information. While addTreeLSTM and fullTreeL-STM obtain comparable performances, mTreeL-STM outperforms both of them. It is also to note that the number of parameters of mTreeLSTM is much less than those of fullTreeLSTM.

Other Related Work
There is a line of research that extends the standard LSTM (Hochreiter and Schmidhuber, 1997) in order to model more complex structures. Tai et al. (2015) and Zhu et al. (2015) extended sequential LSTMs to tree-structured LSTMs by adding branching factors. They showed such extensions outperform competitive LSTM baselines on several tasks such as sentiment classification and semantic relatedness prediction (which is also confirmed in this paper). Li et al. (2015) further investigated the effectiveness of TreeLSTMs on various tasks and discussed when tree structures are necessary. Chen et al. (2017) combined sequential and tree-structured LSTM for NLI and has achieved state-of-the-art results on the benchmark dataset. Their approach uses n-ary TreeLSTM based on syntactic constituency parsers. In contrast, we focus more on child-sum TreeLSTM which is better suited for trees with high branching factor.
Previous works have studied the use of relation information. Dyer et al. (2015) considered each syntactic relation as an additional node and included its embedding to their composition function for dependency parsing. Peng et al. (2017) introduced a different set of parameters for each edge-type in their LSTM-based approach for relation extraction. In contrast to these works, our mTreeLSTM model incorporates relation information via a multiplicative mechanism, which we have shown is more effective and uses less parameters.
AMR has been successfully applied to a number of NLP tasks, besides the ones we considered in this paper. For example, Mitra and Baral (2016) made use of AMR to improve question answering; Liu et al. (2015) utilized AMR to produce promising results toward abstractive summarization. Using AMR as the backbone in TreeLSTM has been investigated in Takase et al. (2016). They incorporated AMR information by a neural encoder to the attention-based summarization method (Rush et al., 2015) and it performed well on headline generation. Our work differs from these studies in the sense that we aim to investigate how semantic information induced by AMR formalism can be incorporated to tree-structured LSTM models, and study which properties introduced by AMR turn out to be useful in various tasks. In this paper, we use the start-of-the-art AMR parser provided by Flanigan et al. (2016) which additionally provides the alignment between words and nodes in the tree.
Though we have considered AMR in this paper, we believe the conclusions we drew here largely apply to other semantic schemes, such as GMB and UCCA, as well. Abend and Rappoport (2017) has recently noted that the differences between these schemes are not critical, and the main distinguishing factors between them are their relation to syntax, their degree of universality, and the expertise they require from annotators.

Conclusions
We presented multiplicative TreeLSTM, an extension of existing tree-structured LSTMs to incorporate relation information between nodes in the tree. Multiplicative TreeLSTM allows different compositional functions for child nodes, which makes it more expressive. In addition, we investigated how lexical semantic representation can be used with tree-structured LSTMs. Experiments on three common NLP tasks showed that multiplicative TreeLSTMs outperform conventional TreeL-STMs, illustrating the usefulness of relation information. Moreover, with AMR as backbone, treestructured models can effectively handle longrange dependencies.