Graph-to-Tree Neural Networks for Learning Structured Input-Output Translation with Applications to Semantic Parsing and Math Word Problem

The celebrated Seq2Seq technique and its numerous variants achieve excellent performance on many tasks such as neural machine translation, semantic parsing, and math word problem solving. However, these models either only consider input objects as sequences while ignoring the important structural information for encoding, or they simply treat output objects as sequence outputs instead of structural objects for decoding. In this paper, we present a novel Graph-to-Tree Neural Networks, namely Graph2Tree consisting of a graph encoder and a hierarchical tree decoder, that encodes an augmented graph-structured input and decodes a tree-structured output. In particular, we investigated our model for solving two problems, neural semantic parsing and math word problem. Our extensive experiments demonstrate that our Graph2Tree model outperforms or matches the performance of other state-of-the-art models on these tasks.


Introduction
Learning general functional dependency between arbitrary input and output spaces is one of the key challenges in machine learning. While many efforts in machine learning have mainly focused on designing flexible and powerful input representations for solving classification or regression problems, many applications require researchers to design novel models that can deal with complex structured inputs and outputs, such as graphs, trees, sequences, or sets. In this paper, we consider the general problem of learning a mapping between a graph input G ∈ G and a tree output T ∈ T , based on a training sample of structured input-output pairs (G 1 , T 1 ), ..., (G n , T n ) ∈ G × T drawn from some fixed but unknown probability distribution.
Such learning problems often arise in a variety of applications, ranging from semantic parsing, to * authors contributed equally to this research.

SP
Text Input: what jobs are there for web developer who know 'c++' ?
Structured output: answer( A , ( job ( A ) , title ( A , W ) , const ( W , 'Web Developer' ) , language ( A , C ) , const ( C , 'c++' ) ) ) MWP Text input: 0.5 of the cows are grazing grass . 0.25 of the cows are sleeping and 9 cows are drinking water from the pond . find the total number of cows .
Structured output: ( ( 0.5 * x ) + ( 0.25 * x ) ) + 9.0 = x Table 1: Examples of structured input and output of semantic parsing (SP) and math word problem (MWP). For inputs, we consider parsing tree augmented sequences to get structural information. For outputs, they are naturally a hierarchical structure with some structural meaning symbols like brackets. math word problem, label sequence learning, and supervised grammar learning, to name just a few. As shown in Fig. 1, finding the parse tree of a sentence involves a structural dependency among the labels in the parse tree; generating a mathematical expression of a math word problem involves a hierarchical dependency between math logical operations and the numbers. Conventionally, there have been efforts in generalizing kernel methods to predict structured and inter-dependent variables in a supervised learning setting (Tsochantaridis et al., 2005;Altun et al., 2004;Joachims et al., 2009).
Recently, the celebrated Sequence-to-Sequence technique (Seq2Seq) and its numerous variants (Sutskever et al., 2014;Bahdanau et al., 2014;Luong et al., 2015) achieve excellent performance in neural machine translation. Encouraged by the success of Seq2Seq model, there is a surge of interests in applying Seq2Seq models to cope with other tasks such as developing neural semantic parser (Dong and Lapata, 2016) or solving math word problem (Ling et al., 2017). However, the two significant challenges making a Seq2Seq model ineffective in these tasks are that, i) for the natural text description input, it often entails some hidden syntactic structure information such as dependency, constituency tree or even semantic structure information like AMR parsing tree; ii) for the meaningful representation output, it typically contains abundant information in a structured object like a parsing tree or a mathematical equation.
Inspired by these observations, in this work, we propose a Graph-to-Tree neural networks, namely Graph2Tree consisting of a graph encoder and a hierarchical tree decoder, which leverages the structural information of both source graphs and target trees. In particular, our Graph2Tree model learns the mapping from a structured object such as a graph to another structured object such as a tree. In addition, we also observe that the structured object translation typically follows a modular procedure, which translates the individual sub-graph in the source graph into the corresponding target one in target tree output, and then compose them to form the final target tree.
Therefore, we design a workflow to align with this procedure: our graph encoder first learns from an input graph that is constructed from the various inputs such as combining both a word sequence and the corresponding dependency or constituency tree, and then our tree decoder generates the tree object from the learned graph vector representations to explicitly capture the compositional structure of a tree. In particular, we present a novel Graph2tree model with a separated attention mechanism to jointly learn a final hidden vector of the corresponding graph nodes in order to align the generation process between a heterogeneous graph input and a hierarchical tree output.
To demonstrate the effectiveness of our model, we perform experiments on two important tasks -Semantic Parsing and Math Word Problem. First, we compare our approach against several neural network approaches on the Semantic Parsing task. Our experimental results show that our Graph2Tree model could outperform or match the performance of other state-of-the-art models on three standard benchmark datasets. Second, we further compare our approach with existing recently developed neural approaches on the math word problem and our results clearly show that our Graph2Tree model can achieve state-of-the-art performance compared to other baselines that use many task-specific techniques. We believe our Graph2Tree model is a solid attempt for learning structured input-output translation.

Graph Neural Networks
The graph representation learning recently attracted a lot of attention and interest from both academia and industry. One of the most important research lines is the semantic embedding learning of graph nodes or edges based upon the power of graph neural networks (GNNs) Kipf and Welling, 2017;Velickovic et al., 2017;Gilmer et al., 2017;Hamilton et al., 2017). Encouraged by the recent success in GNNs, various Sequence-to-Graph (Peng et al., 2018) or Graph-to-Sequence models (Xu et al., 2018a,b,c;Beck et al., 2018;Chen et al., 2020) have been proposed to handle the structured inputs, structured outputs or both of them, i.e. generating AMR graph generation from the text sequence. More recently, some researchers proposed the Tree-to-Tree , Graph-to-Tree (Yin et al., 2019) and Graph-to-Graph (Guo et al., 2018) neural networks for targeted application scenarios.
However, these works are designed exclusively for specific downstream tasks like program translation or code edit. Compared to them, our proposed Graph2Tree neural network with novel design of graph encoder and tree decoder does not rely on any specific downstream task assumption. Additionally, our Graph2Tree is the first generic neural network translating graph inputs into tree outputs, which may have numerous applications in practice.

Neural Semantic Parsing
Semantic parsing is the task of translating natural language utterances into machine-interpretable meaning representations like logical forms or SQL queries. Recent years have witnessed a surge of interests in developing neural semantic parsers with sequence-to-sequence models. These parsers have achieved promising results (Jia and Liang, 2016;Dong and Lapata, 2016;Ling et al., 2016). Due to the fact that the meaning representations are usually structured objects (e.g. tree structures), many efforts have been devoted to develop structureoriented decoders, including tree decoders (Dong and Lapata, 2016;Alvarez-Melis and Jaakkola, 2017), grammar constrained decoders (Yin and Neubig, 2017;Yin et al., 2018;Jie and Lu, 2018;Dong and Lapata, 2018), action sequences for semantic graph generation (Chen et al., 2018a), and modular decoders based on abstract syntax trees (Rabinovich et al., 2017). However, those approaches could potentially be further improved because they only consider the word sequence information and ignore other rich syntactic information, such as dependency or constituency tree, available at the encoder side.
Researchers recently attempted to leverage of the power of GNNs in various NLP tasks, including the neural machine translation (Bastings et al., 2017;Beck et al., 2018), conversational machine reading comprehension (Chen et al., 2019b), and AMR-to-text (Song et al., 2018). Specifically in the semantic parsing field, a general Graph2Seq model (Xu et al., 2018b) is proposed to incorporate these dependency and constituency trees with the word sequence and then create a syntactic graph as the encoding input. However, this approach simply treats a logical form as a sequence, neglecting the abundant information in a structured object like tree in the decoder architecture. Therefore, we present the Graph2Tree model to utilize the structure information in both structured inputs and outputs.

Math Word Problems
The math word problem is the task of translating the short paragraph (typically consisting with multiple short sentences) into succinct mathematical equations. To solve a math word problem illustrated in Table 1, traditional approaches focus on generating numeric answer expressions by mapping verbs in problems text to categories (Hosseini et al., 2014) or by generating templates from problem texts . However, these approaches either need additional hand-crafted annotations for problem texts or are limited to a set of predefined equation templates. Inspired by the great success of Seq2Seq models in Neural Machine Translation, deep-learning based methods are intensively explored by researchers in the equation generation Ling et al., 2017;Li et al., 2018Li et al., , 2019Zou and Lu, 2019;Xie and Sun, 2019). However, different forms of equations can be formed to solve the same math problem, which often makes models fail. To resolve the equation duplication issues, various equation normalization methods are proposed in  to generate a unique expression tree with the cost of losing the understanding of problem-solving steps in equation expressions. In contrast, we propose to use a Graph2Tree model to solve this task without any special mechanisms like equation normalization. To the best of our knowledge, this is the first work to use GNN to build a math word problem solver.

Problem Formulation and Structure
Object Construction

Graph-to-Tree Translation Task
In this work, we consider the problem of translating a graph input to a tree output. In particular, we consider two important tasks -Semantic Parsing and Math Word Problem. Formally, we define both tasks as follows. The input side contains a set of text sequences, denoted as S = {s 1 , s 2 , . . . , s n } ∈ S where s i is a text sequence consisting of a sequence of word embeddings contains all of the original word nodes V 1 ∈ V 1 as well as the relationship nodes V 2 ∈ V 2 from the relationships of a parsing tree (i.e. dependency or constituency tree), and E ∈ E denotes if the two nodes are connected or not. The aim is to translate a set of heterogeneous graph inputs G = {g 1 , g 2 , . . . , g n } into a set of tree outputs T = {t 1 , t 2 , ...t n } ∈ T where t i is a logic form or math equation consisting of a sequence of tree node token t i = {y 1 , y 2 , . . . , y |t i | } ∈ Y.

Constructing Graph Inputs and Tree Outputs
To apply GNNs, the first step is to construct a graph input by combining the word sequence with their corresponding hidden structure information. How to construct such graphs is critical to incorporate the structured information and influences the final performance. Similarly, how to construct the tree outputs from logic form or math equations also play an important role in the final performance and model interpretability. In this section, we will introduce two methods for graph construction and one method for tree construction. Combining Word Sequence with Dependency Parse Tree. The dependency parse tree not only represents various grammatical relationships between pairs of text words, but also is shown to have an important role in transforming texts into logical forms (Reddy et al., 2016). Therefore, the first method integrates two types of features by adding dependency linkages between corresponding word pairs in word sequence. Concretely, we transform a dependency label into a node, which is linked respectively with two word nodes with dependency relationship. Figure 1 gives such an example of constructed heterogeneous graph from a text. Combining Word Sequence with Constituency Tree. The constituency tree contains the phrase structure information which is also critical to describe the word relationships and has shown to provide useful information for translation (Gū et al., 2018). Since the leaf nodes in the constituency tree are the word nodes in the text, this method merges these nodes with the identical ones in the bi-directional word sequence chain to create the syntactic graph. Figure 2 shows an example of constructed heterogeneous graph input.  Figure 3: A sample tree output in our decoding process from expression "( ( 0.5 * x ) + ( 0.25 * x ) ) + 9.0 = x" Constructing Tree Outputs. To effectively learn the compositional nature of our structured outputs, we need to firstly transform original outputs from logic forms or math equations to tree structured objects. Specifically, we follow the tree construction method in (Dong and Lapata, 2016), which is a top-down manner to generate tree-structured outputs. In original outputs containing structural meaning symbols like brackets, we first extract subtree structures and replace these sub-tree structures with sub-tree symbols. Then we grow branches from the generated sub-tree symbols until all hierarchical structures in the original sequence are processed. Figure 3 provides an example of constructed tree objects from mathematical expression.

Graph2Tree Neural Networks
We aim to learn a mapping that translates a heterogeneous graph-structured input G and its corresponding tree-structured outputs T . We illustrate the workflow of our proposed Graph2Tree model for semantic parsing in Figure 4, and present each component of the model as follows.

Graph Encoder
To effectively learn graph representations from our constructed heterogeneous text graph, we present a novel bidirectional graph node embeddings method -BiGraphSAGE. The proposed BiGraphSAGE extends the widely used GraphSAGE (Hamilton et al., 2017) by learning forward and backward node embeddings of a graph G in an interleaved fashion.
In particular, consider a word node v ∈ V 1 with pretrained word embedding w v like GloVe (Pennington et al., 2014) as v's initial attributes. We then generate the contextualized node embeddings a v for all nodes v ∈ V 1 using Bi-directional Long Short Term Memory (BiLSTM) (Graves et al., 2013). For a relationship node v ∈ V 2 , we initialize a v with randomized embeddings. These feature vectors are used as initial node embeddings h 0 v = a v . Then each node embedding learns its vector representation by aggregating information from a node local neighborhood within K hops of the graph.
where k ∈ {1, ..., K} is the iteration index and N is the neighborhood function of node v. M k and M k are the forward and backward aggregator functions. Node v's forward (backward) representation h k v (h k v ) aggregates the information of nodes in N (v) (N (v)).

Concat
where denotes component-wise multiplication, σ is a sigmoid function and w g is a gating vector. The graph encoder learns node embeddings h k v by repeating the following process K times: where W k denotes weight matrices, σ is a nonlinearity function, K is maximum number of hops. The final bi-directional node embeddings z v is chosen to concatenate the two unidirectional node embeddings at the last hop, After the bi-directional embeddings for all nodes z are computed, we then feed the obtained node embeddings into a fully-connected neural network and apply the element-wise max-pooling operation on all node embeddings to compute the graph-level vector representation g, where other alternative commutative operations such as mean or attention based weighted sum can be used as well.

Tree Decoder
We propose a new general tree decoder fully leveraging the outputs of our graph encoder, i.e. the bidirectional node embeddings and the graph embedding, and faithfully generating the tree-structured targets like logic forms or math equations. Inspired by the thinking paradigm of human beings, our tree decoder at high level uses a divideand-conquer strategy splitting the whole decoding task into sub ones. Figure 3 illustrates an example output of our tree decoder. In this example, we firstly initialize the root tree node ROOT with the graph embedding g, and then apply a sub-decoder on the ROOT to generate a 1st-level coarse output containing a sub-tree node S 1 . This S 1 is further decoded with the similar sub-decoder to derive the 2nd-level coarse output. This procedure is repeated to generate the 3rd-level output in which there is no sub-tree nodes. In this way, we get the whole tree output in a top-down manner.
This whole procedure can be summarized as follows: 1) initialize the root tree node with the graph embedding from our encoder and perform the first level decoding with our LSTM based sub-decoder; 2) for each newly generated sub-tree node, a subdecoder is applied to derive the next level coarse output; 3) repeat step 2 until there is no sub-tree nodes in the last level of tree structure.

Sub-Decoder Design
In each of our sub-decoder task, the conditional probability of the generated word at step t is calculated as follows: where x denotes vectors of all input words, y t is the predicted output word at t, s t is the decoder hidden state at t, and f predict is a non-linear function.
The key component of Eq. (9) is the computation of s t . Conceptually, this value is calculated as s t = f decoder (y t−1 , s t−1 ), where f decoder is usually a RNN unit. We propose two improvements on top of it, parent feeding and sibling feeding, to feed more information for decoding sub-tree nodes.
Parent feeding. For a sub-task in our tree decoding process, we aim to expand the sub-tree node in the parent layer. Therefore, it is reasonable to take the sub-tree node embedding st i into consideration. Therefore, we add the sub-tree node embedding as part of our input at every time-step, in order to capture the upper-layer information for decoding.
Sibling feeding. Besides the information from parent nodes, if two sub-tree nodes share the same parent node, then these two sub-tasks can also be related. Inspired by this observation, we employ the sibling feeding mechanism to feed the preceding sibling sentence embedding to the sub-task related to its closet neighbor sub-tree node. For example, imagine p 1 is the parent node of c 1 , c 2 , and we feed both embeddings of p 1 and c 1 when decoding c 2 . Therefore, our sub-decoder calculates the decoder hidden state s t as follows: where st parent stands for sub-tree node embedding from parent layer and st sibling is the sentence embedding of the closest preceding sibling. By fully utilizing the information from parent nodes and sibling nodes, our tree decoder can effectively generate target hierarchical outputs.

Separate Attention Mechanism to Locate Source Sub-graph
Various attention mechanisms have been proposed (Bahdanau et al., 2014;Luong et al., 2015) to incorporate the hidden vectors of the inputs into account during the decoding processing. In particular, the context vector s t depends on a set of bidirectional node representations of the source graph (z 1 ,...,z |V | ) to which the decoder locates the source sub-graph. Since our graph input is essentially a heterogeneous graph with two different input sources (word nodes with relationship nodes of a parsing tree), we propose to employ a separated attention mechanism over the node representations corresponding to the different node types: , ∀v ∈ V2 (12) where the score(·) function estimates the similarity of z v and s t . Then, we compute the context vectors c v1 and c v2 , respectively.
We concatenate the context vector c v 1 , context vector c v 2 and decoder hidden state s t to compute the final attention hidden state at this time step as: where W c and b c are learnable parameters. The final context vectors t is further used for decoding tree structured outputs. The output probability distribution over a vocabulary at the current time step is calculated by: p(yt|y1, y2, . . . , yt−1, g) = sof tmax(Wvst + bv) (16) where W v and b v are learnable parameters. Our model is then jointly trained to maximize the conditional log-probability of the target tree given a heterogeneous graph input g.

Experiments
In this section, we evaluate the effectiveness and generality of Graph2Tree model on two important tasks -Semantic Parsing and Math Word Problem. The code and data for our Graph2Tree model are provided for research purpose 1 .

Experiments for Semantic Parsing
Datasets. We evaluate our Graph2Tree on three totally-different benchmark datasets, JOBS (Zettlemoyer and Collins, 2005), GEO (Zettlemoyer and Collins, 2005), and ATIS (Dahl et al., 1994), for the semantic parsing task. The first one JOBS is a set of 640 queries from a job listing database, the second one GEO is a set of 880 queries on a database of U.S. geography, and the last one ATIS is a dataset of 5410 queries from a flight booking system. We utilize the same train/dev/test split standard as used in previous works. We adopt the data preprocessing provided by (Dong and Lapata, 2016). Natural language utterances are in lower case and stemmed, and entity mentions are replaced by numbered markers. For the graph construction, we use the dependency parser and constituency parser from CoreNLP . Settings. We use the Adam optimizer (Kingma and Ba, 2014) with a batch size of 20. For the JOBS and GEO datasets, our hyper-parameters are cross-validated on the training sets. For ATIS, we tune them on the development set. The learning rate is set to 0.001. In graph encoder, the BiRNN we use is a one-layer BiLSTM with a hidden size of 150, and the hop size in GNN is chosen from {2,3,4,5,6}. The decoder we employ is a one-layer LSTM with a hidden size of 300. The dropout rate is chosen from {0.1,0.3,0.5}. Baselines. We compare our model against several state-of-the-art neural semantic parsers: i) Seq2Seq model with a Copy mechanism (Jia and Liang, 2016); ii) Seq2Seq and Seq2Tree models (Dong and Lapata, 2016); iii) Graph2Seq model (Xu et al., 2018a). We report the exact-match accuracy for each baseline on all three benchmarks.  Results. Table 2 shows that our proposed Graph2Tree outperforms or achieves comparable exact-match accuracy compared to other state-ofthe-art baselines, highlighting the effectiveness of our proposed model by exploiting full utilization of structural information in both inputs and outputs.

JOBS GEO ATIS
Case study. Next we analyze the different decoding results of all models for an example case in Table 3. The challenge in semantic parsing is the high-order neighborhood estimation of the noun key word "jobs" to its attribute words "windows" and "san antonio". It is hard for the traditional sequence encoder to encode high-order neighborhood (long-range dependency). For instance, there are 10 hops between the word "jobs" and "windows" according to the sequential dependency, while there are only two hops if we introduce the syntactic dependency information. Therefore, syntactic graph with graph encoder is an effective way to learn a high-quality representation for decoding. This partially explains why our Graph2tree model outperforms Seq2Seq and Seq2Tree models.  Table 4: Ablation study of Graph2Tree on the semantic parsing (JOBS and GEO). We employ exact match accuracy as evaluation metric.

JOBS GEO
Ablation study. Table 4 presents the ablation study on our Graph2Tree using a constituency tree based graph (on SP datasets JOBS and GEO). This is done with test sets (JOBS and GEO have no dev set). Firstly, We observe that the syntactic information in constituency tree, which is helpful for describing word relationships, is critical to our overall performance. And we found that our bidirectional GraphSAGE, encoding from both forward and backward nodes according to edge direction, is proved to enhance the final performance. Furthermore, parent feeding and sibling feeding mechanism, which can enrich both the paternal and fraternal information in decoding, also play important roles in the whole model. In addition, designed for different types of nodes in the input graph, the separate attention mechanism is proved useful in our model. Last but not least, it is also necessary to use Bi-LSTM in encoder to learn the contextualized word embeddings from the word sequences.

Experiments for Math Word Problems
Datasets. We here evaluate our Graph2Tree model on two benchmark datasets, MAWPS (Koncel-Kedziorski et al., 2016) and MATHQA (Amini et al., 2019), for the Math Word Problems automatically solving task. The MAWPS dataset is a Math Word Problem dataset in English and contains 2373 pairs after harvesting equations with single unknown variable. The other MATHQA dataset is a recently proposed large-scale Math Word Problem dataset with 37k English pairs, where each math expression is corresponding to an annotated formula for better interpretability. This dataset is more difficult for covering complex multivariate problems.
Baselines. We compare our Graph2Tree model against several state-of-the-art methods. We report the solution accuracy for each baseline in test set. On MAWPS, our baselines are: i) Retrieval, Classification, and Seq2Seq (Robaidek et al., 2018); ii) Seq2Tree (Dong and Lapata, 2016); iii) Graph2Seq (Xu et al., 2018a); iv) MathDQN ; v) T-RNN ; vi) Group-Att (Li et al., 2019). On MATHQA, our baselines are: i) Sequence-to-program (Amini et al., 2019); ii) TP-N2F (Chen et al., 2019a); iii) Seq2Seq, Seq2Tree and Graph2Seq.   Table 6: Solution accuracy comparison on MATHQA so far on this MAWPS benchmark. We have observed similar conclusions on a more challenging and larger dataset -MATHQA. This highlights the importance of having our Graph2Tree neural networks that can leverage the structured information from both inputs and outputs for automatic solving of math problems. It is worth noting that our hierarchical tree decoder directly generates original mathematical expressions, which faithfully reflect reasoning steps when building math equations. However, state-ofthe-art math word problem solvers like Group-Att (Li et al., 2019) or T-RNN  have achieved high performance by utilizing Equation Normalization (EN) proposed by  to keep structures of output equations unified. This method can improve solution accuracy because it reduces the difficulty of equation generation. On the other hand, the normalized equations completely lose the semantic meaning of operands and operators, making them difficult to reason rationales how answer math equations are built. Attention visualization. For better understanding of our separated attention, we give a visualization sample from MAWPS. As shown in Figure 5(a), we give an augmented graph input and equation tree, where N is sub-tree node and 1, 2 are indexed markers for original numbers. Specifically, Figure 5(b) and 5(c) illustrates alignments with word nodes and compositional nodes in graph input respectively. For example, in Figure 5(c), the equation part "2 * 1" is matched with "a bee has 2 legs" in the original natural language sentence which is actually semantically connected with "NP" and "VP" in the constituency tree. Ablation study. Similarly, we also perform the ablation study for math word problem (MAWAPS), as shown in Table 7. This is done with dev set. Attention mechanism, constituency structure and other components in our model play significant roles for Graph2tree to achieve high performance in MWP solving, which is consistent with our ob-    Table 7: Ablation study of Graph2Tree on the math word problem (MAWAPS). We employ solution accuracy as evaluation metric. The Methods settings is same as Table 4. servation in the semantic parsing task. However, it is worth noting that, according to the experiment, the sibling mechanism is obviously more important to the MWP task than the semantic parsing task, which is in line with our expectations. In the MWP task, the result of decoding, math expressions, is relatively simple compared to semantic parsing. And in math expressions, the order between leaf nodes (numbers), which directly affects the correctness of expressions, is very important. The sibling mechanism plays exactly such a role. One potential interesting extension is that, if we can connect leaf nodes in the input graph and employ edge weights to dynamically represent the order between the nodes, it may achieve a similar or even better effect than the sibling mechanism.

Conclusion and Future Work
We presented a novel Graph2Tree model consisting of a graph encoder and a hierarchical tree decoder, for learning the translation between structured inputs and structured outputs. Studies on two tasks -Semantic Parsing and Math Word Problem demonstrated our model consistently outperformed or matched the performance of the state-of-the-art. Our Graph2Tree model is generic and agnostic to the downstream tasks and thus one of the future works is to adapt it to the other NLP applications.