Tree-structured Decoding for Solving Math Word Problems

Automatically solving math word problems is an interesting research topic that needs to bridge natural language descriptions and formal math equations. Previous studies introduced end-to-end neural network methods, but these approaches did not efficiently consider an important characteristic of the equation, i.e., an abstract syntax tree. To address this problem, we propose a tree-structured decoding method that generates the abstract syntax tree of the equation in a top-down manner. In addition, our approach can automatically stop during decoding without a redundant stop token. The experimental results show that our method achieves single model state-of-the-art performance on Math23K, which is the largest dataset on this task.


Introduction
Math word problem (MWP) solving is the task of obtaining a solution from math problems that are described in natural language. One example of MWP is illustrated in Figure 1. Early approaches rely on either predefined rules or statistical machine learning-based methods to map the problems into several predefined templates in classification style or retrieval style. The major drawback of these approaches is that they are inflexible for new templates and require extra effort to design rules, features and templates.
Modeling the tree structure of math equations has been considered as an important factor when building models for MWP. As shown in Figure 1, each equation could be transformed into an abstract syntax tree (AST). Roy and Roth (2017) built an expression tree and combined two classifiers for quantity relevance prediction and operator classification. Koncel-Kedziorski et al. (2015) designed a template ranking function based on the * This denotes equal contribution.
Problem The distance between the two places A and B is 660 km, the car starting from A drives 32 km/h, and the car starting from B drives 34 km/h. The two cars are starting from the two places at the same time in inverse direction. How many hours later would the two cars meet? Equation AST of the equations. However, these approaches are based on traditional methods and require feature engineering.
Recently, the appearance of large-scale datasets and the development of neural generative models have opened a new research line for MWP. Wang et al. (2017) cast this task as a sequence generation problem and used a sequence-to-sequence (seq2seq) model to learn the mapping from natural language text to a math equation. Recent approaches use the Reverse Polish notation, also called suffix notation of equations in which operators follow their operands to implicitly model the tree structure (Wang et al., 2018a;Chiang and Chen, 2018). However, these studies lost sight of important information of the math equation ASTs (e.g., parent and siblings of each node), despite of their promising results. Thus, their model has to use extra effort to memorize various pieces of auxiliary information such as the sibling node of the current step and learn the implicit tree structure in the sequence.
Meanwhile, in similar tasks such as semantic parsing and code generation, which also try to convert natural language into a well-formed executable tree structured output, they first decode the function or operator at the root node before decoding the variables. Such top-down decoding matches with the Polish notation, also called the prefix order of the AST of math equations, shown in Figure 1. Human also does reasoning and reference in a similar order, usually first determines the function before filling in the variables, but not the inverse. Thus we present a top-down hierarchical sequence-to-tree (seq2tree) model which explicitly models the tree structure. This decoder considers the prefix order of the equation and feeds the information of the parent and sibling nodes during decoding.
Another advantage of this system is that it can use a stack to monitor the decoding process and automatically end. By pushing in new generated nodes to the stack and popping out the completed subtrees, the decoding process can naturally end when the tree is completed and the stack is empty. There is no need for a special redundant end-ofsequence (EOS) token which is used in ordinary text sequence generation. Without the EOS token, the model has a higher possibility to generate valid equation answers.
In summary, the contributions of this work are as follows: 1. We design a hierarchical seq2tree model that can better capture information of the AST of the equation.
2. We are the first to use prefix order decoding for MWP.
3. We abandon the EOS token for end-to-end equation generation. 4. Our model outperforms previous systems for solving math word problems. On the large-scale dataset Math23K, we achieve state-of-the-art single model performance.

Related Work
Our work synthesizes two strands of research, which are math word problems and seq2tree encoder-decoder architectures.

Math Word Problems
Early approaches hand engineered rule-based systems to solve MWPs (Mukherjee and Garain, 2008). The rules could only cover a limited domain of problems, while math word problems are flexible in real-world settings.
There are currently three major research lines in solving MWPs. The first research line maps a problem text into logical forms, and then uses the logical forms to obtain the equation (Shi et al., 2015;Roy and Roth, 2015;Huang et al., 2017;Roy and Roth, 2018). Shi et al. (2015) defined a Dolphin language to connect math word problems and logical forms. The major drawback is that it requires extra human annotation for the logic forms.
Another research line uses either a retrieval or classification model to maintain a template, and then fills in the slots with quantities.  first introduced the idea of 'equation template'. For example, x = 6*7 and x = 10*5 belong to the same template x = n 1 *n 2 . They collected the first dataset of this task, ALG514, which contained 514 samples. They brought out a twostep pipeline model, which first used a classifier to select a template and then mapped the numbers into the slots. One major drawback is that it cannot solve problems beyond the templates in the training data. This two-step pipeline model was further extended with tree-based features, ranking style retrieval models and so on (Upadhyay and Chang, 2017;Roy and Roth, 2017). Huang et al. (2016) released the first large-scale dataset Dophin18K and trained a similar system on it.
The third research line directly generates the equation. Hosseini et al. (2014) cast the problem into a State Transition Diagram of verbs and trained a binary classifier that could solve problems with only add and minus operators. Wang et al. (2017) first used a seq2seq model to directly generate the equation template and released a Chinese high-quality large-scale dataset, Math23K. Reinforcement learning was used to further improve the seq2seq framework. Wang et al. (2018b) first extended the seq2seq model by decoding the suffix order sequence of the equations. Wang et al. (2018a) introduced equation normalization techniques that leverage the duplicated template problem. Chiang and Chen (2018) used the copy mechanism to improve the semantic representation of quantities. However, in these work the model needs extra effort to learn the implicit tree structure and memorize tree information in the sequence. Wang et al. (2019) proposed a twostep pipeline method that first generates a template with unknown operators, and then uses a recursive neural network to combine tree structure information and predict the operators. However, the topology of the AST is determined in the first step without tree structure information. To our best knowledge, we are the first to explicitly give the model guidance of parent and sibling nodes.

Seq2Tree Architectures
Seq2Tree-style encoder-decoder is mainly used in two fields both of which try to bridge natural language and a tree structured output. Semantic parsing is the task that translates natural language text to formal meaning logical forms or structured queries. Code generation maps a piece of program description to programming language source code. Dong and Lapata (2016) first used recurrent neural networks (RNNs) based seq2tree for semantic parsing and out performed the seq2seq model. One drawback is that their generation is at token level so it cannot guarantee the result is syntactically correct. Grammar rules were used to solve this problem. Another drawback is that they needed special tokens for predicting branches, which are not necessary for MWPs because all operators are binary operators. The similar framework is also used in code generation (Zhang et al., 2016;Yin and Neubig, 2017). Alvarez-Melis and Jaakkola (2017) presented doubly recurrent neural networks to predict tree topology explicitly. Rabinovich et al. (2017) presented a abstract syntax network that combines edge information for code generation. Convolution neural networks (CNNs) were used for code generation decoding because the output program is much longer than semantic parsing and MWPs, and RNNs suffer from the long dependency problem (Sun et al., 2018).

Model
Our model consists of two stages as shown in Figure 2: the encoder stage that encodes the input natural language into a sequence of representation vectors and the decoder stage that receives these vectors and decodes the AST of the equation with Equation x = 130 * ( 1 -0.8 ) Quantities n 1 : 130, n 2 : 0.8 Template n 1 * ( 1 -n 2 ) Prefix template * n 1 -1 n 2

Preprocessing
Significant Number Identification A Significant Number Identification (SNI) unit is used for reducing the noise in the input numbers. Significance refers to whether the number appears in the equation. In MWPs, it is very common that the input text contains irrelevant numbers, such as date or descriptive text such as 'third grade student'. We follow Wang et al. (2017) and simply use whether the numbers appear in the equation as gold labels and build an LSTM-based binary classification model to determine the significance of the input numbers. The accuracy of this unit is 99% and thus it can efficiently reduce the noise.

Prefix Order Equation Template
For the output equations, we first turn them into prefix order equation templates before using them to train the model. In Table 1 we show one example of prefix templates. Given a problem-equation pair, we first build the mapping between numbers in the problem and equation, n i denotes the ith number in the problem after SNI. We use this mapping to convert the equation into a template by replacing the numbers with n i tokens, and at last convert the template into prefix order.
To be noticed, equations can be mapped to more than one AST. For example, n1 + n2 + n3 could be mapped to an AST with either the first + or the second + as the root node, and in that case the prefix order notation would also be different. We assume the first operator is the root node of the AST here. Further details are shown in the appendix.

Equation Normalization
One math word problem can be solved by different but equivalent equations, which bring noise during training. For example, 10 -(8 + 5) and 10 -8 -5 are equivalent, but the templates n 1 − (n 2 + n 3 ) and n 1 − n 2 − n 3 are different. This problem is called the equation duplication problem. We follow Wang et al. (2018a) and use several rules for equation normalization Alice has 10. …… ?
n " Figure 2: Framework of our seq2tree model. The blue blocks refer to the encoder. The yellow blocks refer to the decoder. The green blocks refer to the auxiliary stack.
to alleviate the equation duplication problem. The rules are listed below.
• If one long equation template could be converted into a shorter one, then it should be shortened. For example, n 1 + n 2 + n 3 + n 3 − n 3 and n 1 + n 2 + n 3 are equivalent. In this case the former one should be normalized as the latter one.
• The order of number tokens in the template should follow their occurrence order in the problem text as much as possible. For example, n 1 + n 3 + n 2 should be normalized as n 1 + n 2 + n 3 .

Encoder
Bi-directional Long Short Term Memory Network (BiLSTM) is an efficient method to encode sequential information. Formally, given an input math word problem sentence x = {x t } n t=1 , we first embed each word into a vector e t . Then these embeddings are fed into a BiLSTM layer to model the sequential information.
where h t is the concatenation of the hidden states h f t and h b t , which are from both forward and backward LSTMs. These representation vectors are then fed into the following decoder stage.

Tree-structured Decoder
For decoding, we follow Dong and Lapata (2016) and build a top-down hierarchical tree-structured decoder. However, their model is built for semantic parsing and some components are unnecessary and redundant for equation template decoding. Benefited by the fact that operator nodes must have two children and number nodes must be leaf nodes, we build a decoder which is specialized for math equation AST, as shown in Figure 3. It extends a vanilla sequence-based LSTM decoder by using tree-based information feeding as the input, and also an auxiliary stack to help the model know which is the next token to decode and automatically end the decoding process without a special token.

Tree-based Information Feeding
The input of each time step consists of three parts: parent feeding, sibling feeding and previous token feeding.
The parent feeding h parent refers to using the LSTM hidden state of the parent node as the input when decoding children nodes, which is shown as the orange solid line in Figure 3. This can let the model be informed of the parent node status. For the root node, this part of the input is padded as zeros.
The sibling feeding e sibling refers to using the embedding of the left sibling node as the input when decoding the right sibling node, which is shown as the yellow dotted line in Figure 3. This can let the model be informed whether we are decoding the left or right sibling. For the root node, we use a special token s for sibling feeding. For the left-most node, we use a special token ( for sibling feeding.
The previous token feeding e prev refers to using the previous token in prefix order as the input when decoding the next token, which is shown as the blue dot line in Figure 3. This can let the model be informed of what part is already decoded by the tree. For the root node, we also use the special token s for previous token feeding.
At time step t, the input e d t of the LSTM unit is the concatenation of these three components.

Tree-Structured LSTM
The tree-structured decoder uses LSTM to generate the equation template in a top-down manner, as the grey solid line in Figure 3, and use an auxiliary stack to guide the decoding process. Given the input e d t shown in the previous section, we generate the output token y t with one LSTM layer and one Multi-layer Perceptron (MLP) layer.  As shown in Algorithm 1 and Figure 3, if the predicted token y t is an operator, then we predict the left child node of the operator and push this token into the stack S. If the predicted token y t is a quantity, we check the stack to determine which is the next token that needs to be decoded. If the top of the stack is an operator, then we push y t into the stack and go on to decode the right sibling node of the current node. If the top of S is a quantity, we follow Algorithm 1 to find the next position to decode. We push y t to the stack and pop out the top three tokens, which should be op num num . These three tokens form a subtree t and we regard this subtree t as one quantity unit in the following process. Then we examine the status of the stack again, if the top of the stack is still a number, we push t back to the stack and continue until the top of the stack is not a quantity. When the loop stops, if the top of the stack is an operator, we push back t and continue to decode the operator's right child node, and if the stack is empty, the decoding process ends because the AST is completed. Here we still push back the tree unit to the stack and treat the status that only one number is in the stack as the ending condition, which refers to line 2 in Algorithm 1. In this way the condition that the first generated token is the answer number can be unified. With the help of this stack, we can guide the decoding process, including which token to generate next and when to stop naturally without any special tokens.
We show one example of the decoding process with the an auxiliary stack in Figure 4. The upper half part shows the status of the stack, where the solid blocks stand for the inserted tokens. The lower half part shows the status of the AST during decoding, where the solid line stands for the generated nodes and the dotted line stands for the node that should be generated in the next step. The decoder first generates −, +, n 1 and n 2 and forms a complete subtree in the AST. This subtree +, n 1 and n 2 is then popped out of the stack and +n 1 n 2 are pushed back as one unit. The model then continues to predict the sibling node of the subtree's root node, which is the dotted line circle in Figure 4. The top three tokens of the stack now form a complete subtree again and is popped out of the stack. The stack is now empty, we push it back and the stack only contains one number unit, then the decoding process ends and the equation template is popped out.

Attention Mechanism
An attention mechanism has shown its effectiveness in various natural language processing tasks. We extend the decoder with an attention mechanism by adjusting Equation 4. Instead of directly using the hidden state h d t to predict the output token y t , we consider relevant information from the input vectors to better predict y t . Formally, given the LSTM hidden state h d t and the encoder outputs {h t } n t=1 , we calculate the attention weights α i t and attention vector s t as follows: In lieu of Equation 4, we use the attention vector s t , which considers relevance of the encoder information to predict the output token y t .

Experiments
To demonstrate the effectiveness of our model, we conduct experiments on the Math23K dataset. Our method achieves the state-of-the-art (SOTA) single model performance and also exceeds the previous ensemble model SOTA.

Dataset
Math23K Math23K is one large-scale Chinese MWP dataset that contains 23,162 math problems and math equation solutions. The questions are elementary school level. Every question is linear and contains only one unknown variable.
Although there are other large-scale datasets such as Dolphin18K and AQuA, which are in English, they either contain many small typos (e.g., using x to represent * ) or contain wrong answers and templates. Other datasets such as ALG514 and MAWPS are much more smaller. Therefore, we decide to conduct experiments on Math23K, which is the only large-scale, clean and highquality dataset.

Implementation Details
The embedding vectors are pretrained on the training set with the word2vec algorithm. The dimension of the embedding is 128. We use a two-layer BiLSTM with the hidden size 512 for the encoder. The decoder is a two-layer LSTM with 1024 hidden size. We use a teacher forcing ratio of 0.83 during training. We use cross entropy as the loss function and Adam to optimize the parameters. We also use dropout to avoid over-fitting. The batch size is 128. Table 2 shows the results of our system and other novel systems of MWP on the Math23k test set. The retrieval-style models compare a question in the test set with the questions in the training set, choose the template that has the highest similarity, and then fill in the numbers into the template (Upadhyay and Chang, 2017;Robaidek et al., 2018). The classification-style models train a classifier to select an equation template, and then map the numbers into the template . For retrieval and classification models, we use the results of Robaidek et al. (2018). The generation models use endto-end style encoder-decoder systems to generate an equation template and then fill in the numbers.

Ensemble
DNS+Retrieval (Wang et al., 2017) 64.7% DNS+suffix+EN Ensemble (Wang et al., 2018a) 68.4% T-RNN+Retrieval (Wang et al., 2019) 68.7%  Our seq2tree model is also a generation-style model. As shown in Table 2, we achieve state-ofthe-art single model performance on the test set, and even better results than all the previous ensemble models, which can demonstrate the effectiveness of our proposed method.

Model
Invalid Templates EOS as terminator 1.3% Stack as terminator 0.2%

Ablation Study
To get better insight into our seq2tree system, we conduct ablation study on Math23K development set, which is shown in Table 3. The prefix baseline denotes the model that removes parent feeding and sibling feeding, but only uses previous token feeding for the input. Thus, this model loses parent and sibling information and falls into a linear seq2seq model based on the prefix notation. The prefix baseline performs competitive results compared to previous single model SOTA (66.9%), which proves the effectiveness of top-down decoding. Parent feeding and sibling feeding separately improve the baseline model by 1.1% and 1.0%, demonstrating the importance of informing the model of AST structure information.
In Table 4, we also report the percentage of invalid templates by using different terminating methods. We remove the auxiliary stack and use a special end-of-sentence (EOS) token, which was used in previous studies to terminate the decoding process (Wang et al., 2017(Wang et al., , 2018a. We can see that using the stack as a terminator can let the model generate very low percentage of invalid templates and outperforms the EOS method. Problem A person is taking a trip from A to B. He took a train for n 1 of the trip the first day. He took a bus and travelled for n 2 km the second day. He still needs to travel for n 3 of the total distance. How far is it from A to B? Gold suffix order: x = n 1 1 n 2 -n 3 -/ prefix order: x = / n 1 --1 n 2 n 3 Prediction BiLSTM+Suffix+EN: n 1 n 2 -n 3 / (error) Ours: / n 1 --1 n 2 n 3 (correct) Figure 5: One example of our system compared with BiLSTM+Suffix+EN (Wang et al., 2018a).

Case Study
Here we give an example that is improved by our tree-structured decoding system. As shown in Figure 5, in the gold equation of this example, there is a long distance between two pairs of parentchild nodes n 1 and / and also 1 and −. The BiL-STM+Suffix+EN model failed to capture the relationship between these two pairs of parent-child nodes and caused an error. Our model has a better ability to capture the relation between pairs of parent-child nodes even if there is a long distance between them in the notation.

Error Analysis
In  model's performance.
In Table 6, we examine the performance of the model in different question domains. MWPs in the same domain usually share similar logic, while there is an obvious difference between questions across different domains. Accurately detecting the question domain is very laborious, so we do this experiment by simply detecting frequent keywords of each domain in the question. We show further details in the appendix. The results show that the performance of the model has obvious variance among different domains and limitations in some domains such as solid geometry. This is because these domains require complicated external knowledge for solving these questions, such as S circle = πr 2 . It is difficult for the model to automatically summarize these kinds of information with only supervision of the equation templates. Adding external knowledge for this task may further improve the model.

Conclusion
We proposed a sequence-to-tree generative model to improve template generation for solving math word problems. The hierarchical top-down treestructure decoder can use the information of the abstract syntax tree of an equation during decoding. With the help of an auxiliary stack, this decoding process can end without any redundant special tokens. Our model achieves state-of-theart results on the large-scale dataset, Math23k, demonstrating the effectiveness of our approach.