Semantically-Aligned Universal Tree-Structured Solver for Math Word Problems

A practical automatic textual math word problems (MWPs) solver should be able to solve various textual MWPs while most existing works only focused on one-unknown linear MWPs. Herein, we propose a simple but efficient method called Universal Expression Tree (UET) to make the first attempt to represent the equations of various MWPs uniformly. Then a semantically-aligned universal tree-structured solver (SAU-Solver) based on an encoder-decoder framework is proposed to resolve multiple types of MWPs in a unified model, benefiting from our UET representation. Our SAU-Solver generates a universal expression tree explicitly by deciding which symbol to generate according to the generated symbols' semantic meanings like human solving MWPs. Besides, our SAU-Solver also includes a novel subtree-level semanticallyaligned regularization to further enforce the semantic constraints and rationality of the generated expression tree by aligning with the contextual information. Finally, to validate the universality of our solver and extend the research boundary of MWPs, we introduce a new challenging Hybrid Math Word Problems dataset (HMWP), consisting of three types of MWPs. Experimental results on several MWPs datasets show that our model can solve universal types of MWPs and outperforms several state-of-the-art models.


Introduction
Math word problems (MWPs) solving aims to automatically answer a math word problem by understanding the textual description of the problem and reasoning out the underlying answer. A typical MWP is a short story that describes a partial state of the world and poses a question about an unknown quantity or multiple unknown quantities. Thus, a machine should have the ability of natural language understanding and reasoning. To solve an MWP, the relevant quantities need to be identified from the text, and the correct operators and their computation order among these quantities need to be determined. Many traditional methods (Yuhui et al., 2010;Kushman et al., 2014;Shi et al., 2015) have been proposed to address this problem, but they relied on tedious hand-crafted features and template annotation, which required extensive human efforts and knowledge. Recently, deep learning has opened a new direction towards automatic MWPs solving Huang et al., 2018;Wang et al., 2018bWang et al., , 2019Xie and Sun, 2019;Chiang and Chen, 2019). Most of deep learning-based methods try to train an end-to-end neural network to automatically learn the mapping function between problems and their corresponding equations. However, there are some limitations hindering them from being applied in real-world applications. First, although seq2seq model  can be applied to solve various MWPs, it suffers from fake numbers generation and mispositioned numbers generation due to all data share the same target vocabulary without problem-specific constraints. Second, some advanced methods (Wang et al., 2018b(Wang et al., , 2019Xie and Sun, 2019) only target at arithmetic word problems without any unknown or with one unknown that do not need to model the unknowns underlying in MWPs, which prevent them from generalizing to various MWPs, such as equation set problems. Thus, their methods can only handle arithmetic problems with no more than one unknown. Besides, they also lack an efficient equation representation mechanism to handle those MWPs with multiple unknowns and multiple equations, such as equation set problems. Finally, though some methods Huang et al., 2018;Chiang and Chen, 2019) can handle multiple types of MWPs, they neither generate next symbol by taking full advantage of the generated symbols like a human nor consider the semantic transformation between equations in a problem, resulting in poor performance on the multiple-unknown MWPs, such as the MWPs involving equation set.
To address the above issues, we propose a simple yet efficient method called Universal Expression Tree (UET) to make the first attempt to represent the equations of various MWPs uniformly like the expression tree of one-unknown linear word problems with considering unknowns. Specifically, as shown in Fig. 1, UET integrates all expression trees underlying in an MWP into an ensemble expression tree via math operator symbol extension so that the grounded equations of various MWPs can be handled in a unified manner as handling one-unknown linear MWPs. Thus, it can significantly reduce the difficulty of modeling equations of various MWPs.
Then, we propose a semantically-aligned universal tree-structured solver (SAU-Solver), which is based on our UET representation and an Encoder-Decoder framework, to solve multiple types of MWPs in a unified manner with a single model. In our SAU-Solver, the encoder is designed to understand the semantics of MWPs and extract number semantic representation while the tree-structured decoder is designed to generate the next symbol based on the problem-specific target vocabulary in a semantically-aligned manner by taking full advantage of the semantic meanings of the gener-ated expression tree like a human uses problem's contextual information and all tokens written to reason next token for solving MWPs. The problemspecific target vocabulary can help our solver to mitigate the problem of fake numbers generation as much as possible.
Besides, to further enforce the semantic constraints and rationality of the generated expression tree, we also propose a subtree-level semanticallyaligned regularization to further improve subtreelevel semantic representation by aligning with the contextual information of a problem, which can improve answer accuracy effectively.
Finally, to validate the universality of our solver and push the research boundary of MWPs to math real-word applications better, we introduce a new challenging Hybrid Math Word Problems dataset (HMWP), consisting of one-unknown linear word problems, one-unknown non-linear word problems, and equation set problems with two unknowns. Experimental results on HWMP, ALG514, Math23K, and Dolphin18K-Manual show the universality and superiority of our approach compared with several state-of-the-art methods.

Related Works
Numerous methods have been proposed to attack the MWPs task, ranging from rule-based methods (Bakman, 2007;Yuhui et al., 2010), statistical machine learning methods (Kushman et al., 2014;Zhou et al., 2015;Mitra and Baral, 2016;Huang et al., 2016;Roy and Roth, 2018),semantic parsing methods (Shi et al., 2015;Koncelkedziorski et al., 2015;Huang et al., 2017), and deep learning methods (Ling et al., 2017;Wang et al., 2017Wang et al., , 2018bHuang et al., 2018;Wang et al., 2018a;Xie and Sun, 2019;Wang et al., 2019). Due to space limitations, we only review some recent advances on deep leaning-based methods.  made the first attempt to generate expression templates using Seq2Seq model. Seq2seq method has achieved promising results, but it suffers from generating spurious numbers, predicting numbers at wrong positions, or equation duplication problem (Huang et al., 2018;Wang et al., 2018a). To address them, (Huang et al., 2018) proposed to add a copy-and-alignment mechanism to the standard Seq2Seq model. (Wang et al., 2018a) proposed equation normalization to normalize the duplicated equations by considering the uniqueness of an expression tree. Different from seq2seq-based works, (Xie and Sun, 2019) proposed a tree-structured decoder to generate an expression tree inspired by the goaldriven problem-solving mechanism. (Wang et al., 2019) proposed a two-stage template-based solution based on a recursive neural network for math expression construction. However, they do not model the unknowns underlying in MWPs, resulting in only handling one-unknown linear word problems. Besides, they also lack an efficient mechanism to handle those MWPs with multiple unknowns and multiple equations, such as equation set problems. Therefore, their solution can not solve other types of MWPs that are more challenging due to larger search space, such as equation set problems, non-linear equation problems, etc. (Chiang and Chen, 2019) is a general equation generator that generates expression via the stack, but they did not consider the semantic transformation between equations in a problem, resulting in poor performance on the multiple-unknown MWPs, such as equation set problems.
3 The design of SAU-Solver

Universal Expression Tree (UET)
The primary type of textual MWPs can be divided into two groups: arithmetic word problems and equation set problems. For a universal MWPs solver, it is highly demanded to represent various equations of various MWPs in a unified manner so that the solver can generate equations efficiently. Although most of the existing works can handle one-unknown linear word problems well, it is more challenging and harder for current methods to handle the equation set MWPs with multiple unknowns well since they not only do not model the unknowns in the MWPs but also lack of an efficient equations representation mechanism to make their decoder generate required equations efficiently. To handle the above issue, an intuitive way is treating the equation set as a forest of expression trees and all trees are processed iteratively in a certain order. Although this is an effective way to handle equations set problems, it increases the difficulty of equation generation since the model needs to reason out the number of equations before starting equation generation and the prediction error will influence equation generation greatly. Besides, it is also challenging to take full advantage of the context information from the problem and the generated trees. Another way is that we can deploy Seq2Seq-based architecture to handle various equations in infix order like in previous works Huang et al., 2018), but there are some limitations, such as generating invalid expression, generating spurious numbers, and generating numbers at wrong positions.
To overcome the above issues and maintain simplicity, we propose a new equation representation called Universal Expression Tree (UET) to make the first attempt to represent the equations of various MWPs uniformly. Specially, we extend the math operator symbol table by introducing a new operator ; as the lowest priority operator to integrate one or more expression trees into a universal expression tree, as shown in Fig. 1. With UET, a solver can handle the underlying equations of various textual MWPs easier in a unified manner like the way on arithmetic word problems. Although our UET is simple, it provides an efficient, concise, and uniform way to utilize the context information from the problem and treat the semantic transformation between equations as simple as treating the semantic transformation between subtrees in an equation.

SAU-Solver
Based on our proposed UET representation, we design a universal tree-structured solver to generate a universal expression tree explicitly according to the problem context and explicitly model the relationships among unknown variables, quantities, math operations, and constants in a tree-structured way, as shown in Fig. 2. Our solver consists of a Bi-GRU-based problem encoder and an explicit tree-structured equation decoder. When a problem is entered, our model first encodes each word of the problem to generate the problem's contextual representation g 0 by our problem encoder. Then, the g 0 will be used as the initial hidden state by our tree-structured equation decoder to guide the equation generation in prefix order with two intertwined processes: top-down tree-structured decoding and bottom-up subtree semantic transformation. With the help of top-down tree-structured decoding and bottom-up subtree semantic transformation, SAU-Solver can generate the next symbol by taking full advantage of generated symbols in a semanticallyaligned manner like human solving MWPs. Finally, we apply infix traversal and inverse number mapping to generate the corresponding human-readable

Semantically-Aligned Regularization
Figure 2: An overview of our SAU-Solver. When a problem preprocessed by number mapping and replacement is entered, our problem encoder encodes the problem text as context representation. Then our equation decoder generates an expression tree explicitly in pre-order traversal for the problem according to the context representation. Finally, infix traversal and inverse number mapping are applied to generate the corresponding equation. equation that can be computed by SymPy 2 , which is a python library for symbolic mathematics.

Problem Encoder
Bidirectional Gated Recurrent Unit (BiGRU) (Cho et al., 2014) is an efficient method to encode sequential information. Formally, given an input math word problem sentence P = {x t } n t=1 , we first embed each word into a vector x t . Then these embeddings are fed into a two-layer BiGRU from beginning to end and from end to beginning to model the problem sequence: where GRU (·, ·) represents the function of a twolayer GRU. h p t is the sum of the hidden states − → h p t and ← − h p t , which are from both forward and backward GRUs. These representation vectors are then fed into our tree-structured equation decoder for ensemble expression tree generation. Besides, we also construct the hidden state g 0 as the initial hidden state of our equation decoder:

Equation Decoder
For decoding, inspired by previous works (Xie and Sun, 2019; Chiang and Chen, 2019), we build a semantically-aligned tree decoder to decide which symbol to generate by taking full advantage of the semantic meanings of the generated symbols with two intertwined processes: top-down treestructured decoding and bottom-up subtree semantic transformation. Our decoder takes tree-based information g parent (left node) or (g parent , t l ) (right node) as the input and maintains two auxiliary stacks G and T to enforce semantically-aligned decoding procedure. The stack G maintains the hidden states generated from the parent node while the stack T helps the model decide which symbol to generate by maintaining subtree semantic information of generated symbols. Benefiting from UET, our decoder can automatically end the decoding procedure without any special token. If the predicted token y t is an operator, then we generate two children hidden states g l and g r according to the current node embedding n of y t , and push them into the stack G to maintain the state transition among nodes and be used to predict token and its node embedding. Besides, we also push the token embedding e(y t |P ) of y t into the stack T so that we can maintain subtree semantic information of generated symbols after right child node generation. If the predicted token y t is not an operator, we check the size of the stack T to judge whether the current node is a right node. If the current node is a right node, we transform the embedding of parent node op, left sibling node l and current node e(y t |P ) to a subtree semantic representation t, which represents the semantic meanings of generated symbols for current subtree and is used to help the right node generation of the upper subtree. In this way, our equation decoder can decode out an equation as a human writes out an equation according to the problem description. Token Embedding. For a problem P , its target vocabulary V tar consists of 4 parts: math operators V op , unknowns V u , constants V con that are those common-sense numerical values occurred in the target expression but not in the problem text (e.g. a chick has 2 legs.), and the numbers n p occurred in P . For each token y in V tar , its token embedding e(y|P ) is defined as: where M op , M u , and M con are three trainable word embedding matrices independent of the specific problem. However, for a numeric value in n P , we take the corresponding hidden state h p loc from encoder as its token embedding, where loc(y, P ) is the index position of numeric value y in P .

Gating Mechanism and Attention Mechanism.
To better flow important information and ignore useless information, we apply a gating mechanism to generate node state n which will be used for predicting the output and generating child hidden states g l and g r for descendant nodes if the output of the current node is a math operator: where O can be a left node state n l , a right node state, a left child hidden state g l , or a right child hidden state g r . For n l , I is g l generated by the parent node. For n r , I is [g r , t l ] which is the concatenation of the hidden state g r generated by the parent node and the subtree semantic embedding t l of left sibling. For g l and g r , I is [n, c, e(y t |P )] which is the concatenation of the current node state n, the contextual vector c aggregating relevant information of the problem as a weighted representation of the input tokens by attention mechanism, and the token embedding e(y t |P ) of the predicted token y t .
For better predicting a token y t by utilizing contextual information, we deploy an attention mechanism to aggregate relevant information from the input vectors. Formally, given current node state n and the encoder outputs {h p t } n t=1 , we calculate the contextual vector c as follows: Based on the contextual vector c and current node state n, we can predict the token y t as follows: where s(y|n, c, P ) = V n tanh (W s [n, c, e(y|P )]) (7) Subtree Semantic Transformation. Although our decoder decodes a universal expression tree in the prefix, to help our model to generate the next symbol in a semantically-aligned manner by taking full advantage of the semantic meanings of the generated expression tree, we design a recursive neural network to transform the semantic representations of the current node and its two child subtrees t l and t r into a high-level embedding t in a bottom-up manner. Formally, let t be a subtree, and y denotes the predicted token of the root node of the subtree. If y is a math operator, which means that the current subtree t must have two child subtrees t l and t r , the high-level embedding t should fuse the semantic information from the operator token y, the left child subtree t l and the right child subtree t r as follows: Otherwise, t is the embedding e(y|P ) of the predicted token y because y is a numeric value, an unknown variable, or a constant quantity and the recursion stops.

Semantically-Aligned Regularization
When a subtree t is produced by our model, this means that we have a computable unit. The semantics of this computable unit should be consistent with the problem text P . To achieve this goal, we propose a subtree-level semantically-aligned regularization to help train a better model with higher performance. For each subtree embedding t and encoder outputs h P 1 , h P 1 , · · · , h P n , we first apply an attention function to compute a semanticallyaligned vector a as Equation (5), then we use a two-layer feed-forward neural network with tanh activation to transform t and a into same semantic space respectively. The procedure can be formulated as: where W e1 , W e2 , W d1 , and W d2 are trainable parameter matrices. With the vectors e sa and d sa Let m be the number of subtrees in a universal expression tree, we can regularize our model by minimizing the following loss:

Training Objective
Given the training dataset D={(P i , T 1 ), (P 2 , T 2 ), · · · ,(P N , T N ) }, where T i is the universal expression tree of problem P i , we minimize the following loss function: where m denotes the size of T, and g t and c t are the hidden state vector and its contextual vector at the t-th node. We set λ as 0.01 empirically.

Discussion
The methods most relevant to our method are GTS (Xie and Sun, 2019) and StackDecoder (Chiang and Chen, 2019). However, compared with them, our method is different from them as follows. First, our method applies a universal expression tree to represent the diverse equations underlying different MWPs uniformly, which match real-word MWPs better than GTS and StackDecoder which either can only handle single-var linear MWPs without considering unknowns or can handle equations set problem iteratively. Second, we introduce subtree-level semantically-aligned regularization for better enforcing the semantic constraints and rationality of generated expression tree during training, leading to higher answer accuracy, as illustrated in Table 2.

Hybrid Math Word Problem Dataset
Most public datasets for automatic MWPs solving either are quite small such as Alg514 (Kushman et al., 2014), DRAW-1K (Upadhyay and Chang, 2017), MaWPS (Koncel-Kedziorski et al., 2016) or exist some incorrect labels such as Dol-phin18K (Huang et al., 2016). An exception is the Math23K dataset which contains 23161 problems labeled well with structured equations and answers. However, it only contains one-unknown linear MWPs, which is not sufficient to validate the ability of a math solver about solving multiple types of MWPs. Therefore, we introduce a new high-quality MWPs dataset, called HMWP, in which each sample is extracted from a Chinese K12 math word problem bank, to validate the universality of math word problem solvers and push the research boundary of MWPs to match real-world scenes better. Our dataset contains three types of MWPs: arithmetic word problems, equations set problems, and non-linear equation problems. There are 5491 MWPs, including 2955 one-unknownvariable linear MWPs, 1636 two-unknown-variable linear MWPs, and 900 one-unknown-variable nonlinear MWPs. It should be noticed that our dataset is sufficient for validating the universality of math word problem solvers since these problems can cover most cases about MWPs. We labeled our data with structured equations and answers as Math23K . The data statistics of our dataset and several publicly available datasets are shown in Table 1. From the statistics, we can see that the #AVG EL (average equation length), #Avg PN (average number of quantities occurred in problems and their corresponding equations), and #Avg Ops (average numbers of operators in equations) are the largest among the serval publicly available datasets. (Xie and Sun, 2019) showed the higher these values, the more difficult it is. Therefore, our dataset is more challenging for MWPs solvers.  Table 1: Statistics of our dataset and several publicly available datasets. Avg EL, Avg SNI, Avg Constants, and Avg Ops represent average equation length, average number of quantities occurred in problems and their corresponding equations, average numbers of constants only occurred in equations, and average numbers of operators in equations, respectively. The higher these values, the more difficult it is. This has been shown in (Xie and Sun, 2019).

Experimental Setup and Training Details
Datasets, Baselines, and Evaluation metric.
We conduct experiments on four datasets, such as HMWP, Alg514 (Kushman et al., 2014), Math23K  and Dolphin18K-Manual (Huang et al., 2016). The data statistics of four datasets are shown in Table 1. The main state-of-the-art learning-based methods to be compared are as follows: Seq2Seq-attn w/ SNI  is a universal solver based on the seq2seq model with significant number identification(SNI). GTS (Xie and Sun, 2019) is a goaldriven tree-structured MWP solver only for oneunknown-variable non-linear MWPs. StackDecoder (Chiang and Chen, 2019) is a semanticallyaligned MWPs solver. SAU-Solver w/o SSAR and SAU-Solver are two universal tree-structured solvers proposed in this paper without and with subtree semantically-aligned regularization. Following our baselines, we use answer accuracy as the evaluation metric: if the calculated value of the predicted expression tree equals to the true answer, it is thought of correct since the predicted expression is equivalent to the target expression.

Results and Analyses
Answer Accuracy. We conduct 5-fold crossvalidation to evaluate the performances of baselines and our models on all four datasets. The results are shown in Table 2. Several observations can be made from the results in Table 2 as follows: First, our SAU-Solver has achieved significantly better than the baselines on four datasets. It proves that our model is feasible for solving multiple types of MWPs. It also proves that our model is more general and more effective than other state-of-theart models on the real-word scenario that need to solve multiple types of MWPs with a unified solver.
Second, with our subtree-level semanticallyaligned regularization on training procedure, our SAU-Solver has gained additional absolute 0.43% accuracy on HMWP, absolute 1.95% accuracy on ALG514, absolute 0.31% accuracy on Math23k, and absolute 0.39% accuracy on Dolphin18k-Manual. This shows that subtree-level semantically-aligned regularization is helpful for improving subtree semantic embedding, resulting in improving expression tree generation, especially for the generation of the right child node. Although StackDecoder can be a universal math word problem solver via simple operator extension, the performances on HMWP, ALG514, and Dolphin18k-Manual are very poor, since it generates expression trees independently and only considers the semantic-aligned transformation in an expression tree. Different from it, our SAU-Solver generates multiple expression trees as a universal expression tree and conducts subtree-level semantic-aligned transformation for subsequent tree node generation in our universal expression tree. In this way, we can deliver the semantic information of the previous expression tree to help the generation of the current expression tree. Therefore we can achieve better performance than StackDecoder. Overall, our model is more general and effective than other state-of-the-art models on multiple MWPs and outperforms the compared state-of-theart models by a large margin on answer accuracy.
Performance on different types of MWPs. We drill down to analyse the performance of Retrieval-Jaccard, Seq2seq-attn w/SNI, and SAU-Solver on different types of MWPs in HMWP. The data statistics and performance results are shown in Table 3. We can observe that our model outperforms the other two models by a large margin on all subsets. Intuitively, the longer the expression length is, the more complex the mathematical relationship of the problem is, and the more difficult it is. And the average expression length of our dataset is much longer than Math23K according to the data statistics of Table 3 and Table 1. Therefore, we can observe that the accuracy of our model on linear (One-VAR) is lower than Math23K in Table 2

Error Analysis
In

Case Study
Further, we conduct a case analysis and provide four cases in Table 4, which shows the effectiveness of our approach. Our analyses are summarized as follows. From Case 1, Seq2Seq generates a spurious number n 2 not in problem text while both SAU-Solver w/o SSAR and SAU-Solver predict correctly owning to the problem-specific target vocabulary. Besides, although both SAU-Solver w/o SSAR and SAU-Solver can generate correct an equation, the equation generated by our SAU-Solver is more semantically-aligned with a human than the equation generated by SAU-Solver. From Case 2, we can see that Seq2Seq generates an invalid expression containing consecutive operators while our models can guarantee the validity of expressions since they generate expression trees directly. From Case 3, we find it interesting that tree-based models can avoid generating redundant operations, such as "n 1 *". From Case 4, we can see that SAU-Solver can prevent generating the similar subtree as its left sibling when the parent node is "*".

Conclusion
We propose an SAU-Solver, which is able to solve multiple types of MWPs, to generate the universal express tree explicitly in a semantically-aligned manner. Besides, we also propose a subtree-level semantically-aligned regularization to improve subtree semantic representation. Finally, we introduce a new MWPs datasets, called HMWP, to validate our solver's universality and push the research boundary of MWPs to math real-world applications better. Experimental results show the superiority of our approach.