Translating a Math Word Problem to an Expression Tree

Sequence-to-sequence (SEQ2SEQ) models have been successfully applied to automatic math word problem solving. Despite its simplicity, a drawback still remains: a math word problem can be correctly solved by more than one equations. This non-deterministic transduction harms the performance of maximum likelihood estimation. In this paper, by considering the uniqueness of expression tree, we propose an equation normalization method to normalize the duplicated equations. Moreover, we analyze the performance of three popular SEQ2SEQ models on the math word problem solving. We find that each model has its own specialty in solving problems, consequently an ensemble model is then proposed to combine their advantages. Experiments on dataset Math23K show that the ensemble model with equation normalization significantly outperforms the previous state-of-the-art methods.


Introduction
Developing computer systems to automatically solve math word problems (MWPs) has been an interest of NLP researchers since 1963 (Feigenbaum et al., 1963;Bobrow, 1964).A typical MWP is shown in Table 1.Readers are asked to infer how many pens and pencils Jessica have in total, based on the textual problem description provided.Statistical machine learning-based methods (Kushman et al., 2014;Amnueypornsakul and Bhat, 2014;Zhou et al., 2015;Mitra and Baral, 2016;Roy and Roth, 2018) and semantic parsing-based methods (Shi et al., 2015;Koncel-Kedziorski et al., 2015;Roy and Roth, 2015;Huang et al., 2017) are proposed to tackle this problem, yet they still require considerable manual * The work was done when Lei Wang and Deng Cai were interns at Tencent AI Lab.
Problem: Dan has 5 pens and 3 pencils, Jessica has 4 more pens and 2 less pencils than him.How many pens and pencils does Jessica have in total?Equation: x = 5 + 4 + 3 − 2; Solution: 10 Table 1: A math word problem efforts on feature or template designing.For more literatures about solving math word problems automatically, refer to a recent survey paper Zhang et al. (2018).
Recently, the Deep Neural Networks (DNNs) have opened a new direction towards automatic MWP solving.Ling et al. (2017) take multiplechoice problems as input and automatically generate rationale text and the final choice.Wang et al. (2018) then make the first attempt of applying deep reinforcement learning to arithmetic word problem solving.Wang et al. (2017) train a deep neural solver (DNS) that needs no hand-crafted features, using the SEQ2SEQ model to automatically learn the problem-to-equation mapping.
Although promising results have been reported, the model in (Wang et al., 2017) still suffers from an equation duplication problem: a MWP can be solved by multiple equations.Taking the problem in Table 1 as an example, it can be solved by various equations such as x = 5 + 4 + 3 − 2, x = 4 + (5 − 2) + 3 and x = 5 − 2 + 3 + 4.This duplication problem results in a non-deterministic output space, which has a negative impact on the performance of most data-driven methods.In this paper, by considering the uniqueness of expression tree, we propose an equation normalization method to solve this problem.
Given the success of different SEQ2SEQ models on machine translation (such as recurrent encoder-decoder (Wu et al., 2016), Convolutional SEQ2SEQ model (Gehring et al., 2017) and Trans-arXiv:1811.05632v2[cs.CL] 15 Nov 2018 former (Vaswani et al., 2017)), it is promising to adapt them to MWP solving.In this paper, we compare the performance of three state-of-the-art SEQ2SEQ models on MWP solving.We observe that different models are able to correctly solve different MWPs, therefore, as a matter of course, an ensemble model is proposed to achieve higher performance.Experiments on dataset Math23K show that by adopting the equation normalization and model ensemble techniques, the accuracy boosts from 60.7% to 68.4%.
The remaining part of this paper is organized as follows: we first introduce the SEQ2SEQ Framework in Section 2. Then the equation normalization process is presented in Section 3, following which three SEQ2SEQ models and an ensemble model are applied to MWP solving in Section 4. The experimental results are presented in Section 5. Finally we conclude this paper in Section 6.

SEQ2SEQ Framework
The process of using SEQ2SEQ model to solve MWPs can be divided into two stages (Wang et al., 2017).In the first stage (number mapping stage), significant numbers (numbers that will be used in real calculation) in problem P are mapped to a list of number tokens {n 1 , . . ., n m } by their natural order in the problem text.Throughout this paper, we use the significant number identification (SNI) module proposed in (Wang et al., 2017) to identify whether a number is significant.In the second stage, SEQ2SEQ models can be trained by taking the problem text as the source sequence and equation templates (equations after number mapping) as the target sequence.
Taking the problem P in Table 1 as an example, first we can obtain a number mapping M : {n 1 = 5; n 2 = 3; n 3 = 4; n 4 = 2; }, and transform the given equation E P : x = 5 + 4 + 3 − 2 to an equation template During training, the objective of our SEQ2SEQ model is to maximize the conditional probability P (T p |P ), which will be decomposed to tokenwise probabilities.During decoding, we use beam search to approximate the most likely equation template.After that, we replace the number tokens with actual numbers and calculate the solution S with a math solver.

Equation Normalization
In the number mapping stage, the equations E P have been successfully transformed to equation templates T P .However, due to the equation duplication problem introduced in Section 1, this problem-equation templates formalization is a non-deterministic transduction that will have adverse effects on the performance of maximum likelihood estimation.There are two types of equation duplication: 1) order duplication such as "n1 + n3 + n2" and "n1 + n2 + n3", 2) bracket duplication such as "n 1 + n 3 − n 2 " and "n1 + (n 3 − n 2 )".
To normalize the order-duplicated templates, we define two normalization rules: • Rule 1: Two duplicated equation templates with unequal length should be normalized to the shorter one.For example, two equation templates "n 1 + n 2 + n 3 + n 3 − n 3 ", "n 1 + n 2 + n 3 " should be normalized to the latter one.
• Rule 2: The number tokens in equation templates should be ordered as close as possible to their order in number mapping.For example, three equation templates "n 1 + n 3 + n 2 ", "n 1 + n 2 + n 3 " and "n 3 + n 1 + n 2 " should be normalized to "n 1 + n 2 + n 3 ".
To solve the bracket duplication problem, we further normalize the equation templates to an expression tree.Every inner node in the expression tree is an operator with two children, while each leaf node is expected to be a number token.An example of expressing equation template n 1 + n 2 + n 3 − n 4 as the unique expression tree is shown in Figure 1.In this section, we present three types of SEQ2SEQ models to solve MWPs: bidirectional Long Short Term Memory network (BiLSTM) (Wu et al., 2016), Convolutional SEQ2SEQ model (Gehring et al., 2017), and Transformer (Vaswani et al., 2017).To benefit the output accuracy with all three architectures, we propose to use a simple ensemble method.

BiLSTM
The BiLSTM model uses two LSTMs (forward and backward) to learn the representation of each token in the sequence based on both the past and the future context of the token.At each time step of decoding, the deocder uses a global attention mechanism to read those representations.
In more detail, we use two-layer Bi-LSTM cells with 256 hidden units as encoder, and two layers LSTM cells with 512 hidden units as decoder.In addition, we use Adam optimizer with learning rate 1e −3 , β 1 = 0.9, and β 2 = 0.99.The epochs, minibatch size, and dropout rate are set to 100, 64, and 0.5, respectively.

ConvS2S
ConvS2S (Gehring et al., 2017) uses a convolutional architecture instead of RNNs.Both encoder and decoder share the same convolutional structure that uses n kernels striding from one side to the other, and uses gate linear units as nonlinearity activations over the output of convolution.
Our ConvS2S model adopts a four layers encoder and a three layers decoder, both using kernels of width 3 and hidden size 256.We adopt early stopping and learning rate annealing and set max-epochs equals to 100.

Ensemble Model
Through careful observation (detailed in Section 5.2), we find that each model has a speciality in solving problems.Therefore, we propose an ensemble model which selects the result according to models' generation probability: where y = {y 1 , ..., y T } is the target sequence, and x = {x 1 , ..., x S } is the source sequence.Finally, the output of the model with the highest generation probability is selected as the final output.

Experiment
In this section, we conduct experiments on dataset Math23K to examine the performance of different SEQ2SEQ models.Our main experimental result is to show a significant improvement over the baseline methods.We further conduct a case study to analyze why different SEQ2SEQ models can solve different kinds of MWPs.
Dataset: Math23K1 collected by Wang et al. (2017)  problems are linear algebra questions with only one unknown variable.
Baselines: We compare our methods with two baselines: DNS and DNS-Hybrid.Both of them are proposed in (Wang et al., 2017), with stateof-the-art performance on dataset Math23K.The DNS is a vanilla SEQ2SEQ model that adopts GRU (Chung et al., 2014) as encoder and LSTM as decoder.The DNS-Hybrid is a hybrid model that combines DNS and a retrieval-based solver to achieve better performance.

Results
In experiments, we use the testing set in Math23K as the test set, and randomly split 1, 000 problems from the training set as validation set.Evaluation results are summarized in Table 2. First, to examine the effectiveness of equation normalization, model performance with and without equation normalization are compared.Then the performance of DNS, DNS-Hybrid, Bi-LSTM, ConvS2S, Transformer, and Ensemble model are examined on the dataset.
Several observations can be made from the results.First, the equation normalization process significantly improves the performance of each model.The accuracy of different models gain increases from 2.7% to 7.1% after equation normalization.Second, Bi-LSTM, ConvS2S, Transformer can achieve much higher performance than DNS, which means that popular machine translation models are also efficient in automatic MWP solving.Third, by combining the SEQ2SEQ models, our ensemble model gains additional 1.7% increase on accuracy.
In addition, we have further conducted three extra experiments to disentangle the benefits of three different EN techniques.Table 3 gives the details of the ablation study of the three SEQ2SEQ models.Taking Bi-LSTM as an example, accuracies of rule 1 (SE), rule 2 (OE) and eliminating brackets (EB) are 63.1%, 63.7% and 65.3%, respectively.Obviously, the performance of SEQ2ESQ models benefits from the equation normalization technologies.

Case Study
Further, we conduct a case analysis on the capability of different SEQ2SEQ models and provide three examples in Table 4.Our analysis is summarized as follows: 1) Transformer occasionally generates mathematically incorrect templates, while Bi-LSTM and ConvS2S almost do not, as shown in Example 1.This is probably because the size of training data is still not enough to train the multi-head self-attention structures; 2) In Example 2, the Transformer is adapted to solve problems that require complex inference.It is mainly because different heads in a self-attention structure can model various types of relationships between number tokens; 3) The multi-layer convolutional block structure in ConvS2S can properly process the context information of number tokens.In Example 3, it is the only one that captures the relationship between stamp A and stamp B.

Conclusion
In this paper, we first propose an equation normalization method that normalizes duplicated equation templates to an expression tree.We test different SEQ2SEQ models on MWP solving and propose an ensemble model to achieve higher performance.Experimental results demonstrate that the proposed equation normalization method and the ensemble model can significantly improve the state-of-the-art methods.
Example 1: Two biological groups have produced 690 (n 1 ) butterfly specimens in 15 (n 2 ) days.The first group produced 20 (n 3 ) each day.How many did the second group produced each day?Bi-LSTM: n 1 n 2 /n 3 −; (correct) ConvS2S: n 2 n 3 + n 1 * ; (error) Transformer: n 2 n 1 n 3 n 3 +; (error) Example 2: A plane, in a speed of 500 (n 1 ) km/h, costs 3 (n 2 ) hours traveling from city A to city B. It only costs 2 (n 3 ) hours for return.How much is the average speed of the plane during this round-trip?Bi-LSTM: n

Figure 1 :
Figure 1: A Unique Expression Tree After equation normalization, the SEQ2SEQ models can solve MWPs by taking problem text as source sequence and the postorder traversal of an unique expression tree as target sequence, as shown in Figure 2.

Figure 2 :
Figure 2: Framework of SEQ2SEQ models Vaswani et al. (2017) proposed the Transformer based on an attention mechanism without relying on any convolutional or recurrent architecture.Both encoder and decoder are composed of a stack of identical layers.Each layer contains two parts: a multi-head self-attention module and a positionwise fully-connected feed-forward network.Our transformer is four layers deep, with n head = 16, d k = 12, d v = 32, and d model = 512, where n head is the number of heads of its selfattention, d k is the dimension of keys, d v is the dimension of values, and d model is the output dimension of each sub-layer.In addition, we use Adam optimizer with learning rate 1e −3 , β 1 = 0.9, β 2 = 0.99, and dropout rate of 0.3.

Table 3 :
contains 23,162 labeled MWPs.All these The ablation study of three equation normalization methods.SE is the first Rule mentioned in Section 3. OE is the second rule mentioned in Section 3. EB means eliminating the brackets.

Table 4 :
Three examples of solving MWP with SEQ2SEQ model.Note that the results are postorder traversal of expression trees, and the problems are translated to English for brevity.