Deep Neural Solver for Math Word Problems

This paper presents a deep neural solver to automatically solve math word problems. In contrast to previous statistical learning approaches, we directly translate math word problems to equation templates using a recurrent neural network (RNN) model, without sophisticated feature engineering. We further design a hybrid model that combines the RNN model and a similarity-based retrieval model to achieve additional performance improvement. Experiments conducted on a large dataset show that the RNN model and the hybrid model significantly outperform state-of-the-art statistical learning methods for math word problem solving.


Introduction
Developing computer models to automatically solve math word problems has been an interest of NLP researchers since 1963since Feigenbaum et al. (1963; Bobrow (1964); Briars and Larkin (1984);Fletcher (1985). Recently, machine learning techniques ; Amnueypornsakul and Bhat (2014); Zhou et al. (2015); Mitra and Baral (2016) and semantic parsing methods Shi et al. (2015); Koncel-Kedziorski et al. (2015) are proposed to tackle this problem and promising results are reported on some datasets. Although progress has been made in this task, performance of state-of-the-art techniques is still quite low on large datasets having diverse problem types Huang et al. (2016).
A typical math word problems are shown in Table 1. The reader is asked to infer how many pens Dan and Jessica have, based on constraints provided. Given the success of deep neural networks (DNN) on many NLP tasks (like POS tagging, Problem: Dan have 2 pens, Jessica have 4 pens. How many pens do they have in total ? Equation: x = 4+2 Solution: 6 Table 1: A math word problem syntactic parsing, and machine translation), it may be interesting to study whether DNN could also help math word problem solving. In this paper, we propose a recurrent neural network (RNN) model for automatic math word problem solving. It is a sequence to sequence (seq2seq) model that transforms natural language sentences in math word problems to mathematical equations. Experiments conducted on a large dataset show that the RNN model significantly outperforms state-of-the-art statistical learning approaches.
Since it has been demonstrated Huang et al. (2016) that a simple similarity based method performs as well as more sophisticated statistical learning approaches on large datasets, we implement a similarity-based retrieval model and compare with our seq2seq model. We observe that although seq2seq performs better on average, the retrieval model is able to correctly solve many problems for which RNN generates wrong results. We also find that the accuracy of the retrieval model positively correlate with the maximal similarity score between the target problem and the problems in training data: the larger the similarity score, the higher the average accuracy is.
Inspired by these observations, we design a hybrid model which combines the seq2seq model and the retrieval model. In the hybrid model, the retrieval model is chosen if the maximal similarity score returned by the retrieval model is larger than a threshold, otherwise the seq2seq model is selected to solve the problem. Experiments on our dataset show that, by introducing the hybrid model, the accuracy increases from 58.1% to 64.7%.
Our contributions are as follows: 1) To the best of our knowledge, this is the first work of using DNN technology for automatic math word problem solving.
2) We propose a hybrid model where a se-q2seq model and a similarity-based retrieval model are combined to achieve further performance improvement.
3) A large dataset is constructed for facilitating the study of automatic math problem solving. 1 The remaining part of this paper is organized as follows: After analyzing related work in Section 2, we formalize the problem and introduce our dataset in Section 3. We present our RNN-based seq2seq model in Section 4, and the hybrid model in Section 5. Then experimental results are shown and analyzed in Section 6. Finally we conclude the paper in Section 7.
2 Related work

Math Word Problems Solving
Previous work on automatic math word problem solving falls into two categories: symbolic approaches and statistical learning approaches. In 1964, STUDENT Bobrow (1964 handles algebraic problems by two steps: first, they transform natural language sentences into kernel sentences using a small set of transformation patterns. Then the kernel sentences are transformed to mathematical expressions by pattern matching. A similar approach is also used to solve English rate problems Charniak (1968Charniak ( , 1969. Liguda and Pfeiffer Liguda and Pfeiffer (2012) propose modeling math word problems with augmented semantic networks. In addition, Addition/subtraction problems are studied most Briars and Larkin (1984); Dellarosa (1986);Bakman (2007); Yuhui et al. (2010); Roy et al. (2015).
In 2015, Shi et.al Shi et al. (2015) propose a system SigmaDolphin which automatically solves math word problems by semantic parsing and reasoning. In the same year, Koncel et.al Koncel-Kedziorski et al. (2015) also formalizes the problem of solving multi-sentence algebraic word problems as that of generating and scoring equation trees.
Since 2014, statistical learning based approaches are proposed to solve the math word problems. Hosseini et al. Hosseini et al. (2014) deal with the open-domain aspect of algebraic word problems by learning verb categorization from training data.  proposed a equation template system to solve a wide range of algebra word problems. Zhou et al. Zhou et al. (2015) further extends this method by adopting the max-margin objective, which results in higher accuracy and lower time cost. In addition, Roy and Roth Roy et al. (2015); Roy and Roth (2016) tries to handle arithmetic problems with multiple steps and operations without depending on additional annotations or predefined templates. Mitra et al. Mitra and Baral (2016) presents a novel method to learn to use formulas to solve simple additionsubtraction arithmetic problems.
As reported in Huang et al. (2016, stateof-the-art approaches have extremely low performance on a big and highly diverse data set (18,000+ problems). In contrast to these approaches, we study the feasibility of applying deep learning to the task of math word problem solving.

Sequence to Sequence (seq2seq) Learning
With the framework of seq2seq learning Sutskever et al.  Shang et al. (2015) have demonstrated the power of recurrent neural networks (RNNs) at capturing and translating natural language semantics. The NMT and NRM models are purely data-driven and directly learn to converse from end-to-end conversational corpora.
Recently, the task of translating natural language queries into regular expressions is explored by using a seq2seq model Locascio et al. (2016), which achieves a performance gain of 19.6% over previous state-of-the-art models. To our knowledge, we are the first to apply seq2seq model to the task of math word problem solving.

Problem Formulation
A math word problem P is a word sequence W p and contains a set of variables V p = {v 1 , . . . , v m , x 1 , . . . , x k } where v 1 , . . . , v m are known numbers in P and x 1 , . . . , x k are variables Problem: Dan have 5 pens and 3 pencils, Jessica have 4 more pens and 2 less pencils than him. How many pens and pencils do Jessica have in total? Equation: x = 5 + 4 +3 -2 Solution: 10 Table 2: A math word problem whose values are unknown. A problem P can be solved by a mathematical equation E p formed by V p and mathematical operators.
In math word problems, different equations may belong to a same equation template. For example, equation x = (9 * 3) + 7 and equation x = (4 * 5) + 2 share the same equation template x = (n 1 * n 2 ) + n 3 . To decrease the diversity of equations, we map each equation to an equation template T p through a number mapping M p . The number mapping process can be defined as: Definition 1 Number mapping: For a problem P with m known numbers, a number mapping M p maps the numbers in problem P to a list of number tokens {n 1 , . . . , n m } by their order in the problem text.
Definition 2 Equation template: A general form of equations. For a problem P with equation E p and number mapping M p , its equation template is obtained by mapping numbers in E p to a list of number tokens {n 1 , . . . , n m } according to M p . Table 2 as an example, first we can obtain a number mapping from the problem:

Take the problem in
M : {n 1 = 5; n 2 = 3; n 3 = 4; n 4 = 2; } and then the given equation can be expressed as an equation template: After number mapping, the problem in Table 2 can be mapped to: "Dan have n 1 pens and n 2 pencils, Jessica have n 3 more pens and n 4 less pencils than him. How many pens and pencils do Jessica have in total?" We solve math word problems by generating equation templates through a seq2seq model. The input of the seq2seq model is the sequence W P after number mapping, and the output is an equation template T P . The equation E P can be obtained by applying the corresponding number mapping M P to T P .

Constructing a Large Dataset
Most public datasets for automatic math word problem solving are quite small and contains limited types of problems. The most frequently used Alg514  dataset contains only 514 linear algebra problems with 28 equation templates. There are 1,000 problems in the newly constructed DRAW-1K (Shyam and Ming-Wei, 2017) dataset. Dophin1878 (Shi et al., 2015) includes 1,878 number word problems. An exception is the Dolphin18K dataset (Huang et al., 2016) which contains 18,000+ problems. However, this dataset has not been made publicly available so far.
Since DNN-based approaches typically need large training data, we have to build a large dataset of labeled math word problems. We crawl over 60,000 Chinese math word problems from a couple of online education web sites. All of them are real math word problems for elementary school students. We focus on one-unknown-variable linear math word problems in this paper. For other problem types, we would like to leave as future work. Please pay attention that the solutions to the problems are in natural language, and we have to extract equation systems and structured answers from the solution text. We implement a rule-based extraction method for this purpose, which achieves very high precision and medium recall. That is, most equations and structured answers extracted by our method are correct, and many problems are dropped from the dataset. As a result, we get dataset Math23k which contains 23,161 problems labeled with structured equations and answers. Please refer to Table 3 for some statistics of the dataset and a comparison with other public datasets.

Deep Neural Solver
In this section, we propose a RNN-based seq2seq model to translate problem text to math equations. Since not all numbers in problem text may be useful for solving the problem, we propose, in Section 4.2, a significant number identification model to distinguish whether a number in a problem should appear in the corresponding equations.  Table 2   "Dan have n 1 pens and n 2 pencils, Jessica have n 3 more pens and n 4 less pencils than him. How many pens and pencils do Jessica have in total?"

RNN based Seq2seq Model
The output sequence R = {r 1 , . . . , r s } is the equation template: The gated recurrent units (GRU) (Chung et al., 2014) and long short-memory (LSTM) (Hochreiter and Schmidhuber, 1997) cells are used for encoding and decoding, respectively. The reason why we use GRU as the encoder instead of LSTM is that the GRU has less parameters and less likely to be overfitted on small dataset. Four fundamen-tal operational stages of GRU are as follows: where σ represents the sigmoid function and is an element-wise multiplication. The input x t is a word w t along with previously generated character r t−1 . The variables U and W are weight matrices for each gate.
The fundamental operational stages of LSTM are as follows: where the input x t is a word w t along with previously generated character r t−1 . Then, we redesigned the activation function of the seq2seq model, which is different from vanilla seq2seq models. If we directly generate equation templates by a softmax function, some incorrect equations may be generated, such as: "x = n 1 + + * n 2 " and "x = (n 1 * n 2 ". To ensure that the output equations are mathematically correct, we need to find out which characters are illegal according to previously generated characters. This is done by five predefined rules like: • Rule 1: If r t−1 in {+, −, * , /}, then r t will not in {+, −, * , /, ), =}; • Rule 2: If r t−1 is a number, then r t will not be a number and not in {(, =}; • Rule 3: If r t−1 is "=", then r t will not in {+, −, * , /, =, )}; • Rule 4: If r t−1 is "(", then r t will not in {(, ), +, −, * , /, =}; • Rule 5: If r t−1 is ")", then r t will not be a number and not in {(, )}; A binary vector ρ t can be generated depends on r t−1 and these rules. Each position in ρ t is corresponding to a character in the output vocabulary, where "1" represents that the character is mathematically correct, and "0" indicates mathematically incorrect. Thus, the output probability distribution at each time-step t can be calculated as: where h t is the output of LSTM decoder, and W s is the weight matrix. The probability of mathematically incorrect characters will be 0. Our model is five layers deep, with a word embedding layer, a two-layer GRU as encoder and a two-layer LSTM as decoder. Both the encoder and decoder contain 512 nodes. We perform standard dropout during training (Srivastava et al., 2014) after GRU and LSTM layer with dropout probability equal to 0.5. We train for 80 epochs, utilizing a mini-batch size of 256 and a learning-rate of 0.01.

Significant Number Identification (SNI)
In a math word problem, not all numbers appear in the equation for solving the problem. An example is shown in Table 4, where the number "1" in "1 day, 1 girl" and number "2" in "She has 2 types of" should not be used in equation construction. We say a number is significant if the number should be included in the equation to the problem; otherwise it is insignificant. For the problem in Table 4, significant numbers are 9, 3, and 5, while 1 and 2 are insignificant numbers. Identifying significant and insignificant numbers are important for constructing correct equations. For this purpose, we build a LSTM-based binary classification model to determine whether a number in a piece of problem text is significant.
The training data for SNI model are extracted from the math word problems. Each number and its context in problems is a training instance of SNI. An instance will be labelled"True" if the number is significant, otherwise it will be labelled "False". The structure of SNI model is shown in Figure 2. By using single layer LSTMs with 128 nodes and a symmetric window of length 3, our model achieves 99.1% accuracy. Table 4 is an example of number mapping with and without SNI.
Problem: 1 day, 1 girl was organizing her book case making sure each of the shelves had exactly 9 books on it. She has 2 types of books -mystery books and picture books. If she had 3 shelves of mystery books and 5 shelves of picture books, how many books did she have in total? Number mapping: n 1 = 1; n 2 = 1; n 3 = 9; n 4 = 2; n 5 = 3; n 6 = 5 Equation template: x = n 5 * n 3 + n 6 * n 3 Number mapping with SNI: n 1 = 9; n 2 = 3; n 3 = 5 Equation template with SNI: x = n 2 * n 1 + n 3 * n 1 Problem after number mapping and SNI: 1 day, 1 girl was organizing her book case making sure each of the shelves had exactly n 1 books on it. She has 2 types of books -mystery books and picture books. If she had n 2 shelves of mystery books and n 3 shelves of picture books, how many books did she have in total?

Hybrid Model
To compare the performance of our deep neural solver and traditional statistical learning methods, we implement a similarity-based retrieval model (refer to Section 5.1 for more details).
The Venn diagram in Figure 3 shows the relationship between the problems solved by the re- Figure 3: Green area: problems correctly solved by the retrieval model; Blue area: problems correctly solved by the seq2seq model; Overlapped area: problems correctly solved by both models; White area: problems that both models fail to solve trieval model and those solved by the seq2seq model. We can see that although seq2seq performs better on average, the retrieval model is able to correctly solve many problems that seq2seq cannot solve. If we can combine the two models properly to build a hybrid model, more problems may get solved.
In this section, we first give some details about the retrieval model in Section 5.1, then the hybrid model is introduced in Section 5.2.

Retrieval Model
The retrieval model solves problems by calculating the lexical similarity between the testing problem and each problem in the training data, and then the equation template of the most similar problem is applied to the testing problem. Each problem is modeled as a vector of word TF-IDF scores W = [w 1,d , w 2,d , . . . , w N,d ] T , where and tf t,d is the word frequency of word t in problem d; |D| is the total number of problems in dataset; |d ∈ D|t ∈ d| is the number of documents containing the word t. The similarity between the testing problem P T and another problem Q can be calculated by the Jaccard similarity between their corresponding vectors: The retrieval model will choose training problem Q 1 that have the maximal similarity with P T and use the equation template T of Q 1 as the template of problem P T .
An important and interesting observation about the retrieval model is the relation between the maximal similarity and solution accuracy. Figure  4 shows the results of only considering the problems for those the maximal similarity returned by retrieval model is above a threshold θ (in other words, we skip a problem if its corresponding maximal similarity is below the threshold). It is clear that the larger the similarity score, the higher the average accuracy is. In our hybrid model, we make use of this property to combine the seq2seq model and the retrieval model.

Hybrid Model
Our hybrid model combines the retrieval model and the seq2seq model by setting a hyperparameter θ as the threshold of similarity. In algorithm 1, if the Jaccard similarity between testing problem P T and the retrieved problem Q 1 is higher than θ, the model will choose the equation template T of Q 1 as the equation template of problem P T . Otherwise an equation template will be generated by a seq2seq model. As shown in Figure 4, the retrieval model has a higher precision than the seq2seq model when we set a high threshold.

Experiments
In this section, we conduct experiments on two datasets to examine the performance of the proposed models. Our main experimental result is to show a significant improvement over the baseline Algorithm 1 Hybrid model Input: Q: problems in training data; P T : testing problem; θ: pre-defined threshold of similarity Output: Problem solution 1: Get equation templates and number mappings for training problems Q and testing problem P T . 2: Number identification: identify significant numbers 3: Retrieval: choose problem Q 1 from Q that has the maximal Jaccard similarity with P T 4: if J(P T , Q 1 ) > θ then Apply the seq2seq model: T = seq2seq(P T ) 8: end if 9: Applying number mappings of P T to T and calculating final solution method on the proposed Math23K dataset. We further show that the baseline method cannot solve problems with new equation templates. In contrast, the proposed seq2seq model is quite robust on problems with new equation templates (refer to Table 7).

Experimental Setup
Datasets: As introduced in Section 3.2, we collected a dataset called Math23K which contains 23161 math word problems labeled with equation templates and answers. All these problems are linear algebra questions with only one variable. There are 2187 equation templates in the dataset.
In addition, we also evaluate our method on a public dataset Alg514 .
Baseline: We compare our proposed methods with two baselines. The first baseline is the retrieval model introduced in Section 5.1. The second one is ZDC (Zhou et al., 2015), which is an improved version of KAZB . It maps a problem to one equation template defined in the training set by reasoning across problem sentences. It reports an accuracy of 79.7% on the Alg514 dataset. The Stanford parser is adopted in ZDC to parse all math word problems

Experimental Results
Each approach is evaluated on each dataset via 5fold cross-validation: In each run, 4 folds are used for training and 1 fold is used for testing. Evaluation results are summarized in Table 5. First, to test the effectiveness of significant number identification (SNI), model performance before and after the application of SNI are compared. Then, the performance of the hybrid model, seq2seq model, and retrieval model are examined on two datasets respectively.
To check whether the performance improvements are significant enough, we conduct statistical significance study upon pairs of methods. Table 6 shows the results of sign test, where the symbol > indicates that the method in the row significantly (with p value < 0.05) improves the performance of the method in the column, and the symbol indicates that the performance improvement is extremely significant (with p value < 0.01).
Several observations can be made from the re-sults. First, the seq2seq model significantly outperforms state-of-the-art statistical learning methods (ZDC and the retrieval model). Second, by combining the retrieval model and the seq2seq model using a simple mechanism, our hybrid model achieves significant performance gain with respect to the seq2seq model. Third, the SNI module can effectively improve model accuracy. The accuracy of the hybrid model and seq2seq model gains approximately 4% increase after number identification. Please pay attention that on the small dataset of Alg514, the seq2seq model behaves much worse than others. This is not surprising, because deep neural networks typically need large training data. Figure 5 shows the performance of different models on various scales of training data. As expected, the seq2seq model performs very well on big datasets, but poorly on small datasets. Ability to Generate New Equation Templates: please note that many problems in Math23K can be solved using the same equation template. For example, a problem which corresponds to the equation x = (9 * 3) + 7 and a different problem that maps to x = (4 * 5) + 2 share the same equation template.
One nice property of the seq2seq model is its ability of generating new equation templates. Most previous statistical learning methods ( Table 7. By comparing Table 5 and Table 7, it is clear that the gap between the seq2seq model and the baselines becomes larger in the new settings. It is because the seq2seq model can effectively generate new equation templates for new problems, instead of selecting equation templates from the training set.
Although ZDC and the retrieval model cannot generate new templates, their accuracy is not zero in the new settings. That is because one problem can be solved by multiple equation templates: Although one problem is labeled with template T 1 in the test set, it may also be solved by another template T 2 in the training set.

Discussion
Compare to most previous statistical learning methods for math problem solving, our proposed seq2seq model and hybrid model have the following advantages: 1) They have higher accuracy on large training data. On the Math23K dataset, the hybrid model achieves at least 22% higher accuracy than the baselines. 2) They have the ability of generating new templates (i.e., templates that are not in the training data. 3) They do not rely on sophisticated feature engineering.

Conclusion
We have proposed an RNN-based seq2seq model to automatically solve math word problems. This model directly transforms problem text to a math equation template. This is the first work of applying deep learning technologies to math word problem solving. In addition, we have designed a hybrid model which combines the seq2seq model and a retrieval model to further improve performance. A large dataset has been constructed for model training and empirical evaluation. Experimental results show that both the seq2seq model and the hybrid model significantly outperform state-of-the-art statistical learning methods in math word problem solving.
The output of our seq2seq model is a single equation containing one unknown variable. Therefore our approach is only applicable to the problems whose solution involves one linear equation of one unknown variable. As future work, we plan to extend our model to be able to generate equation systems and nonlinear equations.