Learning Fine-Grained Expressions to Solve Math Word Problems

This paper presents a novel template-based method to solve math word problems. This method learns the mappings between math concept phrases in math word problems and their math expressions from training data. For each equation template, we automatically construct a rich template sketch by aggregating information from various problems with the same template. Our approach is implemented in a two-stage system. It first retrieves a few relevant equation system templates and aligns numbers in math word problems to those templates for candidate equation generation. It then does a fine-grained inference to obtain the final answer. Experiment results show that our method achieves an accuracy of 28.4% on the linear Dolphin18K benchmark, which is 10% (54% relative) higher than previous state-of-the-art systems while achieving an accuracy increase of 12% (59% relative) on the TS6 benchmark subset.


Introduction
The research topic of automatically solving math word problems dates back to the 1960s (Bobrow, 1964a,b;Charniak, 1968). Recently many systems have been proposed to these types of problems Hosseini et al., 2014;Koncel-Kedziorski et al., 2015;Zhou et al., 2015;Roy and Roth, 2015;Shi et al., 2015;Upadhyay et al., 2016;Mitra and Baral, 2016). On a recent evaluation conducted by Huang et al. (2016), current state-of-the-art systems only achieved an 18.3% accuracy on their published dataset Dol-phin18K. Their results indicate that math word problem solving is a very challenging task.
To solve a math word problem, a system needs to understand natural language text to extract information from the problem as local context. Also, it should provide an external knowledge base, including commonsense knowledge (e.g. "a chicken has two legs") and mathematical knowledge (e.g. "the perimeter of a rectangle = 2 * length + 2 * width"). The system can then perform reasoning based on the above two resources to generate an answer.

P1
: What's 25% off $139.99? Equation: (1-0.25)*139.99 = x P2: How much will the ipod now be if the original price is $260 and I get 10% discount? Equation: (1-0.1)*260 = x Template: (1-n 1 )*n 2 = x P3: I bought something for $306.00 dollars. I got a 20% discount. What was the original price? Equation: (1-0.2)*x = 306 Template: (1-n 1 )*x = n 2 In this paper, we focus on the acquisition of mathematical knowledge, or deriving math concepts from natural language. Consider the first two problems P 1 and P 2 in Figure 1. The math concept in the problems tells you to take away a percentage from one and get the resulting percentage of a total. Using mathematical language, it can be formulated as (1−n 1 ) * n 2 , where n 1 , n 2 are quantities. In this example, we can derive the concept of subtraction from the text "[NUM] % off " and "[NUM] % discount".
Acquisition of mathematical knowledge is nontrivial. Initial statistical approaches (Hosseini et al., 2014;Roy and Roth, 2015;Koncel-Kedziorski et al., 2015) derive math concepts based on observations from their dataset of specific types of problems, e.g. problems with one single equation. For example, Hosseini et al. (2014) assumes verbs and only verbs embed math concepts and map them to addition/subtraction. Roy and Roth (2015); Koncel-Kedziorski et al. (2015) assume there is only one unknown variable in the problem and cannot derive math concepts involving constants or more than one unknown variables, such as "the product of two unknown numbers".
Template-based approaches Zhou et al., 2015;Upadhyay et al., 2016), on the other hand, leverage the built-in composition structure of equation system templates to formulate all types of math concepts seen in training data, such as (1 − n 1 ) * n 2 = x in Figure 1. However, they suffer from two major shortcomings. First, the math concepts they learned, which is expressed as an entire template, fails to capture a lot of useful information with sparse training instances. We argue that it would be more expressive if the math concept is learned in a finer granularity. Second, their learning processes rely heavily on lexical and syntactic features, such as the dependency path between two slots in a template. When applied to a large-scale dataset, they create a huge and sparse feature space and it is unclear how these template-related features would contribute.
To alleviate the sparseness problem of math concept learning and better utilize templates, we propose a novel approach to capture rich information contained in templates, including textual expressions that imply math concepts. We parse the template into a tree structure and define "template fragment" as any subtree with at least one operator and two operands. We learn fine-grained mappings between textual expressions and template fragments, based on longest common substring. For example, given the three problems in Figure 1, we can map "[NUM] % off" and "[NUM] % discount" to 1 − n 1 , and "[NUM] % off [NUM]" to (1 − n 1 ) * n 2 = x. In this way, we can decompose the templates and learn math concepts in a finer grain. Furthermore, we observe that problems of the same template share some common properties. By aggregating problems of the same template and capturing these properties, we automatically construct a sketch for each template in the training data.
Our approach is implemented in a two-stage system. We first retrieve a few relevant templates in the training data. This narrows our search space to focus only on those templates that are likely to be relevant. Then we align numbers in the problem to those few returned templates, and do finegrained inference to obtain the final answer. We show that the textural expressions and template sketch we propose are effective for both stages. In addition, our system significantly reduces the hypothesis space of candidate equations compared to previous systems, which benefits the learning process and inference at scale. We evaluate our system on the benchmark dataset provided by Huang et al. (2016). Experiments show that our system outperforms two stateof-the-art baselines with a more than 10% absolute (54% relative) accuracy increase in the linear benchmark and a more than 20% absolute (71% relative) accuracy increase for the dataset with a template size greater than or equal to 6.
In the remaining parts of this paper, we introduce related work in Section 2, describe template sketch and textual expression learning in Section 3, present our two-stage system in Section 4, summarize experiment setup and results in Section 5, and conclude this paper in Section 6.

Related Work
Automatic math word problem solving methods (Bobrow, 1964a,b;Charniak, 1968Charniak, , 1969Briars and Larkin, 1984;Fletcher, 1985;Dellarosa, 1986;Bakman, 2007;Yuhui et al., 2010) developed before 2008 are mostly rule-based. They accept limited well-format input sentences and map them into certain structures by pattern matching. They usually focus on problems with simple math operations such as addition or subtraction. Please see Mukherjee and Garain (2008) for a summary.
In recent years, symbolic and statistical methods have been explored by various researchers. In the symbolic approach, systems transform math word problems to structured representations. Bakman (2007) maps math problems to predefined schema with a table of textual formulas and changing verbs. Liguda and Pfeiffer (2012) uses augmented semantic networks to represent math problems. Shi et al. (2015) parses math problems to their pre-defined semantic language. However, these methods are only effective in their designated math problem categories and are not scalable to other categories. For example, the method used by Shi et al. (2015) works extremely well for solving number word problems but not others.
In the statistical machine learning approach, Hosseini et al. (2014) solves addition and subtraction problems by extracting quantities as states and derive math concepts from verbs in the training data.  and Zhou et al. (2015) generalize equations attached to problems with variable slots and number slots. They learn a probabilistic model for finding the best solution equation. Upadhyay et al. (2016) follows their approach and leverage math word problems without equation annotation as external resources. Seo et al. (2015) solves a set of SAT geometry questions with text and diagram provided. Koncel-Kedziorski et al. (2015) and Roy and Roth (2015) target math problems that can be solved by one single linear equation. They map quantities and words to candidate equation trees and select the best tree using a statistical learning model. Mitra and Baral (2016) considers addition and subtraction problems in three basic problem types: "Change", "Part Whole" and "Comparison". They manually design different features for each type, which is difficult to expand to more types.
In summary, previous methods can achieve high accuracy in limited math problem categories, (i.e. Shi et al., 2015)), but do not scale or perform well in datasets containing various math problem types as in Huang et al. (2016), as their designed features are becoming sparse. Their process of acquiring mathematical knowledge is either sparse or based on certain assumptions of specific problem types. To alleviate this problem, we introduce our template sketch construction and fine-grained expressions learning in the next section.

Template Sketch Construction
A template sketch contains template information.
We define three categories of information for the sketch shown in this section. Next we describe how we construct a template sketch, via aggregation of rich information from training problems. We group problems of the same template in training set as one cluster and collect information. See Figure 2 for the outline of our template sketch con-struction.

Definition
Template: It is first introduced in . It is a unique form of an equation system. For example, given an equation system as follows: This equation system is a solution for a specific math word problem. We replace the numbers with four number slots {n 1 , n 2 , n 3 , n 4 } and generalize the equations to the following template: Alignment: We align numbers in the math problem with the number slots of a template. For the first math problem in Figure 1 with its corresponding template (1 − n 1 ) * n 2 = x, there are two numbers 0.25 and 139.99 to align with two number slots n 1 and n 2 , which results in two different alignments.  aligns nouns to variable slots {x 1 , x 2 , ...} which leads to a huge hypothesis space and does not perform as well as the number slot alignment only method proposed later by (Zhou et al., 2015). Therefore, we only consider number slot alignment in this paper.

Textual Expressions
For template fragments, there are usually some textual expressions. For example, "n 1 % off" and "n 1 % discount" are both mapped to the template fragment 1 − n 1 .
We employ a statistical framework to automatically mine textual expressions for template fragments from a training dataset. First we parse the equation to a hierarchical tree. In a bottom-up approach, we obtain each possible subtree as a template fragment t k , which associates with at least one number slot. For each t k , we use the numbers to anchor the number-related phrases in the problem, replace numbers with"[NUM]" and noun phrases with "[VAR]", and cluster the phrases P = {p 1 , p 2 , · · · } with the same t k across all data given all training problems. Then we compute the longest common substring lcs k ij between pairs p i and p j and calculate tf-idf score of lcs k ij . We keep the lcs k ij with scores above certain empirically determined threshold as the textual expressions. (1-0.28)*275 = x Template: (1-n 1 )*n 2 = x Problem: Figure 2: Template Sketch Construction.

Slot Type
Number slots in templates have their own type of constraints. For example, in the template (1−n 1 ) * n 2 = x, usually n 1 represents a percentage quantity and n 2 is the quantity of an object. We model slot types with quantity units, and find the direct governing noun phrase as its 'owner'. For the problem in Figure 2, we extract quantity unit sequence as {%, $}, normalized unit sequence as {0, 1} (because % and $ are of different quantity types), and quantity owners as {discount, item}. The slot type information provides important clues to choose the correct template and alignment.

Question Keyword
Question keyword decides which template we use. Given the following problem setting: "A rectangle has a width of 5cm and a length of 10cm.", we can ask either Q1:"What is the area of the rectangle?" or Q2: "What is the difference between width and length?". The question keywords area and difference help our system to decide if is should apply template n 1 * n 2 = x for Q1 and apply template n 1 − n 2 = x for Q2.
We first detect the question sentence (containing keywords "what","how","figure out"...). Then we extract the question keyword on the dependency tree with simple rules that we observed in the dev set (e.g. retrieving nouns with "attr − nsubj" dependency relation with keyword "what"). Please note that we favor recall over precision of our detected question keywords since they are used as features instead of hard constrains on template decision. Simple rule-based extraction can already satisfy our need for detecting question keywords in math problems.

Two-Stage System
In this section, we describe our two-stage system for solving math problems, including template retrieval and alignment ranking. We show how to apply textual expressions and template sketch to our system.

Template Retrieval
We use an efficient retrieval module to first narrow our search space and focus only on templates that are likely to be relevant. Let χ denote the set of test problems, and T = {t 1 , t 2 , . . . , t j } as the template set in the training data. For each test problem x i , our goal is to select the correct template t j . We define the conditional probability of selecting a template given a problem as follows: where ν t is the model parameter and f (x i , t j ) is the feature vector. We apply the Ranking SVM (Herbrich et al., 2000) to minimize a regularized margin-based pairwise loss. We then have the following objective function: where superscript "+" indicates the correct instance and "-" indicates the false ones. We use the loss function l(t) = max(0, 1 − t) 2 .
To construct the vector f (x i , t j ) for template t j , we use the three categories in the template sketch shown in Table 1. Let Q(t j ) represent the cluster of training problems with template t j .

Textual Features
Contains textual expressions in each template fragments? Average Word Overlap with Q(t j ) Max Word Overlap with Q(t j ) Quantity Features Unit sequence in Q(t j ) Normalized unit sequence in Q(t j ) Question Features Is Question keyword in Q(t j ) At the phrase level, as we have mined different expressions in 3.2 for slots in templates, we can extract the phrases related to each number or number pair in a test problem and match them with expressions. For example, given a test problem to match template (1 − n 1 ) * n 2 = x in Figure 2, we have two groups of patterns to match, corresponding to 1.0 − n 1 and (1.0 − n 1 ) * n 2 respectively. Quantity types in a problem are important. We use the unit type sequences and normalized unit type sequence for describing number slot types in a template. In addition, if a number unit type cannot differentiate each number slot, we will make use of number "owner" as defined in subsection 3.3. For example, in the sentence "The width is 3cm and the length is 5cm", we extract two quantities with unit type sequence {cm, cm}; and owner {width, length}.
In addition, we consider question keywords for templates. For example, if the question keyword is "difference", then x + n 1 = n 2 will have a higher probability of being selected than x = n 1 * n 2 .
We observe that in some cases, one word difference can lead to two different templates. To consider cases in which some templates are very similar (e.g. x + n 1 = n 2 and n 1 + n 2 = x, part/whole unknown), we retrieve the top ranked N (N =3) templates as candidates for alignment in the next stage.

Alignment Ranking
For each top N templates from the previous stage, we generate possible alignments A = {a 1 , a 2 , . . . , a m } as the candidate equation system for the test problem x i . We train a ranking model to choose the alignment with the highest probability p(a k |x i , t j ; ν a ), where ν a is the model parameter vector.
We use the same ranking model as in template selection stage and the objective function is changed to: We design more fine-grained features for each number slot to formulate the alignment feature vector f (x i , a k ). It contains the following features in Table 2.

Textual Features
Match textual expressions in template fragment aligned to each number slot (pair) Quantity Features Aligned unit sequence in Q(t j ) Aligned normalized unit sequence in Q(t j ) Relationship with noun phrase Optimal number 1 or 2 is used?

Solution Features
Is integer solution? Is positive solution? At the textual level, we want to capture textual expressions describing each number slot. For example, in the template (1 − n 1 ) * n 2 = x, we have mined patterns of 1 − n 1 in 3.2, such as "a discount of n 1 %", "mark down n 1 %", etc. Given the problem in Figure 2 as the test problem, alignment (1-0.28) * 275 = x matches textual expressions, while (1-275) * 0.28 = x does not.
For quantity features, we use the alignmentordered unit sequence. For the problem in Figure  2 mapping to template (1 − n 1 ) * n 2 = x, we have two different alignments: {n 1 :0.28, n 2 :275}, {n 1 :275,n 2 :0.28}. Their aligned unit sequences are {%, $} and {$,%} respectively. We also use the relations of quantities with noun phrases to differentiate number slot interaction with unknown variable slots and number slots, such as n 1 * x and n 1 * n 2 .
Some templates have numerical solution properties while others do not. For example, template x 1 = (n 1 − n 2 )/(n 3 − n 4 ) would be less likely to have any strong indication of integer solution properties. We count the percentage of integer/positive solutions from the corresponding problems as the probability that this template prefers an integer/positive solution.

Model Discussion
Our method has two main differences from previous template-based methods Zhou et al., 2015;Upadhyay et al., 2016).
First, previous methods implicitly model mapping from problem text to templates. We learn fine-grained textual expressions mapped to template fragments; and explicitly model the property of templates with template sketches. Second, previous methods align numbers for all templates in a training set, while we only examine the N most probable templates. This significantly reduces the equation candidate search space. Given a problem in which m numbers align with a template of n number slots, the number of possible equation candidates would be A n m . The search space grows linearly with the number of templates in the training data. Suppose m = 5, n = 4 and we have 1000 templates, the total space would be (5 * 4 * 3 * 2) * 1000 = 120, 000 for one problem in Zhou et al. (2015), and will be much larger if it considers unknown variable alignment as in .

Experiments
Settings As demonstrated in Huang et al. (2016), previous datasets for math problems are limited in both scale and diversity. We conduct our experiment on their dataset Dolphin18K. We use the linear subset, containing 10,644 problems in total. We use two baseline systems for comparison: (1) ZDC (Zhou et al., 2015) is a statistical learning method that is an improved version of KAZB  1 . (2) SIM (Huang et al., 2016) is a simple similarity based method. We do not compare other systems because they only solve one specific type of problem, e.g. Hosseini et al. (2014) only handle addition/subtraction problems and Koncel-Kedziorski et al. (2015) aim to solve problems with one single linear equation. Experiments are conducted using 5-fold crossvalidation with 80% problems randomly selected as training data and the remaining 20% for testing. We report the solution accuracy. Table 3 shows the overall performance of different systems. In the table, the size of a template is the number of problems corresponding to a template. For example, for templates with a size 100 or larger, their problem counts add up to 1,807.  From the table, we observe that our model consistently achieves better performance than the baselines on all template sizes. As the template size becomes larger, all three systems achieve better performance. When template size equals 6 (TS6, as a de-facto template size constrain adopted in ZDC), our model achieve an absolute increase of over 12% (59% relative). This demonstrates the effectiveness of our proposed method.

Overall Evaluation Results
When including long tail problems with a template size less than 2, performance of all three systems drop significantly. This is because the templates of these problems are not seen in the training set, and thus are difficult to solve using these template-based methods. Still, we have at least 10% absolute (54% relative) accuracy increase on the whole test set compared to the two baselines. Previous template-based methods require templates size larger than 6 in the data as constraints. From the result, we can see that our method relaxes the template size constraint and matches more problems with less frequent templates.

Accuracy per Template
Here we investigate the performance of different templates. In Table 4, we sample some dominant templates and report their accuracies. For our model, we report both template retrieval accuracy and final solution accuracy.
As we can see, our method performs better than the baselines for most dominant templates. Performance of the dominant templates can reach an accuracy level of 60%. This proves that our template sketch and textual expressions are effective in capturing rich template information.
To our surprise, some templates tend to perform better than others even with smaller template sizes. For example, x 1 = n 1 − n 2 , which represents the subtraction problem, has 63 problems but performs not as well as x 1 = (n 1 − n 2 )/(n 3 − n 4 ) which has 48 problems. We look into their corresponding problems and find out that x 1 = n 1 − n 2 are applied to more themes in natural language than x 1 = (n 1 − n 2 )/(n 3 − n 4 ), which are almost about the theme of "coordinate slope".
In our model, there is a gap between template retrieval accuracy and final solution accuracy, which means that although we select the correct template candidates for the problem, the alignment model cannot rank the equations correctly.

Two-Stage Evaluation
Next, we evaluate the performance of our twostage system. Accuracy of template retrieval and alignment ranking is shown in Table 5.
For template retrieval accuracy, Hit@N means the correct template for a problem is included in the top N list returned by our model. We estimate the best achievable performance by using oracle template retrieval. The result is 47.1% (Hit@ALL), which means 47.1% of the templates exist more than once in the problem set. Please note that our template retrieval evaluation may be underestimated, since in some cases, a test problem can be solved by different templates.
We then use the top N templates as input for both our alignment ranking and ZDC. From the table, we have the following observations: (1) Hit@3 performs better compared to Hit@1 for both systems. This confirms our claim that some templates are similar and we need to incorporate more fine-grained features to differentiate in the alignment step; (2) It obtains the highest accuracy when N = 3 and decreases when N gets larger. Both systems get benefits from our template retrieval which helps retrieve relevant templates and reduce the hypothesis space of equations; (3) Given the same N templates input, our alignment ranking achieves better performance than ZDC. This implies that our features are more indicative.

Feature Ablation
This section describes our feature ablation study.
Template Retrieval In Table 6, we conduct three configurations against our model (FULL). Each ablated configuration corresponds to one category of our template sketch. From the table, we can see that all three categories of features contribute to system performance. We remove QUANTITY results in the worse performance comparing to the FULL model. Alignment Ranking In Table 7, N means to select the top N templates in the previous stage for alignment. The column "Correct Template" represents how well the alignment model can perform given the correct template input for alignment. Our alignment model (FULL) performs the best compared to the three ablated settings, which confirms the effectiveness of template properties.

Error Analysis
We have observed that template-based methods have difficulty solving problems with small template sizes, especially for cases that have a single problem instance (i.e. template size = 1). We sample 100 problems in which our system makes mistakes in the dev set of Dolphin18K and summarize them in Table 8.
Quantity    the table, if we can detect "24 male" is the same as "men", the problem can be solved.
Relation/State Detection If we can identify the changed states or mathematical relations between variables, we can solve this category of problems. In the example problem, it is important to understand that "commission is taken out" is my money state.
External Knowledge This requires specific mathematical models, such as permutation and combination, or commonsense knowledge, e.g. a dice has 6 sides.

Equation Decomposition
The limitation of template-based approaches is that they require test problems belonging to one of the templates seen  in training. Thus, for problems corresponding to template sizes less than 2, we can decompose templates into smaller units and conduct learning more precisely. We then need to generate the equations, which is also a challenge.

Conclusion and Future Work
In this paper, we propose a novel approach to solving math word problems with rich information of templates. We learn mappings between textual expressions and template fragments. Furthermore, we automatically construct sketches for each template. We implement a two-stage system, including template retrieval and alignment ranking. Experiments show that our method performs signifi-Category Math Problem Quantity Type (10%) The ratio of women to men in a certain club is 3 to 2. If there are 24 male club members, then how many female club members are there? Relation/State Detection (12%) If I am selling something for $25,000 and a 7% commission is taken out, how much money will I be left with? External Knowledge (23%) Find the probability that total score is 10 or more given at least one dice show 6 if 2 dice red & blue thrown? Equation Decomposition (55%) The average weight of A, B and C is 45 kg. If the average weight of A and B is 40 kg and that of B and C is 43 kg, the weight of B is? cantly better than two state-of-the-art systems.
Based on our error analysis, we plan to improve our model by detecting quantity types more accurately, learning relations and incorporating commonsense knowledge. For long tail problems with a template size less 2, we want to utilize the finegrained expressions we have learned and decompose equations for learning. Then we can reason with small equation units to generate a final equation in testing. We would like to leverage semantic parsing and transform math problems to a more structured representation. Additionally, we plan to apply our findings to generating math problem.