Quantity Tagger: A Latent-Variable Sequence Labeling Approach to Solving Addition-Subtraction Word Problems

An arithmetic word problem typically includes a textual description containing several constant quantities. The key to solving the problem is to reveal the underlying mathematical relations (such as addition and subtraction) among quantities, and then generate equations to find solutions. This work presents a novel approach, Quantity Tagger, that automatically discovers such hidden relations by tagging each quantity with a sign corresponding to one type of mathematical operation. For each quantity, we assume there exists a latent, variable-sized quantity span surrounding the quantity token in the text, which conveys information useful for determining its sign. Empirical results show that our method achieves 5 and 8 points of accuracy gains on two datasets respectively, compared to prior approaches.


Introduction
Teaching machines to automatically solve arithmetic word problems, exemplified by two problems in Figure 1, is a long-standing Artificial Intelligence (AI) task (Bobrow, 1964;Mukherjee and Garain, 2008).
Recent research (Hosseini et al., 2014;Roy and Roth, 2015;Wang et al., 2017Wang et al., , 2018b focused on designing algorithms to automatically solve arithmetic word problems. One line of prior works designed rules (Mukherjee and Garain, 2008;Hosseini et al., 2014) or templates Zhou et al., 2015;Mitra and Baral, 2016) to map problems to expressions, where rules or templates are collected from training data.
However, it would be non-trivial and expensive to acquire a general set of rules or templates. Furthermore, such approaches typically require additional annotations. The addition-subtraction problems, which constitute the most fundamental class Problem 1: A worker at a medical lab is studying blood samples. 2 samples contained a total of 7341 blood cells. The first sample contained 4221 blood cells. How many blood cells were in the second sample? Prediction: (0)×2 + (+1)×7341 + (−1)×4221 + (−1)×x = 0 Equation: 7341 − 4221 − x = 0 Solution: x = 3120 Problem 2: There are 22 walnut trees currently in the park. Park workers will plant walnut trees today. When the workers are finished there will be 55 walnut trees in the park. How many walnut trees did the workers plant today? Prediction: (+1)×22 + (−1)×55 + (+1)×x = 0 Equation: 22 − 55 + x = 0 Solution: x = 33 of arithmetic word problems, have been the focus for many previous works (Hosseini et al., 2014;Mitra and Baral, 2016). We also focus on this important task in this work. Our key observation is that essentially solving such a class of problems can be tackled from a sequence labeling perspective. This motivates us to build a novel sequence labeling approach, namely Quantity Tagger. The approach tags each quantity in the text with a label that indicates a specific mathematical operation.
Solving arithmetic word problem can thus be casted as a sequence labeling problem where we assign every quantity appearing in the problem text a sign (in the form of a tag) from the set {+1, 0, −1}. We further assume there exists a latent quantity span that needs to be learned -a se-  L+  L+  L+  L+  N+   L+  L+  L+  L+  N+   L+  L+   R+  R+  R+  R+  R+  R+  R+  R+  R+  R+  R+  R+   L0  L0  N0   L0  L0  L0  L0  N0   L0  L0  L0  L0  N0 L0 L0 There are 22 walnut trees currently in the park . Park workers will plant walnut trees today . When the workers are finished there will be 55 walnut trees in the park . How many walnut trees did the workers plant today ? quence of words surrounding each quantity, based on which tagging decisions could be made. We demonstrate through experiments on benchmark data that, despite its relatively simple assumptions involved, our novel sequence labeling approach is able to yield significantly better results than various state-of-the-art models. To the best of our knowledge, this is the first work that tackles the problem from a sequence labeling perspective. Our code is publicly available at https://github.com/zoezou2015/quantity_tagger.

A Tagging Problem
We define Q = (q 1 , q 2 , . . . , q i , x, q i+1 , · · · q m ) (0<i<m, m≥2 in arithmetic word problems) as an ordered quantity sequence for a problem text T , where q i ∈ Q represents a constant quantity appearing in T , and x stands for the unknown quantity assigned to the question sentence. Q maintains the same order as the quantities appearing in T . The goal is to construct a valid math equation E. This research investigates such a problem by sequentially tagging each quantity q ∈ Q with the most likely sign from set S = {+1, 0, −1}, where "+(−)1" means a quantity is positively (negatively) related to the question, i.e., the sign of the quantity should be +(-) when forming part of the equation; "0" means a quantity is irrelevant to the question and should be ignored.
Given a specific prediction of the signs to the quantities, we can form an equation as follows: where s i ∈ {+1, 0, −1} is the sign for the i-th constant quantity q i , and s x ∈ {+1, −1} is the sign for x. The solution can be easily obtained.

Quantity Tagger
Our primary assumption is that, for each quantity, there exists an implicit quantity span that resides in the problem text and can convey relevant information useful for determining the signs of the quantities. The quantity span of a quantity is essentially a contiguous token sequence from the problem text that consists of the quantity itself and some surrounding word tokens. Formally, our model needs to learn how to sequentially assign each quantity q ∈ Q its optimal sign s ∈ S. This is a sequence labeling problem (Lample et al., 2016;Zou and Lu, 2019). Common sequence labeling tasks, such as NER and POS tagging, mainly consider one sentence at a time, and tag each token in the sentence. However, our tagging problem typically involves multiple sentences where relatively unimportant information may be potentially included. For instance, the second sentence of Problem 2 in Figure 1, "Park workers will plant walnut trees today" describes background knowledge of the problem, but such information may not be useful for solving problems, yet even obstructive.
For each quantity q ∈ Q, we first consider a token window consisting of q and J − 1 surrounding tokens located immediately to the left and right of q. This gives us a window of word tokens in the size of 2J − 1. Next, such token windows for all quantities in Q are merged to form a new token sequence, denoted as t. Note that t is formed by concatenating token subsequences taken from T and is in the length of n (1≤ n≤ N , where N is the length of T ). We assume the quantity spans are defined over such a token sequence t (rather than T ), which we believe convey most relevant information for determining the signs for the quantities. Exemplified by Problem 2 in Figure 1, we show an example token sequence t with J = 3 in Figure 2.
To capture quantity span information, we design 9 different labels with different semantics: • The N nodes are used to indicate that the current token is a quantity. • The L (R) nodes are used to indicate that the current token appears within a quantity span of a given quantity but to the left (right) of the quantity. The subscripts "+", "0", and "−" are used to denote the sign (+1, 0 and −1 respectively) associated with the quantities (and quantity spans).
All quantities are explicitly given in the problem text. Therefore, the N node is used to tag a word token if and only if the token represents a quantity. Otherwise, L and R nodes are considered. Furthermore, the unknown quantity is always relevant to the problem. We thus tag it with either N + or N − , while three types of N nodes are for all constant quantities. As illustrated in Figure 2, only one node from H will be selected at each position. Sequentially connecting all such nodes will form a single path that reveals information about quantity spans selected for all quantities.
Following CRF (Lafferty et al., 2001), we formulate our method as a log-linear model with latent variables. Formally, given the problem text T , let t = (t 1 , t 2 , . . . , t n ) be a token sequence as defined above, y be the corresponding label sequence, and h be a latent variable that provides specific quantity span information for the (t, y) tuple, we define: where w is the feature weight vector, i.e., model parameters, and f is the feature vector defined over the triple (t, y, h), f (t, y, h) returns a list of discrete features (refer to supplementary materials).
During training, we would like to minimize the negative log-likelihood of the training set: where the (t (i) , y (i) ) is the i-th training instance. The standard gradient-based methods can be used to optimize the above objective, such as L-BFGS (Liu and Nocedal, 1989). Gradients of the above function is given by: where E p [·] is the expectation under distribution p. We can construct a lattice representation on top of the nodes shown in Figure 2. The representation compactly encodes exponentially many paths, where each path corresponds to one possible label sequence. Note that there exists a topological ordering amongst all nodes. This allows us to apply a generalized forward-backward algorithm to perform exact marginal inference so as to calculate both objective and expectation values efficiently (Li and Lu, 2017;Zou and Lu, 2018). The MAP inference procedure can be done analogously, which is called during the decoding time.

Model Variants
We further consider two variants of our model. Semi-Markov Variant: Our first variant, namely QT(S), employs the semi-Markov assumption (Sarawagi and Cohen, 2005), where N nodes are removed. Different from QT which makes the first-order Markov assumption, QT(S) assumes L Model AddSub AS CN Hosseini et al. (2014) 77.70 -  64.00 - Koncel-Kedziorski et al. (2015) 77.00 - Roy and Roth (2015) 78.00 47.57 Zhou et al. (2015) 53.14 51.48 Mitra and Baral (2016) 86.07 -Roy and Roth (2017)  Relaxed Variant: One assumption made by QT is: each word in t strictly belongs to a certain quantity span. The variant QT(R) relaxes such a constraint. In this variant, some tokens in t may not belong to any quantity spans. Considering the example shown in Figure 2, the token "There" in t may not belong to any spans.

Experiments
We conduct experiments on two datasets, AddSub (Hosseini et al., 2014), consisting of 395 additionsubtraction problems in English, and AS CN with 1,049 addition-subtraction problems in Chinese (Wang et al., 2017). For all of our experiments, we use the L-BFGS algorithm (Liu and Nocedal, 1989) for learning model parameters with ℓ2 regularization coefficient of 0.01. To tune the hyperparameter J, we randomly select 80% instances of the training set for training and the rest 20% for development. We tune J on the development set.

Analysis
Following standard evaluation procedures used in previous works (Hosseini et al., 2014;Mitra and Baral, 2016), we conduct 3-fold cross validation on AddSub and AS CN, and report accuracies in Table 1. We make comparisons with a list of recent works 1 and two baselines.  Table 2: Accuracies on two types of problems and F 1 scores for three types of signs of quantities. A S.S. : accuracy of single-step problems (%) ; A M.S. accuracy of multi-step problems (%) ; F +(−/0) : F 1 score of sign "+1(−1/0)" (%).
Another is QT(FIX) where the quantity span for each quantity is a fixed-size token window. All of our proposed models consistently outperform previous research efforts. These figures confirm the capability of our approach to provide more promising solutions to addition-subtraction problems. We do not require any additional annotations which can be expensive, while annotations like variable-word alignments and formulas are necessary for works of Mitra and Baral, 2016).
To investigate the power of features extracted by external tools, such as ConceptNet (Liu and Singh, 2004) and Stanford CoreNLP tool (Manning et al., 2014), we conduct additional experiments on the afore-mentioned datasets, where we call such features external features (see supplementary material), indicated as "-EF". It is expected that the performance drops because such features are necessary for capturing evidence across sentences. Especially, for the AddSub dataset, it affects a lot. As discussed before (Hosseini et al., 2014;Mitra and Baral, 2016), there exists lots of irrelevant information and information gaps in AddSub. We thus can infer the external features support our approach to be capable of bridging information gaps and recognizing irrelevant information for solving arithmetic problems. Poor performance shows challenges to solve such problems in Chinese.
Which of our variants works the best? We observe that models with variable-sized quantity spans, namely QT, QT(S) and QT(R), generally perform better than QT(FIX) where the quantity spans are fixed token windows. This shows the effectiveness of introducing the quantity span as a latent variable. QT obtains the highest average accuracy on the AddSub and QT(R) outperforms other two variants on the AS CN.
How does our approach perform on different types of problems? We divide problems into two categories: single-step and multi-step problems. The equation of a single-step problem contains at most two constant quantities tagged with either "+1" or "−1", while the equation for a multistep problem has more than two constant quantities with signs of "+1" or "−1". We report accuracy and F 1 score in Table 2. According to empirical results illustrated in Table 2, our approach is able to give more accurate answers to multi-step problems, while the accuracy of single-step problems is lower. On the other hand, three models have similar patterns in terms of performance for three types of signs. The F 1 scores for signs of "+1" and "−1" are higher than scores of "0". After examining outputs, we found that problem texts of single-step problems often contain more than two constant quantities, among which only two of them are supposed to be labeled as "+1" or "−1" and the rest should be tagged as "0". However, incorrectly labeling an irrelevant quantity with "+1" or "−1" leads to wrong solutions to single-step problems. This also reveals that one main challenge for automatically solving arithmetic word problems is to recognize the irrelevant quantities. Failures in identifying irrelevant information may due to implicit information of problem text or the external tool issues.
Does J really matter? We further investigate the effects of J on the three proposed models. Figure 3 plots how performance varies with J (J ∈ {1, 2, 3, 4, 5, 6, N } 2 ) on datasets AddSub (above) and AS CN (below). On AddSub, three models have similar patterns that performance tends to be worse with a larger J. As for the AS CN dataset, three models achieve relatively higher accuracies with J ∈ {2, 3, 6} compared to other scenarios. Interestingly, it seems that QT and QT(R) performs better than the semi-Markov variant QT(S).  Table 3: Results for three types of signs for quantities predicted by three models. P.: Precision (%), R.: Recall (%), F.: F 1 score (%); Highest scores are in bold and we use * , † and ⋆ to distinguish different sign types.
We tracked outputs from three models and found that QT(S) made more mistakes in predictions for unknown. The fact that models with J = N perform do not perform well confirms our assumption that taking token windows into account rather than the whole text is reasonable and effective.
Evaluation on different types of signs: We investigate the capability of proposed approach to predict three types of signs ({+1, 0, −1}), as illustrated in Table 3. Three models have similar patterns on two datasets. Predictions of "+1" and "−1" are more promising, compared to "0". This reveals that one main challenge for automatically solving arithmetic word problems is to recognize the irrelevant information that should be labeled with "0". Like what we discussed, failure on detecting irrelevant knowledge could be resulted from inevitably errors introduced by external resources and the lack of presence of crucial information in problem text.
Error Analysis The leading sources of errors can be categorized into three types: 1) The description of the problem is incomplete and implicit, which is challenging for machine to understand. 2) Failing in recognizing relevant quantities caused missing quantities or introducing irrelevant information. 3) Incomplete information or errors from external tools, such as ConceptNet (Liu and Singh, 2004) and Standford CoreNLP tool (Manning et al., 2014), are another source of errors leading to wrong predictions, which are inevitable.