Are NLP Models really able to Solve Simple Math Word Problems?

The problem of designing NLP solvers for math word problems (MWP) has seen sustained research activity and steady gains in the test accuracy. Since existing solvers achieve high performance on the benchmark datasets for elementary level MWPs containing one-unknown arithmetic word problems, such problems are often considered “solved” with the bulk of research attention moving to more complex MWPs. In this paper, we restrict our attention to English MWPs taught in grades four and lower. We provide strong evidence that the existing MWP solvers rely on shallow heuristics to achieve high performance on the benchmark datasets. To this end, we show that MWP solvers that do not have access to the question asked in the MWP can still solve a large fraction of MWPs. Similarly, models that treat MWPs as bag-of-words can also achieve surprisingly high accuracy. Further, we introduce a challenge dataset, SVAMP, created by applying carefully chosen variations over examples sampled from existing datasets. The best accuracy achieved by state-of-the-art models is substantially lower on SVAMP, thus showing that much remains to be done even for the simplest of the MWPs.


Introduction
A Math Word Problem (MWP) consists of a short natural language narrative describing a state of the world and poses a question about some unknown quantities (see Table 1 for some examples). MWPs are taught in primary and higher schools. The MWP task is a type of semantic parsing task where given an MWP the goal is to generate an expression (more generally, equations), which can then be evaluated to get the answer. The task is challenging because a machine needs to extract relevant information from natural language text as well as perform mathematical reasoning to solve it. The complexity of MWPs can be measured along multiple axes, e.g., reasoning and linguistic  complexity and world and domain knowledge. A combined complexity measure is the grade level of an MWP, which is the grade in which similar MWPs are taught. Over the past few decades many approaches have been developed to solve MWPs with significant activity in the last decade (Zhang et al., 2020).
MWPs come in many varieties. Among the simplest are the one-unknown arithmetic word problems where the output is a mathematical expression involving numbers and one or more arithmetic operators (+, −, * , /). Problems in Tables 1 and 6 are of this type. More complex MWPs may have systems of equations as output or involve other operators or may involve more advanced topics and specialized knowledge. Recently, researchers have started focusing on solving such MWPs, e.g. multiple-unknown linear word problems (Huang et al., 2016a), geometry (Sachan and Xing, 2017) and probability (Amini et al., 2019), believing that existing work can handle one-unknown arithmetic MWPs well (Qin et al., 2020). In this paper, we question the capabilities of the state-of-the-art (SOTA) methods to robustly solve even the simplest of MWPs suggesting that the above belief is not well-founded.
In this paper, we provide concrete evidence to show that existing methods use shallow heuristics to solve a majority of word problems in the benchmark datasets. We find that existing models are able to achieve reasonably high accuracy on MWPs from which the question text has been removed leaving only the narrative describing the state of the world. This indicates that the models can rely on superficial patterns present in the narrative of the MWP and achieve high accuracy without even looking at the question. In addition, we show that a model without word-order information (i.e., the model treats the MWP as a bag-of-words) can also solve the majority of MWPs in benchmark datasets.
The presence of these issues in existing benchmarks makes them unreliable for measuring the performance of models. Hence, we create a challenge set called SVAMP (Simple Variations on Arithmetic Math word Problems; pronounced swamp) of one-unknown arithmetic word problems with grade level up to 4 by applying simple variations over word problems in an existing dataset (see Table 1 for some examples). SVAMP further highlights the brittle nature of existing models when trained on these benchmark datasets. On evaluating SOTA models on SVAMP, we find that they are not even able to solve half the problems in the dataset. This failure of SOTA models on SVAMP points to the extent to which they rely on simple heuristics in training data to make their prediction.
Below, we summarize the two broad contributions of our paper.
• We show that the majority of problems in benchmark datasets can be solved by shallow heuristics lacking word-order information or lacking question text.
• We create a challenge set called SVAMP 1 for more robust evaluation of methods developed to solve elementary level math word problems.  Cai et al. (2017) showed that biases prevalent in the ROC stories cloze task allowed models to yield state-of-the-art results when trained only on the endings. To the best of our knowledge, this kind of analysis has not been done on any Math Word Problem dataset.
Challenge Sets for NLP tasks have been proposed most notably for NLI and machine translation (Belinkov and Glass, 2019;Nie et al., 2020;Ribeiro et al., 2020). Gardner et al. (2020) suggested creating contrast sets by manually perturbing test instances in small yet meaningful ways that change the gold label. We believe that we are the first to introduce a challenge set targeted specifically for robust evaluation of Math Word Problems.

Problem Formulation
We denote a Math Word Problem P by a sequence of n tokens P = (w 1 , . . . , w n ) where each token w i can be either a word from a natural language or a numerical value. The word problem P can be broken down into body B = (w 1 , . . . , w k ) and question Q = (w k+1 , . . . , w n ). The goal is to map P to a valid mathematical expression E P composed of numbers from P and mathematical operators from the set {+, −, /, * } (e.g. 3 + 5 − 4). The metric used to evaluate models on the MWP task is Execution Accuracy, which is obtained from com-  paring the predicted answer (calculated by evaluating E P ) with the annotated answer. In this work, we focus only on one-unknown arithmetic word problems.  datasets. Note that our implementations achieve a higher score than the previously reported highest score of 78% on ASDiv-A (Miao et al., 2020) and 83.7% on MAWPS (Zhang et al., 2020). The implementation details are provided in Section A in the Appendix.

Deficiencies in existing datasets
Here we describe the experiments that show that there are important deficiencies in MAWPS and ASDiv-A.

Evaluation on Question-removed MWPs
As mentioned in Section 3.1, each MWP consists of a body B, which provides a short narrative on a state of the world and a question Q, which inquires about an unknown quantity about the state of the world. For each fold in the provided 5-fold split in MAWPS and ASDiv-A, we keep the train set unchanged while we remove the questions Q from the problems in the test set. Hence, each problem in the test set consists of only the body B without any question Q. We evaluate all three models with RoBERTa embeddings on these datasets. The results are provided in Table 3. The best performing model is able to achieve a 5-fold cross-validation accuracy of 64.4% on ASDiv-A and 77.7% on MAWPS. Loosely translated, this means that nearly 64% of the problems in ASDiv-A and 78% of the problems in MAWPS can be correctly answered without even looking at the question. This suggests the presence of patterns in the bodies of MWPs in these datasets that have a direct correlation with the output equation.
Some recent works have also demonstrated similar evidence of bias in NLI datasets (Gururangan et al., 2018;Poliak et al., 2018). They observed that NLI models were able to predict the correct label for a large fraction of the standard NLI datasets based on only the hypothesis of the input and without the premise. Our results on question-removed examples of math word problems resembles their observations on NLI datasets and similarly indicates the presence of artifacts that help statistical   (2018), we attempt to understand the extent to which SOTA models rely on the presence of simple heuristics in the body to predict correctly. We partition the test set into two subsets for each model: problems that the model predicted correctly without the question are labeled Easy and the problems that the model could not answer correctly without the question are labeled Hard. Table 4 shows the performance of the models on their respective Hard and Easy sets. Note that their performance on the full set is already provided in Table 2. It can be seen clearly that although the models correctly answer many Hard problems, the bulk of their success is due to the Easy problems. This shows that the ability of SOTA methods to robustly solve word problems is overestimated and that they rely on simple heuristics in the body of the problems to make predictions.

Performance of a constrained model
We construct a simple model based on the Seq2Seq architecture by removing the LSTM Encoder and replacing it with a Feed-Forward Network that maps the input embeddings to their hidden representations. The LSTM Decoder is provided with the average of these hidden representations as its initial hidden state. During decoding, an attention mechanism (Luong et al., 2015) assigns weights to individual hidden representations of the input  tokens. We use either RoBERTa embeddings (noncontextual; taken directly from Embedding Matrix) or train the model from scratch. Clearly, this model does not have access to word-order information. Table 5 shows the performance of this model on MAWPS and ASDiv-A. The constrained model with non-contextual RoBERTa embeddings is able to achieve a cross-validation accuracy of 51.2 on ASDiv-A and an astounding 77.9 on MAWPS. It is surprising to see that a model having no word-order information can solve a majority of word problems in these datasets. These results indicate that it is possible to get a good score on these datasets by simply associating the occurence of specific words in the problems to their corresponding equations. We illustrate this more clearly in the next section.

Analyzing the attention weights
To get a better understanding of how the constrained model is able to perform so well, we analyze the attention weights that it assigns to the hidden representations of the input tokens. As shown by Wiegreffe and Pinter (2019), analyzing the attention weights of our constrained model is a reliable way to explain its prediction since each hidden representation consists of information about only that token as opposed to the case of an RNN where each hidden representation may have information about the context i.e. its neighboring tokens.
We train the contrained model (with RoBERTa embeddings) on the full ASDiv-A dataset and observe the attention weights it assigns to the words of the input problems. We found that the model usually attends to a single word to make its prediction, irrespective of the context. Table 6 shows some representative examples. In the first example, the model assigns an attention weight of 1 to the representation of the word 'every' and predicts the correct equation. However, when we make a subtle change to this problem such that the corresponding equation changes, the model keeps on attending over the word 'every' and predicts the same equa-

Input Problem Predicted Equation Answer
John delivered 3 letters at every house. If he delivered for 8 houses, how many letters did John deliver? 3 * 8 24 John delivered 3 letters at every house. He delivered 24 letters in all. How many houses did John visit to deliver letters? 3 * 24 72 Sam made 8 dollars mowing lawns over the Summer. He charged 2 bucks for each lawn. How many lawns did he mow? 8 / 2 4 Sam mowed 4 lawns over the Summer. If he charged 2 bucks for each lawn, how much did he earn? 4 / 2 2 10 apples were in the box. 6 are red and the rest are green. how many green apples are in the box? 10 -6 4 10 apples were in the box. Each apple is either red or green. 6 apples are red. how many green apples are in the box? 10 / 6 1.67 Table 6: Attention paid to specific words by the constrained model. tion, which is now incorrect. Similar observations can be made for the other two examples. Table 22 in the Appendix has more such examples. These examples represent only a few types of spurious correlations that we could find but there could be other types of correlations that might have been missed.
Note that, we do not claim that every model trained on these datasets relies on the occurrence of specific words in the input problem for prediction the way our constrained model does. We are only asserting that it is possible to achieve a good score on these datasets even with such a brittle model, which clearly makes these datasets unreliable to robustly measure model performance.

SVAMP
The efficacy of existing models on benchmark datasets has led to a shift in the focus of researchers towards more difficult MWPs. We claim that this efficacy on benchmarks is misleading and SOTA MWP solvers are unable to solve even elementary level one-unknown MWPs. To this end, we create a challenge set named SVAMP containing simple one-unknown arithmetic word problems of grade level up to 4. The examples in SVAMP test a model across different aspects of solving word problems. For instance, a model needs to be sensitive to questions and possess certain reasoning abilities to correctly solve the examples in our challenge set. SVAMP is similar to existing datasets of the same level in terms of scope and difficulty for humans, but is less susceptible to being solved by models relying on superficial patterns.
Our work differs from adversarial data collection methods such as Adversarial NLI (Nie et al., 2020) in that these methods create examples depending on the failure of a particular model while we create examples without referring to any specific model. Inspired by the notion of Normative evaluation (Linzen, 2020), our goal is to create a dataset of simple problems that any system designed to solve MWPs should be expected to solve. We create new problems by applying certain variations to existing problems, similar to the work of Ribeiro et al. (2020). However, unlike their work, our variations do not check for linguistic capabilities. Rather, the choice of our variations is motivated by the experiments in Section 4 as well as certain simple capabilities that any MWP solver must possess.

Creating SVAMP
We create SVAMP by applying certain types of variations to a set of seed examples sampled from the ASDiv-A dataset. We select the seed examples from the recently proposed ASDiv-A dataset since it appears to be of higher quality and harder than the MAWPS dataset: We perform a simple experiment to test the coverage of each dataset by training a model on one dataset and testing it on the other one. For instance, when we train a Graph2Tree model on ASDiv-A, it achieves 82% accuracy on MAWPS. However, when trained on MAWPS and tested on ASDiv-A, the model achieved only 73% accuracy. Also recall Table 2 where most models performed better on MAWPS. Moreover, AS-Div has problems annotated according to types and grade levels which are useful for us.
To select a subset of seed examples that sufficiently represent different types of problems in the ASDiv-A dataset, we first divide the examples into groups according to their annotated types. We dis-  Original: He then went to see the oranges being harvested. He found out that they harvest 83 sacks per day and that each sack contains 12 oranges. How many sacks of oranges will they have after 6 days of harvest? Variation: He then went to see the oranges being harvested. He found out that they harvest 83 sacks per day and that each sack contains 12 oranges. How many oranges do they harvest per day?
Reasoning Ability

Invert Operation
Original: He also made some juice from fresh oranges. If he used 2 oranges per glass of juice and he made 6 glasses of juice, how many oranges did he use? Variation: He also made some juice from fresh oranges. If he used 2 oranges per glass of juice and he used up 12 oranges, how many glasses of juice did he make?

Structural Invariance
Change order of objects Original: John has 8 marbles and 3 stones. How many more marbles than stones does he have? Variation: John has 3 stones and 8 marbles. How many more marbles than stones does he have?

Change order of phrases
Original: Matthew had 27 crackers. If Matthew gave equal numbers of crackers to his 9 friends, how many crackers did each person eat? Variation: Matthew gave equal numbers of crackers to his 9 friends. If Matthew had a total of 27 crackers initially, how many crackers did each person eat?

Add irrelevant information
Original: Jack had 142 pencils. Jack gave 31 pencils to Dorothy. How many pencils does Jack have now? Variation: Jack had 142 pencils. Dorothy had 50 pencils. Jack gave 31 pencils to Dorothy. How many pencils does Jack have now?  Table 7.

Variations
The variations that we make to each seed example can be broadly classified into three categories based on desirable properties of an ideal model: Question Sensitivity, Reasoning Ability and Structural Invariance. Examples of each type of variation are provided in Table 8.
1. Question Sensitivity. Variations in this category check if the model's answer depends on the question. In these variations, we change the question in the seed example while keeping the body same. The possible variations are as follows: (a) Same Object, Different Structure: The principal object (i.e. object whose quantity is unknown) in the question is kept the same while the structure of the question is changed.   termine a change in reasoning arising from subtle changes in the problem text. The different possible variations are as follows: (a) Add relevant information: Extra relevant information is added to the example that affects the output equation.
(b) Change information: The information provided in the example is changed.
(c) Invert operation: The previously unknown quantity is now provided as information and the question instead asks about a previously known quantity which is now unknown.
3. Structural Invariance. Variations in this category check whether a model remains invariant to superficial structural changes that do not alter the answer or the reasoning required to solve the example. The different possible variations are as follows: (a) Add irrelevant information: Extra irrelevant information is added to the problem text that is not required to solve the example.
(b) Change order of objects: The order of objects appearing in the example is changed.
(c) Change order of phrases: The order of numbercontaining phrases appearing in the example is changed.

Protocol for creating variations
Since creating variations requires a high level of familiarity with the task, the construction of SVAMP is done in-house by the authors and colleagues, hereafter called the workers. The 100 seed examples (as shown in Table 7) are distributed among the workers.
For each seed example, the worker needs to create new variations by applying the variation types discussed in Section 5.1.1. Importantly, a combination of different variations over the seed example can also be done. For each new example created, the worker needs to annotate it with the equation as well as the type of variation(s) used to create it. More details about the creation protocol can be found in Appendix B.
We created a total of 1098 examples. However, since ASDiv-A does not have examples with equations of more than two operators, we discarded 98 examples from our set which had equations consisting of more than two operators. This is to ensure that our challenge set does not have any unfairly difficult examples. The final set of 1000 examples was provided to an external volunteer unfamiliar with the task to check the grammatical and logical correctness of each example.

Dataset Properties
Our challenge set SVAMP consists of oneunknown arithmetic word problems which can be solved by expressions requiring no more than two operators. Table 9 shows some statistics of our dataset and of ASDiv-A and MAWPS. The Equation Template for each example is obtained by converting the corresponding equation into prefix form and masking out all numbers with a meta symbol. Observe that the number of distinct Equation Templates and the Average Number of Operators are similar for SVAMP and ASDiv-A and are considerably smaller than for MAWPS. This indicates that SVAMP does not contain unfairly difficult MWPs in terms of the arithmetic expression expected to be produced by a model.
Previous works, including those introducing MAWPS and ASDiv, have tried to capture the notion of diversity in MWP datasets. Miao et al.
(2020) introduced a metric called Corpus Lexicon Diversity (CLD) to measure lexical diversity. Their contention was that higher lexical diversity is correlated with the quality of a dataset. As can be seen from Table 9, SVAMP has a much lesser CLD than ASDiv-A. SVAMP is also less diverse in terms of problem types compared to ASDiv-a. Despite this we will show in the next section that SVAMP is in fact more challenging than ASDiv-A for current models. Thus, we believe that lexical diversity is not a reliable way to measure the quality of MWP datasets. Rather it could depend on other factors such as the diversity in MWP structure which preclude models exploiting shallow heuristics.

Experiments on SVAMP
We train the three considered models on a combination of MAWPS and ASDiv-A and test them on SVAMP. The scores of all three models with and without RoBERTa embeddings for various subsets of SVAMP can be seen in Table 10.  The best performing Graph2Tree model is only able to achieve an accuracy of 43.8% on SVAMP. This indicates that the problems in SVAMP are indeed more challenging for the models than the problems in ASDiv-A and MAWPS despite being of the same scope and type and less diverse. Table 23 in the Appendix lists some simple examples from SVAMP on which the best performing model fails. These results lend further support to our claim that existing models cannot robustly solve elementary level word problems.
Next, we remove the questions from the examples in SVAMP and evaluate them using the three models with RoBERTa embeddings trained on combined MAWPS and ASDiv-A. The scores can be seen in Table 11. The accuracy drops by half when compared to ASDiv-A and more than half compared to MAWPS suggesting that the problems in SVAMP are more sensitive to the information present in the question. We also evaluate the performance of the constrained model on SVAMP when trained on MAWPS and ASDiv-A. The best model achieves only 18.3% accuracy (see Table 12   is marginally better than the majority template baseline. This shows that the problems in SVAMP are less vulnerable to being solved by models using simple patterns and that a model needs contextual information in order to solve them. We also explored using SVAMP for training by combining it with ASDiv-A and MAWPS. We performed 5-fold cross-validation over SVAMP where the model was trained on a combination of the three datasets and tested on unseen examples from SVAMP. To create the folds, we first divide the seed examples into five sets, with each type of example distributed nearly equally among the sets. A fold is obtained by combining all the examples in SVAMP that were created using the seed examples in a set. In this way, we get five different folds from the five sets. We found that the best model achieved about 65% accuracy. This indicates that even with additional training data existing models are still not close to the performance that was estimated based on prior benchmark datasets.
To check the influence of different categories of variations in SVAMP, for each category, we measure the difference between the accuracy of the best model on the full dataset and its accuracy on a subset containing no example created from that category of variations. The results are shown in    Table 14.
We also check the break-up of performance of the best performing Graph2Tree model according to the number of numbers present in the text of the input problem. We trained the model on both ASDiv-A and MAWPS and tested on SVAMP and compare those results against the 5-fold crossvalidation setting of ASDiv-A. The scores are provided in Table 15. While the model can solve many problems consisting of only two numbers in the input text (even in our challenge set), it performs very badly on problems having more than two numbers. This shows that current methods are incapable of properly associating numbers to their context. Also, the gap between the performance on ASDiv-A and SVAMP is high, indicating that the examples in SVAMP are more difficult for these models to solve than the examples in ASDiv-A even when considering the structurally same type of word problems.

Final Remarks
Going back to the original question, are existing NLP models able to solve elementary math word  problems? This paper gives a negative answer. We have empirically shown that the benchmark English MWP datasets suffer from artifacts making them unreliable to gauge the performance of MWP solvers: we demonstrated that the majority of problems in the existing datasets can be solved by simple heuristics even without word-order information or the question text. The performance of the existing models in our proposed challenge dataset also highlights their limitations in solving simple elementary level word problems. We hope that our challenge set SVAMP, containing elementary level MWPs, will enable more robust evaluation of methods. We believe that methods proposed in the future that make genuine advances in solving the task rather than relying on simple heuristics will perform well on SVAMP despite being trained on other datasets such as ASDiv-A and MAWPS.
In recent years, the focus of the community has shifted towards solving more difficult MWPs such as non-linear equations and word problems with multiple unknown variables. We demonstrated that the capability of existing models to solve simple one-unknown arithmetic word problems is overestimated. We believe that developing more robust methods for solving elementary MWPs remains a significant open problem.

A Implementation Details
We use 8 NVIDIA Tesla P100 GPUs each with 16 GB memory to run our experiments. The hyperparameters used for each model are shown in Table  16. The best hyperparameters are highlighted in bold. Following the setting of Zhang et al. (2020), the arithmetic word problems from MAWPS are divided into five folds, each of equal test size. For ASDiv-A, we consider the 5-fold split [238,238,238,238,266] provided by the authors (Miao et al., 2020).

B Creation Protocol
We create variations in template form. Generating more data by scaling up from these templates or by performing automatic operations on these templates is left for future work. The template form of an example is created by replacing certain words with their respective tags. Table 17 lists the various tags used in the templates. The N U M tag is used to replace all the numbers and the N AM E tag is used to replace all the Names of Persons in the example. The OBJs and OBJp tags are used for replacing the objects in the example. The OBJs and OBJp tags with the same index represent the same object in singular and plural form respectively. The intention when using the OBJs or the OBJp tag is that it can be used as a placeholder for other similar words, which when entered in that place, make sense as per the context. These tags must not be used for collectives; rather they should be used for the things that the collective represents. Some example uses of OBJs and OBJp tags are provided in Table 18. Lastly, the M OD tag must be used to replace any modifier preceding the OBJs / OBJp tag. A preprocessing script is executed over the Seed Examples to automatically generate template suggestions for the workers. The script uses Named Entity Recognition and Regular Expression matching to automatically mask the names of persons and the numbers found in the Seed Examples. The outputs from the script are called the Script Examples. An illustration is provided in Table 19.
Each worker is provided with the Seed Examples along with their respective Script Examples that have been alloted to them. The worker's task is to edit the Script Example by correcting any mistake made by the preprocessing script and adding any new tags such as the OBJs and the OBJp tags in order to create the Base Example. If a worker introduces a new tag, they need to mark it against its example-specific value. If the tag is used to mask objects, the worker needs to mark both the singular and plural form of the object in a commaseperated manner. Additionally, for each unique index of OBJs / OBJp tag in the example, the worker must enter atleast one alternate value that can be used in that place. Similarly, the worker must enter atleast two modifier words that can be used to precede the principal OBJs / OBJp tags in the example. These alternate values are used to gather a lexicon which can be utilised to scale-up the data at a later stage. An illustration of this process is provided in Table 20.
In order to create the variations, the worker needs to check the different types of variations in Table 8 to see if they can be applied to the Base Example. If applicable, the worker needs to create the Variation Example while also making a note of the type of variation. If a particular example is the result of performing multiple types of variations, all types of variations should be listed according to their order of application from latest to earliest in a comma-seperated manner. For any variation, if a worker introduces a new tag, they need to mark it against its example-specific value as mentioned before. The index of any new tag introduced needs to be one more than the highest index already in use for that tag in the Base Example or its previously created variations.
To make the annotation more efficient and streamlined, we provide the following steps to be followed in order:    To make sure that different workers following our protocol make similar types of variations, we hold a trial where each worker created variations from the same 5 seed examples. We observed that barring minor linguistic differences, most of the created examples were the same, thereby indicating the effectiveness of our protocol.

C Analyzing Attention Weights
In Table 22, we provide more examples to illustrate the specific word to equation correlation that the constrained model learns.

D Examples of Simple Problems
In Table 23, we provide a few simple examples from SVAMP that the best performing Graph2Tree model could not solve.

E Ethical Considerations
In this paper, we consider the task of automatically solving Math Word Problems (MWPs). Our work encourages the development of better systems that can robustly solve MWPs. Such systems can be deployed for use in the education domain. E.g., an application can be developed that takes MWPs as input and provides detailed explanations to solve them. Such applications can aide elementary school students in learning and practicing math.
We present a challenge set called SVAMP of oneunknown English Math Word Problems. SVAMP is created in-house by the authors themselves by applying some simple variations to examples from ASDiv-A (Miao et al., 2020), which is a publicly available dataset. We provide a detailed creation protocol in Section B. We are not aware of any risks associated with our proposed dataset.
To provide an estimate of the energy requirements of our experiments, we provide the details such as computing platform and running time in Section A. Also, in order to reduce carbon costs from our experiments, we first perform a broad hyperparameter search over only a single fold for the datasets and then run the cross validation experiment over a select few hyperparameters.

Excerpt of Example
Beth has 4 packs of red crayons and 2 packs of green crayons. Each pack has 10 crayons in it. Template Form N AM E1 has N U M 1 packs of M OD1 OBJp1 and N U M 2 packs of M OD2 OBJp1 .

Excerpt of Example
In a game, Frank defeated 6 enemies. Each enemy earned him 9 points.

Template Form
In a game N AM E1 defeated N U M 1 OBJp1 . Each OBJs1 earned him N U M 2 points. Table 18: Example uses of tags. Note that in the first example, the word 'packs' was not replaced since it is a collective. In the second example, the word 'points' was not replaced because it is too instance-specific and no other word can be used in that place.

Seed Example Body
Beth has 4 packs of crayons. Each pack has 10 crayons in it. She also has 6 extra crayons.

Seed Example Question
How