Semantic Parsing of Pre-university Math Problems

We have been developing an end-to-end math problem solving system that accepts natural language input. The current paper focuses on how we analyze the problem sentences to produce logical forms. We chose a hybrid approach combining a shallow syntactic analyzer and a manually-developed lexicalized grammar. A feature of the grammar is that it is extensively typed on the basis of a formal ontology for pre-university math. These types are helpful in semantic disambiguation inside and across sentences. Experimental results show that the hybrid system produces a well-formed logical form with 88% precision and 56% recall.


Introduction
Frege and Russell, the initiators of the mathematical logic, delved also into the exploration of a theory of natural language semantics (Frege, 1892;Russell, 1905). Since then, symbolic logic has been a fundamental tool and a source of inspiration in the study of language meaning. It suggests that the formalization of the two realms, mathematical reasoning and language meaning, is actually the two sides of the same coin -probably, we could not even conceive the idea of formalizing language meaning without grounding it onto mathematical reasoning. This point was first clarified by Tarski (1936;1944) mainly on formal languages and then extended to natural languages by Davidson (1967). Montague (1970a;1970b;1973) further embodied it by putting forward a terrifyingly arrogant and attractive idea of seeing a natural language as a formal language.
The automation of end-to-end math problem solving thus has an outstanding status in the re-Define the two straight lines L1 and L2 on the xy-plane as L1: y = 0 (x-axis) and L2: y = √ 3x. Let P be a point on the xy-plane. Let Q be the point symmetric to P about the straight line L1, and let R be the point symmetric to P about the straight line L2. Answer the following questions: (1) Let (a, b) be the coordinates of P , then represent the coordinates of R using a and b.
(2) Assuming that the distance between the two points Q and R is 2, find the locus C of P . (3) When the point P moves on C, find the maximum area of the triangle P QR and the coordinates of P that gives the maximum area. (Hokkaido Univ., 1999-Sci-3) Figure 1: Example problem search themes in natural language processing. The conceptual basis has been laid down, which connects text to the truth (= answer) through reasoning. However, we have not seen a fully automated system that instantiates it end-to-end. We wish to add a piece to the big picture by materializing it. Past studies have mainly targeted at primary school level arithmetic word problems (Bobrow, 1964;Charniak, 1969;Hosseini et al., 2014;Shi et al., 2015;Roy and Roth, 2015;Zhou et al., 2015;Koncel-Kedziorski et al., 2015;Mitra and Baral, 2016;Upadhyay et al., 2016). In their nature, arithmetic questions are quantifier-free. Moreover they tend to include only ∧ (and) as the logical connective. The main challenge in these works was to extract simple numerical relations (most typically equations) from a real-world scenario described in a text. Seo et al. (2015) took SAT geometry questions as their benchmark. However, the nature of SAT geometry questions restricts the resulting formula's complexity. In §3, we will show that none of them includes ∀ (for all), ∨ (or) or → (implies). It suggests that this type of questions require little need to analyze the logical structure of the problems beyond conjunctions of predicates.  We take pre-university math problems falling in the theory of real-closed fields (RCF) as our benchmark because of their variety and complexity. The subject areas include real and linear algebra, complex numbers, calculus, and geometry. Furthermore, many problems involve more than one subject: e.g., algebraic curves and calculus as in Fig. 1. Their logical forms include all the logical connectives, quantifiers, and λ-abstraction. Our goal is to recognize the complex logical structures precisely, including the scopes of the quantifiers and other logical operators.
In the rest of the paper, we first present an overview of an end-to-end problem solving system ( §2) and analyze the complexity of the preuniversity math benchmark in comparison with others ( §3). Among the modules in the end-to-end system, we focus on the sentence-level semantic parsing component and describe an extensivelytyped grammar ( §4 and §5), an analyzer for the math expressions in the text ( §6), and two semantic parsing techniques to fight against the scarcity of the training data ( §7) and the complexity of the domain ( §8). Experimental results show the effectiveness of the presented techniques as well as the complexity of the task through an in-depth analysis of the end-to-end problem solving results ( §9).
2 End-to-end Math Problem Solving Fig. 2 presents an overview of our end-to-end math problem solving system. A math problem text is firstly analyzed with a dependency parser. Anaphoric and coreferential expressions in the text are then identified and their antecedents are determined. We assume the math formulas in the problems are encoded in MathML presentation mark-up. A specialized parser processes each one of them to determine its syntactic category and semantic content. The semantic representation of each sentence is determined by a semantic parser based on Combinatory Categorial Grammar (CCG) (Steedman, 2001(Steedman, , 2012. The output from the CCG parser is a ranked list of sentence-level logical forms for each sentence. After the sentence-level processing steps, we determine the logical relations among the sentence-level logical forms (discourse parsing) by a simple rule-based system. It produces a tree structure whose leaves are labeled with sentences and internal nodes with logical connectives. Free variables in the logical form are then bound by some quantifiers (or kept free) and their scopes are determined according to the logical structure of the problem. A semantic representation of a problem is obtained as a formula in a higher-order logic through these language analysis steps.
The logical representation is then rewritten using a set of axioms that define the meanings of the predicate and function symbols in the formula, such as maximum defined as follows: maximum(x, S) ↔ x ∈ S ∧ ∀y(y ∈ S → y ≤ x), as well as several logical rules such as βreduction. We hope to obtain a representation of the initial problem expressed in a decidable math theory such as RCF through these equivalencepreserving rewriting. Once we find such a formula, we invoke a computer algebra system (CAS) or an automatic theorem prover (ATP) to derive the answer.
The reasoning module (i.e., the formula rewriting and the deduction with CAS and ATP) of the system has been extensively tested on a large collection of manually formalized pre-university math problems that includes more than 1,500 problems. It solves 70% of the them in the time limit of 10 minutes per problem. Table 1 shows the rate of successfully solved problems in the manually formalized version of the benchmark problems used in the current paper.   Freq. ∀(P → ∃(∀(P → P )∧P )) 2 ∃(∃(¬P ∧P )∧P ∧P (λf ))∧P (λ(P → P ))) 1 ∃(P ∧P (λ(¬P ∧∃(∃P ∧P )))) 1 ∃(P ∧P (λf ))∧P (λ(¬P ∧P ))∧P (λP )) 1

Profile of the Benchmark Data
Our benchmark problems, UNIV, were collected from the past entrance exams of seven top-ranked universities in Japan. In the exams held in odd numbered years from 1999 to 2013, we exhaustively selected the problems which are ultimately expressible in RCF. They occupied 40% of all the problems. We divided the problems into two sets: DEV for development (those from year 1999 to 2005) and TEST for test (those from year 2007 to 2013). DEV was used for the lexicon development and the tuning of the end-to-end system. The problem texts (both in English and Japanese) with MathML mark-up and manually translated logical forms are publicly available at https: //github.com/torobomath. The manually translated logical forms were formulated in a higher-order semantic language introduced later in the paper. The translation was done as faithfully as possible to the original wordings of the problems. They thus keep the inherent logical structures expressed in natural language. Table 2 lists several statistics of the UNIV problems in the English version and their manual formalization. For comparison, the statistics of three other benchmarks are also listed. JOBS and GEO-QUERY are collections of natural language queries against databases. They have been widely used as benchmarks for semantic parsing (e.g., Tang and Mooney, 2001;Collins, 2005, 2007;Kwiatkowski et al., 2010Kwiatkowski et al., , 2011Liang et al., 2011). The queries are annotated with logical forms in Prolog. We converted them to equivalent higher-order formulas to collect comparable statistics. GEOMETRY is a collection of SAT geometry questions compiled by Seo et al. (2015). We formalized the GEOMETRY questions 1 in our semantic language in the same way as UNIV.
In Table 2, the first column lists the number of problems. The next three provide statistics of the problem texts: average number of words and sentences in a problem ('Avg. tokens' and 'Avg. sents'), and the number of unique words in the whole dataset. 2 They reveal that the sentences in UNIV are significantly longer than the others and more than three sentences have to be correctly processed for a problem.
The remaining columns provide the statistics about the logical complexities of the problems. 'Atoms' stands for the average number of the occurrences of predicates per problem. The next three columns list the number of variables bound by ∃, ∀, and λ. We count sequential occurrences of the same binder as one. The columns for ∧, ∨, ¬, and → list the average number of them per problem. 3 We can see UNIV includes a wider variety of quantifiers and connectives than the others.
The final column lists the numbers of unique 'sketches' of the logical forms in the dataset. What we call 'sketch' here is a signature that encodes the overall structure of a logical form. Table 3 shows the top four most frequent sketches observed in the datasets. In a sketch, P stands for a (conjunction of) predicate(s) and f stands for a term. ∃, ∀, and λ stand for (immediately nested sequence of) the binders.
To obtain the sketch of a formula φ, we first replace all the predicate symbols in φ to P and function symbols and constants to f . We then eliminate all variables in φ and 'flatten' it by applying the following rewriting rules to the sub-formulas in φ in the bottom-up order: Finally, we sort the arguments of P s and f s and remove the duplicates among them. For instance, to obtain the sketch of the following formula: we replace the predicate/function symbols as in: and then eliminate the variables to have: and finally flatten it to: ∀(P (λP ) → P ). Table 3 shows that a wide variety of structures are found in UNIV while other data sets are dominated by a small number of structures. Table 4 presents some of less frequent sketches found in UNIV (DEV). In actuality, 67% of the unique sketches found in UNIV (DEV) occur only once in the dataset. These statistics suggest that the distribution of the logical structures found in UNIV, and math text in general, is very long-tailed.

A Type System for Pre-university Math
Our semantic language is a higher-order logic (lambda calculus) with parametric polymorphism. Table 5 presents the types in the language. The atomic types are defined so that they capture the selectional restriction of verbs and other truth values  argument-taking phrases as precisely as possible.
For instance, an equation in real domain, e.g., x 2 − 1 = 0, can be regarded as a set of reals, i.e., {x | x 2 − 1 = 0}. However, we never say 'a solution of a set.' We thus discriminate an equation from a set in the type system even though the concept of equation is mathematically dispensable. Entities of equation and set are built by constructor functions that take a higher-order term as the argument as in eqn(λx.x 2 − 1) and set(λx.x 2 − 1). Related concepts such as 'solution' and 'element' are defined by the axioms for corresponding function and predicate symbols: Distinction of cardinal numbers (Card) and ordinal numbers (Ord), and the introduction of 'integer division' type (QuoRem) are also linguistically motivated.
The former is necessary to capture the difference between, e.g., 'kth integer in n 1 , n 2 , . . . , n m ' and 'k integers in n 1 , n 2 , . . . , n m .' An object of type QuoRem is conceptually a pair of integers that represent the quotient and the remainder of integer division. It is linguistically distinct from the type of Pair(Z,Z) because, e.g., in Select a pair of integers (n, m) and divide n by m. If the remainder (of φ) is zero, ... the null (i.e., omitted) pronoun φ has 'the result of division n/m' as its antecedent but not (n, m).
S\N P : λx.(x = 3) S\N P : λx.quo of(x) = 3 S : (∃k; k ∈ K) → quo of(quorem(k, m)) = 3 Figure 3: Sketch of the derivation tree for a sentence including an action verb and quantification phrases is expressed by polymorphic lists and tuples: e.g., 'the radii of the circles C 1 , C 2 , and C 3 ' is of type ListOf(R) and 'the function f and its maximum value' is of type Pair(R2R,R).

Combinatory Categorial Grammar
An instance of CCG grammar consists of a lexicon and a small number of combinatory rules. A lexicon is a set of lexical items, each of which associates a word surface form with a syntactic category and a semantic function: e.g., sum :: NP/PP : λx.sum of(x) intersects :: S\NP/PP : λy.λx.intersect(x, y) A syntactic category is one of atomic categories, such as NP, PP, and S, or a complex category in the form of X/Y or X\Y, where X and Y are syntactic categories.
The syntactic categories and the semantic functions of constituents are combined by applying combinatory rules. The most fundamental rules are forward (>) and backward (<) application: The atomic categories are further classified by features such as num(ber) and case of noun phrases. In the current paper, the features are written as in NP [num=pl,case=acc].

A Japanese CCG Grammar and Lexicon
We developed a Japanese CCG following the analysis of basic constructions by Bekki (2010) but significantly extending it by covering various phenomena related to copula verbs, action verbs, argument-taking nouns, appositions and so forth. The semantic functions are defined in the format of a higher-order version of dynamic predicate logic (Eijck and Stokhof, 2006). The dynamic property is necessary to analyze semantic phenomena related to quantifications, such as donkey anaphora. In the following examples, we use English instead of Japanese and the standard notation of higher-order logic for the sake of readability.
We added two atomic categories, Sn and Sa, to the commonly used S, NP, and N. Category Sn is assigned to a proposition expressed as a math formula, such as 'x > 0'. Semantically it is of type Bool but syntactically it behaves both like a noun phrase and a sentence.
Category Sa is assigned to a sentence where the main verb is an action verb such as add and rotate. Such a sentence introduces the result of the action as a discourse entity (i.e., what can be an antecedent of coreferential expressions). The action verbs can also mediate quantification as in: When any k ∈K is divided by m, the quotient is 3. ∀k(k ∈ K → quo of(quorem(k, m)) = 3) where quorem(k, m) represents the result of the division (i.e., the pair of the quotient and the remainder) and quo of is a function that extracts the quotient from it. To handle such phenomena, we posit the semantic type of Sa as Pair(α, Bool) where the two components respectively bring the result of an action and the condition on it (including quantification). Fig. 3 presents a derivation tree for the above example. 4 The atomic category NP, N, and Sa in our grammar have type feature. Its value is one of the types defined in the semantic language or a type variable when the entity type is underspecified. The lexical entry for '(an integer) divides (an integer)' and '(a set) includes (an element)' would thus have the following categories (other features than type are not shown): When defining a lexical item, we don't have to explicitly specify the type features in most cases. They can be usually inferred from the definition of the semantic function. In the above example, divides will have λy.λx.(x|y) and includes will have λy.λx.(y ∈ x) as their semantic functions. For both cases, the type feature of the NP arguments can be determined from the type definitions of the operators | and ∈ in the ontology.
The lexicon currently includes 54,902 lexical items for 8,316 distinct surface forms, in which 5,094 lexical items for 1,287 surface forms are for function words and functional multi-word expressions. The number of unique categories in the lexicon is 10,635. When the type features are ignored, there are still 4,026 distinct categories.

Math Expression Analysis
The meaning of a math expression is composed with the semantic functions of surrounding words to produce a logical form. We dynamically generate lexical items for each math expression in a problem. Consider the following sentence including two 'equations': If a 2 −4 = 0, then x 2 +ax+1 = 0 has a real solution.
The latter, x 2 +ax+1 = 0, should receive a lexical item of a noun phrase, NP : eqn(λx.x 2 + a + 1), but the former, a 2 −4 = 0, should receive category S since it denotes a proposition. Such disambiguation is not always possible without semantic analysis of the text. We thus generate more than one lexical item for ambiguous expressions and let the semantic parser make a choice.
To generate the lexical items, we first collect appositions to the math expressions, such as 'integer n and m' and 'equation x 2 + a = 0,' and use them as the type constraints on the variables and the compound expressions. Compound expressions are then parsed with an operator precedence parser (Aho et al., 2006). Overloaded operators, such as + for numbers and vectors, are resolved using the type constrains whenever possible. Finally, we generate all possible interpretations of the expressions and select appropriate syntactic categories.
We have seen only three categories of math expressions: NP, Sn, and T/(T\NP). The last one is used for a NP with post-modification, as in: 7 Two-step Semantic Parsing Two central issues in parsing are the cost of the search and the accuracy of disambiguation. Supervised learning is commonly used to solve both. It is however very costly to create the training data by manually annotating a large number of sentences with CCG trees. Past studies have tried to bypass it by so-called weak supervision, where a parser is trained only with the logical form (e.g., Kwiatkowski et al. 2011) or even only with the answers to the queries (e.g., Liang et al. 2011).
Although the adaptation of such methods to the pre-university math data is an interesting future direction, we developed yet another approach based on a hybrid of shallow dependency parsing and the detailed CCG grammar. The syntactic structure of Japanese sentences has traditionally been analyzed based on the relations among word chunks called bunsetsus. A bunsetsu consists of one or more content words followed by zero or more function words. The dependencies among bunsetsus mostly correspond to the predicate-argument and interclausal dependencies (Fig. 4). The dependency structure hence matches the overall structure of a CCG tree only leaving the details unspecified.
We derive a full CCG-tree by using a bunsetsu dependency tree as a constraint. We assume: (i) the fringe of each sub-tree in the dependency tree has a corresponding node in the CCG tree. We call such a node in the CCG tree 'a matching node.' We further assume: (ii) a matching node is combined with another CCG tree node whose span includes at least one word in the head bunsetsu of the matching node. Fig. 5 presents an example of a sentence consisting of four bunsetsus (rounded squares), each of which contains two words. In the figure, the i-th cell in the k-th row from the bottom is the CKY cell for the span from i-th to w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 Figure 5: Restricted CKY parsing based on a shallow dependency structure (i+k-1)-th words. Under the two assumptions, we only need to fill the hatched cells given the dependency structure shown below the CKY chart. The hatched cells with a white circle indicate the positions of the matching nodes.
Even under the constraint of a dependency tree, it is impractical to do exhaustive search. We use beam search based on a simple score function on the chart items that combines several features such as the number of atomic categories in the item. We also use N -best dependency trees to circumvent the dependency errors. The restricted CKY parsing is repeated on the N -best dependency trees until a CCG tree is obtained. Our hope is to reject a dependency error as violation of the syntactic and semantic constraints encoded in the CCG lexicon. In the experiment, we used a Japanese dependency parser developed by Kudo and Matsumoto (2002). We modified it to produce N -best outputs and used up to 20-best trees per sentence.

Global Type Coherency
The well-typedness of the logical form is usually guaranteed by the combinatory rules. However, they do not always guarantee the type coherency among the interpretations of the math expressions.
For instance, consider the following derivation: The + symbol is interpreted as the addition of real numbers (add R ) in the first clause but that of vectors (add V ) in the second one. The logical form is not typable because the two occurrences of x must have different types. The forward application rule does not reject this derivation since the categories of the two clauses perfectly match the rule schema.
We can reject such inconsistency by doing type checking on the logical form at every step of the Chart ← empty CKY chart for each token t in s do for each lexical item C : f for t do // C: category, f : semantic function if t is a math expression then for each environment Γ ∈ Envs do if Γ is unifiable with FV(f ) then add (C, Γ FV(f )) to Chart else // t is a normal word add (C, ∅) to Chart return Chart FV(f ): the environment that maps the free variables in a semantic function f to their principal types determined by type inference on f . // Envs: a set of environments; Derivs: derivations trees procedure UPDATEENVIRONMENTS(Envs, Derivs) NewEnvs ← ∅ // environments for the next sentence for each derivation d ∈ Derivs do Γ ← the environment at the root of d if Γ = ∅ then // update the environments NewEnvs ← NewEnvs ∪{Γ} else // no update: there was no math expression NewEnvs ← NewEnvs ∪ Envs // eliminate those subsumed by other environments return MOSTGENERALENVIRONMENTS(NewEnvs) derivation. It is however quite time consuming because we cannot use dynamic programming any more and need to do type checking on numerous chart items. Furthermore, such type inconsistency may happen across sentences. Instead, we consider the type environment while parsing. A type environment, written as {v 1 : T 1 , v 2 : T 2 , . . . }, is a finite function from variables to type expressions. A pair v : T means that the variable v must be of type T or its instance (e.g., SetOf(R) is an instance of SetOf(α)). For example, the logical form of the first clause of the above sentence is typable under {x : R, y : R, z : α, U : SetOf(R), V : β}, but that of the second clause isn't. Please refer, e.g., to (Pierce, 2002) for the formal definitions. Two environments Γ 1 and Γ 2 are unifiable iff there exists a substitution σ that maps the type variables in Γ 1 and Γ 2 to some type expressions so that Γ 1 σ = Γ 2 σ holds. We write Γ 1 Γ 2 for the result of such substitution (i.e., unification) with the    We associate a type environment with each chart item and refine it through parsing. The type constraints implied in a discourse are accumulated in the environment and block the generation of incoherent derivations (Algorithm 1). Fig. 6 presents an example of a parsing result, in which the type constraints implied in the two clauses are unified at the root and the type of U is determined. When we apply a combinatory rule, we first check if the environments of the child chart items are unifiable. If so, we put the unified environment in the parent item and apply the unifier to the type features in the parent category. For instance, the forward application rule is revised as follows: where σ is the mgu of Γ 1 and Γ 2 and Xσ means the application of σ to the type features in X. 5 5 To be precise, we also consider the type constraints induced through the unification of the categories. It can be seen in the derivation step for "n divides 12" in Fig. 6, where the new constraint n :Z is induced by the unification of NP [α] and NP [Z] and merged into the environment of the parent.

Experiments and Analysis
This section presents the overall performance of the current end-to-end system and demonstrates the effectiveness of the proposed parsing techniques. We also present an analysis of the failures. Table 6 presents the result of end-to-end problem solving on the UNIV data. It shows the failure in the semantic parsing is a major bottleneck in the current system. Since a problem in UNIV includes more than three sentences on average, parsing a whole problem is quite a high bar for a semantic parser. It is however necessary to solve it by the nature of the task. Once a problem-level logical form was produced, the system yielded a correct solution for 44% of such problems in DEV and 36% in TEST. Table 7 lists the fraction of the sentences on which the two-step parser produced a CCG tree within top-N dependency trees. We compared the results obtained with the dependency parser trained only on a news corpus (News) (Kurohashi and Nagao, 2003), which is annotated with bunsetsu level dependencies, and that trained additionally with a math problem corpus consisting of 6,000 sentences 6 (News+Math). The math problem corpus was developed according to the same annotation guideline for the news corpus. The attachment accuracy of the dependency parser was 84% on math problem text when trained only on the news corpus but improved to 94% by the addition of the math problem corpus. The performance gain by increasing N is more evident in the results with the News parser than that with the News+Math parser. It suggests the grammar properly rejected wrong dependency trees, which were ranked higher by the News parser. The effect of the additional training is very large at small N s and still significant at N = 20. It means that we successfully boosted both the speed and the success rate of CCG parsing only with the shallow dependency annotation on in-domain data.   Table 9: Reasons for the parse failures Table 8 shows the effect of CCG parsing with type environments. The column headed 'Typing failure' is the fraction of the problems on which no logical form was obtained due to typing failure. Parsing with type environment eliminated almost all such failures and significantly improved the number of correct answers. The remaining type failure was due to beam thresholding where a necessary derivation fell out of the beam. Table 9 lists the reasons for the parse failures on 1/4 of the TEST section (the problems taken from exams on 2007). In the table, "unknown usage" means a missing lexical item for a word already in the lexicon. "Unknown word" means no lexical item was defined for the word. Collecting unknown usages (especially that of a function word) is much harder than just compiling a list of words. Our experience in the lexicon development tells us that once we find a usage example, in the large majority of the cases, it is not difficult to write down its syntactic category and semantic function. Table 9 suggests that we can efficiently detect and collect unknown word usages through parsing failures on a large raw corpus of math problems. Table 10 presents the accuracy of the sentenceand problem-level logical forms produced on the year 1999 subset of DEV and the year 2007 subset of TEST. Although the recall on the unseen test data is not as high as we hope, the high precision of the sentence-level logical forms is encouraging. Table 11 provides the counts of the error types found in the wrong sentence-level logical forms produced on DEV-1999 andTEST-2007. It reveals the majority of the errors are related to the choice of quantifier (∃, ∀, or free) and logical op-   Table 11: Types of errors in the logical forms erators (e.g., → vs. ↔) as well as the determination of their scopes. Meanwhile, we did not find an error related to the predicate-argument structure of a logical form. This fact and the results in Table 6 suggest that the selectional restrictions, encoded in the lexicon, properly rejected nonsensical predicate-argument relations. Our next step is to introduce a more sophisticated disambiguation model on top of the grammar, enjoying the properly confined search space.

Conclusion
We have explained why the task of end-to-end math problem solving matters for a practical theory of natural language semantics and introduced the semantic parsing of pre-university math problems as a novel benchmark. The statistics of the benchmark data revealed that it includes far more complex semantic structures than the other benchmarks. We also presented an overview of an endto-end problem solving system and described two parsing techniques motivated by the scarcity of the annotated data and the need for the type coherency of the analysis. Experimental results demonstrated the effectiveness of the proposed techniques and showed the accuracy of the sentence-level logical form was 88% precision and 56% recall. Our future work includes the expansion of the lexicon with the aid of the semantic parser and the development of a disambiguation model for the binding and scoping structures.