From Textbooks to Knowledge: A Case Study in Harvesting Axiomatic Knowledge from Textbooks to Solve Geometry Problems

Textbooks are rich sources of information. Harvesting structured knowledge from textbooks is a key challenge in many educational applications. As a case study, we present an approach for harvesting structured axiomatic knowledge from math textbooks. Our approach uses rich contextual and typographical features extracted from raw textbooks. It leverages the redundancy and shared ordering across multiple textbooks to further refine the harvested axioms. These axioms are then parsed into rules that are used to improve the state-of-the-art in solving geometry problems.


Introduction
Recently, researchers have proposed standardized tests as "drivers for progress in AI" . There is a growing body of work in solving standardized tests such as reading comprehensions (Richardson et al., 2013;Sachan et al., 2015, inter alia), science question answering (Schoenick et al., 2016;Sachan et al., 2016, inter alia), algebra word problems (Kushman et al., 2014, inter alia), geometry problems (Seo et al., 2015), pre-university entrance exams (Fujita et al., 2014), etc. A major challenge in building these solvers is the lack of subject knowledge. For example, geometry tests require knowledge of geometry axioms and pre-university exams require knowledge of laws of physics, chemistry, etc.
In this paper, we present an automatic approach that can (a) harvest such subject knowledge from textbooks, and (b) parse the extracted knowledge to structured programs that the solvers can use. Unlike information extraction systems trained on domains such as web documents (Chang et al.,Figure 1: An excerpt of a textbook from our dataset that introduces the Pythagoras theorem. The textbook has a lot of typographical features that can be used to harvest this theorem: The textbook explicitly labels it as a "theorem"; there is a colored bounding box around it; an equation writes down the rule and there is a supporting figure. Our models leverages such rich contextual and typographical information (when available) to accurately harvest axioms and then parses them to horn-clause rules. The horn-clause rule derived by our approach for the Pythagoras theorem is: isT riangle(ABC) ∧ perpendicular(AC, BC) =⇒ BC 2 + AC 2 = AB 2 .
2003; Etzioni et al., 2004, inter alia), learning an information extraction system that can extract axiomatic knowledge from textbooks is challenging because of the small amount of in-domain labeled data available for these tasks. We tackle this challenge by (a) leveraging the redundancy and shared ordering of axiom mentions across multiple textbooks 1 , and (b) utilizing rich contextual and typographical features 2 from textbooks to effectively extract and parse axioms. Finally, we also provide an approach to parse the extracted axiom mentions from various textbooks and reconcile them to achieve the best program for each axiom.
As a case study, we use our approach to harvest axiomatic knowledge of geometry from math textbooks, and use this knowledge to improve the state-of-the-art system for solving SAT style geometry problems. Seo et al. (2015) recently presented GEOS, an automated end-to-end system that solves SAT style geometry questions such as the one shown in Figure 2. GEOS derives a logical expression that represents the meaning of the corresponding diagram and (optionally) answer candidates. Below: A logical expression that represents the meaning of the text description and the diagram in the problem. GEOS derives a weighted logical expression where each predicates also carries a weighted score but we do not show them here for clarity.
text description and the diagram (also shown in Figure 2), and then solves the geometry question by checking the satisfiablity of the derived logical expression. While this solver has its basis in coordinate geometry and indeed works, it has some key issues: GEOS requires an explicit mapping of each predicate into a set of constraints over point coordinates 3 . These constraints can be non-trivial to write, requiring significant manual engineering. As a result, GEOS's constraint set is incomplete and it cannot solve a number of SAT style geometry questions. Furthermore, this solver is not interpretable. As our user studies show, it is not natural for a student to understand the solution of these geometry questions in terms of satisfiability of constraints over coordinates. A more natural way for students to understand and reason about these questions is through deductive reasoning using axioms of geometry 4 . We use our model to extract and parse axiomatic knowledge from a novel dataset of 20 publicly available math textbooks. We use this structured axiomatic knowledge to build a new axiomatic solver that performs logical inference to solve ge-3 For example, the predicate isPerpendicular(AB, CD) is mapped to the constraint y For example, the deductive reasoning required to solve the question in Figure 2 is: (1) Use the axiom that the sum of interior angles of a triangle is 180 • and the fact that ∠AMO is 90 • to conclude that ∠MOA is 60 • . (2) MOA ∼ MOB (using a similar triangle axiom) and then, ∠MOB = ∠MOA = 60 • (using the axiom that corresponding angles of similar triangles are equal). (3) Use angle sum rule to conclude that ∠AOB = ∠MOB + ∠MOA = 120 • . (4) Use the axiom that the angle subtended by an arc of a circle at the centre is double the angle subtended by it at any point on the circle to conclude that ∠ADB = 0.5×∠AOB = 60 • . ometry problems. Our axiomatic solver outperforms GEOS on all existing test sets introduced in Seo et al. (2015) as well as a new test set of geometry questions collected from these textbooks. We also performed user studies on a number of school students studying geometry who found that our axiomatic solver is more interpretable and useful compared to GEOS.

Background: GEOS
Our work reuses GEOS to parse the question text and diagram into its formal problem description as shown in Figure 2. GEOS parses the question text and the diagram to a formal problem description. GEOS uses a logical formula, a firstorder logic expression that includes known numbers or geometrical entities (e.g. 4 cm) as constants, unknown numbers or geometrical entities (e.g. O) as variables, geometric or arithmetic relations (e.g. isLine, isTriangle) as predicates and properties of geometrical entities (e.g. measure, liesOn) as functions. This is done by learning a set of relations that potentially correspond to the question text (or the diagram) along with a confidence score. For diagram parsing, GEOS uses a publicly available diagram parser for geometry problems (Seo et al., 2014). For text parsing, GEOS takes a multi-stage approach, which maps words or phrases in the text to their corresponding concepts, and then identifies relations between identified concepts. Given this formal problem description, GEOS use a numerical method to check the satisfiablity of literals by defining a relaxed indicator function for each literal. These indicator functions are manually engineered for every predicate. Since this is a cumbersome process, GEOS has an incomplete mapping of literals to indicator functions.

Set up for the Axiomatic Solver
In this work, we replace the numerical solver of GEOS with an axiomatic solver. We extract axiomatic knowledge from textbooks and parse them into horn clause rules. Then we build an axiomatic solver that performs logical inference with these horn clause rules and the formal problem description. A sample logical program (in prolog notation) that solves the problem in Figure 2 is given in Figure 3. The logical program has a set of declarations from the GEOS text and diagram parsers which describe the problem specification  Figure 2. The program consists of a set of data structure declarations that correspond to types in the prolog program, a set of declarations from the diagram and text parse and a subset of the geometry axioms written as horn clause rules. The axioms are used as the underlying theory with the aforementioned declarations to yield the solution upon logical inference. Normalized confidence weights from the diagram, text and axiom parses are used as probabilities. For readers understanding, we list the axioms in the order (1 to 7) they are used to solve the problem. However, this ordering is not required. Other (less probable) declarations and axiom rules are not shown here for clarity but they can be assumed to be present. and the parsed horn clause rules describe the underlying theory. Normalized confidence scores from question text, diagram and axiom parsing models are used as probabilities in the program. Next, we describe how we harvest structured axiomatic knowledge from textbooks.

Harvesting Axiomatic Knowledge
We present a structured prediction model that identifies axioms in textbooks and then parses them. Since harvesting axioms from a single textbook is a very hard problem, we use multiple textbooks and leverage the redundancy of information to accurately extract and parse axioms. We first define a joint model that identifies axiom mentions in each textbook and aligns repeated mentions of the same axiom across textbooks. Then, given a set of axioms (with possibly, multiple mentions of each axiom), we define a parsing model that maps each axiom to a horn clause rule by utilizing the various mentions of the axiom.
Given a set of textbooks B in machine readable form (XML in our experiments), we extract chapters relevant for geometry in each of them to obtain a sequence of sentences (with associated typographical information) from each textbook. Let |S b | } denote the sequence of sentences in textbook b. |S b | denotes the number of sentences in textbook b.

Axiom Identification and Alignment
We decompose the problem of extracting axioms from textbooks into two tractable sub-problems: (a) identification of axiom mentions in each textbook using a sequence labeling approach, and (b) aligning repeated mentions of the same axiom across textbooks. Then, we combine the learned models for these sub-problems into a joint optimization framework that simultaneously learns to identify and align axiom mentions. Joint modeling of the axiom identification and alignment is necessary as both sub-problems can help each other.

Axiom Identification
Linear-chain CRF formulation (Lafferty et al., 2001) can be used for the subproblem of axiom identification. Given {S b |b ∈ B}, the model labels each sentence s (b) i as Before, Inside or Outside an axiom. Hereon, a contiguous block of sentences labeled B or I will be considered as an axiom mention. Let T = {B, I, O} denote the tag set. Let y We find the parameters θ θ θ using maximumlikelihood estimation with L2 regularization: We use L-BFGS to optimize the objective and Viterbi decoding for inference.
Features: Features f look at a pair of adjacent tags y k , the input sequence S b , and where we are in the sequence. The features (listed in Table 1) include various content based features encoding various notions of similarity between pairs of sentences as well as various typographical features such as whether the sentences are annotated as an axiom (or theorem or corollary) in the textbook, contain equations, diagrams, text that is bold or italicized, are in the same node of the xml hierarchy, are contained in a bounding box, etc.
Some extracted axiom mentions contain pointers to a diagram eg. " Figure 2.1". We consider the diagram to be a part of the axiom mention.

Axiom Alignment
Next, we leverage the redundancy of information and the relatively fixed ordering of axioms in various textbooks by aligning various mentions of the same axiom across textbooks and introducing structural constraints on the alignment.

Content
Sentence Overlap Semantic Textual Similarity between the current and next sentence. We include features that compute the proportion of common unigrams and geometry entities (constants, predicates and functions) across the two sentences. This feature is conjoined with the tag assigned to the current and next sentence. Geometry entities No. of geometry entities (normalized by the number of tokens) in this sentence. This feature is conjoined with the tag assigned to the current sentence. Intra-sentence semantics Indicator that the current sentence contains any one of the following words: hence, if, equal, twice, proportion, ratio, product. This feature is conjoined with the tag assigned to the current sentence.

Typography
Axiom, Theorem, Corollary Mention (a) The current (or previous) sentence is mentioned as an Axiom, Theorem or Corollary e.g. Similar Triangle Theorem or Corollary 2.1. (b) The section or subsection in the textbook containing the current (or previous) sentence mentions an Axiom, Theorem or Corollary. This feature is conjoined with the tag assigned to the current (and previous) sentence.

Eqn. Template
The current (or next) sentence contains an equation eg. P A × P B = P T 2 . This feature is conjoined with the tag assigned to the current (and next) sentence. Assoc. Diagram The current sentence contains a pointer to a figure eg. " Figure 2.1". This feature is conjoined with the tag assigned to the current sentence.

RST edge
Indicator for the RST relation between the current and next sentence. This feature is conjoined with the tag assigned to the current and next sentence. Bold/Underline The sentence (or previous) sentence contains text that is in bold font or underlined. Conjoined with the tag assigned to the current (and previous) sentence. XML structure Indicator that the current and previous sentence are in the same node of the XML hierarchy. Conjoined with the tag assigned to the current and previous sentence.

Bounding box
Indicator that the current and previous sentence are bounded by a bounding box in the textbook. Conjoined with the tag assigned to the current and previous sentence.
|A b | be the axiom mentions extracted from textbook b. Let A denote the collection of axiom mentions extracted from all textbooks. We assume a global ordering of where U is some pre-defined upper bound on the total number of axioms in geometry. Then, we emphasize that the axiom mentions extracted from each textbooks (roughly) follow this ordering. Let Z We introduce a log-linear model that factorizes over alignment pairs: Here, Z(A; φ φ φ) is the partition function of the log-linear model. g denotes the feature function described later. We introduce the following constraints on the alignment structure: C1: An axiom appears in one book at-most once C2: An axiom refers to exactly one theorem in the global ordering C3: Ordering Constraint: If i th axiom in a book refers to the j th axiom in the global ordering then no axiom succeeding the i th axiom can refer to a global axiom preceding j.
Learning with Hard Constraints: We find the optimal parameters φ φ φ using maximum-likelihood estimation with L2 regularization: We use L-BFGS to optimize the objective. To compute feature expectations appearing in the gradient of the objective, we use a Gibbs sampler. The sampling equations for Z b ik are: Note that the constraints C1 . . . 3 define the feasible space of alignments. Our sampler always samples the next Z (b) ik in this feasible space. Learning with Soft Constraints: We might want to treat some constraints, in particular, the ordering constraints C3 as soft constraints. We can write down the constraint C3 using the alignment variables: To model these constraints as soft constraints, we penalize the model for violating these constraints. Let the penalty for violating the above constraint be exp Here ν is a hyper-parameter to tune the cost of violating a constraint. We write down the following regularized objective: We use L-BFGS to find the optimal parameters φ φ φ * . We perform Gibbs sampling to compute feature expectations. The sampling equation for Z (b) ik is similar (eq 1), but:  Figure 4: An illustration of the three operations to sample axiom blocks.
Features: Now, we describe the features g. These too include content based features encoding various notions of similarity between pairs of axiom mentions as well as various typographical features. The features are listed in Table 2.

Joint Identification and Alignment
Joint modeling of axiom identification and alignment components is useful as both problems potentially help each other. Let Y ij denote that the sentence s ij as before. We further define Z (b) i0 such that it denotes that the i th axiom in textbook b is not aligned to any global axiom. We again define a log-linear model with factors that score axiom identification and axiom alignments.
Here, the factors: We write down the model constraints below: C1': Every sentence has a unique label C2' Tag O cannot be followed by tag I C3' Consistency between Y 's and Z's i.e. axiom boundaries defined by Y 's and Z's must agree. C4' = C3. We use L-BFGS for learning. To compute feature expectations, we use a Metropolis Hastings sampler that samples Y s and Z s alternatively. Sampling for Z s reduces to Gibbs sampling and the sampling equations are as same as before (Section 4.1.2). For better mixing, we sample Y in blocks. Consider blocks of Y's which denote axiom boundaries at time stamp t , we define three operations to sample axiom blocks at the next time stamp. The operations (shown in Figure 4) are: Update axiom: The axiom boundary can be shrunk, expanded or moved. The new axiom, however, cannot overlap with other axioms. Delete axiom: The axiom can be deleted by labeling all its sentences as O. Introduce axiom: Given a contiguous sequence of sentences labeled O, a new axiom can be introduced. Note that these three operations define an ergodic Markov chain. We use the axiom identification part of the model as the proposal: Hence, the acceptance ratio only depends on the alignment part of the model: where U (Y) = fAA. We again have two variants, where we model the ordering constraints (C4 ) as soft or hard constraints.

Axiom Parsing
After harvesting axioms, we build a parser for these axioms that maps raw axioms to horn clause rules. The axiom harvesting step provides us a multi-set of axiom extractions. Let A = {A 1 , A 2 , . . . , A |A| } represent the multi-set where each axiom A i is mentioned at least once.
First, we describe a base parser that parses axiom mentions to horn clause rules. Then, we utilize the redundancy of axiom extractions from various sources (textbooks) to improve our parser.

Base Axiomatic Parser
Our base parser identifies the premise and conclusion portions of each axiom and then uses GEOS's text parser to parse the two portions into a logical formula. Then, the two logical formulas are put together to form horn clause rules. Axiom mentions (for example, the Pythagoras theorem mention in Figure 1) are often accompanied by equations or diagrams. When the mention has an equation, we simply treat the equation as the conclusion and the rest of the mention as the premise. When the axiom has an associated diagram, we always include the diagram in the premise. We learn a model to predict the split of the axiom text into two parts forming the premise and the conclusion spans. Then, the GEOS parser maps the premise and conclusion spans to premise and conclusion logical formulas, respectively.
Let Z s represent the split that demarcates the premise and conclusion spans. We score the ax-Unigram, Bigram, Dependency and Entity Overlap Real valued features that compute the proportion of common unigrams, bigrams, dependencies and geometry entities (constants, predicates and functions) across the two axioms. When comparing geometric entities, we include geometric entities derived from the associated diagrams when available.

Longest Common Subsequence
Real valued feature that computes the length of longest common sub-sequence of words between two axiom mentions normalized by the total number of words in the two mentions.

Number of sentences
Real valued feature that computes the absolute difference in the number of sentences in the two mentions.

Alignment Scores
We use an off-the-shelf monolingual word aligner -JACANA (Yao et al., 2013) pretrained on PPDB -and compute alignment score between axiom mentions as the feature.

MT Metrics
We use two common MT evaluation metrics METEOR (Denkowski and Lavie, 2010) and MAXSIM (Chan and Ng, 2008), and use the evaluation scores as features. While METEOR computes n-gram overlaps controlling on precision and recall, MAXSIM performs bipartite graph matching and maps each word in one axiom to at most one word in the other. Summarization Metrics We also use Rouge-S (Lin, 2004), a text summarization metric, and use the evaluation score as a feature. Rouge-S is based on skip-grams.

Equation Template
Indicator feature that matches templates of equations detected in the axiom mentions. Image Caption Proportion of common unigrams in the image captions of the diagrams associated with the axiom mentions. If both mentions do not have associated diagrams, this feature doesn't fire. XML structure Indicator matching the current (and parent) node of axiom mentions in respective XML hierarchies. iom split as a log-linear model: p(Z s |a; w) ∝ exp w T h(a, Z s ) . Here, h are feature functions described later. We found that in most cases (>95%), the premise and conclusion are contiguous spans in the axiom mention where the left span corresponds to the premise and the right span corresponds to the conclusion. Hence, we search over the space of contiguous spans to infer Z s . We use L-BGFGS for learning. Features: We list the features h in Table 3. The features are defined over candidate spans forming the text split, are strongly inspired from rhetorical structure theory (Mann and Thompson, 1988) and previous works on discourse parsing (Marcu, 2000;Soricut and Marcu, 2003). Given a beam of Premise and Conclusion splits, we use the GEOS parser to get Premise and Conclusion logical formulas for each split in the beam and obtain a beam of axiom parses for each axiom in each textbook.

Multi-source Axiomatic Parser
Now, we describe a multi-source parser that utilizes the redundancy of axiom extractions from various sources (textbooks). Given a beam of 10best parses for each axiom from each source, we use a number of heuristics to determine the best parse for the axiom: 1. Majority Voting: For each axiom, pick the parse that occurs most frequently across beams. 2. Average Score: Pick the parse that has the highest average parse score (only counting top 5 parses for each source), for each axiom.
3. Learn Source Confidence: Learn a set of weights {µ 1 , µ 2 , . . . , µ S }, one for each source and then picks the parse that has the highest average weighted parse score for each axiom. 4. Predicate Score: Instead of selecting from one of the top parses across various sources, treat each axiom parse as a bag of premise predicates and a bag of conclusion predicates. Then, pick a subset of premise and conclusion predicates for the final parse using average scoring with thresholding.

Experiments
Datasets: We use a collection of grade 6-10 Indian high school math textbooks by four publishers/authors -NCERT, R S Aggarwal, R D Sharma and M L Aggarwal -a total of 5 × 4 = 20 textbooks to validate our model. Millions of students in India study geometry from these books every year and these books are readily available online. We manually marked chapters relevant for geometry in these books and then parsed them using Adobe Acrobat's pdf2xml parser. Then, we annotated geometry axioms, alignments and parses for grade 6, 7 and 8 textbooks by the four publishers/authors. We use grade 6, 7 and 8 textbook annotations for development, training, and testing, respectively. All the hyper-parameters in all the models are tuned on the development set using grid search. GEOS used 13 types of entities and 94 functions and predicates. We add some more entities, functions and predicates to cover other more complex concepts in geometry not covered in GEOS. Thus, we obtain a final set of 19 entity types and 115 functions and predicates for our parsing model. We use Stanford CoreNLP (Manning et al., 2014) for feature generation. We use two datasets for evaluating our system: (a) practice and official SAT style geometry questions used in GEOS, and (b) an additional dataset of geometry questions collected from the aforementioned textbooks. This dataset consists of a total of 1406 SAT style questions across grades 6-10, and is approximately 7.5 times the size of the dataset used in GEOS. We split the dataset into training (350 questions), Discourse Markers Discourse markers (connectives, cue-words or cue-phrases, etc) have been shown to give good indications on discourse structure (Marcu, 2000). We build a list of discourse markers using the training set, considering the first and last tokens of each span, culled to top 100 by frequency. We use these 100 discourse markers as features. We repeat the same procedure by using part-of-speech (POS) instead of words and use them as features. Punctuation Punctuation at the segment border is an excellent cue. We include indicator features whether there is a punctuation at the segment border.

Text Organization
Indicator that the two text spans are part of the same (a) sentence, (b) paragraph.

XML Structure
Indicator that the two spans are in the same node in the XML hierarchy. Conjoined with the indicator feature that the two spans are part of the same paragraph.

RST Parse
We use an off-the-shelf RST parser (Feng and Hirst, 2014) and include an indicator feature that the segmentation matches the parse segmentation. We also include the RST label as a feature.

Span Lengths
The distribution of the two text spans is typically dependent on their lengths. We use the ratio of the length of the two spans as an additional feature. Soricut and Marcu Segmenter Soricut and Marcu (2003) (section 3.1) presented a statistical model for deciding elementary discourse unit boundaries. We use the probability given by this model retrained on our training set as feature. This feature uses both lexical and syntactic information. Head / Common Ancestor/ Attachment Node Head node is the word with the highest occurrence as a lexical head in the lexicalized tree among all the words in the text span. The attachment node is the parent of the head node. We have features for the head words of the left and right spans, the common ancestor (if any), the attachment node and the conjunction of the two head node words. We repeat these features with part-of-speech (POS) instead of words. Syntax Distance to (a) root (b) common ancestor for the nodes spanning the respective spans. We use these distances, and the difference in the distances as features. Dominance Dominance (Soricut and Marcu, 2003) is a key idea in discourse which looks at syntax trees and studies sub-trees for each span to infer a logical nesting order between the two. We use the dominance relationship is a feature. See Soricut and Marcu (2003) for details.

Span Similarity
Proportion of (a) words (b) geometry relations (c) relation-arguments shared by the two spans.

No. of Relations
Number of geometry relations represented in the two spans. We use the Lexicon Map from GEOS to compute the number of expressed geometry relations.

Relative Position
Relative position of the two lexical heads and the text split in sentence.  development (150 questions) and test (906 questions) with equal proportion of grade 6-10 questions. We annotated the 500 training and development questions with ground-truth logical forms. We use the training set to train another version of GEOS with expanded set of entity types, functions and predicates. We call this system GEOS++.
Results: We first evaluate the axiom identification, alignment and parsing models individually.
For axiom identification, we compare the results of automatic identification with gold axiom identifications and compute the precision, recall and Fmeasure on the test set. We use strict as well as relaxed comparison. In strict comparison mode the automatically identified mentions and gold mentions must match exactly to get credit, whereas, in the relaxed comparison mode only a majority (>50%) of sentences in the automatically identified mentions and gold mentions must match to get credit. Table 4 shows the results of axiom identification where we clearly see improvements in performance when we jointly model axiom identification and alignment. This is due to the fact that both the components reinforce each other. We also ob-  serve that modeling the ordering constraints as soft constraints leads to better performance than modeling them as hard constraints. This is because the ordering of presentation of axioms is generally (yet not always) consistent across textbooks.
To evaluate axiom alignment, we first view it as a series of decisions, one for each pair of axiom mentions and compute precision, recall and Fscore by comparing automatic decisions with gold decisions. Then, we also use a standard clustering metric, Normalized Mutual Information (NMI) (Strehl and Ghosh, 2002) to measure the quality of axiom mention clustering. Table 5 shows the results on the test set when gold axiom identifications are used. We observe improvements in axiom alignment performance too when we jointly model axiom identification and alignment jointly both in terms of F-score as well as NMI. Modeling ordering constraints as soft constraints again leads to better performance than modeling them as hard constraints in terms of both metrics.
To evaluate axiom parsing, we compute precision, recall and F-score in (a) deriving literals in axiom parses, as well as for (b) the final axiom parses on our test set. Table 6 shows the re-  These scores are computed over literals derived in axiom parses or full axiom parses. We show results for the old GEOS system, for the improved GEOS++ system with expanded entity types, functions and predicates, and for the multisource parsers presented in this paper.  Table 7: Scores for solving geometry questions on the SAT practice and official datasets and a dataset of questions from the 20 textbooks. We use SATs grading scheme that rewards a correct answer with a score of 1.0 and penalizes a wrong answer with a negative score of 0.25. Oracle uses gold axioms but automatic text and diagram interpretation in our logical solver. All differences between GEOS and our system are significant (p¡0.05 using the two-tailed paired t-test).

Practice
sults of axiom parsing for GEOS (trained on the training set) as well as various versions of our best performing system (GEOS++ with our axiomatic solver) with various heuristics for multisource parsing. The results show that our system (single source) performs better than GEOS as it is trained with the expanded set of entity types, functions and predicates. The results also show that the choice of heuristic is important for the multisource parser -though all the heuristics lead to improvements over the single source parser. The average score heuristic that chooses the parse with the highest average score across sources performs better than majority voting which chooses the best parse based on a voting heuristic. Learning the confidence of every source and using a weighted average is an even better heuristic. Finally, predicate scoring which chooses the parse by scoring predicates on the premise and conclusion sides performs the best leading to 87.5 F1 score (when computed over parse literals) and 73.2 F1 score (when computed on the full parse). The high F1 score for axiom parsing on the test set shows that our approach works well and we can accurately harvest axiomatic knowledge from textbooks. Finally, we use the extracted horn clause rules in our axiomatic solver for solving geometry problems. For this, we over-generate a set of horn clause rules by generating 3 horn clause parses for each axiom and use them as the underlying theory in prolog programs such as the one shown in Figure 3. We use weighted logical expressions for the  Table 8: User study ratings for GEOS and our system (O.S.) by students in grade 6-10. Ten students in each grade were asked to rate the two systems on a scale of 1-5 on two facets: 'interpretability' and 'usefulness'. Each cell shows the mean rating computed over ten students in that grade for that facet.
question description and the diagram derived from GEOS++ as declarations, and the (normalized) score of the parsing model multiplied by the score of the joint axiom identification and alignment model as weights for the rules. Table 7 shows the results for our best end-to-end system and compares it to GEOS on the practice and official SAT dataset from Seo et al. (2015) as well as questions from the 20 textbooks. On all the three datasets, our system outperforms GEOS. Especially on the dataset from the 20 textbooks (which is indeed a harder dataset and includes more problems which require complex reasoning based on geometry), GEOS doesn't perform very well whereas our system still achieves a good score. Oracle shows the performance of our system when gold axioms (written down by an expert) are used along with automatic text and diagram interpretations in GEOS++. This shows that there is scope for further improvement in our approach. Interpretability: Students around the world solve geometry problems through rigorous deduction whereas the numerical solver in GEOS does not provide such interpretability. One of the key benefits of our axiomatic solver is that it provides an easy-to-understand student-friendly deductive solution to geometry problems.
To test the interpretability of our axiomatic solver, we asked 50 grade 6-10 students (10 students in each grade) to use GEOS and our system (GEOS++ with our axiomatic solver) as a web-based assistive tool while learning geometry. They were each asked to rate how 'interpretable' and 'useful' the two systems were on a scale of 1-5. Table 8 shows the mean rating by students in each grade on the two facets. We can observe that students of each grade found our system to be more interpretable as well as more useful to them than GEOS. This study lends support to our claims about the need of an interpretable deductive solver for geometry problems.

Related Work
Solving Geometry Problems: While the problem of using computers to solve geometry questions is old (Feigenbaum and Feldman, 1963;Schattschneider and King, 1997;Davis, 2006), NLP and computer vision techniques were first used to solve geometry problems in Seo et al. (2015). While Seo et al. (2014) only aligned geometric shapes with their textual mentions, Seo et al. (2015) also extracted geometric relations and built GEOS, the first automated system to solve SAT style geometry questions. GEOS used a coordinate geometry based solution by translating each predicate into a set of manually written constraints. A boolean satisfiability problem posed with these constraints was used to solve the multiple-choice question. GEOS had two key issues: (a) it needed access to answer choices which may not always be available for such problems, and (b) it lacked the deductive geometric reasoning used by students to solve these problems. Our axiomatic solver mitigates these issues by performing deductive reasoning using axiomatic knowledge extracted from textbooks. Information Extraction from Textbooks: Our model builds upon ideas from Information extraction (IE), which is the task of automatically extracting structured information from unstructured and/or semi-structured documents. While there has been a lot of work in IE on domains such as web documents (Chang et al., 2003;Etzioni et al., 2004;Cafarella et al., 2005;Chang et al., 2006;Banko et al., 2007;Etzioni et al., 2008;Mitchell et al., 2015) and scientific publication data (Shah et al., 2003;Peng and McCallum, 2006;Saleem and Latif, 2012), work on IE from educational material is much more sparse. Most of the research in IE from educational material deals with extracting simple educational concepts (Shah et al., 2003;Canisius and Sporleder, 2007;Liu et al., 2016b;Wang et al., 2016) or binary relational tuples (Balasubramanian et al., 2002;Clark et al., 2012;Dalvi et al., 2016) using existing IE techniques. On the other hand, our approach extracts axioms and parses them to horn clause rules. This is much more challenging. Raw application of rule mining or sequence labeling techniques used to extract information from web documents and scientific publications to educational material usually leads to poor results as the amount of redundancy in educational material is lower and the amount of labeled data is sparse. Our approach tackles these issues by making judicious use of typographical information, the redundancy of information and ordering constraints to improve the harvesting and parsing of axioms. This has not been attempted in previous work. Language to Programs: After harvesting axioms from textbooks, we also present an approach to parse the axiom mentions to horn clause rules. This work is related to a large body of work on semantic parsing Mooney, 1993, 1996;Kate et al., 2005;Zettlemoyer and Collins, 2012, inter alia). Semantic parsers typically map natural language to formal programs such as database queries (Liang et al., 2011;Berant et al., 2013;Yaghmazadeh et al., 2017, inter alia), commands to robots (Shimizu and Haas, 2009;Matuszek et al., 2010; Chen and Mooney, 2011, inter alia), or even general purpose programs (Lei et al., 2013;Ling et al., 2016;Yin and Neubig, 2017;Ling et al., 2017). More specifically, Liu et al. (2016a) and Quirk et al. (2015) learn "If-Then" and "If-This-Then-That" rules, respectively. In theory, these works can be adapted to parse axiom mentions to horn-clause rules. However, this would require a large amount of supervision which would be expensive to obtain. We mitigated this issue by using redundant axiom mention extractions from multiple textbooks and then combining the parses obtained from various textbooks to achieve a better final parse for each axiom.

Conclusion
We presented an approach to harvest structured axiomatic knowledge from math textbooks. Our approach uses rich features based on context and typography, the redundancy of axiomatic knowledge and shared ordering constraints across multiple textbooks to accurately extract and parse axiomatic knowledge to horn clause rules. We used the parsed axiomatic knowledge to improve the best previously published automatic approach to solve geometry problems. A user-study conducted on a number of school students studying geometry found our approach to be more interpretable and useful than its predecessor. While this paper focused on harvesting geometry axioms from textbooks as a case study, it can be extended to obtain valuable structured knowledge from textbooks in areas such as science, engineering and finance.