Natural Solution to FraCaS Entailment Problems

,


Introduction
Understanding and automatically processing the natural language semantics is a central task for computational linguistics and its related fields. At the same time, inference tasks are regarded as the best way of testing an NLP systems's semantic capacity (Cooper et al., 1996, p. 63). Following this view, recognizing textual entailment (RTE) challenges (Dagan et al., 2005) were regularly held which evaluate the RTE systems based on the RTE dataset. The RTE data represents a set of texthypotheses pairs that are human annotated on the inference relations: entailment, contradiction and neutral. Hence it attempts to evaluate the systems on human reasoning. In general, the RTE datasets are created semi-automatically and are often motivated by the scenarios found in the applications like question answering, relation extraction, infor-mation retrieval and summarization (Dagan et al., 2005;Dagan et al., 2013). On the other hand, the semanticists are busy designing theories that account for the valid logical relations over natural language sentences. These theories usually model reasoning that depends on certain semantic phenomena, e.g., Booleans, quantifiers, events, attitudes, intensionality, monotonicity, etc. These types of reasoning are weak points of RTE systems as the above mentioned semantic phenomena are underrepresented in the RTE datasets.
In order to test and train the weak points of an RTE system, we choose the FraCaS dataset (Cooper et al., 1996). The set contains complex entailment problems covering various challenging semantic phenomena which are still not fully mastered by RTE systems. Moreover, unlike the standard RTE datasets, FraCaS also allows multipremised problems. To account for these complex entailment problems, we employ the theorem prover for higher-order logic (Abzianidze, 2015a), which represents the version of formal logic motivated by natural logic (Lakoff, 1970;Van Benthem, 1986). Though such expressive logics usually come with the inefficient decision procedures, the prover maintains efficiency by using the inference rules that are specially tailored for the reasoning in natural language. We introduce new rules for the prover in light of the FraCaS problems and test the rules against the relevant portion of the set. The test results are compared to the current stateof-the-art on the dataset.
The rest of the paper is structured as follows. We start with introducing a tableau system for natural logic (Muskens, 2010). Section 3 explores the FraCaS dataset in more details. In Section 4, we describe the process of adapting the theorem prover to FraCaS, i.e. how specific semantic phenomena are modeled with the help of tableau rules. Several premises with monotone quantifiers in- ≤× [8,13] ≤× [10,11] Figure 1: A closed tableau proves that every prover halts quickly entails most tableau provers terminate. Each branch growth is marked with the corresponding rule application.
crease the search space for proofs. In Section 5, we present several rules that contribute to shorter proofs. In the evaluation part (Section 6), we analyze the results of the prover on the relevant Fra-CaS sections and compare them with the related RTE systems. We end with possible directions of future work.

Tableau theorem prover for natural language
Reasoning in formal logics (i.e., a formal language with well-defined semantics) is carried out by automated theorem provers, where the provers come in different forms based on their underlying proof system. In order to mirror this scenario for reasoning in natural language, Muskens (2010) proposed to approximate natural language with a version of natural logic (Lakoff, 1970;Van Benthem, 1986;Sánchez-Valencia, 1991) while a version of analytic tableau method (Beth, 1955;Hintikka, 1955;Smullyan, 1968), hereafter referred to as natural tableau, is introduced as a proof system for the logic. The version of natural logic employed by Muskens (2010) is higher-order logic formulated in terms of the typed lambda calculus (Church, 1940). 1 As a result, the logic is much more expressive (in the sence of modeling certian phenomena in an intuitive way) than firstorder logic, e.g., it can naturally account for generalized quantifiers (Montague, 1973;Barwise and Cooper, 1981), monotonicity calculus (Van Benthem, 1986;Sánchez-Valencia, 1991;Icard and Moss, 2014) and subsective adjectives. What makes the logic natural are its terms, called Lambda Logical Forms (LLFs), which are built up only from variables and lexical constants via the functional application and λ-abstraction. In this way the LLFs have a more natural appearance than, for instance, the formulas of firstorder logic. The examples of LLFs are given in the nodes of the tableau proof tree in Figure 1, where the type information for terms is omitted. A tableau node can be seen as a statement of truth type which is structured as a triplet of a main LLF, an argument list of terms and a truth sign. The semantics associated with a tableau node is that the application of the main LLF to the terms of an argument list is evaluated according to the truth sign. For instance, the node 9 is interpreted as the term tableau prover d being true, i.e. d is in the extension of tableau prover. Notice that LLFs not only resemble surface forms in terms of lexical elements but most of their constituents are in correspondence too. This facilitates the automatized generation of LLFs from surface forms. The natural tableau system of (Muskens, 2010), like any other tableau systems (D'Agostino et al., 1999), tries to prove statements by refuting them. For instance, in case of an entailment proof, a tableau starts with the counterexample where the premises are true and the conclusion is false. The proof is further developed with the help of schematic inference rules, called tableau rules (see Figure 2). A tableau is closed if all its branches are closed, i.e. are marked with a closure (×) sign. A tableau branch intuitively corresponds to a situation while a closed branch represents an inconsistent situation. Refutation of a statement fails if a closed tableau is obtained. Hence the closed tableau serves as a proof for the statement. The proof of an entailment in terms of the closed tableau is demonstrated in Figure 1. The tableau starts with the counterexample ( 1 , 2 ) of the entailment. It is further developed by applying the rule (MON↑) to 1 and 2 , taking into account that one-sorted type theory, i.e. with the entity e and truth t types, and hence omit a type s for world-time pairs.
or H is mon↓ and #d and P are fresh where A entails B written as A ≤ B Figure 2: The tableau rules employed by the tableau proof in Figure 1 every is upward monotone in the second argument position. The rule application is carried out until all branches are closed or no new rule application is possible. In the running example, all the branches close as (≤×) identifies inconsistencies there; for instance, 4 and 7 are inconsistent according to (≤×) assuming that a knowledge base (KB) provides that halting entails termination, i.e. halt ≤ terminate.
The natural tableau system was succesfully applied to the SICK textual entailment problems (Marelli et al., 2014) by Abzianidze (2015a). In particular, the theorem prover for natural language, called LangPro, was implemented that integrates three modules: the parsers for Combinatory Categorial Grammar (CCG) (Steedman, 2000), LLFgen that generates LLFs from the CCG derivation trees, and the natural logic tableau prover (NLogPro) which builds tableau proofs. The pipeline architecture of the prover is depicted in Figure 3: the sentences of an input problem are first parsed, then converted into LLFs, which are further processed by NLogPro. For a CCG parser, there are at least two options, C&C (Clark and Curran, 2007;Honnibal et al., 2010) and Easy-CCG (Lewis and Steedman, 2014 Figure 3: The architecture of LangPro in (Muskens, 2010) and also additional rules that were collected from SICK. In order to make theorem proving robust, LangPro employs a conservative extension of the type theory for accessing the syntactic information of terms (Abzianidze, 2015b): in addition to the basic semantic types e and t, the extended type theory incorporates basic syntactic types n, np, s and pp corresponding to the primitive categories of CCG. Abzianidze (2015a) shows that on the unseen portion of SICK LangPro obtains the results comparable to the state-of-the-art scores while achieving an almost perfect precision. Based on this inspiring result, we decide to adapt and test LangPro on the FraCaS problems, from the semantics point of view much more harder than the SICK ones. 2

FraCaS dataset
The FraCaS test suite (Cooper et al., 1996) is a set of 346 test problems. It was prepared by the FraCaS consortium as an initial benchmark for semantic competence of NLP systems. Each Fra-CaS problem is a pair of premises and a yes-nounknown question that is annotated with a gold judgment: yes (entailment), no (contradiction), or unknown (neutral). The problems mainly consist of short sentences and resemble the problems found in introductory logic books. To convert the test suite into the style of RTE dataset, MacCartney and Manning (2007) translated the questions into declarative sentences. The judgments were copied from the original test suite with slight modifications. 3 Several problems drawn from the obtained FraCaS dataset are presented in Table 1.
Unlike other RTE datasets, the FraCaS problems contain multiple premises (45% of the total problems) and are structured in sections according to the semantic phenomena they concern. The sections cover generalized quantifiers (GQs), plurals, anaphora, ellipsis, adjectives, comparatives, temporal reference, verbs and attitudes. Due to the challenging problems it contains, the FraCaS dataset can be seen as one of the most complex RTE data from the semantics perspective. Unfortunately, due to its small size the dataset is not representative enough for system evaluation purposes. The above mentioned facts perhaps are the main reasons why the FraCaS data is less favored for developing and assessing the semantic competence of RTE systems. Nevertheless, several RTE systems (MacCartney and Manning, 2008;Angeli and Manning, 2014;Lewis and Steedman, 2013;Mineshima et al., 2015) were trained and evaluated on (the parts of) the dataset. Usually the goal of these evaluations is to show that specific theories/frameworks and the corresponding RTE systems are able to model deep semantic reasoning over the phenomena found in FraCaS. Our aim is also the same in the rest of the sections.

Modeling semantic phenomena
Modeling a new semantic phenomenon in the natural tableau requires introduction of special rules. The section presents the new rules that account for certain semantic phenomena found in FraCaS.
FraCaS Section 1, in short FrSec-1, focuses on GQs and their monotonicity properties. Since the rules for monotonicity are already implemented in LangPro, in order to model monotonicity behavior of a new GQ, it is sufficient to define its monotonicity features in the signature. For instance, few is defined as few n↓,vp↓,s while many and most are modeled as many n,vp↑,s and most n,vp↑,s respectively. 4 The contrast between monotonicity properties of the first arguments of few and many is conditioned solely by the intuition behind the Fra-CaS problems: few is understood as an absolute amount while many as proportional (see Fr-56 and 76 in Table 1). Accounting for the monotonicity properties of most, i.e. most n,vp↑,s , is not sufficient for fully capturing its semantics. For instance, solving Fr-26 requires more than just up- : Clients at the demonstration were all impressed by the system's performance. P2: Smith was a client at the demonstration. C: Smith was impressed by the system's performance. 100 yes P: Clients at the demonstration were impressed by the system's performance. C: Most clients at the demonstration were impressed by the system's performance. 211 no P1: All elephants are large animals. P2: Dumbo is a small elephant. C: Dumbo is a small animal. Table 1: Samples of the FraCaS problems ward monotonicity of most in its second argument. We capture the semantics, concerning more than a half, of most by the following new rule: and X is either T or F With (MOST), now it is possible to prove Fr-26 (see Figure 4). The rule efficiently but partially captures the semantics of most. Modeling its complete semantics would introduce unnecessary inefficiency in the theorem proving. 5 FrSec-1 involves problems dedicated to the conservativity phenomenon (1). Although we have The proof also employs the admissible rules (∀ n T ) and (∀ v T ) from Section 5.
not specially modeled the conservativity property of GQs in LangPro, it is able to solve all 16 poblems about conservativity except one. The reason is that conservativity is underrepresented in FraCaS. Namely, the problems cover conservativity in the form of (2) instead of (1) (see . We capture (2) with the help of the existing rules for GQs and (THR×), from (Abzianidze, 2015b), which treats the expletive constructions, like there is, as a universal predicate, i.e., any entity not satisfying it leads to inconsistency (×). FrSec-2 covers the problems concerning plurals. Usually the phrases like bare plurals, definite plurals and definite descriptions (e.g., the dog) do not get special treatment in wide-coverage semantic processing and by default are treated as indefinites. Since we want to take advantage of the expressive power of the logic and its proof system, we decide to separately model these phrases. We treat bare plurals and definite plurals as GQs of the form s n,vp,s N n , where s stands for the plural morpheme. The quantifier s can be ambiguous in LLFs due to the ambiguity related to the plurals: they can be understood as more than one, universal or quasi-universal (i.e. almost every). Since most of the problems in FraCaS favor the latter reading, we model s as a quasi-universal quantifier. We introduce the following lexical knowledge, s ≤ a and s ≤ most, in the KB and allow the existential quantification rules (e.g., ∃ T ) to apply the plural terms s N . With this treatment, for instance, the prover is able to prove the entailment in Fr-100.
We model the definite descriptions as generalized quantifiers of the form the N , where the rules make the act as the universal and existential quantifiers when marked with T and as the existential quantifier in case of F. Put differently, (∀ T ), (∃ T ) and (∃ F ) allow the quantifier in their antecedent nodes to match the. This choice guarantees that, for example, the demonstration in the premises of Fr-99 co-refer and allow the proof for entailment. This approach also maintains the link if there are different surface forms co-referring, e.g., the demonstration and the presentation, in contrast to the approach in Abzianidze (2015a). FrSec-2 also involves several problems with contrasting cardinal phrases like exactly n and m, where n < m (see Fr-85). We account for these problems with the closure rule (×EXCT), where the type q, the predicate greater/2 and the domain for E act as constraints.

× ×EXCT
such that E ∈ {just, exactly} and greater(M, N ) FrSec-5 contains RTE problems pertaining to various types of adjective. First-order logic has problems with modeling subsective or privative adjectives (Kamp and Partee, 1995), but they are naturally modeled with higher-order terms. A subsective term, e.g., small n,n , is a relation over a comparison class and an entity, e.g., small n,n animal n c e is of type t as n is a subtype of et according to the extended type theory (Abzianidze, 2015b). The rule (⊆) in Figure 2 accounts for the subsective property. With the help of it, the prover correctly identifies Fr-211 as contradiction (see Figure 5). In case of the standard firstorder intersective analysis, the premises of Fr-211 would be translated as: which is a contradiction given that small and large are contradictory predicates. Therefore, due to the principle of explosion everything, including the conclusion and its negation, would be entailed from the premises.
FrSec-9, about attitudes, is the last section we explore. Though the tableau system of (Muskens, 2010) employs intensional types, LangPro only uses extensional types due to simplicity of the system and the paucity of intensionality in RTE problems. Despite the fact, with the proof-theoretic approach and extensional types, we can still account for a certain type of reasoning on attitude verbs by modeling entailment properties of the verbs in the style of Nairn et al. (2006) and Karttunen (2012). For example, know has (+/+) property meaning that when it occurs in a positive embedding context, it entails its sentential complement with a positive polarity. Similarly, manage to is (+/+)   λBE [8] > [4,9] × | [10,11] and (-/-) because John managed to run entails John run and John did not manage to run entails John did not run. We accommodate the entailment properties in the tableau system in a straightforward way, e.g., terms with (+/+) property, like know and manage, are modeled via the rule (+/+) where ?p is an optional prepositional or particle term. The rest of the three entailment properties for attitude verbs are captured in the similar way.
We also associate the entailment properties with the phrases it is true that and it is false that and model them via the corresponding tableau rules.
Our account for intensionality with the extensional types represents a syntactic approach rather than semantic. From the semantics perspective, the extensional types license John knowing all true statement if he knows at least one of them. But using the proof system, a syntactic machinery, we avoid such unwanted entailments with the absence of rules. In future, we could incorporate intensional types in LangPro if there is representative RTE data for the intensionality phenomenon.
The rest of the FraCaS sections were skipped during the adaptation phase for several reasons. FrSec-3 and FrSec-4 are about anaphora and ellipsis respectively. We omitted these sections as recently pronoun resolution is not modeled in the natural tableau and almost all sentences involving ellipsis are wrongly analyzed by the CCG parsers. In the current settings of the natural tableau, we treat auxiliaries as vacuous, due to this reason LangPro cannot properly account for the problems in FrSec-8 as most of them concern the aspect of verbs. FrSec-6 and FrSec-7 consists of problems with comparatives and temporal reference respectively. To account the latter phenomena, the LLFs of certain constructions needs to be specified further (e.g., for comparative phrases) and additional tableau rules must be introduced that model calculations on time and degrees.

Efficient theorem proving
Efficiency in theorem proving is crucial as we do not have infinite time to wait for provers to terminate and return an answer. Smaller tableau proofs are also easy for verifying and debugging. The section discusses the challenges for efficient theorem proving induced by the FraCaS problems and introduces new rules that bring efficiency to some extent.
The inventory of rules is a main component of a tableau method. Usually tableau rules are such inference rules that their consequent expressions are not larger than the antecedent expressions and are built up from sub-parts of the antecedent expressions. The natural tableau rules also satisfy these properties which contribute to the termination of tableau development. But there is still a big chance that a tableau does not terminate or gets unnecessarily large. The reasons for this is a combination of branching rules, δ-rules (introducing fresh entity terms), γ-rules (triggered for each entity term), and non-equivalent rules (the antecedents of which must be accessible by other rules too). 6 6 For instance, (MON↑) and (MON↓) in Figure 2 are both branching and δ. They are also non-equivalent since their consequents are semantically weaker than their antecedents; this requires that after their application, the antecedent nodes are still reusable for further rule applications. On the other hand, (∀ T ) is non-equivalent and γ; for instance, for any en-Efficeint theorem proving with LangPro becomes more challenging with multi-premised problems and monotonic GQs. More nodes in a tableau give rise to more choice points in rule applications and monotonic GQs are usually available for both monotonic and standard semantic rules.
To encourage short tableau proofs, we introduce eight admissible rules -the rules that are redundant from completeness point of view but represent smart shortcuts of several rule applications. 7 Half of the rules for the existential (e.g., a and the) and universal (e.g., every, no and the) quantifiers are γ-rules. 8 To make application of these rules more efficient, we introduce two admissible rules for each of the γ-rules. where q ∈ {every, the} Their efficiency is due to choosing a relevant entity c e , rather than any entity like (∀ T ) does: (∀ n T ) chooses the entity that satisfies the noun term while (∀ v T ) picks the one not satisfying the verb term. Moreover, the admissible rules are not branching unlike their γ counterparts. Other four admissible rules account for a and the in a false context and no in a true context in the similar way.
The monotonicity rules, (MON↑) and (MON↓), are inefficient as they are branching δ-rules. On the other hand, the rules for GQs are also inefficient for being a γ or δ-rule. Both types of rules are often applicable to the same GQs, e.g., every and a, as most of GQs have monotonicity properties. Instead of triggering these two types of rules separately, we introduce two admissible rules, (∃FUN↑) and (∅FUN↓), which trigger them in tandem: T and asserts that either c is not dog or c does bark. 7 In other words, if a closed tableau makes use of an admissible rule, the tableau can still be closed with a different rule application strategy that ignores the admissible rule. 8 Remember from Section 4 that the is treated like the universal and existential quantifiers in certain cases.

ID
FraCaS entailment problem 64 unk P: At most ten female commissioners spend time at home. C: At most ten commissioners spend time at home. 88 unk P: Every representative and client was at the meeting. C: Every representative was at the meeting. 109 no P: Just one accountant attended the meeting. C: Some accountants attended the meeting. 215 unk P1: All legal authorities are law lecturers. P2: All law lecturers are legal authorities. C: All competent legal authorities are competent law lecturers. For instance, if g = every, a single application of (∃FUN↑) already yields the fine-grained semantics: there is c e that is A and N but not B. If the nodes were processed by the rules for every, (∀ F ) would first entail 4 and 5 from 2 and then (∀ T ) or (∀ n T ) would introduce 3 from 1 . (∃FUN↑) also represents a more specific version of the admissible rule (FUN↑) of Abzianidze (2015a), which itself is an efficient and partial version of (MON↑).
(∃FUN↑) and (∅FUN↓) not only represent admissible rules but they also model semantics of few and many not captured by the monotonicity rules. ]. Therefore, we get the inference encoded in (∅FUN↓). Similarly, it can be shown that many satisfies the inference in (∃FUN↑).

Evaluation
After adapting the prover to the FraCaS sections for GQs, plurals, adjectives and attitudes, we evaluate it on the relevant sections and analyze the performance. Obtained results are compared to related RTE systems.
We run two version of the prover, ccLangPro and easyLangPro, that employ CCG derivations produced by C&C and EasyCCG respectively. In order to abstract from the parser errors to some extent, the answers from both provers are aggregated in LangPro: a proof is found iff one of the parser-specific provers finds a proof. The evaluation results of the three versions of LangPro on the relevant FraCaS sections are presented in Table 3 along with the confusion matrix for LangPro.  The results show that LangPro performs slightly better with C&C compared to EasyCCG. This is due to LLFgen which is mostly tuned on the C&C derivations. Despite this bias, easyLangPro proves 8 problems that were not proved by ccLangPro. In case of half of these problems, C&C failed to return derivations for some of the sentences while in another half of the problems the errors in C&C derivations were crucial, e.g., in the conclusion of Fr-44 committee members was not analyzed as a constituent. On the other hand, ccLangPro proves 10 problems unsolved by easyLangPro, e.g., Fr-6 was not proved because EasyCCG analyzes really as a modifier of are in the conclusion, or even more unfortunate, the morphological analyzer of EasyCCG cannot get the lemma of clients correctly in Fr-99 and as a result the prover cannot relate clients to client.
The precision of LangPro is high due to its sound inference rules. Fr-109 in Table 2 was the only case when entailment and contradiction were confused: plurals are not modeled as strictly more than one. 9 The false proves are mostly due to a lack of knowledge about adjectives. Lang-Pro does not know a default comparison class for clever, e.g., clever person→clever but clever politician →clever). Fr-215 was proved as entailment because we have not modeled intensionality of adjectives. Since EasyCCG was barely used during adaptation (except changing most of NP modifiers into noun modifiers), it analyzed at most in Fr-64 as a sentential modifier which was not modeled as downward monotone in the signature. Hence, by default, it was considered as upward monotone leading to the proof for entailment.
There are several reasons behind the problems that were not proved by the prover. Several problems for adjectives were not proved as they con-   (Lewis and Steedman, 2013) with Parser and Gold syntax, NLI (Angeli and Manning, 2014), T14a , T14b (Dong et al., 2014) and M15 (Mineshima et al., 2015). BL is a majority (yes) baseline. Results for non-applicable sections are strikeout.
tained comparative constructions, not covered by the rules. Some problems assume the universal reading of plurals. A couple of problems involving at most were not solved as the parsers often analyze the phrase in a wrong way. 10 We also check the FraCaS sections how representative they are for higher-order GQs (HOGQs). After replacing all occurrences of most, several, many, s and the with the indefinite a in LLFs, LangPro -HOGQ (without the HOGQs) achieves an overall accuracy of 81% over FrSec-1,2,5,9. Compared to LangPro only 6 problems, including Fr-56, 99, were misclassified while Fr-26, 100 were solved. This shows that the dataset is not representative enough for HOGQs.
In Table 4, the current results are compared to the RTE systems that have been tested on the single or multi-premised FraCaS problems. 11 According to the table, the current work shows that the natural tableau system and LangPro are successful in deep reasoning over multiple premises.
The natural logic approach in MacCartney and Manning (2008) and Angeli and Manning (2014) models monotonicity reasoning with the exclusion relation in terms of the string edit operations over phrases. Since the approach heavily hinges on a sequence of edits that relates a premise to a conclusion, it cannot process multi-premised problems properly. Lewis and Steedman (2013) and Mineshima et al. (2015) both base on first-order logic representations. While Lewis and Steedman (2013) employs distributional relation clustering to model the semantics of content words, Mineshima et al. (2015) extends first-order logic with several higher-order terms (e.g., for most, believe, manage) and augments first-order inference of Coq with additional inference rules for the higher-order terms.  and Dong et al. (2014) build an inference engine that reasons over abstract denotations, formulas of relational algebra or a sort of description logic, obtained from Dependency-based Compositional Semantic trees (Liang et al., 2011). Our system and approach differ from the above mentioned ones in its unique combination of expressiveness of highorder logic, naturalness of logical forms (making them easily obtainable) and flexibility of a semantic tableau method. All these allow to model surface and deep semantic reasoning successfully in a single system.

Future work
We have modeled several semantic phenomena in the natural tableau theorem prover and obtained high results on the relevant FraCaS sections. Concerning the FraCaS dataset, in future work we plan to account for the comparatives and temporal reference in the natural tableau. After showing that the natural tableau can successfully model deep reasoning (e.g., the FraCaS problems) and (relatively) wide-coverage and surface reasoning (e.g., the SICK dataset), we see the RTE datasets, like RTE-1 (Dagan et al., 2005) and SNLI (Bowman et al., 2015), involving texts obtained from newswire or crowd-scouring as a next step for developing the theory and the theorem prover.