Natural Language Inference with Monotonicity

This paper describes a working system which performs natural language inference using polarity-marked parse trees. The system handles all of the instances of monotonicity inference in the FraCaS data set. Except for the initial parse, it is entirely deterministic. It handles multi-premise arguments, and the kind of inference performed is essentially “logical”, but it goes beyond what is representable in first-order logic. In any case, the system works on surface forms rather than on representations of any kind.


Introduction
Computational systems which attempt to automate natural language inference (NLI) generally fall into one of the three categories: 1) systems which translate input into first-order logic (FOL) or higher-order logic; 2) systems based on distributional semantics, using word embeddings and then neural networks for learning inference (e.g. Bowman et al., 2015;Cases and Karttunen, 2017); and 3) systems using natural logic.
This paper is a contribution to the third category, as are Abzianidze (2015Abzianidze ( , 2017; MacCartney and Manning (2009); Angeli and Manning (2014); Angeli et al. (2016); ; Mineshima et al. (2015). Specifically, we continue work on order-based approaches to natural language inference going back to Fyodorov et al. (2003) and Zamansky et al. (2006). We make use of the polarity-marking tool due to . When given as input a derivation tree in CCG, this tool outputs the polarized yield of the tree. For example, when one inputs Most Europeans live outside of Asia, the output will be Most ↑ Europeans = live ↑ outside = of = Asia ↓ . Indeed, the tool polarizes every constituent. These arrows indicate whether inferences can be made by replacement "upward", replacement "downward", or in neither direction =. Iterating this idea of replacement does give a "fair amount" of inference, but to cope with NLI datasets we augment replacement with rules of natural logic (van Benthem, 1986), and with a mechanism for handling contradictions.
Our system is aimed at inference problems such as those in the FraCaS data set (Cooper et al., 1996), and we compare our efforts with the results in other papers. In addition, the ideas in our system can be adapted by others as part of their NLI toolkits.
Two auxilliary sets K and S: a knowledge base K and a set S of entailed sentences. K consists of a relation ≤ on constituents (that is, a set of ordered pairs of words or multi-word constituents): cat ≤ animal kiss ≤ touch kissed some cat ≤ touched some animal These come from the premise set P, or from a fixed background knowledge base K, or from a lexical source such as WordNet (Miller, 1995). K also keeps track of all the nouns, verbs, adjectives, adverbs, relative clauses that appear in either P or H. The second auxilliary set is a sentence base, S. This set stores all the inferences and contradictions our system derives, starting from P. Inferences are stored in S.inferences whereas contradictions are stored in S.contradictions. (Optionally, it might include a subset of K which is relevant to P ∪ {H}).

Algorithm
The key intuition of our algorithm is that once we have a correctly polarized CCG tree, e.g., all ↑ animal ↓ sleep ↑ , we can replace a constituent with some word or phrase from K and get an inference: all ↑ cat ↓ sleep ↑ . This replacement is extremely simple and effective. Along the same lines, we can replace all with no to obtain a contradiction: no ↑ animal ↓ sleep ↓ . (This is not strictly a contradiction: in a logically-possible model, there might not be any animals. In that model, both all animals sleep and no animals sleep. However, the spirit of work in the RTE area is that we should use the more natural semantics of all, the semantics that carries an existential presupposition. And from this in the premise, we indeed have a contradiction.) Since we ignore most of the morphology, the words are all represented by their lemmas. We also manipulate the sentence structures so that there isn't too much variation. For example, there-be structures such as "there are NP who VP" are changed to "some NP VP". Major steps of our algorithm are listed below. (A more complete pseudocode is presented in the Appendix.) 1. Get the polarities of all sentences in P, using the system in Hu and Moss (2018).
2. For each P ∈ P: (1) Add P to S.inferences; (2) If P is of the form "every X is a N", then add X ≤ N to K; (3) If P is of the form "every X VP", then add be X ≤ VP to K; (4) If P is of the form "N pr is a N", then add every N ≤ N pr and N pr ≤ some N to K.
3. Next, make the following additions to K: • For each noun n and each subsective adjective a, each prepositional phrase p, and each relative clause r in P ∪ {H}: add to K the following inequalities: a n ≤ n, n p ≤ n, and n r ≤ n. For instance, small dog ≤ dog, dog from France ≤ dog, dog that barks ≤ dog.
• For each verb v, and each adverb a to K the inequality v a ≤ v.
4. Loop over each premise P and do two types of replacement; one derives inferences while the other produces contradictions. See Figure 1 for a concrete example.
• replacement infer: 1) if a constituent is ↓ , then replace it with something "smaller" in K; 2) if a constituent is ↑ , then replace it with something "bigger" in K. Finally, store the new sentences in S.inferences.
• replacement contra: 1) replace "no" with "some" and vice versa if the quantifier is not embedded in a relative clause 1 . 2) negate the sentence by e.g., adding "do not" before the main verb. Finally, store the new sentences in S.contradictions.
Now the inference problem becomes a typical AI search problem. That is, we are searching for an inference that matches H string for string. If such an inference can be found before reaching a stopping criterion, then we return entail. If not, we turn to the generated contradictions to see if any of them matches H; if so, we return contradict, otherwise return unknown.  Figure 1: Example search tree where P is every animal likes some young semanticist, with the H: every cat likes some linguist. Only one replacement is allowed at each step. Sentences in rectangular are the generated contradictions. In this case our system will return entail.
Note that theoretically we can perform replacement infer indefinitely, on the inferences generated in the last step. On the contrary, replacement contra can only be applied once on each inference, since the contradiction of a contradiction brings us back to the premise again. 2 This search problem is implemented using depth-first search, w/ default depth = 2.
Inferences not handled by replacement As discussed in , replacement can handle/derive many rules of natural logic, but not all of them. To name just a few the rules below are not covered by replacement: To deal with this, we first convert the premises to a sentence compatible to natural logic syntax, i.e., quantifier x y. Then we apply the above rules on these sentences to get inferences. Finally we convert sentences in natural logic to sentence in natural language. This usually only involves minimal editing. For example, every cat (animal ∧ meow) will be converted to "every cat is an animal who meows". As we will show later, DET is useful in solving many of the multi-premise problems in the first section of FraCaS. It is also worth noting that the capacity of our system can be easily expanded by including more rules from natural logic.
Initial knowledge base K includes the most basic (monotonicity) knowledge that can be utilized for all problems: • knowledge from WordNet (Miller, 1995). dog ≤ animal, dog | cat, etc. The first section of the FraCaS dataset does not require world knowledge, so we didn't include WordNet relations for now. However, they can easily be added if need be.
• knowledge about quantifiers. Our system treats the following words/phrases as quantifiers: every = all = each ≤ most ≤ many ≤ a few = several ≤ some = a; the ≤ some = a at least/most n.
Because the parsers do not treat at least/most n as quantifiers as we hoped, we need a separate work-around for them.

Experiments on section 1 of FraCaS
We run our algorithm on the FraCaS dataset for NLI. This paper reports only on the first section: generalized quantifiers. Extending to other sections of the FraCaS dataset, and to other datasets, is work in progress. Results of our system are shown in Table 1. We have perfect precision and a comparable accuracy with previous systems.

Choice of parsers and their errors
Parser performance is the biggest bottle-neck of the system. We have tested two commonly used CCG parsers, C&C (Clark and Curran, 2007) and EasyCCG (Lewis and Steedman, 2014). C&C fails to parse four sentences from Sec. 1 of FraCaS. EasyCCG can parse all of them but we still need to semiautomatically modify the trees. Some of these are modifications that transform the tree into a semantically more meaningful form, while others are correcting parse errors. For example, not all quantifiers are super-tagged consistently, e.g., most, few are sometimes not tagged as NP/N. There are parsing errors involving multi-word expressions such as "a lot of", "used to be". We only correct systematic ones.

An example
The following example shows the actual process solving FraCas-026, which is a multiple-premise problem, and handled not only by replacement, but also with DET rule. Major steps are listed below: 1. Get polarities 3 of all premises P, but not the hypothesis H: P1: Most ↑ Europeans = are ↑ resident ↑ in ↑ Europe ↑ P2: All ↑ Europeans ↓ are ↑ people ↑ P3: All ↑ people ↓ who ↓ are ↓ resident ↓ in ↓ Europe ↓ can ↑ travel ↑ freely ↑ within ↑ Europe ↓ H: Most Europeans can travel freely within Europe 2. Update knowledge base K with the information from P, e.g.: Based on the form "every (or equivalent quantifiers, see above) X VP", add X ≤ VP, which is: people who are resident in Europe ≤ can travel freely within Europe We can also get: be people who are resident in Europe ≤ can travel freely within Europe 3. Using the DET rule, the system generates a series of sentences which are also polarized, e.g., applying the DET rule to P1 and P2 we get: Then adds generated sentences into sentence base S, and start to do replacement on every constituent of each sentence. Therefore, we obtain a series of inferences like the following: Many European are people who are resident in Europe Most European are people Most European can travel freely within Europe Several European are people who are resitdent in Europe 5. At last, one of the sentences in the list of inferences above matches the given hypothesis H, which means that solution to the original problem is entail.

Comparison with previous systems
We comment on our system and compare it to several systems mentioned in our Introduction (e.g. Mac-Cartney and Manning, 2008;Angeli and Manning, 2014;Mineshima et al., 2015;Abzianidze, 2015). Our algorithm is provably correct in the following sense. If one has a correctly-parsed set P = P 1 , . . . , P n of premises and uses our algorithm, and if the hypothesis H is proved from the premises in our system, then H follows logically from P. So in this sense, our system will have no false positives, i.e., no type I error. Now this requires words of clarification. First, frequently the parsed output does not reflect the logical structure accurately, and in this case, the polarity-marking step of our algorithm might well go wrong. Second, in the case of ambiguous logical words, it is also possible that errors in a parse will lead to errors in our output. For example, the English word "any" means "all" in a downwardentailing environment, but it means "some" in an upward-entailing environment. There are exceptions to this.
MacCartney and Angeli's systems, on the other hand, find downward-entailing environments by pattern-matching using dependency trees, which makes the polarizing algorithm more error-prone. For example, their system (part of Stanford CoreNLP v3.9.1) incorrectly polarizes the following sentences: no ↑ man ↓ walks ↑ , Ed ↑ refused ↑ to ↑ dance ↑ , John ↑ loves ↑ dancing ↑ without ↑ shoes ↑ , I ↑ like ↑ most ↑ dogs ↑ , whereas our system correctly polarizes all of them (walks, dance, shoes should all be ↓ and dogs = ). Another difference is that in our system, replacement can happen in any order, and the results are the same, whereas in Angeli's system only certain "mutating" orders lead to the correct inferences (see Section 3.1 of Angeli et al. (2016)). A final point is that their systems only polarize at the word level, but our system computes polarities also on the constituent level, which is important for replacement as shown in the example above.
We used the tool in Hu and Moss (2018), but they did not provide a working system for NLI. Such a system was described in , but that paper was programmatic and did not have an implementation, or test data. So it left open the issue of how much NLI can be done with mononotonicity alone, and how much requires natural logic rules. We present initial attempts to address this problem.
The best-performing system in the logic-related area of NLI is the one described in Abzianidze (2014Abzianidze ( , 2015Abzianidze ( , 2017. That line of work uses tableau rules rather than deduction, as a more standard theoremprover from automated reasoning. Our system is arguably simpler than his: we use no lambda logical forms, and we believe that the theoretical basis of our system is also simpler. When our system finds an inference (or contradiction) the derivations in our system are not so far from a natural language proof of the hypothesis (or its negation) from the premises. This would not be possible from a tableau proof.

Conclusion and future work
We have provided a generic algorithm for natural language inference based on polarized parses and natural logic rules. The algorithm requires parsed output, but it does not require translation into a logical form, or special alignment steps. It would be possible to extend our system in either or both directions.
For future work, we are currently tuning our system on other sections of the FraCaS dataset (especially Sections 5 and 6) and the larger SICK dataset (Marelli et al., 2014). Another line of work is to incorporate into our system some "light" representation of the sentences, e.g. dependency parses, abstract meaning representation, that allows for more flexible syntactic variation than the current string-for-string match strategy.