HELP: A Dataset for Identifying Shortcomings of Neural Models in Monotonicity Reasoning

Large crowdsourced datasets are widely used for training and evaluating neural models on natural language inference (NLI). Despite these efforts, neural models have a hard time capturing logical inferences, including those licensed by phrase replacements, so-called monotonicity reasoning. Since no large dataset has been developed for monotonicity reasoning, it is still unclear whether the main obstacle is the size of datasets or the model architectures themselves. To investigate this issue, we introduce a new dataset, called HELP, for handling entailments with lexical and logical phenomena. We add it to training data for the state-of-the-art neural models and evaluate them on test sets for monotonicity phenomena. The results showed that our data augmentation improved the overall accuracy. We also find that the improvement is better on monotonicity inferences with lexical replacements than on downward inferences with disjunction and modification. This suggests that some types of inferences can be improved by our data augmentation while others are immune to it.


Introduction
Natural language inference (NLI) has been proposed as a benchmark task for natural language understanding.
This task is to determine whether a given statement (premise) semantically entails another statement (hypothesis) (Dagan et al., 2013).
Large crowdsourced datasets such as SNLI (Bowman et al., 2015a) and MultiNLI (Williams et al., 2018) have been created from naturally-occurring texts for training and testing neural models on NLI. Recent reports showed that these crowdsourced datasets contain undesired biases that allow prediction of entailment labels only from hypothesis sentences (Gururangan et al., 2018;Poliak et al., 2018b;Tsuchiya, 2018). Moreover, these standard

Upward
Some changes in personal values are simply part of growing older (MultiNLI) ⇒ Some changes in values are a part of growing older Downward At most ten commissioners spend time at home (FraCaS) ⇒ At most ten female commissioners spend time at home datasets come with the so-called upward monotonicity inferences (see Table 1), i.e., inferences from subsets to supersets (changes in personal values ⊑ changes in values), but they rarely come with downward monotonicity inferences, i.e., inferences from supersets to subsets (commissioners ⊒ female commissioners). Downward monotonicity inferences are interesting in that they allow to replace a phrase with a more specific one and thus the resulting sentence can become longer, yet the inference is valid. FraCaS (Cooper et al., 1994) contains such logically challenging problems as downward inferences. However, it is small in size (only 346 examples) for training neural models, and it covers only simple syntactic patterns with severely restricted vocabularies. The lack of such a dataset on a large scale is due to at least two factors: it is hard to instruct crowd workers without deep knowledge of natural language syntax and semantics, and it is also unfeasible to employ experts to obtain a large number of logically challenging inferences. Bowman et al. (2015b) proposed an artificial dataset for logical reasoning, whose premise and hypothesis are automatically generated from a simple English-like grammar. Following this line of work, Geiger et al. (2018) presented a method to construct a complex dataset for multiple quantifiers (e.g., Every dwarf licks no rifle ⇒ No ugly dwarf licks some rifle). These datasets contain downward inferences, but they are designed not to require lexical knowledge. There are also NLI datasets which expand lexical knowledge by replacing words using lexical rules (Monz and de Rijke, 2001;Glockner et al., 2018;Naik et al., 2018;Poliak et al., 2018a). In these works, however, little attention has been paid to downward inferences.
The GLUE leaderboard (Wang et al., 2019) reported that neural models did not perform well on downward inferences, and this leaves us guessing whether the lack of large datasets for such kind of inferences that involve the interaction between lexical and logical inferences is an obstacle of understanding inferences for neural models.
To shed light on this problem, this paper makes the following three contributions: (a) providing a method to create a large NLI dataset 1 that embodies the combination of lexical and logical inferences focusing on monotonicity (i.e., phrase replacement-based reasoning) (Section 3), (b) measuring to what extent the new dataset helps neural models to learn monotonicity inferences, and (c) by analyzing the results, revealing which types of logical inferences are solved with our training data augmentation and which ones are immune to it (Section 4.2).

Monotonicity Reasoning
Monotonicity reasoning is a sort of reasoning based on word replacement. Based on the monotonicity properties of words, it determines whether a certain word replacement results in a sentence entailed from the original one (van Benthem, 1983;Icard and Moss, 2014). A polarity is a characteristic of a word position imposed by monotone operators. Replacements with more general (or specific) phrases in ↑ (or ↓) polarity positions license entailment. Polarities are determined by a function which is always upward monotone (+) (i.e., an order preserving function that licenses entailment from specific to general phrases), always downward monotone (−) (i.e., an order reversing function) or neither, non-monotone.
Determiners are modeled as binary operators, taking noun and verb phrases as the first and second arguments, respectively, and they entail sentences with their arguments narrowed or broadened according to their monotonicity properties. For example, the determiner some is upward monotone both in its first and second arguments, and the concepts can be broadened by replacing its hypernym (people ⊒ boy), removing modifiers (dancing ⊒ happily dancing), or adding disjunction. The concepts can be narrowed by replacing its hyponym (schoolboy ⊑ boy), adding modifiers, or adding conjunction.
( If a sentence contains negation, the polarity of words over the scope of negation is reversed: If the propositional object is embedded in another negative or conditional context, the polarity of words over its scope can be reversed again: [the party might be canceled] In this way, the polarity of words is determined by monotonicity operators and syntactic structures.

Data Creation
We address three issues when creating the inference problems: (a) Detect the monotone operators and their arguments; (b) Based on the syntactic structure, induce the polarity of the argument positions; (c) Using lexical knowledge or logical connectives, narrow or broaden the arguments.

Source corpus
We use sentences from the Parallel Meaning Bank (PMB,  as a source while creating the inference dataset. The reason behind choosing the PMB is threefold. First, the fine-grained annotations in the PMB facilitate our automatic monotonicity-driven construction of inference problems. In particular, semantic tokenization and WordNet (Fellbaum, 1998) senses make narrow and broad concept substitutions easy while the syntactic analyses in Combinatory Categorial Grammar (CCG, Steedman, 2000) format and semantic tags  contribute to monotonicity and polarity detection. Second, the PMB contains lexically and syntactically diverse texts from a wide range of genres. Third, the gold (silver) documents are fully (partially) manually verified, which control noise in the automated generated dataset. To prevent easy inferences, we use the sentences with more than five tokens from 5K gold and 5K silver portions of Step 1. Select a sentence using semantic tags from the PMB All kids were dancing on the floor AND CON PST EXG REL DEF CON Step 2. Detect the polarity of constituents via CCG analysis the PMB. Figure 1 illustrates the method of creating the HELP dataset. We use declarative sentences from the PMB containing monotone operators, conjunction, or disjunction as a source (Step 1). These target words can be identified by their semantic tags: AND (all, every, each, and), DIS (some, several, or), NEG (no, not, neither, without), DEF (both), QUV (many, few), and IMP (if, when, unless). In

Methodology
Step 2, after locating the first (NP) and the second (VP) arguments of the monotone operator via a CCG derivation, we detect their polarities with the possibility of reversing a polarity if an argument appears in a downward environment. In Step 3, to broaden or narrow the first and the second arguments, we consider two types of operations: (i) lexical replacement, i.e., substituting the argument with its hypernym/hyponym (e.g., H 1 ) and (ii) syntactic elimination, i.e., dropping a modifier or a conjunction/disjunction phrase in the argument (e.g., H 2 ). Given the polarity of the argument position (↑ or ↓) and the type of replacement (with more general or specific phrases), the gold label (entailment or neutral) of a premisehypothesis pair is automatically determined; e.g., both (P, H 1 ) and (P, H 2 ) in Step 3 are assigned entailment. For lexical replacements, we use WordNet senses from the PMB and their ISA relations with the same part-of-speech to control naturalness of the obtained sentence. To compensate missing word senses from the silver documents, we use the Lesk algorithm (Lesk, 1986). In Step 4, by swapping the premise and the hypothesis, we create another inference pair and assign its gold label; e.g., (P ′ 1 , H ′ ) and (P ′ 2 , H ′ ) are created and  assigned neutral. By swapping a sentence pair created by syntactic elimination, we can create a pair such as (P ′ 2 , H ′ ) in which the hypothesis is more specific than the premise.

The HELP dataset
The resulting dataset has 36K inference pairs consisting of upward monotone, downward monotone, non-monotone, conjunction, and disjunction. Table 2 shows some examples. The number of vocabulary items is 15K. We manually checked the naturalness of randomly sampled 500 sentence pairs, of which 146 pairs were unnatural. As mentioned in previous work (Glockner et al., 2018), there are some cases where WordNet for substitution leads to unnatural sentences due to the context mismatch; e.g., an example such as P: You have no driving happening ⇒ H: You have no driving experience, where P is obtained from H by replacing experience by its hypernym happening. Since our intention is to explore possible ways to augment training data for monotonicity reasoning, we include these cases in the training dataset.

Experiments
We use HELP as additional training material for three neural models for NLI and evaluate them on test sets dealing with monotonicity reasoning.
Training data We used three different training sets and compared their performance; MultiNLI (392K), MultiNLI+MQ (the dataset for multiple quantifiers introduced in Section 1; Geiger et al., 2018) (892K), and MultiNLI+HELP (429K).   Table 3 shows that adding HELP to MultiNLI improved the accuracy of all models on GLUE, Fra-CaS, and SICK. Regarding MultiNLI, note that adding data for downward inference can be harmful for performing upward inference, because lexical replacements work in an opposite way in downward environments. However, our data augmentation minimized the decrease in performance on MultiNLI. This suggests that models managed to learn the relationships between downward operators and their arguments from HELP.

Results and discussion
The improvement in accuracy is better with HELP than that with MQ despite the fact that the size of HELP is much smaller than MQ. MQ does not deal with lexical replacements, and thus the improvement is not stable. This indicates that the improvement comes from carefully controlling the target reasoning of the training set rather than from its size. ESIM showed a greater improvement in accuracy compared with the other models when we added HELP. This result arguably supports the finding in Bowman et al. (2015b) that a tree architecture is better for learning some logical inferences. Regarding the evaluation on SICK, Talman and Chatzikyriakidis (2018) reported a drop in accuracy of 40-50% when BiL-STM and ESIM were trained on MultiNLI because SICK is out of the domain of MultiNLI. Indeed, the accuracy of each model, including BERT, was low at 40-60%.
When compared among linguistic phenomena, the improvement by adding HELP was better for upward and downward monotone. In particular, all models except models trained with HELP failed to answer 68 problems for monotonicity inferences with lexical replacements. This indicates that such inferences can be improved by adding HELP.
The improvement for disjunction was smaller than other phenomena. To investigate this, we conducted error analysis on 68 problems of GLUE and FraCaS, which all the models misclassified. 44 problems are neutral problems in which all words in the hypothesis occur in the premise (e.g.,

He is either in London or in Paris
He is in London). 13 problems are entailment problems in which the hypothesis contains a word or a phrase not occurring in the premise (e.g., I don't want to have to keep entertaining people ⇒ I don't want to have to keep entertaining people who don't value my time). These problems contain disjunction or modifiers in downward environments where either (i) the premise P contains all words in the hypothesis H yet the inference is invalid or (ii) H contains more words than those in P yet the inference is valid. 2 Although HELP contains 21K such problems, the models nevertheless misclassified them. This indicates that the difficulty in learning these non-lexical downward inferences might not come from the lack of training datasets.

Conclusion and Future Work
We introduced a monotonicity-driven NLI data augmentation method. The experiments showed that neural models trained on HELP obtained the higher overall accuracy. However, the improvement tended to be small on downward monotone inferences with disjunction and modification, which suggests that some types of inferences can be improved by adding data while others might require different kind of help.
For future work, our data augmentation can be used for multilingual corpora. Since the PMB annotations sufficed for creating HELP, applying our method to the non-English PMB documents seems straightforward. Additionally, it is interesting to verify the quality and contribution of a dataset which will be created by using our method on an automatically annotated and parsed corpus.