Exploring End-to-End Differentiable Natural Logic Modeling

We explore end-to-end trained differentiable models that integrate natural logic with neural networks, aiming to keep the backbone of natural language reasoning based on the natural logic formalism while introducing subsymbolic vector representations and neural components. The proposed model adapts module networks to model natural logic operations, which is enhanced with a memory component to model contextual information. Experiments show that the proposed framework can effectively model monotonicity-based reasoning, compared to the baseline neural network models without built-in inductive bias for monotonicity-based reasoning. Our proposed model shows to be robust when transferred from upward to downward inference. We perform further analyses on the performance of the proposed model on aggregation, showing the effectiveness of the proposed subcomponents on helping achieve better intermediate aggregation performance.


Introduction
A recent research trend has attempted to further advance the long-standing problem of bringing together the complementary strengths of neural networks and symbolic models, e.g., the research performed in (Garcez et al., 2015;Yang et al., 2017;Rocktäschel and Riedel, 2017;Evans and Grefenstette, 2018;Weber et al., 2019;De Raedt et al., 2019;Mao et al., 2019), among others.It is known that neural models can approximate complex functions and are robust to noise and ambiguity, while symbolic models often render superior explainability and interpretability but are brittle and prone to fail in the presence of noise and uncertainty.
The majority of research efforts are based on some abstract logical forms such as the first-order logic (FOL) or its fragments.For natural language, obtaining such a representation is known to face many thorny challenges.Natural logic instead aims to sidestep some of the challenges by performing inferences over surface forms of text based on monotonicity or projectivity (Van Benthem, 1986;Valencia, 1991;MacCartney and Manning, 2009;Icard and Moss, 2014), and has been applied to tasks such as natural language inference (MacCartney and Manning, 2009;Angeli and Manning, 2014) and question answering (Angeli et al., 2016).
In this work we explore differentiable natural logic models that integrate natural logic with neural networks, with the aim to keep the backbone of inference based on the natural logic formalism, while introducing subsymbolic vector representations and neural components into the framework.Combining the advantages of neural networks with natural logic needs to take several basic problems into consideration.Two problems flow directly from this objective: 1) How (and where) to leverage the strength of neural networks in the natural logic formalism, and; 2) How to alleviate the issue of a lack of intermediate supervision for training sub-components, which may lead to the spurious problem (Guu et al., 2017;Min et al., 2019) in the end-to-end training.
We explore a framework in which module networks (Andreas et al., 2016;Gupta et al., 2020) are leveraged to model the natural logic operations, which is enhanced with a memory module component to  MacCartney and Manning (2009).(Icard, 2012).Relations listed in the first column are aggregated with those listed in the first row, yielding the relations in the corresponding entries in the table.capture contextual information.At the lexical and local relation learning layers, we constrain the network to predict the seven natural logic relations.The entire model is differentiable and end-to-end trained.
We evaluate and analyze the proposed model on the monotonicity subset of Semantic Fragments (Richardson et al., 2020), HELP (Yanaka et al., 2019b) and MED (Yanaka et al., 2019a).We also extend MED to generate a dataset to help evaluate 2-hop inference.The model can effectively learn natural logic operations in the end-to-end training paradigm.1 2 Related Work

Neural Symbolic Models
A growing number of research efforts have recently revisited the long-standing problem of bringing together the complementary advantages of neural networks and symbolic methods.There are at least two approaches that have received intensive attention.One uses symbolic constraints as regularizers to equip neural models with the corresponding inductive bias (Demeester et al., 2016;Diligenti et al., 2017;Donadello et al., 2017;Xu et al., 2018;Li and Srikumar, 2019).Another approach develops differentiable end-to-end trained frameworks based on symbolic models.For example, the work in (Rocktäschel and Riedel, 2017;Weber et al., 2019;Minervini et al., 2020) proposes a differentiable backward-chaining algorithm, and Dong et al. (2019) adopt probabilistic tensor representations for logic predicates and mimic the forward-chaining proof.Evans and Grefenstette (2018) treat inductive logic programming as a satisfiability problem and Manhaeve et al. (2018) combine high-level symbolic oriented reasoning with low-level neural perception models.The second approach is more interesting to us for exploring powerful reasoning models with built-in explainability.Unlike the existing work based on abstract logical forms, this paper explores the integration of neural networks with natural logic.

Natural Logic
Natural logic (Lakoff, 1970;van Benthem, 1988;Valencia, 1991;Van Benthem, 1995;Nairn et al., 2006;MacCartney, 2009;Icard, 2012;Angeli et al., 2016) has a long history that is traceable to the syllogisms of Aristotle.It aims to model a subset of logical inferences by operating directly on the surface form and structure of language, based on monotonicity or projectivity (Van Benthem, 1986;Valencia, 1991;MacCartney and Manning, 2009;Icard and Moss, 2014), rather than deduction on the abstract forms such as the first-order logic (FOL) or its fragments-it is well known that deriving logic forms for natural language is a very challenging task.
In natural language processing, the framework proposed in (MacCartney and Manning, 2008;MacCartney and Manning, 2009) extends monotonicity-based models (van Benthem, 1988;Valencia, 1991) to incorporate semantic exclusion and unifies them to consider implicatives (Nairn et al., 2006), which is a state-of-the-art natural logic formalism that has been used for multiple NLP tasks (MacCartney, 2009;Angeli and Manning, 2014).In this work we explore neural natural logic based on this formalism.We will briefly review the background in Section 3.

Background
This section briefly reviews the natural logic formalism (MacCartney and Manning, 2009) that our work is based on.For more details, we refer readers to (MacCartney and Manning, 2008;MacCartney and Manning, 2009;MacCartney, 2009;Angeli et al., 2016).
Monotonicity is a pervasive feature of natural language and an essential concept in natural logic (Van Benthem, 1986;Valencia, 1991;MacCartney and Manning, 2009;Icard and Moss, 2014).Similar to the monotone functions in calculus, in natural language upward monotone keeps the entailment relation when the argument "increases" (e.g., some cats are playing some animals are playing, where cats is replaced by its hypernym animals).Downward monotone keeps the entailment relation when the argument "decreases" (e.g., all animals are playing all cats are playing, where animals is replaced by its hyponym cats).
To extend the monotonicity to consider exclusion, MacCartney and Manning (2009) investigate all sixteen equivalence classes of set relations and remove nine degenerate, semantically vacuous relations, thereby defining a seven-relation set B = { ≡, , , ∧, | , , # } for natural logic, as shown in Table 1.
From a high-level perspective, the natural logic proof system proposed by MacCartney and Manning (2009) consists of the following steps.First, the alignment between two text spans (often two sentences) is obtained and then lexical relation recognition is performed for aligned pairs of words.Consider a simplified example: a premise All animals outside are eating and a corresponding hypothesis All cats outside are playing, as shown in Figure 1.Each pair of aligned words is assigned one of the relations in Table 1, e.g., animals cats and eating | playing.
Projection ρ : B → B is then performed according to the projectivity in specific context.The projection operation has been implemented in the Stanford natlog parser2 .For a given sentence, natlog can output the projections at each word position.For example, Table 2 summarizes the projections in the context of the quantifier all, some, and no.Specifically, consider the example we discussed in the last paragraph: as animals and cats take place in the first argument of the quantifier all, according to the projectivity in Table 2, the reverse entailment relation (animals cats) will be projected to forward entailment (animals cats) in this specific context.As another example, since eating and playing take place in the second argument of all, the alternation relation (eating | playing) is projected to alternation (eating | playing).
Built on this, relation aggregation is performed to aggregate multiple projected local relations, according to Table 3, to determine the global relation between the sentence pair.In our example, two projected relations, forward entailment ( ) and alternation ( | ), are aggregated to yield alternation ( | ); i.e., we obtain All animals outside are eating | All cats outside are playing.The seven natural logic relationships at the sentence level can be used to determine NLI relations.For example, if NLI is defined as a three-way classification problem (entailment, contradiction, and neutral).The ' ≡ ' or ' ' relation will be mapped to entailment, the ' ∧ ' or ' | ' relation will be mapped to contradiction, and ' ', ' ', or ' # ' to neutral.

Neural Natural Logic Model
We present a differentiable framework in which natural logic is integrated with neural networks.The overall architecture of the model is shown in Figure 1.At the core of the framework are natural logic operations modeled with memory-enhanced module networks, which are trained end-to-end to optimize the following objective: where y is the output, which in natural language inference is the label of the relation between a premise and hypothesis sentence (e.g., entailment, contradiction, and neutral), and which can be different labels in other tasks.The input X = X p , X h comprises a premise sentence X p and a hypothesis sentence X h .We use z = {z 1 , z 2 , ..., z n } to denote a sequence of latent variables corresponding to the output of natural logic aggregation at each time step, where n is the number of hidden variables.The term Z denotes the space of all possible trajectories and z ∈ Z. Specifically, for the example in Figure 1, if we perform the aggregation from left to right, z 1 = ' ≡ ', z 2 = z 3 = z 4 = ' ', and z 5 = ' | ' is a z trajectory that proves the contradiction label.Note that z i ∈ B where B is the set of seven relations listed in Table 1.

Encoding and Alignment
Recent research has shown the effectiveness of distributed representations for encoding lexicons and their semantic relations.We use word embedding and neural networks to learn lexical representations to capture natural logic related semantics.Let X p = { x p 1 , x p 2 , ..., x p m } be a premise sentence and .., x h n } the corresponding hypothesis sentence, where m and n are the number of word tokens in the premise and hypothesis, repspectively.Each sentence is fed into a multi-layer BiLSTM, for which a i = BiLSTM(X p , i) denotes the i th hidden vector at the top layer of the BiLSTM, encoding the i th token and its context in the premise.Similarly, we use b j = BiLSTM(X h , j) to denote the hidden vector at the j th position at the top layer of the BiLSTM that encodes the hypothesis.In this paper, we focus on understanding neural natural logic itself, without being further confounded by different ways of exploring knowledge external to the training data, e.g., via pretraining.
Many models can be used to capture cross-sentence attention.Focusing on the training data, the approach proposed in (Chen et al., 2017b) has been widely used in the NLI literature as a baseline.We follow the work to compute cross-sentence attention weight e ij = a T i b j for each pair a i , b j .Specifically, for each b j in the hypothesis, the corresponding content in a premise is weighted summed as bj = m i=1 exp(e ij ) m k=1 exp(e kj ) a i , which will be used together with b j to learn local lexical-level inference relations (refer to (Chen et al., 2017b) for more details).
In addition, we compute a hard alignment indicator φ j , and φ j = 1 if and only if x p i * = x h j , where i * = arg max i∈{1,...,m} e ij . 3That is, for each word token x h j in the hypothesis, we record the token x p i * in the premise that has the maximum attention value e ij .If the word token x p i * and x h j are the same word type, we let φ j = 1, which will be used to help reduce the search space in aggregation.

Learning Local Natural Logic Relation
Given a sequence of alignment { b1 , b 1 , ..., bj , b j , ..., bn , b n }, we use a bi-linear model to compute each pair's probabilistic distribution p j over the natural logic relations B: In the scoring function f s , each type of relation k ∈ B has its own weight matrix M k ∈ R d×d , which is a slice of the tensor M ∈ R d×d×|B| , where d is the dimensionality of b j or bj .We use softmax to normalize the values to be a distribution over B. Among several alternatives we used, the bi-linear model achieves the best performance on the development dataset, and we use it in our final framework.

Local Relation Constraints
Same as in many other weakly supervised setups, we do not have direct supervision signals here to learn logic relationships at the lexical level; instead, the supervision signals are backpropagated from the overall sentence-level NLI errors.To reduce the search space and alleviate the spurious problem (Guu et al., 2017) in which incorrect local inference relationships and aggregation produce correct sentence-level NLI labels,4 we adopt several strategies as follows.
Symmetric Inference Parameter Sharing: We make the forward entailment ( ) and reverse entailment ( ) relations share the same parameters.Specifically, to compute p j , we reverse the order of bj , b j to reuse M T in the following scoring function, where M T is a matrix in M T that corresponds to the forward entailment ( ) relation.
Equivalence Constraint: A token pair will be assigned the equivalence relation (≡), if φ j learned above in the alignment stage takes the value of 1: Figure 2: A memory-enhanced module network for natural logic aggregation.
Collapse Constraints: We suppress the relations negation (∧) and cover ( ): Inspired by Angeli and Manning (2014), we suppress the negation relation (∧) because its behavior is almost same as that of alternation ( | ) in natural logic aggregation, as shown in Table 3, avoiding the co-linearity problem when training on datasets without double negation samples.We also suppress the cover relation ( ) because it is extremely rare in current natural language inference datasets.

Projected Distribution
With the predicted seven-dimensional probability vector p j being ready, our model uses a projection operator ρ to re-organize the distribution according to the projectivity of the corresponding input hypothesis word at position j.Unlike the discrete "hard" projection used in the conventional natural logic, e.g., projecting the first argument of all from reverse entailment to forward entailment, we apply "soft" projection over relation probability distribution pj .Specifically, based on the projection Table 2, we convert the original probability distribution p j to the projected distribution pj : where 1(•) is the indicator function, k is the original relation, and k is the projected relation.Consider the pair of sentences in Figure 1 and suppose the pair eating vs. playing have a probability of 0.8 to be alternation ( | ) and 0.1 to be negation (∧).According to the projectivity of the second argument of the quantifier all in Table 2, both relations are projected to alternation ( | ): ρ playing ( | ) = ρ playing (∧) = |.So after projection, p| 5 = p | 5 + p ∧ 5 = 0.9, where the subscript 5 is the index of the word token playing in the hypothesis.

Aggregation
We propose to leverage the module networks (Andreas et al., 2016;Gupta et al., 2020) to perform neural natural logic aggregation, which is enhanced by a memory network component to leverage the powerful ability in modeling context.Figure 2 shows the proposed neural natural logic aggregation network.The right part of the figure is the aggregation module network and the left is the memory network component.Specifically, at each time step j, our aggregation algorithm computes a distribution p(z j |X) = softmax(s j ), where s j = {s k j } is a set of logits.s k j is the one corresponding to p(z j = k|X) for relation k ∈ B. Our model computes s k j with Equation 7.
At time step 1, s 1 is initialized with p1 .At any other time step t > 1, we invoke modules G u1v (•) to derive s j .Specifically, in our network each relation aggregation in Table 3, i.e., u 1 v (u, v ∈ B), has its own module G u1v (•).Now, given the previous s j−1 and the current projected local relation distribution pj , s j can be computed by marginalizing the Cartesian product s j−1 • pT j according to aggregation Table 3.More specifically, we first compute the Cartesian product s j−1 • pT j , which is weighted by the memory g u1v (o j ).Then for all modules with output being the same relation k according to Table 3, the modules' output are summed up, where 1(•) is the indicator function.
Below we discuss how the memory network response o j is calculated.In this paper, we propose a memory network component (Weston et al., 2014;Sukhbaatar et al., 2015) to enhance our module aggregation network, aiming to better model contextual information.The details are shown in the left part of Figure 2. Specifically, at time step j, we store memory vectors { m 1 , ..., m j } and the corresponding output vectors { c 1 , ..., c j } in the memory.The query vector q j scans the memory and computes the match between itself and memory vectors by taking the inner product followed by a softmax: The query, memory, and output vectors are functions of aligned token representation [ bj , b j ], typically modeled by two feed-forward layers.The response vector o j is computed by the weighted sum over stored outputs vectors c j and is used in the module network discussed above: where o j encodes all historical transitions and their context and is then incorporated into Equation 7.
In addition to the sequential aggregation we discuss above in which we perform aggregation left-to-right over a premise and hypothesis pair, we also perform the aggregation on the binarized constituency parses, where aggregation is performed on a tree structure.For node j in the constituency tree, we define a random variable z j which represents the reasoning states upon seeing the node j and sub-tree, and we use s j to denote the distribution of z j .We initialize s j with projected relation distribution pj if node j is the leaf node.Iteratively, the distribution s j for each non-leaf node is computed by aggregating its left child (lc) and right child (rc): where o j is the memory network response vector which is computed on the information of all nodes that have already been visited.
Objective Function The final prediction of sentence relation is computed with the distribution of hidden state s n at the last time step (or the root node if reasoning is performed over the constituency tree).We follow the work of Angeli and Manning (2014) and group relation equivalence (≡) and forward entailment ( ) to be entailment; negation (∧) and alternation ( | ) to be contradiction, and; reverse entailment ( ), cover ( ) and independent (#) to be neutral.We apply a variant of hard-EM training method (Min et al., 2019), which selects the most likely relation: , and p neutral = max(s n , s n , s # n ).After applying softmax, we obtain the prediction probability, which can be used to compute the cross entropy loss.

Setup
Data: We use three datasets that are designed for studying monotonicity based reasoning, i.e., HELP (Yanaka et al., 2019b), MED (Yanaka et al., 2019a), and the monotonicity subset of Semantic Fragments (Richardson et al., 2020).The HELP dataset has 35,891 inference pairs, which are automatically generated by conducting lexical substitution or deletion on one sentence to obtain the other, given natural logic polarity information of each word token and syntactic structure of sentences.The MED dataset contains 5,382 human-generated inference pairs by either asking crowdworkers to perform the generation or manually collecting the pairs from linguistics publications.The monotonicity subset of Semantic Fragments is automatically generated with a controlled set of rules and lexicons, which contains around 2,000 pairs.Since the pairs with the contradiction relation in the Semantic Fragments dataset are obtained by changing quantifiers, which are out of the scope of the natural logic formalism that we use, we do not include this subset in our experiments.
In addition, we create a new 2-hop dataset.The above datasets lack ground-truth labels for evaluating aggregation at each time step, and most of them are 1-hop aggregation in which a premise and hypothesis differs only by one span of text.In our 2-hop dataset, the premise and hypothesis differs by two edits of word/phrase insertion, deletion, or substitution.Our dataset provides ground-truth aggregation output { z 1 , ..., z j , ... z n } to help assess models' performance on natural logic operations and understand their decision paths.The development of this 2-hop dataset includes three steps: (a) identify the editing type for each example in MED and determine the logic relations; (b) add one more hop of relation, and; (c) record the ground-truth aggregation labels at each time step and the final NLI labels following MacCartney's natural logic formalism.We manually checked a subset of the data and found more than 96% of examples are correct.Details of the data development are included in Appendix A.
Implementation Details: Following Chen et al. (2017b), hidden vectors in our model are 300 dimensional.We use pretrained 300-dimensional 840B GloVe vectors (Pennington et al., 2014) to initialize our word embeddings.All word embeddings are trainable after being initialized.We apply a dropout rate of p = 0.5.Adam (Kingma and Ba, 2015) is used as our optimizer, and the first momentum is set to be 0.9 and the second 0.999.The batch size is set to be 32 and the initial learning rate is 0.0004.We train ESIM and our neural natural logic models for 32 epochs and use the development set to select models for testing.We use default hyper-parameters specified in (Devlin et al., 2019) and train the BERT-base model for 3 epochs.

Results
Inference Performance: Table 4 shows the test accuracy of different models on the four datasets that are designed specifically for evaluating monotonicity-based inference.Following Richardson et al. (2020) and Yanaka et al. (2019a), we train the models on SNLI (Bowman et al., 2015) and test on these different test sets.The proposed models, in general, achieve better performances on these four datasets than ESIM (Chen et al., 2017b) and BERT (Devlin et al., 2019).The difference is more prominent in the 2-hop dataset, which requires the system to have a better aggregation ability to make the final prediction.
To demonstrate how the models generalize between the upward and downward monotone, we train the models with HELP's upward monotone subset and test on the downward monotone subset.A system that The right-most column of Table 4 shows that while ESIM and BERT achieve very high development accuracy on the upward data, they fail to generalize to the downward monotone test set.The proposed models generalize well and achieve better test accuracy on the downward monotone datasets.
Aggregation Decisions: The proposed model provides inference explainability by accessing natural logic's aggregation and decision paths.Figure 3 shows an example of the 2-hop dataset, together with the visualization of the intermediate aggregation decisions.From left to right, the first subfigure shows the cross-sentence attention between the premise (x-axis) and hypothesis (y-axis), where a darker color corresponds to a larger attention weight.In the second subfigure, for each word in the hypothesis (y-axis), the predicted distribution of lexical-level logical relations are shown along the x-axis.The third subfigure shows the aggregation output.For example, on the second row, the aggregation has already been performed over the first two words b 1 = "the" and b 2 = "animals" using their lexical relation distributions, which have been shown in the second subfigure and are, in turn, computed from the first subfigure using b1 , b 1 and b2 , b 2 .Since '≡' 1 ' ' = ' ', we can see that on the second row, a large probability mass has been put on (i.e., ent f in the figure).
We further perform quantitative analysis on the aggregation performance.We analyze the sequential aggregation.Specifically, for the 2-hop dataset in which we have access to the aggregation decisions: ẑ = {ẑ 1 , ẑ2 , ..., ẑn }, where ẑj is the aggregation result at time step j, we evaluate the models by comparing the estimated ẑ with the ground truth z.We use precision, recall, and F-score as our evaluation metrics.The details of how to compute them are in Appendix B.
Table 5 shows the results.Since ESIM and BERT do not produce intermediate aggregation results, they are not included in the table.The ablation analysis shows that both the memory/module component and the local relation constraints help the model to learn intermediate natural logic aggregation.We can also see that further work is desirable to improve the performance on aggregation prediction as there is still a large room to improve modeling performance on this.As part of our efforts, we have also performed component training to leverage WordNet (Miller, 1998) and ConceptNet (Speer et al., 2017) to help determine lexical relations.This approach is not particularly effective since the lexical pairs from these knowledge bases only cover a very small percentage of pairs that need to be modeled.

Conclusions
This paper studies end-to-end trained differentiable models that integrate natural logic with neural networks.The proposed model adapts module networks to model natural logic operations, which is Figure 4 shows an example of the 2-hop dataset.The premise and hypothesis differ by two edits of word/phrase insertion, deletion, or substitution.The dataset provides the ground truth of aggregation at each time step (the equivalence relation is the default relation and is hence not included in the "Ground truth of aggregation" section) and the word locations/indices associated with each edit.The 2-hop dataset is developed with the following three steps:

Premise:
Some delegates finished the survey on time.Hypothesis: Some individuals finished the survey.

Label:
Entailment  Identifying MED Relations: Since most sentence pairs in the MED dataset are only different by one word/phrase edit; i.e., the premise and the hypothesis differs by one word/phrase, it is easy to determine location of the insertion, deletion, or replacement.For insertion and deletion, we follow (Angeli and Manning, 2014) and treat the relation as reverse entailment ( ) and forward entailment ( ), respectively.We set aside the replacement samples since we can not determine their relations without human labeling.
To ensure the identified natural logic relations are correct, we compare the labels provided in MED with labels determined by MacCartney's natural logic theory and remove samples in which labels do not agree, yielding roughly 1.1K sentence pairs.
Adding One More Hop of Relations: We ask human annotators to replace a noun either in the premise or the hypothesis with another word.The relation between the substituted and substituting word are one of {≡, , , |, #}.Annotators have access to WordNet that can help suggest substituting words (e.g., hypernyms or hyponyms).Meanwhile, we require that the candidate words to be replaced are not children or parents of any previously identified differences over the parsing tree.This replacement operation yields 5,858 sentence pairs, and the premise and the hypothesis of each example now differ by two edits.
Determining Labels: We apply projection operation and natural logic aggregation according to (Mac-Cartney and Manning, 2009) to determine the 3-way natural language inference labels for the generated 2-hop sentence pairs.We also record the ground-truth relations of each hop of aggregation output.We manually assess the data quality on 300 sentence pairs (100 for each category).We find that on average 3% of the samples have either incorrect labels or wrong intermediate aggregation output (4% in category Entailment, 4% in category Neutral and 1% in Contradiction).Those mistakes are mainly produced by incorrect parser-identified polarity.

B Aggregation Evaluation Metrics
We evaluate the intermediate aggregations of the proposed model with the precision, recall, and F1 score.Precision is the number of correctly performed aggregations, divided by the total number of aggregations performed by a model.Recall is the number of correctly performed aggregations, divided by the total number of aggregations presented in the ground-truth annotation.Note that we only consider aggregations at time step t when ẑt = ẑt−1 .Since by default the starting state ẑ0 = '≡', so if ẑ1 = '≡', we do not count this degenerate case.

Figure 1 :
Figure1: A high-level view of the proposed neural natural logic model.

Figure 4 :
Figure 4: An example of the 2-hop dataset.

Table 1 :
Seven natural logic relations proposed by

Table 2 :
The projection function ρ can project an input relation r into a different relation depending on the context.Here we show the projection function for each argument position for quantifier all, some and no.

Table 3 :
Relation aggregation table

Table 4 :
Test accuracy of the models.

Table 5 :
Evaluation of models' aggregation performance on the 2-hop dataset.can better model monotonicity should achieve more robust performance.Specifically, we split the upward monotone subset of the HELP dataset into the training set (∼ 6k training examples) and the development set (∼ 1.5k examples).We train all models on the training split and select models with the highest development accuracy.We test all models on the HELP downward monotone subset (∼ 21k examples).