Do Neural Models Learn Systematicity of Monotonicity Inference in Natural Language?

Despite the success of language models using neural networks, it remains unclear to what extent neural models have the generalization ability to perform inferences. In this paper, we introduce a method for evaluating whether neural models can learn systematicity of monotonicity inference in natural language, namely, the regularity for performing arbitrary inferences with generalization on composition. We consider four aspects of monotonicity inferences and test whether the models can systematically interpret lexical and logical phenomena on different training/test splits. A series of experiments show that three neural models systematically draw inferences on unseen combinations of lexical and logical phenomena when the syntactic structures of the sentences are similar between the training and test sets. However, the performance of the models significantly decreases when the structures are slightly changed in the test set while retaining all vocabularies and constituents already appearing in the training set. This indicates that the generalization ability of neural models is limited to cases where the syntactic structures are nearly the same as those in the training set.


Introduction
Natural language inference (NLI), a task whereby a system judges whether given a set of premises P semantically entails a hypothesis H (Dagan et al., 2013;Bowman et al., 2015), is a fundamental task for natural language understanding. As with other NLP tasks, recent studies have shown a remarkable impact of deep neural networks in NLI (Williams et al., 2018;Wang et al., 2019;Devlin et al., 2019). However, it remains unclear to what extent DNN-based models are capable of learning the compositional generalization underlying NLI from given labeled training instances. Systematicity of inference (or inferential systematicity) (Fodor and Pylyshyn, 1988;Aydede, 1997) in natural language has been intensively studied in the field of formal semantics. From among the various aspects of inferential systematicity, in the context of NLI, we focus on monotonicity (van Benthem, 1983;Icard and Moss, 2014) and its productivity. Consider the following premise-hypothesis pairs (1)-(3), which have the target label entailment: (1) P : Some [puppies ↑] ran.
H: Some dogs ran.
(2) P : As in (1), for example, quantifiers such as some exhibit upward monotone (shown as [... ↑]), and replacing a phrase in an upward-entailing context in a sentence with a more general phrase (replacing puppies in P with dogs as in H) yields a sentence inferable from the original sentence. In contrast, as in (2), quantifiers such as no exhibit downward monotone (shown as [... ↓]), and replacing a phrase in a downward-entailing context with a more specific phrase (replacing cats in P with small cats as in H) yields a sentence inferable from the original sentence. Such primitive inference patterns combine recursively as in (3). This manner of monotonicity and its productivity produces a potentially infinite number of inferential patterns. Therefore, NLI models must be capable of systematically interpreting such primitive patterns and reasoning over unseen combinations of patterns. Although many studies have addressed this issue by modeling logical reasoning in formal semantics (Abzianidze, 2015;Mineshima et al., 2015;Hu et al., 2019) and testing DNN-based models on monotonicity inference (Yanaka et al., 2019a,b; 2020), the ability of DNN-based models to generalize to unseen combinations of patterns is still underexplored.
Given this background, we investigate the systematic generalization ability of DNN-based models on four aspects of monotonicity: (i) systematicity of predicate replacements (i.e., replacements with a more general or specific phrase), (ii) systematicity of embedding quantifiers, (iii) productivity, and (iv) localism (see Section 2.2). To this aim, we introduce a new evaluation protocol where we (i) synthesize training instances from sampled sentences and (ii) systematically control which patterns are shown to the models in the training phase and which are left unseen. The rationale behind this protocol is twofold. First, patterns of monotonicity inference are highly systematic, so we can create training data with arbitrary combinations of patterns, as in examples (1)-(3). Second, evaluating the performance of the models trained with well-known NLI datasets such as MultiNLI (Williams et al., 2018) might severely underestimate the ability of the models because such datasets tend to contain only a limited number of training instances that exhibit the inferential patterns of interest. Furthermore, using such datasets would prevent us from identifying which combinations of patterns the models can infer from which patterns in the training data.
This paper makes two primary contributions. First, we introduce an evaluation protocol 1 using 1 The evaluation code will be publicly available at https://github.com/verypluming/systematicity. the systematic control of the training/test split under various combinations of semantic properties to evaluate whether models learn inferential systematicity in natural language. Second, we apply our evaluation protocol to three NLI models and present evidence suggesting that, while all models generalize to unseen combinations of lexical and logical phenomena, their generalization ability is limited to cases where sentence structures are nearly the same as those in the training set.
2 Method 2.1 Basic idea Figure 1 illustrates the basic idea of our evaluation protocol on monotonicity inference. We use synthesized monotonicity inference datasets, where NLI models should capture both (i) monotonicity directions (upward/downward) of various quantifiers and (ii) the types of various predicate replacements in their arguments. To build such datasets, we first generate a set of premises G Q d by a context-free grammar G with depth d (i.e., the maximum number of applications of recursive rules), given a set of quantifiers Q. Then, by applying G Q d to elements of a set of functions for predicate replacements (or replacement functions for short) R that rephrase a constituent in the input premise and return a hypothesis, we obtain a set D Q,R d of premise-hypothesis pairs defined as For example, the premise Some puppies ran is generated from the quantifier some in Q and the production rule S → Q, N, IV, and thus it is an element of G Q 1 . By applying this premise to a replacement function that replaces the word in the premise with its hypernym (e.g., puppy ⊑ dog), we provide the premise-hypothesis pair Some puppies ran ⇒ Some dogs ran in Fig. 1.
We can control which patterns are shown to the models during training and which are left unseen by systematically splitting D Q,R d into training and test sets. As shown on the left side of Figure 1, we consider how to test the systematic capacity of models with unseen combinations of quantifiers and predicate replacements. To expose models to primitive patterns regarding Q and R, we fix an arbitrary element q from Q and feed various predicate replacements into the models from the training set of inferences D {q},R d generated from combinations of the fixed quantifier and all predicate replacements. Also, we select an arbitrary element r from R and feed various quantifiers into the models from the training set of inferences D Q,{r} d generated from combinations of all quantifiers and the fixed predicate replacement.
We then test the models on the set of inferences generated from unseen combinations of quantifiers and predicate replacements. That is, we test them on the set of inferences D Similarly, as shown on the right side of Figure 1, we can test the productive capacity of models with unseen depths by changing the training/test split based on d. For example, by training models on D Q,R d and testing them on D Q,R d+1 , we can evaluate whether models generalize to one deeper depth. By testing models with an arbitrary training/test split of D Q,R d based on semantic properties of monotonicity inference (i.e., quantifiers, predicate replacements, and depths), we can evaluate whether models systematically interpret them.

Evaluation protocol
To test NLI models from multiple perspectives of inferential systematicity in monotonicity inferences, we focus on four aspects: (i) systematicity of predicate replacements, (ii) systematicity of embedding quantifiers, (iii) productivity, and (iv) localism. For each aspect, we use a set D Q,R d of premise-hypothesis pairs. Let Q = Q ↑ ∪ Q ↓ be the union of a set of selected upward quantifiers Q ↑ and a set of selected downward quantifiers Q ↓ such that |Q ↑ | = |Q ↓ | = n. Let R be a set of replacement functions {r 1 , . . . , r m }, and d be the embedding depth, with 1 ≤ d ≤ s.
(4) is an example of an element of D Q,R 1 , containing the quantifier some in the subject position and the predicate replacement using the hypernym relation dogs ⊑ animals in its upward-entailing context without embedding.
(4) P : Some dogs ran ⇒ H: Some animals ran I. Systematicity of predicate replacements The following describes how we test the extent to which models generalize to unseen combinations of quantifiers and predicate replacements. Here, we expose models to all primitive patterns of predicate replacements like (4) and (5) and all primitive patterns of quantifiers like (6) and (7). We then test whether the models can systematically capture the difference between upward quantifiers (e.g., several) and downward quantifiers (e.g., no) as well as the different types of predicate replacements (e.g., the lexical relation dogs ⊑ animals and the adjective deletion small dogs ⊑ dogs) and correctly interpret unseen combinations of quantifiers and predicate replacements like (8) and (9). Here, we consider a set of inferences D Q,R 1 whose depth is 1. We move from harder to easier tasks by gradually changing the training/test split according to combinations of quantifiers and predicate replacements. First, we expose models to primitive patterns of Q and R with the minimum training set. Thus, we define the initial training set S 1 and test set T 1 as follows: where q is arbitrarily selected from Q, and r is arbitrarily selected from R. Next, we gradually add the set of inferences generated from combinations of an upwarddownward quantifier pair and all predicate replacements to the training set. In the examples above, we add (8) and (9) to the training set to simplify the task. We assume a set Q ′ of a pair of upward/downward quantifiers, namely, We consider a set perm(Q ′ ) consisting of permutations of Q ′ . For each p ∈ perm(Q ′ ), we gradually add a set of inferences generated from p(i) to the training set S i with 1 < i ≤ n − 1. Then, we provide a test set T i generated from the complement Q i of This protocol is summarized as To evaluate the extent to which the generalization ability of models is robust for different syntactic structures, we use an additional test set generated using three production rules. The first is the case where one adverb is added at the beginning of the sentence, as in example (10).
(10) P adv : Slowly, several small dogs ran H adv : Slowly, several dogs ran The second is the case where a three-word prepositional phrase is added at the beginning of the sentence, as in example (11).
(11) Pprep: Near the shore, several small dogs ran Hprep: Near the shore, several dogs ran The third is the case where the replacement is performed in the object position, as in example (12).
(12) P obj : Some tiger touched several small dogs H obj : Some tiger touched several dogs We train and test models |perm(Q ′ )| times, then take the average accuracy as the final evaluation result.
II. Systematicity of embedding quantifiers To properly interpret embedding monotonicity, models should detect both (i) the monotonicity direction of each quantifier and (ii) the type of predicate replacements in the embedded argument. The following describes how we test whether models generalize to unseen combinations of embedding quantifiers. We expose models to all primitive combination patterns of quantifiers and predicate replacements like (4)-(9) with a set of non-embedding monotonicity inferences D Q,R 1 and some embedding patterns like (13), where Q 1 and Q 2 are chosen from a selected set of upward or downward quantifiers such as some or no. We then test the models with an inference with an unseen quantifier several in (14) to evaluate whether models can systematically interpret embedding quantifiers.
(13) P : Q1 animals that chased Q2 dogs ran H: Q1 animals that chased Q2 animals ran (14) P : Several animals that chased several dogs ran H: Several animals that chased several animals ran We move from harder to easier tasks of learning embedding quantifiers by gradually changing the training/test split of a set of inferences D Q,R 2 whose depth is 2, i.e., inferences involving one embedded clause.
We assume a set Q ′ of a pair of upward and downward quantifiers as We train and test models |perm(Q ′ )| times, then take the average accuracy as the final evaluation result.

III. Productivity Productivity (or recursiveness)
is a concept related to systematicity, which refers to the capacity to grasp an indefinite number of natural language sentences or thoughts with generalization on composition. The following describes how we test whether models generalize to unseen deeper depths in embedding monotonicity (see also the right side of Figure 1). For example, we expose models to all primitive nonembedding/single-embedding patterns like (15) and (16)  IV. Localism According to the principle of compositionality, the meaning of a complex expression derives from the meanings of its constituents and how they are combined. One important concern is how local the composition operations should be (Pagin and Westerståhl, 2010). We therefore test whether models trained with inferences involving embedded monotonicity locally perform inferences composed of smaller constituents. Specifically, we train models with examples like (17) and then test the models with examples like (15) and (16). We train models with D d and test the models on ∪ k∈{1,...,d} D k with 3 ≤ d ≤ s .

Data creation
To prepare the datasets shown in Table 1, we first generate premise sentences involving quantifiers from a set of context-free grammar (CFG) rules and lexical entries, shown in Table 6 in the Appendix. We select 10 words from among nouns, intransitive verbs, and transitive verbs as lexical entries. A set of quantifiers Q consists of eight elements; we use a set of four downward quantifiers Q ↓ ={no, at most three, less than three, few} and a set of four upward quantifiers Q ↑ ={some, at  least three, more than three, a few}, which have the same monotonicity directions in the first and second arguments. We thus consider n = |Q ↑ | = |Q ↓ | = 4 in the protocol in Section 2.2. The ratio of each monotonicity direction (upward/downward) of generated sentences is set to 1 : 1. We then generate hypothesis sentences by applying replacement functions to premise sentences according to the polarities of constituents. The set of replacement functions R is composed of the seven types of lexical replacements and phrasal additions in Table 2. We remove unnatural premise-hypothesis pairs in which the same words or phrases appear more than once. For embedding monotonicity, we consider inferences involving four types of replacement functions in the first argument of the quantifier in Table 2: hyponyms, adjectives, prepositions, and relative clauses. We generate sentences up to the depth d = 5. There are various types of embedding monotonicity, including relative clauses, conditionals, and negated clauses. In this paper, we consider three types of embedded clauses: peripheral-embedding clauses and two kinds of center-embedding clauses, shown in Table 6 in the Appendix.
The number of generated sentences exponentially increases with the depth of embedded clauses. Thus, we limit the number of inference examples to 320,000, split into 300,000 examples for the training set and 20,000 examples for the test set. We guarantee that all combinations of quantifiers are included in the set of inference examples for each depth. Gold labels for generated premise-hypothesis pairs are automatically determined according to the polarity of the argument position (upward/downward) and the type of predicate replacements (with more general/specific phrases). The ratio of each gold label (entailment/non-entailment) in the training and test sets is set to 1 : 1.
To double-check the gold label, we translate each premise-hypothesis pair into a logical formula (see the Appendix for more details). The logical formulas are obtained by combining lambda terms in accordance with meaning composition rules specified in the CFG rules in the standard way (Blackburn and Bos, 2005). We prove the entailment relation using the theorem prover Vampire 2 , checking whether a proof is found in time for each entailment pair. For all pairs, the output of the prover matched with the entailment relation automatically determined by monotonicity calculus.

Models
We consider three DNN-based NLI models. The first architecture employs long short-term memory (LSTM) networks (Hochreiter and Schmidhuber, 1997). We set the number of layers to three with no attention. Each premise and hypothesis is processed as a sequence of words using a recurrent neural network with LSTM cells, and the final hidden state of each serves as its representation.
The second architecture employs multiplicative tree-structured LSTM (TreeLSTM) networks (Tran and Cheng, 2018), which are expected to be more sensitive to hierarchical syntactic structures. Each premise and hypothesis is processed as a tree structure by bottomup combinations of constituent nodes using the same shared compositional function, input word information, and between-word relational information. We parse all premise-hypothesis pairs with the dependency parser using the spaCy li-brary 3 and obtain tree structures. For each experimental setting, we randomly sample 100 tree structures and check their correctness. In LSTM and TreeLSTM, the dimension of hidden units is 200, and we initialize the word embeddings with 300-dimensional GloVe vectors (Pennington et al., 2014). Both models are optimized with Adam (Kingma and Ba, 2015), and no dropout is applied.
The third architecture is a Bidirectional Encoder Representations from Transformers (BERT) model (Devlin et al., 2019). We used the baseuncased model pre-trained on Wikipedia and BookCorpus from the pytorch-pretrained-bert library 4 , fine-tuned for the NLI task using our dataset. In fine-tuning BERT, no dropout is applied, and we choose hyperparameters that are commonly used for MultiNLI. We train all models over 25 epochs or until convergence, and select the best-performing model based on its performance on the validation set. We perform five runs per model and report the average and standard deviation of their scores.

Experiments and Discussion
I. Systematicity of predicate replacements Figure 2 shows the performance on unseen combinations of quantifiers and predicate replacements. In the minimal training set S 1 , the accuracy of LSTM and TreeLSTM was almost the same as chance, but that of BERT was around 75%, suggesting that only BERT generalized to unseen combinations of quantifiers and predicate replacements. When we train BERT with the training set S 2 , which contains inference examples generated from combinations of one pair of upward/downward quantifiers and all predicate replacements, the accuracy was 100%. This indicates that by being taught two kinds of quantifiers in the training data, BERT could distinguish between upward and downward for the other quantifiers. The accuracy of LSTM and TreeLSTM increased with increasing the training set size, but did not reach 100%. This indicates that LSTM and TreeLSTM also generalize to inferences involving similar quantifiers to some extent, but their generalization ability is imperfect.
When testing models with inferences where adverbs or prepositional phrases are added to the be- ginning of the sentence, the accuracy of all models significantly decreased. This decrease becomes larger as the syntactic structures of the sentences in the test set become increasingly different from those in the training set. Contrary to our expectations, the models fail to maintain accuracy on test sets whose difference from the training set is the structure with the adverb at the beginning of a sentence. Of course, we could augment datasets involving that structure, but doing so would require feeding all combinations of inference pairs into the models. These results indicate that the models tend to estimate the entailment label from the beginning of a premise-hypothesis sentence pair, and that inferential systematicity to draw inferences involving quantifiers and predicate replacements is not completely generalized at the level of arbitrary constituents. Figure 3 shows the performance of all models on unseen combinations of embedding quantifiers. Even when adding the training set of inferences involving one embedded clause and two quantifiers step-by-step, no model showed improved performance. The accuracy of BERT slightly exceeded chance, but the accuracy of LSTM and TreeLSTM was nearly the same as or lower than chance. These results suggest that all the models fail to generalize to unseen combinations of embedding quantifiers even when they involve similar upward/downward quantifiers. Table 3 shows the performance on unseen depths of embedded clauses. The accuracy on D 1 and D 2 was nearly 100%, indicating that all models almost completely generalize to inferences containing previously seen depths. When D 1 +D 2 were used as the training set, the accuracy of all models on D 3 exceeded chance. Similarly, when D 1 + D 2 + D 3 were used as the training set, the accuracy of all models on D 4 exceeded chance. This indicates that all models partially generalize to inferences containing embedded clauses one level deeper than the training set.

III. Productivity
However, standard deviations of BERT and LSTM were around 10, suggesting that these models did not consistently generalize to inferences containing embedded clauses one level deeper than the training set. While the distribution of monotonicity directions (upward/downward) in the training and test sets was uniform, the accuracy of LSTM and BERT tended to be smaller for downward inferences than for upward inferences. This also indicates that these models fail to properly compute monotonicity directions of constituents from syntactic structures. The standard deviation of TreeLSTM was smaller, indicating that TreeLSTM robustly learns inference patterns containing embedded clauses one level deeper than the training set.   However, the performance of all models trained with D 1 + D 2 on D 4 and D 5 significantly decreased. Also, performance decreased for all models trained with D 1 + D 2 + D 3 on D 5 . Specifically, there was significantly decreased performance of all models, including TreeLSTM, on inferences containing embedded clauses two or more levels deeper than those in the training set. These results indicate that all models fail to develop productivity on inferences involving embedding monotonicity. Table 4 shows the performance of all models on localism of embedding monotonicity. When the models were trained with D 3 , D 4 or D 5 , all performed at around chance on the test set of non-embedding inferences D 1 and the test set of inferences involving one embedded clause D 2 . These results indicate that even if models are trained with a set of inferences containing complex syntactic structures, the models fail to locally interpret their constituents.

IV. Localism
Performance of data augmentation Prior studies (Yanaka et al., 2019b;Richardson et al., 2020) have shown that given BERT initially trained with  MultiNLI, further training with synthesized instances of logical inference improves performance on the same types of logical inference while maintaining the initial performance on MultiNLI. To investigate whether the results of our study are transferable to current work on MultiNLI, we trained models with our synthesized dataset mixed with MultiNLI, and checked (i) whether our synthesized dataset degrades the original performance of models on MultiNLI 5 and (ii) whether MultiNLI degrades the ability to generalize to unseen depths of embedded clauses. Table 5 shows that training BERT on our synthetic data D 1 + D 2 and MultiNLI increases the accuracy on our test sets D 1 (46.9 to 100.0), D 2 (46.2 to 100.0), and D 3 (46.8 to 67.8) while preserving accuracy on MultiNLI (84.6 to 84.4). This indicates that training BERT with our synthetic data does not degrade performance on commonly used corpora like MultiNLI while improving the performance on monotonicity, which suggests that our data-synthesis approach can be combined with naturalistic datasets. For TreeLSTM and LSTM, however, adding our synthetic dataset decreases accuracy on MultiNLI. One possible reason for this is that a pre-training based model like BERT can mitigate catastrophic forgetting in various types of datasets.
Regarding the ability to generalize to unseen depths of embedded clauses, the accuracy of all models on our synthetic test set containing embedded clauses one level deeper than the training set exceeds chance, but the improvement becomes smaller with the addition of MultiNLI. In particular, with the addition of MultiNLI, the models tend to change wrong predictions in cases where a hypothesis contains a phrase not occurring in a premise but the premise entails the hypothesis. Such inference patterns are contrary to the heuristics in MultiNLI (McCoy et al., 2019). This indicates that there may be some trade-offs in terms of performance between inference patterns in the training set and those in the test set.

Related Work
The question of whether neural networks are capable of processing compositionality has been widely discussed (Fodor and Pylyshyn, 1988;Marcus, 2003). Recent empirical studies illustrate the importance and difficulty of evaluating the capability of neural models. Generation tasks using artificial datasets have been proposed for testing whether models compositionally interpret training data from the underlying grammar of the data (Lake and Baroni, 2017;Hupkes et al., 2018;Saxton et al., 2019;Loula et al., 2018;Hupkes et al., 2019;Bernardy, 2018). However, these conclusions are controversial, and it remains unclear whether the failure of models on these tasks stems from their inability to deal with compositionality.
Previous studies using logical inference tasks have also reported both positive and negative results.
Assessment results on propositional logic (Evans et al., 2018), first-order logic (Mul and Zuidema, 2019), and natural logic (Bowman et al., 2015) show that neural networks can generalize to unseen words and lengths. In contrast, Geiger et al. (2019) obtained negative results by testing models under fair conditions of natural logic. Our study suggests that these conflicting results come from an absence of perspective on combinations of semantic properties.
Regarding assessment of the behavior of modern language models, Linzen et al. (2016), , and Goldberg (2019) investigated their syntactic capabilities by testing such models on subject-verb agreement tasks. Many studies of NLI tasks (Liu et al., 2019;Glockner et al., 2018;Poliak et al., 2018;Tsuchiya, 2018;McCoy et al., 2019;Rozen et al., 2019;Ross and Pavlick, 2019) have provided evaluation methodologies and found that current NLI models often fail on particular inference types, or that they learn undesired heuristics from the training set. In particular, recent works (Yanaka et al., 2019a,b;Richardson et al., 2020) have evaluated models on monotonicity, but did not focus on the ability to generalize to unseen combinations of patterns. Monotonicity covers various systematic inferential patterns, and thus is an adequate semantic phenomenon for assessing inferential systematicity in natural language. Another benefit of focusing on monotonicity is that it provides hard problem settings against heuristics (McCoy et al., 2019), which fail to perform downward-entailing inferences where the hypothesis is longer than the premise.

Conclusion
We introduced a method for evaluating whether DNN-based models can learn systematicity of monotonicity inference under four aspects. A series of experiments showed that the capability of three models to capture systematicity of predicate replacements was limited to cases where the positions of the constituents were similar between the training and test sets. For embedding monotonicity, no models consistently drew inferences involving embedded clauses whose depths were two levels deeper than those in the training set. This suggests that models fail to capture inferential systematicity of monotonicity and its productivity.
We also found that BERT trained with our synthetic dataset mixed with MultiNLI maintained performance on MultiNLI while improving the performance on monotonicity. This indicates that though current DNN-based models do not systematically interpret monotonicity inference, some models might have sufficient ability to memorize different types of reasoning. We hope that our work will be useful in future research for realizing more advanced models that are capable of appropriately performing arbitrary inferences.
Context-free grammar for premise sentences S → N P IV 1 N P → Q N | Q N S S → W hN P T V N P | W hN P N P T V | N P T V Lexicon Q → {no, at most three, less than three, few, some, at least three, more than three, a few} N → {dog,rabbit,lion,cat,bear,tiger,elephant,fox,monkey,wolf } IV 1 → {ran,walked,came,waltzed,swam,rushed,danced,dawdled,escaped,left} IV 2 → {laughed,groaned,roared,screamed,cried} T V → {kissed,kicked,hit,cleaned,touched,loved,accepted,hurt,licked to IV 1 Adv | IV 1 P P | IV 1 or IV 2 | IV 1 and IV 2 first argument of the quantifier. To generate natural sentences consistently, we use the past tense for verbs; for lexical entries and predicate replacements, we select those that do not violate selectional restriction.
To check the gold labels for the generated premise-hypothesis pairs, we translate each sentence to a first-order logic (FOL) formula and test if the entailment relation holds by theorem proving. The FOL formulas are compositionally derived by combining lambda terms assigned to each lexical item in accordance with meaning composition rules specified in the CFG rules in the standard way (Blackburn and Bos, 2005). Since our purpose is to check the polarity of monotonicity marking, vague quantifiers such as few are represented according to their polarity. For example, we map the quantifier few onto the lambda-term λP λQ¬∃x(few(x) ∧ P (x) ∧ Q(x)). Table 7 shows all results on embedding monotonicity. This indicates that all models partially generalize to inferences containing embedded clauses one level deeper than the training set, but fail to generalize to inferences containing embedded clauses two or more levels deeper.