Probing Linguistic Systematicity

Recently, there has been much interest in the question of whether deep natural language understanding (NLU) models exhibit systematicity, generalizing such that units like words make consistent contributions to the meaning of the sentences in which they appear. There is accumulating evidence that neural models do not learn systematically. We examine the notion of systematicity from a linguistic perspective, defining a set of probing tasks and a set of metrics to measure systematic behaviour. We also identify ways in which network architectures can generalize non-systematically, and discuss why such forms of generalization may be unsatisfying. As a case study, we perform a series of experiments in the setting of natural language inference (NLI). We provide evidence that current state-of-the-art NLU systems do not generalize systematically, despite overall high performance.


Introduction
Language allows us to express and comprehend a vast variety of novel thoughts and ideas. This creativity is made possible by compositionalitythe linguistic system builds utterances by combining an inventory of primitive units such as morphemes, words, or idioms (the lexicon), using a small set of structure-building operations (the grammar; Camap, 1947;Fodor and Pylyshyn, 1988;Hodges, 2012;Janssen et al., 2012;Lake et al., 2017b;Szabó, 2012;Zadrozny, 1994;Lake et al., 2017a).
One property of compositional systems, widely studied in the cognitive sciences, is the phenomenon of systematicity. Systematicity refers to the fact that lexical units such as words make consistent contributions to the meaning of the sentences in which they appear. Fodor and Pylyshyn * Corresponding author (1988) provided a famous example: If a competent speaker of English knows the meaning of the sentence John loves the girl, they also know the meaning of The girl loves John. This is because for speakers of English knowing the meaning of the first sentence implies knowing the meaning of the individual words the, loves, girl, and John as well as grammatical principles such as how transitive verbs take their arguments. But knowing these words and principles of grammar implies knowing how to compose the meaning of the second sentence.
Deep learning systems now regularly exhibit very high performance on a large variety of natural language tasks, including machine translation (Wu et al., 2016;Vaswani et al., 2017), question answering Henaff et al., 2016), visual question answering (Hudson and Manning, 2018), and natural language inference (Devlin et al., 2018;Storks et al., 2019). Recently, however, researchers have asked whether such systems generalize systematically (see §4).
Systematicity is the property whereby words have consistent contributions to composed meaning; the alternative is the situation where words have a high degree of contextually conditioned meaning variation. In such cases, generalization may be based on local heuristics (McCoy et al., 2019b;Niven and Kao, 2019), variegated similarity (Albright and Hayes, 2003), or local approximations (Veldhoen and Zuidema, 2017), where the contribution of individual units to the meaning of the sentence can vary greatly across sentences, interacting with other units in highly inconsistent and complex ways.
This paper introduces several novel probes for testing systematic generalization. We employ an artificial language to have control over systematicity and contextual meaning variation. Applying our probes to this language in an NLI setting reveals that some deep learning systems which achieve very high accuracy on standard holdout evaluations do so in ways which are non-systematic: the networks do not consistently capture the basic notion that certain classes of words have meanings which are consistent across the contexts in which they appear.
The rest of the paper is organized as follows. §2 discusses degrees of systematicity and contextually conditioned variation; §3 introduces the distinction between open-and closed-class words, which we use in our probes. §5 introduces the NLI task and describes the artificial language we use; §6 discusses the models that we tested and the details of our training setup; §7 introduces our probes of systematicity and results are presented in §8. 1

Systematicity and Contextual Conditioning
Compositionality is often stated as the principle that the meaning of an utterance is determined by the meanings of its parts and the way those parts are combined (see, e.g., Heim and Kratzer, 2000). Systematicity, the property that words mean the same thing in different contexts, is closely related to compositionality; nevertheless, compositional systems can vary in their degree of systematicity. At one end of the spectrum are systems in which primitive units contribute exactly one identical meaning across all contexts. This high degree of systematicity is approached by artificial formal systems including programming languages and logics, though even these systems don't fully achieve this ideal (Cantwell Smith, 1996;Dutilh Novaes, 2012).
The opposite of systematicity is the phenomenon of contextually conditioned variation in meaning where the contribution of individual words varies according to the sentential contexts in which they appear. Natural languages exhibit such context dependence in phenomena like homophony, polysemy, multi-word idioms, and co-compositionality. Nevertheless, there are many words in natural language-especially closed-class words like quantifiers (see below)-which exhibit very little variability in meaning across sentences.
At the other end of the spectrum from programming languages and logics are systems where many or most meanings are highly context dependent.
The logical extreme-a system where each word has a different and unrelated meaning every time it occurs-is clearly of limited usefulness since it would make generalization impossible. Nevertheless, learners with sufficient memory capacity and flexibility of representation, such as deep learning models, can learn systems with very high degrees of contextual conditioning-in particular, higher than human language learners. An important goal for building systems that learn and generalize like people is to engineer systems with inductive biases for the right degree of systematicity. In §8, we give evidence that some neural systems are likely too biased toward allowing contextually conditioned meaning variability for words, such as quantifiers, which do not vary greatly in natural language.

Compositional Structure in Natural Language
Natural language distinguishes between content or open-class lexical units and function or closedclass lexical units. The former refers to categories, such a nouns and verbs, which carry the majority of contentful meaning in a sentence and which permit new coinages. Closed-class units, by contrast, carry most of the grammatical structure of the sentence and consist of things like inflectional morphemes (like pluralizing -s in English) and words like determiners, quantifiers, and negation (e.g., all, some, the in English). These are mostly fixed; adult speakers do not coin new quantifiers, for example, the way that they coin new nouns. Leveraging this distinction gives rise to the possibility of constructing probes based on jabberwockytype sentences. This term references the poem Jabberwocky by Lewis Carroll, which combines nonsense open-class words with familiar closedclass words in a way that allows speakers to recognize the expression as well formed. For example, English speakers identify a contradiction in the sentence All Jabberwocks flug, but some Jabberwocks don't flug, without a meaning for jabberwock and flug. This is possible because we expect the words all, some, but, and don't to contribute the same meaning as they do when combined with familiar words, like All pigs sleep, but some pigs don't sleep.
Using jabberwocky-type sentences, we tested the generalizability of certain closed-class word representations learned by neural networks. Giving the networks many examples of each construction with a large variety of different content wordsthat is, large amounts of highly varied evidence about the meaning of the closed-class words-we asked during the test phase how fragile this knowledge is when transferred to new open-class words. That is, our probes combine novel open-class words with familiar closed-class words, to test whether the closed-class words are treated systematically by the network. For example, we might train the networks to identify contradictions in pairs like All pigs sleep; some pigs don't sleep, and test whether the network can identify the contradiction in a pair like All Jabberwocks flug; some Jabberwocks don't flug. A systematic learner would reliably identify the contradiction, whereas a non-systematic learner may allow the closed-class words (all, some, don't) to take on contextually conditioned meanings that depend on the novel context words.
In contrast to our approach (testing novel words in familiar combinations), many of these studies probe systematicity by testing familiar words in novel combinations. Lake and Baroni (2018) adopt this approach in semantic parsing with an artificial language known as SCAN. Dasgupta et al. (2018Dasgupta et al. ( , 2019 introduce a naturalistic NLI dataset, with test items that shuffle the argument structure of natural language utterances. In the in the inductive logic programming domain, Sinha et al. (2019) introduced the CLUTTR relational-reasoning benchmark. The novel-combinations-of-familiar-words approach was formalized in the CFQ dataset and associated distribution metric of Keysers et al. (2019). Ettinger et al. (2018) introduced a semantic-rolelabeling and negation-scope labeling dataset, which tests compositional generalization with novel combinations of familiar words and makes use of syntactic constructions like relative clauses. Finally, Kim et al. (2019) explore pre-training schemes' abilities to learn prepositions and wh-words with syntactic transformations (two kinds of closedclass words which our work does not address).
A different type of systematicity analysis directly investigates learned representations, rather than developing probes of model behavior. This is done either through visualization (Veldhoen and Zuidema, 2017), training a second network to approximate learned representations using a symbolic structure (Soulos et al., 2019) or as a diagnostic classifier (Giulianelli et al., 2018), or reconstructing the semantic space through similarity measurements over representations (Prasad et al., 2019).

Natural Language Inference
We make use of the Natural language inference (NLI) task to study the question of systematicity. The NLI task is to infer the relation between two sentences (the premise and the hypothesis). Sentence pairs must be classified into one of a set of predefined logical relations such as entailment or contradiction. For example, the sentence All mammals growl entails the sentence All pigs growl. A rapidly growing number of studies have shown that deep learning models can achieve very high performance in this setting (Evans et

Natural Logic
We adopt the formulation of NLI known as natural logic Manning, 2014, 2009;Lakoff, 1970). Natural logic makes use of seven logical relations between pairs of sentences. These are shown in Table 1. These relations can be interpreted as the set theoretic relationship between the extensions of the two expressions. For instance, if the expressions are the simple nouns warthog and pig, then the entailment relation ( ) holds between these extensions (warthog pig) since every warthog is a kind of pig. For higher-order operators such as quantifiers, relations can be defined between sets of possible worlds. For instance, the set of possible worlds consistent with the expression All blickets wug is a subset of the set of possible worlds consistent with the logically weaker expression All red blickets wug. Critically, the relationship between composed expressions such as All X Y and All P Q is determined entirely by the relations between X/Y and P/Q, respectively. Thus, natural logic allows us to compute the relation between the whole expressions using the relations between parts. We define an artificial language in which such alignments are easy to compute, and use this language to probe deep learning systems' ability to generalize systematically.

Symbol Name
Example Set-theoretic definition

The Artificial Language
In our artificial language, sentences are generated according to the six-position template shown in Table 2, and include a quantifier (position 1), noun (position 3), and verb (position 6), with optional pre-and post-modifiers (position 2 and 4) and optional negation (position 5). For readability, all examples in this paper use real English words; however, simulations can use uniquely identified abstract symbols (i.e., generated by gensym).
We compute the relation between positionaligned pairs of sentences in our language using the natural logic system (described in §5.2). Quantifiers and negation have their usual natural-language semantics in our artificial language; pre-and postmodifiers are treated intersectively. Open-class items (nouns and verbs) are organized into linear hierarchical taxonomies, where each open-class word is the sub-or super-set of exactly one other open-class item in the same taxonomy. For example, since dogs are all mammals, and all mammals animals, they form the entailment hierarchy dogs mammals animals. We vary the number of distinct noun and verb taxonomies according to an approach we refer to as block structure, described in the next section.

Block Structure
In natural language, most open-class words do not appear with equal probability with every other word. Instead, their distribution is biased and clumpy, with words in similar topics occurring together. To mimic such topic structure, we group nouns and verbs into blocks. Each block consists of six nouns and six verbs, which form taxonomic hi-erarchies (e.g., lizards/animals, run/move). Nouns and verbs from different blocks have no taxonomic relationship (e.g., lizards and screwdrivers or run and read) and do not co-occur in the same sentence pair. Because each block includes a six verbs and six nouns in a linear taxonomic hierarchy, no single block is intrinsically harder to learn than any other block.
The same set of closed-class words appear with all blocks of open-class words, and their meanings are systematic regardless of the open-class words (nouns and verbs) they are combined with. For example, the quantifier some has a consistent meaning when it is applied to some screwdrivers or some animals. Because closed-class words are shared across blocks, models are trained on extensive and varied evidence of their behaviour. We present closedclass words in a wide variety of sentential contexts, with a wide variety of different open-class words, to provide maximal pressure against overfitting and maximal evidence of their consistent meaning.

Test and Train Structure
We now describe the structure of our training blocks, holdout test set, and jabberwocky blocks. We also discuss our two test conditions, and several other issues that arise in the construction of our dataset.
Training set: For each training block, we sampled (without replacement) one sentence pair for every possible combination of open-class words, that is, every combination of nouns and verbs noun 1 , noun 2 , verb 1 , verb 2 . Closed-class words were sampled uniformly to fill each remaining positions in the sentence (see Table 2). A random subset of 20% of training items were reserved for validation (early stopping) and not used during training.
Holdout test set: For each training block, we sampled a holdout set of forms using the same nouns and verbs, but disjoint from the training set just described. The sampling procedure was identical to that for the training blocks. These holdout items allow us to test the generalization of the models with known words in novel configurations (see §8.1).
Jabberwocky test set: Each jabberwocky block consisted of novel open-class items (i.e., nouns and verbs) that did not appear in training blocks. For each jabberwocky block, we began by following a  Balancing: One consequence of the sampling method is that logical relations will not be equally represented in training. In fact, it is impossible to simultaneously balance the distributions of syntactic constructions, logical relations, and instances of words. In this trade-off, we chose to balance the distribution of open-class words in the vocabulary, as we are focused primarily on the ability of neural networks to generalize closed-class word meaning.
Balancing instances of open-class words provided the greatest variety of learning contexts for the meanings of the closed-class items.

Models
We analyze performance on four simple baseline models known to perform well on standard NLI tasks, such as the Stanford Natural Language Inference datasets, (Bowman et al., 2015). Following Conneau et al. (2017), the hypothesis u and premise v are individually encoded by neural sequence encoders such as a long short-term memory (LSTM; Hochreiter and Schmidhuber, 1997) or gated recurrent unit (GRU; Cho et al., 2014). These vectors, together with their element-wise product u * v and element-wise difference u − v are fed into a fully connected multilayer perceptron layer to predict the relation. The encodings u and v are produced from an input sentence of M words, w 1 , . . . , w M , using a recurrent neural network, which produces a set of a set of M hidden repre- The sequence encoding is represented by its last hidden vector h T . The simplest of four models sets f to be a bidirectional gated recurrent unit (BGRU). This model concatenates the last hidden state of a GRU run forwards over the sequence and the last hidden state of GRU run backwards over the sequence, for . Our second embedding system is the Infersent model reported by Conneau et al. (2017), a bidirectional LSTM with max pooling (INFS). This is a model where f is an LSTM. Each word is represented by the concatenation of a forward and backward representation: We constructed a fixed vector representation of the sequence h t by selecting the maximum value over each dimension of the hidden units of the words in the sentence. Our third model is a self-attentive sentence encoder (SATT) which uses an attention mechanism over the hidden states of a BiLSTM to generate the sentence representation (Lin et al., 2017). This attention mechanism is a weighted linear combination of the word representations, denoted by u = M α i h i , where the weights are calculated as follows:h where, u w is a learned context query vector and (W, b w ) are the weights of an affine transformation. This self-attentive network also has multiple views of the sentence, so the model can attend to multiple parts of the given sentence at the same time. Finally, we test the Hierarchical Convolution-alNetwork (CONV) architecture from (Conneau et al., 2017) which is itself inspired from the model AdaSent (Zhao et al., 2015). This model has four convolution layers; at each layer the intermediate representation u i is computed by a max-pooling operation over feature maps. The final representation is a concatenation u = [u 1 , ..., u l ] where l is the number of layers.

Probing Systematicity
In this section, we study the systematicity of the models described in §6.1. Recall that systematicity refers to the degree to which words have consistent meaning across different contexts, and is contrasted with contextually conditioned variation in meaning. We describe three novel probes of systematicity which we call the known word perturbation probe, the identical open-class words probe, and the consistency probe.
All probes take advantage of the distinction between closed-class and open-class words reflected in the design of our artificial language, and are performed on sentence pairs with novel open-class words (jabberwocky-type sentences; see §5.5 ). We now describe the logic of each probe.

Known Word Perturbation Probe
We test whether the models treat the meaning of closed-class words systematically by perturbing correctly classified jabberwocky sentence pairs with a closed-class word. More precisely, for a pair of closed-class words w and w , we consider test items which can be formed by substitution of w by w in a correctly classified test item. We allow both w and w to be any of the closed-class items, including quantifiers, negation, nominal post-modifiers, or the the empty string (thus modeling insertions and deletions of these known, closed-class items). Suppose that Example 1 was correctly classified. Substituting some for all in the premise of yields Example 2, and changes the relation from entailment ( ) to reverse entailment ( ).
There are two critical features of this probe. First, because we start from a correctly-classified jabberwocky pair, we can conclude that the novel words (e.g., wug and blickets above) were assigned appropriate meanings.
Second, since the perturbation only involves closed-class items which do not vary in meaning and have been highly trained, the perturbation should not affect the models ability to correctly classify the resulting sentence pair. If the model does misclassify the resulting pair, it can only be because a perturbed closed-class word (e.g., some) interacts with the open-class items (e.g., wug), in a way that is different from the pre-perturbation closed-class item (i.e., all). This is non-systematic behavior.
In order to rule out trivially correct behavior where the model simply ignores the perturbation, we consider only perturbations which result in a change of class (e.g., → ) for the sentence pair. In addition to accuracy on these perturbed items, we also examine the variance of model accuracy on probes across different blocks. If a model's accuracy varies depending only on the novel open-class items in a particular block, this provides further evidence that it does not treat word meaning systematically.

Identical Open-class Words Probe
Some sentence pairs are classifiable without any knowledge of the novel words' meaning; for example, pairs where premise and hypothesis have identical open-class words. An instance is shown in Example 3: the two sentences must stand in contradiction, regardless of the meaning of blicket or wug.
The closed-class items and compositional structure of the language is sufficient for a learner to deduce the relationships between such sentences, even with unfamiliar nouns and verbs. Our second probe, the identical open-class words probe, tests the models' ability to correctly classify such pairs.

Consistency Probe
Consider Examples 4 and 5, which present the same two sentences in opposite orders.
In Example 4, the two sentences stand in an entailment ( ) relation. In Example 5, by contrast, the two sentences stand in a reverse entailment ( ) relation. This is a logically necessary consequence of the way the relations are defined. Reversing the order of sentences has predictable effects for all seven natural logic relations: in particular, such reversals map → and → , leaving all other relations intact. Based on this observation, we develop a consistency probe of systematicity. We ask for each correctly classified jabberwocky block test item, whether the corresponding reversed item is also correctly classified. The intuition behind this probe is that whatever meaning a model assumes for the novel open-class words, it should assume the same meaning when the sentence order is reversed. If the reverse is not correctly classified, then this is strong evidence of contextual dependence in meaning.

Results
In this section, we report the results of two control analyses, and that of our three systematicity probes described above.

Analysis I: Holdout Evaluations
We first establish that the models perform well on novel configurations of known words. Table 3 reports accuracy on heldout sentence pairs, described in §5.5. The table reports average accuracies across training blocks together with the standard deviations of these statistics. As can be seen in the table, all models perform quite well on holdout forms across training blocks, with very little variance. Because these items use the same sampling scheme and vocabulary as the trained blocks, these simulations serve as a kind of upper bound on the performance and a lower bound on the variance that we can expect from the more challenging jabberwockyblock-based evaluations below.

Analysis II: Distribution of Novel Words
Our three systematicity probes employ jabberwocky-type sentences-novel openclass words in sentential frames built from known closed-class words. Since models are not   trained on these novel words, it is important to establish that they are from the same distribution as the trained words and, thus, that the models' performance is not driven by some pathological feature of the novel word embeddings. Trained word embeddings were initialized randomly from N (0, 1) and then updated during training. Novel word embeddings were simply drawn from N (0, 1) and never updated. Figure 1 plots visualizations of the trained and novel open-class word embeddings in two dimensions, using t-SNE parameters computed over all open-class words (Maaten and Hinton, 2008). Trained words are plotted as +, novel words as •. Color indicates the proportion of test items containing that word that were classified correctly. As the plot shows, the two sets of embeddings overlap considerably. Moreover, there does not appear to be a systematic relationship between rates of correct classification for items containing novel words and their proximity to trained words. We also performed a resampling analysis, determining that novel vectors did not differ significantly in length from trained vectors (p = 0.85). Finally, we observed mean and standard deviation of the pairwise cosine similarity between trained and novel words to be 0.999 and 0.058 respectively, confirming that there is little evidence the distributions are different.

Analysis III: Known Word Perturbation Probe
Recall from §7.1 that the known word perturbation probe involves insertion, deletion, or substitution of a trained closed-class word in a correctly classified jabberwocky-type sentence pair. Figure 2 plots the results of this probe. Each point represents a perturbation type-a group of perturbed test items that share their before/after target perturbed closedclass words and before/after relation pairs. The upper plot displays the mean accuracy of all perturbations, averaged across blocks, and the lower plot displays the standard deviations across blocks. All models perform substantially worse than the holdout-evaluation on at least some of the perturbations. In addition, the standard deviation of accuracy between blocks is higher than the holdout tests. As discussed in §7.1, low accuracy on this probe indicates that closed-class words do not maintain a consistent interpretation when paired with different open-class words. Variance across blocks shows that under all models the behavior of closed-class words is highly sensitive to the novel words they appear with.
Performance is also susceptible to interference from sentence-level features. For example, consider the perturbation which deletes a post-modifier from a sentence pair in negation, yielding a pair in cover relation. The self-attentive encoder performs perfectly when this perturbation is applied to a premise (100% ± 0.00%), but not when applied to a hypothesis (86.60% ± 18.08%). Similarly, deleting the adjective red from the hypothesis of a forward-entailing pair results in an unrelated sen-tence pair (84.79% ± 7.50%) or another forwardentailing pair (92.32%, ±3.60%) or an equality pair (100% ± 0.00%). All the possible perturbations we studied exhibit similarly inconsistent performance.

Analysis IV: Identical Open-Class Words Probe
Recall that the identical open-class words probe consist of sentence pairs where all open-class lexical items were identical. Table 4 shows the accuracies for these probes, trained on the small language. Average accuracies across jabberwocky blocks are reported together with standard deviations.  Accuracy on the probe pairs fails to reach the holdout test levels for most models and most relations besides #, and variance between blocks is much higher than in the holdout evaluation. Of special interest is negation ( ∧ ), for which accuracy is dramatically lower and variance dramatically higher than the holdout evaluation.
The results are similar for the large language condition, shown in Table 5. Although model accuracies improve somewhat, variance remains higher than the heldout level and accuracy lower. Recall that these probe-items can be classified while ignoring the specific identity of their open-class words. Thus, the models inability to leverage this fact, and high variance across different sets novel open-class words, illustrates their sensitivity to context.

Analysis V: Consistency Probe
The consistency probe tests abstract knowledge of relationships between logical relations, such as the fact that two sentences that stand in a contradiction still stand in a contradiction after reversing their order. Results of this probe in the small-language condition are in  sentence pairs that, when presented in reverse order, were also correctly labeled.
The best-performing model on negation reversal is SATT, which correctly labeled reversed items 66.92% of the time. Although performance on negation is notably more difficult than the other relations, every model, on every relation, exhibited inter-block variance higher than that of the hold-out evaluations.  Furthermore, as can be seen in Table 7, the large language condition yields little improvement. Negation pairs are still well below the hold-out test threshold, still with a high degree of variation. Variation remains high for many relations, which is surprising because the means report accuracy on test items that were chosen specifically because the same item, in a reverse order, was already correctly labeled. Reversing the order of sentences causes the model to misclassify the resulting pair, more often for some blocks than others.

Discussion and Conclusion
Systematicity refers to the property of natural language representations whereby words (and other units or grammatical operations) have consistent meanings across different contexts. Our probes test whether deep learning systems learn to represent linguistic units systematically in the natural lan-  guage inference task. Our results indicate that despite their high overall performance, these models tend to generalize in ways that allow the meanings of individual words to vary in different contexts, even in an artificial language where a totally systematic solution is available. This suggests the networks lack a sufficient inductive bias to learn systematic representations of words like quantifiers, which even in natural language exhibit very little meaning variation.
Our analyses contain two ideas that may be useful for future studies of systematicity. First, two of our probes (known word perturbation and consistency) are based on the idea of starting from a test item that is classified correctly, and applying a transformation that should result in a classifiable item (for a model that represents word meaning systematically). Second, our analyses made critical use of differential sensitivity (i.e., variance) of the models across test blocks with different novel words but otherwise identical information content. We believe these are a novel ideas that can be employed in future studies.