Filler-gaps that neural networks fail to generalize

It can be difficult to separate abstract linguistic knowledge in recurrent neural networks (RNNs) from surface heuristics. In this work, we probe for highly abstract syntactic constraints that have been claimed to govern the behavior of filler-gap dependencies across different surface constructions. For models to generalize abstract patterns in expected ways to unseen data, they must share representational features in predictable ways. We use cumulative priming to test for representational overlap between disparate filler-gap constructions in English and find evidence that the models learn a general representation for the existence of filler-gap dependencies. However, we find no evidence that the models learn any of the shared underlying grammatical constraints we tested. Our work raises questions about the degree to which RNN language models learn abstract linguistic representations.


Introduction
While sentences appear highly variable on the surface, many syntactic constructions share the same underlying constraints, which determine their acceptability or grammaticality, i.e. the extent to which they are considered "well formed" through adherence to the rules of grammar. One of the strongest pieces of evidence for the existence of these shared underlying constraints are filler-gap constructions such as: (1) What does Leslie like ?
Filler-gap constructions contain a dependency between an overt filler (what in (1)) and a gap site (underlined above). The filler is bound to a referent (e.g., Robin's painting) that can fill the gap: (2) Leslie likes Robin's painting.
There are well-known restrictions (islands; Ross, 1967) that prevent certain words from participating in a filler-gap dependency. For example, it isn't possible to form a filler-gap dependency with prenominal (left-branch) noun modifiers: 1 (3) *Whose does Leslie like painting?
Further, very different filler-gap constructions obey shared underlying principles (e.g., subjacency; Chomsky, 1973). In this work, we probe recurrent neural network (RNN) language model understanding 2 of these underlying principles in English. Recent work has claimed that recurrent neural network language models understand filler-gap dependencies (Chowdhury and Zamparelli, 2018;Wilcox et al., 2018Wilcox et al., , 2019. However, behavioral probing has suggested that this understanding is relatively superficial and doesn't reflect the underlying constraints that govern filler-gap acceptability (Chaves, 2020). An intermediate possibility, which we explore in this paper, is that RNNs do acquire a basic understanding of the underlying constraints but that the learned representations of the constraints are too weak to correctly drive behavior in behavioral probing tasks.
We use cumulative priming (van Schijndel and Linzen, 2018;Prasad et al., 2019) to test for representational overlap between disparate constructions. While we find some evidence that RNNs learn a general representation for the existence of filler-gap dependencies (in keeping with Wilcox et al., 2018Wilcox et al., , 2019, we find no evidence that RNNs learn shared representation of the associated governing constraints (in keeping with Chaves, 2020).
Several recent papers have highlighted ways in which RNN behavior actually reflects shallow surface heuristics (McCoy et al., 2019;Chaves, 2020;Davis and van Schijndel, 2020). Note that the representational overlap we seek in the present study is actually a requirement for appropriate generalization of abstract knowledge to unseen data. This is what differentiates abstract knowledge from the surface heuristics that make RNN behavior fragile to adversarial methods. Therefore, our finding that RNNs fail to learn any shared abstract constraints across filler-gap constructions despite being sensitive to the existence of filler-gap dependencies raises questions about the ability of these models to learn abstract generalizable linguistic patterns.

Background
There are a number of different kinds of filler-gap constructions whose behavior is governed by a variety of different underlying constraints (see Table 1). Previous work has probed model understanding of filler-gap dependencies by testing model performance on individual construction types rather than the underlying constraints that might govern them (Chowdhury and Zamparelli, 2018;Wilcox et al., 2018, c.f. Chaves 2020. These studies have followed the logic of the subject-verb agreement probing literature (e.g., Linzen et al., 2016): a model that understands filler-gap dependencies should assign greater probability to grammatical filler-gap dependencies than to ungrammatical filler-gap dependencies. However, certain properties are shared across a variety of filler-gap constructions, giving rise to the hypothesis in the syntax literature that there are shared constraints that underlie multiple different fillergap constructions. Chaves (2020) identifies failure cases where neural language models incorrectly rank sentences containing acceptable and unacceptable filler-gap dependencies, possibly because they have overlearned the individual filler-gap constructions without understanding the broader underlying constraints. In this paper, we test whether models understand four underlying filler-gap constraints that have been widely studied in the syntax literature. Specifically, we test whether multiple constructions that are governed by a single constraint share any representational features that are not present in other constructions. If such selec-tive representational overlap exists, it could be an indication that the models do understand the underlying constraints but that their representation is too weak to correctly rank acceptable/unacceptable sentence pairs. Cumulative priming has been introduced as a method for probing the linguistic representations encoded in RNNs (van Schijndel and Linzen, 2018;Prasad et al., 2019;Lepori et al., 2020). 3 This approach involves fine-tuning pretrained models for a single epoch with a small amount of additional training data. The pretrained model acts as a filter on the linguistic features learned during finetuning. More salient features will be affected by the fine-tuning to a larger degree than less salient features. By measuring the responsiveness of the model to linguistic input before and after priming (fine-tuning), researchers can identify which linguistic features are salient for the pretrained model. Importantly, if one construction primes a different construction, there is representational overlap between the two constructions within the model. Therefore, this approach provides a direct method of separating abstract linguistic knowledge from surface heuristics.

Constructions
We analyze eight types of syntactic constructions involving filler-gap dependencies in English. Syntacticians often refer to these dependencies as extractions, which alludes to the linguistic theory that filler phrases originate at the gap site before being extracted to the filler site during language production.

Adjunct islands
An adjunct island is formed from an adjunct clause, out of which wh-extraction is not possible. Adjunct clauses are introduced by because, if, and when, as well as by relative clauses.
(4) a. She ate her hat because they announced the plan. b. *What did she eat her hat because they announced ? Who is it probable that Bill likes ? + + (+) Non-bridge verb island *How did she whisper that he had died ? ? ? Table 1: Island constructions (rows) and the associated underlying constraints (L-marking, Subjacency, and the Empty Category Principle) that govern their behavior. We only examine constraints that are hypothesized to apply to multiple of our construction types. For constraints that are thought to be particularly influential of a construction, we denote the influence with +/-. For example, subject island extractions are unacceptable because they violate Lmarking, while object extractions are acceptable because they adhere to L-marking. Discourse-Linking (D-linking) is an optional shared feature that can make some unacceptable constructions more acceptable.

Wh-islands
A wh-island is created by an embedded sentence which is introduced by a wh-word. Extraction out of a wh-island results in an unacceptable sentence.
(5) a. Sam wonders who solved the problem. b. *What does Sam wonder who solved ?

Subject islands
A subject island is formed from a subject clause or a subject phrase, out of which wh-movement is not possible.
(6) a. That he has met Julia Roberts is unlikely. b. *Who is that has met Julia Roberts unlikely? (7) a. The rumour about Susan was circulating. b. *Whom was the rumour about circulating?

Left branch islands
Left branch islands consist of noun phrases with modifiers, such as possessive determiners and attributive adjectives, that appear on a left branch under the noun. These preceding modifiers of a noun cannot be extracted from a noun phrase.

Complex noun phrase islands
Complex noun phrase islands ban extraction from the clausal complement of a noun, and from a relative clause modifying a noun.
(12) a. You heard the rumour that Bill speaks a Balkan language. b. *What did you hear the rumour that Bill speaks ?
(13) a. They hired someone who cleans a dirty surface. b. *What dirty surface did they hire someone who cleans ?

Object extraction
Object extraction is when a filler-gap dependency involves an object clause or phrase. In contrast to subject islands, object extraction produces acceptable sentences.
(14) a. She told me that her mother is a teacher. b. Her mother, she told me , is a teacher.
(15) a. It is important to invite Will to our party. b. Who is it important to invite to our party?

Non-bridge verb islands
Non-bridge verb islands ban extraction out of thatclause verb complements when the matrix verb is a non-bridge verb. Non-bridge verbs include mannerof-speaking verbs, such as whisper or shout.
(16) a. She thinks that he died in his sleep. b. How does she think that he died ?
(17) a. She whispered that he had died in his sleep. b. *How did she whisper that he had died ?
The unacceptability of non-bridge verb islands hinges on the frequency of the verb (Liu et al., 2019). Therefore, the degree to which different constraints govern its behavior is hotly debated. As such, we do not use this construction when probing for underlying contraint knowledge. However, we do include grammatical variants of this construction (16-b) as acceptable stimuli in our general filler-gap analysis (Section 7.2).

Constraints
We analyze four underlying syntactic principles that have been hypothesized to govern the behavior of the above constructions.

Subjacency
Subjacency (Chomsky, 1973) is defined in terms of the notion of extraction mentioned in the previous section. Syntacticians theorize that a filler is iteratively extracted from its gap site to particular possible landing sites during language production. The subjacency principle states that extraction is only permitted if all landing site positions that intervene between the filler and the gap are unfilled during the extraction process. If the possible structural positions are unavailable because they are filled with another lexical item, extraction is blocked and the resulting filler-gap dependency is deemed ungrammatical. The effect of the subjacency constraint can be observed in the above examples of Wh-islands, subject islands, left branch islands, and complex noun phrase islands.

Empty Category Principle
The Empty Category Principle (ECP; Kayne, 1980;Chomsky, 1981) is a syntactic constraint that requires a gap be properly governed. To be properly governed, gaps must be identifiable as empty positions in the surface structure of a sentence, which allows a tree structure to "remember" what has happened at earlier stages of a sentence's derivation. Adherence to this constraint makes extraction of a wh-word from a subject or adjunct position ungrammatical, while extraction from an object position or from a coordinate structure island is grammatical.

L-marking
L-Marking (Chomsky, 1986) is a process that defines the types of categories that act as barriers to movement, including extraction. A category is Lmarked if and only if it gets its theta role from a lexical head. A theta role specifies the number and type of arguments that are syntactically required by a particular verb. For example, direct objects in English receive theta roles from the main verb, while adjuncts and subjects do not. Movement is grammatical only when it occurs out of an L-marked phrase, as in object extraction but not in adjunct or subject islands.

D-linking
Discourse-linking or D-linking (Pesetsky, 1987) indicates that there is a pre-existing contextual relationship between a filler and its associated noun phrase (e.g., which man). D-linked phrases contrast with non-discourse linked interrogative pronouns such as who, which do not necessarily imply familiar discourse entities. Left branch island extractions and wh-island extractions become more acceptable with a D-linked wh-phrase (Pesetsky, 1987;Atkinson et al., 2015). For example: (18) a. ??Which book did Will ask why John read ?
b. *What did Will ask why John read ?
Since D-linking is an optional feature that can be added to filler-gap constructions, it isn't something that can be violated, per se. Therefore, in our analyses to probe for D-linking knowledge we only test constructions that adhere to D-linking (18-a).

Models
We focused in this work on recurrent neural language models with long short-term memory units (LSTMs; Hochreiter and Schmidhuber, 1997). We analyzed five of the highest performing models released by van Schijndel et al. (2019), 4 who showed that these models perform comparably to state-ofthe-art transformers GPT and BERT on many simple syntactic agreement tasks. The models are 2layer LSTMs with 400 hidden units per layer, each with a unique random initialization, trained on 80 million training tokens of English Wikipedia data. Analyzing multiple similar models with different random seeds helps ensure that our results are more representative of a class of models rather than simply revealing how a single exceptional model behaves (e.g., BERT; Devlin et al., 2019). This is very important given the speed with which new individual models supplant each other in the literature. Our results are averaged across all five of our models.

Method
We generated 40 sentences per construction type, partitioned into a prime set (15 sentences) and a test set (25 sentences). Sentences are available in the supplementary materials. We measured model performance as the average surprisal (negative log-probability; Shannon, 1948;Hale, 2001) experienced by a model M when processing each word w i of each sentence s j in a set S: Priming is achieved by giving the model a single training epoch on the set of priming stimuli 4 https://zenodo.org/record/3559340 S P . This process produces a modified model M P whose performance on a test set S T differs from the original model in a way that gives insight into the representations of the original model. One way to think about this is that pretraining provides a certain kind of model initialization. Priming the model moves the model representations along gradients which are characterized by an interaction of the priming stimuli and the initial pre-trained model state. If a set of priming stimuli has a consistent set of features, the initial state's sensitivity to that set of features can be probed with those stimuli. Following Prasad et al. (2019), we denote this raw effect of priming (also known as adaptation) as: We are actually interested in the interaction between the original model and the priming set, but the above measure also includes the interaction between the original model and the test set. Less expected (more surprising) test constructions can produce larger measures of priming simply because the original model has more room for improvement (Prasad et al., 2019). Therefore, we used linear regression to predict the size of the priming effect using the original model's performance on the test set: To obtain a more appropriate adaptation effect (AE) for analysis, we subtracted out the predicted linear relation between the original model's test performance and the size of the final priming effect: This measure of priming more directly reflects the interaction of the original model with the priming set, normalizing the adaptation effect by each model and prime construction, and producing a comparable measure to that studied by Prasad et al. (2019). Across all analyses, greater values of the adaptation effect indicate greater similarity between adaptation and test structures.
Recently, Kodner and Gupta (2020) have raised concerns about the efficacy of this technique in probing model representations. Specifically, they showed that non-syntactic models can produce qualitative patterns that appear to mimic the effects of syntactic priming. However, their results demonstrate that while the qualitative patterns may be similar, syntactic priming produces much larger effects in syntactic models compared with non-syntactic models. Therefore, their results should not be taken as an indictment of this methodology but simply that reasonable baselines must be employed when using this probing method.
7 Probing for Filler-Gaps 7.1 Null-Prime Baseline Adaptation effects are difficult to interpret on their own. A positive effect indicates that priming produced more accurate predictions in a given class, while a negative effect indicates the opposite, but the magnitude of the effect is tricky to interpret. Following Prasad et al. (2019), for each analysis we define a class of interest and then compare withinclass adaptation effects to cross-class adaptation effects. However, as Kodner and Gupta (2020) point out, spurious correlations may be introduced during stimulus creation/selection. Therefore, we also compare to null-primed models, which use the same original models but we prime them on prime sets whose sentences are shuffled at the word level (as Example (19-b)): (19) a. *What does Sam wonder who solved ? b. *Sam wonder What does solved who?
These shuffled null-prime stimuli do not contain any filler-gap dependencies, and so the adaptation effect from null-priming must represent phenomena in which we are not interested (e.g., lexical priming). Our effect-of-interest (filler-gap knowledge) is therefore reflected in any adaptation effect in excess of the null-prime effect.

Priming Filler-Gap Existence
First we ask whether the models have learned to represent the overall existence of a filler-gap dependency. To test this, we partition our stimuli into wholly acceptable constructions (involving object extraction, bridge verb extraction, and instances of left branch extraction and coordinate structure extraction) and wholly unacceptable constructions (the remaining constructions). We then test whether grammatical constructions can prime ungrammatical constructions and vice versa. Although at the sentence level, ungrammatical constructions do not have a resolvable filler-gap dependency, that unacceptability only manifests at or near the end of the sentence. From the perspective of our unidirectional models, both sets of sentences initially require the retention of an apparent "filler." We therefore hypothesize that if the models understand filler-gap, both of these sets should initially contain a shared unidirectional representation of filler-gap existence.
We find that grammatical constructions prime ungrammatical constructions beyond the baseline shuffled adaptation effect (Figure 1). In other words, fine-tuning on grammatical items teaches the models how to process ungrammatical items with apparent filler-gap dependencies. We also find that ungrammatical items prime grammatical items in a similar fashion. Since the single unifying feature present in both grammatical and ungrammatical constructions is the presence of an apparent filler-gap dependency, these results suggest that the models contain a representation of filler-gap existence that is shared across constructions.

Filler-Gap Existence Baseline
In the previous section, we found that RNNs represent the existence of filler-gap dependencies similarly across construction types. We are therefore interested in whether the models systematically differentiate between filler-gap constructions that are governed by different constraints. If so, we would expect the shared representation of a constraint to produce greater priming within a constraint than across constraints. Therefore, rather than using shuffled sentences, our lower-bound baseline in this section consists of the adaptation effect from priming on sentences that do not share a constraint (green bars). We also use an upper-bound baseline adaptation effect from when models are tested and primed with the same syntactic construction (though the actual sentences differed; gold bars).

Priming Filler-Gap Constraints
Our results are presented in Figures 2 and 3. We divided our constructions based on those that consistently adhere to or violate particular constraints. If an abstract filler-gap constraint is learned by the model, then constraint adherence should prime adherence in the test set and constraint violation should prime violation in the test set (pink bars). Further, true understanding of a constraint would mean that constraint adherence would prime subsequent adherence more than subsequent violation and vice versa (purple bars reflect adherence priming violation and vice versa).
For subjacency, simple existence of filler-gap primed the model significantly more than other constructions involving subjacency. 5 Adherence to D-linking primed subsequent adherence to Dlinking significantly more than simple filler-gap existence did, suggesting that perhaps the models do have an abstract representation of D-linking (but see Section 8.3).
Constructions involving L-marking and ECP produced significantly more priming in the mismatched adherence condition (adherence priming violation; violation priming adherence) than the matched adherence condition (adherence priming adherence; violation priming violation), suggesting that these constraints aren't learned by RNN language models. In fact, matched adherence in L-marking was not significantly different than priming with unrelated filler-gap sentences. Matched adherence in ECP did produce significantly greater priming than filler-gap existence, but since the priming effect was even greater for mismatched adherence, we can conclude that the model did not have an abstract representation of ECP that modulates filler-gap acceptability in a predictable way.

D-Linking Modulation
Our priming results in the previous subsection suggested that D-linked stimuli prime subsequent Dlinking more than simple filler-gap existence does. However, D-linking also has an explicit surface cue, sentence-initial 'which', that could produce representational clustering even without an abstract linguistic concept of the constraint. Therefore, in this section, we use a behavioral probe to determine whether the models actually encode the underlying constraint.
As noted above, D-linking increases the acceptability of a filler-gap dependency. (20) a. ??Which book did Will ask why John read ?
b. *What did Will ask why John read ?
At the filler, D-linking provides semantic clues to help predict the gap site (e.g., books will be bought or read but not eaten), and at the gap site D-linking greatly reduces the set of possible referents and eases retrieval of the correct referent both for syntactic attachment of the filler and for comprehension of the sentence (e.g., John did not read a sign or a scroll). We therefore expect that a model that understands D-linking will find D-linked sentences easier to process, similar to humans. We compared model performance (average surprisal; Equation 1) on each set of constructions. If a model uses Dlinking in a human-like way, the D-linked sentences (like (20-a)) should produce better performance (be less surprising) than the non-D-linked sentences (like (20-b)).
In contrast to humans, RNNs find D-linked sentences more surprising than non-D-linked ones (Figure 4). In other words, the models prefer the less acceptable filler-gap constructions. These results suggest that the underlying D-linking feature is not learned even in a construction-specific way, let alone in a way that is shared across multiple constructions.
Based on the findings in the current and preceding sections, we conclude that RNN language models do not go beyond representing the existence of filler-gap dependencies to representing any of the shared underlying constraints we studied here.

Discussion
Our results support previous behavioral findings that recurrent neural language models acquire some abstract concept of "filler-gap dependency" (Chowdhury and Zamparelli, 2018;Wilcox et al., 2018Wilcox et al., , 2019, but go farther than those findings by indicating that this concept is representationally shared across different constructions. However, our findings also indicate that RNNs do not encode any of the syntactic filler-gap constraints we studied.
Both grammatical and ungrammatical sentences involving apparent filler-gap dependencies cause models to anticipate the existence of subsequent filler-gap dependencies. However, in two of four cases, the baseline adaptation effect from priming on unrelated filler-gap constructions was comparable to or greater than that from priming on other constructions governed by the same constraint. Constructions that violated ECP and L-marking were represented similarly to constructions that adhered to those same constraints. And models assigned lower probability to sentences involving D-linking than sentences without D-linking, which is the opposite of human results.
Overall, our results provide robust evidence that Chaves (2020) was correct that recurrent neural language models do not fully understand filler-gap constructions. While it is entirely possible that the constraints analyzed in this work do not actually govern filler-gap dependencies (i.e. these particular syntactic theories may be incorrect), we chose four constraints that have been very widely studied by syntacticians precisely because of their broad coverage. Therefore, even if the underlying syntactic theories are incorrect, it is conceivable that RNNs could induce these constraints as plausible abstractions to aid in filler-gap processing. It is therefore striking that RNNs learn none of them.
One might wonder whether our priming sets simply needed to be larger to observe the desired priming effects. Our priming sets consisted of 15 items, which is comparable to the number of priming stim-uli used by Prasad et al. (2019) and Kodner and Gupta (2020). Since the constraints we study involve fewer surface cues, they could require more priming data to produce noticable effects. However, our non-baseline adaptation effects were around 1.5 bits, which is much larger than the 0.5-1 bit priming effects seen in those previous studies. Since we are already seeing large priming effects with these constructions, it seems unlikely that increasing the amount of priming data would produce qualitatively different effects.
We selected four common, well-studied fillergap influences from the syntax literature and tested whether RNNs shared representational features across filler-gap constructions in a way that would suggest they had learned those constraints. While we did find evidence that RNNs encode some general representation of the existence of filler-gap dependencies, we found no evidence for more abstract underlying shared constraints. That is, while we find that RNN language models can learn abstract representations that are shared across constructions, our work raises questions about the depth of such abstractions.