Investigating Lexical Variability in Language Models

Neural language models learn, to varying degrees of accuracy, the grammatical properties of natural languages. In this work, we investigate whether there are systematic sources of variation in the language models' accuracy. Focusing on subject-verb agreement and reflexive anaphora, we find that certain nouns are systematically understood better than others, an effect which is robust across grammatical tasks and different language models. Surprisingly, we find that across four orders of magnitude, corpus frequency is unrelated to a noun's performance on grammatical tasks. Finally, we find that a novel noun's grammatical properties can be few-shot learned from various types of training data. The results present a paradox: there should be less variation in grammatical performance than is actually observed.


Introduction
Neural language models (Howard and Ruder, 2018;Devlin et al., 2019;Radford et al., 2019) have achieved success in both text prediction and downstream tasks such as question-answering, text classification, and natural language inference. The strong performance of these models raises scientific questions about the knowledge they have acquired, in particular, about the abstractness and generality of their linguistic representations.
Previous work has investigated the linguistic representations of neural language models in several domains, and found varying evidence for how linguistically adequate these representations are (Lau et al., 2017;Marvin and Linzen, 2018;Goldberg, 2019;. This work has employed psycholinguistic methodology in order to elicit grammatical judgments from these models, inferring the models' underlying representations from the patterns of judgments.
In the current work, we focus on the variation in grammatical knowledge that potentially exists within a neural language model. Just as in human psycholinguistic tasks, previous work on neural LMs has observed variability in grammatical judgments between different sentences; not all violations of a grammatical constraint are judged to be equally bad. It is not clear, however, whether there are systematic sources of variation in these judgments, and if so, what the sources are.
We will focus on variation among lexical items, using English subject-verb agreement and reflexive anaphora as a case study. We first ask whether language models learn the grammatical properties of some nouns more accurately than for others. We do this by measuring the accuracy of language models when making grammatical judgments involving different nouns. We find systematic variation among nouns: nouns that perform well on one task or language model are more likely to perform well on other tasks or other language models. We then consider possible sources of the observed variation between nouns, finding that the grammatical properties of nouns are paradoxically easy to learn; our results suggest that there should be much less variation than is actually observed. 1

Related work
A number of other studies have investigated the linguistic representations of neural models, both language models specifically and networks trained using other objectives. Linzen et al. (2016); Gulordava et al. (2018); Kuncoro et al. (2018) probe the ability of LSTMs to learn hierarchical structures. Warstadt et al. (2019b) introduces a large-scale corpus of grammatical acceptability judgements, trains RNNs to predict these judgments, and concludes that the models outperform unsupervised baselines, but fall far short of human performance. Lepori et al. (2020) finds that tree-based RNNs outperform sequential RNNs on number prediction tasks, but that fine-tuning on an artificially-generated augmentation set can bring the models closer to parity.
Other work has focused on probing whether neural language models have acquired adequate representations of specific linguistic phenomena. Marvin and Linzen (2018) and Goldberg (2019) use a minimal pair methodology to assess the grammatical knowledge of RNNs and BERT, looking at subject-verb number agreement, reflexive anaphora, and negative polarity items. Wilcox et al. (2018) examines whether RNN language models exhibit wh-licensing interactions on surprisal associated with gaps, concluding they can represent long-distance filler-gap dependencies and learn certain island constraints.  studies whether neural language models show evidence for incremental syntactic state representations using psycholinguistic methodology. Warstadt et al. (2019a) studies BERT's knowledge of NPI's, focusing on differences between tasks: boolean classification (e.g. Linzen et al. 2016 andWarstadt et al. 2019b), minimal pair comparisons (e.g. Marvin and, and probing tasks (e.g. Giulianelli et al. 2018).

Approach
We use the minimal pair methodology of Marvin and Linzen (2018) in order to investigate the grammatical judgments of neural language models. Given a minimal pair of sentences, i.e. a pair that differ from each other in their acceptability due to a difference in just one grammatical property. If the model understands the grammatical phenomenon being studied, it should assign higher probability to the grammatical sentence than to the ungrammatical sentence. Table 1 shows the 10 grammatical tasks (Marvin and Linzen, 2018) and the templates used for generating minimal pairs. The tasks fall into two general categories: subject-verb agreement (SVA) and reflexive anaphora (RA). The first SVA task, SVA Simple, probes whether the model understands that subject number must agree with the number of third-person present verbs:
The cat walks.
b. * The cat walk.
The other SVA tasks probe whether the models have more sophisticated representations of number agreement. For example, the SVA PP task measures whether the model is able to ignore distractors ("boys") which occur between the head of the subject and the verb: (2) a. The cat next to the boys jumps.
b. * The cat next to the boys jump.
The object relative clause tasks probe whether the model accurately maintains the head's number in the presence of an embedded clause. Marvin and Linzen (2018) provide extensive discussion of the linguistic motivation for these tasks. The RA tasks measure whether the language model understands the structural conditions on the binding of reflexive pronouns. The tasks make use of the following property of English reflexives: a reflexive pronoun needs to agree in number with its antecedent. The RA Sent.Comp task evaluates whether the model understands that reflexives must be in the same clause as their antecedents: (3) a.
The lawyers said the defendant incriminated himself.
b. *The lawyers said the defendant incriminated themselves.
The RA tasks involving object relative clauses evaluate whether the models understand that reflexive anaphora do not bind to the noun in an embedded clause but rather to the head noun.

Measuring the performance of a noun
We use these tasks in order to measure how well the model understands the grammatical properties of a particular target noun. Given a specific target noun, it is substituted as the TargetNoun in each of the task templates shown in Table 1. This gives a partially specified template. For example, substituting the target noun "zombie" in the SVA Simple template results in: (4) The zombie Verb .
Given each of these partially specified templates, 500 minimal pairs are randomly sampled by filling in the remaining lexical items. Finally, the model's grammatical judgments on the 500 minimal pairs are computed (by taking the difference in scores between the grammatical and ungrammatical variants) and averaged, resulting in a task performance score for the noun.

Limitations
These analyses are limited in several respects. First, only two grammatical tasks are used. By using a wider range of tasks, it will be possible to investigate a larger set of grammatical phenomena outside of number agreement. Second, while the study focuses on the grammatical information carried by nouns, other lexical types such as verbs are likely to carry this information as well. Future work can determine whether the approach generalizes to verbs and other lexical types.
Finally, while the study uses acceptability judgments in order to determine the models' grammatical knowledge, other probing tasks exist and may produce different results (Warstadt et al., 2019a). We use acceptability judgments because, to the best of our knowledge, feature probing has not been extensively studied for GPT-2 or Transformer-XL. Different probing architectures may produce different results for these models. It would be desirable to understand the robustness of the current results to the choice of experimental readout.

Methods
In this section we describe the process of calculating a target noun's task performance score in more detail.

Sentence generation
Using WordNet (Fellbaum, 1998) and VerbNet (Schuler, 2005), we compiled a list of lexical items as shown in Table 2. The target nouns were drawn from the Noun list, which consisted of animate nouns. Only nouns with distinct singular and plural forms were included. All verbs in the Verb set have an intransitive reading. For each pair of task template and target noun, 500 sentences were randomly sampled by choosing lexical items from the appropriate word lists.
For each sampled sentence, 2*2 or 2*2*2 versions were generated (depending on the template). These versions varied the grammaticality of the sentence and the plurality of the target noun and any distractor nouns. For example, for the SVA Simple task, 2*2 versions are generated for every sampled sentence: (5) a. Singular-Grammatical: The horse walks.

Models
Our experiments use three models, Transformer-XL , GPT-2 (Radford et al., 2019), and BERT (Devlin et al., 2019). We use the Hugging Face implementations (Wolf et al., 2019) with the pre-trained models transfo-xl-wt103, which is trained on the WikiText-103 dataset, gpt2-xl, which is trained on the WebText dataset, and bert-baseuncased, which is trained on BookCorpus and English Wikipedia.

Sentence scoring
We now describe how a score was calculated for a particular sampled sentence. For each of the  sentence variants (e.g. Example 5), the model computes a score. In the case of Transformer-XL and GPT-2, this score is simply the the log probability of the string. For example, for Transformer-XL: where P TXL is the Transformer-XL language model probability distribution. For BERT, given its masked language model architecture, we follow the approach of Goldberg (2019). For the SVA tasks, we compute the log conditional probability of the verb whose number must agree with the target noun. For the RA tasks, we compute the log conditional probability of the reflexive pronoun. Both conditional probabilities are computed conditional on the left and right contexts.
Given the scores for a sentence's variants, we compute an overall score for the sentence, which captures how much the model prefers the grammatical variants to the ungrammatical variants. For each sampled sentence S, there are either 2 or 4 minimal pairs among its variants. In Example 5, a. and b. is a minimal pair, and c. and d. is a minimal pair. Letting s a , ..., s d denote these variants, the overall score for the sentence is given by: The formula when there are four minimal pairs is similar.

Noun scoring
We next compute an overall score for the target noun. As described in Section 3.1, for a specific target noun n and task, we sample 500 sentences S 1 , ..., S 500 . The noun's score for this task is then given by:

Word filtering and tokenization
Words were removed from a particular model if either their singular or plural form was tokenized to unk, or if their singular and plural forms were assigned different numbers of tokens. 2 For BERT, words in the Verb set were removed if they were assigned more than one token, as BERT does not model the joint distribution over multiple masked tokens.
For Transformer-XL, we add a padding text 3 and a start-of-sentence-token ( SOS ) to the beginning of the sentence and an end-of-sentence token ( EOS ) to the end of the sentence. For GPT-2, we make no modifications to the generated sentence (although prefix spaces are added to the strings for tokenization purposes). For BERT, since it is a masked language model, we replace the Verb (for SVA) or reflexive pronoun (for RA) with a [MASK] token after tokenization. Thus, each sentence will have a single mask token corresponding to the word that should agree with the target noun.

Noun performance is correlated across tasks
We first examine how each noun's performance varies across the grammatical tasks. For each nountask pair, we measure the average performance of the noun on that task, as described above. This gives 10 features per noun, corresponding to the 10 grammatical tasks. Figure 1 shows the pairwise comparisons between performance on the different tasks for Transformer-XL. Results for BERT and GPT-2 are similar and are shown in the appendix. The figure shows that performance is correlated across the tasks; for many pairs of tasks, nouns which have higher performance on one task are likely to have higher performance on the other.
Using principal component analysis, we found that a single principal component explains 47% of task variance for Transformer-XL, and two principal components explain 73%. Results are similar for BERT and GPT-2, and are shown in the appendix. The first PC primarily measures performance on the four reflexive anaphora tasks, while the second PC measures performance on the subject-verb agreement across relative clause tasks. This suggests that there is a dimension that characterizes whether the model understands how reflexive binding constraints operate for a noun, and a dimension for whether the model understands subject-verb agreement for the noun. Note that Figure 1 additionally demonstrates correlations between the reflexive tasks and the subject-verb agreement tasks.
These results provide evidence that language models' variation in performance on the grammatical tasks is, in part, explained by properties of the nouns which are stable across tasks. The models understand number agreement better for some nouns, and worse for others.

Noun performance is correlated across models
We next investigate whether nouns exhibit stable behavior across different neural language models. For each pair of the three language models, we measured how well a noun's task performance in one language model predicted its task performance in the other language model. Figure 2 shows comparisons between pairs of language models on the 10 grammatical tasks. Of the 30 comparisons, 24 show significant positive correlations between the pairs of language models. 22 of the correlations remain significant after Bonferroni correction.
GPT-2 and Transformer-XL show the strongest correlation in performance. It is possible that this is due to methodological differences between the task setup for GPT-2 and Transformer-XL compared to BERT: GPT-2 and Transformer-XL are performing a language modeling task in which the probability of a full sentence is queried, while BERT performs masked language modeling on a single target word. The difference may also be due to corresponding training differences between BERT and the autoregressive language models.
The results provide evidence that nouns exhibit stable task performance across language models. The source of the correlation across language models must come from features of the training data. Properties of the natural text distribution of nouns lead some of these nouns to be better understood than others.

Effect of frequency on task performance
In Sections 4.1 and 4.2, we found evidence that nouns exhibit stable performance across different grammatical tasks and language models. One obvious explanation of these results is that nouns vary in their frequency in natural text, and language models learn more accurate grammatical representations for more frequent nouns.
In order to investigate this, we measured the frequency of each noun in two corpora: WikiText-103, a 103 million token subset of Wikipedia, which was used for training Transformer-XL; and Open Web-Text (Gokaslan and Cohen, 2019), an open-source implementation of the web corpus used to train GPT-2. 4 Word frequencies were measured separately for singular and plural noun forms. Figure 3 shows the relationship between frequency and task performance on each of the ten grammatical tasks. The appendix shows the results broken down by task type.
The results show no clear relationship between noun frequency and task performance. Frequency explains no more than 0.1% of the variation in performance. This holds true over more than four orders of magnitude in frequency. This provides evidence that 1) differences in corpus frequency do not explain the systematic differences observed between nouns, and 2) relatively few observations suffice for transformer language models to learn Figure 2: Pairwise comparisons between GPT-2, Transformer-XL, and BERT on the 10 grammatical tasks. Each row corresponds to a pair of language models, and each column is a single task. One point represents the performance of a noun on a single task. Figure 3: Relationship between corpus frequency and task performance for Transformer-XL, BERT, and GPT-2. Performance scores are z-normalized. Colors indicate the ten grammatical tasks and singular/plural form of the noun (s indicates singular, p indicates plural). Each point represents task performance for a single noun. correct number agreement behavior for a noun. In the next section, we investigate this finding further.

Few-shot learning for novel lexical items
The results in the previous section provide evidence that nouns systematically vary in their performance on grammatical tasks; some nouns perform better than others across tasks and language models. However, this variation is not explained by frequency of occurrence in natural text. Nouns that occur on the order of 100 times in a corpus do not have systematically worse performance than nouns that occur 10 6 times.
The results raise a question: if frequency does not influence how well a noun is understood, what does? If low frequency nouns are understood as well as higher frequency nouns, then this suggests that language models few-shot learn the grammatical properties of nouns. We suggest that by study-ing what makes a noun learnable in a few-shot setting, it may be psosible to better understand the sources of the observed variation.
We use a few-shot learning paradigm, introducing a new lexical item into the vocabulary of the language model, either "wug" (intended as a new singular noun), or "wuz" (intended as a plural). We then fine-tune the language model using several example sentences containing this word. Note that this paradigm is distinct from nearly all of the fewshot learning experiments performed in Radford et al. (2019);Brown et al. (2020), which operate on a known vocabulary. 5

Learning agreement from syntactic data
We first look at whether training data containing explicit syntactic markers of number agreement is sufficient for few-shot learning. Table 3 describes Training data type Template

Pred-adj
The wug/wuz is/are Adj .

Reflexive
The wug/wuz Verb himself/themselves. the types of training data we examine. The three types of training data use different syntactic markers of plurality to indicate whether the new noun is singular or plural. The language models are fine-tuned with 5 sentences drawn from a single training data type. GPT-2 was fine-tuned for 2 epochs, and BERT was fine-tuned for 4 epochs. 6 Transformer-XL was not used for the fine-tuning experiments, due to issues with introducing new vocabulary items given Transformer-XL's adaptive weight embedding.
After fine-tuning, each model was evaluated on the 10 grammatical tasks in Table 1. For each grammatical task, 500 sentences were sampled from the task template, and a performance score was calculated by averaging scores of the samples, as described in Section 3.4. Figure 4 shows results for fine-tuning on the three types of syntactic data. Compared to model performance on real lexical items (shown in the leftmost column), both BERT and GPT-2 achieve qualitatively similar performance given the Pred-adj and Reflexive training data, but worse performance given the Simple training data. Performance is weakest on subject-verb agreement (SV-agreement) tasks involving relative clauses. When trained on data containing reflexive anaphora, both models achieve notably higher performance on the grammatical tasks involving reflexive anaphora.
The results provide evidence that small amounts of syntactic training data support learning the agreement properties of novel nouns. They also provide evidence of heterogeneity among different types of training data. Training from bare present tense verbs is least effective, and training from sentences containing reflexives leads to improved performance on tasks which require understanding of the conditions on reflexive binding.

Learning agreement from semantic data
We next examine whether purely semantic indicators of plurality are sufficient for learning a noun's The baseline columns indicate performance of nonfine-tuned models on the novel wug/wuz lexical items. Scores are differences of log-probabilities between grammatical and ungrammatical. The 95% confidence interval around each point estimate is always smaller than ±0.25. number agreement properties. We look at several types of constructions which provide information about the plurality of a noun, but using predicates with past tense verbs that don't inflect for number so that there is no grammatical number agreement. In particular, we note the different possible readings with reference to the distributive and collective distinction described in the semantics literature (Lønning, 1997;Lasersohn, 2011;Champollion, 2015). For documentation of predicates that require a collective NP subject, see Levin (1993).
We use the fine-tuning method from Section 5.1.

Singular constructions
In order to induce singular noun interpretations, we use the singular-biased constructions shown at the top of Table 4. For example, if a wug worked all alone or came unaccompanied, it is likely that "wug" is both semantically and grammatically singular. However, these constructions do not gramatically require the head noun to be singular: they are compatible with distributive readings where the predicate individually applies to members of a group (e.g. "the lawyers worked all alone" means each lawyer worked alone). BERT and GPT-2 were fine-tuned on 5 examples of each of the singular constructions. Figure 5 shows the results. None of the constructions con- The wug came unaccompanied. separated-entire The wug became separated from the entire group. personally The wug personally thanked me.

Plural unison
The wuz nodded in unison. together The wuz ate together. simultaneously The wuz jumped simultaneously. outnumbered The wuz outnumbered the cats. constituted The wuz constituted a majority of the team. gathered The wuz gathered quietly.  sistently induced correct performance on the grammatical tasks across both models. Three of the constructions -all-alone, unaccompanied, and personally -led to strong performance on the reflexive anaphora tasks (stronger than the average performance calculated in Section 4). The separatedentire construction consistently decreased performance on the tasks relative to baseline.

Plural constructions
In order to provide the models with data indicating that a novel noun is plural, we use constructions which force either collective or distributive readings. For example, in Table 4, if the wuz constituted the majority of the team, then the word "wuz" must be semantically plural. The construction constituted a majority is collective because it must apply to the group as a whole: (6) The doctors constituted a majority of the team.
a. *Distributive reading: each of the doctors constituted a majority. b. Collective reading: the doctors as a group constituted a majority.
While the argument of a collective predicate must be semantically plural, it is not necessarily grammatically plural. For example, the singular "the group" could constitute the majority of the team. Three of the constructions in Table 4 are collective: outnumbered, constituted, and gathered. The other three are distributive phrasal predicates, which force distributive readings: (7) The architects nodded in unison.
a. Distributive reading: each of the architects nodded.
b. *Collective reading: the group of architects itself nodded. Figure 6 shows the plural learning results. The 6 types of training data perform comparably on the subject-verb agreement tasks (and similar to the baseline model, which represents performance prior to fine-tuning). The three distributive phrasal constructions perform better on the reflexive anaphora tasks than the three collective constructions, though all constructions improve relative to the baseline.

Discussion
We have investigated the sources of variation in neural language models' grammatical judgments. We found that there are systematic differences between nouns: when a language model exhibits knowledge of a noun's grammatical properties in one task, it is more likely to do so in other tasks. Moreover, when one language model exhibits this knowledge, other language models are more likely to as well.
The study found two latent dimensions of variation between nouns: one corresponding to how well the models understood its behavior with reflexive pronouns, and the other corresponding to subject-verb agreement.
Subsequent analyses demonstrate a pair of empirical phenomena: 1. It is relatively easy to learn the number agreement properties of a noun. The models learn the agreement properties of a novel noun from just a few samples, and the data supporting few-shot learning appears to be densely distributed; nearly all types of syntactic and semantic data examined lead to improvements on the reflexive pronoun or subject-verb agreement tasks.
2. Nouns that occur more frequently during training are not learned more accurately. Many nouns that occur with high frequency are not learned accurately.
These results suggest that nouns should vary less in their grammatical performance than is actually observed; the study finds excess variation in grammatical performance. If number agreement can be correctly learned from a few samples (FSL samples), then one would expect model performance to either a) improve with more data, as more FSL samples are observed, or b) improve with more data up to some threshold, and then asymptote after learning has saturated. In either case, for high frequency nouns, a sufficient number of FSL samples should be observed for these nouns to be learned very accurately.
A potential explanation of the results is that they are caused by catastrophic forgetting (Ratcliff, 1990;French, 1999): although a sufficient number of FSL samples are observed for a noun, these samples are forgotten during training, causing the performance of the noun to degrade. This explanation is implausible. If catastrophic forgetting is occurring, then the problem should be more severe for infrequent nouns than for frequent nouns, as the interval between training samples will be longer for infrequent nouns. This would predict better performance for frequent nouns. the three language models (Tables 5-8); pairwise comparison between task performance for BERT and GPT-2 (Figures 7 and 8); and more fine-grained comparisons between word frequency and model performance (Figures 9 and 10 Table 8: Top contributors (tasks) to top few (of 10) PCs for GPT-2's noun performance as detailed in Section 4.1. Cells contain the task name followed by their (absolute) component value in the eigenvector.