An LSTM Adaptation Study of (Un)grammaticality

We propose a novel approach to the study of how artificial neural network perceive the distinction between grammatical and ungrammatical sentences, a crucial task in the growing field of synthetic linguistics. The method is based on performance measures of language models trained on corpora and fine-tuned with either grammatical or ungrammatical sentences, then applied to (different types of) grammatical or ungrammatical sentences. The results show that both in the difficult and highly symmetrical task of detecting subject islands and in the more open CoLA dataset, grammatical sentences give rise to better scores than ungrammatical ones, possibly because they can be better integrated within the body of linguistic structural knowledge that the language model has accumulated.


Introduction
As the language modeling abilities of Artificial Neural Network (ANN) expand, a growing number of studies have started to address a network's ability to distinguish sentences contain various types of syntactic errors from minimally different correct sentences, thus providing the equivalent of human grammaticality judgments, one of the cornerstones of theoretical linguistics since Chomsky (1957). These studies are important for at least two reasons: they can shed light on the type and amount of information which can be learned from pure linguistic data without any specialized language-learning device (thus contributing to the debate on human Universal Grammar, This work was funded by the Italian 2015 PRIN Grant "TREiL", and is licensed under a Creative Commons Attribution 4.0 International License. License details: http: //creativecommons.org/licenses/by/4.0/ Chomsky 1986; Lasnik and Lidz. 2015;Chowdhury and Zamparelli 2018), and they can be used as probes on the ANNs themselves, investigating whether models which are apparently proficient at language modeling are actually sensitive to the same syntactic and semantic cues humans use.
The ANNs used in this area of research (often LSTMs, Hochreiter and Schmidhuber 1997, but recently also transformer-based ANN, Vaswani et al. 2017, all trained on large datasets of normal text) are tested on a mix of grammatical or ungrammatical sentences. The latter are obtained either by altering naturally occurring sentences (semi-randomly, as in Lau et al. 2017, or systematically, Linzen et al. 2016Gulordava et al. 2018), by collecting examples from the published linguistic literature (Warstadt et al., 2018) or by creating minimal pairs by hand (individually, Wilcox et al. 2018, or with sentence-schemata, as in Chowdhury and Zamparelli 2018). 1 Once test data have been acquired, the literature has threaded between two very different approaches: treating grammaticality as a classification problem (i.e. feeding grammatical/ungrammatical sentences to a classifier and asking it to discriminate, cf. the first experiment in Linzen et al. 2016), or feeding the test sentences to a Language Model (LM) pretrained on normal language and measuring the perplexity accumulated by the LM as it traverses the sentence. 2 The classification approach works somewhat better, and can tell us if the possibility to spot un-1 Most studies except Lau et al. (2017) take the simplifying assumption that judgments can be treated as binary (e.g. acceptable/non-acceptable). This position is not entirely satisfactory, theoretically, but we believe that it won't do much harm at this early stage of research.
2 Intermediate methods are possible: Warstadt et al. (2018) and Warstadt and Bowman (2019) train a classifier on sentence vectors produced by various types of language models. grammaticality can in principle be learned from the data, but is not directly comparable with the human ability to detect ungrammaticality, since explicit syntactic judgments play a negligible role in language acquisition.
The approach which reads (un)grammaticality from the performance of a LM starts from a more naturalistic task-predicting what's coming (van Berkum, 2010)-and can thus be more directly compared to human performances, but the probability assigned by a LM to the words reflects many factors (sentence complexity, level of embedding, semantic coherence, etc.), making it difficult to tease apart 'grammaticality' from a more general notion of 'acceptability' or 'processing load'.
In this paper we propose a third approach to measuring grammaticality, derived from the LM method. In this approach, we utilized our inhouse pre-trained LSTM LM and adapt the model via fine-tuning (Pan and Yang, 2010;Li, 2012) on variations of the test sentences.
Grammaticality is then treated as a comparative measure of coherence: to what extent the new (un)grammatical input can be integrated with what the ANN has learned so far, and to what extent it can improve similar grammatical or ungrammatical constructions. We test this method with a large number of artificially generated examples, focusing on a particularly difficult contrast, the case of subject vs. object subextraction 3 . We then apply the method to a more general scenario, the CoLA dataset, tuning a LM with either grammatical or ungrammatical CoLA sentences and measuring its performance in various testing scenarios. 4 In the following sections, we first present a detailed task description, in Section 2, followed by a brief overview of the methodology and datasets used for the study (Section 3). In Section 4, we formalize our hypothesis of how the model should behave and report the results and observation of the network behavior in Section 5; we then discuss our observation and conclude the study with future directions in Section 6. 3 The expanded test sets for each task can be found in https://github.com/LiCo-TREiL/ Computational-Ungrammaticality/tree/ master/blackboxnlp2019. 4 See Warstadt et al. (2018). Every sentence in the corpus, which can be found at https://nyu-mll.github.io/CoLA/, is marked as grammatical or ungrammatical. The values are drawn from the published literature, see Warstadt et al. (2018, Tab.2) for details.

Task Description
It has been noted since Ross (1967) that while Wh-questions and relatives clauses (RC) can give raise to gaps at unbounded distance (as in Who did Mary say that John saw and The boy that Mary thinks that John adopted ), gaps in certain positions (e.g. inside relative clauses, individual conjuncts, or certain adjuncts) are perceived as degraded. Ross coined the term syntactic islands for these environments, which have been the focus of a huge amount of research in theoretical linguistics (see e.g. Szabolcsi and den Dikken 1999). Studies on ANNs' sensitivity to grammaticality have tried to model certain types of islands, with varying degree of success (Lau et al., 2017;Wilcox et al., 2018Wilcox et al., , 2019Jumelet and Hupkes, 2018). In this paper, we address subject islands, i.e. the difference between (1a) and (b) for Wh-interrogatives, and between (2a)  Subject islands are an interesting domain for various reasons: (i) extractions from subjects and object can contain nearly the same words (like above), and there are no lexical cues which signal one or the other type (e.g. both cases in (1) require do-support); (ii) while (1) and (2) share the extraction phenomenon, they have completely different discourse functions and distributions: is not obvious that a model that learns relative clauses should boost its processing of Wh-questions, or vice-versa; (iii) extractions out of PPs inside nominals are rare in naturally occurring data, so they stand as a challenging test of the ANN's generalization abilities. Embedded Wh extractions out of PPs (*I know who the painting by fetched a high price at auction.) were one of the violations studied in Wilcox et al. (2018), using Google's LM and the model from Gulordava et al. (2018). Neither LMs managed to model extractions out of PPs, treating the PP either as a possible extraction domain (Google's LM) or an island in both subject and object position (Gulorodova's). The study didn't address RCs like (2). This case thus presents an interesting challenge for our technique: combined with the sentence schemata method described in Section 3, it gives a highly controlled environment; however, this comes at the cost of a high lexical overlap (after fine-tuning, the ANN is tested on structures which contain many words it has already practiced with). To try a different and more open testing environment we applied the same method to the 5 test sets of the CoLA dataset (see Section 3 for details) . In this case, we fine-tuned the ANN on grammatical or ungrammatical sentences from the CoLA training set, and tested it on different CoLA test-phenomena sentences, checking the interactions. Since this part of CoLA is categorized by topic this gives a sense of which types of phenomena improve with this method.

Methodology
In this section, we describe our pipeline, including details of the datasets used in each steps. In addition, we present the evaluation measure used to validate the effects (if any) of the fine-tuning method on our tasks. Figure 1 shows the pipeline we propose for exploring the effect of rehearsing new (un)grammatical input on a trained LSTM language model.
LM Architecture: The first step (Step 1 in Fig.  1) is to train a language model (LM O ) using a large text corpus. For the study, we used a left-to-right long-short term memory (LSTM) language model (Hochreiter and Schmidhuber, 1997), trained with 500 hidden units in each layer (layers = 2) and an embedding dimension of 256. The model was trained using a PyTorch RNN implementation with dropout regularization technique applied in different layers of the architecture, along with SGD optimizer using a fixed batch  (Lahiri, 2014) and UKWaC (Ferraresi et al., 2008), as shown in Table 1. We then tokenized the input sentences, removing URLs, email addresses, emoticons and text enclosed in any form of brackets ({.},(.), [.]). We replaced rare words (tokens with frequency < 20) 5 with <UNK> token along with its signatures (e.g. -ed, -ing, -ly etc.) to represent every possible outof-vocabulary (OOV) words. We also replaced numbers (exponential, comma separated etc) with a <NUM> tag. We removed the sentences from UKWaC with OOV tags. Therefore, to train LM we used a training set consisted of ≈ 0.7B words in ≈ 31M sentences, with a vocabulary of size |V | = 0.1M .
Adaptation via fine-tuning: The trained LM 0 was used to initialize the weights of the new LSTM LM X , so as to transfer the knowledge LM 0 has acquired so far (Step 2 Fig. 1). To adapt the models LM X to new (un)grammatical structures, we fine-tuned the models by feeding the sentences from our small training data sets, with batch size of 20 and epoch e (e = {3, 10}). All other parameters remained unchanged with respect to the original LM 0 . In this paper, for brevity, we only report the results after 3 epochs.
Datasets for Adaptation: LMs can be quite sensitive to the specific content words used.
To minimize this effect and focus on structure, we used the 'sentence schemata' method from Chowdhury and Zamparelli (2018): starting from a schema such as (3), a script automatically generates sentences containing all the possible combinations of the bracketed expressions. The schema in (3) (tagged Aff(irmatives with complex) Obj(ects)) gives 160 affirmative sentences (e.g. Activists hated fighting for these laws); we also constructed schemata for affirmatives with the gerund in subject position (AffSubj, e.g. fighting for these causes scares politicians), as well as for the corresponding root Wh-clauses (WhSubj, WhObj, as in (1a)/(1b), and relatives (RelSubj, RelObj, as in (2a)/(2b)). 7 In total, we have 6 train/test sets, see Table 2 for details. ( . • Reflexive-Antecedent Agreement (ReflAgr): A test on whether reflexive pronouns have appropriate local antecedents (cf. I amused myself / *yourself / *herself / *him-self / *ourselves / *themselves).
Evaluation Measure: To track the performance of our LSTM on the test sets, we adopted the popular acceptability measure Syntactic log-odds ratio (SLOR), introduced in this domain by Lau et al. (2017) and shown in Equation 1.
where ε represents the sentence; p m (.) is the probability of the ε given by the model, calculated by multiplying probabilities of each target words, present in the sentence; p u (.) is the unigram probability of the ε and |ε| represent the length of the sentence. The measure considers the structure and position of the words, subtracting out the unigram log-probability so that sentences that use rare words are not penalized, and is normalized by sentence length, thus removing (positive or negative) biases due to long sentences. Higher SLOR values correspond to 'better' (i.e. more predictable/acceptable) sentences.

Our Hypothesis
We expect the adapted LM to improve in proportion to the similarity between the tasks, but also in proportion to how well the material presented in the fine-tuning learning phase is consistent with what the ANN already knows about language structures. Our expectations are that retraining with ungrammatical sentences should be harder to incorporate into previous knowledge, thus leading to worse performances in terms of generalization. Note that improvements when the ANN is trained on Wh and tested on RC or vice-versa can be attributed in part to lexical familiarity (the training contained most of the words seen in the testing), in part to the model's ability to note the common element in the two constructions, i.e. the extraction. We can mitigate the lexical overlap problem by subtracting the scores of a LM fine-tuned on the affirmative cases (i.e. the sentences generated from (3)) from those obtained from the corresponding extraction cases (RC and Wh), since our affirmative cases already contain most of the lexicon found in the RC/Wh sentences.
In the second experiment, where we tested on CoLA, there is no reason to expect a very high lexical overlap, so any effect found there can be attributed purely to the structures. Figure 2 gives an overview of the SLOR values of our LSTM tuned for 3 epochs just on the affirmative sentences (LM X , left), compared to the original (LM O , right). Unsurprisingly, the LM X shows a large improvement in the AffSubj/AffObj cases, but also an improvement in Wh case and especially in relative clauses. Note that after fine-tuning, all conditions (Aff,Rel and Wh) show a significant pref-erence for the object case (present in Aff/Wh even in the original run). This effect emerged also in Chowdhury and Zamparelli (2018) (and in work of ours, under review, which specifically addresses this phenomenon). Since it is also present in affirmatives, it cannot obviously be attributed to a sensitivity to islands, but can probably be put down to a general preference of LSTM LMs for having complex structures in object position. This effect seems to overcome an effect found in Chowdhury and Zamparelli (2018) (Task A, which however uses different measures), where subject relatives scored better than object relatives (while both being grammatical), in line with human parsing preferences widely discussed in the psycholinguistic literature (Gibson, 1998;Gordon et al., 2001;Friedmann et al., 2009). The general lower score for RCs, compared to Wh cases, could also be attributed to the fact that in the testing phase the LM receives an End-of-Input signal before the sentence is over (i.e. RC are sentence fragments). Figure 3 shows the effect of fine-tuning the original LM on different parts of the test set and testing it on the others. At a global level, if we compare the scores with the affirmative baseline (the performance of the model fine-tuned with affirmatives only, as in Figure 2, left), we see that on average adding Wh-clauses significantly boosts RCs (+0.64) and vice-versa, though not as strongly (+0.39). Next, tuning with grammatical material gives a larger overall boost than tuning on ungrammatical material. This can be seen from the Total in Table 3 (using the notation ARelObj(RelObj-Aff(RelObj)) to mean "SLOR of LM O fine-tuned with Aff+RelObj (ARelObj) and applied to Relatives with OBJect subextraction minus the SLOR of the test set using model adapted by Aff)"). Within construction, tuning with Aff plus Obj extractions boosts other object cases (green cells) Figure 2: Variation of SLOR measure for different test groups using the model adapted on affirmative sentences (Aff, both AffSubj and AffObj) and the original LM 0 (Ori) model. Higher is better. The blue arrows with the ∆ values represents the difference in SLOR between grammatical and ungrammatical sentences. * warns that the same testset is used to adapt the respective model. ns indicates that the results are not significantly different from each other. Figure 3: Variation of SLOR measure for different test groups using models adapted on: relative clause-object (ARelObj); relative clause-subject (ARelSubj); wh-object (AWhObj); wh-subject (AWhSubj). All the models are initially adapted on affirmative sentences, hence the presence of A in ARelObj and all other models. The blue arrows with the ∆ values represents the difference between the SLOR of grammatical correct sentences with ungrammatical sentences. The * warns when the same testset was used to adapt the corresponding model. more than tuning on Aff plus Subj extractions boosts other Subj cases (pink cells); across construction, Aff+WhObj tuning boosts RelObj and even RelSub and, to a lesser extent, Aff+RelObj tuning boosts WhObj more than Aff+RelSubj boosts WhSubj. Figure 4 shows the results on the 5 test sets for the original model (LM O ) and the LM fine-tuned with the CoLA grammatical and ungrammatical sentences, respectively. The first thing to note is that LM O is already able to significantly distinguish, on average, the two classes, with the worst performances coming from the Causative-Inchoative Alternation, a construc-  tion linked to the lexical semantics of a class of verbs which are not likely to be encountered in many other examples. As in the previous experiment, fine-tuning improves the SLOR scores of all cases, ungrammatical ones included. In keeping with the previous experiment, we verify whether the switch from CoLA G to CoLA U G has a significant effect on the improvements (esp. CoLA-G(G) vs. CoLA-UG(UG), keeping in mind that here, unlike in the previous experiment, the training can contain at most a small dose of the lexicon and the phenomena in the testing set). Given the results in the Subj/Obj island task, our expectations are that tuning on CoLA G should work better than tuning on CoLA U G . The difference turns out not to be significant with Subject-Verb Agreement cases (SVArg, Figure 4a), significant but with ungrammatical cases coming out best for the Subject-Verb-Object permutation cases (SVO, Figure 4b), significantly bigger with grammatical tuning in the remaining cases (see 4f for the overall picture). The case of SVArg might be due to the fact that the contrastive examples found in the syntactic literature might not cover something as basic as wrong subject-verb agreement. The behavior of SVO remains unclear.

Discussions and Conclusions
The results of ours first experiment suggest that, even though the contrast between subject and object subextraction is one of the hardest for ANNs to detect (see Wilcox et al. 2018), fine-tuning a language model with one of the two conditions does not give the same effect: above and beyond the effect of assertions (see Figure 3), tuning with grammatical extractions (i.e. object cases) yields a larger boost for the construction used for tuning than tuning with the ungrammatical cases. In small measure, the boost extends to the related construction (Wh to RC, and partially vice-versa).
The same effect is found with the much less controlled CoLA dataset, at least for some of the constructions we tested.
The results are consistent with the hypothesis that grammatical cases are somehow easier to integrate into what the ANN has already discovered about linguistic structures. Of course, positive examples of grammatical extractions like WhObj and RelOBj also boost the ungrammatical cases, but possibly this is because they apply to parts of the sentence different from the extraction site (indeed, ungrammatical cases boost grammatical and ungrammatical cases almost to the same degree). This suggests that the methodology we are proposing could be a useful addition to the toolbox of this research area.
An obvious question, at this point, is whether the fine-tuning approach could be turned into a classification method.
One could for instance imagine classifying a sentence as grammatical or ungrammatical on the basis of its SLOR difference across LMs tuned with grammatical/ungrammatical sets (e.g. CoLA G and CoLA U G conditions). Recall however that SLOR is sensitive to a variety of factors which have nothing to do with grammaticality (e.g. collocations, pragmatic plausibility), and that it has been used to study grammaticality only with carefully constructed minimal pairs. While not impossible, we suspect that a classification experiment could not be done with relatively open data like CoLA, though it is possible that with more balanced materials such an experiment might become possible. Probably a better use for the technique proposed here would be to study similarity across constructions as seen by the network. Using the more finegrained classification of the CoLA data given in Warstadt and Bowman (2019), it might be possible to selectively fine-tune a model with one construction, test it with all the others and discover  Figure 4f represents the difference in the measure of SLOR value, for grammatical (G) and ungrammatical (UG) examples, between Cola-G and Cola-UG model, i.e. A G (SLOR) − A U G (SLOR) for all the above test cases (a-e), where A G represents result from Cola-G model and similarly A U G represents result from Cola-UG. The blue arrows with the ∆ values represents the difference between the SLOR of grammatical correct sentences with ungrammatical sentences. ns indicates that the results are not significantly different from each other. from the variations in a performance measure like SLOR how the ANN 'sees' the relation between different linguistic cases.