An Analysis of the Utility of Explicit Negative Examples to Improve the Syntactic Abilities of Neural Language Models

We explore the utilities of explicit negative examples in training neural language models. Negative examples here are incorrect words in a sentence, such as barks in *The dogs barks. Neural language models are commonly trained only on positive examples, a set of sentences in the training data, but recent studies suggest that the models trained in this way are not capable of robustly handling complex syntactic constructions, such as long-distance agreement. In this paper, we first demonstrate that appropriately using negative examples about particular constructions (e.g., subject-verb agreement) will boost the model’s robustness on them in English, with a negligible loss of perplexity. The key to our success is an additional margin loss between the log-likelihoods of a correct word and an incorrect word. We then provide a detailed analysis of the trained models. One of our findings is the difficulty of object-relative clauses for RNNs. We find that even with our direct learning signals the models still suffer from resolving agreement across an object-relative clause. Augmentation of training sentences involving the constructions somewhat helps, but the accuracy still does not reach the level of subject-relative clauses. Although not directly cognitively appealing, our method can be a tool to analyze the true architectural limitation of neural models on challenging linguistic constructions.


Introduction
Despite not being exposed to explicit syntactic supervision, neural language models (LMs), such as recurrent neural networks, are able to generate fluent and natural sentences, suggesting that they induce syntactic knowledge about the language to some extent. However, it is still under debate whether such induced knowledge about grammar is robust enough to deal with syntactically challenging constructions such as long-distance subjectverb agreement. So far, the results for RNN language models (RNN-LMs) trained only with raw text are overall negative; prior work has reported low performance on the challenging test cases (Marvin and Linzen, 2018) even with the massive size of the data and model (van Schijndel et al., 2019), or argue the necessity of an architectural change to track the syntactic structure explicitly (Wilcox et al., 2019b;Kuncoro et al., 2018). Here the task is to evaluate whether a model assigns a higher likelihood on a grammatically correct sentence (1a) over an incorrect sentence (1b) that is minimally different from the original one (Linzen et al., 2016).
The author that the guards like laughs. b. * The author that the guards like laugh.
In this paper, to obtain a new insight into the syntactic abilities of neural LMs, in particular RNN-LMs, we perform a series of experiments under a different condition from the prior work. Specifically, we extensively analyze the performance of the models that are exposed to explicit negative examples. In this work, negative examples are the sentences or tokens that are grammatically incorrect, such as (1b) above.
Since these negative examples provide a direct learning signal on the task at test time it may not be very surprising if the task performance goes up. We acknowledge this, and argue that our motivation for this setup is to deepen understanding, in particular the limitation or the capacity of the current architectures, which we expect can be reached with such strong supervision. Another motivation is engineering: we could exploit negative examples in different ways, and establishing a better way will be of practical importance toward building an LM or generator that can be robust on particular linguistic constructions.
The first research question we pursue is about this latter point: what is a better method to utilize negative examples that help LMs to acquire robustness on the target syntactic constructions? Regarding this point, we find that adding additional token-level loss trying to guarantee a margin between log-probabilities for the correct and incorrect words (e.g., log p(laughs|h) and log p(laugh|h) for (1a)) is superior to the alternatives. On the test set of Marvin and Linzen (2018), we show that LSTM language models (LSTM-LMs) trained by this loss reach near perfect level on most syntactic constructions for which we create negative examples, with only a slight increase of perplexity about 1.0 point.
Past work conceptually similar to us is Enguehard et al. (2017), which, while not directly exploiting negative examples, trains an LM with additional explicit supervision signals to the evaluation task. They hypothesize that LSTMs do have enough capacity to acquire robust syntactic abilities but the learning signals given by the raw text are weak, and show that multi-task learning with a binary classification task to predict the upcoming verb form (singular or plural) helps models aware of the target syntax (subject-verb agreement). Our experiments basically confirm and strengthen this argument, with even stronger learning signals from negative examples, and we argue this allows us to evaluate the true capacity of the current architectures. In our experiments (Section 4), we show that our margin loss achieves higher syntactic performance than their multi-task learning.
Another relevant work on the capacity of LSTM-LMs is Kuncoro et al. (2019), which shows that by distilling from syntactic LMs (Dyer et al., 2016), LSTM-LMs can improve their robustness on various agreement phenomena. We show that our LMs with the margin loss outperform theirs in most of the aspects, further strengthening the argument about a stronger capacity of LSTM-LMs.
The latter part of this paper is a detailed analysis of the trained models and introduced losses. Our second question is about the true limitation of LSTM-LMs: are there still any syntactic constructions that the models cannot handle robustly even with our direct learning signals? This question can be seen as a fine-grained one raised by Enguehard et al. (2017) with a stronger tool and improved evaluation metric. Among tested constructions, we find that syntactic agreement across an object relative clause (RC) is challenging. To inspect whether this is due to the architectural limitation, we train another LM on a dataset, on which we unnaturally augment sentences involving object RCs. Since it is known that object RCs are relatively rare compared to subject RCs (Hale, 2001), frequency may be the main reason for the lower performance. Interestingly, even when increasing the number of sentences with an object RC by eight times (more than twice of sentences with a subject RC), the accuracy does not reach the same level as agreement across a subject RC. This result suggests an inherent difficulty in tracking a syntactic state across an object RC for sequential neural architectures.
We finally provide an ablation study to understand the encoded linguistic knowledge in the models learned with the help of our method. We experiment under reduced supervision at two different levels: (1) at a lexical level, by not giving negative examples on verbs that appear in the test set; (2) at a construction level, by not giving negative examples about a particular construction, e.g., verbs after a subject RC. We observe no huge score drops by both. This suggests that our learning signals at a lexical level (negative words) strengthen the abstract syntactic knowledge about the target constructions, and also that the models can generalize the knowledge acquired by negative examples to similar constructions for which negative examples are not explicitly given. The result also implies that negative examples do not have to be complete and can be noisy, which will be appealing from an engineering perspective.

Target Task and Setup
The most common evaluation metric of an LM is perplexity. Although neural LMs achieve impressive perplexity (Merity et al., 2018), it is an average score across all tokens and does not inform the models' behaviors on linguistically challenging structures, which are rare in the corpus. This is the primary motivation to separately evaluate the models' syntactic robustness by a different task.

Syntactic evaluation task
As introduced in Section 1, the task for a model is to assign a higher probability to the grammatical sentence over the ungrammatical one, given a pair of minimally different sentences at a critical position affecting the grammaticality. For example, (1a) and (1b) only differ at a final verb form, and to assign a higher probability to (1a), models need to be aware of the agreement dependency between author and laughs over an RC. Marvin and Linzen (2018) test set While initial work (Linzen et al., 2016;Gulordava et al., 2018) has collected test examples from naturally occurring sentences, this approach suffers from the coverage issue, as syntactically challenging examples are relatively rare. We use the test set compiled by Marvin and Linzen (2018), which consists of synthetic examples (in English) created by a fixed vocabulary and a grammar. This approach allows us to collect varieties of sentences with complex structures.
The test set is divided by the syntactic constructions appearing in each example. Many constructions are different types of subject-verb agreement, including local agreement on different sentential positions (2), and non-local agreement across different types of phrases. Intervening phrases include prepositional phrases, subject RCs, object RCs, and coordinated verb phrases (3). (1) is an example of agreement across an object RC.
(3) The senators like to watch television shows and are/*is twenty three years old. Previous work has shown that non-local agreement is particularly challenging for sequential neural models (Marvin and Linzen, 2018).
The other patterns are reflexive anaphora dependencies between a noun and a reflexive pronoun (4), and on negative polarity items (NPIs), such as ever, which requires a preceding negation word (e.g., no and none) at an appropriate scope (5): (4) The authors hurt themselves/*himself. (5) No/*Most authors have ever been popular.
Note that NPI examples differ from the others in that the context determining the grammaticality of the target word (No/*Most) does not precede it. Rather, the grammaticality is determined by the following context. As we discuss in Section 3, this property makes it difficult to apply training with negative examples for NPIs for most of the methods studied in this work.
All examples above (1-5) are actual test sentences, and we can see that since they are synthetic some may sound somewhat unnatural. The main argument behind using this dataset is that even not very natural, they are still strictly grammatical, and an LM equipped with robust syntactic abilities should be able to handle them as a human would do.
We use the original test set used in Marvin and Linzen (2018). 1 See the supplementary materials of this for the lexical items and example sentences in each construction.

Language models
Training data Following the practice, we train LMs on the dataset not directly relevant to the test set. Throughout the paper, we use an English Wikipedia corpus assembled by Gulordava et al. (2018), which has been used as training data for the present task (Marvin and Linzen, 2018;Kuncoro et al., 2019), consisting of 80M/10M/10M tokens for training/dev/test sets. It is tokenized and rare words are replaced by a single unknown token, amounting to the vocabulary size of 50,000.
Baseline LSTM-LM Since our focus in this paper is an additional loss exploiting negative examples (Section 3), we fix the baseline LM throughout the experiments. Our baseline is a three-layer LSTM-LM with 1,150 hidden units at internal layers trained with the standard cross-entropy loss. Word embeddings are 400-dimensional, and input and output embeddings are tied (Inan et al., 2016). Deviating from some prior work (Marvin and Linzen, 2018; van Schijndel et al., 2019), we train LMs at sentence level as in sequence-tosequence models (Sutskever et al., 2014). This setting has been employed in some previous work (Kuncoro et al., 2018(Kuncoro et al., , 2019. 2 Parameters are optimized by SGD. For regularization, we apply dropout on word embeddings and outputs of every layer of LSTMs, with weight decay of 1.2e-6, and anneal the learning rate by 0.5 if the validation perplexity does not improve successively, checking every 5,000 mini-batches. Mini-batch size, dropout weight, and initial learning rate are tuned by perplexity on the dev set of Wikipedia dataset. 3 Note that we tune these values for the baseline LSTM-LM and fix them across the experiments. The size of our three-layer LM is the same as the state-of-the-art LSTM-LM at document-level (Merity et al., 2018). Marvin and Linzen (2018)'s LSTM-LM is two-layer with 650 hidden units and word embeddings. Comparing two, since the word embeddings of our models are smaller (400 vs. 650) the total model sizes are comparable (40M for ours vs. 39M for theirs). Nonetheless, we will see in the first experiment that our carefully tuned three-layer model achieves much higher syntactic performance than their model (Section 4), being a stronger baseline to our extensions, which we introduce next.

Learning with Negative Examples
Now we describe four additional losses for exploiting negative examples. The first two are existing ones, proposed for a similar purpose or under a different motivation. As far as we know, the latter two have not appeared in past work. 4 We note that we create negative examples by modifying the original Wikipedia training sentences, not sentences in the test set. As a running example, let us consider the case where sentence (6a) exists in a mini-batch, from which we create a negative example (6b). (6) a.
An industrial park with several companies is located in the close vicinity. b. * An industrial park with several companies are located in the close vicinity.
Notations By a target word, we mean a word for which we create a negative example (e.g., is). We distinguish two types of negative examples: a negative token and a negative sentence; the former means a single incorrect word (e.g., are), while the latter means an entire ungrammatical sentence.

Negative Example Losses
Binary-classification loss This is proposed by Enguehard et al. (2017) to complement a weak inductive bias in LSTM-LMs for learning syntax. It is multi-task learning across the cross-entropy loss (L lm ) and an additional loss (L add ): where β is a relative weight for L add . Given outputs of LSTMs, a linear and binary softmax layers predict whether the next token is singular or plural. L add is a loss for this classification, only defined for the contexts preceding a target token x i : where x 1:i = x 1 · · · x i is a prefix sequence and h * is a set of all prefixes ending with a target word (e.g., An industrial park with several companies is) in the training data. num(x) ∈ {singular, plural} is a function returning the number of x. In practice, for each mini-batch for L lm , we calculate L add for the same set of sentences and add these two to obtain a total loss for updating parameters. As we mentioned in Section 1, this loss does not exploit negative examples explicitly; essentially a model is only informed of a key position (target word) that determines the grammaticality. This is rather an indirect learning signal, and we expect that it does not outperform the other approaches.
Unlikelihood loss This is recently proposed (Welleck et al., 2020) for resolving the repetition issue, a known problem for neural text generators (Holtzman et al., 2019). Aiming at learning a model that can suppress repetition, they introduce an unlikelihood loss, which is an additional loss at a token level and explicitly penalizes choosing words previously appeared in the current context.
We customize their loss for negative tokens x * i (e.g., are in (6b)). Since this loss is added at tokenlevel, instead of Eq. 1 the total loss is L lm , which we modify as: where neg t (·) returns negative tokens for a target x i . 5 α controls the weight.
x is a sentence in the training data D. The unlikelihood loss strengthens the signal to penalize undesirable words in a context by explicitly reducing the likelihood of negative tokens x * i . This is a more direct learning signal than the binary classification loss.
Sentence-level margin loss We propose a different loss, in which the likelihoods for correct and incorrect sentences are more tightly coupled. As in the binary classification loss, the total loss is given by Eq. 1. We consider the following loss for L add : where δ is a margin value between the loglikelihood of original sentence x and negative sentences {x * j }. neg s (·) returns a set of negative sentences by modifying the original one. Note that we change only one token for each x * j , and thus may obtain multiple negative sentences from one x when it contains multiple target tokens (e.g., she leaves there but comes back ...). 6 Comparing to the unlikelihood loss, not only decreasing the likelihood of a negative example, this loss tries to guarantee a certain difference between the two likelihoods. The learning signal of this loss seems stronger in this sense; however, the tokenlevel supervision is missing, which may provide a more direct signal to learn a clear contrast between correct and incorrect words. This is an empirical problem we pursue in the experiments.
Token-level margin loss Our final loss is a combination of the previous two, by replacing g(x i ) in the unlikelihood loss by a margin loss: We will see that this loss is the most advantageous in the experiments (Section 4).

Parameters
Each method employs a few additional hyperparameters (β for the binary classification loss, α for the unlikelihood loss, and δ for the margin losses). We preliminary select β and α from {1, 10, 100, 1000} that achieve the best average syntactic performance and find β = 1 and α = 1000. For the two margin losses, we fix β = 1.0 and α = 1.0 and only see the effects of margin value δ.
6 In principle, one can cumulate this loss within a single mini-batch for L lm as we do for the binary-classification loss. However, obtaining L add needs to run an LM entirely on negative sentences as well, which demands a lot of GPU memories. We avoid this by separating mini-batches for L lm and L add . We precompute all possible pairs of (x, x * j ) and create a mini-batch by sampling from them. We make the batch size for L add (the number of pairs) as the half of that for L lm , to make the number of sentences contained in both kinds of batches equal. Finally, in each epoch, we only sample at most the half mini-batches of those for L lm to reduce the total amount of training time. Reflexive pronoun We also create negative examples on reflexive anaphora, by flipping between {themselves}↔{himself, herself }.

Scope of Negative Examples
These two are both related to the syntactic number of a target word. For binary classification we regard both as a target word, apart from the original work that only deals with subject-verb agreement (Enguehard et al., 2017). We use a single common linear layer for both constructions.
In this work, we do not create negative examples for NPIs. This is mainly for technical reasons. Among four losses, only the sentence-level margin loss can correctly handle negative examples for NPIs, essentially because other losses are tokenlevel. For NPIs, left contexts do not have information to decide the grammaticality of the target token (a quantifier; no, most, etc.) (Section 2.1). Instead, in this work, we use NPI test cases as a proxy to see possible negative (or positive) impacts as compensation for specially targeting some constructions. We will see that in particular for our margin losses, such negative effects are very small.

Experiments on Additional Losses
We first see the overall performance of baseline LSTM-LMs as well as the effects of additional losses. Throughout the experiments, for each setting, we train five models from different random seeds and report the average score and standard deviation. The code is available at https://github.com/aistairc/lm syntax negative.
Naive LSTM-LM performs well The main accuracy comparison across target constructions for different settings is presented in Table 1. We first 7 We use Stanford tagger (Toutanova et al., 2003) to find the present verbs.

LSTM-LM
Additional margin loss (δ = 10) Additional loss (α = 1000, β = 1) Distilled  Marvin and Linzen (2018). K19 is the result of distilled two-layer LSTM-LM from RNNGs (Kuncoro et al., 2019). VP: verb phrase; PP: prepositional phrase; SRC: subject relative clause; and ORC: object-relative clause. Margin values are set to 10, which works better according to Figure  Higher margin value is effective For the two types of margin loss, which margin value should we use? Figure 1 reports average accuracies within the same types of constructions. For both token and sentence-levels, the task performance increases along δ, but a too large value (15) causes a nega- 8 We omit the comparison but the scores are overall similar. tive effect, in particular on reflexive anaphora. Increases (degradations) of perplexity are observed in both methods but this effect is much smaller for the token-level loss. In the following experiments, we fix the margin value to 10 for both, which achieves the best syntactic performance.
Which additional loss works better? We see a clear tendency that our token-level margin loss achieves overall better performance. Unlikelihood loss does not work unless we choose a huge weight parameter (α = 1000), but it does not outperform ours, with a similar value of perplexity. The improvements by binary-classification loss are smaller, indicating that the signals are weaker than other methods with explicit negative exam- ples. Sentence-level margin loss is conceptually advantageous in that it can deal with any type of sentence-level grammaticality including NPIs. We see that it is overall competitive with token-level margin loss but suffers from a larger increase of perplexity (4.9 points), which is observed even with smaller margin values (Figure 1). Understanding the cause of this degradation as well as alleviating it is an important future direction.

Limitations of LSTM-LMs
In Table 1, the accuracies on dependencies across an object RC are relatively low. The central question in this experiment is whether this low performance is due to the limitation of current architectures, or other factors such as frequency. We base our discussion on the contrast between object (7) and subject (8) RCs: (7) The authors (that) the chef likes laugh.
(8) The authors that like the chef laugh.
Importantly, the accuracies for a subject RC are more stable, reaching 99.8% with the token-level margin loss, although the content words used in the examples are common. 9 It is known that object RCs are less frequent than subject RCs (Hale, 2001;Levy, 2008), and it could be the case that the use of negative examples still does not fully alleviate this factor. Here, to understand the true limitation of the current LSTM architecture, we try to eliminate such other factors as much as possible under a controlled experiment. 9 Precisely, they are not the same. Examples of object RCs are divided into two categories by the animacy of the main subject (animate or not), while subject RCs only contain animate cases. If we select only animate examples from object RCs the vocabularies for both RCs are the same, remaining only differences in word order and inflection, as in (7, 8).
Setup We first inspect the frequencies of object and subject RCs in the training data, by parsing them with the state-of-the-art Berkeley neural parser (Kitaev and Klein, 2018). In total, while subject RCs occur 373,186 times, object RCs only occur 106,558 times. We create three additional training datasets by adding sentences involving object RCs to the original Wikipedia corpus (Section 2.2). To this end, we randomly pick up 30 million sentences from Wikipedia (not overlapped to any sentences in the original corpus), parse by the same parser, and filter sentences containing an object RC, amounting to 680,000 sentences. We create augmented training sets by adding a subset, or all of these sentences to the original training sentences. Among the test cases about object RCs we only report accuracies on subject-verb agreement, on which the portion for subject RCs also exists. This allows us to compare the difficulties of two types of RCs for the present models. We also evaluate on "animate only" subset, which has a correspondence to the test cases for subject RCs with only differences in word order and inflection (like (7) and (8); see footnote 9). Of particular interest to us is the accuracy on these animate cases. We expect that the main reason for lower performance for object RCs is due to frequency, and with our augmentation the accuracy will reach the same level as that for subject RCs.
Results However, for both all and animate cases, accuracies are below those for subject RCs (Figure 2). Although we see improvements from the original score (93.7), the highest average accuracy by the token-level margin loss on the "animate" subset is 97.1 ("with that"), not beyond 99%. This result indicates some architectural limitations of LSTM-LMs in handling object RCs robustly at a near perfect level. Answering why the accuracy does not reach (almost) 100%, perhaps with other empirical properties or inductive biases (Khandelwal et al., 2018;Ravfogel et al., 2019) is future work.
6 Do models generalize explicit supervision, or just memorize it?
One distinguishing property of our margin loss, in particular token-level loss, is that it is highly lexical, making a contrast explicitly between correct and incorrect words. This direct signal may make models acquire very specialized knowledge about each target word, not very generalizable one across similar words and occurring contexts. In this section, to get insights into the transferability of syntactic knowledge induced by our margin losses, we provide an ablation study by removing certain negative examples during training.   ject RC, object RC, and long verb phrase (VP)). 11 We hypothesize that models are less affected by token-level ablation, as knowledge transfer across words appearing in similar contexts is promoted by language modeling objective. We expect that construction-level supervision would be necessary to induce robust syntactic knowledge, as perhaps different phrases, e.g., a PP and a VP, are processed differently.
Results Figure 3 is the main results. Across models, we restrict the evaluation on four nonlocal dependency constructions, which we select as ablation candidates as well. For a model with -PATTERN, we evaluate only on examples of construction ablated in training (see caption). To our surprise, both -TOKEN and -PATTERN have similar effects, except "Across an ORC", on which the degradation by -PATTERN is larger. This may be related to the inherent difficulty of object RCs for LSTM-LMs that we verified in Section 5. For such particularly challenging constructions, models may need explicit supervision signals. We observe lesser score degradation by ablating prepositional phrases and subject RCs. This suggests that, for example, the syntactic knowledge strengthened for prepositional phrases with negative examples could be exploited to learn the syntactic patterns about subject RCs, even when direct learning signals on subject RCs are missing. We see approximately 10.0 points score degradation on long VP coordination by both ablations. Does this mean that long VPs are particularly hard in terms of transferability? We find that the main reasons for this drop, relative to other cases, are rather technical, essentially due to the target verbs used in the test cases. See Table 2, 3, which show that failed cases for the ablated models are often characterized by the existence of either like or likes. Excluding these cases ("other verbs" in Table 3), the accuracies reach 99.2 and 98.0 by -TOKEN and -PATTERN, respectively. These verbs do not appear as a target verb in the test cases of other tested constructions. This result suggests that the transferability of syntactic knowledge to a particular word may depend on some characteristics of that word. We conjecture that the reason for weak transferability to likes and like is that they are polysemous; e.g., in the corpus, like is much more often used as a preposition and being used as a present tense verb is rare. This type of issue due to frequency may be one reason for lessening the transferability. In other words, like can be seen as a challenging verb to learn its usage only from the corpus, and our margin loss helps for such cases.

Discussion and Conclusion
Our results with explicit negative examples are overall positive. We have demonstrated that models exposed to these examples at training time in an appropriate way will be capable of handling the targeted constructions at near perfect level except a few cases. We found that our new token-level margin loss is superior to the other approaches and the remaining challenging cases are dependencies across an object relative clause.
Object relative clauses are known to be harder for a human as well, and our results may indicate some similarities in the sentence processing behaviors by a human and RNN, though other studies also find some dissimilarities between them (Linzen and Leonard, 2018;Wilcox et al., 2019a). The difficulty of object relative clauses for RNN-LMs has also been observed in the prior work (Marvin and Linzen, 2018;van Schijndel et al., 2019). A new insight provided by our study is that this difficulty holds even after alleviating the frequency effects by augmenting the target structures along with direct supervision signals. This indicates that RNNs might inherently suffer from some memory limitation like a human subject, for which the difficulty of particular constructions, including center-embedded object relative clauses, are known to be incurred due to memory limitation (Gibson, 1998;Demberg and Keller, 2008) rather than purely frequencies of the phenomena. In terms of language acquisition, the supervision provided in our approach can be seen as direct negative evidence (Marcus, 1993). Since human learners are known to acquire syntax without such direct feedback we do not claim that our proposed learning method itself is cognitively plausible.
One limitation of our approach is that the scope of negative examples has to be predetermined and fixed. Alleviating this restriction is an important future direction. Though it is challenging, we believe that our final analysis for transferability, which indicates that the negative examples do not have to be complete and can be noisy, suggests a possibility of a mechanism to induce negative examples themselves during training, perhaps relying on other linguistic cues or external knowledge.