Text Genre and Training Data Size in Human-like Parsing

Domain-specific training typically makes NLP systems work better. We show that this extends to cognitive modeling as well by relating the states of a neural phrase-structure parser to electrophysiological measures from human participants. These measures were recorded as participants listened to a spoken recitation of the same literary text that was supplied as input to the neural parser. Given more training data, the system derives a better cognitive model — but only when the training examples come from the same textual genre. This finding is consistent with the idea that humans adapt syntactic expectations to particular genres during language comprehension (Kaan and Chun, 2018; Branigan and Pickering, 2017).


Introduction
Natural language processing (NLP) systems based on deep neural networks are sensitive to the amount and type of training data that they receive. A "data hungry" method may not work well until it is supplied with sufficient examples (e.g. Yogatama et al., 2019). Likewise, transfer to a different textual genre may be poor (Petrov and McDonald, 2012). This is the classic problem of domain adaptation 1 which arises in many areas of NLP.
This paper revisits domain adaptation in the context of human-like parsing. With this humanlike aspect in mind, we consider models that use linguistically-plausible trees (see Frank, 2011 for a review) and operate incrementally from left to right (e.g. Steedman, 2000). We quantify the fit to human language performance using freelyavailable electrophysiological data (henceforth: 1 It remains quite difficult to reconcile human-like incremental parsing with high performance out-of-domain; many researchers take a nonincremental whole-sentence approach (Gildea, 2001;Baucom et al., 2013;Joshi et al., 2018). EEG) that was elicited by a pre-existing literary text (Brennan and Hale, 2019).
These EEG data come from a naturalistic stimulus, and in virtue of their higher temporal resolution, are more detailed than reading times or plausibility judgments. Hale et al. (2018) were the first to model them using a neural parser. In that study, textual training data came from the same book as did the human participants' stimuli. While this yielded a model that was quite well-matched to the EEG modeling task, its training data was confined to just 1543 sentences (24K words). In contrast, recent studies suggest that neural language models require substantially more data to achieve humanlike linguistic competence (Gulordava et al., 2018;Futrell et al., 2019;Frank and Hoeks, 2019).
We investigate this question of data size together with a contrast between textual domains or "genres". Modeling human neural signals, we find that in-domain training leads to a better and better fit as more examples are added to the training set whereas with out-of-domain data, more examples do not help. This is interesting given the consistent reductions in language modeling perplexity with more data, which we observe across both domains. We further find that, across all amounts of training data, models that incorporate linguisticallyplausible phrase structure achieve a better fit to human EEG data than models that do not. This suggests that phrase structure should play an important role in human-like models of language comprehension, even in models that benefit from large training data.

Newspaper text
In the second genre, we randomly sampled news articles from the English Gigaword corpus (Graff et al., 2005). This sampling was made regardless of the particular national source, i.e. Agence France-Press, New York Times or Xinhua News Agency. Sentences in this sample were, on average, 20 words long. This out-of-domain text had a CosineTop50 dissimilarity level of 0.56.

Presumptive Trees
Both genres were parsed using a reimplementation of the Berkeley parser (Petrov et al., 2006) to yield presumptively-correct, "silver-grade" trees. This Berkeley parser was trained on a diverse set of annotated data. These include the Penn Treebank's Wall Street Journal materials (Marcus et al., 1993), the Question Treebank (Judge et al., 2006), Ontonotes (e.g. Pradhan and Ramshaw, 2017) and Parsing-the-Web Corpora (Petrov and McDonald, 2012). In a manual inspection of randomly sampled silver trees, the only obvious mistakes were tagging errors (1 newspaper, 2 literature), which are not harmful since RNNGs do not use part of speech tags. Indeed, the bracketing and phrase labels on these silver trees appeared to be fully consistent with the Penn Treebank Bracketing Guidelines (Bies et al., 1995). Before being passed to the RNNG as training data, these trees were post-processed to remove empty elements, punctuation, and function tags.

Vocabulary
Within each of the two genres, we facilitate comparison across different training set sizes by running the RNNGs with a shared vocabulary derived from the largest training configuration. An attestation frequency cut-off was applied to each of them to ensure broadly comparable rates of out-of-vocabulary words on the development set, Alice in Wonderland. For the lexically-similar literature, this threshold was 5 attestations, whereas for the newspaper text this threshold was 20 attestations. On the validation set, these cut-offs yielded out-of-vocabulary rates of 2.3 to 2.4 percent for newspaper text and 0.5 to 1.28 percent for lexically-similar literature.

Achieved perplexity
Per-action perplexity levels 4 achieved by the trained RNNG parsing models on the development set are shown in Figure 1. Validation perplexity consistently improves with more and more silver-grade training data, even when that data comes from a different domain. 5

EEG Regression model
Surprisal values from a beam search parser based on RNNG are entered as predictors into a regression model of scalp voltages. Models are fit with the brm function in R and model fits were compared using Bayesian model comparison (Vehtari et al., 2017). These regression models include random intercepts for participant along-side predictors to account for factors of non-interest that nevertheless are known to influence sentence processing difficulty (see e.g. Goodkind and Bick-3 The RNNG hyperparameters are: 2-layer stack LSTM (Dyer et al., 2015), 450 hidden units, and an initial SGD learning rate of 0.3, decayed exponentially with a factor of 0.1 applied after the tenth epoch. 4 The per-action perplexity is computed based on the joint probability of strings (x) and trees (y), denoted as p(x, y), which therefore aggregate the perplexity of the next-word prediction with the perplexity of tree-building actions. Approximate inference methods such as important sampling can be used to derive an estimate of p(x) (Dyer et al., 2016). 5 While relative perplexity levels internal to a genre are comparable in virtue of a shared vocabulary, absolute perplexity values are not directly comparable across the genres since these vocabularies are different. nell, 2018). These are: sentence position in the stimulus text, word position within each sentence, acoustic sound power, unigram word frequency in the HAL corpus (Balota et al., 2007). Unigram predictors are included for the previous word, the current word and the next word. The EEG data themselves come from 33 datasets that were collected by Brennan and Hale (2019) while participants listened to the first chapter of Alice in Wonderland. 6 Ocular artifacts and other noise sources are removed from the raw signal using ICA and visual inspection. The EEG data are reduced to a single spatio-temporal region of interest (ROI) comprising the data from anterior channels across both hemispheres between 200 and 400 ms after the onset of contentwords. This anterior ROI has, uniquely, shown sensitivity to surprisal values from incremental parsers under a data-driven whole-scalp analysis in Brennan and Hale (2019). Note that this ROI is earlier and more anterior than the usual locus of the N400 component, which typically manifests on central electrodes at or around 400 ms post-stimulus (for a review see Kaan, 2007). Figure 2 plots the goodness-of-fit of a regression model that includes RNNG-derived complexity metrics. As the neural phrase structure parser is trained on increasingly larger corpora, the regression model of EEG amplitudes fits better and better. However, this pattern only obtains with in-domain training data that is lexically-similar to the first chapter of Alice in Wonderland. When trained on newspaper text from the Gigaword cor- pus, the goodness of fit to human EEG data remains about the same no matter how much training data is supplied. Tukey's test of additivity indicates an interaction between genre and training set size, F (1, 5) = 171.3, p < 0.000001. No effect of surprisal values obtained in electrodes and timepoints corresponding to the N400 component. Figure 3 compares the RNNG, which explicitly uses phrase structure representations, to an LSTM sequence model which does not (Hochreiter and Schmidhuber, 1997). Here, both models receive the benefit of in-domain training on the lexicallysimilar literature. But regardless of how much indomain training data is made available, the explicitly phrase-structural RNNG always offers a better account of the EEG signal.

Related Work
The importance of domain adaptation in NLP has been well-established in earlier work (see Daumé III, 2007 and footnote 1), including applications to parsing (Sarkar, 2001;McClosky et al., 2006;Søgaard and Rishøj, 2010;Weiss et al., 2015). Our approach to in-domain data selection is closely related to earlier work in language modeling and machine translation (Keller and Lapata, 2003 First, we consider the interaction between domain and amount of training data, rather than examining each variable in isolation. Second, we investigate the impact of these variables on cognitive modeling, which reveals a pattern that is different from what we observe in the standard perplexity evaluation. We focus on text genre, rather than online adaptation as van Schijndel and Linzen (2018) do. Despite coming at the problem from a different direction (and using EEG rather than selfpaced reading) our results agree with van Schijndel and Linzen in suggesting that some kind of adaptation must be going on in human language comprehension.

Conclusion
These comparisons confirm that genre matters. If surprisal describes human linguistic expectations, then we can say that those expectations are bettermodeled by a parsing system that benefits from indomain training. This would follow if, as Kaan and Chun (2018) have suggested, people are able to very rapidly adjust their syntactic expectations to match a particular genre. Indeed, these expectations seems to be phrase-structural in nature. Certainly the presence of unigram nuisance predictors in the EEG regression (section 5) and the comparatively worse performance of the LSTM sequence model (Figure 3) render it unlikely that this finding is due to word frequency or superficial co-occurrence. Rather, the neural parser has learned something about 19 th century children's literature that can be captured at the syntactic level. Whatever these syntactic properties are, it could have been the case that they were equally learnable from newswire text or Alice-like books. But in fact Alice-like books generalize better to human EEG data. This use of moment-by-moment processing difficulty to adjudicate between trained NLP systems offers a reminder that the quest for human-level performance in language technology should always be understood in relation to a particular kind of linguistic performance in a particular genre.