Nightmare at test time: How punctuation prevents parsers from generalizing

Punctuation is a strong indicator of syntactic structure, and parsers trained on text with punctuation often rely heavily on this signal. Punctuation is a diversion, however, since human language processing does not rely on punctuation to the same extent, and in informal texts, we therefore often leave out punctuation. We also use punctuation ungrammatically for emphatic or creative purposes, or simply by mistake. We show that (a) dependency parsers are sensitive to both absence of punctuation and to alternative uses; (b) neural parsers tend to be more sensitive than vintage parsers; (c) training neural parsers without punctuation outperforms all out-of-the-box parsers across all scenarios where punctuation departs from standard punctuation. Our main experiments are on synthetically corrupted data to study the effect of punctuation in isolation and avoid potential confounds, but we also show effects on out-of-domain data.


Introduction
We study the sensitivity of modern dependency parsers to punctuation. While punctuation was originally motivated by reading aloud, serving the purpose of "breath marks" (Baldwin and Coady, 1978), many modern-day punctuation systems are designed to facilitate grammatical disambiguation. This paper aims to show that for this reason, punctuation can significantly hurt the generalization ability of state-of-the-art syntactic parsers. In other words, syntactic parsers become too reliant on punctuation and therefore suffer from the absence or creative uses of punctuation. Such uses are abundant; see Table 1 for examples from Twitter. Such situations, where highly predictive features are absent or distorted at test time, were referred to in Globerson and Roweis (2006) as nightmare at test time. Human reading is very robust to variation in punctuation   (Baldwin and Coady, 1978); so creative use of punctuation does not hurt human reading performance. In effect, sensitivity to punctuation is a major obstacle that prevents our syntactic parser from achieving human-level robustness.
The generalization ability of a dependency parser is usually measured by evaluating its accuracy on held-out data, our yardstick to prevent over-fitting, i.e. we define the degree to which a parser has over-fitted to the training data as the difference between performance on training data and performance on the held-out data. This practice is poor when data is not i.i.d., since the heldout data cannot be assumed to be representative; in such cases, little or no over-fitting does not guarantee our parsers have learned important linguistic generalizations: Rather, the parsers may have over-fitted to superficial cues that are present in both the training and test datasets (Jo and Bengio, 2017). We argue that punctuation signs are superficial cues preventing modern parsers from learning appropriately high-level abstractions from our datasets.
Contributions We evaluate three neural dependency parsers for English, as well as two older alternatives, on a standard benchmark, before and after stripping punctuation, as well as after injecting more punctuation signs in the benchmark. We show that (a) projective parsers are, unsurprisingly, more sensitive to punctuation injection than non-projective ones, since punctuation injection may introduce crossing edges, and (b) neural parsers are more sensitive than vintage parsers. The latter is our main contribution, but we also show that training a neural parser without punctuation outperforms all parsers trained in a regular fashion across all punctuation scenarios. Our experiments are on semi-synthetic data to control for confounds, but we also show the parser trained without punctuation is superior on real data with non-standard punctuation.

Punctuation in Stanford dependencies
Dependency annotation Dependency annotation refers to the manual assignment of syntactic structures to sentences, following one of several sets of available annotation guidelines. This paper focuses exclusively on the Stanford dependencies annotation scheme (de Marneffe and Manning, 2008). This scheme restricts the set of possible syntactic structures to single-rooted, ordered, possibly non-projective trees whose edges are uniquely labeled by a single dependency label.
Punctuation Punctuation should be distinguished from diacritics and logographs. The two most frequently used punctuation signs are periods and commas. Periods ("."), however, are potentially ambiguous with other uses of dots, typically indicating omissions or pauses. When dots are used emphatically and creatively it is hard to maintain this distinction, and we will simply refer to dots and commas in this paper. We ignore other punctuation signs, including dashes, question and exclamation marks, and colons and semicolons.
Punctuation is, among other things, used to mark boundaries between constituents of written language. Space characters, for example, separate words, albeit sometimes inconsistently. Spacing is a fairly recent innovation in writing; classi-cal Latin and Greek did not leave spaces between words, and many Asian languages, e.g., Thai and

Experiments
This section describes how we remove and inject punctuation (our perturbation maps), and details of the parsers used in our experiments.
Perturbation maps Since dots consistently attach to the root token of the sentence, and commas attach to their left neighbour or to the root token, we can remove and inject additional punctuation in a sentence without affecting the rest of its syntactic structure and without violating the wellformedness of dependency trees. Note, however, that injecting a root-dominated dot or comma may lead to crossing edges, i.e., turn a projective dependency tree into a non-projective one. This may lead to cascading errors for projective dependency parsers (Ng and Curran, 2015). In our experiments, arc-eager MALTPARSER and STAN-FORD are the only projective parsers. We therefore propose two perturbation maps (Jo and Bengio, 2017): (a) simply removing punctuation, and (b) a simple injection scheme with two parameters χ and δ. Let a dependency structure be an ordered tree with n nodes decorated with words w 1 , . . . , w n . At any node 1 ≤ i ≤ n, we (a) inject a comma at position i with probability χ and move nodes i ≤ j ≤ n to positions j + 1, increasing the size of the graph by 1; and (b) inject a dot at position i+1 with probability δ and move nodes i < j ≤ n to positions j + 1, increasing the size of  the graph by 1. If we follow standard methodology and ignore punctuation when evaluating parsers, we can compare evaluations before and after applying the injection scheme. It is equally straightforward to remove punctuation without affecting the rest of the dependency tree. Each element w i to the right of punctuation nodes w j (i > j) moves to the left (j − 1) for every punctuation item, decreasing the length of the sentence by 1 each time.
Note that both removing punctuation and our injection scheme can be seen as perturbation maps (Jo and Bengio, 2017) of our dataset, with the following important properties: (a) grammatical structure recognizability, i.e., human ability to correctly process sentences, is preserved (Baldwin and Coady, 1978), (b) surface statistical regularities are qualitatively different, and (c) there exists a non-trivial generalization map between the original dataset and the perturbed version. These properties mean we can use our punctuation injection scheme to evaluate the sensitivity of neural dependency parsers to the surface statistical regularities involving dots and commas (Jo and Bengio, 2017). Since human reading is largely unaffected by erroneous punctuation, we may expect parsers to be robust to absence of punctuation and punctuation injection, as well. Our results clearly show this is not the case; in fact, recently proposed neural dependency parsers are very sensitive to differences in punctuation.

Our dependency parsers We use five parsers in our experiments:
the Uppsala parser (UUPARSER) (de Lhoneux et al., 2017a,b), the graph-based parser proposed in (Kiperwasser and Goldberg, 2016)(KGRAPHS) , the arc-eager MALT-PARSER (Nivre et al., 2007), the TURBOPARSER (Fernández-González and Martins, 2015), and the STANFORD parser (Chen and Manning, 2014). UUPARSER is a neural transition-based dependency parser, while KGRAPHS is a neural graph-based parser. MALTPARSER is a more traditional transition-based parser, and TUR-BOPARSER is a more traditional graph-based parser. Finally, the STANFORD parser is a projective, neural transition-based dependency parser. All parsers rely on predicted part-of-speech tags, except UUPARSER (which does not rely on part-of-speech information at all). We use the TURBOTAGGER to obtain those. See Table 2 for an overview of our parsers.
Finally, we also evaluate three non-standard versions of the UUPARSER, namely, a parser trained with the same parameters as the offthe-shelf parser (de Lhoneux et al., 2017b), but which simply ignores dots and commas completely (NOPUNCT), and two heavily regularised versions of the parser trained in the standard fashion: (a) a version trained with the drop-out parameter set to 0.8 (zeros out 80% of activations); (b) a version with the gradient clipping parameter set to 0.075. We do so to answer the question of whether more heavily regularized dependency parsers are less sensitive to punctuation (they are not).

Results and analysis
We discuss the sensitivity of off-the-shelf dependency parsers to our perturbation maps, comparing to a parser trained after removing punctuation in the training data, as well as to heavily regularised versions of the same parser.
No punctuation We first test our parsers on a version of the validation set where we strip away all punctuation. The data thus consists of newswire (WSJ 22) with punctuation removed. This is similar to Example (1) in Table 1, but indomain. The results are in the second results column in Table 3, with the relative increases in error listed in the third results column. The drop induced by removing punctuation is quite dramatic: The UUPARSER, for example, suffers an absolute drop of 5.4% LAS or an error increase of 67%. For every three mistakes, UUPARSER does, stripping away punctuation makes it introduce another two. Note that, generally, the relative increase in error is much higher for the three neural parsers, and that the regularisation strategies (drop-out and gradient clipping) do not seem to help much.
Comma and dot injection At medium injection rates, all parsers are sensitive to punctuation injection. With δ = 0.05, γ = 0.05, for example, all parsers perform worse than in the absence of punc-  Table 3: Labeled attachment scores with punctuation removed. All parsers suffer from absence of or additional punctuation. The relative increase in error ( 1-BL 1-SYS − 1; with BL performance on original text; SYS performance under NO PUNCT and δ = 0.1, κ = 0.1, resp.) for neural parsers is higher than for non-neural parsers. GWEB and FOSTER scores are on development sentences (of at least five words) with no punctuation. tuation. Our main observation is, again, that neural parsers suffer higher relative increases in errors than vintage parsers. Note that the MALTPARSER is a projective parser and therefore has a higher relative increase in error; but TURBOPARSER is much more robust than the other parsers. That said, it still does much worse than the UUPARSER trained without punctuation.
Evaluation on informal text with non-standard punctuation We also evaluate the models on sentences with non-standard punctuation in the development sections in the Google Web Treebank with informal text (from Yahoo Answers and user reviews). Specifically, we evaluate the models on sentences with more than one dot. Again, we show that the neural dependency parser trained without punctuation is superior to the other parsers.

Related work
Punctuation in parsing Spitkovsky et al. (2011) introduced the idea of splitting sentences at punctuation and imposing parsing restrictions over the fragments and observed significant improvements in the context of unsupervised parsing. Ng and Curran (2015) aim to prevent cascading errors by enforcing correct punctuation arcs.
They restrict themselves to projective dependency parsing; erroneous punctuation arcs do not lead to cascading errors in non-projective dependency parsing. Ma et al. (2014), motivated by the same observation , treat punctuation marks as properties of their neighboring words rather than as individual tokens, showing improvements on in-domain data.
Breaking NLP models Jia and Liang (2017) show how machine reading models can easily be broken with distractor sentences at test time and propose an alternative evaluation scheme, and Belinkov and Bisk (2018) show how susceptible character-based machine translation models are to noise. Both papers are similar to ours in evaluating the performance of state-of-the-art models under corruptions of the data. There was recently a workshop dedicated to evaluation of NLP models under human adversarial example selection (Ettinger et al., 2017). Historically, NLP models were rarely evaluated on synthetic or otherwise adversarial data, but we believe this is a fruitful research direction. This is largely a philosophical question, and we believe a philosophical argument is in order. John Dewey (John Dewey, 1910), the American philosopher, distinguishes three modes of thinking: (i) common reasoning, which identifies pattern in available, historical data, (ii) empirical thinking, which collects new data to vary the experimental conditions, and (iii) experimental thinking, which actively modifies the conditions in controlled experiments to isolate the relevant variables. We believe recent work on breaking NLP models is an attempt to introduce experimental thinking into NLP, which has otherwise been limited -or handicapped in Dewey's words -by what data happens to be available.

Conclusions
We evaluate the sensitivity of five dependency parsers to variations in punctuation, showing that available neural parsers tend to be more sensitive to such variation. We also show, however, that training neural parsers without punctuation pro-vides a robust model that is better than any offthe-shelf parsers.