How do you correct run-on sentences it's not as easy as it seems

Run-on sentences are common grammatical mistakes but little research has tackled this problem to date. This work introduces two machine learning models to correct run-on sentences that outperform leading methods for related tasks, punctuation restoration and whole-sentence grammatical error correction. Due to the limited annotated data for this error, we experiment with artificially generating training data from clean newswire text. Our findings suggest artificial training data is viable for this task. We discuss implications for correcting run-ons and other types of mistakes that have low coverage in error-annotated corpora.


Introduction
A run-on sentence is defined as having at least two main or independent clauses that lack either a conjunction to connect them or a punctuation mark to separate them. Run-ons are problematic because they not only make the sentence unfriendly to the reader but potentially also to the local discourse. Consider the example in Table 1.
In the field of grammatical error correction (GEC), most work has typically focused on determiner, preposition, verb and other errors which non-native writers make more frequently. Runons have received little to no attention even though they are common errors for both native and nonnative speakers. Among college students in the United States, run-on sentences are the 18th most frequent error and the 8th most frequent error made by students who are not native English speakers (Leacock et al., 2014).
Correcting run-on sentences is challenging (Kagan, 1980) for several reasons: • They are sentence-level mistakes with longdistance dependencies, whereas most other grammatical errors are local and only need a small window for decent accuracy.

Before correction
But the illiterate will not stay illiterate always if they put an effort to improve and are given a chance for good education, they can still develop into a group of productive Singaporeans.

After correction
But the illiterate will not stay illiterate always. If they put an effort to improve and are given a chance for good education, they can still develop into a group of productive Singaporeans. • There are multiple ways to fix a run-on sentence. For example, one can a) add sentence-ending punctuation to separate them; b) add a conjunction (such as and) to connect the two clauses; or c) convert an independent clause into a dependent clause.
• They are relatively infrequent in existing, annotated GEC corpora and therefore existing systems tend not to learn how to correct them.
In this paper, we analyze the task of automatically correcting run-on sentences. We develop two methods: a conditional random field model (roCRF) and a Seq2Seq attention model (roS2S) and show that they outperform models from the sister tasks of punctuation restoration and wholesentence grammatical error correction. We also experiment with artificially generating training examples in clean, otherwise grammatical text, and show that models trained on this data do nearly as well predicting artificial and naturally occurring run-on sentences.

Related Work
Early work in the field of GEC focused on correcting specific error types such as preposition and article errors (Tetreault et al., 2010;Rozovskaya and Roth, 2011;Dahlmeier and Ng, 2011), but did not consider run-on sentences. The closest work to our own is Israel et al. (2012), who used Conditional Random Fields (CRFs) for correcting comma errors (excluding comma splices, a type of run-on sentence). Lee et al. (2014) used a similar system based on CRFs but focused on comma splice correction. Recently, the field has focused on the task of whole-sentence correction, targeting all errors in a sentence in one pass. Whole-sentence correction methods borrow from advances in statistical machine translation (Madnani et al., 2012;Felice et al., 2014;Junczys-Dowmunt and Grundkiewicz, 2016) and, more recently, neural machine translation (Yuan and Briscoe, 2016;Chollampatt and Ng, 2018;Xie et al., 2018;Junczys-Dowmunt et al., 2018).
To date, GEC systems have been evaluated on corpora of non-native student writing such as NUCLE (Dahlmeier et al., 2013) and the Cambridge Learner Corpus First Certificate of English (Yannakoudakis et al., 2011). The 2013 and 2014 CoNLL Shared Tasks in GEC used NUCLE as their train and test sets (Ng et al., , 2014. There are few instances of run-on sentences annotated in both test sets, making it hard to assess system performance on that error type. A closely related task to run-on error correction is that of punctuation restoration in the automatic speech recognition (ASR) field. Here, a system takes as input a speech transcription and is tasked with inserting any type of punctuation where appropriate. Most work utilizes textual features with n-gram models (Gravano et al., 2009), CRFs (Lu and Ng, 2010), convolutional neural networks or recurrent neural networks (Peitz et al., 2011;Che et al., 2016). The Punctuator (Tilk and Alumäe, 2016) is a leading punctuation restoration system based on a sequence-to-sequence model (Seq2Seq) trained on long slices of text which can span multiple sentences.

Model Descriptions
We treat correcting run-ons as a sequence labeling task: given a sentence, the model reads each token and learns whether there is a SPACE or PERIOD following that token, as shown in Table 2. We ap-This/S shows/S the/S rising/S of/S life/S expectancies/P it/S is/S an/S achievement/S and/S it/S is/S also/S a/S challenge/S ./S ply two sequence models to this task, conditional random fields (roCRF) and Seq2Seq (roS2S).

Conditional Random Fields
Our CRF model, roCRF, represents a sentence as a sequence of spaces between tokens, labeled to indicate whether a period should be inserted in that space. Each space is represented by contextual features (sequences of tokens, part-of-speech tags, and capitalization flags around each space), parse features (the highest uncommon ancestor of the word before and after the space, and binary indicators of whether the highest uncommon ancestors are preterminals), and a flag indicating whether the mean per-word perplexity of the text decreases when a period is inserted at the space according to a 5-gram language model.

Sequence to Sequence Model with Attention Mechanism
Another approach is to treat it as a form of neural sequence generation. In this case, the input sentence is a single run-on sentence. During decoding we pass the binary label which determines if there is terminal punctuation following the token at the current position. We then combine the generated label and the input sequence to get the final output. Our model, roS2S, is a Seq2Seq attention model based on the neural machine translation model (Bahdanau et al., 2015). The encoder is a bidirectional LSTM, where a recurrent layer processes the input sequence in both forward and backward direction. The decoder is a unidirectional LSTM. An attention mechanism is used to obtain the context vector.

Data
Train: Although run-on sentences are common mistakes, existing GEC corpora do not include enough labeled run-on sentences to use as training data. Therefore we artificially generate training examples from a corpus of clean newswire text, Annotated Gigaword (Napoles et al., 2012). We randomly select paragraphs and identify candidate pairs of adjacent sentences, where the sentences have between 5-50 tokens and no URLs or special punctuation (colons, semicolons, dashes, or ellipses). Run-on sentences are generated by removing the terminal punctuation between the sentence pairs and lowercasing the first word of the second sentence (when not a proper noun). In total we create 2.8 million run-on sentences, and randomly select 1.75M other Gigaword sentences for negative examples. We want the model to learn more patterns of run-on errors by feeding a large portion of positive examples while we report our results on a test set where the ratio is closer to that of real world. We call this data FakeGiga-Train. An additional 28k run-ons and 218k non-run-ons are used for validation.
Test: We evaluate on two dimensions: clean versus noisy text and real versus artificial run-ons. In the first evaluation, we artificially generate sentences from Gigaword and NUCLE following the procedure above such that 10% of sentences are run-ons, based on our estimates of their rate in real-world data (similar observations can be found in Watcharapunyawong and Usaha (2012)). We refer to these test sets as FakeGiga and FakeESL respectively. Please note that the actual run-on sentences in NUCLE are not included in FakeESL.
The second evaluation compares performance on artificial versus naturally occurring run-on sentences, using the NUCLE and CoNLL 2013 and 2014 corpora. Errors in these corpora are annotated with corrected text and error types, one of which is Srun: run-on sentences and comma splices. Sruns occur 873 times in the NUCLE corpus. We found that some of the Srun annotations do not actually correct run-on sentences, so we reviewed the Srun annotations to exclude any corrections that do not address run-on sentences. We also found that there are 300 out of the 873 sentences with Srun annotations which actually perform correction by adding a period. Other Srun annotations correct run-ons by converting independent clauses to dependent clauses, but we only target missing periods in this initial work. We manually edit Srun annotations so that the only correction performed is inserting periods. (This could be as simple as deleting the comma in the original text of a comma splice or more involved, as in rewriting a dependent clause to an independent clause in the corrected text.) In total, we find fewer than 500 run-on sentences. Run-  We discard all other error annotations and combine the NUCLE train and CoNLL test sets, which we call RealESL.
Only 1% of sentences in RealESL are run-ons, which may not be the case in other forms of ESL corpora. So for a fair comparison we down-sample the run-on sentences in FakeESL to form a test set with the same distribution as RealESL, FakeESL-1%. Table 3 describes the size of our data sets.

Experiments
Metrics: We report precision, recall, and the F 0.5 score. In GEC, precision is more important than recall, and therefore the standard metric for evaluation is F 0.5 , which weights precision twice as much as recall.
Baselines: We report results on a balanced random baseline and state-of-the-art models from whole-sentence GEC (NUS18) and punctuation restoration (the Punctuator). NUS18 is the released GEC model of Chollampatt and Ng (2018), trained on two GEC corpora, NUCLE and Lang-8 (Mizumoto et al., 2011). We test two versions of the Punctuator: Punctuator-EU is the released model, trained on English Europarl v7 (Koehn, 2005), and Punctuator-RO, which we trained on artificial clean data (FakeGiga-Train) using the authors' code. 1 roCRF: We train our model with 1regularization and c = 10 using the CRF++ toolkit. 2 Only features that occur at least 5 times in the training set were included. Spaces are labeled to contain missing punctuation when the marginal probability is less than 0.70. Parameters are tuned to F 0.5 on 25k held-out sentences.  Table 4: Performance on clean v. noisy artificial data with 10% run-ons, and real v. artificial data with 1% run-ons.
roS2S: Both the encoder and decoder have a single layer, 1028-dimensional hidden states, and a vocabulary of 100k words. We limit the input sequences to 100 words and use 300-dimensional pre-trained GloVe word embeddings (Pennington et al., 2014). The dropout rate is 0.5 and minibatches are size 128. We train using Ada-grad with a learning rate of 0.0001 and a decay of 0.5.

Results and Analysis
Results are shown in Table 4. A correct judgment is where a run-on sentence is detected and a PERIOD is inserted in the right place. Across all datasets, roCRF has the highest precision. We speculate that roCRF consistently has the highest precision because it is the only model to use POS and syntactic features, which may restrict the occurrence of false positives by identifying longer distance, structural dependencies. roS2S is able to generalize better than roCRF, resulting in higher recall with only a moderate impact on precision. On all datasets except RealESL, roS2S consistently has the highest overall F 0.5 score. In general, Punctuator has the highest recall, probably because it is trained for a more general purpose task and tries to predict punctuation at each possible position, resulting in lower precision than the other models. NUS18 predicts only a few false positives and no true positives, so P = R = 0 and we exclude it from the results table. Even though NUS18 is trained on NUCLE, which RealESL encompasses, its very poor performance is not too surprising given the infrequency of run-ons in NUCLE.
Clean v. Noisy In the first set of experiments (columns 2 and 3), we compare models on clean text (FakeGiga), which has no other grammatical mistakes, and noisy text (FakeESL), which may have several other errors in each sentence. Punctuator-EU is the only model which improves when tested on the noisy artificial data compared to the clean. It is possible that the speech transcripts used for training Punctuator-EU more closely resemble FakeESL, which is less formal than FakeGiga. All other models do worse, which could be due to overfitting FakeGiga. However, further work is needed to determine how much of the performance drop is due to a domain mismatch versus the frequency of grammatical mistakes in the data.
Real v. Artificial So far, we have only used artificially generated data for training and testing. The second set of experiments (columns 4 and 5) determines if it is easier to correct run-on sentences that are artificially generated compared to those that occur naturally. The Punctuators do poorly on this data because they are too liberal, evidenced by the high recall and very low precision. Our models, roCRF and roS2S, outperform the Punctuators and have similar performance on both the real and artificial run-ons (RealESL and FakeESL-1%). roCRF has significantly higher precision on RealESL while roS2S has significantly higher recall and F 0 .5 on RealESL and FakeESL-1% (with bootstrap resampling, p < 0.05). This supports the use of artificially generated run-on sentences as training data for this task.

Conclusions
Correcting run-on sentences is a challenging task that has not been individually targeted in earlier GEC models. We have developed two new models for run-on sentence correction: a syntax-aware CRF model, roCRF, and a Seq2Seq model, roS2S. Both of these outperform leading models for punctuation restoration and grammatical error correction on this task. In particular, roS2S has very strong performance, with F 0.5 = 0.86 and F 0.5 = 0.67 on run-ons generated from clean and noisy data, respectively. roCRF has very high precision (0.83 ≤ P ≤ 0.89) but low recall, meaning that it does not generalize as well as the leading system, roS2S.
Run-on sentences have low frequency in annotated GEC data, so we experimented with artificially generated training data. We chose clean newswire text as the source for training data to ensure there were no unlabeled naturally occurring run-ons in the training data. Using ungrammatical text as a source of artificial data is an area of future work. The results of this study are inconclusive in terms of how much harder the task is on clean versus noisy text. However, our findings suggest that artificial run-ons are similar to naturally occurring run-ons in ungrammatical text because models trained on artificial data do just as well predicting real run-ons as artificial ones.
In this work, we found that a leading GEC model (Chollampatt and Ng, 2018) does not correct any run-on sentences, even though there was an overlap between the test and training data for that model. This supports the recent work of Choshen and Abend (2018), who found that GEC systems tend to ignore less frequent errors due to reference bias. Based on our work with run-on sentences, a common error type that is infrequent in annotated data, we strongly encourage future GEC work to address low-coverage errors.