(Almost) Unsupervised Grammatical Error Correction using Synthetic Comparable Corpus

We introduce unsupervised techniques based on phrase-based statistical machine translation for grammatical error correction (GEC) trained on a pseudo learner corpus created by Google Translation. We verified our GEC system through experiments on a low resource track of the shared task at BEA2019. As a result, we achieved an F0.5 score of 28.31 points with the test data.


Introduction
Research on grammatical error correction (GEC) has gained considerable attention recently. Many studies treat GEC as a task that involves translation from a grammatically erroneous sentence (sourceside) into a correct sentence (target-side) and thus, leverage methods based on machine translation (MT) for GEC. For instance, some GEC systems use large parallel corpora and synthetic data (Ge et al., 2018;Xie et al., 2018).
We introduce an unsupervised method based on MT for GEC that does not use parallel learner data. In particular, we use methods proposed by Marie and Fujita (2018), Artetxe et al. (2018b), and Lample et al. (2018). These methods are based on phrase-based statistical machine translation (SMT) and phrase table refinements. Forward refinement used by Marie and Fujita (2018) simply augments a learner corpus with automatic corrections. We also use forward refinement for improvement of phrase table.
Unsupervised MT techniques do not require a parallel but a comparable corpus as training data. Therefore, we use comparable translated texts using Google Translation as the source-side data. Specifically, we use News Crawl written in English as target-side data and News Crawl written in another language translated into English as source-side data.
We verified our GEC system through experiments for a low resource track of the shared task at Building Educational Applications 2019 (BEA2019). The experimental results show that our system achieved an F 0.5 score of 28.31 points.
2 Unsupervised GEC Algorithm 1 shows the pseudocode for unsupervised GEC. This code is derived from Artetxe et al. (2018b). First, the cross-lingual phrase embeddings are acquired. Second, a phrase table is created based on these cross-lingual embeddings. Third, the phrase table is combined with a language model trained by monolingual data to initialize a phrase-based SMT system. Finally, the SMT system is updated through iterative forwardtranslation.
Cross-lingual embeddings First, n-gram embeddings were created on the source-and targetsides. Specifically, each monolingual embedding was created based on the source-and target-sides using a variant of skip-gram (Mikolov et al., 2013) for unigrams, bigrams, and trigrams with high frequency 1 in the monolingual data. Next, the monolingual embeddings were mapped onto a shared space to obtain cross-lingual embeddings. The self-learning method of Artetxe et al. (2018a) was used for unsupervised mapping.
Phrase table induction A phrase table was created based on the cross-lingual embeddings. In particular, this involved the creation of phrase translation models and lexical translation models.
The translation candidates were limited in the source-to-target phrase translation model ϕ(f |e) for each source phrase e to its 100 nearest neighbor phrases f on the target-side. The score of the phrase translation model was calculated based on the normalized cosine similarity between the source and target phrases.
f ′ represents each phrase embedding on the targetside and τ is a temperature parameter that controls the confidence of prediction 2 . The backward phrase translation probability ϕ(e|f ) was determined in a similar manner.
The source-to-target lexical translation model lex(f |e) considers the word with the highest translation probability in a target phrase for each word in a source phrase. The score of the lexical translation model was calculated based on the product of respective phrase translation probabilities.
ϵ is a constant term for the case where no alignments are found. As in Artetxe et al. (2018b), the term was set to 0.001. The backward lexical translation probability lex(e|f ) is calculated in a similar manner.

Refinement of SMT system
The phrase table created is considered to include noisy phrase pairs. Therefore, we update the phrase table using an SMT system. The SMT system trained on synthetic data eliminates the noisy phrase pairs using 2 As in Artetxe et al. (2018b), τ is estimated by maximizing the phrase translation probability between an embedding and the nearest embedding on the opposite side.  language models trained on the target-side corpus. This process corresponds to lines 6-10 in Algorithm 1. The phrase table is refined with forward refinement (Marie and Fujita, 2018). For forward refinement, target synthetic data were generated from the source monolingual data using the source-to-target phrase Construction of a comparable corpus This unsupervised method is based on the assumption that the source and target corpora are comparable. In fact, Lample et al. (2018), Artetxe et al. (2018b) and Marie and Fujita (2018) use the News Crawl of source and target language as training data.
To make a comparable corpus for GEC, we use translated texts using Google Translation as the source-side data. Specifically, we use Finnish News Crawl translated into English as source-side. English News Crawl is used as the target-side as is. Finnish data is used because Finnish is not similar to English.
This translated data does not include misspelled words. To address these words, we use a spell checker as a preprocessing step before inference.
3 Experiment of low resource GEC 3.1 Experimental setting Table 1 shows the training and development data size. Finnish News Crawl 2014-2015 translated into English was used as source training data and English News Crawl 2017 was used as target training data. To train the extra language model of the target-side (LM t ), we used training data of One Billion Word Benchmark (Chelba et al., 2014). We used googletrans v2.4.0 3 for Google Translation. This module did not work sometimes and thus, we obtained 2,122,714 trans-  lated sentences 4 . We sampled the 3,000,000 sentences from English News Crawl 2017 and excluded the sentences with more than 150 words for either source-and target-side data. Finally, the synthetic comparable corpus comprises processed News Crawl data listed in Table 1. The low resource track permitted to use W&I+LOCNESS (Bryant et al., 2019;Granger, 1998) development set, so we split it in half; tune data and dev data 5 . These data are tokenized by spaCy v1.9.0 6 and the en_core_web_sm-1.2.0 model. We used moses truecaser for the training data; this truecaser model is learned from processed English News Crawl. We used byte-pair-encoding (Sennrich et al., 2016) learned from processed English News Crawl; the number of operations is 50K.
The implementation proposed by Artetxe et al. (2018b) 7 was modified to conduct the experiments. Specifically, some features were added; word-level Levenshtein distance, word-, and character-level edit operation, operation sequence model, (Durrani et al., 2013) 8 and 9-gram word class language model, similar to Grundkiewicz and Junczys-Dowmunt (2018) without sparse features. Word class language model was trained with One Billion Word Benchmark data; the number of classes is 200, and the word class was estimated with fastText (Bojanowski et al., 2017). The distortion feature was not used.
Moses (Koehn et al., 2007) was used to train the SMT system. FastAlign (Dyer et al., 2013) was used for word alignment and KenLM (Heafield, 2011) was used to train the 5-gram language model over each processed English News 4 Finnish News Crawl 2014-2015 have 6,360,479 sentences. 5 Because W&I+LOCNESS data had four types of learner level, we split it so that each learner level is equal. 6 https://github.com/explosion/spaCy 7 https://github.com/artetxem/monoses 8 Operation sequence model was used in refinement step.
Crawl and One Billion Word Benchmark. MERT (Och, 2003) was used with the tuning data for Mˆ2 Scorer (Dahlmeier and Ng, 2012). Synthetic sentence pairs with a [3, 80] sentence length were used at the refinement step. The number of iterations N was set to 5, and the embedding dimension was set to 300. We decided best iteration using the dev data and submitted the output of the best iteration model. We used pyspellchecker 9 as a spell checker. This tool uses Levenshtein distance to obtain permutations within an edit distance of 2 over the words included in a word list. We made the word list from One Billion Word Benchmark and included words that occur more than five times.
We report precision, recall, and F 0.5 score based on the dev data and official test data. The output of dev data was evaluated using ERRANT scorer (Bryant et al., 2017) similarly to official test data. Table 2 shows the results of the GEC experiments with test data. The F 0.5 score for our system (TMU) is 28.31; this score is eighth among the nine teams. In particular, the number of false positives of our system is 4,314; this is the worst result of all. Table 3 shows the results of the dev data listed in Table 1. On the dev data, the system of iteration 1 is the best among all. According to the improvement of iteration from 0 to 1, it is confirmed that the refinement method works well. However, it is observed that the system is not improved after iteration 1. The source-side data is fixed, and target-side data is generated from the source-side for each iteration. Therefore, the quality of the 9 https://github.com/barrust/pyspellchecker  Table 3: GEC results with dev data. The bold scores represent the best score without the spell checker.

Discussion
source-side data is important for this refinement method. In this study, we use the automatically translated text as source-side data; thus, it is considered that the quality is not high and the refinement after iteration 1 does not work. The results of Table 3 confirm that the spell checker works well. We also investigate the importance of the order; SMT or spell check, which is suitable for the first system for a better result? As a result, it is better to use the SMT system after using the spell checker. That is because the source-side data does not include the misspelled words as mentioned above. Table 4 shows the error types that our system corrected well or mostly did not correct on the dev data. SPELL means the misspell errors; the correction of these errors depends only on the spell checker. PUNCT means the errors about the punctuation; e.g., 'Unfortunately when we...→ Unfortunately, when we...'. It is considered that our system can correct errors such as these owing to the n-gram co-occurrence knowledge derived from the language models.
In contrast, our system struggled to correct content word errors. For example, NOUN includes an error like this; 'way → means' and VERB includes an error like this; 'watch → see'. It is considered that our system is mostly not able to correct the errors regarding word usage based on the context because the phrase table was still noisy. Although we observed some usage error examples of 'watch' in the synthetic source data, our model was not able to replace 'watch' to 'see' based on the context.

Related Work
Unsupervised Machine Translation Studies on unsupervised methods have been conducted for both NMT (Lample et al., 2018;Marie and Fujita, 2018) and SMT (Artetxe et al., 2018b). In  Table 4: Error types for which our best system corrected errors well or mostly did not correct on the dev data. Top2 denotes the top two errors, and Bottom2 denotes the lowest two errors in terms of the F 0.5 10 .
this study, we apply the USMT method of Artetxe et al. (2018b) and Marie and Fujita (2018) to GEC. The UNMT method (Lample et al., 2018) was ineffective under the GEC setting in our preliminary experiments.
GEC with NMT/SMT Several studies that introduce sequence-to-sequence models in GEC heavily rely on large amounts of training data. Ge et al. (2018), who presented state-of-the-art results in GEC, proposed a supervised NMT method trained on corpora of a total 5.4 M sentence pairs. We mainly use the monolingual corpus because the low resource track does not permit the use of the learner corpora. Despite the success of NMT, many studies on GEC traditionally use SMT (Susanto et al., 2014;Junczys-Dowmunt and Grundkiewicz, 2014). These studies apply an offthe-shelf SMT toolkit, Moses, to GEC. Junczys-Dowmunt and Grundkiewicz (2014) claimed that the SMT system optimized for BLEU learns to not change the source sentence. Instead of BLEU, they proposed tuning an SMT system using the M 2 score with annotated development data. In this study, we also tune the weights with an F 0.5 score measured by the M 2 scorer because the official score is an F 0.5 score.

Conclusion
In this paper, we described our GEC system for the low resource track of the shared task at BEA2019. We introduced an unsupervised approach based on SMT for GEC. This track prohibited the use of learner data as training data, so we created a synthetic comparable corpus using Google Translation. The experimental results demonstrate that our system achieved an F 0.5 score of 28.31 points with the test data.