The AIP-Tohoku System at the BEA-2019 Shared Task

We introduce the AIP-Tohoku grammatical error correction (GEC) system for the BEA-2019 shared task in Track 1 (Restricted Track) and Track 2 (Unrestricted Track) using the same system architecture. Our system comprises two key components: error generation and sentence-level error detection. In particular, GEC with sentence-level grammatical error detection is a novel and versatile approach, and we experimentally demonstrate that it significantly improves the precision of the base model. Our system is ranked 9th in Track 1 and 2nd in Track 2.


Introduction
As part of the BEA-2019 shared task, we participated in Track 1 (Restricted Track) and Track 2 (Unrestricted Track). We utilized the Transformer (Vaswani et al., 2017) architecture as a base GEC model for machine translation systems as it has become a state-of-the-art approach for grammatical error correction (GEC).
In our system, the error correction model collaborates with a sentence-level error detection model. In GEC, F 0.5 is used for evaluation because precision is more important than recall. To improve the precision score on the test set, our system corrected the input sentences by detecting errors using a sentence-level error detection model (which we denote as SED). We applied the bidirectional encoder representations from transformers (BERT) model (Devlin et al., 2018) for sentence-level error detection. In order to improve the performance of SED, we propose an SED model taking the learner's proficiency into * Current affiliation: Yahoo Japan Corporation, hiroasan@yahoo-corp.jp † Current affiliation: Future Corporation, mizumoto.tomoya.mh7@is.naist.jp account. To the best of our knowledge, this is the first study that has combined GEC with SED.
Because grammatical correctness is required for output sentences in GEC, the target side of parallel training corpora should not contain noisy sentences. Our correction model is trained to correct sentence pairs, which were identified by our sentence-level grammatical error detection model. We call this data cleaning process BERT-Cleaning.
For Track 1, similar to back-translation (Sennrich et al., 2016b;Edunov et al., 2018), we augmented the parallel training corpus with errors generated from monolingual data. After addition of the generated data and SED process, the F 0.5 score on the base model improved.
For Track 2, we used the EF-Cambridge Open Language Database (EFCAMDAT) (Geertzen et al., 2013) and non-public Lang-8 as the external language learner corpus.

Error Detection
The field of grammatical error detection (GED) has a long history. Many previous studies have treated GED as a token-level binary classification task (Tetreault and Chodorow, 2008;Han et al., 2006;Chodorow et al., 2012;Rei and Yannakoudakis, 2016;Rei, 2017). Kaneko et al. (2017) improved grammatical error detection by learning word embeddings that consider grammaticality and error patterns. Yannakoudakis et al. (2017) propose an approach to N-best list re-ranking using neural sequencelabelling models.
While many studies in GED focus on tokenlevel error detection, there are studies that perform sentence-level binary classification of sentences that need some editing (Han et al., 2006;Tetreault and Chodorow, 2008;Chodorow et al., 2012;Schmaltz et al., 2016). Compared with tokenlevel grammatical error correction, sentence-level grammatical error correction is a simple problem setting because there is no need to identify the location of errors.

Error Generation
In the field of machine translation, backtranslation is an effective method for neural machine translation systems (Sennrich et al., 2016b;Imamura et al., 2018). Edunov et al. (2018) reported that back-translation obtained via sampling or noised beam outputs is effective for neural machine translation systems.
Recently, back-translation has been applied to grammatical error detection and correction.  proposed artificial error generation with statistical machine translation and syntactic patterns for error detection. Kasewa et al. (2018) constructed synthetic samples using a seq2seq neural model with greedy search and temperature sampling for error detection. Xie et al. (2018) proposed certain noising methods for error generation, and Ge et al. (2018) proposed back-boost learning using fluency scores.
3 System Architecture

Base Correction Model
We used Transformer, the self-attention-based translation model, as a base GEC system (Vaswani et al., 2017). Some previous studies used Transformer to achieve high performance (Junczys-Dowmunt et al., 2018;Zhao et al., 2019).

Motivation
The sentence-level error detection (SED) module is one of the key components of our system, with the goal of detecting sentences with grammatical errors. The aim of introducing SED is to reduce false positive by passing only sentences that contain errors to the GEC model. We calculated the rate of a sentence that changes in the W&I+LOCNESS development set and found it to be 64.34%, i.e., almost 35% of the sentences did not require corrections.

Base Model
We built a base SED model using BERT (Devlin et al., 2018), which is a straightforward extension of sequence classification tasks such as CoLA (Warstadt et al., 2018) and SST-2 (Socher et al., 2013). For setting up a training set for the base SED model, we preprocessed it to obtain binary labeled data (e.g., 0 for correct and 1 for incorrect, respectively). Figure 1 shows the architecture of our proposed SED model. The key ideas of our proposed model are as follows:

Proposed Model
• There is a correlation between the error rate and the learner's level of proficiency.
• The performance of SED can be improved by fine-tuning the model according to the learners proficiency.
The first idea is based on the following observation on the W&I+LOCNESS development set: Looking at the word error rate (WER) across three different CEFR levels: A (beginner), B (intermediate), C (advanced), we can confirm that 19.49% for level A, 13.18% for level B, and 6.04% for level C. The second idea comes from previous studies on GEC (Junczys-Dowmunt and Grundkiewicz, 2016;Junczys-Dowmunt et al., 2018). They showed that better results can be achieved if the error rate of the training data is adapted to the error rate of the development set, which is called error adaptation.
Let N and M denote the total number of source words and sentences in a corpus, respectively. WER is defined as follows: where X m denotes each source sentence, Y m denotes each corrected sentence, and d(X m , Y m ) denotes the edit distance between X m and Y m .
Based on the above ideas, our SED model is developed in two steps:

Building Proficiency Prediction Module (PPM):
The PPM predicts the proficiency of the learner who wrote a given sentence. Based on the above key ideas, we employed a multi-task learning approach in which the model estimates the learner's proficiency and performs sentence-level error detection simultaneously (PP&SED in Fig-ure1), trained on labelled data obtained by simply conjoining the SED label with PP label (e.g., 1 A).
We confirmed that the PP&SED outperforms the vanilla PP by a large margin of up to 7.8 points at accuracy (from 42.2 to 50.0).

2.
Fine-tuning SED model: After dividing the given text by proficiency based on the label estimated by the PPM, the SED model is fine-tuned for each level of proficiency.
Then, the SED module performs sentence-level binary classification of sentences that need editing. Table 1 shows the performance of SED on our dev set. Here, we split the official development set into test/dev set for our experiments. Our proposed SED model achieved a significant improvement both in precision and recall, by considering learner proficiency.

Error Generation
Our error generation system follows the system developed by Edunov et al. (2018). A target-tosource model is trained, and back-translation is applied to monolingual data to generate pseudoparallel data via sampling from the distribution of the target-to-source model.

Experimental Setting
We will now describe the training data and tools used to train our model.

Tools
We used the Transformer implemented in Fairseq 1 (Ott et al., 2019) as our GEC model. For the Transformer, we used a token embedding size of dimension 512. The hidden size is set to 512, and the filter size is set to 2048. The multi-head attention has eight individual attention heads, whereas the encoder and decoder have six layers. We use Adam optimizer with β 1 = 0.9, β 2 = 0.98, and = 10 −9 . We use inverse squared root decay. We set the dropout to 0.3. Rather than using words directly, we used byte pair encoding (BPE) (Sennrich et al., 2016a), and each 1 https://github.com/pytorch/fairseq of the source and target vocabularies comprises 30K elements, which are the most frequent BPE tokens.
For building the sentence-level error detection model, we employed the model based on BERT, especially for the sequence-level tasks as described in Section 3.2. Thus, we used the PyTorch implementations for Googles BERT model 2 .
For building the error generation model, we used a 7-layer convolutional seq2seq model implemented in Fairseq (Gehring et al., 2017;Chollampatt and Ng, 2018). As Chollampatt and Ng (2018), both source and target embeddings are of 500 dimensions. Each of the source and target vocabularies comprises the 30K most frequent BPE tokens. The hidden size of encoders and decoders is 1,024 with a convolution window width of 3. The output of each encoder and decoder layer is 1,024 dimensions. We set the dropout rate to 0.3. The parameters are optimized using the Nesterov Accelerated Gradient (Sutskever et al., 2013) optimizer with a momentum value of 0.99. We set the initial learning rate to 0.25, using early stopping. For evaluating the system outputs, the ER-RANT (Bryant et al., 2017) is used as a scorer. In this study, all the results shown are "span-based correction F0.5".
In generating erroneous sentences, we used Simple Wikipedia and essay scoring data sets (i.e., International Corpus of Learner English    (Blanchard et al., 2013). With respect to Simple Wikipedia, we ignored sentences that were longer than 60 tokens. To remove erroneous sentences, we applied BERT-Cleaning to the essay scoring data sets. After BERT-Cleaning and preprocessing (Chollampatt and Ng, 2018), we obtained 1,426,354 sentence pairs by error generation.

External Dataset for Track-2
We used EFCAMDAT (Geertzen et al., 2013) and non-public Lang-8 as the external language learner corpus. The EFCAMDAT is constructed by the Department of Theoretical and Applied Linguistics at the University of Cambridge. Lo et al. (2018) were the first the researchers to use the EF-CAMDAT for the GEC task. However, the system trained with the EFCAMDAT gave lower performance than the system trained with the Lang-8 Corpus. One of the causes of the lower performance is that many errors are found in the EF-CAMDAT corrected sentences. Thus, we applied BERT-Cleaning to the EFCAMDAT to remove the erroneous sentences. Consequently, the number of sentence pairs of EFCAMDAT was reduced from 1,157,339 to 760,393. Finally, we used 7,739,577 sentence pairs (non-public Lang-8 + Cleand EF-CAMDAT) by using pre-processing (Chollampatt and Ng, 2018) as the additional training data.   Table 2 shows the results of our systems, ensemble decoding of five independently trained models. We compared the following four systems: (1) Base (Transformer-based GEC system), (2) Base plus sentence error detection (Base+SED) described in section 3.2, (3) Base plus generated data (Base+GenData), and (4) Base plus sentence error detection and generated data (Base+SED+GenData). Note that our system, which was composed of both SED and GenData, achieved a 60.97 F 0.5 score. Our proposed methods, the SED, and the GenData were effective for improving GEC performance. Especially, the SED is effective for a precision score, which improved from 61.97 to 65.45 (+3.48). However, the recall dropped from 42.11 to 38.04 (4.07). Nevertheless, the GenData improved both recall (from 42.11 to 46.40) and precision (from 61.97 to 64.57). Table 3 shows the results of the model trained with additional data (Track1+AddData). The additional data improve precision and recall, and notably give a large increase in recall (improved from 42.16 to 51.03).

Conclusion
We described our system for the BEA-2019 Shared Task. Our system has two key components: error generation and sentence-level error detection. We input grammatically incorrect sentences predicted by the sentence-level error detection model into our correction model. Sentencelevel grammatical error detection is a novel approach to grammatical error correction, and we have shown that it can significantly improve performance. Our system ranked 9th in Track-1 and 2nd in Track-2.