Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction

This paper investigates how to effectively incorporate a pre-trained masked language model (MLM), such as BERT, into an encoder-decoder (EncDec) model for grammatical error correction (GEC). The answer to this question is not as straightforward as one might expect because the previous common methods for incorporating a MLM into an EncDec model have potential drawbacks when applied to GEC. For example, the distribution of the inputs to a GEC model can be considerably different (erroneous, clumsy, etc.) from that of the corpora used for pre-training MLMs; however, this issue is not addressed in the previous methods. Our experiments show that our proposed method, where we first fine-tune a MLM with a given GEC corpus and then use the output of the fine-tuned MLM as additional features in the GEC model, maximizes the benefit of the MLM. The best-performing model achieves state-of-the-art performances on the BEA-2019 and CoNLL-2014 benchmarks. Our code is publicly available at: https://github.com/kanekomasahiro/bert-gec.


Introduction
Grammatical Error Correction (GEC) is a sequenceto-sequence task where a model corrects an ungrammatical sentence to a grammatical sentence. Numerous studies on GEC have successfully used encoder-decoder (EncDec) based models, and in fact, most current state-of-the-art neural GEC models employ this architecture (Zhao et al., 2019;Grundkiewicz et al., 2019;Kiyono et al., 2019).
In light of this trend, one natural, intriguing question is whether neural EndDec GEC models can benefit from the recent advances of masked language models (MLMs) since MLMs such as BERT (Devlin et al., 2019) have been shown to yield substantial improvements in a variety of NLP tasks (Qiu et al., 2020). BERT, for example, builds on the Transformer architecture (Vaswani et al., 2017) and is trained on large raw corpora to learn general representations of linguistic components (e.g., words and sentences) in context, which have been shown useful for various tasks. In recent years, MLMs have been used not only for classification and sequence labeling tasks but also for language generation, where combining MLMs with EncDec models of a downstream task makes a noticeable improvement (Lample and Conneau, 2019).
Common methods of incorporating a MLM to an EncDec model are initialization (init) and fusion (fuse). In the init method, the downstream task model is initialized with the parameters of a pre-trained MLM and then is trained over a taskspecific training set (Lample and Conneau, 2019;Rothe et al., 2019). This approach, however, does not work well for tasks like sequence-to-sequence language generation tasks because such tasks tend to require a huge amount of task-specific training data and fine-tuning a MLM with such a large dataset tends to destruct its pre-trained representations leading to catastrophic forgetting (Zhu et al., 2020;McCloskey and Cohen, 1989). In the fuse method, pre-trained representations of a MLM are used as additional features during the training of a task-specific model (Zhu et al., 2020). When applying this method for GEC, what the MLM has learned in pre-training will be preserved; however, the MLM will not be adapted to either the GEC task or the task-specific distribution of inputs (i.e., erroneous sentences in a learner corpus), which may hinder the GEC model from effectively exploiting the potential of the MLM. Given these drawbacks in the two common methods, it is not as straightforward to gain the advantages of MLMs in GEC as one might expect. This background motivates us to investigate how a MLM should be incorporated into an EncDec GEC model to maximize its bene-fit. To the best of our knowledge, no research has addressed this research question.
In our investigation, we employ BERT, which is a widely used MLM (Qiu et al., 2020), and evaluate the following three methods: (a) initialize an EncDec GEC model using pre-trained BERT as in Lample and Conneau (2019) (BERT-init), (b) pass the output of pre-trained BERT into the EncDec GEC model as additional features (BERTfuse) (Zhu et al., 2020), and (c) combine the best parts of (a) and (b).
In this new method (c), we first fine-tune BERT with the GEC corpus and then use the output of the fine-tuned BERT model as additional features in the GEC model. To implement this, we further consider two options: (c1) additionally train pre-trained BERT with GEC corpora (BERT-fuse mask), and (c2) fine-tune pre-trained BERT by way of the grammatical error detection (GED) task (BERT-fuse GED). In (c2), we expect that the GEC model will be trained so that it can leverage both the representations learned from large general corpora (pre-trained BERT) and the task-specific information useful for GEC induced from the GEC training data.
Our experiments show that using the output of the fine-tuned BERT model as additional features in the GEC model (method (c)) is the most effective way of using BERT in most of the GEC corpora that we used in the experiments. We also show that the performance of GEC improves further by combining the BERT-fuse mask and BERTfuse GED methods. The best-performing model achieves state-of-the-art results on the BEA-2019 and CoNLL-2014 benchmarks.

Related Work
Studies have reported that a MLM can improve the performance of GEC when it is employed either as a re-ranker (Chollampatt et al., 2019; or as a filtering tool (Asano et al., 2019;Kiyono et al., 2019). EncDec-based GEC models combined with MLMs can also be used in combination with these pipeline methods. Asano et al. (2019) proposed sequence labeling models based on correction methods. Our method can utilize the existing EncDec GEC knowledge, but these methods cannot be utilized due to the different architecture of the model. Besides, to the best of our knowledge, no research has yet been conducted that incorporates information of MLMs for effectively training the EncDec GEC model.
MLMs are generally used in downstream tasks by fine-tuning (Liu, 2019;Zhang et al., 2019), however, Zhu et al. (2020) demonstrated that it is more effective to provide the output of the final layer of a MLM to the EncDec model as contextual embeddings. Recently, Weng et al. (2019) addressed the mismatch problem between contextual knowledge from pre-trained models and the target bilingual machine translation. Here, we also claim that addressing the gap between grammatically correct raw corpora and GEC corpora can lead to the improvement of GEC systems.

Methods for Using Pre-trained MLM in GEC Model
In this section, we describe our approaches for incorporating a pre-trained MLM into our GEC model. Specifically, we chose the following approaches: (1) initializing a GEC model using BERT; (2) using BERT output as additional features for a GEC model, and (3) using the output of BERT fine-tuned with the GEC corpora as additional features for a GEC model.

BERT-init
We create a GEC EncDec model initialized with BERT weights. This approach is based on Lample and Conneau (2019). Most recent state-of-the-art methods use pseudo-data, which is generated by injecting pseudo-errors to grammatically correct sentences. However, note that this method cannot initialize a GEC model with pre-trained parameters learned from pseudo-data.

BERT-fuse
We use the model proposed by Zhu et al. (2020) as a feature-based approach (BERT-fuse). This model is based on Transformer EncDec architecture. It takes an input sentence X = (x 1 , ..., x n ), where n is its length. x i is i-th token in X. First, BERT encodes it and outputs a representation B = (b 1 , ..., b n ).
Next, the GEC model encodes X and B as inputs. h l i ∈ H is the i-th hidden representation of the l-th layer of the encoder in the GEC model. h 0 stands for word embedding of an input sentence X. Then we calculateh l i as follows: where A h and A b are attention models for the hidden layers of the GEC encoder H and the BERT output B, respectively. Then eachh l i is further processed by the feedforward network F which outputs the l-th layer H l = (F (h l 1 ), ..., F (h l n )). The decoder's hidden state s l t ∈ S is calculated as follows: Here, A s represents the self-attention model. Finally, s L t is processed via a linear transformation and softmax function to predict the t-th wordŷ t . We also use the drop-net trick proposed by Zhu et al. (2020) to the output of BERT and the encoder of the GEC model.

BERT-fuse Mask and GED
The advantage of the BERT-fuse is that it can preserve pre-trained information from raw corpora, however, it may not be adapted to either the GEC task or the task-specific distribution of inputs. The reason is that in the GEC model, unlike the data used for training BERT, the input can be an erroneous sentence. To fill the gap between corpora used to train GEC and BERT, we additionally train BERT on GEC corpora (BERT-fuse mask) or finetune BERT as a GED model (BERT-fuse GED) and use it for BERT-fuse. GED is a sequence labeling task that detects grammatically incorrect words in input sentences (Rei and Yannakoudakis, 2016;Kaneko et al., 2017). Since BERT is also effective in GED (Bell et al., 2019;, it is considered to be suitable for fine-tuning to take into account grammatical errors.

Train and Development Sets
We use the BEA-2019 workshop 1 (Bryant et al., 2019) official shared task data as training and development sets. Specifically, to train a GEC model, we use W&I-train (Granger, 1998;Yannakoudakis et al., 2018), NUCLE (Dahlmeier et al., 2013), FCE-train (Yannakoudakis et al., 2011) andLang-8 (Mizumoto et al., 2011) datasets. We use W&I-dev as a development set. Note that we excluded sentence pairs that were not corrected from the training data. To train BERT for BERT-fuse mask and GED,  we use W&I-train, NUCLE, and FCE-train as training, and W&I-dev was used as development data.

Evaluating GEC Performance
In GEC, it is important to evaluate the model with multiple datasets . Therefore, we used GEC evaluation data such as W&I-test, CoNLL-2014(Ng et al., 2014, FCE-test and JF-LEG (Napoles et al., 2017). We used ERRANT evaluation metrics (Felice et al., 2016;Bryant et al., 2017) for W&I-test, M 2 score (Dahlmeier and Ng, 2012) for CoNLL-2014 and FCE-test sets, and GLEU (Napoles et al., 2015) for JFLEG. All our results (except ensemble) are the average of four distinct trials using four different random seeds.

Models
Hyperparameter values for the GEC model is listed in  Table 2: Results of our GEC models. The top group shows the results of the single models without using pseudodata and/or ensemble. The second group shows the results of the single models using pseudo-data. The third group shows ensemble models using pseudo-data. Bold indicates the highest score in each column. * reports the state-of-the-art scores for BEA test and CoNLL 2014 for two separate models: models with and without SED. We filled out a single line with the results from such two separate models.  Table 1. We used the BERT-Base cased model, for consistency across experiments 5 . The model was evaluated on the development set.

Right-to-left (R2L) Re-ranking for Ensemble
We describe the R2L re-ranking technique incorporated in our experiments proposed by Sennrich et al. (2016), which proved to be efficient for the GEC task (Grundkiewicz et al., 2019;Kiyono et al., 2019). Standard left-to-right (L2R) models generate the n-best hypotheses using scores with the normal ensemble and R2L models re-score them. Then, we re-rank the n-best candidates based on the sum of the L2R and R2L scores. We use the generation probability as a re-ranking score and ensemble four L2R models and four R2L models. Table 2 shows the experimental results of the GEC models. A model trained on Transformer without using BERT is denoted as "w/o BERT." In the top groups of results, it can be seen that using BERT consistently improves the accuracy of our GEC model. Also, BERT-fuse, BERT-fuse mask, and BERT-fuse GED outperformed the BERT-init model in almost all cases. Furthermore, we can see that using BERT considering GEC corpora as BERT-fuse leads to better correction results. And the BERT-fuse GED always gives better results than the BERT-fuse mask. This may be because the BERT-fuse GED is able to explicitly consider grammatical errors. In the second row, the correction results are improved by using BERT as well. Also in this setting, BERT-fuse GED outperformed other models in all cases except for the FCE-test set, thus, achieving state-of-the-art results with a single model on the BEA2019 and CoNLL14 datasets. In the last row, the ensemble model yielded high scores on all corpora, improving state-of-the-art results by 0.2 points in CoNLL14.

Hidden Representation Visualization
We investigate the characteristics of the hidden representations of vanilla (i.e., without any finetuning) BERT and BERT fine-tuned with GED. We visualize the hidden representations of the same words from the last layer of BERT H L . They were chosen depending on correctness in a different context, using the above models. These target eight words 7 that have been mistaken more than 50 times, were chosen from W&I-dev. We sampled the same number of correctly used cases for the same word from the corrected side of W&I-dev. Figure 1 visualizes hidden representations of BERT and fine-tuned BERT. It can be seen that the vanilla BERT does not distinguish between correct and incorrect clusters. The plotted eight words are gathered together, and it can be seen that hidden representations of the same word gather in the same place regardless of correctness. On the other hand, fine-tuned BERT produces a vector space that demonstrates correct and incorrect words on different sides, showing that hidden representations take grammatical errors into account when fine-tuned on GEC corpora. Moreover, it can be seen that the correct cases divided into 8 clusters, implying that BERT's information is also retained.

Performance for Each Error Type
We investigate the correction results for each error type. We use ERRANT (Felice et al., 2016;Bryant et al., 2017) to measure F 0.5 of the model for each error type. ERRANT can automatically assign error types from source and target sentences. We 7 1. the 2. , 3. in 4. to 5. of 6. a 7. for 8. is   Table 3: The result of single Fine-tuned BERT-fuse and w/o BERT models without using pseudo-data on most error types including all the top-5 frequent types of error in W&I-dev use single BERT-fuse GED and w/o BERT models without using pseudo-data for this investigation. Table 3 shows the results of single BERT-fuse GED and w/o BERT models without using pseudodata on most error types including all the top-5 frequent error types in W&I-dev. We see that BERTfuse GED is better for all error types compared to w/o BERT. We can say that the use of BERT finetuned by GED for the EncDec model improves the performance independently of the error type.

Conclusion
In this paper, we investigated how to effectively use MLMs for training GEC models. Our results show that BERT-fuse GED was one of the most effective techniques when it was fine-tuned with GEC corpora. In future work, we will investigate whether BERT-init can be used effectively by using methods to deal with catastrophic forgetting.