Cross-lingual Transfer Learning for Grammatical Error Correction

In this study, we explore cross-lingual transfer learning in grammatical error correction (GEC) tasks. Many languages lack the resources required to train GEC models. Cross-lingual transfer learning from high-resource languages (the source models) is effective for training models of low-resource languages (the target models) for various tasks. However, in GEC tasks, the possibility of transferring grammatical knowledge (e.g., grammatical functions) across languages is not evident. Therefore, we investigate cross-lingual transfer learning methods for GEC. Our results demonstrate that transfer learning from other languages can improve the accuracy of GEC. We also demonstrate that proximity to source languages has a significant impact on the accuracy of correcting certain types of errors.


Introduction
Grammatical error correction (GEC) is the task of correcting grammatically incorrect sentences. The demand for GEC has grown significantly in recent decades because of the increasing opportunities for cross-cultural collaboration. Previous studies in the literature primarily focused on improving automated GEC for the English language. Thus, because of the large amount of data available for training, several machine-learning-based methods have achieved high scores in English GEC (Zhao et al., 2019;Grundkiewicz et al., 2019;Kiyono et al., 2019;Kaneko et al., 2020). In recent years, researchers have started working on other languages, including Russian and Czech (Rozovskaya and Roth, 2019;Náplava and Straka, 2019). However, for these languages, the language resources required to train the GEC models accurately are not sufficiently available.
It is known that using high-resource languages as the source languages can improve the accuracy of deep neural models for low-resource target languages in various settings (Johnson et al., 2017;Ruder et al., 2019;Dabre et al., 2020). One such setting involves cross-lingual transfer learning (Zoph et al., 2016), which aims to improve the accuracy of low-resource target models using knowledge from high-resource source models. The similarities between these languages is a key factor for successfully transferring grammatical knowledge (Cotterell and Heigold, 2017;Johnson et al., 2017). For example, languages within the same language family share several rules of grammar and nuances of vocabulary, which aid the learning process of the target models.
However, thus far, no study has investigated the use of cross-lingual transfer learning for GEC from other languages; therefore, it is unclear if useful grammatical knowledge (e.g., case inflection or conjugation) can be transferred. Table 1 shows example case inflections of words that mean "sister" in English, Russian, and Czech. In English, the difference between nominative and genitive is marked by the suffix "'s," whereas in Russian and Czech, it is marked by word conjugation. This example shows that Russian and Czech inflections are similar, suggesting that it may be possible to perform cross-lingual transfer learning by exploiting their grammatical similarities.
In this study, we investigate the following three research questions with respect to cross-lingual transfer learning for GEC: (a) Does cross-lingual transfer learning improve GEC? To help answer this question, we compare the results of GEC models trained with and without transfer learning. In addition, we compare several crosslingual transfer learning methods.
(b) Is information on grammatical errors transferable? To help answer this question, we analyze the correction results for each error type. Specifically, we transfer grammatical knowledge from languages that have similar grammatical structures and analyze the results for similar error types between the languages, including the noun case inflection.
(c) How does the size of data in the target language affect the results of transfer learning? Generally, in transfer learning, the transfer is performed from a high-resource language to a low-resource language. However, it is not clear whether it is effective to perform transfer learning from a low-resource language to a high-resource language. Therefore, we analyze the effectiveness of transfer learning, even for targets from a high-resource setting.
Our results indicate that using transfer learning from similar languages can improve the accuracy of GEC for certain grammatical errors. In particular, we show that the error correction performance of similar lexical items is improved and that the transfer of grammatical knowledge is possible. Additionally, we demonstrate that transfer learning is more effective for some types of errors than others, depending on the size of the target data.

Related Work
Most recent GEC methods use the encoder-decoder (EncDec) model, which requires large-scale training data (Zhao et al., 2019;Grundkiewicz et al., 2019). Therefore, several studies created additional pseudodata in low-resource scenarios (Náplava and Straka, 2019;Rozovskaya and Roth, 2019). For example, because it is easy to generate a grammatically incorrect sentence from a grammatically correct sentence, extensive research has been conducted on generating pseudo-data from large-scale monolingual corpora (Xie et al., 2018;Kiyono et al., 2019). In addition, the use of EncDec models pretrained with large-scale unlabeled data is known to be effective for GEC (Kaneko et al., 2020). These studies aimed to improve the performance of GEC using large-scale training data.
Furthermore, research has also been conducted on the use of linguistic knowledge from other languages in neural machine translation (NMT). Zoph et al. (2016) proposed a method to fine-tune NMT models trained from high-resource language pairs on low-resource language pairs. Johnson et al. (2017) demonstrated that a language can be translated with no training data by jointly training one model by concatenating the training data from multiple languages. Schuster et al. (2018) presented a method that uses a bidirectional NMT encoder for cross-lingual contextual word representations, to generate dialog responses. For question answering, Lee and Lee (2019) proposed a cross-lingual transfer learning method that uses generative adversarial networks. These studies focused on the tasks for which semantic information is more important, as opposed to GEC, for which grammatical information is the key factor.
Various studies have analyzed the transfer of syntactic knowledge between languages. Kim et al. (2017) proposed a part-of-speech tagging method for learning language-independent and languagedependent expressions between languages by combining two models corresponding to the expressions.  and Dredze (2019) used multilingual BERT for five tasks, such as POS tagging and dependency parsing, and demonstrated that its performance can be improved by using multilingual knowledge. As in our study, these studies perform transfer learning between languages in tasks for which syntactic information is important. However, it is not clear whether linguistic knowledge about grammatical errors can be transferred across languages.
Several GEC studies using L1 information have been conducted. Rozovskaya and Roth (2011) adopted information from five L1s with different priorities to preposition correction using the naïve Bayes classifier. Rozovskaya et al. (2017) extended this method to eleven L1s and three error types. Mizumoto et al. (2011) demonstrated that using the same L1 for training and test data in an SMT-based GEC system improved the system performance. Chollampatt et al. (2016) extended this method by incorporating three different L1 neural language models into an SMT-based GEC model as features to adapt to each L1. In these studies, GEC was performed considering the L1 information; however, unlike our study, the objective of these studies did not include the transfer of grammatical knowledge between languages.

Overall Training Steps
We train a GEC model that employs the Masked Language Modeling (MLM) / Translation Language Modeling (TLM) (Conneau and Lample, 2019) shown in Subsection 3.2, and the transfer learning method shown in Subsection 3.3. Figure 1 illustrates the overall training steps. First, we pre-train the MLM/TLM with the monolingual/parallel corpora of the source and target languages. Second, we initialize the GEC model using the MLM/TLM and train it with the learner corpora of the source and target languages. Finally, we fine-tune the GEC model with the learner corpus of the target language.

Using Pre-trained Language Representations
We use pre-trained language representations in our cross-lingual transfer learning to transfer crosslingual linguistic knowledge that cannot be obtained solely from a learner corpus. We use a method based on MLM and TLM to learn the language representations 1 .
MLM training uses monolingual source and target language data. Input is given as one sentence with some tokens being masked, and training is performed by predicting the masked tokens. When training the MLM, a batch includes sentences coming from the same language at each iteration. TLM extends the capability of MLM to use parallel data for training. If the input data are parallel, then the sentence pair is combined into one sequence while masking some tokens. Training and prediction are performed in a manner similar to those in MLM. Because MLM and TLM are used in combination, they are trained alternately. Henceforth, in this paper, the model trained by combining MLM and TLM will be referred to target source similarity high moderate low Russian Czech English Japanese Czech Russian English Japanese English German Russian Japanese as TLM. It is assumed that using a parallel corpus with TLM provides better knowledge transfer between languages than using only MLM.
We compare the GEC model initialized using MLM with that initialized using TLM, and investigate whether it is better to use a parallel corpus for cross-lingual transfer for GEC. After initializing the GEC model using MLM or TLM, we initialize the learning rate of the model and train the GEC model. While training the GEC models, we do not use language embedding, which was proposed by (Conneau and Lample, 2019).

Transfer-learning Method for GEC
In this study, we investigate whether grammatical knowledge can be transferred in GEC using crosslingual transfer learning. Various methods are utilized for cross-lingual transfer learning, as discussed in Section 2. In this study, we focused on sharing both lexical and grammatical knowledge between languages. Thus, we use the EncDec model to facilitate transfer learning.
Several studies on NMT have demonstrated that training a source model on a combination of different languages is effective when performing fine-tuning on low-resource language pairs (Imankulova et al., 2019;. Therefore, we train the GEC models by concatenating the learner data from both the source and the target languages, wherein each batch may consist of tokens from the two languages; subsequently, we fine-tune the models using the learner data of the target language. Finally, the outputs of all models are re-ranked, as proposed by Chollampatt et al. (2018).

Languages
In this study, we perform experiments with GEC on three target languages: Russian, Czech, and English. For each target language, we use three source languages: one with high similarity, one with moderate similarity, and one with low similarity (i.e., Japanese, which is not used as a target language). For the Russian GEC, we use Czech, English, and Japanese as the source languages. Russian and Czech are languages that belong to the Slavic family; hence, they have considerable commonalities, such as in the inflections of adjectives and nouns. Although English is different from Russian in terms of the language family and the characters used, it has some similarities; for example, the verb forms change depending on the singular or plural status of the subject. Japanese is the language that is least related to Russian among the languages used in this study. It differs not only in terms of the language family but also in terms of characters and sentence structure.
For the Czech GEC, we use Russian, English, and Japanese as the source languages. The similarities between Russian and Czech have already been mentioned, and the relation of Czech to English and Japanese is similar to that of Russian.
For the English GEC, we use German, Russian, and Japanese as the source languages. English and German are considerably similar languages and belong to the same Germanic family. The relation between English and Russian has been described above; Japanese is again the least similar language to English among the languages used.

Data
The data used in the experiment are presented in Table 3. In this study, we use WMT-2019's News Crawl 2 and Japanese Wikipedia data 3 as monolingual data for training the MLM, and we used TED talks (Cettolo et al., 2012) (Boyd, 2018) 9 as the learner corpora for training the GEC model 10 . For the development and test data for GEC, we use the Russian, Czech, and German data attached to each corpus. We use English data from CoNLL 2013  and CoNLL 2014 and Japanese data from the NAIST Goyo Corpus (Oyama et al., 2013) for the development and test data 11 . We also use Russian News Crawl (2015-2018), Czech News Crawl (2014-2018), and English News Crawl (2015-2018 to train the language model for re-ranking the GEC model.
The TED Talks data are reconstructed from the original English translation data by extracting the corresponding sentence pairs in each language. News Crawl, Wikipedia, Lang-8, and NUCLE data are obtained by extracting the number of sentences depicted in Table 3 from the original data. To maintain a consistent experimental setting, the size of each source language data is adjusted to be the same as that of the language that has the smallest data size. The source languages data size is 40K in the Russian GEC and 54K in the Czech and English GEC.

Settings
We use the same architecture as Conneau and Lample (2019) for the MLM/TLM and transformer encoder and decoder for GEC models. Both the encoder and the decoder of the GEC model are initialized with the parameters of MLM/TLM. The number of layers in the model is six, the dimension of the hidden and embedding layers is 1,024, the batch size is 32, and a dropout with a probability of 0.1 is applied. The best model is selected using perplexity on the development data. We report the precision, recall, and F 0.5 scores using m2scorer (Dahlmeier and Ng, 2012) for the test data.

Baseline
In this study, we use two baselines to compare the effects of transfer learning and the MLM/TLM. PLAIN In this setting, we do not use the MLM/TLM. Therefore, the GEC model learns grammatical knowledge from the learner corpus only.

MLM {Ru,Cs,En}-only
In this setting, we pre-train the MLM with the target language monolingual corpus only and train the GEC model with the target language learner corpus only. This model learns grammatical knowledge from the large-scale target language monolingual corpus and learner corpus; it does not use knowledge of other languages. Table 4 shows the results for the Russian GEC. The models that use transfer learning with MLM or TLM obtain a higher F 0.5 score than the PLAIN models and the MLM Ru-only model; moreover, the model that uses transfer learning from Czech with TLM obtains the highest precision and F 0.5 score.   Table 7: Recall of the Russian GEC models on the top-10 most frequent error types in RULEC-GEC. Table 5 shows the results for the Czech GEC. In this table, a trend similar to the Russian results can be observed. The models that use transfer learning with MLM or TLM obtain a higher F 0.5 score than the PLAIN models and the MLM Cs-only model; furthermore, the model that uses transfer learning from Russian with TLM obtains the highest precision and F 0.5 score. Table 6 shows the results for the English GEC. In the "NUCLE only" setting, the results in Table 6 show a similar trend to that in the Russian and Czech GEC results thus far. The F 0.5 scores of models that use transfer learning with MLM or TLM are higher than the F 0.5 scores of the PLAIN models and the MLM En-only model; the model that use transfer learning from German with TLM obtains the highest F 0.5 scores. In the "NUCLE + Lang-8-En" setting, the F 0.5 scores of the models that use transfer learning with MLM or TLM are higher than those of the PLAIN models; however, unlike the other GEC results, they are almost the same as the score of the MLM En-only model. No difference is observed between the scores of the models that use transfer learning with MLM or TLM.

English
Of course, in Russia there is a more political transparency in the news when it comes to international relations, but not about domestic politics.  Table 8: Output example of the Russian GEC model for Adj:Case and Noun:Case. The words in red are incorrect, and those in blue are correct. Brackets show the types of error. The first term indicates a part of speech, where "A" is an adjective, and "N" is a noun. The second term indicates the word case, where "Nom" is nominative, "Gen" is genitive, "Pre" is prepositional, and "Inst" is instrumental.

Cross-lingual Transfer Learning for GEC
The GEC results for the three languages provide some interesting insights that are common to all languages for GEC using cross-lingual transfer learning. In all languages, the models that use transfer learning with MLM or TLM score higher than those that use transfer learning without MLM or TLM. This shows that transfer learning with MLM and TLM is effective for GEC, irrespective of the language pairs. The models that use transfer learning with MLM or TLM in the most similar languages obtain a higher score than those that use MLM pretrained in the target language instead of transfer learning. This suggests that a better GEC model can be trained by considering knowledge in multiple languages instead of only the target language. In any language, the model that uses transfer learning with TLM from the language closest to the target language obtains the highest score. This indicates that it is important to perform transfers from a closely related language.

Similar Lexical Items between Languages
In this subsection, we present our investigation of whether transfer learning between languages improves the error correction accuracy for similar grammatical items 16 . Table 7 shows the recall of different models for the error types calculated using manually annotated evaluation data for Russian GEC. Using Czech as a source language improves the accuracy of most error types compared to the baseline models. Moreover, with similar grammatical items between Czech and Russian, (e.g., Spelling, Lexical choice, Adj:Case (errors of adjective case inflection), and Noun:Case (errors of noun case inflection)), the TLM model transferred from Czech obtains the highest scores, thereby demonstrating the positive effect of transfer learning in similar grammatical items. Table 8 shows the output examples of the Russian GEC model for the error types of Adj:Case and Noun:Case. The underlined words represent the errors in Adj:Case, and Noun:Case. When the original sentence is compared with the gold sentence, the case in the gold sentence is observed to have changed. The case of a word is denoted by suffixes in Russian and Czech. There are seven cases in Czech, including six of the same cases in Russian. Ru-only fails to correct errors in this case. We speculate that this is caused by a lack of training data. The use of Japanese as a source language leads to erroneous changes in nouns. However, in English, the prepositional relations are marked using standalone words, which is not helpful in correcting the prepositional errors in Russian, wherein the prepositional relations  Table 9: Recall of the English GEC models on the top-five and bottom-five in terms of the numbers of errors in the Lang-8-En corpus, excluding error types whose number of errors in the test data is less than or equal to 50.
are marked by case systems. Only the TLM model transferred from Czech generates the correct output. We hypothesize that the reason that Adj:case and Noun:case errors are corrected into prepositional cases is because our model captures the grammatical information of Czech, which is also useful in Russian.

Size of Target Language Data
In this subsection, we analyze the effect of the size of the target language data on transfer learning. We use ERRANT 17 (Bryant et al., 2017) to annotate the training and evaluation data with error types and analyze the knowledge that is effective (i.e., transferable) in transfer learning to a high-resource target. Table 9 shows the recall results of the English GEC models on the top-five error types (determiner, preposition, punctuation, verb, and verb tense) and the bottom-five error types (spelling, pronoun, verb form, morphology, and subject-verb agreement) in terms of the numbers of errors in the Lang-8-En corpus.
The results of "NUCLE only," wherein the target is at a low-resource setting, demonstrate that the model that uses transfer learning from German obtains the highest recall for most error types. This indicates that transfer learning is effective for most error types for any low-resource languages.
For the "NUCLE + Lang-8-En" setting, there is a tendency wherein the recall of the top-five error types does not increase much through transfer learning from any language. However, the tendency of recall of the bottom-five error types is slightly different. The recall of the model that uses transfer learning from Japanese barely increases from that of the baseline model in terms of pronoun, verb form, and subject-verb agreement, which are dissimilar error types between Japanese and English. In contrast, the recall results of the model transferred from German are higher than those of the MLM En-only model for similar errors, such as pronoun, morphology, and subject-verb agreement in English and German. Accordingly, it can be considered that transfer learning from other languages is effective for the error types that are infrequent but similar between the source and target languages, even if the target is a high-resource language.

Conclusion
In this study, we show that certain grammatical knowledge can be transferred across languages for GEC. In particular, the correction performance of a model transferred from a similar language is greatly improved. Additionally, we show that the performance improves more when correcting errors for similar grammatical items are corrected than when those for dissimilar grammatical items are corrected. We also show that cross-lingual transfer learning is effective for GEC in both low-resource and high-resource languages to some extent.
In the future, we plan to compare and visualize word embedding in models with and without transfer learning and investigate why MLM and TLM are effective for transfer learning in GEC. We also plan to investigate the results of training and transferring multiple source languages instead of only one.