Generating Diverse Corrections with Local Beam Search for Grammatical Error Correction

In this study, we propose a beam search method to obtain diverse outputs in a local sequence transduction task where most of the tokens in the source and target sentences overlap, such as in grammatical error correction (GEC). In GEC, it is advisable to rewrite only the local sequences that must be rewritten while leaving the correct sequences unchanged. However, existing methods of acquiring various outputs focus on revising all tokens of a sentence. Therefore, existing methods may either generate ungrammatical sentences because they force the entire sentence to be changed or produce non-diversified sentences by weakening the constraints to avoid generating ungrammatical sentences. Considering these issues, we propose a method that does not rewrite all the tokens in a text, but only rewrites those parts that need to be diversely corrected. Our beam search method adjusts the search token in the beam according to the probability that the prediction is copied from the source sentence. The experimental results show that our proposed method generates more diverse corrections than existing methods without losing accuracy in the GEC task.


Introduction
Grammatical error correction (GEC) is a task that corrects grammatical errors in an input text. Depending on the input, there are multiple ways to correct such text. For example, 10 annotators can produce 10 different valid correction results for the same grammatically incorrect text (Bryant and Ng, 2015). If a GEC model presents multiple candidates for correction, it helps the user decide whether to utilize the correction results such that the user can select a favorite correct expression from among the candidates.
However, currently existing GEC models do not consider the generation of multiple correction candidates. Generally, in GEC, the method for obtaining multiple corrections involves the use of a plain beam search to generate the n-best candidates (Grundkiewicz et al., 2019;Kaneko et al., 2020). However, it has been shown that a plain beam search does not provide a great enough variety of candidates and produces lists of nearly identical sequences (Vijayakumar et al., 2018). Therefore, the n-best candidates generated by a beam search without diversity control are not expected to provide useful additional information. Considering this problem, several beam search methods have been proposed to generate diverse candidates (Li et al., 2016;Vijayakumar et al., 2018;Kulikov et al., 2019). These diverse beam search methods encourage diversity by globally rewriting all tokens in a sentence. We will refer to such methods as diverse global beam search methods. Conversely, considering a local sequence transduction task in GEC, wherein most of the tokens in the source and target sentences overlap, excessive correction of the input sentence is not preferred because unnecessary rewriting damages the grammatically correct parts of the input sentence. Furthermore, encouraging more corrections than necessary decreases the performance of the GEC itself (Hotate et al., 2019). We hypothesize that both plain beam search and diverse global beam search methods may not be suitable for GEC tasks, and a GEC model must correct the grammatical errors of the input sentence in diverse ways while preserving the correct portions of the sentence. In this study, we propose a diverse local beam search method to obtain diverse outputs considering whether a token should be corrected during the beam search. Note that our method can be used for any local sequence transduction task. Figure 1 shows a comparison of existing and proposed methods. The proposed beam search method considers the following: (a) In plain beam search, the correction is concentrated on a specific path. Therefore, this method generates sentences with similar token combinations and a small number of word types. (b) The diverse global beam search method explores many different paths. Therefore, unlike plain beam search, this method generates sentences with various token combinations and a large number of word types. However, it also generates correction candidates for tokens that are not corrections. (c) The proposed diverse local beam search expands different paths only for tokens that require corrections. Therefore, our diverse local beam search generates sentences with combinations of more diverse tokens than plain beam search only at points that need correction. It should be noted that all the above methods have the same n but different paths. The experimental results show that our diverse local beam search could generate more diverse and accurate n-best candidates than the existing methods. The performance of the general evaluation datasets in the GEC task demonstrated almost no deterioration.

Related work
Several studies have been proposed to obtain diverse outputs using beam search. Li et al. (2016) modified standard beam search to penalize the scores of searches with the same parent node. Their algorithm recommends only those hypotheses that come from different parent beams. Vijayakumar et al. (2018) proposed a method to divide the beam into several groups and perform beam search for each group. Additionally, they added a constraint to make it harder to select tokens selected by other groups in the same time step. Kulikov et al. (2019) proposed an iterative beam search, which produces a more diverse set of candidate responses in neural dialogue modeling. However, these studies do not distinguish between the parts that do not need to be rewritten and those parts that require diverse corrections.

Diverse local beam search
Diverse local beam search encourages candidates of diverse corrections for parts of the input sentence that must be corrected and discourages candidates for already correct parts of the input sentence. Consequently, it runs the computation for fewer parts to generate diverse candidates. For this purpose, a penalty score s b,t is assigned to each beam b at each time step t, indicating whether a correction should be made. Although different methods can be used to calculate a penalty score, in this study, we use a copy probability from the copy-augmented model (Zhao et al., 2019) as a penalty score s b,t . We explain the copy-augmented model in greater detail in Section 4.1. Using the penalty score, we penalize the beam search score k as follows: where p is the output distribution of the GEC model. β and λ are hyperparameters, where β prevents the penalty from falling to zero and λ determines the strength of the penalty.

Model
We used the copy-augmented model (Zhao et al., 2019) as the GEC model. This model controls the balance between the copy distribution p copy and generation distribution p gen via the balancing factor α copy . p copy is the probability distribution of the tokens to be copied from the source sentence, and p gen is the generation probability distribution that predicts the output tokens. The output distribution p b,t is calculated as follows: Copying from a source sentence or generating a token can be considered as a choice between correcting or not. Therefore, we used α copy b,t as the penalty score s b,t in Equation 1. As a diverse global beam search, we used the diverse beam search (Vijayakumar et al., 2018) approach, wherein the number of groups is defined from the number of desired diverse sentences, and a diversity strength for the beam search is selected such that the output tokens at each time step in each group differs. We set the number of groups to n of n-best and the diversity strength to 0.7 for diverse global beam search. For diverse local beam search, we used β = 1.0 and λ = 4.0 for diverse local beam search 1 .

Datasets
For models that have been pre-trained with publicly available pseudo-data, we fine-tuned them using published training data 2 . We used the public NUCLE (Dahlmeier et al., 2013), Lang-8 (Mizumoto et al., 2011), and FCE (Yannakoudakis et al., 2011) corpora as our training data. We used the JFLEG test set and dev set corrected by four different annotators (Napoles et al., 2017) as the development set. We also used the CoNLL-2014 dataset as the test set. The original CoNLL-2014 dataset was corrected by two different annotators (Ng et al., 2014). However, in this work, eight corrections made by (Bryant and Ng, 2015) and four corrections with minimal corrections made by (Sakaguchi et al., 2016) were also used as references.

Evaluation
Performance of GEC (G-score). We evaluated each decoding method using the GLEU score (Napoles et al., 2017) for the JFLEG and the F 0.5 score by using the MaxMatch scorer (Dahlmeier and Ng, 2012) for the CoNLL-2014 test set as general evaluation metrics for the GEC.
Diversity of corrections (C-score). To evaluate the diversity of corrections, we calculated the coverage score between the n-best candidates and references. We used the weighted recall, which was used as the evaluation metric in the 2020 Duolingo Shared Task 3 as a coverage score. In this work, the number of duplicated corrections divided by the total number of corrections is used for weighting. This gives higher weight to corrections with more duplicates.
Correctness of corrections (DF score). To evaluate the correctness of the corrections, we distinguished between the correct and incorrect parts of the input sentences and considered the correction outputs acceptable if they only change tokens that require corrections. In this study, we used document frequency (DF) to assess the correctness of the corrections.
First, we calculated the number of n-grams that are present in both the source and predicted sentences 4 . Then, we calculated the DF score by dividing it by the number of all n-grams in both the source and Method n = 10 n = 15 n = 20 G-score C-score DF score G-score C-score DF score G-score C-score DF score  Table 1: Evaluation results for 10-, 15-, and 20-best candidates. The G-score denotes performance of the GEC task, which is GLEU for JFLEG and F 0.5 for CoNLL-2014. The coverage score (C-score) and DF score denote the diversity and correctness of the corrections, respectively. reference sentences. Finally, the average score of multiple outputs per source sentence was calculated over the entire test set. Table 1 shows the evaluation results of our two baselines of plain beam search (PBS) and diverse global beam search (DGBS), as well as those of our proposed diverse local beam search (DLBS). We performed experiments for 10-, 15-, and 20-best candidates. Thus, DGBS has lower coverage scores compared to PBS. Moreover, the DF scores of DGBS are lower compared to PBS. These results show that DGBS cannot produce more diverse outputs than PBS in GEC. We hypothesized that this is because DGBS attempts to rewrite and diversify all tokens regardless of their correctness in the source sentence. By contrast, our proposed approach's coverage scores and DF scores are higher than those of our baselines. Therefore, we can conclude that our proposed method diversifies only the tokens that must be corrected in the source sentence.

Analysis
Diversity has two aspects: the diversity of the correction points and that of the words to be corrected. The DF score can measure the diversity of the corrected parts, but not the diversity of the corrected words. Here, we analyze the number of word types for the output as the diversity of the words to be corrected. Figure 2 shows the average number of word types per sentence for each method. The horizontal axis represents n of the n-best or the number of reference sentences, and the vertical axis represents the average number of word types per sentence. The GOLD value indicates a reference for CoNLL-2014,

Source
You can share photos , videos and every meaningful experiences of your life .  and the dashed line indicates the maximum value of GOLD. In all methods, the number of word types increases as the number of sentences increases. However, in GOLD, the increase is particularly small as the number of references increases. This implies that, although there are several references, not all of them use different words. PBS has fewer word types for the output than GOLD. The number of word types in DGBS increases linearly with the number of n, indicating that DGBS generates more word types than necessary when compared to GOLD. Conversely, DLBS can diversify the output more appropriately than PBS and DGBS. Table 2 shows the output examples of each decoding method. We can observe that DGBS has unnecessarily changed parts of the input sentence (e.g., "meaningful" and "your"). Conversely, the proposed method generates diverse outputs only for the tokens that should be rewritten (e.g., "every," "experiences," and "of").

Conclusion
Existing methods of acquiring diverse outputs rewrite all tokens of the input sentence. In this work, we proposed a method to produce diverse candidates for only the parts that require corrections via GEC. It was shown that the proposed method can diversify the outputs more properly in GEC than the existing methods. In the future, we would like to apply this method to other model architectures and extend it to tasks other than GEC.