Neural Quality Estimation with Multiple Hypotheses for Grammatical Error Correction

Grammatical Error Correction (GEC) aims to correct writing errors and help language learners improve their writing skills. However, existing GEC models tend to produce spurious corrections or fail to detect lots of errors. The quality estimation model is necessary to ensure learners get accurate GEC results and avoid misleading from poorly corrected sentences. Well-trained GEC models can generate several high-quality hypotheses through decoding, such as beam search, which provide valuable GEC evidence and can be used to evaluate GEC quality. However, existing models neglect the possible GEC evidence from different hypotheses. This paper presents the Neural Verification Network (VERNet) for GEC quality estimation with multiple hypotheses. VERNet establishes interactions among hypotheses with a reasoning graph and conducts two kinds of attention mechanisms to propagate GEC evidence to verify the quality of generated hypotheses. Our experiments on four GEC datasets show that VERNet achieves state-of-the-art grammatical error detection performance, achieves the best quality estimation results, and significantly improves GEC performance by reranking hypotheses. All data and source codes are available at https://github.com/thunlp/VERNet.


Introduction
Grammatical Error Correction (GEC) systems primarily aim to serve second-language learners for proofreading. These systems are expected to detect grammatical errors, provide precise corrections, and guide learners to improve their language ability. With the rapid increase of second-language learners, GEC has drawn growing attention from numerous researchers of the NLP community. * Corresponding author: M. Sun (sms@tsinghua.edu.cn) Figure 1: The Grammaticality of Generated Hypotheses. The hypotheses are generated by Kiyono et al. (2019) with beam search decoding. The hypothesis is compared to the source sentence with a BERT based language model and classified into Win (the hypothesis is better), Tie (the hypothesis and source are same) and Loss (the source is better). The ratios of different classes are plotted with different beam search ranks.
Existing GEC systems usually inherit the seq2seq architecture (Sutskever et al., 2014) to correct grammatical errors or improve sentence fluency. These systems employ beam search decoding to generate correction hypotheses and rerank hypotheses with quality estimation models from Kbest decoding (Kiyono et al., 2019;Kaneko et al., 2020) or model ensemble (Chollampatt and Ng, 2018a) to produce more appropriate and accurate grammatical error corrections. Such models thrive from edit distance and language models (Chollampatt and Ng, 2018a;Chollampatt et al., 2019;Yannakoudakis et al., 2017;Kaneko et al., , 2020. Chollampatt and Ng (2018b) further consider the GEC accuracy in quality estimation by directly predicting the official evaluation metric, F 0.5 score.
The K-best hypotheses from beam search usually derive from model uncertainty (Ott et al., 2018). These uncertainties of multi-hypotheses come from model confidence and potential ambiguity of lin-  guistic variation (Fomicheva et al., 2020), which can be used to improve machine translation performance (Wang et al., 2019b). Fomicheva et al. (2020) further leverage multi-hypotheses to make convinced machine translation evaluation, which is more correlated with human judgments. Their work further demonstrates that multi-hypotheses from well-trained neural models have the ability to provide more hints to estimate generation quality.
For GEC, the hypotheses from the beam search decoding of well-trained GEC models can provide some valuable GEC evidence. We illustrate the reasons as follows.
• Beam search can provide better GEC results.
The GEC performance of the top-ranked hypothesis and the best one has a large gap in beam search. For two existing GEC systems, Zhao et al. (2019) and Kiyono et al. (2019), the F 0.5 scores of these systems are 58.99 and 62.03 on the CoNLL2014 dataset. However, the F 0.5 scores of the best GEC results of these systems can achieve 73.56 and 76.82.
• Beam search candidates are more grammatical. As shown in Figure 1, the hypotheses from well-trained GEC models with beam search usually win the favor of language models, even for these hypotheses ranked to the rear. It illustrates these hypotheses are usually more grammatical than source sentences.
• Beam search candidates can provide valuable GEC evidence. As shown in Figure 2, the hypotheses of different beam ranks have almost the same Recall score, which demonstrates all hypotheses in beam search can provide some valuable GEC evidence.
Existing quality estimation models (Chollampatt and Ng, 2018b) for GEC regard hypotheses independently and neglect the potential GEC evidence from different hypotheses. To fully use the valuable GEC evidence from GEC hypotheses, we propose the Neural Verification Network (VERNet) to estimate the GEC quality with modeled interactions from multi-hypotheses. Given a source sentence and K hypothesis sentences from the beam search decoding of the basic GEC model, VERNet establishes hypothesis interactions by regarding source, hypothesis pairs as nodes, and constructing a fullyconnected reasoning graph to propagate GEC evidence among multi-hypotheses. Then VERNet proposes two kinds of attention mechanisms on the reasoning graph, node interaction attention and node selection attention, to summarize and aggregate necessary GEC evidence from other hypotheses to estimate the quality of tokens.
Our experiments show that VERNet can pick up necessary GEC evidence from multi-hypotheses provided by GEC models and help verify the quality of GEC hypotheses. VERNet helps GEC models to generate more accurate GEC results and benefits most grammatical error types.

Related Work
The GEC task is designed for automatically proofreading. Large-scale annotated corpora (Mizumoto et al., 2011;Dahlmeier et al., 2013;Bryant et al., 2019) bring an opportunity for building fully datadriven GEC systems.
Existing neural models regard GEC as a natural language generation (NLG) task and usually use sequence-to-sequence architecture (Sutskever et al., 2014) to generate correction hypotheses with beam search decoding (Yuan and Briscoe, 2016;Chollampatt and Ng, 2018a). Transformer-based architectures (Vaswani et al., 2017) show their effectiveness in NLG tasks and are also employed to achieve convinced correction results (Grundkiewicz et al., 2019;Kiyono et al., 2019). The copying mechanism is also introduced for GEC models (Zhao et al., 2019) to better align tokens from source sentence to hypothesis sentence. To further accelerate the generation process, some work also comes up with non-autoregressive GEC models and leverages a single encoder to parallelly detect and correct grammatical errors (Awasthi et al., 2019;Malmi et al., 2019;Omelianchuk et al., 2020).
Recent research focuses on two directions to im-prove GEC systems. The first one treats GEC as a low-resource language generation problem and focuses on data augmentation for a grammar sensitive and language proficient GEC system (Junczys-Dowmunt et al., 2018;Kiyono et al., 2019). Various weak-supervision corpora have been leveraged, such as Wikipedia edit history (Lichtarge et al., 2019), Github edit history (Hagiwara and Mita, 2020) and confusing word set (Grundkiewicz et al., 2019). Besides, lots of work generates grammatical errors through generation models or round-trip translation (Ge et al., 2018;Wang et al., 2019a;Xie et al., 2018). Kiyono et al. (2019) further consider different data augmentation strategies to conduct better GEC pretraining. Reranking GEC hypotheses from K-best decoding or GEC model ensemble (Hoang et al., 2016;Chollampatt and Ng, 2018b) with quality estimation models provides another promising direction to achieve better GEC performance. Some methods evaluate if hypotheses satisfy linguistic and grammatical rules. For this purpose, they employ language models (Chollampatt and Ng, 2018a;Chollampatt et al., 2019) or grammatical error detection (GED) models to estimate hypothesis quality. GED models (Rei, 2017;Rei and Søgaard, 2019) estimate the hypothesis quality on both sentence level  and token level (Yannakoudakis et al., 2017). Chollampatt and Ng (2018b) further estimate GEC quality by considering correction accuracy. They establish sourcehypothesis interactions with the encoder-decoder architecture and learn to directly predict the official evaluation score F 0.5 .
The pre-trained language model BERT (Devlin et al., 2019) has proven its effectiveness in producing contextual token representations, achieving better quality estimation Chollampatt et al., 2019) and improving GEC performance by fuse BERT representations (Kaneko et al., 2020). However, existing quality estimation models regard each hypothesis independently and neglect the interactions among multihypotheses, which can also benefit the quality estimation (Fomicheva et al., 2020).

Neural Verification Network
This section describes Neural Verification Network (VERNet) to estimate the GEC quality with multihypotheses, as shown in Figure 3.
Given a source sentence s and K correspond- Hypotheses from the beam search decoding of basic GEC : $ 1 : Does someone who suffered from this disease … $ 3 : Does someone who suffers (4-th token) from this disease … $ 5 : Does anyone who suffers from this disease … 2 < ", $ 5 > ing hypotheses C = {c 1 , . . . , c k , . . . , c K } generated by a GEC model, we first regard each sourcehypothesis pair s, c k as a node and fully connect all nodes to establish multi-hypothesis interactions. Then VERNet leverages BERT to get the representation of each token in s, c k pairs (Sec. 3.1) and conducts two kinds of attention mechanisms to propagate and aggregate GEC evidence from other hypotheses to verify the token quality (Sec. 3.2). Finally, VERNet estimates hypothesis quality by aggregating token level quality estimation scores (Sec. 3.3). Our VERNet is trained end-to-end with supervisions from golden labels (Sec. 3.4).

Initial Representations for Sentence Pairs
Pre-trained language models, e.g. BERT (Devlin et al., 2019), show their advantages of producing contextual token representations for various NLP tasks. Hence, given a source sentence s with m tokens and the k-th hypothesis c k with n tokens, we use BERT to encode the source-hypothesis pair s, c k and get its representation H k : The pair representation H k consists of token-level representations, that is,

Verify Token Quality with Multi-hypotheses
VERNet conducts two kinds of attention mechanisms, node interaction attention and node selection attention, to verify the token quality with the verification representation V k of k-th node, which learns the supporting evidence towards estimating token quality from multi-hypotheses. The node interaction attention first summarizes useful GEC evidence from the l-th node for the finegrained representation V l→k (Sec. 3.2.1). Then node selection attention further aggregates finegrained representation V l→k with score γ l according to each node's confidence (Sec. 3.2.2). Finally, we can calculate the verification representation V k to verify the token's quality of each node.

Fine-grained Node Representation with Node Interaction Attention
The node interaction attention α l→k attentively reads tokens in the l-th node and picks up supporting evidence towards the k-th node to build fine-grained node representations V l→k . For the p-th token in the k-th node, w k p , we first calculate the node interaction attention weight α l→k q according to the relevance between w k p and the q-th token in the l-th node, w l q : where W is a parameter. H k p and H l q are the representations of w k p and w l q . Then all token representations of l-th node are aggregated: Based on V l→k p , we further build the l-th node fine-grained representation towards the k-th node,

Evidence Aggregation with Node Selection Attention
The node selection attention measures node importance and is used to aggregate supporting evidence from the fine-grained node representation V l→k of the l-th node. We leverage attention-over-attention mechanism (Cui et al., 2017) to conduct source h ls and hypotheses h lh representations to calculate the l-th node selection attention score γ l . Then we get the node verification representation V k p with the node selection attention γ l .
To calculate the node selection attention γ l , we establish an interaction matrix M l between the source and hypothesis sentences of the l-th node. Each element M l ij in M l is calculated with the relevance between i-th source token and j-th hypothesis token (include "[SEP]" tokens): where W is a parameter. Then we calculate attention scores β ls i and β lh j along the source dimension and hypothesis dimension, respectively: Then the representations of source sentence and hypothesis sentence are calculated: Finally, the node selection attention γ l of l-th node is calculated for the evidence aggregation: where • is the element-wise multiplication operator and ; is the concatenate operator. The node selection attention γ l aggregates evidence for the verification representation V k p of w k p : where V k = {V k 1 , . . . , V k p , . . . , V k m+n+2 } is the k-th node verification representation.

Hypothesis Quality Estimation
For the p-th token w k p in the k-th node, the probability P (y|w k p ) of quality label y is calculated with the verification representation V k p : where • is the element-wise multiplication and ; is the concatenate operator. We average all probability P (y = 1|w k p ) of token level quality estimation as hypothesis quality estimation score f (s, c k ) for the pair s, c k :

End-to-end Training
We conduct joint training with token-level supervision. The source labels and hypothesis labels are used, which denote the grammatical quality of source sentences and GEC accuracy of hypotheses. The cross entropy loss for the p-th token w k p in the k-th node is calculated: using the ground truth token labels y * . Then the training loss of VERNet is calculated:
Basic GEC Model. To generate correction hypotheses, we take one of the state-of-the-art autoregressive GEC systems (Kiyono et al., 2019) as our basic GEC model and keep the same setting. The beam size of our baseline model is set to 5 (Kiyono et al., 2019), and all these beam search hypotheses are reserved in our experiments.
We generate quality estimation labels for tokens in both source sentences and hypothesis sentences with ERRANT (Bryant et al., 2017;Felice et al., 2016), which indicate grammatical correctness and GEC accuracy, respectively. As shown in Table 2, ERRANT annotates edit operations (delete, insert, and replace) towards the ground truth corrections. In terms of such annotations, each token is labeled with correct (1) or incorrect (0).
Evaluation Metrics. We introduce the evaluation metrics in three tasks: token quality estimation, sentence quality estimation, and GEC.
To evaluate the model performance of tokenlevel quality estimation, we employ the same evaluation metrics from previous GED models (Rei, 2017;Rei and Søgaard, 2019;Yannakoudakis et al., 2017), including Precision, Recall, and F 0.5 . F 0.5 is our primary evaluation metric.   For the evaluation of sentence-level quality estimation, we employ the same evaluation metrics from the previous quality estimation model (Chollampatt and Ng, 2018b), including two evaluation scenarios: (1) GEC evaluation metrics for the hypothesis that reranked top-1 and (2) Pearson Correlation Coefficient (PCC) between reranking scores and golden scores (F 0.5 ) for all hypotheses.
To evaluate GEC performance, we adopt GLEU (Napoles et al., 2015) to evaluate model performance on the JFLEG dataset. The official tool ERRANT of the BEA19 shared task (Bryant et al., 2019) is used to calculate Precision, Recall, and F 0.5 scores for other datasets. For the CoNLL-2014 dataset, the M 2 evaluation (Dahlmeier and Ng, 2012) is also adopted as our main evaluation.
Baselines. BERT-fuse (GED) (Kaneko et al., 2020) is compared in our experiments, which trains BERT with the GED task and fuses BERT representations into the Transformer. For quality estimation, we consider two groups of baseline models in our experiments, and more details of these models can be found in Appendices A.1.
(1) BERT based language models. We employ three BERT based language models to estimate the quality of hypotheses. BERT-LM (Chollampatt et al., 2019) measures hypothesis quality with the perplexity of the language model. BERT-GQE  is trained with annotated GEC data and estimates if the hypothesis has grammatical errors. We also conduct BERT-GED (SRC) that predicts token level grammar indicator labels, which is inspired by GED models (Yannakoudakis et al., 2017). BERT shows significant improvement compared to LSTM based models for the GED task (Appendices A.2). Hence the LSTM based models are neglected in our experiments.
(2) GEC accuracy estimation models. These models further consider the source-hypothesis interactions to evaluate GEC accuracy. We take a strong baseline NQE (Chollampatt and Ng, 2018b) in experiments. NQE employs the encoder-decoder (predictor) architecture to encode source-hypothesis pairs and predicts F 0.5 score with the estimator architecture. All their proposed architectures, NQE (CC), NQE (RC), NQE (CR), and NQE (RR) are compared. For NQE (XY), X indicates the predictor architecture, and Y indicates the estimator architecture. X and Y can be recurrent (R) or convolutional (C) neural networks. In addition, we also employ BERT to encode source-hypothesis pairs and then predict the F 0.5 score to implement the BERT-QE model. We also come up with two baselines, BERT-GED (HYP) and BERT-GED (JOINT). They leverage BERT to encode source-hypothesis pairs and are supervised with the token-level quality estimation label. BERT-GED (HYP) is trained with the supervision of hypotheses, and BERT-GED (JOINT) is supervised with labels from both source and hypothesis sentences.
Implementation Details. In all experiments, we use the base version of BERT (Devlin et al., 2019) and ELECTRA (Clark et al., 2020). BERT is a widely used pretrained language model and trained with the mask language model task. ELEC-TRA is trained with the replaced token detection task and aims to predict if the token is original or replaced by a BERT based generator during pretraining. ELECTRA is a discriminator based pretrained language model and is more like the GED task. We regard BERT as our main model for text encoding and leverage ELECTRA to evaluate the generalization ability of our model. Both BERT and ELECTRA inherit huggingface's PyTorch implementation (Wolf et al., 2020). Adam (Kingma and Ba, 2015) is utilized for parameter optimization. We set the max sentence length to 120 for source and hypothesis sentences, learning rate to 5e-5, batch size to 8, and accumulate step to 4 during training.
For hypothesis reranking, we leverage the learning-to-rank method, Coordinate Ascent (CA) (Metzler and Croft, 2007), to aggregate the ranking features and basic GEC score to conduct the ranking score. We assign the hypotheses with the highest F 0.5 score as positive instances and the others as negative ones. The Coordinate Ascent method is implemented by RankLib 1 .

Evaluation Results
We conduct experiments to study the performance of VERNet from three aspects: token-level quality estimation, sentence-level quality estimation, and the VERNet's effectiveness in GEC models. Then we present the case study to qualitatively analyze the effectiveness of the proposed two types of attention in VERNet.

Performance of Token Level Quality Estimation
We first evaluate VERNet's effectiveness on tokenlevel quality estimation. BERT-GED (SRC) is the previous state-of-the-art GED model . Additional two variants, HYP and JOINT, of BERT-GED are conducted as baselines by considering the first-ranked GEC hypothesis in beam search decoding. As shown in Table 3, there are two scenarios, source and hypothesis, are conducted to evaluate model performance. The source scenario evaluates the ability of grammaticality quality estimation, which is the same as GED models (Rei and Søgaard, 2019). The hypothesis scenario tests the quality estimation ability on GEC accuracy.
For the source scenario, BERT-GED (JOINT) outperforms BERT-GED (SRC) and illustrates that the GEC result can help estimate the grammaticality quality of source sentences. For the hypothesis scenario, BERT-GED (JOINT) shows better performance than BERT-GED (HYP), which thrives from the supervisions from source sentences. For both scenarios, BERT-VERNet shows further improvement compared with BERT-GED (JOINT). Such improvements demonstrate that various GEC evidence from multiple hypotheses benefits the tokenlevel quality estimation.
Moreover, the detection style pre-trained model ELECTRA (Clark et al., 2020) is also used as our sentence encoder. VERNet is boosted a lot on all scenarios and datasets, which illustrates the strong ability of ELECTRA in token-level quality estimation and the generalization ability of VERNet.     and BERT-GED (SRC) are supervised with sentence-level and token-level labels from source sentences to estimate grammatical quality, respectively. NQE and BERT-QE encode source, hypothesis pairs and directly predict F 0.5 score. BERT-GED (HYP) and BERT-GED (JOINT) encode the source, hypothesis pairs to estimate the quality of generated tokens.

Performance of Sentence Level Quality Estimation
In this part, we evaluate VERNet's performance on sentence-level quality estimation by reranking hypotheses from beam search decoding. Baselines can be divided into two groups: language model based and GEC accuracy based quality estimation models. The former focuses on grammaticality and fluency, including BERT-LM, BERT-GQE and BERT-GED (SRC). The others focus on estimating the GEC accuracy, including NQE, BERT-QE, BERT-GED (HYP)/(JOINT).
As shown in Table 4, we find that language model based quality estimation prefers higher recall but lower precision, which leads to more redundant corrections. Only considering grammaticality is insufficient since such unnecessary correction suggestions may mislead users. By contrast, GEC accuracy based quality estimation models get much better Precision and F 0.5 , and provide more precise feedback for users. Furthermore, BERT-GED (HYP) outperforms BERT-QE, manifesting that token-level supervisions provide finer-granularity signals to help the model better distinguish subtle differences among hypotheses. VERNet outperforms all baselines, which supports our claim that multi-hypotheses from beam search provide valuable GEC evidence and help conduct more effective quality estimation for generated GEC hypotheses.

VERNet's Effectiveness in GEC Models
This part explores the effectiveness of VERNet on improving GEC models. We conduct VERNet † by aggregating scores from the basic GEC model and VERNet for hypothesis reranking.
As shown in Table 5, two baseline models are compared in our experiments, Basic GEC (Kiyono et al., 2019) and BERT-fuse (GED) (Kaneko et al., 2020). Compared to BERT-fuse (GED), BERT-VERNet † achieves comparable performance on CoNLL-2014 and more improvement on BEA19. It demonstrates that reranking hypotheses with VER-   (Kiyono et al., 2019) and VERNet for hypothesis reranking with Coordinate Ascent. BERT-fuse (GED) (Kaneko et al., 2020) is the Transformer model that fuses BERT representations. * Note that R2L models incorporate four right-to-left Transformer models that are trained with unpublished data and these models are not supplied in their open source codes, thus these results are hard to reimplement. Net provides an effective way to improve basic GEC model performance without changing the Transformer architecture. R2L models incorporate four right-to-left Transformer models to improve GEC performance. However, these R2L models are not available. ELECTRA-VERNet † incorporates only one model and achieves comparable performance on BEA19 and JFLEG. Figure 4 presents VERNet † 's performance on different grammatical error types. We plot the F 0.5 scores of both basic GEC model and VERNet † on BEA19. VERNet † achieves improvement on most types and performs significantly better for word morphology and word usage errors, such as Noun Inflection (NOUN:INFL) and Pronoun (PRON). Such results illustrate that VERNet † is able to leverage clues learned from multi-hypotheses to verify the GEC quality. However, we also find that VERNet † discounts GEC performance on a few error types, e.g., Contraction (CONTR). The annotation biases may cause such a decrease in CONTR errors. For example, for both "n't" and "not", they are both right according to grammaticality, but annotators usually come up with different corrections with different GEC standards.

Case Study
We select one case from CoNLL-2014 and visualize node interaction and node selection attention weights to study what VERNet learns from multihypotheses of beam search, as shown in Figure 5.
Given a source sentence, "Do one who suffered from this disease keep it a secret of infrom their relatives ?", and its five hypotheses from the Basic GEC Model, we plot the node interaction attention weights towards the word "suffers" in the hypothesis of node 2, which is assigned more higher score by BERT-VERNet. The word usage "suffers" is The node selection attention assigned to each hypothesis is annotated with dark orange. The node interaction attention towards the edited token "suffers" in the second node is also plotted. Darker red indicates higher attention weights. more appropriate than "suffered" according to the context.
The node interaction attention accurately picks up the associated tokens "Does" from nodes 1, 3, and 4, and "suffers" from node 5. "Does" and "suffers" indicate the present tense and provide sufficient evidence to verify the quality of "suffers" in node 2. For node selection attention, the hypothesis (node 2) shares more attention than other nodes, which is more appropriate than other hypotheses. It demonstrates that the node attention is effective to select high-quality corrections with the source-hypothesis interactions.
The attention patterns are intuitive and effective, which further demonstrates VERNet's ability to well model the interactions of multi-hypotheses for better quality estimation.

Conclusion and Future Work
This paper presents VERNet for GEC quality estimation with multi-hypotheses. VERNet models the interactions of multiple hypotheses by building a reasoning graph, and then extracts clues with two kinds of attention: node selection attention and node interaction attention. They summarize and aggregate GEC evidence from multi-hypotheses to verify the quality of tokens. Experiments on four datasets show that VERNet achieves the state-ofthe-art GED and quality estimation performance, and improves one published state-of-the-art GEC system. In the future, we will explore the impact of different kinds of hypotheses used in VERNet.

A.1 Model Details of Sentence Quality Estimation Score Calculation
This part describes the details of sentence score calculation of BERT based quality estimation models. Given a source sentence s with m tokens and k-th hypothesis c k with n tokens, we can get the representation H k of the k-th source, hypothesis sentence pair through BERT: or only the representation H k of the k-th hypothesis through BERT: The "[CLS]" representations are H k 0 and H k 0 . BERT-LM. We mask tokens in the k-th hypothesis sentence c k and calculate the Perplexity of the k-th hypothesis sentence: BERT-GQE. BERT-GQE uses the "[CLS]" representation H k 0 of k-th hypothesis to estimate the sentence quality with the probability P (y s |c k ): where W is the parameter and the label y s is categorized into two groups: correct (y s = 1) and incorrect (y s = 0). Then the sentence-level quality estimation score of hypothesis c k is calculated: f GQE (c k ) = P (ys = 1|c k ).
BERT-QE. BERT-QE uses the "[CLS]" representation H k 0 of k-th source, hypothesis sentence pair to estimate the quality of GEC hypothesis: where W is the parameter. The quality estimation score f QE (s, c k ) of BERT-QE is trained to approximate the F 0.5 score of the k-th hypothesis c k . BERT-GED. Take BERT-GED (HYP) as an example, it uses the hypothesis representation H k m+2:m+n+2 of the k-th source, hypothesis sentence pair to estimate the quality of GEC hypothesis. Note that the "[SEP]" token is also used in BERT-GED to denote the end of the sentence.  We calculate the probability of token quality estimation label y for the i-th token w k i in the k-th source, hypothesis sentence pair: where W is the parameter. The label y is categorized into two groups: correct (y = 1) and incorrect (y = 0).
To estimate the quality of hypotheses, we average all token quality estimation probability P (y = 1|w k i ) as the sentence quality estimation score f (s, c k ) for the k-th hypothesis c k : f GED (s, c k ) = 1 n + 1 m+n+2 i=m+2 P (y = 1|w k i ).

A.2 Grammatical Error Detection Performance with LSTM
In this experiment, we evaluate the effectiveness of BERT and LSTM on the grammatical error detection (GED) task. We keep the same setting as previous work (Rei and Søgaard, 2019). The FCE dataset is used for evaluation. Precision, Recall, and F 0.5 are used as our evaluation metrics. As shown in Table 6, three models, LSTM, LSTM-ATTN, and LSTM-JOINT from Rei and Søgaard (2019) are compared with the BERT model. The LSTM model leverages the LSTM encoder and adds language modeling objectives in the training process (Rei, 2017). LSTM-ATTN and LSTM-JOINT further add attention constraints and sentence level supervision to achieve better performance (Rei and Søgaard, 2019). The BERT model is the same as our BERT-GED (SRC).
The BERT based model shows significant improvement than LSTM based models. Thus we do not consider LSTM based GED models in the experiments of GEC quality estimation.