Fast End-to-end Coreference Resolution for Korean

Recently, end-to-end neural network-based approaches have shown significant improvements over traditional pipeline-based models in English coreference resolution. However, such advancements came at a cost of computational complexity and recent works have not focused on tackling this problem. Hence, in this paper, to cope with this issue, we propose BERT-SRU-based Pointer Networks that leverages the linguistic property of head-final languages. Applying this model to the Korean coreference resolution, we significantly reduce the coreference linking search space. Combining this with Ensemble Knowledge Distillation, we maintain state-of-the-art performance 66.9% of CoNLL F1 on ETRI test set while achieving 2x speedup (30 doc/sec) in document processing time.


Introduction
Coreference resolution is one of the fundamental sub-tasks for Machine Reading Comprehension and Dialogue Systems that groups mentions of a same entity in a given sentence or document (Soon et al., 2001;Raghunathan et al., 2010;Ng, 2010;Lee et al., 2013). Recently, for English coreference resolution, span-based end-toend trained models such as e2e-coref , c2f-coref (Lee et al., 2018), and BERT-coref  have shown to outperform previous rule-based or mention-pairing approaches.
However, such approaches suffer from the computational complexity effectively-being O(n 4 ), where n is the length of the input document. Furthermore, as coreference resolution is a very important and complicated task, most of the research efforts have been focused on how to solve the problem through better modeling, such as higher-order coreference resolution (Lee et al., 2018). Inevitably,  these approaches lead to more complicated models that are more computation heavy, but there are not many studies on solving this complexity issue. Hence, this paper aims to cope with this problem by infusing relevant linguistic features into the model.
One of the underlying reasons for such high computational complexity was the creation of O(n 2 ) spans caused by the mixed head directionality of English, as shown in Figure 1. This makes it hard to locate the heads in the mentions because the head location is not deterministic. On the other hand, having deterministic head locations is a very desirable linguistic trait for solving the aforementioned computational complexity issue. This effectively reduces the search space for coreference linking as we can use only the heads of the mentions.
Korean is not only a new domain for end-toend coreference resolution but also considered a strongly head-final language (Kwon et al., 2006), which motivates us to focus on Korean. In this paper, we present the first end-to-end model in Korean coreference resolution. Our model leverages such head-final properties using Pointer Net-works (Vinyals et al., 2015) and achieves comparable performance to that of state-of-the-art models with a 2x speedup. Our contributions can be summarized as the following: • First end-to-end coreference resolution model for Korean • 2x speed up than state-of-the-art models • Achieve state-of-the-art with Ensemble and maintain 2x speedup using Knowledge Distillation

Background
Coreference resolution is basically about linking mention pairs (which are often noun phrases). Essentially, this is finding heads of noun phrases that refer to the same entity, but the head locations within mentions are unknown. While previous rule-based approaches (Wiseman et al., 2016;Clark and Manning, 2016a,b) relied on several hand-engineered features including head-related ones, recent end-to-end methods attempt to directly model the mention distribution using span-based neural networks.

Span-based Coreference Resolution
To elaborate, Lee et al. ( , 2018;  have formulated the task of end-to-end coreference resolution for English as a set of decisions for every possible spans in the document. The input is a document consisted of n words and there are S = n(n+1) 2 = O(n 2 ) possible spans in it. For each span, the task is to assign an antecedent that refers to the same entity. Hence, as all of these spans have to be ranked against each other, the final coreference resolution search space is S(S+1) 2 = O(n 4 ). Finally, the entity resolutions are recovered by grouping all spans that are connected.

Head-final Coreference Resolution
In this section, we introduce the concept of our proposed head-final coreference resolution. Head-final languages are left-branching in which the heads of mention phrases are at the end of the phrase (Dryer, 2009). This allows to easily extract accurate coreference linking between nouns across the mentions and use them for training directly. On the other hand, in English, it is impossible to know which nouns in the mentions are supposed to be linked together because the head locations are non-deterministic. Hence, using such head-final property, we can effectively reduce a search over span candidates to a search over head candidates, which are simply the nouns. In short, this yields a coreference resolution search space of O(n 2 ).

BERT-SRU Pointer Networks
We propose a novel model, BERT-SRU Pointer Network, that is suitable for head-final coreference resolution. This model combines bidirectional encoder representation from transformer (BERT) (Devlin et al., 2019) with bidirectional simple recurrent units (SRUs) (Lei et al., 2017) andPointer Networks (Vinyals et al., 2015), as shown in Figure 2, to perform the head-final coreference resolution. Initially, the encoder part (which is BERT) receives morphologically analyzed texts along with their POS-tags as inputs. Then the decoder extracts the hidden state corresponding to the head candidates (which are all nouns) and uses them as the inputs. After that, the gated self-attention layer in decoder models head information, and the decoder outputs position corresponding to the input using the pointer networks. We use deep biaffine (Dozat and Manning, 2016) as the attention score of the pointer networks, and this model performs both the mention detection and the coreference resolution.

Model Inputs
To elaborate on the BERT encoder layer, we use a BERT model that is pre-trained with morphologically analyzed large-scale Korean corpus and apply byte pair encoding (BPE) (Sennrich et al., 2016) to the input morpheme sequence. When using BPE, we add a [CLS] and [SEP] token to the beginning and end of the input sequence and distinguish morphemes on the subword by attaching ' ' in the last syllable of morphemes. We use features that are appropriate for Korean coreference resolution. The features are morpheme boundary, word boundary, dependency parsing, named entity recognition (NER), and candidate head distance.

Input Text Preprocessing
The following example shows the use of morphological analysis and BPE for a given raw text. In the example below, the entity is 바카스 (Bacchus).

Pointer of mention detection
Pointer of coreference resolution noun noun noun noun noun noun dummy Figure 2: Our Fast Head-final coreference resolution model for Korean. We use BERT to obtain embeddings corresponds to input tokens. Along with the five features, embeddings are used as SRU encoder inputs, and encoder outputs are fed to SRU-based decoder inputs through self-attention. Finally, decoder outputs are transformed to predict 1) mention start boundary, and 2) coreference resolution. All parameters are trained through an end-toend manner. After training, the model could further be extended with ensemble knowledge distillation, achieving comparable performance to that of the state-of-the-art with 2x inference speed. If the following input text is given, morphological analysis is performed using a part-of-speech (POS)-tagger and BPE is applied. In this paper, we use the POS-tag together with the morphological analysis results to specify the POS information of each morpheme. After applying BPE, '로마/NNG' (Rome/NNG) and '바카스/NNP' (Bacchus/NNP) were divided into '로' (Ro), '마/NNG ' (me/NNG ) and '바' (Ba), '카스/NNP ' (cchus/NNP ) according to BPE dictionary matching.

Additional Input Features
In this study, We use five features for Korean coreference resolution, which are word boundary, morpheme (morp) boundary, dependency parsing, NER, and head distance. The description of each feature is as follows: Word boundary: This studies the boundary feature of the coreference resolution in word units.
The starting token of the word is divided into B, and the following token is divided into I tags.
Morpheme boundary: This reflects the morpheme boundary characteristics of the morpheme analysis results. Morp-B is the beginning token, and morp-I is the inside token of the morpheme. Dependency parsing: We use the dependency parsing label as a feature to reflect the structural and semantic information of the sentences. NER: We use type information for each entity appearing in the document as a feature. Head distance: To use distance information between extracted candidate nouns and we measure the distance from the immediately preceding noun, the following buckets [1, 2, 3, 4, 5-7, 8-15, 16-31, 32-63, 64+] (Clark and Manning, 2016b).

Encoder
As shown in Figure 2, each token input to the encoder gets the hidden state of BERT b i = BERT(x i ) from the pre-trained BERT model. The hidden state for the input feature is generated as follows: h f i = emb f eat (f i ). We concatenate the hidden state of the BERT and the hidden state of the features to make the hidden state Then, according to equation 1, the encoder encodes e i into bidirectional SRU (biSRU) to generate a hidden state r i .

Decoder
In Figure 2, the input of the decoder is r h t = copy(r yit ) that extracts the hidden state corresponding to the head yi t from the encoded hidden state r i . The decoder performs biSRU(.), as shown in equation 2, to model the context information between heads.
Self-attention Module To model the scores between similar head, we apply a gated self-matching layer (Wang et al., 2017), the equation follows as: Where c t is the context vector of the whole heads. g t is a hidden state generated from the additional gate. The additional gate concatenates the hidden state h h t and the context vector c t , and applies a sigmoid gate to convert the significant value of the two vectors to larger ones, and otherwise to smaller ones. BiSRU(.) models g t with gate applied and generated h t .
Deep Biaffine Score To output the mention start boundary and coreference resolution, we apply elu (Clevert et al., 2015) to the last hidden state h t of the decoder as shown in Dozat and Manning (2016), and create the hidden states as h men src t , h coref src t , h coref tgt t . In this case, the hidden state to be used for the output of the mention boundary is h men tgt i based on the output hidden state r i of the encoder. We apply the deep biaffine score when performing the attention to output the mention boundary and the coreference resolution, and the equation follows as:

Model Extension: Ensemble Knowledge Distillation
An ensemble is a model that combines the output results of several single models into a single result. When performing the ensemble, we use the method of averaging all the softmax probability distributions of single models. Meanwhile, knowledge distillation is a model compression technique to reduce the size of a single model (Hinton et al., 2015). A small student model is trained to learn the output distribution of a large teacher model using a loss function that compares the distribution, such as Kullback Leibler distance (KLD, Kullback and Leibler (1951)). As shown in Figure 3, we use the ensemble model as the distribution of the teacher model and we distill its knowledge to a single model as the student model. The loss function that we use to train the knowledge distillation is KLD, and the final loss function equation is as follows: In equation 6, p T (y|x) is the result distribution of the teacher model, and p S (y|x) is the result distribution of the student model. Equation 7 calculates the final loss by adding the cross-entropy loss (Nasr et al., 2002) and the knowledge distillation loss of the mention detection and coreference solution.
Cross-entropy losses for mention and coreference resolution are L men ce and L coref ce , and knowledge distillation losses are L men kd and L coref kd , respectively. Here, the weight α is a hyper-parameter that determines the loss reflection ratio between coreference resolution and mention boundary. β is the weight of knowledge distillation loss. γ determines the loss reflection ratio between coreference resolution and mention boundary in the teacher model. α, β and γ all perform optimization and the values are 0.9, 0.2 and 0.9, respectively.

Experiments
Dataset and Measures We use the Korean coreference resolution data (Park et al., 2016) from the ETRI quiz domain of AIOpen 1 . Table 1 summarizes the dataset statistics. We use CoNLL F1 averaged MUC, B 3 , and CEAF φ 4 according to the official CoNLL-2012 evaluation script. However, we evaluate coreference resolution using only heads of mentions as it is more suitable for Korean coreference resolution because the positional weighting in the script is tailored for English.
Pre-training Korean BERT BERT consists of a bidirectional transformer encoder with several layers. For pre-training BERT, we reuse the hyperparameters from Devlin et al. (2019). We used Wikipedia and news data (total 23.5 GB) collected from the web. After performing morpheme analysis on all input words, tokenization was done on the subwords using BPE. The dictionary consists of 30,349 BPE tokens. We used the ETRI language Implementation Hyper-parameters of the BERT-SRU-based Pointer Networks model are as follows. We fine-tune all models on the ETRI Korean data for 70 epochs with a batch size of six for each GPU. The model trained on 2 GEFORCE GTX 1080 Ti GPU cards. The number of hidden layer dimensions and feature dimensions of the SRU was optimized to 800 and 1,600, respectively. We have optimized the stack of the SRU hidden layer to two. We set the dropout as 0.1. The training algorithm we used is Adam (Kingma and Ba, 2014), and Adam weight decay was set to 1 × 10 −2 . The learning rate was set to 5 × 10 −5 , and the linear method was used in the learning rate schedule. The maximum length of the input sequence was limited to 430 because the most extended input sequence length in the test set was 428. We used the ETRI language analyzer to obtain POS-tagging, NER, and dependency parsing features.
Head Candidates In general, pointer networks' target outputs align with those of the decoder inputs. For our head-final coreference resolution, we set the inputs of the decoder as the list of head candidates. These head candidates are all nouns of the source document and they is extracted using the POS-tags. By doing so, we can effectively reduce the computational complexity to O(n 2 ), as the search for coreference links is only done between the head candidates.
Attention Masking In coreference resolution, the antecedent at position i comes before the head at j, where i <= j. Similarly, in mention detection, the beginning boundary of a mention always appears before the head. Accordingly, when calculating the attention score, we perform attention masking to prevent attention from being calculated for the element that is later than the j-th position.

Results
In this section, we show our experimental results for Korean coreference resolution. We denote the BERT-SRU-based ptr-net as our model for headbased coreference resolution. The performance of the models is measured and compared using the CoNLL Average F1-score.   (Lee et al., 2014;Park et al., 2019b). The third column (Doc/sec) is the number of documents processing per second. The final column shows a time complexity for each model.  (Lee et al., , 2018 Korean word vector representation, and they are denoted as e2e-coref, c2f-coref, BERT-coref, respectively. We extend the original Tensorflow implementations of e2e-coref , c2f-coref 2 and BERT-coref 3 for Korean coreference resolution.

Coreference Resolution
The e2e-coref shows average F1 of 59.4 and c2f-coref from (Lee et al., 2018) uses second-order span representations achieves a slightly higher performance of 60.2 F1 for headbased Korean coreference resolution. Our proposed model achieves 66.2 of CoNLL F1, which is 6.8 and 6.0 points higher than e2e-coref and c2f-coref, respectively. However, this improvement is most likely due to the usage of BERT because BERT-coref also shows a significantly higher performance (67.0 F1) than the other two baselines, and its main difference with c2f-coref is the usage of BERT.
Meanwhile, by ensembling 10 models, we achieve state-of-the-art performance in this Korean dataset with F1 of 68.6, which is 2.4 points higher than our single model and 1.6 points more than BERT-coref. However, as ensembling models is notoriously expensive in terms of inference time and memory usage, we also provide a knowledge distilled model of the ensemble that solves this problem which is referred to as BERT-SRU ptr-net (KD). This distilled model has the same size as the single model while having 0.7 points higher in F1, and only 0.1 point difference with the best single model, BERT-coref. It is noteworthy that not only our ensemble KD model can achieve similar performance to BERT-coref without using any higher-order modeling, it also has a 2x faster document processing speed (30 vs 15 doc/sec) due to the much smaller computational complexity.
We also compare the usage of different pretrained BERT embeddings. Table 2 shows that our pre-trained version is more suitable for this task than Google's multilingual BERT 4 (BERT-SRU ptr-net (Google)).

Ensemble Knowledge Distillation
Ensemble We perform an ensemble using ten single models with different random seeds on the dev set. The lowest performance among the 10models is 70.04% F1, and the average F1 score is 70.37% and Std. deviation is 0.253, both of which still outperforms the 68.62 F1 of the Korean BERT-coref from . We perform a maximum score ensemble and an average score of the ensemble for 10-models. The maximum score ensemble is 72.26% F1, and the average score of the ensemble is 72.23% F1. But we choose the average score of the ensemble because the average ensemble is 1.28% higher than the maximum score ensemble in the test set.
Knowledge Distillation We optimize the weight option β of knowledge distillation, such that we   apply β only to knowledge distillation loss term as L = L ce + βL kd in equation 7. The optimized β is 0.2, and it is meaningful to apply the loss to Korean coreference resolution.

Analysis
Feature ablation study We perform feature ablation to understand the effect of each feature on Korean coreference resolution. Table 3 compares the ablation performance of each feature. Removing the morp boundary deteriorates the average F1 score by 0.8%. Also, dependency parsing or NER feature decreases 1.12, 1.20 F1 score, respectively. If the head distance feature is removed, the F1 score is reduced by 1.27%. Among all the features, the word boundary has the most significant difference from the other features.
Component ablation study To understand the effect of different components on the model, we perform components ablation study on the dev set, as illustrated in Table 4. We apply attention masking to consider only true antecedents when calculating the attention score in the decoder of the pointer networks and define the head candidate list (nouns) as the target class to reduce candidates of the target class. Removing this attention mask decreases the average F1 score by 0.66 points. When we define the target class as the entire input document, it deteriorates the F1 score significantly by 2.39 points. These two methods combined make the most contribution to our model.
In addition, we share a hidden layer to perform coreference resolution and detection of mention start boundary together. When mention detection module is removed, the F1 score is reduced by 0.74. Finally, removing the self-attention module of the decoder results in a difference of 0.94 F1. Accordingly, it can be seen that all components of the proposed model are contribute meaningfully to the Korean coreference resolution task.
Qualitative Analysis Our qualitative analysis in Figure 4 highlights the strengths of our model. Figure 4 shows examples first in Korean and then its English translated version. In Example 1, we can see that the removal of the mention detection (w/o MD) module from our model does not properly link the entity to 레오나르도 다빈치 (Leonardo da Vinci). When training using BERT embedding without fine-tuned Korean BERT, it does not find 엘리자베타 (Elisabeta) as an entity to resolve. On the other hand, our model distinguishes various entity information and performs coreference resolution correctly on all entities. From example 2, our model even finds 물체 (object) entity links missing from the ground truth, demonstrating the robustness of our model.
Meanwhile, pronouns and determiner phrases are the most substantial part of coreference resolution. In example 3, our model can successfully predict that the pronouns and the determiner phrases such as 이 사자성어 (This idiom), 이 말 (this), 무 엇 (What) are linked to an entity as 어려운 기회 (challenging opportunity). Furthermore, in Korean documents, foreign languages such as Chinese characters and English frequently appear. Our model reflects the contextual information and can successfully perform coreference to foreign languages. In Example 4, Persian token exists in the vocabulary of BERT and the model can successfully resolve the coreference between the two foreign words. In addition, the model can also detect relatively long and complex noun phrases, such as 낙타나 말 등에 짐을 싣고 떼지어 다니면서 특산물을 파고 사는 상인의 집단 (a group of merchants carrying loads of troops on camels and horses and selling specialties).

Weaknesses and Future Works
As shown from the results, head-final coreference resolution, which reflects the linguistic characteristics of Korean, has a significant computational advantage over spanbased coreference resolution. However, our method  can only be applied to languages that are either strongly head-initial (head is at the beginning of mention) or strongly head-final, and English is a mixture of those two. In future works, a search for English linguistic traits could alleviate the computational complexity issue of this task in English or other mixed head-directional languages. Furthermore, as shown in Table 5, it is clear that the coreference resolution performance significantly decreases when the document length increases. Although this is partly due to the Korean dataset being relatively small and non-uniform regarding document length, we believe the choice of BERT size is also relevant. Recent studies have shown that larger BERT might better encode longer contexts (Joshi et al., 2019a). By using the BERT-large model (we use BERT-base) in , coreference resolution improves overall performance, especially for long documents. In future works, we would like to explore BERT variants that are good at larger contexts.

Conclusion
We propose head-based coreference resolution that reflects the head-final characteristics of Korean and present a suitable BERT-SRU-based Pointer Networks model that leverages this linguistic trait. The proposed method, as the first end-to-end Korean coreference resolution model, not only achieves state-of-the-art performance in the Korean coreference resolution model through ensembling but also dramatically speeds up the document processing time compared to the conventional span-based coreference resolution. Our method achieves this result by reducing the problem of coreference resolution from a search over span candidates to a search over head candidates using the fact that we can easily extract the mention heads for a head-final language.
Moreover, our proposed method of using headdirectionality to speed up coreference resolution while maintaining the best performance is valid for not only other strongly head-final languages like Japanese, but also for strongly head-initial languages as the same method of head extraction can be applied. We believe that our paper also provides an interesting and important research direction. Combining linguistic theories like headdirectionality and branching with deep learning has a strong potential of more efficiently and effectively model fundamental tasks like coreference resolution.

A.1 Related Work
Traditional coreference resolution studies are divided into rule-and machine learning-based methods. In the rule-based method, Stanford's model (Lee et al., 2013), applied to multi-pass sieve using pronouns, entity attributes, named entity information, and so on. In the statistics-based, various coreference models have been proposed such as mention-pair (Ng and Cardie, 2002;Ng, 2010), mention-ranking (Wiseman et al., 2015Clark and Manning, 2016a) and entity-level models (Haghighi and Klein, 2010;Clark and Manning, 2016b).  defined mentions as span representations and proposed a span ranking model based on long short-term memory (LSTM, (Hochreiter and Schmidhuber, 1997)) for all spans in the document. As span representations could reflect the contextual information from LSTM, but the other two spans are interpreted as a related entity. This phenomenon results in local consistency errors that yield erroneous coreference resolutions. Hence, Lee et al. (2018) performed the attention mechanism to resolve coreference using a highorder function. The end-to-end model of (Lee et al., , 2018 showed the superior performance in English coreference resolution however, the complexity of O(n 4 ) is considering all spans and span pairs of the document. Zhang et al. (2018) is based on the , which replaced the concat attention score into the biaffine attention score to calculate the conference score. Also, it performed the multi-task learning process that also calculates the loss for the mention score.
Simple recurrent units (SRU) (Lei et al., 2017) architecture solves the vanishing gradient problem that occurs when back-propagation of the recurrent neural network (RNN). SRU, which is one of RNN types such as gated recurrent unit architecture (GRU) (Cho et al., 2014) and LSTM, is less computational complexity than other RNN types because the SRU encodes hidden states using a feed-forward neural gate and recurrent cell in a layer.
Recently, a variety of downstream studies using BERT (Bidirectional Encoder Representations from Transformer, Vaswani et al. (2017); Devlin et al. (2019)) which have been pre-trained with large amounts of data, have been conducted in natural language processing tasks Zhang et al., 2019;Park et al., 2019a;Wang et al., 2019). A BERT-coref study was also conducted in the English coreference resolution task, and a more effective SpanBERT (Joshi et al., 2019a) for coreference resolution has also been studied, with dramatic gains in GAP (Webster et al., 2018) and OntoNotes (Pradhan et al., 2012) datasets. A qualitative assessment of BERT-coref showed that BERT is significantly better at distinguishing unique entities and concepts.

A.2 Data Format for Our Model
The following example shows input sequence, head list and decoder output format. • Decoder output: [0, 0, 0, 0, 0, 5, 5] We add [CLS] and [SEP] to match the input sequence to the BERT format. The Heads is an example of heads included in a sentence, and the Heads applied by BPE is an example of heads with BPE applied. BPE divides words into subwords. The head divided into subwords uses the first token as the representative of the head. In the example of the Heads applied by BPE, the representative of the BPE-applied head '바' (Ba) and '카스/NNP' (cchus/NNP) is '바' (Ba). The Head list is the position of the head in the sentence that matches the BERT input format, which is input to the decoder. The head list is a target class. The decoder output is a position where the coreference resolves in the head list. Since '바' (Ba) is first mention in the entity of Bacchus, '바' (Ba) outputs its own location of 5. '신/NNG' (a god) outputs position 5 because it is linked to '바' (Ba). We then change the output to word units via post-processing.

A.3 Overall Performance
Please refer to Table 6 for full performance on all metrics, and dev set results for Table 7.

A.4 Optimizing Hyperparameters
We perform hyperparameter optimization on the baseline model of the BERT-SRU Pointer Networks, which is not applied to the head target class component. We optimize hyperparameters for the development set, and hyperparameter optimization proceeds for the feature embedding size, the number of RNN hidden layer dimensions, and the number of biaffine hidden layer dimensions. We set the number of dimensions to 50, 100, 200, 400, 800, 1600, respectively, to find the hyperparameters that give the best performance.
Optimizing Dimension Size of Feature Embedding In Table 8, we perform an optimization of the feature embedding size and our model shows the best performance when the embedding size is 1600. At this time, we could see that the overall performance improves in proportion to the size of the embedding dimension according to Table 8.

Optimizing Size of RNN Hidden States
The optimization of the number of RNN hidden layer dimensions is as shown in Table 9, and when the hidden state size is 800, the performance is as good as Table 8. We consider that our model with the number of moderately large dimensions shows good performance because the hidden state e of equation 1 is that the hidden state of the BERT and the hidden state of the feature are concatenated.
Optimizing Size of Biaffine Hidden States Table 10 shows the optimization of the number of biaffine hidden layer dimensions, and when the number of hidden layer dimensions is 50, the performance 69.72% of CoNLL F1 is shown as in the previous tables. We perform modeling by applying the head target class component based on the optimized hyperparameters. As a result, the performance of the single model shows 70.83% of CoNLL F1.  vastava et al., 2015)), a skip connection is used to allow the gradient to directly propagate to the previous layer; the information loss is small even if the stack is deepened.

A.6 Ensemble Knowledge Distillation
Ensemble Table 12 shows the performances of ten single models with different random seed and ensemble models on the dev set. We are interested in how the proposed model performs under different random initial conditions. Our model observes consistent performance regardless of 10 different initializations. The lowest performance among the 10-models is 70.04% F1, and the mean F1 score is 70.37%, both of which still outperforms the 68.62 F1 of the Korean BERT-coref from . We perform a maximum score ensemble and an average score of the ensemble for 10-models. The maximum score ensemble is 72.26% F1, and the average score of the ensemble is 72.23% F1.   But we choose the average score of the ensemble because the average ensemble is 1.28% higher than the maximum score ensemble in the test set.
Knowledge Distillation We optimize the weight option β of knowledge distillation. The final loss calculated when training knowledge distillation can be divided into two methods. The first method applies β only to the knowledge distillation loss term as L = L ce + βL kd in equation 7. The second method applies β to both terms, such as L = (1 − β)L ce + βL kd . Figure 5 shows the optimization results for the hyper-parameter β used in the knowledge distilla- tion when training with an ensemble knowledge distillation model. The experiment uses the loss function of equation 7 with methods and optimizes β between 0.1 and 1.0. When using the KLD, temperature (Hinton et al., 2015) is set to 5. As a result, the first method shows that the optimal performance is 71.18% F1 when β is 0.2 on the dev set. This method improves the F1 score by 0.34% compared to the single model. When β 0.1, F1 score is 71.06%, it is the second-best performance in the same method. In the case of the second method, when β is 0.3 and 0.5, F1 scores are 70.66% and 71.01%, respectively, which are improved than the single model. Accordingly, we can see that knowledge distillation of β below 0.5 is helpful for training, and it is meaningful to apply the loss of the first method to Korean coreference resolution.