On GAP Coreference Resolution Shared Task: Insights from the 3rd Place Solution

This paper presents the 3rd-place-winning solution to the GAP coreference resolution shared task. The approach adopted consists of two key components: fine-tuning the BERT language representation model (Devlin et al., 2018) and the usage of external datasets during the training process. The model uses hidden states from the intermediate BERT layers instead of the last layer. The resulting system almost eliminates the difference in log loss per gender during the cross-validation, while providing high performance.


Introduction
The GAP coreference resolution shared task promotes gender fair modelling with its GAP dataset (Webster et al., 2018). GAP is a coreference dataset for the resolution of ambiguous pronounname pairs in real-world context. GAP has a particular focus on the balance of masculine and feminine pronouns and allows for gender-specific evaluation. The challenge was hosted by Kaggle and consisted of two stages. Stage 1 attracted 838 participants, and stage 2 involved 263 participants.
GAP training examples look the following way: Burnett Stone (Peter Fonda) is Lily's grandfather and Lady's caretaker. He keeps her in Muffle Mountain.
where her is ambiguous pronoun, Lily is candidate mention A, and Lady is the candidate mention B. The data was extracted from Wikipedia, so, in addition to the text, the source URL of the article is given. The goal is to predict whether the ambiguous pronoun refers to the mention A, to the mention B or to NEITHER of them. The problem was treated as gold-two-mention task, where the model has the access to the position of the mentions.

The data
The GAP dataset is split into training, validation and test set.

Additional data
There are several coreference datasets available for training and evaluating. Besides GAP data, the presented solution uses four external data sources: Winobias (Zhao et al., 2018), Winogender (Rudinger et al., 2018), The Definite Pronoun Resolution (DPR) Dataset (Rahman and Ng, 2012;Peng et al., 2015) and Ontonotes 5.0 (Pradhan et al., 2012). Each of them was processed to be compatible with the GAP format. After the cleaning this resulted in 39,452 training examples for Ontonotes 5.0, 360 for Winogender, 3162 for Winobias and 1400 for DPR. In this paper, this external data (Ontonotes 5.0, Winobias, Winogdender, DPR) will be called warm-up data, because it was used to fine-tune the BERT embeddings, and the weights learned from this data served as 'warm up' for the training on the GAP dataset.
There was one more candidate to the additonal datasets pool, namely PreCo , but despite many efforts this dataset did not provide any score improvement. Presumably, this is mainly due to the different structure of the data, and the high amount of noise. For instance, some training examples contained the same word as both mention and pronoun, which may have worsen the model performance.
There were several attempts to include all the additional datasets to the training procedure. The naive attempt to concatenate GAP data and additional data into one big training set did not work, because the additional data has a different structure and does not have the URL feature. The second attempt was to use a two-step approach: 1. Warm-up step: pretrain the part of the model (namely the head, see section 3 for the explanations) on the external data only. Then, select the weights from the model that performs best on the warm-up validation set.
2. GAP step: continue training on GAP data, using the weight from the best-performing warm-up model instead of randomly initialized weights.
The warm-up data was randomly split into training and validation set with 95%-5% proportions. This strategy slightly improved the model performance, showing that warming-up on external dataset can be a promising direction. One possible explanation is that starting with pretrained weights allowed the model to reach flatter optimum and generalize better. In addition, the external data contained many more training examples for the category NEITHER (see Table 1), resulting in better performance for this group.
During the third attempt, the validation set was not randomly chosen, but replaced by Winobias only. It was done to ensure that gender fair representation will be chosen as initialization weights for the GAP step. This action further provided small improvement in the evaluation metric. However, the negative effect of choosing Winobias as validation data was the complete exclusion of the class NEITHER from the validation data (see Table 1). Surprisingly, this effect was not detrimental to the performance, most likely because the training data contained enough training examples for this class. The final version of the model also fine-tunes the particular layers of BERT embeddings, in addition to the warm-up of the head. For a full description, see section 3.

Evaluation metrics and class distributions
Class distributions Table 1 shows the class distributions for the three final datasets used: GAP train, which includes all GAP data, warm-up train, which includes Ontonotes 5.0, Winogdender, DPR and warm-up validation, which is Winobias. As can be seen, warm-up train has the most balanced distribution of classes, while GAP train has a lower portion of the category NEITHER. Evaluation metrics Solutions were evaluated using the multi-class logarithmic loss. For each pronoun, the participants had to provide the probabilities of it belonging to A, B, or NEITHER. The formula to evaluate the performance of the model is: where N is the number of samples in the test set, M is number of classes, 3, log is the natural logarithm, y ij is 1 if observation i belongs to class j and 0 otherwise, and p ij is the predicted probability that observation i belongs to class j.

Features
Besides the direct textual input, the current solution uses some manually constructed features. The majority of them were already mentioned by Webster et al. (2018) as single baseline models. The following features were used: • Token Distance. Distance between mentions and the pronoun, and also between the mentions themselves.
• Syntactic distance between mentions and pronoun. The distances were extracted with the StandfordCoreNLP.
• URL. Whether the Wikipedia URL contains the mention.
• Sentence of the mention. Index of the sentence, where the mention is located, divided by total number of sentences in the snippet.
• Syntactic distance to the sentence root.
The distance between the mention and its syntactic parent.
• Character position. The relative character position of the mention in the text.
• Pronoun gender. Gender of the pronoun. It was noticed that in some examples, mentions were of different gender, so the hope was that this feature could help. It did not help, but it did not hurt either. The fact that this feature did not affect the performance can be a good indicator of gender-neutral learning.
The features that provided the biggest improvements were URL (0.07 decrease on log loss) and syntactic distance between mentions and pronoun (0.01). The contribution of other features was very limited.

The system
The final solution uses an ensemble of four neural networks. They are: fifth-to-last-layer with cased BERT, fifth-to-last-layer with uncased BERT, sixth-to-last-layer with uncased BERT, sixth-to-last-layer with cased BERT. The explanation is in the next subsection of the paper. Each network consists of two main parts: • BERT part: contextual representations of the text with fine-tuned BERT embeddings • Head part: using the embeddings together with manually crafted features to produce softmax probabilities of the three classes All networks have the same architecture for the head part and the only difference is in the BERT part.

BERT part
BERT is general purpose language model, pretrained on Wikipedia and BookCorpus. It leverages high amount of unannotated data on the web and produces context-aware word embeddings. The current solution uses pytorch-pretrained-BERT 2 . Besides fine-tuning BERT weights, the set of possible parameters for the pretrained BERT is limited. The possibilities are: • Amount of layers in the Transformer model: 12-layer (bert-base) or 24-layer (bert-large) • Cased or uncased model • Multilingual or single-language model One additional peculiarity, discovered by several contestants independently, is that using the output of the last layer may be an inferior option compared to the deeper ones. For the architecture presented in this paper, optimal layers were fifth-to-last ([-5]) and sixth-to-last ([-6]). Figure  1 shows the log loss during stage 1 for different ouput layers, keeping all other parameters of the network fixed.
One possible explanation for this phenomenon is that the last output layer specializes on predicting the masked words, while the intermediate layers contain more general information. The Ushaped curve also shows that taking much deeper layers negatively affect the model performance.
Fine-tuning. As mentioned in the section 2.1, initially only the head part was fine-tuned on the warm-up dataset and the learned weights were used as initialization weights for the GAP training. The logical step would also be to fine-tune the BERT embeddings themselves. It was done in the following way: first, the head was trained for 16 epochs, with the parameters of BERT being frozen. Afterwards, all of the parameters of the head were frozen, and one particular layer (either fifth-to-last or sixth-to-last) of BERT was finetuned. The small learning rate was very crucial at this step, because the number of parameters is high (12,596,224). The current solution uses 4e −5 as learning rate, trained for 16 epochs with batch size of 32. Every 400 steps the network performance was estimated on the evaluation data, and only the best performing model was used after the training was done.

Head part
The head network always takes BERT contextual embeddings of shape [batch size, seq len, 1024] (for BERT-large). These embeddings are processed with the 1d-convolutional layer of size 64 and kernel=1 in order to reduce the dimensionality. Interestingly, increasing kernel size to 2 or 3 deteriorates the performance. This may be due to the fact that the context just around the mentions was not that informative.
Because the positions of mentions and pronoun are known in advance, the embeddings of only those three phrases are extracted. This is done by using SelfAttentiveSpanExtractor from AllenNLP (Gardner et al., 2017). This span extractor will generate 3 vectors of size 64 -for A, B and Pronoun. For single token mentions the span representation is just the original vector itself, while for mentions with more than two tokens, the span extractor will produce weighted representation by using the attention scores. Other span extractors from AllenNLP did not perform as good as the self-attentive span extractor.
The resulting three vectors of size 64 are concatenated and processed with the standard fullyconnected block: BatchNorm1d(192) → Linear (192, 64) → ReLU → BatchNorm1d → Dropout (0.6). This output is concatenated with all the manual features mentioned in the section 2.3, which results in the vector of length 96. Adding features directly to the last layer is important, otherwise they do not bring any improvement. Finally, Linear (96, 3) layer on top produces the log-its. The softmax probabilities are computed in the numerically stable way in the loss function.

Training details
For the GAP training, Adam optimizer with the learning rate 2e −3 is used. The batch size is 20. For both BERT fine-tuning and GAP training the triangular learning rate schedule is used (Smith, 2017). The loss used is CrossEntropyLoss, which combines numerically stable computation of softmax probabilities with negative log-likelihood loss function.
The predictions for each of the four networks are done in 10 fold cross-validation stratified by the class distribution, i.e. the model is trained on 90% of the data and the other 10% is used for the evaluation. The final predictions is the average across all folds and all models, overall of 40 models. Final predictions were clipped to be in the interval (1e −2 , 1 − 1e −2 ), because log loss penalizes the predictions heavily as they drift away from ground truth.
The training runs approximately one day on single NVIDIA Tesla P100. This can be substantially reduced with proper code optimization (for instance, removing BERT computations for all layers after the fifth-to-last). The framework used for the implementation is PyTorch 3

Results
The described solution provides the log loss of 0.1839 on the test stage 2 data, which results in the third place on Kaggle leaderboard. The results of single models are presented in the Table 3. As can be seen, the cased version performs generally better. One reason may be given by the variety of personal names in the GAP, and the cased version is able to recognize them better.
On the cleaned stage 1 test data, the bestperforming single model (cased BERT, fifth-tolast layer) provided the log loss of 0.23819. Because the true labels are available for the test stage 1 data, the loss for the masculine and the feminine pronouns was estimated separately. For masculine pronouns the log loss was equal to 0.24014, while for feminine it was equal to 0.23623. The difference is 3e −3 , which can be considered insignificant.
The error matrix for the whole stage one dataset (10-fold cross-validated) is presented in the

Discussion
One of the weaknesses of the presented system is the training and prediction time. Because there are four networks, training takes a long time, and the prediction on the test set requires almost 2 hours. The attempt to concatenate several intermediate BERT layers, or to use a linear combination of them did not work, although it was reported to have a positive effect (Tenney et al., 2018). Using the information from the whole Wikipedia page also did not provide any improvements. During the early stages of the competition, when only GAP data was used, the sources of model mistakes were analyzed. It was found, that despite the best efforts of the authors, there are still some mislabelled examples in the GAP data itself.  Other participants reported errors as well 4 . Some of these errors are quite simple, but the majority require substantial human efforts and sometimes were impossible to detect without reading the corresponding Wikipedia article. The number of erroneous labels for different GAP datasets separated by gender is reported in Table 4. This list is based on the mistakes reported on the forum, as well as own single checks, but it is not comprehensive. It can be seen that mislabelled examples represent less than 5% of all cases. They are usually equally distributed between genders, besides gap test, where mislabelled examples for female cases are 30% higher.

Conclusion
This paper presents the solution for the coreference resolution on GAP shared task. The solution utilizes the pretrained contextual embeddings from BERT and fine-tunes them for the coreference problem on additional data. One of the findings is that the output of BERT's intermediate layers gives better representation of the input text for the coreference task. Another contribution is that the gender bias in external data can be mitigated by using gender-fair datasets as validation data during the pretraining phase.