Transfer Learning from Pre-trained BERT for Pronoun Resolution

The paper describes the submission of the team “We used bert!” to the shared task Gendered Pronoun Resolution (Pair pronouns to their correct entities). Our final submission model based on the fine-tuned BERT (Bidirectional Encoder Representations from Transformers) ranks 14th among 838 teams with a multi-class logarithmic loss of 0.208. In this work, contribution of transfer learning technique to pronoun resolution systems is investigated and the gender bias contained in classification models is evaluated.


Introduction
The shared task Gendered Pronoun Resolution aims to classify the pronoun resolution in the sentences, hereby to find the true name referred by a given pronoun, such as she in: In May, Fujisawa joined Mari Motohashi's rink as the team's skip, moving back from Karuizawa to Kitami where she had spent her junior days.
This task for pronoun resolution closely relates to the traditional coreference resolution task in natural language processing. Many works (Wiseman et al., 2016;Clark and Manning, 2016;Lee et al., 2017) related to coreference resolution have been published recently and all of them are evaluated with CoNLL-2012 shared task dataset (Pradhan et al., 2012). However, simply pursuing the best score over the entire dataset may cause the neglect of the model performance gap between the two genders.
To explore the existence of gender bias in such tasks, researchers from Google built and released GAP (Gendered Ambiguous Pronouns) (Webster et al., 2018), a human-labeled corpus of 8908 ambiguous pronoun-name pairs derived * Both authors contributed equally in this work.
from Wikipedia with balanced gender pronouns. It has been shown that most of the recent representative coreference systems struggled on GAP dataset with a overall mediocre performance and a large performance gap between genders. This may be due to both unbalanced training dataset used by these coreference systems or the design of the systems. Up to now, detecting and eliminating gender bias in such systems still remains a challenge.
In this paper, we explore transfer learning from pre-trained models to improve the performance of tasks with limited data. Various efficient approaches to reuse the knowledge from pre-trained BERT on this shared task are proposed and compared. The final system significantly outperforms the off-the-shelf resolvers, with a balanced prediction performance for two genders. Moreover, gender bias in word and sentence level embeddings is studied with a scientific statistical experiment on Caliskan dataset (Caliskan et al., 2017).

Data
This shared task is based on GAP dataset including: • Test 4,000 pairs: used for official evaluation • Development 4,000 pairs: used for model development • Validation 908 pairs: used for parameter tuning In the first stage, we use part of the released data on Google GAP Github repository, which includes 2000 development pairs, 2000 test pairs, and 454 validation pairs. 1 We refer the test pairs as training data, the development pairs as testing data and the validation pairs as validation data. Each sample contains a sentence and three mentions, A, B and pronoun. Each pronoun has been labeled as A, B, or NEITHER. Submissions are evaluated using the multi-class logarithmic loss. Table 1 shows the frequency of the different types of pronouns in the dataset. The number of masculine pronouns and feminine pronouns are strictly equal.

Data Preparation
We introduced the procedure for processing the data before training in detail in this section.

Data Preprocessing
Data preprocessing can be summarized into the following steps: BERT embeddings generation: We use pretrained bert-large-uncased model to obtain contextual embeddings as features. This part is implemented with the bert-as-service library based on Tensorflow (Xiao, 2018).
Dimension reduction: The dimension reduction for the original BERT contextual embeddings is performed to mitigate the overfitting problems. This approach is inspired by the Algorithm 2 (PPA-PCA-PPA) proposed in Raunak (2017).
For large scale vectors with dimension of 1024, instead of directly using PCA (principal component analysis), we train a linear autoencoder to approximate the linear PCA procedure. Namely, we train the autoencoder by minimizing the loss: where X is the contextual embedding. W 1 and W 2 are m × n and n × m matrices to project vectors to lower dimensional space and recover from lower dimensional space, respectively (m < n).
Hence, the PCA part in the original algorithm is performed by computing W 1 X, and the PPA part in the original algorithm is performed by computing X − W 2 W 1 X.
Here the PPA procedures remove the first 4 principal components. The PCA procedure maps 1024 dimension vectors to 256 dimension vectors.
Processing mention: A mention in the data (A, B or the pronoun) can be a single word or multiple words. Also, since BERT is based on the word piece model (Wu et al., 2016), a word may be cut into multiple word pieces after the BERT tokenization. We define the mention index as the index for the tokenized word piece list which corresponds to the original mention.
The vectors in the BERT contextual embeddings which correspond to the mention index are extracted. Meanwhile, vectors of mentions are the mean value of all the vectors which correspond to the mention. We call this mention vector.
Find names: All names in the sentences except A and B are extracted with the named entity recognition tool. After that, their mention indices are found by the same procedure in the previous step. We call these indices neither mention index. Stanford Named Entity Tagger is used for finding the names in the sentences in this step (Finkel et al., 2005).
An example of tokenization and mention index is shown in table 2.

Data Augmentation
We replace the originally referred mention by a different random mention in the sentence, then change the label to neither. This creates 1445 sam-ples labeled neither from training data. Original training data together with augmented neither data make up the augmented training set.

Architecture
We mainly explored two sub-categories of models as shown in figure 1. One category is based on fine-tuned BERT with different top layers. For this category, Back-propagation is done to both top layers and the pre-trained BERT model. Another idea is to use BERT as a feature extractor. Different from fine-tuned BERT, models in the second category do not back propagate to BERT weights during training. All of these base models contribute to our final model. 2

Initial Sentences & Mention Indices
Fine-tuned BERT POS-top Feature Extractor x 1 x 7 LR MLP x 7 x 5 x 2 x 1

Meta Classifier
Output Probabilities

Fine-tuned BERT
We propose two different kinds of top layers to fine-tune BERT model on GAP task and implemented with PyTorch Pretrained BERT library(Hugging-Face, 2018). The first kind of top layer shown in figure 2 is called MLP-top. It extracts and aggregates vectors for all mentions by concatenation, which are then fed into a multiple layer neural network. The second kind of top layer first map the output of BERT into a scalar by a linear layer whose output size is 1. Then we extract the value corresponding to the mention index and feed it into a softmax layer for a 3-class-probability-output. We call this Positional-top which is illustrated in Figure 3. 3 2 Due to the space limit, we do not explain all the base models that we use to produce the final ensemble model in detail. The models in the following description are only efficient and representative base models. For a comprehensive list of the base models we use, please check: https:// github.com/bxclib2/kaggle_gender_coref/ 3 Both figure 2 and figure 3 show the mentions which contain only a single word-piece after tokenization. If one men-

BERT as Feature Extractor
When BERT is used as a feature extractor, the contextual embeddings and the mention vectors prepared are passed to the subsequent classifier.
Here we use SVM (support vector machine) and BIDAF (bi-directional attention flow layer) (Seo et al., 2017) as classifiers. SVM: We denote the mention vector of A, B and pronoun as h A , h B and h pron . The vector: is fed as the input of the SVM, where the means point-wise product. The multiclass support is handled according to a one-vs-one scheme. The SVM tion contains multiple word-pieces, the mean of the multiple positions in BERT output layer should be computed in order to generate a tensor with desired size to be fed into the top layer.
BIDAF: BERT contextual embeddings and the pronoun mention vectors are passed to the bidirectional attention flow layer as the context and the query, respectively. We use the original embedding extracted from BERT large with embedding dimension of 1024 here. Then a two-layer point-wise fully-connected neural network is connected to map the output embedding vectors to scalars. The fully-connected layer has 64 hidden units with ELU as activation function (Djork-Arn Clevert, 2016). Finally, the scalars corresponding to the A, the B and the neither are fed into a softmax layer to generate 3-class probabilities.  The top layer of BIDAF network works similarly to the positional head of the fine-tuned BERT. However, there are two major differences: the positional head of the fine-tuned BERT uses only a linear layer to map the embeddings to scalars, while the BIDAF network uses a two-layer neural network with the ELU activation layer. Also, the output of BIDAF is from the positions corresponding to the A, the B and the neither mention respectively, while the BERT positional head extracts the scalars corresponding to the A, the B and the pronoun mention respectively.

Model Ensemble
Ensemble learning greatly improves the results compared to single models. Stacking method is used for ensemble. During ensemble, several base classifiers are trained to make preliminary predictions, and a meta classifier is used to make a final prediction based on these predictions.
In order to reduce the data leakage, 5-fold cross validation is performed when building the training data for the meta classifier from the original training data. In other words, we avoid the base classifiers and meta classifier to be trained with the same fold of data (Beaudon, 2016). For each training time 4-fold of data is used to train, and the resulting model predicts the remaining one fold of data to build one fold of training data for the meta classifier, as shown in figure 5. Here we use the logistic regression as the meta classifier.

Experiment
In this section, we present the result of different classifiers to the shared task.

Experiment setting
For SVM, C equals to 5.0 and the kernel function is the RBF function. The SVM is trained both with the original 1024 dimension mention vectors and the 256 dimension-reduced mention vectors respectively for comparison. The BIDAF network is trained for 50 epoches with a batch size of 25. We use the Adam optimizer with a learning rate of 1e-3 for training. For each fully-connected layer in BIDAF, a dropout with probability 0.7 is performed. It is trained both with the original training set and the augmented training set for comparison. This training process takes about 10 minutes with the GTX 1070 GPU.
The fine-tuned BERT models are trained with the Adam optimizer with a learning rate of 2e-5. All the dropout layers in the original BERT model are set to a dropout rate of 0.15. Models are trained for 1 epoch with a batch size of 16. Note that it is not possible to fit 16 training sentences at one time due to the limited GPU memory. Hence, gradient accumulation trick is used. Every time we fit 2 training sentences and we accumulate the gradient for 8 times. This fine-tuning process takes about 10 minutes with the Tesla K80 GPU.
The meta classifier is the logistic regression with l 2 regularization of the regularization constant C which equals to 0.5.

Evaluation
The results are shown in table 3. The masculine data loss and feminine data loss are shown respectively in order to show the gender bias. We compute the model loss for testing data (stage 1) and the loss caused by the masculine part and the feminine part in stage 1 testing data. We also submit our base model results after the competition finishes in order to get the private testing data (stage 2) loss.  We derive the following conclusions: • The dimension reduction greatly enhances the result of SVM which reduces about 0.1 multi-class logarithmic loss. The SVM 1024 has a loss of 0.184 and 0.597 with respect to training and testing data, while the SVM 256 has a loss of 0.250 and 0.505. Both SVM model overfit a lot, while the dimension reduction of BERT contextual embeddings efficiently mitigate overfitting, which bridges the performance gap between training data and testing data.
• The BIDAF model performs worse when trained with the augmented training set than the original training set, due to the distribution mismatching caused by data augmenta-tion that, the portion of the neither data is larger in the training set than in the testing set.
• Both two fine-tuned BERT models achieve much more competitive results compared to Bert as Feature Extractor models. 4 • The ensemble learning with logistic regression greatly enhances the overall classification result.
Although the data augmentation does not improve the BIDAF model directly, it still helps to make more accurate predictions of the neither class in the ensemble model. The BIDAF-aug and the BIDAF reach the loss of 0.982 and 1.095, respectively. In the testing data (stage 1), the respective accuracy of A, B and neither class is 89.8%, 89.5% and 73.1%, indicating that predicting the neither class correctly is much harder than predicting A and B. We can observe that it is easier for the model to choose an answer as A or B than to predict as no reference.
We also evaluate our system F1 score with stage 1 testing dataset to compare to the off-the-shelf resolvers in

Gender Bias in the Embeddings
To further demonstrate the presence or absence of gender bias in embeddings, we use both the Word Embedding Association Test (WEAT) (Caliskan et al., 2017) and Sentence Embedding Association Test (SEAT) (May et al., 2019) to measure it. As fine-tuned BERT large models with Positional-top contribute a lot to our final ensemble model, we only focus on this category of models in this section.

WEAT & SEAT
For both word-level test and sentence level test, let X and Y be two sets of target concept word or sentence embeddings, and let A and B be two sets of attribute word embeddings. The test statistic is the difference between sums of similarities of the respective attributes over target concepts, which can be calculated as: where: the p-values on s(X, Y, A, B) is used to compute the significance between (A, B) and (X, Y ), where X i and Y i are of equal size. Also the effect size d is used to measure the magnitude of associations:

Experiments and Results
We apply WEAT and SEAT on Caliskan Test of male/female names with career and family, which corresponds to past social psychology studies.
Method GloVe ELMo BERT F-BERT WEAT 1.81 * −0.45 0.21 0.38 SEAT 1.74 * −0.38 0.08 0.07  Table 5 shows the result of WEAT and SEAT. Sentence vectors are aggregated by taking the mean value of all word vectors in the sentences for GloVe (Pennington et al., 2014), ELMo (Peters et al., 2018), BERT and Fine-tuned BERT. 5 With p-values lower than 0.01, embeddings by GloVe 5 Here we use a different method to aggregate sentence vector for BERT, comparing to the cited paper which uses [CLS] vector as sentence vector for better comparison. on both word level and sentence level show significant gender bias, indicating that women are associated with family while men are associated with career.
However, p-values of all contextual embeddings including ELMo, BERT and Fined-tuned BERT are larger than 0.05, which suggests that there is no evidence suggesting existence of gender bias in these embeddings. One possible explanation is that, by training contextual word embeddings, a single word is usually represented differently in different sentences, resulting in more flexible word representations focusing on single context within a sentence rather than the overall word frequency distribution.

Conclusion and Future Work
We propose a transfer-learning-based solution for pronoun resolution. The proposed solution leads to gender balance in both word embeddings and overall predictions. It greatly improves the prediction accuracy of this task by 23.3% F1 against the off-the-shelf solutions proposed by Lee et al. (2017) on the widely studied Google GAP dataset. Meanwhile, among several single models in our ensemble solution, BERT-mlp and BERTpos model highly outperform others in the experiments. Overall this work shows the efficacy of employing BERT in downstream natural language processing classification tasks.
In the future, we would like to investigate various transfer structures on the top of pre-trained BERT, especially for the sake of enhancing the stability of the fine-tune process. We observe in our experiments that the performance of fine-tune models based on BERT strongly depends on initial random state, thus, further research on building more robust models is indispensable.