Gendered Pronoun Resolution using BERT and an Extractive Question Answering Formulation

The resolution of ambiguous pronouns is a longstanding challenge in Natural Language Understanding. Recent studies have suggested gender bias among state-of-the-art coreference resolution systems. As an example, Google AI Language team recently released a gender-balanced dataset and showed that performance of these coreference resolvers is significantly limited on the dataset. In this paper, we propose an extractive question answering (QA) formulation of pronoun resolution task that overcomes this limitation and shows much lower gender bias (0.99) on their dataset. This system uses fine-tuned representations from the pre-trained BERT model and outperforms the existing baseline by a significant margin (22.2% absolute improvement in F1 score) without using any hand-engineered features. This QA framework is equally performant even without the knowledge of the candidate antecedents of the pronoun. An ensemble of QA and BERT-based multiple choice and sequence classification models further improves the F1 (23.3% absolute improvement upon the baseline). This ensemble model was submitted to the shared task for the 1st ACL workshop on Gender Bias for Natural Language Processing. It ranked 9th on the final official leaderboard.


Introduction
Coreference resolution is a task that aims to identify spans in a text that refer to the same entity. This is central to Natural Language Understanding. We focus on a specific aspect of the coreference resolution that caters to resolving ambiguous pronouns in English. Recent studies have shown that state-of-the-art coreference resolution systems exhibit gender bias (Webster et al., 2018) (Rudinger et al., 2018) (Zhao et al., 2018). (Webster et al., 2018) released a dataset that contained an equal number of male and female ex-amples to encourage gender-fair modeling on the pronoun resolution task. A shared task for this dataset was then published on Kaggle 1 . The task involves classifying a specific ambiguous pronoun in a given Wikipedia passage as coreferring with one of the three classes: first candidate antecedent (hereby referred to as A), second candidate antecedent (hereby referred to as B) or neither of them (hereby referred to as N). The authors show that even the best of the baselines such as (Clark and Manning, 2015), (Wiseman et al., 2016), (Lee et al., 2017) achieve an F1 score of just 66.9% on this dataset. The limited number of annotated labels available in this unbiased setting makes the modeling a challenging task. To that end, we propose an extractive question answering formulation of the task that leverages BERT (Devlin et al., 2018) pre-trained representations and significantly improves (22.2% absolute improvement in F1 score) upon the best baseline (Webster et al., 2018). In this formulation, the task is similar to a SQUAD (Rajpurkar et al., 2016) style question answering (QA) problem where the question is the context window (neighboring words) surrounding the pronoun to be resolved and the answer is the antecedent of the pronoun. The answer is contained in the provided Wikipedia passage. The intuition behind using the pronoun's context window as a question is that it allows the model to rightly identify the pronoun to be resolved as there can be multiple tokens that match the given pronoun in a passage. There has been previous work that cast the coreference resolution as a Question Answering problem (Kumar et al., 2016). But the questions used in their approach take the form "Who does "she" refer to?". This would necessitate including additional information such as an indicator vector to identify the exact pronoun to be re-  (Levesque et al., 2012) as a question answering problem by including the candidate antecedents as part of the question. An unique feature of the question answering framework (referred to as CorefQA) we propose is that it doesn't require the knowledge of the candidate antecedents in order to produce an answer for the pronoun resolution task. The model "learns", from training on the QA version of the shared task dataset, the specific task of extracting the appropriate antecedent of the pronoun given just the Wikipedia passage and the pronoun's context window. We also demonstrate other modeling variants for the shared task that use the knowledge of the candidate antecedents A and B. The first variant (CorefQAExt) is an extension of the CorefQA model that uses its predictions to produce probabilities over A, B and N. The second variant (CorefMulti) takes the formulation of a SWAG (Zellers et al., 2018) style multiple choice classification and the final variant (CorefSeq) takes the standard sequence classification formulation. An ensemble of CorefQAExt, CorefMulti and CorefSeq models shows further performance gains (23.3% absolute improvement in F1 score).

Data
The dataset used for this shared task is the GAP dataset (Webster et al., 2018) where each row contains a Wikipedia text snippet, the corresponding page's URL, the pronoun to be resolved, the two candidate antecedents (A and B) of the pronoun, the text offsets corresponding to A, B, pronoun and boolean flags indicating the pronoun's coreference with A and B. The Kaggle competition for this shared task was conducted in two stages. Table 1 shows the aggregate statistics for each stage.
The 5-Fold Dev row represents the number of examples used for 5-fold stratified cross validation done based on the gender of the pronoun. This could lead to different distributions of A and B during the training of each fold. We chose to do so because we wanted to retain the perfect balance between male and female representations during training and thereby minimize the bias from the data. The columns T, A, B and N refer to the total number of examples, the number of examples where the pronoun's antecedent is A, B and neither respectively. We should note that for the question answering model, we exclude all the neither examples from the training data as we dont have an exact answer. While this seems destructive, the model doesn't need, by design, an explicit supervision on the "neither" examples to predict an antecedent that's neither A nor B. The male and female pronoun examples are equally represented (50-50 split) in the development, validation and test datasets -with the exception of stage 2 test dataset. The stage 2 test dataset has 377 male and 383 female examples. We use lowercased BERT word-piece tokenizer for preprocessing. This comes with a pre-built vocabulary of size 30522.

System Description
The final model used for submission is an ensemble of the question answering (CorefQAExt), multiple choice (CorefMulti) and sequence classification (CorefSeq) models. We describe each of these models in the following sections. We chose the pytorch-pretrained-bert 2 library to implement all models. The source code is available at https: //github.com/rakeshchada/corefqa The question text Q is the pronoun context window of up to 5 words. The context window is the pronoun itself and its two neighboring words to the left and right. So, if W is "They say John and his wife Carol had a son", then Q would be "John and his wife Carol" assuming "his" is the pronoun to be resolved. In the case where there are less than two words on a given side, we just use the words available within the window -so these cases would lead to the window with less than 5 words. The text at this point is still un-tokenized so the "words" are just space separated tokens in a given text. The an-swer text is either A's or B's name ("neither" cases have been initially filtered). The rest of the architecture until the Span-wise Max Pooling layer follows the standard SQUAD formulation in (Devlin et al., 2018). It's worth noting that the architecture until this point (before the Span-wise Max Pooling layer) doesn't use candidate antecedents' A and B text or offset information. The output at this intermediate layer (Dense Layer) contains two sets of logits: start and end logits for each token. These can then be used to extract the maximum scoring span as an answer as demonstrated in (Devlin et al., 2018). We refer to the architecture until the Span-wise Max Pooling Layer as CorefQA.

Probability Estimation
The shared task requires the output to be probabilities over the given A, B and N spans. So, we implement a mechanism that combines Span-wise Max Pooling and Logistic Regression to extract probabilities from start and end logits obtained in the previous step. Since we have access to offsets of A and B, we simply extract span logits corresponding to those offsets. Span logits are calculated by taking the maximum value of each of the individual token logits in a span. This gives us four values that represent maximum logits for the start and end of A and B spans. We also calculate maximum start and end logits over the entire sequence. These six logits are then fed as input features to a multi-class logistic regression. The output of this classifier then gives us the desired probabilities P A , P B &P N . We refer to this endto-end architecture (from input layer to the Multiclass Logistic Regression layer) as CorefQAExt.

Training & Hyperparameters
We use Adam optimizer with learning rate of 1e-5, β 1 =0.9, β 2 =0.999, L2 weight decay of 0.01, learning rate warmup over the first 10% of total training steps, and linear decay of the learning rate. The maximum sequence length is set to 300 and batch size of 12 is used during training. We use BERT Large Uncased pre-trained model for initializing the weights of BERT layers. This model has 24 layers with each producing a 1024 dimensional hidden representation. The whole system is trained in an end-to-end fashion. We fine-tune the last 12 BERT Encoder layers (layer 13 to layer 24) and freeze layers 1 to 12 -meaning the parameters of those layers aren't updated during training. This leads to total trainable parameters in the order of 150 million. We didn't use any dropout. The hyperparameter C for the logistic regression is set to 0.1. This model was trained for 2 epochs on a NVIDIA K80 GPU. The training with the 5-fold cross validation finished in about 30 minutes. The average of the predictions of each fold on the test dataset is used as the final prediction. We had experimented with different choices for each of these hyperparameters -such as freezing or unfreezing more layers, choosing different learning rates, different batch sizes -but these numbers gave us the best results. Another hyperparameter the model was sensitive to was the context window size. Lower window sizes gave us better results with 5 being the ideal size.

Multiple Choice classification (CorefMulti)
Here, we formulate the task as a SWAG (Zellers et al., 2018) style multiple choice problem among A, B and N classes.

Inputs and Architecture
For each example, we construct four input sequences, which each contain the concatenation of the the two sequences S1 and S2. S1 is a concatenation of the given Wikipedia passage with an additional sentence of the form "P is " where P is the text of the pronoun in question. So, for a passage that ends with the sentence "They say John and his wife Carol had a son", the sequence S1 would be "They say John and his wife Carol had a son. his is " assuming "his" is the pronoun to be resolved. The sequence S2 is one of A's name, B's name or the word "neither" if the pronoun in the example doesn't co-refer with A and B. Once we represent the inputs in this fashion, the rest of the architecture follows the design of BERT based SWAG task architecture discussed in (Devlin et al., 2018).

Training & Hyperparameters
We use a batch size of 4 for training, initialize the BERT layers with the weights from the BERT Large Uncased pre-trained model and maintain the rest of the hyperparameters the same as the ones used for CorefQAExt model. Layers 12 to 24 of the BERT Encoder are fine-tuned and the rest of the layers are frozen. We use 5-fold cross validation with test prediction averaging from each fold. This model took about 100 minutes to run on Stage 1 data on a NVIDIA K80 GPU.

Sequence classification (CorefSeq)
This involves framing the problem as a standard sequence classification task.

Inputs and Architecture
The input is the given Wikipedia passage without any additional augmentation. The sequence features are extracted by concatenating token embeddings corresponding to the A, B and the pronoun spans. These span embeddings are calculated by concatenating token embeddings of the start token, end token and the result of an element-wise multiplication of start and end token embeddings. The token embeddings are the output of the last encoder layer of the (fine-tuned) BERT. These features are then fed to a single hidden layer feedforward neural network with a ReLU activation. This hidden layer has 512 hidden units. A softmax layer at the output then provides the desired A, B and N probabilities.

Training & Hyperparameters
A dropout of 0.1 is applied before the inputs are fed from the BERT's last encoder layer to the feed forward neural network. The model is trained for 30 epochs with a batch size of 10. Layers 12 to 24 of the BERT Encoder are fine-tuned and the rest of the layers are frozen. A learning rate of 1e-5 is used with a triangular learning rate scheduler (Smith, 2017) whose steps per cycle is set to 100 times the length of training data. We use 5fold cross validation with test prediction averaging from each fold. This model took 105 minutes to run on Stage 1 data on a NVIDIA K80 GPU. selects, most of the time, the spans corresponding to named entities as answers even though that 3 Sample predictions shown in the Supplemental Material Section A constraint wasn't explicitly encoded in its design. The CorefQA model doesn't produce probabilities over A, B and N classes as that information isn't available to the model. Hence, we report Log-loss as "N/A" in Table 2. The probabilities from the CorefQAExt, CorefMulti and CorefSeq are averaged to obtain the ensemble models probabilities. This ensemble model, with an Overall F1 score of 90.2, improves upon the baseline by 23.3 percentage points. This model ranked ninth on the final leaderboard of the Kaggle competition. The CorefMulti model seemed most robust to bias (0.99). The ensemble model had the best log loss in stage 2 even though the CorefQAExt model had the best Overall F1 score. This might be a reflection of the issues with probability calibration. Another explanation of this might be just the smaller stage 2 data size as compared to stage 1. Finally, although the CorefSeq model doesn't individually outperform other models, we get a better ensemble performance by including it rather than by excluding it.

Freezing BERT weights
We tried freezing all BERT layer weights for some of our initial experiments but hadn't seen much success -especially when we used the weights from the last encoder layer of the BERT. The Stage 1 Overall F1 score for the CorefQAExt model dropped down significantly to 63.6% in this setting. This improved to 72.1% if we used layer 18 weights. We also tried concatenating the last four encoder layer outputs of BERT. This resulted in an slightly better Overall F1 score of 74.4% for Stage 1. So, the performance seemed to be sensitive to the choice of the encoder layer outputs. However, from the preliminary experiments, there seemed to be a big gap of about 15% on the Overall F1 when compared to the fine-tuned model. A more principled & thorough analysis of this phenomena makes an important future area of work.

Post Stage 2 deadline Results
After the competition had finished, we experimented with a few model variations on the final stage 2 test dataset that gave us interesting insights. Firstly, we tried excluding each model from the full ensemble. We noticed that we obtained a better Log Loss of 0.195 when we excluded CorefSeq. This model is listed as QAMul Ensemble in Table 2. We carried another experiment where we trained the CorefQAExt using the cased version of the BERT model. An ensembling of the uncased version with this cased version delivered further performance gains (3% absolute F1 improvement upon uncased CorefQAExt). Then, we tried ensembling the cased and uncased versions of all the three individual models -Core-fQAExt, CorefMulti and CorefSeq on stage 2 test data. This resulted in an overall F1 score of 94.7% , Male F1 of 94.8%, Female F1 of 94.6%, bias of 1.0 and a log loss of 0.197.

Failed Experiments
1. We tried fine-tuning the BERT model in an unsupervised manner by training a language model on the texts extracted from the Wikipedia pages corresponding to the URLs provided in the dataset. The idea behind this one was to see if we can get better BERT layer representations by tuning them to the shared task's dataset. However, this is a computationally expensive step to run and we didn't see promising gains from initial runs. We hypothesize that this may be due to the fact that BERT representations were originally obtained by training on Wikipedia as one of the sources. So, fine-tuning on the task's dataset which is also from Wikipedia might not have added an extra signal.
2. For the CorefMulti model, we tried adding to the token embedding vector, an additional entity embedding vector that encodes the wordpiece token level info of whether it belongs to one of A, B or P. We hypothesized this should help the model focus its attention on the relevant entities to the coreference task. But we weren't able to make a successful use of these embeddings to improve the model performance within the competition deadline. However, this is a promising future direction.
3. For the CorefQAExt model, we appended the title extracted from the provided wikipedia page's URL into the input token sequence to evaluate if the page URL provides useful signal to the model. This made the performance slightly worse.

Conclusion
We proposed an extractive question answering (QA) formulation of the pronoun resolution task that uses BERT fine-tuning and shows strong performance on the gender-balanced dataset. We have shown that this system can also effectively extract the antecedent of the pronoun without using the knowledge of candidate antecedents. We demonstrated three other formulations of the task that uses this knowledge. The ensemble of all these models obtained further gains (Table  2). This work showed that the pre-trained BERT representations provide a strong signal for the coreference resolution task. Furthermore, thanks to training on the gender-balanced dataset, this modeling framework was able to generate unbiased predictions despite using pre-trained representations. An important future work would be to analyze the gains obtained from BERT representations in more detail and perhaps compare it with alternate contextual token representations and fine-tuning mechanisms (Peters et al., 2018) (Howard and Ruder, 2018). We also would like to apply our techniques to the Winograd schema challenge (Levesque et al., 2012), the Definite Pronoun Resolution dataset (Rahman and Ng, 2012), the Winogender schema dataset (Rudinger et al., 2018) and explore extensions to other languages perhaps using the CoNLL 2012 shared task dataset (Pradhan et al., 2012).