Resolving Gendered Ambiguous Pronouns with BERT

Pronoun resolution is part of coreference resolution, the task of pairing an expression to its referring entity. This is an important task for natural language understanding and a necessary component of machine translation systems, chat bots and assistants. Neural machine learning systems perform far from ideally in this task, reaching as low as 73% F1 scores on modern benchmark datasets. Moreover, they tend to perform better for masculine pronouns than for feminine ones. Thus, the problem is both challenging and important for NLP researchers and practitioners. In this project, we describe our BERT-based approach to solving the problem of gender-balanced pronoun resolution. We are able to reach 92% F1 score and a much lower gender bias on the benchmark dataset shared by Google AI Language team.


Introduction
In this work, we are dealing with gender bias in pronoun resolution. A more general task of coreference resolution is reviewed in Sec. 2. In Sec. 3, we give an overview of a related Kaggle competition. Then, Sec. 4 describes the GAP dataset and Google AI's heuristics to resolve pronomial coreference in a gender-agnostic way, so that pronoun resolution is done equally well in cases of masculine and feminine pronouns. In Sec. 5, we provide the details of our BERT-based solution while in Sec. 6 we analyze pleasantly low gender bias specific for our system (our code is shared on GitHub 1 ). Lastly, in Sec. 7, we draw conclusions and express some ideas for further research.

Related work
Among popular approaches to coreference resolution are: 2 rule-based, mention pair, mention ranking, and clustering. As for rule-based approaches, they describe naïve Hobbs algorithm (Hobbs, 1986) which, in spite of being naïve, has shown state-of-the-art performance on the OntoNotes dataset 3 up to 2010.
Recent state-of-the-art approaches Peters et al., 2018a) are pretty complex examples of mention ranking systems. The 2017 version is the first end-to-end coreference resolution model that didn't utilize syntactic parsers or hand-engineered mention detectors. Instead, it used LSTMs and attention mechanism to improve over previous NN-based solutions.
Some more state-of-the-art coreference resolution systems are reviewed in (Webster et al., 2018) as well as popular datasets with ambiguous pronouns: Winograd schemas (Levesque et al., 2012), WikiCoref (Ghaddar and Langlais, 2016), and The Definite Pronoun Resolution Dataset (Pradhan et al., 2007). We also refer to the GAP paper for a brief review of gender bias in machine learning.
We further outline that e2e-coref model , in spite of being state-of-the-art in coreference resolution, didn't show good results in the pronoun resolution task that we tackled, so we only used e2e-coref predictions as an additional feature.

Kaggle competition "Gendered
Pronoun Resolution" Following Kaggle competition "Gendered Pronoun Resolution", 4 for each abstract from Wikipedia pages we are given a pronoun, and we try to predict the right coreference for it, i.e. to which named entity (A or B) it refers. Let's take a look at this simple example: "John entered the room and saw [A] Julia.
[Pronoun] She was talking to [B] Mary Hendriks and looked so extremely gorgeous that John was stunned and couldn't say a word." Here "Julia" is marked as entity A, "Mary Hendriks" -as entity B, and pronoun "She" is marked as Pronoun. In this particular case the task is to correctly identify to which entity the given pronoun refers.
If we feed this sentence into a coreference resolution system (see Fig. 1 and online demo 5 ), we see that it correctly identifies that "she" refers to Julia, it also correctly clusters together two mentions of "John" and detects that Mary Hendriks is a two-word span.
For instance, if you take an abstract like this it's pretty hard to resolve coreference.
"Roxanne, a poet who now lives in France. Isabel believes that she is there to help Roxanne during her pregnancy with her toddler infant, but later realizes that her father and step-mother sent her there so that Roxanne would help the shiftless Isabel gain some direction in life. Shortly after she (pronoun) arrives, Roxanne confides in Isabel that her French husband, Claude-Henri has left her." Google AI and Kaggle (organizers of this competition) provided the GAP dataset (Webster et al., 2018) with 4454 snippets from Wikipedia articles, in each of them named entities A and B are labeled along with a pronoun. The dataset is labeled, i.e. for each sentence a correct coreference is specified, one of three mutually-exclusive classes: either A or B or "Neither". Thus, the prediction task is actually that of multiclass classification type.
Moreover, the dataset is balanced w.r.t. masculine and feminine pronouns. Thus, the competition was supposed to address the problem of building a coreference resolution system which is not susceptible to gender bias, i.e. works equally well for masculine and feminine pronouns.
These are the columns provided in the dataset (Webster et al., 2018): • ID -Unique identifier for an example (matches to Id in output file format) • Text -Text containing the ambiguous pronoun and two candidate names (about a paragraph in length) • Pronoun -target pronoun (text) • Pronoun-offset -character offset of Pronoun in Text • A -first name candidate (text) • A-offset -character offset of name A in Text • B -second name candidate • B-offset -character offset of name B in Text • URL -URL of the source Wikipedia page for the example Evaluation metric chosen for the competition 6 is multiclass logarithmic loss. Each pronoun has been labeled with whether it refers to A, B, or "Neither". For each pronoun, a set of predicted probabilities (one for each class) is submitted. The formula is then where N is the number of samples in the test set, M is 3, log is the natural logarithm, y ij is 1 if observation i belongs to class j and 0 otherwise, and p ij is the predicted probability that observation i belongs to class j. Unfortunately, the chosen evaluation metric does not reflect the mentioned above goal of building a gender-unbiased coreference resolution algorithm, i.e. the metric does not account for gender imbalance -logarithmic loss may not reflect the fact that e.g. predicted pronoun coreference is much worse for masculine pronouns than for feminine ones. Therefore, we explore gender bias separately in Sec. 6 and compare our results with those published by the Google AI Language team (reviewed in Sec. 4).

Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns
Google AI Language team addresses the problem of gender bias in pronoun resolution (when systems favor masculine entities) and a genderbalanced labeled corpus of 8,908 ambiguous pronoun-name pairs sampled to provide diverse coverage of challenges posed by real-world text (Webster et al., 2018) (further referred to as the GAP dataset). They run 4 state-of-the-art coreference resolution models (Lee et al., 2013;Clark and Manning, 2015;Wiseman et al., 2016; on the OntoNotes and GAP datasets reporting F1 scores separately for masculine and feminine pronoun-named entity pairs (metrics M and F in the paper). Also they measure "gender bias" defined as B = F / M. In general, they conclude, these models perform better for masculine pronoun-named entity pairs, but still pronoun resolution is challenging -all achieved F1 scores are less than 0.7 for both datasets. Further, they propose simple heuristics (called surface, structural and Wikipedia cues). The best reported cues are "Parallelism" (if the pronoun is a subject or direct object, select the closest can-didate with the same grammatical argument) and "URL" (select the syntactically closest candidate which has a token overlap with the page title). They compare the performance of "Parallelism + URL" cue with e2e-coref  on the GAP dataset and, surprisingly enough, conclude that heuristics work better achieving better F1 scores (0.742 for M and 0.716 for F) at the same time being less gender-biased (some of heuristics are totally gender-unbiased, for "Parallelism + URL" B = F / M = 0.96).
Finally, they explored Transformer architecture (Vaswani et al., 2017) for this task and observed that the coreference signal is localized on specific heads and that these heads are in the deep layers of the network. In Sec. 5 we confirm this observation. Actually, they select the candidate which attends most to the pronoun ("Transformer heuristic" in the paper). Even though they conclude that Transformer models implicitly learn language understanding relevant to coreference resolution, as for F1 scores, they didn't make it work better than e2e-coref or Parallelism cues (F1 scores lower that 0.63). More to that, proposed Transformers heuristics are a bit biased towards masculine pronouns with B from 0.95 to 0.98.
Further we report a much stronger genderunbiased BERT-based (Devlin et al., 2018) pronoun resolution system.

System
BERT (Devlin et al., 2018) is a transformer architecture, pre-trained on a large corpus (Wikipedia + BookCorpus), with 12 to 24 transformer layers. Each layer learns a 1024-dimensional representation of the input token, with layer 1 being similar to a standard word embedding, layer 24 special-ized for the task of predicting missing words from context. At the same time BERT embeddings are learned for a second auxiliary task of resolving whether two consequent sentences are connected to each other or not.
In general, motivated by (Tenney et al., 2019), we found that BERT provides very good token embeddings for the task in hand.
Our proposed pipeline is built upon solutions by teams "Ken Krige" and "[ods.ai] five zeros" (placed 5 and 22 in the final leaderboard 7 correspondingly). The way these two teams approached the competition task are described in two Kaggle posts. 89 The combined pipeline includes several subroutines: • Extracting BERT-embeddings for named entities A, B, and pronouns • Fine-tuning BERT classifier • Hand-crafted features • Neural network architectures • Correcting mislabeled instances

Extracting BERT-embeddings for named entities A, B, and pronouns
We concatenated embeddings for entities A, B, and Pronoun taken from Cased and Uncased large BERT "frozen" (not fine-tuned) models. 10 We noticed that extracting embeddings from intermediate layers (from -4 to -6) worked best for the task. Also we added pointwise products of embeddings for Pronoun and entity A, Pronoun and entity B as well as AB -PP. First of these embedding vectors expresses similarity between pronoun and A, the second one expresses similarity between pronoun and B, the third vector is supposed to represent the extent to which entities A and B are similar to each other but differ from the Pronoun. 7 https://www.kaggle.com/c/ gendered-pronoun-resolution/leaderboard 8 https://www.kaggle.com/c/ gendered-pronoun-resolution/discussion/ 90668 9 https://www.kaggle.com/c/ gendered-pronoun-resolution/discussion/ 90431 10 https://github.com/google-research/ bert

Fine-tuning BERT classifier
Apart from extracting embeddings from original BERT models, we also fine-tuned BERT classifier for the task in hand. We made appropriate changes to the "run_classifier.py" script from Google's repository. 11 Preprocessing input data for the BERT input layer included stripping text to 64 symbols, then into 4 segments, running BERT Wordpiece for each segment, adding start and end tokens (with truncation if needed) and concatenating segments back together. The whole preprocessing is reproduced in a Kaggle Kernel 12 as well as in our final code on GitHub. 13

Hand-crafted features
Apart from BERT embeddings, we also added 69 features which can be grouped into several categories: • Neuralcoref, 14 Stanford CoreNLP (Manning et al., 2014) and e2e-coref  model predictions. It turned out that these models performed not really well in the task in hand, but their predictions worked well as additional features.

• Predictions of a Multi-Layered Perceptron trained with ELMo (Peters et al., 2018b) embeddings
• Syntactic roles of entities A, B, and Pronoun (subject, direct object, attribute etc.) extracted with SpaCy 15 .
• Positional and frequency-based (distances between A, B, Pronoun and derivations, whether they all are in the same sentence or Pronoun is in the following one etc.). Many of these features we motivated by the Hobbs algorithm (Hobbs, 1986) for coreference resolution.
• Named entities predicted for A and B with SpaCy • GAP heuristics outlined in the corresponding paper (Webster et al., 2018) and briefly discussed in Sec. 4 We need to mention that adding all these features had only minor effect on the quality of pronoun resolution (resulted in a 0.01 decrease in logarithmic loss when measured on the Kaggle test dataset) as compared to e.g. fine-tuning BERT classifier.

Neural network architectures
Final setup includes: • 6 independently trained fine-tuned BERT classifiers with preprocessing described in Subsec. 5.2. In Tables 1, 2, and 3, we refer to their averaged prediction as to that of a "fine-tuned" model ( ) • 5 multi-layered perceptrons trained with different combinations of BERT embeddings for A, B, Pronoun (see Subsec. 5.1) and handcrafted features (see Subsec. 5.3), all together referred to as "frozen" in Tables 1, 2, and 3 ( ). Using MLPs with pre-trained BERT embeddings is motivated by (Tenney et al., 2019). Two MLPs-separate for Cased and Uncased BERT models -both taking 9216d input and outputting 112-d vectors. Two Siamese networks were trained on top of distances between Pronoun and A-embeddings, Pronoun and B-embeddings as inputs. One more MLP took only 69-dimensional feature vectors as an input. Finally, a single dense layer mapped outputs from the mentioned 5 models into 3 classes corresponding to named entities A, B or "Neither".
• Blending ( ) involves taking predicted probabilities for A, B and "Neither" with weight 0.65 for the "fine-tuned" model and summing the result with 0.35 times corresponding probabilities output by the "frozen" model.
In the next Section, we perform the analysis identical to the one done in (Webster et al., 2018) to measure the quality of pronoun resolution and the severity of gender bias in the task in hand.

Correcting mislabeled instances
During the competition, 158 label corrections were proposed for the GAP dataset 16 -when Pronoun is said to mention A but actually mentions B and vice versa. For the GAP test set, this resulted in 66 pronoun coreferences being corrected. It's important to mention that the observed mislabeling is a bit biased against female pronouns (39 mislabeled feminine pronouns versus 27 mislabeled masculine ones), and it turned out that most of the gender bias for F1 score and accuracy comes from these mislabeled examples.

Results
In Table 1, we report logarithmic loss that we got on GAP test ("gap-test.tsv"), and Kaggle test (Stage 2) datasets. Kaggle competition results can also be seen on the final competition leaderboard. 17 We report GAP test results as well to further compare with the results reported in the GAP paper: measured are logarithmic loss, F1 score and accuracy for masculine and feminine pronouns (Table 2). Logarithmic loss and accuracy are computed for a 3-class classification problem (A, B, or Neither) while F1 is computed for a 2-class problem (A or B) to compare with results reported by the Google AI Language team in (Webster et al., 2018). We also incorporated 66 label corrections as described in 5.5 and, interestingly enough, this lead to a conclusion that with corrected labels, models are less susceptible to gender bias. Table 3 reports the same metric in case of corrected labeling, and we see that in this case the proposed models are  almost gender-unbiased. These results imply that: • Overall, in terms of F1 score, the proposed solution compares very favorably with the results reported in the GAP paper, achieving as high as 0.911 overall F1 score, compared to 0.729 for "Parallelism + URL" heuristic from (Webster et al., 2018); • Blending model predictions improves logarithmic loss pretty well but does not impact F1 score and accuracy that much. It can be explained: logarithmic loss is high for confident and at the same time incorrect predictions. Blending averages predicted probabilities so that they end up less extreme (not so close to 0 or 1); • With original labeling, all models are somewhat susceptible to gender bias, especially in terms of logarithmic loss. However, in terms of F1 score, gender bias is still less than for e2e-coref and "Parallelism + URL" heuristic reported in (Webster et al., 2018); • Fixing some incorrect labels almost eliminates gender bias, when we talk about F1 score and accuracy of pronoun resolution.

Conclusions and further work
We conclude that we managed to propose a BERTbased approach to pronoun resolution which results in considerably better quality (as measured in terms of F1 score and accuracy) than in case of pronoun resolution done with heuristics described in the GAP paper. Moreover, the proposed solution is almost gender-unbiased -pronoun resolution is done almost equally well for masculine and feminine pronouns.
Further we plan to investigate which semantic and syntactic information is carried by different BERT layers and how it refers to coreference resolution. We are also going to benchmark our system on OntoNotes, Winograd, and DPR datasets.