Gendered Ambiguous Pronouns Shared Task: Boosting Model Confidence by Evidence Pooling

This paper presents a strong set of results for resolving gendered ambiguous pronouns on the Gendered Ambiguous Pronouns shared task. The model presented here draws upon the strengths of state-of-the-art language and coreference resolution models, and introduces a novel evidence-based deep learning architecture. Injecting evidence from the coreference models compliments the base architecture, and analysis shows that the model is not hindered by their weaknesses, specifically gender bias. The modularity and simplicity of the architecture make it very easy to extend for further improvement and applicable to other NLP problems. Evaluation on GAP test data results in a state-of-the-art performance at 92.5% F1 (gender bias of 0.97), edging closer to the human performance of 96.6%. The end-to-end solution presented here placed 1st in the Kaggle competition, winning by a significant lead.


Introduction
The Gendered Ambiguous Pronouns (GAP) shared task aims to mitigate bias observed in the performance of coreference resolution systems when dealing with gendered pronouns.State-ofthe-art coreference models suffer from a systematic bias in resolving masculine entities more confidently compared to feminine entities.To this end, Webster et al. (2018) published a new GAP dataset 2 to encourage research into building models and systems that are robust to gender bias.
The arrival of modern language models like ELMo (Peters et al., 2018), BERT (Devlin et al., 2018), and GPT (Radford et al., 2018), have significantly advanced the state-of-the art in a wide 1 The code is available at https://github.com/sattree/gap range of NLP problems.All of them have a common theme in that a generative language model is pretrained on a large amount of data, and is subsequently fine-tuned on the target task data.This approach of transfer learning has been very successful.The current work applies the same philosophy and uses BERT as the base model to encode lowlevel features, followed by a task-specific module that is trained from scratch (fine-tuning BERT in the process).
GAP shared task presents the general GAP problem in gold-two-mention (Webster et al., 2018) format and formulates it as a classification problem, where the model must resolve a given pronoun to either of the two given candidates or neither 3 .Neither instances are particularly difficult to resolve since they require understanding a wider context and perhaps a knowledge of the world.A parallel for this case can be drawn from Question-Answering systems where identifying unanswerable questions confidently remains an active research area.Recent work shows that it is possible to determine lack of evidence with greater confidence by explicitly modeling for it.Works of Zhong et al. (2019) and Kundu and Ng (2018) demonstrate model designs with specialized deep learning architectures that encode evidence in the input and show significant improvement in identifying unanswerable questions.This paper first introduces a baseline that is based on a language model.Then, a novel architecture for pooling evidence from off-the-shelf coreference models is presented, that further boosts the confidence of the base classifier and specifically helps in resolving Neither instances.The main contributions of this paper are: • Demonstrate the effectiveness of pretrained language models and their transferability to establish a strong baseline (ProBERT) for the gold-two-mention shared task.
• Introduce an Evidence Pooling based neural architecture (GREP) to draw upon the strengths of off-the-shelf coreference systems.
• Present the model results that placed 1st in the GAP shared task Kaggle competition.

Data Augmentation: Neither instances
In an attempt to upsample and boost the classifier's confidence in the underrepresented N either category (Table 2), about 250 instances were added manually.These were created by obtaining cluster predictions from the coreference model by Lee et al. (2018) and choosing a pronoun and the two candidate entities A and B from disjoint clusters.However, in the interest of time, this strategy was not fully pursued.Instead, the evidence pooling module was used to resolve this problem, as will become clear from the discussion in section 6.

Mention Tags
The raw text snippet is manipulated by enclosing the labeled span of mentions with their associated tags, i.e. <P> for pronoun, <A> for entity mention A, and <B> for entity mention B. The primary reason for doing this is to provide the positional information of the labeled mentions implicitly within the text as opposed to explicitly through additional features.A secondary motivation was to test the language model's sensitivity to noise in input text structure, and its ability to adapt the pronoun representation to the positional tags.Figure 2 shows an example of this annotation scheme.Figure 2: Sample text-snippet after annotating the mentions with their corresponding tags.Bob Suter and Dehner were tagged as entities A and B, and the mention 'His' following them was tagged as the pronoun.

Label Sanitization
Only samples where labels can be corrected unambiguously based on snippet-context were corrected 4 .The Wikipedia page-context and urlcontext were not used.A visualization tool 5 was also developed as part of this work to aid in this activity.Table 2 lists the corpus statistics before and after the sanitization process.

Coreference Signal
Transformer networks have been found to have limited capability in modeling long-range dependency (Dai et al., 2018;Khandelwal et al., 2018) knowledge (Lee et al., 2018).Being cognizant of these two factors, it would be useful to inject predictions from off-the-shelf coreference models as an auxiliary source of evidence (with input text context being the primary evidence source).
3 Model Architecture

ProBERT: baseline model
ProBERT uses a fine-tuned BERT language model (Devlin et al., 2018;Howard and Ruder, 2018) with a classification head on top to serve as baseline.The snippet-text is augmented with mentionlevel tags (section 2.2) to capture the positional information of the pronoun, entity A, and entity B mentions, before feeding the text as input to the model.Position-wise token representation corresponding to the pronoun is extracted from the last layer of the language model.With GAP dataset and WordPiece tokenization (Devlin et al., 2018), all pronouns were found to be single token entities.Let E p ∈ R H (where H is the dimensionality of the language model output) denote the pooled pronoun vector.A linear transformation is applied to it, followed by softmax, to obtain a probability distribution over classes A, B, and NEITHER, P = sof tmax(W T E p ), where W ∈ R H×3 is the linear projection weight matrix.All the parameters are jointly trained to minimize cross entropy loss.This simple architecture is depicted in Figure 1.Only H × 3 new parameters are introduced in the architecture, allowing the model to use training data more efficiently.
A natural question arises as to why this model functions so well (see section 5.2) with just the pronoun representation.This is discussed in section 6.1.

GREP: Gendered Resolution by Evidence Pooling
The architecture for GREP pairs the simple ProBERT architecture with a novel Evidence Pooling module.The Evidence Pooling (EP) module leverages cluster predictions from pretrained (or heuristics-based) coreference models to gather evidence for the resolution task.The internals of the coreference models are opaque to the system, allowing for any evidence source such as a knowledge base to be included as well.This design choice limits us from propagating the gradients through the coreference models, thereby losing information and leaving them noisy.Suppose we have access to N off-the-shelf coreference models and each predicts T n mentions that are coreferent with the given pronoun.Let P, A, and B, refer to the mentioned entities labeled in the text-snippet as the pronoun and entities A and B, respectively.Without loss of generality, let us consider the nth coreference model and mth mention in the cluster predicted by it.Let , denote the position-wise token embeddings obtained from the last layer of the language model for each of the mentions, where T is the number of tokens in each mention.The first step is to aggregate the information at the mentionlevel.Self-attention is used to reduce the mention tokens, an operation that will be referred to as AttnP ool (attention pooling) hereafter.A single layer MLP is applied to compute position-wise compatibility score, which is then normalized and used to compute a weighted average over the mention tokens for a pooled representation of the mention as follows: (1) (3) Similarly, a pooled representation of all mentions in the cluster predicted by the nth coreference model, and of P, A, and B entity mentions is obtained.Let A n ∈ R Tn×H denote the joint representation of cluster mentions, and A p , A a , and A b , the pooled representations of entity mentions.Next, to compute the compatibility of the cluster with respect to the given entities, we systematically transform the cluster representation by passing it through a transformer layer (Vaswani et al., 2017).A sequence of such transformations is applied successively by feeding A p , A a , and A b as query vectors at each stage.Each such transformer layer consists of a multi-head attention and feedforward (FFN) projection layers.The reader is referred to Vaswani et al. (2017) for further information on M ultiHead operation.
(7) The transformed cluster representation C b is then reduced at the cluster-level and finally at the coreference model level by attention pooling as: A co represents the evidence vector that encodes information obtained from all the coreference models.Finally, the evidence vector is concatenated with the pronoun representation, and is once again fed through a linear layer and softmax to obtain class probabilities.

Training
All models were trained on 4 NVIDIA V100 GPUs (16GB memory each).
The pytorchpretrained-bert8 library was used as the language model module and saved model checkpoints were used for initialization.Adam (Kingma and Ba, 2014) optimizer was used with β 1 = 0.9,  0.999, = 1e −6 , and a fixed learning rate of 4e −6 .For regularization, a fixed dropout (Srivastava et al., 2014) rate of 0.1 was used in all layers and a weight decay of 0.01 was applied to all parameters.Batch sizes of 16 and 8 samples were used for model variants with bert-base and bertlarge respectively.Models with bert-base took about 6 mins to train while those with bert-large took up to 20 mins.
For single model performance evaluation, the models were trained on gap-train, early-stopping was based off of gap-validation, and gap-test was used for test evaluation.Kaggle competition results were obtained by training models on all datasets, i.e. gap-train, gap-validation, gap-test, and gpr-neither (a total of 4707 samples), in a 5-Fold Cross-Validation (Friedman et al., 2001) fashion.Each model gets effectively trained on 3768 samples, while 942 samples were held-out for validation.Training would terminate upon identifying an optimal early stopping point based on performance on the validation set with an evaluation frequency of 80 gradient steps.Model's access is limited to snippet-context, and the Wikipedia page-context is not used.However, page-url context may be used via coreference signal (Parallelism+URL).

Results
The performance of ProBERT and GREP models is benchmarked against results previously established by Webster et al. and Liu et al. (2019).It is worth noting that Liu et al. do not use gold-twomention labeled spans for prediction and hence their results may not be directly comparable.This section first introduces an estimate of human performance on this task.Then, results for single model performance are presented, followed by ensembled model results that won the Kaggle competition.F 1 performance scores were obtained by using the GAP scorer script9 provided by Webster et al.. Wherever applicable, log loss (the official Kaggle metric) performance is reported as well.

Human Performance
Errors found in crowd-sourced labels are considered a measure of human performance on this task, and serve as a benchmark.The corrections are only a best-effort attempt to fix some obvious mistakes found in the dataset labels, and were made with certain considerations (section 2.3).This performance measure is subject to variation based on an evaluator's opinion on ambiguous samples.

Single Model Performance
Single model performance on GAP test set is shown in Table 3.The GREP model (with bertlarge-uncased as the language model) achieves a powerful state-of-the-art performance on this task.The model significantly benefits from evidence pooling, gaining 6 points in terms of log loss and 2.8 points in F 1 accuracy.Further analysis of the source of these gains is discussed in section 6.
While it may seem that the significantly improved performance of GREP has been achieved at a small cost in terms of gender bias, an attentive reader would realize that the model enjoys improved performance for both genders.Performance gains in masculine instances are much higher compared to feminine instances, and the slight degradation in bias ratio is a manifestation of this.The superior performance of GREP provides evidence that for a given sample context, the model architecture is able to successfully discriminate between the coreference signals, and identify their usefulness.

Kaggle Competition 10
To encourage fairness in modeling, the competition was organized in two stages.This strategy eliminates any attempts at leaderboard probing and other such malpractices.Furthermore, models were frozen at the end of stage 1 and were only allowed to operate in inference mode to generate predictions for stage 2 test submission.Additionally, no feedback was provided on stage 2 submissions (in terms of performance score) until the end of the competition.GREP model is trained as described in section 4 and out-of-fold (oof) error on the held-out samples is reported.The experiments are repeated with 5 different random seeds (42,59,75,46,91) for initialization.Finally, two sets of models are trained with bert-large-uncased and bert-large-cased as the language models.The overall scheme leads to 50 models being trained in total, and 50 sets of predictions being generated on stage 2 test data.To generate predictions for submission, ensembling is done by simply taking the unbiased weighted mean over the 50 individual prediction sets.
Table 4 presents a granular view of the winning model performance.This performance comes very close to human performance and has almost no gender bias.As the table shows, the ensemble models achieve much larger gains in log loss as compared to F 1 accuracy.This is expected since the committee of models makes more confident decisions on "easier" examples.Two insights can be drawn by comparing these results with the single model performance presented in section 5.2: (1) model accuracy benefits from more training data, although the gains are marginal at best (92.5 vs 93.9) given that the model was trained on approximately twice the amount of data; (2) ensembling has a similar effect as evidence pooling, i.e., models become more confident in their predictions.

Discussion
Results shown in section 5 establish the superior performance of GREP compared to ProBERT.This can be attributed to two sources: (1) GREP corrects some errors made by ProBERT,reflected in F1;and (2) where predictions are correct, GREP  is more confident in its predictions, reflected in log loss.To investigate this, error analysis is performed on gap-test.
Figure 5 shows a class-wise comparison of probabilities generated by the two models.It can be seen that GREP is more confident in its predictions (all distributions appear translated closer to 1.0), and the improvement is overwhelmingly evident for the NEITHER class.To understand the difference between the two models, confusion matrix statistics are presented in table 5.The diagonal terms show the number of instances that the two models agree on, and the off-diagonal terms show where they disagree.The numbers reveal that the evidence pooling module not only boosts the model confidence but also helps in correctly resolving Neither instances (44 vs 11), indicating that the model is successfully able to build evidence for or against the given candidates.
Appendix A details the behavior of GREP through some examples.The first example is particularly interesting -while it is trivial for a human to resolve this, a machine would require knowl-edge of the world to understand "death" and its implications.

Unreasonable Effectiveness of ProBERT
It would seem unreasonable that ProBERT is able to perform so well with the noisy input text (due to mention tags) and is able to make the classification decision by looking at the pronoun alone.The following two theories may explain this behavior: (1) attention heads in the (BERT) transformer architecture are able to specialize the pronoun representation in the presence of the supervision signal; (2) the special nature of dropout (present in every layer) makes the model immune to a small amount of noise, and at the same time prevents the model from ignoring the tags.The analysis of attention heads to investigate these claims should form the scope of future work.

Conclusion
A powerful set of results have been established for the shared task.Work presented in this paper makes it feasible to efficiently employ neural attention for pooling information from auxiliary sources of global knowledge.The evidence pooling mechanism introduced here is able to leverage upon the strengths of off-the-shelf coreference solvers without being hindered by their weaknesses (gender bias).A natural extension of the GREP model would be to solve the gendered pronoun resolution problem beyond the scope of the gold-two-mention task, i.e., without accessing the labeled gold spans.
Tables 6, 7, 8, and Figure 6 show an example of how incorporating evidence from the coreference models helps GREP to correct a prediction error made by ProBERT.While the example is trivial for a human to resolve, a machine would require knowledge of the world to understand "death" and its implications.ProBERT is unsure about the resolution and ends up assigning comparable probabilities to both entities A and B. GREP, on the other hand, is able to shift nearly all the probability mass from B to the correct resolution A, in light of strong evidence presented by the coreference solvers.Figure 6 illustrates an interesting phenomenon; while e2e-coref groups the pronoun and both entities A and B in the same cluster, the model architecture is able to harvest information from AllenNLP predictions, propagating the belief that entity A must be the better candidate.The above observations indicate that by pooling evidence from various sources, the model is able to reason over a larger space and build a rudimentary form of world knowledge.
Tables 9, 10, 11, and Figure 7 show a second example.This example is not easy even for a human to resolve without reading and understanding the full context.A model may find this situation to be adverse given the presence of too many named entities as distractor elements; and the urlcontext can be misleading since the pronoun referent is not the subject of the article.Nevertheless, the model is able to successfully build evidence against the given candidates, and resolve with a very high confidence of 92.5%.Finally, a third example is shown in Tables 12 and 13.This example shows that the model doesn't simply make a majority decision, rather considers interactions between the global structure exposed by the various evidence sources.

Figure 1 :
Figure 1: ProBERT: Pronoun BERT.Token embeddings corresponding to the labeled pronoun in the input text are extracted from the last layer of the language model (BERT) and used for prediction.

Figure 5 :
Figure 5: Comparison of probabilities assigned by ProBERT and GREP.Figures show distribution of predicted class probabilities assigned by the models to samples from that class.

Table 1 :
Corpus statistics.Masculine (M) and Feminine (F) instances were identified based on the gender of the pronoun mention labeled in the sample.† Only a subset of these may have been used for final evaluation.

Table 2 :
GAP dataset label distribution before and after sanitization.(-x)indicatesthe number of samples that were moved out of a given class and (+x) indicates the number of samples that were added post-sanitization.

Table 3 :
Liu et al. (2019)018)nce on gap-test set by gender.M: masculine, F: feminine, B: (bias) ratio of feminine to masculine performance, O: overall.Log loss is not available for systems that only produce labels.†AsreportedbyWebsteret al. (2018).‡Asreported byLiu et al. (2019), their model does not use gold-two-mention labeled span information for prediction.

Table 4 :
GREP model performance results in the Kaggle competition.Out-of-fold (OOF) error is reported on all data, i.e. gap-development, gap-validation, gap-test, and gpr-neither, as well as on gap-test explicitly for comparison against single model performance results.Since early stopping is based on OOF samples, OOF errors reported here cannot be considered as an estimate of test error.Nevertheless, stage 2 test performance benchmarks the model.† Due to a bug, the model did not fully leverage coref evidence, further gains are expected with the fixed version.

Table 5 :
Class-wise comparison of model accuracy forProBERT and GREP.Off-diagonal terms show cases where GREP fixes errors made by ProBERT and viceversa.