WikiCREM: A Large Unsupervised Corpus for Coreference Resolution

Pronoun resolution is a major area of natural language understanding. However, large-scale training sets are still scarce, since manually labelling data is costly. In this work, we introduce WikiCREM (Wikipedia CoREferences Masked) a large-scale, yet accurate dataset of pronoun disambiguation instances. We use a language-model-based approach for pronoun resolution in combination with our WikiCREM dataset. We compare a series of models on a collection of diverse and challenging coreference resolution problems, where we match or outperform previous state-of-the-art approaches on 6 out of 7 datasets, such as GAP, DPR, WNLI, PDP, WinoBias, and WinoGender. We release our model to be used off-the-shelf for solving pronoun disambiguation.


Introduction
Pronoun resolution, also called coreference or anaphora resolution, is a natural language processing (NLP) task, which aims to link the pronouns with their referents. This task is of crucial importance in various other NLP tasks, such as information extraction (Nakayama, 2019) and machine translation (Guillou, 2012). Due to its importance, pronoun resolution has seen a series of different approaches, such as rule-based systems (Lee et al., 2013) and end-to-end-trained neural models (Lee et al., 2017;Liu et al., 2019). However, the recently released dataset GAP (Webster et al., 2018) shows that most of these solutions perform worse than naïve baselines when the answer cannot be deduced from the syntax. Addressing this drawback is difficult, partially due to the lack of largescale challenging datasets needed to train the datahungry neural models.
As observed by Trinh and Le (2018), language models are a natural approach to pronoun resolution, by selecting the replacement for a pronoun that forms the sentence with highest probability. Additionally, language models have the advantage of being pre-trained on a large collection of unstructured text and then fine-tuned on a specific task using much less training data. This procedure has obtained state-of-the-art results on a series of natural language understanding tasks (Devlin et al., 2018).
In this work, we address the lack of large training sets for pronoun disambiguation by introducing a large dataset that can be easily extended. To generate this dataset, we find passages of text where a personal name appears at least twice and mask one of its non-first occurrences. To make the disambiguation task more challenging, we also ensure that at least one other distinct personal name is present in the text in a position before the masked occurrence. We instantiate our method on English Wikipedia and generate the Wikipedia Co-REferences Masked (WIKICREM) dataset with 2.4M examples, which we make publicly available for further usage 1 . We show its value by using it to fine-tune the BERT language model (Devlin et al., 2018) for pronoun resolution.
To show the usefulness of our dataset, we train several models that cover three real-world scenarios: (1) when the target data distribution is completely unknown, (2) when training data from the target distribution is available, and (3) the transductive scenario, where the unlabeled test data is available at the training time. We show that fine-tuning BERT with WIKICREM consistently improves the model in each of the three scenarios, when evaluated on a collection of 7 datasets. For example, we outperform the state-of-4304 the-art approaches on GAP (Webster et al., 2018), DPR (Rahman and Ng, 2012), and PDP (Davis et al., 2017) by 5.9%, 8.4%, and 12.7%, respectively. Additionally, models trained with WIKI-CREM show increased performance and reduced bias on the gender diagnostic datasets WINOGEN-DER (Rudinger et al., 2018) and WINOBIAS (Zhao et al., 2018).

Related Work
There are several large and commonly used benchmarks for coreference resolution, such as (Pradhan et al., 2012;Schäfer et al., 2012;Ghaddar and Langlais, 2016). However, Webster et al. (2018) argue that a high performance on these datasets does not correlate with a high accuracy in practice, because examples where the answer cannot be deduced from the syntax (we refer to them as hard pronoun resolution) are underrepresented. Therefore, several hard pronoun resolution datasets have been introduced (Webster et al., 2018;Rahman and Ng, 2012;Rudinger et al., 2018;Davis et al., 2017;Zhao et al., 2018;Emami et al., 2019). However, they are all relatively small, often created only as a test set.
Therefore, most of the pronoun resolution models that address hard pronoun resolution rely on little (Liu et al., 2019) or no training data, via unsupervised pre-training (Trinh and Le, 2018;Radford et al., 2019). Another approach involves using external knowledge bases (Emami et al., 2018;Fähndrich et al., 2018), however, the accuracy of these models still lags behind that of the aforementioned pre-trained models.
A similar approach to ours for unsupervised data generation and language-model-based evaluation has been recently presented in our previous work (Kocijan et al., 2019). We generated MASKEDWIKI, a large unsupervised dataset created by searching for repeated occurrences of nouns. However, training on MASKEDWIKI on its own is not always enough and sometimes makes a difference only in combination with additional training on the DPR dataset (called WSCR) (Rahman and Ng, 2012). In contrast, WIKICREM brings a much more consistent improvement over a wider range of datasets, strongly improving models' performance even when they are not finetuned on additional data. As opposed to our previous work ( We highlight that pronouns are not actually inserted into the sentences and thus none of the examples sound unnatural. This analysis was performed to show that WIKICREM consists of examples with data distribution closer to the target tasks than MASKEDWIKI.

The WIKICREM Dataset
In this section, we describe how we obtained WI-KICREM. Starting from English Wikipedia 2 , we search for sentences and pairs of sentences with the following properties: at least two distinct personal names appear in the text, and one of them is repeated. We do not use pieces of text with more than two sentences to collect concise examples only. Personal names in the text are called "candidates". One non-first occurrence of the repeated candidate is masked, and the goal is to predict the masked name, given the correct and one incorrect candidate. In case of more than one incorrect candidate in the sentence, several datapoints are constructed, one for each incorrect candidate.
We ensure that the alternative candidate appears before the masked-out name in the text, in order to avoid trivial examples. Thus, the example is retained in the dataset if: (a) the repeated name appears after both candidates, all in a single sentence; or (b) both candidates appear in a single sentence, and the repeated name appears in a sentence directly following. Examples where one of the candidates appears in the same sentence as the repeated name, while the other candidate does not, are discarded, as they are often too trivial.
We illustrate the procedure with the following example: When asked about Adams' report, Powell found many of the statements to be inaccurate, including a claim that Adams first surveyed an area that was surveyed in 1857 by Joseph C.
The second occurrence of "Adams" is masked. The goal is to determine which of the two candidates ("Adams","Powell") has been masked out. The masking process resembles replacing a name with a pronoun, but the pronoun is not inserted to keep the process fully unsupervised and error-free.
We used the Spacy Named Entity Recognition library 3 to find the occurrences of names in the text. The resulting dataset consists of 2, 438, 897 samples. We note that our dataset contains hard examples. To resolve the first example, one needs to understand that Denise was assigned a task and "meant to be the parent" thus refers to her. To resolve the second example, one needs to understand that having an abortion can only happen if one falls pregnant first. Since both candidates have feminine names, the answer cannot be deduced just on the common co-occurrence of female names and the word "abortion".
We highlight that our example generation method, while having the advantage of being unsupervised, also does not give incorrect signals, since we know the ground truth reference.
Even though WIKICREM and GAP both use text from English Wikipedia, they produce differing examples, because their generating pro-WIKICREM statistics. We analyze our dataset for gender bias. We use the Gender guesser library 4 to determine the gender of the candidates. To mimic the analysis of pronoun genders performed in the related works (Webster et al., 2018;Rudinger et al., 2018;Zhao et al., 2018), we observe the gender of the correct candidates only. There were 0.8M "male" or "mostly male" names and 0.42M "female" or "mostly female" names, the rest were classified as "unknown". The ratio between female and male candidates is thus estimated around 0.53 in favour of male candidates. We will see that this gender imbalance does not have any negative impact on bias, as shown in Section 6.2.
However, our unsupervised generating procedure sometimes yields examples where the correct answer cannot be deduced given the available information, we refer to these as unsolvable examples. To estimate the percentage of unsolvable examples, we manually annotated 100 randomly selected examples from the WIKICREM dataset. In order to prevent guessing, the candidates were not visible to the annotators. For each example, we asked them to state whether it was solvable or not, and to answer the solvable examples. In 100 examples, we found 18 unsolvable examples and achieved 95.1% accuracy on the rest, showing that the annotation error rate is tolerable. These annotations can be found in Appendix A.
However, as shown in Section 6.2, training on WIKICREM alone does not match the performance of training on the data from the target distribution. The data distribution of WIKICREM differs from the data distribution of the datasets for evaluation. If we replace the [MASK] token with a pronoun instead of the correct candidate, the resulting sentence sometimes sounds unnatural and would not occur in a human-written text. On the annotated 100 examples, we estimated the percentage of natural-sounding sentences to be 63%. While the these sentences are not incorrect, the distribution of the training data differ from the target data.

Model
We use a simple language-model-based approach to anaphora resolution to show the value of the introduced dataset. In this section, we first introduce BERT (Devlin et al., 2018), a language model that we use throughout this work. In the second part, we describe the utilization of BERT and the finetuning procedures employed.

BERT
The Bidirectional Encoder Representations from Transformers (BERT) language model is based on the transformer architecture (Vaswani et al., 2017). We choose this model due to its strong languagemodeling abilities and high performance on several NLU tasks (Devlin et al., 2018).
BERT is initially trained on two tasks: next sentence prediction and masked token prediction. In the next sentence prediction task, the model is given two sentences and is asked to predict whether the second sentence follows the first one. In the masked token prediction task, the model is given text with approximately 15% of the input tokens masked, and it is asked to predict these tokens. The details of the pre-training procedure can be found in Devlin et al. (2018).
In this work, we only focus on the masked token prediction. We use the PyTorch implementation of BERT 5 and the pre-trained weights for BERT-large released by Devlin et al. (2018).

Pronoun Resolution with BERT
This section introduces the procedure for pronoun resolution used throughout this work. Let S be the sentence with a pronoun that has to be resolved. Let a be a candidate for pronoun resolution. The pronoun in S is replaced with a [MASK] token and used as the input to the model to compute the log-probability log P(a|S). If a consists of more than one token, the same number of [MASK] tokens is inserted into S, and the logprobability log P(a|S) is computed as the average of log-probabilities of all tokens in a.
The candidate-finding procedures are datasetspecific and are described in Section 6. Given a sentence S and several candidates a 1 , . . . , a n , we select the candidate a i with the largest log P(a i |S).

Training
When training the model, the setup is similar to testing. We are given a sentence with a name or a pronoun masked out, together with two candidates. The goal is to determine which of the candidates is a better fit. Let a be the correct candidate, and b be an incorrect candidate. Following our previous work (Kocijan et al., 2019) we minimize the negative log-likelihood of the correct candidate, while additionally imposing a max-margin between the log-likelihood of the correct and incorrect terms. We observe that this combined loss consistently yields better results on validation sets of all experiments than negative log-likelihood or max-margin loss on their own.
where α and β are hyperparameters controlling the influence of the max-margin loss term and the margin between the log-likelihood of the correct and incorrect candidates, respectively.
Since WIKICREM is large and one epoch takes around two days even when parallelized on 8 Tesla P100 GPUs, we only fine-tune BERT on WIKI-CREM for a single epoch. We note that better results may be achieved with further fine-tuning and improved hyperparameter search.
Fine-tuning on other datasets is performed in the same way as training except for two differences. Firstly, in fine-tuning, the model is trained for 30 epochs due to the smaller size of datasets. Secondly, we do not sub-sample the training set for hyperparameter search. We validate the model after every epoch, retaining the model that performs best on the WIKICREM validation set.

Evaluation Datasets
We now introduce the 7 datasets that were used to evaluate the models. We decide not to use the CONLL2012 and WINOCOREF (Pradhan et al., 2012;Peng et al., 2015) datasets, because they contain more general coreference examples than just pronouns. We did not evaluate on the KNOW-REF dataset (Emami et al., 2019), since it was not yet publicly available at the time of writing.
GAP. GAP (Webster et al., 2018) is a collection of 4, 454 passages from Wikipedia containing ambiguous pronouns. It focuses on the resolution of personal pronouns referring to human names and has a 1 : 1 ratio between masculine and feminine pronouns. In addition to the overall performance on the dataset, each model is evaluated also on its performance on the masculine subset (F M 1 ), feminine subset (F F 1 ), and its gender bias ( ). The best performance was exhibited by the Referential Reader (Liu et al., 2019), a GRU-based model with additional external memory cells.
For each example, two candidates are given with the goal of determining whether they are the referent. In approximately 10% of the training examples, none of the candidates are correct. When training on the GAP dataset, we discard such examples from the training set. We do not discard any examples from the validation or test set. When testing the model, we use the Spacy NER library to find all candidates in the sentence. Since the GAP dataset mainly contains examples with human names, we only retain named entities with the tag PERSON. We observe that in 18.5% of the test samples, the Spacy NER library fails to extract the candidate in question, making the answer for that candidate "FALSE" by default, putting our models at disadvantage. Because of this, 7.25% of answers are always false negatives, and 11.25% are always true negatives, regardless of the model. Taking this into account, we compute that the maximal F 1 -score achievable by our models is capped at 91.1%.
We highlight that, when evaluating our models, we are stricter than previous approaches (Liu et al., 2019;Webster et al., 2018). While they count the answer as "correct" if the model returns a substring of the correct answer, we only accept the full answer. The aforementioned models return the exact location of the correct candidate in the input sentence, while our approach does not. This strictness is necessary, because a substring of a correct answer could be a substring of several answers at once, making it ambiguous. 1. Two entities appear in the text. 2. A pronoun or a possessive adjective appears in the sentence and refers to one of the entities. It would be grammatically correct if it referred to the other entity. 3. The goal is to find the referent of the pronoun or possessive adjective. 4. The text contains a "special word". When switched for the "alternative word", the sentence remains grammatically correct, but the referent of the pronoun changes. The Winograd Schema Challenge is specifically made up from challenging examples that require commonsense reasoning for resolution and should not be solvable with statistical analysis of cooccurence and association.
We evaluate the models on the collection of 273 problems used for the 2016 Winograd Schema Challenge (Davis et al., 2017), also known as WSC273. The best known approach to this problem uses the BERT language model, fine-tuned on the DPR dataset (Kocijan et al., 2019).
DPR. The Definite Pronoun Resolution (DPR) corpus (Rahman and Ng, 2012) is a collection of problems that resemble the Winograd Schema Challenge. The criteria for this dataset have been relaxed, and it contains examples that might not require commonsense reasoning or examples where the "special word" is actually a whole phrase. We remove 6 examples in the DPR training set that overlap with the WSC dataset. The dataset was constructed manually and consists of 1316 training and 564 test samples after we removed the overlapping examples. The best result on the dataset was reported by Peng et al. (2015) using external knowledge sources and integer linear programming.
PDP. The Pronoun Disambiguation Problem (PDP) is a small collection of 60 problems that was used as the first round of the Winograd Schema Challenge in 2016 (Davis et al., 2017). Unlike WSC, the examples do not contain a "special word", however, they do require commonsense reasoning to be answered. The examples were manually collected from books. Despite its small size, there have been several attempts at solving this challenge (Fähndrich et al., 2018;Trinh and Le, 2018), the best result being held by the Marker Passing algorithm (Fähndrich et al., 2018).

WNLI. The Winograd Natural Language Inference (WNLI) is an inference task inspired by the Winograd Schema
Challenge and is one of the 9 tasks on the GLUE benchmark (Wang et al., 2019). WNLI examples are obtained by rephrasing Winograd Schemas. The Winograd Schema is given as the "premise". A "hypothesis" is constructed by repeating the part of the premise with the pronoun and replacing the pronoun with one of the candidates. The goal is to classify whether the hypothesis follows from the premise.
A WNLI example obtained by rephrasing one of the WSC examples looks like this: Premise: The city councilmen refused the demonstrators a permit because they feared violence. Hypothesis: The demonstrators feared violence.

Answer: true / false
The WNLI dataset is constructed manually. Since the WNLI training and validation sets overlap with WSC, we use the WNLI test set only. The test set of WNLI comes from a separate source and does not overlap with any other dataset.
The currently best approach transforms examples back into the Winograd Schemas and solves them as a coreference problem (Kocijan et al., 2019). Following our previous work (Kocijan et al., 2019), we reverse the process of example generation in the same way. We automatically detect which part of the premise has been copied to construct the hypothesis. This locates the pronoun that has to be resolved, and the candidate in question. All other nouns in the premise are treated as alternative candidates. We find nouns in the premise with the Stanford POS tagger (Manning et al., 2014).
WINOGENDER. WINOGENDER (Rudinger et al., 2018) is a dataset that follows the WSC format and is aimed to measure gender bias. One of the candidates is always an occupation, while the other is a participant, both selected to be gender neutral. Examples intentionally contain occupations with strong imbalance in the gender ratio. Participant can be replaced with the neutral "someone", and three different pronouns (he/she/they) can be used. The aim of this dataset is to measure how the change of the pronoun gender affects the accuracy of the model.
Our models mask the pronoun and are thus not affected by the pronoun gender. They exhibit no bias on this dataset by design. We mainly use this dataset to measure the accuracy of different models on the entire dataset. According to Rudinger et al. (2018), the best performance is exhibited by Durrett and Klein (2013) when used on the male subset of the dataset. We use this result as the baseline.

WINOBIAS.
Similarly to the WINOGENDER dataset, WINOBIAS (Zhao et al., 2018) is a WSCinspired dataset that measures gender bias in the coreference resolution algorithms. Similarly to WINOGENDER, it contains instances of occupations with high gender imbalance. It contains 3, 160 examples of Winograd Schemas, equally split into validation and test set. The test set examples are split into 2 types, where examples of type 1 are "harder" and should not be solvable using the analysis of co-occurrence, and examples of type 2 are easier. Additionally, each of these subsets is split into anti-stereotypical and pro-stereotypical subsets, depending on whether the gender of the pronoun matches the most common gender in the occupation. The difference in performance between pro-and anti-stereotypical examples shows how biased the model is. The best performance is exhibited by Lee et al. (2017) and Durrett and Klein (2013), as reported by Zhao et al. (2018).

Evaluation
We quantify the impact of WIKICREM on the introduced datasets.

Experiments
We train several different models to evaluate the contribution of the WIKICREM dataset in different real-world scenarios. In Scenario A, no information of the target distribution is available. In Scenario B, the distribution of the target data is known and a sample of training data from the target distribution is available. Finally, Scenario C is the transductive scenario where the unlabeled test samples are known in advance. All evaluations on the GAP test-set are considered to be Scenario C, because BERT has been pre-trained on the English Wikipedia and has thus seen the text in the GAP dataset at the pre-training time.
We describe the evaluated models below.
BERT. This model, pretrained by Devlin et al. (2018), is the starting point for all models and serves as the soft baseline for Scenario A.
BERT WIKIRAND. This model serves as an additional baseline for Scenario A and aims to eliminate external factors that might have worked against the performance of BERT. To eliminate the effect of sentence lengths, loss function, and the percentage of masked tokens during the training time, we generate the RANDOMWIKI dataset. It consists of random passages from Wikipedia and has the same sentence-length distribution and number of datapoints as WIKICREM. However, the masked-out word from the sentence is selected randomly, while the alternative candidate is selected randomly from the vocabulary. BERT is then trained on this dataset in the same way as BERT WIKICREM, as described in Section 4.3.
BERT WIKICREM. BERT GAP. This model is obtained by finetuning BERT on the GAP dataset. It serves as the baseline for Scenario C, as explained at the beginning of Section 6.1.
BERT WIKICREM GAP. This model serves as the evaluation of WIKICREM for Scenario C and is obtained by fine-tuning BERT WIKICREM on GAP.
BERT ALL. This model is obtained by finetuning BERT on all the available data from the target datasets at once. Combined GAP-train and DPR-train data are used for training. The model is validated on the GAP-validation set and the WINOBIAS-validation set separately. Scores on both sets are then averaged to obtain the validation performance. Since both training sets and both validation sets have roughly the same size, both tasks are represented equally.
BERT WIKICREM ALL. This model is obtained in the same way as the BERT ALL model, but starting from BERT WIKICREM instead.

Results
The results of the evaluation of the models on the test sets are shown in Table 1. We notice that additional training on WIKICREM consistently improves the performance of the models in all scenarios and on most tests. Due to the small size of some test sets, some of the results are subject to deviation. This especially applies to PDP (60 test samples) and WNLI (145 test samples). We observe that BERT WIKIRAND generally performs worse than BERT, with GAP and PDP being notable exceptions. This shows that BERT is a strong baseline and that improved performance of BERT WIKICREM is not a consequence of training on shorter sentences or with different loss function. BERT WIKICREM consistently outperforms both baselines on all tests, showing that WIKICREM can be used as a standalone dataset.
We observe that training on the data from the target distribution improves the performance the most. Models trained on GAP-train usually show more than a 20% increase in their F 1 -score on GAP-test. Still, BERT WIKICREM GAP shows Each section contains a model that has been trained on WIKICREM and models that have not been. The best result in each section is in bold. The best overall result is underlined. Scores on GAP are measured as F 1 -scores, while the performance on other datasets is given in accuracy. The source of each SOTA is listed in Section 5. a consistent improvement over BERT GAP on all subsets of the GAP test set. This confirms that WI-KICREM works not just as a standalone dataset, but also as an additional pre-training in the transductive scenario.
Similarly, BERT WIKICREM DPR outperforms BERT DPR on the majority of tasks, showing the applicability of WIKICREM to the scenario where additional training data is available. However, good results of BERT GAP DPR show that additional training on a manually constructed dataset, such as GAP, can yield similar results as additional training on WIKICREM. The reason behind this difference is the impact of the data distribution. GAP, DPR, and WIKICREM contain data that follows different distributions which strongly impacts the trained models. This can be seen when we fine-tune BERT GAP on DPR to obtain BERT GAP DPR, as the model's performance on GAP-test drops by 8.2%. WIKI-CREM's data distribution strongly differs from the test sets' as described in Section 3.
However, the best results are achieved when all available data is combined, as shown by the models BERT ALL and BERT WIKICREM ALL. BERT WIKICREM ALL achieves the highest performance on GAP, DPR, WNLI, and WINOBIAS among the models, and sets the new state-of-the-art result on GAP, DPR, and WINOBIAS.
The new state-of-the-art result on the WINOGENDER dataset is achieved by the BERT WIKICREM DPR model, while BERT WIKICREM ALL and BERT GAP DPR set the new state-of-the-art on the PDP dataset.

Conclusions and Future Work
In this work, we introduced WIKICREM, a large dataset of training instances for pronoun resolution. We use our dataset to fine-tune the BERT language model. Our results match or outperform state-of-the-art models on 6 out of 7 evaluated datasets.
The employed data-generating procedure can be further applied to other large sources of text to generate more training sets for pronoun resolution. In addition, both variety and size of the generated datasets can be increased if we do not restrict ourselves to personal names. We hope that the community will make use of our released WIKICREM dataset to further improve the pronoun resolution task.