Coreferential Reasoning Learning for Language Representation

Language representation models such as BERT could effectively capture contextual semantic information from plain text, and have been proved to achieve promising results in lots of downstream NLP tasks with appropriate fine-tuning. However, existing language representation models seldom consider coreference explicitly, the relationship between noun phrases referring to the same entity, which is essential to a coherent understanding of the whole discourse. To address this issue, we present CorefBERT, a novel language representation model designed to capture the relations between noun phrases that co-refer to each other. According to the experimental results, compared with existing baseline models, the CorefBERT model has made significant progress on several downstream NLP tasks that require coreferential reasoning, while maintaining comparable performance to previous models on other common NLP tasks.


Introduction
Recently, language representation models  have made significant strides in many natural language understanding tasks, such as natural language inference, sentiment classification, question answering, relation extraction, fact extraction and verification, and coreference resolution (Zhang et al., 2019;Talmor and Berant, 2019;Peters et al., 2019;Joshi et al., 2019b). These models usually conduct self-supervised pre-training tasks over large-scale corpus to obtain informative language representation, which could capture the contextual semantics of the input text.
Despite existing language representation models have made a success on many downstream tasks, they are still not sufficient to understand coreference in long texts. Pre-training tasks, such as masked language modeling, sometimes lead model to collect local semantic and syntactic information to recover the masked tokens. Meanwhile, they may ignore long-distance connection beyond sentence-level due to the lack of modeling the coreference resolution explicitly. Coreference can be considered as the linguistic connection in natural language, which commonly appears in a long sequence and is one of the most important elements for a coherent understanding of the whole discourse. Long text usually accommodates complex relationships between noun phrases, which has become a challenge for text understanding. For example, in the sentence "The physician hired the secretary because she was overwhelmed with clients.", it is necessary to realize that she refers to the physician, for comprehending the whole context.
To improve the capacity of coreferential reasoning of language representation models, a straightforward solution is to fine-tune these models on supervised coreference resolution data. Nevertheless, it is impractical to obtain a large-scale supervised coreference dataset. In this paper, we present CorefBERT, a language representation model designed to better capture and represent the coreference information in the utterance without supervised data. CorefBERT introduces a novel pretraining task called Mention Reference Prediction (MRP), besides the Masked Language Modeling (MLM). MRP leverages repeated mentions (e.g. noun or noun phrase) that appears multiple times in the passage to acquire abundant co-referring relations. Particularly, MRP involve mention reference masking strategy, which masks one or several mentions among the repeated mentions in the passage and requires model to predict the maksed mention's corresponding referents. Here is an example: Sequence: Jane presents strong evidence against Claire, but [MASK] may present a strong defense. Candidates: Jane, evidence, Claire, ... For the MRP task, we substitute the repeated mention, Claire, with [MASK] and require the model to find the proper candidate for filling the [MASK].
To explicitly model the coreference information, we further introduce a copy-based training objective to encourage the model to select the consistent noun phrase from context instead of the vocabulary. The copy mechanism establishes more interactions among mentions of an entity, which thrives on the coreference resolution scenario.
We conduct experiments on a suite of downstream NLP tasks which require coreferential reasoning in language understanding, including extractive question answering, relation extraction, fact extraction and verification, and coreference resolution. Experimental results show that CorefBERT outperforms the vanilla BERT on almost all benchmarks based on the improvement of coreference resolution. To verify the robustness of our model, we also evaluate CorefBERT on other common NLP tasks where CorefBERT still achieves comparable results to BERT. It demonstrates that the introduction of the new pre-training task would not impair BERT's ability in common language understanding.

Background
BERT , a language representation model, learns universal language representation with deep bidirectional Transformer (Vaswani et al., 2017) from a large-scale unlabeled corpus. Typically, it utilizes two training tasks to learn from unlabeled text, including Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). However, it turns out that NSP is not as helpful as expected for the language representation learning . Therefore, we train our model, CorefBERT, on contiguous sequences without the NSP objective.
Notation Given a sequence of tokens 1 X = (x 1 , x 2 , . . . , x n ), BERT first represents each token by aggregating the corresponding token, segment, and position embeddings, and then feeds the input representation into a deep bidirectional Transformer to obtain the final contextual representation.
Masked language modeling (MLM) MLM is regarded as a kind of cloze tasks and aims to predict the missing tokens according to its final contextual representation. In CorefBERT, we reserve 1 In this paper, tokens are at the subword level. the MLM objective for learning general representation, and further add Mention Reference Prediction for infusing stronger coreferential reasoning ability into the language representation.

Methodology
In this section, we present CorefBERT, a language representation model, which aims to better capture the coreference information of the text. Our approach comes up with a novel auxiliary training task Mention Reference Prediction (MRP), which is added to enhance reasoning ability of BERT . MRP utilizes mention reference masking strategy to mask one of the repeated mentions in the sequence and then employs a copybased training objective to predict the masked tokens by copying other tokens in the sequence.

Mention Reference Masking
To better capture the coreference information of the text, we propose a novel masking strategy: mention reference masking, which masks tokens of the repeated mentions in the sequence instead of masking random tokens. The idea is inspired by the unsupervised coreference resolution. We follow a distant supervision assumption: the repeated mentions in a sequence would refer to each other, therefore, if we mask one of them, the masked tokens would be inferred through its context and the unmasked references. Based on the above strategy and assumption, the CorefBERT model is expected to capture the coreference information in the text for filling the masked token.
In practice, we regard nouns in the text as mentions. We first use spaCy 2 for part-of-speech tagging to extract all nouns in the given sequence. Then, we cluster the nouns into several groups where each group contains all mentions of the same noun. After that, we select the masked nouns from different groups uniformly.
In order to maintain the universal language representation ability in CorefBERT, we utilize both the masked language modeling (random token masking) and mention reference prediction (mention reference masking) in the training process. Empirically, the masked words for masked language modeling and mention reference prediction are sampled on a ratio of 4:1. Similar to BERT, 15% of the tokens are masked in total where 80% of them are replaced with [MASK], 10% with original tokens,

Multi-layer Transformer
Jane presents strong evidence against but Claire [MASK] may present a strong defense Figure 1: An illustration of CorefBERT's training process. In this example, the second Claire is masked. We use copy-based objective to predict the masked token from context for mention reference prediction task. The overall loss consists of the loss of both mention reference prediction and masked language modeling. and 10% with random tokens. We also adopt whole word masking, which masks all the subwords belong to the masked words or mentions.

Copy-based Training Objective
In order to capture the coreference information of the text, CorefBERT models the correlation among words in the sequence. Copy mechanism is a method widely adopted in sequence-to-sequence tasks, which alleviates out-of-vocabulary problems in text summarization (Gu et al., 2016), translates specific words in translation , and retells queries in dialogue generation . We adapt the copy mechanism and introduce a copy-based training objective to require the model to predict missing tokens of the masked noun by copying the unmasked tokens in the context. Through copying mechanism, the CorefBERT model could explicitly capture the relations between the masked mention and its referring mentions, therefore to obtain the coreference information in the context.
The representations of the start token and the end token of a word typically contain the whole word's information (Lee et al., 2017, based on which we apply the copy-based training objective on both ends of the masked word. Formally, we first encode the given input sequence X = (x 1 , . . . , x n ), with some tokens masked, into hidden states H = (h 1 , . . . , h n ) via multi-layer Transformer (Vaswani et al., 2017). The probability of recovering the masked token x i by copying from x j is defined as: where denotes element-wise product function and V is a trainable parameter to measure the importance of each dimension for token's similarity. For a masked noun w i consisting of a sequence of tokens (x , we recover w i by copying its referring context word, and defines the probability of choosing word w j as: (2) A masked noun possibly has multiple corresponding words in the sequence, for which we collectively maximize the similarity of all corresponding words. It is an approach widely used in question answering (Kadlec et al., 2016;Swayamdipta et al., 2018; designed to handle multiple answers. Finally, we define the loss of mention reference prediction (MRP) as: where M is the set of all masked mentions for mention reference masking, and C w i is the set of all corresponding words of word w i .

Training
CorefBERT aims to capture the coreference information of the text while maintaining the language representation capability of BERT. Thus, the overall loss of CorefBERT consists of two losses: the mention reference prediction loss L MRP and the masked language modeling loss L MLM , which can be formulated as:

Experiment
In this section, we first introduce the training details of CorefBERT. After that, we present the finetuning results on a comprehensive suite of tasks, including extractive question answering, documentlevel relation extraction, fact extraction and verification, coreference resolution, and eight tasks in the GLUE benchmark.

Training Details
Due to the large cost of training CorefBERT from scratch, we initialize the parameters of CorefBERT with BERT released by Google 3 , which is used as our baselines on downstream tasks. Similar to previous language representation models , we adopt English Wikipeida 4 as our training corpus, which contains about 3,000M tokens. Note that, since Wikipedia corpus has been used to train the original BERT, CorefBERT does not use additional corpus. We train CorefBERT with contiguous sequences of up to 512 tokens, and shorten the input sequences with a 10% probability.
To verify the effectiveness of our method for the language representation model trained with tremendous corpus, we further train CorefRoBERTa starting from the released RoBERTa 5 . Additionally, we follow the pre-training hyperparameters used in BERT, and adopt Adam optimizer (Kingma and Ba, 2015) with batch size of 256. Learning rate of 5e-5 is used for the base model and 1e-5 is used for the large model. The optimization runs 33k steps, where the first 20% steps utilize linear warm-up learning rate. The pretraining took 1.5 days for base model and 11 days for large model with 8 2080ti GPUs .

Extractive Question Answering
Given a question and passage, the extractive question answering task aims to select spans in passage to answer the question. We evaluate our model on the Questions Requiring Coreferential Reasoning dataset (QUOREF) , which contains more than 24k question-answer pairs. Compared to previous reading comprehension benchmarks, QUOREF is more challenging: 78% of the questions in QUOREF cannot be answered without coreference resolution while tracking entities' coreference is essential to comprehending documents. Therefore, QUOREF could examine the coreference resolution capability of question answering models to some extent. We also evalu-ate the models on the MRQA shared task (Fisch et al., 2019). MRQA integrates several existing datasets to a unified format, which provides a single context within 800 tokens for each question, ensuring at least one answer could be accurately found in the context. We use six benchmarks of MRQA, including SQuAD (Rajpurkar et al., 2016), NewsQA (Trischler et al., 2017), SearchQA (Dunn et al., 2017), TriviaQA (Joshi et al., 2017), Hot-potQA (Yang et al., 2018), and Natural Questions (NaturalQA) (Kwiatkowski et al., 2019). The MRQA shared task involves paragraphs from different sources and questions with manifold styles, helping us effectively evaluate our model in different domains. Since MRQA does not provide a public test set, we randomly split the development set into two halves to make new validation and test sets.
Baselines For QUOREF, we compare our Coref-BERT model with three baseline models: (1) QANet (Yu et al., 2018) combines self-attention mechanism with the convolutional neural network, which achieves the best performance to date without pre-training; (2) QANet+BERT adopts BERT representation as an additional input feature into QANet; (3) BERT , which simply fine-tunes BERT for extractive question answering. We further design two components accounting for coreferential reasoning and multiple answers, by which we obtain a stronger BERT baseline on QUOREF. (4) RoBERTa-MT, the current state-of-the-art, is pre-trained on CoLA, SST2, SQuAD datasets in turns before finally fine-tuned on QUOREF.
Implementation Details Following BERT's setting , given the question Q = (q 1 , q 2 , . . . , q m ) and the passage P = (p 1 , p 2 , . . . , p n ), we represent them as a sequence X = ([CLS], q 1 , q 2 , . . . , q m , [SEP], p 1 , p 2 , . . . , p n , [SEP]), feed the sequence X into the pre-trained encoder and train two classifiers on the top of it to seek answer's start and end positions simultaneously. For MRQA, CorefBERT maintains the same framework as BERT. For QUOREF, we further employ two extra components to process multiple mentions of the answers: (1) Spurred by the idea from Hu et al. (2019) in handling multiple answer spans problem, we utilize the representation of [CLS] to predict the number of answers. Then, we adopt non-maximum suppression (NMS)  algorithm (Rosenfeld and Thurston, 1971) to extract a specific quantity of non-overlapped spans. NMS first selects the answer span of the current highest scores, then continue to choose that of the second-highest score with no overlap to previous spans, and so on, until the predicted number of spans are selected. (2) When answering a question from QUOREF, the coreference mention could possibly be a pronoun in the sentence most relevant to the correct answer, so we add an additional reasoning layer (Transformer layer) before the span boundary classifier.
Results Table 2 shows the performance on QUOERF. Our BERT Base outperforms original BERT by about 2 points in EM and F1 score, which indicates the effectiveness of the added reasoning layer and multi-span prediction module. CorefBERT Base and CorefBERT Large exceeds our adapted BERT Base and BERT Large by 4.4% and 2.9% F1 points respectively. CorefRoBERTa Large also gains 0.7% F1 improvement and achieves a new state-of-the-art. We show four case studies in Supplemental Materials, which indicate that through reasoning over mentions, CorefBERT could aggregate information to answer the question requiring coreferential reasoning Table 1 further shows that the effectiveness of CorefBERT is consistent in six datasets of the MRQA shared task besides QUOREF. We find that though the MRQA shared task is not designed for coreferential reasoning, our CorefBERT model still achieves averagely over 1 point improvement on all six datasets, especially on NewsQA and Hot-potQA. In NewsQA , 20.7% of the answers can only be inferred by synthesizing information distributed across multiple sentences. In HotpotQA, 63% of the answers need to be inferred through bridge entities or checking multiple properties in different positions. It demonstrates that coreferential reasoning is an essential ability in question answering.

Relation Extraction
Relation extraction (RE) aims to extract the relationship between two entities in a given text. We evaluate our model on DocRED (Yao et al., 2019), a challenging document-level RE dataset which requires to extract relations between entities by synthesizing information from all the mentions of them after reading the whole document. DocRED requires a variety of reasoning types, where 17.6% of the relation facts need to be uncovered through coreferential reasoning.
Baselines We compare our model with the following baselines: (1) CNN/LSTM/BiLSTM. CNN (Zeng et al., 2014), LSTM (Hochreiter and Schmidhuber, 1997), bidirectional LSTM (BiL-STM) (Cai et al., 2016) are widely adopted as text encoders in relation extraction tasks. The above text encoders are employed to convert each word in the document into its output representations. Then, the representations of the two entities are used to predict the relationship between them. We replace the encoder with BERT/RoBERTa to provide a stronger baseline.   BERT (Tang et al., 2020) proposes a hierarchical inference network to obtain and aggregate the inference information with different granularity.
Results Table 3 shows the performance on Do-cRED. CorefBERT Base outperforms BERT Base model by 0.7% F1. CorefRoBERTa Large beats RoBERTa Large by 0.3% F1 and outperforms all previous published work. It proves the effectiveness of considering coreference information of text for document-level relation classification.

Fact Extraction and Verification
Fact extraction and verification aim to verify deliberately fabricated claims with trust-worthy corpora. We evaluate our model performance on a large-scale public fact verification dataset, FEVER (Thorne et al., 2018). FEVER consists of 185, 455 annotated claims with all Wikipedia documents.
Baselines We compare our model with four BERT-based fact verification models: (1) BERT Concat  concatenates all evidence pieces and the claim to predict the claim label; (2) SR-MRS (Nie et al., 2019) employs hierarchical BERT retrieval to improve model performance; (3) GEAR  constructs an evidence graph and conducts a graph attention network for joint reasoning over several evidence pieces; (4) KGAT (Liu et al., 2019b) further conducts a fine-grained graph attention network with kernels.  Results Table 4 shows the performance on FEVER. KGAT with CorefBERT Base outperforms KGAT with BERT Base by 0.4% FEVER score. KGAT with CorefRoBERTa Large gains 1.4% FEVER score improvement compared to the model with RoBERTa Large , which makes our model perform the best compared with all previously published research. It again demonstrates the effectiveness of our model. The CorefBERT, which incorporates coreference information in distantsupervised pre-training, helps to verify if the claim and evidence discuss about the same mentions, such as person or object.

Coreference Resolution
Coreference resolution aims to link referring expressions that evoke the same discourse entity. We inspect the models' intrinsic coreference resolution ability under the setting that all mentions have been detected. Given two sentences where the former has two or more mentions and the latter contains an ambiguous pronoun, models should predict what mention the pronoun refers to. We evaluate our model on several widely-used datasets, including GAP (Webster et al., 2018), DPR (Rahman and Ng, 2012), WSC (Levesque, 2011), Winogender (Rudinger et al., 2018) and PDP (Davis et al., 2017).
Baselines We compare our model with coreference resolution models based on the pre-trained language model and fine-tunes on the GAP and DPR training set. Trinh and Le (2018)   generates GAP-like sentences automatically. After that, They pre-train BERT with the objective minimizing the perplexity of correct mentions in these sentences and finally fine-tune the model on supervised datasets. Benefiting from the augmented data, Kocijan et al. (2019a) achieves state-of-the-art in sentence-level coreference resolution.
Results Table 5 shows the performance on the test set of the above coreference dataset. Our Coref-BERT model significantly outperforms BERT, which demonstrates that the intrinsic coreference resolution ability of CorefBERT has been enhanced by involving the mention reference prediction training task. Moreover, it achieves comparable performance with state-of-the-art baseline WikiCREM. Note that, WikiCREM is specially designed for sentence-level coreference resolution and not suitable for other NLP tasks. The capability of Coref-BERT in terms of coreferential reasoning can be transferred to other NLP tasks.
Implementation Details Following BERT's setting, we add [CLS] token in front of the input sentences, and extract its top-layer representation as the whole sentence or sentence pair's representation for classification or regression. We use a batch size of 32 and fine-tune for 3 epochs for all GLUE tasks and select the learning rate of Adam among 6 https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs 2e-5, 3e-5, 4e-5, 5e-5 for the best performance on the development set.
Results Table 6 shows the performance on GLUE. We notice that CorefBERT achieves comparable results to BERT. Though GLUE does not require much coreference resolution ability due to its attributes, the results prove that our masking strategy and auxiliary training objective would not weaken the performance on natural language understanding tasks.

Ablation Study
In this subsection, we explore the effects of the Whole Word Masking (WWM), Mention Reference Masking (MRM), Next Sentence Prediction (NSP) and copy-based training objective using several benchmark datasets. We continue to train Google's released BERT BASE on the same Wikipedia corpus with different strategies. As shown in Table 7, we have the following observations: (1) Deleting next sentence prediction training task results in better performance on almost all tasks. The conclusion is consistent with ; ;. (2) MRM scheme usually achieves parity with WWM scheme except on SearchQA, and both of them outperform the original subword masking scheme on NewsQA (averagely +1.7% F1) and TriviaQA (averagely +1.5% F1); (3) On the basis of mention reference masking scheme, our copy-based training objective explicitly requires model to look for noun's referents in the context, which could effectively consider the coreference information of the sequence. Coref-BERT takes advantage of the objective and further improves performance, with a substantial gain (+2.3% F1) on QUOREF.

Related Work
Word representation (Mikolov et al., 2013;Pennington et al., 2014) aims to capture semantic information of words from the unlabeled corpus, to transform the discrete word into continuous vectors representation. Since pre-trained word representation cannot handle the polysemy well, ELMO (Peters et al., 2018) further extracts contextaware word embeddings from a sequence-level language model. Deep learning models benefit from adopting the word representations as input features, which have achieved encouraging progress in the last few years (Kim, 2014;Lample et al.,    More recently, language representation models that generate contextual word representations have been learned from a large-scale unlabeled corpus and then fine-tuned for downstream tasks. SA-LSTM (Dai and Le, 2015) pre-trains auto-encoder on unlabeled text, and achieves strong performance in text classification with a few fine-tuning steps. ULMFiT (Howard and Ruder, 2018) further builds a universal language model.OpenAI GPT (Radford et al., 2018) learns pre-trained language representation with Transformer (Vaswani et al., 2017) architecture. BERT  trains a deep bidirectional Transformers with masked language modeling objective, which achieves state-of-the-art results on various NLP tasks. SpanBERT  extends BERT by masking continuous random spans and train models to predict the entire context within the span boundary. XLNET  combines Transformer-XL  and auto-regressive loss, which takes dependency between the predicted positions into account. MASS (Song et al., 2019) explores masking strategy on the sequence-to-sequence pre-training. Though both pre-trained word representation and language models have achieved great success, they still cannot well capture the coreference information. In this paper, we design mention referring prediction tasks to enhance language representation models in terms of coreferential reasoning.
Our work, which acquires coreference resolution ability from an unlabeled corpus, can also be viewed as a special form of unsupervised coreference resolution. Formerly, researchers have made efforts to explore feature-based unsupervised coreference resolution methods (Haghighi and Klein, 2007;Bejan et al., 2009;Ma et al., 2016). After that, Trinh and Le (2018) uncover that it is natural to resolve pronouns in the sentence according to the probability of language models. Moreover, Kocijan et al. (2019a,b) proposes sentence-level unsupervised coreference resolution datasets to train a language-model-based coreference discriminator, which achieves outstanding performance in coreference resolution. However, we found the above methods cannot be directly transferred to the training of language representation models since their learning objective may weaken the model performance on downstream tasks. Therefore, in this paper, we introduce mention reference prediction objective along with masked language model to make learned abilities available for more downstream tasks.

Conclusion and Future Work
In this paper, we present a language representation model named CorefBERT, which is trained on a novel task, mention reference prediction, for strengthening the coreferential reasoning ability of BERT. Experimental results on several downstream NLP tasks show that our CorefBERT significantly outperforms BERT by considering the coreference information within the text. In the future, there are several prospective research directions: (1) we introduce a Distant Supervision (DS) assumption in our mention reference prediction training task. It is a feasible approach to introducing the coreferential signal to language representation models, but the automatic labeling mechanism inevitably accompanies with the wrong labeling problem. Until now, mitigating noise in DS data is still an open question.
(2) The DS assumption does not consider the pronouns in the text, while the pronouns play an important role in coreferential reasoning. Thus, it is worth developing a novel strategy such as selfsupervised learning to further consider pronouns in CorefBERT.