Fine-grained Information Status Classification Using Discourse Context-Aware BERT

Previous work on bridging anaphora recognition (Hou et al., 2013) casts the problem as a subtask of learning fine-grained information status (IS). However, these systems heavily depend on many hand-crafted linguistic features. In this paper, we propose a simple discourse context-aware BERT model for fine-grained IS classification. On the ISNotes corpus (Markert et al., 2012), our model achieves new state-of-the-art performances on fine-grained IS classification, obtaining a 4.8 absolute overall accuracy improvement compared to Hou et al. (2013). More importantly, we also show an improvement of 10.5 F1 points for bridging anaphora recognition without using any complex hand-crafted semantic features designed for capturing the bridging phenomenon. We further analyze the trained model and find that the most attended signals for each IS category correspond well to linguistic notions of information status.


Introduction
Information Structure (Halliday, 1967;Prince, 1981;Prince, 1992;Gundel et al., 1993;Lambrecht, 1994;Birner and Ward, 1998;Kruijff-Korbayová and Steedman, 2003) studies structural and semantic properties of a sentence according to its relation to the discourse context. Information structure affects how discourse entities are referred to in a text, which is known as Information Status (Halliday, 1967;Prince, 1981;Nissim et al., 2004). Specifically, information status (IS henceforth) reflects the accessibility of a discourse entity based on the evolving discourse context and the speaker's assumption about the hearer's knowledge and beliefs. For instance, according to Markert et al. (2012), old mentions 1 refer to entities that have been referred to previously; mediated mentions have not been mentioned before but are accessible to the hearer by reference to another old mention or to prior world knowledge; and new mentions refer to entities that are introduced to the discourse for the first time and are not known to the hearer before.
In this paper, we mainly follow the IS scheme proposed by Markert et al. (2012) and focus on learning fine-grained IS on written texts. A mention's semantic and syntactic properties can signal its information status. For instance, indefinite NPs tend to be new and pronouns are likely to be old. Moreover, referential patterns of how a mention is referred to in a sentence also affect this mention's IS. In Example 1, "Friends" is a bridging anaphor even if we do not know the antecedent (i.e., she); while the information status for "Friends" in Example 2 is mediated/worldKnowledge. Section 3.1 analyzes the characteristics of each IS category and the relations between IS and discourse context.
(2) Friends are part of the glue that holds life and faith together.
In this work, we propose a simple yet effective discourse context-aware self-attention model based on BERT (Devlin et al., 2019) for fine-grained IS classification. We find that the sentence containing the target mention as well as the lexical overlap information between the target mention and the preceding mentions are the most important discourse context when assigning IS for a mention. With the self-attention mechanism, our model can capture important signals within a mention and the interactions between the mention and its context. On the ISNotes corpus (Markert et al., 2012), our model achieves new state-of-theart performance on fine-grained IS classification, obtaining a 4.8 absolute overall accuracy improvement compared to Hou et al. (2013a). More importantly, we also show an improvement of 10.5 F1 points for bridging anaphora recognition without using any sophisticated hand-crafted semantic features.
Furthermore, to gain additional insights into our model's predictions, we analyze the attention mechanisms of our trained model. We find that the most attended tokens for each IS category correspond well with linguistic features of information status. For instance, for the old IS category, the most attended token list includes pronouns such as "she", "her", and "it". While for the new category, the model pays more attention to indefinite determiners such as "a" and "an". Section 6 provides a detailed analysis of the attention map for each IS category.
To summarize, the main contributions of our work are as follows: • We propose a simple and effective model for fine-grained IS classification. The model uses a novel approach for encoding information from the previous sentences along with the current sentence for IS classification.
• Our proposed model achieves new state-of-the-art results for IS classification and bridging anaphora recognition on the ISNotes corpus. Our model also achieves competitive results for fine-grained IS classification on the Switchboard dialogue IS corpus (Nissim et al., 2004) that uses a different IS scheme than the one in ISNotes. The processed datasets and code are publicly available at: https://github.com/IBM/bridging-resolution.
• We carry out ablation studies to understand the effectiveness of each component in our model. We further investigate the self-attention patterns in our model and find that the model does learn specific linguistic features for predicting information status.
2 Related Work IS classification and bridging anaphora recognition. Bridging resolution (Hou et al., 2014; contains two sub tasks: identifying bridging anaphors (Markert et al., 2012;Hou et al., 2013a;Hou, 2016a) and finding the correct antecedents among candidates (Hou et al., 2013b;Hou, 2018a;Hou, 2018b;Hou, 2020). Most previous studies handle bridging anaphora recognition as part of IS classification problem. Markert et al. (2012) applied joint inference for IS classification on the ISNotes corpus but reported very low results on bridging recognition. Building on this work, Hou et al. (2013a) designed many linguistic features to capture bridging and integrated them into a cascading collective classification algorithm. This approach later was integrated into a pipeline for bridging resolution (Hou, 2016b;. Differently, Hou (2016a) used an attention-based LSTM model based on GloVe vectors and a small set of features for IS classification. The author reported similar results as Hou et al. (2013a) regarding the overall IS classification accuracy but the result on bridging anaphora recognition is much worse than Hou et al. (2013a). Rahman and Ng (2012) incorporated carefully designed rules into an SVM algorithm for IS classification on the Switchboard dialogue IS corpus (Nissim et al., 2004). 2 The authors first designed a rule-based system to assign IS classes to mentions on the basis of Nissim's IS annotation guidelines (Nissim et al., 2004). They then applied an SVM multiclass algorithm for this task by combining the prediction from the rule-based system, the ordering of the rules as well as two lexical features.
Another work on IS classification was carried out by Cahill and Riester (2012). They assumed that the distribution of IS classes within sentences tends to have certain linear patterns, e.g., old > mediated > new. Under this assumption, they trained a CRF model with syntactic and surface features for fine-grained IS classification on the German DIRNDL radio news corpus (Riester et al., 2010). Recently, Rösiger (2019) adapted eight rules from Hou et al. (2014) to recognize bridging anaphors and find their antecedents in the improved annotations of the extended DIRNDL corpus (Björkelund et al., 2014). Different from the above-mentioned work, we do not use any complicated hand-crafted features, and our model improves the previous state-of-the-art results on both overall IS classification accuracy and bridging recognition by a large margin on the ISNotes corpus. Our model also achieves competitive results for fine-grained IS classification on Switchboard compared to the approach in Rahman and Ng (2012) that uses the Stanford coreference resolver and an SVM classifier that explores 18 carefully designed hand-crafted rules.
Fine-tuning with contextual word embeddings. Recent studies (Peters et al., 2018;Devlin et al., 2019) have shown that a range of downstream NLP tasks benefit from fine-tuning task-specific parameters with pre-trained contextual word representations. Our work belongs to this category and we fine-tune our model based on BERT representations (Devlin et al., 2019). The novelty of our approach is that we create a "pseudo sentence" for each mention that encodes the most effective local and global discourse context for predicting the mention's IS. The self-attention mechanism in Transformer's self-attention encoder (Ashish et al., 2017) allows our model to attend to both the context and the mention itself for clues that are helpful to predict the mention's IS.
Model probing. Recently, there have been a number of studies exploring the types of knowledge encoded in the BERT model. Jawahar et al. (2019) found that internal vector representations in BERT encode rich linguistic information, with surface information at the bottom layers, syntactic information in the middle layers, and semantic information at the top layers. Clark et al. (2019) showed that certain attention heads in BERT correspond well to the linguistic knowledge of syntax and coreference. In our work, we demonstrate that the attention patterns in our trained model embed linguistic notions of information status.

Information Status and Discourse Context
The IS scheme proposed by Markert et al. (2012) adopts three major course-grained IS categories (old, new, and mediated) from Nissim et al. (2004) and distinguishes six subcategories for mediated. Below we provide a brief description for the eight fine-grained IS classes in ISNotes.
Old mentions are coreferent with the already introduced entities. New mentions are entities that have not been introduced into the discourse and the hearer/reader cannot infer them from either previously mentioned entities or general world knowledge. Mediated mentions are discourse-new and hearer-old (Prince, 1992). They have not been introduced into the discourse before but are accessible to the hearer by reference to another mention or to prior world knowledge.
Among the mediated category, Mediated/worldKnowledge mentions are generally known to the hearer. This category contains mostly proper names. Mediated/syntactic mentions are syntactically linked to other old or mediated mentions, such as "[[their] old father] m/syntactic " or "[a war in [Africa] mediated ] m/syntactic ". Mediated/aggregate mentions are coordinated NPs where at least one element is old or mediated, such as "[[U.S.] mediated and [Canada] mediated ] m/aggregate ". Mediated/function mentions refer to a value of a previously explicitly mentioned function and this function needs to be able to rise or fall (e.g., 6 cents in Example 3). Mediated/comparative mentions usually contain a premodifier to indicate that this entity is compared to another preceding entity (antecedent) (e.g., further attacks in Example 4). Finally, Mediated/bridging mentions are associative anaphors that link to previously introduced related entities/events (e.g., Friends in Example 1).
(3) In trading on the American Stock Exchange, Delmed's price [went down] f unction 6 cents.
(4) [The cyber attacks] antecedent were followed by further attacks on ZDNet.com, a news portal.
We characterize the linguistic factors that affect a mention's IS into three categories: mention properties, local context, as well as previous context. coordinated NPs where at least one element U.S. and Canada is old or mediated he and his son m/function refer to a value of a previously explicitly (the price went mentioned rise/fall function down) 6 cents m/comparative usually contain a premodifier to indicate that another law this entity is compared to another entity further attacks m/bridging associative anaphors which link to previously the price introduced related entities/events the reason new introduced into the discourse for the first time a reader and not known to the hearer before politics Table 1: Information status categories and their main affecting factors. "Local context" means the sentence s which contains the target mention, "Previous context" indicates all sentences from the discourse which occur before s.
summarizes the main affecting factors for each IS class. Note that we analyze the main affecting factors for each IS class based on their definitions. As described in Section 1, a mention's internal syntactic and semantic properties can signal its IS. For instance, a mention containing a possessive pronoun modifier is likely to be mediated/syntactic (e.g., their father); and a mediated/comparative mention often contains a premodifier indicating that this entity is compared to another preceding entity (e.g., further attacks).
In addition, for some IS classes, the "local context" (the sentence s which contains the target mention) and "previous context" (sentences from the discourse which precede s) play an important role when assigning IS to a mention. Example 1 and Example 2 in Section 1 demonstrate the role of the local context for IS. In Example 1, the referential patterns in the local context indicate that "Friends" is a bridging anaphor, 3 whereas "Friends" in Example 2 is a generic NP.
Sometimes we need to look at the previous context when deciding IS for a mention. In Example 5, without looking at the previous context, we tend to think the IS for "Poland" in the second sentence is mediated/WorldKnowledge. Here the correct IS for "Poland" is old because it is mentioned before in the previous context.

IS Classification with Discourse Context-Aware Self-Attention
To account for the different factors described in the previous section when predicting IS for a mention, we create a novel "pseudo sentence" for each mention and apply the multi-head self-attention encoder (Ashish et al., 2017;Devlin et al., 2019) to this sentence. Figure 1 depicts the high-level structure of our model. The pseudo sentence consists of five parts: previous overlap info, local context, the delimiter token "[SEP]", the content of the target mention, and the IS prediction token "[CLS]". The previous overlap info part contains two tokens, which indicate whether the target mention has the same string/head with a mention from the preceding sentences. And the local context is the sentence containing the target mention.
The final prediction is made based on the hidden state of the prediction token " is added to every sequence as the first token and its hidden state is used as the aggregate sequence representation for classification tasks. The novelty of our work is that we design the structure of a and b in a way that embeds the most indicative information to predict a mention's information status. During the training stage, the mechanism of multi-head self-attention helps the model to learn the important cues from both the mention itself and its discourse context when predicting IS.
There are other ways to encode a mention's context information. For instance, one could try to add more previous sentences in the local context or replace the current previous overlap info with all previous sentences. In practice, we found that the current configuration yields the best results on the ISNotes and Switchboard IS corpora. In particular, we notice that using all previous sentences as the discourse context significantly decreases the results for IS classification. 4 This is in line with the observation from Joshi et al. (2019) that modeling the longer context in BERT provides no improvement for coreference resolution.

Model Parameters
We use the vanilla BERT (Devlin et al., 2019) for our experiments. We initialize our model using pretrained BERT contextual embeddings, which is trained on top of the BookCorpus (800M words) and English Wikipedia (2,500M words). We then fine-tune the model for 3 epochs with the learning rate of 3e − 5 and a batch size of 32. During training and testing, the max token size of the pseudo sentence is set as 128. 5

Experimental Setup
We perform experiments on the ISNotes corpus (Markert et al., 2012), which contains 10,980 mentions annotated for information status in 50 news texts taken from the Wall Street Journal portion of the OntoNotes corpus (Weischedel et al., 2011). Table 2 shows the IS distribution in ISNotes.
Following Hou et al. (2013a), all experiments are performed via 10-fold cross-validation on documents. On each testing fold, the model is trained on the other nine folds. The hyper-parameters of 3 epochs and the learning rate 3e − 5 were fixed during all training processes. We report overall accuracy as well as precision, recall and F-score per IS class. In the following, we describe the baselines as well as our model with different settings.  cascaded collective (baseline 2). This is the cascading minority preference system for bridging anaphora recognition from Hou et al. (2013a).

incremental LSTM (baseline 3). This is the attention-based LSTM model proposed by Hou (2016a). The model uses one-hot vectors to encode IS classes and predicts information status for all mentions of a document from left to right incrementally.
self-attention with BERT BASE . We fine-tune BERT BASE on the pseudo sentences described in Section 3. The model has 12 transformer blocks, 768 hidden units, and 12 self-attention heads.
self-attention with BERT LARGE . We fine-tune BERT LARGE on the pseudo sentences described in Section 3. The model has 24 transformer blocks, 1024 hidden units, and 16 self-attention heads. Table 3 shows the results of our models compared to the baselines. Our best model self-attention with BERT LARGE improves over all baselines by a large margin on all IS categories. It achieves an overall accuracy of 83.7% on fine-grained IS classification, obtaining a 4.8 and 5.1 absolute improvements in accuracy over the two strong baselines (collective and cascade collective), respectively.  It is worth noting that recognizing bridging anaphora is a challenging task (Markert et al., 2012). Hou et al. (2013a) proposed a lot of discourse structure, lexico-semantic and genericity detection features to capture the phenomenon. Their best model for bridging anaphora recognition (cascade collective) achieves an F-score of 42.2. Overall, our model self-attention with BERT LARGE achieves the new state-of-the-art performance for this task with an F-score of 52.7 without resorting to any hand-crafted sophisticated semantic features. By comparing the confusion matrices of cascade collective and self-attention with BERT LARGE , we find that the highest proportion of recall errors of bridging recognition in cascade collective is due to the fact that a lot of bridging anaphors are misclassified as new. This can be explained as the syntactic form of many new mentions and bridging anaphors are the same (see Example 1 and Example 2 ), the lexico-semantic features in cascade collective only pick up on certain types of bridging. In addition, most precision errors in cascade collective are new and old mentions being misclassified as m/bridging. Both these recall and precision errors are less frequent in self-attention with BERT LARGE . It seems that our model does capture properties of bridging anaphora better by only looking at a mention and its interactions with the surrounding context.

Ablations
To better understand the impact of different components in our model, we carry out an ablation experiment. We remove target mention, local context, previous overlap info, as well as all context information (local context + previous overlap info) from our best model self-attention with BERT LARGE , respectively. Table  4 reports the results of different configurations for our model. Surprisingly, the model considering only the content of mentions (see the last column of Table 4) achieves competitive results as the baseline cascade collective which explores many hand-crafted linguistic features. Also it outperforms the three baselines on several IS categories (m/syntactic, m/aggregate, m/comparative, m/bridging and new). In Section 3.1, we analyze that m/syntactic and m/aggregate are often signaled by mentions' internal syntactic structures, and that the semantics of certain premodifiers is a strong signal for m/comparative. The improvements on these categories show that our model can capture the semantic/syntactic properties of a mention when predicting its IS.
Among all three components, it seems that the content of mentions has the most impact on the overall results, while the local context has the least impact. Furthermore, we find that local context and previous overlap info have different impacts on IS classes. More specifically, we notice that m/bridging, m/function and new benefit most from local context, whereas old and m/worldKnowledge benefit most from previous overlap info. This may seem counter-intuitive for m/bridging and m/worldKnowledge, as one expects that m/bridging should benefit more from the previous context and m/worldKnowledge is a local phenomenon. For m/worldKnowledge, this is explained by the fact that the system without previous context information (self-attention wo pre overlap info) wrongly predicts a lot of old mentions as m/worldKnowledge, as illustrated in Example 5.  For m/bridging, the big impact of the local context corresponds to Hou et al. (2013a)'s observation that some bridging can be indicated by referential patterns without world knowledge about the anaphor/antecedent NPs. For instance, in the following sentence, "The blicket couldn't be connected to the dax. The wug failed.", the mention "The wug" is likely a bridging anaphor, although we do not know the antecedent. 6 Similarly, Clark (1975) distinguishes between bridging via necessary, probable, and inducible parts/roles. He states that only in the first case does the antecedent trigger the bridging anaphor in the sense that we already spontaneously think of the anaphor when we read/hear the antecedent. In the probable/inducible cases, bridging anaphora accommodates itself into the context and is induced by the need for an antecedent.
In addition, we also tested whether a broader local context can help us to detect bridging better. In the ISNotes corpus, 26% of bridging anaphors have the antecedents from the same sentence, and 77% of anaphors have antecedents occurring in the same or up to two sentences prior to the anaphor. In practice, we tried to add the previous k sentences (k = 1 and k = 2) into the current local context but found that the overall results for bridging in both settings are similar to the current one.

Experimental Results on the Switchboard IS Corpus
In this section, we apply our discourse context-aware self-attention model to the Switchboard dialogue IS corpus (Nissim et al., 2004). The corpus contains around 63k mentions annotated with IS types (i.e., old, mediated, and new) and subtypes. Note that the IS scheme in this corpus is different from the one in ISNotes in terms of fine-grained IS classes. In general, bridging in this corpus includes non-anaphoric, syntactically linked part-of and set-member relations (e.g., the house's door), as well as comparative anaphors that are marked by surface indicators such as "other" or "different". 7 Nevertheless, we think the

IS class
Most attended tokens old the, pre overlap2 = NA, pre overlap1 = NA, pre overlap2 = yes, pre overlap1 = yes, it, her, she, that, they m/worldKnow. pre overlap1 = no, pre overlap2 = no, the, month, year, and, of, to, this, said m/syntactic pre overlap1 = no, the, pre overlap2 = no, of, 's, her, in, its, pre overlap2 = yes, to m/aggregate and, the, or, pre overlap1 = no, pre overlap2 = yes, her, oil, units, of,m/function %, units, pre overlap2 = yes, pre overlap1 = no, to, 8, fell, 5, 243, million m/comparative pre overlap1 = no, more, pre overlap2 = no, other, pre overlap2 = yes, higher, com-panies, some, of, that m/bridging pre overlap1 = no, the, pre overlap2 = no, a, in, friends, year, demand, production, to new pre overlap1 = no, pre overlap2 = no, the, a, an, of, to, -, magazines, but  (2012), we split the dataset into a training set containing 117 dialogues and a testing set containing 30 dialogues. We train our model self-attention with BERT LARGE on the training dataset using the parameters described in Section 3.3. Table 5 lists the results of our model compared to Rahman and Ng (2012)'s system, which is an SVM multiclass model based on predictions from a rule-based system and the Stanford Deterministic Coreference Resolution System (Lee et al., 2011). The rule-based system consists of 18 hand-crafted rules for assigning IS subtypes to mentions. Some rules are based on the lexicon relations encoded in WordNet and FrameNet.
Note that the results of the two systems in Table 5 are not directly comparable due to the different splits of the training/testing datasets. 8 Neverthless, our model self-attention with BERT LARGE achieves competitive performance compared to Rahman and Ng (2012)'s system in terms of the overall accuracy. In general, it seems that our model is better at predicting old mentions. We also checked the confusion matrix and found that the low results for med/situation, med/event and med/func value is due to the fact that our model cannot distinguish these three categories from med/set.

Attention to Linguistic Features
In order to gain additional insights into our model's predictions, we analyze the attention maps in our best model (self-attention with BERT LARGE ) that is trained on ISNotes. We aim to check to what extent the most attended tokens correspond to the linguistic features for each IS class.
Specifically, we randomly choose one fold and apply the trained model to the testing dataset. Since the "[CLS]" token is used for prediction, we analyze the attention weights assigned to other tokens from "[CLS]" for each testing instance. The weight of each token is normalized by the sequence length. The final attended score for each token is calculated by aggregating normalized attended weights across all testing instances in all 16 heads from the last layer. This is because previous work suggests that the last layer usually encodes the task-specific features in fine-tuning (Kovaleva et al., 2019). Table 6 lists the top ten most attended tokens for each IS class. We exclude the separator tokens ([CLS]/[SEP]) and two punctuation tokens (comma and period) from the list, as suggested by Clark et al. (2019) that these tokens are heavily attended in deep heads and might be used as a no-op for attention heads. Note that pre overlap1 and pre overlap2 are the two tokens that indicate whether the target mention has the same string/head with a mention from the preceding sentences. Both can have a value of "yes", "no", or "NA". Following Markert et al. (2012), "NA" means "non-applicable" and is mainly used for pronouns.
We notice that a lot of the attended tokens in Table 6 correspond well with the linguistic features for each IS class. For old mentions, the model attends to pronouns and signals that indicate string overlap, while for new mentions, the model attends to tokens that indicate string non-overlap and the indefinite determiners "a/an".
It is interesting to note that the model seems to learn the internal syntactic/semantic structure for a few IS classes. For instance, "of" and "'s" are strong signals for m/syntactic mentions that have a prepositional structure or a possessive structure. Also m/aggregate mentions usually contain the tokens "and/or" that indicate the coordination structure. Similarly, for m/comparative category, the model learns to focus on a few premodifiers (e.g., "more", "other", and "higher") that indicate the comparison between two entities.
Finally for m/function mentions, the model learns to mostly focus on numbers. Surprisingly, the model also learns to attend the verb "fell", which corresponds well with the definition of this IS class (see Section 3.1). For the most difficult category m/bridging, it seems that the model attends to some relational nouns (e.g., "friends" or "demand") that are likely used as bridging anaphors.

Conclusions
We propose a simple discourse context-aware self-attention model for IS classification based on the BERT fine-tuning framework. We cast the IS classification problem as a sentence classification task by creating a novel "pseudo sentence" for each mention. We design the "pseudo sentence" based on the linguistic intuitions about IS and it contains most indicative context information to predict a mention's information status. Such design allows the model to capture both clues from the mention and its context when predicting IS.
Our model does not contain any complex hand-crafted semantic features and achieves the new state-ofthe-art results for IS classification and bridging anaphora recognition on ISNotes that contains written news articles. In another domain that consists of conversational dialogues (Switchboard), our model also achieves competitive performance for fine-grained IS classification compared to previous work (Rahman and Ng, 2012).
Finally, in order to better understand our model's predictions, we probe our best model (self-attention with BERT LARGE ) on ISNotes. We find that our model learns to pay more attention to signals that correspond well to the linguistic features of each IS class.