Bridging Resolution: Making Sense of the State of the Art

While Yu and Poesio (2020) have recently demonstrated the superiority of their neural multi-task learning (MTL) model to rule-based approaches for bridging anaphora resolution, there is little understanding of (1) how it is better than the rule-based approaches (e.g., are the two approaches making similar or complementary mistakes?) and (2) what should be improved. To shed light on these issues, we (1) propose a hybrid rule-based and MTL approach that would enable a better understanding of their comparative strengths and weaknesses; and (2) perform a manual analysis of the errors made by the MTL model.


Introduction
Bridging resolution is an anaphora resolution task that involves identifying and resolving bridging/associative anaphors, which are anaphoric references to non-identical associated antecedents. To exemplify, consider the following sentences taken from the BASHI corpus (Rösiger, 2018a): Even if baseball triggers losses at CBSand he doesn't think it will -"I'd rather see the games on our air than on NBC and ABC," he says .
In this example, a bridging link exists between the anaphor the games and its antecedent baseball, as the definite description cannot be interpreted correctly unless it is associated with baseball.
Bridging resolution is arguably more challenging than entity coreference resolution. The reason is that unlike in entity coreference, in bridging resolution there are typically no clear syntactic or surface clues for identifying the antecedent of a bridging anaphor. In many cases, resolution requires the use of context as well as commonsense inference.
Despite the difficulty of bridging resolution, the annotated corpora available for training bridging resolvers are much smaller than those for training entity coreference resolvers (e.g., OntoNotes (Hovy et al., 2006)). As a result, early work has focused on developing rule-based systems (e.g., Hou et al. (2014), Rösiger (2018b)). A key weakness of rule-based approaches is that the ruleset may have to be updated when it is applied to a new corpus (e.g., new rules may have to be added, and existing rules may have to be removed or modified), as different bridging corpora are annotated with slightly different guidelines (to cover different kinds of bridging links, for instance). In light of this weakness, Yu and Poesio (2020) have recently proposed a neural bridging resolver based on multi-task learning (MTL). Despite being trained on the relatively small amount of labeled data that are currently available, their resolver has achieved state-of-the-art results on three evaluation corpora.
In this paper, we seek to make sense of this state of the art by shedding light on two issues. First, how is the MTL model better than its rule-based counterparts? More specifically, while MTL is apparently making fewer mistakes than the rules, are the two approaches making similar or complementary mistakes? Second, given that the MTL model is the current state of the art, what needs to be improved in MTL?
To investigate the first issue, we propose a hybrid approach to bridging resolution: we first apply the hand-crafted rules to identify bridging links, and then employ the MTL-based model to resolve any (anaphoric) mentions that are not resolved by the rules. The design of this pipelined resolver is motivated in part by sieve-based approaches to entity coreference resolution (Raghunathan et al., 2010;Lee et al., 2013). Specifically, given our hypothesis that hand-crafted rules typically have higher precision and lower coverage than machinelearned patterns, we employ the rules as our first sieve and MTL as our second sieve. If our hybrid approach outperformed both the rule-based and learning-based approaches, that would provide suggestive evidence that these two approaches have ISNotes  50  40292  11272  663  BASHI  50  57709  18561  459  ARRAU RST 413 228901 72013 3777  stein, 2008;Uryupina et al., 2020). Following previous work, we report results only on RST, the most comprehensively annotated segment of ARRAU. Table 1 shows the statistics on these corpora. For ARRAU RST, we use the standard traintest split. For ISNotes and BASHI, we divide the available documents into 10 folds and report 10fold cross validation results, following previous work (Hou, 2020;Yu and Poesio, 2020). The hybrid approach. Recall that our hybrid approach is composed of a rule-based system and Yu and Poesio's (2020) (learning-based) MTL approach. Below we provide a brief overview of the MTL approach and the rules.

Corpora Docs Tokens Mentions Anaphors
Yu and Poesio's (2020) MTL-based system is the first neural model for full bridging resolution. 1 They presented two extensions to Kantor and Globerson's (2019) span-based neural mentionranking model (Denis and Baldridge, 2008) that was originally developed for entity coreference resolution. First, they provided gold mentions as input to the model, meaning that the model needs to learn the span representations but not the span boundaries. Second, they proposed to train the model to perform coreference and bridging in a MTL framework, where the span representation layer is shared by the two tasks so that information learned from one task can be utilized when learning the other task. Unlike feature-based approaches, where feature engineering plays a critical role in performance, this model employs only two features, the length of a mention and mention-pair distance.
Different rule-based systems have been developed for the three evaluation corpora. We used Hou's (2014) rules for ISNotes, and Rösiger's (2018) rulesets for BASHI and ARRAU. 2 Table 2 shows an example rule designed by Hou et al. (2014) for full bridging resolution in IS-Notes. 3 As can be seen, a rule is composed of two conditions: one on the anaphor and the other on the antecedent. If two mentions satisfy these conditions, the rule will posit a bridging link between them. In the table, we express the rule in terms of its name, the condition on the anaphor, the condition on the antecedent, and the motivation behind its design. 4 Setting. We report results for full bridging resolution. In this setting, a system is given as input not only a document but also the gold mentions in the document. The goal is to identify the subset of the gold mentions that are bridging anaphors and resolve them to their antecedents, which are also chosen from the gold mentions.
Postprocessing. Following previous work (Rösiger et al., 2018), we postprocess the output of a resolver by removing the gold coreferent anaphors from the predicted bridging anaphors.
Evaluation metrics. We report results for recognition and resolution in terms of precision, recall, and F-score. For recognition, recall is the fraction of gold anaphors that are correctly identified, whereas precision is the fraction of anaphors iden-   Table 3: Full bridging recognition and resolution results in ISNotes, BASHI, and ARRAU RST. tified by the system that are correct. For resolution, recall and precision are defined in a similar fashion.

Results
Bridging recognition and resolution results of the three approaches under comparison (i.e., Rules, MTL, and Hybrid) on the three evaluation corpora are shown in Table 3. The performance trends largely corroborate our hypothesis. On all three datasets, we see that the recall of Hybrid is substantially higher than those of Rules and MTL for both recognition and resolution, meaning that Rules and MTL are making different rather than similar mistakes and can therefore be used to complement each other's weaknesses. Moreover, Hybrid's F-scores on ISNotes and BASHI are better than those of Rules and MTL: on ISNotes, Hybrid outperforms MTL by 5.7% points and 4.8% points in F-score for recognition and resolution, respectively; and on BASHI, Hybrid outperforms MTL by 5.4% points and 2.0% points in F-score for recognition and resolution, respectively. On ARRAU RST, however, Hybrid's recognition and resolution F-scores are only slightly better than those of Rules and MTL. The failure of Hybrid to offer substantial gains on ARRAU RST w.r.t. F-score can be attributed to Rules's relatively low precision: unlike in ISNotes and BASHI, where Rules's precision is higher than MTL's, in ARRAU RST, Rules's precision are more or less at the same level as MTL's.
Next, we compare in Table 4 the performance of our three resolvers on different categories of anaphors defined by the rules used in the rule-based resolver. 5 Each rule category is identified using its rule ID (column 1). 6 Each fraction in column 2 is 5 Owing to space limitations, only the results on ISNotes and BASHI are shown in Table 4. The results on ARRAU RST can be found in Appendix B. 6 The mapping between rule IDs and the rule categories the ratio of the number of gold anaphors that satisfy the anaphor condition of a rule to the number of gold mentions that satisfy the same condition. Finally, the recognition and resolution results shown in the remaining columns are expressed in terms of precision (P), recall (R), and F-score (F). We believe that these results can reveal the comparative strengths and weaknesses of the resolvers.
A few points about the results in Table 4 deserve mention. On ISNotes (Table 4(a)), while Rules outperforms MTL on the majority of the rule categories in resolution F-score, MTL achieves the state of the art by resolving anaphors in the largest category, Rule 18 (Other), which consists of anaphors that cannot be handled by any of the rules. On BASHI (Table 4(b)), however, Rules outperforms MTL on only four rule categories. This is somewhat surprising because the rulesets used for ISNotes and BASHI are almost identical to each other. 7 A closer look at the numbers in the second column of Table 4 reveals an interesting observation: in a majority of the rules, the number of gold anaphors that satisfy a rule condition is smaller in BASHI than in ISNotes, whereas the number of gold mentions that satisfy an anaphor condition is larger in BASHI than in ISNotes. This is again somewhat surprising because both ISNotes and BASHI contain 50 WSJ news articles taken from OntoNotes that are annotated with very similar annotation schemes. Consequently, we computed the average length of a document in the two datasets and found that BASHI indeed has more tokens per document on average (1154 tokens/doc in BASHI compared to 805 tokens/doc in ISNotes). The fact that BASHI has longer documents could explain why more gold mentions satisfy the anaphor condican be found in Appendix A.   tions of the rules in BASHI than in ISNotes. However, we still could not explain why the number of gold anaphors that satisfy the anaphor conditions of the rules is smaller in BASHI than in ISNotes.
To understand the reason, we took a closer look at the documents in BASHI and found that there are cases of bridging that are not being annotated. Examples of such missing bridging links are shown in Table 6, where the missing anaphors are boldfaced and their antecedents are italicized. We therefore speculate that the lower resolution precision achieved by Rules on BASHI has to do with the incomplete gold annotations on BASHI.
In Table 5, we quantify how different Rules and MTL are w.r.t. each rule category. Let GA i be the set of gold anaphors that are covered by rule category i. We show for each i the percentage of GA i that are (1) correctly recognized/resolved by both resolvers (B), (2) correctly recognized/resolved by Rules but not MTL (R), and (3) correctly recognized/resolved by MTL but not Rules (M). For both ISNotes and BASHI, the relatively large numbers under the "R" and "M" columns suggest that Rules and MTL are making different predictions; moreover, the fact that the numbers under "R" are larger than the corresponding numbers under "M" on a majority of categories implies that the number of gold anaphors that are solely recognized/resolved by Rules is larger than that by MTL.

Error Analysis
To better understand what areas of improvement are needed by the MTL model, we perform a manual analysis of its errors and discuss three major types of error in the following three subsections.

Recognition: Precision Errors
Precision errors in recognition refer to errors in misclassifying a mention as a bridging anaphor. Coreference anaphor errors are the most common type of precision errors, contributing to 14-30% of the overall precision errors in recognition. Coreference anaphor errors occur when a gold coreference anaphor is predicted as a bridging anaphor.
Consider the first example in Table 7. In this example, the gold coreference anaphor the stake is predicted as a bridging anaphor and resolved to the ground, but it has a coreference link with a big iron stake. By definition, a bridging anaphor (especially referential bridging) should not be a coreference anaphor. We speculate that MTL makes these mistakes because it is trained on coreference and Rule  Recognition  Resolution  B  R  M  B  R  M  1  38  62  0  25 50  0  2  29  43 14 29 29 14  3  47  47  5  16 58 16  4  46  14 26 40  6  23  5  50  12  0  38   bridging in the multi-task setting.

Recognition: Recall Errors
Recall errors in recognition refer to the model's failure to identify bridging anaphors. Indefinite expression errors are the most common type of recall errors, contributing to 48-71% of the overall recall errors in recognition on the three datasets. Indefinite expression errors occur when a system misclassifies an indefinite bridging anaphor as a mention having the NEW information status. 8 Consider the second example in Table 7. In this example, the indefinite bridging anaphor production is not detected by the MTL model. The reason is that the syntactic forms of many NEW instances and indefinite bridging anaphors are the same. Thus, it is not easy for model to distinguish between them. This observation has also been made by Hou et al. (2018).

Resolution: Precision Errors
Precision errors in resolution refer to errors in identifying the antecedent for a bridging anaphor. Unmodified expression errors are the most common 8 Bridging is a subcategory of the MEDIATED. When Michael S. Perry took the podium at a recent cosmetics industry event, more than 500 executives packing the room snapped to attention . Folk doctors also prescribe it for kidney , bladder and urethra problems , duodenal ulcers and hemorrhoids . Some apply it to gouty joints . After three Sagos were stolen from his home in Garden Grove , "I put a big iron stake in the ground and tied the tree to the stake with a chain , " he says proudly. Currently, Boeing has a backlog of about $80 billion, but production has been slowed by a strike of 55,000 machinists , which entered its 22nd day today . In addition, the government is figuring that the releases could create a split between the internal and external wings of the ANC and between the newly freed leaders and those activists who have emerged as leaders inside the country during their imprisonment. In order to head off any divisions , Mr. Mandela , in a meeting with his colleagues before they were released, instructed them to report to the ANC headquarters in Lusaka as soon as possible . type of precision errors, contributing to 23-63% of the overall precision errors in resolution. Unmodified expression errors occur when a predicted anaphor is a short mention without modifiers. Such a mention is semantically less rich than those that are modified and is therefore harder to resolve.
Consider the third example in Table 7. In this example, the anaphor any divisions is resolved to a wrong antecedent their imprisonment rather than the correct antecedent the ANC.

Conclusion
In this paper, we sought to make sense of the state of the art in bridging resolution. We combined the hand-crafted rules and the MTL model in a pipelined fashion, showing that (1) the rules and MTL were making complementary mistakes and (2) the resulting hybrid approach achieved state-ofthe-art results on three standard evaluation datasets. In addition, we performed a manual error analysis to determine what needed to be improved in MTL. Finally, our findings suggested that BASHI's annotation quality may need to be reassessed. A Rules for Bridging Resolution Table 8 enumerates the list of heuristic rules manually designed for bridging resolution on ISNotes, BASHI, and ARRAU RST. As mentioned in Section 2, each rule is composed of a rule ID, an anaphor condition, and an antecedent condition. To enable the reader to better understand these rules, we describe in the last column of the table the motivation behind the design of each rule.

B Results on ARRAU RST
Results on ARRAU RST are shown in Tables 9  and 10. Specifically, Table 9 shows the performance of the three resolvers (Rules, MTL, and Hybrid) on different rule categories. This table is formatted in the same way as Table 4 and therefore can be interpreted in the same manner as Table 4. Comparing Table 4 and  Rules 5, 6, 14, and 15. Consequently, the improvement of Hybrid over MTL on ARRAU RST is the smallest of the three evaluation datasets.
In Table 10, we attempt to quantify how different Rules and MTL are w.r.t. each rule category on ARRAU RST by showing the percentages of gold anaphors covered by each rule category that are correctly recognized/resolved correctly by both Rules and MTL (B), by Rules only (R), and by MTL only (M). This table is formatted in the same way as Table 5 and therefore can be interpreted in the same way as Table 5. As we can see, the largest values in the "R" column for both recognition and resolution are associated with Rules 12 and 14, meaning that these are the rule categories in which Rules has unique strength. This observation is consistent with the results of Rules 12 and 14 in Table 9. Other than these two rule categories, Rules     manages to uniquely recognize/resolve just a few anaphors covered by rule categories 5, 10, 11, and 15. In contrast, the number of gold anaphors that are uniquely recognized/resolved by MTL is larger than that by Rules. Overall, we can infer from the results in Table 10 that the use of Rules does not add a lot of value to MTL on ARRAU RST.