Anaphora Resolution with the ARRAU Corpus

The ARRAU corpus is an anaphorically annotated corpus of English providing rich linguistic information about anaphora resolution. The most distinctive feature of the corpus is the annotation of a wide range of anaphoric relations, including bridging references and discourse deixis in addition to identity (coreference). Other distinctive features include treating all NPs as markables, including non-referring NPs; and the annotation of a variety of morphosyntactic and semantic mention and entity attributes, including the genericity status of the entities referred to by markables. The corpus however has not been extensively used for anaphora resolution research so far. In this paper, we discuss three datasets extracted from the ARRAU corpus to support the three subtasks of the CRAC 2018 Shared Task–identity anaphora resolution over ARRAU-style markables, bridging references resolution, and discourse deixis; the evaluation scripts assessing system performance on those datasets; and preliminary results on these three tasks that may serve as baseline for subsequent research in these phenomena.


Introduction
The release of the ONTONOTES coreference corpus (Pradhan et al., 2007a) and the organization of two CONLL shared tasks based on the dataset (Pradhan et al., 2012) have resulted in a substantial increase in coreference research, both in terms of quantity and in terms of quality. We expect ONTONOTES to remain a key resource for the field for many years.
However, ONTONOTES also has a number of frequently mentioned limitations, including: • Not all NPs of relevance to anaphora resolution are treated as markables. For instance, expletives are not annotated.
• And even among referring markables, singletons are not annotated, nor are references to abstract objects or many types of generic objects (Pradhan et al., 2012).
Furthermore, anaphora resolution involves a number of phenomena besides 'coreference', such as bridging reference (Clark, 1975) and discourse deixis (Webber, 1991). Only a simple form of discourse deixis, event anaphora, is annotated in ONTONOTES; bridging reference was not annotated, although a subset of the corpus has been annotated with this information by Markert et al. (2012). A number of these limitations are overcome in the ARRAU corpus (Uryupina et al., In press). In ARRAU, all NPs are considered markables, including expletives and singletons. Both discourse deixis and bridging reference have been annotated.
The corpus however, hasn't been widely used for anaphora resolution research yet, with a few exceptions (Rodriguez, 2010;Uryupina and Poesio, 2012;Marasović et al., 2017). There are a number of reasons for this, ranging from the fact that research in both bridging reference and discourse deixis is still limited, to the unusual markup format. The objective of this paper is to introduce the community to the three datasets extracted from the ARRAU corpus to support this year's CRAC18 Shared task, the first evaluation campaign based on ARRAU. Our hope is that making such datasets available may, on the one hand, facilitate the use of ARRAU; on the other, increase the community of researchers working on these aspects of anaphora resolution.

Genres
The ARRAU corpus includes a substantial amount of news text in the sub-corpus called RST, con-sisting of the entire subset of the Penn Treebank (Marcus et al., 1993) that was annotated in the RST treebank (Carlson et al., 2003). News data were annotated so that researchers could compare results on ARRAU with results on other news datasets; and these documents were chosen because they had already been annotated in a number of ways-not only syntactically (e.g., through the Penn Treebank (Marcus et al., 1993)) and for their argument structure (e.g., through Propbank (Palmer et al., 2005)) but also for rhetorical structure (Carlson et al., 2003). But one of the objectives of the ARRAU annotation was to cover genres other than news, so, in addition to RST, ARRAU includes three more sub-corpora. The TRAINS sub-corpus includes all the task-oriented dialogues in the TRAINS-93 corpus; 1 the PEAR sub-corpus consists of the complete collection of spoken narratives in the Pear Stories that provided some of the early evidence on salience and anaphoric reference (Chafe, 1980); and the GNOME sub-corpus covers documents from the medical and art history genres covered by the GNOME corpus (Poesio, 2000a(Poesio, , 2004b used to study both local and global salience (Poesio et al., 2004. The same coding scheme was used for all sub-corpora, but separate guidelines were written for the textual and the spoken dialogue sub-corpora. Table  1 provides basic statistics about the four ARRAU sub-corpora. Note in particular the large number of non-referring markables. RST, TRAINS and PEAR were used for the CRAC 2018 shared task.

Markables
Markable definition Many, especially among the older, anaphorically annotated corpora impose syntactic, semantic or discourse-based restrictions on markables. For instance, in ONTONOTES neither expletives nor singletons are annotated (for a discussion of the state of the art in anaphoric annotation, see (Poesio et al., 2016)). By contrast, in ARRAU all NPs are considered as markables, also when they are non-referring because either expletives such as it or predicative NPs such as a busy place in (1), or when they do not corefer with any other markable and thus form a singleton coreference chain. Moreover, non-referring markables are manually sub-classified into expletives, predicative, and quantifiers. In addition, possessive pronouns are marked as well, and all premodifiers are marked when the entity referred to is mentioned again, e.g., in the case of the proper name US in (2), and when the premodifier refers to a kind, like exchange-rate in (3). (1) [It] seems to be [a busy place] Newsweek's ad rates would increase 5% in January.
Markable properties All markables are manually annotated for a variety of properties according to the GNOME guidelines (Poesio, 2000b): these include morphosyntactic agreement (gender, number and person), grammatical function, and the semantic type of the entity. The guidelines and reliability studies leading to this scheme are discussed in (Poesio, 2000a(Poesio, , 2004a In press). We will only mention one attribute here, the reference attribute, that specifies a combination of information about the logical form status of the NP (referring, expletive, quantificational, or predicative), and can be used to distinguish between referring and non-referring markables.

Types of anaphoric relations marked
The ARRAU guidelines support annotation of different types of anaphoric relations. All referring markables are marked as either discourse  new or discourse old. Discourse new mentions introduce new entities and thus are not marked as being coreferent with an entity already introduced (antecedent). For discourse-old mentions, an antecedent can be identified, either of type phrase (if the antecedent was introduced using a nominal markable) or segment (not introduced by a nominal markable, for discourse deixis). In addition, referring NPs can be marked as related to a previously mentioned discourse entity, to identify them as examples of associative (bridging) anaphora.

Bridging references
The term bridging reference was introduced by Clark (1975) to refer to any reference that requires some sort of 'bridging' inference to be interpreted. Clark's very general definition covered both identity anaphora in which the description of the anaphor is different from the description of the antecedent, as in (5); and socalled associative anaphora (Hawkins, 1978), in which the anaphoric expression refers to an object that is associated with, but not identical to, the antecedent, as in (6). (These days, the term bridging reference is mostly used to refer to the associative cases.) Annotating-indeed, even identifying-bridging references in a reliable way is difficult (Vieira, 1998;Poesio and Vieira, 1998), which is one of the reasons why so few large-scale corpora for anaphora include this type of annotation (Poesio et al., 2016). The ARRAU guidelines for bridging anaphora are based on experiments that started with the work of Vieira and Poesio (Vieira, 1998;Poesio and Vieira, 1998) and continued in the GNOME project (Poesio, 2004a). In GNOME, a subset of relations that could be annotated reliably was found (Poesio, 2004a), including three types of relations: element-of; subset; and a generalized possession relation poss covering both part-of relations and general possession relations. The ARRAU Release 1 guidelines followed the GNOME guidelines, but with an extension and a simplification. Annotators were asked to mark a markable as related to a particular antecedent if it stood to that antecedent in one of the relations identified in GNOME (indeed, the same examples were used), and in addition, if they stood in two additional relations (but without testing the reliability of this annotation): • other, for other NPs, broadly following the guidelines in (Modjeska, 2003); • an undersp-rel relation for 'obvious cases of bridging that didn't fit any other category'.
The simplification was that in ARRAU Release 1, coders were not asked to specify the relationeffectively, any associative bridging reference was considered a case of 'underspecified relation'. In ARRAU Release 2, the annotation of bridging references was revised for the RST domain only and coders were now asked to mark the relations only in that domain. Some statistics about bridging references in ARRAU Release 2 are shown in Table 2. A total of 5512 bridging references were marked, but a classification of the relations was only provided for the 3777 bridging references identified in the RST domain. In the table, we write P+S+E+O+U as category for the bridging references in the other domains, currently not classified.
Discourse deixis The term discourse deixis was introduced by Webber (1991) to indicate the reference to abstract entities which have not been introduced in the discourse through a nominal markable, as in the following example from the TRAINS corpus, where that in utterance 7.6 refers to the plan of shipping boxcars to oranges to Elmira.
7.3 : so we ship one 7.4 : boxcar 7.5 : of oranges to Elmira 7.6 : and that takes another 2 hours Discourse deixis is a very complex form of reference, both to annotate (Artstein and Poesio, 2006) and to resolve. Very few anaphoric annotation projects have attempted annotating discourse deixis in its entirety (Artstein and Poesio, 2006;Dipper and Zinsmeister, 2012). More typical is a partial annotation, as in (Byron and Allen, 1998;Navarretta, 2000), who annotated pronominal reference to abstract objects; in ONTONOTES, where event anaphora was marked (Pradhan et al., 2007b); and in the work of Kolhatkar (2014), that focused on so-called shell nouns. In ARRAU, 1. A coder specifying that a referring expression is discourse old is asked whether its antecedent was introduced using a phrase (markable) or segment (discourse segment).
2. Coders choosing segment have to mark a sequence of predefined clauses.
Statistics about discourse deixis in ARRAU Release 2 are shown in Table 3. A total of 1633 cases of discourse deixis were marked.

Markup
ARRAU was annotated using the MMAX2 annotation tool (Müller and Strube, 2006). MMAX2 is based on token standoff technology: the annotated anaphoric information is stored in a phrase level whose markables point to a base layer in which each token is represented by a separate XML element.

Two releases
There have been two releases of the corpus. The first release, in 2008, is discussed in (Poesio and Artstein, 2008). This first release was relatively small (about 100K words in total), and focused primarily on identity anaphora and on the annotation of ambiguity, but its development involved extensive experiments with the annotation of discourse deixis and of ambiguity that led to the annotation guidelines used throughout the project (Poesio and Artstein, 2005b,a;Artstein and Poesio, 2006). The second release, via LDC in 2013, is substantially larger than the first (350K) and the annotation of bridging reference, discourse deixis and genericity is much more extensive. Another key annotation effort was the annotation of minimal spans of markables (MINs). Last but not least, extensive checks were run on the annotation of identity anaphora. This is the release used for the CRAC 2018 Shared Task.
3 Previous work on anaphora resolution with ARRAU

Identity anaphora
Rodriguez (2010) used BART (Versley et al., 2008) to compare the difficulty of ARRAU and the two more widely used corpora at the time, MUC-7 and ACE02, and the effect of using MIN information to ascribe partial credit (50%) whenever a system markable overlaps with the minimal span of a gold markable, and the boundaries of the system markable do not exceed those of the gold markable, as done in MUC. He found that assigning such partial credit substantially improves the scores. Uryupina and Poesio (2012)    or the entire dataset. They did that on both AR-RAU 2 and ONTONOTES, thus providing what to our knowledge is the only comparison between the two corpora in terms of system performance. Table 4 summarizes the results.

Discourse Deixis
Marasović et al. (2017) developed an approach to abstract anaphora resolution based on bidirectional LSTMs to produce representations of the anaphor and the candidate sentence, and a mention ranking component adapted from the systems by Clark and Manning (2016) and Wiseman et al. (2015). The system was tested using both the dataset by Kolhatkar et al. (2013) (for shell nouns) and the discourse deixis cases in ARRAU.

The Three Tasks of CRAC 2018
The CRAC 2018 Shared Task was the evaluation campaign associated with this workshop. The task was articulated in three subtasks: a first task on identity anaphora resolution, a second one on bridging reference, and a third one on discourse deixis. Researchers could participate independently, and indeed no group participated in more than one task. In this Section we discuss how the datasets for the three tasks were created using AR-RAU, and the evaluation scripts that were used.

Markable Settings
One characteristic in common to all three subtasks is that the official evaluation of systems was based on a gold setting, in that the markables were spec-ified in advance. 2 This was done because the organizers of Tasks 2 and 3 felt that the state of the art in bridging anaphora and discourse deixis resolution is such that the system markable setting would be too hard, so we would need to release data in a gold setting for those tasks-and then of course it would not make sense to release them in a system markables setting for Task 1. The evaluation scripts however supported both gold and predicted markables, and the evaluations reported below carried out both.

Task 1: Identity anaphora
In this task, systems have to decide • whether a markable is referring or not; • if referring, whether it introduces a new entity/coreference chain (discourse new) or refers to an entity already introduced (discourse old); • in case it is classified as discourse old, the systems have to identify the antecedent (entity, or coreference chain).
Data format For this task, the documents were exported in the format used for EVALITA-2011 (Uryupina and Poesio, 2013), derived from the tabular CONLL-style format used in the SEMEVAL 2010 shared task on multilingual anaphora (Recasens et al., 2010). The format used involves three tab-separated columns, with one line per token:

TOKEN MARKABLE MIN
The first column specifies the token; the second column specifies whether the token belongs to a markable in BIO format (as said above, evaluation is on gold markables, although participants could also submit runs for systems-markables evaluation); and the third column specifies which token is the minimal span (MIN) of the markable, in the sense of MUC. So for example, the first line of the document wsjarrau 2308.CONLL consists of the following three columns:

Ripples B-markable_45 word_1
where Ripples is the token (in this case, the first token of the document, i.e., word 1); the second column says the token is the beginning of markable 45; and the third column says the MIN word of the markable is token 1, i.e., this very same token (note that token indices start from 1). The task of a system is to decide whether a markable is referring, and if so, the coreference chain it belongs to (possibly a singleton). Participation in a coreference chain is represented using the markable=set notation from EVALITA, a slight variation of the standard CONLL notation which generalizes to representations for bridging reference and discourse deixis as well, as discussed below. In the case of the example line above, the gold version of the document contains the following line: Ripples B-markable_45=set_37 word_1 new which states that markable 45 is referring; that the entity it refers to is discourse-new (fourth column); and that this entity is coreference chain set 37. (The EVALITA notation can easily be converted into the CONLL notation to use the standard CONLL scorer as well, as we did-see below.) In case a token is part of distinct markables, the @ notation from EVALITA 2011 is used, derived from the | notation from SEMEVAL 2010. Consider for instance the first few lines of the same test set file, representing the NP Ripples from the strike by 55,000 Machinists Union members against Boeing Co..
I-markable_45@I-markable_47@I-markable_50 word_1@word_4@word_11..word_12 This states that, for instance, the token Machinists is the Beginning of markable 609, which in turn is Inside markable 49, in turn markable 47, and then of markable 45. For each of these markables, the coreference chain to which it belongs is specified using the The third column specifies the MINs of each of these markables, again using the @ notation.
A system correctly interpreting these markables should output for every markable its coreference chain and information status (non referring, discourse new, or discourse old).
Evaluation script The coreference evaluation script developed by Moosavi and Strube was modified to produce the scorer for Task 1. We will refer to this script as 'the extended coreference scorer' below. 3 The extended scorer, when run excluding non-referring expressions and singletons and ignoring MIN information, evaluates a system's response using the same metrics (indeed, a reimplementation of the same code) as the standard CONLL evaluation script, v8 (Pradhan et al., 2014). 4 When required to use MIN information, the extended scorer follows the MUC convention, and considers a mention boundary correct if it contains the MIN and doesn't go beyond the annotated maximum boundary. When singletons are to be considered, singletons are also included in the scores (all metrics apart from MUC can deal with singletons). Finally, when run in allmarkables mode, the script scores referring and non-referring expressions separately. Referring expressions are scored using the CONLL metrics; for non-referring expressions, the script evaluates P, R and F1 at non-referring expression identification. The extended coreference scorer is available from Moosavi's github at https://github. com/ns-moosavi/coval.

Task 2: Bridging Anaphora
Data format For the bridging task, the documents were exported in a similar format to that of Task 1. Again, the test set already specifies the gold markables (in this case, only the bridging references). The test set provides four tab-separated columns, with one line for each token:

TOKEN MARKABLE MIN BRIDGE
The meaning of the first three columns is as in Task 1. The fourth column specifies whether the markable is a bridging reference. For example, the following lines a B-markable_311 word_695 B-markable_311 speedy I-markable_311 word_695 I-markable_311 resolution I-markable_311 word_695 I-markable_311 state that tokens a, speedy, and resolution are part of markable 311, with head token word 695, and that this markable is a bridging reference. The objective of participating systems is to identify which anchor entity and anchor markable referring to that entity the bridging reference refers to, using the notation bridg ref=bridg rel= anchor mark= anchor ent For example, in the case of markable 311 above, the correct answer would be: stating that markable 311 has been identified as belonging to entity set 148 as well as being an associative reference to entity set 3 through the undersp-rel relation.
Evaluation script The evaluation script for Task 2 is based on the evaluation method proposed in (Hou et al., 2013). The script separately measures precision and recall at anchor entity recognition (e.g., whether set 3 is the right coreference chain) and at anchor markable detection (i.e., whether markable 308 is the appropriate markable of set 3). Note that whereas the identification of the anchoring entity is considered correct whenever the right coreference chain is identified, irrespective of the particular anchor markable chosen, the identification of the anchor markable is strict, i.e., it is only considered correct if the same markable as annotated is found.

Task 3: Discourse deixis
Finally, in this task (discourse deixis) systems have to identify the unit-clausal text segmentthat evokes the abstract entity the discourse deixis refers to.
For this task, the documents have been exported in a format again consisting of three columns, again with one line for each token:

TOKEN UNIT MARKABLE
The second column specifies which unit (= utterance in the case of dialogue data, clause in the case of textual data) the token belongs to. (All units have already been marked, so systems do not need to recognize them.) The third column specifies whether the token belongs to a discourse deixisand if so, which unit (utterance) evoked the antecedent.
The first 14 lines contain tokens belonging to unit markable 565. The following 4 lines contain tokens belonging to unit markable 566. The last of these is marked as a discourse deixis: This line states that token this belongs to unit markable 566 5 , and it is the beginning of a discourse deixis, B-markable 322. The systems' task is to identify which unit the discourse deixis refers to. The gold interpretation, using the =unit:<markable ID> format would be as follows: 6 this I-markable_566 B-markable_322=unit:markable_565 Evaluation script The evaluation script for Task 3 computes the Success@N metric proposed by Kolhatkar (e.g., (Kolhatkar and Hirst, 2014)) and also used by Marasović et al. (2017). SUC-CESS@N is the proportion of instances where the gold answer-the unit label-occurs within a systems first n choices. (S@1 is standard precision.) 5 All levels of annotation have markables named markable N where N is an integer, but those names are independent: so unit markable 566 is different from coreference markable 566.
6 It is actually not entirely clear from the example whether demonstrative this refers to 'preferring a simpler strategy' or 'hedging their individual holdings' or, more likely, a more complex abstract object.

Anaphoric Resolution with The Three
New Datasets: Results No system participated in Task 1 and Task 3 of the shared task. In this Section we discuss the results obtained with Task 2, as well as the baseline results for markable extraction and Task 1.

Markable extraction
One of the important differences between corpora for anaphora / coreference is the definition of mentions (or markables, in this case). In order to compare the difficulty of markable extraction in ARRAU with that of mention extraction ONTONOTES, we ran two markable extractors on both corpora: a few versions of a mention extractor based on the Stanford CORE pipeline, and our own implementation of an LSTM architecture for markable extraction. Our markable extractor is a modified version of the neural named entity recognition system proposed by Lample et al. (2016). Two versions of this markable extractor were run on the ONTONOTES dataset, one optimized for F1, one for recall. The results are shown in Table 5.
The results suggest that markable extraction in ARRAU is considerably easier than mention extraction in ONTONOTES. This might be due to the differences in markable definition, since singletons and non-referring NPs have to be excluded in ONTONOTES. But the accuracy gaps might also be a result of the domain differences between ONTONOTES and ARRAU. To test this we tested the Stanford pipeline on the WSJ portion of the ONTONOTES test set. The highest scores on the WSJ portion is obtained by the rule-based version of the pipeline, and is lower (43.1% F1) than that for the entire set. This suggests the difference in performance are due to the more releaxed notion of markable used in ARRAU.

Task 1
The results from (Uryupina and Poesio, 2012) suggest that the resolution of identity anaphoric reference in ARRAU is no harder than in ONTONOTES, but to further test this the Stanford CORE deterministic coreference resolver (Lee et al., 2013) was run on the RST subset of the dataset for Task 1 as a baseline, using the division into training, development and test built-in the shared task for this subdomain. The system was run both on gold and on predicted mentions, and evaluated first using both the CONLL official scorer and the extended coreference scorer ignoring singletons and nonreferring markables, then including those.
On gold markables The first 10 lines of Table 6 show the results obtained using the extended coreference scorer and the CONLL official scorer excluding both singletons (4161 markables) and non-referring markables (1391)-i.e., the same conditions as in the standard CONLL evaluations. In these conditions, the extended coreference scorer and the CONLL official scorer obtain the same scores modulo rounding. The following lines in Table 6 show the results when including in the assessment singletons; for this evaluation, the Stanford deterministic coreference resolver was made to output singletons instead of removing them prior to evaluation. When nonreferring markables are included as well, the results for referring expressions remain identical, but in addition, the scorer outputs the results on those separately. (The Stanford deterministic coreference resolver does not attempt to identify non-referring markables, hence all values are 0.) The first conclusion that can be obtained from this   ford resolver on gold markables on this dataset are broadly comparable to the results the system achieved on gold markables at CONLL 2011, where it achieved a CONLL score of 60.7. The second observation is that the system appears quite good at identifying singletons, as its CONLL score in that case is over ten percentage points higherin other words, the system is very much penalized when running on the CONLL dataset.
On Predicted Markables Table 7 shows the results obtained by the Stanford deterministic coreference resolver when evaluated on predicted markables instead of gold markables. These are the results that are more directly comparable with those obtained by this system in the CONLL 2011 shared task. We can see a substantial drop in CONLL score, from 58.3 on predicted markables in the CONLL 2011 shared task to 43.2 on predicted markables with the Task 1 dataset. Most likely, that indicates that some degree of optimization to the characteristics of CONLL dataset was carried out in the system even though the system is not trained.
Using the MIN information Finally, Table 8 shows the effect of using the MIN information. As can be seen from the Table, this results in five extra percentage points.

Task 2
One aspect of anaphoric interpretation for which there were no previous results with ARRAU is bridging reference. One group from the University of Stuttgart participated in this subtask (Roesiger, 2018). We summarize here the results; for further detail, see the paper. Roesiger developed two systems, one rulebased, one ML-based. The results obtained by these systems on all three subdomains are summarized in Table 9 in the Appendix. The three columns present the result of the two systems at the tasks of (i) attempting to resolve all gold bridging references; (ii) only producing results when the system is reasonably convinced; and (iii) identifying and resolving bridging references. These results appear broadly comparable to those obtained by Hou et al. (2013) over the ISNotes corpus as far as the RST and TRAINS domain are concerned, but much lower for the PEAR domainalthough given the small number of bridging references in this domain (354) not too much should be read into this. See Roesiger (2018) for some interesting hypotheses regarding the differences between the two corpora.

Conclusions
In this paper we discuss a dataset based on the ARRAU corpus that supports three fundamental anaphora resolution tasks: identity anaphora resolution, bridging reference resolution, and discourse deixis. We are not aware of any other dataset supporting all three tasks, which makes the resource fairly unique. In this paper we have discussed preliminary experiments with the data that can give other groups an idea of how to use them and what results have been achieved so far.  22