Plug Latent Structures and Play Coreference Resolution

We present cort , a modular toolkit for de-vising, implementing, comparing and analyzing approaches to coreference resolution. The toolkit allows for a uniﬁed representation of popular coreference resolution approaches by making explicit the structures they operate on. Several of the implemented approaches achieve state-of-the-art performance.


Introduction
Coreference resolution is the task of determining which mentions in a text refer to the same entity. Machine learning approaches to coreference resolution range from simple binary classification models on mention pairs (Soon et al., 2001) to complex structured prediction approaches (Durrett and Klein, 2013;Fernandes et al., 2014).
In this paper, we present a toolkit that implements a framework that unifies these approaches: in the framework, we obtain a unified representation of many coreference approaches by making explicit the latent structures they operate on.
Our toolkit provides an interface for defining structures for coreference resolution, which we use to implement several popular approaches. An evaluation of the approaches on CoNLL shared task data (Pradhan et al., 2012) shows that they obtain state-of-the-art results. The toolkit also can perform end-to-end coreference resolution.
We implemented this functionality on top of the coreference resolution error analysis toolkit cort (Martschat et al., 2015). Hence, this toolkit now provides functionality for devising, implementing, comparing and analyzing approaches to coreference resolution. cort is released as open source 1 and is available from the Python Package Index 2 .

A Framework for Coreference Resolution
In this section we briefly describe a structured prediction framework for coreference resolution.

Motivation
The popular mention pair approach (Soon et al., 2001;Ng and Cardie, 2002) operates on a list of mention pairs. Each mention pair is considered individually for learning and prediction. In contrast, antecedent tree models (Yu and Joachims, 2009;Fernandes et al., 2014;Björkelund and Kuhn, 2014) operate on a tree which encodes all anaphorantecedent decisions in a document. Conceptually, both approaches have in common that the structures they employ are not annotated in the data (in coreference resolution, the annotation consists of a mapping of mentions to entity identifiers). Hence, we can view both approaches as instantiations of a generic structured prediction approach with latent variables.

Setting
Our aim is to learn a prediction function f that, given an input document x ∈ X , predicts a pair (h, z) ∈ H ×Z. h is the (unobserved) latent structure encoding the coreference relations between mentions in x. z is the mapping of mentions to entity identifiers (which is observed in the training data). Usually, z is obtained from h by taking the transitive closure over coreference decisions encoded in h. H and Z are the spaces containing all such structures and mappings.

Representation
For a document x ∈ X , we write M x = {m 1 , . . . , m n } for the mentions in x. Following previous work (Chang et al., 2012;Fernandes et al., 2014), we make use of a dummy mention which we denote as m 0 . If m 0 is predicted as the antecedent of a mention m i , we consider m i nonanaphoric. We define M 0 x = {m 0 } ∪ M x . Inspired by previous work (Bengtson and Roth, 2008;Fernandes et al., 2014;Martschat and Strube, 2014), we adopt a graph-based representation of the latent structures h ∈ H. In particular, we express structures by labeled directed graphs with vertex set M 0 x . Figure 1 shows a structure underlying the mention ranking and the antecedent tree approach. An arc between two mentions signals coreference. For antecedent trees (Fernandes et al., 2014), the whole structure is considered, while for mention ranking (Denis and Baldridge, 2008;Chang et al., 2012) only the antecedent decision for one anaphor is examined. This can be expressed via an appropriate segmentation into subgraphs which we refer to as substructures. One such substructure encoding the antecedent decision for m 3 is colored black in the figure.
Via arc labels we can express additional information. For example, mention pair models (Soon et al., 2001) distinguish between positive and negative instances. This can be modeled by labeling arcs with appropriate labels, such as + and −.

Inference and Learning
As is common in natural language processing, we model the prediction of (h, z) via a linear model. That is, where θ ∈ R d is a parameter vector and φ : X × H × Z → R d is a joint feature representation for inputs and outputs. When employing substructures, one maximization problem has to be solved for each substructure (instead of one maximization problem for the whole structure).
To learn the parameter vector θ ∈ R d from training data, we employ a latent structured perceptron (Sun et al., 2009) with cost-augmented inference (Crammer et al., 2006) and averaging (Collins, 2002).

Implementation
We now describe our implementation of the framework presented in the previous section.

Aims
By expressing approaches in the framework, researchers can quickly devise, implement, compare and analyze approaches for coreference resolution. To facilitate development, it should be as easy as possible to define a coreference resolution approach. We first describe the general architecture of our toolkit before giving a detailed description of how to implement specific coreference resolution approaches.

Architecture
The toolkit is implemented in Python. It can process raw text and data conforming to the format of the CoNLL-2012 shared task on coreference resolution (Pradhan et al., 2012). The toolkit is organized in four modules: the preprocessing module contains functionality for processing raw text, the core module provides mention extraction and computation of mention properties, the analysis module contains error analysis methods, and the coreference module implements the framework described in the previous section.

preprocessing
By making use of NLTK 3 , this module provides classes and functions for performing the preprocessing tasks necessary for mention extraction and coreference resolution: tokenization, sentence splitting, parsing and named entity recognition.

core
We employ a rule-based mention extractor, which also computes a rich set of mention attributes, including tokens, head, part-of-speech tags, named entity tags, gender, number, semantic class, grammatical function and mention type. These attributes, from which features are computed, can be extended easily.

analysis
To support system development, this module implements the error analysis framework of Martschat and Strube (2014). Users can extract, analyze and visualize recall and precision errors of the systems they are working on. Figure 2 shows a screenshot of the visualization. A more detailed description can be found in Martschat et al. (2015).

coreference
This module provides features for coreference resolution and implements the machine learning framework described in the previous section.
We implemented a rich set of features employed in previous work (Ng and Cardie, 2002;Bengtson and Roth, 2008;Björkelund and Kuhn, 2014), including lexical, rule-based and semantic features. The feature set can be extended by the user.
The module provides a structured latent perceptron implementation and contains classes that implement the workflows for training and prediction. As its main feature, it provides an interface for defining coreference resolution approaches. We already implemented various approaches (see Section 4).

Defining Approaches
The toolkit provides a simple interface for devising coreference resolution approaches via structures. The user just needs to specify two functions: an instance extractor, which defines the search space for the optimal (sub)structures, and a decoder, which, given a parameter vector, finds optimal (sub)structures. The toolkit then performs training and prediction using these user-specified functions. The user can further customize the approach by defining cost functions to be used during cost-augmented inference, and clustering algorithms to extract coreference chains from latent structures, such as closest-first (Soon et al., 2001) or best-first (Ng and Cardie, 2002).
In the remainder of this section, we present an example implementation of the mention ranking model with latent antecedents (Chang et al., 2012) in our toolkit.

Instance Extractors
The instance extractor receives a document as input and defines the search space for the maximization problem to be solved by the decoder. To do so, it needs to output the segmentation of the la- tent structure for one document into substructures, and the candidate arcs for each substructure. Listing 1 shows source code of the instance extractor for the mention ranking model with latent antecedents. In this model, each antecedent decision for a mention corresponds to one substructure. Therefore, the extractor iterates over all mentions. For each mention, arcs to all preceding mentions are extracted and stored as candidate arcs for one substructure.

Decoders
The decoder solves the maximization problems for obtaining the highest-scoring latent substructures consistent with the gold annotation, and the highest-scoring cost-augmented latent substructures.
Listing 2 shows source code of a decoder for the mention ranking model with latent antecedents. The input to the decoder is a substructure, which is a set of arcs, and a mapping from arcs to information about arcs, such as features or costs. The output is a tuple containing • a list of arcs that constitute the highestscoring substructure, together with their labels (if any) and scores, • the same for the highest-scoring substructure consistent with the gold annotation, • the information whether the highest-scoring substructure is consistent with the gold annotation. To obtain this prediction, we invoke the auxiliary function self.find best arcs. This function searches through a set of arcs to find the overall highest-scoring arc and the overall highestscoring arc consistent with the gold annotation. Furthermore, it also outputs the scores of these arcs according to the model, and whether the prediction of the best arc is consistent with the gold annotation.
For the mention ranking model, we let the function search through all candidate arcs for a substructure, since these represent the antecedent decision for one anaphor. Note that the mention ranking model does not use any labels.
The update of the parameter vector is handled by our implementation of the structured perceptron.

Cost Functions
Cost functions allow to bias the learner towards specific substructures, which leads to a large margin approach. For the mention ranking model, we employ a cost function that assigns a higher cost to erroneously determining anaphoricity than to selecting a wrong link, similar to the cost functions employed by Durrett and Klein (2013) and Fernandes et al. (2014). The source code is displayed in Listing 3.

Clustering Algorithms
The mention ranking model selects one antecedent for each anaphor, therefore there is no need to cluster antecedent decisions. Our toolkit provides clustering algorithms commonly used for mention pair models, such as closest-first (Soon et al., 2001) or best-first (Ng and Cardie, 2002).

Running cort
cort can be used as a Python library, but also provides two command line tools cort-train and cort-predict.

Evaluation
We implemented a mention pair model with bestfirst clustering (Ng and Cardie, 2002), the mention ranking model with closest (Denis and Baldridge, 2008) and latent (Chang et al., 2012) antecedents, and antecedent trees (Fernandes et al., 2014). Only slight modifications of the source code displayed in Listings 1 and 2 were necessary to implement these approaches. For the ranking models and antecedent trees we use the cost function described in Listing 3.
We evaluate the models on the English test data of the CoNLL-2012 shared task on multilingual coreference resolution (Pradhan et al., 2012). We use the reference implementation of the CoNLL  , 1998) and CEAF e (Luo, 2005). The models are trained on the concatenation of training and development data.
The evaluation of the models is shown in Table  1. To put the numbers into context, we compare with Fernandes et al. (2014), the winning system of the CoNLL-2012 shared task, and the state-ofthe-art system of Björkelund and Kuhn (2014). The mention pair model performs decently, while the antecedent tree model exhibits performance comparable to Fernandes et al. (2014), who use a very similar model. The ranking models outperform Björkelund and Kuhn (2014), obtaining state-of-the-art performance.

Related Work
Many researchers on coreference resolution release an implementation of the coreference model described in their paper (Lee et al., 2013;Durrett and Klein, 2013;Björkelund and Kuhn, 2014, inter alia). However, these implementations implement only one approach following one paradigm (such as mention ranking or antecedent trees).
Similarly to cort, research toolkits such as BART (Versley et al., 2008) or Reconcile (Stoyanov et al., 2009) provide a framework to implement and compare coreference resolution approaches. In contrast to these toolkits, we make the latent structure underlying coreference approaches explicit, which facilitates development of new approaches and renders the development more transparent. Furthermore, we provide a generic and customizable learning algorithm.

Conclusions
We presented an implementation of a framework for coreference resolution that represents approaches to coreference resolution by the structures they operate on. In the implementation we placed emphasis on facilitating the definition of new models in the framework.
The presented toolkit cort can process raw text and CoNLL shared task data. It achieves state-ofthe-art performance on the shared task data.
The framework and toolkit presented in this paper help researchers to devise, analyze and compare representations for coreference resolution.