Antecedent Prediction Without a Pipeline

We consider several antecedent prediction models that use no pipelined features generated by upstream systems. Models trained in this way are interesting because they al-low for side-stepping the intricacies of upstream models, and because we might expect them to generalize better to situations in which upstream features are unavailable or unreli-able. Through quantitative and qualitative er-ror analysis we identify what sorts of cases are particularly difﬁcult for such models, and suggest some directions for further improvement.


Introduction
Most recent approaches to identity coreference resolution rely on a set of pipelined features generated by relatively accurate upstream systems. For instance, the CoNLL 2012 coreference datasets (Pradhan et al., 2012), which are based on the OntoNotes corpus (Hovy et al., 2006), make available both gold and predicted parse, part-of-speech, and namedentity information for each sentence in the corpus. While recent systems have managed to improve on the state of the art in coreference resolution by taking advantage of such information (Durrett and Klein, 2013;Wiseman et al., 2015;Björkelund and Kuhn, 2014;Fernandes et al., 2012;Martschat and Strube, 2015), we might be interested in systems that do not use pipelined features for several reasons: first, pipelined systems are known to accumulate errors throughout the stages of the pipeline. Second, unpipelined models do not need to contend with the intricacies of the various systems in the pipeline, which may have little impact on the target task. Finally, models that do not require pipelined features may be more applicable to regimes in which upstream features are unavailable or unreliable, such as those arising from predicting coreference in lowresource languages or in social media text. Indeed, to the extent that it is easier to obtain coreference annotations than it is to obtain (for instance) parse annotations in such regimes, an unpipelined strategy may be particularly practical.
Accordingly, in this paper we consider systems that attempt to move beyond OntoNotes by making coreference predictions without access to pipelined features, using only a document's words and sentence boundaries. In the hopes of shedding light on whether this is a viable strategy, we consider, as a case study, how well coreference systems without access to upstream features can perform on English. Given the amount of research that has gone into resolving English coreference resolution with pipelined features, by also considering the English "unpipelined" setting we can expect to get a rather accurate sense of how much we sacrifice by ignoring these features. Moreover, in addition to the benefits of unpipelined models noted above, the proposed line of research is congenial to the recent trend in NLP of using as few hand-engineered features as possible (as advocated, for instance, in Collobert et al. (2011)).
We report preliminary experiments on the subtask of antecedent prediction (defined in Wiseman et al. (2015) and reviewed below) on the CoNLL 2012 English dataset in this unpipelined setting. In particular, we will assume that we have automatically extracted mentions from a document, but that no other pipelined information is available. We emphasize that this is a strong assumption (since pipelined features, such as parse trees, are often used to extract mentions), and so what follows should be interpreted as an attempt to obtain an upper bound on the performance possible in such a setting. We conclude by analyzing the errors made by the proposed unpipelined systems, and discussing how these systems might be made more competitive.

Problem Setting
As above, we will assume we are given a set of documents from which we are able to automatically extract mentions. We denote by X the set of these automatically extracted mentions. For a mention x ∈ X , let A(x) denote the set of mentions appearing before x in the document, and let the set C(x) ⊆ A(x) denote the mentions appearing before x that are coreferent with x. The problem of antecedent ranking involves trying to predict an antecedent y ∈ C(x) for only those x for which C(x) = ∅, that is, for only those x that have coreferent antecedents. We will moreover require that in making these antecedent predictions no pipelined features are used. In particular, we will assume that "unpipelined" systems have access only to a document's mention-boundaries, to the sets C(x) for each x ∈ X (when training), to the words in each document, and to the document's sentence boundaries.
Whereas recent coreference systems typically make use of syntactic information, named-entity tags, word-lists containing type information (e.g., number, gender, animacy), and speaker information (Durrett and Klein, 2013;Björkelund and Kuhn, 2014;Lee et al., 2013), given the aforementioned restrictions, the only common coreference features that remain legal are word-based features and "distance" features. Distance features are typically defined in terms of the number of words, mentions, or sentences between a mention and a candidate antecedent (Durrett and Klein, 2013), and such features can presumably be defined accurately in many settings without the use of upstream systems.

Models
We will use a very simple mention-ranking style model for our antecedent prediction. Mentionranking models make use of a scoring function s(x, y) that scores the compatibility between a mention x and a candidate antecedent y, and they predict the antecedent to be y * = arg max y∈C(x) s(x, y). We will define s as where Φ c extracts relevant word-based features from a mention and its context, and Φ d extracts distance based features between x and y. Thus, the scoring function s is defined by applying a standard multi-layer perceptron (MLP) to the (vertically) concatenated outputs of the functions Φ c and Φ d . In particular, W represents the weight matrix of the MLP's first hidden layer, b the corresponding bias vector, and u the vector of weights projecting the first hidden layer into a scalar score. The exact dimensions of these weights will become clear in what follows.
In defining Φ c we will view a mention x spanning M words as a sequence of real vectors x 1 , . . . , x M , with each x m ∈ R D obtained by looking up the m'th word in x in an embedding matrix E ∈ R D×|V| , where V is our fixed vocabulary. Accordingly, let X 1:M ∈ R D×M be the matrix formed by concatenating the embeddings of the words in a mention (in order). Analogously, let X −K:−1 ∈ R D×K be the concatenation of the embedding-vectors corresponding to the K words preceding x on the left (padded where necessary), and X M +1:M +K the concatenation of the embedding-vectors corresponding to the K words following x on the right (padded where necessary).
For simplicity, we will require Φ c to take the following form: where h(X i:j ) is some function of the matrix X i:j . That is, Φ c (x) simply concatenates a representation of the words of x with representations (respectively) of the K words preceding and following x.
For example, consider the following passage from the development portion of the CoNLL 2012 English development data, from which the final example in Table 1 is taken, and in which we have highlighted a particular mention we might like to predict an antecedent for: Suddenly we realized water came into the engine room and it was rising and they started to pump, of course, and they pumped and pumped and the water came more and more and more. (bn/cnn/cnn 0410) If we are interested in predicting coreferent antecedents for "the water," which we will denote by x, then we will have M = 2, and X 1:2 will be a matrix in R D×2 with its first column equal to the embedding (in E) for "the," and its second column equal to the embedding for "water." Since in predicting x we will likely also want to take into account some of its surrounding context, we will also form matrices corresponding to the K words to the left and to the right (respectively) of x. Thus, if we set K = 1, we will form X −1:−1 as the matrix in R D×1 , which consists of the embedding for "and," and we would define X M +1:M +1 analogously. Given the aforementioned X matrices, we define Φ c by vertically concatenating the output of applying a function h to each of these matrices.
We now consider three approaches to defining h(X 1:M ), in increasing order of complexity: 1 We found the max-pooling described here to be more effective than mean-pooling.

Max
To define Φ d we first define indicator features (represented as one-hot vectors), which (respectively) bucket the number of mentions and the number of sentences between a mention and a candidate antecedent into 11 discrete buckets, following Durrett and Klein (2013). We therefore have 22 distance indicator features in total, and they are used to index into an embedding matrix A ∈ R D d ×22 . Accordingly, Φ d (x, y) ∈ R D d represents the sum of the (two) distance embeddings obtained from A in this way. This approach resembles that of Sukhbaatar et al. (2015).

Methods
We conduct antecedent-ranking experiments on the development portion of the CoNLL 2012 English corpus. Mentions were extracted using the Berkeley Coreference System (Durrett and Klein, 2013). We set K = 4 in forming word-windows, and we trained by optimizing the margin ranking-loss defined in Wiseman et al. (2015) using mini-batch Adagrad (Duchi et al., 2011).
For the convolutional model, we used windows of size 1, 2, and 3, and 40 filters for each. We set D d , the dimensionality of the distance feature embeddings which constitute the columns of A, to 20. We used the element-rnn RNN package (Léonard et al., 2015) to implement the LSTM, and we set the LSTM's hidden-layer size to 200. All models used 300 hidden units in the final layer (represented by W ), and we used Dropout for regularization. All hyperparameters including window size were tuned on the development set.
For all models we initialized E, the word embedding matrix, with word vectors obtained from word2vec (Mikolov et al., 2013), and so E ∈ R 300×|V| , where V is the vocabulary consisting of words in the training or development sets (plus an unknown word token). E was updated during training. For the Max-Over-Time Model we found it beneficial to untie the embedding matrices used to embed the words in the mention, before the mention, and after the mention, giving 3 separate embedding matrices. For the Convolutional and LSTM Models, performance was at least as good when using a single embedding matrix.

Results
We are particularly interested in determining in what situations a word-and-distance model underperforms models with access to more sophisticated information. In Table 2 we compare the antecedentprediction accuracy of the three models defined above with the antecedent ranking performance of the model described in Wiseman et al. (2015), which uses an MLP over pipelined coreference features. We will refer to this latter model as the "baseline MLP." We see that the word-and-distance models underperform, though the LSTM model comes within 5.2% of the baseline MLP. (It is also worth noting here that without the distance features Φ d all models are significantly less accurate, with accuracies decreasing by over 15 percentage points).

Discussion
In Table 3 we examine, using an analysis similar to that in Durrett and Klein (2013), where the unpipelined models go wrong. There, we partition mentions column-wise into nominal or proper mentions that have a head-match with some previously occurring mention, nominal or proper mentions that do not, and pronominal mentions. (Note that whereas parse information must be used to detect heads, this is only used in our analysis, and none of the three models introduced here have access to this information  which underperforms the baseline in all categories, but does particularly badly in predicting antecedents for mentions for which a previous mention in the text has the same head. Why is this? Further analysis shows that almost 84% of the HM examples that are correctly predicted by the baseline MLP but incorrectly predicted by the Convolutional Model involve the baseline MLP predicting an antecedent with an exact head-match to the current mention, and the Convolutional Model predicting a non-head-match antecedent. We show some representative examples in Table 1, where we bracket the head of each mention. As is evident from Table 1, the model is picking antecedents that are semantically reasonable, but which do not have a head match. The reason the Convolutional Model makes these errors is presumably that it is not able to tell what the head of each mention is (because it sees only the words in the mention, and the wordwindows preceding and following). The baseline MLP, however, does have access to the heads of each mention, and so can learn that head-match is a discriminative feature.
As we move to the LSTM model, we find that errors decrease in all categories, though follow largely the same pattern. Indeed, over 78% of the LSTM model's errors in the HM category also involve predicting a non-head-match antecedent when the baseline MLP correctly predicts a head-match antecedent. Thus, it seems the LSTM model too could benefit from better head-finding. As additional evidence for this hypothesis, in Figure 1 we plot the percentage of correctly predicted antecedents in the CoNLL 2012 development set as the length of the current mention x increases. (Only mention-lengths occurring ≥ 10 times in the development set are reported). We see that the accuracy of both the Convolutional and LSTM models (as well as that of the Max-Over-Time model) generally decreases as the mention-length increases, though that of the baseline MLP model does not. Of course, it stands to reason that finding heads is more difficult in longer mentions, which may explain this trend.
When it comes to the other major category of errors in Table 3, namely, errors on pronominal mentions, it is more difficult to diagnose a single underlying cause of error. In particular, the unpipelined models' errors tend to involve either predicting antecedents that are inconsistent in terms of gender or number, or, interestingly, predicting nonpronominal antecedents when the baseline MLP predicts a pronominal antecedent. While it is certainly the case that the baseline MLP has access to gender information that the unpipelined models do not, it is not as clear why these unpipelined models learn to disprefer predicting pronominal antecedents for pronominal mentions, and this issue requires further investigation.

Conclusion
The results presented above suggest that a major factor holding word-and-distance-only models back from competing with models that have access to pipelined features is their inability to find mentionheads and, more generally, to take advantage of syntactic features. While the fact that such models would benefit from syntactic information is not surprising, the examples in Table 1 suggest that even coarse notions of head-finding may be sufficient to improve performance. Accordingly, one might imagine that alignment or attention models (such as that of Bahdanau et al. (2014)) that attempt to model coarse head-information would be useful in such cases.