A Mention-Ranking Model for Abstract Anaphora Resolution

Resolving abstract anaphora is an important, but difficult task for text understanding. Yet, with recent advances in representation learning this task becomes a more tangible aim. A central property of abstract anaphora is that it establishes a relation between the anaphor embedded in the anaphoric sentence and its (typically non-nominal) antecedent. We propose a mention-ranking model that learns how abstract anaphors relate to their antecedents with an LSTM-Siamese Net. We overcome the lack of training data by generating artificial anaphoric sentence–antecedent pairs. Our model outperforms state-of-the-art results on shell noun resolution. We also report first benchmark results on an abstract anaphora subset of the ARRAU corpus. This corpus presents a greater challenge due to a mixture of nominal and pronominal anaphors and a greater range of confounders. We found model variants that outperform the baselines for nominal anaphors, without training on individual anaphor data, but still lag behind for pronominal anaphors. Our model selects syntactically plausible candidates and – if disregarding syntax – discriminates candidates using deeper features.


Introduction
Current research in anaphora (or coreference) resolution is focused on resolving noun phrases referring to concrete objects or entities in the real world, which is arguably the most frequently occurring type. Distinct from these are diverse types of abstract anaphora (AA) (Asher, 1993) where reference is made to propositions, facts, events or properties. An example is given in (1) below. 1 While recent approaches address the resolution of selected abstract shell nouns (Kolhatkar and Hirst, 2014), we aim to resolve a wide range of abstract anaphors, such as the NP this trend in (1), as well as pronominal anaphors (this, that, or it).
Henceforth, we refer to a sentence that contains an abstract anaphor as the anaphoric sentence (AnaphS), and to a constituent that the anaphor refers to as the antecedent (Antec) (cf. (1)).
(1) Ever-more powerful desktop computers, designed with one or more microprocessors as their "brains", are expected to increasingly take on functions carried out by more expensive minicomputers and mainframes. "[Antec The guys that make traditional hardware are really being obsoleted by microprocessor-based machines]", said Mr. Benton. [ AnaphS As a result of this trendAA, longtime powerhouses HP, IBM and Digital Equipment Corp. are scrambling to counterattack with microprocessor-based systems of their own.] A major obstacle for solving this task is the lack of sufficient amounts of annotated training data. We propose a method to generate large amounts of training instances covering a wide range of abstract anaphor types. This enables us to use neural methods which have shown great success in related tasks: coreference resolution (Clark and Manning, 2016a), textual entailment (Bowman et al., 2016), learning textual similarity (Mueller and Thyagarajan, 2016), and discourse relation sense classification (Rutherford et al., 2017).
Our model is inspired by the mention-ranking model for coreference resolution (Wiseman et al., 2015;Manning, 2015, 2016a,b) and combines it with a Siamese Net (Mueller and Thyagarajan, 2016), (Neculoiu et al., 2016) for learning similarity between sentences. Given an anaphoric sentence (AntecS in (1)) and a candidate antecedent (any constituent in a given context, e.g. being obsoleted by microprocessor-based machines in (1)), the LSTM-Siamese Net learns representations for the candidate and the anaphoric sentence in a shared space. These representations are combined into a joint representation used to calculate a score that characterizes the relation between them. The learned score is used to select the highest-scoring antecedent candidate for the given anaphoric sentence and hence its anaphor. We consider one anaphor at a time and provide the embedding of the context of the anaphor and the embedding of the head of the anaphoric phrase to the input to characterize each individual anaphorsimilar to the encoding proposed by Zhou and Xu (2015) for individuating multiply occurring predicates in SRL. With deeper inspection we show that the model learns a relation between the anaphor in the anaphoric sentence and its antecedent. Fig. 1 displays our architecture.
In contrast to other work, our method for generating training data is not confined to specific types of anaphora such as shell nouns (Kolhatkar and Hirst, 2014) or anaphoric connectives (Stede and Grishina, 2016). It produces large amounts of instances and is easily adaptable to other languages. This enables us to build a robust, knowledge-lean model for abstract anaphora resolution that easily extends to multiple languages. We evaluate our model on the shell noun resolution dataset of Kolhatkar et al. (2013b) and show that it outperforms their state-of-the-art results. Moreover, we report results of the model (trained on our newly constructed dataset) on unrestricted abstract anaphora instances from the ARRAU corpus (Poesio and Artstein, 2008;Uryupina et al., 2016). To our knowledge this provides the first state-of-the-art benchmark on this data subset.
Our TensorFlow 2 implementation of the model and scripts for data extraction are available at: https://github.com/amarasovic/ neural-abstract-anaphora.

Related and prior work
Abstract anaphora has been extensively studied in linguistics and shown to exhibit specific properties in terms of semantic antecedent types, their degrees of abstractness, and general dis-2 Abadi et al. (2015) course properties (Asher, 1993;Webber, 1991). In contrast to nominal anaphora, abstract anaphora is difficult to resolve, given that agreement and lexical match features are not applicable. Annotation of abstract anaphora is also difficult for humans (Dipper and Zinsmeister, 2012), and thus, only few smaller-scale corpora have been constructed. We evaluate our models on a subset of the AR-RAU corpus (Uryupina et al., 2016) that contains abstract anaphors and the shell noun corpus used in Kolhatkar et al. (2013b). 3 We are not aware of other freely available abstract anaphora datasets.
Little work exists for the automatic resolution of abstract anaphora. Early work (Eckert and Strube, 2000;Strube and Müller, 2003;Byron, 2004;Müller, 2008) has focused on spoken language, which exhibits specific properties. Recently, event coreference has been addressed using feature-based classifiers (Jauhar et al., 2015;Lu and Ng, 2016). Event coreference is restricted to a subclass of events, and usually focuses on coreference between verb (phrase) and noun (phrase) mentions of similar abstractness levels (e.g. purchase -acquire) with no special focus on (pro)nominal anaphora. Abstract anaphora typically involves a full-fledged clausal antecedent that is referred to by a highly abstract (pro)nominal anaphor, as in (1). Rajagopal et al. (2016) proposed a model for resolution of events in biomedical text that refer to a single or multiple clauses. However, instead of selecting the correct antecedent clause(s) (our task) for a given event, their model is restricted to classifying the event into six abstract categories: this these changes, responses, analysis, context, finding, observation, based on its surrounding context. While related, their task is not comparable to the full-fledged abstract anaphora resolution task, since the events to be classified are known to be coreferent and chosen from a set of restricted abstract types.
More related to our work is Anand and Hardt (2016) who present an antecedent ranking account for sluicing using classical machine learning based on a small training dataset. They employ features modeling distance, containment, discourse structure, and -less effectively -content and lexical correlates. 4 Closest to our work is Kolhatkar et al. (2013b) (KZH13) and Kolhatkar and Hirst (2014) (KH14) on shell noun resolution, using classical machine learning techniques. Shell nouns are abstract nouns, such as fact, possibility, or issue, which can only be interpreted jointly with their shell content (their embedded clause as in (2) or antecedent as in (3)). KZH13 refer to shell nouns whose antecedent occurs in the prior discourse as anaphoric shell nouns (ASNs) (cf.
(3)), and cataphoric shell nouns (CSNs) otherwise (cf. (2)). 5 (2) Congress has focused almost solely on the fact that [special education is expensive -and that it takes away money from regular education.] ( KZH13 presented an approach for resolving six typical shell nouns following the observation that CSNs are easy to resolve based on their syntactic structure alone, and the assumption that ASNs share linguistic properties with their embedded (CSN) counterparts. They manually developed rules to identify the embedded clause (i.e. cataphoric antecedent) of CSNs and trained SVM rank (Joachims, 2002) on such instances. The trained SVM rank model is then used to resolve ASNs. KH14 generalized their method to be able to create training data for any given shell noun, however, their method heavily exploits the specific properties of shell nouns and does not apply to other types of abstract anaphora.
Stede and Grishina (2016) study a related phenomenon for German. They examine inherently anaphoric connectives (such as demzufolge -according to which) that could be used to access their abstract antecedent in the immediate context. Yet, such connectives are restricted in type, and the study shows that such connectives are often ambiguous with nominal anaphors and require sense disambiguation. We conclude that they cannot be easily used to acquire antecedents automatically.
In our work, we explore a different direction: we construct artificial training data using a general pattern that identifies embedded sentence constituents, which allows us to extract relatively secure training data for abstract anaphora that captures a wide range of anaphora-antecedent rela-tions, and apply this data to train a model for the resolution of unconstrained abstract anaphora.
Recent work in entity coreference resolution has proposed powerful neural network-based models that we will adapt to the task of abstract anaphora resolution. Most relevant for our task is the mention-ranking neural coreference model proposed in Clark and Manning (2015), and their improved model in Clark and Manning (2016a), which integrates a loss function (Wiseman et al., 2015) which learns distinct feature representations for anaphoricity detection and antecedent ranking.
Siamese Nets distinguish between similar and dissimilar pairs of samples by optimizing a loss over the metric induced by the representations. It is widely used in vision (Chopra et al., 2005), and in NLP for semantic similarity, entailment, query normalization and QA (Mueller and Thyagarajan, 2016;Neculoiu et al., 2016;Das et al., 2016).

Mention-Ranking Model
Given an anaphoric sentence s with a marked anaphor (mention) and a candidate antecedent c, the mention-ranking (MR) model assigns the pair (c, s) a score, using representations produced by an LSTM-Siamese Net. The highest-scoring candidate is assigned to the marked anaphor in the anaphoric sentence. Fig. 1 displays the model.
We learn representations of an anaphoric sentence s and a candidate antecedent c using a bidirectional Long Short-Term Memory (Hochreiter and Schmidhuber, 1997; Graves and Schmidhuber, 2005). One bi-LSTM is applied to the anaphoric sentence s and a candidate antecedent c, hence the term siamese. Each word is represented with a vector w i constructed by concatenating embeddings of the word, of the context of the anaphor (average of embeddings of the anaphoric phrase, the previous and the next word), of the head of the anaphoric phrase 6 , and, finally, an embedding of the constituent tag of the candidate, or the S constituent tag if the word is in the anaphoric sentence. For each sequence s or c, the word vectors w i are sequentially fed into the bi-LSTM, which produces outputs from the forward pass, − → h i , and outputs ← − h i from the backward pass. The final output of the i-th word is defined as To get a representation of the full sequence, h s or h c , all outputs are averaged, except for those that correspond to padding tokens. To prevent forgetting the constituent tag of the sequence, we concatenate the corresponding tag embedding with h s or h c (we call this a shortcut for the tag information). The resulting vector is fed into a feed-forward layer of exponential linear units (ELUs) (Clevert et al., 2016) to produce the final representationh s orh c of the sequence.
Fromh c andh s we compute a vector h c,s = [|h c −h s |;h c h s ] (Tai et al., 2015), where |-| denotes the absolute values of the element-wise subtraction, and the element-wise multiplication. Then h c,s is fed into a feed-forward layer of ELUs to obtain the final joint representation,h c,s , of the pair (c, s). Finally, we compute the score for the pair (c, s) that represents relatedness between them, by applying a single fully connected linear layer to the joint representation: where W is a 1 × d weight matrix, and d the dimension of the vectorh c,s . We train the described mention-ranking model with the max-margin training objective from Wiseman et al. (2015), used for the antecedent ranking subtask. Suppose that the training set , where a i is the i-th abstract anaphor, s i the corresponding anaphoric sentence, T (a i ) the set of antecedents of a i and N (a i ) the set of candidates that are not antecedents (negative candidates). Lett i = arg max t∈T (a i ) score(t i , s i ) be the highest scor- S' x S Figure 2: A general pattern for artificially creating anaphoric sentence-antecedent pairs.
ing antecedent of a i . Then the loss is given by

Training data construction
We create large-scale training data for abstract anaphora resolution by exploiting a common construction, consisting of a verb with an embedded sentence (complement or adverbial) (cf. Fig.  2). We detect this pattern in a parsed corpus, 'cut off' the S constituent and replace it with a suitable anaphor to create the anaphoric sentence (AnaphS), while S yields the antecedent (Antec). This method covers a wide range of anaphoraantecedent constellations, due to diverse semantic or discourse relations that hold between the clause hosting the verb and the embedded sentence.
First, the pattern applies to verbs that embed sentential arguments. In (4), the verb doubt establishes a specific semantic relation between the embedding sentence and its sentential complement.
(4) He doubts [ S [S a Bismarckian super state will emerge that would dominate Europe], but warns of "a risk of profound change in the [..] European Community from a Germany that is too strong, even if democratic"].
From this we extract the artificial antecedent A Bismarckian super state will emerge that would dominate Europe, and its corresponding anaphoric sentence He doubts this, but warns of "a risk of profound change ... even if democratic", which we construct by randomly choosing one of a predefined set of appropriate anaphors (here: this, that, it), cf. Table 1. The second row in Table 1 is used when the head of S is filled by an overt complementizer (doubts that), as opposed to (4). The remaining rows in Table 1 apply to adverbial clauses of different types.
Adverbial clauses encode specific discourse relations with their embedding sentences, often indicated by their conjunctions. In (5), for example, the causal conjunction as relates a cause (embedded sentence) and its effect (embedding sentence):  We randomly replace causal conjunctions because, as with appropriately adjusted anaphors, e.g. because of that, due to this or therefore that make the causal relation explicit in the anaphor. 7 Compared to the shell noun corpus of KZH13, who made use of a carefully constructed set of extraction patterns, a downside of our method is that our artificially created antecedents are uniformly of type S. However, the majority of abstract anaphora antecedents found in the existing datasets are of type S. Also, our models are intended to induce semantic representations, and so we expect syntactic form to be less critical, compared to a feature-based model. 8 Finally, the general extraction pattern in Fig. 2, covers a much wider range of anaphoric types.
Using this method we generated a dataset of artificial anaphoric sentence-antecedent pairs from the WSJ part of the PTB Corpus (Marcus et al., 1993), automatically parsed using the Stanford Parser (Klein and Manning, 2003).

Datasets
We evaluate our model on two types of anaphora: (a) shell noun anaphora and (b) (pro)nominal abstract anaphors extracted from ARRAU.
a. Shell noun resolution dataset. For comparability we train and evaluate our model for shell noun resolution, using the original training (CSN) and test (ASN) corpus of Kolhatkar et al. (2013a,b). 9 7 In case of ambiguous conjunctions (e.g. as interpreted as causal or temporal), we generally choose the most frequent interpretation. 8 This also alleviates problems with languages like German, where (non-)embedded sentences differ in surface position of the finite verb. We can either adapt the order or ignore it, when producing anaphoric sentence -antecedent pairs. 9 We thank the authors for providing the available data.
We follow the data preparation and evaluation protocol of Kolhatkar et al. (2013b) The CSN corpus was constructed from the NYT corpus using manually developed patterns to identify the antecedent of cataphoric shell nouns (CSNs). In KZH13, all syntactic constituents of the sentence that contains both the CSN and its antecedent were considered as candidates for training a ranking model. Candidates that differ from the antecedent in only one word or one word and punctuation were as well considered as antecedents 10 . To all other candidates we refer to as negative candidates. For every shell noun, KZH13 used the corresponding part of the CSN data to train SVM rank .
The ASN corpus serves as the test corpus. It was also constructed from the NYT corpus, by selecting anaphoric instances with the pattern "this shell noun " for all covered shell nouns. For validation, Kolhatkar et al. (2013a) crowdsourced annotations for the sentence which contains the antecedent, which KZH13 refer to as a broad region. Candidates for the antecedent were obtained by using all syntactic constituents of the broad region as candidates and ranking them using the SVM rank model trained on the CSN corpus. The top 10 ranked candidates were presented to the crowd workers and they chose the best answer that represents the ASN antecedent. The workers were encouraged to select None when they did not agree with any of the displayed answers and could provide information about how satisfied they were with the displayed candidates. We consider this dataset as gold, as do KZH13, although it may be biased towards the offered candidates. 11 b. Abstract anaphora resolution data set. We use the automatically constructed data from the WSJ corpus (Section 4) for training. 12 Our test data for unrestricted abstract anaphora resolution is obtained from the ARRAU corpus (Uryupina et al., 2016). We extracted all abstract anaphoric instances from the WSJ part of ARRAU that are marked with the category abstract or plan, 13 and call the subcorpus ARRAU-AA. 10 We obtained this information from the authors directly. 11 The authors provided us with the workers' annotations of the broad region, antecedents chosen by the workers and links to the NYT corpus. The extraction of the anaphoric sentence and the candidates had to be redone. 12 We excluded any documents that are part of ARRAU. 13 ARRAU distinguishes abstract anaphors and (mostly) pronominal anaphors referring to an action or plan, as plan.  Candidates extraction. Following KZH13, for every anaphor we create a list of candidates by extracting all syntactic constituents from sentences which contain antecedents. Candidates that differ from antecedents in only one word, or one word and punctuation, were as well considered as antecedents. Constituents that are not antecedents are considered as negative candidates.
Data statistics. Table 2 gives statistics of the datasets: the number of anaphors (row 1), the median length (in tokens) of antecedents (row 2), the median length (in tokens) for all anaphoric sentences (row 3), the median of the number of antecedents and candidates that are not antecedents (negatives) (rows 4-5), the number of pronominal and nominal anaphors (rows 6-7). Both training sets, artificial and CSN, have only one possible antecedent for which we accept two minimal variants differing in only one word or one word and punctuation. On the contrary, both test sets by design allow annotation of more than one antecedent that differ in more than one word. Every anaphor in the artificial training dataset is pronominal, whereas anaphors in CSN and ASN are nominal only. ARRAU-AA has a mixture of nominal and pronominal anaphors.
Data pre-processing. Other details can be found in Supplementary Materials.

Baselines and evaluation metrics
Following KZH13, we report success@n (s@n), which measures whether the antecedent, or a candidate that differs in one word 14 , is in the first n ranked candidates, for n ∈ {1, 2, 3, 4}. Additionally, we report the preceding sentence baseline (PS BL ) that chooses the previous sentence for the antecedent and TAGbaseline (TAG BL ) that randomly chooses a candidate with the constituent tag label in {S, VP, ROOT, SBAR}. For TAG BL we report the average of 10 runs with 10 fixed seeds. PS BL always performs worse than the KZH13 model on the ASN, so we report it only for ARRAU-AA.

Training details for our models
Hyperparameters tuning. We recorded performance with manually chosen HPs and then tuned HPs with Tree-structured Parzen Estimators (TPE) (Bergstra et al., 2011) 15 . TPE chooses HPs for the next (out of 10) trails on the basis of the s@1 score on the devset. As devsets we employ the ARRAU-AA corpus for shell noun resolution and the ASN corpus for unrestricted abstract anaphora resolution. For each trial we record performance on the test set. We report the best test s@1 score in 10 trials if it is better than the scores from default HPs. The default HPs and prior distributions for HPs used by TPE are given below. The (exact) HPs we used can be found in Supplementary Materials. Input representation. To construct word vectors w i as defined in Section 3, we used 100-dim. GloVe word embeddings pre-trained on the Gigaword and Wikipedia (Pennington et al., 2014), and did not fine-tune them. Vocabulary was built from the words in the training data with frequency in {3, U(1, 10)}, and OOV words were replaced with an UNK token. Embeddings for tags are initialized with values drawn from the uniform distribution U − 1 √ d+t , 1 √ d+t , where t is the number of tags 16 and d ∈ {50, qlog-U(30, 100)} the size of the tag embeddings. 17 We experimented with removing embeddings for tag, anaphor and context.
Weights initialization. The size of the LSTMs hidden states was set to {100, qlog-U(30, 150)}. We initialized the weight matrices of the LSTMs with random orthogonal matrices (Henaff et al., 2016), all other weight matrices with the initialization proposed in He et al. (2015). The first feed-forward layer size is set to a value in {400, qlog-U(200, 800)}, the second to a value in {1024, qlog-U(400, 2000)}. Forget biases in the LSTM were initialized with 1s (Józefowicz et al., 2015), all other biases with 0s.  Optimization. We trained our model in minibatches using Adam (Kingma and Ba, 2015) with the learning rate of 10 −4 and maximal batch size 64. We clip gradients by global norm (Pascanu et al., 2013), with a clipping value in {1.0, U(1, 100)}. We train for 10 epochs and choose the model that performs best on the devset.
6 Results and analysis 6.1 Results on shell noun resolution dataset Table 3 provides the results of the mentionranking model (MR-LSTM) on the ASN corpus using default HPs. Column 2 states which model produced the results: KZH13 refers to the best reported results in Kolhatkar et al. (2013b) and TAG BL is the baseline described in Section 5.2.
In terms of s@1 score, MR-LSTM outperforms both KZH13's results and TAG BL without even necessitating HP tuning. For the outlier reason we tuned HPs (on ARRAU-AA) for different variants of the architecture: the full architecture, without embedding of the context of the anaphor (ctx), of the anaphor (aa), of both constituent tag em-  bedding and shortcut (tag,cut), dropping only the shortcut (cut), using only word embeddings as input (ctx,aa,tag,cut), without the first (ffl1) and second (ffl2) layer. From Table 4 we observe: (1) with HPs tuned on ARRAU-AA, we obtain results well beyond KZH13, (2) all ablated model variants perform worse than the full model, (3) a large performance drop when omitting syntactic information (tag,cut) suggests that the model makes good use of it. However, this could also be due to a bias in the tag distribution, given that all candidates stem from the single sentence that contains antecedents. The median occurrence of the S tag among both antecedents and negative candidates is 1, thus the model could achieve 50.00 s@1 by picking S-type constituents, just as TAG BL achieves 42.02 for reason and 48.66 for possibility.
Tuning of HPs gives us insight into how different model variants cope with the task. For example, without tuning the model with and without syntactic information achieves 71.27 and 19.68 (not shown in table) s@1 score, respectively, and with tuning: 87.78 and 68.10. Performance of 68.10 s@1 score indicates that the model is able to learn without syntactic guidance, contrary to the 19.68 s@1 score before tuning. Table 5 shows the performance of different variants of the MR-LSTM with HPs tuned on the ASN corpus (always better than the default HPs), when evaluated on 3 different subparts of the ARRAU-AA: all 600 abstract anaphors, 397 nominal and 203 pronominal ones. HPs were tuned on the ASN corpus for every variant separately, without shuffling of the training data. For the best performing variant, without syntactic information (tag,cut), we report the results with HPs that yielded the best s@1 test score for all anaphors (row 4), when training with those HPs on shuffled training data (row 5), and with HPs that yielded the best s@1 all (600) nominal (397) pronominal (  score for pronominal anaphors (row 6). The MR-LSTM is more successful in resolving nominal than pronominal anaphors, although the training data provides only pronominal ones. This indicates that resolving pronominal abstract anaphora is harder compared to nominal abstract anaphora, such as shell nouns. Moreover, for shell noun resolution in KZH13's dataset, the MR-LSTM achieved s@1 scores in the range 76.09-93.14, while the best variant of the model achieves 51.89 s@1 score for nominal anaphors in ARRAU-AA. Although lower performance is expected, since we do not have specific training data for individual nominals in ARRAU-AA, we suspect that the reason for better performance for shell noun resolution in KZH13 is due to a larger number of positive candidates in ASN (cf. Table 2, rows: antecedents/negatives).

Results on the ARRAU corpus
We also note that HPs that yield good performance for resolving nominal anaphors are not necessarily good for pronominal ones (cf. rows 4-6 in Table 5). Since the TPE tuner was tuned on the nominal-only ASN data, this suggest that it would be better to tune HPs for pronominal anaphors on a different dataset or stripping the nouns in ASN.
Contrary to shell noun resolution, omitting syntactic information boosts performance in ARRAU-AA. We conclude that when the model is provided with syntactic information, it learns to pick S-type candidates, but does not continue to learn deeper features to further distinguish them or needs more data to do so. Thus, the model is not able to point to exactly one antecedent, resulting in a lower s@1 score, but does well in picking a few good candidates, which yields good s@2-4 scores. This is what we can observe from row 2 vs. row 6 in Table 5: the MR-LSTM without context embedding (ctx) achieves a comparable s@2 score with the variant that omits syntactic information, but better s@3-4 scores. Further, median occurrence of tags not in {S, VP, ROOT, SBAR} among top-4 ranked candidates is 0 for the full architecture, and 1 when syntactic information is omitted. The need for discriminating capacity of the model is more emphasized in ARRAU-AA, given that the median occurrence of S-type candidates among negatives is 2 for nominal and even 3 for pronominal anaphors, whereas it is 1 for ASN. This is in line with the lower TAG BL in ARRAU-AA.
Finally, not all parts of the architecture contribute to system performance, contrary to what is observed for reason. For nominal anaphors, the anaphor (aa) and feed-forward layers (ffl1, ffl2) are beneficial, for pronominals only the second ffl.

Exploring the model
We finally analyze deeper aspects of the model: (1) whether a learned representation between the anaphoric sentence and an antecedent establishes a relation between a specific anaphor we want to resolve and the antecedent and (2) whether the maxmargin objective enforces a separation of the joint representations in the shared space.
(1) We claim that by providing embeddings of both the anaphor and the sentence containing the anaphor we ensure that the learned relation between antecedent and anaphoric sentence is dependent on the anaphor under consideration. Fig.  3 illustrates the heatmap for an anaphoric sentence with two anaphors. The i-th column of the heatmap corresponds to absolute differences between the output of the bi-LSTM for the i-th word in the anaphoric sentence when the first vs. second anaphor is resolved. Stronger color indi- Figure 3: Visualizing the differences between outputs of the bi-LSTM over time for an anaphoric sentence containing two anaphors.
cates larger difference, the blue rectangle represents the column for the head of the first anaphor, the dashed blue rectangle the column for the head of the second anaphor. Clearly, the representations differ when the first vs. second anaphor is being resolved and consequently, joint representations with an antecedent will differ too.
(2) It is known that the max-margin objective separates the best-scoring positive candidate from the best-scoring negative candidate. To investigate what the objective accomplishes in the MR-LSTM model, we analyze the joint representations of candidates and the anaphoric sentence (i.e., outputs of ffl2) after training. For a randomly chosen instance from ARRAU-AA, we plotted outputs of ffl2 with the tSNE algorithm (v.d. Maaten and Hinton, 2008). Fig. 4 illustrates that the joint representation of the first ranked candidate and the anaphoric sentence is clearly separated from other joint representations. This shows that the maxmargin objective separates the best scoring positive candidate from the best scoring negative candidate by separating their respective joint representations with the anaphoric sentence.

Conclusions
We presented a neural mention-ranking model for the resolution of unconstrained abstract anaphora, and applied it to two datasets with different types of abstract anaphora: the shell noun dataset and a subpart of ARRAU with (pro)nominal abstract anaphora of any type. To our knowledge this work is the first to address the unrestricted abstract anaphora resolution task with a neural network. Our model also outperforms state-of-the-art results on the shell noun dataset.
In this work we explored the use of purely artificially created training data and how far it can bring Figure 4: tSNE projection of outputs of ffl2. Labels are the predicted ranks and the constituent tag.
us. In future work, we plan to investigate mixtures of (more) artificial and natural data from different sources (e.g. ASN, CSN).
On the more challenging ARRAU-AA, we found model variants that surpass the baselines for the entire and the nominal part of ARRAU-AA, although we do not train models on individual (nominal) anaphor training data like the related work for shell noun resolution. However, our model still lags behind for pronominal anaphors. Our results suggest that models for nominal and pronominal anaphors should be learned independently, starting with tuning of HPs on a more suitable devset for pronominal anaphors.
We show that the model can exploit syntactic information to select plausible candidates, but that when it does so, it does not learn how to distinguish candidates of equal syntactic type. By contrast, if the model is not provided with syntactic information, it learns deeper features that enable it to pick the correct antecedent without narrowing down the choice of candidates. Thus, in order to improve performance, the model should be enforced to first select reasonable candidates and then continue to learn features to distinguish them, using a larger training set that is easy to provide.
In future work we will design such a model, and offer it candidates chosen not only from sentences containing the antecedent, but the larger context. coreference resolution. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL), Beijing, China.
Jie Zhou and Wei Xu. 2015. End-to-end learning of semantic role labeling using recurrent neural networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1127-1137, Beijing, China.