Transferring Coreference Resolvers with Posterior Regularization

We propose a cross-lingual framework for learning coreference resolvers for resource-poor target languages, given a re-solver in a source language. Our method uses word-aligned bitext to project information from the source to the target. To handle task-speciﬁc costs, we propose a softmax-margin variant of posterior regularization, and we use it to achieve robustness to projection errors. We show empirically that this strategy outperforms competitive cross-lingual methods, such as delexicalized transfer with bilingual word embeddings, bitext direct projection, and vanilla posterior regularization.


Introduction
The goal of coreference resolution is to find the mentions in text that refer to the same discourse entity. While early work focused primarily on English (Soon et al., 2001;Ng and Cardie, 2002), efforts have been made toward multilingual systems, this being addressed in recent shared tasks Pradhan et al., 2012). However, the lack of annotated data hinders rapid system deployment for new languages. Unsupervised methods (Haghighi and Klein, 2007;Ng, 2008) and rule-based approaches (Raghunathan et al., 2010) avoid this data annotation bottleneck, but they often require complex generative models or expert linguistic knowledge.
We propose cross-lingual coreference resolution as a way of transferring information from a rich-resource language to build coreference resolvers for languages with scarcer resources; as a testbed, we transfer from English to Spanish and to Brazilian Portuguese. We build upon the recent successes of cross-lingual learning in NLP, which proved quite effective in several structured prediction tasks, such as POS tagging (Täckström et al., 2013), named entity recognition (Wang and Manning, 2014), dependency parsing (McDonald et al., 2011), semantic role labeling (Titov and Klementiev, 2012), and fine-grained opinion mining (Almeida et al., 2015). The potential of these techniques, however, has never been fully exploited in coreference resolution (despite some existing work, reviewed in §6, but none resulting in an endto-end coreference resolver).
We bridge this gap by proposing a simple learning-based method with weak supervision, based on posterior regularization (Ganchev et al., 2010). We adapt this framework to handle softmax-margin objective functions (Gimpel and Smith, 2010), leading to softmax-margin posterior regularization ( §4). This step, while fairly simple, opens the door for incorporating taskspecific cost functions, which are important to manage the precision/recall trade-offs in coreference resolution systems. We show that the resulting problem involves optimizing the difference of two cost-augmented log-partition functions, making a bridge with supervised systems based on latent coreference trees (Fernandes et al., 2012;, reviewed in §3. Inspired by this idea, we consider a simple penalized variant of posterior regularization that tunes the Lagrange multipliers directly, bypassing the saddle-point problem of existing EM and alternating stochastic gradient algorithms (Ganchev et al., 2010;Liang et al., 2009). Experiments ( §5) show that the proposed method outperforms commonly used cross-lingual approaches, such as delexicalized transfer with bilingual embeddings, direct projection, and "vanilla" posterior regularization.

Architecture and Experimental Setup
Our methodology, outlined as Algorithm 1, is inspired by the recent work of Ganchev and Das (2013) on cross-lingual learning of sequence models. For simplicity, we call the source and tar- Figure 1: Excerpt of a bitext document with automatic coreference annotations (from FAPESP). The English side had its coreferences resolved by a state-of-the-art system . The predicted coreference chains {The pulmonary alveoli, the alveoli, their} and {The pulmonary surfactant} are then projected to the Portuguese side, via word alignments.

Algorithm 1 Cross-Lingual Coreference Resolution via Softmax-Margin Posterior Regularization
Input: Source coreference system S e , parallel data D e and D f , posterior constraints Q. Output: get languages English (e) and "foreign" (f ), respectively, and we assume the existence of parallel documents on the two languages (bitext). The first two steps (lines 1-2) run a word aligner and label the source side of the parallel data with a pre-trained English coreference system. Afterwards, the predicted English entities are projected to the target side of the parallel data (line 3), inducing an automatic (and noisy) training dataset for the foreign language. Finally, a coreference system is trained in this dataset with the aid of softmax-margin posterior regularization (line 4).
We next detail all the datasets and tools involved in our experimental setup. Table 1 provides a summary, along with some statistics.
Parallel Data. As parallel data, we use a sentence-aligned trilingual (English-Portuguese-Spanish) parallel corpus based on the scientific news Brazilian magazine Revista Pesquisa FAPESP, collected by Aziz and Specia (2011). 1 We preprocessed this dataset as follows. We labeled the English side with the Berkeley Coreference Resolution system v1.0, using the provided English model . Then, we computed word alignments using the Berkeley aligner (Liang et al., 2006), intersected them and filtered out all the alignments whose confi- dence is below 0.95. After this, we projected English mentions to the target side using the maximal span heuristic of Yarowsky et al. (2001). We filtered out documents where more than 15% of the mentions were not aligned. At this point, we obtained an automatically annotated corpus D f in the target language. Figure 1 shows a small excerpt where all mentions were correctly projected. In practice, not all documents are so well behaved: in the English-Portuguese parallel data, only 200,175 out of the original 271,122 mentions (about 73.8%) were conserved after the projection step. In Spanish, this number drops to 69.9%.
Monolingual Data. We also use monolingual data for validation and comparison with supervised systems. The Berkeley Coreference Resolution system is trained in the English OntoNotes dataset used in the CoNLL 2011 shared task; this dataset is also used to train delexicalized models. For Spanish, we use the AnCora dataset (Recasens and Martí, 2010) provided in the SemEval 2010 coreference task, which we preprocessed as follows. We split all MWEs into individual tokens (for consistency with the other corpora). We also removed the extra gap tokens associated with zeroanaphoric relations, and the anaphoric annotations associated with relative pronouns (e.g., in "[una central de ciclo combinado [que] 1 debe empezar a funcionar en mayo del 2002] 1 " we removed the nested mention [que] 1 ), since these are not annotated in the English dataset.
For Portuguese, we used the Summ-It 3.0 corpus (Collovini et al., 2007), which contains 50 documents annotated with coreferences, from the science section of the Folha de São Paulo newspaper. This dataset is much smaller than OntoNotes and AnCora, as shown in Table 1. We split the data into train, development, and test partitions.
For both Spanish and Portuguese, we obtained automatic POS tags and dependency parses by using TurboParser (Martins et al., 2013).

Problem Definition and Prior Work
In coreference resolution, we are given a set of mentions M := {m 1 , . . . , m M }, and the goal is to cluster them into discourse entities, E := {e 1 , . . . , e E }, where each e j ⊆ M and e j = ∅. The set E must form a partition of M, i.e., we must have E j=1 e j = M, and e i ∩ e j = ∅ for i = j. A variety of approaches have been proposed to this problem, including entity-centric models (Haghighi and Klein, 2010;Rahman and Ng, 2011;, pairwise models (Bengtson and Roth, 2008;Versley et al., 2008), greedy rule-based methods (Raghunathan et al., 2010), and mention-ranking decoders (Denis and Baldridge, 2008;. We chose to base our coreference resolvers on this last class of methods, which permit efficient decoding by shifting from entity clusters to latent coreference trees. In particular, the inclusion of lexicalized features by  yields nearly state-of-the-art performance with surface information only. Given that our goal is to prototype resolvers for resource-poor languages, this model is a good fit-we next describe it in detail.

Latent Coreference Tree Models
Let x be a document containing M mentions, sorted from left to right. We associate to the mth mention a random variable y m ∈ {0, 1, . . . , m−1} to denote its antecedent, where the value y m = 0 means that m is a singleton or starts a new coreference chain. We denote by Y(x) the set of coreference trees that can be formed by linking mentions to their antecedents; we represent each tree as a vector y := y 1 , . . . , y M . Note that each tree y induces a unique clustering E, but that this map is many-to-one, i.e., different trees may correspond to the same set of entity clusters. We denote by Y(E) the set of trees that are consistent with a given clustering E.
We model the probability distribution p(y|x) as an arc-factored log-linear model: where w is a weight vector, and each f (x, m, y m ) is a local feature vector that depends on the document x, the mention m, and its candidate antecedent y m .
This model permits a cheap computation of the most likely tree y := arg max y∈Y(x) p w (y|x): simply compute the best antecedent independently for each mention, and collect them to form a tree. An analogous procedure can be employed to compute the posterior marginals p w (y m |x) for every mention m.
Gold coreference tree annotations are rarely available; datasets usually consist of documents annotated with entity clusters, { x (n) , E (n) } N n=1 .  proposed to learn the probabilistic model in Eq. 1 by maximizing conditional log-likelihood, treating the coreference trees as latent variables. They also found advantageous to incorporate a cost function (y, Y(E)), measuring the extent to which a prediction y differs from the ones that are consistent with the gold entity set E. 2 Putting these pieces together, we arrive at the following loss function to be minimized: where p w is the cost-augmented distribution: The loss function in Eq. 2 can be seen as a probabilistic analogous of the hinge loss of support vector machines, and a model trained this way is called a softmax-margin CRF (Gimpel and Smith, 2010). Note that L(w) is non-convex, corresponding to the difference of two log-partition functions (both convex on w), where f (x, y) := M m=1 f (x, m, y m ). 3 Evaluating the gradient of the loss in Eq. 4 requires computing marginals for the candidate antecedents of each mention, which can be done in a mentionsynchronous fashion. This enables a simple stochastic gradient descent algorithm, which was the procedure taken by .
Another way of regarding this framework, expressed through the marginalization in Eq. 2, is to "pretend" that the outputs we care about are the actual coreference trees, but that the datasets are only "weakly labeled" with the entity clusters. We build on this point of view in §4.1.

Cross-Lingual Coreference Resolution
We now adapt the framework above to learn coreference resolvers in a cross-lingual manner.

Softmax-Margin Posterior Regularization
In the weakly supervised case, the training data may only be partially labeled or contain annotation errors. For taking advantage of these data, we need a procedure that handles uncertainty about the missing data, and is robust to mislabelings. We describe next an approach based on posterior regularization (PR) that fulfills these requirements.
For ease of explanation, we introduce corpuslevel counterparts for the variables in §3.2. We use bold capital letters X := {x (1) , . . . , x (N ) } and Y := {y (1) , . . . , y (N ) } to denote the documents and candidate coreference trees in our corpus. We denote by p w (Y|X) := N n=1 p w (y|x (n) ) the conditional distribution of trees over the corpus, induced by a model w, and similarly for the costaugmented distribution p w (Y|X).
In PR, we define a vector g(X, Y) of corpuslevel constraint features, and a vector b of upper bounds for those features. We consider the family of distributions over Y (call it Q) that satisfy these constraints in a posteriori expectation, To make the analysis simpler, we assume that 0 ≤ b ≤ 1, and that for every j, min Y g j (X, Y) = 0 and max Y g j (X, Y) = 1, where the min/max above are over all possible coreference trees Y that can be build from the documents X in the cor-pus. 4 Under this assumption, the two extreme values of the upper bounds have a precise meaning: if b j = 0, the jth feature becomes a hard constraint, (i.e., any feasible distribution in Q will vanish outside {Y | g j (X, Y) = 0}), while b j = 1 turns it into a vacuous feature. We also make the usual assumption that the constraint features decompose over documents, g(X, Y) := N n=1 g(x (n) , y (n) ); if this were not the case, decoding would be much harder, as the documents would be coupled.
In vanilla PR (Ganchev et al., 2010), one seeks the model w minimizing the Kullback-Leibler divergence between the set Q and the distribution p w . Here, we go one step farther to consider the cost-augmented distribution in Eq. 3. That is, we minimize KL(Q||p w ) := min q∈Q KL(q p w ). The next proposition shows that this expression also corresponds to a difference of two logpartition functions, as in Eq. 4. Proposition 1. The (regularized) minimization of the cost-augmented KL divergence is equivalent to the following saddle-point problem: with Z (w, x) as in Eq. 5, and Proof. See Appendix A.
In sum, what Proposition 1 shows is that we can easily extend the vanilla PR framework of Ganchev et al. (2010) to incorporate a task-specific cost: by Lagrange duality, the resulting optimization problem still amounts to finding a saddle point of an objective function (Eq. 8), which involves the difference of two log-partition functions (Eq. 9). The difference is that these partition functions now incorporate the cost term (y, Y(E)). If this cost term has a factorization compatible with the features and the constraints, this comes at no additional computational burden.

Penalized Variant
In their discriminative PR formulation for learning sequence models, Ganchev and Das (2013) optimize an objective similar to Eq. 8 by alternating stochastic gradient updates with respect to w and u. In their procedure, b was chosen a priori via linear regression (see their Figure 2). Here, we propose a different strategy, based on Proposition 1 and a simple observation: while the constraint values b have a more intuitive meaning than the Lagrange multipliers u (since they may correspond, e.g., to proportions of events observed in the data), choosing these upper bounds is often no easier than tuning u. In this case, a preferable strategy is to specify u directly-this leaves this variable fixed in Eq. 8, and allows us to get rid of b. The resulting problem becomes which is a penalized variant of PR and no longer a saddle point problem. This variant requires tuning the Lagrange multipliers u j in the range [0, +∞], for every constraint. The two extreme cases of b j = 0 and b j = 1 correspond respectively to u j = +∞ and u j = 0. 5 Note that this grid search is only appealing for a small number of posterior constraints at corpus-level (since document-level constraints would require tuning separate coefficients for each document). The practical advantages of the penalized variant over the saddle-point formulation are illustrated in Figure 2, which compares the performance of stochastic gradient algorithms for the two formulations (there, η 2 = 1 − b 2 ).
An interesting aspect of this penalized formulation is its resemblance to latent variable models. Indeed, the objective of Eq. 11 is also a difference of log-partition functions, as the latent-tree supervised case (cf. Eq. 4). The noticeable difference is that now both partition functions include extra cost terms, either task-specific ( (y, Y(E)) in Z ) or with soft constraints (u g(x, y) in Z u ). In particular, if we set a single constrained feature g 1 (x, y) := I( (y, Y(E)) = 0) with weight u 1 → +∞, all non-zero-cost summands in Z u (w, x) Figure 2: Comparison of saddle-point and penalized PR for Spanish, using the setup in §5.5. Left: variation of the multiplier u2 over gradient iterations, with strong oscillations in initial epochs and somewhat slow convergence. Right: impact in the averaged F1 scores (on the dev-set). Contrast with the more "stable" scores achieved by the penalized method.
vanish and we get Z u (w, x) = Z(w, x), recovering the supervised case (see Eq. 6).
Intuitively, this formulation pushes probability mass toward structures that respect the constraints in Eq. 7, while moving away from those that have a large task-specific cost. A similar idea, but applied to the generative case, underlies the framework of constrastive estimation (Smith and Eisner, 2005).

Cost Function
Denote by E m the entire coreference chain of the mth mention (so E = m∈M {E m }), and by M sing := {m ∈ M | E m = {m}} the set of mentions that are projected as singleton in the data (we call this gold-singleton mentions).
We design a task-specific cost ( y, Y(E)) as in  to balance three kinds of mistakes: (i) false anaphora ( y m = 0 while m ∈ M sing ); (ii) false new ( y m = 0 while m / ∈ M sing ); and (iii) wrong link ( y m = 0 but E m = E ym ). Letting I FA ( y m , E), I FN ( y m , E), and I WL ( y m , E) be indicators for these events, we define a weighted Hamming cost function: ( y, Y(E)) := M m=1 (α FA I FA ( y m , E)+ α FN I FN ( y m , E) + α WL I WL ( y m , E)). We set α FA = 0.0, α FN = 3.0, and α WL = 1.0. 6 Since this cost decomposes as a sum over mentions, the computation of cost-augmented marginals (necessary to evaluate the gradient of Eq. 11) can still be done with mention-ranking decoders.

Constraint Features
Finally, we describe the constraint features (Eq. 7) used in our softmax-margin PR formulation.
Constraint #1: Clusters should not split. Let |M| − |E| be the number of anaphoric mentions in the projected data. We push these mentions to preserve their anaphoricity (y m = 0) and to have their antecedent in the projected coreference chain (E m = E ym ). To do so, we force the fraction of mentions satisfying these properties to be at least η 1 . This can be enforced via a constraint feature and an upper bound b 1 := −η 1 N n=1 (|M (n) | − |E (n) |). (These quantities are summed by a constant and rescaled to meet the assumption in §4.1.) In our experiments, we set η 1 = 1.0, turning this into a hard constraint. This is equivalent to setting u 1 = +∞ in the penalized formulation.
Constraint #2: Most projected singletons should become non-anaphoric. We define a soft constraint so that a large fraction of the goldsingleton mentions m ∈ M sing satisfy y m = 0. This can be done via a constraint feature and an upper bound b 2 := −η 2 N n=1 |M (n) sing |. In our experiments, we varied η 2 in the range [0, 1], either directly or via the dual variable u 2 , as described in §4.1. The extreme case η 2 = 0 corresponds to a vacuous constraint, while for η 2 = 1 this becomes a hard constraint which, combined with the previous constraint, recovers bitext direct projection (see §5.3). The intermediate case makes this a soft constraint which allows some singletons to be attached to existing entities (therefore introducing some robustness to non-aligned mentions), but penalizes the number of reattachments.

Experiments
We now present experiments using the setup in §2. We compare our coreference resolvers trained with softmax-margin PR ( §5.5) with three other weakly-supervised baselines: delexicalized transfer with cross-lingual embeddings ( §5.2), bitext projection ( §5.3), and vanilla PR ( §5.4). We also run fully supervised systems ( §5.1), to obtain upper bounds for the level of performance we expect to achieve with the weakly-supervised systems.
An important step in coreference resolution systems is mention prediction. For English, mention spans were predicted from the noun phrases given by the Berkeley parser (Petrov and Klein, 2007), the same procedure as . For Spanish and Portuguese, this prediction relied on the output of the dependency parser, using a simple heuristic: besides pronouns, each maximal span formed by contiguous descendants of a noun becomes a candidate mention. This heuristic is quite effective, as shown by Attardi et al. (2010). Table 2 shows the performance of supervised systems for English, Spanish and Portuguese. All optimize Eq. 4 appended with an extra regularization term γ 2 w 2 , by running 20 epochs of stochastic gradient descent (SGD; we set γ = 1.0 and selected the best epoch using the dev-set). All lexicalized systems use the same features as the SUR-FACE model of , plus features for gender and number. 7 We collected a list of pronouns for all languages along with their gender, number, and person information. For English, we trained on the WSJ portion of the OntoNotes dataset, and for Spanish and Portuguese we trained on the monolingual datasets described in §2.

Supervised Systems
We observe that the Spanish system obtains averaged F 1 scores around 44%, a few points below the English figures. 8 In Portuguese, these scores are significantly lower (in the 37-39% range), which is explained by the fact that the training dataset is much smaller (cf. Table 1).
For English, we also report the performance of delexicalized systems, i.e., systems where all the lexical features were removed. The second row of Table 2 shows a drop of 2-2.5 points with respect to the lexicalized system. For the third and fourth rows, the lexical features were replaced by bilingual word embeddings (either English-Spanish or English-Portuguese; a detailed description of these embeddings will be provided in §5.2).
Here the drop is small, and for English-Spanish it looks on par with the lexicalized system.  Table 2: Results for the supervised systems. We show also the performance of delexicalized English systems, with and without cross-lingual embeddings. Shown are MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), and CEAFe (Luo, 2005) Table 3: Results for all the cross-lingual systems. Bold indicates the overall highest scores. As a lower bound, we show a simple deterministic baseline that, for pronominal mentions, selects the closest non-pronominal antecedent, and, for non-pronominal mentions, selects the closest non-pronominal mention that is a superstring of the current mention.

Baseline #1: Delexicalized Transfer With Cross-Lingual Embeddings
We now turn to the cross-lingual systems. Delexicalized transfer is a popular strategy in NLP (Zeman and Resnik, 2008;McDonald et al., 2011), recently strengthened with cross-lingual word representations (Täckström et al., 2012). The procedure works as follows: a delexicalized model for the source language is trained by eliminating all the language-specific features (such as lexical features); then, this model is used directly in the target language. We report here the performance of this baseline on coreference resolution for Spanish and Portuguese, using the delexicalized models trained on the English data as mentioned in §5.1. To achieve a unified feature representation, we mapped all language-specific POS tags to universal tags (Petrov et al., 2012). All lexical features were replaced either by cross-lingual word embeddings (for words that are not pronouns); or by a universal representation containing the gender, number, and person information of the pronoun. To obtain the cross-lingual word embeddings, we ran the method described by Hermann and Blunsom (2014) for the English-Spanish and English-Portuguese pairs, using the parallel sentences in §2. When used as features, these 128-dimensional continuous representations were scaled by a factor of 0.5 (selected on the dev-set), using the proce-dure of Turian et al. (2010).
The second and seventh rows in Table 3 show the performance of this baseline, which is rather disappointing. For Spanish, we observe a large drop in performance when going from supervised training to delexicalized transfer (about 11-13% in averaged F 1 ). For Portuguese, where the supervised system is not so accurate, the difference is less sharp (about 9-11%). These drops are mainly due to the fact that this method does not take into account the intricacies of each language-e.g., possessive forms have different agreement rules in English and in Romance languages; 9 those, on the other hand, have clitic pronouns that are absent in English. Feature weights that promote certain English agreement relations may then harm performance more than they help.

Baseline #2: Bitext Direct Projection
Another popular strategy for cross-lingual learning is bitext direct projection, which consists in projecting annotations through parallel data in the source and target languages (Yarowsky et al., 2001;Hwa et al., 2005). This is essentially the same as Algorithm 1, except that line 4 is replaced by simple supervised learning, via a minimization of the loss function in Eq. 4 with 2 -regularization. This procedure has the disadvantage of being very sensitive to annotation errors, as we shall see. For Portuguese, this baseline is a near-reproduction of Souza and Orȃsan (2011)'s work, discussed in §6.
The third and eighth rows in Table 3 show that this baseline is stronger than the delexicalized baseline, but still 6-8 points away from the supervised systems. This gap is due to a mix of two factors: prediction errors in the English side of the bitext, and missing alignments. Indeed, when automatic alignments are used, false negatives for coreferent pairs of mentions are common, due to words that have not been aligned with sufficiently high confidence. The direct projection method is not robust to these annotation errors.

Baseline #3: Vanilla PR
Our last baseline is a vanilla PR approach; this is an adaptation of the procedure carried out by Ganchev and Das (2013) to our coreference resolution problem. The motivation is to increase the robustness of bitext projection to annotation errors, which we do by applying the soft constraints in §4.4. We seek a saddle-point of the PR objective by running 20 epochs of SGD, alternating wupdates and u-updates. The best results in the devset were obtained with η 1 = 1.0 and η 2 = 0.9.
By looking at the fourth and ninth rows of Table 3, we observe that vanilla PR manages to reduce the gap to supervised systems, obtaining consistent gains over the bitext projection baseline (with the exception of the Portuguese dev-set). This confirms the ability of PR methods to handle annotation mistakes in a robust manner.

Our Proposal: Softmax-Margin PR
Finally, the fifth and last rows in Table 3 show the performance of our systems trained with softmaxmargin PR, as described in §4.1. We optimized the loss function in Eq. 11 with γ = 1.0 by running 20 epochs of SGD, setting u 1 = +∞ and u 2 = 1.0 (cf. §4.4)-the last value was tuned in the dev-set. As shown in Figure 2, this penalized variant was more effective than the saddle point formulation.
From Table 3, we observe that softmax-margin PR consistently beats all the baselines, narrowing the gap with respect to supervised systems to about 5 points for Spanish, and 2-3 points for Portuguese. Gains over the vanilla PR procedure (the strongest baseline) lie in the range 0.5-3%. These gains come from the ability of softmax-margin PR to handle task-specific cost functions, enabling a better management of precision/recall tradeoffs.

Error Analysis
We carried out some error analysis, focused on the Spanish development dataset, to better understand where the improvements of softmax-margin PR come from. The main conclusions carry out to the Portuguese case, with a few exceptions, mostly due to different human annotation criteria. Table 4 shows the precision and recall scores for mention prediction and the different coreference evaluation metrics. Note that all systems predict the same candidate mentions; however a final post-processing discards all mentions that ended up in singleton entities, for compliance with the official scorer. Therefore, the mention prediction score reflects how well a system does in predicting if a mention is anaphoric or not. The first thing to note is that the PR methods, due to their ability to create new links during training (via constraint #2) tend to predict fewer singletons than the direct projection method. Indeed, we observe that soft max-margin PR achieves 47.1% mention prediction recall, which is more than 5% above the direct projection method, and 10% above the delexicalized transfer method. Note also that, while the vanilla PR method achieves higher recall than the two other baselines, it is still almost 5% below the system trained with soft-max margin PR. This is because vanilla PR does not benefit from the cost function in §4.3-such cost is able to penalize false non-anaphoric mentions and encourage larger clusters, allowing softmax-margin PR to achieve a better precision-recall trade-off. From Table 4, we can see that this improvement in mention recall consistently translates into higher recall for the MUC, B 3 and CEAF e coreference metrics.
Further analysis revealed that a major source of error for the delexicalized baseline is its inability to handle pronominal mentions robustly across languages-as hinted in footnote 9. In practice, we found the delexicalized systems to be quite conservative with possessive pronouns: for the Spanish dataset, where the vast majority of possessive pronouns are anaphoric, the delexicalized model incorrectly predicts 53.3% of these pronouns as non-anaphoric. The direct projection model is slightly less conservative, missing 30.1% of the possessives (arguably due to its inability to recover missing links in the projected data, dur-  ing training). By comparison, the vanilla and softmax margin PR models only miss 4.9% and 3.4% of the possessives, respectively. In Portuguese, where many possessives are not annotated in the gold data, we observe a similar but much less pronounced trend.

Related Work
While multilingual coreference resolution has been the subject of recent SemEval and CoNLL shared tasks, no submitted system attempted cross-lingual training. As shown by Recasens and Hovy (2010), language-specific issues pose a challenge, due to phenomena as pronoun dropping and grammatical gender that are absent in English but exist in other languages. We have discussed some of these issues in the scope of the present work. Harabagiu and Maiorano (2000) and Postolache et al. (2006) projected English corpora to Romanian to bootstrap human annotation, either manually or via automatic alignments. Rahman and Ng (2012) applied translation-based projection at test time (but require an external translation service). Hardmeier et al. (2013) addressed the related task of cross-lingual pronoun prediction. While all these approaches help alleviate the corpus annotation bottleneck, none resulted in a full coreference resolver, which our work accomplished.
The work most related with ours is Souza and Orȃsan (2011), who also used parallel data to transfer an English coreference resolver to Portuguese, but could not beat a simple baseline that clusters together mentions with the same head. Their approach is similar to our bitext direct projection baseline, except that they used Reconcile (Stoyanov et al., 2010) instead of the Berkeley Coreference System, and a smaller version of the FAPESP corpus. We have shown that our softmaxmargin PR procedure is superior to this approach.
Discriminative PR has been proposed by Ganchev et al. (2010). The same idea underlies the generalized expectation criterion (Mann and McCallum, 2010;Wang and Manning, 2014). An SGD algorithm for solving the resulting saddle point problem has been proposed by Liang et al. (2009), and used by Ganchev and Das (2013) for cross-lingual learning of sequence models. We extended this framework in two aspects: by incorporating a task-specific cost in the objective function, and by formulating a penalized variant of PR.

Conclusions
We presented a framework for cross-lingual transfer of coreference resolvers. Our method uses word-aligned bitext to project information from the source to the target language.
Robustness to projection errors was achieved via a PR framework, which we generalized to handle task-specific costs, yielding softmax-margin PR. We also proposed a penalized formulation that is effective for a small number of corpus-based constraints. Empirical gains were shown over three popular cross-lingual methods: delexicalized transfer, bitext direct projection, and vanilla PR.

Acknowledgments
I would like to thank the reviewers for their helpful comments, José Guilherme Camargo de Souza for pointing to existing datasets, and Mariana Almeida for valuable feedback. This work was partially supported by the EU/FEDER programme, QREN/POR Lisboa (Portugal), under the Intelligo project (contract 2012/24803), and by the FCT grants UID/EEA/50008/2013 and PTDC/EEI-SII/2312/2012.
By standard variational arguments (namely, Fenchel duality between the the log-partition function and the negative entropy; see e.g. Martins et al. (2010)), we have that the optimal q * that minimizes the Lagrangian is q * (Y) = e w f (X,Y)+ (Y)−u g(X,Y) N n=1 Z u (w, x (n) ) .
Plugging this in the Lagrangian yields Eq. 8.