A Fine-grained Large-scale Analysis of Coreference Projection

We perform a fine-grained large-scale analysis of coreference projection. By projecting gold coreference from Czech to English and vice versa on Prague Czech-English Dependency Treebank 2.0 Coref, we set an upper bound of a proposed projection approach for these two languages. We undertake a detailed thorough analysis that combines the analysis of projection’s subtasks with analysis of performance on individual mention types. The findings are accompanied with examples from the corpus.


Introduction
Projection has been for a long time seen as an alternative way to build a linguistic tool for resource-poor languages. Coreference projection has been no exception. Despite its mostly mediocre results, only some works perform a proper error analysis.
In this work, we conduct a fine-grained largescale analysis of coreference projection. We adopt a corpus-based projection approach and apply it on Czech-English parallel texts in Prague Czech-English Dependency Treebank 2.0 Coref (Nedoluzhko et al., 2016) in both projection directions. We project manually annotated coreference links within texts enriched with mainly manually annotated linguistic annotation. Results obtained on manual (i.e. gold) annotation can be then considered as an upper bound for projection techniques where the gold annotation is replaced by the one obtained automatically.
We took inspiration from two works that have previously focused on projection of gold coreference. Even though both of them provided an analysis of collected projections, they treated it in a completely different way. Postolache et al. (2006) concentrated on factorized analysis. They split the task of projection into subtasks, such as mention matching, mention span overlapping and antecedent selection, and inspected their effect on the final result separately. Alternatively, in their multilingual projection approach Grishina and Stede (2017) carried out an analysis across mention types. They split all mentions to categories such as noun phrases, named entities and pronouns, and evaluated projection on these mention types separately.
Our work combines both these views on analysis, providing a factorized fine-grained analysis of projected coreference. In addition, we include new categories of mentions -zeros. These have been often neglected as they are not expressed on the surface. However, by ignoring them we would lose valuable information, especially in pro-drop language such as Czech and Spanish. Furthermore, our analysis is based on about 100-times bigger corpus than in the two related works, which makes the findings and conclusions more reliable.
The paper is structured as follows. In Section 2 we describe two main projection approaches with a special emphasis on corpus-based projection of gold coreference. Section 3 presents the corpus that we use for making projections and Section 4 describes the projection method we propose. The main projection experiments and its results are presented in Section 5 and analyzed in detail using a factorized view in Section 6. Finally, we conclude in Section 7.

Related Work
Approaches to cross-lingual projection are usually aimed to bridge the gap of missing resources in the target language. So far, they have been quite successfully applied to part-of-speech tagging (Täckström et al., 2013), syntactic parsing (Hwa et al., 2005), semantic role labeling (Padó andLapata, 2009), opinion mining (Almeida et al., 2015), etc. Projection techniques are generally grouped into two types with respect to how they obtain the translation to the source language, which is usually a resource-rich language. MTbased approaches apply a machine-translation service to create synthetic data in source language. Corpus-based approaches take advantage of the human-translated parallel corpus of the two languages.
MT-based approaches. The workflow of these approaches is as follows. Starting with a text in the target language to be labeled with coreference, it first must be machine-translated to the source language. A coreference resolver for the source language is then applied on the translated text and, finally, the newly established coreference links are projected back to the target language. Flexibility of this approach lies in the fact that it can be applied in both train and test time, and no linguistic processing tools for the target language are required. To the best of our knowledge, this approach has been applied to coreference only twice, by Rahman and Ng (2012) on projection from English to Spanish and Italian, and by Ogrodniczuk (2013) on projection from English to Polish.
Corpus-based approaches. In these approaches, a human-translated parallel corpus of the two languages is available and the projection mechanism is applied within this corpus. Coreference annotation in the source-language side of the corpus may be both labeled by humans or a coreference system. The target-language side of the corpus then serves as a training dataset for a coreference resolver. This approach thus must be applied in train time and, moreover, it requires a coreference resolver trainable on the target-language data. As a consequence, linguistic processing tools should be available for the target language as most of the resolvers depend on some amount of additional linguistic information. On the other hand, human translation and gold coreference annotation, if available, should increase the quality of the projected coreference. This approach has been used to create a coreference resolver by multiple authors, e.g. de Souza andOrȃsan (2011), Martins (2015), Wallin and Nugues (2017), and Novák et al. (2017). However, since the present work employs the corpus-based approach on gold annotations of coreference, we offer more details on works of Postolache et al. (2006) and Grishina andStede (2015, 2017). Postolache et al. (2006) followed corpus-based approach using a small English-Romanian corpus of 638 sentence pairs in order to create a bilingually-annotated resource. They projected manually annotated coreference, which was then post-processed by linguists to acquire high quality annotation in Romanian. Based on the gold coreference annotation of the Romanian side of the corpus, they evaluated the F-scores of mention heads' matching, mention spans' overlapping, and coreference clusters on all as well as only correctly projected mentions. A factorized error analysis they carried out shows that the majority of errors in coreference projection stems from a lower recall (around 70%) caused by missing alignment due to alignment errors or language differences introduced in the translation.
Yulia Grishina with her colleagues also investigate possibilities of corpus-based coreference projection. In (Grishina and Stede, 2015), they introduced a "generalizable" annotation schema that they tested on parallel texts of three languages (English, Russian and German) and three genres (newswire articles, short stories, medical leaflets). Using this dataset consisting of less than 500 sentence triples, they conducted experiments on projection from English to the two other languages. In (Grishina and Stede, 2017), they pursue a goal of multi-source projection of manual coreference annotation. They propose several strategies of combining projections from multiple languages, with some of them slightly improving the F-score of the best-performing projection source. They also provide a qualitative analysis on individual mention types suggesting that pronouns have much higher projection accuracy 1 than nominal groups. They justify their unsatisfactory results especially for German nominal groups by problems with inclusion of an unaligned German determiner in definite descriptions.

Data Source
We employ a slightly modified version of the Prague Czech-English Dependency Treebank 2.0 Coref (Nedoluzhko et al., 2016, PCEDT 2.0 Coref) for our projection experiments.
PCEDT 2.0 Coref is a coreferential extension to the Prague Czech-English Dependency Treebank 2.0 (Hajič et al., 2012). It is a Czech-English parallel corpus, consisting of almost 50k sentence pairs (more on its basic statistics is shown in the upper part of the Table 1). The English part originally comes from the Wall Street Journal collected in the Penn Treebank (Marcus et al., 1999) and the Czech part was manually translated. It has been annotated at multiple layers of linguistic representation up to the layer of deep syntax (or tectogrammatical layer), based on the theory of Functional Generative Description (Sgall et al., 1986). The tectogrammatical representation of a sentence is a dependency tree with semantic labeling, coreference, and argument structure description based on a valency lexicon. The nodes of a tectogrammatical tree comprise merely auto-semantic words. Furthermore, some surface-elided expressions are reconstructed at this layer. They include anaphoric zeros (e.g. zero subjects in Czech, unexpressed arguments of non-finite clauses in both English and Czech) that are introduced in the tectogrammatical layer with a newly established node.
The coreference annotation of PCEDT 2.0 Coref takes place on the tectogrammatical layer to allow for marking zero anaphora. Coreference is technically annotated as links connecting two mentions: the anaphor (the referring expression) and the antecedent (the referred expression). The coreference links then form chains, which correspond to coreference entities. In tectogrammatics, the mention is determined only by its head. No mention boundaries are specified. Therefore, a coreference link always connects two nodes on a tectogrammatical layer.
In order to provide a fine-grained qualitative analysis, we divide mentions into multiple categories in this paper: (1) personal pronouns, (2) possessive pronouns, (3) reflexive possessive pronouns, (4) reflexive pronouns, all four types of pronouns in the 3rd or ambiguous person, (5) demonstrative pronouns, (6) zero subjects, (7) zeros in non-finite clauses, (8) relative pronouns, (9) the pronouns of types (1)-(4) in the 1st or 2nd person, (10) named entities, (11) common nominal groups, and (12) other expressions. Note that categories (3) and (6)   nodes restoring ellipsis, e.g. zeros in other than subject positions or missing arguments in reciprocal relation. We do not focus on this category in the rest of the paper. The statistics of coreferential mentions is collected in the bottom part of Table 1. The treebank is aligned on the level of tectogrammatical nodes. The alignment is based on unsupervised word alignment by GIZA++ (Och and Ney, 2000), augmented with a supervised method (Novák andŽabokrtský, 2014) for selected coreferential expressions. The supervised alignment has been trained on a section of PCEDT 2.0 Coref comprising 1,078 sentence pairs with manual annotation of alignment. To ensure that the whole PCEDT 2.0 Coref is aligned in the same way for our experiments, we make a slight modification to it and replace the manual alignment in this particular section with the supervised one, obtained by 10-fold cross-validation.

Coreference Projection
Our approach to coreference projection belongs to the corpus-based methods as introduced in Section 2. We work with manually translated English-Czech parallel corpus with word alignment and project coreference from one language side to the other. In fact, our approach is similar to the one adopted by multiple previous works (Postolache et al., 2006;de Souza and Orȃsan, 2011;Wallin and Nugues, 2017;Grishina, 2017, i.a.). Nevertheless, there is a substantial difference of our work compared to the others: our projection system operates on tectogrammatical representation. It leads to the two following consequences. Firstly, our system is able to address zero anaphora. Thorough cross-lingual analysis by Novák and Nedoluzhko (2015) showed that many counterparts of Czech or English coreferential expressions are zeros. This likely holds for the other pro-drop languages, too. It is thus surprising that the previous work on projection to Spanish (Rahman and Ng, 2012; or Portuguese (de Souza and Orȃsan, 2011; did not accent this problem at all. In tectogrammatics, generated nodes serve this purpose instead. Secondly, mention spans are not specified in tectogrammatical trees, as mentioned in Section 3. Concerning projection, many of the previous works (Rahman and Ng, 2012;Postolache et al., 2006;Wallin and Nugues, 2017, i.a.) devote considerable space to answering the question of the proper strategy for determining boundaries of a projected mention. If a mention is solely defined by its head as in the present work, this question does not need to be answered.
The projection algorithm is schematized in Algorithm 1. An input of the algorithm are two aligned lists of tectogrammatical trees representing the same text in the source and the target language. First, a list of coreferential chains must be extracted from source trees (line 1). Every coreference chain is projected independently (lines 2-18) mention by mention, starting with the first one, the one that has no outcoming link. For each mention, at the moment viewed as an anaphor, its counterpart in the target language is returned using the alignment (line 5). In case there are several nodes aligned to the anaphor, those which do not yet participate in a different chain are interlinked and only the very last mention is returned by the function GetAlignedAndInterlink. If no aligned counterpart to the anaphor is found, the anaphor is skipped and its outgoing coreference link thus remains unprojected. Otherwise (lines 6-16), counterparts of anaphor's direct antecedents are retrieved (lines 7-8) and the algorithm adds a link between the anaphor's and antecedents' counterparts in the target language (line 10). If there are no antecedents' counterparts, the last successfully projected anaphor from any of the previous iterations is used instead (line 13).

Experiments and Results
In the following experiment, we project gold coreference between gold trees in two directions: from English to Czech and vice versa. The experiment is carried out on the dataset presented in Section 3, PCEDT 2.0 Coref with supervised alignment in all its sections.
One of the objectives of this work is to analyze performance of coreference projection for individual mention types. Standard evaluation metrics (e.g. MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998)) are not suitable for our purposes, though, since they do not allow for scoring only a subset of mentions. Instead, we use a measure similar to scores proposed by (Tuggener, 2014) that we denote as anaphora score.
Let K = {K 1 , . . . , K m } be the set of true and S = {S 1 , . . . , S n } the set of predicted coreferential chains. From K and S we derive sets of true anaphors ANAPH(K) and predicted anaphors ANAPH(S) using the following definition: where the set ANTE Z i (x) contains all direct antecedents of x in chain Z i . We also define an indicator function both(a, K, S) as follows: if ∃i, j : a ∈ K i ∩ S j and ∃e ∈ K i : e ∈ ANTE Ki (a) and ∃f ∈ ANTE Sj (a) : f ∈ K i 0 otherwise In other words, it fires only if a has an antecedent in both the truth and the prediction and the predicted antecedent of anaphor a belongs to a true coreferential chain associated with a. Precision (P ), Recall (R) are then computed by averaging the function both(a, K, S) over all predicted and true anaphors, respectively, and F-score (F ) traditionally as a harmonic mean of P and R: To evaluate only a particular anaphor type, both sets ANAPH(K) and ANAPH(S) must be restricted only to anaphoric mentions of the given Input: SrcT rees = source language trees with coreference, T rgT rees = target language trees Output: T rgT rees = target language trees with projected coreference   type. In the following tables, we use P R F to format the three components of the anaphora score. Table 2 shows results of gold coreference projection. The main observation is that with the overall F-scores around 50%, coreference projection between EN and CS seems to be a difficult problem. Moreover, let us emphasize that this experiment is supposed to set an upper bound for our projection approach since most of the annotation it exploits is manual. Comparing the two directions, the CS→EN projection appears to be a bit easier, yet still not reaching 60% F-score. Although precision rates are rather low, it is even lower recall rates that seem to have a more important effect on the weak performance of projection.
Note that our absolute projection scores are not easy to be directly compared with the numbers reported in other works performing projection of gold coreference (e.g. by Postolache et al. (2006) and Grishina and Stede (2017)). There are several factors affecting the score values, in which these experiments certainly differ: a target language, a range of expressions annotated with coreference, quality of alignment, evaluation measure, etc.
To slightly facilitate comparison, we can judge relative performance on individual mention types. In both languages, coreference information is obviously best preserved for central pronouns (ex-cept for basic reflexives). It agrees with findings by Grishina and Stede (2017), where they observed higher precision for pronouns than for nominal groups. They suggest that inferior performance for nominal groups may be a result of errors in mention matching. To find out if our results can be justified in this way, we undergo a detailed analysis of factors influencing the projection score.

Analysis of Factors
There are three main factors that contribute to the quality of coreference projection: quality of (1) alignment, (2) mention matching, and (3) antecedent selection. Every projection error can be associated with a factor that caused it. Table 3 shows the results of the analysis of factors that are elaborated in more details in the following paragraphs.
Proportion of aligned mentions. No coreference link can be projected to an unaligned mention. Missing alignment on target-language mentions thus causes errors of the first type. The left-hand side of Table 3 shows the proportion of aligned target-language mentions. Extremely low proportion of aligned mentions is observed for Czech basic reflexive pronouns. In the vast majority of cases, unaligned Czech basic reflexives are a result of not expressing the corresponding argument of the proposition in English. For instance, the Czech translation of the verb to rent in Example 1 requires explicit reflexive pronoun to signal the meaning that Exxon will pay for using the tower, not that Exxon will receive money as its owner.
(1) Exxon Exxon si [to it] pronajme will rentčá st part výškové budovy. of a tower. Do dokončení stavby si společnost Exxon pronajmě cást stávající kancelářské výškové budovy. Until the building is completed, Exxon will rent part of an existing office tower. Surprisingly, Czech personal pronouns are also less frequently aligned than the other mention types. Similarly to the previous case, the reason is often that some arguments of the English proposition are not explicitly mentioned (see Example 2). In general, missing English counterparts are a result of compact formulation of English sentences, like in Example 5. Compact language is, in our view, an inherent property of English as well as a feature of the specific journalistic style used in Wall Street Journal (WSJ). Moreover, one should not neglect the factor of the so-called Explicitation Hypothesis as formulated by Blum-Kulka (1986): the redundancy expressed by a rise of cohesive explicitness in the target-language text might be caused by the nature of the translation process itself.
[which discourage them from working]. Pro prodejce neníúplně snadné vypořádat se s pocity, které je od práce odrazují. It can be hard for a salesperson to fight off feelings of discouragement.
As for English, we can see lower scores for zeros in non-finite clauses and reflexive pronouns, again. The non-finite clauses mainly consist of past and present participles. All the missing Czech counterparts of zeros in the past participle are due to the participle being represented as an adjective in Czech, thus having no valency arguments annotated. The reasons behind a missing Czech counterpart of a zero in the present participle are more diverse. The counterpart is often missing even for the governing verb, not just for its zero argument (see Example 3). As opposed to the previous case of explicitation, this is an example of implicitation in the EN→CS translation.
Missing alignment for English reflexives stems from three prevailing reasons. In the first group, there is no counterpart at all. The second group has surface counterparts, however they are not represented in the tectogrammatical tree. This concerns Czech basic reflexive pronouns, which are often hard to distinguish whether they are tightly bound to a verb or they fill an argument of the verb. The last group are English reflexive pronoun in its emphatic use. As shown in Example 4, they are often translated as words samotný or sám (alone), for which the automatic alignment often fails.  Table 3: Results of the analysis of the three factors directly affecting projection quality. The scores are always measured from a target-language perspective.
I live in hopes that the ringers themselves will be drawn into that fuller life.
Mention matching. A coreference relation cannot be correctly projected unless both the anaphor and the antecedent match a mention in the target language. Not matching a target-language mention is an error of the second type. To check what is the impact of mention matching, we measure it solely on aligned target language mentions and show the results in the middle part of Table 3.
In agreement with findings of (Grishina and Stede, 2017), we observe that pronouns and zeros in the top part of the table clearly approach matching precision of 100% in both projection directions. At the same time, named entities, nominal and other coreferential expressions in the bottom part of the table exhibit drops in precision. We presume that the precision score grows with decreasing length of the mention span. 2 An interesting behavior is displayed by named entities. Whereas in Czech their precision is much lower than recall, these rates are very similar but swapped in English. A closer insight to the data gives us a clear explanation illustrated in Example 5. A modifier, such as společnost (company), firma (firm), trh (market) etc., is added to many named entities in Czech. It sounds more natural and is easier to comprehend, especially if you are not familiar with the WSJ domain. This modifier is in fact a head of the complete named entity and, more importantly, it is the node that may corefer with others. Since it has no counterpart in English, no coreference is transferred to English, which results in recall errors for corresponding named entities. In the opposite projection, the English coreference link that is connected directly to one of the words in a given named entity finds its Czech counterpart, which is not a head of the mention, though. Hence, the Czech counterpart is in fact not coreferential, which causes a precision error. And because the head of the mention, the true coreferential node, is a word like společnost, the recall error incurred by not covering it falls into the category of nominal groups, not named entities.
Moreover, English reflexives see a dramatic fall in recall. These errors are again incurred for instances that translate to the Czech expressions sám or samotný (alone). Even if they are correctly aligned, these Czech expressions do not carry any coreference annotation. Therefore, no links can be projected.
Antecedent selection quality. If both the anaphor and the antecedent are correctly matched to some target-language mentions but these mentions belong to distinct chains, an error of the third type is incurred. The right-hand side of Table 3 shows the anaphora scores calculated on the same data as used until now, but only on correctly matched mentions. It accounts for around 49% of all coreferential links in Czech and 53% in English.
All F-scores move around 90% and more. The only exception is a category of Czech demonstrative pronouns. The reasons behind the errors related to them are various, including annotators' errors and alignment errors. But they are often caused by relatively free nature of demonstratives, which can refer to nominal groups, predicates, larger segments as well as entities outside the text. The free nature then allows the annotators to mark different (but somehow related) mentions as antecedents, especially when a different syntax structure of the languages encourages it. For instance, in Example 7 both expressions "the exchange" and "volume" are in some sense possible as the antecedent of "it". The same holds for the Czech translation.

Conclusion
Coreference projection performs poorly in both investigated directions. And since the experiments were undertaken on gold data, it is doubtful that performance with automatic links would be better.
The analysis confirmed the conclusions drawn in the related literature. First, the bottleneck of coreference projections seems to be alignment and mention-matching, incurring mostly recall errors. Second, precision of projection on pronouns is much better than for nominal groups and named entities, which leads us to the belief that shorter mentions are easier to project. However, to confirm it we would need to define span boundaries for tectogrammatical mentions.
Our analysis revealed also some more detailed findings. Reflexive pronouns seem to be very problematic. Not only are they difficult to align, but they neither excel at mention matching. Surprisingly, a relatively high proportion of Czech personal pronouns remain unaligned. The reason for this cannot be clearly generalized from the current corpus and thus should be verified on data that consist of various domains and translation directions.