Using prosodic annotations to improve coreference resolution of spoken text

This paper is the ﬁrst to examine the effect of prosodic features on coreference resolution in spoken discourse. We test features from different prosodic levels and investigate which strategies can be applied. Our results on the basis of manual prosodic labelling show that the presence of an accent is a helpful feature in a machine-learning setting. Including prosodic boundaries and determining whether the accent is the nuclear accent further increases results.


Introduction
Noun phrase coreference resolution is the task of determining which noun phrases (NPs) in a text or dialogue refer to the same discourse entities (Ng, 2010). Coreference resolution has been extensively addressed in NLP research, e.g. in the CoNLL shared task 2012 (Pradhan et al., 2012) or in the SemEval shared task 2010 (Recasens et al., 2010). Amoia et al. (2012) have shown that there are differences between written and spoken text wrt coreference resolution and that the performance typically drops when systems that have been developed for written text are applied on spoken text. There has been considerable work on coreference resolution in written text, but comparatively little work on spoken text, with a few exceptions of systems for pronoun resolution in transcripts of spoken text e.g. Strube and Müller (2003), Tetreault and Allen (2004). However, so far, prosodic information has not been taken into account. The interaction between prosodic prominence and coreference has been investigated in several experimental and theoretical analyses (Terken and Hirschberg, 1994;Schwarzschild, 1999;Cruttenden, 2006); for German (Baumann and Riester, 2013;Baumann and Roth, 2014;Baumann et al., 2015).
There is a tendency for coreferent items, i.e. entities that have already been introduced into the discourse, to be deaccented, as the speaker assumes the entity to be salient in the listener's discourse model. We can exploit this by including prominence features in the coreference resolver.
Our prosodic features mainly aim at definite descriptions, where it is difficult for the resolver to decide whether the potential anaphor is actually anaphoric or not. In these cases, accentuation is an important means to distinguish between given entities (often deaccented) and other categories (i.e. bridging anaphors, see below) that are typically accented, particularly for entities whose heads have a different lexeme than their potential antecedent. Pronouns are not the case of interest here, as they are (almost) always anaphoric. To make the intuitions clearer, Example (1), taken from Umbach (2002), shows the difference prominence can make: (1) John has an old cottage. 1 a. Last year he reconstructed the SHED. b. Last year he reconSTRUCted the shed.
Due to the pitch accent on shed in (1a), it is quite obvious that the shed and the cottage refer to different entities; they exemplify a bridging relation, where the shed is a part of the cottage. In (1b), however, the shed is deaccented, which has the effect that the shed and the cottage corefer. We present a pilot study on German spoken text that uses manual prominence marking to show the principled usefulness of prosodic features for coreference resolution. In the long run and for application-based settings, of course, we do not want to rely on manual annotations. This work is investigating the potential of prominence information and is meant to motivate the use of automatic prosodic features. Our study deals with German data, but the prosodic properties are comparable to other West Germanic languages, like English or Dutch. To the best of our knowledge, this is the first work on coreference resolution in spoken text that tests the theoretical claims regarding the interaction between coreference and prominence in a general, state-of-the-art coreference resolver, and shows that prosodic features improve coreference resolution.

Prosodic features for coreference resolution
The prosodic information used for the purpose of our research results from manual annotations that follow the GToBI(S) guidelines by Mayer (1995), which stand in the tradition of autosegmentalmetrical phonology, cf. Pierrehumbert (1980), Gussenhoven (1984), Féry (1993), Ladd (2008), Beckman et al. (2005). We mainly make use of pitch accents and prosodic phrasing. The annotations distinguish intonation phrases, terminated by a major boundary (%), and intermediate phrases, closed by a minor boundary (-), as shown in Examples (2) and (3). The available pitch accent and boundary annotations allow us to automatically derive a secondary layer of prosodic information which represents a mapping of the pitch accents onto a prominence scale in which the nuclear (i.e. final) accents of an intonation phrase (n2) rank as the most prominent, followed by the nuclear accents of intermediate phrases (n1) and prenuclear (i.e. non-final) accents which are perceptually the least prominent. To put it simply, the nuclear accent is the most prominent accent in a prosodic phrase while prenuclear accents are less prominent.
While we expect the difference between the presence or absence of pitch accents to influence the classification of short NPs like in Example (1), we do not expect complex NPs to be fully deaccented. For complex NPs, we nevertheless hope that the prosodic structure of coreferential NPs will turn out to significantly differ from the structure of discourse-new NPs such as to yield a measurable effect. Examples (2) and (3) show the prosodic realisation of two expressions with different information status. In Example (2), the complex NP the text about the aims and future of the EU refers back to the Berlin Declaration, whereas in Example (3), the complex NP assault with lethal consequences and reckless homicide is not anaphoric. The share of prenuclear accents is higher in the anaphoric case, which indicates lower overall prominence. The features described in Section 2.1 only take into account the absence or type of the pitch accent; those in Section 2.2 additionally employ prosodic phrasing. To get a better picture of the effect of these features, we implement, for each feature, one version for all noun phrases and a second version only for short noun phrases (<=4 words).

Prosodic features ignorant of phrase boundaries
Pitch accent type corresponds to the following pitch accent types that are present in the GToBI(S) based annotations.
Fall H*L Rise L*H Downstep fall !H*L High target H* Low target L* Early peak HH*L Late peak L*HL For complex NPs, the crucial label is the last label in the mention. For short NPs, this usually matches the label on the syntactic head.
Pitch accent presence focuses on the presence of a pitch accent, disregarding its type. If one accent is present in the markable, the boolean feature gets assigned the value true, and false otherwise.

Prosodic features including phrase boundary information
The following set of features takes into account the degree of prominence of pitch accents as presented at the beginning of Section 2, which at the same time encodes information about prosodic phrasing.
Nuclear accent type looks at the different degrees of accent prominence. The markable gets assigned the type n2, n1, pn if the last accent in the phrase matches one of the types (and none if it is deaccented).
Nuclear accent presence is a Boolean feature comparable to pitch accent presence. It gets assigned the value true if there is some kind of accent present in the markable. To be able to judge the helpfulness of the distinction between the categories that are introduced above, we experiment with two different versions: (2) Anaphoric complex NP (DIRNDL sentences 9/10): The trial about the death of an asylum seeker from Sierra Leone during police custody has started. Charges include [assault with lethal consequence, and reckless homicide], . . .

1.
Only n2 accents get assigned true 2. n2 and n1 accents get assigned true Note that a version where all accents get assigned true, i.e. pn and n1 and n2, is not included as this equals the feature Pitch accent presence.
Nuclear bag of accents treats accents like a bag-of-words approach treats words: if one accent type is present once (or multiple times), the accent type is considered present. This means we get a number of different combinations (2 3 = 8 in total) of accent types that are present in the markable, e.g. pn and n1 but no n2 for Example (2), and pn, n1 and n2 for Example (3).
Nuclear: first and last includes linear information while avoiding an explosion of combinations. It only looks at the (degree of the) first pitch accent present in the markable and combines it with the last accent.

Experimental setup
We perform our experiments using the IMS Hot-Coref system (Björkelund and Kuhn, 2014), a state-of-the-art coreference resolution system for English. As German is not a language that is featured in the standard resolver, we first had to adapt it. These adaptations include gender and number agreement, lemma-based (sub)string match and a feature that addresses German compounds, to name only a few. 2 2 To download the German coreference system, visit: www.ims.uni-stuttgart.de/forschung/ ressourcen/werkzeuge/HOTCorefDe.html For our experiments on prosodic features, we use the DIRNDL corpus 3 (ca. 50.000 tokens, 3221 sentences), a radio news corpus annotated with both manual coreference and manual prosody labels (Eckart et al., 2012; 4 . We adopt the official train, test and development split. We decided to remove abstract anaphors (e.g. anaphors that refer to events or facts), which are not resolved by the system. In all experiments, we only use predicted annotations and no gold mention boundary (GB) information as we aim at real end-to-end coreference resolution. On DIRNDL, our system achieves a CoNLL score of 47.93, which will serve as a baseline in our experiments. To put the baseline in context, we also report performance on the German reference corpus TüBa-D/Z 5 (Naumann, 2006), which consists 3 http://www.ims.uni-stuttgart.de/ forschung/ressourcen/korpora/dirndl.html 4 In this work, we have focused on improvements within the clearly defined field of coreference resolution, using prosodic features. As one of the reviewers pointed out, the DIRNDL corpus additionally features manual two-level information status annotations according to the RefLex scheme (Baumann and Riester, 2012), which additionally distinguishes bridging anaphors, deictic expressions, and more. Recent work on smaller datasets of read text has shown that there is a meaningful correspondence between information status classes and degrees of prosodic prominence, with regard to both pitch accent type and position (Baumann and Riester, 2013;Baumann et al., 2015). Moreover, information status classification has been identified as a task closely related to coreference resolution (Cahill and Riester, 2012;Rahman and Ng, 2012). Integrating these approaches is a promising, though rather complex task, which we reserve for future work. It might, furthermore, require more detailed prosodic analyses than are currently available in DIRNDL. 5 http://www.sfs.uni-tuebingen.de/de/ ascl/ressourcen/corpora/tueba-dz.html   Table 2 compares the performance of our system against CorZu (Klenner and Tuggener, 2011;Tuggener and Klenner, 2014), a rule-based state-of-the-art system for German 9 (on the newest TüBa dataset). Table 3 shows the effect of the respective features which are not informed about intonation boundaries (Table 3a) and those that are (Table 3b). Features that achieved a significant improvement over the baseline are marked in boldface. 10 The best-performing feature in Table 3a is the presence of a pitch accent in short NPs. It can be seen that this feature has a negative effect when being applied on all NPs. Presumably, this is because the system is misled to classify a higher number of complex anaphoric expressions as non-anaphoric, due to the presence of pitch accents. This confirms our conjecture that long NPs will always contain some kind of accent and we cannot distinguish nu-6 http://stel.ub.edu/semeval2010-coref/ 7 Using the official CoNLL scorer v8.01, including singletons as they are part of TüBa 8 8 Using the official CoNLL scorer v8.01, not including singletons as TüBa 9 does not contain them. 9 CorZu performance: Don Tuggener, personal communication. We did not use CorZu for our experiments as the integration of prosodic information in a rulebased system is non-trivial. 10 We compute significance using the Wilcoxon signed rank test (Siegel and Castellan, 1988)   clear from prenuclear accents. Features based on GToBI(S) accent type did not result in any improvements. Table 3b presents the performance of the features that are phonologically more informed. Distinguishing between prenuclear and nuclear accents (NuclearType) is a feature that works best for short NPs where there is only one accent, while having a negative effect on all NPs. Nuclear presence, however, works well for both versions (not distinguishing between n1 or n2 works for short NPs while n2 accents only works best for all NPs). This feature achieves the overall best performance for both short NPs (48.76) and all NPs (48.88).

Experiments using prosodic features
The NuclearBagOfAccents feature works quite well, too: this is a feature designed for NPs that have more than one accent and so it works best for complex NPs. Combining the features did not lead to any improvements.
Overall, it becomes clear that one has to be very careful in terms of how the prosodic information is used. In general, the presence of an accent works better than the distinction between certain accent types, and including intonation boundary information also contributes to the system's performance. When including this information, we can observe that when we look at the presence of a pitch accent (the best-performing feature), the distinction between prenuclear and nuclear is an important one: not distinguishing between prenuclear and nuclear deteriorates results. The results also seem to sug-gest that simpler features (like the presence or absence of a certain type of pitch accent) work best for simple (i.e. short) phrases. For longer markables this effect turns into the negative. This probably means that simple features cannot do justice to the complex prosody of longer NPs, which gets blurred. The obvious solution is to define more complex features that approximate the rhythmic pattern (or even the prosodic contour) found on longer phrases, which however will require more data and, ideally, automatic prosodic annotation.

Conclusion
We have tested a set of features that include different levels of prosodic information and investigated which strategies can be successfully applied for coreference resolution. Our results on the basis of manual prosodic labelling show that including prosody improves performance. While information on pitch accent types does not seem beneficial, the presence of an accent is a helpful feature in a machine-learning setting. Including prosodic boundaries and determining whether the accent is the nuclear accent further increases results. We interpret this as a promising result, which motivates further research on the integration of coreference resolution and spoken language.