Annotating Information Structure in Italian: Characteristics and Cross-Linguistic Applicability of a QUD-Based Approach

We present a discourse annotation study, in which an annotation method based on Questions under Discussion (QuD) is applied to Italian data. The results of our inter-annotator agreement analysis show that the QUD-based approach, originally spelled out for English and German, can successfully be transferred cross-linguistically, supporting good agreement for the annotation of central information structure notions such as focus and non-at-issueness. Our annotation and interannotator agreement study on Italian authentic data confirms the cross-linguistic applicability of the QuD-based approach.


Introduction
In this paper, we present a discourse annotation study of Italian data, which uses the annotation scheme and discourse-analytic method, the QUDtree framework, developed in ?, ? and ?. Its purpose is the cross-linguistic analysis of information structure and discourse structure of textual data. On the theoretical side, the QUD framework has been applied to a number of different languages, such as German, English and French in (?), and various Austronesian languages as discussed in ? and ?. On the applied side, ? showed that the QUD based method supports the successful annotation of discourse structure and information structure in German and English spoken language data. Here we want to broaden the crosslinguistic scope of the QUD framework and apply it to another Romance language, Italian. We will explore both the QUD annotation and the information structure annotation including all information structure labels that are part of the annotation scheme proposed in ?, such as focus, background, contrastive topic, nai and topic. Topic is regarded as a notoriously difficult label in agreement studies (cf. ??). While the results of our study show that the question-based annotation method supports the successful annotation of discourse structure and of information structure, in particular focus, we will also discuss, using the example of topic, some shortcomings of the QUD based annotation method.

The QUD framework
The QUD framwork introduced in ? presents an explicit method for the reconstruction of QUDs which are usually only discussed as an abstract theoretical term. The center of the QUD framework is a compact representation format for QUD trees, in which the textual assertions (A) represent the terminal nodes of a discourse tree (preserving the linear order of the text from left to right) while (implicit or explicit) QUDs (Q) form the non-terminal nodes. An abstract QUD tree is shown in Figure 1. The QUD-tree framework as spelled out in ? can be applied to any kind of written or spoken discourse or conversation. It is not language-specific and can, in principle, be used in order to investigate data from any language. While the exact analysis procedure is described at great length within the guidelines document (?), we just briefly introduce some basic principles here.

QUD principles
The actual identification of a QUD for each assertion is guided by a number of explicit principles adapted from the formal literature on information structure (????), cf. ?: Q-A-CONGRUENCE: QUDs must be answerable by the assertion(s) that they immediately dominate.
Q-GIVENNESS: Implicit QUDs can only consist of given (or, at least, highly salient) material.

MAXIMIZE-Q-ANAPHORICITY: Implicit
QUDs should contain as much given (or salient) material as possible.
Example (4) shows that from these principles we can derive QUD Q 32 for assertion A 32 in the context of A 31 , whereas any of the questions in (5), used in place of Q 32 , would violate at least one of the QUD constraints in the same context.  Two or more assertions are defined as parallel if and only if they share some semantically identical content and represent partial answers to the same QUD, see Example (6), where the semantically shared content is Alek (omitted in the second assertion).
PARALLELISM: The background of a QUD with two or more parallel answers consists of the (semantically) common material of the answers.

QUDs and information structure
The basis of our annotation approach is an alternative-based definition of information structural categories, in line with e.g. ?, ?, ?, ? or ?. The Table in 1 shows the definitions for the information structure categories as introduced in ?. These are the basis for the labels used in our annotation study.  (7) is an example demonstrating the assignment of information-structure labels in the context of a QUD (in curly brackets). Note that the indentation (>) of A 7 in the textual representation marks subordination in the discourse tree, as shown in Figure 2. The focus is i romanzi di formazione 'coming-of-age novels', which is labelled [ ] F and constitutes the answer to the QUD Q 7 . The background is linguistic content that is mentioned in this QUD. The question is about what books the interviewee reads or likes to read, so ho ripreso a leggere 'I've restarted to read' (labelled [ ] BG ) is clearly recoverable from the QUD. Focus and background together form the focus domain, labelled [ ]∼. The sentence initial phrase Di recente 'recently' is not relevant to answer the QUD Q 7 , which would still receive an answer without it, therefore it is labelled [ ] NAI .
An example of the label Topic T is given in (8). In A 32 , the clause initial phrase Senza etichette, the novel's title, is part of the background (in fact, it is the only background in that utterance) because it is mentioned in Q 32 . Since it is a referential expression, it is marked [ ] T . In (9), an example of a contrastive topic (CT) is given. The (explicit) question Q 10 asks the interviewee to tell three wishes. The speaker answers by uttering three different assertions each about one wish. Clearly, il primo 'the first (wish)' in A 10.1 and il secondo 'the second (wish)' in A 10.2 are members of the alternative set mentioned in Q 10 (tre desideri 'three wishes').

Evaluation: Discourse structure
In the present annotation study based on the above described QUD framework, our goal is to show that the discourse annotation in terms of QUDs can be applied reliably to naturally occurring data -in this particular case, Italian data. We conducted an empirical study, in which annotators followed the QUD guidelines described in ? to annotate two Italian blog interviews.
For the QUD-based annotation we use the tool TreeAnno introduced by ?, which enables the analyst to semi-automatically segment texts, system-atically enhance them with implicit Questions under Discussion (QUDs), and transform the data into a discourse tree called QUD tree, as described in ?.

Evaluation setup
Two trained annotators (and also native speakers of Italian) analyzed and annotated two short Italian blog interviews downloaded from the internet 1 . The first blog interview consists of 95 text segments, the second one of 113 segments. The QUD discourse tree for Blog 1 resulting from the first annotator is shown in Figure 3, the other three discourse trees are included in the Appendix.

Method and results
For the comparison of the two annotated documents, we follow the method described in ?. The basic idea is that for the comparison of two QUD annotations one needs to calculate an interannotator agreement score that takes into account, for every segment and every possible span of segments, whether a QUD is present or not. In order to compute a κ statistics (?) based on our QUD annotations, ? propose to follow the method described in ?, which was developed for measuring agreement in the labelling of rhetorical structure categories in texts. The method is based on the idea of mapping the hierarchical structure of a discourse tree onto sets of units (i.e. text segments) that are a matrix or chart filled with categorical values. In our case, the values are whether there exists a (Q)uestion spanning the respective segments -start to end -or (n)ot).
A κ statistics can then be computed between two charts that represent two different QUD annotations for the same text, more precisely between the two resulting sets of possible spans of segments. 2 For our two annotated documents we calculated κ values for the annotation charts derived from our QUD annotations, based on the above described method. For the text Italian Blog 1, consisting of 95 segments, we calculated the κ statistics based on 4,256 items (i.e. possible spans of segments), for Italian Blog 2 with 113 segments based on 6,187 items. The results are shown in  The values show moderate agreement between the annotator pairs. For Blog 1, the κ value is at .61, which is substantially higher than what (?) report for the QUD annotations of their German and English texts: their κ values are around .5. For our Blog 2, the κ value is at .51, which is thus very similar to the scores reported in (?) for texts of similar length. Our two annotated Italian texts are relatively short, only around 100 sentences each, so it is perhaps too early to interpret the results, in particular since this is a rather complex task. However, since the results are comparable to those reported in (?), we take this as a further proof that the QUD-based annotation of discourse can successfully be applied cross-linguistically.

Evaluation: Information structure
The second major issue we are interested in is to evaluate the reliability of information-structure annotation based on the previous identification of QUDs.

Evaluation setup
For the evaluation of the information structural annotation, the same two Italian blog texts were annotated by the same two trained annotators, who still followed the guidelines of Riester et al. 2018). We aimed at annotating all five categories that are mentioned in ?: focus (F), background (BG), nonat-issue material (NAI), contrastive topic (CT) and topic (T). Focus domain labels (∼) were not annotated, since each text segment (assertion) already corresponds to one focus domain. The annotators based their annotations on the previously performed QUD analysis in the TreeAnno tool. As an annotation tool for the token-based informationstructure annotation, WebAnno (?) was chosen. Figure 4 shows a screenshot of the informationstructure annotation of the beginning of Blog 1.

Method and results
As agreement measure for the evaluation of the information structure annotation, we calculated κ values on the annotated texts based on tokens,  Table 1, we defined a number of heuristic (but potentially debatable) rules in order to prevent disagreement due to theoretically unclear issues, such as: • Discourse connectors (but, and, although, because, therefore etc.) at the beginning of discourse segments are not annotated.
• Punctuation: Quotation marks around an expression, commas within and at the right edge of an expression are part of the markable. Periods, colons, semicolons, exclamation marks are not.
Results are shown in Table 3, divided into scores for all labels taken together, and individual scores for each of the four labels.
The results are rather heterogeneous in both texts but overall they show that the QUD-based method does contribute to a successful annotation of information structure in Italian for a range of labels. For the first text Blog 1, the overall agreement score for all annotated categories taken together is at .7, which shows substantial agreement,  Table 3: Kappa for information structure annotation the score for focus annotation alone being at .72. The agreements scores for the second blog are overall lower, but with .58 for the overall agreement and .51 with agreement for focus they are still at a relatively high level and still comparable to the scores that (?) report for the annotation of German and English data (which are at around .65). The category NAI, the classification of non-at-issue material, also received reasonable agreement scores at .53 in Blog 1 and .62 for Blog 2. The agreement scores for the other three categories, BG and CT, differ a lot between the two texts. In Blog 1, the score for contrastive topic is very high with .85, in Blog 2 the score .1 shows that there was hardly any agreement between the two annotators. This might be due to the fact that there were only very few cases for which the label CT was used. In Blog 1, the label CT was used for 9 and 12 tokens in the two annotations, in Blog 2 it was assigned to 13 and 14 tokens (out of 1243 tokens). The case is similar with respect to background: in the two annotated documents, the label BG was only assigned for around 40 tokens in Blog 1 and 30 tokens in Blog 2. This means that, if the annotators disagreed in only one token when assigning the label CT or BG, this had a much greater impact on the agreement scores for these labels than in the cases of disagreement for assigning focus labels. The category topic (T) received relatively low agreement scores at .45 and .35, but still at a level which other studies report for categories like focus (cf. ? report a κ of .44 for focus). In the following section we will qualitatively evaluate why the annotation scheme seems to better support the successful annotation of a category like focus, whereas there seems to be much more disagreement when annotating topic.

Qualitative Evaluation: The Case of Topics
In the question-based definitions of our information structure labels, the focus corresponds to those parts of an assertion that answer the current QUD. Especially in case of overt questions, but with implicit QUDs, the annotators agree on focus.
The definition of topic in the QUD framework, however, is the only one that does not take the current QUD into account. As remarked by ?, while potentially all referential expressions inside the background could be labelled as topic, one might argue that not all referential expressions inside the background are actually aboutness topics. But unfortunately, the QUD method is not meant to single out the best topic candidate. And ? do not provide any rules that help to distinguish between better and worse topic candidates. The only cue that is given through the current QUD is that all focal expressions are excluded as topic candidates.
A typical topic expression in Italian would be a clitic left or right dislocated phrase (see quel libro below), but no dislocation was present in our data, probably due to the fact that a blog interview is less interactive than an spoken conversation, and these construction are typically used in interaction.
(10)a. Quel book Clitic personal pronouns, such as le in A 2 in (11), are also typical candidates for (continuing) topics.
(11) A 1 : Abbiamo fatto quattro chiacchiere con Maria Verdiana Rigoglioso per parlare di Senza Etichette, il romanzo che ha pubblicato con Libromania. 'We had a chat with Maria Verdiana Rigoglioso to talk about Senza Etichette, the novel she published with Libromania.' Q2: {What did you do with her exactly?} What about cases where the topic is neither a dislocated expression, nor a clitic? Our annotation method should be able to single out such cases, but this is not always true. The example above nicely illustrates a case where our annotators disagreed about labelling a given referential expression as topic: the PP del romanzo in A 3 , which is already introduced in the previous sentence, A 1 . One annotator chose to nevertheless include it in the focus and label A 3 as an all-focus assertion. The other annotator, while annotating a similar QUD, chose to label the PP as a topic. Indeed, strictly speaking, this given PP should then also be part of the QUD ("What for, with respect to the novel?"). novel It may be observed that the PP del romanzo is embedded inside the verb's direct object NP. Our assumption is that informational categories are defined and identified solely by pragmatic means, in particular by the QUD-related properties given in Table 1. Despite such an assumption, we may suppose that it was the syntactically embedded position of del romanzo that led one annotator to consider it as part of the focus, or more precisely, the fact that the focus (retroscena e curiosità) did not form a constituent on its own without the PP del romanzo. The relationship between the given-new structure and the syntactic structure has not been discussed by ?, but it is something that might be worth addressing in the future. Of course, if the syntactic position of the topic must be invoked to complete the picture and arrive at its identification, then we expect different levels of complexity in the task of annotating aboutness topics depending on the language.
In other cases, the topic was well identified by both annotators, such as le due lingue in (13). In this example syntax does not help to identify the topic status of the direct object le due lingue. Such expression is mentioned in A 1 as part of the focus, but instead of being promoted to topic in the subsequent utterance by some syntactic device for topic shift (such as left dislocation, cf. ?), it is left in situ. One reason for the speaker's choice may be the fact that the topic expression is inside a free relative, a construction that seems to be incompatible with dislocations, as the unacceptability of examples below shows: well Since due lingue is mentioned in the previous sentence, the context tells us that this expression is clearly background. Since it's a referential expression, it has all that is required to be identified as topic. Note that a clitic pronoun might have been acceptable here (see example (15)), but this option is not chosen by the speaker/writer.  The mechanism of identifying parallel structures (multiple answers to the same question) is a strategy that our annotation tool provides to help recognizing 'hidden' topics.  Clearly, the fact that le lingue (which again occupies a canonical post-verbal position in A 55 ) is elided in A 55 , shows that it represents shared material between A 55 and A 55 , and therefore is part of the background.
Cases of topic shift were easily recognized by the two annotators. One example is given below in (17). The referent la mamma che parla la lingua minoritaria per crescere i suoi bambini bilingui is introduced in the overt question Q 24.1 and then it continues as topic in the answer A 24.1 . Then the topic changes and becomes i bambini in A 25 . In A 26 , the topic changes back to la mamma madrelingua.  The fact that the topic is a preverbal subject also helped the annotators to recognize it. As discussed in (?), preverbal subjects are typical sentence topics, and our two annotators agreed more often when the topic was in that position. The socalled hidden topics were more challenging.
And even if an expression was correctly included within the background, the two annotators still had to decide for every referential item that was part of the background whether to label it as a topic or not. Not surprisingly, they sometimes agreed, as in (13), and they sometimes picked different elements. Since there are several characteristics of the text and the preceding discourse that have to be taken into account for the identification of possible topics, we hypothesise that this category will probably always be annotated with less accuracy than the other information structure categories such as focus or non-at-issue material.

Conclusion
We have presented a novel method for the annotation of information structure which achieves good inter-annotator scores. In particular the agreement scores for focus are much higher than the results reported in other similar annotation studies on naturally occurring data (cf. ?). The method is based on the reconstruction of QUDs, from which the annotation of IS categories is then derived. The results of our inter-annotator agreement analysis show that the QUD-based approach, originally spelled out for English and German, can successfully be transferred cross-linguistically, supporting good agreement for the annotation of central information structure notions such as focus and non-at-issueness, with (contrastive) topic and background showing lower levels of agreement for some texts due to underrepresentation of those information structural categories in some of the data analysed. Thanks to the QUD-based method, attention was drawn to some interesting aspects of Italian information structure, and in particular of Italian topics. Some difficulties of topic identification were shown to be reduced by the adopted annotation procedure. We believe that the discussion of the problems occurring with the labelling of topics in Italian not only contributes to the analysis of topics in Romance languages, but also helps to refine the QUD annotation procedure in general, so that future annotators are more aware of problematic cases which will hopefully lead to even more reliable annotations.