Presenting TWITTIRÒ-UD: An Italian Twitter Treebank in Universal Dependencies

In this paper we describe the early stage application of the Universal Dependencies to an Italian corpus from social media developed for shared tasks related to irony and stance detection. The development of this novel resource (TWITTIR `O-UD) serves a twofold goal: it enriches the scenario of treebanks for social media and for Italian, and it paves the way for a more reliable extraction of a larger variety of morphological and syntactic features to be used by sentiment analysis tools. On the one hand, social media texts are especially hard to parse and the limited amount of resources for training and testing NLP tools further damages the situation. On the other hand, we thought that adding the Universal Dependencies format to the ﬁne-grained annotation for irony, that was previously applied on TWITTIR `O, might meaningfully help in the investigation of possible relationships between syntax and semantics of the uses of ﬁgurative language, irony in particular.


Introduction
In the last decade, the interest towards social networking sites has grown considerably and the NLP community has been relying more and more on data extracted from social media and micro-blogs. In particular, thanks to the APIs provided by the platform, and the fact there is a variety of expressions of people's sentiments and opinions, Twitter has become one of the most exploited sources for the retrieval of data, especially in the fields of Sentiment Analysis (SA) and Opinion Mining. Nevertheless, although humans can understand each other while they exchange social media contents, which are featured by non-standard word-forms, misspelled words, dialectal word-forms, emojis and elongated words, dealing with them still proves to be a very hard challenge for automatic analyses, especially concerning syntax and morphology.
In this paper we introduce a novel Twitter treebank for Italian, i.e. TWITTIRÒ-UD. The data come from a resource originally developed for training and testing irony detection systems, also exploited as a benchmark for the Italian irony detection task held in EVALITA 2018 1 (Cignarella et al., 2018b). In order to pave the way towards collecting evidences about the relationships between syntax and semantic knowledge involved in SA tasks we are developing this project of annotation which encompasses in TWITTIRÒ-UD both the fine-grained annotation for irony applied in a multilingual setting in Karoui et al. (2017) and that morphological and syntactic provided by Universal Dependencies (UD). An alike resource will allow us to extract morphological and syntactic features to be used to improve the performance in irony and stance detection tasks (Duric and Song, 2012;Sidorov et al., 2014). The UD resources available for Italian and social media meaningfully helped us in the morphological and syntactic analysis of the dataset Sanguinetti et al., 2017;Sanguinetti et al., 2018). This paper is organized as follows. The next section briefly surveys the literature about Italian social media UD resources. Section 3 introduces the dataset used for our project and describes the various annotation steps. In Sections 4 we discuss the creation of the gold standard set, and we highlight the findings of a quantitative analysis. Finally, in Section 5 we draw some considerations on the current state of the project and give some insights on future work.

Related Work
In recent years UD have become the standard for syntactic annotation (De Marneffe et al., 2014;Nivre et al., 2016) and the repository of UD projects enlarges by the day, also including data for under-resourced languages and less studied varieties, see e.g. Wang et al. (2017). As far as Italian is concerned, the main UD resources, that we exploited as reference, are two: namely, the UD-Italian treebank  and PoSTWITA-UD (Sanguinetti et al., 2017;Sanguinetti et al., 2018). The former entails standard texts drawn from newspapers, legal codes and Wikipedia, the latter texts from social media.
The genre of social media texts can be a bottleneck for morphological and syntactic analysis, but some experiments are reported in literature about parsing this type of data, see e.g. (Foster et al., 2011) and (Kong et al., 2014), who introduce the dependency parser TWEEBOPARSER and TWEEBANK, a Twitter treebank later extended in TWEEBANK V2 (Liu et al., 2018). In Albogamy and Ramsay (2017) an Arabic dependency treebank of tweets is converted in the UD format, while in (Blodgett et al., 2018) a treebank of tweets in African-American English is created, and in Bhat et al. (2018) a UD treebank of Hindi-English is created focusing on syntactic aspects of code-switching.
Finally, addressing the morphological analysis of social media, the task organized in the 2016's edition of EVALITA 2 can be cited. In this edition of the evaluation campaign for NLP and speech tools for Italian, a task about PoS-tagging of social media texts has been organized (Bosco et al., 2016) which was centered on the POSTWITA corpus, i.e. that later enriched with UD annotation for creating PoSTWITA-UD. This kind of experience encourages the community to adapt NLP tools to this different type of text domain, which is noisy and difficult to deal with automatically.

Data and Annotation
The data of TWITTIRÒ-UD are drawn from TWITTIRÒ (Cignarella et al., 2018a;Cignarella et al., 2019), a gold standard Italian corpus for irony detection. It has been firstly annotated according to the fine-grained schema for irony proposed in Karoui et al. (2017). Later it has been extended with the annotation for sarcasm exploited in the EVALITA 2018 task on irony detection in Italian tweets (IronITA 3 ) (Cignarella et al., 2018b). The corpus includes 1,424 tweets annotated as follows.
# sent id = 507111702744162304 # twittiro = EXPLICIT HYPERBOLE # sarcasm = 0 # text = se sento ancora la parola merito vomito #labuonascuola #chenonèquelladirenzi In the tweet 4 two features are marked for irony, i.e. the fact that all the elements necessary for interpreting the irony are lexically represented in the post (EXPLICIT), and that a particular device (HYPERBOLE) triggers irony, while a binary annotation has been applied for marking the (absence of) sarcasm. In TWITTIRÒ-UD, this annotation manually provided and revised in the original resource is enhanced by that for morphology and syntax according to UD (see examples in Sec. 3.1).
In order to create TWITTIRÒ-UD, we applied the full pipeline of tokenization, lemmatization, PoStagging and dependency parsing provided by UDPipe 5 (Straka and Straková, 2017). For this purpose, we trained UDPipe on two different gold benchmarks, namely PoSTWITA-UD (Sanguinetti et al., 2018) (6,712 tokens) and UD Italian (Simi et al., 2014) (14,167 tokens). Considering the typology of text and the features of ironic messages, we followed the PoSTWITA-UD tenets, in particular for what concerns segmentation, which is at tweet level rather than at sentence level.

Issues in Manual Correction
In this paper we focus on a subset of the original corpus, which includes 897 tweets only, while we plan a second release in the UD repository including the full corpus for November 2019. From the manual correction of this dataset 6 we have already learned some interesting lessons.

Tokenization
Several tokenization errors depend on misspelled words (i.e. not correctly separated by spaces) or punctuation irregularly used, like in the following example. In line 12 and line 17 of the tweet 7 we find "concorso...solo" and "perde?#dalleparoleaifatti", which should be split in three different tokens each. In order to avoid that the failures in tokenization propagate in the other annotation levels, before tokenization we applied an automatic data cleaning which consists in always adding a white space between words and punctuation signs (with the exception of the apostrophe which left attached to the preceding token). We only manually corrected the remaining cases of misspelled tokens, that is not separated by the necessary white space. The result of the correction of the example above can be seen below (where we also corrected the PoS tags). Lemmatization and PoS-tagging Misspelled forms often occurring in social media contents cannot be recognized by lemmatizers and their analysis may result in a failure. Here, as it was done in the annotation of PoSTWITA-UD, we associated the non-standard forms with the lemmas of their normalized versions, thus allowing a correct PoS-tagging. For instance, the typo anema is paired with the lemma anima (soul), the abbreviation ke with che (that), the elongated nooo with no (no), and the abbreviations X and h respectively with per (for) and ora (hour). Emoticons, emojis, URLs, email addresses, and Twitter marks (hashtags and mentions) have been instead labelled with the tag SYM.

Dependency Relations Attachment
As said above, following the strategy applied in POSTWITA-UD, we did not perform any sentence splitting in the novel dataset. Each syntax tree of TWITTIRÒ-UD corresponds to a tweet in its entirety, and may consist of multiple sentences too. At the same time, provided that the UD scheme poses a single-root constraint, the internal connections between different sentences occurring in a tweet have to be annotated and labeled by the dependency relation parataxis. This relation is quite hard to be provided by the parser, which often fails in recognizing this kind of structure. See for instance, Figure 1 where we display a tweet 8 containing more paratactic structures. Another issue is related to the wide presence of Twitter marks. The current limited amount of adequate training data prevents the parser from dealing with them successfully. Within the manual correction phase, we resort to the label vocative:mention for Twitter mentions, the label discourse:emo for emojis, and dep for URLs. Moreover, hashtags and mentions could be either used at the end of the tweet, to create more emphasis, or with a full syntactic function. In the first case, we resort to the relation (parataxis:hashtags and vocative:mention), while in the second we annotate accordingly to the syntactic role, see for example in Fig. 2 the hashtag and the mention 9 labelled as nmod.
PoSTWITA-UD TWITTIRÒ-UD  Table 1: Distribution of deprel labels for hashtags and mentions. Table 1 shows the distribution of the dependency relations (deprels), and confirms that there is a syntactic correlate of the peculiar semantic role that hashtags and mentions play in tweets. The labels that are mostly exploited for linking the hashtags to the sentence structure PoSTWITA-UD and in TWITTIRÒ-UD are mostly two: nmod and nsubj.  We can observe, despite the sparseness of relations, how their frequency and distribution characterizes the language exploited in the social media data collected in TWITTIRÒ-UD and PoSTWITA-UD with respect to the standard language collected in UD Italian. As expected, meaningful differences emerge for parataxis and punctuation. Punctuation is indeed exploited more extensively in the two social media datasets (12.08% and 17.24%) than in UD Italian (11.36%), and the frequency of the parataxis deprel is 4.02% and 4.62% in PoSTWITA and TWITTIRÒ-UD, while it is only 0.14% in UD Italian, marking a significant difference. The distributions of the relations vocative:mention and parataxis:hashtag especially features the two social media treebanks. The mentions' deprel is 2.06% in PoSTWITA-UD and 2.89% in TWITTIRÒ-UD, while the hashtags are respectively 1.81% and 2.15%.Furthermore, it is interesting to notice how the use of passive voices (aux:pass) is 0.75% in the UD Italian treebank while only 0.12% in PoSTWITA-UD and only 0.18% in TWITTIRÒ-UD, indicating a preference for the exploitation of active voices in the language used in social media, as it happens in spoken language.

A Parsing Experiment
In order to preliminary evaluate the similarities between the three datasets, we performed an evaluation of UDPipe using the TWITTIRÒ-UD gold corpus as a test set. The following three settings were exploited.
For evaluation we used the script made available for the CoNLL 2018 Shared Task 5 10 with the default setting parameters. Table 3 surveys the resulting scores for precision (P), recall (R) and averaged F1score (F1).  First of all, it is interesting to notice the variation of the Unlabelled Attachment Score (UAS) and Labelled Attachment Score (LAS). For what concerns UAS, the first setup, where only the data from UD Italian have been used for training, allowed a better result than the second one, where PoSTWITA-UD is the training dataset. But the opposite can be seen for LAS. We can hypothesize that the larger amount of data in UD Italian allowed to build a more representative statistical model. Nevertheless, training on a resource which includes the same typology of data may be crucial for collecting an adequate knowledge about the specific relations exploited. This motivates the best scores for LAS an UAS, which were obtained in the third setup benefiting of both the resources for training. This encourages us to develop more and better gold standard treebanks also for social media to be used for training.

Conclusion and Future Work
In this paper we presented an ongoing project for the development of a novel Italian treebank from Twitter in the UD format: TWITTIRÒ-UD. Focusing on the 897 tweets currently annotated for the first release, we discuss the annotation of this resource which encompasses a fine-grained representation of irony and the UD morpho-syntactic analysis.
The preliminary analysis we applied shows some difference in the distribution of dependency relations in standard Italian and social media language, e.g. in the use of verbal active/passive voices, confirming that the language used in social media presents a strong preference for the exploitation of active voices. Furthermore, a simple parsing experiment and a comparison among the novel resource, UD Italian  and PoSTWITA-UD (Sanguinetti et al., 2018) are provided, in order to shed light on the syntactic features of social media texts. Also considering the perspective of the future release of the complete resource (1,424 tweets) to be accomplished before the next UD release in November 2019, the work serves a twofold goal: it enriches the scenario of available resources for a text genre which is especially hard to parse (social media text), and helps in the investigation of possible relationships between syntax and semantics of the uses of figurative language (irony in particular). The availability of a resource whose annotation encompasses both UD relations and a fine-grained description of irony may indeed pave the way for the investigation of whether syntactic knowledge might help in SA and other related tasks.