tweeDe – A Universal Dependencies treebank for German tweets

We introduce the ﬁrst German treebank for Twitter microtext, annotated within the framework of Universal Dependencies. The new treebank includes over 12,000 tokens from over 500 tweets, independently annotated by two human coders. In the paper, we describe the data selection and annotation process and present baseline parsing results for the new testsuite


Introduction
Recent years have seen an increasing interest in developing robust NLP applications for data from different language varieties and domains. The Universal Dependencies (UD) project (Nivre et al., 2016) has inspired the creation of many new datasets for dependency parsing in a multilingual setting. Treebanks have been created for low-resourced languages such as Bambara, Erzya, or Kurmanji as well as for many new domains, genres and language varieties for which no annotated data was yet available. A case in point are web genres, spoken discourse, literary prose, historical data or data from social media. 1 We contribute to the creation of new resources for different language varieties and introduce tweeDe, a new German UD Twitter treebank. TweeDe has a size of over 12,000 tokens, annotated with PoS, morphological features and syntactic dependencies. TweeDe is different from existing German UD treebanks as its content focusses on private communication. Private tweets share many properties of spoken language. They are often highly informal and not carefully edited, often lack punctuation and can include ungrammatical structures. In addition, the data often includes spelling errors and a creative use of language that results in a high number of unknown words. These properties make user-generated microtext a challenging test case for parser evaluation.
In the paper, we describe the creation of tweeDe, including data selection, preprocessing and the annotation process. We report inter-annotator agreement for the syntactic annotations ( §2) and discuss some of the decisions that we have made during annotation ( §3). We compare tweeDe to other treebanks in §4. In §5 we present baseline parsing results for the new treebank. Finally, we put our work into context ( §6) and outline avenues for future work ( §7).

tweeDe -A German Twitter treebank
This section describes the creation of the first German Twitter treebank, annotated with Universal Dependencies. The treebank includes 519 tweets with over 12,000 tokens of microtext.

Data extraction
The annotation of user-generated microtext is a challenging task, due to the brevity of the messages and the missing context information, which often results in highly ambiguous texts. As a result, interannotator agreement (IAA) is often below the one obtained on standard newspaper text. To avoid such problems, we opted to extract short communication threads, which range in length from 2 up to 34 tweets. This approach allowed the annotators to see the context of each tweet and was thus crucial for resolving ambiguities in the data.
The conversations were collected in two steps. We first used an existing python tool 2 that supports the downloading of conversations by querying the Twitter API for a set of query terms and then scraping the html page on twitter.com that represents each matching conversation. However, Twitter does not embed complete json files into the html-pages and the existing crawler had some problems in fully retrieving tweet text containing certain special characters. We therefore used the output of the initial crawler only to establish the ids and the sequencing of the tweets in a conversation and then re-downloaded the full json files to be sure we had complete tweets.
The query terms we used were all German stop words, i.e. highly-frequent closed-class function words such as prepositions, articles, modal verbs, and adverbs such as auch 'too' or dann 'then'. The idea behind this was to avoid any kind of topic bias. Of the threads retrieved, we only retained those representing private communication between two or more participants. Threads consisting mainly of automatically generated tweets, advertisements, and so on were discarded after manual inspection. The treebank preserves the temporal order of the tweets in the same thread. For meta-information, we keep the tweet id, date and time as well as the author's user name. As is common practise for UD treebanks, we also store the raw, untokenised text for each tweet.
Besides issues arising from brevity, further problems for annotating user-generated social media content are the creative use of language, including acronyms (example 1) and emoticons (example 2), noncanonical spellings (example 3), missing arguments (example 2) and the often missing or inconsistent use of punctuaction (examples 1-4). The latter causes segmentation problems like those faced in annotating spoken language where, since no punctuation is given, the annotator has to decide on where to insert sentence boundaries.

Segmentation
For spoken German, several proposals have been made how to segment transcribed utterances, based on syntax, intonation and prosodic cues, pausing and hesitation markers (Rehbein et al., 2004;Selting et al., 2009). However, when the different levels of analysis provide contradicting evidence, it is not clear how to proceed. For tweets, we have to deal with similar issues. When no (or only inconsistent use of) punctuation is present, we have to decide how to segment the tweet into units for syntactic analysis. Earlier work has chosen to consider the whole tweet as one unit, i.e. as one syntax tree. Since Twitter has changed their policy and doubled the length limit from 140 to 280 characters, this is no longer feasible (see example 5 below). We thus decided to split up the messages into sentences, based on the following rules.
"@surfguard @Mathias59351078 @ArioMirzaie Some make me laugh, some make me think "hm" and still others make me feel appalled. I don't have anything to do with any of them. If you blame me for the color of my skin, you're a racist." • Hashtags and URLs at the beginning or end of the tweet that are not syntactically integrated in the sentence are separated and form their own unit (tree).
• Emoticons are treated as non-verbal comments to the text and are integrated in the tree (figure 1).

Tokenisation
User-generated text often reflects (or mimics) morpho-phonological processes from spoken language that are in conflict with the rules of Standard German orthography . One example are words merged into one token that, according to German grammar, should be separated but in spoken varieties of German are contracted into one token. We split merged tokens to avoid having tokens with more than one PoS tag and grammatical function. To mark that the word has been written as one atomic token, we use the UD feature SpaceAfter=No in combination with CorrectSpaceAfter=Yes in the last column of the CoNLL-UD file. Figure 2 (left) shows an example where the canonical token sequence "Kennst Du ?" is instead fused into the single token "Kennste ?".
We also observe the opposite case where tokens that should have been written as one word are split into two or more separate tokens in the tweet. Most of these are German noun compounds. We chose to annotate split compounds using the UD relation goeswith. We follow UD conventions to always annotate the first component as the head and attach all remaining components to the first component. One problem with this approach is that in some cases the head of the compound will end up with the wrong PoS tag. Figure 2 (right) gives an example where the whole compound should have been annotated as a noun (Japanurlaub, Japan vacation) but instead now obtains a proper noun PoS tag. A possible solution to this problem is to deviate from UD practise and annotate the second component (i.e. the real head) as the head. As those cases were rare in our data, we refrained from doing so, for the sake of consistency with other UD treebanks.

Annotation
We annotated two types of PoS tags, based on the UD (Petrov et al., 2012) and Stuttgart-Tübingen (STTS) (Schiller et al., 1995) tag sets. The PoS tags and morphological features represent the annotations of one annotator, correcting the output of the UD processing pipeline for German (UDPipe) (Straka and Straková, 2017). For all dependency annotations, two annotators provided syntactic attachments and dependency labels, which were subsequently adjudicated. The adjudicated syntactic dependency relations were used for consistency checks between the dependency labels and the PoS and morphological tags. Additional consistency checks based on DECCA (Dickinson and Meurers, 2003) verified the compatibility of the different annotation layers. All incompatibilities were manually inspected and resolved. The final testsuite includes 12,073 tokens from 519 tweets, split up into train, development and test data (table 1). Around 10% of the tweets include a non-projective tree structure.  Inter-Annotator Agreement We computed IAA on a subset of the data with 1,630 tokens. For labelled attachments, the agreement between the two annotators was 0.83 κ, for unlabelled attachments the score increased to 0.89 κ.

Annotation decisions
Below we discuss decisions we made during the annotation process that deviate from other existing German UD treebanks, i.e. the UD-GSD and the UD-TüBa-D/Z. UD-GSD has been converted from an earlier version of Stanford-style dependencies (McDonald et al., 2013) and contains mostly web reviews while the UD-TüBa-D/Z (Çöltekin et al., 2017) is a conversion of the TüBa-D/Z (Telljohann et al., 2004) and includes articles from a German daily newspaper.
Placeholder sentences In the UD-GSD treebank, finite subordinate placeholder sentences with dass or ob (that, whether) are mostly analysed as ccomp while infinite correlates are annotated as acl and attached to the placeholder, usually a pronominal adverb. In contrast, the TüBa-D/Z attaches both finite and infinite placeholder clauses as adverbial clause to the verb of the matrix clause. We decided to annotate finite and infinite placeholder sentences as acl and attach both to their respective placeholder (figure 3). there belongs also a regular portion creativity to_it , so much shit to build "It takes a good deal of creativity to screw up so bad." Fixed multi-word constructions German has a rich system of adverbs and particles that can form multi-word constructions and so obtain a meaning that is different from the one of their individual components. We annotate those using the dependency label fixed (figure 4 left). Adpositions also frequently form multiword units and have been treated the same (figure 4 right), as have specific combinations of pronouns and prepositions (e.g. Was für ein Unsinn! (What for a nonsense), English translation: "What utter nonsense!"). Correlative construction with two clauses The correlative construction je X, desto/umso Y (the X, the Y) (figure 5) consists of a subordinate clause marked by je, followed by a matrix clause that is introduced by desto/umso. 3 Each clause needs to contain a comparative form, either of an adjective or of an adverb. Semantically, the construction describes a relationship between an independent and a dependent variable (example 8).
As indicated by word order, the clause expressing the causal variable is the subordinate clause (the finite verb comes last) while the clause describing the dependent variable is syntactically encoded as the matrix clause (the finite verb comes in second position). While je typically only marks the subordinate clause, there also exist variants of the construction where the desto/umso is omitted and a second je is used instead to mark the comparative that describes the dependent variable (example 9 Based on these observations, we decided to attach the subordinate clause as an adverbial clause to the matrix clause and analyse both particles as adverbial modifiers. We do not assign the mark relation as the particles are not modifiers of the head of the subordinate clause but are modifiers of the comparative forms in the subordinate and in the matrix clause. This analyis is different from the one in the German UD-GSD and TüBa-D/Z UD treebanks (figure 6) where the head of the subordinate clause is analysed as the root of the sentence and the matrix clause is attached as a conjunct of the subordinate clause. Our analysis is consistent with the one for conditional clauses that are similar in meaning (e.g.: If I scroll down further, I can see more), where the subordinate if-clause is also an adverbial clausal modifier of the matrix clause. the more_constant the market_shares declined , the more_regular became reformed "The more consistently market shares declined, the more regularly reforms were carried out." al., 2014) which includes mostly news articles and is also the largest existing German treebank. Figure 7 shows the distribution of PoS tags in the four treebanks. While the other three treebanks are quite homogeneous (except UD-GSD including more proper names), the most striking difference between tweeDe and the other treebanks is the higher number of adverbs and pronouns. This is typical for informal multiparty communication and is accompanied by a lower percentage of nouns, determiners, adjectives and adpositions as well as a slightly higher amount of verbs. This shows that tweeDe has a more verbal style, as opposed to the nominal style of the other treebanks.

Parsing experiments
We present parsing baselines for the new German UD treebank, using the state-of-the-art parser of Dozat et al. (2017). The parser is a neural dependency parser that learns complex, non-linear representations directly from the input text, based on bidirectional LSTMs (Hochreiter and Schmidhuber, 1997). It only considers local context and predicts attachments and labels in a greedy fashion. The huge success of the parser is based on its use of biaffine attention.
In our first experiment, we train the parser on the 250 tweets in the tweeDe training set. We use pretrained skipgram embeddings with 100 dimensions (window size: 5, min word count: 10), trained on a large collection of German tweets, collected in a time period from 2013 to 2017. The embeddings are publically available from https://www.cl.uni-heidelberg.de/research/downloads. All models have been trained with default parameters.
Table 2 (left) shows results for gold PoS and for automatically predicted PoS tags. Using UD PoS tags for parsing outperforms the STTS tags by a large margin, probably due to sparsity caused by the more fine-grained STTS. Feeding both, UD and STTS tags, to the parser can further increase results, but only slightly (less than 1%). Most surprisingly, we obtain higher results when using automatically predicted STTS tags (as compared to using gold STTS tags). This observation, however, is more pronounced for the test set and might not be representative, being an artefact of the small data size.
Results for training on the small tweeDe dataset only are in the range of 74% LAS (gold PoS) and 68% LAS (auto PoS). When adding the training data from the German-GSD UD treebank, results increase to 81% LAS (gold PoS) and 76% LAS (auto PoS). The large gap of 5% between the gold and auto PoS setting highlights the importance of high-quality PoS tags for parsing tweets.

Related work
Twitter treebanks exist not only for English (Kong et al., 2014;Liu et al., 2018;Blodgett et al., 2018) but also for Italian (Sanguinetti et al., 2018) and Arabic (Albogamy et al., 2017). Foster et al. (2011) were among the first to provide syntactic analyses for Twitter microtext. They created a testset with over 500 sentences extracted from tweets. The data was automatically parsed with a constituency parser and the trees were manually corrected by one annotator. Inter-annotator agreement (IAA) for labelled bracketing, measured on a subset of the data annotated by a second annotator, was quite high with nearly 96%. Parsing accuracy without any domain adaptation, however, was low: the Malt parser (Nivre et al., 2006), trained on the WSJ, achieved an LAS of 63.3% on the Twitter testset. The Tweebank v1 (Kong et al., 2014) is another English Twitter treebank, with a size of over 900 tweets annotated with unlabelled dependencies. Liu et al. (2018) extend the work of Kong et al. (2014) by enlarging the treebank to more than 3,500 tweets, refining the guidelines and adding labels to the former unlabelled trees. They report an IAA of 84.3% for labelled attachments in the Tweebank v2. A third English Twitter treebank was created by Blodgett et al. (2018). Their corpus includes 250 African-American English (AAE) tweets and 250 tweets of mainstream American English microtext. The data has been annotated by two coders but no inter-annotator agreement is reported.
The Italian Twitter treebank of Sanguinetti et al. (2018) is the largest existing Twitter treebank and includes more than 6,700 trees. The authors report an IAA of 0.92 κ for syntactic annotation. The results for a dependency parser (Dozat et al., 2017) trained on a combination of the Italian UD treebank and the new dataset are also quite high, with a labelled attachment score of 81.5%. The high agreement and parsing scores suggest that the dataset is somewhat easier and more well-behaved than the Tweebank (see table 3 for baseline results for the different Twitter treebanks).
For Arabic, a treebank with Twitter microtext has been created fully automatically, based on predictions of a rule-based and a data-driven parser (Albogamy et al., 2017). Efforts have been made to map the annotations to the UD scheme, but, to the best of our knowledge, the data is not yet available.
With over 12,000 tokens, our new German Twitter treebank is comparable in size to TWEEBANK V1 (Kong et al., 2014) even though the number of tweets in our dataset is smaller. This is due to the fact that our data were collected after Twitter raised the maximum length for tweets from 140 to 280 characters.