EmpiriST 2015: A Shared Task on the Automatic Linguistic Annotation of Computer-Mediated Communication and Web Corpora

This paper describes the goals, design and results of a shared task on the automatic linguistic annotation of German language data from genres of computer-mediated communication (CMC), social media interactions and Web corpora. The two sub-tasks of tokenization and part-of-speech tagging were performed on two data sets: (i) a genuine CMC data set with samples from several CMC genres, and (ii) a Web corpora data set of CC-licensed Web pages which represents the type of data found in large corpora crawled from the Web. The teams participating in the shared task achieved a substantial improvement over current off-the-shelf tools for German. The best tokenizer reached an F 1 - score of 99.57% (vs. 98.95% off-the-shelf baseline), while the best tagger reached an accuracy of 90.44% (vs. 84.86% baseline). The gold standard (more than 20,000 to-kens of training and test data) is freely available online together with detailed annotation guidelines.


Motivation, premises and goals
Over the past decade, there has been a growing interest in collecting, processing and analyzing data from genres of computer-mediated communication and social media interactions (henceforth referred to as CMC) such as chats, blogs, forums, tweets, newsgroups, messaging applications (SMS, WhatsApp), interactions on "social network" sites and on wiki talk pages. The development of resources, tools and best practices for automatic linguistic processing and annotation of CMC discourse has turned out to be a desideratum for several fields of research in the humanities: 1. Large corpora crawled from the Web often contain substantial amounts of CMC (blogs, forums, etc.) and similar forms of noncanonical language. Such data are often regarded as "bycatch" that proves difficult for linguistic annotation by means of standard natural language processing (NLP) tools that are optimized for edited text (Giesbrecht and Evert, 2009).
2. For corpus-based variational linguistics, corpora of CMC discourse are an important resource that closes the "CMC gap" in corpora of contemporary written language and language-in-interaction. With a considerable part of contemporary everyday communication being mediated through CMC technologies, up-to-date investigations of language change and linguistic variation need to be able to include CMC discourse in their empirical analyses.
In order to harness the full potential of corpusbased research, the preparation of any type of linguistic corpus which includes CMC discoursewhether a genuine CMC corpus or a broadcoverage Web corpus-faces the challenge of handling and annotating the linguistic peculiarities characteristic for the types of written discourse found in CMC genres. Two fundamental (but nontrivial) tasks are (i) accurate tokenization and (ii) sufficiently reliable part-of-speech (PoS) annotation. Together, they provide a layer of basic linguistic information on the token level that is a pre-requisite for any form of advanced linguistic analysis on the word, sentence and interaction level.
The linguistic peculiarities of discourse in CMC and social media genres have been extensively described in the literature (for an overview of features with a focus on German CMC see e.g. Haase et al., 1997;Runkehl et al., 1998;Beißwenger, 2000;Storrer, 2001;Dürscheid, 2005;Androutsopoulos, 2007; Bartz et al., 2013; for English CMC see e.g. Crystal, 2001Crystal, , 2003Herring, 1996Herring, , 2010Herring, , 2011. Due to its dialogic nature and depending on the degree to which the interlocutors consider their interaction as an informal, private exchange, CMC discourse typically includes a range of deviations from the syntactic and orthographic norms of the written standard (often referred to as non-canonical phenomena) such as colloquial spellings (e.g., clitics and schwa elisions) and lexical items which typically occur in spoken interactions rather than monologic texts (interjections, intensifiers, focus and gradation particles, modal particles and downtoners, etc.). The word order and syntax of CMC posts exhibit features that are characteristic of spoken or "conceptually oral" language use in colloquial registers (e.g., ellipses, German weil or obwohl with V2 clause). High speed typing causes speedwriting phenomena such as typos, the omission of upper case or the use of acronyms; other deviations from the orthographic standard have to be considered as intended, creative spellings (nice2CU, good n8). The need for emotion markers leads to the use of emoticons and emoji; upper case and letter iterations serve as suprasegmental forms of emphasis in the written medium (LASS DAS!, suuuuuper!!!!). Addressing terms and hashtags indicate reference between user posts and link individual posts to discourse topics.
Tackling the linguistic peculiarities of CMC data with NLP tools is an open issue in corpus and computational linguistics, which has been addressed by an increasing number of papers and approaches over the past years (as a desideratum e.g. Beißwenger and Storrer, 2008;King, 2009; for the development of NLP tools e.g. Ritter et al., 2011;Gimpel et al., 2011;Owoputi et al., 2015;Avontuur et al., 2012;Bartz et al., 2013;Neunerdt et al., 2013;Rehbein, 2013;Horbach et al., 2015;Zinsmeister et al., 2014;Ljubešić et al., 2015). Issues of processing and annotating CMC data have also been a central topic of the DFG-funded scientific network Empirical Research of Internet-Based Communication (Empirikom), which brought together researchers interested in building and analyzing CMC, social media and Web corpora for research questions in linguistics, computational linguistics and language technology during the years 2010-2014. 1 As a result from discussions in the network, it was decided to set up a community shared task to foster the development of approaches for automatic linguistic annotation of CMC data for German in a competitive setting. The task was named Empirikom Shared Task on Automatic Linguistic Annotation of Computer-Mediated Communication and Social Media (EmpiriST 2015).
The design of EmpiriST 2015 was based on the following two premises: 1. It should take into consideration not only the compilation of CMC corpora for research and teaching purposes in linguistics but also the handling of portions of CMC data as part of large Web corpora. 2. It should be based on a freely available gold standard created with a well-defined PoS tagset and precise guidelines for tokenization and PoS annotation (see Sec. 2).
The main goals and research questions are: 1. To what extent can the performance of automatic tools for tokenization and PoS tagging of German CMC discourse be improved, using our gold standard for training or domain adaptation? 2. Can both genuine CMC corpora and Web corpora (where CMC phenomena typically occur much less frequently) be processed by the same approaches and models, or do we need different tools for the two types of corpora?

The EmpiriST gold standard
The gold standard developed for the shared task comprises roughly 10,000 tokens of training data provided to participants as well as roughly 10,000 tokens of unseen test data used in the evaluation phase. It was compiled from data samples considered representative for the two types of corpora: (i) a CMC subset covering discourse from a range of CMC/social media genres, and (ii) a Web corpora subset containing CC-licensed Web pages from different genres.

Data sets
The CMC subset includes samples from several CMC genres and different sources: • a selection of donated tweets from (i) the Twitter channel of an academy project used for (monologic) project-related announcements, (ii) the Twitter channel of a lecturer used for discussions with the students accompanying a university class (= dialogic use of tweets); • a selection of data taken from the Dortmund Chat Corpus (Beißwenger, 2013) representing discourse from different types of chat: (i) social chat recorded in multiparty chatrooms where people met mainly for recreational purposes, (ii) professional chat comprising professional uses of chatrooms, e.g. advisory chats and chats in the context of learning and teaching; • a selection of threads retrieved from Wikipedia talk pages; • a selection of WhatsApp interactions taken from the data collected in the project Whats up, Deutschland?; 2 • a selection of blog comments from CClicensed weblogs collected by Adrien Barbaresi.
For the Web corpora subset, roughly 50,000 running words of text were collected by Web crawling. In order to ensure a broad coverage of Web genres and topics, the crawl was based on a set of manually pre-selected seed words. The following list gives an impression of the distribution of genres in the data: • Web sites on topics such as hobbies, travel and IT; • blogs on topics such as hobbies, travel and legal issues; • Wikipedia articles on topics such as biology, botany and cities; • Wikinews on topics such as IT security and ecology.
The largest portion of these data is comprised of Web pages, blog entries and commentaries, a smaller portion consists of genres such as Wikipedia articles, Wikinews etc. An important requirement was that all texts must be published 2 http://www.whatsup-deutschland.de/ under a suitable Creative Commons licence so that the resulting corpus can be made freely available to the community without any legal issues. From the available data, we selected roughly 5,000 tokens of training data for each subset, which were provided to task participants with manual tokenization and PoS tagging. Another 5,000 tokens per subset were used as unseen test data, with a similar distribution of genres and sources as in the training data. The precise data sizes of the training and test sets are listed in Tab

Annotation guidelines
For tokenization, we developed a guideline with detailed rules for handling CMC-specific tokenization issues (Beißwenger et al., 2015a). It was tested and refined for a range of CMC and Web genres with the help of several student annotators in Berlin, Darmstadt, Dortmund and Erlangen.
For PoS tagging, we used the 'STTS IBK' tag set which had been defined as a result from discussions in the Empirikom network and at three workshops dedicated to the adaptation and extension of the canonical version of the Stuttgart-Tübingen-Tagset ('STTS 1.0'; Schiller et al., 1999) to the peculiarities of "non-standard" genres (Zinsmeister et al., 2013(Zinsmeister et al., , 2014. STTS IBK introduces two types of new tags: (i) tags for phenomena that are specific to CMC and social media discourse, (ii) tags for phenomena that are typical for spontaneous (spoken or "conceptually oral") language in colloquial registers (cf. Tab. 2). These extensions are useful for corpus-based research on CMC as well as spoken conversation. STTS IBK is downward compatible to STTS 1.0 and therefore allows for interoperability with existing corpora and tools. In addition, the tag set extensions in STTS IBK are compatible with the STTS extensions defined at IDS Mannheim for the PoS   fahl and Schmidt, 2013;Westpfahl, 2013). The tag set is described in an annotation guideline (Beißwenger et al., 2015b) and has been tested with data from several CMC genres in advance. The complete annotation guidelines (in German) as well as supplementary documentation are available online from the shared task Web site. 4 For international participants, an English translation of the tagging guideline is also provided.

Annotation procedure
All data sets were manually tokenized and PoS tagged by multiple annotators, based on the official tokenization (Beißwenger et al., 2015a) and tagging guidelines (Schiller et al., 1999;Beißwenger et al., 2015b), see Sec. 2.2. Cases of disagreement were then adjudicated by the task or-ganizers to produce the final gold standard. During the annotation of the training data, minor changes to the annotation guidelines were made based on experience from the adjudication procedure. In addition, various problematic cases were collected in a supplementary document available to the annotators.
The manual tokenization was carried out in a plain text editor, starting from whitespacetokenized files in one-token-per-line format. Annotators were instructed to make no other changes to the files than inserting additional line breaks as token boundaries (except for a few special cases), but were allowed to mark unclear cases with comments. The tokenizations were compared and adjudicated using the kdiff3 utility. 5 In the next step, manual tagging was partly carried out with the Web-based annotation platform  our own Web-based tool MiniMarker. In both cases annotators worked independently with separate password-protected accounts and were encouraged to document interesting or difficult phenomena in free-form comments. CorA has the advantage that tokenization errors can be corrected at the tagging stage, while MiniMarker enables annotators to look up how specific word forms are tagged in the TIGER treebank corpus in order to ensure consistent annotation. For adjudication of the PoS tagging, we pre-annotated unanmimous annotator decisions and filled in the remaining disputed tags with MiniMarker. Agreement between annotators as well as the agreement of each annotator with the final gold standard was determined using the same evaluation metrics as for systems participating in the shared task (see Sec. 3.2).

CMC subset
In a preliminary study on the manual tokenization of CMC (cf. , we observed very high inter-annotator agreement with F 1 scores ranging from 98.6% to 99.7%, showing that manual tokenization of such data provides a valid and reliable gold standard. For training and test data of the CMC subset, we therefore decided to pursue a "sequential double keying" approach. The initial tokenization was done at a very early stage of the task preparation; it was later doublechecked and revised according to the final tokenization guidelines by a second expert annotator. PoS tags were added by two independent annotators. Tab. 3 shows the observed agreement between the annotators and the adjudicated gold standard in terms of accuracy (acc).
It is interesting to note that for both annotators the agreement between each annotator and the gold standard is much higher than the agreement between the two annotators. One possible explanation is that each annotator had difficulties with specific types of phenomena. Looking at the error classes, this assumption turns out to be true: For example, annotator FW tended to misclassify adverbs as intensifier particles (PTKIFG, n = 66) whereas annotator BT made this mistake only six times. On the other hand, BT misjudged more than twice as many adjectives (ADJA vs. ADJD) than FW.

Web corpora subset
The test data of the Web corpora subset were manually tokenized by five primary annotators, and then adjudicated in two phases by one of the task organizers. Tab. 4 shows pairwise agreement between annotators and the agreement of each annotator with the gold standard in terms of F 1 scores for token boundaries. Agreement is very high between all pairs of annotators, indicating that the manual tokenization is reliable.   PoS tags were manually added by 4 independent annotators, based on the adjudicated tokenization. No further corrections of the tokenization were found to be necessary in this phase. Tab. 5 shows agreement between the annotators and the gold standard in terms of observed accuracy (acc). Due to the low probability of chance agreement (approx. 7.5%), there is no need to compute κ values or other adjusted scores. Agreement for the manual tagging is less satisfactory than for the tokenization. Major sources of disagreement were the newly introduced particle classes-in particular PTKIFG and PTKMA-as well as unintuitive or poorly defined category boundaries in the original STTS 1.0 tag set-in particular common nouns (NN) vs. proper nouns (NE) vs. foreign text (FM), and adverbs (ADV) vs. adverbial adjectives (ADJD). It is also noticeable that the training and experience of individual annotators played an important role: two annotators (AS and JM) agree fairly well with each other and with the adjudicated gold standard, while the other two annotators performed considerably worse.
Despite these issues, most errors and misinterpretations were caught by our adjudication of the four-fold annotation. A fifth independent tagging carried out by annotator SM at a later stage showed an agreement of acc = 95.90% with the final gold standard.
The training data of the Web corpora subset were manually tokenized by three independent annotators and tagged by five independent annotators, with adjudication by one of the task organizers after each stage. Agreement between annotators and the gold standard is similar to the test data.

Availability
All gold standard data sets, the specification of the extended STTS tag set and the guidelines for tokenization and PoS tagging have been published on the EmpiriST Web site 7 and will remain available for use in future research. We used simple UTF-8 encoded text formats for both raw and annotated versions of the data. Annotated files are provided in one-token-per-line format with empty lines serving as posting or paragraph boundary markers. Corresponding PoS tags are given in an additional column separated from the token text by a single tab stop. Metadata for each posting or Web page are inserted as empty XML elements on separate lines. A small excerpt from one of the files is shown in Fig. 1.
Apart from the actual contents, the EmpiriST 2015 data package comes with a description of the tag set, evaluation scripts and licensing informa-  The preparation stage started with the release of the annotation guidelines together with roughly 2,000 tokens of trial data from each subset in October 2015. The trial data were intended to illustrate the required input and output file formats and to give an impression of the specific characteristics of the CMC and Web texts to be processed. They were based on preliminary versions of the guidelines and were produced without multiple annotation. Participants were instructed that they should not be relied on for training the final systems. During the preparation stage, there was also a fruitful dialogue between interested parties and the shared task organizers, leading to clarifications and corrections of the guidelines.
The second stage was dedicated to the training and adaptation of the competing systems. It started with the release of the complete training data on the shared task Web site in December 2015. The registration deadline fell within this stage, enabling participants to make an initial assessment of their performance before registering.
The evaluation stage was divided into two consecutive phases so that (i) tokenization and tagging quality could be evaluated separately and (ii) the same test data could be used for both subtasks. In each phase, unannotated test data were released via the shared task Web site; participants then had to submit their system output within five days by e-mail. For the tokenization phase, raw texts were released, padded with additional filler data in order to prevent tuning of systems to the test data before the second phase. For the tagging phase, manually tokenized versions of the texts were released. The two phases took place in two consecutive weeks in February 2016.

Evaluation metrics
Evaluation of the submissions to EmpiriST 2015 was carried out by the task organizers. Following Jurish and Würzner (2013), results for the tokenization task were evaluated based on the unweighted harmonic average (F 1 ) between precision (pr) and recall (rc) of the token boundaries in the participants' submissions. Formally, let B retrieved be the set of token boundaries predicted by the tokenization procedure to be evaluated and B relevant those present in the gold standard; then: For technical reasons, the trivial token boundary at the beginning of each text file is included in the evaluation, but not the boundary at its end. 9 Following Giesbrecht and Evert (2009), the PoS tagging task was evaluated in terms of the accuracy (acc) of the PoS tag assignments in the participants' submissions. Formally, let n correct be the number of tokens whose tags agree with the gold standard, and n total the total number of tokens in the data set; then: acc = n correct n total (4) In order to support participants in development and self-evaluation of their submissions, both evaluation metrics were implemented as Perl scripts by the organizers and published together with the training and test data sets.

Participating systems
Tab. 6 gives an overview of the participating teams and systems. Team UdS submitted three related systems (UdS-distributional, UDS-retrain, UDSsurface). In addition, each system was permitted  to submit up to 3 different runs, with only the best run being included in the task results.

Summary of competing approaches
As shown in Tab. 6, we had five submissions for the tokenization subtask, one of them noncompetitive. 10 All five systems employed rulebased tokenization approaches. Two of them (AIPHES and LTL-UDE) used a "split and merge" strategy that splits tokens into atomic units in the first pass. In subsequent passes, higher-order rules implement merging strategies for dealing with complex phenomena such as URLs, abbreviations or emoticons. In contrast, COW used an "under segmentation" strategy protecting certain token sequences in the first pass and further segmenting them in a second. SoMaJo used complex, cascaded regular expressions successively dealing with the aforementioned classes of phenomena. All approaches made use of additional lists of abbreviations, proper names, emoticons, etc. in order to improve correct tokenization of special characters and punctuation.
We had six submissions for the PoS tagging subtask, two of them non-competitive. 11 From the four regular submissions, one (bot.zen) was sent in after the submission deadline and is thus not included in the official ranking. In contrast to tokenization, all systems competing in the PoS tagging subtask made use of statistical models specially trained or re-trained for the purpose of Em-piriST 2015. The types of models employed reflect all state-of-the-art approaches to the task of PoS tagging. All approaches have in common that they extend the EmpiriST training data with additional corpora and linguistic resources.
The three UdS systems built on a classical hidden Markov model (HMM; Rabiner, 1989). In addition, they focused on improvements in the analysis of out-of-vocabulary (OOV) words by adding domain-specific training material and a list of likely PoS tags for OOV items. LTL-UDE and AIPHES used conditional random fields (CRF; Lafferty et al., 2001). Both systems differed in the selection of features and the additional resources used in the training process. Team bot.zen employed a long short-term memory (LSTM; Hochreiter and Schmidhuber, 1997) recurrent neural network in combination with neural word embeddings as input representations (Mikolov et al., 2013).

Results
In order to put the performance of the shared task submissions into perspective, we also evaluated several widely-used off-the-shelf tools as baselines: • the WASTE tokenizer (Jurish and Würzner, 2013); 12 • TreeTagger v3.2 (Schmid, 1995); 13 • Stanford tagger v3.6.0 (Toutanova et al., 2003); 14 tations to account for the tokenization principles and extended tag set of EmpiriST. It may therefore be more appropriate to compare COW with the baseline systems than with the other task participants. 12 We used WASTE as shipped with the moot package (v2.0.13, http://kaskade.dwds.de/waste/) and trained a model solely using the EmpiriST training data. 13 We used the German UTF-8 parameter file downloaded from http://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger/ on 21 June 2016. 14 We used the german-dewac parameter file from the distribution released on 9 Dec 2015. Substantial automatic and manual post-editing was required to undo character transformations made by the tokenizer, replace non-STTS tags (e.g. $[ instead of $(), and account for the systematic mistagging of parentheses and brackets as TRUNC.
• the COW pipeline (Schäfer and Bildhauer, 2012;Schäfer, 2015). 15 Tab. 7 (tokenization) and Tab. 8 (PoS tagging) show the results obtained by all task participants and baseline systems on the CMC and Web corpora subsets. Within each subset, results are micro-averaged across the text samples. The overall score is the macro-average over both subsets, ensuring that CMC and Web corpora carry the same weight. For systems that submitted multiple runs, only the best run is shown in the table (indicated by a subscript appended to the team name). The official ranking ("podium") includes only competitive and timely submissions. Since team UdS entered three closely related systems into the competition, only one of them was selected for the official podium. Detailed results for individual runs and text samples are available on the EmpiriST Web page. 16 Since the existing off-the-shelf taggers used as a baseline are not aware of the new PoS tags in STTS IBK, the evaluation was carried out both at the level of STTS IBK and at the level of the established STTS 1.0 tag set (Schiller et al., 1999). For this purpose, one or more alternative STTS 1.0 tags were also accepted for each extended tag in the gold standard. The precise mapping rules are specified in Tab. 9. The official ranking is always based on the full STTS IBK tag set.

Conclusion
The systems submitted to the EmpiriST2015 shared task have improved the state-of-the-art for tokenization and PoS tagging of CMC and Web corpora. The best submitted tokenizer achieved an F 1 -score of 99.54% (vs. 98.47% baseline) for the CMC data set and an F 1 -score of 99.77% (vs. 99.42% baseline) for the Web corpora data set. For PoS tagging, the results are still far from optimal. Nevertheless, the improvement against baseline systems is striking especially for the CMC subset: The best submitted tagger achieved an accuracy of 87.33% evaluated against STTS IBK (vs. 77.89% baseline), and an accuracy of 90.28% against STTS 1.0 (vs. 81.51% baseline). For the Web corpora subset, where the baseline systems already peform much better than on gen-    Further evaluation of the results in future work should include a close examination and discussion of the performance of the tagger models with respect to the tag set extensions defined in STTS IBK, as well as their performance on different genres and text sources. This will be the topic of a round table organized at the 3rd NLP4CMC workshop at KONVENS 2016. 17 The results of the shared task can be considered a promising step towards better NLP tools for German CMC data, especially since all participants (except for UdS) have made their systems available to the community as open-source software. However, the adaptation of NLP tools to the linguistic peculiarities of CMC discourseespecially for PoS tagging-is still a challenging task. The resources developed for EmpiriST 2015 (gold standard and annotation guidelines) will remain available on the task Web site under a Creative Commons licence. 18 We hope that they will 17 https://sites.google.com/site/ nlp4cmc2016/ 18 https://sites.google.com/site/ stimulate further advances in adapting NLP technologies to CMC discourse as well as in improving the annotation quality of German Web corpora.