EmpiriST: AIPHES - Robust Tokenization and POS-Tagging for Different Genres

We present our system used for the AIPHES team submission in the context of the EmpiriST shared task on “Auto-matic Linguistic Annotation of Computer-Mediated Communication / Social Me-dia”. Our system is based on a rule-based tokenizer and a machine learning sequence labelling POS tagger using a variety of features. We show that the sys-tem is robust across the two tested gen-res: German computer mediated communication (CMC) and general German web data (WEB). We achieve the second rank in three of four scenarios. Also, the presented systems are freely available as open source components.


Introduction
Tokenization and part-of-speech (POS) tagging are considered core tasks in a standard Natural Language Processing (NLP) pipeline. NLP tasks, such as summarization, information extraction, event detection, machine translation, and many others, are typically based on machine learning algorithms which use the outcome of lower level NLP tasks, such as tokens or intermediate linguistic phenomena including parts-of-speech or grammatical relations, as features. Though tokenization and part-of-speech tagging are considered simple tasks, it is highly important to achieve high-quality results, as errors propagate to downstream applications, where they are hard to repair and may cause notable consequential errors. Thus, a major goal is the minimization of the propagation of errors by using methods that perform as accurate as possible in lower level tasks on a diversity of texts and genres.
In this paper we present a simple, yet flexible and universally applicable system for tokenization and POS tagging German text. Our system participated in the EmpiriST shared task on "Automatic Linguistic Annotation of Computer-Mediated Communication / Social Media" (Beißwenger et al., 2016). For this task, we applied our solution to texts from two different genres: a) general, html-stripped web data and b) colloquial language from social media texts.
The paper is organized as follows: We first describe the shared task and related work Section 2. Our systems for tokenization and POS tagging are laid out in Section 3 and evaluated in Section 4, which includes a detailed error analysis. Section 5 concludes.

Task Description & Related Work
The main goal of the GSCL Shared Task "Automatic Linguistic Annotation of Computer-Mediated Communication / Social Media" was to encourage adaptation and development of language processing tools for German texts of computer-mediated communication genres. The shared task was divided into two subtasks, tokenization and POS tagging, which made use of an extended STTS-EmpiriST tag set. For both tasks, two data sets were provided for trial and training purposes.
• A computer-mediated communication data set (CMC) that included chat texts, tweets, blogs and Wikipedia talk pages.
• A Web data set (WEB) with various web text genres.
The training data set includes 5,109 (WEB) and 6,034 (CMC) manually annotated and expertchecked tokens. System submissions for the tasks were evaluated by the organizers on 7,800 (WEB) and 6,142 (CMC) tokens of blind test data.

Tokenization
Tokenization is usually the first step in a NLP system. Even systems that do not follow the classical NLP pipeline architecture still mostly operate on the basis of tokens, including unified architectures starting from scratch (Collobert et al., 2011). This is common, since tokens -either directly or indirectly -are usually considered to bear the information in a text eventually. However, the importance of tokenization is often neglected, as simple methods like whitespace segmentation can yield acceptable accuracies for many languages at first sight (Webster and Kit, 1992). But errors in an early phase of an NLP pipeline can have severe effects to higher level tasks and influence their performance by a large margin.
Existing tokenizers can be organized into three categories: a) rule-based methods, b) supervised methods, c) unsupervised methods. Manning et al. (2014) 1 , for example, internally use JFlex 2 , which is a meta language for rules based on regular expressions and procedures to execute when a rule matches. In contrast, Jurish and Würzner (2013) present a supervised system for joint tokenization and sentence splitting, which employs a Hidden Markov Model on character features for boundary detection. Kiss and Strunk (2006) introduce Punkt, providing an unsupervised model for sentence splitting and tokenization. Kiss and Strunk (2006) use the fact that most ambiguous token or sentence boundaries happen around punctuation characters, such as periods/full stops. Punkt finds collocations of characters before and after punctuations, assuming that these collocations are typical abbreviations, initials, or ordinal numbers which can be maintained as a simple list of non-splittable tokens.
Automatically learned models, both supervised and unsupervised, are typically hard to debug and the results might need post cleaning, e.g. postmerging or splitting of common mistakes, because modifying learned models is usually not trivial but need to be re-learned with different parameter settings or training data. However, it is important to offer the possibility to easily debug and change the outcome of the tokenization, hence, our goal is to implement a small and reasonable ruleset.

POS Tagging
Existing POS taggers for German primarily rely on the Stuttgart-Tübingen Tagset (STTS, Schiller et al. (1999)), which consists of 54 POS tags and distinguishes between eleven main parts of speech, which are further divided into various subcategories. The STTS tagset has become a de facto standard for German, as it is also used in major German treebanks, such as the Tiger treebank (Brants et al., 2004), called Tiger henceforth. Tiger consists of approx. 900,000 tokens of German newspaper text (taken from the Frankfurter Rundschau), and the POS annotations have been added semi-automatically. For this, the TnT tagger (Brants, 2000) was used, because it also outputs probabilities that can be used as confidence scores. Only POS tags with a low confidence score were checked for correctness by human annotators.
As the basis for the development of the STTStagset were newspaper corpora, STTS only contains six POS tags that describe categories other than the standard grammatical word categories (e.g., non-words or punctuation marks). In contrast, the extended version of STTS used in the EmpiriST shared task contains 18 additional tags for elements that are specific for computermediated communication, for example, tags for emoticons, hashtags and URLs, or tags for phenomena which are typical for spoken language.
State-of-the-art POS taggers use supervised machine learning to train a model from corpora annotated with POS tags. While there are several ways to model POS tagging as a machine learning problem, casting it as a sequence labeling problem is a frequent approach, used already for the early TnT tagger by Brants (2000). In sequence tagging, the learning algorithm -e.g. Hidden Markov Models or Conditional Random Fields (CRFs) -optimizes the most likely tags over the sequence, while taking interdependencies of tags into account -as op-posed to a mere token-based classification.
Another annotation task that is a typical example of sequence labeling, is named entity recognition. For example, the GermaNER toolkit (Benikova et al., 2015) uses CRFs for learning to tag named entities. GermaNER has been built in a modular fashion and is highly configurable, which allows users to easily train it with new data and features sets, and hence we chose to build upon the GermaNER system for POS tagging in this shared task.

System Description
The systems we describe in the following subsections are available as open source components under the Apache v2 license. 3 For tokenization, we have not attempted to create different variants for the two text genres of the shared task, but rather provide a robust generic solution, since we would not want to adopt subsequent processing steps when applying them to a different genre.

Tokenization
We present a rule-based tokenizer where the rules describe merging routines of two or more conservatively segmented tokens. Rules are defined in terms of a list of common non-splittable terms and simple regular expressions. The tokenizer is configured with a set of configuration files, which we call a ruleset. A ruleset can be easily adapted or changed depending on a particular language. In the following we present the tokenizer's configuration options and show selected toy examples.
The main building blocks of the tokenizer are the following: Conservative splits: A base tokenizer provides the initial tokens that are refined in the next steps. We chose a robust tokenizer that operates on general unicode character categories, i.e. a stream of characters is processed and for each character its general unicode category is retrieved. Based on the transition from the current character's unicode category to the next character's unicode category new token segments are created by some specified rules. More specifically, new token segments are created for empty space 4 to non-empty space Merge rules: Since merge lists contain only fixed tokens that must match entirely and hence do not allow for modifications within tokens, we additionally maintain a list of merge rules which are specified as regular expressions. This is particularly important for expressions involving digits, such as date expressions, usernames, etc. Rules are processed in the order of their definition. Unfortunately, as with potentially every rulebased system, too many handwritten rules start to interfere and introduce unwanted behavior. This is especially true if rules are too general, i.e. they match more examples than they should. We balance this trade-off between rule complexity and rule interaction by introducing global and local reject rules, i.e. merge rules are rejected iff a reject rule also matches. The scope of these reject rules can be defined globally, matching tokens that should never be considered for merging, or locally, matching tokens that should not be considered for merging only if a particular merge rule matched. Multiple consecutive reject rules are possible. Listing 2 shows a snippet of the respective configuration file.
The tokenizer is implemented in Java using the Java default regular expression engine. It was developed as part of the lt-segmenter 7 and is provided as a branch 8 .

POS Tagging
For POS tagging, we have adapted the  Listing 2: Examples for merge rules defined as regular expressions. Merge rules are defined with an initial '+' in the beginning of the line, whereas reject rules are defined with an initial '-'. Global reject rules are defined before any positive rule and comments begin with a # character. A description of the rules can be found as comment before the actual rule.
tool written in Java. GermaPOS 9 is a fork of the software, adapting the framework for this purpose. As a machine learning algorithm, a CRF sequence tagger (Lafferty et al., 2001) is used. Specifically the implementation provided by CRFsuite (Okazaki, 2007), as is in the clearTK framework is employed.
The architecture of GermaPOS is a highly extensible UIMA 10 pipeline (Ferrucci and Lally, 2004), providing a simple interface to both training a new tagger based on user-provided training data, as well as running a pretrained model on simple text files. The pipeline first reads a tab-9 GermaPOS is available at https://github.com/ AIPHES/GermaPOS 10 Unified Information Management Architecture, https://uima.apache.org/ separated input file. In a subsequent step, feature extraction is performed per token, using additional information from external sources, e.g. word lists. Feature extraction can further take into account any surrounding context of the current token, e.g. time-shifted features of relative position −2, −1, 0, +1, +2. In training mode, a CRF model is then built on the basis of feature annotations; at runtime the model provides POS tags as UIMA annotations. An optional output step in the pipeline produces a POS-annotated file. Alternatively, the pipeline can be used within UIMA projects out of the box. We perform a post-hoc assignment of POS tags based on a subset of our mapping rules that cover EmpiriST-specific conventions. For example a token emojiQsmilingFace will be assigned the tag EMOIMG, regardless of the output of the sequence tagger.
Features We adapt nearly the full feature set of GermaNER, with the exception of POS features. In the following list, we give a brief overview -a more detailed description can be found in (Benikova et al., 2015).

Character n-grams First and last character
n-grams for n ∈ {1, 2, 3} of the current token, as well as time-shifted versions of this feature with offset from −2 to 2 are extracted.

Gazetteers and word lists
We adapt most gazetteers from GermaNER, containing mostly named entities (NE). As we gained no performance increase from a higher coverage of NEs in our datasets through Freebase (Bollacker et al., 2008), we omit this resource in favor of a more lightweight system. In addition, we incorporate word lists. We employ a small list of English words 11 , as well as hand-crafted lists 12 of onomatopoeia, discourse markers, Internet abbreviations, intensity markers, as well as various types of particles.
3. Similar words JoBimText (Biemann and Riedl, 2013) to obtain a distributional thesaurus (DT) from which the four most similar words for the current token are used. The underlying motivation is to be able to correctly 11 We use a list of English words as these cover most occurrences of foreign language tags 12 Partially compiled from Wikipedia and enriched by data from various internet sites e.g. internetslang.com. tag infrequent or unseen targets, by expanding them with a frequent similar term, most likely sharing the same part of speech. 4. Topic clusters LDA topic modeling was applied on the DT defined above, resulting in a fixed number of topic clusters. For each token, and time-shifted context tokens, its topic index is extracted as a feature. We again build on existing work of GermaNER and use a precomputed set of 200 clusters.

Syntax
We use simple syntactic features, such as the word position and casing of tokens. We generalize the original GermaPOS setup to use arbitrary regular expressions as binary features. We then use all regular expressions designed for tokenization as features. This way, we also cover most casing information.
Furthermore, we extract the character range of each token as a feature, in case all characters fall into the same class. Hence, if all characters are from the same Unicode code block, this block is extracted as a feature. This feature allows, for example, to capture Unicode emoticons, not specifically preprocessed as in the EmpiriST data.
Training In the context of the EmpiriST shared task, we train a separate model for both the CMC and WEB datasets. As the training data is comparatively small for the purpose of POS tagging, we add the Tiger dataset to the respective training sets. The Tiger corpus is annotated using the standard STTS tagset, whereas the task at hand provided an extended tagset. In order to make learning from Tiger feasible, we have manually converted the Tiger data to the extended tagset using a set of simple rules, which aim at covering most of the easy cases.
As with GermaNER, the selection of resources and software components was done in favor of choosing a permissive license rather than focusing on system performance. Although it is plausible to improve POS tagging performance by integrating high-quality resources, we have opted to release GermaPOS with only free components, i.e. those already employed in GermaNER as well as manual additions not encumbered with restrictive usage rights. Where applicable, the system can be customized to utilize additional resources. A possible extension is the integration of another thirdparty POS tagger to be utilized as a feature.
Usage GermaPOS is provided as a runnable jar file with a pre-bundled model trained on the data described above. The training format isequivalent to the EmpiriST training data -a tabseparated file of one token-tag pair per line and sentences being separated by an empty line.

Evaluation
Following the EmpiriST task setup, we evaluate our tokenizer by measuring precision P , recall R, and the F 1 score as in Jurish and Würzner (2013). Precision denotes the proportion of correctly identified token boundaries over the total number of token boundaries proposed by our tokenizer and recall denotes the proportion of correctly identified token boundaries over the total number of token boundaries in the gold standard. The F 1 score is the harmonic mean of precision and recall.
For our POS tagger, we report the tagging accuracy. That is, we measure the fraction of correct tag guesses over the total number of tokens to tag. To enable a comparison of our tagger's results with previous work on German, we additionally use the STTS mapping provided by the shared task organizers and measure the tagging accuracy using the mapped tags.
Below, we first discuss our results according to these standardized metrics and then conduct a careful analysis of the most prominent errors of our tools.

Results
We present results according to the tasks evaluation. Table 1 shows the results for the tokenization task for the two datasets CMC and WEB. Without adapting the rules for the particular sub-tasks, we achieved good performance on both sets such that we positioned on rank two in both categories.
The results for the POS tagging task are shown in Table 2. We achieve clearly better results on the WEB dataset (second best results) than on the CMS dataset. One possible reason for that is the distribution of the new POS tag labels in the test set. As can be seen in Table 3, the CMC data make more use of the new labels. Another reason might be the adaption of our system to the text style, which is dominated by the much larger Tiger training set.

Common Errors
We identified three main sources of tokenization errors. Examples in the following show gold tokenization on the left and system tokens on the right, errors are marked with an asterisk.
1. Rules are underspecified, which means that certain rules were not specified or the lookahead list did not contain the particular abbreviation. Also, note that we deviated from the annotation guidelines and did not perform token splitting at camel case boundaries.
Examples: POS tagging error analysis We have performed a post-hoc error analysis on the EmpiriST data. Table 4 shows a confusion matrix regarding classes of POS tags by their prefix (first character). Note that this matrix only lists tagging errors, so that the diagonal of the matrix denotes incorrect tagging within the same prefix class. It can be seen that the majority of errors happen within these classes, such as N*. The most common tagging error is in fact mistagging NE and NN, which $* AD* AKW* AP* AR* EML* EMO* N* P* PPER* PT* V* $*   We define a number of error classes to better quantify the types of errors introduced by our tagger. For this, we construct an ordered list from which we select the first item that applies as the error class:

other
if none of the other criteria apply We then annotate the first 160 errors from the CMC test set with their respective error classes. The results are shown in Table 5. It can be observed that most errors are related to nouns or named entities. The tagger commonly confuses these two. For CMC data, a very common error which throws off the tagger are nouns written in lower case, which generally get assigned a completely different POS. As we have trained our tagger on a standard STTS-annotated corpus (with minimal postprocessing), some errors also stem from not capturing the new rules introduced by the extended EmpiriST tagset. There are also a few errors resulting from unknown foreign language words or emoticons not captured by our regular expressions, but regarding their quantity this is much less of a problem and they only account for a tiny percentage of errors.

Conclusion
We have presented our submission to the Em-piriST shared task on "Automatic Linguistic Annotation of Computer Mediated Communication / Social Media", comprising a rule-based tokenizer and a machine-learning-based POS tagger. Overall, we achieved a very good, but not the best performance amongst the participating systems, ranking second throughout except for CMC POS tagging with the extended tagset. Our submission was aimed at robustness; we have not tuned our tokenizer per genre, and show good POS tagging performance throughout. Both systems are freely available as open source under a permissive license.