An In-depth Analysis of the Effect of Lexical Normalization on the Dependency Parsing of Social Media

Existing natural language processing systems have often been designed with standard texts in mind. However, when these tools are used on the substantially different texts from social media, their performance drops dramatically. One solution is to translate social media data to standard language before processing, this is also called normalization. It is well-known that this improves performance for many natural language processing tasks on social media data. However, little is known about which types of normalization replacements have the most effect. Furthermore, it is unknown what the weaknesses of existing lexical normalization systems are in an extrinsic setting. In this paper, we analyze the effect of manual as well as automatic lexical normalization for dependency parsing. After our analysis, we conclude that for most categories, automatic normalization scores close to manually annotated normalization and that small annotation differences are important to take into consideration when exploiting normalization in a pipeline setup.


Introduction
It is well known that many traditional natural language processing systems are focused on standard texts, and their performance drops when used on another domain. This is also called the problem of domain adaptation. Recently, much focus has been on the notoriously noisy domain of social media. The hasty and informal nature of communication on social media results in highly nonstandard texts, including a variety of phenomena not seen in standard texts, like phrasal abbreviations, slang, typos, lengthening, etc. One approach to adapt natural language processing tools to the social media domain is to 'translate' input to standard text before processing it, this is also referred to as normalization. In this ap-proach, the input data is made more similar to the type of data the tool is expecting. Previous work has shown that normalization improves performance on social media data for tasks like POS tagging, parsing, lemmatization and named entity tagging (Baldwin and Li, 2015;Schulz et al., 2016;Zhang et al., 2013), however, it often remains unknown which types of replacements are most influential and which type of replacements still have potential to improve the usefulness of an automatic normalization system. Baldwin and Li (2015) already investigated this effect in detail. They evaluate the effect of manual normalization beyond the word-level (including insertion and deletion of words). To the best of our knowledge, no automatic systems are available to obtain such a normalization, which is why Baldwin and  focused only on the theoretical effect (i.e. manually annotated normalization). In this work, we will instead focus on lexical normalization, which is normalization on the word level. For this task, publicly available datasets and automatic systems are available (Han and Baldwin, 2011;. Recently, multiple English social media treebanks were released (Blodgett et al., 2018;Liu et al., 2018; in Universal Depencies format (Nivre et al., 2017), as well as novel categorizations of phenomena occurring in lexical normalization . In this work, we combine both of these tasks into one dataset, which allows us not only to evaluate the theoretical effect of lexical normalization for dependency parsing, but also a real-world situation with automatic normalization.
The main contributions of this paper are: • We add a layer of annotation to a social media treebank to also include normalization categories. • We analyze the theoretical effect of lexical normalization for dependency parsing by using manually annotated normalization.
• We analyze the effect of an automatic lexical normalization model for dependency parsing, thereby showing which type of replacements still require attention.

Data
In this section we shortly discuss our choices for datasets and annotation formats, starting with the treebank data, followed by the lexical normalization categories annotation and automatic normalization. See Figure 1 for a fully annotated example instance from our development data.

Treebank
In 2018, three research groups simultaneously annotated dependency trees in the Universal Dependencies format on tweets: Liu et al. (2018) focussed on training a better parser by using an ensemble strategy, Blodgett et al. (2018) improved a dependency parser by using several adaptation methods, whereas van der Goot and van Noord (2018) focused on the use of normalization. Because the treebank created by van der Goot and van Noord (2018) is already annotated for lexical normalization, we will use this treebank. The data from the treebank is taken from Li and Liu (2015), where van der Goot and van Noord (2018) only kept the tweets that were still available at the time of writing. The data from Li and Liu (2015) was in turn taken from two different sources: the LexNorm dataset (Han and Baldwin, 2011), originally annotated with lexical normalization and the dataset by Owoputi et al. (2013), originally annotated with POS tags. Li and Liu (2015) complemented this annotation so that both sets contain normalization as well as POS tags, to which van der Goot and van Noord (2018) added Universal Dependency structures. Similar to van der Goot and van Noord (2018) we use the English Web Treebank treebank (Silveira et al., 2014) for training, and Owoputi (development data) for the analysis. The test split is not used in this work, since our aim is not to improve the parser.

Normalization Categories
We choose to use the taxonomy of van der Goot et al. (2018) for three main reasons: 1) to the best of our knowledge, this is the most detailed categorization for lexical normalization 2) annotation for the same source data as the treebanks is available from Reijngoud (2019) 3) systems are available to automatically perform this type of normalization, as opposed to the taxonomy used by Baldwin and Li (2015). The existing annotation is edited to fit the treebank tokenization; if a word is split in the treebank, the normalization is split accordingly, and both resulting words are annotated in the same category. (Reijngoud, 2019) added one category to the taxonomy: informal contractions, which includes splitting of words like 'gonna' and 'wanna'. The frequencies of the categories in the development data are shown in Table 1. The 'split', 'merge' and 'phrasal abbreviations' categories are very infrequent, because the original annotation only included 1-1 replacements, these categories have been added when transforming the annotation to treebank tokenization.

Automatic Lexical Normalization
We use the state-of-the-art model for lexical normalization: MoNoise (van der Goot, 2019), which  is a modular normalization model, consisting of two steps; candidate generation and candidate ranking. For the generation, the most important modules are a lookup list based on the training data, the Aspell spell-checker 1 and word embeddings. For the ranking of candidates, features from the generation are complemented with ngram probabilities and used as input to a random forest classifier, which predicts the confidence that a candidate is the correct replacement. We train MoNoise on data from (Li and Liu, 2014), because it is most similar in annotation style to our development and test sets. Performance on the normalization task is slightly lower compared to the reported results (Error reduction rate (van der Goot, 2019) on the word level dropped from 60.61 to 45.38), because of differ-1 http://aspell.net/ ences in tokenization required for Universal Dependencies annotation. Also, the model clearly has issues with capitalization (see for example Figure 1) because capitalization is not corrected in the normalization training data.

Effect of Manual Normalization
We use the UUparser (de Lhoneux et al., 2017) for our experiments, with similar settings as van der Goot and van Noord (2018), including a heuristic to correctly parse a sentence starting with a retweet token or a username. All results reported in this paper are obtained with the official UD evaluation script 2 and are the average of 10 runs with different random seeds for the parser. For both settings (manual/automatic) we inspected the LAS graphs as well as the UAS graphs, but because the UAS scores showed very similar trends they are not reported here. The parser scores 52.56 LAS on the original input data, which improves to 57.83 when using the full gold normalization.
To evaluate the effect of each category, we measure performance twofold: in isolation, and in an ablation setting. For the isolation, we look at the difference between the baseline parser (without normalization) and a parser which only has access to normalization replacements of one category. For the ablation setting, we look at the loss when removing one category from the full model.

Normalization Recall
Norm. Rec. Figure 3: The effect of the categories when using automatic normalization. On the right y-axis the performance of the normalization model on this category is plotted (recall). The 'Other' category shows the effect of normalization replacements that were not annotated (but are still replaced by MoNoise).
The results for each category with gold normalization are shown in Figure 2. From these results, it becomes clear that some categories have a much larger effect compared to other categories. Not surprisingly, there is a correlation visible with the frequencies (Table 1). The categories going beyond the 1-1 normalization have only very little effect since they are very rare in this dataset 3 . The most important category is 'other transformation', this is mainly due to very frequent short words (e.g. 2 →to, u →you). Other important categories are 'shortening end' and 'regular transformations'. This can be explained by the fact that they repair the suffixes, which often contain important syntactic clues.
It also becomes clear that differences in tokenization guidelines play a large role; one of the most frequent categories 'missing apostrophe' seems to be not useful for parsing; a manual inspection showed that this is because these also occur in the training data in their not-normalized form (e.g. 'll → will), thereby normalizing them creates more diversity. For the same reason, informal contractions (e.g. wanna, gonna) also have a relatively small effect. normalization categories, which is 72% of the gain that can be achieved with gold normalization compared to the baseline setting (52.56). Similar to the previous section, we run an isolation as well as an ablation experiment. In this setting, we only allow the normalization to replace words that are annotated as the category under evaluation (for the ablation experiments the inverse).
The parser performance as well as the recall of the normalization model on each category are plotted in Figure 3. Results show that the 'other transformations' and 'slang' category have the most room for improvement in LAS compared to gold normalization, even though they are not the worst categories with respect to the normalization performance. Furthermore, trends are rather similar compared to the gold normalization, even though there are differences in normalization performance. As expected from the gold normalization, the 'missing apostrophe' category is not helpful.
Interestingly, the 'other' category, which includes normalization replacements that were not annotated in the gold normalization, shows a small increase in performance. This category includes replacements like 'supp' →'support' and 'da' →'the', which were overlooked by the annotator. This could also be due to differences in the scope of annotation between the training data and development data.

Conclusion
We have introduced a novel annotation layer for an existing treebank with normalization annotation, which indicates which types of replacements are made. This allowed us to evaluate the effect of lexical normalization on the dependency parsing of tweets, both with manual normalization annotation and automatically predicted normalization. The automatic normalization obtained over 70% of the performance increase that could be obtained with gold normalization. The most influential categories were 'other transformation', which includes many replacements for very short words, and the categories with a high frequency that repair a words' suffix: 'shortening end' and 'regular transformation'. The categories which have the most potential for improvement in parser performance are the 'other transformation' and 'slang' categories. Furthermore, we saw that some predicted normalization replacements which were not annotated in the gold data also led to an increase in performance. Our results suggest that care should be taken when taking out-of-the-box annotation, because differences in annotation and the scope of the normalization task (i.e. tokenization, missed normalization) could lead to sub-optimal performance.
The dataset and code for the analysis is available on: https://bitbucket.org/ robvanderg/taxeval/.

Acknowledgments
I would like to thank Wessel Reijngoud for providing the annotation of the normalization categories and Gertjan van Noord and the anonymous reviewers for their feedback.