DaN+: Danish Nested Named Entities and Lexical Normalization

This paper introduces DaN+, a new multi-domain corpus and annotation guidelines for Danish nested named entities (NEs) and lexical normalization to support research on cross-lingual cross-domain learning for a less-resourced language. We empirically assess three strategies to model the two-layer Named Entity Recognition (NER) task. We compare transfer capabilities from German versus in-language annotation from scratch. We examine language-specific versus multilingual BERT, and study the effect of lexical normalization on NER. Our results show that 1) the most robust strategy is multi-task learning which is rivaled by multi-label decoding, 2) BERT-based NER models are sensitive to domain shifts, and 3) in-language BERT and lexical normalization are the most beneficial on the least canonical data. Our results also show that an out-of-domain setup remains challenging, while performance on news plateaus quickly. This highlights the importance of cross-domain evaluation of cross-lingual transfer.


Annotation
We opted for a two-level NER annotation scheme following largely the annotation scheme provided by NoSTA-D . First-level annotations contain outermost entities (e.g., the company 'Maribo Frø'). Second-level annotations are sub-entities (location 'Maribo'). Four annotators were involved, three of which are native Danish speakers and one is proficient in Danish. For each task, a native speaker annotated the entire dataset after initial training.

Experimental Setup
For nested NER, we use BERT (Devlin et al., 2019) with fine-tuning implemented in MaChAmp (van der . 5 We evaluate three decoding strategies: single task-merged: both annotation layers are merged into a single flat entity.
multi-task: the encoder is shared and each layer of annotation has its own decoder.
multi-label: treats nested NER as multi-label problem, where a label i is predicted if P (l i |·) ≥ τ (Bekoulis, 2019) further illustrated in .
We first evaluate all NER models on Danish, both within news and on the three out-of-domain (OOD) varieties. We further compare to transfer from German: 1) zero-shot transfer, fine-tuning only on German; and 2) union of the Danish and German data for fine-tuning. We compare multilingual BERT (mlBERT) versus training with Danish BERT (danishBERT). 6 Even though both are trained on Danish data, for mlBERT this is Wikipedia data, whereas danishBERT is trained on Wikipedia, Common Crawl, Danish debate forums, and Danish subtitles. For MaChAmp, we use the proposed default parameters (van der  shown to work well across tasks. We tune early stopping and τ on Danish news dev data, and set τ = 0.9. We compare our final model to the boundary-aware model (Zheng et al., 2019), a state-of-the-art nested NER model which was also evaluated on GermEval 2014. We train it with bilingual Danish and German Polyglot embeddings obtained via Procrustes alignment (Conneau et al., 2018). For evaluation we use the official GermEval script  with strict span-based F1 over both entity levels. 7 For normalization, we choose to use MoNoise (  requires n-grams and word embeddings to reach a good performance. We use a Wikipedia dump from 01-01-2020 and Twitter data collected throughout 2012 and 2018, filtered with the FastText language classifier (Joulin et al., 2017). Intrinsic normalization results are reported as capitalization sensitive word-level accuracy over all words (including words which are not normalized). Because we have no external training data for normalization, we use a 10-fold setup of dev+test. Figure 2 depicts the main results for nested NER on the dev set, while detailed results are given in Table 4 and 5 in the appendix. First, we note that the model performs well within Danish newswire, reaching an F1 score in the 80ies (left bars). However, we observe a domain shift, as performances drop to 38-67% on the three non-canonical social media datasets, with Twitter and Arto reaching lowest scores. Our Danish training dataset is of modest size, hence the question arises whether existing German data is beneficial. The German multi-task model performs remarkably well on Danish news in zero-shot setups with mlBERT, reaching an F1 of 76%. This can be explained by the closeness of the languages, the annotations and the large training data (Table 1).

NER
For a model trained on the union of the German and Danish data (de+da), we observe that performance is overall close to the model trained on Danish only, which is five times smaller. The average F1 over all Danish datasets (News, Reddit, Twitter, Arto) for the two best models (using multi-task learning) is 65.01 with da.da.multitask and 65.05 with de+da.da.multitask. Interestingly, danishBERT is the best for the least non-canonical domain (Arto), in contrast to mlBERT which fares best on news. This is likely due to forum data included for pre-training danishBERT, 8 while mlBERT is based on Wikipedia data, which is less fit for non-canonical data. This suggests that adaptive pre-training could yield better results (Han and Eisenstein, 2019;. We also compare transfer learning from German with increasing amounts of in-language Danish data. The learning curve in Figure 3 shows that transfer helps for low amounts of data, and in-domain performance plateaus surprisingly quickly (especially for the da+de setup), and in-language data remains the best in-domain (ID). Instead, the gap to the non-canonical domains remains large for both in-language and cross-language setups, and performance on OOD is less stable throughout, calling for more out-ofdomain evaluation of NER models.   Table 3: Nested NER F1 score on the test sets for models with mlBERT (ml) vs danishBERT (da).

Lexical Normalization
We take a straightforward baseline for normalization, which always copies the original token, and we evaluate the impact of automatic versus gold normalization on NER. In other words, the accuracy of this baseline is equal to the percentage of not-normalized words. The results in Table 2 show that MoNoise is performing well for the less canonical Arto data in contrast to the Twitter data. On the Arto data, MoNoise reaches scores in a similar range compared to state-of-the-art results on other languages (van der Goot, 2019). 9 In the downstream evaluation (right part of Table 2), we see that normalization is most beneficial when the data is less canonical (Arto), but even on Twitter normalization is beneficial. Furthermore, from the GOLD results, we can conclude that there is still space for improvement for automatic normalization.

Test Data
We evaluate the model that fares overall best on in-domain source news (de+da-multitask) with danishBERT and mlBERT on the test sets. Table 3 shows that our model outperforms the boundary-aware method, which turns out to be brittle to domain shifts. Overall, the results confirm that normalization helps the most on the least canonical data (i.e. Arto), and mlBERT is better than danishBERT on canonical news data, whereas on the least standard data (Arto) it is the other way around.

Conclusions
This paper contributes to the limited prior work on cross-lingual cross-domain transfer of nested NER. We provide a new resource for Danish, DAN+, with baselines on nested NER and lexical normalization, using two BERT variants and training on Danish, German or both. Our results show that BERT-based variants are sensitive to domain shift for cross-domain nested NER, whereas they can cope relatively well with missing in-language data. Results on normalization show that it helps in case of very non-standard data only, for which automatic normalization improves Danish nested NER performance.

D Annotation guidelines for lexical normalization
The guidelines are based on (Baldwin et al., 2015a), all the cases were we diverged from these guidelines, or when we believed clarification was necessary are described below.
Systematic miss-spellings Since the data was taken from social media some words were systematically spelled wrong. This is especially seen on Arto, where many words were spelled using q instead of g. Here q was replaced with g: jeq → jeg (I) muliqe → mulige As it is also common to write words without the last one or two letters, there were also many words missing one or multiple letters in the end. Here the missing letters were inserted: ik → ikke hva → hvad Capitalization Capitalization was corrected in names, first letter in a post and after periods, question marks and other signs that require capitalization in the first letter of the following word. Capitalized words that illustrate yelling or emphasis have been decapitalized, acronyms that are capitalized have been kept capitalized.

Phrasal abbreviations
There was no correction of phrasal abbreviations because the written-out form does not correspond to the intended meaning of the phrase. The only ones found were in English.

Hashtags
Hashtags and usernames were not corrected, even if they were misspelled or if they contained multiple words.
#sundhedforalle → #sundhedforalle (health for all) Corrections of the letters ae, ø,å The Danish alphabet contains the three letters ae, ø andå. If these are not available at the used keyboard they are often replaced by other vowels: ae → ae o → ø aa →å In words where the replacement vowels are used they have been replaced with the appropriate letter. In some data ae, ø andå were left out entirely, here the letters were inserted. As the missing letter in some cases results in multiple options, the word was determined using the context: har → har (to have) or hår (hair) fler → flere (more) or føler (feels)