What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian

We present experiments on Slovenian, Croatian and Serbian morphosyntactic annotation and lemmatisation between the former state-of-the-art for these three languages and one of the best performing systems at the CoNLL 2018 shared task, the Stanford NLP neural pipeline. Our experiments show significant improvements in morphosyntactic annotation, especially on categories where either semantic knowledge is needed, available through word embeddings, or where long-range dependencies have to be modelled. On the other hand, on the task of lemmatisation no improvements are obtained with the neural solution, mostly due to the heavy dependence of the task on the lookup in an external lexicon, but also due to obvious room for improvements in the Stanford NLP pipeline’s lemmatisation.


Introduction
Morphosyntactic annotation and lemmatisation are crucial tasks for languages that are rich in inflectional morphology, such as Slavic languages. These tasks are far from solved, and the recent CoNLL 2017  and CoNLL 2018 (Zeman et al., 2018) shared tasks on multilingual parsing from raw text to Universal Dependencies (Nivre et al., 2016) have given the necessary spotlight to these problems. In addition to the advances due to multi-and cross-lingual settings, the participating systems have also confirmed the predominance of neural network approaches in the field of natural language processing.
In this paper we compare the improvements obtained on these two tasks in three South Slavic languages (Slovenian, Croatian and Serbian) by moving from traditional approaches to the neural ones. The tool that we use as the representative of the traditional approaches is reldi-tagger (Ljubešić and Erjavec, 2016;, the previous state-of-theart for morphosyntactic tagging and lemmatisation of the three focus languages due to (1) carefully engineered features for the CRF-based tagger, (2) integration of an inflectional lexicon both for the morphosyntactic tagging and the lemmatisation task and (3) lemma guessing for unknown word forms via morphosyntactic-tagspecific Naive Bayes classifiers, predicting the transformation of the surface form. The tool that we use as the representative for the neural approaches is stanfordnlp, the Stanford NLP pipeline (Qi et al., 2018), a state-of-the-art in neural morphosyntactic and dependency syntax text annotation. The system took part in the CoNLL 2018 shared task (Zeman et al., 2018) as one of the best-performing systems, which would have, with "an unfortunate bug fixed", placed among the top-three for all evaluation metrics, including lemmatisation and morphology prediction. The tool is, additionally, released as open source and has a vivid development community, 1 with a named entity recognition module being in development.

Experiment Setup
We perform our comparison of the traditional and the neural tool of choice on the two tasks on data splits defined in the babushka-bench benchmarking platform 2 which currently hosts data and results for the three South Slavic languages we use in these experiments, namely Slovenian, Croatian and Serbian. It is organised as a git repository, with scripts for transferring datasets from the CLARIN.SI repository, 3 and splitting them into 1 https://github.com/stanfordnlp/ stanfordnlp 2 https://github.com/clarinsi/ babushka-bench 3 https://www.clarin.si/repository/ xmlui training, development, and testing portions. While the primary usage of this platform are in-house experiments on the available and emerging technologies, other researchers are more than welcome to further enrich the repository.
The name of the repository has its roots in the erroneous, but popular naming of the Matryoshka doll in South Slavic languages, as the datasets are split into train, dev and test portions in a random fashion, but with a fixed random seed. This enables splitting the same datasets on the annotation layers that were not applied over the whole dataset (as is often the case with costly annotations of syntax, semantic etc.), and simultaneously ensuring that no spillage between train, dev and test between the various layers would occur. There are many cases where such a split comes handy for benchmarking, one example being using the whole datasets for training taggers and just portions of the datasets (i.e. the manually parsed subsets) to train parsers that require tagging as upstream processing.
For evaluating morphosyntactic tagging and lemmatisation in babushka-bench, we use a modified CoNLL 2018 shared task evaluation script to enable evaluation without parsing present. This script calculates the F1 metric between the gold and the real annotations, taking into account the possibility of different segmentation, which is not the case in these experiments as we use gold segmentation from the datasets to focus on the tasks of morphosyntactic tagging and lemmatisation. When modelling morphosyntax, we predict morphosyntactic descriptions (MSDs), position-based encodings of partof-speech and feature-value pairs, as defined in the MULTEXT-East tagset (Erjavec, 2012). The training-data-defined size of the tagset for each of the three languages lies between 600 and 1300 MSDs, depending on the language and the size of the training data. This is the default tagset for the reldi-tagger and is also supported by the stanfordnlp tool, where language-specific tags (XPOS) are predicted as one of the three outputs by the tagging module (the other two being UD parts-of-speech (UPOS) and features (FEATS)). The datasets we use for our experiments are the three official datasets for training standard language technologies for these languages. These are the ssj500k dataset for Slovenian , the hr500k dataset for Croatian  and the SE-Times.SR dataset for Serbian . While the Slovenian and Croatian datasets are both around 500 thousand tokens in size, the Serbian dataset is significantly smaller with only 87 thousand tokens in size. We additionally make use of the inflectional lexicons of these three languages, Sloleks for Slovenian , hrLex for Croatian (Ljubešić, 2019a) and srLex for Serbian (Ljubešić, 2019b), all containing more than 100 thousand lemmas with around 3 million inflected forms.
Our experiments are split into two main parts: experiments on morphosyntactic tagging in Section 3.1, backed with the comparison of the difference of the most frequent errors in the traditional and neural approaches, and the experiments on lemmatisation in Section 3.2.

Morphosyntax
We first compare the results of the two tools on morphosyntactic annotation, trained on the training portion of the datasets of the three languages, with development data used if necessary. 5 The results of the two taggers on the two languages are presented in Table 1.
The results show significant differences between reldi-tagger and stanfordnlp, with relative error reduction of 43% for Slovenian, 27% for Croatian and 40% for Serbian. Regarding 4 Currently only the fastText versions are available for download in the repository. 5 While stanfordnlp uses the development data for updating the learning rate and optimization algorithm, reldi-tagger did not make any use of the development data during this training phase. However, during the development of reldi-tagger, a series of feature selections and hyperparameter values were investigated on held-out data, so we can consider for that tool to have used development data indirectly, as well.  the usage of different embedding collections with stanfordnlp, there are no drastic differences, but the CLARIN.SI embeddings show to be better suited than the CoNLL embeddings, which does not come as a surprise as the former are based on more text, which is frequently also of higher quality. The distinction between word2vec (w2v) and fastText (fT) embeddings shows to be minimal, but fastText seems to be more beneficial when smaller amounts of training data are available, as is the case with Serbian.
For the error analysis, as well as downstream experiments on lemmatisation, for which morphosyntactic annotation is a prerequisite, we take the stanfordnlp tool with CLARIN.SI fast-Text embeddings, as these settings achieve the best results on average.
To identify the differences in morphosyntactic tagging errors between the traditional and neural tagger, we analyse the 10 most frequent confusions per tagger for each of the three languages. Our results presented in Table 2 show that some of the most frequent errors in reldi-tagger are substantially reduced by stanfordnlp, such as the confusion between masculine nouns in singular accusative (Ncmsan) and nominative (Ncmsn), which shows the neural tagger to be more capable in modelling long-range dependencies. Namely, whether a male noun is in the nominative or accusative case depends mostly on whether one of these two cases already occurred somewhere in the clause.
Another regular confusion in morphosyntactic tagging in general, which is also heavily resolved  Table 3: F1 results in lemmatisation with the traditional and neural tool and different upstream processing.
by the neural tagger, is that between adjectives in the neutrum nominative (Agpnsn) and adverbs (Rgp), which, again, requires information from a wider context, i.e., whether there is a noun to which the potential adjective can be attached to. An error type which requires more of a semantic understanding is the distinction between proper nouns (Npmsn) and foreign residuals (Xf) in Croatian and Serbian. In these two languages, the rule is that proper nouns of foreign origin (Easy Jet, Feng Shui) are annotated as foreign residuals. This type of error is in good part resolved via word embedding information where this distinction is obviously encoded, while in the 1000 hierarchical Brown clusters this is obviously not the case.
Interestingly, some shared errors are even more frequent in the neural stanfordnlp predictions, such as the disambiguation between homonymous conjunctions (Cc, Cs) and adverbs (Rgp) for Croatian and Slovenian (e.g. već, tako, zato), which does come as a surprise as this distinction requires long-range information which should be more available in the neural approach.

Lemmatisation
Given that morphosyntactic information is usually expected as the input to lemmatisation, we compare the lemmatisation performance of the two tools if (1) gold morphosyntax is given, (2) the morphosyntax predicted by the tool itself is used and (3) the best predicted morphosyntax by stanfordparser is used. In addition to that, we also expand stanfordnlp with a simple intervention in the lemmatisation procedure, in which the lexicon lookup is not performed over the training data only, but the external inflectional lexicons as well, naming this modified tool stanfordnlp+lex.
The results of the lemmatisation experiments are given in Table 3.
The results show that reldi-tagger outperforms the original stanfordnlp by a substantial margin, which does not come as a surprise as reldi-tagger uses a large inflectional lexicon. A simple lexicon intervention with stanfordnlp+lex closes the gap between the two, with almost no difference in lemmatisation quality for any of the languages.
Regarding different upstream processing, as expected, preprocessing with stanfordnlp closes one third of the gap between preprocessing with reldi-tagger and having perfect, gold morphosyntactic annotation.
Investigating the differences between the decisions of reldi-tagger and stanfordnlp+lex shows that these mostly differ in handling named entities, with both tools missing the correct lemma with similar frequency. For stanfordnlp+lex in particular, some errors can be attributed to the fact it does not rely on the morphological feature (FEATS) information when looking up the lexicon and producing lemma predictions, causing errors such as generating a feminine proper noun lemma for a correctly tagged masculine proper noun.

Conclusion
In this paper we have presented the set up of the long-term evaluation platform for benchmarking current and future NLP tools for the three South Slavic languages, a practice which is still far too rare. We did a comparative evaluation of two stateof-the art tools with different architectures (traditional vs. neural) and confirmed that the neural approach yields significant improvements in tagging, especially because of better long-range dependency modelling and more distributional semantic information available.
For lemmatisation, the results of both approaches are very close, especially because of a heavy dependence on the lookup in a large inflec-tional lexicon, but with obvious room for improvement in the neural lemmatisation process.
The presented results give important pointers for the development of future state-of-the-art tools for the three languages, but also Slavic languages in general.