Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe

Many natural language processing tasks, including the most advanced ones, routinely start by several basic processing steps – tokenization and segmentation, most likely also POS tagging and lemmatization, and commonly parsing as well. A multilingual pipeline performing these steps can be trained using the Universal Dependencies project, which contains annotations of the described tasks for 50 languages in the latest release UD 2.0. We present an update to UDPipe, a simple-to-use pipeline processing CoNLL-U version 2.0 files, which performs these tasks for multiple languages without requiring additional external data. We provide models for all 50 languages of UD 2.0, and furthermore, the pipeline can be trained easily using data in CoNLL-U format. UDPipe is a standalone application in C++, with bindings available for Python, Java, C# and Perl. In the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, UDPipe was the eight best system, while achieving low running times and moderately sized models.


Introduction
The Universal Dependencies project (Nivre et al., 2016) seeks to develop cross-linguistically consistent treebank annotation of morphology and syntax for many languages. The latest version of UD (Nivre et al., 2017a) consists of 70 dependency treebanks in 50 languages. As such, the UD project represents an excellent data source for developing multi-lingual NLP tools which perform sentence segmentation, tokenization, POS tagging, lemmatization and dependency tree parsing.
The goal of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (CoNLL 2017 UD Shared Task) is to stimulate research in multi-lingual dependency parsers which process raw text only. The overview of the task and the results are presented in Zeman et al. (2017).
This paper describes UDPipe (Straka et al., 2016) 1 -an open-source tool which automatically generates sentence segmentation, tokenization, POS tagging, lemmatization and dependency trees, using UD version 2 treebanks as training data.
The contributions of this paper are: • Description of UDPipe 1.1 Baseline System, which was used to provide baseline models for CoNLL 2017 UD Shared Task and preprocessed test sets for the CoNLL 2017 UD Shared Task participants. UDPipe 1.1 provided a strong baseline for the task, placing as the 13 th (out of 33) best system in the official ranking. The UDPipe 1.1 Baseline System is described in Section 3. • Description of UDPipe 1.2 Participant System, an improved variant of UDPipe 1.1, which was used as a contestant system in the CoNLL 2017 UD Shared Task, finishing 8 th in the official ranking, while keeping very low software requirements. The UDPipe 1.2 Participant System is described in Section 4. • Evaluation of search-based oracle and several transition-based system on UD 2.0 dependency trees (Section 5).

Related Work
There is a number of NLP pipelines available, e.g., Natural Language Processing Toolkit 2 (Bird et al., 1 http://ufal.mff.cuni.cz/udpipe 2 NLTK, http://nltk.org 2009) or OpenNLP 3 to name a few. We designed yet another one, UDPipe, with the aim to provide extremely simple tool which can be trained easily using only a CoNLL-U file without additional resources or feature engineering. Deep neural networks have recently achieved remarkable results in many areas of machine learning. In NLP, end-to-end approaches were initially explored by Collobert et al. (2011). With a practical method for precomputing word embeddings (Mikolov et al., 2013) and routine utilization of recurrent neural networks (Hochreiter and Schmidhuber, 1997;Cho et al., 2014), deep neural networks achieved state-of-the-art results in many NLP areas like POS tagging , named entity recognition (Yang et al., 2016) or machine translation (Vaswani et al., 2017). The wave of neural network parsers was started recently by Chen and Manning (2014) who presented fast and accurate transition-based parser. Many other parser models followed, employing various techniques like stack LSTM , global normalization (Andor et al., 2016), biaffine attention (Dozat and Manning, 2016) or recurrent neural network grammars (Kuncoro et al., 2016), improving LAS score in English and Chinese dependency parsing by more than 2 points in 2016.

UDPipe 1.1 Baseline System
UDPipe 1.0 (Straka et al., 2016) 4 is a trainable pipeline performing sentence segmentation, tokenization, POS tagging, lemmatization and dependency parsing. It is fully trainable using CoNLL-U version 1 files and the pretrained models for UD 1.2 treebanks are provided.
For the purpose of the CoNLL 2017 UD Shared Task, we implemented a new version UDPipe 1.1 which processes CoNLL-U version 2 files. UD-Pipe 1.1 was used as one of the baseline systems in the shared task. UDPipe 1.1 Baseline System was trained and tuned in the training phase of CoNLL 2017 UD Shared Task on the UD 2.0 training data and the trained models and outputs were available to the participants.
In this Section, we describe the UDPipe 1.1 Baseline System, focusing on the differences to the previous version described in (Straka et al., 2016): the tokenizer (Section 3.1), the tagger (Sec-tion 3.2), the parser (Section 3.3), the hyperparameter search support (Section 3.4), the training details (Section 3.5) and evaluation (Section 3.6).

Tokenizer
In UD and in CoNLL-U files, the text is structured on several levels -a document consists of paragraphs composed of (possibly partial) sentences, which are sequences of tokens. A token is also usually a word (unit used in further morphological and syntactic processing), but a single token may be composed of several syntactic words (for example, token zum consists of words zu and dem in German). The original text can be therefore reconstructed as a concatenation of tokens with adequate spaces, but not as a concatenation of words.

Sentence Segmentation and Tokenization
Sentence segmentation and tokenization is performed jointly (as it was in UDPipe 1.0) using a single-layer bidirectional GRU network which predicts for each character whether it is the last one in a sentence, the last one in a token, or not the last one in a token. Spaces are usually not allowed in tokens and therefore the network does not need to predict end-of-token before a space (it only learns to separate adjacent tokens, like for example Hi! or cannot).

Multi-Word Token Splitting
In UDPipe 1.0, a case insensitive dictionary was used to split tokens into words. This approach is beneficial if there is a fixed number of multi-word tokens in the language (which is the case for example in German).
In UDPipe 1.1 Baseline System we also employ automatically generated suffix rules -a token with a specific suffix is split, using the non-matching part of the token as prefix of the first words, and a fixed sequence of first word suffix and other words (e.g, in Polish we create a rule łem → ł + em). The rules are generated automatically by keeping all such rules present in the training data, which do not trigger incorrectly too often. The contribution of suffix rules is evaluated in Section 5.

Documents and Paragraphs
We use an improved sentence segmenter in UD-Pipe 1.1 Baseline System. The segmenter learns sentence boundaries in the text in a standard way as in UDPipe 1.1 Baseline System, but it omits the sentence breaks at the end of a paragraph or a document. The reason for excluding these boundaries from the training data is that the ends of paragraphs and documents are frequently recognized by layout (e.g. newspaper headlines) and if the recognizer is trained to recognize these sentence breaks, it tends to erroneously split regular sentences.
Additionally, we now also mark paragraph boundaries (recognized by empty lines) and document boundaries (corresponding to files being processed, storing file names as document ids) when running the segmenter.

Spaces in Tokens
Additional feature allowed in CoNLL-U version 2 files is presence of spaces in tokens. If spaces in tokens are allowed, the GRU tokenizer network must be modified to predict token breaks in front of spaces. On the other side, many UD 2.0 languages do not allow spaces in tokens (and in such languages a space in a token might confuse the following systems in the pipeline), therefore, it is configurable whether spaces in tokens are allowed, with the default being to allow spaces in tokens if there is any token with spaces in the training data.

Precise Reconstruction of Spaces
Unfortunately, neither CoNLL-U version 1 nor version 2 provide a standardized way of storing inter-token spaces which would allow reconstructing the original plain text. Therefore, UDPipe 1.1 Baseline System supports several UDPipe-specific MISC fields that are used for this purpose.
CoNLL-U defines SpaceAfter=No MISC feature which denotes that a given token is not followed by a space.
We extend this scheme in a compatible way by introducing SpacesAfter=spaces and SpacesBefore=spaces fields.
These fields store the spaces following and preceding this token, with SpacesBefore by default empty and SpacesAfter being by default empty or one space depending on SpaceAfter=No presence. Therefore, these fields are not needed if tokens are separated by no space or a single space. The spaces are encoded by a means of a C-like escaping mechanism, with escape sequences \s, \t, \r, \n, \p, \\ for space, tab, CF, LF, | and \ characters, respectively.
If spaces in tokens are allowed, these spaces cannot be represented faithfully in the FORM field which disallows tabs and new line characters. Therefore, UDPipe utilizes an additional MISC field SpacesInToken=token with spaces representing the token with original spaces. Once again, with the default value being the value of the FORM field, the field is needed only if the token spaces cannot be represented in the FORM field.
All described MISC fields are generated automatically by UDPipe 1.1 Baseline System tokenizer, with SpacesBefore used only at the beginning of a sentence. Furthermore, we also provide an optional way of storing the document-level character offsets of all tokens, using TokenOffset MISC field. The values of this field employ Python-like start:end format.

Detokenization
To train the tokenizer, the original plain texts of the CoNLL-U files are required. These plain texts can be reconstructed using the SpaceAfter=No feature. However, very little UD version 1 corpora contains this information. Therefore, UDPipe 1.0 offers a way of generating these features using a different raw text in the concerned language (Straka et al., 2016).
Fortunately, most UD 2.0 treebanks do include the SpaceAfter=No feature. We perform detokenization only for Dannish, Finnish-FTB and Slovenian-SST.

Inference
When employing the segmenter and tokenizer GRU network during inference, it is important to normalize spaces in the given text. The reason is that during training, tokens were either adjacent or separated by a single space, so we need to modify the network input during inference accordingly.
During inference, we precompute as much network operations on character embeddings as possible 5 (to be specific, we cache 6 matrix products for every character embedding in each GRU). Consequently, the inference is almost twice as fast.

Tagger
The tagger utilized by UDPipe 1.1 Baseline System is nearly identical to the previous version in UDPipe 1.0. A guesser generates several (UPOS, XPOS, FEATS) triplets for each word according to its last four characters, and an averaged perceptron tagger with a fixed set of features disambiguates the generated tags (Straka et al., 2016;Straková et al., 2014).
The lemmatizer is analogous. A guesser produces (lemma rule, UPOS) pairs, where the lemma rule generates a lemma from a word by stripping some prefix and suffix and prepending and appending new prefix and suffix. To generate correct lemma rules, the guesser generates the results not only according to the last four characters of a word, but also using word prefix. Again, the disambiguation is performed by an averaged perceptron tagger.
We prefer to perform lemmatization and POS tagging separately (not as a joint task), because we found out that utilization of two different guessers and two different feature sets improves the performance of our system (Straka et al., 2016).
The only change in UDPipe 1.1 Baseline System is a possibility to store lemmas not only as lemma rules, i.e., relatively, but also as "absolute" lemmas. This change was required by the fact that some languages such as Persian contain a lot of empty lemmas which are difficult to encode using relative lemma rules, and because Latin-PROIEL treebank uses greek.expression lemma for all Greek forms.

Dependency Parsing
UDPipe 1.0 utilizes fast transition-based neural dependency parser. The parser is based on a simple neural network with just one hidden layer and without any recurrent connections, using locallynormalized scores.
The parser offers several transition systemsa projective arc-standard system (Nivre, 2008), partially non-projective link2 system (Gómez-Rodríguez et al., 2014) and a fully non-projective swap system (Nivre, 2009). Several transition oracles are implemented -static oracles, dynamic oracle for the arc-standard system (Goldberg et al., 2014) and a search-based oracle (Straka et al., 2015). Detailed description of the parser architecture and transition systems and oracles can be found in Straka et al. (2016) and Straka et al. (2015).
The parser makes use of FORM, UPOS, FEATS and DEPREL embeddings. The form embeddings are precomputed with word2vec using the training data, the other embeddings are initialized randomly, and all embeddings are updated during training.
We again precompute as much network operations as possible for the input embeddings. How-ever, to keep memory requirements and loading times reasonable, we do so only for 1000 most frequent embeddings of every type.
Because the CoNLL 2017 UD Shared Task did not allow sentences with multiple roots, we modified all the transition systems in UDPipe 1.1 to generate only one root node and to use the root dependency relation only for this node.

Hyperparameter Search Support
All three described components employ several hyperparameters which can improve performance if tuned correctly. To ease up the process, UD-Pipe offers random hyperparameter search for all the components -the run=number option during training generates pseudorandom but deterministic values for predefined hyperparameters. The hyperparameters are supposed to be tuned for every component individually, and then merged.

Training the UDPipe 1.1 Baseline System
When developing the UDPipe 1.1 Baseline System in the training phase of CoNLL 2017 UD Shared Task, the testing data were not yet available for the participants. Therefore a new data split was created from the available training and development data: the performance of the models was evaluated on the development data, and part of the training data was put aside and used to tune the hyperparameters. This baselinemodel-split of the UD 2.0 data is provided together with the baseline modes from Straka (2017).
The following subsections describe the details of training the UDPipe 1.1 Baseline System.

Tokenizer
The segmenter and tokenizer network employs character embeddings and GRU cells of dimension 24. The network was trained using dropout both before and after the recurrent units, using the Adam optimization algorithm (Kingma and Ba, 2014). Suitable batch size, dropout probability, learning rate and number of training epochs was tuned on the tune set.

Tagger
The tagger and the lemmatizer do not use any hyperparameters which require tuning. The guesser hyperparameter were tuned on the tune set.

Parser
The parser network employs form embeddings of dimension 50, and UPOS, FEATS and DEPREL embeddings of dimension 20. The hidden layer has dimension 200, batch consists of 10 words and the network was trained for 10 iterations. The suitable transition system, oracle, learning rate and L2 regularization was chosen to maximize the accuracy on the tune set.

Evaluation of the UDPipe 1.1 Baseline System
There are three testing collections in CoNLL 2017 UD Shared Task: UD 2.0 test data, new parallel treebank (PUD) sets, and four surprise languages.
The UDPipe 1.1 Baseline System models were completely trained, released and "frozen" on the UD 2.0 training and development data with a new split (see the previous Section 3.5) already in the training phase of the CoNLL 2017 UD Shared Task on the UD 2.0 training data, unlike the participant systems, which could use the full training data for training and development data for tuning.
We used the UDPipe 1.1 Baseline System models for evaluation of the completely new parallel treebank (PUD) set and completely new surprise languages in the following way: For the new parallel treebank sets we utilized the "main" treebank for each language (e.g., for Finish fi instead of fi ftb). This arbitrary decision was a lucky one -after the shared task evaluation, the performance on the parallel treebanks was shown to be significantly worse if different treebanks than the "main" were used (even if they were larger or provided higher LAS on their own test set). The reason seem to be the inconsistencies among the treebanks of the same languagethe Universal Dependencies are yet not so universal as everyone would like.
To parse the surprise languages, we employed a baseline model which resulted in highest LAS F1-score on the surprise language sample dataresulting in Finnish FTB, Polish, Finnish FTB and Slovak models for the surprise languages Buryat, Kurmanji, North Sámi and Upper Sorbian, respectively. Naturally, most words of a surprise language are not recognized by a baseline model for a different language. Conveniently, the UPOS tags and FEATS are shared across languages, allowing the baseline model to operate similarly to a delexicalized parser.

UDPipe 1.2 Participant System
We further updated the UDPipe 1.1 Baseline System to participate in CoNLL 2017 UD Shared Task with an improved UDPipe 1.2 Participant System.
As participants of the shared task, we trained the system using the whole training data and searched for hyperparameters using the development data (instead of using the baselinemodelsplit described in Section 3.5). Although the data size increase is not exactly a change in the system itself, it improves performance, especially for smaller treebanks.

Hyperparameter Changes
While tokenization and segmentation is straightforward in some languages, it is quite complex in others (notably in Japanese and Chinese, which do not use spaces for word separation, or in Vietnamese, in which many tokens contain spaces). In order to improve the performance on these languages we increased the embedding dimension and GRU cell dimension in the tokenizer from 24 to 64.
We increased form embedding dimension in the parser from 50 to 64 (larger dimensions showed no more improvements on the development set) and also trained the parser for 20 iterations over the training data instead of 10.
Furthermore, instead of using beam of size 5 during parsing as in UDPipe 1.1 Baseline System, we tuned the beam size individually for each treebank, choosing 5, 10, 15 or 20 according to resulting LAS on a development set.

Merging Treebanks of the Same Language
For several languages, there are multiple treebanks available in the UD 2.0 collection. Ideally, one would merge all training data of all treebanks of a given language. However, according to our preliminary experiments, the annotation is not perfectly consistent even across treebanks of the same language. Still, additional training data, albeit imperfect, could benefit small treebanks.
We therefore attempt to exploit these multiplex treebanks by enriching each treebank's training data with training data from other treebanks of the same language. Given a treebank for which another treebanks of the same language exist, we evaluate performance of several such expansions and choose the best according to LAS score on the development data of the treebank in question. We extend the original training data by adding random sentences from the additional treebanks of the same language -we consider subsets containing 1 4 , 1 2 , 1 and 2 times the size of the original treebank.

Joint Sentence Segmentation and Parsing
Some treebanks are very difficult to segment into sentences due to missing punctuation, which harms the parser performance. We segment three smallest treebanks of this kind (namely Gothic, Latin-PROIEL and Slovenian-SST) jointly with the parser, by choosing such sentence segmentation which maximizes likelihood of their parse trees.
In order to determine the segmentation with maximum parsing likelihood, we evaluate every possible segmentation with sentences up to a given maximum length L. Because likelihoods of parse trees are independent, we can utilize dynamic programming and find the best segmentation in polynomial time by parsing sentences of lengths 1 to L at every location in the original text. Therefore, the procedure has the same complexity as parsing text which is circa L 2 /2 times longer than the original one.
Additionally, we incorporate the segmentation suggested by the tokenizer in the likelihood of the parse trees -we multiply the tree likelihood by a fixed probability for each sentence boundary different than the one returned by the tokenizer.
However, if a transition-based parser is used, the optimum solution for the algorithm described so far would probably be to segment the text into one-token sentences, due to the fact that for a single word there is only one possible sequence of transitions (to make the word a root node), which has therefore probability one. Consequently, we introduce a third hyperparameter, which is an additional "cost" for every sentence. We tuned the three described hyperparameters for every treebank independently to maximize LAS score on development set. The chosen hyperparameter values are shown in Table 1.
We expect graphical parsing models to benefit even more from this kind of joint segmentation -for every word, one can compute the probability distribution of attaching it as a dependent to all words within a distance of L (including the word itself, which represents the word being a root node). Then, the likelihood of a single-word sentence would not be one, but would take into account the possibility of attaching the word as a dependent to every near word.

Experiments and Results
The official CoNLL 2017 UD Shared Task evaluation was performed using a TIRA platform (Potthast et al., 2014), which provided virtual machines for every participants' systems. During test data evaluation, the machines were disconnected from the internet, and reset after the evaluation finished -this way, the entire test sets were kept private even during the evaluation.
In addition to official results, we also report results of supplementary experiments. These were evaluated after the shared task, using the released test data (Nivre et al., 2017b). All results are produced using the official evaluation script.
Because only plain text (and not gold tokenization) is used as input, all results are in fact F1scores and always take tokenization performance into account.
The complete UDPipe 1.2 Participant System scores are shown in Table 2. We also include LAS F1-score of the UDPipe 1.1 Baseline System for reference. Note that due to time constraints, some UDPipe 1.2 Participant System submitted models did not generate any XPOS and lemmas. In these cases, we show XPOS and lemmatization results using post-competition models and typeset them in italic.    Table 5: Joint segmentation and parsing in UD-Pipe 1.2 Participant System, optimized to maximize parsing likelihood, in comparison with sequential segmentation and parsing.
In order to make the extensive results more visual, we show relative difference of baseline LAS score using the grey bars (on a scale that ignores 3 outliers). We use this visualization also in later tables, always showing relative difference to the first occurrence of the metric in question.
The effect of enlarging training data using other treebanks of the same language (Section 4.2) is evaluated in Table 3. We include only those treebanks in which the enlarged training data result in better LAS score and compare the performance to cases in which only the original training data is used.
The impact of tokenizer dimension 64 compared to dimension 24 can be found in Table 4. We also include the effect of not using the suffix rules for multi-word token splitting, and not using multi-word token splitting at all. As expected, for many languages the dimension 64 does not change the results, but yields superior performance for languages with either difficult tokenization or sentence segmentation.
The improvement resulting from joint sentence segmentation and parsing is evaluated in Table 5. While the LAS and UAS F1-scores of the joint approach improves, the sentence segmentation F1score deteriorates significantly.
The overall effect of search-based oracle with various transition systems on parsing accuracy is  summarized in Table 6. The search-based oracle improves results in all cases, but the increase is only slight if a dynamic oracle is also used. Note however that dynamic oracles for non-projective systems are usually either very inefficient (for link2, only O(n 8 ) dynamic oracle is proposed in Gómez-Rodríguez et al. (2014)) or not known (as is the case for the swap system).
Furthermore, if only a static oracle is used, partially or fully non-projective systems yield better overall performance than a projective one. Yet, a dynamic oracle improves performance of the projective system to the extent it yield better results (which is further improved by utilizing also a search-based oracle).
The influence of beam size on UAS and LAS scores is analyzed in Table 7. According to the results, tuning beam size for every treebank independently is worse than using large beam size all the time.
Finally, model size and runtime performance of individual UDPipe components are outlined in Table 8. The median of complete model size is circa 13MB and the speed of full processing (tokenization, tagging and parsing with beam size 5) is approximately 1700 words per second on a single core of an Intel Xeon E5-2630 2.4GHz processor.    . Binary tools as well as bindings for C++, Python, Perl, Java and C# are provided. As our future work, we consider using deeper models in UDPipe for tokenizers, POS taggers and especially for the parser.