RACAI’s Natural Language Processing pipeline for Universal Dependencies

This paper presents RACAI’s approach, experiments and results at CONLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. We handle raw text and we cover tokenization, sentence splitting, word segmentation, tagging, lemmatization and parsing. All results are reported under strict training, development and testing conditions, in which the corpora provided for the shared tasks is used “as is”, without any modifications to the composition of the train and development sets.


Introduction
This paper describes RACAI's entry for the CONLL Shared Task on Universal Dependencies parsing. We represent the Research Institute for Artificial Intelligence, in Bucharest, Romania. The shared task refers to processing raw text with the goal of automatically inferring word dependencies. While some approaches require only segmented (tokenized) text, parsing methods that depend on rich feature sets (which is our case), implicitly require that the text is tokenized, POS tagged and lemmatized. The Universal Dependencies (UD) corpus (Nivre et al., 2016(Nivre et al., , 2017a uses 3 distinct layers of analysis: (a) a Universal Part-of-Speech layer (UPOS) (Petrov et al., 2011); (b) a language specific part-of-speech layer (XPOS) and (c) a list of language-dependent morphological attributes Zeman (2008). Also, in the current version of UD, tokenization and wordsegmentation require different handling strategies (see section 3.2 for details). In what follows we will provide an overview of our system's architecture (section 2) and a detailed description of each module (section 3) used in our process-ing pipeline, followed by its evaluation (section 4) (Nivre et al., 2017b) on the TIRA platform (Potthast et al., 2014). Though Syntaxnet (Weiss et al., 2015) models were also available, some discrepancies in the token and word-segmentation methodology made the comparison impossible (mainly because the Syntaxnet's output was incompatible with the evaluation script). This work is focused on presenting our system's technical details and individual results. The full comparison between competing systems, as well as the baseline values obtained by UDPipe v1.1 (Straka et al., 2016) are available in Zeman et al. (2017). At runtime, the input of the system is a raw text file. As the file is not sentence split, all new line characters are removed and passed as a single long string to the first module: the Tokenization and Sentence Splitting (Tok/SS) (section 3.1). Depending on language, some files are already tokenized, for which we perform only sentence splitting; however, for most languages, we perform both tokenization and sentence-splitting in a single pass. At this point, we obtain a file in the conllu format, passed to the Compound Word module (section 3.2). This module looks for words that can (and should) be split in two or more tokens. For example, in German "im" becomes "in dem", or in Polish: "kupiłbym" becomes "kupił by m". Compound words are added as new lines in the conllu file. Next, language independent part of speech (UPOS) tags are added by the Tag.UPOS module (section 3.3). Based on the available information so far, the following modules all run in parallel: The Lemma (section 3.4), Tag.XPOS and Morphological Features modules add lemmas, XPOSes and Morphological Features. The XPOS and Morphological Features modules are language dependent and are described together in section 3.5. Finally, the Parser module (section 3.6) adds the final dependencies in the output conllu file.

System Architecture
For the training process we used the available training data in the conllu format. The conllu files provided by the organizers contained "detokenization" information (a SpaceAfter=No flag added to every token that in the original text had no space) and morphological analysis layers: word, lemmas, part-of-speech, language-dependent part-ofspeech, morphological attributes and word dependencies (index of the head of each word, as well as the relationship type).
Further details about each information layer will be provided later in the paper, when we introduce the individual processing modules and describe our feature extraction and labeling strategies. During training we use a separate script responsible for preparing the custom training and development sets. Depending on the memory and CPU requirements we trained either the models sequentially (e.g. the parser is memory expensive and the linear classifier for part-of-speech tagging is multi-threaded) or multiple models in parallel (the morphological analysis require far less memory and CPU time -practically the decision trees for all languages in the competition were built and pruned in parallel).

System Description
Before we proceed with the description of the modules we must note that some of our methods rely on decision trees (DT) that are built using a custom designed algorithm that relies on the constituency matrix to speed-up computation for the Information Gain (IG) (Equation 1). Also, after the initial trees are built, we use the available de-velopment sets to prune them to reduce possible train-data overfitting, leading to improved performance.
where i represents an input feature, H(S) (Equation 2) is the entropy of the initial set S, H(t) the entropy of subset t, and P (t) is the fraction of elements in t over the entire |S|. We note |S| as the number of instances in S.
We note this because the above mentioned methodology was not presented elsewhere and we feel the it is an important aspect of our approach (see section 4 for details regarding the model sizes and section 5 for comments).

Tokenization and Sentence Splitting
The first module in the pipeline is the Tokenizer and Sentence Splitter. Depending on the training data, we actually have 4 distinct tokenizers/sentence splitters merged into our module: the standard Tok/SS for corpora that had both punctuation and was not already tokenized (this is the case for most languages), the character level Tok/SS for Japanese and Chinese, and two versions of the sentence splitter for training data with pre-tokenized sentences that had (e.g. Danish, Finish FTB and Slovene SST), and had not punctuation (e.g. Gothic, Latin Proiel, Ancient Greek Proiel and Old Church Slavonic).
The main tokenizer and sentence splitter is based on DTs. Tokenization and sentence splitting are independent models, and are run in parallel. Based on a number of features (described below), the DT tokenizer model chooses between 4 classes: SPLIT_LEFT, SPLIT_RIGHT, SPLIT_LEFTRIGHT and NONE, meaning it should split to the left of the current character, right, to put two spaces around the character or not make a split. The DT sentence splitter has only 2 classes: SPLIT and NONE. Training is done in the following manner: (1) initially, we look for specific characters where we might have a word split, marked in the conllu train file by "SpaceAfter=No"; we then normalize the frequency of these characters; iterating again on the training data, we choose only the character that is most probable to initiate a split, based on the normalized frequency; for sentence splitting we perform the same process, the only difference being we pre-seed the character list with a number of punctuation characters like: .-!? etc., because we had cases where ? for example was not frequent enough to remain in the split list, though it was a valid sentence splitter; (2) we extracted the following features for each split character: current letter, 3 characters before and after (4 for sentence splitting), and for the previous and next words a marker whether or not the word is punctuation only, if it ends with punctuation, if it contains punctuation, if it is uppercase, if it is capitalized and the number of periods in the word; (3) we trained the DT model and pruned the tree based on the dev set. Another small optimization worth mentioning is that we replaced all digits with zeroes to reduce variability in the training data.
The symbol based tokenizer is targeted for Japanese and Chinese where we have to look after each symbol and decide whether to split. The features are simply the current symbol and one symbol to the left and right, with two possible classes: SPLIT (split after the current symbol) and NONE.
The last two Tok/SS address languages that are already tokenized, performing only sentence splitting. For the languages that had no punctuation (e.g. Gothic), the features are 5 characters to the left and 3 to the right (including the current character). For the languages that had punctuation (e.g. Danish) the features are token based: the current token plus 2 tokens before and after; for each token we mark all the features the main tokenizer has for words (if the token is punctuation only, etc.) Overall, we compared our results on the dev sets against UDPipe's, obtaining good results. However, for a few languages it seems that the decision tree approach is not optimal, yielding low per-formances. For English, Bulgarian, Korean, Portuguese and other 14 smaller languages, at runtime we directly used the UDPipe tokenization and sentence split conllu files, bypassing this module. This is the only place in the system where we use data that was not generated by us.

Compound Words
The compound words module has the task of word expansion. For example, in German "im" is the contraction of "in dem"; in Turkish, "muhabbetliydi" is for "muhabbet li ydi". While word expansion could be relatively well solved by using a dictionary (search for the key and replace with expansion tokens), it would fail for unseen words as well as ignoring split-no split decisions depending on POS tags. Our intuition was that we need to represent generated word expansions as parts of the original string (longest common substring or LCS) plus new terminations, either before and/or after the LCS. For example, currency tokens like "7000e " should always expand as the first variable part plus the last symbol separately if that last symbol is "e ", while the opposite example "e 7000" should be represented as a static first symbol followed by a variable string (here the numeric amount).
Needless to say, the process of token decomposition into words carries a great weight over the accuracy of the system, because all other modules depend on it: tagging, morphological analysis and parsing. Obviously preserving the head or tail of the original token and concatenating strings at the beginning or end requires different labeling strategies, because any of the words in the decomposition can be written either by keeping the head of the original token and concatenating a suffix or by a prefix and concatenating the tail of the original token. More often than not, one strategy will likely yield a larger number of unique output labels in the training data than the other. As such, the actual difficulty is determining which labeling strategy would be more accurate. We attempted to determine the labeling strategy as follows: • First, take the training data and generate output labels using all (desired) tagging strategies, generating multiple label sets; • Then, measure the system entropy for each of the previously generated label sets; • Finally, use the tagging strategy which gen-erated the lowest-entropy system for training the classifier.
In our approach the labeling scheme was in the form of "n+<string>" where n is a number and <string> is a string to keep. Consider the following example, where a is the word that will be expanded in two words b c. There are four different output encodings we can choose from: FS_KS, FS_KE, FE_KS and FE_KE. FS means "from start" and denotes that the number n is an absolute index, which measures the character span from the beginning of the word, FE is "from end" and denotes that n is a relative index and it will express a character span relative to the size of the original token. The meaning of KS is "keep start" and means that the head of the token should be preserved, while KE stands for "keep end" and means that the tail of the original token will be used in building the decomposed token. Now, suppose we write word a as letters a 1 a 2 a 3 a 4 a 5 and its first expansion b as b 1 b 2 b 3 b 4 , and a 1−2 is equal to b 1−2 and a 5 is equal to b 4 . To obtain the first encoding FS_KS we find the longest common substring from start of a and b which is b 1−2 , of length 2, keeping the rest of b; so, the FS_KS label encoding is 2+ < b 3−4 >. Encoding FS_KE means finding the longest common substring from start, while keeping the end: 3 + b 4−n−3 .
Our algorithm has to choose between FS and FE labeling schemes, the decision between KS and KE subordinated depending only on the LCS criterion (for each word in the decomposition, we chose to keep the head or tail of the original token depending on which would provide a higher character overlapping).
After determining the best labeling strategy we generated a decision tree using the following features: first four letters, last four letters, wordform (if occurrence frequency was higher than 10 in the training data).
To show how important choosing the appropriate labeling strategy is, on the Hebrew development set we obtained 93% F-score using the FE notation (automatically selected by the system), versus 88% when we forced the system to use FS.

POS Tagging
The part-of-speech inventory used in this step refers to UPOS tags, an inventory which contains only 17 unique labels. Our tagging methodology is fairly standard: we use a Conditional Random Field to estimate the probability of the i-th tag (t i ), based on the previous tag and a rich set of features (f i ) (Equation 3). During runtime, we use Viterbi to obtain the optimal sequence of labels.
The set of features is composed of (a) the lowercased wordform, (b) a large number of letter ngrams, with n ranging from 2 to 5 and (c) a feature which we refer to as "writing style". All the features are extracted from a window of 3 words (centered on the current word). The "writing style" feature takes 4 values: • ALL-CAPS -the word is written in CAPS; • ALL-LOWER -the word contains only lower-cased symbols; • F-UPPER -the word starts with a capital letter and all other symbols are lower-cased; • F-UPPER-START -similar with F-UPPER, only this time the word is also the first token of the sentence.
We use a large number of character combinations (90), which includes cross-word letter ngrams and was manually obtained using a trialand-error process.
To prevent overfitting and obtain a robust model for out-of-vocabulary (OOV) words, we only include a wordform as a feature, if that word's occurrence frequency is higher than a threshold (k) in the training data 1 .

Lemmatizer
Lemmatization is done in two steps. First, the surface wordform and its UPOS (language independent part of speech) is searched in a dictionary created at train time. If the surface form & UPOS match, the corresponding lemma is used (if there is more than one lemma for the surface&UPOS pair, we prefer the most frequent). If not found, we attempt to create the new lemma using a DT. Given a word, we extract the UPOS, the first 4 letters and the last 4 as features. The first and last letters may overlap if the word is smaller than 8 letters, or can be null (encoded as "_") if the word is smaller than 4 letters. The output classes are strings looking like "n+<string>". The n represents how many letters to cut from the surface form of the word, and <string> means the string to append to the word. For example, given the word forgotten, which is a Verb, with features f o r g and t t e n, its output class would be 5+et, meaning that we need to cut the last 5 letters (to obtain the largest common prefix) and add et to obtain the lemma forget.
We note that the number of output classes varies between a few hundreds to several thousands depending on the language, but, even for this large number, the results of the DT seem accurate.

Language Specific Morphological Analysis
Language-specific morphological analysis is a two-fold process that refers to the resolution of (a) the language-dependent part-of-speech (XPOS) tag and (b) a structured set of morphological attributes (in the form of key-value pairs), which are used to encode important information such as gender, number, case etc. As a rule-of-thumb, the XPOS tag used in morphologically rich languages is a compact representation of the morphological attributes. For instance, the Romanian corpus from the UD data uses a standardized compact representation, which is composed of morphosyntactic descriptors (MSDs) (Erjavec, 2004). Given the similarities between language-specific tags and UD morphological-attributes, in our approach we used the same feature sets for both tasks. The features are composed of: (a) the UPOS tag, (b) the first four characters of the word; (c) the last four characters of the word and (d) the previously mentioned "writing-style" feature. To capture localdependencies between words we used a context window of 5 centered on the current word.
In this case, we preferred to use decision trees, mainly because of reduced computational requirements and the small-footprint of the output models.

Parsing
Once the morphological analysis is completed, our processing pipeline relies on RBGParser 2 which is a greedy hill-climbing parser, well described in Zhang et al. (2014a,b); Lei et al. (2014). In our approach, we used branch 1.1.2 of RBG, which we modified in order to be compliant with the current UD version.
The main incompatibility was generated by the presence of multiword tokens. During training we modified the data adapter of the RBGParser to skip multiword tokens and, for the runtime version, we filtered the input for RBG to exclude multiword tokens and we re-aligned the output of RBG with the unparsed dataset, to restore multiword tokens and provide an output compatible with the current UD standard.
The RBG models were built using the default parameters for the "standardModel" predefined configuration, on which we added automatically extracted word embeddings (Mikolov et al., 2013), obtained using word2vec 3 . The word embeddings were computed by applying the Continuous Bag of Words (CBOW) model on the permitted raw-text resources.
Depending on the language, for the computation of word vectors we compiled monolithic corpora composed of Wikipedia Dumps (whenever available) and raw text from UD training.  Table 1 presents the types of the individual models and their sizes. We use decision trees for most of the tasks, a CRF model for UPOS tagging and the models RBG Parser creates for the last task of parsing. We can directly see that the DT models are very small, even if they are written in text mode, with the largest average for morphological features of 170 KB. We also note that while there are 64 models for the 64 languages we had training data for, we only created 22 models for the languages that actually had compound words and 56 for lemmatization. The 57 tokenization models do not include the symbol tokenization models for Japanese and Chinese, which are even smaller; also, there were other languages that were pre-tokenized, so no models were created for them (details in section 3.1). The CRF models used for UPOS tagging are significantly larger, with the average of 121MB. Still, with a standard deviation of 100MB, we can say that most models are smaller than 250MB. The largest models are created by RBG, with the model for Czech reaching an impressive 3.16GB. On average, RBG creates models of around 1GB. Moving on to the system results, we evaluate each task incrementally, starting from tokenization. We obtained a macro-averaged F1 score of 98.58, with a 0.37 difference to the first place. The decision tree approach used, while simple, brought interesting results, like first place for Czech (CLTT), Italian, Irish or Russian.
For sentence splitting, also based on DT, we obtained an average F1 score of 87.52 versus the top score of 89.10. We obtained first place on a number of languages like Hebrew, Basque or Latin. However on Latvian we obtained last place with an F1 of 93.30 versus 98.90. We note that no languagedependent tuning was performed, neither for tokenization nor for sentence splitting. The same features were chosen for all languages. While we did perform tree pruning based on the dev sets, we did not vary and choose the best feature set for each language (e.g. tokenization on some languages was better with a context of 2,2, while for others with context 4,1; we used 3,3 for every language that had punctuation and was not pre-tokenized).
In word segmentation we obtained a score of 98.39 vs 98.81, a difference of only 0.42 percent. Using the DT classifier brought us top places in several languages like Czech (CLTT), Danish, Norwegian (Bokmaal) or Russian.
Lemmatization, also based on decision trees, unfortunately worked really well on only a small number of languages. For example on Farsi we were the first with a 1.5 point difference over second place. Overall, we obtained an F1 score of 77.45 versus the top performer that obtained 83.74.
The morphological features average F1 score of 70.8 brings us relatively close to the top score of 73.92. Again, while not a best performer, the decision tree algorithm we used has shown very good performance compared to more complex algorithms in the competition.
On the language independent parts of speech (

Conclusions
The CoNLL 2017 UD Shared Task has been a learning experience for us. Considering that so far we only worked mostly on Romanian and English, and only up to the level of POS tagging, we managed to draw a number of conclusions, a few outlined below: • while the DT algorithm has, on average, below stateof-the-art performance, it is very close to top performers. It sometimes achieves first place on tasks like tokenization or word expansion which follow a simpler and more predictable set of rules. We used a decision tree model because it is a predictable and understandable model, that, for this initial set of experiments allowed us to obtain significant insight on how we should create features and output labels, something that using a neural network would not allow.
• sticking with a method and trying out variations can lead to noticeable improvements. For example, pruning the character list on which to attempt a word split for tokenization based on normalized frequency yielded a more balanced training set. This has led to better results than simply asking whether to split on each character in the unpruned list (an unbalanced training set with most examples being "no split").
• sometimes intuition does not work. Initially, we hypothesized that for the morphological feature prediction task it was natural to attempted to predict each feature individually: we would predict, for every word irrespective of its part of speech, all available features separately. Each feature had an extra class of NONE meaning that it was not appropriate for that particular word, so it would not show up to the final composition of features. The results were actually significantly worse that trying to predict all features at once, as a single class output label, even if the number of such labels was much higher as it contained all combinations seen in the training data for morphological features.
As we viewed the CoNLL 2017 UD Shared Task as a learning experience, we attempted all tasks sequentially, even though the main goal of the challenge was the last task: parsing. The only place we used baseline UDPipe files was in the tokenization and sentence splitting where our decision tree approach with no tuning produced results significantly below the baseline. However, we kept our Tok/SS module even for languages where we were 5 points below the baseline, to see what would be the results on the test data. That basically meant that any error in the initial task would be partially propagated in the next one in our processing chain, as each module relies on information from any number of the preceding modules, marginally explaining some of the lower scores in later tasks.
Regarding the system itself we already created a fully functioning on-line version available at our NLP Tools Website 4 . During the last weeks after the shared task ended, we have replaced the decision tree algorithm with our own implementation of a linear classifier, and have obtained superior results. However there the footprint of the model obtained using the linear classifier is, in some cases, 1000 times larger than that of the decision tree classifier (i.e. the Ancient Greek XPOS linear model size is 4.5GB, whereas the DT model is only 4MB). Experiments using a deep neural network (DNN) architecture, trained to predict attributes and XPOS based on character-level features were also performed. Though this approach provided state-of-the art results for some languages, we found it difficult to tune hyper-parameters for all languages. However, for the DNN approach, the model footprint and performance figures (accuracy and computational time) were very appealing.
While one might consider that training independent models for each morphological attribute would provide better results, decision trees, Linear Classifier and DNN performed significantly better, when trained to output all the morphological features at once (softmax one-of-n encoded, not multitask learning). Additionally, we experimented with multitask learning (i.e.: using a common network structure, followed by multiple softmax layers) (Collobert and Weston, 2008) and observed that it did not improve the learning process, at least on the corpora and feature sets we used. Further tuning will be done and performance figures evaluated by the UD evaluation script will be reported on the above mentioned website.