NLP-Cube: End-to-End Raw Text Processing With Neural Networks

We introduce NLP-Cube: an end-to-end Natural Language Processing framework, evaluated in CoNLL’s “Multilingual Parsing from Raw Text to Universal Dependencies 2018” Shared Task. It performs sentence splitting, tokenization, compound word expansion, lemmatization, tagging and parsing. Based entirely on recurrent neural networks, written in Python, this ready-to-use open source system is freely available on GitHub. For each task we describe and discuss its specific network architecture, closing with an overview on the results obtained in the competition.


Introduction and Shared task description
NLP-Cube is a freely available Natural Language Processing (NLP) system that performs: sentence splitting, tokenization, lemmatization, tagging and parsing. The system takes raw-text as input and annotates it, generating a CoNLL-U 2 format file. Written in Python, it is based entirely on recurrent neural networks built in DyNET (Neubig et al., 2017). The paper focuses on each NLP task, its architecture, motivating our choice and comparing it to the current state-of-the-art 3 1 https://github.com/adobe/NLP-Cube 2 The CoNLL-U format is well described in the official Universal Dependencies (UD) website and in  and is the standard format of the UD Corpus. 3 We must note that in the official runs our system was affected by a bug which had a negative impact on the quality of the lexicalized features (See section 2.1 for details). Due to the fact that were unable to retrain the models to meet the Shared Task's deadline (at the time of submitting this article we are still retraining them), we are reposting all new results on the GitHub project page.
The "Multilingual Parsing from Raw Text to Universal Dependencies" 2018 Shared Task (Zeman et al., 2018) targets primarily learning to generate syntactic dependency trees and secondarily the end-to-end text preprocessing pipeline (from raw text segmentation up to parsing), all in a multilingual setting. The task is open to anybody, and participants can choose whether to focus on parsing or attacking the end-to-end problem. The task itself is not simple, having to handle typologically different languages, some of them having little or even no training data. Based on the Universal Dependencies (UD) Corpus 4 (Nivre et al., 2016, participants have to target 82 languages, with datasets annotated in the CoNLL-U format. Their systems, given raw text as input, have to correctly: segment a text into sentences (marked as SS in the results table, or Sentence Splitting), segment sentences into words (marked as Tok, from Tokenization), expand single tokens/words into compound words (marked as Word), and, for each word, predict its universal part-of-speech (UPOS), language-dependent part-of-speech (XPOS), morphological attributes (Morpho), and dependency link to another word and its label, evaluated as 5 different metrics named CLAS, BLEX, MLAS, UAS, and LAS. Each of these metrics is well described in the Shared Task; for brevity, in this paper we will focus mostly on UAS -Unlabeled Attachment Score measuring only the linking to the correct word, and LAS -Labeled Attachment Score, measuring both linking to another word and correctly predicting the link's label. Section 4 presents NLP-Cube's results for all these metrics for all languages in the Shared Task.
The paper is organized as follows: in section 2 we first discuss generics, then move to each particular task. We further present some training as-pects of our system in section 3, followed by results in section 4, closing with section 6 on conclusions.

Processing pipeline
The end-to-end system is a standard processing pipeline having the following components: a sentence splitter, tokenizer, compound word expander (specific to the UD format), lemmatizer, tagger and a parser.

Input features
Our system is able to work with both lexicalized (word embeddings and character embeddings) and delexicalized, morphological features (UPOS, XPOS and ATTRs). However, we observed that when using morphological features as input (for example using POS tags as input for parsing), the performance of the end-to-end system generally degrades. This is mainly because while training is done using gold-standard morphological features (e.g. the parser is trained on gold POS tags), at runtime these features are predicted at an earlier step and then used as "gold" input (e.g. the parser would be given tagger-predicted POS tags as input). There are several ways in which this effect can be mitigated with varying degrees of success; in our approach we preferred to use only lexicalized features as input for all our modules, with the exception of the lemmatizer which is heavily dependent on morphological information.
In what follows, when we refer to lexicalized features, we mean a concatenation of the following: 1. external word embeddings: 300dimensional standard word embeddings using Facebook's FastText (Bojanowski et al., 2016) vectors 5 projected to a 100-dimensional space using a linear transformation); to these we included a trainable <UNK> token; 2. holistic word embeddings: these represent all words in the trainset which have a frequency of at least 2. They are 100dimensional trainable embeddings, also including a <UNK> token for unseen tokens in the testset; 3. character word embeddings: 100dimensional word representation generated 5 Available on github.com/facebookresearch/fastText by applying a network over the word's symbols. The character word embeddings are obtained by applying a two-layer bidirectional LSTM network (size 200, using 0.33 dropout only on the recurrent connections) on a word's characters/symbols (see Figure 1). We then concatenate the final outputs from the second layer (top) forward and backward LSTM with an attention vector (totaling 400 values: 100 from last fwd. state, 100 from last bw. state and 200 from the attention). The attention is computed over all the network states, using the final internal states of the top forward and backward layers for conditioning. Let f k,(1,n) be the forward states of the top layer (k-th) in the character network and b k,(1,n) be its backward counterpart. If f k,n is the forward state corresponding to the last character of a word and b k,1 is the backward state of the first letter of that word, then the character-level embeddings (E c ) are computed as in Equations 1, 2 and 3. 173 Finally, we linearly project E c to an 100dimensional vector. Note, that we use f * and b * for the internal states of the LSTM cells and that the missing superscript means the variables refer to the output of the LSTM cells.
The morphological features are computed by adding three distinct (trainable) embeddings of size 100: one for UPOS, one for XPOS and one for ATTRS.

Tokenization and sentence splitting
For most languages in the Shared Task our system uses raw text as input. Exceptions apply to the low-resourced languages for which we had little or no training data. In these cases we use the input provided by the UDPipe baseline system (Straka et al., 2016) which is already in CoNLL-U format.
For tokenization and sentence splitting we use the same network architecture (see Figure 2) and labeling strategy for all languages. The process is sequential: first we run sentence splitting and then we perform tokenization on the segmented sentences. In both steps, we use identical networks; arguably we could achieve both tasks in a single pass over the input data (the same architecture could perform both sentence splitting and tokenization). However, the best performing network parameters for sentence splitting are not identical to the best performing network parameters for tokenization. With this in mind, we trained two separate models for the two tasks.
For every symbol (s i ) in the input text, the decision for tokenization or sentence splitting (after s i ) is generated using a softmax layer that takes as input 4 distinct vectors (final output states) of: 1. Forward Network: A unidirectional LSTM that sees the input symbol by symbol in natural order;

2.
Peek Network: A unidirectional LSTM , that peeks at a limited window of symbols 6 in front of the current symbol -the input is fed to the network in reverse order; 3. Language Model (LM) Network: A unidirectional LSTM that takes as input external word embeddings for previously genera- 6 We set the value to 5 based on empirical observations ted words; it updates only when a new word is predicted by the network; 4. Partial Word Embeddings (PWE) Network: It is often the case that we are able to generate valid (known) words made up of symbols from the previously tokenized word up to the current symbol. If the joined symbols form a word that exists in the embeddings, we use these embeddings. Otherwise we use the unknown word embedding. We project the embedding using the same 300to-100 linear transformation.

Figure 2: Tokenization and Sentence Splitting
For regularization, we observed that adding two auxiliary softmax layers (with same labels as final layer) for the Forward Network and the Peek Network slightly reduces overfitting. Intuitively, the Forward Network should be able to tokenize/sentence split based only on the previous characters and the Peek Network should also share this trait.
Moreover, the LM Network combined with the PWE Network should be able to "determine" if it makes sense (from the Language Modeling pointof-view) to generate another word, based on the previous words. This is highly important for languages that don't use spaces to delimit words inside an utterance (e.g. Chinese, Japanese etc.).
For the large treebanks in the UD corpus our tokenization method placed second, with an overall token-level score of 99.46%, the highest score being 99.51%. On the same treebanks, for sentence splitting we placed 5th, with an overall F-score of 86.83% (highest was 89.52%).

Lemmatization and compound word expansion
Lemmatization (automatically inferring a word's canonical form) and compound word expansion (automatically expanding collapsed tokens into their constituents) are similar in the sense that both start from a sequence of symbols and have the task of generating another sequence of symbols. One difference is that lemmatization is also dependent on the input word's morphological attributes and part-of-speech, whereas compound word expansion doesn't have such data available (at least not for the UD corpus and consequently not for our system). At first glance the two tasks can easily be solved using sequence to sequence models. It is important to mention that by analyzing some input examples, one can easily see that input-output sequences have monotonic alignments. This implies that the standard encoder-decoder with attention model is too complex and resource consuming for these two tasks.
We propose a method that uses an attentionfree encoder-decoder model, which is less computationally expensive and, surprisingly, provides a 3-5% absolute increase in accuracy (at word level) as opposed to its attention-based counterpart.
The model is composed of a bidirectional LSTM encoder and an unidirectional LSTM decoder. Similarly to a Finite State Transducer (FST) we train a model to output any symbol from the alphabet and three additional special symbols: <COPY>, <INC> and <EOS>. During training, we use a dynamic algorithm to monotonically align the input symbols to the output symbols. Based on these alignments, we create the "goldstandard" decoder output, which aims at copying as many input characters to the output as possible, while incrementing the input cursor and emitting new symbols only as a last resort.
Trying to find a comprehensive example for English proves difficult (most lemmas are obtained by simply copying a portion of the input word) and we prefer to address lemmatization for a Romanian example because it allows a better exploration of the output sequence of symbols. A good example is the lemmatization process for the word "fetelor" (en.: girls), which has the canonical form "fatȃ" (en.: girl). The alignment process will generate the following source-destination pairs of indexes: 1-1, 3-3. The pairs map only symbols that are identical in the input and output sequence. The output symbol list for the decoder to learn is: <COPY>a<INC><INC><COPY>a<EOS> 7 .
Let E (1,n) be the output of the encoder for a sequence of n input symbols and i be an internal index which takes values from 1 to n. The algorithm we use in the decoding process is: In the code above f (E[i], word) is generically defined for both lemmatization and compound word expansion. The function uses the output of the encoder for position i and, for lemmatization, it concatenates this vector with morphological features (see Section 2.1 for details). The compound word expander directly uses E[i] as input for the decoder.
To our knowledge, the attention-free encoder decoder provides state-of-the-art results 8 , our results being up-to-par with the highest ranking system in the UD Shared Task. The results are reported without using any lexicon for known words, and by employing the heuristic of leaving numbers and proper nouns unchanged.

Tagging
Tagging is achieved using a two-layer bidirectional LSTM (same size for all languages). The input of the network is composed only of lexicalized features (see Section 2.1) and the output contains three softmax layers that independently predict UPOS, XPOS and ATTRS. Though the AT-TRS label is composed by multiple key-value pairs for each morphological attribute of the word (e.g. 7 As a reviewer kindly noted, a <COPY> might not always be followed by an <INC>; We cannot exclude the possibility that a word in a certain language might have a single letter that has to be copied twice in the lemma. We thank the reviewer for pointing this out. gender, case, number etc.), we treat the concatenated strings as a single value.
We performed a number of experiments trying to predict individual morphological attributes, but the overall accuracy degraded and we preferred this naive approach to other tagging strategies.
For regularization, we use an auxiliary layer of softmax functions (Szegedy et al., 2015), located after the first bidirectional LSTM layer. The objective function is also designed to maximize the prediction probabilities for the same labels as the main softmax functions.
Note: The tagger is completely independent from the parser and we don't use any morphological information for parsing.

Parsing
Our parser is inspired by Kiperwasser and Goldberg (2016) and Dozat et al. (2017), in the sense that we use multiple stacked bidirectional LSTM layers and project 4 specialized representations for each word in a sentence, which are later aggregated in a multilayer perceptron in order to produce arc and label probabilities.
We observed that training the parser on both morphological and lexical features biases the model into relying on correct previously-predicted tags. This does not hold for end-to-end parsing, which implies that we use predicted (thus imperfect) morphology. Also, in this Shared Task we can only train a tagger using the provided corpora, which means that it has access to the same features and training examples as the parser itself.
Taking all this into account, an interesting question arises: "Why would tagging followed by parsing (learned on an identical training dataset) be better than multi-task learning and joint prediction of arcs, labels and POS tags?". The answer that we came to is .. that it is not. Actually, we observed that jointly training a parser to also output morphological features increases the absolute UAS and LAS scores by up to 1.5% (at least for our own models).
Our parser architecture (Figure 3) is composed of 5 layers of bidirectional LSTMs (sized 300, 300, 200, 200, 200). After the first two layers we introduce an auxiliary loss using three softmax layers for the three independent morphological labels: UPOS, XPOS and ATTRS. After the final stacked layer we project 4 specialized representation which are used in a bi-affine attention for predicting arcs between words and a softmax layer for predicting the label itself (after we decode the graph into a parsing tree).
There are several interesting observations which apply to this approach (but they could be generally true): Observation 1: If we compute the accuracy of the auxiliary predicted tags and compare it to that of the independent tagger, we get an slight increased accuracy for the UPOS labels and decreased figures for XPOS and ATTRS. This could mean that the contribution to parsing of the UPOS labels is higher than that of XPOS labels and morphological attributes. Of course, we are also using lexicalized features, so this conclusion might be false. Note: In the end-to-end system we use the tagger to predict POS tags for UPOS, XPOS and ATTRS; the slight gain in accuracy of using UPOS tags predicted by the parser are offset by the complexity of picking labels from separate modules and more parameter logic for the end-user of our system (for example, if a user requests only POS tags he would then need to run the parser just for UPOSes).
Observation 2: In theory, the parsing tree should be computed as the minimum or maximum spanning tree (MST) from the complete graph that we create using the network. A standard way to do this is to use Chu-Liu/Edmonds' algorithm. However, in our initial experiments we used a greedy method, which almost never generated MSTs. The algorithm worked by sorting all possible arcs, based on the probabilities from highest to lowest. Then we would start from the most probable arc and iteratively add arcs if they would not introduce cycles. While this is similar to Kruskal's algorithm, it never holds for directed graphs. When we switched to the MST algorithm we obtained lower UAS and LAS scores for the parser. We checked the validity of the results and, indeed, the score of MST trees is higher than that of greedy trees. Also, we tried multiple MST implementation including our own, which reduces any chance of coding errors. The conclusion is that in order to obtain good UAS/LAS scores, one should always favor strong arc scores over lower-confidence relations between words. The MST algorithm removes high confidence relations and replaces them with subsets of lower scoring relationships that provide a "global-optimum". Our intuition is that if one wants to use a MST tree algorithm to produce a parsing tree, this algorithm should be inte-

Training details
Regarding drop-out, for all tasks we use a consistent strategy: similar to the methodology of Dozat et al. (2017) we randomly drop each representation 9 independently and we scale the others to cope with the missing input. The default parameters used in our process are also close to those proposed in the aforementioned paper, with the exception that we found a batch-size of 1000 to provide better results. The batch size refers to the number of tokens included in one training iteration. Our models are implemented using DyNET (Neubig et al., 2017), which is a framework for neural networks with dynamic computation graph. This implies that we don't require bucketing and padding in our approach. Instead, when we compute a batch we add sentences until the total number of tokens reaches the batch threshold (1000). Often, we overflow the input size, because rarely the number of tokens sum up to exactly 1000.
The global early-stopping condition is that the task-specific metric over the development set doesn't improve over 20 consecutive training epochs.
All models that use auxiliary softmax functions, weight the auxiliary loss by an empirically selected value of 0.2. Whenever more than one aux softmax layers are used, the weighed value is equally divided between the losses (i.e. if we use two auxiliary loss layers, each will infer a loss that is scaled with the value 0.1, not 0.2).
At runtime the end-to-end system performs the following operations sequentially: (a) it segments the input raw text using the best accuracy sentence splitter model, it then (b) tokenizes the sentences using the best accuracy tokenizer network model, (c) it generates compound words with the best accuracy compound word expander model over the tokens, (d) it predicts POS tags using each of the best performing network model for UPOS, XPOS and ATTRS respectively, (e) generates parse links and labels using the best UAS model (and not the LAS one, though we save this one as well), finally (f) filling in the lemma with the best accuracy lemmatizer model. We used the same hyperparameters for all languages. They were chosen based on a few languages that we initially tested on, and used these values for all other languages. However, each task has its own set of hyperparameters that can be tuned individually. Except the input sizes (like the 300-to-100 linear transform in the tokenizer), all other LSTM sizes are configurable through the automatically generated config file for each task.

Results
We summarized our results in table 1 showing NLP-Cube's individual task scores for each language, and two tables comparing our ranking by task:   Table 3: Results for Big Treebanks cial website 10 and due to space restrictions the description of each individual score is available online 11 as well. For example, for sentence splitting (SS) and tokenization (Token), the figures reported are F1 scores. For tables 2 and 3 we did not include in the max-min/average/median calculation the lowest performing system as it had a very low score and would skew the overall ranking. For the Rank value in the tables please note that there were 25 systems participating (excluding the lowest competitor), so rank 10 means 10th position out of 25. Overall, NLP-Cube performed above average for most tasks and treebanks, and, even better if we consider only the large treebanks. Due to the hidden bug we discovered very late in the TIRA testing period (mentioned in the introduction) we can see consistently bad performance for the tasks of compound word expansion and lemmatization where the character network has a large influence. Considering that for most languages we performed end-to-end processing, a low performance in the early processing chain compounded the error and led to lower scores.

Use-cases
We've built NLP-Cube with the vision that it would help in higher-level NLP tasks like Machine 10 http://universaldependencies.org/conll18/results.html 11 http://universaldependencies.org/conll18/evaluation.html Translation, Named Entity Recognition or Question Answering, to name a few.
Part of NLP-Cube, we have a Named Entity Recognition (NER) system 12 that employs Graph-Based-Decoding (GBD) over a hybrid network architecture composed of bidirectional LSTMs for word-level encoding, which had great results 13 .
We're currently working on integrating Universal Morphological Reinflection and also Machine Translation tasks in NLP-cube. We welcome feedback and contributions to the project, as well as new ideas and areas we could cover.

Conclusions
This paper introduces NLP-Cube: an end-to-end system that performs text segmentation, lemmatization, part-of-speech tagging and parsing. It allows training of any model given datasets in the CoNLL-U format. Written in Python, it is opensource, easily usable ("pip install nlpcube") and provides models for the large treebanks in the Universal Dependency Corpus.
We presented and discussed each NLP task. Results place NLP-Cube in the upper half of the best performing end-to-end text preprocessing systems. As we retrain our models, new scores will be continuously updated online 14 .
Finally, we highlight a few ideas: 1. We presented a lemmatizer / compound word expander that uses a Finite State Transducer-style algorithm that is faster and has better results than the classic attention-based encoder-decoder model (with the mention that it requires monotonic alignments between symbols) (see section 2.3); 2. We obtained better results for Morphological Attributes when using each example as a single class instead of splitting and predicting their presence or not at every instance (see section 2.4); 3. Parsing based on lexicalized features only, and at the same time, performing UPOS, XPOS and ATTRS prediction jointly with arc index and labeling led to a higher performance than parsing based on previously predicted morphological features generated by a tagger (see section 2.5).