The ParisNLP entry at the ConLL UD Shared Task 2017: A Tale of a #ParsingTragedy

We present the ParisNLP entry at the UD CoNLL 2017 parsing shared task. In addition to the UDpipe models provided, we built our own data-driven tokenization models, sentence segmenter and lexicon-based morphological analyzers. All of these were used with a range of different parsing models (neural or not, feature-rich or not, transition or graph-based, etc.) and the best combination for each language was selected. Unfortunately, a glitch in the shared task’s Matrix led our model selector to run generic, weakly lexicalized models, tailored for surprise languages, instead of our dataset-specific models. Because of this #ParsingTragedy, we officially ranked 27th, whereas our real models finally unofficially ranked 6th.


Introduction
The Universal Dependency parsing shared task (Zeman et al., 2017) was arguably the hardest shared task the field has seen since the CoNLL 2006 shared task (Buchholz and Marsi, 2006) where 13 languages had to be parsed in gold token, gold morphology mode, while its follow up in 2007 introduced an out-of-domain track for a subset of the 2006 languages (Nivre et al., 2007). The SANCL "parsing the web" shared task (Petrov and McDonald, 2012) introduced the parsing of English non-canonical data in gold token, predicted morphology mode and saw a large decrease of performance compared to what was usually reported in English parsing of the Penn Treebank . As far as we know, the SPMRL shared tasks (Seddah et al., 2013(Seddah et al., , 2014 were first to introduce a non gold tokenization, predicted morphology, scenario for two morphologically rich languages, Arabic and Hebrew while, for other languages, complex source tokens were left untouched (Korean, German, French. . . ). Here, the Universal Dependency (hereafter "UD") shared task introduced an endto-end parsing evaluation protocol where none of the usual stratification layers were to be evaluated in gold mode: tokenization, sentence segmentation, morphology prediction and of course syntactic structures had to be produced 1 for 46 languages covering 81 datasets. Some of them are low-resource languages, with training sets containing as few as 22 sentences. In addition, an out-of-domain scenario was de facto included via a new 14-language parallel test set. Because of the very nature of the UD initiative, some languages are covered by several treebanks (English, French, Russian, Finnish. . . ) built by different teams, who interpreted the annotation guidelines with a certain degree of freedom when it comes to rare, or simply not covered, phenomena. 2 Let us add that our systems had to be deployed on a virtual machine and evaluated in a total blind mode with different metadata between the trial and the test runs.
All those parameters led to a multi-dimension shared task which can loosely be summarized by the following "equation": where Lang stands for Language, T ok for tokenization, W S for word segmentation, Seg for sentence segmentation, M orph for predicted morphology, DS for data scarcity, OOD for out-ofdomainness, AS for annotation scheme, Exp for experimental environment. 1 Although baseline prediction for all layers were made available through Straka et al.'s (2016) pre-trained models or pre-annotated development and test files.
2 See for example, the discrepancy between frpartut and the other French treebanks regarding the annotation of the not so rare car conjunction, 'for/because', and the associated syntactic structures, cf. https://github.com/ UniversalDependencies/docs/issues/432.
In this shared task, we earnestly tried to cover all of these dimensions, ranking #3 in UPOS tagging and #5 in sentence segmentation. But we were ultimately strongly impacted by the Exp parameter (cf. Section 6.3), a parameter we could not control, resulting in a disappointing rank of #27 out of 33. Once this variable was corrected, we reached rank #6. 3 Our system relies on a strong pre-processing pipeline, which includes lexicon-enhanced statistical taggers as well as data-driven tokenizers and sentence segmenters. The parsing step proper makes use for each dataset of one of 4 parsing models: 2 non-neural ones (transition and graphbased) and extensions of these models with character and word-level neural layers.

Architecture and Strategies
In preparation for the shared task, we have developed and adapted a number of different models for tokenization 4 and sentence segmentation, tagging-predicting UPOS and the values for a (manually selected, language-independent) subset of the morphological attributes (hereafter "partial MSTAGs")-and parsing. For each dataset for which training data was available, we combined different pre-processing strategies with different parsing models and selected the best performing ones based on parsing F-scores on the development set in the predicted token scenario. Whenever no development set was available, we achieved this selection based on a 10-fold crossevaluation on the training set.
Our baseline pre-processing strategy consisted in simply using the data annotated using UDPipe (Straka et al., 2016) provided by the shared task organizers. We also developed new tools of our own, namely a tagger as well as a joint tokenizer and sentence segmenter. We chose whether to use the baseline UDPipe annotations or our own annotations for each of the following steps: sentence segmentation, tokenization, and tagging (UPOS and partial MSTAGs). We used UDPipe-based information in all configurations for XPOS, lemma, and word segmentation, based on an a posteriori character-level alignment algorithm.
At the parsing level, we developed and tried five different parsers, both neural and non-neural, which are variants of the shift-reduce (hereafter "SR") and maximum spanning-tree algorithms (hereafter "MST"). The next two sections describe in more detail our different pre-processing and parsing architectures, give insights into their performance, and show how we selected our final architecture for each dataset.
3 Pre-processing 3.1 Tagging Tagging architecture Taking advantage of the opportunity given by the shared task, we developed a new part-of-speech tagging system inspired by our previous work on MElt (Denis and Sagot, 2012), a left-to-right maximum-entropy tagger relying on features based on both the training corpus and, when available, an external lexicon. The two main advantages of using an external lexicon as a source of additional features are the following: (i) it provides the tagger with information about words unknown to the training corpus; (ii) it allows the tagger to have a better insight into the right context of the current word, for which the tagger has not yet predicted anything.
Whereas MElt uses the megam package to learn tagging models, our new system, named alVW-Tagger, relies on Vowpal Wabbit. 5 One of Vowpal Wabbit's major advantages is its training speed, which allowed us to train many tagger variants for each language, in order to assess, for each language, the relative performance of different types of external lexicons and different ways to use them as a source of features. In all our experiments, we used VW in its default multiclass mode, i.e. using a squared loss and the one-against-all strategy. Our feature set is a slight extension of the one used by MElt (cf. Denis and Sagot, 2012).
The first improvement over MElt concerns how information extracted from the external lexicons is used: instead of only using the categories provided by the lexicon, we also use morphological features. We experimented different modes. In the baseline mode, the category provided by the lexicon is the concatenation of a UPOS and a sequence of morphological features, hereafter the "full category". In the F mode ("ms mode" in Table 1), only the UPOS is used (morphological features are ignored). In the M mode, both the full category and the sequence of morphological features are used separately. Finally, in the FM mode, both the UPOS and the sequence of morphological features are used separately.
The second improvement over MElt is that alVWTagger predicts both a part-of-speech (here, a UPOS) and a set of morphological features. As mentioned earlier, we decided to restrict the set of morphological features we predict, in order to reduce data sparsity. 6 For each word, our tagger first predicts a UPOS. Next, it uses this UPOS as a feature to predict the set of morphological features as a whole, using an auxiliary model. 7 Extraction of morphological lexicons As mentioned above, our tagger is able to use an external lexicon as a source of external information. We therefore created a number of morphological lexicons for as many languages as possible, relying only on data and tools that were allowed by the shared task instructions. We compared the UPOS accuracies on the development sets, or on the training sets in an 10-fold setting when no development data was provided, and retained the best performing lexicon for each dataset (see Table 1). Each lexicon was extracted from one of the following sources or several of them, using an a posteriori merging algorithm: • The monolingual lexicons from the Apertium project (lexicon type code "AP" in Table 1); • Raw monolingual raw corpora provided by the shared task organizers, after application of a basic rule-based tokenizer and the appropriate Apertium or Giellatekno morphological analyzers (codes "APma" or "GTma"); • The corresponding training dataset (code "T") or another training dataset for the same language (code "Tdataset"); • The UDPipe-annotated corpora provided by the shared task organizers (code "UDP"); • A previously extracted lexicon for another language, which we automatically "translated" using a dedicated algorithm, which we provided, as a seed, with a bilingual lexicon automatically extracted from OPUS sentence-aligned data (code "TRsource language"). 6 The list of features we retained is the following: Case, Gender, Number, PronType, VerbForm, Mood, and Voice. 7 We also experimented with per-feature prediction, but it resulted in slightly lower accuracy results on average, as measured on development sets. All lexical information not directly extracted from UDPipe-annotated data or from training data was converted to the UD morphological categories (UPOS and morphological features).
For a few languages only (for lack of time), we also created expanded versions of our lexicons using word embeddings re-computed on the raw data provided by the organizers, assigning to words unknown to the lexicon the morphological information associated with the closest known word (using a simple euclidian distance on the word embedding space). 8 When the best performing lexicon is one of these extended lexicons, it is indicated in Table 1 by the "-e" suffix.

Tokenization and sentence segmentation
Using the same architecture as our tagger, yet without resorting to external lexicons, we developed a data-driven tokenizer and sentence segmenter, which runs as follows. First, a simple rule-based pre-tokenizer is applied to the raw text of the training corpus, after removing all sentence boundaries. 9,10 This pre-tokenizer outputs a sequence of "pre-tokens," in which, at each pre-token boundary, we keep trace of whether a whitespace was present in the raw text or not at this position. Next, we use the gold train data to label each pre-token boundary with one of the following labels: not a token boundary (NATB), token boundary (TB), sentence boundary (SB). 11 This model can then be applied on raw text, after the pre-tokenizer has been applied. It labels each pre-token boundary, resulting in the following decisions depending on whether it corresponds to a whitespace in the raw text or not: (i) if it predicts NATB at a non-whitespace boundary, the boundary is removed; (ii) if it predicts NATB at a whitespace boundary, it results in a tokenwith-space; (iii) if it predicts TB (resp. SB) at a non-whitespace boundary, a token (resp. sentence) boundary is created and "SpaceAfter=No" is added to the preceedings token; (iv) if it predicts TB (resp. SB) at a whitespace boundary, a token (resp. sentence) boundary is created. We compared our tokenization and sentence segmentation results with the UDPipe baseline on development sets. Whenever the UDPipe tokenization and sentence segmentation scores were both better, we decided to use them in all configurations. Other datasets, for which tokenization and sentence segmentation performance is shown in Table 2, were split into two sets: those on which our tokenization was better but sentence segmentation was worse-for those, we forced the UD-Pipe sentence segmentation in all settings-, and those for which both our tokenization and sentence segmentation were better.

Preprocessing model configurations
As mentioned in Section 2, we used parsing-based evaluation to select our pre-processing strategy for each corpus. More precisely, we selected for each dataset one of the following strategies: 1. UDPIPE: the UDPipe baseline is used and provided as such to the parser. 2. TAG: the UDPipe baseline is used, except for the UPOS and MSTAG information, which is provided by our own tagger. 3. TAG+TOK+SEG and TAG+TOK: we apply our own tokenization and POS-tagger to produce UPOS and MSTAG information; sentence segmentation is performed either by us (TAG+TOK+SEG (available for datasets with "yes" in the last column in Table 2) or by the UDPipe baseline (TAG+TOK, available for datasets with "no" in Table 2).  Table 2: Tokenization and sentence segmentation accuracies for the UDPipe baseline and our tokenizer (restricted to those datasets for which we experimented the use of our own tokenization).
Whenever we used our own tokenization and not that of the UDPipe baseline, we used a characterlevel alignment algorithm to map this information to our own tokens. Table 1 shows the configuration retained for each language for which a training set was provided in advance. 12 For surprise language datasets, we always used the UDPipe configuration. 13 For PUD corpora, we used the same configuration as for the basic dataset for the same language (for instance, we used for the fr pud dataset the same configuration as that chosen for the fr dataset). 14 Table 1 indicates for each dataset which configuration was retained.

Parsing Models
We used 4 base parsers, all implemented on top of the DYALOG system (de La Clergerie, 2005), a logic-programming environment (à la Prolog) specially tailored for natural language processing, in particular for tabulation-based dynamic programming algorithms.
Non-neural parsing models The first two parsers are feature-based and use no neural components. The most advanced one is DYALOG-SR, a shift-reduce transition-based parser, using dynamic programming techniques to maintain beams (Villemonte De La Clergerie, 2013). It accepts a large set of transition types, besides the usual shift and reduce transitions of the arcstandard strategy. In particular, to handle nonprojectivity, it can use different instances of swap transitions, to swap 2 stack elements between the 3 topmost ones. A noop transition may also be used at the end of parsing paths to compensate differences in path lengths. Training is done with a structured averaged perceptron, using early aggressive updates, whenever the oracle falls out of the beam, or when a violation occurs, or when a margin becomes too high, etc. 15 Feature templates are used to combine elementary standard features: • Word features related to the 3 topmost stack elements s i=0···2 , 4 first buffer elements I j=1···4 , leftmost/rightmost children [lr]s i /grandchildren of the stack elements [lr]2s i , and governors. These features include the lexical form, lemma, UPOS, XPOS, 13 For surprise languages, the UDPipe baseline was trained on data not available to the shared task participants.
14 Because of a last-minute bug, we used the TAG configuration for trpud and ptpud although we used the UDPIPE configuration for tr and pt. We also used the TAG setting for fipud rather than the TAG+TOK+SEG setting used for fi. 15 By "violation," we mean for instance adding an edge not present in the gold tree, a first step towards dynamic oracles. We explored this path further for the shared task through dynamic programming exploration of the search space, yet did not observe significant improvements yet. morphosyntactic features, Brown-like clusters (derived from word embeddings), and flags indicating capitalization, numbers, etc.
• Binned distances between some of these elements • Dependency features related to the leftmost/rightmost dependency labels for s i (and dependent [lr]s i ), label set for the dependents of s i and [lr]s i , and number of dependents • Last action (+label) leading to the current parsing state.
The second feature-based parser is DYALOG-MST, a parser developed for the shared task and implementing the Maximum Spanning Tree (MST) algorithm (McDonald et al., 2005). By definition, DYALOG-MST may produce nonprojective trees. Being recent and much less flexible than DYALOG-SR, it also relies on a much smaller set of first-order features and templates, related to the source and target words of a dependency edge, plus its label and binned distance. It also exploits features related to the number of occurrences of a given POS between the source and target of an edge (inside features) or not covered by the edge but in the neighborhood of the nodes (outside features). Similar features are also implemented for punctuation.
Neural parsing models Both feature-based parsers were then extended with a neural-based component, implemented in C++ with DyNet (Neubig et al., 2017). The key idea is that the neural component can provide the best parser action or, if aksed, a ranking of all possible actions. This information is then used as extra features to finally take a decision. The 2 neural-based variants of DYALOG-SR and DYALOG-MST, straightforwardly dubbed DYALOG-SRNN and DYALOG-MSTNN, implement a similar architecture, the one for DYALOG-SRNN being a bit more advanced and stable. Moreover, DYALOG-MSTNN was only found to be the best choice for a very limited number of treebanks. In addition to these models, we also investigated a basic version of DYALOG-SRNN that only uses, in a feature-poor setting, its character-level component and its joint action prediction, and which provides the best performance on 3 languages. The following discussion will focus on DYALOG-SRNN. The architecture is inspired by Google's PARSEYSAURUS (Alberti et al., 2017), with a first left-to-right char LSTM covering the whole sentence and (artificial) whitespaces introduced to separate tokens. 16 The output vectors of the char LSTM at the token separations are used as (learned) word embeddings that are concatenated (when present) with both the pre-trained ones provided for the task and the UPOS tags predicted by the external tagger. The concatenated vectors serve as input to a word bi-LSTM that is also used to predict UPOS tags as a joint task (training with the gold tags provided as oracle). For a given word w i , its final vector representation is the concatenation of the output of the bi-LSTM layers at position i with the LSTM-predicted UPOS tag.
The deployment of the LSTMs is done once for a given sentence. Then, for any parsing state, characterized by the stack, buffer, and dependency components mentioned above, a query is made to the neural layers to suggest an action. The query fetches the final vectors associated with the stack, buffer, and dependent state words, and completes it with input vectors for 12 (possibly empty) label dependencies and for the last action. The number of considered state words is a hyper-parameter of the system, which can range between 10 and 19, the best and default value being 10, covering the 3 topmost stack elements and 6 dependent children, but only the first buffer lookahead word 17 and no grandchildren. Through a hidden layer and a softmax layer, the neural component returns the best action paction (and plabel) but also the ranking and weights of all possible actions. The best action is used as a feature to guide the decision of the parser in combination with the other features, the final weight of an action being a linear weighted combination of the weights returned by both perceptron and neural layers. 18 A dropout rate of 0.25 was used to introduce some noise. The Dynet AdamTrainer was chosen for gradient updates, with its default parameters. Many hyperparameters are however available as options, such as the number of layers of the char and word LSTMs, the size of input, hidden and output dimensions for the LSTMs and feedforward layers. A partial exploration of these parameters was run on a few languages, but not in a systematic way given the lack of time and the huge 16 A better option would be to add whitespace only when present in the original text. 17 We suppose the information relative to the other lookahead words are encapsulated in the final vector of the first lookahead word. 18 The best way to combine the weights of the neural and feature components remains a point to further investigate. number of possibilities. Clearly, even if we did try 380 distinct parsing configurations through around 16K training runs, 19 we are still far away from language-specific parameter tuning, thus leaving room for improvement.

Results
Because of the greater flexibility of transitionbased parsers, MST-based models were only used for a few languages. However, our results, provided in the Appendix, show the good performance of these models, for instance on Old Church Slavonic (cu), Gothic (got), Ancient Greek (grc), and Kazakh (kk). Already during development, it was surprising to observe, for most languages, a strong preference for either SR-based models or MST-based ones. For instance, for Ancient Greek, the best score in gold token mode for a MST-based model is 62.43 while the best score for a SR-based one is 60.59. On the other hand, for Arabic (ar), we get 74.87 for the best SR model and 71.44 for the best MST model.
Altogether, our real, yet unofficial scores are encouraging (ranking #6 in LAS) while our official UPOS tagging, sentence segmentation and tokenization results ranked respectively #3, #6 and #5. Let us note that our low LAS official results, #27, was the result of a mismatch between the trial and test experimental environments provided by the organizers (cf. Section 6.3). However, we officially ranked #5 on surprise languages, which were not affected by this mismatch.

Discussion
While developing our parsers, training and evaluation were mostly performed using the UDPipe pre-processing baseline with predicted UPOS and MSTAGs but gold tokenization and gold sentence segmentation. For several (bad) reasons, only in the very last days did we train on files tagged with our preprocessing chain. Even later, evaluation (but no training) was finally performed on dev files with predicted segmentation and tokenization, done by either UDPipe or by our pre-processing chains (TAG, TAG+TOK+SEG or TAG+TOK). Based on the results, we selected, for each language and treebank, the best preprocessing configuration and the best parsing model.
In general, we observed that neural-based models without features often worked worse than pure feature-based parsers (such as srcat), but great when combined with features. We believe that, being quite recent, our neural architecture is not yet up-to-date and that we still have to evaluate several possible options. Between using no features (srnnsimple and srnncharsimple models) and using a rich feature set (srnnpx models), where the predicted actions paction and plabel may be combined with other features, we also tested, for a few languages, a more restricted feature set with no combinations of paction and plabel with other features (srnncharjoin models). These latter combinations are faster to train and reach good scores, as shown in Table 3.  For the treebanks without dev files, we simply did a standard training, using a random 80/20-split of the train files. Given more time, we would have tried transfer from other treebanks when available (as described below).
To summarize, a large majority of 47 selected models were based on DYALOG-SRNN with a rich feature set, 29 of them relying on predicted data coming from our processing chains (TAG or TAG+TOK+SEG), the other ones relying on the tags and segmentation predicted by UDPipe. 10 models were based on DYALOG-MSTNN, with 5 of them relying on our preprocessing chain. Finally, 5 (resp. 2) were simply based on DYALOG-SR (resp. DYALOG-MST), none of them using our preprocessing.

OOV Handling
Besides the fact that we did not train on files with predicted segmentation, we are also aware of weaknesses in handling unknown words in test files. Indeed, at some point, we made the choice to filter the large word-embedding files by the vocabulary found in the train and dev files of each dataset. We made the same for the clusters we derived from word embeddings. It means that unknown words have no associated word embed-dings or clusters (besides a default one). The impact of this choice is not yet clear but it should be a relatively significant part of the performance gap between our score of the dev set (with predicted segmentation) and on the final test set. 20

Generic Models
We also started exploring the idea of transferring information between close languages, such as Slavic languages. Treebank families were created for some groups of related languages by randomly sampling their respective treebanks as described in Table 4. A fully generic treebank was also created by randomly sampling 41k sentences from almost all languages (1k sentences per primary treebank).

Model
Languages #sent.  Table 4: Generic models partition The non-neural parsers were trained on these new treebanks, using much less lexicalized information (no clusters, no word embeddings, no lemmas, and keeping only UPOS tags, but keeping forms and lemmas when present). We tested using the resulting models, whose named are prefixed with "ZZ", as base models for further training on some small language treebanks. However, preliminary evaluations did not show any major improvement and, due to lack of time, we put on hold this line of research, while keeping our generic parsers as backup ones.
Some of these generic parsers were used for the 4 surprise languages, with Upper Sorbian using a ZZSSlavic parser 21 (LAS=56.22), North Saami using ZZFinnish (LAS=37.33), and the two other ones using the fully generic parser (Kurmanji LAS=34.8; Buryat LAS=28.55).

The Tragedy
Ironically, the back-off mechanism we set up for our model selection was also a cause of failure and salvation. Because of the absence of the name field in the test set metadata, which was nevertheless present in the dev run and crucially also in the trial run metadata, the selection of the best model per language was broken and led to the selection of back-off models, either a family one or in most cases the generic one. The Tira blind test configuration prevented us from discovering this experimental discrepancy before the deadline. Once we adapted our wrapper to the test metadata, the appropriate models were selected, resulting in our real run results. It turned out that our non language-specific, generic models performed surprisingly well, with a macro-average F-score of 60% LAS. Of course, except for Ukrainian, our language-specific models reach much better performance, with a macro-average F-score of 70.3%. But our misadventure is an invitation to further investigation.
However, it is unclear at this stage whether or not mixing languages in a large treebank really has advantages over using several small treebanks. In very preliminary experiments on Greek, Arabic, and French, we extracted the 1000 sentences present in the generic treebank for these languages and trained the best generic configuration (srcat, beam 6) on each of these small treebanks. As shown in Table 5, the scores on the development sets do not exhibit any improvement coming from mixing languages in a large pool and are largely below the scores obtained on a larger language-specific treebank.

Impact of the Lexicon
We also investigated the influence of our tagging strategy with respect to the UDPipe baseline. Figure 1 plots the parsing LAS F-scores with respect to training corpus size. It also show the result of logarithmic regressions performed on datasets for which we used the UDPipe baseline for preprocessing versus those for which we used the TAG configuration. As can be seen, using the UD-Pipe baseline results in a much stronger impact of training corpus size, whereas using our own tagger leads to more stable results. We interpret this  Figure 1: LAS F-score w.r.t. training corpus size observation as resulting from the influence of external lexicons during tagging, which lowers the negative impact of out-of-training-corpus words on tagging and therefore parsing performance. It illustrates the relevance of using external lexical information, especially for small training corpora.

Conclusion
The shared task was a excellent opportunity for us to develop a new generation of NLP components to process a large spectrum of languages, using some of the latest developments in deep learning. However, it was really a challenging task, with an overwhelming number of decisions to take and experiments to run over a short period of time.
We now have many paths for improvement. First, because we have a very flexible but newly developed architecture, we need to stabilize it by carefully selecting the best design choices and parameters. We also plan to explore the potential of a multilingual dataset based on the UD annotation scheme, focusing on cross-language transfer and language-independent models.