CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.


Introduction
Ten years ago, two CoNLL shared tasks were a major milestone for parsing research in general and dependency parsing in particular. For the first time dependency treebanks in more than ten languages were available for learning parsers. Many of them were used in follow-up work, evaluating parsers on multiple languages became standard, and multiple state-of-the-art, open-source parsers became available, facilitating production of dependency structures to be used in downstream applications. While the two tasks (Buchholz and Marsi, 2006;Nivre et al., 2007) were extremely important in setting the scene for the following years, there were also limitations that complicated application of their results: (1) gold-standard to-kenization and part-of-speech tags in the test data moved the tasks away from real-world scenarios, and (2) incompatible annotation schemes made cross-linguistic comparison impossible. CoNLL 2017 has picked up the threads of those pioneering tasks and addressed these two issues. 1 The focus of the 2017 task was learning syntactic dependency parsers that can work in a realworld setting, starting from raw text, and that can work over many typologically different languages, even surprise languages for which there is little or no training data, by exploiting a common syntactic annotation standard. This task has been made possible by the Universal Dependencies initiative (UD) (Nivre et al., 2016), which has developed treebanks for 50+ languages with crosslinguistically consistent annotation and recoverability of the original raw texts.
Participating systems had to find labeled syntactic dependencies between words, i.e., a syntactic head for each word, and a label classifying the type of the dependency relation. No gold-standard annotation (tokenization, sentence segmentation, lemmas, morphology) was available in the input text. However, teams wishing to concentrate just on parsing were able to use segmentation and morphology predicted by the baseline UDPipe system (Straka et al., 2016).

Data
In general, we wanted the participating systems to be able to use any data that is available free of charge for research and educational purposes (so that follow-up research is not obstructed). We deliberately did not place upper bounds on data sizes (in contrast to e.g. Nivre et al. (2007)), despite the fact that processing large amounts of data may be difficult for some teams. Our primary objective was to determine the capability of current parsers with the data that is currently available.
In practice, the task was formally closed, i.e., we listed the approved data resources so that all participants were aware of their options. However, the selection was rather broad, ranging from Wikipedia dumps over the OPUS parallel corpora (Tiedemann, 2012) to morphological transducers. Some of the resources were proposed by the participating teams.
We provided dependency-annotated training and test data, and also large quantities of crawled raw texts. Other language resources are available from third-party servers and we only referred to the respective download sites.

Training Data: UD 2.0
Training and development data come from the Universal Dependencies (UD) 2.0 collection (Nivre et al., 2017b). Unlike previous UD releases, the test data was not included in UD 2.0. It was kept hidden until the evaluation phase of the shared task terminated. In some cases, the underlying texts had been known from previous UD releases but the annotation had not (UD 2.0 follows new annotation guidelines that are not backwardcompatible).
64 UD treebanks in 45 languages were available for training. 15 languages had two or more training treebanks from different sources, often also from different domains. 56 treebanks contained designated development data. Participants were asked not to use it for training proper but only for evaluation, development, tuning hyperparameters, doing error analysis etc. The 8 remaining treebanks were small and had only training data (and even these were extremely small in some cases, especially for Kazakh and Uyghur). For those treebanks cross-validation had to be used during development, but the entire dataset could be used for training once hyperparameters were determined.
Participants received the training and development data with gold-standard tokenization, sentence segmentation, POS tags and dependency relations; and for some languages also lemmas and/or morphological features.
Cross-domain and cross-language training was allowed and encouraged. Participants were free to train models on any combination of the training treebanks and apply it to any test set. They were even allowed to use the training portions of the 6 UD 2.0 treebanks that were excluded from evaluation (see Section 2.3).

Supporting Data
To enable the induction of custom embeddings and the use of semi-supervised methods in general, the participants were provided with supporting resources primarily consisting of large text corpora for (nearly) all of the languages in the task, as well as embeddings pre-trained on these corpora.

Raw texts
The supporting raw data was gathered from CommonCrawl, which is a publicly available web crawl created and maintained by the non-profit CommonCrawl foundation. 2 The data is publicly available in the Amazon cloud both as raw HTML and as plain text. It is collected from a number of independent crawls from 2008 to 2017, and totals petabytes in size.
We used cld2 3 as the language detection engine because of its speed, available Python bindings and large coverage of languages. Language detection was carried out on the first 1024 bytes of each plaintext document. Deduplication was carried out using hashed document URLs, a simple strategy found in our tests to be effective for coarse duplicate removal. The data for each language was capped at 100,000 tokens per a single input file.
Automatic tokenization, morphology and parsing The raw texts were further processed in order to generate automatic tokenization, segmentation, morphological annotations and dependency trees.
At first, basic cleaning was performed -paragraphs with erroneous encoding or less than 16 characters were dropped, remaining paragraphs converted to Normalization Form KC (NFKC) 4 and again deduplicated. Then the texts were segmented and tokenized, multi-word tokens split into words, and sentences with less than 5 words dropped. Because we wanted to publish the resulting corpus, we shuffled the sentences and also dropped sentences with more than 80 words at this point for licensing reasons. The segmentation and tokenization was obtained using the baseline UDPipe models described in Section 5. These models were also used to further generate automatic morphological annotations (lemmas, UPOS, XPOS and FEATS) and dependency trees.
The resulting corpus contains 5.9 M sentences and 90 G words in 45 languages and is available in CoNLL-U format . The perlanguage sizes of the corpus are listed in Table 1 Precomputed word embeddings We also precomputed word embeddings using the segmented and tokenized plain texts. Because UD words can contain spaces, these in-word spaces were con-2 http://commoncrawl.org/ Except for Ancient Greek, which was gathered from the Perseus Digital Library. verted to Unicode character NO-BREAK SPACE (U+00A0). 5 The dimensionality of the word embeddings was chosen to be 100 after thorough discussion -more dimensions may yield better results and are commonly used, but even with just 100, the uncompressed word embeddings for the 45 languages take 135 GiB. Also note that Andor et al. (2016) achieved state-of-the-art results with 64 dimensions.

Test Data: UD 2.0
The main part of test data comprises test sets corresponding to 63 of the 64 training treebanks. 6 Test sets from two different treebanks of one language were evaluated separately as if they were different languages. Every test set contained at least 10,000 words or punctuation marks. UD 2.0 treebanks that were smaller than 10,000 words were excluded from the evaluation. Among the treebanks that were able to provide the required amount of test data, there are 8 treebanks so small that the remaining data could not be split to training and development portions; for two of them, the data left for training is only a tiny sample (529 words in Kazakh, 1662 in Uyghur). There was no upper limit on the test data; the largest treebank had a test set comprising 170K words.
Although the 63 test sets correspond to UD 2.0 treebanks, they were not released with UD 2.0. They were kept hidden and only published after the evaluation phase of the shared task (Nivre et al., 2017a).

New Parallel Test Sets
In addition, there were test sets for which no corresponding training data sets exist: 4 "surprise" languages (described in Section 2.5) and 14 test sets of a new Parallel UD (PUD) treebank (described in this section). These test sets were created for this shared task, i.e., not included in any previous UD release.
The PUD treebank consists of 1000 sentences currently in 18 languages (15 K to 27 K words, depending on the language), which were randomly picked from on-line newswire and Wikipedia; 7 usually only a few sentences per source document. 750 sentences were originally English, the remaining 250 sentences come from German, French, Italian and Spanish texts. They were translated by professional translators to 14 languages (i.e., 15 languages with the original: Arabic, Chinese, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Russian, Spanish, Thai and Turkish; but four languages-Chinese, Indonesian, Korean and Thai-were excluded from the shared task due to consistency issues). Translators were instructed to prefer translations closer to original grammatical structure, provided it is still a fluent sentence in the target language. In some cases, picking a correct translation was difficult because the translators did not see the context of the original document. The translations were organized at DFKI and text & form, Germany; they were then tokenized, morphologically and syntactically annotated at Google following guidelines based on McDonald et al. (2013), and finally converted to proper UD v2 annotation style by volunteers from the UD community using the Udapi framework (Popel et al., 2017). 8 Three additional translations (Czech, Finnish and Swedish) were contributed and annotated natively in UD v2 by teams from Charles University, University of Turku and Uppsala University, respectively.
The Google dependency representation predates Universal Dependencies, deriving from the scheme used by McDonald et al. (2013), i.e., Stanford Dependencies 2.0 with the option to make copula verbs heads (de Marneffe and Manning, 2008, section 4.7) and Google Universal POS tags (Petrov et al., 2011). Various tree transformations were needed to convert it to UD. 9 For example, prepositions and copula verbs are phrasal heads in Google annotation but must be dependent function words in UD. Similarly, some POS tags differ in the two schemes; particularly hard were conjunc-7 The two domains are encoded in sentence ids but this information is not visible to the systems participating in the shared task. 8 http://udapi.github.io/ 9 using ud.Google2ud from the Udapi framework tions, where the Google tag set does not distinguish coordinators (CCONJ in UD) from subordinators (SCONJ). Some bugs, for example where verbs had multiple subjects or objects, or where function words were not leaves, were detected automatically 10 and fixed manually. Finally, the most prominent consistency issues lay in tokenization and word segmentation, especially in languages where it interacts with morphology or where the writing system does not clearly mark word boundaries. The tokenizers used before manual annotation were not necessarily compatible with existing UD treebanks, yet in the shared task it was essential to make the segmentation consistent with the training data. We were able to fix some problems, such as unmarked multi-word tokens in European languages, 11 and we were even able to re-segment Japanese (note that this often involved new dependency relations); on the other hand, we had to exclude Korean for not being able to fix it in time.
Many transformations were specific to individual languages. For example, in the original tokenization of Arabic, the definite article al-was separated from the modified word, which is comparable to the D3 tokenization scheme (Habash, 2010). This scheme was inconsistent with the tokenization of the Arabic training data, hence it had to be changed. Text-level normalization further involved removal of the shadda diacritical mark (marking consonant gemination), which is optional in Arabic orthography and does not occur in the training data. On the POS level, the active and passive participles and verbal nouns (masdars) were annotated as verbs. For Arabic, however, these should be mapped to NOUN. Once we changed the tags, we also had to modify the surrounding relations to those used with nominals.
Like some UD treebanks, the parallel data contains information on document boundaries. They are projected as empty lines to the raw text presented to parsers, and they can be exploited to improve sentence segmentation. Note that due to the way the sentences were collected, the paragraphs are rather short. 12 10 using ud.MarkBugs from the Udapi framework 11 using Udapi's ud.de.AddMwt for German, and similarly for Spanish (es), French (fr) and Portuguese (pt). For all languages, we applied ud.ComplyWithText to make sure the concatenation of tokens matches exactly the original raw text. 12 A special case is Arabic where we artificially marked every sentence as a separate paragraph, to make it more consistent with somewhat unusual segmentation of the existing The fact that the data is parallel was not exploited in this task. Participating systems were told the language code so they could select an appropriate model. All parallel test sets were in languages that have at least one training treebank in UD 2.0 (although the domain may differ).
After the evaluation phase these parallel test sets were published together with the main test data; in the future they will become part of regular UD releases.

Surprise Languages
The second type of additional test sets were surprise languages, which had not been previously released in UD. Names of surprise languages (Buryat, Kurmanji Kurdish, North Sámi and Upper Sorbian) and small samples of gold-standard data (about 20 sentences) were published one week before the beginning of the evaluation phase. Crawled raw texts were provided too, though in much smaller quantity than for the other languages. The point of having surprise languages was to encourage participants to pursue truly multilingual approaches to parsing, utilizing data from other languages.
As with all other test sets, the systems were able to use segmentation and part-of-speech tags predicted by the baseline UDPipe system (in this case UDPipe was trained and applied in a 10-fold cross-validation manner directly on the test data; hence this is the only annotation that the participants were given but could not produce with their own models).
Note that the smallest non-surprise languages (Kazakh, Uyghur) were asking for multilingual approaches as well, given that the amount of their own training data was close to zero. The difference was that participants at least knew in advance what these languages were and had more time to determine the most suitable training model. On the other hand, the segmentation and tagging models for these languages were only trained on the tiny training data, i.e., they were much worse than the models for the surprise languages. In this sense parsing of Kazakh and Uyghur was even harder than parsing the surprise languages.
When compared to the training data available in UD 2.0, the genetically closest language to Kazakh and Uyghur is Turkish; but it uses a dif-UD Arabic treebank. This gave an advantage to systems that were able to take paragraph boundaries into account, including those that re-used the baseline segmentation. ferent writing system, and the Turkish dataset itself is not particularly large. For Kurmanji Kurdish, the closest relative is Persian, again with different script and other reservations. Buryat is a Mongolic language written in Cyrillic script and does not have any close relative in UD. North Sámi is an Finno-Ugric language; Finnish and Estonian UD data could be expected to be somewhat similar. Finally, Upper Sorbian is a West Slavic language spoken in Germany; among the many Slavic languages in UD, Czech and Polish are its closest relatives.
In summary, the test data consisted of 81 files in 49 languages (55 test sets from "big" UD 2.0 treebanks, 8 "small" treebanks, 14 parallel test sets and 4 surprise-language test sets).

Evaluation Metrics
The standard evaluation metric of dependency parsing is the labeled attachment score (LAS), i.e., the percentage of nodes with correctly assigned reference to parent node, including the label (type) of the relation. When parsers are applied to raw text, the metric must be adjusted to the possibility that the number of nodes in gold-standard annotation and in the system output vary. Therefore, the evaluation starts with aligning system nodes and gold nodes. A dependency relation cannot be counted as correct if one of the nodes could not be aligned to a gold node. LAS is then re-defined as the harmonic mean (F 1 ) of precision P and recall R, where Note that attachment of all nodes including punctuation is evaluated. LAS is computed separately for each of the 81 test files and a macroaverage of all these scores serves as the main metric for system ranking in the task.

Token Alignment
UD defines two levels of token/word segmentation. The lower level corresponds to what is usually understood as tokenization. However, unlike some popular tokenization schemes, it does not include any normalization of the non-whitespace characters. We can safely assume that any two tokenizations of a text differ only in whitespace while the remaining characters are identical. There is thus a 1-1 mapping between gold and system nonwhitespace characters, and two tokens are aligned if all their characters match.

Syntactic Word Alignment
The higher segmentation level is based on the notion of syntactic word. Some languages contain multi-word tokens (MWT) that are regarded as contractions of multiple syntactic words. For example, the German token zum is a contraction of the preposition zu "to" and the article dem "the".
Syntactic words constitute independent nodes in dependency trees. As shown by the example, it is not required that the MWT is a pure concatenation of the participating words; the simple token alignment thus does not work when MWTs are involved. Fortunately, the CoNLL-U file format used in UD clearly marks all MWTs so we can detect them both in system output and in gold data. Whenever one or more MWTs have overlapping spans of surface character offsets, the longest common subsequence algorithm is used to align syntactic words within these spans.

Sentence Segmentation
Words are aligned and dependencies are evaluated in the entire file without considering sentence segmentation. Still, the accuracy of sentence boundaries has an indirect impact on LAS: any missing or extra sentence boundary necessarily makes one or more dependency relations incorrect.

Invalid Output
If a system fails to produce one of the 81 files or if the file is not valid CoNLL-U format, the score of that file (counting towards the system's macroaverage) is zero.
Formal validity is defined more leniently than for UD-released treebanks. For example, a nonexistent dependency type does not render the whole file invalid, it only costs the system one incorrect relation. However, cycles and multi-root sentences are disallowed. A file is also invalid if there are character mismatches that could make the token alignment algorithm fail.

CLAS
Content-word Labeled Attachment Score (CLAS) has been proposed as an alternative parsing metric that is tailored to the UD annotation style and more suitable for cross-language comparison (Nivre and Fang, 2017). It differs from LAS in that it only considers relations between content words. Attachment of function words is disregarded because it corresponds to morphological features in other languages (and morphology is not evaluated in this shared task). Furthermore, languages with many function words (e.g., English) have longer sentences than morphologically rich languages (e.g., Finnish), hence a single error in Finnish costs the parser significantly more than an error in English. CLAS also disregards attachment of punctuation.
As CLAS is still experimental, we have designated full LAS as our main evaluation metric; nevertheless, a large evaluation campaign like this is a great opportunity to study the behavior of the new metric, and we present both scores in Section 6.

Evaluation Methodology
Key goals of any empirical evaluation are to ensure a blind evaluation, its replicability, and its reproducibility. To facilitate these goals, we employed the cloud-based evaluation platform TIRA (Potthast et al., 2014), 13 which implements the evaluation as a service paradigm (Hanbury et al., 2015). In doing so, we depart from the traditional submission of system output to shared tasks, which lacks in these regards, toward the submission of working software. Naturally, software submissions bring about additional overhead for both organizers and participants, whereas the goal of an evaluation platform like TIRA is to reduce this overhead to a bearable level. Still being an early prototype, though, TIRA fulfills this goal only with some reservations. Nevertheless, the scale of the CoNLL 2017 UD Shared Task also served as a test of scalability of the evaluation as a service paradigm in general as well as that of TIRA in particular.

Blind Evaluation
Traditionally, evaluations in shared tasks are halfblind (the test data are shared with participants while the ground truth is withheld), whereas outside shared tasks, say, during paper-writing, evaluations are typically pseudo-blind (the test data and

Replicability and Reproducibility
The replicability of an evaluation depends on whether the same results can be obtained from re-running an experiment using the same setup, whereas reproducibility refers to achieving results that are commensurate with a reference evaluation, for instance, when exchanging the test data with alternative test data. Both are important aspects of an evaluation, the former pertaining to its reliability, and the latter to its validity. Ensuring both requires that a to-be-evaluated software is preserved in working condition for as long as possible. Traditionally, shared tasks do not take charge of participant software preservation, mostly because the software remains with participants, and since open sourcing the software underlying a paper is still the exception rather than the rule. To ensure both, TIRA supplies participants with a virtual machine, offering a range of commonly used operating systems in order not to limit the choice of technology stacks and development environments. Once deployed and tested, the virtual machines are archived to preserve the software within.
Many participants agreed to share their code so that we decided to collect the respective projects in a kind of open source proceedings at GitHub. 14

Resource Allocation
The allocation of an appropriate amount of computing resources (especially CPUs and RAM, whereas disk space is cheap enough) to each participant proved to be difficult, since minimal requirements were unknown. When asked, participants typically request liberal amounts of resources, just to be on the safe side, whereas assigning too much up front would not be economical nor scale well. We hence applied a least commitment strategy with an initial assignment of 1 CPU and 4 GB RAM. More resources were granted on request, the limit being the size of the underlying hardware. When it comes to exploiting available resources, a lot depends on programming prowess, whereas more resources do not necessarily translate into better performance. This is best exemplified by the fact that with 4 CPUs and 16 GB RAM, the winning team Stanford used only a quarter the amount of resources of the second and third winners, respectively. The team on fourth (sixth) place was even more frugal, getting by with 1 CPU and 8 GB RAM (4 GB RAM). All of the aforementioned teams' approaches exceed the LAS level of 70%.

UDPipe
We prepared a set of baseline models using UD-Pipe (Straka et al., 2016)  by Straka and Straková (2017) as one of the competing systems. Straka and Straková (2017) describe both these versions in more detail.
The baseline models were released together with the UD 2.0 training data, one model for each treebank. Because only training and development data were available during baseline model training, we put aside a part of the training data for hyperparameter tuning, and evaluated the baseline model performance on development data. We called this data split baseline model split. The baseline models, the baseline model split, and also UD 2.0 training data with morphology predicted by 10-fold jack-knifing (cross-validation), are available on-line (Straka, 2017).
UDPipe baseline models are able to reconstruct nearly all annotation from CoNLL-U files -they can generate segmentation, tokenization, multiword token splitting, morphological annotation (lemmas, UPOS, XPOS and FEATS) and dependency trees. Participants were free to use any part of the model in their systems -for all test sets, we provided UDPipe processed variants in addition to raw text inputs. We provided the UD-Pipe processed variant even for surprise languages -however, only segmentation, tokenization and morphology, generated by 10-fold jack-knifing, as described in Section 2.5.

Baseline UDPipe Shared Task System
We further used the baseline models as a baseline system in the shared task. We used the corresponding models for the UD 2.0 test data.
For the new parallel treebanks, we used UD 2.0 baseline models of the corresponding languages. If there were several treebanks for one language, we arbitrarily chose the one named after the language only (e.g., we chose ru and not ru syntagrus). Unfortunately, we did not explicitly mention this choice to the participants and this arbitrary choice had a large impact on resultssome contestant systems fell below UDPipe baseline just because of choosing different treebanks to train on for the parallel treebanks. (On the other hand, there was no guarantee that the models selected in the baseline system would be optimal.) For each surprise language, we also chose one baseline model to apply. Even if most words are unknown to the baseline model, universal POS tags can be used to drive the parsing, making the baseline model act similar to a delexicalized parser. We chose a baseline model to maximize Team LAS 1. Stanford (Dozat et al.) 76.30 2. C2L2 (Shi et al.) 75.00 3. IMS (Björkelund et al.) 74.42 4. HIT-SCIR (Che et al.) 72.11 5. LATTICE (Lim and Poibeau) 70.93 6. NAIST SATO (Sato et al.) 70.14 7. Koç University (Kırnap et al.) 69.76 8.ÚFAL (Straka and Straková) 69.52 9. UParse (Vania et al.) 68.87 10. Orange (Heinecke and Asadullah) 68.61 11. TurkuNLP (Kanerva et al.) 68.59 12. darc (Yu et al.) 68

SyntaxNet
Another set of baseline models was prepared by Alberti et al. (2017) based on improved version of the SyntaxNet system (Andor et al., 2016). Pretrained models were provided for UD 2.0 data. However, no SyntaxNet models were prepared for the surprise languages, therefore, the Syn-taxNet baseline is not part of the official results. Table 2 gives the main ranking of participating systems by the LAS F 1 score macro-averaged over all 81 test files. The table also shows the performance of the baseline UDPipe system; the baseline is relatively strong and only 12 of the 32 systems managed to outperform it.

Official Parsing Results
We used bootstrap resampling to compute 95% confidence intervals: they are in the range ±0.11 to ±0.15 (% LAS) for all systems except the three lowest-scoring ones. We used paired bootstrap resampling to compute whether the difference in LAS is significant (p < 0.05) for each pair of systems. 15

Secondary Metrics
In addition to the main LAS ranking, we evaluated the systems along multiple other axes, which may  shed more light on their strengths and weaknesses. This section provides an overview of selected secondary metrics for systems matching or surpassing the baseline; a large number of additional results is available at the shared task website. 16 The website also features a LAS ranking of unofficial system runs, i.e. those that were not marked by their teams as primary runs, or were even run after the official evaluation phase closed and test data were unblinded. At least two differences from the official results are remarkable; both seem to be partially inflicted by the blind evaluation on TIRA and the inability of the participants to see the diagnostic messages from their software. In the first case, the Dynet library seems to pro-   et al., 2017) used a wrong method of recognizing the input language, which was not supported in the test data (but unfortunately it was possible to get along with it in development and trial data). Simply crashing could mean that the task moderator would show the team their diagnostic output and they would fix the bug; however, the parser was robust enough to switch to a languageagnostic mode and produced results that were not great, but also not so bad to alert the moderator and make him investigate. Thus the official LAS of the system is 60.02 (27th place) while without the bug it could have been 70.35 (6th place). Table 3 ranks the systems by CLAS instead of LAS (see Section 3.5). The scores are lower than LAS but differences in system ranking are minimal, possibly indicating that optimization towards one of the metrics does not make the parser bad with respect to the other. Table 4 evaluates detection of tokens, syntactic words and sentences. Half of the systems simply trusted the segmentation offered by the baseline system. 7 systems were able to improve baseline segmentation. For most languages and in aggregate, the ability to improve parsing scores through better segmentation was probably negligible, but for a few languages, such as Chinese and Vietnamese, the UDPipe baseline segmentation was not so strong and several teams, notably IMS, appear to have improved their LAS by several percent through use of improved segmentation.
The systems were not required to generate any morphological annotation (part-of-speech tags, features or lemmas). Some parsers do not even need morphology and learn to predict syntactic dependencies directly from text. Nevertheless, systems that did output POS tags, and had them at least as good as the baseline system, are evaluated in Table 5. Note that as with segmentation, morphology predicted by the baseline system was available and some systems simply copied it to the output. Table 6 gives the LAS F 1 score averaged over the 55 "big" treebanks (training data larger than test data, development data available). Higher scores reflect the fact that models for these test sets are easier to learn: enough data is available, no cross-lingual or cross-domain learning is necessary (the parallel test sets are not included here). When compared to Table 2, four new teams now surpass the baseline, LyS-FASTPARSE being the best among them. The likely explanation is that the systems can learn good models but are not so good at picking the right model for unknown domains and languages. Table 7 gives the LAS F 1 score on the four surprise languages only. The globally best system, Stanford, now falls back to the fourth rank while C2L2 (Cornell University) apparently employs the most successful strategy for underresourced languages. Another immediate observation is that our surprise languages are very hard to parse; accuracy under 50% is hardly useful for any downstream processing. However, there are significant language-by-language differences, the best score on Upper Sorbian surpassing 60%. This proba-  Table 6: Average attachment score on the 55 "big" treebanks.

Partial Results
bly owes to the presence of many Slavic treebanks in training data, including some of the largest datasets in UD.
In contrast, the results on the 8 small nonsurprise treebanks (Table 8) are higher on average, but again the variance is huge. Uyghur (best score 43.51) is worse than three surprise languages, and Kazakh (best score 29.22) is the least parsable test set of all (see Table 10). These two treebanks are outliers in the size of training data (529 words Kazakh and 1662 words Uyghur, while the other "small" treebanks have between 10K and 20K words). However, the only "training data" of the surprise languages are samples of 147 to 460 words, yet they seem to be easier for some systems. It would be interesting to know whether the more successful systems took a similar approach to Kazakh and Uyghur as to the surprise languages. Table 9 gives the average LAS on the 14 new parallel test sets (PUD). Three of them (Turkish, Arabic and Hindi) proved difficult to parse for any model trained on the UD 2.0 training data; it seems likely that besides domain differences, inconsistent application of the UD annotation guidelines played a role, too.
See Table 10 for a ranking of all test sets by the best LAS achieved on them by any parser. Note that this cannot be directly interpreted as a   ranking of languages by their parsing difficulty: many treebanks have high ranks simply because the corresponding training data is large. The table also gives a secondary ranking by CLAS and indicates the system that achieved the best LAS / CLAS (mostly the same system won by both metrics). Finally, the best score of word and sentence segmentation is given (without indicating the best-scoring system). Vietnamese proved to be the hardest language in terms of word segmentation; it is not surprising given that its writ-  Table 9: Average attachment score on the 14 parallel test sets (PUD).
ing system allows spaces inside words. Second hardest was Hebrew, probably due to a large number of multi-word tokens. In both cases the poor segmentation correlates with poor parsing accuracy. Sentence segmentation was particularly difficult for treebanks without punctuation, i.e., most of the classical languages and spoken data (the best score achieved on the Spoken Slovenian Treebank is only 21.41%). On the other hand, the paragraph boundaries available in some treebanks made sentence detection significantly easier (the extreme being Arabic PUD with one sentence per paragraph; some systems were able to exploit this anomaly and get 100% correct segmentation). Table 11 gives an overview of 29 of the systems evaluated in the shared task. The overview is based on a post-evaluation questionnaire to which 29 of 32 teams responded. The abbreviations used in Table 11 are explained in Table 12.

Analysis of Submitted Systems
As we can see from Table 11, the typical system uses the baseline models for segmentation and morphological analysis (including part-of-speech tagging), employs a single parsing model with pretrained word embeddings provided by the organizers, and does not make use of any additional data. For readability, all the cells corresponding to use of baseline models (and lack of additional data) have been shaded gray.
Only 7 teams have developed their own word and sentence segmenters, while an additional 5  Table 10: Treebank ranking by best parser LAS. Bold CLAS is higher than the preceding one. Best F 1 of word and sentence segmentation is also shown. ISO 639 language codes are optionally followed by a treebank code.
teams have retrained or improved the baseline models, or combined them with other techniques.
When it comes to part-of-speech tags and morphology, 7 teams use their own systems and 4 use modified versions of the baseline, while 2 teams predict tags jointly with parsing and 3 teams do not predict morphology at all. For parsing, most teams use a single parsing model -transition-based, graph-based or even rule-based -but 4 teams build ensemble systems in one way or the other. It is worth noting that, whereas the C2L2 and IMS systems are ensembles, the winning Stanford system is not, which makes its performance even more impressive.
The majority of parsers incorporate pre-trained word embeddings. Only 3 parsers use word embeddings without pre-training, and only 4 parsers do not incorporate word embeddings at all. Except for training word embeddings, the additional data provided (or permitted) appears to have been used very sparingly.
When it comes to the surprise languages (and some of the other low-resource languages), the dominant approach is to use a cross-lingual parser, single-or multi-source, and often delexicalized. Finally, for the parallel test sets, most teams have picked a model trained on a single treebank from the same language, but at least 4 teams have trained models on multiple treebanks.

Conclusion
The CoNLL 2017 Shared Task on UD parsing was novel in several respects. Besides using crosslinguistically consistent linguistic representations and emphasizing end-to-end processing of text, as discussed in the introduction, it was unusual also in featuring a very large number of languages, in integrating cross-lingual learning for resourcepoor languages, and in using a multiply parallel test set.
It was the first large-scale evaluation on data annotated in the Universal Dependencies style. For most UD languages the results represent a new state of the art for dependency parsing. The numbers are not directly comparable to some older work for various reasons (different annotation schemes, gold-standard POS tags, tokenization etc.) but the way the task was organized should ensure their reproducibility and comparability in the future. Furthermore, parsing results are now more comparable across languages than ever before.  Table 11: Classification of participating systems. The second column repeats the main system ranking.
Two new language resources were produced whose usefulness reaches far beyond the task itself: A UD-style parallel treebank in 18 languages, and a large, web-crawled parsebank in 48 languages, over 90 billion words in total.
The analysis of the shared task results has so far only scratched the surface, and we refer to the system description papers for more in-depth analysis of individual systems and their performance. For many previous CoNLL shared tasks, the task itself has only been the starting point of a long and fruitful research strand, enabled by the resources created for the task. We hope and believe that the 2017 UD parsing task will join this tradition.