IMS at the CoNLL 2017 UD Shared Task: CRFs and Perceptrons Meet Neural Networks

This paper presents the IMS contribution to the CoNLL 2017 Shared Task. In the preprocessing step we employed a CRF POS/morphological tagger and a neural tagger predicting supertags. On some languages, we also applied word segmentation with the CRF tagger and sentence segmentation with a perceptron-based parser. For parsing we took an ensemble approach by blending multiple instances of three parsers with very different architectures. Our system achieved the third place overall and the second place for the surprise languages.


Introduction
This paper presents the IMS contribution to the CoNLL 2017 UD Shared Task (Zeman et al., 2017). Our submission to the Shared Task (ST) ranked third. Our overall approach relies on established techniques for improving accuracies of dependency parsers, including strong preprocessing, supertagging and parser combination.
The task was to predict dependency trees from raw text. To make the ST more accessible to participants, the organizers provided baseline predictions for all preprocessing steps (including word and sentence segmentation and POS/morphological feature predictions) using the baseline UDPipe system (Straka et al., 2016). We scrutinized the baseline and considered where we could improve over it. It turns out that, although the UDPipe baseline is a strong one, considerable parsing accuracy improvements can be gained by improving the preprocessing steps. In particular, we applied our own POS/morphology tagging using a CRF tagger and supertagging (Ouchi et al., * All three authors contributed equally. 2014) with a neural tagger. Additionally, we performed our own word and/or sentence segmentation on a subset of the test sets.
For the parsing step we applied an ensemble approach using three different parsers, sometimes using multiple instances of the same parser: one graph-based parser trained with the perceptron; one transition-based beam search parser also trained with the perceptron; and one greedy transition-based parser trained with neural networks. The parser outputs were combined through blending (also known as reparsing; Sagae and Lavie, 2006) using the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967).
The final test runs were carried out on the TIRA platform (Potthast et al., 2014) where participants were assigned a virtual machine. To ensure that our final test run would finish on time on the VM, we established a time budget for each treebank and set a goal that a full test run should finish within 24 hours. Thus we applied a combination search under a time constraint to limit the maximal number of instances of the individual parsers.
An interesting aspect of the ST was the introduction of four surprise languages. These languages were only announced one week before the test phase at which point the participants were provided with roughly 20 gold standard sentences for each language. Unfortunately, among the allowed external resources the amount of parallel data for the surprise languages was rather limited. This prevented us from using cross-lingual techniques or multilingual word vectors. We therefore resorted to blending models trained on the small samples as well as delexicalized models trained on other source languages.
Another challenge of the ST were 14 parallel new test domains for the known languages. Since the UD annotation scheme is applied on all of the treebanks, this suggests that the training data of the same language from different domains could be combined. We made several experiments in this direction and trained models on merged treebanks for most of the parallel test sets (Section 7).
The remainder of this paper is organized as follows. Section 2 discusses our preprocessing steps, including word and sentence segmentation, POS and morphological tagging, and supertagging. In Section 3 we describe the three baseline parsers, while blending is reviewed in Section 4. In Section 5 we go through our pipeline and show results on the development data. Sections 6 and 7 describe our approaches to the surprise languages and parallel test sets, respectively. Our official test set results are shown in Section 8 and Section 9 concludes.

Preprocessing
For most data sets word and sentence segmentation plays a minimal role, as it is delivered almost for free by means of whitespaces, sentencefinal punctuation and capital letters. Therefore our overall architecture applies word/sentence segmentation pipeline only on treebanks for which this task is non-trivial (see Figure 1). These test sets can roughly be grouped into two categories: Languages where tokenization is challenging, e.g., Chinese and Japanese, but also languages such as Arabic and Hebrew, where many orthographic tokens are segmented into smaller syntactic words with transformations. The second category comprises the treebanks where the detection of sentence boundaries is difficult, mostly classical texts.

Word Segmentation
We applied our own word segmentation on six languages: Arabic, French, Hebrew, Japanese, Vietnamese, and Chinese. We selected them by analyzing the UDPipe baseline and picking out cases where we potentially could surpass it.
For Arabic, French and Hebrew, the difficulty lies in splitting orthographic words (i.e., multiword tokens) into several syntactic words (e.g., clitics). Additionally the orthographic words are often not the simple concatenation of their components. For example in French, the multiword token des would be split into two syntactic words de and les. We cast this problem as classification by predicting the Levenshtein edit script to transform a multiword token into its components.
Concretely with the French example, we take the multiword token des as input, and predict de&les, where & is an artificial delimiter to split the token. To reduce the tag set, we used the Levenshtein edit script "=2+&le=1" instead of de&les as the target class, which means keeping the first 2 characters, adding "&le", then keeping 1 character, so that des can be transformed into de&les (thus split into de and les). Using edit scripts reduced the tag set size from about 12,000 to 1,000 for Arabic and from 14,000 to 600 for Hebrew. For Japanese, Vietnamese and Chinese, we simply applied a standard chunking method: for each character (or phoneme in Vietnamese), we predicted the chunk boundary, jointly with the POS tag of the word.
In both cases, we used the state-of-the-art morphological CRF tagger 1 MarMoT (Müller et al., 2013) to predict the tags (edit scripts or chunk boundaries). We used second order models for Arabic, French and Hebrew, and third order models for Japanese, Vietnamese and Chinese.

Sentence boundary detection
We applied our own sentence segmentation on nine languages (see Figure 1). For some of them, like Gothic or Latin PROIEL, typical orthographic features (e.g., punctuation or capitalization) that indicate sentence boundaries are not present and UDPipe was achieving extremely low scores (23.51 and 19.76 F1 respectively). The others were selected empirically by tests on the development data.
We employed a beam-search transition-based parser extended to predict sentence boundaries (Björkelund et al., 2016). This parser (referred to as TPSeg) is an extension of our transition-based parser (see Section 3.2) using the perceptron and is trained using DLASO updates (Björkelund and Kuhn, 2014;Björkelund et al., 2016). It marks sentence boundaries with an additional transition. For this parser the input is not just a pre-tokenized sentence, but a pre-tokenized document. As documents during test-time we used paragraphs from the raw input text, assuming that no sentence would span across a paragraph break.
A training instance for the parser is a document (rather than a sentence). Some treebanks have the entire training set represented as a single paragraph (document). Initial experiments showed that  Figure 1: System architecture, where langs 1 : he, ja, fr, fr sequoia, fr partut, fr pud, vi, zh; langs 2 : cu, en, et, got, grc proiel, la, la ittb, la proiel, nl lassysmall, sl sst; langs 3 : ar, ar pud.
training the parser on a single document took considerable time and also did not perform very well. Instead, we created artificial documents for training by taking chunks of 10 sentences from the training set and treating them as documents (irrespective of whether they went across paragraphs). We trained the parser using gold word segmentation and POS/morphology information. At test time we relied on UDPipe predictions in most cases. However, for Arabic, the only language where we did both word and sentence segmentation, we applied our own POS/morphology tagger since the word segmentation had changed. Additionally, we applied our tagger on Old Church Slavonic, Estonian, Gothic, Ancient Greek PROIEL and Dutch LassySmall since we found that this lead to better sentence segmentation results on the development sets.

Part-of-Speech and Morphological Tagging
We used MarMoT to jointly predict POS tags and morphological features. We annotated the training sets via 5-fold jackknifing. All parsers for all languages except the surprise ones were trained on jackknifed features. We did not use XPOS tags and lemmas. We used MarMoT with default hyperparameters.

Supertags
Supertags (Joshi and Bangalore, 1994) are labels for tokens which encode syntactic information, e.g., the head direction or the subcategorization frame. Supertagging has recently been proposed to provide syntactic information to the feature model of statistical dependency parsers (Ambati et al. (2013;, Ouchi et al. (2014)).
We follow the definition of supertagging from Ouchi et al. (2014) and extract supertag tag sets from the treebanks. We use their Model 1 to design our supertags. That is, we encode the dependency relation (label), the relative head direction (hdir) and the presence of left and right dependents (hasLdep, hasRdep) and follow the template label/hdir+hasLdep hasRdep.
We used an in-house neural-based tagger (TAGNN) to predict the supertags . It takes the context of a word within a window size of 15. The input word representations are concatenations of three components: output of a character-based Convolutional Neural Network (CNN), pretrained word embeddings provided by the ST organizers, and a binary code indicating the target word. The word representations of the whole context-window are then fed into another CNN to predict the supertag of the target word. We used TAGNN instead of CRF for supertagging, since it performed considerably better in the preliminary experiments.
3 Baseline parsers Surdeanu and Manning (2010) show that combining a set of parsers with a simple voting scheme can improve parsing performance. Martins et al. (2013) demonstrate that self-application, i.e., stacking a parser on its own output, only leads to minuscule improvements. 2 Therefore to profit from combining components one of the most significant factor is their diversity. Thus we experimented with three parsers with quite different ar-chitectures and additionally varied their settings.

Graph-based perceptron parser
As the graph-based parser we used mate 3 (Bohnet, 2010), henceforth referred to as GP. This is a state-of-the-art graph-and perceptron-based parser. The parser uses the Carreras (2007) extension of the Eisner (1997) decoding algorithm to build a projective parse tree. It then applies the non-projective approximation algorithm of McDonald and Pereira (2006) to recover nonprojective dependencies. We train the parser using the default number of training epochs (10).
We modified the publicly available sources of this parser in two ways. First, we extended the feature set with features based on the supertags following Faleńska et al. (2015). Second, we changed the perceptron implementation to shuffle the training instances between epochs. 4 Shuffling enables us to obtain different instances of the parser trained with different random seeds, which are used in the blending step.
Since the time complexity of the Carreras (2007) decoder is quite high (O(n 4 )) this parser required a considerable amount of time to parse long sentences. Therefore, while applying this parser in the blending scenario, we skipped all sentences longer than 50 tokens. 5 We additionally made sure that for each treebank we had at least one parser that was not GP, so that all sentences would be parsed.

Transition-based beam-perceptron parser
We apply an in-house transition-based beam search parser trained with the perceptron (Björkelund and Nivre, 2015), henceforth referred to as TP. 6 We have previously extended this parser to accommodate features from supertags (Faleńska et al., 2015). It uses the ArcStandard system extended with a Swap transition (Nivre, 2009) and is trained using the improved oracle by . The parser is trained with a globally optimized structured perceptron (Zhang and Clark, 2008) using max-violation updates (Huang et al., 2012).
We use the default settings for beam size (20) and number of training epochs (also 20). Similarly to GP, we employ different seeds for the random number generator used during shuffling of the training instances in order to obtain multiple different models.

Transition-based greedy neural parser
We use an in-house transition-based greedy parser with neural networks , henceforth referred to as TN. 7 The parser uses a CNN to compose word representations from characters, it also takes the embeddings of word forms, universal POS tags and supertags and concatenates all of them as input features. The input is then fed into two hidden layers with ReLU non-linearity, and finally predicts the transition with a softmax layer. The parser uses the same Swap transition system and oracle as TP. We use the default hyperparameters during training and testing.
During training the parser additionally predicts the supertag of the top token in the stack and includes the tagging cross-entropy into the cost function. This approach is similar to stackpropagation (Zhang and Weiss, 2016), where the tagging task is only used as a regularizer.

Blending
To enhance the performance of the baseline single parsers we combined them using blending (Sagae and Lavie, 2006). We trained multiple instances of each baseline parser using different random seeds. We parsed every sentence and assigned scores to arcs depending on how frequent they were in the predicted trees. We used the Chu-Liu-Edmonds algorithm to decode the maximum spanning tree from the resultant graph. This way we obtained the majority decision of the parser instances under the tree constraint.
As a baseline for blending (BLEND-BL), we took four instances from each of the baseline parsers: The four GP instances were trained with different random seeds. The four TP instances further split into two groups: two parse from left to right (TP-l2r) and two parse from right to left (TP-r2l). The four TN instances differ not only in the parsing direction, but also in the word embeddings, two use pretrained embeddings from the organizers (TN-l2r-vec, TN-r2l-vec) and two use randomly initialized embeddings (TN-l2rrand, TN-r2l-rand).
The 4+4+4 combination was rather arbitrary and simply based on the intuition that different parsers should be equally represented and as diverse as possible. However, this might not be the optimal combination since different parsers are better at different treebanks. Also, given the relatively limited computing resources on the VM, we needed to optimize the number of blended instances in terms of speed.
We thus applied a combination search under a time constraint. First we measured time needed by each parser to parse every development treebank on the VM as an estimation of time usage for the test run. We then defined a time budget of 1,000 seconds for each treebank, and checked all combinations of the parsers on the development set under the time budget. We took the combinations from a pool of 24 individual instances, divided into seven groups: 8×GP; 4×TP-l2r; 4×TP-r2l; 2×TN-l2r-rand; 2×TN-l2r-vec; 2×TN-r2l-rand; 2×TN-r2l-vec.
Note that enumerating all combinations of individual instances is not feasible (2 24 combinations). Thus we applied a two-step heuristic search. First we searched for the optimal number of instances from the 7 groups, by drawing samples from the pool of instances with only different random seed (at most 9 × 5 × 5 × 3 × 3 × 3 × 3 = 18, 225 possibilities). Once the optimal numbers of instances were found, we then searched exhaustively for the optimal instances (BLEND-OPT).

Evaluation
In this section we evaluate the aforementioned methods on the 55 treebanks for which development data was available.

Word and sentence segmentation
As discussed in Section 2, we applied our own word and/or sentence segmentation to a subset of languages. The corresponding results on the development sets are shown in Tables 1 and 2. For word tokenization both our methods (predicting edit script and tagging with chunk boundaries) outperform the UDPipe baseline by 2.64 F1score points on average. The biggest gains are achieved for Hebrew (4.57 points) and Vietnamese (4.67 points).
Using the TPSeg parser to predict sentence boundaries results in an average improvement of 9.32 points on sentence segmentation F1-score over the UDPipe baseline. Especially the difficult data sets that do not use orthographic features to indicate sentence boundaries improve by a big margin, for example Latin PROIEL by 18.76 and Gothic by 15.73.
Most importantly, the improvements in word and sentence segmentation F1-score roughly translate into LAS improvements with a 1:1 and a 5:1 ratio, respectively.

Preprocessing and Supertags
To see the improvements stemming from our preprocessing steps we run the baseline parsers in four incremental settings: (1) using only the UD-Pipe baseline predictions, (2) replacing POS and morphological features with CRF predictions, (3) adding supertags, and (4) applying our own word and sentence segmentation. Table 3 shows the average LAS for each parser across the 55 development sets for the consecutive experiments. For each set of experiments the parsers were trained on corresponding jackknifed annotations for POS, morphology, and supertags. Gold word and sentence segmentation was used while training parsers in all settings. The table shows that replacing the POS and morphological tagging with the CRF instead of baseline UDPipe predictions improves the parsers by 0.66 on average. 8 The introduction of supertags brings an additional 0.88 points which demonstrates that supertags are a useful source of syntactic features for dependency parsers, irrespective of architecture. Replacing the word and sentence segmentation from UDPipe with our own improves on average by 0.74 points. It is worth noting that this improvement stems only from the 15 treebanks where we applied our own segmentation, although the averages in Table 3

Development Results
Our overall results on the development sets are shown in Table 4. The table shows the performance of the preprocessing steps, the individual baseline parsers, and the results of the two blends.
The 15 treebanks where we applied our own word and/or sentence segmentation are marked explicitly in the table, for the other cases we used the UDPipe baseline. The three single baseline parsers achieved similar average performances. Each one of them performed the highest on some of the treebanks, but not on all. It is worth noting that the strongest baseline parser, GP, is perceptron-based rather than a neural model. That is not to say that perceptrons generally are stronger than neural models (our neural TN parser is a greedy parser, and other participants in the Shared Task present considerably stronger neural models), however it indicates that perceptrons are not miles behind the more recent neural-based parsers.
Blending parsers yield a strong boost over the 8 The actual improvements on the POS and morphological tagging tasks amount to 0.68 and 1.17, respectively. baselines. BLEND-BL improves roughly 2-3 points depending on the choice of baseline. By searching for optimal combinations under the time budget, this can be further improved by 0.49 on average (BLEND-OPT). The search reduced the number of models from 660 to 438. In particular, there were 221 instances of GP, 79 of TP, and 138 of TN.

Surprise languages
The implementation of the surprise languages in the Shared Task was done in a rather peculiar way with respect to preprocessing. The test sets were annotated by the organizers through crossvalidation. That is, the test sets were provided with predicted (by UDPipe) POS and morphological tags. Participants were provided with a small sample (about 20 sentences) for each surprise language, however only with gold standard preprocessing. This meant that it was difficult to use the samples for tuning/development since we would either have to use gold standard preprocessing, or apply cross-validation on the samples ourselves which most likely would have resulted in considerably worse preprocessing than that which was delivered for the test sets. We chose to consistently use gold preprocessing for all development experiments on the surprise languages.
A straightforward approach to the surprise languages is to use delexicalized parser transfer (Zeman and Resnik, 2008). The idea is to train a parser on a source treebank using only nonlexical features (in our case universal POS tags and morphological features) and apply it on sentences from the target language. We followed Rosa and Zabokrtský (2015) and performed multisource delexicalized transfer by blending together models trained on several languages. Contrary to them, we treat the source languages equally and blend them with the same weight.
We trained delexicalized TP and GP parsers for 40 source languages (we took the 40 biggest treebanks, excluding the domain specific ones). We refrained from training TN since the main motivation of this parser is that it operates on characters. Therefore, using it in the delexicalized setting does not make sense.
To narrow down the number of possible source language parsers, we used TP to select the best six source languages for each surprise language using the sample data. We then searched for the optimal  Table 4: Development results. The treebanks for which we did our own word and/or sentence segmentation are marked with ⊗ and respectively. The TP and TN models correspond to TP-l2r TN-l2r-vec, respectively.  Table 5: Parsing accuracy (LAS) for surprise languages: the three best delexicalized TP-l2r parsers and lexicalized parser obtained by leave-one-out jackknifing.
In addition to the delexicalized models, we also trained lexicalized TP and GP models 9 on the sample data and applied leave-one-out jackknifing. 10 A comparison between the three best delexicalized TP models and the lexicalized TP parser is shown in Table 5. Only for Upper Sorbian were transferred models able to surpass the model trained on the very small training data. Interestingly, the blended models were much better than any of the models for all languages except Kurmanji. Therefore we decided not to use any of the delexicalized models for this language. For the other three surprise languages we used ultimately blended eight delexicalized (selected as described above) and eight lexicalized models, the intuition being that this would give equal weight to lexicalized and delexicalized models.

Parallel datasets
For the 14 additional parallel datasets (PUD) we used parsers trained on their corresponding languages. For several languages there were more than one treebank in the training data for the corresponding PUD test set. This begs the question as to whether the models used for the PUD test sets should be trained only on the primary treebank, or on the combination of all training sets corresponding to that language. For the main treebanks, initial experiments indicated that this was a bad idea and parsers performed better when training sets were not combined. However, for the PUD test sets we had no information on the annotation scheme nor the domain, which made it difficult to decide whether to use only the primary training set or all available.
For each language with multiple training sets, we trained one parser on each training set as well as on their concatenation. We applied these models on the development sets and created a confusion matrix. Without prior knowledge about the PUD treebanks, we estimated the expected LAS as the average LAS of the development sets and chose the model that maximizes the estimation. Table 6 shows such a confusion matrix for Swedish using the TN parser. The expected LAS for PUD (right-most column) is highest when trained on the concatenation of the two treebanks. This observation held for all the languages with multiple treebanks that we tested and we therefore used models trained on the concatenation of all training data with two exceptions: For Czech time prevented us from training models and creating a confusion matrix and we only used models trained on the primary treebank. For Finish FTB the README distributed with the treebank states that this treebank is a conversion that tries to approximate the primary Finish treebank. This suggests that it does not entirely conform to the Finish UD standard. We assumed that the Finish PUD test set would be closer to the primary treebank and therefore chose to use only the model trained on the primary treebank.

Test Set results
Our final results on the test sets are shown in Table 7. Overall we ranked third in the Shared Task with a macro-average LAS of 74.42 behind two teams: Stanford and Cornell. Both of them used state-of-the-art neural-based parsers (Zeman et al., 2017). Our efforts to improve the preprocessing scores paid off. On most of the languages where we applied our word and/or sentence segmentation we achieved the best parsing results. On the secondary evaluation metrics we ranked first for word segmentation and sentence segmentation, second for POS tagging, and first for morphological tagging. Additionally we were second on parsing the surprise languages.
As it turns out, all PUD treebanks were presumably annotated following the guidelines of the primary treebanks. This most likely lowered our results a little bit for some of the PUD treebanks. However, for Russian PUD our results are abnormally low compared to many other participants. We scored about 13 points behind the top result, in comparison to an average distance of less than 2 points. This is most likely an artifact of how the non-primary (SynTagRus) Russian treebank is considerably larger than the primary Russian treebank, which means that a parser trained on the concatenation is mostly dominated by the SynTa-gRus annotation style and domains.

Conclusion
We have presented the IMS contribution to the CoNLL 2017 UD Shared Task. We have shown that tuning the preprocessing methods is a way to achieve competitive parsing performance. We made use of a CRF tagger for POS and morphological features and very strong word and sentence segmentation tools.
None of our baseline parsers alone would rank third. We therefore used blending to combine them. In general, we can confirm the observation of Surdeanu and Manning (2010) that the diversity of parsers is important. Additionally, we observed that both the diversity of parser architectures and number of instances of the same parser can improve performance. Furthermore, our automatized combination search method could be seen as a case of a "sparsely" weighted voting scheme.
We confirmed two of our previous findings on a larger scale. (1) Syntactic information can help sentence segmentation (Björkelund et al., 2016).
For the surprise languages we blended delexicalized models from other languages with lexical-ized models trained on the small in-language sample data. This approach seems to have been robust and rendered us second rank for surprise languages. However, further analysis would be required in order to understand whether the lexicalized or delexicalized models in general fare better in this setting.
As for the PUD treebanks we found that, although the UD annotation scheme should be consistent across treebanks, combining training sets for one language is not useful for parsing the PUD test sets. Whether this depends on annotation idiosyncrasies or domain differences is an open question and deserves further attention.  Table 7: Test results. The treebanks for which we did our own word and/or sentence segmentation are marked with ⊗ and respectively.