A Semi-universal Pipelined Approach to the CoNLL 2017 UD Shared Task

This paper presents our system submitted for the CoNLL 2017 Shared Task, “Multilingual Parsing from Raw Text to Universal Dependencies.” We ran the system for all languages with our own fully pipelined components without relying on re-trained baseline systems. To train the dependency parser, we used only the universal part-of-speech tags and distance between words, and applied deterministic rules to assign dependency labels. The simple and delexicalized models are suitable for cross-lingual transfer approaches and a universal language model. Experimental results show that our model performed well in some metrics and leads discussion on topics such as contribution of each component and on syntactic similarities among languages.


Introduction
We tested dependency-based syntactic parsing in 49 languages on Universal Dependencies (Nivre et al., 2015) using 81 corpora from the UD version 2.0 datasets . The task is described in the overview paper (Zeman et al., 2017) and the whole system is evaluated on the TIRA platform (Potthast et al., 2014).
Instead of merely pursuing higher scores in the shared task, we adopted several strategies in the design of our parser: Self-contained system. To keep capabilities to control the input and output of the system, we use only our own components for the whole pipeline including sentence splitter, tokenizer, lemmatizer, PoS tagger, dependency parser and role labeler. We do not rely on any existing preprocessors such as UDPipe (Straka et al., 2016) and SyntaxNet (Weiss et al., 2015).
One model per language. When there are multiple corpora in a language with different annotation strategies, our system does not optimize models for each corpus, because the real applications do not assume such specific corpora.
No machine learning. We use merely simple statistics with parts of speech of each word and distance between words, and induced deterministic rules. Neither higher order models nor word embeddings are used, thus our system is fully controllable with linguistic knowledge.
Componentized pipeline. Components in the pipeline can be divided and optimized independently so that they are interchangeable with other corresponding components such as the UDPipe tokenizer. Our dependency parser relies only on Universal PoS tags and does not use an extended PoS, lemma nor features annotated by a specific tokenizer.
Our system was composed under these constraints at the sacrifice of overall scores but it performed marginally well, achieving the best participant scores in a number of metrics. The major contributions in this report are as follows: 1. Report of runs without UDPipe with very different results than those obtained from other participants.
2. Experiments in cross-lingual and universal scenarios by using delexicalized statistics of different languages.
3. Simple and reusable techniques to induce rules for PoS tagging and relation labeling. ja ko ar, de,en,fr, hi,zh,nl,... bg,et,fa,id, kk,lv,ur,.. Section 2 describes each components in our pipeline. Section 3 reports our results, including ablation studies and additional experiments in cross-lingual and multilingual settings. Section 4 shows some related prior work related to our approach.

Components
Figure 1 illustrates our pipelined architecture for multilingual parsing from raw text. As indicated as dotted boxes in the figure, we exploited inhouse engines for sentence splitting, tokenization and PoS tagging for a number of languages and fit them to the UD annotation schemata. For languages which our engine does not cover, we used simple statistics in the training corpus to assign Universal PoS (UPOS). For syntactic parsing, we extracted statistics to predict the head words, taking into account UPOS and distance. To assign relation labels we applied rules induced from the corpus.
The rest of this section describes each component with language specific treatments in the order in the pipeline.

Sentence splitting
For the sentence splitting we applied existing logics, taking into account language specific punctuations and special cases such as "Mr." in English. For languages that our sentence splitter does not cover, we simply applied the logic for English. For corpora that do not use punctuation at all (e.g. got and la proiel), we identified words that tend to be the first or the last word in a sentence (more than half of appearance e.g. "itaque" in Latin as the first word), and used them to split long sentences that had 10 or more words.

Tokenization
Our in-house engine tokenizer and PoS tagger support 17 languages; ar, cs, da, de, en, es, fr, he, it, ja, ko, nl, pl, pt, tr, ru and zh. For three of them, Japanese (ja), Korean (ko) and Chinese (zh), words are split in very different manner without relying on white spaces 1 .
We applied English tokenizer for other languages to simply split words by white spaces and punctuations. For Vietnamese (vi) in which the word units are longer than space-split tokens, we extracted multi-token words from the training corpus and aggregated them in runtime. This raised the word F1 score for vi from 73.7 to 85.1.
There are unignorable mismatches in tokenization strategies between our tokenizer and UD corpora. The major difference is in Korean (ko): while our tokenizer splits particles and suffixes from content words, the UD corpus gives whitespace (eojeol) tokenization. Accordingly, we merged those tokens after getting parts of speech of each unit.
We also made adjustments in Turkish (tr) to attach suffixes except for "ki", and in Arabic (ar) to attach the determiner "al". There still remains many differences in other languages but we did not make any other modifications, which resulted in lower word correspondence values (95.5 on average) compared to those of UDPipe (98.6).

PoS tagging
As well as the tokenization, we applied PoS tags output by our engine for 17 languages to get their own PoS schema; some of them are close to the Penn Treebank style and the others are in different schemes. We adopted those tags as Extended PoS (XPOS) tags and mapped them to UPOS. The mapper assigns the most frequent UPOS in the training corpus for a combination of XPOS and the lemma of a given word. By definition, our PoS tagger does not distinguish some of the main verbs (VERB) from auxiliary verbs (AUX) such as "do" and "have" in English, "avoir" in French and "haber" in Spanish, which causes many parsing errors, and so we added heuristics to change the UPOS using the context.
For other languages the PoS tagger does not cover, we assigned the most frequent UPOS for each surface form in the training corpus. Even with this naïve method we obtained UPOS scores higher than 90 for some languages such as Czech (cs), Persian (fa), Hindi (hi) and Indonesian (id) but it did not work well enough for lower resource languages.

PoS-level models
To keep the simplicity and language universality of the parsing method, we built the first-order delexicalized model for each language 2 . The score of the dependency between two words is determined only by the UPOS of head and dependent words and surface distance between two words. Figure 2 shows a sample dependency structure for an English sentence and Table 1 shows true (T) and false (F) dependencies found in the sentence in Figure 2. By counting frequencies of these events for all pairs in a sentence, the ratio of correct dependency for a pair of PoS and distance is calculated.
Formally, let h be a head word, d be a dependent, p w be the UPOS of w, and ∆ d,h be the distance 3 between d and h, so that the score is dependent head distance dependency? Table 1: True (T) and false (F) dependencies between two words in the sentence in Figure 2. Negative distance means that the head is left to the dependent.
English ( where #(·) is the frequency in the training corpus, T denotes that d depends on h and F denotes it does not. The score is set to 0 when the denominator is 0. Table 2 shows example scores. These statistics reflect universal attributes, for example, smaller distance is preferred, functional words tend not to have dependents, and so on. Also language specific attributes are contained, such as regarding orientation of adjective modification and adpositions.
These scores are used as the weight of the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965;Edmonds, 1967) to obtain the minimum spanning tree to optimize the dependency structures in a sentence. This algorithm can produce a nonprojective tree, which frequently appears in languages such as German, Latin and Czech (Mc-Donald et al., 2005).

Language specific cases
Japanese (ja) and Korean (ko) are parsed in a different manner. A common point to both languages is that all content words form right-head structures; consequently, a set of rules selects the syntactically possible head words for a given word by using the syntactic features (Kanayama et al., 2014). Here the dependencies are determined as the nearest baseline among the modification candidates without relying on the statistics of the training corpora.
For 'surprise languages' that do not have training corpora, we use models for languages in the close regions (Russian (ru) for Buryat (bxr), Persian (fa) for Kurmanji (kmr), Finnish (fi) for North Sámi (sme) and Polish (pl) for Upper Sorbian (hsb)) but these selections were not optimal as found in the experiments in Section 3.2.

Exceptional dependencies
The statistic model above is apparently ignorant of the vocabulary and lexical features finer than UPOS level. To capture some phenomena we made two deterministic modifications.

Fixed expressions.
Multi-word expressions behave exceptionally in the UPOS-based model. In each language we extracted fixed phrases such as "because of" and "as well as" in English, and in runtime forcibly tagged dependencies for such word sequences with 'fixed' label. Also, for consecutive appearance of same PoS tags of NOUN, PRON or NUM, a structure with the majority label (one of flat, nmod, compound or nummod) is assigned depending on the pairs of language and PoS, e.g. give left-head structures with flat label for PROPN sequences of Catalan (ca).

Consistent words. English UPOS PART is used
for possessive "'s" and infinitive "to", which behaves very differently from each other. For such words whose head word is in a consistent direction per dependent word, the score for the other direction is discounted by multiplying 0.1, e.g. 0.1 is multiplied for the score  of left-head modification of PART: "to" in English.

Relation label assignment
After getting the tree structures, we assigned dependency labels to each node by referring to the most frequent label between two UPOS tags in the languages. The labels vary by language and orientation of the dependencies as exemplified in Table 3.
In some cases the labels are difficult to deterministically assign merely by using UPOS of two words. In such cases, we applied the following label refinement rules.
Word based constraints. Forcibly change the label for words whose relation labels are mostly consistent (≥ .95), e.g. modification by "there" in English should have expl label.
Verb arguments. Adjust the label of NOUN, PROPN and PRON when the word is a dependent of VERB with several conditions, e.g. set obl if the word has a dependent labeled case in most of languages.
Pronouns. Change the relation label of PRON as a dependent of VERB to its majority 4 for a surface form. E.g. select obj for "him" in English.

Conjunctions.
When the dependent and head words have the same UPOS and there is CCONJ between the two words, set the label of the dependent as conj.   Table 4 shows the results for 81 test corpora in 49 languages including 'surprise languages'. The left side shows the performance of our system described in Section 2. The scores are the same as those in the official run except for ja pud data on which we encountered a technical problem in the official run. ' * ' denotes that the values were updated from the official score. WLAS denotes "Weighted labeled attachment score", which discounts the functional word attachments by multiplying 0.1 and ignores punctuation.

Overall results
Numbers in bold letter indicate that our system achieved the best scores among task participants. Our sentence splitting was the best for seven corpora including three surprize languages, and word segmentation was best for five corpora.
Japanese (ja) shows the best score except for sentence splitting 5 , but it is exceptional here. As we provided the Japanese UD2.0 data set, we have the consistent tokenization, PoS mapping and label definition with the data set, thus it is straightforward to convert the parsing structure into appropriate UD schema. We intentionally use the naïve method for parsing (nearest baseline), however, we performed the best among the participants due to the high coincidence of the tokenization.
For Kazakh (kk), our approach worked well and achieved the best score in sentence splitting and unlabeled attachment scores (UAS). The absolute score was not high, so this shows the difficulty of the language for machine learning approaches.
Besides the difficult languages in terms of tokenization: Chinese (zh), Vietnamese (vi) and Hebrew (he), some languages show quite low scores for word splitting (e.g. pt and tr) due to differences in tokenization policies which our adjustment rules did not cover. Due to the nature of the pipelined architecture, the errors in word splitting directly affect the downstream metrics. Since the UPOS is used for dependency parsing, PoS tagging and PoS mapping errors are critical for parsing scores, both UAS and LAS.
The right side of Table 4 shows the results of our parser using UDPipe for tokenization and PoS tagging. Three columns (Sentence, Words and UPOS) show the scores of UDPipe itself, and the rest of columns show the scores of our parser when UDPipe was applied for preprocessing. Since UD-Pipe was trained with the UD corpora the tokenization and PoS tagging performed much better than ours and resulted in scores 8 and 9 points higher than those obtained for UAS and LAS respectively. For Kazakh (kk), Vietnamese (vi) and one of Arabic data (ar pud), our tokenizer and PoS tagger performed better than UDPipe, resulting in better parsing scores in the left side.
Korean (ko) is another exception. Since our UPOS-based dependency parsing model does not capture the decomposed elements of each token, the parser did not work well after the UDPipe preprocessing. Our deterministic parser can handle the functional words thus it performed better.

Cross-lingual and universal evaluation
One of the advantages of Universal Dependencies is the capability to test the language independent model and cross-lingual transfer learning. As described in Section 2.4, our dependency parsing models without any lexical information are very general. They therefore can be applied to other languages enabling us to test a universal language model. Table 5 compares the UAS scores with the cross-lingual and universal settings. The 'Own model' column shows the original score, the 'Best transfer' column shows the score using the model that performed the best among different languages, and the 'Universal' column shows the score obtained with the combined statistics extracted from all of the multilingual corpora. Numbers in bold denote that the transfer or universal model outperformed the language specific model. Japanese (ja) and Korean (ko) were not tested here because they did not use compatible models.
The experimental result shows the best models for applying low-resource languages: fi for bxr, cs for kmr, tr for sme and hr for hsb. Also for relatively low-resource languages such as Kazakh (kk) and Ukrainian (uk), the models with larger corpora outperformed their own models. For four French (fr) corpora, the Portuguese (pt) model performed as well as the French model. This suggests the model with three different French corpora generated a noisy model.
It is interesting to consider the 'neighbor' languages in terms of syntax. English and Swedish (sv) selected each other as the closest languages, which suggests that they are selected not only be-  Table 5: UAS-F1 scores with language specific models, and transfer models (see Section 3.2. cause of the size of the training corpora. It is also notable that two variants of Norwegian (no) were closest for different languages (Danish (da) and Swedish). Even the universal model performed well. The drop in UAS scores from the language-specific result was only 4.5 points on average. This shows our method is general enough for multilingual design. Not only for low-resource languages such as Ukrainian and surprise languages, but also for Russian (ru) and Slovak (sk), the universal model outperformed the language specific model. Table 6 shows the difference in UAS scores when we did not apply one of the sets of rules to change the dependency structures described in Section 2.4 and LAS scores without one of refinements for relation labels described in Section 2.5. The identification of multi word tokens did not work well as expected, and the word level rules made little contribution.

Ablation of refinement rules
Applying all label refinement rules improved the LAS score by 2.35 points on average. The rules to modify labels for verb arguments were the most important on average. Conjunction rules were very simple but consistently improved for almost all languages. Word-based constraints are good for some languages but may cause side effects. Pronoun rules were good for Gothic (got), which suggests that the Gothic pronouns are relatively consistently used for argument cases e.g. "saei" for nsubj and "mik" for obj.

Related Work
Some approaches share the same motivation with ours. Martínez-Alonso et al. (2017) used a small set of UPOS-level attachment rules for parsing and achieved 55 UAS with a universal model with predicted PoS. In this shared task we need to tackle the preprocessing and relation labeling as well which cannot be done in a language agnostic manner. Accordingly, we used minimum statistics for each language and achieved UAS levels similar to those for our own tokenization and PoS prediction, and higher value (by 9 points) when we use the UDPipe preprocessor.
Universal parsing is not our main focus here, but our results in the rightmost column in Table 2 can be used to compare our approach with universal approaches (Ammar et al., 2016).

Conclusion
For the CoNLL 2017 Shared Task on multilingual parsing from raw text, we were able to achieve a whole multilingual parser pipeline in a "semi-universal" manner exploiting minimum statistics from the training corpora with deterministic rules for part of speech tagging and label adjustment. Even with a simple and general model we achieved .43 labeled attachment scores on average and showed that the model we propose can be suitably applied to cross-lingual and universal scenarios.