A Transition-based System for Universal Dependency Parsing

This paper describes the system for our participation in the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies. In this work, we design a system based on UDPipe1 for universal dependency parsing, where multilingual transition-based models are trained for different treebanks. Our system directly takes raw texts as input, performing several intermediate steps like tokenizing and tagging, and finally generates the corresponding dependency trees. For the special surprise languages for this task, we adopt a delexicalized strategy and predict basing on transfer learning from other related languages. In the final evaluation of the shared task, our system achieves a result of 66.53% in macro-averaged LAS F1-score.


Introduction
Universal Dependencies (UD) (Nivre et al., 2016(Nivre et al., , 2017b and universal dependency parsing take efforts to build cross-linguistically treebank annotation and develop cross-lingual learning to parse many languages even low-resource languages. Universal Dependencies release 2.0 2 (Nivre et al., 2017b) includes rich languages and treebanks resources and the parsing task in CoNLL 2017 is based on this dataset. In fact, dependency parsing has been adopted as topic of the shared task in CoNLL-X andCoNLL-2007 (Buchholz andMarsi, 2006;Nivre et al., 2007), which have been the milestones for the researching field of parsing. This time, the task is taking a universal annotation version and trying to exploit cross-linguistic similarities between various languages.
In this paper, we describe the system of team Wanghao-ftd-SJTU for the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies (Zeman et al., 2017). For this task, we only use provided treebanks to train models without any other resources including pretrained embeddings.
For dependency parsing, there have been two major parsing methods: graph-based and transition-based. The former searches for the final tree through graph algorithms by decomposing trees into factors, utilizing ingenious dynamic programming algorithms (Eisner, 1996;McDonald et al., 2005;McDonald and Pereira, 2006); while the latter parses sentences by making a series of shift-reduce decisions (Yamada and Matsumoto, 2003;Nivre, 2003). In our system, we will utilize the transition-based system for its simplicity and relatively lower computation cost.
Transition-based dependency parsing takes linear time complexity and utilizes rich features to make structural prediction (Zhang and Clark, 2008;Zhang and Nivre, 2011). Specifically, a buffer for input words, a stack for partially built structure and shift-reduce actions are basic elements in a transition-based dependency parsing. For the transition systems of dependency parsing, there have been two major ones: arc-standard and arc-eager (Nivre, 2008). Our system adopts the former, whose basic algorithm can be described as following: where σ, β, A represent the stack, queue and the actions respectively.
One major difference for parsing between the situation of current and that of ten years ago is that recently we have seen a rising of neural network based methods in the field of Natural Language Processing and parsing has also been greatly changed by the neural methods. With distributed representation for words and sentences and the powerful non-linear calculation ability of the neural networks, we could explore deeper syntactic and maybe semantic meaning in text analysis, and both graph-based (Pei et al., 2015;Wang and Chang, 2016) and transition-based (Chen and Manning, 2014;Weiss et al., 2015;Dyer et al., 2015;Andor et al., 2016) parsing have benefited a lot from neural representation learnings. In our system, the model, which is trained by UDPipe, for the transition action predictor is also based on neural network, which is similar to the one of Chen and Manning (2014).
For this shared task, our system is built based on UDpipe (Straka et al., 2016), which provides a pipeline from raw text to dependency structures, including a tokenizer, taggers and the dependency predictor. We trained and tuned the models on different treebanks, and in the final evaluation, a score of 66.53% in macro-averaged LAS F1-score measurement is achieved. In the task, there are several surprise languages which lack of annotated resources, which means it is hard to train specified models for those languages. To tackle this problem, we exploit the universal part-of-speech (POS) tags, which could be represented as crosslingual knowledge to avoid language-specific information, and adopting a delexicalized and crosslingual method, which relies solely on universal POS tags and annotated data in close-related languages.
The rest of the paper is organized as follows:  Section 2 describes our system overview, Section 3 elaborates the components of the system, Section 4 shows the experiments and results for our participation in the shared task, and Section 5 concludes this paper.

System Overview
The overall architecture of our universal dependency parser is shown in Figure 1. The whole system can be divided into two parts: Known Language Parser and Surprise Language Parser. The former deals with known languages, including rich resource treebanks and low resource treebanks, whose annotations as the training data are accessible, while the latter disposes of the ones without dependency annotations. When the text to be processed by the system is inputed, it is first discriminated as rich-resource or low-resource and then dispatched to the corresponding sub-systems, which will be described as follows.
For the Known Language Parser, the related pipeline contains three steps as follow.
(1) Tokenizer The raw texts are split into basic units for the latter processing of dependency anal-ysis, which is the main task of the tokenizer. For all rich resource languages, we train tokenziers using provided training data, including the languages which can be easily tokenized by specific delimiters.
(2) Tagger The tokenized texts are labeled by taggers, which provides them with the tags which will be utilized in the later dependency analysis, such as POS and morphological features. Like the previous step, we train taggers for all the rich resource languages.
(3) Dependency Parser Tokens and linguistic features generated by taggers are put into the dependency parser to generate the final dependency structures.
For Surprise Language Parser, only Dependency Parser is needed. We directly take the provided CoNLL-U files which already include the tokens and features as inputs and predicts the results. Without annotated training data, we could not train the tokenizers and taggers for these languages; Meanwhile for the parsing, we adopt a delexicalized and cross-lingual strategy, which will be described later in Section 3.3.

Model Selector
In the final testing phase of the shared task, there are mainly three types of test data (Nivre et al., 2017a), Ordinary Provided Resource Test Set which have corresponding training datasets , Parallel Test Set which concerns selected known languages but may have different domain from their training data and Surprise Languages whose training annotations are not available in the provided dataset. The model selector aims to discriminate these different input types, and dispatch the inputs to different sub-systems. Specifically, for the first two types which we refer to as Known Language, they will be dealt by the Known Language Parser, while the Surprise Language Parser will dispose with the surprise languages.

Tokenizer
In the Known Language Parser, the first step is to tokenize the input raw text, generating the basic units for later processing. We train tokenizers for all the languages using UDPipe, including those ones which are quite easy to separate using simple rules, like identifying the blank spaces in English.
Considering there are some languages that could not be simply tokenized by blank spaces, we adopt this unified treatment for this step. The tokenizers are trained mainly using the SpaceAfter features provided in the CoNLL-U files and the parameters of UDPipe Tokenizer are shown in Table 1.

Tagger
In the pipeline of dealing known languages, the second step is to provide several light-weighted syntactical and morphological features for the tokenized texts, which will be utilized as the input features in the final parsing step. In our system, we adopt the tagger in UDPipe, whose tagging method is based on MorphoDita (Straková et al., 2014) and the training method is the classical Averaged Perceptron (Collins, 2002), and the training parameters of UDPipe Tagger are provided in Table 2. In this step, the tagger will provide the following outputs: 1. Lemma: Lemma or stem of word forms.

FEATS:
Morphological features from the universal feature inventory or from a defined language-specific extension.
These features will be used as inputs in the final parsing step for Rich Resource Languages.

Dependency Parser
For the final step, we generate the final dependency outputs with the tokens and features generated by the pre-trained POS taggers. The parser uses Parsito (Straka et al., 2015b). Parsito 4 is a transition-based parser with neural network classifier, which is similar to the one of (Chen and Manning, 2014). The inputs to the model represent the current configuration of the stack and buffer, including features of the top three nodes on both of them and child nodes of the nodes on the stack. After we projected features to embeddings and concatenated the generated embeddings to representations of features, the vector representations of the input are fed to a hidden layer activated with tanh, and the output layer is softmax indicating the probabilities of each possible transition actions. The parser supports projective and nonprojective dependency parsing, which is configured by the option transition system. In Universal Dependencies release 2.0, only UD Japanese and UD Galician have no non-projective dependency trees; while UD Chinese, UD Polish and UD Hebrew have a few non-projective trees, around 1% in the treebanks. According to the projective tree quantities of the whole treebanks 5 , we train non-projective parsing for most treebanks except UD Japanese and UD Galician. In projective parsing, we use dynamic oracle which usually performs better but more slowly. In non-projective parsing, we use static lazy and search-based oracle (Straka et al., 2015a).
Except transition system option, other configurations of Parsito are the same in all the training of different treebanks. For the structured interval option, we kept the default value 8. To make sure that there is a only single root when parsing, single root option is set to 1.    Table 3.

Surprise Language Parser
This sub-system deals with the surprise languages without enough training data. We use a simple delexicalized and cross-lingual method, that is, parsing these low resource languages based on the models learned from other languages. This follows the method of (Zeman and Resnik, 2008), which shows that transfer learning for another language based on delexicalized parser can perform well. Although different languages may have different word forms, the underlying syntactic information could overlap and the universal POS tags could be utilized to explore the correlations. To achieve this, we train a dependency parser in a close-relation language (source language) for a surprise language, and then feed the delexicalized POS tag sequence of the surprise language to the source language parser. We consider language family and close area to find the source language for surprise language.

Results
Evaluation process of this shared task is deployed in TIRA 6 (Potthast et al., 2014). LAS is the main scoring metric and we show performances of our system in several types of treebanks in Table 5 using the same groups as the official results. What's more, LAS of our system in Surprise Languages are shown in Table 6. We show several official evaluation results such as LAS, UAS and other results and compared with best results in Table 7.

Conclusion
In this paper, we describe the universal dependency parser for our participation in the CoNLL 2017 shared task. The official evaluation shows that our system achieves 66.53% in macroaveraged LAS F1-score measurement on the official blind test. Further improvements could be obtained by more carefully fine-tuning models and adopting more sophisticated neural models.