Universal Morpho-Syntactic Parsing and the Contribution of Lexica: Analyzing the ONLP Lab Submission to the CoNLL 2018 Shared Task

We present the contribution of the ONLP lab at the Open University of Israel to the UD shared task on multilingual parsing from raw text to Universal Dependencies. Our contribution is based on a transition-based parser called ‘yap – yet another parser’, which includes a standalone morphological model, a standalone dependency model, and a joint morphosyntactic model. In the task we used yap‘s standalone dependency parser to parse input morphologically disambiguated by UDPipe, and obtained the official score of 58.35 LAS. In our follow up investigation we use yap to show how the incorporation of morphological and lexical resources may improve the performance of end-to-end raw-to-dependencies parsing in the case of a morphologically-rich and low-resource language, Modern Hebrew. Our results on Hebrew underscore the importance of CoNLL-UL, a UD-compatible standard for accessing external lexical resources, for enhancing end-to-end UD parsing, in particular for morphologically rich and low-resource languages. We thus encourage the community to create, convert, or make available more such lexica in future tasks.


Introduction
The Universal Dependencies (UD) initiative 1 is an international, cross-linguistic and cross-cultural initiative aimed at providing morpho-syntactically annotated data sets for the world's languages under a unified, harmonized, annotation scheme. 1 universaldependencies.org The UD scheme (Nivre et al., 2016) adheres to two main principles: (i) there is a single set of POS tags, morphological properties, and dependency labels for all treebanks, and their annotation obeys a single set of annotation principles, and (ii) the text is represented in a two-level representation, clearly mapping the written spacedelimited source tokens to the (morpho)syntactic words which participate in the dependency tree.
The CONLL 2018 UD SHARED TASK is a multilingual parsing evaluation campaign wherein, contrary to previous shared tasks such as CoNLL-06/07 (Buchholz and Marsi, 2006;Nivre et al., 2007) corpora are provided with raw text, and the end goal is to provide a complete morpho-syntactic representation, including automatically resolving all of the token-word discrepancies. Contrary to the previous SPMRL shared tasks (Seddah et al., 2013(Seddah et al., , 2014, the output of all systems obeys a single annotation scheme, allowing for reliable cross-system and cross-language evaluation. This paper presents the system submitted by the ONLP lab to the shared task, including the dependency models trained on the train sets, assuming morphologically disambiguated input tokens by UDpipe (Straka et al., 2016). We successfully parsed 81 test treebanks of UDv2 set (Nivre et al., 2017) participating in the CONLL 2018 UD SHARED TASK (Zeman et al., 2018), obtaining the official score of LAS 58.35 average on all treebanks. We then present an analysis of case of Modern Hebrew, a low-resource morphologically rich language (MRL), which is known to be notoriously hard to parse, due to its high morphological word ambiguity and the small size of the treebank. We investigate the contribution of an external lexicon and a standalone morphological component, and show that inclusion of such lexica can lead to above 10% LAS improvement on this MRL.
Our investigation demonstrates the importance of sharing not only syntactic treebanks but also lexical resources among the UD community, and we propose the UD-compatible CoNLL-UL standard for external lexica (More et al., 2018) for sharing broad-coverage lexical resources in the next UD shared tasks, and in general.
The remainder of this document is organized as follows: In Section 2, we present our parser's formal system and statistical models. In Section 3 we present technical issues relevant to the official run for the shared task followed by our results on all languages. In Section 4 we proceed with an analysis of the performance on Modern Hebrew in the task, compared against its performance augmented with a lexicon-backed morphological analyzer. We finally discuss in Section 5 directions for future work and conclude by embracing the CoNLL-UL standard (More et al., 2018) for UDanchored lexical resource as means to facilitate and improve raw-to-dependencies UD parsing.

Our Framework
The parsing system presented by the ONLP Lab for this task is based on yap -yet another parser, a transition-based parsing system that relies on the formal framework of Zhang and Clark (2011), an efficient computational framework designed for structure prediction and based on the generalized perceptron for learning and beam search for decoding. This section briefly describes the formal settings and specific models available via yap. 2

Formal Settings
Formally, a transition system is a quadruple (C, T, c s , C t ) where C is a set of configurations, T a set of transitions between the elements of C, c s an initialization function, and C t ⊂ C a set of terminal configurations. A transition sequence y = t n (t n−1 (...t 1 (c s (x)))) for an input x starts with an initial configuration c s (x) and results in a terminal configuration c n ∈ C t . In order to determine which transition t ∈ T to apply given a configuration c ∈ C, we define a model that learns to predict the transition that would be chosen by an oracle function O : C → T , which has access to the gold output. We employ an objective function F (x) = argmax y∈GEN (x) Score(y) 2 https://github.com/habeanf/yap which scores output candidates (transition sequence in GEN (x)) such that the most plausible sequence of transitions is the one that most closely resembles the one generated by an oracle.
To compute Score(y), y is mapped to a global feature vector Φ(y) = {φ i (y)} where each feature φ i (y) is a count of occurrences of a pattern defined by a feature function. Given this vector, Score(y) is calculated as the dot product of Φ(y) and a weights vector ω: Following Zhang and Clark (2011), we learn the weights vector ω via the generalized perceptron, using the early-update averaged variant of Collins and Roark (2004). For decoding, the framework uses the beam search algorithm, which helps mitigate otherwise irrecoverable errors in the transition sequence.

Morphological Analysis
The input to the morphological disambiguation (MD) component in particular and to the yap parsing system in general is a lattice L representing all of the morphological analysis alternatives of k surface tokens of the input stream x = x 1 , ..., x k , such that each L i = M A(x i ) is generated by a morphological analysis (MA) component, the lattice concatenate the lattices for the whole input sentence x. Each lattice-arc in L has a morphosyntactic representation (MSR) defined as m = (b, e, f, t, g), with b and e marking the start and end nodes of m in L, f a form, t a universal partof-speech tag, and g a set of attribute=value universal features. These lattice-arc correspond to potential nodes in the intended dependency tree.

Morphological Disambiguation
The morphological disambiguation (MD) component of our parser is based on More and Tsarfaty (2016), modified to accommodate UD POS tags and morphological features. We provide here a brief exposition of the transition system, as shall be needed for our later analysis, and refer the reader to the original paper for an in-depth discussion (More and Tsarfaty, 2016).
A configuration for morphological disambiguation C M D = (L, n, i, M ) consists of a lattice L, an index n representing a node in L, an index i s.t. 0 ≤ i < k representing a specific token's lattice, and a set of disambiguated morphemes M .
The initial configuration function is defined to be c s ( , and n = bottom(L), the bottom of the lattice. A configuration is terminal when n = top(L) and i = k. To traverse the lattice and disambiguate the input, we define an open set of transitions using the M D s transition template: Where p = b, q = e, and s relates the transition to the disambiguated morpheme m using a parameterized delexicalization s = DLEX oc (m): In words, DLEX projects a morpheme either with or without its form depending on whether or not the POS tag is an open-class with respect to the form. For UD, we define: We use the parametric model of More and Tsarfaty (2016) to score the transitions at each step. Since lattices may have paths of different length and we use beam search for decoding, the problem of variable-length transition sequences arises. We follow More and Tsarfaty (2016), using the EN DT OKEN transition to mitigate the biases induced by variable-length sequences.

Syntactic Disambiguation
A syntactic configuration is a triplet C DEP = (σ, β, A) where σ is a stack, β is a buffer, and A a set of labeled arcs. For dependency parsing, we use a specific variant of Arc Eager that was first presented by (Zhang and Nivre, 2011). The differences between plain arc-eager and the arc-zeager variant are detailed in Figure 1.
The features defined for the parametric model also follows the definition of non-local features by Zhang and Nivre (2011), with one difference: we created one version of each feature with a morphological signature (all feature values of the relevant node) and one without. this allows to capture phenomena like agreement.

Joint Morpho-Syntactic Processing
Given the standalone morphological and syntactic disambiguation it is possible to embed the two into a single joint morpho-syntactic transition system with a "router" that decides which of the transition systems to apply in a given configuration, and train the morphosyntactic model to maximize a single objective function. We implement such joint parser in yap but we have not used it in the task, and we thus leave its description out of this exposition. For further discussion and experiments with the syntactic and joint morpho-syntactic variants in yap we refer the reader to (More et al., In Press).

Shared Task Implementation
For sentence segmentation and tokenization up to and including full morphological disambiguation for all languages, we rely on the UDPipe (Straka et al., 2016). Our parsing system implementation is yap -yet another parser, an open-source natural language processor written in Go 3 . Once compiled, the processor is a self-contained binary, without any dependencies on external libraries.
For the shared task the processor was compiled with Go version 1.10.3. During the test phase we wrapped the processor with a bash script that invoked yap serially on all the treebanks. Additionally, in order to train on all treebanks we limited the size of all training sets to the first 50,000 sentences for the parser.
Finally, our training algorithm iterates until convergence, where performance is measured by F 1 for labeled attachment score when evaluated on languages' respective development sets. We define convergence as two consecutive iterations resulting in a monotonic decrease in F 1 for LAS, and used the best performing model up to that point. For some languages we observed the F 1 never monotonically decreased twice, so after 20 iterations we manually stopped training and used the best performing model.
For some treebanks (cs cac, fr sequoia, ru syntagrus) the serialization code, which relies on Go's built-in encoder package, failed to serialize the in-memory model because it is larger than 2 30 bytes. To overcome the limitation we downloaded the go source code, manually changed the const field holding this limit and compiled the go source code.
Our strategy for parsing low resource languages was to use another treebank in the same language when such existed for the following: Arc Eager: Arc ZEager:

The Case of MRLs: A Detailed Analysis for Modern Hebrew
As is well known, and as observed in this particular task, morphologically rich languages are most challenging to parse in the raw-to-dependencies parsing scenarios. This is because the initial automatic segmentation and morphological disambiguation may contain irrecoverable errors which will undermine parsing performance. In order to investigate the errors of our parser we took a particular MRL that is known to be hard to parse (Modern Hebrew, ranked 58 in the LAS ranking, with basline 58.73 accuracy) and contrasted the Baseline UDPipe results with the results of our parser, with and without the use of external lexical and morphological resources. Table  1 lists the results of the different parsing models on our dev set. In all of the parsing scenarios, we used UDPipe's built in sentence segmentation, to make sure we parse the exact same sentences.We then contrasted UDPipe's full pipeline with the yap output for different morphological settings. We used the Hebrew UD train set for training and the Hebrew UD for analyzing the empirical results.
Initially, we parsed the dev set with the same system we used for the shared task, namely, yap dependency parser which parses the morphologically diambiguated output by UDPipe (yap DEP).
Here we see that yap DEP results (59.19) are lower than the full UDPipe pipeline (61.95). We then moved on to experiment with yap's complete pipeline, including a data-driven morphological analyzer (MA) to produce input lattices, transition-based morphological disambiguation and transition-based parsing. The results now dropped relative to the UDPipe baseline and relative to our own yap DEP system, from 59.19 to 52.25 LAS. Now, interestingly, when we replace the baseline data-driven MA learned from the treebank alone with an MA backed with an external broad-coverage lexicon called HebLex (adapted from (Adler and Elhadad, 2006)), the LAS results arrive at 60.94 LAS, outperforming the results obtained by yap DEP (UDPipe morphology with yap dependencies) and close much of the gap with the UDPipe full model. This suggests that much of the parser error stems from missing lexical knowledge concerning the morphologically rich and ambiguous word forms, rather than parser errors.
Finally, we simulated an ideal morphological lattices, by artificially infusing the path that indicates the correct disambiguation into the HebLex lattices in case it has been missing. Note that we still provide an ambiguous input signal, with many possible morphological analyses, only now we guarantee that the correct analysis exists in the lattice. For this setting, we see a significant improvement in LAS, obtaining 71.39 (much beyond the baseline) without changing any of the parsing algorithms involved. So, for morphologically rich and ambiguous languages it appears that lexical coverage is a major factor affecting task performance, especially in the resource scarce case.
Note that the upper-bound of our parser, when given a completely disambiguated morphological input stream, provides LAS of 79.33, which is a few points above the best system (Stanford) in the raw-to-dependencies scenario.

Discussion and Conclusion
This paper presents our submission to the CONLL 2018 UD SHARED TASK. Our submitted system assumed UDpipe up to and including morphological disambiguation, and employed a state-of-theart transition-based parsing model to successfully parse 81 languages in the UDv2 set, with the average LAS of 58.35, ranked 22 among the shared task participants. A detailed post-task investigation of the performance that we conducted on Modern Hebrew, including the shared task and a number of variants, has shown that for the MRL case much of the parser errors may be attributed to incomplete morphological analyses or a complete lack thereof for the source tokens in the input stream.
In the future we intend to investigate sophisticated ways for incorporating additional external lexical and morphological knowledge, explicitely by means of broad-coverage lexica obeying the CoNLL-UL format (More et al., 2018), or implicitly by means of pre-trained word embeddings on large volumes of data. Note, however, that the utility of word-embedding themselves present an open questions in the case of morphologically rich and ambigous source token, where each token may be equivalent to multiple syntactic words in a language like English.