Universal Joint Morph-Syntactic Processing: The Open University of Israel’s Submission to The CoNLL 2017 Shared Task

We present the Open University’s submission to the CoNLL 2017 Shared Task on multilingual parsing from raw text to Universal Dependencies. The core of our system is a joint morphological disambiguator and syntactic parser which accepts morphologically analyzed surface tokens as input and returns morphologically disambiguated dependency trees as output. Our parser requires a lattice as input, so we generate morphological analyses of surface tokens using a data-driven morphological analyzer that derives its lexicon from the UD training corpora, and we rely on UDPipe for sentence segmentation and surface-level tokenization. We report our official macro-average LAS is 56.56. Although our model is not as performant as many others, it does not make use of neural networks, therefore we do not rely on word embeddings or any other data source other than the corpora themselves. In addition, we show the utility of a lexicon-backed morphological analyzer for the MRL Modern Hebrew. We use our results on Modern Hebrew to argue that the UD community should define a UD-compatible standard for access to lexical resources, which we argue is crucial for MRLs and low resource languages in particular.


Introduction
The Universal Dependencies (UD) project (Nivre et al., 2016) sets itself apart from previous multilingual parsing initiatives such as the CoNLL (Buchholz and Marsi, 2006;Nivre et al., 2007) and SPMRL (Seddah et al., 2013(Seddah et al., , 2014 shared tasks with two key principles: (i) the POS tags, morphological properties, and dependency labels are unified, with enforceable annotation guidelines and (ii) corpora text is provided via a two-level representation of the input stream. With the latter two-level principle in place, corpora can be provided with raw text, syntactic words as the nodes of syntactic trees, and the relationship between them, in a harmonized scheme. This representation is crucial to the participation of Morphologically Rich Languages (MRLs) in end-to-end parsing tasks.
The availability of a wide range of language corpora in this manner provides a unique opportunity for the advancement of (universal) joint morphosyntactic processing, introduced by Tsarfaty and Goldberg (2008) in a generative setting and advocated for in a variety of settings (Bohnet and Nivre, 2012;Andor et al., 2016;Bohnet et al., 2013;Li et al., 2011;Bohnet and Nivre, 2012;Li et al., 2014;. To this end, our submission is a joint morpho-syntactic processor in a transition-based framework. We present our submission (OpenU-NLP-Lab), with models trained only on the train sets (Nivre et al., 2017b), parsing all 81 test treebanks of UD v2 corpora (Nivre et al., 2017a) participating in the CoNLL 2017 UD Shared Task (Zeman et al., 2017).
We use the results of our processor on an MRL to argue that one last piece of the puzzle is missing: a universal scheme for access to lexical resources. We discuss our results for a lexiconbacked approach, compared to a data-driven one. The goal of our submission is to compel the UD community to recognize the need for lexical resources in the context of joint morpho-syntactic processing, and push forward the discussion on a UD annotation-compliant standard for access to lexical resources that could benefit MRLs and low resource languages.
In section 2 we describe our framework and formal settings (2.1), first instantiated individually as a morphological disambiguator (2.2) and dependency parser (2.2), followed by how we unify the two into a joint processor (2.4).
Since the input stream of the processor is a morphological analysis of the tokenized raw text, we describe a universal, data-driven morphological analyzer, and a lexicon-based MA for the MRL Modern Hebrew (2.5).
In section 3, we detail the implementation of our parser (3.1) and specific technical issues we encountered with the official run for the shared task (3.2). We then present our results on all languages in section 4, and present a comparison to processing Modern Hebrew with a lexicon-based morphological analyzer. We discuss directions for future work in section 5, conclude with a summary of our submission in section 6, and urge the UD community to put forth a standard for lexical resource access.

Our Framework
We use the transition-based framework of Zhang and Clark (2011), originally designed for syntactic processing using the generalized perceptron and beam search, which we briefly cover in subsection 2.1.
We first describe the standalone transition system and model for morphological disambiguation of (More and Tsarfaty, 2016) (2.2), and Arc Standard transition system together with a richlinguistic feature model (2.3). We then present our approach to joint morpho-syntactic processing which unifies both transition systems (2.4).
We present our baseline approach to data-driven morphological analysis, followed by our Modern Hebrew lexical resource (2.5).

Formal Settings
Formally, a transition system is a quadruple (C, T, c s , C t ) where C is a set of configurations, T a set of transitions between the elements of C, c s an initialization function, and C t ⊂ C a set of terminal configurations. A transition sequence y = t n (t n−1 (...t 1 (c s (x)))) for an input x starts with the configuration c s (x). After n transitions of corresponding configurations (t i , c i ) ∈ T × C, the result is a terminal configuration c n ∈ C t . In order to determine which transition t ∈ T to apply given a configuration c ∈ C, we need to define a model that learns to predict the transition that would be chosen by an oracle function O : C → T , which has access to the correct (gold) output.
To define a model, we employ an objective function F : X → R, which ranks outputs via a scoring of the possible transition sequences (GEN (x)) from which outputs are derived, such that the most plausible sequence of transitions is the one that most closely resembles one generated by an oracle: How we define Score is therefore crucial to the performance of the model, since it must capture the relation of a generated sequence (and its derived output) to that of an oracle's output. To compute Score(y), y is mapped to a global feature vector Φ(y) = {φ i (y)} where each feature is a count of occurrences defined by feature functions. Given this vector, Score(y) is calculated as the dot product of Φ(y) and a weights vector ω: Following Zhang and Clark (2011), we learn the weights vector ω via the generalized perceptron, using the early-update averaged variant of Collins and Roark (2004).
For decoding, the framework uses the beam search algorithm, which helps mitigate otherwise irrecoverable errors in the transition sequence.

Morphological Disambiguation
The morphological disambiguator (MD) component of our parser is based on More and Tsarfaty (2016), modified only to accommodate UD POS tags and morphological features. We provide a brief exposition of the transition system, and refer the reader to the original paper for an in-depth explanation (More and Tsarfaty, 2016).
The input to the transition-based MD is a lattice L of an input stream of k surface tokens x = x 1 , ..., x k , such that L i = M A(x i ), is generated by a morphological analysis component that analyzes each token separately and returns a lattice for the whole input sentence x. We rely on the UDPipe baseline models (Straka et al., 2016) for sentence segmentation and tokenization.
Each lattice-arc in L corresponds to a potential node in the intended dependency tree. A latticearc has a morpho-syntactic representation (MSR) defined as m = (b, e, f, t, g), with b and e marking the start and end nodes of m in L, f a form, t a universal part-of-speech tag, and g a set of at-tribute=value universal features.
A configuration C M D = (L, n, i, M ) consists of a lattice L, an index n representing a node in L, an index i s.t. 0 ≤ i < k representing a specific token's lattice, and a set of disambiguated morphemes M .
The initial configuration function c s To traverse the lattice and disambiguate the input, we define an open set of transitions using the M D s transition template: Where p = b, q = e, and s relates the transition to the disambiguated morpheme m using a parameterized delexicalization s = DLEX oc (m): In words, DLEX projects a morpheme either with or without its form depending on whether or not the POS tag is an open-class with respect to the form. For UD, we redefine: We use the parametric model of More and Tsarfaty (2016) to score the transitions at each step.
Since lattices may have paths of different length and we use beam search for decoding, the problem of variable-length transition sequences arises. We follow More and Tsarfaty (2016), using the EN DT OKEN transition to mitigate the biases induced by variable-length sequences.

Syntactic Disambiguation
For dependency parsing, we use the Arc Standard configuration, transition system, and oracle function defined in Kübler et al. (2009). A configuration is a triple C DEP = (σ, β, A) where σ is a stack, β is a buffer, and A a set of labeled arcs.
We present the specific variant of Arc Standard that we use in Figure 2.3. Note that in this variant, arc operations are performed between the top of the stack σ and the head of the buffer β. Additionally, in order to guarantee a single root, for the purposes of the shared task we apply a post processing step in which the first root node encountered (in left-to-right order) is designated as the only root node, and all other root nodes are set as its modifier with the "punct" dependency label.
Of course, this means that our transition system only applies to projective trees -the oracle will indeed fail given a non-projective tree, and our transition system cannot output one. In addition, since we are using the Arc Standard transition system, which has been shown to not be arcdecomposable, we cannot employ a dynamic oracle during training (Goldberg and Nivre, 2012).
The rich-linguistic feature model for our dependency parser, inspired by Zhang and Nivre (2011), applies the rich non-local features to arc standard (where this is possible), such as to accommodate the free word order of MRLs. We provide an appendix with a detailed comparison of the two feature models.

Joint Morpho-Syntactic Processing
Given standalone morphological and syntactic disambiguation systems in the same framework, we integrate them into a joint morpho-syntactic processor. Our integration is a literal embedding of the two systems, with a deterministic "router" that decides which of the two transition systems should apply a transition to a given configuration -we call this router a strategy.
We first must alter the morphological disambiguation transition such that a disambiguated morpheme is enqueued onto β: We call the set of joint strategies used for the shared task ArcGreedy k , because it will perform a syntactic operation if possible, otherwise it will disambiguate a morpheme. k determines the minimal number of morphemes in the buffer β of the Arc Standard configuration in order to perform a syntactic transition: Figure 1: The Arc Standard transition system We set k = 3 based on the features we use to predict the syntactic transition.
The ArcGreedy approach provides joint processing through the interaction of the two systems through the global score. Together with beam search, this allows a syntactic transition to reverse the ranking of an otherwise higher-scored disambiguation candidate, and vice-versa, although this interaction occurs with a small delay due to the difference between a morphological disambiguation transition and a syntactic transition for the same morpheme.

Morphological Analysis
The joint parser requires a morphologically analyzed input, in the form of a lattice. However, universal lexical resources are not available for all languages participating in the shared task. Therefore, we use the data-driven morphological analyzer from More and Tsarfaty (2016), which derives its lexicon from the training set of a given UD corpora, modified to read/write UDv2-compatible file formats.
As part of our submission, we provide these derived lexica to the community.
In addition, we use the HEBLEX morphological analyzer from More and Tsarfaty (2016), adapted to output lattices conforming to UD annotation standards for universal POS tags and morphological features.

Implementation
In this section we describe technical details of implementation 3.1, bugs encountered during the shared task 3.2, and our approach to surprise languages 3.3.

Technical Details
For sentence segmentation and tokenization, we rely on the UDPipe (Straka et al., 2016) predicted data files. The morphological analysis component and joint morpho-syntactic parser are all im-plemented in yap 1 (yet another parser), an opensource natural language processor written in Go 2 . Once compiled, the processor is a self-contained binary, without any dependencies on external libraries.
For the shared task the processor was compiled with Go version 1.8.1, and a git tag created for the commit used at the time of the task. During the test phase we wrapped the processor with a python script that invokes two instances concurrently in order to complete processing before the official (final) deadline.
Additionally, in order to train on all treebanks we limited the size of all training sets to the first 50,000 sentences for the parser.
Finally, our training algorithm iterates until convergence, where performance is measured by F 1 for full morphological disambiguation when evaluated on languages' respective development sets. We define convergence as two consecutive iterations resulting in a monotonic decrease in F 1 for full MD, and used the best performing model up to that point. For some languages we observed the F 1 never monotonically decreased twice, so after 20 iterations we manually stopped training and used the best performing model. 3

Shared Task Bugs
We encountered two serious bugs during training for the shared task, which prevented us from running our joint processor on all treebanks.
First, for some treebanks (cs cac, cs cltt, cs pud, cs, en, fr sequoia, ru syntagrus) the serialization code, which relies on Go's built-in encoder package, failed to serialize the in-memory model because it is larger than 2 30 bytes. Much too our surprise, this is apparently an issue related to the decoder, one the Go maintainers are aware of but have decided not to address. 4 Changing our model serialization code was too large a task at the time we found it, so for the aforementioned problematic treebanks we had no choice but to train only the dependency parser, and rely on UDPipe for morphological disambiguation.
Second, close to the time of submitting this paper, we discovered a bug in the morphological disambiguator. The original MD model from More and Tsarfaty (2016) assumed the Hebrew treebank SPMRL annotation (SPMRL citation), in which some clitics are identified by morphological "suffix" features, as opposed to the UD approach which breaks them down as separate syntactic words. As a result, the MD transition system sometimes fails to distinguish between latticearcs.
As a temporary remedy, we modified the parser such that syntactic words with clitic suffixes have an additional indication as such, to set them apart from syntactic words without clitic suffixes. However, we did not have time to re-run our datadriven morphologically analyzed parses with this fix.

Surprise Languages
Our strategy for parsing surprise languages was to train a delexicalized (no word-form features) dependency-only parsing model on one treebank per surprise language, which we manually deemed as "close" as follows: • bxr: ru syntagrus

Results and Discussion
In Tables 1 and 2 we present our official results for all languages. For the MRL Modern Hebrew, we train and test parsed using the lexicon-backed morphological analyzer (HEBLEX). When using HEBLEX, we obtained word-segmentation accuracy F 1 score of 87.48, compared to 81.26 in the data-driven MA of the official results, a 33% reduction in error rate.
Although the data-driven results suffer from the aforementioned bug, we do not expect them to change considerably, as we have seen such large differences with similar comparisons for the SPMRL Hebrew treebank. We hope the results from our unofficial run will be more convincing. It is important to note that the best wordsegmentation result for Modern Hebrew in the shared task is 91.37.
We argue that although our lexicon-assisted model did not outperform the best model in the shared task, this does not invalidate our position on universal lexical resources. A 91.37 F 1 wordsegmentation accuracy on Modern Hebrew is quite low, and in our opinion, still too low for inclusion in practical, real-world applications. We believe it is likely that together with access to lexical resources, more performant models would be able to bridge the gap and reduce this large error rate to a more acceptable level for down-stream tasks.

Future Work
In the future, we would like to replace our more traditional linear model with a modern, non-linear neural network-based approach. However, to date there is no solution for joint morph-syntactic processing of MRLs, a problem we aim to tackle. In the context of a neural-based solution, we believe that the availability of lexical resources will be crucial for MRLs and low resource languages in particular.

Conclusion
We present our submission to the CoNLL 2017 UD Shared Task, to the best of our knowledge the first universal, joint morpho-syntactic processor. We report our official result of 56.56. We contrast our results on the MRL Modern Hebrew, as a showcase of the utility of access to a lexicon-backed morphological analyzer.
Our goal is to instigate a discussion in the UD community on the need for a universal scheme for lexical resource access.

Rich Linguistic Feature Types
In addition to features described in Zhang and Nivre (2011), we define the following attributes: • f p -the multi-set of parts of speech of the dependents of a node • s f -the multi-set of labels of all dependents of a node • v f -the valency (= number) of all dependents of a node Also, we define C i as an address generator -it generate a feature for each dependent of the addressed node.

Morphological Augmentation
To allow the inclusion of morphology we add the ability of specifying morphological properties to be added to all features of a feature group. Augmentation of a feature group does not cause a replacement of the defined features, it only creates a copy with the addition of morphological properties.
To augment a feature group, all the features to the groups are required to have the same number of addresses. An augmentation specifies a character, either h or x, to specify the host or suffix morphological properties as attributes, respectively. If the group has more than one address, the augmentation must specify an address (a 1-indexed integer offset). Multiple augmentations may be used together.
For example, given the feature group Pairs in table 3, the first few features are S w tN 0 wt, S 0 wtN 0 w, S w N 0 wt, etc. All features in the Pairs group have two addresses. An example of a morphological augmentation of the Pairs group is h1h2, resulting in the new features S 0 wtm h N 0 wtm h , S 0 wtm h N 0 wm h , S w m h N 0 wtm h , etc. where m h is the set of key-value pairs of properties of the respective morphemes at the top of the stack (S 0 ) and buffer (N 0 ).

Features
The set of rich non-local features of (Zhang and Nivre, 2011)