The ATILF-LLF System for Parseme Shared Task: a Transition-based Verbal Multiword Expression Tagger

We describe the ATILF-LLF system built for the MWE 2017 Shared Task on automatic identification of verbal multiword expressions. We participated in the closed track only, for all the 18 available languages. Our system is a robust greedy transition-based system, in which MWE are identified through a MERGE transition. The system was meant to accommodate the variety of linguistic resources provided for each language, in terms of accompanying morphological and syntactic information. Using per-MWE Fscore, the system was ranked first for all but two languages (Hungarian and Romanian).


Introduction
Verbal multi-word expressions (hereafter VMMEs) tend to exhibit more morphological and syntactic variation than other MWEs, if only because in general the verb is inflected, and it can receive adverbial modifiers. Furthermore some VMWEs, in particular light verb constructions (one of the VMWE categories provided in the shared task), allow for the full range of syntactic variation (extraction, coordination etc...). This renders the VMWE identification task even more challenging than general MWE identification, in which fully frozen and contiguous expressions help increasing the overall performance.
The data sets are quite heterogeneous, both in terms of the number of annotated VMWEs and of accompanying resources (for the closed track). 2 1 2 systems participated for one language only (French), and 5 systems participated for more than one language.
2 Some of the data sets contain the tokenized sentences plus VMWEs only (BG, ES, HE, LT), some are accompanied with morphological information such as lemmas and POS So our first priority when setting up the architecture was to build a generic system applicable to all the 18 languages, with limited language-specific tuning. We thus chose to participate in the closed track only, relying exclusively on training data, accompanying CoNLL-U file when available, and basic feature engineering. We developed a onepass greedy transition-based system, which we believe can handle discontinuities elegantly. We integrated more or less informed feature templates, depending on their availability in the data. We describe our system in section 2, the experimental setup in section 3, the results in section 4 and the related works in section 5. We conclude in section 6 and give perspectives for future work.

System description
The identification system we used is a simplified and partial implementation of the system proposed in Constant and Nivre (2016), which is in itself a mild extension of an arc-standard dependency parser (Nivre, 2004). Constant and Nivre (2016) proposed a parsing algorithm that jointly predicts a syntactic dependency tree and a forest of lexical units including MWEs. In particular, in line with Nivre (2014), this system integrates special parsing mechanisms to deal with lexical analysis. Given that the shared task focuses on the lexical task only and that datasets do not always provide syntactic annotations, we have modified the structure of the original system by removing syntax prediction, in order to use the same system for all 18 languages.
A transition-based system consists in applying a sequence of actions (namely transitions) to incrementally build the expected output structure in a bottom-up manner. Each transition is (CS, MT, RO, SL), and for the third group (the 10 remaining languages) full dependency parses are provided. See (Savary et al., 2017) for more information on the data sets.  usually predicted by a classifier given the current state of the parser (namely configuration). A configuration in our system consists of a triplet c = (σ, β, L), where σ is a stack containing units under processing, β is a buffer containing the remaining input tokens, and L is a set of processed lexical units. The processed units correspond either to tokens or to VMWEs. When corresponding to a single token, a lexical unit is composed of one node only, whereas a unit representing a (multi-token) VMWE is represented as a binary lexical tree over the input tokens. Every unit is associated with a set of linguistic attributes (when available in the working dataset): its actual form, lemma, part-of-speech (POS) tag, syntactic head and label. The initial configuration for a sentence The transitions of this system are limited to the following: (a) the Shift transition takes the first element in the buffer and pushes it onto the stack; (b) the Merge transition removes the two top elements of the stack, combines them as a single element, and adds it to the stack; 3 (c) the Complete transition moves the upper element of the stack to L, whether the element is a single token or an identified VMWE and finally (d) the Complete-MWT transition, only valid for multiword tokens 3 The newly created element is assigned linguistic attributes using basic concatenation rules that would deserve to be improved in future experiments: e.g., the lemma is the concatenation of the lemmas of the two initial elements.
(MWT), acts as Complete, but also marks the element moved to L as VMWE. 4 Training such a system means enabling it to classify a configuration into the next transition to apply. This requires an oracle that determines what is an optimal transition sequence given an input sentence and the gold VMWEs. We created a static oracle using a greedy algorithm that performs Complete as soon as possible (i.e. when a non VMWE token or a gold VMWE is on top of the stack) and Merge as late as possible (i.e. when the right-most component of the VMWE is on top of the stack) (see Figure 1). Note that an oracle sequence is exactly composed of 2n transitions: every single token requires one Shift and one Complete, and each multi-token VMWE of length m requires m Shifts, m−1 Merges and a single Complete.
The proposed system has some limitations with respect to the shared task annotation scheme. First, for now, our system does not handle embedded VMWEs (only the longest VMWE is considered in the oracle, and the transition system cannot predict embeddings). This feature could be straightforwardly activated as VMWEs are represented with lexical trees. Note also that the system cannot handle overlapping MWEs like take 1,2 a bath 1 then a shower 2 , since it requires a graph representation (not a tree).

Experimental setup
For replication purposes, we now describe how the system has been implemented (Subsection 3.1), which feature templates have been used (Subsection 3.2) and how they have been tuned (Subsection 3.3). Simple descriptions of the system settings are provided in Table 1. We thereafter use symbol B i to indicate the ith element in the buffer. S 0 and S 1 stand for the top and the second top elements of the stack. For every unit X in the stack or the buffer, we denote Xw its word form, Xl its lemma and Xp its POS tag. The concatenation of two elements X and Y is noted XY .

Implementation
For a given language, and a given train/dev split, we train three SVM classifiers (one vs all, one vs one and error-correcting output codes) and we select the majority vote one. 5 Note that some configurations only allow for a unique transition type, and thus do not require transition prediction. A configuration with a one token stack and empty buffer requires the application of a Complete, as last transition of the transition sequence. Similarly, a configuration with empty stack and non-empty buffer must lead to a Shift transition. During the feature tuning phase, for a few languages we added a number of hard-coded procedures aiming at enforcing specific transitions in given contexts. These procedures all use a VMWE dictionary extracted from the training set (hereafter the VMWE dictionary). For German and Hungarian, we noticed a high percentage of VMWEs with one token only. 6 We added the Complete-MWT transition for these languages, which we systematically apply when the head of the stack S 0 is a token appearing as MWT in the VMWE dictionary (cf. setting Q in Table 1). For other languages with long and discontinuous expressions, we used other hard-coded procedures that experimentally proved to be beneficial (setting P in Table 1). We systematically apply a Complete transition when S 1 lB 0 l or S 1 lB 1 l forms a VMWE existing in the VMWE dictionary. Moreover, an obligatory Shift is applied when the concatenation of successive elements in the stack and the buffer belongs to the VMWE dictionary. In particular, we test S 1 lS 0 lB 0 l, S 0 lB 0 l, S 0 lB 0 lB 1 l and S 0 lB 0 lB 1 lB 2 l.

Feature Templates
A key point in a classical transition-based system is feature engineering, where feature template design and tuning could play a very important role in increasing the accuracy of system results.

Basic Linguistic Features
First of all, depending on their availability in the working dataset and on the activation of related settings (cf. G and J in Table 1), we extracted linguistic attributes in order to generate features such as S 0 l, S 0 p and S 0 w where p, l and 5 The whole system was developed using Python 2.7, with 2,200 lines of code, using the open-source Scikit-learn 0.19 libraries for the SVMs. The code is available on Github: https://goo.gl/EDFyiM 6 These correspond mainly to cases of verb-particle (tagged VPC in the data sets) in which the particle is not separated from the verb.
Code F Setting description B + use of transition history (length 1) C + use of transition history (length 2) D + use of transition history (length 3) E + use of B1 F + use of bigrams (S1S0, S0B0, S1B0,S0B1) G + use of lemma H + use of syntax dependencies I + use of trigrams S1S0B0 J + use of POS tag K + use of distance between S0 and S1 L + use of training corpus VMWE lexicon M + use of distance between S0 and B0 N + use of (S0B2) bigram O + use of stack length P -enabling dictionary-based forced transitions Q -enabling Complete-MWT transition Table 1: System setting code descriptions. The 'F' column indicates whether the setting is a featurerelated setting ('+') used by the classifiers or whether ('-') it is a hard-coded implementation enhancement.
w stand for the lemma, the part of speech, and the word form respectively. The same features are extracted for unigrams S 1 , B 0 and B 1 (when used) (cf. E in Table 1).
When enabled, the bigrams features for the pair XY of elements are XpYp, XlYl, XwYw, XpYl and XlYp. The trigram-based features are extracted in the same way. Basically, the involved bigrams are S 1 S 0 , S 0 B 0 , S 1 B 0 and S 0 B 1 (cf. setting F in Table 1), but we also added the S 0 B 2 bigram for a few languages (cf. N in Table 1). For trigrams, we only used the features of the S 1 S 0 B 0 triple (cf. I in Table 1).
Finally, because the datasets for some languages do not provide the basic linguistic attributes such as lemmas and POS tags, we tried to bridge the gap by extracting unigram "morphological" attributes when POS tag and lemma extraction settings were disabled (cf. G and J in Table 1). The features of S 0 for such languages would be S 0 w, S 0 r, S 0 s where r and s stand for the last two and three letters of S 0 w respectively.

Syntax-based Features
After integrating classical linguistic attributes, we investigated using more linguistically sophisticated features. First of all, syntactic structure is known to help MWE identification (Fazly et al., 2009;Seretan, 2011;Nagy T. and Vincze, 2014). We therefore inform the system with the provided syntactic dependencies when available: for each token B n that both appears in the buffer and is a syntactic dependent of S 0 with label l, we capture the existence of the dependency using the features RightDep(S 0 , B n ) = T rue and RightDepLabel(S 0 , B n ) = l. We also use the opposite features IsGovernedBy(S 0 , B n ) = T rue and IsGovernedByLabel(S 0 , G) = l when S 0 's syntactic governor G appears in the buffer. Other syntax-based features aim at modeling the direction and label of a syntactic relation between the two top elements of the stack (feature syntacticRelation(S 0 , S 1 ) = ±l is used for S 0 governing/governed by S 1 ). 7 All these syntactic features (cf. H in Table1) try to capture syntactic regularities between the tokens composing a VMWE.

History-based Features
We found that other traditional transition-based system features were sometimes useful like (local) transition history of the system. We thus added features to represent the sequence of previous transitions (of length one, two or three, cf. settings B, C and D in Table 1).

Distance-based Features
Distance between sentence components is also known to help transition-based dependency parsing (Zhang and Nivre, 2011). We thus added the distance between S 0 and B 0 and the distance between S 0 and S 1 (cf. settings K and M in Table  1).

Dictionary-based Features
We also added features based on the VMWE dictionary automatically extracted from the training set. Such features inform the system when one of the focused elements (S i , B j ) is a component of a VMWE present in the dictionary (cf. L in Table 1).

Stack-length Features
Using the length of the stack as an additional feature (cf. O in Table 1) has also proven beneficial during our feature tuning.
Finally, it is worthwhile to note that system settings (cf. Table 1) interact when used to generate the precise set of features. For instance if lemma extraction is disabled (code G) while bigram extraction is enabled (code F), the produced features for e.g. the S 1 S 0 bigram would not include the following features: S 1 lS 0 l, S 1 pS 0 l and S 1 lS 0 p.

Feature Tuning
We first divided the data sets into 3 groups, based on the availability of CoNLL-U files: (a) for BG, HE and LT only the VMWEs on tokenized sentences are available; (b) CS, ES, FA, MT and RO are accompanied by CoNLL-U files but without syntactic dependency annotations, and (c) the other languages are accompanied by a fully annotated CoNLL-U file. In the first tuning period, we tested the various configurations using three pilot languages (BG, CS, FR) representing one group each. In the latest days of the experiments, the set of languages tested was enlarged to all of them and systematic tuning was performed for every language. Table 2 summarizes the results of the system performance over all the languages proposed by the shared task. Each row of the table displays its per-MWE and per-token F-scores for a given language (identified by its ISO 639-1 code) for test dataset, on top of a 5-fold cross-validation (CV) per-MWE F-score on training dataset. The system settings are represented as a sequence of codes described in Table 1.

Results
We can observe that results are very heterogeneous. For instance, five languages (CS, FA, FR, PL, RO) are above 0.70 per-MWE F-score in the case of cross-validation, while seven languages (DE, HE, HU, IT, LT, MT, SV) are below 0.30. In general, we can see an approximative linear correlation between the number of training VMWEs and the performance. This suggests that the size of training datasets is not large enough as systems' performance does not converge. We note though that some languages like CS and TR reach relatively low scores given the size of training data, which shows the high complexity of this task for these languages.
When comparing to the other shared task systems, we can observe that our system is the only one that handled all 18 languages, showing the  and delta columns display the difference in F-score (times 10 −2 ) between our system and the best other system of the shared task for the current evaluation/language configuration. robustness of our approach. Moreover, evaluation using per-MWE F-score (i.e. exact VMWE matching) ranks our system first on all languages but two (HU:2nd:, RO:3rd), displaying an average difference of 6.73 points with the best other system in the current evaluation/language pair. Concerning per-token scores (which allow partial matchings), results are relatively lower: our system is ranked first for 12 languages (out of 18), with a positive average difference of 1.84 points as compared with the best other system. Such very enthusiastic results for per-MWE evaluations seem to show that our system succeeds more in considering a MWE as a whole. Further error analysis is needed to explain this trait, and in particular to check the impact of the Merge transition, which transforms sequences of elements into one.

Related Work
Previous approaches for VMWE identification include the two-pass method of candidate extraction followed by binary classification (Fazly et al., 2009;Nagy T. and Vincze, 2014). VMWE identification has also been performed using sequence labeling approaches, with IOBscheme. For instance, Diab and Bhutada (2009) apply a sequential SVM to identify verb-noun idiomatic combinations in English. Such approaches were used for MWE identification in general (including verbal expressions) ranging from contiguous expressions (Blunsom and Baldwin, 2006) to gappy ones (Schneider et al., 2014).
A joint syntactic analysis and VMWE identification approach using off-the-shelf parsers is another interesting alternative that has shown to help VMWE identification such as light verb constructions (Eryigit et al., 2011;Vincze et al., 2013).

Conclusion and future work
This article presents a simple transition-based system devoted to VMWE identification. In particular, it offers a simple mechanism to handle discontinuity since foreign elements are iteratively discarded from the stack, which is a crucial point for VMWEs. It also has the advantage of being robust, accurate and efficient (linear time complexity). As future work, we would like to apply more sophisticated syntax-based features, as well as more advanced machine-learning techniques like neural networks and word embeddings. We also believe that a dynamic oracle could help increase results to better deal with cases where the system is unsure.