Edition 1.1 of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions

This paper describes the PARSEME Shared Task 1.1 on automatic identification of verbal multiword expressions. We present the annotation methodology, focusing on changes from last year’s shared task. Novel aspects include enhanced annotation guidelines, additional annotated data for most languages, corpora for some new languages, and new evaluation settings. Corpora were created for 20 languages, which are also briefly discussed. We report organizational principles behind the shared task and the evaluation metrics employed for ranking. The 17 participating systems, their methods and obtained results are also presented and analysed.

This paper describes edition 1.1 of the PARSEME Shared Task, which builds on this momentum.We amalgamated organizational experience from last year's task, a more polished version of the annotation methodology and an extended set of linguistic data, yielding an event that attracted 12 teams from 9 countries.Novel aspects in this year's task include additional annotated data for most of the languages, some new languages with annotated datasets and enhanced annotation guidelines.
The structure of the paper is the following.First, related work is presented, then details on the annotation methodology are described, focusing on changes from last year's shared task.We have annotated corpora for 20 languages, which are briefly discussed.Main organizational principles behind the shared task, as well as the evaluation metrics are reported next.Finally, participating systems are introduced and their results are discussed before we draw our conclusions.

Related Work
In the last few years, there have been several evaluation campaigns for MWE identification.First, the 2008 MWE workshop contained an MWE-targeted shared task.However, the goal of participants was to rank the provided MWE candidates instead of identifying them in raw texts.The recent DiMSUM 2016 shared task (Schneider et al., 2016) challenged participants to label English sentences in tweets, user reviews of services, and TED talks both with MWEs and supersenses for nouns and verbs.Last, the 1.0 edition of the PARSEME Shared Task in 2017 (Savary et al., 2017) provided annotated datasets for 18 languages, where the goal was to identify verbal MWEs in context.Our current shared task is similar in vein to the previous edition.However, the annotation methodology has been enhanced (see Section 3) and the set of languages covered has also been changed.Rosén et al. (2015) reports on a survey of MWE annotation in 17 treebanks for 15 languages, collaboratively documented according to common guidelines.They highlight the heterogeneity of MWE annotation practices.Similar conclusions have been drawn for Universal Dependencies (McDonald et al., 2013).With regard to these conclusions, we intended to provide unified guidelines for all the participating languages, in order to avoid heterogeneous, hence incomparable, datasets.

Enhanced Annotation Methodology
The first PARSEME annotation campaign (Savary et al., forthcoming) generated a rich feedback from annotators and language team leaders.It also attracted the interest of new teams, working on languages not covered by the previous version of the PARSEME corpora.About 80 issues were raised and discussed among dozens of contributors.1This boosted our efforts towards a better understanding of VMWErelated phenomena, and towards a better synergy of terminologies across languages and linguistic traditions.The annotation guidelines were gradually enhanced, so as to achieve more clear-cut distinctions among categories, and make the decision process easier and more reliable.As a result, we expected higher-quality annotated corpora and better VMWE identification systems learned on them.

Definitions
We maintain all major definitions (unified across languages) introduced in edition 1.0 of the annotation campaign (Savary et al.,forthcoming,Sec. 2).In particular, we understand multiword expressions as expressions with at least two lexicalized components (i.e.always realised by the same lexemes), including a head word and at least one other syntactically related word.Thus, lexicalized components of MWEs must form a connected dependency graph.Such expressions must display some degree of lexical, morphological, syntactic and/or semantic idiosyncrasy, formalised by the annotation procedures.
As previously, syntactic variants of MWE candidates are normalised to their least marked form (called the canonical form) maintaining the idiomatic reading, before it is submitted to linguistic tests.A verbal MWE is defined as a MWE whose head in a canonical form is a verb, and which functions as a verbal phrase, unlike e.g.FR peut-être 'may-be'⇒'maybe' (which is always an adverbial).As in edition 1.0, we account for single-token VMWEs with multiword variants, e.g.ES hacerse 'make-self'⇒'become' vs. se hace 'self makes'⇒'becomes'.

Typology
Major changes in the annotation guidelines between edition 1.0 and 1.1 include redesigning the VMWE typology, which is now defined as follows:2 1.Two universal categories, that is, valid for all languages participating in the task: (a) LIGHT VERB CONSTRUCTIONS (LVC), divided into two subcategories: i. LVCs in which the verb is semantically totally bleached (LVC.full),DE eine Rede halten 'hold a speech'⇒'give a speech', ii.LVCs in which the verb adds a causative meaning to the noun (LVC.cause),3e.g.PL narazić na straty 'expose to losses' (b) VERBAL IDIOMS (VID),4 grouping all VMWEs not belonging to other categories, and most often having a relatively high degree of semantic non-compositionality, e.g.LT našta gula ant savivaldybiu ˛pečiu ˛'the burden lies on the shoulders of the municipality'⇒'the municipality is in charge of the burden' 2. Three quasi-universal categories, valid for some language groups or languages, but not all: (a) INHERENTLY REFLEXIVE VERBS (IRV)5 -pervasive in Romance and Slavic languages, and present in Hungarian and German -in which the reflexive clitic (REFL) either always cooccurs with a given verb, or markedly changes its meaning or subcategorisation frame, e.g.
PT se formar 'REFL form'⇒'graduate' (b) VERB-PARTICLE CONSTRUCTIONS (VPC) -pervasive in Germanic languages and Hungarian, rare in Romance and absent in Slavic languages -with two subcategories: i. fully non-compositional VPCs (VPC.full),6 in which the particle totally changes the meaning of the verb, e.g.HU berúg 'in-kick'⇒'get drunk' ii.semi non-compositional VPCs (VPC.semi),7 in which the particle adds a partly predictable but non-spatial meaning to the verb, e.g.EN wake up (c) MULTI-VERB CONSTRUCTIONS (MVC)8 -close to semantically non-compositional serial verbs in Asian languages like Chinese, Hindi, Indonesian and Japanese (but also attested in Spanish), e.g.HI kar le 'do take'⇒'do (for one's own benefit)', kar de 'do give'⇒'do (for other's benefit)' 3.One language-specific category, introduced for Italian: (a) INHERENTLY CLITIC VERBS (LS.ICV),9 in which at least one non-reflexive clitic (CLI) either always accompanies a given verb or markedly changes its meaning or its subcategorisation frame, e.g.IT prenderle 'take-them'⇒'get beaten up' 4. One optional experimental category, to be considered in the post-annotation step: (a) INHERENTLY ADPOSITIONAL VERBS (IAV) -they include idiomatic combinations of verbs with prepositions or post-positions, depending on the language, e.g.HR ne do de do usporavanja 'it will not come to delay'⇒'no delay will occur'10

Decision tree for annotation
Edition 1.0 featured a two-stage annotation process, according to which VMWEs were supposed to be first identified in a category-neutral fashion, then classified into one of the VMWE categories.Since the annotation practice showed that VMWE identification is virtually always done in a category-specific way, for this year's task we constructed a unified decision tree, shown in Fig. 1.

Annotation process and decision tree
We propose the following methodology for VMWE annotation: Step 1 -identify a candidate, that is, a combination of a verb with at least one other word which could form a VMWE.If the candidate has the structure of a meaning-preserving variant, the following steps apply to its canonical form.This step is largely based on the annotators' linguistic knowledge and intuition after reading this guide.
Step 2 -determine which components of the candidate (or of its canonical form) are lexicalized, that is, if they are omitted, the VMWE does not occur any more.Corpus and web searches may be required to confirm intuitions about acceptable variants.
Step 3 -depending on the syntactic structure of the candidate's canonical form, formally check if it is a VMWE using the generic and category-specific decision trees and tests below.Notice that your intuitions used in Step 1 to identify a given candidate are not sufficient to annotate it: you must confirm them by applying the tests in the guidelines.
Step 4 (experimental and optional) -if your language team chose to experimentally annotate the IAV category follow the dedicated inherently adpositional verb (IAV) tests.These tests should always be applied once the 3 previous steps are complete, i.e. the IAV overlays the universal annotation.
The decision tree below indicates the order in which tests should be applied in step 3.The decision trees are a useful summary to consult during annotation, but contain very short descriptions of the tests.Each test is detailed and explained with examples in the following sections.

Generic decision tree
If you are annotating Italian or Hindi, go to the Italian-specific decision tree or Hindi-specific decision tree.For all other languages follow the tree below.

Consistency checks
Due to manpower constraints, we could not perform double annotation followed by adjudication.For most languages, only small fractions of the corresponding corpus were double-annotated (Sec.4.2).Therefore, in order to increase the consistency of the annotations, we applied the consistency checking tool developed for edition 1.0 (Savary et al.,forthcoming,Sec. 5.4).The tool provides an "orthogonal" view of the corpus, where all annotations of the same VMWE are grouped and can be corrected interactively.Previous experience showed that the use of this tool greatly reduced noise and silence errors.This year, almost all language teams completed the consistency check phase (with the exception of Arabic).

Corpora
For edition 1.1, we prepared annotated corpora for 20 languages divided into four groups: • Germanic languages: German (DE), English (EN) • Romance languages: Spanish (ES), French (FR), Italian (IT), Portuguese (PT), Romanian (RO) • Balto-Slavic languages: Bulgarian (BG), Croatian (HR), Lithuanian (LT), Polish (PL), Slovene (SL) • Other languages: Arabic (AR), Greek (EL), Basque (EU), Farsi (FA), Hebrew (HE), Hindi (HI), Hungarian (HU), Turkish (TR) Arabic, Basque, Croatian, English and Hindi were additional languages, compared to the first edition of the shared task.However, the Czech, Maltese and Swedish corpora were not updated and hence were not included in edition 1.1 of the shared task.The Basque corpus comprises texts from the whole UD corpus (Aranzabe et al., 2015) and part of the Elhuyar Web Corpora. 13The Bulgarian corpus comprises news articles from the Bulgarian National Corpus (Koeva et al., 2012).The Croatian corpus contains sentences from the Croatian version of the SETimes corpora: mostly running text but also selected fragments, such as introductory blurbs and image descriptions characteristic of newswire text.The English corpus consists of 7,437 sentences taken from three of the UD: the Gold Standard Universal Dependencies Corpus for English, the LinES parallel corpus and the Parallel Universal Dependencies treebank.The Farsi corpus is built on top of the MULTEXT-East corpora (QasemiZadeh and Rahimi, 2006) and VMWE annotations are added to a portion of Orwell's 1984 novel.The French corpus contains the Sequoia corpus (Candito and Seddah, 2012) converted to UD, the GDS French UD treebank, the French part of the Partut corpus, and part of the Parallel UD (PUD) corpus.The German corpus contains shuffled sentences crawled from online news, reviews and wikis, derived from the WMT16 shared task data (Bojar et al., 2016), and Universal Dependencies v2.0.The Greek corpus comprises Wikipedia articles and newswire texts from various on-line newspaper editions and news portals.The Hebrew corpus contains news and articles from Arutz 7 and HaAretz news websites, collected by the MILA Knowledge Center for Processing Hebrew.The Hindi corpus represents the news genre sentences selected from the test section of the Hindi Treebank (Bhat et al., 2015).The Hungarian corpus contains legal texts from the Szeged Treebank (Csendes et al., 2005).The Italian corpus is a selection of texts from the PAISÁ corpus of web texts (Lyding et al., 2014), including Wikibooks, Wikinews, Wikiversity, and blog services.The Lithuanian corpus contains articles from a Lithuanian news portal DELFI.The Polish corpus builds on top of the National Corpus of Polish (Przepiórkowski et al., 2011) and the Polish Coreference Corpus (Ogrodniczuk et al., 2015).These are balanced corpora, from which we selected mainly daily and periodical press extracts.The Portuguese corpus contains sentences from the informal Brazilian newspaper Diário Gaúcho and from the training set of the UD_Portuguese-GSD v2.1 treebank.The Romanian corpus is a collection of articles from the concatenated editions of the Agenda newspaper.The Slovenian corpus contains parts of the ssj500k 2.0 training corpus (Krek et al., 2017), which consists of sampled paragraphs from the Slovenian reference FidaPLUS corpus (Arhar Holdt et al., 2007), including literary novels, daily newspapers, web blogs and social media.The Spanish corpus consists of newspaper texts from the the Ancora corpus (Taulé et al., 2016), the UD version of Ancora, a corpus compiled by the IXA group in the University of the Basque country, and parts of the training set of the UD v2.0 treebank.The Turkish corpus consists of 18,611 sentences of newswire texts in several genres.
As shown in Table 2, most languages provided corpora containing several thousand VMWEs, totalling 79,326 VMWEs across all languages.The smallest corpus is in English, containing around 7,437 sentences and 832 VMWEs, and the largest one is in Hungarian, with 7,760 VMWEs.All corpora, except the Arabic one, are available under different flavours of the Creative Common license.14

Format
Edition 1.1 of the shared task saw a major evolution of the data format, motivated by a quest for synergies between PARSEME (Savary et al., forthcoming) and Universal Dependencies (Nivre et al., 2016), two complementary multilingual initiatives aiming at unified terminologies and methodologies.The new format called cupt, combines in one file the conllu format15 and the parsemetsv format 16  As seen in Fig. 2, each token in a sentence is now represented by 11 columns: the 10 columns compatible with the conllu specification (notably: rank, token, lemma, part-of-speech, morphological features, and syntactic dependencies), and the 11th column containing the VMWE annotations, according to the same conventions as parsemetsv but with the updated set of categories (cf.Sec.3.2).Note the presence of an IRV (tokens 2-3) embedded in a VID (tokens 2-5).The underscore '_', when it occurs alone in a field, is reserved for underspecified annotations.It can be used in incomplete annotations or in blind versions of the annotated files.The star '*', when it occurs alone in a field, is reserved for empty annotations, which are different from underspecified.This concerns sporadic annotations, typical for VMWEs (where not necessarily all words receive an annotation, as opposed to e.g.part-of-speech tags).
Besides adding a new column to conllu, cupt also introduces additional conventions concerning comments (lines starting with '#').The first line of each file must indicate the ordered list of columns (with standardized names) that this file contains, i.e. the same format can be used for any subset of standard columns, in any order.Each sentence is then preceded by the identifier of the source sentence (source_sent_id) which consists of three fields: (i) the persistent URI of the original corpus (e.g. of a UD treebank), (ii) the path of the source file in the original corpus, (iii) the sentence identifier, unique within the whole corpus.Items (i) and (ii) contain '.' if there is no external source corpus, as in the example of Figure 2. The following comment line contains the text of the current sentence.Validation scripts and converters were developed for cupt, and published before the shared task.

Inter-Annotator Agreement
Contrary to standard practice in corpus annotation, most corpora were not double-annotated due to lack of human resources.Nonetheless, each language team has double-annotated a sample containing at least 100 annotated VMWEs. 17The number of sentences (S), number of VMWEs annotated by the first (A 1 ) and by the second annotator (A 2 ) are shown in Table 1.The last three columns report two measures to assess span agreement (tokens belonging to a VMWE) and one measure to assess the agreement on Observed and expected agreement for κ span are based on the number of verbs V in the sample, assuming that a simplification of the task consists of deciding whether each verb belongs to a VMWE or not. 19If annotators perfectly agree on A 1=2 annotated VMWEs, then we estimate that they agree on As for κ cat , we consider only the A 1=2 VMWEs on which both annotators agree on the span, and calculate P O and P E based on the proportion of times both annotators agree on the VMWE's category label.
Inter-annotator agreement scores can give an idea of the quality of the guidelines and of the training procedures for annotators.We observe a high variability among languages, especially for determining the span of VMWEs, with κ span ranging from 0.227 for Spanish to 0.984 for Turkish.Macro-averaged κ span is 0.691, which is superior to the macro-averaged κ unit reported in 2017, which was of 0.58 (Savary et al., 2017). 20Categorization agreement results are much more homogeneous, with a macro-average κ cat of 0.836, which is also slightly higher than the one obtained in 2017, which was of 0.819.
The variable agreement values observed could be explained by language and corpus characteristics (e.g.web texts are harder to annotate than newspapers).They could also be explained by the fact that the double-annotated samples are quite small.Finally, they could indicate that the guidelines are still vague and that annotators do not always receive appropriate training.In reality, probably a mixture of all these factors explains the low agreement observed for some languages.In short, Table 1 strongly suggests that there is still room for improvement in (a) guidelines, (b) annotator training, and (c) annotation team management, best practices, and methodology.It should also be noted that lower agreement values may correlate with the results obtained by participants: the lower the IAA for a given language (i.e. the more difficult the task is for humans), the lower the results of automatic MWE identification.Nevertheless, we believe that the systematic use of our in-house consistency checks tool helped homogenizing some of these annotation disagreements (Sec.3.4).

Shared Task Organization
Each language in the shared task was handled by a team that was responsible for the choice of subcorpora and for the annotation of VMWEs, in a similar setting as in the previous edition.For each language, we then split its corpus into training, test and development sets (train/test/dev), as follows: • If the corpus has less than 550 VMWEs: Take sentences containing 90% of the VMWEs as test, and the other 10% as a small training corpus.• If the corpus has between 550 and 1500 VMWEs: Take sentences containing 500 VMWEs as test, and take the rest for training.• If the corpus has between 1,500 and 5,000 VMWEs: Take sentences containing 500 VMWEs as test, take sentences containing 500 VMWEs as dev, and take the rest for training.• If the corpus has more than 5,000 VMWEs: Take sentences containing 10% of the VMWEs as test, take sentences containing 10% of the VMWEs as dev, and take the remaining 80% for training.As in edition 1.0, participants could submit their systems to two tracks: open and closed.Systems in the closed track were only allowed to train their models on the train and dev files provided.
In this edition, we distinguished sentences based on their origin, so as to make sure that the fraction of each sub-corpus is the same in all splits for each language.For example, around 59% of all Basque sentences came from UD, while the other 41% came from the sub-corpus Elhuyar.We have made sure that similar percentages also applied to test/train/dev when taken in isolation.Due to this balancing act, for most languages, we could not keep the VMWEs in the same split as in edition 1.0.

Evaluation Measures
The goal of the evaluation measures is to represent the quality of system predictions when compared to the human-annotated gold standard for a given language.As in edition 1.0, we define two types of evaluation measures: a strict per-VMWE score (in which each VMWE in gold is either deemed predicted or not, in a binary fashion); and a fuzzy per-token score (which takes partial matches into account).For each of these two, we can calculate precision (P), recall (R) and F 1 -scores (F).
Orthogonally to the type of measure, there is the choice of what subset of VMWEs to take into account from gold and system predictions.As in the previous edition, we calculate a general category-agnostic measure (both per-VMWE and per-token) based on the totality of VMWEs in both gold and system predictions -this measure only considers whether each VMWE has been properly predicted, regardless of category.We also calculate category-specific measures (both per-VMWE and per-token), where we consider only the subset of VMWEs associated with a given category.
We additionally consider the following phenomenon-specific measures, which focus on some of the challenging phenomena specifically relevant to MWEs (Constant et al., 2017): • MWE continuity: We calculate per-VMWE scores for two different subsets: continuous e.g.TR istifa edecek 'resignation will-do'⇒'he/she will resign', and discontinuous VMWEs e.g.SL imajo investicijske načrte 'they-have investment plans'⇒'they have investment plans'.(2) it is not identical to another VMWE, i.e. the training corpus does not contain the sequence of surface-form tokens as seen in this VMWE (including non-lexicalized components in between, in the case of discontinuous VMWEs).E.g., BG накриво ли беше стъпил is a variant of стъпя накриво 'to step to the side'⇒'to lose (one's) footing'.Systems may predict VMWEs for all languages in the shared task, and the aforementioned measures are independently calculated for each language.Additionally, we calculate a macro-average score based on all of the predictions.In this case, the precision P for a given measure (e.g. for continuous VMWEs) is the average of the precisions for all 19 languages.Arabic is not considered due to delays in the corpus release.Missing system predictions are assumed to have P = R = 0.The recall R is averaged in the same manner, and the average F score is calculated from these averaged P and R scores.

System Results
For the 2018 edition of the PARSEME Shared Task, 12 teams submitted 17 system results: 13 to the closed track and 4 to the open track.No team submitted system results for all 20 languages of the shared task, but 11 teams covered 19 languages (all except Arabic).Detailed result tables are reported on the shared task website. 21In the tables, systems are referred to by anonymous nicknames.System authors and their affiliations are available in the system description papers published in these proceeings.
As for the best performing systems, TRAPACC and TRAVERSAL were ranked first for 8 languages and 7 language, respectively.TRAVERSAL is more effective in Slavic and Romance languages, whereas TRAPACC works well for German and English.In the "Other" language group, GDB-NER achieved the best results for Farsi and Turkish, and CRF approaches proved to be the best for Hindi.The best results for Bulgarian were obtained by varIDE, based on a Naive Bayes classifier.
Results per language show that, Hungarian and Romanian were the "easiest" languages for the systems, with best MWE-based F-scores of 90.31 and 85.28, respectively.Hebrew, English and Lithuanian show the lowest MWE-based F-scores, not exceeding 23.28, 32.88 and 32.17, respectively.This is likely due to the amount of annotated training data: Hungarian had the highest, whikle English and Lithuanian the lowest, number of VMWEs in the training data.A notable exception to this tendency is Hindi, where good results (an F-score of 72.98) could be achieved building on a small amount of training data.This is probably due to the high number of multi-verb constructions (MVCs) in Hindi, which are usually formed by a sequence of two verbs, hence relatively easily identified by relying on POS tags.
Table 12 shows the effectiveness of MWE identification with regard to MWE categories.The highest F-scores were achieved for IRVs (especially for Balto-Slavic languages).This might be due to the fact that the IRVs tend to be continuous and must contain a reflexive pronoun/clitic, therefore the presence of such a pronoun in the immediate neighborhood of a verb is a strong predictor for IRVs.The LVC.full category is present in all languages.Interestingly, they are most effectively identified in the "Other" language group.Idioms occur in the test corpora of almost all languages (except Farsi), and they can be identified to the greatest extent in Romance languages.VPCs seem to be the easiest to find in Hungarian.
In regards to phenomenon-specific macro-average results (Tables 4 to 11), let us have a closer look at the F 1 -MWE measure of the 11 systems which submitted results to all 19 languages, except MWE-TreeC (whose results are hard to interpret).The differences are: (i) from 13 to 28 points (17 points on average) for continuous vs. discontinuous VMWEs, (ii) from 14 to 43 points (27 points on average) for multitoken vs. single-token VMWEs, (iii) from 45 to 56 points (50 points on average) for seen-in-train vs. unseen-in-train VMWEs, and (iv) from 13 to 27 points (20 points on average) for identical-to-train vs. variant-of-train VMWEs.These results confirm that the phenomena they focus on are major challenges in the VMWE identification task, and we suggest that the corresponding measures should be systematically used for future evaluation.The hardest challenge is the one of identifying unseen-in-train VMWEs.This result is not a suprise since MWE-hood is, by nature, a lexical phenomenon, that is, a particular idiomatic reading is available only in presence of a combination of particular lexical units.Replacing one of them by a semantically close lexeme usually leads to the loss of idiomatic reading, e.g.force one's hand 'compel someone to act against her will' is an idiom, while force one's arm can only be understood literally.Few other, non-lexical, hints are given to distinguish a particular VMWE occurrence from a literal expression, because a VMWE usually takes syntactically regular forms.Morphosyntactic idiosyncrasy (e.g. the fact that a given VMWE allows some and blocks some other regular syntactic transformations) is a property of types rather than tokens.We expect, therefore, satisfactory unseen-in-train VMWE identification results mostly from systems using large-scale VMWE lexicons or semi/unsupervised methods and very large corpora.

Conclusions and Future Work
We reported on edition 1.1 of the PARSEME Shared Task aiming at identifying verbal MWEs in texts in 20 languages.We described our corpus annotation methodology, the data provided to the participants, the shared task modalities and evaluation measures.The official results of the shared task were also presented and briefly discussed.The outputs of individual systems22 should be compared more thoroughly in the future, so as to see how systems with different architectures cope with different phenomena.For instance, it would be interesting to check if, as expected, discontinuous VMWEs are handled better by parsingbased methods vs. sequential taggers, or by LSTMs vs. other neural network architectures.
Compared to the first edition in 2017, we attracted a larger number of participants (17 vs. 7), with 11 of the submissions covering 19 languages.We expect that this growing interest in modeling and computational treatment of verbal MWEs will motivate teams working on corpus annotation, especially from new language families, to join the initiative.We expect to maintain and continuously increase the quality and the size of the existing annotated corpora.For instance, we have identified weaknesses in the guidelines for MVCs that will require enhancements.Furthermore, we need to collect feedback about the IAV experimental category, and decide whether we consolidate its annotation guidelines.
Our ambitious goal for a future shared task is to extend annotation to other MWE categories, not only verbal ones.We are aware of corpora and guidelines for individual languages (e.g.English or French) and/or MWE categories (e.g.noun-noun compounds).However, a considerable effort will be required to design and apply universal annotation guidelines for the annotation of new MWE categories.We strongly believe that the large community and collective expertise gathered in the PARSEME initiative will allow us to take on this challenge.We definitely hope that this initiative will continue in the next years, yielding available multilingual annotated corpora that can foster MWE research in computational linguistics, as well as in linguistics and translation studies.

Appendix B: Shared task results
Lang-split Sent. Tok

Figure 1 :
Figure 1: Decision tree for joint VMWE identification and classification.
11Note that the first 4 tests are structural.They first hypothesize as VIDs those candidates which: (S.1) do not have a unique verb as head, e.g.HE britanya nas'a ve-natna 'im micrayim 'Britain carried and gave with Egypt'⇒'Britain negotiated with Egypt', (S.2) have more than one lexicalized dependent of the head verb, EL ρίχνω λάδι στη φωτιά 'pour oil to-the fire'⇒'make a bad or negative situation feel worse', (S.3) have a lexicalized subject, e.g.EU deabruak eraman 'devil-the.ERG 12 take'⇒'be taken by the devil, go to hell'.The remaining candidates, i.e. those having exactly one head verb and one lexicalized non-subject dependent, trigger category specific tests depending on the part-of-speech of this dependent (S.4).

↳ YES ⇒ Annotate as a VMWE of category IRV ↳ NO ⇒ It is not a VMWE, exit ↳Particle ⇒ Apply VPC-specific tests ⇒ VPC tests positive? ↳ YES ⇒ Annotate as a VMWE of category VPC.full or VPC.semi ↳ NO ⇒ It is not a VMWE, exit ↳Verb with no lexicalized dependent ⇒ Apply MVC-specific tests ⇒ MVC tests positive? ↳ YES ⇒ Annotate as a VMWE of category MVC ↳ NO ⇒ Apply
the VID-specific tests ⇒ VID tests positive?
, both used in the previous edition of this shared task.

Table 1 :
span κ span κ cat Per-language inter-annotator agreement on a sample of S sentences, with A 1 and A 2 VMWEs annotated by each annotator.F span is the F-measure between annotators, κ span is the agreement on the annotation span and κ cat is the agreement on the VMWE category.EL, EN and HI provided corpora annotated by more than 2 annotators.We report the highest scores among all possible annotator pairs.VMWE categories (Sec.3.2).The F span score is the MWE-based F-measure when considering that one of the annotators tries to predict the other one's annotations.18This is identical to the F1-MWE score used to evaluate participating systems (Sec.6).F span is an optimistic estimator which ignores chance agreement.On the other hand, κ span and κ cat estimate to what extent the observed agreement P O exceeds the expected agreement P E , that is, • MWE length: We calculate per-VMWE scores for two different subsets: single-token, e.g.We calculate per-VMWE scores for two subsets: seen and unseen VMWEs.We consider a VMWE in the (gold or prediction) test corpus as seen if a VMWE with the same multiset of lemmas is annotated at least once in the training corpus.Other VMWEs are deemed unseen.For instance, given the occurrence of EN has a new look in the training corpus, the occurrence of EN had a look of innocence and of EN having a look at this report in the test corpus would be considered seen and unseen, respectively.•MWE variability: We calculate per-VMWE scores for the subset of VMWEs that are variants of VMWEs from the training corpus.A VMWE is considered a variant if: (i) it is deemed as a seen VMWE, as defined above, and DE anfangen 'at-catch'⇒'begin', ES abstenerse 'abstain-REFL'⇒'abstain', and multi-token VMWEs e.g.FA 'eye throw'⇒'to look at'.• MWE novelty: