Pay Attention when you Pay the Bills. A Multilingual Corpus with Dependency-based and Semantic Annotation of Collocations.

This paper presents a new multilingual corpus with semantic annotation of collocations in English, Portuguese, and Spanish. The whole resource contains 155k tokens and 1,526 collocations labeled in context. The annotated examples belong to three syntactic relations (adjective-noun, verb-object, and nominal compounds), and represent 58 lexical functions in the Meaning-Text Theory (e.g., Oper, Magn, Bon, etc.). Each collocation was annotated by three linguists and the final resource was revised by a team of experts. The resulting corpus can serve as a basis to evaluate different approaches for collocation identification, which in turn can be useful for different NLP tasks such as natural language understanding or natural language generation.


Introduction
The automatic identification of collocations, as well as other multiword expressions (MWEs), is crucial for many natural language processing (NLP) tasks, since their linguistic behaviour differs from other combinations of words (Mel'čuk, 1995;Sag et al., 2002;Ramisch and Villavicencio, 2018). On the one hand, approaches to natural language generation may take advantage of collocational information to produce natural utterances with the desired meanings (Wanner et al., 2010;Lareau et al., 2011). In this regard, while different adjectives such as heavy and strong can convey basically the same meaning (e.g., 'intensification' in heavy load and in strong fragrance), great has different senses in great loss and in great time (with 'intensification' and 'positive' meanings, respectively). On the other hand, to interpret the meaning of a sentence, a system should take into account the properties of these expressions: for instance, the meaning of the verb [to] take in the collocation take [a] cab is different from the same verb in a free combination such as take [a] pencil, so natural language understanding or abstract meaning representation systems could benefit from the correct identification of collocations (Bonial et al., 2014;O'Gorman et al., 2018). It is worth mentioning that collocations are pervasive and frequent in all domains and text typologies, so their correct interpretation should be critical to progress in the automatic processing of natural languages.
The concept of collocation was formalized in the Meaning-Text Theory as a combination of two lexical units (LUs) where one of them (the BASE, e.g., attention in the collocation pay attention) is freely selected due to its meaning, while the selection of the other one (the COLLOCATE, e.g., [to] pay) is restricted by the former (Mel'čuk, 1995). Under this theory, lexical functions (LF) represent a relation between a LU (the base) and a set of expressions (the potential collocates) (Mel'čuk, 1996(Mel'čuk, , 1998Wanner, 1996). For instance, the LF Oper means 'to carry out', so we could define Oper(picture)=[to] take to formalize the collocation take a picture. Similarly, the adjectivenoun collocation loud screech can be represented as M agn(screech)=loud, where the lexical function M agn denotes 'intensification'.
The automatic identification of collocations has deserved a substantial number of works of different researchers from NLP and computational linguistics as well as from lexicography and corpus linguistics (Evert, 2008;Pecina, 2010;Gries, 2013). Most approaches rely on statistical association measures (AMs), both symmetrical and directional, and recent works incorporate distributional semantics to automatically identify the collocate of a given base and LF, or to classify the compositionality of MWEs including collocations (Wanner et al., 2006;Carlini et al., 2014;Rodríguez-Fernández et al., 2016;Cordeiro et al., 2019). To evaluate the extraction, some researchers use manual selection of true collocations from ranked lists, while others take advantage of examples extracted from collocation dictionaries. However, most of these approaches are carried out only in one language, and they do not always permit to obtain precise recall values. Moreover, they usually do not include semantic information.
Taking the above into account, this paper attempts to fill this gap by releasing a freely available multilingual corpus of English, Portuguese, and Spanish with manual annotation of collocations and their lexical functions. The whole resource, annotated by five experts, has more than 155k tokens and 1, 526 collocations classified into 60 lexical functions. For each language, we provide both the labeled data of each annotator as well as the gold-standard data. 1 2 Related Work Different statistical methods have been applied to automatically identify and classify collocations from corpora. Studies such as Wanner et al. (2006), Wanner et al. (2016), or Gelbukh and Kolesnikova (2010) train statistical models using Spanish data (from EuroWordNet, from the DiCE dictionary, and using a Spanish corpus, respectively). For French, Fonseca et al. (2017) explore the combination of dependency parsing with a lexical network based on lexical functions.
The semantic classification of base-collocate pairs allowed for implementing multilingual natural language generation systems which take advantage of lexical functions to select the most appropriate combinations for each context (Wanner et al., 2010). In this regard, Lareau et al. (2011) propose a methodology to use lexical functions in Lexical Functional Grammar.
With respect to the extraction process, there have been a large number of studies focusing on the automatic identification of collocations in corpora. In this regard, most approaches have relied on statistical association measures applied both to windows of n-grams (Church and Hanks, 1990; http://www.grupolys.org/˜marcos/pub/collocations.zip Smadja, 1993;Pecina, 2010), and to syntax-based dependency triples (Seretan, 2011;Carlini et al., 2014;Garcia et al., 2017;Uhrig et al., 2018). In Rodríguez-Fernández et al. (2016) it is presented a method to retrieve potential collocates of a given LF and a base. Other studies address the identification process as a classification problem. Karan et al. (2012) take advantage of a set of true positive and true negative collocations to evaluate machine learning algorithms which use, among others, features based on association values.
To evaluate such methods, some authors carry out a manual review of the n best combinations of candidate collocations lists, ranked by a given AM (Seretan and Wehrli, 2006;Garcia, 2018). A different approach consists of collecting an inventory of true collocations (usually from existing dictionaries), which is then used to compare the performance of various AMs (Evert and Krenn, 2001;Pearce, 2002;Pecina, 2010;Kilgarriff et al., 2014;Evert et al., 2017). Concerning the available data with collocational information, it is worth noting that a vast majority of the resources are dictionaries and lexicons often targeted at language learners (Benson et al., 1986;Alonso-Ramos et al., 2010). From a different perspective, initiatives such as PropBank and abstract meaning representation also provide corpora with semantic annotation of MWEs, some of which may be considered collocations (Banarescu et al., 2013;Bonial et al., 2014;O'Gorman et al., 2018).
The approach to evaluate the process of collocation extraction proposed here consists of using a gold-standard corpus with manual annotation of such expressions. On the one hand, this allows for accurate precision and recall values to be obtained, also taking into account ambiguous combinations which may be collocations or not depending on the context. On the other hand, a gold-standard enables the research community to evaluate different strategies in a more comparable way. In this regard, the 2017 and 2018 PARSEME Shared Tasks released multilingual corpora with annotation of verbal MWEs (Savary et al., 2017;. Even if the initial objectives of these shared tasks differ from ours (they annotate idioms, verb-particle constructions and other non-collocation MWEs), some of the units actually intersect with the expressions we want to identify. Thus, we rely on these corpora to initiate the construction of a multilingual corpus with dependency-based and semantic annotation of collocations.

Source Data and Annotation
This section describes both the source data used to build our multilingual corpora as well as the annotation guidelines and procedure.

Corpora
We decided to take advantage of three subcorpora of the edition 1.1 of the PARSEME Shared Task, which include annotation of different verbal multiword expressions in 20 languages . Since we understand collocations as lexically restricted combinations of words, some of the MWEs annotated in the PARSEME corpora are also useful for our objectives (see Section 3.2).
Our main purpose is to provide datasets to evaluate unsupervised strategies for extracting collocations, so we selected the test datasets for Portuguese (58k tokens) and Spanish (39k tokens), and the train corpus for English (53k tokens), because the test data for this language are fewer. These corpora are annotated with Universal Dependencies (Nivre, 2015) and released in .cupt format 2 (an extension of .conllu). 3

Annotation Guidelines
In general, our annotation follows Mel'čuk (1996) with specific guidelines for each collocation type. Also, we tried to be compatible with the PARSEME principles with a view to combine both annotations. As our strategy relies on dependency analysis to obtain candidate combinations (which are subsequently revised), we defined annotation guidelines for three syntactic patterns of collocations (for each pattern, a set of tests for identifying collocations was included). In the following we present some examples of the most productive lexical functions in each pattern (see Appendix A for the whole list of LFs): Adjective-noun (amod): collocations where the adjective has a function of intensification and attenuation (M agn: high priority, or AntiM agn: weak resource), expresses a positive or negative consideration from the speaker (Bon: great event, AntiBon: unfortunate mistake), or conveys a specific sense (N onStandard) in combination with the noun (e.g., dark sorcerer) (Mel'čuk, 1996).  Verb-object (obj): verb-object collocations occur between a predicative noun (Polguère, 2011) depending of a verb which either do not contribute to the meaning of the combination (Oper: [to] have breakfast), or express causation (CausOper, [to] cause damage) or a specific meaning in combination with the base (N onStandard: [to] shake hands). As some of these types were covered by PARSEME (labeled as light verb constructions), we revised each case and added their LFs. Some structures such as verb-obj candidates occurring in passive voice as subjects (e.g., [the] damage was caused) or relative constructions (which do not have a direct dependency between the lexical base and the collocate) were not extracted.
A simplified version of the guidelines can be accessed at the following url: http://www.grupolys.org/˜marcos/ collocations/guidelines.html.

Annotation Procedure
In order to facilitate the labeling by each annotator as well as to speed up the whole process we defined the following procedure (see Figure 1): We start by extracting the instances of the desired relations (amod, nmod, compound and obj) from the referred corpora, and arrange them into triples (base;collocate;relation). Despite the fact that most collocation extraction approaches set up a frequency threshold to avoid noisy and less frequent combinations, we revised every single instance of each dependency relation.
Then, for each language and collocation pattern, we created a sheet including the triples, a link to an automatically generated HTML site with examples from the corpora, and annotation fields (see Figure 2). Each collocation candidate was revised by three experts (native or near-native speakers of the target languages) which classified them as collocation, non-collocation, or doubt. 4 Doubts include (i) combinations which are collocations in some examples but not in others, (ii) collocations which include internal MWEs (e.g., verb-particle constructions), and consequently they need a specific annotation, and (iii) cases in which the annotator is not sure about the collocational status of the candidate. After that, we created a final sheet including the most frequent classification for each candidate. Those cases in which there is no agreement (i.e., each annotator selected a different label, or there is more than one doubt) were also marked as doubt. The dubious cases in these final sheets were revised in common by the whole team of language experts, who decided on the LF of each collocation. Finally, we automatically transferred the annotation to a new version of the initial corpora, and convert it to the .tsv format required by WebAnno (Eckart de Castilho et al., 2016). Using this tool, we corrected those special cases (MWEs bases and collocates, combinations which are collocations only in some contexts, etc.) and performed a general revision of the corpus. At the end of this process, we generated the final corpus in .conllu format using the original resources and the .tsv files.
It is worth mentioning that we did not perform a systematic evaluation of the syntactic analysis 4 Light verb constructions already annotated in the original corpus were initially marked as doubt, so each annotator also revised again these cases.
id token h dep collocational information 1 He 2 nsubj 2 took 0 root 1 3 a 5 det 4 deep 5 amod 2 5 breath 2 obj 1:obj Oper1;2:amod Magn of each corpus. In this respect, we could miss some true collocations incorrectly labeled with a wrong dependency relation. However, the annotated cases were manually checked, and therefore they have a correct syntactic analysis (except for human errors).
This resulting corpus contains the collocational annotation in the last column of the .conllu file (see an example in Figure 3). On the one hand, the base of each collocation has a numerical id followed by the syntactic pattern (e.g., obj, amod) and by its lexical function. On the other hand, the collocate is labeled with the same id as the base it depends on. In blended collocations (as in the example), the base contains information about both combinations separated by a semicolon.

Final Resources and Results
The final multilingual corpus has 155, 794 tokens and 1, 526 annotated collocations (1, 394 unique) Table 1 includes the number of revised candidates and annotated collocations for each language and dependency relation. As expected, adjective-noun and verb-object collocations were the most pro-   ductive ones, and nominal compounds combinations were less frequent.
We computed multi-k inter-annotator agreement (Davies and Fleiss, 1982;Artstein and Poesio, 2008) for each language and relation (Table 2), with values between k = 0.370 and k = 0.706. The higher agreement occurs in verb-object collocations, while in nominal compounds it was lower.
Once the final sheets (for each relation and language) were created, a total of 447 combinations (3.6%) were labeled as doubt (there was no agreement between the annotators). Out of these, 260 (58.2%) were finally considered true collocations by the team of experts. Among the most frequent disagreements we found adjective-noun pairs for which the annotators doubted whether they were technical terms (e.g., light cluster), nominal compounds in which one of the nouns seems lexically selected by the other (e.g., golf tournament), and verb-object combinations in which the noun could be predicative and the verb has scarce lexical content, but lacks a single-word verb equivalent (e.g., tener velocidad 'have speed' in Spanish). In the latter group, we harmonized their annotation in the three languages.
The final resource includes a total of 60 lexical functions, some of them complex (e.g., M agn + AntiBon), and not all of them in every language (i.e., less frequent LFs appear only in one or two corpora). The most frequent ones are Oper1, M agn, Bon, and N onStandard (see Appendix A for the full list of LFs per language).

Conclusions and Further Work
This paper presented a multilingual corpus with manual annotation of collocations and their lexical functions in English, Portuguese, and Spanish. The resource contains 155k tokens and 1, 526 collocations classified into 60 lexical functions. Each collocation candidate was revised by three language experts, and those which were dubious were corrected by the whole team of annotators.
We release both the final corpus of each annotator as well as the gold-standard resource in .conllu format. This dataset can serve as a basis to evaluate systems designed to automatically extract collocations and identify their lexical functions, which in turn may be useful for different NLP and corpus linguistics tasks. As we provide resources for three languages (and with different dependency relations), the corpora can be also useful to verify whether some methods behave similarly or not in each language and syntactic pattern.
In further work we plan to carry out a multilingual alignment of the collocations in each language. This process, also enlarged with other multilingual equivalents, will generate a new dataset for evaluating the automatic translation of this type of multiword expressions. Leo Wanner, Bernd Bohnet, Nadjet Bouayad-Agha, François Lareau, and Daniel Nicklaß. 2010