Collocation Classification with Unsupervised Relation Vectors

Lexical relation classification is the task of predicting whether a certain relation holds between a given pair of words. In this paper, we explore to which extent the current distributional landscape based on word embeddings provides a suitable basis for classification of collocations, i.e., pairs of words between which idiosyncratic lexical relations hold. First, we introduce a novel dataset with collocations categorized according to lexical functions. Second, we conduct experiments on a subset of this benchmark, comparing it in particular to the well known DiffVec dataset. In these experiments, in addition to simple word vector arithmetic operations, we also investigate the role of unsupervised relation vectors as a complementary input. While these relation vectors indeed help, we also show that lexical function classification poses a greater challenge than the syntactic and semantic relations that are typically used for benchmarks in the literature.


Introduction
Relation classification is the task of predicting whether between a given pair of words or phrases, a certain lexical, semantic or morphosyntactic relation holds.This task has direct impact in downstream NLP tasks such as machine translation, paraphrase identification (Etzioni et al., 2005), named entity recognition (Socher et al., 2012), or knowledge base completion (Socher et al., 2013).The currently standard approach to relation classification is to combine the embeddings corresponding to the arguments of a given relation into a meaningful representation, which is then passed to a classifier.As for which relations have been targeted so far, the landscape is considerably more varied, although we may safely group them into morphosyntactic and semantic relations.
Morphosyntactic relations have been the focus of work on unsupervised relational similarity, as it has been shown that verb conjugation or nominalization patterns are relatively well preserved in vector spaces (Mikolov et al., 2013;Pennington et al., 2014a).Semantic relations pose a greater challenge (Vylomova et al., 2016), however.In fact, as of today, it is unclear which operation performs best (and why) for the recognition of individual lexico-semantic relations (e.g., hyperonymy or meronymy, as opposed to cause, location or action).Still, a number of works address this challenge.For instance, hypernymy has been modeled using vector concatenation (Baroni et al., 2012), vector difference and component-wise squared difference (Roller et al., 2014) as input to linear regression models (Fu et al., 2014;Espinosa-Anke et al., 2016); cf. also a sizable number of neural approaches (Shwartz et al., 2016;Anh et al., 2016).Furthermore, several high quality semantic relation datasets are available, ranging from wellknown resources such as WordNet (Miller, 1995), Yago (Suchanek et al., 2007), BLESS (Baroni and Lenci, 2011), several SemEval datasets (Jurgens et al., 2012;Camacho-Collados et al., 2018) or DiffVec (Vylomova et al., 2016).But there is a surprising gap regarding collocation modeling.Collocations, which are semi-compositional in their nature in that they are situated between fixed multiword expressions (MWEs) and free (semantic) word combinations, are of relevance to second language (henceforth, L2) learners and NLP applications alike.In what follows, we investigate whether collocations can be modeled along the same lines as semantic relations between pairs of words.For this purpose, we introduce Lex-FunC, a newly created dataset, in which collocations are annotated with respect to the semantic typology of lexical functions (LFs) (Mel'čuk, 1996).We use LexFunC to train linear SVMs on top of different word and relation embedding composition.We show that the recognition of the semantics of a collocation, i.e., its classification with respect to the LF-typology, is a more challenging problem than the recognition of standard lexicosemantic relations, although incorporating distributional relational information brings a significant increase in performance.

Collocations and LexFunC
We first introduce the notion of collocation and LF and then present the LexFunC dataset.1

The phenomenon of collocation
Collocations such as make [a] suggestion, attend [a] lecture, heavy rain, deep thought or strong tea, to name a few, are described by Kilgarriff (2006) as restricted lexical co-occurrences of two syntactically bound lexical items.Due to their idiosyncrasy, collocations tend to be language-specific.For instance, in English or Norwegian we take [a] nap, whereas in Spanish we throw it, and in French, Catalan, German and Italian we make it.However, they are compositionally less rigid than some other types of multiword expressions such as, e.g., idioms (as, e.g., [to] kick the bucket) or multiword lexical units (as, e.g., President of the United States or chief inspector).Specifically, they are formed by a freely chosen word (the base), which restricts the selection of its collocate (e.g., rain restricts us to use heavy in English to express intensity). 2ecovery of collocations from corpora plays a major role in improving L2 resources, in addition to obvious advantages in NLP applications such as natural language analysis and generation, text paraphrasing / simplification, or machine translation (Hausmann, 1984;Bahns and Eldaw, 1993;Granger, 1998;Lewis and Conzett, 2000;Nesselhauf, 2005;Alonso Ramos et al., 2010).
Starting with the seminal work by Church and Hanks (1989), an extensive body of work has been produced on the detection of collocations in text corpora; cf., e.g., (Evert and Kermes, 2013;Evert, 2007;Pecina, 2008;Bouma, 2010;Garcia et al., 2017), as well as the Shared Task of the PARSEME European Cost Action on automatic recognition of verbal MWEs.3However, mere lists of collocations are often insufficient for both L2 acquisition and NLP.Thus, a language learner may not know the difference between, e.g., come to fruition and bring to fruition or between have [an] approach and take [an] approach, etc. Semantic labeling is required.The failure to identify the semantics of collocations also led, e.g., in earlier machine translation systems, to the necessity of the definition of collocation-specific crosslanguage transfer rules (Dorr, 1994;Orliac and Dillinger, 2003).The above motivates us to consider in this paper collocations and their classification in terms of LFs (Mel'čuk, 1996), their most fine-grained semantic typology (see Section 2.2).Especially because, so far, this is only discussed in a reduced number of works, and typically on a smaller scale (Wanner et al., 2006;Gelbukh and Kolesnikova., 2012).

LFs and the LexFunc dataset
An LF can be viewed as a function f (•) that associates, with a given base L (which is the argument or keyword of f ), a set of (more or less) "synonymous collocates that are selected contingent on L to manifest the meaning corresponding to f " (Mel'čuk, 1996).The name of an LF is a Latin abbreviation of this meaning.For example, Oper for operāri ('do', 'carry out'), Magn for magnus ('great', 'intense'), and so forth.The LexFunc dataset consists of collocations categorized in terms of LFs.Table 1 lists the ten LFs used in this paper, along with a definition, example and frequency.The LFs have been selected so as to cover the most prominent syntactic patterns of collocations (verb+direct object, adjective+noun, and noun+noun).(but still semi-compositional) "collocationality" between a collocation's base and collocate.To this end, we benchmark standard relation classification baselines in the task of LF classification.Furthermore, we also explore an explicit encoding of relational properties by distributional relation vectors (see Section 3.2).Moreover, to contrast the LF categories in our LexFunc dataset with others typically found in the relation classification literature, we use ten categories from DiffVec (Vylomova et al., 2016), a dataset which was particularly designed to explore the role of vector difference in supervised relation classification.The rationale for this being that, by subtraction, the features that are common to both words are known to be "cancelled out".For instance, for madrid − spain, this operation can be expected to capture that the first word is a capital city and the second word is a country, and "remove" the fact that both words are related to Spain (Levy et al., 2014).

LF
Both for DiffVec and LexFunc, we run experiments on those categories for which we have at least 99 instances.We cast the relation classification task as a multi-class classification problem and use a stratified 2 3 portion of the data for training and the rest for evaluation.We consider each of the datasets in isolation, as well as a concatenation of both (referred to in Table 2 as Diff-Vec+LexFunc).The model we use is a Linear SVM, 4 , trained on a suite of vector composition 4 Implemented in scikit-learn (http://scikit-learn. operations (Section 3.1).

Modeling relations using word vectors
Let w 1 and w 2 be the vector representations of two words w 1 and w 2 .We experiment with the following word-level operations: diff (w 2 − w 1 ), concat (w 1 ⊕ w 2 ), sum (w 1 + w 2 ), mult (w 1 • w 2 )), and leftw (w 1 ), the latter operation being included to explore the degree to which the data can be lexically memorized (Levy et al., 2015) 5 .

Relation vectors
Because word embeddings are limited in the amount of relational information they can capture, a number of complementary approaches have emerged which directly learn vectors that capture the relation between concepts, typically using distributional statistics from sentences mentioning both words (Espinosa-Anke and Schockaert, 2018;Washio and Kato, 2018;Joshi et al., 2018;Jameel et al., 2018).Below we explore the potential of such relation vectors for semantic relation classification.Specifically, we trained them for all word pairs from DiffVec and LexFunc using two different variants of the SeVeN model (Espinosa-Anke and Schockaert, 2018).The corpus for training these vectors is a Wikipedia dump from Janorg.uary 2018, with GloVe (Pennington et al., 2014b) 300d pre-trained embeddings.

DiffVec
The first variant, referred to as rvAvg6, is based on averaging the words that appear in sentences mentioning the two given target words.Since this approach differentiates between words that appear before the first word, after the second word, or in between the two words, and takes into account the order in which the words appear, it results in relation vectors with a dimensionality which is six times the dimensionality of the considered word vectors.The second variant, referred to as rvAE, starts from the same high-dimensional relation vector, but then uses a conditional autoencoder to obtain a lower-dimensional and potentially higher-quality 300d vector6 .

Results
Table 2 shows the experimental results for Diff-Vec, LexFunc and both datasets together.The first five rows show the performance of word embedding operations, whereas the configurations for remaining rows also include a relation vector.

Discussion
We highlight two major conclusions.First, despite vector difference and component-wise multiplication being the most popular vector operations for encoding relations between words, also in more expensive neural architectures for relation modeling (Washio and Kato, 2018;Joshi et al., 2018), vector concatenation alone proves to be a highly performing baseline.Moreover, the overall best method (concat+rvAvg6), obtains performance gains when compared with the standard diff method ranging from +5.89% in DiffVec to +7.98% in LexFunc and +10.08% in the combined dataset.This suggests that while vector differences may encode relational properties, important information is lost when only this operation is considered.
Second, despite being a well-studied topic, recognizing lexical functions emerges as a challenging problem.They seem difficult to classify, not only between themselves, but also when coupled with other lexical semantic relations.This may be due to the fact that collocations are idiosyncratic lexical co-occurrences which are syntactically bound.The base and collocate embeddings should account for these properties, rather than over-relying on the contexts in which they appear.
In the following section we present an analysis on the main sources of confusion in the LexFunc and DiffVec+LexFunc settings.

Problematic LFs
We aim to gain an understanding of recurrent errors made both by the best performing model (concat+rvAvg6) and diff.Figure 1 shows confusion matrices for the two datasets involving LFs, namely LexFunc and DiffVec+LexFunc.We are particularly interested in pinpointing which LFs are most difficult to classify, and whether there is any particular label that agglutinates most predictions.For example, in (Fig. 1a) we see a strong source of confusion in the diff model between the 'bon' and 'magn' labels.Both are noun-adjective combinations and both are used as intensifiers, but they subtly differ in that only one enforces a perceived degree of positiveness (e.g., resounding vs. crushing victory).Thus, combining their vectors produces clearly similar representations that confuse the classifier, a scenario which is only partly alleviated by the use of relation vectors (Fig. 1b).
The case of 'oper 1 ' (perform) and 'real 1 ' (accomplish) also proves problematic.The number of light verbs as collocates of these LFs is notably high in the former, amounting to 48%; 'real 1 ' is more semantic, with almost 11% light verbs.Interestingly, however, these labels are almost never confused with the 'event' label from DiffVec (Figs. 1c and 1d), even if it also contains relations with light verbs such as break or pay.
Finally, one last source of confusion that warrants discussion involves 'magn' and 'antimagn', two noun-advective collocations which are different in that the former conveys a notion of intensity, whereas the latter is about weakness (e.g., 'faint admiration' or 'slight advantage').These two LFs typically include antonymic collocates (e.g., 'weak' and 'strong' as collocates for the base 'argument'), and these are known to have similar distributional vectors (Mrkšić et al., 2016;Nguyen et al., 2016), which in high likelihood constitutes a source of confusion.

Conclusions and Future Work
In this paper, we have discussed the task of distributional collocation classification.We have used a set of collocations categorized by lexical functions, as introduced in the Meaning Text Theory (Mel'čuk, 1996), and evaluated a wide range of vector representations of relations.In addition, we have used the DiffVec (Vylomova et al., 2016) dataset to provide a frame of reference, as this dataset has been extensively studied in the distributional semantics literature, mostly for evaluating the role of vector difference.We found that, despite this operation being the go-to representation for lexical relation modeling, concatenation works as well or better, and clear improvements can be obtained by incorporating explicitly learned relation vectors.However, even with these improvements, categorizing LFs proves to be a difficult task.
In the future, we would like to experiment with more data, so that enough training data can be obtained for less frequent LFs.To this end, we could benefit from the supervised approach proposed in (Rodríguez-Fernández et al., 2016), and then filter by pairwise correlation strength metrics such as PMI.Another exciting avenue would involve exploring cross-lingual transfer of LFs, taking advantage of recent development in unsupervised cross-lingual embedding learning (Artetxe et al., 2017;Conneau et al., 2017).

Table 1 :
Statistics, definitions and examples of the LexFunc dataset.The indices indicate the argument structure of the LF: '1'stands for "first actant is the grammatical subject"; '0' for "the base is the grammatical subject".

Table 2 :
Experimental results of several baselines on different multiclass settings for relation classification.