A Dataset for Noun Compositionality Detection for a Slavic Language

This paper presents the first gold-standard resource for Russian annotated with compositionality information of noun compounds. The compound phrases are collected from the Universal Dependency treebanks according to part of speech patterns, such as ADJ+NOUN or NOUN+NOUN, using the gold-standard annotations. Each compound phrase is annotated by two experts and a moderator according to the following schema: the phrase can be either compositional, non-compositional, or ambiguous (i.e., depending on the context it can be interpreted both as compositional or non-compositional). We conduct an experimental evaluation of models and methods for predicting compositionality of noun compounds in unsupervised and supervised setups. We show that methods from previous work evaluated on the proposed Russian-language resource achieve the performance comparable with results on English corpora.


Introduction
The quality of many natural language processing applications is heavily dependent on the quality of vector representations of text elements. The streamline NLP research encompasses many works on building various distributional semantic models (DSMs), and on methods for combining vector representations of atomic elements like words into representations of bigger fragments: phrases, sentences, texts. A simple but strong baseline for this task suggests averaging word embeddings of a text fragment (sometimes weighted, e.g., according to IDF). Although the result vector representation is rough compared to results could be achieved by more elaborate neural network encoding methods, it was shown that this baseline has high performance in many tasks (Weston et al., 2013;Mikolov et al., 2013;Mitchell and Lapata, 2008;Anke and Schockaert, 2018).
The main advantages of such methods are computational efficiency and an ability to use them in an unsupervised setting, while neural encoders would commonly require heavy computational power, labeled datasets, and substantial time for training.
However, simple averaging of word embeddings often is too naïve. Idiomatic noun phrases are one of the cases where the averaging of the phrase parts would yield a wrong result since the meaning of such phrases is metaphorical and could not be directly "summed up" from meanings of its components. Therefore, it would be beneficial to have a DSM that tackles this problem, by having a distinct embedding for the whole phrase.
In this work, we focus on the task of predicting compositionality of noun phrases in Russian language texts. The goal is to develop a resource and methods for distinguishing compositional compounds, which meaning could be split into parts, from non-compositional ones that have a solid meaning, and for which we would like to have a dedicated embedding. The ability to detect compositionality for noun compounds is considered beneficial for many tasks including machine translation, semantic parsing, as well as word sense disambiguation.
The contribution of this paper is two-fold: 1. We present the first gold-standard dataset for Russian annotated with compositionality information of noun compounds. 1 2. We provide an experimental evaluation of models and methods for predicting compositionality of noun compounds. We show that the methods from the previous work trained on the proposed Russian-language resource achieve the performance comparable with results on English corpora.

Related Work
The construction of datasets presenting compositionality can be traced back to as early as the 2000s: Baldwin and Villavicencio (2002) proposed chunk-based extraction methods for English verb-prepositional combinations and gave some binary judgments on the subject of considering them as phrasal verbs. In the follow-up paper, Baldwin et al. (2003) used the same framework to retrieve 1,710 Noun-Noun compounds from 1996 Wall Street Journal corpus. The authors use LSA to calculate the similarity between a phrase and its components as one of the early compositionality prediction attempts. McCarthy et al. (2003) evaluated 116 candidates of English phrasal verbs using three annotators' predictions on a scale from 0 to 10. Venkatapathy and Joshi (2005) used 800 verbobject collocations obtained from British National Corpus to give annotations from 1 to 6 where one stands for total non-compositionality and 6 for complete compositionality. The dataset developed by Reddy et al. (2011) contained 90 English noun compounds and used an average of 30 judgments to give each phrase compositionality scores. This work provided compositionality assessments for both the phrase and its constituents enabling the use of various operations with corresponding embeddings of a compound and its distinctive parts in the context of linking human validations with measurements of semantic distance.  extended this dataset to 180 phrases presenting two parallel sets for French and Portuguese languages. English Noun-Noun compounds were mapped with Noun-Prep-Noun and Noun-Adj constructions according to the grammar equivalents. Farahmand et al. (2015) presented considerably larger dataset, which has 1,042 Noun-Noun compounds annotated with the help of 4 experts.
We also should note some works on compositionality detection datasets for non-English languages. Gurrutxaga and Alegria (2013) studied 1,200 Basque Noun-Verb collocations and resolve classification task into three classes: idiom, collocation, and free combination. Roller et al. (2013) provides 244 German compounds with compositionality scores assigned from 1 to 7 as an average from 30 validations. PARSEME project (Savary et al., 2015) is devoted to the multilingual annotation of multiword expressions (MWE) of arbi-trary length and syntactical structure. By design, PARSEME is more suited for MWE extraction tasks rather than compositionality evaluation. This dataet includes annotated verbal MWEs for several Slavic languages Jana et al. (2019) explored the use of hyperbolic embeddings for noun compositionality detection comparing it to the Euclidian embeddings.
Most of the experiments on noun compositionality were conducted for the English language and to the best of our knowledge, to date, there are no datasets for compositionality detection task for any Slavic language structurally similar to (Reddy et al., 2011) and (Farahmand et al., 2015).

Agreement Metric
Value Pearson's correlation 0.541 Cronbach's alpha 0.700 Table 1: Annotation agreement metrics for our dataset.

Data Collection
The compound phrases are collected from the Russian Universal Dependency (UD) treebanks (Nivre et al., 2016) according to part of speech patterns, such as adjectives (ADJ) + noun (NOUN) or noun + noun, based on gold-standard UD annotations, which guarantees that not only no preprocessing but also no POS tagging and no disambiguation is required. We use all Russian treebanks, available in the UD project. They consist of texts from the following genre: news, nonfiction, fiction. To extract nominal compounds, we loop over all nouns and select only those, which has noun or adjective dependant (i.e., are "head" of another noun or adjective). We filter out non-frequent compounds, and from the list of frequent compounds, we randomly select 1,000 compounds to be annotated. Note, that this procedure is coarse and does not rely on more precise compound definition such as the exact type of the dependency between the head and dependant tokens. Each compound is lowercased and lemmatized. Stress characters are omitted. The head noun is provided in the nominal case and in singular number (if it exists), and the dependant adjectives are put in grammatical agreement with the head noun in case and gender, while dependant nouns remain unchanged.   (0), compositional (1) and ambiguous (2) compounds.

Annotation Setup and Agreement
Each compound phrase in the selected list is annotated by two experts according to the following schema: (0) the phrase is non-compositional; (1) the phrase is compositional; (2) the phrase is ambiguous, which means that exact compositionality of the phrase is dependant on the corresponding context. After that, annotators' answers are reviewed by a moderator. Out of 1,000 randomly selected compounds, moderator samples 220 and resolves the ambiguity left from the first two annotators. We calculate the agreement metrics of the first two annotators on the dataset of 1,000 compounds. Annotators achieved a substantial agreement. We note that the typical problematic cases that are hard to annotate are compounds, which meaning tends to be compositional in a metaphorical way, e.g., "otkrytoe more" [open sea] and compounds, that contain polysemic words: "hod dela" [justicement or the course of business].

Dataset Description
The resulting dataset consists of 220 compound phrases with several full sentence contexts, collected from source texts. The number of contexts is not fixed. So far the contexts are not annotated.
A few examples are provided in Table 2. Table  3 presents the cross-tabulation of compound pattern and compound compositionality. Each compound is provided with a sentence context. The number of contexts is not fixed as we extract all contexts that contain the compound from the UD treebanks. The contexts so far are not used in the experiments. However, one of the possible directions for the future work would be compound disambiguation, based on the contexts. Examples of the compound contexts are presented in Figure 1.

Experiments
We evaluate various methods for detection of compositionality presented in the previous work. For experiments, we train a distributional semantic model (DSM) that includes embeddings not just for single words but also for compounds. We achieve this by replacing in the training corpora all occurrences of compounds from the proposed resource with single tokens composed of their parts. We use two experimental setups in our work. First, the unsupervised setup follows the method and evaluation pipeline presented in . In this setting, we rely solely on a similarity between a compound embedding and an embedding composed from its parts using an additive function. The value of the similarity should correlate with annotators' judgments in the proposed resource.
Second, the supervised setup considers compositionality detection as a binary classification task. We train various supervised machine learning methods on vector representations of a compound and its parts to predict compositionality class. In this setup, we train an additional DSM that does not have any modifications (it does not contain embeddings for compounds). In this setup, embeddings of compound parts are obtained from this unmodified supplementary model.

Adjective-Noun
Noun-Noun Total Non-compositional (0) 23 10 33 Compositional (1) 71 96 167 Ambiguous (2) 9 11 20 Total 103 117 220 Table 3: The number of compositional and non-compositional compounds in our dataset. (Segalovich, 2003). Minimal frequency count of 2 is used. We performed experiments on several sets of hyperparameters (dimensionality and amount of training epochs). We found that dimensionality of 300 and five epochs give good or the best results across all considered settings, therefore, we report results only for this set of hyperparameters.
To simplify the task, in experimental evaluation, we do not consider contextual information of compounds. It means that no ambiguity is under consideration and only phrases with compositionality classes of 1 and 0 are qualified for evaluation, which leaves 200 compounds. For three of them, models lack an embedding, which leaves 197 phrases for experiments: 164 are compositional, and 33 are non-compositional according to annotators (approximately 0.83 to 0.17 ratio).

Unsupervised Setup
For unsupervised setup, we calculate a metric from  that measures similarity of an embedding of a compound as a whole and an additive embedding composed of its parts. Consider w 1 , w 2 are words of a given compound and a function v(·) yielding vector representation of a word/compound. Then the similarity metric is equals to: In addition to cosine, we use similarity measures based on distance metrics between embeddings: Chebyshev distance (L ∞ -norm), Manhattan distance (L 1 -norm), and Euclidean distance (L 2 -norm). When using these distances, instead   of normalized sum, we use a simple averaging: v(w 1 + w 2 ) = 1 2 (w 1 + w 2 ).
We evaluate the performance of these metrics to predict compositionality based on Spearman rank correlation (Spearman's ρ) between them and the compositionality class in the annotated dataset as considered in .

Results and Discussion
The results of the experimental evaluation for unsupervised setup are presented in Table 5, for supervised setup -in Table 4. Of presented metrics, L 1 , L 2 , and L ∞ present substantial negative correlation. That can be explained by the nature of embedding vectors. The bigger the dis-tance value, the further compound is from its components in a semantic sense. If the sense of the compound widely differs from corresponding senses of its components, it is deemed as noncompositional. To be comparable with previous papers, we present a positive correlation bringing minus of a distance instead. Taking this into consideration, all metrics perform comparably on the dataset. We can see a not strong, yet stable and substantial correlation between similarity and compositionality class.
Considering the supervised classification task, precision, recall, and F 1 metrics are presented alongside Spearman rank correlation. As noncompositional compounds are in the minority in this dataset, and detecting idiomatic phrases provides more interest practice-wise, we report on zero-class quality metrics to access algorithm performance. LSVC, MLP, and NB present higher ρ than the unsupervised counterpart. LSVC and MLP also give relatively high recall on noncompositional examples. Overall, linear SVC and multi-layer perceptron perform better than the other models across all metrics.

Conclusion
We presented the first Russian-language dataset of noun compounds annotated, where each compound follows one of the noun compound patterns (noun+noun or adjective+noun) and is annotated with as non-compositional, compositional or ambiguous compounds. The latter can be either compositional or not, depending on the context. Each compound is provided along with the sentence contexts. The inter-annotator agreement metrics show that annotator judgments on the scores agree well. We investigated the performance of various algorithms from previous work and showed that the achieved evaluation metrics correspond with other state-of-the-art results for English. We hope that our resource will foster the research in the area of compositionality detection for Russian and other Slavic languages.