The Role of Modifier and Head Properties in Predicting the Compositionality of English and German Noun-Noun Compounds: A Vector-Space Perspective

In this paper, we explore the role of constituent properties in English and German noun-noun compounds (corpus frequencies of the compounds and their constituents; productivity and ambiguity of the constituents; and semantic relations between the constituents), when predicting the degrees of compositionality of the compounds within a vector space model. The results demonstrate that the empirical and semantic properties of the compounds and the head nouns play a significant role.


Introduction
The past 20+ years have witnessed an enormous amount of discussions on whether and how the modifiers and the heads of noun-noun compounds such as butterfly, snowball and teaspoon influence the compositionality of the compounds, i.e., the degree of transparency vs. opaqueness of the compounds. The discussions took place mostly in psycholinguistic research, typically relying on reading time and priming experiments. For example, Sandra (1990) demonstrated in three priming experiments that both modifier and head constituents were accessed in semantically transparent English noun-noun compounds (such as teaspoon), but there were no effects for semantically opaque compounds (such as buttercup), when primed either on their modifier or head constituent. In contrast, Zwitserlood (1994) provided evidence that the lexical processing system is sensitive to morphological complexity independent of semantic transparency. Libben and his colleagues (Libben et al. (1997), Libben et al. (2003)) were the first who systematically categorised noun-noun compounds with nominal modifiers and heads into four groups representing all possible combinations of modifier and head transparency (T) vs. opaqueness (O) within a compound. Examples for these categories were car-wash (TT), strawberry (OT), jailbird (TO), and hogwash (OO). Libben et al. confirmed Zwitserlood's analyses that both semantically transparent and semantically opaque compounds show morphological constituency; in addition, the semantic transparency of the head constituent was found to play a significant role.
From a computational point of view, addressing the compositionality of noun compounds (and multi-word expressions in more general) is a crucial ingredient for lexicography and NLP applications, to know whether the expression should be treated as a whole, or through its constituents, and what the expression means. For example, studies such as Cholakov and Kordoni (2014), Weller et al. (2014), Cap et al. (2015), and Salehi et al. (2015b) have integrated the prediction of multi-word compositionality into statistical machine translation.
Computational approaches to automatically predict the compositionality of noun compounds have mostly been realised as vector space models, and can be subdivided into two subfields: (i) approaches that aim to predict the meaning of a compound by composite functions, relying on the vectors of the constituents (e.g., Mitchell and Lapata (2010), Coecke et al. (2011), Baroni et al. (2014), and Hermann (2014)); and (ii) approaches that aim to predict the degree of compositionality of a compound, typically by comparing the compound vectors with the constituent vectors (e.g., Reddy et al. (2011), Salehi andCook (2013), Schulte im Walde et al. (2013), Salehi et al. (2014;2015a)). In line with subfield (ii), this paper aims to distinguish the contributions of modifier and head properties when predicting the compositionality of English and German nounnoun compounds in a vector space model. Up to date, computational research on noun compounds has largely ignored the influence of constituent properties on the prediction of compositionality. Individual pieces of research noticed differences in the contributions of modifier and head constituents towards the composite functions predicting compositionality (Reddy et al., 2011;Schulte im Walde et al., 2013), but so far the roles of modifiers and heads have not been distinguished. We use a new gold standard of German noun-noun compounds annotated with corpus frequencies of the compounds and their constituents; productivity and ambiguity of the constituents; and semantic relations between the constituents; and we extend three existing gold standards of German and English noun-noun compounds (Ó Séaghdha, 2007;von der Heide and Borgwaldt, 2009;Reddy et al., 2011) to include approximately the same compound and constituent properties. Relying on a standard vector space model of compositionality, we then predict the degrees of compositionality of the English and German noun-noun compounds, and explore the influences of the compound and constituent properties. Our empirical computational analyses reveal that the empirical and semantic properties of the compounds and the head nouns play a significant role in determining the compositionality of noun compounds.

Related Work
Regarding relevant psycholinguistic research on the representation and processing of noun compounds, Sandra (1990) hypothesised that an associative prime should facilitate access and recognition of a noun compound, if a compound constituent is accessed during processing. His three priming experiments revealed that in transparent noun-noun compounds, both constituents are accessed, but he did not find priming effects for the constituents in opaque noun-noun compounds. Zwitserlood (1994) performed an immediate partial repetition experiment and a priming experiment to explore and to distinguish morphological and semantic structures in noun-noun compounds. On the one hand, she confirmed Sandra's results that there is no semantic facilitation of any constituent in opaque compounds. In contrast, she found evidence for morphological complexity, independent of semantic transparency, and that both transparent and also partially opaque compounds (i.e., compounds with one transparent and one opaque constituent) produce semantic priming of their constituents. For the heads of semantically transparent compounds, a larger amount of facilitation was found than for the modifiers. Differences in the results by Sandra (1990) and Zwitserlood (1994) were supposedly due to different definitions of partial opacity, and different primetarget SOAs.
Libben and his colleagues (Libben et al. (1997), Libben (1998), andLibben et al. (2003)) were the first who systematically categorised noun-noun compounds with nominal modifiers and heads into four groups representing all possible combinations of a constituent's transparency (T) vs. opaqueness (O) within a compound: TT, OT, TO, OO. Libben's examples for these categories were car-wash (TT), strawberry (OT), jailbird (TO), and hogwash (OO). They confirmed Zwitserlood's analyses that both semantically transparent and semantically opaque compounds show morphological constituency, and also that the semantic transparency of the head constituent was found to play a significant role. Studies such as  and Kehayia et al. (1999) to a large extent confirmed the insights by Libben and his colleagues for French, Bulgarian, Greek and Polish.
Regarding related computational work, prominent approaches to model the meaning of a compound or a phrase by a composite function include Mitchell and Lapata (2010), Coecke et al. (2011), Baroni et al. (2014), and Hermann (2014)). In this area, researchers combine the vectors of the compound/phrase constituents by mathematical functions such that the resulting vector optimally represents the meaning of the compound/phrase. This research is only marginally related to ours, since we are interested in the degree of compositionality of a compound, rather than its actual meaning.
Most closely related computational work includes distributional approaches that predict the degree of compositionality of a compound regarding a specific constituent, by comparing the compound vector to the respective constituent vector.

Noun-Noun Compounds
Our focus of interest is on noun-noun compounds, such as butterfly, snowball and teaspoon as well as car park, zebra crossing and couch potato in English, and Ahornblatt 'maple leaf', Feuerwerk 'fireworks', and Löwenzahn 'dandelion' in German, where both the grammatical head (in English and German, this is typically the rightmost constituent) and the modifier are nouns. We are interested in the degrees of compositionality of nounnoun compounds, i.e., the semantic relatedness between the meaning of a compound (e.g., snowball) and the meanings of its constituents (e.g., snow and ball). More specifically, this paper aims to explore factors that have been found to influence compound processing and representation, such as • frequency-based factors, i.e., the frequencies of the compounds and their constituents (van Jaarsveld and Rattink, 1988;Janssen et al., 2008); • the productivity (morphological family size), i.e., the number of compounds that share a constituent (de Jong et al., 2002); and • semantic variables as the relationship between compound modifier and head: a teapot is a pot FOR tea; a snowball is a ball MADE OF snow (Gagné and Spalding, 2009;Ji et al., 2011).
In addition, we were interested in the effect of ambiguity (of both the modifiers and the heads) regarding the compositionality of the compounds. Our explorations required gold standards of compounds that were annotated with all these compound and constituent properties. Since most previous work on computational predictions of compositionality has been performed for English and for German, we decided to re-use existing datasets for both languages, which however required extensions to provide all properties we wanted to take into account. We also created a novel gold standard. In the following, we describe the datasets. 1 German Noun-Noun Compound Datasets As basis for this work, we created a novel gold standard of German noun-noun compounds: G h OST-NN (Schulte im Walde et al., 2016). The new gold standard was built such that it includes a representative choice of compounds and constituents from various frequency ranges, various productivity ranges, with various numbers of senses, and with various semantic relations. In the following, we describe the creation process in some detail, because the properties of the gold standard are highly relevant for the distributional models.
Relying on the 11.7 billion words in the web corpus DECOW14AX 2 (Schäfer and Bildhauer, 2012;Schäfer, 2015), we extracted all words that were identified as common nouns by the Tree Tagger (Schmid, 1994) and analysed as noun compounds with exactly two nominal constituents by the morphological analyser SMOR (Faaß et al., 2010). This set of 154,960 two-part noun-noun compound candidates was enriched with empirical properties relevant for the gold standard: • corpus frequencies of the compounds and the constituents (i.e., modifiers and heads), relying on DECOW14AX; • productivity of the constituents i.e., how many compound types contained a specific modifier/head constituent; • number of senses of the compounds and the constituents, relying on GermaNet (Hamp and Feldweg, 1997;Kunze, 2000).
From the set of compound candidates we extracted a random subset that was balanced 3 for • the productivity of the modifiers: we calculated tertiles to identify modifiers with low/mid/high productivity; • the ambiguity of the heads: we distinguished between heads with 1, 2 and >2 senses.
For each of the resulting nine categories (three productivity ranges × three ambiguity ranges), we randomly selected 20 noun-noun compounds from our candidate set, disregarding compounds with a corpus frequency < 2,000, and disregarding compounds containing modifiers or heads with a corpus-frequency < 100. We refer to this dataset of 180 compounds balanced for modifier productivity and head ambiguity as G h OST-NN/S. We also created a subset of 5 noun-noun compounds for each of the 9 criteria combinations, by randomly selecting 5 out of the 20 selected compounds in each mode. This small, balanced subset was then systematically extended by adding all compounds from the original set of compound candidates with either the same modifier or the same head as any of the selected compounds. Taking Haarpracht as an example (the modifier is Haar 'hair', the head is Pracht 'glory'), we added Haarwäsche, Haarkleid, Haarpflege, etc. as well as Blütenpracht, Farbenpracht, etc. 4 We refer to this dataset of 868 compounds that destroyed the coherent balance of criteria underlying our random extraction, but instead ensured a variety of compounds with either the same modifiers or the same heads, as G h OST-NN/XL.
The two sets of compounds (G h OST-NN/S and G h OST-NN/XL) were annotated with the semantic relations between the modifiers and the heads, and compositionality ratings. Regarding semantic relations, we applied the relation set suggested byÓ Séaghdha (2007), because (i) he had evaluated his annotation relations and annotation scheme, and (ii) his dataset had a similar size as ours, so we could aim for comparing results across languages.Ó Séaghdha (2007) himself had relied on a set of nine semantic relations suggested by Levi (1978), and designed and evaluated a set of relations that took over four of Levi's relations (BE, HAVE, IN, ABOUT) and added two relations referring to event participants (ACTOR, INST(rument)) that replaced the relations MAKE, CAUSE, FOR, FROM, USE. An additional relation LEX refers to lexicalised compounds where no relation can be assigned. Three native speakers of German annotated the compounds with these seven semantic relations. 5 Regarding compositionality ratings, eight native speakers of German annotated all 868 gold-standard compounds with compound-constituent compositionality ratings on a scale from 1 (definitely semantically opaque) to 6 (definitely semantically transparent). Another five native speakers provided additional annotation for our small core subset of 180 compounds on the same scale. As final compositionality ratings, we use the mean compound-constituent ratings across the 13 annotators.
As alternative gold standard for German nounnoun compounds, we used a dataset based on a selection of noun compounds by von der Heide and Borgwaldt (2009), that was previously used in computational models predicting compositionality (Schulte im Walde et al., 2013;Salehi et al., 2014). The dataset contains a subset of their compounds including 244 two-part noun-noun compounds, annotated by compositionality ratings on a scale between 1 and 7. We enriched the existing dataset with frequencies, and productivity and ambiguity scores, also based on DECOW14AX and GermaNet, to provide the same empirical information as for the G h OST-NN datasets. We refer to this alternative German dataset as VDHB.
English Noun-Noun Compound Datasets Reddy et al. (2011) created a gold standard for English noun-noun compounds. Assuming that compounds whose constituents appeared either as their hypernyms or in their definitions tend to be compositional, they induced a candidate compound set with various degrees of compoundconstituent relatedness from WordNet (Miller et al., 1990;Fellbaum, 1998) and Wiktionary. A random choice of 90 compounds that appeared with a corpus frequency > 50 in the ukWaC corpus (Baroni et al., 2009) constituted their gold-standard dataset and was annotated by compositionality ratings. Bell and Schäfer (2013) annotated the compounds with semantic relations using all of Levi's original nine relation types: CAUSE, HAVE, MAKE, USE, BE, IN, FOR, FROM, ABOUT. We refer to this dataset as REDDY.
O Séaghdha developed computational models to predict the semantic relations between modifiers and heads in English noun compounds (Ó Séaghdha, 2008;Ó Séaghdha and Copestake, 2013;Ó Séaghdha and Korhonen, 2014). As gold-standard basis for his models, he created a dataset of compounds, and annotated the compounds with semantic relations: He tagged and parsed the written part of the British National Cor- pus using RASP (Briscoe and Carroll, 2002), and applied a simple heuristics to induce compound candidates: He used all sequences of two or more common nouns that were preceded or followed by sentence boundaries or by words not representing common nouns. Of these compound candidates, a random selection of 2,000 instances was used for relation annotation (Ó Séaghdha, 2007) and classification experiments. The final gold standard is a subset of these compounds, containing 1,443 noun-noun compounds. We refer to this dataset as OS.
Both English compound datasets were enriched with frequencies and productivities, based on the ENCOW14AX 6 containing 9.6 billion words. We also added the number of senses of the constituents to both datasets, using WordNet. And we collected compositionality ratings for a random choice of 396 compounds from the OS dataset relying on eight experts, in the same way as the G h OST-NN ratings were collected. Table 1 summarises the gold-standard datasets. They are of different sizes, but their empirical and semantic annotations have been aligned to a large extent, using similar corpora, relying on WordNets and similar semantic relation inventories based on Levi (1978).

VSMs Predicting Compositionality
Vector space models (VSMs) and distributional information have been a steadily increasing, integral part of lexical semantic research over the past 20 years (Turney and Pantel, 2010): They explore the notion of "similarity" between a set of target objects, typically relying on the distributional hypothesis (Harris, 1954;Firth, 1957) to determine co-occurrence features that best describe the words, phrases, sentences, etc. of interest. 6 http://corporafromtheweb.org/encow14/ In this paper, we use VSMs in order to model compounds as well as constituents by distributional vectors, and we determine the semantic relatedness between the compounds and their modifier and head constituents by measuring the distance between the vectors. We assume that the closer a compound vector and a constituent vector are to each other, the more compositional (i.e., the more transparent) the compound is, regarding that constituent. Correspondingly, the more distant a compound vector and a constituent vector are to each other, the less compositional (i.e., the more opaque) the compound is, regarding that constituent.
Our main questions regarding the VSMs are concerned with the influence of constituent properties on the prediction of compositionality. I.e., how do the corpus frequencies of the compounds and their constituents, the productivity and the ambiguity of the constituents, and the semantic relations between the constituents influence the quality of the predictions?

Vector Space Models (VSMs)
We created a standard vector space model for all our compounds and constituents in the various datasets, using co-occurrence frequencies of nouns within a sentence-internal window of 20 words to the left and 20 words to the right of the targets. 7 The frequencies were induced from the German and English COW corpora, and transformed to local mutual information (LMI) values (Evert, 2005).
Relying on the LMI vector space models, the cosine determined the distributional similarity between the compounds and their constituents, which was in turn used to predict the degree of compositionality between the compounds and their constituents, assuming that the stronger the distributional similarity (i.e., the cosine values), the larger the degree of compositionality. The vector space predictions were evaluated against the mean human ratings on the degree of compositionality, using the Spearman Rank-Order Correlation Coefficient ρ (Siegel and Castellan, 1988). Table 2 presents the overall prediction results across languages and datasets. The mod column shows the ρ correlations for predicting only the degree of compositionality of compound-modifier pairs; the head column shows the ρ correlations for predicting only the degree of compositionality of compound-head pairs; and the both column shows the ρ correlations for predicting the degree of compositionality of compound-modifier and compound-head pairs at the same time.  Table 2: Overall prediction results (ρ).

Overall VSM Prediction Results
The models for VDHB and REDDY represent replications of similar models in Schulte im Walde et al. (2013) and Reddy et al. (2011), respectively, but using the much larger COW corpora.
Overall, the both prediction results on VDHB are significantly 8 better than all others but REDDY; and the prediction results on OS compounds are significantly worse than all others. We can also compare within-dataset results: Regarding the two G h OST-NN datasets and the REDDY dataset, the VSM predictions for the compound-head pairs are better than for the compound-modifier pairs. Regarding the VDHB and the OS datasets, the VSM predictions for the compound-modifier pairs are better than for the compound-head pairs. These differences do not depend on the language (according to our datasets), and are probably due to properties of the specific gold standards that we did not control. They are, however, also not the main point of this paper. 8 All significance tests in this paper were performed by Fisher r-to-z transformation.

Influence of Compound Properties on VSM Prediction Results
Figures 1 to 5 present the core results of this paper: They explore the influence of compound and constituent properties on predicting compositionality. Since we wanted to optimise insight into the influence of the properties, we selected the 60 maximum instances and the 60 minimum instances for each property. 9 For example, to explore the influence of head frequency on the prediction quality, we selected the 60 most frequent and the 60 most infrequent compound heads from each goldstandard resource, and calculated Spearman's ρ for each set of 60 compounds with these heads. Figure 1 shows that the distributional model predicts high-frequency compounds (red bars) better than low-frequency compounds (blue bars), across datasets. The differences are significant for G h OST-NN/XL. Figure 1: Effect of compound frequency. Figure 2 shows that the distributional model predicts compounds with low-frequency heads better than compounds with high-frequency heads (right panel), while there is no tendency regarding the modifier frequencies (left panel). The differences regarding the head frequencies are significant (p = 0.1) for both G h OST-NN datasets. Figure 3 shows that the distributional model also predicts compounds with low-productivity heads better than compounds with highproductivity heads (right panel), while there is no tendency regarding the productivities of modifiers (left panel). The prediction differences regarding the head productivities are significant for G h OST-NN/S (p < 0.05).    shows that the distributional model also predicts compounds with low-ambiguity heads better than compounds with high-ambiguity heads (right panel) -with one exception (G h OST-NN/XL)-while there is no tendency regarding the ambiguities of modifiers (left panel). The prediction differences regarding the head ambiguities are significant for G h OST-NN/XL (p < 0.01).

Discussion
While modifier frequency, productivity and ambiguity did not show a consistent effect on the predictions, head frequency, productivity and ambiguity influenced the predictions such that the prediction quality for compounds with lowfrequency, low-productivity and low-ambiguity heads was better than for compounds with highfrequency, high-productivity and high-ambiguity heads. The differences were significant only for our new G h OST-NN datasets. In addition, the compound frequency also had an effect on the predictions, with high-frequency compounds receiving better prediction results than low-frequency compounds. Finally, the quality of predictions also differed for compound relation types, with BE compounds predicted best, and ACTOR compounds predicted worst. These differences were ascertained mostly in the G h OST-NN and the OS datasets. Our results raise two main questions: (1) What does it mean if a distributional model predicts a certain subset of compounds (with specific properties) "better" or "worse" than other subsets?
(2) What are the implications for (a) psycholinguistic and (b) computational models regarding the compositionality of noun compounds?
Regarding question (1), there are two options why a distributional model predicts a certain subset of compounds better or worse than other subsets. On the one hand, one of the underlying goldstandard datasets could contain compounds whose compositionality scores are easier to predict than the compositionality scores of compounds in a different dataset. On the other hand, even if there were differences in individual dataset pairs, this would not explain why we consistently find modelling differences for head constituent properties (and compound properties) but not for modifier constituent properties. We therefore conclude that the effects of compound and head properties are due to the compounds' morphological constituency, with specific emphasis on the influences of the heads.
Looking at the individual effects of the compound and head properties that influence the distributional predictions, we hypothesise that highfrequent compounds are easier to predict because they have a better corpus coverage (and less sparse data) than low-frequent compounds, and that they contain many clearly transparent compounds (such as Zitronensaft 'lemon juice'), and at the same time many clearly opaque compounds (such as Eifersucht 'jealousy', where the literal translations of the constituents are 'eagerness' and 'addiction'). Concerning the decrease in prediction quality for more frequent, more productive and more ambiguous heads, we hypothesise that all of these properties are indicators of ambiguity, and the more ambiguous a word is, the more difficult it is to provide a unique distributional prediction, as distributional co-occurrence in most cases (including our current work) subsumes the contexts of all word senses within one vector. For example, more than half of the compounds with the most frequent and also with the most productive heads have the head Spiel, which has six senses in GermaNet and covers six relations (BE, IN, INST, ABOUT, ACTOR, LEX). (2), the results of our distributional predictions confirm psycholinguistic research that identified morphological constituency in noun-noun compounds: Our models clearly distinguish between properties of the whole compounds, properties of the modifier constituents, and properties of the head constituents. Furthermore, our models reveal the need to carefully balance the frequencies and semantic relations of target compounds, and to carefully balance the frequencies, productivities and ambiguities of their head constituents, in order to optimise experiment interpretations, while a careful choice of empirical modifier properties seems to play a minor role.

Regarding question
For computational models, our work provides similar implications. We demonstrated the need to carefully balance gold-standard datasets for multiword expressions according to the empirical and semantic properties of the multi-word expressions themselves, and also according to those of the constituents. In the case of noun-noun compounds, the properties of the nominal modifiers were of minor importance, but regarding other multi-word expressions, this might differ. If datasets are not balanced for compound and constituent properties, the qualities of model predictions are difficult to interpret, because it is not clear whether biases in empirical properties skewed the results. Our advice is strengthened by the fact that most significant differences in prediction results were demonstrated for our new gold standard, which includes compounds across various frequency, productivity and ambiguity ranges.

Conclusion
We explored the role of constituent properties in English and German noun-noun compounds, when predicting compositionality within a vector space model. The results demonstrated that the empirical and semantic properties of the compounds and the head nouns play a significant role. Therefore, psycholinguistic experiments as well as computational models are advised to carefully balance their selections of compound targets according to compound and constituent properties.