Automatic Semantic Classification of German Preposition Types: Comparing Hard and Soft Clustering Approaches across Features

This paper addresses an automatic clas-siﬁcation of preposition types in German, comparing hard and soft clustering approaches and various window-and syntax-based co-occurrence features. We show that (i) the semantically most salient preposition features (i.e., subcategor-ised nouns) are the most successful, and that (ii) soft clustering approaches are required for the task but reveal quite different attitudes towards predicting ambiguity

Semantic classifications are of great interest to computational linguistics, specifically regarding the pervasive problem of data sparseness in the processing of natural language. Such classifications have been used in applications such as word sense disambiguation (Dorr and Jones, 1996;Kohomban and Lee, 2005;McCarthy et al., 2007), parsing (Carroll et al., 1998;Carroll and Fang, 2004), machine translation (Prescher et al., 2000;Koehn and Hoang, 2007;Weller et al., 2014), and information extraction (Surdeanu et al., 2003;Venturi et al., 2009).
Regarding prepositions, comparably little effort in computational semantics has gone beyond a specific choice of prepositions (such as spatial prepositions), towards a systematic classification of preposition senses, as in The Preposition Project (Litkowski and Hargraves, 2005). Distributional approaches towards preposition meaning and sense distinction have only recently started to explore salient preposition features, but with few exceptions (such as Baldwin (2006)) these approaches focused on token-based classification of preposition senses (Ye and Baldwin, 2006;O'Hara and Wiebe, 2009;Tratz and Hovy, 2009;Hovy et al., 2010;Hovy et al., 2011).
This paper addresses an automatic classification of preposition types in German, comparing various clustering approaches. We aim for an unsupervised setting that does not require predefined expensive resources, such as a token-based annotation of preposition senses. Our task is challenging, because (i) prepositions are notoriously ambiguous, (ii) the interpretation of out-of-context preposition type classification is more difficult than context-embedded token interpretation, (iii) there are no established lexical resources for type-based semantic classification other than for English, and (iv) there are no established evaluation measures for ambiguous linguistic classifications. We accept the challenges, identify salient preposition features, and demonstrate the inevitability to apply soft (rather than hard) clustering in order to explore linguistic ambiguity.

Preposition Data
In the absence of any large-scale semantic hierarchical type classification, the German grammar book by Helbig and Buscha (1998)  classes that contained more than one preposition, and deleted prepositions that appeared <10,000 times in our web corpus containing 880 million words (cf. Section 2.2). This selection process resulted in 12 semantic classes covering between 2 and 27 prepositions each (cf. Table 1), and a more fine-grained version that sub-divided the three largest classes 'local', 'modal' and 'temporal' into 6/10/7 sub-classes, respectively, and resulted in a total of 32 classes. 12 The prepositions in the fine-grained version exhibit ambiguity rates of 1 (monosemous) up to 10. Out of the 49 preposition types, 23 are polysemous (46.9%).

Preposition Features
The corpus-based features for the German prepositions were induced from the SdeWaC corpus (Faaß and Eckart, 2013), a cleaned version of the German web corpus deWaC (Baroni et al., 2009) containing approx. 880 million words. We compare three categories of distributional features: (1) bag-of-words window co-occurrence features: we apply a standard bag-of-words model (BOW) relying on a window of 2 words to the left and to the right, and a continuous bag-of-words model (CBOW) using negative sampling with K=15 (Mikolov et al., 2013); (2) direct syntactic dependency: we compare the most salient preposition-related dependencies: preposition-subcategorised nouns (nouns-dep, e.g., in Buch 'in book'), prepositionsubcategorising nouns (nouns-gov, e.g., Buch von 'book by'), and prepositionsubcategorising verbs (verbs-gov, e.g., reisen nach 'to travel to'); (3) 2nd-order syntactic co-occurrence: adjectives that modify nouns subcategorised by the prepositions, and adverbs that modify verbs subcategorising the prepositions.
The dependency information was extracted from a parsed version of the SdeWaC using Bohnet's MATE dependency parser (Bohnet, 2010;Scheible et al., 2013). All but the CBOW features were weighted according to positive pointwise mutual information.

Clustering Approaches
As we wanted to explore hard vs. soft clustering approaches on the same task, we chose k-Means as a standard hard clustering approach (relying on WEKA's spherical k-Means implementation), and compared it to various soft clustering approaches. We transfered the hard k-Means cluster analyses to soft cluster analyses, using two alternative methods. (1) The prep-based soft k-Means method (Springorum et al., 2013) calculated the mean cosine distanced for each preposition p to the centroids z c of the clusters c, and assigned a preposition to a specific cluster if its distance to the respective cluster centroid was below a threshold t multiplied with the mean distance, with t = 0.05, 0.1, 0.15, . . . , 0.95. Additionally, (2) we propose a hard-to-soft clustering transfer prob-based soft k-Means that converts the cosine distances between the prepositions and the hard cluster centroids to membership probabilities.
Instead of transferring a hard clustering to a soft clustering we also directly applied soft clustering approaches: (1) The fuzzy c-Means algorithm extends k-means by a cluster membership function for each preposition, f m ∈ [0, 1].
(2) We applied Latent Semantic Clustering (LSC), an instance of the Expectation-Maximisation (EM) algorithm (Baum, 1972) for unsupervised training on unannotated data (Rooth et al., 1999). The cluster analyses define two-dimensional soft clusters (in our case: preposition-feature clusters) with cluster membership probabilities, which are able to generalise over hidden data. (3) We used Non-negative matrix factorization (NMF), a factorisation approach with an inherent (soft) clustering property (Ding et al., 2005).
All variants of our hard-to-soft clustering approaches and the direct soft clustering approaches (except for k-Means/prep) 3 resulted in a preposition-cluster membership matrix with values ∈ [0, 1]. We transfered the real membership values to binary membership by applying a threshold t to decide about the cluster membership, again with t = 0.05, 0.1, 0.15, . . . , 0.95. For each clustering approach and for each number of clusters k we then identified the best threshold.

Evaluation
We chose the fuzzy extension of B-Cubed (Bagga and Baldwin, 1998) as evaluation measure, because it is (a) a pair-wise evaluation, which is considered as most suitable for soft clustering evaluations, and (b) distinguishes between homogeneity and completeness of a clustering, and thus resembles an evaluation by precision and recall. Amigó et al. (2009) demonstrated the strengths of B-Cubed, and a similar version has been used in SemEval 2013 for Word Sense Induction (Jurgens and Klapaftis, 2013).
Pair-wise precision P determines the homogeneity of a cluster analysis, by calculating for each individual preposition p the amount of prepositions p in the same cluster c that also belong to the same gold-standard class g, cf. Equation (1). Pair-wise recall R determines the completeness of a cluster analysis, by calculating for each individual preposition p the amount of prepositions p in the same gold-standard class g that also belong to the same cluster c, cf. Equation (2). The overall B-Cubed precision and recall scores are the averages over all preposition-wise scores. We combined precision and recall by their harmonic mean, the f-score.

Baselines
We created two baselines for our preposition clusterings: The hard baseline was computed for every number of clusters k= [5,40]. For each k, each preposition was randomly assigned to one of the k clusters, and the resulting hard cluster analysis was evaluated. The hard cluster assignments were repeated 1,000 times for each k, and the overall evaluation score for k clusters is the average score of the 1,000 runs. The soft baseline was also created by random assignment across 1,000 runs for each k, but -integrating the fuzzy component-each preposition was assigned to n clusters, with n a random number between 1 and the number of gold-standard classes for that specific preposition. Note that this baseline is more informed than an entirely random baseline, because the information about the number of gold-standard classes for each preposition is very helpful. For example, the baseline assigns monosemous prepositions to only one cluster, and prepositions with three senses to a random integer in [1, 3].
3 Results Figure 1 compares the fuzzy B-Cubed f-score values across the hard and soft clustering approaches, relying on the preposition-subcategorised nouns as one of the best features (cf. Figure 2 below). The plot demonstrates that (i) the hard k-Means clustering approach is the only one resulting in f-scores below the soft baseline, while (ii) the vast majority of soft clustering results lies above the soft baseline. Furthermore, (iii) there is a clear tendency for all soft clustering approaches to provide the best f-scores for similar values of k clusters: 15 ≤ k ≤ 19. The overall best result is reached by NMF for a clustering with 17 clusters. Figure 2 compares the f-scores across feature types, relying on NMF as the best clustering approach. The plot confirms that (i) -across features-, the vast majority of soft clustering results lies above the soft baseline. In addition, (ii) in the previously most successful range for 15 ≤ k ≤ 19 clusters, the preposition-subcategorised nouns represent the best features. (iii) The best cluster analyses relying on window vs. syntax features are similarly successful, and outperform 2nd-order co-occurrence features.
We checked the overall best cluster analysis (NMF, k = 17, nouns-dep) on the predicted degree of ambiguity (cf. Figure 3): for 23 out of the 26 monosemous prepositions, we correctly predicted one preposition sense; for 7 out of the 23 polysemous prepositions, we predicted the correct number of senses; for 9 out of the 23 polysemous prepositions, we predicted less senses than the gold standard defines; and for 7 out of the 23 polysemous prepositions, we predicted more senses than the gold standard defines.
Our best soft-clustering approach to the preposition classification task thus demonstrates its usefulness through quantitative B-Cubed evaluation and through reliable predictions of ambiguity.  bei  in  für  an  bis  mit  von  zu  ausser  durch  ohne  über  aus  nach  um  unter  vor  zwischen  ab  gegen  innerhalb  neben  a  als  ausserhalb  binnen  dank  gegenüber  gemäss  hinter  infolge  inmitten  je  laut  mittels  per  pro  samt  seit  seitens  statt  unfern  unterhalb  während  wegen  wie

Discussion
While the results in the previous section demonstrate the success of the type-based clustering, we were interested in two specific questions: (i) Where do the differences in the quality of the cluster analyses come from? (ii) Do the best cluster analyses present linguistically reliable and useful semantic classes? From a quantitative point of view, both questions have been addressed by the evaluation measure, fuzzy B-Cubed, which we chose for reasons outlined in Section 2.4. One should keep in mind, however, that there is an ongoing discussion about cluster comparison and cluster evaluation (Meila, 2007;Rosenberg and Hirschberg, 2007;Vinh and Bailey, 2010;Utt et al., 2014), which demonstrates uncertainty about an optimal measure, and which concerns us, expecially regarding the linguistic aspects of soft clustering. In the following, we therefore provide qualitative analyses and discussions of the cluster approaches and analyses.
Ambiguity rate of soft-clustering approaches: We looked into the best cluster analysis for each soft-clustering approach, and checked the ambiguities. While the number of preposition types in the cluster analyses is similar across approaches (between 44 and 48), the ambiguity rate (i.e., the number of cluster assignments per preposition type) and the number of ambiguous preposition types (i.e., the number of prepositions assigned to more than one cluster) differ strongly. For example, k-Means/prob and NMF perform an average of 3.1/3.7 assignments for each preposition, in comparison to 2.2-2.4 assignments by the other approaches. On the other hand, while k-Means/prob defines almost all preposition types (43 out of 48) as ambiguous, NMF only defines 28 out of 46 prepositions as ambiguous. NMF (best approach) thus shows a high ambiguity rate, but only 60% of the prepositions are ambiguous.
Cluster sizes: Looking into the actual cluster analyses reveals that the sizes and the structures within the individual clusters differ strongly. The best k-Means/prep and k-Means/prob analyses (k = 16, F = 0.33, and k = 19, F = 0.34), for example, each contain 7 large clusters with 10-25 prepositions. All other clusters contain only 1-3 prepositions. In comparison, the best NMF analysis (k = 17, F = 0.43) contains only one cluster with three prepositions, and all other clusters but one contain ≥ 5 and ≤ 14 prepositions. The cluster sizes of the best NMF analysis are therefore more homogeneous than for other clustering approaches.
Optimal k: While fuzzy B-Cubed determined the numbers of clusters [15,19] as optimal for the soft-clustering approaches, we also looked into the NMF cluster analysis with k = 32, with NMF as the best approach and 32 as the number of gold standard classes. The clusters are, again, very similar in size, including only one singleton and only one cluster with 9 prepositions. All other clusters contain 2 − 6 prepositions. The smaller cluster sizes allow manual evaluations. We can indeed find reliable semantic clusters, such as {an, auf, hinter, in, mit, nach, neben, um, vor}, where 7 out of 9 prepositions belong to the gold-standard class local: not target-oriented containing a total of 12 prepositions.

Conclusion
We presented variants of hard and soft clustering approaches across several sets of preposition features, to automatically classify preposition types into semantic classes.
While type-based classifications for highly ambiguous word classes are a computational challenge, our best approach (NMF-based classification with 17 clusters) reached an f-score of 0.43. The clustering experiments showed that (i) the semantically most salient preposition features are indeed the most successful, and that (ii) the clustering of highly ambiguous words requires soft rather than hard clustering approaches.
Most interestingly, a qualitative analysis zoomed into the assignment behaviour of the soft clustering approaches, and revealed different attitudes towards predicting ambiguity. NMF as the best approach predicted a high ambiguity rate but only for a restricted proportion of 60% of the preposition types. Furthermore, the distribution of cluster sizes was less skewed than for other approaches.