Improving Zero-Shot-Learning for German Particle Verbs by using Training-Space Restrictions and Local Scaling

Recent models in distributional semantics consider derivational patterns (e.g., use → use + f ul ) as the result of a compositional process, where base term and afﬁx are combined. We exploit such models for German particle verbs (PVs), and focus on the task of learning a mapping function between base verbs and particle verbs. Our models apply particle-verb motivated training-space restrictions relying on nearest neighbors, as well as recent advances from zero-shot-learning. The models improve the mapping between base terms and derived terms for a new PV derivation dataset, and also across existing derivation datasets for German and English.


Introduction
were the first to apply distributional semantic models (DSMs) to the task of deriving the meaning of morphologically complex words from their parts. They relied on high-dimensional vector representations to model the derived term (e.g., useful) as a result of a compositional process that combines the meanings of the base term (e.g., to use) and the affix (e.g., ful). For evaluation, they compared the predicted vector of the complex word with the original, corpus-based vector.
More recently, Kisselew et al. (2015) put the task of modeling derivation into the perspective of zero-shot-learning: instead of using cosine similarities they predicted the derived term by learning a mapping function between the base term and the derived term. Once the predicted vector was computed, a nearest neighbor search was applied to validate if the prediction corresponded to the derived term. In zero-shotlearning the task is to predict novel values, i.e., values that were never seen in training. More formally, zero-shot-learning trains a classifier f : X → Y that predicts novel values for Y (Palatucci et al., 2009). It is often applied across vector spaces, such as different domains (Mikolov et al., 2013;. The experiments by Kisselew et al. (2015) were performed over six derivational patterns for German (cf. Table 1), including particle verbs (PVs) with two different particle prefixes (an and durch), which were particularly difficult to predict. PVs such as anfangen (to start) are compositions of a base verb (BV) such as fangen (to catch) and a verb particle such as an. Predicting PV meaning is challenging because German PVs are highly productive (Springorum et al., 2013b;Springorum et al., 2013a), and the particles are notoriously ambiguous (Lechler and Roßdeutscher, 2009;Haselbach, 2011;Kliche, 2011;Springorum, 2011). Furthermore, the particles often trigger meaning shifts when they combine with base verbs (Springorum et al., 2013b), so the resulting PVs represent frequent cases of non-literal meaning.
In this paper, we focus on predicting the meanings of German PV derivations. Our models provide two contributions to the research field of predicting derivations: (i) We suggest a novel idea of restricting the available training data, which has a positive impact on the mapping quality. (ii) We integrate a correction method for popular nearest neighbors into our models, so-called hubs (Radovanović et al., 2010), to improve the prediction quality.   Table 2: New German PV derivation dataset.

Prediction Experiments
As in Kisselew et al. (2015), we treat every derivation type as a specific learning problem: we take a set of word pairs with a particular derivation pattern (e.g., "-in", Bäcker::Bäckerin), and divide this set into training and test pairs by performing 10-fold cross-validation. For the test pairs, we predict the vectors of the derived terms (e.g., −−−−−−→ Bäckerin). The search space includes all corpus words across parts-of-speech, except for the base term. The performance is measured in terms of recall-out-of-5 (McCarthy and Navigli, 2009), counting how often the correct derived term is found among the five nearest neighbors of the predicted vector.

Derivation Datasets
We created a new collection of German particle verb derivations 1 relying on the same resource as Kisselew et al. (2015), the semiautomatic derivational lexicon for German DE-rivBase (Zeller et al., 2013). From DErivBase, we induced all pairs of base verbs and particle verbs across seven different particles. Nonexisting verbs were manually filtered out. In total, our collection contains 1 410 BV-PV combinations across seven particles, cf. Table 2.
In addition, we apply our models to two existing collections for derivational patterns, the German dataset from Kisselew et al. (2015), comprising six derivational patterns with 80 in-stances each (cf. Table 1), and the English dataset from Lazaridou et al. (2013), comprising 18 derivational patterns (3 prefixes and 15 suffixes) and 7 449 instances (cf.

Word Embedding Vectors
We relied on the German and English COW web corpora 2 (Schäfer and Bildhauer, 2012) to obtain vector representations. The corpora contain 20 billion words and 9 billion words, respectively. We parsed the corpora using state-of-the-art pipelines integrating the MarMoT tagger and the MATE parser (Müller et al., 2013;Bohnet, 2010), and induced window co-occurrences for all corpus lemma-POS pairs and co-occurring nouns, verbs and adjectives in a 5-lemma window. We then created 400-dimensional word representations using the hyperwords toolkit (Levy et al., 2015), with context distribution smoothing of 0.75 and positive point-wise mutual information weighting together with singular value decomposition. The resulting vector space models contain approximately 460 000 lemmas for German and 240 000 lemmas for English.

Baseline
A baseline method that simply guesses the derived term has a chance of approx.
1 460 000 for German and 1 240 000 for English to predict the correct term. We thus apply a more informed baseline, the same as in Kisselew et al. (2015), and predict the derived term at exactly the same position as the base term.

Additive Method (AvgAdd)
AvgAdd is a re-implementation of the best method in Kisselew et al. (2015): 3 For each affix, the method learns a difference vector by computing the dimension-wise differences between the vector representations of base term A and derived term B . The method thus learns a centroid c for all relevant training pairs (N ) with the same affix: For each PV test instance with this affix, the learned centroid vector is added dimensionwise to the vector representation of the base term to predict a position for the derived term.

Restricting the Training Space (BestAdd)
Avg-Add learns a vector representation based on the full available training data for each derivational pattern. In this paper, we suggest a method BestAdd k that restricts the training items of a given base term to those BV-PV training instances that include the k nearest base verbs (using k = 1, 3, 5) according to their cosine. The motivation for our adjusted method relies on the observation that particles are very ambiguous and thus differ in their meanings across particle verbs. For example, the meanings of 'an' include a directed contact as in sprechen::ansprechen (to speak/to speak to s.o.) and in schreiben::anschreiben (to write/to write to s.o.), and also a start of an action as in spielen::anspielen (to play/to start playing) and in stimmen::anstimmen (to pitch/to start singing). We assume that base verbs that are distributionally similar also behave in a similar way when combined with a specific particle, and that a more restricted training set that is however specified for BV semantics outperforms a larger training set across wider BV meanings.

3CosMul
We also re-implemented 3CosMul (Levy and Goldberg, 2014), a method that has been proven successful in solving analogy tasks, such as man (A) is to ki ng (B) as woman (C) is to queen (D). 3CosMul does not explicitly predict a position in space but selects a target D in space that is close to B and C but not close to A. We applied 3Cos-Mul by always using the most similar training instance (as for BestAdd with k = 1).

Local Scaling
All methods introduced in the previous section perform a nearest neighbor search at the predicted position. We suggest to improve the prediction quality at this stage by mitigating the hubness problem . Hubs are objects in vector space that are likely to appear disproportionately often among nearest neighbors, without necessarily being semantically related. Hubness has been shown an intrinsic problem of high-dimensional spaces (Tomasev, 2014). In order to reduce hubness, three unsupervised methods to re-scale the high-dimensional distances have been proposed (Schnitzer et al., 2014): local scaling, global scaling, and shared nearest neighbors. We focus on a local scaling (LS) type of hubness-correcting distance measure, namely the non-iterative contextual measure N I (Jégou et al., 2007): N I relies on the average distance µ of x and y to their k nearest neighbors. It increases the similarity between x and y in cases where we observe low average similarities between x, y and its k nearest neighbors. Intuitively, if a word x is not even close to its nearest neighbors but comparably close to y then we increase the similarity between x and y. For 3CosMul, we adapt local scaling by scaling over the neighborhood information for all four parts (A, B, C and D) in the analogy: Table 4 presents macro-averaged recall-out-of-5 scores, giving equal weight to each derivation regardless of the number of instances. Across the three datasets, the default results (i.e., without local scaling) obtained with our novel method  Table 4: Macro-averaged recall-out-of-5 across methods, with and without local scaling N I 15 . BestAdd (with k = {3, 5}) are significantly 4 above AvgAdd (p < 0.01), the previously best method for the existing German and English datasets. BestAdd with k = 1 and 3CosMul perform at a similar level than AvgAdd, but for our new PV derivation dataset do not even outperform the baseline. Restricting the training process to a small selection of nearest neighbors therefore has a positive impact on the prediction quality. Furthermore, local scaling relying on k = 15 nearest neighbors (N I 15 ) improves the prediction results in all but one cases. These improvements are however not significant.

BestAdd and Local Scaling
The results in Table 4 also demonstrate that predicting particle verbs is the most challenging derivation task, as the results are significantly lower than for the other two datasets. Figure 1 once more illustrates the recall-out-of-5 results for our new PV dataset. In the following, we zoom into dataset derivation types.

Improvement across Derivation Types and Languages
Figures 2 to 4 break down the results from Table 4 across the German and English derivation types. 4 Significance relies on χ 2 .
The blue bars show the BestAdd 3 results, and the green stacked bars represent the additional gain using local scaling (NI 15 ). The yellow points correspond to baseline performance, and the dotted black lines to the AvgAdd results.
We can see that BestAdd 3 not only outperforms the previously best method AvgAdd on average but also for each derivation type. Also, local scaling provides an additional positive impact for all but one particle type in German, ab-, and for all but three derivation types in English, -able, -al, -less. At the same time, we can see that the impact of local scaling is different across derivation types. For example, looking into the data we observe that mit PVs are often wrongly mapped to nouns, and BestAdd and local scaling correct this behavior: The nearest neighbors of the verb erledigen (to manage sth.) with BestAdd 3 are Botengang (errand), Haushaltsarbeit (domestic work), Hausmeisterarbeit (janitor work), and further six compounds with the nominal head Arbeit (work). Additional local scaling predicts the correct PV miterledigen (to manage sth. in addition) as second nearest neighbor.

Recall-out-of-x across Particle Types
Figure 5 focuses on the particle types, but varies the strength of the evaluation measure. Relying on BestAdd 3 with local scaling NI 15 , we apply recall-out-of-x with x ∈ [1, 10]. With one exception (zu), all particle types achieve a performance of 15-23% for recall-out-of-5, so zu had a negative impact on the average score in Table 4.
Looking at recall-out-of-10, the performances go up to 20-30%. While PVs with the rather nonambiguous mit are again modeled best, also PVs with strongly ambiguous particles (such as an and auf ) are modeled well.  Figure 5: Recall-out-of-[1,10] across particles.

Conclusion
We suggested two ways to improve the prediction of derived terms for English and German. Both (i) particle-verb motivated trainingspace restrictions and (ii) local scaling to address hubness in high-dimensional spaces had a positive impact on the prediction quality of derived terms across datasets. Particle-specific explorations demonstrated the difficulty of this derivation, and differences across particle types.