Automatic Selection of Context Configurations for Improved Class-Specific Word Representations

This paper is concerned with identifying contexts useful for training word representation models for different word classes such as adjectives (A), verbs (V), and nouns (N). We introduce a simple yet effective framework for an automatic selection of class-specific context configurations. We construct a context configuration space based on universal dependency relations between words, and efficiently search this space with an adapted beam search algorithm. In word similarity tasks for each word class, we show that our framework is both effective and efficient. Particularly, it improves the Spearman’s rho correlation with human scores on SimLex-999 over the best previously proposed class-specific contexts by 6 (A), 6 (V) and 5 (N) rho points. With our selected context configurations, we train on only 14% (A), 26.2% (V), and 33.6% (N) of all dependency-based contexts, resulting in a reduced training time. Our results generalise: we show that the configurations our algorithm learns for one English training setup outperform previously proposed context types in another training setup for English. Moreover, basing the configuration space on universal dependencies, it is possible to transfer the learned configurations to German and Italian. We also demonstrate improved per-class results over other context types in these two languages..


Introduction
Dense real-valued word representations (embeddings) have become ubiquitous in NLP, serving as invaluable features in a broad range of tasks (Turian et al., 2010;Collobert et al., 2011;Chen and Manning, 2014). The omnipresent word2vec skip-gram model with negative sampling (SGNS) (Mikolov et al., 2013) is still considered a robust and effective choice for a word representation model, due to its simplicity, fast training, as well as its solid performance across semantic tasks (Baroni et al., 2014;Levy et al., 2015). The original SGNS implementation learns word representations from local bag-of-words contexts (BOW). However, the underlying model is equally applicable with other context types (Levy and Goldberg, 2014a).
Recent work suggests that "not all contexts are created equal". For example, reaching beyond standard BOW contexts towards contexts based on dependency parses (Bansal et al., 2014;Melamud et al., 2016) or symmetric patterns (Schwartz et al., 2015(Schwartz et al., , 2016 yields significant improvements in learning representations for particular word classes such as adjectives (A) and verbs (V). Moreover, Schwartz et al. (2016) demonstrated that a subset of dependency-based contexts which covers only coordination structures is particularly effective for SGNS training, both in terms of the quality of the induced representations and in the reduced training time of the model. Interestingly, they also demonstrated that despite the success with adjectives and verbs, BOW contexts are still the optimal choice when learning representations for nouns (N).
In this work, we propose a simple yet effective framework for selecting context configurations, which yields improved representations for verbs, adjectives, and nouns. We start with a definition of our context configuration space (Sect. 3.1). Our basic definition of a context refers to a single typed (or labeled) dependency link between words (e.g., the amod link or the dobj link). Our configuration space then naturally consists of all possible subsets of the set of labeled dependency links between words. We employ the universal dependencies (UD) scheme to make our framework applicable across languages. We then describe (Sect. 3.2) our adapted beam search algorithm that aims to select an optimal context configuration for a given word class.
We show that SGNS requires different context configurations to produce improved results for each word class. For instance, our algorithm detects that the combination of amod and conj contexts is effective for adjective representation. Moreover, some contexts that boost representation learning for one word class (e.g., amod contexts for adjectives) may be uninformative when learning representations for another class (e.g., amod for verbs). By removing such dispensable contexts, we are able both to speed up the SGNS training and to improve representation quality.
We first experiment with the task of predicting similarity scores for the A/V/N portions of the benchmarking SimLex-999 evaluation set, running our algorithm in a standard SGNS experimental setup (Levy et al., 2015). When training SGNS with our learned context configurations it outperforms SGNS trained with the best previously proposed context type for each word class: the improvements in Spearman's ρ rank correlations are 6 (A), 6 (V), and 5 (N) points. We also show that by building context configurations we obtain improvements on the entire SimLex-999 (4 ρ points over the best baseline). Interestingly, this context configuration is not the optimal configuration for any word class.
We then demonstrate that our approach is robust by showing that transferring the optimal configurations learned in the above setup to three other setups yields improved performance. First, the above context configurations, learned with the SGNS training on the English Wikipedia corpus, have an even stronger impact on SimLex999 performance when SGNS is trained on a larger corpus. Second, the transferred configurations also result in competitive performance on the task of solving class-specific TOEFL questions. Finally, we transfer the learned context configurations across languages: these configurations improve the SGNS performance when trained with German or Italian corpora and evaluated on class-specific subsets of the multilingual SimLex-999 (Leviant and Reichart, 2015), without any language-specific tuning.
Several recent studies examined the effect of context types on word representation learning. Melamud et al. (2016) compared three context types on a set of intrinsic and extrinsic evaluation setups: BOW, dependency links, and substitute vectors. They show that the optimal type largely depends on the task at hand, with dependency-based contexts displaying strong performance on semantic similarity tasks. Vulić and Korhonen (2016) extended the comparison to more languages, reaching similar conclusions. Schwartz et al. (2016), showed that symmetric patterns are useful as contexts for V and A similarity, while BOW still works best for nouns. They also indicated that coordination structures, a particular dependency link, are more useful for verbs and adjectives than the entire set of dependencies. In this work, we generalise their approach: our algorithm systematically and efficiently searches the space of dependency-based context configurations, yielding class-specific representations with substantial gains for all three word classes.
Previous attempts on specialising word representations for a particular relation (e.g., similarity vs relatedness, antonyms) operate in one of two frameworks: (1) modifying the prior or the regularisation of the original training procedure (Yu and Dredze, 2014;Wieting et al., 2015;Liu et al., 2015;Kiela et al., 2015;Ling et al., 2015b); (2) post-processing procedures which use lexical knowledge to refine previously trained word vectors (Faruqui et al., 2015;Wieting et al., 2015;Mrkšić et al., 2017). Our work suggests that the induced representations can be specialised by directly training the word representation model with carefully selected contexts.

Context Selection: Methodology
The goal of our work is to develop a methodology for the identification of optimal context configura-  Figure 1: Extracting dependency-based contexts.
Top: An example English sentence from (Levy and Goldberg, 2014a), now UD-parsed. Middle: the same sentence in Italian, UD-parsed. Note the similarity between the two parses which suggests that our context selection framework may be extended to other languages. Bottom: prepositional arc collapsing. The uninformative short-range case arc is removed, while a "pseudo-arc" specifying the exact link (prep:with) between discovers and telescope is added.
tions for word representation model training. We hope to get improved word representations and, at the same time, cut down the training time of the word representation model. Fundamentally, we are not trying to design a new word representation model, but rather to find valuable configurations for existing algorithms.
The motivation to search for such training context configurations lies in the intuition that the distributional hypothesis (Harris, 1954) should not necessarily be made with respect to BOW contexts. Instead, it may be restated as a series of statements according to particular word relations. For example, the hypothesis can be restated as: "two adjectives are similar if they modify similar nouns", which is captured by the amod typed dependency relation. This could also be reversed to reflect noun similarity by saying that "two nouns are similar if they are modified by similar adjectives". In another example, "two verbs are similar if they are used as predicates of similar nominal subjects" (the nsubj and nsubjpass dependency relations).
First, we have to define an expressive context configuration space that contains potential training configurations and is effectively decomposed so that useful configurations may be sought algorithmically. We can then continue by designing a search algorithm over the configuration space.

Context Configuration Space
We focus on the configuration space based on dependency-based contexts (DEPS) (Padó and Lapata, 2007;Utt and Padó, 2014). We choose this space due to multiple reasons. First, dependency structures are known to be very useful in capturing functional relations between words, even if these relations are long distance. Second, they have been proven useful in learning word embeddings (Levy and Goldberg, 2014a;Melamud et al., 2016). Finally, owing to the recent development of the Universal Dependencies (UD) annotation scheme (McDonald et al., 2013;Nivre et al., 2016) 1 it is possible to reason over dependency structures in a multilingual manner (e.g., Fig. 1). Consequently, a search algorithm in such DEPS-based configuration space can be developed for multiple languages based on the same design principles. Indeed, in this work we show that the optimal configurations for English translate to improved representations in two additional languages, German and Italian.
And so, given a (UD-)parsed training corpus, for each target word w with modifiers m 1 , . . . , m k and a head h, the word w is paired with context elements m 1 _r 1 , . . . , m k _r k , h_r −1 h , where r is the type of the dependency relation between the head and the modifier (e.g., amod), and r −1 denotes an inverse relation. To simplify the presentation, we adopt the assumption that all training data for the word representation model are in the form of such (word, context) pairs (Levy and Goldberg, 2014a,c), where word is the current target word, and context is its observed context (e.g., BOW, positional, dependency-based). A naive version of DEPS extracts contexts from the parsed corpus without any post-processing. Given the example from Fig. 1, the DEPS contexts of discovers are: scientist_nsubj, stars_dobj, telescope_nmod.
DEPS not only emphasises functional similarity, but also provides a natural implicit grouping of related contexts. For instance, all pairs with the shared relation r and r −1 are taken as an rbased context bag, e.g., the pairs {(scientist, Aus-tralian_amod), (Australian, scientist_amod −1 )} from Fig. 1 are inserted into the amod context bag, while {(discovers, stars_dobj), (stars, discovers_dobj −1 )} are labelled with dobj.
Assume that we have obtained M distinct dependency relations r 1 , . . . , r M after parsing and postprocessing the corpus. The j-th individual context ri + rj + rk + rl ri + rj + rk ri + rj + rl ri + rk + rl rj + rk + rl ri + rj ri + rk rj + rk ri + rl rj + rl rk + rl ri rj rk rl Figure 2: An illustration of Alg. 1. The search space is presented as a DAG with direct links between origin configurations (e.g., r i + r j + r k ) and all its children configurations obtained by removing exactly one individual bag from the origin (e.g., r i + r j , r j + r k ). After automatically constructing the initial pool (line 1), the entry point of the algorithm is the R P ool configuration (line 2). Thicker blue circles denote visited configurations, while the gray circle denotes the best configuration found.
bag, j = 1, . . . , M , labelled r j , is a bag (or a multiset) of (word, context) pairs where context has one of the following forms: v_r j or v_r −1 j , where v is some vocabulary word. A context configuration is then simply a set of individual context bags, e.g., R = {r i , r j , r k }, also labelled as R: r i + r j + r k . We call a configuration consisting of K individual context bags a K-set configuration (e.g., in this example, R is a 3-set configuration). 2 Although a brute-force exhaustive search over all possible configurations is possible in theory and for small pools (e.g., for adjectives, see Tab. 2), it becomes challenging or practically infeasible for large pools and large training data. For instance, based on the pool from Tab. 2, the search for the optimal configuration would involve trying out 2 10 −1 = 1023 configurations for nouns (i.e., training 1023 different word representation models). Therefore, to reduce the number of visited configurations, we present a simple heuristic search algorithm inspired by beam search (Pearl, 1984). 2 A note on the nomenclature and notation: Each context configuration may be seen as a set of context bags, as it does not allow for repetition of its constituent context bags. For simplicity and clarity of presentation, we use dependency relation types (e.g., ri = amod, rj = acl) as labels for context bags. The reader has to be aware that a configuration R = {ri, rj, r k } is not by any means a set of relation types/names, but is in fact a multiset of all (word, context) pairs belonging to the corresponding context bags labelled with ri, rj, r k .

Algorithm 1: Best Configuration Search
Input :Set of M individual context bags: Output :Best configuration Ro

Class-Specific Configuration Search
Alg. 1 provides a high-level overview of the algorithm. An example of its flow is given in Fig. 2. Starting from S, the set of all possible M individual context bags, the algorithm automatically detects the subset S K ⊆ S, |S K | = K, of candidate individual bags that are used as the initial pool (line 1 of Alg. 1). The selection is based on some fitness (goal) function E. In our setup, E(R) is Spearman's ρ correlation with human judgment scores obtained on the development set after training the word representation model with the configuration R. The selection step relies on a simple threshold: we use a threshold of ρ ≥ 0.2 without any finetuning in all experiments with all word classes. We find this step to facilitate efficiency at a minor cost for accuracy. For example, since amod denotes an adjectival modifier of a noun, an efficient search procedure may safely remove this bag from the pool of candidate bags for verbs.
The search algorithm then starts from the full K-set R P ool configuration (line 3) and tests K (K − 1)-set configurations where exactly one individual bag r i is removed to generate each such configuration (line 10). It then retains only the set of configurations that score higher than the origin K-set configuration (lines 11-12, see Fig. 2). Using this principle, it continues searching only over lower-level (l − 1)-set configurations that further improve performance over their l-set origin configuration. It stops if it reaches the lowest level or if it cannot improve the goal function any more (line 15). The best scoring configuration is returned (n.b., not guaranteed to be the global optimum).
In our experiments with this heuristic, the search for the optimal configuration for verbs is performed only over 13 1-set configurations plus 26 other configurations (39 out of 133 possible configurations). 3 For nouns, the advantage of the heuristic is even more dramatic: only 104 out of 1026 possible configurations were considered during the search. 4 4 Experimental Setup 4.1 Implementation Details Word Representation Model We experiment with SGNS (Mikolov et al., 2013), the standard and very robust choice in vector space modeling (Levy et al., 2015). In all experiments we use word2vecf, a reimplementation of word2vec able to learn from arbitrary (word, context) pairs. 5 For details concerning the implementation, we refer the reader to (Goldberg and Levy, 2014;Levy and Goldberg, 2014a).
The SGNS preprocessing scheme was replicated from (Levy and Goldberg, 2014a;Levy et al., 2015). After lowercasing, all words and contexts that appeared less than 100 times were filtered. When considering all dependency types, the vocabulary spans approximately 185K word types. 6 Further, all representations were trained with d = 300 (very similar trends are observed with d = 100, 500).
The same setup was used in prior work (Schwartz et al., 2016;Vulić and Korhonen, 2016). Keeping the representation model fixed across experiments and varying only the context type allows us to attribute any differences in results to a sole factor: the context type. We plan to experiment with other representation models in future work.
Universal Dependencies as Labels The adopted UD scheme leans on the universal Stanford dependencies (de Marneffe et al., 2014) complemented with the universal POS tagset (Petrov et al., 2012). It is straightforward to "translate" previous annotation schemes to UD (de Marneffe et al., 2014). Providing a consistently annotated inventory of categories for similar syntactic constructions across languages, the UD scheme facilitates representation learning in languages other than English, as shown in (Vulić and Korhonen, 2016;Vulić, 2017).
Individual Context Bags Standard post-parsing steps are performed in order to obtain an initial list of individual context bags for our algorithm: (1) Prepositional arcs are collapsed ( (Levy and Goldberg, 2014a;Vulić and Korhonen, 2016), see Fig. 1). Following this procedure, all pairs where the relation r has the form prep:X (where X is a preposition) are subsumed to a context bag labelled prep; (2) Similar labels are merged into a single label (e.g., direct (dobj) and indirect objects (iobj) are merged into obj); (3) Pairs with infrequent and uninformative labels are removed (e.g., punct, goeswith, cc).
Coordination-based contexts are extracted as in prior work (Schwartz et al., 2016), distinguishing between left and right contexts extracted from the conj relation; the label for this bag is conjlr. We also utilise the variant that does not make the distinction, labeled conjll. If both are used, the label is simply conj=conjlr+conjll. 7 Consequently, the individual context bags we use in all experiments are: subj, obj, comp, nummod, appos, nmod, acl, amod, prep, adv, compound, conjlr, conjll.

Training and Evaluation
We run the algorithm for context configuration selection only once, with the SGNS training setup described below. Our main evaluation setup is presented below, but the learned configurations are tested in additional setups, detailed in Sect. 5. The parser was trained using default settings (SVM MIRA with 20 iterations, no further parameter tuning) on the TRAIN+DEV portion of the UD treebank annotated with UPOS tags. The data were then parsed with UD using the graph-based Mate parser v3.61 (Bohnet, 2010) 10 with standard settings on TRAIN+DEV of the UD treebank.

Training Data
Evaluation We experiment with the verb pair (222 pairs), adjective pair (111 pairs), and noun pair (666 pairs) portions of SimLex-999. We report Spearman's ρ correlation between the ranks derived from the scores of the evaluated models and the human scores. Our evaluation setup is borrowed from Levy et al. (2015): we perform 2-fold cross-validation, where the context configurations are optimised on a development set, separate from the unseen test data. Unless stated otherwise, the reported scores are always the averages of the 2 runs, computed in the standard fashion by applying the cosine similarity to the vectors of words participating in a pair.

Baselines
Baseline Context Types We compare the context configurations found by Alg. 1 against baseline contexts from prior work: -BOW: Standard bag-of-words contexts.
-DEPS-All: All dependency links without any context selection, extracted from dependency-parsed data with prepositional arc collapsing.
-COORD: Coordination-based contexts are used as fast lightweight contexts for improved representations of adjectives and verbs (Schwartz et al., 2016). This is in fact the conjlr context bag, a subset of DEPS-All.
The development set was used to tune the window size for BOW and POSIT (to 2) and the parameters of the SP extraction algorithm. 11 Baseline Greedy Search Algorithm We also compare our search algorithm to its greedy variant: at each iteration of lines 8-12 in Alg. 1, R n now keeps only the best configuration of size l − 1 that perform better than the initial configuration of size l, instead of all such configurations.

Main Evaluation Setup
Not All Context Bags are Created Equal First, we test the performance of individual context bags across SimLex-999 adjective, verb, and noun subsets. Besides providing insight on the intuition behind context selection, these findings are important for the automatic selection of class-specific pools (line 1 of Alg. 1). The results are shown in Tab. 1.
The experiment supports our intuition (see Sect. 3.2): some context bags are definitely not useful for some classes and may be safely removed  Table 3: Results on the SimLex-999 test data over (a) verbs and (b) nouns subsets. Only a selection of context configurations optimised for verb and noun similarity are shown. POOL-ALL denotes a configuration where all individual context bags from the verbs/nouns-oriented pools (see Table 2) are used. BEST denotes the best performing configuration found by Alg. 1. Other configurations visited by Alg. 1 that score higher than the best scoring baseline context type for each word class are in gray. Scores obtained using a greedy search algorithm instead of Alg. 1 are in italic, marked with a cross ( †).

Baselines (Adjectives)
BOW ( Table 4: Results on the SimLex-999 adjectives subset with adjective-specific configurations. when performing the class-specific SGNS training. For instance, the amod bag is indeed important for adjective and noun similarity, and at the same time it does not encode any useful information regarding verb similarity. compound is, as expected, useful only for nouns. Tab. 1 also suggests that some context bags (e.g., nummod) do not encode any informative contextual evidence regarding similarity, therefore they can be discarded. The initial results with individual context bags help to reduce the pool of candidate bags (line 1 in Alg. 1), see Tab. 2.
Searching for Improved Configurations Next, we test if we can improve class-specific representations by selecting class-specific configurations.
Results are summarised in Tables 3 and 4. Indeed, class-specific configurations yield better representations, as is evident from the scores: the improve-ments with the best class-specific configurations found by Alg. 1 are approximately 6 ρ points for adjectives, 6 points for verbs, and 5 points for nouns over the best baseline for each class. The improvements are visible even with configurations that simply pool all candidate individual bags (POOL-ALL), without running Alg. 1 beyond line 1. However, further careful context selection, i.e., traversing the configuration space using Alg. 1 leads to additional improvements for V and N (gains of 3 and 2.2 ρ points). Very similar improved scores are achieved with a variety of configurations (see Tab. 3), especially in the neighbourhood of the best configuration found by Alg. 1. This indicates that the method is quite robust: even sub-optimal 12 solutions result in improved class-specific representations. Furthermore, our algorithm is able to find better configurations for verbs and nouns compared to its greedy variant. Finally, our algorithm generalises well: the best scoring configuration on the dev set is always the best one on the test set.
Training: Fast and/or Accurate? Carefully selected configurations are also likely to reduce SGNS training times. Indeed, the configurationbased model trains on only 14% (A), 26.2% (V), and 33.6% (N) of all dependency-based contexts. The training times and statistics for each context type are displayed in Tab. 5. All models  were trained using parallel training on 10 Intel(R) Xeon(R) E5-2667 2.90GHz processors. The results indicate that class-specific configurations are not as lightweight and fast as SP or COORD contexts (Schwartz et al., 2016). However, they also suggest that such configurations provide a good balance between accuracy and speed: they reach peak performances for each class, outscoring all baseline context types (including SP and COORD), while training is still much faster than with "heavyweight" context types such as BOW, POSIT or DEPS-All. Now that we verified the decrease in training time our algorithm provides for the final training, it makes sense to ask whether the configurations it finds are valuable in other setups. This will make the fast training of practical importance.

Generalisation: Configuration Transfer
Another Training Setup We first test whether the context configurations learned in Sect. 5.1 are useful when SGNS is trained in another English setup (Schwartz et al., 2016), with more training data and other annotation and parser choices, while evaluation is still performed on SimLex-999.
In this setup the training corpus is the 8B words corpus generated by the word2vec script. 13 A preprocessing step now merges common word pairs and triplets to expression tokens (e.g., Bilbo_Baggins). The corpus is parsed with labelled Stanford dependencies (de Marneffe and Manning, 2008) using the Stanford POS Tagger (Toutanova et al., 2003) and the stack version of the MALT parser (Goldberg and Nivre, 2012). SGNS preprocessing and parameters are also replicated; we now 13 code.google.com/p/word2vec/source/browse/trunk/  Table 6: Results on the A/V/N SimLex-999 subsets, and on the entire set (All) in the setup from Schwartz et al. (2016). d = 500. BEST-* are again the best class-specific configs returned by Alg. 1.
train 500-dim embeddings as in prior work. 14 Results are presented in Tab. 6. The imported class-specific configurations, computed using a much smaller corpus (Sect. 5.1), again outperform competitive baseline context types for adjectives and nouns. The BEST-VERBS configuration is outscored by SP, but the margin is negligible. We also evaluate another configuration found using Alg. 1 in Sect. 5.1, which targets the overall improved performance without any finer-grained division to classes (BEST-ALL). This configuration (amod+subj+obj+compound+prep+adv+conj) outperforms all baseline models on the entire benchmark. Interestingly, the non-specific BEST-ALL configuration falls short of A/V/N-specific configurations for each class. This unambiguously implies that the "trade-off" configuration targeting all three classes at the same time differs from specialised class-specific configurations.
Experiments on Other Languages We next test whether the optimal context configurations computed in Sect. 5.1 with English training data are also useful for other languages. For this, we train SGNS models on the Italian (IT) and German (DE) Polyglot Wikipedia corpora with those configurations, and evaluate on the IT and DE multilingual SimLex-999 (Leviant and Reichart, 2015). 15 Our results demonstrate similar patterns as for English, and indicate that our framework can be easily applied to other languages. For instance, the BEST-ADJ configuration (the same configuration as in Tab. 4 and Tab. 7) yields an improvement of 8  ρ points and 4 ρ points over the strongest adjectives baseline in IT and DE, respectively. We get similar improvements for nouns (IT: 3 ρ points, DE: 2 ρ points), and verbs (IT: 2, DE: 4).

TOEFL Evaluation
We also verify that the selection of class-specific configurations (Sect. 5.1) is useful beyond the core SimLex evaluation. For this aim, we evaluate on the A, V, and N TOEFL questions (Landauer and Dumais, 1997). The results are summarised in Tab. 7. Despite the limited size of the TOEFL dataset, we observe positive trends in the reported results (e.g., V-specific configurations yield a small gain on verb questions), showcasing the potential of class-specific training in this task.

Conclusion and Future Work
We have presented a novel framework for selecting class-specific context configurations which yield improved representations for prominent word classes: adjectives, verbs, and nouns. Its design and dependence on the Universal Dependencies annotation scheme makes it applicable in different languages. We have proposed an algorithm that is able to find a suitable class-specific configuration while making the search over the large space of possible context configurations computationally feasible. Each word class requires a different class-specific configuration to produce improved results on the class-specific subset of SimLex-999 in English, Italian, and German. We also show that the selection of context configurations is robust as once learned configuration may be effectively transferred to other data setups, tasks, and languages without additional retraining or fine-tuning.
In future work, we plan to test the framework with finer-grained contexts, investigating beyond POS-based word classes and dependency links. Exploring more sophisticated algorithms that can efficiently search richer configuration spaces is also an intriguing direction. Another research avenue is application of the context selection idea to other representation models beyond SGNS tested in this work, and experimenting with assigning weights to context subsets. Finally, we plan to test the portability of our approach to more languages.