Large-scale evaluation of dependency-based DSMs: Are they worth the effort?

This paper presents a large-scale evaluation study of dependency-based distributional semantic models. We evaluate dependency-filtered and dependency-structured DSMs in a number of standard semantic similarity tasks, systematically exploring their parameter space in order to give them a “fair shot” against window-based models. Our results show that properly tuned window-based DSMs still outperform the dependency-based models in most tasks. There appears to be little need for the language-dependent resources and computational cost associated with syntactic analysis.


Introduction
Distributional semantic models (DSMs) based on syntactic dependency relations (Padó and Lapata, 2007;Baroni and Lenci, 2010) represent a more linguistically informed version of the widely-used window-based DSMs (Sahlgren, 2006;Bullinaria and Levy, 2007;Bullinaria and Levy, 2012). Both types of DSMs operationalize the meaning of a target word t as a set of co-occurrence patterns extracted from language corpora. While windowbased DSMs adopt a surface-oriented perspective (two words co-occur if they appear within a certain span, e.g. of 4 tokens), dependency-based DSMs adopt a syntactic perspective on co-occurrence: "nearness" is defined by the presence of a syntactic relation between target and features (e.g. direct object, subject, adjectival modifier), which may also correspond to a path along several edges of a dependency graph. If syntactic relations are only used to determine co-occurrence contexts, we talk of dependency-filtered DSMs; if the type of relation is explicitly encoded in the context features (e.g. "subj dog"), we talk of dependency-typed DSMs.
The fortune of syntax-based models in distributional semantics has been mixed. Early work on dependency-filtered (Padó and Lapata, 2007) or dependency-typed (Rothenhäusler and Schütze, 2009; Baroni and Lenci, 2010) DSMs indicated that syntax-based semantic representations are indeed superior. These evaluation studies, however, were restricted to a specific corpus (BNC in Padó and Lapata (2007)) or task (noun clustering in Rothenhäusler and Schütze (2009)), or based on a very specific notion of co-occurrence (Baroni and Lenci, 2010) 2 . Meanwhile, extensive evaluation studies and parameter tuning led to significant improvements in the performance of window-based models (Bullinaria and Levy, 2007;Bullinaria and Levy, 2012; to the point that dependency-based DSMs currently hold the state-of-the-art only in very few standard semantic similarity tasks; see Baroni et al. (2014) and  for an overview of the state of the art. Among recent comparative evaluation studies, only Kiela and Clark (2014) attempt a direct comparison between the parameter spaces of window-based and syntax-based DSMs: once again, window-based models are found to perform better (with the exception of models built from the large Google Books N-gram corpus), but the scope of this comparison is rather limited.
The aim of this paper is to establish a fair ground for the comparison between window-based and dependency-based DSMs. To that end, we take as a reference point the large parameter set evaluated by  and  for window-based models. We carry out a parallel evaluation for dependency-based DSMs using the same tasks, datasets, parameters -adding some parameters specific to syntax-based models (such as the parser used and the type of allowed dependency relations) -and model selection methodology, allowing for a direct comparison of the results.
We address the question of whether dependencybased models can significantly improve DSM performance if the parameters are properly set, and whether the degree of the improvement justifies the increased complexity of the extraction process. In either case, a more thorough understanding of the parameter space will be beneficial for applications that prefer dependency-based DSMs on general grounds, e.g. because of an integration with syntactic structure (Erk et al., 2010). While the evaluation reported here does not encompass predict-type models, we believe that our findings also apply to the usefulness of dependency information in neural word embeddings (Levy and Goldberg, 2014).

Evaluation setting
Tasks & Datasets Our evaluation covers all tasks and datasets used by  and . For space reasons, we present detailed results for one representative dataset from each task 3 : the TOEFL synonym test dataset (Landauer and Dumais, 1997) for the multiple-choice synonymy task (performance: accuracy); the Generalized Event Knowledge (McRae and Matzuki, 2009) dataset (GEK), a collection of 402 triples (target, consistent prime, inconsistent prime), for the multiple-choice semantic priming task (performance: accuracy) 4 ; the WordSim-353 (WS353) dataset, which contains 353 noun pairs with similarity/relatedness ratings (Finkelstein et al., 2002) for the task of predicting human similarity ratings (performance: Pearson's r); and the Almuhareb-Poesio (AP) dataset, containing 402 nouns grouped into 21 semantic classes (Almuhareb, 2006) for the noun clustering task 3 If more than one dataset was available for a task, we preferred larger datasets (for which results are more reliable). Results for all datasets will be made available in the supplementary materials. 4 In contrast to the paradigmatic relation targeted by TOEFL (i.e., synonymy), the GEK dataset focuses on relatedness of a more syntagmatic nature. See  for more details on this dataset.

DSM parameters
We employ a large vocabulary of target words (27,522 lemma types), based on the vocabulary of Distributional Memory (Baroni and Lenci, 2010) and extended to cover all items in our datasets. After extracting dependency paths from the source corpora, the DSMs were compiled using the UCS toolkit 6 and the wordspace package for R (Evert, 2014). We evaluate the following parameters: Source corpus (abbreviated in the plots as corpus): BNC 7 , WaCkypedia EN, and ukWaC 8 ; Format of dependency relations (dep.style): Basic vs. collapsed with propagation of conjuncts (De Marneffe et al., 2006;De Marneffe and Manning, 2008); Annotation pipeline (parser): TreeTagger (Schmid, 1995) and MALT parser (Nivre, 2003) vs. bidirectional POS tagger and Neural Network parser of Stanford CoreNLP (Chen and Manning, 2014); Path length (path.length): we include paths with a maximum length of 1, 2, 3, 4 or 5 edges; Type of dependency relations (dep.type): paths composed only of core dependencies (main actants of the sentence) vs. paths that also allow external dependencies (inter-clausal relations and conjuncts); Threshold for context selection (orig.dim): we select the 5k, 10k, 20k, 50k, or 100k most frequent context dimensions; Score for feature weighting (score): frequency, tf.idf, Dice coefficient, simple log-likelihood, Mutual Information (MI), t-score, or z-score; 9 Feature transformation (transformation): an additional square root, sigmoid (tanh), or logarithmic transformation applied to feature scores vs. no transformation; Number of latent SVD dimensions (red.dim): we project vectors into 1000 dimensions using randomized SVD (Halko et al., 2009), then select the first 100, 300, 500, 700, or 900 latent dimensions; Number of skipped SVD dimensions (dim.skip): exclude the first 0, 50 or 100 latent dimensions (e.g., those with the highest singular values); previous work on window-based DSMs (Bullinaria and Levy, 2012; showed that model performance improves when the initial components of the reduced matrix (i.e., those with the highest variance) are discarded.
Distance metric (metric): cosine distance (i.e. the angle between vectors) vs. Manhattan distance; Index of distributional relatedness (rel.index): the semantic relatedness of words a and b in a DSM is quantified either by their metric distance d(a, b) or by neighbor rank (rank of b among the neighbors of a for TOEFL and GEK, mean of log rank(a, b) and log rank(b, a) for WS353 and AP).
Among the evaluated parameters, parser, dep.type and dep.style are specific to dependencybased DSMs. Path.length is the dependency-based equivalent of window size in a bag-of-words DSM. The comparison between filtered vs. typed DSMs can be considered roughly equivalent to the comparison between undirected and directed windows in a bag-of-words DSM. All the other parameters are shared with window-based DSMs.
Evaluation methodology We tested all possible combinations of the parameters described above, resulting in a total of 806400 runs per model class (filtered vs. typed), which were generated and evaluated on a large HPC cluster within approximately 6 weeks. To meaningfully interpret the evaluation results, we apply a model selection methodology that is sensitive to parameter interactions and robust to overfitting. Following Lapesa and Evert (2013), we analyze the influence of individual parameters and their interactions using general linear models with performance (accuracy, correlation, purity) as a dependent variable and the model parameters as independent variables, including all two-way interactions. Analysis of variancewhich is straightforward for our full factorial design -is used to quantify the impact of each parameter or interaction. Robust optimal parameter settings are identified with the help of effect displays (Fox, 2003), which show the partial effect of one or two parameters by marginalizing over all other parameters. Unlike coefficient estimates, they allow an intuitive interpretation of the effect sizes of categorical variables irrespective of the dummy coding scheme used.

Results
As model runs without dimensionality reduction performed consistently worse than the corresponding SVD-reduced runs, we only report results for the latter in this paper.

Impact of parameters
We use a feature ablation approach to assess which parameters have the strongest impact on model performance. The ablation value of a parameter is the proportion of variance accounted for by the parameter together with all its interactions (corresponding to the reduction in adjusted R 2 of the model fit if the parameter were left out). Figures 1 and 2 visualize the feature ablation values of all evaluated parameters in the dependency-filtered and dependency-typed setting, respectively. Table 1 shows R 2 for the full model as well as all major interactions (partial R 2 > 1%).   Parameters can be divided into three groups. First, a group of parameters with a strong impact on model performance, which is dominated by metric in both settings. Metric also has strong interactions with many other parameters. Further parameters in this group are score and transformation, again with a strong interaction across all datasets and both settings  found this interaction to be the strongest also for window-based DSMs), as well as corpus. Second, a group of parameters with an intermediate impact includes the two SVD-related parameters (red.dim and dim.skip) and, to a lesser extent, the number of context dimensions (orig.dim) and the relatedness index (rel.index). Path.length only affects dependency-filtered models on the GEK dataset (that directly involves syntagmatic relatedness) and, but to a lesser extent, on AP (which encodes cohyponymy). It is almost irrelevant in a dependencytyped setting. This is probably due to the fact that direct dependency relations already capture the "core" of the semantic space and the information contributed by longer paths is neutralized by the additional noise. Third, a group of irrelevant parameters, which comprises the details of the dependency scheme (dep.style and dep.type) as well as the parser used.
Best parameter values In this section, we identify the best parameter settings by inspecting partial effect plots. We focus on dependencyfiltered models because they consistently achieve better results and only discuss the dependencytyped ones when the best parameters are differ-ent. As for window-based DSMs, the Manhattan metric always performs much worse than cosine distance; the different behaviour of the two metrics also accounts for most of the interactions listed in table 1. We therefore exclude runs with Manhattan metric from further analysis and the effect plots below. The two bigger corpora are always a better choice (figure 3), with a preference for ukWaC in the multiple choice tasks. Neighbor rank (figure 4) outperforms distance, but the increased computational cost may only be justified for AP and WS353; the effect is much stronger for unreduced models in all tasks. As far as path length (figure 5) is concerned, datasets containing syntagmatic (GEK) or non-attributional relatedness (WS353) need longer paths to reach optimal performance. While the TOEFL task only requires 5k context dimensions (figure 6), more dimensions are necessary for AP and WS353 (20k and 50k) and even more for GEK (100k). Performance in all tasks improves with an increasing number of reduced dimensions, but 300 appear to be sufficient for AP and WS353 (figure 7); skipping the first 50 latent dimensions is beneficial for all tasks except AP (figure 8). The strong interaction between score and transformation, displayed in figure 12 for AP dataset and in figure 13 for GEK, indicates a preference for simple log-likelihood with log transformation or MI without any transformation (similar tendencies to AP hold for the remaining datasets). Parameters which are not explanatory can be set to the most "economic" value: MALT for parser, basic for dependency style, and core for dependency type.
Let us now briefly turn to dependency-typed    Table 3: General best settings (filtered and typed) (b.set). For comparison, we also show the performance of the optimized window-based DSM from  or Lapesa et al. (2014) (b.bow), and the state of the art for the task (soa). Table 3 reports the parameter values of general settings for the dependency filtered (Filter) and typed (Typed) models and their performance on the four datasets.

Conclusion
We presented the results of a large-scale evaluation study of syntax-based DSMs. We show that, even after extensive parameter tuning, syntax-based DSMs outperform comparable window-based models only in one task out of four (noun clustering). We found many similarities to window-based DSMs: a significant core of the parameter space (metric, score, transformation, relatedness index) is common to both types of models, in terms of their impact on performance as well as the best parameter values; path length trades off between paradigmatic similarity and non-attributional relatedness, in the same way window-size does; most tasks require more SVD dimensions than are commonly used, and synonymy is better modeled by discarding the first SVD dimensions. It is left for future work to establish to what extent our conclusions generalize to different languages 11 and to more linguistically challenging tasks (e.g., prediction of thematic fit ratings). log; metric: cosine; rel.index: rank. 11 For example, DSM evaluation on German reveals a mixed picture: on the one hand, Bott and Schulte im Walde (2015) found no advantage for syntax-based models over bag-ofwords ones in a quite linguistic task: the prediction of particle verb compositionality; on the other, Utt and Padó (2014) did find advantages in the use of syntactic information in the German counterparts of TOEFL and WS353.