Is “Universal Syntax” Universally Useful for Learning Distributed Word Representations?

Recent comparative studies have demonstrated the usefulness of dependency-based contexts (DEPS) for learning distributed word representations for similarity tasks. In English, DEPS tend to perform better than the more common, less informed bag-of-words contexts (BOW). In this paper, we present the ﬁrst cross-linguistic comparison of different context types for three different languages. DEPS are extracted from “universal parses” without any language-speciﬁc optimization. Our results suggest that the universal DEPS (UDEPS) are useful for detecting functional similarity (e.g., verb similarity, solving syntactic analogies) among languages, but their advantage over BOW is not as prominent as previously reported on English. We also show that simple “post-parsing” ﬁltering of useful UDEPS contexts leads to consistent improvements across languages.


Introduction
Dense real-valued distributed representations of words known as word embeddings (WEs) have become ubiquitous in NLP, serving as invaluable features in a broad range of NLP tasks, e.g., (Turian et al., 2010;Collobert et al., 2011;Chen and Manning, 2014). The omnipresent word2vec skip-gram model with negative sampling (SGNS) (Mikolov et al., 2013b) is still considered the stateof-the-art word representation model, due to its simplicity, fast training, as well as its solid and robust performance across a wide variety of semantic tasks (Baroni et al., 2014;Levy et al., 2015).
The original implementation of SGNS learns word representations from local bag-of-words contexts (BOW). However, the underlying SGNS model is equally applicable to other context types.
Recent comparative studies have demonstrated the usefulness of dependency-based contexts (DEPS) (Padó and Lapata, 2007) for the task. In comparison with BOW, syntactic contexts steer the induced semantic spaces towards functional similarity (e.g., tiger:cat) rather than towards topical similarity/relatedness (e.g., tiger:jungle). DEPS-based embeddings outperform the less informed BOW-based embeddings in a variety of similarity tasks (Bansal et al., 2014;Levy and Goldberg, 2014a;Hill et al., 2015;Melamud et al., 2016). However, these studies have all focused solely on English. A comparison extending to additional languages is required before any cross-lingual generalisations can be drawn.
Following recent initiatives on languageagnostic and cross-linguistically consistent universal natural language processing (i.e., universal POS (UPOS) tagging and dependency (UD) parsing) (Nivre et al., 2015), this paper is concerned with two important questions: (Q1) Can one usefully replace the DEPS extraction pipeline optimised for tools developed for English with a pipeline that relies on languageuniversal syntactic processing (UDEPS)?
(Q2) Are UDEPS universally better than BOW for learning distributed word representations in other languages?
Regarding Q1, the results show that it is possible to replace original DEPS with UDEPS for English and to obtain benchmarking results with only a slight drop in performance. As for Q2, the framework is not equally effective in other languages, as suggested by the performance in Italian and German, which sheds new light on the usefulness of BOW and dependency-based contexts. Further, the results reveal that even a simple preliminary "post-parsing" selection of use-ful UDEPS contexts leads to consistent improvements across languages, especially in detecting functional similarity.
This focused contribution is the first crosslinguistic comparison of different context types for learning word representations in three languages, reaching beyond English. It also constitutes a first completely language-universal and widely applicable framework for UDEPS extraction.

Methodology
Universal Multilingual Resources The departure point in our experiments is the Universal Dependencies project (McDonald et al., 2013;Nivre et al., 2015) which develops crosslinguistically consistent treebank annotation. 1 The annotation scheme leans on the universal Stanford dependencies (de Marneffe et al., 2014) complemented with the Google universal POS tagset (Petrov et al., 2012) and the Interset interlingua for morphological tagsets (Zeman and Resnik, 2008). It provides a universal and consistent inventory of categories for similar syntactic constructions across languages.
The main aim of the "universal initiative" is to facilitate cross-lingual and multilingual learning (e.g., multilingual parser development, typologies) by capturing structural similarities across languages and by exploiting connections that exist naturally between them (Berg-Kirkpatrick and Klein, 2010;McDonald et al., 2011;Cohen et al., 2011;Naseem et al., 2012). Here, we test the ability of such a universal annotation scheme to encode potentially useful semantic knowledge crosslinguistically; in this case, to yield more informed UDEPS contexts for improved word embeddings.
The extraction of UDEPS as the new variant of dependency-based contexts is completely language-agnostic on purpose: exactly the same procedure is followed for each language in comparison in order to make the representation learning framework completely universal.

Context Types
Prequel: Representation Model For all the context types, we opt for the standard and robust choice in vector space modeling: SGNS (Mikolov et al., 2013b;Levy et al., 2015). In all our experiments we use word2vecf, a reimplementa-  Figure 1: An example of extracting dependencybased contexts from UD parses (UDEPS) in English and Italian. Top: the example sentence in English taken from (Levy and Goldberg, 2014a), now UD-parsed. Middle: the same sentence in Italian, UD-parsed. Note the very similar structure of the two parses. Bottom: the intuition behind UDEPS-ARC. The uninformative shortrange case arc between with and telescope is removed, and another "pseudo-arc" now specifying the exact link type (i.e., case_with) between discovers and telescope is added.
tion of word2vec which is capable of learning from arbitrary (word, context) pairs. 2 Keeping the representation model fixed across experiments and varying only the context type allows us to attribute any differences in results to a sole factor: the context type.
BOW The English sentence from Fig. 1 is used as the running example for all context types. Given the target word w and the window size k, the BOW context simply comprises all 2k word pairs (w, v), where v is found in the window of k words preceding w or k words following w, e.g., BOW with k = 2 extracts the following contexts v for the word discovers from Fig. 1: Australian, scientist, stars, with. Note that BOW may miss valid longer-range contexts (e.g., telescope) while including some accidental (e.g., Australian) or uninformative ones (e.g., with).
POSIT A more informed variant of BOW is positional contexts. It includes extra information on the actual sequential position of each context word (Levy and Goldberg, 2014b). Given the same example, POSIT with k = 2 extracts the following contexts for discovers: Australian_-2, scientist_-1, stars_+2, with_+1. This context type has not been studied systematically in relation to learning WEs. POSIT suffers from the same issues with locality as BOW, but its shallow positional annotations may capture additional shallow syntactic phenomena in the data. Therefore, POSIT may be considered a link from BOW towards DEPS. 3 UDEPS-NAIVE Given a corpus of parsed sentences, for each target w with modifiers m 1 , . . . , m k and head h, w is paired with context elements m 1 _r 1 , . . . , m k _r k , h_r −1 h , where r is the type of the UD relation between the head and the modifier (e.g., amod), and r −1 denotes an inverse relation. A naive version of the UD-based model extracts contexts from the parsed corpus without any post-processing. The UDEPS-NAIVE contexts of discovers are now: scientist_nsubj, stars_dobj, telescope_nmod. They capture longerrange relations (e.g., telescope) and filter "accidental contexts" (e.g., Australian). In addition, the typed dependencies reveal more than POSIT and BOW about the nature of the relation in context.

UDEPS-ARC
However, UDEPS-NAIVE also produces uninformative context pairs such as (telescope, with_case), and it does not specify the type of e.g. the nmod relation between discovers and telescope which are linked through the preposition with. Our intuition is that a simple post-hoc intervention into the UDEPS context extraction may yield even more focused contexts. UDEPS-ARC leans on the idea of arc collapsing from prior work (Levy and Goldberg, 2014a;Melamud et al., 2016) that we now adjust to the UD annotation scheme. The difference to UDEPS-NAIVE is as follows: For each pair of words linked through case (e.g., discovers and telescope), we introduce a new "pseudo-arc" which is typed by the actual case/preposition. This results in a new context for discovers: telescope_case_with and also for telescope: discovers_case_with −1 (Fig. 1). In addition, we remove the uninformative case arc and its associated contexts: (with, telescope_case −1 ), (telescope, with_case) from the training pairs.

Experimental Setup
Evaluation Our cross-linguistic study is made possible not only thanks to the "universal NLP" initiative but also owing to the benchmarking evaluation sets for other languages beyond English (i.e., IT, DE) that have very recently become available, e.g., (Leviant and Reichart, 2015). We evaluate SGNS with different context types from sect. 2.1 across the three languages on two benchmarking tasks and datasets: (1) semantic similarity on SimLex-999 (Hill et al., 2015) translated and re-scored by native speakers in EN, DE, and IT (Leviant and Reichart, 2015), and (2) word analogies on the Google dataset (Mikolov et al., 2013a) made available in IT (Berardi et al., 2015) and DE (Köper et al., 2015) only recently.

UPOS Tagging and UD Parsing
The Wikipedia corpora were UPOS-tagged using a state-of-the art system TurboTagger (Martins et al., 2013). 5 TurboTagger was trained using suggested settings without any further parameter fine-tuning (SVM MIRA with 20 iterations) on the TRAIN+DEV portion of the UD treebank annotated with UPOS tags. Following that, the Wikipedia data were UD-parsed 6 using the graph-based Mate parser v3.61 (Bohnet, 2010) 7 and the same regime: suggested settings on the TRAIN+DEV UD treebank portion. 8 The performance of the models measured on the TEST portion of the UD treebanks is reported in Tab. 1. 4 https://sites.google.com/site/rmyeid/projects/polyglot 5 http://www.cs.cmu.edu/ ark/TurboParser/ 6 Besides EN, DE, and IT, we also UPOS-tagged and UDparsed Wikipedias in NL, ES, and HR. We believe that the full UPOS-tagged and UD-parsed Wikipedias in six languages are a valuable asset for future research and we plan to make the resource publicly available at: http://ltl.mml.cam.ac.uk/resources/ 7 https://code.google.com/archive/p/mate-tools/ 8 We opted for the Mate parser due to its speed, simplicity, and state-of-the-art performance according to very recent parser evaluations (Choi et al., 2015).    (Levy and Goldberg, 2014a).
The results are consistent with prior work on the UD treebanks, e.g., (Tiedemann, 2015).
Training Setup The SGNS preprocessing scheme for English was replicated from (Levy and Goldberg, 2014a) and extended to the other two languages: all tokens were converted to lowercase, and words and contexts that appeared less than 100 times were filtered. Exactly the same vocabularies were used with all context types (approx. 185K distinct EN words, 163K DE words, and 83K IT words). The word2vecf SGNS was trained using standard settings: 15 epochs, 15 negative samples, global learning rate 0.025, subsampling rate 1e − 4. All WEs were trained with d = 50, 100, 300, 500, 600. BOW-based WEs were trained with k = 2 (BOW-2), proven to be the (near-)optimal choice across various semantic tasks in related work (Levy and Goldberg, 2014a;Melamud et al., 2016). The same k was used for POSIT-based WEs (POSIT-2). SimLex pairs, and 0.378 on verb pairs. 9 A comparison with UDEPS-ARC reveals only a slight drop in performance when switching to languageagnostic UDEPS (see Fig. 2(a), Q1). 10 However, the results are heavily dependent on the actual language: the claims made for English (i.e., DEPS ≥ BOW) do not extend to other languages (Q2). A comparison of results from Tab. 1 with the task evaluation also shows that excellent tagging and parsing results do not guarantee a strong task performance.

Results and Discussion
The results over the verb subset of SimLex also reveal that claims established with English are not necessarily general and true with other languages. For instance, while it has been noted that modeling verb similarity is indeed a difficult problem in English as evidenced by lower correlation scores on SimLex (see Fig. 2(a) and e.g. (Schwartz et al., 2015)), verbs are apparently easier to model in Italian (Fig. 2(c)), and a real challenge in German,  Table 3: Rankings based on Acc 1 scores over syntactic analogy groups (from the Google dataset). A=UDEPS-ARC, N =UDEPS-NAIVE, B=BOW-2, P =POSIT-2. d = 300.
The results on the analogy task from Tab. 2 suggest the evident advantage of more abundant (but less informed) BOW contexts across all languages. This finding is completely in line with the analyses from prior work on English, e.g., Levy and Goldberg (2014a) report that "DEPS perform dramatically worse than BOW contexts on analogy tasks", but without providing any exact numbers.
Nonetheless, the relative ranking of context types over syntactic analogy sets as highlighted in Tab. 3 marks the evident advantage of the moreinformed POSIT and UDEPS-ARC on analogies referring to functional similarity. UDEPS-ARC in German outperforms all other context types on all syntactic analogies, except for the nationalityadjective relation. The strongest performance of UDEPS is detected with syntactic analogies where two words in the analogy pair are perfectly replaceable in the given context (e.g., past-tense: dancing-danced, sleeping-slept or opposite: sureunsure, honest-dishonest).
We can also see that POSIT displays a strong performance in detecting functional similarity across all three languages in both tasks (e.g., see the results in Tab. 3 where they outperform BOW). This finding reveals that POSIT should be included as a strong baseline in any follow-up work.
We also analysed the influence of the training data size by learning EN WEs from the EN Wikipedia comprising roughly 13M sentences (same size as the IT Wikipedia). As Tab. 4 shows, the absolute scores are naturally lower with less training data, and we observe a decrease in the performance of UDEPS. However, the decrease is small: these results demonstrate that the reduced performance of UDEPS in IT and DE cannot be attributed solely to smaller training datasets and sparsity of (word, context) pairs. Finally, the consistent improvements of  UDEPS-ARC over UDEPS-NAIVE for all three languages on both tasks show the importance of a careful post-hoc selection of informative contexts. Future work will delve deeper into the informative context selection for the WE learning.

Conclusion and Future Work
We have presented the first comparison of different context types for learning word embeddings for multiple languages. Dependency-based contexts in different languages are for the first time extracted from "universal" parses made possible by the Universal Dependencies initiative, without any language-specific optimisation.
In sum, our comparison provides no clear answer to the question posed by the title of this paper. However, it shows conclusively that different context types yield semantic spaces with different properties, and that the optimal context type depends on the actual application and language. The usefulness of universal dependency-based contexts is evident with a simple post-parsing context extraction scheme in tasks oriented towards syntactic/functional similarity.
This first cross-linguistic analysis covering only a small set of languages from the same (Indo-European) phylum also reveals that training word embeddings in languages other than English is not trivial, suggesting Anglo-centric assumptions that do not extend to other languages (Bender, 2011). It is therefore essential not to generalise results on English to other languages without clear empirical evidence. Yet, a broader cross-linguistic study involving more languages from other families (with UD treebanks available) and additional experimentation is warranted in order to better guide research on "universal NLP" and languageindependent word representation learning.