Cross-Lingual Syntactically Informed Distributed Word Representations

We develop a novel cross-lingual word representation model which injects syntactic information through dependency-based contexts into a shared cross-lingual word vector space. The model, termed CL-DepEmb, is based on the following assumptions: (1) dependency relations are largely language-independent, at least for related languages and prominent dependency links such as direct objects, as evidenced by the Universal Dependencies project; (2) word translation equivalents take similar grammatical roles in a sentence and are therefore substitutable within their syntactic contexts. Experiments with several language pairs on word similarity and bilingual lexicon induction, two fundamental semantic tasks emphasising semantic similarity, suggest the usefulness of the proposed syntactically informed cross-lingual word vector spaces. Improvements are observed in both tasks over standard cross-lingual “offline mapping” baselines trained using the same setup and an equal level of bilingual supervision.

Another line of work has demonstrated that syntactically informed dependency-based (DEPS) word vector spaces in monolingual settings (Lin, 1998;Padó and Lapata, 2007;Utt and Padó, 2014) are able to capture finer-grained distinctions compared to vector spaces based on standard bag-ofwords (BOW) contexts. Dependency-based vector spaces steer the induced WEs towards functional similarity (e.g., tiger:cat) rather than topical similarity/relatedness (e.g., tiger:jungle), They support a variety of similarity tasks in monolingual settings, typically outperforming BOW contexts for English (Bansal et al., 2014;Hill et al., 2015;Melamud et al., 2016). However, despite the steadily growing landscape of CL WE models, each requiring a different form of cross-lingual supervision to induce a SCLVS, syntactic information is still typically discarded in the SCLVS learning process.
To bridge this gap, in this work we develop a new cross-lingual WE model, termed CL-DEPEMB, which injects syntactic information into a SCLVS. The model is supported by the recent initiatives on language-agnostic annotations for universal lan-guage processing (i.e., universal POS (UPOS) tagging and dependency (UD) parsing) (Nivre et al., 2015). Relying on cross-linguistically consistent UD-typed dependency links in two languages plus a word translation dictionary, the model assumes that one-to-one word translations are substitutable within their syntactic contexts in both languages. It constructs hybrid cross-lingual dependency trees which could be used to extract monolingual and cross-lingual dependency-based contexts (further discussed in Sect. 2 and illustrated by Fig. 1).
In summary, our focused contribution is a new syntactically informed cross-lingual WE model which takes advantage of the normalisation provided by the Universal Dependencies project to facilitate the syntactic mapping across languages. We report results on two semantic tasks, monolingual word similarity (WS) and bilingual lexicon induction (BLI), which evaluate the monolingual and cross-lingual quality of the induced SCLVS. We observe consistent improvements over baseline CL WE models which require the same level of bilingual supervision (i.e., a word translation dictionary). For this supervision setting, we show a clear benefit of joint online training compared to standard offline models which construct two separate monolingual BOW-based or DEPS-based WE spaces, and then map them into a SCLVS using dictionary entries as done in (Mikolov et al., 2013a;Vulić and Korhonen, 2016b, inter alia) 2 Methodology Representation Model In all experiments, we opt for a standard and robust choice in vector space modeling: skip-gram with negative sampling (SGNS) (Mikolov et al., 2013b;Levy et al., 2015). We use word2vecf, a reimplementation of word2vec which is capable of learning from arbitrary (word, context) pairs 2 , thus clearly emphasising the role of context in WE learning.
(Universal) Dependency-Based Contexts A standard procedure to extract dependency-based contexts (DEPS) (Padó and Lapata, 2007;Utt and Padó, 2014) from monolingual data is as follows. Given a parsed training corpus, for each target w with modifiers m 1 , . . . , m k and a head h, w is paired with context elements m 1 r 1 , . . . , m k r k , h r −1 h , where r is the type of the dependency relation between the head and the modifier (e.g., amod), and r −1 denotes an inverse relation. 3 When extracting DEPS, we adopt the post-parsing prepositional arc collapsing procedure (Levy and Goldberg, 2014a) (see Fig. 1a-1b).
Cross-Lingual DEPS: CL-DEPEMB First, a UD-parsed monolingual training corpus is obtained in both languages L 1 and L 2 . The use of the interlingual UD scheme enables linking dependency trees in both languages (see the structural similarity of the two sentences in English (EN) and Italian (IT), Fig. 1a-1b). For instance, the link between EN words Australian and scientist as well as IT words australiano and scienzato is typed amod in both trees. This link generates the following monolingual EN DEPS: (scientist, Australian amod), (Australian, scientist amod −1 ) (similar for IT). Now, assume that we possess an EN-IT translation dictionary D with pairs [w 1 , w 2 ] which contains entries [Australian, australiano] and [scientist, scienzato]. Given the observed similarity in the sentence structure, and the fact that words from a translation pair tend to take similar UPOS tags and similar grammatical roles in a sentence, we can substitute w 1 with w 2 in all DEPS in which w 1 participates (and vice versa, replace w 2 with w 1 ). Using the substitution idea, besides the original monolingual EN and IT DEPS contexts, we now generate additional hybrid cross-lingual EN-IT DEPS contexts: (scientist, australiano amod), (australiano, scientist amod −1 ), (scienzato, Australian amod), (Australian, scienzato amod −1 ) (again, we can also generate such hybrid IT-EN DEPS contexts).
CL-DEPEMB then trains jointly on such extended DEPS contexts containing both monolingual and cross-lingual (word, context) dependencybased pairs. With CL-DEPEMB, words are considered similar if they often co-occur with similar words (and their translations) in the same dependency relations in both languages. For instance, words discovers and scopre might be considered similar as they frequently co-occur as predicates for the nominal subjects (nsubj) scientist and scienzato, and stars and stelle are their frequent direct objects (dobj). An illustrative example of the core idea behind CL-DEPEMB is provided in Fig. 1 Offline Models vs CL-DEPEMB (Joint) CL-DEPEMB uses a dictionary D as the bilingual signal to tie two languages into a SCLVS. A standard CL WE learning scenario in this setup is as follows (Mikolov et al., 2013a;Vulić and Korhonen, 2016b): (1) two separate monolingual WE spaces are induced using SGNS; (2) dictionary entries from D are used to learn a mapping function mf from the L 1 space to the L 2 space; (3) when mf is applied to all L 1 word vectors, the transformed L 1 space together with the L 2 space is a SCLVS. Monolingual WE spaces may be induced using different context types (e.g., BOW or DEPS). Since the transformation is done after training, these models are typically termed offline CL WE models.
On the other hand, given a dictionary link [w 1 , w 2 ], between an L 1 word w 1 and an L 2 word w 2 , our CL-DEPEMB model performs an online training: it uses the word w 1 to predict syntactic neighbours of the word w 2 and vice versa. In fact, we train a single SGNS model with a joint vocabulary on two monolingual UD-parsed datasets with additional cross-lingual dependency-based training examples fused with standard monolingual DEPS pairs. From another perspective, the CL-DEPEMB model trains an extended dependency-based SGNS model now composed of four joint SGNS models between the following language pairs: L 1 → L 1 , L 1 → L 2 , L 2 → L 1 , L 2 → L 2 (see Fig. 1). 4

Experimental Setup
We report results with two language pairs: English-German/Italian (EN-DE/IT) due to the availability of comprehensive test data for these pairs (Leviant and Reichart, 2015;Vulić and Korhonen, 2016a).  (Bohnet, 2010). 6 The SGNS preprocessing scheme is standard (Levy and Goldberg, 2014a): 4 A similar idea of extended joint CL training was discussed previously by (Luong et al., 2015;Coulmance et al., 2015). In this work, we show that expensive parallel data and word alignment links are not required to produce a SCLVS. Further, instead of using BOW contexts, we demonstrate how to use DEPS contexts for joint training in the CL settings. all tokens were lowercased, and words and contexts that appeared less than 100 times were filtered out. 7 We report results with d = 300-dimensional WEs, as similar trends are observed with other d-s.

Implementation
The code for generating monolingual and cross-lingual dependency-based (word, context) pairs for the word2vecf SGNS training using a bilingual dictionary D is available at: https://github.com/cambridgeltl/ cl-depemb/.
Translation Dictionaries We report results with a dictionary D labelled BNC+GT: a list of 6,318 most frequent EN lemmas in the BNC corpus (Kilgarriff, 1997) translated to DE and IT using Google Translate (GT), and subsequently cleaned by native speakers. A similar setup was used by (Mikolov et al., 2013a;Vulić and Korhonen, 2016b). We also experiment with dict.cc, a freely available large online dictionary (http://www.dict.cc/), and find that the relative model rankings stay the same in both evaluation tasks irrespective to the chosen D.
Baseline Models CL-DEPEMB is compared against two relevant offline models which also learn using a seed dictionary D: (1) OFF-BOW2 is a linear mapping model from (Mikolov et al., 2013a;Vulić and Korhonen, 2016b) which trains two SGNS models with the window size 2, a standard value (Levy and Goldberg, 2014a); we also experiment with more informed positional BOW contexts (Schütze, 1993;Levy and Goldberg, 2014b) (OFF-POSIT2); (2) OFF-DEPS trains two DEPS-based monolingual WE spaces and linearly maps them into a SCLVS. Note that OFF-DEPS uses exactly the same information (i.e., UD-parsed corpora plus dictionary D) as CL-DEPEMB.

Results and Discussion
Evaluation Tasks Following Luong et al. (2015) and Duong et al. (2016), we argue that good crosslingual word representations should preserve both monolingual and cross-lingual representation quality. Therefore, similar to (Duong et al., 2016;Upadhyay et al., 2016), we test cross-lingual WEs in two core semantic tasks: monolingual word similarity (WS) and bilingual lexicon induction (BLI). 7 Exactly the same vocabularies were used with all models (∼ 185K distinct EN words, 163K DE words, and 83K IT words). All word2vecf SGNS models were trained using standard settings: 15 epochs, 15 negative samples, global   1 scores). For SL-TRANS we also report results on the verb translation subtask (numbers in square brackets).
Word Similarity Word similarity experiments were conducted on the benchmarking multilingual SimLex-999 evaluation set (Leviant and Reichart, 2015) which provides monolingual similarity scores for 999 word pairs in English, German, and Italian. 8 The results for the three languages are displayed in Tab. 1. These results suggest that CL-DEPEMB is the best performing and most robust model in our comparison across all three languages, providing the first insight that the online training with the extended set of DEPS pairs is indeed beneficial for modeling true (functional) similarity.
We also carry out tests in English using another word similarity metric: QVEC, 9 which measures how well the induced word vectors correlate with a matrix of features from manually crafted lexical resources and is better aligned with downstream performance (Tsvetkov et al., 2015). The results are again in favour of CL-DEPEMB with a QVEC score of 0.540 (BNC+GT) and 0.543 (dict.cc), compared to those of OFF-BOW2 (0.496), OFF-POSIT2 (0.510), and OFF-DEPS (0.528).  The gap between the online CL-DEPEMB model and the offline baselines is now even more prominent, 10 and there is a huge difference in performance between OFF-DEPS and CL-DEPEMB, two models using exactly the same information for training.

Bilingual Lexicon Induction
Experiments on Verbs Following prior work, e.g., (Bansal et al., 2014;Melamud et al., 2016;Schwartz et al., 2016), we further show that WE models which capture functional similarity are especially important for modelling particular "more grammatical" word classes such as verbs and adjectives. Therefore, in Tab. 1 and Tab. 2 we also report results on verb similarity and translation. The results indicate that injecting syntax into crosslingual word vector spaces leads to clear improvements on modelling verbs in both evaluation tasks.
We further verify the intuition by running experiments on another word similarity evaluation set, which targets verb similarity in specific: SimVerb-3500 (Gerz et al., 2016) contains similarity scores for 3,500 verb pairs. The results of the CL-DEPEMB on SimVerb-3500 with dict.cc are provided in Tab. 3, further indicating the usefulness of syntactic information in multilingual settings for improved verb representations.
Similar trends are observed with adjectives: e.g., CL-DEPEMB with dict.cc obtains a ρ correlation score of 0.585 on the adjective subset of DE SimLex while the best baseline score is 0.417; for IT these scores are 0.334 vs. 0.266.

Conclusion and Future Work
We have presented a new cross-lingual word embedding model which injects syntactic information into a cross-lingual word vector space, resulting in improved modeling of functional similarity, as evidenced by improvements on word similarity and bilingual lexicon induction tasks for several language pairs. More sophisticated approaches involving the use of more accurate dependency parsers applicable across different languages (Ammar et al., 2016), selection and filtering of reliable dictionary entries (Peirsman and Padó, 2010;Vulić and Moens, 2013b;Vulić and Korhonen, 2016b), and more sophisticated approaches to constructing hybrid cross-lingual dependency trees (Fig. 1) may lead to further advances in future work. Other crosslingual semantic tasks such as lexical entailment (Mehdad et al., 2011;Vyas and Carpuat, 2016) or lexical substitution (Mihalcea et al., 2010) may also benefit from syntactically informed cross-lingual representations. We also plan to test the portability of the proposed framework, relying on the abstractive assumption of language-universal dependency structures, to more language pairs, including the ones outside the Indo-European language family.