Robust Cross-Lingual Hypernymy Detection Using Dependency Context

Cross-lingual Hypernymy Detection involves determining if a word in one language (“fruit”) is a hypernym of a word in another language (“pomme” i.e. apple in French). The ability to detect hypernymy cross-lingually can aid in solving cross-lingual versions of tasks such as textual entailment and event coreference. We propose BiSparse-Dep, a family of unsupervised approaches for cross-lingual hypernymy detection, which learns sparse, bilingual word embeddings based on dependency contexts. We show that BiSparse-Dep can significantly improve performance on this task, compared to approaches based only on lexical context. Our approach is also robust, showing promise for low-resource settings: our dependency-based embeddings can be learned using a parser trained on related languages, with negligible loss in performance. We also crowd-source a challenging dataset for this task on four languages – Russian, French, Arabic, and Chinese. Our embeddings and datasets are publicly available.


Introduction
Translation helps identify correspondences in bilingual texts, but other asymmetric semantic relationships can improve language understanding when translations are not exactly equivalent. One such relationship is cross-lingual hypernymyidentifying thatécureuil ("squirrel" in French) is a kind of rodent, or ворона ("crow" in Russian) is a kind of bird. The ability to detect hypernyms across languages serves as a building block in a range of cross-lingual tasks, including Recognizing Textual Entailment (RTE) (Negri et al., 2012, * These authors contributed equally. 1 https://github.com/yogarshi/ bisparse-dep/ 2013), constructing multilingual taxonomies (Fu et al., 2014), event coreference across multilingual news sources (Vossen et al., 2015), and evaluating Machine Translation output (Padó et al., 2009).
Building models that can robustly identify hypernymy across the spectrum of human languages is a challenging problem, that is further compounded in low resource settings. At first glance, translating words to English and then identifying hypernyms in a monolingual setting may appear to be a sufficient solution. However, this approach cannot capture many phenomena. For instance, the English words cook, leader and supervisor can all be hypernyms of the French word chef, as the French word does not have a exact translation in English covering its possible usages. However, translating chef to cook and then determining hypernymy monolingually precludes identifying leader or supervisor as a hypernyms of chef. Similarly, language-specific usage patterns can also influence hypernymy decisions. For instance, the French word chroniqueur translates to chronicler in English, but is more frequently used in French to refer to journalists (making journalist its hypernym). 2 This motivates approaches that directly detect hypernymy in the cross-lingual setting by extending distributional methods for detecting monolingual hypernymy, as in our prior work (Vyas and Carpuat, 2016). State-of-the-art distributional approaches (Roller and Erk, 2016;Shwartz et al., 2017) for detecting monolingual hypernymy require syntactic analysis (eg. dependency parsing), which may not available for many languages. Additionally, limited training resources make unsupervised methods more desirable than supervised hypernymy detection approaches (Roller and Erk, 2016). Furthermore, monolingual distributional approaches cannot be applied directly to the crosslingual task, because the vector spaces of two languages need to be aligned using a cross-lingual resource (a bilingual dictionary, for instance).
We tackle these challenges by proposing BISPARSE-DEP -a family of robust, unsupervised approaches for identifying cross-lingual hypernymy. BISPARSE-DEP uses a cross-lingual word embedding model learned from a small bilingual dictionary and a variety of monolingual syntactic context extracted from a dependency parsed corpus. BISPARSE-DEP exhibits robust behavior along multiple dimensions. In the absence of a dependency treebank for a language, it can learn embeddings using a parser trained on related languages. When exposed to less monolingual data, or a lower quality bilingual dictionary, BISPARSE-DEP degrades only marginally. In all these cases, it compares favorably with models that have been supplied with all necessary resources, showing promise for low-resource settings. We extensively evaluate BISPARSE-DEP on a new crowd-sourced cross-lingual dataset, with over 2900 hypernym pairs, spanning four languages from distinct families -French, Russian, Arabic and Chinese -and release the datasets for future evaluations.

Related Work
Cross-lingual Distributional Semantics Cross-lingual word embeddings have been shown to encode semantics across languages in tasks such as word similarity (Faruqui and Dyer, 2014) and lexicon induction (Vulić and Moens, 2015). Our works stands apart in two aspects (1) In contrast to tasks involving similarity and synonymy (symmetric relations), the focus of our work is on detecting asymmetric relations across languages, using cross-lingual embeddings. (2) Unlike most previous work, we use dependency context instead of lexical context to induce crosslingual embeddings, which allows us to abstract away from language specific word order, and (as we show) improves hypernymy detection.
More closely related is our prior work (Vyas and Carpuat, 2016) where we used lexical context based embeddings to detect cross-lingual lexical entailment. In contrast, the focus of this work is on hypernymy, a more well-defined relation than entailment. Also, we improve upon our previous approach by using dependency based embeddings ( §6.1), and show that the improvements hold even when exposed to data scarce settings ( §6.3). We also do a more comprehensive evaluation on four languages paired with English, instead of just French.
Dependency Based Embeddings In monolingual settings, dependency based embeddings have been shown to outperform window based embeddings on many tasks (Bansal et al., 2014;Hill et al., 2014;Melamud et al., 2016). Roller and Erk (2016) showed that dependency embeddings can help in recovering Hearst patterns (Hearst, 1992) like "animals such as cats", which are known to be indicative of hypernymy. Shwartz et al. (2017) demonstrated that dependency based embeddings are almost always superior to window based embeddings for identifying hypernyms in English. Our work uses dependency based embeddings in a cross-lingual setting, a less explored research direction. A key novelty of our work also lies in its use of syntactic transfer to derive dependency contexts. This scenario is more relevant in a cross-lingual setting, where treebanks might not be available for many languages.

Our Approach -BISPARSE-DEP
We propose BISPARSE-DEP, a family of approaches that uses sparse, bilingual, dependency based word embeddings to identify cross-lingual hypernymy. Figure 1 shows an overview of the end-toend pipeline of BISPARSE-DEP. The two key components of this pipeline are: (1) Dependency based contexts ( §3.1), which help us generalize across languages with minimal customization by abstracting away language-specific word order. We also discuss how to extract such contexts in the absence of a treebank in the language ( §3.2) using a (weak) dependency parser trained on related languages.
(2) Bilingual sparse coding ( §3.3), which allows us to align dependency based word embeddings in a shared semantic space using a small bilingual dictionary. The resulting sparse bilingual embeddings can then be used with a unsupervised entailment scorer ( §3.4) to predict hypernymy for cross-lingual word pairs.  Figure 1: The BISPARSE-DEP approach, which learns sparse bilingual embeddings using dependency based contexts. The resulting sparse embeddings, together with an unsupervised entailment scorer, can detect hypernyms across languages (e.g., pomme is a fruit).

Dependency Based Context Extraction
The context of a word can be described in multiple ways using its syntactic neighborhood in a dependency graph. For instance, in Figure 2, we describe the context for a target word (traveler) in the following two ways: • FULL context (Padó and Lapata, 2007;Baroni and Lenci, 2010;Levy and Goldberg, 2014): Children and parent words, concatenated with the label and direction of the relation (eg. roamed#nsubj −1 and tired#amod are contexts for traveler). • JOINT context (Chersoni et al., 2016): Parent concatenated with each of its siblings (eg. roamed#desert and roamed#seeking are contexts for traveler).
These two contexts exploit different amounts of syntactic information -JOINT does not require labeled parses, unlike FULL. The JOINT context combines parent and sibling information, while FULL keeps them as distinct contexts. Both encode directionality into the context, either through label direction or through sibling-parent relations. We use word-context co-occurrences generated using these contexts in a distributional semantic model (DSM) in lieu of window based contexts to generate dependency based embeddings.

Dependency Contexts without a Treebank
Using dependency contexts in multilingual settings may not always be possible, as dependency treebanks are not available for many languages. To circumvent this issue, we use related languages to train a weak dependency parser.
We train a delexicalized parser using treebanks of related languages, where the word form based features are turned off, so that the parser is trained on purely non-lexical features (e.g. POS tags). The rationale behind this is that related languages show common syntactic structure that can be transferred to the original language, with delexicalized parsing (Zeman and Resnik, 2008;Mc-Donald et al., 2011, inter alia) being one popular approach. 3

Bilingual Sparse Coding
Given a dependency based co-occurrence matrix described in the previous section(s), we generate BISPARSE-DEP embeddings using the framework from our prior work (Vyas and Carpuat, 2016), which we henceforth call BISPARSE. BISPARSE generates sparse, bilingual word embeddings using a dictionary learning objective with a sparsity inducing l 1 penalty. We give a brief overview of this approach, the full details of which can be found in our prior work.
For two languages with vocabularies v e and v f , and monolingual dependency embeddings X e and X f , BISPARSE solves the following objective: where S is a translation matrix, and A e and A f are sparse matrices which are bilingual representations in a shared semantic space. The translation matrix S (of size v e × v f ) captures correspondences between the vocabularies (of size v e and v f ) of two languages. For instance, each row of S can be a one-hot vector that identifies the word in f that is most frequently aligned with the e word for that row in a large parallel corpus, thus building a one-to-many mapping between the two languages.

Unsupervised Entailment Scorer
A variety of scorers can be used to quantify the directional relationship between two words, given feature representations of these words (Lin, 1998;Weeds and Weir, 2003;Lenci and Benotto, 2012).
Once the BISPARSE-DEP embeddings are constructed, we use BalAPinc (Kotlerman et al., 2009) to score word pairs for hypernymy. BalAPinc is based on the distributional inclusion hypothesis (Geffet and Dagan, 2005) and computes the geometric mean of 1) LIN (Lin, 1998), a symmetric score that captures similarity, and 2) APinc, an asymmetric score based on average precision.

Crowd-Sourcing Annotations
There is no publicly available dataset to evaluate models of hypernymy detection across multiple languages. While ontologies like Open Multilingual WordNet (OMW) (Bond and Foster, 2013) and BabelNet (Navigli and Ponzetto, 2012) contain cross-lingual links, these resources are semiautomatically generated and hence contain noisy edges. Thus, to get reliable and high-quality test beds, we collect evaluation datasets using Crowd-Flower 4 . Our datasets span four languages from distinct families -French (Fr), Russian (Ru), Arabic (Ar) and Chinese (Zh) -paired with English.
To begin the annotation process, we first pool candidate pairs using hypernymy edges across languages from OMW and BabelNet, along with translations from monolingual hypernymy datasets (Baroni and Lenci, 2011;Baroni et al., 2012;Kotlerman et al., 2010).

Annotation Setup
The annotation task requires annotators to be fluent in both English and the non-English language. To ensure only fluent speakers perform the task, for each language, we provide task instructions in the non-English language itself. Also, we restrict the task to annotators verified by CrowdFlower to have those language skills. Finally, annotators also  need to pass a quiz based on a small amount of gold standard data to gain access to the task. Annotators choose between three options for each word pair (p f , q e ), where p f is a non-English word and q e is a English word : "p f is a kind of q e ", "q e is a part of p f " and "none of the above". Word pairs labeled with the first option are considered as positive examples while those labeled as "none of the above" are considered as negative. 5 The second option was included to filter out meronymy examples that were part of the noisy pool. We leave it to the annotator to infer whether the relation holds between any senses of p f or q e , if either of them are polysemous.
For every candidate hypernym pair (p f , q e ), we also ask annotators to judge its reversed and translated hyponym pair (q f , p e ). For instance, if (citron, f ood) is a hypernym candidate, we also show annotators (aliments, lemon) which is a potential hyponym candidate (potential, because as mentioned in §1, translation need not preserve semantic relationships). The purpose of presenting the hyponym pair, (q f , p e ), is two-fold. First, it emphasizes the directional nature of the task. Second, it identifies hyponym pairs, which we use as negative examples. The hyponym pairs are challenging since differentiating them from hypernyms truly requires detecting asymmetry.
Each pair was judged by at least 5 annotators, and judgments with 80% agreement (at least 4 annotators agree) are considered for the final dataset. This is a stricter condition than certain monolingual hypernymy datasets -for instance, EVALution (Santus et al., 2015) -where agreement by 3 annotators is deemed sufficient. Inter-annotator agreement measured using Fleiss' Kappa (Fleiss, 1971) was 58.1 (French), 53.7 (Russian), 53.2 (Arabic) and 55.8 (Chinese). This indicates moderate agreement, on par with agreement obtained on related fine-grained semantic tasks (Pavlick et al., 2015). We cannot compare with monolin-gual hypernymy annotator agreement as, to the best of our knowledge, such numbers are not available for existing test sets. Dataset statistics are shown in Table 1.
We observed that annotators were able to agree on pairs containing polysemous words where hypernymy holds for some sense. For instance, for the French-English pair (avocat, professional), the French word avocat can either mean lawyer or avocado, but the pair was annotated as a positive example. Hence, we leave it to the annotators to handle polysemy by choosing the most appropriate sense. Cohyponyms are words sharing a common hypernym. For instance, bière ("beer" in French) and vodka are cohyponyms since they share a common hypernym in alcool/alcohol. We choose cohyponyms for the second test set because: (a) They require differentiating between similarity (a symmetric relation) and hypernymy (an asymmetric relation).

Data and Evaluation Setup
Training BISPARSE-DEP requires a dependency parsed monolingual corpus, and a translation matrix for jointly aligning the monolingual vectors. We compute the translation matrix using word alignments derived from parallel corpora (see corpus statistics in Table ??). While we use parallel corpora to generate the translation matrix to be comparable to baselines ( §5.2), we can obtain the matrix from any bilingual dictionary. The monolingual corpora are parsed using Yara Parser (Rasooli and Tetreault, 2015), trained on the corresponding treebank from the Universal Dependency Treebank (McDonald et al., 2013) (UDT-v1.4). Yara Parser was chosen as it is fast, and competitive with stateof-the-art parsers (Choi et al., 2015). The monolingual corpora was POS-tagged using TurboTagger (Martins et al., 2013). We induce dependency contexts for words by first thresholding the language vocabulary to the top 50,000 nouns, verbs and adjectives. A co-occurrence matrix is computed over this vocabulary using the context types in §3.1.

Inducing Dependency Contexts
The entries of the word-context co-occurrence matrix are reweighted using Positive Pointwise Mutual Information (Bullinaria and Levy, 2007). The resulting matrix is reduced to 1000 dimensions using SVD (Golub and Kahan, 1965). 6 These vectors are used as X e , X f in the setup from §3.3 to generate 100 dimensional sparse bilingual vectors.
Evaluation We use accuracy as our evaluation metric, as it is easy to interpret when the classes are balanced (Turney and Mohammad, 2015). Both evaluation datasets -HYPER-HYPO and HYPER-COHYPO -are split into 1:2 dev/test splits. BalAPinc has two tunable parameters -1) a threshold that indicates the BalAPinc score above which all examples are labeled as positive, 2) the maximum number of features to consider for each word. We use the tuning set to tune the two parameters as well as the various hyper-parameters associated with the models.

Contrastive Approaches
We compare our BISPARSE-DEP embeddings with the following approaches: MONO-DEP (Translation baseline) For word pair (p f , q e ) in test data, we translate p f to English using the most common translation in the translation matrix. Hypernymy is then determined using sparse, dependency based embeddings in English.
BISPARSE-LEX (Window context) Predecessor of the BISPARSE-DEP model from our previous work (Vyas and Carpuat, 2016). This model induces sparse, cross-lingual embeddings using window based context. BIVEC+ (Window context) Our extension of the BIVEC model of Luong et al. (2015). BIVEC generates dense, cross-lingual embeddings using window based context, by substituting aligned word pairs within a window in parallel sentences. By default, BIVEC only trains using parallel data,  Table 2: Training data statistics for different languages. Note that while we use parallel corpora for computing translation dictionaries, our approach does not require it, and can work with any bilingual dictionary. and so we initialize it with monolingually trained window based embeddings to ensure fair comparison.
CL-DEP (Dependency context) The model from Vulić (2017), which induces dense, dependency based cross-lingual embeddings by translating syntactic word-context pairs using the most common translation, and jointly training a word2vecf 7 model for both languages. Vulić (2017) showed improvements for word similarity and bilingual lexicon induction. We report the first results using CL-DEP on this task.

Evaluating Robustness of BISPARSE-DEP
We investigate how robust BISPARSE-DEP is when exposed to data scarce settings. Evaluating on a truly low resource language is complicated by the fact that obtaining an evaluation dataset for such a language is difficult. Therefore, we simulate such settings for the languages in our dataset in multiple ways.
No Treebank If a treebank is not available for a language, dependency contexts have to be induced using treebanks from other languages ( §3.2), which can affect the quality of the dependencybased embeddings. To simulate this, we train a delexicalized parser for the languages in our dataset.
We use treebanks from Slovenian, Ukrainian, Serbian, Polish, Bulgarian, Slovak and Czech (40k sentences) for training the Russian parser, and treebanks from English, Spanish, German, Portuguese, Swedish and Italian (66k sentences) for training the French parser. UDT does not (yet) have languages in the same family as Arabic or Chinese, so for the sake of completeness, we train Arabic and Chinese parsers on delexicalized treebanks of the language itself. Af-7 bitbucket.org/yoavgo/word2vecf/ ter delexicalized training, the Labeled Attachment Score (LAS) on the UDT test set dropped by several points for all languages -from 76.6% to 60.0% for Russian, 83.7% to 71.1% for French, from 76.3% to 62.4% for Arabic and from 80.3% to 53.3% for Chinese. The monolingual corpora are then parsed with these weaker parsers, and coocurrences and dependency contexts are computed as before.
Subsampling Monolingual Data To simulate low-resource behavior along another axis, we subsample the monolingual corpora used by BISPARSE-DEP to induce monolingual vectors, X e , X f . Specifically, we learn X e and X f using progressively smaller corpora.

Quality of Bilingual Dictionary
We study the impact of the quality of the bilingual dictionary used to create the translation matrix S. This experiment involves using increasingly smaller parallel corpora to induce the translation dictionary.

Experiments
We aim to answer the following questions -(a) Are dependency based embeddings superior to window based embeddings for identifying crosslingual hypernymy? ( §6.1) (b) Does directionality in the dependency context help cross-lingual hypernymy identification? ( §6.2) (c) Are our models robust in data scarce settings ( §6.3)? (d) Is the answer to (a) predicated on the choice of entailment scorer? ( §6.4)?

Dependency v/s Window Contexts
We compare the performance of models described in §5.2 with the BISPARSE-DEP (FULL and JOINT) models. We evaluate the models on the two test splits described in §4.2 -HYPER-HYPO and HYPER-COHYPO.   Table 3: Comparing the different approaches from §5.2 with our BISPARSE-DEP approach on HYPER-HYPO and HYPER-COHYPO (random baseline= 0.5). Bold denotes the best score for each language, and the * on the best score indicates a statistically significant (p < 0.05) improvement over the next best score, using McNemar's test (McNemar, 1947). Across both datasets, BISPARSE-DEP models outperform window based models and the translation baseline on an average. Table 3a shows the results on HYPER-HYPO. First, the benefit of crosslingual modeling (as opposed to translation) is evident in that almost all models (except CL-DEP on French) outperform the translation baseline. Among dependency based models, BISPARSE-DEP (FULL) and CL-DEP consistently outperform both window models, while BISPARSE-DEP (JOINT) outperforms them on all except Russian. BISPARSE-DEP (JOINT) is the best model overall for two languages (French and Chinese), CL-DEP for one (Arabic), with no statistically significant differences between BISPARSE-DEP (JOINT) and CL-DEP for Russian. This confirms that dependency context is more useful than window context for cross-lingual hypernymy detection.

Hyper-Cohypo Results
The trends observed on HYPER-HYPO also hold on HYPER-COHYPO i.e. dependency based models continue to outperform window based models (Table 3b).
Overall, BISPARSE-DEP (FULL) performs best in this setting, followed closely by BISPARSE-DEP (JOINT). This suggests that the sibling information encoded in JOINT is useful to distinguish hypernyms from hyponyms (HYPER-HYPO results), while the dependency labels encoded in FULL help to distinguish hypernyms from cohyponyms. Also note that all models improve significantly on the HYPER-COHYPO set, suggesting that discriminating hypernyms from cohyponyms is easier than discriminating them from hyponyms.
While the BISPARSE-DEP models were generally performing better than window models on both test sets, CL-DEP was not as consistent (e.g., it was worse than the best window model on HYPER-COHYPO). As shown by Turney and , BalAPinc is designed for sparse embeddings and is likely to perform poorly with dense embeddings. This explains the relatively inconsistent performance of CL-DEP.
Besides establishing the challenging nature of our crowd-sourced set, the experiments on HYPER-COHYPO and HYPER-HYPO also demonstrate the ability of the BISPARSE-DEP models to discriminate between different lexical semantic relations (viz. hypernymy and cohyponymy) in a cross-lingual setting. We will investigate this ability more carefully in future work.

Ablating Directionality in Context
The context described by the FULL and JOINT BISPARSE models encodes directional information ( §3.1) either in the form of label direction (FULL), or using sibling information (JOINT). Does such directionality in the context help to capture the asymmetric relationship inherent to hypernymy? To answer this, we evaluate a third BISPARSE-DEP model which uses UNLABELED dependency contexts. This is similar to the FULL context, except we do not concatenate the label of the relation to the context word (parent or children). For instance, for traveler in Fig. 2, contexts will be roamed and tired.
Experiments on both HYPER-HYPO and HYPER-COHYPO (bottom row, Tables 3a and 3b) highlight that directional information is indeed essential -UNLABELED almost always performs worse than FULL and JOINT, and in many cases worse than even window based models.

No Treebank
We run experiments (Table 4) for all languages with a version of BISPARSE-DEP that use the FULL context type for both English and the non-English (target) language, but the target language contexts are derived from a corpus parsed using a delexicalized parser ( §5.3). This model compares favorably on all language pairs against the best window based and the best dependency based model. In fact, it almost consistently outperforms the best window based model by several points, and is only slightly worse than the best dependency-based model.
Further analysis revealed that the good performance of the delexicalized model is due to the relative robustness of the delexicalized parser on frequent contexts in the co-occurrence matrix. Specifically, we found that in French and Russian, the most frequent contexts were derived from amod, nmod, nsubj and dobj edges. 8 For instance, the nmod edge appears in 44% of Russian contexts and 33% of the French contexts. The delexicalized parser predicts both the label and direction of the nmod edge correctly with an F1 of 68.6 for Russian and 69.6 for French. In contrast, a fully-trained parser achieves a F1 of 76.7 for Russian and 76.8 for French for the same edge. Figure 4, we use increasingly smaller monolingual corpora (10%, 20%, 40%, 60% and 80%) sampled at random to induce the monolingual vectors for BISPARSE-DEP (FULL) model. Trends (Figure 4)  data. Robust performance with smaller monolingual corpora is helpful since large-enough monolingual corpora are not always easily available.

Small Monolingual Corpus In
Quality of Bilingual Dictionary Bilingual dictionaries derived from smaller amounts of parallel data are likely to be of lower quality than those derived from larger corpora. Hence, to analyze the impact of dictionary quality on BISPARSE-DEP (FULL), we use increasingly smaller parallel corpora to induce bilingual dictionaries used as the score matrix S ( §3.3). We use the top 10%, 20%, 40%, 60% and 80% sentences from the parallel corpora. The trends in Figure 4 show that even with a lower quality dictionary, BISPARSE-DEP performs better than BISPARSE-LEX.

Choice of Entailment Scorer
We change the entailment scorer from BalAPinc to SLQS (Santus et al., 2014) and redo experiments from §6.1 to see if the conclusions drawn depend on the choice of the entailment scorer. SLQS is based on the distributional informativeness hypothesis, which states that hypernyms are less "informative" than hyponyms, because they occur in more general contexts. The informativeness E u of a word u is defined to be the median entropy of its top N dimensions, E u = median N k=1 H(c k ), where H(c i ) denotes the entropy of dimension c i . The SLQS score for a pair (u, v) is the relative difference in entropies, (Shwartz et al., 2017) has found SLQS to be more successful than other metrics in monolingual hypernymy detection.
The trends observed in these experiments are consistent with those in §6.1 -both BISPARSE-DEP models still outperform window-based models. Also, the delexicalized version of BISPARSE-DEP outperforms the window-based models, showing that the robust behavior demonstrated in §6.3 is also invariant across metrics.
We also found that using BalAPinc led to better results than SLQS . For both BISPARSE-DEP models, BalAPinc wins across the board for two languages (Russian and Chinese), and wins half the time for the other two languages compared to SLQS . We leave detailed comparison of these and other scores to future work.

Conclusion
We introduced BISPARSE-DEP, a new distributional approach for identifying cross-lingual hypernymy, based on cross-lingual embeddings derived from dependency contexts. We showed that using BISPARSE-DEP is superior for the crosslingual hypernymy detection task, when compared to standard window based models and a translation baseline. Further analysis also showed that BISPARSE-DEP is robust to various low-resource settings. In principle, BISPARSE-DEP can be used for any language that has a bilingual dictionary with English and a "related" language with a treebank. We also introduced crowd-sourced crosslingual hypernymy datasets for four languages for future evaluations.
Our approach has the potential to complement existing work on creating cross-lingual ontologies such as BabelNet and the Open Multilingual Wordnet, which are noisy because they are compiled semi-automatically, and have limited language coverage. In general, distributional approaches can help refine ontology construction for any language where sufficient resources are available.
It remains to be seen how our approach performs for other language pairs beyond simluated low-resource settings. We anticipate that replacing our delexicalized parser with more sophisticated transfer strategies (Rasooli and Collins, 2017;Aufrant et al., 2016) might be beneficial in such settings.While our delexicalized parsing based approach exhibits robustness, it can benefit from more sophisticated approaches for transfer parsing (Rasooli and Collins, 2017;Aufrant et al., 2016) to improve parser performance. We aim to explore these and other directions in the future.