Sparse Bilingual Word Representations for Cross-lingual Lexical Entailment

We introduce the task of cross-lingual lexical entailment , which aims to detect whether the meaning of a word in one language can be inferred from the meaning of a word in another language. We construct a gold standard for this task, and propose an unsupervised solu-tion based on distributional word representations. As commonly done in the monolingual setting, we assume a word e entails a word f if the prominent context features of e are a sub-set of those of f . To address the challenge of comparing contexts across languages, we pro-pose a novel method for inducing sparse bilingual word representations from monolingual and parallel texts. Our approach yields an F-score of 70%, and signiﬁcantly outperforms strong baselines based on translation and on existing word representations.


Introduction
Multilingual Natural Language Processing lacks techniques to automatically compare and contrast the meaning of words across languages. Machine translation (Koehn, 2010) lets us discover translation correspondences in bilingual texts, but a word and its translation often do not cover the exact same semantic space: distinct word senses might translate differently (Gale et al., 1992;Diab and Resnik, 2002, among others); semantic relations and associations do not always translate, an important issue when constructing multilingual ontologies (Fellbaum and Vossen, 2012); and words in parallel text might be translated non-literally due to lexical gaps (Santos, 1990;Bentivogli and Pianta, 2000) or decisions of the translator, as becomes clear when comparing multiple translations of the same source text (Bhagat and Hovy, 2013).
As a result, correct word translations found in parallel corpora exhibit a variety of relations including equivalence, hypernymy, and meronymy. For instance, even after removing noisy examples (Johnson et al., 2007) from a Machine Translation bilexicon induced from parallel corpora (Koehn et al., 2007), we find that the French word "appartement" (apartment) is linked to related but not strictly equivalent English words, such as "home" or "condo".
house ||| foyer (foyer) house ||| maison (house) house ||| chambre (chamber) home ||| appartement condo ||| appartement apartment ||| appartement We aim to design models that capture these differences and similarities in word meaning across languages, beyond translation correspondences. As a first step, we introduce cross-lingual lexical entailment, the task of detecting whether the meaning of a word in one language can be inferred from the meaning of a word in another language. In monolingual settings, lexical entailment has received significant attention as a representation-agnostic way of modeling lexical semantics, and as a step toward textual inference Turney and Mohammad, 2015;Levy et al., 2015;Pavlick et al., 2015). We hypothesize that the cross-lingual task can help do the same with multilingual texts.
Building on prior work on the monolingual task, we take an unsupervised approach, and use a directional semantic similarity metric motivated by the distributional inclusion hypothesis (Geffet and Dagan, 2005a;Kotlerman et al., 2010): we assume a word e entails a word f if the prominent context features of e are a subset of those of f . However, we face a new challenge in the cross-lingual case: how can we represent and compare word contexts across languages? Our solution leverages recent work on sparse representations for natural language processing. We develop sparse bilingual word representations that represent contexts based on interpretable dimensions that are aligned across languages.
As we will see, this approach successfully detects cross-lingual lexical entailment (with an F-score of 70%), and significantly outperforms strong baselines based (1) on machine translation, and (2) on existing dense bilingual word representations. Along the way, we construct a new dataset to evaluate cross-lingual lexical entailment, and also show the benefits of our approach in the monolingual setting.

A Cross-Lingual View of Lexical Entailment
Zhitomirsky-Geffet and Dagan (2009) formalize lexical entailment as a substitutional relationship. Under their definition, given a word pair (w, v), w entails v if the following two conditions are fulfilled 1. The meaning of a possible sense of w implies a possible sense of v, and 2. w can substitute for v in a sentence, such that the meaning of the modified sentence entails the meaning of the original sentence.
As a result, monolingual lexical entailment includes various semantic relations, such as synonymy, hypernymy, some meronymy relations, but also cause-effect relations (murder entails death), and other associations (ocean entails water) (Kotlerman et al., 2010).
We extend this definition to the cross-lingual case by modifying the second condition. Given a word pair (w , v ), where w is a word in language e and v is a word in language f , w entails v if 1. The meaning of a possible sense of w implies a possible sense of v , and 2. Given a sentence T in f containing v , w can substitute for v in the translation of T in e, such that the meaning of the modified sentence entails the meaning of the original sentence.
Cross-lingual lexical entailment helps us refine our understanding of semantic mappings across languages: while the French word ouvrier can be translated as worker in English, knowing that worker does not entail ouvrier could be useful in many multilingual applications, including machine translation and its evaluation, question answering or entity linking in multilingual texts.
As can be seen in Table 2, lexical entailment is not always preserved by translation: while aspirin entails the English word drug, it does not entail the French drogue, which only refers to the narcotic sense of drug and not to its medicinal sense.
When evaluating lexical entailment, we use the same approach as in monolingual tasks (Baroni et al., 2012;Baroni and Lenci, 2011;Kotlerman et al., 2010;Turney and Mohammad, 2015): given a bilingual word pair, systems are asked to make a binary true/false decision on whether the first word entails the second. We describe the collection of gold standard annotations in Section 5.2.

Unsupervised Detection of Lexical Entailment
We choose to detect lexical entailment without supervision. As in the monolingual case, detection can be done using a scoring function which quantifies the directional semantic similarity of an input word pair. On monolingual tasks, despite reaching better performance, supervised systems do not really learn entailment relations for word pairs, but instead learn when a particular word in the pair is a "prototypical hypernym" (Levy et al., 2015). 1 Thus, we limit our investigation to unsupervised models. As a result, our approach only requires a small number of annotated examples to tune the scoring threshold. We use the balAPinc score (Kotlerman et al., 2009), which is based on the distributional inclusion hypothesis (Geffet and Dagan, 2005b): given feature representations of the contexts of two words u and v, u is assumed to entail v if all features of u tend to appear within the features of v.
Formally, balAPinc is the geometric mean of a symmetric similarity score, LIN (Lin, 1998), and an asymmetric score, APinc. Given a directional entailment pair (u → v), Assume we are given ranked feature lists F V u and F V v for words u and v respectively. Let w u (f ) denote the weight of a particular feature f in F V u . LIN is defined by APinc is a modified asymmetric version of the Average Precision metric used in Information Retrieval: where, Thus, to use balAPinc for cross-lingual lexical entailment, we need a ranked list of features that capture information about the context of words in two languages. In the monolingual case, features are dimensions in a distributional semantic space. For the cross-lingual task, we need to represent words in two languages in the same space, or in spaces where a one-to-one mapping between dimensions exists.

Learning Sparse Bilingual Word Representations
As we will see in Section 9, there is a wealth of existing methods for learning representations that capture context of words in two different languages in the literature. However, they have been evaluated on tasks that do not require much semantic analysis, such as translation lexicon induction or document categorization. In contrast, detecting lexical entailment requires the ability to capture more subtle semantic distinctions. This requires bilingual representations to capture both the full range of word contexts observed in original language texts, as well as cross-lingual correspondences from translated texts.
We propose a new model that uses sparse nonnegative embeddings to represent word contexts as interpretable dimensions, and facilitate context comparisons across languages. This is an instance of sparse coding, which consists of modeling data vectors as sparse linear combinations of basis elements. In contrast with dimensionality reduction techniques such as PCA, the learned basis vectors need not be orthogonal, which gives more flexibility to represent the data (Mairal et al., 2009). These models have been introduced as word representations in monolingual settings (Murphy et al., 2012) with the goal of obtaining interpretable, cognitively-plausible representations. We review the monolingual models, before introducing our novel bilingual formulation.

Review: Learning Monolingual Sparse Representations
Previous work (Murphy et al., 2012; on obtaining sparse monolingual representations is based on a variant of the Nonnegative Matrix Factorization problem. Given a matrix X containing v dense word representations arranged row-wise, sparse representations for the v words can be ob-tained by solving the following optimization problem The first term in the objective 3 aims to factorize the dense representation matrix X into two matrices, A and D such that the l 2 reconstruction error is minimized. The second term is an l 1 regularizer on A which encourages sparsity, where the level of sparsity is controlled by the λ hyperparameter. This, together with the non-negativity constraint, helps in obtaining sparsified and interpretable representations in A since non-negativity has been shown to correlate with interpretability. Note that the objective function on its own is degenerate since it can be trivially optimized by making the entries of D arbitrarily large and choosing corresponding small values as entries of A. To avoid this, an additional constraint is imposed on D.

Proposed Bilingual Model
Learning bilingual word representations for entailment requires two sources of information: • Monolingual distributional representations independently learned from large amounts of text in each language. We denote them as two input matrices, X e and X f , of respective sizes v e × n e and v f × n f . Each row in X e represents the representation of a particular word in the first language, e, while X f represents word representations for the other language f .
• Cross-lingual correspondences that enable comparison across languages. We define a "score" matrix S of size v e ×v f , which captures high-confidence correspondences between the vocabularies of the two languages. There are many ways of defining S. As a starting point, we define each row of S as a one-hot vector that identifies the word in f that is most frequently aligned with the e word for that row in a large parallel corpus. This reduction leads to a many-to-one mapping from e to f , which captures translation ambiguity by allowing multiple words in e to be aligned to the same word in f .
We formulate the following optimization problem to obtain sparse bilingual representations: The first two rows and the constraints in Equation 4 can be understood as in Equation 3 -they encourage sparsity in word representations for each language. The third row imposes bilingual correspondence constraints, weighted by the regularizer λ x : it encourages words in e and f that are strongly aligned according to S to have similar representations.

Optimization
Equations 3 and 4 define non-differentiable, nonconvex optimization problems and finding the globally optimally solution is not feasible. However, various methods used to solve convex problems work well in practice. We use Forward Backward Splitting, a proximal gradient method for which an efficient generic solver, FASTA, is available (Goldstein et al., 2015;Goldstein et al., 2014). FASTA (Fast Adaptive Shrinkage / Thresholding Algorithm) is designed to minimize functions of the form f (Ax)+ g(x), where f is a differentiable function, g is a function (possibly non-differentiable) for which we can calculate the proximal operator, and A is a linear operator. For the objective function in our model, the l 1 terms form g and the l 2 terms form f .
We have now described all components of the model required to detect bilingual lexical entailment: solving objective 4 as described yields sparse representations for words in the two languages that can be compared directly using the balAPinc metric.

Existing Monolingual Test Suites
A comprehensive suite of lexical entailment test beds is available for English (Levy et al., 2015). They were constructed either by asking humans to annotate entailment relations directly (Kotlerman et al., 2010), or by deriving entailment labels from semantic relation annotations (Baroni et al., 2012;Baroni and Lenci, 2011;Turney and Mohammad, 2015). Each test set has 900 to 1300 positive examples of lexical entailment -word pairs (w, v) such that w → v. All but one are balanced.

Creating a Cross-Lingual Test Set
We select French as the second language: it is a good starting point for studying cross-lingual entailment, as it is a resource-rich language with many available bilingual annotators. We will construct data sets for more distant language pairs in future work.
We aim to construct a balanced test set of positive and negative bilingual entailment examples in the spirit of the existing English test beds. While it is attractive to leverage existing English examples, we cannot translate them directly as entailment relations might be affected by translation ambiguity (as illustrated in Table 2).
We therefore obtain annotated bilingual examples using a two step process: (1) automatic translation of monolingual examples, and (2) manual annotation through crowdsourcing. For a sample of positive examples of entailment w e → v e in the monolingual datasets, we generate up to two French translations for v e , v f 1 and v f 2 , using the top translations from BabelNet (Navigli and Ponzetto, 2012) and Google Translate. v f 1 and v f 2 are then paired back with w e , thus generating two unannotated crosslingual examples. Annotation is crowdsourced on Crowdflower 2 : for each example pair (w e , v f ), workers are asked to label it as true (w e → v f ) or false (w e → v f ). We select the positive examples annotated with highagreement, and obtain the same number of negative examples by applying the same translation process to negative examples 3 .

Crowdsourcing Cross-lingual Entailment Judgments
Detecting lexical entailment for bilingual word pairs is a non-trivial annotation task, and requires a good command of both French and English. For quality control, we first ask a bilingual speaker in our group to conduct a pilot annotation task, which we use to evaluate workers' ability to perform the task. In addition, Crowdflower allows us to present this task to only workers who have a proven knowledge of French, and to georestrict the task to countries most likely to have French-English bilinguals.
This result first shows that we can indeed generate a gold standard for the challenging task of crosslingual lexical entailment using such crowdsourcing techniques. We ensure high-quality annotations by selecting all 945 (w, v) where four or more annotators agree that w → v.
In addition, the degree of agreement sheds light on how the notion of lexical entailment is interpreted by non-expert annotators. In Table 3, we present randomly selected examples for each agreea bilingual speaker to manually check a random sample of 100 such translated pairs, which were all found to remain negative. , art and serpent, which means snake in English) or the directionality is wrong (e.g., animal ← reptile). The middle rows where one to three annotators chose to annotate the word pair as negative contain less clear-cut cases, including association relations (e.g., endurance → force), and examples where entailment judgments requires taking into account more subtle word sense or translation distinctions (e.g., bookmark can be translated as marque for a positive example, but the most frequent sense of marque translates into English as brand, for which the entailment relation does not hold.)

Experimental Conditions
We estimate the following models for evaluation on the test sets described in the previous section.

Sparse Bilingual Model
Estimating sparse, bilingual representations as described in Section 4.2 first requires learning dense monolingual representations in two languages (X e and X f Second, we construct S by word-aligning large amounts of parallel corpora using a fast implementation of IBM model 2 (Dyer et al., 2013). We combine Europarl (Koehn, 2005), News Commentary 4 , and Wikipedia 5 to create a large parallel corpus of 72M English tokens and 78M French tokens. All corpora are tokenized and lowercased.

Contrastive Models
We also learn two other sets of 100-dimensional word representations, as a basis for comparison.
First, we learn sparse monolingual English word representations, which will be used in monolingual lexical entailment experiments (Section 7.1). These are trained using the non-negative sparse method described in Section 4.1, on the same 4.9B word English corpus that was used for learning bilingual representations.
Second, we learn dense bilingual word representations using BiCVM (Hermann and Blunsom, 2014), to use as a baseline for our cross-lingual lexical entailment experiments (Section 7.2). BiCVM uses sentence aligned parallel corpora to learn representations for words in two languages, with the objective that when these representations are composed into representations for parallel sentences, the Euclidean distance between the parallel sentences should be minimized. We learn English-French vectors on the parallel corpora described in Section 6.1.

Monolingual Tasks
We first evaluate the monolingual version of our sparse model on English test sets. While our focus is on the cross-lingual setting, the monolingual evaluation lets us compare a version of our newly proposed approach with existing lexical entailment results (Levy et al., 2015), obtained using dense word representations compared with cosine similarity. This is not a controlled comparison, as training conditions are not comparable. Nevertheless it English Dataset Levy et al. Sparse+cosine Sparse+balAPinc Baroni et al. (2012) .788 .745 .744 Baroni and Lenci (2011) .197 .552 .546 Kotlerman et al. (2010) .461 .620 .618 Turney and Mohammad (2015) .642 .576 .587 Table 4: Evaluating sparse representations on monolingual lexical entailment (F-score): we compare previously published unsupervised results (Levy et al.) to our sparse word representations. While this is not a controlled comparison, we can see that our word representations yield roughly comparable performance to prior work.
is reassuring to see that sparse word representations are roughly on par with previously published results. This suggests that they indeed provide good features for discovering entailment relations, using both cosine and balAPinc as metrics 6 . Results (Table 4) show that sparse representations lead to performance comparable to previous approaches, thus providing a strong motivation for using the same for the crosslingual task.  We evaluate our proposed approach on the new English-French lexical entailment test set. We evaluate the impact of choosing a sparse representation by comparing our approach to the dense bilingual word representations obtained with the BiCVM model (Section 6.2). We also evaluate the usefulness of bilingual vs. monolingual word representations: given a bilingual example (w e , v f ), we translate v f into English using Google Translate, and then detect lexical entailment using English sparse representations for the English pair (w e , v e ) as described in Section 6.2.

Cross-lingual Task
Results are summarized in Table 5. First, we observe that balAPinc outperforms cosine for all word representations, confirming that the directional metric is better suited to discovering lexical entailment. Second, all sparse models significantly outperform the model based on dense representation, which suggests that sparsity helps discover useful context features. Finally, our proposed approach (balAPinc with features from sparse bilingual representations) yields the best result overall, perofrming better than the second best model (cosine with features from sparse bilingual representations) by approximately 1.6 points. This difference is highly statistically significant (at p < 0.01) according to the McNemar's Test (Dietterich, 1998). Our model also outperforms translation followed by monolingual entailment, confirming the need for models that directly compare the meaning of words across languages, instead of using translation as a proxy.

Examining bilingual dimensions learned
One motivation for using sparse representations is that they yield interpretable dimensions: one can summarize a dimension using the top scoring words in its column. Interpreting five randomly selected dimensions learned in our bilingual model (Table 6) shows that we indeed learn English and French dimensions that align well, but that are not identical -reflecting the difference in contexts observed in monolingual English vs. French corpora, as needed to detect lexical entailment.

Sparse Vectors Help Capture Distributional Inclusion
One advantage of our sparse representations over dense bilingual representations is that they can better leverage an asymmetric scoring function like  balAPinc . Consider the following two pairs from our dataset -(mesothelioma,tumeur) and (tumor,mésothéliome). The former is a positive example since mesothelioma → tumeur, but the latter is negative (since not all tumors are mesotheliomas.) Cosine similarity is unable to differentiate between these two cases, assigning a high score to both these pairs, causing both of them to be labeled positive. However, balAPinc with sparse representations teases them apart by giving a high score to the first pair and a low score to the second.
In the bilingual sparse model, mesothelioma and mésothéliome have only one non-zero entry ( in the dimension corresponding to ['virus', 'viruses', 'infection', 'cells', 'cancer']) whereas tumeur and tumor have five non-zero entries in their representations. Based on the distributional inclusion hypothesis, this difference in the number of non-zero entries is a strong basis for mesothelioma → tumor.

Benefits of Bilingual Modeling
Examining the results of the approach based on translation followed by monolingual entailment confirms the problems raised by sense ambiguities.
Consider the English word drug, which can be translated into the French drogue when used in the narcotics sense, and médicament when used in the medicinal sense. Thus the pair (antibiotic,drogue) that is correctly labeled as negative in the crosslingual case, gets converted to (antibiotic,drug) by translation and is then incorrectly labeled as positive. Similarly, the pair (coriander, herbe), which is positive in the crosslingual case, gets translated to (coriander, grass) because the French herbe is primarily aligned to the English grass (rather than herb). The translated pair is labeled negative.

Related Work
Bilingual Word Representations Much recent work targets the problem of learning lowdimensional multilingual word representations, using matrix decomposition techniques such as Principal Component Analysis and Canonical Correlation Analysis (Gaussier et al., 2004;Jagarlamudi and Daumé III, 2012;Gardner et al., 2015), Latent Dirichlet Allocation (Mimno et al., 2009;Jagarlamudi and Daumé III, 2010), and neural distributional representations (Klementiev et al., 2012;Gouws et al., 2015;Lu et al., 2015, among others). However, these models have typically been evaluated on translation induction or document categorization, which, unlike lexical entailment, focus on capturing coarse cross-lingual correspondences.
Sparse Word Representations While cooccurrence matrices and their PPMI transformed variants are early examples of sparse representations, recent work has leveraged Nonnegative Sparse Embedding (NNSE) (Murphy et al., 2012). These models have been augmented to incorporate different types of linguistically motivated constraints, such as compositionality of words into phrases (Fyshe et al., 2015), or a hierarchical regularizer that captures knowledge of word relations .
Sparse representations have also been used for monolingual lexical entailment in the Boolean Distributional Semantic Model (Kruszewski et al., 2015), which shares our hypothesis on the usefulness of sparsity in meaning representations. However, they are meant to be used in different settings: while the boolean features can interestingly capture formal semantics, they are not as useful in our unsupervised setting, since they do not provide the feature rankings required to use the balAPinc metric.

Cross-Lingual Semantic Analysis
To the best of our knowledge, lexical entailment has not been pre-viously addressed in a cross-lingual setting. The long tradition of lexical semantic analysis in crosslingual settings has mostly focused on using translations to characterize word meaning (Diab and Resnik, 2002;Carpuat and Wu, 2007;Lefever and Hoste, 2010;McCarthy et al., 2013, among others). An exception is Cross-lingual Textual Entailment (Mehdad et al., 2010), which aims to detect whether an English hypothesis H entails a text T written in another language. We plan to use our lexical models to address this task in the future.

Conclusion
In this work, we introduced the task of cross-lingual lexical entailment, which aims to detect whether the meaning of a word in one language can be inferred from the meaning of a word in another language. We constructed a dataset with gold annotations through crowdsourcing, and presented a top-performing solution based on novel sparse bilingual word representations that leverages both word co-occurrence patterns in monolingual corpora and bilingual correspondences learned in parallel text 7 .
A key limitation of this work is that we address lexical entailment out of context, based on word representations that collapse multiple word senses into a single vector . These could be addressed in future work by adapting existing methods for learning sense-specific representations for dense vectors (Jauhar et al., 2015;Ettinger et al., 2016;Reisinger and Mooney, 2010;Guo et al., 2014;Huang et al., 2012;Neelakantan et al., 2015) to our sparse representations, and target cross-lingual textual entailment tasks, which focus on full sentences rather than isolated words. We also plan to study lexical entailment on more languages and example types, as well as investigate the usefulness of our bilingual representations in higher level multilingual applications such as machine translation.