Weakly Supervised Cross-lingual Semantic Relation Classification via Knowledge Distillation

Words in different languages rarely cover the exact same semantic space. This work characterizes differences in meaning between words across languages using semantic relations that have been used to relate the meaning of English words. However, because of translation ambiguity, semantic relations are not always preserved by translation. We introduce a cross-lingual relation classifier trained only with English examples and a bilingual dictionary. Our classifier relies on a novel attention-based distillation approach to account for translation ambiguity when transferring knowledge from English to cross-lingual settings. On new English-Chinese and English-Hindi test sets, the resulting models largely outperform baselines that more naively rely on bilingual embeddings or dictionaries for cross-lingual transfer, and approach the performance of fully supervised systems on English tasks.


Introduction
Natural Language Processing (NLP) often uses translation lexicons for projecting models and data from one language to another under the assumption that words and their translations in these lexicons are synonyms (Mayhew et al., 2017;Tsvetkov et al., 2014). However, translation lexicons include semantic relations other than synonymy in practice, as can be seen (Table 1) in examples drawn from the MUSE dictionary (Lample et al., 2018). Peirsman and Padó (2011) show that distributional translation lexicons contain hyponyms and co-hyponyms, and that treating all translations as synonyms hurts cross-lingual projection performance. In the Paraphrase Database (Ganitkevitch et al., 2013), Pavlick et al. (2015) find that the diversity of semantic relations discovered in word-aligned parallel corpora yields paraphrases that span the lexical relations defined in

Lexicon Entry
Semantic Relation writer, Ê writer is more specific than Ê (creator) council, Ú council is more general than Ú (council of ministers) father, ÊÊ father is mutually exclusive to ÊÊ (father's brother) Table 1: Semantic relations between word pairs in an English-Hindi lexicon (Lample et al., 2018) the natural logic framework of MacCartney and Manning (2009). These non-synonymous translations are not just noise. They can be found even in high-quality parallel corpora, since the strategies used by professional translators to deal with words that do not have a direct equivalent in the target language include replacement by near-synonyms, hypernyms or negated antonyms (Baker, 2011;Venuti, 2012;Chesterman, 2016).
In this work, we classify semantic relations between words in different languages. Given a word pair (water, ), the classification task is to select one of the five entailment classes (Figure 1) defined under the natural logic framework of Mac-Cartney and Manning (2009). This cross-lingual task cannot be solved by translation, as translation does not preserve semantic relations. We also cannot assume that labeled examples exist for all language pairs and learning from English labeled examples is complicated by translation ambiguity (Figure 1).
We introduce BILEXNET, a neural classifier for semantic relations based on cross-lingual distributional and path-based features inspired by the monolingual LEXNET model (Shwartz and Dagan, 2016a,b) (Section 3). We then design a novel training procedure for BILEXNET that leverages weak supervision in the form of examples translated from English via a knowledge distillation technique guided by translation dic- Figure 1: On the left, we illustrate cross-lingual semantic relation classification: given the pair (water, ) as input, the task is to select the Equivalence class (in bold/green) from the five possible relations. On the right, we show that semantic relations change by translation.
translates to liquid and water, and their respective semantic relations with water differ.
tionaries ) (Section 4). We collect and release MULTILEXREL, a crowdsourced benchmark to evaluate models for this task on a high-resource (English-Chinese) and a low-resource (English-Hindi) language pair (Section 5). Experiments show that BILEXNET substantially outperforms translation baselines and approaches the performance of a fully supervised English semantic relation classifier (Section 7). Code for BILEXNET and the MUL-TILEXREL dataset are available at https:// github.com/yogarshi/BiLexNet/.

Related Work
Cruse (1986) describes how lexical relations can be organized into congruence relations of identity (synonymy: dog-canine), inclusion (hyponymy: dog-animal), overlap (compatibility: dog-pet), disjunction (incompatibility: cat-dog), and antonyms (open-shut), and how these relations can be used to characterize differences in meaning. Variations on these fundamental relations have been used within semantic networks such as WordNet (Fellbaum, 1998), or as the basis of a framework for inference without formal logic representations (MacCartney and Manning, 2009). Recent work on semantic relation prediction largely focuses on a single relation between words in the same language (mostly English) (Nastase et al., 2013;Vulić and Mrkšić, 2018;Glavaš and Ponzetto, 2017;Ono et al., 2015). Methods that deal with multiple semantic relations are fewer Turney, 2008), and recent shared tasks have shown that this is a challenging problem, especially when ontologies and other structured resources are not available, and models are trained only on raw corpora .
In cross-lingual settings, studies of semantic relations between words are mostly limited to the translation equivalence relation. Dictionary induction aims to automatically discover words that are translations of each other using monolingual or comparable corpora (Rapp, 1995;Fung and Yee, 1998;Irvine and Callison-Burch, 2017). The task is typically framed as unsupervised learning, and models rely on distributional properties to discover words that have the same meaning in the two languages. Recently, dictionary induction has also been used to evaluate multilingual word embeddings (Lample et al., 2018;Artetxe et al., 2018;Søgaard et al., 2018), leading to significant advances on the state-of-the-art.
However, Peirsman and Padó (2011) show that automatically induced bilingual lexicons exhibit multiple semantic relations including not only synonymy, but also hypernymy and co-hyponymy. This has prompted work on identifying hypernymy in cross-lingual settings (Vyas and Carpuat, 2016;Upadhyay et al., 2018). This paper broadens the scope of relations studied in cross-lingual settings, and addresses for the first time the task of distinguishing between multiple semantic relations for words in different languages.
Previous work studying semantic relations in multiple languages has focused on the different task of cross-lingual transfer. In such settings, the focus is on identifying semantic relations between two words in the same language, without training data in that language. There are broadly two strategies to solve this problem. One line of work (Glavaš and Vulić, 2018;Mrkšić et al., 2017) uses model transfer, where a single model is trained on data from a high-resource language, and is then ported to the target language using cross-lingual embeddings. In contrast, Roth and Upadhyay (2019) translate training data from English into a target language using a combination of unsupervised cross-lingual embeddings (Artetxe et al., 2018) and monolingual information from the target language.
Our approach for cross-lingual semantic relation classification builds on the monolingual classifier LEXNET (Shwartz and Dagan, 2016a,b), which achieved the highest performance (45 F1) among participating teams on the CogALex-V shared task on identification of semantic relations  without ontologies or structured information. We adapt LEXNET to make cross-lingual predictions by proposing to model cross-lingual relations using lexicosyntactic paths from both languages.
Finally, our training procedure uses knowledge distillation  to alleviate the lack of annotated cross-lingual pairs. Knowledge distillation has been proposed to compress a model with many parameters (the teacher model) to a model with fewer parameters (the student model). It has also been used successfully to learn mappings between languages (Nakashole and Flauger, 2017) or to transfer knowledge from models trained on one language to a different target language for text classification (Xu and Yang, 2017) and belief tracking (Chen et al., 2018a), in settings where the classification labels are translation invariant. This work adapts distillation to a setting where labels might change when samples are translated.

BILEXNET: a Classifier for
Cross-Lingual Semantic Relations The task of classifying semantic relations is a multi-class classification problem, where the classes are the set of five semantic relations from Pavlick et al. (2015): Equivalence (X is the same as Y), Forward Entail (X is more specific than/ is a type of Y), Backward Entail (X is more general than/encompasses Y), Exclusion (X is mutually exclusive with/is opposite to Y), and Other (X is not related or related in other ways to Y). We choose these relations as they have been useful in describing lexical relations between English paraphrases (Pavlick et al., 2015), and in downstream natural language inference systems (MacCartney and Man ning, 2007, 2009). Our classifier, BILEXNET, adapts the LEXNET English classifier (Shwartz and Dagan, 2016a,b) to cross-lingual settings. BILEXNET represents the input word pair (x, y) by a feature vector v xy , consisting of complementary distributional and pathbased features i.e. v xy = [v x ;v y ;v paths(x,y) ]. The distributional semantic properties of x and y are captured by bilingual word embeddings v x and v y . v paths(x,y) encodes lexico-syntactic paths that represent the relation between words x and y in context (Hearst, 1992;Snow et al., 2004;Shwartz et al., 2016). For classification, v xy is input to a multi-class classifier, parameterized as a feedforward neural network with a single hidden layer.
W 1 and W 2 are the weights of the network, and the biases have been omitted for simplicity.

Cross-lingual Paths
In LEXNET, a lexico-syntactic path is the sequence of edges that lead from x to y in the dependency tree of a sentence. Each edge contains the word and part-of-speech tag of the source node, the dependency label of the edge, and the edge direction between two subsequent nodes (see Figure 2 for an example of the path connecting pigs and animals). The vector representation of each edge is formed by the concatenation of the vectors of these four components. v paths(x,y) is obtained by encoding the sequence of edges using an LSTM (Hochreiter and Schmidhuber, 1997).
In English, these paths are extracted from sentences where x and y co-occur. However, when x and y are in different languages, a new path definition is required. For a cross-lingual pair (x e ,y f ), we extract cross-lingual paths v paths(xe,y f ) from a word-aligned parallel corpus ( Figure 2). We first extract all parallel sentences which contain x e on the source side and y f on the target side. For each sentence, using word alignments, we can extract x f , the target word aligned to x e , and y e the source word aligned to y f . We then extract a path connecting the two word in the source sentence i.e. x e and y e . Similarly, we also extract a corresponding path connecting the two word in the target sentence i.e. x f and y f , since different languages can encode the same information differently due to structural divergences (Dorr, 1994). Thus, if the parallel corpus contains m sentence pairs where x e occurs on the source side and y f on the target side, we extract a total of 2m paths. All of the 2m paths paths are encoded using a single LSTM, and averaged to form v paths(xe,y f ) .
Two special cases arise from this definition. First, a path can be a single alignment link if x e and y f are aligned to each other i.e. x f = y f and y e = x e . Second if no path is found in the corpus, v paths(xe,y f ) is set to the zero vector.

Weakly Supervised Training via Knowledge Distillation
Cross-lingual examples that would enable fully supervised training of BILEXNET are hard to obtain: examples of relations such as synonymy or hypernymy can be derived from multilingual Word-Nets (Bond and Foster, 2013), but such resources are not available for many languages, and only cover a subset of semantic relations. Instead, we introduce a dictionary-guided variant of knowledge distillation to train BILEXNET. This procedure only relies on a set of monolingual labeled examples that are readily available for various lexical relations in English, and a translation dictionary that maps words in the source language to the target language. Our approach transfers knowledge from a monolingual teacher model to a cross-lingual student model. The teacher model is a monolingual LEXNET model (say M e ) trained on the sourcelanguage examples S = {(x e;i , y e;i , l i )}. Here, x e;i and y e;i are a pair of words in the same language and l i ∈ R c is a one-hot encoding of the relation between x e;i and y e;i (the number of possible relations is c). Given (x e;i , y e;i , l i ) ∈ S, M e is trained by minimizing the cross-entropy loss between the predicted outputl e→e and the gold labell i : BILEXNET plays the role of the student model (denoted M ef ) and is trained to make predictions that agree with those of the teacher model. The student model is trained using weak supervision which is obtained by using a bilingual dictionary D to translate the right side of each training pair into the target language S to obtain ., t in } is the set of translations of y e;i in D. S serves as weak supervision because semantic relations are not translation invariant (Figure 1), and hence the label l i is not correct for every (x e;i , t ik ) pair.
To extract useful training signals from the weak supervision, we use an attention mechanism which guides the model to attend to translations that preserve the monolingual label. The attention component constructs the input representation for the cross-lingual model M ef in Equation 1 by averaging representations for all translation candidates, giving more weight to those that are likely to preserve the monolingual label. Given a training sample ( ., t in }, the score for a candidate translation t ik is calculated using the word embeddings of x e;i and t ik , along with l, an embedding of the gold label l i as features to a feed-forward network f (with one hidden layer). l is provided to help select translations that are consistent with the correct label for the monolingual pair. The scores for all translations are converted to probabilities using the softmax function, and the input features v x e;i y e;i for the student model M ef are a sum of the features obtained from each of these translations, weighted by the probabilities.
The student model is then trained to maximize the distillation objective: wherel e→e is calculated using M e ,l e→f is calculated using the attended representation v x e;i y e;i as input to M ef and α is an interpolation parameter. The first term is again a cross-entropy loss  that aims to measure how well the cross-lingual model M ef predicts the relation given v x e;i y e;i . The second term uses KL-Divergence (Kullback and Leibler, 1951) to penalize differences in predictions by M ef on the cross-lingual input v x e;i y e;i and the predictions by M e on the monolingual input v x e;i y e;i . As is typical in distillation, we flatten the softmax of both inputs to the KL-Divergence term with a temperature parameter τ .

MULTILEXREL: A Dataset for Cross-lingual Semantic Relations
Existing resources containing annotated crosslingual lexical relations are limited in scope, quality, and quantity. For instance, in previous work (Upadhyay et al., 2018), we provide datasets annotated cross-lingual hypernymy, but do not consider other relations. On the other hand, resources such as bilingual dictionaries or the Open Multilingual WordNet (Bond and Foster, 2013) can be mined for examples of synonyms, hypernyms and hyponyms, but these are likely to be noisy as these resources are created in a semiautomatic way.
In this work, we crowdsource 1 MULTILEXREL, a set of new high-quality annotations for English-Hindi (En-Hi) and English-Chinese (En-Zh) word pairs using the natural logic relations laid out in Section 3. We leverage monolingual annotations 1 Via http://figure-eight.com/

Relation
En-Hi En-Zh  to speed up the process and enable comparisons between monolingual and cross-lingual models. We use Google Translate to translate one side of a randomly sampled subset of the gold-standard dataset of semantic relations created by Pavlick et al. (2015), and ask crowdworkers whether the semantic relation holds after translation. Each example is annotated by five annotators and annotations are aggregated using MACE (Hovy et al., 2013), a Bayesian model that estimates the trustworthiness of annotators and accordingly assigns a label to each instance. The distributions of the five relations for each language pair are shown in All in all, the size of the training set is ∼20K pairs. Like previous work (Levy et al., 2015), we ensure a lexical split where the English words in the test data are not seen in the training data. This makes the task challenging as it prevents the model from memorizing patterns of words such as their "prototypicality" for certain relations i.e. whether certain words are likely to appear in specific relations.
Validation data Since we assume no access to labeled cross-lingual examples, we need to define a validation set using the resources available to us. We construct a validation set by randomly removing 1000 pairs from the training data, and automatically translating the right side of each example with the bilingual dictionary used for training. This process yields a noisy validation set, which is solely used for tuning hyper-parameters.
Unlabeled Resources The bilingual dictionary for knowledge distillation is obtained from the MUSE project (Lample et al., 2018) for En-Hi, while the MDBG dictionary is used for En-Zh. 3 We use FastText bilingual embeddings (Bojanowski et al., 2017). 4 We extract English paths for the monolingual model from a dump of the English Wikipedia. 5 Cross-lingual paths are extracted from a random sample of the WMT18 parallel corpora 6 for En-Zh (∼5M sentences) and the IIT Bombay English-Hindi corpus (Kunchukuttan et al., 2018) for En-Hi (∼1M sentences). All corpora are parsed using YaraParser (Rasooli and Tetreault, 2015) trained on the treebank of the corresponding language from the Universal Dependencies (v2.2) project (McDonald et al., 2013). Tokenization is performed using the Moses tokenizer for English (Koehn et al., 2007), the Indic NLP tokenizer for Hindi, 7 and the Jieba word segmenter for Chinese. 8

Model Configurations and Baselines
Model Configuration The path-encoder LSTM has two layers with 60 hidden units each, with dropout (Srivastava et al., 2014)  Baselines Our experiments aim to assess how the direct cross-lingual modeling of semantic relations in BILEXNET impacts predictions, and to isolate the impact of key training components: knowledge distillation and translation selection via attention. We compare against the following baselines: RANDOM BASELINE: Randomly assign one of the five semantic relations to each word pair.
TRANSLATION BASELINE: This baseline combines dictionary translation and the English-only system ENLEXNET to gauge the relative difficulty of predicting semantic relations across languages rather than in English only. Each pair (x e ,y f ) in the test set is translated into English using the bilingual dictionary D. Since y f can have multiple translations, we pair x e with each of these translations, and use ENLEXNET to predict the relation for each of these pairs. The relation for (x e , y f ) is then chosen as the most general relation among those predicted for the translated pairs according to the order in which they appear in Table 2. 10 BILEXNET (NO DISTILLATION) A simple strategy for cross-lingual transfer consists of seeding a vanilla LEXNET model with bilingual embeddings in the source and target languages before training. This strategy has been successfully used for other NLP tasks (Klementiev et al., 2012;Guo et al., 2015, inter alia). By keeping the embeddings fixed, we can use source language data to train the monolingual LEXNET model using features based on source embeddings and source language paths as usual. At inference time, the model uses both the source and target embeddings as input, and the cross-lingual paths defined above.
SPECIALIZED TENSOR MODEL (STM) How does a model that has primarily been used for comparing words in the same language perform on cross-lingual comparisons? Our final baseline aims to answer this question. Proposed by Glavaš and Vulić (2018), STM is a neural architecture for identifying semantic relations that achieves state-of-the-art performance on two English dataset. STM is based on the hypothesis that specialized word embeddings are necessary to accurately disambiguate between semantic relations. More precisely, STM assumes that different specializations of generic word embeddings are needed to recognize different relations and that interactions between the specialized vectors can be used to identify the semantic relations. These different specializations are implemented using K feed-forward neural networks. Given a word pair, STM takes in as input a pair of generic word embeddings for the word pair which are then specialized by the K transformations. Each pair of corresponding specialized embeddings is used to calculate a score based on a non-linear transformation of their bilinear product. Finally, the K scores obtained from K pairs of specialized embeddings are used as features to train a multi-class classifier.
Besides English, STM has also been used for cross-lingual transfer, where a model trained on one language (say English) is used to test on word- 10 We also experimented with a voting based approach for combination, but it generally performed worse. pairs in another language (say German). Here, we use STM in a new setting to predict the semantic relation between two words in different languages.
We use the official implementation of STM 11 with the same bilingual embeddings used by BILEXNET. We tune three hyperparameters on the validation set (

Results
Tables 3 summarizes results obtained on the MUL-TILEXREL test sets. BILEXNET achieves F1 scores that are roughly double of those obtained by the random baseline for 5-way classification.
Impact of cross-lingual modeling We assess the impact of direct cross-lingual modeling in BILEXNET by comparing against the TRANSLA-TION BASELINE. Using a translation dictionary to naïvely convert cross-lingual relation prediction to an English-only task, the translation baseline F1 scores are 8 to 13 points higher than RANDOM for both language pairs. This difference can be attributed to easy examples where English semantic relations are preserved by simple dictionary translation. BILEXNET further improves F1 by 8 to 15 points over the TRANSLATION BASELINE, primarily by improving recall.
Supervised English system Without crosslingual training samples, we cannot compare weakly supervised and fully supervised training for BILEXNET in a controlled fashion. However, the supervised monolingual ENLEXNET model (Section 6) evaluated on the En-En test set offers a reference point: remarkably the F1 scores of BILEXNET are only 1 to 3 points lower than those obtained by the supervised English model (∼44 on the En-En test set).

Impact of knowledge distillation
We compare the full BILEXNET model to the naïve baseline (BILEXNET (NO DISTILLATION)) that only relies on embeddings for cross-lingual transfer and does not perform cross-lingual distillation. This approach performs on par with or a little better than the translation baseline, but ∼9 points worse than the full BILEXNET model, losing on both  Table 3: Precision (P), Recall (R) and F1-score (F) for BILEXNET and contrastive baselines on the two MULTI-LEXREL test sets. All configurations (except STM) are trained with five random seeds. We report the mean score and standard deviation. The full BILEXNET model performs best and is consistently better with attention.
precision and recall. This result confirms the benefit of aligning training and test conditions for our model with knowledge distillation and not relying solely on embeddings. These results are consistent with prior findings on distributional representations: • Distributional representations have difficulties in discriminating between multiple semantic relations (Chersoni et al., 2016). As such, relying solely on word embeddings for cross-lingual transfer can cause loss of knowledge during transfer. • Syntactic divergences cause differences in paths in the source and target languages. This can cause a distribution shift between the features seen by the classifier during training and test time, thereby affecting performance. Again, word embeddings are not sufficient to bridge the gap between the distributions of the two languages (Chen et al., 2018b).

Impact of attention
We test the impact of the attention model in BILEXNET by removing it, and instead translating training samples for distillation using the single most frequent translation. Removing attention yields small but consistent degradations, suggesting that attending to multiple translations is beneficial, but leaves room for improvement. We analyze the behavior of the attention model in the next section.
Specialization Finally, we observe F1 scores of STM are significantly worse than those of BILEXNET. In fact, it is the weakest model for En-Zh, and is only 3 points better than the translation baseline for En-Hi. The relatively poor performance of STM highlights that our cross-lingual task, which directly compares words in two languages, is fundamentally different from the crosslingual transfer task, where models trained in one language are ported to other languages. Thus, models such as STM, which have been designed for transfer, may not be directly suitable for our task.

Analysis
This section further breaks down the results, and highlight some successes and failures of BILEXNET to guide future work on cross-lingual semantic relation classification.

Performance Per Class
We break down the performance of the BILEXNET model per target relation (Table 4). The Equivalence and Exclusion classes are the hardest to predict correctly, which is consistent with our monolingual results and those from prior work (Shwartz and Dagan, 2016a): distributional models have trouble distinguishing synonyms from antonyms (Yih et al., 2012) and synonyms rarely occur in the same sentence, and hence path-based methods are less useful for this class. However, in BILEXNET, words of the Equivalence class can occur in a parallel sentence pair where they are aligned to each other. Thus, there is a direct signal for examples of this class which helps discriminate between Equivalence and Exclusion.
The largest fraction of errors are caused by the model predicting Other instead of a specific relation. This suggests that special treatment of this class might improve performance, perhaps by using a multi-step process which filters out pairs not related under the relations that we are targeting, and then performs 4-way classification for the remaining examples. This is similar to the the Co-

Class
En-Hi En-Zh En-En  gALex shared task , where the first part of the task is to eliminate completely unrelated pairs, before predicting relations on the remaining pairs. However, filtering out unrelated pairs is an easier task than filtering pairs in the Other category.
Missing cross-lingual paths Cross-lingual paths might not exist for all word pairs, particularly for language pairs with limited parallel data such as En-Hi. BILEXNET would then only rely on word embeddings as features to predict semantic relations. We assess the impact of missing paths by comparing the classification performance on pairs which have cross-lingual paths (70% of the test), against pairs which do not have paths in the En-Hi setting. The former subset has a higher F1 score (44.6) than the latter (40.2), mainly due to differences in recall. This difference in performance also confirms that the cross-lingual paths complement word embeddings, in the same way that monolingual paths do.
Attention Analysis We complement ablation experiments in Table 3 by examining a random sample of 25 monolingual training pairs (x e , y e ) where y e has multiple translations in the bilingual dictionary. We manually check for how many pairs the model places the highest attention weight on a translation that preserves the relation label of the monolingual pair. This happens in 64% of the cases (16 out of 25). The attention model is often able to modulate the choice of the right translation of y e based on the context provided by x e and the gold label. For example, given the monolingual example (drop, fall, Forward Entail) the model places the highest weight on the Hindi word , which captures the "moving downward" sense. On the other hand, for the example (autumn, fall, Equivalence), the model correctly identifies Ú as the right translation. There still remains a lot of overhead for improv-ing the attention component. Some failure cases in the 25 examples occur for pairs where the set of translations of y e contains an incorrect translation which is totally unrelated to x e or y e . For example, given (country, uganda), the model chooses the word Ú (transliteration for candle) and not Ú (transliteration for uganda). Of course, this is an extreme example, but such errors are also more likely to occur when the noisy translation is in the same domain as x e and y e . Fixing such errors can help improve the training process.

Conclusion
This work contributes data and models to the task of classifying semantic relations between words in different languages with only monolingual English supervision. We introduced MULTILEXREL, a dataset of about 1000 English-Hindi and 900 English-Chinese word pairs annotated with the natural logic lexical entailment classes of Mac-Cartney and Manning (2007), and BILEXNET, a cross-lingual relation classification model.
We also proposed a knowledge distillation algorithm for BILEXNET, which only needs annotated monolingual examples and a bilingual dictionary. Unlike previous uses of knowledge distillation for cross-lingual transfer, our approach does not assume that labels are translation invariant, and relies on an attention mechanism to select translations that best explain a given label. Experiments show that this method largely outperforms baselines that use bilingual embeddings or dictionaries more naïvely for cross-lingual transfer, and that it approaches the performance of fully supervised systems on an English-only version of the task.