CLUSE: Cross-Lingual Unsupervised Sense Embeddings

This paper proposes a modularized sense induction and representation learning model that jointly learns bilingual sense embeddings that align well in the vector space, where the cross-lingual signal in the English-Chinese parallel corpus is exploited to capture the collocation and distributed characteristics in the language pair. The model is evaluated on the Stanford Contextual Word Similarity (SCWS) dataset to ensure the quality of monolingual sense embeddings. In addition, we introduce Bilingual Contextual Word Similarity (BCWS), a large and high-quality dataset for evaluating cross-lingual sense embeddings, which is the first attempt of measuring whether the learned embeddings are indeed aligned well in the vector space. The proposed approach shows the superior quality of sense embeddings evaluated in both monolingual and bilingual spaces.


Introduction
Word embeddings have recently become the basic component in most NLP tasks for its ability to capture semantic and distributed relationships learned in an unsupervised manner. The higher similarity between word vectors can indicate similar meanings of words. Therefore, embeddings that encode semantics have been shown to serve as the good initialization and benefit several NLP tasks. However, word embeddings do not allow a word to have different meanings in different contexts, which is a phenomenon known as polysemy. For example, "apple" may have different meanings in fruit and technology contexts. Several attempts have been proposed to tackle this problem by inferring multi-sense word representations (Reisinger and Mooney, 2010;Neelakantan et al., 2014;Li and Jurafsky, 2015;Lee and Chen, 2017). These approaches relied on the "one-sense per collocation" heuristic (Yarowsky, 1993), which assumes that presence of nearby words correlates with the sense of the word of interest. However, this heuristic provides only a weak signal for discriminating sense identities, and it requires a large amount of training data to achieve competitive performance.
Considering that different senses of a word may be translated into different words in a foreign language, Guo et al. (2014Guo et al. ( ) andŠuster et al. (2016 proposed to learn multi-sense embeddings using this additional signal. For example, "bank" in English can be translated into banc or banque in French, depending on whether the sense is financial or geographical. Such information allows the model to identify which sense a word belongs to. However, the drawback of these models is that the trained foreign language embeddings are not aligned well with the original embeddings in the vector space. This paper addresses these limitations by proposing a bilingual modularized sense induction and representation learning system. Our learning framework is the first pure sense representation learning approach that allows us to utilize two different languages to disambiguate words in English. To fully use the linguistic signals provided by bilingual language pairs, it is necessary to ensure that the embeddings of each foreign language are related to each other (i.e., they align well in the vector space). We solve this by proposing an algorithm that jointly learns sense representations between languages. The contributions of this paper are four-fold: • We propose the first system that maintains purely sense-level cross-lingual representation learning with linear-time sense decoding. • We are among the first to propose a single ob-jective for modularized bilingual sense embedding learning.
• We are the first to introduce a high-quality dataset for directly evaluating bilingual sense embeddings. • Our experimental results show the state-ofthe-art performance for both monolingual and bilingual contextual word similarities.

Related Work
There are a lot of prior works focusing on representation learning, while this work mainly focuses on bridging the work about sense embeddings and cross-lingual embeddings and introducing a newly collected bilingual data for better evaluation.
Sense Embeddings Reisinger and Mooney (2010) first proposed multi-prototype embeddings to address the lexical ambiguity when using a single embedding to represent multiple meanings of a word. Huang et al. (2012); Neelakantan et al. (2014); Li and Jurafsky (2015); Bartunov et al. (2016) utilized neural networks as well as the Bayesian non-parametric method to learn sense embeddings. Lee and Chen (2017) first utilized a reinforcement learning approach and proposed a modularized framework that separates learning of senses from that of words. However, none of them leverages the bilingual signal, which may be helpful for disambiguating senses. Klementiev et al. (2012) first pointed out the importance of learning cross-lingual word embeddings in the same space and proposed the cross-lingual document classification (CLDC) dataset for extrinsic evaluation. Gouws et al. (2015) trained directly on monolingual data and extracted a bilingual signal from a smaller set of parallel data. Kočiskỳ et al. (2014) used a probabilistic model that simultaneously learns alignments and distributed representations for bilingual data by marginalizing over word alignments. Hermann and Blunsom (2014) learned word embeddings by minimizing the distances between compositional representations between parallel sentence pairs. Šuster et al. (2016) reconstructed the bag-of-words representation of semantic equivalent sentence pairs to learn word embeddings. Shi et al. (2015) proposed a training algorithm in the form of matrix decomposition, and induced cross-lingual constraints for simultaneously factorizing monolingual matrices. Luong et al. (2015) extended the skip-gram model to bilingual corpora where contexts of bilingual word pairs were jointly predicted. Wei and Deng (2017) proposed a variational autoencoding approach that explicitly models the underlying semantics of the parallel sentence pairs and guided the generation of the sentence pairs. Although the above approaches aimed to learn cross-lingual embeddings jointly, they fused different meanings of a word in one embedding, leading to lexical ambiguity in the vector space model. Guo et al. (2014) adopted the heuristics where different meanings of a polysemous word usually can be represented by different words in another language and clustered bilingual word embeddings to induce senses. Šuster et al. (2016) proposed an encoder, which uses parallel corpora to choose a sense for a given word, and a decoder that predicts context words based on the chosen sense. Bansal et al. (2012) proposed an unsupervised method for clustering the translations of a word, such that the translations in each cluster share a common semantic sense. Upadhyay et al. (2017) leveraged cross-lingual signals in more than two languages. However, they either used pretrained embeddings or learned only for the English side, which is undesirable since cross-lingual embeddings shall be jointly learned such that they aligned well in the embedding space. Evaluation Datasets Several datasets can be used to justify the performance of learned sense embeddings. Huang et al. (2012) presented SCWS, the first and only dataset that contains word pairs and their sentential contexts for measuring the quality of sense embeddings. However, it is a monolingual dataset constructed in English, so it cannot evaluate cross-lingual semantic word similarity. On the other hand, while Camacho-Collados et al. (2017) proposed a cross-lingual semantic similarity dataset, it ignored the contextual words but kept only word pairs, making it impossible to judge sense-level similarity. In this paper, we present an English-Chinese contextual word similarity dataset in order to benchmark the experiments about bilingual sense embeddings.

CLUSE: Cross-Lingual Unsupervised Sense Embeddings
Our proposed model borrows the idea about modularization from Lee and Chen (2017), which treats the sense induction and representation mod-

Monolingual Sense Representation Learning (EN)
: Apple company designs the best cellphone in the world Figure 1: Sense induction modules decide the senses of words, and two sense representation learning modules optimize the sense collocated likelihood for learning sense embeddings within a language and between two languages. Two languages are treated equally and optimized iteratively.
ules separately to avoid mixing word-level and sense-level embeddings together. Our model consists of four different modules illustrated in Figure 1, where sense induction modules decide the senses of words, and two sense representation learning modules optimize the sense collocated likelihood for learning sense embeddings within a language and between two languages in a joint manner. All modules are detailed below.

Notations
We denote our parallel corpus without word alignment C, where C en is for the English part and C zh is for the Chinese part. Our English vocabulary is W en and Chinese vocabulary is W zh . Moreover, C en t and C zh t are the t-th sentence-level parallel sentences in English and Chinese respectively. In the following sections, we treat English as the major language and Chinese as an additional bilingual signal, while their roles can be mutually exchanged. Specifically, English and Chinese iteratively become the major language during the training procedure.

Bilingual Sense Induction Module
The bilingual sense induction module takes a parallel sentence pair as input and determines which sense identity a target word belongs to given the bilingual contextual information. Formally, for the t-th English sentence C en t , we aim to decode the most probable sense z ik ∈ Z i for the i-th word w i ∈ W en in C en t , where Z i is the set of sense candidates for w i and 1 ≤ k ≤ |Z i |. We assume that the meaning of w i can be determined by its surrounding words, or the so-called local context, Aside from monolingual information, it is desirable to exploit the parallel sentences as additional bilingual contexts to enable cross-lingual embedding learning. Note that word alignment is not required in this work, so we consider the whole parallel bilingual sentence during training. Considering training efficiency, we sample M words in the parallel bilingual sentence with their original relative order or pad it to M for those shorter than M . Formally, given the t-th parallel bilingual sentence C zh t , the bilingual context of w i is therefore To ensure efficiency, continuous bag-of-words (CBOW) model is applied, where it takes wordlevel input tokens and outputs sense-level identities. Specifically, given an English word embedding matrix P en , the local context can be modeled as the average of word embeddings from its context, 1 Similarly, we can model the bilingual contextual information given Chinese word embedding matrix P zh using the CBOW formulation and obtain 1 M w j ∈c i P zh j . We linearly combine the contextual information from different languages as: The likelihood of selecting each sense identity z ik for w i can be formulated in the form of Bernoulli distribution with a sigmoid function σ(·): where Q en is a 3-dimensional tensor with each dimension denotes W en , z ik for a specific word i in W en , and the corresponding latent variable, respectively. Therefore, Q en ik will retrieve the latent variable of k-th sense of i-th English word. Finally, we can induce the sense identity, z * ik , given the contexts of a word w i from different languages, c i and c i .
In order to allow the module to explore other potential sense identities, we apply an -greedy algorithm (Mnih et al., 2013) for exploration in the training procedure.

Monolingual Sense Induction Module
This module is the degraded version of bilingual sense induction module when α = 1, which occurs where no parallel bilingual signal exists. In other words, every bilingual sense induction module will experience the degradation during the training process presented in Algorithm 1. The only difference is that it cannot access the bilingual information. The purpose of this module is to maintain the stability of sense induction and to decode the sampled bilingual sense identity which will later be used in the bilingual sense representation learning module. As shown in Figure 1, given the monolingual context of a word, this module selects its sense identity using (2) and (3) with α = 1.

Monolingual Sense Representation Learning Module
Given the decoded sense identities from the sense induction module, the skip-gram architecture (Mikolov et al., 2013) is applied considering that it only requires two decoded sense identities for stochastic training. We first create an input English sense representation matrix U en and an English collocation estimation matrix V en as the learning targets. Given a target word w i and its collocated word w j in the t-th English sentence C en t , we map them to their sense identities as z * ik = s i and z * jl = s j by the sense induction module and maximize the sense collocation likelihood. The skip-gram objective can be formulated as p(s j | s i ): where s k iterates over all possible English sense identities in the denominator. This formulation shares the same architecture as skip-gram but extends to rely on senses. Note that the Chinese sense representation learning module is built similarly.

Bilingual Sense Representation Learning Module
To ensure sense embeddings of two different languages align well, we hypothesize that the target sense identity s i not only predicts the sense identity s j of w j in C en t but also one sampled sense identity s l of w l from the parallel sentence C zh t , where s l is decoded by the Chinese monolingual sense induction module. Specifically, the bilingual skip-gram objective can be formulated using the English sense embedding matrix U en and the bilingual collocation estimation matrix V zh as: where s k iterates over all possible Chinese sense identities in the denominator.

Joint Learning
In this learning framework, the gradient cannot be back-propagated from the representation module to the induction module due to the usage of arg max operator. It is therefore desirable to connect these two modules in a way such that they can improve each other by their own estimations. In one direction, forwarding the prediction of the sense induction module to the sense representation learning module is trivial, while in another direction, we treat the estimated collocation likelihood as the reward for the induction module. First note that calculating the partition function in the denominator of (4) and (5) is intractable since it involves a computationally expensive summation over all sense identities. In practice, we adopt the negative sampling strategy technique (Mikolov et al., 2013) and rewrite (4) and (5) as: where p neg (s) and p neg (s ) is the distribution over all English senses and all Chinese senses for negative samples respectively, and N is the number of negative sample. The rewritten objective for optimizing two sense representation learning modules is the same as maximizing (6) and (7). Moreover, we can utilize the probability of correctly classifying the skip-gram sense pair as the reward signal. The intuition is that a correctly decoded sense identity is more likely to predict its neighboring sense identity compared to incorrectly decoded ones. This learning framework can now be viewed as a reinforcement learning agent solving onestep Markov Decision Process (Sutton and Barto, 1998;Lee and Chen, 2017). For bilingual modules, the state, action, and reward correspond to bilingual contextC, sense z ik , and σ((U en s i ) T V zh s l ) respectively. As for the monolingual modules, the state, action, and reward correspond to monolingual context c t , sense z ik , and σ((U en s i ) T V en s j )). Finally, we can optimize both bilingual and monolingual sense induction modules (P and Q from (2) by minimizing the cross entropy loss between decoded sense probability and reward. We also include an entropy regularization term as suggested in (Šuster et al., 2016) to let the sense induction module converge faster and make more confident predictions. Formally, E is the entropy of selection probability weighted by λ. Note that the major language is switched  (2) and (3)  21: return z * ik , p(z * ik |C) 22: end function 23: function TRAINSRL(maj, bi, si, sj) 24: if maj==bi then 25: optimize U maj , V maj by (6) given si, sj 26: else 27: optimize U maj , V bi by (7) given si, sj 28: end if 29: return collocation prob of (si, sj) 30: end function 31: function TRAINSI(maj, bi, r, pred) 32: if maj==bi then 33: optimize P maj , Q maj by (9) given r, pred 34: else 35: optimize P maj , Q bi by (8) given r, pred 36: end if 37: end function iteratively among two languages. Algorithm 1 presents the full learning procedure.

New Dataset-Bilingual Contextual Word Similarity (BCWS)
We propose a new dataset to measure the bilingual contextual word similarity. English and Chinese are chosen as our language pair for three reasons: 1. They are the top widely used languages in the world. 2. English and Chinese belong to completely different language families, making it interesting to explore syntactic and semantic difference among them. 3. Chinese is a language that requires segmentation, this dataset can also help researchers experiment on different segmentation levels and investigate how segmentation affects the English Sentence Chinese Sentence Score Judges must give both sides an equal 我非常喜歡這個故事，它<告 告 告訴 訴 訴>我們一些 7.00 opportunity to <state> their cases.
(The owner of the fruit stall seemed surprised that someone bought this <unpopular> product, talking me few words about "you are such a pro".) sense similarity. This dataset also provides a direct measure to determine whether the two language embeddings align well in the vector space. Note that we focus on word-level, and this is different from (Klementiev et al., 2012), which also measured the crosslingual embedding similarity but rely on the ambiguous document-level classification.
Our dataset contains 2093 question pairs, where each pair consists of exactly one English and one Chinese sentence; note that they are not parallel but with their own sentential contexts shown in Table 1. Eleven raters 2 were recruited to annotate this dataset. Each rater gives a score ranging from 1.0 (different) to 10.0 (same) for each question to indicate the semantic similarity of bilingual word pairs based on sentential clues. The annotated dataset shows very high intra-rater consistency; we leave one rater out and calculate Spearman correlation between the rater and the average of the rest, and the average number is about 0.83, indicating the human-level performance (the average number in SCWS is 0.52).
We describe the construction of BCWS below.

Chinese Multi-Sense Word Extraction
We utilize the Chinese Wikipedia dump to extract the most frequent 10000 Chinese words that are nouns, adjective, and verb based on Chinese Wordnet (Huang et al., 2010). In order to test the sense-level representations, we discard singlesense words to ensure that the selected words are polysemous. Also, the words with more than 20 senses are deleted, since those senses are too fine-2 They are all Chinese native speaker whose scores are at least 29 in the TOEFL reading section or 157 in the GRE verbal section. grained and even hard for human to disambiguate. We denote the list of Chinese words l c .

English Candidate Word Extraction
We have to find an English counterpart for each Chinese word in l c . We utilize BabelNet (Navigli and Ponzetto, 2010), a free and open-sourced knowledge resource, to serve as our bilingual dictionary. To be more concrete, we first query the selected Chinese word using the free API call provided by Babelnet to retrieve all WordNet senses 3 . For example, the Chinese word "制服" has two major meanings: • a type of clothing worn by members of an organization • force to submit or subdue.
Hence, we can obtain two candidate English words "uniform" and "subjugate". Each word in l c retrieves its associated English candidate words and obtain the dictionary D.
Enriching Semantic Relationship Note that D is merely a simple translation mapping between Chinese and English words. It is desirable that we have a more complicated and interesting relationship between bilingual word pairs. Hence, we traverse D and for each English word we find its hyponyms, hypernyms, holonyms and attributes, and add the additional words into D. In our example, we may obtain {制服:[uniform, subjugate, livery, clothing, repress, dominate, enslave, dragoon...]}. We sample 2 English words if the number of English candidate words is more than 5, 3 English words if more than 10, and 1 English word oth-erwise to form the final bilingual pair. For example, a bilingual word pair (制服, enslave) can be formed accordingly. After this step, we obtain 2093 bilingual word pairs P .
Adding Contextual Information Given the bilingual word pairs P , appropriate contexts should be found in order to form the full sentences for human judgment. For each Chinese word, we randomly sample one example sentence in Chinese WordNet that matches the PoS tag we selected in section 4. For each English word, we traverse the whole English Wikipedia dump to find the sentences that contain the target English word. We then sample one sentence where the target word is tagged as the matched PoS tag 4 .

Experimental Setup
Two sets of parallel data are used in the experiments, one for English-Chinese (EN-ZH) and another for English-German (EN-DE

Hyperparameter Settings
In our experiments, we use a mini-batch size of 512, context window size for major language is set to m = 5 and we sample M = 20 words for bilingual context. For the exploration of sense induction module, we set = 0.05. The λ of entropy regularization is set to 1. 5 For negative sampling in (6) and (7), we pick N = 25. The fixed learning rate is set to 0.025. The embedding dimension is 300 and the sense number per word is set to 3 for both Chinese, German, and English (|Z i | = 3). This setting is for a fair comparison with prior works.  better than the baselines that learn cross-lingual word embeddings. It indicates that the sense-level information is critical for precise vector representations. In addition, all results for AvgSimC and MaxSimC are the same in the proposed model, showing that the learned selection distribution is reliable for sense decoding.

Monolingual Embedding Evaluation
Because our model considers multiple languages and learns the embeddings jointly, the multilingual objective makes learning more difficult due to more noises. In order to ensure the quality of the monolingual sense embeddings, we also evaluate our learned English sense embeddings on the benchmark SCWS data. Comparing the results between training on EN-ZH and training on EN-DE, all results using EN-ZH are better than ones using EN-DE. The probable reason is that the language difference between English and Chinese is larger than English and German; parallel Chinese sentences therefore provide informative cues for learning better sense embeddings. Furthermore, our proposed model achieves comparable or superior performance than the current state-of-the-art monolingual sense embeddings proposed by Lee and Chen (2017) when trained on our monolingual data.

Sensitivity of Bilingual Contexts
To investigate how much the bilingual sense induction module relies on another language, the re-  To justify the usefulness of utilizing bilingual signal, we compare our model with Lee and Chen (2017), which used monolingual signal in a similar modular framework. Our method outperforms theirs in terms of MaxSimC on both EN-ZH and EN-DE. However, this trend is not observed on AvgSimC. The reason may be that bilingual signal is indicative but noisy, which largely affects AvgSimC due to its weighted sum operation. MaxSimC only picks the most probable senses, which makes it robust to noises.
In addition, our performance slightly degrades as α increases for EN-DE, and the best performance is obtained when α is small, indicating that

Extrinsic Evaluation
We further evaluate our bilingual sense embeddings using a downstream task, cross-lingual document classification (CLDC), with a standard setup (Klementiev et al., 2012). To be more concrete, a set of labeled documents in language A is available to train a classifier, and we are interested in classifying documents in another language B at test time, which tests semantic transfer of information across different languages. We use the averaged sense embeddings as word embeddings for a fair comparison. The result is shown in Table 3. We can see that our proposed model achieves comparable performance or even superior performance to most prior work on the DE2EN direction; however, the same conclusion does not hold for the EN2DE direction. The reason may be that we test the model that works best on BCWS and hence not able to tune hyperparameters on the development set of CLDC. In addition, we use the average of sense vectors as input word embeddings, which may induce some noises into the resulting vectors. In sum, the comparable performance of the downstream task shows the practical usage and the potential extension of the proposed model.

Qualitative Analysis
Some examples of our learned sense embeddings are shown in Table 4. It is obvious to see that the first sense of Apple is related to fruit and things to eat, while the second one means the tech company Apple Inc. Most English and Chinese nearest neighbors match the meanings of the in-duced senses, but there are still some noises that are underlined. For example, cake should be the neighbor of the first sense rather than the second one. The same observation applies to iphone and spring. In our second example for uniform, the first sense is related to outfit and clothes, while the second is related to engineering terms. However, even appears in the outfit and clothes sense, which is incorrect. The reason may be that the size of the parallel corpus is not large enough for the model to accurately distinguish all senses via unsupervised learning. Hence, utilizing external resources such as bilingual dictionaries or designing a new model that can use existing large monolingual corpora like Wikipedia can be our future work.

Conclusion
This paper is the first purely sense-level crosslingual representation learning model with efficient sense induction, where several monolingual and bilingual modules are jointly optimized. The proposed model achieves superior performance on both bilingual and monolingual evluation datasets. A newly collected dataset for evaluating bilingual contextual word similarity is presented, which provides potential research directions for future work.