Learning to Represent Bilingual Dictionaries

Bilingual word embeddings have been widely used to capture the similarity of lexical semantics in different human languages. However, many applications, such as cross-lingual semantic search and question answering, can be largely benefited from the cross-lingual correspondence between sentences and lexicons. To bridge this gap, we propose a neural embedding model that leverages bilingual dictionaries. The proposed model is trained to map the literal word definitions to the cross-lingual target words, for which we explore with different sentence encoding techniques. To enhance the learning process on limited resources, our model adopts several critical learning strategies, including multi-task learning on different bridges of languages, and joint learning of the dictionary model with a bilingual word embedding model. Experimental evaluation focuses on two applications. The results of the cross-lingual reverse dictionary retrieval task show our model's promising ability of comprehending bilingual concepts based on descriptions, and highlight the effectiveness of proposed learning strategies in improving performance. Meanwhile, our model effectively addresses the bilingual paraphrase identification problem and significantly outperforms previous approaches.


Introduction
Cross-lingual semantic representation learning has attracted significant attention recently. Various approaches have been proposed to align words of different languages in a shared embedding space (Ruder et al., 2017). By offering task-invariant se-mantic transfers, these approaches critically support many cross-lingual NLP tasks including neural machine translations (NMT) (Devlin et al., 2014), bilingual document classification (Zhou et al., 2016), knowledge alignment (Chen et al., 2018b) and entity linking (Upadhyay et al., 2018).
While many existing approaches have been proposed to associate lexical semantics between languages (Chandar et al., 2014;Gouws et al., 2015;Luong et al., 2015a), modeling the correspondence between lexical and sentential semantics across different languages is still an unresolved challenge. We argue that learning to represent such cross-lingual and multi-granular correspondence is well desired and natural for multiple reasons. One reason is that, learning word-to-word correspondence has a natural limitation, considering that many words do not have direct translations in another language. For example, schadenfreude in German, which means a feeling of joy that comes from knowing the troubles of other people, has no proper English counterpart word. To appropriately learn the representations of such words in bilingual embeddings, we need to capture their meanings based on the definitions.
Besides, modeling such correspondence is also highly beneficial to many application scenarios. One example is cross-lingual semantic search of concepts (Hill et al., 2016), where the lexemes or concepts are retrieved based on sentential descriptions (see Fig. 1). Others include discourse relation detection in bilingual dialogue utterances (Jiang et al., 2018), multilingual text summarization (Nenkova et al., 2012), and educational applications for foreign language learners. Finally, it is natural in foreign language learning that a human learns foreign words by looking up their meanings in the native language (Hulstijn et al., 1996). Therefore, learning such correspondence essentially mimics human learning behaviors.
A male descendent in relation to his parents. EN Tout être humain du sexe masculin considéré par rapport à son père et à sa mère, ou à un des deux seulement. FR

Cross-lingual Reverse Dictionary Retrieval Cross-lingual Paraphrases
Son EN Figure 1: An example illustrating the two cross-lingual tasks. The cross-lingual reverse dictionary retrieval finds cross-lingual target words based on descriptions. In terms of cross-lingual paraphrases, the French sentence (which means any male being considered in relation to his father and mother, or only one of them) describes the same meaning as the English sentence, but has much more content details.
However, realizing such a representation learning model is a non-trivial task, inasmuch as it requires a comprehensive learning process to effectively compose the semantics of arbitrary-length sentences in one language, and associate that with single words in another language. Consequently, this objective also demands high-quality crosslingual alignment that bridges between single and sequences of words. Such alignment information is generally not available in the parallel and seedlexicon that are utilized by bilingual word embeddings (Ruder et al., 2017).
To incorporate the representations of bilingual lexical and sentential semantics, we propose an approach to capture the mapping from the definitions to the corresponding foreign words by leveraging bilingual dictionaries The proposed model BilDRL (Bilingual Dictionary Representation Learning) first constructs a word embedding space with pre-trained bilingual word embeddings. Based on cross-lingual word definitions, a sentence encoder is trained to realize the mapping from literal descriptions to target words in the bilingual word embedding space, for which we investigate with multiple encoding techniques. To enhance cross-lingual learning on limited resources, BilDRL conducts multi-task learning on different directions of a language pair. Moreover, BilDRL enforces a joint learning strategy of bilingual word embeddings and the sentence encoder, which seeks to gradually adjust the embedding space to better suit the representation of cross-lingual word definitions.
To show the applicability of BilDRL, we conduct experiments on two useful cross-lingual tasks (see Fig. 1). (i) Cross-lingual reverse dictionary retrieval seeks to retrieve words or concepts given descriptions in another language. This task is use-ful to help users find foreign words based on the notions or descriptions, and is especially beneficial to users such as translators, foreigner language learners and technical writers using non-native languages. We show that BilDRL achieves promising results on this task, while bilingual multi-task learning and joint learning dramatically enhance the performance. (ii) Bilingual paraphrase identification asks whether two sentences in different languages essentially express the same meaning, which is critical to question answering or dialogue systems that apprehend multilingual utterances (Bannard and Callison-Burch, 2005). This task is challenging, as it requires a model to comprehend cross-lingual paraphrases that are inconsistent in grammar, content details and word orders. BilDRL maps sentences to the lexicon embedding space. This process reduces the problem to evaluate the similarity of lexicon embeddings, which can be easily solved by a simple classifier. BilDRL performs well with even a small amount of data, and significantly outperforms previous approaches.

Related Work
We discuss two lines of relevant work. Bilingual word embeddings. Various approaches have been proposed for training bilingual word embeddings. These approaches span in two families: off-line mappings and joint training.
The off-line mapping based approach fixes the structures of pre-trained monolingual embeddings, and induces bilingual projections based on seed lexicons (Mikolov et al., 2013a). Some variants of this approach improve the quality of projections by adding constraints such as orthogonality of transforms, normalization and mean centering of embeddings (Xing et al., 2015;Artetxe et al., 2016;Vulić et al., 2016). Others adopt canonical correlation analysis to map separate monolingual embeddings to a shared embedding space (Faruqui and Dyer, 2014;Doval et al., 2018).
Unlike off-line mappings, joint training models simultaneously update word embeddings and cross-lingual alignment. In doing so, such approaches generally capture more precise crosslingual semantic transfer (Ruder et al., 2017;Upadhyay et al., 2018). While a few such models still maintain separated embedding spaces for each language (Artetxe et al., 2017), more of them maintain a unified space for both languages. The cross-lingual semantic transfer by these models is captured from parallel corpora with sentential or document-level alignment, using techniques such as bilingual bag-of-words distances (BilBOWA) (Gouws et al., 2015), Skip-Gram (Coulmance et al., 2015) and sparse tensor factorization (Vyas and Carpuat, 2016). Neural sentence modeling. Neural sentence models seek to capture phrasal or sentential semantics from word sequences. They often adopt encoding techniques such as recurrent neural encoders (RNN) (Kiros et al., 2015), convolutional encoders (CNN) (Chen et al., 2018a), and attentive encoders  to represent the composed semantics of a sentence as an embedding vector. Recent works have focused on apprehending pairwise correspondence of sentential semantics by adopting multiple neural sentence models in one learning architecture, including Siamese models for detecting discourse relations of sentences (Sha et al., 2016), and sequenceto-sequence models for tasks like style transfer (Shen et al., 2017), text summarization (Chopra et al., 2016) and translation (Wu et al., 2016).
On the other hand, fewer efforts have been put to characterizing the associations between sentential and lexical semantics. Hill et al. (2016) and Ji et al. (2017) learn off-line mappings between monolingual descriptions and lexemes to capture such associations. Eisner et al. (2016) adopt a similar approach to capture emojis based on descriptions. At the best of our knowledge, there has been no previous approach to learn to discover the correspondence of sentential and lexical semantics in a multilingual scenario. This is exactly the focus of our work, in which the proposed strategies of multi-task learning and joint learning are critical to the corresponding learning process under limited resources. Utilizing such correspondence, our approach also sheds light on addressing discourse relation detection in a multilingual scenario.

Modeling Bilingual Dictionaries
We hereby begin our modeling with the formalization of bilingual dictionaries. We use L to denote the set of languages. For a language l ∈ L, V l denotes its vocabulary, where for each word w ∈ V l , bold-faced w ∈ R k denotes its embedding vector. A l i -l j bilingual dictionary D(l i , l j ) (or simply is a cross-lingual definition that describes the word w i with a sequence of words in language l j . For example, a French-English dictionary D(Fr, En) could include a French word appétite accompanied by its English definition desire for, or relish of food or drink. Note that, for a word w i , multiple definitions in l j may coexist.
BilDRL is constructed and improved through three stages, as depicted in Fig. 2. A sentence encoder is first used to learn from a bilingual dictionary the association between words and definitions. Then in a pre-trained bilingual word embedding space, multi-task learning is conducted on both directions of a language pair. Lastly, joint learning with word embeddings is enforced to simultaneously adjust the embedding space during the training of the dictionary model, which further enhances the cross-lingual learning process.
It is noteworthy that, NMT (Wu et al., 2016) is considered as an ostensibly relevant method to ours. NMT does not apply to our problem setting bacause it has major differences from our work in those perspectives: (i) In terms of data modalities, NMT has to bridge between corpora of the same granularity, i.e. either between sentences or between lexicons. This is unlike BilDRL that captures multi-granular correspondence of semantics across different modalities, i.e. sentences and words; (ii) As for learning strategies, NMT relies on an encoder-decoder architecture using end-toend training (Luong et al., 2015b), while BilDRL employs joint learning of a dictionary-based sentence encoder and a bilingual embedding space.

Encoders for Lexical Definitions
BilDRL models a dictionary using a neural sentence encoder E(S), which composes the meaning of the sentence into a latent vector representation. We hereby introduce this model component, which is designed to be a GRU encoder with selfattention. Besides that, we also investigate other widely-used neural sequence encoders.

Attentive GRU Encoder
The GRU encoder is a computationally efficient alternative of the LSTM (Cho et al., 2014). Each unit consists of a reset gate r t and an update gate z t to track the state of the sequence. Given the vector representation w t of an incoming item w t , GRU updates the hidden state h The same color indicates shared parameters.
Legend Output * * *^# the cat sat on the $ le chat s'est assis sur le # $ Figure 2: Joint learning architecture of BilDRL.

stateh
(1) The update gate z t balances between the information of the previous sequence and the new item, where M z and N z are two weight matrices, b z is a bias vector, and σ is the sigmoid function.

The candidate stateh
(1) t is calculated similarly to those in a traditional recurrent unit as below. The reset gate r t thereof controls how much information of the past sequence should contribute to the candidate state.
While a GRU encoder can stack multiple of the above GRU layers, without an attention mechanism, the last state h (1) S of the last layer represents the overall meaning of the encoded sentence S. The self-attention mechanism (Conneau et al., 2017) seeks to highlight the important units in an input sentence when capturing its overall meaning, which is calculated as below: S . u S can be seen as a high-level representation of the input sequence.
By measuring the similarity of each u t with u S , the normalized attention weight a t , which highlights an input that contributes significantly to the overall meaning, is produced through a softmax. Note that a scalar |S| is multiplied along with a t to u t , so as to keep the weighted representation h

Other Encoders
We also experiment with other widely used neural sentence modeling techniques 2 , which are however outperformed by the attentive GRU in our tasks. These techniques include the vanilla GRU, CNN (Kalchbrenner et al., 2014), and linear bagof-words (BOW) (Hill et al., 2016). We briefly introduce the later two techniques in the following. A convolutional encoder applies a kernel M c ∈ R h×k to produce the latent representation h |S|−h+1 ] is produced from the input, where each latent vector leverages the significant local semantic features from each h-gram. Following convention , we apply dynamic max-pooling to extract robust features from the convolution outputs, and use the mean-pooling results of the last layer to represent the sentential semantics. The Linear bag-of-words (BOW) encoder (Ji et al., 2017;Hill et al., 2016) is realized by the sum of projected word embeddings of the input sentence, i.e. E (2) (S) = |S| t=1 M b w t .

Basic Learning Objective
The objective of learning the dictionary model is to map the encodings of cross-lingual word definitions to the target word embeddings. This is realized by minimizing the following L 2 loss, in which E ij is the dictionary model that maps from descriptions in l j to words in l i . The above defines the basic model variants of BilDRL that learns on a single dictionary. For word representations in the learning process, BilDRL initializes the embedding space using pre-trained word embeddings. Note that, without adopting the joint learning strategy in Section 3.4, the learning process does not update word embeddings that are used to represent the definitions and target words. While other forms of loss such as cosine proximity (Hill et al., 2016) and hinge loss (Ji et al., 2017) may also be used in the learning process, we find that L 2 loss consistently leads to better performance in our experiments.

Bilingual Multi-task Learning
In cases where entries in a bilingual dictionary are not amply provided, learning the above bilingual dictionary on one ordered language pair may fall short in insufficiency of alignment information. One practical solution is to conduct a bilingual multi-task learning process. In detail, given a language pair (l i , l j ), we learn the dictionary model E ij on both dictionaries D ij and D ji with shared parameters. Correspondingly, we rewrite the previous learning objective function as below, in which D = D ij ∪ D ji .
This strategy non-trivially requests the same dictionary model to represent semantic transfer in two directions of the language pair. To fulfill such a request, we initialize the embedding space using the BilBOWA embeddings (Gouws et al., 2015), which provide a unified embedding space that resolves both monolingual and cross-lingual semantic relatedness of words. In practice, we find this simple multi-task strategy to bring significant improvement to our cross-lingual tasks. Note that, besides BilBOWA, other joint-training bilingual embeddings in a unified space (Doval et al., 2018) can also support this strategy, for which we leave the comparison to future work.

Joint Learning Objective
While above learning strategies are based on a fixed embedding space, we lastly propose a joint learning strategy. During the training process, this strategy simultaneously updates the embedding space based on both the dictionary model and the bilingual word embedding model. The learning is through asynchronous minimization of the following joint objective function, where λ 1 and λ 2 are two positive hyperparameters. L SG i and L SG j are the original Skip-Gram losses (Mikolov et al., 2013b) to separately obtain word embeddings on monolingual corpora of l i and l j . Ω A ij , termed as below, is the alignment loss to minimize bag-of-words distances for aligned sentence pairs (S i , S j ) in parallel corpora C ij .
The joint learning process adapts the embedding space to better suit the dictionary model, which is shown to further enhance the crosslingual learning of BilDRL.

Training
To initialize the embedding space, we pretrained BilBOWA on the parallel corpora Europarl v7 (Koehn, 2005) and monolingual corpora of tokenized Wikipedia dump (Al-Rfou et al., 2013). For models without joint learning, we use AMSGrad (Reddi et al., 2018) to optimize the parameters. Each model without bilingual multi-task learning thereof, is trained on batched samples from each individual dictionary. Multi-task learning models are trained on batched samples from two dictionaries. Within each batch, entries of different directions of languages can be mixed together. For joint learning, we conduct an efficient multi-threaded asynchronous training (Mnih et al., 2016) of AMSGrad. In detail, after initializing the embedding space based on pre-trained BilBOWA, parameter updating based on the four components of J occurs across four worker threads. Two monolingual threads select batches of monolingual contexts from the Wikipedia dump of two languages for Skip-Gram, one alignment thread randomly samples parallel sentences from Europarl, and one dictionary thread extracts samples of entries for a bilingual multi-task dictionary model. Each thread makes a batched update to model parameters asynchronously for each term of J. The asynchronous training of all threads keeps going until the dictionary thread finishes its epochs.

Experiments
We present experiments on two multilingual tasks: the cross-lingual reverse dictionary retrieval task and the bilingual paraphrase identification task.

Datasets
The experiment of cross-lingual reverse dictionary retrieval is conducted on a trilingual dataset Wikt3l. This dataset is extracted from Wiktionary 3 , which is one of the largest freely available multilingual dictionary resources on the Web. Wikt3l contains dictionary entries of language pairs (English, French) and (English, Spanish), which form En-Fr, Fr-En, En-Es and Es-En dictionaries on four bridges of languages in total. Two types of cross-lingual definitions are extracted from Wiktionary: (i) cross-lingual definitions provided under the Translations sections of Wiktionary pages; (ii) monolingual definitions for words that are linked to a cross-lingual counterpart with a inter-language link 4 of Wiktionary. We exclude all the definitions of stop words in constructing the dataset, and list the statistics in Table 1.
Since existing datasets for paraphrase identification are merely monolingual, we contribute with another dataset WBP3l for cross-lingual sentential paraphrase identification. This dataset contains 6,000 pairs of bilingual sentence pairs respectively for En-Fr and En-Es settings. Within each bilingual setting, positive cases are formed as pairs of descriptions aligned by inter-language links, which exclude the word descriptions in Dictionary En-Fr Fr-En En-Es Es-En #Target words 15,666 16,857 8,004 16,986 #Definitions 50,412 58,808 20,930 56,610 (Speer et al., 2017) to filter out the synonyms of the source word, so as to prevent from generating false negative cases. Then we randomly pick one word from the filtered neighbors and pair its cross-lingual definition with the English definition of the source word to create a negative case. This process ensures that each negative case is endowed with limited dissimilarity of sentence meanings, which makes the decision more challenging. For each language setting, we randomly select 70% for training, 5% for validation, and the rest 25% for testing. Note that each language setting of this dataset thereof, matches with the quantity and partitioning of sentence pairs in the widely-used Microsoft Research Paraphrase Corpus benchmark for monolingual paraphrase identification (Yin et al., 2016;Das and Smith, 2009). Several examples from the dataset are shown in Table 2. The datasets and the processing scripts are available at https://github.com/muhaochen/ bilingual_dictionaries.

Cross-lingual Reverse Dictionary Retrieval
The objective of this task is to enable cross-lingual semantic retrieval of words based on descriptions.
Besides comparing variants of BilDRL that adopt different sentence encoders and learning strate-Languages En-Fr Fr-En En-Es Es-En Metric P @1 P @10 MRR P @1 P @10 MRR P @1 P @10 MRR P @1 P @10 MRR  Table 3: Cross-lingual reverse dictionary retrieval results by BilDRL variants. We report P @1, P @10, and MRR on four groups of models: (i) basic dictionary models that adopt four different encoding techniques (BOW, CNN, GRU and ATT); (ii) models with the two best encoding techniques that enforce the monolingual retrieval approach by Hill et al. (2016) (GRUmono and ATT-mono); (iii) models adopting bilingual multi-task learning (GRU-MTL and ATT-MTL); (iv) joint learning that employs the best dictionary model of ATT-MTL (ATT-joint).
gies, we also compare with the monolingual retrieval approach proposed by Hill et al. (2016). Instead of directly associating cross-lingual word definitions, this approach learns definition-toword mappings in a monolingual scenario. When it applies to the multilingual setting, given a lexical definition, it first retrieves the corresponding word in the source language. Then, it looks for semantically related words in the target language using bilingual word embeddings. As discussed in Section 3, NMT does not apply to this task due that it cannot capture the multi-granular correspondence between a sentence and a word.
Evaluation Protocol. Before training the models, we randomly select 500 word definitions from each dictionary respectively as test cases, and exclude these definitions from the training data. Each of the basic BilDRL variants are trained on one bilingual dictionary. The monolingual retrieval models are trained to fit the target words in the original languages of the word definitions, which are also provided in Wiktionary. BilDRL variants with multi-task or joint learning use both dictionaries of the same language pair. In the test phase, for each test case (w i , S j w ) ∈ D ij , the prediction performs a kNN search from the definition encoding E ij (S j w ), and record the rank of w i within the vocabulary of l i . We limit the vocabularies to all words that appear in the Wikt3l dataset, which involve around 45k English words, 44k French words and 36k Spanish words. To prevent the surface information of the target word from appearing in the definition, we have also masked out any translation of the target word occurring in the definition using a wildcard token <concept>. We aggregate three metrics on test cases: the accuracy P @1 (%), the proportion of ranks no larger than 10 P @10 (%), and mean reciprocal rank MRR.
We pre-train BilBOWA based on the original configuration by Gouws et al. (2015) and obtain 50-dimensional initialization of bilingual word embedding spaces respectively for the English-French and English-Spanish settings. For CNN, GRU, and attentive GRU (ATT) encoders, we stack five of each corresponding encoding layers with hidden-sizes of 200, and two affine layers are applied to the final output for dimension reduction. This encoder architecture consistently represents the best performance through our tuning. Through comprehensive hyperparameter tuning, we fix the learning rate α to 0.0005, the exponential decay rates of AMSGrad β 1 and β 2 to 0.9 and 0.999, coefficients λ 1 and λ 2 to both 0.1, and batch size to 64. Kernel-size and pooling-size are both set to 2 for CNN. Word definitions are zero-padded (short ones) or truncated (long ones) to the sequence length of 15, since most definitions (over 92%) are within 15 words in the dataset. Training is limited to 1,000 epochs for all models as well as the dictionary thread of asynchronous joint learning, in which all models are able to converge. Results. Results are reported in Table 3 in four groups. The first group compares four different encoding techniques for the basic dictionary models. GRU thereof consistently outperforms CNN and BOW, since the latter two fail to capture the important sequential information for descriptions. ATT that weighs among the hidden states has notable improvements over GRU. While we equip the two better encoding techniques with the monolingual retrieval approach (GRU-mono and ATTmono), we find that the way of learning the dictionary models towards monolingual targets and retrieving cross-lingual related words incurs more impreciseness to the task. For models of the third group that conduct multi-task learning in two directions of a language pair, the results show significant enhancement of performance in both directions. For the final group of results, we incorporate the best variant of multi-task models into the joint learning architecture, which leads to compelling improvement of the task on all settings. This demonstrates that properly adapting the word embeddings in joint with the bilingual dictionary model efficaciously constructs the embedding space that suits better the representation of both bilingual lexical and sentential semantics.
In general, this experiment has identified the proper encoding techniques of the dictionary model. The proposed strategies of multi-task and joint learning effectively contribute to the precise characterization of the cross-lingual correspondence of lexical and sentential semantics, which have led to very promising capability of crosslingual reverse dictionary retrieval.

Bilingual Paraphrase Identification
The bilingual paraphrase identification problem 5 is a binary classification task with the goal to decide whether two sentences in different languages express the same meanings. BilDRL provides an effective solution by transferring sentential meanings to word-level representations and learning a simple classifier. We evaluate three variants of BilDRL on this task using WBP3l: the multi-task BilDRL with GRU encoders (BilDRL-GRU-MTL), the multi-task BilDRL with attentive GRU encoders (BilDRL-ATT-MTL), and the joint learning BilDRL with with attentive GRU encoders (BilDRL-ATT-joint). We compare against several baselines of neural sentence pair models that are proposed for monolingual paraphrase identification. These models include Siamese structures of CNN (BiCNN) (Yin and Schütze, 2015), RNN (BiLSTM) (Mueller and Thyagarajan, 2016), attentive CNN (ABCNN) (Yin et al., 2016), attentive GRU (BiATT) , and BOW (BiBOW). To support the reasoning of cross-lingual semantics, we provide the baselines with the same BilBOWA embeddings. 5 Paraphrases have similar meanings, but can largely differ in content details and word orders. Hence, they are essentially different from translations. We have found that even the well-recognized Google NMT frequently caused distortions to short sentence meanings, and led to results that were close to random guess by the baseline classifiers after translation.  Table 4: Accuracy and F1-scores of bilingual paraphrase identification. For BilDRL, the results by three model variants are reported: BilDRL-GRU-MTL and BilDRL-ATT-MTL are models with bilingual multi-task learning, and BilDRL-ATT-joint is the best ATT-based dictionary model variant deployed with both multi-task and joint learning.
Evaluation protocol. BilDRL transfers each sentence into a vector in the word embedding space. Then, for each sentence pair in the train set, a Multi-layer Perceptron (MLP) with a binary softmax loss is trained on the subtraction of two vectors as a downstream classifier. Baseline models are trained end-to-end, each of which directly uses a parallel pair of encoders with shared parameters and an MLP that is stacked to the subtraction of two sentence vectors. Note that some works use concatenation (Yin and Schütze, 2015) or Manhattan distances (Mueller and Thyagarajan, 2016) of sentence vectors instead of their subtraction (Jiang et al., 2018), which we find to be less effective on small amount of data.
We apply the configurations of the sentence encoders from the last experiment to corresponding baselines, so as to show the performance under controlled variables. Training of a classifier is terminated by early-stopping based on the validation set. Following convention (Hu et al., 2014;Yin et al., 2016), we report the accuracy and F1 scores.
Results. This task is challenging due to the heterogeneity of cross-lingual paraphrases and limitedness of learning resources. The results in Table 4 show that all the baselines, where BiATT consistently outperforms the others, merely reaches slightly over 60% of accuracy on both En-Fr and En-Es settings. We believe that it comes down to the fact that sentences of different languages are often drastically heterogenous in both lexical semantics and the sentence grammar that governs the composition of words. Hence, it is not surprising that previous neural sentence pair models, which capture the semantic relation of bilingual sentences directly from all participating words, fall short at the multilingual task. BilDRL, how-ever, effectively leverages the correspondence of lexical and sentential semantics to simplify the task to an easier entailment task in the lexicon space, for which the multi-task learning BilDRL-ATT-MTL outperforms the best baseline respectively by 3.80% and 4.80% of accuracy in both language settings, while BilDRL-ATT-joint, employing the joint learning, further improves the task by another satisfying 3.26% and 1.06% of accuracy. Both also show notable increment in F1.

Conclusion and Future Work
In this paper, we propose a neural embedding model BilDRL that captures the correspondence of cross-lingual lexical and sentential semantics. We experiment with multiple forms of neural models and identify the best technique. The two learning strategies, bilingual multi-task learning and joint learning, are effective at enhancing the cross-lingual learning with limited resources, and also achieve promising performance on crosslingual reverse dictionary retrieval and bilingual paraphrase identification tasks by associating lexical and sentential semantics. An important direction of future work is to explore whether the word-sentence alignment can improve bilingual word embeddings. Applying BilDRL to bilingual question answering and semantic search systems is another important direction.