ArbEngVec : Arabic-English Cross-Lingual Word Embedding Model

Word Embeddings (WE) are getting increasingly popular and widely applied in many Natural Language Processing (NLP) applications due to their effectiveness in capturing semantic properties of words; Machine Translation (MT), Information Retrieval (IR) and Information Extraction (IE) are among such areas. In this paper, we propose an open source ArbEngVec which provides several Arabic-English cross-lingual word embedding models. To train our bilingual models, we use a large dataset with more than 93 million pairs of Arabic-English parallel sentences. In addition, we perform both extrinsic and intrinsic evaluations for the different word embedding model variants. The extrinsic evaluation assesses the performance of models on the cross-language Semantic Textual Similarity (STS), while the intrinsic evaluation is based on the Word Translation (WT) task.


Introduction
Distributed word representations in vector space (Word Embeddings) are one of the most successful applications in deep learning for capturing the semantic and syntactic properties of words. Lately, many NLP tasks have been enriched using tools based on Mono and Cross-Lingual word embedding models. For instance, Mono-Lingual Word Embeddings (MLWE) have been widely used in information retrieval (Vulić and Moens, 2015a), sentiment analysis (Tang et al., 2014; text classification (Lai et al., 2015), semantic textual similarity (Kenter and De Rijke, 2015;Nagoudi and Schwab, 2017) and plagiarism detection .
Cross-Lingual Word Embeddings (CLWE) is a more challenging task because the knowledge is transferred between two or more different languages (Doval et al., 2018). Recently, crosslingual word embeddings was used to address several issues, e.g. machine translation (Zou et al., 2013), cross-language information retrieval (Vulić and Moens, 2015a;Zhou et al., 2012), crosslanguage semantic similarity (Ataman et al., 2016;Nagoudi et al., 2017b) and plagiarism detection across multiple languages (Ferrero et al., 2017;Barrón-Cedeño et al., 2013). Many cross-lingual word embedding models in natural language have been developed, particularly for English, but Arabic did not get that much of interest.
In this paper, we propose six Arabic-English cross-lingual word embedding models 1 . To train these models, we have used a large collection with more than 93 million pairs of parallel Arabic-English sentences.
The rest of this paper is organised as follows: in section 2 we provide a quick overview of work related to the cross-lingual word embedding models. We describe our dataset collection and the preprocessing process in Section 3. Section 4 presents our proposed cross-lingual models. Section 5 presents the evaluation results. Section 6 concludes the paper with our main findings and points to possible directions for future work.

Related works
While we focus on the cross-lingual word embedding models, the interested reader may refer to a number of research studies on the subject of mono-lingual word embeddings in general (Collobert and Weston, 2008), (Turian et al., 2010), (Mnih and Hinton, 2009), (Mikolov et al., 2013c and (Peters et al., 2018).
In the cross-lingual context, several word embedding models are proposed. Blunsom and Hermann (2014) introduced a Bilingual Compositional Model (BiCVM). Leveraging from the fact that aligned sentences have the same meaning. BiCVM is based on a sentence-aligned corpus to learn the bilingual word embedding vectors.
Vulić and Moens (2015b) introduced a Bilingual Word Embedding Skip-Gram (BWESG), this model is constructed through three main steps: i) prepare a Skip-Gram Negative Sampling (Mikolov et al., 2013b) architecture that deals with document aligned comparable data, ii) provide bilingual document pairs, iii) shuffle each pair producing pseudo-bilingual document that serves as the architecture's input which is to be trained. Luong et al. (2015) proposed a Bilingual Skip-Gram model (BiSKip). BiSKip uses the Skip-Gram of (Mikolov et al., 2013b) to train two different languages at the same time by manipulating the Skip-Gram architecture to obtain two pivots and two contexts and provide a training session for each combination. Choosing two Germanic languages (English and German) made it easier to predict target language's appropriate pivot and context for the ones from source language by simply aligning the target words at position [i * T /S] with source words at position i where S and T are source and target sentence lengths respectively. Chen et al. (2018) presented an Adversarial Deep Averaging Network (ADAN) for crosslingual sentiment classification. In fact, they trained many bilingual WE models, one of them was trained using the United Nations (UN) English-Arabic parallel aligned corpus (Ziemski et al., 2016) and Bilingual Bag-of-Words without Alignments (BilBOWA) (Gouws et al., 2015). Additionally, ADAN replaces the softmax and regularization terms by a less costly alternatives.
Recently, Devlin et al. (2018) have proposed a deep learning method called Bidirectional Encoder Representations from Transformers (BERT) based on overcoming the limitations of next and previous token prediction procedures benefiting from Masked Language Modeling (MLM) (Taylor, 1953) by masking 15% of the sentence tokens fed into the architecture alongside the transformer encoder (Vaswani et al., 2017). Devlin et al. (2018) have extended their work by apply-ing the same architecture in a Wikipedia corpora of 104 different languages, requiring not a single alignment signal and realising, if not outperforming, state-of-the-art score in many NLP tasks such as Part Of Speech Tagging and Named Entity Recognition. However, BERT demands significantly more machine effort (Wu and Dredze, 2019). Table 1 summarises the cross-language embedding models mentioned above according to the architecture and used corpus, the target languages and the evaluation methods.

Corpus Used
The main objective of this work is to provide an efficient Arabic-English cross-lingual word embedding models across different text domains. Indeed, we used a large dataset of parallel Arabic-English sentences mainly extracted from the Open Parallel Corpus Project 2 (OPUS) (Tiedemann, 2012). OPUS contains 90 languages, and more than 2.7 billion parallel sentences. This corpus consists of data from multiple domains and sources including: MultiUN Corpus (Daniel Tapias, 2010), OpenSubtitles (Creutz, 2018), Tanzil (Zarrabi-Zadeh, 2007), News-Commentary, United Nations (UN) (Ziemski et al., 2016), Wikipedia, TED 2013 3 , GNOME 4 , Tatoeba 5 , Global Voices 6 , KDE4 7 and Ubuntu 8 corpus. To train our models, we extract more than 93.9 million parallel sentences of Arabic-English from whole collection, this alignment contains more than 800 million Arabic tokens and 1 billion for English. More details about our dataset are given in Table 2.

Preprocessing and Normalization
Preprocessing is an important step in building any word embedding model as it can potentially significantly affect the end results. We first remove the punctuation marks, non letters, URLs, emojis and emoticons from the Arabic and English sentences. Additionally, we normalize Arabic sentences using the preprocessing suggested by Nagoudi et al.

CLWE Models
Corpus Used Arch.

Proposed Models
In this section, we present our proposed Ar-bEngVec models. In order to learn our models, we have relied basically on shuffling the cor-

Parallel Mode
To make clear that shuffling methods adds crosslingual improvements, we decided to train a model without any alignment. For example, let S ar and S en be Arabic and English sentences: S en = " The young boys are brothers".
The pair (S ar , S en ) were fed directly to the training as follows: "young, boys, brothers, , , ".

Word by Word Alignment Mode
The second method used on the same corpus type with aligning pairs word by word and paying attention to sentences length and start aligning with the longest (the short sentence words will be surrounded with those of the long sentence). This method supports using pairs with almost equal lengths. In this situation, stop-words removal preprocessing step is highly blessed. We shall continue with the sentences of the previous example, the input of the training is : " young, , boys, , brothers, ".

Random Shuffling Mode
In this method, we put each pair of bilingual sentences as a list that contains their words and shuffle it randomly and separately from the rest of the corpus to have a list of combined English-Arabic tokens. As shown in our example : " young, , , boys, brothers, ".

Parameters and Training Environment
Training word embedding models require the choice of some parameters affecting the resulting vectors. For our CBOW models we have used recommended parameters values proposed by (Mikolov et al., 2013c). Thus, we set the vector size to 300, the window = 5, and F requency threshold = 100. Regarding the Skip-gram models we have chosen Negative Sampling with negative = 5 instead of Hierarchical Softmax. Worth mentioning that all models were trained on 10 epochs withŘehřek and Sojka (2011) GenSim tool.
Concerning the training environment, we have used Google Colaboratory 9 research project (also known as Colab) for training our model variants. It is a perfectly prepared developing environment with no requirements but a browser. This environment provides a free 12 GB of GPU, also access to Google Drive personal account for saving and loading files and there are many other services that can be plugged into it.

Evaluation
Usually multilingual models go against two aspects of evaluation methodology: maintain monolingual aspect and provide the other cross-lingual. Clearly for us, after creding on the shuffle we lost the former willingly to stick around the latter. Preserving the model's monolingual behaviour requires keeping words in a semantic meaningful order, which is exactly what happens with our first parallel (non-shuffling) model with completely skewed cross-lingual aspect. To clarify that, we have evaluated our models through Semantic Textual Similarity as extrinsic, and Word Translation as intrinsic.

Intrinsic Evaluation
In this step, we basically focused on word translation following (Gouws et al., 2015) evaluation procedure, so we generated a 1000 tuples starting with choosing random 1000 words from the model vocabulary. Then, we find their k-closest (k most similar) cross-lingual words based on the cosine similarity in our six ArbEngVec models. In fact, we have used five different values of k to generate the 1-closest, 2-closest, 3-closest, 5-closest and 10-closest words. 9 https://colab.research.google.com/ For example, Table 4 shows the 5-closest words of and weapons in our random Skip-Gram model. Afterwards, we calculate the accuracy of each range, which has been calculated by giving a value 1 to each word couple that represents a translation, we make sure that the word provided by our model is a translation with comparing it to Google Translate API's bag of words, if this comparison comes negative we compare manually, if also manual comparison comes negative we give negative score 0. Eventually we count the average of the 1000 scores. Results of the six studied models are provided in Table 3.
Discussion. Parallel results were so dim bilingually as Table 3 shows, but monolingual aspect was preserved especially in CBOW variant. This fact is illustrated in Table 5, the same 5closest words of and weapon using Parallel CBOW model. Switching to word by word alignment method, both variants gave promising results and notably Skip-gram's by an average of 59.26% from CBOW, and these are a consequence of getting word translation pairs at the context window range but still since Arabic and English are structurally different this alignment method had its inconvenience. Arriving to random shuffle variants which have given the best results and again Skip-Gram with average of 2.44% better than CBOW.

Extrinsic Evaluation
Extrinsic evaluating means surveilling the model performance under real-world Natural Language Processing tasks use. Our choice fell on Semantic Sentences Similarity (STS) task. To estimate the semantic similarity between the Arabic-English sentences, we have used the WE-based approach proposed by Nagoudi et al. (2017b) jointly with our ArbEngVec models. In fact, we have had STS2017-Eval 10 datasets drawn from the shared taskSemEval-2017 Task1: STS Crosslingual Arabic-English (Cer et al., 2017). The sentence pairs of STS2017-Eval have been manually labelled by five annotators, and the similarity score is the average of the annotators judgments. Afterwards, in order to evaluate the performance of each model, we calculate Pearson correlation between our assigned semantic similarity scores and human judgement.  Discussion. These results indicate that when the parallel alignment is used the correlation rate gets very low in both architectures. This is due to the distance of every word and its translation in the parallel sentences pair shape. However, when applying the word by word alignment the correlation rate is clearly outperformed to 49.4% and 73.6% with the CBOW and Skip-Gram model respectively. Additionally, the observed results indicate that the random shuffling method with Skip-Gram model is the best performing method with a correlation rate of 75.7%.

Models Visualization
As part of the discussion, we have chosen to illustrate our models using pyplot scatters with Maaten and Hinton (2008) t-SNE algorithm. We provide these visualizations by choosing 20 arbitrary words from our vocabulary, run 4-closest similarity to each word and finally project all of them on the 2-dimensional plot. Starting with parallel mode models, charts show that distance between Arabic markers are distant from others of English comparing to those of the same language. Same thing can be said on the situation that concerns word by word method CBOW variant with less distant languages but still marker bags most often do not include translation pairs. Eventually, random variant charts make it clear that close markers include translation pairs alongside mono and cross-lingual similarities, six model charts are in figure 2. Especially for Skip-Gram variant, supposedly that t-SNE feature reduction procedure got rid of both language characteristics, as figure  3 shows, words and their translations most often appear next to each other.

Conclusion
In this paper, we have presented the open source project named ArbEngVec. This project provides several Arabic-English cross-lingual word embedding models. The embedding models are learned through a large dataset of parallel Arabic-English sentences. Additionally, we evaluated the Ar-bEngVec models via extrinsic and intrinsic evaluations. In the extrinsic evaluation, we used the cross-language semantic similarity task to test the capability of our models to capture the semantic and syntactic properties of words in two different languages. While in the intrinsic evaluations, we employed the embedding vectors to evaluate the word translation task. As future work, we are going to use these models with those of other classical NLP techniques, including word sense disambiguation, named entity recognition to make more improvement in the Arabic-English cross-language semantic similarity and plagiarism detection. We also are going to aim on finding better word alignment methods to improve features capturing regarding the transfer between Semitic and Germanic languages.