Semantic Similarity of Arabic Sentences with Word Embeddings

Semantic textual similarity is the basis of countless applications and plays an important role in diverse areas, such as information retrieval, plagiarism detection, information extraction and machine translation. This article proposes an innovative word embedding-based system devoted to calculate the semantic similarity in Arabic sentences. The main idea is to exploit vectors as word representations in a multidimensional space in order to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. The performance of our proposed system is confirmed through the Pearson correlation between our assigned semantic similarity scores and human judgments.


Introduction
Text Similarity is an important task in several application fields, such as information retrieval, plagiarism detection, machine translation, topic detection, text classification, text summarization and others. Finding similarity between two texts, paragraphs or sentences, is based on measuring, directly or indirectly, the similarity between words.
There are two known types of words similarity: lexical and semantic. The first one handles the words as a stream of characters: words are similar lexically if they share the same characters in the same order (Manning et al., 2008). There are many techniques of lexical similarity measures, the most known are : Damerau-Levenshtein (Levenshtein, 1966), Needleman Wunsch (Needleman and Wunsch, 1970), LCS (Chvatal and Sankoff, 1975), JaroWinkler (Winkler, 1999), etc.
The second type aims to quantify the degree to which two words are semantically related. As an example they can be, synonyms, represent the same thing or they are used in the same context. The classical way to measure this semantic similarity is by using linguistic resources, like Word-Net (Miller, 1995), HowNet (Dong and Dong, 2003), BabelNet (Navigli and Ponzetto, 2012) or Dbnary (Sérasset, 2015). However, the word embedding techniques can be a more effective alternative to these linguistic databases (Mikolov et al., 2013a).
In this article we focus our investigation on measuring the semantic similarity between short Arabic sentences using word embedding representations. We also consider the IDF weighting and Part-of-Speech tagging techniques in order to improve the identification of words that are highly descriptive in each sentence.
The rest of this article is organized as follows, the next section describes work related to word representations in vector space. In Section 3, we present three variants of our proposed word embedding-based system. Section 4 describes the experimental results of this study. Finally, our conclusion and some future research directions are drawn in Section 5.

Word Embedding Models
Words representations as vectors in a multidimensional space allows to capture the semantic and syntactic properties of the language (Mikolov et al., 2013a). These representations can serve as a fundamental building unit to many applications of Natural Language Processing (NLP). In the literature, several techniques are proposed to build vectorized space representations.
For instance, Collobert and Weston (2008) have proposed a unified system based on a deep neural network architecture, and trained jointly with many well known NLP tasks, including: Chunking, Part of Speech tagging, Named Entity Recognition and Semantic Role Labeling. Their word embedding model is stored in a matrix M ∈ R d * |D| , where D is a dictionary of all unique words in the training data, and each word is embedded into a d-dimensional vector. The sentences are represented using the embeddings of their forming words. A similar idea was independently proposed and used by Turian et al. (Turian et al., 2010). Mnih and Hinton (2009) have proposed another form to represent words in vector space, named Hierarchical Log-Bilinear Model (HLBL). Like virtually all neural language models, the HLBL model represents each word with a real-valued feature vector. For n-gram word-based, HLBL concatenates the n − 1 first embedding words (w 1 ..w n−1 ) and learns a neural linear model to predicate the last word w n .
Mikolov et al. (Mikolov et al., 2013c) have used a recurrent neural network (RNN) (Mikolov et al., 2010) to build a neural language model. The RNN encode the context word by word and predict the next word. The weights of the trained network are used as the words embeddings vectors. Mikolov et al. (Mikolov et al., 2013a) (Mikolov et al., 2013b) have proposed two other approaches to build a words representations in vector space. using a simplified version of Bengio et al. (Bengio et al., 2003) neural language mode. They replaced the hidden layer by a simple projection layer in order to boost performance. In their work, two models are presented: the continuous bag-ofwords model (CBOW) (Mikolov et al., 2013a), and the skip-gram model (SKIP-G) (Mikolov et al., 2013b).
In the first one, the continuous bag of word model CBOW (Mikolov et al., 2013a), predicts a pivot word according to the context by using a window of contextual words around it. Given a sequence of words S = w 1 , w 2 , ..., w i , the CBOW model learns to predict all words w k from their surrounding words (w k−l , ..., w k−1 , w k+1 , ..., w k+l ).
The second model SKIP-G, predicts surrounding words of the current pivot word w k (Mikolov et al., 2013b). Pennington et al. (Pennington et al., 2014) proposed a Global Vectors (GloVe) to build a words representations model, GloVe uses the global statistics of word-word co-occurrence to build cooccurrence matrix M . Then, M is used to calculate the probability of word w i to appear in the context of another word w j , this probability P (i/j) represents the relationship between words.

Model Used
In (Mikolov et al., 2013a), all the methods (Collobert and Weston, 2008), (Turian et al., 2010), (Mnih and Hinton, 2009), (Mikolov et al., 2013c) have been evaluated and compared, and they show that CBOW and SKIP-G are significantly faster to train with better accuracy compared to these techniques. For this reason, we have used the CBOW word representations for Arabic model 1 proposed by Zahran et al. (Zahran et al., 2015). To train this model, they have used a large collection from different sources counting more than 5.8 billion words : • Arabic Wikipedia (WikiAr, 2006).
Training the Arabic CBOW model require choice of some parameters affecting the resulting vectors. All the parameters used by Zahran et al. (Zahran et al., 2015) are shown in Table 1.

Words Similarity
We used CBOW model in order to identify the near matches between two words w i and w j (e.g. synonyms, singular, plural, feminization or closely related semantically). The similarity between w i and w j is obtained by comparing their vector representations v i and v j respectively. The similarity between v i and v j can be evaluated using the cosine similarity, euclidean distance, Manhattan distance or any other similarity measure functions. For example: let " " (university), " " (evening) and " " (faculty) be three words. The similarity between them is measured by computing the cosine similarity between their vectors as follows: That means that, the words " " (faculty) and " " (university) are semantically closer than " " (evening) and " " (university).

Sentences similarity
Let S 1 = w 1 , w 2 , ..., w i and S 2 = w 1 , w 2 , ..., w j be two sentences, their word vectors are (v 1 , v 2 , ..., v i ) and (v 1 , v 2 , ..., v j ) respectively. We have used three methods to measure the similarity between sentences. Figure 1 illustrates an overview of the procedure for computing the similarity between two candidate sentences in our system. Figure 1: The architecture of the proposed system In the following, we explain our proposed methods to compute the semantic similarity among sentences.

No Weighting Method
A simple way to compare two sentences, is to sum their words vectors. In addition, this method can be applied to any size of sentences. The similarity between S 1 and S 2 is obtained by calculating the cosine similarity between V 1 and V 2 , where: For example, let S 1 and S 2 be two sentences: S1 = " " (Joseph went to college). The similarity between S1 and S2 is obtained as follows: step 1: Sum of the word vectors step 2: Calculate the similarity The similarity between S 1 and S 2 is obtained by calculating the cosine similarity between V 1 and V 2 . sim(S 1 , S 2 ) = cos(V 1 , V 2 ) = 0.71 In order to improve the similarity results, we have used two weighting functions based on the Inverse Document Frequency IDF (Salton and Buckley, 1988) and the Part-Of-Speech tagging (POS tagging) (Schwab, 2005) (Lioma and Blanco, 2009).

IDF Weighting Method
In this variant, the Inverse Document Frequency IDF concept is used to produce a composite weight for each word in each sentence. The IDF weighting of words (Salton and Buckley, 1988) is traditionally used in information retrieval (Turney and Pantel, 2010) and can be employed in our system. The idf weight serves as a measure of how much information the word provides, that is, whether the term that occurs infrequently is good for discriminating between documents (in our case sentences).
This technique uses a large collection of document (background corpus), generally the same genre as the input corpus that is to be semantically verified. In order to compute the idf weight for each word, we have used the BBC and CNN Arabic corpus 2 (Saad and Ashour, 2010) as a background corpus. In fact, the idf of each word is determined by using the formula: where S is the total number of sentences in the corpus and W S is the number of sentences containing the word w. The similarity between S 1 and S 2 is obtained by calculating the cosine similarity between V 1 and V 2 , cos(V 1 , V 2 ) where: and idf (w k ) is the weight of the word w k in the background corpus.
step 2: Calculate the similarity The cosine similarity is applied to computed a similarity score between V 1 and V 2 .
We note that the similarity result between the two sentences is better than the previous method.

Part-of-speech weighting Method
An alternative technique is the application of the Part-of-Speech tagging (POS tag) for identification of words that are highly descriptive in each input sentence (Schwab, 2005) (Lioma and Blanco, 2009). For this purpose, we have used the POS tagger for Arabic language proposed by G. Braham et al. (Gahbiche-Braham et al., 2012) to estimate the part-of-speech of each word in sentence. Then, a weight is assigned for each type of tag in the sentence. For example, verb = 0.4, noun = 0.5, adjective = 0.3, preposition = 0.1, etc. The similarity between S1 and S2 is obtained in three steps (Schwab, 2005) as follows: step 1: POS tagging In this step the POS tagger of G. Braham et al. (Gahbiche-Braham et al., 2012) is used to estimate the POS of each word in sentence. P os tag(S 1 ) = P os w 1 , P os w 2 , ..., P os w i P os tag(S 2 ) = P os w 1 , P os w 2 , ..., P os w j The function P os tag(S i ) returns for each word w k in S i its estimated part of speech P os w k .
step 2: POS weighting At this point we should mention that, the weight of each part of speech can be fixed empirically. Indeed, we based on the training data of SemEval- where P os weight(P os w k ) is the function which return the weight of POS tagging of w k .
step 3: Calculate the similarity Finally, the similarity between S 1 and S 2 is obtained by calculating the cosine similarity between V 1 and V 2 as follows: Example: let us continue with the same example above. S1 = " " (Joseph went to college).

Test Sample
In order to measure effectively the performances of our system, a large collection are necessary. In fact, we have used a dataset of 750 pairs of sentences drawn from publicly Microsoft Research 3 http://alt.qcri.org/semeval2017/task1/data/uploads/ Video Description Corpus (MSR-Video) (MSRvideo, 2016), and manually translated into Arabic. The sentence pairs have been manually tagged by four annotators, and the similarity score is the mean of the annotators. This score is a float number between "0" (indicating that the meaning of sentences are completely independent) to "1" (signifying meaning equivalence).

Preprocessing
In order to normalize the sentences for the semantic similarity step, a set of preprocessing are performed on the data set. All sentences went through by the following steps: 1. Stop-word removal. 2. Remove punctuation marks, diacritics and non letters. 3. We normalized to and to .
4. Replace final followed by with . 5. Normalizing numerical digits to the token "N um".

Results
To evaluate the performance of our system, our three approaches were assessed based on their accuracy on the 750 sentences in the MSR-Video corpus. An example of our results is shown in Ta  The sentence pairs in Table 2, were selected randomly from our dataset. It can be seen that the similarity estimation provided by our system are fairly consistent with human judgements. How-ever, the similarity score is not good enough when two sentences share the same words, but with a totally different meaning, like in the last pair of sentences.
On the other hand, we calculate the Pearson correlation between our assigned semantic similarity scores and human judgements. The results are presented in Table 3.

Approach
Correlation Basic method 72.33 % IDF-weighting method 78.20% POS tagging method 79.69% These results indicate that when the no weighting method is used the correlation rate reached 72.33%. Both IDF-weighting and POS tagging approaches significantly outperformed the correlation to more than 78% (respectively 78.2% and 79.69%).

Conclusion and Future Work
In this article, we presented an innovative word embedding-based system to measure semantic relations between Arabic sentences. This system is based on the semantic properties of words included in the word-embedding model. In order to make further progress in the analysis of the semantic sentence similarity, this article showed how the IDF weighting and Part-of-Speech tagging are used to support the identification of words that are highly descriptive in each sentence. In the experiments we have shown how these techniques improve the correlation results. The performance of our proposed system was confirmed through the Pearson correlation between our assigned semantic similarity scores and human judgements. As future work, we can make more improvement in the semantic similarity results by a smart hybridisation between both IDF weighting and POS tagging techniques.