LIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting

This article describes our proposed system named LIM-LIG. This system is designed for SemEval 2017 Task1: Semantic Textual Similarity (Track1). LIM-LIG proposes an innovative enhancement to word embedding-based model devoted to measure the semantic similarity in Arabic sentences. The main idea is to exploit the word representations as vectors in a multidimensional space to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. LIM-LIG system achieves a Pearson’s correlation of 0.74633, ranking 2nd among all participants in the Arabic monolingual pairs STS task organized within the SemEval 2017 evaluation campaign


Introduction
Semantic Textual Similarity (STS) is an important task in several application fields, such as information retrieval, machine translation, plagiarism detection and others. STS measures the degree of similarity between the meanings of two text sequences (Agirre et al., 2015). Since SemEval 2013, STS has been one of the official shared tasks.
This is the first year in which SemEval has organized an Arabic monolingual pairs STS. The challenge in this task lies in the interpretation of the semantic similarity of two given Arabic sentences, with a continuous valued score ranging from 0 to 5. The Arabic STS measurement could be very useful for several areas, including: disguised plagiarism detection, word-sense disambiguation, la-tent semantic analysis (LSA) or paraphrase identification. A very important advantage of SemEval evaluation campaign, is enabling the evaluation of several different systems on a common datasets. Which makes it possible to produce a novel annotated datasets that can be used in future NLP research.
In this article we present our LIM-LIG system devoted to enhancing the semantic similarity between Arabic sentences. In STS task (Arabic monolingual pairs) SemEval 2017, the LIM-LIG system propose three methods to measure this similarity: No weighting, IDF weighting and Partof-speech weighting Method. The best submitted method (Part-of-speech weighting) achieves a Pearsons correlation of 0.7463, ranking 2nd in the Arabic monolingual STS task. In addition, we have proposed another method (after the competition) named Mixed method, with this method, the correlation rate reached 0.7667, which represent the best score among the different submitted methods involved in the Arabic monolingual STS task.

Word Embedding Models
In the literature, several techniques are proposed to build word-embedding model.
For instance, Collobert and Weston (2008) have proposed a unified system based on a deep neural network architecture. Their word embedding model is stored in a matrix M ∈ R d * |D| , where D is a dictionary of all unique words in the training data, and each word is embedded into a d-dimensional vector. Mnih and Hinton (2009) have proposed the Hierarchical Log-Bilinear Model (HLBL). The HLBL Model concatenates the n − 1 first embedding words (w 1 ..w n−1 ) and learns a neural linear model to predicate the last word w n . Mikolov et al. (2013aMikolov et al. ( , 2013b have proposed two other approaches to build a words representations in vector space. The first one named the continuous bag of word model CBOW (Mikolov et al., 2013a), predicts a pivot word according to the context by using a window of contextual words around it. Given a sequence of words S = w 1 , w 2 , ..., w i , the CBOW model learns to predict all words w k from their surrounding words (w k−l , ..., w k−1 , w k+1 , ..., w k+l ). The second model SKIP-G, predicts surrounding words of the current pivot word w k (Mikolov et al., 2013b). Pennington et al.(2014) proposed a Global Vectors (GloVe) to build a words representations model, GloVe uses the global statistics of wordword co-occurrence to calculate the probability of word w i to appear in the context of another word w j , this probability P (i/j) represents the relationship between words.

Model Used
In Mikolov et al. (2013a), all the methods (Collobert and Weston, 2008), (Turian et al., 2010), (Mnih and Hinton, 2009), (Mikolov et al., 2013c) have been evaluated and compared, and they show that CBOW and SKIP-G are significantly faster to train with better accuracy compared to these techniques. For this reason, we have used the CBOW word representations for Arabic model 1 proposed by Zahran et al. (2015). To train this model, they have used a large collection from different sources counting more than 5.

Words Similarity
We used CBOW model in order to identify the near matches between two words w i and w j . The similarity between w i and w j is obtained by comparing their vector representations v i and v j respectively. The similarity between v i and v j can be evaluated using the cosine similarity, euclidean distance, manhattan distance or any other similarity measure functions. For example, let " " 1 https://sites.google.com/site/mohazahran/data (university), " " (evening) and " " (faculty) be three words. The similarity between them is measured by computing the cosine similarity between their vectors as follows: That means that, the words " " (faculty) and " " (university) are semantically closer than " " (evening) and " " (university).

Sentences similarity
Let S 1 = w 1 , w 2 , ..., w i and S 2 = w 1 , w 2 , ..., w j be two sentences, their words vectors representa- There exist several ways to compare two sentences. For this purpose, we have used four methods to measure the similarity between sentences. Figure 1 illustrates an overview of the procedure for computing the similarity between two candidate sentences in our system. In the following, we explain our proposed methods to compute the semantic similarity among sentences.

No Weighting Method
A simple way to compare two sentences, is to sum their words vectors. In addition, this method can be applied to any size of sentences. The similarity between S 1 and S 2 is obtained by calculating the cosine similarity between V 1 and V 2 , where: For example, let S 1 and S 2 be two sentences: S1 = " " (Joseph went to college).
The similarity between S 1 and S 2 is obtained as follows: Step 1: Sum of the word vectors Step 2: Calculate the similarity The similarity between S 1 and S 2 is obtained by calculating the cosine similarity between V 1 and V 2 : sim(S 1 , S 2 ) = cos(V 1 , V 2 ) = 0.71 In order to improve the similarity results, we have used two weighting functions based on the Inverse Document Frequency IDF (Salton and Buckley, 1988) and the Part-Of-Speech tagging (POS tagging) (Schwab, 2005) (Lioma and Blanco, 2009).

IDF Weighting Method
In this variant, the Inverse Document Frequency IDF concept is used to produce a composite weight for each word in each sentence. The idf weight serves as a measure of how much information the word provides, that is, whether the term that occurs infrequently is good for discriminating between documents (in our case sentences). This technique uses a large collection of document (background corpus), generally the same genre as the input corpus that is to be semantically verified. In order to compute the idf weight for each word, we have used the BBC and CNN Arabic corpus 2 (Saad and Ashour, 2010) as a background corpus. In fact, the idf of each word is determined by using the formula: idf (w) = log( S W S ), where S is the total number of sentences in the corpus and W S is the number of sentences containing the word w. The similarity between S 1 and S 2 is obtained by calculating the cosine similarity between V 1 and V 2 , cos(V 1 , V 2 ) where: Step 1: Sum of vectors with IDF weights Step 2: Calculate the similarity The cosine similarity is applied to computed a similarity score between V 1 and V 2 . sim(S 1 , S 2 ) = cos(V 1 , V 2 ) = 0.78 We note that the similarity result between the two sentences is better than the previous method.

Part-of-speech weighting Method
An alternative technique is the application of the Part-of-Speech tagging (POS tag) for identification of words that are highly descriptive in each input sentence (Lioma and Blanco, 2009). For this purpose, we have used the POS tagger for Arabic language proposed by G. Braham et al. (2012) to estimate the part-of-speech of each word in sentence. Then, a weight is assigned for each type of tag in the sentence. For example, verb = 0.4, noun = 0.5, adjective = 0.3, preposition = 0.1, etc.
The similarity between S 1 and S 2 is obtained in three steps (Ferrero et al., 2017) as follows: Step 1: POS tagging In this step the POS tagger of G. Braham et al. (2012) is used to estimate the POS of each word in sentence. P os tag(S 1 ) = P os w 1 , P os w 2 , ..., P os w i P os tag(S 2 ) = P os w 1 , P os w 2 , ..., P os w j The function P os tag(S i ) returns for each word w k in S i its estimated part of speech P os w k .
Step 2: POS weighting At this point we should mention that, the weight of each part of speech can be fixed empirically. Indeed, we based on the training data of SemEval-2017 (Task 1) 3 to fix the POS weights.
where P os weight(P os w k ) is the function which return the weight of POS tagging of w k .
Step 3: Calculate the similarity Finally, the similarity between S 1 and S 2 is obtained by calculating the cosine similarity between V 1 and V 2 as follows: sim(S 1 , S 2 ) = cos(V 1 , V 2 ).

Example:
Let us continue with the same example, and suppose that POS weights are: verb noun noun prop adj prep 0.4 0.5 0.7 0.3 0.1 Step 1: Pos tagging The function P os tag(S i ) is applied to each sentence.

Mixed weighting
We have proposed another method (after the competition), this method propose to use both IDF and the POS weightings simultaneously. The similarity between S 1 and S 2 is obtained as follows: If we apply this method to the previous example, using the same weights in Section 3.2 and 3.3, we will have: Sim(S 1 , S 2 ) = Cos(V 1 , V 2 ) = 0, 87.

Preprocessing
In order to normalize the sentences for the semantic similarity step, a set of preprocessing are performed on the data set. All sentences went through by the following steps: 1. Remove Stop-word, punctuation marks, diacritics and non letters. 2. We normalized to and to .
3. Replace final followed by with . 4. Normalizing numerical digits to N um.

Tests and Results
To evaluate the performance of our system, our four approaches were assessed based on their accuracy on the 250 sentences in the STS 2017 Monolingual Arabic Evaluation Sets v1.1 4 . We calculate the Pearson correlation between our assigned semantic similarity scores and human judgements. The results are presented in Table 1 These results indicate that when the no weighting method is used the correlation rate reached 59.57%. Both IDF-weighting and POS tagging approaches significantly outperformed the correlation to more than 73% (respectively 73.09% and 74.63%). We noted that, the Mixed method achieve the best correlation (76.67%) of the different techniques involved in the Arabic monolingual pairs STS task.

Conclusion and Future Work
In this article, we presented an innovative word embedding-based system to measure semantic relations between Arabic sentences. This system is based on the semantic properties of words included in the word-embedding model. In order to make further progress in the analysis of the semantic sentence similarity, this article showed how the IDF weighting and Part-of-Speech tagging are used to support the identification of words that are highly descriptive in each sentence. In the experiments we have shown how these techniques improve the correlation results. The performance of our proposed system was confirmed through the Pearson correlation between our assigned semantic similarity scores and human judgements. As future work, we are going to combine these methods with those of other classical techniques in NLP field such as: n-gram, fingerprint and linguistic resources.