Arabizi sentiment analysis based on transliteration and automatic corpus annotation

Arabizi is a form of writing Arabic text which relies on Latin letters, numerals and punctuation rather than Arabic letters. In the literature, the difficulties associated with Arabizi sentiment analysis have been underestimated, principally due to the complexity of Arabizi. In this paper, we present an approach to automatically classify sentiments of Arabizi messages into positives or negatives. In the proposed approach, Arabizi messages are first transliterated into Arabic. Afterwards, we automatically classify the sentiment of the transliterated corpus using an automatically annotated corpus. For corpus validation, shallow machine learning algorithms such as Support Vectors Machine (SVM) and Naive Bays (NB) are used. Simulations results demonstrate the outperformance of NB algorithm over all others. The highest achieved F1-score is up to 78% and 76% for manually and automatically transliterated dataset respectively. Ongoing work is aimed at improving the transliterator module and annotated sentiment dataset.


Introduction
Sentiment analysis (SA), also called opinion mining, is the field of study that analyzes people's opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes. It represents a large problem space (Liu, 2012). To determine whether a document or a sentence expresses a positive or negative sentiment, three main approaches are commonly used, the lexicon based approach (Taboada et al., 2011), machine learning (ML) based approach (Maas et al., 2011) and a hybrid approach (Khan et al., 2015). English has the greatest number of sentiment analysis studies, while research is more limited for other languages including Arabic and its dialects (Alayba et al., 2017;Guellil and Boukhalfa, 2015).
ML based sentiment analysis is a more dominant approach in the literature but it requires annotated training data. One of the majors problems related to the treatment of Arabic and its dialect is the lack of resources. Other dominant problems include the non standard romanization (called Arabizi) that Arabic speakers often use in social media. Arabizi uses Latin alphabet, numbers, punctuation for writing an Arabic word (For example the word "mli7", combined with Latin letters and numbers, becomes the romanized form of the Arabic word " " meaning "good"). To the best of our knowledge, limited work has been conducted on sentiment analysis of Arabizi ((Duwairi et al., 2016;Guellil et al., 2018)). The reason behind the lack of contribution is the complexity of Arabizi. Most researches are therefore moving towards the transformation of Arabizi into Arabic. This transformation or passage is recognized by the transliteration. Therefore, transliteration is only a process of passing from a written text in a given script or alphabet to another (Guellil et al., 2017c;Kaur and Singh, 2014). To bridge the gap, this paper proposes an approach determining the sentiment of Arabizi messages after transliterating them. This paper is organized as follows, Section 2 presents an overview of Arabizi. Section 3 presents the related work on SA and machine transliteration (MT). Section 4 presents the proposed approach and related components. Section 5 presents the simulation and experimentation. Finally, Section 6 presents the conclusion with some future directions.

Arabizi: An overview
Arabic speakers on social media, discussion forums, Short Messaging System (SMS), and on line chat applications often use a non standard romanization called "Arabizi" (Darwish, 2013;Bies et al., 2014). For example, the sentence: "rani fer7ana" (which means I am happy and correspond to the arabic sentence: ) is written in Arabizi. Hence, Arabizi is an Arabic text written using Latin characters, numerals and some punctuations (Darwish, 2013). The challenge behind Arabizi is the presence of many forms of the same word. For example the authors in  argued that the word (meaning if the god willing) could be written in 69 different manners.
3 Related work 3.1 Machine learning Arabic sentiment analysis ML based sentiment analysis requires annotated data. Among the corpora presented in the literature and focused on MSA, we cite: LABR (Aly and Atiya, 2013), AWATIF (Abdul-Mageed and Diab, 2012), ASTD (Nabil et al., 2015) and ArTwitter (Abdulla et al., 2013). LABR contains 63,257 comments annotated with stars ranging from 1 to 5. AWATIF is a multi-genre corpus containing 10,723 sentences manually annotated in objective and subjective sentences. ASTD contains 10,000 Arab Tweets classified into objective, subjective positive, subjective negative or subjective mixed. ArTwitter contains 2,000 tweets manually annotated into positive and negative. However, most of the aforementioned works suffer from manual annotation and almost all resources are not publicly available. In addition, constructed corpora are dedicated to some dialects, neglecting others (specially Maghrebi dialect such as Moroccan or Algerian dialect).

Arabizi Transliteration
The proposed approach is inspired by the work presented in (van der Wees et al., 2016), where the authors used a table extracted from Wikipedia 1 for the passage from Arabizi to Arabic. The originality of our transliteration approach compared to this work is the treatment of ambiguities related to Arabizi transliteration such as: (a) Am-biguity of the vowels, where each vowels can be replaced by different letters or by NULL character (b) Ambiguity of the characters having the same sound or whose sounds are close, for example, the letters 's' and 'c' which can be replaced by the two letters and (c) Ambiguity related to the transliteration direction, unlike the different works in (Guellil et al., 2017c,b), the rules of passage that we defined are from Arabizi to Arabic. The reverse passage may cause several ambiguities. The proposed approach is also inspired by the works presented in (Guellil et al., 2017c,b;Nouvel et al.) that uses a language model to determine the best possible candidate for a word in Arabizi. However, their work relies on a parallel corpus corresponding to the transliteration of a set of messages from Arabizi to Arabic. The realization of this corpus is usually done manually, which is a very time and effort consuming work. Hence, we avoid using a parallel corpus between Arabizi and Arabic and applied a language model (based on large corpus extracted from social media) to extract the best candidate.

Arabizi Sentiment Analysis
Different works have been proposed for handling Arabizi Darwish (2013); ; ; Guellil and Azouaou (2016). However,to the best of our knowledge, limited work has been conducted on sentiment analysis of Arabizi (Duwairi et al., 2016;Guellil et al., 2018). In (Duwairi et al., 2016), the authors presents a transliteration step before proceeding to the sentiment classification. However their approach present two majors drawbacks: (1) They rely on a very basic table for the passage from Arabizi to Arabic which cannot handle Arabizi ambiguities. (2) They construct a small annotated corpus manually (containing 3026 messages). This corpus contains Arabizi messages which therefore transliterated into Arabic. In (Guellil et al., 2018), the authors automatically construct an annotated sentiment Arabizi corpus and directly applied sentiment classification without calling the transliteration process. However, the authors confronted several ambiguity problems which resulted low F1-score of 66%. In contrast, the purpose of our paper is to present an approach dedicated to Arabizi sentiment analysis by calling transliteration process. The sentiment analysis corpus (training corpus) contains Arabic messages (Modern Standard Arabic MSA and Dialectal Arabic DA, specially Algerian dialect) and it is constructed automatically. For transliteration step, this paper is focused on ambiguities treatment (especially vowels).

Methodology
This paper presents an approach for Arabizi Sentiment Analysis. Figure 1 summarizes the main steps of the proposed approach, including: • Automatic construction of Arabic sentiment lexicon.
• Automatic annotation of Arabic messages • Arabizi transliteration • Sentiment classification of Arabic messages

Automatic construction of Arabic sentiment lexicon
In this study, the sentiment lexicon is constructed by translating an existing English lexicon, namely SOCAL (Taboada et al., 2011) to Arabic. We opt for using SOCAL rather than other lexicons such as SentiWordNet (Baccianella et al., 2010) or Sen-tiStrength (Thelwall et al., 2010) because SOCAL contains a large number terms and in this study, we are not focusing on the context of terms but only on its global valence. The text is translated using the Glosbe API 2 , which takes an English 2 https://glosbe.com/en/arq/excellent word as input and returns a set of equivalent in other languages. In this work we focus on Arabic and its dialect (MSA + dialect). We choose this API because, to the best of our language, it is the unique API dealing with some dialects with scarce resources such as the Algerian dialect. After the automatic translation, the same score is assigned to all the translated words. This score corresponds to the score of English word from which they are translated. For example, all the translations of the English word 'excellent' with a score of +5, such as (bAhy), (lTyf), and ' ' (mlyH), are assigned a score of +5. 6 769 terms were obtained including negative sentiment terms (labels ranging between -1 and -5) and positive terms (labels ranging between +1 and +5). Since some Arabic sentiment words result from different English words having different sentiment scores, an average score is assigned to such English words. Lastly, the resulted lexicon is manually reviewed to retain correct sentiment words. The final lexicon contains 1 745 terms (in Algerian dialect) where 968 are negative, 6 are neutral and 771 are positives. We choose to apply our approach to Algerian dialect for comparing our results to those obtained in (Guellil et al., 2018).

Automatic Annotation of Arabic messages
The constructed lexicon is used to automatically provide a sentiment score for Arabic utterances. The lexicon is used to build a large sentiment corpus. To calculate the score, we considered: (1) Opposition (2) Multi-word expressions (because the constructed lexicon contains multi-word entries) (3) Handling Arabic morphology by employing a simple rule-based light stemmer that handles Arabic prefixes and suffixes (4) Negation which can reverse polarity. Negation in some Arabic dialect is usually expressed as an attached prefix, suffix, or a combination of both.
To score a message, the sentiment scores of all words in the message are averaged. Finally, balanced dataset is constructed by keeping the same number of messages in positive and negative dataset. The resulted corpus contains 255,008 messages (where both positive and negative corpus contains 127,504 messages).

Arabizi Transliteration
The proposed transliteration approach includes four important steps: (1) pretreatment of the Arabic corpus and the Arabizi message. (2) Proposal and application of the rules for the Algerian Arabizi. (3) Generating the different candidates. (4) Extraction of the best candidate. This part receives input, a set of messages written in Arabizi and a voluminous corpus written in DA extracted from Facebook. All these messages are pretreated (i.e. deleting exaggeration, etc). Afterwards, a set of passages rules are proposed (i.e. the letter 'a' could be replaces by ' ', etc. It could also be replaced by ", none letters when it represents a diacritic). By applying different replacements, as well as different rules developed, each Arabizi word gives birth to several words in Arabic. For example the word "kraht" generates 32 possible candidates, such as: ' ', ' ', ' ' etc. The correctly transliterated word is ' '. The word "7iati" has 16 candidates such as: ' ', ' ',' '. The correctly transliterated word is ' '. To extract the best candidate for the transliteration of a given Arabizi word into Arabic, a language model is constructed and applied.

Sentiment classification of Arabic messages
In this paper, different classification models are compared. The document embedding vectorization (Doc2vec algorithm presented within (Le and Mikolov, 2014)) is used (with default parameters). For Doc2vec, the two methods presented in (Le and Mikolov, 2014) were applied: (1)

Experimental Setup
The proposed approach is applied on a Maghrebi dialect (i.e. Algerian Arabizi) which suffers from limited available tools and other handling resources required for automatic sentiment analysis. Algerian dialect (DALG) is largely presented in (Meftouh et al., 2012). However, the resources dedicated to the treatment of MSA cannot be directly applied to DALG. In this context, two large corpora were extracted from Facebook using RestFB 3 . The first one was extracted on September, 2017 which contains 8,673,285 messages with 3,668,575 written in Arabic letters. The second one was extracted on November, 2017 that contains 15,407,910 messages with 7,926,504 written in Arabic letters. The first one was used for transliteration task where the second one was used in sentiment annotation task. For testing our transliteration approach, we used Cor-pus_50 which is a part of Cottrell's corpus  used in (Guellil et al., 2017c,b,a). For testing our sentiment analysis approach, we used Corpus_500 ( an Algerian Arabizi annotated corpus in (Guellil et al., 2018), containing 250 positives and negatives messages) .

Experimental results
The first experiment evaluates the transliteration module. The transliteration of Corpus_50 achieves an accuracy up to 74.76% (as compared to 45.35% in (Guellil et al., 2017c)). This results shows the efficacy of the proposed transliteration approach. For sentiment analysis, we used Corpus_500. This dataset was transliterated automatically with the transliterator module. To validate the quality of the automatic transliteration, this dataset was also transliterated manually by Algerian dialect's natives. The transliteration of this dataset achieves an accuracy up to 72.05%. Afterwards, we carried out two types of experiments: (1) SA on test corpus transliterated automatically (2) SA on test corpus transliterated manually. Table 1 presents the performance of different shallow classification algorithms in terms of Precision (P), Recall (R) and F1-score (F1) for Doc2vec methods (PV_DBOW, PV_DM and PV_DBOW + PV_DM) and for Tr_automatic and Tr_manual dataset (respectively referring to the dataset transliterated automatically and manually).

Results and errors analysis
Based on the simulations and analysis, three major observations are: (1)  Tr_manual are slightly better than Tr_automatic (because the mistake on transliteration generally appears on only one letter), (2) The implementation PV_DBOW of Doc2vec achieved best results, (3) For classification, NB performed the best. (4) The results presented in Table 1 largely outperform the resulted presented in (Guellil et al., 2018) (which are up to 66%). However, we were not able to compare our results to those presented in (Duwairi et al., 2016) because their data are not available.However, the most observed errors are as follow: • The principal error appears in transliteration process is related to technique of choosing the best candidate. The idea of language model is to extract the candidate having the most important number of occurrence. However, in some cases, this techniques returns an incorrect candidate. For example the word "rakom" meaning "you are" is transliterated as " " meaning "a number" rather than " " (which is the correct transliteration). The solution to this problem is to integrate other parameters for determining the best candidate such as distance.
• Some sentiment classification errors are due to transliteration errors. For example, "khlwiya" meaning good and quiet is wrongly transliterated to " " (meaning empty) rather than . Improving transliteration will improve sentiment classification.
• Other sentiment classification errors are due to some errors occurred in the automatic annotated corpus (so the training corpus). For example, the messages meaning Djabou the excellency of the name is sufficient was annotated negative (where it is positive). Manually reviewing the automatic annotation will definitely improve the results.