Classifying Arab Names Geographically

Different names may be popular in different countries. Hence, person names may give a clue to a person’s country of origin. Along with other features, mapping names to countries can be helpful in a variety of applications such as country tagging twitter users. This paper describes the collection of Arabic Twitter user names that are either written in Arabic or transliterated into Latin characters along with their stated geographical locations. To classify previously unseen names, we trained naive Bayes and Support Vector Machine (SVM) multi-class classi-ﬁers using primarily bag-of-words features. We are able to map Arabic user names to speciﬁc Arab countries with 79% accuracy and to speciﬁc regions (Gulf, Egypt, Levant, Maghreb, and others) with 94% accuracy. As for transliterated Arabic names, the accuracy per country and per region was 67% and 83% respectively. The approach is generic and language independent, and can be used to collect and classify names to other countries or regions, and considering language-dependent name features (like the compound names, and person titles) yields to better results.


Introduction
Geo-locating tweets and tweeps (Twitter users) has captured significant attention in recent years. Geographical information is important for many applications such as transliteration, social studies, directed advertisement, dialect identification, and Automatic Speech Recognition (ASR) among others. In social studies, researchers may be interested in studying the views and opinions of tweeps for specific geographical locations. Similarly, tweets can offer a tool for linguists to study different linguistic phenomena. For ASR, training language models using dialectal Arabic tweets that are associated with different regions of the Arab world was shown to reduce recognition error rate for dialectal Egyptian Arabic by 25% (Ali, et. al, 2014).
Previous work has looked at a variety of features that may geo-locate tweets and tweeps such as the dialect of tweet(s), words appearing in tweets, a tweep's social network, etc. In this work we examine the predictive power of tweep names in predicting a tweep's location or region of origin. We define geographic units at two different levels, namely: country level and region level. The country level geographic units are defined based on political boundaries regardless of the size and proximity of different geographic entities. Thus, Qatar and Bahrain as well as Lebanon and Syria are considered as different units. At the region level, we conflate nearby countries into regions. Conflation was guided by previous work on dialects, where dialects were categorized into five regional language groups, namely: Egyptian (EGY), Maghrebi (MGR), Gulf (Arabian Peninsula) (GLF), Iraqi (IRQ), and Levantine (LEV) (Zbib et al., 2012;Cotterell et al., 2014). Sometimes, the Iraqi dialect is considered to be one of the Gulf dialects (Cotterell et al., 2014). In this paper we consider Iraq as a part of the Gulf region.
Thus the goal of this work is to build a classifier that can predict a tweep's country/region of residence/origin. To build the classifier we obtained tweep names and their self-declared locations from Twitter. Many tweeps use pseudonyms, such as "white knight", and fake or irregular, such as "in phantasmagoria" or "Eastern Province". Hence, identifying fake tweep names may be necessary, and locations need to be mapped to countries. We built multiple classifiers using either a naive Bayes or a Support Vector Machine (SVM) classifier using bagof-words features, namely word unigrams. We also considered improvements that entailed using character n-gram features and word position weighting. For our work, we tried to collect tweets for all 22 Arab countries, but we did not find Arabic tweets from Mauritania, Somalia, Djibouti and Comoros. The contributions of this paper are: 1. We show that we can use Twitter as a source for collecting person names for different Arab countries by mapping user location to one of the Arab countries.
2. We show that we can build a classifier of Arabic names at the county level or region level with reasonable accuracy.
3. we show the characteristics of Arabic names and how they differ among different countries or regions.
The paper is organized as follows: Section 2 surveys previous work on person name classification; Section 3 describes some features of Arabic names including dialectal variation in transliteration; section 4 describes how names are collected from Twitter, cleaned and classified; section 5 shows results of name classification experiments; and Section 6 contains conclusion and future work.

Previous Work
The problem of classifying names at country level is not well explored. As far as we know, there are no studies for Arabic person name classification. Some work has been done on clustering and classifying person names by origin like (Fei et al., 2005), where they used the LDC bilingual person name lists to build a name clustering and classification framework. They considered that several origins may share the same pattern of transliteration and applied their technique to a name transliteration task by building letter n-gram language models for source and target languages. They clustered names into typical origin clusters (English, Chinese, German, Arabic., etc.).
Balakrishnan (Balakrishnan, 2006) extracted a list of person names from the employee database of a multinational organization covering 9 countries: US, UK, France, Germany, Canada, Japan, Italy, India, and China. Equal number of names is chosen from each country (1,000 names for each). He used pattern search for first and second names and used k-nearest neighbor and Levenshtein edit distance to measure the distance between two names. He reported a classification accuracy = 0.67 for supervised training set and 0.63 for unsupervised training set.
Fu et al. (Fu et al., 2010) mentioned that humans often identify correctly the origins of person names, and there seem to be distinctive patterns in names to distinguish origins. They constructed an ontology containing all linguistic knowledge that can directly contribute to language origin identification, and this was employed for the analysis of name structure. They reported an average performance of 87.54% using ME-based language identifier for 8 languages (Arabic, Chinese, English, French, German, Japanese, Russian, and Spanish-Portuguese).
Rao et al. (Rao et al., 2010) classified the latent user attributes including gender, age, regional origin, etc., using features like n-grams models and number of followers/followees (in a social graph information) among others. Mahmud et al. (Mahmud et al., 2012) collected tweets using the geo-tag filter option on Twitter until they received tweets from 100 unique users from the top 100 cities in US. They used this corpus for inferring home locations of users at the level of their cities. They reported a recall of 0.7 for 100 cities.
Huang et al. (Huang et al., 2014) discussed the challenges of detecting the nationality of Twitter users using profile features and they studied the effectiveness of different features for inferring nationalities. They reported an accuracy of 83.8% for these nationality groups: Qatari, Arabs, Western, Southeast Asia, Indian, and Others. They mentioned that due to the unbalanced data distribution, the performance of less populated groups is not very high. We observe similar results in this paper.

Compound Names
Single Arabic names typically are made up of single words, but sometimes they may be composed of 2 or 3 words. We refer here to single names with more than one word as 'compound names'. There are some words such as (Allh 1 -meaning "God") and (Aldyn -meaning "religion") that trail other words as in (Ebd Allh -meaning "slave of Allah") constructing the name "Abdullah" and as in (SlAH Aldyn -meaning "perfection of religion") constructing the name "Salahudin" (Saladin). In some countries, father and family names are often preceded by words meaning "son of" such as (bn),

Dialectal Variations of Names
Names in Arabic are normally written without diacritics, and when they are transliterated, these hidden diacritics are shown in addition to dialectal differences in pronunciation among countries as shown 1 Buckwalter transliteration is used exclusively in the paper in table 2. Since we are classifying names that are written in both Arabic and Latin scripts, spelling variations can perhaps be helpful in ascertaining the country/region of origin.

Religion and Gender
Names can also be indicative of other attributes such as religion and gender. For example, the names ($nwdp -"Shnouda"), (Ebd AlHsyn -"Abdul Hussein"), and (Emr -"Omar") are typically Coptic, Shia, and Sunni respectively. And for gender, feminine names frequently end with (p, A', Y, A), such as (FaTmp -"Fatima") and (hnA' -"Hannah"). Second names, either father or family names, are mostly masculine. Though guessing a tweep's religion and gender are interesting, such is beyond the scope of this paper.

Name Variations
Phonetic

Data Collection
Twitter user profiles contain user-declared information like: Twitter account name, screen name (user name), user location, description, etc. User names are normally written in Arabic or Latin characters, and user locations are written in full or abbreviated, formal or informal, etc. as shown in Figure 1. We used the Twitter4J 2 interface to the Twitter API to collect Arabic tweets during the whole of March 2014. We searched using the query "lang:ar", which indicates any Arabic tweet. In all we collected 175 million tweets that were authored by 5.5 million unique tweeps. We used the users selfdeclared locations to map them to countries. We mapped the locations using the GeoNames 3 geographical database, which contains 8M place names and a database of of the most commonly used 10,000 user locations on Twitter (Mubarak et al., 2014). If the location referred to two or more different countries, as in "UK and Kuwait", it was removed. User location was successfully mapped to one of the Arab countries for 1M unique user names. After name cleaning (described later in this section), we have 170 thousand Names arb and 182K Names trans that are considered as valid names and mapped to only one country.
Per-country distributions are shown in Figure 2 and Figure 3. One of the interesting observations from these figures is that people from Saudi Arabia (SA 4 ) are the majority in both cases, and they tend to write their names in Arabic, while people from Egypt (EG) tend to write their names as transliterated. We opted not to limit our collection to tweeps who have geo-tagged tweets (tweets with latitude and longitude), because geo-tagged tweets represent less than 1% of the total number of tweets 5 . We found that 0.3% of the collected tweets are geo-tagged. Table 3 shows some examples of the collected names. We took samples of 200 random names from each set and found that 70% of the names are real and the rest are unreal person names (fake). We plan to identify fake names from real names in future.
Name cleaning included ignoring words that are composed of single letters, special characters outside the Arabic or the Latin alphabets, entries that are single words only, and entries having stopwords. Names were normalized in the manner described by Darwish et al. (2012), which involved removing diacritics, kashidas, normalizing different forms of alef, ya and alef maqsoura, and ha and ta marbouta, and mapping letters from other languages such as Farsi that use the Arabic script to Arabic letters. Further, titles, such as Dr., and numbers were removed. We also identified compound names as described earlier.
For example, the user name "Dr. Abdullah Bin Fahad AL MUTAIRI1973" will be normalized to "abdullah bin fahad al mutairi".

User name
Real/Unreal

Name Classification Experiments
Given the 170K Names arb and 182K Names trans that we collected, we randomly split the set into 80/20 training and testing splits. We used word unigrams as features. We also examined giving first and last names different weights and character trigrams as a back-off for unseen words. Further, we trained two classifiers namely a Naive Bayes classifier and an SVM classifier. When using a Naive Bayes classifier and a name was not observed during training in general or for a class, we used KenLM language modeling toolkit to compute the smoothing probability of it (Heafield, 2011). Our baseline involved tagging all test items with the tag of the majority class, which means that every tweep would assigned to SA at country level and the Gulf at region level. Table 4 shows the baseline re-    Table 5 and Table 6 show the results for Names arb per country and per region respectively using word unigrams only. Similarly, Table 7 and Table 8 show the results for Names trans per country and per region respectively using word unigrams only. Micro and Macro averages refer to computing metrics per test example or taking the average of per country results respectively. As can be seen, the naive Bayes classifier performed better than SVM classifier for the vast majority of countries and in overall accuracy and F-measure. Mostly the SVM classifier had higher precision with less recall.
In further experiments, we exclusively used the naive Bayes classifier. We tried two modifications of the classifier. The first involved giving different weights to different single names in the full name, such that a person's last name would get a higher weight than his/her first name. The intuition is that different countries may have different common family names that may indicate their place of origin, family, or tribe. The weight of the word based on its position is determined using the following formula: Where i ranged between 1 and number of single names in the full name. Thus the last single name would get a weight of 1 and all previous single names would get a weight of 1/2, 1/3, etc. (from end to beginning).
The second entailed using a character trigram model as a back-off for out of vocabulary words, which were not seen during training. We used KenLM to train a trigram character model using all the names in the training set (Heafield, 2011). Table 9 and Table 10 compare the plain Bayesian classifier with using the classifier with single name weighting and character trigram back-off for Names arb at country and region level respectively. Table 11 and Table 12 compare the same for Names trans . As the results show, both methods improved overall accuracy with consistent improvements in precision and improvements in recall most of the time. Using single name weighting had a greater effect on precision. In this paper, we presented our work on classifying person names based on their country or region. To construct training data, we collected Twitter user names that authored Arabic tweets with their associated self-declared locations, which we mapped to Arab countries and regions. We experimented with Bayesian and SVM classifiers and the Bayesian classifier outperformed the SVM classifier most of the time. Adding position information and back-off to a character trigram model for names not observed during training generally improved results. Classifying user names at region level generally yielded better results than at country level.
Because majority of user names written in Arabic are from the Gulf region (93%), the classification improvement above the majority baseline was not that big, but when we applied the same approach for classifying transliterated user names, we achieved an increase of the accuracy by 52% and 20% at the country level and group level in order, and an increase in the F-measure by 135% and 46% at the country level and region level in order.
In future, we want to incorporate the user name feature in conjunction with other features in the context of geo-locating Twitter users. We need to test our engine for classifying names collected for each country from outside Twitter, think in other ways to collect user names from regions like the Maghreb, and detect more information from user profile like the gender and religion.