Toward Micro-Dialect Identification in Diaglossic and Code-Switched Environments

Although the prediction of dialects is an important language processing task, with a wide range of applications, existing work is largely limited to coarse-grained varieties. Inspired by geolocation research, we propose the novel task of Micro-Dialect Identification (MDI) and introduce MARBERT, a new language model with striking abilities to predict a fine-grained variety (as small as that of a city) given a single, short message. For modeling, we offer a range of novel spatially and linguistically-motivated multi-task learning models. To showcase the utility of our models, we introduce a new, large-scale dataset of Arabic micro-varieties (low-resource) suited to our tasks. MARBERT predicts micro-dialects with 9.9% F1, ~76X better than a majority class baseline. Our new language model also establishes new state-of-the-art on several external tasks.


Introduction
Sociolinguistic research has shown how language varies across geographical regions, even for areas as small as different parts of the same city (Labov, 1964;Trudgill, 1974). These pioneering studies often used field work data from a handful of individuals and focused on small sets of carefully chosen features, often phonological. Inspired by this early work, researchers have used geographically tagged social media data from hundreds of thousands of users to predict user location (Paul and Dredze, 2011;Amitay et al., 2004;Han et al., 2014;Rahimi et al., 2017;Huang and Carley, 2019b;Tian et al., 2020;Zhong et al., 2020) or to develop language identification tools (Lui and Baldwin, 2012;Zubiaga et al., 2016;Jurgens et al., 2017a;Dunn and Adams, 2020). Whether it is possible at all to predict the micro-varieties 2 of the same general language is a question that remains, to the best of our knowledge, unanswered. In this work, our goal is to investigate this specific question by introducing the novel task of Micro-Dialect Identification (MDI). Given a single sequence of characters (e.g., a single tweet), the goal of MDI is to predict the particular micro-variety (defined at the level of a city) of the community of users to which the posting user belongs. This makes MDI different from geolocation in at least two ways: in geolocation, (1) a model consumes a collection of messages (e.g., 8-85 messages in popular datasets (Huang and Carley, 2019a) and (2) predicts the location of the posting user (i.e., user-level). In MDI, a model takes as input a single message, and predicts the micro-variety of that message (i.e., message-level).
While user location and micro-dialect (MD) are conceptually related (e.g., with a tag such as Seattle for the first, and Seattle English, for the second), they arguably constitute two different tasks. This is because location is an attribute of a person who authored a Wikipedia page (Overell, 2009) or posted on Facebook (Backstrom et al., 2010) or Twitter (Han et al., 2012), whereas MD is a characteristic of language within a community of speakers who, e.g., use similar words to refer to the same concepts in real world or pronounce certain sounds in the same way. To illustrate, consider a scenario where the same user is at different locations during different times. While a geolocation model is required to predict these different locations (for that same person), an MDI model takes as its target predicting the same micro-variety for texts authored by the person (regardless of the user location). After all, while the language of a person can, and does, change when they move from one region to another, such a change takes time.
Concretely, although to collect our data we use location as an initial proxy for user MD, we do not just exploit data where n number of posts (usually n=10) came from the location of interest (as is usually the case for geolocation). Rather, to the extent it is possible, we take the additional necessary step of manually verifying that a user does live in a given region, and has not moved from a different city or country (at least recently, see Section 2). We hypothesize that, if we are able to predict user MD based on such data, we will have an empirical evidence suggesting MD does exist and can be detected. As it turns out, while it is almost impossible for humans to detect MD (see Section 2.3 for a human annotation study), our models predict varieties as small as those of cities surprisingly well (9.9% F 1 , ∼ 76× higher than a majority class baseline based on a single, short message) (Sections 6 and 7).
Context. MDI can be critical for multilingual NLP, especially for social media in global settings. In addition to potential uses to improve machine translation, web data collection and search, and pedagogical applications (Jauhiainen et al., 2019), MDI can be core for essential real-time applications in health and well-being (Paul and Dredze, 2011), recommendation systems (Quercia et al., 2010), event detection (Sakaki et al., 2010), and disaster response (Carley et al., 2016). Further, as technology continues to play an impactful role in our lives, access to nuanced NLP tools such as MDI becomes an issue of equity (Jurgens et al., 2017b). The great majority of the world's currently known 7,111 living languages, 3 however, are not NLPsupported. This limitation also applies to closely related languages and varieties, even those that are widely spoken.
We focus on one such situation for the Arabic language, a large collection of similar varieties with ∼400 million native speakers. For Arabic, currently available NLP tools are limited to the standard variety of the language, Modern Standard Arabic (MSA), and a small set of dialects such as Egyptian, Levantine, and Iraqi. Varieties comprising dialectal Arabic (DA) differ amongst themselves and from MSA at various levels, including phonological and morphological (Watson, 2007), lexical (Salameh andBouamor, 2018;Abdul-Mageed et al., 2018;Qwaider et al., 2018), syntactic (Benmamoun, 2011), and sociological (Bassiouney, 2020). Most main Arabic dialects, however, remain understudied. The situation is even more acute for MDs, where very limited knowledge (if 3 Source: https://www.ethnologue.com. at all) currently exists. The prospect of research on Arabic MDs is thus large.
A major limitation to developing robust and equitable language technologies for Arabic language varieties has been absence of large, diverse data. A number of pioneering efforts, including shared tasks (Zampieri et al., 2014;Malmasi et al., 2016;Zampieri et al., 2018), have been invested to bridge this gap by collecting datasets. However, these works either depend on automatic geocoding of user profiles (Abdul-Mageed et al., 2018), which is not quite accurate, as we show in Section 2; use a small set of dialectal seed words as a basis for the collection (Zaghouani and Charfi, 2018;Qwaider et al., 2018), which limits text diversity; or are based on translation of a small dataset of sentences rather than naturally-occurring text (Salameh and Bouamor, 2018), which limits the ability of resulting tools. The recent Nuanced Arabic Dialect Identification (NADI) shared task (Abdul-Mageed et al., 2020a) aims at bridging this gap.
In this work, following Gonçalves and Sánchez (2014); Doyle (2014); Sloan and Morgan (2015), among others, we use location as a surrogate for dialect to build a very large scale Twitter dataset (∼6 billion tweets), and automatically label a subset of it (∼500M tweets) with coverage for all 21 Arab countries at the nuanced levels of state and city (i.e., micro-dialects). In a departure from geolocation work, we then manually verify user locations, excluding ∼ 37% of users. We then exploit our data to develop highly effective hierarchical and multi-task learning models for detecting MDs.
Other motivations for choosing Arabic as the context for our work include that (1) Arabic is a diaglossic language (Ferguson, 1959;Bassiouney, 2020) with a so-called 'High' variety (used in educational and formal settings) and ' Low' variety (used in everyday communication). This allows us to exploit dialgossia in our models. In addition, (2) for historical reasons, different people in the Arab world code-switch in different foreign languages (e.g., English in Egypt, French in Algeria, Italian in Libya). This affords investigating the impact of code-switching on our models, thereby bringing yet another novelty to our work. Further, (3) while recent progress in transfer learning using language models such as BERT (Devlin et al., 2018) has proved strikingly useful, Arabic remains dependent on multilingual models such as mBERT trained on the restricted Wikipedia domain with limited data. Although an Arabic-focused language model, AraBERT (Baly et al., 2020), was recently introduced, it is limited to MSA rather than dialects. This makes AraBERT sub-optimal for social media processing as we show empirically. We thus present a novel Transformer-based Arabic language model, MARBERT, for MDs. Our new model exploits a massive dataset of 1B posts, and proves very powerful: It establishes new SOTA on a wide range of tasks. Given the impact self-supervised language models such as BERT have made, our work has the potential to be a key milestone in all Arabic (and perhaps multilingual) NLP.
To summarize, we offer the following contributions: (1) We collect a massive dataset from Arabic social media and exploit it to develop a large human-labeled corpus for Arabic MDs.
(2) For modeling, we introduce a novel, spatially motivated hierarchical attention multi-task learning (HA-MTL) network that is suited to our tasks and that proves highly successful. (3) We then introduce linguistically guided multi-task learning models that leverage the diaglossic and code-switching environments in our social data. (4) We offer a new, powerful Transformer-based language model trained with self-supervision for Arabic MDs. (5) Using our powerful model, we establish new SOTA results on several external tasks.
The rest of the paper is organized as follows: In Section 2, we introduce our methods of data collection and annotation. Section 3 is about our experimental datasets and methods. We present our various models in Section 4 and our new microdialectal model, MARBERT, in Section 5. We investigate model generalization in Section 6, and the impact of removing MSA from our data in Section 7. Section 8 is a discussion of our findings. We compare to other works in Section 9, review related literature in Section 10, and conclude in Section 11.

Data Acquisition and Labeling
We first acquire a large user-level dataset covering the whole Arab world. We then use information in user profiles (available only for a subset of users) to automatically assign city, state, and country labels to each user. Since automatic labels can be noisy (e.g., due to typos in city names, use of different languages in user profiles), we manually fix resulting errors. To further account for issues with human mobility (e.g., a user from one country moving to another), we manually inspect user profiles, tweets, and network behavior and verify assigned locations. Finally, we propagate city, state, and country labels from the user to the tweet level (each tweet gets the label assigned to its user). We now describe our data methods in detail.

A Large User-Level, Tagged Collection
Figure 1: All 21 Arab countries in our data, with states demarcated in thin black lines within each country. All 319 cities from our user location validation study, in colored circles, are overlayed within respective states.
To develop a large scale dataset of Arabic varieties, we use the Twitter API to crawl up to 3,200 tweets from ∼2.7 million users collected from Twitter with bounding boxes around the Arab world. Overall, we acquire ∼6 billion tweets. We then use the Python geocoding library geopy to identify user location in terms of countries (e.g., Morocco) and cities (e.g., Beirut). 4 Out of the 2.7 million users, we acquired both 'city' and 'country' label for ∼233K users who contribute ∼ 507M tweets. The total number of cities initially tagged was 705, but we manually map them to only 646 after correcting several mistakes in results returned by geopy. Geopy also returned a total of 235 states/provinces that correspond to the 646 cities, which we also manually verified. We found all state names to be correct and to correspond to their respective cities and countries. 5

Validation of User Location
Even after manually correcting location labels, it cannot be guaranteed that a user actually belongs to 4 Geopy (https://github.com/geopy) is a client for several popular geocoding web services aiming at locating the coordinates of addresses, cities, countries, and landmarks across the world using third-party geocoders. In particular, we use the Nominatim geocoder for OpenStreetMap data (https://nominatim.openstreetmap.org). With Nominatim, geopy depends on user-provided geographic information in Twitter profiles such as names of countries or cities to assign user location. 5 More information about manual correction of city tags is in Section A.1 in the Appendix.
(i.e., is a local of) the region (city, state, and country) they were assigned. Hence, we manually verify user location through an annotation task. To the extent it is possible, this helps us avoid assigning false MD labels to users whose profile information was captured rightly in the previous step but who indeed are not locals of automatically labeled places. Before location verification, we exclude cities that have < 500 tweets and users with < 30 tweets from the data. This initially gives us 417 cities. We then ask two native Arabic annotators to consider the automatic label for each task (city and country) 6 and assign one label from the set {local, non-local, unknown} per task for each user in the collection. We provide annotators with links to users' Twitter profiles, and instruct them to base their decisions on each user's network and posting content and behavior. As a result, we found that 81.00% of geopy tags for country are correct, but only 62.29% for city. This validates the need for the manual verification. Ultimately, we could verify a total of 3,085 users for both country and city from all 21 countries but only from 319 cities. 7 Figure 1 shows a map of all 21 Arab countries, each divided into its states with cities overlayed as small colored circles.

Can Humans Detect Micro-Dialect?
We were curious to know whether humans can identify micro-dialect from a single message, and so we performed a small annotation study. We extracted a random set of 1,050 tweets from our labeled collection and asked two native speakers from two non-neighboring Arab countries to tag each tweet with a country then (choosing from a drop-down menu) a state label. Annotators found the statelevel task very challenging (or rather "impossible", to quote one annotator) and so we did not complete it. Hence, we also did not go to the level of city since it became clear it will be almost impossible for humans. For country, annotators reported trying to identify larger regions (e.g. Western Arab world countries), then pick a specific country (e.g., Morocco). To facilitate the task, we asked annotators to assign an "unknown" tag when unsure. We calculated inter-annotator agreement and found it at Cohen's (Cohen, 1960) Kappa (K)=0.16 ("poor" agreement). When we calculate the subset of data 6 Note that we have already manually established the link between states and their corresponding cities and countries. 7 More information about manual user verification is in Section A.2 of the Appendix.
where both annotators assigned an actual country label (i.e., rather than "unknown"; n=483 tweets), we found the Kappa (K) to increase to 0.47 ("moderate" agreement). Overall, the annotation study emphasizes challenges humans face when attempting to identify dialects (even at the level of country sometimes).

Datasets
Preprocessing. To keep only high-quality data, we remove all retweets, reduce all consecutive sequences of the same character to only 2, replace usernames with <USER> and URLs with <URL>, and remove all tweets with less than 3 Arabic words. This gives ∼ 277.4K tweets. We tokenize input text only lightly by splitting off punctuation. 8 Ultimately, we extract the following datasets for our experiments: Micro-Ara (Monolingual). Extracted from our manually verified users, this is our core dataset for modeling. We randomly split it into 80% training (TRAIN), 10% development (DEV), and 10% test (TEST). To limit GPU time needed for training, we cap the number of tweets in our TRAIN in any given country at 100K. We describe the distribution of classes in Micro-Ara in Tables B.1 and B.2 in the Appendix. We note that Micro-Ara is reasonably balanced. Table 1 shows our data splits.
CodSw (Code-Switching). As explained in Section 1, Arabic speakers code-switch to various foreign languages. We hypothesize the distribution of foreign languages will vary across different regions (which proves to be true, as we show in Figure 2), thereby providing modeling opportunities that we capture in a multi-task setting (in Section 4.5). Hence, we introduce a code-switching dataset (CodSw) by tagging the non-Arabic content in all tweets in our wider collection with the langid tool (Lui and Baldwin, 2012). Keeping only tweets with at least 3 Arabic words and at least 4 non-Arabic words, we acquire ∼ 934K tweets. CodSw is diverse, with a total of 87 languages. We split CodSw as is shown in Table 1.
DiaGloss (Diaglossia). We also explained in Section 1 that Arabic is a diaglossic language, with MSA as the "High" variety and dialects as the "Low" variety. MDs share various linguistic features (e.g., lexica) with MSA, but to varying degrees. We use an auxiliary task whose goal is to tease apart MSA from dialectal varieties. We use existence of diacritics (at least 5) as a proxy for MSA, 9 and direct responses (vs. tweets or retweets) as a surrogate for dialectness. In each class, we keep 500K tweets, for a total of 1M tweets split as in Table 1.We refer to this dataset as DiaGloss.

Methods
BiGRUs and BERT. We perform dialect identification at the country, state, and city levels. We use two main neural network methods, Gated Recurrent Units (GRUs) (Cho et al., 2014), a variation of Recurrent Neural Networks (RNNs), and Google's Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2018). We model each task independently, but also under multi-task conditions. Multi-Task Learning. Multi-Task Learning (MTL) is based on the intuition that many realworld tasks involve predictions about closely related labels or outcomes. For related tasks, MTL helps achieve inductive transfer between the various tasks by leveraging additional sources of information from some of the tasks to improve performance on the target task (Caruana, 1993). By using training signals for related tasks, MTL allows a learner to prefer hypotheses that explain more than one task (Caruana, 1997) and also helps regularize models. In some of our models, we leverage MTL by training a single network for our city, state, and country tasks where network layers are shared but with an independent output for each of the 3 tasks. 9 Unlike dialect, MSA is usually diacritized.

Baselines
For all our experiments, we remove diacritics from the input text. We use two baselines: the majority class in TRAIN (Baseline I) and a single-task Bi-GRU (Baseline II, described below). For all our experiments, we tune model hyper-parameters and identify best architectures on DEV. We run all models for 15 epochs (unless otherwise indicated), with early stopping 'patience' value of 5 epochs, choosing the model with highest performance on DEV as our best model. We then run each best model on TEST, and report accuracy and macro F 1 score. 10 Single-Task BiGRUs. As a second baseline (Baseline II), we build 3 independent networks (each for one of the 3 tasks) using the same architecture and model capacity. Each network has 3 hidden BiGRU layers, with 1,000 units each. More information about each of these networks and how we train them is in Section C.1 in the Appendix. Table 2 presents our results on TEST.

Multi-Task BiGRUs
With MTL, we design a single network to learn the 3 tasks simultaneously. In addition to our hierarchical attention MTL (HA-MTL) network, we design two architectures that differ as to how we endow the network with the attention mechanism. We describe these next. We provide illustrations of our MTL networks in Figures C.1 and C.2 in the Appendix.
Shared and Task-Specific Attention. We first design networks with attention at the same level in the architecture. Note that we use the same hyperparameters as the single-task networks. We have two configurations: Shared Attention. This network has 3 hidden BiGRU layers, each of which has 1,000 units per layer (500 in each direction). 11 All the 3 layers are shared across the 3 tasks, including the third layer.
Only the third layer has attention applied. We call this setting MTL-common-attn.
Task-Specific Attention. This network is similar to the previous one in that the first two hidden layers are shared, but differs in that the third layer (attention layer) is task-specific (i.e., independent for each task). We call this setting MTL-spec-attn. This architecture will allow each task to specialize its own attention within the same network. As Table 2 shows, both MTL-common-attn and MTLspec-attn improve over each of the two baselines (with first performing generally better).

Hierarchical Attention MTL (HA-MTL)
Instead of a 'flat' attention, we turn to hierarchical attention (spatially motivated, e.g., by how a smaller regions is a part of a larger one): We design a single network for the 3 tasks but with supervision at different layers. Overall, the network has 4 BiGRU layers (each with a total of 1,000 units), the bottom-most of which has no attention. Layers 2, 3, and 4 each has dot-product attention applied, followed directly by one task-specific fully-connected layer with softmax for class prediction. In the two scenarios, state is supervised at the middle layer. These two architectures allow information flow with different granularity: While the city-first network tries to capture what is in the physical world a more fine-grained level (city), the countryfirst network does the opposite. Again, we use the same hyper-parameters as the single-task and MTL networks, but we use a dropout rate of 0.70 since we find it to work better. As Table 2 shows, our proposed HA-MTL models significantly outperform single-task and other BiGRU MTL models. They also outperform our Baseline II with 12.36%, 10.01%, and 13.22% acc on city, state, and country prediction respectively, thus demonstrating their effectiveness on the task.

Single-Task BERT
We use the BERT-Base, Multilingual Cased model released by the authors. 12 For fine-tuning, we use a maximum sequence length of 50 words and a batch size of 32. We set the learning rate to 2e-5. We train for 15 epochs, as mentioned earlier. As Table 2 shows, BERT performs consistently better on the three tasks. It outperforms the best of our two HA-MTL networks with an acc of 5.32% (city), 5.10% (state), 3.38% (country). To show how a small network (and hence deployable on machines with limited capacity with quick inference time) can be trained on knowledge acquired by a bigger one, we distill (Hinton et al., 2015;Tang et al., 2019) BERT representation (big) with a BiGRU (small). We provide related results in Section C.2 in the Appendix.

Multi-Task BERT
We investigate two linguistically-motivated auxiliary tasks trained with BERT, as follows: Exploiting Diaglossia. As Table 2 shows, a diaglossia-based auxiliary task improves over the single task BERT for both city (0.55% acc) and country (0.56% acc).
Exploiting Code-Switching. We run 4 experiments with our CodSw dataset, as follows: (1) with the two tasks supervised at the city level, (2) at the country level, (3 & 4) with the levels reversed (city-country vs. country-city). Although the codeswitching dataset is automatically labeled, we find that when we supervise with its country-level labels, it helps improve our MDI on city (0.14% acc) and on country (0.87% acc). Related results are shown in Table 2. We now describe our new language model, MARBERT.

MARBERT: A New Language Model
We introduce MARBERT, a new language model trained with self-supervision on 1B tweets from from our unlabaled Twitter collection (described in Section 2). We train MARBERT on 100K word-Piece vocabulary, for 14 epochs with a batch size of 256 and a maximum sequence length of 128. Training took 14 days on 8 Google Cloud TPUs. We use the same network architecture as mBERT, but without the next sentence prediction objective since tweets are short. MARBERT has much larger token count than BERT (Devlin et al., 2018) (15.6B vs. 3.3B), and is trained on 5× bigger data than AraBERT (Baly et al., 2020) (126GB vs. 24GB of text). Unlike AraBERT, which is focused on MSA, MARBERT has diverse coverage of dialects in addition to MSA. As Table 3 shows, MARBERT significantly outperforms all other models across the 3 tasks, with improvements of 4.13% and 3.54% acc over mBERT and AraBERT respectively on the country task. We also run a set of MTL experiments with MARBERT, fine-tuning it with a diaglossia auxiliary task, a code-switching auxiliary task, and both diaglossia and code-switching as auxiliary tasks. As Table 3 shows, MTL does not bring acc improvements, and helps the country task only slightly (0.30% acc gain with CodSW). This reflects MARBERT's already-powerful representation, with little need for MTL. We provide an error analysis of MARBERT's MDI in Section E.1 of the Appendix.

Model Generalization
For the experiments reported thus far, we have split our Micro-Ara (monolingual) dataset randomly at the tweet level. While this helped us cover the full list of our 319 cities, including cities from which we have as few as a single user, this split does not prevent tweets from the same user to be divided across TRAIN, DEV, and TEST. In other words, while the tweets across the splits are unique (not shared), users who posted them are not unique. We hypothesize this may have the consequence of allowing our models to acquire knowledge about user identity (idiolect) that interact with our classification tasks. To test this hypothesis, we run a set of experiments with different data splits where users in TEST are never seen in TRAIN. To allow the model to see enough users during training, we split the data only into TRAIN and TEST and use no DEV set. We use the same hyper-parameters identified on previous experiments. An exception is the number of epochs, where we report the best epoch identified on TEST. To alleviate the concern about absence of a DEV set, we run each experiment 3 times. Each time we choose a different TEST set, and we average out performance on the 3 TEST sets. This is generally similar to cross-validation. For this set of experiments, we first remove all cities from which we have only one user (79 cities) and run experiments across 3 different settings (narrow, medium, and wide). 13 We provide a description of these 3 settings in Section D.1 in the Appendix. For the narrow setting only, we also run with the same code-switching and diaglossic auxiliary tasks (individually and combined) as before. We use mBERT fine-tuned on each respective TRAIN as our baseline for the current experiments.
As Table 4 shows, MARBERT significantly (p < 0.01) outperforms the strong mBERT baseline across the 3 settings. With the narrow setting on MDs, MARBERT reaches 8.12% F 1 (61 cities). These results drop to 5.81% (for medium, 116 cities) and 3.59% (for wide, 240 cities). We also observe a positive impact 14 from the combined code-switching and diaglossic auxiliary tasks. All results are also several folds better than a majority class city baseline (city of Abu Dhabi, UAE; not shown in Table 4). For example, results acquired with the two (combined) auxiliary tasks are 4.7× better in acc and 229× better for F 1 than the majority class.
Importantly, although results in Table 4 are not comparable to those described in Section 5 (due to the different data splits), these results suggest that our powerful transformer models in Section 5 may have made use of user-level information (which may have caused inflated performance). To further investigate this issue in a reasonably comparable set up, we apply the models based on the narrow, medium and wide settings and our single task MARBERT model (shown in Table 3) all to a completely new test set. This additional evaluation iteration, which we describe in Section D.2 of the Appendix, verifies the undesirable effect of sharing users between the data splits. 15 For this reason, we strongly advise against sharing users across data splits for tweet-level tasks even if the overall dataset involves several thousand users.  dataset comprises posts either solely in MSA or in MSA mixed with dialectal content. Since MSA is shared across different regions, filtering it out is likely to enhance system ability to distinguish posts across our 3 classification tasks. We test (and confirm) this hypothesis by removing MSA from both TRAIN and TEST in our narrow setting data (from Section 6) and fine-tuning MARBERT on the resulting ('dialectal') data only. 16 As Table 5 shows, this procedure results in higher performance across the 3 classification tasks. 17 For micro-dialects, performance is at 14.09% acc. and 9.87% F 1 . Again, this performance is much better (3.3× better acc and 75.9× better F 1 ) than the majority class baseline (city of Abu Dhabi, UAE, in 2 of our 3 runs).  Table 5: Performance on predicted dialectal data.

Discussion
As we showed in Sections 6 and 7, our models are able to predict variation at the city-level significantly better than all competitive baselines. This is the case even when we do not remove MSA content, but better results are acquired after removing it. A question arises as to whether the models are indeed capturing micro-linguistic variation between the different cities, or simply depending on different topical and named entity distributions in the data. To answer this question, we visualize attention in ∼ 250 examples from our TEST set using one of our MARBERT-narrow models reported 16 To filter out MSA, we apply an in-house MSA vs. dialect classifier (acc = 89.1%, F1 = 88.6%) on the data, and remove tweets predicted as MSA. More information about the MSA vs. dialect model is in Section D.3 of the Appendix. We cast more extensive investigation of the interaction between dialects and MSA vis-a-vis our classification tasks, including based on manually-filtered MSA, as future research. 17 MARBERT is significantly better than mBERT with p < 0.03 for city, p < 0.01 for state, and p < 0.0004 for country.
in Table 4. 18 Our analysis reveals that the model does capture micro-dialectal variation. We provide two example visualizations in Section E in the Appendix demonstrating the model's micro-dialectal predictive power. Still, we also observe that the model makes use of especially names of places. For this reason, we believe future research should control for topical and named entity cues in the data.

Comparisons and Impact
Comparisons to Other Dialect Models. In absence of similar-spirited nuanced language models, we compare our work to existing models trained at the country level. These include the tweet-level 4-country SHAMI (Qwaider et al., 2018) which we split into TRAIN (80%), DEV (10%), and TEST (10%) for our experiments, thus using less training data than Qwaider et al. (2018) (who use crossvalidation). We also compare to (Zhang and Abdul-Mageed, 2019), the winning team in the the 21country MADAR Shared Task-2 (Bouamor et al., 2019b). Note that the shared task targets user-level dialect based on a collection of tweets, which our models are not designed to directly predict (since we rather take a single tweet input, making our task harder). 19 For the purpose, we train two models, one on MADAR data (shared tasks 1 and 2) and another on our Micro-Ara+MADAR data. We also develop models using the 17-country Arap-Tweet (Zaghouani and Charfi, 2018), noting that authors did not preform classification on their data and so we include a unidirectional 1-layered GRU, with 500 units as a baseline for Arap-Tweet. Note that we do not report on the dataset described in Abdul-Mageed et al. (2018) since it is automatically labeled, and so is noisy. We also do not compare to the dataset in Salameh and Bouamor (2018) since it is small, not naturally occurring (only 2,000 translated sentences per class), and the authors have already reported linear classifiers outperforming a deep learning model due to small data size. As Table 7 shows, our models achieve new SOTA on all three tasks with a significant margin. Our results on MADAR show that if we have up to 100 messages from a user, we can detect their MD at 80.69% acc.
Impact on External Tasks: We further demon-   more balanced, and more diverse. It is also, by far, the most fine-grained.

Conclusion
We introduced the novel task of MDI and offered a large-scale, manually-labeled dataset covering 319 city-based Arabic micro-varieties. We also introduced several novel MTL scenarios for modeling MDs including at hierarchical levels, and with linguistically-motivated auxiliary tasks inspired by diaglossic and code-switching environments. We have also exploited our own data to train MAR-BERT, a very large and powerful masked language model covering all Arabic varieties. Our models establish new SOTA on a wide range of tasks, thereby demonstrating their value.

A.1 Correction of City and State Tags
City-Level. Investigating examples of the geolocated data, we observed geopy made some mistakes. To solve the issue, we decided to manually verify the information returned from geopy on all the 705 assumed 'cities'. For this purpose of manual verification, we use Wikipedia, Google maps, and web search as sources of information while checking city names. We found that geopy made mistakes in 7 cases as a result of misspelled city names in the queries we sent (as coming from user profiles). We also found that 44 cases were not assigned the correct city name as the first 'solution'. Geopy provided us with a maximum of 7 solutions for a query, with best solutions sometimes being names of hamlets, villages, etc., rather than cities. In many cases, we found the correct solution to fall between the 2nd and 4th solutions. A third problem was that some city names (as coming from user profiles) were written in non-Arabic (e.g., English or French). We solved this issue by requiring geopy to also return the English version of a city name, and exclusively using that English version. Ultimately, we acquired a total of 646 cities. State-Level. As explained, geopy returned to us a total of 235 states/provinces that correspond to the 646 identified (manually fixed) cities. We also manually verified all the state names and their correspondence to the cities and countries. We found no issues with state tags.

A.2 Validation of User Location
We trained the two annotators and instructed them to examine the profile information of each user on Twitter, providing a link to the profile. We asked them to consider various sources of information as a basis for their decisions, including (1) the profile picture, (2) profile textual description (including user-provided location), (3) the actual name of the user (if available), (4) at least 10 tweets, (5) the followers and followees of the user, and (5) user's network behavior such as the 'likes'.
Each annotator was responsible for ∼ 50% of the usernames and was given a random sample of 100 21 users for each city along with the Twitter handles and the automatically assigned city and 21 But we note that some cities had less than 100 users. country labels. We asked the users to label the first 10 accounts in each city, and only add more if the city proves specially challenging (as we observed to be the case in a pilot analysis of a few cities). Annotators ended up labeling a total of 4,953 accounts (∼ 11.88 users per city), of whom 4,012 users were verified for country and 3,085 for both country and city locations. We found that 81.00% of geopy tags for country are correct, but only 62.29% for city (which reduced our final city count to 319). As a final sanity check, a third annotator reviewed the labels for a random sample of 20 users from each annotator and agreed fully.    As mentioned in Section 4.1, our a second baseline (Baseline II), is comprised of 3 independent networks (each for one of the 3 tasks) using the same architecture and model capacity. Each net-work has 3 hidden BiGRU layers, 22 with 1,000 units each (500 units from left to right and 500 units from right to left). We add dot-product attention only to the third hidden layer. We trim each sequence at 50 words, 23 and use a batch size of 8. Each word in the input sequence is represented as a vector of 300 dimensions that are learned directly from the data. Word vectors weights W are initialized with a normal distribution, with µ = 0, and σ = 0.05, i.e., W ∼ N (0, 0.05). For optimization, we use Adam (Kingma and Ba, 2014) with a fixed learning rate of 1e − 3. For regularization, we use dropout (Srivastava et al., 2014) with a value of 0.5 on each of the 3 hidden layers.

C.2 Distill BERT
We distill mBERT knowledge in out HA-MTL BiGRUs. In other words, we use the output of the mBERT logit layer as input to our city-first and country-first HA-MTL BiGRUs to optimize a mean-squared error objective function, but not a cross-entropy function (following equation 3 in Tang et al. (2019)). 24 As Table C.1, both of these networks (HA-MTL-Dist-city 1 st and HA-MTL-Dist-country 1 st in the table) acquire sizeable improvements over the equivalent, non-distilled Bi-GRUs. Although these distillation models are still less than BERT, the goal behind them is to yield as closer-as-possible performance to BERT albeit with a smaller network that can be deployed in machines with limited capacity and offer quicker inference. Concretely, a HA-MTL-BiGRU model learns the 3 tasks of city, state, and country together compared to the single task BERT where 3 different models are needed for these 3 tasks. In terms of the number of parameters, this means the multi-task BiGRU distillation model has 11.6× fewer parameters. HA-MTL-BiGRU is also  24 The network architecture of the HA-MTL BiGRU is otherwise similar as before, but we train them for 20 epochs rather than 15. 25 We perform model inference on the DEV set with a batch size of 128 on a single NVIDIA V100 GPU. Figure C.1: Illustration of MTL (spec-attn) network for city, state, and country. The three tasks share 2 hidden layers, with each task having its independent attention layer.  TEST and the rest (13 or more) in TRAIN. This gives us 61 cities, 48 states, and 11 countries. Our (2) medium setting is similar to narrow, but we sample from cities with at least 13 users instead of 16. We use 3 users for TEST and the rest (10 users or more) for TRAIN. This results in 116 cities, 90 states, and 17 countries.
(3) Wide has data from a single user from a given city in TEST and the rest of users from the same city in TRAIN. This setting allows more coverage (240 cities, 158 states, and all the 21 countries), at the cost of having as few as only two users for a given city in TRAIN. Figure D.1 shows the distribution of users over the 3 data settings. In addition, Table D.2 shows the data sizes of the TRAIN and TEST sets in each of the 3 runs, across each of the 3 settings.

D.2 Comparison of Models on a Completely
New TEST Set.
As mentioned in Section 6, we evaluate our models from the narrow, medium and wide settings and our single task MARBERT model (shown in Ta- ble 3) all on a completely new test set. This allows a more direct comparison between these models, including to test the impact of sharing users across the various data splits (as is the case of single task MARBERT) or lack thereof (as is the case for the narrow, medium and wide settings models). We now introduce GeoAra, our new evaluation dataset. GeoAra Dataset. GeoAra is a dataset of tweets with city labels from 20 Arab countries. 26 To build GeoAra, we run a crawler on each of the 319 cities in our gold data for a total of 10 month (Jan. 2019 -Oct. 2019). We acquire a total of 4.7M tweets from all the cities. We collect Twitter user ids from users who posted consistently from a single location over the whole 10 months (n= 390,396), and crawl the timeline of 148K users. 27 Note that MicroAra (our monolingual dataset) is collected in 2016 and 2017. This means GeoAra involves data from a period significantly different (more recent) than MicroAra (2 years later). We then only keep users who posted at least 10 tweets. This leaves us with 101,960 users from 147 cities. From GeoAra, we create a DEV set from a random sample of 100K tweets (908 from users) and a TEST set from a random sample of 97,834 tweets (from 1,053 users). 28 We do not share users between the TRAIN, DEV, and TEST splits.
As Table D.1 shows, although all the 4 models degrade on GeoAra, single task MARBERT (MARB-319 in the table) suffers most. This further suggests, that this particular model has captured user-level knowledge that may have allowed it to perform much higher on the TEST set in Table 3 26 These are the same countries as in our MicroAra, with the exception of Djibouti. 27 We note that this is more conservative than previous geolocation works (e.g., (Han et al., 2012)) that take the majority class city of a user who posted 10 tweets as the label. 28 The two splits do not identically match since we also needed to create a specific TRAIN split from the same dataset. The TRAIN split is not part of the current work and so we leave it out. than what it would if user data were not shared across the various splits. In addition, even though our narrow setting model covers only 61 cities, it is the one that performs best on both the DEV and TEST GeoAra splits. This might be the case because this model is trained on the most number of users (at least 13 users for each city), which allows it to generalize well on these cities. An error analysis may reveal more information on performance of these particular models on GeoAra. We cast further investigation of this issue as future research.

D.3 MSA vs. Dialect Classifier
As described in Section 7, we apply an MSA vs. dialect in-house classifier on our narrow data setting, to remove MSA. Our in-house classifier fine-tuned MARBERT on the MSA-dialect TRAIN split described in Elaraby and Abdul-Mageed (2018). This    binary classifier performs at 89.1% accuracy, and 88.6% F 1 on Elaraby and Abdul-Mageed (2018) MSA-dialect TEST set. Running this model on our narrow setting data, gives us the TRAIN and TEST splits with predicted labels described in Table D.3.

E Discussion
As discussed in Section 8, we visualize attention in ∼ 250 examples from our TEST set using our MARBERT-narrow model fine-tuned in split B in Table D.2. We provide visualizations from two examples here. Example 1: Figure D.2 shows a visualization of a sentence from the city of Asuit, Egypt, that the model correctly predicted. Left: Attention layer #3 of the model 29 has several heads attending to lexical micro-dialectal cues related to Asuit. Most notably, tokens characteristic of the language of the correct city are attended to. Namely, the word (part of the metaphorical expression meaning "what a devil") recieves attention in heads 1-3, and the word "man" in city of Asuit) receives attention in head 2. These cues usually co-occur with the token ("you screwed [somebody]"), which is also characteristic of the Southern Egyptian region, and the city of Asuit. This is clear in the image in the right where the token attends to other micro-dialectal cues in the sequence.
Example 2: Figure D.3 shows a visualization of a sentence from the city of Marsa Matrouh, Egypt, that was incorrectly predicted as Tobruq, Libya. Even though the model makes a prediction error here, its error is meaningful in that it chooses a city that is located in the vicinity of that of the gold city. This means, interestingly, that the citylevel model can pick a city close-by to gold in a different country rather than a far city in the same country. This reflects how micro-dialects paint a more nuanced (and linguistically plausible) picture. This also suggests that country-level dialect models are based on arbitrary assumptions, by virtue of being dependent on political boundaries which are not always what defines language variation.

E.1 Brief Error Analysis
We provide a brief error analysis of single Task MARBERT (described in Table 3 of the paper) in  Table E.1: Top wrongly predicted cities in our DEV based on mBERT. For each gold city, we provide the average distance from the cities with which they were confused (we call it avg. error distance), countries to which confused cities belong, followed by percentage in which cities of each country were confused with the gold city. In the future, we also plan to carry out a more extensive (including manual) error analysis based on the tweets involved.