Wetin dey with these comments? Modeling Sociolinguistic Factors Affecting Code-switching Behavior in Nigerian Online Discussions

Multilingual individuals code switch between languages as a part of a complex communication process. However, most computational studies have examined only one or a handful of contextual factors predictive of switching. Here, we examine Naija-English code switching in a rich contextual environment to understand the social and topical factors eliciting a switch. We introduce a new corpus of 330K articles and accompanying 389K comments labeled for code switching behavior. In modeling whether a comment will switch, we show that topic-driven variation, tribal affiliation, emotional valence, and audience design all play complementary roles in behavior.


Introduction
Multilingual individuals frequently switch between different languages throughout a discourse, a process known as code switching (Heller, 2010;Gambäck and Das, 2016).This switching process is thought to be driven from a variety of factors, including grammatical constraints (Pfaff, 1979;Poplack, 1980), audience design (Gumperz, 1977;Bell, 1984), or even to evoke a specific perception of the speaker's identity (Niedzielski, 1999;Schmid, 2001).In common social situations, many of these factors are in play, yet we often do not have an idea of how they interact.Here, we present a large scale study of code switching in Nigeria between English and Naijá, the widelyspoken Nigerian creole, to quantify which factors predict switching.
Computational studies of code switching have largely focused on linguistic aspects of switching (Solorio and Liu, 2008;Adel et al., 2013;Vyas et al., 2014;Hartmann et al., 2018).However, several recent works have begun to examine the contextual factors that influence switching behavior, * Authors contributed equally.
finding that the topic driving a discussion spurs on language variation (Shoemark et al., 2017;Stewart et al., 2018) and that individuals are sensitive to the scope of their audience when choosing a language (Papalexakis et al., 2014;Pavalanathan and Eisenstein, 2015).Given that the social context is known to be strongly influential on code switching (Gumperz, 1977;Thomason and Kaufman, 2001;Gardner-Chloros and Edwards, 2004), our work builds on these recent advancements to quantify the impact of social and contextual factors influencing code switching.
Here, we examine the social and contextual factors predictive of English-Naijá code switching in online discussions across five major Nigerian newspapers.Our work makes three contributions towards computational sociolinguistics.First, we introduce a massive new corpus of Naijá and English text that presents code switching behavior in context, using 330K articles and 389K comments from nine years of longitudinal data.Second, we develop a new classifier for distinguishing Naijá and English, identifying over 24K cases of code switching.Third, we show that although topicdriven variation drives much of code switching behavior, tribal affiliation, emotional valence, and audience design play important roles in which language is used.

Identifying Naijá and English
Naijá is an English creole spoken by approximately 80 million people throughout Nigeria, with 3 to 5 million speaking it as a first language (Uchechukwu Ihemere, 2006), leading to many popular services generating content in Naijá, e.g., BBC Pidgin.While official business is frequently conducted in English, Naijá is considered the main language of social interaction in Nigeria (Ifeanyi Onyeche, 2004) As news media, all six datasets use a formal register in their style, which does not necessarily match that of the comments.Therefore, to supplement the news data, two annotators labeled a sample of 2,500 comments across all sites.As Naijá is less frequent, the sample was bootstrapped to potentially contain more Naijá by first training our classifier (described next) from the news data and then sampling comments uniformly across its posterior distribution.A held out set of 682 randomly sampled comments (not bootstrapped) was additionally doubly annotated (Krippendorff α=0.511) as a test set, 9.5% of which were Naijá; note that due to class imbalance, α represents a highlyconservative estimate of agreement.
Method and Experimental Setup Our goal is to create a classifier that identifies whether a sentence contains Naijá.English is significantly more frequent in our news dataset and therefore we downsample English to a 9:1 ratio following the  As a primarily spoken language, Naijá has significant orthographic variation in its spelling (Deuber and Hinrichs, 2007).Therefore, we follow insights from language detection approaches (Lui and Baldwin, 2012;Jauhiainen et al., 2018;Zhang et al., 2018) and adopt character-based features, which are more robust to such variation.Here, character sequences of length 3 to 7 are used as features with a logistic regression with L2 loss.The resulting model is evaluated using AUC in two ways: using 5-fold cross validation within the training data and the held-out comment test set.

Results
The classifier was highly accurate at learning to distinguish Naijá and English in the mostly-news training data, achieving a crossvalidation AUC of 0.996, compared with the random baseline of 0.5.The model performed less accurately on the comments, which have a more informal register, achieving an AUC of 0.724.

Social Factors Influencing Switching
People code switch in part to signal a part of their identity (Nguyen, 2014) and online discussion provides an intersectional context that combines social and topic features that could each elicit the use of Naijá (Myers-Scotton, 1995).Here, we outline the social and contextual factors that could affect whether Naijá is used and identify outline specific research hypotheses to test.

Article Topic
The content of a discussion has the potential to elicit a response in a particular language, especially if content, language, and identity interrelate.For example, in online discussions of independence referendums, Shoemark et al. (2017) and Stewart et al. (2018) show evidence of topic-based language variation, with additional modulation based on expected audience.These results point to hypothesis H1 that we should observe topic-induced variation in which Naijá would be more frequent for certain topics.

Social Setting
The audience imagined by an author leads to differing code switching behavior, where computational studies have found that messages intended for broader audiences typically use the major language (Papalexakis et al., 2014;Shoemark et al., 2017).Similarly, Nguyen et al. (2015) notes that individuals switch to a minority language during a conversation with other individuals.We operationalize audience design in three ways: (1) the number of prior comments to an article, which signals general its potential audience size, (2) the depth of the comment in the discussion thread, which is often a signal of more interpersonal discussion (Aragón et al., 2017), and (3) the time of day the comment is made, as an expectation of future audience size.These three factors lead to hypothesis H2a that initial comments will be less likely to be in Naijá as they would have a wider audience and H2b comments made to a smaller audience are more likely to be made in Naijá.
Tribal affiliation Nigeria is home to individuals identifying with over a hundred different tribal identities which are concentrated in different regions.These tribal affiliations are the strongest aspect of self identity in present day Nigeria (Mustapha, 2006) and have also historically served as sources of conflict due to social stratification along tribal and geographic lines (Akiwowo, 1964;Himmelstrand, 1969).Tribal identity and salience is closely linked with language in Nigeria (Bamiro, 2006), with individuals alternating between English, Naijá, and local languages to emphasize identity.Language choice is driven in part by these cultural identities (Gudykunst and Schmidt, 1987;Myers-Scotton, 1991;Moreno et al., 1998).We test hypothesis H3 that tribal affiliation will be predictive of codeswitching.
As our dataset does not initially come with tribal affiliation, we follow previous work (Rao et al., 2011;Fink et al., 2012) and train a classifier (described in Appendix A) to automatically label all article authors as Igbo, Hausa-Falani, Yoruba, or other.These three tribes constitute over 71% of the population.Similar to prior work, our method attains an 81.0 F1 on author names, with slightly lower performance (67.7 F1) on the noisier commenter names.
Social Status Code switching behavior is connected to perceived notions of status, especially along the perceived status of each language in context (Genesee, 1982).Kim et al. (2014) notes that higher status individuals tend to speak in the majority language.Here, we operationalize status through users' meta-data from Disqus that provides their number of followers, which acts as a proxy for their reputation on the platforms.In hypothesis H4, individuals with higher status are more likely to use the majority language, English.
Emotion The language spoken by a bilingual individual is intimately connected to emotion (Rajagopalan, 2004).Indeed, individuals are more likely to swear in their native language (Dewaele, 2004;Rudra et al., 2016) or code switch when being impolite (Hartmann et al., 2018), underscoring a unconscious connection during emotional moments.Odebunmi (2012) notes that Naijá is used in the more formal setting of doctor-patient interactions to express emotions.These results suggest hypothesis H5 that in high-emotion settings, individuals are more likely to code-switch into Naijá.

When is Naijá Used?
What sociocultural factors influence a person's choice of communicating in Naijá or English?Here, we analyze the comments from data in Table 1 to test the hypotheses from Section 3.

Experimental Setup
The Naijá-English classifier was run on all comments made to the 330K articles in the dataset, classifying each sentence within the comment separately.If any one sentence is classified as Naijá, we consider the comment to have code-switched, noting that we are not making a distinction about what level the switch is occurring, e.g., word, phrase, or sentence (Gambäck and Das, 2016).Ultimately, this process resulted in 365,420 English and 24,232 Naijácontaining comments.
User-based statistics were extracted for each commenter from their Disqus profile.As only 15K individuals use Disqus accounts (4%), we include an additional binary indicator variable for whether the individual has an account.To test for the effect of content, a 20-topic LDA model (Blei et al., 2003) was run on the article text and included as variables (due to collinearity, topic 20 is excluded).We model tribal affiliation in four ways: (i) the commenter, (ii) the article author, and, where possible, (iii) the affiliation of the parent being replied to, and (iv) whether the parent explicitly mentions a tribe.For the first, three the "Other" category is the reference coding.Emotion is measured using VADER (Hutto and Gilbert, 2014), a lexicon designed for sentiment analysis in social media on a scale of [-1,1].We incorporate sentiment in four ways: (1-2) the sentiment scores of the post and its parent, using 0 for the parent's sentiment if the current comment has no parent and (3-4) the absolute value of the sentiment and parent's sentiment.The latter two variables enable us to separately test whether any emotionality (positive of negative) influence using Naijá, rather than the particular direction.Each platform is included as a fixed effect to control for differences in baseline rates of Naijá.After testing for collinearity, all features had VIF<3.1 indicating the model's features are largely independent.As each hypothesis uses different regression variables, this low VIF also indicates that any results are likely not confounded by correlations within the data.
Results A logistic regression model is fit using all the features, and the resulting coefficients, shown in Figure 1, provide support for all five hypotheses.However, the effect sizes of each hypotheses variables differed substantially, pointing to the complexity of code switching behavior.
The strongest effects of Naijá usage in the comment section came from the topic of the article, supporting H1.Topics related to business, social issues, and tribal and electoral politics were more likely to see code switching into Naijá.However, topics related to more general, legislative politics and individual sectors of the economy do not promote Naijá usage.
Further, this trend is seen in the newspapers' relative rates: being more oriented towards business topics and targeting an educated audience, The Guardian features less code-switching in its comment sections compared to The Punch, a tabloid with a wider audience (Marcus, 1999).In particular, the code switching effect is strongest for topics that relate to societal tensions (e.g., political, socioeconomic, and tribal).While prior work on topic-induced variation (Shoemark et al., 2017;Stewart et al., 2018) identified behaviors for political identitybased content (national referendums on independence), in contrast, here, we also observe that individuals are sensitive to audience for more do-  7. mestic topics like education and health care.
The use of Naijá did vary by audience, with strongest support for H2b.Comments deeper in a reply thread are more likely to be Naijá as well as those made in the evening when much of the discussion has taken place and when replies are more likely to be conversational with a particular person, rather than commentary on the article.The total effect is seen by considering both the depth and when "Parent Commenter: None" (i.e., the comment is at the top level).Such initial comments are much more likely to be in English, after which as the discuss turns more conversational, more Naijá is used.Our results agree with those of Nguyen et al. (2015) who found more minority language using in interpersonal communication.
The initial comments to an article (low sequence number) were less likely to be in Naijá (H2a; p<0.05), though the effect is relatively weaker.
Tribal affiliation only had limited association with use of Naijá (H3), where Igbo commenters are more likely and Yoruba commenters are less likely to use Naijá.A subsequent model tested for interaction effects between author and parent tribe, which revealed only one significant trend that individuals from all tribes are more likely to reply to Yoruba commenters in Naijá.As Naijá is widely spoken throughout the country, compared with Standard English, which is spoken more frequently at higher socioeconomic levels (Faraclas, 2002), our results suggest its use is not to emphasize tribal affiliation.
The expectation of H4 was observed: higher status (as measured by number of followers) was as predictive of use of the higher prestige language (English), though the effect is relatively small and the effect is estimated only from those users with Disqus accounts.As a complementary analysis, we performed a second test where we replace the number of followers with the number of total upvotes as a proxy of status, with the rationale that users who generate content that is well-received by the community might aquire a positive reputation.The regression results using total upvotes also found a similar weak effect of higher status users writing more in English (and highly similar coefficients for all other features).However, we note that this second analysis has a potential confound, as an English comment could be read by a wider audience and therefore receive more upvotes simply due to audience size rather than status.As all newspapers in our study are primarily read by a Nigerian national audience who is likely bilingual in English and Naijá, this potential effect is expected to be small.Nevertheless, given the limitations of both operationalizations of status, we view their similar results as tentative evidence of the effects of status on Naijá code switching in social discussions (H4).
The effects associated with H5 were strongly shown: when expressing any kind of sentiment, authors were much more likely to do it in Naijá, with a positive effect for using Naijá in positive sentiment comments.Surprisingly, a parent's use of sentiment was negatively associated with Naijá indicating a reaction to emotional language does not elicit a code switch.Given that our model con-trols for topics that may be more likely to elicit certain emotions, this result suggests that emotion is a driving factor code switching behavior.

Conclusion
This work provides the first computational examination of code switching behavior in Naijá through introducing a large corpora of articles in Naijá and Nigerian Standard English, along with comments to these articles.We develop new methods for distinguishing these two languages and identify over 24K instances of code switching in the comments.Through examining code switching in an intersectional social context, our analysis provides evidence of complementary social factors influencing switching.Notably, we find that topical modulation has the largest effect on switching to Naijá, with use of emotion surpassing the effect for a few topics.However, as no one factor was sufficient for predicting code switching, our results point to the need for holistically modeling the social context when examining factors influence code-switching behavior.All data and code are made available at https://blablablab.si.umich.edu/projects/naija/.

A A Classifier for Tribal Affiliation
As our dataset does not come with tribal affiliations to start with, we first create a classifier to identify affiliations on the basis of name.Due to cultural norms in Nigeria, individual's names often reveal their tribal affiliation (Rao et al., 2011;Fink et al., 2012), which lends itself to developing computational methods for distinguishing between the affiliations.Here, we develop a classifier for distinguishing between the three largest tribal affiliations: Hausa-Falani (29%), Yoruba (21%), and Igbo (21%), which together account for over 71% of the population thereby providing solid coverage of online users.Data for the tribal affiliation classifier was compiled using online databases and annotated names extracted from a held-out set of article authors and commenter names from the dataset of articles.The final training dataset included 493 Hausa-Falani names, 500 Yoruba names, 351 Igbo names, and 511 "other" names, which encompassed Nigerian names not fulling under the aforementioned three categories as well as non-Nigerian names (e.g., "The Editorial Board" or "flexingbenny").Table 4 shows examples of names used in training.We note that some tribes' names have similar cultural origins and therefore our data could result in systematic misclassifications for some tribes; for example, both the Hausa and the Kanuri (an ethnic group comprising roughly 3-4% of the Nigerian population) share names that are Arabic in origin.Our model would likely label all such names as Hausa, though due to population size differences, the impact of such errors are likely to be small.A logistic regression classifier was trained using L2 regularization with character n-grams ranging from 2 to 5 in length.To evaluate perfor-

Model
Article Author Commenter Our method 0.81 0.68 majority class 0.12 0.17 random 0.24 0.21   3.While absolute performance on article authors is on par with similar approaches to classifying tribal affiliation (Rao et al., 2011;Fink et al., 2012), which applied their classifiers to clean name data.Performance on commenter names is slightly lower due noise from lexical variation, misspellings, and web extraction.Table 5 shows examples of names with tribal affiliation in the test data.The confusion matrix of the tribal affiliations, shown in Figure 2, reveals no systematic misclassification bias, suggesting that any errors will only increase variance in the downstream results without biasing findings towards one particular affiliation.

B Additional Naijá Classification Examples
Table 8 shows a sample of instances classified by the final trained language-distinguishing model.Instances are sampled uniformly across the posterior to show the variety of confidence scores.

C Additional Regression Details
Table 7 shows the full regression coefficients for the model depicted in Figure 1.We additionally show the most probable words for each topic in Table 6.Note that the final topic ("Security") was intentionally omitted from the regression to remove the effects of collinearity between topic probabilities.

Figure 1 :
Figure 1: Regression results for whether a comment will have Naijá in it.Error bars show standard error, with *** denoting p<0.001, ** p<0.01, and * p<0.05.Shaded regions group similar variables.Full results are detailed in Appendix Table7.

Figure 2 :
Figure 2: Normalized confusion matrix of tribal affiliation classifier . Although spo- Data A longitudinal sample of Nigerian news was collected from six major news sources; five of these are in Nigerian Standard English, while one is in Naijá.Table1summarizes the datasets.Articles span from 2010 to present day and all but the BBC Pidgin site allow users to comment on the article, with activity rates ranging significantly.Notably, all sites share a common commenting framework through Disqus, which allows

Table 2 :
High confidence Naijá classification examples

Table 5 :
Tribal affiliation test data examples