#WhoAmI in 160 Characters? Classifying Social Identities Based on Twitter Profile Descriptions

We combine social theory and NLP methods to classify English-speaking Twitter users’ on-line social identity in proﬁle descriptions. We conduct two text classiﬁcation experiments. In Experiment 1 we use a 5-category online social identity classiﬁcation based on identity and self-categorization theories. While we are able to automatically classify two identity categories (Relational and Occupational), automatic classiﬁcation of the other three identities (Political, Ethnic/religious and Stigmatized) is challenging. In Experiment 2 we test a merger of such identities based on theoretical arguments. We ﬁnd that by combining these identities we can improve the predictive performance of the classiﬁers in the experiment. Our study shows how social theory can be used to guide NLP methods, and how such methods provide input to revisit traditional social theory that is strongly consolidated in ofﬂine settings.


Introduction
Non-profit organizations increasingly use social media, such as Twitter, to mobilize people and organize cause-related collective action, such as health advocacy campaigns.
Studies in social psychology (Postmes and Brunsting, 2002;Van Zomeren et al., 2008;Park and Yang, 2012;Alberici and Milesi, 2013;Chan, 2014;Thomas et al., 2015) demonstrate that social identity motivates people to participate in collective action, which is the joint pursuit of a common goal or interest (Olson, 1971). Social identity is an individual's self-concept derived from social roles or memberships to social groups (Stryker, 1980;Tajfel, 1981;Turner et al., 1987;Stryker et al., 2000). The use of language is strongly associated with an individual's social identity (Bucholtz and Hall, 2005;Nguyen et al., 2014;Tamburrini et al., 2015). On Twitter, profile descriptions and tweets are online expressions of people's identities. Therefore, social media provide an enormous amount of data for social scientists interested in studying how identities are expressed online via language.
We identify two main research opportunities on online identity. First, online identity research is often confined to relatively small datasets. Social scientists rarely exploit computational methods to measure identity over social media. Such methods may offer tools to enrich online identity research. For example, Natural Language Processing (NLP) and Machine Learning (ML) methods assist to quickly classify and infer vast amounts of data. Various studies investigate how to predict individual characteristics from language use on Twitter, such as age and gender (Rao et al., 2010;Burger et al., 2011;Al Zamal et al., 2012;Van Durme, 2012;Ciot et al., 2013;Nguyen et al., 2013;Nguyen et al., 2014;Preotiuc-Pietro et al., 2015), personality and emotions (Preotiuc-Pietro et al., 2015;, political orientation and ethnicity (Rao et al., 2010;Pennacchiotti and Popescu, 2011;Al Zamal et al., 2012;Cohen and Ruths, 2013;Volkova et al., 2014), profession and interests (Al Zamal et al., 2012;Li et al., 2014). Second, only a few studies combine social theory and NLP methods to study online identity in relation to collective action. One recent example uses the Social Identity Model of Collective Action (Van Zomeren et al., 2008) to study health campaigns organized on Twitter (Nguyen et al., 2015). The authors automatically identify participants' motivations to take action online by analyzing profile descriptions and tweets.
In this line, our study contributes to scale-up research on online identity. We explore automatic text classification of online identities based on a 5category social identity classification built on theories of identity. We analyze 2633 English-speaking Twitter users' 160-characters profile description to classify their social identities. We only focus on profile descriptions as they represent the most immediate, essential expression of an individual's identity.
We conduct two classification experiments: Experiment 1 is based on the original 5-category social identity classification, whereas Experiment 2 tests a merger of three categories for which automatic classification does not work in Experiment 1. We show that by combining these identities we can improve the predictive performance of the classifiers in the experiment.
Our study makes two main contributions. First, we combine social theory on identity and NLP methods to classify English-speaking Twitter users' online social identities. We show how social theory can be used to guide NLP methods, and how such methods provide input to revisit traditional social theory that is strongly consolidated in offline settings.
Second, we evaluate different classification algorithms in the task of automatically classifying online social identities. We show that computers can perform a reliable automatic classification for most social identity categories. In this way, we provide social scientists with new tools (i.e., social identity classifiers) for scaling-up online identity research to massive datasets derived from social media.
The rest of the paper is structured as follows. First, we illustrate the theoretical framework and the online social identity classification which guides the text classification experiments (Section 2). Second, we explain the data collection (Section 3) and methods (Section 4). Third, we report the results of the two experiments (Section 5 and 6). Finally, we discuss our findings and provide recommendations for future research (Section 7).

Theoretical Framework: a 5-category Online Social Identity Classification Grounded in Social Theory
We define social identity as an individual's selfdefinition based on social roles played in society or memberships of social groups. This definition combines two main theories in social psychology: identity theory (Stryker, 1980;Stryker et al., 2000) and social identity, or self-categorization, theory (Tajfel, 1981;Turner et al., 1987), which respectively focus on social roles and memberships of social groups. We combine these two theories as together they provide a more complete definition of identity (Stets and Burke, 2000). The likelihood of participating in collective action does increase when individuals both identify themselves with a social group and are committed to the role(s) they play in the group (Stets and Burke, 2000). We create a 5-category online social identity classification that is based on previous studies of offline settings (Deaux et al., 1995;Ashforth et al., 2008;Ashforth et al., 2016). We apply such classification to Twitter users' profile descriptions as they represent the most immediate, essential expression of an individual's identity (Jensen and Bang, 2013). While tweets mostly feature statements and conversations, the profile description provides a dedicated, even limited (160 characters), space where users can write about the self-definitions they want to communicate on Twitter.
The five social identity categories of our classification are: (1) Relational identity: self-definition based on (reciprocal or unreciprocal) relationships that an individual has with other people, and on social roles played by the individual in society. Examples on Twitter are "I am the father of an amazing baby girl!", "Happily married to @John", "Crazy Justin Bieber fan", "Manchester United team is my family".
(2) Occupational identity: self-definition based on occupation, profession and career, individual vocations, avocations, interests and hobbies. Examples on Twitter are "Manager Communication expert", "I am a Gamer, YouTuber", "Big fan of pizza!", "Writing about my passions: love cooking traveling reading".
(3) Political identity: self-definition based on political affiliations, parties and groups, as well as being a member of social movements or taking part in collective action. Examples on Twitter are "Feminist Activist", "I am Democrat", "I'm a council candidate in local elections for []", "mobro in #movember", "#BlackLivesMatter".
(4) Ethnic/Religious identity: self-definition based on membership of ethnic or religious groups. Examples on Twitter are "God first", "Will also tweet about #atheism", "Native Washingtonian", "Scottish no Australian no-both?".
(5) Stigmatized identity: self-definition based on membership of a stigmatized group, which is considered different from what the society defines as normal according to social and cultural norms (Goffman, 1959). Examples on Twitter are "People call me an affectionate idiot", "I know people call me a dork and that's okay with me". Twitter users also attach a stigma to themselves with an ironic tone. Examples are "I am an idiot savant", "Workaholic man with ADHD", "I didn't choose the nerd life, the nerd life chose me'.
Social identity categories are not mutually exclusive. Individuals may have more than one social identity and embed all identities in their definition of the self. On Twitter, it is common to find users who express more than one identity in the profile description. For example, "Mom of 2 boys, wife and catholic conservative, school and school sport volunteer", "Proud Northerner, Arsenal fan by luck. Red Level and AST member. Gamer. Sports fan. English Civic Nationalist. Contributor at HSR. Pro-#rewilding".

Data Collection
We collect data by randomly sampling English tweets. From the tweets, we retrieve the user's profile description. We remove all profiles (i.e, 30% of the total amount) where no description is provided.
We are interested in developing an automatic classification tool (i.e., social identity classifier) that can be used to study identities of both people engaged in online collective action and general Twitter users. For this purpose, we use two different sources to collect our data: (1) English tweets from two-year (2013 and 2014) Movember cancer aware-ness campaign 1 , which aims at changing the image of men's health (i.e., prostate and testicular cancer, mental health and physical inactivity); and (2) English random tweets posted in February and March 2015 obtained via the Twitter Streaming API. We select the tweets from the UK, US and Australia, which are the three largest countries with native English speakers. For this selection, we use a country classifier, which has been found to be fairly accurate in predicting tweets' geolocation for these countries (Van der Veen et al., 2015). As on Twitter only 2% of tweets are geo-located, we decide to use this classifier to get the data for our text classification.
From these two data sources, we obtain two Twitter user populations: Movember participants and random generic users. We sample from these two groups to have a similar number of profiles in our dataset. We obtain 1,611 Movember profiles and 1,022 random profiles. Our final dataset consists of 2,633 Twitter users' profile descriptions.

Methods
In this study, we combine qualitative content analysis with human annotation (Section 4.1) and text classification experiment (Section 4.2).

Qualitative Content Analysis with Human Annotation
We use qualitative content analysis to manually annotate our 2,633 Twitter users' profile descriptions. Two coders are involved in the annotation. The coders meet in training and testing sessions to agree upon rules and build a codebook 2 that guides the annotation. The identity categories of our codebook are based on the 5-category social identity classification described in Section 2. In the annotation, a Twitter profile description is labeled with "Yes" or "No" for each category label, depending on whether the profile belongs to such category or not. Multiple identities may be assigned to a single Twitter user (i.e., identity categories are not mutually exclusive). We calculate the inter-rater relia-  The definition of social identity is applicable only to one individual. Accounts that belong to more than one person, or to collectives, groups, or organizations (N=280), are annotated as "Not applicable", or "N/a" (Kalpha=0.8268). Such category also includes individual profiles (N=900) for which: 1) no social identity category fits (e.g., profiles contain quote/citations/self-promotion; or individual attributes descriptions with no reference to social roles or group membership); and 2) ambiguous or incomprehensible cases 4 .
Looking at the distributions of social identity categories in the annotated profile descriptions provides an overview of the types of Twitter users in our data. We check if such distributions differ in the two populations (i.e., Movember participants and random generic users). We find that each identity category 3 We use Krippendorff's alpha as it is considered the most reliable inter-coder reliability statistics in content analysis. 4 We keep N/a profiles in our dataset to let the classifiers learn that those profiles are not examples of social identities. Such choice considerably increases the number of negative examples over the positive ones that are used to detect the identity categories. However, we find that including or excluding N/a profiles does not make any significant difference in the classifiers performance.
is similarly distributed in the two groups ( Figure 1). We conclude that the two populations are thus similar in their members' social identities. Figure 1 shows the distributions of social identity categories over the total amount of annotated profiles (N=2,633). N/a profile descriptions are the 45% (N=1180) of the total number of profiles: organizations/collective profiles are 11% (N=280), whereas no social identity profiles/ambiguous cases are 34% (N=900). It means that only a little more than a half, i.e., the remaining 55% profiles (N=1,453), of the Twitter users in our dataset have one or more social identities. Users mainly define themselves on the basis of their occupation or interests (Occupational identities=36%), and social roles played in society or relationships with others (Relational iden-tities=28%). By contrast, individuals do not often describe themselves in terms of political or social movement affiliation, ethnicity, nationality, religion, or stigmatized group membership. Political, Ethnic/Religious and Stigmatized identities categories are less frequent (respectively, 4%, 13% and 7%).

Automatic Text Classification
We use machine learning to automatically assign predefined identity categories to 160-character Twitter profile descriptions (N=2,633), that are manually annotated as explained in Section 4.1. For each identity category we want to classify whether the profile description belongs to a category or not. We thus treat the social identity classification as a binary text classification problem, where each class label can take only two values (i.e. yes or no).
We use automatic text classification and develop binary classifiers in two experiments. Experiment 1 is based on the 5-category social identity classification explained in Section 2. In Experiment 1, we compare the classifiers performance in two scenarios. First, we use a combined dataset made by both Movember participants and random generic users. Profiles are randomly assigned to a training set (Combined(1): N=2338) and a test set (Combined(2): N=295). Second, we use separated datasets, i.e., random generic users as training set (Random: N=1022) and Movember participants as test set (Movember: N=1611), and vice versa.
Experiment 2 is a follow-up of Experiment 1 and we use only combined data 5 . We test a merger of three social identity categories (i.e., Political, Ethnic/religious and Stigmatized) for which we do not obtain acceptable results in Experiment 1.

Features Extraction
We use TF-IDF weighting (Salton and Buckley, 1988) to extract useful features from the user's profile description. We measure how important a word, or term, is in the text. Terms with a high TF-IDF score occur more frequently in the text and provide the most of information. In addition, we adopt standard text processing techniques, such as Lowercasing and Stop words, to clean up the feature set (Sebastiani, 2002). We use the Chi Square feature selection on the profile description term matrix resulted from the TF-IDF weighting to select the terms that are mostly correlated with the specific identity category (Sebastiani, 2002).

Classification Algorithms
In the automatic text classification experiments, we evaluate four classification algorithms. First, we use Support Vector machine (SVM) with a linear kernel, which requires less parameters to optimize and is faster compared to other kernel functions, such as Polynomial kernel (Joachims, 1998). Balanced mode is used to automatically adjust weights for class labels. Second, Bernoulli Naïve Bayes (BNB) is applied with the Laplace smoothing value set to 1. Third, Logistic Regression (LR) is trained with balanced subsample technique to provide weights for class labels. Fourth, the Random Forest (RF) classifier is trained with 100 trees to speed up the computation compared to a higher number of trees, for which no significant difference has been found in the classifier performance. Balanced subsample technique is used to provide weights for class labels.

Evaluation Measures
Experimental evaluation of the classifiers is conducted to determine their performance, i.e., the degree of correct classification. We compare the four classification algorithms on the training sets using Stratified 10-Fold Cross Validation. This technique seeks to ensure that each fold is a good representative of the whole dataset and it is considered better than regular cross validation in terms of biasvariance trade-offs (Kohavi and others, 1995). In feature selection, we check for different subsets of features (i.e., 500, 1000, 2000, 3000, 4000 and 5000) with the highest Chi Square from the original feature set, which consists of highly informative features. We find that 1000 features are the most informative.
Furthermore, we calculate precision (P), recall (R) and F-score to assess the accuracy and completeness of the classifiers. The classification algorithm that provide the best performance according to Fscore in the Stratified 10-Fold Cross Validation is then tested on the test sets to get better insight into the classification results.

Classification Experiment 1
In this section, we present the results of Experiment 1 on automatically identifying 5 online social identities based on the annotated Twitter profile descriptions. In Section 5.1, we show the results of the Stratified 10 Fold Cross Validation in three training sets, i.e., Combined(1), Movember and Random. In Section 5.2, we illustrate and discuss the results of the best classification algorithm on the test sets.  Occupational identity. All classifiers provide very high precision (P>0.800) and recall (R>0.750) for the Occupational identity category (Table 1). The most precise classification algorithm is BNB in the Random set (P=0.859), whereas the classification algorithm with the highest recall is SVM in the combined set (R=0.793). According to Fscores, all classifiers provide good and excellent performances (F>0.700), except for BNB in the Random set (F=0.599). Classifiers trained on the combined set provide the highest F-scores, except for BNB where F-score is higher in the Movember set. By contrast, the Random set provides the lowest per-formances. Overall, SVM and LR provide the best F-scores in all three training set.
Political, Ethnic/religious and Stigmatized identities. Classifiers perform less well in automatically classifying Political, Ethnic/religious and Stigmatized identities than in Relational and Occupational ones (Table 2). Both precision and recall are almost acceptable (0.400<P,R<0.690) in all three training sets. When training SVM, BNB and RF, we get ill-defined precision and F-score, which are consequently set to 0.0 in labels with no predicted samples (in Table 2, these values are marked with a *). As we noticed earlier in Figure 1, the low number of positive examples of Political, Ethnic/religious and Stigmatized identities in the data may cause this outcome. Classifiers trained on combined and Movember sets provide similar results, whereas the Random set provides the lowest performance. Overall, LR classifier provide the best F-scores for each category in all training sets.

LR Classifier Testing
Stratified 10 Fold Cross Validation show that the optimal classification algorithm for each identity category is LR. The LR classifier is evaluated on the test sets in order to get better insight into the classification results. Since we use three training sets, we evaluate the classifier on three different test sets as explained in Section 4.2.
According to the F-scores (Table 3), we are able to automatically classify Relational and Occupational identities in all three test sets. LR trained and tested on combined data provides the best results (Relational: F=0.699; Occupational: F=0.766). Although in the Stratified 10 Fold Cross Validation the classifier trained on the Random set has lower performance than trained on the Movember set, in the final testing the classifier performs better when we use Random as training set and Movember as test set (Relational: F=0.594; Occupational: F=0.737).
Final training and testing using LR on Political, Ethnic/religious and Stigmatized identities (Table 4) is affected by the low number of positive examples in the test sets, as these identities are less frequent in our annotated sample. Classifying Political identities is the most difficult task for the classifier in all three test sets and the performance is very low (Combined(2): F=0.300; Random: F=0.266; Movember: F=0.098). Regarding Ethnic/religious and Stigmatized identities, LR provides almost acceptable F-scores only on the combined data (Ethnic religious: F=0.543; Stigmatized: F=0.425).

Discussion: Merging Identity Categories
In Experiment 1 we show that a classifier trained on the combined data performs better than a classifier trained on only Movember profiles or Random profiles. Our results are of sufficient quality for Relational and Occupational identities on the combined set, and thus we are able to automatically classify such social identities on Twitter using LR. Experiment 1 also shows that automatically classifying Political, Ethnic/religious and Stigmatized identities may be a challenging task. Although LR provides acceptable F-scores in the Stratified 10 Fold Cross Validation, the classifier is not able to automatically classify those three identities. This may be due to unbalanced distributions of identity categories in our data, that thus affect the text classification experiment.
Despite of the unsatisfactory classifier performances in detecting Political, Ethnic/religious and Stigmatized identities, we conduct a second experiment to find an alternative way to classify such identities because of their importance in the study of collective action. Therefore, we find that using NLP methods invites us to go back to theory and revisit our framework.
People with strong Political, Ethnic/religious and/or Stigmatized identities are often more engaged in online and offline collective action (Ren et al., 2007;Spears et al., 2002). These identities have a collective, rather than individualistic, nature as they address individual membership to one or multiple social groups. By sharing a common identity with other group members, individuals may feel more committed to the group's topic or goal. Consequently, they may engage in collective action on behalf of the group, even in cases of power struggle, i.e., individuals have a politicized identity, see (Klandermans et al., 2002;Simon and Klandermans, 2001). Political, Ethnic/religious and/or Stigmatized identities are indeed action-oriented (Ren et al., 2007), rather than social statuses as for Relational and Occupational identities (Deaux et al., 1995). Thus, the collective, action-oriented nature  Following these theoretical arguments, we decide to merge Political, Ethnic/religious and Stigmatized identities in one category, called PES identity (N=556). In this way, we also provide more positive examples to the classifiers. In Experiment 2, we train and test again the four classification algorithms on the PES identity using the combined data. In the next section, we present the results of this second experiment and show that by combining these identities we can improve the predictive performance of the classifiers.
6 Classification Experiment 2 Table 5 shows value of precision, recall and F-score in the Stratified 10 Fold Cross Validation on the training set (i.e., Combined (1): N=2338) to select the optimal classifier. Overall, all classifiers provide quite acceptable performances for the PES identity category (0.500<F<0.650). Only when validating the BNB classifier, we obtain an ill-defined F-score (in Table 5, this value is marked with a *). RF is the most precise classification algorithm (P=0.758), whereas LR has the highest recall (R=0.608). As in Experiment 1, LR is the optimal classifier with the highest F-score (F=0.623).
LR classifier is evaluated on the test set (i.e., Combined (2): N=295) to get better insight into the classification results. The classifier is highly precise in identifying PES identities (P=0.857). By contrast, recall is quite low (R=0.466), thus affecting

Final Discussion and Conclusions
In this study, we explore the task of automatically classifying Twitter social identities of Movember participants and random generic users in two text classification experiments. We are able to automatically classify two identity categories (Relational and Occupational) and a 3-identity category merger (Political, Ethnic/religious and Stigmatized). Furthermore, we find that a classifier trained on the combined data performs better than a classifier trained on one group (e.g. Random) and test on the other one (e.g. Movember). We make two main contributions from which both social theory on identity and NLP methods can benefit. First, by combining the two we find that social theory can be used to guide NLP methods to quickly classify and infer vast amounts of data in social media. Furthermore, using NLP methods can provide input to revisit traditional social theory that is often strongly consolidated in offline settings.
Second, we show that computers can perform a reliable automatic classification for most types of social identities on Twitter. In NLP research there is already much earlier work on inferring demographic traits, therefore it may not be surprising that at least some of these identities can be easily inferred on Twitter. Our contribution is in the second experiment, where we show that merged identities are useful features to improve the predictive performance of the classifiers. In such way, we provide social scientists with three social identity classifiers (i.e., Relational, Occupational and PES identities) grounded in social theory that can scale-up online identity research to massive datasets. Social identity classifiers may assist researchers interested in the relation between language and identity, and identity and collective action. In practice, they can be exploited by organizations to target specific audiences and improve their campaign strategies.
Our study presents some limitations that future research may address and improve. First, we retrieve the user' profile description from randomly sampled tweets. In this way, people who tweet a lot have a bigger chance of ending up in our data. Future research could explore alternative ways of profile description retrieval that avoid biases of this kind.
Second, our social identity classifiers are based only on 160-characters profile descriptions, which alone may not be sufficient features for the text classification. We plan to test the classifiers also on tweets, other profile information and network features. Furthermore, the 160-character limitation constrains Twitter users to carefully select which identities express in such a short space. In our study, we do not investigate identity salience, that is, the degree or probability that an identity is more prominent than others in the text. Future research that combine sociolinguistics and NLP methods could investigate how semantics are associated to identity salience, and how individuals select and order their multiple identities on Twitter texts.
Third, in the experiments we use standard text classification techniques that are not particularly novel in NLP research. However, they are simple, effective ways to provide input for social theory. We plan to improve the classifiers performance by including other features, such as n-grams and cluster of words. Furthermore, we will explore larger datasets and include more training data for further experimentation with more complex tech-niques (e.g., neural networks, World2Vec).