Twitter Homophily: Network Based Prediction of User’s Occupation

In this paper, we investigate the importance of social network information compared to content information in the prediction of a Twitter user’s occupational class. We show that the content information of a user’s tweets, the profile descriptions of a user’s follower/following community, and the user’s social network provide useful information for classifying a user’s occupational group. In our study, we extend an existing data set for this problem, and we achieve significantly better performance by using social network homophily that has not been fully exploited in previous work. In our analysis, we found that by using the graph convolutional network to exploit social homophily, we can achieve competitive performance on this data set with just a small fraction of the training data.


Introduction
Twitter (http://twitter.com) is a microblogging service launched in 2006, where, a user can publish messages with up to 280 characters, called "tweets". Unlike many other social networking platforms, such as Facebook and LinkedIn, Twitter does not provide structured fields for users to fill in personal information. However, a user can write a 160-character-long small public summary about itself called a "Bio". Besides linguistic information from tweets and Bios, online social media is a rich source of network information. People's personal networks are homogeneous, i.e., friends share more attributes such as race, ethnicity, religion, and occupation-known as the homophily principle (McPherson et al., 2001). Such network information has been utilized in friend recommendation (Guy et al., 2010), community detection * Equal Contribution; work performed while both authors were visiting Singapore University of Technology and Design (SUTD).  (Yang and Leskovec, 2013), etc. Figure 1 shows two users connected on Twitter. By looking at their Bio and tweets, it can be inferred that these users share the same occupational interest.
Profiling users can enhance service quality and improve product recommendation, and hence is a widely studied problem. User occupational class prediction is an important component of user profiling and a sub-task of user demographic feature prediction. Existing approaches to predicting Twitter users' demographic attributes explore, select, and combine various features generated from text and network to achieve the best predictive performances in respective classification tasks (Han et al., 2013;Miller et al., 2012;Preoţiuc-Pietro et al., 2015;Huang et al., 2015;Aletras and Chamberlain, 2018). The three categories of features are: account level features, tweet text features, and network based features. Past research have shown the distinctive usage of language across gender, age, location, etc. in tweets (Sloan et al., 2015;Cheng et al., 2010;Burger et al., 2011;Rao et al., 2010), which makes content based prediction effective.
As for user occupational class prediction, Preoţiuc-Pietro et al. (2015) built a dataset where users are assigned to hierarchical job categories. They used word cluster distribution features of content information to predict a user's occupational group. Aletras and Chamberlain (2018) constructed a user's followings connections to learn the user embedding as a feature input to the classification models. Considering the regional disparities of economic development stages, the major job categories may vary significantly across regions. Sloan et al. (2015) summarized occupation distribution of Twitter users in the UK by looking into their profiles.
In this paper, we analyze the usefulness of a user's network information over the user's tweets for predicting its occupational group. We extend the existing dataset for occupation classification (Preoţiuc-Pietro et al. (2015)) by introducing the network information about a user, i.e. follower/following IDs together with their Bio descriptions, and we construct a user-centric network to extract useful community and text based features. The acquired features from the network are then exploited using a graph neural network. The obtained results show the importance of a network information over tweet information from a user for such a task. (1)

Graph Convolutional Network
where X is the feature matrix for all the nodes with X (0) being the initial feature input of size d nodes × d f eatures , A is the adjacency matrix of dimension d nodes × d nodes ,D is the degree matrix of A + λI, λ is a hyperparameter controlling the weight of a node against its neighbourhood, and W (l) and b (l) are trainable weights and bias for the l-th layer, respectively. In each layer of GCN, a node aggregates its direct neighbours' features according toÂ and linearly transforms the representation using W and b. A nonlinear activation function σ (e.g., ReLu) is then applied. The number of layers of GCN decides the number of hops away that the neighbours' features will be smoothed over for each node. 3 Experimental Setup

Data
We base our work on a publicly available Twitter dataset that maps 5,191 users to 9 major occupational classes (Preoţiuc-Pietro et al., 2015). The dataset contains user IDs (we call these users the main users henceforth) and the bag-of-words from tweets. The hierarchical structure of occupational classes in the data was defined based on the Standard Occupation Classification (SOC) from the UK 1 .
To explore the role of network information in occupational class prediction, we extend the above dataset by crawling follower/following IDs (henceforth referred to as follow IDs) for each main ID (IDs corresponding to main users). For the crawled follow IDs, we further crawl their Bio descriptions. We refer to the extended dataset as ED. ED contains 4,557 main users with both followers and followings information. The remaining Twitter accounts could not be scrapped because of various reasons such as account suspension and protected tweets. Table 1 shows the occupational class distribution of the main users in the ED. In all our work, we discard the Bio information of the main users as these were used to annotate this dataset. We tokenize the Bio text of the follow IDs using the Glove Twitter pre-processing guidelines 2 . As for social network construction, we consider each follower/following relationship as an undirected edge. Based on the reasoning that the social network information is passed between main IDs mainly through some common follow IDs, the follow IDs that only connect to very few main IDs will have minimum functionality in information flow.
Thus, we decide to filter the graph by keeping the follow IDs with more than 10 connections to the main IDs. All connections between main IDs are retained. The filtering step results in 29 main IDs losing all their connections. For all such isolated main IDs, we retrieve all its follow IDs having at least one other main ID connection. After all these operations, we are able to construct an un-weighted graph in which all the main IDs are connected. The filtered graph contains 34,630 unique users (including 4,557 main IDs) and 586,303 edges. Although the main users are not collected to be connected to each otheronly 2,550 main IDs have at least one direct connection to another main ID, we find that they often share common follow IDs which allows us to retrieve their social representations.
To compare with previous works, we also construct a partial network dataset that contains only following IDs of all the 4,557 main IDs. We refer to this partial dataset as PD. PD adheres to the same network construction methodology as ED.
We divide the dataset into training, development, and test sets using stratified split with the splitting ratio of 80%, 10%, and 10%. All the experimental results are reported on the same test set. The split information and the processed dataset ED can be found together with code on github: https://github.com/jqnap/ Twitter-Occupation-Prediction.

Features and Models
Node Embeddings: To encode user-user social relationship of main IDs with the follow network, we learn latent representations of all IDs (node embedding) which can be easily exploited for the prediction task. The embeddings are learned by forming node sequences using Deep Walk (Perozzi et al., 2014).
Based on the network processing strategy used in Aletras and Chamberlain (2018), we construct unweighted bipartite graphs using our filtered network. The two sides of a bipartite graph are follow IDs and main IDs respectively. Note that the main ID-main ID connections will break the bipartiteness. To resolve this, we duplicate the main ID nodes to the follow IDs' side and then link con-nections within main IDs. We construct for both ED and PD, and obtain a full graph (fG) and a partial graph (pG) respectively.
Next, we performed 10 random walks starting from each main ID, alternating between main ID and followers/followings with a walk length of 80. For each node, the walk sequence is used to generate embeddings using a similar approach to word2vec (Mikolov et al., 2013). We use the same hyper-parameters as in Aletras and Chamberlain (2018).
Text Features: To have a valid comparison with existing approaches, we construct two sets of text features: (1) bag-of-clusters (Preoţiuc-Pietro et al., 2015): we assign each word that appears in each main ID's concatenated tweets document to its corresponding word cluster, where the word clusters are obtained by applying spectral clustering (Ng et al., 2002;Shi and Malik, 2000) to word embeddings. Next, we calculate the cluster assigning frequencies for each main ID. (2) bag-of-words (BOW): since the initial dataset used the Bio information of the main users to annotate their occupations, we remove all the Bio information of main users. We kept only the most frequent 5,000 words from the Bio (of other users) and another 5,000 words from tweets text as the dictionary of separate BOW vectors to the model. We feed the obtained text features and node embedding features to both the Logistic Regression (LR) classifier and the Support Vector Machine (SVM) classifier 3 . Both classifiers are trained following the one-vsall approach for the 9-way classification task. 2 regularization is used for LR, whose coefficient is tuned based on the development set. We use the RBF kernel for SVM, normalize the features before feeding them to SVM as inputs, and tune the regularization coefficient C using the development set.
GCN: In the case of GCN (as shown in Figure 2), we use its transductive semi-supervised setting. The inputs are the adjacency matrix of all the network IDs and a feature matrix of the Bio's bagof-words. Specifically, we keep the input feature vectors corresponding to the main IDs as null (all zeros), since their Bios were discarded. We experiment GCN with 2, 3 and 4 convolutional layers. The 3-layer GCN slightly outperformed the

Text Features and Node Embeddings
As shown in Table 2, we compare our results using network information with existing methods: bagof-clusters (Preoţiuc-Pietro et al., 2015) and Deepwalk on the followings graph concatenated with bag-of-clusters (Aletras and Chamberlain, 2018). We first conduct experiments on our collected ED dataset with 4,557 main users using existing methods. The better accuracy among existing methods is given by the concatenated bag-ofclusters and Deepwalk embeddings: 55.0%.
Next, we investigate the performance of bagof-words features from main ID tweets and follow Bios using logistic regression (LR) and support vector machines (SVM). From the experiments on tweets, we find that using the bag-ofwords features achieve comparable performance to using the bag-of-clusters features. Thus we opt for the bag-of-words representation in subsequent experiments. The optimized model using Bio text features outperforms using tweet content. It can be inferred that the Bio descriptions of follow accounts provide more useful information compared to tweets. The reason could be the higher noise in tweets, while people are comparatively more careful while writing their Bios.
The next set of results uses follow network features. Based on Aletras and Chamberlain (2018), we perform deep walk with 32-dim learned node representations, and used it as input to LR and  Table 2: Performance in terms of accuracy percentage comparison of logistic regression (LR), support vector machines (SVM), and graph convolutional networks (GCN). The first two rows (marked with * ) are existing approaches from Preoţiuc-Pietro et al. (2015) and Aletras and Chamberlain (2018). The number in brackets are the dimension of the feature space. pG and fG refer to partial graph and full graph respectively. We use F-Bio to denote "Follower Bio BOW". SVM. We achieve higher accuracy (55.3%) as compared to tweets BOW (54.6%). However, the model is less effective than using follow Bio BOW. Combining both node representations and follow Bio BOW features further boosts the accuracy to 57.5%.

GCN
To analyze the importance of Bios in conjunction with social network information, we exploit graph convolutional networks. With an accuracy of 59.9%, the model exceedingly outperforms existing approaches on tweets and partial network information. Our best result 61.0% accuracy is achieved by using GCN with one-hot encoding for nodes, which is significantly higher than existing methods. This shows that GCN is able to exploit the rich topological information of network to learn social representations for users. We postulate that the GCN with Bio did not do better than just a one-hot encoding for nodes because the main users do not have Bios: so all the labeled nodes in the GCN have no Bios, which makes learning difficult. We visualize the GCN final layer representations of training set (big ovals) and test set (dark colored dots) in Figure 3a. It can be observed that many test data samples are mapped to the correct group of occupation, showing the capability of GCN utilizing Twitter network information for the prediction task. To analyze wrongly mapped test samples, we observed confusion matrix as shown in Figure 3b. We see that group 4 is predicted as belonging to group 1 or 2. When we compare the jobs lying in groups 1, 2, and 4, we found that they contain similar types of sub-occupations, such as "financial account managers" and "finance officers", or "engineers" and "engineering technicians". The same phenomenon can be seen for group 9 and group 5. Figure 3c compares the performance of two models, using tweet only features (LR-tweets) and follow network features (GCN-Bio), based on a fraction of training samples used for model learning. Even with 10% of the labeled training data, GCN with Bio-BOW features achieves comparable accuracy to existing models as well as models trained on tweet BOW with all the training set. This shows the significance of a user's network information.
We analyze the predictions on test samples made by GCN with Bio feature input and GCN with the one-hot encoded input. We find that 11% of the test set's main IDs are correctly classified by only one of the two GCNs. This suggests that Bio features provide complementary information to the one-hot encoded input. In this work, the acquired network is dense. In cases when network is sparse, one-hot representation of an ID seems infeasible while BOW may generalize for the larger graph.
While occupational class prediction could be used to improve service quality, we note that the use of network information might result in unintended consequences such as racial and ethnicity based segregation in online spaces. To alleviate such concerns, it would be useful in future to incorporate explainable predictions with work such as (Xie and Lu, 2019), to further mitigate such risks involved.

Conclusion and Future Work
Previous works have used tweets or a fraction of the network information to extract features for occupation classification. To analyze the importance of network information, we extended an existing Twitter dataset for a user's social media connections (follow information). We showed that by using only follow information as an input to graph convolutional networks, one can achieve a significantly higher accuracy on the prediction task as compared to the existing approaches utilizing tweet-only information or partial network structure.
Directions of future research include adaptation of our methods to a large scale, sparsely connected social network. One might also want to investigate the inductive settings of GCN (Hamilton et al., 2017) to predict demographic information of a user from outside the black network.