Demographic Inference on Twitter using Recursive Neural Networks

In social media, demographic inference is a critical task in order to gain a better understanding of a cohort and to facilitate interacting with one’s audience. Most previous work has made independence assumptions over topological, textual and label information on social networks. In this work, we employ recursive neural networks to break down these independence assumptions to obtain inference about demographic characteristics on Twitter. We show that our model performs better than existing models including the state-of-the-art.


Introduction
Social media is a popular public platform for communicating, sharing information and expressing opinions. Millions of users discuss a variety of topics such as politics or sports. Valuable insights can be obtained by analysing social media content (e.g., mining consumer preferences), and, consequently, social media data is now a valuable resource. Accordingly, social media analytics have received much attention among researchers and companies (Wan and Paris, 2014;Valdes et al., 2015;Zubiaga et al., 2016).
A common approach to infer demographic characteristics is the use of supervised classifiers trained on textual features. The main limitation of this approach is that it makes little use of the network topology. Several network embedding methods have been proposed to learn distributed dense representations for vertices in graphs: Deep-Walk (Perozzi et al., 2014) or LINE (Tang et al., 2015). While these two models can capture the topological structure of social networks, their performances are still limited, as the text features associated with vertices are not considered. For instance, text messages that Twitter users post, tweets, can offer great potential to enhance the vertex embeddings.  proposed the Text-Associated DeepWalk (TADW) to enhance the discriminative power of the vertex embeddings by incorporating the text information into the embedding generation process. Although this matrix factorisation framework is effective on the vertex classification task, it can produce suboptimal embeddings. This is because the label information is not exploited in the unsupervised framework. More recently, Pan et al. (2016) proposed the Tri-Party Deep Network Representation (TriDNR), a method that incorporates the label information in addition to the text and topological information. However, in TriDNR, the text in a vertex is assumed to be independent of neighbour vertices. Furthermore, the two optimisation problems (learning vertex embeddings and training discriminative classifiers) are tackled separately. This is also true of TADW.
In this paper, we tackle the problem of inferring demographic characteristics from social networks as a vertex classification task on graphs. We employ recursive neural networks (RNNs) to infer three demographic attributes of Twitter users (age, gender and user type) based on network topology, text content and label information. Our model breaks down the independence assumption by leveraging RNNs on paths in graphs. We show that our model achieves better performance compared to existing models including the state-ofthe-art. While high performance is achieved using solely neural network models, more importantly, we find that different demographic inference tasks benefit to varying the topological size of RNNs. To our knowledge, there has been little previous work applying neural network based methods to the problem of inferring social media demographics.

Graph Recursive Neural Networks
RNNs are deep learning models that recursively compose the vector of a parent unit from those of child units over a given structure in topological order (Pollack, 1990). They have shown to be very effective for various natural language processing (NLP) tasks, capturing syntactic and semantic composition (Socher et al., 2011;Qian et al., 2015). In this section, we describe the framework of Graph RNNs (GRNNs) (Xu et al., 2017) to classify the vertices of a graph. This framework allows us to infer the demographic characteristics of social media users. We formally define the problem of Twitter vertex classification as follows: A Twitter social network is defined as G = (V, E), where V is the set of vertices (users) and E is the set of edges (relationships) between the vertices. Each edge e ∈ E is an ordered pair Each v i is associated with a pair of (x i , y i ), where x i ∈ X is a feature vector and y i ∈ Y is a particular label that depends on x i . X and Y thus denote a set of feature vectors and a set of possible predicted labels in G, respectively. Our goal is then to predict the most likely label y t ∈ Y for v t ∈ V , which is the target vertex to be classified:ŷ t = argmax yt∈Y P θ (y t |v t , G, X ) using a RNN with parameters θ.
GRNNs contain five components that will be presented in this section: 1) Graph-to-Tree Conversion, 2) Word Embedding layer, 3) Recursive Neural Unit layer and 4) Softmax Output layer, as illustrated in Figure 1.

Graph-to-Tree Conversion
A graph is converted to tree structures before constructing a RNN for each tree. Specifically, we construct a tree T = (V T , E T ) of depth d rooted at v t using a breadth-first search algorithm from G. V T and E T are the sets of vertices and edges in the tree. (v, w) ∈ E T denotes an edge from a parent vertex v to a child vertex w.

Word Embeddings
Let S i = {w 1 , w 2 , ..., w R } be texts (e.g., tweets) consisting of R words, which are associated with a vertex v i . Every word w r is converted into a real-valued vector e r by looking up the embedding matrix E ∈ R dw|V | , where d w is the size of word embedding and |V | is a vocabulary size. The matrix E is initialised using the Skip-gram model (Mikolov et al., 2013). S i is then fed into the next layer as a real-valued feature vector x i = {e 1 , e 2 , ..., e R }.

Recursive Neural Units
Once we construct a tree from a graph, we build a RNN using one of two types of recursive neural units (RNUs) for each vertex v k ∈ T : Naive Recursive Neural Unit (NRNU) and Long Short-Term Memory Unit (LSTMU).

Graph Naive Recursive Neural Net
Each NRNU for a vertex v k take its feature vector x k and a hidden stateh k as input. Max pooling producesh k from all hidden state vectors h r of the child vertices v r of v k . 2 A hidden state vector h k of v k is obtained using weight matrices, followed by a non-linear function tanh: and h r is a hidden state of a child vertex v r . W (h) and U (h) are weight matrices, and b (h) is a bias vector for model parameters. In this paper, we refer to Graph Naive Recursive Neural Network as GNRNN incorporating NRNU as a RNU.

Graph Long Short-Term Memory Net
LSTMU (Hochreiter and Schmidhuber, 1997) was originally proposed to tackle a sequential labelling problem and it is able to model long-range dependencies by incorporating gated memory cells. At each time step, LSTMU takes the sequential input vector and the previous hidden state vector to produce the next hidden state. In this study, LSTMU is employed as a RNU to represent a vertex in a tree, and it naturally captures the relationships between vertices. For a vertex v k , LSTMU takes x k andh k as input, and generates the input, forget and output gate signals, denoted as i k , f k and o k respectively. It produces a memory cell state c k and hidden state h k with respect to a vertex v k : where refers to element-wise product and σ indicates the sigmoid function. W ( * ) , U ( * ) and b ( * )  are LSTM parameters. We call a tree-structured network topology consisting of LSTMUs as Graph Long Short-Term Memory Net (GLSTMN).

Softmax Output
At the end, the hidden state h t is fed into a softmax classifier to predict the label y t of the target vertex v t after calculating the hidden states of all vertices in T : P θ (y t |v t , G, X ) = sof tmax(W (s) h t + b (s) ) y t = argmax yt∈Y P θ (y t |v t , G, X ).

Experimental Setup
In this section, we provide an overview of datasets and the models that are evaluated in the experiments.

Datasets
We evaluate the effectiveness of our model on three types of social networks: gender, age and user type classification. Twitter users follow others or are followed, and two types of relationships are used to build the social networks: friend and follower. A user is associated with others via the following relationship, the user's friend in Twitter's terminology. Follower relationships indicate that a user receives all the tweets from those the user follows.
• Gender (Volkova, 2014) is a Twitter social network encoding friend relationships between users. The labels of this network are Male and Female.
• Age (Volkova, 2014) is a Twitter social network encoding friend relationships between users. The labels of this network are Young (18-23 years old) and Old (25-30 years old).
• UserType (Kim et al., 2017) is a Twitter social network encoding follower relationships between users. The labels represent the types of Twitter users: Individual, Organisation and other.
To generate text features of vertices, up to 1K tweets per user are used in Gender and Age, whereas Twitter user profile descriptions are used in UserType. All words are stemmed, and then stop words and words with document frequency less than 10 are removed. The statistics of the datasets are summarised in Table 1.

Evaluated Models
We compare the GRNN model with several existing models to assess vertex classification performance.
• Lexica (LX) (Sap et al., 2014): a lexiconbased method produced from Twitter to calculate the scores of gender and age. These scores are used to predict their labels for users.
• Logistic Regression (LR) (Hosmer Jr et al., 2013): a commonly used linear model in the NLP community, only using textual contents in vertices. Bag-of-words feature vectors are generated without incorporating any topological information of a network.
• Label Propagation (LP) (Wang and Zhang, 2006): a graph-based semi-supervised learning model, where label probabilities are propagated to all unlabelled neighbours. The probability derivation steps are terminated for the remaining vertices when all label probabilities converge.
• Text-Associated DeepWalk (TADW) : an unsupervised vertex embedding learning method. Low-dimensional representations of vertices are induced both from their texts and graph relationships based on inductive matrix factorisation.
• Tri-Party Deep Network Representation (TriDNR) (Pan et al., 2016): two neural networks incorporating the texts, relationships and labels of vertices in graphs. As in TADW, unlabelled vertices are classified using Support Vector Machines (SVMs) (Cortes and Vapnik, 1995) trained on learned vertex embeddings.

Experiment Settings
In our experiments, we follow the standard experimental protocol for vertex classification task. More specifically, we evaluate classification ac-   curacy 3 with different training ratios, increasing from 20% to 80%. For each training ratio, we randomly generate 5 different training datasets. For each training dataset, we run 10 trials and record the highest accuracy on each testing dataset. We then report the average accuracy for the same ratio of training datasets.
We test six different model architectures using NRNU and LSTMU with three different tree depths (d = 0 , d = 1 , d = 2 ). The tree depth corresponds to the number of hops between users in a social network. For all the GRNN models, the size of the hidden units is set to 200. We use Adagrad (Duchi et al., 2011) with a batch size of 20 as the optimisation method that automatically adapts the learning rate in training. The initial learning rate is set to 0.1 for LR and LP, and 0.01 for all GRNNs.

Results and Analysis
In this section, we present the experimental results and analysis on vertex classification for the three  networks. Numbers in bold represent the highest performance in each column in all tables. As shown in Table 2, GNRNN and GLSTMN perform better than the evaluated models for the task of gender prediction, with no performance gain when the depth of the tree is increased. Namely, GLSTMN d0 is the best performing model for this task. Table 3 shows that GLSTMN also achieves the best performance for age prediction. For this task, however, performance increases with the depth of the tree. For the task of user type classification, Table 4 shows that we attain the best performance with tree depth d = 1 for GLSTMN for the training ratio of 80%. Note that we could not train the GRNN models with tree depth 2 (GNRNN d2 and GLSTMN d2) on the UserType dataset on our Linux server with 64G memory due to the lack of memory. As shown in Table 1, the average degree of UserType is roughly 10 times larger than that of Gender or Age. The degree of a vertex in a graph indicates the number of edges connected to adjacent vertices. It means that the GRNN models could have approximately 625 (= 25 × 25) vertices in average for the tree depth of 2. Our experiments show that a dense graph consisting of high degree vertices is intractable under the GRNN model.
Interestingly, LR and LP are effective methods for Gender and Age compared to TADW and TriDNR. In particular, TADW performs poorly on Gender, and TriDNR marginally outperforms GLSTMN on the dense social network, UserType.
To summarise, we observed four findings: 3. GRNNs have different optimal tree depths for each demographic inference task. Tree depth does not improve inference performance for Gender, indicating that only text information is sufficient without incorporating network information. For age, tree depth allows GRNNs to be more effective although its increase does not consistently lead to better performance for GLSTMN. Similarly, tree depth increases the performance of GRNNs for UserType. We hypothesise this may be related to the nature of social interactions (e.g., Twitter users in the similar age group are more likely to interact).
4. The matrix factorisation-based methods (TADW and TriDNR) relatively work well on datasets having high degree vertices, whereas the GRNN models achieve relatively good performance on graphs containing low degree vertices.

Conclusions and Future Work
In this paper we tackled the demographic inference problem on Twitter as vertex classification on a graph using GRNNs, demonstrating their effectiveness against strong models for selected datasets. The RNN framework provides an effective way to incorporate network, text and label information for Twitter demographic inference. However, different demographic inference tasks benefit to varying the tree depth of GRNN models. As our future work, we plan to employ other state-of-the-art deep learning models Kipf and Welling, 2017) that vary in the nature of the dependency between network, text and label information for demographic inference to confirm the effectiveness of our proposals.