Learning Linguistic Descriptors of User Roles in Online Communities

Understanding the ways in which users interact with different online communities is crucial to social network analysis and community maintenance. We present an unsupervised neural model to learn linguistic descriptors for a user’s behavior over time within an online community. We show that the descriptors learned by our model capture the functional roles that users occupy in communities, in contrast to those learned via a standard topic-modeling algorithm, which simply reﬂect topical content. Experiments on the social media forum Reddit show how the model can provide interpretable insights into user behavior. Our model uncovers linguistic differences that correlate with user activity levels and community clustering.


Introduction
Social scientists and community maintainers are interested not only in the topics that users discuss in online communities, but also the manner and patterns of behavior within those discussions (Welser et al., 2007). For example, a community maintainer might be interested in knowing the proportion of users that come to the community seeking support versus the proportion of users that actively provide that support. A social scientist might be interested in how these different types of functional roles interact with different behaviors, community cohesion, user activity levels, etc.
Methods for detecting and characterizing these functional roles have had mixed results. Previous computational methods for detecting these roles required significant hand-engineering (Welser et al., 2007) or relied on non-textual features particular to one online community (Welser et al., 2011). Creating general frameworks for automatically characterizing these functional behaviors is difficult because such models must rely primarily on text from community discussions. When examining multiple communities, the social-functional aspects of language can be obscured due to differences in subject matter and jargon, and because in many communities the roles that users can occupy are not predefined or specified in any way. A key technical challenge then is automatically identifying linguistic variation that signals the varying social function of a user's posts, as opposed to variation that simply arises from topical differences across communities In this work we explore an unsupervised neural network-based method for learning linguistic descriptors that reflect the social roles present in different communities. Unlike standard topic-modeling, we seek to learn descriptors that are independent of the subject matter of one particular community.
We apply our method to data from a collection of sub-communities from the social media forum Reddit. We find that our method is able to pick up on the abstract, functional-communicative roles that users occupy in the communities, while a baseline topic model learns concrete, topical descriptors. Analyzing the behavior of users associated with these descriptors, we find significant and intuitive differences between users of different activity levels and between users with different levels of clustering within their social network.

Related Work
The idea of social roles and methods for identifying them are fairly long-standing concepts; we refer readers to Welser et al. (2007) for an in-depth review of the early history. Welser et al. (2007), analyzing Usenet forum posts, exemplifies early approaches for identifying social roles, which primarily relied on creating visualizations of authorship and reply networks then manually inspecting them to identify patterns. More recent work has leveraged more sophisticated computational models and methods such as information cascades to identify a particular role: that of social leaders and influencers (Lü et al., 2011;Tsai et al., 2014;Rosenthal, 2014). Jain et al. (2014) attempt to identify a broader set of functional roles, but their work is limited to online arguments and relies on human annotation of content. Our work attempts to provide a more general unsupervised method for identifying many functional roles in online communities, and we identify or define these roles in terms of the stylistic word choices that users make.
A separate line of work has investigated the process of "socialization" in online communities and the dynamics of multi-community engagement. Nguyen and Rosé (2011) show how users begin to use more informal language and refer more to other members as they spend more time in a community. Danescu-Niculescu-Mizil et al. (2013) build off this work and show that user lifespan can be predicted based upon the language of a user. They also characterize important linguistic differences for active long-term users, compared to inactive newer users, such as a decreased use in first-person pronouns. Tan and Lee (2015) extend this line of work to the multi-community setting, tracking thousands of users across various sub-communities on Reddit.
Lastly, for our work, we heavily draw inspiration from Iyyer et al. (2016), which presents an unsupervised neural network, called a relationship modeling network (RMN), for modeling the relationships between fictional characters in literary works. Importantly, the model is able to learn descriptors for the relationships independent of the book they appear in, which vary significantly in language due to time and geographical differences between authors. The RMN also learns the trajectory, or progression of descriptors, of the character relationships over time. Iyyer et al. (2016) provide an in-depth analysis to demonstrate the effectiveness of their model, including crowdsourced verification of the quality of the learned descriptors and relationship trajectories, thorough comparison against a hidden Markov model baseline, and quantitative analysis of results in line with previous literary scholarship.

Model
Given the success of the relationship modeling network (RMN), we closely follow Iyyer et al. (2016) in adapting the model to our setting. The model uses a recurrent neural network to reconstruct spans of text, represented with word vector embeddings, using a small set of interpretable embedding-based descriptors. Unlike the original RMN model which learns linguistic descriptors of relationships between character dyads in novels, we seek to model the relationship between users and online communities. In particular, whereas their method learns embeddings of characters and books, we replace them with embeddings of users and communities, respectively. The intuition in applying this model to our setting is that these embeddings should function as offsets that account for idiosyncratic or superficial variation between the language of different users or communities. By "subtracting" away this superficial variation, the system can pick up on the core variation that corresponds to different social functional roles. Figure 1 provides a high-level overview of the model. The technical details of the model largely follow that of Iyyer et al. (2016). For completeness, we provide a formal description here.
Formally, we have a corpus of N users and M communities where each user u i makes a sequence of posts . . ] to community c j , where each post is a fixed-length sequence of l word tokens drawn from a vocabulary V; i.e., p (t) ij = [w 1 , w 2 , . . . w l ] with possibly some padding. During training, the model learns K descriptors, which are represented as d word -dimensional vectors {r k } K k=1 , and a function for assigning each post a score for each of the descriptors. Representing the descriptors as d word -dimensional vectors allows us to interpret the descriptors by looking at their nearest word embedding neighbors. This representation Figure 1: RMN architecture. For each post, the post text, user, and community are first embedded, concatenated, and passed through a linear and nonlinear layer (red). Next, a recurrent layer computes a distribution over descriptors using the previous distribution and a softmax (blue). Finally, to train we create a reconstruction vector and compare against the original text embedding (green).
differs from a topic learned by a topic model, where a topic is a probability distribution over words. The model represents the descriptors as rows of a descriptor matrix R ∈ R K×d word .
For each post, the model computes a bag-ofwords representation as an average of word vector embeddings from embedding matrix E word : We also obtain embeddings v user , v comm for the user and the community via respective embedding matrices E user ∈ R N×d user and E comm ∈ R M×d comm , also learned throughout training. Following Iyyer et al. (2016) we concatenate these embeddings and pass them through a linear layer parameterized by W h ∈ R (d word +d user +d comm )×d hid , followed by a nonlinear ReLU layer: To convert h t to scores over the descriptors, the model then computes a softmax 1 . However, in order 1 see Iyyer et al. (2016) for a discussion on the use of a softmax versus other nonlinear functions to also make use of information from previous posts between the user and the community, the distribution is computed by also using the distribution from the previous post: This recurrent aspect allows the model to consider the previous state of the relationship and also lets us track the progression of the relationship. The parameter α controls the degree that d t depends on the previous distribution and can either be a hyperparameter or a learned parameter.
To train the model, we have it act as an autoencoder, where for each post we would like the post's scores for the descriptors to accurately capture the meaning of the original post. We formalize this into a training objective J(θ) by defining a reconstruction vector r t and attempting to make it similar in terms of cosine distance to the original post v post : where cos sim(v, w) is the cosine similarity between vectors v, w and S is a set of averaged bagof-words representations v n for a randomly sampled subset of posts from the entire dataset. J(θ) seeks to minimize the cosine distance between the reconstructed post and original post while maximizing the cosine distance between the reconstruction and the negative samples. Finally, to encourage the model to learn distinct descriptors, Iyyer et al. (2016) add an orthogonality penalty X(θ) = ||RR − I|| where I is the identity matrix. Then the final training objective is where λ is a hyperparameter controlling the degree to which the model is penalized for learning semantically similar descriptors. For our experiments we use text data from user comments on Reddit (reddit.com), a social media forum with over 200 million unique users as of June, 2016. 2 Reddit allows users to post and comment in multiple sub-communities, known as subreddits, that cover a wide array of subject matter from current events to specific video games to cooking.

Data
To prevent our model from being mislead by structurally abnormal subreddits and to provide scope to our domain of inquiry, we chose to focus our experiments on a subset of video game related subreddits, using all publicly available comments from 2014. We manually selected 75 subreddits, where each community is dedicated to the discussion of a particular videogame (e.g., r/Halo3). Limiting to subreddits that discuss specific videogames has the benefit of providing a sample of subreddits that are all similar in both social structure and scope.

Preprocessing
To build our dataset, we consider all users that have made at least 50 comments to a subreddit, and we sample up to 50 users from each subreddit. Then for each subreddit-user pair, we sample at most 100 of their comments. For the vocabulary, we lowercase, filter out conjunctions and articles, then remove words that do not appear in at least 20% of the subreddits. We found that restricting the vocabulary this way removed words that are concentrated to a few subreddits, thereby encouraging the model to learn more general descriptors.
For the word embedding matrix E word , we pretrain 300-dimensional word vectors using a skip-gram word2vec model trained on all subreddit data from 2014 3 and do not fine-tune these embeddings during training. User and community embeddings are initialized randomly and fine-tuned during training. To summarize, our final dataset consisted of 3.3 × 10 5 comments, 75 subreddits, 2575 users, and a ∼10 4 word vocabulary. See Table 1 for experimental parameters.

Results
Our analysis (see Section 5.1) reveals that the descriptors (or topics 4 ) learned via the RMN model are qualitatively different than those produced by latent Dirichlet allocation (LDA; Blei et al. (2003)) and that the RMN descriptors more effectively capture functional or stylistic aspects of language that (at least intuitively) correspond to the communication patterns of stereotypical social roles. We then show how the unsupervised RMN model can be used to gain insights into user behavior: Section 5.2 shows how the descriptors are differentially expressed by users of varying activity levels and who have varying amounts of clustering in their local social network. Finally, Section 5.3 explores the latent dimensionality of the space of functional user roles using the learned RMN descriptor distributions.

Descriptors learned
We present a subset of the learned RMN descriptors in Table 2. For comparison, we also trained an LDA model 5 on the same data (with identical pre-processing) and present the topics learned. In the case of LDA, the words describing each topic are simply the words with the highest within-topic probabilities; in contrast, the words corresponding to the RMN descriptors are selected by finding the closest words to the descriptor vectors in terms of cosine similarity. From the examples shown, we  RMN and LDA models (top) and two examples that were judged to be largely incoherent/non-useful (bottom). The coherent LDA topics correspond to superficial subreddit-specific topical content, while the coherent RMN descriptors capture functional aspects of language user (e.g., a user asking for advice, or providing positive acknowledgment). The incoherent LDA topics consist of mixtures of (somewhat) semantically related concrete terms. The RMN model tends to fail by producing either difficult-to-interpret sets of stopwords or interpretable, but uninteresting, sets of functionally related words (e.g., Spanish stopwords).
can see that descriptors learned by the RMN seem to be more abstract and functional-capturing concepts such as asking for advice-while the topics learned via LDA are more concrete and subreddit specific; for example, the first LDA topic shown in Table 2 is specific to "shooter"-type games, while the second is specific to fantasy role-playing games.
The learned RMN descriptors also have some intuitive mappings to standard user roles. Some correspond to anti-social or "troll"-like behavior, such as example descriptors 2 and 8 in Table 2; similarly, example descriptor 5 corresponds to "maven"like behavior (providing technical advice), while 4 likely represents the language of inexperienced, or so-called "newbie", users-a point which we confirm in Section 5.2.
Not all the learned descriptors have such intuitive mappings, but this does not imply that they are not informative with respect to the functional roles users play in communities. Example RMN descriptor 1, which contains language discussing "the other" (e.g., "them", "they") does not map to one of these well-known categories; however, it might still have functional relevance (e.g., in the social process of outgroup derogation (Tajfel et al., 1971)).
Of course, not all the descriptors learned by the RMN are perfect. In addition to the non-functional descriptors learned, a small number of descriptors (1-3) lack a clear coherent interpretation (e.g., ex-ample 9 in Table 2). Furthermore, some descriptors did indeed seem to capture some more topical information (e.g., example 3 in Table 2 is specific to competitive gaming). We note, however, that all of these behaviors were also observed in the LDA topics. Table 5 in the appendix lists the full set of descriptors and topics learned by both methods.

Descriptor quality
Previous work applying the RMN framework to fictional novels has shown that humans judge the generated RMN descriptors to be more coherent compared to topics generated by LDA (Iyyer et al., 2016). Manual inspection of the 50 descriptors/topics learned by each model in our study supported this finding, though we found the majority produced by both methods were reasonably coherent. That said, the top-10 words for LDA topics contained a large number of repeated terms. Of the 500 top-10 words generated by LDA (10 each for 50 topics), 241, or 48%, occur in more than one topic. The word "game", for example, occurs as a top-word in 16 out of the 50 topics for LDA. 6 In contrast, only 7% of the top-10 descriptor words appeared in more than one descriptor for the RMN model.

Functional vs. topical descriptors
A key qualitative trend evident in the learned descriptors (Table 2) is that the RMN role descriptors appear to capture more functional aspects of language use, e.g., asking for advice or discussions of agreement/disagreement, while the LDA topics capture more concrete, topical, and subreddit-specific language, e.g., "guns" or "dragons".
We quantify this qualitative insight in two ways. First, we note that the RMN descriptors are less subreddit-specific and occur in a greater diversity of subreddits (after controlling for absolute frequency). In particular, we compute the relative subreddit frequency of a word w i as where s(w i ) is the number of subreddits w i occurs in and f (w i ) is its absolute frequency. We found that s r (w i ) was significantly higher for the 500 RMN descriptor words compared to those from LDA ( Figure  2.A; p < 0.05, Mann-Whitney U-test). Normalizing by the logarithm of the absolute frequency in (7) is necessary because higher frequency words will simply occur in more subreddits by chance, and the median LDA descriptor-word frequency is ∼10× higher than that of the RMN model. 7 We also found that the RMN descriptor words were significantly more abstract (Figure 2.B; p < 0.0001, Mann-Whitney U-test), as judged by human 7 The use of a logarithm in the denominator is motivated by the power-law scaling between type and token counts in a corpus (Egghe, 2007 Table 3: Descriptors that are most predictive of activity levels. The top-3 correspond to the most positively predictive descriptors, while the bottom-3 correspond to the most negatively predictive. ratings from the Brysbaert et al. (2014) concreteness lexicon. The relative abstractness of the RMN descriptor-words highlights the functional nature of the descriptors learned by the RMN model. This finding is further reinforced by the fact that the RMN descriptor words contain far more verbs compared to those from LDA: the RMN descriptors are equally balanced between verbs and nouns (132 verbs, 134 nouns) while the LDA descriptors are overwhelming nouns (98 verbs, 258 nouns).

Examining user behavior
We now show how the learned RMN descriptors reveal valuable insights into users' behaviors.

Describing active vs. non-active users
First, we investigated the extent to which user activity levels are associated with differential language use by regressing the number of comments a user made in the year 2014 on their average RMN descriptor distribution. We employed a negative binomial regression model since the comment counts are integer valued and heavy-tailed (McCullagh and Nelder, 1989). 8 Table 3 shows the top-3 most positive and negative predictive descriptors (according to Wald zstatistics), all of which are significant at p < 0.01 by Wald's z-tests. Interestingly, we see that one of the most positive predictors of high activity levels is the topic that contains terms used to refer to "the other" (e.g., "them", "they"); this topic also contains words such as "tend" and "typically", in-  dicating that it captures users references to a stereotyped out-group. This has important social implications, as it potentially highlights the tendency for highly active users to engage in in-group/out-group dynamics. The other topics predictive of high activity levels include one filled with informal "bro" language and a topic related to increasing/decreasing (for which a social interpretation is unclear).
In contrast, the topics most associated with lowactivity levels include one related to asking for advice or suggestions, along with a topic related to discussions of "humanity" and "sacrifice". This is in line with anthropological theories of social roles such as legitimate peripheral participation (Lave and Wenger, 1991), which states that new users in a community initially participate via simple and low-risk tasks in order to become familiar with the community jargon and norms. On Reddit, engaging in the in-group/out-group behavior could be costly if users do not have a good understanding of the communityspecific norms behind those behaviors. The low-risk actions on Reddit often take the form of questionasking, as newcomers are encouraged to ask questions and seek the advice of more veteran members of the community.

Associating network structure with
descriptors In addition to user activity levels, we also examined how a user's position in their social network is associated with use of different RMN role descriptors. For this experiment, we constructed social networks by attaching all users who commented together within the same comment-chain and whose comments were separated by at most two other com-ments. We then computed the users' degrees and local clustering coefficients (Watts and Strogatz, 1998) within these networks.
We performed regression analysis via ordinary least squares to relate the different RMN descriptors to the logarithm of a user's clustering coefficient. As with the analysis on activity levels, we performed regression with a vector of RMN descriptor weights for each user that is averaged over all their comments in the dataset. We also controlled for user degree and their activity level in the regression (both log-transformed). Table 4 shows the top-3 most positive and negative predictors in this regression (according to their t-statistics), all of which are significant at the p < 0.01 level. We see that users with highly clustered interactions are more likely to express subjective attitudes (e.g., "realized", "wished", "hope") and are more likely to discuss temporal aspects of their lives (e.g., "day", "busy", "evenings"), perhaps indicating that high clustering during interactions is associated with more personal or in-depth discussions. In contrast, the most predictive topics in the negative direction were more focused on material aspects of gaming, including a topic discussing the purchasing of video games ("grabbed", "bought") and one discussing video game hardware ("desktop", "hardware", "optimized").

Number of types of users
Given that we found the learned RMN role descriptors to be related to social aspects of user behavior in informative ways, it is natural to investigate how much user variation there is along the learned role descriptor axes. In other words, how many types of users are there?
We investigated this question by performing principal components analysis on the set of userdescriptor vectors, where each user is assigned a vector corresponding to the weight of their average comment along the RMN role descriptor axes (as was done in the regression analysis). We also performed an identical analysis on average subredditdescriptor vectors. Figure 3.A shows the proportion of variance explained by the top-k principal components for both users and subreddits. We see that it takes ∼6 latent dimensions to explain 80% of the variance across subreddits, while it takes ∼12 latent-dimensions to explain the same amount of variance in user behavior. This indicates that despite the fact the descriptor axes are regularized to be orthogonal during learning, they still contain redundant information and users cluster in predictable ways. However, we also see that there is far more variation at the user-level compared to the subreddit-level, which indicates that the RMN descriptors are not simply recapitulating subreddit distinctions. We also formally tested the extent to which the descriptors of users are simply determined by the language of subreddits. Figure 3.B shows the absolute Pearson correlation between the principal components of users and subreddits. This correlation is extremely high for the first two principal components but quickly drops off and becomes noise by the fifth principal component. This indicates that a large proportion of variance in user behavior is not simply explained by their being active in certain subreddits and reinforces the notion that RMN topics capture community-independent aspects of user's linguistic behavior that correspond to functional social roles.

Conclusion
We adapted a neural network model to learn functional descriptors of how users behave and interact in online communities, and we showed that these descriptors better captured the abstract or functional properties of language use compared to descriptors learned by a standard topic model. We then showed that the learned descriptors are useful for providing interpretable linguistic characterizations of different user behaviors. Our results highlight the usefulness of the RMN framework as an unsupervised, quantitative tool for uncovering and characterizing user roles in online communities.
This unsupervised approach to discovering stereotypical communication patterns offers a powerful compliment to social network and interaction-based methods of discovering social roles. However, one limitation of this study is that we do not formally map the learned descriptors to more traditional social role categories, and this is an important direction for future work.
An interesting extension of the model would be to take into account the immediate context in which a post is made. Because the function of a post is partially determined by what it is responding to, additional context may lead to more salient descriptors.