Community Member Retrieval on Social Media Using Textual Information

This paper addresses the problem of community membership detection using only text features in a scenario where a small number of positive labeled examples defines the community. The solution introduces an unsupervised proxy task for learning user embeddings: user re-identification. Experiments with 16 different communities show that the resulting embeddings are more effective for community membership identification than common unsupervised representations.


Introduction
Active users of social media often like identifying other users with common interests and values.Or, a user may want to find other users that share characteristics with specific accounts that they follow, e.g.cartoonists or local food trucks.Members of such communities of interest are often identifiable via their social network connections, and shared social connections are clearly important in recommendations.However, shared connections often reflect a subset of a person's interests, and there may be users of interest where any shared connections are distant.In addition, there may be scenarios where there is no explicit social graph, or the full graph is expensive to obtain.In such cases, the language of tweets, blogs, etc. is helpful in identifying users with particular interests.
In this paper, we represent users in terms of the text in their communications and introduce a scenario where a user can define a "community" by providing a small number of example accounts that are used to train a system for retrieving similar users.Note that our use of the term "community" differs from other online contexts, where members explicitly self-identify with a community (e.g. by joining a discussion forum or using a specific hashtag).The community is in the eye of the user issuing the query.
We frame the task of community membership detection as a retrieval problem.A small set of representative accounts selected by the user forms the query, and the system retrieves additional community members from a large index of accounts.The task is loosely related to entity set expansion (Pantel et al., 2009).We make no assumptions about the type of communities that can be handled, and no labeled data is available other than the query.Because the training set (query) is minimal, unsupervised learning is useful for the text representation.We propose the proxy task of person re-identification for learning a user embedding, where the goal is for two embeddings from the same user to be closer to each other than to the embedding of a random user.The hypothesis is that a representation useful for detecting similarities between posts from the same person made at different times will also do well at identifying similarities between people in the same community.This hypothesis stems from observations that people with shared interests often talk about topics related to these interests, and that they tend to have shared jargon and other similarities in language use (Nguyen and Rosé, 2011;Danescu-Niculescu-Mizil et al., 2013;Tran and Ostendorf, 2016).
In this paper, we demonstrate experimentally that the re-identification proxy task is useful with simple models that are suited to the retrieval scenario, and present analyses showing that the approach learns to emphasize words associated with individual interests and polarizing issues.

Model
The model for community detection includes: i) a mapping from a user's text (a collection of tweets) to a k-dimensional embedding, and ii) a binary classifier for detecting whether a candidate user belongs to the target community.The novel contribution of the work is the proxy re-identification task for learning the user embedding.
User Embedding Model.The mapping from text to an embedding could leverage any document-level representation.We focus on a simple weighted bag-of-words neural model for direct comparison to other popular methods, motivated by the fact that many virtual communities form around shared interests in particular topics.Specifically, let c p,i denote the number of times person p uses word v i ∈ V , where V is the vocabulary, and w p,i = log(c p,i + 1) be the log-scaled word count.Then the user embedding is where ] and E ∈ ℜ |V |×k is the matrix of word embeddings.
Person Re-identification Learning.The embedding matrix E is learned using a person reidentification objective that encourages embeddings from the same person to be closer than embeddings from different people.We build on the triplet loss function taken from Schroff et al. (2015) used to train a face recognition system. Specifically: where d(x, y) is the cosine distance between x and y. u p 1 1 and u p 2 1 are embeddings made from distinct subsets of a single person's Tweets, and u p 1 2 is an embedding made from a subset of another person's Tweets.In practice, we estimate the loss function randomly sampling triplets (p1 1 , p2 1 , p 1 2 ) from a large training set.

Classifier. A logistic regression model with L2
regularization is used for the classifier, because it is simple but powerful and our scenario has little training data.Simplicity is important because the classifier should be trainable in real-time after receiving the query.The classifier objective is to discriminate the embeddings from the users in the query from a set of user embeddings from the general collection.For the i-th user, let y i ∈ {0, 1} be the binary label indicating whether the user belongs to a particular community and u i be the user embedding.The logistic regression model computes the probability that the user belongs to the community according to: where σ(x) = 1/(1 − e −x ).During evaluation, the users in the index are ranked according to the maximum log probability ratio Because the classifier is linear, we can quickly retrieve the top matching users from the index using approximate nearest-neighbor search (Kushilevitz et al., 2000).The technique is scalable up to hundreds of millions of users and beyond.

Data
All data was collected using the Twitter API. 1 We used 1,035 randomly selected items from the list of trending topics in the USA during the period April-June 2017 to query for users and collected their most recent 2,000 tweets.Example trending topics are #Quantico, RonaldoCristiano, and #MayDay2017.(The full list is available with the data.)Each user had at least one Tweet that mentioned a trending topic but their other Tweets could be on any topic.
We refer to this collection as the "general population," because it was not targeted towards any particular community.In total, we collected around 80,000 such users and used roughly 36,000 for learning user embeddings, 1,000 for learning the community classifiers, and 43,000 for evaluation.The text is mostly in English, but some of it is in Spanish, French, or other languages.A list of the tweet IDs is available. 2 To support evaluation with the community detection task, we conducted a second collection (contemporaneous with the first) targeting members that we had identified as belonging to one of 16 communities (Table 2).To define a "community," volunteers manually selected a set of users that fit with a theme that they had familiarity with.Thus, the specific 16 communities were determined based on themes of interest to the authors and their friends and colleagues, where we could be reasonably confident about membership decisions.In addition, we tried to avoid themes that might be biased towards well-known celebrities, and we made an effort to have diversity in the characteristics of the communities.The communities were selected to span a range of topics, sizes (6-130 accounts), individuals vs. organizations, and other characteristics.A few of the communities are comprised of organizations rather than individuals such as the high school drama departments and the Pittsburgh food truck communities.(The community names are invented by the authors for purposes of describing the data in this paper; they are not part of the retrieval task.) The text is lower-cased and some punctuation is removed using regular expressions.Words are formed by splitting on white space.While this strategy will not work for languages that do not delimit words by spaces, these make up a negligible portion of the data.A 174k vocabulary was created by extracting the unique types that were seen in the tweets from the general population, as well as selected bigrams extracted using the open source Gensim library using a point-wise mutual information criteria ( Řehůřek and Sojka, 2010).The vocabulary included roughly 49k bigrams, 36k usernames and 17k hashtags.Usernames, hashtags, and URLs are not treated specially and can be part of the vocabulary just like any other word if they occur frequently enough.

Experiment Configuration
The experiments involved comparing different methods of learning user embeddings, all with a weighted bag-of-words modeling assumption: • Weighted word2vec (W2V) using default3 skip-gram training (Mikolov et al., 2013); • Latent Dirichlet allocation (LDA) (Blei et al., 2003), using default settings from the Scikit Learn library (Pedregosa et al., 2011); • Person re-identification with random initialization (RE-ID); and • Person re-identification with W2V initialization (RE-ID, W2V init).Both count-weighted W2V and LDA have been used as unsupervised representations in Twitter classification tasks, as noted in Section 5. Default configurations are used because there is insufficient data to have a separate validation set.
For all methods, the same vocabulary, final dimension (128), unit vector normalization strategy, and logistic regression model training were used.The embeddings are trained on the 36k user general data, randomly sampling pairs of users p 1 and p 2 and then sampling 50 tweets at a time without replacement to create u p 1 1 , u p 2 1 , and u p 1 2 .The logistic regression models are trained on the 1K user general training pool, using the 50 most recent tweets for each user.Because there are so few labeled examples for most communities, training and evaluation is done using a leave-one-out strategy with the positive samples but including all of the 1K negative samples.For each of the N classifiers (corresponding to N labeled samples), the test set is the left-out positive example and the 43K general user test pool.Also because of training limitations, there is no tuning of the regularization weight; the default weight of 1.0 is used.Tuning may be useful given a collection of training and testing communities.Performance is averaged over the N classifiers (corresponding to the N labeled samples).Two evaluation criteria are used: a retrieval metric (inverse mean reciprocal rank or 1/MRR) (Voorhees et al., 1999) and a detection metric (area under the curve or AUC).

Results
Table 1 shows retrieval results averaged across all communities.The RE-ID model outperforms the W2V and LDA baselines for both criteria, with substantial gains in 1/MRR (lower is better).Further, the version of RE-ID initialized with word2vec did better than the one that was initialized randomly even though the randomly initialized version was trained for twice as long.A breakdown of the best model performance by community is given in Table 2. Sample size does not seem to be a good indicator of performance: the two smallest communities (Cartoonists, Fresno City Council) had the worst and one of the best results, respectively.Anecdotally, we observed that the sample of cartoonists were more likely to Tweet about topics outside their main interest (e.g., politics or sports).We hypothesize that the diversity of interests of the members of a community affects the difficulty of the retrieval task, but our test set is too small to confirm this hypothesis.

Community
Size These results may underestimate performance, because there is a chance that some users in the general population test data may actually belong to one or more of our test communities, i.e. there could be mislabeled data.To assess the potential impact, we manually checked the top ten false positives for each community for mislabeled users.We did discover some mislabeled examples for the economist, hedge fund manager, and ultramarathon runner communities.For the most part, the top ranked users from the general population tended to be people from related communities.For example, the top false ultimate frisbee users contained people who wrote about their participation in tournaments for other sports such as soccer.

Analysis
The finding that the W2V-initialized RE-ID model is significantly better than W2V raises the question: how do the embeddings learned by the reidentification task differ from the ones learned by the word2vec objective?To investigate this, we looked at the 1,000 words in the RE-ID model with embeddings that were farthest (in Euclidean distance) from its word2vec initialization.These top words disproportionately contain Twitter user handles, so some social network structure is captured.Using agglomerative clustering, we found groups of words that centered around frequent words used in particular regions (foreign words, dialects) or cultures (sociolects), associated with hobbies or interests (specific sports, music genres, gaming), or polarizing topics (political parties, controversial issues).At least one of the top tokens was the username of an account later identified as being sponsored by the Russian government to spread propaganda during the United States presidential election, e.g., "ten gop" in Table 4 of the Appendix.
We also looked at which communities are closest in the embedding space.We represent a community with the average of the member embeddings and use a normalized cosine distance for similarity.The two nearest neighbors are Mathematicians and NLP researchers, which are also close to the next two nearest neighbors, Hedge Fund Managers and Professional Economists.
To interpret what the model as a whole captured, we found the top scoring tweets for each held-out user (creating an embedding for a single tweet) according to the logistic regression model.Representative examples include "recurrent neural network grammars simplified and analyzed" for NLP Researchers, and "we're looking forward to seeing you opening night may 24th love the cast of high school musical" for High School Drama clubs.Examples for additional communities are included in the appendix.The results provide insight into the community member identification decision.

Related Work
One notion of community detection involves discovering different communities within a collection of users (Chen et al., 2009;Di, 2011;Fani et al., 2017).A related task is making recommendations of friends or people to follow (Gupta et al., 2013;Yu et al., 2016).In contrast, our task involves identifying other members of a community, which is specified in terms of a set of example users.These tasks use different learning frameworks (our work uses supervised learning), but the features (social network and/or text cues) are relevant across tasks.Our task is perhaps more similar to using social media text to predict author characteristics such as personality (Golbeck et al., 2011), gang membership (Wijeratne et al., 2016), geolocation (Han et al., 2014), political affiliation (Makazhanov et al., 2014), occupational class (Preot ¸iuc-Pietro et al., 2015), and more.Again, a commonality across tasks is the frequent use of unsupervised representations of textual features.
In representing text, a common assumption is that community language reflects topical interests, so representations aimed at topic modeling have been used, including LDA (Pennacchiotti and Popescu, 2011) and tf-idf weighted word2vec embeddings (Boom et al., 2016;Wijeratne et al., 2016).Yu et al. (2016) compute a user embedding by averaging tweet embeddings.Other work investigates methods for learning embeddings that integrate text and social network (graph or text-based) features (Benton et al., 2016).
The work closest to ours is by Fani et al. (2017), which learns embeddings that are close for likeminded users, where like-minded pairs are identified by a deterministic algorithm that leverages timing of related posts.Our approach requires no additional heuristics for defining user similarity, but instead relies on an objective that maximizes self-similarity and minimizes similarity to other users randomly sampled from a large general pool.
Our person re-identification proxy task makes use of the triplet loss used to learn person embeddings for face recognition (Schroff et al., 2015).In image processing, person re-identification refers to the task of tracking people who have left the field of view of one camera and are later seen by another camera (Bedagkar-Gala and Shah, 2014).It is different from our proxy task and the methods are not the same.

Conclusion
In summary, this paper defines a task of community member retrieval based on their tweets, introduces a person re-identification task to allow community definition with a small number of examples, and shows that that the method gives very good results compared to word2vec and LDA baselines.Analyses show that the user embeddings learned efficiently represent user interests.The text embeddings are largely complementary to the social network features used in other stud-ies, so performance gains can be expected from feature combination.
While our experiments use a bag-of-words representation, as in most related work, the reidentification training objective proposed here can easily be used with other methods for deriving document embeddings, e.g.(Le and Mikolov, 2014;Kim, 2014).

Table 1 :
Performance of different model variants.

Table 2 :
W2V+RE-ID results by community