Learning Invariant Representations of Social Media Users

The evolution of social media users’ behavior over time complicates user-level comparison tasks such as verification, classification, clustering, and ranking. As a result, naive approaches may fail to generalize to new users or even to future observations of previously known users. In this paper, we propose a novel procedure to learn a mapping from short episodes of user activity on social media to a vector space in which the distance between points captures the similarity of the corresponding users’ invariant features. We fit the model by optimizing a surrogate metric learning objective over a large corpus of unlabeled social media content. Once learned, the mapping may be applied to users not seen at training time and enables efficient comparisons of users in the resulting vector space. We present a comprehensive evaluation to validate the benefits of the proposed approach using data from Reddit, Twitter, and Wikipedia.


Introduction
Social media presents a number of challenges for characterizing user behavior, chief among them that the topics of discussion and their participants evolve over time. This makes it difficult to understand and combat harmful behavior, such as election interference or radicalization (Thompson, 2011;Mihaylov and Nakov, 2016;Ferrara et al., 2016;Keller et al., 2017).
This work focuses on the fundamental problem of learning to compare social media users. We propose a procedure to learn embeddings of small samples of users' online activity, which we call episodes. This procedure involves learning the embedding using a metric learning objective that causes episodes by the same author to map to nearby points. Through this embedding users may be efficiently compared using cosine similarity. This representation immediately enables several tasks: Verification. Determining if two episodes have the same author. Classification. Labeling authors via their knearest neighbors. Clustering. Grouping users via off-the-shelf methods like k-means or agglomerative clustering. Ranking and retrieval. Sorting episodes according to their distances to a given episode.
The problem considered in this paper is most closely related to author attribution on social media. However, prior work in this area has primarily focused on classifying an author as a member of a closed and typically small set of authors (Stamatatos, 2009;Schwartz et al., 2013;Shrestha et al., 2017). In this paper, we are concerned with an open-world setting where we wish to characterize an unbounded number of users, some observed at training time, some appearing only at test time. A further challenge is that the episodes being compared may be drawn from different time periods. With these challenges in mind, the primary contributions described in this paper are as follows: §3 A training strategy in which a user's history is dynamically sampled at training time to yield multiple short episodes drawn from different time periods as a means of learning invariant features of the user's identity; §4 A user embedding that can be trained end-toend and which incorporates text, timing, and context features from a sequence of posts; §5 Reddit and Twitter benchmark corpora for open-world author comparison tasks, which are substantially larger than previously considered; §6 Large-scale author ranking and clustering experiments, as well as an application to Wikipedia sockpuppet verification.
(t 1 , r 1 ) E t 1 r 1 x 1 (t 2 , r 2 ) t 2 r 2 x 2 . . . Figure 1: The map f θ takes an episode as input and outputs a vector. Here A denotes a multi-head self-attention layer, C a stack of 1D convolutions, E an embedding lookup, M an MLP, and P a pooling layer.

Preliminaries
Broadly speaking, a corpus of social media data consists of the actions of a number of users. Each action consists of all available information from a given platform detailing what exactly the user did, which for purposes of this work we take to include: (1) a timestamp recording when the action occurred, from which we extract a tuple t of temporal features, (2) unstructured text content x of the action, and (3) a categorical feature r specifying the context of the action. Thus an action is a tuple of the form (t, x, r). This formulation admits all three platforms considered in this work and therefore serves as a good starting point. However, incorporating features specific to particular platforms, such as image, network, and moderation features, might also provide useful signal. In our experiments we use a data-driven subword representation (Kudo, 2018) of x, which admits multilingual and non-linguistic content, as well as misspellings and abbreviations, all of which useful in characterizing authors. We use a simple discrete time feature for t, namely the hour of the day, although others might be helpful, such as durations between successive actions. In our Reddit experiments we take r to be the subreddit to which a comment was posted. On Twitter we take r to be a flag indicating whether the post was a tweet or a retweet.

Learning Invariant Representations
We organize the actions of each user into short sequences of chronologically ordered and ideally contiguous actions, which we call episodes. This paper is concerned with devising a notion of dis-tance between episodes for which episodes by the same author are closer to one another than episodes by different authors. Such a distance function must necessarily be constructed on the basis of past social media data. But in the future, authors' behavior will evolve and new authors will emerge.
We would like episodes by the same author to be nearby, irrespective of when those episodes took place, possibly future to the creation of the distance function. A given user will discuss different topics, cycle through various moods, develop new interests, and so on, but distinctive features like uncommon word usage, misspellings, or patterns of activity will persist for longer and therefore provide useful signal for the distance function.
We would also like the distance to be meaningful when applied to episodes by users who didn't exist when the distance function was created. To this end, the features it considers must necessarily generalize to new users. For example, common stylometric features will be shared by many users, including new users, but their particular combination is distinctive of particular users (Orebaugh and Allnutt, 2009;Layton et al., 2010).
Rather than heuristically defining such a distance function, for example, based on word overlap between the textual content of the episodes, we instead introduce a parameterized embedding f θ shown in Figure 1 that provides a vector representation of an episode. Then the desired distance between episodes can be taken to be the distance between the corresponding vectors. We fit the embedding f θ using metric learning to simultaneously decrease the distance between episodes by the same user and increase the distance between episodes by different users (Bromley et al., 1994;Wang et al., 2014).
But doing so requires knowledge of the true author of an episode, something which is not generally available. Therefore we take account names to be an approximation of latent authorship. Of course, account names are not always a reliable indicator of authorship on social media, as the same individual may use multiple accounts, and multiple individuals may use the same account. As such, we expect a small amount of label noise in our data, to which neural networks have proven robust in several domains (Krause et al., 2016;Rolnick et al., 2017).
We fit f θ to a corpus of social media data using stochastic gradient descent on batches of examples, where each example consists of an episode of a given length drawn uniformly at random from the full history of each user's actions. 1 By construction, a metric learning objective with this batching scheme will encourage the embedding of episodes drawn from the same user's history to be close. In order to accomplish this, the model will need to distinguish between ephemeral and invariant features of a user. The invariant features are those that enable the model to consistently distinguish a given users' episodes from those of all other users.

The Model
We now describe a mapping f θ parameterized by a vector θ from the space of user episodes to R D . The model is illustrated in Figure 1. This embedding induces a notion of distance between episodes that depends on which of the two proposed loss functions from §4.2 is used to train f θ . We illustrate the embeddings resulting from both losses in Figure 2.

The encoder
One approach to define f θ might be to manually define features of interest, such as stylometric or surface features (Solorio et al., 2014;Sari et al., 2018). However, when large amounts of data are available, it is preferable to use a data-driven approach to representation learning. Therefore we define f θ using a neural network as follows. The network is illustrated in Figure 1.
Encoding actions. First, we embed each action (t, x, r) of an episode. We encode the time features t and the context r, both assumed to be discrete, using a learned embedding lookup. We next embed every symbol of x, again using a learned embedding lookup, and apply one-dimensional convolutions of increasing widths over this list of vectors, similar to Kim (2014); Shrestha et al. (2017). We then apply the relu activation and take the componentwise maximum of the list of vectors to reduce the text content to a single, fixeddimensional vector. We optionally apply dropout at this stage if training. Finally, we concatenate the time, text, and context vectors to yield a single vector representing the action.
Embedding episodes. Next we combine the vector representations of the actions of an episode. For this purpose, one option is a recurrent neural network (RNN). However, recurrent models are biased due to processing inputs sequentially, and suffer from vanishing and exploding gradients. Therefore we propose the use of self-attention layers, which avoid the sequential biases of RNNs and admit efficient implementations.
In our particular formulation, we use several layers of multi-head self-attention, each taking the output of the previous layer as input; architectural details of the encoder layers follow those of the Transformer architecture proposed by Vaswani et al. (2017). We apply mean pooling after every layer to yield layer-specific embeddings, which we concatenate. We project to the result to the desired embedding dimension D using an MLP, both its input and output batch normalized (Ioffe and Szegedy, 2015).

The loss function
For the purpose of training the embedding f θ we compose it with a discriminative classifier g φ : R D → R Y with parameters φ predicting the author of an episode, where Y is the number of authors in the training set. We estimate θ and φ jointly using a standard cross-entropy loss on a corpus of examples with their known authors. Once the model is trained, the auxiliary projection g φ is discarded. Two possibilities for g φ are proposed below.
Softmax (SM). We introduce a weight matrix W ∈ R Y×D and define the map g φ (z) = softmax (Wz) with parameters φ = W. When using this loss function, one compares embed-(a) Embeddings obtained using SM loss.
(b) Embeddings obtained using AM loss. Figure 2: Projections of embeddings of user episodes. Each point is the result of mapping an episode to a single point in R 512 and projected to R 2 using t-SNE. The colors of the points correspond with the 50 different authors of the underlying episodes. We emphasize that the episodes shown here were not seen by the model at training time.
dings using Euclidean distance.
Angular margin (AM). Following Deng et al.
(2019) we again introduce a weight matrix W ∈ R Y×D whose rows now serve as class centers for the training authors. Given the embedding z ∈ R D of an episode, let z = z z be the normalization of z and let W be obtained from W by normalizing its rows. Then the entries of w = W z give the cosines of the angles between z and the class centers. Let w be obtained from w by modifying the entry corresponding with the correct author by adding a fixed margin m > 0 to the corresponding angle. 2 Finally, define g φ (z) = softmax (sw ) where s > 0 is a fixed scale constant. When using this loss function, one compares embeddings using cosine similarity.
5 Corpora for Large-Scale Author Identification

Reddit benchmark
Reddit is a large, anonymous social media platform with a permissive public API. Using Reddit consists of reading and posting comments, which consist of informal text, primarily in English, each appearing within a particular subreddit, which we treat as a categorical feature providing useful contextual signal in characterizing users. We introduce a new benchmark author identification corpus derived from the API (Gaffney and Matias, 2018) containing Reddit comments 2 One way to calculate cos (θ + m) from cos θ is cos θ cos m − sin θ sin m where sin θ is calculated as √ 1 − cos 2 θ. Note that this calculation discards the sign of θ.
by 120,601 active users for training and 111,396 held-out users for evaluation. The training split contains posts published in 2016-08 while the evaluation split contains posts published in 2016-09. In both cases, we restrict to users publishing at least 100 comments but not more than 500. The lower bound ensures that we have sufficient evidence for any given user for training, while the upper bound is intended to mitigate the impact of bots and atypical users. The evaluation split is disjoint from the training split and contains comments by 42,121 novel authors not contributing to the training split.
Validation. For model selection, we use the first 75% of each user's chronologically ordered posts from the training set, with the final 25% reserved for validation. For example, in our ranking experiments described in §6.3 we use these held-out comments as candidate targets, using ranking performance to inform hyper-parameter choice.

Twitter benchmark
The microblog domain is sufficiently distinct from Reddit that it is suitable as an additional case study. For this purpose, we sample 169,663 active Twitter users from three months of 2016 as separate training, development, and test sets (2016-08 through 2016-10). We use three months because we rely on a sub-sampled collection of Twitter, as little as 1% of all posts published, resulting in significantly fewer posts by each user than on Reddit. Another consequence of this sub-sampling is that the collection violates our assumptions regarding contiguous user actions.

Experiments
In the experiments described below, we refer to our method as IUR for Invariant User Representations.

Baseline methods
In order to validate the merit of each of our modeling contributions, we compare against three baseline models described below. To the best of our knowledge, we are the first to consider using metric learning to learn embeddings from episodes of user activity. We are also the first to consider doing so in open-world and large-scale settings. As such, the neural baseline described below uses the training scheme proposed in this paper, and was further improved to be more competitive with the proposed model.
Neural author identification. We use the architecture proposed by Shrestha et al. (2017) for closed-set author attribution in place of our f θ . At the level of individual posts this architecture is broadly similar to ours in that it applies 1D convolutions to the text content. To extend it to episodes of comments, we simply concatenate the text content into a single sequence with a distinguished end-of-sequence marker. Note that the timing and context features may also be viewed as sequences, and in experiments with these features we run a separate set of one-dimensional filters over them. All max-over-time pooled features are concatenated depthwise. By itself, this model failed to produce useful representations; we found it necessary to apply the batch-normalized MLP described in §4.1 to the output layer before the loss. To train the model, we follow the procedure described in §4.2 to compose the embedding with the SM loss function, optimize the composition using cross-entropy loss, and discard the SM factor after training.

Document vectors.
By concatenating all the textual content of an episode we can view the episode as a single document. This makes it straightforward to apply classical document indexing methods to the resulting pseudo-document. As a representative approach, we use TFIDF with cosine distance (Robertson, 2004). We note that TFIDF is also well-defined with respect to arbitrary bagsof-items, and we make use of this fact to represent a user according to the sequence of subreddits to which they post as a further baseline in §6.3.
Author verification models. We use the SCAP ngram profile method of Frantzeskou et al. (2007). Two episodes are compared by calculating the size of the intersection of their n-gram profiles. We use profiles of fixed length 64 in our experiments.

Model hyperparameters and training
Below we list our hyperparameter choices for the IUR model, which we define in §4.
For both Twitter and Reddit, we estimate the sub-word vocabulary on training data using an inventory of 65,536 word pieces, including a distinguished end-of-sequence symbol. We truncate comments to 32 word pieces, padding if necessary. 3 We restrict to the 2048 most popular subreddits, mapping all others to a distinguished unk symbol. We encode word pieces and subreddits as 256-long vectors. The architecture for the text content uses four convolutions of widths 2, 3, 4, 5 with 256 filters per convolution. We use two layers of self-attention with 4 attention heads per layer, and hidden layers of size 512. Other details such as use of layer normalization match the recommendations of Vaswani et al. (2017).
We train all variations of the IUR for a fixed budget of 200,000 iterations of stochastic gradient descent with momentum 0.9 and a piecewise linear learning rate schedule that starts at 0.1 and is decreased by a factor of 10 at 100,000 and 150,000 iterations. The final MLP has one hidden layer of dimension 512 with output also of dimension D = 512. For the angular margin loss we take m = 0.5 and s = 64 as suggested in Deng et al. (2019).

Reddit ranking experiment
Given a query episode by a known user, our author ranking experiment consists of returning a list of target episodes ranked according to their similarity to the query. The problem arises in the moderation of social media content, when say, a user attempts to circumvent an account ban by using another account.
Experimental setup. Recall that we train all Reddit models on the 2016-08 split. In this experiment we draw episodes from the first half of 2016-09 as queries and the second half of 2016-09 as targets. Specifically, for   We compare models using mean reciprocal rank (MRR), median rank (MR), and recall-at-k (R@k) for various k. The MRR is the mean over all 25,000 queries of the reciprocal of the position of the correct target in the ranked list. The MR is the median over the queries of the position of the correct target. The R@k is the proportion of the queries for which the correct target appears among the first k ranked targets.
Results. The results of this experiment are shown in Table 1. For each combination of features considered, the rankings based on the proposed IUR embeddings consistently outperform all methods considered, both neural and classical. We also report results on several variations of our model, noted in parenthesis. First, using the proposed architecture for f θ but the softmax loss results in ranking performance comparable to the baseline system. Second, using a recurrent architecture rather than self-attention to aggregate information across an episode results in significantly worse performance. 4 Finally, omitting time features results in worse performance.
Performance on novel users. As described above, the experiments presented in Table 1 involved ranking episodes by test authors, some of whom had been seen during training, and some new to the model. To better understand the ability of the proposed embedding to generalize to new users, we performed a further evaluation in which authors were restricted to those not seen at training time. For the IUR incorporating all features, this yielded a MRR of 0.50, while our extension of Shrestha et al. (2017) obtains 0.38 for the same queries. Both methods produce salient embeddings of novel users, but IUR retains an edge over the baseline.
Varying episode length. As described above, the experiments presented in Table 1 involved episodes of length exactly 16. In Figure 3, we report results of a further ranking experiment in which we vary the episode length, both at training time and at ranking time. For both the proposed IUR and our extension of Shrestha et al. (2017), performance increases as episode length increases. Furthermore, even for the shortest episodes considered, the proposed approach performs better. This illustrates that the choice of episode length should be decided on an application-specific basis. For example, for social media moderation, it may be desirable to quickly identify problematic users on the basis of as few posts as possible.

Twitter ranking experiment
We repeat the experiment described in §6.3 using data from Twitter in place of Reddit, and with the further difference that the queries were drawn from 2016-08 and the targets from 2016-10 as a mitigation of Twitter's 1% censorship. Unlike the Reddit dataset, all three data splits contain posts by exactly the same authors. The results are shown in Table 2.

Wikipedia sockpuppet verification
In this section we describe an experiment for the task of sockpuppet verification on Wikipedia using the dataset collected by Solorio et al. (2014). Wikipedia allows editors to open cases against other editors for using suspected sockpuppet accounts to promote their contributions. We have reorganized the dataset into pairs of episodes by different accounts. Half of our examples contain a pair deemed by the community to have the same author, while half have been deemed to have different authors. The task is to predict whether a pair of episodes was composed by the same author.
We are interested in whether the text-only version of our IUR model, trained on Reddit data, is able to transfer effectively to this task. This domain is challenging because in many cases sockpuppet accounts are trying to hide their identity, and furthermore, Wikipedia talk pages contain domain-specific markup which is difficult to reliably strip or normalize. Naturally we expect that the identities of Wikipedia editors do not overlap with Reddit authors seen at training time, since the data is drawn from different time periods and from different platforms.
As a baseline, we compare to BERT, a generic text representation model trained primarily on Wikipedia article text (Devlin et al., 2018). While BERT is not specifically trained for author recognition tasks, BERT has obtained state-of-the-art results in many pairwise text classification tasks including natural language inference, question pair equivalence, question answering, and paraphrase recognition. The BERT model used here has 110 million parameters compared to 20 million for our embedding.
Setup. Because many comments are short, we preprocess the data to ensure that each comment has at least 5 whitespace-separated tokens. We restrict to users contributing at least 8 such comments. This left us with 180 cases which we split into 72 for training, and 54 each for validation and testing. We fine-tune both the cased and uncased pretrained English BERT models for our sockpuppet detection task using public models and software. 5 In order to combine the comments comprising an episode for BERT, we explored different strategies, including encoding each comment separately. We found that simply combining comments together and using a long sequence length of 512 gave the best validation performance. For our model, we fine-tune by fitting an MLP on top of our embeddings using binary cross entropy and keeping other parameters fixed. Both methods are tuned on validation data, and the best hyperparameter configuration is then evaluated on heldout test data.
Results. Results are reported in Table 3. The best validation performance is obtained by the cased BERT model. However, both BERT models appear to overfit the training data as test performance is significantly lower. Regarding the proposed IUR model, we see that its performance on validation data is comparable to BERT while generalizing better to held-out test data. For reference, Solorio et al. (2013) report accuracy of 68.83 using the same data using a SVM with hand-crafted features; however, neither their experimental splits nor their model are available for purposes of a direct comparison.

Clustering users
For certain tasks it is useful to identify groups of accounts shared by the same author or to identify groups of accounts behaving in a similar fashion (Solorio et al., 2013;Tsikerdekis and Zeadally, 2014). To this end, we experiment with how well a clustering algorithm can partition authors on the basis of the cosine similarity of their IUR episode embeddings.
Procedure. Using the pre-trained Reddit IUR model, we embed five episodes of length 16 by 5000 users selected uniformly at random, all drawn from the held-out 2016-09 split. The embeddings are clustered using affinity propagation, hiding both the identities of the users as well as the true number of users from the algorithm (Frey and Dueck, 2007). Ideally the algorithm will arrive at 5000 clusters, each containing exactly five episodes by same author. Clustering performance is evaluated using mutual information (NMI), homogeneity (H), and completeness (C) (Rosenberg and Hirschberg, 2007). NMI involves a ratio of the mutual information of the clustering and ground truth. Homogeneity is a measure of cluster purity. Completeness measures the extent to which data points by the same author are elements of the same cluster. All three measures lie in interval [0, 1] where 1 is best. The results are shown in Table 4.

Related Work
This work considers the problem of learning to compare users on social media. A related task which has received considerably more attention is predicting user attributes (Han et al., 2014;Sap et al., 2014;Dredze et al., 2013;Culotta et al., 2015;Volkova et al., 2015;Goldin et al., 2018). The inferred user attributes have proven useful for social science and public health research (Mislove et al., 2011;Morgan-Lopez et al., 2017). While author attributes like gender or political leaning may be useful for population-level studies, they are inadequate for identifying particular users. 6 More generally, learning representations for downstream tasks using unsupervised training has recently emerged as an effective way to mitigate the lack of task-specific training data (Peters et al., 2018;Devlin et al., 2018). In the context of social media data, unsupervised methods have also been explored to obtain vector representations of individual posts on Twitter (Dhingra et al., 2016;Vosoughi et al., 2016). Our approach is distinguished from this prior work in several respects. First, we embed episodes consisting of multiple documents, which involves aggregating features. Second, for each document, we encode both textual features as well as associated meta-data. Finally, our training procedure is discriminative, embedding episodes into a vector space with an im-mediately meaningful distance.
When social network structure is available, for example on Twitter via followers, it may be used to learn user embeddings (Tang et al., 2015;Grover and Leskovec, 2016;Kipf and Welling, 2016). Graph representations have successfully been combined with content-based features; for example, Benton et al. (2016) propose matrix decomposition methods that exploit complementary features of Twitter authors. Graph-based embeddings have proven useful in downstream applications such as entity linking (Yang et al., 2016). However, such methods are not applicable when network structure is unavailable or unreliable, such as with new users or on social media platforms like Reddit. In this work, we are motivated in part by adversarial settings such as moderation, where it is desirable to quickly identify the authorship of novel users on the basis of sparse evidence. 7 The most closely related work is author identification on social media. However, previous work in this area has largely focused on distinguishing among small, closed sets of authors rather than the open-world setting of this paper (Mikros and Perifanos, 2013;Ge et al., 2016). For example, Schwartz et al. (2013) consider the problem of assigning single tweets to one of a closed set of 1000 authors. Overdorf and Greenstadt (2016) consider the problem of cross-domain authorship attribution and consider 100 users active on multiple platforms. In a different direction, Sari et al. (2018) seek to identify stylistic features contributing to successful author identification and consider a closed set of 62 authors. In contrast, the present work is concerned with problems involving several orders of magnitude more authors. This scale precludes methods where similarity between examples is expensive to compute, such as the method of Koppel and Winter (2014).
Prior work on detecting harmful behavior like hate speech has focused on individual documents such as blog posts or comments (Spertus, 1997;Magu et al., 2017;Pavlopoulos et al., 2017;Davidson et al., 2017;de la Vega and Ng, 2018;Basile et al., 2019;Zampieri et al., 2019). Recently, there have been some efforts to incorporate userlevel information. For example, for the supervised task of abuse detection, Mishra et al. (2018) find consistent improvements from incorporating userlevel features.

Conclusion
Learning meaningful embeddings of social media users on the basis of short episodes of activity poses a number of challenges. This paper describes a novel approach to learning such embeddings using metric learning coupled with a novel training regime designed to learn invariant user representations. Our experiments show that the proposed embeddings are robust with respect to both novel users and data drawn from future time periods. To our knowledge, we are the first to tackle open-world author ranking tasks by learning a vector space with a meaningful distance.
There are several natural extensions of this work. An immediate extension is to further scale up the experiments to Web-scale datasets consisting of millions of users, as has been successfully done for face recognition (Kemelmacher-Shlizerman et al., 2016). Sorting episodes according to their distances to a query can be made efficient using a number of approximate nearest neighbor techniques (Indyk and Motwani, 1998;Andoni and Indyk, 2006).
We are also considering further applications of the proposed approach beyond those in this paper. For example, by restricting the features considered in the encoder to text-alone or text and temporal features, it would be interesting to explore cross-domain author attribution (Stamatatos et al., 2018). It would also be interesting to explore community composition on the basis of the proposed embeddings (Newell et al., 2016;Waller and Anderson, 2019).
Finally, it bears mentioning that the proposed model presents a double-edged sword: methods designed to identify users engaging in harmful behavior could also be used to identify authors with legitimate reasons to remain anonymous, such as political dissidents, activists, or oppressed minorities. On the other hand, methods similar to the proposed model could be developed for such purposes and not shared with the broader community. Therefore, as part of our effort to encourage positive applications, we release source code to reproduce our key results. 8