A Deep Metric Learning Approach to Account Linking

We consider the task of linking social media accounts that belong to the same author in an automated fashion on the basis of the content and meta-data of the corresponding document streams. We focus on learning an embedding that maps variable-sized samples of user activity–ranging from single posts to entire months of activity–to a vector space, where samples by the same author map to nearby points. Our approach does not require human-annotated data for training purposes, which allows us to leverage large amounts of social media content. The proposed model outperforms several competitive baselines under a novel evaluation framework modeled after established recognition benchmarks in other domains. Our method achieves high linking accuracy, even with small samples from accounts not seen at training time, a prerequisite for practical applications of the proposed linking framework.


Introduction
The scale and anonymity of social media pose systematic challenges for manual moderation efforts (Pennycook et al., 2020;Broniatowski et al., 2018). These challenges have motivated the development of automated methods to identify abusive content, such as Davidson et al. (2017), which considers automatically classifying hate speech, or Alvari et al. (2019), which deals with detecting violent extremism, both in the Twitter domain.
However, automatic moderation remains a difficult problem. Indeed, existing methods based on hand-constructed resources such as keyword lists may fail to adapt to novel trends (Corbett-Davies and Goel, 2018) whereas automatic methods based on statistics of large corpora may exhibit harmful biases (Caliskan et al., 2017). Additionally, individual posts may fail to contain sufficient information to reliably identify them as harmful. This work considers account-level moderation. Specifically, we consider the problem of determining whether two document streams share the same author, based on samples from those streams rather than individual documents. This capability has numerous applications, such as detecting users attempting to circumvent account bans, identifying sockpuppet accounts, and detecting coordinated disinformation campaigns involving multiple authors controlling multiple accounts.
As a motivating application, we consider the enforcement of account bans on anonymous platforms, such as Reddit. Given a new account, the problem is to automatically identify whether it matches any previously banned account, which amounts to making binary decisions about whether pairs of accounts share the same author. Variations of this problem have been studied before. For example, Schwartz et al. (2013) learn a classifier to determine whether the author of a Twitter comment belongs to a small, closed set of authors. In contrast, we are interested in an open-world setting, requiring binary decisions about arbitrary pairs of accounts. This introduces a number of challenges.
First, any individual comment may be too short to serve as the basis for linking accounts. Figure 1a illustrates this empirically using a variation of our model, where embeddings of individual comments from the same account fail to coalesce, making it difficult to assert that an account has the same author as another account. See §4 for further experimental details. Therefore, we focus on aggregating information across contiguous sequences of documents. Figure 1b illustrates the impact of aggregation using our full model, where aggregations of contiguous sequences of documents from the same account exhibit an approximate convergence behavior as the number of documents aggregated increases. In fact, the motivating application above requires linking accounts on the basis of samples of widely varying sizes. Indeed, banned accounts  typically have many documents, all of which we would like to consider, while new accounts generally have few documents, which nevertheless must be linked to banned accounts as quickly as possible to mitigate abusive behavior.
The second challenge is that of spurious associations. For example, over a short period of time, an author may discuss only a single, narrow topic. While a naïve model based on word statistics alone might be sufficient to link such an account to another, this approach would fail to generalize over longer periods of time due to topic drift. Both our training procedure and evaluation framework have been designed to ensure that our model learns the appropriate invariances to identify same-authorship, rather than being correct for the wrong reasons (McCoy et al., 2019). Namely, samples from an account are drawn from different time periods in each training iteration (see §2.3), while the evaluation data consists of posts by accounts not seen at training time and is future to all the training data (see §4.2).
Finally, the numbers of banned and new accounts may be quite large, requiring a still larger number of pairwise comparisons. For this reason, our proposed approach to account linking consists of embedding variable-sized samples from document streams into a metric space whereby samples likely to have been composed by the same author map to nearby points. Under this embedding, comparisons between document streams amount to pairwise distance calculations, so our approach is highly scalable and amenable to various optimizations, such as approximate nearest neighbor methods.
Our primary contributions are the following: • We provide a simple but effective data augmentation strategy which enables embedding variable-sized samples. In addition, we successfully train such an embedding on a largescale dataset consisting of more than 300 million comments from 1 million distinct accounts using scalable losses.
• We propose a novel framework to assess account linking performance focused on challenging conditions and minimizing the impact of incidental authorship features, such as topic.
In particular, we propose benchmark datasets as well as verification metrics tailored to our application.
Our code, data splits, and scripts to reproduce our experiments are available at http:// github.com/noa/naacl2021.

Learning embeddings of document streams
We treat a document stream as a sequence of timestamped actions a 1 , a 2 , . . . , a L where each a i is a structure containing the data comprising an action. The possible contents of a i are specific to the document stream, but include at least a timestamp t i such that t 1 < t 2 < · · · < t L .
In this work we focus on textual content published on social media platforms, although the approach would easily extend to allow a 1 , a 2 , . . . , a L to contain other modalities such as images or video, which would be handled similarly. In addition, we . .
x M E C P Figure 2: Illustration of the proposed model architecture. Each action a i consists of text content x i ∈ Z , a subreddit feature r i ∈ Z, and is published at time t i ∈ {0, 1, . . . , 23}. These elements are combined as shown using various embedding lookups E, one-dimensional convolutions C, and linear projections P , along with an attention mechanism A and a max pooling layer M .
also avail of certain categorical features contained in a 1 , a 2 , . . . , a L such as hashtags or the subreddit to which a comment was posted.
A sample from a document stream is a contiguous subsequence of its actions. We introduce an embedding f θ in §2.1 mapping a variable-sized sample to a point in a vector space, such that the Euclidean distance between the embeddings of two samples quantifies the likelihood that they belong to the same author.

Architecture
We define an embedding f θ as follows. This embedding is illustrated in Figure 2. Consider a sample a = (a 1 , a 2 , . . . , a M ) where each action a i consists of a subreddit feature r i , a timestamp t i , and text content x i . 1 We encode t i as the corresponding hour of the day and r i by lookup in the list of 2048 most common subreddits, resulting in t i ∈ {0, 1, . . . , 23} and r i ∈ {0, 1, . . . , 2048}, where r i = 2048 when the subreddit is not among the top 2048. We encode the text feature using the SentencePiece unigram subword model (Kudo, 2018), resulting in x i ∈ Z , where the parameter is defined in §2.2. Note that the chosen vocabulary size impacts the amount of content that can be encoded with integers. Further details on the choice of text encoding are provided in Appendix C.
We replace each token of x i with a correspond-ing learned embedding in R N for all 1 ≤ i ≤ M , resulting in M matrices in R ×N . We apply onedimensional convolutions of widths 2, 3, and 4 along the first axis of each, concatenate the convolved matrices along their second axes, max-pool along the first axis, and concatenate the results with learned embeddings of the corresponding subreddit features and one-hot encodings of the corresponding time features, resulting in M vectors, which we aggregate using dot-product attention. The resulting sequence of vectors is projected to a single vector through max-pooling followed by two fully-connected layers with bias, resulting in the encoding f θ (a) ∈ R D of the sample a.

Text Sampling
The variance in the lengths of documents poses computational challenges when aggregating large samples. Therefore we resort to truncating each document to a fixed number of tokens, padding any documents containing fewer than tokens. We take = 32 after observing that Reddit posts have an average length of approximately 43 tokens (see Appendix D). We also experimented with a more complicated text sampling strategy, namely sampling contiguous segments of tokens from each post uniformly at random during training. While this approach leverages all available textual information by affording slightly different samples of each post in each iteration of training, we found it to yield similar results and also complicates the comparison to our primary baseline model, which uses the prefix method described above.

Sample selection
During training we randomly select document stream samples of sizes varying between R = 1 and S = 16, which we regard as hyperparameters of the model. To select a sample from the stream a 1 , a 2 , . . . , a L we first choose its length Selecting M according to Beta (3, 1) provides an expected sample size closer to S than to R, a tradeoff that allows the model to quickly learn features of a document stream by exposing it to larger samples most of the time, while still maintaining the flexibility to handle samples of varying sizes. Indeed, the latter is critical in the evaluation described in §4.3, which requires linking large samples to small samples. The density function of Beta (3, 1) is shown in Figure 3 together with that of the uniform distribution Unif (0, 1) for comparison. We explore the benefits of Beta (3, 1) and other related distributions in §4.5.

Scalable deep metric learning losses
Deep metric learning methods aim to embed observations into a low-dimensional space such that instances from the same class map to nearby points under a chosen metric, such as Euclidean distance. In our setting, we take the instances to be document stream samples and the classes to be the corresponding accounts, which serve as proxies for latent authorship. Therefore, training the mapping f θ defined in §2.1 using metric learning affords an embedding under which samples by the same author map to nearby points.
Recent work in deep metric learning has introduced a number of training objectives with state of the art performance on computer vision tasks (Kim et al., 2020;Wang et al., 2019). Unfortunately, many of these objectives scale linearly with the number K of classes considered due to a costly linear projection onto R K . Note that because account names in effect provide labels for the corresponding document streams, we may use raw social media content to fit our model directly, availing of a virtually unlimited source of data. We stipulate that the ability to exploit larger amounts of data may be more important than per-example efficiency, and therefore consider the classical triplet loss (Schroff et al., 2015) in our experiments, whose complexity does not depend on K. In particular, we use semihard negative mining with a fixed margin penalty. We also consider the top-k loss recently proposed by Lu et al. (2019), which optimizes precision-atk as follows. Given targets ranked by similarity to a query, top-k arranges for as many matches as possible to be among the top k ranked targets. It accomplishes this by penalizing only those targets that would need to move the smallest amount in order to maximize the number of matching targets among the top k. Like triplet loss, top-k also uses an additive margin penalty to separate classes. See Appendix A for further experimental details on both loss functions.

Related work
The separate but related problem of closed-world author attribution has received considerable attention. For example, the PAN 2019 challenge (Daelemans et al., 2019) employed a closed-world setting with a small number of authors that are the same at training and test time. That task also considered longer documents, obviating the need for aggregating evidence of authorship across multiple documents.
Generic text embedding methods such as the universal sentence encoder (Cer et al., 2018) and BERT (Devlin et al., 2019) are fit using auxiliary tasks, such as conditional language modeling. In the case of BERT, this is usually followed by supervised fine-tuning for a downstream task of interest. In this work, we are interested in learning representations that are immediately useful for our account linking task. However, because a large corpus of task-specific training data may be collected without human supervision, the benefits of generative pre-training are diminished in our setting. Indeed, the parameters of the text encoding are learned from a random initialization in all our experiments. Our approach is further distinguished from generic embedding methods by featuring a multi-document embedding, mapping a sequence of documents to a single vector, where each document may consist of both text and metadata.
The most closely related prior work is the Invariant User Representation (IUR) proposed by Andrews and Bishop (2019), whose approach is broadly similar to ours, but only considers samples of a fixed size. Our approach may be viewed as a generalization of that work in support of the account linking task. In addition we use a simpler dot product attention mechanism and introduce the use of scalable metric learning losses in §2.4, which enable us to train our model on an order of magnitude more data than previously considered. We validate these improvements in §4.2 using the ranking task proposed by Andrews and Bishop (2019). We also adapt IUR to serve as a baseline in our primary linking task in §4.3.
We believe that our treatment of account linking as a pairwise recognition task between document stream samples and our proposed general-purpose evaluation protocols are both novel. However, in prior work, there have been several platformspecific approaches to account linking. For example Silvestri et al. (2015) explore a heuristic approach to linking accounts across social media platforms. Separately, on platforms with rich social network information, graph-matching methods have been explored (Fan, 2012). Our focus is on content-based account linking, which is more general than prior methods we are aware of. Some other related but distinct problems include detecting deceptive accounts (Van Der Walt and Eloff, 2018) and authorship classification of short messages (Ishihara, 2011).

Experiments
We conduct evaluations on the two primary tasks illustrated in Figure 4. First, our ranking evaluation described in §4.2 is motivated by information retrieval needs. Although ranking is not the focus of this paper, it provides an assessment of the quality of the learned embedding in terms of similarity judgements and facilitates comparison with the baseline model IUR. In addition, we use the ranking evaluation to monitor training using de-  Van Leeuwen and Brümmer, 2007). Both evaluations involve setting up two sets of samples as described below, the queries and the targets. For each query, there is exactly one target drawn from the same document stream. Roughly speaking, both evaluations involve matching targets with their corresponding queries.

MUD: a Web-scale training dataset
Reddit is currently one of the most popular social media platforms, where anonymous users interact primarily by posting comments to discussion threads. Together with its text content, each comment is labeled by its publication time and the subreddit to which it was posted, a categorical feature roughly indicating its topic.
We construct a dataset consisting of 300 million Reddit posts from 1 million users published over an entire year to be used to train our proposed model. This Million User Dataset (MUD) consists of all posts by authors who published at least 100 and at most 1000 posts between July 2015 and June 2016, where the lower bound ensures a sufficiently long history from which to sample, and the upper bound is intended to reduce the impact of bot and spam accounts. We obtained the data by drawing from the existing Pushshift Reddit corpus (Baumgartner et al., 2020). Some further statistics of MUD are shown in Table 8

Ranking evaluation
As shown in Figure 4a, the ranking experiment consists of ranking the targets by similarity to each query. For compatibility, we mimic the experimental setup from Andrews and Bishop (2019), which proposes separate sets of queries and targets to be used for training and testing. We adopt the training split for validation and the testing split for evaluation, although we train our model on MUD (see §4.1). We select hyperparameters based on dev split performance (see Appendix A). The test split consists of samples, each of size exactly 16, although we train the proposed model using samples from MUD of varying sizes as described in §2.3. Note that the posts comprising MUD precede those of both IUR splits in publication time, ensuring that our training data is disjoint from IUR's test data.
Of the 111,396 authors contributing to the test split, 69,275 or 62% contribute to the IUR training split. In contrast, MUD has only 39,529 users in common with the test split, a significantly smaller overlap than IUR. In principle, the increase in novel users at test time puts the proposed model at a disadvantage because it places more importance on generalization to novel users.
We report recall-at-k (R@k) and mean reciprocal rank (MRR), calulated exactly as in Andrews and Bishop (2019). MRR is the expected value of the reciprocal of the position of the correct target in the ranked list. R@k is the probability that the unique target composed by the same author as a given query appears in the top k ranked results. We limit ourselves to R@4 and R@8 as proxies for the "first page" of search results returned to a user issuing a query.
The results of this evaluation calculated with the test split are shown in Table 1. Note that the full version of the proposed model significantly outperforms the previously published state-of-the-art. 2 We conclude that although the angular margin loss used by Andrews and Bishop (2019) is considered state-of-the-art, the simpler triplet loss outperforms it, most likely because it admits the use of a considerably larger dataset. We remark that the models trained with top-k performed only slightly better than those trained with triplet loss, an observation consistent with recent findings that when matching experimental conditions, the choice of ranking loss is less important than previously believed (Musgrave et al., 2020).
In addition, Figure 5 shows the results of the evaluation performed after every hour of training. 3 Note that after only six hours of training the full model outperforms the baseline. Figure 5 also shows the learning curve for an ablation of our model that eliminates the subreddit feature. We observe that this ablated model performs almost as well as the full-featured baseline, which suggests that the proposed approach may be effective in domains where only text and timestamps are 2 A paired sign test of the differences in ranking between IUR and the proposed model is significant at the p < 10 −15 level.
3 Although we generated Figure 5 post-hoc using test data, we did not use test data for model or hyperparameter selection.

A new framework for account linking
While the ranking experiments in §4.2 were designed to measure the quality of the learned embedding, they do not directly measure task performance: moderation applications require decisions rather than rankings. To this end, we propose an account linking benchmark modeled after the problem of enforcing account bans, in which a fixed number of accounts are linked against novel accounts at test-time. Compared to the ranking experiments, the key difference is that we introduce a distinguished subset of authors from which we have accumulated a significant number of previously published documents to serve as queries. The procedure is illustrated in Figure 4b.
Because the subreddit feature serves as a proxy for topic, restricting to a single subreddit results in a more challenging problem by increasing the likelihood that the comments considered deal with similar topics. To this end, we repeat the following procedure for each of the five most popular subreddits. Each result of the experiment reported in Tables 2 and 9 is the average over the five subreddits of the corresponding results calculated using those subreddits individually.
Given a specified subreddit, we first randomly select 100 distinguished accounts, each publishing at least 100 posts to that subreddit in November 2016. The queries in the experiment consist of the 100 most recently published posts to the subreddit by each of the distinguished accounts in November 2016. In addition, the distinguished ac-counts must have published at least 16 posts to the subreddit between December 2016 and May 2017 to serve as the corresponding targets, as described below.
Next we randomly select 4900 accounts distinct from the distinguished accounts, each publishing at least 16 posts to the subreddit between December 2016 and May 2017. The targets in the experiment consist of the 4 most recently published posts to the subreddit by each of the 5000 accounts.
Performance metrics. For every query and target, each model considered returns a score, with smaller scores associated with a higher likelihood that the query and the target have the same author. For example, the proposed model returns the distance between their embeddings under the model. A decision rule to predict an author match is obtained by thresholding this score with respect to a chosen operating point. In production settings, one adjusts the operating point to obtain acceptable rates of false positives and false negatives. In our running application of ban enforcement, these types of errors correspond respectively with mistakenly banning an innocent user and failing to ban a new account of a banned user. Because the severity of these types errors are different, we consider the detection cost function C det = πC − P − + (1 − π) C + P + proposed by Van Leeuwen and Brümmer (2007), where P − and P + are empirical probabilities of false negatives and false positives, C − and C + are the costs of false negatives and false positives, and π is the a priori probability of a match. We take π = 0.05 and we set C − = 1 and C + = 2, reflecting our presumption that banning an innocent account is more severe than failing to recognize a banned user. Our choices of C − and C + are only meant to reflect the asymmetric nature of the problem, although in practice these costs would be highly platform-specific.
We report the minimum value of C det over all operating points (minDCF) and the value of P + at the operating point for which P − = P + , also known as the equal error rate (EER).
Baseline models. We compare the proposed method with three baselines. First we consider TF-IDF vector representations of the concatenated text content of a sample, which are compared using cosine similarity. Next, we consider universal sentence encodings (Cer et al., 2018), which

Model
Training EER minDCF Length  are compared using angular distance. We experimented with two versions of this baseline, namely embedding the concatenation of the text content of the documents in a sample, and averaging the embeddings of the individual documents. Since we found the concatenated version to perform better, we only report on this variation. Finally, we consider IUR (Andrews and Bishop, 2019). Because this model only embeds samples of size 16, we pad samples containing fewer than 16 posts. To handle samples containing more than 16 posts, we organize the sample into contiguous groups of at most 16 posts, apply the embedding to each group, and average the embeddings.
Results. Table 2 compares the linking performance of the three baseline models along with two of variations of the proposed model arising from varying the sizes of the training samples. Note that both variations of the proposed model outperform the baselines. A further variation on this experiment is reported in Table 9 in which the queries are drawn from the training dataset, better reflecting the context of the motivating example. Figure 6 shows the effect of the sizes of the targets on linking performance. Note that performance rapidly improves with larger samples, relative to the baselines. This trend is promising for our motivating application of ban enforcement, where it is desirable to recognize banned users as early as possible. Figure 7 shows receiver operator curves (ROC), which plot false positive rates against true positive rates as the operating points vary.

Embeddings of variable-sized samples
Our experiments in §4.3 show that linking samples from newly created accounts to those of distinguished authors is more successful when using as much historical data from the distinguished authors as possible. However, computational constraints typically inhibit embedding full account histories during training. Instead, in §4.3 we embed large samples of distinguished authors' histories using models trained on samples of sizes up to a maximum tractable length S. We take S = 16 in our experiments as described in §2.3. Here, we examine the ability of a model trained on samples of sizes at most S to generalize to samples of sizes greater than S.
We also compare to a further baseline that averages single-post embeddings produced by a variation of the proposed model trained on single posts, which we denote by Avg. This is in contrast with the proposed model, which aggregates embeddings of multiple posts using an attention mechanism. Table 3 shows the ranking performance of a num-ber of variations of the proposed model trained with triplet loss and all features (TPS) on samples of fixed or varying sizes as specified. These results demonstrate that a model trained on variable-sized samples appears to generalize well to much longer samples. We observe substantially better performance compared to the simple averaging baseline, and only a slight decrease in performance compared to the fixed length models as the evaluation sample size increases beyond the lengths seen at training time.

Selecting the distribution of sizes
As mentioned in §2.3, we use Beta (3, 1) to select sample sizes during training. We hypothesize that a negatively skewed distribution tends to improve training efficiency by supplying longer samples most of the time, while retaining the ability to handle shorter samples. To evaluate this claim, we investigate several distributions of varying degrees of negative skew. Table 4 shows the ranking performance of variations of the proposed model trained on samples of sizes varying between 1 and 16 posts and evaluated on samples of size 16. These models differ only in the distribution used to select sample sizes. Indeed, the negatively skewed distributions do improve ranking performance over the uniform distribution, although the choice of negatively skewed distribution appears to be mostly immaterial.

Future work
This work motivates a number of interesting research questions. First, the proposed model makes use of publication times, but only avails of the hour of the day. It would be interesting to examine continuous-time variants of our encoder that incorporate relative time differences between actions when aggregating their embeddings, in light of the fact that patterns of user activity might be highly discriminative. For example, bots and spammers typically post at certain times of day and with particular frequencies. Separately, the proposed data augmentation methods we use to handle variablesized samples may also be applicable in other settings, such as multi-document summarization (Liu and Lapata, 2019). Finally, the scores we use to determine author matches could be calibrated, providing confidence estimates associated with the account linking decisions.
To our knowledge, this work is the first to demonstrate the feasibility of a general-purpose account linking framework at web scale. Indeed, Figure 6 shows that performance improves as the size of the target increases, suggesting a speed-accuracy trade-off that can be tuned for different application settings. Expanding on an idea above, if confidence estimates were available, they could be used to inform the necessary sample sizes to achieve an acceptable level of risk.

A Hyperparameter selection
Top-k loss involves a number of hyperparameters. In addition to k and the margin penalty m, one must also select the number n + of targets from each class presented in every batch. We found the values of k, m, n + suggested by (Lu et al., 2019) to not perform well in our setting. We therefore conducted a small grid search for these values, selecting the optimal configuration based on validation scores. We considered k ∈ {4, 8}, n + ∈ {4, 16}, and m ∈ {0.05, 0.40}, resulting in k = 4, n + = 8, and m = 0.25.
In experiments with the triplet loss, we use a fixed margin m = 0.2. We arrived at this value through the grid search illustrated in Table 5. We do not use dropout regularization, as previous work has shown dropout can degrade performance when training with larger datasets (Lan et al., 2019

B Implementation details
Mixed precision. We use mixed precision training (Micikevicius et al., 2017). This reduces the GPU memory consumed at training time by about half through the use of half precision floats, enables faster forward and backward pass computations, and allows for a larger batch size.
Multi-GPU training. Using 2 V100 GPUs to train the model significantly speeds up the process by quadrupling the effective batch size. Our models have an average training time of around 72 hours.
Simplified text encoding. We limit our convolutional text encoders to windows of 2, 3, and 4 subwords, excluding the largest window of 5 used by other models. Surprisingly, this did not impact ranking performance, which suggests that a small receptive field is sufficient for purposes of comparing authorship. As further support for this claim, we also experimented with larger receptive fields than 5, which reduced ranking performance.
Model Hyper-parameters. Our reported models are trained with an embedding dimension of D = 1024, 512 convolutional filters, and an attention mechanism producing outputs of dimension 512. Additionally the subword and subreddit embedding dimensions are both N = 512. Table 6 shows the numbers of parameters of the proposed model when using various combinations of text content (T), publication time (P), and subreddit (S).

Model Parameters
Model Parameters T 44.0M TP 44.1M TPS 45.5M Table 6: Numbers of parameters in various trained models.

C Text encoding
We consider two methods to encode raw text content into integer arrays, namely taking the integer values of the corresponding UTF-8 encoded bytes directly, and using the SentencePiece unigram subword model (Kudo, 2018). SentencePiece tokenizes select character groupings according to a pretrained vocabulary of a specified size, which can be orders of magnitudes larger than the vocabulary of size 2 8 used by the byte encoding. For experiments conducted in §4, we used SentencePiece with a vocabulary size of 2 16 . The potentially large disparity in vocabulary size between encoding methods can result in text encoded as integer arrays of significantly different lengths. In light of the need to truncate these arrays at training, we hypothesize that subword encoding with a large vocabulary results in better model performance as more textual information is captured after truncation.
To evaluate this claim, we fit additional Sen-tencePiece subword models with vocabulary sizes 2 12 and 2 14 on text content from MUD. Table 7 provides ranking performance of variations of the proposed model trained with only the text feature on samples of fixed sized 16 and evaluated on samples of fixed sized 16. We remark that increases in vocabulary size do correspond with increases  Table 7: Ranking performance of models using various subword vocabulary sizes. All models use Sentence-Piece (SP) or the byte encoding (Byte) described in Appendix C. in ranking performance. We found that increasing the vocabulary size beyond 2 16 did not increase performance further. Mean number of months a user was active 9.9

D Further statistics about MUD
Percentage of posts containing more than 64 tokens 17.37%

E Further experiments
We repeat the linking experiment from §4.3 using queries that are observed at training time. We expect this to improve performance, something which is confirmed in Table 9.

Model
Training EER minDCF Length  Table 9: Same experiment as Table 2, but with queries observed at training time.