Compositional Demographic Word Embeddings

Word embeddings are usually derived from corpora containing text from many individuals, thus leading to general purpose representations rather than individually personalized representations. While personalized embeddings can be useful to improve language model performance and other language processing tasks, they can only be computed for people with a large amount of longitudinal data, which is not the case for new users. We propose a new form of personalized word embeddings that use demographic-specific word representations derived compositionally from full or partial demographic information for a user (i.e., gender, age, location, religion). We show that the resulting demographic-aware word representations outperform generic word representations on two tasks for English: language modeling and word associations. We further explore the trade-off between the number of available attributes and their relative effectiveness and discuss the ethical implications of using them.


Introduction
Word embeddings are used in many natural language processing tasks as a way of representing language. Embeddings can be efficiently trained on large corpora using methods like word2vec or GloVe (Mikolov et al., 2013;Pennington et al., 2014), which learn one vector per word. These embeddings capture syntactic and semantic properties of the language of all individuals who contributed to the corpus. However, they are unable to account for user-specific word preferences (e.g., using the same word in different ways across different contexts), particularly for individuals whose usage deviates from the majority. These individual preferences are reflected in the word's nearest neighbors. As an example, Table 1 shows the way two users use the word "health" and the word's five nearest neighbors in their respective personalized embedding spaces. The word is used in similar  contexts, where contextual embeddings may give similar representations, but it has different salient meanings in the personal space of each user. User A tends to talk more about preventative care and insurance, while user B tends to talk about people's experiences affecting their mental health. The typical approach in natural language processing (NLP) is to use one-size-fits-all language representations, which do not account for variation between people. This may not matter for people whose language style is well represented in the data, but could lead to worse support for others (Pavalanathan and Eisenstein, 2015;May et al., 2019;Kurita et al., 2019). While the way we produce language is not a direct consequence of our demographics or any other grouping, it is possible that by tailoring word embeddings to a group we can more effectively model and support the way they use language.
Additionally, personalized embeddings can be useful for applications such as predictive typing systems that auto-complete sentences by providing suggestions to users, or dialog systems that follow the style of certain individuals or professionals (e.g., counselors, advisors). They can also be used to match the communication style of a user, which would signal cooperation from a dialog agent.
In this paper, we propose compositional demographic word embeddings as a way of building personalized word embeddings by leveraging data from users sharing the same demographic attributes (e.g., age: young, location: Europe). Our proposed method has the benefits of personalized word representations, while at the same time being applicable to users with limited or no data.
To implement and evaluate our proposed method, we build a large corpus of Reddit posts from 61,981 users for whom we extract self-reported values of up to four demographic properties: age, location, gender, and religion. We examine differences in word usage and association captured by the demographics we extracted and discuss the limitations and ethical considerations of using or drawing conclusions from this method. We explore the value of compositional demographic word embeddings on two English NLP tasks: language modeling and word associations. In both cases, we show that our proposed embeddings improve performance over generic word representations.

Related Work
Embedding Bias. Recent work on embeddings has revealed and attempted to remove racial, gender, religious, and other biases (Manzini et al., 2019;Bolukbasi et al., 2016). The bias in our corpora and embeddings have a societal impact and risks exclusion and demographic misrepresentation (Hovy and Spruit, 2016). This means that users of certain regions, ages, or genders may find NLP technologies more difficult to use. For instance, when using standard corpora for POS tagging, Hovy and Søgaard (2015) found that models perform significantly lower on younger people and ethnic minorities. Similarly, results on textbased geotagging show best results for men over 40 (Pavalanathan and Eisenstein, 2015).
Similar results are starting to be found in embeddings produced by contextual embedding methods (May et al., 2019;Kurita et al., 2019). We focus on non-contextual embedding methods because of their computational efficiency, which is crucial if many separate representations are being learned. Additionally, there may not be a large amount of available data for underrepresented groups and these contextualized models require billions of tokens for training. Recent work has also shown that static embeddings are competitive with contextualized ones in some settings (Arora et al., 2020).
Personalization. The closest work is Garimella et al. (2017)'s exploration of demographic-specific word embedding spaces. They trained word embeddings for male and female speakers who live in the USA and India using skip-gram architectures that learn a separate word matrix for each demographic group (e.g., male speakers from the USA).
Another line of work used discrete (Hovy, 2015) or continuous values (Lynn et al., 2017) to learn speaker embeddings: a single vector for each user. The speaker embedding is appended to the input of the recurrent or output layer, and trained simultaneously with the rest of the model. This idea applies to any contextual information type and was introduced as a way to condition language models on topics learned by topic modeling (Mikolov and Zweig, 2012). It has since been used as a way of representing users in tasks such as task-oriented and open-domain dialog (Wen et al., 2013;Li et al., 2016), information retrieval based on book preferences (Amer et al., 2016), query auto-completion (Jaech and Ostendorf, 2018), authorship attribution (Ebrahimi and Dou, 2016), sarcasm detection (Kolchinski and Potts, 2018), sentiment analysis (Zeng et al., 2017), and cold-start language modeling Huang et al. (2016). Finally, a recent study by King and Cook (2020) compared how to improve a language model with user-specific data using priming and interpolation, depending on the amount of data available, learning a new model for each user.
More generally, personalization has been extensively applied to marketing, webpage layout, product and news recommendation, query completion, and dialog (Eirinaki and Vazirgiannis, 2003;Das et al., 2007). Welch et al. (2019a,b) explored predicting response time, common messages, and speaker relationships from personal conversation data. Zhang et al. (2018) conditioned dialog systems on artificially constructed personas and Madotto et al. (2019) used meta-learning to improve this process. Goal-oriented dialog has used demographics (i.e. age, gender) to condition system response generation, showing that this relatively coarse grained personalization improves system performance (Joshi et al., 2017).
One particularly relevant study by Gjurković anď Snajder (2018) presented a corpus of Reddit users with personality information as well as some demographics for a subset of users. Unlike our approach, which is based on text content, they extract information from Reddit flairs, a type of user tag. Out of their set of 10,295 users, 2,253 are also in our set of users (22% of theirs, 0.5% of ours) that have one or more demographic labels, confirming the speculation in their paper that extracting demographics from text is a complementary approach that captures more information about users in their data. Other work has used Reddit posts to identify users who were diagnosed with depression (Yates et al., 2017) and to construct personas for personalized dialog agents (Mazaré et al., 2018).
Language Models. To evaluate embeddings, we consider language modeling, a task that has long been used for speech recognition and translation, and more recently been widely used for model pretraining. A range of models have been developed, with progressively larger models trained on more data (e.g., Dai et al., 2019). Variations of the LSTM have consistently achieved state-of-the-art performance without massive compute resources (Merity et al., 2018a;Melis et al., 2019;Merity, 2019;Li et al., 2020). We use the AWD-LSTM (Merity et al., 2018b) in our experiments, as it achieves very strong performance, has a well-studied codebase, and can be trained on a single GPU in a day.

Dataset
Our first contribution is a new dataset. We use English Reddit comments as they are publicly available, are written by many users, and span multiple years. 1 We extract demographic properties of users from self-identification in their text.

Finding Demographic Information
Reddit users do not have profiles with personal information fields that we could scrape. Instead, we developed methods to extract demographic information from the content of user posts.
In order to determine what kind of information we can extract about users, we performed a preliminary analysis. We manually labeled a random sample of 132 statements that users made about themselves. We specifically searched for statements starting with phrases such as 'i am a' or 'i am an'. In our sample: 36% clearly stated the user's age, religion, gender, occupation, or location; 34% contained descriptive phrases that were difficult to categorize like 'i am a big guy' or 'i am a lazy person'; and 30% mentioned attributes such as sexual orientation, dietary restrictions, political affiliations, or hobbies that were rare overall.
Based on our analysis, we decided to focus on age, religion, gender, occupation, and location as the main attributes. 2 These were extracted as follows: Age. We extracted the user's age using a regular expression. 3 During this process, we found users that were matched to different ages due to the corpus covering user activity across several years. In those cases, we removed users whose age difference was greater than the time span of our corpus. Additionally, we excluded users who said they were less than 13 years of age, as this violates the Reddit terms of service. We decided to split the age into two groups, young and old at a threshold of 30, as this split was used in previous work (Rao et al., 2010), and it gave a reasonable split for our data and the data we used for testing word associations (Garimella et al., 2017). Gender. Gender was extracted by searching for statements referring to oneself as a 'boy', 'man', 'male', 'guy', for male, or 'girl', 'woman', 'female', 'gal', for female. Manual inspection revealed some users indicated that they were of both genders. In that case, if one gender occurred less than one fifth of the time we took the majority of the reported gender, otherwise we removed the user from our dataset. We acknowledge that this approach excludes transgender, gender fluid, and a range of non-binary people, and may misgender people as well (see § 7 for further discussion of these issues).
Location. To obtain location information, we searched for phrases such as 'i am from' and 'i live in.' Next, whenever either the next token is (1) tagged as a location by a named entity recognizer , (2) a noun, or (3) the word 'the', we select all subsequent tokens in the phrase as the user location. Manual inspection of matches showed that Reddit users are not consistent in the granularity of reported location. Statements included cities, state, province, country, continent, or geographical region. Based on the number of users per country, we decided to merge some countries into region labels while leaving others separate. This resulted in the following set of regions: USA, Asia, Oceania, UK, Europe, Canada. We further matched location statements to lexicons to resolve the location to one of these regions, removing common relative location words. 4 For larger population regions of Canada and the USA, we match statements using state abbreviations, province names, highest population cities, and in the USA we also match the capital cities. For other regions we only match the highest population cities as there were too many cases to cover. Religion. To extract religion, we searched for the five largest global religious populations, 5 counting 'secular', 'atheist', and 'agnostic' as one nonreligious group. We used a regular expression 6 and filtered users who stated beliefs in more than one of these five groups.

Post-processing
The resulting dataset was further filtered to remove known bots. 7 For the demographic data we consider two subsets. First, the set of users for which all four attributes are known (4Dem). With this set we perform ablation experiments on the number of known attributes in a controlled manner. However, it is important to note that this set may not be representative of most users on Reddit, as it focuses on users willing to divulge a range of demographic attributes. Our second sample addresses this by including users for whom we identify two or more of the demographic attributes (2+Dem). Statistics for these sets are described in Table 2, along with the training, development, and test splits used for the language modeling experiments.
The distribution of demographic values for each of these sets is shown in Figure 1. Looking at the set of all users in our data who have at least two known demographic attributes (2+Dem), we find   that 83% of the time location is unknown. Age and religion are the next most frequently missing at 53% and 34% respectively. Gender is more likely to be known than the other attributes: only 10% of users in this subset have an unknown gender. In a manual evaluation of all our extracted attribute labels for the 100 users, we found accuracies of

Generating Compositional Demographic Word Embeddings
We propose two methods for learning compositional demographic embeddings. The first learns a generic embedding for each word and a vector representation of each demographic attribute (including 'unknown'). This is memory efficient, as we need only 19 vectors to cover all of our attributes.
In the second method, for each word we learn (a) a generic embedding and (b) a vector for each demographic attribute. This is more expressive, but requires twenty vectors for each word.

Demographic Attribute Vectors
In this approach we jointly learn a matrix for words and a separate vector for each demographic value. The word matrix W ∈ R |V |×k has a row for each word in the vocabulary and a k-dimensional vector for each embedding. The demographic values can be represented by another matrix D ∈ R |C|×k , where C is the set of all demographic values (e.g., male, female, christian, USA). The hidden layer is where w represents the one-hot encoding of an input word and g, l, r, a represent the demographic values of the speaker. This is a modified skipgram architecture (Mikolov et al., 2013) with a hierarchical softmax, which sums five terms so that back-propagation updates the word representation as well as the demographic values.
We use posts from all users to train embeddings for words that occur at least five times across all users. This yields a vocabulary of 503k words. We learn 100-dimensional embeddings with an initial learning rate of 0.025 and a window size of five.

Demographic Word Matrices
When learning demographic matrices we separately run our skip-gram model for each of the demographic attributes (e.g., gender) and learn a generic word matrix W G ∈ R |V |×k and a value specific word matrix for each value, v, of the given attribute, A, (e.g., male, female) W v ∈ R |V |×k , ∀v ∈ A v . This changes the hidden layer calculation to h = W h G w + W h W vw , with hidden layer weights W h , and the model then learns a generic word representation, in matrix G, while learning the value specific impact on the meaning of that word.

Differences Across Demographic Embeddings.
In order to understand what our embeddings capture, we examine words that have different representations across demographics. We can look at the nearest neighbors of a given query word across the embedding spaces for different demographics. We perform this analysis on both the demographic matrices and vectors, finding less variation in the neighbors when using demographic vectors, making them less interesting. We show examples of words with low overlap in nearest neighbors for demographic matrices in Table 3. These show the differences in word meaning across groups.

Language Modeling
We first examine the usefulness of our embeddings by showing that they can help us better model a user's language. We consider two experiments. First, we focus on compositional demographic embeddings and sample 50k posts from our corpus for training the language model and 5k for each of validation and test. Next, we compare with a userspecific model on a sample of our data with text from just 100 users who each have a large amount of data available in our corpus, with an average of 3.2 million tokens per user.
In both experiments, we use the language model developed by Merity et al. (2018b,a). As discussed in § 2, this model was recently state-of-the-art and has been the basis of many variations. We modify it to initialize the word embeddings with the ones we provide and to concatenate multiple embedding vectors as input to the recurrent layers. The rest of the architecture is unaltered. We tried adding rather than concatenating and found no improvement. We chose to concatenate the inputs with the intuition that the network would learn how to combine the information itself.
We explored various hyperparameter configurations on our validation set and found the best results using dropout with the same mask for generic and demographic-specific embeddings, untied weights, and fixed input embeddings. Untying and fixing input embeddings is supported by concurrent work (Welch et al., 2020b). Each model is trained for 50 epochs. We use the version from the epoch that had the best validation set perplexity, a standard metric in language modeling that measures the accuracy of the predicted probability distribution. Table 4 shows results for our demographic personalization methods, which are designed to handle new users for whom we have demographics but not much text data. The first method, demographic vectors, performs no better than generic embeddings. This is surprising since prior work has achieved success on a range of tasks with this kind of representation (see § 2). We suspect that for language modeling the variations are too fine-grained to be captured by a single vector. However, demographic matrices do improve significantly over generic embeddings. A model with all demographics improves the most, but we also see improvements when only one demographic value is known.

Demographic Perplexity Evaluation
The LSTM hidden layer size is the same across models, but the change in the input size affects the total number of parameters. To control for this, we ran our baseline model and model initialized with generic words with a larger input size, matching the number of parameters in our best models. As shown in Table 4, this increase in parameters does not improve performance.   Table 5: Perplexity for language models with no demographics (0D) or with all four demographic matrices (4D) with results broken down by demographic values. Table 4 shows results when using no demographics (top 4 rows), one demographic at a time (rows 6-9) and all four demographics (row 10). Each attribute improves perplexity, with age and gender improving it more than location and religion. Additionally, we perform a breakdown of the performance of our demographic matrices language model on each of the demographic groups. These results are shown in Table 5. We do see worse performance on some minorities as compared to other groups for the same model, although that is not always the case (gender, for instance, shows better perplexities for female than for male, and Muslim shows lower perplexity than Christianity, which has substantially more data). When we use the de-mographic word embeddings in our model, we are able to improve performance for all demographic groups, including minorities.

Ablation Experiments
We also find that the performance on the 'unknown' group increases in all cases with our largest improvement on 'unknown' religion. The unknown is explicitly modeling people in our dataset who have either (1) stated this demographic information with a value that we model but not in a way that our regular expressions identify, (2) stated this demographic information with a value that we do not model, or (3) have not stated this demographic information. In the second case, the effect is that it is useful to know which demographic groups the speaker does not belong to. In the third case, it may be that not sharing this particular piece of information (while sharing other personal information) says more about what the speaker will tend to say.

Comparison with User Representations
For users with a lot of data, it is possible to train a user-specific model, with embeddings that capture their unique language use. We would expect this to be better than our demographic embeddings, but also only be feasible for users with a lot of data. This experiment compares our demographic approach with a user-specific approach.
We create a model for each user using the sample that has a large amount of data for 100 users (3.2 million tokens each on average) as done in concurrent work (Welch et al., 2020a). We tried two approaches, user vectors and user matrices, which are analogous to our demographic vectors and matrices. The difference is that rather than having a separate vector / matrix for each demographic we have a separate vector / matrix for each user. Our split sizes for language model experiments are the same as the demographic experiments.
Results. Table 6 shows results for generic embeddings, user vectors, user matrices, and demographic matrices. We find that user vectors, as have been used widely in previous work (Kolchinski and Potts, 2018;Li et al., 2016), do not improve performance. Both our demographic and user matrices improve performance over generic embeddings with comparable performance. While we chose 100 users with a lot of data, they had less data than the amount used to train each demographic specific model. The relationship between the amount of data, its similarity to a user's writing, and the effect on performance is an interesting open question.  Table 6: Comparing our demographic-based approach with two user-specific approaches. Perplexities are generally lower than previous tables because the threshold for rare words being made UNK was higher.

Demographic Word Associations
As a second evaluation, we consider word associations, a core task in NLP that probes the relatedness or similarity between words. Data is collected for the task by presenting a stimulus word (e.g., cat) and asking people what other words come to mind (e.g., dog or mouse). Earlier systems relied on resources such as WordNet to solve the task, but most recent work has used word embeddings.
Data. For our evaluation, we use data from Garimella et al. (2017). They constructed a word association dataset and experimented with learning separate word embedding matrices for different demographic groups. To collect the data, they (1) asked crowd workers to write one word associated with a single word prompt and (2) asked the workers their gender, age, location, occupation, ethnicity, education, and income. Only gender and location information was released, but the authors provided age information upon request.
Evaluation. As in prior work, we consider evaluation metrics defined in terms of: f w , the number of people who listed word w for a stimulus; f max , the highest f w across all words chosen for a stimulus; and t, the number of participants given a stimulus. best is f w divided by f max , where w is the word in the embedding space closest to the stimulus word; ooN (out-of-N) is f w /t for the N words in the embedding space closest to the stimulus word; both are averaged over all stimulus words. We consider two experiments. One directly matches Garimella et al. (2017), testing each demographic group separately. Since our interest is in compositionality, we also introduce a setting where the data is split into eight disjoint sets, one for each combination of the three attributes.
Models. Garimella et al. (2017) proposed two methods, which we merge by taking the best result from either one. We considered only our demographic matrix embeddings as they performed best best oo3 oo10 best oo3 oo10 best oo3 oo10  Table 7: Comparison of demographic-aware word association similarities for our embeddings using (G)eneric or (G)eneric+(D)emographic, and the best results of the two variants of the composite skip-gram model (C-SGM) from Garimella et al. (2017). We show improved results for (US), (IN)dia, (M)ale, and (F)emale, and provide new results using age for (Y)ounger than 30 and (O)lder.  Table 8: Results on the 8 disjoint word association subsets for each combination of attributes. Similarities concatenate three embeddings that are each either generic, or specific to that demographic attribute. Overall, using age and gender in combination gives the best performance, though using all three is better on oo3. † indicates statistically significant improvement (permutation test, p < 0.001) over the next best model on the marked metric.
on language modeling. For the experiment with separate demographics, we use the appropriate embeddings. For the experiment with combinations of demographics, we concatenate the embeddings.
We also compare to concatenation of generic embeddings learned for each attribute (this performs better than any individual generic embedding).
Results. Table 7 shows results on the singledemographic experiment. We achieve higher performance, but that may come from the change in training dataset. 8,9 Table 8 shows results on the multi-demographic setting. We include only the best pair (age and gender) due to space. We have seen in earlier experiments that location does not perform as well as the other attributes and found the same trend here. Overall, composing demographicbased representations helps, with a combination of all three attributes consistently performing well on the oo3 metric, while having two helps on the best metric. Generic embeddings only score the highest 8 Their models are trained on 67.6m tokens of blog data, while ours are trained on 1,400m tokens of Reddit data. 9 We see a larger gain for the US than the IN evaluation. This may be because in our data location is unknown for many users and India is underrepresented (so much so that we aggregate it into all of Asia). on one subset: Male, India, Young.

Limitations and Ethical Considerations
This work uses demographic information to modify language representation. This type of work is encouraged by the numerous arguments outlined in (Perez, 2019), which demonstrate the need for demographic data disaggregation in order to make decisions and build technologies that are equitable for all. We view our work as an initial investigation of differences in language model performance across demographics and how technology can be improved for the identified groups. Our results in Tables 4 and 5 show that using demographic information can enable the development of language tools that improve performance for all groups compared to simply training on all data.
Although we show that some language production aspects are correlated with demographic information, we do not believe the way we speak is a direct and only consequence of one's demographics, neither do we claim that this is the ideal information source for it or that this will necessarily hold for populations sampled significantly differently than in our study. As a consequence, it is possible that using demographics in embedding construction could accentuate bias, although this remains to be studied. Those that use our method should account for this possibility.
Our study uses four demographic variables and only covers a subset of the potential values of each demographic. For instance, we do not use the same granularity across locations, include all locations, religions, or gender identities. We simplify age into ranges. The groups 'secular', 'agnostic', and 'atheist' are grouped into one broader group. Our sample is further biased by the choice of platform as each platform contains text from different populations. Users in our sample are predominately young, male, atheist, and live in the United States.
When using gender as a study variable, we followed the recommendations of Larson (2017). Our "gender" extraction method does not refer to biological sex. After running gender extraction patterns, users are assigned to either the 'male', 'female', or 'unknown' label, meaning that on the basis of these phrases one's gender identity is assumed to be binary or to be a gender identity unknown to our model, which may include those who are transgender, non-binary, or those who do not wish to disclose their gender. However, we are aware that the use of regular expressions for the extraction of demographic attributes can lead to false positives and false negatives (error rates are provided in the supplemental material) and that there exists a bias in using these strategies, as populations that do not wish to be identified are less likely to explicitly make such statements. For transparency, our released code includes the scripts used to assign demographic labels.
Above we discussed concerns for incorrect demographic assignment when developing models. There are also potential negative consequences when using these models in a deployed system. Our embeddings can only be used when the demographics of a user are known. This may be acceptable if the user voluntarily self-reports their demographics with the understanding that they will alter the predictions they receive. However, if demographics are automatically inferred there is a risk of misattribution, which depending on the application may have negative consequences.
A separate consideration is the environmental impact of this approach. Compared to the standard method, our approach does involve training more models, but the cost of inference is likely only marginally higher. We believe the additional cost in training is worth the benefits to individual users.
Finally, we acknowledge that components of our method could potentially be used for user profiling (Rangel et al., 2013) and/or surveillance of target populations, thus exposing members of underrepresented groups to harms such as discrimination and coercion and threatening intellectual freedom (Richards, 2013). Similarly, the language models could be used to generate text in the style of a target population or at least to estimate the label distribution of a given text, which would help obfuscate the identity of the author (Potthast et al., 2018). This obfuscation could help hide an author's identity in order to avoid surveillance or could be used maliciously to infiltrate communities online. We advocate against the use of our methods for these or other ethically questionable applications.

Conclusions
We proposed a novel method of generating word representations by composing demographicspecific word vectors. Through experiments on two core language processing tasks, language modeling and word associations, we show that demographicaware word representations outperform generic embeddings. We also find that demographic matrices perform much better than demographic vectors. Through several ablation analyses we show that word embeddings that leverage multiple demographic attributes give better performance than those using single attributes.
To support future work that can help model individuals and demographics, our code is available at http://lit.eecs.umich.edu. Our data is not available due to licensing restrictions but can be redownloaded and processed with our scripts. We hope this will support work on solutions for NLP applications and resources that can better serve minorities and underrepresented groups.
In order to verify the accuracy of our demographic attribute assignment, we manually annotated a sample of 100 users from the dataset. Our extraction of attributes with regular expressions and rules was meant to have high-precision. It is likely that more attributes marked 'unknown' by our extraction could be filled in upon manual inspection. We evaluate the retrieved attributes for these 100 users by viewing the set of all posts that matched our extraction rules and attempting to annotate age, religion, gender, and location. The annotation instructions were to identify the value of these four attributes based on the annotators interpretation of the text of the posts. Then, for cases where the extracted attribute is not 'unknown', we calculate the percentage of times that they are the same. We get 94% for location and gender, 78% for religion, and 96% for age. It should also be noted that despite the annotators best efforts, it is not possible to know the actual ground truth values.

B Reproducibility Criteria
For each item in the list we have a section below with the relevant information.

B.1 Experimental Results
A clear description of the mathematical setting, algorithm, and/or model. The model we use is described in Merity et al. (2018b). We modify it to support weight freezing and initialization. In Section 2 where they describe the weight-dropped LSTM, we concatenate our vectors for user-specific and demographic representations to x t .
The embeddings are obtained from the model described in Bamman et al. (2014) for the demographic and user matrices. To obtain demographic vectors, we treat C, from Section 2, as a matrix whose rows represent the demographic attribute of a speaker (e.g. male, female) independent of the word used. The model updates the same way, changing a generic word vector and relevant demographic attribute vectors when backpropagating.
A link to a downloadable source code, with specification of all dependencies, including external libraries • AWD-LSTM code is available from https: //github.com/salesforce/awd-lstm-lm.

Description of computing infrastructure used
Each model is trained on one NVIDIA Tesla V100 GPU.
Average runtime for each approach Our methods take between 260 and 1450 seconds per epoch depending on the approach.

Number of parameters in each model
The number of parameters for the model that uses all four demographic attributes has the most parameters at 249,752,492. Our smallest model is the user representation comparison which has 48,066,614.
Corresponding validation performance for each reported test result Validation perplexities are reported for the 2+Dem validation set. Explanation of evaluation metrics used, with links to code Perplexity for language models is common and is implemented in Merity's code. The word association metrics for best, oo3, and oo10 are described in Garimella et al. (2017) and we have reimplemented these metrics in order to compare to their results.

B.2 Hyperparameter Search
In our initial experiments on the 2+Dem validation set we chose the highest performing hyperparameters from the following list. One value listed means we used this value as described in Merity's code. Parameters were manually tuned and the best validation perplexity was chosen to use for all experiments. We also experimented with embedding dropout masks. We initially had separate masks for the generic and concatenated demographic-specific embeddings but if one is masked and not the other it doesn't mask all information about that word. When we tried embedding dropout with the same mask for each concatenated vector perplexity dropped several points.
The vocabulary size for demographic experiments was 502k, while the experiments for individual users had a vocabulary size of 177k words.

Relevant statistics such as number of examples
See section 3 of the paper. Table  2 for details of the 2+Dem and 4Dem experiments and Section 5.2 for details on the user representation comparison.

Details of train/validation/test splits See
Explanation of any data that were excluded, and all pre-processing steps See Section 3.2.
A link to a downloadable version of the data The Reddit data from 2007-2015 is available from https://www.reddit.com/r/datasets/ comments/3bxlg7/i_have_every_publicly_ available_reddit_comment/.
Our subset of demographic labeled comments will not be available due to licensing restrictions, but can be reconstructed using our scripts and the source data linked to here. Our code can also be used to label more Reddit data from after this collection was posted.
For new data collected, a complete description of the data collection process, such as instructions to annotators and methods for quality control. See Section 3 for data collection details and the beginning of this supplemental material for details on the annotation process.