Demographer: Extremely Simple Name Demographics

The lack of demographic information available when conducting passive analysis of social media content can make it difﬁcult to compare results to traditional survey results. We present D EMOGRAPHER , 1 a tool that predicts gender from names, using name lists and a classiﬁer with simple character-level features. By relying only on a name, our tool can make predictions even without extensive user-authored content. We compare D EMOG - RAPHER to other available tools and discuss differences in performance. In particular, we show that D EMOGRAPHER performs well on Twitter data, making it useful for simple and rapid social media demographic inference.


Introduction
To study the attitudes and behaviors of a population, social science research often relies on surveys. Due to a variety of factors, including cost, speed, and coverage, many studies have turned to new sources of survey data over traditional methods like phone or in-person interviews. These include web-based data sources, such as internet surveys or panels, as well as passive analysis of social media content. The latter is particularly attractive since it does not require active recruitment or engagement of a survey population. Rather, it builds on data that can be collected from social media platforms.
Many major social media platforms, such as Twitter, lack demographic and location characteristics available for traditional surveys. The lack of these data prevents comparisons to traditional survey results. There have been a number of attempts to automatically infer user attributes from available social media data, such as a collection of messages for a user. These efforts have led to author attribute, or demographic, inference (Mislove et al., 2011;Volkova et al., 2015b;Burger et al., 2011;Volkova et al., 2015a;Pennacchiotti and Popescu, 2011;Schwartz et al., 2013;Ciot et al., 2013;Alowibdi et al., 2013;Culotta et al., 2015) and geolocation tasks (Eisenstein et al., 2010;Han et al., 2014;Rout et al., 2013;Compton et al., 2014;Cha et al., 2015;Jurgens et al., 2015;Rahimi et al., 2016).
A limitation of these content analysis methods is their reliance on multiple messages for each user (or, in the case of social network based methods, data about multiple followers or friends for each user of interest). For example, we may wish to better understand the demographics of users who tweet a particular hashtag. While having tens or hundreds of messages for each user can improve prediction accuracy, collecting more data for every user of interest may be prohibitive either in terms of API access, or in terms of the time required. In this vein, several papers have dealt with the task of geolocation from a single tweet, relying on the user's profile location, time, tweet content and other factors to make a decision (Osborne et al., 2014;Dredze et al., 2016). This includes tools like Carmen  and TwoFishes. 2 For demographic prediction, several papers have explored using names to infer gender and ethnicity (Rao et al., 2011;Liu and Ruths, 2013;Chang et al., 2010), although there has not been an analysis of the efficacy of such tools using names alone on Twitter.
This paper surveys existing software tools for determining a user's gender based on their name. We compare these tools in terms of accuracy on annotated datasets and coverage of a random collection of tweets. Additionally, we introduce a new tool DEMOGRAPHER which makes predictions for gender based on names. Our goal is to provide a guide for researchers as to software tools are most effective for this setting. We describe DEMOGRAPHER and then provide comparisons to other tools.

Demographer
DEMOGRAPHER is a Python tool for predicting the gender 3 of a Twitter user based only on the name 4 of the user as provided in the profile. It is designed to be a lightweight and fast tool that gives accurate predictions when possible, and withholds predictions otherwise. DEMOGRAPHER relies on two underlying methods: name lists that associate names with genders, and a classifier that uses features of a name to make predictions. These can also be combined to produce a single prediction given a name.
The tool is modular so that new methods can be added, and the existing methods can be retrained given new data sources.
Not every first name (given name) is strongly associated with a gender, but many common names can identify gender with high accuracy. DEMOG-RAPHER captures this through the use of name lists, which assign each first name to a single gender, or provide statistics on the gender breakdown for a name. Additionally, name morphology can indicate the gender of new or uncommon names (for example, names containing the string "anna" are often associated with Female). We use these ideas to implement the following methods for name classification.
Name list This predictor uses a given name list to build a mapping between name and gender. We assign scores for female and male based on what fraction of times that name was associated with females and males (respectively) in the name list. This model is limited by its data source; it makes no predictions for names not included in the name list. Other tools in our comparison also take this approach.
Classifier We extract features based on prefix and suffix of the name (up to character 4-grams, and including whether the first and final letters are vowels) and the entire name. We train a linear SVM with L2 regularization. For training, we assume names are associated with their most frequent gender. This model increases the coverage with a modest reduction in accuracy. When combined with a threshold (below which the model would make no prediction), this model has high precision but low recall.

Other Tools
For comparison, we evaluate four publicly available gender prediction tools. More detailed descriptions can be found at their respective webpages.
Gender.c We implement and test a Python version of the gender prediction tool described in Michael (2007), which uses a name list with both gender and country information. The original software is written in C and the name list contains 32,254 names and name popularity by country.
Gender Guesser Pérez (2016) uses the same data set as Gender.c, and performs quite similarly (in terms of accuracy and coverage).
Gender Detector Vanetta (2016) draws on US Social Security Administration data (which we also use for training DEMOGRAPHER), as well as data from other global sources, as collected by Open Gender Tracking's Global Name Data project. 5 Genderize IO Strømgren (2016) resolves first names to gender based on information from user profiles from several social networks. The tool is accessed via a web API, and results include gender, probability, and confidence expressed as a count. According to the website, when we ran our experiments the tool included 216,286 distinct names from 79 countries and 89 languages. It provides limited free access and larger query volumes for a fee.
Localization Several tools include the option to provide a locale for a name to improve accuracy. For example, Jean is typically male in French and female in English. We excluded localization since locale is not universally available for all users. We leave it to future work to explore its impact on accuracy.

Training Data
We train the classifier in DEMOGRAPHER and take as our name list Social Security data (Social Security Administration, 2016), which contains 68,357 unique names. The data is divided by year, with counts of the number of male and female children given each name in each year. Since it only includes names of Americans with Social Security records, it may not generalize internationally.

Evaluation Data
Wikidata We extracted 2,279,678 names with associated gender from Wikidata. 6 We use 100,000 for development, 100,000 for test, and reserve the rest for training in future work. While data for other genders is available on Wikidata, we selected only names that were associated with either Male or Female. This matches the labels available in the SSA data used for training, as well as the other gender prediction tools we compare against. This dataset is skewed heavily male (more than 4 names labeled male for every female), so we also report results on a balanced (subsampled) version.
Annotated Twitter These names are drawn from the "name" field from a subset of 58,046 still publicly tweeting users from the Burger et al. (2011) dataset (user IDs released with Volkova et al. (2013)). Of these, 30,364 are labeled Female and 27,682 are labeled Male. The gender labels are obtained by following links to Twitter users' blogger profile information (containing structured gender self-identification information).
Unannotated Twitter Since the annotated Twitter data contains predominantly English speakers (and who may not be representative of the general Twitter population who do not link to external websites), we also evaluate model coverage over a sample of Twitter data: the 1% feed from July 2016 from containing 655,963 tweets and 526,256 unique names.

Processing
All data is lowercased for consistency. For the Twitter data, we use a regular expression to extract the first string of one or more (Latin) alphabetic characters from the name field, if one exists. This may or may not be the user's actual given name (or even a given name at all). Note that most of the tools are do not handle non-Latin scripts, which limits their usefulness in international settings. Table 1 reports results for Wikidata in terms of accuracy (percent of correctly predicted names only including cases where the tool made a prediction), coverage (the percent of the full test set for which the tool made a prediction), F1 (the harmonic mean of accuracy and coverage), and the number of names labeled per second. The corresponding result for the balanced version of the dataset is in parentheses.

Results
Tools make different tradeoffs between accuracy, coverage, and speed. Both Gender.c and Gender Guesser have high accuracy and fairly high coverage at high speed (with Gender.c being the fastest of the tools evaluated). Gender Detector has slightly higher accuracy, but this comes at the cost of both coverage and speed (it is second slowest). Genderize IO has the best F1 among all name list based approaches, but stands out for lower accuracy and higher coverage. We show five settings of DEMOG-RAPHER: name list only (fast, accurate, but with only fairly high coverage), classifier (slow, and either high coverage with no threshold or high accuracy with a high threshold) and the combined versions, which fall in between the name list and classifier in terms of speed, accuracy, and coverage). The combined demographer with no threshold performs best out of all tools in terms of F1. Table 2 shows results on Twitter data. The Coverage column shows the percentage of the unlabeled Twitter data for which each tool was able to make a prediction. These numbers are quite a bit lower than for Wikidata and the labeled Twitter set (the names in the labeled sample contain less non-Latin alphabet text than those in the unlabeled sample). This may be due to there being many non-names in the Twitter name field, or the use of non-Latin alphabets, which many of the tools do not currently  handle. DEMOGRAPHER provides the best coverage, as it can make predictions for previously unobserved names based on character-level features. For F1 we report results on gender-annotated Twitter. DEMOGRAPHER, in its combined setting, performs best, with Genderize IO also performing fairly well.
We raise the following concerns, to be addressed in future work. The international nature of the Twitter data takes its toll on our models, as both the name list and classifier are based on US Social Security data. Clearly, more must be done to handle non-Latin scripts and to evaluate improvements based on language or localization (and appropriately localized training data). Our tool also makes the assumption that the user's given name appears first in the name field, that the name contains only characters from the Latin alphabet, and that the user's name (and their actual gender) can be classified as either Male or Female, all of which are known to be false assumptions and would need to be taken into consideration in situations where it is important to make a correct prediction (or no prediction) for an individual. We know that not all of the "name" fields actually contain names, but we do not know how the use of nonnames in that field may be distributed across demographic groups. We did not evaluate whether thresholding had a uniform impact on prediction quality across demographic groups. Failing to produce accurate predictions (or any prediction at all) due to these factors could introduce bias into the sample and subsequent conclusions. One possible way to deal with some of these issues would be to incorporate predictions based on username, such as those as described in Jaech and Ostendorf (2015).

Conclusions
We introduce DEMOGRAPHER, a tool that can produce high-accuracy and high-coverage results for gender inference from a given name. Our tool is comparable to or better than existing tools (particularly on Twitter data). Depending on the use case, users may prefer higher accuracy or higher coverage versions, which can be produced by changing thresholds for classification decisions.