Celebrity Profiling

Celebrities are among the most prolific users of social media, promoting their personas and rallying followers. This activity is closely tied to genuine writing samples, which makes them worthy research subjects in many respects, not least profiling. With this paper we introduce the Webis Celebrity Corpus 2019. For its construction the Twitter feeds of 71,706 verified accounts have been carefully linked with their respective Wikidata items, crawling both. After cleansing, the resulting profiles contain an average of 29,968 words per profile and up to 239 pieces of personal information. A cross-evaluation that checked the correct association of Twitter account and Wikidata item revealed an error rate of only 0.6%, rendering the profiles highly reliable. Our corpus comprises a wide cross-section of local and global celebrities, forming a unique combination of scale, profile comprehensiveness, and label reliability. We further establish the state of the art’s profiling performance by evaluating the winning approaches submitted to the PAN gender prediction tasks in a transfer learning experiment. They are only outperformed by our own deep learning approach, which we also use to exemplify celebrity occupation prediction for the first time.


Introduction
Author profiling is about predicting personal traits of individual authors based on their writing style. Frequently studied traits are demographics such as gender, age, native language or dialect, and even personality. Applications of author profiling include marketing, social science, risk assessment, and forensics. Given the high expectations that are implied by these and similar applications, the creation of a valid automatic profiler for a given trait, let alone many, depends on the availability of carefully constructed corpora. Corpus construction for author profiling has always been difficult for lack of large-scale distant supervision sources that provide for genuine pieces of writing from many different authors alongside personal information. In part, the aforementioned selection of demographics that are frequently studied reflects the availability of corresponding ground truth. In this regard, one source of ground truth, available in large quantities, high diversity of traits, and near-perfect label reliability, has been overlooked: celebrities.
The contributions of our research are threefold: 1 First, in Section 2, we survey the state of the art in constructing author profiling corpora for the first time, compiling a taxonomy of construction strategies applied. Second, in Section 3, we report on the construction of the first large-scale corpus of celebrity profiles, describing our acquisition approach based on a reliable matching of Twitter accounts to Wikidata items. Third, in Section 4, we carry out a prediction experiment on the most widely studied trait, gender, comparing the performance of our own deep learning approach with that of the four best-performing ones submitted to the recent PAN author profiling competitions from 2015 to 2018. Moreover, we exemplify the prediction of celebrity occupations.

Related Work
We analyzed 29 publications on author profiling the authors of which explicitly describe their data acquisition and corpus construction strategies. The strategies have been reviewed, abstracted, and mapped into a taxonomy, which in turn enabled us to identify specific quality criteria.  by Pennebaker et al. (2003), Koppel et al. (2002), Schler et al. (2006), and Argamon et al. (2009); recent works add novel traits, trait relations, multilingualism, and microblogs. The largest annual shared task on author profiling is part of the PAN competition (Rangel Pardo et al., 2013Pardo et al., , 2014Pardo et al., , 2015Pardo et al., , 2016Pardo et al., , 2017bPardo et al., , 2018. Profiling research related to aspects such as behavioral traits (Kumar et al., 2018), medical conditions (Choudhury et al., 2013), or native language identification (NLI) have been excluded from our survey, since these have developed into subfields of their own right. Three criteria describe the quality of the surveyed resources: the representativeness of the targeted population, the comprehensiveness in terms of author, text, and label size, and the reliability of label attributions. Table 2 shows our taxonomy of label acquisition strategies for reliability and comprehensiveness evaluation: labels provided by the author or by others (A/O), labels provided independently or on request (I/R), and labels re-  trieved in structured or unstructured form (S/U). The six resulting strategies, disregarding R-U combinations as inapplicable, describe the general strategy and hint possible issues: (1) subjectivity or misunderstandings by experts, volunteer annotators, or crowdsourcing workers versus deception and self-serving bias by author-self-reported labels, (2) self-selection bias and per-author cost in requested labels versus few and stale trait choices in independent reporting, and (3) imprecision, incompleteness, and misunderstandings in unstructured versus restricted choices in structured labeling.

The Webis Celebrity Corpus
This section introduces the Webis Celebrity Corpus 2019, detailing how we identified celebrities at scale, compiled a large corpus of their writing, and linked it with Wikidata to obtain personal profiles. A corpus analysis and validation follows.

Who is a Celebrity?
To operationalize the term "celebrity", we say that a person has a celebrity-like status, be it locally or globally, if he or she possesses a verified Twitter account, and at the same time, is deemed notable enough to be the subject of a Wikipedia article and a Wikidata item. Importantly, Twitter verifies "that an account of public interest is authentic" (Twitter, 2018), awarding a blue checkmark badge: . Notability at Wikipedia pertains to people who are "worthy of notice," "remarkable," or "famous or popular" (Wikipedia, 2018). While verified accounts also include organizations, and while most notable people at Wikipedia/Wikidata are not considered celebrities, it is their intersection which provides for a good approximation. To collect celebrity profiles at scale, we join these sources of information.

Corpus Construction
We crawled all 297,878 verified Twitter accounts, 2 and linked them with Wikidata items. This is a non-trivial task: a Twitter account name and its corresponding Wikidata item need not have an exact string match, and there may be false matches. Table 3a shows the six candidate names we obtained from the unique, static Twitter "@"-names and the free-form display names. Table 3b shows the linking results. Accounts were marked as human or not human based on Wikidata's instance of property. In the sequence of name candidates I-VI, a human match was kept, even if successive candidates matched non-human items. If items differed between languages for the same candidate, matches were marked ambiguous. Matches containing one of the eight deathrelated Wikidata properties and a date of death before Twitter's launch in March 2006 were marked memorial. All mismatches identified during our subsequent corpus validation were marked as error. After excluding matches with private timelines, 71,706 valid account-item matches remained.

Corpus Analysis
The corpus we created contains 29,968 words on average per author and 1,523 different Wikidata properties, of which 239 are personal traits relevant for profiling. Table 4 shows a selection of those traits, the most common value and for how many celebrities they are annotated. The remaining properties split into 1,224 external references (i.e., links to other sites) and 60 miscellaneous properties (mostly internal references and multimedia data). Of the 239 traits, 45 are attributed to more than 1,000, and 5 to more than 55,000 users simultaneously. The extracted Wikidata properties are highly specific and frequently feature over 100 different values per property within our corpus, although most are Zipf-distributed and can easily be aggregated or reduced to smaller dimensions, as we will demonstrate with occupation in Section 4. It should be noted that labels, such as ethnicity, religion, and native language, are present mostly for minorities rather than the majority.
We collected an average 2,181 tweets per celebrity and 156,411,899 tweets in total (≈ 3 billion words), covering 98.05% of all their tweets. 3 Of all collected tweets, 29.3% are retweets and 20.9% (a) Name candidate generation rule I only alphanumeric characters of the display name II reference name split at capitalization III reference name split at display name IV first and last part from I, split at spaces V all but the last part from I VI all but the last two parts from I   replies. Of the 49.7% remaining tweets, an average of 989 (13,938 words) per celebrity are longer than 20 characters and do not contain links, yielding a conservative estimate of tweets amenable for style analysis. Although celebrities tweeted in 50 different languages, 77% of all timelines consisted of tweets exclusively written in English, followed by 7% in Spanish and 4% in French, while 2,104 celebrities tweeted at least bilingual.

Corpus Reliability and Limitations
Regarding the representativeness of our sample from the population of celebrities, we may cautiously claim to have obtained a wide cross-section of people of elevated status. However, celebrities are excluded who do not use Twitter, whose account is not verified (which is exceedingly unlikely, the more famous they are), or who have no Wikipedia article about themselves. There are no reliable estimates of the true number of celebrities worldwide, but it is safe to assume that our corpus has a bias towards Western culture, and particularly English-speaking celebrities. Regarding profile comprehensiveness, our corpus provides for comparably long samples of writing per author and a rich set of traits, albeit many traits are available only for a subset of profiles. Most celebrities provide genuine writing samples of themselves at Twitter, but some employ public relations staff to manage their account. Though a problem for generic author profiling, this does not impede celebrity profiling. Celebrities craft public personas as their own unique brands. If a celebrity decides to employ staff to do so, approving their impersonations, these personas are no less genuine and normative than personally crafted personas.
The information about the traits of celebrities obtained from Wikidata can be considered highly reliable. Dedicated volunteers collect all kinds of personal information about celebrities, which are often referenced and under constant review by other Wikipedia and Wikidata editors. As per our taxonomy of label acquisition strategies in Table 2, we employ an OIS strategy: we obtain labels from third-party expert annotators (O), who are independent (I), supplying data in structured form (S).

Evaluation
To investigate the usefulness of our corpus for author profiling, we carry out a first large-scale profiling experiment by predicting celebrity occupation and gender and evaluating four state of the art approaches that won the PAN 2015-2018 author profiling competitions. Instead of retraining their prediction models, we use the models for gender inference as they have been trained on the PAN training datasets provided to participants of the respective years. Additionally, we train our own baseline gender model on celebrity profiles. Gender is a suitable benchmark trait that is frequently studied in the related work and a recurring trait prediction task at PAN. We observe a successful model transfer, thus mutually corroborating that ours and the PAN corpora capture the same underlying concept of gender.

Preprocessing and Baselines
For our experiments, we extracted a subset of 45,475 English-speaking profiles from our corpus with the traits gender and occupation and split it 70/30 into training and test sets. Table 3c shows  this dataset in comparison to the PAN datasets. Our subset has 1,379 different occupations annotated, which we manually assigned to eight groups: sports, performer, creator, politics, manager, science, professional, and religious. We preprocessed the text by lowercasing, replacing mentions with <user>, hashtags with <hashtag>, hyperlinks with <url>, number-groups with <numbers>, the most frequent emoticons with <smiley>, and we removed all punctuation sequences beyond basic English punctuation marks. As baseline models for gender and for occupation prediction, we adapted the convolutional neural network (CNN) for text classification introduced by Kim (2014). Our variant of this model builds on the 100-dimensional GloVe (Pennington et al., 2014) Twitter embeddings, uses four parallel 1D-convolution layers with 128 filters each for 1-, 2-, 3-, and 4-grams, a 64-node dense layer for concatenation after the convolutions, and a final classification layer. The models for occupation and gender only differ in the last classification layer and loss function used to facilitate binary (gender) and categorical truth (occupation). We limited the vocabulary to the most common 100,000 words and padded the word-sequence for each author to 5000 words, which is roughly the average per author word count between ours and the PAN datasets. In our tests on the celebrity profiles, this hyperparameter setting achieves more consistent results than fewer or shorter n-gram filters, smaller dense layers, shorter or longer sequence length, or a larger vocabulary. Note that our corpus has labels for more than the two sexes male and female, however, the PAN data did not, so that we excluded profiles with other genders from our experiments, leaving their investigation for future work. Table 5 shows all models' transfer performance between populations on gender. In general, all models generalize well to the respectively unseen datasets but perform best on the data they have been specifically trained for. The largest difference can be observed on the sub-1,000 author dataset PAN15, where the model of Álvarez- Carmona et al. (2015) suffers a significant performance loss, and PAN16, where the model of Busger op Vollenbroek et al. (2016) performs notably better on the celebrity data. This was a surprise to us that may be explained by the longer samples of writing per profile in our corpus. This hypothesis is also supported by the large increase in accuracy of the baseline model after retraining for two epochs with the PAN15 and PAN16 training datasets, respectively. The occupation model achieved a 0.7111 accuracy.

Evaluation Results
Altogether, the results of our experiments show that profiling models trained on a random choice of people generalize to celebrities, and vice versa. Our corpus can hence be used for generic author profiling, while providing significantly richer profiles in terms of writing samples and as of yet unexplored personal traits. The scale of our corpus allows for the training of deep learning models, which, at least on our corpus, outperform the state of the art. We expect that further fine-tuning of the model architecture will yield significant improvements.

Conclusion
This paper introduces the Webis Celebrity Corpus 2019, the first corpus of its kind comprising a total of 71,706 celebrity profiles, 239 profilingrelevant labels, and 3 billion words. Its quality is due to Twitter's verification process, Wikidata's accuracy, and our low-error linking strategy between the two sites. Its generalizability qualities for gender prediction have been demonstrated using state-of-the-art approaches.
Our corpus formed the basis for the first celebrity profiling competition, organized as part of the PAN evaluation lab (Wiegmann et al., 2019). The traits studied were the degree of fame, occupation, age, and gender, introducing fame and occupations as novel, celebrity-specific profiling traits, and revisiting the well-known traits age and gender.
In future work, we plan on improving the corpus by incorporating verified accounts from other social networks, and, by inferring new labels for as of yet unlabeled celebrities through link prediction.