Increasing In-Class Similarity by Retrofitting Embeddings with Demographic Information

Most text-classification approaches represent the input based on textual features, either feature-based or continuous. However, this ignores strong non-linguistic similarities like homophily: people within a demographic group use language more similar to each other than to non-group members. We use homophily cues to retrofit text-based author representations with non-linguistic information, and introduce a trade-off parameter. This approach increases in-class similarity between authors, and improves classification performance by making classes more linearly separable. We evaluate the effect of our method on two author-attribute prediction tasks with various training-set sizes and parameter settings. We find that our method can significantly improve classification performance, especially when the number of labels is large and limited labeled data is available. It is potentially applicable as preprocessing step to any text-classification task.


Introduction
Predicting socio-demographic author characteristics is becoming ever more relevant with the pervasive use of user-generated content. Classifying user attributes such as age and gender is useful for a number of applications both in the public sector, where it can support the investigation of crime (in forensic linguistics) or the determination of social policies, and in the private sector, where companies want to profile a potential consumer market, targeting communication strategies and advertising to specific communities. Furthermore, recent work in NLP has shown that incorporating author attributes in various NLP tasks can also improve performance (Volkova et al., 2013;Hovy, 2015;Lynn et al., 2017;Preotiuc-Pietro et al., 2016).
In these tasks, authors are typically represented via their linguistic profiles, i.e., information avail-able in the text. This includes both word-based features as well as continuous representations (embeddings). Generally, linguistic features are divided into content-related and strictly stylistic features. While the first can be effectively represented by (n-grams of) words which capture the topic and meaning of a text, the second ones focus on the use of function words, expressions, pronouns, syntactic structures, etc. There is evidence in the literature that content-related text characteristics are more effective than stylistic features for gender and age prediction (Fatima et al., 2017;Rosenthal and McKeown, 2011). This effect is the consequence of a non-linguistic auto-selection process known as homophily: people within in a demographic group tend to be more similar to each other than to other groups, and subjects belonging to different groups are therefore naturally more prone to discuss different topics.
Despite the large amount of available social media data (in April 2018, Facebook had more than two billion active users, YouTube and WhatsApp each one-and-a-half billion, and Twitter 330 million, see statista.com), we often encounter scenarios with limited availability of ground-truth user attributes, leading to remarkable performance differences to, say, blogs. This difference is due to the shortness of social media texts and the wider range of topics (Rangel et al., 2016), which weaken linguistic profile features. In such cases, improving author representations beyond the linguistic profiles can be especially useful.
We implement this intuition of leveraging demographic homophily by using retrofitting (Faruqui et al., 2015), a method introduced to refine word vectors to reflect semantic similarity information from lexicons. In our case, we increase the similarity between the (linguisticallybased) continuous authors representations within each class (here: age or gender). Authors who share the same gender or age therefore get more similar vector representations (see section 3.2). This effectively increases class-separability and can thereby improve classification performance.
We also experiment with a trade-off parameter α, which controls the relative influence of the retrofitting process vs. the original embedding vector on the retrofit representation, allowing us to explore the effect of both factors on the final prediction outcome.
In order to extend the in-class homophily information to unlabeled data, we induce a transformation matrix to translate between the original and retrofitted embedding space. This matrix can be applied to unlabeled data to transform the author representations in the test set.
We use a set of almost 100K authors to predict age and gender. In order to explore limitedresource scenarios, we experiment with a range of training set sizes. Our results indicate that demographic retrofitting of linguistic representations substantially increases classification performance for age and gender prediction, especially in lowresource scenarios.
It is an easy, fast, and efficient preprocessing step that can substantially improve classification performance. We show our method for authorattribute prediction, but believe it can potentially be applied to any text-classification task.
Contributions In this paper, we introduce demographic retrofitting based on in-class homophily, and make the following contributions: 1. we present a substantial expansion of the original retrofitting algorithm (Faruqui et al., 2015). In contrast to prior work, which relies on external ontologies, our method relies solely on the information contained within the training data.
2. We show how to generalize the transformation from training data to unlabeled data, using a translation matrix.

Data
We use data from , a collection of reviews of online companies from various countries, including author information. We select all reviews written in English from American and British sources, if they include both age and gender of the author, and if the review is at least 10 tokens long (shorter reviews tend to be mistokenized URLs or replies). We aggregate the reviews by users, so that each instance is a collection of texts' from a unique user. This leaves us with 98,608 individual users, and about 8M words (roughly 80 words per instance). For each user, we use the age (discretized by decade) and gender (self-stated as binary, and augmented by  based on the users' first name) as target variables. We minimally preprocess the text data, collapsing all numbers into 0s, and tokenizing via spacy (Honnibal and Johnson, 2015).

Methodology
In our experiments, we are interested in the effect of homophily-inducing retrofitting on authorattribute prediction. In order to evaluate the effect, we compare the performance of author representations based on linguistic input to the performance of the same representation retrofitted to the author attribute class in question. In this section, we outline the details for the different steps.

Linguistic author representations
We train Doc2Vec, a paragraph2vec (Le and Mikolov, 2014) implementation, on the corpus, inducing a 98K-by-300 matrix D, where each row represents an author. We follow the parametrization suggested in Lau and Baldwin (2016), setting the window size to 15, minimum word-frequency to 10, negative samples to 5, downsampling rate to 0.00001, and run for 1000 iterations. We use the resulting author embeddings as input to the authorattribute classifier (see 3.4). We induce the author embeddings over the entire corpus of 98K authors, without recurrence to age or gender information.
As comparison, we also create a bag-of-words (BOW) representation with the same dimensionality. We use χ 2 as selection criterion to find the top 300 words in the training data, separately for both age and gender classification.

Retrofitting
Our goal is to enhance the author representations, which are based on linguistic similarity, with demographic information about the target variable (say, age). In order to introduce this information into the vector space, we rely on retrofitting, by in-(a) Non-retrofitted (b) Retrofitted Figure 1: Schematic representation of 500 authors colored by age group, without (top) and with (bottom) retrofitting creasing the similarity of authors within the same target group (say, people in their 20s). We thereby separate the target classes in embedding space, making them easier to differentiate by a classifier. Faruqui et al. (2015) introduced retrofitting of word vectors based on external ontologies, such as WordNet (Miller, 1995) or PPDB (Ganitkevitch et al., 2013). Instead of these resources, we map each labeled author to the list of all other authors with the same label in the training data. Formally, we create a set Ω containing tuples of authors (d i , d j |y i = y j ). We do this separately for each demographic dimension -age and gender.
During retrofitting, we iteratively update the author representation in the training data (initially linguistically-based) to increase the cosine similarity between authors within the same class (as defined in Ω). This creates a retrofitted ma-trixD train of the original author matrix D train . The update for an author representation d i is a weighted combination of the original embedding and the average over all its current neighbors: where d i is the original linguistic representation vector, N = |{∀j : (i, j) ∈ Ω}| is the set of all embeddings in the same label group, and α and β are hyper-parameters that control the tradeoff between the original representation and the updates from the neighboring embeddings during retrofitting. In Faruqui et al. (2015), α = β. In contrast, we define β = 1 − α By varying α from 0 to 1, we can control the strength of the retrofitting process. α = 1 simply reproduces the original matrix, i.e.,D = D, whereas α = 0 only relies on the neighborhood updates after the initialization. Figure 1 shows a sample of 500 users in a non-retrofitted (1a) and retrofitted (1b) 3D embedding space, colored by class. The color distribution shows how people belonging to the same group get drawn closer to each other in embeddings space when using retrofitting.

Translation
We can only retrofit the embeddings of authors in the training set D train , since we need information about the class label in order to construct Ω. However, the retrofitting process changes the configuration of the embedding space (intoD train ), so a separating hyperplane learned onD train will not be applicable to a test set D test in the original embedding space.
In order to extend the homophily information to authors in the test set, we use a translation matrix T (a 300 × 300 matrix), which approximates the transformation from the original training data matrix D train into the retrofitted matrixD train . We obtain T by minimizing the least-square difference in D train · T =D train .
T captures the retrofitting operation, and allows us to modify the test subjects' representations as if age and gender were known, despite the absence of class information. In particular, by applying T to the matrix of the author embeddings in the unlabeled test set D test , we obtain a retrofitted version D test that preserves the transformation learned on the training data. Since the least-square approximation is not perfect, we find that in practice fitting a classifier on the approximation D train · T works better than usingD train , acting as a regularizer.

Classification
We retrofit the author embeddings in the training set (see 3.2) and learn a translation matrix to transform the representations of the remaining authors in the test set. We train three separate Logistic Regression classifiers: one on author embeddings, one on the retrofit embeddings, and one on BOW features. It is technically possible to retrofit BOW representations as well, but in practice, the classifier does not converge, as word count-based vectors do not represent a continuous space that captures latent similarities.
We then use the three classifiers to predict the author attributes of the remaining authors in the test data set. We evaluate the results via micro-F1 score (averaged over 100 runs), since our tasks include imbalanced multi-class scenarios: micro-F1 weights the contribution of each class according to their relative size and is therefore more informative than accuracy.
Since we are interested in the effect of the training set size on performance, we vary the number of available training examples from 1000 to 10,000, using the remaining authors as test set. For each training set size, we collect 100 random subsamples and average over them.

Results
The learning curves in Figure 2 show that retrofitting outperforms both the original author embeddings and BOW representations for age (top) and gender (bottom) prediction in terms of F1. The effect is stronger when little training data is available. We evaluate the statistical significance of the difference between results with retrofitting and original embeddings via a bootstrap sampling test. We do not test against BOW, since this is consistently lower than embeddings. The resulting p-value are given in the respective figures. For gender classification, there are small, but not significant improvements with retrofitting. By contrast, for age classification, small values of α (0.01, 0.1, and 0.25) result in significantly better classification than when using any other method.
The performance differences between the methods are generally more pronounced for ageprediction, which has 10 possible labels, than for gender prediction (two labels). The difference in optimal α value suggests a relation between α and label space.
In both tasks, the best result is achieved by choosing a low α, i.e., by giving more weight to the demographic association of the users than to their linguistic feature representations. In practice, this value should be determined via crossvalidation: here we show different levels of α in order to give some intuition of its on performance. Note that the curves for the original embeddings and BOW are unaffected by α and do not change. We repeat them at each figure for comparison. Increasing α eventually converges with the original embeddings, but we see that even intermediate values can be close to the original embeddings.

Related Work
The first studies to apply statistical NLP techniques to author attribute prediction are Koppel et al. (2002); Argamon et al. (2003), using the British National Corpus (BNC). The same authors also introduced the use of blogs as data source .
A big contribution in this field, however, was the shared tasks of the PAN workshops (Rangel et al., 2013(Rangel et al., , 2014(Rangel et al., , 2015(Rangel et al., , 2016. Research has identified a variety of linguistic features, ranging from "stylistic features with n-grams models, parts-of-speech, collocations, LDA, different readability indexes, vocabulary richness, correctness or verbosity" (Rangel et al., 2016). However, none of these papers used demographic information directly in the author representations.
Closest to our method are Lopez- Monroy et al. (2013), who propose the use of second-order representations. They created specific profiles for the target classes, and exploited them for the creation of the profile of each document. In both cases, the linguistic representation of the documents passes through a class-related profile.
The methods applied in the PAN workshops also reflect the recent research trend towards word embeddings, which we explore in this pa-a) α = 0.01 b) α = 0.1 c) α = 0.25 d) α = 0.5 e) α = 0.75 Figure 2: Learning curves (micro-F1) for 3 classifiers on age (top) and gender prediction (bottom) for different values of α. Retrofitting influence decreases from left to right. All data points averaged over 100 runs. Shaded area is 95%-confidence interval for retrofitting. p-values denote statistical difference between original and retrofit embeddings according to bootstrap test.
per. Bayot and Gonçalves (2016) first used word2vec embeddings as input features to a SVM classifier, followed by the use of convolutional (CNN) and recurrent neural networks (RNN) by Miura et al. (2017). Markov et al. (2016) also created document representations through word2vec, using a Logistic Regression classifier.