Towards Robust and Privacy-preserving Text Representations

Written text often provides sufficient clues to identify the author, their gender, age, and other important attributes. Consequently, the authorship of training and evaluation corpora can have unforeseen impacts, including differing model performance for different user groups, as well as privacy implications. In this paper, we propose an approach to explicitly obscure important author characteristics at training time, such that representations learned are invariant to these attributes. Evaluating on two tasks, we show that this leads to increased privacy in the learned representations, as well as more robust models to varying evaluation conditions, including out-of-domain corpora.


Introduction
Language is highly diverse, and differs according to author, their background, and personal attributes such as gender, age, education and nationality. This variation can have a substantial effect on NLP models learned from text , leading to significant variation in inferences across different types of corpora, such as the author's native language, gender and age. Training corpora are never truly representative, and therefore models fit to these datasets are biased in the sense that they are much more effective for texts from certain groups of user, e.g., middleaged white men, and considerably poorer for other parts of the population (Hovy, 2015). Moreover, models fit to language corpora often fixate on author attributes which correlate with the target variable, e.g., gender correlating with class skews (Zhao et al., 2017), or translation choices (Rabinovich et al., 2017). This signal, however, is rarely fundamental to the task of modelling language, and is better considered as a confounding influence. These auxiliary learning signals can mean the models do not adequately capture the core linguistic problem. In such situations, removing these confounds should give better generalisation, especially for out-of-domain evaluation, a similar motivation to research in domain adaptation based on selection biases over text domains (Blitzer et al., 2007;Daumé III, 2007).
Another related problem is privacy: texts convey information about their author, often inadvertently, and many individuals may wish to keep this information private. Consider the case of the AOL search data leak, in which AOL released detailed search logs of many of their users in August 2006 (Pass et al., 2006). Although they deidentified users in the data, the log itself contained sufficient personally identifiable information that allowed many of these individuals to be identifed (Jones et al., 2007). Other sources of user text, such as emails, SMS messages and social media posts, would likely pose similar privacy issues. This raises the question of how the corpora, or models built thereupon, can be distributed without exposing this sensitive data. This is the problem of differential privacy, which is more typically applied to structured data, and often involves data masking, addition or noise, or other forms of corruption, such that formal bounds can be stated in terms of the likelihood of reconstructing the protected components of the dataset (Dwork, 2008). This often comes at the cost of an accuracy reduction for models trained on the corrupted data (Shokri and Shmatikov, 2015;Abadi et al., 2016).
Another related setting is where latent representations of the data are shared, rather than the text itself, which might occur when sending data from a phone to the cloud for processing, or trusting a third party with sensitive emails for NLP processing, such as grammar correction or translation. The transfered representations may still contain sensitive information, however, especially if an adversary has preliminary knowledge of the training model, in which case they can readily reverse engineer the input, for example, by a GAN attack algorithm (Hitaj et al., 2017). This is true even when differential privacy mechanisms have been applied.
Inspired by the above works, and recent successes of adversarial learning (Goodfellow et al., 2014;Ganin et al., 2016), we propose a novel approach for privacy-preserving learning of unbiased representations. 1 Specially, we employ Ganin et al.'s approach to training deep models with adversarial learning, to explicitly obscure individuals' private information. Thereby the learned (hidden) representations of the data can be transferred without compromising the authors' privacy, while still supporting high-quality NLP inference. We evaluate on the tasks of POStagging and sentiment analysis, protecting several demographic attributes -gender, age, and location -and show empirically that doing so does not hurt accuracy, but instead can lead to substantial gains, most notably in out-of-domain evaluation. Compared to differential privacy, we report gains rather than loss in performance, but note that we provide only empirical improvements in privacy, without any formal guarantees.

Methodology
We consider a standard supervised learning situation, in which inputs x are used to compute a representation h, which then forms the parameterisation of a generalised linear model, used to predict the target y. Training proceeds by minimising a differentiable loss, e.g., cross entropy, between predictions and the ground truth, in order to learn an estimate of the model parameters, denoted θ M .
Overfitting is a common problem, particular in deep learning models with large numbers of parameters, whereby h learns to capture specifics of the training instances which do not generalise to unseen data. Some types of overfitting are insidious, and cannot be adequately addressed with standard techniques like dropout or regularisation. Consider, for example, the authorship of each sentence in the training set in a sentiment prediction Figure 1: Proposed model architectures, showing a single training instance (x i , y i ) with two protected attributes, b i and b j . D indicates a discriminator, and the red dashed and blue lines denote adversarial and standard loss, respectively.
task. Knowing the author, and their general disposition, will likely provide strong clues about their sentiment wrt any sentence. Moreover, given the ease of authorship attribution, a powerful learning model might learn to detect the author from their text, and use this to predict the sentiment, rather than basing the decision on the semantics of each sentence. This might be the most efficient use of model capacity if there are many sentences by this individual in the training dataset, yet will lead to poor generalisation to test data authored by unseen individuals. Moreover, this raises privacy issues when h is known by an attacker or malicious adversary. Traditional privacy-preserving methods, such as added noise or masking, applied to the representation will often incur a cost in terms of a reduction in task performance. Differential privacy methods are unable to protect the user privacy of h under adversarial attacks, as described in Section 1.
Therefore, we consider how to learn an unbiased representations of the data with respect to specific attributes which we expect to behave as confounds in a generalisation setting. To do so, we take inspiration from adversarial learning (Goodfellow et al., 2014;Ganin et al., 2016). The architecture is illustrated in Figure 1.

Adversarial Learning
A key idea of adversarial learning, following Ganin et al. (2016), is to learn a discriminator model D jointly with learning the standard supervised model. Using gender as an example, a discriminator will attempt to predict the gender, b, of each instance from h, such that training involves joint learning of both the model parameters, θ M , and the discriminator parameters θ D . However, the aim of learning for these components are in opposition -we seek a h which leads to a good predictor of the target y, while being a poor representation for prediction of gender. This leads to the objective (illustrated for a single training instance), where X denotes the cross entropy function. The negative sign of the second term, referred to as the adversarial loss, can be implemented by a gradient reversal layer during backpropagation (Ganin et al., 2016). To elaborate, training is based on standard gradient backpropagation for learning the main task, but for the auxiliary task, we start with standard loss backpropagation, however gradients are reversed in sign when they reach h. Consequently the linear output components are trained to be good predictors, but h is trained to be maximally good for the main task and maximally poor for the auxiliary task.
Furthermore, Equation 1 can be expanded to scenarios with several (N ) protected attributes,

Experiments
In this section, we report experimental results for our methods with two very different language tasks.

POS-tagging
This first task is part-of-speech (POS) tagging, framed as a sequence tagging problem. Recent demographic studies have found that the author's gender, age and race can influence tagger performance Jørgensen et al., 2016). Therefore, we use the POS tagging to demonstrate that our model is capable of eliminating this type of bias, thereby leading to more robust models of the problem.
Model Our model is a bi-directional LSTM for POS tag prediction (Hochreiter and Schmidhuber, 1997), formulated as: for input sequence x i | n i=1 with terminal hidden states h 0 and h ′ n+1 set to zero, where φ is a linear transformation, and [·; ·] denotes vector concatenation.
For the adversarial learning, we use the training objective from Equation 2 to protect gender and age, both of which are treated as binary values. The adversarial component is parameterised by 1-hidden feedforward nets, applied to the final hidden representation [h n ; h ′ 0 ]. For hyperparameters, we fix the size of the word embeddings and h to 300, and set all λ values to 10 −3 . A dropout rate of 0.5 is applied to all hidden layers during training.
Data We use the TrustPilot English POS tagged dataset , which consists of 600 sentences, each labelled with both the sex and age of the author, and manually POS tagged based on the Google Universal POS tagset (Petrov et al., 2012). For the purposes of this paper, we follow Hovy and Søgaard's setup, categorising SEX into female (F) and male (M), and AGE into over-45 (O45) and under-35 (U35). We train the taggers both with and without the adversarial loss, denoted ADV and BASELINE, respectively.
For evaluation, we perform a 10-fold cross validation, with a train:dev:test split using ratios of 8:1:1. We also follow the evaluation method in , by reporting the tagging accuracy for sentences over different slices of the data based on SEX and AGE, and the absolute difference between the two settings.
Considering the tiny quantity of text in the TrustPilot corpus, we use the Web English Treebank (WebEng: Bies et al. (2012)), as a means of pre-training the tagging model. WebEng was chosen to be as similar as possible in domain to the TrustPilot data, in that the corpus includes unedited user generated internet content.
As a second evaluation set, we use a corpus of African-American Vernacular English (AAVE) from Jørgensen et al. (2016), which is used purely for held-out evaluation. AAVE consists of three very heterogeneous domains: LYRICS, SUBTI-TLES and TWEETS. Considering the substantial difference between this corpus and WebEng or TrustPilot, and the lack of any domain adaptation, we expect a substantial drop in performance when transferring models, but also expect a larger impact from bias removal using ADV training.  Results and analysis Table 1 shows the results for the TrustPilot dataset. Observe that the disparity for the BASELINE tagger accuracy (the ∆ column), for AGE is larger than for SEX, consistent with the results of . Our ADV method leads to a sizeable reduction in the difference in accuracy across both SEX and AGE, showing our model is capturing the bias signal less and more robust to the tagging task. Moreover, our method leads to a substantial improvement in accuracy across all the test cases. We speculate that this is a consequence of the regularising effect of the adversarial loss, leading to a better characterisation of the tagging problem. Table 2 shows the results for the AAVE heldout domain. Note that we do not have annotations for SEX or AGE, and thus we only report the overall accuracy on this dataset. Note that ADV also significantly outperforms the BASELINE across the three heldout domains.
Combined, these results demonstrate that our model can learn relatively gender and age debiased representations, while simultaneously improving the predictive performance, both for indomain and out-of-domain evaluation scenarios.

Sentiment Analysis
The second task we use is sentiment analysis, which also has broad applications to the online community, as well as privacy implications for the authors whose text is used to train our models. Many user attributes have been shown to be easily detectable from online review data, as used extensively in sentiment analysis results Potthast et al., 2017). In this paper, we focus on three demographic variables of gender, age, and location.
Model Sentiment is framed as a 5-class text classification problem, which we model using Kim (2014)  chitecture, in which the hidden representation is generated by a series of convolutional filters followed a maxpooling step, simply denote as h = CNN(x; θ M ). We follow the hyper-parameter settings of Kim (2014), and initialise the model with word2vec embeddings (Mikolov et al., 2013). We set the λ values to 10 −3 and apply a dropout rate of 0.5 to h.
As the discriminator, we also use a feed-forward model with one hidden layer, to predict the private attribute(s). We compare models trained with zero, one, or all three private attributes, denoted BASELINE, ADV-*, and ADV-all, respectively.
Data We again use the TrustPilot dataset derived from , however now we consider the RATING score as the target variable, not POS-tag. Each review is associated with three further attributes: gender (SEX), age (AGE), and location (LOC). To ensure that LOC cannot be trivially predicted based on the script, we discard all non-English reviews based on LANGID.PY (Lui and Baldwin, 2012), by retaining only reviews classified as English with a confidence greater than 0.9. We then subsample 10k reviews for each location to balance the five location classes (US, UK, Germany, Denmark, and France), which were highly skewed in the original dataset. We use the same binary representation of SEX and AGE as the POS task, following the setup in .
To evaluate the different models, we perform 10-fold cross validation and report test performance in terms of the F 1 score for the RATING task, and the accuracy of each discriminator. Note that the discriminator can be applied to test data, where it plays the role of an adversarial attacker, by trying to determine the private attributes of users based on their hidden representation. That is, lower discriminator performance indicates that the representation conveys better privacy for individuals, and vice versa.  Results Table 3 shows the results of the different models. Note that all the privacy attributes can be easily detected in BASELINE, with results that are substantially higher than the majority class, although AGE and SEX are less well captured than LOC. The ADV trained models all maintain the task performance of the BASELINE method, however they clearly have a substantial effect on the discrimination accuracy. The privacy of SEX and LOC is substantially improved, leading to discriminators with performance close to that of the majority class (conveys little information). AGE proves harder, although our technique leads to privacy improvements. Note that AGE appears to be related to the other private attributes, in that privacy is improved when optimising an adversarial loss for the other attributes (SEX and LOC). Overall, these results show that our approach learns hidden representations that hide much of the personal information of users, without affecting the sentiment task performance. This is a surprising finding, which augurs well for the use of deep learning as a privacy preserving mechanism when handling text corpora.

Conclusion
We proposed a novel method for removing model biases by explicitly protecting private author attributes as part of model training, which we formulate as deep learning with adversarial learning. We evaluate our methods with POS tagging and sentiment classification, demonstrating our method results in increased privacy, while also maintaining, or even improving, task performance, through increased model robustness.