Differentially Private Language Models Benefit from Public Pre-training

Language modeling is a keystone task in natural language processing. When training a language model on sensitive information, differential privacy (DP) allows us to quantify the degree to which our private data is protected. However, training algorithms which enforce differential privacy often lead to degradation in model quality. We study the feasibility of learning a language model which is simultaneously high-quality and privacy preserving by tuning a public base model on a private corpus. We find that DP fine-tuning boosts the performance of language models in the private domain, making the training of such models possible.


Introduction
Language modeling, the task of assigning a probability to sequences of words, is a key problem in natural language processing. Modern language models are data-driven, relying on a large corpus of text. Many such models are trained on corpora from a specific domain, such as Wikipedia or news articles (Radford et al., 2019a). These models often suffer from generalization issues when used to model language from a different domain. This motivates the use of model fine-tuning, in which the weights of a pre-trained language model are tuned by gradient descent on a second dataset of interest (Radford et al., 2019a;Devlin et al., 2019;Liu et al., 2019).
In some cases, we would like to fine-tune our model with respect to a dataset containing private information. As such, there is an obligation to preserve the privacy of individuals who contribute text to the private training corpus. For example, training a medical chat-bot may require learning * *All authors contributed equally. 1 https://github.com/dylan-slack/ Finetuning-DP-Language-Models a language model from transcribed patient-doctor conversations; it would be critical that this model not expose sensitive information about the patients whose conversations are used as training data. In recent years, differential privacy (DP) has been a key quantitative measure of privacy which allows one to use aggregate statistical information about a dataset while preserving the privacy of its individual datapoints.
In the case of language modeling, we are interested in preserving the privacy of individuals who contribute text to a private corpus. As each individual who contributes to this dataset could potentially contribute several sentences, our notion of privacy is group differential privacy (Dwork and Roth, 2014), in which all sentences from a single individual are grouped. In practice, group DP is equivalent to DP with re-scaled parameters.A potential limitation of this approach is that the number of contributed sentences may not be uniform over users, leading to sub-optimal bounds on the privacy guarantee. There has been some success in directly training differentially private language models, but these often require access to large datasets in order to achieve a reasonable level of quality (McMahan et al., 2017). Other work has trained a differentially private base model which was then fine-tuned through active learning on a non-private dataset (Zhao et al., 2019).
We instead train a non-private base model on a large, public dataset, which we proceed to finetune on a private out-of-distribution dataset through differentially private stochastic gradient descent (DPSGD) (Abadi et al., 2016). By doing so, we successfully train a high-quality model which is differentially private with respect to our tuning dataset. Our experimental results show that DP fine-tuning not only boosts the performance of DP language modeling, but makes it possible.

Related Work
Training a feedforward neural network with DP is achievable through the popular DP-SGD algorithm (Abadi et al., 2016). However, this method may lead to significant decreases in the accuracy (or other metrics) of the resulting model. Recent work considers the use of metric privacy for language modeling (Fernandes et al., 2019;Feyisetan et al., 2020), which is a relaxation of differential privacy where noise is instead added to the vector embedding of a word. We leave the exploration of metric privacy for the private fine-tuning task as a direction for future work.
Many high-quality language models rely on some form of recurrent neural architecture, such as RNNs or LSTMs (Sherstinsky, 2018;Hochreiter and Schmidhuber, 1997). In (McMahan et al., 2017), the authors develop a method for training such models while achieving differential privacy. However, this approach requires a large private dataset, and the mechanisms to achieve privacy lead to a significant decrease in model quality.
In (Zhao et al., 2019), the authors attempt to train a language model which is simultaneously differentially private and of high quality. The first solution proposed in (Zhao et al., 2019) is to finetune the language model with publicly available data, but as this public data is likely distributed differently than the private data, the resulting model is likely mistuned. The second proposed approach is to augment the training data by actively selecting non-private data instances. This effectively reduces the privacy cost incurred during each training step, but still requires training with potentially out-ofdistribution data.
In contrast, our work begins with a pre-trained model which only has access to publicly available data. This base model is then fine-tuned through DPSGD on our private domain of interest, resulting in a model that is both differentially private and tuned with respect to our protected dataset. By tuning a pre-trained public model, we achieve higher quality models without incurring any additional costs to our privacy budget.

Approach
Let D be a publicly available corpus, and P be a protected corpus whose contents we would like to protect the privacy of. Denote by X the fixed, shared vocabulary of these corpora. At a high level, our approach is to first train a language model M D : X n → [0, 1]. In practice, we choose a feedforward architecture for M D due to limited computing resources. We fine-tune this model with respect to P by using the DPSGD algorithm (Abadi et al., 2016) on batches of sentences from P.

( , δ) Differential Privacy
Intuitively, an algorithm is ( , δ)-DP if the output of the algorithm cannot be used to probabilistically determine the presence of a single instance in the database by more than a factor of exp( ). We additionally allow this constraint to be violated with probability δ, with δ typically being small 2 .
In the case of language modeling, an individual i may possibly contribute s i ≥ 1 sentences to the private training corpus. To maintain the privacy of said individual, we require that our algorithm satisfy s i -group differential privacy, meaning our algorithm cannot be used to determine the presence or absence of s i sentences in the dataset. However, ( , δ) s i -group DP is equivalent to ( /s i , δ)-DP (Dwork and Roth, 2014). Hence, it is sufficient to consider the somewhat unintuitive notion of preserving the privacy of individual sentences in the training set. Any mechanism satisfying ( , δ)-DP on individual sentences will then satisfy ( /γ, δ)-DP with respect to contributing individuals, where γ = max i {s i }. Formally, an algorithm A satisfies ( , δ)-DP if for all datasets D 1 , D 2 differing by at most one instance, and for any set S, we have Smaller values indicate a stronger privacy guarantee. We typically think of S being some query on the outcome of A. A more complete treatment of differential privacy is available in (Dwork and Roth, 2014).

Differentially Private Fine-tuning
Differential privacy is achieved in SGD by adding appropriately scaled noise to the gradient of the loss function. In particular, we fix a noise scale σ 2 ∈ R and a gradient clipping level C ∈ R. For a batch of size L, our loss function is given by For each x i in our batch, we compute the clipped gradient g(x i ) by scaling the gradient of the loss at x i to have 2 norm at most C (Dwork and Roth, 2014).
We add appropriately scaled zero-mean Gaussian noise to our gradients: Our gradient signal used in training is then the average of g(x i ) over a given mini-batch, which we use to determine a descent direction as in SGD. Note that our noisy gradient is equal to the true gradient in expectation, as we add mean-zero noise.
As our access to the private data is done entirely in the calculation of g(x i ), with appropriately chosen parameters this method guarantees our algorithm respects our specified level of privacy.
For given noise σ, we can determine an acceptable privacy violation level δ 1 and compute the resulting privacy parameter through the composition theorem proved in (Abadi et al., 2016). In appendix B 2, we plot the ( , δ)-privacy guarantees for various settings of σ. As expected, for a fixed δ, more noise (greater σ) results in a tighter privacy guarantee (smaller ).
Throughout this section, we have assumed a maximum individual contribution size of γ = 1. When γ > 1, the only necessary change is a postprocessing scaling of → /γ, as is computed based on parameters which are independent of γ.

Datasets
For our public dataset, we choose the Brown corpus (Francis and Kucera, 1979), as it is a fairly large corpus designed to represent modern English. For our private dataset, we used the Reddit comments dataset (Reddit, 2019). While this corpus is not truly private, we felt it represented the type of language data one might be interested in protecting -written language generated by individual users which likely contains personal information. We randomly select a subset of 10k comments for private training data and 5k comments for development and testing. For more details, see Appendix C.

Models and Evaluation
For our language models, we consider two feedforward architectures: a small network and a large network, each with three hidden layers, but with varying numbers of nodes (see appendix A for details). For both architectures, we train three baseline models: • A non-private model trained only on the public corpus. • A non-private model trained only on the private corpus. • A non-private model pre-trained on the public corpus, and fine-tuned on the private corpus. For each architecture, we compare these baseline models to a private model which is pre-trained on the public corpus and fine-tuned on the private corpus. For the private models, we hold δ = 1e−5 and set gradient clipping to 1.0. We train each private model with σ = 1.1 and σ = 0.1. Also, we finetune OpenAI's pre-trained GPT-2 (Radford et al., 2019b) non-privately on both Brown and Reddit. For each model, we report the perplexity scores.

Results
GPT-2 Fine-tuning The GPT-2 model finetuned for three epochs on the Brown training data set scored 40.0 perplexity on the held out test set. The GPT-2 model fine-tuned for the same time on the Reddit training data set scored 45.14 on the held out test set.
Small Feedforward Neural Network Next, we trained and evaluated a smaller feedforward neural network on the evaluation schema from section 4.2. Figure 1(a) shows the test-set perplexity for each of our models as a function of training iterations. We observe that each of the base non-private models converges at roughly the same rate, but the models trained on the Brown corpus converge to a lower perplexity than those trained on the Reddit corpus. We also note that the fine-tuned models achieve a significantly lower perplexity in fewer iterations, even with the inclusion of differential privacy mechanisms. The increase in perplexity seen in the base Reddit model may be indicative of overfitting.
Large Feedforward Neural Network Next, we train and evaluate a large feedforward neural network model. The results can be found in figure 1(b). We found that the larger models performed mostly similar to the smaller ones. However, the larger model does significantly outperform its smaller counterpart when trained and evaluated on 10,000 comments sampled from the Reddit dataset. This can be seen when comparing figure 1(a) and 1(b). The "Reddit 10k / Reddit 10k" curve reaches a much lower value much sooner for the larger model. Another difference is that the larger model was not able to get finite perplexity values when fine-tuned on Reddit 10k in a differentially private way with noise set to 1.1, while the smaller model was able to do this.

Analysis
Finetuning improves DP perplexity We summarize the perplexities of our final small and large models in table 1 in the appendix. A σ 2 of zero indicates non-private training while a σ 2 > 0 indicates private training where privacy increases with larger σ 2 . We additionally provide the values for the private models in figure 4. The perplexity scores for both the small and large feedforward language models are orders of magnitude worse than the GPT-2 models indicating that they are not competitive with state of the art language models. However, our results indicate that pre-training may significantly improve the perplexity of a differentially private language model. We were unsuccessful in training a differentially private model on the Reddit data alone, as all models tested gave unreasonably high perplexities (i.e. useless models). When DP fine-tuning was used to create a private language model for this domain, our small model outperformed the baseline models (except for its non-private equivalent). This indicates that pre-training may be highly valuable in facilitating the training of DP language models.
Qualitative Analysis We provide a sample of sentences generated from models fine-tuned on the Reddit 10k data set in table 2 in the appendix.
Aside from the state of the art GPT-2 model, both the small and large feedforward neural networks are not able to generate sentences that are coherent. Additionally, there is not a discernible difference between the various levels of private fine-tuning. This is likely because feedforward neural networks are not strong language models. We do see the pre-training benefits for privacy with such models.

Conclusions
Training neural models with differential privacy often significantly degrades model performance. However, differential privacy could prove crucial when doing language modeling on private datasets. Our work shows that DP fine-tuning not only boosts the performance of DP language modeling, but makes it possible. We also compared our experiments across two different model sizes and found that increasing the model size while decreasing the number of training epochs does not significantly impact the results in the differentially private transfer learning scenario. Future research could experiment with stronger model architectures (e.g., LSTM's, transformers) instead of regular feedforward neural networks, as well as train models longer in order to increase performance.

A Architectures
We consider two language model architectures. We first use a feedforward neural network as our language model with three hidden layers consisting of 500, 250, and 50 nodes respectively ("small" language model). Recent work suggests large language models may produce better results more quickly than smaller models (Li et al., 2020). Though the mentioned work considers transformer models, we also investigate training a larger feedforward neural network with three hidden layers consisting of 10, 000, 5, 000, and 1, 000 nodes ("large" language model) in hopes to speed up differentially private training and gain better performance.
For both models, we consider 20 previous tokens. We trained the public models using the Adam optimizer with a learning rate of 1e − 3. To train the private models we used the DPSGD optimizer from (Waites, 2019). We used the ReLU activation function on all nodes and the softmax function on the output layer.
Lastly, we trained the small language model for 5 epochs during pre-training and 5 epochs during fine-tuning. We trained the large language model for 2 epochs during pre-training and 2 epochs during fine-tuning B ( − δ)-Privacy Guarantees Figure 2: ( , δ)-privacy guarantees for q = 10 −3 , T = 10 5 , computed using the moments accountant (Abadi et al., 2016). Here, σ is a noise-scale parameter specified by the user. This helps us to select a noise scale appropriate to a given application setting.

C Dataset Sizes
In figure 3, we provide the number of tokens used for training in each data set.

Dataset
Tokens ( Figure 4: We provide the trade off between and test perplexity for the small and large models from figure 1. We hold δ to 1e − 5 and set the gradient clipping to 1.0.
We include the lowest test perplexity for each model. Recall the large model with σ = 1.1 never converged to finite perplexity and is denoted NA.