Understanding Unintended Memorization in Language Models Under Federated Learning

Recent works have shown that language models (LMs), e.g., for next word prediction (NWP), have a tendency to memorize rare or unique sequences in the training data. Since useful LMs are often trained on sensitive data, it is critical to identify and mitigate such unintended memorization. Federated Learning (FL) has emerged as a novel framework for large-scale distributed learning tasks. It differs in many aspects from the well-studied central learning setting where all the data is stored at the central server, and minibatch stochastic gradient descent is used to conduct training. This work is motivated by our observation that NWP models trained under FL exhibited remarkably less propensity to such memorization compared to the central learning setting. Thus, we initiate a formal study to understand the effect of different components of FL on unintended memorization in trained NWP models. Our results show that several differing components of FL play an important role in reducing unintended memorization. First, we discover that the clustering of data according to users—which happens by design in FL—has the most significant effect in reducing such memorization. Using the Federated Averaging optimizer with larger effective minibatch sizes for training causes a further reduction. We also demonstrate that training in FL with a user-level differential privacy guarantee results in models that can provide high utility while being resilient to memorizing out-of-distribution phrases with thousands of insertions across over a hundred users in the training set.


Introduction
There is a growing line of work (Fredrikson et al., 2015;Wu et al., 2016;Shokri et al., 2017;Carlini et al., 2018;Song and Shmatikov, 2019) demonstrating that neural networks can leak information about the underlying training data in unexpected ways. Many of these works show that language models (LMs), which include commonly-used next word prediction (NWP) models, are prone to unintentionally memorize rarely-occurring phrases in the data. Large-scale LM training often involves training over sensitive data, and such memorization can result in blatant leaks of privacy (e.g., (Munroe, 2019)). Thus, it is crucial to measure such memorization in trained LMs, and identify mitigation techniques to ensure privacy of the training data.
The framework of Federated Learning (FL) (McMahan et al., 2017a;McMahan and Ramage, 2017) has emerged as a popular approach for training neural networks on a large corpus of decentralized on-device data (e.g., (Konečný et al., 2016;Konecný et al., 2016;Bonawitz et al., 2017;Hard et al., 2018;Bonawitz et al., 2019)). FL operates in an iterative fashion: in each round, sampled client devices receive the current global model from a central server to compute an update on their locally-stored data, and the server aggregates these updates using, for e.g., the Federated Averaging (FedAvg) algorithm (McMahan et al., 2017a), to build a new global model. A hallmark of FL is that each participating device only sends model weights to the central server; raw data never leaves the device, remaining locally-cached. This, by itself, is not sufficient to provide formal privacy guarantees for the training data. However, this work is motivated by the observation (described in detail in Section 3) that NWP models trained under the canonical setting of FL exhibited resilience to memorize rare phrases in spite of hundreds of occurrences in the training data. Note that FL does differ in many aspects from the well-studied (Shokri et al., 2017;Carlini et al., 2018;Song and Shmatikov, 2019) central learning setting where all the data is stored at a central server, and minibatch stochastic gradient descent (SGD) is used to conduct training. While training NWP models via central learning, we observed that phrases with even tens of occurrences were easily (a) A user selected as a secret sharer for a canary.
(b) An example in a secret sharer's local dataset being replaced by the canary. Figure 1: An illustration of our federated secret-sharer framework, using the canary "My SSN is 123-45-6789". memorized, in line with prior work (Carlini et al., 2018) that showed the propensity of such models to memorize phrases with even one occurrence in the training set. Thus, we initiate a formal study to understand the effect of the different components of FL, compared to the central learning setting, on unintended memorization in trained NWP models.
We also study the extent to which a guarantee of Differential Privacy (DP) (Dwork et al., 2006c,a) reduces such memorization. DP has become the standard for performing learning tasks over sensitive data, and has been adopted by companies like Google (Erlingsson et al., 2014;Bittau et al., 2017;Erlingsson et al., 2020), Apple (Apple, 2017), Microsoft (Ding et al., 2017), andLinkedIn (Rogers et al., 2020), as well as the US Census Bureau (Kuo et al., 2018). Intuitively, DP prevents an adversary from confidently making conclusions about whether any particular user's data was used to train a model, even while having access to the model and arbitrary external side information.
The Federated Secret Sharer: We build on the "secret sharer" framework (Carlini et al., 2018) that was designed to measure the unintended memorization in generative models. At a high-level, out-ofdistribution examples (called canaries) are inserted into a training corpus, and a model trained on this corpus is then evaluated using various techniques to measure the extent to which the model has memorized the canaries. Since datasets in FL are inherently partitioned according to users, we adapt the secret sharer framework to the FL regime by introducing two parameters to control the presence of a canary in such settings. An illustration of our federated secret sharer framework is shown in Figure 1. Given a canary with parameters p u and p e , we let p u be the probability with which each user in a dataset is selected to be a "secret sharer" of the canary (Figure 1a), whereas p e denotes the probability with which each example in such a secret sharer's data is replaced by the canary (Figure 1b). We use Poisson sampling for both user-selection and example-replacement. The secret sharer selection phase precedes canary insertion to model realworld settings where occurrences of user-specific unique or rare out-of-distribution canaries are typically limited to a small group of users, but such users can exhibit high usage for those canaries.
Contributions: Our empirical evaluations demonstrate the following key contributions. First, we observe that clustering training data according to users, which happens by design in distributed learning settings like FL, has a significant effect in reducing unintended memorization for NWP models. Next, given a dataset partitioned by users, we show that replacing the learning optimizer from SGD to Federated Averaging and increasing the effective minibatch size provides a further reduction in such memorization. Lastly, we demonstrate that training in FL with user-level differential privacy (DP) results in models that can provide comparable utility while being resilient to memorizing canaries with thousands of insertions spread across over a hundred users in the training set. Prior work (Carlini et al., 2018) has shown that models trained with record-level DP do not exhibit unintended memorization for a single insertion of a canary. We provide evidence of models being resilient to memorizing canaries for orders of magnitude higher insertions, at the stronger user-level privacy.

Related Work
Apart from (Carlini et al., 2018) which this work builds upon, other works (Song and Shmatikov, 2019) have also studied memorization in generative text models. The FL paradigm, which is a major focus of this work, has been used to train multiple production scale models (Hard et al., 2018;Ramaswamy et al., 2019;Chen et al.). Kairouz et al. (2019) provides an excellent overview of the state-of-the-art in the field, along with a suite of interesting open problems. This work also studies the effectiveness of a user-level DP guarantee in reducing unintended memorization. While many works on DP focus on record-level DP guarantees (which usually cannot be directly extended to strong userlevel DP guarantees), recent works (e.g., (McMahan et al., 2017b;Jain et al., 2018;Augenstein et al., 2020;Andrew et al., 2021)) have designed techniques tailored to user-level DP guarantees.

Contrasting Federated Learning with Central Learning
Now, we take a deeper look at how the well-studied central learning framework differs from the canonical setting of FL for LM training. We are interested in differences that might have an effect on unintended memorization. We identify three such components: (1) Data Processed per Update: Central learning typically ingests data as records/sentences. On the other hand, FL operates at the granularity of a user, with each user having their own set of sentences locally. Typically, the amount of data processed per model update in central learning is much smaller in comparison to FL.
(2) Learning Technique: In central learning, the model is updated via SGD on a minibatch of records. In the canonical setting of FL, a model update typically corresponds to Federated Averaging over a minibatch of users: an average of the differences between the current model and the model obtained after several SGD steps on the local data of a user.
(3) Independent and Identically Distributed (IID) Data: To reduce variance in learning, the data in central learning is shuffled before training (and/or each update involves a randomly sampled minibatch). Thus, each minibatch can be estimated to be drawn IID from the data. Datasets in FL are naturally grouped according to potentially heterogeneous users, resulting in non-IID data even though each minibatch of users may be randomly sampled. 1  memorization. To provide the strongest user-level DP while obtaining high utility, we conduct experiments only for our largest minibatch size of 5K users. The results are presented in Table 2.

Discussion
Clustering data according to users: The results from our experiments strongly indicate that clustering data according to users significantly reduces unintended memorization. This is evident by considering the measurements in Table 1 in pairs where the only differing component among them is whether the data is IID or not. The number of epochs taken over the dataset to train the models on which we measure memorization is the same for any particular minibatch size, irrespective of whether the data is IID. Thus, the number of times the inserted canaries were encountered during training is the same. However, the amount of memorization observed is always lower when the data is Non-IID. This effect is more pronounced in the settings where Fe-dAvg is used as the training method. For instance, for a minibatch size of b u = 500 users, training with FedAvg on IID data results in 66 (56) canaries showing up as memorized via the RS (BS) method. The same configuration on Non-IID data results in the RS method classifying only 21 canaries as memorized, and the BS method not being able to extract any canary even after 8000 rounds of training. In addition to the data being clustered, the inserted canaries are clustered as well, which we conjecture to be playing a crucial role in reducing such memorization.
Varying data per update: Fixing the optimizer to SGD/FedAvg and the data to be IID/non-IID, we do not see any significant effect of varying the minibatch size on such memorization. 3 Training non-IID user data with FedAvg and larger effective minibatches: The smallest minibatch size for our FedAvg experiments is 500 users, 4 and as each user contains ≈ 250 records, the effective minibatch size is ≈ 125K records. In comparison, the largest minibatch size for which we are able to conduct SGD training is 256 records. Focusing on the results in Table 1 using Non-IID data, we find that using FedAvg and having larger effective minibatches per round causes a significant reduction in unintended memorization when compared to training with SGD and smaller minibatches. 5 Training with DP-FedAvg in FL: Our aim is to test the extent to which NWP models trained with DP-FedAvg in FL are resilient to such memorization. By definition, a user-level DP guarantee is intended to be resilient to changes w.r.t. any one user's data. Some of our inserted canaries are shared by ∼100 users (with ∼24.5K occurrences in the training data). In spite of such high levels of canary insertion, and our FL models exhibiting the least amount of unintended memorization (Table 1), we see that training with DP-FedAvg results in a significantly reduced memorization. Our results are noteworthy as, in spite of our DP model exhibiting extremely low unintended memorization, it also provides comparable utility as a model trained via FedAvg, along with a user-level guarantee of (18.8, 10 −7 )-DP. While strengthening the privacy guarantee of DP-FedAvg by increasing the noise added to the model update in each training round can further reduce such memorization, it can also start significantly affecting model utility. Designing methods that improve the privacy-utility trade-offs is an interesting direction, which is beyond the scope of this work.

Conclusion
In this work, we conduct a formal study to understand the effect of the different components of Federated Learning (FL), on the unintended memorization in trained next word prediction (NWP) models, as compared to the well-studied central learning. From our results, we observe that the components of FL exhibit a synergy in reducing such memorization. To our surprise, user-based clustering of data (which occurs as a natural consequence in the FL setting) has the most significant effect in the reduction. Moreover, training using Federated Averaging and larger effective minibatches reduces such memorization further. Lastly, we observe that training in FL with a user-level differential privacy guarantee results in models that can provide comparable utility while being resilient to memorizing canaries with thousands of insertions across over a hundred users in the training set. Recent work (Karimireddy et al., 2019) has shown that, in general, such heterogeneity in the training data can result in a slower and unstable convergence due to factors such as "client-drift". For all of the experiments with non-IID data, we observe that the utility of the trained models is comparable to those trained on IID data, and we leave further exploration into why client-drift may not play a significant role in our experiments for future work. Next, while our extensive evaluation is for a practical NWP model on a real-world benchmark dataset, the degree of unintended memorization in general can depend on the model architecture and the dataset used for training. Lastly, the secretsharer line of methods for measuring unintended memorization operate at the granularity of a record. For future work, it will be interesting to design stronger attacks targeting data at the granularity of a user, and measure the resilience of models trained via FL, against such memorization.

A.1 Measuring Unintended Memorization
Following (Carlini et al., 2018), this work assumes a threat model of curious or malevolent users having a black-box query access to models, in that they see only the models' output probabilities (or logits). We also assume that users can adaptively query models multiple times, thus posing a threat of extracting uncommon word combinations. Now, we describe the Secret Sharer framework of (Carlini et al., 2018). First, random sequences called canaries are inserted into the training data. The canaries are constructed based on a prefixed format sequence. For instance, to design the framework for a character-level model, the format could be "My SSN is xxx-xx-xxxx", where each x can take a random value from digits 0 to 9. Next, the target model is trained on the modified dataset containing the canaries. Lastly, methods like Random Sampling and Beam Search (both formally defined in Section 3) are used to efficiently measure the extent to which the model has "memorized" the inserted random canaries, and whether it is possible for an adversary with partial knowledge to extract the canary. For instance, if a canary is classified as memorized via our Beam Search method, then given black-box access to the trained model, an adversary with the knowledge of only the first word of the inserted canary can extract it completely with a simple beam search.

A.2 Differential Privacy
To establish the notion of differential privacy (Dwork et al., 2006c,b), we first define neighboring datasets. We will refer to a pair of datasets D, D as neighbors if D can be obtained by the addition or removal of all the examples associated with one user from D, to be able to provide a user-level DP guarantee. 6 Definition A.1 (Differential privacy (Dwork et al., 2006c,b)). A randomized algorithm A is ( , δ)differentially private if, for any pair of neighboring datasets D and D , and for all events S in the output range of A, we have

Pr[A(D) ∈ S] ≤ e · Pr[A(D ) ∈ S] + δ
where the probability is taken over the random coins of A.
For meaningful privacy guarantees, is assumed to be a small constant, and δ 1/|D|. To train models with DP guarantees, we follow the variant of DP Federated Averaging (DP-FedAvg) (McMahan et al., 2017b) used in (Augenstein et al., 2020, where the only change is sampling fixed-sized minibatches in each training round. 7

A.3 Differentially Private Federated Averaging
We now present the technique used to train our DP model in FL. It closely follows the DP-FedAvg technique in (McMahan et al., 2017b), in that peruser updates are clipped to have a bounded L 2 norm, and calibrated Gaussian noise is added to the weighted average update to be used for computing the model to be sent in the next round. A slight difference between the DP-FedAvg algorithm in (McMahan et al., 2017b) and our approach is the way in which client devices are sampled to participate in a given federated round of computation. DP-FedAvg uses Poisson sampling, where for each round, each user is selected independently with a fixed probability. In this work (also, following (Augenstein et al., 2020)), we instead use fixed-size federated rounds, where a fixed number of users is randomly sampled to participate in each round. For reference, we provide a pseudo-code for the technique in Algorithm 1. Adam, we set the learning rate to 10 −4 . First, we observe that using Momentum increases the observed unintended memorization but has a similar utility as SGD. On the other hand, we see that using Adam decreases such memorization, but the utility of the models is also noticeably reduced as compared to SGD. We observe a similar trend when Adam is combined with FedAvg.
Training with DP-FedAvg on IID data: Now, we present in Table 5 the results of using DP-FedAvg as the optimizer on IID data consisting of synthetic users. We observe that using DP-FedAvg provides a significant reduction in unintended memorization compared to using FedAvg.
Only Clipping: To bound the contribution by each participating user, DP-FedAvg clips each user update before aggregating them from a minibatch of users and adding calibrated noise to guarantee DP. Following (Carlini et al., 2018), we present results (rows containing "FedAvg+Clip" in Table 6) for the case when user updates are clipped to a value of 0.2, but no noise is added. This results in an (∞, δ)-DP guarantee for any δ ∈ (0, 1), which is vacuous as a privacy guarantee. However, this experiment   Table 5: Unintended memorization and utility for models trained with (18.8, 10 −7 )-DP using DP-FedAvg on IID data and 5K users/round for 100 epochs.
helps us observe the extent to which only clipping reduces the propensity of unintended memorization exhibited by trained models. With IID data, for the setting evaluated in Section 3 (8000 rounds, i.e., 100 epochs for minibatch size of 5000 users), we observe that the RS method extracts 58 canaries with clipping, which is 7 fewer canaries when compared to without clipping. The BS method extracts 49, which is 9 fewer than without clipping. For Non-IID data, we observe a similar trend but it is more pronounced: the RS method extracts 11 canaries with clipping, which is 15 fewer canaries when compared to without clipping, whereas the BS method is not able to extract any of the inserted canaries.
Evaluating for Same Training Epochs: In Table 7, we provide the results for evaluating models trained for the same number of epochs over the training data. For the runs using SGD, we start Non-IID FedAvg 26, 2 24.5, 58.2 FedAvg+Clip 11, 0 24, 61.5 Table 6: Unintended memorization, lowest (by insertion frequency) canary configuration memorized, and utility for models trained with Clipping/DP and 5000 users/round for 100 epochs. The models trained with DP-FedAvg satisfy (18.8, 10 −7 )-DP.
with a batch size of 32 records and a tuned learning rate of 0.005, and we increase the learning rate by ≈ √ 2 for every 2x increase in the batch size. For all the experiments with FedAvg, we find that using a constant learning rate provides the best utility across the different batch sizes, and thus, we keep it fixed.  For the models trained with SGD, for both IID and non-IID data we observe that unintended memorization remains comparable for models trained with different batch sizes. However, we see a decrease in the utility as the batch size increases. The decrease in utility is observed for models trained using FedAvg as well, but we also observe a significant drop in the such memorization when training is performed with at least 1000 users per round. Moreover, once the training involves at least 2000 users on non-IID data, both the RS and BS methods are unsuccessful in classifying any of the 90 inserted canaries as memorized.