Private Release of Text Embedding Vectors

Ensuring strong theoretical privacy guarantees on text data is a challenging problem which is usually attained at the expense of utility. However, to improve the practicality of privacy preserving text analyses, it is essential to design algorithms that better optimize this tradeoff. To address this challenge, we propose a release mechanism that takes any (text) embedding vector as input and releases a corresponding private vector. The mechanism satisfies an extension of differential privacy to metric spaces. Our idea based on first randomly projecting the vectors to a lower-dimensional space and then adding noise in this projected space generates private vectors that achieve strong theoretical guarantees on its utility. We support our theoretical proofs with empirical experiments on multiple word embedding models and NLP datasets, achieving in some cases more than 10% gains over the existing state-of-the-art privatization techniques.


Introduction
Privacy has emerged as a topic of strategic consequence across all computational fields. Differential Privacy (DP) is a mathematical definition of privacy proposed by (Dwork et al., 2006). Ever since its introduction, DP has been widely adopted and as of today, it has become the de facto privacy definition in the academic world with also wide adoption in industry, e.g., (Erlingsson et al., 2014;Dajani et al., 2017;Team, 2017;Uber Security, 2017). DP provides provable protection against adversaries with arbitrary side information and computational power, allows clear quantification of privacy losses, and satisfies graceful composition over multiple access to the same data. In DP, two parameters and δ control the level of privacy. Very roughly, is an upper bound on the amount of influence a single data point has on the information released and δ is the probability that this bound fails to hold, so the definition becomes more stringent as , δ → 0.
The definition with δ = 0 is referred to as pure differential privacy, and with δ > 0 is referred to as approximate differential privacy.
Within the field of Natural Language Processing (NLP), the traditional approach for privacy was to apply anonymization techniques such as kanonymity (Sweeney, 2002) and its variants. While this offers an intuitive way of expressing privacy guarantees as a function of an aggregation parameter k, all such methods are provably non-private (Korolova et al., 2009). Given the sheer increase in data gathering occurring across a multiplicity of connected platforms -a great number of which is being done via user generated voice conversations, text queries, or other language based metadata (e.g., user annotations), it is imperative to advance the development of DP techniques in NLP.
Vector embeddings are a popular approach for capturing the "meaning" of text and a form of unsupervised learning useful for downstream tasks. Word embeddings were popularized via embedding schemes such as WORD2VEC (Mikolov et al., 2013), GLOVE (Pennington et al., 2014), and FAST-TEXT (Bojanowski et al., 2017). There is also a growing literature on creating embeddings for sentences, documents, and other textual entities, in addition to embeddings in other domains such as in computer vision (Goodfellow et al., 2016).
Recent works such as (Fernandes et al., 2019;Feyisetan et al., 2019Feyisetan et al., , 2020 have attempted to directly adapt the methods of DP to word embeddings by borrowing ideas from the privacy methods used for map location data (Andrés et al., 2013). In the DP literature, one standard way of achieving privacy is by adding properly calibrated noise to the output of a function (Dwork et al., 2006). This is also the premise behind these previously proposed DP for text techniques, which are based on adding noise to the vector representation of words in a high dimensional embedding space and additional post-processing steps. The privacy guarantees of applying such a method is quite straightforward. However, the main issue is that the magnitude of the DP privacy noise scales with dimensionality of the vector, which leads to a considerable degradation to the utility when these techniques are applied to vectors produced through popular embedding techniques. In this paper, we seek to overcome this curse of dimensionality arising through the differential privacy requirement. Also unlike previous results which were focused on word embeddings, we focus on the general problem of privately releasing vector embeddings, thus making our scheme more widely applicable.

Related Work
Vector representations of words, sentences, and documents, have all become basic building blocks in NLP pipelines and algorithms. Hence, it is natural to consider privacy mechanisms that target these representations. The most relevant to this paper is the privacy mechanism proposed in (Feyisetan et al., 2020) that works by computing the vector representation x of a word in the embedding space, applying noise N calibrated to the global metric sensitivity to obtain a perturbed vector v = x + N , and then swapping the original word another word whose embedding is closest to v. (Feyisetan et al., 2020) showed that this mechanism satisfies the ( , 0)-Lipschitz privacy definition. However, the issue with this mechanism is that the magnitude (norm) of the added noise is proportional to d, which we avoid by projecting these vectors down before the noise addition step. Our focus here is also more general and not just on word embeddings. Additionally, we provide theoretical guarantees on our privatized vectors. We experimentally compare with this approach.
The privacy mechanisms of (Fernandes et al., 2019;Feyisetan et al., 2019) are also based on similar noise addition ideas. However, (Fernandes et al., 2019) utilized the Earth mover metric to measure distances (instead of Euclidean), and (Feyisetan et al., 2019) perturb vector representations of words in high dimensional Hyperbolic space (instead of a real space). In this paper, we focus on the Euclidean space as it captures the most common choice of metric space with vector models.
Over the past decade, a large body of work has been developed to design basic algorithms and tools for achieving DP, understanding the privacyutility trade-offs in different data access setups, and on integrating DP with machine learning and statistical inference. We refer the reader to (Dwork and Roth, 2013) for a more comprehensive overview.
Dimensionality reduction for word embeddings using PCA was explored in (Raunak et al., 2019) for computational efficiency purposes. In this paper, we use random projections for dimensionality reduction that helps with reducing the magnitude of noise needed for privacy. Another issue with PCA like scheme is that there are strong lower bounds (that scale with dimension of the vectors d) on the amount of distortion needed for achieving differentially private PCA in the local privacy model (Wang and Xu, 2020).
Random projections have been used as a tool to design differentially private algorithms in other problem settings too (Blocki et al., 2012;Wang et al., 2015;Kenthapadi et al., 2013;Zhou et al., 2009;Kasiviswanathan and Jin, 2016).

Preliminaries
We denote [n] = {1, . . . , n}. Vectors are in column-wise fashion. We measure the distance between embeddings through the Euclidean metric. For a vector x, we set x to denote the Euclidean (L 2 -) norm and x 1 denotes its L 1 -norm. For sets S, T , the Minkowski sum S + T = {a + b : a ∈ S, b ∈ T }. N (0, σ 2 ) denotes the Gaussian distribution with mean 0 and variance σ 2 .

Privacy Motivations for Text
The privacy concerns around word embedding vectors stem from how they are created. For example, embeddings created using neural models inherit the side effects of unintended memorizations that come with such models (Carlini et al., 2019). Similarly it has been demonstrated that text generation models that encode language representations also suffer from various degrees of information leakage (Song and Shmatikov, 2019;Lyu et al., 2020). While this might not be concerning for off the shelf models trained on public data, it becomes important for word embeddings trained on non-public data.
Recent studies (Song and Raghunathan, 2020;Thomas et al., 2020) have shown that word embeddings are vulnerable to 3 types of attacks (1) embedding inversion where the vectors can be used to recreate some of the input training data; (2) attribution inference occurs when sensitive attributes (such as authorship) of the input data are revealed even when they are independent of the task at hand; and (3) membership inference where an attacker is able to determine if data from a particular user was used to train the word embedding model.
The privacy consequences are further amplified depending on the domain of data under consideration. For example, a study by (Abdalla et al., 2020) on word embeddings in the medical domain demonstrated that: (1) they were able to reconstruct up to 68.5% of full names based on the embeddings i.e., embedding inversion; (2) they were able to retrieve associated sensitive information to specific patients in the corpus i.e., attribution inference; and (3) by using the distance between the vector of a patient's name and a billing code, they could differentiate between patients that were billed, and those that weren't i.e., membership inference.
These findings all underscore the need to release text embeddings using a rigorous notion of privacy, such as differential privacy, that preserves user privacy and mitigates the attacks described above.

Background on Differential Privacy.
Differential privacy (Dwork et al., 2006) gives a formal standard of privacy, requiring that, for all pairs of datasets that differ in one element, the distribution of outputs should be similar. In this paper, we use the notion of local differential privacy (LDP) (Kasiviswanathan et al., 2011).
A randomized algorithm A : X → Z is ( , δ)local differentially private (LDP) if for any two data x, x ∈ X and all (measurable) sets U ⊆ Z, The probability is taken over the random coins of A. Here, we think of δ as being cryptographically small, whereas is typically thought of as a moderately small constant. The above definition considers every pair of x and x (considered as adjacent for the purposes of DP). The LDP notion requires that the given x has a non-negligible probability of being transformed into any other x ∈ X no matter how unrelated (far) x and x are. However, for text embeddings, this strong requirement makes it virtually impossible to enforce that the semantics of a word are approximately preserved by the privatized vector (Feyisetan et al., 2020). To address this problem, we work with a modification of the above definition, referred to as Lipschitz (or metric) privacy, that is better suited for metric spaces defined through embedding models. Lipschitz privacy is closely related to LDP where the adjacency relation is defined through the Hamming metric, but also generalizes to include Euclidean, Manhattan, and Chebyshev metrics, among others (Chatzikokolakis et al., 2013;Andrés et al., 2013;Chatzikokolakis et al., 2015;Fernandes et al., 2019;Feyisetan et al., 2019Feyisetan et al., , 2020. Similar to differential privacy, Lipschitz privacy is preserved under post-processing and composition of mechanisms (Koufogiannis et al., 2016). Definition 1 (Lipschitz Privacy (Dwork et al., 2006;Chatzikokolakis et al., 2013)). Let (X , d) be a metric space. A randomized algorithm A : X → Z is ( , δ)-Lipschitz private if for any two data x, x ∈ X and all (measurable) sets U ⊆ Z, An alternate equivalent way of stating this would be to say that with probability at least 1 − δ, over a drawn from either A (x) The key difference between Lipschitz privacy and LDP is that the latter corresponds to a particular instance of the former when the distance function is given by d(x, x ) = 1 for every x = x .
In this paper, the metric space of interest is defined by embeddings which organize discrete objects in a continuous real space such that objects that are "similar" result in vectors are "close" in the embedded space. For the distance measure, we focus on the Euclidean metric, d(x, x ) = x − x that is known to capture semantic similarity between discrete words in a continuous space.
For a function, f : X → R m , the most basic technique in differential privacy to release f (x) is to answer f (x) + ν , where ν is instanceindependent additive noise (e.g., Laplace or Gaussian) with standard deviation proportional to the global sensitivity of the function f . Definition 2 (Global sensitivity). For a function f : X → R m , define the global sensitivity of f as x − x .

Dimensionality Reduction.
Dimensionality reduction is the problem of embedding a set from high-dimensions into a lowdimensional space, while preserving certain properties of the original high-dimensional set. Perhaps the most fundamental result for dimensionality reduction is the Johnson-Lindenstrauss (JL) lemma which states that any set of p points in high dimensions can be embedded into O(log(p)/β 2 ) dimensions, while preserving the Euclidean norm of all points within a multiplicative factor between 1 − β and 1 + β. In fact, one could embed an infinite continuum of points into lower dimensions while preserving the Euclidean norm of all point up to a multiplicative distortion. A classical result due to (Gordon, 1988) characterizes the relation between the "size" of the set and the required dimensionality of the embedding on the unit sphere. Before stating the result, we need to introduce the notion of Gaussian width which captures the L 2geometric complexity of X . Definition 3 (Gaussian Width). Given a closed set X ⊂ R d , its Gaussian width ω(X ) is defined as: Many popular sets have low Gaussian width (Vershynin, 2016). For example, if X contains vector in R d that are c-sparse (at most c non-zero elements) then ω(X ) = c log(d/c). If X contains vectors that are sparse in the L 1 -sense, say ∀x ∈ X , Notice that in all these cases ω(X ) 2 is exponentially smaller than d.
The following is a restatement of the original Gordon's theorem that is better suited for this paper. Theorem 1 (Gordon's Theorem (Gordon, 1988)). Let β ∈ (0, 1), X be a subset of the unit ddimensional sphere and let Φ ∈ R m×d be a matrix with i.i.d. entries from N (0, 1/m). Then, In particular, for a set of points X ⊂ R d , we have the following: Since for any set X with |X | = p, w(X ) 2 ≤ log p, therefore the above theorem is a generalization of the JL lemma. By a simple manipulation and adjusting β, Theorem 1 can be restated for preserving inner-products. Corollary 2. Under the setting of Theorem 1, for a set of points X in R d , holds for all x, x ∈ X with probability at least 1 − γ, if m = Ω((ω(X ) + log(1/γ)) 2 /β 2 ).
The above result also holds if we replace the Gaussian random matrix Φ by a sparse random matrix (Bourgain et al., 2015). For simplicity, we use a Gaussian matrix Φ for projection.

Our Approach
The main issue arising in constructing differentially private vector embeddings is that a direct noise addition to the vectors (such as in (Feyisetan et al., 2020)) would require that the L 2 -norm of the noise vector scales almost linearly with the dimensionality of the vector. To overcome this dimension dependence, our mechanism is based on the idea of performing a dimensionality reduction and then adding noise to the projected vector. By carefully balancing the dimensionality of the vectors with the magnitude of the noise needed for DP, the mechanism achieves a superior performance overall.
We will add noise calibrated to the sensitivity of the dimensionality reduction function. The noise is sampled from a d-dimensional distribution with density p(z) ∝ exp(− z /∆ f ). Sampling from this distribution is simple as noted in (Wu et al., 2017). 1 . The following simple claim (that holds for all functions f ) shows that this mechanism satisfies Definition 1. All the missing proofs from this section are collected in Appendix C. Let us first investigate the global sensitivity of f Φ using Theorem 1. Instead of considering a fixed bound on global sensitivity, we provide a probabilistic upper bound.
Let β ∈ (0, 1) be a fixed constant. Consider the mechanism which publishes A( . Given a set of sensitive words (x 1 , . . . , x n ), we can apply A( Algorithm PRIVEMB summarizes the mechanism. Since each vector is perturbed independently, the algorithm can be invoked locally. We now establish the privacy guarantee of PRIVEMB. The δ factor comes in from Lemma 4 because we only have a probabilistic bound on the global sensitivity, i.e., there exists pairs of x, x for whom the bound on global sensitivity of 1 + β could fail. For example, imagine a situation where there are n users each having a sensitive word (embedding). Given access to a common Φ, they can perturb their word locally and transmit only the perturbed vector.
While the idea behind Algorithm PRIVEMB is simple, it is widely applicable and effective. As an example consider vector representation of text such as through Bag-of-K-grams, which creates representations that are sparse in some very highdimensional space (say c-sparse vectors). In this case, even though d could be extremely large, we can project these vectors to ≈ c log(d/c)dimensional space (due to their low Gaussian width) and add noise in the projected space for achieving privacy. On the other hand, the privacy mechanism of (Feyisetan et al., 2020), with noise magnitude proportional to d will completely destroy the information in these vectors.

Utility Analysis of Alg. PRIVEMB
We now provide utility performance bounds for Algorithm PRIVEMB. As mentioned earlier these are the first theoretical analysis for any private vector embedding scheme. We start with two important properties of interest based on distances and innerproducts that commonly arise when dealing with text embeddings. Our next result compares the loss of a linear model trained on these private vector embeddings to loss of a similar model trained on the original vector embeddings. All our error bounds depend on m ≈ ω(Ran(M )) 2 .
We start with a simple observation about the magnitude of the noise vector. Consider κ drawn from the noise distribution with density p(z) ∝ exp(− z /(1 + β)). The Euclidean norm of κ is distributed according to the Gamma distribution Γ(m, (1 + β)/ ) (Wu et al., 2017) and satisfies the following bound.

Distance Approximation Guarantee
Our first result compares the distances between the private vectors and between the original vectors.
Proposition 7. Consider Algorithm PRIVEMB. With probability at least 1 − δ, for all pairs As a baseline consider the privatization mechanism proposed by (Feyisetan et al., 2020) which computes a privatized version of an embedding vector x by adding noise N to the original vector x. Formally, (Feyisetan et al., 2020) defined a mechanism where the private vector v i is constructed from x i as follows: v i = x i + N i where N i is drawn from the distribution in R d with density p(z) ∝ exp(− z )) to x. Since the noise vector N i is now d-dimensional, its Euclidean norm will tightly concentrate around its mean E[ N i ] = O(d). Therefore, with high probability, holds for the mechanism proposed in (Feyisetan et al., 2020). However, in our mechanism, the dependence on d is replaced by m which as argued above is generally much smaller than d. On the flip side though, PRIVEMB satisfies ( , δ)-Lipschitz privacy for δ > 0, whereas the mechanism in (Feyisetan et al., 2020) achieves the stronger ( , 0)-Lipschitz privacy.

Inner-Product Approximation Guarantee
Word embeddings seek to capture word similarity, so similar words (e.g., synonyms) have embeddings with high inner product. We now compare the inner product between the private vectors to the inner product between the original embedding vectors.
Proposition 8. Consider Algorithm PRIVEMB. With probability at least 1 − δ, for all pairs

Performance on Linear Models
We now discuss about the performance of the private vectors (w 1 , . . . , w n ) when used with common machine learning models. Given n datapoints, (x 1 , y 1 ), . . . , (x n , y n ) drawn from some universe R d × R (where y i represents the label on point x i ), we consider the problem of learning a linear model on this labeled data. We assume that x i 's are sensitive whereas the y i 's are publicly known. Such situations arise commonly in practice. For example, consider a drug company investigating the effectiveness of a drug trail over n users. Here, y i could represent the response to the drug for user i which is known to the drug company, whereas x i could encode the medical history of user i which the user would like to keep private.
We focus on a broad class of models, where the loss functions have the form, ( x, θ ; y) for parameter θ ∈ R d , where : R × R → R. This captures a variety of learning problems, e.g., the linear regression is captured by setting ( x, θ ; y) = (y − x, θ ) 2 , logistic regression is captured by setting ( x, θ ; y) = ln(1 + exp(−y x, θ )), support vector machine is captured by setting ( x, θ ; y) = hinge(y x, θ ), where hinge(a) = 1 − a if a ≤ 1 and 0 otherwise. We assume that the function is convex and Lipschitz in the first parameter. Let λ denote the Lipschitz parameter of the loss function over the first parameter, i.e., | (a; y) − (b; y)| ≤ λ |a − b| for all a, b ∈ R.
On the data (x 1 , y 1 ), . . . , (x n , y n ), the (empirical) training loss for a parameter θ is defined as: 1 n n i=1 ( x i , θ ; y i ) and the goal in training (empirical risk minimization) is to minimize this loss over a parameter space Θ. Let θ be a true minimizer of 1 n n i=1 ( x i , θ ; y i ), i.e., θ ∈ argmin θ∈Θ 1 n n i=1 ( x i , θ ; y i ). Our goal will be to compare the loss of the model trained on the privatized points (w 1 , y 1 ), . . . (w n , y n ) where the w i 's are produced by Algorithm PRIVEMB to the true minimum loss (= 1 n n i=1 ( x i , θ ; y i )). Let Θ defined as sup θ∈Θ θ denote the diameter of Θ. The following proposition states our result. Proposition 9. Consider Algorithm PRIVEMB. With probability at least 1 − δ, In the above result the error terms will be negligible if β 1/(λ τ Θ ) and λ (m ln(2nm/δ)) Θ . Though in our experiments (see Section 5), we notice good performance with private vectors even when β and don't satisfy these conditions.
Another point to note is that our setting, where we train ML models over a differentially private data release, is different from traditional literature on differentially private empirical risk minimization where the goal is to release only a private version of model parameter θ, and not the data itself, see e.g., (Chaudhuri et al., 2011;Bassily et al., 2014). In particular, this means that the results from traditional differentially private empirical risk minimization do not carry over to our setting. Our data release setup allows training any number of ML models on the private vectors without having to pay for the cost of composition on the privacy guarantees (as post-processing does not affect the privacy guarantee), which is a desirable property.

Experimental Evaluations
We carry out four experiments to demonstrate the improvement of our approach (Algorithm PRIVEMB), denoted as M2, over ( , 0)-Lipschitz privacy mechanism proposed in (Feyisetan et al., 2020) (denoted by M1). 2 The first three map to the theoretical guarantees described Section 4, i.e., (1) distance approximation guarantee, (2) inner-product approximation guarantee, and (3) performance on linear models. The final experiment provides further evidence for performance of using these private vectors for downstream classification tasks. All our experiments are on embeddings generated by GLOVE (Pennington et al., 2014) and FASTTEXT (Bojanowski et al., 2017). The dimensionality of the embedding d = 300 in both cases. Due to space constraints, we present the FASTTEXT results in Appendix B.
The value of δ is kept constant for all experiments (involving our scheme) at 1e − 6. We set ω(Ran(M )) = √ log d. The parameter β only affects the utility guarantee, and Algorithm PRIVEMB is always ( , δ)-Lipschitz private for any value of β. In our experiments, corroborating our theoretical guarantees, we do vary β to illustrate the effect of β on the guarantees. Remember that higher values of β results in lowerdimensional vectors, so setting β appropriately lets one trade-off between the loss of utility due to dimension reduction vs. the gain in the utility due to lesser noise needed for lower-dimensional vectors.
We also vary the privacy parameter in our experiments. While lower values are certainly desirable, it is widely known that differentially private algorithms for certain problems (such as those arising in complex domains such as NLP) require slightly larger values to provide reasonable utility in practice (Fernandes et al., 2019;Feyisetan et al., 2020;Xie et al., 2018;Ma et al., 2020). For example, the related work on differential privately releasing text embeddings from Fernandes et al.

Distance Approximation Guarantees
This experiment compares the distance between pairs of private vectors to that between the corresponding original vectors. We sampled 100 word vectors from the vocabulary. For each of these 100 vectors, we compare the distance to another set of 100 randomly sampled vectors. These 100 × 100 pair of vectors were kept constant across all experiment runs. For each embedding model, we outcomes across the different values of , β, and embeddings. Lower values on the y−axis indicate better results in that the distance between the private vectors are a good approximation to the actual distances between the original vectors. Overall, the guarantees of our approach M2 are better than M1 as observed by the smaller distance differences across all conditions. Next, the results also highlight that for both mechanisms, as expected, the guarantees get better as increases, due to the introduction of less noise (note the different scales across ). Finally, the results reveal that for a given value of , as the value of β increases, the guarantees of our scheme improve. This can be viewed through the guarantees of Proposition 7, which consists of two terms, the first term increases with β and the second term due to its dependence on 1/β 2 (through m) decreases with β. Since the second (noise) term generally dominates, we get an improvement with β, suggesting that it is advantageous to pick a larger β in practice.

Inner Prod Approximation Guarantees
This experiment compares the inner product between pairs of private vectors to that between the corresponding original vectors. The setup here is identical to the distance approximation experiments (i.e., the same 100 × 100 word pairs and mix of and β). The results capture | w i , w j − x i , x j |. Results. The results in Fig. 2 show the experiment outcomes across , β, and embeddings. Similar to the findings in Fig. 1, the results of M2 are an improvement over M1 with the same patterns of improvement. For a fixed privacy budget, the performance of M2 is better than that of M1 and the gap increases as β increases. Again this suggests that one should pick a larger β.

Performance on Linear Models
We built a simple binary SVM linear mode to classify single keywords into 2 classes: positive and negative based on their conveyed sentiment. The dataset used was a list from (Hu and Liu, 2004) consisting of 4783 negative and 2006 positive words. We selected a subset of words that occurred in both GLOVE and FASTTEXT embeddings and capped both lists to have an equal number of words. The resulting datasets each had 1899 words. The purpose of this experiment was to explore the behaviors of M1 and M2 at different values of and β for a linear model. Results shown are over 10 runs. Results. The results on the performance on linear models are presented in Fig. 3. The performance metrics are (i) accuracy on a randomly selected 20% test set, and (ii) the area under the ROC curve (AUC). Higher values on the y−axis indicate better results. The findings follow from our first 2 experiments which demonstrate that for a fixed privacy guarantee, the utility of M2 is better than that of M1 and the gap between the performance of M2 and M1 increases as β increases.

Performance on NLP Datasets
We further evaluated M2 against M1 at a fixed value of and β on classification tasks on 5 NLP datasets. The experiments were done and can be replicated using SentEval (Conneau and Kiela, 2018), an evaluation toolkit for sentence embeddings by replacing the default embeddings with the private embeddings. From the previous experiments, we know that it is better to pick a larger β, so we set β = 0.9 here.
Results. Table 1 presents the results and summarizes the datasets: MR (Pang and Lee, 2005), CR (Hu and Liu, 2004), MPQA (Wiebe et al., 2005), SST-5 (Socher et al., 2013), and TREC-6 (Li and Roth, 2002). Table 1 presents the results from the experiments. We also present results of 2 nonprivate baselines on all the datasets based on Infersent and SkipThought described in (Conneau et al., 2017). The evaluation metrics were train and test accuracies, therefore, higher scores indicate better utility. Not surprisingly, because of the noise addition there is is a performance drop when we compare the private mechanisms to the non-private baselines. However, the results reinforce our findings that the utility afforded by M2 are better than M1 at fixed values of . Some of the improvements are remarkably significant e.g., +7% on the CR dataset, and +20% on TREC-6. Summary of the Results. Overall, these experiments demonstrate that PRIVEMB offers better utility than the embedding privatization scheme of (Feyisetan et al., 2020).

Concluding Remarks
In this paper, we introduced an ( , δ)-Lipschitz private algorithm for generating real valued embedding vectors. Our mechanism works by first reducing the dimensionality of the vectors though a random projection, then adding noise calibrated to the sensitivity of the dimensionality reduction function. The mechanism can be utilized for any welldefined embedding model including but not limited to word, sentence, and document embeddings. We prove theoretical bounds that show how various properties of interest important for vector embeddings are well-approximated through the private vectors, and our empirical results across multiple embedding models and NLP datasets demonstrate the superior utility guarantees.  (2) it is more computationally expensive as it requires retraining the embeddings from scratch.  Proof. First note that f (x) + κ has the same distribution as that of κ but with a different mean. Consider any x, x ∈ X . We will be interested in bounding the ratio Pr[A(x) = w]/Pr[A(x ) = w].