Privacy-Preserving News Recommendation Model Learning

News recommendation aims to display news articles to users based on their personal interest. Existing news recommendation methods rely on centralized storage of user behavior data for model training, which may lead to privacy concerns and risks due to the privacy-sensitive nature of user behaviors. In this paper, we propose a privacy-preserving method for news recommendation model training based on federated learning, where the user behavior data is locally stored on user devices. Our method can leverage the useful information in the behaviors of massive number users to train accurate news recommendation models and meanwhile remove the need of centralized storage of them. More specifically, on each user device we keep a local copy of the news recommendation model, and compute gradients of the local model based on the user behaviors in this device. The local gradients from a group of randomly selected users are uploaded to server, which are further aggregated to update the global model in the server. Since the model gradients may contain some implicit private information, we apply local differential privacy (LDP) to them before uploading for better privacy protection. The updated global model is then distributed to each user device for local model update. We repeat this process for multiple rounds. Extensive experiments on a real-world dataset show the effectiveness of our method in news recommendation model training with privacy protection.


Introduction
With the development of Internet and mobile Internet, online news websites and Apps such as Yahoo! News 1 and Toutiao 2 have become very popular for people to obtain news information (Okura et al., 2017). Since massive news articles are posted online every day, users of online news services face heavy information overload (Zheng et al., 2018). Different users usually prefer different news information. Thus, personalized news recommendation, which aims to display news articles to users based on their personal interest, is a useful technique to improve user experience and has been widely used in many online news services (Wu et al., 2019b). The research of news recommendation has attracted many attentions from both academic and industrial fields (Okura et al., 2017;Lian et al., 2018;Wu et al., 2019a).
Many news recommendation methods have been proposed in recent years Wu et al., 2019b;Zhu et al., 2019b). These methods usually recommend news based on the matching between the news representation learned from news content and the user interest representation learned from historical user behaviors on news. For example, Okura et al. (2017) proposed to learn news representations from the content of news articles via autoencoder, and learn user interest representations from the clicked news articles via Gated Recurrent Unit (GRU) network. They ranked the candidate news articles using the direct dot product of the news and user interest representations. These approaches all rely on the centralized storage of user behavior data such as news click histories for model training. However, users' behaviors on news websites and Apps are privacy-sensitive, the leakage of which may bring catastrophic consequences. Unfortunately, the centralized storage of user behavior data in server may lead to high privacy concerns from users and risks of large-scale private data leakage.
In this paper, we propose a privacy-preserving method for news recommendation model training. Instead of storing user behavior data on a central server, in our method it is locally stored on (and never leaves) users' personal devices, which can effectively reduce the privacy concerns and risks (McMahan et al., 2017). Since the behavior data of a single user is insufficient for model training, we propose a federated learning based framework named FedNewsRec to coordinate massive user devices to collaboratively learn an accurate news recommendation model without the need to centralized storage of user behavior data. In our framework, on each user device we keep a local copy of the news recommendation model. Since the user behaviors on news websites or Apps stored in user device can provide important supervision information of how the current model performs, we compute the model gradients based on these local behaviors. The local gradients from a group of randomly selected users are uploaded to server, which are further aggregated to update the global news recommendation model maintained in the server. The updated global model is then distributed to each user device for local model update. We repeat this process for multiple rounds until the training converges. Since the local model gradients may also contain some implicit private information of users' behaviors on their devices, we apply the local differential privacy (LDP) technique to these local model gradients before uploading them to server, which can better protect user privacy at the cost of slight performance sacrifice. We conduct extensive experiments on a real-world dataset. The results show that our method can achieve satisfactory performance in news recommendation by coordinating massive users for model training, and at the same time can well protect user privacy.
The major contributions of this work include: (1) We propose a privacy-preserving method to train accurate news recommendation model by leveraging the behavior data of massive users and meanwhile remove the need to its centralized storage to protect user privacy.
(2) We propose to apply local differential privacy to protect the private information in local gradients communicated between user devices and server.
(3) We conduct extensive experiments on a realworld dataset to verify the proposed method in recommendation accuracy and privacy protection.

News Recommendation
News recommendation can be formulated as a problem of matching between news articles and users.
There are three core tasks for news recommendation, i.e., how to model the content of news articles (news representation), how to model the personal interest of users in news (user representation), and how to measure the relevance between news content and user interest. For news representation, many feature based methods have been applied. For example, Lian et al. (2018) represented news using URLs, categories and entities. Recently, many deep learning based news recommendation methods represent news from the content using neural networks. For example, Okura et al. (2017) used denoising autoencoder to learn news representations from news content. Wu et al. (2019c) proposed to learn news representations from news titles via multi-head self-attention network. For user representation, existing news recommendation methods usually model user interest from their historical behaviors on news platforms. For example, Okura et al. (2017) learned user representations from the previously clicked news using GRU network.  proposed a long-and short-term user representation model (LSTUR) for user interest modeling. It captures the long-term user interest via user ID embedding and the short-term user interest from the latest news click behaviors via GRU. For measuring the relevance between user interest and news content, dot product of user and news representation vectors is widely used (Okura et al., 2017;Wu et al., 2019a;. Some methods also explore cosine similarity (Zhu et al., 2019b), feed-forward network , feature-interaction network (Lian et al., 2018).
These existing news recommendation methods all rely on centrally-stored user behavior data for model training. However, users' behaviors on news platforms are privacy-sensitive. The centralized storage of user behavior data may lead to serious privacy concerns of users. In addition, the news platforms have high responsibility to prevent user data leakage, and have high pressure to meet the requirements of strict user privacy protection regulations like GDPR 3 . Different from existing news recommendation methods, in our method the user behavior data is locally stored on personal devices, and only the model gradients are communicated between user devices and server. Since the model gradients contain much less user information than the raw behavior data and they are futher processed by the Local Differential Privacy (LDP) technique, our method can protect user privacy much better than existing news recommendation methods.

Federated Learning
Federated learning (McMahan et al., 2017) is a privacy-preserving machine learning technique which can leverage the rich data of massive users to train shared intelligent models without the need to centrally store the user data. In federated learning the user data is locally stored on user mobile devices and never uploaded to server. Instead, each user device computes a model update based on the local data, and the locally-computed model updates from many users are aggregated to update the shared model. Since model updates usually contain much less information than the raw user data, the risks of privacy leakage can be effectively reduced. Federated learning requires that the labeled data can be inferred from user interactions for supervised model learning, which can be perfectly satisfied in our news recommendation scenario, since the click and skip behaviors on news websites and Apps can provide rich supervision information.
Federated learning has been applied to training query suggestion model for smartphone keyboard and topic models (Jiang et al., 2019). There are also some explorations in applying federated learning to recommendation (Ammad et al., 2019;Chai et al., 2019). For example, Ammad et al. (2019) proposed a federated collaborative filtering (FCF) method. In FCF, the personal rating data is locally stored on user client and is used to compute the local gradients of user embeddings and item embeddings. The user embeddings are locally maintained in user client and are directly updated using the local gradient on each client. The item embeddings are maintained by a central server, and are updated using the aggregated gradients of many clients. Chai et al. (2019) proposed a federated matrix factorization method, which is very similar with FCF. However, these methods require all users to participate the process of federated learning to train their embeddings, which is not practical in realworld recommendation scenarios. Besides, these methods represent items using their IDs, and are difficult to handle new items since many news articles are posted every day which are all new items. Thus, these federated learning based recommendation methods have their inherent drawbacks, and are not suitable for news recommendation.

Local Differential Privacy
Local differential privacy (LDP) is an important technique to provide guarantees of privacy for sensitive information collection and analysis (Ren et al., 2018). It has attracted increasing attentions since user privacy protection has become a more and more important issue (Kairouz et al., 2014;Qin et al., 2016). A classical scenario of LDP is that there are a set of users, and each user u has a private value v, which is sent to a untrusted third-party aggregator so that the aggregator can learn some statistical information of the private value distribution among the users (Cormode et al., 2018). LDP can guarantee that the leakage of private information for each individual user is bounded by applying a randomized algorithm M to private value v and sending the perturbed value M(v) to the aggregator for statistical information inference. The randomized algorithm M(·) is called to satisfylocal differential privacy if and only if for two arbitrary input private values v and v , the following inequation holds: where y ∈ range(M). ≥ 0, and it is usually called privacy budget. Smaller means better private information protection. In many works (Sarathy and Muralidhar, 2010;Duchi et al., 2013), M(·) is implemented by adding Laplace noise to the private value. In this paper we apply LDP technique to the model gradients which are generated in user devices based on user behaviors and uploaded to server, to better protect user privacy and remove the need to a trusted server.

FedNewsRec for Privacy-Preserving News Recommendation
In this section we introduce our FedNewsRec method for privacy-preserving news recommendation model training. We first describe the news recommendation model. Then we describe the details of FedNewsRec.

Basic News Recommendation Model
Following previous works (Wu et al., 2019a;, the news recommendation model in our method can be decomposed into two core submodels, i.e., a news model to learn news representations and a user model to learn user representations. News Model The news model aims to learn news representations to model news content. Its ar- chitecture is shown in Fig. 1. Following (Wu et al., 2019b), we learn news representations from news titles. The news model contains four layers stacked from bottom to up. The first layer is word embedding, which converts the word sequence in a news title into a sequence of semantic word embedding vectors. The second layer is a CNN network, which is used to learn word representations by capturing local contexts. The third layer is a multi-head selfattention network (Vaswani et al., 2017), which can learn contextual word representations by modeling the long-range relatedness between different words. The fourth layer is an attention network, which is used to build a news representation vector t from the output of multi-head self-attention network by selecting informative words.
User Model The user model is used to learn user representations to model their personal interest. Its architecture is shown in Fig. 2. Following (Okura et al., 2017), we learn user representation from their clicked news articles. Motivated by the LSTUR model proposed by , we learn representations of users by capturing both long-term and short-term interests. The difference is that in LSTUR the embeddings of user IDs are used to model long-term interest, while in our user model it is learned from all the historical behaviors through a combination of a multi-head selfattention network and an attentive pooling network. This is because in the federated learning scenario, it is not practical that all users can participate the process of model training. Thus, the ID embeddings of many users in LSUTR cannot be learned. For shortterm user interest modeling, our user model applies a GRU network to the recent behaviors of users, which is the same with LSUTR. The embeddings User Embedding ··· ··· ···

Historical clicked news
Short-term user embedding Model Training from User Behavior Users' behaviors on news websites and Apps can provide useful supervision information to train the news recommendation models. For example, if a user u clicks a news article t which has low ranking score predicted by the model, then we can tune the model to give higher ranking score for this usernews pair. We propose to train the news recommendation model based on both click and non-click behaviors. More specifically, following (Wu et al., 2019b), for each news t c i clicked by user u, we randomly sample H news which are displayed in the same impression but not clicked. Assume this user has B u click behaviors in total, then the loss function of the news recommendation model with parameter set Θ is defined as: where t c i and t nc i,j are clicked and non-clicked news articles shown in the same impression. s(u, t) is the ranking score of news t for user u, which is defined as the dot product of their embedding vectors, i.e., s(u, t) = u T t.

The Framework of FedNewsRec
Next, we introduce our FedNewsRec framework for privacy-preserving news recommendation model  training, which is shown in Fig. 3. In our Fed-NewsRec framework, the user behaviors on news platforms (websites or Apps) are locally stored on the user devices and never uploaded to server. In addition, the servers which provide news services do not record nor collect the user behaviors, which can reduce the privacy concerns of users and the risks of data leakage. Since an accurate news recommendation model can effectively improve users' news reading experiences and the behavior data from a single user is far from sufficient for training an accurate and unbiased model, in our FedNews-Rec framework we propose to coordinate a large number of user devices to collectively train intelligent news recommendation models.
Following (McMahan et al., 2017), each user device which participates the model training is called a client. Each client has a copy of the current news recommendation model Θ which is maintained by the server. Assume user u's client has accumulated a set of behaviors on news platforms which is denoted as B u , then we compute a local gradient of model Θ according to the behaviors B u and the loss function defined in Eq. (3), which is denoted as g u = ∂Lu ∂Θ . Although the local model gradient g u is computed from a set of behaviors rather than a single behavior, it may still contain some private information of user behaviors (Zhu et al., 2019a). Thus, for better privacy protection, we apply local differential privacy (LDP) technique to the local model gradients. Denote the randomized algorithm applied to g u as M, which is defined as: n ∼ La(0, λ), where n is Laplace noise with 0 mean. The param-eter λ can control the strength of Laplace noise, and larger λ can bring better privacy protection. The function clip(x, y) is used to limit the value of x with the scale of y. It is motivated by some studies which show that applying gradient clipping can help avoid potential gradient explosion and is beneficial for model training . Denote the randomized gradient as g u = M(g u ).
After clip and randomization operation, it is more difficult to infer the raw user behaviors from the gradients. The user client uploads the randomized local model gradient g u to the server. In our FedNewsRec framework, a server is used to maintain the news recommendation model and update it via the model gradients from a large number of users. In each round, the server randomly selects a random fraction r (e.g., 10%) of the user clients, and sends the current news recommendation model Θ to them. Then it collects and aggregates the local model gradients from the selected user clients as follows: where U is the set of users selected for the learning process in this round, and B u is the set of behaviors of user u for local model gradient computation. Then the aggregated gradient g is used to update the global news recommendation model Θ maintained in the server: where η is the learning rate. The updated global model is then distributed to user devices to update their local models. This process is repeated until the model training converges.

Discussions on Privacy Protection
Next, we discuss why our FedNewsRec framework can protect user privacy in news recommendation model training. First, in our method the private user behavior data is stored on user own devices, and is never uploaded to server. Only the model gradients inferred from the local user behaviors are communicated with the server. According to the data processing inequality (McMahan et al., 2017), these gradients never contain more private information than the raw user behaviors, and usually contain much less information (McMahan et al., 2017). Thus, the user privacy can be better protected compared with the centralized storage of user behavior data as did in existing news recommendation methods. Second, the local model gradients are computed from a group of user behaviors instead of a single behavior. Thus, it is not very easy to infer a specific behavior from the local model gradients uploaded to server. Third, we apply the local differential privacy technique to the local model gradients before uploading by adding Laplace noise to them. It can strengthen the privacy protection of the private information in local model gradients. According to (Choi et al., 2018), Laplace noise in LDP can achieve -local differential privacy, and = where v and v are arbitrary values in local model gradient. Since the upper bound of max v,v |M(v) − M(v )| in our FedNewsRec framework is 2δ, the upper bound of the privacy budget is 2δ λ . We can see that by increasing λ (i.e., the strength of the noise), we can achieve a smaller privacy budget which means better privacy protection. However, strong noise will hurt the accuracy of aggregated gradients. Thus, λ should be selected based on the trade-off between privacy protection and model performance.

Dataset and Experimental Settings
Our experiments were conducted on a public news recommendation dataset (named Adressa) collected from a Norwegian news website (Gulla et al., 2017) and another real-world dataset collected from Microsoft News 4 (named MSN-News). 5 For the Adressa dataset, following Hu et al. (2020), we used user logs in the first five days to construct 4 https://www.msn.com/en-us 5 Our dataset and codes will be publicly available in https://github.com/JulySinceAndrew/FedNewsRec-EMNLP-Findings-2020. In experiments we used the 300-dimensional pretrained Glove embedding to initialize word embeddings. The number of the self-attention head is 20 and the output dimension of each head is 20. The dimension of GRU hidden state is 400. H in Eq. (3) is 4. The fraction r of users participating in model training in each round is 2%. The learning rate η in Eq. (7) is 0.5. δ in Eq. (4) is 0.005 and λ in Eq. (5) is 0.015. These hyper-parameters are all selected according to cross-validation on the training set.

Effectiveness Evaluation
First, we verify the effectiveness of the proposed FedNewsRec method. We compared with many methods, including: (1) FM (Rendle, 2012), factorization machine, a classic method for recommendation; (2) DFM (Lian et al., 2018)  et al., 2017), using GRU for user modeling (Cho et al., 2014); (4) DKN , using knowledge-aware CNN network for news representation in news recommendation; (5) DAN (Zhu et al., 2019b), using CNN to learn news representations from both news title and entities and using LSTM to learn user representations; (6) NAML (Wu et al., 2019a), learning news representations via attentive multi-view learning; (7) NPA (Wu et al., 2019b), using personalized attention network to learn news and user representations; (8) NRMS (Wu et al., 2019d), learning representations of news and users via multi-head self-attention network; (9) FCF (Ammad et al., 2019), a federated collaborative filtering method for recommendation; (10) CenNewsRec, which has the same news recommendation model with FedNewsRec but is trained on centralized user behavior data.
The results are listed in Table 2. First, by comparing FedNewsRec with SOTA news recommendation methods such as NRMS, NPA and EBNR, our method can achieve comparable and even better performance on news recommendation. It validates the effectiveness of our approach in learning accurate models for personalized news recommendation. Moreover, different from these existing news recommendation methods which are all trained on centralized storage of user behavior data, in our FedNewsRec the user behavior data is stored on local user devices and is never uploaded. Thus, our method can train accurate news recommendation model and meanwhile better protect user privacy.
Second, our method can perform better than existing federated learning based recommendation methods like FCF (Ammad et al., 2019). The performance of FCF is not good in news recommendation. This is because FCF requires each user and each item to participate the training process to learn their embeddings. However, in practical application not all the users can participate the training due to different reasons. In addition, news articles on online news platforms expire very quickly, and new news articles continuously emerge. Thus, many items for recommendation are news items, and unseen in the training data, which cannot be handled by FCF. In our method we learn news representations from news content and learn user representations from their behaviors using neural models. Therefore, our method can handle the problem of new users and new items, and is more suitable for news recommendation scenario.
Third, FedNewsRec performs worse than Cen-NewsRec which has the same news recommendation model with FedNewsRec but is trained on the centralized user behavior data. This is intuitive since centralized data is more beneficial for model training than decentralized data. In addition, in FedNewsRec we apply local differential privacy technique with Laplace noise to protect the private information in model gradients, which leads to the aggregated gradient for model update less accurate. Luckily, the gap between the performance of FedNewsRec and CenNewsRec is not very big. Thus, our FedNewsRec method can achieve much better privacy protection at the cost of acceptable performance decline. These results validate the effectiveness of our method.

Influence of User Number
In this section, we explore whether our FedNews-Rec method can exploit the useful behavior information of massive users in a federated way to train accurate news recommendation models. In the following sections, we only show the experimental results on the MSN-News dataset. We randomly select different numbers of users for model training,  and use all users for evaluation. The experimental results are shown in Fig. 4. From Fig. 4 we have several observations. First, when the number of users is small (e.g., 1000), the performance of news recommendation model trained on the behavior data of these users is not satisfactory. This is because the behaviors of a single user are usually very limited, and behavior data of a small number of users is insufficient to train accurate news recommendation model. This result validates the motivation of our FedNewsRec method to coordinate a large number of users in a federated way for model training. Second, when the number of users participating in training increases, the performance of FedNewsRec improves. It indicates that FedNewsRec can effectively exploit the useful behavior information from different users to collectively train an accurate news recommendation model, which validates the effectiveness of our framework. Third, when the number of users is big enough, further incorporating more users can only bring marginal performance improvement. This result shows that a reasonable number of users are sufficient for news recommendation model training, and it is unnecessary to involve too many or all users which is costly and impractical.

Hyper-parameter Analysis
In this section, we explore the influence of hyperparameters on our method. We show the results of two important hyper-parameters, i.e., δ in Eq. (4) Figure 6: Convergence of model training. and λ in Eq. (5) which serve in the local differential privacy module of our FedNewsRec framework. The results are shown in Fig. 5. In Fig. 5(a) we show the performance of our method with different λ and δ values. We find that a large λ value can lead to the performance decline. This is because larger λ means stronger Laplace noise added to the gradients in LDP module, making the aggregated gradient for model update less accurate. In addition, our method tends to have better performance when δ is larger. This is because fewer gradients will be affected in the clip operation when δ is larger. In Fig. 5(b) we show the upper bound of the privacy budget, i.e., in Section 3.3, with different λ and δ values. We can find that with larger λ value and smaller δ value, the privacy budget becomes lower, which means better privacy protection. This is intuitive, since larger λ value and smaller δ value indicate that stronger noise is added and more gradient values are clipped, making it more difficult to recover the original model gradients. Combining Fig. 5(a) and Fig. 5(b) we can see that the better privacy protection is achieved by some sacrifice of the performance, and we need to select λ and δ values based on the trade-off between privacy protection and news recommendation performance.

Convergence Analysis
Next we explore the convergence of the model training in FedNewsRec, and the results are shown in Fig. 6. We can see that the training process can converge in about 1,500 rounds under different settings of r (i.e., ratio of selected users for model training in each round). It indicates that FedNewsRec can train news recommendation model efficiently.

Effectiveness of User Model
In this section, we conduct ablation studies to evaluate the effectiveness of the short-and long-term user interest modeling in our user model. The ex-  Fig. 7, from which we have several observations. First, after removing the short-term user embedding, the performance of our method declines. This is because users sometimes tend to read news related to the topics they recently cared about ). Our user model learns the short-term user embedding from the sequence of users' recent clicked news via a GRU network, which can effectively capture users' short-term interest. Thus, removing the short-term user embedding makes the unified user embedding loss some information of users' recent reading preference and causes performance decline. Second, after removing the long-term user embedding, the performance of our method also declines. This is because users may read some news according to their long-term interests, which may not be reflected by their recent reading history . To address this issue, our user model learns long-term user embedding by capturing the relatedness among users' clicked news, which can effectively capture users' long-term interest. After removing it, the unified user embedding losses the information of the long-term interest, which hurts the recommendation accuracy.

Conclusion
In this paper, we propose a privacy-preserving method for news recommendation model training. Different from existing methods which rely on centralized storage of user behavior data, in our method the user behaviors are locally stored on user devices. We propose a FedNewsRec framework to coordinate a large number of users to collectively train accurate news recommendation models from the behavior data of these users without the need to upload it. In our method each user client computes local model gradients based on the user behaviors on device, and sends them to server. The server coordinates the training process and maintains a global news recommendation model. It aggregates the local model gradients from massive users and updates the global model using the aggregated gradient. Then the server sends the updated model to user clients and this process is repeated for multiple rounds. In order to further protect the private information in the local model gradients, we apply local differential privacy to them by adding Laplace noise. The experiments on real-world dataset show that our method can achieve comparable performance with SOTA news recommendation methods, and meanwhile can better protect user privacy.