PTUM: Pre-training User Model from Unlabeled User Behaviors via Self-supervision

User modeling is critical for many personalized web services. Many existing methods model users based on their behaviors and the labeled data of target tasks. However, these methods cannot exploit useful information in unlabeled user behavior data, and their performance may be not optimal when labeled data is scarce. Motivated by pre-trained language models which are pre-trained on large-scale unlabeled corpus to empower many downstream tasks, in this paper we propose to pre-train user models from large-scale unlabeled user behaviors data. We propose two self-supervision tasks for user model pre-training. The first one is masked behavior prediction, which can model the relatedness between historical behaviors. The second one is next K behavior prediction, which can model the relatedness between past and future behaviors. The pre-trained user models are finetuned in downstream tasks to learn task-specific user representations. Experimental results on two real-world datasets validate the effectiveness of our proposed user model pre-training method.


Introduction
User modeling is a critical technique for many personalized web services such as personalized news and video recommendation (Okura et al., 2017;Covington et al., 2016). Many existing methods model users from their behaviors (Zhou et al., 2018;Ouyang et al., 2019). For example, Covington et al. (2016) proposed a YouTubeNet model for video recommendation, which models users from their watched videos and search tokens. Zhou et al. (2018) proposed a deep interest network (DIN) for click-through rate (CTR) prediction, which models users from user behaviors on the e-commerce platform based on their relevance to the candidate ads. Okura et al. (2017) proposed to use a GRU network for news recommendation, which models users from their clicked news. However, these methods mainly rely on sufficient labeled data to train user models, and their performance may be not optimal when training data is scarce. In addition, they only model task-specific user information and do not exploit the universal user information encoded in user behaviors.
In recent years, pre-trained language models such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019) and XLNET (Yang et al., 2019) have achieved great success in many NLP tasks, such as reading comprehension and machine translation. Many language models are pre-trained on a large unlabeled corpus via self-supervision tasks such as masked LM and next sentence prediction to model the contexts (Devlin et al., 2019). These language models can learn universal language representations from large unlabeled corpus and empower many different downstream tasks when the labeled data for these tasks is insufficient (Qiu et al., 2020).
Motivated by pre-trained language models, in this paper we propose pre-trained user models (PTUM), which can learn universal user models from unlabeled user behaviors. 1 We propose two self-supervision tasks for user model pre-training. The first one is masked behavior prediction, which aims to infer the randomly masked behavior of a user based on her other behaviors. It can help the user model capture the relatedness between historical user behaviors. The second one is next K behaviors prediction, which aims to predict the K future behaviors based on past ones. It can help the user model capture the relatedness between past and future behaviors. The pre-trained user model is further fine-tuned in downstream tasks to learn task-specific user representations. We conduct experiments on two real-world datasets for user demographic prediction and ads CTR prediction. The  results validate that our PTUM method can consistently boost the performance of many user models by pre-training them on unlabeled user behaviors.
2 Pre-trained User Model

Framework of User Model
Before introducing our PTUM method for user model pre-training, we first briefly introduce the general framework of many existing user modeling methods based on user behaviors. As shown in Fig. 1, the core is a behavior encoder to encode each behavior and its position into a behavior embedding and a user encoder to learn user embeddings from behavior embeddings.  (Wu et al., 2019b). There are also many options for the user encoder, such as GRU (Hidasi et al., 2016), attention network (Wu et al., 2019a) and Transformer (Sun et al., 2019). In these existing methods, their user models are trained in an end-to-end way using the labeled data of target task, which can only capture task-specific information. Thus, in this paper we propose to pre-train user models from unlabeled user behavior data via self-supervision, which can exploit universal user information encoded in user behaviors.

Pre-training
We propose two self-supervision tasks for pretraining user models on unlabeled user behaviors. The first one is masked behavior prediction (MBP), and the second one is next K behaviors prediction (NBP). Their details are introduced as follows: Task 1: Masked Behavior Prediction (MBP).
Modeling the relatedness between user behaviors is important for user modeling (Sun et al., 2019). Inspired by the masked LM task proposed in BERT (Devlin et al., 2019) for language model pre-training, we propose a Masked Behavior Prediction (MBP) task to pre-train user models, as shown in Fig. 2(a). Different from words which are usually easy to be inferred from their contexts, user behaviors are diverse and are more difficult to be predicted. Thus, different from BERT which masks a fraction of words, we only randomly mask one behavior of a user. The goal of this task is to infer whether a candidate behavior r is the masked behavior of the target user u based on her other behaviors. We use a user model to encode the behavior sequence of the user u into her embedding u, and use a behavior encoder to obtain candidate behavior embedding r. The relevance scoreŷ between the user u and candidate behavior r is evaluated by a predictor with the functionŷ = f (u, r). Motivated by DSSM (Huang et al., 2013), we use negative sampling techniques to construct selflabeled samples for user model pre-training by packing the masked behavior r of a user u with P randomly sampled behaviors from other users. Then, we predict the relevance scores between the user embedding and the embeddings of these P + 1 candidate behaviors using the predictor, and normalize these scores via softmax function to obtain the probability of each candidate behavior belonging to this user. We formulate the masked behavior prediction task as a multi-class classification problem and use the cross-entropy loss function for pre-training, which is formulated as follows: where y i andŷ i are the gold and predicted labels of the i th candidate, and S 1 is the dataset for user model pretraining constructed from the masked behavior prediction task. Task 2: Next K Behaviors Prediction (NBP). The second self-supervision task for user model pre-training is Next K Behaviors Prediction (NBP).
Modeling the relatedness between past and future behaviors is also important for user modeling (Zhou et al., 2019). Thus, we propose a Next K Behaviors Prediction task to help user models grasp the relatedness between past and multiple future behaviors, as shown in Fig. 2(b). The goal is to infer whether a candidate behavior r N +i is the next ...  i-th behavior of the target user u based on her past N behaviors. we use a user model to obtain the user embedding and use a behavior encoder to obtain the candidate behavior embeddings. Similar to the MBP task, we use a predictor to predict the relevance scoreŷ k between the user embedding u and each candidate behavior embedding r N +k . We also use negative sampling techniques by packing each real future user behavior together with P behaviors from other users to construct labeled samples for model pre-training. The task is then formulated as K parallel multi-way classification problems, and the loss function we used is formulated as follows:

Masked Behavior Prediction
where y i,k andŷ i,k are the gold and predicted labels of the i th candidate for the next k th behavior, and S 2 is the dataset constructed from the NBP task. We pre-train the user model in both MBP and NBP tasks collaboratively, and the final loss function to be optimized is formulated as follows: where λ is a non-negative coefficient to control the relative importance of the NBP task.

Datasets and Experimental Settings
We conduct experiments on two tasks. The first task is user demographic prediction. We construct a dataset (denoted as Demo) by collecting the webpages browsing behaviors of 20,000 users in one month (from 06/21/2019 to 07/20/2019) and their age and gender labels 2 from a commercial search engine. The task is to infer ages and genders of users from the titles of their browsed webpages. In this dataset, there are 12,769 male users and 7,231 female users. There are 103 users under twenty, 2,895 users between twenty and forty, 7,453 users between forty and sixty, and 9,549 users over sixty. We use 80% of users for training, 10% for validation and the rest for test. The second task is ads CTR prediction. We used the dataset (denoted as CTR) provided in . This dataset contains the titles and descriptions of ads, impression logs of ads, and the webpage browsing behaviors of 374,584 users in one month (from 01/01/2019 to 01/31/2019). The task is to infer whether a user clicks a candidate ad based on the ad texts and the titles of browsed webpages. We use the logs in the last week for test, and the rest for training and validation (9:1 split). Since webpage browsing behaviors are used in both datasets, for model pre-training we use the titles of browsed webpages of 500,000 users in about six months (from 05/01/2019 to 10/26/2019), which is collected from the same platform as the Demo dataset. The detailed dataset statistics are shown in Table 1.  In our experiments, the word embeddings we used were 300-dimensional. The predictor function is implemented by dot product. The number K of future behaviors to be predicted was 2, and the coefficient λ was 1.0. In addition, the negative   sampling ratio P was 4. These hyperparameters were tuned on the validation data. The complete hyperparameter settings and analysis are included in supplements. To evaluate the performance of different methods, we used accuracy and macro F-score on the Demo dataset, and used AUC and AP scores on the CTR dataset. Each experiment was repeated 10 times independently.

Performance Evaluation
In this section, we verify the effectiveness of our proposed PTUM method for user model pretraining. We choose several state-of-the-art user models and compare their performance with their variants pre-trained by our PTUM method. On the Demo dataset, the models to be compared include: (1) HAN (Yang et al., 2016), hierarchical attention network, which uses attentional LSTM to learn behavior and user representations.
(2) HURA (Wu et al., 2019c), hierarchical user representation with attention model, which uses CNN and attention networks to learn behavior and user representations.
(3) HSA (Wu et al., 2019b), using hierarchical multi-head self-attention to learn behavior and user representations. On the CTR dataset, the models to be compared include: (1) GRU4Rec (Hidasi et al., 2016), using GRU networks to learn behavior and user representations.
(2) NativeCTR , using CNN and attention networks to learn behavior representa-tions and using behavior attention to learn user representations.
(3) BERT4Rec (Sun et al., 2019), using Transformers to learn behavior and user representations. The results on the two datasets under different ratios of training data are respectively shown in Tables 2 and 3. We find that pre-trained user models consistently outperform their variants trained in an end-to-end manner. This is because pre-trained user models can capture the universal user information encoded in unlabeled user behaviors to help learn better user representations. In addition, the advantage of pre-trained user models is larger when training data is more scarce. This may be because pre-trained user models can exploit the complementary information provided by largescale unlabeled user behavior data to reduce the dependency on labeled training data. Besides, finetuning pre-trained user models is necessary. This may be because fine-tuning pre-trained user models with task-specific labeled data can help learn user representations specialized for downstream tasks.

Ablation Study
We conducted several ablation studies to verify the effectiveness of the proposed two self-supervision tasks for user model pre-training, i.e., masked behavior prediction and next K behaviors prediction, by removing one or two of them from PTUM. The results of HSA on the Demo dataset and BERT4Rec on the CTR dataset are respectively shown in Figs. 3(a) and 3(b). We find the masked behavior prediction task can effectively enhance pre-trained user models. This may be because the MBP task helps user models capture the relatedness between historical user behaviors, which is critical for user modeling (Sun et al., 2019). In addition, the next K behaviors prediction task can also improve the model performance. This may be because the NBP task helps the user model grasp the relatedness between user behaviors in the past and future,  which is also beneficial for user modeling (Zhou et al., 2019). Besides, combining two tasks yields better model performance, because both the relatedness among historical behaviors and between past and future behaviors can be modeled.

Hyper-parameter Analysis
In this section, we explore the influence of two key hyper-parameters on our approach, i.e., the coefficient λ in Eq. (3) and the number of behavior K in the NBP task. 3 We first vary the coefficient λ to compare the performance of PTUM w.r.t. different λ, and the results on the Demo and CTR datasets are shown in Figs. 4(a) and 4(b). From these results, we find the performance is not optimal under a small λ. This may be because the useful selfsupervision signals in the NBP task is not fully exploited. When λ goes too large, the performance begins to decline. This may be because the NBP task is over-emphasized and the MBP task is not well pre-trained. Thus, it may be more suitable to set λ = 1 to balance the two tasks.
Then, we vary the behavior number K to explore its influence on the performance of PTUM, and the results are shown in Figs. 5(a) and 5(b). According to these results, we find that the performance of pre-trained user models in downstream tasks is not optimal at K = 1. This is probably because the   relatedness between the last input behavior and the first behavior in the future may be strong, and the model may tend to overfit their short-term relatedness. Thus, it is not optimal to simply predict the next one behavior. In addition, we find the performance is sub-optimal when K is too large. This may be because it is difficult to accurately predict user behaviors in a long term due to the diversity of user behaviors. Thus, a moderate K may be more appropriate (e.g., K = 2).

Conclusion
In this paper, we propose an effective user model pretraining method PTUM which can pretrain user models from unlabeled user behaviors. In our method, we propose two self-supervision tasks for user model pre-training. The first one is masked behavior prediction and the second one is next K behaviors prediction, which can help user models capture the relatedness among historical behaviors and the relatedness between past and future behaviors. Extensive experiments on two real-world datasets for different tasks show that pre-training user models can consistently boost the performance of various user modeling methods.