Transferring User Interests Across Websites with Unstructured Text for Cold-Start Recommendation

,


Introduction
While collaborative ltering (CF) approaches are one of the most successful methods for building recommender systems, their performance deteriorates dramatically under cold-start situations. That is, low prediction accuracy is observed for users/items with very few ratings. Content-based recommender systems may also suffer from the cold-start problem. For instance, content-based nearest-neighbor models (Pazzani and Billsus, 2007) might not be as effective if some users contain too few information to generate a meaningful set of neighbors.
Two types of solutions have been proposed to address the cold-start problem. The rst is to create hybrid recommendation models that impose a content-based model on a CF model to enrich the information for users/items with sparse rating proles (Burke, 2002;Burke, 2007). The second is to transfer the information from auxiliary domains as a remedy to the cold-start individuals (Deng et al., 2015). This paper aims at bringing a marriage between these two types of strategies.
Although transfer learning gradually gains popularity in handling the cold-start issue (Roy et al., 2012), most of them assume a homogeneous model where observations in both domains are of the same type. That is, to transfer knowledge to a ratingbased/text-based recommender system, the source system must also be rating-based/text-based. Some earlier works even require the ratings from both domains to be in the same format (Li et al., 2009), or assume specic structured text, such as userprovided tags (Shi et al., 2011;Deng et al., 2015). In this work, by contrast, no source-domain ratings are available and unstructured user-generated content is treated as the auxiliary data. We propose a heterogeneous transfer learning framework to utilize unstructured auxiliary text for a better target-domain CF model.
As there is no single service satisfying all social needs, users nowadays hold multiple accounts across many websites. Furthermore, the account linking mechanism is often available on these websites. This allows a precise mapping between the accounts of the same user to be built. One major application of our approach is to improve the recom-mendation quality in the target service using auxiliary data obtained from another seemingly irrelevant service.
For instance, consider a new user on YouTube. The initial recommended videos for this user is likely to be irrelevant as there is very few information available. However, with the account linking mechanism, YouTube accounts can be linked to Twitter accounts with a simple click. Our goal is to utilize the content generated by this user on Twitter, despite the possibility that the content is irrelevant to their preference on video browsing, to produce a better video recommendation list on YouTube.
Seemingly intuitive, there exist some difculties in such cross-website transfer learning approach. The biggest challenge lies in the fact that most users do not use multiple services (e.g. social media sites) for the same purpose. Usually a user registers for multiple services because each of them serves its own purpose. As a result, we cannot assume the existence of direct mentions about target-domain items in the source-domain text data. For example, a regular YouTube viewer does not necessarily tweet about the videos he/she has viewed. Thus simple methods such as keyword matching are likely to fail. The same reasoning also implies that, when transferring knowledge across websites or services, the assumption of a shared rating format or structured text is overly optimistic. Even websites aiming for the same purpose often violate this assumption, let alone websites of different types. Therefore, we expect that the source and target services contain heterogeneous information (e.g. content vs. rating). In our model, we make a less strong assumption: regardless of the type of information available in each domain, the users that are similar in one domain should have similar taste in the other domain. Thus, instead of directly transfer the content material from source to target domain, we transfer the similarity between users, and use it as a guide to improve the CF model in the target domain.

LDA-MF Model
We rst introduce an intuitive model to realize the above-mentioned idea, and point out several intrinsic weaknesses making it unsuitable for crosswebsite transfer learning.
Here we rely on the probabilistic matrix factorization (PMF) model as our target-domain CF model. In the PMF model, each user latent factor is modeled (a priori) by a zero-mean Gaussian. To incorporate source-domain information into the targetdomain PMF model, for each user i, a topic vector θ i is extracted from source-domain text content and assigned as the prior mean of this user's PMF latent factor, that is, where λ U is the precision parameter and I is the identity matrix. Different from the original PMF model, prior distributions of different user latent factors are no longer identical. For users having similar source-domain topic vectors, their latent factors are expected to be close in the target-domain latent space. Such property allows the similarity between users to be transferred from source domain to the target domain.
With the latent Dirichlet allocation (LDA) (Blei et al., 2003) topic model being used, the graphical model is depicted in Figure 1. This model is similar in structure to the recently proposed collaborative topic regression (CTR) (Wang and Blei, 2011) model. The main difference is that, instead of modeling description about items, now user-generated content from the source domain is modeled in our problem. We call this model the LDA-MF model. Figure 1: The LDA-MF model.
Although LDA-MF indeed incorporates knowledge from the source domain, it has certain weaknesses which need to be addressed. The most signicant drawback is that the dimensionalities of the LDA topic vector θ i and the PMF user latent factor u i are required to be equal. These two variables are of very different nature. One is extracted from text data in the source domain to model the topics of the user-generated content, and the other is generated from the rating data in the target domain to model the latent interests of users. It is an overly strong assumption to assume the optimal dimensionalities for LDA and PMF are equal. In practice, if we choose the dimensionality to optimize the predictive power of PMF (e.g. by cross-validation on the rating data), the LDA model is likely to yield sub-optimal results and vice versa. The experiments that will be shown later conrm this concern. Furthermore, since the two variables are modeling different types of observations coming from different websites, the underlying meanings of the latent dimensions are unlikely to be identical. By treating the LDA topic vector as the prior mean of the PMF user latent factor, the latent dimensions are forced to be one-to-one aligned, which is again a strong assumption. Finally, the topic vectors are drawn from the Dirichlet distribution which has a bounded (and positive) support S, while the latent factors in PMF are unbounded Gaussian random vectors. If the optimal solution of u i is far from S, the performance of the model could be affected, particularly in the cold-start situation where data is sparse and the prior plays an important role. 3 Nearest-Neighbor Transfer MF Model To alleviate the drawbacks of the LDA-MF model, here we propose nearest-neighbor transfer matrix factorization (NT-MF) model to transfer user interests across websites. The entire framework is depicted in Figure 2. We begin by describing the scenario in which our model operates. First, there is a rating-based recommender system (i.e. PMF) in the target domain, which suffers from the cold-start problem. The target domain might or might not contain content information. For example, in the video recommendation task, we can use the titles of all rated videos of a user to generate content information in the target domain. Such information is not available for the cold-start users since they have not rated any videos. However, in the source domain there are some content information available for these users. This can be, say, the content of a user's tweets. As previously mentioned, this type of auxiliary text data is immediately available when a user connects the accounts from two domains. Therefore, we assume this auxiliary text data is available for all users.

Model Outline
Next, we describe the high level concept of our model. As described previously, we have observed that the hypotheses encoded by the LDA-MF model is too strong as the PMF latent factor is enforced to inherit certain mathematical properties from the LDA topic vector. Here we loosen the constraint to only enforce that users should have similar distributions over the target-domain PMF latent factors if there is a high similarity between their sourcedomain topic vectors.
It is a reasonable hypothesis since our objective is to make the target-domain rating matrix factorize in a way that is consistent with the knowledge extracted from source-domain text. After all, the factorization of incomplete matrix is not unique, and there is no reason that the latent factor should match the topic factor of the user. In fact, our hypothesis implies a different distribution over the PMF latent factor for each user, i.e. u i ∼ N (µ i , Σ i ), where (µ i , Σ i ) are unknown parameters, and are (possibly) different for each user.
To estimate the unknown parameters in a distribution, normally we need a set of observations, i , . . . ). However, the parameters now belong to a distribution over a latent variable, which is non-trivial to estimate since we have no observations about the user latent factor. An exhaustive search over the parameter space is obviously intractable. Even if we treat the entire model as a hierarchical model and learn the parameters indirectly from rating data, the cold-start problem immediately comes in and forbids us from learning a representative distribution for users.
We propose the idea of using the latent factors of the nearest-neighbors to estimate the unknown parameters in the distribution for a user. That is, the latent factors of the nearest-neighbors, {u l } l∈kNN(i) , are regarded as a set of pseudo observations to replace the unavailable data, (u (1) However, the denition of closeness is not based on target-domain rating data, but computed by the topic vectors obtained from the content in the source domain (and the target domain, if available). Our model is thus not hampered by the cold-start problem.
Note that, in addition to the set of k-nearestneighbors kNN(i), we also have the corresponding similarity scores sim(i, l) between each neighbor l and user i. The similarity scores along with the list of nearest-neighbors are transferred to the target domain to form a set of weighted samples, D, which can be used to estimate the unknown parame- The main purpose of assigning a sample weight w l to each of the pseudo observations u l is that by doing so, users with a higher source-domain similarity to user i will have a larger impact on the estimation of the target-domain parameters (µ i , Σ i ). In other words, with this model specication, the similarity between users is transferred across domains.

Inference in NT-MF Model
To perform inference in our model, we adopt the maximum a posteriori (MAP) strategy and alternately update the user and item latent factors (i.e. by block coordinate ascent), similar to some previous solutions (Salakhutdinov and Mnih, 2007;Wang and Blei, 2011).
To solve for the optimal user latent factor u i , we need to rst estimate the unknown parameters (µ i , Σ i ). Therefore, in our coordinate ascent algorithm, different from the original PMF model, we update the user latent factors one by one. That is, all user latent factors are regarded as xed constants except for the one, u i , to be updated. By doing so, for each user i, a set of pseudo observations about u i (Eq. 2) is available. Using these pseudo observations, the unknown parameters (µ i , Σ i ) can then be estimated with standard techniques such as maximum likelihood estimation (MLE). After an estimator of (µ i , Σ i ) is obtained, we can analytically solve for the MAP solution of the user latent factor u i . Then, we move on to the next user, and the coordinate ascent procedure continues. These two steps, namely the estimation of unknown parameters and the updating of the latent factors, are repeated until convergence.
One advantage of this procedure is that the list of nearest-neighbors and the similarities in Eq. 2 need not be recomputed during inference, avoiding expensive recomputation of pairwise similarities. It is also noticeable that, different from other transferbased approaches, rating information and structured text from the source domain are not required in this procedure of model optimization. This further adds a level of exibility to our framework for transferring user interests across websites.

Case Study: Inferring Unknown Mean
To clarify the previous discussions, we present a simple but detailed case-study on how an NT-MF model and its optimization procedure can be derived. The latent factor u i for each user is assumed to be generated from a multivariate normal distribution with unknown mean µ i and a known precision parameter λ U , which is shared among the users.
The generative process proceeds as follows: 1. For each user i, draw user latent factor u i ∼ N (µ i , λ −1 U I).
3. For each observed user-to-item pair (i, j), draw the rating where λ 0 is the precision parameter of the ratings, and λ U , λ V are the precision parameter of the users and items, respectively. We use the notation N (x|µ, Σ) to denote the Gaussian pdf with mean µ and covariance Σ. The model is optimized by maximizing the posterior likelihood of the latent variables (an additive term is omitted), where γ ij is an indicator variable which is equal to 1 if item j is rated by user i, and 0 otherwise. To solve the MAP problem, we need to rst estimate the unknown parameters in the distribution, which in this case is the mean vector µ i . The likelihood function over the pseudo observations, By taking derivative of Eq. 4 with respect to µ i and set it to zero, we obtain, which implies that the MLE of µ i is the sample mean. However, since we are dealing with a set of weighted samples, the sample mean is replaced by the weighted average (the weights w l are assumed to add up to one): Our model yields an intuitive result: to estimate the mean vector µ i of u i , we can simply take the weighted average of the latent factors u l from the nearest-neighbors as an estimator, where the weights are the similarity scores between the textual proles of user i and its neighbors.
Given µ i , we can now maximize Eq. 3 with respect to u i and v j . By taking derivative of Eq. 3 with respect to u i and v j and set it to zero, we obtain the update equations, Now with Eq. 6 to Eq. 8 at hand, we can iteratively solve for µ i , u i and v j for all users and items until the model converges.
It can be seen from this case-study that NT-MF eliminates the three major drawbacks of the previously mentioned LDA-MF model. First, the topic vectors and the user latent factors are not required to have equal dimensionalities, which allows for the optimal dimensionality to be chosen in both models. Second, the mean vector, that is, the kNN weighted average in Eq. 6, is a linear combination of a set of user latent factors; as a result, the latent dimensions of u i and µ i are naturally aligned. Third, the mean vector µ i has the same support as the user latent factor u i , avoiding the risk of prior misspecication in cold-start situations.

Experiment
We use YouTube video recommendation to test the usefulness of NT-MF under the cold-start scenario. The NT-MF model used in this section follows the optimization procedure derived in Section 3.3.

Dataset and Statistics
To construct a dataset containing both the users' rating history and textual information, we begin with the user prole pages on Google+. A large proportion of Google+ users provide links to their prole pages from other social network services (e.g. Twitter). More importantly, if a user owns a YouTube account, a link to the user's YouTube channel will be automatically added to his Google+ prole. This makes a fully aligned dataset available. Users' Twitter accounts are obtained via their Google+ prole page, and the concatenation of tweets is regarded as the auxiliary text data. It has been shown that by concatenating the tweets, more representative user topic vectors can be obtained (Hong and Davison, 2010). We refer to this text data as the Twitter corpus.
Videos in a user's liked or favorite playlists are considered to have a rating r ij = 1. Other videos are assigned r ij = 0. In other words, we are dealing with a one-class collaborative ltering (OCCF) problem (Pan et al., 2008). We adopt the same strategy as in (Wang and Blei, 2011) to deal with OCCF. First, all ratings are assumed to be observed, i.e. γ ij = 1 for all user-item pairs. Next, a condence parameter c ij is introduced to reduce the inuence of the huge number of zeroes during model optimization. The condence parameter takes place of the original rating precision parameter λ 0 and is dened in (Wang and Blei, 2011) as c ij = a if r ij = 1 and c ij = b otherwise (a > b > 0). All the derivations in the previous sections follow intuitively.
The titles of the liked videos are concatenated and treated as the text data in the target domain (which we refer to as the YouTube corpus). As for the vocabulary, stopwords are rst removed, and then 5000 words are selected from the YouTube corpus based on their TF-IDF scores (Blei and Lafferty, 2009). On average, each user's Twitter text data contains 5149 words and 1193 distinct terms, and each user's YouTube text data contains 158 words and 116 distinct terms. These statistics are in accordance with our assumption that text data in the source domain is abundant comparing to that in the target domain.
To validate the prediction result, each user has at least 10 liked videos. Videos with less than 5 likes are removed from the dataset. After data cleansing, there are 7328 users and 18691 videos in the dataset. The maximum number of likes received by a video is 98, and the average is 19.1. Among all videos, 92% of them are liked by less than 40 users. The maximum number of likes given by a user is 908, and the average is 48.8. Among all users, 89% of them have liked less than 100 videos. The sparsity (ratio of zeroes to the total number of entries) of the rating matrix is 99.74%, which illustrates the difculty of this recommendation task.

Evaluation and Scenario
We choose the area under ROC curve (AUC) as the evaluation metric. AUC is often used to compare models when there is severe class imbalance, which is the case in our OCCF problem since we regard all zeroes as observed. All reported results are the average of 5 random data splits.
Similar to the experiments performed in (Wang and Blei, 2011), we test the performance of each model under two different scenarios. The rst one is the task of in-matrix prediction. In this task, the likes received by each video are partitioned into three sets, namely the training, validation and testing sets. The ratio of data partition is 3:1:1. There are no coldstart users for the in-matrix prediction.
The second task is the out-of-matrix prediction, where the users are partitioned into three sets with the same 3:1:1 ratio. To make the two tasks comparable, we randomly split the data until the number of observations in each of the three sets is closed to that of the in-matrix task. Users in the testing set are all cold-start users. The only data we have when making prediction on the cold-start users is the auxiliary text data.

Baseline Methods
• LDA: We run linear regression on the LDA features to predict the ratings. This model serves as a content-based baseline.
• UKNN: The user-kNN algorithm (Herlocker et al., 1999) based on LDA features is implemented. This model serves as a neighborhoodbased baseline.
• PMF: PMF (Salakhutdinov and Mnih, 2007) is a classic and widely-used CF model. It uses only the rating information, and thus is not capable of performing the out-of-matrix task.
• LDA-MF: This model is implemented as has been described in Section 2. It is similar to CTR (Wang and Blei, 2011) in structure. Since the optimization of the full model converges badly, we pre-train the LDA part of the model, and x the topic vector when optimizing the PMF part.
All hyperparameters are tuned on the validation set. Due to efciency and storage considerations, for UKNN and NT-MF, the k-nearest-neighbors are computed approximately with the FLANN library ( Muja and Lowe, 2014). The symmetric Kullback-Leibler divergence is chosen to be the distance metric between topic vectors. For all baseline methods, we use K to denote the dimensionality of the latent variables. However, when discussing about NT-MF, since the number of topics can be different from the number of user latent factors, we use T to denote the former and K to denote the latter to avoid confusion.

In-Matrix Prediction
In this section, the in-matrix prediction is discussed. First, we test the model's general performance on different corpora. Normally, the optimal number of topics will not be the same for different corpora. Since the LDA model performs the best with K = 50 on the YouTube corpus and K = 200 on the Twitter corpus, we report the results when K is set to these two numbers. Figure 3(a) shows the results when no sourcedomain information is available and thus no transfer learning is performed. That is, all models are provided only with the YouTube ratings and the YouTube corpus. Because the YouTube corpus is scarce, the LDA model results in lower AUC when more topics are used, signifying overtting. The same reason also leads to limited improvement of LDA-MF over PMF. Using neighborhood information alone, UKNN performs poorly. On the other hand, as a model bringing neighborhood information into PMF, NT-MF outperforms all baselines signicantly. The above analysis shows that, although using either content (LDA) or neighborhood (UKNN) information alone is insufcient to generate good predictions, they can effectively improve the factorization of the rating matrix if used correctly.
To demonstrate the advantage of transfer learning, we study the scenario where only source-domain text and target-domain ratings are available. That is, the YouTube corpus in the previous analysis is replaced with the Twitter corpus. The result is shown in Figure 3(b). Comparing to Figure 3(a), we can see that although the Twitter corpus is larger than the YouTube corpus, it leads to a worse performance for LDA and UKNN. Content information from the noisy Twitter corpus alone is not sufcient to capture the rating behavior of users. However, by integrating the content information and rating history, both LDA-MF and NT-MF benet from a larger corpus.
In the following analyses, we use data from both websites. For LDA, PMF and LDA-MF, we merge the two corpora by summing up the word counts. For UKNN and NT-MF, however, there is a more elegant way to combine the knowledge from different websites. First, we compute user similarity separately from the two corpora. Then, the two sets of similarity scores are weighted and averaged. Finally, the nearest-neighbors are computed based on this set of newly generated similarity scores. By applying this strategy to NT-MF, not only can θ i and u i differ in dimensionality, but also the optimal number of topics can be used for different corpora. Regardless of K, we use T = 50 for YouTube and T = 200 for Twitter in our NT-MF model. The result is shown in Figure 3(c). By comparing it with Figure 3(b), we can see that the AUC of NT-MF increases while that of LDA-MF remains unchanged. UKNN also benets from this strategy. These facts show that, instead of merging the two corpora directly, our strategy of averaging the similarities is more advantageous. Next, as a preliminary investigation of the performance on cold-start users, in Figure 4(a), we plot the cumulative AUC with respect to the total number of observed ratings. NT-MF outperforms other methods in terms of cumulative AUC regardless of the number of observed ratings. The advantage of NT-MF over the baseline methods is even greater as the number of observed ratings decreases (except for LDA). To make it clear, we plot the difference in AUC between NT-MF and the baseline methods in Figure 4(b). This phenomenon sheds light on the advantage of NT-MF under cold-start scenario.

Out-of-Matrix Prediction
In this section, we discuss the out-of-matrix prediction. Users in the testing set are all completely coldstart users. That is, we are only provided the Twitter corpus when making prediction for these users. Therefore, our previous strategy of averaging the similarities only applies to users in the training set. For this study we adopt the strategy of merging the two corpus instead of averaging the similarities. The number of topics T = 150 is chosen for NT-MF with respect to the validation AUC.
The result is presented in Figure 5. We plot the AUC against the dimensionality of the latent variables K. It can be observed that NT-MF beats all baseline methods regardless of K. Comparing to Figure 3, the out-of-matrix AUC is much lower, signifying the difculty of cold-start recommendation.
Under the cold-start scenario, the latent factor used in the prediction phase is taken to be the prior mean for the MF-based models. For LDA-MF the prior mean is the topic vector θ i , while for NT-MF it is the weighted average µ i given by Eq. 6. Since θ i is used in place of u i in the LDA-MF model when generating predictions, the curves of LDA and LDA-MF look very similar. A paired t-test (p < 0.05) shows no statistically signicant difference between these two methods when K = 10 (p = 0.48) and K = 20 (p = 0.09). Despite the fact that u i = θ i is xed for the cold-start users in the LDA-MF model, as K becomes larger, the item latent factors can carry more information in the rating data, which results in a higher AUC than LDA. However, since the dimensionalities of the LDA part and PMF part must match, the inference procedure of LDA-MF becomes very slow when K is large. To make a better use of the available data, the computational efciency must be sacriced.
On the other hand, note that NT-MF achieves the highest AUC when K = 50. In fact, not only does NT-MF beat all baseline methods under different K values, it also outperforms the best LDA-MF model (K = 200) with fewer latent factors (K = 20). Unlike LDA-MF, the latent factors of the cold-start users are not xed in NT-MF. Therefore, NT-MF can represent the information in a more concise way. In this case, NT-MF is better than LDA-MF in terms of both execution speed and predictive power. In Figure 6 we investigate the effect of different values of K and T . For each curve, we can see that the performance is about the same for K ≥ 50. This is in accordance with the observation that NT-MF does not need as many latent factors as LDA-MF to achieve the same level of performance. Also, while increasing the number of topics T improves the performance in general, increasing T from 150 to 200 gives no signicant improvement. The most important observation is that the highest AUC is achieved when K = 50 and T = 150. In other words, the optimal number of topics is different from that of user latent factors. This further justies the advantage of NT-MF against previous methods.

Related Work
Although not directly aiming to solve the problem we have proposed, there exists some models of similar structure or adopt similar ideas.
As previously mentioned, LDA-MF is similar in structure to CTR. Collaborative topic Poisson fac-torization (CTPF) (Gopalan et al., 2014) combines the ideas of CTR and Poisson factorization (Gopalan et al., 2013) for a better performance. We have also tried CTPF on our dataset; nevertheless, there is no signicant improvement over LDA-MF.
Recently, the neighborhood-aware probabilistic matrix factorization (NHPMF) model is proposed (Wu et al., 2012) as a method to combine kNN and PMF. It is originally proposed to leverage tagging data for improving PMF. This model can also be applied to our problem if we use the Twitter corpus in place of the unavailable tagging data. However, in the NHPMF model, the mean parameters are not treated as constants when the user latent factors are updated. As a result, an extra term appears in the gradient formula, which leads to an O(k 2 ) time complexity, with k being the number of nearestneighbors considered. On the other hand, the computation of the weighted average (i.e. Eq. 6) takes O(k) time complexity. We have implemented NH-PMF for comparison. As we increase k, NHPMF becomes signicantly slower than NT-MF, while its performance is no better than NT-MF on our dataset.

Conclusion
In this work, we propose NT-MF, a cross-website transfer learning model which integrates content, neighborhood and rating information to alleviate the cold-start problem. A signicant improvement over previous methods is demonstrated on a real-world cross-website dataset. The improvement is even more signicant under the cold-start scenario.
So far we use the LDA topic vector to represent a user. As future work, different aspects of text can be taken into account to generate a more comprehensive user model. For example, writing styles or opinion mining may provide different insights on user behavior. Another possible extension is to apply our idea to more realistic settings such as large-scale and online recommender systems.