Enhanced Personalized Search using Social Data

Search personalization that considers the social dimension of the web has attracted a significant volume of research in recent years. A user profile is usually needed to represent a user’s interests in order to tailor future searches. Previous research has typically constructed a profile solely from a user’s usage information. When the user has only limited activities in the system, the effect of the user profile on search is also constrained. This research addresses the setting where a user has only a limited amount of usage information. We build enhanced user profiles from a set of annotations and resources that users have marked, together with an external knowledge base constructed according to usage histories. We present two probabilistic latent topic models to simultaneously incorporate social annotations, documents and the external knowledge base. Our web search strategy is achieved using personalized social query expansion. We introduce a topical query expansion model to enhance the search by utilizing individual user profiles. The proposed approaches have been intensively evaluated on a large public social annotation dataset. Results show that our models significantly outper-formed existing personalized query expansion methods which use user profiles solely built from past usage information in personalized search.


Abstract
Search personalization that considers the social dimension of the web has attracted a significant volume of research in recent years. A user profile is usually needed to represent a user's interests in order to tailor future searches. Previous research has typically constructed a profile solely from a user's usage information. When the user has only limited activities in the system, the effect of the user profile on search is also constrained. This research addresses the setting where a user has only a limited amount of usage information. We build enhanced user profiles from a set of annotations and resources that users have marked, together with an external knowledge base constructed according to usage histories. We present two probabilistic latent topic models to simultaneously incorporate social annotations, documents and the external knowledge base. Our web search strategy is achieved using personalized social query expansion. We introduce a topical query expansion model to enhance the search by utilizing individual user profiles. The proposed approaches have been intensively evaluated on a large public social annotation dataset. Results show that our models significantly outperformed existing personalized query expansion methods which use user profiles solely built from past usage information in personalized search.

Introduction
On today's social web, users can enrich the social context of web pages. The most notable fact is that users can often freely tag web pages with an-notations (Gupta et al., 2011). These tags could be high quality descriptors of the web pages' topics and a good indicator of web users' interests. However, the uncontrolled manner of social tagging results in the use of an unrestricted vocabulary. This makes searching through the collection difficult and generally less accurate. Thus the social annotation or bookmarking system demonstrates an extreme example of the vocabulary mismatch problem encountered in personalized web search. To tackle the problem, various personalized query expansion (QE) and results reranking techniques have been proposed and evaluated (Bouadjenek et al., 2016).
There have been some attempts to achieve personalized QE using social data. For example, Researchers have considered selecting the most related tags from a user's profile to expand queries (Bender et al., 2008;Bertier et al., 2009;Bouadjenek et al., 2011). Local analysis and cooccurrence based user profile representation have also been adopted to expand the query (Chirita et al., 2007;Biancalana et al., 2013). Recently, Zhou et al. proposed a query expansion framework based on individual user profiles (Zhou et al., 2012a). In their work, terms in the user profile are modeled according to their associations, which can be defined by co-occurrence statistics or defined by a tag-topic model.
All of the previously mentioned systems are dependent upon historical usage information being available in an individual user profile (Sugiyama et al., 2004;Teevan et al., 2005;Bennett et al., 2012;Zhou et al., 2014;Guha et al., 2015;Zhou et al., 2016). This information is pivotal when tailoring search results to the preferences of specific individuals. However, in some cases a user may have very limited previous interactions with the system. With little usage information to hand, the personalized search experience is poor. Furthermore, using only historical usage information to personalize search may not be enough.
In this paper, we extend personalized search using social data in two directions. First, we exploit external knowledge bases to enhance the user profile built from a user's historical usage information. We build queries from the user tags and annotated web pages. Subsequently, we fetch the relevant documents from an external corpus to be included in the user profile. We then propose to incorporate the user's annotations, web page content information and external documents through two statistical models, which we have named Mixture Enhanced User Profiling (MEUP) Model and Separated Enhanced User Profiling (SEUP) Model. Both models infer latent topics, their probabilities of being relevant and a multinomial distribution of topics of the documents being considered. MEUP mixed the tags, annotated documents and external documents together to infer unified latent topics, while SEUP is an extension of MEUP which learns topics that are shared between the two groups of document-aligned pairs. Second, we propose a topical query expansion model to personalize web search by utilizing the user profiles. In the topical QE model, profile terms are calculated based on their topical relevance to the query terms to expand the query.
Experimental results show that the Enhanced User Profiling models together with the topic QE can significantly improve retrieval performance over user profiles solely built from a user's historical information. Improvements were observed for users with both a rich amount of usage information and a small amount of information. We also demonstrate that the approach proposed in the paper outperforms existing QE methods proposed for personalized search using social data.
The contribution of this paper can be summarized as follows: i.
We tackle the challenge of personalized web search using social data in a novel way by enhancing user profiles that are built solely from users' historical usage information.
ii. We propose and systematically evaluate two novel generative models to construct enriched user profiles with the help of external corpora in the context of personalized search using social data.
iii. We suggest and evaluate a novel query expansion method. Instead of relying on lexical relevance information between query terms and profile terms, we also consider the topical relevance between them to expand the query.

Personalized Search Using Social Media
In personalized search using social media (Jamali and Ester, 2010;Lin et al., 2013), the search process is either performed over "social" data gathered from Web 2.0 applications such as social bookmarking systems, wikis, blogs etc., or it readapts the web search results produced by search engines by using social data (Carman et al., 2008;Bouadjenek et al., 2016). For example, the authors in (Vallet et al., 2010) investigated how the ranking of search engine results can be improved with respect to users if the users' social information is taken into consideration. A similar approach was also explored in (Noll and Meinel, 2007) where the system performed re-ranking of Google search results based on social bookmarks and tags harvested from del.icio.us 1 . However, the data sparsity problem poses a challenge to this approach as not all Web pages returned by search engines are tagged in the del.icio.us dataset.

Personalized Results Re-Ranking
Because of this problem, researchers started to use social data as a test collection to develop personalized techniques. In this way, personalization usually involves two general approaches. The first approach submits a query into the collection but re-ranks the returned results based on an individual user profile. In (Xu et al., 2008) the authors rerank the results according to the topical relevance of documents and users' interests. Carmel et al. (Carmel et al., 2009) investigated personalized results re-ranking based on the user's social relations. Wang and Jin (Wang and Jin, 2010) explored gathering data from multiple online social systems for adaptive search personalization.
Bouadjenek et al. (Bouadjenek et al., 2013a;Bouadjenek et al., 2013b) propose to use social data and user relationships to enhance document representation for re-ranking purposes. Though this group of work is attractive, if relevant items cannot be fetched in the first place, regardless of the complex re-ranking process, the results still tend to be unsatisfactory.

Personalized Query Expansion
Another group of work modifies or augments a user's original query. This approach is termed query expansion (Zhou et al., 2015). Researchers have considered tag-tag relationships for personalized query expansion, by selecting the most related tags from a user's profile (Bender et al., 2008;Bertier et al., 2009). However, tags cannot be relied upon to consistently provide precise descriptions of resources for use when searching. Local analysis and co-occurrence based user profile representation have also been adopted to expand the query (Chirita et al., 2007;Biancalana and Micarelli, 2009). However, the expansion terms are solely based on lexical matching between the query and the terms which exist in the user profile. Zhou et al. proposed a query expansion framework based on individual user profiles (Zhou et al., 2012a;Zhou et al., 2012b). In their work, terms in the user profile are modeled according to their associations, which can be defined by cooccurrence statistics or defined by a tag-topic model. The method simultaneously incorporates annotations and web documents in a latent graph, regularized by terms extracted from the topranked documents.
However, all of the previously mentioned systems consider constructing user profiles solely from past usage information. In contrast, in this paper we extend personalized web search using social data by exploiting an external knowledge base to enhance the user profile.

Problem Definition and Solution Overview
In social annotation and bookmarking systems such as del.icio.us or CiteUlike 2 , users can label interesting web resources with primarily short and unstructured annotations in natural language called tags. These web resources are denoted as a URL in the del.icio.us website. Textual content can be crawled by following a URL that refers to a document or web page. Please refer to Table 1 for the basic notations used in this paper. Formally, social tagging data can be represented by a tuple ≔ ( , , , ). ⊆ × × is a ternary relation, whose elements are called tag assignments or annotations (or bookmarks). The set of annotations of a user is defined as: The tag vocabulary of a user, is given as We define terms extracted from a user's set of documents as ! ≔ { | ∈ ! }, where denotes a word/term in the annotated documents. Similarly, we define terms extracted from a user's set of external documents as denotes a user's set of external documents from an external corpus !"#!$ . In a typical personalized search scenario, given a source query and a set of words in the user profile { ! , ! … ! } the goal is to return a ranked list of profile terms to be added to the query, for a second round retrieval of results.
Our personalization approach consists of three main steps (see Table 2): Fetching external documents; User profile modelling; and Personalized query expansion. We enhance a user's historical usage information in step one. We firstly concate-nate all tags in ! into a query ! (line 1).
Then for each document in ! , we extract terms with the highest inverted document frequency (idf) scores as queries ! (lines 2-3, with the EXTRACTTOP function returns top λ terms). Next we send queries in !"#!$ ( ! ∪ ! ) to an external corpus !"#!$ to fetch !"#!$ ! together with their retrieval scores !,! (lines 5-7, the number of documents retrieved by each query is controlled by the parameter ).
Step two integrates ! (here all tags are concatenated and viewed as a single document), ! , !"#!$ ! and their retrieval scores !,! into a topic model such that a multinomial distribution of topics specific to each document can be inferred (lines 8-14, we eliminate the procedure for the SEUP model because its similarity to the simpler model, see the next section). In the last step, the algorithm uses the output of step two to build a topical query expansion model in order to expand the original query (lines 15-18). Note that step one and step two could be executed off-line so as to improve the efficiency of the algorithm.

Enhanced User Profiling Models
In this section we describe how to model user profiles (i.e. step two in Table 2). We present two Enhanced User Profiling (EUP) models for this purpose.

Mixture Enhanced User Profiling
Topic discovery in the EUP models is influenced not only by term co-occurrences, but also by the retrieval scores of documents. To avoid normalization, we employ a log-normal distribution for retrieval scores to infer latent topics via the documents and their relevance probabilities.
The MEUP model developed here is a generative model of retrieval scores and the words in the documents. The generative process is as follows: Record retrieval score !,! /* step two: User profile modelling */  In the above process, the retrieval scores of terms in the same document are the same and calculated by a language model retrieval function (Manning et al., 2008) for retrieved documents in !"#!$ ! . The retrieval scores for ! (here all tags are concatenated and viewed as a single document) and documents in ! are set to one. We normalize the scores by the max score in the retrieval list. We used a fixed number of latent topics . The posterior distribution of topics depends on two sets of information, both the terms and retrieval scores of the documents.
In this model, inference is intractable. We use Gibbs Sampling (Heck and Thomas, 2015) to perform approximate inference. We employ a conjugate prior for the multinomial distributions, and integrate out and . In the sampling procedure, we need to calculate the conditional distribution !,! = (line 12 in Table 2). By using Gibbs Sampling, for each word the topic is sampled from: where !,!,¬! counts the number of times that topic with index has been sampled from the multinomial distribution specific to document ! with the current !,! not counted. Another counter variable !,! !,! ,¬ counts the number of times !,! has been generated by topic , but not counting the current !,! . A dot denotes summation over all values of the variable whose index that dot takes.
! !,! and ! !,! are elements from ! and ! , respectively. After that we can calculate the posterior estimate of and (line 14 in Table 2).

Separated Enhanced User Profiling
In the MEUP model, ! , ! and !"#!$ ! are mixed together to infer unified latent topics. However, the MEUP model may miss important information when the topics are learned. Our SEUP model extends the MEUP model by learning topics which are shared between document-aligned pairs. In order to do this, we create pseudo-aligned documents between ! , ! and !"#!$ ! . This procedure works as follows. For each external document in !"#!$ ! retrieved by a query from !"#!$ , which is formed through step one of our approach, we treat the document (from !"#!$ ! ) and the query (from !"#!$ ) as pseudo-aligned documents in two groups. The first group we named source group , the other group we named target group . By using the aligned documents, we propose a model to learn the latent topics between the two groups.
Note that in this case, there is a comparable document set aligned at the document-level. Therefore, can be viewed as a group independent factor, and shared among comparable aligned documents. Henceforth, the generation process for the SEUP model is slightly different from the MEUP model. The generative process is summarized below:  Similar to the MEUP model, the updated formulas for Gibbs sampling for the SEUP model are:

for each document pair
The meaning of the symbols used in the SEUP model is the same as in the MEUP model, except this time for two groups E and C. In the two EUP models, the multinomial distribution of topics is specific to each document and each word can be easily inferred.

Topical Query Expansion
In step three of our approach to personalization, we use the output from step two to build a QE model that calculates the weights of the profile terms to be added to the initial query. In this section we detail this process.
Given the query = { ! ! } !!! ! of independent query terms, the probability of the query generating a word is defined as (see also (Lavrenko and Croft, 2001;Ganguly et al., 2012)): We further assume that there are a set of relevant documents { ! } !!! ! related to the query and the word being considered, where is the number of documents. Incorporating this set of documents into the above equation leads to: The calculation discards the uniform prior for ( ! ! ), and takes the uniform prior of documents outside the summation. As we already have outputs from step two, the documents inside the user profile can be used as a set of relevant documents in the above calculation. In addition, because we now have latent topics related to each document and each word, there is no longer a direct dependency of on ! and . In this case, in order to estimate ! , we can marginalize the probability over the latent topic variables ! , then we have: Similarly, the probability ! ! ! becomes: So that the probability of the query generating a word can be re-defined as: In the SEUP Model, we use one side of the word-topic distributions from the group that contains tags and annotated documents to calculate the weighting. All the profile terms { ! , ! … ! } = ! ∪ ! ∪ ! are ranked by their probability of being generated by the given query (line 16-17 in Table 2), and the top terms are chosen to expand the query.

Evaluation
In the following section we describe experiments which have been designed to evaluate the proposed method. We start the section by discussing the experimental settings, and then we present and analyze the results.

Experimental Setup
In order to evaluate the above proposed methods on real-world data, we selected two delicious datasets: socialbm0311 and deliciousT140, which are public, described and analyzed in (Zubiaga et al., 2009;Zubiaga et al., 2013). The deliciousT140 dataset is made up by 144,574 unique URLs, all of them with their corresponding social tags retrieved from del.icio.us. However, this dataset does not contain the actual web pages (i.e. documents). So we used another socialbm0311 dataset. It contains the complete bookmarking activity for almost 2 million users. After matching the documents in deliciousT140 with the bookmark activities in socialbm0311, we obtained a total of 5,153,720 bookmark activities, 259,511 users, 131,283 web pages and 137,870 tags. We used a public parser 3 to parse the web pages in order to get their textual content.
We constructed two corpora from different external knowledge bases. The first corpus was obtained from the largest encyclopedia -Wikipedia 4 . A Wikipedia snapshot was obtained on the 14/08/2014, which contained a collection of 4,634,369 articles. The second corpus consists of English news documents from the Glasgow Herald 1995, Los Angeles Times 1994 and Los Angeles Times 2002, a collection made available by the CLEF AdHoc-News Test Suites (2004)(2005)(2006)(2007)(2008) 5 , which we refer to as CLEF. This collection contains 304,630 documents.
To investigate the effects of enhanced user profiles, we selected two groups of users as test users. One group contains 1,000 randomly selected users with no more than 50 bookmarks (refer to as User-SMALL) and another group contains 1,000 randomly selected users with more than 200 bookmarks (refer to as User-LARGE). These two groups of users represent users with small amount of and rich amount of past usage information respectively. The English terms were processed by down-casing the alphabetic characters, removing the stop words and stemming words using the Por ter stemmer. For each user, 75% of his/her tags with annotated web pages were used to create the user profile and the other 25% were used as a test collection.
The evaluation method used by previous researchers in personalized social search (Xu et al., 2008;Wang and Jin, 2010;Zhou et al., 2012a) is employed. The main assumption is as follows: Any documents tagged by with are considered relevant for the personalized query ( , t) ( submits the query ).
The following evaluation metrics were chosen to measure the effectiveness of the various approaches: the normalized discounted cumulative gain (NDCG), mean reciprocal rank (MRR) and mean average precision (MAP) (Voorhees, 1999;Järvelin and Kekäläinen, 2000). The average performance over all users is calculated. Statistically significant differences were determined using a paired t-test at a confidence level of 95%.

Experimental Runs
The proposed approach is applied to social search personalization through the means of query expansion. We evaluate our proposed models and compare with several state-of-the-art methods as follows. LM A popular and quite robust language model retrieval method which has previously demonstrated good results (Zhai and Lafferty, 2001). We compute the Kullback-Leibler divergence between the query and document language model as described in (Zhai and Lafferty, 2001). LMRM A relevance model involves pseudorelevance feedback in the language model as in (Lavrenko and Croft, 2001). We include this model as a competitive non-personalized query expansion baseline. LMRM-external This is a modified version of the relevance model as described in (Diaz and Metzler, 2006). Instead of using the top-ranked documents as pseudo-relevance documents, this model uses external corpora to obtain the relevance documents. We include this model as a strong non-personalized baseline as we also used external corpora in our models. In the experiments, this method will acquire external documents from the Wikipedia corpus and CLEF. Co-occur This method has been used by several researchers. In this method the selection of expansion terms is based on their co-occurrence statistics with the query terms and other terms inside the user model. We used this approach as previously it demonstrated satisfactory performance as in (Chirita et al., 2007). Co-tag Pure tag-tag relationships are also favored by many researchers. This method is based on the co-tagging activities a user performed (Bender et al., 2008;Bertier et al., 2009;Bouadjenek et al., 2011). In this case, the user profiles contain training tags with their cotagging statistics computed using the Jaccard coefficient. Tag-topic-regu Zhou et al. (Zhou et al., 2012a) proposed a query expansion framework based on regularizing the smoothness of word associations over a connected graph using terms extracted from top-ranked documents. The user profiles are built according to a Tag-Topic model in a latent graph. We include the highest performing method from their work for comparison. MEUP From our proposed methods, the MEUP method using the MEUP model and the topical query expansion method for social web search. SEUP This is our alternative proposed method, by using the SEUP model and the topical query expansion method to personalize search. The number of documents retrieved by each query in step one is set to γ = 5 empirically. Parameter λ used in the EXTRACTTOP function is set to 10. For the EUP modeling, and were set to 50/ and 0.01. In the expansion method, the number of expansion terms are set to 5. All the parameters in the other baseline models are set according to their tuning procedures in the original papers

Results
Firstly we examine the experimental results that describe the performance of the proposed methods in this paper together with three non-personalized baselines on the overall test users, which are shown in Table 3. The statistically significant differences are marked as l and w with respect to the LMRM and LMRM-external baselines as these two methods work better than the simpler LM method. As illustrated by the results, the LM model was the lowest performer for all evaluation metrics for two groups of users. This result shows that merely borrowing common lexical-matching techniques from traditional information retrieval will not solve the personalized search problem. With the help of pseudo-relevance feedback, the LMRM and LMRM-external methods work consistently better than the LM baseline. This demonstrates the power of query expansion. Specifically, the technique that explores external corpora to obtain the relevant documents works better than the method which simply uses top-ranked documents. The results are consistent with previous research (Diaz and Metzler, 2006). The improvements are more noticeable when using Wikipedia as the external corpus. However, all the non-personalized baselines are outperformed by the personalized approaches including our proposed methods MEUP and SEUP, all with statistically significant results. This illustrates that nonpersonalized query expansion methods can only bring limited improvements while methods with additional terms from the user profiles can greatly improve retrieval effectiveness.
Next we evaluate the performance of the proposed methods compared to several personalized baselines that use only the users' past information for query expansion, i.e. Co-occur, Co-tag and Tag-topic-regu methods.
As seen from Table 3, three conclusions emerge. First, MEUP and SEUP both outperform all personalization methods previously proposed, in all metrics measured with two external corpora for both groups of users. Moreover, the difference between our proposed methods and the baseline runs is always significant. We believe that the strong performance of our methods is due to the fact that our methods do not only consider a user's past usage information, but also uses an external knowledge base to enhance the user profiling process. Secondly, the SEUP method works consistently better than the MEUP method. This result confirms that merely mixing the documents from the historical evidence and external knowledge bases will miss some important information. By treating the documents as a pseudoaligned corpus, we obtain much better performance. The highest improvement over the best performing run reaches 54.95% (in terms of the SEUP method with the MRR metric when compared to Tag-topic-regu by using Wikipedia as the external corpus in the User-SMALL group). Third, further improvements are achieved by us-  Table 3. Overall results, statistically significant differences between our methods and LMRM, LMRM-External, Co-occur, Co-tag, Tag-topic-regu are indicated by l, w, o, t, r respectively. ing Wikipedia as the external corpus rather than using the CLEF collection. The possible reason, as pointed out by Diaz and Metzler (Diaz and Metzler, 2006), is that an external corpus is likely to be a better source of expansion terms if it has better topic coverage over the target corpus. The results also show that the improvements over baseline models in the User-SMALL group are more noticeable than in the User-LARGE group. However, the differences are small. This result confirms that our methods work well both for users with small amounts, and those with rich amounts of past usage information. We now examine the effect of the performance of the number of latent topics used in MEUP and SEUP. We vary the number of topics in both methods from 5 to 30, the results are shown in Figure 1, using Wikipedia as the external corpus in the user-SMALL group (we eliminated other results as they gave similar results). As can be seen from the figure, the highest performance is reached when the number of latent topics is 20 in MEUP and 25 in SEUP. When the number of topics continues to grow, the performance starts to degrade. However, even the lowest scored run still outperformed the strongest baseline. By varying the topic numbers, SEUP still outperforms MEUP.

Conclusion and Future Work
In this paper, we tackle the challenge of personalized web search using social data in a novel way by building enhanced user profiles from the annotations and resources the user has marked, together with an external knowledge base. We present two probabilistic latent models to simultaneously incorporate social annotations, documents and the external knowledge base. In addition, we introduce a topical query expansion model to enhance the search by utilizing individual user profiles. The proposed methods performed well on the social data crawled from the web, delivering statistically significant improvements over nonpersonalized and personalized representative baseline systems by constructing user profiles from a user's historical usage information only. It is also confirmed that our proposed methods work well for both active and less active users. In future research, we aim to automatically estimate the number of topics to be used in the EUP models. We also plan to explore the use of more external resources and novel latent semantic models to enhance performance.