A Hierarchical Knowledge Representation for Expert Finding on Social Media

Expert ﬁnding on social media beneﬁts both individuals and commercial services. In this paper, we exploit a 5-level tree representation to model the posts on social media and cast the expert ﬁnding problem to the matching problem between the learned user tree and domain tree. We enhance the traditional approximate tree matching algorithm and incorporate word embeddings to improve the matching result. The experiments conducted on Sina Microblog demonstrate the effectiveness of our work.


Introduction
Expert finding has been arousing great interests among social media researchers after its successful applications on traditional media like academic publications. As already observed, social media users tend to follow others for professional interests and knowledge (Ramage et al, 2010). This builds the basis for mining expertise and finding experts on social media, which facilitates the services of user recommendation and questionanswering, etc.
Despite the demand to access expertise, the challenges of identifying domain experts on social media exist. Social media often contains plenty of noises such as the tags with which users describe themselves. Noises impose the inherent drawback on the feature-based learning methods (Krishnamurthy et al, 2008). Data imbalance and sparseness also limits the performance of the promising latent semantic analysis methods such as the LDA-like topic models (Blei et al, 2003;Ramage et al, 2009). When some topics co-occur more frequently than others, the strict assumption of these topic models cannot be met and consequently many nonsensical topics will be generated (Zhao and Jiang, 2011;Pal et al, 2011;Quercia et al, 2012). Furthermore, not as simple as celebrities, the definition of experts introduces additional difficulties. Experts cannot be simply judged by the number of followers. The knowledge conveyed in what they say is essential. This leads to the failures of the network-based methods (Java et al, 2007;Weng et al, 2010;Pal et al, 2011).
The challenges mentioned above inherently come from insufficient representations. They motivate us to propose a more flexible domain expert finding framework to explore effective representations that are able to tackle the complexity lies in the social media data. The basic idea is as follows. Experts talk about the professional knowledge in their posts and these posts are supposed to contain more domain knowledge than the posts from the other ordinary users. We determine whether or not users are experts on specific domains by matching their professional knowledge and domain knowledge. The key is how to capture such information for both users and domains with the appropriate representation, which is, in our view, the reason why most of previous work fails.
To go beyond the feature-based classification methods and the vector representation inference in expert finding, a potential solution is to incorporate the semantic information for knowledge modeling. We achieve this goal by representing user posts using a hierarchical tree structure to capture correlations among words and topics. To tackle the data sparseness problem, we apply word embeddings to tree-nodes to further enhance semantic representation and to support semantic matching. Expert finding is then cast to the problem of determining the edit distance between the user tree and the domain tree, which is computed with an approximate tree matching algorithm.
The main contribution of this work is to integrate the hierarchical tree representation and structure matching together to profile users' and do-mains' knowledge. Using such trees allows us to flexibly incorporate more information into the data representation, such as the relations between latent topics and the semantic similarities between words. The experiments conducted on Sina Microblog demonstrate the effectiveness of the proposed framework and the corresponding methods.

Knowledge Representation with Hierarchical Tree
To capture correlations between topics, Pachinko Allocation Model (PAM) (Li and McCallum, 2006) uses a directed acyclic graph (DAG) with leaves representing individual words in the vocabulary and each interior node representing a correlation among its children. In particular, multi-level PAM is capable of revealing interconnection between sub-level nodes by inferencing corresponding super-level nodes. It is a desired property that enables us to capture hierarchical relations among both inner-level and inter-level nodes and thereby enhance the representation of users' posts. More important, the inter-level hierarchy benefits to distribute words from super-level generic topics to sub-level specific topics.
In this work, we exploit a 5-level PAM to learn the hierarchical knowledge representation for each individual user and domain. As shown in Figure 1, the 5-level hierarchy consists of one root topic r, I topics at the second level X = {x 1 , x 2 , . . . , x I }, J topics at the third level Y = {y 1 , y 2 , . . . , y J }, K topics at the fourth level Z = {z 1 , z 2 , . . . , z K } and words at the bottom. The whole hierarchy is fully connected.
x-topic . root Figure 1: 5-level PAM Each topic in 5-level PAM is associated with a distribution g(·) over its children. In general, g(·) can be any distribution over discrete variables. Here, we use a set of Dirichlet com-pound multinomial distributions associated with the root, the second-level and the third-level topics. These distributions are They are used to sample the multinomial distributions θ x , θ y and θ z over the corresponding sub-level topics. As to the fourthlevel topics, we use a fixed multinomial distribution {ϕ z k } K k=1 sampled once for the whole data from a single Dirichlet distribution g(β). Figure 2 illustrates the plate notation of this 5-level PAM. By integrating out the sampled multinomial distributions θ x , θ y , θ z , ϕ and summing over x, y, z, we obtain the Gibbs sampling distribution for word w = w m in document d as: is the number of occurrences of the root r in document d, which is equivalent to the number of tokens in the document. n jk are respectively the number of occurrences of x i , y j and z k sampled from their upper-level topics. n k is the number of occurrences of the fourthlevel topics z k in the whole dataset and n km is the number of occurrences of word w m in z k . −w indicates all observations or topic assignments except word w.
With the fixed Dirichlet parameter α for the root and β as the prior, what's left is to estimate (learn from data) γ and δ to capture the different correlations among topics. To avoid the use of iterative methods which are often computationally extensive, instead we approximate these two Dirichlet parameters using the moment matching algorithm, the same as (Minka, 2000;Casella and Berger, 2001;Shafiei and Milios, 2006). With smoothing techniques, in each iteration of Gibbs sampling we update: where N i is the number of documents with nonzero counts of super-level topic x i . Parameter estimation of δ is the same as γ.

Expert Finding with Approximate Tree Matching
Once the hierarchical representations of users and domains have been generated, we can determine whether or not a user is an expert on a domain based on their matching degree, which is a problem analogous to tree-to-tree correction using edit distance (Selkow, 1977;Shasha and Zhang, 1990;Wagner, 1975;Wagner and Fischer, 1974;Zhang and Shasha, 1989). Given two trees T 1 and T 2 , a typical edit distance-based correction approach is to transform T 1 to T 2 with a sequence of editing operations S =< s 1 , s 2 , . . . , s k > such that s k (s k−1 (. . . (s 1 (T 1 )) . . .)) = T 2 . Each operation is assigned a cost σ(s i ) that represents the difficulty of making that operation. By summing up the costs of all necessary operations, the total cost σ(S) = ∑ k i=1 σ(s i ) defines the matching degree of T 1 and T 2 .
We assume that an expert could only master a part of professional domain knowledge rather than the whole and thereby revise a traditional approximate tree matching algorithm (Zhang and Shasha, 1989) to calculate the matching degree. This assumption especially makes sense when the domain we are concerned with is quite general. Let T d and T u denote the learned domain knowledge tree and the user knowledge tree, we match T d to the remaining trees resulting from cutting all possible sets of disjoint sub-trees of T u . We specifically penalize no cost if some sub-trees are missing in matching process. We define two types of operations. The substitution operations edit the dissimilar words on tree-nodes, while the insertion and deletion operations perform on tree-structures. Expert finding is then to calculate the minimum matching cost on T d and T u . If the cost is smaller than an empirically defined threshold λ d , we identify user u as an expert on domain d.
To alleviate the sparseness problem caused by direct letter-to-letter matching in tree-node mapping, we embed word embeddings (Bengio et al, 2003) into the substitution operation. We apply the word2vec skip-gram model (Mikolov et al, 2013(a) ;Mikolov et al, 2013(b)) to encode each word in our vocabulary with a probability vector and directly use the similarity generated by word2vec as the tree-node similarity. The costs of insertion and deletion operations will be explained in Section 4. Actually all these three costs can be defined in accordance with applicant needs. In brief, by combining both hierarchical representation of tree-structure and word embeddings of tree-nodes, we achieve our goal to enhance semantics.

Experiments
The experiments are conducted on 5 domains (i.e., Beauty Blogger, Beauty Doctor, Parenting, E-Commerce, and Data Science) in Sina Microblog, a Twitter-like microblog in China. To learn PAM, we manually select 40 users in each domain from the official expert lists released by Sina Microblog 1 , and crawl all of their posts. In average, there are 113,924 posts in each domain. Notice that the expert lists are not of high quality. We have to do manual verification to filter out noises. For evaluation, we select another 80 users in each domain from the expert list, with 40 verified as experts and the other 40 as non-experts.
Since there is no state-of-art Chinese word embeddings publicly available, we use another Sina Microblog dataset provided by pennyliang 2 , which contains 25 million posts and nearly 100 million tokens in total, to learn the word embeddings of 50-dimension. We pre-process the data with the Rwordseg segmentation package 3 and discard nonsensical words with the pullword package 4 . When learning 5-level PAM, we set fixed parameters α = 0.25, β = 0.25 and from top to down, I = 10, J = 20, K = 20 for the number of second, third and fourth levels of topics, respectively. And we initialize γ and δ with 0.25. For tree matching, we define the cost of tree-node substitution operation between word a and b as Eq (1). The costs of insertion and deletion operations for treestructure matching are MAX VALUE. Here we set MAX VALUE as 100 experimentally. The threshold λ d used to determine the expert is set to be 12 times of MAX VALUE.
We compare PAM with n-gram (unigram and bigram), LDA (Blei et al, 2003) and Twitter-LDA (Zhao and Jiang, 2011). We set β in LDA and Twitter-LDA to 0.01, γ in Twiitter-LDA to 20. For α, we adopt the commonly used 50/T heuristics where the number of topics T = 50. To be fair, we all use the tokens after pullword preprocessing as the input to extract features for classification. Following Zhao and Jiang (2011), we train four ℓ 2 -regularized logistic regression classifiers using the LIBLINEAR package (Fan et al, 2008) on the top 200 unigrams and bigrams ranked according to Chi-squared and 100-dimensional topic vectors induced by LDA and Twitter-LDA, respectively. We also compare our model with/without word embeddings to demonstrate the effectiveness of this semantic enhancement. The results are presented in Table 1.
In general, LDA, Twitter-LDA and PAM outperform unigram and bigram, showing the strength of latent semantic modeling. Within the first two models, Twitter-LDA yields better precisions than LDA because of its ability to overcome the difficulty of modeling short posts on social media. It designs an additional background word distribution to remove the noisy words and assumes that a single post can belong to several topics.
Our 5-level PAM gains observed improvement over Twitter-LDA. We attribute this to the advantages of tree representations over vector feature representations, the effective approximate tree matching algorithm and the complementary word embeddings. As mentioned in Section 1, LDA and other topic models like Twitter-LDA share the same assumption that each topic should be independent with each other. This assumption however is too strict for the real world data. Our tree-like 5level PAM relaxes such assumption with two additional layers of super-topics modeled with Dirichlet compound multinomial distributions, which is the key to capture topic correlations. Furthermore, by allowing partial matching and incorporating word embeddings, we successfully overcome the sparseness problem.
While macro-averages give equal weight to each domain, micro-averages give equal weight to each user. The significant difference between the macro-and micro-scores in Table 1 is caused by the different nature of 5 domains. In fact, the posts of experts on the domain E-Commerce are to some extent noisy and contain lots of words irrelevant to the domain knowledge. Meanwhile, the posts of experts on the domain Data Science are less distinguishable. The higher micro-recalls of PAM demonstrate its generalization ability over LDA and Twitter-LDA.

Conclusion
In this paper, we formulate the expert finding task as a tree matching problem with the hierarchical knowledge representation. The experimental results demonstrate the advantage of using 5-level PAM and semantic enhancement against n-gram models and LDA-like models. To further improve the work, we will incorporate more information to enrich the hierarchical representation in the future.