A Trio Neural Model for Dynamic Entity Relatedness Ranking

Measuring entity relatedness is a fundamental task for many natural language processing and information retrieval applications. Prior work often studies entity relatedness in a static setting and unsupervised manner. However, entities in real-world are often involved in many different relationships, consequently entity relations are very dynamic over time. In this work, we propose a neural network-based approach that leverages public attention as supervision. Our model is capable of learning rich and different entity representations in a joint framework. Through extensive experiments on large-scale datasets, we demonstrate that our method achieves better results than competitive baselines.


Introduction
Measuring semantic relatedness between entities is an inherent component in many text mining applications. In search and recommendation, the ability to suggest most related entities to the entity-bearing query has become a standard feature of popular Web search engines (Blanco et al., 2013). In natural language processing, entity relatedness is an important factor for various tasks, such as entity linking (Hoffart et al., 2012) or word sense disambiguation (Moro et al., 2014).
However, prior work on semantic relatedness often neglects the time dimension and consider entities and their relationships as static. In practice, many entities are highly ephemeral (Jiang et al., 2016), and users seeking information related to those entities would like to see fresh information. For example, users looking up the entity Taylor Lautner during 2008-2012 might want to be recommended with entities such as The Twilight Saga, due to Lautner's well-known performance in the film series; however the same query in August 2016 should be served with entities related to his appearances in more recent films such as "Scream Queens", "Run the Tide". In addition, much of previous work resorts to deriving semantic relatedness from co-occurrence -based computations or heuristic functions without direct optimization to the final goal. We believe that desirable framework should see entity semantic relatedness as not separate but an integral part of the process, for instance in a supervised manner.
In this work, we address the problem of entity relatedness ranking, that is, designing the semantic relatedness models that are optimized for ranking systems such as top-k entity retrieval or recommendation. In this setting, the goal is not to quantify the semantic relatedness between two entities based on their occurrences in the data, but to optimize the partial order of the related entities in the top positions. This problem differs from traditional entity ranking (Kang et al., 2015) in that the entity rankings are driven by user queries and are optimized to their (ad-hoc) information needs, while entity relatedness ranking also aims to uncover the meanings of the the relatedness from the data. In other words, while conventional entity semantic relatedness learns from data (editors or content providers' perspectives), and entity ranking learns from the user's perspective, the entity relatedness ranking takes the tradeoff between these views. Such a hybrid approach can benefit applications such as exploratory entity search (Miliaraki et al., 2015), where users have a specific goal in mind, but at the same time are opened to other related entities.
We also tackle the issue of dynamic ranking and design the supervised-learning model that takes into account the temporal contexts of entities, and proposes to leverage collective attention from public sources. As an illustration, when one looks into the Wikipedia page of Taylor Lautner, each navi- gation to other Wikipedia pages indicates the user interest in the corresponding target entity given her initial interest in Lautner. Collectively, the navigation traffic observed over time is a good proxy to the shift of public attention to the entity ( Figure 1).
In addition, while previous work mainly focuses on one aspect of the entities such as textual profiles or linking graphs , we propose a trio neural model that learns the low level representations of entities from three different aspects: Content, structures and time aspects. For the time aspect, we propose a convolutional model to embed and attend to local patterns of the past temporal signals in the Euclidean space. Experiments show that our trio model outperforms traditional approaches in ranking correlation and recommendation tasks. Our contributions are summarized as follows.
• We present the first study of dynamic entity relatedness ranking using collective attention.
• We introduce an attention-based convolutional neural networks (CNN) to capture the temporal signals of an entity.
• We propose a joint framework to incorporate multiple views of the entities, both from content provider and from user's perspectives, for entity relatedness ranking.
2 Related Work

Entity Relatedness and Recommendation
Most of existing semantic relatedness measures (e.g. derived from Wikipedia) can be divided into the following two major types: (1) text-based, (2) graph-based. For the first, traditional methods mainly focus on a high-dimensional semantic space based on occurrences of words Markovitch (2007, 2009)) or concepts ( Aggarwal and Buitelaar (2014)). In recent years, embedding methods that learn low-dimensional word representations have been proposed. Hu et al. (2015) leverages entity embedding on knowledge graphs to better learn the distributional semantics. Ni et al. (2016) use an adapted version of Word2Vec, where each entity in a Wikipedia page is considered as a term. For the graph-based approaches, these measures usually take advantage of the hyperlink structure of entity graph (Witten and Milne, 2008;Guo and Barbosa, 2014). Recent graph embedding techniques (e.g., Deep-Walk (Perozzi et al., 2014)) have not been directly used for entity relatedness in Wikipedia, yet its performance is studied and shown very competitive in recent related work (Zhao et al., 2015;Ponza et al., 2017). Entity relatedness is also studied in connection with the entity recommendation task. The Spark (Blanco et al., 2013) system firstly introduced the task for Web search, Yu et al. (2014); Zhang et al. (2016a) exploit user click logs and entity pane logs for global and personalized entity recommendation. However, these approaches are optimized to user information needs, and also does not target the global and temporal dimension. Recently, Zhang et al. (2016b); Tran et al. (2017) proposed time-aware probabilistic approaches that combine 'static' entity relatedness with temporal factors from different sources. Nguyen et al. (2018) studied the task of time-aware ranking for entity aspects and propose an ensemble model to address the sub-features competing problem.

Neural Network Models
Neural Ranking. Deep neural ranking among IR and NLP can be generally divided into two groups: representation-focused and interactionfocused models. The representation-focused approach (Huang et al., 2013) independently learns a representation for each ranking element (e.g., query and document) and then employ a similarity function. On the other hand, the interactionfocused models are designed based on the early interactions between the ranking pairs as the input of network. For instance, Lu and Li (2013); Guo et al. (2016) build interactions (i.e., local matching signals) between two pieces of text and trains a feed-forward network for computing the matching score. This enables the model to capture various interactions between ranking elements, while with former, the model has only the chance of isolated observation of input elements.
Attention networks. In recent years, attentionbased NN architectures, which learn to focus their "attention" to specific parts of the input, have shown promising results on various NLP tasks. For most cases, attentions are applied on sequential models to capture global context (Luong et al., 2015). An attention mechanism often relies on a context vector that facilitates outputting a "summary" over all (deterministic soft) or a sample (stochastic hard) of input states. Recent work proposed a CNN with attention-based framework to model local context representations of textual pairs (Yin et al., 2016), or to combine with LSTM to model time-series data (Ordóñez and Roggen, 2016;Lin et al., 2017) for classification and trend prediction tasks.

Preliminaries
We denote as named entities any real-world objects registered in a database. Each entity has a textual document (e.g. content of a home page), and a sequence of references to other entities (e.g., obtained from semantic annotations), called the entity link profile. All link profiles constitute an entity linking graph. In addition, two types of information are included to form the entity collective attention.
Temporal signals. Each entity can be associated with a number of properties such as view counts, content edits, etc. Given an entity e and a time point n, given D properties, the temporal signals set, in the form of a (univariate or multivariate) time series X ∈ R D×T consists of T realvalued vector x n−T , · · · , x n−1 , where x t ∈ R D captures the past signals of e at time point t.
Entity Navigation. In many systems, the user navigation between two entities is captured, e.g., search engines can log the total click-through of documents of the target entity presented in search results of a query involving the source entity. Following learning to rank approaches (Kang et al., 2015), we use this information as the ground truth in our supervised models. Given two entities e 1 , e 2 , the navigation signal from e 1 to e 2 at time point t is denoted by y t {e 1 ,e 2 } .

Problem Definition
In our setting, it is not required to have a predefined, static function quantifying the semantic relatedness between two entities. Instead, it can capture a family of functions F where the prior distribution relies on time parameter. We formalize the concepts below. Dynamic Entity Relatedness between two entities e s , e t , where e s is the source entity and e t is the target entity, in a given time t, is a function (denoted by f t (e s , e t )) with the following properties.
• asymmetric: Dynamic Entity Relatedness Ranking. Given a source entity e s and time point t, rank the candidate entities e t 's by their semantic relatedness.

Datasets and Their Dynamics
In this work we use Wikipedia data as the case study for our entity relatedness ranking problem due to its rich knowledge and dynamic nature. It is worth noting that despite experimenting on Wikipedia, our framework is universal can be applied to other sources of entity with available temporal signals and entity navigation. We use Wikipedia pages to represent entities and page views as the temporal signals (details in section 6.1).
Clickstream. For entity navigation, we use the clickstream dataset generated from the Wikipedia webserver logs from February until September, 2016. These datasets contain an accumulation of transitions between two Wikipedia articles with their respective counts on a monthly basis. We study only actual pages (e.g. excluding disambiguation or redirects). In the following, we provide the first analysis of the clickstream data to gain insights into the temporal dynamics of the entity collective attention in Wikipedia. Figure 2a illustrates the distribution of entities by click frequencies, and the correlation of top popular entities (measured by total navigations) across different months is shown in Figure 2b. In general, we observe that the user navigation activities in the top popular entities are very dynamic,     Table 1, there are 24.31% of entities in top-10,000 most active entities of September 2006 do not appear in the same list the previous month. And 30.61% are new compared with 5 months before. In addition, there are 71% of entities in top-10,000 having navigations to new entities compared to the previous month, with approx. 18 new entities are navigated to, on average. Thus, the datasets are naturally very dynamic and sensitive to change. The substantial amount of missing past click logs on the newly-formed relationships also raises the necessity of an dynamic measuring approach. Figure 3 shows the overall architecture of our framework, which consists of three major components: time-, graphand content-based networks. Each component can be considered as a separate sub-ranking network. Each network accepts a tuple of three elements/representations as an input in a pair-wise fashion, i.e., the source entity e s , the target entity e t with higher rank (denoted as e (+) ) and the one with lower rank (denoted as e (−) ). For the content network, each element is a sequence of terms, coming from entity textual representation. For the graph network, we learn the embed- dings from the entity linking graph. For the time network, we propose a new convolutional model learning from the entity temporal signals. More detailed are described as follows.

Neural Ranking Model Overview
The entity relatedness ranking can be handled by a point-wise ranking model that learns to predict relatedness score directly. However, as the navigational frequency distribution is often skewed at top, supervisions guided by long-tail navigations would be prone to errors. Hence instead of learning explicitly a calibrated scoring function, we opt for a pair-wise ranking approach. When applying to ranking top-k entities, this approach has the advantage of correctly predicting partial orders of different relatedness functions f t at any time points regardless of their non-transitivity (Cheng et al., 2012).
This work builds upon the idea of interactionbased deep neural models, i.e. learning soft semantic matches from the source-target entity pairs. Note that, we do not aim for a Siamese architecture (Chopra et al., 2005) (i.e., in representation-based models), where the weight parameters are shared across networks. The reason is that, the conventional kind of network produces a symmetric relation, violating the asymmetric property of the relatedness function f t (section 3.2). Concretely, each deep network ψ consists of an input layer z 0 , n − 1 hidden layers and an output layer z n . Each hidden layer z i is a fullyconnected network that computes the transformation: z i = σ (w i · z i−1 + b i ), where w i and b i are the weight matrix and bias at hidden layer i, σ is a non-linear function such as the rectified linear unit(ReLU). The final score under the trio setup is summed from multiple networks.
(1) In the next section we describe the input representations z 0 for each network.

Content-based representation learning
To learn the entity representation from its content, we rely on entity textual document (word-based) as well as its link profile (entity-based) (section 3.1). Since the vocabulary size of entities and words is often very large, conventional one-hot vector representation becomes expensive. Hence, we adopt the word hashing technique from (Huang et al., 2013), that breaks a term into character trigraphs and thus can dramatically reduce the size of the vector dimensionality. We then rely on embeddings to learn the distributed representations and build up the soft semantic interactions via input concatenation. Let E : V → R m be the embedding function, V is the vocabulary and m is the embedding size. w : V → R, is the weighting function that learns the global term importance and a weighted element-wise sum of word embedding vectors -compositionality function ⊕, the word-based representation for entity e is hence ⊕ . For entity-based representation, we break down the surface form of a linked entity into bag-of-words and apply analogously. The concatenation of the two representations for the tuple < e s , e (+) , e (−) > is then input to the deep feed-forward network.

Graph-based representation
To obtain the graph embedding for each entity, we adopt the idea of DeepWalk (Perozzi et al., 2014), which learns the embedding by predicting the ver-tex sequence generated by random walk. Concretely, given an entity e, we learn to predict the sequence of entity references S e -which can be considered as the graph-wise context in the Skipgram model. We then adopt the matching histogram mapping in (Guo et al., 2016) for the soft interaction of the ranking model. Specifically, denote the bag of entities representation of e s as C e s , and that of e t as C e t ; we discretize the soft matching (calculated by cosine similarity of the embedding vectors) of each entity pair in (C e s , C e t ) into different bins. The logarithmic numbers of the count values of each bin then constitute the interaction vector. This soft-interaction in a way is similar in the idea with the traditional link-based model (Witten and Milne, 2008), where the relatedness measure is based on the overlapping of incoming links.

Attention-based CNN for temporal representation
For learning representation from entity temporal signals, the intuition is to model the low-level temporal correlation between two multivariate time series. Specifically, we learn to embed these time series of equal size T into an Euclidean space, such that similar pairs are close to each other. Our embedding function takes the form of a convolutional neural network (CNN), shown in Figure 4. The architecture rests on four basic layers: a 1-D convolutional (that restricts the slide only along the time window dimension, following (Zheng et al., 2014)), a batch-norm, an attention-based and a fully connected layer. Convolution layer: A 1-D convolution operation involves applying a filter w f ∈ R 1×w×D (i.e., a matrix of weight parameters) to each subsequence X i e of window size m to produce a new abstraction.
where L i t:t+w−1,D denotes the concatenation of w vectors in the lookup layer representing the subsequence X i e , b is a bias term. The convolutional layer is followed by a batch normalization (BN) layer (Ioffe and Szegedy, 2015), to speed up the convergence and help improve generalization.
Attention Mechanism: We apply an attention layer on the convolutional outputs. Conceptually, attention mechanisms allow NN models to focus selectively on only the important fea- tures, based on the attention weights that often derived from the interaction with the target or within the input itself (self-attention) (Vaswani et al., 2017). We adopt the former approach, with the intuition that the time-spatial patterns should not be treated equally, but the ones near the studied time should gain more focus. To ensure that each feature in F c i that associates with different timestamps are rewarded differently, the attention weights are guided by a time-decay weight function, in a recency-favor fashion. More formally, let A ∈ R T −w+1×1 be the time context vector and F c i ∈ R 1×(T −w+1) the output of convolution for X. Then the k th column of the re-weighted feature map F h i is derived by: The time context vector a is generated by a decay weight function, since each column k in the vector is associated with a time t k which is T − k + w time units away from studied time t.
Decay weight function: we leverage the Polynomial Curve for the function.
PD(t i ,t) = 1 (t−t i ) α +1 , whereas α defines the decay rate. It is worth noting that when α is increased, the attention layer acts just like a pooling one 1 . Stacking up multiple convolutional layers is possible, in this case |A| is the size of the previous layer. The attention layer is only applied to the last convolution layer in our architecture. The output of the attention layer is then passed to a fully-connected layer with non-linear activation to obtain the temporal representation.
1 Note that, for clear visualization, we put flattening before attention layer in Figure 4

Learning and Optimization
Finally, we describe the optimization and training procedure of our network. We use a Logarithmic loss that can lead to better probability estimation at the cost of accuracy 2 . Our network minimizes the cross-entropy loss function as follows: where N is the training size,ȳ is the output of the sigmoid layer on the predicted label. θ contains all the parameters of the network and λ |θ | 2 2 is the L2 regularization. P {e s ,e (+) ,e (−) } i is the probability that e (+) is ranked higher than e (−) derived from entity navigation, P

Dataset
To recap from Section 4.1, we use the click stream datasets in 2016. We also use the corresponding Wikipedia article dumps, with over 4 million entities represented by actual pages. Since the length of the content of an Wikipedia article is often long, in this work, we make use of only its abstract section. To obtain temporal signals of the entity, we use page view statistics of Wikipedia articles and aggregate the counts by month. We fetch the data from June, 2014 up until the studied time, which results in the length of 27 months.
Seed entities and related candidates. To extract popular and trending entities, we extract from the clickstream data the top 10,000 entities based on the number of navigations from major search engines (Google and Bing), at the studied time. Getting the subset of related entity candidatesfor efficiency purposes-has been well-addressed in related work (Guo and Barbosa, 2014;Ponza et al., 2017). In this work, we do not leverage a method and just assume the use of an appropriate one. In the experiment, we resort to choose only  candidates which are visited from the seed entities at studied time. We filtered out entity-candidate pairs with too few navigations (less than 10) and considered the top-100 candidates.

Models for Comparison
In this paper, we compare our models against the following baselines. Wikipedia Link-based (WLM): Witten and Milne (2008) proposed a low-cost measure of semantic relatedness based on Wikipedia entity graph, inspired by Normalized Google Distance.
DeepWalk (DW): DeepWalk (Perozzi et al., 2014) learned representations of vertices in a graph with a random walk generator and language modeling. We chose not to compare with the matrix factorization approach in (Zhao et al., 2015), as even though it allows the incorporation of different relation types (i.e., among entity, category and word), the iterative computation cost over large graphs is very expensive. When consider only entity-entity relation, the performance is reported rather similar to DW.
Entity2Vec Model (E2V): or entity embedding learning using Skip-Gram (Mikolov et al., 2013) model. E2V utilizes textual information to capture latent word relationships. Similar to Zhao et al. (2015); Ni et al. (2016), we use Wikipedia articles as training corpus to learn word vectors and reserved hyperlinks between entities.
RankSVM: Ceccarelli et al. (2013) learned entity relatedness from a set of 28 handcrafted features, using the traditional learning-to-rank method, RankSVM. We put together additional well-known temporal features (Kanhabua et al., 2014;Zhang et al., 2016b) (i.e., time series cross correlation, trending level and predicted popularity based on page views) and report the results of the extended feature set.
For our approach, we tested different combinations of content (denoted as Content Emb ), graph, (Graph Emb ) and time (TS-CNN-Att) networks. We also test the content and graph networks with pretrained entity representations (i.e., ParaVecs-DM and DeepWalk).

Experimental Setup
Evaluation procedures. The time granularity is set to months. The studied time t n of our experiments is September 2016. From the seed queries, we use 80% for training, 10% for development and 10% for testing, as shown in Table 2. Note that, for the time-aware setting and to avoid leakage and bias as much as possible, the data for training and development (including supervision) are up until time t n − 1. In specific, for content and graph data, only t n − 1 is used.
Metrics. We use 2 correlation coefficient methods, Pearson and Spearman, which have been used often throughout literature, cf. (Dallmann et al., 2016;Ponza et al., 2017). The Pearson index focuses on the difference between predicted-vscorrect relatedness scores, while Spearman focuses on the ranking order among entity pairs. Our work studies on the strength of the dynamic relatedness between entities, hence we focus more on Pearson index. However, traditional correlation metrics do not consider the positions in the ranked list (correlations at the top or bottom are treated equally). For this reason, we adjust the metric to consider the rankings at specific top-k positions, which consequently can be used to measure the correlation for only top items in the ranking (based to the ground truth). In addition, we use Normalized Discounted Cumulative Gain (NDCG) measure to evaluate the recommendation tasks.

Experimental Tasks
We evaluate our proposed method in two different scenarios: (1) Relatedness ranking and (2) Entity recommendation. The first task evaluates how well we can mimic the ranking via the entity navigation. Here we use the raw number of navigations in Wikipedia clickstream. The second task is formulated as: given an entity, suggest the top-k most related entities to it right now. Since there is no standard ground-truth for this temporal task, we constructed two relevance ground-truths. The first one is the proxy ground-truth, with relevance grade is automatically assigned from the (top-100) most navigated target entities. The graded relevance score is then given as the reversed rank order. For this, all entities in the test set are used. The second one is based on the human judgments with 5-level graded relevance scale, i.e., from 4 -highly relevant to 0 -not (temporally) relevant. Two human experts evaluate on the subset of 20 entities (randomly sampled from the test set), with 600 entity pairs (approx. 30 per seed, using pooling method). The ground-truth size is comparable the widely used ground-truth for static relatedness assessment, KORE (Hoffart et al., 2012). The Cohen's Kappa agreement is 0.72. Performance of the best-performed models on this dataset is then tested with paired t-test against the WLM baseline.

Results on Relatedness Ranking
We report the performance of the relatedness ranking on the left side of Table 3, with the Pearson and Spearman metrics. Among existing baselines, we observe that link-based approaches i.e., WLM and DeepWalk perform better than others for top-k correlation. Whereas, temporal models yield substantial improvement overall. Specifically, the TS-CNN-Att performs better than the no-attention model in most cases, improves 11% for Pearson@10, and 3% when considering the total rank. Our trio model performs well overall, gives best results for total rank. The duo models (combine base with either pretrained DW or PV) also deliver improvements over the sole temporal ones. We also observer additional gains while combining of temporal base with pretrained DW and PV altogether.

Results on Entity Recommendation
Here we report the results on the nDCG metrics. Table 3 (right-side) demonstrates the results for two ground-truth settings (proxy and human). We can observe the good performance of the baselines for this task over conventional temporal models, significantly for proxy setting. It can be explained that, 'static' entity relations are ranked high in the non time-aware baselines, hence are still rewarded when considering a fine-grained grading scale (100 level). The margin becomes smaller when comparing in human setting, with the standard 5-level scale. All the models with pretrained representations perform poorly. It shows that for this task, early interaction-based approach is more suitable than purely based on representation.

Additional Analysis
We present an anecdotic example of top-selected entities for Kingsman: The Golden Circle in Table 4. While the content-based model favors old relations like the preceding movies, TS-CNN puts popular actress Halle Berry or the recent released X-men: Apocalypse on top. The latter is not ideal as there is not a solid relationship between the two movies. One implication is that the two entities are ranked high is more because of the popularity of themself than the strength of the relationship toward the source entity. The Trio model addresses the issue by taking other perspectives into account, and also balances out the recency and long-term factors, gives the best ranking performance.
Analysis on decay hyper-parameter. We give a study on the effect of decay parameter on performance. Figure 5a illustrates the results on Pearson all and nDCG@10 for the trio model. It can be seen that while nDCG slightly increases, Pearson score peaks while α in the range [1.5, 3.5]. Additionally, we show the convergence analysis on α for TS-CNN-Att in Figure 6. Bigger α tends to converge faster, but to a significant higher loss when α is over 5.5 (omitted from the Figure).
Performances on different entity types. We demonstrate in Figures 5b and 5c the model performances on the person and event types. WLM performs poorer for the latter, that can be interpreted as link-based methods tend to slowly adapt

Conclusion
In this work, we presented a trio neural model to solve the dynamic entity relatedness ranking problem. The model jointly learns rich representations of entities from textual content, graph and temporal signals. We also propose an effective CNNbased attentional mechanism for learning the tem-  poral representation of an entity. Experiments on ranking correlations and top-k recommendation tasks demonstrate the effectiveness of our approach over existing baselines. For future work, we aim to incorporate more temporal signals, and investigate on different 'trainable' attention mechanisms to go beyond the time-based decay, for instance by incorporating latent topics.