hyperdoc2vec: Distributed Representations of Hypertext Documents

Hypertext documents, such as web pages and academic papers, are of great importance in delivering information in our daily life. Although being effective on plain documents, conventional text embedding methods suffer from information loss if directly adapted to hyper-documents. In this paper, we propose a general embedding approach for hyper-documents, namely, hyperdoc2vec, along with four criteria characterizing necessary information that hyper-document embedding models should preserve. Systematic comparisons are conducted between hyperdoc2vec and several competitors on two tasks, i.e., paper classification and citation recommendation, in the academic paper domain. Analyses and experiments both validate the superiority of hyperdoc2vec to other models w.r.t. the four criteria.


Introduction
The ubiquitous World Wide Web has boosted research interests on hypertext documents, e.g., personal webpages (Lu and Getoor, 2003), Wikipedia pages (Gabrilovich and Markovitch, 2007), as well as academic papers (Sugiyama and Kan, 2010).Unlike independent plain documents, a hypertext document (hyper-doc for short) links to another hyper-doc by a hyperlink or citation mark in its textual content.Given this essential distinction, hyperlinks or citations are worth specific modeling in many tasks such as link-based classification (Lu and Getoor, 2003), web retrieval (Page et al., 1999), entity linking (Cucerzan, 2007), and citation recommendation (He et al., 2010).
To model hypertext documents, various efforts (Cohn and Hofmann, 2000;Kataria et al., 2010;Perozzi et al., 2014;Zwicklbauer et al., 2016;Wang et al., 2016) have been made to depict networks of hyper-docs as well as their content.Among potential techniques, distributed representation (Mikolov et al., 2013;Le and Mikolov, 2014) tends to be promising since its validity and effectiveness are proven for plain documents on many natural language processing (NLP) tasks.
Conventional attempts on utilizing embedding techniques in hyper-doc-related tasks generally fall into two types.The first type (Berger et al., 2017;Zwicklbauer et al., 2016) simply downcasts hyper-docs to plain documents and feeds them into word2vec (Mikolov et al., 2013) (w2v for short) or doc2vec (Le and Mikolov, 2014) (d2v for short).These approaches involve downgrading hyperlinks and inevitably omit certain information in hyper-docs.However, no previous work investigates the information loss, and how it affects the performance of such downcasting-based adaptations.The second type designs sophisticated embedding models to fulfill certain tasks, e.g., citation recommendation (Huang et al., 2015b), paper classification (Wang et al., 2016), and entity linking (Yamada et al., 2016), etc.These models are limited to specific tasks, and it is yet unknown whether embeddings learned for those particular tasks can generalize to others.Based on the above facts, we are interested in two questions: • What information should hyper-doc embedding models preserve, and what nice property should they possess?
• Is there a general approach to learning taskindependent embeddings of hyper-docs?
To answer the two questions, we formalize the hyper-doc embedding task, and propose four criteria, i.e., content awareness, context awareness, newcomer friendliness, and context intent aware-arXiv:1805.03793v1[cs.CL] 10 May 2018 ness, to assess different models.Then we discuss simple downcasting-based adaptations of existing approaches w.r.t. the above criteria, and demonstrate that none of them satisfy all four.To this end, we propose hyperdoc2vec (h-d2v for short), a general embedding approach for hyperdocs.Different from most existing approaches, h-d2v learns two vectors for each hyper-doc to characterize its roles of citing others and being cited.Owning to this, h-d2v is able to directly model hyperlinks or citations without downgrading them.To evaluate the learned embeddings, we employ two tasks in the academic paper domain1 , i.e., paper classification and citation recommendation.Experimental results demonstrate the superiority of h-d2v.Comparative studies and controlled experiments also confirm that h-d2v benefits from satisfying the above four criteria.
We summarize our contributions as follows: • We propose four criteria to assess different hyper-document embedding models.
• We propose hyperdoc2vec, a general embedding approach for hyper-documents.
• We systematically conduct comparisons with competing approaches, validating the superiority of h-d2v in terms of the four criteria.

Related Work
Network representation learning is a related topic to ours since a collection of hyper-docs resemble a network.To embed nodes in a network, Perozzi et al. (2014) propose DeepWalk, where nodes and random walks are treated as pseudo words and texts, and fed to w2v for node vectors.Tang et al. (2015b) explicitly embed second-order proximity via the number of common neighbors of nodes.Grover and Leskovec (2016) extend Deep-Walk with second-order Markovian walks.To improve classification tasks, Tu et al. (2016) explore a semi-supervised setting that accesses partial labels.Compared with these models, h-d2v learns from both documents' connections and contents while they mainly focus on network structures.
Document embedding for classification is another focused area to apply document embeddings.
Le and Mikolov (2014) employ learned d2v vectors to build different text classifiers.Tang et al. (2015a) apply the method in (Tang et al., 2015b) on word co-occurrence graphs for word embeddings, and average them for document vectors.For hyper-docs, Ganguly and Pudi (2017) and Wang et al. (2016) target paper classification in unsupervised and semi-supervised settings, respectively.However, unlike h-d2v, they do not explicitly model citation contexts.Yang et al. (2015)'s approach also addresses embedding hyper-docs, but involves matrix factorization and does not scale.
Citation recommendation is a direct downstream task to evaluate embeddings learned for a certain kind of hyper-docs, i.e., academic papers.In this paper we concentrate on context-aware citation recommendation (He et al., 2010).Some previous studies adopt neural models for this task.Huang et al. (2015b) propose Neural Probabilistic Model (NPM) to tackle this problem with embeddings.Their model outperforms non-embedding ones (Kataria et al., 2010;Tang and Zhang, 2009;Huang et al., 2012).Ebesu and Fang (2017) also exploit neural networks for citation recommendation, but require author information as additional input.Compared with h-d2v, these models are limited in a task-specific setting.
Embedding-based entity linking is another topic that exploits embeddings to model certain hyperdocs, i.e., Wikipedia (Huang et al., 2015a;Yamada et al., 2016;Sun et al., 2015;Fang et al., 2016;He et al., 2013;Zwicklbauer et al., 2016), for entity linking (Shen et al., 2015).It resembles citation recommendation in the sense that linked entities highly depend on the contexts.Meanwhile, it requires extra steps like candidate generation, and can benefit from sophisticated techniques such as collective linking (Cucerzan, 2007).

Preliminaries
We introduce notations and definitions, then formally define the embedding problem.We also propose four criteria for hyper-doc embedding models w.r.t their appropriateness and informativeness.

Notations and Definitions
Let w ∈ W be a word from a vocabulary W , and d ∈ D be a document id (e.g., web page URLs and paper DOIs) from an id collection D. After filtering out non-textual content, a hyper-document H is reorganized as a sequence of words and doc ids, (Koehn et al., 2007) (Zhao and Gildea, 2010) (Papineni et al., 2002) Original Source doc

Context words 𝐶
Target doc  … We also evaluate our model by computing the machine translation BLEU score (Papineni et al., 2002) using the Moses system (Koehn et al., 2007) … … … (a) Hyper-documents.

Problem Statement
Given a corpus of hyper-docs {H d } d∈D with D and W , we want to learn document and word embedding matrices D ∈ R k×|D| and W ∈ R k×|W | simultaneously.The i-th column d i of D is a kdimensional embedding vector for the i-th hyperdoc with id d i .Similarly, w j , the j-th column of W, is the vector for word w j .Once embeddings for hyper-docs and words are learned, they can facilitate applications like hyper-doc classification and citation recommendation.

Criteria for Embedding Models
A reasonable model should learn how contents and hyperlinks in hyper-docs impact both D and W. We propose the following criteria for models: • Content aware.Content words of a hyperdoc play the main role in describing it, so the document representation should depend on its own content.For example, the words in Zhao and Gildea (2010) should affect and contribute to its embedding.
• Context aware.Hyperlink contexts usually provide a summary for the target document.Therefore, the target document's vector should be impacted by words that others use to summarize it, e.g., paper Papineni et al. (2002) and the word "BLEU" in Figure 1(a).
• Newcomer friendly.In a hyper-document network, it is inevitable that some documents are not referred to by any hyperlink in other hyper-docs.If such "newcomers" do not get embedded properly, downstream tasks involving them are infeasible or deteriorated.
• Context intent aware.Words around a hyperlink, e.g., "evaluate . . .by" in Figure 1(a), normally indicate why the source hyper-doc makes the reference, e.g., for general reference or to follow/oppose the target hyperdoc's opinion or practice.Vectors of those context words should be influenced by both documents to characterize such semantics or intents between the two documents.
We note that the first three criteria are for hyperdocs, while the last one is desired for word vectors.

Representing Hypertext Documents
In this section, we first give the background of two prevailing techniques, word2vec and doc2vec.
Then we present two conversion approaches for hyper-documents so that w2v and d2v can be applied.Finally, we address their weaknesses w.r.t. the aforementioned four criteria, and propose our hyperdoc2vec model.In the remainder of this paper, when the context is clear, we mix the use of terms hyper-doc/hyperlink with paper/citation.

word2vec and doc2vec
w2v (Mikolov et al., 2013)  Model Output is regarded as a special context vector to average.Analogously, pv-dbow uses IN document vector to predict its words' OUT vectors, following the same structure of skip-gram.Therefore in pv-dbow, words' IN vectors are omitted.

Adaptation of Existing Approaches
To represent hyper-docs, a straightforward strategy is to convert them into plain documents in a certain way and apply w2v and d2v.Two conversions following this strategy are illustrated below.
Citation as word.This approach is adopted by Berger et al. (2017). 2 As Figure 1(b) shows, document ids D are treated as a collection of special words.Each citation is regarded as an occurrence of the target document's special word.After applying standard word embedding methods, e.g., w2v, we obtain embeddings for both ordinary words and special "words", i.e., documents.In doing so, this approach allows target documents interacting with context words, thus produces context-aware embeddings for them.
Context as content.It is often observed in academic papers when citing others' work, an author briefly summarizes the cited paper in its citation context.Inspired by this, we propose a contextas-content approach as in Figure 1(c).To start, we remove all citations.Then all citation contexts of a target document d t are copied into d t as additional contents to make up for the lost information.Finally, d2v is applied to the augmented documents to generate document embeddings.With this approach, the generated document embeddings are both context-and content-aware.

hyperdoc2vec
Besides citation-as-word with w2v and contextas-content with d2v (denoted by d2v-cac for short), there is also an alternative using d2v on documents with citations removed (d2v-nc for 2 It is designed for document visualization purposes.short).We made a comparison of these approaches in Table 1 in terms of the four criteria stated in Section 3.3.It is observed that none of them satisfy all criteria, where the reasons are as follows.
First, w2v is not content aware.Following our examples in the academic paper domain, consider the paper (hyper-doc) Zhao and Gildea (2010) in Figure 1 Zhao and Gildea (2010), thus not contributing to its embedding.In addition, for papers being just published and having not obtained citations yet, they will not appear as special "words" in any text.This makes w2v newcomerunfriendly, i.e., unable to produce embeddings for them.Second, being trained on a corpus without citations, d2v-nc is obviously not context aware.Finally, in both w2v and d2v-cac, context words interact with the target documents without treating the source documents as backgrounds, which forces IN vectors of words with context intents, e.g., "evaluate" and "by" in Figure 1(a), to simply remember the target documents, rather than capture the semantics of the citations.
The above limitations are caused by the conversions of hyper-docs where certain information in citations is lost.For a citation d s , C, d t , citationas-word only keeps the co-occurrence information between C and d t .Context-as-content, on the other hand, mixes C with the original content of d t .Both approaches implicitly downgrade citations d s , C, d t to C, d t for adaptation purposes.
To learn hyper-doc embeddings without such limitations, we propose hyperdoc2vec.In this model, two vectors of a hyper-doc d, i.e., IN and OUT vectors, are adopted to represent the document of its two roles.The IN vector d I characterizes d being a source document.The OUT vector d O encodes its role as a target document.We note that learning those two types of vectors is advantageous.It enables us to model citations and con- tents simultaneously without sacrificing information on either side.Next, we describe the details of h-d2v in modeling citations and contents.
To model citations, we adopt the architecture in Figure 2. It is similar to pv-dm, except that documents rather than words are predicted at the output layer.For a citation d s , C, d t , to allow context words C interacting with both vectors, we average d I s of d s with word vectors of C, and make the resulted vector predictive of d O t of d t .Formally, for all citations C = { d s , C, d t }, we aim to optimize the following average log probability objective: To model the probability P (d t |d s , C) where d t is cited in d s with C, we average their IN vectors and use x to compose a multi-class softmax classifier on all OUT document vectors To model contents' impact on document vectors, we simply consider an additional objective function that is identical to pv-dm, i.e., enumerate words and contexts, and use the same input architecture as Figure 2 to predict the OUT vector of the current word.Such convenience owes to the fact that using two vectors makes the model parameters compatible with those of pv-dm.Note that combining the citation and content objectives leads to a joint learning framework.To facilitate easier and faster training, we adopt an alternative pre-training/fine-tuning or retrofitting framework (Faruqui et al., 2015).We initialize with a predefined number of pv-dm iterations, and then optimize Eq. 1 based on the initialization.
(4) and use it to replace every log P (d t |d s , C).Following Huang et al. (2015b), we adopt a uniform distribution on D as the distribution P N (d).
Unlike the other models in Table 1, h-d2v satisfies all four criteria.We refer to the example in Figure 2 to make the points clear.First, when optimizing Eq. 1 with the instance in Figure 2, the update to d O of Papineni et al. (2002) depends on w I of context words such as "BLEU".Second, we pre-train d I with contents, which makes the document embeddings content aware.Third, newcomers can depend on their contents for d I , and update their OUT vectors when they are sampled3 in Eq. 4. Finally, the optimization of Eq. 1 enables mutual enhancement between vectors of hyper-docs and context intent words, e.g., "evaluate by".Under the background of a machine translation paper Zhao and Gildea (2010), the above two words help point the citation to the BLEU paper (Papineni et al., 2002), thus updating its OUT vector.The intent "adopting tools/algorithms" of "evaluate by" is also better captured by iterating over many document pairs with them in between.

Experiments
In this section, we first introduce datasets and basic settings used to learn embeddings.We then discuss additional settings and present experimental results of the two tasks, i.e., document classification and citation recommendation, respectively.Table 5: F 1 on DBLP when newcomers are discarded.

Datasets and Experimental Settings
We use three datasets from the academic paper domain, i.e., NIPS 4 , ACL anthology 5 and DBLP 6 , as shown in Table 3.They all contain full text of papers, and are of small, medium, and large size, respectively.We apply ParsCit  2010), we take 50 words before and after a citation as the citation context.Gensim ( Řehůřek and Sojka, 2010) is used to implement all w2v and d2v baselines as well as h-d2v.We use cbow for w2v and pv-dbow for d2v, unless otherwise noted.For all three baselines, we set the (half) context window length to 50.For w2v, d2v, and the pv-dm-based initialization of h-d2v, we run 5 epochs following Gensim's default setting.For h-d2v, its iteration is set to 100 epochs with 1000 negative samples.The dimension size k of all approaches is 100.All other parameters in Gensim are kept as default.

Document Classification
In this task, we classify the research fields of papers given their vectors learned on DBLP.To obtain labels, we use Cora 8 , a small dataset of Computer Science papers and their field categories.We keep the first levels of the original categories, 4 https://cs.nyu.edu/roweis/data.html 5http://clair.eecs.umich.edu/aan/index.php(2013 release) 6 http://zhou142.myweb.cs.uwindsor.ca/academicpaper.htmlThis page has been unavailable recently.They provide a larger CiteSeer dataset and a collection of DBLP paper ids.To better interpret results from the Computer Science perspective, we intersect them and obtain the DBLP dataset.
7 https://github.com/knmnyn/ParsCit 8http://people.cs.umass.edu/˜mccallum/data.html e.g., "Artificial Intelligence" of "Artificial Intelligence -Natural Language Processing", leading to 10 unique classes.We then intersect the dataset with DBLP, and obtain 5,975 labeled papers.For w2v and h-d2v outputing both IN and OUT document vectors, we use IN vectors or concatenations of both vectors as features.For newcomer papers without w2v vectors, we use zero vectors instead.To enrich the features with network structure information, we also try concatenating them with the output of DeepWalk (Perozzi et al., 2014), a representative network embedding model.The model is trained on the citation network of DBLP with an existing implementation9 and default parameters.An SVM classifier with RBF kernel is used.We perform 5-fold cross validation, and report Macro-and Micro-F 1 scores.

Classification Performance
In Table 4, we demonstrate the classification results.We have the following observations.First, adding DeepWalk information almost always leads to better classification performance, except for Macro-F 1 of the d2v-cac approach.
Second, owning to different context awareness, d2v-cac consistently outperforms d2v-nc in terms of all metrics and settings.
Third, w2v has the worst performance.The reason may be that w2v is neither content aware nor newcomer friendly.We will elaborate more on the impacts of the two properties in Section 5.2.2.
space of the classifiers and the training variance.For w2v with or without DeepWalk, it is also the case.This may be because information in w2v's IN and OUT vectors is fairly redundant.

Impacts of Content Awareness and Newcomer Friendliness
Because content awareness and newcomer friendliness are highly correlated in Table 1, to isolate and study their impacts, we decouple them as follows.In the 5,975 labeled papers, we keep 2,052 with at least one citation, and redo experiments in Table 4.By carrying out such controlled experiments, we expect to remove the impact of newcomers, and compare all approaches only with respect to different content awareness.In Table 5, we provide the new scores obtained.By comparing Tables 4 and 5, we observe that w2v benefits from removing newcomers with zero vectors, while all newcomer friendly approaches get lower scores because of fewer training examples.Even though the change, w2v still cannot outperform the other approaches, which reflects the positive impact of content awareness on the classification task.It is also interesting that Deep-Walk becomes very competitive.This implies that structure-based methods favor networks with better connectivity.Finally, we note that Table 5 is based on controlled experiments with intentionally skewed data.The results are not intended for comparison among approaches in practical scenarios.

Citation Recommendation
When writing papers, it is desirable to recommend proper citations for a given context.This could be achieved by comparing the vectors of the context and previous papers.We use all three datasets for this task.Embeddings are trained on papers before 1998, 2012, and 2009, respectively.The remaining papers in each dataset are used for testing.
We compare h-d2v with all approaches in Sec-tion 4.2, as well as NPM10 (Huang et al., 2015b) mentioned in Section 2, the first embedding-based approach for the citation recommendation task.Note that the inference stage involves interactions between word and document vectors and is nontrivial.We describe our choices as below.
First, for w2v vectors, Nalisnick et al. ( 2016) suggest that the IN-IN similarity favors word pairs with similar functions (e.g., "red" and "blue"), while the IN-OUT similarity characterizes word co-occurrence or compatibility (e.g., "red" and "bull").For citation recommendation that relies on the compatibility between context words and cited papers, we hypothesize that the IN-for-OUT (or I4O for short) approach will achieve better results.Therefore, for w2v-based approaches, we average IN vectors of context words, then score and and rank OUT document vectors by dot product.
Second, for d2v-based approaches, we use the learned model to infer a document vector d for the context words, and use d to rank IN document vectors by cosine similarity.Among multiple attempts, we find this choice to be optimal.
Third, for h-d2v, we adopt the same scoring and ranking configurations as for w2v.
Finally, for NPM, we adopt the same ranking strategy as in Huang et al. (2015b).Following them, we focus on top-10 results and report the Recall, MAP, MRR, and nDCG scores.

Recommendation Performance
In Table 6, we report the citation recommendation results.Our observations are as follows.
First, among all datasets, all methods perform relatively well on the medium-sized ACL dataset.This is because the smallest NIPS dataset provides too few citation contexts to train a good model.Moreover, DBLP requires a larger dimension size k to store more information in the embedding vectors.We increase k and report the Rec@10 scores in Figure 3.We see that all approaches have better performance when k increases to 200, though d2v-based ones start to drop beyond this point.Second, the I4I variant of w2v has the worst performance among all approaches.This observation validates our hypothesis in Section 5.3.
Third, the d2v-cac approach outperforms its variant d2v-nc in terms of all datasets and metrics.This indicates that context awareness matters in the citation recommendation task.
Fourth, the performance of NPM is sandwiched between those of w2v's two variants.We have tried our best to reproduce it.Our explanation is that NPM is citation-as-word-based, and only depends on citation contexts for training.Therefore, it is only context aware but neither content aware nor newcomer friendly, and behaves like w2v.
Finally, when retrofitting pv-dm, h-d2v generally has the best performance.When we substitute pv-dm with random initialization, the performance is deteriorated by varying degrees on different datasets.This implies that content awareness is also important, if not so important than context awareness, on the citation recommendation task.

Impact of Context Intent Awareness
In this section, we analyze the impact of context intent awareness.We use  Zhao and Gildea (2010).kernels and default parameters.Following Teufel et al. (2006), we use 10-fold cross validation.Figure 4 depicts the F 1 scores.Scores of Teufel et al. (2006)'s approach are from the original paper.We omit d2v-nc because it is very inferior to d2v-cac.We have the following observations.First, Teufel et al. (2006)'s feature-engineeringbased approach has the best performance.Note that we cannot obtain their original cross validation split, so the comparison may not be fair and is only for consideration in terms of numbers.
Second, among all embedding-based methods, h-d2v has the best citation function classification results, which is close to Teufel et al. (2006)'s.
Finally, the d2v-cac vectors are only good at Neutral, the largest class.On the other classes and global F 1 , they are outperformed by w2v vectors.
To study how citation function affects citation recommendation, we combine the 2,824 labeled citation contexts and another 1,075 labeled contexts the authors published later to train an SVM, and apply it to the DBLP testing set to get citation functions.We evaluate citation recommendation performance of w2v (I4O), d2v-cac, and h-d2v on a per-citation-function basis.In Figure 5, we break down Rec@10 scores on citation functions.On the six largest classes (marked by solid dots), h-d2v outperforms all competitors.To better investigate the impact of context intent awareness, Table 9 shows recommended papers of the running example of this paper.Here, Zhao and Gildea (2010) cited the BLEU metric (Papineni et al., 2002) and Moses tools (Koehn et al., 2007) of machine translation.However, the additional words "machine translation" lead both w2v and d2v-cac to recommend many machine translation papers.Only our h-d2v manages to recognize the citation function "using tools/algorithms (PBas)", and concentrates on the citation intent to return the right papers in top-5 results.

Conclusion
We focus on the hyper-doc embedding problem.We propose that hyper-doc embedding algorithms should be content aware, context aware, newcomer friendly, and context intent aware.To meet all four criteria, we propose a general approach, hyperdoc2vec, which assigns two vectors to each hyper-doc and models citations in a straightforward manner.In doing so, the learned embeddings satisfy all criteria, which no existing model is able to.For evaluation, paper classification and citation recommendation are conducted on three academic paper datasets.Results confirm the effectiveness of our approach.Further analyses also demonstrate that possessing the four properties helps h-d2v outperform other models.
(a), from w2v's perspective in Figure 1(b), ". . .computing the machine translation BLEU . . ." and other text no longer have association with

Figure 3 :
Figure 3: Varying k on DBLP.The scores of w2v keeps increasing to 26.63 at k = 1000, and then begins to drop.Although at the cost of a larger model and longer training/inference time, it still cannot outperform h-d2v of 30.37 at k = 400.
appear in the hyper-doc of d s , i.e., H ds , we stipulate that a hyper-link d s , C, d t is formed.Herein d s , d t ∈ D are ids of the source and target documents, respectively; C ⊆ W are context words.Figure 1(a) exemplifies a hyperlink.

Table 1 :
Analysis of tasks and approaches w.r.t.desired properties.
has proven effective for many NLP tasks.It integrates two models, i.e., cbow and skip-gram, both of which learn two types of word vectors, i.e., IN and OUT vectors.cbowsums up IN vectors of context words and make it predictive of the current word's OUT vector.skip-gramuses the IN vector of the current word to predict its context words' OUT vectors.As a straightforward extension to w2v, d2v also has two variants: pv-dm and pv-dbow.pv-dm works in a similar manner as cbow, except that the IN vector of the current document
Table 7 analyzes the impact of newcomer friendliness.Opposite from what is done in Section 5.2.2, we only evaluate on testing examples where at least a ground-truth paper is a newcomer.Please note that newcomer unfriendly approaches do not

Table 9 :
Papers recommended by different approaches for a citation context in