DocTag2Vec: An Embedding Based Multi-label Learning Approach for Document Tagging

Tagging news articles or blog posts with relevant tags from a collection of predefined ones is coined as document tagging in this work. Accurate tagging of articles can benefit several downstream applications such as recommendation and search. In this work, we propose a novel yet simple approach called DocTag2Vec to accomplish this task. We substantially extend Word2Vec and Doc2Vec – two popular models for learning distributed representation of words and documents. In DocTag2Vec, we simultaneously learn the representation of words, documents, and tags in a joint vector space during training, and employ the simple k-nearest neighbor search to predict tags for unseen documents. In contrast to previous multi-label learning methods, DocTag2Vec directly deals with raw text instead of provided feature vector, and in addition, enjoys advantages like the learning of tag representation, and the ability of handling newly created tags. To demonstrate the effectiveness of our approach, we conduct experiments on several datasets and show promising results against state-of-the-art methods.


Introduction
Every hour, several thousand blog posts are actively shared on social media; for example, blogging sites such as Tumblr 1 had more than 70 billion posts by January 2014 across different communities (Chang et al., 2014). In order to reach * This work was done when the author was an intern at Yahoo. 1 tumblr.com right audience or community, authors often assign keywords or "#tags" (hashtags) to these blog posts. Besides being topic-markers, it was shown that hashtags also serve as group identities (Bruns and Burgess, 2011), and as brand labels (Page, 2012). On Tumblr, authors are allowed to create their own tags or choose existing tags to label their blog. Creating or choosing tags for maximum outreach can be a tricky task and authors may not be able to assign all the relevant tags. To alleviate this problem, algorithm-driven document tagging has emerged as a potential solution in recent times. Automatically tagging these blogs has several downstream applications, e.g., blog search, cluster similar blogs, show topics associated with trending tags, and personalization of blog posts. For better user engagement, the personalization algorithm could match user interests with the tags associated with a blog post. From machine learning perspective, document tagging is by nature a multi-label learning (MLL) problem, where the input space is certain feature space X of document and the output space is the power set 2 Y of a finite set of tags Y. Given training data Z ⊂ X × 2 Y , we want to learn a function f : X → 2 Y that predicts tags for unseen documents. As shown in Figure 1a, during training a standard MLL algorithm (big blue box) one typically attempts to fit the prediction function (small blue box) into feature vectors of documents and the corresponding tags. Note that feature vectors are generated separately before training, and tags for each document are encoded as a |Y|-dimensional binary vector with one representing the presence and zero otherwise. In prediction phase, the learned prediction function will output relevant tags for the input feature vector of an unseen document. Following such a paradigm, many generic algorithms have been developed for MLL (Weston et al., 2011; Bhatia et al., 2015). With a surge of text content created by users online, such as blog posts, Wikipedia entries, etc., the algorithms for document tagging has many challenges. Firstly, time sensitive news articles are generated on a daily basis, and it is important for an algorithm to assign tags before they loose freshness. Secondly, new tagged documents could be fed into the training system, thus incrementally adapting the system to new training data without re-training from scratch is also critical. Thirdly, we might face a very large set of candidate tags that can change dynamically, as new things are being invented.
In view of the aforementioned challenges, in this paper we propose a new and simple approach for document tagging: DocTag2Vec. Our approach is motivated by the line of works on learning distributed representation of words and documents, e.g., Word2Vec (Mikolov et al., 2013) and Doc2Vec (a.k.a. Paragraph Vector) (Le and Mikolov, 2014). Word2Vec and Doc2Vec aim at learning low-dimensional feature vectors (i.e., embeddings) for words and documents from large corpus in an unsupervised manner, such that similarity between words (or documents) can be reflected by some distance metric on their embeddings. The general assumption behind Word2Vec and Doc2Vec is that more frequent co-occurrence of two words inside a small neighborhood of document should imply higher semantic similarity between them (see Section 2.2 for details). The DocTag2Vec extends this idea to document and tag by positing that document and its associated tags should share high semantic similarity, which allows us to learn the embeddings of tags along with documents (see Section 2.3 for details). Our method has two striking differences compared with standard MLL frameworks: firstly, our method directly works with raw text and does not need feature vectors extracted in advance. Secondly, our DocTag2Vec produces tag embeddings, which carry semantic information that are generally not available from standard MLL frame-work. During training, DocTag2Vec directly takes the raw documents and tags as input and learns their embeddings using stochastic gradient descent (SGD). In terms of prediction, a new document will be first embedded using a Doc2Vec component inside the DocTag2Vec, and tags are then assigned by searching for the nearest tags embedded around the document. Overall the proposed approach has the following merits.
• The SGD training supports the incremental adjustment of DocTag2Vec to new data.
• The prediction uses the simple k-nearest neighbor search among tags instead of documents, whose running time does not scale up as training data increase.
• Since our method represent each individual tag using its own embedding vector, it it easy to dynamically incorporate new tags.
• The output tag embeddings can be used in other applications.
Related Work: Multi-label learning has found several applications in social media and web, like sentiment and topic analysis (Huang et al., 2013a), social text stream analysis (Ren et al., 2014), and online advertising (Agrawal et al., 2013). MLL has also been applied to diverse Natural Language Processing (NLP) tasks. However to the best of our knowledge we are the first to propose embedding based MLL approach to a NLP task. MLL has been applied to Word Sense Disambiguation (WSD) problem for polysemic adjectives (Boleda et al., 2007). (Huang et al., 2013b) proposed a joint model to predict sentiment and topic for tweets and (Surdeanu et al., 2012) proposed a multiinstance MLL based approach for relation extraction with distant supervision.
Paper Organization: The rest of the paper is organized as follows. In Section 2, we first give a brief review of Word2Vec and Doc2Vec models, and then present training and prediction step respectively for our proposed extension, Doc-Tag2Vec. In Section 3, we demonstrate the effectiveness of our DocTag2Vec approach through experiments on several datasets. In the end, Section 4 is dedicated to conclusions and future works.

Proposed Approach
In this section, we present details of DocTag2Vec. For the ease of exposition, we first introduce some mathematical notations followed by a brief review for two widely-used embedding models: Word2Vec and Doc2Vec.

Notation
We let V be the size of vocabulary (i.e., set of unique words), N be the number of documents in the training set, M be the size of tag set, and K be the dimension of the vector space of embedding. We denote the vocabulary as W = {w 1 , . . . , w V }, set of documents as D = {d 1 , . . . , d N }, and the set of tags as T = {t 1 , . . . , t M }. Each document d ∈ D is basically a sequence of n d words represented by (w d 1 , w d 2 , . . . , w d n d ), and is associated with Here the sub-script d of n and M suggests that the number of word and tag is different from document to document. For convenience, we use the shorthand w d i : as the matrix for document embeddings, and T = [t 1 , . . . , t M ] ∈ R K×M as the matrix for tag embeddings. Sometimes we may use the symbol d i interchangeably with the embedding vector d i to refer to the i-th document, and use d d to denote the vector representation of document d. Similar conventions apply to word and tag embeddings. Besides we let σ(·) be the sigmoid function, i.e., σ(a) = 1/(1 + exp(−a)).

Word2Vec and Doc2Vec
The proposed approach is inspired by the work of Word2Vec, an unsupervised model for learning embedding of words. Essentially, Word2Vec embeds all words in the training corpus into a low-dimensional vector space, so that the semantic similarities between words can be reflected by some distance metric (e.g., cosine distance) defined on their vector representations. The way to train Word2Vec model is to minimize the loss function associated with certain classifier with respect to both feature vectors (i.e., word embeddings) and classifier parameters, such that the nearby words are able to predict each other. For example, in continuous bag-of-word (CBOW) framework, Word2Vec specifically minimizes the following average negative log probability where c is the size of context window inside which words are defined as "nearby". To ensure the conditional probability above is legitimate, one usually needs to evaluate a partition function, which may lead to a computationally prohibitive model when the vocabulary is large. A popular choice to bypass such issue is to use hierarchical softmax (HS) (Morin and Bengio, 2005), which factorizes the conditional probability into products of some simple terms. The hierarchical softmax relies on the construction of a binary tree B with V leaf nodes, each of which corresponds to a particular word in the vocabulary W. HS is parameterized by a matrix H ∈ R K×(V −1) , whose columns are respectively mapped to a unique non-leaf node of B. Additionally, we define Path(w) = {(i, j) ∈ B | edge (i, j) is on the path from root to word w}. Then the negative log probability is given as where child(u, v) is equal to 1 if v is the left child of u and 0 otherwise. Figure 2a shows the model architecture of CBOW Word2Vec. Basically g d i is the input feature for HS classifier corresponding to projection layer in Figure 2a , which essentially summarizes the feature vectors of context words surrounding w d i , and other options like averaging of w d i+j can also be applied. This Word2Vec model can be directly extended to Distributed memory (DM) Doc2Vec model by conditioning the probability of w d i on d as well as The architecture of DM Doc2Vec model is illustrated in Figure 2b. Instead of optimizing some rigorously defined probability function, both Word2Vec and Doc2Vec can be trained using other objectives, e.g., negative sampling (NEG) (Mikolov et al., 2013).

Training for DocTag2Vec
Our approach, DocTag2Vec, extends the DM Doc2Vec model by adding another component for learning tag embeddings. In addition to predicting target word w d i using context w d i−c , . . . , w d i+c , as shown in Figure 2c, DocTag2Vec also uses the document embedding to predict each associated tag, with hope that they could be closely embedded. The joint objective is given by where α is a tuning parameter. As discussed for Word2Vec, the problem of evaluating costly partition function is also faced by the newly introduced probability p(t|d). Different from the conditional probability of w d i , the probability p(t|d) cannot be modeled using hierarchical softmax, as the columns of parameter matrix do not have oneto-one correspondence to tags (remember that we need to obtain a vector representation for each tag). Motivated by the idea of negative sampling used in Word2Vec, we come up with the following objective for learning tag embedding rather than stick to a proper probability function, where p is a discrete distribution over all tag embeddings {t 1 , . . . , t M } and r is a integer-valued hyperparameter. The goal of such objective is to differentiate the tag t from the draws according to p, which is chosen as uniform distribution for simplicity in our practice. Now the final loss function for DocTag2Vec is the combination of (1) and (3), tag embedding with negative sampling We minimize (W, D, T, H) using stochastic gradient descent (SGD). To avoid exact calculation of the expectation in negative sampling, at each iteration we sample r i.i.d. instances of t from distribution p, denoted by {t 1 p , t 2 p , . . . , t r p }, to stochastically approximate the expectation, i.e., where T is column-normalized version of T. To boost the prediction performance of DocTag2Vec, we apply the bootstrap aggregation (a.k.a. bagging) technique to DocTag2Vec. Essentially we train b DocTag2Vec learners using different randomly sampled subset of training data, resulting in b different tag predictors f 1 k (·), . . . , f b k (·) along with their tag embedding matrices T 1 , . . . , T b . In general, the number of nearest neighbors k for individual learner can be different from k. In the end, we combine the predictions from different models by selecting from b j=1 f j k (d d ) the k tags with the largest aggregated similarities with d d , 3 Experiments

Datasets
In this subsection, we briefly describe the datasets included in our experiments. It is worth noting that DocTag2Vec method needs raw texts as input instead of extracted features. Therefore many benchmark datasets for evaluating multi-label learning algorithms are not suitable for our setting. For the experiment, we primarily focus on the diversity of the source of tags, which capture different aspects of documents. The statistics of all datasets are provided in Table 1.
Public datasets: • Wiki10: The wiki10 dataset contains a subset of English Wikipedia documents, which are tagged collaboratively by users from the social bookmarking site Delicious 1 . We remove the two uninformative tags, "wikipedia" and "wiki", from the collected data.
• WikiNER: WikiNER has a larger set of English Wikipedia documents. The tags for each document are the named entities inside it, which is detected automatically by some named entity recognition (NER) algorithm.

Proprietary datasets
• Relevance Modeling (RM): The RM dataset consists of two sets of financial news article in Chinese and Korean respectively. Each article is tagged with related ticker symbols of companies given by editorial judgement.
• News Content Taxonomy (NCT): NCT dataset is a collection of news articles annotated by editors with topical tags from a taxonomy tree. The closer the tag is to the root, the more general the topic is. For such tags with hierarchical structure, we also evaluate our method separately for tags of general topics (depth=2) and specific topics (depth=3).

Baselines and Hyperparameter Setting
The baselines include one of the state-of-theart multi-label learning algorithms called SLEEC   (Bhatia et al., 2015), a variant of DM Doc2Vec, and an unsupervised entity linking system, Fas-tEL (Blanco et al., 2015), which is specific to WikiNER dataset. SLEEC is based on non-linear dimensionality reduction of binary tag vectors, and use a sophisticated objective function to learn the prediction function. For comparison, we use the TF-IDF representation of document as the input feature vector for SLEEC, as it yields better result than embedding based features like Doc2Vec feature. To extend DM Doc2Vec for tagging purpose, basically we replace the document d shown in Figure 2b with tags t d 1 , . . . , t d M d , and train the Doc2Vec to obtain the tag embeddings. During testing, we perform the same steps as DocTag2Vec to predict the tags, i.e., inferring the embedding of test document followed by k-NN search. FastEL is unsupervised appproach for entity linking of websearch queries that walks over a sequence of words in query and aims to maximize the likelihood of linking the text span to an entity in Wikipedia. FastEL model calculates the conditional probabilities of an entity given every substring of the input document, however avoid computing entit to entity joint dependencies, thus making the process efficient. We built FastEL model using query logs that spanned 12 months and Wikipedia anchor text extracted from Wikipedia dumps dated November 2015. We choose an entity linker baseline because it is a simple way of detecting topics/entities that are semantically associated with a document.
Regarding hyperparameter setting, both SLEEC and DocTag2Vec aggregate multiple learners to enhance the prediction accuracy, and we set the number of learners to be 15. For SLEEC, we tune the rest of hyperparameters using grid search. For SLEEC and DocTag2Vec, we set the number of epochs for SGD to be 20 and the window size c to be 8. To train each individual learner, we randomly sample 50% training data. In terms of the nearest neighbor search, we set k = 10 for Wiki10 and WikiNER while keeping k = 5 for others. For the rest of hyperparameters, we also apply grid search to find the best ones. For DocTag2Vec, we additionally need to set the number of negative tags r and the weight α in (4). Typically r ranges from 1 to 5, and r = 1 gives the best performance on RM and NCT datasets. Empirically good choice for α is between 0.5 and 5. For FastEL, we consider a sliding window of size 5 over the raw-text (no punctuations) of document to generate entity candidates. We limit the number of candidates per document to 50.

Results
We use precision@k as the evaluation metric for the performance. Figure 3 shows the precision plot of different approaches against choices of k on Wiki10, WikiNER and RM dataset. On Wiki10, we see that the precision of our Doc-Tag2Vec is close to the one delivered by SLEEC, while Doc2Vec performs much worse. We observe similar result on WikiNER except for the pre-cision@1, but our precision catches up as k increases. For RM dataset, SLEEC outperforms our approach, and we conjecture that such gap is due to the small size of training data, from which Doc-Tag2Vec is not able to learn good embeddings. It   similarity The world is definitely getting warmer, according to the U.S. National Atmospheric and Oceanic Administration. For its annual "State of the Climate" report, NOAA for the first time gathered data on 37 climate indicators, such as air and sea temperatures, sea level, humidity, and snow cover in one place, and found that, taken together, the measurements show an "unmistakable upward trend" in temperature. Three hundred scientists analyzed the information and concluded it's "undeniable" that the planet has warmed since 1980, with the last decade taking the record for hottest ever recorded.  For NCT dataset, we also train the DocTag2Vec incrementally, i.e., each time we only feed 100 documents to DocTag2Vec and let it run SGD, and we keep doing so until all training samples are presented. As shown in Figure 4, our DocTag2Vec outperform Doc2Vec baseline, and delivers competitive or even better precision in comparision with SLEEC. Also, the incremental training does not sacrifice too much precision, which makes DocTag2Vec even appealing. The overall recall of DocTag2Vec is also slightly better than SLEEC, as shown in Table 2. Figure 5 and 6 include the precision plot against the number of learners b and the number of nearest neighbors k for individual learner, respectively. It is not difficult to see that after b = 10, adding more learners does not give significant improvement on precision. For nearest neighbor search, k = 5 would suffice.

Case Study for NCT dataset
For NCT dataset, when we examine the prediction for individual articles, it turns out surprisingly that there are a significant number of cases where DocTag2Vec outputs better tags than those by editorial judgement. Among all these cases, we include a few in Table 3 showing the superiority of the tags given by DocTag2Vec sometimes. For the first article, we can see that the three predicted tags are all related to the topic, especially the one with highest similarity, /Nature & Environment/ Environment/Climate Change, seems more pertinent compared with the editor's. Similarly, we predict /Finance/Investment & Company Information/Company Earnings as the most relevant topic for the second article, which is more precise than its parent /Finance/Investment & Company Information. Besides our approach can even find the wrong tags assigned by the editor. The last piece of news is apparently about NBA, which should have the tag /Sports & Recreation/Basketball as predicted, while the editor annotates them with the incorrect one, /Sports & Recreation/Baseball. On the other hand, by looking at the similarity scores associated with the predicted tags, we can see that higher score in general implies higher aboutness, which can also be used as a quantification of pre-diction confidence.

Conclusions and Future Work
In this paper, we present a simple method for document tagging based on the popular distributional representation learning models, Word2Vec and Doc2Vec. Compared with classical multilabel learning methods, our approach provides several benefits, such as allowing incremental update of model, handling the dynamical change of tag set, as well as producing feature representation for tags. The document tagging can benefit a number of applications on social media. If the text content over web is correctly tagged, articles or blog posts can be pushed to the right users who are likely to be interested. And such good personalization will potentially improve the users engagement. In future, we consider extending our approach in a few directions. Given that tagged documents are often costly to obtain, it would be interesting to extend our approach to a semi-supervised setting, where we can incorporate large amounts of unannotated documents to enhance our model. On the other hand, with the recent progress in graph embedding for social network (Yang et al., 2016;Grover and Leskovec, 2016), we may be able to improve the tag embedding by exploiting the representation of users and interactions between tags and users on social networks.