Label-Specific Document Representation for Multi-Label Text Classification

Multi-label text classification (MLTC) aims to tag most relevant labels for the given document. In this paper, we propose a Label-Specific Attention Network (LSAN) to learn a label-specific document representation. LSAN takes advantage of label semantic information to determine the semantic connection between labels and document for constructing label-specific document representation. Meanwhile, the self-attention mechanism is adopted to identify the label-specific document representation from document content information. In order to seamlessly integrate the above two parts, an adaptive fusion strategy is proposed, which can effectively output the comprehensive label-specific document representation to build multi-label text classifier. Extensive experimental results demonstrate that LSAN consistently outperforms the state-of-the-art methods on four different datasets, especially on the prediction of low-frequency labels. The code and hyper-parameter settings are released to facilitate other researchers.


Introduction
Text classification is a fundamental text mining task including multi-class classification and multilabel classification. The former only assigns one label to the given document, while the latter classifies one document into different topics. In this paper, we focus on multi-label text classification (MLTC) because it has become one of the core tasks in natural language processing and has been widely applied in topic recognition (Yang et al., 2016), question answering (Kumar et al., 2016), sentimental analysis (Cambria et al., 2014) and so on. With the boom of big data, MLTC becomes significantly challenging because it has to handle the massive documents, words and labels simultaneously. Therefore, it is an emergency to develop effective multi-label text classifier for various practical applications.
Multi-label text classification allows for the coexistence of more than one label in a single document, thus, there are semantical correlations among labels because they may share the same subsets of document. Meanwhile, the document may be long and complicated semantic information may be hidden in the noisy or redundant content. Furthermore, most documents fall into few labels while a large number of "tail labels" only contain very few positive documents. To handle these issues, researchers pay much attention on three facets: 1) how to sufficiently capture the semantic patterns from the original documents, 2) how to extract the discriminative information related to the corresponding labels from each document, and 3) how to accurately mine the correlation among labels.
Till now, in the community of machine learning and natural language processing, researchers have paid tremendous efforts on developing MLTC methods in each facet. Among them, deep learning-based methods such as CNN (Liu et al., 2017;Kurata et al., 2016), RNN (Liu et al., 2016), combination of CNN and RNN (Lai et al., 2015;Chen et al., 2017), attention mechanism (Yang et al., 2016;You et al., 2018), (Adhikari et al., 2019) and etc., have achieved great success in document representation. However, most of them only focus on document representation but ignore the correlation among labels. Recently, some methods including DXML , EX-AM (Du et al., 2018), SGM (Yang et al., 2018), GILE (Pappas and Henderson, 2019) are proposed to capture the label correlations by exploiting label structure or label content. Although they obtained promising results in some cases, they still cannot work well when there is no big difference between label texts (e.g., the categories Management vs Management moves in Reuters News), which makes them hard to distinguish.
In MLTC task, one document may contain multiple labels, and each label can be taken as one aspect or component of the document, thus, the overall semantics of the whole document can be formed by multiple components. Motivated by the above-mentioned observations, we propose a novel Label-Specific Attention Network model (LSAN) to learn document representation by sufficiently exploiting the document content and label content. To capture the label-related component from each document, we adopt the self-attention mechanism to measure the contribution of each word to each label. Meanwhile, LSAN takes advantage of label texts to embed each label into a vector like word embedding, so that the semantic relations between document words and labels can be explicitly computed. Thereafter, an adaptive fusion strategy is designed to extract the proper amount of information from these two aspects and construct the label-specific representation for each document. We summarize the main contributions: • A label-specific attention network model is proposed to handle multi-label text classification task by considering document content and label texts simultaneously.
• An adaptive fusion strategy is first designed to adaptively extract the proper semantical information to construct label-specific document representation.
• The performance of LSAN is thoroughly investigated on four widely-used benchmark datasets in terms of several evaluation metrics, indicating its advantage over the stateof-the-art baselines.
The rest of the paper is organized as follows. Section 2 describes the proposed LSAN model for multi-label text classification. The experiments on real-word datasets are conducted in Section 3 and their results are discussed in detail. Section 4 lists the related work. The brief conclusions and future work are given in Section 5.

Proposed Method
In this section, we introduce the proposed labelspecific attention network, as shown in Figure 1.
LSAN consists of two main parts. The first part is to capture the label-related components from each document by exploiting both document content and label texts. The second part aims to adaptively extract the proper information from two aspects. Finally, the classification model can be trained on the fused label-specific document representations.

Preliminaries
denote the set of documents, which consists of N documents with corresponding labels Y = {y i ∈ {0, 1} l }, here l is the total number of labels. Each document contains a sequence of words. Each word can be encoded to a low-dimensional space and represented as a d-dimension vector via word2vector technique (Pennington et al., 2014). Let x i = {w 1 , · · · , w p , · · · , w n } denote the i-th document, w p ∈ R k is the p-th word vector in the document, n is the number of words in document.
For text classification, each label contains textual information. Thus, similar to the document word, one label can be represented as an embedding vector and the label set will be encoded by a trainable matrix C ∈ R l×k . Given the input documents and their associated labels D, MLTC aims to train a classifier to assign the most relevant labels to the new coming documents.
Input Text Representation: To capture the forward and backward sides contextual information of each word, we adopt the bidirectional long short-term memory (Bi-LSTM)  language model to learn the word embedding for each input document. At time-step p, the hidden state can be updated with the aid of input and (p − 1)-th step output.
where w p is the embedding vector of the p-th word in the corresponding document, and − → h p , ← − h p ∈ R k indicate the forward and backward word context representations respectively. Then, the whole document can be represented by Bi-LSTM as follows.
In this case, the whole document set can be taken as a matrix H ∈ R 2k×n .

Label-Specific Attention Network
In this subsection, we will give the proposed attention network model for label-specific document representation learning. It aims to determin the label-related component from each document. Actually, this strategy is intuitive for text classification. For example, regarding the text "June a friday, in the lawn, a war between the young boys of the football game starte", it is assigned into two categories youth and sports. Obviously, the content "young boy" is much more related to youth than to sports, while "football game" should be directly related to sports. Next, we will show how to capture this characteristic with our model.

Self-attention Mechanism
As mentioned above, the multi-label document may be tagged by more than one label, and each document should have the most relative contexts with its corresponding labels. In other words, each document may contain multiple components, and the words in one document make different contributions to each label. To capture different components for each label, we adopt the self-attention mechanism (Lin et al., 2017), which has been successful used in various text mining tasks (Tan et al., 2018;Al-Sabahi et al., 2018;You et al., 2018). The label-word attention score (A s ∈ R l×n ) can be obtained by where W 1 ∈ R da×2k and W 2 ∈ R l×da are the socalled self-attention parameters to be trained. d a is a hyper-parameter we can set arbitrarily. Each row A (s) j· (an n-dim row vector where n is the total number of words) indicates the contribution of all words to the j-th label. Then, we can obtain the the linear combination of the context words for each label with the aid of label-word attention score (A (s) ) as follows.
which can be taken as a new representation of the input document along the j-th label. Then the whole matrix M (s) ∈ R l×2k is the label-specific document representation under the self-attention mechanism.

Label-Attention Mechanism
Self-attention mechanism can be taken as contentbased attention because it only considers the document content. As we all know, labels have specific semantics in text classification, which is hidden in the label texts or descriptions. To make use of the semantic information of labels, they are preprocessed and represented as a trainable matrix C ∈ R l×k in the same latent k-dim space with the words. Once having the word embedding from Bi-LSTM in (1) and the label embedding in C, we can explicitly determine the semantic relation between each pair of word and label. A simple way is calculating the dot product between where − → A (l) ∈ R l×n and ← − A (l) ∈ R l×n indicate the forward and backward sides semantic relation between words and labels. Similar to the previous self-attention mechanism, the label-specific document representation can be constructed by linear combining the label's context words as follows.
Finally, the document can be re-represented along all labels via . This representation is based on the label texts, thus, we called it as label-attention mechanism.

Adaptive Attention Fusion Strategy
Both M (s) and M (l) are label-specific document representation, but they are different. The former focuses on the document content, while the later prefers to the semantic correlation between document content and label text. In order to take advantage these two parts, in this subsection, an attention fusion strategy is proposed to adaptively extract proper amount of information from them and build comprehensive label-specific document representation.
More specifically, two weight vectors (α, β ∈ R l ) are introduced to determine the importances of the above two mechanisms, which can obtained by a fully connected layer on the input M (s) and M (l) .
are the parameters to be trained. α j and β j indicate the importances of self-attention and label-attention to construct the final document representation along the j-th label respectively. Therefore, we add the constraint on them: Then, we can obtain the final document representation along the j-th label based on fusion weights as follows.
The label-specific document representation along all labels can be described as a matrix M ∈ R l×2k .

Label Prediction
Once having the comprehensive label-specific document representation, we can build the multilabel text classifier via a multilayer perceptron with two fully connected layers. Mathematically, the predicted probability of each label for the coming document can be estimated viâ Here W 5 ∈ R b×2k , W 6 ∈ R b are the trainable parameters of the fully connected layer and output layer respectively. f is the ReLU nonlinear activation function. The sigmoid function is used to transfer the output value into a probability, in this case, the cross-entropy loss can be used as the loss function which has been proved suitable for multilabel text classification task (Nam et al., 2014).
(yij log(ŷij)) where N is the number of training documents, l is the number of labels,ŷ ij ∈ [0, 1] is the predicted probability, and y ij ∈ {0, 1} indicates the ground truth of the i-th document along the j-th label.

Experiments
In this section, we evaluate the proposed model on four datasets (with various number of labels from 54 to 3956) by comparing with the state-of-the-art methods in terms of widely used metrics.

Experimental Setting
Datasets: There are several multi-label text datasets, however, only few of them have label text information. Thus, in this paper, four benchmark multi-label datasets including three small-scale datasets (RCV1, AAPD and EUR-Lex) and one medium-scale dataset (KanShan-Cup) are used to construct the experiments.
• AAPD 2 (Yang et al., 2018) collects the abstract and the corresponding subjects of 55840 papers from arXiv in the filed of computer science. For the first three data sets, only last 500 words were kept for each document, while the last 50 N is the number of training instances, M is the number of test instances, D is the total number of words, L is the total number of classes,L is the average number of labels per document, L is the average number of documents per label,W is the average number of words per document in the training set,W is the average number of words per document in the testing set.
words were used in KanShan-Cup dataset. Once the document has less than the predifined number of words, we extend it by padding zeros. All methods are trained and tested on the given training and testing datasets which are summarized in Table 1. Evaluation Metrics: We use two kinds of metric, precision at top K (P @k) and the Normalized Discounted Cumulated Gains at top K (nDCG@k) to evaluate the prediction performance. P @k and nDCG@k are defined according to the predicted score vectorŷ ∈ R l and the ground truth label vector y ∈ {0, 1} l as follows.
where rank k (y) is the label indexes of the top k highest scores of the current prediction result.
y 0 counts the number of relevant labels in the ground truth label vector y. Baseline Models: The proposed LSAN is a deep neural network model, thus the recent state-of-theart deep learning-based MLTC methods are selected as baselines.
• XML-CNN: (Liu et al., 2017) adopts Convolutional Neural Network (CNN) and a dynamic pooling technique to extract high-level feature for multi-label text classification. • SGM: (Yang et al., 2018) applies a sequence generation model from input document to output label to construct the multi-label text classifier. • DXML:  tries to explore the label correlation by considering the label structure from the label co-occurrence graph.
• AttentionXML: (You et al., 2018) builds the label-aware document representation only based on the document content, thus, it can be taken as one special case of our proposed LSAN with arbitrarily setting α = 0.
• EXAM: (Du et al., 2018) is the most similar work to LSAN because both of them exploit the label text to learn the interaction between words and labels. However, EXAM suffers from the situation where different labels have similar text.

Comparison Results and Discussion
In this section, the proposed LSAN is evaluated on four benchmark datasets by comparing with five baselines in terms of P @K and nDCG@K(K = 1, 3, 5). Table 2 and Table 3 show the averaged performance of all test documents. According to the formula of P @K and nDCG@K, we know P @1 = nDCG@1, thus only nDCG@3 and nDCG@5 are listed in Table 3. In each line, the best result is marked in bold. From Table 2 and 3, we can make a number of observations about these results. Firstly, XML-CNN is worse than other four methods because it only considers the document content but ignores the label correlation which has been proven very important for multi-label learning. Secondly, At-tentionXML is superior to EXAM on datasets R-CV1 and Kanshan-Cup, because these two datases have hierarchical label structures. In this case, parent label and child label may contain similar text, which makes them hard to distinguish according to the text-based embedding and further reduce the performance of EXAM. By compar-    ing with EXAM and the proposed LSAN, however, AttentionXML performs worse on EUR-Lex dataset The main reason is that AttentionXML only focuses on the document content, which will make it not sufficiently trained once there are only few documents in some labels. Fortunately, EX-AM and LSAN benefit from the label texts. Last one, as expected, is that LSAN consistently outperforms all baselines on all experimental datasets. This result further confirms that the proposed adaptive attention fusion strategy is much helpful to learn the label-specific document representation for multi-label text classification.

Comparison on Sparse Data
In order to verify the performance of LSAN on low-frequency labels, we divided labels in EUR-Lex into three groups according to their occurring frequency. Figure 2 shows the distribution of label frequency on EUR-Lex, F is the frequency of label. Among it, nearly 55% of labels occur between 1 and 5 times to form the first label group (Group1). The labels appearing 5-37 times are assigned into Group2, which is 35% of the whole label set. The remaining 10% frequent labels form the last group (Group3). Obviously, Group1 is much harder than other two groups due to the lack of training documents. Figure 3 shows the prediction results in terms of P @1, P @3 and P @5 obtained by AttentionXML, EXAM and LSAN. Three methods become better and better from Group1 to Group3, which is reasonable because more and more documents are included in each label from Group1 to Group3. L-SAN significantly improves the prediction performance on Group1. Especially, LSAN obtains an average of more than 83.82%, 182.55%, 244.62% gain on three metrices for group 1 to Attention-XML, and 3.85%, 27.19%, 58.27% gain to EX-AM. This result demonstrates the superiority of the proposed model on multi-label text data with tail labels.

Ablation Test
The proposed LSAN can be taken as a joint attention strategy including three parts. One is selfattention based on document content (denoted as A). The second one is label-attention based on label text (denoted as L). Another one is fusion attention by adaptively integrating A and L with proper weights (denoted as W). In this section, we try to demonstrate the effection of each component via an ablation test. Figure 4 lists the prediction results on four datasets in terms of P @1, P @3 and P @5. (i.e., S+L gets better results than S and L). S prefers to finding the useful content when constructing the label-specific document representation, but it ignores the label information. L takes adantage of label text to explicitly determine the semantic relation between documents and labels, however, label text is not easy to distinguish the difference between labels (e.g., Management vs. Management movies). Therefore, coupling with both attentions is really reasonable. Furthermore, adaptively extracting proper amount of information from these two attentions benefits the final multi-label text classification.
To further verify the effectiveness of attention adaptive fusion, Figure 5 lists the distribution of weights on Self-attention and Label-attention on two representative datasets, one for sparse data (EUR-Lex) and the other for dense data (AAPD). As expected, the label-attention is much more useful than self-attention for sparse data, vice versa for dense data. In dense data, each label has sufficient documents, therefore, self-attention can sufficiently obtain label-specific document representation. On the other hand, label text is helpful to extract the semantic relations between labels and documents. Results on other two datasets have the similar trends which are omitted due to page limitation.
For investigating the effect of label-attention, we visualize the attention weights on the original document using heat map, as shown in Figure 6. Among it, the example AAPD document belongs to two categories Computer Vision and Neural and Evolutionary Computing. From the attention weights, we can see that each category has its own most related words, which confirms that the proposed label specific attention network is able to extract the label-aware content and further construct label-specific document representation.

Related work
In the line of MLTC, most works focus on two issues, one is document representation learning and Figure 5: Weight distribution for two components on EUR-Lex (left subfigure) and AAPD (right subfigure). Horizontal axis is the range of weight from 0 to 1 with 0.1 gap. Vertical axis is the frequency that the specific range occurs in current label group. the other is label correlation detection.
For document representation, along with the recent success in CNN, many works are proposed based on CNN (Kim, 2014;Liu et al., 2017;Chen et al., 2017), which can capture the local correlations from the consecutive context windows. Although they obain promising results, these methods suffer from the limitation of window size so that they cannot determine the long-distance dependency of text. Meanwhile, they treat all words equally no matter how noisy the word is. Later, RNN and attention mechanism are introduced to get brilliant results (Yang et al., 2016). To implicitly learn the document representation for each label, the self-attention mechanism (Lin et al., 2017) is adopted for multi-label classification (You et al., 2018).
To determine the label correlation among multilabel data, in literatures, researchers proposed various methods. Kurata et al. (2016) adopt an initialization method to leverage label co-occurrence information. SLEEC (Bhatia et al., 2015) divides dataset into several clusters, and in each cluster it detects embedding vectors by capturing non-linear label correlation. DXML  establishes an explicit label co-occurrence graph to explore label embedding in low-dimension laten-t space. Yang et al. (2018) use sequence-tosequence(Seq2Seq) model to consider the correlations between labels. Recently, the textual information of labels are used to guide MLTC. EX-AM (Du et al., 2018) introduces the interaction mechanism to incorporate word-level matching signals into the text classification task. GILE (Pappas and Henderson, 2019) proposes a joint inputlabel embedding model for neural text classification. Unfortunately, they cannot work well when there is no big difference between label texts.

Conclusions and Future Work
A new label-specific attention network, in this paper, is proposed for multi-label text classification. It makes use of document content and label text to learn the label-specific document representation with the aid of self-attention and label-attention mechanisms. An adaptive fusion is designed to effectively integrate these two attention mechanisms to improve the final prediction performance. Extensive experiments on four benchmark datasets prove the superiority of LSAN by comparing with the state-of-the-art methods, especially on the dataset with large subset of low-frequency labels.
In real applications, more precious information can be collected, such as label description, la-bel topology (e.g., hierarchical structure) and etc. Therefore, it is interesting to extend the current model with such extra information.