A Deep Reinforced Sequence-to-Set Model for Multi-Label Classification

Multi-label classification (MLC) aims to predict a set of labels for a given instance. Based on a pre-defined label order, the sequence-to-sequence (Seq2Seq) model trained via maximum likelihood estimation method has been successfully applied to the MLC task and shows powerful ability to capture high-order correlations between labels. However, the output labels are essentially an unordered set rather than an ordered sequence. This inconsistency tends to result in some intractable problems, e.g., sensitivity to the label order. To remedy this, we propose a simple but effective sequence-to-set model. The proposed model is trained via reinforcement learning, where reward feedback is designed to be independent of the label order. In this way, we can reduce the dependence of the model on the label order, as well as capture high-order correlations between labels. Extensive experiments show that our approach can substantially outperform competitive baselines, as well as effectively reduce the sensitivity to the label order.


Introduction
Multi-label classification (MLC) aims to assign multiple labels to each sample. It can be applied in many real-world scenarios, such as text categorization (Schapire and Singer, 2000) and information retrieval (Gopal and Yang, 2010). Due to the complex dependency between labels, a key challenge for the MLC task is how to effectively capture high-order correlations between labels (Zhang and Zhou, 2014).
When involving in capturing high-order correlations between labels, one line of research focuses on exploring the hierarchical structure of the label space (Prabhu and Varma, 2014;Jernite et al., 2017;Peng et al., 2018;Singh et al., 2018), while 1 The code is available at https://github.com/ lancopku/Seq2Set another line strives to extend specific learning algorithms (Zhang and Zhou, 2006;Baker and Korhonen, 2017;Liu et al., 2017). However, most of these work tends to result in intractable computational costs (Chen et al., 2017).
Recently, based on a pre-defined label order, Nam et al. (2017); Yang et al. (2018) succeeded in applying the sequence-to-sequence (Seq2Seq) model to the MLC task, which shows its powerful ability to capture high-order label correlations and achieves excellent performance. However, the Seq2Seq model suffers from some thorny flaws on the MLC task. The output labels are essentially an unordered set with swapping-invariance 2 , rather than an ordered sequence. This inconsistency usually leads to some intractable problems, e.g., sensitivity to the label order. Previous work (Vinyals et al., 2016) has shown that the order has a great impact on the performance of the Seq2Seq model. Therefore, the performance of classifier is sensitive to the pre-defined label order. Besides, even if the model accurately predicts all true labels, it still may result in an unreasonable training loss due to the inconsistent order with the pre-defined label sequence 3 . Therefore, in this work, we propose a simple but effective sequence-to-set model, which aims at alleviating the dependence of the model on the label order. Instead of maximizing the log-likelihood of pre-defined label sequences, we apply reinforcement learning (RL) (Sutton et al., 1999) to guild the model training. The designed reward not only comprehensively evaluates the quality of the output labels, but also satisfies swapping-invariance of the set, which leads to a reduction in the dependence of the model on the label order.
The main contributions of this paper are summarized as follows: • We propose a simple but effective sequenceto-set (Seq2Set) model based on reinforcement learning, which not only captures the correlations between labels, but also alleviates the dependence on the label order.
• Experimental results show that our Seq2Set model can outperform baselines by a large margin. Further analysis demonstrates that our approach can effectively reduce the sensitivity of the model to the label order.

Overview
Here we define some necessary notations and describe the MLC task. Given a text sequence x containing m words, the MLC task aims to assign a subset y containing n labels in the total label set Y to x. From the perspective of sequence learning, once the order of output labels is pre-defined, the MLC task can be regarded as the generation of target label sequence y conditioned on the source text sequence x.

Neural Sequence-to-Set Model
Our proposed Seq2Set model consists of an encoder E and a set decoder D, which are introduced in detail as follows.
Encoder E: We implement the encoder E as a bidirectional LSTM. Given the input text x = (x 1 , · · · , x m ), the encoder computes the hidden states of each word as follows: where e(x i ) is the embedding of x i . The final representation of the i-th word is where semicolon denotes vector concatenation.
Set decoder D: Due to its powerful ability of LSTM to model sequence dependency, we also implement D as a LSTM model to capture highorder correlations between labels. In particular, the hidden state s t of the set decoder D at timestep t is computed as: where [e(y t−1 ); c t ] denotes the concatenation of vectors e(y t−1 ) and c t , e(y t−1 ) is the embedding of the label y t−1 generated at the last time-step, and c t is the context vector obtained by the attention mechanism. Readers can refer to Bahdanau et al. (2015) for more details. Finally, the set decoder D samples a label y t from the output probability distribution, which is computed as follows: where W 1 , W 2 , and U are trainable parameters, f is a nonlinear activation function, and I t ∈ R |Y| is the mask vector that prevents D from generating repeated labels,

MLC as a RL Problem
In order to alleviate the dependence of the model on the label order, here we model the MLC task as a RL problem. Our set decoder D can be viewed as an agent, whose state at time-step t is the current generated labels (y 1 , · · · , y t−1 ). A stochastic policy defined by the parameter of D decides the action, which is the prediction of the next label. Once a complete label sequence y is generated, the agent D will observe a reward r. The training objective is to minimize negative expected reward, which is as follows: where θ refers to the model parameter. In our model, we use the self-critical policy gradient algorithm (Rennie et al., 2017). For each training sample in the minibatch, the gradient of Eq.(6) can be approximated as: where y s is the label sequence sampled from probability distribution p θ and y g is the label sequence generated with the greedy search algorithm. r(y g ) in Eq. (7) is the baseline, which aims to reduce the variance of gradient estimate and enhance the consistency of the model training and testing to alleviate exposure bias (Ranzato et al., 2016).

Reward Design
The ideal reward is supposed to be a good measure of the quality of the generated labels. Besides,  Table 1: Performance of different systems. "HL", "0/1 Loss", "F1", "Precision", and "Recall" denote hamming loss, subset zero-one loss, micro-F 1 , micro-precision, and micro-recall, respectively. "+" indicates higher is better and "-" is opposite. The best performance is highlighted in bold.
in order to free the model from the strict restriction of label order, it should also satisfy swappinginvariance of the output label set. Motivated by this, we design the reward r as the F 1 score calculated by comparing the generated labels y with ground-truth labels y * . 4 r(y) = F 1 (y, y * ) We also tried other reward designs, such as hamming accuracy. Results show that reward based on F 1 score gives the best overall performance.

Datasets
We conduct experiments on the RCV1-V2 corpus (Lewis et al., 2004), which consists of a large number of manually categorized newswire stories. The total number of labels is 103. We adopt the same data-splitting in Yang et al. (2018).

Settings
We tune hyper-parameters on the validation set based on the micro-F 1 score. The vocabulary size is 50,000 and the batch size is 64. we set the embedding size to 512. Both encoder and set decoder is a 2-layer LSTM with the hidden size 512, but the former is set to bidirectional. We pre-train the model for 20 epochs via MLE method. The optimizer is Adam (Kingma and Ba, 2015) with 10 −3 learning rate for pre-training and 10 −5 for RL. Besides, we use dropout (Srivastava et al., 2014) to avoid overfitting and clip the gradients (Pascanu et al., 2013) to the maximum norm of 8.

Baselines
We compare our approach with the following competitive baselines: • BR-LR (Boutell et al., 2004) amounts to independently training one binary classifier (logistic regression) for each label.
• PCC-LR (Read et al., 2011) transforms the MLC task into a chain of binary classification (logistic regression) problems.
• FastXML (Prabhu and Varma, 2014) learns a hierarchy of training instances and optimizes the objective at each node of the hierarchy. • Seq2Seq (Nam et al., 2017;Yang et al., 2018) adapts the Seq2Seq model to perform multilabel classification.

Evaluation Metrics
The evaluation metrics include: subset zero-one loss calculating the fraction of misclassifications, hamming loss denoting the fraction of wrongly predicted labels to total labels, and micro-F 1 that is the weighted average of F 1 score of each class. Micro-precision and micro-recall are also reported for reference.

Results and Discussion
Here we conduct an in-depth analysis on the model and experimental results. For simplicity, we use BR to represent the baseline BR-LR.

Experimental Results
The comparison between our approach and all baselines is presented in  the proposed Seq2Set model can outperform all baselines by a large margin in all evaluation metrics. Compared to BR which completely ignores the label correlations, our Seq2Set model achieves a reduction of 12.05% hamming-loss. It shows that modeling high-order label correlations can largely improve results. Compared to Seq2Seq that makes strict requirements on the label order, our Seq2Set model achieves a reduction of 3.95% hamming-loss on the RCV1-V2 dataset. This indicates that our approach can achieve substantial improvements by reducing the dependence of the model on the label order.

Reducing Sensitivity to Label Order
To verify that our approach can reduce the sensitivity to the label order, we randomly shuffle the order of the label sequences. Table 2 presents the performance of various models on the labelshuffled RCV1-V2 dataset. Results show that for the shuffled label order, BR is not affected, but the performance of Seq2Seq declines drastically. The reason is that the decoder of Seq2Seq is essentially a conditional language model. It relies heavily on a reasonable label order to model the intrinsic association between labels, while labels in this case present an unordered state. However, our model's performance on subset zero-one loss drops by only 1.2% 5 , while Seq2Seq drops by 9.3%. This shows that our Seq2Set model is more robust, which can resist disturbances in the label order. Our model is trained via reinforcement learning and reward feedback is independent of the label order, which reduces sensitivity to the label order.

Improving Model Universality
The labels in the RCV1-V2 dataset exhibits a longtail distribution. However, in real-scenarios, there are other common label distributions, e.g., uniform distribution (Lin et al., 2018a). Therefore, here we analyze the universality of the Seq2Set model, which means that it can achieve stable improvements in performance under different label distributions. In detail, we remove the most frequent k labels in turn on the RCV1-V2 dataset and perform the evaluation on the remaining labels. The larger the k, the more uniform the label distribution. Figure 1 shows changes in the performance of different systems.
First, as the number of removed high-frequency labels increases, the performance of all methods deteriorates. This is reasonable because predicting low-frequency labels is relatively difficult. However, compared to other methods, the performance of the Seq2Seq model is greatly degraded. We suspect this is because it's difficult to define a reasonable order for uniformly distributed labels while Seq2Seq imposes strict requirements on the label order. This conflict may damage performance. However, as shown in Figure 1, as more labels are removed, the advantage of Seq2Set over Seq2Seq continues to grow. This illustrates that our Seq2Set model has excellent universality, which works for different label distributions. Our approach not only has the ability of Seq2Seq to capture label correlations, but also alleviates the strict requirements of Seq2Seq for label order via reinforcement learning. This avoids the problem of difficulty in predefining a reasonable label order on the uniform distribution, leading to excellent universality.

Error Analysis
We find that all methods perform poorly when predicting low-frequency (LF) labels compared to high-frequency (HF) labels. This is reasonable because samples assigned LF labels are sparse, making it hard for the model to learn an effective pattern to make predictions. Figure 2 shows the results of different methods on HF labels and Figure 2: Performance of different systems on the HF labels and LF labels. "Impv-BR" and "Impv-Seq2Seq" denote the improvement of our model compared to BR-LR and Seq2Seq, respectively. LF labels 6 . However, compared to other systems, our proposed Seq2Set model achieves better performance on both LF labels and HF labels. Besides, the relative improvements achieved by our approach are greater on LF labels than HF labels. In fact, the distribution of LF labels is relatively more uniform. As analyzed in Section 4.3, the label order problem is more serious in the uniform distribution. Our Seq2Set model can reduce the dependence on the label order via reinforcement learning, leading to larger improvements in performance on the LF labels.

Related Work
Multi-label classification (MLC) aims to assign multiple labels to each sample in the dataset. Early work on exploring the MLC task focuses on machine learning algorithms, mainly including problem transformation methods and algorithm adaptation methods. Problem transformation methods, such as BR (Boutell et al., 2004), LP (Tsoumakas and Katakis, 2006) and CC (Read et al., 2011), aim at mapping the MLC task into multiple singlelabel learning tasks. Algorithm adaptation methods strive to extend specific learning algorithms to handle multi-label data directly. The corresponding representative work includes ML-DT (Clare and King, 2001), Rank-SVM (Elisseeff and Weston, 2001), ML-KNN (Zhang and Zhou, 2007), and so on. In addition, some other methods, including ensemble method (Tsoumakas et al., 2011) and joint training (Li et al., 2015), can also be used for the MLC task. However, they can only be used to capture the first or second order label correlations (Chen et al., 2017), or are computationally intractable when high-order label correlations are considered.
Recent years, some neural network models have also been successfully used for the MLC task. For instance, the BP-MLL proposed by Zhang and Zhou (2006) applies a fully-connected network and the pairwise ranking loss to perform classification. Nam et al. (2013) further replace the pairwise ranking loss with cross-entropy loss function. Kurata et al. (2016) present an initialization method to model label correlations by leveraging neurons. Chen et al. (2017) present an ensemble approach of CNN and RNN so as to capture both global and local semantic information. Liu et al. (2017) use a dynamic max pooling scheme and a hidden bottleneck layer for better representations of documents. Graph convolution operations are employed by Peng et al. (2018) to capture nonconsecutive and long-distance semantics. The two milestones are Nam et al. (2017) and Yang et al. (2018), both of which utilize the Seq2Seq model to capture the label correlations. Going a step further, Lin et al. (2018b) propose a semantic-unitbased dilated convolution model and Zhao et al. (2018) present a label-graph based neural network equipped with a soft training mechanism to capture label correlations. Most recently, Qin et al. (2019) present new training objectives propose based on set probability to effectively model the mathematical characteristics of the set.

Conclusion
In this work, we present a simple but effective sequence-to-set model based on reinforcement learning, which aims to reduce the stringent requirements of the sequence-to-sequence model for label order. The proposed model not only captures high-order correlations between labels, but also reduces the dependence on the order of output labels. Experimental results show that our Seq2Set model can outperform competitive baselines by a large margin. Further analysis demonstrates that our approach can effectively reduce the sensitivity to the label order.