Zero-shot Text Classification via Reinforced Self-training

Zero-shot learning has been a tough problem since no labeled data is available for unseen classes during training, especially for classes with low similarity. In this situation, transferring from seen classes to unseen classes is extremely hard. To tackle this problem, in this paper we propose a self-training based method to efficiently leverage unlabeled data. Traditional self-training methods use fixed heuristics to select instances from unlabeled data, whose performance varies among different datasets. We propose a reinforcement learning framework to learn data selection strategy automatically and provide more reliable selection. Experimental results on both benchmarks and a real-world e-commerce dataset show that our approach significantly outperforms previous methods in zero-shot text classification

classes to unseen classes. These methods assume that semantically similar classes share similar image features, however, they may fail in the cases where classes share low similarities.
This problem becomes even more salient in typical NLP tasks such as text classification. For example, let us consider a 10-class emotion classification task (Yin et al., 2019), in which the model is trained on class "sadness" while makes predictions on instances from class "joy". Notice that most emotions are relatively independent, which means the way we express certain emotion is pretty different from other emotions. As a result, for an unseen class we can hardly find a similar class in the seen class set. Transferring from seen classes to unseen classes can be extremely hard as matching patterns that can be shared among classes are rare.
Essentially, ZSL methods aim to learn a matching model between feature space and semantic space, which refers to text and label in text classification task respectively. Matching patterns between text and label can be roughly classified as class-invariant patterns and class-specific ones. The former refers to the patterns that are shared among classes, while the latter is dependent on a certain class. Table 1 shows an example to illustrate this definition. The string match of label and text, which is highlighted with red color, indicates a simple matching pattern that can be shared among classes. On the contrary, the words that are highlighted with blue color indicates a matching pattern that is specific to a certain class and cannot be transferred among classes easily. Imagine if the model is trained on sentence 1, it can make a correct prediction on sentence 2 while failing on sentence 3 probably.
There are mainly two ways to deal with this troublesome zero-shot learning situation, including (1) integrating more external knowledge to Label Sentence fear 1. One day, when I realized that I was alone, I felt fear of loneliness. guilty 2. I felt guilty when I lied to my parents. guilty 3. I wished secretly and lied to a friend because I didn't want her to stay in my house. better describe class and build more sophisticated connections between classes (Rios and Kavuluru, 2018;Zhang et al., 2019); (2) integrating the unlabeled data to improve the generalization performance. Generally, existing works mainly adopt the former solution, while little attention is paid to the latter one. In this paper, we focus on the latter one and propose a self-training based method to leverage unlabeled data. The basic idea of selftraining (McClosky et al., 2006;Sagae, 2010) is to select unlabeled instances that are predicted with high confidence and add them into the training set.
It is straightforward to consider that if we add sentence 2 to training set, the model is capable of learning class-specific pattern as sentence 2 and sentence 3 share the intra-class similarity. In this way, we can mine class-specific feature through class-invariant feature. However, directly applying traditional selftraining method to zero-shot learning may encounter some problems: (1) traditional selftraining methods use manually designed heuristics to select data, so manual adjustment of selection strategy is costly (Chen et al., 2018). (2) due to the severe domain shift (Fu et al., 2015), traditional self-training method may not provide reliable selection. To alleviate these problems, we present a reinforcement learning framework to learn data selection policy, which can select unlabeled data automatically and provide more reliable selection.
The contributions of our work can be summarized as follows: • We propose a self-training based method to leverage unlabeled data in zero-shot text classification. Our method is capable of alleviating the domain shift problem and enabling transferring between classes sharing low similarities and connections.
• We propose a reinforcement learning framework to learn data selection policy automatically instead of using manually designed heuristics.
• Experimental results on both benchmarks and a real-world e-commerce dataset show that our method outperforms previous methods with a large margin of 15.4% and 5.4% on average in generalized and non-generalized ZSL respectively.
2 Related Work

Zero-shot Learning
Zero-shot learning has been widely studied in image classification, in which training classes and testing classes are disjoint (Lampert et al., 2013;Larochelle et al., 2008;Rohrbach et al., 2011). The general idea of zero-shot learning is to transfer knowledge from seen classes to unseen classes (Wang et al., 2019). Most methods focus on learning a matching model between image feature space and class semantic space, such as visual attributes (Lampert et al., 2009), word embeddings of class names (Socher et al., 2013), class hierarchy (Socher et al., 2013). For zero-shot text classification, similar methods have been adopted. (Dauphin et al., 2013) associated text with class label through semantic space, which is learned by deep neural networks trained on large amounts of search engine query log data. (Nam et al., 2016) proposed an approach to embed text and label into joint space while sharing word representations between text and label. (Pushp and Srivastava, 2017) proposed three neural networks to learn the relationship between text and tags, which are trained on a large text corpus. (Rios and Kavuluru, 2018) incorporated word embeddings and hierarchical class structure using GCN (Kipf and Welling, 2016) for multi-label zero-shot medical records classification. (Zhang et al., 2019) proposed a two-phase framework together with data augmentation and feature augmentation, in which four kinds of semantic knowledge (word embeddings, class descriptions, class hierarchy, and knowledge graph) were incorporated.
These works benefit from large training corpus and external semantic knowledge, however, none of these works have tried to leverage unlabeled unseen data in zero-shot text classification, namely transductive zero-shot learning (Xian et al., 2018). There exists some work to utilize unlabeled data in image classification to alleviate domain shift problem, including (Fu et al., 2012;Rohrbach et al., 2013;Li et al., 2015;Fu et al., 2015), etc. As far as we know, our work is the first to explore transductive zero-shot learning in text classification.

Self-training
Self-training is a widely used algorithm in semisupervised learning (Triguero et al., 2015). The basic process of self-training is to iteratively select high-confidence data from unlabeled data and add these pseudo-labeled data to training set. Self-training has shown its effectiveness for various natural language processing tasks, including text classification (Drury et al., 2011;Van Asch and Daelemans, 2016), name entity recognition (Kozareva et al., 2005), parsing (McClosky et al., 2006(McClosky et al., , 2008Huang and Harper, 2009). However, there are two main drawbacks of selftraining. Firstly, its data selection strategy is simply confidence-based, which may not provide reliable selection (Chen et al., 2011) and cause error accumulation. Secondly, self-training relies on pre-defined confidence threshold which varies among datasets and manual adjustment is costly.

Reinforcement Learning for Data Selection
There have been some works applying reinforcement learning to data selection in semi-supervised learning, including active learning (Fang et al., 2017), self-training (Chen et al., 2018), co-training (Wu et al., 2018). These works share a similar framework which uses deep Q-Network (Mnih et al., 2015) to learn a data selection strategy guided by performance change of model. This process is time-consuming as the reward is immediate which means the classifier is retrained and evaluated after each instance is selected. Reinforcement learning has also been applied in relation extraction to alleviate the noisy label problem caused by distant supervision. (Feng et al., 2018;Qin et al., 2018) proposed a policy network to automatically identify wrongly-labeled instances in training set. Earlier, (Fan et al., 2017) proposed an adaptive data selection strategy, enabling to dynamically choose different data at different train- 3 Methodology

Problem Formulation and Overview
Here we first formalize the zero-shot text classification problem. Let Y s and Y u denote seen and unseen class set respectively, where for unseen classes, where x i represents i-th text and y i represents the corresponding label. As shown in Figure 1, ZSL method turns a classification problem into a matching problem between text and class label. During training, we learn a matching model f (x, y; θ) from seen classes D s and then make predictions on unseen classes:ŷ where θ refers to the parameter of f . For transductive ZSL, both labeled seen data D s and unlabeled unseen data To tackle zero-shot text classification, a reinforced self-training framework is developed in this work. Figure 2 shows an overview of our reinforced self-training framework for zero-shot text classification. The goal of our framework is to select high quality data from unseen classes automatically by agent and use these data to augment the performance of the base matching model. Specifically, we first train the base matching model on seen class data and make predictions on unseen class data. To make it more efficient, the agent performs data selection from a subset of unlabeled data instead of all unlabeled data at each iteration. We rank the instances by prediction confidence and take a certain ratio of instances from it at each iteration. The agent is responsible for selecting data from this subset and filter negative instances. The reward is determined by the performance of matching model in validation set. We will introduce the details of our method in the following subsections.

The Base Matching Model
Our RL-based data selection framework is modelagnostic, which means any matching model is compatible. Here we adopt the widely recognized pre-trained model BERT (Devlin et al., 2018) as the base matching model. For seen classes, given text x and label y, we generate {(x, y )|y ∈ Y s } as training instances, in which (x, y ) is a positive training instance if y = y. We take the text as premise and transform the label into its corresponding hypothesis provided in (Yin et al., 2019). Therefore, the input sequence of BERT is packed as " [CLS] x [SEP] hypotheis of y [SEP]", where [CLS] and [SEP] are special start and separator tokens, as shown in Figure 3. BERT encoder is composed of multi-layer bidirectional transformers (Vaswani et al., 2017). We use the hidden vector c x,y ∈ R H corresponding to [CLS] in the final layer as the aggregate representation. We add a linear layer and compute loss as below: where W and b are parameters of the linear layer, W ∈ R H , b ∈ R, H is the hidden dimension size, and p x,y indicates the matching score between x and y , σ(·) is sigmoid function.

Reinforcement Learning for Self-training
The conventional self-training method simply selects data predicted with high confidence, which is confidence-based. We formalize the data selection as a sequential decision-making process and introduce a RL framework to combine confidencebased strategy and performance-driven strategy.
We describe the whole process in Algorithm 1 . The details of the RL modules are described below.

State
For each text x, we get prediction scores {p x,y |y ∈ Y u }. The label y * with maximum matching score is considered as the pseudo label. For time step t, the current state s t consists of 2 parts: the prediction confidence p x,y * , the representation of arriving instance c x,y * . We take the hidden vector corresponding to [CLS] as the representation of current instance (x, y * ). The policy network takes p x,y * and c x,y * as input and outputs the probability whether to select or not.

Action
At each step, the agent is required to take action for the current instance(x, y * ) -whether to select it or not. At time step t, a t = 1 means the agent accepts the current instance and adds it to training set; a t = 0 means rejection. The action value is obtained through sampling from the policy network's output P (a|s t ).

Reward
If wrongly-labeled instances are added into training set, it will degrade the performance of the matching model. Therefore the function of reward is to guide the agent to select the instances that are consistent with training set. The reward is determined by the performance of the matching model on validation set, which consists of 2 parts: seen validation set D s dev and unseen validation set D u dev . D u dev comes from the pseudo labeled data, which guides newly-selected data to be consistent with previously-selected data. More specifically, after each batch of selection, we train the matching model using the selected instances, and evaluate on validation set. We use macro-F1 as the evaluation metric. Assume there are N 3 batches in one episode, we get two F sequences F s = {F s 1 , F s 2 , ..., F s N 3 } for seen validation set and F u = {F u 1 , F u 2 , ..., F u N 3 } for unseen validation set. For batch k, the reward is formulated as: where λ controls the weight of seen class and unseen class, µ and σ represent the mean and standard deviation of F , respectively.

Policy Network
We adopt a multi-layer perceptron (MLP) as the policy network. The policy network receives states: the prediction confidence p x,y * and the representation of arriving instance c x,y * , then output the probability for each action.
We use ReLU as the activation function, W 1 , W 2 , W 3 , b 1 , b 2 are the parameters of MLP, and P (a|s t ) is the probability of actions.

Optimization
To learn an optimal data selection policy, we aim to maximize the expected total reward, which can be formulated as: where R(s, a) is the state-action value function and φ is the parameter of policy network. We update the φ via policy gradient (Sutton et al., 2000), where η is the discount learning rate. For a batch B k , we sample an action a t for each state s t according to policy P φ (a|s). After one episode , we compute rewards {r k } N 3 k=1 by Equation 4. The gradient can be approximated by where |B k | is the number of instances in one batch, r k is the reward of batch B k , the parameter of policy network is updated after each episode.
Algorithm 1 Reinforced self-training for zeroshot text classification Require: labeled seen data , seen validation set D s dev . 1: Initialize pseudo-labeled data D p ← ∅ 2: for i = 1 → N 1 do //iteration i 3: Train matching model f with instances 4: from D s and D p .

Datasets
We use two kinds of datasets for our experiments. The first comes from the recently released benchmarks for zero-shot text classification (Yin et al., 2019), including 3 datasets: topic, emotion and situation classification. Considering that some texts in situation dataset has multiple labels, we remove texts with multiple labels and keep single-label texts. To keep consistent with Equation 1, "none" type is not included in unseen classes. Datasets are prepared with two versions of partitions with non-  overlapping labels so as to get rid of the models over-fitting on one of them.
To further evaluate our method in real-world scenario, we construct a new dataset from ecommerce platform, where texts consist of user search queries. For seen classes Y s , it consists of the categories of product that users click on after searching. For unseen classes Y u , it consists of the pre-defined user preference classes. User preference refers to the product's attribute that users prefer, such as the efficacy of cosmetic products, the style of furniture. The user preference and product category are disjoint so it can be formalized as a zero-shot learning problem. We annotate 10-class user preference dataset for evaluation and there is 1000 instances for each class. Following (Yin et al., 2019), we created two versions of unseen classes each with 5 classes that do not overlap. The statistics of datasets are shown in Table  2.
Policy network pre-train is widely used by reinforcement learning based methods to accelerate the training of RL agent (Silver et al., 2016;Xiong et al., 2017;Qin et al., 2018). We use seen class data to pre-train the agent, enabling the agent to distinguish negative instances. We set early stop criteria to avoid overfitting to seen class data.

Baseline Methods
We compare our method with the following baselines: (1) Word2vec measures how well a label matches the text by computing cosine similarity of their representations. Both the representations of text and labels are average of word embeddings.
(2) Label similarity (Veeranna et al.) uses word embeddings to compute semantic similarity as well, which computes the cosine similarity between class label and every n-gram (n=1,2,3) of the text, and takes the max similarity as final matching score; (3) FC and RNN+FC refers to the architecture 1 and architecture 2 proposed in (Pushp and Srivastava, 2017).
We also compare multiple variants of our models: (1) BERT refers to the base matching model without self-training and RL; (2) BERT+selftraining refers to the traditional self-training method, which selects instances with high confidence. However, confidence threshold has great impact on performance. With different thresholds, the number of selected instances differs, resulting in performance change of the model. To provide a fair comparison, we record the number of instances k selected in every iteration in RL selection process. For self-training, we select top k instances for every iteration. (3) BERT+RL refers to full model of our methods.
We use macro-F1 as evaluation metric in our experiments since datasets are not well balanced. We report the results in two ZSL setting: generalized and non-generalized. In non-generalized ZSL, at test time we aim to assign an instance to unseen class label (Y u ). While in generalized ZSL, class label comes from both unseen and seen classes (Y s ∪ Y u ). The harsh policy in testing (Yin et al., 2019) is not adopted in our experiments.

Topic
Emotion Situation E-commerce   I  II  I  II  I  II  I  II  Word2vec 35    Table 3 shows the experimental results on benchmarks and real-world e-commerce dataset in generalized setting. For baseline methods, Word2vec and Label similarity are unsupervised approaches, which cannot get desirable results as the effectiveness of these methods heavily rely on the similarity of text and label. Therefore, it may not perform well on dataset like emotion detection. Label similarity performs slightly better than Word2vec, which proves that max aggregation of n-grams is better than mean aggregation in Word2vec method. As for the supervised FC and RNN+FC method, FC gets slightly better results than RNN+FC in most datasets. As the number of categories and the scale of training dataset are small, RNN+FC may overfit on seen class data and cannot generalize well on unseen class data. For variants of our method, we can observe that the full model BERT+RL outperforms all other baselines. On average, BERT+RL achieves an improvement of 15.4% over BERT. To be specific, the base matching model BERT performs better than previous baselines, which shows good gen-eralization results benefiting from pre-training on large-scale corpus. For BERT+self-training, the integration of unlabeled data augments the base matching model and shows superior performance than BERT. Last but not least, our full model BERT+RL shows substantial improvement over BERT+self-training in most datasets. Under the condition that the number of selected instances remains the same, reinforced selection strategy can still yield better performance than the simply confidence-based strategy, which proves the effectiveness of our RL policy.

Results
For non-generalized ZSL setting, we can get similar results as presented in Table 4. On average, BERT+RL achieves an improvement of 5.4% over BERT. However, we notice that the improvement is more significant in generalized ZSL compared to non-generalized ZSL. The reason is that model trained on seen class data tends to bias towards seen classes, resulting in poor performance in generalized setting (Song et al., 2018). Our approach, however, could relieve the bias in favour of seen classes by incorporating pseudo-labeled unseen class data. : Performance with regards to selected instance ratio . One can see the RL data selection strategy does not rely on manually-set ratio and can yield consistently better performance than the competitors in most cases.

Impact of Selection Ratio
When selecting the same number of instances per iteration, previous experimental results show our reinforced selection strategy can yield better performance than the greedy strategy. We define as the ratio of selected instances size to all unlabeled instances size. In this section, we vary the selection ratio among {0.2, 0.4, 0.6, 0.8, 1.0} for self-training method. For each iteration, we select top N 1 M instances and add them into training set. Figure 4 shows the performances with different selection ratios in generalized ZSL setting. Clearly, the performance of self-training method varies with different ratio of instances selected. The optimal ratio of selection instances also varies with different datasets. However, our reinforced data selection strategy does not rely on manuallyset ratio and can yield consistently better performance than the self-training method in most cases.

Case Study
In Table 5, we listed some examples to further reveal the differences between BERT and BERT+RL method. In the left part of the table, texts predicted by BERT with highest confidence are listed. We can easily find that these texts share a simple matching pattern that label words appear in the text, which is highlighted with red color. These simple patterns are exactly classinvariant patterns we defined previously, which can be shared among classes. In the right part of the table, we select the texts which are misclassified by BERT but are predicted correctly by BERT+RL. We can observe that those texts are harder to be distinguished since these matching patterns are more class-dependent, which cannot be directly transferred from other classes. There is no doubt that model trained on other classes would fail in such cases. For our method, we first tackle the easy instances, then add these instances into training set iteratively. With the integration of instances with easy pattern, the model can learn harder pattern gradually. In this way, our method can learn to transfer between classes even with low similarity.

Conclusion
In this paper, we propose a reinforced self-training framework for zero-shot text classification. To realize the transferring between classes with low similarity, our method essentially turns a zero-shot learning problem into a semi-supervised learning problem. In this way, our approach could leverage unlabeled data and alleviate the domain shift between seen classes and unseen classes. Beyond that, we use reinforcement learning to learn data selection policy automatically, thus obviating the need to manual adjustment. Experimental results on both benchmarks and real-world e-commerce dataset demonstrate the effectiveness of the integration of unlabeled data and the reinforced data selection policy.