X-Class: Text Classification with Extremely Weak Supervision

In this paper, we explore text classification with extremely weak supervision, i.e., only relying on the surface text of class names. This is a more challenging setting than the seed-driven weak supervision, which allows a few seed words per class. We opt to attack this problem from a representation learning perspective—ideal document representations should lead to nearly the same results between clustering and the desired classification. In particular, one can classify the same corpus differently (e.g., based on topics and locations), so document representations should be adaptive to the given class names. We propose a novel framework X-Class to realize the adaptive representations. Specifically, we first estimate class representations by incrementally adding the most similar word to each class until inconsistency arises. Following a tailored mixture of class attention mechanisms, we obtain the document representation via a weighted average of contextualized word representations. With the prior of each document assigned to its nearest class, we then cluster and align the documents to classes. Finally, we pick the most confident documents from each cluster to train a text classifier. Extensive experiments demonstrate that X-Class can rival and even outperform seed-driven weakly supervised methods on 7 benchmark datasets.


Introduction
Weak supervision has been recently explored in text classification to save human effort. Typical forms of weak supervision include a few labeled documents per class Jo and Cinarel, 2019), a few seed words per class (Meng et al., , 2020a, and other similar open-data (Yin et al., 2019).Though much weaker than a fully annotated corpus, these forms still require non-trivial, corpusspecific knowledge from experts. For example, nominating seed words requires experts to consider their relevance to not only the desired classes but also the input corpus; To acquire a few labeled documents per class, unless the classes are balanced, one needs to sample and annotate a much larger number of documents to cover the minority class.
In this paper, we focus on extremely weak supervision, i.e., only relying on the surface text of class names. This setting is much more challenging than the ones above, and can be considered as almost-unsupervised text classification.
We opt to attack this problem from a representation learning perspective-ideal document representations should lead to nearly the same result between clustering and the desired classification. Recent advances in contextualized representation learning using neural language models have demonstrated the capability of clustering text to domains with high accuracy (Aharoni and Goldberg, 2020). Specifically, a simple average of word representations is sufficient to group documents on the same topic together. However, the same corpus could be classified using various criteria other than topics, such as locations and sentiments. As visualized in Figure 1, such class-invariant representations separate topics well but mix up locations. Therefore, it is a necessity to make document representations adaptive to the user-specified class names.
We propose a novel framework X-Class to conduct text classification with extremely weak supervision, as illustrated in Figure 2. Firstly, we esti-  Figure 2: An overview of our X-Class. Given a raw input corpus and user-specified class names, we first estimate a class-oriented representation for each document. And then, we align documents to classes with confidence scores by clustering. Finally, we train a supervised model (e.g., BERT) on the confident document-class pairs. mate class representations by incrementally adding the most similar word to each class and recalculating its representation. Following a tailored mixture of class attention mechanisms, we obtain the document representation via a weighted average of contextualized word representations. These representations are based on pre-trained neural language models, and they are supposed to be in the same latent space. We then adopt clustering methods (e.g., Gaussian Mixture Models) to group the documents into K clusters, where K is the number of desired classes. The clustering method is initialized with the prior knowledge of each document assigned to its nearest class. We preserve this assignment so we can easily align the final clusters to the classes. In the end, we pick confident documents from each cluster to form a pseudo training set, based on which, we can train any document classifier. In our implementation, we use BERT as both the pre-trained language model and the text classifier. Compared with existing weakly supervised methods, X-Class has a stronger and more consistent performance on 7 benchmark datasets, despite some of them using at least 3 seed words per class. It is also worth mentioning that X-Class has a much more mild requirement on the existence of class names in the corpus, whereas existing methods rely on the variety of contexts of the class names. Our contributions are summarized as follows. • We advocate an important but not-well-studied problem of text classification with extremely weak supervision. • We develop a novel framework X-Class to attack this problem from a representation learning perspective. It estimates high-quality, class-oriented document representations based on pre-trained neural language models so that the confident clustering examples could form pseudo training set for any document classifiers to train on. • We show that on 7 benchmark datasets, X-Class achieves comparable and even better performance than existing weakly supervised methods that require more human effort. Reproducibility. We will release both datasets and codes on Github 1 .

Preliminaries
In this section, we formally define the problem of text classification with extremely weak supervision. And then, we brief on some preliminaries about BERT (Devlin et al., 2019), Attention (Luong et al., 2015) and Gaussian Mixture Models. Problem Formulation. The extremely weak supervision setting confines our input to only a set of documents D i , i ∈ {1, ..., n} and a list of class names c j , j ∈ {1, ..., k}. The class names here are expected to provide hints about the desired classification objective, considering that different criteria (e.g., topics, sentiments, and locations) could classify the same set of documents. Our goal is to build a classifier to categorize a (new) document into one of the classes based on the class names.
Seed-driven weak supervision requires carefully designed label-indicative keywords that concisely define what a class represents. This requires human experts to understand the corpus extensively. One of our motivations is to relax this burdensome requirement. Interestingly, in experiments, our proposed X-Class using extremely weak supervision can offer comparable and even better performance than the seed-driven methods.
BERT. BERT is a pre-trained masked language model with a transformer structure (Devlin et al., 2019). It takes one or more sentences as input, breaks them up into word-pieces, and generates a contextualized representation for each word-piece. To handle long documents in BERT, we apply a sliding window technique. To retrieve representations for words, we average the representations of the word's word-pieces. BERT has been widely adopted in a large variety of NLP tasks as a backbone. In our work, we will utilize BERT for two purposes: (1) representations for words in the documents and (2) the supervised text classifier. Attention. Attention mechanisms assign weights to a sequence of vectors, given a context vector (Luong et al., 2015). It first estimates a hidden statẽ h j = K(h j , c) for each vector h j , where K is a similarity measure and c is the context vector. Then, the hidden states are transformed into a distribution via a softmax function. In our work, we use attentions to assign weights to representations, which we then average them accordingly. Gaussian Mixture Model. Gaussian Mixture Model (GMM) is a traditional clustering algorithm (Duda and Hart, 1973). It assumes that each cluster is generated through a Gaussian process. Given an initialization of the cluster centers and the co-variance matrix, it iteratively optimizes the point-cluster memberships and the cluster parameters following an Expectation-Maximization framework. Unlike K-Means, it does not restrict clusters to have a perfect ball-like shape. Therefore, we apply GMM to cluster our document representations.

Our X-Class Framework
As shown in Figure 2, our X-Class framework contains three modules: (1) class-oriented document representation estimation, (2) document-class alignment through clustering, and (3) text classifier training based on confident labels.

Class-oriented Document Representation
Ideally, we wish to have some document representations such that clustering algorithms can find k clusters very similar to the k desired classes.
We propose to estimate the document representations and class representations based on pretrained neural language models. Algorithm 1 is an overview. In our implementation, we use BERT as an example. For each document, we want its document representation to be similar to the class

Algorithm 1: Class-Oriented Document Representation Estimation
Input: n documents D i , k class names c j , max number of class-indicative words T , and attention mechanism set M Compute t i,j (contextualized word rep.) Compute s w for all words (Eq. 1) // class rep. estimation representation of its desired class. Aharoni and Goldberg (2020) demonstrated that contextualized word representations generated by BERT can preserve the domain (i.e., topic) information of documents. Specifically, they generated document representations by averaging contextualized representations of its constituent words, and they observed these document representations to be very similar among documents belonging to the same topic. This observation motivates us to "classify" documents by topics in an unsupervised way. However, this unsupervised method may not work well on criteria other than topics. For example, as shown in Figure 1, such document representations work well for topics but poorly for locations.
We therefore incorporate information from the given class names and obtain class-oriented document representations. We break down this module into two parts, (1) class representation estimation and (2) document representation estimation. Class Representation Estimation. Inspired by seed-driven weakly supervised methods, we argue that a few keywords per class would be enough to understand the semantics of the user-specified classes. Intuitively, the class name could be the first keyword we can start with. We propose to incrementally add new keywords to each class to enrich our understanding. Figure 3 shows an overview of our class representation estimation. First, for each word, we obtain its static representation via averaging the contextualized representations of all its occurrences in the input corpus. For words that are broken into word-piece tokens, we average all the token representations as the word's representation. Then, we define the static representation s w of a word w as where D i,j is the j-th word in the document D i and t i,j is its contextualized word representation. Ethayarajh (2019) adopted a similar strategy of estimating a static representation using BERT. Such static representations are used as anchors to initialize our understanding of the classes. A straightforward way to enrich the class representation is to take a fixed number of words similar to the class name and average them to get a class representation. However, it suffers from two issues: (1) setting the same number of keywords for all classes may hurt the minority classes, and (2) a simple average may shift the semantics away from the class name itself. As an extreme example, when the 99% of documents are talking about sports and the rest 1% are about politics, it is not reasonable to add as many keywords as sports to politics-it will diverge the politics representation.
To address these two issues, we iteratively find the next keyword for each class and recalculate the class representation by a weighted average on all the keywords found. We stop this iterative process when the new representation is not consistent with the previous one. In this way, different classes will have a different number of keywords adaptively. Specifically, we define a comprehensive representation x l for a class l as a weighted average representation based on a ranked list of keywords K l . The top-ranked keywords are expected to have more similar static representations to the class representation. Assuming that the similarities follow Zipf's laws distribution (Powers, 1998), we define the weight of the i-th keyword as 1/i . That is, For a given class, the first keyword in this list is always the class name. In the i-th iteration, we retrieve the out-of-list word with the most similar static representation to the current class representation. We then calculate a new class representation based on all the i + 1 words. We stop this expansion if we already have enough (e.g., T = 100) keywords, or the new class representation cannot yield the same set of top-i keywords in our list. In our experiments, some classes indeed stop before reaching 100 keywords. Document Representation Estimation. Intuitively, the content of each document should stick to its underlying class. For example, in the sentence "I cheered for Lakers winning NBA", its content covers sports and happy classes, but not arts, politics, or sad. Therefore, we assume that each word in a document is either similar to its desired class's representation or unrelated to all classes. Based on this assumption, we upgrade the simple average of contextualized word representations (Aharoni and Goldberg, 2020) to a weighted average. Specifically, we follow the popular attention mechanisms to assign weights to the words based on their similarities to the class representations. Figure 4 shows an overview of our document representation estimation. We propose to employ a mixture of attention mechanisms to make it more robust. For the j-th word in the i-th document D i,j = w, there are two possible representations: (1) the contextualized word representation t i,j and (2) the static representation of this word s w . The contextualized representations disambiguate words with multiple senses by considering the context, while the static version accounts for outliers that may exist in documents. Therefore, it is reasonable to use either of them as the word representation e for attention mechanisms. Given the class representations x c , we define two attention mechanisms: • one-to-one: h i,j = max c {cos(e, x c )}. It captures the maximum similarity to one class. This is useful for detecting words that are specifically similar to one class, such as NBA to sports. • one-to-all: h i,j = cos (e, avg c {x c }) which is the similarity to the average of all classes. This ranks words by how related it is to the general set of classes in focus. Combining 2 choices of e and 2 choices of attention mechanisms totals 4 ways to compute each word's attention weight. We further fuse these attention weights in an unsupervised way. Instead of using the similarity values directly, we rely on the rankings. Specifically, we sort the words decreasingly based on attention weights to obtain 4 ranked lists. Following previous work Tao et al., 2018), we utilize the geometric mean of these ranks for each word and then form a unified ranked list. Like class representation estimation, we follow Zipf's law and assign a weight of 1/r to a word ranked at the r-th position in the end. Finally, we obtain the document representation E i from t i,j with these weights.

Document-Class Alignment
One straightforward idea to align the documents to classes is simply finding the most similar class based on their representations. However, document representations not necessarily distribute ballshape around the class representation-the dimensions in the representation can be correlated freely.
To address this challenge, we leverage the Gaussian Mixture Model (GMM) to capture the covariances for the clusters. Specifically, we set the number of clusters the same as the number of classes k and initialize the cluster parameters based on the prior knowledge that each document D i is assigned to its nearest class L i , as follows.
We use a tied co-variance matrix across all clusters since we believe classes are similar in granularity. We cluster the documents while remembering the class each cluster is initialized to. In this way, we can align the final clusters to the classes. Considering the potential redundant noise in these representations, we also apply principal component analysis (PCA) for dimension reduction following the experience in topic clustering (Aharoni and Goldberg, 2020). By default, we fix the PCA dimension P = 64.

Text Classifier Training
The alignment between documents and classes produce high-quality pseudo labels for documents in the training set. To generalize such knowledge to unseen text documents, we train a text classifier using these pseudo labels as ground truth. This is a classical noisy training scenario (Angluin and Laird, 1987;Goldberger and Ben-Reuven, 2017). Since we know how confident we are on each instance (i.e., the posterior probability on its assigned cluster in GMM), we select the most confident ones to train a text classifier (e.g., BERT). By default, we set a confidence threshold δ = 50%, i.e., the top 50% instances are selected for classifier training.

Experiments
We conduct extensive experiments to show and ablate the performance of X-Class.

Compared Methods
We compare with two seed-driven weakly supervised methods. WeSTClass  generates pseudo-labeled documents via word embeddings of keywords and employs a self-training module to get the final classifier. We use the CNN version of WeSTClass as it is reported to have better performance compared to the HAN version. Con-Wea (Mekala and Shang, 2020) utilizes pre-trained neural language models to make the weak supervision contextualized. In our experiments, we feed at least 3 seed words per class to these two.
We also compare with LOTClass (Meng et al., 2020b), which works under the extremely weak supervision setting. In their experiments, it mostly relies on class names but has used a few keywords Table 1: An overview of our 7 benchmark datasets. They cover various domains and classification criteria. The imbalance factor of a dataset refers to the ratio of its largest class's size to the smallest class's size.

AGNews 20News NYT-Small NYT-Topic NYT-Location
Yelp DBpedia  to elaborate on some difficult classes. In our experiments, we only feed the class names to it. We denote our method as X-Class. To further understand the effects of different modules, we have four ablation versions. X-Class-Rep refers to the prior labels L i derived based on class-oriented document representation. X-Class-Align refers to the labels obtained after document-class alignment. X-Class-ExactT refers to not doing consistency check when estimating class representations, and having exactly T class words. X-Class-KMeans refers to using K-Means (Lloyd, 1982) of GMM during document class alignment.
We present the performance of supervised models, serving as an upper-bound for X-Class. Specifically, Supervised refers to a BERT model crossvalidated on the training set with 2 folds (matching our confidence selection threshold).

Datasets
Many different datasets have been adopted to evaluate weakly supervised methods in different works. This makes it hard for systematic comparison.
In this paper, we pool the most popular datasets to establish a benchmark on weakly supervised text classification. Table 1 provides an overview of our carefully selected 7 datasets, covering different text sources (e.g., news, reviews, and Wikipedia articles) and different criteria of classes (e.g., topics, locations, and sentiment).

(used in WeSTClass
and ConWea) is for topic categorization in news. • NYT-Small (used in WeSTClass and ConWea) is for classifying topic in New York Times news. • NYT-Topic (used in (Meng et al., 2020a)) is another larger dataset collected from New York Times for topic categorization. • NYT-Location (used in (Meng et al., 2020a)) is the same corpus as NYT-Topic but for locations. It is noteworthy to point out that many documents from this dataset talk about several countries simultaneously, so simply checking the location names will not lead to satisfactory results. • Yelp from (Zhang et al., 2015) (used in WeST-Class) is for sentiment analysis in reviews. • DBpedia from (Zhang et al., 2015) (used in LOT-Class) is for topic classification based on titles and descriptions in DBpedia.

Experimental Settings
For all X-Class experiments, we report the performance under one fixed random seed. By default, we set T = 100, P = 64, δ = 50%. For contextualized token representations t i,j , we use the BERT-base-uncased to group more occur-   (Wolf et al., 2019) with all hyper-parameters unchanged.
For both WeSTClass and ConWea, we have tried our best to find keywords for the new datasets. Table 3 shows an example on the seed words selected for them on the NYT-Small dataset. For LOTClass, we tune their hyper-parameters match threshold and mcp epoch, and report the best performance during their self-train process.

Performance Comparison and Analysis
From Table 2, one can see that X-Class achieves the best overall performance. It is only 1% to 2% away from LOTClass and ConWea on AGNews and NYT-Topics, respectively. Note that, ConWea consumes at least 3 keywords per class.
It is noteworthy that X-Class can approach the supervised upper bound to a small spread, especially on the NYT-Small dataset. Ablation on Modules. X-Class-Rep has achieved high scores (e.g., on both NYT-Topics and NYT-Locations) showing success of our class-oriented representations. The improvement of X-Class-Align over X-Class-Rep demonstrates the usefulness of our clustering module. It is also clear that the classifier training is beneficial by comparing X-Class and X-Class-Align. Ablation on Consistency Check. The consistency check in class representation estimation allows an adaptive number of keywords for each class. Without it leads to a diverged class understanding and degrading performance, as shown in Table 2. Ablation on Clustering Methods. Table 2 also shows that K-Means performs poorly on most datasets. This matches our previous analysis as K-Means assumes a hard spherical boundary, while Datasets unweighted one-to-one one-to-all one-to-one-static one-to-all-static mixture (default) Figure 6: Effects of Attention Mechanisms. We focus on X-Class-Align to show their direct effects.
GMM models the boundary softly like an ellipse.

Effect of Attention
In Figure 5, we visualize our class-oriented document representations and the unweighted variants using t-SNE (Rauber et al., 2016). We can see that while the simple-average representations are well-separated like class-oriented representations in NYT-Topics, they are much mixed up in NYT-Locations and Yelp. We conjecture that this is because BERT representations has topic information as its most significant feature. We have also tried using different attention mechanisms in X-Class. From the results in Figure 6, one can see that using a single mechanism, though not under-performing much, is less stable than our proposed mixture. The unweighted case works well on all four datasets that focus on news topics but not good enough on locations and sentiments. Figure 7 visualizes the performance trend w.r.t. to the three hyper-parameters in X-Class, i.e., the limit of class words T in class representation estimation, the PCA dimension P in document-class alignment, and the confidence threshold δ in text classifier training.

Hyper-parameter Sensitivity in X-Class
Intuitively, a class doesn't have too many highly relevant keywords. One can confirm this in Figure 7(a) as the performance of X-Class is relatively stable unless T goes too large to 1000.
Choosing a proper PCA dimension could prune out redundant information in the embeddings and improve the running time. However, if P is too small or too large, it may hurt due to information  Figure 7: Hyper-parameter Sensitivity in X-Class. For T and P , we report the performance of X-Class-Align to explore their direct effects. loss or redundancy. One can observe this expected trend in Figure 7(b) on all datasets. Typically, we want to select a reasonable number of confident training samples for the text classifier training. Too few training samples (i.e., too large δ) would lead to insufficient training data. Too many training samples (i.e., too small δ) would lead to too noisy training data. Figure 7(c) shows that δ ∈ [0.3, 0.9] is a good choice on all datasets.

Requirements on Class Names
Compared with previous works Meng et al., 2020b), our X-Class has a significantly more mild requirement on human-provided class names in terms of quantity and quality. We have conducted an experiment in Table 4 for X-Class on 20News and NYT-Small by deleting all but one occurrence of a class name from the input corpus. In other words, the userprovided class name only appears once in the corpus. Interestingly, the performance of X-Class only drops less than 1%, still outperforming all compared methods. In contrast, the most recent work, LOTClass (Meng et al., 2020b), requires a wide variety of contexts of class names from the input corpus to ensure the quality of generated class vocabulary in its very first step.

X-Class for Hierarchical Classification
There are two straightforward ways to extend X-Class for hierarchical classification (1) X-Class-End: We can give all fine-grained class names as input to X-Class and conduct classification in an endto-end manner; and (2) X-Class-Hier: We can first Table 5: Micro-/Macro-F 1 scores for Fine-grained Classification on NYT-Small. All compared methods use 3 keywords per class. LOTClass failed to discover documents with category indicative terms, thus not reported. § refers to numbers coming from other papers.

Model
Coarse (5 classes give only coarse-grained class names to X-Class and obtain coarse-grained predictions. Then, for each coarse-grained class and its predicted documents, we further create a new X-Class classifier based on the fine-grained class names. We experiment with hierarchical classification on the NYT-Small dataset, which has annotations for 26 fine-grained classes. We also introduce WeSHClass (Meng et al., 2019), the hierarchical version of WeSTClass, for comparison. LOTClass is not investigated here due to its poor coarsegrained performance on this dataset. The results in Table 5 show that X-Class-Hier performs the best, and it is a better solution than X-Class-End. We conjecture that this is because the fine-grained classes' similarities are drastically different (a pair of fine-grained classes can much similar than another pair). Overall, we show that we can apply our method to a hierarchy of classes.

Related Work
We discuss related work from two angles. Weakly supervised text classification. Weakly supervised text classification has attracted much attention from researchers (Tao et al., 2018;Meng et al., 2020a;Meng et al., 2020b). The general pipeline is to generate a set of document-class pairs to train a supervised model above them. Most previous work utilizes keywords to find such pseudo data for training, which requires an expert that understands the corpus well. In this paper, we show that it is possible to reach a similar, and often better, performance on various datasets without such guidance from experts.
A recent work (Meng et al., 2020b) also studied the same topic -extremely weak supervision on text classification. It follows a similar idea of (Meng et al., 2020a) and further utilizes BERT to query replacements for class names to find keywords for classes, identifying potential classes for documents via string matching. Compared with LoTClass, our X-Class has a less strict requirement of class names being existent in the corpus, and can work well even when there is only one occurrence (refer to Section 4.7). BERT for topic clustering. Aharoni and Goldberg (2020) showed that document representations obtained by an average of token representations from BERT preserve domain information well. We borrow this idea to improve our document representations through clustering. Our work differs from theirs in that our document representations are guided by the given class names.

Conclusions and Future Work
We propose our method X-Class for extremely weak supervision on text classification, which is to classify text with only class names as supervision. X-Class leverages BERT representations to generate class-oriented document presentations, which we then cluster to form document-class pairs, and in the end, fed to a supervised model to train on. We further set up benchmark datasets for this task that covers different data (news and reviews) and various class types (topics, locations, and sentiments). Through extensive experiments, we show the strong performance and stability of our method.
There are two directions that are possible to explore. First, focusing on the extremely weak supervision setting, we can extend to many other natural language tasks to eliminate human effort, such as Named Entity Recognition and Entity Linking. Second, based on the results on extremely weak supervision, we can expect an unsupervised version of text classification, where machines suggest class names and classify documents automatically.