Contextualized Weak Supervision for Text Classification

Weakly supervised text classification based on a few user-provided seed words has recently attracted much attention from researchers. Existing methods mainly generate pseudo-labels in a context-free manner (e.g., string matching), therefore, the ambiguous, context-dependent nature of human language has been long overlooked. In this paper, we propose a novel framework ConWea, providing contextualized weak supervision for text classification. Specifically, we leverage contextualized representations of word occurrences and seed word information to automatically differentiate multiple interpretations of the same word, and thus create a contextualized corpus. This contextualized corpus is further utilized to train the classifier and expand seed words in an iterative manner. This process not only adds new contextualized, highly label-indicative keywords but also disambiguates initial seed words, making our weak supervision fully contextualized. Extensive experiments and case studies on real-world datasets demonstrate the necessity and significant advantages of using contextualized weak supervision, especially when the class labels are fine-grained.


Introduction
Weak supervision in text classification has recently attracted much attention from researchers, because it alleviates the burden of human experts on annotating massive documents, especially in specific domains. One of the popular forms of weak supervision is a small set of user-provided seed words for each class. Typical seed-driven methods follow an iterative framework -generate pseudolabels using some heuristics, learn the mapping between documents and classes, and expand the seed set (Agichtein and Gravano, 2000;Riloff et al., 2003;Kuipers et al., 2006;Tao et al., 2015;.
Most of, if not all, existing methods generate pseudo-labels in a context-free manner, therefore, the ambiguous, context-dependent nature of human languages has been long overlooked. Suppose the user gives "penalty" as a seed word for the sports class, as shown in Figure 1. The word "penalty" has at least two different meanings: the penalty in sports-related documents and the fine or death penalty in law-related documents. If the pseudolabel of a document is decided based only on the frequency of seed words, some documents about law may be mislabelled as sports. More importantly, such errors will further introduce wrong seed words, thus being propagated and amplified over the iterations.
In this paper, we introduce contextualized weak supervision to train a text classifier based on userprovided seed words. The "contextualized" here is reflected in two places: the corpus and seed words. Every word occurrence in the corpus may be interpreted differently according to its context; Every seed word, if ambiguous, must be resolved according to its user-specified class. In this way, we aim to improve the accuracy of the final text classifier.
We propose a novel framework ConWea, as illustrated in Figure 1. It leverages contextualized representation learning techniques, such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), together with user-provided seed information to first create a contextualized corpus. This contextualized corpus is further utilized to train the classifier and expand seed words in an iterative manner. During this process, contextualized seed words are introduced by expanding and disambiguating the initial seed words. Specifically, for each word, we develop an unsupervised method to adaptively decide its number of interpretations, and accordingly, group all its occurrences based on their contextualized representations. We design a principled comparative ranking method to select highly label-indicative keywords from the contextualized corpus, leading to contextualized seed words. We will repeat the iterative classification and seed word expansion process until the convergence.
To the best of our knowledge, this is the first work on contextualized weak supervision for text classification. It is also worth mentioning that our proposed framework is compatible with almost any contextualized representation learning models and text classification models. Our contributions are summarized as follows: • We propose a novel framework enabling contextualized weak supervision for text classification. • We develop an unsupervised method to automatically group word occurrences of the same word into an adaptive number of interpretations based on contextualized representations and userprovided seed information. • We design a principled ranking mechanism to identify words that are discriminative and highly label-indicative. • We have performed experiments on real-world datasets for both coarse-and fine-grained text classification tasks. The results demonstrate the superiority of using contextualized weak supervision, especially when the labels are fine-grained. Our code is made publicly available at GitHub 1 .

Overview
Problem Formulation. The input of our problem contains (1) a collection of n text documents D = {D 1 , D 2 , . . . , D n } and (2) m target classes C = {C 1 , C 2 , . . . , C m } and their seed words S = {S 1 , S 2 , . . . , S m }. We aim to build a high-quality 1 https://github.com/dheeraj7596/ConWea document classifier from these inputs, assigning class label C j ∈ C to each document D i ∈ D.
Note that, all these words could be upgraded to phrases if phrase mining techniques  were applied as preprocessing. In this paper, we stick to the words. Framework Overview. We propose a framework, ConWea, enabling contextualized weak supervision. Here, "contextualized" is reflected in two places: the corpus and seed words. Therefore, we have developed two novel techniques accordingly to make both contextualizations happen.
First, we leverage contextualized representation learning techniques (Peters et al., 2018;Devlin et al., 2019) to create a contextualized corpus. We choose BERT (Devlin et al., 2019) as an example in our implementation to generate a contextualized vector of every word occurrence. We assume the user-provided seed words are of reasonable quality -the majority of the seed words are not ambiguous, and the majority of the occurrences of the seed words are about the semantics of the user-specified class. Based on these two assumptions, we are able to develop an unsupervised method to automatically group word occurrences of the same word into an adaptive number of interpretations, harvesting the contextualized corpus.
Second, we design a principled comparative ranking method to select highly label-indicative keywords from the contextualized corpus, leading to contextualized seed words. Specifically, we start with all possible interpretations of seed words and train a neural classifier. Based on the predictions, we compare and contrast the documents belonging to different classes, and rank contextualized words based on how label-indicative, frequent, and unusual these words are. During this process, we eliminate the wrong interpretations of initial seed words and also add more highly label-indicative contextualized words. This entire process is visualized in Figure 1. We denote the number of iterations between classifier training and seed word expansion as T , which is the only hyper-parameter in our framework. We discuss these two novel techniques in detail in the following sections. To make our paper self-contained, we will also brief the pseudo-label generation and document classifiers.

Document Contextualization
We leverage contextualized representation techniques to create a contextualized corpus. The key objective of this contextualization is to disambiguate different occurrences of the same word into several interpretations. We treat every word separately, so in the rest of this section, we focus on a given word w. Specifically, given a word w, we denote all its occurrences as w 1 , . . . , w n , where n is its total number of occurrences in the corpus. Contextualized Representation. First, we obtain a contextualized vector representation b w i for each w i . Our proposed method is compatible with almost any contextualized representation learning model. We choose BERT (Devlin et al., 2019) as an example in our implementation to generate a contextualized vector for each word occurrence. In this contextualized vector space, we use the cosine similarity to measure the similarity between two vectors. Two word occurrences w i and w j of the same interpretation are expected to have a high cosine similarity between their vectors b w i and b w j . For the ease of computation, we normalize all contextualized representations into unit vectors.
Choice of Clustering Methods. We model the word occurrence disambiguation problem as a clustering problem. Specifically, we propose to use the K-Means algorithm (Jain and Dubes, 1988) to cluster all contextualized representations b w i into K clusters, where K is the number of interpretations. We prefer K-Means because (1) the cosine similarity and Euclidean distance are equivalent for unit vectors and (2) it is fast and we are clustering a significant number of times. Automated Parameter Setting. We choose the value of K purely based on a similarity threshold τ . τ is introduced to decide whether two clusters belong to the same interpretation by checking if the cosine similarity between two cluster center vectors is greater than τ . Intuitively, we should keep increasing K until there exist no two clusters with the same interpretation. Therefore, we choose K to be the largest number such that the similarity between any two cluster centers is no more than τ . (1) where c i refers to the i-th cluster center vector after clustering all contextualized representations into K clusters. In practice, K is usually no more than 10. So we increase K gradually until the constraint is violated.
We pick τ based on user-provided seed information instead of hand-tuning, As mentioned, we make two "majority" assumptions: (1) For any seed word, the majority of its occurrences follow the intended interpretation by the user; and (2) The majority of the seed words are not ambiguousthey only have one interpretation. Therefore, for each seed word s, we take the median of pairwise cosine similarities between its occurrences.
Then, we take the median of these medians over all seed words as τ . Mathematically, The nested median solution makes the choice of τ safe and robust to outliers. For example, consider the word "windows" in the 20Newsgroup corpus. In fact, the word windows has two interpretations in the 20Newsgroup corpus -one represents an opening in the wall and the other is an operating system. We first compute the pairwise similarities between all its occurrences and plot the histogram as shown in Figure 2(a). From this plot, we can see that its median value is about 0.7. We apply the same for all seed words and obtain τ following Equation 3. τ is calculated to be 0.82. Based on this value, we gradually increase K for "windows" and it ends up with K = 2. We visualize its K-Means clustering results using t-SNE (Maaten and Hinton, 2008) in Figure 2(b). Similar results can be observed for the word penalty, as shown in Figure 2(c). These examples demonstrate how our document contextualization works for each word.
In practice, to make it more efficient, one can subsample the occurrences instead of enumerating all pairs in a brute-force manner. Contextualized Corpus. The interpretation of each occurrence of w is decided by the cluster-ID to which its contextualized representation belongs. Specifically, given each occurrence w i , the word w is replaced byŵ i in the corpus as follows: By applying this to all words and their occurrences, the corpus is contextualized. The pseudo-code for corpus contextualization is shown in Algorithm 1.

Pseudo-Label and Text Classifier
We generate pseudo-labels for unlabeled contextualized documents and train a classifier based on these pseudo-labels, similar to many other weakly supervised methods (Agichtein and Gravano, 2000;Riloff et al., 2003;Kuipers et al., 2006;Tao et al., 2015;. These two parts are not the focus of this paper. We briefly introduce them to make the paper self-contained. Pseudo-Label Generation. There are several ways to generate pseudo-labels from seed words. As proof-of-concept, we employ a simple but effective method based on counting. Each document is assigned a label whose aggregated term frequency of seed words is maximum. Let tf(ŵ, d) denote term-frequency of a contextualized word w in the contextualized document d and S c represents set of seed words of class c, the document d is assigned a label l(d) as follows: Document Classifier. Our framework is compatible with any text classification model. We use Hierarchical Attention Networks (HAN) (Yang et al., 2016) as an example in our implementation. HAN considers the hierarchical structure of documents (document -sentences -words) and includes an attention mechanism that finds the most important words and sentences in a document while taking the context into consideration. There are two levels of attention: word-level attention identifies the important words in a sentence and sentence level attention identifies the important sentences in a document. The overall architecture of HAN is shown in Figure 3. We train a HAN model on contextualized corpus with the generated pseudo-labels. The predicted labels are used in seed expansion and disambiguation.

Seed Expansion and Disambiguation
Seed Expansion. Given contextualized documents and their predicted class labels, we propose to rank contextualized words and add the top few words into the seed word sets. The core element of this process is the ranking function. An ideal seed word s of label l, is an unusual word that appears only in the documents belonging to label l with significant frequency. Hence, for a given class C j and a word w, we measure its ranking score based on the following three aspects: • Label-Indicative. Since our pseudo-label generation follows the presence of seed words in the document, ideally, the posterior probability of a document belonging to the class C j after observing the presence of word w (i.e., P (C j |w)) should be very close to 100%. Therefore, we use P (C j |w) as our label-indicative measure: where f C j refers to the total number of documents that are predicted as class C j , and among them, f C j ,w documents contain the word w. All these counts are based on the prediction results on the input unlabeled documents. • Frequent. Ideally, a seed word s of label l appears in the documents belonging to label l with significant frequency. To measure the frequency score, we first compute the average frequency of seed word s in all the documents belonging to label l. Since average frequency is unbounded, we apply tanh function to scale it, resulting in the frequency score, is the frequency of word w in documents that are predicted as class C j . • Unusual: We want our highly label-indicative and frequent words to be unusual. To incorporate this, we consider inverse document frequency (IDF). Let n be the number of documents in the corpus D and f D,w represents the document frequency of word w, the IDF of a word w is computed as follows: Similar to previous work (Tao et al., 2015), we combine these three measures using the geometric mean, resulting in the ranking score R(C j , w) of a word w for a class C j .
Based on this aggregated score, we add top words to expand the seed word set of the class C j . Seed Disambiguation. While the majority of userprovided seed words are nice and clean, some of them may have multiple interpretations in the given corpus. We propose to disambiguate them based on the ranking. We first consider all possible interpretations of an initial seed word, generate the pseudo-labels, and train a classifier. Using the classified documents and the ranking function, we rank all possible interpretations of the same initial seed word. Because the majority occurrences of a seed word are assumed to belong to the user-specified class, the intended interpretation shall be ranked the highest. Therefore, we retain only the top-ranked interpretation of this seed word.
After this step, we have fully contextualized our weak supervision, including the initial userprovided seeds.

Experiments
In this section, we evaluate our framework and many compared methods on coarse-and finegrained text classification tasks under the weakly supervised setting.

Datasets
Following previous work (Tao et al., 2015), , we use two news datasets in our experiments. The dataset statistics are provided in Table 1. Here are some details.
• The New York Times (NYT): The NYT dataset contains news articles written and published by The New York Times. These articles are classified into 5 wide genres (e.g., arts, sports) and 25 fine-grained categories (e.g., dance, music, hockey, basketball). • The 20 Newsgroups (20News): The 20News dataset 2 is a collection of newsgroup documents partitioned widely into 6 groups (e.g., recreation, computers) and 20 fine-grained classes (e.g., graphics, windows, baseball, hockey). We perform coarse-and fine-grained classifications on the NYT and 20News datasets. NYT dataset is imbalanced in both fine-grained and coarse-grained classifications. 20News is nearly balanced in fine-grained classification but imbalanced in coarse-grained classification. Being aware of these facts, we adopt micro-and macro-F 1 scores as evaluation metrics.

Compared Methods
We compare our framework with a wide range of methods described below: • IR-TF-IDF treats the seed word set for each class as a query. The relevance of a document to a label is computed by aggregated TF-IDF values of its respective seed words. The label with the highest relevance is assigned to each document. • Dataless (Chang et al., 2008) uses only label surface names as supervision and leverages Wikipedia to derive vector representations of labels and documents. Each document is labeled based on the document-label similarity. • Word2Vec first learns word vector representations (Mikolov et al., 2013) for all terms in the corpus and derive label representations by aggregating the vectors of its respective seed words. Finally, each document is labeled with the most 2 http://qwone.com/˜jason/20Newsgroups/ similar label based on cosine similarity. • Doc2Cube (Tao et al., 2015) considers label surface names as seed set and performs multidimensional document classification by learning dimension-aware embedding. • WeSTClass  leverages seed information to generate pseudo documents and refines the model through a self-training module that bootstraps on real unlabeled documents. We denote our framework as ConWea, which includes contextualizing corpus, disambiguating seed words, and iterative classification & key words expansion. Besides, we have three ablated versions. ConWea-NoCon refers to the variant of ConWea trained without the contextualization of corpus. ConWea-NoSeedExp is the variant of ConWea without the seed expansion module. ConWea-WSD refers to the variant of ConWea, with the contextualization module replaced by Lesk algorithm (Lesk, 1986), a classic Word-sense disambiguation algorithm (WSD).
We also present the results of HAN-Supervised under the supervised setting for reference. We use 80-10-10 for train-validation-test splitting and report the test set results for it. All weakly supervised methods are evaluated on the entire datasets.

Experiment Settings
We use pre-trained BERT-base-uncased 3 to obtain contextualized word representations. We follow Devlin et al. (2019) and concatenate the averaged word-piece vectors of the last four layers.
The seed words are obtained as follows: we asked 5 human experts to nominate 5 seed words per class, and then considered the majority words (i.e., > 3 nominations) as our final set of seed words. For every class, we mainly use the label surface name as seed words. For some multi-word class labels (e.g., "international business"), we have multiple seed words, but never exceeds four per each class. The same seed words are utilized for all compared methods for fair comparisons.
For ConWea, we set T = 10. For any method using word embedding, we set its dimension to be 100. We use the public implementations of WeST-Class 4 and Dataless 5 with the hyper-parameters mentioned in their original papers.

Performance Comparison
We summarize the evaluation results of all methods in Table 2. As one can observe that our proposed framework achieves the best performance among all the compared weakly supervised methods. We discuss the effectiveness of ConWea as follows: • Our proposed framework ConWea outperforms all the other methods with significant margins. By contextualizing the corpus and resolving the interpretation of seed words, ConWea achieves inspiring performance, demonstrating the necessity and the importance of using contextualized weak supervision. • We observe that in the fine-grained classification, the advantages of ConWea over other methods are even more significant. This can be attributed to the contextualization of corpus and seed words. Once the corpus is contextualized properly, the subtle ambiguity between words is a drawback to other methods, whereas ConWea can distinguish them and predict them correctly. • The comparison between ConWea and the ablation method ConWea-NoExpan demonstrates the effectiveness of our Seed Expansion. For example, for fine-grained labels on the 20News dataset, the seed expansion improves the micro-F1 score from 0.58 to 0.65. • The comparison between ConWea and the two ablation methods ConWea-NoCon and ConWea-WSD demonstrates the effectiveness of our Contextualization. Our contextualization, building upon (Devlin et al., 2019), is adaptive to the input corpus, without requiring any additional human annotations. However, WSD methods(e.g., (Lesk, 1986)) are typically trained for a general domain. If one wants to apply WSD to some spe-cific corpus, additional annotated training data might be required to meet the similar performance as ours, which defeats the purpose of a weakly supervised setting. Therefore, we believe that our contextualization module has its unique advantages. Our experimental results further confirm the above reasoning empirically. For example, for coarse-grained labels on the 20News dataset, the contextualization improves the micro-F1 score from 0.53 to 0.62. • We observe that ConWea performs quite close to supervised methods, for example, on the NYT dataset. This demonstrates that ConWea is quite effective in closing the performance gap between the weakly supervised and supervised settings.

Parameter Study
The only hyper-parameter in our algorithm is T , the number of iterations of iterative expansion & classification. We conduct experiments to study the effect of the number of iterations on the performance. The plot of performance w.r.t. the number of iterations is shown in Figure 4. We observe that the performance increases initially and gradually converges after 4 or 5 iterations. We observe that after the convergence point, the expanded seed words have become almost unchanged. While there is some fluctuation, a reasonably large T , such as T = 10, is a good choice.

Number of Seed Words
We vary the number of seed words per class and plot the F 1 score in Figure 5. One can observe that in general, the performance increases as the number of seed words increase. There is a slightly different pattern on the 20News dataset when the labels are fine-grained. We conjecture that it is caused by the subtlety of seed words in fine-grained cases -additional seed words may bring some noise. Overall, three seed words per class are enough for reasonable performance.

Case Study
We present a case study to showcase the power of contextualized weak supervision. Specifically, we investigate the differences between the expanded seed words in the plain corpus and contextualized corpus over iterations. Table 3 shows a column-bycolumn comparison for the class For Sale on the 20News dataset. The class For Sale refers to documents advertising goods for sale. Starting with the same seed sets in both types of corpora, from Table 3, in the second iteration, we observe that "space" becomes a part of expanded seed set in the plain corpus. Here "space" has two interpretations, one stands for the physical universe beyond the Earth and the other is for an area of land. This error gets propagated and amplified over the iterations, further introducing wrong seed words like "nasa", "shuttle" and "moon", related to its first interpretation. The seed set for contextualized corpus addresses this problem and adds only the words with appropriate interpretations. Also, one can see that the initial seed word "offer" has been disambiguated as "offer$0".

Related Work
We review the literature about (1) weak supervision for text classification methods, (2) contextualized representation learning techniques, (3) document classifiers, and (4) word sense disambiguation.

Weak Supervision for Text Classification
Weak supervision has been studied for building document classifiers in various of forms, including hundreds of labeled training documents (Tang et al., 2015;Miyato et al., 2016;Xu et al., 2017), class/category names (Song and Roth, 2014;Tao et al., 2015;Li et al., 2018), and user-provided seed words Tao et al., 2015). In this paper, we focus on user-provided seed words as the source of weak supervision, Along this line, Doc2Cube (Tao et al., 2015) expands label keywords from label surface names and performs multidimensional document classification by learning dimension-aware embedding; PTE (Tang et al., 2015) utilizes both labeled and unlabeled documents to learn text embeddings specifically for a task, which are later fed to logistic regression classifiers for classification;  leverage seed information to generate pseudo documents and introduces a self-training module that bootstraps on real unlabeled data for model refining. This method is later extended to handle hierarchical classifications based on a pre-defined label taxonomy (Meng et al., 2019). However, all these weak supervisions follow a context-free manner. Here, we propose to use contextualized weak supervision.

Contextualized Word Representations
Contextualized word representation is originated from machine translation (MT). CoVe (McCann et al., 2017) generates contextualized representations for a word based on pre-trained MT models, More recently, ELMo (Peters et al., 2018) leverages neural language models to replace MT models, which removes the dependency on massive parallel texts and takes advantages of nearly unlimited raw corpora. Many models leveraging language modeling to build sentence representations (Howard and Ruder, 2018;Radford et al., 2018;Devlin et al., 2019) emerge almost at the same time. Language models have also been extended to the character level Akbik et al., 2018), which can generate contextualized representations for character spans. Our proposed framework is compatible with all the above contextualized representation techniques. In our implementation, we choose to use BERT to demonstrate the power of using contextualized supervision.

Word Sense Disambiguation
Word Sense Disambiguation (WSD) is one of the challenging problems in natural language processing. Typical WSD models (Lesk, 1986;Zhong and Ng, 2010;Yuan et al., 2016;Raganato et al., 2017;Le et al., 2018;Tripodi and Navigli, 2019) are trained for a general domain. Recent works (Li and Jurafsky, 2015;Mekala et al., 2016;Gupta et al., 2019) also showed that machine-interpretable representations of words considering its senses, improve document classification. However, if one wants to apply WSD to some specific corpus, additional annotated training data might be required to meet the similar performance as ours, which defeats the purpose of a weakly supervised setting.
In contrast, our contextualization, building upon (Devlin et al., 2019), is adaptive to the input corpus, without requiring any additional human annotations. Therefore, our framework is more suitable than WSD under the weakly supervised setting.. Our experimental results have verified this reasoning and showed the superiority of our contextualization module over WSD in weakly supervised document classification tasks.

Document Classifier
Document classification problem has been long studied. In our implementation of the proposed ConWea framework, we used HAN (Yang et al., 2016), which considers the hierarchical structure of documents and includes attention mechanisms to find the most important words and sentences in a document. CNN-based text classifiers (Kim, 2014;Lai et al., 2015) are also popular and can achieve inspiring performance.
Our framework is compatible with all the above text classifiers. We choose HAN just for a demonstration purpose.

Conclusions and Future Work
In this paper, we proposed ConWea, a novel contextualized weakly supervised classification framework. Our method leverages contextualized representation techniques and initial user-provided seed words to contextualize the corpus. This contextualized corpus is further used to resolve the interpretation of seed words through iterative seed word expansion and document classifier training. Experimental results demonstrate that our model outperforms previous methods significantly, thereby signifying the superiority of contextualized weak supervision, especially when labels are fine-grained.
In the future, we are interested in generalizing contextualized weak supervision to hierarchical text classification problems. Currently, we perform coarse-and fine-grained classifications separately. There should be more useful information embedded in the tree-structure of the label hierarchy. Also, extending our method for other types of textual data, such as short texts, multi-lingual data, and code-switched data is a potential direction.