Self-Training with Weak Supervision

State-of-the-art deep neural networks require large-scale labeled training data that is often expensive to obtain or not available for many tasks. Weak supervision in the form of domain-specific rules has been shown to be useful in such settings to automatically generate weakly labeled training data. However, learning with weak rules is challenging due to their inherent heuristic and noisy nature. An additional challenge is rule coverage and overlap, where prior work on weak supervision only considers instances that are covered by weak rules, thus leaving valuable unlabeled data behind. In this work, we develop a weak supervision framework (ASTRA) that leverages all the available data for a given task. To this end, we leverage task-specific unlabeled data through self-training with a model (student) that considers contextualized representations and predicts pseudo-labels for instances that may not be covered by weak rules. We further develop a rule attention network (teacher) that learns how to aggregate student pseudo-labels with weak rule labels, conditioned on their fidelity and the underlying context of an instance. Finally, we construct a semi-supervised learning objective for end-to-end training with unlabeled data, domain-specific rules, and a small amount of labeled data. Extensive experiments on six benchmark datasets for text classification demonstrate the effectiveness of our approach with significant improvements over state-of-the-art baselines.

In this work, we develop a weak supervision framework (ASTRA 1 ) that leverages all the available data for a given task. To this end, we leverage task-specific unlabeled data through self-training with a model (student) that considers contextualized representations and predicts pseudo-labels for instances that may not be covered by weak rules. We further develop a rule attention network (teacher) that learns how to aggregate student pseudo-labels with weak rule labels, conditioned on their fidelity and the underlying context of an instance. Finally, we construct a semi-supervised learning objective for end-to-end training with unlabeled data, domain-specific rules, and a small amount of labeled data. Extensive experiments on six benchmark datasets for text classification demonstrate the effectiveness of our approach with significant improvements over state-of-the-art baselines.

Introduction
The success of state-of-the-art neural networks crucially hinges on the availability of large amounts of annotated training data. While recent advances on language model pre-training (Peters et al., 2018; * Most of the work was done while the first author was an intern at Microsoft Research. 1 ASTRA: weAkly-supervised Self-TRAining. Our code is publicly available at https://github.com/ microsoft/ASTRA. leverages domain-specific rules, a large amount of (task-specific) unlabeled data, and a small amount of labeled data via iterative self-training. Devlin et al., 2019;Radford et al., 2019) reduce the annotation bottleneck, they still require large amounts of labeled data for obtaining state-of-theart performances on downstream tasks. However, it is prohibitively expensive to obtain large-scale labeled data for every new task, therefore posing a significant challenge for supervised learning. In order to mitigate labeled data scarcity, recent works have tapped into weak or noisy sources of supervision, such as regular expression patterns (Augenstein et al., 2016), class-indicative keywords (Ren et al., 2018b;Karamanolakis et al., 2019), alignment rules over existing knowledge bases (Mintz et al., 2009;Xu et al., 2013) or heuristic labeling functions (Ratner et al., 2017;Bach et al., 2019;Badene et al., 2019;Awasthi et al., 2020). These different types of sources can be used as weak rules for heuristically annotating large amounts of unlabeled data. For instance, consider the question type classification task from the TREC dataset with regular expression patterns such as: label all questions containing the token "when" as numeric (e.g., "When was Shakespeare born?"). Approaches relying on such weak rules typically suffer from the following challenges. (i) Noise. Rules by their heuristic nature rely on shallow patterns and may predict wrong labels for many instances. For example, the question "When would such a rule be justified?" refers to circumstances rather than numeric expressions. (ii) Coverage. Rules generally have a low coverage as they assign labels to only specific subsets of instances. (iii) Conflicts. Different rules may generate conflicting predictions for the same instance, making it challenging to train a robust classifier.
To address the challenges with conflicting and noisy rules, existing approaches learn weights indicating how much to trust individual rules. In the absence of large-scale manual annotations, the rule weights are usually learned via mutual agreement and disagreement of rules over unlabeled data (Ratner et al., 2017;Platanios et al., 2017;Sachan et al., 2018;Bach et al., 2019;Ratner et al., 2019;Awasthi et al., 2020). For instance, such techniques would up-weight rules that agree with each other (as they are more likely to be correct), and down-weight such rules otherwise. An important drawback of these approaches is low coverage since rules assign weak labels to only a subset of the data, thus leading to low rule overlap to compute rule agreement. For instance, in our experiments on six real-world datasets, we observe that 66% of the instances are covered by fewer than 2 rules and 40% of the instances are not covered by any rule at all. Rule sparsity limits the effectiveness of previous approaches, thus leading to strong assumptions, such as, that each rule has the same weight across all instances (Ratner et al., 2017;Bach et al., 2019;Ratner et al., 2019), or that additional supervision is available in the form of labeled "exemplars" used to create such rules in the first place (Awasthi et al., 2020). Most importantly, all these works ignore (as a data pre-processing step) unlabeled instances that are not covered by any of the rules, thus leaving potentially valuable data behind.
Overview of our method. In this work, we present a weak supervision framework, namely ASTRA, that considers all task-specific unlabeled instances and domain-specific rules without strong assumptions about the nature or source of the rules. ASTRA makes effective use of a small amount of labeled data, lots of task-specific unlabeled data, and domain-specific rules through iterative teacherstudent co-training (see Figure 1). A student model based on contextualized representations provides pseudo-labels for all instances, thereby, allowing us to leverage all unlabeled data including instances that are not covered by any heuristic rules. To deal with the noisy nature of heuristic rules and pseudo-labels from the student, we develop a rule attention (teacher) network that learns to predict the fidelity of these rules and pseudo-labels conditioned on the context of the instances to which they apply. We develop a semi-supervised learning objective based on minimum entropy regularization to learn all of the above tasks jointly without the requirement of additional rule-exemplar supervision.
Overall, we make the following contributions: • We propose an iterative self-training mechanism for training deep neural networks with weak supervision by making effective use of task-specific unlabeled data and domainspecific heuristic rules. The self-trained student model predictions augment the weak supervision framework with instances that are not covered by rules.
• We propose a rule attention teacher network (RAN) for combining multiple rules and student model predictions with instance-specific weights conditioned on the corresponding contexts. Furthermore, we construct a semisupervised learning objective for training RAN without strong assumptions about the structure or nature of the weak rules.
• We demonstrate the effectiveness of our approach on several benchmark datasets for text classification where our method significantly outperforms state-of-the-art weak supervision methods.

Self-Training with Weak Supervision
We now present our approach, ASTRA, that leverages a small amount of labeled data, a large amount of unlabeled data, and domain-specific heuristic rules. Our architecture has two main components: the base student model (Section 2.1) and the rule attention teacher network (Section 2.2), which are iteratively co-trained in a self-training framework. Formally, let X denote the instance space and Y = {1, . . . , K} denote the label space for a Kclass classification task. We consider a small set of manually-labeled examples D L = {(x l , y l )}, where x l ∈ X and y l ∈ Y and a large set of unlabeled examples D U = {x i }. We also consider a set of pre-defined heuristic rules R = {r j }, where each rule r j has the general form of a labeling function that considers as input an instance x i ∈ X (and potentially additional side information), and  either assigns a weak label q j i ∈ {0, 1} K (one-hot encoding) or does not apply, i.e., does not assign a label for x i . Our goal is to leverage D L , D U , and R to train a classifier that, given an unseen test instance x ∈ X , predicts a label y ∈ Y. In the rest of this section, we present our ASTRA framework for addressing this problem.

Base Student Model
Our self-training framework starts with a base model trained on the available small labeled set D L . The model is then applied to unlabeled data D U to obtain pseudo-labeled instances. In classic selftraining (Riloff, 1996;Nigam and Ghani, 2000), the student model's pseudo-labeled instances are directly used to augment the training dataset and iteratively re-train the student. In our setting, we augment the self-training process with weak labels drawn from our teacher model that also considers rules in R (described in the next section). The overall self-training process can be formulated as: where, p θ (y|x) is the conditional distribution under student's parameters θ; λ ∈ R is a hyper-parameter controlling the relative importance of the two terms; and q φ * (y | x) is the conditional distribution under the teacher's parameters φ * from the last iteration that is fixed in the current iteration.

Rule Attention Teacher Network (RAN)
Our Rule Attention Teacher Network (RAN) aggregates multiple weak sources of supervision with trainable weights and computes a soft weak label q i for an unlabeled instance x i . One of the potential drawbacks of relying only on heuristic rules is that a lot of data get left behind. Heuristic rules by nature (e.g., regular expression patterns, keywords) apply to only a subset of the data. Therefore, a substantial number of instances are not covered by any rules and thus are not considered in prior weakly supervised learning approaches (Ratner et al., 2017;Awasthi et al., 2020). To address this challenge and leverage contextual information from all available task-specific unlabeled data, we leverage the corresponding pseudo-labels predicted by the base student model (from Section 2.1). To this end, we apply the student to the unlabeled data x ∈ D U and obtain pseudo-label predictions as p θ (y|x). These predictions are used to augment the set of already available weak rule labels to increase rule coverage.
Let R i ⊂ R be the set of all heuristic rules that apply to instance x i . The objective of RAN is to aggregate the weak labels predicted by all rules r j ∈ R i and the student pseudo-label p θ (y|x i ) to compute a soft label q i for every instance x i from the unlabeled set D U . In other words, RAN considers the student as an additional source of weak rule. Aggregating all rule labels into a single label q i via simple majority voting (i.e., predicting the label assigned by the majority of rules) may not be effective as it treats all rules equally, while in practice, certain rules are more accurate than others.
RAN predicts pseudo-labels q i by aggregating rules with trainable weights a (·) i ∈ [0, 1] that capture their fidelity towards an instance x i as: where a j i and a S i are the fidelity weights for the heuristic rule labels q j i and the student assigned pseudo-label p θ (y|x i ) for an instance x i , respectively; u is a uniform rule distribution that assigns equal probabilities for all the K classes as u = [ 1 K , . . . , 1 K ]; a u i is the weight assigned to the "uniform rule" for x i , which is computed as a function of the rest of the rule weights: a u i = (|R i | + 1 − j: r j ∈R i a j i − a S i ); and Z i is a normalization coefficient to ensure that q i is a valid probability distribution. u acts as a uniform smoothing factor that prevents overfitting for sparse settings, for instance, when a single weak rule applies to an instance.
According to Eq.
(2), a rule r j with higher fidelity weight a j i contributes more to the computation of q i . If a j i = 1 ∀r j ∈ {R i ∪ p θ }, then RAN reduces to majority voting. If a j i = 0 ∀r j ∈ {R i ∪ p θ }, then RAN ignores all rules and predicts q i = u. Note the distinction of our setting to recent works like Snorkel (Ratner et al., 2017), that learns global rule-weights a j i = a j ∀x i by ignoring the instance-specific rule fidelity. Our proposed approach is flexible but at the same time challenging as we do not assume prior knowledge of the internal structure of the labeling functions r j ∈ R.
In order to effectively compute rule fidelities, RAN considers instance embeddings that capture the context of instances beyond the shallow patterns considered by rules. In particular, we model the weight a j i of rule r j as a function of the context of the instance x i and r j through an attention-based mechanism. Consider h i ∈ R d to be the hidden state representation of x i from the base student model. Also, consider the (trainable) embedding of each rule r j as e j = g(r j ) ∈ R d . We use e j as a query vector with sigmoid attention to compute instance-specific rule attention weights as: where f is a multi-layer perceptron that projects h i to R d and σ(·) is the sigmoid function. Rule embedding allows us to exploit the similarity between different rules in terms of instances to which they apply, and further leverage their semantics for modeling agreement. RAN computes the student's weight a S i using the same procedure as for computing the rule weights a j i . Note that the rule predictions q j i are considered fixed, while we estimate their attention weights. The above coupling between rules and instances via their corresponding embeddings e j and h i al- Figure 3: Variation in unsupervised entropy loss with instance-specific rule predictions and attention weights encouraging rule agreement. Consider this illustration with two rules for a given instance. When rule predictions disagree (q 1 = q 2 ), minimum loss is achieved for attention weights a 1 =0, a 2 =1 or a 1 =1, a 2 =0. When rule predictions agree (q 1 =q 2 ), minimum loss is achieved for attention weights a 1 =a 2 =1. For instances covered by three rules, if q 1 =q 2 =q 3 , the minimum loss is achieved for a 1 =a 2 =1 and a 3 =0.
lows us to obtain representations where similar rules apply to similar contexts, and model their agreements via the attention weights a j i . To this end, the trainable parameters of RAN (f and g) are shared across all rules and instances. Next, we describe how to train RAN.

Semi-Supervised Learning of ASTRA
Learning to predict instance-specific weights a (·) i for the weak sources (including rules and student pseudo-labels) is challenging due to the absence of any explicit knowledge about the source quality and limited amount of labeled training data. We thus treat the weights a (·) i as latent variables and propose a semi-supervised objective for training RAN with supervision on the coarser level of q i : (4) Given task-specific labeled data D L , the first term in Eq. (4) minimizes the cross-entropy loss between the teacher's label q i and the corresponding clean label y i for the instance x i . This term penalizes weak sources that assign labels q (·) i that contradict with the ground-truth label y i by assigning a low instance-specific fidelity weight a (·) i . The second term in Eq. (4) minimizes the entropy of the aggregated pseudo-label q i on unlabeled data D U . Minimum entropy regularization is effective in settings with small amounts of labeled

Algorithm 1 Self-training with Weak Supervision
Input: Small amount of labeled data D L ; taskspecific unlabeled data D U ; weak rules R Outputs: Student p * θ (·), RAN Teacher q * φ (·) 1: Train student p θ (·) using D L 2: Repeat until convergence: 2.1: Train teacher q φ (·) using D L , D U through Eq. (2) and (4) data by leveraging unlabeled data (Grandvalet and Bengio, 2005), and is highly beneficial in our setting because it encourages RAN to predict weights that maximize rule agreement. Since the teacher label q i is obtained by aggregating weak labels q (·) i , entropy minimization encourages RAN to predict higher instance-specific weights a (·) i to sources that agree in their labels for x i , and lower weights when there are disagreements between weak sourcesaggregated across all the unlabeled instances. Figure 3 plots the minimum entropy loss over unlabeled data over two scenarios where two rules agree or disagree with each other for a given instance. The optimal instance-specific fidelity weights a (·) i are 1 when rules agree with each other, thereby, assigning credits to both rules, and only one of them when they disagree. We use this unsupervised entropy loss in conjunction with crossentropy loss over labeled data to ensure grounding. End-to-end Learning: Algorithm 1 presents an overview of our learning mechanism. We first use the small amount of labeled data to train a base student model that generates pseudo-labels and augments heuristic rules over unlabeled data. Our RAN network computes fidelity weights to combine these different weak labels via minimum entropy regularization to obtain an aggregated pseudolabel for every unlabeled instance. This is used to re-train the student model with the above studentteacher training repeated till convergence.

Experiments
Datasets. We evaluate our framework on the following six benchmark datasets for weak supervision from Ratner et al. (2017) and Awasthi et al. (2020). (1) Question classification from TREC-6 into 6 categories (Abbreviation, Entity, Description, Human, Location, Numeric-value); (2) Spam classification of SMS messages; (3) Spam classification of YouTube comments; (4) Income classification on the CENSUS dataset on whether a person earns more than $50K or not; (5) Slot-filling in sentences on restaurant search queries in the MIT-R dataset: each token is classified into 9 classes (Location, Hours, Amenity, Price, Cuisine, Dish, Restaurant Name, Rating, Other); (6) Relation classification in the Spouse dataset, whether pairs of people mentioned in a sentence are/were married or not. Table 1 shows the dataset statistics along with the amount of labeled, unlabeled data and domainspecific rules for each dataset. For a fair comparison, we use exactly the same set of rules as in the previous work for the benchmark datasets. These rules include regular expression patterns, lexicons, and knowledge bases for weak supervision. Most of these rules were constructed manually, except for the CENSUS dataset, where rules have been automatically extracted with a coverage of 100%.
On average across all the datasets, 66% of the instances are covered by fewer than 2 rules, whereas 40% are not covered by any rule at all -demonstrating the sparsity in our setting. We also report the accuracy of the rules in terms of majority voting Method Learning to Weight Unlabeled Rules Instances (no rules) Majority ---Snorkel (Ratner et al., 2017) --PosteriorReg (Hu et al., 2016) --L2R (Ren et al., 2018a) --ImplyLoss (Awasthi et al., 2020) -Self-train --ASTRA Table 2: ASTRA learns rule-specific and instancespecific attention weights and leverages task-specific unlabeled data where no rules apply. on the task-specific unlabeled datasets. Additional details on the dataset and examples of rules are presented in the Appendix. Evaluation. We train ASTRA five times for five different random splits of the labeled training data and evaluate on held-out test data. We report the average performance as well as the standard deviation across multiple runs. We report the same evaluation metrics as used in prior works (Ratner et al., 2017;Awasthi et al., 2020) for a fair comparison. Model configuration. Our student model consists of embeddings from pre-trained language models like ELMO (Peters et al., 2018) or BERT (Devlin et al., 2019) for generating contextualized representations for an instance, followed by a softmax classification layer. The RAN teacher model considers a rule embedding layer and a multilayer perceptron for mapping the contextualized representation for an instance to the rule embedding space. Refer to the Appendix for more details. Baselines. We compare our method with the following methods: (a) Majority predicts the majority vote of the rules with ties resolved by predicting a random class. (b) LabeledOnly trains classifiers using only labeled data (fully supervised baseline). (c) Self-train (Nigam and Ghani, 2000; Lee, 2013) leverages both labeled and unlabeled data for iterative self-training on pseudolabeled predictions over task-specific unlabeled data. This baseline ignores domain-specific rules. leverages exemplar-based supervision as additional knowledge for learning instance-specific and rulespecific weights by minimizing an implication loss over unlabeled data. This requires maintaining a record of all instances used to create the weak rules in the first place. Table 2 shows a summary of the different methods contrasting them on how they learn the weights (rule-specific or instance-specific) and if they leverage task-specific unlabeled data that are not covered by any rules.

Experimental Results
Overall results. Table 3 summarizes the main results across all datasets. Among all the semisupervised methods that leverage weak supervision from domain-specific rules, ASTRA outperforms Snorkel by 6.1% in average accuracy across all datasets by learning instance-specific rule weights in conjunction with self-training over unlabeled instances where weak rules do not apply. Similarly, ASTRA also improves over a recent work and the best performing baseline ImplyLoss by 3.1% on average. Notably, our method does not require additional supervision at the level of exemplars used to create rules in contrast to ImplyLoss. Self-training over unlabeled data. Recent works for tasks like image classification (Li et al., 2019;Xie et al., 2020;Zoph et al., 2020), neural sequence generation (Zhang and Zong, 2016;He et al., 2019) and few-shot text classification (Mukherjee and Awadallah, 2020;Wang et al., 2020) show the effectiveness of self-training methods in exploiting task-specific unlabeled data with stochastic regularization techniques like dropouts and data augmentation. We also make similar observations for our weakly supervised tasks, where classic self-train methods ("Self-train") leveraging only a few taskspecific labeled examples and lots of unlabeled data outperform weakly supervised methods like Snorkel and PosteriorReg that have additional access to domain-specific rules. Self-training with weak supervision. Our framework ASTRA provides an efficient method to incorporate weak supervision from domain-specific rules to augment the self-training framework and improves by 6% over classic self-training.
To better understand the benefits of our approach   compared to classic self-training, consider Figure 4, which depicts the gradual performance improvement over iterations. The student models in classic self-training and ASTRA have exactly the same architecture. However, the latter is guided by a better teacher (RAN) that learns to aggregate noisy rules and pseudo-labels over unlabeled data. Impact of rule sparsity and coverage for weak supervision. In this experiment, we compare the performance of various methods by varying the proportion of available domain-specific rules. To this end, we randomly choose a subset of the rules (varying the proportion from 10% to 100%) and train various weak supervision methods. For each setting, we repeat experiments with multiple rule splits and report aggregated results in Figure 5. We observe that ASTRA is effective across all settings with the most impact at high levels of rule sparsity. For instance, with 10% of domain-specific rules available, ASTRA outperforms ImplyLoss by 12% and Snorkel+Labeled by 19%. This performance improvement is made possible by incorporating self-training in our framework to obtain pseudo-labels for task-specific unlabeled instances, and further re-weighting them with other domain-specific rules via the rule attention network. Correspondingly, Table 4 shows the increase in data coverage for every task given by the proportion of unlabeled instances that are now covered by at least two weak sources (from multiple rules and pseudolabels) in contrast to just considering the rules. Table 5 reports ablation experiments to evaluate the impact of various components in ASTRA.

Ablation Study
ASTRA teacher marginally outperforms the student model on an aggregate having access to domain-specific rules. ASTRA student that is selftrained over task-specific unlabeled data and guided by an efficient teacher model significantly outper-   forms other state-of-the-art baselines.
Through minimum entropy regularization in our semi-supervised learning objective (Eq. (4)), ASTRA leverages the agreement between various weak sources (including rules and pseudo-labels) over task-specific unlabeled data. Removing this component results in an accuracy drop of 1.4% on an aggregate demonstrating its usefulness.
Fine-tuning the student on labeled data is important for effective self-training: ignoring D L in the step 2.3 in Algorithm 1, leads to 1.6% lower accuracy than ASTRA.
There is significant performance drop on removing the student's pseudo-labels (p θ (·)) from the rule attention network in Eq. (2). This significantly limits the coverage of the teacher ignoring unlabeled instances where weak rules do not apply, thereby, degrading the overall performance by 3.2%. Table 6 shows a question in the TREC-6 dataset that was correctly classified by the ASTRA teacher as an "Entity" type (ENTY). Note that the majority voting of the four weak rules that apply to this instance (Rule 8, 24, 42, and 61) leads to an incorrect prediction of "Human" (HUM) type. The ASTRA teacher aggregates all the heuristic rule labels and the student pseudo-label with their (computed) fidelity weights for the correct prediction.

Case Study: TREC-6 Dataset
Refer to Table 7 for more illustrative examples on how ASTRA aggregates various weak supervision sources with corresponding attention weights shown in parantheses. In Example 1 where no rules apply, the student leverages the context of the sentence (e.g., semantics of "president") to predict the HUM label. While in Example 2, the teacher downweights the incorrect student (as well as conflicting rules) and upweights the appropriate rule to predict the correct ENTY label. In example 3, ASTRA predicts the correct label ENTY relying only on the student as both rules report noisy labels.

Related Work
In this section, we discuss related work on selftraining and learning with noisy labels or rules. Refer to Hedderich et al. (2021) for a thorough survey of approaches addressing low-resource scenarios.
Self-Training. Self-training (Yarowsky, 1995;Nigam and Ghani, 2000;Lee, 2013) as one of the earliest semi-supervised learning approaches (Chapelle et al., 2009) trains a base model (student) on a small amount of labeled data; applies it to pseudo-label (task-specific) unlabeled data; uses pseudo-labels to augment the labeled data; and re-trains the student in an iterative manner. Self-training has recently been shown to obtain state-of-the-art performance for tasks like image classification ( A typical issue in self-training is error propagation from noisy pseudo-labels. This is addressed in ASTRA via rule attention network that computes the fidelity of pseudo-labels instead of directly using them to re-train the student. Learning with Noisy Labels. Classification under label noise from a single source has been an active research topic (Frénay and Verleysen, 2013). A major line of research focuses on correcting noisy labels by learning label corruption matrices (Patrini et al., 2017;Hendrycks et al., 2018;Zheng et al., 2021). More related to our work are the instance reweighting approaches (Ren et al., 2018b;Shu et al., 2019), which learn to up-weight and down-weight instances with cleaner and noisy labels respectively. However, these operate only at instance-level and do not consider rule-specific importance. Our approach learns both instance-and rule-specific fidelity weights and substantially outperforms Ren   (1) (2020) learn rule-specific and instance-specific weights but assume access to labeled exemplars that were used to create the rule in the first place. Most importantly, all these works ignore unlabeled instances that are not covered by any of the rules, while our approach leverages all unlabeled instances via self-training.

Conclusions and Future Work
We developed a weak supervision framework, ASTRA, that efficiently trains classifiers by integrating task-specific unlabeled data, few labeled data, and domain-specific knowledge expressed as rules. Our framework improves data coverage by employing self-training with a student model. This considers contextualized representations of instances and predicts pseudo-labels for all instances, including those that are not covered by heuristic rules. Additionally, we developed a rule attention network, RAN, to aggregate various weak sources of supervision (heuristic rules and student pseudo-labels) with instance-specific weights, and employed a semi-supervised objective for training RAN without strong assumptions about the nature or structure of the weak sources. Extensive experiments on several benchmark datasets demonstrate our effectiveness, particularly at high levels of rule sparsity. In future work, we plan to extend our framework to support a broader range of natural language understanding tasks and explore alternative techniques for rule embedding.

Ethical Considerations
In this work, we introduce a framework for training of neural network models with few labeled examples and domain-specific knowledge. This work is likely to increase the progress of NLP applications for domains with limited annotated resources but access to domain-specific knowledge. While it is not only expensive to acquire large amounts of labeled data for every task and language, in many cases, we cannot perform large-scale labeling due to access constraints from privacy and compliance concerns. To this end, our framework can be used for applications in finance, legal, healthcare, retail and other domains where adoption of deep neural network may have been hindered due to lack of large-scale manual annotations on sensitive data. While our framework accelerates the progress of NLP, it also suffers from associated societal implications of automation ranging from job losses for workers who provide annotations as a service. Additionally, it involves deep neural models that are compute intensive and has a negative impact on the environment in terms of carbon footprint. The latter concern is partly alleviated in our work by leveraging pre-trained language models and not training from scratch, thereby, leading to efficient and faster compute.

A Appendix
For reproducibility, we provide details of our implementation (Section A.1), datasets (Section A.2), and experimental results (Section A.3). Our code is available at https://github.com/ microsoft/ASTRA.

A.1 Implementation Details
We now describe implementation details for each component in ASTRA: our base student model and our rule attention teacher network. Table 8 shows our hyperparameter search configuration. We choose optimal hyperparameters by manual tuning based on the development performances. Table 9 shows the hyperparameters and model architecture details for each dataset. For a fair comparison, we use the same architectures as previous approaches but we expect further improvements by exploring different architectures.
Base Student Model Our student model consists of an instance embedding layer (e.g., ELMO (Peters et al., 2018), BERT (Devlin et al., 2019), logistic regression), a multilayer perceptron with two hidden layers, and a softmax classification layer for predicting labels.
Rule Attention Teacher Network Our RAN teacher model consists of a 128-dimensional rule embedding layer, a multilayer perceptron for mapping the contextualized representation for an instance to the rule embedding space, and a sigmoid attention layer.
Iterative Teacher-Student Training At each iteration, we train the RAN teacher on unlabeled data and fine-tune on clean labeled data. Also at each iteration, we train the student on pseudolabeled teacher data and fine-tune on clean labeled data. We consider a maximum number of 25 selftraining iterations (with early stopping of patience 3 epochs) and keep the models' performances for the iteration corresponding to the highest validation performance.

A.2 Dataset Details
We evaluate our framework on the following six benchmark datasets for weak supervision from Ratner et al. (2017) and Awasthi et al. (2020) 2 . All datasets are in English. Table 11 shows detailed dataset statistics. We consider the same test sets with previous work. For a robust evaluation of our model's performance, we split each dataset into five random train/validation/unlabeled splits and report the average performance and standard deviation across runs. For a fair comparison, we use the same splits and evaluation procedure across all methods and baselines.
TREC: Question classification from TREC-6 into 6 categories: Abbreviation, Entity, Description, Human, Location, Numeric-value. Table 12 reports a sample of regular expression rules out of the 68 rules used in the TREC dataset. TREC has 13 keyword-based (coverage=62%) and 55 regular expression-based (coverage=57%) rules.
CENSUS: Binary income classification on the UCI CENSUS dataset on whether a person earns more than $50K or not. This is a non-textual dataset and is considered to evaluate the performance of our approach under the low sparsity setting, since the 83 rules are automatically extracted and have a coverage of 100%.

A.3 Experimental Result Details
We now discuss detailed results on each dataset. To be consistent with previous work, we report accuracy scores for the TREC, Youtube, and CENSUS dataset and macro-average F1 scores for the SMS, Spouse, and MIT-R datasets.