Denoising Multi-Source Weak Supervision for Neural Text Classification

We study the problem of learning neural text classifiers without using any labeled data, but only easy-to-provide rules as multiple weak supervision sources. This problem is challenging because rule-induced weak labels are often noisy and incomplete. To address these two challenges, we design a label denoiser, which estimates the source reliability using a conditional soft attention mechanism and then reduces label noise by aggregating rule-annotated weak labels. The denoised pseudo labels then supervise a neural classifier to predicts soft labels for unmatched samples, which address the rule coverage issue. We evaluate our model on five benchmarks for sentiment, topic, and relation classifications. The results show that our model outperforms state-of-the-art weakly-supervised and semi-supervised methods consistently, and achieves comparable performance with fully-supervised methods even without any labeled data. Our code can be found at https://github.com/weakrules/Denoise-multi-weak-sources.


Introduction
Many NLP tasks can be formulated as text classification problems, such as sentiment analysis (Badjatiya et al., 2017), topic classification (Zhang et al., 2015), relation extraction (Krebs et al., 2018) and question answering like slot filling (Pilehvar and Camacho-Collados, 2018).Recent years have witnessed the rapid development of deep neural networks (DNNs) for this problem, from convolutional neural network (CNN, Kim, 2014;Kalchbrenner et al., 2014), recurrent neural network (RNN, Lai et al., 2015) to extra-large pre-trained language models (Devlin et al., 2019;Dai et al., 2019;Liu et al., 2019).DNNs' power comes from their capabilities of fitting complex functions based on largescale training data.However, in many scenarios, labeled data are limited, and manually annotating them at a large scale is prohibitively expensive.
Weakly-supervised learning is an attractive approach to address the data sparsity problem.It labels massive data with cheap labeling sources such as heuristic rules or knowledge bases.However, the major challenges of using weak supervision for text classification are two-fold: 1) the created labels are highly noisy and imprecise.The label noise issue arises because heuristic rules are often too simple to capture rich contexts and complex semantics for texts; 2) each source only covers a small portion of the data, leaving the labels incomplete.Seed rules have limited coverage because they are defined over the most frequent keywords but real-life text corpora often have long-tail distributions, so the instances containing only long-tail keywords cannot be annotated.
Existing works (Ratner et al., 2017;Meng et al., 2018;Zamani et al., 2018;Awasthi et al., 2020) attempt to use weak supervision for deep text classification.Ratner et al. (2017) proposes a data programming method that uses labeling functions to automatically label data and then trains discriminative models with these labels.However, data annotated in this way only cover instances directly matched by the rules, leading to limited model performance on unmatched data.Meng et al. (2018) proposes a deep self-training method that uses weak supervision to learn an initial model and updates the model by its own confident predictions.However, the self-training procedure can overfit the label noise and is prone to error propagation.Zamani et al. (2018) solves query performance prediction (QPP) by boosting multiple weak supervision signals in an unsupervised way.However, they choose the most informative labelers by an ad-hoc user-defined criterion, which may not generalize to all the domains.Awasthi et al. (2020) assumes that human labelers are over-generalized to increase the coverage, and they learn restrictions on the rules to address learning wrongly generalized labels.However, their method requires the specific formulation process of rules to indicate which rules are generated by which samples, so that it cannot deal with other kinds of labeling sources like knowledge bases or third-party tools.
We study the problem of using multiple weak supervision sources (e.g., domain experts, pattern matching) to address the challenges in weaklysupervised text classification.While each source is weak, multiple sources can provide complementary information for each other.There is thus potential to leverage these multiple sources to infer the correct labels by estimating source reliability in different feature regimes and then aggregating weak labels.Moreover, since each source covers different instances, it is more promising to leverage multiple sources to bootstrap on unlabeled data and address the label coverage issue.Motivated by the above, we propose a model with two reciprocal components.The first is a label denoiser with the conditional soft attention mechanism (Bahdanau et al., 2014) ( § 3.2).Conditioned on input text features and weak labels, it first learns reliability scores for labeling sources, emphasizing the annotators whose opinions are informative for the particular corpus.It then denoises rule-based labels with these scores.The other is a neural classifier that learns the distributed feature representations for all samples ( § 3.3).To leverage unmatched samples, it is supervised by both the denoised labels and its confident predictions on unmatched data.These two components are integrated into an end-to-end co-training framework, benefiting each other through cross-supervision losses, including the rule denoiser loss, the neural classifier loss, and the self-training loss( § 3.4).
We evaluate our model on four classification tasks, including sentiment analysis, topic classification, spam classification, and information extraction.The results on five benchmarks show that: 1) the soft-attention module effectively denoises the noisy training data induced from weak supervision sources, achieving 84% accuracy for denoising; and 2) the co-training design improves prediction accuracy for unmatched samples, achieving at least 9% accuracy increase on them.In terms of the overall performance, our model consistently outperforms SOTA weakly supervised methods (Ratner et al., 2017;Meng et al., 2018;Zamani et al., 2018)

NEG NEG NEG NEG
The margaritas were good and strong but maybe .It is worth coming here just for the goat cheese.Service is above average as our server Amanda was very know and and seemed well trained.The chicken was sort of and not so . .The beef one could have been more but was OK.The vegetable (corn & rajas) taco was full of just mooch as it lacked flavor and texture.The wild caught fish taco was yellowtail on my visit.The tiny taco was tasty but the fish was and it seemed to me for such a tiny taco with very little fish in it.semi-supervised method (Tarvainen and Valpola, 2017), and fine-tuning method (Howard and Ruder, 2018) by 5.46% on average.

Problem Definition
In weakly supervised text classification, we do not have access to clean labeled data.Instead, we assume external knowledge sources providing labeling rules as weak supervision signals.
Definition 1 (Weak Supervision).A weak supervision source specifies a set of labeling rules R = {r 1 , r 2 , . . ., r k }.Each rule r i declares a mapping f → C, meaning any documents that satisfy the feature f are labeled as C.
We assume there are multiple weak supervision sources providing complementary information for each other.A concrete example is provided below.
Example 1 (Multi-Source Weak Supervision).Figure 1 shows three weak sources for the sentiment analysis of Yelp reviews.The sources use 'if-else' labeling functions to encode domain knowledge from different aspects.The samples that cannot be matched by any rules remain unlabeled.
Problem Formulation Formally, we have: 1) a corpus Our goal is to learn a classifier from D with only multiple weak supervision sources to accurately classify any newly arriving documents.

Challenges
Although the use of automatic weak annotators largely reduces human labeling efforts, using ruleinduced labeled data has two drawbacks: label noise and label incompleteness.

Weak-labeled Noisy Data
Unmatched Data

Conditional Soft Attention Denoiser Neural Classifier
Denoised Pseudo Clean Labels Weak labels are noisy since user-provided rules are often simple and do not fully capture complex semantics of the human language.In the Yelp example with eight weak supervision sources, the annotation accuracy is 68.3% on average.Label noise hurts the performance of text classifiersespecially deep classifiers-because such complex models easily overfit the noise.Moreover, the source coverage ranges from 6.8% to 22.2%.Such limited coverage is because user-provided rules are specified over common lexical features, but real-life data are long-tailed, leaving many samples unmatched by any labeling rules.

Our Method
We begin with an overview of our method and then introduce its two key components as well as the model learning procedure.

The Overall Framework
Our method addresses the above challenges by integrating weak annotated labels from multiple sources and text data to an end-to-end framework with a label denoiser and a deep neural classifier, illustrated in Figure 2.
Label denoiser & self-denoising We handle the label noise issue by building a label denoiser that iteratively denoises itself to improve the quality of weak labels.This label denoiser estimates the source reliability using a conditional soft attention mechanism, and then aggregates weak labels via weighted voting of the labeling sources to achieve "pseudo-clean" labels.The reliability scores are conditioned on both rules and document feature representations.They effectively emphasize the opinions of informative sources while down-weighting those of unreliable sources, thus making rule-induced predictions more accurate.

Neural classifier & self-training
To address the low coverage issue, we build a neural classifier which learns distributed representations for text documents and classifies each of them, whether rule-matched or not.It is supervised by both the denoised weakly labeled data as well as its own high-confident predictions of unmatched data.

The Label Denoiser
When aggregating multiple weak supervision sources, it is key for the model to attend to more reliable sources, where source reliability should be conditioned on input features.This will enable the model to aggregate multi-source weak labels more effectively.Given k labeling resources, we obtain the weak label matrix Ỹ ∈ R n×k through rule matching.Specifically, as shown in the Rule Matching step of 3, by Definition 1, given one rule, if a document is matchable by that rule, it will be assigned with a rule-induced label C; otherwise, the document remains unlabeled, represented as -1.N rules thus generate N weak labels for each document.We then estimate the source reliability and aggregate complementary weak labels to obtain "pseudo-clean" labels.

Parameterization of source reliability
We introduce a soft attention mechanism conditioned on both weak labels and feature representation, denoted as B, to estimate the source reliability.Formally, we denote the denoised "pseudo-clean" labels by Ŷ = [ ŷ1 , . . ., ŷn ] T , and the initial ones Ŷ0 are obtained by simple majority voting from Ỹ .
The core of the label denoiser is an attention net, a two-layer feed-forward neural network which predicts the attention score for matched samples.Formally, we specify a reliability score a j for each labeling source to represent its annotation quality, and the score is normalized to satisfy k j=1 a j = 1.For one document d i , its attention score q i,j of one labeling source R j is: where W 1 , W 2 denote the neural network weights and tanh is the activation function.Thus, for each document, its conditional labeling source score vector A i = [a i1 , a i2 , . . ., a ik ] T is calculated over matched annotators as where χ C is the indicator function.Then, we average the conditional source score A i over all the n matched samples to get the source reliability vector A. The weight of j th (j = 1, 2, . . ., k) annotator is < l a t e x i t s h a 1 _ b a s e 6 4 = " 6 m y 3 l A m 0 0 k 7 d P J g Z i L E 0 L 9 w 5 8 a F I m 7 9 F H f 9 G y d t F 9 p 6 Y O B w z r 3 c M 8 e o / w D K / w 5 k j n x X l 3 P h a t a 0 4 + c w J / 4 H z + A D l y j u U = < / l a t e x i t > (2)

Pre-trained Transformer
The updated higher-quality labels Ŷ then supervise the rule-covered samples in D L to generate better soft predictions and guide the neural classifier later.

Rule-based classifier prediction
At the epoch t of our co-training framework, we learn the reliability score A(t) and soft predictions Ẑ(t) supervised by "pseudo-clean" labels from the previous epoch Ŷ (t − 1).Then we renew "clean-pseudo" labels as Ŷ (t) using the score A(t) by (2).Specifically, given m target classes and k weak annotators, the prediction probability ẑi for d i is obtained by weighting the noisy labels Ỹi according to their corresponding conditional reliability scores A i : ẑi = softmax( Ỹi ⊗ A i ), where the masked matrix multiplication ⊗ (defined in (3)) is used to mask labeling sources that do not annotate document i, and we normalize the resultant masked scores via softmax: We finally aggregate m soft adjusted scores to get the soft prediction vector ẑi = [z i1 , . . ., z im ] T .

The Neural Classifier
The neural classifier is designed to handle all the samples, including matched ones and unmatched ones.The unmatched corpus where the documents cannot be annotated by any source is denoted as D U .In our model, we use the pre-trained BERT (Devlin et al., 2019) as our feature extractor, and then feed the text embeddings B into a feed-forward neural network to obtain the final predictions.For where f θ denotes the two-layer feed-forward neural network, and θ w denotes its parameters.

The Training Objective
The rule denoiser loss 1 is the loss of the rulebased classifier over D L .We use the "pseudoclean" labels Ŷ to self-train the label denoiser and define the loss 1 as the negative log likelihood of ŷi , A, ẑi∈D L ← g w ( Ỹi , B i , ŷi ) learn reliability score and evaluate attention network output supervised by "pseudo-clean" labels from (1) and ( 3) renewed pseudo labels 5: update θ, W using ADAM by (8) 7: end for 8: return W, θ The neural classifier loss 2 is the loss of the neural classifier over D L .Similarly, we regard the negative log-likelihood from the neural network outputs Z to the pseudo-clean labels Ŷ as training loss, formally The unsupervised self-training loss 3 is the loss of the neural classifier over D U .To further enhance the label quality of D U we apply the temporal ensembling strategy (Laine and Aila, 2016), which aggregates the predictions of multiple previous network evaluations into an ensemble prediction to alleviate noise propagation.For a document d i ∈ D U , the neural classifier outputs zi are accumulated into ensemble outputs Z i by updating where α is a term that controls how far the ensemble looks back into training history.We also need to construct target vectors by bias correction, namely where t is the current epoch.Then, we minimize the Euclidean distance between p i and zi , where Overall Objective The final training objective is to minimize the overall loss : where 0 ≤ c 1 ≤ 1, 0 ≤ c 2 ≤ 1, and 0 ≤ c 3 ≤ 1 are hyper-parameters for balancing the three losses and satisfy c 1 + c 2 + c 3 = 1.

Model Learning and Inference
Algorithm 1 sketches the training procedure.Two classifiers provide supervision signals for both themselves and their peers, iteratively improving their classification abilities.In the test phase, the corpus is sent into our model with the corresponding annotated noisy labels.The final target C i for a document i is predicted by ensembling the soft predictions.If two predictions from the label denoiser and the neural classifier conflict with each other, we choose the one with higher confidence, where the confidence scores are softmax outputs.  1 shows the statistics of these datasets and the quality of weak labels (the details of each annotation rule are given in the appendix A.4). Creating such rules required very light efforts, but is able to cover a considerable amount of data samples (e.g., 54k in agnews).

Experiments
Baselines We compare our model with the following advanced methods: 1) Snorkel (Ratner et al., 2017)

Comparison with Baselines
We first compare our method with the baselines on five datasets.For fair comparison, all the methods use a pre-trained BERT-based model for feature extraction, and use the same neural architecture as the text classification model.All the baselines use the same set of weak labels Ỹ for model training, except for WeSTClass which only requires seed keywords as weak supervision (we extract these keywords from the predicates of our rules).
Table 2 shows the performance of all the methods on five datasets.As shown, our model consistently outperforms all the baselines across all the datasets.Such results show the strength and robustness of our model.Our model is also very time-efficient (4.5 minutes on average) with trainable parameters only from two simple MLP neural networks (0.199M trainable parameters).
Similar to our methods, Snorkel, NeuralQPP, and Implyloss also denoise the weak labels from multiple sources by the following ideas: 1) Snorkel uses a generative modeling approach; 2) Implyloss adds one regularization to estimate the rule over-generalizing issue, but it requires the clean data to indicate which document corresponds to which rule.Without such information in our setting, this advanced baseline cannot perform well; 3) NeuralQPP selects the most informative weak labelers by boosting method.The performance gaps verify the effectiveness of the our conditional soft attention design and co-training framework.
WeSTClass is similar to our method in that it also uses self-training to bootstrap on unlabeled samples to improve its performance.The major advantage of our model over WeSTClass is that it uses two different predictors (rule-based and neural classifier) to regularize each other.Such a design not only better reduces label noise but also makes the learned text classifier more robust.
Finally, ULMFiT and BERT-MLP are strong baselines based on language model fine-tuning.MT is a well-known semi-supervised model which achieved inspiring results for image classification.However, in the weakly supervised setting, they do not perform well due to label noise.The results show that ULMFiT and MT suffer from such label noise, whereas our model is noise-tolerant and more suitable in weakly supervised settings.Overall BERT-MLP performs the best and we further compare it with ours in more perspectives.

Effectiveness of label denoising
To study the effectiveness of label denoising, we first compare the label noise ratio in training set given by the majority-voted pseudo labels ( Ỹ defined in § 3.2) and our denoised pseudo labels.Figure 4 shows that after applying our denoising model, the label noise is reduced by 4.49% (youtube), 4.74% (imdb), 12.6% (yelp), 3.87% (agnews) and 8.06% (spouse) within the matched samples.If we count all the samples, the noise reduction is much more significant with 23.92% on average.Such inspiring results show the effectiveness of our model in denoising weak labels.
Train a classifier with denoised labels We further study how the denoised labels benefit the training of supervised models.To this end, we feed the labels generated by majority voting and denoised ones generated by our model into two state-of-the-  art supervised models: ULMFiT and BERT-MLP (described in § 4.1).Table 3 shows that denoised labels significantly improve the performance of supervised models on all the datasets.

Effectiveness of handling rule coverage
We proceed to study how effective our model is when dealing with the low-coverage issue of weak supervision.To this end, we evaluate the performance of our model for the samples covered by different numbers of rules.As shown in Figure 5, the strongest baseline (BERT-MLP) trained with majority-voted labels performs poorly on samples that are matched by few rules or even no rules.
In contrast, after applying our model, the performance on those less matched samples improves significantly.This is due to the neural classifier in our model, which predicts soft labels for unmatched samples and utilizes the information from the multiple sources through co-training.

Incorporating Clean Labels
We also study how our model can further benefit from a small amount of labeled data.While our model uses weak labels by default, it can easily incorporate clean labeled data by changing the weak labels to clean ones and fix them during training.We study the performance of our model in this setting, and compare with the fully-supervised BERT-MLP model trained with the same amount of clean labeled data.As shown in Table 4, the results of combining our denoised labels with a small amount of clean labels are inspiring: it further improves the performance of our model and consistently outperforms the fully supervised BERT-MLP model.When the labeled ratio is small, the performance improvement over the fully-supervised model is particularly large: improving the accuracy by 6.28% with 0.5% clean labels and 3.84% with 5% clean labels on average.When the ratio of clean labels is large, the performance improvements becomes marginal.
The performance improvement over the fullysupervised model is relatively smaller on yelp and agnews datasets.The reason is likely that the text genres of yelp and agnews are similar to the text corpora used in BERT pre-training, making the supervised model fast achieve its peak performance with a small amount of labeled data.

Ablation Study
We perform ablation studies to evaluate the effectiveness of the three components in our model: the label denoiser, the neural classifier, and the selftraining over unmatched samples.By removing one of them, we obtain four settings: 1) Rule-only, represents w/o neural classifier and self-training; 2) Neural-only, represents w/o label denoiser and self-training; 3) Neural-self: represents w/o label denoiser; 4) Rule-Neural: represents w/o self training.3) and 4) are supervised by the initial simple majority voted labels.Table 5 shows the results.We find that all the three components are key to our model, because: 1) the rule-based label denoiser iteratively obtains higher-quality pseduo labels from the weak supervision sources; 2) the neural classifier extracts extra supervision signals from unlabeled data through self-training.

Case Study
We provide a example of Yelp dataset to show the denoising process of our model.A reviewer of says "My husband tried this place.He was pleased with his experience and he wanted to take me there for dinner.We started with calamari which was so greasy we could hardly eat it...The bright light is the service.Friendly and attentive!The staff made an awful dining experience somewhat tolerable."The ground-truth sentiment should be NEGATIVE.
This review is labeled by three rules as follows: 1) keyword-mood, pleased → POSITIVE; 2) keyword-service, friendly → POSITIVE; 3) keyword-general, awful → NEGATIVE.The majority-voted label is thus POSITIVE, but it is wrong.After applying our method, the learned conditional reliability scores for the three rules are 0.1074, 0.1074, 0.2482, which emphasizes rule 3) so the denoised weighted majority voted is thus NEGATIVE, and it becomes correct.

Parameter Study
The primary parameters of our model include: 1) the dimension of hidden layers d h in the label denoiser and the feature-based classifier; 2) learning rate lr; 3) the weight c 1 , c 2 , and c 3 of regularization term for 1 , 2 , and 3 in (8); 4) We fix momentum term α = 0.6 followed the implementation of Laine and Aila (2016).By default, we set d h = 128, lr = 0.02, and c 1 = 0.2, c 2 = 0.7, c 3 = 0.1 as our model achieves overall good performance with these parameters.The search space of d h is 2 6−9 , lr is 0.01 − 0.1, c 1 and c 3 are 0.1 − 0.9 (note that c 2 = 1 − c 1 − c 3 ).The hyperparameter configuration for the best performance reported in Table 2 is shown in the appendix A.3.
We test the effect of one hyperparameter by fixing others to their default values.In Figure 6 (a) and (b), we find the performance is stable except that the loss weight is too large.For (c) and (d), except for the spouse dataset when lr is too small and d h is too large (instability due to the dataset size is small), our model is robust to the hyperpa-0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 (a) loss regularization term c1  rameters when they are in a reasonable range.We also report overall performance for all the search trails in Table 10 of appendix A.3.

Related Work
Learning from Noisy Supervision.Our work is closely related to existing work on learning from noisy supervision.To deal with label noise, several studies (Brodley and Friedl, 1999;Smith and Martinez, 2011;Yang et al., 2018) adopt a data cleaning approach that detects and removes mislabeled instances.This is achieved by outlier detection (Brodley and Friedl, 1999), a-priori heuristics (Smith and Martinez, 2011), self-training (Liang et al., 2020), or reinforcement learning (Yang et al., 2018;Zhang et al., 2020).One drawback of this data cleaning approach is that it can discard many samples and incur information loss.Different from data cleaning, some works adopt a data correction approach.The most prominent idea in this line is to estimate the noise transition matrix among labels (Sukhbaatar and Fergus, 2014;Sukhbaatar et al., 2014;Goldberger and Ben-Reuven, 2016;Wang et al., 2019;Northcutt et al., 2019) and then use the transition matrices to re-label the instances or adapt the loss functions.Specifically, Wang et al. (2019) and Northcutt et al. (2019) generate label noise by flipping clean labels based on such noise transition matrices.They are thus not applicable to our weak supervision setting where no clean labels are given.Meanwhile, reweighting strategies have been explored to adjust the input training data.These techniques weigh training samples according to the predictions confidence (Dehghani et al., 2017), one-sided noise assumption (Zhang et al., 2019), a clean set (Ren et al., 2018) or the similarity of their descent directions (Yang et al., 2018).Recently, a few studies (Veit et al., 2017;Hu et al., 2019) have also explored designing denoising modules for neural networks.However, our method differs from them in that: (1) our method learns conditional reliability scores for multiple sources; and (2) these methods still require clean data for denoising, while ours does not.
Learning from Multi-Source Supervision The crowdsourcing area also faces the problem of learning from multiple sources (i.e., crowd workers).Different strategies have been proposed to integrate the annotations for the same instance, such as estimating the confidence intervals for workers (Joglekar et al., 2015) or leveraging approval voting (Shah et al., 2015).Compared with crowdsourcing, our problem is different in that the multiple sources provide only feature-level noisy supervision instead of instance-level supervision.
More related to our work are data programming methods (Ratner et al., 2016(Ratner et al., , 2017(Ratner et al., , 2019) ) that learn from multiple weak supervision sources.One seminal work in this line is Snorkel (Ratner et al., 2017), which treats true labels as latent variables in a generative model and weak labels as noisy observations.The generative model is learned to estimate the latent variables, and the denoised training data are used to learn classifiers.Our approach differs from data programming methods where we use a soft attention mechanism to estimate source reliability, which is integrated into neural text classifiers to improve the performance on unmatched samples.
Self-training Self-training is a classic technique for learning from limited supervision (Yarowsky, 1995).The key idea is to use a model's confident predictions to update the model itself iteratively.However, one major drawback of self-training is that it is sensitive to noise, i.e., the model can be mis-guided by its own wrong predictions and suffer from error propagation (Guo et al., 2017).
Although self-training is a common technique in semi-supervised learning, only a few works like WeSTClass (Meng et al., 2018) have applied it to weakly-supervised learning.Our self-training differs from WeSTClass in two aspects: 1) it performs weighted aggregation of the predictions from multiple sources, which generates higher-quality pseudo labels and makes the model less sensitive to the error in one single source; 2) it uses temporal ensembling, which aggregates historical pseudo labels and alleviates noise propagation.

Conclusion
We have proposed a deep neural text classifier learned not from excessive labeled data, but only unlabeled data plus weak supervisions.Our model learns from multiple weak supervision sources using two components that co-train each other: (1) a label denoiser that estimates source reliability to reduce label noise on the matched samples, (2) a neural classifier that learns distributed representations and predicts over all the samples.The two components are integrated into a co-training framework to benefit from each other.In our experiments, we find our model not only outperforms state-ofthe-art weakly supervised models, but also benefits supervised models with its denoised labeled data.Our model makes it possible to train accurate deep text classifiers using easy-to-provide rules, thus appealing in low-resource text classification scenarios.As future work, we are interested in denoising the weak supervision further with automatic rule discovery, as well as extending the co-training framework to other tasks beyond text classification.

Figure 1 :
Figure1: The annotation process for three weak supervision sources."POS" and "NEG" are the labels for the sentiment analysis task.

Figure 2 :
Figure 2: Overview of cross-training between the rulebased classifier and the neural classifier.
< l a t e x i t s h a 1 _ b a s e 6 4 = " b f c S i z n F b B T e Q W + A L T H t 8 A S o S S M = " > A A A B + H i c b V D L S s N A F L 2 p r 1 o f j b p 0 M 7 Q I F a E k d a H L o h u X F e w D m h A m 0 0 k 7 d P J g Z i L E 0 L 9 w 5 8 a F I m 7 9 F H f 9 G 6 e P h b Y e G D i c c y / 3 z P E T z q S y r K l R 2 N j c 2 t 4 p 7 p b 2 9 g 8 O y + b R c U f G q S C 0 T W I e i 5 6 P J e U s o m 3 7 g 3 X g y X o 0 P 4 3 M x W j C W O y f w B 8 b X D 3 Y o l e 8 = < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " I Y o H S / B X o P m 9 U U 2 b m l 1 0 y y p 0 D + A P j 8 w d i e 5 X T < / l a t e x i t > ŶL < l a t e x i t s h a 1 _ b a s e 6 4 = " F e o n e 0 H Y i w + 3 Q Qx I s P w i 0 d 8 z + i Y = " > A A A B / X i c b V D L S s N A F L 2 p r 1 p f 8 b F z M 1 g E V y W p g i 6 L b l y 4 q G A f 0 o Q y m U 7 a o Z M H M x O h h u C v u H G h i F v /w 5 1 / 4 6 T N Q l s P D B z O u Z d 7 5 n g x Z 1 J Z 1 r d R W l p e W V 0 r r 1 c 2 N r e 2 d 8 z d v b a M E k F o i 0 Q 8 E l 0 P S 8 p Z S F u K K U 6 7 s a A 4 8 D j t e O O r 3 O 8 8 U C F Z F N 6 p S U z d A A 9 D 5 j O C l Z b 6 5 o a m t A C A o / w D K / w Z j w Z L 8 a 7 8 T E b L R n F z j 7 8 g f H 5 A 1 T O l c o = < / l a t e x i t >f c (U) < l a t e x i t s h a 1 _ b a s e 6 4 = " F Y C E 0 7 n s R m 4 E s H J P 5 M 8 r k e Q h k 9 Y = " > A A A B + H i c b V B N S 8 N A F H y p X 7 V + N O r R y 2 I R 6 q U k V d B j 0 Y v H C q Y W 2 h A 2 2 0 2 7 d L M J u x u h h v 4 S Lx 4 U 8 e p P 8 e a / c d v m o K 0 D C 8 P M e 7 z Z C V P O l H a c b 6 u 0 t r 6 x u V X e r u z s 7 u 1 X 7 Y P D j k o y S a h H E p 7 I b o g V 5 U x Q T z P N a T e V F M c h p w / h + G b

Figure 3 :
Figure 3: The detailed model architecture.Our model mainly consists of two parts: (1) the label denoiser, including the conditional soft attention reliability estimator and the instance-wise multiplication; (2) the neural classifier, which calculates sentence embedding using the pre-trained Transformer and makes classification.
Figure 4: The label noise ratio of the initial majority voted labels and our denoised labels in the training set.

Figure 5 :
Figure 5: Accuracy on low-resource samples (matched by a small number of rules) in Youtube dataset.

Figure 6 :
Figure 6: The prediction accuracy over different parameter settings.

)
Algorithm 1 Training process of our model Require: D L , D U , C, B, Ỹ , g w (x) and f θ (x): feed-forward rule-based and nerual classifier with trainable parameters W and θ; s: number of training iteraions; 1: Ŷ ← Ŷ0 , initialize by simple majority voting 2: for t ← 1 to s do

Table 1 :
Data Statistics.C is the number of classes.Cover is fraction of rule-induced samples.Acc.refers to precision of labeling sources (number of correct samples / matched samples).Cover and Acc. are in %.

Table 3 :
Classification accuracy of two supervised methods with labels generated by majority voting and denoised ones generated by our model.

Table 4 :
The classification accuracy of BERT-MLP and our model with ground truth labeled data