Learning with Noisy Labels for Sentence-level Sentiment Classification

Deep neural networks (DNNs) can fit (or even over-fit) the training data very well. If a DNN model is trained using data with noisy labels and tested on data with clean labels, the model may perform poorly. This paper studies the problem of learning with noisy labels for sentence-level sentiment classification. We propose a novel DNN model called NetAb (as shorthand for convolutional neural Networks with Ab-networks) to handle noisy labels during training. NetAb consists of two convolutional neural networks, one with a noise transition layer for dealing with the input noisy labels and the other for predicting ‘clean’ labels. We train the two networks using their respective loss functions in a mutual reinforcement manner. Experimental results demonstrate the effectiveness of the proposed model.


Introduction
It is well known that sentiment annotation or labeling is subjective (Liu, 2012). Annotators often have many disagreements. This is especially so for crowd-workers who are not well trained. That is why one always feels that there are many errors in an annotated dataset. In this paper, we study whether it is possible to build accurate sentiment classifiers even with noisy-labeled training data. Sentiment classification aims to classify a piece of text according to the polarity of the sentiment expressed in the text, e.g., positive or negative (Pang and Lee, 2008;Liu, 2012;. In this work, we focus on sentence-level sentiment classification (SSC) with labeling errors.
As we will see in the experiment section, noisy labels in the training data can be highly damaging, especially for DNNs because they easily fit the training data and memorize their labels even when training data are corrupted with noisy labels * Corresponding author . Collecting datasets annotated with clean labels is costly and time-consuming as DNN based models usually require a large number of training examples. Researchers and practitioners typically have to resort to crowdsourcing. However, as mentioned above, the crowdsourced annotations can be quite noisy.
Research on learning with noisy labels dates back to 1980s (Angluin and Laird, 1988). It is still vibrant today (Mnih and Hinton, 2012;Natarajan et al., 2013Natarajan et al., , 2018Menon et al., 2015;Gao et al., 2016;Liu and Tao, 2016;Khetan et al., 2018;Zhan et al., 2019) as it is highly challenging. We will discuss the related work in the next section.
This paper studies the problem of learning with noisy labels for SSC. Formally, we study the following problem.
Problem Definition: Given noisy labeled training sentences S = {(x 1 , y 1 ), ..., (x n , y n )}, where x i | n i=1 is the i-th sentence and y i ∈ {1, ..., c} is the sentiment label of this sentence, the noisy labeled sentences are used to train a DNN model for a SSC task. The trained model is then used to classify sentences with clean labels to one of the c sentiment labels.
In this paper, we propose a convolutional neural NETwork with AB-networks (NETAB) to deal with noisy labels during training, as shown in Figure 1. We will introduce the details in the subsequent sections. Basically, NETAB consists of two convolutional neural networks (CNNs) (see Figure 1), one for learning sentiment scores to predict 'clean' 1 labels and the other for learning a noise transition matrix to handle input noisy labels. We call the two CNNs A-network and AB-network, respectively. The fundamental here is that (1) DNNs memorize easy instances first and gradu-ally adapt to hard instances as training epochs increase Arpit et al., 2017); and (2) noisy labels are theoretically flipped from the clean/true labels by a noise transition matrix (Sukhbaatar et al., 2015;Goldberger and Ben-Reuven, 2017;Han et al., 2018a,b). We motivate and propose a CNN model with a transition layer to estimate the noise transition matrix for the input noisy labels, while exploiting another CNN to predict 'clean' labels for the input training (and test) sentences. In training, we pre-train A-network in early epochs and then train AB-network and Anetwork with their own loss functions in an alternating manner. To our knowledge, this is the first work that addresses the noisy label problem in sentence-level sentiment analysis. Our experimental results show that the proposed model outperforms the state-of-the-art methods.

Related Work
Our work is related to sentence sentiment classification (SSC). SSC has been studied extensively (Hu and Liu, 2004;Pang and Lee, 2005;Zhao et al., 2008;Narayanan et al., 2009;Täckström and McDonald, 2011;Wang and Manning, 2012;Yang and Cardie, 2014;Kim, 2014;Tang et al., 2015;Wu et al., 2017;. None of them can handle noisy labels. Since many social media datasets are noisy, researchers have tried to build robust models (Gamon, 2004;Barbosa and Feng, 2010;. However, they treat noisy data as additional information and don't specifically handle noisy labels. A noiseaware classification model in (Zhan et al., 2019) trains using data annotated with multiple labels.  exploited the connection of users and noisy labels of sentiments in social networks. Since the two works use multiple-labeled data or users' information (we only use singlelabeled data, and we do not use any additional information), they have different settings than ours.
Our work is closely related to DNNs based approaches to learning with noisy labels. DNNs based approaches explored three main directions: (1) training DNNs on selected samples (Malach and Shalev-Shwartz, 2017;Jiang et al., 2018;Ren et al., 2018;Han et al., 2018b), (2) modifying the loss function of DNNs with regularization biases (Mnih and Hinton, 2012;Jindal et al., 2016;Patrini et al., 2017;Ghosh et al., 2017;Ma et al., 2018;Zhang and Sabuncu, 2018), and (3) plug-ging an extra layer into DNNs (Sukhbaatar et al., 2015;Bekker and Goldberger, 2016;Goldberger and Ben-Reuven, 2017;Han et al., 2018a). All these approaches were proposed for image classification where training images were corrupted with noisy labels. Some of them require noise rate to be known a priori in order to tune their models during training (Patrini et al., 2017;Han et al., 2018b). Our approach combines direction (1) and direction (3), and trains two networks jointly without knowing the noise rate. We have used five latest existing methods in our experiments for SSC. The experimental results show that they are inferior to our proposed method.

Proposed Model
Our model builds on CNN (Kim, 2014). The key idea is to train two CNNs alternately, one for addressing the input noisy labels and the other for predicting 'clean' labels. The overall architecture of the proposed model is given in Figure 1. Before going further, we first introduce a proposition, a property, and an assumption below. Proposition 1 Noisy labels are flipped from clean labels by an unknown noise transition matrix.
Proposition 1 is reformulated from (Han et al., 2018a) and has been investigated in (Sukhbaatar et al., 2015;Goldberger and Ben-Reuven, 2017;Bekker and Goldberger, 2016). This proposition shows that if we know the noise transition matrix, we can use it to recover the clean labels. In other words, we can put noise transition matrix on clean labels to deal with noisy labels. Given these, we ask the following question: How to estimate such an unknown noise transition matrix?
Below we give a solution to this question based on the following property of DNNs. Property 1 DNNs tend to prioritize memorization of simple instances first and then gradually memorize hard instances . Arpit et al. (2017) further investigated this property of DNNs. Our setting is that simple instances are sentences of clean labels and hard instances are those with noisy labels. We also have the following assumption. Assumption 1 The noise rate of the training data is less than 50%.
This assumption is usually satisfied in practice because without it, it is hard to tackle the input noisy labels during training.
Based on the above preliminaries, we need to estimate the noisy transition matrix Q ∈ R c×c (c = 2 in our case, i.e., positive and negative), and train two classifiersÿ ∼ P (ÿ|x, θ) and y ∼ P ( y|x, ϑ), where x is an input sentence,ÿ is its noisy label, y is its 'clean' label, θ and ϑ are the parameters of two classifiers. Note that bothÿ and y here are the prediction results from our model, not the input labels. We propose to formulate the probability of the sentence x labeled as j with P (ÿ = j|x, θ) = i P (ÿ = j| y = i)P ( y = i|x, ϑ) (1) where P (ÿ = j| y = i) is an item (the ji-th item) in the noisy transition matrix Q. We can see that the noisy transition matrix Q is exploited on the 'clean' scores P ( y|x, ϑ) to tackle noisy labels.
We now present our model NETAB and introduce how NETAB performs Eq. (1). As shown in Figure 1, NETAB consists of two CNNs. The intuition here is that we use one CNN to perform P ( y = i|x, ϑ) and use another CNN to perform P (ÿ = j|x, θ). Meanwhile, the CNN performing P (ÿ = j|x, θ) estimates the noise transition matrix Q to deal with noisy labels. Thus we add a transition layer into this CNN.
More precisely, in Figure 1, the CNN with a clean loss performs P ( y = i|x, ϑ). We call this CNN the A-network. The other CNN with a noisy loss performs P (ÿ = j|x, θ). We call this CNN the AB-network. AB-network shares all the parameters of A-network except the parameters from the Gate unit and the clean loss. In addition, ABnetwork has a transition layer to estimate the noisy transition matrix Q. In such a way, A-network predict 'clean' labels, and AB-network handles the input noisy labels.
We use cross-entropy with the predicted labels y and the input labels y (given in the dataset) to compute the noisy loss, formulated as below (2) where I is the indicator function (if y == i, I = 1; otherwise, I = 0), and |S| is the number of sentences to train AB-network in each batch.
Similarly, we use cross-entropy with the predicted labels y and the input labels y to compute the clean loss, formulated as (3) where | S| is the number of sentences to train Anetwork in each batch.
Next we introduce how our model learns the parameters (ϑ, θ and Q). An embedding matrix v is produced for each sentence x by looking up a pre-trained word embedding database (e.g., GloVe.840B (Pennington et al., 2014)). Then an encoding vector h = CN N (v) (and u = CN N (v)) is produced for each embedding matrix v in Anetwork (and AB-network). A sofmax classifier gives us P (ŷ = i|x, ϑ) (i.e., 'clean' sentiment scores) on the learned encoding vector h. As the noise transition matrix Q indicates the transition values from clean labels to noisy labels, we com-  Table 1: Summary statistics of the datasets. Number of positive (P) and negative (N) sentences in (noisy and clean) training data, validation data, and test data. The second column shows the statistics of sentences extracted from the 2,000 reviews of each dataset. The last three columns show the statistics of the sentences in three clean-labeled datasets, see "Clean-labeled Datasets".
pute Q as follows where W i is a trainable parameter matrix, b i and f i are two trainable parameter vectors. They are trained in the AB-network. Finally, P (ÿ = j|x, θ) is computed by Eq. (1). In training, NETAB is trained end-to-end. Based on Proposition 1 and Property 1, we pretrain A-network in early epochs (e.g., 5 epochs). Then we train AB-network and A-network in an alternating manner. The two networks are trained using their respective cross-entropy loss. Given a batch of sentences, we first train ABnetwork. Then we use the scores predicted from A-network to select some possibly clean sentences from this batch and train A-network on the selected sentences. Specifically speaking, we use the predicted scores to compute sentiment labels by arg max i {ÿ = i|ÿ ∼ P (ÿ|x, θ)}. Then we select the sentences whose resulting sentiment label equals to the input label. The selection process is marked by a Gate unit in Figure 1. When testing a sentence, we use A-network to produce the final classification result.

Experiments
In this section, we evaluate the performance of the proposed NETAB model. we conduct two types of experiments. (1) We corrupt clean-labeled datasets to produce noisy-labeled datasets to show the impact of noises on sentiment classification accuracy.
(2) We collect some real noisy data and use them to train models to evaluate the performance of NETAB.
Clean-labeled Datasets. We use three clean labeled datasets. The first one is the movie sentence polarity dataset from (Pang and Lee, 2005). The other two datasets are laptop and restaurant datasets collected from SemEval-2016 2 . The former consists of laptop review sentences and the latter consists of restaurant review sentences. The original datasets (i.e., Laptop and Restaurant) were annotated with aspect polarity in each sentence. We used all sentences with only one polarity (positive or negative) for their aspects. That is, we only used sentences with aspects having the same sentiment label in each sentence. Thus, the sentiment of each aspect gives the ground-truth as the sentiments of all aspects are the same.
For each clean-labeled dataset, the sentences are randomly partitioned into training set and test set with 80% and 20%, respectively. Following (Kim, 2014), We also randomly select 10% of the test data for validation to check the model during training. Summary statistics of the training, validation, and test data are shown in Table 1.
Noisy-labeled Training Datasets. For the above three domains (movie, laptop, and restaurant), we collected 2,000 reviews for each domain from the same review source. We extracted sentences from each review and assigned review's label to its sentences. Like previous work, we treat 4 or 5 stars as positive and 1 or 2 stars as negative. The data is noisy because a positive (negative) review can contain negative (positive) sentences, and there are also neutral sentences. This gives us three noisy-labeled training datasets. We still use the same test sets as those for the clean-labeled datasets. Summary statistics of all the datasets are shown in Table 1.
Experiment 1: Here we use the clean-labeled data (i.e., the last three columns in Table 1). We corrupt the clean training data by switching the labels of some random instances based on a noise rate parameter. Then we use the corrupted data to train NETAB and CNN (Kim, 2014).
The test accuracy curves with the noise rates From the figure, we can see that the test accuracy drops from around 0.8 to 0.5 when the noise rate increases from 0 to 0.5, but our NETAB outperforms CNN. The results clearly show that the performance of the CNN drops quite a lot with the noise rate increasing. Experiment 2: Here we use the real noisylabeled training data to train our model and the baselines, and then test on the test data in Table 1. Our goal is two fold. First, we want to evaluate NETAB using real noisy data. Second, we want to see whether sentences with review level labels can be used to build effective SSC models.
Baselines. We use one strong non-DNN baseline, NBSVM (with unigrams or bigrams features) (Wang and Manning, 2012) and six DNN baselines. The first DNN baseline is CNN (Kim, 2014), which does not handle noisy labels. The other five were designed to handle noisy labels.
The comparison results are shown in Table 2. From the results, we can make the following observations. (1) Our NETAB model achieves the best ACC and F1 on all datasets except for F1 of negative class on Laptop. The results demonstrate the superiority of NETAB.
(2) NETAB outperforms the baselines designed for learning with noisy labels. These baselines are inferior to ours as they were tailored for image classification. Note that we found no existing method to deal with noisy labels for SSC.
Training Details. We use the publicly available pre-trained embedding GloVe.840B (Pennington et al., 2014) to initialize the word vectors and the embedding dimension is 300.
For each baseline, we obtain the system from its author and use its default parameters. As the DNN baselines (except CNN) were proposed for image classification, we change the input channels from 3 to 1. For our NETAB, we follow Kim (2014) to use window sizes of 3, 4 and 5 words with 100 feature maps per window size, resulting in 300dimensional encoding vectors. The input length of sentence is set to 40. The network parameters are updated using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.001. The learning rate is clipped gradually using a norm of 0.96 in performing the Adam optimization. The dropout rate is 0.5 in the input layer. The number of epochs is 200 and batch size is 50.

Conclusions
This paper proposed a novel CNN based model for sentence-level sentiment classification learning for data with noisy labels. The proposed model learns to handle noisy labels during training by training two networks alternately. The learned noisy transition matrices are used to tackle noisy labels. Experimental results showed that the proposed model outperforms a wide range of baselines markedly. We believe that learning with noisy labels is a promising direction as it is often easy to collect noisy-labeled training data.