From Random to Supervised: A Novel Dropout Mechanism Integrated with Global Information

Dropout is used to avoid overfitting by randomly dropping units from the neural networks during training. Inspired by dropout, this paper presents GI-Dropout, a novel dropout method integrating with global information to improve neural networks for text classification. Unlike the traditional dropout method in which the units are dropped randomly according to the same probability, we aim to use explicit instructions based on global information of the dataset to guide the training process. With GI-Dropout, the model is supposed to pay more attention to inapparent features or patterns. Experiments demonstrate the effectiveness of the dropout with global information on seven text classification tasks, including sentiment analysis and topic classification.


Introduction
Recently, neural networks have achieved remarkable results in natural language processing (NLP).Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) are two popular types of neural network architectures and both of them are widely applied to various NLP tasks.CNN is known for its strong ability in extracting position-invariant features and RNN is highlighted in modeling sequences (Yin et al., 2017).In sentence classification tasks, models based on CNN or RNN aim to represent sentences as appropriate embeddings, which are supposed to encode semantic features for the classification.
However, with the consideration of computational complexity and spatial limitation, neural networks are often trained via mini-batch in which global information is gathered implicitly rather than explicitly.To facilitate the learning process, Li et al. (2017) extract global semantic features from the training dataset, and encode them into CNN filters with a novel initialization mechanism.This approach gains significant improvements in sentiment analysis and topic classification tasks.
Unlike most of machine learning methods, the advantage of neural networks is extracting features with less need of feature engineering.In general, the stronger ability of a model to learn features automatically, the better performance it will achieve.However, during the training process, neural networks tend to focus on some distinctive words or phrases but ignore other noteworthy patterns, which may result in overfitting, especially in a small dataset.To avoid this problem, dropout is proposed (Hinton et al., 2012;Srivastava et al., 2014).The key idea of dropout is to randomly drop units from the neural network during training and use a smaller weight of these units in the test.
Inspired by the above works, we propose a novel dropout method guided by global information (GI-Dropout).In our method, we force the model to pay more attention to features that are inapparent or with low frequency by dropping words that are prominent and easy to learn.Unlike the traditional dropout method where neurons are dropped randomly with the same probability, we encode global information into dropout.Specifically, we drop words based on their importance which are calculated from training data via a novel Naïve Bayes (NB) weighting technique.
With this dropout method, neural networks tend to extract not only the obvious features but also the unobvious features which are also helpful for the classification.By integrating our method into a classic CNN model for text classification (Kim, 2014) and a novel self-attentive RNN (Lin et al., 2017), we observe significant improvements in arXiv:1808.08149v3[cs.CL] 10 Oct 2018 various benchmarks. 1 The advantages of our approach are as follows: 1. Global information is directly obtained from the training data without any external resources; 2. GI-Dropout is simple but effective, and could be easily applied to other DNN models; 3. The computation brought by our method is relatively small, resulting in little additional training cost.

Related Work
Recently, neural networks dominate the state-ofthe-art results on a wide range of NLP tasks.For text classification, Kim (2014)  RNNs also achieve comparable performance in this area.Tang et al. (2015) show that gated RNN performs well on document-level sentiment classification.Lin et al. (2017) propose a enhanced model to extract an interpretable sentence embedding by introducing self-attention mechanism and yields a significant performance gain compared with other sentence embedding methods.Yin et al. (2017) make a systematic comparison of CNNs and RNNs, showing that both of the networks can provide complementary information for text classification tasks, while which architecture performs better depends on how important it is to semantically understand the global/long-range semantics.
To improve the semantic understanding abilities of the models, some works aim to encode prior knowledge into the networks.For example, Hu et al. (2016) present a framework that encapsulates the logical structured knowledge into a neural network.Li et al. (2017) encode global semantic features into the convolutional filters instead of initializing them randomly, which helps the filters focus on learning useful n-grams.
Another effective method to facilitate learning process is to exploit dropout mechanism.Apparently, if a model pays too much attention to a few distinct patterns, it can easily give rise to an overfitting, especially in a small dataset.Hinton et al. (2012) introduce Binary (regular) Dropout, showing that it can prevent co-adaptation of neurons by randomly dropping units from the neural networks during training, so as to reduce overfitting.Later Srivastava et al. (2014) show that multiplying outputs of the neurons by a random variable drawn from Gaussian distributions works as well, or perhaps better than regular dropout.Ba and Frey (2013) present standout, an adaptive dropout method, where each variable's dropout probability is calculated by a binary belief network, which can be trained jointly with the neural networks.Kingma et al. (2015) introduce variational dropout, a generalization of Gaussian dropout where the dropout rates are also learned during training.
The existing dropout methods are often based on mathematics or learned jointly with the downstream task, where global information is not explicitly utilized.Different from previous works, we focus on how to utilize global information to help model training via dropout.As depicted in Figure 3, GI-Dropout is introduced at the beginning of the baseline models, which is different from prior dropout methods which aim at controlling units in the networks rather than input words in the texts.
In this work, we use the global information to guide dropout method by dropping words based on their importance.Hence, neural networks are able to extract not only the obvious features but also the unobvious features which are also helpful for the classification.

Our method
The intuition behind our method is straightforward.Since neural networks aim to capture semantic features and classify sentences by the features, we encourage models to share more attention to unobvious features by dropping words according to their importance.Some features are so distinctive that model can learn them easily.However, a sentence may have more than one feature that can contribute to class prediction.For instance, in "The story is sad and very boring", "boring" is of strong polarity and indicates negative emotion.Neural networks may not be sensitive to other features like "sad" which is also helpful for the sentiment classification, due to the very strong impact of "boring".In GI-Dropout, a word of higher importance score has greater possibility to be dropped.Thus, models are forced to learn unobvious features and will achieve better performance in prediction.

Importance Score
Firstly, we compute an importance score for each word.Intuitively, word "unique" is much more important than "movie" for determining polarities of reviews.Naïve Bayes (NB) weighting is an effective technique for determining the importance of words (Martineau and Finin, 2009;Wang and Manning, 2012;Li et al., 2017).The NB weight r of word w in class c is calculated as follows: where n w c is the count of word w in class c, n w c is the count of word w in the other classes, n c 1 is the count of all the word occurrences in class c, n c 1 is the count of all the word occurrences in the other classes, α is a smoothing parameter and is set as 1 in this paper.
To avoid low-frequency words being recognized as important words, we propose an improved NB weighting method based on (1): where log β n w c is introduced as a frequency factor.The base β is a hyperparameter.
For positive class in movie review dataset (MR), the scores of words like "unique" and "warm" should be large since they appear much more frequently in positive texts than in negative texts.As for neutral words like "the" and "movie", their scores should be small.For a word w, we select the max score of it as its importance score: In Figure 1, we show top 30 key words of each class in customer review dataset (CR).We aim to drop these key words with higher probabilities and encourage the model to pay more attention to other unobvious features.

Dropout Probability
As shown in 3.1, we compute words importance scores with the whole training data.It is a simple yet effective way to represent the global information.After obtaining the scores, we compress them into [0, 1).The GI-Dropout probability of word w is: where r is the importance score of w calculated via (2).A word would not be ignored when its probability is 0. The β in (2) is a key parameter.As shown in Figure 2, after tuning β, the GI-Dropout probability of a word and its probability rank follow Zipf's Law.Zipf (1935) states that given a sample of words, the frequency of any word is inversely proportional to its rank in the frequency table.Replacing the frequency with GI-Dropout probability, we can get a variant of Zipf's Law.The experiments will show that setting β to this value in SST-1 is not a coincidence.In this case, the word embedding of "boring" is dropped and set to zero vector while "sad" is not.

GI-Dropout Method
As illustrated in Figure 3, we implement a GI-Dropout layer before the neural network.Models without our dropout method can be viewed as the special case in which all the words are not dropped in GI-Dropout layer, i.e. dropout probabilities of all words are 0.
In this paper, every word in training data has a score to measure its importance via the novel NB weighting method, as well as a dropout probability calculated by the proposed scale function.During training, the words will be dropped according to their dropout probabilities.
The way to implement our dropout method is very straightforward.In embedding layer, we get the word embedding e i of word w i after looking it up in the embedding table.After that, this word can be dropped according to its GI-Dropout possibility.For word w i , we set the e i to zero vector if it needs to be dropped.Through this method, the neural network will not learn features from words whose embeddings are zero vectors.It is worth noted that the dropout probabilities of words differ from each other, which is different from the traditional dropout method where all the neurons are dropped according to the same probability.The dropout probabilities which are encoded with global information, guide the model to share attention to unobvious patterns.
In traditional dropout method, all the neurons are used in testing, but their weights are scaled down by a factor p (same with p in training) since a part of units emit nothing to the next layer during training.While in our method, during evaluation and testing, dropout probabilities of all the words are set to 0 so as to use all the patterns, and scaling is not needed.2017) also achieves outstanding performance in many sentence classification tasks.We adopt these two models to evaluate GI-dropout.

Datasets
Following (Kim, 2014), we evaluate the performance of the proposed approach on various datasets.We use the same seven datasets with (Kim, 2014), including both sentiment analysis and topic classification tasks: MR: Movie reviews sentiment datasets2 .SST-1: Stanford Sentiment Treebank with 5 sentiment labels (Socher et al., 2013) 3 .The data consists of phrases-level and sentence-level instances.To keep same with (Kim, 2014), we train the model on both phrases and sentences but only test on sentences.
The statistics of the datasets can be seen in Table 1.

CNN Model
CNNs use filters to capture semantic features of n-grams.After that, max-pooling is introduced to force the network to capture the most useful local features produced by convolutional layers (Collobert et al., 2011).A simple CNN model in (Kim, 2014) consists of the embedding layer, one convolution and pooling layer, and one fully connected layer.Four model variations are provided in (Kim, 2014), and we choose the CNN-non-static model as our baseline.The hyperparameters of the CNN are described in Table 2.The architecture of the model integrated with GI-Dropout is shown in Figure 4.

Self-attentive RNN Model
Long Short-Term Memory (LSTM) is a specific recurrent neural network (RNN) architecture which is good at modeling temporal sequences and

Activation function ReLU
Pooling method max-over-time MLP dropout rate 0.5  2017) consists of a bidirectional LSTM (biLSTM) and the selfattention mechanism.Self-attention mechanism is used to replace the max pooling or averaging step after the biLSTM.Multiple hops of attention are performed to extract semantic features in different aspects of the sentence.
In brief, suppose we have a sentence of n tokens, and let the hidden unit number for each unidirectional LSTM be u.After the biLSTM layer, we can get H, which have the size of n-by-2u.The attention mechanism takes the whole LSTM hidden states H as input, and outputs a vector of weights a, a = sof tmax(w s2 tanh(W s1 H T )) (5) where W s1 is a weight matrix with a shape of d aby-2u, and W s2 is a vector of parameters with size d a which is a hyperparameter.To extract r different aspects of the sentence, Lin et al. (2017) present multiple hops of attention, i.e. extend the w s2 into a r-by-d a matrix and note it as W s2 .In the end, the annotation vector a becomes annotation matrix A.
The sentence embedding is: Then the paper uses two layer 2-layer MLP with ReLU activation function to predict the label of  of texts.We also use a 2-layer ReLU output MLP with 2000 hidden units.During training we use a 0.5 dropout rate on the MLP.The hyperparameters are described in Table 3.

Experiment Settings
We apply our method to two baseline models.For fair comparison, we use the same hyperparameters settings with two baselines for training and testing.For datasets that do not have test sets, we split them for cross-validation with fixed random seeds.We train all the models using early stopping and set timedelay to 10.

Effectiveness of GI-Dropout
Results on 7 datasets are listed in Table 4. Experiments show that the models with GI-Dropout outperform both CNN and self-attentive RNN baselines by a significant margin.
To test whether global information makes key contribution, we conduct another experiment in which all words are dropped according to the same probability at the GI-Dropout layer.Grid search method is used to find the best result which is listed in "Dropout-same-prob" row.
The one-layer CNN provides a very strong baseline.The first line is the result of CNN-nonstatic model in (Kim, 2014).We reproduce the experiment results in "CNN-baseline" row.
By integrating our GI-Dropout mechanism, the model further improves the performance significantly on both CNN and RNN models.Compared with Dropout-same, there is a clear advantage that results on all of the datasets have been improved.
With the comparison between GI-Dropout and Dropout-same, we are convinced that GI-Dropout benefits from global information which provides explicit semantic information to guide the training process.
Even when compared with other models with complex architectures, GI-Dropout models achieve the best accuracy on most datasets, especially in SST-1 and SST-2.

Further Analysis of Our Method
With GI-Dropout, we drop words according to their importance scores.The higher score of a word, the greater chance it is to be ignored.We further analyze why GI-Dropout works so well, and the relationship between β and accuracy.GI-dropout helps models to learn inapparent features.To test whether the method indeed helps models to learn the inapparent features, we conduct experiments where the top-k apparent words (with highest important scores) were removed from test cases in SST-2.Results are shown in Table 6.We can observe that the CNN base-line model is more sensitive to the apparent features and GI-dropout can still have relatively good results even when we remove top 1000 apparent words.Thus, the model is supposed to pay more attention to the inapparent features with the help of GI-Dropout.
GI-dropout helps models to reduce the overfitting for the apparent features.The frequent words can easily induce the model to focus on limited features and activate a part of units with large score.This can be seen by analyzing the cases which the proposed model makes a correct prediction and the baseline makes a incorrect prediction: (1) provide -lrb-s -rrb-nail-biting suspense and credible characters without relying on technologyof-the-moment technique or pretentious 8 dialogue.
(2) the screenplay sabotages the movie's strengths at almost every juncture.
(3) this is cool, slick stuff, ready to quench the thirst of an audience that misses the summer blockbusters.
The baseline model is prone to focus only on the prominent features, e.g. the "pretentious" (negative) in case (1), "strengths" (positive) in case(2) and "miss" (negative) in case (3), and then make wrong predictions.Even though there are some important words indicating the opposite polarity, e.g."without" in case (1), "sabotages" in case (2) , "cool", "slick" and "quench" in case (3), the model can not make use of these features efficiently.
By integrating our GI-dropout method, the model can learn not only the obvious features, e.g."strengths", but also the less obvious features e.g."sabotages".Thus, it makes correct predictions in all the above cases.
The relationship between β and accuracy.Another thing should be noticed is the value of β in Equation 2. As shown in Figure 2, the probability of a word and its rank follow Zipf's Law when β is 0.95 in SST-1.Actually, for each dataset, there is an appropriate β value for Equation 2 that can approximate the dropout probability and its rank with a Zipfian distribution.We assume that the β setting in accord with Zipf's Law could have an important positive effect on the model perfor-8 Words in bold denote the apparent features with high importance scores, e.g."pretentious" appears 159 times in positive texts and 5 in negative texts.Words with underline represent unobvious features that also contribute to the class prediction.mance.To examine this hypothesis, we further test the influences of different β values on the CNN and RNN model.As expected, Table 5 shows that the models achieve the best results for both CNN and RNN in SST-1 with β setting to 0.95.

Conclusion
This paper proposes GI-Dropout, a novel dropout method which utilizes global information and guides neural networks to extract not only obvious features but also unobvious features.
This idea is inspired by dropout in which units are dropped randomly in training according to the same probability.Unlike traditional dropout method, we aim to use global information to guide our dropout based on the importance of the words.
By integrating this mechanism, we encode global information explicitly into model via a novel Naïve Bayes Weighting method.We discover that model can be sensitive to some inapparent patterns, which is of great help to the classification.Experimental results demonstrate the effectiveness of GI-Dropout on multiple text classification tasks.In addition, our method requires few external resources and relatively small calculation.It is simple but effective and could be easily applied to other NLP tasks.

Figure 1 :
Figure 1: Top 30 key words of each class in Customer Review dataset
Figure3: GI-Dropout.In this case, the word embedding of "boring" is dropped and set to zero vector while "sad" is not.

Table 1 :
Datasets summary.c: Number of target classes.l: Average sentence length.N: Dataset size.V: Vocabulary size.Test: Test set size (CV means there is no standard train/test split and thus 10-fold CV is used).

Table 4 :
Effectiveness of GI-Dropout.Dropout-same means dropping units with the same probability.

Table 6 :
Accuracy decline when removing top-k apparent words in SST-2.