Is it Time to Swish? Comparing Deep Learning Activation Functions Across NLP tasks

Activation functions play a crucial role in neural networks because they are the nonlinearities which have been attributed to the success story of deep learning. One of the currently most popular activation functions is ReLU, but several competitors have recently been proposed or 'discovered', including LReLU functions and swish. While most works compare newly proposed activation functions on few tasks (usually from image classification) and against few competitors (usually ReLU), we perform the first large-scale comparison of 21 activation functions across eight different NLP tasks. We find that a largely unknown activation function performs most stably across all tasks, the so-called penalized tanh function. We also show that it can successfully replace the sigmoid and tanh gates in LSTM cells, leading to a 2 percentage point (pp) improvement over the standard choices on a challenging NLP task.


Introduction
Activation functions are a crucial component of neural networks because they turn an otherwise linear classifier into a non-linear one, which has proven key to the high performances witnessed across a wide range of tasks in recent years. While different activation functions such as sigmoid or tanh are often equivalent on a theoretical level, in the sense that they can all approximate arbitrary continuous functions (Hornik, 1991), different activation functions often show very diverse behavior in practice.
For example, sigmoid, one of the activation functions dominating in neural network practice for several decades eventually turned out less suitable for learning because (according to accepted wisdom) of its small derivative which may lead to vanishing gradients. In this respect, the so-called ReLU function (Glorot et al., 2011) has proven much more suitable. It has an identity derivative in the positive region and is thus claimed to be less susceptible to vanishing gradients. It has therefore (arguably) become the most popular activation function. The recognition of ReLU's success has led to various extensions proposed (Maas et al., 2013;He et al., 2015;Klambauer et al., 2017), but none has reached the same popularity, most likely because of ReLU's simplicity and because the gains reported tended to be inconsistent or marginal across datasets and models (Ramachandran et al., 2017).
Activation functions have been characterized by a variety of properties deemed important for successful learning, such as ones relating to their derivatives, monotonicity, and whether their range is finite or not. However, in recent work, Ramachandran et al. (2017) employed automatic search to find high-performing novel activation functions, where their search space contained compositions of elementary unary and binary functions such as max, min, sin, tanh, or exp. They found many functions violating properties deemed as useful, such as non-monotonic activation functions or functions violating the gradientpreserving property of ReLU. Indeed, their most successful function, which they call swish, violates both of these conditions. However, as with previous works, they also only evaluated their newly discovered as well as their (rectifier) baseline activation functions on few different datasets, usually taken from the image classification community such as CIFAR (Krizhevsky, 2009) and ImageNet (Russakovsky et al., 2015), and using few types of different networks, such as the deep convolutional networks abounding in the image classification community (Szegedy et al., 2016).
To our best knowledge, there exists no largescale empirical comparison of different activations across a variety of tasks and network architectures, and even less so within natural language processing (NLP). 1 Thus, the question which activation function really performs best and most stably across different NLP tasks and popular NLP models remains unanswered to this date.
In this work, we fill this gap. We compare (i) 21 different activation functions, including the 6 top performers found from automatic search in Ramachandran et al. (2017), across (ii) three popular NLP task types (sentence classification, document classification, sequence tagging) comprising 8 individual tasks, (iii) using three different popular NLP architectures, namely, MLPs, CNNs, and RNNs. We also (iv) compare all functions across two different dimensions, namely: top vs. average performance.
We find that a largely unknown activation function, penalized tanh (Xu et al., 2016), performs most stably across our different tasks. We also find that it can successfully replace tanh and sigmoid activations in LSTM cells. We further find that the majority of top performing functions found in Ramachandran et al. (2017) do not perform well for our tasks. An exception is swish, which performed well across several tasks, but less stably than penalized tanh and other functions. 2 2 Theory Activation functions We consider 21 activation functions, 6 of which are "novel" and proposed in Ramachandran et al. (2017). The functional form of these 6 is given in Table 1, together with the sigmoid function.
The remaining 14 are: tanh, sin, relu, lrelu-0.01, lrelu-0.30, maxout-2, maxout-3, maxout-4, prelu, linear, elu, cube, penalized tanh, selu. We briefly describe them: lrelu-0.01 and lrelu-0.30 are the so-called leaky relu (LReLU) functions (Maas et al., 2013); the idea behind them is to avoid zero activations/derivatives in the negative region of relu. Their functional form is given in Table 1. prelu (He et al., 2015) generalizes the LReLU functions by allowing the slope in the negative region to be a learnable parameter. The maxout functions (Goodfellow et al., 2013)  ferent in that they introduce additional parameters and do not operate on a single scalar input. For example, maxout-2 is the operation that takes the maximum of two inputs: max{xW +b, xV +c}, so the number of learnable parameters is doubled. maxout-3 is the analogous function that takes the maximum of three inputs. As shown in Goodfellow et al. (2013), maxout can approximate any convex function. sin is the standard sine function, proposed in neural network learning, e.g., in Parascandolo et al. (2016), where it was shown to enable faster learning on certain tasks than more established functions. penalized tanh (Xu et al., 2016) has been defined in analogy to the LReLU functions, which can be thought of as "penalizing" the identity function in the negative region. The reported good performance of penalized tanh on CIFAR-100 (Krizhevsky, 2009) lets the authors speculate that the slope of activation functions near the origin may be crucial for learning. linear is the identity function, f (x) = x. cube is the function f (x) = x 3 , proposed in Chen and Manning (2014) for an MLP used in dependency parsing. elu (Clevert et al., 2015) has been proposed as (yet another) variant of relu that assumes negative values, making the mean activations more zero-centered. selu is a scaled variant of elu used in Klambauer et al. (2017) in the context of socalled self-normalizing neural nets.
Properties of activation functions Many properties of activation functions have been speculated to be crucial for successful learning. Some of these are listed in Table 2, together with brief de-

Experiments
We conduct experiments using three neural network types and three types of NLP tasks, described in §3.1, §3.2, and §3.3 below.

MLP & Sentence Classification
Model We experiment with a multi-layer perceptron (MLP) applied to sentence-level classification tasks. That is, input to the MLP is a sentence or short text, represented as a fixed-size vector embedding. The output of the MLP is a label which classifies the sentence or short text. We use two sentence representation techniques, namely, Sent2Vec (Pagliardini et al., 2018), of dimensionality 600, and InferSent (Conneau et al., 2017), of dimensionality 4096. Our MLP has the form: where x 0 is the input representation, x 1 , . . . , x N are hidden layer representations, and y is the output, a probability distribution over the classes in the classification task. Vectors b and matrices W are the learnable parameters of our network. The activation function is given by f and ranges over the choices described in §2.
Data We use four sentence classification tasks, namely: movie review classification (MR), subjectivitiy classification (SUBJ), question type classification (TREC), and classifying whether a sentence contains an argumentation structure of a certain type (claim, premise, major claim) or else is non-argumentative (AM). The first three datasets are standard sentence classification datasets and contained in the SentEval framework. 3 We choose the AM dataset for task diversity, and derive it by projecting token-level annotations in the dataset from Stab and Gurevych (2017) to the sentence level. In the rare case (<5% of the cases) when a sentence contains multiple argument types, we choose one based on the ordering Major Claim (MC) > Claim (C) > Premise (P). Datasets and examples are listed in Table 3.
Approach We consider 7 "mini-experiments": • (1): MR dataset with Sent2Vec-unigram embeddings as input and 1% of the full data as training data; (2): the same mini-experiment with 50% of the full data as training data. In both cases, the dev set comprises 10% of the full data and the rest is for testing. • (3,4): SUBJ with InferSent embeddings and likewise 1% and 50% of the full data as train data, respectively. • (5): The TREC dataset with original split in train and test; 50% of the train split is used as dev data. • (6): The AM dataset with original split in train, dev, and test (Eger et al., 2017), and with InferSent input embeddings. (7): the same mini-experiment with Sent2Vec-unigram embeddings.
We report accuracy for mini-experiments (1-5) and macro-F1 for (6-7). We report macro-F1 for (6-7) because the AM dataset is imbalanced. The motivation behind choosing different input embeddings for different tasks was to investigate a wider variety of conditions. Choosing subsets of the full data had the same intention.
For all 7 mini-experiments, we draw the same 200 randomly chosen hyperparameters from the ranges indicated in Table 4. All experiments are conducted in keras. 4 For each of the 21 different activation functions detailed in §2, we conduct each mini-experiment with the 200 randomly chosen hyperparameters.  All activation functions use the same hyperparameters and the same train, dev, and test splits. We store two results for each mini-experiment, namely: (i) the test result corresponding to the best (best) dev performance; (ii) the average (mean) test result across all hyperparameters. The best result scenario mirrors standard optimization in machine learning: it indicates the score one can obtain with an activation function when the MLP is well-optimized. The mean result scenario is an indicator for what one can expect when hyperparameter optimization is 'shallow' (e.g., because computing times are prohibitive): it gives the average performance for randomly chosen hyperparameters. We note that we run each hyperparameter combination with 5 different random weight initializations and all the reported scores (best dev score, best best, best mean) are averages over these 5 random initializations.
Finally, we set the following hyperparameters for all MLP experiments: patience of 10 for early stopping, batch size 16, 100 epochs for training.
Results Figure 1 shows best and mean results, averaged over all 7 mini-experiments, for each activation function. To make individual scores comparable across mini-experiments, we perform max normalization and divide each score by the maximum score achieved in any given mini-experiment (for best and mean, respectively) before averaging. 5 For best, the top performers are the rectifier functions (relu, lrelu-0.01, prelu) as well as maxout and penalized tanh. The newly discovered activation functions lag behind, with the best of them being minsin and swish. linear is worst, together with elu and cube. Overall, the difference between the best activation function, relu, and the worst, linear, is only roughly 2pp, however. This means that if hyperparameter search is done carefully, the choice of activation function is less important for these sentence classification tasks. Particularly the (binary) tasks MR and SUBJ appear robust against the choice of activation function, with the difference between the best and worst function being always less than 1pp, in all settings. For TREC and AM, the situation is slightly different: for TREC, the difference is 2pp (swish vs. maxsig) and for AM, it is 3pp using InferSent embeddings (swish vs. cube) and 12pp using Sent2Vec embeddings (relu vs. linear). It is noteworthy that swish wins 2 out of 3 cases in which the choice of activation function really matters. mean results are very different from best results. Here, somewhat surprisingly, the oscillating   N (µ, s) is the normal distribution with mean µ and std s; µ = m is the default value from keras for the specific optimizer (if drawn learning rate is < 0, we choose it to be m).
sin function wins, followed by penalized tanh, maxout and swish. The difference between the best mean function, sin, and the worst, cube, is more than 30pp. This means that using cube is much riskier and requires more careful hyperparameter search compared to sin and the other top performers.

CNN & Document Classification
Model Our second paradigm is document classification using a CNN. This approach has been popularized in NLP by the ground-breaking work of Kim (2014). Even though shallow CNNs do not reach state-of-the-art results on large datasets anymore (Johnson and Zhang, 2017), simple approaches like (shallow) CNNs are still very competitive for smaller datasets (Joulin et al., 2016). Our model operates on token-level and first embeds a sequence of tokens x 1 , . . . , x n , represented as 1-hot vectors, into learnable embeddings x 1 , . . . , x n . The model then applies 1Dconvolution on top of these embeddings. That is, a filter w of size h takes h successive embeddings x i:i+h−1 , performs a scalar product and obtains a feature c i : Here, f is the activation function and b is a bias term. We take the number n k of different filters as a hyperparameter. When our network has multiple layers, we stack another convolutional layer on top of the first (in total we have n k outputs at each time step), and so on. Our penultimate layer is a global max pooling layers that selects the maximum from each feature map. A final softmax layer terminates the network.
Data We use two document classification tasks, namely: 20 Newsgroup (NG) and Reuters-21578 R8 (R8). Both datasets are standard document classification datasets. In NG, the goal is to classify each document into one of 20 newsgroup classes (alt.atheism, sci.med, sci.space, etc.). In R8, the goal is to classify Reuters news text into one of eight classes (crude, earn, grain, interest, etc.). We used the preprocessed files from https://www.cs.umb.edu/˜smimarog/ textmining/datasets/ (in particular, stopwords are removed and the text is stemmed).
Approach We consider 4 mini-experiments: • (1,2) NG dataset with 5% and 50%, respectively of the full data as train data. In both cases, 10% of the full data is used as dev data, and the rest as test data. • (3,4) Same as (1,2) for R8.
We report accuracy for all experiments. We use a batch size of 64, 50 epochs for training, and a patience of 10. For all mini-experiments, we again draw 200 randomly chosen hyperparameters from the ranges indicated in Table 4. The hyperparameters and train/dev/test splits are the same for all activation functions.
Results Figure 2 shows best and mean results, averaged over all mini-experiments. This time, the winners for best are elu, selu (again two members from the rectifier family), and maxout-3, but the difference between maxout-3 and several lower ranked functions is minimal. The cube function is again worst and sigmoid and cosid have similarly bad performance. Except for minsin, the newly proposed activation functions from Ramachandran et al. (2017) again considerably lag behind. The most stable activation functions are the maxout functions as well as penalized tanh, tanh and sin.

RNN & Sequence Tagging
Model Our third paradigm is sequence tagging, a ubiquitous model type in NLP. In sequence tagging, a sequence of input tokens w 1 , . . . , w K is mapped to a sequence of labels y 1 , . . . , y K . Classical sequence tagging tasks include POS tagging, chunking, NER, discourse parsing (Braud et al., 2017), and argumentation mining (Eger et al., 2017;Schulz et al., 2018). We use a standard recurrent net for sequence tagging, whose form is: Here, w i are (pre-trained) word embeddings of words w i . Vectors b, c and matrices U, V, W are parameters to be learned during training. The above describes an RNN with only one hidden layer, h i , at each time step, but we consider the generalized form with N ≥ 1 hidden layers; we also choose a bidirectional RNN in which the hidden outputs of a forward RNN and a backward RNN are combined. RNNs are particularly deep networks-indeed, the depth of the network corresponds to the length of the input sequence-which makes them particularly susceptible to the vanishing gradient problem (Pascanu et al., 2012). Initially, we do not consider the more popular LSTMs here for reasons indicated below. However, we include a comparison after discussing the RNN performance.
Data We use two sequence tagging tasks, namely: English POS tagging (POS), and tokenlevel argumentation mining (TL-AM) using the same dataset (consisting of student essays) as for the sentence level experiments. In token-level AM, we tag each token with a BIO-label plus the component type, i.e., the label space is Y = {B, I} × {MC, C, P} ∪ {O}, where 'O" is a label for non-argumentative tokens. The motivation for using TL-AM is that, putatively, AM has more long-range dependencies than POS or similar sequence tagging tasks such as NER, because argument components are much longer than named entities and component labels also depend less on the current token.
Approach We consider 6 mini-experiments: • (1): TL-AM with Glove-100d word embeddings and 5% of the original training data as train data; (2) the same with 30% of the original training data as train data. In both cases, dev and test follow the original train splits (Eger et al., 2017). • (3,4) Same as (1) and (2) but with 300d Levy word embeddings (Levy and Goldberg, 2014). • (5,6): POS with Glove-100d word embeddings and 5% and 30%, respectively, of the train data of a pre-determined train/dev/test split (13k/13k/178k tokens). Dev and test are fixed in both cases.
We report macro-F1 for mini-experiments (1-4) and accuracy for (5-6). For our RNN implementations, we use the accompanying code of (the state-of-the-art model of) Reimers and Gurevych (2017), which is implemented in keras. The network uses a CRF layer as an output layer. We use a batch size of 32, train for 50 epochs and use a patience of 5 for early stopping.
Results Figure 3 shows best and mean results, averaged over all 6 mini-experiments, for each activation function. We exclude prelu and the maxout functions because the keras implementation does not natively support these activation functions for RNNs. We also exclude the cube function because it performed very badly. Unlike for sentence classification, there are much larger differences between the activation functions. For example, there is almost 20pp difference between the best best activation functions: relu, lrelu-0.01, swish, penalized tanh, and the worst ones: linear, cosid, and sigmoid (the differences were larger had we included cube). Interestingly, this difference is mostly due to the TL-AM task: for POS, there is only 3pp difference between the best function (sigmoid (sic!), though with almost zero margin to the next best ones) and the worst one (linear), while this difference is almost 40pp for TL-AM. This appears to confirm our concerns regarding the POS tagging task as not being challenging enough due to lack of, e.g., long-range dependencies.
The four best best activation functions in Figure 3 are also the functions with the best mean results, i.e., they are most stable over different hyperparameters. The clear winner in this category is penalized tanh with 100% mean score, followed by swish with 91%. Worst is cosid with 30%. It is remarkable how large the difference between tanh and penalized tanh is both for best and mean-7pp and 20pp, respectively, which is much larger than the differences between the analogous pair of LReLU and relu. This appears to make a strong case for the importance of the slope around the origin, as suggested in Xu et al. (2016).
LSTM vs. RNN Besides an RNN, we also implemented a more popular RNN model with (bidirectional) LSTM blocks in place of standard hid-den layers. Standard LSTM units follow the equations (simplified): where f t and i t are perceived of as gates that control information flow, x t is the input at time t and h t is the hidden layer activation. In keras (and most standard references), σ is the (hard) sigmoid function, and τ is the tanh function.
We ran an LSTM on the TL-AM dataset with Levy word embeddings and 5% and 30% data size setup. We varied σ and τ independently, keeping the respective other function at its default.
We find that the top two choices for τ are penalized tanh and tanh (margin of 10pp), given that σ is sigmoid. For τ = tanh, the best choices are σ = penalized tanh, sigmoid, and tanh. All other functions perform considerably worse. Thus, the top-performers are all saturating functions, indicating the different roles activation functions play in LSTMs-those of gates-compared to standard layers. It is worth mentioning that choosing σ or τ as penalized tanh is on average better than the standard choices for σ and τ . Indeed, choosing τ = σ = penalized tanh is on average 2pp better than the default choices of τ, σ.
It is further worth mentioning that the best best results for the LSTM are roughly 5pp better (absolute) than the best corresponding choices for the simple RNN.

Analysis & Discussion
Winner statistics Each of the three meta-tasks sentence classification, document classification, and sequence tagging was won, on average, by a member from the rectifier family, namely, relu (2) and elu, for best. Also, in each case, cube and cosid were among the worst performing activation functions. The majority of newly proposed functions from Ramachandran et al. (2017) ranked somewhere in the mid-field, with swish and minsin performing best in the best category. For the mean category, we particulary had the maxout functions as well as penalized tanh and sin regularly as top performers.
To get further insights, we computed a winner statistic across all 17 mini-experiments, counting how often each activation function was among the top 3. Table 5 shows the results, excluding prelu and the maxout functions because they were not considered in all mini-experiments.
best penalized tanh (6), swish (6), elu (4), relu (4), lrelu-0.01 (4) mean penalized tanh (16), tanh (13) sin (10) We see that penalized tanh and swish win here for best, followed by further rectifier functions. The mean category is clearly won by saturating activation functions with finite range. If this comparison were restricted to sentence and document classification, where we also included the maxout functions, then penalized tanh would have been outperformed by maxout for mean.
This appears to yield the conclusion that functions with limited range behave more stably across hyperparameter settings while nonsaturating functions tend to yield better topperformances. The noteworthy exception to this rule is penalized tanh which excels in both categories (the more expensive maxout functions would be further exceptions). If the slope around the origin of penalized tanh is responsible for its good performance, then this could also explain why cube is so bad, since it is very flat close to the origin.
Influence of hyperparameters To get some intuition about how hyperparameters affect our different activation functions, we regressed the score of the functions on the test set on all the employed hyperparameters. For example, we estimated: where y is the score on the test set, n l is the number of layers in the network, d is the dropout value, etc. The coefficients α k for each regressor k is what we want to estimate (in particular, their size and their sign). We logarithmized certain variables whose scale was substantially larger than those of others (e.g., number of units, number of filters). For discrete regressors such as the optimizer we used binary dummy variables. We estimated Eq.
(1) independently for each activation function and for each mini-experiment. Overall, there was a very diverse pattern of outcomes, preventing us from making too strong conclusions. Still, we observed that while all models performed on average better with fewer hidden layers, particularly swish was robust to more hidden layers (small negative coefficient α l ), but also, to a lesser degree, penalized tanh. In the sentence classification tasks, sin and the maxout functions were particulary robust to an increase of hidden layers. Since penalized tanh is a saturating function and sin even an oscillating one, we therefore conclude that preserving the gradient (derivative close to one) is not a necessary prerequisite to successful learning in deeper neural networks.

Concluding remarks
We have conducted the first large scale comparison of activation functions across several different NLP tasks (and task types) and using different popular neural network types. Our main focus was on so-called scalar activation functions, but we also partly included the more costly 'many-toone' maxout functions. Our findings suggest that the rectifier functions (and the similarly shaped swish) can be top performers for each task, but their performance is unstable and cannot be predicted a priori. One of our major findings is that, in contrast, the saturating penalized tanh function performs much more stably in this respect and can with high probability be expected to perform well across tasks as well as different choices of hyperparameters. This appears to make it the method of choice particularly when hyperparameter optimization is costly. When hyperparameter optimization is cheap, we recommend to consider the activation function as another hyperparameter and choose it, e.g., from the range of functions listed in Table 5 along with maxout.
Another major advantage of the penalized tanh function is that it may also take the role of a gate (because of its finite range) and thus be employed in more sophisticated neural network units such as LSTMs, where the rectifiers fail completely. In this context, we noticed that replacing sigmoid and tanh in an LSTM cell with penalized tanh leads to a 2pp increase on a challenging NLP sequence tagging task. Exploring whether this holds across more NLP tasks should be scope for future work. Additionally, our research sug-gests it is worthwhile to further explore penalized tanh, an arguably marginally known activation function. For instance, other scaling factors than 0.25 (default value from Xu et al. (2016)) should be explored. Similarly as for prelu, the scaling factor can also be made part of the optimization problem.
Finally, we found that except for swish none of the newly discovered activation functions found in Ramachandran et al. (2017) made it in our top categories, suggesting that automatic search of activation functions should be made across multiple tasks in the future.