FENAS: Flexible and Expressive Neural Architecture Search

Architecture search is the automatic process of designing the model or cell structure that is optimal for the given dataset or task. Recently, this approach has shown good improvements in terms of performance (tested on language modeling and image classification) with reasonable training speed using a weight sharing-based approach called Efficient Neural Architecture Search (ENAS). In this work, we propose a novel architecture search algorithm called Flexible and Expressible Neural Architecture Search (FENAS), with more flexible and expressible search space than ENAS, in terms of more activation functions, input edges, and atomic operations. Also, our FENAS approach is able to reproduce the well-known LSTM and GRU architectures (unlike ENAS), and is also able to initialize with them for finding architectures more efficiently. We explore this extended search space via evolutionary search and show that FENAS performs significantly better on several popular text classification tasks and performs similar to ENAS on standard language model benchmark. Further, we present ablations and analyses on our FENAS approach.


Introduction
Architecture search enables automatic ways of finding the best model architecture and cell structures for the given task or dataset, as opposed to the traditional approach of manually tuning among different architecture choices. Recently, this idea has been successfully applied to the tasks of language modeling and image classification (Zoph and Le, 2017;Cai et al., 2018;Liu et al., 2018a,b). The first approach of architecture search involved an RNN controller which samples a model architecture and uses the validation performance of this architecture trained on the given dataset as feedback (or reward) to sample the next architec- x [t] h [t-1] tanh (1) ReLU (2) add (3) h [t-1] x [t] x [t] tanh Node 1 <start> 1 3 2 x [t] h [t-1] 0 h [t] h [t] h [t-1] (a) (b) (c) ture. However, this process is computationally very expensive, making it infeasible to run on a single GPU in a reasonable amount of time. Some recent attempts have made architecture search more computationally feasible (Negrinho and Gordon, 2017;Baker et al., 2017), with further performance improvements by Pham et al. (2018) who introduced Efficient Neural Architecture Search (ENAS) and achieved strong results on language modeling and image classification tasks.
In this work, we present a new architecture search approach called Flexible and Expressible Neural Architecture Search (FENAS) with less restrictive and more flexible search space than ENAS. FENAS search space has more number of activation functions (e.g., skip-based tanh, ReLU) and new atomic-level operations (e.g., addition, element-wise multiplication), as shown in Fig. 1. Importantly, unlike ENAS, FENAS can represent previous well-known human-designed architectures such as the Gated Recurrent Unit (GRU) and Long Short-Term Memory (LSTM) in its search space, allowing it to have flexible number of input edges. Unlike ENAS, we do not use weight-sharing strategy during the architecture search, but instead use evolutionary search  and initialize the population with known human-designed RNN architectures to search the space efficiently.
We conduct several experiments on a standard language modeling benchmark (PTB) and text classification tasks from GLUE benchmark (Wang et al., 2019). To the best of our knowledge, we are the first ones to compare NAS methods on the full GLUE benchmark. Comparing our FENAS approach with the previous NAS approaches, FE-NAS performs similarly on PTB and significantly better on several downstream GLUE tasks. Finally, we provide various advantages of FENAS over ENAS, and also analyze the learned FENAS cell structure for PTB, e.g., learned cell has fewer skip-connections and less network complexity.
Early NAS approach (Zoph and Le, 2017) took many days and thousands of GPU hours to train on simple datasets like PTB (Marcus et al., 1994) and CIFAR-10 (Krizhevsky and Hinton, 2009). Recently, a weight-sharing strategy among search space parameters has been proposed by Pham et al. (2018), which reduced resource requirements to a few GPU days. Later, various variations of this approach have been proposed, e.g.,  replaced RL with gradient descent. Li and Talwalkar (2019) and Sciuto et al. (2020) showed that a simple random search approach can give best results. In our work, we propose a new approach with a better search space making NAS flexible and expressible in comparison to Pham et al. (2018).

FENAS Method Details
Similar to ENAS, our method has two stages. In stage-1, we search for an optimal cell, and in stage-2, we train a model using the optimal cell structure. For the rest of this section, we describe our method and the search approach for learning optimal cell.

Search Space
ENAS's search space is restrictive, i.e., every node has only one input from the previous nodes (we refer to Pham et al. (2018) for more details). In our work, we introduce Flexible and Expressive Neural Architecture Search (FENAS), which assumes that every node has one or two inputs from the previous nodes and also has three levels of operational functions (details in next paragraph), and hence is more flexible and expressible than the ENAS. Next, we describe the FENAS cell in detail.
At a structural level, FENAS is similar to ENAS cell, where it has edges that represent weights and nodes which represent functions (see Fig. 1). Unlike the ENAS cell, FENAS has more number of node functions which are divided into three types: (1) atomic functions (addition, subtraction, and element-wise-product); (2) activation functions (tanh, ReLU, identity, and sigmoid); and (3) skip-based activation functions (tanh-skip, ReLU-skip, and sigmoid-skip). In comparison to ENAS, atomic functions and skip connection-based activation functions are new in FENAS. Note that ENAS uses skip connections at every computational node, whereas we allow FENAS to self-learn which computational nodes require skip connections. Edge weights are used when the nodes choose skip-activation functions. Nodes with activation functions can choose to have edge weights or not (means just identity function); in other cases, edge weights are replaced with the identity function (these are green edges in Fig 1(b)). Let x(t) and h(t − 1) be the inputs to the FENAS cell at time step t, and h(t) is the corresponding output from the FENAS cell. Let h t k be the node k output at time step t of the cell. Let h t i and h t j be the outputs of nodes i and j, where i,j < k, then the node functions are described as follows: where, f a is any of the four activation func- tions (ReLU, tanh, identity, sigmoid), and w i→k (w c i→k ) and w j→k (w c j→k ) are the edge weights (skip-weights) from nodes i and j, respectively to node k. FENAS also has an additional 'zero' node so as to allow single input to the node. Hence, every node has one or two input parent nodes (unlike one parent node in ENAS). Architectures with more than two input nodes can be derived by increasing the node count. Also, FENAS architecture is flexible such that its search space contains known architectures. For example, Fig. 2a presents the GRU cell represented in the FENAS search space, where the inputs are x(t) and h(t − 1). FENAS's search space requires 10 computational nodes to represent the GRU cell. Note that even though ENAS has two inputs, it cannot represent GRU cell in its search space because of the skip connections and single input to its computational nodes, suggesting that it has a restrictive search space, and our FENAS approach has more expressive power than ENAS. FENAS can also represent the popular LSTM cell (Fig. 2b) by extending to 3-input nodes (x(t), h(t − 1), and c(t − 1)), and two outputs (h(t) and c(t)). For this, we allow our approach to consider three inputs and also sample a computational node at the end which represents the output c(t).

Evolutionary Search for FENAS
In this work, we use evolutionary search (ES) algorithm to find the optimal cell. For this, we follow the approach proposed in the previous work . During the ES, a population of P trained models are kept throughout the search phase, where initially, the population is initialized with random architectures. In this setup, all the architectures that are possible in the FENAS search space are possible and equally likely. At each cycle, we sample S random models from the population where each of them is drawn uniformly at random with replacement. The model with the highest val-idation fitness in these S models is considered as the next parent. A new architecture is constructed which is a mutation of the selected parent architecture, we call it the child model. In FENAS, the mutation is a simple random change in one of the computational node operation. This child architecture is trained, evaluated, and added to the population. In order to keep the population size fixed, we remove the oldest model in the population when a new child model is added, this process is otherwise called as aging evolution.  suggested that aging evolution approach allows to explore the search space better by not focusing on good models too early. After the end of the cycles, the architecture for the best trained model during the whole search process is selected as the optimal.
Another advantage of ES with FENAS is that it has human-designed cells (LSTM and GRU) in its search space, and we can use these architectures as one of the models in the initial population of the ES, to start from a better state (experimental validation in Sec. 5.1). We also tried RL based weight-sharing (WS) strategy similar to ENAS during stage-1, but did not get expected results, 1 partly due to the reasoning discussed in Sciuto et al. (2020) that even though ENAS is computationally very efficient, its WS approach does not converge to local optima.

Datasets
Penn Treebank. The Penn Treebank (PTB) is a standard English language modeling benchmark dataset (Marcus et al., 1994). We use the standard pre-processing steps following Zaremba et al. (2014); Pham et al. (2018), which include lowercase, removing numbers and punctuation. The vocabulary size is capped at 10,000 unique tokens.  3 We use the standard splits from 1 We achieved a test perplexity score of 59.2 with RL search on PTB, while our evolution search (ES) based approach achieved a better test perplexity score of 56.8 (see Table 1).

Metrics
For the language modeling tasks, we report the perplexity (PPL) as the performance measure.

Training Details
In all our experiments, our hyperparameter choices are based on validation perplexity for the language modeling tasks and based on validation accuracy for the text classification tasks. We do not perform any extensive hyperparameter search. We manually tune only dropout in the range [0.1, 0.5] for very few tasks. We use 9 computational nodes in all of our FENAS models. In stage-1, for both tasks, we use evolution search algorithm  with a population size of 100, sample size of 25, and a total of 5000 cycles for learning the FENAS optimal cell structure. Language Models. In stage-1 evolution search, the child model hidden size and word embedding size are set to 300. We train each child model for 20 epochs with a learning rate of 0.001 using Adam optimizer (Kingma and Ba, 2015). We clip the norm of the gradient at 0.25, use l 2 regularization weighted by 8e-6, tie word embeddings and softmax weights (Inan et al., 2017), and use variational dropout (Gal and Ghahramani, 2016) for both stages. In stage-2, we use a hidden size of 900 and word embedding size of 900, and other settings such as stage-2 optimizer, learning rate, dropout are same as in previous work (Pham et al., 2018  Text Classification Models. All the baseline models on the GLUE benchmark have same settings apart from the vocabulary size. Each model has a two layer bidirectional LSTM-RNN with a hidden size of 1500, and a word embedding size of 300 which are initialized with glove embeddings. The classifier is an MLP with a hidden size of 256. In all our models, we use Adam optimizer with a learning rate of 0.0001 and a dropout of 0.2, and keep the maximum length of RNN to 50. We use a batch size of 64. We refer to Appendix A for more training details on FENAS and ENAS approaches. Table 1 presents the performance of various state-of-the-art language models (both manuallydesigned LSTM-based and architecture search based models) on the standard Penn Treebank (PTB) dataset. ENAS, DARTS, and Random Search WS models use the same weight-sharing strategy with different search approach in the stage-1 to learn the optimal cell. Our FENAS method performs similar w.r.t. these ENAS models.

Language Model on Penn Treebank
Computational Complexity. FENAS search space is larger than ENAS because of more activation functions and more inputs to the computational nodes. Stage-1 search process for learning the optimal cell takes 8 and 0.5 GPU days on Nvidia Tesla P100s for FENAS and ENAS, respectively. For stage-2, the training time of FENAS is similar to the ENAS approach.
Random Search Baseline. It has been shown that an architecture sampled uniformly from ENAS search space can also perform reasonably well (Li and Talwalkar, 2019;. In fact, a random search with weight-sharing approach performed best on PTB (see Table 1). For FENAS random baseline, we uniformly sampled 5 random architectures from FENAS search space and trained them on PTB. The average perplexity of these 5 architectures is 126.67, which is substantially lower   w.r.t. best learned cell, emphasizing the importance of having a good search algorithm for FENAS.
LSTM-RNN Initialization. In all our models, we use LSTM cell in the initial population of the evolutionary search. To show the advantage of including human-designed cells, we perform an additional experiment where we do not include the LSTM cell, and observe that the search process is 24% slower in finding the best architecture.
Learned Cell Structure. Fig. 3 presents our learned FENAS cell on PTB. This cell has some similar computational nodes as LSTM cell. Interestingly, it does not have any ReLU activation function, unlike ENAS cell (Pham et al., 2018). Also, FENAS cell uses skip connection only 2 times (nodes with '-s'), and have roughly equal number of edges with and without learnable weights, accounting for its low network complexity.

Text Classification on GLUE Tasks
We move beyond language modeling tasks for NAS research and present novel results for several NAS methods on the full set of more realistic downstream GLUE benchmark tasks. We use the BiL-STM model as discussed in Wang et al. (2019) (Pham et al., 2018) and ENAS with random search (ENAS-RS) (Li and Talwalkar, 2019), and our FENAS on 9 GLUE tasks. 5 We observe that FENAS significantly outperforms ENAS and the LSTM baseline on many GLUE datasets. 6 To the best of our knowledge, this is the first detailed comparison of diverse NAS methods on the full GLUE benchmark and we hope this will encourage further comparison by future work.
Computational Complexity. The search time varies across GLUE tasks, but the average search time is 4 and 0.8 GPU days on Nvidia Tesla P100s for FENAS and ENAS models, respectively.

Conclusion
We presented a new architecture search algorithm (FENAS) which has more activation functions and more inputs to the computational nodes than the previous best algorithm (ENAS), thus achieving more flexible and expressible architectures. Our FENAS approach is also able to reproduce the wellknown LSTM and GRU architectures, and is also able to initialize with them for finding architectures more efficiently. We also present the first detailed comparison of several NAS methods on the full GLUE benchmark, and achieve significant improvements on several text classification tasks.
we sample multiple child models in parallel. In stage-1, we use a hidden size of 1000 for large tasks (QNLI, MNLI, and QQP), and a hidden size of 300 for the rest of the tasks. We observe that the cells learned using models with smaller hidden size in stage-1 can not transfer its best performance to large hidden size models that we use in stage-2, especially for large tasks. For this reason, we use a larger hidden size in stage-1 for large tasks. We further only use 2000 examples in stage-1 for large tasks to find the optimal cell. In stage-2, we keep the hidden size such that the overall model size is lower than that of ENAS and LSTM baseline. We use 9 computational nodes in order to accommodate LSTM architecture in the FENAS search space. Rest of the hyperparameters are same as the ENAS baseline.