Evaluating the Effectiveness of Efficient Neural Architecture Search for Sentence-Pair Tasks

Neural Architecture Search (NAS) methods, which automatically learn entire neural model or individual neural cell architectures, have recently achieved competitive or state-of-the-art (SOTA) performance on variety of natural language processing and computer vision tasks, including language modeling, natural language inference, and image classification. In this work, we explore the applicability of a SOTA NAS algorithm, Efficient Neural Architecture Search (ENAS) (Pham et al., 2018) to two sentence pair tasks, paraphrase detection and semantic textual similarity. We use ENAS to perform a micro-level search and learn a task-optimized RNN cell architecture as a drop-in replacement for an LSTM. We explore the effectiveness of ENAS through experiments on three datasets (MRPC, SICK, STS-B), with two different models (ESIM, BiLSTM-Max), and two sets of embeddings (Glove, BERT). In contrast to prior work applying ENAS to NLP tasks, our results are mixed – we find that ENAS architectures sometimes, but not always, outperform LSTMs and perform similarly to random architecture search.


Introduction
Neural Architecture Search (NAS) methods aim to automatically discover neural architectures that perform well on a given task and dataset. These methods search over a space of possible model architectures, looking for ones that perform well on the task and will generalize to unseen data. There has been substantial prior work on how to define the architecture search space, search over that space, and estimate model performance (Elsken et al., 2019).
Recent works, however, cast doubt on the quality and performance of NAS-optimized architectures * Work completed while interning at Amazon. (Sciuto et al., 2020;Li and Talwalkar, 2019), showing that current methods fail to find the best performing architectures for a given task and perform similarly to random architecture search.
In this work, we explore applications of a SOTA NAS algorithm, ENAS (Pham et al., 2018), to two sentence-pair tasks, paraphrase detection (PD) and semantic textual similarity (STS). We conduct a large set of experiments testing the effectiveness of ENAS-optimized RNN architectures across multiple models (ESIM, BiLSTM-Max), embeddings (BERT, Glove) and datasets (MRPC, SICK, STS-B). We are the first, to our knowledge, to apply ENAS to PD and STS, to explore applications across multiple embeddings and traditionally LSTM-based NLP models, and to conduct extensive SOTA HPT across multiple ENAS-RNN architecture candidates.
Our experiments suggest that baseline LSTM models, with appropriate hyperparameter tuning (HPT), can sometimes match or exceed the performance of models with ENAS-RNNs. We also observe that random architectures sampled from the ENAS search space offer a strong baseline, and can sometimes outperform ENAS-RNNs. Given these observations, we recommend that researchers (i) conduct extensive HPT (preferably using automated methods) across various candidate architectures for the fairest comparisons; (ii) compare the performances of ENAS-RNNs against both standard architectures like LSTMs and RNN cells randomly sampled from the ENAS search space; (iii) examine the computational (memory and runtime) requirements of ENAS methods alongside the gains observed.
Current SOTA approaches focus on learning new cell architectures as replacements for LSTM or convolutional cells (Zoph and Le, 2017;Pham et al., 2018;Liu et al., 2019;Jiang et al., 2019;Li et al., 2020) or entire model architectures to replace hand-designed models such as the transformer or DenseNet (So et al., 2019;Pham et al., 2018).
Recently, the superiority of NAS to random architecture search and traditional architectures with SOTA HPT methods has been called into question. Li and Talwalkar (2019) discuss reproducibility issues with current NAS methods and find that, on language modeling and image classification tasks, NAS algorithms perform similarly to random architecture search. Similarly, Sciuto et al. (2020) find minimal differences in performance between NAS and random search and that the popular weightsharing strategy (Pham et al., 2018) decreases performance. With this in perspective, we conduct a study to investigate the value added by ENAS to two NLP tasks, PD and STS, which, to our knowledge, have not been been explored in previous NAS literature.

Neural Architecture Search for Sentence-Pair Tasks
In this work, we explore applications of ENAS to two sentence-pair tasks, PD and STS. We select ENAS because prior work (Pasunuru and Bansal, 2019;Wang et al., 2020) has shown promising results applying it to a closely-related task, NLI, with gains of up to 1.3% absolute over LSTMs and 1.6% over an RNN with a random architecture. Through our evaluations on PD and STS, we aim to study whether the ENAS methods used in prior work for NLI are generalizable and whether the results hold when applied to related tasks and datasets. ENAS models consist of two parts: 1) a search space over model architectures, i.e. child models, and 2) a controller that samples architectures from that search space. The primary contribution of ENAS is that all child models in the search space share their weights, so each child model does not have to be trained from scratch to evaluate it. Train-ing the child models and controller proceeds as follows -first, the controller is fixed, and the child models are trained together for one epoch on the dataset, sampling a new architecture from the controller to use for each minibatch. Then, the child model shared parameters are fixed, and the controller is updated -we sample child architectures from its policy and update the controller to maximize the expected reward on the dev set (e.g. dev set accuracy). This two-step process then repeats for a specified number of epochs. After training is complete, a number of child models are sampled from the controller and the best one is trained from scratch and evaluated on the test set. We refer the reader to Pham et al. (2018) for further details on ENAS.
In this work, we follow the setup of Pasunuru and Bansal (2019), using standard LSTM-based NLP models and replacing the LSTMs with RNN cells sampled from the ENAS controller. We leave the rest of the model architecture (e.g. attention, pooling, output layers) the same, so the child model search space consists of every possible ENAS-RNN architecture with the standard model architecture around it. As with standard ENAS training, the parameters of the ENAS-RNNs and standard model architecture (e.g. final output layer) are shared across all child models.

Experiments
We evaluate ENAS on three sentence-pair datasets using two models and two sets of embeddings:

Training Discovered Architectures
After training ENAS, we sample 10 architectures from the controller. Just as during ENAS training, we then use these architectures as drop-in replacements for LSTMs, replacing a model's BiLSTM layer(s) with ENAS BiRNN(s). We then train the models from scratch and repeat HPT, extending the original LSTM hyperparameter search space with a choice over the 10 sampled architectures. We run 200 trials of HPT. We note that, unlike the CUDA implementations for LSTMs, it is non-trivial to implement highly optimized arbitrary ENAS-RNN architectures. We discuss these limitations and the overall compute dedicated for HPT on LSTM and ENAS-RNN based models in Appendix A.2.
In addition to experiments replacing all BiLSTM layers with ENAS BiRNNs, we also examine mixing ENAS-RNN and LSTM layers in the multilayer ESIM model. Specifically, we experiment with only replacing the 1st BiLSTM layer in ESIM with an ENAS BiRNN and only replacing the 2nd BiLSTM layer. These models have the same hyperparameter search space as the ESIM model with ENAS-RNNs in both layers (i.e. same possible ENAS-RNN architectures), but we tune and evaluate them separately (see Table 1, rows 5-6, 11-12).  Table 2: Evaluation of how well ENAS-RNNs transfer to other datasets and compare to random search. We report pearson correlation for SICK-R and STS-B and accuracy for MRPC. In the RNN collumn, "E" stands for ENAS-RNN, "L" stands for LSTM, and "RND" for random RNN. For ESIM we use an ENAS or random RNN in the 1st layer and an LSTM in the 2nd layer. Table 1 lists the dev and test results for all datasets, embeddings, and models. We focus our discussion on the test results. On the whole, the results are mixed. BLM, ENAS outperforms BLM, LSTM across all datasets and embeddings by an average of 1.9%. ESIM, ENAS , on the other hand, fails to consistently outperform ESIM, LSTM . ESIM models with ENAS-RNNs in both layers lag behind LSTMs by 0.9%, on average. Focusing first on BLM, we find that BLM, ENAS outperforms BLM, LSTM by an average of 2.1% across all three datasets using BERT (row 8) and 1.7% using Glove (row 14). These results parallel those of Pasunuru and Bansal (2019), who find that BLM, ENAS with ELMO embeddings (Peters et al., 2018) outperforms BLM, LSTM on two NLI datasets and is on par on a third. However, both in our experiments and those of Pasunuru and Bansal (2019), the 6 node ENAS-RNNs have more parameters than the corresponding LSTM models 3 , making it difficult to get a clear picture of the effects of just changing the RNN architecture. To 3 The exact ratio in number of parameters between 6 node ENAS-RNNs and LSTMs depends on the input and hidden dimensions control for this, in §4.1 we conduct experiments comparing ENAS-RNNs to RNNs randomly sampled from the same search space.

Results
Examining ESIM, the results are mixed. ESIM models with ENAS-RNNs in both layers (rows 4, 10) are worse than ESIM, LSTM on 4 of 6 dataset, embedding configurations. The best ESIM, ENAS performance is actually achieved using a mix of ENAS-RNNs and LSTMs across different layers. In fact, the only configurations in which ESIM, ENAS outperforms ESIM, LSTM across all three datasets is BERT, ENAS / LSTM) (row 5), where we only replace the first LSTM layer with an ENAS-RNN. The gains, however, are modest compared to those of the BLM model, improving over ESIM, LSTM by 0.73% on average. Further, changing the embeddings to Glove Glove, ENAS / LSTM) (row 11), ESIM, ENAS underperforms ESIM, LSTM across all 3 datasets by nearly 2% on average. Since we do not observe similar performance gains with ESIM as with BLM, we hypothesize that optimization of specific RNN architectures might matter less as model complexity (e.g. number of layers) increases. We suggest future work further examine the importance of ENAS as it relates to model complexity, especially on tasks where an RNN's architecture might have a higher impact on modeling performance.

Random & Transfer Architectures
In addition to comparisons to LSTMs, we evaluate two common claims about NAS methods: 1) NAS outperforms random search (Pham et al., 2018;Zoph and Le, 2017;Luo et al., 2018;Liu et al., 2019) 2) NAS architectures are transferable to related datasets and tasks (Zoph and Le, 2017;Liu et al., 2019;Luo et al., 2018). We choose two configurations to evaluate these claims: (i) Glove, BLM and (ii) BERT, ESIM, ENAS / LSTM with ENAS-RNNs only in the first layer, keeping the second BiLSTM layer. We chose these configurations since they perform well relative to LSTMs and, between them, cover all embeddings and models.
For claim #1, we first randomly sample 10 RNN architectures from the ENAS search space. Then, just as for the ENAS-RNNs, we perform 200 HPT trials, replacing the 10 ENAS-RNN candidates with the 10 randomly sampled RNN candidates. For claim #2, we test the transferability of SICK-R and MRPC cells to/from each other. We do not evaluate the transferability of STS-B cells, since STS-B contains data from SICK-R and MRPC. We again perform 200 HPT trials, but with the different dataset's ENAS-RNN cells in the search space. Table 2 shows our results. We again focus on test results. For claim #1, we find mixed results, with ENAS outperforming random search by an average of 1.33% in the configuration BERT, ESIM, ENAS / LSTM (rows 1-4), but performing worse or on par with random on GLOVE, BLM (rows 5-8) (average 0.9% decrease). These results contrast those of Pham et al. (2018); Pasunuru and Bansal (2019), who report gains over random search on language modeling (25.4% decrease in perplexity) and NLI datasets (1.53% increase in accuracy). We hypothesize that these differences are due, in part, to our emphasis on creating strong baselines by searching over multiple architectures and performing extensive HPT for all models and settings.
For claim #2, we find that transfer architectures underperform dataset-specific ENAS architectures by 0.58% and random architectures by 0.7%, on average. Only one architecture (row 1, SICK to MRPC) outperforms either of the corresponding random or dataset-specific architectures. Together with our findings for claim #1, these results cast further doubt on the ability of ENAS to find the best architecture for a specific task, its superiority to well-tuned random architectures, and the transferability of its discovered architectures.

Conclusion
Unlike prior work applying ENAS to NLP, we find that ENAS-RNNs only outperform LSTMs and random search on some dataset, embedding, model) configurations. Our findings parallel recent work (Li and Talwalkar, 2019;Sciuto et al., 2020) which question the effectiveness of current NAS methods and their superiority to random architecture search and SOTA HPT methods. Given our mixed results, we recommend researchers: (i) extensively tune hyperparameters for standard (e.g. LSTM) and randomly sampled architectures to create strong baselines; (ii) benchmark ENAS performance across multiple simple and complex model architectures (e.g. BLM & ESIM); (iii) present computational requirements alongside gains observed with ENAS methods.

A Implementation Details
All models were implemented with Pytorch and run on Amazon p3 instances (16GB Nvidia V100).

A.1 Embeddings
Experiments with BERT used the Huggingface Transformers library (Wolf et al., 2019). Experiments with Glove vectors used 300 dimensional vectors trained on Wikipedia 2014 + Gigaword 5 4 . Glove vectors weren't updated training, and outof-vocabulary tokens were replaced with the token "[UNK]" with an embedding of all 0s (≈ 6% of tokens are OOV). In initial experiments, we found no differences between our all-0 embeddings and embeddings randomly initialized according to a Gaussian distribution.

A.2 Hyperparameter Tuning
All HPT was run using Microsoft NNI's parallel implementation of TPE 5 with concurrency 8. Table 3 contains the search space for our experiments. Table 5 contains the best hyperparameter settings for all of our experiments.

A.2.1 Memory Limitations for HPT with ENAS-RNNs
In order for a model to fit on a single GPU (16GB Nvidia V100), we had to limit the search space slightly for models using both ENAS-RNNs and BERT embeddings. This is because the ENAS-RNN search space contains weight matrices W h ,j between each pair of nodes , j in the RNN search space DAG, which greatly expands memory usage (see Pham et al. (2018), sections 2.1 and Appendix A). For both BLM and ESIM models, hidden dimensions were limited to [384,512,768]. Further, for ESIM models with ENAS-RNNs in both layers, the batch size was also limited to [16,32].

A.2.2 Timing limitations for HPT with ENAS-RNNs
Since our ENAS-RNNs are, similar to prior NAS research code, implemented using a Python for-loop over time steps, the implementation is significantly slower (≈ 25x) than the cuda-optimized LSTM equivalent. Thus, due to computational limits, we only perform 200 trials of HPT for the models with ENAS-RNNs (vs. 500 for models with LSTMs). Though the number of HPT trials is lower than for LSTMs, due to their slow speed, the total compute time devoted to tuning the ENAS-RNN models is roughly 10x+ higher. As an example, Table 4 shows the total compute time dedicated to HPT for BLM models (both LSTM-based models and ENAS-RNN based models), measured as the total number of hours spent on a single p3.16xlarge instance 6 to finish all HPT trials. Note, the models with ENAS-RNNs are not always exactly 10x slower than the LSTM equivalent -since we are also searching over batch size during HPT, runtimes can vary significantly.

A.3 Memory Limitations for Training ENAS
As noted in §3.3, we train the ENAS child models BLM, ESIM using the same parameters as the corresponding best LSTM model for the given configuration dataset, embeddings, model . For the configuration STS-B, BERT, ESIM , the corresponding ENAS child models would not fit on a single GPU (16GB Nvidia V100). This is due to the large memory footprint of ENAS as discussed in A.2. Thus, for STS-B, BERT, ESIM we decrease the batch size from 64 to 32 and the hidden dimensions from 1152 to 768.  Table 4: Compute time spent on HPT for BLM models (both LSTM-based models and ENAS-RNN based models). Compute time measured as total number of hours on a single p3.16xlarge instance. All HPT was run using Microsoft NNI's parallel implementation of TPE 7 with concurrency 8 (one trial running on each of the 8 GPUs in the p3.16xlarge instance).

A.4 ESIM: Differences Between Training Child Models with ENAS and Training Models from Scratch
As described in §3.3, when training the ESIM child models jointly with the ENAS controller, we replace both of ESIM's BiLSTMs with the sampled ENAS-RNN architectures. We do this for each dataset, embedding configuration, thus running 6 total instances of ENAS (3 datasets * 2 embeddings). After the ENAS training is complete, we sample 10 ENAS-RNN architectures from the trained controller. However, when training ESIM models from scratch, as described in §3.4, we experiment with 1) replacing both LSTM layers with the ENAS-RNN architecture (same as during ENAS training) 2) only replacing the 1st layer 3) only replacing the 2nd layer. We treat each ESIM layer configuration as its own model and tune its hyperparameters separately. Thus, for example, for the configuration (SICK-R, BERT, ESIM) we perform 200 trails of HPT for the configuration with ENAS-RNNs in both layers, 200 trials of HPT for the configuration with an ENAS-RNN in layer 1 and an LSTM in layer 2, and finally 200 trials of HPT for the configuration with an LSTM in layer 1 and an ENAS-RNN in layer 2. Note, however, that these three separate instances of HPT share the same search space over ENAS-RNN architectures -all three are searching over the same 10 ENAS-RNNs sampled from the same controller. In total, we run 18 different instances of HPT (3 datasets * 2 embeddings * 3 layer configs). The results from each configuration are presented separately in Table 1 (in the main portion of the paper).

A.5 RNN Architectures Sampled from ENAS
Search Space Table 6 shows the architectures of all RNNs used in our experiments (ENAS-RNNs, transferred ENAS-RNNs, random RNNs). Each architecture is numbered 1-26. Table 5, which displays the hyperparameter settings for each model and configuration, lists which RNN architecture each configuration uses. Note, some of the architectures are the same across different model configurations. This is due to two reasons: • As discussed in §3.4 and §A.4, we experiment with mixing ENAS-RNN and LSTM layers in the multi-layer ESIM model. The ESIM models with ENAS RNNs in both layers share the same possible ENAS-RNN architectures as the corresponding ESIM models with an ENAS-RNN only in the 1st layer or 2nd layer.
• We sampled 10 total random architectures from the ENAS-RNN search space then used those same 10 architectures in the search spaces for all dataset, model, embedding configurations. Thus, some configurations might use the same architecture.

A.6 Datasets
For MRPC and STS-B, we use the data provided by Glue 8 . For SICK-R, we use the data provided by SemEval-2014 Task 1 9 . We use scikit-learn 10 to split the provided SICK-R training data into train and dev splits. For our experiments with BERT, we use the Bert-Tokenizer from the Huggingface Transformers library (Wolf et al., 2019). We cap each sentencepair at a certain number of total wordpiece tokens (SICK: 64, MRPC: 128, STS-B: 128). For our experiments with Glove, we use spacy 11 (Honnibal and Montani, 2017) Table 5: Hyperparameter values used for all experiments. In the RNN collumn, "E" stands for ENAS-RNN, "L" stands for LSTM, "R" for random RNN, and "T" for transfer. All floating point values have been rounded to 4 significant figures after the decimal point. Variational dropout is applied before each RNN layer. For models with RNNs from the ENAS search space (all models except those with LSTMs), the column 'Architecture #' displays which RNN architecture it uses. The number corresponds to the row number in Table 6. For ESIM models, the two hidden dimension values refer to (RNN layer 1, RNN layer 2) and the two dropout numbers refer to standard dropout (applied after the 'enhancement' layer, in the final MLP layer). For BLM models, the two dropout numbers refer to standard dropout applied (after the RNN layer, before the final projection) Tanh Table 5 by the column 'Architecture #'. Node # Input refers to the index of the previous node used as input to the current node. Node # Op refers to the elementwise operation applied at each node (Relu, Tanh, Sigmoid, Identity). Please see Pham et al. (2018) for more details on the ENAS RNN search space.