Learning Architectures from an Extended Search Space for Language Modeling

Neural architecture search (NAS) has advanced significantly in recent years but most NAS systems restrict search to learning architectures of a recurrent or convolutional cell. In this paper, we extend the search space of NAS. In particular, we present a general approach to learn both intra-cell and inter-cell architectures (call it ESS). For a better search result, we design a joint learning method to perform intra-cell and inter-cell NAS simultaneously. We implement our model in a differentiable architecture search system. For recurrent neural language modeling, it outperforms a strong baseline significantly on the PTB and WikiText data, with a new state-of-the-art on PTB. Moreover, the learned architectures show good transferability to other systems. E.g., they improve state-of-the-art systems on the CoNLL and WNUT named entity recognition (NER) tasks and CoNLL chunking task, indicating a promising line of research on large-scale pre-learned architectures.


Introduction
Neural models have shown remarkable performance improvements in a wide range of natural language processing (NLP) tasks. Systems of this kind can broadly be characterized as following a neural network design: we model the problem via a pre-defined neural architecture, and the resulting network is treated as a black-box family of functions for which we find parameters that can generalize well on test data. This paradigm leads to many successful NLP systems based on well-designed architectures. The earliest of these makes use of recurrent neural networks (RNNs) for representation learning (Bahdanau et al., 2015;Wu et al., 2016), * Corresponding author.
whereas recent systems have successfully incorporated fully attentive models into language generation and understanding (Vaswani et al., 2017).
In designing such models, careful engineering of the architecture plays a key role for the state-ofthe-art though it is in general extremely difficult to find a good network structure. The next obvious step is toward automatic architecture design. A popular method to do this is neural architecture search (NAS). In NAS, the common practice is that we first define a search space of neural networks, and then find the most promising candidate in the space by some criteria. Previous efforts to make NAS more accurate have focused on improving search and network evaluation algorithms. But the search space is still restricted to a particular scope of neural networks. For example, most NAS methods are applied to learn the topology in a recurrent or convolutional cell, but the connections between cells are still made in a heuristic manner as usual (Zoph and Le, 2017;Elsken et al., 2019).
Note that the organization of these sub-networks remains important as to the nature of architecture design. For example, the first-order connectivity of cells is essential to capture the recurrent dynamics in RNNs. More recently, it has been found that additional connections of RNN cells improve LSTM models by accessing longer history on language modeling tasks (Melis et al., 2019). Similar results appear in Transformer systems. Dense connections of distant layers help in learning a deep Transformer encoder for machine translation (Shen et al., 2018). A natural question that arises is: can we learn the connectivity of sub-networks for better architecture design?
In this paper, we address this issue by enlarging the scope of NAS and learning connections among sub-networks that are designed in either a handcrafted or automatic way ( Figure 1). We call this the Extended Search Space method for NAS (or ESS for short). Here, we choose differentiable architecture search as the basis of this work because it is efficient and gradient-friendly. We present a general model of differentiable architecture search to handle arbitrary search space of NAS, which offers a unified framework of describing intra-cell NAS and inter-cell NAS. Also, we develop a joint approach to learning both high-level and low-level connections simultaneously. This enables the interaction between intra-cell NAS and inter-cell NAS, and thus the ability of learning the full architecture of a neural network.
Our ESS method is simple for implementation. We experiment with it in an RNN-based system for language modeling. On the PTB and WikiText data, it outperforms a strong baseline significantly by 4.5 and 2.4 perplexity scores. Moreover, we test the transferability of the learned architecture on other tasks. Again, it shows promising improvements on both NER and chunking benchmarks, and yields new state-of-the-art results on NER tasks. This indicates a promising line of research on largescale pre-learned architectures. More interestingly, it is observed that the inter-cell NAS is helpful in modeling rare words. For example, it yields a bigger improvement on the rare entity recognition task (WNUT) than that on the standard NER task (CoNLL).

Related work
NAS is a promising method toward AutoML (Hutter et al., 2018), and has been recently applied to NLP tasks (So et al., 2019;Jiang et al., 2019;Li and Talwalkar, 2019). Several research teams have investigated search strategies for NAS. The very early approaches adopted evolutionary algorithms to model the problem (Angeline et al., 1994;Stanley and Miikkulainen, 2002), while Bayesian and reinforcement learning methods made big progresses in computer vision and NLP later (Bergstra et al., 2013;Baker et al., 2017;Zoph and Le, 2017). More recently, gradient-based methods were successfully applied to language modeling and image classification based on RNNs and CNNs (Liu et al., 2019a). In particular, differentiable architecture search has been of great interest to the community because of its efficiency and compatibility to off-the-shelf tools of gradient-based optimization.
Despite of great success, previous studies restricted themselves to a small search space of neural networks. For example, most NAS systems were designed to find an architecture of recurrent or convolutional cell, but the remaining parts of the network are handcrafted (Zhong et al., 2018;Brock et al., 2018;Elsken et al., 2019). For a larger search space,  optimized the normal cell (i.e., the cell that preserves the dimensionality of the input) and reduction cell (i.e., the cell that reduces the spatial dimension) simultaneously and explored a larger region of the space than the singlecell search. But it is still rare to see studies on the issue of search space though it is an important factor to NAS. On the other hand, it has been proven that the additional connections between cells help in RNN or Transformer-based models (He et al., 2016;Huang et al., 2017;Wang et al., 2018Wang et al., , 2019. These results motivate us to take a step toward the automatic design of inter-cell connections and thus search in a larger space of neural architectures.

Inter-Cell and Intra-Cell NAS
In this work we use RNNs for description. We choose RNNs because of their effectiveness at preserving past inputs for sequential data processing tasks. Note that although we will restrict ourselves to RNNs for our experiments, the method and discussion here can be applied to other types of models.

Problem Statement
For a sequence of input vectors {x 1 , ..., x T }, an RNN makes a cell on top of every input vector. The RNN cell receives information from previous cells and input vectors. The output at time step t is defined to be: where π(·) is the function of the cell.ĥ t−1 is the representation vector of previous cells, andx t is the representation vector of the inputs up to time step t. More formally, we defineĥ t−1 andx t as functions of cell states and model inputs, like thiŝ where models the way that we pass information from previous cells to the next. Likewise, g(·) models the case of input vectors. These functions offer a general method to model connections between cells. For example, one can obtain a vanilla recurrent model by settingĥ t−1 = h t−1 and x t = x t , while more intra-cell connections can be considered if sophisticated functions are adopted for f (·) and g(·). While previous work focuses on searching for the desirable architecture design of π(·), we take f (·) and g(·) into account and describe a more general case here. We separate two sub-problems out from NAS for conceptually cleaner description: • Intra-Cell NAS. It learns the architecture of a cell (i.e., π(·)).
• Inter-Cell NAS. It learns the way of connecting the current cell with previous cells and input vectors (i.e., f (·) and g(·)).
In the following, we describe the design and implementation of our inter-cell and intra-cell NAS methods.

Differentiable Architecture Search
For search algorithms, we follow the method of differentiable architecture search (DARTS). It is gradient-based and runs orders of magnitude faster than earlier methods Real et al., 2019). DARTS represents networks as a directed acyclic graph (DAG) and search for the appropriate architecture on it. For a DAG, the edge o i,j (·) between node pair (i, j) performs an operation to transform the input (i.e., tail) to the output (i.e., head). Like Liu et al. (2019a)'s method and others, we choose operations from a list of activation functions, e.g., sigmoid, identity and etc 1 . A node represents the intermediate states of the networks. For node i, it weights vectors from all predecessor nodes (j < i) and simply sums over them. Let s i be the state of node i. We define s i to be: where W j is the parameter matrix of the linear transformation, and θ i,j k is the weight indicating the importance of o i,j k (·). Here the subscript k means the operation index. θ i,j k is obtained by softmax normalization over edges between nodes i and j: In this way, the induction of discrete networks is reduced to learning continuous variables {θ i,j k } at the end of the search process. This enables the use of efficient gradient descent methods. Such a model encodes an exponentially large number of networks in a graph, and the optimal architecture is generated by selecting the edges with the largest weights.
The common approach to DARTS constraints the output of the generated network to be the last node that averages the outputs of all preceding nodes. Let s n be the last node of the network. We have Given the input vectors, the network found by DARTS generates the result at the final node s n . Here we present a method to fit this model into intra and inter-cell NAS. We re-formalize the function for which we find good architectures as F (α; β). α and β are two groups of the input vectors. We create DAGs on them individually. This gives us two DAGs with s α and s β as the last nodes. Then, we make the final output by a Hadamard product of s α and s β , like this, See Figure 2 for the network of an example F (α; β). This method transforms the NAS problem into two learning tasks. The design of two separate networks allows the model to group related inputs together, rather than putting everything into a "magic" system of NAS. For example, for the inter-cell function f (·), it is natural to learn the pre-cell connection from h [0,t−1] , and learn the impact of the model inputs from x [1,t−1] . It is worth noting that the Hadamard product of s α and s β is doing something very similar to the gating mechanism which has been widely used in NLP Bradbury et al., 2017;Gehring et al., 2017). For example, one can learn s β as a gate and control how much s α is used for final output. Table  1 gives the design of α and β for the functions used in this work.
Another note on F (α; β). The grouping reduces a big problem into two cheap tasks. It is particularly important for building affordable NAS systems because computational cost increases exponentially as more input nodes are involved. Our method instead has a linear time complexity if we adopt a reasonable constraint on group size, leading to a

The Intra-Cell Search Space
The search of intra-cell architectures is trivial. Since β = 1 and s β = 1 (see Table 1), we are basically performing NAS on a single group of input vectorsĥ t−1 andx t . We follow Liu et al.
(2019a)'s work and force the input of networks to be a single layer network ofĥ t−1 andx t . This can be described as where W (h) and W (x) are parameters of the transformation, and tanh is the non-linear transformation. e 1 is the input node of the graph. See Figure  3 for intra-cell NAS of an RNN models.

The Inter-Cell Search Space
To learnĥ t−1 andx t , we can run the DARTS system as described above. However, Eqs. (2-3) define a model with a varying number of parameters for different time steps, in which our architecture search method is not straightforwardly applicable. Apart from this, a long sequence of RNN cells makes the search intractable.
Function JOINTLEARN (rounds, w, W ) 1: for i in range(1, rounds) do 2: while intra-cell model not converged do 3: Update intra-cell w (intra) and W 4: while inter-cell model not converged do 5: Update inter-cell w (inter) and W 6: Derive architecture based on w 7: return architecture For a simplified model, we re-define f (·) and g(·) as: where m is a hyper-parameter that determines how much history is considered. Eq. (8) indicates a model that learns a network on x [t−m,t−1] (i.e., β = x [t−m,t−1] ). Then, the output of the learned network (i.e., s β ) is used as a gate to control the information that we pass from the previous cell to the current cell (i.e., α = {h t−1 }). Likewise, Eq. (9) defines a gate on h [t−m,t−1] and controls the information flow from x t to the current cell.
Learning f (·) and g (·) fits our method well due to the fixed number of input vectors. Note that f (·) has m input vectors x [t−m,t−1] for learning the gate network. Unlike what we do in intra-cell NAS, we do not concatenate them into a single input vector. Instead, we create a node for every input vector, that is, the input vector e i = x t−i links with node s i . We restrict s i to only receive inputs from e i for better processing of each input. This can be seen as a pruned network for the model described in Eq. (4). See Figure 3 for an illustration of inter-cell NAS.

Joint Learning for Architecture Search
Our model is flexible. For architecture search, we can run intra-cell NAS, or inter-cell NAS, or both of them as needed. However, we found that simply joining intra-cell and inter-cell architectures might not be desirable because both methods were restricted to a particular region of the search space, and the simple combination of them could not guarantee the global optimum.
This necessitates the inclusion of interactions between intra-cell and inter-cell architectures into the search process. Generally, the optimal inter-cell architecture depends on the intra-cell architecture used in search, and vice versa. A simple method that considers this issue is to learn two models in a joint manner. Here, we design a joint search method to make use of the interaction between intra-cell NAS and inter-cell NAS. Figure 4 shows the algorithm. It runs for a number of rounds. In each round, we first learn an optimal intra-cell architecture by fixing the inter-cell architecture, and then learn a new inter-cell architecture by fixing the optimal intra-cell architecture that we find just now.
Obviously, a single run of intra-cell (or inter-cell) NAS is a special case of our joint search method. For example, one can turn off the inter-cell NAS part (lines 4-5 in Figure 4) and learn intra-cell architectures solely. In a sense, the joint NAS method extends the search space of individual intra-cell (or inter-cell) NAS. Both intra-cell and inter-cell NAS shift to a new region of the parameter space in a new round. This implicitly explores a larger number of underlying models. As shown in our experiments, joint NAS learns intra-cell architectures unlike those of the individual intra-cell NAS, which leads to better performance in language modeling and other tasks.

Experiments
We experimented with our ESS method on Penn Treebank and WikiText language modeling tasks and applied the learned architecture to NER and chunking tasks to test its transferability.

Experimental Setup
For language modeling task, the monolingual and evaluation data came from two sources.
• Penn Treebank (PTB). We followed the standard preprocessed version of PTB (Mikolov et al., 2010). It consisted of 929k training words, 73k validation words and 82k test words. The vocabulary size was set to 10k.
• WikiText-103 (WT-103). We also used WikiText-103    NER and chunking tasks were also used to test the transferability of the pre-learned architecture. We transferred the intra and inter-cell networks learned on WikiText-103 to the CoNLL-2003 (English), the WNUT-2017 NER tasks and the CoNLL-2000 tasks. The CoNLL-2003 task focused on the newswire text, while the WNUT-2017 contained a wider range of English text which is more difficult to model.
Our ESS method consisted of two components, including recurrent neural architecture search and architecture evaluation. During the search process, we ran our ESS method to search for the intra-cell and inter-cell architectures jointly. In the second stage, the learned architecture was trained and evaluated on the test dataset.
For architecture search on language modeling tasks, we applied 5 activation functions as the candidate operations, including drop, identity, sigmoid, tanh and relu. On the PTB modeling task, 8 nodes were equipped in the recurrent cell. For the intercell architecture, it received 3 input vectors from the previous cells and consisted of the same number of the intermediate nodes. By default, we trained our ESS models for 50 rounds. We set batch = 256 and used 300 hidden units for the intra-cell model. The learning rate was set as 3 × 10 −3 for the intracell architecture and 1 × 10 −3 for the inter-cell architecture. The BPTT (Werbos, 1990) length was 35. For the search process on WikiText-103, we developed a more complex model to encode the representation. There were 12 nodes in each cell and 5 nodes in the inter-cell networks. The batch size was 128 and the number of hidden units was 300 which was the same with that on the PTB task. We set the intra-cell and inter-cell learning rate to 1 × 10 −3 and 1 × 10 −4 . A larger window size (= 70) for BPTT was applied for the WikiText-103. All experiments were run on a single NVIDIA 1080Ti.
After the search process, we trained the learned architectures on the same data. To make it comparable with previous work, we copied the setup in Merity et al. (2018b). For PTB, the size of hidden layers was set as 850 and the training epoch was 3,000. While for the WikiText-103, we enlarged the number of hidden units to 2,500 and trained the model for 30 epochs. Additionally, we transferred the learned architecture to NER and chunking tasks with the setting in Akbik et al. (2019). We only modified the batch size to 24 and hidden size to 512.

Language Modeling tasks
Here we report the perplexity scores, number of parameters and search cost on the PTB and WikiText-103 datasets ( Table 2). First of all, the joint ESS method improves the performance on language modeling tasks significantly. Moreover, it does not introduce many parameters. Our ESS method achieves state-of-the-art result on the PTB task. It outperforms the manually designed Mogrifier-LSTM by 4.5 perplexity scores on the test set. On 10/1 9/2 8/3 7/4 6/5 5/6 4/7 3/8 2/9 1/10 59.5 63.5

67.5
Number of nodes (intra/inter) Perplexity NAS Figure 5: Perplexity on the validation data (PTB) vs. number of nodes in intra and inter-cell.
the WikiText task, it still yields a +2.4 perplexity scores improvement over the strong NAS baseline (DARTS) method. These results indicate that ESS is robust and can learn better architectures by enlarging the scope of search space.
Also, we find that searching for the appropriate connections among cells plays a more important role in improving the model performance. We observe that the intra-cell NAS (DARTS) system underperforms the inter-cell counterpart with the same number of parameters. It is because the welldesigned intra-cell architectures (e.g., Mogrifier-LSTM) are actually competitive with the NAS structures. However, the fragile connections among different cells greatly restrict the representation space. The additional inter-cell connections are able to encode much richer context. Nevertheless, our ESS method does not defeat the manual designed Transformer-XL model on the WikiText-103 dataset, even though ESS works better than other RNN-based NAS methods. This is partially due to the better ability of Transformer-XL to capture the language representation. Note that RNNs are not good at modeling the long-distance dependence even if more history states are considered. It is a good try to apply ESS to Transformer but this is out of the scope of this work.

Sensitivity Analysis
To modulate the complexity of the intra and intercell, we study the system behaviors under different numbers of intermediate nodes ( Figure 5). Fixing the number of model parameters, we compare these systems under different numbers of the intra and inter-cell nodes. Due to the limited space, we show the result on the PTB in the following sensitivity analysis. We observe that an appropriate choice of node number (8 nodes for intra-cell and 3 nodes for inter-cell) brings a consistent improvement. More interestingly, we find that too many nodes for inter-cell architecture do not improve the model representation ability. This is reasonable 0.5K 2K 3.5K 5K  Table 3: Difference in word loss (normalized by word counts) on validation data when searching intra and inter-cell jointly. The left column contains the words with eight best improvements (larger absolute value of ∆loss) and right column presents the most frequent words in the validation data.
because more inter-cell nodes refer to considering more history in our system. But for language modeling, the current state is more likely to be relevant to most recent words. Too many inputs to the gate networks raise difficulties in modeling. We observe that our ESS method leads to a model that is easier to train. The left part in Figure  6 plots the validation perplexity at different training steps. The loss curve of joint ESS significantly goes down as the training proceeds. More interestingly, our joint learning method makes the model achieve a lower perplexity than the intra-cell NAS system. This indicates better networks can be obtained in the search process. Additionally, the convergence can be observed from the right part in Figure 6. Here we apply Mean Absolute Deviation (MAD) to define the distance between edge weights and initial uniform distribution. It is obvious that both the intra and inter-cell architectures change little at the final searching steps.
In order to figure out the advantage of inter-cell connections, we detail the model contribution on each word on the validation data. Specifically, we compute the difference in word loss function (i.e.,  (Lample et al., 2016) 90.94 LSTM-CRF + ELMo (Peters et al., 2018) 92.22 LSTM-CRF + Flair (Akbik et al., 2019) 93.18 GCDT + BERTLARGE (Liu et al., 2019b) 93.47 CNN Large + ELMo (Baevski et al., 2019) 93.50 DARTS + Flair (Jiang et al., 2019) 93.13 I-DARTS + Flair (Jiang et al., 2019) 93  log perplexity) between methods with and without inter-cell NAS. The words with eight best improvements are shown in the left column of Table 3. We observe that the rare words in the training set obtain more significant improvements. In contrast, the most frequent words lead to very modest decrease in loss (right column of Table 3). This is because the connections between multiple cells enable learning rare word representations from more histories. While for common words, they can obtain this information from rich contexts. More inputs from previous cells do not bring much useful information.
Additionally, we visualize the learned intracell architecture in Figure 7(a). The networks are jointly learned with the inter-cell architecture. Compared with the results of intra-cell NAS (Figure 7(b)), the learned network is more shallow. The inter-cell architectures have deeper networks. This in turn reduces the need for intra-cell capacity. Thus a very deep intra-cell architecture might not be necessary if we learn the whole model jointly.

Transferring to Other Tasks
After architecture search, we test the transferability of the learned architecture. In order to apply the model to other tasks, we directly use the architecture searched on WikiText-103 and train the param-Models F1 Cross-BiLSTM-CNN (Aguilar et al., 2018) (Yang and Zhang, 2018) 95.06 BiLSTM-CRF + IntNet (Xin et al., 2018) 95.29 Flair (Akbik et al., 2019) 96.72 GCDT + BERTLARGE (Liu et al., 2019b) 97  For the two NER tasks, it achieves new stateof-the-art F1 scores (Table 4 and Table 5). ELMo, Flair and BERT LARGE refer to the pre-trained language models. We apply these word embeddings to the learned architecture during model training process. For the chunking task, the learned architecture also shows greater performance than other NAS methods (Table 6). Moreover, we find that our pre-learned neural networks yield bigger improvements on the WNUT-2017 task. The difference of the two NER tasks lies in that the WNUT-2017 task is a long-tail emerging entities recognition task. It focuses on identifying unusual, previously-unseen entities in the context of emerging discussions. As we discuss in the previous part of the section, the additional inter-cell NAS is good at learning the representations of rare words. Therefore, it makes sense to have a bigger improvement on WNUT-2017.

Conclusions
We have proposed the Extended Search Space (ESS) method of NAS. It learns intra-cell and inter-cell architectures simultaneously. Moreover, we present a general model of differentiable architecture search to handle the arbitrary search space. Meanwhile, the high-level and low-level sub-networks can be learned in a joint fashion. Experiments on two language modeling tasks show that ESS yields improvements of 4.5 and 2.4 perplexity scores over a strong RNN-based baseline. More interestingly, it is observed that transferring the pre-learned architectures to other tasks also obtains a promising performance improvement.