HAT: Hardware-Aware Transformers for Efficient Natural Language Processing

Transformers are ubiquitous in Natural Language Processing (NLP) tasks, but they are difficult to be deployed on hardware due to the intensive computation. To enable low-latency inference on resource-constrained hardware platforms, we propose to design Hardware-Aware Transformers (HAT) with neural architecture search. We first construct a large design space with arbitrary encoder-decoder attention and heterogeneous layers. Then we train a SuperTransformer that covers all candidates in the design space, and efficiently produces many SubTransformers with weight sharing. Finally, we perform an evolutionary search with a hardware latency constraint to find a specialized SubTransformer dedicated to run fast on the target hardware. Extensive experiments on four machine translation tasks demonstrate that HAT can discover efficient models for different hardware (CPU, GPU, IoT device). When running WMT’14 translation task on Raspberry Pi-4, HAT can achieve 3× speedup, 3.7× smaller size over baseline Transformer; 2.7× speedup, 3.6× smaller size over Evolved Transformer with 12,041× less search cost and no performance loss. HAT is open-sourced at https://github.com/mit-han-lab/hardware-aware-transformers.


Introduction
Transformer (Vaswani et al., 2017) has been widely used in natural language processing tasks.By stacking multiple identical encoder/decoder layers with attention modules, it provides a significant performance improvement over previous convolutional or recurrent neural network models (Kim, 2014).
Nevertheless, it is challenging to deploy Transformers on mobile devices due to the high computation cost.For instance, in order to translate a sentence with only 30 words, a Transformer-Big model needs to execute 13G FLOPs and takes 20 Figure 1: Framework for searching Hardware-Aware Transformers.We first train a SuperTransformer that contains numerous sub-networks, then conduct an evolutionary search with hardware latency feedback to find one specialized SubTransformer for each hardware.
seconds on a Raspberry Pi.Such long latency will hurt the user experience on edge devices.Thus we need hardware-efficient Transformers (Figure 1).There are two common pitfalls when evaluating the efficiency of a Transformer.(1) FLOPs does not reflect the measured latency.Although FLOPs is used as an metric for efficiency in prior arts (Howard et al., 2017;Wu et al., 2020), it is not a good latency proxy.As in Figure 2 (Right), models with the same FLOPs can result in very different measured latencies; (2) different hardware prefers different Transformer architecture.As in Table 1, the Transformer model optimized on one hardware is sub-optimal for another because latency is influenced by different factors on different hardware platforms.For example, the embedding dimension has significant impact on the Raspberry Pi latency but hardly influences the GPU latency (Figure 2).
Inspired by the success of Neural Architecture Search (NAS) (Bender et al., 2018;Guo et al., 2019; Pham et al., 2018;Cai et al., 2019a), we propose to search for Hardware-Aware Transformers (HAT) by directly involving the latency feedback into the design loop.In this way, we do not need FLOPs as the latency proxy and can search specialized models for various hardware.
We first construct a large search space with arbitrary encoder-decoder attention and heterogeneous Transformer layers.Traditional Transformer has an information bottleneck between the encoder and decoder.Arbitrary encoder-decoder attention breaks the bottleneck, allowing all decoder layers to attend to multiple and different encoder layers instead of only the last one.Thus low-level information from the encoder can also be used by the decoder.Motivated by Figure 2, we introduce heterogeneous Transformer layers to allow different layers to have different architecture adapting various hardware.
To perform a low-cost search in such a large design space, we first train a Transformer supernet -SuperTransformer, which contains many Sub-Transformers sharing the weights.We train all SubTransformers simultaneously by optimizing the uniformly sampled SubTransformers from the Su-perTransformer.The performance of a SubTransformer with inherited weights from the SuperTransformer can provide a good relative performance approximation for different architectures trained from-scratch.Unlike conventional NAS, we only need to pay the SuperTransformer training cost for once and can evaluate all the models in the design space with it.Finally, we conduct an evolutionary search to find the best SubTransformer under the hardware latency constraint.Experiments show that HAT can be naturally incorporated with model compression techniques such as quantization and knowledge distillation.

Proposed Approaches
An overview of the HAT framework is shown in Figure 3.We firstly train a SuperTransformer with a large design space.Then, for a given hardware platform, we collect a dataset of (SubTransformer architecture, measured latency) pairs for different models, and train a latency predictor.Finally, we conduct an evolutionary search with a latency constraint to find an efficient model specialized for the target hardware.

Design Space
We construct a large design space by breaking two conventions in the Transformer design: (1) All decoder layers only attend to the last encoder layer; (2) All the layers are identical.
Arbitrary Encoder-Decoder Attention.Different encoder layers extract features on different abstraction levels.Conventionally, all the decoder lay-ers only attend to the last encoder layer.It forms an information bottleneck that forces all the decoder layers to learn solely from the high abstraction level and ignore the low-level information.To break the bottleneck, we propose Arbitrary Encoder-Decoder Attention to learn the most suitable connections between the encoder and the decoder.Each decoder layer can choose multiple encoder layers to attend.The key and value vectors from encoder layers are concatenated in the sentence length dimension (Figure 4) and fed to the encoder-decoder cross attention module.The mechanism is efficient because it introduces no additional parameters.The latency overhead is also negligible.For example, with each decoder layer attending to two encoder layers, the latency of Transformer-Base on Nvidia TITAN Xp GPU barely increases by 0.4%.It improves the model capacity by allowing attention to different abstraction levels.

Heterogeneous Transformer Layers. Previous
Transformers repeat one architecture for all layers.In HAT, instead, different layers are heterogeneous, with different numbers of heads, hidden dim, and embedding dim.In attention layers, different heads are used to capture various dependencies.However, Voita et al. (2019) shows that many heads are redundant.We thereby make attention head number elastic so that each attention module can decide its necessary number of heads.
In the FFN layer, the input features are cast to a higher dimension (hidden dim), followed by an  activation layer.Traditionally, the hidden dim is set as 2× or 4× of the embedding dim, but this is sub-optimal since different layers need different capacities depending on the feature extraction difficulty.We hence make the hidden dim elastic.
Moreover, we also support elastic embedding dim of encoder and decoder, but it is consistent inside encoder/decoder.The number of encoder & decoder layers are also elastic to learn the proper level of feature encoding and decoding.Other design choices such as the length of Q, K, V vectors in attention modules can be naturally incorporated in our framework, which we leave for future work.

SuperTransformer
It is critical to have a large design space in order to find high-performance models.However, training all the models and comparing their BLEU scores is infeasible.We thus propose SuperTransformer, a supernet for performance approximation, which can judge the performance of a model without fully training it.The SuperTransformer is the largest model in the search space with weight sharing (Pham et al., 2018;Liu et al., 2019;Cai et al., 2019a).Every model in the search space (a SubTransformer) is a part of the SuperTransformer.All SubTransformers share the weights of their common parts.For elastic embedding dim, all SubTransformers share the front portion of the longest word embedding and corresponding FC layer weights.As in Figure 5, for elastic FFN hidden dim, the front part of the FC weights is shared.For elastic head number in attention modules, the whole Q, K, V vectors (the lengths are fixed in our design space) are shared by dividing into head number parts.Elastic layer numbers let all SubTransformers share the first several layers.In the SuperTransformer training, all possible SubTransformers are uniformly sampled, and the corresponding weights are updated.In practice, the SuperTransformer only needs to be trained for the same steps as a baseline Transformer model, which is fast and low-cost.After training, we can get the performance proxy of sampled models in the design space by evaluating the corresponding Sub-Transformers on the validation set without training.

Evolutionary Search for SubTransformer
Given a latency requirement, we perform an evolutionary search to find a satisfactory SubTransformer.There are two ways to evaluate the hardware latency of a SubTransformer: (1) Online measurement in which we measure the models during the search process.(2) Offline, where we train a latency predictor to provide the latency.We apply the offline method here because it is fast and accurate.For the online method, a single sampled SubTransformer requires hundreds of inferences to get an accurate latency, which lasts for minutes and slows down the searching.For the offline method, we encode the architecture of a SubTransformer into a feature vector, and predict its latency instantly with a multi-layer perceptron (MLP).Trained with thousands of real latency data points, the predictor yields high accuracy (Figure 6).Note that the predicted latency is only used in the search process, and we report real measured latency in the experiment section.Compared with deducing a closed-form latency model for each hardware, the latency predictor method is more general and faster.
We use an evolutionary algorithm to conduct the search process.As in Figure 3, the search engine queries the latency predictor for SubTransformer latency, and validates the loss on the validation set.The engine only adds SubTransformers with latency smaller than the hardware constraint to the population.We then train the searched models from scratch to obtain the final performance.
Evaluation Metrics.For evaluation, we use beam four and length penalty 0.6 for WMT, and beam five for IWSLT (Vaswani et al., 2017).All BLEUs are calculated with case-sensitive tokenization1 , but we also apply the compound splitting BLEU2 for WMT, the same as Vaswani et al. (2017).We test the model with the lowest validation set loss for WMT and the last ten checkpoints averaged for IWSLT.
We test the latency of the models by measuring translation from a source sentence to a target sentence with the same length.The length is the average output length on the test set -30 for WMT and 23 for IWSLT.For each model, we measure the latency for 300 times, remove the fastest and slowest 10% and then take the average of the rest 80%.We conduct experiments on three representative hardware platforms: Raspberry Pi-4 with an ARM Cortex-A72 CPU, Intel Xeon E5-2640 CPU, and Nvidia TITAN Xp GPU.

Implementation Details
SuperTransformer Setups.The SuperTransformer for WMT has the following design space: [512, 640] for embedding dim, [1024,2048,3072] for hidden dim, [4,8] for the head number in all attention modules, [1,2,3,4,5,6] for decoder layer number.Due to decoder auto-regression, encoder only accounts for less than 5% of the measured latency; thereby, we set the encoder layer number fixed as 6.For arbitrary encoder-decoder attention, each decoder can choose to attend to the last one, two, or three encoder layers.The Super-Transformer design space for IWSLT is the same as WMT except for [2048,1024,512] for hidden dim and [4, 2] for head number.We set the Q, K, V vector dim fixed as 512.The design space contains around 10 15 possible SubTransformers and covers a wide range of model size and latency (largest = 6×smallest).We train the SuperTransformers of WMT for 40K steps and 50K steps for IWSLT.
Hardware-Aware Evolutionary Search Setups.The input of the latency predictor is a feature vector of SubTransformer architecture with ten elements: layer number, embed dim, average hidden dim, average self-attention heads, of both encoder and decoder; plus average encoder-decoder attention heads, and the average number of encoder layers each decoder layer attends.A dataset of 2000 (SubTransformer architecture, measured latency) samples for each hardware is collected, and split into train:valid:test=8:1:1.We normalize the features and latency, and train a three-layer MLP with 400 hidden dim and ReLU activation.We choose three-layer because it is more accurate than the one-layer model, and over three layers do not improve accuracy anymore.With the predictor, we conduct an evolutionary search for 30 iterations in the SuperTransformer, with population 125, parents population 25, mutation population 50 with 0.3 probability and crossover population 50.
Training Settings.Our training settings are in line with Wu et al. (2019b) and Wu et al. (2020).For WMT, we train for 40K steps with Adam optimizer and a cosine learning rate (LR) scheduler (Kingma and Ba, 2015;Loshchilov and Hutter, 2017), where the LR is linearly warmed up from 10 −7 to 10 −3 , and then cosine annealed.For IWSLT, we train for 50K steps with inverse square root LR scheduler.The baseline Transformers are trained with the same settings as the searched Sub-Transformers for fair comparisons.8.

HAT Performance Comparisons
In Figure 7, 8 and Appendix Table 8, we compare HAT with Transformer baselines on four tasks.The embedding dims are 512 and 1024 for the Transformer-Base and Big, respectively.The hidden dims are 4× and 2× of the embedding dim for WMT and IWSLT.The IWSLT models are smaller to prevent overfitting (Wu et al., 2019b).We obtain a series of baseline models with layer number scaling (yellow) and dimension scaling (blue).We set different latency constraints on three hardware to get a series of HAT models.HAT consistently outperforms baselines with a large gap under different latency constraints.On ARM CPU, HAT is 3× faster and 3.7× smaller than Transformer-Big with the same BLEU.On Intel CPU, HAT achieves over 2× speedup.On Nvidia GPU, the blue dash line is nearly vertical, indicating that dimension scaling can hardly reduce the latency.In this case, HAT can still find models with low latency and high performance.
We further compare various aspects of HAT with Transformer (Vaswani et al., 2017) and Evolved Transformer (So et al., 2019) in Table 2. HAT achieves up to 1.6×, 3×, and 3.4× speedup with up to 1.4×, 3.7×, and 4× smaller size than baselines.We report FLOPs for translating a 23-token sentence for IWSLT and 30 for WMT.We show the overall GPU hours for training the SuperTransformer and the searched SubTransformer.We also calculate the cloud computing costs with different modes: "preemptable" is cheaper ($0.74/h) than "on-demand" ($2.48/h) (Strubell et al., 2019).HAT is highly affordable since the total GPU-hour is over 12000× smaller than the Evolved Transformer, and is even smaller than Transformer-Big by virtue of the compact model size.
In Table 3, we compare HAT with other latest models.We scale down all models to have similar BLEU scores with Levenshtein for fair comparisons.We adopt the average iteration time of 2.88 for decoding (Gu et al., 2019)  Latency BLEU Transformer (Vaswani et al., 2017) 4.3s 25.85 Levenshtein (Gu et al., 2019) 6.5s 25.20 Evolved Transformer (So et al., 2019) 3.7s 25.40 Lite Transformer (Wu et al., 2020) 3.4s 25.79 HAT (Ours) 3.4s 25.92 decoding).HAT runs 1.3× faster than Transformer with higher BLEU; 1.9× faster than Levenshtein with 0.7 higher BLEU.Under similar latency, HAT also outperforms Lite Transformer.These results demonstrate HAT's effectiveness in lower latency scenarios.Our framework can also be adopted to speedup those models.

Analysis
Design Insights.For all HAT WMT models in Figure 7, 10% of all decoder layers attend to three encoder layers, 40% attend to two encoder layers.That demonstrates the necessity of arbitrary encoder-decoder attentions.
In Appendix Figure 12, we visualize the models specialized for different hardware mentioned in Table 1.We find that the GPU model is wide but shallow; the Raspberry Pi model is deep but thin.The phenomenon echos with our latency profiling (Figure 2) as GPU latency is insensitive to embedding and hidden dim, but Raspberry Pi is highly sensitive.It guides manual designs: on GPU, we can reduce the layer number and increase dimension to reduce latency and keep high performance.
Ablation Study.HAT achieves higher BLEU with 1.5× lower latency and 1.5× smaller size compared with the largest SubTransformer (Table 4).This suggests that larger models do not always provide better performance, and demonstrates the effectiveness of HAT.We also compare the evolutionary search with random search (Figure 9).Evolutionary search can find models with lower losses than random search.Table 5: The performance of SubTransformers with inherited weights are close to those trained from-scratch, and have the same relative performance order.Figure 10: The search cost measured in pounds of CO 2 emission.Our framework for searching HAT reduces the cost by four orders of magnitude than the Evolved Transformer (So et al., 2019).
SubTransformer Performance Proxy.All Sub-Transformers inside the SuperTransformer are uniformly sampled and thus equally trained, so the performance order is well-preserved during training.We conduct experiments to show the effectiveness of the SubTransformer performance proxy as in Table 5 and Appendix Figure 11.The BLEUs of SubTransformers with inherited weights and weights trained from-scratch are very close.More importantly, they also have the same relative performance order.Therefore, we can rely on the proxy to search high-performance model architecture, significantly reducing the search cost.5 Related Work

Low Search Cost. As shown in
Transformer.Transformer (Vaswani et al., 2017) has prevailed in sequence modeling (Ng et al., 2019;Junczys-Dowmunt, 2018).By stacking identical blocks, the model obtains a large capacity but incurs high latency.Recently, a research trend is to modify the Transformer to improve the performance (Chen et al., 2018;Wu et al., 2019b;Sukhbaatar et al., 2019;Wang et al., 2019).Among them, Wu et al. (2019b) introduced a convolutionbased module to replace the attention; Wang et al. (2019) proposed to train deep Transformers by propagating multiple layers together in the encoder.Zhang et al. (2018) and Kim et al. (2019) also proposed AAN and SSRU to replace the attention mechanism.HAT is orthogonal to them and can be combined to search for efficient architecture with those new modules.Another trend is to apply non-or partially-autoregressive models to cut down the iteration number for decoding (Gu et al., 2019;Akoury et al., 2019;Wei et al., 2019;Gu et al., 2018).Although reducing latency, they sometimes suffer from low performance.Bapna et al. (2018) explored using learned linear combinations of encoder outputs as decoder inputs, while HAT concatenates the outputs without linear combinations, thus better preserving the low-level information.Wu et al. (2020) investigated mobile settings for NLP tasks and proposed a multi-branch Lite Transformer.However, it relied on FLOPs for efficient model design, which is an inaccurate proxy for hardware latency (Figure 2).There are also works (Kim and Rush, 2016;Junczys-Dowmunt et al., 2018;Kim et al., 2019;Yan et al., 2020) using Knowledge Distillation (KD) to obtain small student models.Our method is orthogonal to KD and can be combined with it to improve the efficiency further.There are also hardware accelerators (Ham et al., 2020;Zhang et al., 2020) for attention and fully-connected layers in the Transformer to achieve efficient processing.
Neural Architecture Search.In the computer vision community, there has been an increasing interest in automating efficient model design with Neural Architecture Search (NAS) (Zoph and Le, 2017;Zoph et al., 2018;Pham et al., 2018;He et al., 2018).Some applied black-box optimization such as evolutionary search (Wang et al., 2020b) and reinforcement learning (Cai et al., 2019b;He et al., 2018;Wang et al., 2018Wang et al., , 2020a;;Mao et al., 2019); Some leveraged backpropagation with differentiable architecture search (Liu et al., 2019).Some also involved hardware constraints into optimizations such as MNasNet (Tan et al., 2019), ProxylessNAS (Cai et al., 2019b), FBNet (Wu et al., 2019a) and APQ (Wang et al., 2020b).To reduce the NAS cost, supernet based methods (Pham et al., 2018;Bender et al., 2018;Guo et al., 2019) apply a proxy for sub-network performance and adopt search algorithms to find good sub-networks.For NLP tasks, the benefits of the architecture search have not been fully investigated.Recently, So et al. (2019) proposed the Evolved Transformer to search for architectures under model size constraints and surpassed the original Transformer baselines.However, it suffered from very high search costs (250 GPU years), making it unaffordable to search specialized models for various hardware and tasks.In addition, hardware latency feedback was not taken into account for better case-by-case specializations.
Since different hardware has distinct architecture and features (Cong et al., 2018), feedback from hardware is critical for efficient NLP.

Conclusion
We propose Hardware-Aware Transformers (HAT) framework to solve the challenge of efficient deployments of Transformer models on various hardware platforms.We conduct hardware-aware neural architecture search in an ample design space with an efficient weight-shared SuperTransformer, consuming four orders of magnitude less cost than the prior Evolved Transformer, and discover highperformance low-latency models.We hope HAT can open up an avenue towards efficient Transformer deployments for real-world applications.The larger the inherited val loss, the lower the trained from-scratch BLEU.
The larger the inherited val loss, the lower the trained from-scratch BLEU.SubTransformer trained from-scratch BLEU (Target) Figure 11: The validation loss of SubTransformers is a good performance proxy for BLEU of from-scratch trained SubTransformers.The larger the validation loss, the lower the BLEU score.
A Appendix for "HAT: Hardware-Aware Transformers for Efficient Natural Language Processing" A.1 SubTransformer Performance Proxy In Figure 11, we show the relationship between the validation loss of SubTransformers directly inherited from the SuperTransformer, and the BLEU score of the SubTransformers trained from-scratch.We can observe that the larger the validation loss, the lower the BLEU score.Therefore the validation loss can be a good performance proxy.

A.2 Visualizations of Searched Models on WMT'14 En-De Task
We show the HAT models searched for Raspberry Pi ARM Cortex-A72 CPU and Nvidia TITAN Xp GPU in Figure 12.The searched model for Raspberry Pi is deep and thin, while that for GPU is shallow and wide.The BLEU scores of the two models are similar: 28.10 for Raspberry Pi CPU, and 28.15 for Nvidia GPU.A.3 Latency, BLEU and SacreBLEU of searched HAT models.
In Table 8, we show the specific latency numbers, BLEU and SacreBLEU (Post, 2018) scores for searched HAT models in Figure 7 and Figure 8.

Figure 2 :
Figure 2: Latency of different Transformer models on different hardware.We find (1) FLOPs does not reflect the real measured latency; (2) Latency influencing factors of different hardware are contrasting.Thus we need to consider hardware latency feedback to design specialized models for different hardware.

Figure 4 :
Figure 4: Arbitrary Encoder-Decoder Attention.Each encoder-decoder attention in one decoder layer can attend to the outputs from multiple encoder layers, fully leveraging the features extracted by the encoder.

Figure 5 :
Figure 5: Weight Sharing of the SuperTransformer.All SubTransformers share the front portion of word embeddings, and weights in the fully-connected layers.

Figure 6 :
Figure6: The latency predictor is very accurate, with an average prediction error (RMSE) of 0.1s.

Figure 7 :
Figure 7: Inference latency and BLEU trade-offs of WMT'14 En-De and En-Fr on three hardware platforms.HAT consistently outperforms the baseline Transformers and achieves up to 3× faster inference speed and 3.7× smaller size over Transformer-Big.Specific latency, BLEU and SacreBLEU(Post, 2018) are in Appendix Table8.

Figure 12 :
Figure 12: SubTransformers optimized for Raspberry Pi ARM CPU and Nvidia GPU on WMT'14 En-De task are different.The CPU model has BLEU 28.10, and GPU model has BLEU 28.15.

Table 1 :
BLEU score and measured inference latency of HAT on WMT'14 En-De task.The efficient model for GPU is not efficient for ARM CPU and vice versa.

ea ed b Ca lina Cani f m he N n P jec
, HAT achieves up to 3× speedup, 3.7× smaller size over Transformer-Big without loss of accuracy.With 12,041× less search cost, HAT outperforms the Evolved Transformer with 2.7× speedup and 3.6× smaller size.It also achieves up to 1.9× speedup over Levenshtein and Lite Transformers with no BLEU score loss.With 4-bit quantization, HAT can further reach 25× model size reduction.

ea ed b Mi ha Pe i hche f m he N n P jec
Perform an evolutionary search with hardware latency constraint to find the model with the lowest validation loss.(5) Finally, the searched model is trained from scratch to get the final performance.
, without limiting the length of the output sentence (12 tokens after

Table 2 :
Comparisons of latency, model size, FLOPs, BLEU and training cost in terms of CO 2 emissions (lbs) and cloud computing cost (USD) for Transformer, the Evolved Transformer and HAT.The training cost estimation is adapted from Strubell et al. (2019).The training time is for one Nvidia V100 GPU, and the latency is measured on the Raspberry Pi ARM CPU.The cloud computing cost is based on AWS.

Table 3 :
Raspberry Pi ARM CPU latency and BLEU comparisons with different models on WMT'14 En-De.HAT has the lowest latency with the highest BLEU.

Table 4 :
The searched HAT compared with the largest SubTransformer in the design space.Larger models do not necessarily have better performance.HAT models have lower latency, smaller size, and higher BLEU.

Table 2
Table 2 and Figure10, the search cost of HAT is 12,041× lower than the Evolved Transformer.Although both are using Evolutionary Search, the key difference is that Evolved Transformer needs to train all individual models and sort their final performance to pick top ones; on the contrary, HAT trains all models together inside SuperTransformer and sorts their performance proxy to pick top ones.The superior performance of HAT proves that the performance proxy is accurate enough to find good models.

Table 8 :
Specific latency numbers, BLEU and Sacre-BLEU scores for searched HAT models in Figure7and Figure8.