Scale down Transformer by Grouping Features for a Lightweight Character-level Language Model

This paper introduces a method that efficiently reduces the computational cost and parameter size of Transformer. The proposed model, refer to as Group-Transformer, splits feature space into multiple groups, factorizes the calculation paths, and reduces computations for the group interaction. Extensive experiments on two benchmark tasks, enwik8 and text8, prove our model’s effectiveness and efficiency in small-scale Transformers. To the best of our knowledge, Group-Transformer is the first attempt to design Transformer with the group strategy, widely used for efficient CNN architectures.


Introduction
Character-level language modeling has become a core task in the field of natural language processing (NLP) such as classification (Zhang et al., 2015), sequence tagging (Guo et al., 2019a), question answering (He and Golub, 2016), and scene text recognition (Baek et al., 2019;Hwang and Sung, 2016), with its simplicity on generating text and its adaptability to other languages. Most previous approaches had consisted of recurrent neural networks (RNNs), but they have suffered from high learning complexity caused by inherently long character sequences. Recently, Transformer (Vaswani et al., 2017) have shown promise in addressing this problem and have become a standard way in general language modeling (Al-Rfou et al., 2019;Dai et al., 2019).
Transformers have achieved higher performance but have also grown in size by building deeper and wider networks. TransformerXL (Dai et al., 2019) and GPT-2 (Radford et al., 2019), for instance, contain 277M and 1542M parameters, respectively. However, this trend toward a large size model for performance is not suitable for edge device applications that require small memory sizes and fast real-time responsiveness, such as auto-correction and auto-completion (Gong et al., 2019). Contrary to the recent trend, character-level language models need to be scaled down while minimizing performance degradation due to capacity loss.
A simple way to get a lightweight Transformer is to reduce its width and depth, directly related to the model complexity. However, the width reduction loses representation power of high dimensional feature space, and the depth reduction brings the lower capacity that stacks diverse dependencies between local information. To compensate for the losses, knowledge distillation (Sun et al., 2019) tries to optimize a scale-downed model with a large teacher network, and weight-sharing (Bai et al., 2019) stacks a unified layer multiple times. They have shown promising results, but they still require a scale-downed model as a target model, or a unified layer for iterative usage.
In this paper, we introduce a lightweight Transformer, referred to as Group-Transformer, with lower model complexity without any modification to its width and depth. Our method utilizes group-wise operations, inspired by the group convolution approaches (Zhang et al., 2018;Sandler et al., 2018) that have effectively compressed huge image processing models. The basic concept of group convolution is to partition the feature maps into multiple groups and process them individually, rather than connecting Figure 1: Feature connections of Transformer and Group-Transformer for a single time step. All connections are categorized into "intra-group" and "inter-group", and Group-Transformer directly reduces computational complexity in "inter-group" connections. them. Figure 1 shows a brief overview of our proposed model utilizing the group strategy. Replacing all fully connected operations to group-wise operations reduce the model complexity since only a few connections remain as intra-group connections.
Beside, Group-Transformer employes lightweight inter-group operations to compensate for the information loss of inter-group correlations. The mutually exclusive calculation of the group strategy compromises performance, but modeling the interactions for all group pairs might be over-parameterized. Our inter-group operations share a common feature over groups in attention layers and utilize a low-rank approximation in feed-forward layers to model the inter-group information flows with a few calculations.
We conducted extensive experiments on two benchmark datasets, enwik8 and text8, and found that Group-Transformer showed better performance when compared against Transformers with a comparable number of parameters under 10M. Furthermore, when scaling down Transformer, Group-Transformer shows promising results comparing to other scale-down methods. We provide further analysis to identify the contributions of our proposed modules in detail. To the best of our knowledge, Group-Transformer is the first attempt to build a lightweight Transformer with the group strategy.

Towards a Lightweight Transformer
Since Transformer has become a promising model for diverse NLP tasks, there have been attempts to improve its architectural efficiency with two majority approaches. The first is to restrict dependencies between input tokens to reduce superfluous pair-wise calculations Guo et al., 2019b;Sukhbaatar et al., 2019a). The approach provides time efficiency during inference, but it does not address the heavy parameterization of Transformer. The second approach is to develop a lightweight Transformer architecture while maintaining its properties. For example, (Sukhbaatar et al., 2019b) combined the multi-head attention and position-wise feed-forward layer to devise a unified module with fewer parameters. Although the unified layer shows promising improvement, it still keeps bottleneck property of the position-wise feed-forward layer; thus, its benefit can be marginal in small-size Transformer settings. (Tay et al., 2019) utilizes quaternion algebra to build lightweight modules for Transformer. They also factorize the components of the embedding layer, but the expression power can be limited by the connection of factorized components based on the quaternion principle. Our proposed model is categorized into a lightweight Transformer architecture. By adjusting the number of groups, Group-Transformer decreases the number of feature connections instead of tuning the size of the feature dimension.

Towards Lightweight Neural Networks
Building a lightweight neural network has attracted much attention to compressing many large and deep state-of-the-art neural networks. One of the major approaches utilizes large pre-trained models to gain its small variant. Network pruning and quantization (Han et al., 2015) directly compresses parameters identified by pre-trained models. Knowledge distillation (Hinton et al., 2015) transfers knowledge from a large-scale network to a small network. These approaches have compressed a pre-trained model effectively, but they still require a small student network. Another approach designs a lightweight network architecture having fewer parameters and calculations. Low-rank approximation (Novikov et al., 2015) decomposes a big transition matrix into multiple small matrices. Group convolution (Krizhevsky et al., 2012;Zhang et al., 2018) factorizes feature spaces and processes them individually. Inspired by the two methods, Group-Transformer partitions feature spaces, spilt all feature connections into "intra-group" and "inter-group" connections, conducts fewer calculations for "inter-group" compared to the original Transformer.
The group strategy for NLP tasks has been investigated, but not on Transformers. (Kuchaiev and Ginsburg, 2017) proposed group-wise RNN as a special form of ensembled RNNs. However, they did not consider the interactions between different groups. (Gao et al., 2018) combined the idea of ShuffleNet into the group-wise RNN and achieved promising results on language modeling as well as machine translation. In this work, we adopt the group strategy and build new inter-group operations suitable for Transformer architecture. Figure 2a shows the overall architecture of Group-Transformer. It consists of a group embedding (bottom grey box), which embeds a character into grouped features, group attention (yellow box), which contains attention modules to identify dependencies in the time domain, and group feed-forward layer (green box), which re-configures the grouped features. As can be seen, when an input character is given, Group-Transformer converts the input into multiple group representations (blue dots and red dots), processes and merges them to predict the next character. Figure 2b and 2c show group-wise information flow (blue and red arrows) and inter-group information flow (grey arrow) in the sub-modules. Without the intergroup information flows, the grouped features are processed independently. We observed that inter-group modeling ensures that the groups become aware of the others and prevents different groups hold the same information. The following subsections describe the architectural details of the sub-modules and their relations. For a simple description, we describe the processes for a single time step.

Group Embedding Layer
Group embedding layer identifies a set of embeddings to represent a token. The idea of representing a sentence, word or even character using a set of vectors can widely be found in many NLP models that embed input tokens by concatenating (or summing) its embedding and its sub-units' embeddings (Verwimp et al., 2017;Bojanowski et al., 2017;Zhou et al., 2019). Similarly, we assume a single character c to be represented with G vector representations of groups, that is, [u c1 , · · · , u cG ] where u cg ∈ R Dgroup , 1 ≤ g ≤ G. When a character is given, the group embedding layer retrieves a corresponding set of vectors and passes it to the following group attention layer.

Group Attention
The attention mechanism identifies dependencies between features in the time domain and combines the information of them. It contains three steps; (1) identifying queries, keys, and values, (2) retrieving relative features at different times, and (3) transforming the attended feature into the input domain (Vaswani et al., 2017). The main focus of this paper is to apply a group strategy to the feature space of Transformer. Thus, we let the second step be identical to those of the original Transformer and focused on the first and the third steps. Figure 2b explains the architecture of group attention. The multi-head attention module represents the second step, the under operations identify the queries for the first step, and the upper operations transform the attention output for the third step. The group attention processes the grouped features with intra-group operations (white boxes) and inter-group operations (grey boxes).

Grouped Queries, Keys and Values
x G ] be a set of input vectors where x g ∈ R Dgroup for the group g. Since the multi-head attention contains H group attention modules for a single group, group attention first calculates query q gh for a group g and its head h as the below, where W q-intra gh ∈ R Dgroup×D head and W q-inter gh ∈ R Dgroup×D head are linear weights to describe an intra-group (white boxes) and an inter-group (grey box) combinations, respectively, when D head = D group /H group . In the formula, the first term on the right-hand side identifies a specific feature for the head h in the group g, and the second term determines head-wise features shared by all groups. It should note that all heads are split into groups; thus, the total number of heads keeps unchanged. Compared with the fully connected linear layer over the groups, the approach restricts the connection between the groups, so it requires fewer parameters and calculations.
The above decomposition of intra-and inter-group connections can be applied to identify keys and values. However, we observed dramatic performance drops when applying group strategy on two components among query, key, and value (See 4.5.). The performance drops indicate that heads in a single group require information in other groups. Based on the experimental results, Group-Transformer utilizes fully connected linear layers to identify keys and values as the original Transformer does.

Multi-head Attention
The identified headed elements are used for connecting features in the time domain. In this step, position encoding (Vaswani et al., 2017) has an important role for the features to be aware of their position in an input sequence. In this paper, we apply the relative positional encoding, which describes a long-length character sequence effectively. By following (Dai et al., 2019), we define the attention score map with the relative positional information, and the attention mechanism determines the attended feature a gh of the head h in the group g.

Combination of Multiple Heads
The multiple heads [a g1 , · · · , a gH ] in the group g are combined as the below; where W o-intra gh ∈ R D head ×Dgroup and W o-inter gh ∈ R D head ×Dgroup are linear weights for combining intra-group and inter-group information, respectively. As can be seen, the final output is determined with a specific feature from its own group and a shared feature from whole groups. These intra-group and inter-group modelings mainly contribute to reducing the number of parameters and calculations. Finally, the inputs x g are added into the output o g asx g = x g + o g for a residual connection.

Required Resources of Group Attention
The multi-head attention of the original Transformer uses 4 * O(D 2 model ) of parameters for queries, keys, values, and the outputs. In comparison, group attention keeps 2 * O(D 2 model ) parameters for keys and values, but 2 * O( 2 G D 2 model ) parameters for queries and the outputs, where D group = D model /G. When the number of groups is 2, the number of group attention parameters is the same as those of the original Transformer. However, when the group number increases to 4 or 8, the parameters decrease to 75% or 62.5% of the original module.

Group Feed-forward Layer
Group feed-forward layer re-configures the outputs of the attention module,x g , by applying group-wise operation at each position. Figure 2c shows the architecture of the proposed module. As can be seen, the groups are shuffled (grey box) and support each other. As the original module does, the linear layers in our module transpose the input feature into a high dimensional space with non-linear activation and transform the output back into the input space.
Given G input features [x 1 , · · · ,x G ], group feed-forward layer transposes the grouped features into a high dimensional space as follows;ȳ where W f1-intra g ∈ R Dgroup×Dgroup and W f1-inter g g ∈ R Dgroup×Dgroup are linear weights for mapping intra-and inter-group information into theD group -dimensional space, relatively bigger than D group , here, we introduce a low-rank matrix approximation on the inter-group transformation matrix W f1-inter g g . Modeling interactions between groups requires the G × G weights as well as the additional weights to transpose group g into the high dimensional space for the target group g . If designing a fully connected weight for all groups like the original Transformer, the feed-forward layer still holds heavyweights and expensive calculations. To reduce the overburden, we factorize the matrix W f1-inter g g into two matrices, W f1-inter[1] g g ∈ R Dgroup×M and W f1-inter[2] g g ∈ R M ×Dgroup , inspired by (Sainath et al., 2013) and (Novikov et al., 2015). The newly introduced dimension M is smaller than D group , and thus the number of parameters and calculation is reduced proportionally with the ratio between M and D group . In this paper, we set M as D group /G to control the dimension relatively with the number of the groups. Interestingly, such matrix factorization can be modeled efficiently with a group-wise linear transformation and a shuffle trick, as shown in Figure 2c.
Finally, a group-wise linear transformation is applied upon the high-dimensional feature as follow; where W f2 g ∈ RD G ×Dgroup is a linear weight. For a residual connection, each grouped input feature is added into the output of the group feed-forward layer;ŷ g =x g + y g .

Required Resources of Group Feed-forward
An original position-wise feed-forward layer requires 2 * O(D modelDmodel ) of parameters whenD model is the inner filter size. In comparison, a group feed-forward layer requires 3 G * O(D modelDmodel ) of parameters where D group = D model /G,D group =D model /G, and M = D group /G. When the number of groups is 2, the group feed-forward layer uses 81% parameters of those of the original Transformer. When increasing the number of groups to 4 or 8, the number of parameters decreases proportionally to 40.6% or 20.3%.

Dataset and Experimental Settings
We demonstrate the efficiency of the proposed Group-Transformer with two popular benchmark datasets, enwik8 and text8. The enwik8 dataset contains 100M of English Wikipedia texts with 204 unique characters, including alphabets, non-Latin and special characters. In comparison, the text8 dataset provides 100MB of pre-processed texts only with 27 unique characters by filtering superfluous content, such as tables, citations, and punctuation, and by replacing the non-Latin characters with spelled-out equivalents (i.e., "15" to "one five"). For a fair comparison with previous works, we used the training/dev/test splits defined by (Mahoney, 2011) for both enwik8 and text8.
Most experimental settings follow those of (Dai et al., 2019), where the difference lies in the hyperparameters that influence the size of the model. For the regularization of the model, we applied layer normalization (Ba et al., 2016) independently over groups and dropout layers upon the outputs of the group attention and the group feed-forward layer with the probability p = 0.1. The length of the feed sequence was 512, with the cached 512-length for the previous sequence (Dai et al., 2019). Adam optimizer with a learning rate of 2.5e-4, 0.9 for β 1 , 0.999 for β 2 , 22 batch size, 400,000 iterations, and the best model from the given validation set. The code and settings can be found at https://github.com/clovaai/group-transformer.

Scale-down Transformers by Adjusting Hyper-parameters
2 3 4 5 6 7 8 9 Number of parameters (M)  The number of groups can be interpreted as a hyper-parameter affecting the model size. Figure 3 shows the effectiveness of three hyper-parameters, such as the number of layers, the size of the hidden dimension, and the number of groups. The default model used TransformerXL (Dai et al., 2019) with L = 9, H model = 8, D model = 256, andD model = 4 * D model , and then we reduced the three hyperparameters. It should note that Group-Transformer split the multi-heads into groups; thus, each group holds H group = H model /G attention heads without any changes in the total number of the multi-heads. When making the model thinner or shallower, the performances of the model become worse, but the required resources are getting lower. When comparing ours with two reduction methods, the group strategy shows better performances than the models requiring comparable resources. This experiment proved that the feature grouping methods, the main idea of this paper, is more efficient to reduce the model size and the time complexity than tuning other model parameters. It should be reminded that all models are the same number of total heads because the group strategy binds the multi-heads into groups. Table 1 shows the module-wise impact on the number of parameters and performance. For a fair comparison, we set the baseline model to a reduced TransformerXL (Dai et al., 2019) of less than 8M parameters, and can gradually reduce the model size by replacing the attention and the feed-forward layer with Group-Transformer module selectively. When replacing the feed-forward layer with Group-Transformer module, we observe that the number of parameters in all cases decreases more efficiently than replacing the attention module. Interestingly, when replacing two modules with 2 or 4 group cases, the performance degradation is less than the sum of the individual performance losses but is able to reduce the overall required resources more. For instance, the individual performance drops in the 4 group case are 0.023 from Atten. and 0.027 from FF., but their combination shows only 0.037, less than the sum of the gaps. This result demonstrates the efficiency of concurrently using both group-wise modules.  Table 2: Ablation study in modeling query, key, and value with our group operations.

Ablation Study on Group-Transformer Modules
The group-attention module includes identifying the attention elements such as query, key, and value, conducting attention mechanisms, and configuring the output from the attended features. Although the group strategy can be applied to the three elements fed into the multi-head attention, it can affect each grouped information to be isolated. Table 2 investigated the effectiveness of the group strategy on the three elements. As can be seen, the group strategy on a single element shows similar performance in views of its resource and accuracy. However, if applied to more than two elements, the results show marginal benefit on the parameter size and dramatic performance drops in all cases of the group numbers. Based on the experiment, we choose the query as the only target of the group strategy among the three attention elements.

Ablation Study on Inter-group Operations
Here, we investigate the influence of inter-group operations in our model. When the inter-group operations are removed (grey boxes in Figures 2b and 2c), we observed the performance degradation on 2 Group-Transformer by 0.028 bpc and 4 Group-Transformer by 0.051 bpc. These gaps are relatively huge when compared to the performance gap between TransformerXL and Group-Transformers in Table 1. The results re-emphasize the importance of inter-group modeling in Group-Transformer. Figure 4 shows the similarity patterns between the multi-head attention of our models and the ablation models without the inter-group operations. As can be seen, the multi-head attention map from the model without inter-group operations shows high similarities among different groups, while the proposed model shows the opposite. These similarity patterns imply that the model cannot fully take advantage of multi-head attention, which is designed to attend multiple positions of content, without the proposed inter-group operation.   The number of groups in Group-Transformer is closely related to the parameter size. By increasing the group number, the model can be scale-downed under the same hidden dimension. To identify the best number of groups, we set the maximum number of heads to 8, and 1, 2, 4, and 8 group cases were compared. For a fair comparison, the models have the same number of layers as 9, and the hidden size was adjusted to hold around 4.2M parameters. As can be seen in Table 3, the 1 group model is the same as the TransformerXL model because it does not include any inter-group operations, as well as the hidden features, are not split into groups. Same with earlier results, the 2, 4, and 8 group models show better performances than the 1 group model. Interestingly, although the 8 group model has a wider hidden dimension than others, the model shows worse performance than the 2 and 4 group models. The 4 group model turns out the best performer around 4.2M parameter size.

Comparison Against Prior Character-level Language Models
We compare the Group-Transformer against existing character-level language models using under 50M parameters in Table 4. Although the number of embedding vectors for characters is much lower than word-level embeddings (Sukhbaatar et al., 2019a), the total parameters of most previous models have been more than 10M parameters. Recently, the reported transformer models achieved under 1.2 bpc for enwik8 and text8, but the models under 10M parameters have not been well explored. When developing Group-Transformer with more than 40M parameters, the model fails to show superior performance than others, even though it is wider and deeper than the prior works. However, when exploring transformers with 8M and 4M parameters, the Group-Transformers outperform the scale-downed transformer with 9 and 6 layers. The results indicate that the group strategy shows superior performance in modeling a lightweight Transformer by holding high-dimensional feature space under the same parameter size.

Extension to Word-level Language Modeling
The proposed method is focused on developing character-level language models, but the model can be applied to other NLP tasks using the Transformer architecture. When it comes to word-level language modeling, compressing the word embedding layer becomes the most important part of designing a lightweight language model rather than other layers in Transformer. Therefore, we set an embedding   Table 5: Comparison with the prior word-level language models on wikitext-103. We report perplexity (ppl) for test sets as well as the number of parameters. Params * indicates the number of parameters except for word embeddings.
dimension as 500 and adjusted the number of layers and the hidden dimension to get models with the same embedding (134M) and model parameters (4.5M). For the bottleneck layer in the position-wise feed-forward layer, we used 4 times larger dimension than each hidden dimension. Table 5 compares TransformerXL and the Group-Transformers. In all settings, the Group-Transformers show better performances than the baselines.