Highway Transformer: Self-Gating Enhanced Self-Attentive Networks

Self-attention mechanisms have made striking state-of-the-art (SOTA) progress in various sequence learning tasks, standing on the multi-headed dot product attention by attending to all the global contexts at different locations. Through a pseudo information highway, we introduce a gated component self-dependency units (SDU) that incorporates LSTM-styled gating units to replenish internal semantic importance within the multi-dimensional latent space of individual representations. The subsidiary content-based SDU gates allow for the information flow of modulated latent embeddings through skipped connections, leading to a clear margin of convergence speed with gradient descent algorithms. We may unveil the role of gating mechanism to aid in the context-based Transformer modules, with hypothesizing that SDU gates, especially on shallow layers, could push it faster to step towards suboptimal points during the optimization process.

Holding the great promise of deep neural networks in language and images, Transformer capitalizes on the stacked multi-headed self-attention mechanism based on the conventional encoderdecoder architecture in a sequence-to-sequence (seq2seq) manner to learn the global soft signals without explicit recurrence mechanism. Multi-head dot product attention (MHDPA) not only underpins the parallel training of multiple heads but captures long-term dependencies across an arbitrarily long distance within the same context. In which separated multiple heads independently draw sub-level attentions within the latent semantic sub-space of a fixed dimension, where different heads are presumed to signal different meaning aspects implicitly (Vaswani et al., 2017). Additionally, residual connections between layers allow the deep tandem stack of multiple identical modules by impeding degradation problem during training (He et al., 2016). Thus Transformer architectures take the place of Recurrent Neural Networks (RNNs), especially Long Short-Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997) to be the model solution to learning sequential data.
Recently, there have been plenty of works contending that gating mechanisms could play a vital role or even entirely substitute RNNs or Transformers to model language sequences. Dauphin et al. (2017) firstly claimed that non-recurrent networks are also highly competitive with conventional RNNdominated models in LM. They proposed the hierarchical gated temporal convolution neural networks (CNNs) with Gated Linear Units (GLU) to replace the recurrent connections in RNNs and achieved strong performance with faster training speed. Gehring et al. (2017) integrated absolute positional embedding, multi-step attention, GLU, and residual connections into entirely convolutional models to outperform strong LSTM models in NMT and abstractive summarization tasks. Wu et al. (2019) applied dynamic convolutions using shared softmax-normalized filters of depth-wise on GLU-regulated inputs within a fixed reception field rather than global contexts, challenging the common self-attention-dominated intuition.
However, all of the models, as mentioned earlier, adopt stacked CNNs rather than self-attention networks (SAN) to attend to the global contexts. It is well-known that CNNs are good at learning localregion features rather than long-term dependency, while SANs are adept in attending global dependencies. Context-based self-attention can capture the importance of relative relations under a valid context and is thus location-unaware. It focuses on the object-wise attention distributions between any two words but ignores the fundamental importance of feature-wise information.
Intuitionally, people need to consider not only the global contextual dependency but the meaning of individual words to comprehend the reading materials better. Grounding on this, we apply self-gating approaches on Transformer blocks for seq2seq modeling that combines gating units with skip-connections and Transformers to jointly take into account both the inner feature-wise importance and the relation-aware content-based attention distribution.
We adopt the self-dependency gating approach to intrinsically draw a binary importance ratio of itself and decide how much information of each feature to retain or remove. Our key contributions are: • to illustrate that our self-dependency units on shallow Transformer layers could expedite the convergence speed during both the training and validation process without hyperparameter tuning. • to support the claim that Transformer layers in different depth attend to information of different aspects, wherein bottom layers focus on local-range encodings. It substantiates the argument that the bottom layers of SAN can learn more in local contexts (Yang et al., 2018). • to empirically prove that self-gating mechanisms are complementary to recurrence mechanisms in R-Transformer and Transformer-XL components.

Preliminaries
This section briefly introduces the related background of Transformer and Highway Networks. SAN has been dominant in most SOTA sequence learning models, whose basic components consist of stacked Transformers modules. We conduct comparison experiments on the Transformer and its two variants, Transformer-XL (Dai et al., 2019) and R-Transformer (Wang et al., 2019).

Multi-head Dot Product Attention
Scaled dot product attention (DPA) (Vaswani et al., 2017) computes global attention weights between pairs within the context across an arbitrarily long distance, which could allow the simultaneous training and space-saving, impeding the drawbacks of sequential dependency of RNNs.
Given the input word representation X ∈ R L×dh , where L is the sequence length, d is the input dimension of each head and h is the number of attention heads, DPA uses the linear projection to acquire the query Q, key K and value V. Denoting splitted inputs for i-th head as X i ∈ R L×d , where i ∈ {1, · · · , h}, single-head self-attention can be calculated as: is a scaling factor to prevent the effect of large values. In LM tasks, attention weights before softmax function are masked to only attend to history sequences. MHDPA (Fig 1a) linearly projects the single DPA into h heads and performs attention operation in parallel, to jointly learn different semantic meanings of different subspaces (Vaswani et al., 2017). MHDPA can be calculated as: where • denotes the concatenation of h different heads, W o ∈ R dh×dh is the trainable weight.

Transformer
Absolute Positional Encoding Transformer applies sinusoidal timing signal as the absolute positional encoding (PE) and directly element-wise add the dense word embeddings E ∈ R L×dh on the PE before feeding into Transformer modules: P E (pos,2i+1) = cos( pos 10000 2i/d ) (5) where 'pos' indicates the position of sequences, i denotes the order along the embedding dimension.
Given input representations X, Transformer components with a sternward Layer Normalization (LN) is: where Eq. 8 indicates the position-wise feedforward networks (FFN), O ∈ R L×dh represents the output of transformer layer. FF denotes the feed-forward fully-connected layer, ReLU is used as the non-linear activate function.

Transformer-XL
Transformer-XL (Dai et al., 2019) injected relative PE and segment-level recurrence to provide historical information for LM tasks.
Relative Positional Encoding Transformer-XL decomposed the dot product calculation of MHDPA, merged terms with similar meanings of positional bias, and reduced trainable weights with global positional semantics. It incorporated partial trainable parameters of relative sinusoidal PE in the MHDPA operation. The Relative PE A rel of Transformer-XL is: where W k,R ∈ R d×d , {u, v} ∈ R d are trainable parameters. For each two positions i, j in the segment, R is sinusoidal encoding matrices between relative position i − j. The terms a, b, c, d in the Eq. 10, 11, 12, 13 represent the content-based addressing, content-dependent positional biases, global biases between different positions and the global positional biases, respectively.

Segment-level Recurrence
In Transformer-XL, the previous hidden states are cached and reused to inject the history information and attend to contexts beyond a fixed length through multi-layer stacks.
The MHDPA is computed as: wherein the key and value M n−1 τ concatenate the previous memory X n−1 τ −1 with the current segment inputs X n−1 τ for the τ -th segment in the n-th layer, SG means no backpropagation through the tensor.

R-Transformer
R-Transformer (Wang et al., 2019) employed shortrange RNNs, termed localRNNs, to capture the positional information without explicit PEs. local-RNNs take the recurrent connections within a local context, and shift right with one position at each time step. It can be seen as applying the RNN cells, such as LSTM, on the same receptive fields as the convolutional filters along the sequence direction.
None of the above Transformer models explicitly consider the essential feature-wise information. We augment several gated units on the Transformer block of the models above and empirically illustrate the effectiveness of gating units on convergence acceleration.

Highway Networks
Let we define the non-linear transforms as H, T and C, Highway Network (Srivastava et al., 2015) is defined as: where T (·) and C(·) denote transform and carry gates to control the input transformation, denotes the Hadamard product.

Self-Dependency Units
Similar to GLU (Dauphin et al., 2017) that adopts the inputs as sigmoidal gates, we apply the Self-Dependency Units (SDU) by taking full inputs as their respective self gates and computing the element-wise product upon themselves (Fig 1b).
where T (X) indicates the transform gate, Ψ is the gate function that confine the linear projection into a fixed range, {W 1 , W 2 } ∈ R d×d and {b 1 , b 2 } ∈ R d are trainable parameters. The element-wise gating function Ψ takes sigmoidal-curve functions to regulate the pointwise weights within a fixed region, which have a side effect of relative normalization. Specifically, the sigmoid function σ(x) = 1/(1 + exp(−x)) and its rescaled version tanh(x) = 2σ(2x) − 1, where x ∈ R.
We interpret the tanh function as an update gate, which can restrict the importance range into between -1 and 1, while the σ function bears a resemblance to the input gate in LSTMs to modulate how much information to retain at the feature-wise level.

Pseudo-highway Connection
MHDPA computes the multi-headed pairwise attention along the sequence dimension by measuring the distance between each word. It might overlook the fundamental importance of individual features. Rather than replacing MHDPA as gating and convolution operations in dynamic convolutions (Wu et al., 2019), we simply add a new branch of inputs to enrich the representations of residual connected MHDPA with augmented gating-modified encodings. The gated units are also supplemented on FFN modules to provide additional self-adaptive information flow ( Fig 1c).
From other perspectives, SDU can be considered as a self-dependency non-linear activation function with dynamic adaptation. The self-gating augmented Transformer module is calculated as: where U and O represent the intermediate representation and outputs.
Pseudo-highway Transformer When we take σ gate as Ψ, we can have the similar format as highway networks: where the σ(.) can be seen as the transform gate, while (1 − σ(.)) can be seen as the carry gate. This could be regarded as a form of highway networks.

Variant Gated Connections
Highway Gate Similar to the highway networks (Srivastava et al., 2015), let T (X) signal the transform gate and (1 − T (X)) be the carry gate, we have the highway-network-like structures by regulating the encoding f (X) with transform gate and controling X with carry gate. This is quite similar to highway networks: where Eq. 28 is the element-wise summation of highway networks, o(·) represents the intermediate output.
Gated MHDPA Similar to previous highway gates, we can apply the carry gate and transform gate on the attention and FFN units respectively. Thus we have: Such gates can be regarded as dynamically adjusting the information flow between the feature-wise representations and SANs (Eq. 30).

Experiments and Results
We apply the gating mentioned above on Transformer variants described in section 2 on LM tasks and respectively make comparisons in terms of both the convergence process and the final performance. For fairness, we apply SDU components based on the same hyperparameters as the original paper 1 . Our code is available 2 .

vs. Transformer / R-Transformer
We first evaluate the gating units on the Penn Tree-Bank (PTB) LM task. The SDU gates are added on Eq. 7, 9 for each Transformer block. All models in this section are trained on single NVIDIA Titan Xp GPU.

Char-level PTB Hyperparameter and Training
The gated components are evaluated on character-level PTB LM tasks (see Appendix A.1 for hyperparameter settings). The loss and bit per character (bpc) provide the metrics to evaluate the trained models. All models are trained with 100 epochs.
1 Some results of baselines are slightly lower than those reported in original papers using the code obtained from authors but are within the limits of experimental error and variance.
2 https://github.com/cyk1337/ Highway-Transformer  Results of Transformer As shown in Table 1, all the gating-enhanced models conspicuously surpass the performance of the loss and perplexity over the baseline on both training and validating set, revealing the positive influence of self-gating units in supporting Transformer blocks. Furthermore, Fig. 2  Results of RT It can be seen in Fig. 3 that supplementing SDUs can increase the speed of the convergence process of training and evaluation, strengthening our previous claim. As for the final perplexity on the test set, σ-gate SDUs could achieve better than baselines while tanh-gate SDUs perform a bit worse, as shown in Table 2. The influence of σ-gate SDUs might be owing to that σ function compresses the input into the dense non-zero ratios within (0, 1) and results in stable variation range.
In contrast, the zero-centered property and possibly zeroed values of tanh may cause the corresponding units easier to be trapped into the premature convergence during the training process. Besides, σ gates have been empirically proved to be more stable than tanh gates in the follow-up experiments.   Results of Transformer Figure 4 shows a noticeable downward trend on the evaluation performance (i.e., the validation loss and perplexity) of the attention model with tanh and sigmoid functions over the beginning 30 epochs, again indicating the convergence acceleration effect of our gated units. Also, σ-gate enhanced models outmatches the baseline on the test perplexity, but models with tanh gates reach into a plateau untimely. As for the training curves, Transformers with SDUs have seen a remarkably sharper fall in comparison with the baseline model over all the training period.
Results of RT As in Fig. 5 and

Sub-total
To sum up, gating units have empirically expedited the convergence of Transformer blocks due to the enrichment of self-regulated features with skipconnections. It can be seen that σ-gate presents the stability to bear a hand to reach the plateau without hurting the test performance, but tanh-gate seems to be task-and data-dependent and could be better than σ-gate SDUs in some circumstances. We can see that our proposed gated units are complementary to the recurrent connections in RNNs and can boost the performance based on localRNN-encoded representations.
In the following experiment, we check whether it is necessary to apply gates on all the layers and probe the effect of SDU variants (i.e., "highway gate" and "gate MHDPA"). Due to the small size of PTB, we experiment on a larger LM dataset enwik8 and adopt the impressive Transformer-XL, one of the vital variant structures used in XLNet (Yang et al., 2019).

Results of 6-layer Transformer-XL
It is noticeable that Transformer-XL models with different gating variants all outperform the baseline with different margins in terms of both performance and convergence speed, as shown in Table 5. Fig. 6 shows that SDUs benefit the convergence and validation performance compared with baselines. Among which σ-gate SDUs ranked top by achieving 3.1% improvement of bpc on the dev set, followed by gates with tanh, gated MHDPA, highway gate with 2.7%, 1.8%, 1.7% advance respectively. We attribute such improvements to the augmented refined representations learned by our gated units, preventing the basic self-attention blocks from purely considering the contextual dependency. It is also illustrated that SDUs do not conflict with recurrence mechanisms in Transformer-XL.

Ablation Study
6-layer Transformer-XL To probe whether it is required to augment SDUs on each Transformer layer, we supplement gates on layer 1-3, layer 3-6, and layer 1-6 but removing gates on FFN components (denoted "\FFN") as in Table 5 (see Fig. 8 in Appendix B for detailed convergence curvatures). We find that supplementing tanh-gates on the bottom three layers contribute most to the overall performance while tanh-gates on the top three layers could hinder the test set performance. Low-level Transformer blocks can capture the information from localness while top layers usually focus on the global long-range dependency (Yang et al., 2018).  Thus gates on bottom layers could aid in learning syntax and superficial representations to some extent. It also indicates that our gates may be beneficial for the encoding of low-level fine-granularity representations rather than semantic meaning regulation on high-level layers.
12-layer Transformer-XL Previous experiments are all conducted on shallow models and illustrate the positive effects. To investigate the performance on deep stacked models, we further extend our trials to 12-layer Transformer-XL. All hyperparameters are the same as 6-layer Transformer-XL, as shown in Appendix A.3. Each Model is trained 400k steps for more than 100 hours on 4 x GeForce 2080Ti GPUs in parallel. The experimental results illustrate that SDU components have contributed to expediting the convergence during training (see Fig. 9 and 10 in Appendix C for details). But supplementing gated units on each Transformer block could encounter the premature convergence phenomenon. It is also observed that adding the bottom few layers with gated units could strengthen the convergence pro-  cess without impeding the final performance, as shown in Table 6. It is observed from Fig. 7 that tanh-gates on the bottom two layers promote the convergence process and further improve the bpc performance on the dev and test set. Interestingly, the performance does not follow a positive correlation with the increase of gated layer numbers. We can see that enriching the bottom 2 layers with tanh and σ gated functions (denoted "+tanh L1-2" and "+σ L1-2" in Table 6) could impressively benefit for the convergence on both training and evaluation process and even marginally increase the final test bpc (see Fig. 9 and Fig. 10 in Appendix C for details). Therefore, the lower layers benefit more from our proposed gated units than higher layers, again illustrating that SDUs could enhance feature-wise information on shallow layers of deep-stacked Transformer components.

Gating Mechanism Analysis
It can be concluded that gating units could boost the convergence, especially on low-level layers. Enhancing the bottom layers of deep-stacked models may result in faster convergence of optimization. This may be owing to that SDU gates can enrich the original representations with adaptive self-dependency encodings. The final hidden state can be regarded as a revised representation that incorporating additional self-attentive features.
Meanwhile, we find that supplementing SDU gates does not increase much of the time cost in comparison with baselines. Instead, the total run-ning time of each experimental setting is quite similar. We summarize the training time costs of 6-layer Transformer-XL as It is argued that low-level transformers learn the local-region information while high-level layers pay more attention to global dependencies (Yang et al., 2018). Our experimental results could verify that gated representation on bottom layers can strengthen the performance by introducing additional gated encodings on localness.
Further, the visualization of learned gate bias parameters of 6-layer and 12-layer models, as shown in Fig. 11 in Appendix D.1, presenting the layer separation with the increase of layer depth. It has seamlessly verified our previous hypothesis that SDU on shallow layers could promote the learning process and attend to different information with top layers. The scatter plot of Fig. 12 in Appendix D.2 indicates that gates on different sublayers learn from different aspects in the identical representation space.
SDUs calculate the output by regulating the information flow of inputs conditioned on themselves. Given the hidden dimension of d, the additional cost of trainable parameters on each SDU unit in our experiments is O(2d(d + 1)). Meanwhile, convolutions along the sequence direction can substitute fully-connected feedforward SDU to curtail the extra parameter cost. Such gating units equip good scalability to attach to different Transformer structures with only minor modification of implementation.
The gradient of our SDU components is: where f , g are linear projections and Ψ takes tanh or σ function. The addition operation of two terms provides an unimpeded information flow, which can be regarded as a multiplicative skip connection (Dauphin et al., 2017) while the second term is usually vanishing due to the derivative of the gating function Ψ. Based on the experimental results, we hypothesize that it could accelerate the optimization process to move towards a local minimum.

Related Work
In recent years, there have been plenty of works adopting gating units into CNNs to help learn sequential information. Dauphin et al. (2017) proposed stacked gated CNNs by incorporating GLUs into the 1-dimensional convolution operation, achieving the competitive results in comparison to recurrent models on LM tasks. Based on this, Gehring et al. (2017)  Notably, our SDU bears a resemblance to the activation Swish (Ramachandran et al., 2017) in terms of the equation format. Both of them use the sigmoidal function and self-gating mechanism. However, Swish controls the input gated on itself in a tandem way while the proposed SDU applies the gate after a linear projection and performs using a shunt connection in Transformer stacks.

Conclusion and Future Work
Gating-enhanced architecture enjoys both the advantage of MHDPA and self-regulated gating mechanism, allowing for the pseudo-highway information flow for better convergence by elastically intro-ducing a few trainable parameters. It outperforms or matches the performance of common Transformer variants without hyperparameter tuning. It is empirically proved that self-gating units on shallow layers could provide more internal representations of importance and significantly benefit for convergence. This also supports the argument that different levels of Transformer components attend to different semantic aspects while lower levels pay more attention to local regions. In the future, it is necessary to interpret the semantics that Transformer layers in different depths can convey, which is beneficial for the computing-efficiency.      Fig. 9 shows the curve of tanh-gate enhanced Transformer-XL during the training and evaluation process. Adding tanh-gates on the first few layers greatly boost the convergence performance in both the training and evaluation process. Among which "+tanh L1-2" presents a rapid convergence trend and marginally outperforms the baseline performance. Obviously, the trainable biases of SDU gates perform quite different between on MHDPA and FFN sublayers as in Fig. 11a, 11c for 6-layer models and Fig. 11b, 11d for 12-layer models. Also, the gate biases are similarly distributed on all of the 6 layers, as in Fig. 11e, while showing the layer separation on the bottom few transformer layers as shown in Fig. 11f. This also verifies the experimental evidence that SDU gates on 6-layer models all positively influence the final test performance, but those only on the previous few layers of 12layer transformers could have better results on both convergence speed and the final test bpc.

D.2 Scatter Visualization
Fig . 12 illustrates the uniform distribution on both 6-layer and 12-layer Transformer-XL models. Due to the existence of residual connections, the representation space can be seen as the same. Hence the evenly distributed gate biases may learn from different aspects accordingly, which also matches our common intuition. (f) Plot of gate biases on all sublayers of 12-layer models. Figure 11: The heatmap visualization of learnable biases (i.e., b 1 in Eq. 21) on σ gate units of 6-layer (left column) and 12-layer (right column) Transformer-XL models, where vertical axises represent the layer number of our models, and "a1" and "b3" denote the 1-st MHDPA sublayer and 3-rd FFN sublayer, respectively. All gate biases are initialized as 0s with 512 dimension of each.   Figure 12: Scatter visualization of SDU gate biases on 6-layer and 12-layer Transformer-XL, where "layer2-SA" denotes the gate bias on 2-nd self-attention sublayer. We employ t-Distributed Stochastic Neighbor Embedding (t-SNE) to reduce the dimension from 512 to 2.