Exploiting Deep Representations for Neural Machine Translation

Advanced neural machine translation (NMT) models generally implement encoder and decoder as multiple layers, which allows systems to model complex functions and capture complicated linguistic structures. However, only the top layers of encoder and decoder are leveraged in the subsequent process, which misses the opportunity to exploit the useful information embedded in other layers. In this work, we propose to simultaneously expose all of these signals with layer aggregation and multi-layer attention mechanisms. In addition, we introduce an auxiliary regularization term to encourage different layers to capture diverse information. Experimental results on widely-used WMT14 English-German and WMT17 Chinese-English translation data demonstrate the effectiveness and universality of the proposed approach.


Introduction
Neural machine translation (NMT) models have advanced the machine translation community in recent years (Kalchbrenner and Blunsom, 2013;Cho et al., 2014;Sutskever et al., 2014). NMT models generally consist of two components: an encoder network to summarize the input sentence into sequential representations, based on which a decoder network generates target sentence word by word with an attention model (Bahdanau et al., 2015;Luong et al., 2015).
Nowadays, advanced NMT models generally implement encoder and decoder as multiple layers, regardless of the specific model architectures such as RNN (Zhou et al., 2016;Wu et al., 2016), CNN (Gehring et al., 2017), or Self-Attention Network (Vaswani et al., 2017;. * Zhaopeng Tu is the corresponding author of the paper. This work was conducted when Zi-Yi Dou was interning at Tencent AI Lab. Several researchers have revealed that different layers are able to capture different types of syntax and semantic information (Shi et al., 2016;Peters et al., 2018;Anastasopoulos and Chiang, 2018). For example, Shi et al. (2016) find that both local and global source syntax are learned by the NMT encoder and different types of syntax are captured at different layers.
However, current NMT models only leverage the top layers of encoder and decoder in the subsequent process, which misses the opportunity to exploit useful information embedded in other layers. Recently, aggregating layers to better fuse semantic and spatial information has proven to be of profound value in computer vision tasks (Huang et al., 2017;Yu et al., 2018). In natural language processing community, Peters et al. (2018) have proven that simultaneously exposing all layer representations outperforms methods that utilize just the top layer for transfer learning tasks.
Inspired by these findings, we propose to exploit deep representations for NMT models. Specifically, we investigate two types of strategies to better fuse information across layers, ranging from layer aggregation to multi-layer attention. While layer aggregation strategies combine hidden states at the same position across different layers, multi-layer attention allows the model to combine information in different positions. In addition, we introduce an auxiliary objective to encourage different layers to capture diverse information, which we believe would make the deep representations more meaningful.
We evaluated our approach on two widelyused WMT14 English⇒German and WMT17 Chinese⇒English translation tasks. We employed TRANSFORMER (Vaswani et al., 2017) as the baseline system since it has proven to outperform other architectures on the two tasks (Vaswani et al., 2017;Hassan et al., 2018). Experimen-tal results show that exploiting deep representations consistently improves translation performance over the vanilla TRANSFORMER model across language pairs. It is worth mentioning that TRANSFORMER-BASE with deep representations exploitation outperforms the vanilla TRANSFORMER-BIG model with only less than half of the parameters.

Background: Deep NMT
Deep representations have proven to be of profound value in machine translation (Meng et al., 2016;Zhou et al., 2016). Multiple-layer encoder and decoder are employed to perform the translation task through a series of nonlinear transformations from the representation of input sequences to final output sequences. The layer can be implemented as RNN (Wu et al., 2016), CNN (Gehring et al., 2017), or Self-Attention Network (Vaswani et al., 2017). In this work, we take the advanced Transformer as an example, which will be used in experiments later. However, we note that the proposed approach is generally applicable to any other type of NMT architectures.
Specifically, the encoder is composed of a stack of L identical layers, each of which has two sublayers. The first sub-layer is a self-attention network, and the second one is a position-wise fully connected feed-forward network. A residual connection (He et al., 2016) is employed around each of the two sub-layers, followed by layer normalization (Ba et al., 2016). Formally, the output of the first sub-layer C l e and the second sub-layer H l e are calculated as where ATT(·), LN(·), and FFN(·) are selfattention mechanism, layer normalization, and feed-forward networks with ReLU activation in between, respectively. {Q l e , K l−1 e , V l−1 e } are query, key and value vectors that are transformed from the (l-1)-th encoder layer H l−1 e . The decoder is also composed of a stack of L identical layers. In addition to two sub-layers in each decoder layer, the decoder inserts a third sublayer D l d to perform attention over the output of the encoder stack H L e : where {Q l d , K l−1 d , V l−1 d } are transformed from the (l-1)-th decoder layer H l−1 d , and {K L e , V L e } are transformed from the top layer of the encoder. The top layer of the decoder H L d is used to generate the final output sequence.
Multi-layer network can be considered as a strong feature extractor with extended receptive fields capable of linking salient features from the entire sequence . However, one potential problem about the vanilla Transformer, as shown in Figure 1a, is that both the encoder and decoder stack layers in sequence and only utilize the information in the top layer. While studies have shown deeper layers extract more semantic and more global features (Zeiler and Fergus, 2014;Peters et al., 2018), these do not prove that the last layer is the ultimate representation for any task. Although residual connections have been incorporated to combine layers, these connections have been "shallow" themselves, and only fuse by simple, one-step operations (Yu et al., 2018). We investigate here how to better fuse information across layers for NMT models.
In the following sections, we simplify the equations to H l = LAYER(H l−1 ) for brevity.

Proposed Approaches
In this section, we first introduce how to exploit deep representations by simultaneously exposing all of the signals from all layers (Sec 3.1). Then, to explicitly encourage different layers to incorporate various information, we propose one way to measure the diversity between layers and add a regularization term to our objective function to maximize the diversity across layers (Sec 3.2).

Deep Representations
To exploit deep representations, we investigate two types of strategies to fuse information across layers, from layer aggregation to multi-layer attention. While layer aggregation strategies combine hidden states at the same position across different layers, multi-layer attention allows the model to combine information in different positions.

Layer Aggregation
While the aggregation strategies are inspired by previous work, there are several differences since we have simplified and generalized from the original model, as described below. Dense Connection. The first strategy is to allow all layers to directly access previous layers: In this work, we mainly investigate whether densely connected networks work for NMT, which have proven successful in computer vision tasks (Huang et al., 2017). The basic strategy of densely connected networks is to connect each layer to every previous layer with a residual connection: Figure 1b illustrates the idea of this approach. Our implementation differs from (Huang et al., 2017) in that we use an addition instead of a concatenation operation in order to keep the state size constant. Another reason is that concatenation operation is computationally expensive, while residual connections are more efficient.
While dense connection directly feeds previous layers to the subsequent layers, the following mechanisms maintain additional layers to aggregate standard layers, from shallow linear combination, to deep non-linear aggregation.
Linear Combination. As shown in Figure 1c, an intuitive strategy is to linearly combine the outputs of all layers: where {W 1 , . . . , W L } are trainable matrices. While the strategy is similar in spirit to (Peters et al., 2018), there are two main differences: (1) they use normalized weights while we directly use parameters that could be either positive or negative numbers, which may benefit from more modeling flexibility.
(2) they use a scalar that is shared by all elements in the layer states, while we use learnable matrices. The latter offers a more precise control of the combination by allowing the model to be more expressive than scalars (Tu et al., 2017).
We also investigate strategies that iteratively and hierarchically merge layers by incorporating more depth and sharing, which have proven effective for computer vision tasks (Yu et al., 2018).
Iterative Aggregation. As illustrated in Figure  1d, iterative aggregation follows the iterated stacking of the backbone architecture. Aggregation begins at the shallowest, smallest scale and then iteratively merges deeper, larger scales. The iterative deep aggregation function I for a series of layers H l 1 = {H 1 , · · · , H l } with increasingly deeper and semantic information is formulated as where we set H 1 = H 1 and AGG(·, ·) is the aggregation function: As seen, in this work, we first concatenate x and y into z = [x; y], which is subsequently fed to a feed-forward network with a sigmoid activation in between. Residual connection and layer normalization are also employed. Specifically, both x and y have residual connections to the output. The choice of the aggregation function will be further studied in the experiment section. Hierarchical Aggregation. While iterative aggregation deeply combines states, it may still be insufficient to fuse the layers for its sequential architecture. Hierarchical aggregation, on the other hand, merges layers through a tree structure to preserve and combine feature channels, as shown in Figure 2. The original model proposed by Yu et al. (2018) requires the number of layers to be the power of two, which limits the applicability of these methods to a broader range of NMT architectures (e.g. six layers in (Vaswani et al., 2017)).
To solve this problem, we introduce a CNN-like tree with the filter size being two, as shown in Figure 2a. Following (Yu et al., 2018), we first merge aggregation nodes of the same depth for efficiency so that there would be at most one aggregation node for each depth. Then, we further feed the output of an aggregation node back into the backbone as the input to the next sub-tree, instead of only routing intermediate aggregations further up the tree, as shown in Figure 2b. The interaction between aggregation and backbone nodes allows the model to better preserve features. Formally, each aggregation node H i is calculated as The aggregation node at the top layer H L/2 serves as the final output of the network.

Multi-Layer Attention
Partially inspired by Meng et al. (2016), we also propose to introduce a multi-layer attention mechanism into deep NMT models, for more power of layer l-1 layer l-2 layer l Figure 3: Multi-layer attention allows the model to attend multiple layers to construct each hidden state. We use two-layer attention for illustration, while the approach is applicable to any layers lower than l.
transforming information across layers. In other words, for constructing each hidden state in any layer-l, we allow the self-attention model to attend any layers lower than l, instead of just layer l-1: where C l −i is sequential vectors queried from layer l-i using a separate attention model, and AGG(·) is similar to the pre-defined aggregation function to transform k vectors {C l −1 , . . . , C l −k } to a d-dimension vector, which is subsequently used to construct the encoder and decoder layers via Eqn. 1 and 2 respectively. Note that multilayer attention only modifies the self-attention blocks in both encoder and decoder, while does not revises the encoder-decoder attention blocks.

Layer Diversity
Intuitively, combining layers would be more meaningful if different layers are able to capture diverse information. Therefore, we explicitly add a regularization term to encourage the diversities between layers: where λ is a hyper-parameter and is set to 1.0 in this paper. Specifically, the regularization term measures the average of the distance between ev-ery two adjacent layers: Here D(H l , H l+1 ) is the averaged cosine-squared distance between the states in layers H l = {h l 1 , . . . , h l N } and H l+1 = {h l+1 1 , . . . , h l+1 N }: (1 − cos 2 (h l n , h l+1 n )).
The cosine-squared distance between two vectors is maximized when two vectors are linearly independent and minimized when two vectors are linearly dependent, which satisfies our goal. 1

Setup
Dataset. To compare with the results reported by previous work (Gehring et al., 2017;Vaswani et al., 2017;Hassan et al., 2018), we conducted experiments on both Chinese⇒English (Zh⇒En) and English⇒German (En⇒De) translation tasks. For the Zh⇒En task, we used all of the available parallel data with maximum length limited to 50, consisting of about 20.62 million sentence pairs. We used newsdev2017 as the development set and newstest2017 as the test set. For the En⇒De task, we trained on the widely-used WMT14 dataset consisting of about 4.56 million sentence pairs. We used newstest2013 as the development set and newstest2014 as the test set. Byte-pair encoding (BPE) was employed to alleviate the Out-of-Vocabulary problem (Sennrich et al., 2016) with 32K merge operations for both language pairs. We used 4-gram NIST BLEU score (Papineni et al., 2002) as the evaluation metric, and sign-test (Collins et al., 2005) to test for statistical significance.

Models.
We evaluated the proposed approaches on advanced Transformer model (Vaswani et al., 2017), and implemented on top of an open-source toolkit -THUMT (Zhang et al., 2017). We followed Vaswani et al. (2017) to set the configurations and train the models, and have reproduced their reported results on the En⇒De task. The parameters of the proposed models were initialized by the pre-trained model. We tried k = 2 and k = 3 for the multi-layer attention model, which allows to attend to the lower two or three layers. We have tested both Base and Big models, which differ at hidden size (512 vs. 1024), filter size (2048 vs. 4096) and the number of attention heads (8 vs. 16). 2 All the models were trained on eight NVIDIA P40 GPUs where each was allocated with a batch size of 4096 tokens. In consideration of computation cost, we studied model variations with Base model on En⇒De task, and evaluated overall performance with Big model on both Zh⇒En and En⇒De tasks. Table 1 shows the results on WMT14 En⇒De translation task. As seen, the proposed approaches improve the translation quality in all cases, although there are still considerable differences among different variations.

Results
Model Complexity Except for dense connection, all other deep representation strategies introduce new parameters, ranging from 14.7M to 33.6M. Accordingly, the training speed decreases due to more efforts to train the new parameters. Layer aggregation mechanisms only marginally decrease decoding speed, while multi-layer attention decreases decoding speed by 21% due to an additional attention process for each layer.
Layer Aggregation (Rows 2-5): Although dense connection and linear combination only marginally improve translation performance, iterative and hierarchical aggregation strategies achieve more significant improvements, which are up to +0.99 BLEU points better than the baseline model. This indicates that deep aggregations outperform their shallow counterparts by incorporating more depth and sharing, which is consistent with the results in computer vision tasks (Yu et al., 2018). (Rows 6-7): Benefiting from the power of attention models, multi-layer attention model can also significantly outperform baseline, although it only attends to one or two additional layers. However, increasing the number of lower layers to be attended from k = 2 to # Model # Para.  Comparing with existing NMT systems on WMT14 English⇒German and WMT17 Chinese⇒English tasks. "+ Deep Representations" denotes "+ Hierarchical Aggregation + L diversity ". " †" indicates statistically significant difference (p < 0.01) from the TRANSFORMER baseline. k = 3 only gains marginal improvement, at the cost of slower training and decoding speeds. In the following experiments, we set set k = 2 for the multi-layer attention model.

Multi-Layer Attention
Layer Diversity (Rows 8-10): The introduced diversity regularization consistently improves performance in all cases by encouraging different layers to capture diverse information. Our best model outperforms the vanilla Transformer by +1.14 BLEU points. In the following experiments, we used hierarchical aggregation with diversity regularization (Row 8) as the default strategy. Table 2 lists the results on both WMT17 Zh⇒En and WMT14 En⇒De translation tasks. As seen, exploiting deep represen-tations consistently improves translation performance across model variations and language pairs, demonstrating the effectiveness and universality of the proposed approach. It is worth mentioning that TRANSFORMER-BASE with deep representations exploitation outperforms the vanilla TRANSFORMER-BIG model, with only less than half of the parameters.

Analysis
We conducted extensive analysis from different perspectives to better understand our model. All results are reported on the En⇒De task with TRANSFORMER-BASE. Hier.+Div. Hier. Base Figure 4: BLEU scores on the En⇒De test set with respect to various input sentence lengths. "Hier." denotes hierarchical aggregation and "Div." denotes diversity regularization.

Length Analysis
Following Bahdanau et al. (2015) and , we grouped sentences of similar lengths together and computed the BLEU score for each group, as shown in Figure 4. Generally, the performance of TRANSFORMER-BASE goes up with the increase of input sentence lengths, which is superior to the performance of RNN-based NMT models on long sentences reported by (Bentivogli et al., 2016). We attribute this to the strength of self-attention mechanism to model global dependencies without regard to their distance. Clearly, the proposed approaches outperform the baseline model in all length segments, while there are still considerable differences between the two variations. Hierarchical aggregation consistently outperforms the baseline model, and the improvement goes up on long sentences. One possible reason is that long sentences indeed require deep aggregation mechanisms. Introducing diversity regularization further improves performance on most sentences (e.g. ≤ 45), while the improvement degrades on long sentences (e.g. > 45). We conjecture that complex long sentences may need to store duplicate information across layers, which conflicts with the diversity objective.

Effect on Encoder and Decoder
Both encoder and decoder are composed of a stack of L layers, which may benefit from the proposed approach. In this experiment, we investigated how our models affect the two components, as shown  in Table 3. Exploiting deep representations of encoder or decoder individually consistently outperforms the vanilla baseline model, and exploiting both components further improves the performance. These results provide support for the claim that exploiting deep representations is useful for both understanding input sequence and generating output sequence.  Table 4: Impact of residual connections and aggregation functions for hierarchical layer aggregation.

Impact of Aggregation Choices
As described in Section 3.1.1, the function of hierarchical layer aggregation is defined as AGG(x, y, z) = LN (FF([x; y; z]) + x + y + z), where FF(·) is a feed-forward network with a sigmoid activation in between. In addition, all the input layers {x, y, z} have residual connections to the output. In this experiment, we evaluated the impact of residual connection options, as well as different choices for the aggregation function, as shown in Table 4.
Concerning residual connections, if none of the input layers are connected to the output layer ("None"), the performance would decrease. The translation performance is improved when the output is connected to only the top level of the input layers ("Top"), while connecting to all input layers ("All") achieves the best performance. This indi-cates that cross-layer connections are necessary to avoid the gradient vanishing problem.
Besides the feed-forward network with sigmoid activation, we also tried two other aggregation functions for FF(·): (1) A feed-forward network with a RELU activation in between; and (2) multihead self-attention layer that constitutes the encoder and decoder layers in the TRANSFORMER model. As seen, all the three functions consistently improve the translation performance, proving the robustness of the proposed approaches.  xaxis is the aggregation node and y-axis is the input representation. H i denotes the i-th aggregation layer, and H i denotes the i-th encoder layer. The rightmost and topmost position in x-axis and y-axis respectively represent the highest layer.

Visualization of Aggregation
To investigate the impact of diversity regularization, we visualized the exploitation of the input representations for hierarchical aggregation in encoder side, as shown in Figure 5. Let H i = {H 2i , H 2i−1 , H i−1 } be the input representations, we calculated the exploitation of the j-th input as where W j is the parameter matrix associated with the input H j . The score s j is a rough estimation of the contribution of H j to the aggregation H i . We have two observations. First, the model tends to utilize the bottom layer more than the top one, indicating the necessity of fusing information across layers. Second, using the diversity regularization in Figure 5(b) can encourage each layer to contribute more equally to the aggregation. We hypothesize this is because of the diversity regularization term encouraging the different layers to contain diverse and equally important information.

Related Work
Representation learning is at the core of deep learning. Our work is inspired by technological advances in representation learning, specifically in the field of deep representation learning and representation interpretation.
Deep Representation Learning Deep neural networks have advanced the state of the art in various communities, such as computer vision and natural language processing. One key challenge of training deep networks lies in how to transform information across layers, especially when the network consists of hundreds of layers.
In response to this problem, ResNet (He et al., 2016) uses skip connections to combine layers by simple, one-step operations. Densely connected network (Huang et al., 2017) is designed to better propagate features and losses through skip connections that concatenate all the layers in stages. Yu et al. (2018) design structures iteratively and hierarchically merge the feature hierarchy to better fuse information in a deep fusion.
Concerning machine translation, Meng et al. (2016) and Zhou et al. (2016) have shown that deep networks with advanced connecting strategies outperform their shallow counterparts. Due to its simplicity and effectiveness, skip connection becomes a standard component of state-of-the-art NMT models (Wu et al., 2016;Gehring et al., 2017;Vaswani et al., 2017). In this work, we prove that deep representation exploitation can further improve performance over simply using skip connections.
Representation Interpretation Several researchers have tried to visualize the representation of each layer to help better understand what information each layer captures (Zeiler and Fergus, 2014;. Concerning natural language processing tasks, Shi et al. (2016) find that both local and global source syntax are learned by the NMT encoder and different types of syntax are captured at different layers. Anastasopoulos and Chiang (2018) show that higher level layers are more representative than lower level layers. Peters et al. (2018) demonstrate that higher-level layers capture context-dependent aspects of word meaning while lower-level layers model aspects of syntax. Inspired by these observations, we propose to expose all of these representations to better fuse information across layers. In addition, we introduce a regularization to encourage different layers to capture diverse information.

Conclusion
In this work, we propose to better exploit deep representations that are learned by multiple layers for neural machine translation.
Specifically, the hierarchical aggregation with diversity regularization achieves the best performance by incorporating more depth and sharing across layers and by encouraging layers to capture different information. Experimental results on WMT14 English⇒German and WMT17 Chinese⇒English show that the proposed approach consistently outperforms the state-of-theart TRANSFORMER baseline by +0.54 and +0.63 BLEU points, respectively. By visualizing the aggregation process, we find that our model indeed utilizes lower layers to effectively fuse the information across layers.
Future directions include validating our approach on other architectures such as RNN (Bahdanau et al., 2015) or CNN (Gehring et al., 2017) based NMT models, as well as combining with other advanced techniques (Shaw et al., 2018;Shen et al., 2018; to further improve the performance of TRANS-FORMER.