Towards Better Modeling Hierarchical Structure for Self-Attention with Ordered Neurons

Recent studies have shown that a hybrid of self-attention networks (SANs) and recurrent neural networks RNNs outperforms both individual architectures, while not much is known about why the hybrid models work. With the belief that modeling hierarchical structure is an essential complementary between SANs and RNNs, we propose to further enhance the strength of hybrid models with an advanced variant of RNNs – Ordered Neurons LSTM (ON-LSTM), which introduces a syntax-oriented inductive bias to perform tree-like composition. Experimental results on the benchmark machine translation task show that the proposed approach outperforms both individual architectures and a standard hybrid model. Further analyses on targeted linguistic evaluation and logical inference tasks demonstrate that the proposed approach indeed benefits from a better modeling of hierarchical structure.


Introduction
Self-attention networks (SANs, Lin et al., 2017) have advanced the state of the art on a variety of natural language processing (NLP) tasks, such as machine translation (Vaswani et al., 2017), semantic role labelling (Tan et al., 2018), and language representations (Devlin et al., 2018). However, a previous study empirically reveals that the hierarchical structure of the input sentence, which is essential for language understanding, is not well modeled by SANs (Tran et al., 2018). Recently, hybrid models which combine the strengths of SANs and recurrent neural networks (RNNs) have outperformed both individual architectures on a machine translation task . We attribute the improvement to that RNNs complement SANs on the representation limitation of hi- * Work done when interning at Tencent AI Lab. erarchical structure, which is exactly the strength of RNNs (Tran et al., 2018).
Starting with this intuition, we propose to further enhance the representational power of hybrid models with an advanced RNNs variant -Ordered Neurons LSTM (ON-LSTM, Shen et al., 2019). ON-LSTM is better at modeling hierarchical structure by introducing a syntax-oriented inductive bias, which enables RNNs to perform tree-like composition by controlling the update frequency of neurons. Specifically, we stack SANs encoder on top of ON-LSTM encoder (cascaded encoder). SANs encoder is able to extract richer representations from the input augmented with structure context. To reinforce the strength of modeling hierarchical structure, we propose to simultaneously expose both types of signals by explicitly combining outputs of the SANs and ON-LSTM encoders.
We validate our hypothesis across a range of tasks, including machine translation, targeted linguistic evaluation, and logical inference. While machine translation is a benchmark task for deep learning models, the last two tasks focus on evaluating how much structure information is encoded in the learned representations. Experimental results show that the proposed approach consistently improves performances in all tasks, and modeling hierarchical structure is indeed an essential complementary between SANs and RNNs.
The contributions of this paper are: • We empirically demonstrate that a better modeling of hierarchical structure is an essential strength of hybrid models over the vanilla SANs.
• Our study proves that the idea of augmenting RNNs with ordered neurons (Shen et al., 2019) produces promising improvement on machine translation, which is one potential criticism of ON-LSTM.

Approach
Partially motivated by  and , we stack a SANs encoder on top of a RNNs encoder to form a cascaded encoder. In the cascaded encoder, hierarchical structure modeling is enhanced in the bottom RNNs encoder, based on which SANs encoder is able to extract representations with richer hierarchical information. Let X = {x 1 , . . . , x N } be the input sequence, the representation of the cascaded encoder is calculated by where ENC RNNs (·) is a K-layer RNNs encoder that reads the input sequence, and ENC SANs (·) is a Llayer SANs encoder that takes the output of RNNs encoder as input.
In this work, we replace the standard RNNs with recently proposed ON-LSTM for better modeling of hierarchical structure, and directly combine the two encoder outputs to build even richer representations, as described below.
Modeling Hierarchical Structure with Ordered Neurons ON-LSTM introduces a new syntaxoriented inductive bias -Ordered Neurons, which enables LSTM models to perform tree-like composition without breaking its sequential form (Shen et al., 2019). Ordered neurons enables dynamic allocation of neurons to represent different timescale dependencies by controlling the update frequency of neurons. The assumption behind ordered neurons is that some neurons always update more (or less) frequently than the others, and that order is pre-determined as part of the model architecture. Formally, ON-LSTM introduces novel ordered neuron rules to update cell state: (3) where forget gate f t , input gate i t and stateĉ t are same as that in the standard LSTM (Hochreiter and Schmidhuber, 1997). The master forget gatef t and the master input gateĩ t are newly introduced to control the erasing and the writing behaviors respectively. w t indicates the overlap, and when the overlap exists (∃k, w tk > 0), the corresponding neurons are further controlled by the standard gates f t and i t . An ideal master gate is in binary format such as (0, 0, 1, 1, 1), which splits the cell state into two continuous parts: 0-part and 1-part. The neurons corresponding to 0-part and 1-part are updated with more and less frequencies separately, so that the information in 0-part neurons will only keep a few time steps, while the information in 1part neurons will last for more time steps. Since such binary gates are not differentiable, the goal turns to find the splitting point d (the index of the first 1 in the ideal master gate). To this end, Shen et al. (2019) introduced a new activation function: where sof tmax(·) produces a probability distribution (e.g.(0.1, 0.2, 0.4, 0.2, 0.1)) to indicate the probability of each position being the splitting point d. CUMSUM is the cumulative probability distribution, in which the k-th probability refers to the probability that d falls within the first k positions. The output for the above example is (0.1, 0.3, 0.7, 0.9, 1.0), in which different values denotes different update frequencies. It also equals to the probability of each position's value being 1 in the ideal master gate. Since this ideal master gate is binary, CU(·) is the expectation of the ideal master gate. Based on this activation function, the master gates are defined as where x t is the current input and h t−1 is the hidden state of previous step. CU f and CU i are two individual activation functions with their own trainable parameters.
Short-Cut Connection Inspired by previous work on exploiting deep representations (Peters et al., 2018;Dou et al., 2018), we propose to simultaneously expose both types of signals by explicitly combining them with a simple short-cut connection (He et al., 2016). Similar to positional encoding injection in Transformer (Vaswani et al., 2017), we add the output of the ON-LSTM encoder to the output of SANs encoder:  Table 1: Case-sensitive BLEU scores on the WMT14 English⇒German translation task. "↑ / ⇑": significant over the conventional self-attention counterpart (p < 0.05/0.01), tested by bootstrap resampling. "6L SANs" is the state-of-the-art Transformer model. "nL LSTM + mL SANs" denotes stacking n LSTM layers and m SANs layers subsequently. "Hybrid Model" denotes "3L ON-LSTM + 3L SANs".
where H K ON-LSTM ∈ R N ×d is the output of ON-LSTM encoder, and H L SANs ∈ R N ×d is output of SANs encoder.

Experiments
We chose machine translation, targeted linguistic evaluation and logical inference tasks to conduct experiments in this work. The first and the second tasks evaluate and analyze models as the hierarchical structure is an inherent attribute for natural language. The third task aims to directly evaluate the effects of hierarchical structure modeling on artificial language.

Machine Translation
For machine translation, we used the benchmark WMT14 English⇒German dataset. Sentences were encoded using byte-pair encoding (BPE) with 32K word-piece vocabulary (Sennrich et al., 2016). We implemented the proposed approaches on top of TRANSFORMER (Vaswani et al., 2017)a state-of-the-art SANs-based model on machine translation, and followed the setting in previous work (Vaswani et al., 2017) to train the models, and reproduced their reported results. We tested on both the Base and Big models which differ at hidden size (512 vs. 1024), filter size (2048 vs. 4096) and number of attention heads (8 vs. 16). All the model variants were implemented on the encoder. The implementation details are introduced in Appendix A.  Baselines (Rows 1-3) Following Chen et al.
(2018), the three baselines are implemented with the same framework and optimization techniques as used in Vaswani et al. (2017). The difference between them is that they adopt SANs, LSTM and ON-LSTM as basic building blocks respectively. As seen, the three architectures achieve similar performances for their unique representational powers.
Hybrid Models (Rows 4-7) We first followed Chen et al. (2018) to stack 6 RNNs layers and 4 SANs layers subsequently (Row 4), which consistently outperforms the individual models. This is consistent with results reported by . In this setting, the ON-LSTM model significantly outperforms its LSTM counterpart (Row 5), and reducing the encoder depth can still maintain the performance (Row 6). We attribute these to the strength of ON-LSTM on modeling hierarchical structure, which we believe is an essential complementarity between SANs and RNNs. In addition, the Short-Cut connection combination strategy improves translation performances by providing richer representations (Row 7).
Stronger Baseline (Rows 8-9) We finally conducted experiments on a stronger baseline -the TRANSFORMER-BIG model (Row 8), which outperforms its TRANSFORMER-BASE counterpart (Row 1) by 1.27 BLEU points. As seen, our model consistently improves performance over the stronger baseline by 0.72 BLEU points, demonstrating the effectiveness and universality of the proposed approach.  Table 3: Performance on the linguistic probing tasks of evaluating linguistics embedded in the learned representations. "S" and "O" denote the SAN and ON-LSTM baseline models. "H O " and "H S " are respectively the outputs of the ON-LSTM encoder and the SAN encoder in the hybrid model, and "Final" denotes the final output exposed to decoder.

Assessing Encoder Strategies
SANs encoder is able to extract richer representations if the input is augmented with sequential context . Moreover, to dispel the doubt that whether the improvement of hybrid model comes from the increasement of parameters. We investigate the 8layers LSTM and 10-layers SANs encoders (Rows 3-4) which have more parameters compared with the proposed hybrid model. The results show that the hybrid model consistently outperforms these model variants with less parameters and the improvement should not be due to more parameters.

Targeted Linguistic Evaluation
To gain linguistic insights into the learned representations, we conducted probing tasks (Conneau et al., 2018) to evaluate linguistics knowledge embedded in the final encoding representation learned by model, as shown in Table 3. We evaluated SANs and proposed hybrid model with Short-Cut connection on these 10 targeted linguistic evaluation tasks. The tasks and model details are described in Appendix A.2.
Experimental results are presented in Table 3. Several observations can be made here. The proposed hybrid model with short-cut produces more informative representation in most tasks ("Final" in "S" vs. in "Hybrid+Short-Cut"), indicating that the effectiveness of the model. The only exception are surface tasks, which is consistent with the conclusion in Conneau et al. (2018): as a model captures deeper linguistic properties, it will tend to forget about these superficial features. Short-cut further improves the performance by providing richer representations ("H S " vs. "Final" in "Hybrid+Short-Cut"). Especially on syntactic tasks, our proposed model surpasses the baseline more than 13 points (74.36 vs. 60.66) on average, which again verifies that ON-LSTM enhance the strength of modeling hierarchical structure for self-attention.

Logical Inference
We also verified the model's performance in the logical inference task proposed by Bowman et al. (2015). This task is well suited to evaluate the ability of modeling hierarchical structure. Models need to learn the hierarchical and nested structures of language in order to predict accurate logical relations between sentences (Bowman et al., 2015;Tran et al., 2018;Shen et al., 2019). The artificial language of the task has six types of words {a, b, c, d, e, f} in the vocabulary and three logical operators {or, and, not}. The goal of the task is to predict one of seven logical relations between two given sentences. These seven relations are: two entailment types ( , ), equivalence (≡), exhaustive and non-exhaustive contradiction (∧, |), and semantic independence (#, ).
We evaluated the SANs, LSTM, ON-LSTM and proposed model. We followed Tran et al. (2018) to use two hidden layers with Short-Cut connection in all models. The model details and hyperparameters are described in Appendix A.3. Figure 1 shows the results. The proposed hybrid model outperforms both the LSTM-based and the SANs-based baselines on all cases. Consistent with Shen et al. (2019), on the longer sequences (≥ 7) that were not included during training, the proposed model also obtains the best performance and has a larger gap compared with other models than on the shorter sequences (≤ 6), which verifies the proposed model is better at modeling more complex hierarchical structure in sequence. It also indicates that the hybrid model has a stronger generalization ability.

Related Work
Improved Self-Attention Networks Recently, there is a large body of work on improving SANs in various NLP tasks (Yang et al., 2018;Yang et al., 2019a,b;Guo et al., 2019;Sukhbaatar et al., 2019), as well as image classification (Bello et al., 2019) and automatic speech recognition (Mohamed et al., 2019) tasks. In these works, several strategies are proposed to improve the utilize SANs with the enhancement of local and global information. In this work, we enhance the SANs with the On-Lstm to form a hybrid model , and thoroughly evaluate the performance on machine translation, targeted linguistic evaluation, and logical inference tasks.
Structure Modeling for Neural Networks in NLP Structure modeling in NLP has been studied for a long time as the natural language sentences inherently have hierarchical structures (Chomsky, 1965;Bever, 1970). With the emergence of deep learning, tree-based models have been proposed to integrate syntactic tree structure into Recursive Neural Networks (Socher et al., 2013), LSTMs (Tai et al., 2015), CNNs (Mou et al., 2016). As for SANs, Hao et al. (2019a), Ma et al. (2019) and  enhance the SANs with neural syntactic distance, multigranularity attention scope and structural position representations, which are generated from the syntactic tree structures.
Closely related to our work, Hao et al. (2019b) find that the integration of the recurrence in SANs encoder can provide more syntactic structure fea-tures to the encoder representations. Our work follows this direction and empirically evaluates the structure modelling on the related tasks.

Conclusion
In this paper, we adopt the ON-LSTM, which models tree structure with a novel activation function and structured gating mechanism, as the RNNs counterpart to boost the hybrid model. We also propose a modification of the cascaded encoder by explicitly combining the outputs of individual components, to enhance the ability of hierarchical structure modeling in a hybrid model. Experimental results on machine translation, targeted linguistic evaluation and logical inference tasks show that the proposed models achieve better performances by modeling hierarchical structure of sequence.