Self-Attention with Structural Position Representations

Although self-attention networks (SANs) have advanced the state-of-the-art on various NLP tasks, one criticism of SANs is their ability of encoding positions of input words (Shaw et al., 2018). In this work, we propose to augment SANs with structural position representations to model the latent structure of the input sentence, which is complementary to the standard sequential positional representations. Specifically, we use dependency tree to represent the grammatical structure of a sentence, and propose two strategies to encode the positional relationships among words in the dependency tree. Experimental results on NIST Chinese-to-English and WMT14 English-to-German translation tasks show that the proposed approach consistently boosts performance over both the absolute and relative sequential position representations.


Introduction
In recent years, self-attention networks (SANs, Parikh et al., 2016;Lin et al., 2017) have achieved the state-of-the-art results on a variety of NLP tasks (Vaswani et al., 2017;Strubell et al., 2018;Devlin et al., 2019). SANs perform the attention operation under the position-unaware "bagof-words" assumption, in which positions of the input words are ignored. Therefore, absolute position (Vaswani et al., 2017) or relative position (Shaw et al., 2018) are generally used to capture the sequential order of words in the sentence. However, several researches reveal that the sequential structure may not be sufficient for NLP tasks (Tai et al., 2015;Kim et al., 2017;Shen et al., 2019), since sentences inherently have hierarchical structures (Chomsky, 1965;Bever, 1970).
In response to this problem, we propose to augment SANs with structural position representations to capture the hierarchical structure of the input sentence. The starting point for our approach is a recent finding: the latent structure of a sentence can be captured by structural depths and distances (Hewitt and Manning, 2019). Accordingly, we propose absolute structural position to encode the depth of each word in a parsing tree, and relative structural position to encode the distance of each word pair in the tree.
We implement our structural encoding strategies on top of TRANSFORMER (Vaswani et al., 2017) and conduct experiments on both NIST Chinese⇒English and WMT14 English⇒German translation tasks.
Experimental results show that exploiting structural position encoding strategies consistently boosts performance over both the absolute and relative sequential position representations across language pairs. Linguistic analyses (Conneau et al., 2018) reveal that the proposed structural position representation improves the translation performance with richer syntactic information. Our main contributions are: • Our study demonstrates the necessity and effectiveness of exploiting structural position encoding for SANs, which benefits from modeling syntactic depth and distance under the latent structure of the sentence.
• We propose structural position representations for SANs to encode the latent structure of the input sentence, which are complementary to their sequential counterparts.

Background
Self-Attention SANs produce representations by applying attention to each pair of elements from the input sequence, regardless of their distance.
Given an input sequence X = {x 1 , . . . , x I } ∈ R I×d , the model first transforms  Bush held  a  talk with Sharon   Bush   held   a   talk   with   Sharon   0  1  2  3  4  5  1  0  2  1  2

Absolute Position
Relative Position (a) Sequential Position Encoding (b) Structural Position Encoding Figure 1: Illustration of (a) the standard sequential position encoding (Vaswani et al., 2017;Shaw et al., 2018), and (b) the proposed structural position encoding. The relative position in the example is for the word "talk".
it into queries Q ∈ R I×d , keys K ∈ R I×d , and values V ∈ R I×d : where {W Q , W K , W V } ∈ R d×d are trainable parameters and d indicates the hidden size. The output sequence is calculated as where ATT(·) is a dot-product attention model.

Sequential Position Encoding
To make use of the order of the sequence, information about the absolute or relative position of the elements in the sequence is injected into SAN: • Absolute Sequential PE (Vaswani et al., 2017) is defined as where abs is the absolute position in the sequence and i is the dimension of position representations. f (·) is sin(·) for the even dimension, and cos(·) for the odd dimension. Vaswani et al. (2017) propose to conduct element-wise addition to combine the fixed sequential position representation with word embedding and feed the combination representation to the SANS.
• Relative Sequential PE (Shaw et al., 2018) is calculated as where rel is the relative position to the queried word, which is used to index a learnable matrix R that represents relative position embeddings. Shaw et al. (2018) propose relation-aware SANs and take the relative sequential encoding as the additional key and value (Eq.2) in the attention computation.

Structural Position Representations
In this study, we choose dependency tree to represent sentence structure for its simplicity on modeling syntactic relationships among input words. Figure 1 shows an example to illustrate the idea of the proposed approach. From the perspective of relationship path between words, sequential PE measures the sequential distance between the words. As shown in Figure 1 (a), for each word, absolute sequential position represents the sequential distance to the beginning of the sentence, while relative sequential position measures the relative distance to the queried word ("talk" in the example). The latent structure can be interpreted in various ways, from syntactic tree structures, e.g., constituency tree (Collins, 2003) or dependency tree (Kübler et al., 2009), to semantic graph structures, e.g., abstract meaning representation graph (Banarescu et al., 2013). In this work, dependency path, which is induced from the dependency tree, is adopted to provide a new perspective on modelling pairwise relationships. Figure 1 shows the difference between the sequential path and dependency path. The sequential distance between the two words "held" and "talk" is 2, while their structural distance is only 1 as word "talk" is the dependent of the head "held" (Nivre, 2005).
Absolute Structural Position We exploit the tree depth of the word in the dependency tree as its absolute structural position. Specifically, we treat the main verb (Tapanainen and Jarvinen, 1997) of the sentence as the origin and use the distance of the dependency path from the target word to the origin as the absolute structural position where x i is the target word, tree is the given dependency structure and the origin is the main verb of the tree.
In the field of NMT, BPE sub-words and endof-sentence symbol should be carefully handled as they do not appear in the conventional dependency tree. In this work, we assign the BPE sub-words share the absolute structural position of the original word and set the the first larger integer than the max absolute structural position in dependency tree as the absolute structural position of end-ofsentence symbol.
Relative Structural Position For the relative structural position rel stru (x i , x j ), we calculate rel stru (x i , x j ) in the dependency tree following two hierarchical rules: 1. if x i and x j are at same dependency edge, Following Shaw et al. (2018), we use clipping distance r to limit the maximum relative position.

Integrating Structural PE into SANs
We inherit position encoding functions from sequential approaches (Eq.3 and Eq.4) to implement the structural position encoding strategies. Since structural position representations capture complementary position information to their sequential counterparts, we also exploit to integrate the structural position encoding into SANs with the sequential counterparts approaches (Vaswani et al., 2017;Shaw et al., 2018).
For the absolute position, we use the nonlinear function to fuse the sequential and structure posi-tion representations 1 : asb(x i ) =f abs (ABSPE(abs seq ), where f abs is the nonlinear function. ABSPE(abs seq ) and ABSPE(abs stru ) are absolute sequential and structural position embedding in Eq.3 and Eq.5 respectively. For the relative position, we follow Shaw et al. (2018) to extend the self-attention computation to consider the pairwise relationships and project the relative structural position as described at Eq. (3) and Eq. (4)

Related Work
There has been growing interest in improving the representation power of SANs (Dou et al., 2018(Dou et al., , 2019Yang et al., 2018;Wu et al., 2018;Yang et al., 2019a,b;Sukhbaatar et al., 2019). Among these approaches, a straightforward strategy is that augmenting the SANs with position representations (Shaw et al., 2018;Ma et al., 2019;Bello et al., 2019;, as the position representations involves elementwise attention computation. In this work, we propose to augment SANs with structural position representations to model the latent structure of the input sentence. Our work is also related to the structure modeling for SANs, as the proposed model utilizes the dependency tree to generate structural representations. Recently, Hao et al. (2019c,b) integrate the recurrence into the SANs and empirically demonstrate that the hybrid models achieve better performances by modeling structure of sentences. Hao et al. (2019a) further make use of the multi-head attention to form the multi-granularity self-attention, to capture the different granularity phrases in source sentences. The difference is that we treat the position representation as a medium to transfer the structure information from the dependency tree into the SANs.  Table 1: Impact of the position encoding components on Chinese⇒English NIST02 development dataset using Transformer-Base model. "Abs." and "Rel." denote absolute and relative position encoding, respectively. "Spd." denotes the decoding speed (sentences/second) on a Tesla M40, the speed of structural position encoding strategies include the step of dependency parsing.

Experiment
We conduct experiments on the widely used NIST Chinese⇒English and WMT14 English⇒German data, and report the 4-gram BLEU score (Papineni et al., 2002).
Chinese⇒English We use the training dataset consists of about 1.25 million sentence pairs. NIST 2002 (MT02) dataset is used as development set. NIST 2003NIST , 2004NIST , 2005NIST , 2006 datasets are used as test sets. We use byte-pair encoding (BPE) toolkit to alleviate the out-of-vocabulary problem with 32K merge operations.
English⇒German We use the dataset consisting of about 4.5 million sentence pairs as the training set. The newstest2013 and newstest2014 are used as the development set and the test set. We also apply BPE with 32K merge operations to obtain subword unit. We evaluate the proposed position encoding strategies on TRANSFORMER (Vaswani et al., 2017) and implement them on top of THUMT . We use the Stanford parser (Klein and Manning, 2003) to parse the sentences and obtain the structural structural absolute and relative position as described in Section 3. When using relative structural position encoding, we use clipping distance r = 16. To make a fair comparison, we valid different position encoding strategies on the encoder and keep the TRANSFORMER decoder unchanged.
We study the variations of the BASE model on Chinese⇒English task, and evaluate the overall performance with the BIG model on both translation tasks.

Model Variations
We evaluate the importance of the proposed absolute and relative structural position encoding strategies and study the variations with Transformer-Base model on Chinese⇒English data. The experimental results on the development set are shown in Table 1.

Effect of Position Encoding
We first remove the sequential encoding from the Transformer encoder (Model #1) and observe the translation performance degrades dramatically (28.33 − 44.31 = −15.98), which demonstrates the necessity of the position encoding strategies.

Effect of Structural Position Encoding
Then we valid our proposed structural position encoding strategies over the position-unaware model (Models #2-3). We find that absolute and relative structural position encoding strategies improve the translation performance by 7.10 BLEU points and 5.90 BLEU points respectively, which shows that the introducing of the proposed absolute and relative structural positions improves the translation performance in terms of BLEU score.

Combination of Sequential and Structural Position Encoding Strategies
We integrate the absolute and relative structural position encoding strategies into the Base model equipped with absolute sequential position encoding (Models #4-6). We observe that the proposed two approaches are able to achieve improvements over the Base model with decoding speed marginally decreases.
Finally, we valid the proposed structural position encoding over the Base model equipped with absolute and relative sequential position encoding (Models #7-9). We find that sequential relative encoding obtains 0.71 BLEU points improvement (Model #7 vs. Model #4) and structural position encoding achieves a further improvement in performance by 0.65 BLEU points (Model #9 vs. Model #7), demonstrating the effectiveness of the proposed structural position encoding strategies.

Main Results
We valid the proposed structural encoding strategies over Transformer-Big model in   Hao et al. (2019c) is a Transformer-Big model which adopted an additional recurrence encoder with the attentive recurrent network to model syntactic structure. "↑ / ⇑": significant over the Transformer-Big (p < 0.05/0.01), tested by bootstrap resampling (Koehn, 2004).  Table 3: Performance on linguistic probing tasks. The probing tasks were conducted by evaluating linguistics embedded in the Transformer-Base encoder outputs. "Base", "+ Rel. Seq. PE", "+ Stru. PE" denote Transformer-Base, Transformer-Base with relative sequential PE, Transformer-Base with relative sequential PE and structural PE models respectively.
Chinese⇒English and English⇒German data, and list the results in Table 2. For Chinese⇒English, Structural position encoding (+ Structural PE) outperforms the Transformer-Big by 0.59 BLEU points on average over four NIST test sets. Sequential relative encoding approach (+Relative Sequential PE) outperforms the Transformer-Big by 0.53 BLEU points, and structural position encoding (+ Structural PE) achieves further improvement up to +0.40 BLEU points and outperforms the Transformer-Big by 0.93 BLEU points. For English⇒German, similar phenomenon is observed, which reveals that the proposed structural position encoding strategy can consistently boost translation performance over both the absolute and relative sequential position representations.

Linguistic Probing Evaluation
We conduct probing tasks 3 (Conneau et al., 2018) to evaluate structure knowledge embedded in the encoder output in the variations of the Base model that are trained on En⇒De translation task.
We follow  to set model configurations. The experimental results on probing tasks are shown in Table 3, and the BLEU scores of "Base", "+ Rel. Seq. PE", "+ Stru. PE" are 27.31, 27.99 and 28.30. From the table, we can see 1) adding the relative sequential positional embedding achieves improvement over the baseline on semantic tasks (75.03 vs. 74.61). This may indicate the model benefits more from semantic modeling; 2) with the structural positional embedding, the model obtains improvement on syntactic tasks (65.87 v.s. 64.98), which indicates that the representations preserve more syntactic knowledge.

Conclusion
In this paper, we have presented a novel structural position encoding strategy to augment SANs by considering the latent structure of the input sentence. We extract structural absolute and relative positions from the dependency tree and integrate them into SANs. Experimental results on Chinese⇒English and English⇒German translation tasks have demonstrated that the proposed approach consistently improve translation performance over both the absolute and relative sequential position representations.
Future directions include inferring the structure representations from the AMR (Song et al., 2019) or the external SMT knowledge (Wang et al., 2017). Furthermore, the structural position encoding can be also applied to the decoder with RNN Grammars (Dyer et al., 2016;Eriguchi et al., 2017), which we leave for future work.