Extracting Syntactic Trees from Transformer Encoder Self-Attentions

This is a work in progress about extracting the sentence tree structures from the encoder’s self-attention weights, when translating into another language using the Transformer neural network architecture. We visualize the structures and discuss their characteristics with respect to the existing syntactic theories and annotations.


Introduction
Interpreting neural networks is a popular topic, and there are many works focusing on analyzing networks with respect to learning syntax (Shi et al., 2016;Linzen et al., 2016;Blevins et al., 2018).
In particular, Vaswani et al. (2017) showed that the self-attentions in their Transformer architecture may be directly interpreted as syntactic dependencies between tokens. However, there is a potential problem in the fact that the attention mechanism on deeper layers operates on the previous-layer neurons, which already comprise mixed information from multiple source tokens.
Our goal is to infer source sentence tree structures form the encoder's self-attention energies used in the Transfomer neural machine translation (NMT) system. We would like to visualize how the self-attention mechanism connects individual words (or wordpieces) of the sentence, to create various tree structures (e.g. constituency trees, undirected trees, dependency trees), and to discuss their characteristics with respect to the existing syntactic theories and annotations. We would also like to discuss results across various languages and natural language processing (NLP) tasks.
In this abstract, we present our preliminary results, analyzing the encoder in English-to-German NMT within the NeuralMonkey toolkit (Helcl and Libovický, 2017). We introduce aggregation of self-attention through layers to get a distribution over the input tokens for each encoder position and layer (Section 2). We then propose algorithms for constructing two types of syntactic trees (Sections 3 and 4), apply them to 42 sentences sampled from PennTB (Marcus et al., 1993), and compare the resulting structures to established syntax annotation styles, such as that of PennTB, UD (Nivre et al., 2016), or PDT (Böhmová et al., 2003).

Aggregated self-attention visualization
We use the default setting: encoder is composed of 6 layers, each consisting of a 16-head selfattention mechanism and a fully connected feedforward network, both bridged by residual connections. Each position in one layer can attend to all positions in the previous layer; the attention to the same position is boosted by the residual connection. When translating a single sentence by Transformer, we would like to capture how much each input token affects each particular position on each layer in the encoder. This is done by aggregating the attention distributions through the layers. For each layer, we collect the self-attention distribution to the previous layer and add +1 to the sameposition attention for the residual connection. The output is then normalized. So far, we take the attention distribution as the average attention over all the 16 heads.

Constituency trees extraction
In Figure 1, we can see that the self-attention mechanism is quite strong within phrases. That led us to an idea of extracting phrase-structure trees from that. We define the score of a constituent with span from position i to position j as where w[x, y] is the attention weight of the token y in the position x. We then build a binary constituency tree by recurrently splitting the sentence. When splitting a phrase with span (i, j), we look for a position k maximizing the scores of the two resulting phrases: arg max k (score(i, k) · score(k + 1, j)) .
We also rejoin wordpieces into words, assigning zero scores to constituents separating pieces of a single word. One example is shown in Figure 2. When compared to PennTB, clauses, noun phrases, or shorter verb phrases are often well recognized. The differences are mainly inside them 1 and in composing them together forming clauses.

Undirected trees extraction
First, for each pair of tokens i, j, we calculate a coattention score, expressing how common it is for the tokens to be attended to at the same time: We then construct an undirected tree 2 maximizing the coattention scores along its edges, using the algorithm of Kruskal (1956); see the bottom tree in Figure 2. We have found the resulting trees to bear surprising similarities to standard syntactic dependency trees (which, however, are directed).
For example, we observe many flat treelets, resembling headed syntactic phrases; the "phrase heads" (bold) are mostly content words, while the function words are mostly attached as leaf nodes (as in UD). We hypothesize that the encoder tries to concentrate the representation of the whole phrase onto the position of a single token -ideally one that already carries a lot of meaning.
Furthermore, the phrase treelets are then typically connected to each other via these heads (as in UD), and/or via a sort of connector tokens at phrase boundaries (underlined), such as commas, conjunctions, or prepositions (as in PDT).

Future work
In future, we would like to (1) analyze how the trees evolve through layers, (2) employ unsupervised or supervised selection of "more syntactic" heads, and (3) perform the experiments on more language pairs; especially, we hope that translation into multiple languages could push the encoder to use a more syntactic internal representation.