Universal Dependencies according to BERT: both more specific and more general

This work focuses on analyzing the form and extent of syntactic abstraction captured by BERT by extracting labeled dependency trees from self-attentions. Previous work showed that individual BERT heads tend to encode particular dependency relation types. We extend these findings by explicitly comparing BERT relations to Universal Dependencies (UD) annotations, showing that they often do not match one-to-one. We suggest a method for relation identification and syntactic tree construction. Our approach produces significantly more consistent dependency trees than previous work, showing that it better explains the syntactic abstractions in BERT. At the same time, it can be successfully applied with only a minimal amount of supervision and generalizes well across languages.


Introduction and Related Work
In recent years, systems based on Transformer architecture achieved state-of-the-art results in language modeling (Devlin et al., 2018) and machine translation (Vaswani et al., 2017).Additionally, the contextual embeddings obtained from the intermediate representation of the model brought improvements in various NLP tasks.Multiple recent works try to analyze such latent representations (Linzen et al., 2019), observe syntactic properties in some Transformer self-attention heads, and extract syntactic trees from the attentions matrices (Raganato and Tiedemann, 2018;Mareček and Rosa, 2019;Clark et al., 2019).
In our work, we focus on the comparative analysis of the syntactic structure, examining how the BERT self-attention weights correspond to Universal Dependencies (UD) syntax (Nivre et al., 2016).We confirm the findings of Vig and Belinkov (2019) and Voita et al. (2019) that in Transformer based systems particular heads tend to capture specific dependency relation types (e.g. in one head the attention at the predicate is usually focused on the nominal subject).
We extend understanding of syntax in BERT by examining the ways in which it systematically diverges from standard annotation (UD).We attempt to bridge the gap between them in three ways: • We modify the UD annotation to better match the BERT syntax ( §3) • We introduce a head ensemble method, combining multiple heads which capture the same dependency relation label ( §4) • We observe and analyze multipurpose heads, containing multiple syntactic functions ( §7) Finally, we apply our observations to improve the method of extracting dependency trees from attention ( §5), and analyze the results both in a monolingual and a multilingual setting ( §6).
Our method crucially differs from probing (Belinkov et al., 2017;Hewitt and Manning, 2019).We do not use treebank data to train a parser; rather, we extract dependency relations directly from selected attention heads.We only employ syntactically annotated data to select the heads; however, this means estimating only a small set of binary parameters, and only a small amount of data is sufficient for that purpose ( §6.1).

Models and Data
We analyze the uncased base BERT model for English, which we will refer to as enBERT, and the uncased multilingual BERT model, mBERT, for English, German, French, and Czech 1 .The code shared by Clark et al. (2019) 2 substantially helped us in extracting attention weights from BERT.
To find syntactic heads, we use 1000 EuroParl multi parallel sentences in the four languages, automatically annotated with UDPipe (Straka and Straková, 2017).For evaluation, we use PUD treebanks from the CoNLL 2017 Shared Task (Nivre et al., 2017).

Adapting UD to BERT
Since the explicit dependency structure is not used in BERT training, syntactic dependencies captured in latent layers are expected to diverge from annotation guidelines.After initial experiments, we have observed that some of the differences are systematic (see Table 1).Based on these observations, we modify the UD annotations in our experiments to better fit the BERT syntax, using UDApi3 (Popel et al., 2017).We note that for copulas and coordinations, BERT syntax resembles e.g., Surface-syntactic UD (SUD) (Gerdes et al., 2018).Nevertheless, we decided to use our custom modification, since some systematic divergences between SUD and the latent representation occur as well.

Head Ensemble
In line with Voita et al. (2019) and other studies, we have noticed that often a specific syntactic relation type can be found in a specific head.Additionally, we observe that a single head often captures only a specific aspect or subtype of one UD relation type, motivating us to combine multiple heads to cover the full relation.
Figure 1 shows attention weights of two syntactic heads (right columns) and their average (left column).In the top row (purple), both heads identify the parent noun for an adjectival modifier: Head 9 in Layer 3 if their distance is two positions or less, Head 10 in Layer 7 if they are further away (as in "a stable , green economy").Similarly, for an object to predicate relation (blue row), Head 9 in Layer 7 and Head 8 in Layer 3 capture pairs with shorter and longer positional distances, respectively.

Dependency Accuracy of Heads
To quantify the amount of syntactic information conveyed by a self-attention head A, we compute: where L is a set of all dependency relations with the same label (for instance predicate → subject); l i,j denotes a relation from i th to j th token of the sentence; A[i] is the i th row of the attention matrix A. Please note that the measure is sensitive to the direction of the relation (parent to dependent p2d or dependent to parent d2p) In this article, when we say that head with attention matrix A is syntactic for a directed relation type L, we mean that its DepAcc L,A is high.

Method
Having observed that some heads convey only partial information about a UD relation, we propose a method to connect knowledge of multiple heads.
Our objective is to find a set of heads for each directed relation so that their attention weights after averaging have a high dependency accuracy.The algorithm is straightforward: we define the maximum number N of heads in the subset; sort the heads based on their DepAcc on development set; starting from the most syntactic one we check whether including head's attention matrix in the average would increase DepAcc, if it does the head is added to the ensemble.When there are already N heads in the ensemble, the newly added head may substitute another added before, so to maximize DepAcc of the averaged attention matrices. 4  We set N to be 4, as allowing larger ensembles does not improve the results significantly Figure 2. 4 Code will be released at https://github.com/Tom556/BERTHeadEnsembles

Dependency Tree Construction
To extract dependency trees from self-attention weights, we use a method similar to Raganato and Tiedemann (2018), which employs a maximum spanning tree algorithm (Edmonds, 1966) and uses gold information about the root of the syntax tree.
We use the following steps to construct a labeled dependency tree: 1.For each non-clausal UD relation label, syntactic heads ensembles are selected as described in Section 4. Attention matrices in the ensembles are averaged.Hence, we obtain two matrices for each label (one for each direction: "dependent to parent" and "parent to dependent") 2. The "dependent to parent" matrix is transposed and averaged with "parent to dependent" matrix.We use a weighted geometric average, where weights correspond to dependency accuracy value for the given direction.
3. We compute the final dependency matrix by max-pooling over all individual relation-label matrices from step 2. At the same time, we save the syntactic-relation label that was used for each position in the final matrix.
4. In the final matrix, we set the row corresponding to the gold root to zero, to assure it will be the root in the final tree as well.
5. We use the Chu-Liu-Edmond's algorithm (Edmonds, 1966) to find the maximum spanning tree.For each edge, we assign the label saved in step 3.
It is important to note that the total number of heads used for tree construction can be at most 4 * 12 * 2 = 96, (number of heads per ensemble * number of considered labels * two directions).However, the number of used heads is typically much lower (see Table 3).
As far as we know, we are first to construct labeled dependency trees from attention matrices in Transformer.Moreover, we have extended the previous approach by using an ensemble of heads instead of a single head. 5Objects also include indirect objects (iobj).

Dependency Accuracy
In Table 2, we present results for the dependency accuracy (Section 4.1) of a single head, four heads ensemble, and the positional baseline. 8Noticeably, a single attention head surpasses the baseline for every relation label in at least one direction.The average of 4 heads surpasses the baseline by more than 10% for every relation.
Ensembling brings the most considerable improvement for nominal subjects (p2d: +13.3 pp) and noun modifiers (p2d: +13.2 pp).The relative change of accuracy is more evident for clausal relations than non-clausal.Dependent to parent direction has higher accuracy for modifiers (except adverbial modifiers), functional relations, and objects, whereas parent to dependent favors other nominal relations (nominal subject and nominal modifiers).
Introducing the UD modifications (Section 3) had a significant effect for nominal subject.Without such modifications, the accuracy for parent to dependent direction would drop from 76.0% to 70.1%.

Selection Supervision
The selection of syntactic heads requires annotated data for accuracy evaluation.In Figure 3, we examine what number of annotated sentences is sufficient, using 1, 10, 20, 50, 100 or 1000 sentences.The evaluation set was not altered.
For non-clausal relations (Figure 3a), head selection on just 10 annotated sentences allows us to surpass the positional baseline.Using over 20 examples brings only a minor improvement.For the more complex clausal relations (Figure 3b), the score improves steadily with more data.However, even for the full corpus, it is relatively low, since the clausal relations are less frequent in the corpus and harder to identify due to longer distances between dependent and parent.

Dependency Tree Construction
In Table 3, we report the evaluation results on the English PUD treebank (Nivre et al., 2017)     and right-branching baseline with gold root information, and the highest score obtained by Raganato and Tiedemann (2018) who used the neural machine translation Transformer model and extracted whole trees from a single attention head.Also, they did not perform direction averaging.The results show that ensembling multiple attention heads for each relation label allows us to construct much better trees than the single-head approach. 9The number of unique heads used in the process turned out to be two times lower than the maximal possible number (96).This is because many heads appear in multiple ensembles.We examine it further in Section 7.
Furthermore, to the best of our knowledge, we are the first to produce labeled trees and report both UAS and LAS.
Just for reference, the recent unsupervised parser (Han et al., 2019) obtains 61.4% UAS.However, the results are not comparable, since the parser uses information about gold POS tags, and the results were measured on different evaluation data (WSJ Treebank).
Ablation We analyze how much the particular steps described in Section 5 influenced the quality of constructed trees.We also repeat the experimental setting proposed by Raganato and Tiedemann (2018)   decision influenced our results the most, i.e., we change: • Size of head ensembles • Number of sentences used for head selection • Use one head ensemble for all relation labels in each direction.Hence we do not conduct max-pooling described in section 5, point 3.
In Table 3, we see that the method by Raganato and Tiedemann ( 2018) applied to enBERT produces slightly worse trees than the same method applied to neural machine translation.If we do not use ensembles and only one head per each relation label and direction is used, our pipeline from Section 5 offers only 0.2 pp rise in UAS and poor LAS.The analysis shows that the introduction of head ensembles of size four has brought the most significant improvement in our method of tree construction, which is roughly +15 pp for both the variants (with and without labels).
Together with the findings in Section 6.1 this supports our claim that syntactic information is spread across many Transformer's heads.Interestingly, max-pooling over labeled matrices improve UAS only by 0.8 pp.Nevertheless, this step is necessary to construct labeled trees.The performance is competitive even with as little as 20 sentences used for head selection, which is in line with our findings from Section 6.1.4 shows that for English, the dependency accuracy and UAS decreased only slightly by changing the model from enBERT to mBERT, while LAS saw 0.1 pp increase.The model captures syntax comparably well in German and French.Worse results for Czech may be the result of a lower number of mBERT training data that causes splitting sentences into a higher number of shorter wordpieces than corresponding sentences in other considered languages.

Multilingual Setting Table
(b) Adjective modifiers, auxiliaries, determiners D2P In this experiment, we examine whether a single mBERT's head can perform multiple syntactic functions in a multilingual setting.We choose an ensemble for each syntactic relation for each language.Figure 4 presents the sizes of intersections between head sets for different languages and dependency types.

Multiple Syntactic Functions
We can see a significant overlap for the relations of adjective modifiers, auxiliaries, and determin-    ers pointing to their governor.Shared heads tend to find the root of the syntactic phrase.Interestingly, common heads occur even for relations typically belonging to a verb and noun phrases, such as auxiliaries and adjective modifiers.In our other experiments, we have noticed that these heads do not focus their attention on any particular part of speech.Similarly, objects and noun modifiers share at least one head for all languages.They have a similar function in a sentence; however, they connect with the verb and noun, respectively.Such behavior was also observed in a monolingual model.Figure 5 presents attention weights of two heads that belong to the intersection of the adjective modifier, auxiliary, and determiner dependent to parent ensembles.

Multilingual
Representation of mBERT is language independent to some extent (Pires et al., 2019).Thus, a natural question to ask is whether the same mBERT heads encode the same syntactic relations for different languages.In particular, subject relations tend to be encoded by similar heads in different languages, which rarely belong to an ensemble for other dependency labels.
Notably, for adjective modifiers, the French ensemble has two heads in common with the German, although the preferred order of adjective and noun is different in these languages.Attention weights of one of these heads for parallel sentences in French, German, English, and Czech are presented in Figure 6.

Conclusion
We have expanded the knowledge about the representation of syntax in self-attention heads of the Transformer architecture.We modified the UD annotation to fit the BERT syntax better.We analyzed the phenomenon of information about one dependency relation being split among many heads and the opposite situation where one head has multiple syntactic functions.
Our method of head ensembling improved the previous results for dependency relation retrieval and extraction of syntactic trees.As far as we know, this is the first work that conducted a similar analysis for languages other than English.
We also hypothesize that the proposed method could improve dependency parsing in a low supervision setting.

Figure 1 :Figure 2 :
Figure 1: Examples of two enBERT's attention heads covering the same relation label and their average.Gold relations are marked by red letters.An extended version can be found in the appendix.

Figure 3 :
Figure 3: Dependency accuracy against the number of sentences used for selection.

Figure 4 :
Figure 4: Number of mBERT's heads shared between both within and across languages.

Figure 5 :
Figure 5: Syntactic enBERT heads retrieving the parent for three relation labels: Adjective modifiers, AuXiliaries, Determiners.UD relations are marked by A, X, and D respectively.

Figure 6 :
Figure 6: A single mBERT head which identifies noun heads of French adjective modifiers.It also partially captures the relation in German, English, and Czech, although these languages, unlike French, follow "Adjective Noun" order.

Table 1 :
Comparison of original Universal Dependencies annotations (edges above) and our modification (edges below).

Table 2 :
Dependency accuracy for single heads, 4 heads ensembles, and positional baselines.The evaluation was done using the pretrained model enBERT and modified UD as described in Section 3.

Table 3 :
Evaluation results for different settings of dependency trees extraction.UD modifications were not applied here.(*In Raganato+ experimens, the trees were induced from each encoder head, but we report only the results for the head with the highest UAS on 1000 test sentences.) on enBERT model to see whether a language model is better suited to capture syntax than a translation system.Additionally, we alter the procedure described in Section 5 to analyze which

Table 4 :
Average dependency accuracy for non-clausal relations (with UD modification), UAS, and LAS of constructed trees (w/o UD modification).mBERT was used for all languages.