A Matter of Framing: The Impact of Linguistic Formalism on Probing Results

Deep pre-trained contextualized encoders like BERT (Delvin et al., 2019) demonstrate remarkable performance on a range of downstream tasks. A recent line of research in probing investigates the linguistic knowledge implicitly learned by these models during pre-training. While most work in probing operates on the task level, linguistic tasks are rarely uniform and can be represented in a variety of formalisms. Any linguistics-based probing study thereby inevitably commits to the formalism used to annotate the underlying data. Can the choice of formalism affect probing results? To investigate, we conduct an in-depth cross-formalism layer probing study in role semantics. We find linguistically meaningful differences in the encoding of semantic role- and proto-role information by BERT depending on the formalism and demonstrate that layer probing can detect subtle differences between the implementations of the same linguistic formalism. Our results suggest that linguistic formalism is an important dimension in probing studies, along with the commonly used cross-task and cross-lingual experimental settings.


Introduction
The emergence of deep pre-trained contextualized encoders has had a major impact on the field of natural language processing. Boosted by the availability of general-purpose frameworks like AllenNLP  and Transformers (Wolf et al., 2019), pre-trained models like ELMO  and BERT (Devlin et al., 2019) have caused a shift towards simple architectures where a strong pre-trained encoder is paired with a shallow downstream model, often outperforming the intricate task-specific architectures of the past.
The versatility of pre-trained representations implies that they encode some aspects of general linguistic knowledge (Reif et al., 2019). Indeed, even an informal inspection of layer-wise intra-sentence similarities ( Fig. 1) suggests that these models capture elements of linguistic structure, and those differ depending on the layer of the model. A grounded investigation of these regularities allows to interpret the model's behavior, design better pretrained encoders and inform the downstream model development. Such investigation is the main subject of probing, and recent studies confirm that BERT implicitly captures many aspects of language use, lexical semantics and grammar (Rogers et al., 2020).
Most probing studies use linguistics as a theoretical scaffolding and operate on a task level. However, there often exist multiple ways to represent the same linguistic task: for example, English dependency syntax can be encoded using a variety of formalisms, incl. Universal (Schuster and Manning, 2016), Stanford (de Marneffe and Manning, 2008) and CoNLL-2009 dependencies (Hajič et al., 2009), all using different label sets and syntactic head attachment rules. Any probing study inevitably commits to the specific theoretical framework used to produce the underlying data. The differences between linguistic formalisms, however, can be substantial.
Can these differences affect the probing results? This question is intriguing for several reasons. Linguistic formalisms are well-documented, and if the choice of formalism indeed has an effect on probing, cross-formalism comparison will yield new insights into the linguistic knowledge obtained by contextualized encoders during pre-training. If, alternatively, the probing results remain stable despite substantial differences between formalisms, this prompts a further scrutiny of what the pretrained encoders in fact encode. Finally, on the reverse side, cross-formalism probing might be used as a tool to empirically compare the formalisms and their language-specific implementations. To the best of our knowledge we are the first to explicitly address the influence of formalism on probing.
Ideally, the task chosen for a cross-formalism study should be encoded in multiple formalisms using the same textual data to rule out the influence of the domain and text type. While many linguistic corpora contain several layers of linguistic information, having the same textual data annotated with multiple formalisms for the same task is rare. We focus on role semantics -a family of shallow semantic formalisms at the interface between syntax and propositional semantics that assign roles to the participants of natural language utterances, determining who did what to whom, where, when etc. Decades of research in theoretical linguistics have produced a range of rolesemantic frameworks that have been operationalized in NLP: syntax-driven PropBank (Palmer et al., 2005), coarse-grained VerbNet (Kipper-Schuler, 2005), fine-grained FrameNet (Baker et al., 1998), and, recently, decompositional Semantic Proto-Roles (SPR) (Reisinger et al., 2015;White et al., 2016). The SemLink project (Bonial et al., 2013) offers parallel annotation for PropBank, VerbNet and FrameNet for English. This allows us to isolate the object of our study: apart from the rolesemantic labels, the underlying data and conditions for the three formalisms are identical. SR3DE (Mújdricza-Maydt et al., 2016) provides compatible annotation in three formalisms for German, enabling cross-lingual validation of our results. Combined, these factors make role semantics an ideal target for a cross-formalism probing study.
A solid body of evidence suggests that encoders like BERT capture syntactic and lexical-semantic properties, but only few studies have considered probing for predicate-level semantics (Tenney et al., 2019b;Kovaleva et al., 2019). To the best of our knowledge we are the first to conduct a crossformalism probing study on role semantics, thereby contributing to the line of research on how and whether pre-trained BERT encodes higher-level semantic phenomena.
Contributions. This work studies the effect of the linguistic formalism on probing results. We conduct cross-formalism experiments on PropBank, VerbNet and FrameNet role prediction in English and German, and show that the formalism can affect probing results in a linguistically meaningful way; in addition, we demonstrate that layer probing can detect subtle differences between implementations of the same formalism in different languages. On the technical side, we advance the recently introduced edge and layer probing framework (Tenney et al., 2019b); in particular, we introduce anchor tasks -an analytical tool inspired by feature-based systems that allows deeper qualitative insights into the pre-trained models' behavior. Finally, advancing the current knowledge about the encoding of predicate semantics in BERT, we perform a fine-grained semantic proto-role probing study and demonstrate that semantic proto-role properties can be extracted from pre-trained BERT, contrary to the existing reports. Our results suggest that along with task and language, linguistic formalism is an important dimension to be accounted for in probing research.

BERT as Encoder
BERT is a Transformer (Vaswani et al., 2017) encoder pre-trained by jointly optimizing two unsupervised objectives: masked language model and next sentence prediction. It uses WordPiece (WP, Wu et al. (2016)) subword tokens along with positional embeddings as input, and gradually constructs sentence representations by applying tokenlevel self-attention pooling over a stack of layers L. The result of BERT encoding is a layer-wise representation of the input wordpiece tokens with higher layers representing higher-level abstractions over the input sequence. Thanks to the joint pre-training objective, BERT can encode words and sentences in a unified fashion: the encoding of a sentence or a sentence pair is stored in a special token [CLS].
To facilitate multilingual experiments, we use the multilingual BERT-base (mBERT) published by Devlin et al. (2019). Although several recent encoders have outperformed BERT on benchmarks Lan et al., 2019;Raffel et al., 2019), we use the original BERT architecture, since it allows us to inherit the probing methodology and to build upon the related findings.

Probing
Due to space limitations we omit high-level discussions on benchmarking (Wang et al., 2018) and sentence-level probing (Conneau et al., 2018a), and focus on the recent findings related to the representation of linguistic structure in BERT. Surface-level information generally tends to be represented in the lower layers of deep encoders, while higher layers store hierarchical and semantic information (Belinkov et al., 2017;Lin et al., 2019). Tenney et al. (2019a) show that the abstraction strategy applied by the English pre-trained BERT encoder follows the order of the classical NLP pipeline. Strengthening the claim about linguistic capabilities of BERT, Hewitt and Manning (2019) demonstrate that BERT implicitly learns syntax, and Reif et al. (2019) show that it encodes fine-grained lexicalsemantic distinctions. Rogers et al. (2020) provide a comprehensive overview of BERT's properties discovered to date.
While recent results indicate that BERT successfully represents lexical-semantic and grammatical information, the evidence of its high-level semantic capabilities is inconclusive. Tenney et al. (2019a) show that the English PropBank semantics can be extracted from the encoder and follows syntax in the layer structure. However, out of all formalisms PropBank is most closely tied to syntax, and the results on proto-role and relation probing do not follow the same pattern. Kovaleva et al. (2019) identify two attention heads in BERT responsible for FrameNet relations. However, they find that disabling them in a fine-tuning evaluation on the GLUE (Wang et al., 2018) benchmark does not result in decreased performance.
Although we are not aware of any systematic studies dedicated to the effect of formalism on probing results, the evidence of such effects is scattered across the related work: for example, the aforementioned results in Tenney et al. (2019a) show a difference in layer utilization between constituents-and dependency-based syntactic probes and semantic role and proto-role probes. It is not clear whether this effect is due to the differences in the underlying datasets and task architecture, or the formalism per se.
Our probing methodology builds upon the edge and layer probing framework. The encoding produced by a frozen BERT model can be seen as a layer-wise snapshot that reflects how the model has constructed the high-level abstractions. Tenney et al. (2019b) introduce the edge probing task design: a simple classifier is tasked with predicting a linguistic property given a pair of spans encoded using a frozen pre-trained model. Tenney et al. (2019a) uses edge probing to analyze the layer utilization of a pre-trained BERT model via scalar mixing weights  learned during training. We revisit this framework in Section 3.

Role Semantics
We now turn to the object of our investigation: role semantics. For further discussion, consider the following synthetic example: a. Despite surface-level differences, the sentences express the same meaning, suggesting an underlying semantic representation in which these sentences are equivalent. One such representation is offered by role semantics -a shallow predicatesemantic formalism closely related to syntax. In terms of role semantics, Mary, book and John are semantic arguments of the predicate give, and are assigned roles from a pre-defined inventory, for example, Agent, Recipient and Theme.
Semantic roles and their properties have received extensive attention in linguistics (Fillmore, 1968;Levin and Rappaport Hovav, 2005;Dowty, 1991) and are considered a universal feature of human language. The size and organization of the role and predicate inventory are subject to debate, giving rise to a variety of role-semantic formalisms.
PropBank assumes a predicate-independent labeling scheme where predicates are distinguished by their sense (get.01), and semantic arguments are labeled with generic numbered core (Arg0-5) and modifier (e.g. AM-TMP) roles. Core roles are not tied to specific definitions, but the effort has been made to keep the role assignments consistent for similar verbs; Arg0 and Arg1 correspond to the Proto-Agent and Proto-Patient roles as per Dowty (1991). The semantic interpretation of core roles depends on the predicate sense.
VerbNet follows a different categorization scheme. Motivated by the regularities in verb behavior, Levin (1993) has introduced the grouping of verbs into intersective classes (ILC). This methodology has been adopted by VerbNet: for example, the VerbNet class get-13.5.1 would include verbs earn, fetch, gain etc. A verb in Verb-Net can belong to several classes corresponding to different senses; each class is associated with a set of roles and licensed syntactic transformations. Unlike PropBank, VerbNet uses a set of approx. 30 thematic roles that have universal definitions and are shared among predicates, e.g. Agent, Beneficiary, Instrument.
FrameNet takes a meaning-driven stance on the role encoding by modeling it in terms of frame semantics: predicates are grouped into frames (e.g. Commerce buy), which specify role-like slots to be filled. FrameNet offers fine-grained frame distinctions, and roles in FrameNet are frame-specific, e.g. Buyer, Seller and Money. The resource accompanies each frame with a description of the situation and its core and peripheral participants.
SPR follows the work of Dowty (1991) and discards the notion of categorical semantic roles in favor of feature bundles.
Instead of a fixed role label, each argument is assessed via a 11-dimensional cardinal feature set including Proto-Agent and Proto-Patient properties like volitional, sentient, destroyed, etc. The feature-based approach eliminates some of the theoretical issues associated with categorical role inventories and allows for more flexible modeling of role semantics.
Each of the role labeling formalisms offers certain advantages and disadvantages (Giuglea and Moschitti, 2006;Mújdricza-Maydt et al., 2016). While being close to syntax and thereby easier to predict, PropBank doesn't contribute much semantics to the representation. On the opposite side of the spectrum, FrameNet offers rich predicatesemantic representations for verbs and nouns, but suffers from high granularity and coverage gaps (Hartmann et al., 2017). VerbNet takes a middle ground by following grammatical criteria while still encoding coarse-grained semantics, but only focuses on verbs and core (not modifier) roles. SPR avoids the granularity-generalization trade-off of the categorical inventories, but is yet to find its way into practical NLP applications.

Probing Methodology
We take the edge probing setup by Tenney et al. (2019b) as our starting point. Edge probing aims to predict a label given a pair of contextualized span or word encodings. More formally, we encode a WP-tokenized sentence [wp 1 , wp 2 , ...wp k ] with a frozen pre-trained model, producing contextual embeddings [e 1 , e 2 , ...e k ], each of which is a layered representation over L = {l 0 , l 1 , ...l m } layers, with encoding at layer l n for the wordpiece wp i further denoted as e n i . A trainable scalar mix is applied to the layered representation to produce the final encoding given the per-layer mixing weights {a 0 , a 1 ..a m } and a scaling parameter γ: Given the source src and target tgt wordpieces encoded as e src and e tgt , our goal is to predict the label y.
Due to its task-agnostic architecture, edge probing can be applied to a wide variety of unary (by omitting tgt) and binary labeling tasks in a unified manner, facilitating the cross-task comparison. The original setup has several limitations that we address in our implementation.
Regression tasks. The original edge probing only considers classification tasks. Many language phenomena -including positional information and semantic proto-roles, are naturally modeled as regression, and we extend the original model by supporting both classification and regression: the former achieved via softmax, the latter via direct linear regression to the target value.
Flat model To decrease the models' own expressive power (Hewitt and Liang, 2019), we keep the number of parameters in our probing model as low as possible. While Tenney et al. (2019b) utilize pooled self-attentional span representations and a projection layer to enable cross-model comparison, we directly feed the wordpiece encoding into the classifier, using the first wordpiece of a word. To further increase the selectivity of the model, we directly project the source and target wordpiece representations into the label space, opposed to the two-layer MLP classifier used in the original setup.
Separate scalar mixes. To enable fine-grained analysis of probing results, we train and analyze separate scalar mixes for source and target wordpieces, motivated by the fact that the classifier might utilize different aspects of their representation for prediction 1 . Indeed, we find that the mixing weights learned for source and target wordpieces might show substantial -and linguistically meaningful -variation.
Sentence-level probes. Utilizing the BERTspecific sentence representation [CLS] allows us to incorporate the sentence-level natural language inference (NLI) probe into our kit.
Anchor tasks We employ two analytical tools from the original layer probing setup. Mixing weight plotting compares layer utilization among tasks by visually aligning the respective learned weight distributions transformed via a softmax function. Layer center-of-gravity is used as a summary statistic for a task's layer utilization.
While the distribution of mixing weights along the layers allows us to estimate the order in which information is processed during encoding, it doesn't allow to directly assess the similarity between the layer utilization of the probing tasks. Tenney et al. (2019a) have demonstrated that the order in which linguistic information is stored in BERT mirrors the traditional NLP pipeline. A prominent property of the NLP pipelines is their use of low-level features to predict downstream phenomena. In the context of layer probing, probing tasks can be seen as end-to-end feature extractors. Following this intuition, we define two groups of probing tasks: target tasks -the main tasks under investigation, and anchor tasks -a set of related tasks that serve as a basis for qualitative comparison between the targets. The softmax transformation of the scalar mixing weights allows to treat them as probability distributions: the higher the mixing weight of a layer, the more likely the probe is to utilize information from this layer during prediction. We use Kullback-Leibler divergence to compare target tasks (e.g. role labeling in different formalisms) in terms of their similarity to lowerlevel anchor tasks (e.g. dependency relation and lemma). Note that the notion of anchor task is contextual: the same task can serve as a target and as an anchor, depending on the focus of the study. pus annotations to the corresponding FrameNet and VerbNet senses and semantic roles. We use these mappings to enrich the CoNLL-2009 (Hajič et al., 2009) dependency role labeling data -also based on the original PropBank -with roles in all three formalisms via a semi-automatic token alignment procedure. The resulting corpus is substantially smaller than the original, but still an order of magnitude larger than SR3de (Table 1). Both corpora are richly annotated with linguistic phenomena on word level, including part-of-speech, lemma and syntactic dependencies. The XNLI probe is sourced from the corresponding development split of the XNLI (Conneau et al., 2018b) dataset. The SPR probing tasks are extracted from the original data by Reisinger et al. (2015).

Probing tasks
Our probing kit spans a wide range of probing tasks, ranging from primitive surface-level tasks mostly utilized as anchors later to high-level selanguage en de PropBank 5 10 VerbNet 23 29 FrameNet 189 300 mantic tasks that aim to provide a representational upper bound to predicate semantics. We follow the training, test and development splits from the original SR3de, CoNLL-2009 and SPR data. The XNLI task is sourced from the development set and only used for scalar mix analysis. To reduce the number of labels in some of the probing tasks, we collect frequency statistics over the corresponding training sets and only consider up to 250 most frequent labels. Below we define the tasks in order of their complexity, Table 2 provides the probing task statistics, Table 3 compares the categorical role labeling formalisms in terms of granularity, and Table 4 provides examples. We evaluate the classification performance using Accuracy, while regression tasks are scored via Mean Squared Error.
Token type (ttype) predicts the type of a word. This requires contextual processing since a word might consist of several wordpieces; Token position (token.ix) predicts the linear position of a word, cast as a regression task over the first 20 words in the sentence. Again, the task is non-trivial since it requires the words to be assembled from the wordpieces. Part-of-speech (pos) predicts the languagespecific part-of-speech tag for the given token. Lexical unit (lex.unit) predicts the lemma and POS of the given word -a common input representation for the entries in lexical resources. We extract coarse POS tags by using the first character of the language-specific POS tag. Dependency relation (deprel) predicts the dependency relation between the parent src and dependent tgt tokens; Semantic role (role. [frm]) predicts the semantic role given a predicate src and an argument tgt token in one of the three role labeling formalisms: PropBank pb, VerbNet vn and  FrameNet fn. Note that we only probe for the role label, and the model has no access to the verb sense information from the data. Semantic proto-role (spr. [prop]) is a set of eleven regression tasks predicting the values of the proto-role properties as defined in (Reisinger et al., 2015), given a predicate src and an argument tgt. XNLI is a sentence-level NLI task directly sourced from the corresponding dataset. Given two sentences, the goal is to determine whether an entailment or a contradiction relationship holds between them. We use NLI to investigate the layer utilization of mBERT for high-level semantic tasks. We extract the sentence pair representation via the [CLS] token and treat it as a unary probing task.

Results
Our models are implemented using AllenNLP. 2 We train the probes for 20 epochs using the Adam optimizer with default parameters and a batch size of 32. Due to the frozen encoder and flat model architecture, the total runtime of the experiments is under 8 hours on a single Tesla V100 GPU.

General Trends
While absolute performance is secondary to our analysis, we report the probing task scores on respective development sets in Table 5. We observe that grammatical tasks score high, while core role labeling lags behind -in line with the findings of Tenney et al. (2019a) 3 We observe lower scores for German role labeling which we attribute to the lack of training data. Surprisingly, as we show below, this doesn't prevent the edge probe from learning to locate relevant role-semantic information in mBERT's layers.
Our results mirror the findings of Tenney et al. (2019a) about the sequential processing order in BERT. We observe that the layer utilization among tasks (Fig. 2) aligns for English and German 4 , although we note that in terms of center-of-gravity mBERT tends to utilize deeper layers for German probes. Basic word-level tasks are indeed processed early by the model, and XNLI probes focus on deeper levels, suggesting that the representation of higher-level semantic phenomena follows the encoding of syntax and predicate semantics.

The Effect of Formalism
Using separate scalar mixes for source and target tokens allows us to explore the cross-formalism encoding of role semantics by mBERT in detail. Role labeling probe's layer utilization drastically differs for predicate and argument tokens. While the argument representation role * tgt mostly focuses on the same layers as the dependency parsing probe, the layer utilization of the predicates role * src is affected by the chosen formalism. PropBank predicate token mixing weights emphasize the same layers as dependency parsing -in line with the previously published results. However, the probes for VerbNet and FrameNet predicates (role.vn src and role.fn src) utilize the layers associated with ttype and lex.unit that contain lexical information. Coupled with the fact that both VerbNet and FrameNet assign semantic roles based on lexical-semantic predicate groupings (frames in FrameNet and verb classes in VerbNet), this suggests that the lower layers of mBERT implicitly encode predicate sense information; moreover, sense encoding for VerbNet utilizes deeper layers of the model associated with syntax, in line with Verb-Net's predicate classification strategy. This finding 4 Echoing the recent findings on mBERT's multilingual capacity (Pires et al., 2019;Kondratyuk and Straka, 2019) Figure 3: Anchor task analysis of SRL formalisms.
confirms that the formalism can indeed have linguistically meaningful effects on probing results.

Anchor Tasks in the Pipeline
We now use the scalar mixes of the role labeling probes as target tasks, and lower-level probes as anchor tasks to qualitatively explore the differences between how our role probes learn to represent predicates and semantic arguments 5 (Fig. 3). The results reveal a distinctive pattern: while the predicate layer utilization src is similar to the scalar mixes learned for ttype and lex.unit, the learned argument representations tgt attend to the layers associated with dependency relation and POS probes, and the pattern reproduces for English and German. This aligns with the traditional separation of the semantic role labeling task into predicate disambiguation followed by semantic argument identification and labeling, along with the feature sets employed for these tasks (Björkelund et al., 2009). Note that the observation about the pipeline-like task processing within the BERT encoders thereby holds, albeit on a sub-task level.

Formalism Implementations
Both layer and anchor task analysis reveal a prominent discrepancy between English and German role probing results: while the PropBank predicate layer utilization for English mostly relies on syntactic information, German PropBank predicates behave similarly to VerbNet and FrameNet. The difference in the number of role labels for English and German PropBank (Table 3)   As a result, while English PropBank labels are assigned in a predicate-independent manner, German PropBank, following the same numbered labeling scheme, keeps this scheme consistent within the frame. We assume that this incentivizes the probe to learn semantic verb groupings and reflects in our probing results. The ability of the probe to detect subtle differences between formalism implementations constitutes a new use case for probing, and a promising direction for future studies.

Encoding of Proto-Roles
We now turn to the probing results for decompositional semantic proto-role labeling tasks. Unlike (Tenney et al., 2019b) who used a multi-label classification probe, we treat SPR properties as separate regression tasks. The results in Table 6 show that the performance varies by property, with some of the properties attaining reasonably low MSE scores despite the simplicity of the probe architecture and the small dataset size. We do not observe a clear performance trend depending on whether the property is associated with Proto-Agent or Patient. Our fine-grained, property-level task design allows for more detailed insights into the layer utilization by the SPR probes (Fig. 4). The results indicate that while the layer utilization on the predicate side (src) shows no clear preference for particular layers (similar to the results obtained by Layer *instigation *volition *awareness *sentient *change.of.location *exists.as.physical *created *destroyed *changes.possession *change.of.state *stationary src Layer tgt Figure 4: Layer utilization for SPR properties. Tenney et al. (2019a)), some of the proto-role features follow the pattern seen in the categorical role labeling and dependency parsing tasks for the argument tokens tgt. With few exceptions, we observe that the properties displaying that behavior are Proto-Agent properties; moreover, a close examination of the results on syntactic preference by Reisinger et al. (2015, p. 483) reveals that these properties are also the ones with strong preference for the subject position, including the outlier case of stationery which in their data behaves like a Proto-Agent property. The correspondence is not strict, and we leave an in-depth investigation of the reasons behind these discrepancies for follow-up work.

Conclusion
We have demonstrated that the choice of linguistic formalism can have substantial, linguistically meaningful effects on role-semantic probing results. We have shown how probing classifiers can be used to detect discrepancies between formalism implementations, and presented evidence of semantic proto-role encoding in the pre-trained mBERT model. Our refined implementation of the edge probing framework coupled with the anchor task methodology enabled new insights into the processing of predicate-semantic information within mBERT. Our findings show that linguistic formalism is an important factor to be accounted for in probing studies. While our work illustrates this point using a single task and a single probing framework, the influence of linguistic formalism per se is likely to be present for any probing setup that builds upon linguistic material. An investigation of how, whether, and why formalisms affect probing results for tasks beyond role labeling and for frameworks beyond edge probing constitutes an exciting avenue for future research.