Asking without Telling: Exploring Latent Ontologies in Contextual Representations

The success of pretrained contextual encoders, such as ELMo and BERT, has brought a great deal of interest in what these models learn: do they, without explicit supervision, learn to encode meaningful notions of linguistic structure? If so, how is this structure encoded? To investigate this, we introduce latent subclass learning (LSL): a modification to existing classifier-based probing methods that induces a latent categorization (or ontology) of the probe's inputs. Without access to fine-grained gold labels, LSL extracts emergent structure from input representations in an interpretable and quantifiable form. In experiments, we find strong evidence of familiar categories, such as a notion of personhood in ELMo, as well as novel ontological distinctions, such as a preference for fine-grained semantic roles on core arguments. Our results provide unique new evidence of emergent structure in pretrained encoders, including departures from existing annotations which are inaccessible to earlier methods.


Introduction
The success of self-supervised pretrained models in NLP (Devlin et al., 2019;Peters et al., 2018a;Radford et al., 2019;Lan et al., 2020) has stimulated interest in how these models work, andmotivated by their strong performance on many tasks (Wang et al., 2018-what they learn about language. Recent work on model analysis (Belinkov and Glass, 2019) indicates that they may learn a lot about linguistic structure, including part of speech (Belinkov et al., 2017a), syntax (Blevins et al., 2018;Marvin and Linzen, 2018), word sense (Peters et al., 2018a;Reif et al., 2019), and more (Tenney et al., 2019b;Liu et al., 2019a).
Many of these results are based on predictive methods, such as probing, which measure how well * Work performed while at Google. Figure 1: LSL overview. A probing classifier over contextual embeddings produces multi-class latent logits, which are marginalized into a single logit trained on binary classification. In this example, "Pierre Vinken" is identified as a named entity and assigned to latent class 2, which aligns well with the PERSON label. We treat the classes as clusters representing a latent ontology that describes the underlying representation space. Figure 2 visualizes latent logits in more detail. a linguistic variable can be predicted from intermediate representations. However, the ability of supervised probes to fit weak features makes it difficult to find unbiased answers about how those representations are structured (Saphra and Lopez, 2019;Voita et al., 2019). Descriptive methods like clustering and visualization explore this structure directly, but provide limited control and often regress to dominant categories such as lexical features (Singh et al., 2019) or word sense (Reif et al., 2019). This leaves open many questions: how are linguistic features like entity types, syntactic dependencies, or semantic roles represented by an encoder like ELMo (Peters et al., 2018a) or BERT (Devlin et al., 2019)? To what extent do familiar categories like PropBank roles or Universal Dependencies appear naturally? Do these unsupervised encoders learn their own categorization of language?
To tackle these questions, we propose a systematic method for extracting latent ontologies, or discrete categorizations of a representation space, which we call latent subclass learning; see Figure 1 for an overview. In LSL, we use a binary classification task (such as detecting entity mentions or syntactic dependency arcs) as weak supervision to induce a set of latent clusters relevant to that task (i.e., entity or dependency types). As with predictive methods, the choice of targets allows us to explore different phenomena, and the induced clusters can be quantified and measured against gold annotations. But also, as with descriptive methods, our clusters can be inspected and qualified directly, and observations have high specificity: agreement with external (e.g., gold) categories can be taken as strong evidence that those categories are salient in the representation space.
We describe the LSL classifier in Section 3, and apply it to the edge probing paradigm (Tenney et al., 2019b) in Section 4. In Section 5 we evaluate LSL on multiple encoders, including ELMo and BERT. We find that LSL induces stable and consistent ontologies, which include both striking rediscoveries of gold categories-for example, ELMo discovers personhood of named entities and BERT similarly has a notion of dates-as well as novel ontological distinctions-such as fine-grained semantic roles for core arguments-which are not easily observed by fully supervised probes. Overall, we find unique new evidence of emergent latent structure in our encoders, while also revealing new properties of their representations which are inaccessible to earlier methods.

Background
Predictive analysis A common form of model analysis is predictive: assessing how well a linguistic variable can be predicted from a model, whether in intrinsic behavioral tests (Goldberg, 2019;Marvin and Linzen, 2018) or extrinsic probing tasks.
Probing involves training lightweight classifiers over features produced by a pretrained model, and assessing the model's knowledge by the probe's performance. Probing has been used for low-level properties such as word order and sentence length (Adi et al., 2017;Conneau et al., 2018), as well as phenomena at the level of syntax (Hewitt and Manning, 2019), semantics (Tenney et al., 2019b;Liu et al., 2019b;Clark et al., 2019), and discourse structure (Chen et al., 2019). Error analysis on probes has been used to argue that BERT may simulate sequential decision making across layers (Tenney et al., 2019a), or that it encodes its own, soft notion of syntactic distance (Reif et al., 2019).
Predictive methods such as probing are flexible: Any task with data can be assessed. However, they only track predictability of pre-defined categories, limiting their descriptive power. In addition, a powerful enough probe, given enough data, may be insensitive to differences between encoders, making it difficult to interpret results based on accuracy (Saphra and Lopez, 2019;Zhang and Bowman, 2018). So, many probing experiments appeal to the ease of extraction of a linguistic variable (Pimentel et al., 2020). Existing work has measured this by controlling for the capacity of the probe, either by making relative claims between layers and encoders (Belinkov et al., 2017b;Blevins et al., 2018;Tenney et al., 2019b;Liu et al., 2019a) or using explicit measures to estimate and trade off probe capacity with accuracy (Hewitt and Liang, 2019; Voita and Titov, 2020). An alternative is to control amount of supervision, whether by restricting training set size (Zhang and Bowman, 2018), comparing learning curves (Talmor et al., 2019), or using description length with online coding (Voita and Titov, 2020).
We extend this further by removing the distinction between gold categories in the training data and reducing the supervision to binary classification, as explained in Section 3. This extreme measure makes our test high specificity, in the sense that positive results-i.e., when comprehensible categories are recovered by our probe-are much stronger, since a category must be essentially invented without direct supervision.
Descriptive analysis In contrast to predictive methods, which assess an encoder against particular data, descriptive methods analyze models on their own terms, and include clustering, visualization (Reif et al., 2019), and correlation analysis techniques (Voita et al., 2019;Saphra and Lopez, 2019;Abnar et al., 2019;Chrupała and Alishahi, 2019). Descriptive methods produce highspecificity tests of what structure is present in the model, and facilitate discovery of new patterns that weren't hypothesized prior to testing. However, they lack the flexibility of predictive methods. Clustering results tend to be dominated by principal components of the embedding space, which cor- On the other hand, BERT strongly identifies dates (DATE) and organizations (ORG), and both models group numeric/quantitative entities together. Both models separate small CARDINAL numbers (roughly, seven or less) and group them with ORDINALs, separate from larger CARDINALs. The outlined areas in the bottom-right of the ELMo visualization include 2 and 4 induced clusters, respectively. respond to only some salient aspects of linguistic knowledge, such as lexical features (Singh et al., 2019) and word sense (Reif et al., 2019). Alternatively, more targeted latent variable analysis techniques generally have a restricted inventory of inputs, such as layer mixing weights (Peters et al., 2018b) or transformer attention distributions (Clark et al., 2019). As a result of these issues, it is more difficult to discover the underlying structure corresponding to rich, layered ontologies. Our approach retains the advantages of descriptive methods, while admitting more control as the choice of binary classification targets can guide the LSL model to discover structure relevant to a particular linguistic task.

Linguistic ontologies
Questions of what encoders learn about language require well-defined linguistic ontologies, or meaningful categorizations of inputs, to evaluate against. Most analysis work uses formalisms from the classical NLP pipeline, such as part-of-speech and syntax from the Penn Treebank (Marcus et al., 1993) or Universal Dependencies (Nivre et al., 2015), semantic roles from PropBank (Palmer et al., 2005) or Dowty (1991)'s Proto-Roles (Reisinger et al., 2015), and named en-tities, which have a variety of available ontologies (Pradhan et al., 2007;Ling and Weld, 2012;Choi et al., 2018). Work on ontology-free, or open, representations suggests that the linguistic structure captured by traditional ontologies may be encoded in a variety of possible ways (Banko et al., 2007;He et al., 2015; while being annotatable at large scale (Fitzgerald et al., 2018). This raises the question: when looking for linguistic knowledge in pretrained encoders, what exactly should we expect to find? Predictive methods are useful for fitting an encoder to an existing ontology; but do our encoders latently hold their own ontologies as well? If so, what do they look like? That is the question we investigate in this work.

Approach
We propose a way to extract latent linguistic ontologies from pretrained encoders and systematically compare them to existing gold ontologies. We use a classifier based on latent subclass learning (Section 3.1), which is applicable in any binary classification setting. 1 We propose several quantitative metrics to evaluate the induced ontologies (Section 3.2), providing a starting point for qualitative analysis (Section 5) and future research.

Latent Subclass Learning
Consider a logistic regression classifier over inputs x ∈ R d . It outputs probabilities according to the following formula: where w ∈ R d is a learned parameter. Instead, we propose the latent subclass learning classifier: where W ∈ R N ×d is a parameter matrix, and N is a hyperparameter corresponding to the number of latent classes.
This corresponds to N +1-way multiclass logistic regression with a fixed 0 baseline for a null class, but trained on binary classification by marginalizing over the N non-null classes ( Figure 1). The vector Wx ∈ R N may then be treated as a set of latent logits for a random variable C(x) ∈ {1, . . . , N } defined by the softmax distribution. Taking the hard maximum of Wx assigns a latent classĈ(x) to each input, which may be viewed as a weakly supervised clustering, learned on the basis of external supervision but not explicitly optimized to match prior gold categories.
For the loss L LSL , we use the cross-entropy loss on P LSL . However, this does not necessarily encourage a diverse, coherent set of clusters; an LSL classifier may simply choose to collapse all examples into a single category, producing an uninteresting ontology. To mitigate this, we propose two clustering regularizers.
Adjusted batch-level negative entropy We wish for the model to induce a diverse ontology. One way to express this is that the expectation of C has high entropy, i.e., we wish to maximize In practice, we use the expectation over a batch. The maximum value this can take is the entropy of the uniform distribution over N items, or log N . Therefore, we wish to minimize the adjusted batchlevel negative entropy loss: Instance-level entropy In addition to using all latent classes in the expected case, we also wish for the model to assign a single coherent class label to each input example. This can be done by minimizing the instance-level entropy loss: This also takes values in [0, log N ], and we compute the expectation over a batch.
Loss We optimize the regularized LSL loss where α and β are hyperparameters, via gradient descent. Together, the regularizers encourage a balanced solution where the model uses many clusters yet gives each input a distinct assignment.

Metrics
For the following metrics, we consider only points in the gold positive class. B 3 We compare induced ontologies to gold using the standard B-cubed (or B 3 ) clustering metrics (Bagga and Baldwin, 1998). For each input point, this calculates the precision and recall of its predicted cluster against its gold cluster. These values are averaged over all points for aggregate scoring. B 3 is argued to have favorable properties (Amigó et al., 2009) and allows for label-wise scoring by restricting to points with specific gold labels.
Normalized PMI Pointwise mutual information (PMI) is commonly used as an association measure reflecting how likely two items (such as tokens in a corpus) are to occur together relative to chance (Church and Hanks, 1990). Normalized PMI (nPMI; Bouma, 2009) is a way of factoring out the effect of item frequency on PMI. Formally, the nPMI of two items x and y is log P(x, y) P(x) P(y) − log(P(x, y)) , taking the limit value of -1 when they never occur together, 1 when they only occur together, and 0 when they occur independently. We use nPMI to analyze the co-occurrence of gold labels in predicted clusters: high nPMI pairs are preferentially grouped together by the induced ontology, whereas low nPMI pairs are preferentially distinguished. Multi is the standard multi-class model trained directly on gold labels, and Single is the degenerate single-cluster baseline. Our clustering regularizers (batch and/or instance-level entropy), when taken together, yield a good tradeoff between diversity and uncertainty, though at some expense to binary classification accuracy.
Diversity We desire fine-grained ontologies with many meaningful classes. Number of attested classes may not be a good measure of this, since it could include classes with very few members and no broad meaning. So we propose diversity: This increases as the clustering becomes more finegrained and evenly distributed, with a maximum of N when P(Ĉ) is uniform. More generally, exponentiated entropy is sometimes referred to as the perplexity of a distribution, and corresponds (softly) to the number of classes required for a uniform distribution of the same entropy. In that sense, it may be regarded as the effective number of classes in an ontology. We use the predicted classĈ rather than its distribution C because we care about the diversity of the model's clustering, and not just uncertainty in the model.

Uncertainty
In order for our learned classes to be meaningful, we desire distinct and coherent clusters. To measure this, we propose uncertainty: This is also related to perplexity, but unlike diversity, it takes the expectation over the input after calculating the perplexity of the distribution. This reflects how many classes, on average, the model is confused between when provided with an input. Low values correspond to coherent clusters, with a minimum of 1 when every latent class is assigned with full confidence. As with diversity, we take the expectation over the evaluation set.

Experimental Setup
We adopt a similar setup to Tenney et al. (2019b) and Liu et al. (2019a), training probing models over several contextualizing encoders on a variety of linguistic tasks. While our interest is in linguistic structure, our model can be used in any binary classification setting, and our analysis methods apply in any case that finer-grained labels are present to compare against.

Tasks
We cast several structure labeling tasks from Tenney et al. (2019b) as binary classification by adding negative examples, bringing the positive to negative ratio to 1:1 where possible.
Named entity labeling requires labeling noun phrases with entity types, such as person, location, date, or time. We randomly sample non-entity noun phrases as negatives.
Nonterminal labeling requires labeling phrase structure constituents with syntactic types, such as noun phrases and verb phrases. We randomly sample non-constituent spans as negatives.
Syntactic dependency labeling requires labeling token pairs with their syntactic relationship, such as a subject, direct object, or modifier. We randomly sample non-attached token pairs as negatives.
Semantic role labeling requires labeling predicates (usually verbs) and their arguments (usually syntactic constituents) with labels that abstract over syntactic relationships in favor of more semantic notions such as agent, patient, modifier roles in-  Table 2: Results by task for three pretrained encoding methods. All probing models were trained with the LSL loss and cluster regularization coefficients α = β = 1.5, and chosen by the best-of-5 consistency criterion and detailed in Section 4.4. Uncertainty for all models was close to 1 and is omitted for space.
volving e.g. time and place, or predicate-specific roles. We draw the closest non-attached predicateargument pairs as negatives.
We use the English Web Treebank portion of Universal Dependencies 2.2 (Silveira et al., 2014) for syntactic dependencies, and the English portion of Ontonotes 5.0 (Weischedel et al., 2013) for all other tasks.

Encoders
We run experiments on the following encoders: ELMo encodes input tokens with 2-layer LSTMs (Hochreiter and Schmidhuber, 1997)  BERT-lex is a lexical baseline, encoding inputs with BERT-large's context-independent wordpiece embedding layer.

Probing Model
We use the model architecture of Tenney et al. (2019b), which classifies arbitrary spans or pairs of spans by leveraging pretrained encoders in the following way: 1) construct token representations by pooling across encoder layers with a learned scalar mix (Peters et al., 2018a), 2) construct span representations from these token representations 2 tfhub.dev/google/elmo/2 3 github.com/google-research/bert; uncased L-24 H-1024 A-16 using self-attentive pooling (Lee et al., 2017), and 3) concatenate those span representations and feed the result into a multi-layer perceptron to produce input features for the classification layer. This architecture allows for a unified model for all probing tasks and simplifies our experiments. For the classification layer, we use the LSL classifier (Section 3).

Model selection
We run initial studies to determine hidden layer sizes and regularization coefficients. For all LSL probes, we use N = 32 latent classes. 4 Probe capacity Hewitt and Liang (2019) suggest that results with expressive probes may reflect the probe's learning capacity rather than structure encoded in the inputs. To mitigate this, we follow their advice and use a single hidden layer with the smallest dimensionality that does not sacrifice performance. For each task, we train binary logistic regression probes with a range of hidden sizes and select the smallest yielding at least 97% of the best model's performance. Details are in Appendix A.
Mitigating variance To mitigate variance across random restarts, we use a consistency-based model selection criterion: train 5 separate models, compute their pairwise B 3 F1-scores, and choose the model with the highest F1 score on average.

Regularization coefficients
We run preliminary experiments using BERT-large on Universal Dependencies and Named Entity Labeling with ablations on our clustering regularizers. For each ablation, we choose the hyperparameter setting which yields the best F1 against gold.
Results Results, shown in Table 1 Table 3: Label-wise B 3 F1 scores for Named Entities, sorted by decreasing BERT-large F1. Induced ontologies capture some labels surprisingly well, but are indifferent to more specialized categories which may require more world knowledge to distinguish.
batch-level entropy loss drives up both diversity and uncertainty, while the instance-level entropy loss drives them down. In combination, however, they produce the right balance, with uncertainty close to 1 while retaining diversity. Notably, the Named Entity labeling model has lower diversity without the instance-level loss than with it. Intuitively, this may happen because the batch-level entropy can be increased by driving up instance-level entropy, without changing the entropy of the expected distribution of predictions H(E x P(Ĉ(x))). So by keeping the uncertainty down on each input, the instance-level entropy loss helps the batch-level entropy loss promote diversity in the induced ontology.
Based on these results, we set α = β = 1.5 for L be and L ie for the main experiments.

Results and Analysis
We train and evaluate our final probing model on all combinations of task and encoder described in Section 4. Aggregate results are shown in Table 2. 5 Taking all metrics into account, contextualized encodings produce richer ontologies that agree more with gold than the lexical baseline does. In fact, BERT-lex has normalized PMI scores very close to zero across the board, encoding virtually no information about gold categories. For this reason, we omit it from the rest of the analysis.
It may be surprising that our induced ontologies have any relationship at all to gold classes, since the only extra supervision is in binary classification that collapses them together. Indeed, many tasks addressed here have multiple human-written ontolo-5 Results for more tasks and encoders are in Appendix B.  gies, as discussed in Section 2. In our case, we let the model choose its own ontology. The resulting matches and mismatches with human-labeled ontologies provide a new lens with which to analyze both pretrained encoders and linguistic ontologies.
Named entities As shown in Table 3, neither BERT nor ELMo are sensitive to categories that are related to specialized world knowledge, such as languages, laws, and events. However, they are in tune with other types: ELMo discovers a clear PERSON category, whereas BERT has distinguished DATEs. Visualization of the clusters ( Figure 2) corroborates this, furthermore showing that the models have a sense of scalar values and measurement; indeed, instead of the gold distinction between ORDINAL and CARDINAL numbers, both models distinguish between small and large (roughly, seven or greater) numbers. See Appendix C for detailed nPMI scores.
Semantic roles Patterns in nPMI (Figure 3c) roughly match intuition: primary core arguments (ARG0, ARG1) are distinguished, as well as modals (ARGM-MOD) and negation (ARGM-NEG), while trailing arguments (ARG2-5) and modifiers (ARGM-TMP, LOC, etc.) form a large group. On one hand, this reflects surface patterns: primary core arguments tend to be close to the verb, with ARG0 on the left and ARG1 on the right; trailing arguments and modifiers tend to be prepositional phrases or subordinate clauses; and modals and negation are identified by lexical and positional cues. On the other hand, this also reflects error patterns in state-of-the-art systems, where label errors can sometimes be traced to ontological choices in PropBank, which distinguish between arguments and adjuncts that have very similar meaning (He et al., 2017;Kingsbury et al., 2002). While number of induced classes roughly matches gold for most tasks, induced ontologies for semantic roles are considerably more diverse ( Table 2). Among high-precision labels (Table 4), core arguments ARG0-2 are split apart most by the model. This follows intuition for PropBank core argument labels, which have predicate-specific meanings. Other approaches based on Frame Semantics (Baker et al., 1998;Fillmore et al., 2006), Proto-Roles (Dowty, 1991;Reisinger et al., 2015), or Levin classes (Levin, 1993;Schuler, 2005) have more explicit fine-grained roles. Comparison with these frameworks and investigation of learned clusters could be informative for future work on ontology design or unsupervised learning.

Discussion
Our exploration of latent ontologies has yielded some surprising results: ELMo knows people, BERT knows dates, and both sense scalar and measurable values, while distinguishing between small and large numbers. Both models preferentially split core semantic roles into many fine-grained categories, and seem to encode broad notions of syntactic and semantic structure. These findings contrast with those from fully-supervised probes, which produce strong agreement with existing annotations (Tenney et al., 2019b) but can also report false positives by fitting to weak patterns in large feature spaces (Zhang and Bowman, 2018;Voita and Titov, 2020). Instead, agreement of latent categories with known concepts can be taken as strong evidence that these concepts are present as important, salient features in the representation space.
This issue is particularly important when looking for deep, inherent understanding of linguistic structure, which by nature must generalize. For supervised systems, generalization is often measured by out-of-distribution objectives like out-of-domain performance (Ganin et al., 2016), transferability (c) Semantic roles. Figure 3: Pairwise gold label nPMIs on selected categories for ontologies induced from BERT-large on selected tasks. Blue is positive nPMI, representing that gold labels are preferentially grouped together; Red is negative nPMI, representing that gold labels are preferentially separated. Counts are summed over all 5 runs to better reflect the underlying representations, though variance was low and our observed trends hold across all runs. (Wang et al., 2018), or robustness to adversarial inputs (Jia and Liang, 2017). Recent work also advocates for counterfactual learning and evaluation (Qin et al., 2019;Kaushik et al., 2020) to mitigate confounds, or contrastive evaluation sets (Gardner et al., 2020) to rigorously test local decision boundaries. Overall, these techniques target discrepancies between salient features in a model and causal relationships in a task. In this work, we extract such features directly and investigate them by comparing induced and gold ontologies. This identifies some very strong cases of transferability from the binary detection task to detection tasks over gold subcategories, such as ELMo's people and BERT's dates (Table 3). Future work may investigate crosstask ontology matching to identify further cases of transferable features, or perhaps the emergence of categories signifying pipelined reasoning (Tenney et al., 2019a), surface patterns, or new, perhaps unexpected distinctions which can appear when going beyond existing schemas .
Our results point to a general paradigm of probing with latent variables, for which LSL is just one potential technique. We have only scratched the surface of what may emerge with such methods: while our probing test is high specificity, it is low power; plenty of extant latent structure may still be missed. LSL probing may produce different ontologies due to many factors, such as tokenization (Singh et al., 2019), encoder architecture (Peters et al., 2018b), probe architecture (Hewitt and Manning, 2019), data distribution (Gururangan et al., 2018), pretraining task (Liu et al., 2019a;Wang et al., 2019a), or pretraining checkpoint. Any of these factors may be at work in the differences we observe between ELMo and BERT: for example, BERT's tokenization method may not as readily induce personhood features due to splitting of rare words (like names) in byte-pair encoding. Furthermore, concurrent work (Chi et al., 2020) has already found qualitative evidence of syntactic dependency types emergent in the special case of multilingual structural probes (Hewitt and Manning, 2019). With LSL, we provide a method that can be adapted to a variety of probing settings to both quantify and qualify this kind of structure.

Conclusion
We introduced a new classifier and model analysis method based on latent subclass learning: By factoring a binary classifier through a forced choice of latent subclasses, latent ontologies can be coaxed out of input features. Using this approach, we found that encoders such as BERT and ELMo can be found to hold stable, consistent latent ontologies on a variety of linguistic tasks. In these ontologies, we found clear connections to existing categories, such as personhood of named entities. We also found evidence of ontological distinctions beyond traditional gold categories, such as distinguishing large and small numbers, or preferring fine-grained semantic roles for core arguments. With latent subclass learning, we have shown a general technique to uncover some of these features discretely, providing a starting point for descriptive analysis of our models' latent ontologies. Potential future work may include investigating how LSL results vary with probe architecture, developing intrinsic quality measures on latent ontologies, or applying the technique to discover new patterns in settings where gold annotations are not present. Collin F Baker, Charles J Fillmore, and John B Lowe. 1998

A Probe capacity tuning
Results from hidden size tuning experiments are shown in Figure 4.

C More Analysis Results
We show expanded comparative nPMI plots in Figure 5 and Figure 6. These use co-occurrence counts summed over 5 runs, and exhibit the same overall trends as each run. Figure 4: Performance on hidden size tuning experiments for different tasks. Clockwise from top-left, they are nonterminals, named entities, semantic roles, and syntactic dependencies. coarse (red) is binary accuracy of a binary classifier, fine-binary (blue) is binary accuracy of a full multiclass classifier, and fine-full (green) is the full multiclass accuracy of the multiclass classifier. The black vertical line is the smallest hidden size that passes the 97% performance threshold for coarse.