Frame Identification as Categorization: Exemplars vs Prototypes in Embeddingland

Categorization is a central capability of human cognition, and a number of theories have been developed to account for properties of categorization. Even though many tasks in semantics also involve categorization of some kind, theories of categorization do not play a major role in contemporary research in computational linguistics. This paper follows the idea that embedding-based models of semantics lend themselves well to being formulated in terms of classical categorization theories. The benefit is a space of model families that enables (a) the formulation of hypotheses about the impact of major design decisions, and (b) a transparent assessment of these decisions. We instantiate this idea on the task of frame-semantic frame identification. We define four models that cross two design variables: (a) the choice of prototype vs. exemplar categorization, corresponding to different degrees of generalization applied to the input; and (b) the presence vs. absence of a fine-tuning step, corresponding to generic vs. task-adaptive categorization. We find that for frame identification, generalization and task-adaptive categorization both yield substantial benefits. Our prototype-based, fine-tuned model, which combines the best choices for these variables, establishes a new state of the art in frame identification.


Introduction
Categorization is the process of forming categories and assigning objects to them, and is a central capability of human cognition (Murphy, 2002). Not surprisingly, cognitive psychology has shown substantial interest in theories of categorization. Two such prominent theories are prototype and exemplar models. In prototype theory, categories are characterized in terms of a single representation, the prototype, which is an abstraction over individual objects and captures the 'essence' of the category (Posner and Keele, 1968;Rosch, 1975). In computational models, the prototype is often computed as the centroid of the objects of a category, and new objects are classified by their similarity to different categories' prototypes. As a result, the decision boundary between every pair of categories is linear. In contrast, exemplar theories represent categories in terms of the potentially large set of objects, called exemplars, that instantiate the category (Nosofsky, 1986;Hintzman, 1986). New objects are classified by similarity to nearby exemplars, so in a computational model this becomes similar to a nearest-neighbor classification. In exemplar models, the decision boundary between categories can become non-linear, enabling more complex behavior to be captured, but at the cost of higher training data requirements.
Prototype and exemplar theories are typically not at the center of attention in contemporary computational linguistics. One reason is arguably that, due to their origin in psychology, they tend to restrict themselves to cognitively plausible parameters and learning mechanisms (Nosofsky and Zaki, 2002;Lieto et al., 2017), whereas the focus of computational linguistics is very much on the use of novel machine learning techniques for applications. We nevertheless believe that categorization theory is still relevant for computational linguistics, and lexical semantics in particular. In fact, the emergence of distributed representations (embeddings) as a dominant representational paradigm has had a unifying effect on work in lexical semantics. The properties of high-dimensional embeddings provide a good match with the terms of frames, conceptual categories which have a set of lexical units that evoke the situation, and a set of frame elements that categorize the participants and that are expected to be realized linguistically. For instance, tell, explain, and say are all capable of expressing the STATEMENT frame which describes the situation where SPEAKER communicates a MESSAGE to a RECIPIENT.
Frame Semantics has been implemented in a number of NLP applications thanks to the Berkeley FrameNet resource (Baker et al., 1998). The latest FrameNet lexicon release provides definitions for over 1.2k frames, and 13,640 lexical units (i.e., predicate-frame pairs), where there are approximately 12 lexical units per frame. FrameNet also provides sentence annotations that mark, for each lexical unit, the frame that is evoked as well as its frame elements in running text. This annotated corpus has sparked a lot of interest in computational linguistics, and the prediction of frame-semantic structures (frames and frame elements) has become known as (frame-)semantic parsing (Gildea and Jurafsky, 2002;Das et al., 2014).

Frame Identification
In this paper, we focus on the first step of frame-semantic parsing called frame identification or frame assignment, where an occurrence of a predicate in context is labeled with its FrameNet frame. This is a categorization task that presents two main challenges: Ambiguity Most predicates in FrameNet are ambiguous, that is, they can evoke different frames depending on the context that they occur in. For example, treat has a medical sense (treat a disease) that evokes the MEDICAL INTERVENTION frame and a social sense (treat a person in some manner) that evokes the TREATING AND MISTREATING frame. These distinctions can be relatively subtle: say can evoke (among others) the frames STATEMENT and TEXT CREATION which differ mainly in the modality of the communicative act (said to his friend vs. said in his book).
Generalization As conceptual categories, frames are clearly open classes. No resource can exhaustively list all frames or the predicates that can evoke them.
Frame identification was first modeled as a supervised classification task, based on linguistic features (Das et al., 2010). While such systems address the ambiguity problem to some degree, they tend to struggle with generalization. An alternative approach investigated the use of other machine-readable dictionaries (Green et al., 2004), but was not able to fundamentally overcome the generalization problem. Recent progress in supervised frame identification has come out of neural networks and distributed word representations (Peng et al., 2018;Hartmann et al., 2017). In these studies, frame-labeled corpora are used to learn embeddings for the frames as a side product of representation learning with different objective functions. Hermann et al. (2014) learned embeddings jointly for frames and the sentential contexts in which they were evoked. The current state-of-the-art in frame identification performs full-fledged semantic role labeling, i.e., it jointly assigns frames as well as frame elements, using a bi-directional LSTM architecture (Peng et al., 2018).

Distributed Representations of Word Meaning
Distributed representations of word meanings (embeddings) have become a standard representation format in semantics. These models are grounded in the distributional hypothesis (Harris, 1954), according to which similar words are expected to appear in similar contexts. Based on this hypothesis, word (and phrase) meaning is represented as vectors (embeddings) in a semantic space whose dimensions correlate with properties of the context, and in which closeness between two vectors indicates semantic relatedness.
Traditionally, "count" vectors were created by simply counting co-occurring context features, with the option of additional weighting and compression over those count vectors. Neural network-based "predict" vectors are learned by treating contextual features as parameters of an objective function that is optimized on a corpus. One of the first, and still popular, "predict" models is the word2vec Skipgram model (Mikolov et al., 2013). It optimizes a word embedding using a context bag of words. This model, however, learns representations only at the lexical level, so that occurrences of a word in different contexts (cf. treat in Section 2.1.1) are represented equally. This has changed with the latest generation of embedding models, such as AllenNLP's ELMo (Gardner et al., 2018) and Google's BERT (Devlin et al., 2018), which build contextualized embeddings for occurrences of words based on the context as well as their relative positions.
A second important recent development concerns the objective(s) used to learn the embedding. While traditional count vectors and early embedding models like Skipgram assume that embeddings are general, and trained in an task-agnostic fashion, there is an alternative thread of research that sees the training of embeddings as a side product of training task-specific neural network models on tasks like sentiment analysis or machine translation (Socher et al., 2013;Hill et al., 2017). The most recent models reconcile these two directions with a two-phase transfer learning setup. The first phase is pre-training, where taskagnostic embeddings are learned from large, unlabeled corpora. The second phase is fine-tuning, which adapts the pre-trained embeddings to a specific task using comparatively small amounts of task-specific labeled corpora.

Bidirectional Encoder Representations from Transformers (BERT)
BERT (Devlin et al., 2018) is a state-of-the-art embedding model that provides contextualized embeddings in a pre-training/fine-tuning setup. BERT is essentially a deep network of Transformer blocks (Vaswani et al., 2017) which use stacked self-attention mechanisms to capture relationships across different positions in a sequence. The two tasks that are used for pre-training are language modeling and recognition of discourse continuation. Representations from the pre-training step are then pooled and fed to the finetuning stage for classification. Fine-tuning proceeds by adding an additional, task-specific layer on top of the pre-trained BERT embeddings that maps embeddings onto the desired task output. In addition to learning weights for this final, task-specific classification layer, this procedure also updates the pooled, pre-trained embedding through backpropagation.

Categorization Approaches to Frame Identification
This section defines our four embedding-based models for frame identification. As motivated in Section 1, we based our model space on two well-known dichotomies from categorization research: exemplar vs. prototype theory, and pure bottom-up processing vs. a combined bottom-up plus top-down processing. This setup results in a 2x2 matrix and a total of four models, as sketched in Figure 1. To focus solely on the problem of frame identification as a categorization task, we assume that the frame-evoking predicates have already been identified in the texts of interest.
The first dimension distinguishes prototype vs exemplar models, shown in the figure as columns. We consider models to be exemplar-based when they only use representations for individual instances for their predictions (here, predicates in context), but do not compute aggregate embeddings at the category level (here, frames). One of the most straightforward implementations of this approach is nearest-neighbor classification (Daelemans and van den Bosch, 2005). In contrast, prototype-models do not use instance representations at prediction time, but instead aggregate them into category-level representations. The standard softmax classification, for example, is a clear prototype approach by virtue of learning a weight vector for each class whose dot product with an input represents the probability of that class. The geometric interpretation of this computation is a distance between the prototype vector and the input, using dot product.
The second dimension, shown in the figure as rows, distinguishes pure bottom-up from bottom-up plus top-down models. In categorization research, bottom-up models assume that general similarity information "as given" is sufficient to perform the classification. In an embedding-based setup, this corresponds to models where embeddings are (pre-)trained in a task-independent fashion and applied to the task as-is. Thus, the bottom-up models form categories purely from contextual features that have been learned in a generalized, unsupervised fashion. In contrast, combined bottom-up plus top-down models assume that top-down information, such as a preconceived notion of a category or similarity metric, influences processing for this task. In the context of current embedding-based models, we treat the fine-tuning procedure in BERT (cf. Section 2.2.1), where representations are fine-tuned using a small amount of task-specific data as an approximate top-down effect on categorization.

Bottom-up (Pre-trained Embeddings)
Bottom-up frame identification models use only the pre-trained embeddings to predict the frame of a lexical unit in context. The classification performed by these models shows how well frame classification can be carried out by relying on general lexical semantic relatedness, without explicit knowledge about frame-semantic grouping.
Bottom-up Exemplar. In exemplar theories, categorization proceeds by comparing a target instance to prior seen instances, and the target is assigned the same class as its closest seen instance. To classify a predicate in context, we perform single nearest neighbor classification: we compare its pre-trained, contextualized embedding to all pre-trained, contextualized embeddings of predicates in the training set, and assign the frame label of the closest training predicate. We use the standard embedding similarity metric, cosine similarity. In the example in Figure 1 (top left), the nearest neighbor to the test instance He got apples is I got one recently, which leads to the assignment of the COMMERCE BUY frame.
Bottom-up Prototype. In the prototype model, the frame categories are formed by building a summary representation of all known instances in a category. We take advantage of the general effectiveness of averaged representations and compute frame prototypes as the unweighted centroid of all pre-trained, contextualized predicate embeddings for the frame's training instances. Frame classification then assigns a novel instance to the category of its most similar prototype. We again use cosine similarity, which is identical (modulo normalization) to softmax classification. The example in Figure 1 (top right) shows the prototypes of the two frames as dots, the "regions" of the two frames by background color, as well as the (linear) decision boundary between prototypes. The test instance He got apples is assigned to the COMMERCE BUY frame because it is closest to the prototype of that frame.

Combined Bottom-up plus Top-down (Pre-trained and Fine-tuned Embeddings)
Bottom-up plus top-down frame identification models optimize the embeddings according to task-specific data and a loss function during a fine-tuning phase. Prediction then uses the fine-tuned embeddings instead of pre-trained ones.
Bottom-up plus Top-down Exemplar. For our exemplar-based model, we apply fine-tuning to make the embeddings for predicates that evoke the same frame more similar, and embeddings evoking different frames less similar. We frame the fine-tuning step as a binary classification task that decides, for a pair of predicates in sentence context, whether they evoke the same or different frames. The input consists of the concatenated contextualized embeddings of the two predicates. An example of the training input is below, where the BERT model adds a special [SEP] token between the text pair. The treated predicate from the first sentence evokes the MEDICAL INTERVENTION frame, whereas the got predicate in the second sentence evokes COMMERCE BUY.

Input Sequence The doctor treated the patient [SEP] He got apples
Label different Formally, for each predicate in context i with frame f (i), we define P + (i) as a set of (positive) instances with the same frame, P − (i) as a set of (negative) instances with different frames, and same(i, i ) as the binary prediction of the model. We then define the objective function as a cross-entropy loss between the gold label (same / different) and same(i, i ).
We select P − (i) from the set of frame candidates for a given lexical unit. For predicates with only a single frame, we randomly select a negative instance from the entire frame class inventory. For each predicate in the training data, we use two positive and negative instances, which we obtain by random sampling. At prediction time, we pair the target predicate with all instances of all frame candidates for this predicate and run them through the trained classification model, as shown in the bottom-left corner of Figure 1. For Unseen predicates (see Section 4), we pair target predicates with one randomly selected example from each frame in the entire frame inventory. We then label the target predicate with the frame of the instance with the highest same-frame probability. In this model, the top-down knowledge that is passed to the network corresponds to the similarity metric between frame-evoking predicates.
Bottom-up plus Top-down Prototype. For the prototype model, we fine-tune the embeddings specifically to learn frame classes (cf. the bottom right hand example in Figure 1). Since we will train on full-text annotation (described in further detail in Section 4), frame identification proceeds as a token sequence classification, where each token is assigned a frame prediction. An example of the training data is shown below, where the input to the model is the sequence of plain text tokens, and the gold class labels are the sequence of correct frame assignments. In the gold label sequence, non-predicates are assigned an outside (O) label. The loss function is a straightforward multi-class cross-entropy loss averaged over each class for every token. Here, the set of labels are the entire set of frames in the FrameNet lexicon, plus the added 'outside' class label -resulting in a large set of possible classes (1,021 classes). At prediction time, the model predicts a frame label for each token in the input sequence independently. As is the case in the bottom-up prototype model, no global optimization takes place. We only consider predictions for predicates (according to the gold standard) for the purposes of evaluation.

Dataset
We work with the dataset sampled by Das et al. (2014) from the FrameNet Release 1.5 full-text annotations. This dataset contains a total of 78 documents with frame-annotated sentences drawn from the British National Corpus. In total, 39 documents were selected for training and 16 for development with a total of 19,582 target predicates, and 23 documents for testing with 4,458 target predicate annotations. This is the standard dataset used for evaluation of frame identification systems.

Model Setup and Hyperparameters
BERT provides several pre-trained models for English that were trained on the concatenation of the BooksCorpus (Zhu et al., 2015) and Wikipedia. We use the pre-trained BERT-large, cased model, trained with the highest number of layers (L=24), hidden units (H=1024), and self-attention heads (A=16). The final layer of the BERT transformer provides embeddings for each token in the sentence that can be interpreted as contextualized meaning representations. According to the authors of the BERT model, performance is shown to improve when the n final layers for each token are concatenated. We use n=4.
For the fine-tuned models, we re-used the hyperparameters of the pre-trained model. Since both of our fine-tuning tasks are classification tasks, we add a standard softmax classification layer with cross-entropy loss on top of BERT (described earlier in Section 3.2). Due to the computational cost of attention mechanisms, the fine-tuned models require a limit on the maximum sequence length. We set the sequence length to 180 in the prototype model, which in this case means that even long sentences can be fed to the model. The exemplar model, on the other hand, takes two text sequences as input (see Section 3.1) which doubles the overall size of the input sequence. We increased the maximum sequence length to 200 for this model to keep as many tokens as possible in training while also being computationally feasible.
We note that our prototype model required a significant number of epochs to converge. Most tasks in the BERT paper achieve near-optimal accuracy with 3-4 training epochs, while our model required about 30 epochs. We attribute this to the number of classes (at most 4 classes in the BERT paper, more than 1000 classes for frame identification). The exemplar model follows more closely with other BERT tasks, and we perform 5 training epochs for the exemplar model.

Evaluation Metrics
The general evaluation metric for frame identification is accuracy: the relative frequency of correct assignments to predicates. Since the task of frame identification is moot for single-frame lexical units, frame identification systems standardly Peng et al., 2018;Hermann et al., 2014) report accuracy on two different subsets of the data: (1) all instances from the test set, called "Full Lexicon", because it includes lexical units that are unambiguous; and (2) only instances of predicates from the test set that can evoke multiple frames, called "Ambiguous". In the data set we use, the test partition contains 2,029 ambiguous predicates out of a total of 4,458 predicate instances.
In addition, some prior work reports specific metrics on infrequent predicates, for which prediction is particularly challenging. "Unseen" reports accuracy for predicates that are completely unseen in the training data and their predictions over all possible frames -meaning the frame lexicon is not used for evaluation at test time 1 . "Rare" reports accuracy on predicates that occur less than 11 times in the training  Table 1 shows the performance of the four models as well as prior results from recent literature. Regarding the impact of the exemplar and prototype dimensions that we introduced in Section 3, we find that the exemplar model does worse overall than the prototype model in both configurations (overall "Full Lexicon" accuracy: 2% for bottom-up, 7% for bottom-up plus top-down). This indicates that the prototype setup appears better suited to the task than the exemplar one, at least on the data we experimented with. Second, we see a substantial effect of top-down processing (fine-tuning): 1.5% for exemplars, over 6% for prototypes. The clear winner is the bottom-up plus top-down (fine-tuned) prototype model: with an accuracy of 91.26%, it outperforms the previous state of the art (Peng et al., 2018). This shows that frame categorization can indeed profit from task-based optimization. That being said, it is worth noting that even the bottom-up prototype model with only generic pre-training performs at or above the level of the supervised SEMAFOR model  which incorporated linguistic and ontological features in a log-linear model. Thus, the bottom-up vector space models do have a claim to robust performance. Accuracy on "Ambiguous" predicates largely mirrors the patterns we find on "Full Lexicon" accuracy. They bolster the interpretation that both prototype representation and fine-tuning lead to clear gains. Results on "Rare" and "Unseen" predicates are more difficult to compare due to lack of reported results (marked as NA). The numbers for "Rare", again, seem to follow the "Full Lexicon" trend, and outperform the state of the art. The results for the "Unseen" category do so too, but are below the previously reported results. The reason is that Das et al. (2014) employ additional processing to unseen predicates based on a context similarity graph. For simple supervised classification without the extra component, comparable to our 30.20% setting, they report an Accuracy of 23.08%.

Sentence Length
Next, we aim to determine how much the sentence length affects predictions of classes in the bottom-up versus the bottom-up plus top-down models. Results are shown in Figure 2. We find that the performance of the bottom-up models declines as sentence length increases, and the opposite is seen in the top-down prototype model.
The most natural explanation for this pattern starts from the realization that the BERT model incorporates long-range dependencies via its self-attention mechanisms. That is, these long-range dependencies, coupled with the bidirectionality in the BERT model, introduces a rich notion of context. However, in the full-text annotations for fair comparison.   which is exactly what we observe. In contrast, fine-tuning of the self-attention weights can apparently turn long sentences into an asset by providing rich context hints for improved frame classification. The outlier in this analysis is the fine-tuned bottom-up plus top-down exemplar model whose performance fluctuates between the fine-tuned prototype model and the bottom-up models. Given the analysis of the previous paragraph, this may not be surprising: the supervision provided to the fine-tuned exemplar model is less informative than that for the prototype model (cf. Section 3.2): the exemplar supervision does not name the frame(s) involved, and only provides information for one predicate pair in a potentially long sequence. Arguably, this makes it much more difficult for BERT to properly adapt its self-attention weights.

Frame-level and Predicate-level Analysis
We now look at the most accurate frames and predicates from our best model and compare the accuracies for these inputs across our four models. This analysis gives us insight regarding what types of semantic information are already learned by the bottom-up models versus the knowledge that is gained by learning frame-specific semantics in the top-down setting. Table 2 shows the analysis at the frame level. The best model assigns three frames perfectly. For one of them, CAPABILITY, there is a dramatic performance gap, where the other models show accuracies of 0.73 and less. This frame includes lexical units such as can.v and able.a, which are both frequent and unspecific and therefore somewhat difficult to learn without frame-specific tuning. The same is true for three other frames, POSSESSION, and TEMPORAL COLLOCATION, and LOCATIVE RELATION, which also have a high number of frequent, ambiguous predicates including modals and prepositions (have.v, in.prep, on.prep). The final frame, WEAPON, behaves rather differently in that the models perform almost equally well. Since the predicates in this frame form a coherent topic and tend to be low in ambiguity (bomb.n, missile.n, shotgun.n), they are quite easily learned with only generalized embeddings. The analysis at the predicate level is shown in Table 3. We see a distinction very similar to the frame   v, know.v, can.v, and in.prep profit hugely from frame-specific fine-tuning since their pre-trained, contextualized embeddings are presumably more widely spread out. In contrast, the people.n predicate performs well in all models including the bottom-up ones.

Discussion and Conclusions
In this paper, we have taken up an old strand of research in cognitive psychology, categorization, and demonstrated how such research contributes to computational lexical semantics. We have argued that theories of categorization have something valuable to offer to neural embedding-based models of natural language semantics, namely a framework in which to ground model design and understand their consequences. We have considered two dimensions: (a) the distinction between prototype and exemplar categorization, where prototype models produce a summary representation of its categories, while exemplar models represent the input objects themselves; and (b) the decision between pure similarity-driven "bottom-up" categorization, and task-specific "top-down" categorization, which finds its direct counterpart in current embedding models in the distinction between pre-trained and fine-tuned embeddings. Along these two dimensions, we have defined four models for frame-semantic frame identification with BERT embeddings. Empirically, we found that for this task it worked best (a) to learn category representations via a prototype, and (b) to fine-tune the representations with a small amount of framelabeled data. Further analysis showed that the benefit of the fine-tuning was in particular to improve model performance in the face of abstractness and ambiguity: while all models work well on frames describing coherent, concrete topics and containing concrete predicates drawn from their topics (WEAPONS), only fine-tuned models perform well on frames that capture abstract semantic generalizations that do not correspond to coherent regions in embedding space (LOCATIVE RELATION) or ambiguous predicates such as the predicate can.v, which is able to evoke five frames: PRESERVING, CAPABILITY, LIKELIHOOD, and POSSIBILITY.
While the benefit of fine-tuning is expected based on previous work, the relative performance of prototype and exemplar models was less predictable. Our analysis indicates that the outcome of our study -a win for prototypes -is presumably tied to the studies' use of full-text frame annotation, which can be exploited straightforwardly in a prototype setting to tune the long-distance dependencies captured by BERT's self-attention mechanisms.