SemEval-2019 Task 2: Unsupervised Lexical Frame Induction

This paper presents Unsupervised Lexical Frame Induction, Task 2 of the International Workshop on Semantic Evaluation in 2019. Given a set of prespecified syntactic forms in context, the task requires that verbs and their arguments be clustered to resemble semantic frame structures. Results are useful in identifying polysemous words, i.e., those whose frame structures are not easily distinguished, as well as discerning semantic relations of the arguments. Evaluation of unsupervised frame induction methods fell into two tracks: Task A) Verb Clustering based on FrameNet 1.7; and B) Argument Clustering, with B.1) based on FrameNet’s core frame elements, and B.2) on VerbNet 3.2 semantic roles. The shared task attracted nine teams, of whom three reported promising results. This paper describes the task and its data, reports on methods and resources that these systems used, and offers a comparison to human annotation.


Introduction
SemEval 2019 Task 2 focused on the unsupervised semantic labeling of a set of prespecified (semantically) unlabeled structures ( Figure 1). Unsupervised learning methods analyze these structures ( Figure 1a) to augment them with semantic labels (Figure 1b). The shape of the manually labeled input frames is constrained to an acyclic connected tree of lexical items (words and multi-word units) of maximum depth 1, where just one root governs several arguments. The task used Berkeley FrameNet (FN) (Ruppenhofer et al., 2016) and Q. Zadeh and Petruck (2019), guidelines for this task, to determine the arguments and label them with semantic information.
We compared the proposed system results for unsupervised semantic tagging with that of human annotated (or, gold-standard) data in three different subtasks (Figure 2). To evaluate the systems, we computed distributional similarities between  their generated unsupervised labeled data and human annotated reference data. For computing similarities we used general purpose numeral methods of text clustering, in particular BCUBED F-SCORE (Bagga and Baldwin, 1998) as the single figure of merit to rank the systems. The most important result of the shared task is the creation of a benchmark for a future complex task. This benchmark includes a moderately sized, manually annotated set of frames, where only the verbs of each were included, along with their core frame elements (which uniquely define a frame as Ruppenhofer et al. describe). To complement FN's core frame elements that have highly specific meanings, the benchmark also includes the annotated argument structures of the verbs based on the generic semantic roles proposed for verb classes in VerbNet 3.2 (Kipper et al., 2000;Palmer et al., 2017). The benchmark comes with simplified annotation guidelines and a modular annotation sys-tem with browsing and editing capabilities. 1 Complementing the benchmarking are several state-ofthe-art competing baselines, from the participants, that serve as a point of departure for improvements in the future. 2 The rest of this paper is organized as follows: Section 2 contextualizes this task; Section 3 offers a detailed task-description; Section 4 describes the data; Section 5 introduces the evaluation metrics and baselines; Section 6 characterizes the participating systems and unsupervised methods that participants used; Section 7 provides evaluation scores and additional insight about the data; and Section 8 presents concluding remarks.

Background
Frame Semantics (Fillmore, 1976) and other theories (Gamerschlag et al., 2014) that adopt typed feature structures for representing knowledge and linguistic structures have developed in parallel over several decades in theoretical linguistic studies about the syntax-semantics interface, as well as in empirical corpus-driven applications in natural language processing. Building repositories of (lexical) semantic frames is a core component in all of these efforts. In formal studies, lexical semantic frame knowledge bases instantiate foundational theories with tangible examples, e.g., to provide supporting evidence for the theory. Practically, frame semantic repositories play a pivotal role in natural language understanding and semantic parsing, both as inspiration for a representation format and for training data-driven machine learning systems, which is required for tasks such as information extraction, question-answering, text summarization, among others.
However, manually developing frame semantic databases and annotating corpus-derived illustrative examples to support analyses of frames are resource-intensive tasks. The most well-known frame semantic (lexical) resource is FrameNet (Ruppenhofer et al., 2016), which only covers a (relatively) small set of the vocabulary of contemporary English. While NLP research has integrated FrameNet data into semantic parsing, e.g., Swayamdipta et al. (2018), these methods cannot extend beyond previously seen training labels, tagging out-of-domain semantics as unknown at best. This limitation does not hinder unsupervised methods, which will port and extend the coverage of semantic parsers, a common challenge in semantic parsing (Hartmann et al., 2017).
Unsupervised frame induction methods can serve as an assistive semantic analytic tool, to build language resources and facilitate linguistic studies. Since the focus is usually to build language resources, most systems (Pennacchiotti et al. (2008); Green et al. (2004)) have used a lexical semantic resource like WordNet (Miller, 1995) to extend coverage of a resource like FrameNet. Some methods, e.g., Modi et al. (2012) and Kallmeyer et al. (2018), tried to extract FrameNetlike resources automatically without additional semantic information. Others (Ustalov et al. (2018); Materna (2012)) addressed frame induction only for verbs with two arguments.
Lastly, unsupervised frame induction methods can also facilitate linguistic investigations by capturing information about the reciprocal relationships between statistical features and linguistic or extra-linguistic observations (e.g., Reisinger et al. (2015)). This task aimed to benchmark a class of such unsupervised frame induction methods.  The ambitious goal of this task was the unsupervised induction of frame semantic structures from tokenized and morphosyntacally labeled text corpora. We sought to achieve this goal by building an evaluation benchmark for three tasks. Task A dealt with unsupervised labeling of verb lemmas with their frame meaning. Task B involved unsupervised argument role labeling, where B.1 benchmarked unsupervised labeling of frame-specific frame elements (FEs) based on FN, and B.2 benchmarked unsupervised role labeling of arguments in Case Grammar terms (Fillmore, 1968) and against a set of generic semantic roles, taken primarily from VerbNet.

Task Description
The task was unsupervised in that it forbade the use of any explicit semantic annotation (only permitting morphosyntactic annotation). Instead, we encouraged the use of unsupervised representation learning methods (e.g., word embeddings, brown clusters) to obtain semantic information. Hence, systems learn and assign semantic labels to test records without appealing to any explicit training labels. For development purposes, developers received a small labeled development set.

Task A: Clustering Verbs
The goal of this task was to identify verbs that evoke the same frame. The task involved labeling verb uses in context to resemble their categorization based on Frame Semantics ( Figure 2a). Here, we used FN 1.7 as the reference for frame definitions. Hence, the task constituted the unsupervised induction of FN's lexical units, where a lexical unit (LU) is a pairing of a lemma and a frame. For example, we expected that the LUs auction.v, retail.v, sell.v, etc., which evoke the typed situation of COMMERCE SELL, be labeled with the same unsupervised tag. 3 The task resembles word sense induction in that it assigns a class (or sense) label to a verb. In word sense induction (WSI), labels are determined and evaluated on word forms (lemma + part-ofspeech e.g., sell.v or auction.n). WSI evaluations assume that the inventory of senses (set S i s) for different word forms f is devised independently. For instance, assuming f 1 is labeled with the set of senses S 1 and f 2 with S 2 , then S 1 ∩ S 2 = φ only if f 1 = f 2 ; and, if f 1 = f 2 then S 1 ∩ S 2 = φ (as in other SemEval benchmarks, including Agirre and Soroa (2007); Manandhar et al. (2010); Jurgens and Klapaftis (2013);Navigli and Vannella (2013)). For instance, in WSI evaluations based on OntoNotes (Hovy et al., 2006), six different labels from S sell are assigned to the lemma sell.v, and one label s is assigned to auction.v, knowing that s / ∈ S sell . Typically, lexical semantic relationships among members of S i s (e.g., synonymy, antonymy) are then analyzed independently of WSI (e.g., Lenci and Benotto (2012); Girju et al. (2007); McCarthy and Navigli (2007)). In contrast, this task assumes that the sense inventory is defined independent of word forms.
This task involves uncovering mapping between word forms f and members of S such that different word forms (i.e., f i = f j ) can be mapped to the same meaning (label), and the same meaning (label) can be mapped to several word forms. We defined S with respect to FrameNet and assumed that its typed-situation frames are units of meaning. So, COMMERCE SELL captures the meaning associated with both sell.v and auction.v., as well as other selling-related words. Hence, in some sense, Task A goes beyond the ordinary WSI task as it also demands identifying (unspecified) lexical semantic relationships between verbs.

Task B.1: Unsupervised Frame Semantic Argument Labeling
Taking the frames as primary and defining roles relative to each frame, the aim of Task B.1 was to cluster prespecified verb-headed argument structures according to the principles of Frame Semantics, where FrameNet served as the reference for evaluation. This task amounted to unsupervised labeling of frames and core FEs ( Figure 2b). Because FrameNet defines FEs frame-specifically, Task B.1 entails Task A. Given a set of semantically-unlabelled arguments as input (e.g., Figure 1a), the root nodes (i.e., verbs) are clustered and assigned to a set of unsupervised frame labels π i (1 ≤ i ≤ n, where n is the number of latent frames). Then, the arguments are labeled with semantic role labels (FEs) interpreted locally given the frame. That is, for any pair of π x and π y , the set of assigned roles R x to arguments under π x are assumed to be independent from R y labels for π y (R x ∩ R y = φ).

Task B.2: Unsupervised Case Role Labeling
We defined Subtask B.2 in parallel to Subtask B.1 and involved an idea from Case Grammar. The ar-guments of a verb in a set of prespecified subcategorization frames were clustered according to a common set of generic semantic roles ( Figure 2c). Here, the task assumed that semantic roles are universal and generic (e.g., Agent, Patient). Their configuration determines the argument structure of verb-headed phrases. We evaluated this unsupervised labeling of arguments with semantic roles independently of the class, sense, and word form of a verb. We compared the role labels against a set of semantic roles from VerbNet 3.2 (Kipper et al., 2000). Given a verb instance, no guarantee exists that input argument structures for B.2 and B.1 would be the same.

Evaluation Dataset
The dataset consists of manual annotations for verb-headed frame structures anchored in tokenized sentences. These frame structures were manually annotated using the guidelines for this task (Q. Zadeh and Petruck, 2019). For example, as already illustrated, the verb come from.v is annotated in terms of FN's ORIGIN frame and its core FEs, as Example 1 shows. (1) Criticism of futures COMES FROM Wall Street.

ENTITY ORIGIN
Also, using the set of 32 generic semantic role labels in VerbNet 3.2 and two additional roles, COG-NIZER and CONTENT, we annotated arguments of the verb as the following graphic shows.
Criticism come from Wall Street THEME SOURCE We assumed unique identifiers for sentences, e.g., #s1 for Example 1. The evaluation record for come from.v (Task A) appears below, where #s1 4 5 specifies the position of the verb in the sentence (Example 1). We stripped off the manually asserted labels from the records and passed them to systems for assigning unsupervised labels. Evidently, later a scorer program (Section 5) compared system-generated labels with the manually assigned labels.

Data Sampling
We sampled data from the Wall Street Journal (WSJ) corpus of the Penn Treebank. Kallmeyer et al. (2018) provided frame annotations similar to those in this task for a portion of WSJ sentences, using SemLink  and EngVallex (Cinková et al., 2014) to generate frame semantic annotations semi-automatically. That work was based on FrameNet and the Prague Dependency Treebank (PSD) (Hajič et al., 2012) from the Broad-coverage Semantic Dependency resource (Oepen et al., 2016). We started by annotating a portion of the records in Kallmeyer et al. (2018), and later deviated from this subset to create a more representative sample of the overall diversity and distribution of verbs in the WSJ corpus using a stratified random sampling method.

Guidelines
The annotation guidelines for this task were slightly different from those of FrameNet and various semantic dependency treebanks. In contrast to FN, which annotates a full span of text as an argument filler, or PropBank, which annotates syntactic constituents of arguments of verbs (Palmer et al., 2005), we identified the text spans and only annotated a single word or a multi-word unit (MWU), i.e., the semantic head of the span, like annotations in Oepen et al. (2016) and Abstract Meaning Representation (Banarescu et al., 2013). To illustrate, in Example 1, FN would annotate Criticism of futures as filling the FE ENTITY. We only annotated Criticism, understanding it as the LU that evokes JUDGMENT COMMUNICATION, which in turn represents the meaning of the whole text span. Thus, we assumed that another frame f a fills an argument of a frame. We annotated only the main content word(s) that evoke(s) f a ; these main words are the semantic heads. 4 Multi-word unit semantic heads (e.g., named entities, word form combinations) are annotated as if a single word form, such as Wall Street (# 1), excluding modifiers. In contrast to semantic depen-dency structures (e.g., DELPH-IN MRS-Derived Semantic Dependencies, Enju PredicateArgument Structures, and Tectogramatical Representation in PSD (Oepen et al., 2016)), we did not commit to the underlying syntactic structure of the sentence since we were not obliged to relabel only syntactic structures. Rather, we annotated words and MWUs if the frame analysis permitted doing so. 5

Annotation Procedure
We annotated the data in a modular manner and in a semi-controlled environment using an annotation system developed for this purpose. The procedure consisted of four steps: 1) Reading and Comprehension; 2) Choosing a Frame; 3) Annotating Arguments; and 4) Rating, Commenting, or Revising. We tracked and logged all changes in the data as well as annotator interaction with the annotation system upon starting to annotate. The tool measured the time that annotators spent on each record and each annotation step, as well as how annotators moved between steps.
In Step 1, annotators viewed a sentence with one highlighted verb, as in Example 2.
(2) Criticism of futures COMES from Wall Street.
The goal of this step was understanding the meaning of the verb and its semantic function, and identifying semantic heads of arguments and their associated words or MWUs. To continue, an annotator must confirm the understanding of the verb's meaning of the verb, and can identify its semantic arguments. Without confirmation, an annotator would terminate the annotation process for that input sentence and go to the next one.
If confirmed, Step 2 required the annotator to choose the frame that the verb evoked. This step may have included annotating multi-word phrasal verbs, e.g., COMES+FROM (Example 2). The annotation system assisted by providing a list of likely frames for the verb, including a LU lookup function (as in FN), an extended set of LUs derived via statistical methods, and previously logged annotations. After reviewing the definitions of the proposed frames, annotators chose one, or annotated the verb form with a different existing FN frame. Otherwise, the annotator terminated the process and the record moved to the list of "skipped items".
The annotation of arguments, Step 3, required 5 Q. Zadeh and Petruck describe the issues in detail.
that annotators label the core FEs of the chosen frame by first identifying their semantic head, which first may have required marking MWUs, e.g., Wall+Street in Example 3, below.
(3) Criticism of futures comes from Wall Street.
The tool lists the core FEs and their definitions, and checks the integrity of record annotations to ensure that each core FE is annotated only once. In parallel, annotators add the verb's subcategorization frame and its semantic role. We did not annotate null instantiated FEs (but FN does). During step 3, annotators could go back to the previous step and change their choice of frame type. For Step 4, annotators rated their annotation, stating their opinion on how well the annotated instance fit FrameNet's definition and how it compared to other annotated instances. In a sense, annotators measured their confidence in the assigned labels. They did so by selecting a number on a scale from 1 to 5, with 1 not confident at all and 5 the most confident, i.e., the annotation fit perfectly to the chosen FrameNet frame, its definition, and examples. Annotators had the option to add free text comments on each record.
The annotation procedure was rarely straightforward. Given the interdependence of Steps 2 and 3, annotators usually moved back and forth between them. In Step 2 an annotator might believe that a target verb did not belong in any existing FN frame. Likewise, annotators could terminate the annotation process even upon reaching the last step.

Quality Control
At least two annotators verified all annotation used in the evaluation. A main annotator annotated all records in the dataset; two other annotators verified or disputed those annotations. If annotators could not reach an agreement, we removed the record from the SemEval dataset.
A full analysis of annotator disagreement goes beyond the scope of this work. While the source of annotator disagreement may seem trivial and simple (e.g., only one annotator understood the sentence correctly), we believe that some sentences may have more than one interpretation, all of which are plausible. Like the disagreement resulting from incorrect frame assignment, deciding what frame a verb evokes may be challenging; and resolving the dilemma is not always simple. Choosing between two related frames (e.g., BUILDING vs. INTENTIONALLY CREATE, related via Inheritance in FN), or identifying metaphorical and non-metaphorical uses of a verb requires subtle and sophisticated understanding of the semantics of the language, and of Frame Semantics. At times, disagreements pointed to more complex linguistic issues that remain in debate, e.g., choosing the semantic head of a syntactically complex argument, treating quantifiers, conjunctions, etc. Table 1 shows a statistical summary of the annotation task. The SemEval column reports the statistics for the final set of records, i.e., gold records with double-agreement between annotators, and which we used to evaluate the systems. Total reports the statistics of all analyzed records, from which we chose our SemEval data. Skipped and InProg show the statistics for discarded records and records without a final decision, respectively. Dev shows the statistics for the development set.

Summary statistics
Each of the rows reports a value of a component of the data or annotator interaction with the data.  Confidence reports the average of annotatorassigned confidence scores for annotations per record. Although interpreting this measure demands more work, the averages appear to be as expected. Specifically, SemEval is higher in value than both InProg and Skipped, facts that we associate with double agreement and the choice reviewing process. Still, many records with high confidence scores remained as InProg given the lack of double agreement. Table 5 (Appendix A.1) lists the top 10 frames annotated with their respective highest and lowest confidence ratings averaged by their frequency in SemEval.
The last two rows of Table 1 are meta-data on the annotation process. Time reports the total time annotators spent in active annotation, engaged in the steps described above (742 hours), excluding the reviewing process (Section 4.3.1) and including the time to annotate MWUs. Total-Move is the total number of logical moves for frame annotation between annotators and the annotation system, i.e., logged changes in the process of frame and core FE annotation. This number excludes annotation of verb subcategorization with generic semantic roles. 6 In SemEval, annotated frames had an average of 2.15 arguments, requiring a minimum of five logical moves to annotate (MWU-less sentences). However, on average, each SemEval record required 14.8 moves. This number is even higher for InProg (18.2); we believe that it indicates the complexity of the annotation task. Table 4 (Appendix A.1) further details annotator activity, with time spent and moves per annotation step. As expected, frame annotation of verbs (Step 2), was the most time consuming part of the task.

Development Dataset
Shared task participants received a development set consisting of 600 records from a total of 4,620 records, where Table 4 shows the statistics. The development set contained gold annotations for all three subtasks.

Evaluation Metrics
For all subtasks, as figure of merit, here we report the performance of participating systems with measures for evaluating text clustering techniques, including the classic measures of Purity (PU), inverse-Purity (IPU), and their harmonic mean (PIF) (Steinbach et al., 2000), as well as the harmonic mean for BCubed precision and recall (i.e., BCP, BCR, and BCF, respectively) (Bagga and Baldwin, 1998).
To compute these measures for the pairing of reference-labeled data and unsupervised-labeled data (with each having an exact set of annotated items), we built a contingency table T with rows for gold labels and columns for unsupervised system labels. We filled the table with the number of intersecting items, as done in cross-tabulation of results in classification tasks to compute precision and recall. For Task A (Section 3), T tracks the unsupervised system labels and the gold reference labels assigned to verbs. For Task B.1, we labeled the rows and columns of T with tuples (l v , l a ), where l v labels the frame evoking verb and l a labels the FE filler. For Task B.2, the rows and columns in T track the unsupervised system labels and the gold reference labels (generic semantic roles) assigned to arguments.
These performance measures reflect a notion of similarity between the distribution of unsupervised labels and that of the gold reference labels, given certain criteria. Specifically, they define the notions of consistency and completeness of automatically generated clusters based on the evaluation data. Each method measures consistency and completeness in its own way, and alone may lack sufficient information for a clear understanding and analysis of system performance (Amigó et al., 2009). But, as the single metric for system ranking, we used the BCF measure, given its satisfactory behavior in certain situations. Note that we modeled the task and its evaluation as hard clustering, where a record receives only one label, without overlap in any generated category of items.

Baselines
Similar to other clustering tasks, we use baselines of random, all-in-one-cluster (AIN1), and one-cluster-per-instance (1CPI). Additionally, we adapted the baseline of the most frequent sense in WSI for these tasks by introducing the one-cluster-per-head (1CPH) baseline in Task A, and one-cluster-per-syntactic-category (1CPG) for verb argument clustering in Task B.2. 7 For Task B.1, we built a baseline, 1CPGH for labeling verbs with their lemmas (as in 1CPH) and FEs with grammatical relation to their heads (as in 1CPG). We included two more labels lcmpx and rcmpx for frame fillers with no direct syntactic relation to the head verb, if occurring left of or right of the verb, respectively.
Both 1CPH and 1CPG (and their combination for Task B.1) are hard to beat because of the longtailed distribution of the frequency of our test data. E.g., most verbs frequently instantiate one particular frame and rarely other ones. Similarly, a particular role (FE) frequently is filled by words that have a particular grammatical relation to its governing verb; e.g., most subjects of most verb forms receive the agent label in their subcategorization frame (or, an agent-like element in their Frame Semantics representations). Evidently the chosen labels for grammatical relations influences 1CPG and 1CPHG scores. Values reported later (specifically, Tables 6 and 2) could be improved by employing heuristics, e.g., relabeling enhanced dependencies using a few rules.
We also employed one unsupervised and a second supervised system baselines. For the unsupervised one, we trained the system with data from Kallmeyer et al. (2018). For the supervised one, we used OPEN-SESAME, a state-of-the-art supervised FrameNet tagger (Swayamdipta et al., 2018). After converting its output to the format of the present task, we evaluated it similar to other systems. Both systems were trained out-of-thebox with no additional tuning.

System Descriptions
We received submissions from nine teams (13 participants). Only three chose to submit system description papers.  provided a solution for Task A and Task B.2, using both sets of these results to address Task B.1. Task A used language models and Hearst-like patterns to tune and obtain contextualized vector representations for the verbs in the test set. A hierarchical agglomerative clustering method followed, where hyperparameters were set with labeled and unlabeled records from the development and test sets. Task B.2 employed a logistic regression trained over the development set to identify only the most frequent labels. The classifier was based on features obtained from a language model and hand-crafted rules. Using logistic regression and training this algorithm with the development set remains an issue of concern, given the intended unsupervised scenario. While we objected to using the development set to train a supervised system for this subtask, we still report its scores. The differences between its results and those of the other systems may be informative. Still, we considered Arefyev et al.'s results for Task B only complementarily, not to rank the systems.  (Ustalov et al., 2018), and Biemann's clustering algorithm (Biemann, 2006).  Table 2 reports the BCF scores for system submissions along with a baseline for each task. 8 As the table shows, each system performs best only in one of the tasks. We report Arefyev et al.'s submission for Tasks B.1 and B.2 only to show the benefit of using a small amount of training data and a supervised method together with a clustering algorithm, provided that such training data is available. As readers know, finding the optimal (actual) number of clusters is an open research area. Participants knew the number of clusters: whereas Arefyev et al. and Anwar et al. used this information, Ribeiro et al. opted for a statistical method tuned with data that we provided.

Results and Data Analysis
The baseline systems, the unsupervised method of Kallmeyer et al. (2018)   of all systems regarding BCF. This result is not surprising since that work did not effectively handle MWUs in the test, where only the head of the MWU was kept. However, the output of Open-SESAME, and its low BCF was indeed surprising.
We fed Open-SESAME the sentences from the test set; it identified approximately 5k frames. However, the overlap with the test set was only 1,216 records (identification problem in Open-SESAME). These 1,216 records exhibit a mismatch between 536 of the arguments and their respective target verbs. We ignored the system's extra or incorrectly generated arguments, and replaced the missing items with those of the 1CPHG baseline records. We then used the resulting records for evaluation against the task's gold data as did the task's participants. As Table 3 shows, the unsupervised method outperforms the supervised system for all tasks by a wide margin, i.e.,the unsupervised label set can carry more information than does the supervised label set.  We compared results for confidence measure that annotators assigned to records. First, we split the evaluation records according to their assigned confidence value into five subsets E i , 1 ≤ i ≤ 5, such that subset E 1 contained only records with confidence value 1, E 2 contained only record with confidence value 2, etc.. Then we evaluated system outputs on each subset E i and logged that BCF. Later, we performed this evaluation cumulatively using subsets E i s by adding records from all E j s to E i where i < j. Interpreting the obtained values requires careful attention (e.g., changes in the prior probabilities of gold clusters and their cardinality must be taken into account), overall, we observed a similar trend for all systems: as expected, namely a positive correlation between the confidence value and BCF. Thus, what human annotators usually found hard to annotate, automatic systems also found hard to cluster. (The reverse relation does not hold). Or, pessimistically, the level of noise in annotation increases as their associated confidence decreases. (Table 7 in Appendix A.2 details the results.) Finally, we wanted to identify the frames that machines found difficult to cluster. To estimate difficulty we used the differences in BCF under the following conditions. We repeated the evaluation process 1 ≤ i ≤ n times (where n is the number of gold labels for a task) for each system. In each iteration i, we removed all data items of a gold category i. We measured and noted the resulting BCF in the given iteration; we deduced the score from the system performance over the entire gold set. To cancel frequency effects, we normalized the differences by the number of gold data instances. We removed all records annotated as COMMERCE SELL from the evaluation set E to form E . We computed the BCF of the systems over E (E ⊂ E), and measured d = E BCF − E BCF . We interpreted a positive difference as an easy to cluster gold category i, and a negative difference as a hard to cluster gold category i. Table 8 and Table 9 show a summary of the results for Task A and Task B.2, respectively. All systems performed similarly for approximately 30% of the gold classes. Comparing differences across systems and the baselines of 1CPH and 1CPG reveals (possibly) interesting information. Thus, for example, in Task A, most systems found COMMERCE SELL hard and COMMERCE BUY easy to cluster. Interestingly, a set of six verbs evokes each frame: buy, purchase, buy back, buy up, buy out, buy into for COMMERCE BUY; and sell, retail, auction, place, deal, resell for COMMERCE SELL. From these two sets of verbs, three are polysemous: buy in the former, and place and deal in the latter. Does the morphology of the verbs (e.g., buy-back, resell) make one easy to cluster? Alternatively, are other factors at play, such as the number of verb instances? How these factors might influence the proposed naive BCF-difference model is an open question.

Concluding Remarks
We have presented the SemEval 2019 task on unsupervised lexical frame induction. We described the task in detail, provided a summary of methods that participants developed, and compared the results. Although much room for improvement of the task remains, we consider it a step forward. It employed a well-motivated typology of lexical frames to distinguish lexical frame induction tasks. The evaluation data derived from annotations of a well-known resource, namely a portion of WSJ sentences, perhaps the most annotated corpus of English. These features provide opportunities for future investigation, in particular in studies related to reciprocal relations between syntactic and lexical semantic frame structures.
One reason to promote using unsupervised methods is their inherent flexibility to embrace unknown data. These methods have a high margin of tolerance for noise, and perform better than supervised method with insufficient training data. For unsupervised data, obtaining or generating training data is easier than doing so with supervised methods because they simply do not require annotation. For example, all participant systems could collect similar unlabeled training data from only syntactically annotated corpora to generate more unlabeled records. Ultimately, such methods can achieve respectable performance, and produce clusters which are both more informative than the unlabeled input and supervised categories (under certain situations). As shown, unsupervised methods can even outperform a state-of-the-art Frame Semantics parser by a wide margin (Section 7), while a very large gap remains for improvements in future work.

A Appendices
A.1 Appendix I: Annotation Process     Table   Table 6 extends Table 2. Section 5 defines the abbreviations. A horizontal line separates participating systems and the baselines.    Table 7: Changes in BCF score of systems relative to changes in evaluation records based on assigned confidence measure.