HHMM at SemEval-2019 Task 2: Unsupervised Frame Induction using Contextualized Word Embeddings

We present our system for semantic frame induction that showed the best performance in Subtask B.1 and finished as the runner-up in Subtask A of the SemEval 2019 Task 2 on unsupervised semantic frame induction (Qasem-iZadeh et al., 2019). Our approach separates this task into two independent steps: verb clustering using word and their context embeddings and role labeling by combining these embeddings with syntactical features. A simple combination of these steps shows very competitive results and can be extended to process other datasets and languages.


Introduction
Recent years have seen a lot of interest in computational models of frame semantics, with the availability of annotated sources like Prop-Bank (Palmer et al., 2005) and FrameNet (Baker et al., 1998). Unfortunately, such annotated resources are very scarce due to their language and domain specificity. Consequently, there has been work that investigated methods for unsupervised frame acquisition and parsing (Lang and Lapata, 2010;Modi et al., 2012;Kallmeyer et al., 2018;. Researchers have used different approaches to induce frames, including clustering verb-specific arguments as per their roles (Lang and Lapata, 2010), subject-verb-object triples , syntactic dependency representation using dependency formats like CoNLL (Modi et al., 2012;, and latent-variable PCFG models (Kallmeyer et al., 2018).
The SemEval 2019 task of semantic frame and role induction consists of three subtasks: (A) learning the frame type of the highlighted verb from the context in which it has been used; (B.1) clustering the highlighted arguments of the verb into specific roles as per the frame type of that verb, e.g., Buyer, Goods, etc.; (B.2) clustering the arguments into generic roles as per VerbNet classes (Schuler, 2005), without considering the frame type of the verb, i.e., Agent, Theme, etc.
Our approach to frame induction is similar to the word sense induction approach by , which uses tf-idf-weighted context word embeddings for a shared task on word sense induction by . In this unsupervised task, our approach for clustering mainly consists of exploring the effectiveness of already available pre-trained models. 1 Main contributions of this paper are: 1. a method that uses contextualized distributional word representations (embeddings) for grouping verbs to frame type clusters (Subtask A); 2. a method that combines word and context embeddings for clustering arguments of verbs to frame slots (Subtasks B.1 and B.2).
The key difference of our approach with respect to prior work by  and Kallmeyer et al. (2018) is that we have only used pre-trained embeddings to disambiguate the verb senses and then combined these embeddings with additional features for semantic labeling of the verb roles. 2 The remainder of the paper is organized as follows. The methodology and the results for each subtask are discussed in Sections 2, 3 and 4 respectively, followed by the conclusion in Section 5.

Subtask A: Grouping Verbs to Frame Type Clusters
In this subtask, each sentence has a highlighted verb, which is usually the predicate. The goal is to label each highlighted verb according to the frame evoked by the sentence. The gold standard for this subtask is based on the FrameNet (Baker et al., 1998) definitions for frames.

Method
Since sentences evoking the same frame should receive the same labels, we used a verb clustering approach and experimented with a number of pre-trained word and sentence embeddings models, namely Word2Vec (Mikolov et al., 2013), ELMo (Peters et al., 2018), Universal Sentence Embeddings (Conneau et al., 2017), and fast-Text (Bojanowski et al., 2017). This setup is similar to treating the frame induction task as a word sense disambiguation task (Brown et al., 2011). We experimented with embedding different lexical units, such as verb (V), its sentence (context, C), subject-verb-object (SVO) triples, and verb arguments. Combination of context and word representations (C+W) from Word2Vec and ELMo turned out to be the best combination in our case.
We used the standard Google News Word2Vec embedding model by Mikolov et al. (2013). Since this model is trained on individual words only and the SemEval dataset contained phrasal verbs, such as fall back and buy out, we have considered only the first word in the phrase. If this word is not present in the model vocabulary, we fall back to a zero-filled vector. When aggregating a context into a vector, we used the tf-idf-weighted average of the word embeddings for this context as proposed by . We tuned these weights on the development dataset.
We used the ELMo contextualized embedding model by Peters et al. (2018) that generates vectors of a whole context. Similarly to fast-Text (Bojanowski et al., 2017), ELMo can produce character-level word representations to handle outof-vocabulary words. In all our experiments we used the same pre-trained ELMo model available on TensorFlow Hub. 3 Among all the layers of this model, we used the mean-pooling layer for word and context embeddings.

Results and Discussion
We experimented with different clustering algorithms provided by scikit-learn (Pedregosa et al., 2011), namely agglomerative clustering, DB-SCAN, and affinity propagation. After the model selection on the development dataset, we have chosen agglomerative clustering for further evaluation. Although both ELMo and Word2Vec showed the best results on the development dataset with single linkage, we opted average linkage after analyzing t-SNE plots (van der Maaten and Hinton, 2008). Table 1 shows our results obtained on Subtask A. Our final submission ( ) used agglomerative clustering of normalized vectors obtained by concatenating the context and verb vectors from the Word2Vec model. In particular, we found that the best performance is attained for Manhattan affinity and 150 clusters. During our post-competition experiments (Ȯ), we found that ELMo performed better than Word2Vec when a higher number of clusters, 235, was specified.

Subtask B.1: Clustering Arguments of Verbs to Frame-Specific Slots
In this subtask, each sentence has a set of highlighted nouns or noun phrases corresponding to the slots of the evoked frame. Additionally, each sentence is provided with the same highlighted verb as in Subtask A (Section 2). The goal is to label each highlighted verb according to the evoked frame and to assign each highlighted to-  ken a frame-specific semantic role identifier. The gold standard for this subtask is annotated with FrameNet frames and roles (Baker et al., 1998).

Method
Since Subtask B.1 asks to assign role labels to highlighted tokens as per the frame type of the verb, we attempted this by merging the output of verb frame types from Subtask A (Section 2) and the output of generic role labels from Subtask B.2 (Section 4). We used UKN (unknown) slot identifier for the tokens present in Subtask B.1, but missing in Subtask B.2. Table 2 shows the results from merging our solutions for Subtasks A and B.2, as described in Sections 2 and 4, correspondingly. For our final submission ( ), we merged the frame types obtained by clustering the Word2Vec embeddings of the sentence (context, C) and verb (word, W), and the role labels obtained by clustering the vector of inbound dependencies (ID). However, we observed that the logistic regression model demonstrated better performance in Subtask B.2 than any clustering technique we tried, including our final submission ( ) and the baselines. But this performance was further improved by combining the results from post-competition experiments of Subtask A and Subtask B.2 (Ȯ).

Subtask B.2: Clustering Arguments of Verbs to Generic Roles
In Subtask B.2, similarly to Subtask B.1 (Section 3), each sentence has a set of highlighted nouns or noun phrases that correspond to the slots of the evoked frame. The goal is to label each highlighted token with a high-level generic class, such as Agent or Patient. However, unlike Subtask B.1, the verb frame labeling part is omitted. The gold standard for this subtask is annotated as according to the VerbNet classes (Schuler, 2005).

Method
When addressing this subtask, we experimented with combining the embeddings of the word (W) filling the role, its sentence (context, C), and the highlighted verb (V). To handle the out-ofvocabulary roles in the case of Word2Vec embeddings, each role was tokenized and embeddings for each token were averaged. If a token is still not present in the vocabulary, then a zero-filled vector was used as its embedding. During prototyping we developed several features that improved the performance score, namely inbound dependencies (ID), which represent the dependency label from the head to the role (dependent) and two trivial baselines: Boolean (B) and 123.
We built a negative one-hot encoding feature vector to represent the inbound dependencies of the word corresponding to the role. Thus, for each dependency of the given role (in case of a multiword expression), we fill -1 if the dependency relationship holds, otherwise 0 is filled. During our experiments for the development test, we also used the outbound dependencies, which represent the dependency label from the role (head) to the dependent words. So we used -1 for inbound and 1 for outbound. But since they did not perform well in comparison to inbound dependencies, they were not considered for submitted runs.
For the Boolean baseline, given the position of the verb in the sentence p v and the position of the target token p t , we assign the role 0 to t if p v < p t , otherwise 1. For the 123 baseline, we assign its index to each highlighted slot filler. For example, if five slots need to be labelled, the first one will be labelled as 1 and the last one will be labelled as 5.  formed LPCFG (Kallmeyer et al., 2018) and all the standard baselines, including cluster per dependency role (OneClustPerGrType), on the development dataset. 4 Similarly to our solution for Subtask A (Section 2), we tried different clustering algorithms to cluster arguments of verbs to generic roles and found that the best clustering performance is shown by agglomerative clustering with Euclidean affinity, Ward's method linkage, and two clusters. Our final submission ( ) used the combination of inbound dependencies and Word2Vec embedding for sentence (context, C), which performed marginally better than the cluster per dependency role (OneClustPerGrType) baseline, but still not better than such trivial baselines as Boolean or 123. Replacing Word2Vec with ELMo in our postcompetition experiments have lowered the performance further.

Results and Discussion
In order to estimate our upper bound of the performance, we compared our best-performing clustering algorithm, i.e., agglomerative clustering, to a logistic regression model. We found that the combination of sentence (context, C), target word (W), and verb (V) vectors, enhanced with our other features, shows substantially better results than a simple clustering model ( ). However, we did not observe a noticeable difference between the per-4 On the development dataset for Subtask B.2, the Boolean baseline demonstrated B-Cubed F1 = 57.98, while LPCFG and cluster per dependency role yielded F1 = 40.05 and F1 = 50.79, correspondingly. formance of the underlying embedding models.
As the model was trained on the development dataset that contained 20 roles in contrast to the test set which contained 32 roles, this approach has its limitations due to this difference of the number and meaning of roles. We believe that the performance could be improved using semisupervised clustering methods, yet during prototyping with the pairwise-constrained k-Means algorithm (Basu et al., 2004) we did not observe any performance improvements.

Conclusion
We presented an approach for unsupervised semantic frame and role induction that uses word and context embeddings. It separates the task into two independent steps: verb clustering and role labelling, using combination of these embeddings enhanced with syntactical features. Our approach showed the best performance in Subtask B.1 and also finished as the runner-up in Subtask A of this shared task, and it can be easily extended to process other datasets and languages.