Neural GRANNy at SemEval-2019 Task 2: A combined approach for better modeling of semantic relationships in semantic frame induction

We describe our solutions for semantic frame and role induction subtasks of SemEval 2019 Task 2. Our approaches got the highest scores, and the solution for the frame induction problem officially took the first place. The main contributions of this paper are related to the semantic frame induction problem. We propose a combined approach that employs two different types of vector representations: dense representations from hidden layers of a masked language model, and sparse representations based on substitutes for the target word in the context. The first one better groups synonyms, the second one is better at disambiguating homonyms. Extending the context to include nearby sentences improves the results in both cases. New Hearst-like patterns for verbs are introduced that prove to be effective for frame induction. Finally, we propose an approach to selecting the number of clusters in agglomerative clustering.


Introduction
Semeval-2019 Task 2 consisted of three subtasks, this paper presents solutions to all three which were all performing better than other submitted approaches.The first solution officially took the first place in the competition, the other two used tuning on the development set provided by the organizers, which was then interpreted as using additional corpora.
Semantic Frame Induction (Subtask A) is the task of grouping target word occurrences in a text corpus according to their frame (meaning and semantic arguments structure).Target words are usually verbs, nouns, and adjectives (these have argument structure; however in the shared task dataset only verbs were present).For instance, the verbs rise, fall and climb in the sentences The dollar is rising, which makes Russian economy unstable and The dollar fell 1% in September after climbing 2% in August should be clustered together, while the verb climb in sentences like People climb mountains should be clustered separately.For the sake of brevity, occurrences of different words sharing the same frame will be called synonyms, and occurrences of the same word belonging to different frames will be called homonyms.This may violate the traditional meaning of these terms.For instance, fall and rise are not considered synonyms in the classical sense.Semantic Role Induction refers to finding realizations of semantic arguments in text and relating them to corresponding semantic frame slots.Generic role induction (subtask B.2) requires a small number of frame-independent roles like Agent, Patient, Theme, etc. Frame-specific role induction (subtask B.1) allows labeling arguments of each frame independently from other frames.For instance, Microsoft in Microsoft bought Github and Google in Google opened new offices should be labeled as the same role in B.2 but may be labeled differently in B.1.For further details please refer to QasemiZadeh et al. (2019).
In this paper, we focused mainly on the Frame Induction subtask.The main contributions for this subtask are the following.A combined approach to semantic frame induction is introduced, which clusters dense representations obtained from hidden layers of a masked LM first and sparse bag-ofwords representations of possible substitutes for a word in context afterward.This approach resulted in better clustering of both synonyms and homonyms 1 .New Hearst-like patterns designed specifically for verbs were used and they proved to be beneficial for Semantic Frame Induction.Also, a simple but effective semi-supervised approach to selecting the number of clusters for agglomerative clustering was proposed.Finally, we proposed ex-tending context with neighboring sentences which have shown consistent improvements for both of our representations.For solving subtask B.2 we used a semi-supervised approach of training logistic regression over features that were partly designed and partly learned in an unsupervised fashion.To ensure the best performance on verbs that were not present in training data (the majority of examples in the test) we used cross-validation with a lexical split, to select optimal features and hyperparameters.For solving subtask B.1 we trivially reused labels from B.2

Related Work
This section describes previous work which our approach is based on.
Word Sense Induction (WSI) is the task of clustering occurrences of an ambiguous word according to their meaning which is similar to Frame Induction.One of the major differences from Frame Induction is that WSI doesn't require grouping together different words with similar meanings, however, we adopt some ideas from WSI in this work.Instead of graph or vector representation of word cooccurrence information traditionally used to solve WSI task, Baskaya et al. (2013) proposed exploiting n-gram language model (LM) to generate possible substitutes for an ambiguous word in a particular context.Their approach was one of the best in SemEval-2013 WSI shared task (Jurgens and Klapaftis, 2013).Struyanskiy and Arefyev (2018) proposed pretraining SOTA neural machine translation model built from Transformer blocks (Vaswani et al., 2017) to restore target words hidden from its input (replaced with a special token CENTERWORD).After pretraining, they exploited both predicted output embeddings to represent ambiguous words and attention weights to better weigh relevant context words in word2vec weighted average representation.A combination of these representations achieved SOTA results on one of the datasets from RUSSE'2018 Word Sense Induction for the Russian language shared task (Panchenko et al., 2018).Amrami and Goldberg (2018) develop ideas from Baskaya et al. (2013) exploiting neural bidirectional LM ELMO (Peters et al., 2018) instead of n-gram LM for generating substitutes.To improve results further they propose using dynamic symmetric patterns "T and ", " and T" (here "T" stands for the target word and " " for the position at which we collect LM predictions).For instance, to represent the word orange in He wears an orange shirt instead of predicting what comes after wears in He wears they predict what comes after and in He wears orange and (similarly, for backward LM they predict what comes before and in and orange shirt).This provides more information to the LM because we don't hide the ambiguous word and forces it to produce its co-hyponyms instead of all possible continuations given a one-sided context.Other important contributions include lemmatizing substitutes to remove grammatical bias from representations (which was especially important for verbs) and using IDF weights to penalize frequent substitutes, which are probably worse for discriminating between senses.They achieve SOTA results on the SemEval-2013 WSI dataset.Devlin et al. (2018) proposed BERT (Bidirectional Encoder Representations from Transformers).Like the model from Struyanskiy and Arefyev (2018), BERT is a deep NN built from Transformer blocks and pretrained on the task of restoring words hidden from its input (replaced with a special token [MASK], hence they named it masked LM).However they used much deeper models, pretrained them on much more data and predicted hidden words at each timestep rather than generating them as an output sequence.Also additional next sentence prediction task was used to pretrain the model for sentence pairs classification (like paraphrase detection and NLI).BERT has shown better results than previous SOTA models on a wide spectrum of natural language processing tasks.

Semantic Frame Induction
In this section, we describe our approaches to building vector representations of an occurrence of the target word (which is always a verb in the SemEval-2019 Frame Induction task dataset).The first approach exploits dense vector representations of the target word in a context obtained from hidden layers of BERT model.Another approach builds sparse TF-IDF BOW vectors from substitutes generated for the target word by BERT masked LM.We found that each model has its own downsides when used with non-trainable distance functions like cosine and Euclidean, and with traditional clustering algorithms like agglomerative clustering, DBScan, and affinity propagation.The first approach didn't discriminate different senses of the same verb, the second one had problems with clustering together similar senses of different verbs.In preliminary experiments, we tried fixing the first problem by learning a distance function instead of using a fixed one, but this didn't help, presumably due to a very small amount of labeled data provided and restrictions on using additional labeled data.So our best performing algorithm is two-stage: it groups examples to a relatively small number of large clusters using the first representation (merging synonyms together while not taking into consideration homonyms) and then splits each of them into smaller clusters using the second representation (disambiguating homonyms).Finally, we describe our approach to clustering these vector representations and propose a technique for selecting the appropriate number of clusters.

BERT Hidden Representations
In the preliminary experiments, we compared dense representations from different layers of two BERT models pretrained on English texts: bertbase-uncased and bert-large-uncased with 3x more weights.While being significantly slower, the large model didn't show better clustering results for the development set, so we stuck to the base model.Presumably, fine-tuning the large model to the final task could reveal its superiority, but this would require much more labeled data that was provided.Interestingly, a weighted average of word2vec embeddings for context words proposed for WSI in Arefyev et al. (2018) showed similar results, which also supports the hypothesis that distance functions like cosine or Euclidean are not appropriate for BERT hidden representations.BERT-base consists of 12 Transformer blocks with 12 attention heads each, hidden state dimensionality is 768.It was pre-trained on lowercased texts split into subword units.Hyperparameters were selected on the development set, the best results were achieved using outputs of the layer 6 at timestep when the first subword of the verb was fed in.Also, better results were achieved when input texts were lemmatized.This can be explained by the large grammatical bias of LMs also noticed by Amrami and Goldberg (2018): it is much easier to correctly predict grammatical attributes like number, gender, tense from contexts, so it is more beneficial to assign higher probabilities to all verbs with correct tense than to all verbs with cor-rect meaning when losses like cross-entropy are used, which results in large distance between occurrences of the same verb in the same meaning, but in different tenses.

Substitutes Representations
We adopt ideas from Amrami and Goldberg (2018) for our second approach to Frame Induction, with several important differences.First, we propose new patterns which are more suitable for verbs.Secondly, we use BERT, which proved to be better than ELMO for generating substitutes in our series of preliminary experiments.This is likely due to the fact that BERT takes into account the whole context in all of its layers, unlike bidirectional LM in ELMO, which consists of two independently trained language models, one using only right context, and another only left context.Lastly, we do hard clustering instead of soft clustering required for SemEval-2013 WSI, hence we do not sample from distributions predicted by LM, but instead, take the topmost probable substitutes.We found this approach works better than one doing soft clustering and then selecting the most probable cluster for each example.
To generate substitutes, a masked LM based on the bert-base-uncased model was utilized.It is likely that the large model could generate better substitutes, but we left it for future work.Nonlemmatized lowercased text was passed through all the layers of the model.We didn't add biases of the last linear layer to obtain less frequent but more contextually suitable subwords.We took K most probable substitutes to represent each example (K=40 was selected on the development set), lemmatized them to get rid of grammatical bias, and then built TF-IDF bag-of-words vectors.To improve results we employ symmetric patterns.Symmetric patterns were first proposed in Hearst (1992) and then used in many cases, including Widdows and Dorow (2002), Panchenko et al. (2012), Schwartz et al. (2015), to extract lexical relations like hyponymy, hypernymy, cohyponymy, etc. from texts, and to augment lexical resources.However, we were not aware of any Hearst-like patterns designed specifically for verbs.Along with "T and " pattern and trivial "T" and " " patterns we proposed and experimented with "T and then ", "T and will " and "T and then will " patterns.We suppose that the meaning of a verb is better described not by its hypernyms or co-hyponyms (which are traditionally extracted for nouns using patterns like " such as T" or "T and ") but rather by preceding and following events which are better extracted by the proposed patterns."T and then " pattern has shown the best results both for the development and the test sets.For instance, to generate substitutes for the verb build in They are building phones we pass They are building and then [MASK] phones and collect predictions at the masked timestep.We found that among others, substitutes like export, distribute, ship are generated for Manufacturing frame and establish, open, close for Building frame of the verb build allowing to discriminate between them.See Appendix A for examples.

Clustering
We experimented with K-means, DBScan, Affinity Propagation and Agglomerative clustering algorithms implemented in the scikit-learn (Pedregosa et al., 2011) and found agglomerative clustering to achieve the best results.To select hyperparameters of Agglomerative clustering for dense representations (number of clusters and distance functions between points and clusters) we used a simple yet effective semi-supervised approach: merge the development and test sets (labeled and unlabeled respectively) and perform grid search for hyperparameters that provide clustering with optimal value of the target metric (BCubed-f1 in our case) on the labeled subset.Almost always optimal results were obtained using cosine distance for points and average linkage for clusters (average distance between elements).

Combined Approach
Our best performing submission was made of a combination of techniques described above.At phase 1, we clustered dense representations using proposed semi-supervised agglomerative clustering.At phase 2, we split each cluster separately using sparse representations and conventional agglomerative clustering with cosine distance and average linkage (selected on the development set).We didn't use the semi-supervised tuning again because at that stage most clusters didn't contain labeled examples.During the blind evaluation period, we simply split each cluster into two (this method is denoted as Combined below).In the post-evaluation period, we experimented with more sophisticated approaches.Finally, our best results (denoted as Combined2) were ob-tained when the number of clusters at phase 2 was selected using silhouette score and small clusters (with less than 20 examples) or clusters with different target verbs were left intact.Also, during the post-evaluation period, we tried extending the context with nearby sentences (sentences with adjacent IDs in the Penn Treebank corpus).This allowed us to incorporate more information about the preceding and following events, which resulted in improved performance of both representations.In Combined2 we passed a large context of maximum 7 sentences to the left and to the right for dense, and smaller context of 2 sentences on both sides for sparse representations (selected on the development set).

Dataset and Experiments
Due to limitations imposed by the task, we restricted ourselves to only using labeled data provided by the organizers.For the majority of our experiments, we used the development set that consisted of 600 examples of 35 verbs clustered into 41 frames.There are many examples of synonymy in this dataset but not so many of homonymy.Almost all ambiguous verbs have less than 5 examples for all frames except their most frequent frame, hence we used only verbs join and believe (54/9 and 12/8 examples of their first/second most frequent frame respectively) to select hyperparameters likely resulting in a suboptimal performance on the test.
For internal evaluation of different representations and hyperparameters selection, we used the following procedure: the development set or its subset was clustered many times using agglomerative clustering with all feasible hyperparameter values, and maximum BCubed-f1 value (maxB3f1) was taken as a score for the representation.This allowed us to compare clusterability of different representations while avoiding problems of selecting the number of clusters and other hyperparameters.Of course, there is a possibility that other clustering algorithms might perform better with different representations, however, we didn't see improvements from using other clustering algorithms and stick to agglomerative clustering.We denote the proportion of synonyms sharing common cluster as recall for synonyms and the proportion of homonyms put in separate clusters as recall for homonyms.Figure 1 shows both metrics depending on the number of clusters for agglomerative clustering of the whole development set.It is evident that until a relatively large number of clusters (30) almost all synonyms are correctly clustered together when using dense representations, yet homonyms are clustered together as well, which gives almost 1.0 recall for synonyms and nearly 0.0 recall for homonyms.MaxB3f1 of approximately 0.94 is achieved at around 25-28 clusters (depending on the context size) where synonyms are still clustered almost perfectly.At the same time, sparse representations split homonyms into different clusters even at very small numbers of clusters, but simultaneously split synonyms also, achieving lower maxB3f1 of 0.91 in a wider range of 25-40 clusters.To solve this problem, our final solution clusters dense representations first and then splits large clusters containing examples of the same verb (to prevent splitting synonyms) into a small number of clusters to improve recall for homonyms.synonyms are never clustered together.Dense representation with semi-supervised agglomerative clustering slightly underestimates the number of clusters in the test set (similarly to the development set) resulting in the highest recall due to merged synonyms.The combined approach splits some clusters hurting BCubed-recall a bit but increasing BCubed-precision, even more, resulting in better BCubed-f1.The last row shows that selecting the number of clusters which maximizes silhouette score (unsupervised approach) instead of BCubed-f1 of the labeled subset results in much worse results, hence our semi-supervised approach is beneficial.Finally, we noticed that the largest cluster had all the examples of both sell and buy, which were among the most frequent verbs in the test set.In FrameNet, they are assigned to Commerce sell and Commerce buy frames respectively which is a questionable solution since these are just different ways to put into words the same type of event with the same participants (something like commercial-transfer-of-property).We simply moved all examples of the verb sell into a separate cluster which gave significant improvement in BCubed-f1.However, this result is out of competition due to the manual postprocessing.Yet, our best result without manual postprocessing is still ranked first.
In Table 3 we report the results of clustering the test set depending on the pattern and the con-text size used to build sparse representations at phase 2. In addition to standard metrics, we report maxB3F1 which excludes the effect of a suboptimal number of clusters selected on the comparison results.Our proposed pattern seems to give small but consistent improvement as well as context extension.The context of 1-3 sentences on both sides is a reasonable choice for sparse representations.

Semantic Role Induction
After looking at examples from the development set we decided that the subtask B.2 (generic semantic role induction) could be solved much more effectively using a classifier than any kind of clustering because generic roles look more like a highlevel linguistic abstraction than something naturally occurring in texts.We used the development set to trained logistic regression on top of representations extracted from BERT and several handcrafted features.BERT was pretrained in unsupervised fashion on large corpora and this results in much better generalization of our semi-supervised approach compared to a logistic regression trained only on hand-crafted features (see ablation analysis below).To select hyperparameters we used cross-validation with lexical split (i.e.there were no common verbs in train and test subsets for each fold) to ensure the best performance on new verbs not seen during training.This approach was rejected as using an additional labeled corpora to train a supervised component.However we hardly see how the development set provided by the organizers can be considered as additional.

Model Description and Results
We trained a logistic regression classifier for the 14 most frequent semantic roles in the development set.Following recommendations of Devlin et al. (2018) we used outputs from the last four layers of BERT as features.These outputs were taken for two timesteps at which the target argument and its corresponding verb were fed.To be exact, we found the first subword of the verb (for instance, buy for buy out) and the last subword for the argument (Union for European Union) performing best.Additionally we used several handdesigned features.

Conclusions
We show how neural language models can be effectively used for unsupervised inference of semantic structures.To improve the result of semantic frame induction we used a combined approach that utilizes two different vector representations, and adjusted our clustering algorithm accordingly.
The design stemmed from our analysis of problems in use of neural language models for the purpose of semantic frame induction; the experiments showed that issues may be strongly related to how the models treat such linguistic phenomena as synonymy and homonymy.Designing a system that addresses this problem directly allowed us to improve the result significantly.We think that our result could be additionally improved by finding better parameters and/or model combinations.We also think that further research in this direction could lead to neural language models that explicitly address various linguistic phenomena by design, for even better inference of semantic properties.

A Examples of generated substitutes
To show how substitutes can disambiguate homonyms we generated substitutes for examples of two most frequent frames for several verbs.
For each verb we excluded rare substitutes with P (subs|f rame i ) < 0.4 for both frames.Then we sorted the rest according to the probability ratio P (subs|f rame 1 ) P (subs|f rame 2 )+1e−6 .Table 5 shows substitutes with the largest and the smallest ratio (most discriminating substitutes).

B Features and ablation analysis for Generic Semantic Role Induction subtask
We used the following hand-crafted features: an indicator that the argument is to the left of the verb and an indicator that the particle by is between them; categorical features for the output syntactic relation of the argument, the last relation in the path between the argument and the verb, the part of speech of the first word of the argument, the number of words and the number of words starting with a capital letter in the argument.All these features were concatenated, categorical features were encoded with one-hot vectors.In the preliminary experiments we noticed that hand-crafted features performed well by themselves but didn't improve results when concatenated with BERT outputs; this was resolved by multiplying the features by 10 (we attribute the effect to very high dimensionality of BERT outputs compared to handcrafted features, which requires harmonizing the variance each of them adds to the scalar product in the logistic regression).We tried multiplying each feature by its own constant determined analytically from its dimensionality, but this worsened the results, so we left it for the future work.
Table 6 shows results for subtask B.2 after removing features from input representation or using different BERT layers instead of the last four.For ablation analysis, we selected L2regularization strength using cross-validation with a lexical split after removing each feature while leaving all other hyperparameters intact.The features with largest contribution to the result are (from most to least important) BERT output at the argument, at the verb, the last relation in the path from the argument to the verb, the indicator that the particle by is between them (which was designed to fix errors due to passive voice) and the indicator that the argument is to the left of the verb.All other features' contributions (not shown) are small.Remarkably, removing all BERT features gives very large decrease in performance (-18 B3F1) while removing only outputs at the argument/verb gives only moderate decrease (-5.5/-3.5 B3F1) which can be explained by deeply bidirectional nature of BERT resulting in some information about both the verb and the argument present in each of these outputs.Finally, we tried using other BERT layers instead of the last four (layers 8-11) and found that intermediate layers perform best.For instance, layer 8 can replace the last four layers with very little decrease in performance, while the last two layers (10, 11) concatenated perform noticeably worse but much better than the first layers.

Table 1
Table 2 compares results on the test set.Verb baseline assigns the first token of the verb to each example as its cluster id.It overestimates the real number of clusters in the test (149), giving the highest precision but very low recall because

Table 3 :
Subtask-A, effect of pattern and context size Table 4 shows our submission results.Also, we display results when using only BERT and only hand-designed features suggesting that both of them contribute positively