UFRGS&LIF at SemEval-2016 Task 10: Rule-Based MWE Identification and Predominant-Supersense Tagging

This paper presents our approach towards the SemEval-2016 Task 10 – Detecting Minimal Semantic Units and their Meanings. Systems are expected to provide a representation of lexical semantics by (1) segmenting tokens into words and multiword units and (2) providing a supersense tag for segments that function as nouns or verbs. Our pipeline rule-based sys-tem uses no external resources and was implemented using the mwetoolkit . First, we extract and ﬁlter known MWEs from the training corpus. Second, we group input tokens of the test corpus based on this lexicon, with special treatment for non-contiguous expressions. Third, we use an MWE-aware predominant-sense heuristic for supersense tagging. We obtain an F-score of 51 . 48% for MWE identiﬁ-cation and 49 . 98% for supersense tagging.


Introduction
Accurate segmentation and semantic disambiguation of minimal text units is a major challenge in the general pipeline of NLP applications. A machine translation system, for example, needs to decide what is the intended meaning for a given word or phrase in its context, so that it may translate it into an equivalent meaning in the target language.
While determining the meaning of single words is a difficult task on its own, the problem is compounded by the pervasiveness of Multiword Expressions (MWEs). MWEs are semantic units that span over multiple lexemes in the text (e.g. dry run, look up, fall flat). Their meaning cannot be inferred by applying regular composition rules on the meanings of their component words. The task of semantic tagging is thus deeply intertwined with the identification of multiword expressions.
This paper presents our solution to the DiMSUM shared task (Schneider et al., 2016), where the evaluated systems are expected to perform both semantic tagging and multiword identification. Our pipeline system first detects and groups MWEs and then assigns supersense tags, as two consecutive steps. For MWE identification, we use a task-specific instantiation of the mwetoolkit (Ramisch, 2015), handling both contiguous and non-contiguous MWEs with some degree of customization (Cordeiro et al., 2015). Additionally, MWE type-level candidates are extracted without losing track of their tokenlevel occurrences, to guarantee that all the MWE occurrences learned from the training data are projected onto the test corpus. For semantic tagging we adopted a predominant-sense heuristic.
In the remainder of this paper, we present related work ( § 2), then we present and discuss the results of the MWE identification subsystem ( § 3) and of the supersense tagging subsystem ( § 4). We then conclude and share ideas for future improvements ( § 5).

Related Work
Practical solutions for rule-based MWE identification include tools like jMWE (Kulkarni and Finlayson, 2011), a library for direct lexicon projection based on preexisting MWE lists. Finite-state transducers can also be used to take into account the internal morphology of component words and perform efficient tokenization based on MWE dictionaries (Savary, 2009). The problem of MWE identification has also been modeled using supervised machine learning. Probabilistic MWE taggers usually encode the data using a begin-inside-outside scheme and learn CRF-like taggers on it (Constant and Sigogne, 2011;Schneider et al., 2014). The mwetoolkit (Ramisch, 2015) provides command-line programs that allow one to discover new MWE candidate lists, filter them and project them back on text according to some parameters. Our system uses the latter as basis for MWE identification.
Word sense disambiguation (WSD) methods can be roughly classified into knowledge-based, supervised and unsupervised. Knowledge-based methods use lexico-semantic taxonomies like WordNet to calculate the similarity between context and target words (Lesk, 1986). Supervised approaches generally use context-sensitive classifiers (Cabezas et al., 2001). Unsupervised approaches using clustering and distributional similarity (Brody and Lapata, 2008;Goyal and Hovy, 2014) can also be employed for WSD. Both supervised and unsupervised WSD techniques have also been used to distinguish literal from idiomatic uses of MWEs (Fazly et al., 2009;Diab and Bhutada, 2009). Nonetheless, systematically choosing the most frequent sense is a surprisingly good baseline, not always easy to beat (Mc-Carthy et al., 2007;Navigli, 2009). This was also verified for MWE disambiguation (Uchiyama et al., 2005). Thus, in this work, we implemented a simple supervised predominant-sense heuristic and will investigate more sophisticated WSD techniques as future work.

MWE Identification
Our MWE identification algorithm uses 6 different rule configurations, targeting different MWE classes. Three of these are based on data from the training corpus, while the other three are unsupervised. The parameters of each configuration are optimized on a held-out development set, consisting of 1 ⁄9 of the training corpus. The final system is the union of all configurations. 1 For the 3 supervised configurations, annotated MWEs are extracted from the training data and then filtered: we only keep combinations that have been annotated often enough in the training corpus. In 1 When there is an overlap, we favor longer MWEs. other words, we keep MWE candidates whose proportion of annotated instances with respect to all occurrences in the training corpus is above a threshold t, discarding the rest. The thresholds were manually chosen based on what seemed to yield better results on the development set. Finally, we project the resulting list of MWE candidates on the test data, that is, we segment as MWEs the test token sequences that are contained in the lexicon extracted from the training data. These configurations are: CONTIG Contiguous MWEs annotated in the training corpus are extracted and filtered with a threshold of t = 40%. That is, we create a lexicon containing all contiguous lemma+POS sequences for which at least 40% of the occurrences in the training corpus were annotated. The resulting lexicon is projected on the test corpus whenever that contiguous sequence of words is seen.
GAPPY Non-contiguous MWEs are extracted from the training corpus and filtered with a threshold of t = 70%. The resulting MWEs are projected on the test corpus using the following rule: an MWE is deemed to occur if its component words appear sequentially with at most a total of 3 gap words in between them.
NOUN 2 -KN Collect all noun-noun sequences in the test corpus that also appear at least once in the training corpus (known compounds), and filter them with a threshold of t = 70%. The resulting list is projected onto the test corpus.
We further developed 3 additional configurations based on empirical findings. We identify MWEs in the test corpus based on POS-tag patterns, without any filtering (and thus without looking at the training corpus) 2 : NOUN 2 -UKN Collect all noun-noun sequences in the test corpus that never appear in the training corpus (unknown compounds), and project all of them back on the test corpus.
PROPN 2..∞ Collect sequences of two or more contiguous words with POS-tag PROPN and project all of them back onto the test corpus. VP Collect verb-particle candidates and project them back onto the test corpus. A verb-particle candidate is a pair of words under these constraints: the first word must have POS-tag VERB and cannot have lemma go or be. The two words may be separated by a N 3 or PROPN. The second word must be in a list of frequent non-literal particles 4 . Finally, the particle must be followed by a word with one of these POStags: ADV, ADP, PART, CONJ, PUNCT. Even though we might miss some cases, this final delimiter avoids capturing regular verb-PP sequences. Table 1 presents the results for each isolated configuration (evaluated on the test corpus, with all MWEs). These results are calculated based on the fuzzy metrics of the shared task (Schneider et al., 2014), where partial MWE matches are taken into account. Our final MWE identification system is the union of all rule configurations described above.   recall for N_N compounds. The most common false positive errors are presented below.

Error Analysis
• Not in the same phrase In 19 cases, our system has identified two Ns that are not in the same phrase; e.g. *when I have a problem customer services don't want to know. In order to realize that these nouns are not related, we would need parsing information. Nonetheless, it is not clear whether an off-the-shelf parser could solve these ambiguities in the absence of punctuation.
• Partial N_N_N 17 cases have been missed due to only the first two nouns in the MWE being identified; e.g. *Try the memory foam pillows! -instead of memory foam pillows.
• Partial ADJ_N_N 10 cases have been missed; e.g. *My sweet pea plants arrived 00th May completely dried up and dead! -instead of sweet pea plants. These cases are a consequence of the fact that we do not look for adjective-noun pairs (see ADJ_N errors below).
• Compositional N_N In 24 cases, our system identified a compositional compound; e.g. *Quality gear guys, excellent! Semantic features would be required to filter such cases out.
• Questionable N tags 10 false noun compounds were found due to words such as today being tagged as nouns (e.g. *I'm saving gas today). Another 5 cases had adjectives classified as nouns: *Maybe this is a kind of an artificial way to read an e-book.
VERB_ADP errors Most of the VERB_ADP expressions were caught by the VP configuration, but we still had some false negatives. In 7 cases, the underlying particle was not in our list (e.g. I regret ever going near their store), while in 9 other cases, the particle was followed by a noun phrase (e.g. Givin out Back shots). 5 of the missed MWEs could have been found by accepting the particle to be followed by a SCONJ, or to be followed by the end of the line as delimiters. Most of the false positives were due to the verb being followed by an indirect object or prepositional phrase. We believe that disambiguating these cases would require valency information, either from a lexicon or automatically acquired from large corpora (Preiss et al., 2007).
ADJ_N errors While the few ADJ_N pairs that our system identified were usually correct MWEs, most of the annotated cases were missed. This is because we do not specifically look for adjective-noun pairs, due to the high likelihood of them being compositional. For example, a simple ADJ_N annotation scheme (as performed in NOUN 2 -UKN) would have achieved a precision of only 69/505 = 14%. Out of all annotated sentences, in 23 cases the noun is transparent, and we could replace the adjective by a synonym; e.g. I guess people are going again next week, do you think you'll go? (which could be replaced by the following week). In another 17 cases, the noun is transparent and the adjective suggestive of the global meaning, even though it is fixed; e.g. 23 is the lucky number (but not *fortunate number, albeit related to luck).
These cases could be dealt with using fixedness tests such as substitution and permutation (Fazly et al., 2009;Ramisch et al., 2008).
PROPN_PROPN errors Since our system looks for all occurrences of adjacent PROPN pairs, we obtain near-perfect recall for PROPN_PROPN compounds. Most false positives were caused by possessives or personal titles, which were annotated as part of the MWE in the gold standard.
VERB_PART errors The results for VERB_PART are similar to the ones found for VERB_ADP: 3 false negatives are due to the particle not being in our list, and in another 7 cases they are followed by a noun phrase. Additionally, in 6 cases the particle was fol-lowed by a verb (e.g. Stupid Kilkenny didn't get to meet @Royseven). 4 false positives were CONTIG cases of go to being identified as a MWE (e.g. *In my mother's day, she didn't go to college). In the training corpus, this MWE had been annotated 57% of the time, but in future constructions (e.g. Definitely not going to purchase a car from here). Canonical forms would be easy to model with a specific contextual rule of the form going to verb.
PROPN_N errors While the few PROPN_N pairs we found were all correct MWEs, most of the annotated cases were missed. These cases did not earn special attention during the development of the system due to an incorrectly perceived infrequency. However, using only an annotation scheme such as NOUN 2 -UKN, we could have achieved a precision of 72% for these MWEs.
N_N_N errors The occurrence of N_N_N sequences is rare in the training corpus, and we did not specifically look for them, which explains our recall of 0%. By annotating the longest sequence of Ns in the corpus (NOUN 2..∞ ), we could have obtained a precision of 56% and recall of 91% for N_N_N. The precision of N_N would also increase to 70% (with a recall of 93%). If we then replace NOUN 2 by NOUN 2..∞ , the full-system's F-score increases to 56.23%.
ADP_N errors The false positives were ambiguous determinerless PPs that can be compositional or not according to the context. For instance, the system identified *Try them all, in order after seeing The Big Lebowski is in order tonight. False negatives were mainly due to threshold-based filters, like at all and in peace. Unsupervised MWE discovery on large corpora using context-sensitive association measures could have helped in these cases.
VERB_N errors We only generated 4 false positives, which look like light-verb constructions missed by the annotators (give ride, place order) False negatives include 8 cases of gerunds POStagged as verbs (e.g. to listen to flying saucers), which are actualy similar to ADJ_N cases discussed above. We also found 7 false negatives, mainly lightverb constructions, that were not present in the training corpus (take place, take control).
DET_N errors 8 false negatives were compositional time adjuncts (e.g. this morning, this season). False positives are mainly cases that seem inconsistent between training and test data concerning frequent quantifiers (e.g. a lot, a bit, a couple).
Noun compounds (two or more Ns in a row) account for a significant proportion of MWEs in the training corpus ( 601 /4232 = 14%) and an even larger amount of the testing corpus ( 203 /837 = 24%). The NOUN 2 rule sets were essential to obtaining good results. If we remove NOUN 2 from our system, its global performance would drop to a fuzzy F 1 = 33.79%.
The domain of the corpus does not seem to have a great influence on our method's performance. Our lowest performance is on the Reviews subcorpus (fuzzy F 1 = 49.57%) and our best performance is on TED (fuzzy F 1 = 56.76%).
Some of the missed MWEs are questionable and we feel that our system should not annotate them. These include regular verbal chains (shouldn't have, have been), infinitival and selected preposition to (to take, go to) and compositional noun phrases (this Saturday). Fortunately, these cases correspond to a small proportion of the data.

Supersense Tagging
Supersense tagging takes place after MWE identification. Sense tags are coarse top-level Wordnet synsets. The tagset for nouns and verbs has respectively 26 and 15 supersense tags. We use a predominant-sense heuristic to perform WSD.
Before tagging the test data, our system collects all annotated supersense tags from MWEs in the training corpus. We create a mapping with entries of the form (w 1 , w 2 , . . . , w N ) → S, where each MWE component w i = (lemma i , POStag i ). This mapping indicates the most frequent tag S associated a given MWE. Single words are treated as length-1 MWEs and are also added to this mapping.
The supersense tagging algorithm then goes through all segmented units (MWEs or single words) in the test corpus and annotates them according to the most common tag seen in the training set. If a tag has not been seen for a given word or MWE, we do not tag it at all. This heuristic is very simple and not very realistic. Nonetheless, it allowed us to have a minimal supersense tagger quickly and then focus on accurate MWE identification as the main contribution of our system.

Error Analysis
Tables 3 and 4 show the confusion matrices of our system for the 10 most common tags. Each row corresponds to a gold tag and contains the distribution of predicted tags. The perfect system would have numbers only in the main diagonal and zeros everywhere else. The skewed distribution of supersense tags makes our simple heuristic quite effective when the MWE/word has been observed in the training data.
Known nouns seem easy to tag. Most of our errors come from the fact that we did not observe instances of a noun in the training data, and thus did not assign it any tag (column "skipped"). Some distinctions seem harder than others due to similar semantic classes: attributive/cognition and event/time.
The occurrence of verbs in the training data is less of a problem than their polysemy. Stative verbs correspond to the large majority of verbs in the dataset. This is magnified by the nature of the corpus: reviews tend to use stative verbs to talk about product characteristics, tweets often use them to describe the state of the author. While very frequent, stative verbs are also difficult to disambiguate: most false negatives were tagged as change verbs while most false positives were tagged as social verbs. Some distinctions seem extremely hard to make, specially for less frequent supersense tags like contact/motion and perception/cognition.

Conclusions and Future Work
We developed a simple rule-based system that was able to obtain competitive results. Its main advantage is that it was very quick to implement in the context of the generic framework of the mwetoolkit. The system is freely available as part of the official mwetoolkit release. 5 The main limitation of our system is that it cannot properly take unseen MWEs into account and generalize from seen instances. Moreover, most of our rule sets are highly language dependent.
Ideas for future improvements include:   • Adding specific rules for verb-particle constructions, probably based on a lexicon of idiomatic combinations.
• Replacing the CONTIG method by a sequence tagger for contiguous MWEs (e.g. using a CRF), in order to identify unknown MWEs based on generalizations made from known MWEs (Constant and Sigogne, 2011;Schneider et al., 2014).
• Using semantic-based association measures and semantic-based features based on word embeddings to target idiomatic MWEs (Salehi et al., 2015).
• Developing a more realistic WSD algorithm for supersense tagging, able to tag unseen words and MWEs and to take context into account.