SemEval-2016 Task 10: Detecting Minimal Semantic Units and their Meanings (DiMSUM)

This task combines the labeling of multiword expressions and supersenses (coarse-grained classes) in an explicit, yet broad-coverage paradigm for lexical semantics. Nine systems participated; the best scored 57.7% F 1 in a multi-domain evaluation setting, indicating that the task remains largely unresolved. An error analysis reveals that a large number of instances in the data set are either hard cases, which no systems get right, or easy cases, which all systems correctly solve.


Introduction
Grammatical analysis tasks, e.g., part-of-speech tagging, are rather successful applications of natural language processing (NLP). They are comprehensive, i.e., they operate under the assumption that all grammatically-relevant parts of a sentence will be analyzed: We do not expect a POS tagger to only know a subset of the tags in the language. Most POS tags accommodate unseen words and adapt readily to new text genres. Together, these factors indicate a representation which achieves broad coverage.
Explicit analysis of lexical semantics, by contrast, has been more difficult to scale to broad coverage owing to limited comprehensiveness and extensibility. The dominant paradigm of fine-grained word sense disambiguation, WordNet (Fellbaum, 1998), is difficult to annotate in corpora, results in considerable data sparseness, and does not readily generalize to out-of-vocabulary words. While the main corpus with WordNet senses, SemCor (Miller et al., 1993), does reflect several text genres, it is hard to expand SemCor-style annotations to new genres, such as social web text or transcribed speech. This severely limits the applicability of SemCor-based NLP tools and restricts opportunities for linguistic studies of lexical semantics in corpora.
To address this limitation, in the DiMSUM 2016 shared task, 1 we challenged participants to analyze the lexical semantics of English sentences with a tagset integrating multiword expressions and noun and verb supersenses (following Schneider and Smith, 2015), on multiple nontraditional genres of text. By moving away from fine-grained sense inventories and lexicalized, language-specific 2 annotation, we take a step in the direction of broadcoverage, coarse-grained lexical semantic analysis. We believe this departure from the classical lexical semantics paradigm will ultimately prove fruitful for a variety of NLP applications in a variety of genres.
The integrated lexical semantic representation ( §2, §3) has been annotated in an extensive benchmark data set comprising several nontraditional domains ( §4). Objective, controlled evaluation procedures ( §5) facilitate a comparison of the 9 systems submitted as part of the official task ( §6). While the systems range in performance, all are below 60% in our composite evaluation, suggesting that further work is needed to make progress on this difficult task.

Background
Multiword expressions. Most contemporary approaches to English syntactic and semantic analysis treat space-separated words as the basic units of structure. However, this fails to reflect the basic units of meaning for sentences with noncompositional or idiosyncratic expressions, such as: (1) The staff leaves a lot to be desired .
(2) I googled restaurants in the area and Fuji Sushi came up and reviews were great so I made a carry out order of : L 17 .
In these sentences, a lot, leaves. . . to be desired, Fuji Sushi, came up, made. . . order, and carry out are all multiword expressions (MWEs): their combined meanings can be thought of as "prepackaged" in a single lexical expression that happens to be written with spaces. MWEs such as these have attracted a great deal of attention within computational semantics; see Baldwin and Kim (2010) for a review. Schneider et al. (2014b) introduced an English corpus resource annotated for heterogenous MWEs, suitable for training and evaluating generalpurpose MWE identification systems (Schneider et al., 2014a). Prior to that, most MWE evaluations were focused on particular constructions such as noun compounds (recently: Constant and Sigogne, 2011;Green et al., 2012;Ramisch et al., 2012;Vincze et al., 2013), though the corpus and identification system of Vincze et al. (2011) targets several kinds of MWEs.
Importantly, the MWEs in Schneider et al.'s (2014b) corpus are not required to be contiguous, but may contain gaps (viz.: made. . . order). The corpus also contains qualitative labels indicating the strength of MWEs, either strong (mostly non-compositional) or weak (compositional but idiomatic). For simplicity we only include strong MWEs in this task.

Supersenses.
As noted above, relying on WordNet-like fine-grained, lexicalized senses creates problems for annotating at a large scale and covering new domains and languages. Named entity recognition (NER) does not suffer from these problems, as it uses a much smaller number of coarsegrained classes. However, these classes only apply to a subset of the nouns in a sentence and exclude verbs and adjectives. They therefore provide far from complete coverage in a corpus.
Noun and verb supersenses (Ciaramita and Altun, 2006) offer a middle ground in granularity: they generalize named entity classes to cover all nouns (with 26 classes), but also cover verbs (15 classes)see table 1-and provide a human-interpretable high-level clustering. WordNet supersenses for adjectives and adverbs nominally exist, but are based on morphosyntactic rather than semantic properties. There is, however, recent work on developing supersense taxonomies for English adjectives and  prepositions (Tsvetkov et al., 2014;. The inventory for nouns and verbs originates from the top-level organization of WordNet, but can be applied directly to annotate new data-including out-of-vocabulary words in English or other languages (Schneider et al., 2012;Johannsen et al., 2014). Similar to NER, supersense tagging approaches have generally used statistical sequence models and have been evaluated in English, Italian,Chinese,Arabic,and Danish. 3 Features based on supersenses have been exploited in downstream semantics tasks such as preposition sense disambiguation, noun compound interpretation, question generation, and metaphor detection (Ye and Baldwin, 2007;Heilman, 2011;Hovy et al., 2013;Tsvetkov et al., 2013).
Relationship between MWEs and supersenses. We believe that MWEs and supersenses should be tightly coupled: idiomatic combinations such as MWEs are best labeled holistically, since their joint supersense category will often differ from that of the individual words. For example, spill the beans in its literal interpretation would receive supersenses V:CONTACT and N:FOOD, whereas the idiomatic interpretation, 'divulge a secret', is represented as an MWE holistically tagged as V:COMMUNICATION. Schneider and Smith (2015) develop this idea at 3 Evaluations used English SemCor (Ciaramita and Altun, 2006;Paaß and Reichartz, 2009), English-Italian MultiSem-Cor (Picca et al., 2008(Picca et al., , 2009Attardi et al., 2010), the Italian Syntactic-Semantic Treebank and Italian Wikipedia (Attardi et al., 2010;Rossi et al., 2013), Chinese Cilin (Qiu et al., 2011), Arabic Wikipedia , and the Danish CLARIN Reference Corpus (Martínez Alonso et al., 2015).  Figure 1: Illustration of the target representation. MWE positional markers are shown above the sentence and noun and verb supersenses below the sentence. Links illustrate the behavior of the MWE tags. The supersense labeling must respect the MWEs; thus, V.COGNITION applies to a four-word unit-to, be, and desired must not receive separate supersenses from leaves.
length, and provide a web reviews data set with the integrated annotation. Here, we expand the paradigm to additional domains and compare the performance of several systems.

Representation
The analysis for each sentence is represented as a sequence of paired MWE and supersense tags. Figure 1 illustrates the MWE part above the sentence and the supersense part below the sentence.
The MWE portion is a BIO-style (Ramshaw and Marcus, 1995) positional marker. Of the schemes discussed by Schneider et al. (2014a), we adopt the 6-tag scheme, which uses case to allow gaps in an MWE (lowercase tag variants mark tokens within a gap). The positions are thus O, o, B, b, I, i. Systems are expected to ensure that the full tag sequence for a sentence is valid: global validity can be enforced with first-order constraints to prohibit invalid bigrams such as O I and b I (see Schneider et al., 2014a for details).
Because strong MWEs receive a supersense as a unit (if at all), I and i are never accompanied by a supersense label. O or o indicates that the token is not part of any MWE, but many such tokens do bear a noun or verb supersense.
This task uses a CoNLL-style main file format consisting of one line per token, each line having 9 tab-delimited columns. Scripts to convert to and from the .sst format, which displays one sentence per line and contains annotations in a JSON data structure, are provided as well.

Data
The task built upon two existing data sets of social web text, which were harmonized to form the training data. Four new samples from three domains were newly annotated to form the test set. The train and test sets are summarized in tables 2 and 3 and are publicly available on the web. 4 The domains covered are online customer reviews, tweets, and TED talks. This section describes, for each domain, how its component data sets were sampled, preprocessed, and annotated.

Annotation Process
We compiled data sets from various sources, with varying degrees of existing pre-annotation. Unless already provided, we added Universal POS tags as defined by the Universal Dependencies project (Nivre et al., 2015), and baseline supersenses (heuristically using the most frequent WordNet sense, and in some cases grouping sequences of proper nouns as MWEs). The pre-annotated supersenses were then manually corrected by a trained annotator, who simultaneously annotated the sentence for comprehensive MWEs.
The annotator (a linguist) was trained by the first author of this paper using Schneider and Smith's (2015) web interface and annotation guidelines. Prior to starting on the data sets for this task, the annotator devoted approximately 8 hours to training practice on a separate data set which already had a gold standard. Periodic feedback was given on initial annotations as the annotator grew accustomed to the conventions. The annotator spent approximately 50 hours on DiMSUM data (not including the initial training phase), which amounts to roughly 90 seconds per sentence.
In order to estimate inter-annotator agreement (IAA), the first author independently annotated a sample of Ritter tweets ( §4.3) in 6 groups of 11 sentences, spaced out across the main annotator's annotation batches. IAA estimates for these sets ranged from 60% to 75% F 1 for MWEs, and 67%-80% accuracy for supersenses (on tokens which had supersenses in both annotations). Resources did not allow for more systematic double annotation and IAA estimation throughout the data.

REVIEWS
Training. The REVIEWS part of the training data consists of the STREUSLE corpus (Schneider et al., 2014b;Schneider and Smith, 2015), 6 comprising comprehensive multiword expression and supersense annotations on a 55,000-token portion of the English Web Treebank (EWTB; Bies et al., 2012) made up of 723 online user reviews for services (such as restaurants and beauticians). STREUSLE annotation was done by linguists, who took pains to establish conventions and resolve disagreements. Each sentence was annotated independently by at least 2 annotators; disagreements were resolved by negotiation.
The task release is based on version 2.1 of STREUSLE, with weak MWEs removed and Penn Treebank-style POS tags replaced with Universal POS tags. 7 Test. The test portion comprises 340 sentences (6,357 tokens) from the online review site Trustpilot, a subset of the data used in  (the website as a general resource was described in ). The reviews were chosen to obtain a demographic balance (by age, gender, and location), and contained gold POS tags.

TWEETS
Training. Johannsen et al. (2014) recently annotated two samples of 987 Twitter messages (18,000 words) with supersenses: (a) the POS+NERannotated data set of Ritter et al. (2011), and (b) Plank et al.'s (2014) sample of 200 tweets. 8 Annotators were shown pre-annotations from a heuristic supersense chunking/tagging system (based on 5 On the surface, this might be taken to mean that the accuracy of the heuristic baseline used for pre-annotation is 81%. However, because the annotator saw the pre-annotation, we expect that this agreement rate is higher than if the gold standard had been produced from scratch. 6 http://www.ark.cs.cmu.edu/LexSem/ 7 The PTB-to-UPOS conversion script is available at: http://tiny.cc/ptb2upos 8 The supersense-annotated tweets are available at https:// github.com/coastalcph/supersense-data-twitter. the most frequent sense of each word) and asked to correct the boundaries and supersense labels. Though there was no explicit MWE annotation phase, many of the multiword chunks tagged with a noun or verb supersense would be considered MWEs.
We fully reannotated both data sets to match the conventions of the REVIEWS data from the STREUSLE corpus. The annotator examined every sentence and corrected any MWE or supersense decisions deemed to be inconsistent with the guidelines.
Test. Our test set consists of 500 tweets (6,627 tokens) taken from the Tweebank corpus (Kong et al., 2014), 9 which already contained some goldstandard MWEs. We converted the POS tags from gold ARK TweetNLP POS + FUDG dependencies to UPOS and had an annotator supply supersenses.

TED TALKS
Test. To test the broad-coverage aspect of the submitted systems, the test set contained a "surprise" domain. We opted to sample transcribed sentences from TED talks. Because individual TED talks tend to heavily repeat vocabulary, we took the first 10 sentences from each of 16 documents in order to achieve a lexically diverse sample. Specifically, we chose (a) 100 sentences (2,187 tokens) from the 10 talks in the NAIST-NTT Ted Talk Treebank 10 (Neubig et al., 2014) (which in turn is a subset of the IWSLT training data); and (b) 60 sentences (1,329 tokens) from the IWSLT test data (Cettolo et al., 2012). 11 The latter 6 documents were chosen to maximize language pair diversity. 12 We induced parts of speech by conversion from the gold PTB trees for the NAIST-NTT data, and 9 http://www.cs.cmu.edu/~ark/TweetNLP/ 10 http://ahclab.naist.jp/resource/tedtreebank/ 11 https://wit3.fbk.eu/ 12 These 6 talks are known to have been translated from English into (at least) the following languages: {ar, de, es, fa, he, hi, it, ko, nl, th, vi, zh}. Additionally, we note that 4 of the documents have Czech (cs) translations, while the other 2 have French (fr) translations. Neubig et al. (2014) report that all the 10 documents in the NAIST-NTT Treebank have been translated from English into the following 18 languages: {ar, bg, de, el, es, fr, he, it, ja, ko, nl, pl, pt-BR, ro, ru, tr, zh-CN, zh-TW}. Many additional languages are represented for subsets of the documents. for the remaining data, by automatic tagging with an averaged structured perceptron model (Rungsted 13 ) trained on the English Universal Dependencies v1.2 treebank (Nivre et al., 2015). 14

Comparing Domains
A natural question to ask about lexical semantic annotations is whether we observe strong differences between domains. For example, which kinds of multiword expressions and which kinds of supersenses occur more often in some domains than in others? In this section, we report our observations but do not make any strong claims about their generality, for the following reasons: the samples are not necessarily representative of their domains overall, and, in fact, may have been sampled in a biased way (e.g., the Lowlands sample was limited to tweets containing a URL, and as a result, most of these tweets are headlines and advertisements). Furthermore, the annotation procedures differed by subcorpus, likely biasing the results.
MWEs. Figure 2 summarizes MWEs in the seven subcorpora with respect to syntactic status. Colors represent the POS tag of the first word in the MWE. Starting with proper nouns, the blue bars indicate POS tags that tend to begin nominal MWEs (noun, adjective, determiner, etc.). Red bar POS tags are characteristic of verbal MWEs. The remaining bars are prepositional (dark green) and other miscellaneous tags, which collectively comprise no more than 10% of the MWEs in each subcorpus. It is worth noting that in this plot, subcorpora within the same domain are sometimes more diver-13 https://github.com/coastalcph/rungsted 14 http://hdl.handle.net/11234/1-1548 gent than subcorpora in different domains. Lowlands stands out as having a large share of proper noun MWEs-presumably due to the headlineoriented nature of the sample. STREUSLE has the smallest proportion of nominal MWEs, perhaps owing to the way it was annotated: initial rounds of STREUSLE annotation targeted MWEs only, with noun and verb supersenses added only later; whereas in the other data sets, MWE and supersense annotation were performed jointly, so annotator attention may have been focused on nominal and verbal expressions rather than other MWEs.
Supersenses. In the spirit of Schneider et al. (2012), we performed an analysis to see which supersenses were more characteristic of some domains than others. Figure 3 plots the relative frequency (out of all supersense-labeled units) of each supersense in each of the three domains. We use the REVIEWS domain as base frequency: relative to that, the x-axis is the supersense's occurrence rate in the TWEETS domain, and the y-axis represents the rate for the TED talks.
These plots show some clear outliers: among nouns (left plot), N.GROUP and N.FOOD are overrepresented in REVIEWS relative to the other domains-unsurprising because restaurants and other businesses are prominent in this subcorpus. On the other hand, N.PERSON is underrepresented in REVIEWS. N.TIME and N.COMMUNICATION are more popular in the TWEETS domain than the others. Among verbs (right plot), V.STATIVE is underrepresented, apparently due to the relative rarity of the copula (which often can be safely omitted in headlines and other telegraphic messages without obscuring the meaning).

Evaluation
Submission conditions. We invited submissions in multiple data conditions. The open condition encouraged participants to make wide use of any and all available resources, including for distant or direct supervision. A closed condition encouraged controlled comparisons of algorithms by limiting their training to specific resources distributed for the task. Lastly, we allowed for a semi-supervised closed condition, in which use of a specific large unla-  Figure 3: Supersense rate differences by domains, compared to reviews data set. Circle area proportional to the supersense's total frequency across all domains. Noun supersenses on the left, verb supersenses on the right. Each domain's rate is microaveraged across its subcorpora; thus, larger subcorpora weigh more heavily than smaller subcorpora in the same domain.
beled corpus-the Yelp Academic Dataset 15 -was permitted. Teams were permitted to submit no more than one run per condition. Only one team submitted a system in the semi-supervised closed condition.
All conditions had access to: 1) the annotated data we provided; 2) Brown clusterings (Brown et al., 1992) computed from large corpora of tweets and web reviews; 16 and 3) the English WordNet lexicon. The input at test time included POS tags.
No sentence-level metadata was provided in the input at test time: test set sentence IDs were obscured to hide the source domain, and the order of sentences was randomized to remove document structure. The training data, however, marked the domain from which the sentence was drawn (REVIEWS or TWEETS); systems were free to make use of this information, so long as it was not required as part of the input at test time.
Scoring. We provided an evaluation script to allow participants to check the format of system output and to compute all official scores.
The MWE measure looks at precision, recall, and F 1 of the identified MWEs. Tokens not involved in a 15 https://www.yelp.com/academic _ dataset 16 I.e., TweetNLP clusters (http://www.cs.cmu.edu/~ark/ TweetNLP/) and the Yelp Academic Dataset clusters used in AMALGrAM (http://www.cs.cmu.edu/~ark/LexSem/). predicted or gold MWE do not factor into this measure. To award partial credit for partial overlap between a predicted MWE and a gold MWE, these scores are computed based on links between consecutive tokens in an expression (Schneider et al., 2014a). The tokens must appear in order but do not need to be adjacent. The precision is the proportion of predicted links whose words both belong to the same expression in the gold standard. Recall is the same as precision, but swapping the predicted and gold annotations. 17 Figure 4 defines this measure in detail and illustrates the calculations for an example.
To isolate the supersense classification performance, we compute precision, recall, and F 1 of the supersense-labeled word tokens. The numerator of both precision and recall is the number of tokens labeled with the correct supersense. (This interacts slightly with MWE identification, however, as supersenses are only marked on the first token of MWEs. We do not mark supersenses on all words of the MWE to avoid giving MWEs a disproportionate influence on the supersense score.) Finally, combined precision, recall, and F 1 aggregate the MWE and supersense subscores. The combined precision ratio is computed from the MWE MWE Precision: The proportion of predicted links whose words both belong to the same expression in the gold standard. MWE Recall: Same as precision, but swapping the predicted and gold annotations.  Figure 4: A REVIEWS sentence with MWE and supersense analyses: gold above and hypothetical prediction below. MWE precision of the bottom annotation relative to the top one is 2 5. (Note that a link between words w 1 and w 2 is "matched" if, in the other annotation, there is a path between w 1 and w 2 .) The MWE recall value is 3 4. Supersense precision and recall are both 1 2. Combined precision/recall scores add the respective subscores' numerators and denominators: thus, combined precision is 2+1 5+2 = 3 7, and combined recall is 3+1 4+2 = 2 3. Combined F 1 is their harmonic mean, i.e. 12 23. and supersense precision ratios by adding their numerators and denominators, and likewise for combined recall (see the example in figure 4).
Within each domain, scores are computed as microaverages. The official tri-domain scores reported here are domain macroaverages: per-domain measures are aggregated with the three domains weighted equally. The main score, tri-domain combined F 1 , is the arithmetic mean of the three perdomain combined F 1 scores. (Some system papers report domain microaverages, which give less influence to the TED domain because it is the smallest of the domains in the test set.)

Entries and Results
Six teams 18 participated in the task, submitting a total of nine unique system entries prior to the deadline. We give an overview of these systems and analyze their performance.

Synopsis of approaches
From the UFRGS&LIF team (Cordeiro et al., 2016), S106 detects MWEs by heuristic patternmatching against sequences in the training data, and predicts the most frequent supersense observed for each type in the training data.
From the UTU team (Björne and Salakoski, 2016), S211, S254, and S255 match word sequences against a variety of resources and then choose a 18 None of the teams included any DiMSUM organizers. supersense with an ensemble of classifiers. The method performs reasonably well for supersenses, but is weak at detecting MWEs.
The UW-CSE team (Hosseini et al., 2016) experimented with a sequence CRF as well as a double-chained CRF, with separate chains for MWE tags and supersenses, and some parameters shared between them. The closed-condition and open-condition feature sets were drawn from AMALGrAM (Schneider and Smith, 2015). Of the official submissions, S248 used a single-chain CRF and S249 a double-chained CRF. A full comparison demonstrates that the double-chained CRF performs best on the combined measure in both the closed and open conditions. From the ICL-HD team (Kirilin et al., 2016), S214 uses the AMALGrAM sequence tagger (Schneider and Smith, 2015) with an augmented feature set that leverages word embeddings and a knowledge base. The word embedding features, the knowledge base-derived features, and their union all improve over the condition with no new features, with respect to both MWE performance and supersense performance. The best results for the combined measure are obtained with the word embedding features (but not the knowledge base features). The word embeddings are shown to be somewhat complementary to AMALGrAM's Brown cluster features: ablating either reduces performance.
From the WHUNlp team (Tang et al., 2016), S108 uses a pipeline where a sequence CRF first identifies  MWEs, and a maximum entropy classifier then predicts a supersense independently for each lexical expression. Each of these models has a small number of feature templates recording words and POS tags. From the VectorWeavers team (Scherbakov et al., 2016), S227 relies on neural network classifiers to detect MWE boundaries and label supersenses, using features based on word embeddings and syntactic parses. Results show that syntax helps identify MWE boundaries accurately, and that simple incremental composition functions can help construct useful MWE representations.

Overall results
The main results appear in table 4. The first column of table 4 gives the ranking of the systems. Several systems may share a rank if they do not produce significantly different predictions, as detailed below. The score is the combined supersense and MWE measure, macroaveraged over the three test set domains as described above. The final column indicates the resource condition: systems entered in the open condition (all resources allowed) are designated "++"; "+" indicates the more restricted semi-supervised closed condition, while the remaining systems are in the closed condition (most restrictive). Details of the resource conditions and scoring appear in §5.
Ranking and significance. The overall best scoring system, with a combined measure of 57.77%, is S214. The competition, however, is close: S249 scored 57.71%, and S248 obtained a combined score of 57.10%. To check whether the predictions of the systems are significantly different from each other,  According to McNemar's test, the predictions of the highest-ranking and the next-highest-ranking system are not significantly different at p < .05. The third highest ranking system performs significantly worse than the top system, but is not significantly different from the second-place system. We therefore decided to rank all three systems together. In general, adjacent entries in the sorted scoring table are ranked together if the difference between them is not statistically significant according to the test. Drilling down. were in the open condition, taking advantage of additional resources. The best system in the closed condition is S248, which is very similar to S249-and recall that its predictions, overall, are not statistically worse. Table 5 reveals one striking difference, however: in MWE scores for TWEETS, S249 bests S248 by nearly 7 points.
When scores in the 3 domains are compared for each system, there is surprisingly little difference overall. We expected that the TED domain would be most difficult because it is not represented in the training data, but the scores in table 5 give no clear indication that this is the case. Perhaps systems escaped domain bias because the training data included two highly divergent genres; or perhaps other aspects of the data sets (e.g., topic) matter more for this task than differences in genre.

Easy and hard decisions
Overall, the results clearly show that the joint supersense and MWE tagging task is not yet resolved. Given the wide range of participating systems and previous work, it is reasonable to assume that the task itself is not easy. On the other hand, it is not  uniformly hard. In fact, some decisions are relatively easy, in the sense that most or all systems get them right; whereas others are hard, in that none or very few systems produce the correct answer. Figure 5 explores this for the supersense-tagging subtask. The tallest bars are near the left and right sides of the graph, representing the hard and easy instances, respectively. Hard instances account for about 25% of instances where the gold data has a supersense, which also puts an upper bound on any system combination. Even an oracle system allowed to choose the best prediction for each instance from among all the systems would still not push the accuracy above 75%.
The distribution of easy and hard instances varies a lot between labels, though. As shown for supersenses in figure 6, individual labels range from the fairly easy (e.g. V.STATIVE and V.COMMUNICATION) to the more difficult (e.g. N.ATTRIBUTE and V.CONTACT). The most common supersense, V.STATIVE, is easy because it has few distinct lexical forms (the ten most common lemmas make up more than 77% of the instances). Examples of V.STATIVE lemmas include be, have, use, and get.
Supersenses may be difficult for more than one reason. For instance, V.CONTACT-e.g. deliver, receive, and take-has more distinct forms than V.STATIVE and also a more complex mapping between lemmas and supersenses. In contrast, person names, job titles, etc. that should be tagged as N.PERSON are rarely ambiguous with respect to supersense. The main challenge in that case is that the category is open-ended and not in general evident from syntactic structure.

System correlation
Finally, we examine whether the submitted approaches capture different aspects of the task. I.e., could we produce a better system by combining the individual systems? We cannot estimate this from the results tables, since, combinatorially, there are many ways to obtain a given score. However, we can estimate it from the prediction overlap between systems. The N × N labeled matrix in figure 7 shows how the N systems relate to each other. Each cell compares the predictions of two systems a and b in the joint supersense and MWE task. The value of a cell T a,b is the number of correct predictions made by a that were not correctly predicted by b. This is an asymmetric measure of predictive similarity. A single low number indicates one out of two things: either the systems are similar, or a is better than b. When the sum T a,b + T b,a is small, the two systems make similar predictions.
Clustering the systems in figure 7 (shown on the left side of the plot) results in groups that correspond to the ranking in table 4. Inside the cluster of systems ranked at 1, the asymmetric predictive advantage ranges between 267 and 469. Lower-ranked systems all have a smaller predictive advantage with respect to the top-ranked systems. The best combination system would thus likely be between two of the rank-1 systems. However, the gains are small, and overall the systems seem to extract the same knowledge, or subsets of the same knowledge, out of the training data. Figure 7: System clusters. Each cell compares the predictions of two systems i and j with respect to a gold standard. The value in the i, j-th cell is the number of predictions that i got right but j did not.

Conclusion
This task featured a broad-coverage lexical semantic analysis task that combines MWE identification and supersense tagging. The semantic tagset strikes a balance between the extremely difficult fine-grained distinctions in classical WSD, and the restrictiveness of the NER task. To guard against domain bias, we provided training data from two different genres, namely online reviews and tweets, as well as a test-only data set with TED talk transcripts. The training and test data sets are publicly available at https://github.com/dimsum16/dimsum-data.
The best scoring systems obtained 57.7% F 1 on a composite measure over the two subtasks of MWE and supersense tagging, averaged over the three test domains. This level of performance suggests that the task is not yet resolved. Furthermore, our error analysis suggests that the submitted systems arrived at similar generalizations from the training data. Substantially improving performance would thus seem to require novel approaches.