COMETA: A Corpus for Medical Entity Linking in the Social Media

Whilst there has been growing progress in Entity Linking (EL) for general language, existing datasets fail to address the complex nature of health terminology in layman's language. Meanwhile, there is a growing need for applications that can understand the public's voice in the health domain. To address this we introduce a new corpus called COMETA, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph. Our corpus satisfies a combination of desirable properties, from scale and coverage to diversity and quality, that to the best of our knowledge has not been met by any of the existing resources in the field. Through benchmark experiments on 20 EL baselines from string- to neural-based models we shed light on the ability of these systems to perform complex inference on entities and concepts under 2 challenging evaluation scenarios. Our experimental results on COMETA illustrate that no golden bullet exists and even the best mainstream techniques still have a significant performance gap to fill, while the best solution relies on combining different views of data.


Introduction
Social media has become a dominant means for users to share their opinions, emotions and daily experience of life. A large body of work has shown that informal exchanges such as online forums can be leveraged to supplement traditional approaches to a broad range of public health questions such as monitoring suicidal risk and depression (Benton et al., 2017b), domestic abuse (Schrading et al., 2015), cancer (Nzali et al., 2017), and epidemics (Aramaki et al., 2011;Joshi et al., 2019).
One of the widely exercised steps to establish a semantic understanding of social media is En- * Equal contribution. tity Linking (EL), i.e., the task of linking entities within a text to a suitable concept in a reference Knowledge Graph (KG) (Liu et al., 2013;Yang and Chang, 2015;Yang et al., 2016;Ran et al., 2018). However, it is well-documented that poorly composed contexts, the ubiquitous presence of colloquialisms, shortened forms, typing/spelling mistakes, and out-of-vocabulary words introduce challenges for effective utilisation of social media text (Baldwin et al., 2013;Michel and Neubig, 2018). These challenges are exacerbated in EL for user generated content (UGC) in the health domain for two main reasons: lack of dedicated annotated resources for training EL models, and entanglement of the aforementioned challenges in general social media with the inherent complexity of the health domain and its terminology (see Table 1).
For example, in Figure 1 we show sentences taken from social media where the semantics of the concept linking is complex and context-dependent. In the first case, "diagnosed with gad where by  benzos at", benzos is a colloquial form of benzodiazepines, a type of sedative, and if correctly resolved can provide a contextual clue to assign the appropriate sense to the polysemous term gad: an abbreviation for generalised anxiety disorder rather than e.g. glutamate decarboxylase. In the second example, "went to get bloods done at 11 30", the word bloods could be interpreted literally as blood; however, in this case it clearly refers to a blood test, and it can be correctly resolved only by considering the full context in which it is used.
In this paper we open up a new avenue for EL research specifically targeted at the important domain of health in social media through the release of a new resource: the Corpus of Online Medical En-Tities (COMETA), consisting of 20K biomedical entity mentions in English from publicly available and anonymous health discussions on Reddit. Each mention has been been expert-annotated with KG concepts 1 from SNOMED CT (Donnelly, 2006) 2 , a structured medical vocabulary of ca.350K concepts widely used to code Electronic Health Records (EHRs). As we show, COMETA provides a high quality yet challenging benchmark for developing EL techniques, especially for concepts not encountered during training (zero-shot concepts). Due to its semantic diversity the corpus represents an important pathway to knowledge integration between layman's language, EHRs and research evidence.
Through a set of experiments we shed light on the challenges in this domain for several EL baselines utilising a diverse range of techniques from basic string-matching to low-dimensional entity embeddings (Bojanowski et al., 2017), KG structure embeddings (Grover and Leskovec, 2016;Agarwal et al., 2019), and context aware BERT embeddings (Devlin et al., 2019;Lee et al., 2020). We show 1 Throughout the paper concept refers to nodes in a KG (i.e., SNOMED), term/entity refers to the surface form mention of a concept in text, and context refers to the text in which a term appears. Also, SCTID denotes SNOMED CT Identifier. 2 We use the July 2019 release of the international edition. a simple augmentation of the mainstream BERT model with a Multi-Level Attention module can improve its effectiveness in capturing the contextual nuances of highly diverse layman's language in the health domain. Our experimental results illustrate that the best solution needs to combine multiple views of data and still heavily relies on basic techniques, while the remaining performance gap highlights the challenging nature of COMETA. We summarise these challenges and underline some of the key areas that are indispensable for further progress in this domain.

Related Work and Datasets
Entity Linking. EL (Bunescu and Pasca, 2006) is an important task that has sparked attention in recent years due to its wide-scale potential to aid in knowledge acquisition, e.g. the complementary problems of cross-document coreference resolution (Dredze et al., 2016), semantic relatedness (Dor et al., 2018), geo-coding (Gritta et al., 2017) and relation extraction (Koch et al., 2014). Systems that link entities to Wikipedia (Wikification) (Liu et al., 2013;Roth et al., 2014) and scientific literature to biomedical ontologies (Zheng et al., 2015) have been the focus of attention for many years. Generic EL systems such as Babelfy (Moro et al., 2014) and Tagme (Ferragina and Scaiella, 2011) identify and map entities to Wikipedia and WordNet (Miller et al., 1990) but do not directly integrate the coding standards of healthcare KGs such as SNOMED. Medical EL systems such as cTAKES (Savova et al., 2010) and MetaMap (Aronson and Lang, 2010) were designed to perform medical EL on EHRs but limited evidence e.g. (Denecke, 2014) points to a large drop in recall on UGC such as patient forums.
Medical EL in Social Media. There are several medical EL corpora based on scientific publications (Verspoor et al., 2012;Mohan and Li, 2019), EHRs (Suominen et al., 2013) and death certificates (Goeuriot et al., 2017). However, none of these EL corpora dealt with the challenges of UGC.
Due to under-reporting of drug side effects (Freifeld et al., 2014) pharmacovigilance datasets have been among the popular UGC benchmarks for evaluating medical EL. The earliest corpus in this domain was CADEC (Karimi et al., 2015) where 1253 AskAPatient posts (6754 concept mentions) were annotated based on a search for the drugs Diclofenac and Lipitor. Another dataset, Twitter ADR (Nikfarjam et al., 2015), consists of 1784 posts (1280 concept mentions) based on a search for 81 drug names, while TwiMed (Alvaro et al., 2017) provides a comparable corpus of 1K PubMed and 1K Twitter texts (3144 concept mentions) based on a search for 30 drugs. Limsopatham and Collier (2016) introduced two Twitter datasets (201 and 1436 concept mentions) with mappings to the SIDER-4 database (Kuhn et al., 2016), and RedMed (Lavertu and Altman, 2019) used Reddit to build a lexicon of alternative spellings for 2978 drugs to improve EL on social media. Closest to our work is MedRed (Scepanovic et al., 2020), a medical Named Entity Recognition corpus of 2K Reddit posts based on forums for 18 diseases. However we note several key differences to our work: our corpus is four times larger, provides two levels of mapping to general and context-specific concepts and has a much greater diversity of concepts rather than just symptoms and drugs ( §3.3).

The COMETA Corpus
The COMETA corpus satisfies multiple properties which we will explain throughout this section: CONSISTENCY. COMETA has been annotated by biomedical experts to a high quality using SNOMED CT concepts (SCTIDs) -a standard for clinical information interchange ( §3.2); SCALE AND SCOPE. To the best of our knowledge, with at 20K concept mentions, it is the largest UGC corpus for medical EL. Annotated entities cover a wide range of concepts including symptoms, diseases, anatomical expressions, chemicals, genes, devices and procedures across a range of conditions ( §3.3); DISTRIBUTION. We release the full corpus along with two sampling strategies (Stratified and Zeroshot) to prevent over-optimistic reporting of performance (Tutubalina et al., 2018): while Stratified is designed to show the ability of systems to recognise known concepts with possibly novel mentions, Zero-shot is designed to test for recognising novel concepts ( §3.4).

Collection
In order to build our corpus, we crawled healththemed forums on Reddit using Pushshift (Baumgartner et al., 2020) and Reddit's own APIs. We choose forums satisfying strict constraints, i.e. selecting subreddits where: (i) new content was posted daily, (ii) the quality of the content was sufficient (e.g. avoiding spam-ridden forums), (iii) the focus was the personal experiences or questions of the users. 3 Applying these criteria, we selected a list of 68 subreddits (see Appendix A.1 for the full list) and crawled all the threads from 2015 to 2018, obtaining a collection of more than 800K discussions. This collection was then pruned by removing deleted posts, comments by bots or moderators, and so on. In order to obtain the candidate entities, we trained the Flair NER system (Akbik et al., 2018) on a corpus of patient discussions from the health forum HealthUnlocked 4 ; we then used this system to find medical entities in a random sub-sample of 100K discussions of our Reddit set, resulting in over 65K distinct named entities being discovered.
Following the standard practices for ethical health research in social media outlined in (Benton et al., 2017a), we then anonymised the corpus to preserve, as far as possible, the privacy of the users. We removed personally identifiable data from messages and we selected terms that were mentioned by at least five users to avoid using terminology particular to a specific user.
Finally, after anonymisation, we hired two professional annotators with Ph.D. qualification in the biomedical domain to annotate the most popular 8K tagged entities with SNOMED concepts.

Consistency
The annotation process consisted of two steps: FIRST STEP. We showed the first annotator an entity and up to six random sentences in which it appeared. If the entity was unambiguous, e.g. left ankle, the annotator had to associate it to the relevant SCTID (e.g. SCTID: 51636004 -Left Ankle) and up to three sentences correctly representing it. Moreover, the first annotator was required to mark NER system mistakes (e.g., wrong type, wrong span, or non-medical entity) to ensure the inclusion of high quality entities. Only 2.1% of the entities were rejected, confirming the quality of our NER system. SECOND STEP. The second annotator then tackled the ambiguous entities, selecting up to three possible specific senses, and associating each sense to the relevant examples. This way, we obtained two levels of annotation: The General level, concerned with the literal meaning of the term, and the Specific level, which takes into account the context in which the entity appears. For example in the sentence "Regarding my eyes, I'm not experiencing cloudiness.", the literal interpretation of the entity cloudiness corresponds to the General SNOMED concept SCTID: 81858005 -Cloudy (qualifier value); however, a contextsensitive assignment which takes into account the word eyes maps the entity to the Specific concept SCTID: 246636008 -Hazy vision. The specific level requires contextual information to be effectively incorporated in the linking step, hence constitutes a more challenging EL task.
The final corpus contains 20015 entities, each are assigned a General and Specific SCTIDs and accompanied by an example sentence from Reddit where the entity is used. We also provide the link to the Reddit thread where the sentence appears (see Appendix A.2 for a sample). Also, contrary to other corpora, we exclude NIL entities, i.e. entities without a corresponding concept in SNOMED.

Assessing Annotation Quality
Similar to Mohan and Li (2019), we assessed the quality of the annotation process by asking two pairs of assessors 5 to assess the quality of 1K random annotations (500 per pair of assessors).
SCTID: 134407002 -Chronic back pain entails a score of 5, to SCTID: 61968008 -Syringe entails a score of 1, and to SCTID: 77568009 -Back entails a score of 3, since the selected node is not correct but it identifies the location of the concept; see Table 8 in the Appendix A.3 for more details on the instructions we provided to the assessors.
Outcome. Out of 1K examples, both assessors assigned the maximum score of 5 to 93.5% and at least 4 to 96.8% of both the general and specific level annotations. This is a good indication of the quality of the annotations and is in line with Mohan and Li (2019)'s findings. Further investigation of weakly scored entities (3.2% of examples) highlights the unique challenges that emerge in this domain. We provide two representative examples: EXAMPLE 1. Regarding the entity "UI" in the sentence "If you're having GI problems, UI issues and/or ED issues please get the breath test for H.Pylori.", the annotator assigned the SCTID: 68566005 -Urinary tract infectious disease. One assessor agreed with the annotator's judgement on considering "UI" as an abbreviation of "Urinary infection", while the other assessor assigned only a score of 3, considering it as the abbreviation of "Urinary incontinence". Given the sentence, however, both interpretations are plausible. EXAMPLE 2. Consider the entity "pissed off " in the sentence "And to top it off my stomach becomes bloated and pissed off.". Here, "pissed off " is used figuratively to indicate some form of discomfort; however, the annotator assigned SCTID: 75408008 -Feeling angry which both assessors flagged as incorrect. Nevertheless, both assessors couldn't suggest a better SNOMED concept, as this phrase does not identify a precise disease. These ambiguities exemplify why performing EL in the UGC domain can be hard even for humans and highlight the complexity found in laymen's medical conversations.

Scale and Scope
The corpus contains 6404 unique terms, 19911 unique example strings, 3645 unique general concepts (SCTIDs), and 4003 unique specific concepts (SCTIDs). Each general and specific concept is represented on average with more than 1 surface form, while some concepts had more than 15 surface forms, like for example SCTID: 5935008 -Oral contraception, SCTID: 225013001 -Feeling bad, and SCTID: 34000006 -Chrohn's Disease.
Additionally, each concept was accompanied by an average of at least 5 example sentences (median of 3), while 4.5% of entities were linked to different general and specific SNOMED concepts (i.e., due to polysemy or contextual cues). We note that 31 entities are associated to more than one general SCTID, while 453 are associated to more than one specific SCTID.

Distribution
We provide the COMETA corpus in two different sampled splits: STRATIFIED SPLIT. Each SNOMED concept appearing in the test/development sets, appears at least once in the training set. The stratification by SCTID results in 100% coverage of concepts in test/development, but on the surface form it covers only 58% of the entities in the test set. ZERO-SHOT SPLIT. Development and test sets contain only novel concepts for which no training data was available.
In other words, the Stratified split is designed to ensure that the model encounters the same concepts in the training, development and test set, but possibly with different surface forms; the Zero-Shot split, instead, exposes models to unseen terms and concepts in the development and testing sets, making it the hardest of the two settings ( §4). We argue that Zero-Shot is a more realistic setting since obtaining training data that covers all 350K SNOMED concepts involves a very expensive annotation effort. The statistics for the splits are shown in Table 2.

Experiments and Results
In this section we conduct a diverse set of EL experiments, where we apply different simple and complex paradigms to link the annotated entities (and the sentences in which they appear) with the corresponding SNOMED concepts. We follow previous works in biomedical entity linking and use top-k Accuracy (k ∈ {1, 10}) to evaluate performance of EL systems (D'Souza and Ng, 2015). Note that Acc@10 is only computed for systems returning a ranked list and measures if the correct concept is contained within the top 10 concepts returned by the system. We also report Mean Reciprocal Rank (MRR, Craswell (2018)), which instead measures the position of the correct concept in the list of concepts returned by the system. Details about training as well as model and hardware configurations are available in Appendix A.5. Our baselines cover both string/dictionary-based algorithms ( §4.1) which are good at capturing surface-level similarities, and neural models capable of incorporating contextual information ( §4.2), where we experiment with a new Multi-Level Attention mechanism based on BERT to allow more efficient incorporation of context. Finally, to achieve the best possible performance, we combine these models in a back-off setting where we leverage the benefits of each paradigm ( §4.3). When describing the results, we will report the results on the general split and place the results on the specific split in parentheses.

Dictionary and String-based Baselines
As a first step, we experimented with a set of naïve systems based on string matching and edit distance. 6 These baselines ignore the context around the entities, since they simply try to match entities against SNOMED labels.  set. If an entity is mapped to multiple SNOMED labels, the dictionary records the most frequent one.
String-Matching Edit-Distance. For every term, a string-matching search is conducted on its surface form against all the SNOMED node labels. Note that every SNOMED node has multiple alternative surface forms resulting in 2-36 comparisons per each entity. We count as a hit if the entity is matched with any of the node's surface forms based on exact match, Levenshtein ratio or Stolois distance, two strong string matching heuristics, which are defined as follows: given two strings x, y the Levenshtein ratio (or normalised Levenshtein distance, Yujian and Bo (2007)) is defined as Lev(x,y) max(|x|,|y|) where Lev is the Levenshtein distance (Levenshtein, 1966) between x and y; the Stoilos distance (Stoilos et al., 2005) is defined as the similarity of two strings as comm(x, y) − diff(x, y) + winkler(x, y) where the first and second terms are commonality and difference scores computed based on lengths of substrings of x, y that are matched/unmatched and the third term is Jaro-Winkler distance (Winkler, 1999). Both edit distance metrics were tuned to offer the best trade-off between true and false positives in the development set; further details are provided in Appendix A.6.
cTAKES. cTAKES (Savova et al., 2010) is a heavily engineered system for processing clinical text. We report on its EL pipeline which is based on several dictionary-based and advanced string matching techniques for resolving abbreviations, acronyms, spelling variants, and synonymy. 7 QuickUMLS. QuickUMLS (Soldaini and Goharian, 2016) is a fast approximate dictionary matching system for medical concept extraction using SimString (Okazaki and Tsujii, 2010) as its back-end. We restrict its search space to the SNOMED CT subset of UMLS. As QuickUMLS predicts UMLS CUI instead of SCTID, we map predicted CUIs to SCTIDs through the UMLS api. 8 When multiple plausible mappings exist, we count a hit if anyone of them matches. 9 Results. Table 3 summarises the results for the dictionary and string-based baselines. The dictionary method can serve as a strong baseline on the Stratified split, where its performance is barely matched by the more complex string-matching techniques. The most complex strategy, Stoilos distance, outperforms the other string-based techniques, and interestingly is on par with the highly complex cTAKES system while performing significantly better than QuickUMLS. It is worth noting that cTAKES obtained 95.7% in an EL task on an EHR dataset (Savova et al., 2010), highlighting the greater difficulty of the task when performed on the layman's language typical of UGC.
Additionally, contrary to cTAKES, none of the string-based baselines are relying on external resources which might offer an improvement in resolving some abbreviations or acronyms that our string-based systems miss and cTAKES disambiguates correctly (e.g. "ADHD" to SCTID: 406506008 -Attention deficit hyperactivity disorder). We leave further exploitation of such resources for future work.

Neural-based Baselines
For our neural setting, we define the problem as a cross-space mapping task by representing COMETA entities (along with their contexts) and SNOMED concepts using different text-and graphbased representation learning techniques, and then mapping the learned representations from the textual space to SNOMED concepts space.
Entity Embeddings. We experimented both with "traditional" and contextual embedding techniques. To generate the entity embeddings we use FastText (FT, Bojanowski et al. (2017)) and BioBERT (Lee et al., 2020), a PubMed-specialised version of BERT (Devlin et al., 2019). The former was trained and the latter was further specialised on the set of 800K Reddit discussions described earlier ( §3.1). 10 In the case of multi-word terms, their embeddings were generated via averaging. 11 The dimensionality of the embeddings was 300 for FastText and 768 for BERT, and we denote them as FT-term and BERT-term, respectively. Note that we acknowledge there are alternative options of BioBERT like SciBERT (Beltagy et al., 2019) and ClinicalBERT (Alsentzer et al., 2019). In our own experiments, we discovered that the further specialisation on Reddit discussions is more important than the choice of base model. That said, we leave explorations of other * BERT models on COMETA for future work.
Multi-Level Attention for BERT. As noted by Ethayarajh (2019) the deeper BERT goes, the more "contextualized" its representation becomes. However, interpreting semantics of entities requires contextual knowledge in different degrees and always taking the last layer's output may not be the best solution. In order to address this issue, we propose a Multi-Level Attention (denoted as BERT-term MLA ) module on top of BERT to further enhance the representation extracted from BERT by learning how much to attend to each layer for producing an entity representation. The attention weights of the i-th layer is computed as a i = [ B i · A ] + , where [·] + = max(0, ·), and B i ∈ R d denotes the representation from the i-th level of BERT, d denotes the dimensionality (i.e., here d = 768), and A ∈ R d denotes a trainable attention memory vector. We further normalise a i using a softmax layer, w i def = softmax(a i ). Finally, a weighted sum over all layers produces the attention-fused representation, i.e. BERT-term MLA = L i w i B i .
Concept Embeddings. We experimented by embedding SNOMED concepts with two modalities: (i) their labels, to exploit textual information, and (ii) their corresponding nodes in the KG, to incorporate the graph structure. Label embeddings were produced by running FastText (denoted as FT-label) and BERT (denoted as BERT-label) on the label, 10 Note that BERT here is used as a feature extractor. We tried finetuning BERT jointly with the alignment model, but performance got worse due to overfitting. We leave properly finetuned BERT models on COMETA as future work. 11 We tried replacing the entity embeddings with sentence embeddings via RNN/transformers, however, the performance was much worse. We speculate this was due to polluting the informative signal of an entity with its surrounding words. We leave further exploration of this to future work. both trained as described above; for concepts with multiple labels (e.g., SCTID: 61685007 -Lower extremity, Lower limb, Leg), the mean of the label representations is used. For node embeddings, we based our choice of model on the findings reported in Agarwal et al. (2019) and opted for their best reported model for SNOMED, i.e. node2vec (Grover and Leskovec, 2016) with the suggested parameters and vector size 300. 12 Ensemble Embeddings. We also considered several embeddings that integrate multiple views of the data via (i) concatenation (denoted as ⊕) of the entity embeddings (e.g, FT-term ⊕ BERT-term MLA ), and (ii) concatenation of label and node2vec embeddings for concepts (e.g., FTlabel ⊕ BERT-label ⊕ node2vec).
Alignment Model. We adopt a linear transformation followed by ReLU (Nair and Hinton, 2010) for aligning entity and concept embeddings, and we train the model with a max-margin triplet loss: where α (= 0.2) is a pre-set margin, s(·, ·) is the cosine similarity, P and T are the sets of all predictions and target embeddings in a mini-batch, and given a prediction p and its corresponding ground truth t,t denotes a negative target embedding.
Results. The results of the neural baselines are presented in Table 4. All individual baselines (n.1 to n.4) fall behind the string-matching methods on Acc@1. This can be due the fact that on average for each entity-concept pair there are less than 4 examples even in the stratified training set, making it difficult for the trained model to generalise well. This issue is more evident in the zero-shot setting.
The ensemble neural baselines compensate for the lack of training signal by leveraging multiple views of the data. As expected, combining both surface and node embeddings of the concepts (n.5) offers a slight improvement, but still fails to match the string-matching baselines. Finally, concatenation of the entity embeddings with our proposed BERT-term MLA representation, and of the label embeddings with BERT-label (n.6) outperforms all  previous baselines on the stratified split, but still falls behind the string-based baselines on zero-shot. Compared to Acc@1, while the overall ranking of models remains the same, MRR and Acc@10 are more forgiving. The significant gap between Acc@1 and Acc@10 suggests that a re-ranking step (Liu, 2009) applied to top-10 candidates could further boost the performance. We leave further exploration of this idea to our future work.

Back-off Baselines
To obtain the best possible performance, we experimented with a deterministic back-off procedure (denoted as +) that applies the Dictionary and backs-off to a String-Matching model ( §4.1) and finally to the best ensemble model ( §4.2; model n.6 in Table 4) for handling the missed cases.
Results. Table 5 reports the Back-off baseline results. The immediate gain on performance compared to each individual counterpart indicates that each model is equipped to tackle only a subset of the underlying challenges in the data. The back-off model combining dictionary, Stoilos distance, and the ensemble neural approach achieves our best performance across both splits (model b.8 in Table 5). As expected, the neural baselines contribute much less in the Zero-Shot split with a meagre 4%(3%) improvement, compared to the 8%(7%) increase on the Stratified split. Even if their overall contribution is limited, we were able to verify that our neural baselines are actually able to exploit the context as expected. For example w.r.t. the issues typical of the UGC domain we identified in Section 1, we found neural methods helpful in resolving acronyms ("UTIs" to SCTID: 68566005 -Urinary Tract Infection), colloquial synonyms ("bloodwork" to SCTID: 396550006 -Blood Test), compositionality ("drenched in sweat" to SCTID:  415690000 -Sweating), complex inference (e.g., "Oral Cancer" to SCTID: 363505006 -Malignant tumour of oral cavity), or even spelling errors combined with alternative product names ("Remicaid" to SCTID: 386891004 -Infliximab, i.e. the active principle of Remicade). This last example is specifically interesting, since the label Remicade is not present in SNOMED but the pre-training of embeddings on medical texts ( §4.2) allowed the neural baselines to pick up the correct node.

Discussion
The COMETA corpus introduces a challenging scenario for entity linking systems from both ML and NLP perspectives. In this section we summarise these challenges, our findings, and shed light on aspects that demand future attention: Low-Resource Regime and Learning. Compared to similar corpora, COMETA has the largest scale. However, from a learning perspective the lack of sufficient regularity in the data could still leave its toll at test phase. This is a natural consequence of high productivity of layman's language in social media, while emerging and unforeseen topics such as pandemics (i.e., COVID19) could also contribute to the problem. In fact, we observed the daunting task that systems face in the zero-shot setting, where in the absence of sufficient training signal, string-based methods offer a strong baseline which is hard to beat for neural counterparts.
While we artificially control this in the stratified split we still believe the zero-shot setting draws a more detailed picture of challenges an EL system needs to tackle in a real-world scenario. Further exploration of solutions such as transfer learning across domains (i.e., from medical literature to layman's domain) is beyond the focus of this work, nonetheless COMETA provides the framework for designing and testing such solutions.
Cross-Modality Alignment. While Agarwal et al. (2019) report superior performance of node2vec embeddings on several graph-based tasks on SNOMED, this success does not translate into EL as it relies on mapping across modalities (i.e., text-to-graph). Alternatively, when we replaced the node2vec with concept-label embeddings (produced by FT/BERT) the performance was significantly improved. This suggests that aligning different modalities may require a more complex alignment model or stronger training signals. We leave further exploration of this to future work.

Conclusion
We presented COMETA, a unique corpus for its scale and coverage which is curated to maintain high quality annotations of medical terms in layman's language on Reddit with concepts from SNOMED knowledge graph. Different evaluation scenarios were designed to compare the performance of conventional dictionary/string-matching techniques against the mainstream neural counterparts and revealed that these models complement each other very well and the best performance is achieved by combining these paradigms. Nonetheless, the missing performance of 28-46% (depending on the evaluation scenario) encourages future research on this area to take this corpus as a challenging yet reliable evaluation benchmark for further development of models specific to this domain. COMETA is available by contacting the last author via e-mail or following the instructions on https://www.siphs.org/. We release the pre-trained embeddings and the code to replicate our baselines online at https://github.com/ cambridgeltl/cometa.

Acknowledgments
Funding: This work was supported by the UK EP-SRC (EP/M005089/1). We kindly acknowledge Molecular Connections Pvt. Ltd 13 for their work on annotating our data.

A Appendices
A.1 Full List of Subreddits Table 6 reports the list of 68 subreddits crawled for COMETA.
A.2 Example from COMETA Table 7 provides examples from COMETA and illustrates the structure of each line in the corpus.
A.3 Example from Assessor Guidelines Table 8 provides an example from the guideline sent to assessors. Figure 3 provides the detailed distribution of SNOMED Concepts in Stratified and Zero-Shot splits.

A.4 Distribution of Concepts in Stratified and Zero-Shot Splits
A.5 Reproducibility Table 9 and Table 10 describe the hardware and hyperparameters used for the experiments we describe.

A.6 Stoilos Distance
The commonality function comm(x, y), is defined as comm(x, y) = 2 · i |max common substring| (|x| + |y|)/2 Where the max common substring between x, y is computed in an iterative manner: first, that of the original x, y are computed; then the common sub-string is removed and search is done again for the next max common substring until a threshold of length 3 is met (common sub-strings with < 3 length are not considered). The difference function, diff(x, y), is based on the unmatched part of x, y from the last step. We denote them as u x , u y . And the length of them are normalised using a Hamacher product (Hamacher et al., 1978) (a parametric triangular norm): diff(x, y) = |u x |·|u y | p+(1−p)(|u x |+|u y |−|u x |·|u y |) We choose p = 0.6.   Table 7: The structure of the dataset; column names are denoted by bold text, and column types are denoted by monospaced text. The released dataset contains two additional columns, marking the label for the corresponding General and Specific SCTID respectively. However, since a label may appear in multiple nodes, we recommend to always use SCTIDs to retrieve the target nodes. Please note that the data in this table is used for illustration purposes only and it might not be contained in the released corpus.

Quality
Evaluation Term Proposed Node Explanation 5:Excellent The SNOMED node matches exactly the term or is a synonym of the term.

Chronic back pain
Chronic back pain, 134407002 Exact match.

4:Good
The SNOMED node is conceptually similar and taxonomically close (1-2 edges) to the target term, e.g. is a close ancestor/descendant or a sibling.

Chronic back pain
Back pain, 161891005 'Back pain' is the direct ancestor of 'Chronic back pain'.

3:Fair
The SNOMED node is conceptually related and reasonably close (1 to 3 edges) to the target term, both taxonomically or via attributes (finding site, etc.)

Chronic back pain
Back, 77568009 'back' is the 'finding site' of 'Chronic back pain'.

2:Poor
The SNOMED node is conceptually distant from the term, and there is a reasonably long (3-4 edges) path from it to the correct node Chronic back pain Torso, 22943007 'Chronic Back Pain' is located in the 'Torso', so they are somewhat related, and the two nodes are not far (distance 3) 1:Very Poor The SNOMED node is completely unrelated with the term, and the path between the correct node and the target one is very long (> 5).

Chronic back pain
Syringe, 61968008 'Chronic Back Pain' and 'Syringe' have high distance (5), and the concepts are completely unrelated.    Figure 3: The categories in the dataset by split. The outer pie is the training set, the middle pie is the test set, the inner pie is the development set.