First Steps towards Building a Medical Lexicon for Spanish with Linguistic and Semantic Information

We report the work-in-progress of collecting MedLexSp, an unified medical lexicon for the Spanish language, featuring terms and inflected word forms mapped to Unified Medical Language System (UMLS) Concept Unique Identifiers (CUIs), semantic types and groups. First, we leveraged a list of term lemmas and forms from a previous project, and mapped them to UMLS terms and CUIs. To enrich the lexicon, we used both domain-corpora (e.g. Summaries of Product Characteristics and MedlinePlus) and natural language processing techniques such as string distance methods or generation of syntactic variants of multi-word terms. We also added term variants by mapping their CUIs to missing items available in the Spanish versions of standard thesauri (e.g. Medical Subject Headings and World Health Organization Adverse Drug Reactions terminology). We enhanced the vocabulary coverage by gathering missing terms from resources such as the Anatomical Therapeutical Classification, the National Cancer Institute (NCI) Dictionary of Cancer Terms, OrphaData, or the Nomenclátor de Prescripción for drug names. Part-of-Speech information is being included in the lexicon, and the current version amounts up to 76 454 lemmas and 203 043 inflected forms (including conjugated verbs, number and gender variants), corresponding to 30 647 UMLS CUIs. MedLexSp is distributed freely for research purposes.


Introduction
Current machine-learning and deep-learningbased methods are data-intensive; however, in domains such as Medicine, sufficient data are not always available-due to ethical concerns or privacy issues, especially when dealing with Patient Protected Information.Moreover, some tasks demand high precision outcomes, which either need supervised approaches with annotated data or hybrid methods (e.g.rule-based and dictionary-based).In order to overcome the data bottleneck, richlystructured terminological thesauri enhance the annotation and concept normalization of domain corpora to be used subsequently in supervised models.More importantly, to achieve comparable benchmarks, domain resources should integrate standard terminologies and coding schemes.
In this context, we aim at providing a computational lexicon to be used in the pre-processing of text data used in more complex Natural Language Processing (NLP) tasks.The work here presented reports the first steps towards building the Medical Lexicon for Spanish (MedLexSp).MedLexSp is conceived as an unified resource with linguistic information (lemmas, inflected forms and partof-speech), concepts mapped to Unified Medical Language System R (hereafter, UMLS) (Bodenreider, 2004) Concept Unique Identifiers (CUIs), and semantic information (UMLS types and groups).Figure 1 is a sample of the lexicon.MedLexSp is firstly aimed at named entity recognition (NER), and it can be used in the preannotation step of an NER pipeline.It can also help lemmatization and feed general-purpose Partof-Speech taggers applied to medical texts-as done in previous works (Oronoz et al., 2013). 1 Because it gathers semantic data of terms, it can ease relation extraction tasks.
Our work makes several contributions.We provide a resource to be distributed for research purposes in the BioNLP community.MedLexSp includes inflected forms (singular/plural, masculine/feminine) and conjugated verb forms of term lemmas, which are mapped to UMLS Concept Unique Identifiers.Verb terms are also mapped to Concept Unique Identifiers; this is the line of current works for expanding terminologies by in-Figure 1: Sample of the MedLexSp lexicon.In each entry, field 1 is the UMLS CUI of the entity; field 2, the lemma; field 3, the variant forms; field 4, the Part-of-Speech; field 5, the semantic types(s); and field 6, the semantic group.
Section 2 gives an overview of medical thesauri, and Section 3 describes the methods used to gather terms (both corpora and NLP techniques), map them to UMLS CUIs, and enrich the lexicon.Section 4 reports descriptive statistics of the current version, and Section 5, the results of an evaluation conducted during development.We discuss some limitations and conclude in Section 6.

Health thesauri and taxonomies
Medical thesauri and controlled vocabularies aggregate listings of domain terms, and also gather information about the type of term (e.g.synonym or preferred term), a semantic descriptor (e.g.DRUG or FINDING), an unique concept identifier, and very often a term definition or hierarchical relations between concepts (e.g.IS A).Thesauri are essential for indexing and populating databases, domain-specific information retrieval, and standardized codification (Cimino, 1996).
Medical thesauri vary according to the application (we only give examples related to our work).The Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) (Donnelly, 2006) aims at encoding verbatim mentions in clinical texts, and gathers ontological relations between concepts.To report drug reactions in pharmacovigilance, the World Health Organization created the Adverse Reactions Terminology (WHO ART), although the Medical Dictionary for Regulatory Activities (MedDRA) (Brown et al., 1999) is now preferred.The Medical Subject Headings (MeSH) are developed by the National Library of Medicine for indexing biomedical articles.Lastly, the World Organization of Family Doctors produced the International Classification of Primary Care (ICPC) to classify data aimed at family and primary care physicians (WONCA, 1998).
Medical taxonomies or classifications gather essential domain knowledge.Some examples are the International Classification of Diseases vs. 10 (ICD-10) (WHO, 2004), or the Anatomical Therapeutical Chemical (ATC) classification of pharmacological substances (WHO, 2019).

Medical Lexicons
Medical lexicons provide a structured representation of terms and their linguistic information (lemmas, inflection, or surface variants); hence, they are essential for NLP tasks.Unlike medical thesauri or classifications, they do not register term hierarchies, classifications nor ontological relations, but they can encode semantic information and, occasionally, argument structure and corpusbased frequency data (Thompson et al., 2011).
Initiatives to collect medical lexicons have been conducted for English (McCray et al., 1994;Johnson, 1999;Davis et al., 2012), German (Weske-Heck et al., 2002), French (Zweigenbaum et al., 2005) or Swedish, even in multilingual initiatives (Markó et al., 2006).For Spanish, some efforts were sparked when a team at the National Library of Medicine (Divita et al., 2007) started to build an equivalent of the MetaMap tool (Aronson, 2001).Other teams conducted experiments to automate the creation of a Spanish MetaMap by applying machine translation and domain ontologies (Carrero et al., 2008).These initiatives, to the best of our knowledge, did not achieve a Spanish lexicon available for medical NLP.
Besides medical lexicons, domain-specific vocabularies were collected for Biology (Thompson et al., 2011).With a different perspective and goal, Consumer Health Vocabularies have been collected to bridge the gap between patients' expressions and healthcare professionals' jargon (Zeng and Tse, 2006;Keselman et al., 2007).

The Unified Medical Language System
The Unified Medical Language System R (UMLS) (Bodenreider, 2004)

Methods for Creating Medical Lexicons
We will restrict us here to a shallow overview of approaches and will not consider taxonomy nor ontology building.Methods for widening medical vocabularies range from generating syntacticlevel variants of multi-word terms (Jacquemin, 1999), inferring derivation rules from string similarity matches and morphological relations between derivational variants (Grabar and Zweigenbaum, 2000), gathering inflected variants semiautomatically (Cartoni and Zweigenbaum, 2010), or deriving terms from corpora (more below).
Graeco-Latin components are very productive for coining medical terms; thus, several BioNLP systems integrate morphology-based lexical resources.For example, for decomposing terms morphosemantically and deriving their definitions (Namer and Zweigenbaum, 2004), or mapping queries to concepts and indexing documents in cross-lingual information retrieval, based on a subword-based morpheme thesaurus (Markó et al., 2005).In this line, generating paraphrase equivalents of neoclassical compounds (e.g.thyromegalia → enlarged thyroid) is an approach with potential for deriving new terms, and concept normalization systems (Thompson and Ananiadou, 2018) already implement it.Because string similarity measures and edit distance patterns are used for normalization-e.g(Tsuruoka et al., 2007;Kate, 2015)-and terminology mapping (Dziadek et al., 2017), these approaches are also powerful for expanding medical lexicons from a set of reference terms.Decomposition of multi-word terms and synonym expansion of their components are also alternative strategies applied in normalization systems (Tseytlin et al., 2016).
Corpus-derived medical terminology construction requires collecting domain texts and applying term extraction methods, among others: computing graphs of relations between parse trees and word dependency similarities (Nazarenko et al., 2001), using parallel corpora to map cognates or aligned words (Sbrissia et al., 2004;Deléger et al., 2009), linking terms or abbreviations to their definitions or expanded word forms in the text where they occur (Yu and Agichtein, 2003;McCrae and Collier, 2008), using dictionary features to identify polysemy (Pezik et al., 2008), combining text mining techniques with databases (Thompson et al., 2011), or having experts review terms, a method which has been used to build diseasespecific vocabularies (Wang et al., 2016).
Lastly, to develop Consumer Health Vocabularies (CHV), a variety of techniques have been used: analysis by experts of Medline queries (Zeng and Tse, 2006), term recognition methods and collaborative review of user logs in medical sites (Zeng et al., 2007), hybrid methods combining n-grams extraction, the C-value, and dictionary look-up (Doing-Harris and Zeng-Treitler, 2011), co-occurrence analysis of terms and seed words (Jiang and Yang, 2013), or approaches based of similarity measures between CHV lexicons and reference lexicons (Seedorff et al., 2013).2), we leveraged the lemmas and word forms obtained from a Spanish medical lexicon, mostly corpus-derived; we will refer to it as the base list.We only used the subset of lemmas and forms that could be mapped authomatically to UMLS CUIs (exact string match).In a second step, we added missing variants of terms using different methods: • Testing string distance metrics to match terms in the base list to variants that remained unmatched: e.g.eccema ↔ eczema ('eczema', C0013595).
• Adding acronyms and abbreviations of the terms included in the base list: e.g.aneurisma abdominal aórtico ('abdominal aortic aneurysm', C0162871) → AAA.
• Extending the base list by mapping the CUIs of the terms in the subset to gather missing variants of synonymous terms: e.g.eccema ('eczema', C0013595) ↔ dermatitis eccematosa ('eczematous dermatitis', C0013595).We considered several sources from the UMLS-e.g.Spanish Medical Subject Headings (MeSH), SNOMED CT or the WHO ART terminology-and external sources such as the Anatomical Therapeutical Classification, the National Cancer Institute (NCI) Dictionary of Cancer Terms,2 the Nomenclator de prescripción (AEMPS, 2019), OrphaData (INSERM, 2019), or the Spanish Drug Effect database (SD-Edb) (Segura-Bedmar et al., 2015).
• Including subsets of missing terms from thesauri if attested in domain texts.And vice versa, extracting corpus-derived terms from domain texts: synonymous terms from MedlinePlus,3 and terms from Summaries of Product Characteristics (Segura-Bedmar and Martínez, 2017).
The next subsections explain each method.

Leveraging an Inflected Lexicon
We started using a list of medical terms collected in a previous project on Spanish medical terminology;4 we will refer to it as the base list.We collected this resource by combining different methods (Moreno Sandoval and Campillos Llanos, 2015) applied on a corpus of 4204 Spanish medical texts (around 4 million tokens) (Moreno-Sandoval and Campillos-Llanos, 2013).To extract candidate medical terms for the base list, we combined rule-based techniques (Part-of-Speech tagging and filtering through medical affixes), corpus-based methods (comparing word forms from a general corpus and from the domain corpus), and statistical methods, namely the Log-Likelihood ratio (Dunning, 1993).We checked in medical sources-e.g. the dictionary published by the Spanish Royal Academy of Medicine (RANME, 2011)-the terms selected by means of those three methods, before being included in the list.This base list was used to build an automatic term extractor (Campillos Llanos et al., 2013), and amounted to 38 354 entries.
Because one of the goals of MedLexSp is concept normalization by using standard domain terminologies, we did not include the full base list.We only used terms that could be assigned UMLS Concept Unique Identifiers (CUIs) in the UMLS MetaThesaurus version 2018AB, namely from those terminologies of special biomedical or clinical interest (e.g.SNOMED CT, WHO ART or Medical Subject Headings) with available Spanish translations.We mapped 18 263 lemmas to CUIs, which means 47.61% entries of the original lexicon.CUIs were assigned according to an exact match criterion.For example, donación ('donation') is not matched with donación de tejido ('Tissue Donation', C0080231), because the latter makes reference to a donation subtype.Note that the current version of MedLexSp does not include the full list of terms from MeSH or SNOMED CT, but only those which were originally mapped from the base list to UMLS terms with CUIs.

Enriching the Lexicon
String distance metrics We tested mapping terms from the subset of entities with CUIs to terms in the UMLS by applying distance metrics (Levenshtein, 1966) of less than 2. This allowed us mapping hyphenated variants to terms without hyphen (e.g.creatina-cinasa ↔ creatina cinasa, 'creatine kinase', C0010287), compound terms that are often written as single-words (dietil éter ↔ dietiléter, 'diethyl ether', C0014994), or matching terms with minimal morphological variation (eccema ↔ eczema, 'eczema', C0013595).A total of 1463 terms with CUIs were matched to the original base list.

Derivational variants
In line with previous work (Grabar and Zweigenbaum, 2000), we collected a list of equivalent derivational variants of terms.Using this list, we assigned a CUI to the corresponding derivational variant: e.g. the CUI of páncreas (C0030274) was also ascribed to pancreático ('pancreatic').The current version gathers a total of 801 derivational variants with CUIs.
Conjugated verbs Most terms in the UMLS or standard terminologies are noun or adjective phrases.This limits the named entity recognition of medical concepts expressed with verbs in free text; given a context such as el paciente tose ('the patient coughs'), the concept of 'coughing' would not be identified.To widen the scope of concept normalization, verb terms were mapped to CUIs from derived nouns: e.g.tos ('coughing', C0010200) → toser ('to cough', C0010200).We again used a list of correspondences between verbs and deverbal nouns.We included the conjugated forms of verb lemmas in each verb entry of the lexicon.We used a python script that relies on the lexicon of a Spanish Part-of-Speech tagger (Moreno Sandoval and Guirao, 2006) to generate all conjugated forms of verb terms: e.g.toser ('to cough') → tose ('he/she coughs'), tosiendo ('coughing'), etc.The current version includes a total of 295 single-or multi-word verb items.

Affixes and lexical roots
In a first step, we collected affixes and roots from several sources.Firstly, we leveraged a list used in a previous experiment (Sandoval et al., 2013).This list amounts to 1719 forms and considers morphological variants of affixes (e.g.prefix cardiomay have accented variant forms in Spanish, such as cardió-).Secondly, we translated to Spanish several affixes and roots from the Specialist Lexicon R (McCray et al., 1994) and then added variant forms.In a second step, we assigned UMLS CUIs to affixes and roots in the list.The current list gathers a total of 161 entries (82 prefixes and 79 suffixes) with 134 different CUIs and 386 variant forms.Note that many affixes and roots were not included because they are too underspecified to be assigned to a CUI, or are not restricted to the medical domain (e.g.kiloexpresses a quantitative concept).
Abbreviations and acronyms Firstly, we gathered a list of equivalences between full forms and abbreviations and acronyms; we used three sources: 1) the collection of Spanish abbreviations and acronyms used in hospitals, collected by medical doctors (Yetano and Alberola, 2003); 2) abbreviations and acronyms used in the 2nd IberEval Challenge 2018 on Biomedical Abbreviation Recognition and Resolution (Intxaurrondo et al., 2018); and 3) Spanish abbreviations and acronyms from Wikipedia.5 Secondly, we matched the resulting list of equivalent terms (acronyms and full forms) to UMLS terms, adding the corresponding CUIs to those missing acronyms.For example, the full term virus de Epstein-Barr ('Epstein-Barr virus') has CUI C0014644, and we also assigned this code to the corresponding acronym in Spanish (VEB).With this method, we assigned CUIs to 1225 items.

Syntactic variants of terms
To widen the coverage of terms mapped to CUIs, we generated variants of multiword entities by swapping the word order of their components.Then, we tried to match each new variant to entities with CUIs.For example, aneurisma aórtico abdominal ('aortic abdominal aneurysm') has CUI C0162871, and we assigned the same CUI to the generated variant aneurisma abdominal aórtico ('abdominal aortic aneurysm').With this method, we gathered a total of 154 variants of terms with CUIs in the base list.

Mapping UMLS term variants through CUIs
We gathered synonymous variants referring to each corresponding concept by using the UMLS CUIs from the terms included in the base list.To avoid including noisy terms adequate for biomedical natural language processing, we first cleaned the terms from the terminologies we used.To do so, we applied methods for cleaning term strings (Aronson et al., 2008;Hettne et al., 2010;Névéol et al., 2012;Hellrich et al., 2015).We deleted paraphrastic terms that include a description or specification of the entity type in the term string.These terms commonly come from Spanish SNOMED CT.For example, we deleted tos (hallazgo), 'cough (finding)' (CUI C0010200) and kept the term (cough, 'cough').Likewise, we removed most anatomic terms beginning with estructura de ('structure of'): e.g.regarding term estructura del ojo ('structure of eyeball', C0015392), we only kept the synonym ojo ('eyeball').Lastly, terms in the WHO ART terminology needed to be accented and reversed regarding word order: e.g.disociativa, reaccion → reacción disociativa ('dissociative reaction', C0012746).
We also applied an exact-match mapping of Spanish terms from the base list to the English component of the UMLS.This method allowed us to obtain the CUIs of terms unavailable in Spanish terminologies, which remain unchanged in the Spanish language.Namely, Latin scientific names (e.g.Campylobacter fetus, C0006814), compound terms with Graeco-Latin roots (e.g.abdominalgia, C0000737), English acronyms that are broadly used in the medical discourse without Spanish translation (e.g.GABA, 'gammaaminobutyric acid', C0016904), or international brand drug names (e.g.abilify R ).In these cases, the same word is used in both English and Spanish.
We manually revised the list of mapped terms to discard homonymous terms with a different meaning in English (e.g.TIP R is a brand name of a medical drug, but it also means 'point' or 'suggestion' in English).
We extended the list of terms by extracting the information related to rare diseases from OrphaData (INSERM, 2019). 6We also added terms of pharmacological substances and international non-proprietary names from the Spanish Drug Effect database (SDEdb) (Segura-Bedmar et al., 2015) and the Nomenclator de prescripción (AEMPS, 2019), a resource published and updated regularly by the Spanish Agency of Drugs and Food Products. 7or all these procedures and sources, we applied semiautomatic methods to generate the singular and plural inflected forms of the missing terms that were mapped through CUIs.We used the Pattern python library (Smedt and Daelemans, 2012) to create plural forms of terms, which were revised manually before being included in MedLexSp.
Corpus-derived terms When we started adding variant terms from thesauri, the question of where to stop adding terms came up.In the first version, we decided not to include all terms available in MeSH or SNOMED CT terminologies, given that these thesauri contain terms that are often not necessary in clinical or biomedical NER tasks (e.g.names of trees, wild animals, professions or abstract concepts).On the other hand, to make the resource comprehensive, we needed to complement the base list with supplementary terms from thesauri.Hence, in order to decide which items to include in a first version, we computed term frequencies using a medical corpus from a previous project (4 million tokens) (Moreno-Sandoval and Campillos-Llanos, 2013).We currently include terms from the Spanish MeSH and SNOMED CT that were missing in the base list, if they were documented in that corpus.By limiting the inclusion of such subset of terms, we aim at providing quality enriched data (i.e. with revised inflected forms) in a reasonable time and manner.
In a different vein, and similarly to former work (Calleja et al., 2017), we extracted terms from Summaries of Product Characteristics (SPCs).We used Easy Drug Package Leaflets (EasyDPL), a corpus of 306 texts annotated with medical drugs and pathological entities (1400 drug effects) (Segura-Bedmar and Martínez, 2017).We annotated these texts and compared our output annotation with regard to this dataset.We used a purely dictionary-based named-entity recogniser with modules for normalization (e.g.lowercasing), tokenization and lemmatization, implemented in spaCy;8 then, the MedLexSp lexicon was used for exact string matching.We did not use pre-or post-processing rules in the current version (e.g.rules of term composition).
In several iterative rounds, we annotated the texts, identified the unannotated entities, and added them to the lexicon.We did not add (although annotated in the corpus) entities without a CUI, e.g.coordinated entities (e.g.pies y manos frías, 'cold hands and feet') or too specific, post-modified terms (e.g.dolor de cabeza intenso, 'intense headache'; only 'headache' has CUI C0018681).By using SPCs, we added 837 term entries to MedLexSp, and we ensure that it includes common terms referring to adverse drug reactions and medical drugs.
Lastly, for Consumer Health Vocabulary terms, we extracted synonyms in MedlinePlus Spanish.This resource provides terms in patient language that were missing: e.g.ojo vago ('lazy eye') is a synonym of ambliopía ('amblyopia', C0002418).We added 783 term entries from this resource.In addition, we collected 6110 cancer-related terms from the Spanish version of the National Cancer Institute Dictionary.

Semantic and linguistic information
We added to each CUI and lexical entry the corresponding semantic type(s) and group from the UMLS.To avoid noise when annotating biomedical texts semantically, we disfavoured semantic types of the semantic group Concepts and Ideas (CONC, e.g.Quantitative Concept, Functional Concept or Qualitative Concept), which are rather unspecified.We only included terms from that group if no other semantic label was available.If a concept or term can be assigned to two different groups, the element labelled with CONC is not included in our lexicon.For example, the term inhalación ('inhalation') can be related to concept C0004048 (semantic type Organism Function, and group PHYS) and also to concept C4521689 (semantic type Intellectual Product, and group CONC).In this case, we only preserve the lexical entry of concept C0004048 and we rule out the entry of concept C4521689.
We have also started adding the Part-of-Speech (PoS) category of each entry in the lexicon.For multiword terms, the category of the head term is selected; e.g.enfermedad de Crohn ('Crohn's disease') is categorized as N ('noun').We are currently testing different techniques to predict the PoS and automate the assignment of categories to each entry, which is still not fully satisfactory.

Statistics
Table 1 shows the count of entries in the lexicon according to each source or procedure applied to map terms to UMLS CUIs.Note that the full count exceeds the count of term entries in the current version of MedLexSp, given that some terms were gathered through different methods simultaneously.Table 2 shows the descriptive statistics of the lexicon: counts of lemmas and word forms, and total number of CUIs.Lastly, Table 3 shows a preliminary count of PoS categories in the current version of the lexicon.Note that most entries are nouns or need revision (UNKN stands for 'unknown'); this task is currently being undertaken.
Finally, Figure 3 depicts the distribution of semantic groups.Of note, some groups are underrepresented, due to the corpora and thesauri used to collect terms.For example, few entities belong to the GENE group, which implies that the coverage of the current version of MedLexSp is not adequate for tasks in the Genomics domain.
The  than in other UMLS-based resources because: 1) we did not include the full thesauri, but only terms from the original base list that were mapped to As explained, descriptors and qualifiers were removed: e.g.SNOMED CT term fiebre (hallazgo) ('fever (finding)', C0015967) was shortened to fiebre.We also ruled out some concepts belonging to semantic groups that we can be noisy for clinical or medical NER tasks, such as CONC or GEOG; e.g.hierro is related to concept C0302583, 'iron', CHEM; or to concept C0454671, 'Island of Hierro', GEOG (the latter concept was discarded).

Development Evaluation
We analysed the coverage of the lexicon with regard to UMLS semantic groups.We applied the dictionary-based NER tool explained below to a gold standard available in the community.We focused on analysing the annotation of few UMLS groups (DISO, CHEM, PROC and ANAT) and assessed how well the lexicon annotated them with regard to the gold standard.We quantified the matched annotations in terms of precision, recall and F1-measure by using the BRAT-Eval script (Verspoor et al., 2013).
A first version of MedLexSp was evaluated with the Spanish texts from the MANTRA corpus (Kors et al., 2015), which gathers 100 texts from the European Medicines Agency (1961 tokens) and 100 texts from Medline (1087 tokens).These texts are available in BRAT format and were annotated with UMLS CUIs, semantic types and groups.We preprocessed the annotated texts for mapping reference annotations to UMLS semantic groups.
With this dataset, we achieved an overall Fmeasure of 0.83 (exact match) and of 0.87 (approximate match), although the performance var-  4).In our error analysis, we observed that unmatched entities were misspellings (e.g.* deteción instead of detección, 'detection'), discontinuous entities (e.g.hinchazón de la piel in hinchazón y hormigueo de la piel, 'swelling and tingling of skin'), or entities whose scope was wrongly annotated.

Discussion and Conclusions
The lexicon is being developed by means of hybrid NLP methods and corpus-derived terms.We combine the mapping of corpus terms to available thesauri, and viceversa, terms missing in the lexicon were attested in domain texts, so that only a subset of attested terms be included in a first version.Interestingly, searching terms from thesauri in a corpus showed us that many of those terms show low frequencies.From a subset of 56 813 MeSH terms missing in the base list, only 6 676 (11.75%) occurred in the corpus we used (Moreno-Sandoval and Campillos-Llanos, 2013).Although this is due to the influence of the text types, it also reflects the difference betweem terms from thesauri and in real usage.This is another argument that stands for the need for dedicated lexicons combined with NLP methods to achieve successful NER results.A limitation of our evaluation procedure is the restriction to a very small set of texts; hence, results are not comparable to other tasks or text types.To provide more generalizable results, we need to evaluate the MedLexSp lexicon with another annotated medical corpus in Spanish, but such resource is not freely available to date.
We assume the lexicon is not task-independent.To avoid ambiguity, terms would need to be filtered according to the semantic types needed.For example, terms from the Occupation or Discipline group could be removed for most NER tasks.We are also aware of the limits of a purely lexiconbased approach.Contexts of variation occur in multiwords with coordinated terms (e.g.cáncer de mama y ovario, 'breast and ovarian cancer') and adjective modifiers.For example, MedLexSp includes the term cáncer de mama ('breast cancer'), but not common variants such as cáncer de mama derecha ('right breast cancer') or cáncer de una mama ('cancer of one breast').Both phenomena need specific processing techniques.
Mapping concepts to terms differing across varieties of the Spanish language was not exhaustive.As we departed mainly from a set of corpusderived terms, most terms belong to the variety used in the texts (i.e.Peninsular Spanish).However, since we used other terminological sources, terms from other varieties were included: e.g.virus sincitial respiratorio ('respiratory syncytial virus', C0035236) is a term preferred in Spain or Colombia, but we have the variant virus sincicial respiratorio (most frequent in Chile or Argentina).These aspects need nonetheless improvement in future versions, in the same way as the coverage of terms from Consumer Health Vocabularies.
Lastly, we are interested in exploring embedding-based methods for term expansion, and in evaluating the lexicon with a broader set of domain texts.thesauri included in MedLexSp were obtained through a distribution and usage agreement from the corresponding institutions who develop them.In addition, some material in the UMLS Metathesaurus is from copyrighted sources of the respective copyright holders.Users of the UMLS Metathesaurus are solely responsible for compliance with any copyright, patent or trademark restrictions and are referred to the copyright, patent or trademark notices appearing in the original sources, all of which are hereby incorporated by reference.
The version of MedLexSp freely available for research does not include terms nor coding data from terminological sources with copyright rights; only the subset of data in MedLexSp without usage restrictions is accessible.
We acknowledge the intellectual property rights of the institutions who develop the sources from which we extracted subsets of terms to compile the lexicon, and who gave permission (or provide a licence to reuse their data) to distribute these subsets of terms: the National Library of Medicine maintains the resource and the Medical Subject Headings, and BIREME/OPS (Latin-American and Caribbean Center on Health Sciences Information) is in charge of the Spanish translation (Descriptores en Ciencias de la Salud, DeCS); the National Cancer Institute publishes the Dictionary of Cancer Terms; the French National Institute of Health and Medical Research (INSERM) supports OrphaNet and gathers the information provided in OrphaData; the World Health Organization produces the Adverse Drug Reactions terminology, the International Classification of Diseases vs. 10, and the Anatomical Therapeutical Classification; the Spanish translation of the International Classification of Primary Care (ICPC) is supported by the World Organization of Family Doctors; and the Spanish Agency of Drugs and Food Products (AEMPS) publishes the Nomenclátor de prescripción.MedLexSp also gathers some terms from the Spanish version of the Medical Dictionary for Regulatory Activities (MedDRA), which is maintained by the Maintenance and Support Services Organization (MSSO).However, the distributed version of MedLexSp does not include terms coming solely from the MedDRA sources, because of copyright restrictions.In addition, MedLexSp includes a subset of the Spanish version of SNOMED Clini-cal Terms R , which is used by permission of the International Health Terminology Standards Development Organization (IHTSDO; all rights reserved).SNOMED CT R was originally created by The College of American Pathologists.

Figure 2
Figure 2 depicts the methods used to collect the MedLexSp lexicon.In a first step (left part of Figure2), we leveraged the lemmas and word forms obtained from a Spanish medical lexicon, mostly corpus-derived; we will refer to it as the base list.We only used the subset of lemmas and forms that could be mapped authomatically to UMLS CUIs (exact string match).In a second step, we added missing variants of terms using different methods:

Figure 3 :
Figure 3: Distribution of semantic groups in the lexicon

Table 1 :
amount of lemmas/word forms is lower Count of lexical entries according to each source or procedure to map terms to UMLS CUIs.

Table 2 :
Descriptive statistics of the lexicon