The ProfNER shared task on automatic recognition of occupation mentions in social media: systems, evaluation, guidelines, embeddings and corpora

Detection of occupations in texts is relevant for a range of important application scenarios, like competitive intelligence, sociodemographic analysis, legal NLP or health-related occupational data mining. Despite the importance and heterogeneous data types that mention occupations, text mining efforts to recognize them have been limited. This is due to the lack of clear annotation guidelines and high-quality Gold Standard corpora. Social media data can be regarded as a relevant source of information for real-time monitoring of at-risk occupational groups in the context of pandemics like the COVID-19 one, facilitating intervention strategies for occupations in direct contact with infectious agents or affected by mental health issues. To evaluate current NLP methods and to generate resources, we have organized the ProfNER track at SMM4H 2021, providing ProfNER participants with a Gold Standard corpus of manually annotated tweets (human IAA of 0.919) following annotation guidelines available in Spanish and English, an occupation gazetteer, a machine-translated version of tweets, and FastText embeddings. Out of 35 registered teams, 11 submitted a total of 27 runs. Best-performing participants built systems based on recent NLP technologies (e.g. transformers) and achieved 0.93 F-score in Text Classification and 0.839 in Named Entity Recognition. Corpus: https://doi.org/10.5281/zenodo.4309356


Introduction
The number of social media users and content is rapidly growing, with over 700 million tweets posted daily (James, 2019). The Social Media Mining 4 Health (SMM4H) effort (Klein et al., 2020;Magge et al., 2021) attempts to promote the development and evaluation of NLP and text mining resources to extract automatically relevant healthrelated information from social media data. As social media content is produced directly by users at a global scale, a variety of application scenarios of medical importance have been explored so far, including the use of social media for pharmacovigilance (Nikfarjam et al., 2015), medication adherence, (Belz et al., 2019) or tracking the spread of infectious and viral diseases (Rocklöv et al., 2019;Zadeh et al., 2019).
The Spanish-speaking community on Twitter is large, exceeding 30 million (Tankovska, 2019), which motivated the implementation of text mining efforts for health-related applications, in particular on drug-related effects (Segura-Bedmar et al., 2015, 2014Ramírez-Cifuentes et al., 2020).
One of the current challenges for health-related social media applications is to generate more actionable knowledge that can drive the design of intervention plans or policies to improve population health. This is particularly true for infectious disease outbreaks like the COVID-19 pandemic, where certain occupational groups and subpopulations have been at higher risk, due to direct exposure to infected persons, a higher degree of social interaction, work-related travels to high-risk areas or mental-health problems associated to workinduced stress. Early detection and characterization of at-risk professions is critical to design and prioritize preventive and therapeutic measures or even vaccination plans.
To date, occupational text mining has been used in clinical narratives (Dehghan et al., 2016) and to explore injuries in the construction sector (Cheng et al., 2012), mainly in English. However, it has not yet been systematically used in social media and clinical content in Spanish.
To implement a central NLP component for occupational data mining, namely the automatic detection of occupation mentions in social media texts, we have organized the ProfNER (SMM4H 2021) shared task. In this article, occupation mentions are all those elements that indicate the employment information of a person. Within occupations, we have identified three main labels: (i) "profession", occupations that provide a person with a wage or livelihood (e.g., nurse); (ii) "activity", unpaid occupations (e.g., activist); and (iii) "working status" (e.g., retired).
Resources released for the ProfNER track included annotation guidelines in Spanish and English, a consistency analysis to characterize the quality and difficulty of the track, a large collection of manually annotated occupation mentions in tweets, as well as FastText word embeddings generated from a very large social media dataset.
We foresee that the occupation mention recognition systems resulting from this track could serve as a key component for more advanced text mining tools integrating technologies related to opinion mining, sentiment analysis, gender-inequality analysis, hate speech or fake news detection. Moreover, there is also a clear potential for exploring occupation recognition results for safety management, risk behavior detection and social services intervention strategies.

Shared Task Goal
ProfNER focuses on the automatic recognition of professions and working status mentions on Twitter posts related to the COVID-19 pandemic in Spanish.

Subtracks
We have structured the ProfNER track into two independent subtracks, one related to the classification of whether a tweet actually does mention occupations or not, and another subtrack on finding the exact text spans referring to occupations.
This setting was decided due to the high class imbalance in social media data. Indeed, only 23.3% of the Gold Standard tweets contained mentions of occupations. Then, detecting relevant tweets (subtrack A) may help to detect the occupation mentions (subtrack B).
Subtrack A: Tweet binary classification. This subtrack required binary classification of tweets into those that had at least a single mention of occupations and those that did not mention any.
Subtrack B: NER offset detection and classification. Participants must find the beginning and the end of relevant mentions and classify them in the corresponding category.

Shared Task Setting
The ProfNER shared task was organized in three phases run on CodaLab 1 : Practice phase. The training and validation subsets of the Gold Standard were released. During this period, participants built their system and assessed their performance in the validation partition.
Evaluation phase. The test and background partitions were released without annotations. Participants had to generate predictions for the test and the background sets, but they were evaluated only on the test set predictions. This prevented manual annotations and assessed whether systems were able to scale to larger data collections. Each team was allowed to submit up to two runs.
Post-evaluation phase. The competition is kept alive on CodaLab. Interested stakeholders can still assess how their systems perform.

Evaluation: Metrics and Baseline
For Subtrack A, systems have been ranked based on F1-score for the positive class (tweets that contain a mention). For Subtrack B, the primary evaluation metric has been the micro-averaged F1-score. In this second track, a prediction was successful if its span matched completely the Gold Standard annotation and had the same category. Only PROFE-SION (profession) and SITUACION LABORAL (working status) categories are considered in the evaluation.
Also, we have compared every system to a baseline prediction, a Levenshtein lexical lookup approach with a sliding window of varying length.

ProfNER Gold Standard
The ProfNER Gold Standard is a collection of 10,000 COVID-related tweets in Spanish annotated with 4 types of occupation mentions: PROFESION (in English, "profession"), ACTIVIDAD ("activity"), SITUACION LABORAL ("working status") and FIGURATIVA (indicating that occupations are used in a figurative context).
The corpus has been split into training (60%), development (20%) and test (20%) sets. In addition, 25,000 extra tweets are provided without annotations as a background set. Table 1 contains an overview of the corpus statistics.
The corpus was carefully selected to include documents relevant to the challenges of the COVID-19 pandemic. It was obtained from a Twitter crawl that used keywords related to the pandemic (such as "Covid-19") and lockdown measures (like "confinamiento" or "distanciamiento", that are the Spanish translations of lockdown and social distancing), as well as hashtags such as "#yomequedoencasa" (#istayathome), to retrieve relevant tweets. Finally, we only kept the tweets that were written from Spain and in Spanish.
We filtered the tweets using the location attribute of the user profile and looked for the name of Spanish cities with more than 50K inhabitants, province names, autonomous region names, as well as any location specified simply as "Spain". 2 2 The list of place names was obtained from the Instituto Gold Standard Quality. The corpus was manually annotated by expert linguists in an iterative process that included the creation of custom-made annotation guidelines, described in Section 3.3. Furthermore, we have performed a consistency analysis of the corpus: 10% of the documents have been annotated by an internal annotator as well as by the expert linguists. The Inter-Annotator Agreement (pairwise agreement) is 0.919.

Documents Annotations
Tweets with mentions Tokens  Gold Standard Format. Tweets were provided in plain UTF-8 text files, composed Unicode form. Every tweet is contained in a text file whose name is the tweet ID. The tweet classification annotations are distributed in a tab-separated file. The Named Entity Recognition (NER) annotations are distributed in Brat standoff format (Stenetorp et al., 2012) and in a tab-separated file (Fig. 2).
Translation to English. The ProfNER shared task attracted participants from many non-Spanish speaking countries. Besides, there exist more resources for social media text processing in English than in Spanish. For that reason, we have provided, as an additional resource, a machine-translated version of the ProfNER corpus. This will ease the Nacional de Estadística (INE) https://www.ine.es/d ynt3/inebase/es/index.htm?padre=517&cap sel=525 comparison with systems working in English, provide support to participants working previously in English and explore the use of machine-translated corpora.
The complete ProfNER corpus -originally in Spanish-was translated into English by means of a state-of-the-art machine translation system based on recurrent neural networks.
The ProfNER Gold Standard and the English translation are distributed in Zenodo (Miranda-Escalada et al., 2020b).

ProfNER Silver Standard Corpus
The ProfNER test set was released together with an additional collection of 25,000 tweets as a background set. Participants have generated predictions for the test and background sets.

ProfNER Guidelines
The creation of robust guidelines ensures dataset quality and replicability. Their main objectives are: (1) to capture all possible mentions of the entities of interest, especially occupations (ex-trazadores de contagios); and (2) to apply constraints to the mentions in order to obtain well-defined, replicable expressions (ex-empleada en carpintería mecánica).
ProfNER's guidelines were created from scratch and iteratively refined to achieve maximum richness of mentions and until the Inter-Annotator Agreement was sufficiently high. All in all, six batches of annotations, corrections and reviews were required, reaching an agreement of 0.919. The final version includes 54 rules that describe the concepts to annotate and the associated constrictions. They are divided in four major groups: (i) 12 general rules, explaining the classification, orthographic and typographical aspects to be consid-3 https://zenodo.org/communities/medicalnlp/ ered; (ii) 22 positive rules, explaining what should be deemed as an occupation; (iii) 11 negative rules, showing elements that should not be annotated; and (iv) 9 special cases of annotation. All rules are provided with illustrative corpus examples.
The guidelines were originally written in Spanish and later translated into English by a professional translator; both of them are freely available in Zenodo (Farré-Maduell et al., 2020).

ProfNER Embeddings
To our knowledge, we have released the first COVID-related embeddings in Spanish. They are especially suited for the ProfNER use case, since they are trained with 140 million COVID-related Spanish Twitter posts.
URL and mentions are substituted by the standard tokens URL and @MENTION, respectively. Embeddings were trained with FastText (Bojanowski et al., 2017) and the chosen embedding size was 300. CBOW and Skip-gram models in cased and uncased versions are available in Zenodo .

ProfNER occupations gazetteer
We have released the ProfNER gazetteer of occupations in Spanish, a resource that covers terminological resources from multiple terminologies (DeCS, ESCO, SnomedCT and WordNet) and occupations detected by Stanford CoreNLP in a large collection of social media Spanish profiles. The gazetteer can be found in Zenodo (Asensio et al., 2021).

Results
Participation Overview. Since this is the first task on the detection of occupational entities in social media, ProfNER has received considerable attention from the community. Indeed, 35 teams registered for the task (31 for the Subtrack A and 29 for the Subtrack B), and there were 27 submissions (15 in Subtrack A and 12 in Subtrack B) from 11 teams. Participant teams came both from the industry (3) and academia (8)  System Results. Table 2 shows an overview of participants' results. In the tweet classification subtrack, there are 3 systems with a similar performance (with F1-scores of 0.92 or 0.93), belonging to academia and industry. In the NER subtrack, the best performing system was developed by Recognai, an industry participant, with 0.839 F1-score. This system was based on a transformer architecture. The second best-performing system was from MIC-NLP, a partnership between Siemens AG and the Ludwig Maximilian University of Munich. They obtained a 0.824 F-score combining contextualized embeddings with BiLSTM-CRF. It is noteworthy that the best systems in terms of F1score were also the systems with the highest recall, but not the highest precision.
Result analysis. Among the entities that ProfNER participants rarely detect, the proportion of SITUACION LABORAL is larger than in the entire corpus. Besides, some mentions with punctuation signs (particularly # or @) and capital letters are rarely predicted. For instance, "ministro del @interiorgob", "OFICIALES DE MESA" or "PEN-SIONISTAS" are never predicted. Notably, even though correct boundary detection remains a challenge for NER (Li et al., 2020), in our corpus entity length does not seem to influence predictability.

Discussion
To the best of our knowledge, ProfNER is the first occupational data mining effort in social media. It is also the first shared task on health applications of social media in Spanish. Specifically for the shared task, we have built a pioneer Gold Standard for Named Entity Recognition (NER) of occupations in social media in Spanish, the ProfNER corpus. It was generated following the ProfNER annotation guidelines that are shared with the community in Spanish and in English. Finally, we have trained and released the first word embeddings for our use case (Twitter posts in Spanish related to COVID-19) and a gazetteer of relevant terms.
In addition, the ProfNER shared task can be used as template for future shared tasks on the recognition of occupations in social media. Indeed, the English translation of the annotation guidelines eases this research possibility.
ProfNER has aroused interest from both academia and industry. Interestingly, teams from non-Spanish speakers have participated in this task. Tweet classification systems reach high performances. However, the detection and classification of occupational data can still be improved.
We propose the use of the whole ecosystem of occupational text mining resources generated in this shared task (corpus, systems, guidelines, etc.) for building and evaluating novel systems that allow subpopulation characterization in social media in the current pandemic. This system has also the potential to assist public health policy makers in the prevention and management of current and future epidemics.
Moreover, beyond classical evaluation scenarios focusing on traditional quality evaluation using metrics like precision and recall, there is also a need to propose shared tasks and community benchmark efforts that do take into account involvement of end users, as was the case of the BioCreative interactive task (Arighi et al., 2011) or the technical evaluation on the robustness and implementation of named entity components (Leitner et al., 2008;Pérez-Pérez et al., 2016), especially when considering the volume and rapid change of social media data.