The Second Cross-Lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages

We describe the Second Multilingual Named Entity Challenge in Slavic languages. The task is recognizing mentions of named entities in Web documents, their normalization, and cross-lingual linking. The Challenge was organized as part of the 7th Balto-Slavic Natural Language Processing Workshop, co-located with the ACL-2019 conference. Eight teams participated in the competition, which covered four languages and five entity types. Performance for the named entity recognition task reached 90% F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all four languages, and five teams participated in the cross-lingual entity linking task. Detailed evaluation information is available on the shared task web page.


Introduction
Due to rich inflection and derivation, free word order, and other morphological and syntactic phenomena exhibited by Slavic languages, analysis of named entities (NEs) in these languages poses a challenging problem (Przepiórkowski, 2007;Piskorski et al., 2009). Fostering research on detection and normalization of NEs-and on the closely related problem of cross-lingual, crossdocument entity linking-is of paramount importance for improving multilingual and cross-lingual information access in these languages. This paper describes the Second Shared Task on multilingual NE recognition (NER), which aims at addressing these problems in a systematic way. The shared task was organized in the context of the 7th Balto-Slavic Natural Language Processing Workshop co-located with the ACL 2019 conference. The task covers four languages-Bulgarian, Czech, Polish and Russian-and five types of NE: person, location, organization, product, and event. The input text collection consists of doc-uments collected from the Web, each collection centered on a certain "focal" entity. The rationale of such a setup is to foster the development of "all-round" NER and cross-lingual entity linking solutions, which are not tailored to specific, narrow domains. This paper also serves as an introduction and a guide for researchers wishing to explore these problems using the training and test data. 1 This paper is organized as follows. Section 3 describes the task; Section 4 describes the annotation of the dataset. The evaluation methodology is introduced in Section 5. Participant systems are described in Section 6 and the results obtained by these systems are presented in Section 7. Conclusions and lessons learned are discussed in Section 8.

Prior Work
The work we describe here builds on the First Shared Task on Multilingual Named Entity Recognition, Normalization and cross-lingual Matching for Slavic Languages, (Piskorski et al., 2017), which, to the best of our knowledge, was the first attempt at such a shared task covering several Slavic languages.
Similar shared tasks have been organized previously. The first non-English monolingual NER evaluations-covering Chinese, Japanese, Spanish, and Arabic-were carried out in the context of the Message Understanding Conferences (MUCs) (Chinchor, 1998) and the ACE Programme (Doddington et al., 2004). The first shared task focusing on multilingual named entity recognition, which covered several European languages, including Spanish, German, and Dutch, was organized in the context of CoNLL conferences (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003). The NE types covered in these campaigns were similar to the NE types covered in our Challenge. Also related to our task is Entity Discovery and Linking (EDL), (Ji et al., 2014(Ji et al., , 2015, a track of the NIST Text Analysis Conferences (TAC). EDL aimed to extract entity mentions from a collection of documents in multiple languages (English, Chinese, and Spanish), and to partition the entities into cross-document equivalence classes, by either linking mentions to a knowledge base or directly clustering them. An important difference between EDL and our task is that we do not link entities to a knowledge base.
Related to cross-lingual NE recognition is NE transliteration, i.e., linking NEs across languages that use different scripts. A series of NE Transliteration Shared Tasks were organized as a part of NEWS-Named Entity Workshops- (Duan et al., 2016), focusing mostly on Indian and Asian languages. In 2010, the NEWS Workshop included a shared task on Transliteration Mining (Kumaran et al., 2010), i.e., mining of names from parallel corpora. This task included corpora in English, Chinese, Tamil, Russian, and Arabic.
Prior work targeting NEs specifically for Slavic languages includes tools for NE recognition for Croatian (Karan et al., 2013;Ljubešić et al., 2013), a tool tailored for NE recognition in Croatian tweets (Baksa et al., 2017), a manually annotated NE corpus for Croatian (Agić and Ljubešić, 2014), tools for NE recognition in Slovene (Štajner et al., 2013;Ljubešić et al., 2013), a Czech corpus of 11,000 manually annotated NEs (Ševčíková et al., 2007), NER tools for Czech (Konkol and Konopík, 2013), tools and resources for fine-grained annotation of NEs in the National Corpus of Polish (Waszczuk et al., 2010;Savary and Piskorski, 2011) and a recent shared task on NE Recognition in Russian (Alexeeva et al., 2016).

Task Description
The data for the shared task consists of sets of documents in four Slavic languages: Czech, Polish, Russian, and Bulgarian. To accommodate entity linking, each set of documents is chosen to focus around one certain entity-e.g., a person, an organization or an event. The documents were obtained from the Web, by posing a keyword query to a search engine and extracting the textual content from the Web pages.
The task is to recognize, classify, and "normal-ize" all named-entity mentions in each of the documents, and to link across languages all named mentions referring to the same real-world entity. Formally, the Multilingual Named Entity Recognition task includes three sub-tasks: • Named Entity Mention Detection and Classification: Recognizing all named mentions of entities of five types: persons (PER), organizations (ORG), locations (LOC), products (PRO), and events (EVT).
• Name Normalization: Mapping each named mention of an entity to its corresponding base form. By "base form" we generally mean the lemma ("dictionary form") of the inflected word-form. In some cases normalization should go beyond inflection and transform a derived word into a base word's lemma, e.g., in case of personal possessives (see below). Multi-word names should be normalized to the canonical multi-word expression-rather than a sequence of lemmas of the words making up the multiword expression.
• Entity Linking. Assigning a unique identifier (ID) to each detected named mention of an entity, in such a way that mentions referring to the same real-world entity should be assigned the same ID-referred to as the cross-lingual ID.
The task does not require positional information of the name entity mentions. Thus, for all occurrences of the same form of a NE mention (e.g., an inflected variant, an acronym or abbreviation) within a given document, no more than one annotation should be produced. 2 Furthermore, distinguishing typographical case is not necessary since the evaluation is case-insensitive. If the text includes lowercase, uppercase or mixed-case variants of the same entity, the system should produce only one annotation for all of these mentions. For instance, for "BREXIT" and "Brexit" (provided that they refer to the same NE type), only one annotation should be produced. Note that recognition of common-noun or pronominal references to named entities is not part of the task.

Named Entity Classes
The task defines the following five NE classes. . ", only "Jan Kowalski" is recognized as a person name. Initials and pseudonyms are considered named mentions of persons and should be recognized. Similarly, named references to groups of people (that do not have a formal organization unifying them) should also be recognized, e.g., "Ukrainians." In this context, mentions of a single member belonging to such groups, e.g., "Ukrainian," should be assigned the same cross-lingual ID as plural mentions, i.e., "Ukrainians" and "Ukrainian" when referring to the nation receive the same cross-lingual ID.
Personal possessives derived from a person's name should be classified as a Person, and the base form of the corresponding name should be extracted. For instance, in "Trumpov tweet" (Croatian) one is expected to classify "Trumpov" as PER, with the base form "Trump." Locations (LOC): All toponyms and geopolitical entities-cities, counties, provinces, countries, regions, bodies of water, land formations, etc.including named mentions of facilities-e.g., stadiums, parks, museums, theaters, hotels, hospitals, transportation hubs, churches, railroads, bridges, and similar facilities.
In case named mentions of facilities also refer to an organization, the LOC tag should be used. For example, from the text "The Schipol Airport has acquired new electronic gates" the mention "The Schipol Airport" should be classified as LOC.
Organizations (ORG): All organizations, including companies, public institutions, political parties, international organizations, religious organizations, sport organizations, educational and research institutions, etc.
Organization designators and potential mentions of the seat of the organization are considered to be part of the organization name. For instance, from the text "...Citi Handlowy w Poznaniu..." (a bank in Poznań), the full phrase "Citi Handlowy w Poznaniu" should be extracted.
When a company name is used to refer to a service (e.g., "na Twiterze" (Polish for "on Twitter"), the mention of "Twitter" is considered to refer to a service/product and should be tagged as PRO. However, when a company name refers to a service, expressing an opinion of the company, e.g., "Fox News", it should be tagged as ORG.

Complex and Ambiguous Entities
In case of complex named entities, consisting of nested named entities, only the top-most entity should be recognized. For example, from the text "George Washington University" one should not extract "George Washington", but only the toplevel entity.
In case one word-form (e.g., "Washington") is used to refer to more than one different real-world entities in different contexts in the same document (e.g., a person and a location), the system should return two annotations, associated with different cross-lingual IDs.
In case of coordinated phrases, like "European and British Parliament," two names should be extracted (as ORG). The lemmas would be "European" and "British Parliament", and the IDs should refer to "European Parliament" and "British Parliament" respectively.
In rare cases, plural forms might have two annotations-e.g., in the phrase "a border between Irelands"-"Irelands" should be extracted twice with identical lemmas but different IDs.

System Input and Response
Input Document Format: Documents in the collection are represented in the following format. The first five lines contain meta-data: The text to be processed begins from the sixth line and runs till the end of file. The <URL> field stores the origin from which the text document was retrieved. The values of the meta-data fields were computed automatically (see Section 4 for details). The values of <CREATION-DATE> and <TITLE> were not provided for all documents, due to unavailability of such data or due to errors in parsing during data collection.
System Response. For each input file, the system should return one output file as follows. The first line should contain only the <DOCUMENT-ID>, which corresponds to the input. Each subsequent line contains one annotation, as tab-separated fields:

<MENTION> TAB <BASE> TAB <CAT> TAB <ID>
The <MENTION> field should be the NE as it appears in text. The <BASE> field should be the base form of the entity. The <CAT> field stores the category of the entity (ORG, PER, LOC, PROD, or EVT) and <ID> is the cross-lingual identifier. The cross-lingual identifiers may consist of an arbitrary sequence of alphanumeric characters. An example document in Czech and the corresponding response is shown in Figure 2.
For detailed descriptions of the tasks and guidelines, please refer to the web page of the shared task. 3
HTML parsing results may include not only the main text of a Web page, but also some additional text, e.g., labels from menus, user comments, etc., which may not constitute well-formed utterances in the target language. 4 The resulting set of partially "cleaned" documents were used to manually select documents for each language and topic, for the final datasets.
Documents were annotated using the Inforex 5 web-based system for annotation of text corpora (Marcinczuk et al., 2017). Inforex allows parallel access and resource sharing by multiple annotators. It let us share a common list of entities, and perform entity-linking semi-automatically: for a 4 This occurred in a small fraction of texts processed. Some of these texts were included in the test dataset in order to maintain the flavor of "real-data." However, obvious HTML parser failure (e.g., extraction of JavaScript code, extraction of empty texts, etc.) were removed from the data sets. Some of the documents were polished further by removing erroneously extracted boilerplate content. 5 github.com/CLARIN-PL/Inforex given entity, an annotator sees a list of entities of the same type inserted by all annotators and can select an entity ID from the list. A snapshot of the Inforex interface is in Figure 1.
In addition, Inforex keeps track of all lemmas and IDs inserted for each surface form, and inserts them automatically, so in many cases the annotator only confirms the proposed values, which speeds up the annotation process a great deal. All annotations were made by native speakers. After annotation, we performed automatic and manual consistency checks, to reduce annotation errors, especially in entity linking.
Using Inforex allowed us to annotate data much faster than in the first edition of the shared task. Thus we were able to annotated larger datasets and provide participants with training data. (In the first edition participants received only test data.) Data statistics are presented in Table 1.
Documents about ASIA BIBI and BREXIT were used for training and distributed to the participating teams with annotations. The testing datasets-RYANAIR and NORD STREAM-were released to the participants 2 days before the submission deadline. The participants did not know the topics in advance, and did not receive the annotations. Thus, we push participants to build a general solution for Slavic NER, rather than to optimize their models toward a particular set of names.

Evaluation Methodology
The NER task (exact case-insensitive matching) and Name Normalization (or "lemmatization") were evaluated in terms of precision, recall, and F1-measure. For NER, two types of evaluations were carried out: • Relaxed: An entity mentioned in a given document is considered to be extracted correctly if the system response includes at least one annotation of a named mention of this entity (regardless of whether the extracted mention is in base form); • Strict: The system response should include exactly one annotation for each unique form of a named mention of an entity in a given document, i.e., identifying all variants of an entity is required.
In relaxed evaluation we additionally distinguish between exact and partial matching: in the latter case, an entity mentioned in a given document is considered to be extracted correctly if the system response includes at least one partial match of a named mention of this entity. We evaluate systems at several levels of granularity: we measure performance for (a) all NE types and all languages, (b) each given NE type and all languages, (c) all NE types for each language, and (d) each given NE type per language.
In the name normalization task, we take into account only correctly recognized entity mentions and only those that were normalized (on both the annotation and system's sides). Formally, let N correct denote the number of all correctly recognized entity mentions for which the system returned a correct base form. Let N key denote the number of all normalized entity mentions in the gold-standard answer key and N response denote the number of all normalized entity mentions in the system's response. We define precision and recall for the name normalization task as: In evaluating document-level, single-language and cross-lingual entity linking we adopted the Link-Based Entity-Aware metric (LEA) (Moosavi and Strube, 2016), which considers how important the entity is and how well it is resolved. LEA is defined as follows. Let K = {k 1 , k 2 , . . . , k |K| } denote the set of key entities and R = {r 1 , r 2 , . . . , r |R| } the set of response entities, i.e., k i ∈ K (r i ∈ R) stand for set of mentions of the same entity in the key entity set (response entity set). LEA recall and precision are then defined as follows: where imp and res denote the measure of importance and the resolution score for an entity, respectively. In our setting, we define imp(e) = log 2 |e| for an entity e (in K or R), |e| is the number of mentions of e-i.e., the more mentions an entity has the more important it is. To avoid biasing the importance of the more frequent entities log is used. The resolution score of key entity k i is computed as the fraction of correctly resolved coreference links of k i : where link (e) = (|e| × (|e| − 1))/2 is the number of unique co-reference links in e. For each k i , LEA checks all response entities to check whether they are partial matches for k i . Analogously, the resolution score of response entity r i is computed as the fraction of co-reference links in r i that are extracted correctly: LEA brings several benefits. For example, LEA considers resolved co-reference relations instead of resolved mentions and has more discriminative power than other metrics for co-reference resolution (Moosavi and Strube, 2016).
It is important to note at this stage that the evaluation was carried out in "case-insensitive" mode: all named mentions in system response and test corpora were lower-cased.

Participant Systems
Sixteen teams from eight countries registered for the shared task. Half of the registered teams submitted results by the deadline. Five teams submitted description of their systems in the form of a Workshop paper. The remaining teams submitted a short description of their systems.
We briefly review the systems; complete descriptions appear in the corresponding papers.
CogComp used multi-source BiLSTM-CRF models, using solely the BERT multilingual embeddings, (Devlin et al., 2019), which directly allows the model to train on datasets in multiple languages. The team submitted several models trained on different combinations of input languages. They found that multi-source training with multilingual BERT outperforms singlesource. Cross-lingual (even cross-script) training worked remarkably well. Multilingual BERT can handle train/test sets with mismatching tagsets in certain situations. The best performing models were trained on a combination of data in four languages, while adding English into training data worsen the overall performance, (Tsygankova et al., 2019).
CTC-NER is a baseline prototype of a NER component of an entity recognition system currently under development at the Cognitive Technologies Center, Russia. The system has a hybrid architecture, combining rule-based and ML techniques, where the ML-component is loosely related to (Antonova and Soloviev, 2013). As the system processes Russian, English and Ukrainian, the team submitted output only for Russian.
IIUWR.PL combines Flair 6 , Polyglot 7 and BERT. 8 Additional training corpora were used: KPWr 9 for Polish, CNEC 10 for Czech, and data extracted using heuristics from Wikipedia. Lemmatization is partially trained on Wikipedia and PolEval corpora, 11 and partially rule-based. Entity linking is rule-based, and uses WikiData and FastText (Bojanowski et al., 2017).
JRC-TMA-CC is a hybrid system combining a rule-based approach and machine learning techniques. It is a corpus-driven system, lightweight and highly multilingual, exploiting both automatically created lexical resources, such as JRC-Names (Ehrmann et al., 2017), and external resources, such as BabelNet (Jacquet et al., 2019a). The main focus of the approach is on generating the possible inflected variants for known names (Jacquet et al., 2019b).
RIS is a modified BERT model, which uses CRF as the top-most layer (Arkhipov et al., 2019). The model was initialized with an existing BERT model trained on 100 languages.
Sberiboba uses multilingual BERT embeddings, summed with learned weights and followed by BiLSTM, attention layers and NCRF++ on the top (Emelianov and Artemova, 2019). Multilin- gual BERT is used only for the embeddings, with no fine-tuning for the tasks.
TLR used a standard end-to-end architecture for sequence labeling, namely: LSTM-CNN-CRF, (Ma and Hovy, 2016). It was combined with contextual embeddings using a weighted average (Reimers and Gurevych, 2019) of a BERT model pre-trained for multiple languages (including all of the languages of the Task).
As seen from these descriptions, most of the teams use the BERT model, except NLP Cube, which uses another deep learning model (LSTM), and JRC, which uses rule-based processing of Slavic inflection. Figure 3 shows system performance averaged across all languages and two test corpora. We present results for seven teams, since CTC-NER submitted results only for Russian. For each team, we present their best-performing model. 14 As the plots show, the best performing model, CogComp, yields F-measure 91% according to the relaxed partial evaluation, and 85.6% according to the strict evaluation. Also, the only hybrid model, JRC-TMA-CC, reaches the highest precision-93.7% relaxed partial, and 88.6% strict-but lower recall-54.4% relaxed partial, 42.7% strict.

Evaluation Results
Five teams submitted results for cross-lingual entity linking. The best results for each team, averaged across two corpora, are presented in Figure 4,and in   The best performing model, IIUWR.PL, yields Fmeasure 45%. As seen from the plot, for this task it is harder to balance recall and precision: the first two models obtain much higher precision, while the last three obtain much higher recall. The two best-performing models used rule-based entity linking. Note that in our setting the performance on entity linking depends on performance on name recognition and normalization: a system had to link entities that it extracted from documents upstream, rather than link a correct set of entities.
Tables 3 and 4 present the F-measure for all tasks, split by language, for the RYANAIR and NORD STREAM datasets; Table 2 shows performance on the final phase-cross-lingual entity linking. We show one top-performing model for each team. For recognition, we present only the relaxed evaluation, since results obtained on the three evaluation schemes are correlated, as can be seen from Figure 3.
The tables indicate that the test corpora present approximately the same level of difficulty for the participating systems, since the values in both tables are similar. The only exception is singlelanguage document linking, which seems to be much harder for the RYANAIR dataset, especially for Russian. This needs to be investigated further.
In Table 5 we present the results of the evaluation by entity type. As seen in the table, performance was higher overall for LOC and PER, and substantially lower for ORG and PRO, which corresponds with our findings from the First shared task, where ORG and MISC were the most problematic categories (Piskorski et al., 2017). The PRO category also exhibits higher variation across languages and corpora than other categories, which might point to some annotation artefacts. The results for the EVT category are less informative, since there are few examples of this category in the dataset, as seen in Table 1.  To stimulate further research into NER for Slavic languages, including cross-lingual entity linking, our training and test datasets, the detailed annotations, and scripts used for evaluations are made available to the public on the Shared Task's Web page. 15 The annotation interface is released by the Inforex team, to support annotation of additional data for expanded future tests.
This challenge covered four Slavic languages. For future editions of the Challenge, we plan to expand the training and test datasets, covering a wider range of entity types, and supporting cross-lingual entity linking. We also plan to cover a wider set of languages, including non-Slavic ones, and recruit more annotators as the SIGSLAV community expands. We will also undertake further refinement of the underlying annotation guidelines-always a highly complex task in a real-world setting. More complex phenomena also need to addressed, e.g., coordinated NEs, contracted versions of multiple NEs, etc.
We hope that this work will stimulate research into robust, end-to-end NER solutions for processing real-world texts in Slavic languages.