JRC TMA-CC: Slavic Named Entity Recognition and Linking. Participation in the BSNLP-2019 shared task

We report on the participation of the JRC Text Mining and Analysis Competence Centre (TMA-CC) in the BSNLP-2019 Shared Task, which focuses on named-entity recognition, lemmatisation and cross-lingual linking. We propose a hybrid system combining a rule-based approach and light ML techniques. We use multilingual lexical resources such as JRC-NAMES and BABELNET together with a named entity guesser to recognise names. In a second step, we combine known names with wild cards to increase recognition recall by also capturing inflection variants. In a third step, we increase precision by filtering these name candidates with automatically learnt inflection patterns derived from name occurrences in large news article collections. Our major requirement is to achieve high precision. We achieved an average of 65% F-measure with 93% precision on the four languages.


Introduction
Multilingual Named Entity Recognition (NER) and the grounding of names to real-world entities is an essential component of the JRC TMA-CC's 1 large-scale, multi-annual and highly multilingual media monitoring effort called Europe Media Monitor -EMM 2 .
EMM has been analysing online news articles since 2003, reaching a current average of 320K articles per day from about 12K news sources in up to 70 languages. EMM clusters related news, categorises them into thousands of categories, detects breaking news and tracks topics over short periods of time. For a subset of about two dozen languages, EMM recognises and disambiguates en-tity mentions. The EMM-NER component constitutes the backbone of our submissions to the BSNLP-2019 Shared Task .

Approach
We submitted four system instance results, all of which are based on our in-house NER system NERONE (Ehrmann et al., 2017;Steinberger et al., 2015), which we describe first.
NERONE identifies and disambiguates mentions of persons, organisations, locations, events and products by first looking up known names and by then guessing new names. The list of known names contains about 1.2 million names. 600 000 unique entities have an average of 2 variants, the biggest number of variants for one entity being 6 200. The guessing of new names is based on large lexical resources (1.5 million entries) and ca 200 language-agnostic recognition patterns using the finite-state formalism described in (Piskorski, 2007). NERONE continuously updates the list of known names. Newly guessed names can become part of the list of known names if they are considered reliable enough. Reliability is mostly based on the frequency of the newly guessed name, the number of languages where it appears, the number of sources where it appears. Once eligible it is automatically added as a new known name or merged as a new variant of an existing name, including across languages and scripts (Steinberger et al., 2011). On average, 150 new variants and new names are automatically added daily to the list of known names. This list of known names (JRC-NAMES), is distributed publicly, together with the name variants, the titles, the language and date when it was found (Ehrmann et al., 2017). Based on previous work focused on multi-word entities (Jacquet et al., 2019), we furthermore added 2.1 million names and variants of the relevant entity categories from BabelNet (Navigli and Ponzetto, 2012). In the disambiguation steps, names that are part of a larger name are ignored (e.g. John F Kennedy Airport) and location names are disregarded if a homographic entity name of another category exists (e.g. Мартин (Martin) which could be both a small city in Slovakia and a person name).
In the remaining part of this Section we describe the four approaches explored, all of which are built on top of NERONE, which is known to have high precision, but low recall. We modified it to extend the recall, knowing that the precision will fall (NERONE with wildcards), then tried different levels of filtering to optimise the balance between precision and recall.
It is important to emphasise at this point that the four NER approaches presented in this paper are JRC's contribution (as one of the co-organisers of the Shared Task) to the provision of 'good' baseline systems to compare against.

JRC-TMA-CC-1: NERONE
The JRC-TMA-CC-1 variant uses NERONE as described before. We only did a slight adaptation for the location recognition. As our list of known location names (LOC), derived from GeoNames 3 is very short for some languages, we merged the LOC lists for the Cyrillic script languages Russian, Bulgarian, Bosnian, Macedonian and Serbian and we did the same for the Latin script west Slavic languages Polish, Czech and Slovak. It corresponds to the update of 200 000 entries among the existing 1.3 million location name resource.

JRC-TMA-CC-2: NERONE + wildcards
In addition to the system used in JRC-TMA-CC-1, we added wildcards to each name part of all entity types except for the GeoNames-derived LOC lists. The objective is to increase Recall by also capturing morphological variants of the known names. During morphological inflection, suffixes can be added to the base form of the name (e.g. Andrej Babiš inflected as Andrejem Babišem), but it also happens that final letters get replaced (suffix replacement, e.g. Garbině Muguruzaová inflected as Garbiňe Muguruzaovou). We therefore removed the last two letters of each name part and added a wildcard (Garbině Muguruzaová would become Garbi% Muguruzao%). To avoid over-generating wildcard patterns, we did not remove letters from name parts that are three letters or shorter and we only removed one letter in fourletter words. Note that we use the term 'suffix' not in the morphological sense, but simply to denote the final letters of a name string.

JRC-TMA-CC-3: NERONE + wildcards and suffix filtering
Due to the vast number of different names, of which some can also be a string subset of longer names, the wildcards do occasionally overgenerate, i.e. capture names that are not variants, but names in their own right (e.g. Josef Mill would create the wildcard pattern Jos% Mil% which would wrongly match Josefa Miller as a possible inflection of Josef Mill). Submission JRC-TMA-CC-3 is based on the previous method, but here we aim to reduce such false positives (increase Precision) by filtering the names matched with the wildcards against a list of the more frequent suffix replacement rules.
To create such suffix replacement rules, we first searched in an average of 2 million news articles per language 4 for all our known names with the wildcard described in JRC-TMA-CC-2 to gather possible inflections of names, resulting in variant frequency lists for each name (see Table 1 for examples of collected variants). We then applied the following algorithm: 1. We hypothesise that the main form according to BabelNet and JRC Names is the main form. We have found a good empirical evidence this is true.
2. Tokenise all the names 3. For each token from the main variant T m find the corresponding token from one of the derivations T d.
4. Find the common parts between the token T m and T d. For example (cf. first case in Table 1), the common part between Kotleby and Kotleba is Kotleb.
5. Find the difference between the two forms and produce a list of candidate suffix rules, in this last case the rules will look like y → a ; by → ba ; eby → eba.
6. In the case when the first token is completely contained in the second one, like Marian and Mariana, we extract a rule by taking the last two letters from the main form and the last corresponding ending from the derivative form an → ana.
7. The inflection rules are gathered and we calculate various statistics. For example, the conditional probability that the first part of the rule is transformed into the second part of the rule. The statistics were collected from the list of word variants Table 2 shows some examples of inflection rules obtained with this algorithm. This list was then used to filter acceptable inflections according to the initial base form: only those suffix replacement rules that had a probability higher than 0.01 were considered valid suffixes. If a name inflection found belonged to the eliminated low-frequency suffix replacement rules, it was not considered.

JRC-TMA-CC-4: NERONE + wildcards and less strict suffix filtering
This variant is identical to JRC-TMA-CC-3 with a lower threshold for filtering set to 0.001.

Results
While the Shared Task was subdivided into three subtasks, namely, Entity Recognition, Normalisation and Linking, our contribution focused less on  the normalisation subtask and more on recognition, with a priority on precision scores and on cross-lingual entity-linking. Table 3 shows the results obtained by the four systems we submitted. The scores reported only refer to F-measure scores. For each evaluation category and each language, the bold score corresponds to the highest obtained F-measure. As a first observation, according to the description of the four systems, we were expecting the JRC-TMA-CC-1 system to obtain high precision but low recall, the JRC-TMA-CC-2 system to obtain high recall but low precision, and JRC-TMA-CC-3 and JRC-TMA-CC-4 to filter the too noisy recognition from JRC-TMA-CC-2 and deliver good precision/recall balance, therefore better F-measure. This is what one could observe when evaluating on the training set for the four languages. On the test set, one can observe the same phenomenon for Polish and Bulgarian, both for the relaxed partial and strict recognition, however, it applies to a smaller extent for Russian on the Nord-Stream topic and Czech on the Ryanair topic. By checking the error logs, these differences appear to be due to mis-recognition of key entities for these specific topics. Additionally to the F-measure scores reported in Table 3, the high precision scores we obtained for all languages are worth mentioning . We obtained best precision ranking compare to the other shared task participants for all four languages. As an average for both topics, our JRC-TMA-CC-4 system obtained for Czech, Russian, Bulgarian and Polish a precision of, respectively, 94.4%, 90.2%, 95.4%, 93.7% (at a price of lower recall). This precision is also quite well-distributed across entity types. For   PER, LOC, ORG, PRO and EVT, we respectively obtained 92.4%, 95.9%, 89.2%, 96.0% and 83.3%. The fact that we were able to improve our existing system with quite a simple adaptation is promising and encourages us to push further this process of name ending/inflection filtering. Concerning the entity-linking evaluation, Table 3 shows results for each single language and, more importantly, for cross-lingual linking. Despite the low recall of our four systems compare to other teams, our Fmeasure scores are ranked 2nd for both single language and cross-lingual linking. We will have to analyse the error logs in more detail to investigate possible improvements. Also, we observe that in almost all languages and topics, the best results are obtained by the JRC-TMA-CC-2 system, which is most likely correlated to a high recall.

Related Work
NER systems are often the first step in event detection, question answering, information retrieval, co-reference resolution, topic modelling, etc. The first NER task was organised by (Grishman and Sundheim, 1996) in the Sixth Message Understanding Conference. Early NER systems were based on handcrafted rules (Chiticariu et al., 2010), lexicons, orthographic features and ontologies. These systems were followed by NER systems based on feature-engineering and machine learning (Nadeau and Sekine, 2007). There are not many systems for NER that address inflected languages like the Slavic ones. Among the others,  tackled the task of matching morphological variants of names in Polish text by optimising string similar-ity calculations for inflections. (Pajzs et al., 2014) experimented with name lemmatisation and inflection variant generation in the highly inflected and agglutinative language Hungarian. (Gareev et al., 2013) describes NER for the highly inflective Russian language. The first edition of the Shared Task on Slavic NER was organised in the context of BSNLP 2017 (Piskorski et al., 2017)

Conclusions and Future Work
We presented lightweight method to improve the performance of our in-house NER system NERONE for the recognition and linking of inflected named entities in inflected languages without delving into the morphological rules and proper name declension paradigms of each of the languages. We learnt potential name inflection patterns by searching for suffix variants of known names in large volumes of text. We then changed the known-name lookup part of NERONE by replacing the last letters of each name with wildcards to capture inflectional variants. We used the newly captured potential name inflections to reduce the number of wrong wildcard matches. As expected, we achieved good precision scores, 94.4%, 90.2%, 95.4%, 93.7% respectively for Czech, Russian, Bulgarian and Polish and unbalanced F-measures, from too low (58.0% and 57.4% for Czech and Polish) to reasonably good (73.5% and 74.2% for Russian and Bulgarian). One of the main drive of developing the described extension of NERONE was to contribute to the provision of 'good' baseline systems for the BSNLP-2019 Shared Task.
The proposed systems could be improved in many ways, including, i.a.: (a) expansion of the set of inflection patterns to guess new names, (b) integration of a classifier to distinguish the reading of entities that can designate different entity types (e.g. BBC as an organisation or as a product, (c) expansion of the lookup of geographical names, (d) integration of a mechanism to distinguish the Czech female gender marker -ova from case markers as it behaves differently: Forms such as Merkelova are the Czech nominative base form of the German Chancellor Merkel and inflections apply to Merkelova instead of to our name list's base form Merkel, (e) introduction of additional heuristics to narrow down the possible name mention matches, since the automatically generated groups of name inflection variants, from which we learn the inflection patterns, contain errors because the wildcards match too generously, and (f) updating and completing our list of geographical names as the coverage for different languages currently ranges from over 100,000 geographical names to below 3,000.