TLR at BSNLP2019: A Multilingual Named Entity Recognition System

This paper presents our participation at the shared task on multilingual named entity recognition at BSNLP2019. Our strategy is based on a standard neural architecture for sequence labeling. In particular, we use a mixed model which combines multilingualcontextual and language-specific embeddings. Our only submitted run is based on a voting schema using multiple models, one for each of the four languages of the task (Bulgarian, Czech, Polish, and Russian) and another for English. Results for named entity recognition are encouraging for all languages, varying from 60% to 83% in terms of Strict and Relaxed metrics, respectively.


Introduction
Correctly detecting mentions of entities in text documents in multiple languages is a challenging task (Ji et al., 2014(Ji et al., , 2015;;Ji and Nothman, 2016;Ji et al., 2017).This is especially true when documents relate to news because of the huge range of topics covered by newspapers.In this context, the shared task on multilingual named entity recognition (NER) proposes to participants to test their system under a multilingual setup.Four languages are addressed in BSNLP2019: Bulgarian (bg), Czech (cz), Polish (pl), and Russian (ru).Similarly to the first edition of this task in 2017 (Piskorski et al., 2017), participants are required to recognize, normalize, and link entities from raw texts written in multiple languages.Our participation is focused on the sole recognition of entities while other steps will be covered in our future work.
In order to build a unique NER system for multiple languages, we decided to contribute a solution based on an end-to-end system without (or almost without) language specific pre-processing.We explored an existing neural architecture, the LSTM-CNNs-CRF (Ma and Hovy, 2016), initially proposed for NER in English.This neural model is based on word embeddings to represent each token in a sentence.In order to have a unique embedding space, we propose to use a transformerbased (Vaswani et al., 2017) contextual embedding called BERT (Devlin et al., 2019).This pre-trained model includes multilingual representations that are context-aware.However, as noted by Reimers and Gurevych (2019), contextual embeddings provide multiple layers that are challenging to combine together.To overcome this problem, we used the weighted average strategy they successfully tested using (Peters et al., 2018).
The results of our participation are quite encouraging.Regarding the Relaxed Partial metric, our run achieves 80.26% in average for the four languages and the two topics that compose the test collection.In order to present comparative results against the state of the art, we run experiments using two extra datasets under the standard CoNLL evaluation setup.The remainder of this paper is organized as follows: Section 2 introduces the related work while Section 3 presents the proposed multi-lingual model.Section 4 presents the results while conclusions are drawn in Section 5.

Related Work
Named entity recognition has been largely studied through the organization of shared tasks in the last two decades (Nadeau and Sekine, 2007;Yadav and Bethard, 2018).The large variety of models can be grouped into three types: rule-based (Chiticariu et al., 2010), gazetteers-based (Sundheim, 1995), and statistically-based models (Florian et al., 2003).The latter type is a current hot topic in research, in particular with the return of neural based models 1 .Two main contributions have recently redrawn the landscape of models for sequence labelling such as NER: the proposal of new architectures (Ma and Hovy, 2016;Lample et al., 2016), the use of contextualized embeddings (Peters et al., 2018;Reimers and Gurevych, 2019), or even, the use of both of them (Devlin et al., 2019).The use of contextualized embeddings is a clear advantage for several kinds of neural-based NER systems, however as pointed out by Reimers and Gurevych (2019) the combination of multiples vectors proposed by these models is computationally expensive.

TLR System: A Neural-based Multilingual NER Tagger
This section describes our model which is based on a standard end-to-end architecture for sequence labeling, namely LSTM-CNNs-CRF (Ma and Hovy, 2016).We have combined this architecture with contextual embeddings using a weighted average strategy (Reimers and Gurevych, 2019) applied to a pre-trained model for multiple languages (Devlin et al., 2019) (including all languages of the task).We trained a NER model for each of the four languages and predict labels based on a classical voting strategy.As an example, the overall architecture of our model for Polish using the sentence "Wielka Brytania z zadowoleniem przyjeła porozumienie z Unia Europejska" (or "United Kingdom welcomes agreement with the European Union" in English) is depicted in Figure 1.

FastText Embedding
In this layer, we used pre-trained embeddings for each language trained on Common Crawl and Wikipedia using fastText (Bojanowski et al., 2017;Grave et al., 2018).These models were trained using the continuous bag-of-words (CBOW) strategy with position weights.A total of 300 dimensions were used with character n-grams of length 5, a window of size 5 and 10 negatives.The four languages of the task are included in this publicly available 2 pre-trained embedding (Grave et al., 2018).We have used the fastText library to ensure that every token (also in other alphabets) has a corresponding vector avoiding out of vocabulary tokens.

Multilingual BERT
We used the multilingual pre-trained embedding of BERT3 .In particular, we used the model learned for 104 languages including the four of this task.This model is composed of 12 layers and 768 dimensions in each layer for a total of 110M parameters.Directly using the 12 layers can be hard to compute in a desktop computer.To cope with this problem, we used the weighted strategy proposed by Reimers and Gurevych (2019) and combined only the first two layers.When a token was composed of multiple BERT tokens, we averaged them to obtain a unique vector per token.

Char Representation
We used the char representation strategy proposed by Ma and Hovy (2016) where char embeddings are combined using a convolutional neural network (CNN).Thus, an embedding vector is learned for each character by iterating trough the entire collection.Note that the four languages include unique characters which make harder the sharing of patterns between languages.To deal with this problem, we transliterated each token to the Latin alphabet using the unidecode library4 as a preprocessing step.This conversion is only applied at this layer and is not used elsewhere.

Language-Dependent and Independent Features
In Figure 1, we observe that the "char representation", "multilingual BERT", and "case encoding" layers are language-independent features5 So, all the processing steps are applied without considering the language, including the transliteration to the Latin alphabet.It means that some tokens are translated even knowing that they are already in a Figure 1: Architecture of a single-language model of our system.Note that for each token we provide a unique NER prediction.
Latin alphabet.On the other hand, "fastText embedding" is clearly a language-dependent feature.However, we intentionally reduce the language dependency by using the architecture in Figure 1 as many times as the number of languages involved in the task, e.g.four times.Each time we switched the "fastText embedding" model for the one corresponding to each language, this make a total of four different NER models.Our final prediction is obtained by applying a simple majority voting schema between these four NER models.

Experimental Setup
We follow the configuration setup proposed by the task organizers.Two topics, "nord stream" and "ryanair", were used to test our models.These topics include 1100 documents in the four languages.Further details can be found in the 2019 shared task overview paper (Piskorski et al., 2019).For training, we have used the documents provided for the task but also the ones in Czech, Polish, and Russian from the previous round of same task in 2017 (Piskorski et al., 2017).We additionally added the training example form the CoNLL2003 (Sang and De Meulder, 1837) collection in English (13879 train, 3235 dev, and 3422 test sentences).Used metrics include the officially proposed metrics and standard metrics for the CoNLL2003 dataset (F1 metric).

Official Results
The official results of our unique run are presented in Table 1 and identified as TLR-1.Note that only NER metrics are presented for the four languages.We have added the results for each language model using the partial annotations provided by the organizers6 .Each result is identified with the language used for the "fastText embed- ding" layer in Figure 1.Based on strict recognition, most of the cases 7 , the use of the correct language embedding improves the recognition of the respective language.However, the voting schema outperforms the individual models on average.This suggest that a system aware of the language of the input sentence could provide better results that our voting schema.

Unofficial Results
In order to compare our system to the state-of-theart, we have evaluated our architecture using the CoNLL2003 dataset.Our results using two and six layers are presented in Table 2.Note that English is not part of our target languages.So, an underperformance of 2.5 is acceptable in our system 8 .It is also worth nothing that the use of more BERT layers increases our results.However, the amount of memory used is also increased manifold.We set the number of layers (hyperparameter) to two layers due to our computation constraints despite the downgrading in performances for English.
The number of epochs (hyperparameter) was set using the BSNLP2017 dataset (for ru, cs, and 7 6 out of 8, with differences smaller than 0.4 points. 8More experiments using BERT English-only model will be performed in our future work.pl) combined with CoNLL2003 as a validation set of our final models.Results for these combined datasets are presented in Table 3. Surprisingly, our results seem very similar independently of the fastText embedding.It suggests that our architecture is able to generalize the prediction for several target languages.Note that the worst results are obtained by the Bulgarian model, but no test examples were included for this language.In contrast, we believe that the examples provided in other languages were rich enough to help the predictions (also in English).

Conclusion
This paper presents the TLR participation at the shared task on multilingual named entity recognition at BSNLP2019.Our system is a combination of multiple representation including character information, multilingual embedding, and language specific embedding.However, we combine them in such a way that it can be seen as a generic multilingual NER system for a large number of languages (104 in total).Although top participants outperform our average score of 80.26% of "Relaxed Partial" (Piskorski et al., 2019), the strengths of the proposed strategy relays on the fact that it can be easily adapted to new languages and topics without extra effort.

Table 1 :
Evaluation results of our TLR submission.We have added extra results for the strict metric using each single model based on one of the four languages.

Table 2 :
Evaluation results on the CoNLL 2003 dataset, an English only dataset.