Tuning Multilingual Transformers for Language-Specific Named Entity Recognition

Our paper addresses the problem of multilingual named entity recognition on the material of 4 languages: Russian, Bulgarian, Czech and Polish. We solve this task using the BERT model. We use a hundred languages multilingual model as base for transfer to the mentioned Slavic languages. Unsupervised pre-training of the BERT model on these 4 languages allows to significantly outperform baseline neural approaches and multilingual BERT. Additional improvement is achieved by extending BERT with a word-level CRF layer. Our system was submitted to BSNLP 2019 Shared Task on Multilingual Named Entity Recognition and demonstrated top performance in multilingual setting for two competition metrics. We open-sourced NER models and BERT model pre-trained on the four Slavic languages.


Introduction
Named Entity Recognition (further, NER) is a task of recognizing named entities in running text, as well as detecting their type. For example, in the sentence Asia Bibi is from Pakistan, the following NER classes can be detected: [Asia Bibi] PER is from [Pakistan] LOC . The commonly used BIOannotation for this sentence is shown in Figure 1.
The recognizer of named entities can be trained on a single target task dataset as any other sequence tagging model. However, it often benefits from additional data from a different source, either labeled or unlabeled, which is known as transfer learning. To enrich the model one can either train it on several tasks simultaneously (Collobert et al., 2011), which makes its word representations more flexible and robust, or pretrain on large amounts of unlabeled data to utilize unlimited sources available in the Web and then fine-tune them on a specific task (Dai and Le, 2015;Howard and Ruder, 2018).
One of the most powerful unsupervised models is BERT (Devlin et al., 2018), which is a multi-layer Transformer trained on the objective of masked words recovery and on the task of next sentence prediction (known also as Natural Language Inference (NLI) task). The original model was trained on vast amounts of data for more than 104 languages which makes its representations useful for almost any task. Our contribution is three-fold: first, multilingual BERT embeddings with a dense layer on the top clearly beat BiLSTM-CRF over FastText embeddings trained on the four target languages. Second, languagespecific BERT, trained only on the target languages from Wikipedia and news dump, significantly outperforms the multilingual BERT. Third, we adapt a CRF layer as a a top module over the outputs of the BERT-based model and demonstrate that it improves performance even further.

Model Architecture
Our model extends the recently introduced BERT (Devlin et al., 2018) model. BERT itself is a multilayer transformer (Vaswani et al., 2017) which takes as input a sequence of subtokens, obtained using WordPiece tokenization (Wu et al., 2016), and produces a sequence of context-based embeddings of these subtokens. When a word-level task, such as NER, is being solved, the embeddings of word-initial subtokens are passed through a dense layer with softmax activation to produce a probability distribution over output labels. We refer the reader to the original paper, see also Figure 2.
We modify BERT by adding a CRF layer instead of the dense one, which was commonly used in other works on neural sequence labeling (Lample et al., 2016) to ensure output consistency. It also transforms a sequence of word-initial subtoken embeddings to a sequence of probability dis-

Asia
Bibi is from Pakistan . tributions, however, each prediction depends not only on the current input, but also from the previous one.

Transfer from Multilingual Language Model
There are two basic options for building multilingual system: to train a separate model for each language or to use a single multilingual model for all languages. We follow the second approach since it enriches the model with the data from related languages, which was shown to be beneficial in recent studies (Mulcaire et al., 2018). The original BERT embedder itself is essentially multilingual since it was trained on 104 languages with largest Wikipedias 1 . However, for our four Slavic languages (Polish, Czech, Russian, and Bulgarian) we do not need the full inventory of multilingual subtokens. Moreover, the original WordPiece tokenization may lack Slavicspecific ngrams, which makes the input sequence longer and the training process more problematic and computationally expensive.
Hence we retrain the Slavic BERT on stratified Wikipedia data for Czech, Polish and Bulgarian and News data for Russian. Our main innovation is the training procedure: training BERT from scratch is extremely expensive computationally so we initialize our model with the multilingual one. We rebuild the vocabulary of subword tokens using subword-nmt 2 . When a single Slavic subtoken may consist of multiple multilingual subtokens, we initilalize it as an average of their vectors, resembling (Bojanowski et al., 2016). All weights of transformer layers are initialized using the multilingual weights.

Target Task and Dataset
The 2019 edition of the Balto-Slavic Natural Language Processing (BSNLP) (Piskorski et al., 2019) shared task aims at recognizing mentions of named entities in web documents in Slavic languages. The input text collection consists of sets of news articles from online media, each collection revolving around a certain entity or an event. The corpus was obtained by crawling the web and parsing the HTML of relevant documents. The 2019 edition of the shared task covers 4 languages (Bulgarian, Czech, Polish, Russian) and focuses on recognition of five types of named entities including persons (PER), locations (LOC), organizations (ORG), events (EVT) and products (PRO).
The dataset consists of pairs of files: news text and a file with mentions of entities with corresponding tags. There are two groups of documents in the train part of the dataset. Namely, news about Asia Bibi and Brexit. Brexit part is substantially bigger, therefore, we used it for training and Asia Bibi for validation.

Pre-and Post-processing
We use NLTK (Loper and Bird, 2002) sentence tokenizers for Bulgarian, Polish, and Czech. Due to the absence of Bulgarian sentence tokenizer we apply the English NLTK one instead. For Russian language we use DeepMIPT sentence tokenizer 3 . We replace all UTF separators and space characters with regular spaces. Due to mismatch of BSNLP 2019 data format and common format for tagging tasks we first convert the dataset to BIO format to obtain training data. After getting predictions in BIO format we transform them back to the labeling scheme proposed by Shared Task organizers. This step probably causes extra errors, so we partially correct them using post-processing.
We found that sometimes the model predicts a single opening quote without closing one. So we filter out all single quotation marks in the predicted entities. At the prediction stage we perform inference for a sliding window of two sentences with overlaps to reduce sentence tokenization errors.
The Shared Task also included the entity normalization subtask: for example, the phrase "Верховным судом Пакистана" (Supreme+Ins Court+Ins of Pakistan+Gen) should be "Верховный суд Пакистана". We used the UDPipe 2.3 (Straka et al., 2016) lemmatizers whose output was corrected using language-specific rules. For example, "Пакистана" (Pakistan+Gen) should not be lemmatized because in Russian noun modifiers remain in Genitive.

Model Parameters
See below parameters of transferring multilingual BERT from to Slavic languages. The training took 9 days with DGX-1 comprising of eight P-100 16Gb GPUs. We train BERT in two stages: train full BERT on sequences with 128 subtokens length and then train only positional embeddings on 512 length sequences. We found that both initialization from multilingual BERT and reassembling of embeddings speed up convergence of the model. Parameters of all BERT-based NER models are: • Batch size: 16 • BERT layers learning rate: 1e-5 • Top layers learning rate: 3e-4 • Optimizer: AdamOptimizer In contrast to original BERT paper (Devlin et al., 2018), we use different learning rates for the task-specific top layers and BERT layers when training BERT-based NER models. We found that this modification leads to faster convergence and higher scores. We evaluate the model every 10 batches on the whole validation set and chose the one that performed best on it. Despite this strategy being very time consuming, we found it crucial to get extra couple of points. For all experiments we used the span F 1 score for validation.
Our best model used CRF layer and performed moving averages of variables by employing an exponential decay to model parameters.

Results
We evaluated Slavic BERT NER model on the BSNLP 2019 Shared Task dataset. The model is compared with two baselines: Bi-LSTM-CRF (Lample et al., 2016) and NER model based on multilingual BERT. For Bi-LSTM-CRF we use FastText word embeddings trained on the same data as Slavic BERT. Table 1 presents the scores of our model on development set (Asia Bibi documents) when training on Brexit documents. We report a standard span-level F1-score based on the CONLL-2003 evaluation script (Sang and De Meulder, 2003) and three official evaluation metrics (Piskorski et al., 2019) 4 : Relaxed Partial Matching (RPM), Relaxed Exact Matching (REM), and Strict Matching (SM). Our system showed top performance in multilingual setting for all mentioned metrics except RPM.
Even without CRF the multilingual BERT model significantly outperforms Bi-LSTM-CRF model. Adding a CRF layer strongly increases performance both for multilingual and Slavic BERT models. Slavic BERT is the top performing model. The error rate of Slavic BERT-CRF is more than one third less than the one of Multilingual BERT baseline.
We experimented with transfer learning from other NER corpora. We used three corpora as source for transfer: Russian NER corpus (Mozharova and Loukachevitch, 2016), Bulgarian BulTreeBank (Simov et al., 2004;Georgiev et al., 2009), and BSNLP 2017 Shared Task dataset (Piskorski et al., 2017) 6 with Czech, Russian, and Polish data. For pre-training we use stratified sample from the concatenated dataset. The set of tags for the task-specific layer includes all tags that occur in at least one dataset. After pre-training we replace the task-specific layer with the one suited for the BSNLP 2019 dataset and train until convergence. We find this approach to be beneficial for models without CRF, however, the CRF-enhanced model without NER pretraining demonstrates slightly higher scores. Table 2 presents a detailed evaluation report across 4 languages for the top performing Slavic BERT-CRF model. Note that the languages with Latin script (Polish and Czech) demonstrate higher scores than Cyrillic-based ones (Russian and Bulgarian). Low scores for Russian might be caused by the dataset imbalance, since it covers only 7.7% of the whole BSNLP dataset, however, Bulgarian includes 39% but shows even lower quality, especially in terms of recall. We have two explanations: first, incorrect sentence tokenization since we used English sentence tokenizer for Bulgarian (this may explain the skew towards precision). Second, Russian and Bulgarian are much less related than Czech and Polish so they obtain less gain from having additional multilingual data.

Releasing the Models
We release the best BERT based NER model along with the BERT model pre-trained on the four com-petition languages 7 . We provide the code for the inference of our NER model as well as for using the pretrained BERT. The BERT model is fully compatible with original BERT repository.

Conclusion
We have established that BERT models pretrained on task-specific languages and initialized using the multilingual model, significantly outperform multilingual baselines on the task of Named Entity Recognition. We also demonstrate that adding a word-level CRF layer on the top improves the quality of both extended models. We hope our approach will be useful to fine-tune language-specific BERTs not only for Named Entity Recognition but for other NLP tasks as well.