Multilingual Named Entity Recognition Using Pretrained Embeddings, Attention Mechanism and NCRF

In this paper we tackle multilingual named entity recognition task. We use the BERT Language Model as embeddings with bidirectional recurrent network, attention, and NCRF on the top. We apply multilingual BERT only as embedder without any fine-tuning. We test out model on the dataset of the BSNLP shared task, which consists of texts in Bulgarian, Czech, Polish and Russian languages.


Introduction
Sequence labeling is one of the most fundamental NLP models, which is used for many tasks such as named entity recognition (NER), chunking, word segmentation and part-of-speech (POS) tagging.It has been traditionally investigated using statistical approaches (Lafferty et al., 2001), where conditional random fields (CRF) (Lafferty et al., 2001) has been proven to be an effective framework, by taking discrete features as the representation of input sequence (Sathiya and Sellamanickam, 2007).With the advances of deep learning, neural sequence labeling models have achieved state-of theart results for many tasks (Peters et al., 2017).
For the purpose of this paper, we consider neural network solution for multilingual named entity recognition for Bulgarian, Czech, Polish and Russian languages for the BSNLP 2019 Shared Task (Piskorski et al., 2019).Our solution is based on BERT language model (Devlin et al., 2018), use bidirectional LSTM (Hochreiter and Schmidhuber, 1996), Multi-Head attention (Vaswani et al., 2017), NCRFpp (Yang and Zhang, 2018) (being neural network version of CRF++framework for sequence labelling) and Pooling Classifier (for language classification) on the top as additional information.

Data Format
The data consists of raw documents and the annotations, separately provided by the organizers.Each annotation contains a set of extracted entities and their types without duplication.We convert each raw document and corresponding annotations to labeled sequence and predict named entity label for each token in the input sentence.The documents are categorized into topics.There are two topics in the dataset released first: named "brexit" and "asia bibi".
For more details about the dataset and the task refer to the description on the web page1 .We focused on Named Entity Mention Detection (Named Entity Recognition) in this work.

System Description
We propose modeling the task as both sequence labeling and language classification jointly with a neural architecture to learn additional information about text.The model consists of one encoder, which on its own is build from the pretrained multilingual BERT model, followed by several trainable layers and two decoders.While the first decoder generates output tags, the second decoder identifies the language of the input sentence 2 .The system architecture is presented in Figure 1  The BERT embeddings layer contains Google's original implementation of multilingual BERT language model.Each sentence is preprocessed as described in BERT paper (Devlin et al., 2018): 1. Process input text sequence to WordPiece embeddings (Wu and Mike Schuster, 2016) with a 30,000 token vocabulary and pad to 512 tokens.
3. Mark all tokens as members of part "A" of the input sequence.
But instead of BERT's original paper (Devlin et al., 2018) we keep "B" ("Begin") prefix for labels and do a prediction for "X" labels on training stage.BERT neural network is used only to embed input text and don't fine-tune on the training stage.We freeze all layers except dropout here, that decreases overfitting. 2Our code is available at https://github.com/anonymize/slavic-ner.
We take hidden outputs from all BERT layers as the output of this part of the neural network and pass to the next level of the neural network.So the shape of output is 12 × 768 for each token of 512 length's padded input sequence.

BERT Weighting
Here we sum all of BERT hidden outputs from previous part: where • m = 12 is the number hidden layers in BERT; • b i is output from i BERT hidden layer; • γ and s i is trainable task specific parameters.
As we do not fine-tune BERT, we should adapt its outputs for our specific sequence labeling task.The suggested weighting approach is similar to ELMo (Peters et al., 2018), with a lower number of weighting vectors parameters s i .This approach can help to learn importance of each BERT output layer for this task and and network doesn't lose too much information about text, that was stored in all BERT outputs.

Recurrent Part
This part contains two LSTM networks for forward and backward passes with 512 hidden units so that the output representation dim is 1024 for each token.We use a recurrent layer for learning dependencies between tokens in an input sequence (Hochreiter and Schmidhuber, 1996).

Multi-Head Attention
After applying the recurrent layer, we use Selfattention mechanism to learn any other dependencies in a sequence for each token.This can be denoted as D(d h |S), where D is some hidden dependency; d h is the h head of attention, and S is all sequence.each head can learn its dependencies such as morphological, syntactic or semantic relationships between words (tokens).Presumably, dependencies may look as shown at Figure 2. Also, mechanism attention can compensate limitations of the recurrent layer when working with long sequences (Bahdanau et al., 2015).In our architecture, we use multihead-attention block as proposed in the paper "attention is all you need" (Vaswani et al., 2017).We took 6 heads and value and key dim 64.

Inference for NER Task
After the input sequence was encoded, we achieve the final representation of each token in a sequence.This representation is passed to Linear layer with tanh activation function and gets a vector with 14 dim, that equals to the number of entities labels (include supporting labels "pad" and "[CLS]").The inference layer takes the extracted token sequence representations as features and assigns labels to the token sequence.As the inference layer, we use Neural CRF++ layer instead of vanilla CRF.That captures label dependencies by adding transition scores between neighboring labels.NCRF++ supports CRF trained with the sentence-level maximum log-likelihood loss.During the decoding process, the Viterbi algorithm is used to search the label sequence with the highest probability.But also, NCRF++ extends the decoding algorithm with the support of nbest output (Yang and Zhang, 2018).We chose the nbest parameter equal to 11, because we have 11 meaning-ful labels.In this decision we followed the original article (Yang and Zhang, 2018).

Inference for Language Classification
We train our system for language classification.For the classification inference, we use Pooling Linear Classifier block as proposed in ULMFiT paper (Howard and Ruder, 2018).We pass output sequence representation H from Multiheadattention part to different Poolings and concat (as shown in Figure 1): where [] is concatenation; h 0 is first output significant vector of Multihead-attention part (which does have "[CLS]" label).
The result of concat Pooling (3×1024) is passed to Linear layer, and that predicts probability for four language classes (Bulgarian, Czech, Polish and Russian).

Postprocessing Prediction
After getting labels for the sequence of WordPiece tokens, we should convert prediction to word level labels extraction named entities.Each WordPiece token in the word is matched with neural network label prediction.We use ensemble classifier on labels by count all predicted labels for one word except "X" and select label for a word with the higher number of votes.
For final prediction we unite token's sequences which have not "O" ("Other") label to spans and write to result of entities set.

Data Conversion
On the training stage we divide the input data into two parts: the training set (named "brexit") and development set (named "asia bibi").Hence we train the system on one topic and evaluate the system on another topic.Because the input contains raw text and annotation, but BERT take words sequence as input, we convert data to word level IOB markup (Ramshaw and Marcus, 1995).After that, each word was tokenized by WordPiece tokenizer and word label matched with IOBX labels.
On the prediction stage result, labels were received by voice classifier.After this, we transform word predictions to spans markup.The results of develop evaluation stage described in Table 1.
After evaluation stage we train our network on all input data ("brexit" and "asia bibi") to make final predictions on the blind test set.

Training Procedure
The proposed neural network was trained with joint loss: where L SL is maximum log-likelihood loss (Yang and Zhang, 2018) for the sequence labeling task and L clf is Cross Entropy Loss for the language classification.
We use Adam with a learning rate of 1e − 4, β 1 = 0.8, β 2 = 0.9, L2 weight decay of 0.01, learning rate warm up, and linear decay of the learning rate.Also, gradient clipping was applied for weights with clip = 1.0.
Training of proposed neural network architecture was performed on one GPU with the batch size equal to 16, the number of epochs equal to 150, but stopped at epoch number 80 because the loss function has ceased to decrease.The model required only around 3 GB of memory instead of fine-tuning all BERT model, which would have required more than 8 GB GPU memory.All training procedure lasted around five hours on one GPU with the evaluation of development set on each epoch.
The final model was trained on unit of training and development datasets.

Evaluation Results
As baseline for BSNLP Shared Task we use a simple CRF tagger and obtain exact word level f1score 0.372 on the development dataset.
Finally we use joint model for named entity recognition task and language classification task because the model without part of the classification gave a result by several percent less than proposed final model.This means that the joint model pays attention to a specific language morphology and some connections between words within one language.For proposed neural network architecture the evaluation of the training stage was produced on development dataset.Table 1 shows span-level metrics precision, recall, and f1-measure.For development set, we obtained the following scores: language classification quality (f1-score): 0.998 and Multilingual Named Entity Recognition quality (f1-score): 0.70 for exact word level matching and 0.578 for exact full entities matching.Also we train model without language classification, which resulted in f1-score equal to 0.66 .This confirms the impact of language classification.Our model significantly outperforms the CRF baseline.
The evaluation of test dataset presented in Table 2 (relaxed partial matching) and

Error Analysis
First of all, we face some errors with converting from origin data format (raw and annotations) to word markup and back to origin format after predictions were made.This problems stand for extra spaces, bad Unicode symbols and symbols, absent in WordPiece vocabulary.Other errors are caused by neural network prediction failures.The model turns to be overfitted on the negative label "O" so that there are many false positives in the prediction.Lastly, the infrequent labels "PRO" and "EVT" are often confused.

Related Work
The related work has several parts: firstly, our work follows the recent trend of using pretrained neural languages models, such as (Devlin et al., 2018;Peters et al., 2018;Howard and Ruder, 2018).The main difference between original BERT's approach for named entity recognition task (Devlin et al., 2018) we use its only as input embeddings of sequence without fine-tuning.From ELMo paper (Peters et al., 2018) we use weighting approach for different outputs from network and getting final representation of sequence.
From ULMFiT work we took part which is related to the final decoding for classification (Pooling Classifier) without proposed language model (Howard and Ruder, 2018).Secondly we model the task of NER as a joint sequence labeling and classification task following other joint architec-tures (Liu and Lane, 2016;Nguyen et al., 2016).

Conclusion and Future Work
We have proposed neural network architecture that solves Multilingual Named Entity Recognition without any additional labeled data for Bulgarian, Czech, Polish and Russian languages.This implementation allows to train the model even on a modern personal computer with GPU.This neural network architecture can be used for other tasks, that can be reformulated as a sequence labeling task for any other language.
As the next steps in the study of the underlying architecture, we can increase or decrease the number of units on each layer or remove the recurrent layer or multihead-attention layer.As improvements of the system, we can fine-tune BERT embeddings and put additional layers on top of BERT or pass other modern language models as an input.

Figure 1 :
Figure 1: The system architecture

Table 1 :
Evaluation metrics on development dataset

Table 2 :
Evaluation metrics on test dataset