The birth of Romanian BERT

Large-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus com-position and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets. We opensource not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production (with practical examples), and how to fully replicate the evaluation process.


Introduction
A revolution started in natural language processing (NLP) a few years ago, when the first Transformerbased model (Vaswani et al., 2017) demonstrated a significant increase in state-of-the-art results compared to previous neural approaches. The bidirectional BERT (Devlin et al., 2018) language model has since been widely adopted as the baseline for transformer models, and it has been successfully applied to a broad range of NLP tasks from standard language modeling to question answering, text summarization, and machine translation.
A number of papers have been dedicated to studying why and how this model performs so well, including comparison to classical NLP (Tenney et al., 2019), investigation of the newly introduced multi-head attention mechanism (Michel et al., 2019), and analyses of what BERT learns (Clark et al., 2019). A good recent review is presented by Rogers et al. (2020). Following BERT, a number of variations of language models using the transformer architecture have been introduced, including extended models such as XLM (Lample and Conneau, 2019) and XLNet  as well as more efficient ones like DistillBERT (Sanh et al., 2019), ALBERT (Lan et al., 2019), and ELECTRA (Clark et al., 2020). However, the vast majority of these studies have focused only on English models. While Google has released a multilingual BERT model trained on 100+ languages, only recently have monolingual models for other languages started to appear: FlauBERT for French , BERTje for Dutch (de Vries et al., 2019) and FinBERT for Finnish (Virtanen et al., 2019). But none for Romanian, until now. While the multilingual BERT can be used also for Romanian, a monolingual model can bring an increase in accuracy that can be observed also in downstream task performance. Thus, we here introduce Romanian BERT. This paper focuses on the technical and practical aspects of Romanian BERT, covering corpus composition and cleaning in Section 2, the training process in Section 3, and evaluation on Romanian data sets in Section 4.

Corpus
The unannotated texts used to pre-train Romanian BERT are drawn from three publiclyavailable corpora: OPUS, OSCAR and Wikipedia.
OPUS OPUS (Tiedemann, 2012) is a collection of translated texts from the web. It is an open-

Romanian BERT (cased)
Cinci biciclis , ti au plecat din Craiova spre S , o ##p ##âr ##lit , a . Wikipedia The Romanian Wikipedia is publicly available for download. We used the February 2020 Wikipedia dump, which contained approx. 0.4GB of text after cleaning.
All corpora were subjected to the same cleaning procedure, with Wikipedia also needing extra cleaning as the XML extraction still contained markup tokens. The cleaning involved several sequential steps: • Remove all lines under a minimum length.
• Normalize dashes (there are several types of Unicode dashes) and other characters. • Reduce multiple spaces.

Pretraining
The pretraining process starts with building a vocabulary on the available corpus. Using byte-pairencoding (BPE), we generated cased and uncased vocabularies containing 50000 word pieces. Character coverage 1 was set to 2000.   Vocabulary plays an important part in a the performance of a language model. Broadly speaking, the better sentences are tokenized (roughly, the fewer pieces each word is broken into), the better the model is expected to performs. Comparing M-BERT's vocabulary with ours (Table 3), we see that on average, Romanian BERT is able to encode a word in ∼1.4 tokens while M-BERT can reach up to 2 tokens/word for the cased vocabulary. The table also shows that Romanian BERT has an order of magnitude fewer unknown tokens than M-BERT on the same text 2 .   with the initial 900K trained on a sequence length of 128 and the rest with the maximum length of 512. Figure 1 shows the progress for both models. The sudden increase in loss at the 900K steps mark is due to the switch to the 512 sequence length, but both models quickly recover from it. The models were trained using a batch size of 140 per GPU (for 128 sequence length), and then 20 (for 512 sequence length). The optimizer used was Layerwise Adaptive Moments optimizer for Batch training (LAMB (You et al., 2019)), with warm-up over the first 1% of steps up to a learning rate of 1e-4, followed by a decay. Eight Nvidia Volta V100 GPUs with 32GB memory were used, and the pretraining process took around 2 weeks per model.

Evaluation
We evaluate Romanian BERT on three downstream tasks: • Simple Universal Dependencies: one model per task, evaluating labeling performance on UPOS (Universal Part-of-Speech) and the XPOS (eXtended Part-of-Speech). • Joint Universal Dependencies: a single model trained jointly on all tasks, evaluating UPOS, UFeats (Universal Features), Lemmas and Dependency Parsing for which we compute the Labeled Attachment Score. • Named Entity Recognition: a single model predicting BIO-style labels.
For each task we compare Romanian BERT with M-BERT, on both cased and uncased versions. To mitigate the effect of the random seed we run each experiment 5 times and report only the mean score.
All results listed here are reproducible by using the evaluation scripts provided on GitHub.

Simple Universal Dependencies
For this token labeling task we used the Romanian RRT (Barbu Mititelu et al., 2016) dataset from Universal Dependencies (UD), and evaluated the performance of the language models for UPOS and XPOS tagging using the macro-averaged F1, as proposed by Zeman et al. (2018).
Methodology The model itself is straightforward: on top of BERT's output layer we directly use a linear layer (with a 0.1 fixed dropout) that projects BERT's outputs into the desired number of classes, depending on the UPOS or XPOS task. Cross-entropy loss is used on the softmaxed linear layer. We perform 2 tests for each UPOS/XPOS task: a frozen test where we train only the added last layer of the model ("freezing" the language model weights), and a full test where all parameters are trainable. The frozen test can give additional insight into a model because its performance now rests more on the pretrained weights rather than the fine-tuned weights, meaning that differences in frozen model performance should be more exaggerated than the fully tuned ones.

Results
The results are summarized in Table  4. They show that Romanian BERT outperforms M-BERT on all subtasks, with differences in scores ranging from 0.13% (UPOS non-frozen cased) to 6% (XPOS frozen cased). Moreover, our assumption that the difference between frozen variants will be higher holds.
Surprisingly, the uncased version of  Table 6: Named entity recognition evaluation results.
Romanian BERT achieved the highest performance on all experiments, although the cased version should be at something of an advantage on this task thanks to a capital letter indicating proper nouns, etc. We believe these results may be due to better matching between the uncased vocabulary (which necessarily contains more full words) and the RRT corpus.

Joint Universal Dependencies
For this task we evaluated the language models on the same dataset as in the Simple task, namely RRT. We evaluate the models using to the standard CoNLL shared task evaluation tools (Zeman et al., 2018) and report the scores for UPOS (giving us a comparison to the same UPOS task using a simpler model), UFeats (Universal Features of each word), Lemmas, and the Labeled Attachment Score (LAS).
Methodology Although we evaluated the models on the same dataset, the methodology was different. We use an external tool to perform evaluation: UDify (Kondratyuk, 2019), a Transformer-based model that performs joint training and prediction on every UD subtask in a single step. It implements a prediction layer on top of the contextualized embeddings for each task and a layer-wise dot-product attention that calculates a weighted sum for all intermediate representations of a token.

Results
The results shown in

Named Entity Recognition
Romanian BERT was evaluated on RONEC (Dumitrescu and Avram, 2019) -a fine-grained NER corpus. The standard BIO tagging was used and models were evaluated according to (Segura Bedmar et al., 2013), reporting the F1 macro-averaged metrics for entity type, partial, strict and exact matches.
Methodology The methodology used for this task is identical to the one used for the Simple Universal Dependencies task. However, we do not evaluate the models on their frozen versions.
Results Table 6 summarizes NER results. Unsurprisingly, we find that Romanian BERT cased has the best performance on this task, improving on M-BERT scores with 0.78% to 1.96% on all metrics. The uncased version of Romanian BERT placed second at the evaluation, outperforming both cased and uncased versions of M-BERT.

Conclusions
We have introduced the first BERT model trained solely on a 15GB Romanian corpus, obtained by a thorough clean of OSCAR, OPUS and Wikipedia sub-corpora. We have shown that Romanian BERT outperforms the only available model that works on Romanian, the multilingual M-BERT.
As the current corpus is better cleaned, more text added, tweaks to improve vocabulary coverage are performed, new versions of BERT as well as future models will be released to the open source domain 4 .