German’s Next Language Model

In this work we present the experiments which lead to the creation of our BERT and ELECTRA based German language models, GBERT and GELECTRA. By varying the input training data, model size, and the presence of Whole Word Masking (WWM) we were able to attain SoTA performance across a set of document classification and named entity recognition (NER) tasks for both models of base and large size. We adopt an evaluation driven approach in training these models and our results indicate that both adding more data and utilizing WWM improve model performance. By benchmarking against existing German models, we show that these models are the best German models to date. All trained models will be made publicly available to the research community.


Introduction
Deep transformer based language models have shown state-of-the-art results for various Natural Language Processing tasks like text classification, NER and question answering (Devlin et al., 2019). They are pretrained, first by feeding in large unlabeled text corpora before being fine-tuned on the downstream task. In this work we present a set of German BERT and ELECTRA models, the best of which, GELECTRA Large , significantly improves upon state of the art performance on the GermEval18 hate speech detection task by about +4% / +2.5% for the coarse and fine variants of the task respectively. This model also reaches SoTA on the GermEval14 NER task, outperforming the previous best by over +4%. While performant, such models are prohibitively large for many and so we also present a new GBERT model which matches deepset BERT, the previous best German BERT, in size but outperforms it by +2.23% F1 averaged over three tasks.
In the process of pretraining the language models, we also a) quantify the effect of increasing the training data by an order of magnitude and b) verify that whole word masking has a positive effect on BERT models.
Because of the computational expense of training large language models from scratch, we adopt a downstream-oriented evaluation approach to ensure that we get the best performance from a limited number of runs. This involves regularly checkpointing the model over the course of pretraining, evaluating these on a set of classification and NER tasks and selecting as final the checkpoint which shows the best performance. This stands in contrast to approaches where the final model is simply saved after a fixed number of steps. Our method is also an important tool in diagnosing pretraining and we hope that it will be of use to other teams looking to train effective language models on a budget. (Howard and Ruder, 2018) and FLAIR (Akbik et al., 2018) are LSTM based and these were able to set new performance benchmarks on downstream tasks like text classification, PoS tagging and NER. More recent approaches use Transformer-based (Vaswani et al., 2017) architectures and examples include GPT-2 (Radford et al., 2019), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), ALBERT (Lan et al., 2020) and ELECTRA (Clark et al., 2020).
In this work we focus on BERT and ELECTRA models. BERT uses a masked language modeling (MLM) strategy to corrupt an input sentence by replacing some tokens with a [MASK] symbol. The model is then trained to re-construct the original token. However, this method of training is somewhat restricted in that the model only learns from the masked out tokens which typically make up about 15% of the input tokens.
ELECTRA addresses this problem by introducing a new pretraining task called Replaced Token detection. Instead of masking out tokens, a subset of the input tokens are substituted by a synthetically generated token. The model is then trained to classify whether each input token is original or substituted, thus allowing for gradient updates at every input position. Practically speaking, this is achieved by having a discriminator that performs the replaced token detection and a generator which provides plausible token substitutes. These two components are trained jointly and are both Transformer based.
The BERT model received an update when the original authors added Whole Word Masking 1 whereby masking one subword token requires that all other tokens in the word are also masked out. The authors report that this method improves the training signal by removing the easiest cases and show that it improves performance in their tasks.
There is also a line of work that looks into bringing language modeling techniques that were first developed on English to other languages. These include but are not limited to monolingual models such as CamemBERT (Martin et al., 2020) and FlauBERT  for French, Finnish BERT (Virtanen et al., 2019) and German BERTs by DBMDZ 2 and deepset 3 . For a more comprehensive list, see (Nozza et al., 2020).
Some models are also capable of supporting multiple languages such as multilingual BERT (mBERT Base ) and XLM-RoBERTa (Conneau et al., 2019). Multilingual BERT is a multilingual model for 104 different languages 4 trained on Wikipedia dumps. The XLM-RoBERTa model is trained on 2.5TB of data from a cleaned Common Crawl corpus (Wenzek et al., 2020) for 100 different languages.
It is worth emphasizing here that systems trained on naturally occurring data will learn pre-existing cultural biases around gender (Bolukbasi et al., 2016), race and religion (Speer, 2017). Critical evaluation of machine learning methods is more important than ever as NLP is gaining broader adoption. Researchers have been advocating for better documentation of decisions made during the construction of a dataset (Gebru et al., 2018), explicit statements of a dataset's "ingredients" (Holland et al., 2018) and recognition of the dataset characteristics that may lead to exclusion, overgeneralisation and underexposure (Bender and Friedman, 2018). These topics will be addressed in Section 3.1.

Pretraining Data
We have available to us, a range of different German language corpora that we use in different combinations for our model pretraining. OSCAR (Ortiz Suárez et al., 2019) is a set of monolingual corpora extracted from Common Crawl. The Common Crawl texts are pre-processed (e.g. HTML entities are removed) and a language classification model is used to sort texts by language. We use the unshuffled version of the German OSCAR corpus, resulting in 145GB of text. The Wikipedia dump for German is preprocessed with the WikiExtractor 5 script forming a corpus of size 6GB. The OPUS project 6 (Tiedemann, 2012) has collected texts from various domains such as movie subtitles, parliament speeches and books and these comprise a collection of around 10GB. From Open Legal Data 7 (Ostendorff et al., 2020) there is a dataset of about 2.4GB of German court decisions. Table 1 shows an overview over all datasets.
As discussed in Section 2, our pretrained language models will learn pre-existing biases from the training datasets. The main portion (89%) of our training data, namely the OSCAR dataset, uses texts scraped from the internet, which is in some respects problematic. First off, this dataset contains a lot of explicit and indecent material. While we filtered out many of these documents through keyword matching, we cannot guarantee that this method was successful in every case. Furthermore, many websites contain unverified information and any dataset containing this kind of text can lead to a skewed model that reflects commonly found lies and misconceptions. This includes gender, racial and religious biases which are found in textual data of all registers and so we advise that anyone using our model to recognise that it will not always build true and accurate representation of real world concepts. We implore users of the model to seriously consider these issues before deploying it in a production setting, especially in situations where impartiality matter, such as journalism, and institutional decision making like job applications or insurance assessments.

GermEval18
For text classification we use GermEval18 (Coarse) and GermEval18 (Fine) which are both hate speech classification tasks (Wiegand et al., 2018). GermEval18 (Coarse) requires a system to classify a tweet into one of two classes: OFFENSE if the tweet contains some form of offensive language, and OTHER if it does not. GermEval18 (Fine) extends the coarse-grained task and contains four classes: OTHER for nonoffensive tweets as well as PROFANITY, INSULT and ABUSE which are all subclasses of OFFENSE from the coarse variant of the task.

GermEval14
For NER, we use the GermEval14 (Benikova et al., 2014) shared task. The data is sampled from German Wikipedia and News Corpora and contains over 31,000 sentences and 590,000 tokens. The dataset is one of the largest NER datasets for German and features an advanced annotation schema that allows for nested annotations. The four main classes (PERSON, ORGANISATION, LOCATION and OTHER) each have part and derivative variants (e.g. LOCpart or PERderiv) resulting in 12 classes in total.

Method
To train our German BERT and ELECTRA we use the Tensorflow training scripts from the official repositories 8 . We train models that match the size of the original BERT Base , BERT Large , ELECTRA Base and ELECTRA Large . The hyperparameters used for training can be found in Table 2. The base models were trained on single Google Cloud TPUs v3 (8 cores) while large models were trained on pods of 16 TPUs v3 (128 cores).  Table 2: Hyperparameters for language model pretraining.

Models
In total, we trained 7 separate models with different combinations of data and model size as well as Whole Word Masking (WWM) for BERT models. The German DBMDZ BERT Base , is the same size as BERT Base and was trained using the OPUS and Wikipedia corpora. It serves as our baseline model. We train four BERT variants of it, each referred to as GBERT, each using the same cased vocabulary as DBMDZ BERT Base . These match BERT Base in size unless they have the "Large" suffix, in which case they match BERT Large : • GBERT Data -trained on all available data without Whole Word Masking • GBERT WWM -trained on the same data as DBMDZ BERT Base but uses Whole Word Masking • GBERT Data + WWM -trained on all available data and uses Whole Word Masking • GBERT Large -trained on all available data and uses Whole Word Masking We also trained three ELECTRA variants of DBMDZ BERT Base , each referred to as GELECTRA models, which also match the size of the original ELECTRA Base unless they have the "Large" suffix in which case they match ELECTRA Large : • GELECTRA -trained on same data as DBMDZ Base BERT • GELECTRA Data -trained on all available data • GELECTRA Large -trained on all available data The best models of each architecture and size are uploaded to the Hugging Face model hub 9 as deepset/gbert-base, deepset/gbert-large, deepset/gelectra-base and deepset/gelectra-large.

Evaluation
In our approach, models are evaluated continuously during pretraining. Model checkpoints are saved at regular intervals and converted into PyTorch models using Hugging Face's Transformers library (Wolf et al., 2019). Using the FARM framework 10 , we evaluate the performance of each checkpoint on Ger-mEval18 (Coarse) and GermEval18 (Fine) which are both hate speech classification tasks (Wiegand et al., 2018). Using Hugging Face's Transformers we also evaluate on GermEval14 (Benikova et al., 2014) which is a NER task.  In BERT, the vector corresponding to the [CLS] token serves as a representation of the whole input sequence, while in ELECTRA, all word vectors are combined through a feed forward layer. In both cases, this input sequence representation is passed through a single layer Neural Network in order to perform prediction. In the NER task, each vector corresponding to the first token in a word is passed through a single layer Neural Network and the resulting prediction is applied to the whole word.
Each checkpoint is evaluated 3 times on each document classification task since we observed significant variance across different runs. Each of these runs is performed with early stopping and a different seed each time. For NER, the model is evaluated just once without early stopping. The reported performance is the average of the single best run for GermEval18 (Coarse), GermEval18 (Fine) and GermEval14. Table 3 summarizes the most important details and parameters of each task. For all experiments, we use an Nvidia V100 GPU to accelerate training. For each model, we choose the checkpoint that shows the best performance.
For comparison, we also run this evaluation pipeline on the two publicly available German BERT models (deepset German BERT Base and DBMDZ German BERT Base ) as well as multilingual models such as mBERT Base and XLM-RoBERTa Large .

Results
The downstream performance graphs in Figure 1 show that the models are capable of learning with most of the gains being made in the first phase of training and more incremental gains coming later. The best checkpoints come at different points for different models as can be seen in Table 4.
In Table 5 are the evaluation results for each model's best checkpoint for each of the three downstream tasks with comparison to benchmark models and previous SoTA results. For GermEval18, results from the best-performing systems are reported (Wiegand et al., 2018). For GermEval14 we report the result that can be achieved using the FLAIR framework (Akbik et al., 2019).
Steps (   In GermEval18 (Coarse), GBERT Data + WWM , XLM-Roberta Large , GBERT Large and GELECTRA Large all improve upon the previous SoTA. GELECTRA Large does so with the largest margin reaching a score that is +3.93% better. In GermEval18 (Fine), XLM-Roberta Large beats the previous best by +1.39% and GELECTRA Large sets a new SoTA that is better than the previous by +2.45%. In GermEval14, all 7 trained models exceed the previous SoTA, with GELECTRA Large showing a +4.3% improvement over the previous best.
These results indicate that adding extra data gives a consistent but modest performance boost to our language models. GBERT Data outperforms DBMDZ BERT Base by +0.25%, GBERT Data + WWM outperforms GBERT WWM by +0.93% and GELECTRA Data outperforms GELECTRA by +1.59%. For the BERT models, Whole Word Masking also shows a consistent positive impact with GBERT WWM outperforming DBMDZ BERT Base by +1.70% and GBERT Data + WWM outperforming GBERT Data by +2.38%.

Model Size
The large models that we train show much stronger performance than the base models. GBERT Large outperforms GBERT Data + WWM by +2.33% averaged F1 and GELECTRA Large outperforms GELECTRA Data   (Wiegand et al., 2018), and the result reported by FLAIR framework (Akbik et al., 2019) for GermEval14.
by +5.31%. It must be noted however, that their differing training regimes mean that the large models are trained on many more tokens than their base counterparts. In future, we would also be interested in training larger models with less data in order to better quantify the gains that come from model size and the gains that come from the extra data.

Training Length
From the downstream evaluation graphs in Figure 1, it is clear that the models gain most of their performance after a relatively short amount of training steps. GBERT WWM and GBERT Data + WWM both show an upward trend in the second half of model training suggesting they could still benefit from continuing training. There is also a clear upward trend over the course of GELECTRA and GELECTRA Data 's training suggesting these models are undertrained. It should also be noted that none of the models exhibit any clear signs of overfitting or performance degradation and may improve with further training.

ELECTRA Efficiency
One of the central claims of the ELECTRA paper is that it is capable of learning more efficiently than MLM based Language Models. This is exemplified by the comparison of GBERT Large and GELECTRA Large . By the end of their 1 million steps of training, GELECTRA Large has only seen half the number of tokens that GBERT Large due to its smaller batch size and yet outperforms it by +1.47% averaged F1.

Instabilities
The dip in performance around 2 million steps for the base sized GBERT models (See Figure 1) happens to coincide with our training regime whereby the model training is stopped, saved and then reloaded at 2 million steps. While we suspect that these two events are related, it was beyond the scope of this project to investigate the exact reasons.

Conclusion
The set of German models which we trained vary in terms of training regime and model architecture. We hope that the results that we present here will serve as important data points to other NLP practitioners who are looking to train language models from scratch but are limited by compute. Our experiments should give other teams a sense of the batch sizes and training lengths that make for efficient model training. On top of this, we also present a set of GELECTRA and GBERT models which, according to our evaluations, set new SoTA performance for both large and base sized models on GermEval18 and GermEval14.