CamemBERT: a Tasty French Language Model

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models –in all languages except English– very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.


Introduction
Pretrained word representations have a long history in Natural Language Processing (NLP), from non-neural methods (Brown et al., 1992;Ando and Zhang, 2005;Blitzer et al., 2006) to neural word embeddings (Mikolov et al., 2013;Pennington et al., 2014) and to contextualised representations (Peters et al., 2018;Akbik et al., 2018).Approaches shifted more recently from using these representations as an input to taskspecific architectures to replacing these architectures with large pretrained language models.These models are then fine-tuned to the task at hand with large improvements in performance over a wide range of tasks (Devlin et al., 2019;Radford et al., 2019;Liu et al., 2019;Raffel et al., 2019).These transfer learning methods exhibit clear advantages over more traditional task-specific approaches, probably the most important being that they can be trained in an unsupervised manner.They nevertheless come with implementation challenges, namely the amount of data and computational resources needed for pretraining that can reach hundreds of gigabytes of uncompressed text and require hundreds of GPUs (Yang et al., 2019;Liu et al., 2019).The latest transformer architecture has gone uses as much as 750GB of plain text and 1024 TPU v3 1 for pretraining (Raffel et al., 2019).This has limited the availability of these state-of-the-art models to the English language, at least in the monolingual setting.Even though multilingual models give remarkable results, they are often larger and their results still lag behind their monolingual counterparts (Lample and Conneau, 2019).This is particularly inconvenient as it hinders their practical use in NLP systems as well as the investigation of their language modeling capacity, something that remains to be investigated in the case of, for instance, morphologically rich languages.We take advantage of the newly available multilingual cor-pus OSCAR (Ortiz Suárez et al., 2019) and train a monolingual language model for French using the RoBERTa architecture.We pretrain the model -which we dub CamemBERT-and evaluate it in four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI).CamemBERT improves the state of the art for most tasks over previous monolingual and multilingual approaches, which confirms the effectiveness of large pretrained language models for French.We summarise our contributions as follows: • We train a monolingual BERT model on the French language using recent large-scale corpora.
• We evaluate our model on four downstream tasks (POS tagging, dependency parsing, NER and natural language inference (NLI)), achieving state-of-the-art results in most tasks, confirming the effectiveness of large BERT-based models for French.
• We release our model in a user-friendly format for popular open-source libraries so that it can serve as a strong baseline for future research and be useful for French NLP practitioners.2

Related Work
From non-contextual to contextual word embeddings The first neural word vector representations were noncontextualised word embeddings, most notably word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and fastText (Mikolov et al., 2018), which were designed to be used as input to task-specific neural architectures.Contextualised word representations such as ELMo (Peters et al., 2018) and flair (Akbik et al., 2018), improved the expressivity of word embeddings by taking context into account.They improved the performance of downstream tasks when they replaced traditional word representations.This paved the way towards larger contex-tualised models that replaced downstream architectures in most tasks.These approaches, trained with language modeling objectives, range from LSTM-based architectures such as ULMFiT (Howard and Ruder, 2018) to the successful transformer-based architectures such as GPT2 (Radford et al., 2019), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and more recently ALBERT (Lan et al., 2019) and T5 (Raffel et al., 2019).
Non-contextual word embeddings for languages other than English Since the introduction of word2vec (Mikolov et al., 2013), many attempts have been made to create monolingual models for a wide range of languages.
For non-contextual word embeddings, the first two attempts were by (Al-Rfou et al., 2013) and (Bojanowski et al., 2017) who created word embeddings for a large number of languages using Wikipedia.Later (Grave et al., 2018) trained fastText word embeddings for 157 languages using Common Crawl and showed that using crawled data significantly increased the performance of the embeddings relatively to those trained only on Wikipedia.

Contextualised models for languages other than English
Following the success of large pretrained language models, they were extended to the multilingual setting with multilingual BERT 3 , a single multilingual model for 104 different languages trained on Wikipedia data, and later XLM (Lample and Conneau, 2019), which greatly improved unsupervised machine translation.A few monolingual models have been released: ELMo models for Japanese, Portuguese, German and Basque 4 and BERT for Simplified and Traditional Chinese and German 5 .However, to the best of our knowledge, no particular effort has been made toward training models for languages other than English, at a scale similar to the latest English models (e.g.RoBERTa trained on more than 100GB of data).

CamemBERT
Our approach is based on RoBERTa (Liu et al., 2019), which replicates and improves the initial BERT by identifying key hyper-parameters for more robust performance.In this section, we describe the architecture, training objective, optimisation setup and pretraining data that was used for CamemBERT.CamemBERT differs from RoBERTa mainly with the addition of whole-word masking and the usage of Sentence-Piece tokenisation (Kudo and Richardson, 2018).
Architecture Similar to RoBERTa and BERT, Camem-BERT is a multi-layer bidirectional Transformer (Vaswani et al., 2017).
Given the widespread usage of Transformers, we do not describe them in detail here and refer the reader to (Vaswani et al., 2017).Camem-BERT uses the original BERT BASE configuration: 12 layers, 768 hidden dimensions, 12 attention heads, which amounts to 110M parameters.
Pretraining objective We train our model on the Masked Language Modeling (MLM) task.Given an input text sequence composed of N tokens x 1 , ..., x N , we select 15% of tokens for possible replacement.Among those selected tokens, 80% are replaced with the special <MASK> token, 10% are left unchanged and 10% are replaced by a random token.The model is then trained to predict the initial masked tokens using cross-entropy loss.Following RoBERTa we dynamically mask tokens instead of fixing them statically for the whole dataset during preprocessing.This improves variability and makes the model more robust when training for multiple epochs.Since we segment the input sentence into subwords using SentencePiece, the input tokens to the models can be subwords.
An upgraded version of BERT6 and (Joshi et al., 2019) have shown that masking whole words instead of individual subwords leads to improved performance.Whole-word masking (WWM) makes the training task more difficult because the model has to predict a whole word instead of predicting only part of the word given the rest.As a result, we used WWM for CamemBERT by first randomly sampling 15% of the words in the sequence and then considering all subword tokens in each of these 15% words for candidate replacement.This amounts to a proportion of selected tokens that is close to the original 15%.These tokens are then either replaced by <MASK> tokens (80%), left unchanged (10%) or replaced by a random token.Subsequent work has shown that the next sentence prediction task (NSP) originally used in BERT does not improve downstream task performance (Lample and Conneau, 2019;Liu et al., 2019), we do not use NSP as a consequence.
Optimisation Following (Liu et al., 2019), we optimise the model using Adam (Kingma and Ba, 2014) (β 1 = 0.9, β 2 = 0.98) for 100k steps.We use large batch sizes of 8192 sequences.Each sequence contains at most 512 tokens.We enforce each sequence to only contain complete sentences.Additionally, we used the DOC-SENTENCES scenario from (Liu et al., 2019), consisting of not mixing multiple documents in the same sequence, which showed slightly better results.
Segmentation into subword units We segment the input text into subword units using SentencePiece (Kudo and Richardson, 2018).SentencePiece is an extension of Byte-Pair encoding (BPE) (Sennrich et al., 2016) and WordPiece (Kudo, 2018) that does not require pretokenisation (at the word or token level), thus removing the need for language-specific tokenisers.We use a vocabulary size of 32k subword tokens.These are learned on 107 sentences sampled from the pretraining dataset.We do not use subword regularisation (i.e.sampling from multiple possible segmentations) in our implementation for simplicity.
Pretraining data Pretrained language models can be significantly improved by using more data (Liu et al., 2019;Raffel et al., 2019).Therefore we used French text extracted from Common Crawl 7 , in particular, we use OS-CAR (Ortiz Suárez et al., 2019) a pre-classified and prefiltered version of the November 2018 Common Craw snapshot.OSCAR is a set of monolingual corpora extracted from Common Crawl, specifically from the plain text WET format distributed by Common Crawl, which removes all HTML tags and converts all text encodings to UTF-8.OS-CAR follows the same approach as (Grave et al., 2018) by using a language classification model based on the fastText linear classifier (Joulin et al., 2016;Grave et al., 2017) pretrained on Wikipedia, Tatoeba and SETimes, which supports 176 different languages.OSCAR performs a deduplication step after language classification and without introducing a specialised filtering scheme, other than only keeping paragraphs containing 100 or more UTF-8 encoded characters, making OSCAR quite close to the original Crawled data.We use the unshuffled version of the French OSCAR corpus, which amounts to 138GB of uncompressed text and 32.7B SentencePiece tokens.

Part-of-speech tagging and dependency parsing
We fist evaluate CamemBERT on the two downstream tasks of part-of-speech (POS) tagging and dependency parsing.POS tagging is a low-level syntactic task, which consists in assigning to each word its corresponding grammatical category.Dependency parsing consists in predicting the labeled syntactic tree capturing the syntactic relations between words.We run our experiments using the Universal Dependencies (UD) paradigm and its corresponding UD POS tag set (Petrov et al., 2011) and UD treebank collection version 2.2 (Nivre et al., 2018), which was used for the CoNLL 2018 shared task.We perform our work on the four freely available French UD treebanks in UD v2.2: GSD, Sequoia, Spoken, and ParTUT.GSD (McDonald et al., 2013) is the second-largest treebank available for French after the FTB (described in subsection 4.2.), it contains data from blogs, news articles, reviews, and Wikipedia.The Sequoia treebank 8 (Candito and Seddah, 2012;Candito et al., 2014) comprises more than 3000 sentences, from the French Europarl, the regional newspaper L'Est Républicain, the French Wikipedia and documents from the European Medicines Agency.Spoken is a corpus converted automatically from the Rhapsodie treebank 9 (Lacheret et al., 2014;Bawden et al., 2014) with manual corrections.It consists of 57 sound samples of spoken French with orthographic transcription and phonetic transcription aligned with sound (word boundaries, syllables, and phonemes), syntactic and prosodic annotations.Finally, ParTUT is a conversion of a multilingual parallel treebank developed at the University of Turin, and consisting of a variety of text genres, including talks, legal texts, and Wikipedia articles, among others; ParTUT data is derived from the already-existing par- allel treebank Par(allel)TUT (Sanguinetti and Bosco, 2015) .Table 1 contains a summary comparing the sizes of the treebanks10 .We evaluate the performance of our models using the standard UPOS accuracy for POS tagging, and Unlabeled Attachment Score (UAS) and Labeled Attachment Score (LAS) for dependency parsing.We assume gold tokenisation and gold word segmentation as provided in the UD treebanks.
Baselines To demonstrate the value of building a dedicated version of BERT for French, we first compare CamemBERT to the multilingual cased version of BERT (designated as mBERT).We then compare our models to UDify (Kondratyuk, 2019).UDify is a multitask and multilingual model based on mBERT that is near state-of-the-art on all UD languages including French for both POS tagging and dependency parsing.
It is relevant to compare CamemBERT to UDify on those tasks because UDify is the work that pushed the furthest the performance in fine-tuning end-to-end a BERT-based model on downstream POS tagging and dependency parsing.Finally, we compare our model to UDPipe Future (Straka, 2018), a model ranked 3rd in dependency parsing and 6th in POS tagging during the CoNLL 2018 shared task (Seker et al., 2018).UDPipe Future provides us a strong baseline that does not make use of any pretrained contextual embedding.We will compare to the more recent cross-lingual language model XLM (Lample and Conneau, 2019), as well as the state-of-the-art CoNLL 2018 shared task results with predicted tokenisation and segmentation in an updated version of the paper.

Named Entity Recognition
Named Entity Recognition (NER) is a sequence labeling task that consists in predicting which words refer to real-world objects, such as people, locations, artifacts and organisations.We use the French Treebank11 (FTB) (Abeillé et al., 2003)  A large proportion of the entity mentions in the treebank are multi-word entities.For NER we therefore report the 3 metrics that are commonly used to evaluate models: precision, recall, and F1 score.Here precision measures the percentage of entities found by the system that are correctly tagged, recall measures the percentage of named entities present in the corpus that are found and the F1 score combines both precision and recall measures giving a general idea of a model's performance.
Baselines Most of the advances in NER haven been achieved on English, particularly focusing on the CoNLL 2003 (Sang and Meulder, 2003) and the Ontonotes v5 (Pradhan et al., 2012;Pradhan et al., 2013) English corpora.NER is a task that was traditionally tackled using Conditional Random Fields (CRF) (Lafferty et al., 2001) which are quite suited for NER; CRFs were later used as decoding layers for Bi-LSTM architectures (Huang et al., 2015;Lample et al., 2016) showing considerable improvements over CRFs alone.These Bi-LSTM-CRF architectures were later enhanced with contextualised word embeddings which yet again brought major improvements to the task (Peters et al., 2018;Akbik et al., 2018).Finally, large pretrained architectures settled the current state of the art showing a small yet important improvement over previous NER-specific architectures (Devlin et al., 2019;Baevski et al., 2019).
In non-English NER the CoNLL 2002 shared task included NER corpora for Spanish and Dutch corpora (Sang, 2002) while the CoNLL 2003 included a German corpus (Sang and Meulder, 2003).Here the recent efforts of (Straková et al., 2019) settled the state of the art for Spanish and Dutch, while (Akbik et al., 2018) did it for German.
In French, no extensive work has been done due to the limited availability of NER corpora.We compare our model with the strong baselines settled by (Dupont, 2018), who trained both CRF and BiLSTM-CRF architectures on the FTB and enhanced them using heuristics and pretrained word embeddings.

Natural Language Inference
We also evaluate our model on the Natural Language Inference (NLI) task, using the French part of the XNLI dataset (Conneau et al., 2018).NLI consists in predicting whether a hypothesis sentence is entailed, neutral or contradicts a premise sentence.The XNLI dataset is the extension of the Multi-Genre NLI (MultiNLI) corpus (Williams et al., 2018) to 15 languages by translating the validation and test sets manually into each of those languages.The English training set is also machine translated for all languages.The dataset is composed of 122k train, 2490 valid and 5010 test examples.As usual, NLI performance is evaluated using accuracy.
To evaluate a model on a language other than English (such as French), we consider the two following settings: • TRANSLATE-TEST: The French test set is machine translated into English, and then used with an English classification model.This setting provides a reasonable, although imperfect, way to circumvent the fact that no such data set exists for French, and results in very strong baseline scores.
• TRANSLATE-TRAIN: The French model is finetuned on the machine-translated English training set and then evaluated on the French test set.This is the setting that we used for CamemBERT.
Baselines For the TRANSLATE-TEST setting, we report results of the English RoBERTa to act as a reference.
In the TRANSLATE-TRAIN setting, we report the best scores from previous literature along with ours.BiLSTM-max is the best model in the original XNLI paper, mBERT which has been reported in French in (Wu and Dredze, 2019) and XLM (MLM+TLM) is the best-presented model from (Conneau et al., 2018).

Experiments
In this section, we measure the performance of Camem-BERT by evaluating it on the four aforementioned tasks: POS tagging, dependency parsing, NER and NLI.

Experimental Setup
Pretraining We use the RoBERTa implementation in the fairseq library (Ott et al., 2019).Our learning rate is warmed up for 10k steps up to a peak value of 0.0007 instead of the original 0.0001 given our large batch size (8192).The learning rate fades to zero with polynomial decay.We pretrain our model on 256 Nvidia V100 GPUs (32GB each) for 100k steps during 17h.
Fine-tuning For each task, we append the relevant predictive layer on top of CamemBERT's Transformer architecture.Following the work done on BERT (Devlin et al., 2019), for sequence tagging and sequence labeling we append a linear layer respectively to the <s> special token and to the first subword token of each word.For dependency parsing, we plug a bi-affine graph predictor head as inspired by (Dozat and Manning, 2017) following the work done on multilingual parsing with BERT by (Kondratyuk, 2019).We refer the reader to these two articles for more details on this module.
We fine-tune independently CamemBERT for each task and each dataset.We optimise the model using the Adam optimiser (Kingma and Ba, 2014) with a fixed learning rate.
We run a grid search on a combination of learning rates and batch sizes.We select the best model on the validation set out of the 30 first epochs.Although this might push the performances even further, for all tasks except NLI, we don't apply any regularisation techniques such as weight decay, learning rate warmup or discriminative fine-tuning.We show that fine-tuning CamemBERT in a straight-forward manner leads to stateof-the-art results on most tasks and outperforms the existing BERT-based models in most cases.The POS tagging, dependency parsing, and NER experiments are run using hugging face's Transformer library extended to support CamemBERT and dependency parsing (Wolf et al., 2019).The NLI experiments use the fairseq library following the RoBERTa implementation.

Results
Part-of-Speech tagging and dependency parsing For POS tagging and dependency parsing, we compare CamemBERT to three other near state-of-the-art models in Table 2. CamemBERT outperforms UDPipe Future by a large margin for all treebanks and all metrics.Despite a much simpler optimisation process, CamemBERT beats UDify performances on all the available French treebanks.CamemBERT also demonstrates higher performances than mBERT on those tasks.We observe a larger error reduction for parsing than for tagging.For POS tagging, we observe error reductions of respectively 0.71% for GSD, 0.81% for Sequoia, 0.7% for Spoken and 0.28% for ParTUT.For parsing, we observe error reductions in LAS of 2.96% for GSD, 3.33% for Sequoia, 1.70% for Spoken and 1.65% for Par-TUT.
Natural Language Inference: XNLI On the XNLI benchmark, CamemBERT obtains improved performance over multilingual language models on the TRANSLATE-TRAIN setting (81.2 vs. 80.2 for XLM) while using less than half the parameters (110M vs. 250M  Previous work with this model showed increased performance in NER for German, Dutch and Spanish when mBERT is used as contextualised word embedding for an NER-specific model (Straková et al., 2019), but our results suggest that the multilingual setting in which mBERT was trained is simply not enough to use it alone and fine-tune it for French NER, as it shows worse performance than even simple CRF models, suggesting that monolingual models could be better at NER.

Discussion
CamemBERT displays improved performance compared to prior work for the 4 downstream tasks considered.This confirms the hypothesis that pretrained language models can be effectively fine-tuned for various downstream tasks, as observed for English in previous work.Moreover, our results also show that dedicated monolingual models still outperform multilingual ones.We explain this point in two ways.First, the scale of data is possibly essential to the performance of CamemBERT.Indeed, we use 138GB of uncompressed text vs. 57GB 13 for mBERT.Second, with more data comes more diversity in the pretraining distribution.Reaching state-of-the-art performances on 4 different tasks and 6 different datasets requires robust pretrained models.Our results suggest that the variability in the down-stream tasks and datasets considered is handled more efficiently by a general language model than by Wikipediapretrained models such as mBERT.

Conclusion
CamemBERT improves the state of the art for multiple downstream tasks in French.It is also lighter than other BERT-based approaches such as mBERT or XLM.By releasing our model, we hope that it can serve as a strong baseline for future research in French NLP, and expect our experiments to be reproduced in many other languages.We will publish an updated version in the near future where we will explore and release models trained for longer, with additional downstream tasks, baselines (e.g.XLM) and analysis, we will also train additional models with potentially cleaner corpora such as CCNet (Wenzek et al., 2019) for more accurate performance evaluation and more complete ablation.

Table 1 :
Sizes in Number of tokens, words and phrases of the 4 treebanks used in the evaluations of POS-tagging and dependency parsing.

Table 2 :
Final POS and dependency parsing scores of CamemBERT and mBERT (fine-tuned in the exact same conditions as CamemBERT), UDify as reported in the original paper on 4 French treebanks (French GSD, Spoken, Sequoia and ParTUT), reported on test sets (4 averaged runs) assuming gold tokenisation.Best scores in bold, second to best underlined.

Table 3 :
Accuracy of models for French on the XNLI test set.Best scores in bold, second to best underlined.

Table 4 :
Results for NER on the FTB.Best scores in bold, second to best underlined.Named-Entity RecognitionFor named entity recognition, our experiments show that CamemBERT achieves a slightly better precision than the traditional CRF-based SEM architectures described above in Section 4.2.(CRF and Bi-LSTM+CRF), but shows a dramatic improvement in finding entity mentions, raising the recall score by 3.5 points.Both improvements result in a 2.36 point increase in the F1 score with respect to the best SEM architecture (BiLSTM-CRF), giving CamemBERT the state of the art for NER on the FTB.One other important finding is the results obtained by mBERT.