A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages

We use the multilingual OSCAR corpus, extracted from Common Crawl via language classification, filtering and cleaning, to train monolingual contextualized word embeddings (ELMo) for five mid-resource languages. We then compare the performance of OSCAR-based and Wikipedia-based ELMo embeddings for these languages on the part-of-speech tagging and parsing tasks. We show that, despite the noise in the Common-Crawl-based OSCAR data, embeddings trained on OSCAR perform much better than monolingual embeddings trained on Wikipedia. They actually equal or improve the current state of the art in tagging and parsing for all five languages. In particular, they also improve over multilingual Wikipedia-based contextual embeddings (multilingual BERT), which almost always constitutes the previous state of the art, thereby showing that the benefit of a larger, more diverse corpus surpasses the cross-lingual benefit of multilingual embedding architectures.


Introduction
One of the key elements that has pushed the state of the art considerably in neural NLP in recent years has been the introduction and spread of transfer learning methods to the field. These methods can normally be classified in two categories according to how they are used: • Feature-based methods, which involve pretraining real-valued vectors ("embeddings") at the word, sentence, or paragraph level; and using them in conjunction with a specific architecture for each individual downstream task.
• Fine-tuning methods, which introduce a minimal number of task-specific parameters, and instead copy the weights from a pre-trained network and then tune them to a particular downstream task.
Embeddings or language models can be divided into fixed, meaning that they generate a single representation for each word in the vocabulary; and contextualized, meaning that a representation is generated based on both the word and its surrounding context, so that a single word can have multiple representations, each one depending on how it is used.
In practice, most fixed embeddings are used as feature-based models. The most notable examples are word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and fastText . All of them are extensively used in a variety of applications nowadays. On the other hand, contextualized word representations and language models have been developed using both featurebased architectures, the most notable examples being ELMo and Flair (Peters et al., 2018;Akbik et al., 2018), and transformer based architectures, that are commonly used in a fine-tune setting, as is the case of GPT-1, GPT-2 (Radford et al., 2018(Radford et al., , 2019, BERT and its derivatives Lan et al., 2019) and more recently T5 (Raffel et al., 2019). All of them have repeatedly improved the state-of-the art in many downstream NLP tasks over the last year.
In general, the main advantage of using language models is that they are mostly built in an unsupervised manner and they can be trained with raw, unannotated plain text. Their main drawback is that enormous quantities of data seem to be required to properly train them especially in the case of contextualized models, for which larger corpora are thought to be needed to properly address polysemy and cover the wide range of uses that commonly exist within languages.
For gathering data in a wide range of languages, Wikipedia is a commonly used option. It has been used to train fixed embeddings (Al-Rfou et al., 2013; and more recently the multilingual BERT , hereafter mBERT. However, for some languages, Wikipedia might not be large enough to train good quality contextualized word embeddings. Moreover, Wikipedia data all belong to the same specific genre and style. To address this problem, one can resort to crawled text from the internet; the largest and most widespread dataset of crawled text being Common Crawl. 1 Such an approach generally solves the quantity and genre/style coverage problems but might introduce noise in the data, an issue which has earned the corpus some criticism, most notably by Trinh and Le (2018) and Radford et al. (2019). Using Common Crawl also leads to data management challenges as the corpus is distributed in the form of a large set of plain text each containing a large quantity of unclassified multilingual documents from different websites.
In this paper we study the trade-off between quantity and quality of data for training contextualized representations. To this end, we use the OSCAR corpus (Ortiz Suárez et al., 2019), a freely available 2 multilingual dataset obtained by performing language classification, filtering and cleaning of the whole Common Crawl corpus. 3 OS-CAR was created following the approach of  but proposing a simple improvement on their filtering method. We then train OSCARbased and Wikipedia-based ELMo contextualized word embeddings (Peters et al., 2018) for 5 languages: Bulgarian, Catalan, Danish, Finnish and Indonesian. We evaluate the models by attaching them to the to UDPipe 2.0 architecture (Straka, 2018; for dependency parsing and part-of-speech (POS) tagging. We show that the models using the OSCAR-based ELMo embeddings consistently outperform the Wikipediabased ones, suggesting that big high-coverage noisy corpora might be better than small high-quality narrow-coverage corpora for training contextualized language representations 4 . We also establish a new state of the art for both POS tagging and dependency parsing in 6 different treebanks covering all 5 languages.
The structure of the paper is as follows. In Section 2 we describe the recent related work. In Section 3 we present, compare and analyze the corpora used to train our contextualized embeddings, and the treebanks used to train our POS tagging and parsing models. In Section 4 we examine and describe in detail the model used for our contextualized word representations, as well as the parser and the tagger we chose to evaluate the impact of corpora in the embeddings' performance in downstream tasks. Finally we provide an analysis of our results in Section 5 and in Section 6 we present our conclusions.

Related work
Since the introduction of word2vec (Mikolov et al., 2013), many attempts have been made to create multilingual language representations; for fixed word embeddings the most remarkable works are those of (Al- Rfou et al., 2013) and  who created word embeddings for a large quantity of languages using Wikipedia, and later  who trained the fast-Text word embeddings for 157 languages using Common Crawl and who in fact showed that using crawled data significantly increased the performance of the embeddings especially for mid-to low-resource languages.
Regarding contextualized models, the most notable non-English contribution has been that of the mBERT , which is distributed as (i) a single multilingual model for 100 different languages trained on Wikipedia data, and as (ii) a single multilingual model for both Simplified and Traditional Chinese. Four monolingual fully trained ELMo models have been distributed for Japanese, Portuguese, German and Basque 5 ; 44 monolingual ELMo models 6 where also released by the HIT-SCIR team (Che et al., 2018) during the CoNLL 2018 Shared Task (Zeman et al., 2018), but their training sets where capped at 20 million words. A German BERT (Chan et al., 2019) as well as a French BERT model (called CamemBERT) (Martin et al., 2019) have also been released. In general no particular effort in creating a set of highquality monolingual contextualized representations has been shown yet, or at least not on a scale that is comparable with what was done for fixed word embeddings.
For dependency parsing and POS tagging the most notable non-English specific contribution is that of the CoNLL 2018 Shared Task (Zeman et al., 2018), where the 1 st place (LAS Ranking) was awarded to the HIT-SCIR team (Che et al., 2018) who used 's Deep Biaffine parser and its extension described in , coupled with deep contextualized ELMo embeddings (Peters et al., 2018) (capping the training set at 20 million words). The 1 st place in universal POS tagging was awarded to Smith et al. (2018) who used two separate instances of 's tagger.
More recent developments in POS tagging and parsing include those of  which couples another CoNLL 2018 shared task participant, UDPipe 2.0 (Straka, 2018), with mBERT greatly improving the scores of the original model, and UDify (Kondratyuk and Straka, 2019), which adds an extra attention layer on top of mBERT plus a Deep Bi-affine attention layer for dependency parsing and a Softmax layer for POS tagging. UDify is actually trained by concatenating the training sets of 124 different UD treebanks, creating a single POS tagging and dependency parsing model that works across 75 different languages.

Corpora
We train ELMo contextualized word embeddings for 5 languages: Bulgarian, Catalan, Danish, Finnish and Indonesian. We train one set of embeddings using only Wikipedia data, and another set using only Common-Crawl-based OSCAR data. We chose these languages primarily because they are morphologically and typologically different from one another, but also because all of the OSCAR datasets for these languages were of a sufficiently manageable size such that the ELMo pre-training was doable in less than one month. Contrary to HIT-SCIR team (Che et al., 2018), we do not impose any cap on the amount of data, and instead use the entirety of Wikipedia or OSCAR for each of our 5 chosen languages.

Wikipedia
Wikipedia is the biggest online multilingual open encyclopedia, comprising more than 40 million articles in 301 different languages. Because articles are curated by language and written in an  open collaboration model, its text tends to be of very high-quality in comparison to other free online resources. This is why Wikipedia has been extensively used in various NLP applications (Wu and Weld, 2010;Mihalcea, 2007;Al-Rfou et al., 2013;. We downloaded the XML Wikipedia dumps 7 and extracted the plaintext from them using the wikiextractor.py script 8 from Giuseppe Attardi. We present the number of words and tokens available for each of our 5 languages in Table 1. We decided against deduplicating the Wikipedia data as the corpora are already quite small. We tokenize the 5 corpora using UD-Pipe (Straka and Straková, 2017).

OSCAR
Common Crawl is a non-profit organization that produces and maintains an open, freely available repository of crawled data from the web. Common Crawl's complete archive consists of petabytes of monthly snapshots collected since 2011. Common Crawl snapshots are not classified by language, and contain a certain level of noise (e.g. one-word "sentences" such as "OK" and "Cancel" are unsurprisingly very frequent). This is what motivated the creation of the freely available multilingual OSCAR corpus (Ortiz Suárez et al., 2019), extracted from the November 2018 snapshot, which amounts to more than 20 terabytes of plain-text. In order to create OSCAR from this Common Crawl snapshot, Ortiz Suárez et al. (2019) reproduced the pipeline proposed by  to process, filter and classify Common Crawl. More precisely, language classification was performed using the fastText linear classifier (Joulin et al., 2016, which was trained by  to recognize 176 languages and was shown to have an extremely good accuracy to processing time trade-off. The filtering step as performed by  consisted in only keeping the lines exceeding 100 7 XML dumps from April 4, 2019. 8 Available here.  bytes in length. 9 However, considering that Common Crawl is a mutilingual UTF-8 encoded corpus, this 100-byte threshold creates a huge disparity between ASCII and non-ASCII encoded languages. The filtering step used to create OSCAR therefore consisted in only keeping the lines containing at least 100 UTF-8-encoded characters. Finally, as in , the OSCAR corpus is deduplicated, i.e. for each language, only one occurrence of a given line is included.
As we did for Wikipedia, we tokenize OSCAR corpora for the 5 languages we chose for our study using UDPipe. Table 2 provides quantitative information about the 5 resulting tokenized corpora.
We note that the original Common-Crawl-based corpus created by  to train fast-Text is not freely available. Since running the experiments described in this paper, a new architecture for creating a Common-Crawl-based corpus named CCNet (Wenzek et al., 2019) has been published, although it includes specialized filtering which might result in a cleaner corpus compared to OSCAR, the resulting CCNet corpus itself was not published. Thus we chose to keep OSCAR as it remains the only very large scale, Common-Crawl-based corpus currently available and easily downloadable.

Noisiness
We wanted to address (Trinh and Le, 2018) and (Radford et al., 2019)'s criticisms of Common Crawl, so we devised a simple method to measure how noisy the OSCAR corpora were for our 5 languages. We randomly extract a number of lines from each corpus, such that the resulting random sample contains one million words. 10 We test if the words are in the corresponding GNU Aspell 11 dictionary. We repeat this task for each of the 5 languages, for both the OSCAR and the Wikipedia 9 Script available here. 10 We remove tokens that are capitalized or contain less than 4 UTF-8 encoded characters, allowing us to remove bias against Wikipedia, which traditionally contains a large quantity of proper nouns and acronyms. 11 http://aspell.net/  corpora. We compile in Table 3 the number of out-of-vocabulary tokens for each corpora. As expected, this simple metric shows that in general the OSCAR samples contain more out-ofvocabulary words than the Wikipedia ones. However the difference in magnitude between the two is strikingly lower that one would have expected in view of the criticisms by Trinh and Le (2018) and Radford et al. (2019), thereby validating the usability of Common Crawl data when it is properly filtered, as was achieved by the OSCAR creators. We even observe that, for Danish, the number of out-of-vocabulary words in OSCAR is lower than that in Wikipedia.

Experimental Setting
The main goal of this paper is to show the impact of training data on contextualized word representations when applied in particular downstream tasks. To this end, we train different versions of the Embeddings from Language Models (ELMo) (Peters et al., 2018) for both the Wikipedia and OSCAR corpora, for each of our selected 5 languages. We save the models' weights at different number of epochs for each language, in order to test how corpus size affect the embeddings and to see whether and when overfitting happens when training elmo on smaller corpora.
We take each of the trained ELMo models and use them in conjunction with the UDPipe 2.0 (Straka, 2018; architecture for dependency parsing and POS-tagging to test our models. We train UDPipe 2.0 using gold tokenization and segmentation for each of our ELMo models, the only thing that changes from training to training is the ELMo model as hyperparameters always remain at the default values (except for number of training tokens) (Peters et al., 2018).

Contextualized word embeddings
Embeddings from Language Models (ELMo) (Peters et al., 2018) is an LSTM-based language model. More precisely, it uses a bidirectional language model, which combines a forward and a backward LSTM-based language model. ELMo also computes a context-independent token representation via a CNN over characters.
We train ELMo models for Bulgarian, Catalan, Danish, Finnish and Indonesian using the OSCAR corpora on the one hand and the Wikipedia corpora on the other. We train each model for 10 epochs, as was done for the original English ELMo (Peters et al., 2018). We save checkpoints at 1 st , 3 rd and 5 th epoch in order to investigate some concerns about possible overfitting for smaller corpora (Wikipedia in this case) raised by the original ELMo authors. 12

UDPipe 2.0
For our POS tagging and dependency parsing evaluation, we use UDPipe 2.0, which has a freely available and ready to use implementation. 13 This architecture was submitted as a participant to the 2018 CoNLL Shared Task (Zeman et al., 2018), obtaining the 3 rd place in LAS ranking. UDPipe 2.0 is a multi-task model that predicts POS tags, lemmas and dependency trees jointly.
The original UDPipe 2.0 implementation calculates 3 different embeddings, namely: • Pre-trained word embeddings: In the original implementation, the Wikipedia version of fast-Text embeddings is used ; we replace them in favor of the newer Common-Crawl-based fastText embeddings trained by .
• Trained word embeddings: Randomly initialized word representations that are trained with the rest of the network.
• Character-level word embeddings: Computed using bi-directional GRUs of dimension 256.  Once the embedding step is completed, the concatenation of all vector representations for a word are fed to two shared bidirectional LSTM (Hochreiter and Schmidhuber, 1997) layers. The output of these two BiLSTMS is then fed to two separate specific LSTMs: • The tagger-and lemmatizer-specific bidirectional LSTMs, with Softmax classifiers on top, which process its output and generate UPOS, XPOS, UFeats and Lemmas. The lemma classifier also takes the character-level word embeddings as input.
• The parser-specific bidirectional LSTM layer, whose output is then passed to a bi-affine attention layer  producing labeled dependency trees.

Treebanks
To train the selected parser and tagger (cf. Section 4.2) and evaluate the pre-trained language models in our 5 languages, we run our experiments using the Universal Dependencies (UD) 14 paradigm and its corresponding UD POS tag set (Petrov et al., 2012). We use all the treebanks available for our five languages in the UD treebank collection version 2.2 (Nivre et al., 2018), which was used for the CoNLL 2018 shared task, thus we perform our evaluation tasks in 6 different treebanks (see Table 4 for treebank size information).
• Bulgarian BTB: Created at the Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, it consists of legal documents, news articles and fiction pieces.
• Catalan-AnCora: Built on top of the Spanish-Catalan AnCora corpus (Taulé et al., 2008), it contains mainly news articles.
• • Indonesian-GSD: Includes mainly blog entries and news articles.

Parsing and POS tagging results
We use UDPipe 2.0 without contextualized embeddings as our baseline for POS tagging and dependency parsing. However, we did not train the model without contextualized word embedding ourselves. We instead take the scores as they are reported in (Kondratyuk and Straka, 2019). We also compare our UDPipe 2.0 + ELMo models against the state-of-the-art results (assuming gold tokenization) for these languages, which are either UDify (Kondratyuk and Straka, 2019) or UDPipe 2.0 + mBERT . Results for UPOS, UAS and LAS are shown in Table 5. We obtain the state of the art for the three metrics in each of the languages with the UDPipe 2.0 + ELMo OSCAR models. We also see that in every single case the UDPipe 2.0 + ELMo OSCAR result surpasses the UDPipe 2.0 + ELMo Wikipedia one, suggesting that the size of the pre-training data plays an important role in downstream task results. This is also supports our hypothesis that the OSCAR corpora, being multi-domain, exhibits a better coverage of the different styles, genres and uses present at least in these 5 languages.
Taking a closer look at the results for Danish, we see that ELMo Wikipedia , which was trained with a mere 300MB corpus, does not show any sign 15 http://scripta.kotus.fi/visk   that are closely related to high-resource languages, it might also significantly degrade the representations for more isolated or even simply more morphologically rich languages like Finnish. In contrast, our monolingual approach with UDPipe 2.0 + ELMo OSCAR improves the previous SOTA considerably, by more than 2 points for some metrics. Note however that Indonesian, which might also be seen as a relatively isolated language, does not behave in the same way as Finnish.

Impact of the number of training epochs
An important topic we wanted to address with our experiments was that of overfitting and the number of epochs one should train the contextualized embeddings for. The ELMo authors have expressed that increasing the number of training epochs is generally better, as they argue that training the ELMo model for longer reduces held-out perplexity and further improves downstream task performance. 16 This is why we intentionally fully pre-trained the ELMo Wikipedia to the 10 epochs of the original ELMo paper, as its authors also expressed concern over the possibility of overfitting for smaller corpora. We thus save checkpoints for 16 Their comments on the matter can be found here.
each of our ELMo model at the 1, 3, 5 and 10 epoch marks so that we can properly probe for overfitting. The scores of all checkpoints are reported in Table  6. Here again we do not train the UDPipe 2.0 baselines without embedding, we just report the scores published in Kondratyuk and Straka (2019).
The first striking finding is that even though all our Wikipedia data sets are smaller than 1GB in size (except for Catalan), none of the ELMo Wikipedia models show any sign of overfitting, as the results continue to improve for all metrics the more we train the ELMo models, with the best results consistently being those of the fully trained 10 epoch ELMos. For all of our Wikipedia models, but those of Catalan and Indonesian, we see sub-baseline results at 1 epoch; training the model for longer is better, even if the corpora are small in size.
ELMo OSCAR models exhibit exactly the same behavior as ELMo Wikipedia models where the scores continue to improve the longer they are pre-trained, except for the case of Finnish. Here we actually see an unexpected behavior where the model performance caps around the 3 rd to 5 th epoch. This is surprising because the Finnish OSCAR corpus is more than 20 times bigger than our smallest Wikipedia corpus, the Danish Wikipedia, that did not exhibit this behavior. As previously mentioned Finnish is morphologically richer than the other languages in which we trained ELMo, we hypothesize that the representation space given by the ELMo embeddings might not be sufficiently big to extract more features from the Finnish OSCAR corpus beyond the 5 th epoch mark, however in order to test this we would need to train a larger language model like BERT which is sadly beyond our computing infrastructure limits (cf. Subsection 5.3). However we do note that pre-training our current language model architectures in a morphologically rich language like Finnish might actually better expose the limits of our existing approaches to language modeling.
One last thing that it is important to note with respect to the number of training epochs is that even though we fully pre-trained our ELMo Wikipedia 's and ELMo OSCAR 's to the recommended 10 epoch mark, and then compared them against one another, the number of training steps between both pre-trained models differs drastically due to the big difference in corpus size (for Indonesian, for instance, 10 epochs correspond to 78K steps for ELMo Wikipedia and to 2.6M steps for OSCAR; the complete picture is provided in the Appendix, in Table 8). In fact, we can see in Table 6 that all the UDPipe 2.0 + ELMo OSCAR(1) perform better than the UDPipe 2.0 + ELMo Wikipedia(1) models across all metrics. Thus we believe that talking in terms of training steps as opposed to training epochs might be a more transparent way of comparing two pretrained models.

Computational cost and carbon footprint
Considering the discussion above, we believe an interesting follow-up to our experiments would be training the ELMo models for more of the languages included in the OSCAR corpus. However training ELMo is computationally costly, and one way to estimate this cost, as pointed out by Strubell et al. (2019), is by using the training times of each model to compute both power consumption and CO 2 emissions.
In our set-up we used two different machines, each one having 4 NVIDIA GeForce GTX 1080 Ti graphic cards and 128GB of RAM, the difference between the machines being that one uses a single Intel Xeon Gold 5118 processor, while the other uses two Intel Xeon E5-2630 v4 processors. One GeForce GTX 1080 Ti card is rated at around   Strubell et al. (2019) in order to compute the total power required to train one ELMo model: Where c and g are the number of CPUs and GPUs respectively, p c is the average power draw (in Watts) from all CPU sockets, p r the average power draw from all DRAM sockets, and p g the average power draw of a single GPU. We estimate the total power consumption by adding GPU, CPU and DRAM consumptions, and then multiplying by the Power Usage Effectiveness (PUE), which accounts for the additional energy required to support the compute infrastructure. We use a PUE coefficient of 1.58, the 2018 global average for data centers (Strubell et al., 2019). In table 7 we report the training times in both hours and days, as well as the total power draw (in Watts) of the system used to train each individual ELMo model. We use this in-formation to compute the total power consumption of each ELMo, also reported in table 7. We can further estimate the CO 2 emissions in kilograms of each single model by multiplying the total power consumption by the average CO 2 emissions per kWh in France (where the models were trained). According to the RTE (Réseau de transport d'électricité / Electricity Transmission Network) the average emission per kWh were around 51g/kWh in November 2019, 20 when the models were trained. Thus the total CO 2 emissions in kg for one single model can be computed as: CO 2 e = 0.051p t All emissions for the ELMo models are also reported in table 7.
We do not report the power consumption or the carbon footprint of training the UDPipe 2.0 architecture, as each model took less than 4 hours to train on a machine using a single NVIDIA Tesla V100 card. Also, this machine was shared during training time, so it would be extremely difficult to accurately estimate the power consumption of these models.
Even though it would have been interesting to replicate all our experiments and computational cost estimations with state-of-the-art fine-tuning models such as BERT, XLNet, RoBERTa or AL-BERT, we recall that these transformer-based architectures are extremely costly to train, as noted by the BERT authors on the official BERT GitHub repository, 21 and are currently beyond the scope of our computational infrastructure. However we believe that ELMo contextualized word embeddings remain a useful model that still provide an extremely good trade-off between performance to training cost, even setting new state-of-the-art scores in parsing and POS tagging for our five chosen languages, performing even better than the multilingual mBERT model.

Conclusions
In this paper, we have explored the use of the Common-Crawl-based OSCAR corpora to train ELMo contextualized embeddings for five typologically diverse mid-resource languages. We have compared them with Wikipedia-based ELMo embeddings on two classical NLP tasks, POS tagging 20 https://www.rte-france.com/fr/ eco2mix/eco2mix-co2 21 https://github.com/google-research/ bert and parsing, using state-of-the-art neural architectures. Our goal was to explore whether the noisiness level of Common Crawl data, often invoked to criticize the use of such data, could be compensated by its larger size; for some languages, the OSCAR corpus is several orders of magnitude larger than the corresponding Wikipedia. Firstly, we found that when properly filtered, Common Crawl data is not massively noisier than Wikipedia. Secondly, we show that embeddings trained using OSCAR data consistently outperform Wikipedia-based embeddings, to the extent that they allow us to improve the state of the art in POS tagging and dependency parsing for all the 6 chosen treebanks. Thirdly, we observe that more training epochs generally results in better embeddings even when the training data is relatively small, as is the case for Wikipedia.
Our experiments show that Common-Crawlbased data such as the OSCAR corpus can be used to train high-quality contextualized embeddings, even for languages for which more standard textual resources lack volume or genre variety. This could result in better performances in a number of NLP tasks for many non highly resourced languages.