Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.


Introduction
Thousands of languages are spoken in our world (Eberhard et al., 2019), but technologies like machine translation (MT) and automatic speech recognition (ASR) are only available in about 100 of them. As internet access becomes increasingly common with the spread of smartphones (Biggs, 2017), bringing technologies that can help lower language and literacy barriers to more languages is ever more important.
Unfortunately, bringing language technologies to more languages is costly, as for many technologies, extending to an additional language has generally required the use of large parallel labeled datasets. For example, ASR systems are usually trained on large sets of audio recordings and transcriptions, while MT systems have historically needed a set of bilingual sentence pairs. Increasingly, small parallel datasets do exist for many languages (Mayer and Cysouw, 2014;Agić and Vulić, 2019;Artetxe et al., 2020;Ardila et al., 2020), but those resources were either produced at high cost, or are restricted to narrow domains. Parallel resources, which rarely occur naturally, remain scarce for most languages.
Monolingual text data, which is more commonly produced, is also used in building out language technologies: for example, in training language models, which are used in many applications ranging from next-word prediction in keyboard input software (Ouyang et al., 2017) to ASR and MT (Buck et al., 2014). Historically, though, a monolingual text corpus by itself has not been sufficient to build ASR and MT systems in a new language: at least some parallel data was typically necessary.
Recently, however, significant progress has been made in cross-lingual learning for NLP tasks (Klementiev et al., 2012;Ammar et al., 2016;Lample and Conneau, 2019;Pfeiffer et al., 2020): for example, some approaches appear capable of extending machine translation models to new languages with only monolingual data (Artetxe et al., 2017;Lample et al., 2017;, and similar findings have been reported for other NLP tasks (Hu et al., 2020). For ASR it is possible to combine a

LangID Approaches for Web Corpora
To create text corpora in as many languages as possible, we needed a broad-coverage, accurate LangID model for our web crawl. We cover existing work and describe our model, built along similar lines.

Previous Implementations
A rich literature exists on building text corpora from the web: for example, the Web as Corpus workshops have focused on the challenges around identifying relevant pages, extracting clean text, content deduplication, and many other relevant topics (Barbaresi et al., 2020;Jakubíček et al., 2020). We use an internal web crawler, which is equipped with robust text extraction and de-duplication features, and focus on expanding its LangID component.
A comprehensive recent survey on LangID is Jauhiainen et al. (2018). Naturally, LangID systems have been applied to web crawls before: Buck et al. (2014) published n-gram language models for 175 languages based on Common Crawl data. The Corpora Collection at Leipzig University (Goldhahn et al., 2012) and the Corpus of Global Language Use (Dunn, 2020) offer corpora in 252 and 148 languages. The largest language coverage is probably An Crúbadán, which does not leverage LangID, and found (small amounts of) web data in about 2,000 languages (Scannell, 2007). Our work is probably most similar to OSCAR (Ortiz Suárez et al., 2019) and CCNet (Wenzek et al., 2019), which mined Common Crawl data for 166 and 174 language varieties respectively. However, we believe depth of mining and LangID robustness can limit the quality of datasets produced by these projects: a preliminary inspection of the (often small) low-resource language corpora produced by these LangID-based projects discovers the sort of data noise we describe in this paper, which may render them unusable for NLP applications. These Common-Crawl based datasets are also smaller than our final, filtered dataset, which is ≈20x larger than CCNet and ≈180x larger than OSCAR for shared low-resource languages (see Appendix D).
One relevant LangID implementation appearing in the above works is Dunn (2020), achieving an F1 above 0.95 for 464 languages, and offering a thorough evaluation on different data sources and domains. The only LangID systems with higher coverage that we are aware of are those developed by Brown (2012;2014), with the most recent version covering as many as 1,366 language varieties, with accuracy above 99%. These numbers are impressive, but as we will see, even such high accuracy on test sets will not suffice to derive useful monolingual corpora from a real-world web crawl.

Our LangID Implementation
The LangID model we built is similar in approach to previously described systems: we use an n-gram based CLD3 model (Bakalov et al., 2016), consisting of a single hidden layer feed-forward neural network on bag-of-n-gram features and script-count features, which we trained on an aggregation of proprietary and publicly available text corpora, covering 1,629 language varieties, with an average of 800K tokens per language. Some of the data came from sources with language tags like Wikipedia, while another subset was created using a text elicitation task where we prompted native speakers to write sentences in their language . For some languages, we also relied on data extracted by Corpus Crawler (Brawer, 2017), a tool which mines text from sites with known in-language content. Using these corpora, we trained several LangID models, on increasingly large sets of languages. As Table 1  We balanced the data to have the same size dataset for each language before training. Since the relatively uncommon languages we are targeting have little web data compared to languages like English, balancing the data makes sense in order to have a high-enough recall model to get whatever scarce data there might be on the web for less common languages. Additionally, practically speaking, weighting training data according to the estimated prevalence of each language on the web at large-for example, with orders of magnitude more English examples than Quechua examples-would likely make model training difficult from a computational and stability perspective. However, it is worth stressing that evaluating a model on balanced data overestimates the performance of a model on the highly imbalanced web, especially with respect to precision, as we will see in Section 3.1.

Failure Modes of LangID Models on Web Text
Despite our LangID models performing well on the held-out test sets, when applied on real-life web data, the models were not as accurate as we had expected. We performed an initial limited crawl with a 648language model, but some quick evaluations showed that the results were highly noisy, so we performed a full crawl on ≈100B documents with a 224-language model to isolate the problems for closer analysis. This model had comparable performance to the models in Table 1, with median F1 of 96.8 on held-out eval sets. As first-pass filtering, we performed document-consistency filtering: we ran the LangID model on every sentence in each document, and then took the most commonly predicted language as the document language. We only kept sentences where the sentence-level and document-level labels matched. All datasets were also de-duplicated. This approach may have decreased recall on multilingual pages, but it reduced the severe noise problems, and helped reduce disk storage needs.
While we expected some accuracy loss due to the domain mismatch between clean training data and Table 2: Examples of several representative classes of noise in our initial web-crawl corpora.
noisy web text (Dunn, 2020), even after document-consistency filtering the LangID labels were so noisy that the corpora for the majority of languages in our crawl were unusable for any practical NLP task. Table 2 presents some representative samples of noise. Beyond various kinds of noise, we also found a high number of unexpected misclassifications, as in the Oromo case in Table 2. The following sections detail important classes and sources of noise.

Massive Class Imbalances: 99% Accuracy Is Not Enough
Precision, unlike recall or false positive rate (FPR) 2 , is a function of the class balance in a dataset. Measuring precision on a balanced dataset may give misleading impressions about real-world performance. For example, consider a LangID model that has 99% precision, 99% recall, and 0.01% FPR on a particular language on a balanced development set. Imagine however that there are 100 billion pages on the web, of which 10,000 are in the target language: in this scenario, the resulting web-crawled dataset will be mostly out-of-language, containing just under a tenth of a percent of sentences in the target language (see calculations in Appendix B)-insufficient for most NLP applications. Yet this assumes a relatively low FPR; for languages with a high FPR with respect to a much more common language, like Nigerian Pidgin with English, the situation is even more dire. As can be seen from this example, calculations of precision (and by extension, F1) are misleading when applied to real-world data with different class balances than the development set. In the general case, for a classifier with recall r and false positive rate f , if we estimate that the language of interest constitutes x% of the total web text, we get: Therefore, any evaluation of LangID models should also report the false positive rate (ideally with respect to major languages on the internet, like English) along with their precision and recall. This class-imbalance effect exacerbates the problems described in the following sections.

General Internet Noise and Creativity
There are many kinds of web noise that are known to cause problems both with LangID and in downstream tasks, such as abbreviations ("g2g", "hbu"), leetspeak ("n00b"), hashtags ("#99problems"), or non-standard Unicode encodings (like a LATIN CAPITAL LETTER W instead of a CYRILLIC CAPITAL LETTER WE). Some of these problems can be handled automatically (Prasad et al., 2018;Chua et al., 2018). However, our efforts in scaling the LangID models in our web crawl to hundreds of languages uncovered greater depths to internet noise, alongside even more creative ways of using text. As a result of the sheer size of the web, any small pathologies of a LangID model are hugely magnified: we observed that our models tend to pick up on particular genres of internet noise for each separate language, resulting in corpora for some languages that mostly showcase a rich array of particular types of oddities.
For example, in our initial crawls, what purported to be the corpus for Varhadi picked up large amounts of badly-encoded PDFs; Aymara and Turkmen were made up mostly of misrendered non-Unicode text; Dimli had mostly invalid HTML; Dogri offered a rich array of Zalgo-like ornamentation; Fula was awash in URLs; Ilocano caught vast amounts of garbled Javascript; and Zhuang captured German sentences involving the Unicode SOFT HYPHEN character. In each of these cases, sadly the majority of the crawled corpus actually consisted of the class of noise that the LangID classifier decided to assign to these languages-unfortunately drowning out any in-language sentences in the corpora.
In another interesting twist, one might expect that languages which are written in scripts that are not used for any other language would have clean corpora, as the unique connection between the script and the language means that any LangID model gets 100% F1 on development sets. However, this underestimates the creativity of the internet: the Cherokee syllabary, for example, contains characters that look similar to Latin characters, which are consequently repurposed to give words in other languages an aesthetic effect (see example in Table 2), while other scripts, such as Balinese, are used commonly for purely decorative purposes alongside content in entirely unrelated languages. Some script-unique languages like Divehi do yield high-precision corpora right from the get-go, but they are the lucky few.

Artifacts from Character N-gram Modeling
Many error modes seem to be direct consequences of n-gram count based models, and are also common in public corpora crawled using n-gram models like FastText (Grave, 2017)-Appendix E explores these phenomena in the OSCAR (Ortiz Suárez et al., 2019) corpus. Here are a few important classes of pathologies we discovered; see Table 2 for examples of each, and Appendix C for frequency statistics: 1. Unlucky overlap of frequent n-grams with high-prevalence languages: Token frequencies in natural text follow a power law distribution (Zipf, 1935), so that the most common n-grams in a language will be present in a majority of all of its sentences. If one of these common n-grams happens to occur in a sentence in a different language, LangID models can over-trigger. We observed this with Oromo, where 50% of the crawled dataset was actually English sentences containing the word "essay" at least three times, misleading the model due to high counts for the n-grams "essa", "ess", "sa", "a", "e", "s", and "y", all of which are top Oromo n-grams (see Appendix Table 12).
2. Repeated n-graaaaaaaaams: By repeating an n-gram sequence an arbitrary amount, which is rare in clean training text but common on the internet, the class probability of a language may be ramped up, even if the language is clearly wrong-cf. adversarial examples (Goodfellow et al., 2015).

A N T S P E A K :
A surprisingly common internet phenomenon is to find text with space-separated characters, l i k e t h i s (Channing, 2020). Standard n-gram models-or even SentencePiece models (Kudo and Richardson, 2018)-can't handle this without special-casing. This affects about one to two languages per major script: we found that most of our "Chechen" data was actually R u s s i a n, most of our "Lambadi" T e l u g u , our "Santali" B e n g a l i, and some of our "Sepedi" E n g l i s h.

Languages with High-Prevalence Cousins
Languages with High-Prevalence Cousins is a specific, quite common case of the Class Imbalance problem, which requires somewhat different techniques to mitigate (see Section 4). Crawling the web for a low-resource language ("target language") that is closely related to a language that is highly prevalent on the internet ("distractor language") can yield a dataset consisting mostly of the distractor language. A particularly salient example is Nigerian Pidgin (i.e. Naija, 'pcm') and English ('en'), which are similar enough (see Appendix Table 11 for examples) that typical LangID models will have high false positive rates between the two. Because of the prevalence of English on the internet, along with this high degree of confusability, building a high-precision web-crawled text corpus for languages like Nigerian Pidgin is exceedingly difficult.

Languages with Out-of-Model Cousins
A variant on the above are languages that are not supported by the LangID model, which interfere with related languages that are supported. For example, a majority of our Uyghur crawl was actually Kazakh and Kyrgyz in the Arabic script; our model had been trained to recognize Kazakh and Kyrgyz, but only in the Cyrillic alphabet. Table 2 gives an example Kazakh sentence that was labeled as Uyghur.

Unrepresentative Training Data
Sometimes training data may be too clean to be accurate on out-of-domain, noisy web data; yet other times it may be too noisy, too homogeneous, or contain systematic biases. For example, for some languages, training data (especially data sourced from Wikipedia) had high quantities of special characters and templated data (esp. from censuses). Templated data may be harmful for n-gram models, by skewing the token distributions away from that of normal text, though there is some evidence that neural models may be less affected by token distributions than by latent structure (Papadimitriou and Jurafsky, 2020).
Other training data may also have issues; for instance, in our elicited Chechen data, the CYRILLIC LET-TER PALOCHKA (not found on many keyboards) was represented with the ASCII digit "1". Our model therefore may not handle Chechen text containing the correct code point, or other substitutes, very well.

Improving LangID Precision on Web Text
Monolingual web-text corpora afflicted by the issues described in Section 3 will likely prove unusable for practical purposes. We report on two distinct approaches we found helpful in improving precision.

Tunable-precision Filtering with Curated Wordlists
We experimented with token-based filtering techniques, which are simple to implement and fast to perform on large corpora. Since the LangID models in our crawl operated on character n-grams, token-based approaches may have complementary behavior and can side-step particular failure modes. For instance, since a sentence with the word "essay" likely contains mostly non-Oromo words, the havoc caused by the n-gram "essa" described in Section 3.3 is neatly sidestepped by checking against a curated list of known Oromo words. Such filtering approaches have the added benefit of tunable precision, allowing us to adjust the cleanliness of our corpora depending on the noise tolerance of downstream tasks.

Percent-Threshold filtering
The simplest approach to token-based filtering is to remove any sentence where less than x% of its tokens appear in a clean list of known words for the language, such as one would find in a standard dictionary. We used in-house lists with a median of ≈15K words per language, which were obtained through frequency sorting followed by human curation. The one parameter for filtering-the percentage of in-vocabulary words-provides a simple, interpretable way to tune for precision/recall. We call this method Percent-Threshold Wordlist Filtering.

TF-IDF based filtering
Percent-Threshold Wordlist Filtering is effective for a majority of the problems we saw, where the text is nonsense or in an entirely different language, but it will not help where the mislabeled text is in a similar language, as in Nigerian Pidgin ('pcm'), which has very high lexical overlap with English ('en')meaning that such filtering will still retain most English sentences, and fail to increase precision. This problem will occur with any language that has high lexical overlap with a major language. Where there is extensive borrowing of loanwords, the languages may even be unrelated, as for Chuvash and Russian. Some words, however, are highly effective language markers: for example, "wetin" is common in Nigerian Pidgin, but does not occur in English. We therefore propose to keep any sentence that has at least one word from a small list of common tokens that are distinctive to that particular language, and are not shared with its more prevalent cousins. We call this Disjunctive Wordlist Filtering.
First, we perform TF-IDF, where each "document" is our LangID training set. However, this suffers one crucial flaw: the idf formulation of TF-IDF weights each document equally, so a word will be equally penalized if it occurs in English or in K'iche'. For practical purposes, we care mainly about filtering out common distractor-language text on the internet, so we only want to penalize those languages.
This motivates a simple variant on TF-IDF which we call TF-IIF, or Term Frequency-Inverse Internet Frequency. This measure is the ratio of the frequency of a token in our per-language corpus (TF) with the frequency of that token across the entire internet (IIF), which we approximate from a sample of 7 million randomly selected web sentences. In practice we find that performance improves slightly when accounting for both IDF and IIF, yielding the TF-IDF-IIF score. Formally, for a token t in a language l, with a frequency function f (term, corpus) and language-specific corpora D l : With a ranked TF-IDF-IIF list for each language, we then pick the top N words for each language such that we have at least r% recall on our dev sets. While it is tempting to choose the same r for all languages (e.g. 95%), different languages can behave quite differently with such filters, with small changes in recall sometimes leading to large changes in precision. We had best results by choosing r ∈ [0.75, 1.0], and then determining the ideal precision-recall trade-off on a per-language basis. With this paper, we publicly release TF-IDF-IIF wordlists we used, covering the top 100 tokens for each of about 500 languages 3 .

Semi-Supervised LangID
A separate approach from filtering is to improve our original LangID model. Utilizing large unsupervised text corpora to improve the quality of neural networks has become increasingly important in NLP (Devlin et al., 2018;Wang et al., 2018). Following this line of work, we use the noisy data crawled with our n-gram LangID model to improve the quality of our LangID system by leveraging self-supervised approaches, yielding a Semi-Supervised LangID system (SS-LID).  Specifically, following the text-to-text self-supervised approach outlined in Raffel et al. (2019), we train a Transformer Big model (Vaswani et al., 2017) by sampling equally from the crawled data from 212 languages. We co-train this selfsupervised task with the LangID task in a text-to-text setting, with the hope of improving the quality of LangID on noisy open-domain web text. To reduce the confounding effect of using a higher capacity transformer, we train a baseline transformer on just the LangID task.
We evaluate these SS-LID models and compare against the n-gram based LangID model in Table 3. In addition to F1, precision, and recall, we report FPR, whose importance we discussed in Section 3.1. All values are macro-averaged over the shared 212 languages. To distinguish between apparently well-performing models we also report the relative error reduction with respect to the n-gram model, which for an error metric ε we define as ∆ε = ε b −εt ε b , where ε b is the baseline model error and ε t the test model error. We see that the Transformer LangID model outperforms the n-gram model by a large margin, especially on precision and FPR. The SS-LID models improve further upon this model, notably with a 40%   reduction in FPR. It is worth noting that these improvements are on the clean eval set, despite the additional training objective being on the noisy web crawl. We suspect the improvements are even greater on web-type data, which is partially validated by the evaluation on web-text in Section 5.

Evaluation Methodology: Principles and Suggestions
Ideally, LangID models would be evaluated on a large, noisy test set, representative of real-life web data. Since such sets do not currently exist, we recommend having human annotators evaluate crawled corpora to ensure quality meets the threshold for downstream use (which will vary per application). For automatic metrics, we suggest focusing on false positive rate and recall rather than precision and recall, and comparing models using relative error reduction to amplify differences between apparently highly-performant models, as we did above in Section 4.2.

Evaluating our Systems
We asked human annotators to evaluate LangID quality for our web-crawled text in a subset of the languages. First, we filtered the web crawl with several methods. We then randomly sampled 100-1,000 sentences from each of these filtered data sets, and asked annotators (who were fluent speakers, or who spoke a closely related language) to indicate whether each sentence was in the target language. Table 4 presents the results of this evaluation for a selection of languages (full results on seventeen languages in Appendix Table 5). For each language, we show the precision of the method from the human annotations, and the recall of the same filter on our clean dev sets. For the percent-threshold filtering we evaluated a threshold of 20%, and for the disjunctive wordlist filtering we used the top N TF-IDF-IIF words per language such that the recall on our held-out eval set was at least 90%.
We see that the initial datasets were extremely noisy, with a median value of 5% of sentences being in-language. The filtering methods drastically increased the percentage of correctly LangID'd sentences, with values of up to 99% in-language, while maintaining high recall. However, the best filtering method varies widely by language. The neural SS-LID model has the highest precision for Bhojpuri and Swiss German, both of which also suffer most from the High-Prevalence-Cousin issue among these languages. However, it does much more poorly than wordlist-based approaches on Oromo and Cherokee. In the latter case, we found that SS-LID was unable to discard English sentences written in Cherokee syllabics.
It is worth re-emphasizing that the thresholds in Table 4 were chosen somewhat arbitrarily for the purpose of illustration. Since precision is tunable in the word-based approaches, precision can be increased further, though at growing cost to recall-a trade-off to make depending on downstream noise tolerance.
For Guinea-Bissau Creole, which has both a High-Prevalence Cousin (Portuguese) and an Out-of-Model Cousin (Papiamentu), none of our filtering methods were effective (see Appendix). Swiss German, in the same situation, barely scraped by. Future work should investigate additional techniques for such cases-although the most effective solution may be as simple as using a hand-curated TF-IDF-IIF list, which looked promising in preliminary experiments in Nigerian Pidgin.

Web-crawled Dataset and Comparison with other Public Datasets
Using the above methods 4 , we performed a deep crawl of the web (touching >100B webpages) with a 600-language LangID model. Using percent-threshold filtering 5 we made a recall-focused dataset, then post-filter with a SS-LID model for high precision, yielding a larger, cleaner set than is found in similar corpora. More details and comparisons to public corpora (OSCAR, CCNet) are in Appendices E and D.

Future Work
Our approach yielded usable monolingual text corpora in ≈600 languages. Internal user experience research suggests the web may now contain at least some amount of monolingual text in thousands of languages, so we plan to scale up with more multilingual LangID models, like our 1,629-language model.
Truly covering the linguistic richness of the web will also need crawling approaches to be fine-tuned further. Text for some languages may only be found in PDF files (Bustamante et al., 2020), and some scripts are commonly represented in non-Unicode fonts-such as Kruti Dev for Devanagari, requiring separate detection for conversion into Unicode-encoded Devanagari (Singh and Goyal, 2013). Applying OCR may also help handle non-Unicode text, and can uncover textual content within images. And many languages that are not officially written in the Latin alphabet have informal transliterated orthographies (Roark et al., 2020); our models can identify the most common ones, but we could cover more.
Finally, our work focused on a web crawl, but many new internet users primarily use their language online on social media platforms and in chat messages (Soria, 2018;. Other work has looked at applying LangID to social media (Jaech et al., 2016;Blodgett et al., 2017;Vo and Khoury, 2019). Our techniques should help improve LangID accuracy in this challenging domain, too.

Conclusion
Language Identification (LangID) is by no means a solved problem, and n-gram models are much worse than popularly believed. We trained LangID models covering up to 1,629 languages, but found that even seemingly high-quality models (> 95 F1) were nearly unusable in practice for low-resource languages. We described and analyzed several major issues encountered in applying LangID to a real-life web crawl. These practical problems included large amounts of noise, much of which appears to be natural language and can't be easily filtered out; insufficient expressiveness of n-gram models; issues with related languages; and a massive class imbalance problem, meaning that even 99% F1 can be insufficient.
To solve these issues, we developed two major improvements to our LangID system: tunable-precision filtering methods (for which we release wordlists in about 500 languages) and semi-supervised neural models. These allowed us to create usable monolingual text corpora across hundreds of languages based on our deep web crawl, with much more and cleaner data per language than previously published approaches. Such corpora hold great promise for bringing technologies like MT and ASR to more languages, and we believe it should be possible to use the approaches we outlined to create monolingual corpora in many more languages, which should help extend language technology even further. 4 Our process is also summarized in Appendix K for those interested in replicating. 5 In this case, we used larger wordlists than those used for the analysis above, in order to stress recall.

A Complete human evaluation results
A more complete version of Table 4 is given here in Table 5, containing the full set of seventeen languages we evaluated. The only additional information it shows over Table 4 is the percentage of the web-crawl each method filters out, for more context into how these methods will behave in practice. (Keep in mind that, while the precision and % filtered rows are measured on the noisy web crawl, the recall is measured on the held-out eval set.)

Unfiltered
Threshold   Table 5: More complete comparison of different filtering approaches for different languages. For each example language, we report 1. the precision of the crawl (percent of in-language sentences), as judged by human raters over a sample of 100 sentences per filtering method, 2. the recall of this method on our held-out eval sets, and 3. the percentage of the crawl removed by this filtering method. * Starred languages were omitted from the table in the main paper. † G.B. = Guinea-Bissau

B Massive Class Imbalance: Worked Example
This section shows the methodology for the example in Section 3.1, where we examine by way of example a LangID model with 99% precision, 99% recall, and 0.01% FPR for a given language. If we approximate that there are 100 billion pages on the web, of which 10,000 are in a language we are seeking, we can analyze the precision of the web crawl using the quantities of True Positives (TP), True Negatives (TN), False Negatives (FN), and False Positives (FP). For the dataset resulting from the web crawl, we can therefore say that T N + F P ≈ 100B − 100k ≈ 100B, and T P + F N ≈ 100k. One can now calculate p crawl , the precision on the resulting crawl of the web: T P = T P T P + F N (T P + F N ) = r * (T P + F N ) = 0.99 * 10k = 9.9k F P = F P T N + F P (T N + F P ) = f pr * (T N + F P ) = 0.0001 * 100B = 10M p crawl = T P T P + F P = 9.9k 9.9k + 10M ≈ 0.1% C Statistics on languages most affected by different types of noise Many of the types of noise mentioned in Section 3.2 are hard to quantify without significant extra work. For instance, it would require building special classifiers for misrendered PDFs, non-Unicode fonts, creative use of Unicode, and so on-and it may need a stronger classifier than an n-gram classifier, since after all these are mistakes of an n-gram classifier. Issues like out-of model-cousins are even trickier, probably requiring human ratings. However, some types of noise can be quantified using approximations like the following: •   Table 6: Quantification of the incidence of a few noise phenomena, along with their most affected languages in our web-crawl.

D Details on the web-mined datasets
As described in Section 6, the dataset we mined has two versions, one focused on recall (called recall in the table), and one focusing on precision (called sslid(recall) in the table). Table 7 compares these two datasets with public benchmarks.
Since the purpose of this crawl was to focus on low-resource languages, we mined a smaller portion of the internet for the ∼100 highest-resource languages, and did not do any filtering on these languages. For this reason, in addition to the stats on the entire dataset, we report the stats on the dataset omitting the highest-resource 100 languages, to give a fairer approximation of the size of datasets for truly lowresource languages. We also report stats on the languages among those that are shared between the three datasets, again omitting the ∼100 highest resource languages.
Please note that these datasets are hard to compare to public benchmarks, as they crawl a wider swath of the internet, and are much more highly multilingual. Therefore, the comparison with public data sources in this table should not be interpreted as giving information about the nature of the filtering methods described in this paper.  Table 7: Comparison between the two versions of our dataset and the public datasets CCNet and OSCAR.
Although the statistics look similar on the full dataset, we see that the public datasets are heavily skewed towards higher-resource languages. When excluding the 100 highest-resource languages ("100+"), or looking only at shared low-resource languages ("shared"), we see that the public datasets have 20x to 200x less data than our crawl was able to identify.

E Comparison with OSCAR Corpus
While the analyses in the main paper focused on evaluating the quality of the data we crawled, publicly available datasets have similar issues. This section briefly analyzes the OSCAR corpus (Ortiz Suárez et al., 2019), which, although an excellent resource for many languages, has lower-quality content for some languages. All analyses are performed on the deduplicated OSCAR corpus, which is cleaner.
Language Phenomenon % of crawl Central Bicol A N T S P E A K 100.0% Neapolitan A N T S P E A K 100.0% Emilian-Romagnol A N T S P E A K 55.8% Somali n-graaaaams 88.1% Cantonese n-graaaaams 57.1% Asturian n-graaaaams 53.0% Table 8: Most-affected languages in the OSCAR corpus for two common error modes of n-gram models Please note that it is hard to compare OS-CAR directly with our dataset. One notable confound is that the two datasets are drawing from different portions of the web. Another confound is the degree of multilinguality and the subset of languages chosen (this paper tends to focus on longer-tail languages than OSCAR). A further large confound is that OS-CAR uses the FastText LangID model (Grave, 2017), which does not upsample training data, and therefore will tend to have lower recall and higher precision.
Applying the heuristic analyses from Section C, we see that repeated ngram and A N T S P E A K issues are also very common in the OSCAR corpus (the other phenomena from Table 6, however, were mostly absent). Table 8 reports the three most affected languages per phenomenon, and Figure 1 shows a representative sample of two of these corpora. In both these cases, the dataset consisted only of such noise, and had no in-language content.
To further analyze the cleanliness of the OSCAR corpus, we performed a similar analysis as in Section 5, to determine the percentage of each dataset that was in-language. Table 9 summarizes these findings, along with the percentage of the corpus remaining after percent-threshold filtering with our wordlists. We only look at the thirty lowest-resource languages in the corpus. We find that the percent in-language varies widely by language, ranging from 0% to 100%. However, many of the corpora have relatively high precision, with the average precision being just over 89%. At the same time, this accords with a low average recall, with the median dataset size being only 37 sentences. It is interesting to note that wordlist-filtering corresponds quite well with human-judged precision, with Pearson's R of 87.3%.   Table 9: The 30 lowest-resource languages in OSCAR, and 1. their human-judged percent in-language (i.e. precision); 2. the percentage remaining after applying percent-threshold wordlist filtering; and 3. total number of sentences in the (deduplicated) corpus. Languages for which we lacked wordlists are marked with "N/A".

F Notes on Curated Wordlist Approaches
For languages written in unsegmented scripts (where spaces are not used in between words; for example, Mandarin), leveraging the curated wordlists during the filtering techniques is not as straightforward. When given a sentence to check for valid words, we would first need to run a segmentation model in order to split the sentence into words, but segmentation models need to be trained on specific languages and do not usually support lower-resource languages. To handle languages written in such writing systems, we included all valid characters in the language as part of the wordlist, so that we could fall back to character-level checks for any sentences written in these scripts. This means that any somewhat reasonable language data using the same script will be kept, even if it is a different language.

G Wordlist-based Language ID
For languages with little or no training sentence-level data, even an n-gram LangID model is not practical to train. We therefore additionally explored pure wordlist-based models: specifically, we experimented with a Word-Based LangID system (WB-LID), which assigns a LangID label to the sentence by simply counting how many known words appear in the sentence for each possible language and predicting the language with the highest counts, with extra weight granted to "unique words" that appear only in a single language's wordlist. The simple architecture of WB-LID does not compare to an n-gram LangID model for most languages (Table 10), and we decided not to pursue using the outputs of WBLID as a filter in this work, but this approach seems stable and scalable to more languages, and may be worth exploring in the future as a LangID system for languages where no sentence data can be found to train an n-gram model.
LangID system Comparisons (median F1) 493 Languages 590 Languages n-gram LangID 97% 96% Word-Based LangID 75% 76% Table 10: Performance of the n-gram LangID system vs Word-Based LangID system on development sets. For the dev sets shown in this comparison, we only include languages for which we had both sentence data to train the n-gram model and known wordlists to train the WBLID. We remove any known words from our WBLID system that do not appear in the sentence data used to train the n-gram model. The n-gram model is trained on all sentence data for the supported languages.

H Illustration of the High-Prevalence-Cousin problem
Although the issue of highly similar varieties is very common and may be familiar to speakers of most languages in the world, English-speaking researchers may be less familiar with it, since close relatives of English do not generally receive a lot of attention in the literature. As an illustration,  As alluded to in Section 3.3, Oromo has the peculiar error mode that our n-gram model massively overtriggers with English, despite the two languages bearing little to no resemblance to each other, as a result of the frequent 4-gram "essa".  Table 12: Top 10 most common 4-grams in a) Oromo LangID training data, b) the "Oromo" crawl of the web, and c) English LangID training data. Each 4-gram is presented with its index among the top 1,000 most common Oromo 4-grams. We can understand from the n-gram list that the "Oromo" crawl is majority English, overtriggering because of the 4-gram "essa", from the English word "essay". In fact, 50% of sentences in the "Oromo" crawl contain the word "essay" at least three times! The other common n-grams in this Table from the "Oromo" crawl are epiphenomenal, reflecting only English words that tend to occur in English sentences about essays.

J Correlation of filtering precision with relevant variables
When do some filtering methods work better than others? We do not have enough data points to make strong statements (N=17), but there are some trends that may be worth commenting on here. In Table 13, we look at the correlation of the precision of unfiltered data and the three proposed filtering methods, and how they correlate with 1) the size of the crawled dataset, and 2) the dialectical relatedness to common languages online. We hypothesize that variable (1) is a combination of variable (2) with non-linguistic noise artifacts, so looking at these two variables can give us an idea of which methods are better at general noise filtering (from train-data pathologies, etc.) and distinguishing related languages. Unfortunately the "dialectical relatedness to common languages online" is hard to quantify. As a rough approximation, we introduce four heuristic "confusability classes": 1. Class 1: No obviously confusable languages 2. Class 2: Confusable low-resource languages or slightly confusable high-resource language 3. Class 3: Medium-confusable high-resource language 4. Class 4: Very confusable high-resource language To perform the regression we assign these classes to the values {1, 2, 3, 4}. Per-language assignments are given in Table 14.
Based on the numbers in Table 13, it looks like both wordlist filtering methods perform similarly, and the SS-LID method is noticeably better when languages are more confusable, and possibly slightly worse when there are larger datasets (signalling more confusion with non-linguistic or out-of-domain noise).  Table 13: Pearson correlation of the precision of three filtering methods (and unfiltered data) with two relevant variables. Number of segments (i.e. number of sentences in the "unfiltered" dataset) is passed through a log transform first, since the size of the unfiltered datasets follows a log distribution. For an explanation of the "confusion rank", please see the appendix section J and  Table 14: Heuristic judgement of "confusability" for use in the regression in Table 13. Please note that this is not a rigorous quantification of these languages and may contain mistakes. For explanations of the "classes", please see text. * Note that Chuvash is considered "high" overlap because of polluted training data.

K Complete Recipe
This section is simply a concise description of the steps we took to create our dataset, in the form of suggestions for someone interested in creating a similar dataset.