Corpora of social media in minority Uralic languages

This paper presents an ongoing project aimed at creation of corpora for minority Uralic languages that contain texts posted on social media. Corpora for Udmurt and Erzya are fully functional; Moksha and Komi-Zyrian are expected to become available in late 2018; Komi-Permyak and Meadow and Hill Mari will be ready in 2019. The paper has a twofold focus. First, I describe the pipeline used to develop the corpora. Second, I explore the linguistic properties of the corpora and how they could be used in certain types of linguistic research. Apart from being generally “noisier” than edited texts in any language (e.g. in terms of higher number of out-of-vocabulary items), social media texts in these languages present additional challenges compared to similar corpora of major languages. One of them is language identification, which is impeded by frequent code switching and borrowing instances. Another is identification of sources, which cannot be performed by entirely automatic crawling. Both problems require some degree of manual intervention. Nevertheless, the resulting corpora are worth the effort. First, the language of the texts is close to the spoken register. This contrasts to most newspapers and fiction, which tend to use partially artificial standardized varieties. Second, a lot of dialectal variation is observed in these corpora, which makes them suitable for dialectological research. Finally, the social media corpora are comparable in size to the collections of other texts available in the digital form for these languages. This makes them a valuable addition to the existing resources for these languages.


Introduction
There are seven minority Uralic languages in the Volga-Kama area and adjacent regions of Russia 1 : Komi (Zyrian, Permyak), Udmurt, Mari (Meadow, Hill), Erzya and Moksha. All these languages fall in the middle of the Uralic spectrum in terms of the number of speakers. Similarly, they all belong to the middle level of digital vitality: based on the amount of digital resources available for them, Kornai (2016) calls them digitally "borderline" languages. Their sociolinguistic situation is also rather similar; see Blokland and Hasselblatt (2003) for an overview. All of them have had intensive contact with the dominant Russian language; almost all their speakers are bilingual in Russian; the number of speakers is on the decline. Despite the fact that all of these languages have some official status in the respective regions, their use in the public sphere and education is very limited.
Social media have been a target for both NLP and linguistic research for a long time now. However, the overwhelming majority of papers deal with social media texts in English or one of several other major languages. Smaller languages are severely underrepresented in this domain. There are corpora of social media texts in large Uralic languages, e.g. the Suomi24 forum corpus for Finnish (Aller Media Oy, 2014), and investigations based on them, e.g. Venekoski et al. (2016). All minority Uralic languages spoken in Russia lack such corpora.
Collecting social media corpora for the seven languages listed above is the central part of my ongoing project. There are notable differences between social media in these languages and those in major languages, which pose certain challenges for corpus development. First, they are smaller in size by several orders of magnitude. While, for example, the Edinburgh Twitter Corpus contains 2.26 billion tokens of tweets in English collected within a 2.5-month span (Petrović et al., 2010), all corpora I am dealing with do not exceed 3 million tokens despite representing an 11-year period. This scarcity of data makes every single post valuable. Another difference is ubiquitous code switching instances and Russian borrowings, which makes reliable language tagging a necessity. Yet another challenge comes from the fact that many social media users are not well acquainted with, or consciously avoid, the literary norm. On the one hand, this means that dialectal variation can be studied in Uralic social media corpora, but on the other, it makes morphological annotation more difficult.
The paper is organized as follows. In Section 2, I describe how I find, harvest and process the social media texts. In Section 3, I consider the linguistic and sociolinguistic properties of collected texts and discuss how that could be beneficial for certain kinds of research. In Section 4, I briefly describe the web interface through which the corpora are available.

Identifying and harvesting texts
A common approach to harvesting various kinds of texts from the web is to apply some kind of automatic crawling, which takes a small set of URLs as a seed and then follows the hyperlinks to find more content. Unfortunately, it is almost impossible to use this approach without adjustments for languages with small digital presence. Most links that appear in pages written in such languages lead to texts written in the dominant language (Russian in this case), and sifting through all of them to find relevant pages or fragments would require too much computational power.
In order to make text harvesting more efficient and less time-consuming, I try to make the seed as close to the comprehensive URL list as possible. Only after processing all pages from that list do I apply limited crawling. When identifying the pages for the seed list, I build upon a strategy proposed and used by Orekhov et al. (2016) for collecting and researching minority languages of Russia on the Internet, as well as on the results obtained by them. A slightly different version of the same strategy was previously used by Scannell (2007) in the Crúbadán project for similar purposes. This approach involves searching for relevant pages with a conventional search engine, using a manually compiled small set of tokens which are frequent in the relevant language, but do not exist or are very infrequent in any other language. This contrasts to the strategy employed by the "Finno-Ugric Languages and the Internet" project (Jauhiainen et al., 2015), which relied on large-scale crawling and subsequent fully automatic filtering by language.
Out of a dozen social media services with presence in Russia, I currently limit my search to vkontakte 2 , which is by far the most popular of them both in relevant regions and in Russia as a whole. My preliminary research shows that in major Western social media, such as Facebook or Twitter, texts in minority Uralic languages are almost nonexistent. However, there is at least one other Russian resource, odnoklassniki 3 , which seems to contain texts in these languages in quantities that may justify the effort needed to process them. Odnoklassniki is more popular with the older generation and apparently has varying popularity across regions. For example, it seems that there are more texts in Erzya there than in Udmurt. Nevertheless, relevant texts in vkontakte clearly outnumber those in odnoklassniki. Additionally, I download forums not associated with any social media service, if their primary language is one of those I am interested in. So far, I have only found forums of such kind for Erzya.
Although there are also blogs available in these languages, I did not include them in the social media corpora. Baldwin et al. (2013) show that the language of blogs could be placed somewhere between edited formal texts and social media by a number of parameters. This is true for most (although not all) blogs in minority Uralic languages, which on average contain less code-switching than social media and where the language variety seems closer to the literary standard. Nevertheless, blogs are undoubtedly a valuable source for linguistic research, which is why I downloaded them as well and included them in the "support corpora" (see below).
As a starting point, I take the URL lists of vkontakte pages collected by Orekhov et al. (2016). 4 I manually check all of them and remove those that were misattributed (which sometimes happens because the lists were compiled in an unsupervised fashion). An example of an erroneously marked page is a Russian group dedicated to Korean pop music where the users share the lyrics in Cyrillic transcription. Apparently, a transcribed Korean word coincided with one of frequent Udmurt tokens, which is why it ended up tagged as Udmurt.
As a second step, I perform manual search in Yandex search engine with an additional check in Google, using the same strategy as Orekhov et al. (2016). This allows me to enhance the original lists with URLs that were missed or did not exist in 2015, when the lists were compiled.
When the initial list of URLs is ready, I download the texts (posts and comments) and the metadata using the vkontakte API. The amount of data is small enough for it to be downloadable through a free public API with a limitation of 3 queries per second within several days. The texts with some of the metadata are stored in simple JSON files. User metadata is cached and stored in another JSON file to avoid the need of downloading it multiple times for the same user. Obviously, only texts and metadata open to the general public can be downloaded this way.
The final stage of the harvesting process involves limited crawling. The messages written by individual users are automatically language-tagged. For each user, I count the number of messages in the relevant language authored by them. All users that have at least 2 messages are added to the URL list and their "walls" (personal pages with texts and comments written by them or addressed to them) are downloaded as well. The threshold of 2 messages was chosen to cut off instances of erroneous language tagging, which happen especially often with short messages. Besides, users with small message counts tend to have no texts in the relevant languages on their walls anyway.

Language tagging
The social media texts in minority Uralic languages are interspersed with Russian, so language tagging is of crucial importance to the project. There are standard techniques for language tagging, the most popular probably being the one based on character n-gram frequencies (Canvar and Trenkle, 1994). It is impossible, however, to achieve sufficient quality on minority Uralic social media data with these methods. The first problem is that the texts that have to be classified are too short. Mixing languages within one message is extremely common, which is why at least sentencelevel tagging is needed in this case. In an overview of several n-gram-based methods, Vinosh Babu and Baskaran (2005) note that, although generally it is easy to achieve 95% or higher precision with such methods, "for most of the wrongly identified cases the size of the test data was less than 500 bytes, which is too small". This is always the case with the sentences, which most of the time contain less than 10 words. What's more, sentences in the relevant languages contain lots of Russian borrowings and place names, which would shift their n-gram-based counts closer to those of Russian. Classifying short segments with additional issues like that is still problematic with the methods commonly used at present (Jauhiainen et al., 2018, 60-61).
Instead of a character-based classification, I use a process which is mostly dictionarybased and deals with words rather than character n-grams as basic counting units. In a nutshell, it involves tokenization of the sentence, dictionary lookup for each word and tagging the sentence with the language most words can be attributed to. The classification is three-way: each sentence is tagged as either Uralic, or Russian, or "unknown". The last category is inevitable, although the corresponding bin is much smaller than the first two. It contains sentences written in another language (English, Tatar, Finnish and Hungarian are among the most common), sentences that comprise only emoji, links and/or hashtags, and those that are too difficult to classify due to intrasentential code switching. In the paragraphs below, I describe the algorithm in greater detail.
Before processing, certain frequent named entities, such as names of local newspapers and organizations, are cut out with a manually prepared regex. This is important because such names, despite being written in a Uralic language, often appear in Russian sentences unchanged. After that, the sentence is split into tokens by whitespaces and punctuation-detecting regular expressions. Only word tokens without any non-Cyrillic characters or digits were considered.
There are three counters: number of unambiguously Russian tokens (cntR), number of unambiguously Uralic tokens (cntU), and number of tokens that could belong to either language (cntBoth). Each word is compared to the Russian and Uralic frequency lists, which were compiled earlier.
If it only appears on one of them without any remarks, the corresponding counter is incremented. If it appears only in the Uralic list, but is tagged as either a Russian borrowing or a place name without any inflectional morphology, cntBoth is incremented. The same happens if the word is on both lists, unless it is much more frequent, or its 6-character suffix is more common (in terms of type frequency), in one than in the other. (Exact thresholds here and in the paragraph below are adjusted manually and are slightly different for different languages of the sample.) In the latter case, the corresponding counter, cntR or cntU, is incremented.
After all words have been processed, rule-based classification is performed. If one of the counters is greater than the others and most tokens in the sentence have been attributed to one of the languages, the sentence is tagged according to the winning counter. If there are many ambivalent words and either no Uralic words or some clearly Russian words in the sentence, it is classified as Russian. Finally, if counterbased rules fail, the sentence is checked against manually prepared regexes that look for certain specific character n-grams characteristic for one language and rare in the other. If this test also does not produce a definitive answer, the sentence is classified as "unknown".
There is a certain kind of texts in social media in minority languages that poses a serious challenge to this approach. In all languages I have worked with, there are groups designed for learning the language. They often contain lists of sentences or individual words with Russian translations. A simplistic approach to sentence segmentation places most of such translation pairs inside one sentence, which is then impossible to classify as belonging to one of the languages. To alleviate this problem, the language classifier tries splitting sentences by hyphens, slashes or other sequences commonly used to separate the original from the translation. If both parts can be classified with greater certainty than the entire fragment, and they have different language tags, the sentence remains split.
During the initial language tagging, "borderline" sentences, i.e. those whose cntR and cntU counters had close values, were written to a separate file. I manually checked some of them and corrected the classification if it was wrong. During second run of tagging, each sentence was first compared to this list of pre-tagged sentences. The tagging procedure described above was only applied to sentences that were not on that list. Finally, an extended context was taken into account. If a sentence classified as "unknown" was surrounded by at least 3 sentences with the same language tag (at least one before and at least one after it), its class was switched to that of the neighboring sentences.
The resulting accuracy is high enough for practical purposes and definitely higher than an n-gram-based approach would achieve. Tables 1 and 2 show the figures for Udmurt and Erzya. The evaluation is based on a random sample that contained 200 sentences for each of the languages. Actual cases of misclassification comprise only about 2% of sentences classified as Uralic. An additional 3% accounts for problematic cases, e.g. code switching with no clear main/matrix language. The share of sentences classified as "unknown" is 2.5% for Udmurt/Russian pair and 1.3% for Erzya/Russian; most of them are indeed not classifiable. Note that the figures below refer to sentences rather than tokens. Given that wrong classification overwhelmingly occurs in short sentences (1-4 words), precision measured in tokens would be much higher.
The described approach requires much more training data and annotation than the n-gram-based classification. Specifically, it relies on word lists for the respective lan-correct sentences wrong language mix / other Erzya 94.5% 2.5% 3% Russian 97% 1% 2% Table 2: Accuracy of language tagging for Erzya.
guages that are long enough, contain some morphological annotation, annotation for Russian loanwords and place names, and frequency information. Such lists are readily available for Russian; I used a frequency list that is based on the Russian National Corpus and contains about 1 million types. However, it is much more problematic to obtain such lists for the Uralic languages. In order to do so, I had to collect a "support corpus" with clean texts and no Russian insertions for each of the languages first. Fortunately, this is achievable because there are enough non-social-media digital texts in them on the web. First and foremost, for each language there are one or several newspapers that publish articles in it. Apart from that, there are translations of the Bible, blogs (surprisingly, unlike social media, most of them do not contain chaotic code switching) and fiction. By contrast, Wikipedia, which is often a primary source of training data for major languages, is of little use for this purpose because Wikipedias in these languages mostly contain low-quality and/or automatically generated articles (Orekhov and Reshetnikov, 2014). The resulting lists contain around 230,000 types for Udmurt and around 100,000 types for Erzya, Moksha and Komi-Zyrian. Although I am primarily interested in the Uralic data, all Russian and unclassified sentences are also included in the corpus. Omitting them in mixed posts would obviously be detrimental for research because it would be impossible to restore the context of Uralic sentences and therefore, in many cases, fully understand their meaning. However posts written entirely in Russian are also not removed if their authors or the groups where they appear have Uralic posts as well. This effectively makes my corpora bilingual, although not in a sense traditionally associated with this term (Barrière, 2016). One reason why this is done is facilitating sociolinguistic investigations of language choice in communication. Another is enabling research of contactinduced phenomena in Russian spoken by native speakers of the Uralic languages. A number of corpus-based papers has been published recently about regional contactor substrate-influenced varieties of Russian, e.g. by Daniel et al. (2010) about Daghestan or Stoynova (2018) about Siberia and Russian Far East. The availability of corpora that contain Russian produced by Uralic speakers could lead to similar research being carried out on Uralic material.

Filtering and anonymization
After the language tagging, the texts undergo filtering, which includes spam removal, deduplication and anonymization.
Since the actual content is not that important for linguistic research, there is nothing inherently wrong with having spam sentences in the corpus, as long as they are written in a relevant language. However, the main problem with spam is that it is repetitive, which biases the statistics. In order to limit this effect, I manually checked sentences that appeared more than N times in the corpus (with N varying from 2 to 5, depending on the size of the corpus). Those that could be classified as being part of automatically generated messages or messages intended for large-scale mul-tiple posting, were put to the list of spam sentences. If they contained variable parts, such as usernames, those were replaced with regex equivalents of a wildcard. Such variable parts make template sentences resistant to ordinary duplicate search, which justifies treating them separately. Most of such sentences come from online games, digital postcards or chain letters. The resulting list contains about 800 sentences and sentence templates. Sentences in texts that match one of the templates are replaced with a <SPAM> placeholder. Posts where more than half of sentences were marked as spam are removed.
Text duplication is a serious problem for social media texts, which are designed for easily sharing and propagating messages. Posts published through the "share" button are marked as copies in the JSON returned by vkontakte API. If multiple copies of the same post appear in different files, they are identified by their post ID. Only one copy is left in place, and all others are replaced by the <REPOST> placeholder. However, this procedure does not solve the problem entirely. Many posts are copies of texts that originate outside of vkontakte, and some copies of vkontakte posts are made by copy-pasting (and possible editing) rather than with the "share" function. As an additional measure, posts that are longer than 90 characters are compared to each other in lowercase and with whitespaces deleted. If several identical posts are found, all but one are replaced with the placeholder. However, there are still many duplicates or half-duplicates left, which becomes clear when working with the corpora. Some of the duplicates, despite obviously coming from the same source, have slight differences in punctuation, spelling or even grammar, which means they were edited. It is a nontrivial question whether such half-copies should be removed. In any case, this remains a serious problem for the corpora in question. By my informal estimate, as much as 15% of the tokens found in the corpora could actually belong to near-duplicates. Before applying more advanced approach in the future, e.g. shingle-based (Broder, 2000), the near-duplicates have to be carefully analyzed to determine what has to be removed and what has to stay.
Final step of the filtering is anonymization. The purpose of anonymization is to avoid the possibility of identifying the users by removing their personal data. Usernames and IDs of the users are replaced with identifiers such as F_312. The numbers in the labels are random, but consistent throughout each corpus. This way, the corpus users still can identify texts written by the same person (which could be important for dialectological or sociolinguistic research) without knowing their name. The names of the groups are not removed because there is no one-to-one correspondence between groups and users. Similarly, user mentions in texts are removed. Just like in other major social media platforms, user mentions in vkontakte are automatically enhanced with the links to the user pages and therefore are easily recognizable. All such mentions are replaced with a <USER> placeholder. All hyperlinks are replaced with a <LINK> placeholder. Finally, user metadata is aggregated (see Subsection 2.4). Only the anonymized corpus files are uploaded to the publicly accessible server.

Metadata and annotation
Each post together with its comments is conceptualized as a separate document in the corpus. There are post-level and sentence-level metadata. Both include information about the authors: the owner of the page (post-level) and the actual author of the post or comment (sentence-level), which may or may not coincide. Additionally, sentencelevel metadata includes type of message (post/repost/comment), year of creation, and language of the sentence.
Author-related metadata primarily comes from the user profiles. It includes sex (which is an obligatory field) and, if the user indicated it, also their age, place of birth and current location. Simply copying the values for the latter three parameters would make it possible to identify the authors. However, these values are extremely important for any kind of sociolinguistic or dialectological research, so they have to be accessible in some way. As a compromise, these values are presented only in aggregated form. Exact year of birth is replaced with a 5-year span (1990-1995, 1995-2000, etc.) in all corpora. The solution for the geographical values has only been applied to the Udmurt corpus so far. The exact locations there are replaced with areas: districts (район) for Udmurtia and neighboring regions with significant Udmurt minorities; regions (область/республика/край) for other places in Russia; and countries otherwise. The correspondence between the exact values and areal values was established manually and stored in a CSV table, which at the moment has around 800 rows for Udmurt. Since there are a lot of ways to spell a place name (including using Udmurt names, which do not coincide with the official Russian ones), this is a time-consuming process 5 , which is why I have not done that for the other corpora yet.
In order to make sure the birth places the users indicate are real at least most of the time, I read posts written by a sample of users. It is common for speakers in this region to live in cities and towns, but maintain ties with their original villages and describe them in their posts. In such descriptions, the speakers often explicitly indicate that they were born in that village. Additionally, place of origin is an important part of identity. This is why opening sections of most interviews in local press contain the information about the village the interviewee was born, along with their name and occupation. All this makes birth place information easily verifiable. In most cases, the place name indicated by the users was corroborated by the information I found in the texts. There were several cases, however, when instead of naming the exact place, the users wrote the district center closest to the real place of birth. This paradoxically makes the aggregated version of geographical data more accurate than the exact one.
The token-level annotation in the corpora includes lemmatization, part-of-speech and full morphological annotation, morpheme segmentation and glossing. This annotation is carried out automatically using rule-based analyzers, with the details (coverage, presence of disambiguation, etc.) varying from language to language. Additionally, the dictionaries used for morphological analysis were manually annotated for Russian borrowings, place names and other proper names, which is required for high-quality language tagging. Russian sentences were annotated with the mystem 3 analyzer (Segalovich, 2003).
Social media texts in any language tend to be more "noisy" and difficult for straightforward NLP processing, having higher out-of-vocabulary rates (Baldwin et al., 2013). There are both standard and language-specific problems in this respect in the Uralic social media. The former include typos, deliberate distortions and lack of diacritics. An example of the latter is significant dialectal variation, which was to a certain extent accounted for in the morphological analyzers. The variation is explained by the facts that these languages were standardized only in the 1930s and that many people are not sufficiently well acquainted with the literary standards (or choose not to adhere to them).
The most frequent typos were included in the dictionaries. Some kinds of distortions, such as repeating a character multiple times, were removed before a token was morphologically analyzed (but not in the texts). Lack of diacritics is a common problem in Udmurt, Komi and Mari texts, as alphabets of these languages contain Cyrillic letters with diacritics that are absent from a standard Russian keyboard. They can be either omitted or represented in a roundabout way. Interestingly, the same letters are represented differently in different languages. In Udmurt, double dots above a letter are commonly represented by a colon or (less frequently) a double quote following it, e.g. ӧ = о: / o". In Komi, the letter о in this context is most often capitalized or replaced with the zero digit. In all languages, similarly looking characters from Latinbased character sets can be inserted instead of Cyrillic ones. Alphabets of Erzya and Moksha coincide with that of Russian. Nevertheless, double dots above ё are often omitted, following the pattern used in Russian texts (where their use is optional). All these irregularities are taken care of during automatic processing.

Size and distribution of metadata values
After the language tagging, the corpus files were filtered to exclude users who wrote exclusively or almost exclusively in Russian. For each user wall, number of sentences classified as Russian, Uralic or Unknown was calculated. The file was excluded from the corpus either if it contained at most 3 Uralic sentences constituting less than 10% of all sentences, or if it contained at most 10 Uralic sentences constituting less than 1% of all sentences. If the number of sentences classified as "unknown" was suspiciously high, the file was checked manually.
The sizes of the corpora after filtering are listed in Table 3. The two columns on the right give sizes of Uralic and Russian parts of each corpus in tokens. It has to be borne in mind that some of the tokens belong to near-duplicates (see Subsection 2.2), so the actual sizes after proper deduplication may be lower. The figures for Komi-Zyrian and Moksha are preliminary, however it is clear that the total size of the Moksha vkontakte segment is tiny compared to the rest of the languages.   increase in the number of texts, which continued until 2014-2015, number of Erzya vkontakte texts started going down. Permic segments of vkontakte, by contrast, continued growing, although Udmurt had a two-year plunge. The number of groups also seems to grows continuously: Pischlöger (2017) reported 90 groups in 2013 and 162 groups in 2016 for Udmurt. Komi-Zyrian speakers were adopting social media at a lower pace, but at the moment, Komi-Zyrian segment outnumbers the Udmurt one in terms of token counts. The Erzya forums enjoyed peak popularity around 2010. The reason for that was most probably the discussions about development of an artificial unified Mordvin language out of the two existing literary standards, Erzya and Moksha. This idea was advocated by Zaics (1995) and Keresztes (1995) and supported by Mosin (2014). The initiative belonged to people in the position of power rather than e.g. writers or teachers (Rueter, 2010, 7) and was vehemently opposed by Erzya language activists. This possibility was actively discussed in 2009, which energized the activists and led to the spike in the number of forum posts. The controversy seems to have abated since then, and both forums are now defunct (although still accessible). The gender composition is even more different in Udmurt and Erzya (counting only vkontakte texts), as can be seen from the Table 5. Three quarters of texts authored by users (rather than groups) in Erzya were written by males, while in Udmurt it is the females who contribute more. The Udmurt picture is actually close to the average: according to a 2017 study by Brand Analytics 6 , 58.4% of all posts in vkontakte are written by females. I do not have any explanation for this disparity.

Linguistic properties
Literary standards were developed for minority Uralic languages only in the 1930s, although written literature in them existed earlier. During the Soviet times, the standard language was taught at schools, however, this is not obligatory anymore and even unavailable in many places. Dialectal variation is still significant within each language. While older speakers generally try to follow the literary standard when writing, the younger generation may not know it well enough. Their written speech is therefore influenced by their native dialects, as well as by Russian. This contrasts to official texts and press in these languages, where puristic attitudes prevail. In Udmurt, the official register with its neologisms is hardly comprehensible for many speakers (Edygarova, 2013). In Erzya, neologisms in press are often accompanied by their Russian translations in parentheses because otherwise nobody would understand them (Janurik, 2015). Texts in the social media are much closer to the spoken varieties, which makes them better suited for the research of the language as it is spoken today. Dialectal variation is observable in vocabulary and morphology. Frequently occurring non-standard suffixes include, for example, the infinitive in -n and present tense in -ko in Udmurt, or dative in -ńe and 1pl possessive in -mok in Erzya. This makes dialectological research on the social media corpora possible in principle. The main obstacle to such research is corpus size. Only a minority of users indicate their place of origin. Divided by the number of districts these people were born, this leaves a really small number of geographically attributed tokens for all districts except the most populous ones (several thousand to several dozen thousand tokens in the case of Udmurt). In order to see a reliable areal distribution of a phenomenon, that phenomenon has to be really frequent in texts.
As a test case, I used three dialectological maps collected for Udmurt using traditional methods: the distribution of the affirmative particles ben/bon (Maksimov, 2007b); the word for 'forest' (Maksimov, 2007a); and the word for 'plantain (Plantago; a small plant common in Udmurtia)' (Maksimov, 2013). The distribution of the affirmative particles was clearly recoverable from the corpus data: having an average frequency of over 1000 ipm, they had enough occurrences in most districts. The distribution obtained from the corpus coincided with the one from the dialectological map, although it had lower resolution. Out of 7 different names for the forest available on the dialectological map (excluding phonetic variants), 5 were present among the geographically attributed tokens of the corpus (ńules, telʼ, śik, ćašša, surd). The overwhelming majority of occurrences in all districts belonged to the literary variant, ńules, while each of the other variants had only a handful of examples. Nevertheless, all these occurrences were attested exactly in the districts where they were predicted to appear by the dialectological map. Finally, the map for the plantain had 27 variants. Given the number of available options and the low frequency of this word, it is not surprising that its distribution turned out to be completely unrecoverable from the corpus. To sum up, it is possible to obtain some information on areal distributions of high-or middle-frequency phenomena from the social media corpora. However, in most cases this information can only be used as a preliminary survey and has to be supplemented by fieldwork or other methods to make reliable conclusions.
All social media corpora (as well as the "support corpora", see Subsection 2.4) are or will be available for linguistic research through an online interface 7 . Udmurt and Erzya corpora are already online. Komi-Zyrian and Moksha are being processed and will be available in December 2018. Komi-Permyak and both Mari corpora are scheduled for release in the first half of 2019.
Unfortunately, due to copyright and privacy protection reasons it is hardly possible to simply redistribute the source files freely. Instead, I currently employ a solution whereby the texts are only available through a search interface where the users can make queries and get search hits. The search hits appear in shuffled order, and for each sentence found, only a limited number of context sentences can be seen for copyright protection. This is a solution that is commonly applied in the web-as-corpus approach. 8 All data is anonymized (see Subsection 2.3). Tsakorpus 9 is used as the corpus platform. Queries can include any layer of annotation or metadata and support regular expressions and Boolean functions. Additionally, all code used for data processing will be available under the MIT license in a public repository.

Conclusion
In this paper, I described the ongoing project with the goal of creating social media corpora for seven medium-sized minority Uralic languages. The processing pipeline for these corpora includes semi-supervised identification of the texts (mostly in the vkontakte social networking service), downloading them through the API, languagetagging, filtering and anonymization, and morphological annotation. The corpora and tools used to build them are or will be publicly available. Sizes of the corpora vary, but do not exceed 3 million tokens written in the Uralic languages. Apart from those, each corpus also contains Russian sentences written by native speakers of the Uralic languages or in groups where Uralic texts have been posted; Russian parts of the corpora are several times larger than the Uralic ones. The corpora are better suited for sociolinguistic research than more traditional resources and contain texts written in a less formal register than those of press and fiction. Greater dialectal variation in the texts make them a possible source for dialectological investigations, which, however, have to be supported by independent sources to make reliable conclusions. In any case, given the scarcity of texts available digitally for the languages in question, the social media corpora will be a valuable resource for any kind of corpus-based linguistic research on them.