MOROCO: The Moldavian and Romanian Dialectal Corpus

In this work, we introduce the MOldavian and ROmanian Dialectal COrpus (MOROCO), which is freely available for download at https://github.com/butnaruandrei/MOROCO. The corpus contains 33564 samples of text (with over 10 million tokens) collected from the news domain. The samples belong to one of the following six topics: culture, finance, politics, science, sports and tech. The data set is divided into 21719 samples for training, 5921 samples for validation and another 5924 samples for testing. For each sample, we provide corresponding dialectal and category labels. This allows us to perform empirical studies on several classification tasks such as (i) binary discrimination of Moldavian versus Romanian text samples, (ii) intra-dialect multi-class categorization by topic and (iii) cross-dialect multi-class categorization by topic. We perform experiments using a shallow approach based on string kernels, as well as a novel deep approach based on character-level convolutional neural networks containing Squeeze-and-Excitation blocks. We also present and analyze the most discriminative features of our best performing model, before and after named entity removal.


Introduction
The high number of evaluation campaigns on spoken or written dialect identification conducted in recent years Malmasi et al., 2016;Rangel et al., 2017;Zampieri et al., 2017Zampieri et al., , 2018 prove that dialect identification is an interesting and challenging natural language processing (NLP) task, actively studied by researchers in nowadays. Due to the recent interest in dialect identification, we introduce the Moldavian and Romanian Dialectal Corpus (MOROCO), which is composed of 33564 samples of text collected from the news domain.
Romanian is part of the Balkan-Romance group that evolved from several dialects of Vulgar Latin, which separated from the Western Romance branch of languages from the fifth century (Coteanu et al., 1969). In order to distinguish Romanian within the Balkan-Romance group in comparative linguistics, it is referred to as Daco-Romanian. Along with Daco-Romanian, which is currently spoken in Romania, there are three other dialects in the Balkan-Romance branch, namely Aromanian, Istro-Romanian, and Megleno-Romanian. Moldavian is a subdialect of Daco-Romanian, that is spoken in the Republic of Moldova and in northeastern Romania. The delimitation of the Moldavian dialect, as with all other Romanian dialects, is made primarily by analyzing its phonetic features and only marginally by morphological, syntactical, and lexical characteristics. Although the spoken dialects in Romania and Moldova are different, the two countries share the same literary standard (Minahan, 2013). Some linguists (Pavel, 2008) consider that the border between Romania and the Republic of Moldova does not correspond to any significant isoglosses to justify a dialectal division. One question that arises in this context is whether we can train a machine to accurately distinguish literary text samples written by people in Romania from literary text samples written by people in the Republic of Moldova. If we can construct such a machine, then what are the discriminative features employed by this machine? Our corpus formed of text samples collected from Romanian and Moldavian news websites, enables us to answer these questions. Furthermore, MOROCO provides a benchmark for the evaluation of dialect identification methods. To this end, we consider two state-of-the-art methods, string kernels Ionescu and Butnaru, 2017;Ionescu et al., 2014) and character-level convolutional neural networks (CNNs) (Ali, 2018;Belinkov and Glass, 2016;, which obtained the first two places (Ali, 2018; in the Arabic Dialect Identification Shared Task of the 2018 VarDial Evaluation Campaign (Zampieri et al., 2018). We also experiment with a novel CNN architecture inspired the recently introduced Squeeze-and-Excitation (SE) networks (Hu et al., 2018), which exhibit state-of-the-art performance in object recognition from images. To our knowledge, we are the first to introduce Squeeze-and-Excitation networks in the text domain.
As we provide category labels for the collected text samples, we can perform additional experiments on various text categorization by topic tasks. One type of task is intra-dialect multi-class categorization by topic, i.e. the task is to classify the samples written either in the Moldavian dialect or in the Romanian dialect into one of the following six topics: culture, finance, politics, science, sports and tech. Another type of task is cross-dialect multi-class categorization by topic, i.e. the task is to classify the samples written in one dialect, e.g. Romanian, into six topics, using a model trained on samples written in the other dialect, e.g. Moldavian. These experiments are aimed at showing if the considered text categorization methods are robust to the dialect shift between training and testing.
In summary, our contribution is threefold: • We introduce a novel large corpus containing 33564 text samples written in the Moldavian and the Romanian dialects. • We introduce Squeeze-and-Excitation networks to the text domain. • We analyze the discriminative features that help the best performing method, string kernels, in (i) distinguishing the Moldavian and the Romanian dialects and in (ii) categorizing the text samples by topic.
We organize the remainder of this paper as follows. We discuss related work in Section 2. We describe the MOROCO data set in Section 3. We present the chosen classification methods in Section 4. We show empirical results in Section 5, and we provide a discussion on the discriminative features in Section 6. Finally, we draw our conclusion in Section 7.
Arabic. The Arabic Online news Commentary (AOC) (Zaidan and Callison-Burch, 2011) is the first available dialectal Arabic data set. Although AOC contains 3.1 million comments gathered from Egyptian, Gulf and Levantine news websites, the authors labeled only around 0.05% of the data set through the Amazon Mechanical Turk crowdsourcing platform.  constructed a data set of audio recordings, Automatic Speech Recognition transcripts and phonetic transcripts of Arabic speech collected from the Broadcast News domain. The data set was used in the 2016, 2017 and 2018 VarDial Evaluation Campaigns (Malmasi et al., 2016;Zampieri et al., 2017Zampieri et al., , 2018. Alsarsour et al. (2018) collected the Dialectal ARabic Tweets (DART) data set, which contains around 25K manually-annotated tweets. The data set is well-balanced over five main groups of Arabic dialects: Egyptian, Maghrebi, Levantine, Gulf and Iraqi. Bouamor et al. (2018) presented a large parallel corpus of 25 Arabic city dialects, which was created by translating selected sentences from the travel domain.
Other languages. The Nordic Dialect Corpus (Johannessen et al., 2009) contains about 466K spoken words from Denmark, Faroe Islands, Iceland, Norway and Sweden. The authors transcribed each dialect by the standard official orthography of the corresponding country. Francom et al. (2014) introduced the ACTIV-ES corpus, which represents a cross-dialectal record of the informal language use of Spanish speakers from Argentina, Mexico and Spain. The data set is composed of 430 TV or movie subtitle files. The DSL corpus collection (Tan et al., 2014) comprises news data from various corpora to emulate the diverse news content across different languages. The collection is comprised of six language vari- ety groups. For each language, the collection contains 18K training sentences, 2K validation sentences and 1K test sentences. The ArchiMob corpus (Samardžić et al., 2016) contains manuallyannotated transcripts of Swiss German speech collected from four different regions: Basel, Bern, Lucerne and Zurich. The data set was used in the 2017 and 2018 VarDial Evaluation Campaigns (Zampieri et al., 2017(Zampieri et al., , 2018. Kumar et al. (2018) constructed a corpus of five Indian dialects consisting of 307K sentences. The samples were collected by scanning, passing through an OCR engine and proofreading printed stories, novels and essays from books, magazines or newspapers. Romanian. To our knowledge, the only empirical study on Romanian dialect identification was conducted by Ciobanu and Dinu (2016). In their work, Ciobanu and Dinu (2016) used only a short list of 108 parallel words in a binary classification task in order to discriminate between Daco-Romanian words versus Aromanian, Istro-Romanian and Megleno-Romanian words. Different from Ciobanu and Dinu (2016), we conduct a large scale study on 33K documents that contain a total of about 10 million tokens.

MOROCO
In order to build MOROCO, we collected text samples from the top five most popular news websites in Romania and the Republic of Moldova, respectively. Since news websites in the two countries belong to different Internet domains, the text samples can be automatically labeled with the corresponding dialect. We selected news from six different topics, for which we found at least 2000 text samples in both dialects. For each dialect, we illustrate the distribution of text samples per topic in Figure 1. In both countries, we notice that the most popular topics are finance and politics, while the least popular topics are culture and science.  It is important to note that, in order to obtain the text samples, we removed all HTML tags and replaced consecutive space characters with a single space character. We further processed the samples in order to eliminate named entities. Previous research (Abu-Jbara et al., 2013;Nicolai and Kondrak, 2014) found that named entities such as country names or cities can provide clues about the native language of English learners. We decided to remove named entities in order to prevent classifiers from taking the decision based on features that are not truly indicative of the dialects or the topics. For example, named entities representing city names in Romania or Moldova can provide clues about the dialect, while named entities representing politicians or football players names can provide clues about the topic. The identified named entities are replaced with the token $NE$.
In the experiments, we present results before and after named entity removal, in order to illustrate the effect of named entities.
In order to allow proper comparison in future research, we divided MOROCO into a training, a validation and a test set. We used stratified sampling in order to produce a split that preserves the distribution of dialects and topics across all subsets. Table 1 shows some statistics of the number of samples as well as the number of tokens in each subset. We note that the entire corpus contains 33564 samples with more than 10 million tokens in total. On average, there are about 309 tokens per sample.
Since we provide both dialectal and category la-bels for each sample, we can perform several tasks on MOROCO: • Binary classification by dialect -the task is to discriminate between the Moldavian and the Romanian dialects. • Moldavian (MD) intra-dialect multi-class categorization by topic -the task is to classify the samples written in the Moldavian dialect into six topics. • Romanian (RO) intra-dialect multi-class categorization by topic -the task is to classify the samples written in the Romanian dialect into six topics. • MD→RO cross-dialect multi-class categorization by topic -the task is to classify the samples written in the Romanian dialect into six topics, using a model trained on samples written in the Moldavian dialect. • RO→MD cross-dialect multi-class categorization by topic -the task is to classify the samples written in the Moldavian dialect into six topics, using a model trained on samples written in the Romanian dialect.

Methods
String kernels. Kernel functions (Shawe-Taylor and Cristianini, 2004) capture the intuitive notion of similarity between objects in a specific domain. For example, in text mining, string kernels can be used to measure the pairwise similarity between text samples, simply based on character ngrams. Various string kernel functions have been proposed to date (Ionescu et al., 2014;Lodhi et al., 2002;Shawe-Taylor and Cristianini, 2004). Recently, the presence bits string kernel and the histogram intersection kernel obtained state-of-theart results in a broad range of text classification tasks such as dialect identification (Ionescu and Ionescu and Butnaru, 2017;, native language identification , sentiment analysis (Giménez-Pérez et al., 2017; and automatic essay scoring (Cozma et al., 2018). In this paper, we opt for the presence bits string kernel, which allows us to derive the primal weights and analyze the most discriminative features, as explained by . For two strings over an alphabet Σ, x, y ∈ Σ * , the presence bits string kernel is formally defined as: where in s (x) is 1 if string s occurs as a substring in x, and 0 otherwise. In our empirical study, we experiment with character n-grams in a range, and employ the Kernel Ridge Regression (KRR) binary classifier. During training, KRR finds the vector of weights that has both small empirical error and small norm in the Reproducing Kernel Hilbert Space generated by the kernel function.
The ratio between the empirical error and the norm of the weight vector is controlled through the regularization parameter λ.
Character-level CNN. Convolutional networks (LeCun et al., 1998;Krizhevsky et al., 2012) have been employed for solving many NLP tasks such as part-of-speech tagging (Santos and Zadrozny, 2014), text categorization (Johnson and Zhang, 2015;Kim, 2014;, dialect identification (Ali, 2018;Belinkov and Glass, 2016), machine translation (Gehring et al., 2017) and language modeling Kim et al., 2016). Many CNN-based methods rely on words, the primary reason for this being the aid given by word embeddings (Mikolov et al., 2013;Pennington et al., 2014) and their ability to learn semantic and syntactic latent features. Trying to eliminate the pre-trained word embeddings from the pipeline, some researchers have tried to build end-to-end models using characters as input, in order to solve text classification Belinkov and Glass, 2016) or language modeling tasks (Kim et al., 2016). At the character-level, the model can learn unusual character sequences such as misspellings or take advantage of unseen words during test time. This appears to be particularly helpful in dialect identification, since some stateof-the-art dialect identification methods Ionescu and Butnaru, 2017) use character n-grams as features.
In this paper, we draw our inspiration from  in order to design a lightweight character-level CNN architecture for dialect identification. One way proposed by  to represent characters in a character-level CNN is to map every character from an alphabet of size t to a discrete value using a 1-of-t encoding. For example, having the alphabet Σ = {a, b, c}, the encoding for the character a is 1, for b is 2, and for c is 3. Each character from the input text is encoded, and only a fixed size l of the input is kept. In our case, we keep the first l = 5000 characters, zero-padding the documents that are under  length. We compose an alphabet of 105 characters that includes uppercase and lowercase characters, Moldavian and Romanian diacritics (such asȃ,â, ı, ş and ţ), digits, and 33 other symbol characters. Characters that do not appear in the alphabet are encoded as a blank character.
As illustrated in the left-hand side of Figure  2, our architecture is seven blocks deep, containing one embedding layer, three convolutional and max-pooling blocks, and three fully-connected blocks. The first two convolutional layers are based on one-dimensional filters of size 7, the third one being based on one-dimensional filters of size 3. A thresholded Rectified Linear Units (ReLU) activation function (Nair and Hinton, 2010) follows each convolutional layer. The max-pooling layers are based on one-dimensional filters of size 3 with stride 3. After the third convolutional block, the activation maps pass through two fullyconnected blocks having thresholded ReLU activations. Each of these two fully-connected blocks is followed by a dropout layer with the dropout rate of 0.5. The last fully-connected layer is fol-lowed by softmax, which provides the final output. All convolutional layers have 128 filters, and the threshold used for the thresholded ReLU is 10 −6 . The network is trained with the Adam optimizer (Kingma and Ba, 2015) using categorical crossentropy as loss function.
Squeeze-and-Excitation Networks. Hu et al. (2018) argued that the convolutional filters close to the input layer are not aware of the global appearance of the objects in the input image, as they operate at the local level. To alleviate this problem, Hu et al. (2018) proposed to insert Squeeze-and-Excitation blocks after each convolutional block that is closer to the network's input. The SE blocks are formed of two layers, squeeze and excitation. The activation maps of a given convolutional block are first passed through the squeeze layer, which aggregates the activation maps across the spatial dimension in order to produce a channel descriptor. This layer can be implemented through a global average pooling operation. In our case, the size of the output after the squeeze operation is 1 × 128, since our convolutional layers are one-dimensional and each layer contains d = 128 filters. The resulting channel descriptor enables information from the global receptive field of the network to be leveraged by the layers near the network's input. The squeeze layer is followed by an excitation layer based on a selfgating mechanism, which aims to capture channelwise dependencies. The self-gating mechanism is implemented through two fully-connected layers, the first being followed by ReLU activations and the second being followed by sigmoid activations, respectively. The first fully-connected layer acts as a bottleneck layer, reducing the input dimension (given by the number of filters d) with a reduction ratio r. This is achieved by assigning d/r units to the bottleneck layer. The second fullyconnected layer increases the size of the output back to 1×128. Finally, the activation maps of the preceding convolutional block are then reweighted (using the 1 × 128 outputs provided by the excitation layer as weights) to generate the output of the SE block, which can then be fed directly into subsequent layers. Thus, SE blocks are just alternative pathways designed to recalibrate channelwise feature responses by explicitly modeling interdependencies between channels. We insert SE blocks after each convolutional block, as illustrated in the right-hand side of Figure 2.

Experiments
Parameter tuning. In order to tune the parameters of each model, we used the MOROCO validation set. We first carried out a set of preliminary dialect classification experiments to determine the optimal choice of n-grams length for the presence bits string kernel and the regularization parameter λ of the KRR classifier. We present results for these preliminary experiments in Figure 3. We notice that both λ = 10 −4 and λ = 10 −5 are good regularization choices, with λ = 10 −5 being slightly better for all n-grams lengths between 5 and 8. Although 6-grams, 7-grams and 8-grams attain almost equally good results, the best choice according to the validation results is to use 6grams. Therefore, in the subsequent experiments, we employ the presence bits string kernel based on n-grams of length 6 and KRR with λ = 10 −5 .
For the baseline CNN, we set the learning rate to 5 · 10 −4 and use mini-batches of 128 samples during training. We use the same parameters for the SE network. Both deep networks are trained for 50 epochs. For the SE blocks, we set the reduction ratio to r = 64, which results in a bottleneck layer with two neurons. We also tried lower reduction ratios, e.g. 32 and 16, but we obtained lower performance for these values. Results. In Table 2 we present the accuracy, the weighted F 1 -scores and the macro-averaged F 1scores obtained by the three classification models (string kernels, CNN and SE networks) for all the classification tasks, on the validation set as well as the test set. Regarding the binary classification by dialect task, we notice that all models attain good results, above 90%. SE blocks bring only minor improvements over the baseline CNN. Our deep models, CNN and CNN+SE, attain results around 93%, while the string kernels obtain results above 94%. We thus conclude that written text samples from the Moldavian and the Romanian dialects can be accurately discriminated by both shallow and deep learning models. This answers our first question from Section 1.
Regarding the Moldavian intra-dialect 6-way categorization (by topic) task, we notice that string kernels perform quite well in comparison with the CNN and the CNN+SE models. In terms of the macro-averaged F 1 scores, SE blocks bring improvements higher than 1% over the baseline CNN. In the MD→RO cross-dialect 6-way categorization task, our models attain the lowest performance on the Romanian test set. We notice that in both cross-dialect settings, we use the validation set from the same dialect as the training set, in order to prevent any use of information about the test dialect during training. The Romanian intra-dialect 6-way categorization task seems to be much more difficult than the Moldavian intradialect categorization task, since all models obtain scores that are roughly 20% lower. In terms of the macro-averaged F 1 scores, SE blocks bring improvements of around 4% over the baseline CNN. However, the results of CNN+SE are still much under those of the presence bits string kernel. Regarding the RO→MD cross-dialect 6-way categorization task, we find that the models learned on the Romanian training set obtain better results on the Moldavian (cross-dialect) test set than on the Romanian (intra-dialect) test set. Once again, this provides additional evidence that the 6-way categorization by topic task is more difficult for Romanian than for Moldavian. In all the intradialect or cross-dialect 6-way categorization tasks,  we observe a high performance gap between deep and shallow models. These results are consistent with the recent reports of the VarDial evaluation campaigns (Malmasi et al., 2016;Zampieri et al., 2017Zampieri et al., , 2018, which point out that shallow approaches such as string kernels Ionescu and Butnaru, 2017) surpass deep models in dialect and similar language discrimination tasks. Although deep models obtain generally lower results, our proposal of integrating Squeeze-and-Excitation blocks seems to be a steady step towards improving CNN models for language identification, as SE blocks improve performance across all the experiments presented in Table 2, and, in some cases, the performance gains are considerable.

Discussion
In Table 3, we presents comparative results before and after named entity removal (NER). We selected only the KRR based on presence bits string kernel for this comparative study, since it provides the best performance among the considered baselines. The experiment reveals that named entities can artificially raise the performance by more than 1% in some cases, which is consistent with observations in previous works (Abu-Jbara et al., 2013;Nicolai and Kondrak, 2014).
In order to understand why the KRR based on the presence bits string kernel works so well in  Table 3: Accuracy rates, weighted F 1 scores and macro-averaged F 1 -scores (in %) of the KRR based on the presence bits string kernel for the five evaluation tasks, before and after named entity removal (NER).
discriminating the Moldavian and the Romanian dialects, we conduct an analysis of some of the most discriminative features (n-grams), which are listed in Table 4 Table 4: Examples of n-grams from the Moldavian and the Romanian dialects, that are weighted as more discriminative by the KRR based on the presence bits string kernel, before and after named entity removal (NER). The n-grams are placed between squared brackets and highlighted in bold. The n-grams are posed inside words and translated to English.
player', Romanians prefer to use 'jucȃtor de tenis' for the same concept.
In a similar manner, we look at examples of features weighted as discriminative by the KRR based on the presence bits string kernel for categorization by topic. Table 5 list discriminative ngrams for all the six categories inside MOROCO, before and after NER. When named entities are left in place, we notice that the KRR classifier selects some interesting named entities as discriminative. For example, news in the politics domain make a lot of references to politicians such as Liviu Dragnea (the leader of the Social-Democrat Party in Romania), Igor Dodon (the current president of Moldova) or Dacian Cioloş (an ex-prime minster of Romania). News that mention NASA (the National Aeronautics and Space Administration) or the Max Planck institute are likely to be classified in the science domain by KRR+k 0/1 6 . After Simona Halep reached the first place in the Women's Tennis Association (WTA) ranking, there are a lot of sports news that report on her performances, which determines the classifier to choose 'Simona' or ' Halep' as discriminative ngrams. References to the Internet or the Facebook social network indicate that the respective news are from the tech domain, according to our classifier. When named entities are removed, KRR seems to choose plausible words for each category. For instance, it relies on n-grams such as 'muzicȃ' or 'artist' to classify a news sample into the culture domain, or on n-grams such as 'campion' or 'fotbal' to classify a news sample into the sports domain.

Conclusion
In this paper, we presented a novel and large corpus of Moldavian and Romanian dialects. We also introduced Squeeze-and-Excitation networks to the NLP domain, performing comparative experiments using shallow and deep state-of-the-art baselines. In the end, we provided an analysis of the most discriminative features.  Table 5: Examples of n-grams from the six different categories in MOROCO, that are weighted as more discriminative by the KRR based on the presence bits string kernel, before and after named entity removal (NER). The n-grams are placed between squared brackets and highlighted in bold. The n-grams are posed inside words and translated to English.