KINNEWS and KIRNEWS: Benchmarking Cross-Lingual Text Classification for Kinyarwanda and Kirundi

Recent progress in text classification has been focused on high-resource languages such as English and Chinese. For low-resource languages, amongst them most African languages, the lack of well-annotated data and effective preprocessing, is hindering the progress and the transfer of successful methods. In this paper, we introduce two news datasets (KINNEWS and KIRNEWS) for multi-class classification of news articles in Kinyarwanda and Kirundi, two low-resource African languages. The two languages are mutually intelligible, but while Kinyarwanda has been studied in Natural Language Processing (NLP) to some extent, this work constitutes the first study on Kirundi. Along with the datasets, we provide statistics, guidelines for preprocessing, and monolingual and cross-lingual baseline models. Our experiments show that training embeddings on the relatively higher-resourced Kinyarwanda yields successful cross-lingual transfer to Kirundi. In addition, the design of the created datasets allows for a wider use in NLP beyond text classification in future studies, such as representation learning, cross-lingual learning with more distant languages, or as base for new annotations for tasks such as parsing, POS tagging, and NER. The datasets, stopwords, and pre-trained embeddings are publicly available at https://github.com/Andrews2017/KINNEWS-and-KIRNEWS-Corpus .


Introduction
The availability of large monolingual and labeled corpora, paired with innovations in neural text processing have led to a rapid improvement of the quality of text classification over the last years. 1 However, the effectiveness of deep-learning-based text classification models depends on the amount of monolingual and labeled data. Low-resource languages are traditionally left behind because of the few available prepared resources for these languages to extract the data from (Joshi et al., 2020). However, nowadays, the increase in internet use in many African developing countries has made access to information easier. This in turn has strengthened the news agencies of those countries to cover many stories in their native languages. For example, BBC News now provides online news in Arabic, Amharic,Hausa,Kiswahili,Somali,Oromo,Igbo,Nigerian Pidgin,Tigrigna,Kinyarwanda and Kirundi. 2 This development makes news the most reliable source of data for low-resource languages. We explore this opportunity for the example of Kinyarwanda and Kirundi, two African low-resource Bantu languages, and build news classification benchmarks from online news articles. This has the goal to enable NLP researchers to include Kinyarwanda and Kirundi in the evaluation of novel text classification approaches, and diversify the current NLP landscape.
Kinyarwanda is one of the official languages of Rwanda 3 and belongs to the Niger-Congo language family. According to The New Times, 4 it is spoken by approximately 30 million people from four dif- Table 1: Two examples of news titles from our datasets that show the similarity level between Kinyarwanda and Kirundi. Same words in both languages for each sentence are shown in bold. Joshi et al. (2020) classify the state of NLP for both Kinyarwanda and Kirundi as "Scraping-By", which means that they have been mostly excluded from previous NLP research, and require the creation of dedicated resources for future inclusion in NLP research. To this aim, we introduce two datasets KIN-NEWS and KIRNEWS for multi-class text classification in this paper. They consist of the news articles written in Kinyarwanda and Kirundi collected from local news websites and newspapers. KINNEWS samples are annotated using fourteen classes while that of KIRNEWS are annotated using twelve classes based on the agreement of the two annotators for each dataset. We describe a data cleaning pipeline, and we introduce the first ever stopword list for each language for preprocessing purposes. We present word embedding techniques for these two low-resource languages, and evaluate various classic and neural machine learning models. Together with the data, these baselines and preprocessing tools are made publicly available as benchmarks for future studies. In addition, pre-trained embeddings are published to facilitate studies for other NLP tasks on Kinyarwanda and Kirundi.
In the following, we will first discuss previous work on Kinyarwanda and Kirundi and low-resource African languages in general in Section 2, and then describe the dataset creation in Section 3. We then present a range of experiments for text classification on the collected data in Section 4, concluding with an outlook to future work in Section 5. languages and facilitated machine translation research for many low-resource languages (Tiedemann and Thottingal, 2020), amongst them many African languages (∀ et al., 2020b; ∀ et al., 2020a) which have not been subject to machine translation before.
Beyond the multilingual JW300 corpus, there have been a few works have been done for creating new datasets for individual African languages. For example, Emezue and Dossou (2020) introduced the FFR project for creating a corpus of Fon-French (FFR) parallel sentences. Closer to our work, Marivate et al.
(2020) created news classification benchmarks for Setswana and Sepedi, two Bantu languages of South Africa. Different to our work, their corpus is limited to headlines, while we provide headlines and the full articles. The size of our dataset is also several magnitudes larger since we include data from more sources and spent large efforts on expanding an initial set of news sources.
While there is practically no NLP research on Kirundi, there are a few recent studies on Kinyarwanda for the tasks of Morphological Analysis (Muhirwe, 2009), Part-of-Speech (POS) tagging Fang and Cohn, 2016;Duong et al., 2014;Cardenas et al., 2019), Parsing (Sun et al., 2014;Mielens et al., 2015), Automated Speech Recognition (Dalmia et al., 2018), Language Modeling (Andreas, 2020), and Name Entity Recognition (Rijhwani et al., 2020). Most of these works are largely based on a single Kinyarwanda dataset created by . This dataset contains transcripts of testimonies by survivors of the Rwandan genocide, was provided by the Kigali Genocide Memorial Center, and contains 90 annotated sentences with fourteen distinct POS tags. However, it is not suitable for text classification, and to the best of our knowledge, there are no publicly available datasets for Kinyarwanda and Kirundi text classification, which is the gap that this paper is addressing. These works have also focused on either word alignment or monolingual approaches, and did not explore a cross-lingual approach.
We hope that the publication of our benchmarks will inspire the creation of similar datasets. As a result, this would allow the inclusion of more African low-resource languages in cross-lingual studies and benchmarks such as XTREME (Hu et al., 2020), a multi-task benchmark for the evaluation of cross-lingual generalization of multilingual representations across 40 languages, which already include higher-resourced African languages like Afrikaans and Swahili. For past efforts of multi-lingual studies like XTREME, one guiding factor for language selection has been the size of the Wikipedia in the respective languages. The number of Wikipedia articles in local languages is often interpreted as a measure for digital maturity and a pragmatic estimator for the success of un-or self-supervised NLP methods, but this ignores societal and human factors that (cyclically) influence the activity of Wikipedia editor communities. In this work, we want to showcase the impact of manual collection of data sources beyond Wikipedia. The number of news articles that we could retrieve for Kirundi and Kinyarwanda exceeds the number of available Wikipedia articles by far (616 and 1828 Wikipedia articles respectively). 11

Dataset Creation
In this section, we first describe the process for data collection, then the annotation, and finally our data cleaning pipeline for KINNEWS and KIRNEWS. In general, the copyright for the published news articles still remains with the original authors or publishers. Our work can be seen as an additional pre-processing and modeling pipeline, and annotation layer on top.

Collection Process
KINNEWS KINNEWS is collected from fifteen news websites and five newspapers from Rwanda. An initial seed of news sources was retrieved from two websites which list newspapers from Rwanda. 12 These lists also include Rwandan news sources that publish their news in other languages such as French and English, so we select those which publish in Kinyarwanda only. Additionally, we expand the initial seed through Google Search by searching for manually selected Kinyarwanda key words and phrases such as "Iterambere ry'umugore mu Rwanda" ("Women's development in Rwanda") to list all news sources that have published the searched key phrase or related news. These were used to expand the list of news sources that publish their news in Kinyarwanda.
KIRNEWS We used the same process for KIRNEWS. KIRNEWS was collected from eight news sources in total. However, it was more challenging, since most of the news sources that were listed on the overview websites 13 publish their news either in French or English. Only one news website was found that publishes in Kirundi. Thus, to solve this problem, the same seed expansion technique as described above was used to find three more news websites and four newspapers that publish in Kirundi, which was very time-consuming. We hope that this kind of seed expansion can in future be automated with the help of NLP technology for Kirundi.
Document Structure Each data sample for both KINNEWS and KIRNEWS consists of the news headline and the article's content. We separate the title from the content to make the annotation process easier, since the annotator may sometimes simply annotate the news based on its title without reading the whole article. In addition, the original source URLs are recorded with the extracted article, such that metainformation about dates or authors, or multi-modal content such as embedded images and captions can be retrieved if needed. Future NLP studies may also exploit this structure, e.g. for headline prediction or automatic summarization.

Annotation Process
The news we collected were initially given different categories by the publishers. The articles in KIN-NEWS and KIRNEWS were categorized in 48 and 26 different categories, respectively. However, many of these categories were related and to reduce the noisy samples, annotators agreed in grouping the related categories into one category and finally resulted into fourteen categories for KINNEWS and twelve categories for KIRNEWS in total (details in Appendix A).
In both datasets, each category was assigned with its own numerical labels (label) that range from 1 to 14 and English labels (en label) to help those who do not understand Kinyarwanda and Kirundi to know what the article is related to. Moreover, Kinyarwanda labels (kin label) for KINNEWS and Kirundi labels (kir label) for KIRNEWS were also provided.
Then, based on these agreed categories, two annotators for each dataset who are all linguistic graduates and native speakers of each language, attentively revised each news article and annotate it based on its title and content. If they encountered one article that they would hesitate on its category, they annotated it as neutral and it receives a numerical label of 0, to be later removed from the final dataset to focus on clearly classifiable data.

Dataset Cleaning
For each language, we provide a cleaned version and a raw version. The cleaning is done in two stages: (1) removal of special characters, and (2) stopword removal. Nowadays, low-resource languages lack of language processing tools and resources (Baumann and Pierrehumbert, 2014;Muis et al., 2018). Kinyarwanda and Kirundi are also among those languages which do not have any language-specific processing tools (tokenizers, lemmatizers, stemmers, stopword filters, normalizers etc.).

Special Characters Removal
The retrieved documents from the internet are often too noisy. To obtain high-quality and cleaned datasets, we remove the following non-alphanumerical characters ;.?/\| #$%-<>()[]{}&*˜'+-=ˆ\n\r\t and URLs from the text. Note that removing punctuation might lead to losing sentence boundary information within the article. However, this does not hurt the performance of our models, because they were trained based on word-based features within each article. And since the raw data is provided, punctuation might be restored for other applications, for example for developing language-specific tokenizers.
Stopwords Since stopwords play an important role in semantic text preprocessing, we create first stopword lists for both Kinyarwanda and Kirundi languages. Using additional data from the Kinyarwanda Bible 14 and based on the sufficient knowledge the annotators have on both languages, we found that the words with two letters such as "mu":"in" and "ku":"on"/"at", words with three letters such as "uyu":"this" and "iyo":"that", and words with four letters such as "muri":"in" have high frequency, see Figure 1, but do not carry a significant role in training text classification models. Thus, these words and other similar words were used to create a list of 80 stopwords for Kinyarwanda and 59 stopwords for Kirundi, listed in Table 2. The words found in the stopword lists were then removed from the respective cleaned news datasets.

Dataset Statistics
The datasets contain a total of 21,268 and 4,612 news articles which are distributed across 14 and 12 categories for KINNEWS and KIRNEWS, respectively. As shown in Table 3 Politics related articles are the majority in both datasets, while education and history related articles are the minority in KINNEWS and KIRNEWS, respectively.   An in-depth evaluation of the similarity between Kinyarwanda and Kirundi using the created datasets shows that they share 27,489 words of the vocabularies which stands for 32% of all unique words from KIRNEWS, using the raw version datasets and 22,841 vocabularies which stands for 36.2% of all unique words from KIRNEWS*, using the cleaned versions of the datasets.

Word Embedding Training
Many African low-resource languages, including Kinyarwanda and Kirundi, do not enjoy the success of recent word embeddings available as pre-trained models such as GloVe (Pennington et al., 2014), BERT (Devlin et al., 2018), XLNet (Yang et al., 2019), or Fasttext (Grave et al., 2018) because these models were trained on higher-resource languages exclusively. Recent text classification approaches for lowresource languages rely on transfer learning approach that uses the features of resource-rich languages learned by pre-trained word embeddings to train low resource models. However, this technique might not be effective enough or even not be applicable when there is no parallel corpus of that resource-rich and resource-poor languages.
Since our datasets contain a reasonable amount of sentences, the features to train our neural-networkbased models are obtained by training Word2Vec embeddings (Mikolov et al., 2013) from scratch. Word2Vec is trained using the gensim framework 16 with a window size of 5, ignoring all words with total frequency lower than 5, removing stopwords, special characters and URLs, and using skip-gram training algorithm with hierarchical softmax. We train two versions with different dimensions on each language, one with 50 dimensions (W2V-Kin-50) and other with 100 dimensions (W2V-Kin-100) for Kinyarwanda, and W2V-Kir-50 and W2V-Kir-100 for Kirundi.

Text Classification Task
Monolingual For a monolingual approach, we train and evaluate our baseline models using KINNEWS for Kinyarwanda and KIRNEWS for Kirundi separately. This means that we are using exclusively the data available for each task, ignoring the similarity of both languages.
Cross-lingual Cross-lingual transfer has been leveraged for many low-resource applications (Agić et al., 2015;Buys and Botha, 2016;Adams et al., 2017;Fang and Cohn, 2017;Cotterell and Duh, 2017). Most commonly, these approaches rely on machine translation and word alignments between resourcerich and low-resource languages. In this paper, however, we follow a simpler approach that exploits the fact that both languages are mutually intelligible, and does not require parallel or aligned resources. We train the baseline models using KINNEWS and embeddings learned from Kinyarwanda, and test them on KIRNEWS. This simulates the scenario if we did not have any training data for Kirundi. Alternatively, we train and test embedding-based models on KIRNEWS using the Kinyarwanda embeddings. This models a scenario where embeddings in a higher-resourced related language are available, and a small training set in the target language. We only investigate the transfer from Kinyarwanda to Kirundi and not the reverse, since our Kinyarwanda data is much larger than Kirundi data. 17

Baseline Models
We perform benchmark experiments on the datasets using several different classic and neural approaches.
In all experiments, we use the pre-processed (cleaned) version datasets, because the raw versions contain too much noise. The training set and validation set are split with a ratio of 9:1.

Classic Models
For all classic machine learning approaches, we use Term Frequency Inverse Document Frequency (TF-IDF) to get the values of unigram input features. We define the maximum number of features to be used depending on different train set and method. All of the below models are implemented with the help of the scikit-learn framework and use its default hyperparameters: • Multinomial Naive Bayes (MNB) • Logistic Regression (LR) • Support Vector Machine (SVM) with SGD 16 https://radimrehurek.com/gensim/models/word2vec.html 17 We remove tourism and fashion related samples from KINNEWS to get a compatible training set for the KIRNEWS test set which does not contain articles from these two categories.

Neural Models
For neural models, we use the pre-trained embeddings as input and fine-tune them on the task (except for character-based models). This is to mimic approaching an arbitrary NLP task with little training data but with available word embeddings. The following neural models are implemented: • Character-level Convolutional Neural Networks (Char-CNN): We use a small size Char-CNN model for text classification as proposed in (Zhang et al., 2015) with default hyperparameters, except that we removed the letters 'q' and 'x' from the alphabet list which are not in both Kinyarwanda and Kirundi languages. Thus, the alphabet used in our model consists of 68 characters instead of 70 characters from the original paper. The input feature length is also changed from 1,014 to 1,500 to capture most of the texts of interest, since our datasets have relatively long news articles. The special properties of Char-CNN that makes it a good choice for low-resource languages are that (1) it does not require any data preprocessing nor (2) the use of word embeddings which makes it more effective when processing very noisy data.
• Convolutional Neural Network (CNN): We use the CNN for sentence classification model proposed in (Kim, 2014) with default hyperparameters, except that we change the original feature maps of 100 to 150 and min-batch size of 50 to 32. The model is trained on two Word2Vec embeddings with different dimensions of 50 and 100, and using different epochs and number of features based on different train sets and embedding dimensions.
• Bidirectional Gated Recurrent Unit (BiGRU): We design a model that consists of 2-layer bidirectional GRU (Cho et al., 2014) followed by a softmax linear layer. It uses the dropout of 0.5 and batch size of 32. The dimension of hidden layers were set to either 256 or 128, it is trained on two Word2Vec embeddings with different dimensions of 50 and 100 similar to CNN, and different epochs and number of features were used according to different train sets and embedding dimensions similar to the previous models.

Monolingual Text Classification
The experimental results for monolingual text classification are shown in Tables 5 and 6, respectively. In each table the benchmark results of the classic TFIDF-based models and the neural embedding-based models are grouped separately, and we highlight the result of the best model in each group.  As shown in Table 5 and 6, in the group of classic TFIDF-based models, SVM yields the best accuracy on both datasets compared to LR and MNB. It has high predictive power thanks to the hyperplane which can avoid the overfitting and separates the classes in very effective way. Another good property of SVM is that by using its supporting vectors, it can use relatively small amount of data to get a good prediction, which makes it to perform well on both KINNEWS and KIRNEWS. In this group, MNB performs the worst on both datasets, however, it was able to give relatively good results by requiring fewer features compared to other methods.

Model
In the group of neural embedding-based models, the performance is based on the type of data, where BiGRU perform the best on KINNEWS which is larger than KIRNEWS dataset, while CNN perform the best on KIRNEWS. A possible reason might be that BiGRU needs a larger amount of data to perform better than the CNN. The Char-CNN performs the worst on both datasets, likely because of the limited computation power of the compute resources used in the experiments. Because of this we had to limit the input feature length to 1500 while the average length of the news article in each dataset was far greater than that. Thus it is an open challenge to further work that can be able to use higher length of input features to achieve better results. It might also be interpreted as a pointer towards the general "pre-train and fine-tune" regime, which places classification models in an initial representation space that reflects word relations in the input.

Cross-lingual Text Classification
The results on cross-lingual approaches in Table 7 show that in the group of machine learning models, MNB surprisingly performs relatively better compared to SVM and LR. The reason might be that MNB is a generative model while the rest are discriminative models. Similar to the monolingual experiments, BiGRU performs the best when trained on KINNEWS while CNN performs the best when trained on KIRNEWS. Interestingly, the Char-CNN suffers much more from the transfer, which illustrates that the embeddings reflect the similarities of both languages on an abstract level that allow transfer much better than low-level features trained from scratch.
When trained neural models on KINNEWS and testing them on KIRNEWS, they do not give satisfactory results compared to when trained and tested on only KIRNEWS. This is not surprising, since they were both trained on Kinyarwanda word embeddings from the same domain which includes the word vectors of many similar words from Kirundi. Nevertheless, this shows that with pretrained Kinyarwanda word embeddings, which are easier to obtain in high quality since the data retrieval for Kinyarwanda is much easier, can be effectively used in training Kirundi text classification models in a zero-shot scenario without any labeled or unlabeled data for Kirundi.
Based on the performed experiments, the results for our examplary languages Kinywarwanda and Kirundi show that the text classification based on cross-lingual transfer between mutually intelligible low-resource languages is possible, without creating any words alignments or any parallel translation dataset between those languages. What is merely required is to get sufficient data of one language to train the word embeddings.
Analysing the classification errors of highest-scoring cross-lingual models, we find that the MNB model characteristically places many articles about "education" in the category of "politics", while the neural models are more accurate in this distinction. All models tend to confuse "relationship" articles with the "politics" category, which might be due to an overlap in common vocabulary focused on interactions of people. The most accurate classification is generally obtained for sports articles. Complete confusion matrices are displayed in Appendix B.

Conclusion and Future Work
In this paper, we built the first news text classification benchmark for Kinyarwanda and Kirundi, two low-resource Bantu languages. We described the data collection process, provided guidelines for data cleaning, and evaluated classic text classification models as initial baselines. We found fairly strong cross-lingual generation of embedding models trained on the resource-richer Kinyarwanda to the lowerresource Kirundi due to their mutual intelligibility. This gives hope for future studies of languages from the Rwanda-Rundi language family, which have not been studied in NLP at all, and would otherwise classify as "Left-Behind" according to (Joshi et al., 2020), or analogously of other extremely low-resource languages with a slightly higher-resource "sibling".
Future studies on the new dataset will investigate (1) contextualized embeddings, e.g. BERT, (2) subword modeling. Furthermore, the dataset may get enriched with other linguistic annotations, such as named entities, and serve as resource for other NLP tasks than text classification. onment enviribidukikije ibidukikije (environment)ibidukikije ibidukikije (