Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages

Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. ‘Low-resourced’-ness is a complex problem going beyond data availability and reflects systemic problems in society. In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released at https://github.com/masakhane-io/masakhane-mt.

As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process. We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released at https://github. com/masakhane-io/masakhane-mt.

Introduction
Language prevalence in societies is directly bound to the people and places that speak this language. Consequently, resource-scarce languages in an NLP context reflect the resource scarcity in the society from which the speakers originate (McCarthy, 2017). Through the lens of a machine learning researcher, "lowresourced" identifies languages for which few digital or computational data resources exist, often classified in comparison to another language (Gu et al., 2018;Zoph et al., 2016). However, to the sociolinguist, "low-resourced" can be broken down into many categories: low density, less commonly taught, or endangered, each carrying slightly different meanings (Cieri et al., 2016). In this complex definition, the "low-resourced"-ness of a language is a symptom of a range of societal problems, e.g. authors oppressed by colonial governments have been imprisoned for writing novels in their languages impacting the publications in those languages (Wa Thiong'o, 1992), or that fewer PhD candidates come from oppressed societies due to low access to tertiary education (Jowi et al., 2018). This results in fewer linguistic resources and researchers from those regions to work on NLP for their language. Therefore, the problem of "lowresourced"-ness relates not only to the available resources for a language, but also to the lack of geographic and language diversity of NLP researchers themselves.
The NLP community has awakened to the fact that it has a diversity crisis in terms of limited geographies and languages (Caines, 2019;Joshi et al., 2020): Research groups are extending NLP research to low-resourced languages (Guzmán et al., 2019;Hu et al., 2020;Wu and Dredze, 2020), and workshops have been established (Haffari et al., 2018;Axelrod et al., 2019;. We scope the rest of this study to machine  translation (MT) using parallel corpora only, and refer the reader to Joshi et al. (2019) for an assessment of low-resourced NLP in general.

Contributions.
We diagnose the problems of MT systems for low-resourced languages by reflecting on what agents and interactions are necessary for a sustainable MT research process. We identify which agents and interactions are commonly omitted from existing low-resourced MT research, and assess the impact that their exclusion has on the research. To involve the necessary agents and facilitate required interactions, we propose participatory research to build sustainable MT research communities for low-resourced languages. The feasibility and scalability of this method is demonstrated with a case study on MT for African languages, where we present its implementation and outcomes, including novel translation datasets, benchmarks for over 30 target languages contributed and evaluated by language speakers, and publications authored by participants without formal training as scientists.

Background
Cross-lingual Transfer. With the success of deep learning in NLP, language-specific feature design has become rare, and cross-lingual transfer methods have come into bloom (Upadhyay et al., 2016;Ruder et al., 2019) to transfer progress from high-resourced to low-resourced languages (Adams et al., 2017;Wang et al., 2019;Kim et al., 2019). The most diverse benchmark for multilingual transfer by Hu et al. (2020) allows measurement of the success of such transfer approaches across 40 languages from 12 language families. However, the inclusion of languages in the set of benchmarks is dependent on the availability of monolingual data for representation learning with previously annotated resources. The content of the benchmark tasks is English-sourced, and human performance estimates are taken from English. Most cross-lingual representation learning techniques are Anglo-centric in their design (Anastasopoulos and Neubig, 2019).

Multilingual
Approaches. Multilingual MT (Dong et al., 2015;Firat et al., 2016a,b;Wang et al., 2020) addresses the transfer of MT from high-resourced to low-resourced languages by training multilingual models for all languages at once. (Aharoni et al., 2019;Arivazhagan et al., 2019) train models to translate between English and 102 languages, for the 10 most high-resourced African languages on private data, and otherwise on public TED talks (Qi et al., 2018). Multilingual training often outperforms bilingual training, especially for low-resourced languages. However, with multilingual parallel data being also Anglo-centric, the capabilities to translate from English versus into English vastly diverge (Zhang et al., 2020).
Another recent approach, mBART (Liu et al., 2020), leverages both monolingual and parallel data and also yields improvements in translation quality for lower-resource languages such as Nepali, Sinhala and Gujarati. 3 While this provides a solution for small quantities of training data or monolingual resources, the extent to which standard BLEU evaluations reflect translation quality is not clear yet, since human evaluation studies are missing.
Targeted Resource Creation. Guzmán et al. (2019) develop evaluation datasets for lowresourced MT between English and Nepali, Sinhala, Khmer and Pashtolow. They highlight many problems with low-resourced translation: tokenization, content selection, and translation verification, illustrating increased difficulty translating from English into lowresourced languages, and highlight the ineffectiveness of accepted state-of-the-art techniques on morphologically-rich languages. Despite involving all agents of the MT process (Section 3), the study does not involve data curators or evaluators that understood the languages involved, and resorts to standard MT evaluation metrics. Additionally, how this effort-intensive approach would scale to more than a handful of languages remains an open question.

The Machine Translation Process
We reflect on the process enabling a sustainable process for MT research on parallel corpora in terms of the required agents and interactions, visualized in Figure 1. Content creators, translators, and curators form the dataset creation process, while the language technologists and evaluators are part of the model creation process. Stakeholders (not displayed) create demand for both processes.
Stakeholders are people impacted by the artifacts generated by each agent in the MT process, and can typically speak and read the source or the target languages. To benefit from MT systems, the stakeholders need access to technology and electricity.
Content Creators produce content in a language, where content is any digital or nondigital representation of language. For digi-tal content, content creators require keyboards, and access to technology.
Translators translate the original content, including crowd-workers, researchers, or translation professionals. They must understand the language of the content creator and the target language. A translator needs content to translate, provided by content creators. For digital content, the translator requires keyboards and technology access.
Curators are defined as individuals involved in the content selection for a dataset (Bender and Friedman, 2018), requiring access to content and translations. They should understand the languages in question for quality control and encoding information.
Language Technologists are defined as individuals using datasets and computational linguistic techniques to produce MT models between language pairs. Language technologists require language preprocessors, MT toolkits, and access to compute resources.
Evaluators are individuals who measure and analyse the performance of a MT model, and therefore need knowledge of both source and target languages. To report on the performance on models, evaluators require quality metrics, as well as evaluation datasets. Evaluators provide feedback to the Language Technologists for improvement.

Limitations of Existing Approaches
If we place a high-resource MT pair such as English-to-French into the process defined above, we observe that each agent nowadays has the necessary resources and historical stakeholder demand to perform their role effectively. A "virtuous cycle" emerged where available content enabled the development of MT systems that in turn drove more translations, more tools, more evaluation and more content, which cycled back to improving MT systems. By contrast, parts of the process for existing low-resourced MT are constrained. Historically, many low-resourced languages had low demand from stakeholders for content creation and translation (Wa Thiong'o, 1992). Due to missing keyboards or limited access to technology, content creators were not empowered to write digital content (Adam, 1997;van Esch et al., 2019). This is a chicken-oregg problem, where existing digital content in a language would attract more stakeholders, which would incentivize content creators (Kaffee et al., 2018). As a result, primary data sources for NLP research, such as Wikipedia, often have a few hundred articles only for lowresourced languages despite large speaker populations, see Table 1. Due to limited demand, existing translations are often domain-specific and small in size, such as the JW300 corpus (Agić and Vulić, 2019) whose content was created for missionary purposes.
When data curators are not part of the societies from where these languages originate, they are are often unable to identify data sources or translators for languages, prohibiting them from checking the validity of the created resource. This creates problems in en-coding, orthography or alignment, resulting in noisy or incorrect translation pairs (Taylor et al., 2015). This is aggravated by the fact that many low-resourced languages do not have a long written history to draw from and therefore might be less standardized and using multiple scripts. In collaboration with content creators, data curators can contribute to standardization or at least recognize potential issues for data processing further down the line.
As discussed in Section 1, language technologists are fewer in low-resourced societies. Furthermore, the techniques developed in highresourced societies might be inapplicable due to compute, infrastructure or time constraints. Aside from the problem of education and complexity, existing techniques may not apply due to linguistic and morphological differences in the languages, or the scale, domain, or quality of the data (Hu et al., 2020;Pires et al., 2019).
Evaluators usually resort to potentially unsuitable automatic metrics due to time constraints or missing connections to stakeholders (Guzmán et al., 2019). The main evaluators of low-resourced NLP that is developed today typically cannot use human metrics due to the inability to speak the languages, or the lack of reliable crowdsourcing infrastructure, identified as one of the core weaknesses of previous approaches (in Section 2).
In summary, many agents in the MT process for low-resourced languages are either missing invaluable language and societal knowledge, or the necessary technical resources, knowledge, connections, and incentives to form interactions with other agents in the process.

Participatory Research Approach
We propose one way to overcome the limitations in Section 3.1: ensuring that the agents in the MT process originate from the countries where the low-resourced languages are spoken or can speak the low-resourced lan-guages. Where this condition cannot be satisfied, at least a knowledge transfer between agents should be enabled. We hypothesize that using a participatory approach will allow researchers to improve the MT process by iterating faster and more effectively.
Participatory research, unlike conventional research, emphasizes the value of research partners in the knowledge-production process where the research process itself is defined collaboratively and iteratively. The "participants" are individuals involved in conducting research without formal training as researchers. Participatory research describes a broad set of methodologies, organised in terms of the level of participation. At the lowest level is crowd-sourcing, where participants are involved solely in data collection. The highest level-extreme citizen science-involves participation in the problem definition, data collection, analysis and interpretation (English et al., 2018).
Crowd-sourcing has been applied to lowresourced language data collection Guevara-Rukoz et al., 2020;Millour and Fort, 2018), but existing studies highlight how the disconnect between the data creation process and model creation process causes challenges. In seeking to create crossdisciplinary teams that emphasize the values in a societal context, a participatory approach which involves participants in every part of the scientific process appears pertinent to solving the problems for low-resourced languages highlighted in Section 3.1.
To show how more involved participatory research can benefit low-resource language translation, we present a case study in MT for African languages.
African languages account for a small fraction of available language resources, and NLP research rarely considers African languages. In the taxonomy of Joshi et al. (2020), African languages are assigned categories ranging from "The Left Behinds" to "The Rising Stars", with most languages not having any annotated data. Even monolingual resources are sparse, as shown in Table 1.
In addition to a lack of NLP datasets, the African continent lacks NLP researchers. In 2018, only five out of the 2695 affiliations of the five major NLP conferences were from African institutions (Caines, 2019). ∀ et al.
(2020) attribute this to a culmination of circumstances, in particular their societal embedding (Alexander, 2009) and socio-economic factors, hindering participation in research activities and events, leaving researchers disconnected and distributed across the continent. Consequently, existing data resources are harder to discover, especially since these are often published in closed journals or are not digitized (Mesthrie, 1995).
For African languages, the implementation of a standard crowd-sourcing pipeline as for example used for collecting task annotations for English, is at the current stage infeasible, due to the challenges outlined in Section 3 and above. Additionally, no standard MT evaluation set for all of the languages in focus exists, nor are there prior published systems that we could compare all models against for a more insightful human evaluation. We therefore resort to intrinsic evaluation, and rely on this work becoming the first benchmark for future evaluations.
We invite the reader to adopt a metaperspective of this case study as an empirical experiment: Where the hypothesis is that participatory research can facilitate low-resourced MT development; the experimental methodology is the strategies and tools employed to bring together distributed participants, enabling each language speaker to train, contribute, and evaluate their models. The experiment is evaluated in terms of the quantity and diversity of participants and languages, and the variety of research artifacts, in terms of benchmarks, human evaluations, publications, and the overall health of the community. While a set of novel human evaluation results are presented, they serve as demonstration of the value of a participatory approach, rather than the empirical focus of the paper.

Methodology
To overcome the challenge of recruiting participants, a number of strategies were employed. Starting from local demand at a machine learning school (Deep Learning Indaba (Engelbrecht, 2018)), meetups and universities, distant connections were made through Twitter, conference workshops, 4 and eventually press coverage 5 and research publications. 6 To overcome the limited tertiary education enrollments in Sub-Saharan Africa (Jowi et al., 2018), no prerequisites were placed on researchers joining the project. For the agents outlined in Section 3, no fixed roles are imposed onto participants. Instead, they join with a specific interest, background, or skill aligning them best to one or more of agents. To obtain crossdisciplinarity, we focus on the communication and interaction between participants to enable knowledge transfer between missing connections (identified in Section 3.1), allowing a fluidity of agent roles. For example, someone who initially joined with the interest of using 4 ICLR AfricaNLP 2020: https://africanlpworkshop.github.io/ 5 https://venturebeat.com/2019/ 11/27/the-masakhane-project-wantsmachine-translation-and-ai-totransform-africa/ 6 https://github.com/masakhaneio/masakhane-community/blob/master/ publications.md machine translation for their local language (as a stakeholder) to translate education material, might turn into a junior language technologist when equipped with tools and introductory material and mentoring, and guide content creation more specifically for resources needed for MT.
To bridge large geographical divides, the community lives online. Communication occurs on GitHub and Slack with weekly video conference meetings and reading groups. Meeting notes are shared openly so that continuous participation is not required and time commitment can be organized individually. Subinterest groups have emerged in Slack channels to allow focused discussions. Agendas for meetings and reading groups are public and democratically voted upon. In this way, the research questions evolve based on stakeholder demands, rather than being imposed upon by external forces.
The lack of compute resources and prior exposure to NLP is overcome by providing tutorials for training a custom-size Transformer model with JoeyNMT (Kreutzer et al., 2019) on Google Colab 7 . International researchers were not prohibited from joining. As a result, mutual mentorship relations emerged, whereby international researchers with more language technology experience guided research efforts and enabled data curators or translators to become language technologists. In return, African researchers introduced the international language technologists to African stakeholders, languages and context.

Research Outcomes
Participants. A growth to over 400 participants of diverse disciplines, from at least 20 countries, has been achieved within the past year, suggesting the participant recruitment process was effective. Appendix A contains detailed demographics of a subset of participants from a voluntary survey in February 2020. 86.5% of participants responded positively when asked if the community helped them find mentors or collaborators, indicating that the health of the community is positive. This is also reflected in joint research publications of new groups of collaborators.
Research Artifacts. As a result of mentorship and knowledge exchange between agents of the translation process, our implementation of participatory research has produced artifacts for NLP research, namely datasets, benchmarks and models, which are publicly available online. 8 . Additionally, over 10 participants have gone on to publish works addressing language-specific challenges at conference workshops, such as (Dossou and Emezue, 2020; Orife, 2020; Orife et al., 2020;Öktem et al., 2020;Van Biljon et al., 2020;Martinus et al., 2020;Marivate et al., 2020).
Dataset Creation. The dataset creation process is ongoing, with new initiatives still emerging. We showcase a few initiatives below to demonstrate how bridging connections between agents facilitates the MT process.

A team of Nigerian participants, driven
by the internal demand to ensure that accessible and representative data of their culture is used to train models, are translating their own writings including personal religious stories and undergraduate theses into Yoruba and Igbo 9 .
2. A Namibian participant, driven by a passion to preserve the culture of the Damara, is hosting collaborative sessions with Damara speakers, to collect and translate phrases that reflect Damara culture around traditional clothing, songs, and prayers. 10 3. Creating a connection between a translator in South-Africa's parliament and a language technologist has enabled the process of data curation, allowing access to data from the parliament in South-Africa's languages (which are public but obfuscated behind internal tools). 11 .
These stories demonstrate the value of including curators, content creators, and translators as participants.
Benchmarks. We publish 45 benchmarks for neural translation models from English into 32 distinct African languages, and from French into two additional languages, as well as from English into three different languages. 12 Most were trained on the JW300 corpus (Agić and Vulić, 2019 (Tiedemann, 2012), and data translated or curated by participants were added. Language pairs were selected based on the individual demands of each of the 32 participants, who voluntarily contributed 10 https://github.com/masakhaneio/masakhane-khoekhoegowab 11 http://bit.ly/raw-parliamentarytranslations 12 Benchmark scores can be found in Appendix C. 13 https://tatoeba.org/ the benchmarks they valued most. 16 of the selected target languages are categorized as "Left-behind" and 11 are categorized as "Scraping by" in the taxonomy of (Joshi et al., 2020). The benchmarks are hosted publicly, including model weights, configurations and preprocessing pipelines for full reproducibility. The benchmarks are submitted by individual or groups of participants in form of a GitHub Pull Request. By this, we ensure that the contact to the benchmark contributors can be made, and ownership is experienced.

Human MT Evaluation
To our knowledge, there is no prior research on human evaluation specifically for machine translations of low-resourced languages. Until now, NLP practitioners were left with the hope that successful evaluation methodologies for high-resource languages would transfer well to low-resourced languages. This lack of study is due to the missing connections between the community of speakers (content creators and translators), and the language technologists. MT evaluations by humans are often done either within a group of researchers from the same lab or field (e.g. for WMT evaluations 14 ), or via crowdsourcing platforms Post et al., 2012). Speakers of low-resource languages are traditionally underrepresented in these groups, which makes such studies even harder (Joshi et al., 2019;Guzmán et al., 2019). One might argue that human evaluation should not be attempted before reaching a viable state of quality, but we found that early evaluation results in an improved understanding of the individual challenges of the target languages, strengthens the network of the community, and most importantly, improves the connection and knowledge transfer between language technologists, content creators and curators.
The "low-resourced"-ness of the addressed languages pose challenges for evaluation beyond interface design or recruitment of evaluators proficient in the target language. For the example of Igbo, evaluators had to find solutions for typing diacritics without a suitable keyboard. In addition, Igbo has many dialects and variations which the MT model is uninformed of. Medical or technical terminology (e.g., "data") is difficult to translate and whether to use loan words required discussion. Target language news websites were found to be useful for resolving standardization or terminology questions. Solutions for each language were shared and often also applicable for other languages.
Data. The models are trained on JW300 data. 15 To gain real-world quality estimates beyond religious context, we assess the models' out-of-domain generalization by translating a English COVID-19 survey with 39 questions and statements regarding COVID-19, 16 where the human-corrected and approved translations can directly serve the purpose of gathering responses. The domain is challenging as it contains medical terms and new vocabulary. Furthermore, we evaluate a subset of the Multitarget TED test data (Duh, 2018) 17 . The obtained translations enrich the TED datasets, adding new languages for which no prior translations exist. The size of the TED evaluations vary from 30 to 120 sentences. Details are given in Table 3, Appendix B.
Evaluators. 11 participants of the community volunteered to evaluate translations in their language(s), often involving family or friends to determine the most correct translations. The evaluator role is therefore taken by both stakeholders and language technologists. Within only 10 days, we gathered a total of 707 evaluated translations covering Igbo (ig), Nigerian Pidgin (pcm), Shona (sn), Luo (luo), Hausa (ha, twice by two different annotators), Kiswahili (sw), Yoruba (yo), Fon (fon) and Dendi (ddn). We did not impose prescriptions in terms of number of sentences to evaluate, or time to spend, since this was voluntary work, and guidelines or estimates for the evaluation of translations into these languages are non-existent.
Evaluation Technique. Instead of a direct assessment (Graham et al., 2013) often used in benchmark MT evaluations (Barrault et al., 2019;Guzmán et al., 2019), we opt for postediting. Post-edits are grounded in actions that can be analyzed in terms of e.g. error types for further investigations, while direct assessments require expensive calibration (Bentivogli et al., 2018). Embedded in the community, these post-edit evaluations create an asset for the interaction of various agents: for the language technologists for domain adaptation, or for the content creators, curators, or translators for guidance in standardization or domain choice.
Results. Table 2 reports evaluation results in terms of BLEU evaluated on the benchmark test set from JW300, and human-targeted TER (HTER) (Snover et al., 2006), BLEU (Papineni et al., 2002) and ChrF (Popović, 2015) against human corrected model translations. For ha we find modest agreement between evaluators: Spearman's ρ = 0.56 for sentence-BLEU measurements of the post-edits compared to the original hypotheses. Generally, we observe that the JW300 score is misleading, overestimating model quality (except yo). Training data size appears to be a more reliable predictor of generalization abilities, illustrating the danger of chasing a single benchmark. However, ig and yo both have comparable amounts  of training data, JW300 scores, and carry diacritics, but exhibit very different evaluation performances, in particular on COVID. This can be explained by the large variations of ig as discussed above: Training data and model output are not consistent with respect to one dialect, while the evaluator had to decide on one. We also find difference in performance across domains, with the TED domain appearing easier for pcm and ig, while the yo model performs better on COVID.

Conclusion
We proposed a participatory approach as a solution to sustainably scaling NLP research to low-resourced languages. Having identified key agents and interactions in the MT development process, we implement a participatory approach to build a community for African MT. In the process, we discovered successful strategies for distributed growth and communication, knowledge sharing and model building. In addition to publishing benchmarks and datasets for previously understudied languages, we show how the participatory design of the community enables us to conduct a human evaluation study of model outputs, which has been one of the limitations of previous approaches to low-resourced NLP. The sheer volume and diversity of participants, languages and outcomes, and that for many for languages featured, this paper constitutes the first time that human evaluation of an MT system has been performed, is evidence of the value of participatory approaches for low-resourced MT. For future work, we will (1) continue to iterate, analyze and widen our benchmarks and evaluations, (2) build richer and more meaningful datasets that reflect priorities of the stakeholders, (3) expand the focus of the existing community for African languages to other NLP tasks, and (4) help implement similar communities for other geographic regions with lowresourced languages.    Figure 2 shows the demographics for a subset of participants from a voluntary survey conducted in February 2020. Between then and now (May 2020), the community has grown by 30%, so these figures have to be seen as a snapshot. Nevertheless we can see that the educational background and the occupation is fairly diverse, with a majority of undergraduate students (not necessarily Computer Science). Table 3 reports the number sentences that were post-edited in the human evaluation study reported in Section 4.  data comes tokenized with Polyglot. 18 . The table also features the target categories according to (Joshi et al., 2020) as of 28 May 2020.