Endangered Languages meet Modern NLP

This tutorial will focus on NLP for endangered languages documentation and revitalization. First, we will acquaint the attendees with the process and the challenges of language documentation, showing how the needs of the language communities and the documentary linguists map to specific NLP tasks. We will then present the state-of-the-art in NLP applied in this particularly challenging setting (extremely low-resource datasets, noisy transcriptions, limited annotations, non-standard orthographies). In doing so, we will also analyze the challenges of working in this domain and expand on both the capabilities and the limitations of current NLP approaches. Our ultimate goal is to motivate more NLP practitioners to work towards this very important direction, and also provide them with the tools and understanding of the limitations/challenges, both of which are needed in order to have an impact.


Description
Computational Linguistics and Natural Language Processing (NLP) have taken immense strides, spearheaded by neural methods and large data collections. The result is ubiquitous language technology and vast amounts of research on new tasks and products. However, the vast majority of the world's languages have been mostly ignored, including the most vulnerable among them: endangered languages.
The lack of communication between the NLP community and the documentary linguistics community is partly to blame (Bird, 2009). Even though field and documentary linguists produce resources and use NLP methods, this is done in isolation, as computational methods are seen as a means towards the final goal, which typically is language description. The extreme pace of language loss and the urgent needs for language revitalization, however, require that we utilize documentations and go beyond language description: enter 21st century NLP.
Himmelmann's 20-year-old radical vision (Himmelmann, 1998) for a data-centric approach to language documentation (which sparked the creation of modern documentary linguistics) has slowly begun to materialize (McDonnell et al., 2018). For example, the Workshops on the Use of Computational Methods in the Study of Endangered Languages (Comput-EL) (Good et al., 2014;Arppe et al., 2017;Arppe et al., 2019) have provided a small forum for the much-needed discussion between NLP practitioners and documentary and field linguists.
Meanwhile, increasingly more focus is dedicated on NLP research and bringing modern technologies to endangered languages. For example, mobile applications have been developed for data collection (Bird et al., 2014; and are actively used in documentation projects ; automatic speech recognition models have been created to aid with automatic phonetic or orthographic transcriptions focusing in indigenous Australian  or tonal languages from China and the Americas (Michaud et al., 2018); machine translation for under-represented languages have been presented as new corpora have been collected (Abbott and Martinus, 2018;Abate et al., 2018); cross-lingual transfer has been successfully applied for tagging, morphological analysis and inflection (McCarthy et al., 2019;Anastasopoulos and Neubig, 2019); multitask and active learning are being used for learning from continuous annotations on multiple tasks (Gerstenberger et al., 2017;Anastasopoulos et al., 2018;Chaudhary et al., 2019); approaches dedicated to indigenous polysynthetic languages have been developed Kann et al., 2018); and computational methods have been used to study or discover typological features from large collections of text (Asgari and Schütze, 2017;Malaviya et al., 2017).
In this tutorial, we will outline the language documentation process and revitalization efforts, while also mapping them to concrete computational tasks. We will then focus on the machine learning approaches tailored to tackle these tasks under this very data-constrained setting. An overview of many of those NLP methods, as applied for language documentation, can be found in co-proposer Anastasopoulos' PhD dissertation (Anastasopoulos, 2019). Other surveys focus on the state of language technologies within specific geographic areas, such as co-proposer Cox's overview of Canadian languages  or the one by , focusing on indigenous American languages.
The goal of our tutorial will be two-fold. On one hand, we will aim to acquaint the audience with the needs of the documentary linguistics community, and cover the already existing computational research in the field. On the other hand, we will discuss the capabilities and limitations of current computational approaches, so that the participants will know when and how to apply NLP methods, as well as how they could collect data and create corpora that can be used by NLP methods to aid both documentation and computational work. Ideally, by the end of our tutorial, the attendees will be familiar with the current research and the state-of-the-art in NLP for endangered language documentation and revitalization and be aware of the many standing challenges that lie ahead.
First, we will introduce the challenges posed by language documentation and revitalization, such as the transcription bottleneck, and how machine learning methods can fit into the pipeline. Unlike what is considered a typical NLP setting, working with endangered language data has intricate nuances: the lack of standard orthography, or even complete absence of a writing system; the extremely limited amount of data; language typologies widely different than anything used/tested in prior work; and even the lack of established benchmarks. We will discuss these nuances in depth, and how they relate or can be remedied with existing NLP research. We will also provide example code for many of these methods, and show how standard NLP pipelines need to be modified in order to account for these nuances. Finally, we will close our tutorial by discussing open problems and challenges.
Relevance to linguistics community This tutorial will bring together two linguistics communities, documentary/field linguists and NLP practitioners. As a result, we hope, the tutorial will build enough capacity of computational researchers that will be not only interested in NLP for endangered languages, but also aware of the current approaches and challenges. Elevating endangered languages NLP research is necessary towards bringing these under-represented communities to the spotlight, as speakers of such endangered languages frequently lack the skills to build NLP tools themselves.
Tutorial type The proposed tutorial combines the introductory and cutting-edge tutorial types. The acquaintance of computer scientists with the language documentation process will, by necessity, be at an introductory level. At the same time, though, the tutorial will cover cutting-edge NLP methods and their application to the endangered languages domain.

Tutorial Structure
The tutorial will be structured in order to be informative for both linguists (documentary, computational, or otherwise) and for computer scientists who are interested in performing computational work for language documentation and revitalization.
We aim for a three-hour tutorial that covers a reasonable range of all important aspects of this area. Times for the proposed structure are approximate, and they might be adjusted as we refine the tutorial content.

Recommended Reading List
The reading list is indicative of the multi-disciplinary coverage of this tutorial. We highlight the tutorial presenters in the papers they have co-authored:

Diversity Considerations
This tutorial has been constructed with a focus on encouraging diversity in all aspects: • The content will aim to encourage diversity in Computational Linguistics and NLP, by promoting and encouraging research on the most under-represented group of languages: extremely lowresource and endangered ones.
• We will use real endangered language data for all examples, ranging from Mesoamerican languages to European dialects to indigenous languages of Asia.
• The instructors' team is fairly diverse both in terms of gender and in terms of seniority (one postdoctoral associate, three assistant professors). The presenters are affiliated with three different institutions from two different countries (from the US and Canada).

Prerequisites
• Machine Learning: a basic understanding of modern neural models.
• Programming and other tools: All code examples will be provided in Python, so knowledge of Python and basic command-line tools will be necessary in order to follow along. Christopher Cox is an Assistant Professor in the School of Linguistics and Language Studies at Carleton University. His research centres on issues in language documentation, description, and revitalization, with a special focus on the creation and application of corpora representing Indigenous and minority languages. For the past twenty years, he has been involved with community-based language documentation, education, and revitalization efforts, most extensively in partnership with speakers of Plautdietsch, the traditional language of the Dutch-Russian Mennonites, and with Dene communities in western and northern Canada. He has served as an invited instructor in the area of language documentation and revitalization for community-based, national, and international events, delivering workshops for the Canadian Indigenous Languages and Literacy Development Institute (CILLDI), the Institute on Collaborative Language Research (CoLang), and the American Association for Corpus Linguistics (AACL), among others. website: https://carleton.ca/slals/people/cox-christopher/ Hilaria Cruz is a field linguist with a focus on indigenous languages of Mexico, especially the Chatino languages (Otomanguean), which she speaks natively. She is also part of an interdisciplinary community of linguists and computer scientists who are working to create tools for automatic or semi-automatic transcription and analysis of audio and visual information for endangered languages. They have created a time-aligned speech corpus of transcribed, annotated with parts of speech, and translated Chatino data, which are available on an open-source basis. As a native person and field linguist, she has had many opportunities to teach linguistics in diverse settings and with diverse groups of students. She has taught general linguistics at the university level, in community organizations, and within Chatino communities. Her research interests include ASR for endangered languages, Chatino morphology, and promoting reading and writing in indigenous languages.

Tutorial Presenters
Graham Neubig is an assistant professor at Carnegie Mellon University specializing in natural language processing and machine learning. One of his major research interests is methods for low-resource language processing, and specifically for aiding the documentation of endangered languages. He has previously given well-attended tutorials at NLP conferences (EMNLP and NAACL) and the Lisbon and CIFAR Machine Learning Summer Schools. He has won a number of best papers at NLP venues (e.g. EMNLP2016, EACL2017, NAACL2019) and given a number of invited talks on the proposal topic of low-resource language processing, including at Google, UMass Amherst, and New York University. website: http://www.phontron.com/

Details
Breadth We aim to provide the first wide coverage overview of NLP approaches for endangered languages. Out of the 25 referenced works (which is not an exhaustive list of the works we will cover) only 7 are co-authored by the presenters, and we expect only about 30% of the tutorial to be based on the presenters' prior work. Similarly, 60% of the suggested reading has not been co-authored by the presenters.
Audience Size Estimate The first three iterations of the Comput-EL workshops (which are very relevant to the theme of our tutorial) have been steadily growing in size, while a regional meet-up organized in Pittsburgh by co-presenters Anastasopoulos and Neubig drew about 40 participants. Similar tutorials on the use of NLP in the International Conference on Language Documentation and Conservation (ICLDC) also draw large crowds. Given this interest, we expect a healthy audience of at least 60 participants. To our knowledge, no similar tutorial has been given before at any NLP conference.
Technical Equipment Internet Access; the participants could bring their laptops in order to follow along with code examples.
Preferred Venues All venues are appropriate for this tutorial, and the presenters do not anticipate major conflicts.
Open Access We agree to the publication of our slides and a video recording of the tutorial. We will additionally make all other materials (complete reading list, software, example data) openly available online.