[LREC2026 Tutorial] Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies

April 18, 2026 | BY shurokabe

Event Notification Type:

Other

Location:

LREC2026

Saturday, 16 May 2026

Country:

Spain

City:

Palma

Dear colleagues,

Do you care about improving language technologies beyond mainstream languages? Do you wonder how to collect data for low-resource languages? Or how to create the first translation system? And then adapt efficiently to various downstream tasks?

We are pleased to announce an upcoming LREC2026 tutorial
'Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies.'

This tutorial is aimed at NLP practitioners, researchers, and developers working with multilingual and low-resource languages who are interested in building more equitable, inclusive, and socially impactful language technologies.

Tutorial overview
The tutorial covers the full lifecycle of NLP technologies development for a language, including:

Data collection and corpus creation (e.g., web crawling and annotation)
Parallel sentence mining and machine translation
Downstream applications such as text classification and multimodal reasoning
Strategies for addressing data scarcity, cultural variance, and reproducibility
Fair and community-informed development practices

Who should attend

Researchers and practitioners in NLP and multilingual technologies
Corpus builders and linguists working on under-represented languages
Developers interested in low-resource or inclusive NLP
Students and early-career researchers

Scope and highlights

Case studies spanning 10+ languages from diverse language families and geopolitical contexts
Coverage of both digitally resource-rich and severely under-represented languages
Emphasis on hands-on methods and applied modeling frameworks

Save the date and place:
Saturday, 16 May 2026, morning session, Room 6

More information:
https://tum-nlp.github.io/low-resource-tutorial/

Stay tuned for our website – we will fully open-source the tutorial materials!

Additionally, we would like to have an overview of the practices and challenges researchers face when collecting and annotating datasets for under-represented languages.
If you are such a researcher, you are working on a very surprising language or just have experience to share about the topic, please fill in this form to participate in the interview: https://forms.gle/L81hpvZGfemyMjtX7

Organisers:

Ekaterina (Katya) Artemova, Toloka.ai
Laurie Burchell, Common Crawl Foundation
Daryna Dementieva, Technical University of Munich
Shu Okabe, Technical University of Munich
Mariya Shmatova, Toloka.ai
Pedro Ortiz Suarez, Common Crawl Foundation

See you at LREC!

Best regards,
The Tutorial Organisers

Menu

[LREC2026 Tutorial] Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies

Latest Events

Menu

[LREC2026 Tutorial] Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies

User login

Latest Events