VarDial Evaluation Campaign 2022

Event Notification Type: 
Call for Participation
Abbreviated Title: 
VarDial 2022
Location: 
COLING 2022
Wednesday, 12 October 2022 to Monday, 17 October 2022
State: 
Country: 
Republic of Korea
City: 
Gyeongju
Contact: 
Yves Scherrer
Submission Deadline: 
Wednesday, 6 July 2022

Within the scope of the ninth VarDial workshop, co-located with COLING 2022, we are organizing an evaluation campaign on similar languages, varieties and dialects with three shared tasks. To participate and to receive the training data please fill the registration form available on the workshop website:
https://sites.google.com/view/vardial-2022/shared-tasks

The tasks we are organizing this year are the following (please check the website for more information):

1. French Cross-Domain Dialect Identification (FDI)

In the 2022 French Dialect Identification (FDI) shared task, participants have to train a model on news samples collected from a set of publication sources and evaluate it on news samples collected from a different set of publication sources. Not only the sources are different, but also the topics. Therefore, participants have to build a model for a cross-domain 4-way classification by dialect task, in which a classification model is required to discriminate between the French (FH), Swiss (CH), Belgian (BE) and Canadian (CA) dialects across different news samples. The corpus is divided into training, validation and test, such that the publication sources and topics are distinct across splits. The training set contains 358,787 samples. The development set is composed of 18,002 samples. Another set of 36,733 samples are kept for the final evaluation.

2. Identification of Languages and Dialects of Italy (ITDI)

We provide participants with Wikipedia dumps (“pages-articles-multistream.xml.bz2”, from 01.03.2022) of 11 languages and dialects of Italy for training (Piedmontese, Venetian, Sicilian, Neapolitan, Emilian-Romagnol, Tarantino, Sardinian, Ligurian, Friulian, Ladin, Lombard). The Standard Italian raw Wikipedia dump may also be used as training data, but there will not be any instances of Standard Italian in the development and test sets. Please use the provided script to download (and extract, if you wish) the dumps to make sure you work with the correct kind and date of the dump.
The task is classification, i.e. the model is required to discriminate between different language varieties. As the training data is provided in the form of raw Wikipedia dumps, careful pre-processing of the data is part of the task. The task is closed, therefore, participants are not allowed to use external data to train their models. Exceptions are off-the-shelf pre-trained language models from the HuggingFace model hub or similar, the use of which has to be clearly stated. The test set will contain newly collected text samples of a subset of the given language varieties for training. The systems will be evaluated on sentence level.

3. Dialectal Extractive Question Answering (DialQA)

The Dialectal Extractive Question Answering Shared Task invites participants to build QA systems that are robust to dialectal variation. The task builds on existing QA benchmarks (TyDi-QA and SD-QA): specifically, it uses portions of the SD-QA dataset, which recorded dialectal variations of TyDi-QA questions. The participants may either (a) use the baseline automatic speech recognition outputs for each dialect with the aim of making a robust text-based QA system, or (b) they may use the provided audio recordings of the questions with the aim of making a dialect-robust ASR system which can be then evaluated with a baseline QA system, or (c) both of the above. The shared task provides development and test data for 5 varieties of English (Nigeria, USA, South India, Australia, Philippines), 4 varieties of Arabic (Algeria, Egypt, Jordan, Tunisia), and 2 varieties of Kiswahili (Kenya, Tanzania), as well as code for training baseline systems with modified TyDi-QA data.

The test sets will be released on June 30 and submissions are due on July 6. The system description papers will be due on July 29.

Of course, VarDial also accepts research papers focusing on computational methods and language resources for closely related languages, language varieties, and dialects. The full call for papers can be found here:
https://sites.google.com/view/vardial-2022/call-for-papers