Eval4NLP 2021: A Shared Task on Explainable Quality Estimation

Event Notification Type: 
Call for Participation
Location: 
in conjunction with EMNLP 2021
Wednesday, 10 November 2021
Contact Email: 
Contact: 
Eval4NLP organizers
Submission Deadline: 
Wednesday, 1 September 2021

--------------------------------------------
First Call for Participation
--------------------------------------------
The 2nd Workshop on "Evaluation & Comparison of NLP Systems" (co-located with EMNLP 2021) organizes a shared task on Explainable Quality Estimation. The call for participation is described below. For more details, please visit our shared task website -- https://eval4nlp.github.io/sharedtask.html

--------------------------------------------
Important Dates
--------------------------------------------
All deadlines are 11.59 pm UTC -12h (“Anywhere on Earth”).

  • Training and development data release: May 24, 2021
  • Test data release: August 20, 2021
  • Submission deadline: September 1, 2021
  • System paper submission deadline: September 17, 2021
  • Workshop day: November 10, 2021

--------------------------------------------
Overview
--------------------------------------------
Recent Natural Language Processing (NLP) systems based on pre-trained representations from Transformer language models, such as BERT and XLM-Roberta, have achieved outstanding results in a variety of tasks. This boost in performance, however, comes at the cost of interpretability. This could undermine users’ trust in new technologies.

In this shared task, we focus on evaluating machine translation (MT) as an example of this problem. Specifically, we look at the task of quality estimation (QE) (a.k.a. reference-free evaluation), aiming to predict the quality of MT output at inference time, without access to reference translations. Although there have been several QE systems being proposed, few of them focus on explaining their results (i.e., the predicted quality scores) to the users.

Therefore, this shared task consists of building a quality estimation system that (i) predicts the quality score for an input pair of source text and MT hypothesis and (ii) provides word-level evidence for its predictions as explanations. In other words, the explanations should highlight specific errors in the MT output which lead to the quality score predicted. As we would like to foster progress in the plausibility aspect of explanations, we evaluate how similar generated explanations are to human explanations, using a test set with manually annotated rationales.

--------------------------------------------
Task Description
--------------------------------------------
The repository linked below contains datasets, evaluation scripts, and instructions on how to produce baseline results.
https://github.com/eval4nlp/SharedTask2021

Data. The training and development data is the Estonian-English (Et-En) and Romanian-English (Ro-En) partitions of the MLQE-PE dataset (Fomicheva et al. 2020). The sentence-level QE systems can be trained using sentence-level quality scores. Word-level labels derived from post-editing can be used for development purposes. However, we discourage participants from using the word-level data for training, as the goal of the shared task is to explore word-level quality estimation in an unsupervised setting, i.e. as a rationale extraction task.

As test data, we will collect sentence-level quality scores and word-level error annotations for these two language pairs. We will also provide a zero-shot test set for the German-Chinese (De-Zh) and the German-Russian (De-Ru) language pairs where no sentence-level or word-level annotation would be available at training time. Human annotators will be asked to indicate translation errors as an explanation for the overall sentence scores, as well as the corresponding words in the source sentence.

Evaluation. The aim of evaluation is to assess the quality of explanations, not sentence-level predictions. Therefore, the main metrics for evaluation will be AUC and AUPRC scores for word-level explanations.

  • Since the explanations are required to correspond to translation errors, these statistics will be computed for the subset of translations that contain errors according to human annotation.
  • We also ask the participants to provide the sentence-level predictions of their models and compute Pearson correlation with human judgments to measure the overall performance of the system.

Baseline. We provide links to the TransQuest sentence-level QE models (Ranasinghe et al. 2020) that were one of the top-performing submissions at WMT2020 QE Shared Task. The participants can use these models and explore post-hoc approaches to rationale extraction. Participants are also free to train their own QE models and explore architectures that would allow word-level interpretation of model predictions. As a baseline, we will use TransQuest as a QE model and LIME (Ribeiro et al. 2016), a model agnostic explanation method for rationale extraction.

--------------------------------------------
Submission
--------------------------------------------

  • We will use CodaLab as a platform for participants to submit their predictions for the test dataset. The competition and the submission link will be available soon.
  • For submission tracks and formats, please visit our shared task website.
  • Participating teams will be invited to submit papers describing their systems in order to be included in the workshop proceedings.

--------------------------------------------
Awards for Best Submissions
--------------------------------------------
The authors of the best submissions will be awarded with monetary rewards. The monetary rewards are kindly sponsored by Artificial Intelligence Journal and Salesforce Research. More details will be available soon.

--------------------------------------------
Organizers
--------------------------------------------
Yang Gao, Royal Holloway University of London, UK
Steffen Eger, Technische Universität Darmstadt, Germany
Wei Zhao, Technische Universität Darmstadt, Germany
Piyawat Lertvittayakumjorn, Imperial College London, UK
Marina Fomicheva, University of Sheffield, UK

--------------------------------------------
Contact Information
--------------------------------------------
Email: eval4nlp [at] gmail.com
Website: https://eval4nlp.github.io/