WMT 2016 Shared Task on Cross-lingual Pronoun Prediction

Event Notification Type: 
Call for Participation
Abbreviated Title: 
Location: 
Co-located with ACL 2016
Friday, 12 August 2016
State: 
Country: 
Germany
Contact Email: 
City: 
Berlin
Contact: 
Submission Deadline: 
Sunday, 8 May 2016

WMT 2016 Shared Task on Cross-lingual Pronoun Prediction

CALL FOR PARTICIPATION

========================================================
WMT 2016 Shared Task on Cross-lingual Pronoun Prediction
========================================================

Website: http://www.statmt.org/wmt16/pronoun-task.html
At WMT 2016 (collocated with ACL 2016)

We are pleased to announce an exciting cross-lingual pronoun prediction task for people interested in (discourse-aware) machine translation, anaphora resolution and machine learning in general.

In the cross-lingual pronoun prediction task, participants are asked to predict a target-language pronoun given a source-language pronoun in the context of a sentence. For example, in the English-to-French sub-task, to predict the correct translation of "it" or "they" into French (ce, elle, elles, il, ils, ça, cela, on, OTHER). You may use any type of information that can be extracted from the documents. We provide training and development data and a simple baseline system using an N-gram language model.

Participants are invited to submit systems for the English-French and English-German language pairs, for both directions.

More details can be found below, and on our website: http://www.statmt.org/wmt16/pronoun-task.html

Important Dates:

2nd February 2016, Release of training data
4th April 2016,
Release of test data
11th April 2016,
System submission
8th May 2016,
Paper submission deadline
5th June 2016,
Notification of acceptance
22nd June,

Camera-ready deadline

Mailing list: https://groups.google.com/forum/#!forum/wmt-2016-cross-lingual-pronoun-p...

-------------------------------------------------------------------------
Acknowledgements:
The organisation of this task has received support from the following project: Discourse-Oriented Statistical Machine Translation funded by the Swedish Research Council (2012-916)
-------------------------------------------------------------------------

=========================
Detailed Task Description
=========================

OVERVIEW

Pronoun translation poses a problem for current state-of-the-art SMT systems as pronoun systems do not map well across languages, e.g., due to differences in gender, number, case, formality, or humanness, and to differences in where pronouns may be used. Translation divergences typically lead to mistakes in SMT, as when translating the English "it" into French ("il", "elle", or "cela"?) or into German ("er", "sie", or "es"?). One way to model pronoun translation is to treat it as a cross-lingual pronoun prediction task.

We propose such a task, which asks participants to predict a target-language pronoun given a source-language pronoun in the context of a sentence. We further provide a lemmatised target-language human-authored translation of the source sentence, and automatic word alignments between the source sentence words and the target-language lemmata. In the translation, the words aligned to a subset of the source-language third-person pronouns are substituted by placeholders. The aim of the task is to predict, for each placeholder, the word that should replace it from a small, closed set of classes, using any type of information that can be extracted from the documents.

The cross-lingual pronoun prediction task will be similar to the task of the same name at DiscoMT 2015:

http://www.idiap.ch/workshop/DiscoMT/shared-task
Participants are invited to submit systems for the English-French and English-German language pairs, for both directions.

TASK DESCRIPTION

In the cross-lingual pronoun prediction task, you are given a source-language document with a lemmatised and POS-tagged human-authored translation and a set of word alignments between the two languages. In the translation, the lemmatised tokens aligned to the source-language third-person pronouns are substituted by placeholders. Your task is to predict, for each placeholder, the fully inflected word token that should replace the placeholder from a small, closed set of classes. I.e., to provide the fully inflected (German|French) translation of the English pronoun in the context sketched by the lemmatised/tagged target side (in the case of English-to-German|French translation). You may use any type of information that you can extract from the documents.

Lemmatised and POS-tagged target-language data is provided in place of fully inflected text. The provision of lemmatised data is intended both to provide a challenging task, and to simulate a scenario that is more closely aligned with working with machine translation system output. POS tags provide additional information which may be useful in the disambiguation of lemmas (e.g. noun vs. verb, etc.) and in the detection of patterns of pronoun use.

The pronoun prediction task will be run for the following sub-tasks:
English-to-German
German-to-English
English-to-French
French-to-English

Details of the source-language pronouns and the prediction classes that exist for each of the above sub-tasks are provided in the following section (below). The different combinations of source-language pronoun and target-language prediction classes represent some of the different problems that SMT systems face when translating pronouns for a given language pair and translation direction.

The task will be evaluated automatically by matching the predictions against the words found in the reference translation by computing the overall accuracy and precision, recall and F-score for each class. The primary score for the evaluation is the macro-averaged F-score over all classes. Compared to accuracy, the macro-averaged F-score favours systems that consistently perform well on all classes and penalises systems that maximise the performance on frequent classes while sacrificing infrequent ones.

The data supplied for the classification task consists of parallel source-target text with word alignments. In the target-language text, a subset of the words aligned to source-language occurrences of a specified set of pronouns have been replaced by placeholders of the form REPLACE_xx, where xx is the index of the source-language word the placeholder is aligned to. Your task is to predict one of the classes listed in the relevant source-target section below, for each occurrence of a placeholder.

The training, development and test datasets have been filtered to remove non-subject position pronouns. Additional filtering has also been applied to the test set to remove erroneous pronoun examples and thereby ensure the fair and accurate evaluation of system performance. For more information on the format of the data files and their filtering, please see the website.

The complete test data for the classification task, including reference translations and word alignments, will be released on 4th April 2016. Your submission is due on 11th April 2016.

SOURCE-LANGUAGE PRONOUN SETS AND TARGET-LANGUAGE PREDICTION CLASS DETAILS

The following sections describe the set of source-language pronouns and target-language classes to be predicted, for each of the four sub-tasks. Please note that the sub-tasks are asymmetric in terms of the source-language pronouns and prediction classes. The selection of the source-language pronouns and their target-language prediction classes for each sub-task is based on the variation that is possible when translating a given source-language pronoun. For example, when translating the English pronoun "it" into French, a decision must be made as to the gender of the French pronoun, with "il" and "elle" both providing valid options. The translation of the English pronouns "he" and "she" into French, however, does not require such a decision. These may simply be mapped 1-to-1, as "il" and "elle" respectively. The translation of "he" and "she" from English into French is therefore not considered an "interesting" problem and as such, these pronouns are excluded from the source-language set for the English->French sub-task. In the opposite translation, the French pronoun "il" may be translated as "it" or "he", and "elle" as "it" or "she". As a decision must be taken as to the appropriate target-language translation of "il" and "elle", these are included in the set of source-language pronouns for the French->English sub-task.

You should *always* predict either a word token or "OTHER". See prediction class lists below for a list of word tokens to predict for each sub-task.

English-to-French

This sub-task will concentrate on the translation of subject position "it" and "they" from English into French. The following prediction classes exist for this sub-task:

* ce: The French pronoun ce (sometimes with elided vowel as c') as in the expression c'est "it is"
* elle: Feminine singular subject pronoun
* elles: Feminine plural subject pronoun
* il: Masculine singular subject pronoun
* ils: Masculine plural subject pronoun
* cela: Demonstrative pronouns. Includes "cela", "ça", the misspelling "ca", and the rare elided form "ç' "
* on: Indefinite pronoun
* OTHER: Some other word, or nothing at all, should be inserted

French-to-English

This sub-task will concentrate on the translation of subject position "elle", "elles", "il", and "ils" from French into English. The following prediction classes exist for this sub-task:

* he: Masculine singular subject pronoun
* she: Feminine singular subject pronoun
* it: Non-gendered singular subject pronoun
* they: Non-gendered plural subject pronoun
* this: Demonstrative pronouns (singular). Includes both "this" and "that"
* these: Demonstrative pronouns (plural). Includes both "these" and "those"
* there: Existential "there"
* OTHER: Some other word, or nothing at all, should be inserted

English-to-German

This sub-task will concentrate on the translation of subject position "it" and "they" from English into German. The following prediction classes exist for this sub-task:

* er: Masculine singular subject pronoun
* sie: Feminine singular subject pronoun
* es: Neuter singular subject pronoun
* man: Indefinite pronoun
* OTHER: Some other word, or nothing at all, should be inserted

German-to-English

This sub-task will concentrate on the translation of subject position "er", "sie" and "es" from German into English. The following prediction classes exist for this sub-task:

* he: Masculine singular subject pronoun
* she: Feminine singular subject pronoun
* it: Non-gendered singular subject pronoun
* they: Non-gendered plural subject pronoun
* you: Second person pronoun (with both generic or deictic uses)
* this: Demonstrative pronouns (singular). Includes both "this" and "that"
* these: Demonstrative pronouns (plural). Includes both "these" and "those"
* there: Existential "there"
* OTHER: Some other word, or nothing at all, should be inserted