DiscoMT 2015 Shared Task on Pronoun Translation

Event Notification Type: 
Call for Participation
Location: 
at the EMNLP 2015 Workshop on Discourse in Machine Translation
Thursday, 17 September 2015 to Tuesday, 22 September 2015
Country: 
Portugal
City: 
Lisbon

===============================================
DiscoMT 2015 Shared Task on Pronoun Translation
===============================================

Website: https://www.idiap.ch/workshop/DiscoMT/shared-task
In connection with EMNLP 2015 (http://emnlp2015.emnlp.org)

We are happy to announce a new exciting task for people interested in
(discourse-aware) machine translation, anaphora resolution and machine
learning in general. The EMNLP 2015 Workshop on Discourse in Machine
Translation features two shared tasks:

Task 1: Pronoun-Focused Machine Translation
Task 2: Cross-Lingual Pronoun Prediction

Task 1 requires machine translation (from English to French) and
focuses on the evaluation of translated pronouns. We provide training
data and a baseline SMT model to get started.

Task 2 is a straightforward classification task in which one has to
predict the correct translation of a given pronoun in English (it or
they) into French (ce, elle, elles, il, ils, ça, cela, on, OTHER). We
provide training and development data and a simple baseline system
using an N-gram language model.

More details of the two tasks are attached below and can be found at
our website: https://www.idiap.ch/workshop/DiscoMT/shared-task

Important Dates:

4 May, 2015 Release of the MT test set (task 1)
10 May, 2015 Submission of translations (task 1)
11 May, 2015 Release of the classification test set (task 2)
18 May, 2015 Submissions of classification results (task 2)
28 May, 2015 System paper submission deadline
Sep., 2015 Workshop in Lisbon

Mailing list: https://groups.google.com/d/forum/discomt2015

Downloads: https://www.dropbox.com/sh/c8qnpag5z29jyh6/AAAAqk1TE9-UvcgEnfccdRwxa?dl=0
Download alternative 1: http://opus.lingfil.uu.se/DiscoMT2015/
Download alternative 2: http://stp.lingfil.uu.se/~joerg/DiscoMT2015/

-------------------------------------------------------------------------
Acknowledgements:
Funding for the manual evaluation of the pronoun-focused translation
task is generously provided by the European Association for Machine
Translation (EAMT)
-------------------------------------------------------------------------

==========================
Detailed Task Description:
==========================

* Overview

The DiscoMT 2015 shared task will consist of two subtasks, relevant to
both the MT and discourse communities: pronoun-focused translation, a
practical MT task, and cross-lingual pronoun prediction, a
classification task that requires no specific MT expertise and is
interesting as a machine learning task in its own right. For groups
wishing to participate in both tasks, one possibility is to convert a
system for the classification task into an MT feature model using
existing software such as the Docent decoder (Hardmeier et al., ACL
2013). Both tasks use the English–French language pair, which has a
sufficiently high baseline performance to produce basically
intelligible output, as well as interesting differences in their
pronoun systems.

* Task 1: Pronoun-Focused Translation Task

In the pronoun-focused translation task, you are given a collection of
English input documents, which you are asked to translate into
French. This task is the same as for other MT shared tasks such as
that of WMT. The difference is in the way the translations are
evaluated. Instead of checking the overall translation quality, we
specifically look at how the English subject pronouns it and they were
translated. The principal evaluation will be carried out manually and
will focus specifically on the correctness of pronoun
translation. Thanks to a grant from the EAMT, the manual evaluation
will be run by the organisers and participants don't have to
contribute evaluations. Automatic reference-based metrics are
available for development purposes.

The texts in the test corpus will consist of transcripts of TED
talks. The training data contains an in-domain corpus of TED talks as
well as some additional data from Europarl and news texts. To make the
participating systems as comparable as possible, we ask you to
constrain the training data of your system to the resources listed
below as far as you can, but this is not a strict requirement and we
do accept submissions using additional resources. If your system uses
any resources other than those of the official data release, please be
specific about what was included in the system description paper. For
the same reason, we also suggest that you use the tokeniser provided
by us unless you have a good reason to do otherwise.

The test set will be supplied in the XML source format of the 2009
NIST MT evaluation, which is described on the last page of this
document. See the development set included in the data release for an
example. Your translation should be submitted in the XML translation
format of the 2009 NIST MT evaluation. We also need you to submit, in
a separate file, word alignments linking occurrences of the pronouns
it and they (case-insensitive) to the corresponding words generated by
your MT system. The format of the word alignments should be the same
as that of the alignments included in the cross-lingual pronoun
prediction data (see below). Word alignments can be obtained, for
instance, by running the Moses SMT decoder with the
-print-alignment-info option or by parsing the segment-level comments
added to the output by the Docent decoder. You may submit alignments
for the complete sentence if it's easier for you, but only links for
it and they will be used. If your MT system cannot output word
alignments, please contact the shared task organisers to discuss how
to proceed. We'll try to find a solution. More details on how to
submit will be added to this page later.

The test set will be released on 4 May 2015, and your translations are
due on 10 May 2015. Note that we will ensure that each document in the
test set contains an adequate number of challenging pronouns, so the
corpus-level distribution of the pronouns in the test set may differ
from that of the training corpus. However, each document will be a
complete TED talk with a naturally occurring ensemble of pronouns.

* Task 2: Cross-Lingual Pronoun Prediction

In the cross-lingual pronoun prediction task, you are given an English
document with a human-generated French translation and a set of word
alignments between the two languages. In the French translation, the
words aligned to the English third-person subject pronouns it and they
are substituted by placeholders. Your task is to predict, for each
placeholder, the word that should go there from a small, closed set of
classes, using any information you can extract from the documents. The
following classes exist:

ce The French pronoun ce (sometimes with elided vowel as c') as
in the expression c'est 'it is'
elle feminine singular subject pronoun
elles feminine plural subject pronoun
il masculine singular subject pronoun
ils masculine plural subject pronoun
ça demonstrative pronoun (including the misspelling ca and the
rare elided form ç')
cela demonstrative pronoun
on indefinite pronoun
OTHER some other word, or nothing at all, should be inserted

This task will be evaluated automatically by matching the predictions
against the words found in the reference translation by computing the
overall accuracy and precision, recall and F-score for each class. The
primary score for the evaluation is the macro-averaged F-score over
all classes. Compared to accuracy, the macro-averaged F-score favours
systems that consistently perform well on all classes and penalises
systems that maximise the performance on frequent classes while
sacrificing infrequent ones.

The data supplied for the classification task consists of parallel
English-French text with word alignments. In the French text, a subset
of the words aligned to English occurrences of it and they have been
replaced by placeholders of the form REPLACE_xx, where xx is the index
of the English word the placeholder is aligned to. Your task is to
predict one of the classes listed above for each occurrence of a
placeholder.

The training and development data is supplied in a file format with
five tab-separated columns:

1. the class label
2. the word actually removed from the text (may be different from the
class label for class OTHER and in some edge cases)
3. the English source segment
4. the French target segment with pronoun placeholders
5. the word alignment (a space-separated list of alignments of the form
SRC-TGT, where SRC and TGT are zero-based word indices in the source
and target segment, respectively)

A single segment may contain more than one placeholder. In that case,
columns 1 and 2 contain multiple space-separated entries in the order
of placeholder occurrence. A document segmentation of the data is
provided in separate files for each corpus. These files contain one
line per segment, but the precise format varies depending on the type
of document markup available for the different corpora. In the
development and test data, the files have a single column containing
the ID of the document the segment is part of.

Here's an example line from one of the training data files:

elles Elles They arrive first . REPLACE_0 arrivent en premier . 0-0 1-1 2-3 3-4

The test set will be supplied in the same format, but with columns 1
and 2 (elles and Elles) empty, so each line starts with two tab
characters. Your submission should have the same format as column 1
above, so a correct solution would contain the class label elles in
this case. Each line should contain as many space-separated class
labels as there are REPLACE tags in the corresponding segment. For
each segment not containing any REPLACE tags, an empty line should be
emitted. Additional tab-separated columns may be present in the
submission, but will be ignored. Note in particular that you are not
required to predict the second column. The submitted files should be
encoded in UTF-8 (like the data we provide).

The test set will be the same as for the pronoun-focused translation
task. The complete test data for the classification task, including
reference translations and word alignments, will be released on 11 May
2015, after the completion of the translation task. Your submission is
due on 18 May 2015. Details on how to submit will be added to our
website later.

Note: If you create a classifier for this task, but haven't got an MT
system of your own, you might consider using your classifier as a
feature function in the document-level SMT decoder Docent to create a
submission for the pronoun translation task.

* Discussion Group

If you are interested in participating in the shared task, we
recommend that you sign up to our discussion group to make sure you
don't miss any important information. Feel free to ask any questions
you may have about the shared task!

https://groups.google.com/d/forum/discomt2015

* Training Data and Tools

All training and development data for both subtasks can be downloaded
from the following location:

https://www.dropbox.com/sh/c8qnpag5z29jyh6/AAAAqk1TE9-UvcgEnfccdRwxa?dl=0
Download alternative 1: http://opus.lingfil.uu.se/DiscoMT2015/
Download alternative 2: http://stp.lingfil.uu.se/~joerg/DiscoMT2015/

The dropbox folder contains many files, see the list below. To create
a system for the pronoun classification task, you should start with
the classification training data. For the pronoun-focused translation
task, we provide both the original training data, preprocessed data
sets including full word alignments and a complete pre-trained
phrase-based SMT system. To minimise preprocessing differences among
the submitted system we suggest (but do not require) that you start
from the most processed version of the data that is usable for the
type of system that you plan to build.

Look at the README file for more information about the individual
files we provide: http://stp.lingfil.uu.se/~joerg/DiscoMT2015/README

* Classification Baseline

We have a baseline model for the classification task that looks only
at the language model scores (using KenLM, and the language model that
is used needs to be in KenLM's binary format (which is the case for
the "corpus.5.trie.kenlm" included in the "baseline-all" tarball).

Results with default options on TEDdev (same data as tst2010):

ce : P = 110/ 129 = 85.27% R = 110/ 148 = 74.32% F1 = 79.42%
cela : P = 4/ 15 = 26.67% R = 4/ 10 = 40.00% F1 = 32.00%
elle : P = 6/ 13 = 46.15% R = 6/ 30 = 20.00% F1 = 27.91%
elles : P = 4/ 12 = 33.33% R = 4/ 16 = 25.00% F1 = 28.57%
il : P = 35/ 137 = 25.55% R = 35/ 55 = 63.64% F1 = 36.46%
ils : P = 86/ 94 = 91.49% R = 86/ 139 = 61.87% F1 = 73.82%
on : P = 3/ 10 = 30.00% R = 3/ 10 = 30.00% F1 = 30.00%
ça : P = 16/ 22 = 72.73% R = 16/ 61 = 26.23% F1 = 38.55%
OTHER : P = 225/ 315 = 71.43% R = 225/ 278 = 80.94% F1 = 75.89%

or a macro-averaged fine-grained F1 of 46.96%

Results with "--null-penalty -2.0"

ce : P = 121/ 145 = 83.45% R = 121/ 148 = 81.76% F1 = 82.59%
cela : P = 4/ 21 = 19.05% R = 4/ 10 = 40.00% F1 = 25.81%
elle : P = 7/ 15 = 46.67% R = 7/ 30 = 23.33% F1 = 31.11%
elles : P = 5/ 14 = 35.71% R = 5/ 16 = 31.25% F1 = 33.33%
il : P = 36/ 143 = 25.17% R = 36/ 55 = 65.45% F1 = 36.36%
ils : P = 99/ 109 = 90.83% R = 99/ 139 = 71.22% F1 = 79.84%
on : P = 3/ 13 = 23.08% R = 3/ 10 = 30.00% F1 = 26.09%
ça : P = 19/ 32 = 59.38% R = 19/ 61 = 31.15% F1 = 40.86%
OTHER : P = 211/ 255 = 82.75% R = 211/ 278 = 75.90% F1 = 79.17%

or a fine-grained F1 score of 48.35%