Findings of the 2017 DiscoMT Shared Task on Cross-lingual Pronoun Prediction

We describe the design, the setup, and the evaluation results of the DiscoMT 2017 shared task on cross-lingual pronoun prediction. The task asked participants to predict a target-language pronoun given a source-language pronoun in the context of a sentence. We further provided a lemmatized target-language human-authored translation of the source sentence, and automatic word alignments between the source sentence words and the target-language lemmata. The aim of the task was to predict, for each target-language pronoun placeholder, the word that should replace it from a small, closed set of classes, using any type of information that can be extracted from the entire document. We offered four subtasks, each for a different language pair and translation direction: English-to-French, English-to-German, German-to-English, and Spanish-to-English. Five teams participated in the shared task, making submissions for all language pairs. The evaluation results show that most participating teams outperformed two strong n-gram-based language model-based baseline systems by a sizable margin.


Introduction
Pronoun translation poses a problem for machine translation (MT) as pronoun systems do not map well across languages, e.g., due to differences in gender, number, case, formality, or humanness, as well as because of language-specific restrictions about where pronouns may be used. For example, when translating the English it into French an MT system needs to choose between il, elle, and cela, while translating the same pronoun into German would require a choice between er, sie, and es. This is hard as selecting the correct pronoun may need discourse analysis as well as linguistic and world knowledge. Null subjects in pro-drop languages pose additional challenges as they express person and number within the verb's morphology, rendering a subject pronoun or noun phrase redundant. Thus, translating from such languages requires generating a pronoun in the target language for which there is no pronoun in the source.
NMT yields generally higher-quality translation, but is harder to analyze, and thus little is known about how well it handles pronoun translation. Yet, it is clear that it has access to larger context compared to phrase-based SMT models, potentially spanning multiple sentences, which can improve pronoun translation (Jean et al., 2017a).
Motivated by these challenges, the Dis-coMT 2017 workshop on Discourse in Machine Translation offered a shared task on cross-lingual pronoun prediction. This was a classification task, asking the participants to make predictions about which pronoun should replace a placeholder in the target-language text. The task required no MT expertise and was designed to be interesting as a machine learning task on its own right, e.g., for researchers working on co-reference resolution.
Source me ayudan a ser escuchada lit. "me help 3.Pers.Pl to be heard" Target REPLACE help me to be heard POS tags PRON VERB PRON PART AUX VERB Reference They help me to be heard The shared task targets subject pronouns, and this year this also includes null subjects, e.g., as shown in Figure 1. In linguistics, this characteristic is known as pro-drop, since an invisible pronoun pro is assumed to occupy the subject position. Whenever a null subject is used, the grammatical person features are inferred from the verb (Neeleman and Szendői, 2005). In pro-drop languages, an explicit pronoun is used mostly for stressing the subject, since mentioning the pronoun in every subject position results in an output that is perceived as less fluent (Clemens, 2001). However, in impersonal sentences, using a subject pronoun is not an option; it is ungrammatical.
We further target the problem of functional ambiguity, whereby pronouns with the same surface form may perform multiple functions (Guillou, 2016). For example, the English pronoun it may function as an anaphoric, pleonastic, or event reference pronoun. An anaphoric pronoun corefers with a noun phrase (NP). A pleonastic pronoun does not refer to anything, but it is required by syntax to fill the subject position. An event reference pronoun may refer to a verb phrase (VP), a clause, an entire sentence, or a longer passage of text. These different functions may entail different translations in another language.
Previous studies have focused on the translation of anaphoric pronouns. In this case, a wellknown constraint of languages with grammatical gender is that agreement must hold between an anaphoric pronoun and the NP with which it corefers, called its antecedent. The pronoun and its antecedent may occur in the same sentence (intra-sentential anaphora) or in different sentences (inter-sentential anaphora). Most MT systems translate sentences in isolation, and thus inter-sentential anaphoric pronouns will be translated without knowledge of their antecedent, and thus pronoun-antecedent agreement cannot be guaranteed.
The above constraints start playing a role in pronoun translation in situations where several translation options are possible for a given sourcelanguage pronoun, a large number of options being likely to affect negatively the translation quality. In other words, pronoun types that exhibit significant translation divergence are more likely to be wrongly translated by an MT system that is not aware of the above constraints. For example, when translating the English pronoun she into French, there is one main option, elle; yet, there are some exceptions, e.g., in references to ships. However, several options exist for the translation of anaphoric it: il (for an antecedent that is masculine in French) or elle (for a feminine antecedent), but also cela, ça or sometimes ce (non-gendered demonstratives).
The challenges that pronouns pose for machine translation have gradually raised interest in the research community for a shared task that would allow to compare various competing proposals and to quantify the extent to which they improve the translation of different pronouns for different language pairs and different translation directions. However, evaluating pronoun translation comes with its own challenges, as reference-based evaluation, which is standard for machine translation in general, cannot easily take into account legitimate variations of translated pronouns or their placement in the sentence. Thus, building upon experience from DiscoMT 2015 (Hardmeier et al., 2015) and WMT 2016 , this year's cross-lingual pronoun prediction shared task has been designed to test the capacity of the participating systems for translating pronouns correctly, in a framework that allows for objective evaluation, as we will explain below. ce OTHER ce|PRON qui|PRON It 's an idiotic debate . It has to stop . REPLACE 0 etre|VER un|DET débat|NOM idiot|ADJ REPLACE 6 devoir|VER stopper|VER .|. 0-0 1-1 2-2 3-4 4-3 6-5 7-6 8-6 9-7 10-8 Figure 2: English→French example from the development dataset. First come the gold class labels, followed by the pronouns (these are given for training, hidden for test), then the English input, the French lemmatized and PoS-tagged output with REPLACE placeholders, and finally word alignments. Here is a French reference translation (not given to the participants): C'est un débat idiot qui doit stopper.

Subtask
Year

Task Description
Similarly to the setup of the WMT 2016 shared task , the participants had to predict a target-language pronoun given a sourcelanguage pronoun in the context of a sentence, which in turn was given in the context of a full document. We further provided a lemmatized and part-of-speech (POS) tagged target-language human-authored translation of the source sentence, as well as automatic token-level alignments between the source-sentence words and the targetlanguage lemmata. In the translation, we substituted the words aligned to a subset of the source-language thirdperson subject pronouns by placeholders. The aim of the task was to predict, for each such placeholder, the pronoun class (we group some pronouns in an equivalence class, e.g., cela/ça, and we further have a catch-all OTHER class for translations such as lexical noun phrases, paraphrases or nothing at all, when the pronoun is not translated) that should replace it from a small, closed set, using any type of information that can be extracted from the text of the entire document. Thus, the evaluation can be performed in a fully automatic way, by comparing whether the class predicted by the system is identical to the reference one, assuming that the constraints of the lemmatized target text allow only one correct class. Figure 2 shows an English→French example sentence from the development dataset. It contains two pronouns to be predicted, which are indicated by REPLACE placeholders in the target sentence. The first it corresponds to ce, while the second it corresponds to qui (which can be translated in English as which), which belongs to the OTHER class, i.e., does not need to be predicted as a word but rather as the OTHER class. This example illustrates some of the difficulties of the task: the two source sentences are merged into one target sentence, the second it is translated as a relative pronoun instead of a subject one, and the second French verb has a rare intransitive usage. Table 1 shows the set of source-language pronouns and the target-language classes to be predicted for each of the subtasks in all editions of the task. Note that the subtasks are asymmetric in terms of the source-language pronouns and the prediction classes. The selection of the sourcelanguage pronouns and their target-language prediction classes for each subtask is based on the variation that is to be expected when translating a given source-language pronoun. For example, when translating the English pronoun it into French, a decision needs to be made as to the gender of the French pronoun, with il and elle both providing valid options. Alternatively, a nongendered pronoun such as cela may also be used.
Compared to the WMT 2016 version of the task, this year we replaced the French-English language pair with Spanish-English, which allowed us to evaluate the system performance when dealing with null subjects on the source-language side. As in the WMT 2016 task, we provided a lemmatized and POS-tagged reference translation instead of fully inflected text as was used in the Dis-coMT 2015 task. This representation, while still artificial, arguably provides a more realistic MTlike setting. MT systems cannot be relied upon to generate correctly inflected surface form words, and thus the lemmatized, POS-tagged representation encourages greater reliance on other information from the source and the target language texts.

Data Sources
The training dataset comprises Europarl, News and TED talks data. The development and the test datasets consist of TED talks. Below we describe the TED talks, the Europarl and News data, the method used for selecting the test datasets, and the steps taken to pre-process the training, the development, and the test datasets.

TED Talks
TED is a non-profit organization that "invites the world's most fascinating thinkers and doers [...] to give the talk of their lives". Its website 1 makes the audio and the video of TED talks available under the Creative Commons license. All talks are presented and captioned in English, and translated by volunteers world-wide into many languages. 2 In addition to the availability of (audio) recordings, transcriptions and translations, TED talks pose interesting research challenges from the perspective of both speech recognition and machine translation. Therefore, both research communities are making increased use of them in building benchmarks.
TED talks address topics of general interest and are delivered to a live public audience whose responses are also audible on the recordings. The talks generally aim to be persuasive and to change the viewers' behaviour or beliefs. The genre of the TED talks is transcribed planned speech.
It has been shown in previous analysis that TED talks differ from other text types with respect to pronoun use (Guillou et al., 2014). TED speakers frequently use first-and second-person pronouns (singular and plural): first-person to refer to themselves and their colleagues or to themselves and the audience, and second-person to refer to the audience, the larger set of viewers, or people in general. TED speakers often use the pronoun they without a specific textual antecedent, in sentences such as "This is what they think." They also use deictic and third-person pronouns to refer to things in the spatio-temporal context shared by the speaker and the audience, such as props and slides. In general, pronouns are common, and anaphoric references are not always clearly defined.
For the WMT 2017 task on cross-lingual pronoun prediction, the TED training and development sets come from either the MT tasks of the IWSLT evaluation campaigns  or from past editions of the task (Hardmeier et al., 2015;; the test sets are built from 16 TED talks that were never used in any previous evaluation campaign, 8 defining the test sets from English to German and to French, the other 8 those from German and from Spanish to English. More details are provided below.

Europarl and News
For training purposes, in addition to TED talks, we further made available the Europarl 3 (Koehn, 2005) and News Commentary 4 corpora for all language pairs but Spanish-English, for which only TED talks and Europarl were available. We used the alignments provided by OPUS, including the document boundaries from the original sources. For Europarl, we used ver. 7 of the data release, and for News Commentary we used ver. 9.

Test Set Selection
We selected the test data from talks added recently to the TED repository such that:

The talks have been transcribed (in English)
and translated into both German and French. 2. They were not used in the IWSLT evaluation campaigns, nor in the DiscoMT 2015 or WMT 16 test sets. 3. They amount to a number of words suitable for evaluation purposes (tens of thousands).
Once we found the talks satisfying these criteria, we automatically aligned them at the segment level. Then, we extracted a number of TED talks from the collection, following the criteria outlined in Section 3.1 above. Finally, we manually checked the sentence alignments of these selected TED talks in order to fix potential errors introduced by either automatic or human processing. Table 2 shows some statistics about the test datasets we prepared for each subtask.  In total, we selected 16 TED talks for testing, which we split into two groups as follows: 8 TED talks for the English to French/German direction, and 8 TED talks for the Spanish/German to English direction. Another option would have been to create four separate groups of TED talks, one for each subtask. However, we chose the current setup as using a smaller set of documents reduced the manual effort in correcting the automatic sentence alignment of the documents.

Subtask
More detailed information about the TED talks that we included in the test datasets is shown in Tables 3 and 4, for translating from and into English, respectively. We used the same English TED talks for the English to French/German and Spanish/German to English subtasks. Note however that differences in alignment of the sentences lead to different segmentation of the parallel texts for the different language pairs. Moreover, minor corrections to the sentence alignment and to the text itself, which we applied manually, resulted in small differences in the number of token for the same English TED talk when paired with the French vs. the German translation.
Note that when selecting these TED talks, we tried to pick such that include more pronouns from the rare classes. For example, for the English to French/German dataset, we wished to include documents that contained more feminine pronouns in the French and in the German translations.

Data Preparation
Next, we processed all datasets following the same procedure as last year. In particular, we extracted examples for pronoun prediction based on automatic word alignment, and we used filtering techniques to exclude non-subject pronouns. We further converted the data to a lemmatized version with coarse POS tags (Petrov et al., 2012). For all languages except Spanish, we used the TreeTagger (Schmid, 1994) with its built-in lemmatizer. Then, we converted the TreeTagger's POS tags to the target coarse POS tags using pre-defined mappings. 5 For French, we clipped the morphosyntactic information and we reduced the number of verb form tags to just one. For Spanish, we used UDPipe (Straka et al., 2016), which includes universal POS tags and a lemmatizer.
In previous years, the automatic alignments used for the task were optimized to improve the precision and recall of pronoun alignments. For the repeated language pairs, we reused the best performing alignment strategies from 2015 and 2016. For English→French and Spanish→English we used GIZA++ (Och and Ney, 2003) model 4 with grow-diag-final-and  as symmetrization. For English↔German we used GIZA++ HMM (Vogel et al., 1996) alignment with intersection for symmetrization. In all cases, we used fast align (Dyer et al., 2013) as backoff for sentences that are longer than the 100-word limit of GIZA++.

Example Selection
In order to select the acceptable target classes, we computed the frequencies of pronouns aligned to the ambiguous source-language pronouns based on the POS-tagged training data. Using these statistics, we defined the sets of predicted labels for each language pair. Based on the counts, we also decided to merge small classes such as the demonstrative pronouns these and those.
For English-French/German and German-English, we identified examples based on the automatic word alignments. We included cases in which multiple words were aligned to the selected pronoun if one of them belonged to the set of accepted target pronouns. If this was not the case, we used the shortest word aligned to the pronoun as the placeholder token.   Finding a suitable position to insert a placeholder on the target-language side for a sourcelanguage pronoun that was unaligned required using a heuristic. For this purpose, we first used the alignment links for the surrounding sourcelanguage words in order to determine the likely position for the placeholder token. We then expanded the window in both directions until we found an alignment link. We inserted the placeholder before or after the linked token, depending on whether the aligned source-language token was in the left or in the right context of the selected target pronoun. If no link was found in the entire sentence (which was an infrequent case), we used a position similar to the position of the selected pronoun within the source-language sentence. For Spanish-English, the process was a bit different given that English subject pronouns are often realized as null subjects in Spanish. For this language pair, we identified the examples based on the parse of both the source and the target languages. From the Spanish parse, we took all ver-bal phrases (i.e., phrases that had the POS tags VERB, AUX and ADJ as heads) in the segment and we retained those in the third person without an overt subject, i.e., without an "nsubj" or "nsubjpass" arc. We then identified the corresponding English verb using the alignment links. Since English pronouns are aligned to the NULL token, we relied on the English parse, looking for previously identified verbs with an overt subject.
Finally, we inserted the placeholder in the position of the English pronoun with the position of the Spanish verb concatenated to it. In the case of verb phrases that include multiple tokens (e.g., had been reading), we used the position of the first word in the verb phrase. As before, we used a position similar to the position of the selected pronoun within the source-language sentence. Unfortunately, and contrary to the other language pairs, we found many cases for which there was no alignment link in the entire sentence: 26,277/87,528 for IWSLT, 160/638 for TEDdev, and 187,103/ 712,728 for Europarl.

Subject Filtering
As we have explained above, the shared task focused primarily on subject pronouns. However, in English and German, some pronouns are ambiguous between subject and object position, e.g., the English it and the German es and sie. In order to address this issue, in 2016 we introduced filtering of object pronouns based on dependency parsing. This filtering removed all pronoun instances that did not have a subject dependency label. 6 For joint dependency parsing and POS-tagging, we used Mate Tools (Bohnet and Nivre, 2012), with default models. Since in 2016 we found that this filtering was very accurate, this year we performed only automatic filtering for the training and the development, and also for the test datasets. Note that since only subject pronouns can be realized as prodropped pronouns in Spanish, subject filtering was not necessary.

Baseline Systems
The baseline system is based on an n-gram language model (LM). The architecture is the same as that used for the WMT 2016 cross-lingual pronoun prediction task. 7 In 2016, most systems outperformed this baseline, and for the sake of comparison, we thought that it was adequate to include the same baseline system this year. Another reason to use an LM-based baseline is that it represents an important component for pronoun translation in a full SMT system. The main assumption here is that the amount of information that can be extracted from the translation table of an SMT system would be insufficient or inconclusive. As a result, pronoun prediction would be influenced primarily by the language model. We provided baseline systems for each language pair. Each baseline is based on a 5-gram language model for the target language, trained on word lemmata constructed from news texts, parliament debates, and the TED talks of the training/development portions of the datasets. The additional monolingual news data comprises the shuffled news texts from WMT, including the 2014 editions for German and English, and the 2007-2013 editions for French. 6 In 2016, we found that this filtering was too aggressive for German, since it also removed expletives, which had a different tag: EP. Still, we decided to use the same filtering this year, to keep the task stable and the results comparable. 7 https://bitbucket.org/yannick/ discomt_baseline The German corpus contains a total of 46 million sentences with 814 million lemmatized tokens, the English one includes 28 million sentences and 632 million tokens, and the French one covers 30 million sentences with 741 million tokens. These LMs are the same ones that we used in 2016.
The baseline system fills the REPLACE token gaps by using a fixed set of pronouns (those to be predicted) and a fixed set of non-pronouns (which includes the most frequent items aligned with a pronoun in the provided test set) as well as the NONE option (i.e., do not insert anything in the hypothesis). The baseline system may be optimized using a configurable NONE penalty that accounts for the fact that n-gram language models tend to assign higher probability to shorter strings than to longer ones.
We report two official baseline scores for each subtask. The first one is computed with the NONE penalty set to an unoptimized default value of zero. The second one uses a NONE penalty set to an optimized value, which is different for each subtask. We optimized this value on the TEDdev2 dataset for Spanish-English, and on the WMT2016 data set for the other languages, set by a grid search procedure, where we tried values between 0 and −4 with a step of 0.5. The optimized values vary slightly from the optimized values on less balanced data from 2016 , but the differences in the resulting evaluation scores are actually minor.

Submitted Systems
A total of five teams participated in the shared task, submitting primary systems for all subtasks. Most teams also submitted contrastive systems, which have unofficial status for the purpose of ranking, but are included in the tables of results.

TurkuNLP
The TurkuNLP system (Luotolahti et al., 2017) is an improvement of the last year's system by the same team (Luotolahti et al., 2016). The improvement mainly consists of a pre-training scheme for vocabulary embeddings based on the task. The system is based on a recurrent neural network based on stacked Gated Recurrent Units (GRUs). The pretraining scheme involves a modification of WORD2VEC to use all target sequence pronouns along with typical skip-gram contexts in order to induce embeddings suitable for the task. The neural network model takes eight sequences as an input: target-token context, target-POS context, target-token-POS context, source-token context; each of these sequences is represented twice -once for the right and once for the left context. As a ninth input, the neural network takes the source-language token that is aligned to the pronoun to be predicted. All input sequences are fed in an embedding layer followed by two layers of GRUs. The values in the last layer form a vector, which is further concatenated to the pronoun alignment embeddings, to form a larger vector, which is then used to make the final prediction using a dense neural network. The pretraining is a modification of the skip-gram model of WORD2VEC (Mikolov et al., 2013), in which along with the skip-gram token context, all target sentence pronouns are predicted as well. The process of pretraining is performed using WORD2VECF (Levy and Goldberg, 2014).

Uppsala
The UPPSALA system (Stymne et al., 2017) is based on a neural network that uses a BiLSTM representation of the source and of the target sentences, respectively. The source sentences are preprocessed using POS tagging and dependency parsing, and then are represented by embeddings for words, POS tags, dependency labels, and a character-level representation based on a one-layer BiLSTM. The target sentences are represented by embeddings for the provided lemmata and POS tags. These representations are fed into separate two-layer BiLSTMs. The final layer includes a multi-layer perceptron that takes the BiLSTM representations of the target pronoun, of the source pronoun, of the dependency head of the source pronoun (this is not used for Spanish as it is a prodrop language) and the original embeddings of the source pronouns.
In order to address the imbalanced class distribution, sampling of 10% of the data is used in each epoch. For the primary system, all classes are sampled equally, as long as there are enough instances for each class. Although this sampling method biases the system towards macro-averaged recall, on the test data the system performed very well in terms of both macro-averaged recall and accuracy. The secondary system uses a sampling method in which the samples are proportional to the class distribution in the development dataset.

NYU
The NYU system (Jean et al., 2017b) uses an attention-based neural machine translation model and three variants that incorporate information from the preceding source sentence. The sentence is added as an auxiliary input using additional encoder and attention models. The systems are not specifically designed for pronoun prediction and may be used to generate complete sentence translations. They are trained exclusively on the data provided for the task, using the text only and ignoring the provided POS tags and alignments.

UU-Hardmeier
The UU-HARDMEIER system (Hardmeier, 2017) is an ensemble of convolutional neural networks combined with a source-aware n-gram language model. The neural network models evaluate the context in the current and in the preceding sentence of the prediction placeholder (in the target language) and the aligned pronoun (in the source language) with a convolutional layer, followed by max-pooling and a softmax output layer. The ngram language model is identical to the sourceaware n-gram model of Hardmeier (2016) and Loáiciga et al. (2016). It makes its prediction using Viterbi decoding over a standard n-gram model. Information about the source pronoun is introduced into the model by inserting the pronoun as an extra token before the placeholder. The posterior distributions of the n-gram model and of various training snapshots and different configurations of the neural network are linearly interpolated with weights tuned on the development dataset to make the final predictions.

UU-Stymne16
The UU-STYMNE16 system uses linear SVM classifiers, and it is the same system that was submitted for the 2016 shared task (Stymne, 2016). It is based mainly on local features, and anaphora is not explicitly modeled. The features used include source pronouns, local context words/lemmata, target POS n-grams with two different POS tagsets, dependency heads of pronouns, alignments, and position of the pronoun. A joint tagger and dependency parser (Bohnet and Nivre, 2012) is used on the source text in order to produce some of the features. Overall, the source pronouns, the local context and the dependency features performed best across all language pairs. Stymne (2016) describes several variations of the method, including both one-step and two-step variants, but the submitted system is based on one-step classification. It uses optimized features trained on all data. This is the system that is called Final 1-step (all training data) in the original system description paper. Note that this system is not identical to the 2016 submission, but it is the system that performed best in a post-task additional experiments on the 2016 test data for most language pairs.

Evaluation
While in 2015 we used macro-averaged F 1 as an official evaluation measure, this year we followed the setup of 2016, where we switched to macroaveraged recall, which was also recently adopted by some other competitions, e.g., by SemEval-2016/2017Task 4 (Nakov et al., 2016Rosenthal et al., 2017). Moreover, as in 2015 and 2016, we also report accuracy as a secondary evaluation measure (but we abandon F 1 altogether).
Macro-averaged recall ranges in [0, 1], where a value of 1 is achieved by the perfect classifier, 8 and a value of 0 is achieved by the classifier that misclassifies all examples. The value of 1/C, where C is the number of classes, is achieved by a trivial classifier that assigns the same class to all examples (regardless of which class is chosen), and is also the expected value of a random classifier.
The advantage of macro-averaged recall over accuracy is that it is more robust to class imbalance. For instance, the accuracy of the majorityclass classifier may be much higher than 1/C if the test dataset is imbalanced. Thus, one cannot interpret the absolute value of accuracy (e.g., is 0.7 a good or a bad value?) without comparing it to a baseline that must be computed for each specific test dataset. In contrast, for macro-averaged recall, it is clear that a value of, e.g., 0.7, is well above both the majority-class and the random baselines, which are both always 1/C (e.g., 0.5 with two classes, 0.33 with three classes, etc.). Similarly to accuracy, standard F 1 and macro-averaged F 1 are both sensitive to class imbalance for the same reason; see Sebastiani (2015) for more detail and further discussion.

Results
The evaluation results are shown in Tables 5-8. The first column in the tables shows the rank of the primary systems with respect to the official metric: macro-averaged recall. The second column contains the team's name and its submission type: primary vs. contrastive. The following columns show the results for each system, measured in terms of macro-averaged recall (official metric) and accuracy (unofficial, supplementary metric).
The subindices show the rank of the primary systems with respect to the evaluation measure in the respective column. As described in Section 4, we provide two official baseline scores for each subtask. The first one is computed with the NONE penalty set to a default value of zero. The second baseline uses a NONE penalty set to an optimized value. Note that these optimized penalty values are different for each subtask; the exact values are shown in the tables.
German→English. The results are shown in Table 5. We can see that all five participating teams outperformed the baselines by a wide margin. The top systems, TURKUNLP and UPPSALA scored 68.88 and 68.55 in macro-averaged recall. The unofficial accuracy metric yields quite a different ranking, with TurkuNLP having the lowest accuracy among the five primary systems. All systems performed well above the baselines, which are in the high-mid 30s for macro-averaged recall.
English→German. The results are shown in Table 6. For this direction, there is a gap of ten percentage points between the first and the second systems, UPPSALA and TURKUNLP, respectively. The clear winner is UPPSALA, with a macro-averaged recall of 78.38. For the unofficial accuracy metric, UPPSALA is again the winner, closely followed by NYU.
Spanish→English. The results are shown in Table 7. This language pair is the most difficult one, with the lowest scores overall, for both evaluation measures. Yet, all teams comfortably outperformed the baseline on both metrics by at least an 8-9 point margin. The best-performing system here is TURKUNLP with a macro-averaged recall of 58.82. However, it is nearly tied with UPPSALA, and both are somewhat close to NYU. Noteworthy, though, is that the highest-scoring system on macro-average recall is the contrastive system of NYU; NYU also has the second-best accuracy, outperformed only by UPPSALA.   Table 6: Results for English→German.
English→French. The evaluation results for English→French are shown in Table 8. We should note that this is the only language pair and translation direction that was present in all three editions of the shared task on cross-lingual pronoun prediction so far. The best-performing system here is TURKUNLP, with macro-averaged recall of 66.89. Then, there is a gap of 3-4 percentage points to the second and to the third systems, UP-PSALA (macro-averaged recall of 63.55) and UU-HARDMEIER (macro-averaged recall of 62.86), respectively. With respect to the secondary accuracy measure, the best-performing system was that of UU-HARDMEIER, followed by UPPSALA and UU-STYMNE16. Note that all participating systems outperformed the baselines on both metrics and by a huge margin of 15-30 points absolute; in fact, this is the highest margin of improvement over the baselines across all four language pairs and translation directions.
Overall results. TURKUNLP achieved the highest score on the official macro-averaged recall measure for three out of the four language pairs, except for English→German, where the winner was UPPSALA. However, on accuracy, TURKUNLP was not as strong, and ended up fifth for three language pairs. This is in contrast to UP-PSALA, which performed well also on accuracy, being first for three out of the four language pairs. This incongruity between the evaluation measures did not occur in 2016, when macro-averaged recall and accuracy were aligned quite closely.
When we compare the best 2017 scores with the best 2016 scores for the three repeated language pairs, we can note some differences. For German→English, the scores are higher in 2017, but for the other language pairs, the scores are lower. However, we cannot draw any conclusions from this, since the test datasets, and particularly the class distributions, are different.   Tables 9-12 show the recall for each participating system, calculated with respect to each pronoun class. Note that for most classes, the LM baselines perform worse than the participating systems. It is also clear that some classes are considerably easier than others, and that rare classes are often difficult. For German→English (Table 9), no team has managed to predict the single instance of these, and only TURKUNLP has found one of the two instances of this, which considerably boosted their macro-averaged recall.
For English→German (Table 10), there are eight instances of er, but for this class there is a lot of variance, with the best systems having a recall of 75.0, while for several systems it is 0.
For Spanish→English (Table 11), unlike the other pairs, the classes are rather uniformly distributed, the OTHER class, in particular, not being the most frequent one. Besides, although he, she, and it all have 12-15 instances, he and she have low overall recall, while for it it is quite high.
For English→French (Table 12), the female pronouns elle and elles have been notoriously difficult to predict in previous work on this task. We can see that this is also the case this year. However, TURKUNLP achieved a better score for the feminine singular elle than for the masculine singular il, and UPPSALA was better at predicting the feminine plural elles than the masculine plural ils.
Overall, it is hard to see systematic differences across the participating systems: all systems tend to perform well on some classes and bad on others, even though there is some variation. However, it is clear that Spanish→English is more difficult than the other language pairs: compared to German→English, the scores are considerably lower for the classes he, she, they and OTHER, which these two language pairs share. Another clear observation is that for you and there, the scores are lower for Spanish→English than for the other language pairs for all systems, except for NYU-CONTRASTIVE.

Discussion
Unlike 2016, this year all participating teams managed to outperform the corresponding baselines. Note, however, that these baselines are based on n-gram language models, which are conceived to be competitive to SMT, while most systems this year used neural architectures. In fact, four of the systems used neural networks and they all outperformed the SVM-based UU-STYMNE system, which was among the best in 2016. Moreover, the systems used languageindependent approaches which they applied to all language pairs and translation directions. With the exception of dependency parsers, none of the systems made use of additional tools, nor tried to address coreference resolution explicitly. Instead, they relied on modeling the sentential and intersentential context. Table 13 summarizes the sources of information that the systems used.
One of the original goals of the task was to improve our understanding of the process of pronoun translation. In this respect, however, we can only suggest that context should be among the most important factors, since this is what neural methods are very good at learning. Interestingly, the two best-performing systems, TURKUNLP and UPP-SALA, used only intra-sentential context, but still performed better than the two systems that used inter-sentence information. Linguistically, it is easy to motivate using inter-sentential information for resolving anaphora; yet, none of the current systems targeted anaphora explicitly. We can conclude that making use of inter-sentential information for the task remains an open challenge.
Last year, the participating systems had difficulties with language pairs that had English on the source side. However, this year the hardest language pair was Spanish→English, which has English on the target side. This result reflects the difficulty of translating null subjects, which are as underspecified as the pronouns it and they when translating into French or German. We should further note that the example extraction process for Spanish focused on cases of third person verbs with null subjects. In other words, the use of Spanish pronouns vs. null subjects is not considered since overt Spanish pronouns were excluded.
As mentioned earlier, the macro-averaged recall and the accuracy metrics did not correlate well this year, suggesting that the official metric may need some re-thinking. The motivation for using macro-averaged recall was to avoid rewarding too much a system that performs well on high frequency classes. It is not clear, however, that a system optimized to favor macro-averaged recall is strictly better than one that has higher accuracy.
Another question is how realistic our baselines are with respect to NMT systems. Our n-gram language model-based baselines were competitive with respect to phrase-based SMT systems trained with fully inflected target text, as evidenced by the higher scores achieved by the baselines with English on the source side. Given the recent rise of NMT and also in view of the strong performance of the NYU team, who submitted a full-fledged NMT system that uses intra-sentential information, it might be a good idea to adopt a similar system as a baseline in the future. X X Target intra-sentential context X X X X Target inter-sentential context X Table 13: Sources of information and key characteristics of the submitted systems.
We should note however that full-fledged NMT systems present challenges with respect to automatic evaluation, just like full-fledged phrasebased SMT systems do. The problem is that we cannot just compare the pronouns that a machine translation system has generated to the pronouns in a reference translation, as in doing so we might miss the legitimate variation of certain pronouns, as well as variations in gender or number of the antecedent itself. Human judges are thus required for reliable evaluation. In particular, the Dis-coMT 2015 shared task on pronoun-focused translation (Hardmeier et al., 2015) included a protocol for human evaluation. This approach, however, has a high cost, which grows linearly with the number of submissions to the task, and it also makes subsequent research and direct comparison to the participating systems very hard. This is why in 2016, we reformulated the task as one about cross-lingual pronoun prediction, which allows us to evaluate it as a regular classification task; this year we followed the same formulation. While this eliminates the need for manual evaluation, it yielded a task that is only indirectly related to machine translation, and one that can be seen as artificial, e.g., because it does not allow an MT system to generate full output, and because the provided output is lemmatized.
In future editions of the task, we might want to go back to machine translation, but to adopt a specialized evaluation measure that would focus on pronoun translation, so that we can automate the process of evaluation at least partially, e.g., as proposed by Luong and Popescu-Belis (2016).

Conclusions
We have described the design and the evaluation of the shared task on cross-lingual pronoun prediction at DiscoMT 2017. We offered four subtasks, each for a different language pair and translation direction: English→French, English→German, German→English, and Spanish→English. We followed the setup of the WMT 2016 task, and for Spanish→English, we further introduced the prediction of null subjects, which proved challenging.
We received submissions from five teams, with four teams submitting systems for all language pairs. All participating systems outperformed the official n-gram-based language model-based baselines by a sizable margin. The two topperforming teams used neural networks and only intra-sentential information, ignoring the rest of the document. The only non-neural submission was ranked last, indicating the fitness of neural networks for this task. We hope that the success in the cross-lingual pronoun prediction task will soon translate into improvements in pronoun translation by end-to-end MT systems.