Findings of the 2015 Workshop on Statistical Machine Translation

This paper presents the results of the WMT15 shared tasks, which included a standard news translation task, a metrics task, a tuning task, a task for run-time estimation of machine translation quality, and an automatic post-editing task. This year, 68 machine translation systems from 24 institutions were submitted to the ten translation directions in the standard translation task. An additional 7 anonymized systems were included, and were then evaluated both automatically and manually. The quality estimation task had three subtasks, with a total of 10 teams, submitting 34 entries. The pilot automatic post-editing task had a total of 4 teams, submitting 7 entries.


Introduction
We present the results of the shared tasks of the Workshop on Statistical Machine Translation (WMT) held at EMNLP 2015. This workshop builds on eight previous WMT workshops (Koehn and Monz, 2006;Callison-Burch et al., 2007, 2009, 2010, 2011, 2012Bojar et al., 2013Bojar et al., , 2014. This year we conducted five official tasks: a translation task, a quality estimation task, a metrics task, a tuning task 1 , and a automatic postediting task. In the translation task ( §2), participants were asked to translate a shared test set, optionally restricting themselves to the provided training data. We held ten translation tasks this year, between English and each of Czech, French, German, Finnish, and Russian. The Finnish translation 1 The metrics and tuning tasks are reported in separate papers (Stanojević et al., 2015a,b). tasks were new this year, providing a lesser resourced data condition on a challenging language pair. The system outputs for each task were evaluated both automatically and manually.
The human evaluation ( §3) involves asking human judges to rank sentences output by anonymized systems. We obtained large numbers of rankings from researchers who contributed evaluations proportional to the number of tasks they entered. We made data collection more efficient and used TrueSkill as ranking method.
The quality estimation task ( §4) this year included three subtasks: sentence-level prediction of post-editing effort scores, word-level prediction of good/bad labels, and document-level prediction of Meteor scores. Datasets were released with English→Spanish news translations for sentence and word level, English↔German news translations for document level.
The first round of the automatic post-editing task ( §5) examined automatic methods for correcting errors produced by an unknown machine translation system. Participants were provided with training triples containing source, target and human post-editions, and were asked to return automatic post-editions for unseen (source, target) pairs. This year we focused on correcting English→Spanish news translations.
The primary objectives of WMT are to evaluate the state of the art in machine translation, to disseminate common test sets and public training data with published performance numbers, and to refine evaluation and estimation methodologies for machine translation. As before, all of the data, translations, and collected human judgments are publicly available. 2 We hope these datasets serve as a valuable resource for research into statistical machine translation and automatic evaluation or prediction of translation quality.

Overview of the Translation Task
The recurring task of the workshop examines translation between English and other languages. As in the previous years, the other languages include German, French, Czech and Russian.
Finnish replaced Hindi as the special language this year. Finnish is a lesser resourced language compared to the other languages and has challenging morphological properties. Finnish represents also a different language family that we had not tackled since we included Hungarian in and 2009(Callison-Burch et al., 2008, 2009.
We created a test set for each language pair by translating newspaper articles and provided training data, except for French, where the test set was drawn from user-generated comments on the news articles.

Test data
The test data for this year's task was selected from online sources, as before. We took about 1500 English sentences and translated them into the other 5 languages, and then additional 1500 sentences from each of the other languages and translated them into English. This gave us test sets of about 3000 sentences for our English-X language pairs, which have been either written originally written in English and translated into X, or vice versa.
For the French-English discussion forum test set, we collected 38 discussion threads each from the Guardian for English and from Le Monde for French. See Figure 1 for an example.
The composition of the test documents is shown in Table 1.
The stories were translated by the professional translation agency Capita, funded by the EU Framework Programme 7 project MosesCore, and by Yandex, a Russian search engine company. 3 All of the translations were done directly, and not via an intermediate language.

Training data
As in past years we provided parallel corpora to train translation models, monolingual corpora to train language models, and development sets to tune system parameters. Some training corpora were identical from last year (Eu-3 http://www.yandex.com/ roparl 4 , United Nations, French-English 10 9 corpus, CzEng, Common Crawl, Russian-English parallel data provided by Yandex, Russian-English Wikipedia Headlines provided by CMU), some were updated (News Commentary, monolingual data), and new corpora was added (Finnish Europarl), Finnish-English Wikipedia Headline corpus).
Some statistics about the training materials are given in Figure 2.

Submitted systems
We received 68 submissions from 24 institutions. The participating institutions and their entry names are listed in Table 2; each system did not necessarily appear in all translation tasks. We also included 1 commercial off-the-shelf MT system and 6 online statistical MT systems, which we anonymized.
For presentation of the results, systems are treated as either constrained or unconstrained, depending on whether their models were trained only on the provided data. Since we do not know how they were built, these online and commercial systems are treated as unconstrained during the automatic and human evaluations.

Human Evaluation
Following what we had done for previous workshops, we again conduct a human evaluation campaign to assess translation quality and determine the final ranking of candidate systems. This section describes how we prepared the evaluation data, collected human assessments, and computed the final results.
This year's evaluation campaign differed from last year in several ways: • In previous years each ranking task compared five different candidate systems which were selected without any pruning or redundancy cleanup. This had resulted in a noticeable amount of near-identical ranking candidates in WMT14, making the evaluation process unnecessarily tedious as annotators ran into a fair amount of ranking tasks containing very similar segments which are hard to inspect. For WMT15, we perform redundancy cleanup as an initial preprocessing step and This is perfectly illustrated by the UKIP numbties banning people with HIV. You mean Nigel Farage saying the NHS should not be used to pay for people coming to the UK as health tourists, and saying yes when the interviewer specifically asked if, with the aforementioned in mind, people with HIV were included in not being welcome. You raise a straw man and then knock it down with thinly veiled homophobia. Every time I or my family need to use the NHS we have to queue up behind bigots with a sense of entitlement and chronic hypochondria. I think the straw man is yours. Health tourism as defined by the right wing loonies is virtually none existent. I think it's called democracy. So no one would be affected by UKIP's policies against health tourism so no problem. Only in UKIP La La Land could Carswell be described as revolutionary.
Quoting the bollox The Daily Muck spew out is not evidence. Ah, shoot the messenger. The Mail didn't write the report, it merely commented on it. Whoever controls most of the media in this country should undead be shot for spouting populist propaganda as fact. I don't think you know what a straw man is. You also don't know anything about my personal circumstances or identity so I would be very careful about trying to eradicate a debate with accusations of homophobia.
Farage's comment came as quite a shock, but only because it is so rarely addressed. He did not express any homophobic beliefs whatsoever. You will just have to find a way of getting over it. I'm not entirely sure what you're trying to say, but my guess is that you dislike the media reporting things you disagree with. It is so rarely addressed because unlike Fararge and his Thatcherite loony disciples who think aids and floods are a signal from the divine and not a reflection on their own ignorance in understanding the complexities of humanity as something to celebrate,then no. 168.ru (1), aif (6), altapress.ru (1), argumenti.ru (8), BBC Russian (1), dp.ru (2), gazeta.ru (4), interfax (2), Kommersant (12), lenta.ru (8), lgng (3), mk (5), novinite.ru (1), rbc.ru (1), rg.ru (2), rusplit.ru (1), Sport Express (6), vesti.ru (10).    2: Participants in the shared translation task. Not all teams participated in all language pairs. The translations from the commercial and online systems were not submitted by their respective companies but were obtained by us, and are therefore anonymized in a fashion consistent with previous years of the workshop. create multi-system translations. As a consequence, we get ranking tasks with varying numbers of candidate systems. To avoid overloading the annotators we still allow a maximum of five candidates per ranking task. If we have more multi-system translations, we choose randomly.
A brief example should illustrate this more clearly: say we have the following two candidate systems: sysA="This, is 'Magic'" sysX="this is magic" After lowercasing, removal of punctuation and whitespace normalization, which are our criteria for identifying near-identical outputs, both would be collapsed into a single multisystem: sysA+sysX="This, is 'Magic'" The first representative of a group of nearidentical outputs is used as a proxy representing all candidates in the group throughout the evaluation.
While there is a good chance that users would have used some of the stripped information, e.g., case to differentiate between the two systems relative to each other, the collapsed system's comparison result against the other candidates should be a good approximation of how human annotators would have ranked them individually. We get a near 2x increase in the number of pairwise comparisons, so the general approach seems helpful.
• After dropping external, crowd-sourced translation assessment in WMT14 we ended up with approximately seventy-five percent less raw comparison data. Still, we were able to compute good confidence intervals on the clusters based on our improved ranking approach.
This year, due to the aforementioned cleanup, annotators spent their time more efficiently, resulting in an increased number of final ranking results. We collected a total of 542,732 individual "A > B" judgments this year, nearly double the amount of data compared to WMT14.
• Last year we compared three different models of producing the final system rankings: Expected Wins (used in WMT13), Hopkins and May (HM) and TrueSkill (TS). Overall, we found the TrueSkill method to work best which is why we decided to use it as our only approach in WMT15.
We keep using clusters in our final system rankings, providing a partial ordering (clustering) of all evaluated candidate systems. Semantics remain unchanged to previous years: systems in the same cluster could not be meaningfully distinguished and hence are considered to be of equal quality.

Evaluation campaign overview
WMT15 featured the largest evaluation campaign to date. Similar to last year, we decided to collect researcher-based judgments only. A total of 137 individual annotator accounts have been actively involved. Users came from 24 different research groups and contributed judgments on 9,669 HITs. Overall, these correspond to 29,007 individual ranking tasks (plus some more from incomplete HITs), each of which would have spawned exactly 10 individual "A > B" judgments last year, so we expected at least >290,070 binary data points. Due to our redundancy cleanup, we are able to get a lot more, namely 542,732. We report our inter/intra-annotator agreement scores based on the actual work done (otherwise, we'd artificially boost scores based on inferred rankings) and use the full set of data to compute clusters (where the inferred rankings contribute meaningful data).
Human annotation effort was exceptional and we are grateful to all participating individuals and teams. We believe that human rankings provide the best decision basis for machine translation evaluation and it is great to see contributions on this large a scale. In total, our human annotators spent 32 days and 20 hours working in Appraise.
The average annotation time per HIT amounts to 4 minutes 53 seconds. Several annotators passed the mark of 100 HITs annotated, some worked for more than 24 hours. We don't take this enormous amount of effort for granted and will make sure to improve the evaluation platform and overall process for upcoming workshops.

Data collection
The system ranking is produced from a large set of pairwise judgments on the translation quality of candidate systems. Annotations are collected in an evaluation campaign that enlists participants in the shared task to help. Each team is asked to contribute one hundred "Human Intelligence Tasks" (HITs) per primary system submitted.
Each HIT consists of three so-called ranking tasks. In a ranking task, an annotator is presented with a source segment, a human reference translation, and the outputs of up to five anonymized candidate systems, randomly selected from the set of participating systems, and displayed in random order. This year, we perform redundancy cleanup as an initial preprocessing step and create multisystem translations. As a consequence, we get ranking tasks with varying numbers of candidate outputs.
There are two main benefits to this approach: • Annotators are more efficient as they don't have to deal with near-identical translations which are notoriously hard to differentiate; and • Potentially, we get higher quality annotations as near-identical systems will be assigned the same "A > B" ranks, improving consistency.
As in previous years, the evaluation campaign is conducted using Appraise 5 (Federmann, 2012), an open-source tool built using Python's Django framework. At the top of each HIT, the following instructions are provided: You are shown a source sentence followed by several candidate translations. Your task is to rank the translations from best to worst (ties are allowed).
Annotators can decide to skip a ranking task but are instructed to do this only as a last resort, e.g., if the translation candidates shown on screen are clearly misformatted or contain data issues (wrong language or similar problems). Only a small number of ranking tasks has been skipped in WMT15. A screenshot of the Appraise ranking interface is shown in Figure 3. Annotators are asked to rank the outputs from 1 (best) to 5 (worst), with ties permitted. Note that a lower rank is better. The joint rankings provided by a ranking task are then reduced to the fully expanded set of pairwise rankings produced by considering all n 2 ≤ 10 combinations of all n ≤ 5 outputs in the respective ranking task. 5 https://github.com/cfedermann/Appraise For example, consider the following annotation provided among outputs A, B, F, H, and J: • As the number of outputs n depends on the number of corresponding multi-system translations in the original data, we get varying numbers of resulting binary judgments. Assuming that outputs A and F from above are actually near-identical, the annotator this year would see a shorter ranking task: Both examples would be reduced to the following set of pairwise judgments: Here, A > B should be read is "A is ranked higher than (worse than) B". Note that by this procedure, the absolute value of ranks and the magnitude of their differences are discarded. Our WMT15 approach including redundancy cleanup allows to obtain these judgments at a lower cognitive cost for the annotators. This partially explains why we were able to collect more results this year.
For WMT13, nearly a million pairwise annotations were collected from both researchers and paid workers on Amazon's Mechanical Turk, in a roughly 1:2 ratio. Last year, we collected data from researchers only, an ability that was enabled by the use of TrueSkill for producing the partial ranking for each task ( §3.4). This year, based on our redundancy cleanup we were able to nearly double the amount of annotations, collecting 542,732. See Table 3 for more details.

Annotator agreement
Each year we calculate annotator agreement scores for the human evaluation as a measure of Figure 3: Screenshot of the Appraise interface used in the human evaluation campaign. The annotator is presented with a source segment, a reference translation, and up to five outputs from competing systems (anonymized and displayed in random order), and is asked to rank these according to their translation quality, with ties allowed. the reliability of the rankings. We measured pairwise agreement among annotators using Cohen's kappa coefficient (κ) (Cohen, 1960). If P (A) be the proportion of times that the annotators agree, and P (E) is the proportion of time that they would agree by chance, then Cohen's kappa is: Note that κ is basically a normalized version of P (A), one which takes into account how meaningful it is for annotators to agree with each other by incorporating P (E). The values for κ range from 0 to 1, with zero indicating no agreement and 1 perfect agreement. We calculate P (A) by examining all pairs of outputs 6 which had been judged by two or more judges, and calculating the proportion of time that they agreed that A < B, A = B, or A > B. In other words, P (A) is the empirical, observed rate at which annotators agree, in the context of pairwise comparisons.
As for P (E), it captures the probability that two annotators would agree randomly. Therefore: Note that each of the three probabilities in P (E)'s definition are squared to reflect the fact that we are considering the chance that two annotators would agree by chance. Each of these probabilities is computed empirically, by observing how often annotators actually rank two systems as being tied. Table 4 shows final κ values for inter-annotator agreement for WMT11-WMT15 while Table 5 details intra-annotator agreement scores, including the division of researchers (WMT13 r ) and MTurk (WMT13 m ) data. The exact interpretation of the kappa coefficient is difficult, but according to Landis and Koch (1977), 0-0.2 is slight, 0.2-0.4 is fair, 0.4-0.6 is moderate, 0.6-0.8 is substantial, and 0.8-1.0 is almost perfect.  Table 3: Amount of data collected in the WMT15 manual evaluation campagin. The final four rows report summary information from previous editions of the workshop. Note how many rankings we get for Czech language pairs. These include systems from the tuning shared task. Finnish, as a new language, sees a shortage of rankings for Finnnish→English Interest in French seems to have lowered this year with only seven systems. Overall, we see a nice increase in pairwise rankings, especially considering that we have dropped crowd-source annotation and are instead relying on researchers' judgments exclusively.
The inter-annotator agreement rates improve for most language pairs. On average, these are the best scores we have ever observed in one of our evaluation campaigns, including in WMT11, where results were inflated due to inclusion of the reference in the agreement rates. The results for intra-annotator agreement are more mixed: some improve greatly (Czech and German) while others degrade (French, Russian). Our special language, Finnish, also achieves very respectable scores. On average, again, we see the best intra-annotator agreement scores since WMT11.
It should be noted that the improvement is not caused by the "ties forced by our redundancy cleanup". If two systems A and F produced nearidentical outputs, they are collapsed to one multisystem output AF and treated jointly in our agreement calculations, i.e. only in comparison with other outputs. It is only the final TrueSkill scores that include the tie A = F .

Producing the human ranking
The collected pairwise rankings are used to produce the official human ranking of the systems. For WMT14, we introduced a competition among multiple methods of producing this human ranking, selecting the method based on which could best predict the annotations in a portion of the collected pairwise judgments. The results of this competition were that (a) the competing metrics produced almost identical rankings across all tasks but that (b) one method, TrueSkill, had less variance across randomized runs, allowing us to make more confident cluster predictions. In light of these findings, this year, we produced the human ranking for each task using TrueSkill in the following fashion, following procedures adopted for WMT12: We produce 1,000 bootstrap-resampled runs over all of the available data. We then compute a rank range for each system by collecting the absolute rank of each system in each fold, throwing out the top and bottom 2.5%, and then clustering systems into equivalence classes containing systems with overlapping ranges, yielding a partial ordering over systems at the 95% confidence level.
The full list of the official human rankings for each task can be found in Table 6, which also reports all system scores, rank ranges, and clusters for all language pairs and all systems. The official interpretation of these results is that systems in the same cluster are considered tied. Given the large number of judgments that we collected, it was possible to group on average about two systems in a cluster, even though the systems in the middle are typically in larger clusters.
In Figure 4 and 5, we plotted the human evaluation result against everybody's favorite metric BLEU (some of the outlier online systems are   Table 5: κ scores measuring intra-annotator agreement, i.e., self-consistency of judges, across for the past few years of the human evaluation campaign. Scores are much higher for WMT15 which makes sense as we enforce annotation consistency through our initial preprocessing which joins near-identical translation candidates into multi-system entries. It seems that the focus on actual differences in our annotation tasks as well as the possibility of having "easier" ranking scenarios for n < 5 candidate systems results in a higher annotator agreement, both for inter-and intra-annotator agreement scores. not included to make the graphs viewable). The plots cleary suggest that a fair comparison of systems of different kinds cannot rely on automatic scores. Rule-based systems receive a much lower BLEU score than statistical systems (see for instance English-German, e.g., PROMT-RULE). The same is true to a lesser degree for statistical syntax-based systems (see English-German, UEDIN-SYNTAX) and online systems that were not tuned to the shared task (see Czech-English, CU-TECTO vs. the cluster of tuning task systems TT-*).

Quality Estimation Task
The fourth edition of the WMT shared task on quality estimation (QE) of machine translation (MT) builds on the previous editions of the task (Callison-Burch et al., 2012;Bojar et al., 2013Bojar et al., , 2014, with tasks including both sentence and word-level estimation, using new training and test datasets, and an additional task: document-level prediction. The goals of this year's shared task were: • Advance work on sentence-and wordlevel quality estimation by providing larger datasets.
• Investigate the effectiveness of quality labels, features and learning methods for documentlevel prediction.
• Explore differences between sentence-level and document-level prediction.
• Analyse the effect of training data sizes and quality for sentence and word-level predic-    : Human evaluation scores versus BLEU scores for the German-English and Czech-English language pairs illustrate the need for human evaluation when comparing systems of different kind. Confidence intervals are indicated by the shaded ellipses. Rule-based systems and to a lesser degree syntax-based statistical systems receive a lower BLEU score than their human score would indicate. The big cluster in the Czech-English plot are tuning task submissions. tion, particularly the use of annotations obtained from crowdsourced post-editing.

English-French
Three tasks were proposed: Task 1 at sentence level (Section 4.3), Task 2 at word level (Section 4.4), and Task 3 at document level (Section 4.5). Tasks 1 and 2 provide the same dataset with English-Spanish translations generated by the statistical machine translation (SMT) system, while Task 3 provides two different datasets, for two language pairs: English-German (EN-DE) and German-English (DE-EN) translations taken from all participating systems in WMT13 (Bojar et al., 2013). These datasets were annotated with different labels for quality: for Tasks 1 and 2, the labels were automatically derived from the post-editing of the machine translation output, while for Task 3, scores were computed based on reference translations using Meteor (Banerjee and Lavie, 2005). Any external resource, including additional quality estimation training data, could be used by participants (no distinction between open and close tracks was made). As presented in Section 4.1, participants were also provided with a baseline set of features for each task, and a software package to extract these and other quality estimation features and perform model learning, with suggested methods for all levels of prediction. Participants, described in Section 4.2, could submit up to two systems for each task.
Data used to build MT systems or internal system information (such as model scores or n-best lists) were not made available this year as multiple MT systems were used to produce the datasets, especially for Task 3, including online and rulebased systems. Therefore, as a general rule, participants could only use black-box features.

Baseline systems
Sentence-level baseline system: For Task 1, QUEST 7  was used to extract 17 MT system-independent features from the source and translation (target) files and parallel corpora: • Number of tokens in the source and target sentences.
• Average source token length.
• Average number of occurrences of the target word within the target sentence. 7 https://github.com/lspecia/quest • Number of punctuation marks in source and target sentences.
• Language model (LM) probability of source and target sentences based on models for the WMT News Commentary corpus.
• Average number of translations per source word in the sentence as given by IBM Model 1 extracted from the WMT News Commentary parallel corpus, and thresholded such that P (t|s) > 0.2/P (t|s) > 0.01.
• Percentage of unigrams, bigrams and trigrams in frequency quartiles 1 (lower frequency words) and 4 (higher frequency words) in the source language extracted from the WMT News Commentary corpus.
• Percentage of unigrams in the source sentence seen in the source side of the WMT News Commentary corpus.
These features were used to train a Support Vector Regression (SVR) algorithm using a Radial Basis Function (RBF) kernel within the SCIKIT-LEARN toolkit. 8 The γ, and C parameters were optimised via grid search with 5-fold cross validation on the training set. We note that although the system is referred to as "baseline", it is in fact a strong system. It has proved robust across a range of language pairs, MT systems, and text domains for predicting various forms of post-editing effort (Callison-Burch et al., 2012;Bojar et al., 2013Bojar et al., , 2014. Word-level baseline system: For Task 2, the baseline features were extracted with the MAR-MOT tool 9 . For the baseline system we used a number of features that have been found the most informative in previous research on word-level quality estimation. Our baseline set of features is loosely based on the one described in (Luong et al., 2014). It contains the following 25 features: • Word count in the source and target sentences, source and target token count ratio. Although these features are sentence-level (i.e. their values will be the same for all words in a sentence), but the length of a sentence might influence the probability of a word being incorrect.
• Target token, its left and right contexts of one word.
• Source token aligned to the target token, its left and right contexts of one word. The alignments were produced with the force align.py script, which is part of cdec (Dyer et al., 2010). It allows to align new parallel data with a pre-trained alignment model built with the cdec word aligner (fast align). The alignment model was trained on the Europarl corpus (Koehn, 2005).
• Boolean dictionary features: whether target token is a stopword, a punctuation mark, a proper noun, a number.
• Target language model features: -The order of the highest order n-gram which starts or ends with the target token. -Backoff behaviour of the n-grams where t i is the target token (the backoff behaviour is computed as described in (Raybaud et al., 2011)).
• The order of the highest order n-gram which starts or ends with the source token.
• Boolean pseudo-reference feature: 1 if the token is contained in a pseudo-reference, 0 otherwise. The pseudo-reference used for this feature is the automatic translation generated by an English-Spanish phrase-based SMT system trained on the Europarl corpus with standard settings. 10 • The part-of-speech tags of the target and source tokens.
• The number of senses of the target and source tokens in WordNet.
We model the task as a sequence prediction problem and train our baseline system using the Linear-Chain Conditional Random Fields (CRF) algorithm with the CRF++ tool. 11 Document-level baseline system: For Task 3, the baseline features for sentence-level prediction were used. These are aggregated by summing or averaging their values for the entire document. Features that were summed: number of tokens in the source and target sentences and number of punctuation marks in source and target sentences. All other features were averaged. The implementation for document-level feature extraction is available in QUEST++ . 12 These features were then used to train a SVR algorithm with RBF kernel using the SCIKIT-LEARN toolkit. The γ, and C parameters were optimised via grid search with 5-fold cross validation on the training set. Table 7 lists all participating teams submitting systems to any of the tasks. Each team was allowed up to two submissions for each task and language pair. In the descriptions below, participation in specific tasks is denoted by a task identifier.

DCU-SHEFF (Task 2):
The system uses the baseline set of features provided for the task. Two pre-processing data manipulation techniques were used: data selection and data bootstrapping. Data selection filters out sentences which have the smallest proportion of erroneous tokens and are assumed to be the least useful for the task. Data bootstrapping enhances the training data with incomplete training sentences (e.g. the first k words of a sentence of the length N , where k < N ). This technique creates additional data instances and boosts the importance of errors occurring in the training data. The combination of these techniques doubled the F 1 score for the "BAD" class, as compared to a models trained on the entire dataset given for the task. The labelling was performed with a CRF model trained using the CRF++ tool, as in the baseline system.  Table 7: Participants in the WMT15 quality estimation shared task. and fine-tuned for the quality estimation classification task by back-propagating wordlevel prediction errors using stochastic gradient descent. In addition to the continuous space deep model, a shallow linear classifier was trained on the provided baseline features and their quadratic expansion. One of the submitted systems (QUETCH) relies on the deep model only, the other (QUETCHPLUS) is a linear combination of the QUETCH system score, the linear classifier score, and binary and binned baseline features. The system combination yielded significant improvements, showing that the deep and shallow models each contributes complementary information to the combination.
LORIA (Task 1): The LORIA system for Task 1 is based on a standard machine learning approach where source-target sentences are described by numerical vectors and SVR is used to learn a regression model between these vectors and quality scores. Feature vectors used the 17 baseline features, two Latent Semantic Indexing (LSI) features and 31 features based on pseudo-references. The LSI approach considers source-target pairs as documents, and projects the TF-IDF wordsdocuments matrix into a reduced numerical space. This leads to a measure of similarity between a source and a target sentence, which was used as a feature. Two of these features were used based on two matrices, one from the Europarl corpus and one from the official training data. Pseudoreferences were produced by three online systems. These features measure the intersection between n-gram sets of the target sentence and of the pseudo-references. Three sets of features were extracted from each online system, and a fourth feature was extracted measuring the inter-agreement among the three online systems and the target system.
RTM-DCU (Tasks 1, 2, 3): RTM-DCU systems are based on referential translation machines (RTM) (Biçici, 2013;Biçici and Way, 2014). RTMs propose a language independent approach and avoid the need to access any taskor domain-specific information or resource. The submissions used features that indicate the closeness between instances to the available training data, the difficulty of translating them, and the presence of acts of translation for data transformation. SVR was used for document and sentence-level prediction tasks, also in combination with feature selection or partial least squares, and global linear models with dynamic learning were used for the word-level prediction task.
SAU (Task 2): The SAU submissions used a CRF model to predict the binary labels for Task 2. They rely on 12 basic features and 85 combination features. The ratio between OK and BAD labels was found to be 4:1 in the training set. Two strategies were proposed to solve this problem of label ratio imbalance. The first strategy is to replace "OK" labels with sub-labels to balance label distribution, where the sub-labels are OK B, OK I, OK E, OK (depending on the position of the token in the sentence). The second strategy is to reconstruct the training set to include more "BAD" words.
SHEFF-NN (Tasks 1, 2): SHEFF-NN submissions were based on (i) a Continuous Space Language Model (CSLM) to extract additional features for Task 1 (SHEF-GP and SHEF-SVM), (ii) a Continuous Bagof-Words (CBOW) model to produce word embeddings as features for Task 2 (SHEF-W2V), and (iii) a combination of features produced by QUEST++ and a feature produced with word embedding models (SHEF-QuEst++). SVR and Gaussian Processes were used to learn prediction models for Task 1, and a CRF algorithm for binary tagging models in Task 2 (Pystruct Linear-chain CRF trained with a structured SVM for system SHEF-W2V, and CRFSuite Adaptive Regularisation of Weight Vector (AROW) and Passive Aggressive (PA) algorithms for system SHEF-QuEst++). Interesting findings for Task 1 were that (i) CSLM features always bring improvements whenever added to either baseline or complete feature sets and (ii) CSLM features alone perform better than the baseline features. For Task 2, the results obtained by SHEF-W2V are promising: although it uses only features learned in unsupervised fashion (CBOW word embeddings), it was able to outperform the baseline as well as many other systems. Further, combining the source-to-target cosine similarity feature with the ones produced by QUEST++ led to improvements in the F 1 of "BAD" labels.
UAlacant (Task 2): The submissions of the Universitat d'Alacant team were obtained by applying the approach in (Esplà-Gomis et al., 2015b), which uses any source of bilingual information available as a black-box in order to spot sub-segment correspondences between a sentence S in the source language and a given translation hypothesis T in the target language. These sub-segment correspondences are used to extract a collection of features that is then used by a multilayer perceptron to determine the word-level predicted score. Three sources of bilingual information available online were used: two online machine translation systems, Apertium 13 and Google Translate; and the bilingual concordancer Reverso Context. 14 Two submissions were made for Task 2: one using only the 70 features described in (Esplà-Gomis et al., 2015b), and one combining them with the baseline features provided by the task organisers.
UGENT (Tasks 1, 2): The submissions for the word-level task used 55 new features in combination with the baseline feature set to train binary classifiers. The new features try to capture either accuracy (meaning transfer from source to target sentence) using word and phrase alignments, or fluency (well-formedness of target sentence) using language models trained on word surface forms and on part-of-speech tags. Based on the combined feature set, SCATE-MBL uses a memory-based learning (MBL) algorithm for binary classification. SCATE-HYBRID uses the same feature set and forms a classifier ensemble using CRFs in combination with the MBL system for predicting word-level quality. For the sentence-level task, SCATE-SVM-single uses a single feature to train SVR models, which is based on the percentage of words that are labelled as "BAD" by the word-level quality estimation system SCATE-HYBRID. SCATE-SVM adds 16 new features to this single feature and the baseline feature set to train SVR models using an RBF kernel. Additional language resources are used to extract the new features for both tasks. HIDDEN (Task 3): This submission, whose creators preferred to remain anonymous, estimates the quality of a given document by explicitly identifying potential translation errors in it. Translation error detection is implemented as a combination of human expert knowledge and different language processing tools, including named entity recognition, part-of-speech tagging and word alignments.

USAAR-USHEF (Task
In particular, the system looks for patterns of errors defined by human experts, taking into account the actual words and the additional linguistic information. With this approach, a wide variety of errors can be de-tected: from simple misspellings and typos to complex lack of agreement (in genre, number and tense), or lexical inconsistencies. Each error category is assigned an "importance", again according to human knowledge, and the amount of error in the document is computed as the weighted sum of the identified errors. Finally, the documents are sorted according to this figure to generate the final submission to the ranking variant of Task 3.

Task 1: Predicting sentence-level quality
This task consists in scoring (and ranking) translation sentences according to the percentage of their words that need to be fixed. It is similar to Task 1.2 in WMT14. HTER (Snover et al., 2006b) is used as quality score, i.e. the minimum edit distance between the machine translation and its manually post-edited version in [0,1]. As in previous years, two variants of the results could be submitted: • Scoring: An absolute HTER score for each sentence translation, to be interpreted as an error metric: lower scores mean better translations.
• Ranking: A ranking of sentence translations for all source sentences from best to worst. For this variant, it does not matter how the ranking is produced (from HTER predictions or by other means). The reference ranking is defined based on the true HTER scores.
Data The data is the same as that used for the WMT15 Automatic Post-editing task, 15 as kindly provided by Unbabel. 16 Source segments are tokenized English sentences from the news domain with at least four tokens. Target segments are tokenized Spanish translations produced by an online SMT system. The human post-editions are a manual revision of the target, collected using Unbabel's crowd post-editing platform. HTER labels were computed using the TERCOM tool 17 with default settings (tokenised, case insensitive, exact matching only), but with scores capped to 1. As training and development data, we provided English-Spanish datasets with 11, 271 and 1, 000 source sentences, their machine translations, post-editions and HTER scores, respectively. As test data, we provided an additional 15 http://www.statmt.org/wmt15/ape-task.html 16 https://unbabel.com/ 17 http://www.cs.umd.edu/˜snover/tercom/ set of 1, 817 English-Spanish source-translations pairs produced by the same MT system used for the training data.
Evaluation Evaluation was performed against the true HTER label and/or ranking, using the same metrics as in previous years: • Scoring: Mean Average Error (MAE) (primary metric, official score for ranking submissions), Root Mean Squared Error (RMSE).
Additionally, we included Pearson's r correlation against the true HTER label, as suggested by Graham (2015).
Statistical significance on MAE and DeltaAvg was computed using a pairwise bootstrap resampling (1K times) approach with 95% confidence intervals. 18 For Pearson's r correlation, we measured significance using the Williams test, as also suggested in (Graham, 2015).
Results Table 8 summarises the results for the ranking variant of Task 1. They are sorted from best to worst using the DeltaAvg metric scores as primary key and the Spearman's ρ rank correlation scores as secondary key.
The results for the scoring variant are presented in Table 9, sorted from best to worst by using the MAE metric scores as primary key and the RMSE metric scores as secondary key.
Pearson's r coefficients for all systems against HTER is given in Table 10. As discussed in (Graham, 2015), the results according to this metric can rank participating systems differently. In particular, we note the SHEF/GP submission, are which is deemed significantly worse than the baseline system according to MAE, but substantially better than the baseline according to Pearson's correlation. Graham (2015) argues that the use of MAE as evaluation score for quality estimation tasks is inadequate, as MAE is very sensitive to variance. This means that a system that outputs predictions with high variance is more likely to have high MAE score, even if the distribution follows that of the true labels. Interestingly, according to Pearson's correlation, the systems are ranked exactly in the same way as according to our DeltaAvg metric. The only difference is that the 4th place is now considered significantly different from the three winning submissions. She also argues that the significance tests used with MAE, based on randomised resampling, assume that the data is independent, which is not the case. Therefore, we apply the suggested Williams significance test for this metric.

Task 2: Predicting word-level quality
The goal of this task is to evaluate the extent to which we can detect word-level errors in MT output. Often, the overall quality of a translated segment is significantly harmed by specific errors in a small proportion of words. Various classes of errors can be found in translations, but for this task we consider all error types together, aiming at making a binary distinction between 'GOOD' and 'BAD' tokens. The decision to bucket all error types together was made because of the lack of sufficient training data that could allow consideration of more fine-grained error tags.
Data This year's word-level task uses the same dataset as Task 1, for a single language pair: English-Spanish. Each instance of the training, development and test sets consists of the following elements: • Source sentence (English).
• Manual post-edition of the automatic translation.
The binary labels for the datasets were acquired automatically with the TERCOM tool (Snover et al., 2006b). 19 This tool computes the edit distance between machine-translated sentence and its reference (in this case, its post-edited version). It identifies four types of errors: substitution of a word with another word, deletion of a word (word was omitted by the translation system), insertion of a word (a redundant word was added by the translation system), and word or sequence of words shift (word order error). Every word in the machine-translated sentence is tagged with one of these error types or not tagged if it matches a word from the reference.    All the untagged (correct) words were tagged with "OK", while the words tagged with substitution and insertion errors were assigned the tag "BAD". The deletion errors are not associated with any word in the automatic translation, so we could not consider them. We also disabled the shift errors by running TERCOM with the option '-d 0'. The reason for that is the fact that searching for shifts introduces significant noise in the annotation. The system cannot discriminate be-tween cases where a word was really shifted and where a word (especially common words such as prepositions, articles and pronouns) was deleted in one part of the sentence and then independently inserted in another part of this sentence, i.e. to correct an unrelated error. The statistics of the datasets are outlined in Table 11  Evaluation Submissions were evaluated in terms of classification performance against the original labels. The main evaluation metric is the average F 1 for the "BAD" class. Statistical significance on F 1 for the "BAD" class was computed using approximate randomization tests. 20

Results
The results for Task 2 are summarised in Table 12. The results are ordered by F 1 score for the error (BAD) class. Using the F 1 score for the word-level estimation task has a number of drawbacks. First of all, we cannot use it as the single metric to evaluate the system's quality. The F 1 score of the class "BAD" becomes an inadequate metric when one is also interested in the tagging of correct words. In fact, a naive baseline which tags all words with the class "BAD" would yield 31.75 F 1 score for the "BAD" class in the test set of this task, which is close to some of the submissions and by far exceeds the baseline, although this tagging is uninformative.
We could instead use the weighted F 1 score, which would lead to a single F 1 figure where every class is given a weight according to its frequency in the test set. However, we believe the weighted F 1 score does not reflect the real quality of the systems either. Since there are many more instances of the "GOOD" class than there are of the "BAD" class, the performance on the "BAD" class does not contribute much weight to the overall score, and changes in accuracy of error prediction on this less frequent class can go unnoticed. The weighted F 1 score for the strategy which tags all words as "GOOD" would be 72.66, which is higher than the score of many submissions. However, similar to the case of tagging all words as "BAD", this strategy is uninformative. In an attempt to find more intuitive ways of evaluating word-level tasks, we introduce a new metric called sequence correlation. It gives higher importance to the instances of the "BAD" class and is robust against uninformative tagging.
The basis of the sequence correlation metric is the number of matching labels in the reference and the hypothesis, analogously to a precision metric. However, it has some additional features that are aimed at making it more reliable. We consider the tagging of each sentence separately as a sequence of tags. We divide each sequence into sub-sequences tagged by the same tag, for example, the sequence "OK BAD OK OK OK" will be represented as a list of 3 sub-sequences: [ "OK", "BAD", "OK OK OK" ]. Each subsequence has also the information on its position in the original sentence. The sub-sequences of the reference and the hypothesis are then intersected, and the number of matching tags in the corresponding subsequences is computed so that every sub-sequence can be used only once. Let us consider the following example:

Reference:
OK BAD OK OK OK Hypothesis: OK OK OK OK OK Here, the reference has three sub-sequences, as in the previous example, and the hypothesis consists of only one sub-sequence which coincides with the hypothesis itself, because all the words were tagged with the "OK" label. The precision score for this sentence will be 0.8, as 4 of 5 labels match in this example. However, we notice that the hypothesis sub-sequence covers two matching sub-sequences of the reference: word 1 and words 3-5. According to our metric, the hypothesis sub-sequence can be used for the intersection only once, giving either 1 of 5 or 3 of 5 matching words. We choose the highest value and get the score of 0.6. Thus, the intersection procedure downweighs the uninformative hypotheses where all words are tagged with one tag.
In order to compute the sequence correlation we need to get the set of spans for each label in both the prediction and the reference, and then intersect them. A set of spans of each tag t in the string w is computed as follows:   Here λ t is the weight of a tag t in the overall result. It is inversely proportional the number of instances of this tag in the reference: where c t (ŷ) is the number of words labelled with the label t in the prediction. Thus we give the equal importance to all tags.
The sum of matching spans is also weighted by the ratio of the number of spans in the hypothesis and the reference. This is done to downweigh the system tagging if the number of its spans differs from the number of spans provided in the gold standard. This ratio is computed as follows: This ratio is 1 if the number of spans is equal for the hypothesis and the reference, and less than 1 otherwise.
The final score for a sentence is produced as follows: SeqCor(y,ŷ) = r(y,ŷ) · Int(y,ŷ) |y| Then the overall sequence correlation for the whole dataset is the average of sentence scores. Table 13 shows the results of the evaluation according to the sequence correlation metric. The results for the two metrics are quite different: one of the highest scoring submissions according to the F 1 -BAD score is only the third under the sequence correlation metric, and vice versa: the submissions with the highest sequence correlation feature in 3rd place according to F 1 -BAD score. However, the system rankings produced by two metrics are correlated -their Spearman's correlation coefficient between them is 0.65.  The sequence correlation metric gives preference to systems that use sequence labelling (modelling dependencies between the assigned tags). We consider this a desirable feature, as we are generally not interested in maximising the prediction accuracy for individual words, but in maximising the accuracy for word-level labelling in the context of the whole sentence. However, using the TER alignment to tag errors cannot capture "phraselevel errors", and each token is considered independently when the dataset is built. This is a fundamental issue with the current definition of the word-level quality estimation that we intend to address in future work.
Our intuition is that the sequence correlation metric should be closer to human perception of word-level QE than F 1 scores. The goal of wordlevel QE is to identify incorrect segments of a sentence -and the sequence correlation metric evaluates how good the segmentation of the sentence is into correct and incorrect phrases. A system can get very high F 1 score by (almost) randomly assigning a correct tag to a word, and giving very little information on correct and incorrect areas in the text. That was illustrated by the WMT14 wordlevel QE task results, where the baseline strategy that assigned tag "BAD" to all words had significantly higher F 1 score than any of the submissions. fundamental problem with the current task. I added a sentence about it at the end of the paragraph before this one.

Task 3: Predicting document-level quality
Predicting the quality of units larger than sentences can be useful in many scenarios. For example, consider a user searching for information about a product on the web. The user can only find reviews in German but he/she does not speak the language, so he/she uses an MT system to translate the reviews into English. In this case, predictions on the quality of individual sentences in a translated review are not as informative as predictions on the quality of the entire review.
With the goal of exploring quality estimation beyond sentence level, this year we proposed a document-level task for the first time. Due to the lack of large datasets with machine translated documents (by various MT systems), we consider short paragraphs as documents. The task consisted in scoring and ranking paragraphs according to their predicted quality.
Data The paragraphs were extracted from the WMT13 translation task test data (Bojar et al., 2013), using submissions from all participating MT systems. Source paragraphs were randomly chosen using the paragraph markup in the SGML files. For each source paragraph, a translation was taken from a different MT system such as to select approximately the same number of instances from each MT system. We considered EN-DE and DE-EN as language pairs, extracting 1, 215 paragraphs for each language pair. 800 paragraphs were used for training and 415 for test.
Since no human annotation exists for the quality of entire paragraphs (or documents), Meteor against reference translations was used as quality label for this task. Meteor was calculated using its implementation within the Asyia toolkit, with the following settings: exact match, tokenised and case insensitive (Giménez and Màrquez, 2010).
Evaluation The evaluation of the paragraphlevel task was the same as that for the sentencelevel task. MAE and RMSE are reported as evaluation metrics for the scoring task, with MAE as official metric for systems ranking. For the ranking task, DeltaAvg and Spearman's ρ correlation are reported, with DeltaAvg as official metric for systems ranking. To evaluate the significance of the results, bootstrap resampling (1K times) with 95% confidence intervals was used. Pearson's r correlation scores with the Williams significance test are also reported.
Results Table 14 summarises the results of the ranking variant of Task 3. 21 They are sorted from best to worst using the DeltaAvg metric scores as primary key and the Spearman's ρ rank correlation scores as secondary key. RTM-DCU submissions achieved the best scores: RTM-SVR was the winner for EN-DE, and RTM-FS-SVR for DE-EN. For EN-DE, the HIDDEN system did not show significant difference against the baseline. For DE-EN, USHEF/QUEST-DISC-BO, USAAR-USHEF/BFF and HIDDEN were not significantly different from the baseline.
The results of the scoring variant are given in Table 15, sorted from best to worst by using the MAE metric scores as primary key and the RMSE metric scores as secondary key. Again the RTM-DCU submissions scored the best for both lan-21 Results for MAE, RMSE and DeltaAvg are multiplied by 100 to improve readability. guage pairs. All systems were significantly better than the baseline. However, the difference between the baseline system and all submissions was much lower in the scoring evaluation than in the ranking evaluation.
Following the suggestion in (Graham, 2015), Table 16 shows an alternative ranking of systems considering Pearson's r correlation results. The alternative ranking differs from the official ranking in terms of MAE: for EN-DE, RTM-DCU/RTM-FS-SVR is no longer in the winning group, while for DE-EN, USHEF/QUEST-DISC-BO and USAAR-USHEF/BFF did not show statistically significant difference against the baseline. However, as with Task 1 these results are the same as the official ones in terms of DeltaAvg.

Discussion
In what follows, we discuss the main findings of this year's shared task based on the goals we had previously identified for it.

Advances in sentence-and word-level QE
For sentence-level prediction, we used similar data and quality labels as in previous editions of the task: English-Spanish, news text domain and HTER labels to indicate post-editing effort. The main differences this year were: (i) the much larger size of the dataset, (ii) the way post-editing was performed -by a large number of crowdsourced translators, and (iii) the MT systems used -an online statistical system. We will discuss items (i) and (ii) later in this section. Regarding (iii), the main implication of using an online system was that one could not have access to many of the resources commonly used to extract features, such as the SMT training data and lexical tables. As a consequence, surrogate resources were used for certain features, including many of the baseline ones, which made them less effective. To avoid relying on such resources, novel features were explored, for example those based on deep neural network architectures (word embeddings and continuous space language models by SHEFF-NN) and those based on pseudo-references (n-gram overlap and agreement features by LORIA).
While it is not possible to compare results directly with those published in previous years, for sentence level we can observe the following with respect to the corresponding task in WMT14 (Task 1.2):   • In terms of scoring, according to the primary metric -MAE, in WMT15 all systems except one were significantly better than the baseline. In both WMT14 and WMT15 only one system was significantly worse than the baseline. However, in WMT14 four others (out of nine) performed no different than the baseline. This year, no system tied with the baseline: the remaining seven systems were significantly better than the baseline. One could say systems are consistently better this year. It is worth mentioning that the baseline remains the same, but as previously noted, the resources used to extract baseline features are likely to be less useful this year given the mismatch between the data used to produce them and the data used to build the online SMT system.
• In terms of ranking, in WMT14 one system was significantly worse than the baseline, and the four remaining systems were significantly better. This year, all eight submissions are significantly better than the baseline. This can once more be seen as progress from last year's results. These results as well as the general ranking of systems were also found following Pearson's correlation as metric, as  suggested by Graham (2015).
For the word level task, a comparison with the WMT14 corresponding task is difficult to perform, as in WMT14 we did not have a meaningful baseline. The baseline used then for binary classification was to tag all words with the label "BAD". This baseline outperformed all the submissions in terms of F 1 for the "BAD" class, but it cannot be considered an appropriate baseline strategy (see Section 4.4). This year the submissions were compared against the output of a real baseline system and the set of baseline features was made available to participants. Although the baseline system itself performed worse than all the submitted systems, some other systems benefited from adding baseline features to their feature sets (UAlacant, UGENT, HDCL).
Considering the feature sets and methods used, the number of participants in the WMT14 wordlevel task was too small to draw reliable conclusion: four systems for English-Spanish and one system for all other three language pairs. The larger number of submissions this year is already a positive result: 16 submissions from eight teams. Inspecting the systems submitted this and last year, we can speculate about the most promising techniques. Last year's winning system used a neural network trained on pseudo-reference features (namely, features extracted from n-best lists) (Camargo de Souza et al., 2014). This year's winning systems are also based on pseudo-reference features (UAlacant) and deep neural network architectures (HDCL). Luong et al. (2013) had pre-viously reported that pseudo-reference features improve the accuracy of word-level predictions. The two most recent editions of this shared task seem to indicate that the state of the art in wordlevel quality estimation relies upon such features, as well as the ability to model the relationship between the source and target languages using large datasets.

Effectiveness of quality labels, features and learning methods for document-level QE
The task of paragraph-level prediction received fewer submissions than the other two tasks: four submissions for the scoring variant and five for the ranking variant, for both language pairs. This is understandable as it was the first time the task was run. Additionally, paragraph-level QE is still fairly new as a task. However, we were able to draw some conclusions and learn valuable lessons for future research in the area.
By and large, most features are similar to those used for sentence-level prediction. Discourseaware features showed only marginal improvements relative to the baseline system (USHEF systems for EN-DE and DE-EN). One possible reason for that is the way the training and test data sets were created, including paragraphs with only one sentence. Therefore, discourse features could not be fully explored as they aim to model relationships and dependencies across sentences, as well as within sentences. In future, data will be selected more carefully in order to consider only paragraphs or documents with more sentences.
Systems applying feature selection techniques, such as USAAR-USHEF/BFF, did not obtain major improvements over the baseline. However, they provided interesting insights by finding a minimum set of baseline features that can be used to build models with the same performance as the entire baseline feature set. These are models with only three features selected as the best combination by exhaustive search.
The winning submissions for both language pairs and variants -RTM-DCU -explored features based on the source and target side information. These include distributional similarity, closeness of test instances to the training data, and indicators for translation quality. External data was used to select "interpretants", which contain data close to both training and test sets to provide context for similarity judgements.
In terms of quality labels, one problem observed in previous work on document-level QE (Scarton et al., 2015b) is the low variation of scores (in this case, Meteor) across instances of the dataset. Since the data collected for this task included translations from many different MT systems, this was not the case. Table 17 shows the average and standard deviation (STDEV) values for the datasets (both training and test set together). Although the variation is substantial, the average value of the training set is a good predictor. In other words, if we consider the average of the training set scores as the prediction value for all data points in the test set, we obtain results as good as the baseline system. For our datasets, the MAE figure for EN-DE is 10, and for DE-EN 7 -the same as the baseline system. We can only speculate that automatically assigned quality labels based on reference translations such as Meteor are not adequate for this task. Other automatic metrics tend to behave similarly to Meteor for documentlevel (Scarton et al., 2015b). Therefore, finding an adequate quality label for document-level QE remains an open issue. Having humans directly assign quality labels is much more complex than in the sentence and word level cases. Annotation of entire documents, or even paragraphs, becomes a harder, more subjective and much more costly task. For future editions of this task, we intend to collect datasets with human-targeted documentlevel labels obtained indirectly, e.g. through postediting.  algorithms specifically targeted at document-level prediction.

Differences between sentence-level and document-level QE
The differences between sentence and documentlevel prediction have not been explored to a great extent. Apart from the discourse-aware features by USHEF, the baseline and other features explored by participating teams for document level prediction were simple aggregations of sentence level feature values. Also, none of the submitted systems use sentence-level predictions as features for paragraph-level QE. Although this technique is possible in principle, its effectiveness has not yet been proved.  report promising results when using word-level prediction for sentence-level QE, but inclusive results when using sentence-level prediction for document-level QE. They considered BLEU, TER and Meteor as quality labels, all leading to similar findings. Once more the use of inadequate quality labels for document-level prediction could have been the reason.
No submission evaluated different machine learning algorithms for this task. The same algorithms as those used for sentence-level prediction were applied by all participating teams.

Effect of training data sizes and quality for sentence and word-level QE
As it was previously mentioned, the post-editions used for this year's sentence and word-level tasks were obtained through a crowdsourcing platform where translators volunteered to post-edit machine translations. As such, one can expect that not all post-editions will reach the highest standards of professional translation. Manual inspection of a small sample of the data, however, showed that the post-editions were high quality, although stylistic differences are evident in some cases. This is likely due to the fact that different editors, with different styles and levels of expertise, worked on different segments. Another factor that may have influenced the quality of the post-editions is the fact that segments were fixed out of context. For word level, in particular, a potential issue is the fact that the labelling of the words was done completely automatically, using a tool for alignment based on minimum edit distance (TER).
On the positive side, this dataset is much larger dataset than any we have used before for prediction at any level: nearly 12K segments for training/development, as opposed to maximum 2K in previous years. For sentence-level prediction we did not expect massive gains from larger datasets, as it has been shown that small amounts of data can be as effective or even more effective than the entire collection, if selected in a clever way (Beck et al., 2013a,b). However, it is well known that data sparsity is an issue for word-level prediction, so we expected a large dataset to improve results considerably for this task.
Unfortunately, having access to a large number of samples did not seem to bring much improvement for word-level predictions accuracy. The main reason for that was the fact that the number of erroneous words in the training data was too small, as compared to the number of correct words: 50% of the sentences had zero incorrect words (15% of the sentences) or fewer than 15% incorrect words (35% of the sentences). Participants used various data manipulation strategies to improve results: filtering of the training data, as in DCU-SHEFF systems, alternative labelling of the data which discriminates between "OK" label in the beginning, middle, and end of a good segment, and insertion of additional incorrect words, as in SAU-KERC submissions. Additionally, most participants in the word-level task leveraged additional data in some way, which points to the need for even larger but more varied post-edited datasets in order to make significant progress in this task.

Automatic Post-editing Task
This year WMT hosted for the first time a shared task on automatic post-editing (APE) for machine translation. The task requires to automatically correct the errors present in a machine translated text. As pointed out in Parton et al. (2012) and Chatterjee et al. (2015b), from the application point of view, APE components would make it possible to: • Improve MT output by exploiting information unavailable to the decoder, or by per-forming deeper text analysis that is too expensive at the decoding stage; • Cope with systematic errors of an MT system whose decoding process is not accessible; • Provide professional translators with improved MT output quality to reduce (human) post-editing effort; • Adapt the output of a general-purpose MT system to the lexicon/style requested in a specific application domain.
The first pilot round of the APE task focused on the challenges posed by the "black-box" scenario in which the MT system is unknown and cannot be modified. In this scenario, APE methods have to operate at the downstream level (that is after MT decoding), by applying either rule-based techniques or statistical approaches that exploit knowledge acquired from human post-editions provided as training material. The objectives of this pilot were to: i) define a sound evaluation framework for the task, ii) identify and understand the most critical aspects in terms of data acquisition and system evaluation, iii) make an inventory of current approaches and evaluate the state of the art and iv) provide a milestone for future studies on the problem.

Task description
Participants were provided with training and development data consisting of (source, target, human post-edition) triplets, and were asked to return automatic post-editions for a test set of unseen (source, target) pairs.

Data
Training, development and test data were created by randomly sampling from a collection of English-Spanish (source, target, human postedition) triplets drawn from the news domain. 22 Instances were sampled after applying a series of data cleaning steps aimed at removing duplicates and those triplets in which any of the elements (source, target, post-edition) was either too long or too short compared to the others, or included tags or special problematic symbols. The main reason for random sampling was to induce some homogeneity across the three datasets and, in turn, to increase the chances that correction patterns learned from the training set can be applied also to the test set. The downside of losing information yielded by text coherence (an aspect that some APE systems might take into consideration) has hence been accepted in exchange for a higher error repetitiveness across the three datasets. Table 18 provides some basic statistics about the data.
The training and development sets respectively consist of 11, 272 and 1, 000 instances. In each instance: • The source (SRC) is a tokenized English sentence having a length of at least 4 tokens. This constraint on the source length was posed in order to increase the chances to work with grammatically correct full sentences instead of phrases or short keyword lists; • The target (TGT) is a tokenized Spanish translation of the source, produced by an unknown MT system; • The human post-edition (PE) is a manuallyrevised version of the target. PEs were collected by means of a crowdsourcing platform developed by the data provider.
Test data (1, 817 instances) consists of (source, target) pairs having similar characteristics of those in the training set. Human post-editions of the test target instances were left apart to measure system performance.
The data creation procedure adopted, as well as the origin and the domain of the texts pose specific challenges to the participating systems. As discussed in Section 5.4, the results of this pilot task can be partially explained in light of such challenges. This dataset, however, has three major advantages that made it suitable for the first APE pilot: i) it is relatively large (hence suitable to apply statistical methods), ii) it was not previously published (hence usable for a fair evaluation), iii) it is freely available (hence easy to distribute and use for evaluation purposes).

Evaluation metric
System performance is evaluated by computing the distance between automatic and human post-editions of the machine-translated sentences present in the test set (i.e. for each of the 1,817 target test sentences). This distance is measured in terms of Translation Error Rate (TER) (Snover et al., 2006a), an evaluation metric commonly used in MT-related tasks (e.g. in quality estimation) to measure the minimum edit distance between an automatic translation and a reference translation. 23 Systems are ranked based on the average TER calculated on the test set by using the TERcom 24 software: lower average TER scores correspond to higher ranks. Each run is evaluated in two modes, namely: i) case insensitive and ii) case sensitive. Evaluation scripts to compute TER scores in both modalities have been made available to participants through the APE task web page. 25

Baseline
The official baseline is calculated by averaging the distances computed between the raw MT output and the human post-edits. In practice, the baseline APE system is a system that leaves all the test targets unmodified. 26 Baseline results computed for both evaluation modalities (case sensitive/insensitive) are reported in Tables 20 and 21.
Monolingual translation as another term of comparison. To get further insights about the progress with respect to previous APE methods, participants' results are also analysed with respect to another term of comparison: a reimplementation of the state-of-the-art approach firstly proposed by Simard et al. (2007). 27 For this purpose, a phrase-based SMT system based on Moses  is used. Translation and reordering models were estimated following the Moses protocol with default setup using MGIZA++ (Gao and Vogel, 2008) for word alignment. For language modeling we used the 23 Edit distance is calculated as the number of edits (word insertions, deletions, substitutions, and shifts) divided by the number of words in the reference. Lower TER values indicate better MT quality. 24 http://www.cs.umd.edu/˜snover/tercom/ 25 http://www.statmt.org/wmt15/ape-task.html 26 In this case, since edit distance is computed between each machine-translated sentence and its human-revised version, the actual evaluation metric is the human-targeted TER (HTER). For the sake of clarity, since TER and HTER compute edit distance in the same way (the only difference is in the origin of correct sentence used for comparison), henceforth we will use TER to refer to both metrics.
27 This is done based on the description provided in Simard et al. (2007). Our re-implementation, however, is not meant to officially represent such approach. Discrepancies with the actual method are indeed possible due to our misinterpretation or to wrong guesses about details that are missing in the paper. Types  Lemmas  SRC  TGT  PE  SRC  TGT  PE  SRC  TGT  PE  Train (11,272) 238,335 257,643 257,879 23,608 25,121 27,101 13,701 7,624 7,689  Dev (1,000)  21,617  23,213  23,098  5,482  5,760  5,966  3,765 2,810 2,819  Test (1,817)  38,244  40,925  40,903  7,990  8,498  8,816  5,307 3,778 3,814   Table 18: Data statistics.

Tokens
KenLM toolkit (Heafield, 2011) for standard ngram modeling with an n-gram length of 5. Finally, the APE system was tuned on the development set, optimizing TER with Minimum Error Rate Training (Och, 2003). The results of this additional term of comparison, computed for both evaluation modalities (case sensitive/insensitive), are also reported in Tables 20 and 21. For each submitted run, the statistical significance of performance differences with respect to the baseline and the re-implementation of Simard et al. (2007) is calculated with the bootstrap test (Koehn, 2004).

Participants
Four teams participated in the APE pilot task by submitting a total of seven runs. Participants are listed in Table 19; a short description of their systems is provided in the following.
Abu-MaTran. The Abu-MaTran team submitted the output of two statistical post-editing (SPE) systems, both relying on the MOSES phrase-based statistical machine translation toolkit  and on sentence level classifiers. The first element of the pipeline, the SPE system, is trained on the automatic translation of the News Commentary v8 corpus from English to Spanish aligned with its reference. This translation is obtained with an out-of-thebox phrase-based SMT system trained on Europarl v7. Both translation and post-editing systems use a 5-gram Spanish LM with modified Kneser-Ney smoothed trained on News Crawl 2011 and 2012 with KenLM (Heafield, 2011). For the second element of the pipeline, a binary classifier to select the best translation between the given MT output or its automatic post-edition is used. Two different approaches are investigated: a 180-hand-craftedbased regression model trained with a Support Vector Machine (SVM) with a radial basis function kernel to estimate the sentence-level HTER score, and a Recurrent Neural Network (RNN) classifier using context word embeddings as input and classifying each word of a sentence as good or bad. An automatic translation to be post-edited is first decoded by our SPE system, then fed into one of the classifiers identified as SVM180feat and RNN. The HTER estimator selects the translation with the lower score while the binary word-level classifier selects the translation with the fewer amount of bad tags. The official evaluation of the shared task show an advantage of the RNN approach compared to SVM.
FBK. The two runs submitted by FBK (Chatterjee et al., 2015a) are based on combining the statistical phrase-based post-editing approach proposed by Simard et al. (2007) and its most significant variant proposed by Béchara et al. (2011). The APE systems are built-in an incremental manner. At each stage of the APE pipeline, the best configuration of a component is decided and then used in the next stage. The APE pipeline begins with the selection of the best language model from several language models trained on different types and quantities of data. The next stage addresses the possible data sparsity issues raised by the relatively small size of the training data. Indeed, an analysis of the original phrase table obtained from the training set revealed that a large part of its entries is composed of instances that occur only once in the training. This has the obvious effect of collecting potentially unreliable "translation" (or, in the case of APE, correction) rules. The problem is exacerbated by the "context-aware" approach proposed by Béchara et al. (2011), which builds the phrase table by joining source and target tokens thus breaking down the co-occurrence counts into smaller numbers. To cope with this problem, a novel feature (neg-impact) is designed to prune the phrase table by measuring the usefulness of each translation. The higher is the value of the negimpact feature, the less useful is the translation option. After pruning, the final stage of the APE pipeline tries to raise the capability of the decoder to select the correct translation rule by the introduction of new task specific features integrated in  (Pal et al., 2015b)  the model. These features measure the similarity and the reliability of the translation options and help to improve the precision of the resulting APE system.
LIMSI. For the first edition of the APE shared task LIMSI submitted two systems (Wisniewski et al., 2015). The first one is based on the approach of Simard et al. (2007) and considers the APE task as a monolingual translation between a translation hypothesis and its post-edition. This straightforward approach does not succeed in improving translation quality. The second submitted system implements a series of sieves, each applying a simple post-editing rule. The definition of these rules is based on an analysis of the most frequent error corrections and aims at: i) predicting word case; ii) predicting exclamation and interrogation marks; and iii) predicting verbal endings. Experiments with this approach show that this system also hurts translation quality. An in-depth analysis revealed that this negative result is mainly explained by two reasons: i) most of the post-edition operations are nearly unique, which makes very difficult to generalize from a small amount of data; and ii) even when they are not, the high variability of post-editing, already pointed out by Wisniewski et al. (2013), results in predicting legitimate corrections that have not been made by the annotators, therefore preventing from improving over the baseline.

USAAR-SAPE. The
USAAR-SAPE system (Pal et al., 2015b) is designed with three basic components: corpus preprocessing, hybrid word alignment and a state-of-the-art phrase-based SMT system integrated with the hybrid word alignment. The preprocessing of the training corpus is carried out by stemming the Spanish MT output and the PE data using Freeling (Padr and Stanilovsky, 2012). The hybrid word alignment method combines different kinds of word alignment: GIZA++ word alignment with the grow-diag-final-and (GDFA) heuristic (Koehn, 2010), SymGiza++ (Junczys-Dowmunt andSzal, 2011), the Berkeley aligner (Liang et al., 2006), and the edit distance-based aligners (Snover et al., 2006a;Lavie and Agarwal, 2007). These different word alignment tables (Pal et al., 2013) are combined by a mathematical union method. For the phrase-based SMT system various maximum phrase lengths for the translation model and n-gram settings for the language model are used. The best results in terms of BLEU (Papineni et al., 2002) score are achieved by a maximum phrase length of 7 for the translation model and a 5-gram language model.

Results
The official results achieved by the participating systems are reported in Tables 20 and 21. The seven runs submitted are sorted based on the average TER they achieve on test data. Table 20 shows the results computed in case sensitive mode, while Table 21 provides scores computed in the case insensitive mode.
Both rankings reveal an unexpected outcome: none of the submitted runs was able to beat the baselines (i.e. average TER scores of 22.91 and 22.22 respectively for case sensitive and case insensitive modes). All differences with respect to such baselines, moreover, are statistically significant. In practice, this means that what the systems learned from the available data was not reliable enough to yield valid corrections of the test instances. A deeper discussion about the possible causes of this unexpected outcome is provided in Section 5.4.
Unsurprisingly, for all participants the case insensitive evaluation results are slightly better than the case sensitive ones. Although the two rankings are not identical, none of the systems was particularly penalized by the case sensitive evaluation. Indeed, individual differences in the two modes are always close to the same value (∼ 0.7 TER difference) measured for the two baselines.  (Simard et al., 2007) 23.839 Abu-MaTran Contrastive 24.715  (Simard et al., 2007) 23.130 Abu-MaTran Contrastive 23.705 In light of this, and considering the importance of case sensitive evaluation in some language settings (e.g. having German as target), future rounds of the task will likely prioritize this more strict evaluation mode.
Overall, the close results achieved by participants reflect the fact that, despite some small variations, all systems share the same underlying statistical approach of Simard et al. (2007). As anticipated in Section 5.1, in order to get a rough idea about the extent of the improvements over such state-of-the-art method, we replicated it and considered its results as another term of comparison in addition to the baselines. As shown in Tables 20 and 21, the performance results achieved by our implementation of Simard et al. (2007) are 23.839 and 23.130 in terms of TER for the respective case sensitive and insensitive evaluations. Compared to these scores, most of the submitted runs achieve better performance, with positive average TER differences that are always statistically significant. We interpret this as a good sign: despite the difficulty of the task, the novelties introduced by each system allowed to make significant steps forward with respect to a prior reference technique.

Discussion
To better understand the results and gain useful insights about this pilot evaluation round, we perform two types of analysis. The first one is focused on the data, and aims to understand the possible reasons of the difficulty of the task. In particular, by analysing the challenges posed by the origin and the domain of the text material used, we try to find indications for future rounds of the APE task. The second type of analysis focuses on the systems and their behaviour. Although they share the same underlying approach and achieve similar results, we aim to check if interesting differences can be captured by a more fine grained analysis that goes beyond rough TER measurements.

Data analysis
In this section we investigate the possible relation between participants' results and the nature of the data used in this pilot task (e.g. quantity, sparsity, domain and origin) . For this purpose, we take advantage of a new dataset -the Autodesk Post-Editing Data corpus 28 -which has been publicly released after the organisation of the APE pilot task. Although it was not usable for this first round, its characteristics make it particularly suitable for our analysis purposes. In particular: i) Autodesk data predominantly covers the domain of software user manuals (that is, a restricted domain compared to a general one like news), and ii) post-edits come from professional translators (that is, at least in principle, a more reliable source of corrections compared to crowdsourced workforce). To guarantee a fair comparison, English-Spanish (source, target, human postedition) triplets drawn from the Autodesk corpus are split in training, development and test sets under the constraint that the total number of target words and the TER in each set should be similar to the APE task splits. In this setting, performance differences between systems trained on the two datasets will only depend on the different nature of the data (e.g. domain). Statistics of the training sets are reported in  APE task data are the same of Table 18).
The impact of data sparsity. A key issue in most evaluation settings is the representativeness of the training data with respect to the test set used.
In the case of the statistical approach at the core of all the APE task submissions, this issue is even more relevant given the limited amount of training data available. In the APE scenario, data representativeness relates to the fact that the correction patterns learned from the training set can be applied also to the test set (as mentioned in Section 5.1, in the data creation phase random sampling from an original data collection was applied for this purpose). From this point of view, dealing with restricted domains such as software user manuals should be easier than working with news data. Indeed, restricted domains are more likely to feature smaller vocabularies, be more repetitive (or, in other terms, less sparse) and, in turn, determine a higher applicability of the learned error correction patterns.
To check the relation between task difficulty and data repetitiveness, we compared different monolingual indicators (i.e. number of types and lemmas, and repetition rate 29 -RR) computed on the APE and the Autodesk source, target and postedited sentences. Although both the datasets have the same amount of target tokens, Table 22 shows that the APE training set has nearly double of types and lemmas compared to the Autodesk data, 29 Repetition rate measures the repetitiveness inside a text by looking at the rate of non-singleton n-gram types (n=1. . .4) and combining them using the geometric mean. Larger value means more repetitions in the text. For more details see Cettolo et al. (2014) which indicates the presence of less repeated information. A similar conclusion can be drawn by observing that the Autodesk dataset has a repetition rate that is more than twice the value computed for the APE task data.
This monolingual analysis does not provide any information about the level of repetitiveness of the correction patterns made by the post-editors, because it does not link the target and the post-edited sentences. To investigate this aspect, two instances of the re-implemented approach of Simard et al. (2007) introduced in Section 5.1 are respectively trained on the APE and the Autodesk training sets. We consider the distribution of the frequency of the translation options in the phrase table as a good indicator of the level of repetitiveness of the corrections in the data. For instance, a large number of translation options that appear just one or only few times in the data indicates a higher level of sparseness. As expected due to the limited size of the training set, the vast majority of the translation options in both phrase tables are singletons as shown in Table 23. Nevertheless, the Autodesk phrase table is more compact (731k versus 1,066k) and contains 10% fewer singletons than the APE task phrase table. This confirms that the APE task data is more sparse and suggests that it might be easier to learn more applicable correction patterns from the Autodesk domain-specific data.
To verify this last statement, the two APE systems are evaluated on their own test sets. As previously shown, the system trained on the APE task data is not able to improve over the performance achieved by a system that leaves all the test targets unmodified (see Table 20). On the contrary, starting from a baseline of 23.57, the system trained on the Autodesk data is able to reduce the TER by 3.55 points (20.02). Interestingly, the Autodesk APE system is able to correctly fix the target sentences and improve the TER by 1.43 points even with only 25% of the training data. These results confirm our intuitions about the usefulness of repetitive data and show that, at least in restricteddomain scenarios, automatic post-editing can be successfully used as an aid to improve the output of an MT system.

Professional vs. Crowdsourced post-editions
Differently from the Autodesk data, for which the post-editions are created by professional translators, the APE task data contains crowdsourced MT corrections collected from unknown (likely non-  expert) translators. One risk, given the high variability of valid MT corrections, is that the crowdsourced workforce follows post-editing attitudes and criteria that differ from those of professional translators. Professionals tend to: i) maximize productivity by doing only the necessary and sufficient corrections to improve translation quality, and ii) follow consistent translation criteria, especially for domain terminology. Such a tendency will likely result in coherent and minimally post-edited data from which learning and drawing statistics is easier. This is not guaranteed by crowdsourced workers which do not have specific time or consistency constraints. This suggests that non-professional post-editions and the correction patterns learned from them will feature less coherence, higher noise and higher sparsity.
To assess the potential impact of these issues on data representativeness (and, in turn, on the task difficulty), we analyse a subset of the APE test instances (221 triples randomly sampled) in which target sentences were post-edited by professional translators. The analysis focuses on TER scores computed between: 3. The crowdsourced post-editions and the professional ones, using the latter as references (avg. TER = 29.18).
The measured values indicate an attitude of nonprofessionals to correct more often and differently from the professional translators. Interestingly, and similar to the findings of Potet et al. (2012), crowdsourced post-editions feature a distance from the professional ones that is even higher than the distance between the original target sentences and the experts' corrections (29.18 vs. 23.85). If we consider the output of professional translators as a gold standard (made of coherent and minimally post-edited data), these figures suggest a higher difficulty in handling crowdsourced corrections. Further insights can be drawn from the analysis of the word level corrections produced by the two translator profiles. To this aim, word insertions, deletions, substitutions and phrase shifts are extracted using the TERcom software similar to Blain et al. (2012) and Wisniewski et al. (2013). For each error type, the ratio between the number of edit operations and the total number of occurred errors operations performed is computed. This quantity provides us with a measure of the level of repetitiveness of the errors, with 100% indicating that all the error patterns are unique, and small values indicating that most of the errors are repeated. Our results show that non-experts have generally larger ratio values than the professional translators (insertion +6%, substitution +4%, deletion +4%). This seems to support our hypothesis that, independently from their quality, post-editions collected from non-experts are less coherent than those derived from professionals. It is unlikely that different crowdsourced workers will apply the same corrections in the same contexts. If this hypothesis holds, the difficulty of this APE pilot task could be partially ascribed to this unavoidable intrinsic property of crowdsourced data. This aspect, however, should be further investigated to draw definite conclusions.

System/performance analysis
The TER results presented in Tables 20 and 21 evidence small differences between participants, but they do not shed light on the real behaviour of the systems. To this aim, in this section the submitted runs are analysed by taking into consideration the changes made by each system to the test instances (case sensitive evaluation mode). In particular, Table 24 provides the number of modified, improved and deteriorated sentences, together with the percentage of edit operations performed (insertions,  deletions, substitutions, shifts). Looking at these numbers, the following conclusions can be drawn.
Although it varies considerably between the submitted runs, the number of modified sentences is quite small. Moreover, a general trend can be observed: the best systems are the most conservative ones. This situation likely reflects the aforementioned data sparsity and coherence issues. A small fraction of the correction patterns found in the training set seems to be applicable also to the test set, and the risk of performing corrections that are either wrong, redundant, or different from those in the reference post-editions is rather high.
From the system point of view, the context in which a learned correction pattern will be applied is crucial. For instance, the same word substitution (e.g. "house" → "home") is not applicable in all contexts. While sometimes it will be necessary (Example 1: "The house team won the match"), in some contexts it is optional (Example 2: "I was in my house") or wrong (Example 3: "He worked for a brokerage house"). Unfortunately, the unnecessary word replacement in Example 2 (human posteditors would likely leave it untouched) would increase the TER of the sentence exactly as in the clearly wrong replacement in Example 3.
From the evaluation point of view, not penalising such correct but unnecessary corrections is also crucial. Similar to MT, where a source sentence can have many valid translations, in the APE task a target sentence can have many valid posteditions. Indeed, nothing prevents that in our evaluation some correct post-editions are considered as "deteriorated" sentences simply because they differ from the human post-editions used as references. As in MT, this well known variability problem might penalise good systems, thus calling for alternative evaluation criteria (e.g. based on multiple references or sensitive to paraphrase matches). Interestingly, for all the systems the number of modified sentences is higher than the sum of the improved and the deteriorated ones. Such difference is represented by modified sentences for which the corrections do not yield TER variations. This grey area makes the evaluation problem related to variability even more evident.
The analysis of the edit operations performed by each system is not particularly informative. Similar to the overall performance results, also the proportion of correction types they perform reflects the adoption of the same underlying statistical approach. The distribution of the four types of edit operations is almost identical, with a predominance lexical substitutions (55.7%-57.7%) and rather few phrasal shifts (8.0%-8.6%).

Lessons learned and outlook
The objectives of this pilot APE task were to: i) define a sound evaluation framework for future rounds, ii) identify and understand the most critical aspects in terms of data acquisition and system evaluation, iii) make an inventory of current approaches, evaluate the state of the art and iv) provide a milestone for future studies on the problem. With respect to the first point, improving the evaluation is possible, but no major issues emerged or requested radical changes in future evaluation rounds. For instance, using multiple references or a metric sensitive to paraphrase matches to cope with variability in the post-editing would certainly help.
Concerning the most critical aspects of the evaluation, our analysis highlighted the strong dependence of system results on data repetitiveness/representativeness. This calls into question the actual usability of text material coming from general domains like news and, probably, of post-editions collected from crowdsourced workers (this aspect, however, should be further investigated to draw definite conclusions). Nevertheless, it's worth noting that collecting a large, unpublished, public, domain-specific and professionalquality dataset is a hardly achievable goal that will always require compromise solutions.
Regarding the approaches proposed, this first experience was a conservative but, at the same time, promising first step. Although participants performed the task sharing the same statistical approach to APE, the slight variants they explored allowed them to outperform the widely used monolingual translation technique. Moreover, results' analysis also suggests a possible limitation of this state-of-the-art approach: by always performing all the applicable correction patterns, it runs the risk of deteriorating the input translations that it was supposed to improve. This limitation, common to all the participating systems, is a clue of a major difference between the APE task and the MT framework. In MT the system must always process the entire source sentence by translating all of its words into the target language. In the APE scenario, instead, the system has another option for each word: keeping it untouched. A reasonable (and this year unbeaten) baseline is in fact a system that applies this conservative strategy for all the words. By raising this and other issues as promising research directions, attracting researchers' attention to a challenging applicationoriented task, and establishing a sound evaluation framework to measure future advancements, this pilot has substantially achieved its goals, paving the way for future rounds of the APE evaluation exercise.  2 3-4 3-4 5 6 7-9 7-10 7-11 8-11 9-11 12-13 13-15 13-15 13-15 15-16

A Pairwise System Comparisons by Human Judges
Tables 25-34 show pairwise comparisons between systems for each language pair. The numbers in each of the tables' cells indicate the percentage of times that the system in that column was judged to be better than the system in that row, ignoring ties. Bolding indicates the winner of the two systems. Because there were so many systems and data conditions the significance of each pairwise comparison needs to be quantified. We applied the Sign Test to measure which comparisons indicate genuine differences (rather than differences that are attributable to chance). In the following tables indicates statistical significance at p ≤ 0.10, † indicates statistical significance at p ≤ 0.05, and ‡ indicates statistical significance at p ≤ 0.01, according to the Sign Test.