Findings of the WMT 2017 Biomedical Translation Shared Task

Automatic translation of documents is an important task in many domains, including the biological and clinical domains. The second edition of the Biomedical Translation task in the Conference of Machine Translation focused on the automatic translation of biomedical-related documents between English and various European languages. This year, we addressed ten languages: Czech, German, English, French, Hungarian, Polish, Portuguese, Spanish, Romanian and Swedish. Test data included both scientiﬁc publications (from the Scielo and EDP Sciences databases) and health-related news (from the Cochrane and UK National Health Service web sites). Seven teams participated in the task, submitting a total of 82 runs. Herein we describe the datasets, participating systems and results of both the automatic and manual evaluation of the translations.


Introduction
Automatic translation of texts allows readers to gain access to information present in documents written in a language in which the reader is not fluent. We identify two main use cases of machine translation (MT) in the biomedical domain: (a) making health information available to health professionals and the general public in their own language; and (b) assisting health professionals and researchers in writing reports of their research in English. In addition, it creates an opportunity for natural language processing (NLP) tools to be applied to domain-specific texts in languages for which few domain-relevant tools are available; i.e., the texts can be translated into a language for which there are more resources.
The second edition of the Biomedical Translation Task in the Conference for Machine Translation (WMT) 1 builds on the first edition (Bojar et al., 2016) by offering seven additional language pairs and new test sets. This year, we expanded to a total of ten languages in the biomedical task, namely, Czech (cs), German (de), English (en), French (fr), Hungarian (hu), Polish (pl), Portuguese (pt), Spanish (es), Romanian (ro) and Swedish (sv). Test sets included scientific publications from the Scielo and EDP Sciences databases and health-related news from Cochrane and the UK National Health Service (NHS).
Participants were challenged to build systems to enable translation from English to all other lan-guages, as well as from French, Spanish and Portuguese to English. We provided both training and development data but the teams were allowed to use additional in-domain or out-of-domain training data. After release of the test sets, the participants had 10 days to submit results (automatic translations) for any of the test sets and languages. We allowed up to three runs per team for each language pair and test sets.
We evaluated the submission both automatically and manually. In this work, we report details on the challenge, test sets, participating teams, the results they obtained and the quality of the automatic translations.

Training and test sets
We released test sets from four sources, namely, Scielo, EDP, Cochrane and NHS, as presented in Table 1. For training and development data, we referred participants to various biomedical corpora: (a) Biomedical Translation Corpora Repository 2 , which includes titles from MEDLINE R and the Scielo corpus ; (b) UFAL Medical Corpus, 3 which includes EMEA and PatTR Medical, among others; (c) development data from the Khresmoi project. 4 We provide details of the test sets below.
Scielo. Similar to last year, this dataset consisted of titles and abstracts from scientific publications retrieved from the Scielo database 5 and addressed the following language pairs: es/en, en/es, pt/en and en/pt. There were not enough articles indexed in 2017 with French titles or abstracts, so we relied on another source for en/fr and fr/en language pairs (namely, EDP as described below). Similar to last year, we crawled the Scielo site for publications containing both titles and abstracts in both English/Spanish or English/Portuguese language pairs. We considered only articles published in 2017 until that point (April/2017). We tokenized the documents using Apache OpenNLP 6 (with specific models for each language). The test set dataset was automatically created by aligning the GMA tool. 7 We manually checked the alignment of a sample and confirmed that around 88% of the sentences were correctly aligned.
EDP. Title and abstracts of scientific publications were collected from the open access publisher EDP Sciences 8 on March 15, 2017. The corpus comprises a selection of titles and abstracts of articles published in five journals in the fields of Health and Life & Environmental Sciences. The articles were originally written in French but the journals also publish the titles and abstracts in English, using a translation provided by the authors. The dataset was pre-processed for sentence segmentation using the Stanford CoreNLP toolkit 9 and aligned using YASA. 10 Manual evaluation conducted on a sample set suggests that 94% of the sentences are correctly aligned, with about 20% of the sentence pairs exhibiting additional content in one of the languages.
Cochrane and NHS. The test data was produced during the course of the KConnect 11 and HimL 12 projects. The test data contains healthrelated documents from Cochrane and NHS that were manually translated by experts from English to eight languages: cs, de, fr, hu, pl, ro, es and sv.

Participating teams and systems
We received submissions from seven teams, as summarized in Table 2. The teams came from a total of five countries (Germany, Japan, Poland, UK and USA) and from three continents. They include both research institutions and a company. An overview of the teams and their systems is provided below.
Hunter (Hunter College, City University of New York). The system from the Hunter College is based on Moses EMS, SRI-LM, GIZA++ (Xu et al., 2017). For the translation model, they generate word alignments using GIZA++ and mGIZA. For the language model, they relied on an interpolation of models that includes 6-grams with Kneser-Ney smoothing. Different corpora were used for the various languages   kyoto (Kyoto University). The system from the team from Kyoto University is based on two previous papers Cromieres, 2016). The participants describe it as a classic neural machine translation (NMT) system, however, we do not have further information regarding the datasets that have been used to train and tune the system for the WMT challenge.
Lilt (Lilt Inc.). The system from the Lilt Inc. 13 uses an in-house implementation of a sequenceto-sequence model with Bahdanau-style attention. The final submissions are ensembles between models fine-tuned on different parts of the available data.
LMU (Ludwig Maximilian University of Munich). LMU Munich has participated with an en2de NMT system (Huck and Fraser, 2017). A distinctive feature of their system is a linguistically informed, cascaded target word segmentation approach. Fine-tuning for the domain of health texts was done using in-domain sections of the UFAL Medical Corpus v.1.0 as a training corpus. The learning rate was set to 0.00001, initialized with a pre-trained model, and optimized using only the in-domain medical data. The HimL tun-13 https://lilt.com/ ing sets were used for validation, and they tested separately on the Cochrane and NHS24 parts of the HimL devtest set.
PJIIT (Polish-Japanese Academy of Information Technology). PJIIT developed a translation model training, created adaptations of training settings for each language pair, and implemented byte pair encoding (BPE) (subword units) in their systems (Wolk and Marasek, 2017). Only the official parallel text corpora and monolingual models for the challenge evaluation campaign were used to train language models, and to develop, tune, and test their system. PJIIT explored the use of domain adaptation techniques, symmetrized word alignment models, the unsupervised transliteration models and the KenLM language modeling tool.
uedin-nmt (University of Edinburgh). The systems from the University of Edinburgh used a NMT trained with Nematus, an attentional encoder-decoder (Sennrich et al., 2017). Their setup follows the one from last year. This team again built BPE-based models with parallel and back-translated monolingual training data. New approaches this year included the use of deep architectures, layer normalization, and more compact models due to weight-tying and improvements in BPE segmentations.

Evaluation
In this section, we present an overview of the submissions to the Biomedical Task and results in terms of both automatic and manual evaluation.

Submissions
An overview of the submissions is shown is Table 3. The participating teams submitted a total of 82 runs. No submissions were received for Swedish (en/sv) and Hungarian (en/hu).

Baselines
We provided baseline results only for the EDP and Scielo test sets, however, not for the other languages included in the Cochrane and NHS test sets.
Baseline. For the Scielo and EDP test sets, we compared the participants' results to our baseline system, which used the same approach as applied in last year's challenge (Bojar et al., 2016) for the evaluation of the Scielo dataset . The statistical machine translation (SMT) system used for the baseline was Moses (Koehn et al., 2007) with default settings. For es2en, en2es, fr2en, en2fr, pt2en and en2pt, the baseline system was trained as described in .
LIMSI baseline. For additional comparison, we also provided the results of an en2fr Moses-based system prepared by Ive et al. for their participation in the WMT16 biomedical track, which reflects the state of the art for this language pair (Ive et al., 2016a). The system uses in-domain parallel data provided for the biomedical task in 2016, as well as additional in-domain data 14 and out-ofdomain data. However, we did not perform SOUL re-scoring.

Automatic evaluation
In this section, we provide the results for the automatic evaluation and rank the various systems based on those results. For the automatic evaluation, we computed BLEU scores at the sentence level using the multi-bleu and tokenization scripts as provided by Moses (tokenizer and truecase). For all test sets and language pairs, we compare the automatic translations to the reference one, as provided by each test set.
Results for the Scielo test sets are presented in Table 4. All three runs from the UHH team, for all four language pairs, obtained a much higher BLEU score than our baseline. However, this is not surprising given the simplicity of the methods used in the baseline system.
The BLEU scores for the EDP test set are presented in Table 5. While all system runs score above the baseline, only the Kyoto system outperforms the stronger baseline for en2fr. We rank the various submissions as follows: • fr2en: Hunter (runs 1,2) < baseline < UHH (runs 1,2) < UHH (run 3) < kyoto (run 1).
The BLEU scores for the Cochrane test sets are presented in Table 6. The scores range from as low as 12.45 (for Polish) to as high as 48.99 (for Spanish). All scores were particularly high for Spanish (close to 50), but rather low for Polish and Czech (all below 30). While the BLEU value did not vary much for French (all around 30), these went from a range of 14 to 41 for Romanian. We rank the various submissions for each language as below: Teams en/cs en/de fr/en pt/en es/en en/fr en/pl en/pt en/es en/ro  Table 5: Results for the EDP test sets. * indicates the primary run as declared by the participants.
Finally, the BLEU scores for the NHS dataset are presented in Table 7. The scores range from as low as 10.56 (for Romanian, the lowest score across all test sets and languages) to as high as 41.22 (for Spanish). All scores were particularly high for Spanish (around 40), but rather low for Polish, Czech and Romanian (all below 30). We rank the various submissions for each language as shown below: • cs: PJIIT (run 1) < uedin-nmt (run 1).
The BLEU values were generally lower for NHS than the ones obtained for the same teams for the Cochrane test sets. However, the rankings of systems and runs are nearly the same for the Cochrane and NHS test sets. The only exceptions were in French, where run 3 from UHH was higher than the others from the team, and for Polish, where the scores for Hunter and PIIJT (runs 1,3) were nearly the same.

Manual evaluation
We required teams to identify a primary run for each language pair, in the case that they submitted more than one run. These are the runs for which we performed manual evaluation. The following runs were considered to be primary: Hunter (run1), kyoto (run2 for en/fr, run1 for fr/en), lilt (run1), LMU (run1), PJIIT (run3 for pl, otherwise, run1), uedin-nmt (run1), UHH (run3).
We computed pairwise combinations of translations either between two automated systems, or one automated system and the reference translation. We compared all systems (primary) to the reference translation, as well as to each other system. We ran manual validation for all target languages and test sets. The human validators were   native speakers of the languages and were either members of the participating teams or colleagues from the research community. The validation task was carried out using the Appraise tool 15 (Federmann, 2010). For each pairwise comparison, we validated a total of 100 randomly-chosen sentence pairs. The validation consisted of reading the two sentences (A and B), i.e., translations from two systems or from the reference, and choosing one of the options below: • A<B: when the quality of translation B was higher than A.
• A=B: when both translation had similar quality.
• A>B: when the quality of translation A was higher than B.
• Flag error: when the translations did not seem to be derived from the same input sentence. This is usually derived from error in the corpus alignment (for the Scielo and EDP datasets).
15 https://github.com/cfedermann/ Appraise The manual validation for the Scielo test sets is presented in Table 8, for the comparison of the only participating team (UHH) to the reference translation. For en2es, the automatic translation scored lower than the reference one in 53 out of 100 pairs, but could still beat the reference translation in 23 pairs. For en2pt, the automatic translation was better only on 13 sentences pairs, while they could achieve similar quality to the reference translation on 31 cases. In the case of translations from Spanish or Portuguese to English, the reference scored better than the UHH around the same proportion, while the latter could only beat the reference in very few cases.
We present the results for the manual evaluation of the EDP test sets in Table 9. Based on the number of times that a translation was validated as being better than another, we ranked the systems for each language as listed below: • en2fr: Hunter < UHH < kyoto = reference • fr2en: Hunter < UHH < kyoto < reference Results for manual validation of the Cochrane test sets are presented in Table 10. We rank the various system as shown below:   • cs: PIIJT < uedin-nmt < reference • de: UHH < Hunter = PJIIT < Lilt < LMU < uedin-nmt = reference • fr: UHH < Hunter < reference • pl: Hunter = PIIJT < uedin < reference • es: UHH < reference • ro: Hunter < PIIJT < uedin < reference Results for manual validation of the NHS test sets are presented in Table 11. We rank the various system as shown below: • cs: PIIJT < uedin-nmt < reference • de: Hunter = UHH < PIIJT < Lilt < LMU = uedin-nmt < reference • fr: UHH < Hunter < reference • pl: Hunter < PIIJT < uedin < reference • es: UHH < reference • ro: Hunter < PIIJT < uedin < reference For the Polish language in the NHS test set, the evaluator skipped too many sentences (68 out of 100) to enable a comparison between Hunter and PIIJT. However, we ranked the PIIJT system higher than Hunter given that the former scored 21 times better that the latter (in contrast to 7). However, there is inadequate data to support assigning a clear difference between the two systems. Indeed, both systems have similar quality for this language in the Cochrane test set.

Discussion
In this section we present, for each target language, some insights from the automatic validation, the quality of the translations, as well as future work that we plan to implement in the next edition of the challenges.

Performance of the systems
The results obtained by the teams show interesting point of discussion regarding the impact of methods and amount of training data. Considering all the results in Tables 4-7, the highest BLEU score (48.99) of all runs across all test sets was obtained by the UHH system for en2es (Cochrane test set). The same team also scored high (above 40) for the NHS en2es test set and for the Scielo pt2en test set. The only other team that obtained BLEU scores in the same range (above 40) was uedin-nmt for the Cochrane en2ro test set.
No automatic system was able to outperform or match the reference translations on manual evaluation; hence the automated systems all still have room for improvement. Interestingly, it can be noted that the best performing system on the EDP   en2fr dataset (Kyoto) compared very favorably to the reference and was found to be equal to or better than the reference in 62% (58/93) of the manually evaluated sentences. In general, the kyoto and uedin-nmt systems seemed to consistently outperform other competitors. Regarding comparison of results to the ones obtained in the last year's edition of the challenge, we can only draw conclusions for the Scielo test set. The only participating team (UHH) obtained much higher BLEU scores for en2pt (39 vs. 19), pt2en (43 vs. 21) and es2en (37 vs. 30). However, results for en2es were just a little higher than last year's ones (36 vs. 33).
As the performance of the methods improves on the biomedical domain, it will make sense to introduce additional domain-oriented evaluation measures that provide a document-level assessment focused on the clinical validity of the translations, rather than the grammatical correctness and fluency.

Best-performing methods
For languages which received submissions from several systems, such as en2de over Cochrane and NHS data, the systems based on neural networks (e.g., uedin-nmt and LMU) performed substantially better than those based on SMT (e.g., UHH and Hunter). In many runs, the difference in BLEU score was greater than 10 points. The superiority of NMT systems was also observed in the EDP test set, as implemented in the Kyoto system. However, we also note that a state-of-the-art statistical system relying on rich in-domain and outof-domain data still performs well (as seen in the strong results of the LIMSI system).
Finally, some teams submitted more than one run but we only observed significant differences in BLEU scores in a few cases, namely, kyoto (EDP en2fr test set), PJIIT (Cochrane/NHS pl test set), uedin (Cochrane/NHS pl and ro test sets). In the case of the PJIIT systems, the best performing one is an extended version of the base SMT system that includes domain adaptation, among other additional features. In the case of the uedin-nmt system, the best performing run relied on advanced techniques, such as +right-to-left re-ranking.

Differences across languages
Even if some teams relied on equal or similar methods for the different languages, the same system might perform better for certain languages then for others. This is probably due to amount (or quality) of training data available for each language and also due to different linguistic properties of the language pair in question.
For instance, the UHH team developed a SMT system which was trained on a variety of domain and out-of-the-domain data. This system achieved good performance for English, Portuguese and Spanish (around 30-48), but their results for German were much poorer (around 18-22). Indeed, the system obtained the lowest rank position for German for the Cochrane and NHS test sets. The participants report that this is probably due to the amount of training data available for this language (personal communication), even though other teams could obtain much higher BLEU scores for those same test sets, e.g., up to 37 points in the case of the uedin-nmt system.
Such differences across languages was also observed for other systems (higher than 10-20 points in the BLEU score). For instance, scores for the uedin-nmt system ranged from 22 (for Czech) to 41 (for Romanian). Interestingly, the scores for the Hunter system ranged from 10 (for Romanian, in contrast to higher scores from uedin-nmt system) to 30 (for French). The Hunter team seems to have used the same approach across all languages and all of these were trained on a variety of corpora. On the other hand, the uedin-nmt team seems to have used slightly different network architectures for each language (Sennrich et al., 2017).

Differences across datasets
Given that the methods and corpora seem to be largely the same for a particular language, differences in BLEU scores across the test sets are probably related to the the characteristics of these. Few teams submitted runs for more than one test set and only one team (UHH) submitted runs for all test set (for one particular language).
For Spanish, the UHH team obtained considerable differences in BLEU score for Scielo (around 36), NHS (around 41) and Cochrane (around 48). However, their system paper does not give much insight on the reason for such differences (Duma and Menzel, 2017). We can hypothesize that lower scores in the Scielo datasets are due to the fact that the reference translation is not a perfect translation of the source document and sentence alignment was performed automatically.
For French, the Hunter team obtained lower scores in the EDP test set (around 17) and higher ones in the NHS (almost 23) and Cochrane test sets (around 30). Similarly, the UHH team obtained lower scores for the EDP (around 22) and higher ones for Cochrane and NHS (around 31-32). The reason for these differences is probably the same as for the Scielo test set: this is an automatically acquired test set, whose documents were automatically aligned. While the quality of the automatic alignment is high (estimated at 88% accuracy for Scielo and 94% for EDP), we can also note that the translations in these test sets are created by the authors of the articles who are neither professional translators nor native speakers of the all the languages involved.
On the other hand, differences also occurred between the Cochrane and NHS test sets, although these were manually translated by professionals. Such differences were small for most systems (24 vs. 20 for Hunter, 22 vs. 19 for UHH, 25 vs. 21 for PIIJT), for German in the Cochrane and NHS test sets, respectively. However, some cases show larger differences, such as the uedin-nmt system for Romanian (41 vs. 29 for Cochrane and NHS, respectively). We observed that that the average sentence length is higher for Cochrane (with some very long sentences included) while there are many short sentence fragments in the NHS test set. However, both can be problematic for MT as this can scramble long sentences, and trip up over sentence fragments since most of the training data consists of full sentences.

Differences between manual and automatic evaluations
We checked for differences between the manual and automatic evaluations, i.e., whether a team performed better than another in the manual evaluation but the other way round in the automatic evaluation. We observed small differences for Polish (Cochrane and NHS test sets) between the Hunter and PIIJT teams, but these are probably not significant and both systems have probably similar performance. We observed the same for the UHH and Hunter systems for German (NHS test set). However, we found a more interesting contradiction between Hunter and UHH systems for French in both Cohrane and NHS test sets. UHH obtained higher BLEU scores than Hunter (32-33 vs. 30 and 31-33 vs. 23, for Cochrane and NHS, respectively). However, in the manual evaluation, our expert chose Hunter as being better than UHH in many more sentences (40 vs. 8 and 67 vs. 6, respectively).

Quality of the automatic translations
We provide an overview of the quality of the translations and the common errors that we identified during the manual validation.
Czech: The outputs of the weaker system, PIIJT, were rather unsurprising, featuring a wide range of well-known issues of phrase-based SMT, including inflection errors that violate both long-distance and short-distance morphological agreement, errors in missing or surplus negation, untranslated and uninflected rare words, wrong disambiguation of word meanings, etc. On the other hand, the quality of the neural uedint-nmt system is remarkably better, with no negation errors spotted, agreement errors generally limited to long-distance dependencies, only rare disambiguation errors (often domain-specific, e.g. "drug", "study", "review"), and a much bolder attempt at handling unknown or rare words. On one hand, we spotted cases where it would have been better to leave the word untranslated, or to only perform modest transliterations, as in "haemoglobin", which is similar enough to the "hemoglobin" used in Czech to be understandable as it is, but got translated to "hemoroidy" ("hemorrhoids") instead; on the other hand, both correct and incorrect translations of rare words were nearly always correctly inflected. Occasionally, we also noticed a missing or surplus word -especially with auxiliaries, such as reflexive pronouns or forms of the verb "be".
English: Overall, the assessor found the quality of translations into English improved from 2016. Some of the problems observed in the prior year persisted, including inappropriate capitalization of terms (terms were capitalized although they were neither proper nouns nor acronyms) for some translations. Other issues such as incorrect word order as well as untranslated and missing words were observed. Especially in fr2en translations, incorrect word order occurred when the noun-before-adjective grammar in French was erroneously preserved in English; for instance, "douleur oro-faciale" was translated as "pain orofacial". Sometimes, however, untranslated words could still be deciphered because the French words were similar to the English equivalents, such as "biomatériaux" vs. "biomaterials", and "tolérance immunologique" vs. "immunological tolerance". As for missing words, translations were severely impacted when entire phrases were omitted, for instance when two consequences of a procedure were reduced to only one.
French: The quality of translations varied from poor to good. The issues that we encountered were similar to last year and included grammatical errors such as incorrect subject/verb or adjective/noun agreement, untranslated passages, incorrect lexical choice due to a lack of word sense disambiguation. One recurring mistake was the translation of the term "female" as "femelle", which is appropriate for animals instead of "femme", which is appropriate for humans. This year, the best systems showed an ability to successfully translate some acronyms. However, complex hyphenated terms remained challenging (for example, "38year-old", "mid-60s", "immunoglobulin-like").
German: Overall, the quality of translations to German ranges from very good to poor. Comparing between the different systems, the translation with the better syntax, grammar and use of technical terms was preferred. When both translations were equally bad their performance was assigned equal. Poor translations are mostly characterized by incorrect syntax and grammar. Syntactic errors are usually due to missing predicates, the usage of two or more predicates in one sentence, and strange word order, especially in long sentences. This often led to confusion or even not understanding the meaning of a sentence. Usual grammar errors included incorrect conjugation of verbs as "wir suchte" instead of "wir suchten" (we searched). In well performing systems syntax and grammar are often correct. Their difference to the reference is often due to not using the most appropriate word. This does not influence the meaning of the sentence. Only as a native speaker one would rather use a different word. All systems seem to have problems with certain technical terms. Usually this occurs when the German translation is very different from the English term. For instance, "to restart a person's heart" is often word-by-word translated into "Neustart des Herzen" while in German this procedure is called "Reanimation des Herzens". The pairwise evaluation of the two best performing teams (LMU and uedin-nmt) indicates, that they often provide similar sentences in terms of grammar and token order.
Portuguese: Only one team (UHH) submitted translation for Portuguese (Scielo dataset). In comparison to submissions from the previous challenge (Bojar et al., 2016), we found the quality of the translations considerably better. As expected, longer sentences usually contained more mistakes and were harder to understand than shorter sentences, usually due to the wrong placement of the commas and conjunctions (e.g., and). For instance, the translation "diâmetro tubular, altura do epitélio seminífero e integridade" was derived from the English version of the reference clause "diâmetro dos túbulos seminíferos, altura e integridade do epitélio seminífero". However, the same can be also stated for some reference sentences, which could have a higher quality. Regarding more common mistakes, we observed missing articles, such as "Extratos vegetais" versus "Os extratos das espécies vegetais". However, we observed fewer instances of untranslated English words in comparison to last year, which seems to indicate a better coverage of the biomedical terminology. In some sentences, such cases were observed for terms which were skipped by the translation system, such as "método de manometria de alta resolução" for "high-resolution manometry method for esophageal manometry". The same mistake was observed for acronyms, e.g., DPS (death of pastures syndrome) instead of SMP (síndrome da morte das pastagens). However, we also found correct translations for acronyms, e.g., SII (síndrome do intenstino irritavel) instead of IBS (irritable bowel syndrome). Finally, we observed other minor mistakes: (a) nominal concordance, e.g., "O fortalecimento muscular progressiva"; (b) wrong word ordering, e.g., "plantadaś areas florestais" instead of "áreas florestais plantadas"; (c) wrong verb tense, e.g., "coeficiente de correlação linear de Pearson spearmans determinado" instead of "determinou"; (d) wrong verb conjugation, e.g, "a umidade relativa, temperatura, velocidade do vento e intensidade de luz foi...", instead of "foram"; and (e) no contraction when necessary, e.g., "em as" instead of "nas".
Spanish: Compared to last year's challenge translations, the quality of the translations into Spanish is significantly better. Despite some small variations, many of the produced translations are valid translations of the original text. There are still cases in which there are mistakes such as with verb tenses "a menudo oír voces", which should be "a menudo oyen voces". There are translations with similar meaning but not entirely the same meaning such as "hace aparecer" vs. the reference translation "ocurren". In some cases, there are some incorrect phrases such as "teléfono NHS informar sobre" vs. the reference translation "llame por teléfono el sistema informativo de NHS en". Translation systems seem to have better alignment between masculine/feminine and singular/plural articles as compared to last year. In addition, the number of missing words is lower in the Spanish submissions.
Romanian: The quality varied from good translations to clearly underperforming ones. When both translations were good, the one that was grammatically correct was preferred. When one used an awkward language or did not use domainspecific terms such as "traumatism cranian" or "presiune intracraniana", the other one was preferred. We noticed that these translations can be very dangerous, especially when the form is good (and thus the appearance of quality is high). For instance, in one case, "vasopressor" was translated as "vasodilatatoare", which is the precise antonym. A frequent mistake was the translation of "trials" as "procese", which would have been correct for "law suits" but not for clinical trials. Somewhat confusing was translating "norepinephrine" as "noradrenaline", as they look different but are two names of the same substance. For the bad and very bad translations, errors abounded up to the point that both were equally useless and therefore marked as equal (in the sense of equally bad); this happened quite often. In general, we preferred translations that did not mislead and were still possible to understand despite their many flaws. Among the frequent translation errors, we identified the following: untranslated words, grammatical errors (case, gender), random characters and even Cyrillic (for no apparent reason) and context which were frequently not considered (e.g. "shots" translated to "gloante" and "impuscaturi", those words having to do with weapons not with syringes). Other strange errors included unrelated words from other fields, especially "subcontractantul copolimerului" or "transductoare AFC".

Conclusions
We presented the results of the second edition of the Biomedical task in the Conference for Machine Translation. The shared task addressed a total of ten languages and received submission from seven teams. In comparison to last year, we observed an increase on the performance of the systems in terms of higher BLEU scores as well as an improvement in the quality of the translations, as observed during manual validation. The methods used by the systems included statistical and neural machine translation techniques, but also incorporated many advanced features to boost the performance, such as domain adaptation.
Despite the comprehensive evaluation that we show here, there is still room for improvement in our methodology. All by professionals were rather small (up to 1000 sentences), which means that some of the our conclusions might not hold on a larger benchmark. Further, we did not perform statistical tests when ranking the various systems and runs in both manual and automatic evaluations. Furthermore, each combination of two translations or one translation and reference was evaluated by a single expert, given the high number of submissions and the difficulty of finding available experts. On the other hand, most results obtained through manual validation were consistent with the automatic validation, suggesting that automatic scoring is sufficiently meaningful.