Translation of Biomedical Documents with Focus on Spanish-English

For the WMT 2018 shared task of translating documents pertaining to the Biomedical domain, we developed a scoring formula that uses an unsophisticated and effective method of weighting term frequencies and was integrated in a data selection pipeline. The method was applied on five language pairs and it performed best on Portuguese-English, where a BLEU score of 41.84 placed it third out of seven runs submitted by three institutions. In this paper, we describe our method and results with a special focus on Spanish-English where we compare it against a state-of-the-art method. Our contribution to the task lies in introducing a fast, unsupervised method for selecting domain-specific data for training models which obtain good results using only 10% of the general domain data.


Introduction
The 2018 Biomedical Translation Task, held as part of the Third Conference on Machine Translation, aims at evaluating systems on scientific publications from Medline (Neves et al., 2018). The task is particularly challenging as there is still not enough bilingual medical data available for training high quality Machine Translation (MT) systems. We develop and apply a data selection method on five out of the nine language pairs addressed by the task: English-Spanish, Spanish-English, English-Portuguese, Portuguese-English and English-Romanian.
Data selection, as a domain adaptation technique, exploits all available (bilingual) general domain corpora with the purpose of extracting sentences that have a strong relationship to a given in-domain. All sentences from the general domain pool are scored according to a similarity function/ algorithm/ method and after being sorted, the most similar ones are selected to take part in the MT training pipeline. The subsampling is usually done using a threshold, which is the number of sentence (pairs) or a percentage of the sentences to be considered in-domain.
We introduce a data selection method which is fast to apply and yields good results when compared with a strong baseline and a state-of-the-art method. The simplicity of the method has at its core term frequencies and a newly developed similarity function. On the one hand, no models need to be trained and the method is unsupervised, but on the other hand, the method does not consider the context of the words or their semantics. However, the results are very encouraging with BLEU (Papineni et al., 2002) scores between 31.05 and 41.84 for four language pairs. The paper is structured as follows: the next section briefly presents related work, Section 3 describes the experimental results along with a description of our algorithm, Section 4 gives an overview of the results obtained in the task and additional experiments and the last section presents conclusions and future work.

Related work
Related work in data selection is ample, therefore this section only mentions methods that fit in the same category with our method and we also shortly describe the widely known state-of-the art method of performing data selection, introduced by Axelrod et al. (2011), since it is the chosen method for comparing results in this paper.
Our scoring function relies heavily on term frequency. Therefore, it falls in the category of TF-IDF 1 based approaches. Hildebrand et al. (2005) uses TF-IDF to produce vector representations of sentences. Then the cosine of the angle between the sentence vectors is interpreted as the similar-1 Term Frequency -Inverse Document Frequency ity between the sentences. A similar approach is given in  where a weighting scheme based on TF-IDF by means of unseen ngrams and sentence length is applied and cosine is also used as means of determining sentence similarities. In contrast to these methods, we use only the term frequency in computing our similarity scores and we make no use of the cosine. Instead, we focus on the relative difference between a term that appears in the general domain and in the in-domain and simply multiply it by a weighting scheme that has empirically proved to be effective. Our method is also related to the other methods from the TF-IDF category with respect to its simplicity.
To compare our results with other approaches we apply the modified Moore-Lewis method which is based on (Moore and Lewis, 2010): given the source side of an in-domain corpus and a random subsample of the source side of a general domain corpus, a language model (LM) is trained on each one of them. The sentences from the general domain are scored by the difference of the crossentropy of a sentence according to the in-domain LM and the cross-entropy of the same sentence according to the general domain LM. Axelrod et al. (2011) modified the scoring by applying the same procedure also to the target side of the corpora and afterwards summing the scores. We refer to this method as MML (modified Moore-Lewis) in the rest of the paper.

Experiments
This section describes the experimental settings including the corpora and the tools used, as well as the data selection algorithm we developed.

Corpora
The general domain data consisted of a concatenation of the Commoncrawl 2 corpora and the Wikipedia (Wolk and Marasek, 2014) corpora for English-Spanish and Spanish-English, Paracrawl 3 and Wikipedia for English-Portuguese and Portuguese-English and Paracrawl for English-Romanian. For the in-domain, we used the EMEA (Tiedemann, 2012) corpora for all language pairs and the Scielo corpora (health and biological) provided by the WMT 2016 Biomedical task (Neves et al., 2016) for all language pairs except for English-Romanian where Scielo training data was not available.
The development set for the English-Spanish and Spanish-English experiments was a concatenation of the Khreshmoi development set from the Medical Task of WMT 2014 4 and the ECDC corpus made available by UFAL 5 . The motivation for using a concatenation of two medical development sets is that we aimed at diversity in the medical data. Even though ECDC is a very small corpus consisting of only 2357 sentence pairs (for English-Spanish), combining it with Khreshmoi (500 sentence pairs) would have resulted in a quite big development set which would have made the tuning of the SMT systems very time and memory intensive. Therefore, we applied a cleaning step to ECDC which meant limiting the size of the sentences to a minimum of 20 words and a maximum of 80 words. After applying this preprocessing step, the ECDC set was down to 850 sentences, resulting in a total development set of 1350 sentences. For the experiments involving Portuguese, a sample of 1000 sentences from the Scielo development set from WMT 2016 6 was used for tuning purposes. As for the Romanian experiments, also a sample of 1000 sentences was used, but from the ECDC corpus.
Statistics including the number of sentences after preprocessing for every corpus used for the training of the MT systems is given in Table 1.

Tools
For text processing we used the nltk toolkit (Bird et al., 2009), the WordNet (Fellbaum, 1998) lemmatizer for English and the Snowball stemmer (F. Porter, 2001) for Spanish, Portuguese and Romanian.
The SMT systems were trained using the Moses toolkit (Koehn et al., 2007) and the Experiment Management System (Koehn, 2010). The preprocessing of the data consisted in tokenization, side refers to either source or target · weight score s += score w all intermediate scores contribute to the final score cleaning, lowercasing and normalizing punctuation. Our language model (LM) was obtained by interpolating (Schwenk and Koehn, 2008) the LM estimated using the general domain data and the LM estimated on the in-domain data. We used the SRILM toolkit (Stolcke, 2002) and Kneser-Ney discounting (Kneser and Ney, 1995) for estimating 5-grams LMs. All the experiments benefited from the interpolated language model, including the strong baseline and the MML experiment. As for the chosen state-of-the-art method, MML, we used the implementation available from Moses. Tuning of the systems was done with MERT (Och, 2003) and GIZA++ (Och and Ney, 2003) using the default grow-diag-final-and alignment symmetrization method for word alignment.

Data selection using Term Frequency
Using bag of words to represent sentences and term frequency to compute similarity became unpopular due to its limitations, namely no integration of semantic information and ignoring the context of words (Le and Mikolov, 2014). However, through the work presented here we aim at applying this straightforward method to data selection for SMT with a new weighting scheme. Our scoring algorithm builds a profile consisting of word frequencies for each domain, for the source language and the target language. To build the profile for a corpus, all of its sentences undergo a preprocessing step: tokenization, lowercasing, removal of stop words and lemmatization or stemming in the case a lemmatizer was not available for a language (procedure P reprocess Corpus). In the end, numbers or punctuation marks are ignored and only words contribute to the scoring. For word count occurrence we used the script ngram − count from SRILM.
Algorithm 1 can be applied either on the source or on the target sides of the corpora. For example, when considering the source side, for every sentence from the lemmatized (or stemmed) general domain data, we iterate through all its words. Given sentence s and the word w, we square the relative difference between the term frequency of w in the in-domain profile, count(w, IN side ), and the term frequency of w in the general domain profile, count(w, GEN side ). We use the same relative difference formula as in (Kešelj et al., 2003) which uses character n-grams and profiles built using the most frequent character ngrams for authorship attribution. In contrast to this, we used all the words appearing in the corpora and modified the formula by introducing a weighting scheme. Note that due to the squaring, the direction of the subtraction does not matter. The difference is multiplied by a weight and the arithmetic mean of count(w, IN side ) and count(w, GEN side ). The weight represents the impact that w made in the sentence and we empirically determined it. When using only the formula from Kešelj et al. (2003) adapted to our data selection task, the results are of poor quality. Our contribution to the formula lies in introducing the weighting scheme which gives much better results than the original formula. To profit from both the source and the target corpora, summing up the scores for the source language and the scores for the target language seems to be an attractive solution. We refer to our method as DSTF (Data Selection via Term Frequency).
The method has a very important advantage if compared to state-of-the-art methods: scoring is very fast for a general domain corpus (on average, the scoring step took half an hour). The results are satisfactory and will be presented in the following section.

Results
We report the automatic evaluation results obtained in the WMT task for five language pairs and then we present further experiments for the Spanish-English language pair. BLEU was used as an evaluation metric by the WMT Biomedical organizers and in addition to BLEU we also used METEOR (Lavie and Agarwal, 2005) for further evaluating the Spanish-English experiments.

WMT Biomedical Results
Each team was allowed to submit a maximum of three runs. For every language pair that we used to evaluate our method on, we submitted three runs as follows: the first run only considers the scores obtained using the English side of the training corpora, the second run made use of only the non-English side of the training corpora and for the third run the scores for both the source and the target sides were summed up to form a single score.
The aim of data selection is to identify in the general domain pool the top N most similar sentences to an in-domain, where N is determined empirically and is usually a small number or percentage. We experimented for this paper with N = 10% since the maximum of runs allowed was three and we had three variations of the method, but we intend to conduct a range of experiments with more percentage values in future work. Table 2 presents the number of sentence pairs that were subsampled along with the total number of sentence pairs that were used in the training of MT systems.  The BLEU results obtained using DSTF are encouraging: a BLEU score of 41.84 for Portuguese-English ranked our method on the third place out of seven runs submitted by three institutions. For English-Portuguese, our BLEU scores are close to 34 for all runs. The Spanish-English automatic evaluation achieved scores around 35-36 and for English-Spanish around 31. The smallest BLEU scores were measured for English-Romanian where we obtained scores close to 14. This is not surprising considering the fact that compared to the other language pairs there was less biomedical training data available. In particular, no Scielo training corpus was available although translating from English to a morphologically rich language like Romanian is considered difficult. The BLEU scores for each run are given in Table 3. We note that the differences between each run, for every language pair, are insignificant except for one language pair, therefore we conclude that either one of the algorithm variations can be successfully applied as a fast data selection technique that yields good translations (BLEU scores between 31 and 42 for four out of five language pairs).

Spanish-English Additional Experiments
For Spanish-English, the best performing variant of our method was run 1 -using only the English side of the corpora in the algorithm. We evaluated our DSTF-EN method against a strong baseline (that uses an interpolated LM), a baseline trained using only the in-domain data and the state-of-theart method MML for the Spanish-English language pair 7 . Following recommendations from H. Clark et al. (2011) and standard practices, we tuned the systems three times and report in Table 4 the averaged BLEU scores.  According to the BLEU scores, our method outperformed both baselines and gained almost 1 BLEU point over MML. The strong baseline is very competitive with both data selection methods. This can easily be explained, since the system relies on the same interpolated language model as DSTF-EN and MML. There is a 3 BLEU points difference between our results and the baseline trained only the in-domain data and almost half a point BLEU score difference between the strong baseline and our method. With respect to the ME-TEOR scores, our method again outperforms the state-of-the-art approach.
In order to determine whether our method (DSTF-EN) outperforms the state-of-the-art method (MML) from a statistical point of view, we applied paired bootstrap resampling (Koehn, 2004). The MTCompar-Eval tool (Klejch et al., 2015;Sudarikov et al., 2016) was used for this purpose where the source, reference and one or more system translations are used in the analysis. For our analysis we selected the best translation of each system according to their BLEU scores 8 . Figure 1 depicts the paired bootstrap resampling BLEU graph (left side) and the F-measure graph (right side). The x-axis is represented by 1000 resamples of the test set and the y-axis represents the 8 We tuned three times and averaged the BLEU scores difference in BLEU (respectively F-measure) between DSTF-EN and MML for all resamples. The p-value from the first graph in Figure 1 reports that in 11 cases out of the 1000 resamples, the state-ofthe-art method performed better in terms of BLEU than our method (marked with a small red area in the graph). A similar behaviour can be oserved in the right graph from Figure 1 where in 34 cases out of 1000, MML outperformed DSTF-EN in terms of F-measure. Therefore in 96.6% of the times our method wins over the state-of-the-art when using the F-measure and in 98.9% of the cases, our method is better than MML when evaluating with BLEU (large green areas in the graphs). We conclude that our method has a statistical significant performance in comparison with the state-of-theart method when selecting the 10% of the general domain sentences that were most similar to the indomain.

Conclusions and Future Work
We introduced an unsophisticated data selection method based on word frequencies which scores general domain corpora in half an hour (on average when considering all general corpora for five language pairs). Our method yields good results in the WMT task, as well as in comparison with a state-of-the-art method and a strong baseline (for Spanish-English). Further analysis and experiments will be carried out in future work to assess whether the improvement of our method over the state-of-the-art that we observed for Spanish-English is also statistically significant for other language pairs.