Automatic Threshold Detection for Data Selection in Machine Translation

We present in this paper the participation of the University of Hamburg in the Biomedical Translation Task of the Second Conference on Machine Translation (WMT 2017). Our contribution lies in adopting a new direction for performing data selection for Machine Translation via Paragraph Vector and a Feed Forward Neural Network Classiﬁer. Continuous distributed vector representations of the sentences are used as features for the binary classiﬁer. Most approaches in data selection rely on scoring and ranking general domain sentences with respect to their similarity to the in-domain and setting a range of thresholds for selecting a percentage of them for training various MT systems. The novelty of our method consists in developing an automatic threshold detection paradigm for data selection which provides an efﬁcient and simple way for selecting the most similar sentences to the in-domain. Encouraging results are obtained using this approach for seven language pairs and four data sets.


Introduction
Data selection for Machine Translation (MT) represents a standard domain adaptation technique with the aim of tackling the problem of selecting from various general domain data the sentences that are most similar to sentences from the in-domain. Irrespective of having available vast amounts or small amounts of in-domain data, one of the advantages of data selection consists in providing more in-domain data selected from large amounts of general domain data. Two difficult tasks arise when performing data selection: what method to use for scoring the sentences from the general domain according to their similarity to the in-domain and how many of the scored sentences to keep for later use in training MT systems.
Standard state-of-the-art methods resolve the first difficulty by means of information retrieval, perplexity or edit distance methods. However, the second difficulty remains a challenge. There are no standard start-threshold and incrementthreshold defined in the community. Axelrod et al. (2011), for example, uses the top N = {35k, 70k, 150k} sentence pairs from the scored general domain data, while Biçici and Yuret (2011) increasingly select N ∈ {100, 200, 500, 1000, 2000, 3000, 5000, 10000} instances for each test sentence for training and Kirchhoff and Bilmes (2014) select subsets of 10%, 20%, 30% and 40% of the data.
We present a time and resource efficient method of performing data selection using Paragraph Vector (Le and Mikolov, 2014) for representing the sentences and a Feed Forward Neural Network Classifier for determining which general domain sentences should be considered similar to the indomain. The paragraph vectors and the binary classifiers are trained using standard parameters and have a great advantage of dropping the need to experiment with different sentence selection thresholds. Therefore, we call our method automatic threshold detection for data selection (ATD).
The method has been applied in the Biomedical translation task of the Second Conference on Machine Translation (WMT) 2017 (Yepes et al., 2017). The in-domain corpora were made available by the competition and the general domain corpora we have chosen to select data from are the Wikipedia corpora (Wolk and Marasek, 2014) and the Commoncrawl corpora 1 . Experiments were performed on the language pairs English-French, English-Spanish, English-Portuguese and English-German (both directions for all language pairs except for English-German as the competition did not require German-English translations). Good results have been obtained for all language pairs. The paper is structured as follows: related work is presented in Section 2, then the data, tools and data selection method are described in Section 3. Section 4 contains the experimental results and the last section presents conclusions and suggestions for future work.

Related work
Given a large pool of general domain data and a small amount of in-domain data, selecting the sentences from the general domain that are most similar to the in-domain is referred in literature as data selection. The work-flow of performing data selection includes developing a metric or function that scores general domain sentences according to their relevance to the in-domain and experimenting with various ratios of top ranked sentences in order to obtain the best result in terms of one or more MT evaluation metrics.
Recently, a new direction has gained interest by making use of Word or Paragraph Vectors (embeddings). Chen and Huang (2016) use word embeddings along with in-domain selected sentences as positive samples and randomly selected sentences from the general domain as negative samples in training convolutional networks that yield good results. Also, Duma and Menzel (2016) developed a new scoring method using Paragraph Vectors with positive results.
In this paper, we apply Paragraph Vectors for training FFNN classifiers that categorize the general domain sentences as being in-domain or outof-domain. One of the most challenging tasks in data selection consists in finding the optimal threshold (how many of the scored sentences to select). It is a time-consuming process in which several experiments need to be performed, usually aiming to obtain the best BLEU score. Moreover, there is no general consensus in the community regarding the increment ratio. We contribute to the state-of-the-art with a method that overcomes this challenge by means of a binary classifier: the problem of data selection is simplified by reducing the task of scoring and experimenting with different thresholds to a binary decision (keep/ discard a general domain sentence).

Experiments
This section describes the corpora and tools used, as well as the automatic threshold detection method we propose.

Data and tools
All SMT models were developed using the Moses phrase-based MT toolkit (Koehn et al., 2007) and the Experiment Management System (Koehn, 2010). The preprocessing of the data consisted in tokenization, cleaning (6-80), lowercasing and normalizing punctuation. The tuning and the test sets were provided by WMT 2016 (Bojar et al., 2016) and WMT 2017.
The SRILM toolkit (Stolcke, 2002) and Kneser-Ney discounting (Kneser and Ney, 1995) were used to estimate 5-gram language models (LM). All the trained SMT systems use a strong LM built by interpolating a LM for the in-domain and a LM for the general domain with weights that are tuned to minimize the perplexity on the tuning set (Schwenk and Koehn, 2008).
For word alignment we used GIZA++ (Och and Ney, 2003) with the default grow-diag-final-and alignment symmetrization method. Tuning of the SMT systems was performed with MERT (Och, 2003).
Commoncrawl and Wikipedia were used as general domains for all language pairs except for EN↔PT where no Commoncrawl data was provided by WMT. As for the in-domain corpora, EMEA (Tiedemann, 2012) was used for all language pairs and Muchmore, ECDC, Pattr and Pubmed (all from UFAL Medical Corpus 2 ) for those language pairs where data was available. We also made use of the training data provided by the previous Biomedical task from 2016. The corpora corresponding to the general domain was concatenated into a single data source and the same procedure was applied for the in-domain corpora. The size of the corpora is presented in the following table (since the bilingual corpora remain the same for both cases of translating Language1 to Language2 and vice-versa, we mention only one direction in the

Automatic Threshold Detection for Data Selection
The data selection method we used for the WMT Biomedical task is described in this section with a special focus on Paragraph Vector and the FFNN classifier employed in developing the automatic threshold detection.

Paragraph Vector
Sentences were represented using Paragraph Vectors (Le and Mikolov, 2014) which give a continuous distributed vector representation of the input. Paragraph Vector is an extension of word embeddings (Mikolov et al., 2013) to phrases or sentences. Given a sentence, Paragraph Vector learns its representation by mapping context words and a paragraph identifier to the word to be predicted. The paragraph token acts like a memory of the topic of the sentence (Le and Mikolov, 2014). While the word vectors are shared between all paragraphs, the paragraph vector is shared among all the contexts generated from the same sentence. We used the gensim toolkit 3 (Řehůřek and Sojka, 2010) that implements Doc2Vec (Paragraph Vectors). We present results using a Doc2Vec model trained with PV-DBOW 4 applying the default parameters of size 200 for the vectors and window of 10 (the maximum distance between the predicted word and context words used for prediction within a document). 3 https://radimrehurek.com/gensim/models/doc2vec.html 4 Distributed Bag of Words

Feed-forward Neural Network Classifier
The Feed-Forward Neural Network uses a supervised learning algorithm that receives as input the Paragraph Vectors for the labeled sentences. The feed-forward neural network classifier was trained using the python library sknn 5 . We report here results obtained using a fully connected Tanh layer of 200 units with dropout p=0.5 and a Softmax output layer. The optimal dropout value was selected in accordance with the findings from Srivastava et al. (2014).
We experimented with both the source and the target language, in order to determine the best use of classified data given our settings.
For each of the language pairs we trained classifiers on ≈200K sentences with an equal number of positive and negative samples. The positive samples were randomly selected from the in-domain data and the negative samples were randomly selected from the general domain data.

Experimental results
We report in this section the BLEU (Papineni et al., 2002) scores obtained by our submissions, as well as the classifiers accuracy. For each language pair and for each test set provided by the Biomedical task, we submitted three runs as follows: • the selected sentences with the classifier trained on the source language data (run 1) • the selected sentences with the classifier trained on the target language data (run 2) • the union (without duplicates) of the selected sentences proposed by the two classifiers (run 3) Intrinsic evaluation of the proposed data selection technique was performed by computing the classifier accuracy. Following the recommendations from (Kohavi, 1995), we employ the stratified cross-validation method with ten folds. The accuracy values were computed using scikit-learn (Pedregosa et al., 2011). The following table presents the FFNN classifier mean accuracy and standard deviation for each of the language pairs. The low values of standard deviation for all classifiers indicate the consistency of our proposed method.  This year four datasets were used in the evaluation: Scielo, EDP, Cochrane and NHS belonging to scientific publications or health information texts. The format of the datasets differed as Scielo and the EDP datasets follow the BioC format and Cochrane and NHS follow the format of the UFAL Corpus (sgm). Table 3   The results of our submissions are presented with respect to different datasets. Table 5 depicts all the BLEU scores of our submissions. For the Scielo dataset, our team was the only one that submitted runs. The organisers provided baselines for all language pairs and our best run improves with almost 9 BLEU points over the baseline for EN-PT and EN-ES, and almost 7 BLEU point over the baseline for PT-EN and ES-EN. There were small differences between the results of the three runs which suggests that either method could be used for gaining positive results.
For the EDP dataset (FR-EN and EN-FR) there were eight submissions and our best run for EN-FR had a gain of around 10 BLEU points over the baseline, as for FR-EN a gain of around 6 BLEU points. Considering our runs, there is 1 BLEU point difference between run 2 and run 3 for FR-EN and 0.5 difference between run 3 and run 2 for EN-FR. This indicates that the union method provides the best results.
On the Cochrane and NHS datasets our team was the only one that submitted for EN-ES obtaining high BLEU scores (48.99,48.45 and 48.70 for Cochrane and 40.97,41.20 and 41.22 for NHS). The differences between the runs are again very small. For EN-FR there were two teams participating. In our runs the union method gave better results for both datasets. For EN-DE there were six teams participating and the differences between our runs are again small.
In the general ranking among all participating teams, our team ranked first for EN-FR for the Cochrane and NHS datasets, second on FR-EN and third on EN-FR for the EDP datasets, last place on EN-DE for the Cochrane and NHS datasets, and was the only team submitting for Scielo (PT-EN, EN-PT, ES-EN, EN-ES) as well as for Cochrane and NHS (EN-ES). Lavie (2010) points out that BLEU scores above 30 reflect understandable translations, while scores over 50 are considered good and fluent translations. Within 36 submitted runs by our team, 24 runs have BLEU scores between ≈32 and ≈49 (for six language pairs). Therefore, we conclude that the method presented obtains generally good translation results on a variety of language pairs.
Another important result consists in the fact that small amounts of general domain data were selected using ATD ranging from 3.1% up to 9.35%. This represents a promising direction for applying this method on much larger general domain corpora where selecting small amounts of data matters even more. The union of the selected sentences with the classifiers trained on the source and target languages ranges from 5.6% up to 12.1%.
The following table presents the amount of general data selected using ATD for the three runs along with the percentage of general domain data that it represents:  The average duration for training the Doc2Vec models was ≈ 2.5 hours and the average duration for ten fold cross-validation was ≈ 12 minutes 6 , which represents an advantage in terms of time consumption since afterwards only one MT system needs to be trained.

Conclusions and Future Work
We presented the University of Hamburg participation to the WMT Biomedical task. The main contribution of our work consists in developing an automatic threshold detection method for data selection which yields good results for seven language pairs and four data sets. It requires little time for obtaining the general domain sentences that are considered most similar to the in-domain. For six of the seven language pairs, the BLEU scores that our method obtained are in the range between 32 and 49. Generally, the best results among our three runs is obtained using the union approach, but with small differences among the other runs suggesting that there is no clear preference for one of the approaches.
Since we evaluated our approach only with respect to the WMT task, we intend to further apply it to other in-domains and language pairs, as well as, to compare it directly with standard state-of-the-art methods.