Data Selection for IT Texts using Paragraph Vector

This paper presents an overview of the sys-tem submitted by the University of Ham-burg to the IT domain shared translation task as part of the ACL 2016 First Conference of Machine Translation (WMT 2016). We have chosen data selection as a domain adaptation method. The ﬁltering of the general domain data makes use of paragraph vectors as a novel approach for scoring the sentences. Experiments were conducted for English-German under the constrained condition.


Introduction
The WMT 2016 shared task of translating IT documents focuses on translation of answers in a cross-lingual help-desk service. This paper describes the system submitted by the University of Hamburg to this task. We took part in the English-German translation track in which twelve systems (seven constrained and five unconstrained ones) from four different organizations participated. The challenges for this task came from the fact that the available in-domain data for the constrained condition is very small. Moreover, the in-domain differs considerably from any of the domains of the given general domain data.
We propose a method of data selection by filtering the general domain data applying a threshold on the similarity between vector representations for the sentences from the general domain and the in-domain. Sentences are described by paragraph vectors which are trained together with word vectors in order to predict the upcoming words within that paragraph (Le and Mikolov, 2014). Given a sentence from the general domain, our procedure identifies a set of candidate sentences that are most similar to the reference. If at least one of the re-trieved sentences comes from the in-domain then the general domain sentence is considered similar to the in-domain, otherwise it is discarded. This binary decision has the advantage that only one MT system needs to be trained and the disadvantage that it gives only a fixed ratio of general domain data to be kept depending on the chosen threshold.
In order to overcome the disadvantage that the paragraph vector method has, we extend it from using a binary decision filtering to scoring and ranking all the sentences from the general domain from which a certain amount of training sentences can be selected. This extended version is a prerequisite for being able to train and compare multiple MT systems using different ratios of data to be kept.
We first summarize related work in data selection for Statistical Machine Translation (SMT) in Section 2, then describe Paragraph Vector that we used for our data selection method in Section 3. Section 4 presents the experimental settings of the submitted systems and section 5 contains an overview of their performance in the shared IT task.

Related work
A range of different methods for domain adaptation of models for statistical machine translation have been developed including mixture modeling, instance weighting, transductive learning, or data selection .
The data selection approach is the focus of this paper. In the state of the art, data selection is used at the corpus-level, where the selected data is joined together, or at the model-level, where several models are combined together in the translation phase (Wang et al., 2013a). The main workflow of the data selection method consists of the following steps: • scoring: a measure is used to determine how similar the sentences from the general domain are to the in-domain • filtering: sentences from the general domain are selected, if their similarity score is greater than a predefined threshold.
• training: the selected sentences are used as additional training data to develop the language model, to weight the phrase pairs or for tuning purposes.
To compute the similarity score three approaches are commonly used: information retrieval inspired, perplexity-based and edit distance similarity inspired.
TF-IDF 1 term weighing as used in information retrieval was adopted by (Hildebrand et al., 2005) where each sentence from the source side of the bilingual training data constitutes one document (represented using TF-IDF) and each sentence from the test data is used as a query. The cosine distance similarity is used to compute the relevance of the queries to the documents. Lü et al. (2007) also uses the cosine to select sentences for offline and online training data optimization. Tamchyna et al. (2012) presents a method where sentences are extracted from the general domain by translating the source side of a test set and using it in computing the cosine similarity to the general domain.
In Mandal et al. (2008) and in Axelrod et al. (2011) language model perplexity was used to score sentences. Foster et al. (2010) used phrase pairs instead of sentences and learned weights for them using in-domain features based on word frequencies and perplexities. In Mansour et al. (2011), the cross-entropy score is used for language model filtering together with a translation model score that estimates the likelihood that a source and a target sentence are a translation of each other. Toral et al. (2015) introduced linguistic information such as lemmas, named entities and part-of-speech tags into the preprocessing of the data and then ranked the sentences by perplexity.
The edit distance which computes the minimum number of edits needed to transform a sentence from the general domain into a sentence from the 1 Term frequency -Inverse document frequency in-domain was used in Wang et al. (2013b). A combination of the three data selection approaches is presented in Wang et al. (2013aWang et al. ( , 2013c. We propose a new approach of filtering general domain sentences using paragraph vectors (Le and Mikolov, 2014) to determine sentence similarity in a high-dimensional vector space. To the knowledge of the authors, this is the first time Paragraph vector is applied to data selection for SMT.

Paragraph vector
In this section we describe Paragraph vector (Le and Mikolov, 2014) which stands at the core of the proposed data selection method. It has been successfully employed in sentiment detection and information retrieval tasks. Le and Mikolov (2014) propose an unsupervised framework that learns continuous distributed vector representations for phrases, sentences or documents.
The idea of learning paragraph vectors is similar to the approach used in learning word vectors (Mikolov et al., 2013): word vectors are used in predicting a word given its sentential context and paragraph vectors adopt the same idea to contexts sampled from a paragraph.
The model maps context words and a paragraph identifier to the word that is going to be predicted. The contexts have a fixed length and are sampled from a sliding window over the paragraph. The mapping is established by means of two matrices: one consisting of the trained paragraph vectors and the other consisting of word vectors. The paragraph vector is shared among all the contexts sampled from the same paragraph (but not among all paragraphs). The word vectors are shared between all the paragraphs. Paragraph and word vectors are combined during training and inference either by concatenation or by averaging. The paragraph and word vectors are trained on pairs consisting of the word to be predicted and a sampled context tagged by a paragraph identifier. (Le and Mikolov, 2014) We use single sentences as paragraphs. The reason why we adopted Paragraph vector is because they reflect semantic relatedness, similar to word vectors. Moreover, we have chosen paragraph vectors for representing sentences as vectors because the approach does not require tuning, parsing or availability of labeled data. The implementation of paragraph vectors we used is Doc2vec from the gensim toolkit 2 (Řehůřek and Sojka, 2010).

Experiments
For all the submitted systems, we used only the data distributed for the shared IT task. For the general domain training data we chose Commoncrawl 3 (made available by WMT) because it is a relatively large corpus and contains crawled data from a variety of domains including the IT domain. As in-domain training data we concatenated the corpora provided by the task. We tuned the systems with 2000 sentences from Batch1a and Batch2a provided by the shared task and evaluated them on Batch3a.
Our systems have been developed using the Moses phrase-based MT toolkit (Koehn et al., 2007) and the Experiment Management System (Koehn, 2010) that facilitates the preparation of scripts for experiments.

Data preprocessing
All the available data were tokenized, cleaned (i.e. restricted to a maximum sentence length of 80 words) and lowercased. The general domain data was filtered by removing the sentence pairs that do not pertain to the English-German language pair as well as sentences that contain non-alpha characters. In addition to that, punctuation was normalized using the normalize-punctuation.perl script. Approximately 25K sentences were removed because they were not considered English-German sentence pairs by the jlangdetect library 4 and further 650 sentences have been discharged because they contained non-alpha characters. Table 1 presents some data statistics for both domains after preprocessing:

Experimental settings
We performed word alignment using GIZA++ (Och and Ney, 2003) with the default grow-diagfinal-and alignment symmetrization method. For the language model (LM) estimation we trained models/doc2vec.html 3 http://commoncrawl.org/ 4 https://github.com/melix/jlangdetect 5-gram LMs using the SRILM toolkit (Stolcke, 2002) with Kneser-Ney discounting (Kneser and Ney, 1995) on the target side of the Commoncrawl and IT corpora. When LM interpolation was needed, the in-domain LM and the general domain LM were interpolated using weights tuned to minimize the perplexity on the tuning set. The same data was used for tuning the systems with MERT (Och, 2003). For the BLEU-cased scores training recasing was performed using the default configuration from the EMS script: language model trained using KenLM (Heafield, 2011) and order 3. Due to time limitations, we did not try to further improve the recaser model.

Baselines
The baseline system U HBS simple was trained on the concatenation of the in-domain data and the complete general domain data. The second baseline, U HBS lmi, only differed from U HBS simple in its language model that was created by LM interpolation. The motivation for training a second, i.e. stronger baseline, is that we intended to compare the translation results of the system submitted to the competition (U HDS doc2vec) with the one produced by a competitive approach.

Data selection using Doc2vec
In this section the submitted system U HDS doc2vec is described.
The filtering procedure receives as input the bilingual indomain corpus In, the bilingual general domain Gen, the number of most similar sentences N that should be retrieved given a threshold δ that will be described later. Our approach is monolingual as we used only the source side of the corpus data to select sentences from the general domain corpus. To train the paragraph vectors we concatenated In and Gen resulting in the data set C. Training the doc2vec model required tagging every sentence from the source side of the concatenated corpus C source with its corresponding line number in the corpus and building a vocabulary from the tagged C. Therefore, a sentence that came from In was tagged with a number from [1, size In ] and a sentence that came from Gen was tagged with a number from [size In + 1, size In + size Gen ].
The doc2vec model was trained on the tagged C source . After obtaining the doc2vec model M, the algorithm iterates through every sentence pair for each sentence pair (s i , t i ) ∈ Gen do 8: if ∃(index, score) ∈ Sim s i : (index < size In , score > δ) then 11: add (s i , t i ) to FilteredCorpus The list of top N most similar sentences for each sentence from Gen is now filtered by comparing them to a prespecified threshold δ creating a reduced data set F ilteredCorpus. A sentence pair (s i , t i ) is included into F ilteredCorpus if at least one pair (index, score) originates from the in-domain (index < size In ) and has a score > δ. With a value setting of δ = 0.5 we selected 47% of the sentences of Gen. Systematic experiments with other values of δ are planned for future work. Eventually, we trained the final system U HDS doc2vec on a concatenation of the reduced general domain corpus F ilteredCorpus and the in-domain data In. Two separate language models were trained with the in-domain data In and the full general domain corpus Gen. They have been interpolated and the interpolated model has been used in both U HBS lmi (strong baseline) and U HDS doc2vec (the submission to the competition). In Figure 1 the pseudocode for filtering the general domain corpus is presented.
Doc2vec filtering selects in one step all the general domain sentences similar to the in-domain producing one F ilteredCorpus. Eventually, each sentence from Gen is either discarded or added to F ilteredCorpus).
In order to be able to compare our method with other data selection approaches, we modified the binary decision from step 10 of the algorithm with a step that produces a score for each sentence s i ∈ Gen (Figure 2). Therefore, in addition to the submitted systems to the WMT competition, we also conducted experiments with the extended Doc2vec algorithm and with a perplexity-based metric which defines the state-of-the-art for data selection for MT (Axelrod et. al, 2011). We name SEF (Sentence Embedding Filtering) the method presented in Figure 2 and P P L (Perplexity) the state-of-the-art method.
In addition to the input parameters that the algorithm presented in Figure 1 uses, the adapted algorithm receives as input also a percentage P which gives the number of sentences to be selected from Gen. Given a sentence s i ∈ Gen, the SEF method uses the similarity score between s i and its N most similar sentences for producing a final score. Moreover, since the position in Sim s i matters, we multiply each intermediary score with the inverse position (N − j + 1). For example, if the most similar sentence to s i is s j placed on the first position in Sim s i , then their score ij is multiplied with the highest possible value N . After scoring all the sentences from Gen, they are sorted by their score in descending order.
The comparison between SEF and P P L was evaluated on a range of percentages from 10 till 90, incrementing the ratio in steps of 10.

Results
In this section we present the evaluation scores obtained in the WMT competition for the three sub- for each sentence pair (s i , t i ) ∈ Gen do 8: for (index j , score j ) ∈ Sim s i do 11: add (s i , t i ) to F ilteredCorpus P Figure 2: Doc2vec filtering algorithm adapted to select a given percentage P of sentences mitted systems. Moreover, we present the evaluation scores for the SEF and P P L methods and discuss the results. Table 2 presents the BLEU (Papineni et al., 2002), the BLEU-cased and the TER (Snover et al., 2006)   According to their BLEU scores, the strong baseline, U HBS lmi, performs almost on a par with the filtered general domain system, U HDS doc2vec, but with respect to TER U HDS doc2vec clearly outperforms the baseline. The results are encouraging, since our selection method filtered out more than 50% of the general domain data without a substantial loss of translation quality compared to the strong baseline.
The BLEU and TER scores for the SEF and P P L methods are given in Table 3. The maximum BLEU score has been achieved by SEF (37.12) selecting 70% of Gen. The P P L method achieved its maximum BLEU score at a 90% ratio of Gen with a score of 36.75 that is close to the score already achieved at 30% filtering (36.71). With respect to that, the SEF method also has a close score to it at 30% filtering (36.65). The TER scores are all very close for most of the steps, with the lowest score achieved by the P P L method at 30% filtering (0.532). A very similar score has been gained by the SEF method when filtering to 50% (0.535). In comparison to the systems submitted to WMT, the best BLEU and TER scores have still been achieved by U HDS doc2vec and U HBS lmi.

Conclusions
In this paper we presented the system the University of Hamburg submitted to the WMT shared task of translating IT texts. We introduced a new method of data selection for filtering the general domain data by searching for sentences that are similar to the in-domain. The novel contribution of our approach consists in using paragraph vectors to capture crucial meaning aspects of a sentence and deploy them to determine intersentential similarity. With less than 50% general domain data the system performs almost as good  Table 3: Evaluation results for SEF and P P L as the strong baseline in terms of BLEU. We also presented an adaptation of the paragraph vector filtering method that is able to select any required percentage of the general domain data and we conducted experiments using a range of ratios for this method and a state-of-theart method. The BLEU results indicated that the adapted paragraph vector method outperforms the state-of-the-art method.
These results make filtering using paragraph vector for scoring sentences particularly attractive for scenarios where a large pool of general domain data is available, but only a very small amount of in-domain data.