USAAR-SAPE: An English–Spanish Statistical Automatic Post-Editing System

We describe the USAAR-SAPE English– Spanish Automatic Post-Editing (APE) system submitted to the APE Task organized in the Workshop on Statistical Machine Translation (WMT) in 2015. Our system was able to improve upon the baseline MT system output by incorporating Phrase-Based Statistical MT (PBSMT) technique into the monolingual Statistical APE task (SAPE). The reported ﬁnal submission crucially involves hybrid word alignment. The SAPE system takes raw Spanish Machine Translation (MT) output provided by the shared task organizers and produces post-edited Spanish text. The parallel data consist of English Text, raw machine translated Spanish output, and their corresponding manually post-edited versions. The major goal of the task is to reduce the post-editing effort by improving the quality of the MT output in terms of ﬂuency and adequacy.


Introduction
In this paper, we present the submission of Saarland University (USAAR) to the WMT2015 APE task. The system combines a hybrid word alignment system implementation with a monolingual PBSMT for the language pair English-Spanish (EN-ES), translating from English into Spanish.
In order to achieve the desired translation quality, translations provided by MT systems need to be corrected by human translators. Automatic MT post-editing (APE) (Knight and Chander, 1994) is the method of improving raw MT output, before performing human post-editing on it. The objective is to decreases the amount of errors produced by the MT systems, achieving in the end a productivity increase in the translation process.
Usually APE tasks focus on fluency errors produced by the MT system. The most frequent ones are incorrect lexical choices, incorrect word ordering, the insertion of a word, the deletion of a word. For the WMT2015 APE task, we adapted our system in order to automatically post-edit lexical choice errors, word insertions and deletions. The method is also able to correct to some extent word ordering.
The remainder of the paper is organized as follows. Section 2 gives an overview of the related work, Section 3 describes the various components of our system, in particular the corpus preprocessing module, the hybrid word alignment module and the PBSMT model. In Section 4, we outline the complete experimental setup. Section 5 presents the results of the automatic and human evaluation, followed by conclusion in Section 6.

Related Work
In order to implement the correction of repetitive errors in the MT output, various automatic or semi-automatic post-processing or automatic PE techniques have been developed. Although MT output needs to be post-edited by humans to produce publishable quality translation (Roturier, 2009;TAUS/CNGL Report, 2010), it is faster and cheaper to post-edit MT output than to perform human translation from scratch. In some cases, recent studies have shown that the quality of MT output plus PE can exceed the quality of human translation (Fiederer and O'Brien, 2009;Koehn, 2009;De Palma and Kelly, 2009) as well as the productivity (Zampieri and Vela, 2014). Aimed at cost-effective and timesaving use of MT, the PE process needs to be further optimised (TAUS/CNGL Report, 2010). Post-editing can be also used as a MT evaluation method, implying at least source and target language skills, different from ranking, that does nor require specific skills, a homogeneous group of evaluators be-ing enough to perform the task (Vela and van Genabith, 2015) .
The aim of automatic post-editing (APE) is to improve the output of MT by post-processing it. One of the first approaches was the one introduced by Chen and Chen (1997) who proposed a combination of rule-based MT (RBMT) and statistical MT (SMT) systems aiming at merging the positive properties of each system type for a better machine translation output. Simard et al. (2007a) and Simard et al. (2007b) have shown how a PBSMT system can be used for automatic post-editing of an RBMT system for translations from English to French and French to English. Because RBMT systems tend to produce repetitive errors, they train a SMT system to correct errors, with the aim of reducing the postediting effort. The SMT system trains on the output of the RBMT system as the source language and the reference human translations as the target language. The evaluation of their system shows that the post-edited output had a better quality than the output of the RBMT system as well as the output of the same SMT system used in standalone translation mode. Lagarda et al. (2009) use an approach similar to Simard et al. (2007a) for translations from English to Spanish. The evaluation of the method was performed automatically and manually by comparing the APE output with the output from an RBMT system and a SMT system. The two corpora used in the evaluation were transcriptions of parliamentary speeches and medical protocols. The evaluation results have shown that on transcriptions of parliamentary speeches the method improves the RBMT system. Rosa et al. (2012) and Mareček et al. (2011) applied APE on English-to-Czech MT outputs on morphological level. Based on word alignment, the method learns during the training phase 20 hand-written rules based on the most frequent errors encountered in translation. The method addresses fluency in translation and corrects morphosyntactic categories of a word such as number, gender, case, person and dependency label. Parton et al. (2012) present an approach to APE consisting of three stages: detecting errors, suggesting and ranking corrections for the errors, and applying the developed suggestions. For the last stage of their method, applying the corrections, Parton et al. (2012) developed two different methodologies, a rule-abased APE and a feedback APE. The rule-based APE performs either insertions or replacement to address an identified error. The feedback APE, an approach similar to the one proposed by Parton and McKeown (2010), passes the possible correction to the MT system, letting the MT decoder decide whether the errors should be corrected and about the method of correcting it. Parton et al. (2012) evaluated their approach with human evaluators and found that the adequacy of post-edited MT output improved both for rule-based and feedback APE. In terms of fluency the human evaluation has shown that adequacy increase in feedback APE is related to fluency but not for rule-based APE.
Denkowski (2015) has developed a method for integrating in real time post-edited MT output into a translation model, by extracting for each input sentence a grammar. The method, based on Levenberg et al. (2010) and Lopez (2008), allows the indexing of the the source and post-edited MT output, as well as the union of the already existing sentence pairs with the new post-edited data. The system can also remember the rules that are consistent with the post-edited data. This way, rules learned from human corections can be preferred. The experiments Denkowski (2015) ran on from English into and out of Spanish and Arabic data show that the process of translating with an adaptive grammar improves performance on postediting tasks.

System Description
Our system is designed with three basic components: corpus preprocessing, hybrid word alignment and a PBSMT system integrated with the hybrid word alignment. The hybrid word alignment consists of the combination of multiple word alignments into a single word alignment table which is later used in a phrase-based SMT (PB-SMT) system. Our SMT based SAPE systems were trained on monolingual Spanish MT output and the manually post-edited output.

Corpus Preprocessing
For training our system we used the sentence aligned training data provided by the organizers of the WMT2015 APE task. The training data consist of 11,272 parallel segments of English to Spanish MT translations as well as the post-edited translations of the MT output. The English source text, the machine translated Spanish output and the corresponding post-edited version contain 238,335, 257,644 and 257,881 tokens respectively.
The preprocessing of the training corpus was carried out first by stemming the Spanish MT output and the PE data using Freeling (Padró and Stanilovsky, 2012).

Statistical Word Alignment
GIZA++ (Och and Ney, 2003) is a statistical word alignment tool which implements maximum likelihood estimators for all the IBM-1 to IBM-5 models, a HMM alignment model as well as the IBM-6 model covering many to many alignments. GIZA++ facilitates fast development of statistical machine translation (SMT) systems. Like GIZA++, the Berkley Aligner (Liang et al., 2006) is also used to align words across sentence pairs. The Berkeley word aligner uses an extension of Cross Expectation Maximization and is jointly trained with HMM models. We use a third statistical word aligner called SymGiza++ (Junczys-Dowmunt and Szał, 2012), which modifies the counting phase of each model of Giza++ allowing for updating the symmetrized models between the chosen iterations of the original training algorithms. It computes symmetric word alignment models with the capability of taking advantage of multi-processor systems.

Edit Distance-Based Word Alignment
We use two different kind of edit distance based word aligners, where alignment is based on TER (Translation Edit Rate) and the METEOR word aligner. TER (Snover et al., 2006) was developed for automatic evaluation of MT outputs. TER can align two strings such as the reference (in this case the PE translation) and the hypothesis (MT output). In the our work, the reference string has been chosen to be the confusion network skeleton, and the hypotheses are aligned independently using the skeleton. These pair-wise alignments may be consolidated to form a confusion network. TER measures the ratio between the number of edit operations that are required to turn a hypothesis H into the corresponding reference R to the total number of words in the R. The allowable edit types include insertion (Ins), substitution (Sub), deletion (Del) and phrase shifts (Shft). TER is computed as T ER(H, R) = (Ins + Del + Sub + Shf t) * 100% total number of words in R (1) METEOR Alignment (Lavie and Agarwal, 2007) is also an automatic MT evaluation metric which provides an alignment between hypothesis (here the MT output) and reference (here the PE translation). Given a pair of strings such as H and R to be compared, METEOR initially establishes a word alignment between them. The alignment is provided by a mapping method between the words in the hypothesis H an reference R transaltion, which is built incrementally by the following sequence of word-mapping modules: • Exact: maps if they are exactly the same • Porter stem: maps if they are the same after they are stemmed using the Porter stemmer • WN synonymy: maps if they are considered synonyms in WordNet If multiple alignments exist, METEOR selects the alignment for which the word order in the two strings is most similar (i.e. having fewest crossing alignment links). The final alignment is produced between H and R as the union of all stage alignments (e.g. exact, Porter stemming and WN synonymy).

Hybridization
The hybrid word alignment method combines two different kinds of word alignment: the statistical alignment tools such as GIZA++ word alignment with grow-diag-final-and (GDFA) heuristic (Koehn, 2010) and SymGiza++ (Junczys-Dowmunt and Szał, 2012) and the Berkeley aligner (Liang et al., 2006), as well as edit distance-based aligners (Snover et al., 2006;Lavie and Agarwal, 2007). In order to combine these different word alignment tables (Pal et al., 2013) we used a mathematical union method. For the union method, we hypothesise that all alignments are correct. Duplicate entries are removed.

Phrase-Based SMT
Translation is modelled in SMT as a decision process, in which the translation e L 1 = e 1 ...e i ...e I of a source sentence is chosen to maximize in equation (4): where P (f J 1 |e L 1 ) is the translation model and P (e L 1 ) the target language model. In log-linear phrase-based SMT, the posterior probability is directly modeled as a log-linear combination of features (Och and Ney, 2003), involving M translational features, and the language model, as in equation (5): where s k 1 = s 1 . . . s k denotes a segmentation of the source and target sentences respectively into the sequences of phrases (ê k 1 =ê 1 . . .ê k ) and (f k 1 =f 1 . . .f k ) such that (we set i 0 = 0) in equation (6): and each featureĥ m in (5) can be rewritten as in (7): whereĥ m is a feature that applies to a single phrase-pair. It thus follows (8):

Experiments
We performed experiments on the development set provided by the organizers of the APE task in the WMT2015.

Experimental Settings
The effectiveness of the present work is demonstrated by using the standard log-linear PBSMT model. For building our SAPE system, we experimented with various maximum phrase lengths for the translation model and n-gram settings for the language model. We found that using a maximum phrase length of 7 for the translation model and a 5-gram language model produces the best results in terms of BLEU (Papineni et al., 2002) scores for our SAPE model. The other experimental settings were concerned with hybrid word alignment training algorithms (described in Section 3) and the phraseextraction (Koehn et al., 2003). The reordering model was trained with the hierarchical, monotone, swap, left to right bidirectional (hier-mslrbidirectional) (Galley and Manning, 2008) method and conditioned on both source and target language. The 5-gram target language model was trained using KenLM (Heafield, 2011). Phrase pairs that occur only once in the training data are assigned an unduly high probability mass (i.e. 1). To alleviate this shortcoming, we performed smoothing of the phrase table using the Good-Turing smoothing technique (Foster et al., 2006). System tuning was carried out using Minimum Error Rate Training (MERT) (Och, 2003) optimised with k-best MIRA (Cherry and Foster, 2012) on a held out development set. After the parameters were tuned, decoding was carried out on the held out test set.

Evaluation
The evaluation of our SAPE system was performed on the 1817 Spanish sentences. The baseline consisted of two systems, an MT baseline system and the APE the system of (Simard et al., 2007a). The evaluation was carried out using HTER (TER with human targeted references) score. In this year's WMT seven groups made a submission to the APE task. From the seven systems, our system was ranked on the third place, achieving a HTER score of 23.426 for case sensitive evaluation and 22.710 for the case insensitive evaluation, outperforming the baseline APE system scoring 23.839 for the case sensitive evaluation and 23.130 for the case insensitive evaluation.

Conclusion
This paper presents our system submitted in the English-Spanish APE Task for WMT2015. The system demonstrates the crucial role hybrid word alignment can play in SAPE tasks. Edit-distance based monolingual aligner provides alignment for our SAPE system. Incorporating hybrid word alignment into the state-of-the-art PBSMT pipeline provides additional improvements over the baseline APE system.