USAAR: An Operation Sequential Model for Automatic Statistical Post-Editing

,


Introduction
Translations produced by machine translation (MT) systems have improved substantially over the past few decades. This is particularly noticeable for some language pairs (e.g. English to German and English to French) and for domain specific language (e.g. technical documentation). Texts produced by MT systems are now widely used in the translation and localization industry. MT output is post-edited by professional translators and it has become an important part of the translation workflow. A number of studies confirm that post-editing MT output improves translators' performance in terms of productivity and it may also impact translation quality and consis-tency (Guerberof, 2009;Plitt and Masselot, 2010;Zampieri and Vela, 2014).
With this respect the ultimate goal of MT systems is to provide output that can be post-edited with the least effort as possible by human translators. One of the strategies to improve MT output is to apply automatic post-editing (APE) methods (Knight and Chander, 1994;Simard et al., 2007a;Simard et al., 2007b). APE methods work under the assumption that some errors in MT systems are recurrent and they can be corrected automatically in a post-processing stage thus providing output that is more adequate to be post-edited. APE methods are applied before human post-editing increasing translators' productivity.
This paper presents a new approach to APE which was submitted by the USAAR team to the Automatic Post-editing (APE) shared task at WMT-2016. Our system combines two models: monolingual phrase-based and operation sequential model with an edit distance based word alignment between an English-German (EN-DE) Machine translation output and the corresponding human post-edited version of German Translation (Turchi et al., 2016).
Usually APE tasks focus on fluency errors produced by the MT system. The most frequent ones are incorrect lexical choices, incorrect word ordering, the insertion of a word, the deletion of a word. For the WMT2016 APE task, in order to automatically post-editing, we adopt operation sequential model (OSM) for SMT to build our Statistical APE (SAPE) System. We inspired from the work of Durrani et al. (2011) and Durrani et al. (2015). Since, in OSM model, the translation and reordering operations are coupled in a single generative story: the reordering decisions may depend on preceding translation decisions and translation decisions may depend on preceding reordering decisions. The model provides a natural re-ordering mechanism and deal with both local and long-distance re-orderings consistently.
The remainder of the paper is organized as follows. Section 2 describes our proposed system, in particular PB-SMT coupled OSM model. In Section 3, we outline the data used for experiments and complete experimental setup. Section 4 presents the results of the automatic evaluation, followed by conclusion and future work in Section 5.

USAAR APE System
Our APE system is based on operational N-gram sequential model which integrates translation and reordering operations into the phrase-based APE system. Traditional PB-SMT (Koehn et al., 2003) provides a powerful translation mechanism which can directly be modelled to a phrase-based SAPE (PB-SAPE) system (Simard et al., 2007a;Simard et al., 2007b; using target language MT output (T L M T ) and their corresponding post-edited version (T L P E ) as a parallel training corpus. Unlike PB-SMT, PB-SAPE also follows similar kind of drawbacks such as dependency across phrases, handling discontinuous phrases etc. Our OSM-APE system is based on phrase based N-gram APE model, however reordering approach is essentially different, it considers all possible orderings of phrases instead of pre-calculated orientations. The model represents the post-edited translation process as a linear sequence of operations such as lexical generation of post-edited translation and their orderings. The translation and reordering decisions are conditioned on n previous translation and reordering decisions. The model also can able to consistently modelled both local and long-range reorderings. Traditional OSM based MT model consists of three sequence of operations: • Generates a sequence of source and/or target words.
• For reordering operations, inserts gaps as explicit target positions

• Forward and backward jump operations
The sequence operation is based on n-gram model. The probability of a n th operation depends on the n − 1 preceding operations. The generation of post-edited output (pe) from a given The decoder searches best translation in Equation 2 from the model using language model p lm (pe) , is the prior probability that marginalize the joint probability p(mt, pe). The model is then represented in a log-linear approach (Och and Ney, 2003) ) that makes it useful to incorporate standard features along with several novel features that improve the accuracy.
where λ i is the weight associated with the feature h i (mt, pe): p(mt, pe), p pr (pe) and p lm (pe). Apart from this 8 additional features has been included in the log-linear model: 6. Gap distance penalty: The gap between mt and pe sentences generated during the generation process.

Experiment
The effectiveness of the present work is demonstrated by using the standard log-linear PB-SMT model for our phrase based SAPE (PB-SAPE) model. The MT outputs are provided by WMT-2016 APE task (c.f Table 1) are considered as baseline system translation. For building our SAPE system, we experimented with various maximum phrase lengths for the translation model and n-gram settings for the language model. We found that using a maximum phrase length of 10 for the translation model and a 6-gram language model produces the best results in terms of BLEU (Papineni et al., 2002) scores for our SAPE model. The other experimental settings were concerned with word alignment model between T L M T and T L P E are trained on three different aligners: Berkeley Aligner (Liang et al., 2006), METEOR aligner (Lavie and Agarwal, 2007) and TER (Snover et al., 2006). The phraseextraction (Koehn et al., 2003) and hierarchical phrase-extraction (Chiang, 2005) are used to build our PB-SAPE and hierarchical phrase-based statistical (HPB-SAPE) system respectively. The reordering model was trained with the hierarchical, monotone, swap, left to right bidirectional (hiermslr-bidirectional) method (Galley and Manning, 2008) and conditioned on both source and target language. The 5-gram target language model was trained using KenLM (Heafield, 2011). Phrase pairs that occur only once in the training data are assigned an unduly high probability mass (i.e. 1). To compensate this shortcoming, we performed smoothing of the phrase table using the Good-Turing smoothing technique (Foster et al., 2006). System tuning was carried out using Minimum Error Rate Training (MERT) (Och, 2003) optimized with k-best MIRA (Cherry and Foster, 2012) on a held out development set of size 500 sentences randomly extracted from training data. Therefore, all model has been build on 11,500 parallel T L M T -T L P E sentences. After the parameters were tuned, decoding was carried out on the held out development test set ('Dev' in Table 1) as well as test set. Table 1 presents the statistics of the training, development and test sets released for the English-German APE Task organized in WMT-2016. These data sets did not require any preprocessing in terms of encoding or alignment.

Results
We set various APE system settings for our experiments. We start our experiment with the provide T L M T output, considering as baseline.
In the set of experiments are reported in Table 2, first, three word alignment (one statistical based aligner i.e., Berkeley aligner (Liang et al., 2006) and two edit distance based aligners i.e., METEOR aligner (Lavie and Agarwal, 2007) and TER aligner (Snover et al., 2006)) models are integrated separately within both the PB-SAPE as well as the HPB-SAPE systems. As a result, there are three different PB-SAPE (Experiment 2, 3 and 4 in Table 2) and HPB-SAPE (Experiment 5, 6 and 7 in Table 2) systems.
It is evident from Table 2 that the METEOR aligner is performed better than other two aligners. Therefore, our OSM coupled PB-SAPE model ('OSM' in Table 2) used METEOR based alignment. The experiment result shows that compare to other systems in Table 2, our OSM based model performed better in terms of two evaluation metric BLEU (Papineni et al., 2002) and TER. Evaluation result also shows that both PB-SAPE and HPB-SAPE system performed better over baseline system on development set data. The submitted primary system (OSM in Table 2) achieves 3.06% relative (1.99 absolute BLEU points) improvement over baseline 2 . The system also shows similar improvements is terms of TER evaluation measure.
According to the test set evaluation, our system achieves similar improvements as appeared in development set data. Table 3 shows that, there are two types of baseline systems: (i) Baseline1 -based on raw MT output and (ii) Baseline2based on Statistical APE (Simard et al., 2007b) (a phrase-based system (Koehn et al., 2007)

Conclusion and Future Work
This paper presents the USAAR system submitted in the English-German APE task at WMT-2016. The system demonstrates the crucial role METEOR-based alignment and OSM based SAPE can play in SAPE tasks. The use of statistical aligners in PB-SAPE/HPB-SAPE pipeline successfully improve the APE system, however per-3 http://www.statmt.org/moses/ formances with respect to the translations provided by the baseline are not promising. This is the reason behind use of edit distance-based word alignment into the pipeline. The reason for using OSM model is that, the model tightly couples translation and reordering. Apart from that, the OSM model also considers all possible reorderings instead perform search only on a limited number of pre-calculated orderings. The proposed system, OSM-based SAPE approach, was successful in improving over the PB-SAPE as well as HPB-SAPE performance.
The WMT-2016 APE shared task was a great opportunity to test APE methods that can be later applied in real-word post-editing and computeraided translation (CAT) tools. We are currently working on implementing the APE methods described in this paper to CATaLog, a recentlydeveloped CAT tool that provides translators with suggestions originated from MT and from translation memories (TM) (Nayek et al., 2015;Pal et al., 2016). In so doing, we aim to provide better suggestions for post-editing and we would like to investigate how this impacts human post-editing performance by carrying out user studies.