Touch-Based Pre-Post-Editing of Machine Translation Output

We introduce pre-post-editing , possibly the most basic form of interactive translation, as a touch-based interaction with iteratively improved translation hypotheses prior to classical post-editing. We report simulated experiments that yield very large improvements on classical evaluation metrics (up to 21 BLEU) as well as on a parameterized variant of the TER metric that takes into account the cost of matching / touching tokens, conﬁrming the promising prospects of the novel translation scenarios offered by our approach.


Introduction
As shown by oracle studies (Wisniewski et al., 2010;Turchi et al., 2012;Marie and Max, 2013), Statistical Machine Translation (SMT) systems produce results that are of significantly lower quality than what could be produced from their available resources. As a pragmatic solution, human intervention is commonly used for improving automatic draft translations, in so-called post-editing (PE), but is also studied earlier in the translation process in a variety of interactive strategies, including e.g. completion assistance and local translation choices (e.g. (Foster et al., 2002;Koehn and Haddow, 2009;González-Rubio et al., 2013)). Although interactive machine translation does facilitate the work of the SMT system in certain situations by allowing it to make efficient use of knowledge contributed by the human translator, postediting has been shown to remain a faster alternative (Green et al., 2014). Nevertheless, this activity usually requires complex intervention from an expert translator (Carl et al., 2011).
In this work we reduce interaction with an SMT system to its most basic form: similarly to what a human translator is likely to do when first reading a draft translation to post-edit, we require a user to simply spot those segments of a draft translation that can participate in an acceptable translation. The corresponding information is then used by a SMT system in a soft way to improve the draft translation. This process may be iteratively repeated as long as enough improvements are obtained, and terminates with classical post-editing on the obtained translation, hence we dub it prepost-editing (PPE). We resort to simulated prepost-editing and post-editing, as in other works (Carl et al., 2011;Denkowski et al., 2014), to measure translation performance on some available reference translation using both classical metrics and a variant of the TER metric (Snover et al., 2006), where, essentially, the cost of a token matching operation is a parameterized fraction of the cost of the other token edit operations. With the implementation of appropriate strategies in the SMT system, we show under reasonable assumptions that this approach has the potential to significantly reduce the amount of human effort required to obtain a final translation.
In the remainder of this article, we describe the technical details of pre-post-editing (Section 2), report experiments conducted on two translation directions and two domains (Section 3), and finally discuss our proposal and introduce our future work (Section 4).
2 Touch-based pre-post-editing In our PPE framework, the human pre-post-editor has to mark n-grams from a translation hypothesis that can take part in a correct translation. 1 The annotated n-grams are counted, as an n-gram can appear more than once in the same sentence, and a "positive" 6-gram language model (LM) (positive-lm) is trained on these counts 2 . A "negative" LM (negative-lm) is also trained on the counted n-grams left unannotated. Then, all bi-phrases from the SMT system's phrase table that match an annotated n-gram, according to the source token alignments provided by the decoder, are removed from the main phrase table and stored in a separate "positive" phrase table (positive-pt). Conversely, n-grams containing at least one token left unannotated are considered as incorrect, and the set of bi-phrases matching these n-grams are removed and stored in a "negative" phrase table (negative-pt).
As source tokens can appear more than once in a source text, they are located: an identifier is concatenated to each token to make it unique in the source text. Tokens of the source phrases in the phrase table are accordingly also located, so each bi-phrase is duplicated as needed to cover all located tokens. Using located tokens allows our PPE framework to treat differently source tokens that are correctly translated from incorrectly translated ones in the same sentence or text. Figure 1 shows an example of phrase table extraction, using located source tokens 3 , for one iteration of PPE.
If an n-gram is annotated as correct, all its inner n-grams of lower order are also deemed correct. Although annotating translations of high quality may be less expensive by explicitely annotating incorrect n-grams instead of correct ones, such annotations would not permit to identify correct n-grams inside incorrect ones, as illustrated in Figure 2. PPE can thus be worded as a simple problem for the pre-post-editor: which sequences of tokens should appear in the final translation?
The newly extracted phrase tables and LMs 4 , along with the remainder of the original phrase table and the original LM, are used to re-decode the source text in a first iteration of PPE. A new PPE annotation can then be performed on the new translations. The newly extracted "positive" and "negative" phrase tables are merged with the corresponding phrase table of the previous iteration. The extracted n-gram counts from the current iteration and the counts of the previous iterations are summed, and the LMs are re-trained with the updated counts. A new iteration of PPE is then per-  Figure 1: Examples of some of the bi-phrases and n-grams extracted for phrase tables and language models according to a reference translation.
source son impopularité sembleêtre en grande partie due au chômage PPE#0 his unpopularity seems to be owing largely to unemployment PPE#1 his unpopularity seems to be largely owing to unemployment target ref his unpopularity seems to be largely owing to unemployment formed with the updated models. The weights for all, old or new, models in the log-linear combination are found by tuning on a development set for each PPE iteration. 5 Figure 3 illustrates 4 iterations of PPE from an initial translation hypothesis assuming a given target reference translation.

Data and systems
We ran experiments on two translation tasks of different domains: the WMT'14 Medical translation task (medical) and the WMT'11 news translation task (news) for the language pair en-fr on both translation directions. For both tasks we trained two competitive phrase-based SMT systems using Moses (Koehn et al., 2007) and WMT data 6 (see Table 1). The tuning for all systems, including our iteration-specific PPE systems, was performed with kb-mira (Cherry and Foster, 2012).

An adapted evaluation metric: TER PPE
Classical MT evaluation metrics cannot take into account the interactive cost of PPE, and thus do source c' est la réponseà une nouvelle prise de conscience selon laquelle les entreprises chinoises sont indispensables a la survieéconomique de Taiwan PPE#0 this is the answer to a new awareness that Chinese companies are essential to the economic survival of Taiwan PPE#1 it is the response to a new awareness that Chinese firms are essential to Taiwan's economic survival .
PPE#2 it is the reply to a new awareness that Chinese enterprises is essential to Taiwan's economic survival . PPE#3 it is responding to a new awareness that Chinese businesses is essential to Taiwan's economic survival .
PPE#4 it is responding to a new awareness that Chinese business is essential to Taiwan's economic survival .
target ref it is responding to a new awareness that Chinese business is essential to Taiwan's economic survival . Figure 3: Example of a pre-post-edition trace for French to English translation (using the news task, cf. Section 3) using a given implicit target reference translation for simulating pre-post-editing and postediting. Each newly touched phrase is indicated with a green background. Phrases with a gray background indicate previously touched phrases but their tokens remain individually touchable by the user.  not allow us to make direct comparisons with PE. We thus adapt the TER (Snover et al., 2006) metric, which typically uses 4 types of token edits: substitution (s), insertion (i), deletion (d) and shift (f ). While these edit types all have a (debatable) uniform cost of 1, the operation of matching (m) a correct token is ignored. We posit that this operation is in fact performed by a human translator during PE (at the minimum, by recognizing and skipping tokens), and that it can be directly compared to our touchbased selection of tokens for PPE. However, we cannot at this stage of our work provide a realistic cost value for this operation, and so we introduce a match cost parameter α, and use the following as our PPE-aware metric: where r is the number of tokens in the reference translation. Note that a null value for α makes TER PPE correspond to TER, while a value of 1 would indicate that a token matching/touch (m) is e.g. as costly as a token rewriting (s). We anticipate that a realistic value for α given a reasonably skilled user should be rather small, but we will provide TER PPE results for the full range [0, 1].

Experimental results
To validate our approach, we initially used a simulated post-editing paradigm (Carl et al., 2011;Denkowski et al., 2014) in which non-post-edited reference translations are used in lieu of human post-editions. Results on TER (Snover et al., 2006) and BLEU (Papineni et al., 2002), tuning on both metrics, are provided in Tables 2 (news) and 3 (medical). First, we observe that whatever the metric and the task, the first iteration of PPE always yields a significant improvement over the Moses initial system (e.g. up to +9.8 BLEU and -8.2 TER for news fr→en). Unsurprisingly, tuning on a metric yields better results for the same metric for the first iteration; however, we note that this is not always true for the TER metric at later iterations (cf. news en→fr). More generally, tuning on the TER metric results in lower improvements for news, which are mostly concentrated on the first iterations; as systems tuned on BLEU have been found to produce better translations than systems tuned on TER (Cer et al., 2010), only BLEU tuning was used for medical. 7 Improvements follow an interesting pattern over PPE iterations: for instance, on news fr→en, BLEU scores steadily increase after each new touch-based iteration and reach a gain of +21.1 BLEU and -12.3 TER over the initial Moses translation after 5 PPE iterations. Results are very comparable on both language pairs and both domains, e.g. gains of +12.1 BLEU and -9.7 TER are obtained on fr→en medical. The lesser amplitude of the gains obtained after 5 iterations may be attributed to the higher ini-   Table 3: PPE results on the medical task.
Figures 4 and 5 show how our TER PPE metric varies for different values of our α parameter (recall that α = 0 corresponds to TER). Essentially, whatever the value of α, we observe that any iteration of PPE dominates PE (Moses 1-best), but with a tendency to become as costly as PE for high, but probably unrealistic values of α. Tuning with BLEU allows us to bring regular improvements as the number of iteration increases, while tuning with TER makes the amplitude of the gains decrease faster.
Furthermore, results shown in Table 4 point out the complementarity between negative models (negative-lm and negative-pt) and positive models (positive-lm and positive-pt), with a drop of almost 10 BLEU points compared to the corresponding configuration using all models when removing one type of models on both translation directions. The language models (negative-lm and positive-lm) seem to play a more important role during PPE than the phrase tables (negative-pt and positive-pt), with a drop of 9.6 BLEU points on news fr→en when removing the language models against a significantly lower drop of 4.4 BLEU points when removing the phrase tables.

Discussion and future work
We have introduced pre-post-editing, a minimalist interactive machine translation paradigm where a user is only asked to spot text fragments that may be used in the final translation. Our approach is quite comparable to the two-pass procedure described by Luong et al. (2014) using word-level confidence estimation (e.g. (Bach et al., 2011)) to update the cost of the search graph hypotheses. However, contrarily to Luong et al.'s work, our PPE framework is efficiently multi-pass, updates the models over iterations and relies on more informative annotations made at n-gram-level. Our evaluation based on simulated post-editing has revealed a large potential for translation improvement. Interestingly, the type of interaction defined  (b) Tuned with BLEU Figure 5: PPE results on the fr→en news task.
is very different from that expected of a post-editor or in existing interactive translation modes, and lends itself nicely to touch-based interaction. Furthermore, our proposal may in fact define a new role in Computer-Assisted Translation, with PPE being performed on-the-go on mobile devices by more people than available human translators, and even possibly by monolinguals of the target language whose contribution may be more efficiently exploited than that of monolinguals of the source language (e.g. (Resnik et al., 2010)).
In terms of usability, our future work will focus on two important questions: (a) study the actual use of PPE in an interactive setting and tune the α parameter for our TER PPE metric on HTER (Snover et al., 2006) traces, and (b) study whether PPE alters in any positive way the work of the human translator performing the residual post-editing, hoping that PE could become a less tedious task by nature. We further anticipate that some additions would improve our approach, including dealing early with out-ofvocabulary phrases, proposing local drop-down options (e.g. (Koehn and Haddow, 2009)), possibly clustered by senses, allowing the user to easily fix reordering issues, and adapting PPE to be discourse-aware (e.g. (Ture et al., 2012)).