Beyond Prefix-Based Interactive Translation Prediction

Current automatic machine translation systems require heavy human proof-reading to produce high-quality translations. We present a new interactive machine translation approach aimed at providing a natural collaboration between humans and translation systems. As such, we grant the user complete freedom to validate and correct any part of the translations suggested by the system. Our approach is then designed according to the requirements placed by this unrestricted proof-reading protocol. In particular, the ability of the system to suggest new translations coherent with the set of potentially disjoint translation segments validated by the user. We evaluate our approach in a user-simulated setting where reference translations are considered the output desired by a human expert. Results show important reductions in the number of edits in comparison to decoupled post-editing and conventional preﬁx-based interactive translation prediction. Additionally, we provide evidence that it can also reduce the cognitive overload reported for interactive translation systems in previous user studies.


Introduction
Research in the field of machine translation (MT) aims at developing computer systems that reduce the effort required to generate translations, whether by assisting human translators or by directly replacing them. However, most research in MT has focused on the development of fully automatic MT approaches. Despite that, except for a handful of very constrained domains, current automatic MT technology still only achieves results that are not satisfactory in practice; automatic MT still require heavy human proof-reading to produce human-quality translations.
We present a new computer-assisted translation approach that integrates human translators and automatic MT into a tight feedback loop. In our approach, the user 1 and the MT system collaborate to generate translations through a series of interactions. At each interaction, the system proposes its best translation for the given input sentence. If the user finds it correct, then it is accepted and the process goes on with the next input sentence. Otherwise, the user makes some corrections that the system takes into account to improve the proposed translation. The rationale behind this interactive translation prediction (ITP) approach is to combine the accuracy provided by the human expert with the efficiency of the MT system in contrast to decoupled post-editing (PE). Previous works, e.g. (Barrachina et al., 2009), have explored this paradigm; however their practical implementation limits this general proof-reading approach to a prefix-based interaction where the user is forced to correct the errors in the sentence strictly according to the reading order.
Our main contribution, described in Section 3, is a new proof-reading protocol focused on providing a more natural interaction between the user and the system. Specifically, we give complete freedom to the user to validate or to correct any part of the translation at any given interaction. As such, the user is no longer bound to correct the errors following the reading order as in previous prefix-based ITP works (Barrachina et al., 2009;González-Rubio et al., 2013;Green et al., 2014). Preffix-based interaction can be a frustrating and cognitively demanding limitation for the user, and may be a factor in the somehow disappointing results of prefix-based ITP with users (Koehn, 2009;Underwood et al., 2014;Green et al., 2014;. We design our approach to meet the requirements placed by the unrestricted proof-reading protocol, not the opposite way. The most significant new feature is conditioned decoding, for translation generation coherently to a set of segments validated by the user. An evaluation involving human users is most desirable to study the impact of any proof-reading protocol. However, such a study is expensive, time-consuming and it will require to take into account additional sources of variation, namely the human factor, that may obscure the comparison between different approaches. Therefore, we chose to follow previous works, for instance (Barrachina et al., 2009), and carry out our experiments on a simulated setting intended to provide a direct and, more importantly, objective comparison to previous approaches (Section 4). Regarding evaluation, we propose a new metric to automatically estimate the cognitive load of potential users working on the different ITP environments. To the best of our knowledge, this is the first proposal at this respect. Results in Section 5 confirm the soundness of the proposed ITP approach. Reported figures show important reductions in both the number of corrections typed by the user and her estimated cognitive load.

Related Work
Common proof-read MT protocols implement a decoupled PE process in which, first, the MT system returns a translation of a whole given document. Next, a human reads it correcting, in any order, the possible mistakes made by the system.
Interactive approaches (Isabelle and Church, 1998;Langlais and Lapalme, 2002;Tomás and Casacuberta, 2006) were proposed as a more sophisticate way of taking advantage of MT technology. Barrachina et al., (Barrachina et al., 2009) presented a prefix-based ITP approach in which the user is assumed to proof-read each automatic translation correcting each time the first error, if any, in the usual reading order. This can be a reasonable assumption in text or speech transcription (Toselli et al., 2007;Rodríguez et al., 2007) where the output sequence is generated monotonically respect to the input data. However, it has always be an important handicap for translation due to the intrinsic reordering involved in the process.
ITP is a fruitful research field with diverse contributions for multiple authors: (González-Rubio et al., 2010;Alabau et al., 2013; among others. We share with (Sanchis-Trilles et al., 2008) the idea of making a more sophisticated use of the mouse actions performed by the user while interacting with the system, and with (González-Rubio et al., 2013) the common ITP formulation for both phrase-based and hierarchical MT models. In particular, we significantly modify the prefix-based ITP implementation presented in the latter work to support the proposed unrestricted proof-reading protocol.
User studies of prefix-based ITP versus PE have shown that while users tend to make less corrections, overall translation time tend to be higher (Koehn, 2009;Underwood et al., 2014;Green et al., 2014;. Coherently with these results, users also perceive prefix-based ITP as a more cognitive demanding task than PE. This is not surprising given that users are asked to proof-read one new translation (suffix) after each individual correction, which increases significantly the amount of text to be processed to generate a single translation. This is particularly frustrating when the user observes how a correct translation is rewritten with a wrong one by the next suffix suggested by the system. Given that PE do not suffer from this effect, it provides a comprehensive explanation of the somehow disappointing results reported for prefix-based ITP.
To the best of our knowledge, the only alternative to prefix-based proof-reading was proposed in the context of text recognition. Serrano et al., (2014) implement a constrained search procedure that profits from the monotonic alignment between input image, search states and user corrections, to limit the set of possible transcriptions to those coherent with a set of (disjoint) user corrections. We apply a similar idea in a translation context and provide solutions to cope with the nonmonotonicity inherent to the task.

Beyond Prefix-Based ITP
The goal of our approach is to give complete freedom to the user in her interaction with the system. The process starts when the MT system proposes a full translation of the source language sentence. Then, the user reads the translation and is allowed to validate -all or part of-the correct segments in it and corrects any of its potential errors. Then, the source (s): No era el hombre más honesto ni el más piadoso , pero era un hombre valiente . desired translation (t): He was not the most honest or pious of men , but he was courageous .
BEGIN MT : It was not the most honest and the most pious man , but it was a brave man .

IT-1
User: It was not the most honest and the most pious of man , but it was a brave man .
MT : He was not the most honest or pious of men , but it was a brave man .
IT-2 User: He was not the most honest or pious of men , but it was courageous .
MT : He was not the most honest or pious of men , but he was courageous .
END User: He was not the most honest or pious of men , but he was courageous . Figure 1: Interactive translation of a Spanish sentence into English. First, the system suggests an initial translation. At iteration 1, the user validates the parts of the suggestions she considers to be right and introduces a correction by typing a word: " of ". This defines a new user feedback with five segments: {"was not the most honest", "pious of", ", but", "was", "."}. Then, the system suggests a new translation that contains these segments in the given order. Iteration 2 is similar; the user validates words "He" and "or", and she types a new correction: " courageous ". The process ends when the user accepts the translation suggested by the system in the last step. Only two edits are required. In comparison, PE would have needed 10 edits.
system takes into account this feedback to suggest a new translation that contains the segments validated by the user as well as the typed corrections. Such process is repeated until the user validates the whole suggested translation. An example of this process is shown in Figure 1.
The crucial MT feature is the generation of a new translation coherent to the segments already validated by the user. Formally, we represent such user feedback as a sequence of disjoint segments f =f 1 , . . . ,f k , . . . ,f |f | , where eachf k is a sequence of consecutive target language words. For example, user feedback at iteration one in Figure 1 is composed of five disjoint segments:f 1 = "was not the most honest",f 2 = "pious of",f 3 = ", but", f 4 = "was" andf 5 = ".". Segments in f do not overlap and do not necessarily cover the whole sentence. Prefix-based feedback in conventional ITP is a special case of this with only one segment starting at the beginning of the sentence.
Next, we describe the statistical formalization of our approach, the models actually used to implement such formalization, and the search procedures required to efficiently generate translations coherent with this generalized user feedback.

Statistical Framework
Our problem can be stated as follows: given a source sentence s = s 1 . . . s |s| and some user feedback f , we must find the best target language trans- We can make the naïve Bayes' assumption that s and f are statistically independent variables given t. This results in the basic equation for ITP with error correction (Ortiz-Martínez, 2011): where, as we will see in Section 3.2, distribution Pr(t | s) can be approximated by a machine translation model, and Pr(f | t) by an error correction model that measures the degree of compatibility between f and t.
Note that by using a probability distribution Pr(f | t), any translation is compatible with a given user feedback to some degree. As a consequence, the translation returned by Equation (1) may still not contain the segments validated by the user; we need to identify the sub-string of the returned sentence that corresponds to the each of the segments validated by the user. To solve this problem, we define an alignment a = a 1 , . . . , a |f | between the user-validated segments f =f 1 , . . . ,f |f | and a list of segmentst =t 1 , . . . ,t |f | , where each t k = t k i . . . t k j is a sub-sequence of words in t. Each alignment link a k =t k indicates the particular segment in t that should be replaced by the kth user-validated segmentf k to make t coherent to f . Unaligned words in t constitute the free text that completes the gaps in between the user-validated segments in f (González- . The alignment also must be monotonic to preserve the order of the user-validated segments. Formally, for every pair of alignment links: a k =t k and a k =t k , k < k ⇐⇒ k j < k i . After including alignment in Equation (1) and following a maximum approximation, we arrive to our final formulation of ITP with error correction: In practice, we combine the probability distributions in Equation (2) in a log-linear fashion as it is typically done in MT .

Models
Equation (2) includes two probability distributions: Pr(t | s) and Pr(f , a | t). The first one can be modeled by any of the multiple machine translation models that have been proposed in the literature; (Koehn, 2009) for example provide a good description of them. We will focus our exposition in the latter distribution, Pr(f , a | t), that evaluates the compatibility between a translation t and some user feedback f through alignment a.
Following (González-Rubio et al., 2013), we model Pr(f , a | t) as an error correction model based on the edit distance (Levenshtein, 1966). Given a candidate string and the corresponding reference string, we model edit distance as a Bernoulli process where each word of the candidate has a probability p e of being edited. Under this interpretation, the number of edits δ observed in a candidate of length n is a random variable that follows a binomial distribution, δ ∼ B(n, p e ). By assuming independence between each alignment link, we can model error-correction probability as: where P E (f k , a k ) is the error correction probability for the k-th alignment link whose value is given by the probability mass function of the binomial distribution, n k = |f k | is the length in words of the k-th segment validated by the user (f k ), and δ k is the edit distance betweenf k and the segment a k =t k of t aligned to it according to a. The probability of editing p e is the single free parameter of this model. Alternatively, we can use a model based on a multinomial distribution assigning different probabilities to different edit operations. Nevertheless, we adhere to the binomial approximation due to its simplicity.

Search
Next, we address the problem posed by the maximization in Equation (2). Following (Barrachina et al., 2009), we split search into a two step process. Given a source language sentence, we first generate a graph-based representation that contains its most probable translations. Then, we search for the optimal translation and alignment on it according to Equation (2). In particular, we use hypergraphs to represent such search space.
One important advantage of this approach, is that it separates the proof-read step from the MT engine used to generate the initial translations. As such it provides an unified framework that accepts both the use of phrase-based and hierarchical/syntax translation models.

Hypergraphs
A hypergraph (Gallo et al., 1993) is a generalization of the concept of graph where the edges (now called hyperedges) may connect several nodes (hypernodes) at the same time. Formally, a hypergraph is a weighted acyclic graph represented by a pair H =< V, E >, where V is a set of hypernodes and E is a set of hyperedges. Each hyperedge ε ∈ E connects a set of tail hypernodes T (ε) = {τ 1 . . . τ |T (ε)| } τ l ∈ V, to a head hypernode H(ε) ∈ V. A hypernode with no ingoing hyperedges is a leaf, while a hypernode with no outgoing hyperedges is a root. Each hypernode represents a partial translation generated during the MT decoding process. Each ingoing hyperedge ε represents the rule applied to generate the partial solution in the head from the partial solutions in the tail hypernodes, as such it has an associated probability P (ε). Figure 2 shows an example hypergraph 2 . Two alternative translations are constructed from the leaf hypernodes (1, 2 and 3) up to the root hypernode (6). Hypergraphs provide a compact representation of the translation space that allows us to derive efficient search algorithms.
Hypergraphs are the natural representation for hierarchical MT models (Chiang, 2005;Zollmann and Venugopal, 2006). Note, however, that wordgraphs (Ueffing et al., 2002), which are used to represent the search space for phrase-based models (Koehn et al., 2003), are a special case of hypergraphs in which hyperedges have at most one tail hypernode.

Search on hypergraphs
We formalize the maximization in Equation (2) as a bottom-up search problem. Starting from the leaf hypernodes, we keep track of the best solutions (partial translation and alignment) achievable at each hypernode. We define Q(ν, [m, n]) as the probability of the most likely partial translation derivable from hypernode ν aligned to (accounting for) user-validated segments from m-th to n-th, we will refer to this interval as the coverage of the partial solution. Given a node ν and a coverage, we compute its score from its ingoing hyperedges. Specifically, Q(ν, [m, n]) will be equal to the maximum score of the partial solutions computed from any ingoing hyperedge. Partial solutions from an ingoing hyperedge ε are defined as combinations of partial solutions on its tail hypernodes under the constrain that the concatenation of their coverages equals   where I(ν) are the ingoing hyperedges of ν, P (ε) is the probability of hyperedge ε, C(ε, [m, n]) is the set of valid combinations for hyperedge ε and coverage [m, n], and c ∈ C(ε, [m, n]) is one of such valid combinations Leaf hypernodes represent the base cases for this recursion. For simplicity, we restrict them to be fully-aligned to at most one user-validated segment 3 . That is, given a leaf hypernode λ ∈ V: where P MT (λ) is the MT probability (language plus translation model) of λ, and P E (w(λ),f m ) is the error correction probability between the target text covered by the leaf hypernode, w(λ), and the m-th user-validated segment in f . The score of the optimal solution is given by Q(α, [1, |f |]), where α ∈ V is the root hypernode. We can recover (t,â) through backtracking.
The process described above loops over all hyperedges and coverages (bounded by |E||f | 2 ), evaluating all valid combinations (bounded by the coverage partitions). It can be implemented by an algorithm with a complexity in O(|E||f | 2τ ), where τ is the average number of tail hypernodes per hyperedge (usually set to 2). In practice, our approach has a complexity in O(|E||f | 4 ).

Corpus and MT systems
We tested the proposed methods in the Spanishto-English (Es-En) partition of the Bulletin of the European Union (EU) corpus (Barrachina et al., 2009;González-Rubio et al., 2013). We tokenized the corpus keeping the real case of the sentences. Table 1 shows the main figures of the corpus.
We estimated a hierarchical MT model for the train partition with the standard configuration of the Moses toolkit (Koehn et al., 2007). Loglinear weighs were estimated by minimum errorrate training (Och, 2003) on the tune partition. Then, we automatically translated tune and test partitions using the optimized model to obtain the corresponding hypergraphs. Next, we optimized the single free parameter p e of the error correction model (see Section 3.2) on the tune partition. Finally, we interactively translated both partitions according to the unrestricted ITP approach proposed in Section 3.

User Simulation
ITP evaluation with human translators is simply to slow and expensive to be applied on a frequent and ongoing basis during system development. Instead, we carried out an automatic evaluation with simulated users which is faster and cheaper.
At each ITP iteration (see Figure 1), we have to decide which segments in the suggested translation should be validated, and which error should be corrected. To do that, we considered the reference translations in the corpus as the output that a human expert would want to obtain. Then, we align the suggested translation and the reference via edit distance: words aligned to itself are marked as valid, while edited words are potential corrections to be typed by the simulated user.
Without loss of generality, we introduced two restrictions: (1) we restrict users to validate segments only at the first iteration, and (2) the simulated user always corrected the first (in reading order) error in the suggested translation. We are aware that the results obtained with this user simulation will be pessimistic since it forbids behaviors that may improve user productivity, e.g. validating segments at each iteration or correcting more promising parts of the suggested translation. Our goal is not to match the behavior of a human translator, but to allow for a meaningful comparison against conventional ITP. Note that prefix-based proof-reading is a particular case of our user simulation with no segment validation.

Evaluation Metrics
ITP systems are evaluated according to the effort needed to generate the desired translations. This effort is usually estimated as the number of actions performed by the user while interacting with the system. In our user simulation, we describe two different actions: segment-validation, and wordcorrection. Each segment validation involves the user to "click" on the initial and final words of the segment 4 . Each correction corresponds to an edit operation performed by the user. Specifically, we used the following measures in our experiments: Word stroke ratio (WSR): Proposed in (Tomás and Casacuberta, 2006) as the quotient between the number of words edited by the user (wordstrokes), and the number of words in the final translation. Word-strokes are considered as single actions with constant cost independently of the length of the edited word.
Mouse action ratio (MAR): Proposed in (Barrachina et al., 2009) as the quotient between the number of "clicks" made by the user (mouseactions), and the number of words in the final translation. In addition to the "clicks" for segment validation, we count one more mouse action per sentence accounting for the final acceptance of the suggested translation.
Conceptually (Macklovitch et al., 2005), MAR can be seen as accounting for the cognitive part of the supervision process: undertanding the translation and identifying the errors in it, while WSR accounts for the actual physical effort required to type the corrections. As such, both metrics are complementary to express the total human effort involved in proof-reading a document.
We also evaluated the quality of the initial automatic translations generated by the system:

Bilingual evaluation understudy (BLEU):
Proposed in (Papineni et al., 2002), it is based on the precision of n-grams between the suggested translation and the reference; it also includes a brevity penalty to penalize short translations. This score ranges between 0 and 100, with 100 denoting a perfect translation.
Translation edit rate (TER): Proposed in (Snover et al., 2006), it measures the number of edit operations (substitution, insertion and deletion of single words, and swap of word sequences) divided by the number of words in the reference.
In addition to be an MT quality metric, TER can also be seen as a human-effort measure in PE scenarios. Therefore, we can use TER and WSR to compare human effort between PE and ITP.  Finally, in order to assess the statistical significance of the results, we also provide 95% confidence intervals for their values. These intervals were computed via pair-wise bootstrap resampling as proposed in (Zhang and Vogel, 2004).

Results
This section presents the results of the experiments performed to assess the unrestricted ITP approach proposed in Section 3. First, we compare our ITP approach to the prefix-based ITP scenario described in (Barrachina et al., 2009) and the decoupled PE approach. Then, we further study our approach investigating the relationship between segment-validation and typing effort. Finally, we provide evidence that the proposed unrestricted proof-reading protocol allows to reduce the cognitive overload produced by the changing translation completions of prefix-based ITP approaches. Table 2 displays user-effort results for the proposed ITP approach against prefix-based ITP (Barrachina et al., 2009) 5 and a decoupled PE baseline approach. Automatic translation results are also displayed to give an idea of the difficulty of the task. We can observe how our approach clearly outperformed both prefix-based ITP and PE in terms of user typing effort as measured by WSR and TER respectively. According to these results, a human translator assisted by our ITP system would only need to correct only about one third of the words to generate the correct translations. In comparison, PE would require to type ∼ 41% of the words (17% more) while prefixbased ITP would require to correct more that half of them (55% more). Additionally, we can observe that prefix-based ITP was not better than the PE baseline in all cases. This result, coherent with previous works e.g. (González-Rubio et al., 2013;Green et al., 2014), exemplifies the potential limitations of prefix-based ITP.
The large reductions in typing effort observed for our ITP approach came together with an important increase in the number of mouse actions.
Next, we focused on the differences between prefix-based and our approach. To do that, we carried out different experiments allowing the simulated user to validate an increasing number of words from zero to infinity (corresponding respectively to the results of prefix-based ITP and our approach in Table 2). Figure 3 shows WSR (top) and MAR (bottom) for the Test partition as a function of the maximum number or words allowed to be validated by the user.
As we allowed the simulated user to validate more words, the amount of words to be corrected (WSR) decreased dramatically. For example, we obtained a 10% relative reduction in WSR when we allowed the user to validate a maximum of 4 words. A similar trend (but in the opposite direction) can be observed for MAR: as we allowed the user to validate more words, the number of mouse actions increased until stabilization. In other words, our ITP approach allows to reduce user typing effort at the expense of an increase in the number of mouse actions. As we have said before, WSR and MAR account for different phenomena and thus have different cost from a human point of view (Macklovitch et al., 2005). It may seem that we have simply exchanged typing effort for cognitive effort. However, two considerations allow us to consider this a beneficial exchange. On the one hand, from a pure mechanistic point of view, typying a whole word usually requires more effort than "clicking" on it. On the other hand, from a cognitive point of view, the user has to read, understand, and evaluate the suggested translation in both prefix-based ITP and our approach. Hence, the difference in cognitive effort between these two approaches approaches is most probably negligible. Nevertheless, these considerations should be tested with actual human users before reaching categorical conclusions.
Average response time of our Python prototype was below 3 seconds 6 . Obviously, it does not qualify as real-time. However, we expect an important reduction in response time after implementing our approach in a more efficient language.
We performed a final analysis to evaluate to which extent our proposal alleviates the main annoying effect inherent to prefix-based ITP, namely correct words in a given suffix overwritten by the next suggested suffix. This common effect, observed in several user studies, make human users 6 The test machine was an Intel i5 CPU at 3.4 GHz. Figure 4: Percentage of words suggested by the system that were correct but overwritten in subsequent translation suggestions, as a function of the number of words validated by the simulated user. This ratio can be seen as a measure of the user cognitive overload.
feel that cognitive effort invested in evaluating each suggested translation was wasted.
To do that, we measured the number of correct words that were modified by subsequent translation suggestions, normalized by the total number of suggested words. Figure 4 displays this percentage as a function of the maximum number of words that can be validated by the simulated user. We can observe how for prefix-based ITP (zero value in the x-axis), this cognitive overload is a very important phenomena; more than 30% of the words suggested by the system were correct but modified by following suffix suggestions. As we allowed more words to be fixed, this percentage steadily decreased down to zero. This indicates that our ITP proposal actually provides a mechanism to overcome the cognitive overload inherent to prefix-based ITP.

Summary
We have presented a new ITP approach where the user is not longer bound to interact with the system in a prefix-based fashion (Barrachina et al., 2009). The proposed ITP approach gives the user complete freedom to validate and correct any part of the suggested translations thus providing a more natural working environment for human translators. We formalize the problem as a MT model with error correction which, in practice, is implemented as a constrained search on hypergraphs.
Simulated results showed that the proposed ITP approach drastically reduced the typing effort needed to generate translations, improving results of both decoupled PE and prefix-based ITP. This reduction in typing effort came at the expense of a larger amount of mouse actions required to validate correct segments of the suggested translations. However, since mouse actions are cheaper than typing full words, we can expect this exchange to reduce overall user effort. Nevertheless, this expectation should be confirmed in future experiments with actual human users. Finally, in addition to reduce user effort, we also provide evidence indicating that the proposed ITP approach can reduce the cognitive overload commonly reported by humans using prefix-based ITP systems.