Hierarchical Incremental Adaptation for Statistical Machine Translation

We present an incremental adaptation approach for statistical machine translation that maintains a ﬂexible hierarchical domain structure within a single consistent model. Both weights and rules are updated incrementally on a stream of post-edits. Our multi-level domain hierarchy allows the sys-tem to adapt simultaneously towards local context at diﬀerent levels of granularity, including genres and individual documents. Our experiments show consistent improvements in translation quality from all components of our approach


Introduction
Suggestions from a machine translation system can increase the speed and quality of professional human translators (Guerberof, 2009;Plitt and Masselot, 2010;Green et al., 2013a, inter alia). However, querying a single fixed model for all different documents fails to incorporate contextual information that can potentially improve suggestion quality. We describe a model architecture that adapts simultaneously to multiple genres and individual documents, so that translation suggestions are informed by two levels of contextual information.
Our primary technical contribution is a hierarchical adaptation technique for a post-editing scenario with incremental adaptation, in which users request translations of sentences in corpus order and provide corrected translations of each sentence back to the system (Ortiz-Martínez et al., 2010). Our learning approach resembles Hierarchical Bayesian Domain Adaptation (Finkel and Manning, 2009), but updates both the model weights and translation rules in real time based on these corrected translations (Mathur et al., 2013;Denkowski et al., 2014). Our adapted system can provide on-demand translations for any genre and document to which it has ever been exposed, using weights and rules for domains associated with each translation request.
Our weight adaptation is performed using a hierarchical extension to fast and adaptive online training (Green et al., 2013b), a technique based on Ada-Grad (Duchi et al., 2011) and forward-backward splitting (Duchi and Singer, 2009) that can accurately set weights for both dense and sparse features (Green et al., 2014b). Rather than adjusting all weights based on each example, our extension adjusts offsets to a fixed baseline system. In this way, the system can adapt to multiple genres while preventing cross-genre contamination.
In large-scale experiments, we adapt a multigenre baseline system to patents, lectures, and news articles. Our experiments show that sparse models, hierarchical updates, and rule adaptation all contribute consistent improvements. We observe quality gains in all genres, validating our hypothesis that document and genre context are important additional inputs to a machine translation system used for post-editing.

Background
The log-linear appoach to statistical machine translation models the predictive translation distribution p(e|f ; w) directly in log-linear form (Och and Ney, 2004): where f ∈ F is a string in the set of all source language strings F, e ∈ E is a string in the set of all target language strings E, r is a phrasal derivation with source and target projections src(r) and tgt(r), w ∈ R d is the vector of model parameters, φ(·) ∈ R d is a feature map computed using corpus c, and Z(f ) is an appropriate normalizing constant. During search, the maximum approximation is applied rather than summing over the derivations r. Model. We extend a phrase-based system for which φ(r; c) includes 16 dense features: • Two phrasal channel models and two lexical channel models (Koehn et al., 2003), the (log) count of the rule in the training corpus c, and an indicator for singleton rules in c. • Six orientation models that score ordering configurations in r by their frequency in c (Koehn et al., 2007). • A linear distortion penalty that promotes monotonic translation. • An n-gram language model score, p(e), which scores the target language projection of r using statistics from a monolingual corpus. • Fixed-value phrase and word penalties. The elements of φ(r; c) may also include sparse features that have non-zero values for only a subset of rules, but typically do not depend on c (Liang et al., 2006). In this paper, we use four types of sparse features: rule indicators, discriminative lexicalized reordering indicators, rule shape indicators and alignment features (Green et al., 2014b).
The model parameters w are chosen to maximize translation quality on a tuning set. Adaptation. Domain adaptation for machine translation has improved quality using a variety of approaches, including data selection (Ceauşfu et al., 2011), regularized online learning (Simianer et al., 2012;Green et al., 2013b), and input classification (Xu et al., 2007;Banerjee et al., 2010;Wang et al., 2012) and has also been investigated for multidomain tasks (Sennrich et al., 2013;Cui et al., 2013;. Even without domain labels at either training or test time, multi-task learning can boost translation quality in a batch setting (Duh et al., 2010;Song et al., 2011).
Post-editing with incremental adaptation describes a particular mixed-initiative setting (Ortiz-Martínez et al., 2010;Hardt and Elming, 2010). For each f in a corpus, the machine generates a hypothesis e, then a human provides a corrected translation e * to the machine. Observing e * can affect both the For incremental adaptation, speed is essential, and so w i is typically computed with a single online update from w i−1 using (f i , e * i ) as the tuning example.
To alleviate the need for human intervention in the experiment cycle, simulated post-editing (Hardt and Elming, 2010;Denkowski et al., 2014) replaces each e * with a reference that is not a corrected variant of e. Thus, a standard test corpus can be used as an adaptation corpus. Prior work on online learning from post-edits has demonstrated the benefit of adjusting only c (Ortiz- Martínez et al., 2010;Hardt and Elming, 2010) and further benefit from adjusting both c and w (Mathur et al., 2013;Denkowski et al., 2014). Incremental adaptation of both c and the weights w for sparse features is reported to yield large quality gains by Wäschle et al. (2013). 2

Hierarchical Incremental Adaptation
Our hierachical approach to incremental adaptation uses document and genre information to adapt appropriately to multiple contexts. We assume that each sentence f i has a known set D i of domains, which identify the genre and individual document origin of the sentence. This set could be extended to include topics, individual translators, etc. Figure 1 shows the domains that we apply in experiments. All sentences in the baseline training corpus, the tuning corpus, and the adaptation corpus share a domain.
Our adaptation is conceptually similar to hierarchical Bayesian domain adaptation (Finkel and Manning, 2009), but both weights and feature values depend on D i , and we use L 1 regularization. Weight Updates. Model tuning and adaptation are performed with AdaGrad, an online subgradient method with an adaptive learning rate that comes with good theoretical guarantees. AdaGrad makes the following update: The loss function reflects the pairwise ordering between hypotheses. For feature selection, we apply an L 1 penalty via forward-backward splitting (Duchi and Singer, 2009). η is the initial learning rate. See (Green et al., 2013b) for details.
Our adaptation schema is an extension of frustratingly easy domain adaptation (FEDA) (Daumé III, 2007) to multiple domains with different regularization parameters, similar to (Finkel and Manning, 2009). Each feature value is replicated for each domain. Let D denote the set of all domains present in the adaptation set. Given an original feature vector φ(r; c) for derivation r of sentence f i with D i ⊆ D, the replicated feature vector includes |D| copies of φ(r; c), one for each d ∈ |D|, such that The weights of this replicated feature space are initialized using the weights w tuned for the baseline φ(r; c).
In this way, the domain corresponds to the unadapted baseline weights, denoted as Θ * in (Finkel and Manning, 2009). The idea is that we simultaneously maintain a generic set of weights that applies to all domains as well as their domain-specific "offsets", describing how a domain differs from the generic case. Model updates during adaptation are performed according to the same procedure as tuning updates, but now in the replicated space.
Different from (Finkel and Manning, 2009), this generalized FEDA model does not restrict the domains to be strictly hierarchically structured. We could, for example, include a domain for each translator that crossed different genres. However, all of our experimental evaluations maintain a hierarchical domain structure, leaving more general setups to future work.

Rules and Feature Values.
A derivation r of sentence f i has features that are computed from the combination of the baseline training corpus c 0 and a genre-specific corpus that includes all sentence pairs from the tuning corpus as well as from the adaptation corpus (f j , e * j ) with j < i sharing f i 's genre. We refer to this combined corpus as c i . The tuning corpus is the same that is used for parameter tuning in the baseline system. The adaptation corpus is our test set. Note that in our evaluation, each sentence is translated before it is used for adaptation, so that there is no contamination of results.
In order to extend the model efficiently within a streaming data environment, we make use of a suffix-array implementation for our phrase table (Levenberg et al., 2010).
Rather than combining corpus counts across these different sources, separate rules extracted from the baseline corpus and the genre-specific corpus exist independently in the derivation space, and features of each are computed only with one corpus. In this configuration, a large amount of outof-domain evidence from the baseline model will not dampen the feature value adaptation effects of adding new sentence pairs from the adaptation corpus. The genre-specific phrases are distinguished by an additional binary provenance feature.
In order to extract features from the genrespecific corpus, a word-level alignment must be computed for each (f j , e * j ). We force decode using the adapted translation model for f j . In order to avoid decoding failures, we insert high-cost singleword translation rules that allow any word in f j to align to any word in e * j . Sparse Features. Applying a large number of sparse features would compromise responsiveness of our translation system and is thus a poor fit for real-time adaptive computer-assisted translation. However, features that can be learned on a single document are limited in number and can be discarded after the document has been processed. Therefore, document-level sparse features are a powerful means to fit our model to local context with a comparatively small impact on efficiency.

Experiments
We performed two sets of German→English experiments; Table 1 contains the results for both. Our first set of experiments was performed on the PatTR corpus (Wäschle and Riezler, 2012). We divided the corpus into training and development data by date and selected 2.4M parallel segments dated before 2000 from the "claims" section as bilingual training data, taking equal parts from each of the eight patent types A-H as classified by the Cooperative Patent Classification (CPC). From each type we further drew separate test sets and a single tune set, selecting documents with at least 10 segments and a maximum of 150 source words per segment, with around 2,100 sentences per test set and 400 sentences per type for the tune set. The "claims" section of this corpus is highly repetitive, which makes it ideal for observing the effects of incremental adaptation techniques.
To train the language and translation model we additionally leveraged all available bilingual and monolingual data provided for the EMNLP 2015 Tenth Workshop on Machine Translation 3 . The total size of the bitext used for rule extraction and feature estimation was 6.4M sentence pairs. We trained a standard 5-gram language model with modified Kneser-Ney smoothing (Kneser and Ney, 1995;Chen and Goodman, 1998) using the KenLM toolkit (Heafield et al., 2013) on 4 billion running words. The bitext was word-aligned with mgiza (Och and Ney, 2003), and we used the phrasal decoder (Green et al., 2014a) with standard German-English settings for experimentation.
Our second set of experiments was performed on a mixed-genre corpus containing lectures, patents, and news articles. The standard dev and test sets of the IWSLT 2014 shared task 4 were used for the lecture genre. Each document corresponded to an entire lecture. For the news genre, we used newstest2012 for tuning, newstest2013 for metaparameter optimization, and newstest2014 for testing. The tune set for the patent genre is identical to the first set of experiments, while the test set consists of the first 300 sentence pairs of each of the patent type specific test sets of the previous experiment. The documents in the news and patent genres contain around 20 segments on average.
Our evaluation proceeded in multiple stages. We first trained a set of background weights on the . Each component is added on top of the previous line. All results in line + genre TM and below are statistically significant improvements over the baseline with 95% confidence. We also report the repetition rate of the test corpora as propsed by . concatenated tune sets (baseline). Keeping these weights fixed, we performed an additional tuning run to estimate genre-level weights (+ genre weights). 5 In the patent-only setup, we used patent CPC type as genre. Next, we trained a genrespecific translation model for each genre by first feeding the tune set and then the test set into our incremental adaptation learning method as a continuous stream of simulated post edits (+ genre TM). After each sentence, we performed an update on the genre-specific weights. In separate experiments, we also included document-level weights as an additional domain (+ doc. weights) and included sparse features at the document level (+ sparse features). 6 Table 1 demonstrates that each component of this approach offered consistent incremental quality gains, but with varying magnitudes. For the patent experiments we report the average over our eight test sets (A-H) due to lack of space, but total improvement varied from +4.92 to +6.46 B . In the mixed-genre experiments, B increased by +2.27 on lectures, +0.97 on news, and +5.33 on patents. On all tasks, we observed statistically significant improvements over the baseline (95% confidence level) in the + genre TM, + doc. weights and + sparse features experiments using bootstrap resampling (Koehn, 2004).
These results demonstrate the efficacy of hierarchical incremental adaptation, although we would like to stress that the patent data was selected specifically for its high level of repetitiveness, and the  Figure 2: B difference between baseline + genre weights and our incremental adaptation approach, computed on a single segment from each document according to their order, i.e. the first segment from each document, then the second segment from each document, etc.
large improvement in this genre would only be expected to arise in similarly structured domains. This property is quantified by the repetition rate measure (RR)  reported in Table 1, which confirms the finding by Cettolo et al. (2014) that RR correlates with the effectiveness of adaptation. Analysis. Figure 2 shows B score differences to the baseline + genre weights system for different subsets of the news and patent test sets. Each point is computed by document slicing, i.e. on a single segment from each document. The rightmost data point is the B score we obtain by evaluating on the 20th segment of each document, grouped into a pseudo-corpus. Note that this group does not correspond to any number in Table 1, which reports  B on the entire test sets. Thus, we evaluate on all sentences that have learned from exactly (i−1) segments of the same document, with i = 1, . . . , 19. Although the graph is naturally very noisy (each score is computed on roughly 150 segments), we can clearly see that incremental adaptation learns on the document level: on average, the improvement over the baseline increases when proceeding further into the document. Decoding speed. In our real-time computerassisted translation scenario, a certain translation speed is required to allow for responsive user interaction. Table 2 reports the speed in words per second on the lecture data. Adding a genre-specific translation model results in a speed reduction by a factor of 12.6 due to the additional (forced) decod-  ing run and weight updates. Sparse features slows the system down further by a factor of 2.4. However, the largest part of the computation time incurs only when the user has finalized collaborative translation of one sentence and is busy reading the next source sentence. Further, the speed/quality tradeoff can be adjusted with pruning parameters.

Conclusion
We have presented an incremental learning approach for MT that maintains a flexible hierarchical domain structure within a single consistent model. In our experiments, we define a three-level hierarchy with a global root domain as well as genre-and document-level domains. Further, we perform incremental adaptation by training a genre-specific translation model on the stream of incoming postedits and adding document-level sparse features that do not significantly compromise efficiency. Our results show consistent contributions from each level of adaptation across multiple genres.