Phrase-Level Combination of SMT and TM Using Constrained Word Lattice

Constrained translation has improved statistical machine translation (SMT) by combining it with translation memory (TM) at sentence-level. In this paper, we propose using a constrained word lattice , which encodes input phrases and TM constraints together, to combine SMT and TM at phrase-level. Experiments on English– Chinese and English–French show that our approach is signiﬁcantly better than previous combination methods, including sentence-level constrained translation and a recent phrase-level combination.


Introduction
The combination of statistical machine translation (SMT) and translation memory (TM) has proven to be beneficial in improving translation quality and has drawn attention from many researchers (Biçici and Dymetman, 2008;He et al., 2010;Koehn and Senellart, 2010;Ma et al., 2011;Wang et al., 2013;Li et al., 2014). Among various combination approaches, constrained translation (Koehn and Senellart, 2010;Ma et al., 2011) is a simple one and can be readily adopted.
Given an input sentence, constrained translation retrieves similar TM instances and uses matched segments to constrain the translation space of the input by generating a constrained input. Then an SMT engine is used to search for a complete translation of the constrained input.
Despite its effectiveness in improving SMT, previous constrained translation works at the sentence-level, which means that matched segments in a TM instance are either all adopted or all abandoned regardless of their individual quality (Wang et al., 2013). In this paper, we propose a phrase-level constrained translation approach which uses a constrained word lattice to encode the input and constraints from the TM together and allows a decoder to directly optimize the selection of constraints towards translation quality (Section 2).
We conduct experiments (Section 3) on English-Chinese (EN-ZH) and English-French (EN-FR) TM data. Results show that our method is significantly better than previous combination approaches, including sentence-level constrained methods and a recent phrase-level combination method. Specifically, it improves the BLEU (Papineni et al., 2002) score by up to +5.5% on EN-ZH and +2.4% on EN-FR over a phrase-based baseline (Koehn et al., 2003) and decreases the TER (Snover et al., 2006) error by up to -4.3%/-2.2%, respectively.

Constrained Word Lattice
A word lattice G = (V, E, Σ, φ, ψ) is a directed acyclic graph, where V is a set of nodes, including a start point and an end point, E ⊆ V × V is a set of edges, Σ is a set of symbols, a label function φ : E → Σ and a weight function ψ : E → R. 1 A constrained word lattice is a special case of a word lattice, which extends Σ with extra symbols (i.e. constraints).
A constraint is a target phrase which will appear in the final translation. Constraints can be obtained in two ways: addition (Ma et al., 2011) and subtraction (Koehn and Senellart, 2010). 2 Figure  1 exemplifies the differences between them.
The construction of a constrained lattice is very similar to that of a word lattice, except that we need to label some edges with constraints. The general process is: Figure 1: An example of generating a constrained input in two ways: addition and subtraction. While addition replaces an input phrase with a target phrase from a TM instance (an example is marked by lighter gray), subtraction removes mismatched target words and inserts mismatched input words (darker gray). Constraints are specified by <>. Sentences are taken from Koehn and Senellart (2010).
1. Building an initial lattice for an input sentence. This produces a chain.
2. Adding phrasal constraints into the lattice which produces extra nodes and edges. Figure 2 shows an example of a constrained lattice for the sentence in Figure 1.
In the rest of this section, we explain how to use addition and subtraction to build a constrained lattice and the decoder for translating the lattice. Notations we use in this section are: an input f and a TM instance f , e , A where f is the TM source, e is the TM target and A is a word alignment between f and e .

Addition
In addition, matched input words are directly replaced by their translations from a retrieved TM, which means that addition follows the word order of an input sentence. This property makes it easy to obtain constraints for an input phrase.
For an input phrase f , we firstly find its matched phrase f from f via string edits 3 between f and f , so that f = f . Then, we extract its translation e from e , which is consistent with the alignment A (Och and Ney, 2004).
To build a lattice using addition, we directly add a new edge to the lattice which covers f and is labeled by e . For example, dash-dotted lines in Figure 2 are labeled by constraints from addition.
3 String edits, as used in the Levenshtein distance (Levenshtein, 1966), include match, substitution, deletion, and insertion with a priority in this paper: match > substitution > deletion > insertion.

Subtraction
In subtraction, mismatched input words in f are inserted into e and mismatched words in e are removed. The inserted position is determined by A. The advantage of subtraction is that it keeps the word order of e . This is important since the reordering of target words is one of the fundamental problems in SMT, especially for language pairs which have a high degree of syntactic reordering.
However, this property makes it hard to build a lattice from subtraction, as -different from the addition -subtraction does not directly produce a constraint for an input phrase. Thus, for some generated constraints, there is not a specific corresponding phrase in the input. In addition, when adding a constraint to the lattice, we need to consider its context so that the lattice keeps target word order.
To solve this problem, in this paper we propose to segment an input sentence into a sequence of phrases according to information from a matched TM (i.e. the string edit and word alignment) and then create a constrained input for each phrase and add them to the lattice.
Formally, we produce a monotonic segmentation, f 1 , f 1 , e 1 · · · f N , f N , e N , for each sentence triple: f, f , e . Each f i , f i , e i tuple is obtained in two phases: (1) According to the alignment A, f i and e i are produced. (2) Based on string edits between f and f , f i is recognized. The resulting tuple is subject to several restrictions: 1. Each < f i , e i > is consistent with the word alignment A and at least one word in f i is aligned to words in e i . 2. Each boundary word in f i is either the first word or the last word of f or aligned to at least one word in e , so that mismatched input words in f i which are unaligned can find their position in the current tuple.
3. The string edit for the first word of f i , where i = 1, is not "deletion". That means the first word is not an extra input word. This is because, in subtraction, the inserted position of a mismatched unaligned word depends on the alignment of the word before it.
4. No smaller tuples may be extracted without violating restrictions 1-3. This allows us to obtain a unique segmentation where each tuple is minimal.
After obtaining the segmentation, we create a constrained input for each f i using subtraction and add it to the lattice by creating a path covering f i . The path contains one or more edges, each of which is labeled either by an input word or a constraint in the constrained input.

Decoding
The decoder for integrating word lattices into the phrase-based model (Koehn et al., 2003) works similarly to the phrase-based decoder, except that it tracks nodes instead of words (Dyer et al., 2008): given the topological order of nodes in a lattice, the decoder builds a translation hypothesis from left to right by selecting a range of untranslated nodes.
The decoder for a constrained lattice works similarly except that, for a constrained edge, the decoder can only build its translation directly from the constraint. For example, in Figure 2, the translation of the edge "1 → 5" is ", le texte du deuxième alinéa".

Experiment
In our experiments, a baseline system PB is built with the phrase-based model in Moses (Koehn et al., 2007). We compare our approach with three other combination methods. ADD combines PB with addition (Ma et al., 2011), while SUB combines PB with subtraction (Koehn and Senellart, 2010). WANG combines SMT and TM at phraselevel during decoding (Wang et al., 2013;Li et al., 2014). For each phrase pair applied to translate an input phrase, WANG finds its corresponding phrase pairs in a TM instance and then extracts features which are directly added to the loglinear framework (Och and Ney, 2002) as sparse features. We build three systems based on our approach: CWL add only uses constraints from addition; CWL sub only uses constraints from subtraction; CWL both uses constraints from both. Table 1 shows a summary of our datasets. The EN-ZH dataset is a translation memory from Symantec. Our EN-FR dataset is from the publicly available JRC-Acquis corpus. 4 Word alignment is performed by GIZA++ (Och and Ney, 2003) with heuristic function grow-diag-final-and.  Table 2: Experimental results of comparing our approach (CWL x ) with previous work. All scores reported are an average of 3 runs. Scores with * are significantly better than that of the baseline PB at p < 0.01. Bold scores are significantly better than that of all previous work at p < 0.01.
We use SRILM (Stolcke, 2002) to train a 5-gram language model on the target side of our training data with modified Kneser-Ney discounting (Chen and Goodman, 1996). Batch MIRA (Cherry and Foster, 2012) is used to tune weights. Caseinsensitive BLEU [%] and TER [%] are used to evaluate translation results. Table 2 shows experimental results on EN-ZH and EN-FR. We find that our method (CWL x ) significantly improves the baseline system PB on EN-ZH by up to +5.5% BLEU score and by +2.4% BLEU score on EN-FR. In terms of TER, our system significantly decreases the error by up to -4.3%/-2.2% on EN-ZH and EN-FR, respectively. Although, compared to the baseline PB, ADD and SUB work well on EN-ZH, they reduce the translation quality on EN-FR. By contrast, their phrase-level countparts (CWL add and CWL sub ) bring consistent improvements over the baseline on both language pairs. This suggests that a combination approach based on constrained word lattices is more effective and robust than sentencelevel constrained translation. Compared to system WANG, our method produces significantly better translations as well. In addition, our approach is simpler and easier to adopt than WANG.

Results
Compared with CWL add , CWL sub produces better translations. This may suggest that, for a constrained word lattice, subtraction generates a better sequence of constraints than addition since it keeps target words and the word order. However,  combining them together (i.e. CWL both ) does not bring a further improvement. We assume the reason for this is that addition and subtraction share parts of the constraints generated from the same TM. For example, in Figure 2, the edge "1 → 5" based on addition and the edge "11 → 7" based on subtraction are labeled by the same constraint.

Influence of Fuzzy Match Scores
Since a fuzzy match scorer 5 is used to select the best TM instance for an input and thus is an important factor for combining SMT and TM, it is interesting to know what impact it has on the translation quality of various approaches. Table 3 shows statistics of each test subset on EN-ZH and EN-FR where sentences are grouped by their fuzzy match scores. Figure 3 shows BLEU scores of systems evaluated on these subsets. We find that BLEU scores increasingly grow when match scores become higher. While ADD achieves better BLEU scores than SUB on lower fuzzy ranges, SUB performs better than ADD on higher fuzzy scores. In addition, our approaches (CWL x ) are better than the baseline on all ranges but show much more improvement on ranges with higher fuzzy scores.

Conclusion
In this paper, we propose a constrained word lattice to combine SMT and TM at phrase-level. This method uses a word lattice to encode all possible phrasal constraints together. These constraints come from two sentence-level constrained approaches, including addition and subtraction. Experiments on English-Chinese and English-French show that compared with previous combination methods, our approach produces significantly better translation results.
In the future, we would like to consider generating constraints from more than one fuzzy match and using fuzzy match scores or a more sophisticated function to weight constraints. It would also be interesting to know if our method will work better when discarding fuzzy matches with very low scores.