Graph-Based Translation Via Graph Segmentation

One major drawback of phrase-based translation is that it segments an input sentence into continuous phrases. To support linguistically informed source discontinuity, in this paper we construct graphs which combine bigram and dependency relations and propose a graph-based translation model. The model segments an input graph into connected subgraphs, each of which may cover a discontinuous phrase. We use beam search to combine translations of each subgraph left-to-right to produce a complete translation. Experiments on Chinese–English and German– English tasks show that our system is signiﬁcantly better than the phrase-based model by up to +1.5/+0.5 BLEU scores. By explicitly modeling the graph segmentation, our system obtains further improvement, especially on German–English.


Introduction
Statistical machine translation (SMT) starts from sequence-based models. The well-known phrasebased (PB) translation model (Koehn et al., 2003) has significantly advanced the progress of SMT by extending translation units from single words to phrases. By using phrases, PB models can capture local phenomena, such as word order, word deletion, and word insertion. However, one of the significant weaknesses in conventional PB models is that only continuous phrases are used, so generalizations such as French ne . . . pas to English not cannot be learned. To solve this, syntax-based models (Galley et al., 2004;Chiang, 2005;Marcu et al., 2006) take tree structures into consideration to learn translation patterns by using non-terminals for generalization. Model C D S (Koehn et al., 2003) • sequence (Galley and Manning, 2010) • • sequence  and • tree  This work • • graph Table 1: Comparison between our work and previous work in terms of three aspects: keeping continuous phrases (C), allowing discontinuous phrases (D), and input structures (S).
However, the expressiveness of these models is confined by hierarchical constraints of the grammars used (Galley and Manning, 2010) since these patterns still cover continuous spans of an input sentence. By contrast, ,  and Xiong et al. (2007) take treelets from dependency trees as the basic translation units. These treelets are connected and may cover discontinuous phrases. However, their models lack the ability to handle continuous phrases which are not connected in trees but could in fact be extremely important to system performance (Koehn et al., 2003). Galley and Manning (2010) directly extract discontinuous phrases from input sequences. However, without imposing additional restrictions on discontinuity, the amount of extracted rules can be very large and unreliable.
Different from previous work (as shown in Table 1), in this paper we use graphs as input structures and propose a graph-based translation model to translate a graph into a target string. The basic translation unit in this model is a connected subgraph which may cover discontinuous phrases. The main contributions of this work are summarized as follows: • We propose to use a graph structure to combine a sequence and a tree (Section 3.1). The graph contains both local relations between words from the sequence and long-distance relations from the tree.
• We present a translation model to translate a graph (Section 3). The model segments the graph into subgraphs and uses beam search to generate a complete translation from left to right by combining translation options of each subgraph.
• We present a set of sparse features to explicitly model the graph segmentation (Section 4). These features are based on edges in the input graph, each of which is either inside a subgraph or connects the subgraph with a previous subgraph.
• Experiments (Section 5) on Chinese-English and German-English tasks show that our model is significantly better than the PB model. After incorporating the segmentation model, our system achieves still further improvement.

Review: Phrase-based Translation
We first review the basic PB translation approach, which will be extended to our graph-based translation model. Given a pair of sentences S, T , the conventional PB model is defined as Equation (1): The target sentence T is broken into I phrases t 1 · · · t I , each of which is a translation of a source phrase s a i . d is a distance-based reordering model. Note that in the basic PB model, the phrase segmentation is not explicitly modeled which means that different segmentations are treated equally (Koehn, 2010). The performance of PB translation relies on the quality of phrase pairs in a translation table. Conventionally, a phrase pair s, t has two properties: (i) s and t are continuous phrases. (ii) s, t is consistent with a word alignment A (Och and Ney, 2004): ∀(i, j) ∈ A, s i ∈ s ⇔ t j ∈ t and ∃s i ∈ s, t j ∈ t, (i, j) ∈ A.
PB decoders generate hypotheses (partial translations) from left to right. Each hypothesis maintains a coverage vector to indicate which source words have been translated so far. A hypothesis can be extended on the right by translating an Figure 1: Beam search for phrase-based MT. • denotes a covered source position while indicates an uncovered position (Liu and Huang, 2014).
uncovered source phrase. The translation process ends when all source words have been translated. Beam search (as in Figure 1) is taken as an approximate search strategy to reduce the size of the decoding space. Hypotheses which cover the same number of source words are grouped in a stack. Hypotheses can be pruned according to their partial translation cost and an estimated future cost.

Graph-Based Translation
Our graph-based translation model extends PB translation by translating an input graph rather than a sequence to a target string. The graph is segmented into a sequence of connected subgraphs, each of which corresponds to a target phrase, as in Equation (2): (2) where G(s i ) denotes a connected source subgraph which covers a (discontinuous) phrases i .

Building Graphs
As a more powerful and natural structure for sentence modeling, a graph can model various kinds of word-relations together in a unified representation. In this paper, we use graphs to combine two commonly used relations: bigram relations and dependency relations. Figure 2 shows an example of a graph. Each edge in the graph denotes either a dependency relation or a bigram relation.
Note that the graph we use in this paper is directed, connected, node-labeled and may contain cycles. Bigram relations are implied in sequences and provide local and sequential information on pairs of continuous words. Phrases connected by bigram relations (i.e. continuous phrases) are known to be useful to improve phrase coverage (Hanneman and Lavie, 2009). By contrast, dependency relations come from dependency structures which model syntactic and semantic relations between words. Phrases whose words are connected by dependency relations (also known as treelets) are linguistic-motivated and thus more reliable .
By combining these two relations together in graphs, we can make use of both continuous and linguistic-informed discontinuous phrases as long as they are connected subgraphs.

Training
Different from PB translation, the basic translation units in our model are subgraphs. Thus, during training, we extract subgraph-phrase pairs instead of phrase pairs on parallel graph-string sentences associated with word alignments. 1 An example of a translation rule is as follows: FIFA Shijiebei Juxing FIFA World Cup was held Note that the source side of a rule in our model is a graph which can be used to cover either a continuous phrase or a discontinuous phrase according to its match in an input graph during decoding. The algorithm for extracting translation rules is shown in Algorithm 1. This algorithm traverses each phrase pair s, t , which is within a length limit and consistent with a given word alignment Algorithm 1: Algorithm for extracting translation rules from a graph-string pair.
Data: A word-aligned graph-string pair (G(S), T, A) Result: A set of translation pairs R 1 for each phrase t in T : | t |≤ L do 2 find the minimal (may be discontinuous) phrases in S so that |s |≤ L and s, t is consistent with A ; (lines 1-2), and outputs G(s), t ifs is covered by a connected subgraph G(s) (lines 6-8). A source phrase can be extended with unaligned source words which are adjacent to the phrase (lines 9-14). We use a queue Q to store all phrases which are consistently aligned to the same target phrase (line 3).

Model and Decoding
We define our model in the log-linear framework (Och and Ney, 2002) over a derivation D = r 1 r 2 · · · r N , as in Equation (3): where r i are translation rules, φ i are features defined on derivations and λ i are feature weights.
In our experiments, we use the standard 9 features: two translation probabilities p(G(s)|t) and p(t|G(s)), two lexical translation probabilities p lex (s|t) and p lex (t|s), a language model lm(t) over a translation t, a rule penalty, a word penalty, an unknown word penalty and a distortion feature d for distance-based reordering. The calculation of the distortion feature d in our S s2 s1 s2 s3 1 2 3 4 5 6 7 Figure 3: Distortion calculation for both continuous and discontinuous phrases in a derivation.
. model is different from the one used in conventional PB models, as we need to take discontinuity into consideration. In this paper, we use a distortion function defined in Galley and Manning (2010) to penalize discontinuous phrases that have relatively long gaps. Figure 3 shows an example of calculating distortion for discontinuous phrases.
Our graph-based decoder is very similar to the PB decoder except that, in our decoder, each hypothesis is extended by translating an uncovered subgraph instead of a phrase. Positions covered by the subgraph are then marked as translated.

Graph Segmentation Model
Each derivation in our graph-based translation model implies a sequence of subgraphs (also called a segmentation). By default, similar to PB translation, our model treats each segmentation equally as shown in Equation (2). However, previous work on PB translation has suggested that such segmentations provide useful information which can improve translation performance. For example, boundary information in a phrase segmentation can be used for reordering models (Xiong et al., 2006;Cherry, 2013).
In this paper, we are interested in directly modeling the segmentation using information from graphs. By making the assumption that each subgraph is only dependent on previous subgraphs, we define a generative process over a graph segmentation as in Equation (4): Instead of training a stand-alone discriminative segmentation model to assign each subgraph a probability given previous subgraphs, we implement the model via sparse features, each of which is extracted at run-time during decoding and then  directly added to the log-linear framework, so that these features can be tuned jointly with other features (of Section 3.3) to directly maximize the translation quality. Since a segmentation is obtained by breaking up the connectivity of an input graph, it is intuitive to use edges to model the segmentation. According to Equation (4), for a current subgraph G i , we only consider those edges which are either inside G i or connect G i with a previous subgraph. Based on these edges, we extract sparse features for each node in the subgraph. The set of sparse features is defined as follows: where n.w and n.c are the word and class of the current node n, and n .w and n .c are the word and class of a node n connected to n. C, P , and H denote that the node n is in the current subgraph G i or the adjacent previous subgraph G i−1 or other previous subgraphs, respectively. Note that we treat the adjacent previous subgraph differently from others since information from the last previous unit is quite useful (Xiong et al., 2006;Cherry, 2013). in and out denote that the edge is an incoming edge or outgoing edge for the current node n. Figure 4 shows an example of extracting sparse features for a subgraph. Inspired by success in using sparse features in SMT (Cherry, 2013), in this paper we lexicalize only on the top-100 most frequent words. In addition, we group source words into 50 classes by using mkcls which should provide useful generalization (Cherry, 2013) for our model.

Experiment
We conduct experiments on Chinese-English (ZH-EN) and German-English (DE-EN) translation tasks. W:Zai C:5 C in W:Zai C:5 C out W:Zai C:3 P out W:Zai C:7 P in W:Nanfei C:4 C in W:Nanfei C:4 C out W:Nanfei C:6 C in W:Chenggong C:5 C out W:Chenggong C:7 P in C:4 C:5 C in C:4 C:5 C out C:4 C:3 P out C:4 C:7 P in C:5 C:4 C in C:5 C:4 C out C:5 C:6 C in C:6 C:5 C out C:6 C:7 P in Figure 4: An illustration of extracting sparse features for each node in a subgraph during decoding. The decoder segments the graph in Figure 2 into three subgraphs (solid rectangles) and produces a complete translation by combining translations of each subgraph (dashed rectangles). In this figure, the class of a word is randomly assigned. and News-Test 2013 (WMT13) are test sets. We use mate-tools 2 to perform morphological analysis and parse German sentences (Bohnet, 2010). Then, MaltParser 3 converts a parse result into a projective dependency tree (Nivre and Nilsson, 2005).

Settings
In this paper, we mainly report results from five systems under the same configuration. PBMT is built by the PB model in Moses (Koehn et al.,2 http://code.google.com/p/mate-tools/ 3 http://www.maltparser.org/ 2007). Treelet extends PBMT by taking treelets as the basic translation units . We implement a Treelet model in Moses which produces translations from left to right and uses beam search for decoding. DTU extends the PB model by allowing discontinuous phrases (Galley and Manning, 2010). We implement DTU with source discontinuity in Moses. 4 GBMT is our basic graph-based translation system while GSM adds the graph segmentation model into GBMT. Both systems are implemented in Moses.
Word alignment is performed by GIZA++ (Och and Ney, 2003) with the heuristic function growdiag-final-and. We use SRILM (Stolcke, 2002) to train a 5-gram language model on the Xinhua portion of the English Gigaword corpus 5th edition with modified Kneser-Ney discounting (Chen and Goodman, 1996). Batch MIRA (Cherry and Foster, 2012) is used to tune weights. BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2011), and TER (Snover et al., 2006)   Each score is an average over three MIRA runs (Clark et al., 2011). * means a system is significantly better than PBMT at p ≤ 0.01. Bold figures mean a system is significantly better than Treelet at p ≤ 0.01. + means a system is significantly better than DTU at p ≤ 0.01. In this table, we mark a system by comparing it with previous ones. Table 3 shows our evaluation results. We find that our GBMT system is significantly better than PBMT as measured by all three metrics across all test sets. Specifically, the improvements are up to +1.5/+0.5 BLEU, +0.3/+0.2 METEOR, and -0.8/-0.4 TER on ZH-EN and DE-EN, respectively. This improvement is reasonable as our system allows discontinuous phrases which can reduce data sparsity and handle long-distance relations (Galley and Manning, 2010). Another argument for discontinuous phrases is that they allow the decoder to use larger translation units which tend to produce better translations (Galley and Manning, 2010). However, this argument was only verified on ZH-EN. Therefore, we are interested in seeing whether we have the same observation in our experiments on both language pairs. We count the used translation rules in MT02 and WMT11 based on different target lengths. The results are shown in Figure 5. We find that both DTU and GBMT indeed tend to use larger translation units on ZH-EN. However, more smaller translation units are used on DE-EN. 5 We presume this is because long-distance reordering is performed more often on ZH-EN than on DE-EN. Based on the fact that the distortion function d measures the reordering distance, we find that the average distortion value in PB on ZH-EN MT02 is 18.4 and 5 We have the same finding on all test sets. 3.5 on DE-EN WMT11. Our observations suggest that the argument that discontinuous phrases allow decoders to use larger translation units should be considered with caution when we explain the benefit of discontinuity on different language pairs. Compared to PBMT, the Treelet system does not show consistent improvements. Our system achieves significantly better BLEU and METEOR scores than Treelet on both ZH-EN and DE-EN, and a better TER score on DE-EN. This suggests that continuous phrases are essential for system robustness since it helps to improve phrase coverage (Hanneman and Lavie, 2009). Lower phrase coverage in Treelet results in more short phrases being used, as shown in Figure 5. In addition, we find that both DTU and our systems do not achieve consistent improvements over Treelet in terms of TER. We observed that both DTU and our systems tend to produce longer translations than Treelet, which might cause unreliable TER evaluation in our experiments as TER favours shorter sentences (He and Way, 2010).

Results and Discussion
Since discontinuous phrases produced by using syntactic information are fewer in number but more reliable (Koehn et al., 2003), our GBMT system achieves comparable performance with DTU but uses significantly fewer rules, as shown in Table 4. After integrating the graph segmentation model to help subgraph selection, GBMT is further improved and the resulted system G2S has significantly better evaluation scores than DTU on both language pairs. However, our segmentation model is more helpful on DE-EN than ZH-EN. We find that the number of features learned on ZH-EN (25K+) is much less than on DE-EN (49K+). This may result in a lower feature coverage during decoding. The lower number of features in ZH-EN could be caused by the fact that the development set MT02 has many fewer sentences than WMT11. Accordingly, we suggest to use a larger development set during tuning to achieve better translation performance when the segmentation model is integrated. Our current model is more akin to addressing problems in phrase-based and treelet-based models by segmenting graphs into pieces rather than extracting a recursive grammar. Therefore, similar to those models, our model is weak at phrase reordering as well. However, we are interesting in the potential power of our model by incorporating lexical reordering (LR) models and comparing it with syntax-based models. Table 5 shows BLEU scores of the hierarchical phrase-based (HPB) system (Chiang, 2005) in Moses 6 and GBMT combined with a word-based  Table 5: BLEU scores of a Moses hierarchical phrase-based system (HPB) and our system (GBMT) with a word-based lexical reordering model (LR).
LR model (Koehn et al., 2005). We find that the LR model significantly improves our system. GBMT+LR is comparable with the Moses HPB model on Chinese-English and better than HPB on German-English. Figure 6 shows three examples from MT04 to better explain the differences of each system. Example 1 shows that systems which allow discontinuous phrases (namely Treelet, DTU, GBMT, and GSM) successfully translate a Chinese collocation "Yu . . . Wuguan" to "have nothing to do with" while PBMT fails to catch the generalization since it only allows continuous phrases. In Example 2, Treelet translates a discontinuous phrase "Dui . . . Zuofa" (to . . . practice) only as "to" where an important target word "practice" is dropped. By contrast, bigram relations allow our systems (GBMT and GSM) to find a better phrase to translate: "De Zuofa" to "of practice". In addition, DTU translates a discontinuous phrase "De Zuofa . . . Buman" to "dissatisfaction with the approach of". However, the phrase is actually not PBMT: the united states government to brazil has repeatedly expressed its dissatisfaction .

Examples
Treelet: the government of brazil to the united states has on many occasions expressed their discontent .
DTU: the united states has repeatedly expressed its dissatisfaction with the approach of the government to brazil .
GBMT: the us government has repeatedly expressed dissatisfaction with the practice of brazil .
GSM: the us government has repeatedly expressed dissatisfaction with the practice of brazil . the us government has repeatedly expressed dissatisfaction with the practice of brazil .

Example 3
PBMT: the government and all sectors of society should continue to explore in depth and draw on collective wisdom .
Treelet: the government must continue to make in-depth discussions with various sectors of the community and the collective wisdom .
DTU: the government must continue to work together with various sectors of the community to make an in-depth study and draw on collective wisdom .
GBMT: the government must continue to work together with various sectors of the community in-depth study and draw on collective wisdom .
GSM: the government must continue to make in-depth discussions with various sectors of the community and draw on collective wisdom .

REF:
the government must continue to hold thorough discussions with all walks of life to pool the wisdom of the masses . government must continue with society each community make in-depth discussion , draw collective wisdom .
the government must continue to make in-depth discussions with various sectors of the community and draw on collective wisdom . Figure 6: Translation examples from MT04 produced by different systems. Each source sentence is annotated by dependency relations and additional bigram relations (dotted red edges). We also annotate phrase alignments produced by our system GSM.
linguistically motivated and could be unreliable. By disallowing phrases which are not connected in the input graph, GBMT and GSM produce better translations. Example 3 illustrates that our graph segmentation model helps to select better subgraphs. After obtaining a partial translation "the government must", GSM chooses to translate a subgraph which covers a discontinuous phrase "Jixu . . . Zuo" to "continue to make" while GBMT translates "Jixu Yu" (continue . . . with) to "continue to work together with". By selecting the proper subgraph to translate, GSM performs a better reordering on the translation.

Related Work
Starting from sequence-based models, SMT has been benefiting increasingly from complex structures.
Sequence-based MT: Since the breakthrough made by IBM on word-based models in the 1990s (Brown et al., 1993), SMT has developed rapidly. The PB model (Koehn et al., 2003) advanced the state-of-the-art by translating multi-word units, which makes it better able to capture local phenomena. However, a major drawback in PBMT is that only continuous phrases are considered. Galley and Manning (2010) extend PBMT by allowing discontinuity. However, without linguistic structure information such as syntax trees, sequence-based models can learn a large amount of phrases which may be unreliable.
Tree-based MT: Compared to sequences, trees provide recursive structures over sentences and can handle long-distance relations. Typically, trees used in SMT are either phrasal structures (Galley et al., 2004;Marcu et al., 2006) or dependency structures Xiong et al., 2007;Xie et al., 2011;Li et al., 2014). However, conventional treebased models only use linguistically well-formed phrases. Although they are more reliable in theory, discarding all phrase pairs which are not linguistically motivated is an overly harsh decision. Therefore, exploring more translation rules usually can significantly improve translation performance (Marcu et al., 2006;DeNeefe et al., 2007;Mi et al., 2008).
Graph-based MT: Compared to sequences and trees, graphs are more general and can represent more relations between words. In recent years, graphs have been drawing quite a lot of attention from researchers. Jones et al. (2012) propose a hypergraph-based translation model where hypergraphs are taken as a meaning representation of sentences. However, large corpora with annotated hypergraphs are not readily available for MT. Li et al. (2015) use an edge replacement grammar to translate dependency graphs which are converted from dependency trees by labeling edges. However, their model only focuses on subgraphs which cover continuous phrases.

Conclusion
In this paper, we extend the conventional phrasebased translation model by allowing discontinuous phrases. We use graphs which combine bigram and dependency relations together as inputs and present a graph-based translation model. Experiments on Chinese-English and German-English show our model to be significantly better than the phrase-based model as well as other more sophisticated models. In addition, we present a graph segmentation model to explicitly guide the selection of subgraphs. In experiments, this model further improves our system.
In the future, we will extend this model to allow discontinuity on target sides and explore the possibility of directly encoding reordering information in translation rules. We are also interested in using graphs for neural machine translation to see how it can translate and benefit from graphs.