Phrase Table Pruning via Submodular Function Maximization

Phrase table pruning is the act of removing phrase pairs from a phrase table to make it smaller, ideally removing the least useful phrases ﬁrst. We propose a phrase table pruning method that formulates the task as a submodular function maximization problem, and solves it by using a greedy heuristic algorithm. The proposed method can scale with input size and long phrases, and experiments show that it achieves higher BLEU scores than state-of-the-art pruning methods.


Introduction
A phrase table, a key component of phrase-based statistical machine translation (PBMT) systems, consists of a set of phrase pairs. A phrase pair is a pair of source and target language phrases, and is used as the atomic translation unit. Today's PBMT systems have to store and process large phrase tables that contain more than 100M phrase pairs, and their sheer size prevents PBMT systems for running in resource-limited environments such as mobile phones. Even if a computer has enough resources, the large phrase tables increase turnaround time and prevent the rapid development of MT systems.
Phrase table pruning is the technique of removing ineffective phrase pairs from a phrase table to make it smaller while minimizing the performance degradation. Existing phrase table pruning methods use different metrics to rank the phrase pairs contained in the table, and then remove lowranked pairs. Metrics used in previous work are frequency, conditional probability, and Fisher's exact test score (Johnson et al., 2007). Zens et al. (2012) evaluated many phrase table pruning methods, and concluded that entropy-based prun-ing method (Ling et al., 2012;Zens et al., 2012) offers the best performance. The entropy-based pruning method uses entropy to measure the redundancy of a phrase pair, where we say a phrase pair is redundant if it can be replaced by other phrase pairs. The entropy-based pruning method runs in time linear to the number of phrase-pairs. Unfortunately, its running time is also exponential to the length of phrases contained in the phrase pairs, since it contains the problem of finding an optimal phrase alignment, which is known to be NP-hard (DeNero and Klein, 2008). Therefore, the method can be impractical if the phrase pairs consist of longer phrases.
In this paper, we introduce a novel phrase table pruning method that formulates and solves the phrase table pruning problem as a submodular function maximization problem. A submodular function is a kind of set function that satisfies the submodularity property. Generally, the submodular function maximization problem is NP-hard, however, it is known that (1 − 1/e) optimal solutions can be obtained by using a simple greedy algorithm (Nemhauser et al., 1978). Since a greedy algorithm scales with large inputs, our method can be applicable to large phrase tables.
One key factor of the proposed method is its carefully designed objective function that evaluates the quality of a given phrase table. In this paper, we use a simple monotone submodular function that evaluates the quality of a given phrase table by its coverage of a training corpus. Our method is simple, parameter free, and does not cause exponential explosion of the computation time with longer phrases. We conduct experiments with two different language pairs, and show that the proposed method shows higher BLEU scores than state-of-the-art pruning methods.

Submodular Function Maximization
Let Ω be a base set consisting of M elements, and g : 2 Ω → R be a set function that upon the input of X ⊆ Ω returns a real value. If g is a submodular function, then it satisfies the condition where X, Y ∈ 2 Ω , X ⊆ Y , and x ∈ Ω \ Y . This condition represents the diminishing return property of a submodular function, i.e., the increase in the value of the function due to the addition of item x to Y is always smaller than that obtained by adding x to any subset X ⊆ Y . We say a submodular function is monotone if g(Y ) ≥ g(X) for any X, Y ∈ 2 Ω satisfying X ⊆ Y . Since a submodular function has many useful properties, it appears in a wide range of applications (Kempe et al., 2003;Lin and Bilmes, 2010;Kirchhoff and Bilmes, 2014).
The maximization problem of a monotone submodular function under cardinality constraints is formulated as Maximize g(X) Subject to X ∈ 2 Ω and |X| ≤ K , where g(X) is a monotone submodular function and K is the parameter that defines maximum cardinality. This problem is known to be NP-hard, but a greedy algorithm can find an approximate solution whose score is certified to be (1 − 1/e) optimal (Nemhauser et al., 1978). Algorithm 1 shows a greedy approximation method the can solve the submodular function maximization problem under cardinality constraints. This algorithm first sets X ← ∅, and adds item x * ∈ Ω \ X that maxi- Assuming that the evaluation of g(X) can be performed in constant time, the running time of the greedy algorithm is , and these evaluations are repeated K times. If we naively apply the algorithm to situations where M is very large, then the algorithm may not work in reasonable running time. However, an accelerated greedy algorithm can work with large inputs (Minoux, 1978;Leskovec et al., 2007), since it can drastically reduce the number of function evaluations from M K. We applied the accelerated greedy algorithm in the following experiments, and found it Algorithm 1 Greedy algorithm for maximizing a submodular function Input: Base set Ω, cardinality K Output: X ∈ 2 Ω satisfying |X| = K.
could solve the problems in 24 hours. Moreover, further enhancement can be achieved by applying distributed algorithms (Mirzasoleiman et al., 2013) and stochastic greedy algorithms (Mirzasoleiman et al., 2015).

Phrase Table Pruning
We first define some notations.
Let Ω = {x 1 , . . . , x M } be a phrase table that has M phrase pairs. Each phrase pair, x i , consists of a source language phrase, p i , and a target language phrase, q i , and is written as where p ij represents the j-th word of p i and q ij represents the j-th word of q i . Let t i be the i-th translation pair contained in the training corpus, namely t i = f i , e i , where f i and e i are source and target sentences, respectively. Let N be the number of translation pairs contained in the corpus. f i and e i are represented as sequences of words f i = (f i1 , . . . , f i|f i | ) and e i = (e i1 , . . . , e i|e i | ), where f ij is the j-th word of sentence f i and e ij is the j-th word of sentence e i . Definition 1. Let x j = p j , q j be a phrase pair and t i = f i , e i be a translation pair. We say x j appears in t i if p j is contained in f i as a subsequence and q j is contained in e i as a subsequence. We say phrase pair x j covers word f ik if x j appears in f i , e i and f ik is contained in the subsequence that equals p j . Similarly, we say x j covers e ik if x j appears in f i , e i and e ik is contained in the subsequence that equals q j .
Using the above definitions, we describe here our phrase-table pruning algorithm; it formulates the task as a combinatorial optimization problem. Since phrase table pruning is the problem of finding a subset of Ω, we formulate the problem as a submodular function maximization problem under cardinality constraints, i.e., the problem is finding X ⊆ Ω that maximizes objective function g(X) while satisfying the condition |X| = K, where K is the size of pruned phrase table. If g(X) is a monotone submodular function, we can apply Algorithm 1 to obtain an (1 − 1/e) approximate solution. We use the following objective function.
where c(X, f ik ) is the number of phrase pairs contained in X that cover f ik , the k-th word of the ith source sentence f i . Similarly, c(X, e ik ) is the number of phrase pairs that cover e ik .
If a corpus consists of a pair of sentences f 1 = "das Haus ist klein" and e 1 = "this house is small", then x 1 and x 2 appear in f 1 , e 1 and word f 12 = "Haus" is covered by x 1 and x 2 . Hence c(X, f 12 ) = 2.
This objective function basically gives high scores to X if it contains many words of the training corpus. However, since we take the logarithm of cover counts c(X, f ik ) and c(X, e ik ), g(X) becomes high when X covers many different words. This objective function prefers to select phrase pairs that frequently appear in the training corpus but with low redundantly. This objective function prefers pruned phrase table X that contains phrase pairs that frequently appear in the training corpus, with no redundant phrase pairs. We prove the submodularity of the objective function below.
Proof. Apparently, every c(X, f ik ) and c(X, e ik ) is a monotone function of X, and it satisfies the diminishing return property since c( If function h(X) is monotone and submodular, then φ(h(X)) is also monotone and submodular for any concave function φ : R → R. Since log(X) is concave, every log[c(X, f ik )+1] and log[c(X, e ik )+1] is a monotone submodular function. Finally, if h 1 , . . . , h n are monotone and submodular, then i h i is also monotone and submodular. Thus g(X) is monotone and submodular.
Computation costs If we know all counts c(X, f ik ) and c(X, e ik ) for all f ik , e ik , then g(X ∪ {x}) can be evaluated in time linear with the number of words contained in the training corpus 1 . Thus our algorithm does not cause exponential explosion of the computation time with longer phrases.

Settings
We conducted experiments on the Chinese-English and Arabic-English datasets used in NIST OpenMT 2012. In each experiment, English was set as the target language. We used Moses (Koehn et al., 2007) as the phrase-based machine translation system. We used the 5-gram Kneser-Ney language model trained separately using the English GigaWord V5 corpus (LDC2011T07), a monolingual corpus distributed at WMT 2012, and Google Web 1T 5-gram data (LDC2006T13). Word alignments are obtained by running giza++ (Och and Ney, 2003) included in the Moses system. As the test data, we used 1378 segments for the Arabic-English dataset and 2190 segments for the Chinese-English dataset, where all test segments have 4 references (LDC2013T07, LDC2013T03). The tuning set consists of about 5000 segments gathered from MT02 to MT06 evaluation sets (LDC2010T10, LDC2010T11, LDC2010T12, LDC2010T14, LDC2010T17). We set the maximum length of extracted phrases to 7. Table 1 shows the sizes of phrase tables. Following the settings used in (Zens et al., 2012), we reduce the effects of other components by using the same feature weights obtained by running the MERT training algorithm (Och, 2003) on full size phrase tables and tuning data to all pruned tables. We run MERT for 10 times to obtain 10 different feature weights. The BLEU scores reported in the following experiments are the averages of the results obtained by using these different feature weights.
We adopt the entropy-based pruning method used in (Ling et al., 2012;Zens et al., 2012) as the baseline method, since it shows best BLEU Language Pair Number of phrase pairs Arabic-English 234M Chinese-English 169M Table 1: Phrase table sizes. scores as per (Zens et al., 2012). We used the parameter value of the entropy-based method suggested in (Zens et al., 2012). We also compared with the significance-based method (Johnson et al., 2007), which uses Fisher's exact test to calculate significance scores of phrase pairs and prunes less-significant phrase pairs.

Results
Figure 1 and Figure 2 show the BLEU scores of pruned tables. The horizontal axis is the number of phrase pairs contained in a table, and the vertical axis is the BLEU score. The values in the figure are difference of BLEU scores between the proposed method and the baseline method that shows higher score. In the experiment with the Arabic-English dataset, both methods can remove 80% of phrase pairs without losing 1 BLEU point, and the proposed method shows better performance than the baseline methods for all

Related Work
Previous phrase table pruning methods fall into two groups. Self-contained methods only use resources already used in the MT system, e.g., training corpus and phrase tables. Entropy-based  methods (Ling et al., 2012;Zens et al., 2012), a significance-based method (Johnson et al., 2007), and our method are self-contained methods. Non self-contained methods exploit usage statistics for phrase pairs (Eck et al., 2007) and additional bilingual corpora (Chen et al., 2009). Since self contained methods require additional resources, it is easy to apply to existing MT systems. Effectiveness of the submodular functions maximization formulation is confirmed in various NLP applications including text summarization (Lin and Bilmes, 2010;Lin and Bilmes, 2011) and training data selection for machine translation (Kirchhoff and Bilmes, 2014). These methods are used for selecting a subset that contains important items but not redundant items. This paper can be seen as applying the subset selection formulation to the phrase table pruning problem.

Conclusion
We have introduced a method that solves the phrase table pruning problem as a submodular function maximization problem under cardinal-   ity constraints. Finding an optimal solution of the problem is NP-hard, so we apply a scalable greedy heuristic to find (1 − 1/e) optimal solutions. Experiments showed that our greedy algorithm, which uses a relatively simple objective function, can achieve better performance than state-of-the-art pruning methods.
Our proposed method can be easily extended by using other types of submodular functions. The objective function used in this paper is a simple one, but it is easily enhanced by the addition of metrics used in existing phrase table pruning techniques, such as Fisher's exact test scores and entropy scores. Testing such kinds of objective function enhancements is an important future task.