A Comparative Study on Vocabulary Reduction for Phrase Table Smoothing

This work systematically analyzes the smoothing effect of vocabulary reduction for phrase translation models. We extensively compare various word-level vocabularies to show that the performance of smoothing is not significantly affected by the choice of vocabulary. This result provides empirical evidence that the standard phrase translation model is extremely sparse. Our experiments also reveal that vocabulary reduction is more effective for smoothing large-scale phrase tables.


Introduction
Phrase-based systems for statistical machine translation (SMT) (Zens et al., 2002;Koehn et al., 2003) have shown state-of-the-art performance over the last decade. However, due to the huge size of phrase vocabulary, it is difficult to collect robust statistics for lots of phrase pairs. The standard phrase translation model thus tends to be sparse (Koehn, 2010).
A fundamental solution to a sparsity problem in natural language processing is to reduce the vocabulary size. By mapping words onto a smaller label space, the models can be trained to have denser distributions (Brown et al., 1992;Miller et al., 2004;Koo et al., 2008). Examples of such labels are part-of-speech (POS) tags or lemmas.
In this work, we investigate the vocabulary reduction for phrase translation models with respect to various vocabulary choice. We evaluate two types of smoothing models for phrase translation probability using different kinds of word-level labels. In particular, we use automatically generated word classes (Brown et al., 1992) to obtain label vocabularies with arbitrary sizes and structures. Our experiments reveal that the vocabulary of the smoothing model has no significant effect on the end-to-end translation quality. For example, a randomized label space also leads to a decent improvement of BLEU or TER scores by the presented smoothing models.
We also test vocabulary reduction in translation scenarios of different scales, showing that the smoothing works better with more parallel corpora.
2 Related Work Koehn and Hoang (2007) propose integrating a label vocabulary as a factor into the phrase-based SMT pipeline, which consists of the following three steps: mapping from words to labels, labelto-label translation, and generation of words from labels. Rishøj and Søgaard (2011) verify the effectiveness of word classes as factors. Assuming probabilistic mappings between words and labels, the factorization implies a combinatorial expansion of the phrase table with regard to different vocabularies. Wuebker et al. (2013) show a simplified case of the factored translation by adopting hard assignment from words to labels. In the end, they train the existing translation, language, and reordering models on word classes to build the corresponding smoothing models.
Other types of features are also trained on wordlevel labels, e.g. hierarchical reordering features (Cherry, 2013), an n-gram-based translation model (Durrani et al., 2014), and sparse word pair features (Haddow et al., 2015). The first and the third are trained with a large-scale discriminative training algorithm.
For all usages of word-level labels in SMT, a common and important question is which label vocabulary maximizes the translation quality. Bisazza and Monz (2014) compare class-based language models with diverse kinds of labels in terms of their performance in translation into morphologically rich languages. To the best of our knowledge, there is no published work on systematic comparison between different label vocabularies, model forms, and training data size for smoothing phrase translation models-the most basic component in state-of-the-art SMT systems. Our work fulfills these needs with extensive translation experiments (Section 5) and quantitative analysis (Section 6) in a standard phrase-based SMT framework.

Word Classes
In this work, we mainly use unsupervised word classes by Brown et al. (1992) as the reduced vocabulary. This section briefly reviews the principle and properties of word classes. A word-class mapping c is estimated by a clustering algorithm that maximizes the following objective (Brown et al., 1992): for a given monolingual corpus {e I 1 }, where each e I 1 is a sentence of length I in the corpus. The objective guides c to prefer certain collocations of class sequences, e.g. an auxiliary verb class should succeed a class of pronouns or person names. Consequently, the resulting c groups words according to their syntactic or semantic similarity.
Word classes have a big advantage for our comparative study: The structure and size of the class vocabulary can be arbitrarily adjusted by the clustering parameters. This makes it possible to prepare easily an abundant set of label vocabularies that differ in linguistic coherence and degree of generalization.

Smoothing Models
In the standard phrase translation model, the translation probability for each segmented phrase pair (f ,ẽ) is estimated by relative frequencies: where N is the count of a phrase or a phrase pair in the training data. These counts are very low for many phrases due to a limited amount of bilingual training data. Using a smaller vocabulary, we can aggregate the low counts and make the distribution smoother. We now define two types of smoothing models for Equation 2 using a general word-label mapping c.

Mapping All Words at Once (map-all)
For the phrase translation model, the simplest formulation of vocabulary reduction is obtained by replacing all words in the source and target phrases with the corresponding labels in a smaller space. Namely, we employ the following probability instead of Equation 2: which we call map-all. This model resembles the word class translation model of Wuebker et al. (2013) except that we allow any kind of word-level labels. This model generalizes all words of a phrase without distinction between them. Also, the same formulation is applied to word-based lexicon models.

Mapping Each Word at a Time (map-each)
More elaborate smoothing can be achieved by generalizing only a sub-part of the phrase pair. The idea is to replace one source word at a time with its respective label. For each source position j, we also replace the target words aligned to the source word f j . For this purpose, we let a j ⊆ {1, ..., |ẽ|} denote a set of target positions aligned to j. The resulting model takes a weighted average of the redefined translation probabilities over all source positions off : where the superscripts of c indicate the positions that are mapped onto the label space. w j is a weight for each source position, where j w j = 1. We call this model map-each. We illustrate this model with a pair of threeword phrases:f = [f 1 , f 2 , f 3 ] andẽ = [e 1 , e 2 , e 3 ] (see Figure 1 for the in-phrase word alignments). The map-each model score for this phrase pair is: Figure 1: Word alignments of a pair of three-word phrases.
where the alignments are depicted by line segments.
First of all, we replace f 1 and also e 1 , which is aligned to f 1 , with their corresponding labels. As f 2 has no alignment points, we do not replace any target word accordingly. f 3 triggers the class replacement of two target words at the same time. Note that the model implicitly encapsulates the alignment information.
We empirically found that the map-each model performs best with the following weight: which is a normalized count of the generalized phrase pair itself. Here, the count is relatively large when f j , the word to be backed off, is less frequent than other words inf . In contrast, if f j is a very frequent word and one of the other words inf is rare, the count becomes low due to that rare word. The same logic holds for target words inẽ. After all, Equation 5 carries more weight when a rare word is replaced with its label. The intuition is that a rare word is the main reason for unstable counts and should be backed off above all. We use this weight for all experiments in the next section.
In contrast, the map-all model merely replace all words at one time and ignore alignments within phrase pairs.

Setup
We evaluate how much the translation quality is improved by the smoothing models in Section 4. The two smoothing models are trained in both source-to-target and target-to-source directions, and integrated as additional features in the log-linear combination of a standard phrasebased SMT system (Koehn et al., 2003). We also test linear interpolation between the standard and smoothing models, but the results are generally worse than log-linear interpolation. Note that vocabulary reduction models by themselves cannot replace the corresponding standard models, since this leads to a considerable drop in translation quality (Wuebker et al., 2013).
Our baseline systems include phrase translation models in both directions, word-based lexicon models in both directions, word/phrase penalties, a distortion penalty, a hierarchical lexicalized reordering model (Galley and Manning, 2008), a 4-gram language model, and a 7-gram word class language model (Wuebker et al., 2013). The model weights are trained with minimum error rate training (Och, 2003). All experiments are conducted with an open source phrase-based SMT toolkit Jane 2 (Wuebker et al., 2012).
To validate our experimental results, we measure the statistical significance using the paired bootstrap resampling method of Koehn (2004). Every result in this section is marked with ‡ if it is statistically significantly better than the baseline with 95% confidence, or with † for 90% confidence.

Comparison of Vocabularies
The presented smoothing models are dependent on the label vocabulary, which is defined by the word-label mapping c. Here, we train the models with various label vocabularies and compare their smoothing performance.
The experiments are done on the IWSLT 2012 German→English shared translation task. To rapidly perform repetitive experiments, we train the translation models with the in-domain TED portion of the dataset (roughly 2.5M running words for each side). We run the monolingual word clustering algorithm of (Botros et al., 2015) on each side of the parallel training data to obtain class label vocabularies (Section 3).
We carry out comparative experiments regarding the three factors of the clustering algorithm: 1) Clustering iterations. It is shown that the number of iterations is the most influential factor in clustering quality (Och, 1995). We now verify its effect on translation quality when the clustering is used for phrase table smoothing.
As we run the clustering algorithm, we extract an intermediate class mapping for each iteration and train the smoothing models with it. The model weights are tuned for each iteration separately. The BLEU scores of the tuned systems are given in Figure 2. We use 100 classes on both source and target sides. The score does not consistently increase or decrease over the iterations; it is rather on a similar level (± 0.2% BLEU) for all settings with slight fluctuations. This is an important clue that the whole process of word clustering has no meaning in smoothing phrase translation models.
To see this more clearly, we keep the model weights fixed over different systems and run the same set of experiments. In this way, we focus only on the change of label vocabulary, removing the impact of nondeterministic model weight optimization. The results are given in Figure 3.
This time, the curves are even flatter, resulting in only ± 0.1% BLEU difference over the iterations. More surprisingly, the models trained with the initial clustering, i.e. when the clustering algorithm has not even started yet, are on a par with those trained with more optimized classes in terms of translation quality.  2) Initialization of the clustering. Since the clustering process has no significant impact on the translation quality, we hypothesize that the initialization may dominate the clustering. We compare five different initial class mappings: • random: randomly assign words to classes • top-frequent (default): top-frequent words have their own classes, while all other words are in the last class • same-countsum: each class has almost the same sum of word unigram counts • same-#words: each class has almost the same number of words • count-bins: each class represents a bin of the total count range  Table 1 shows the translation results with the map-each model trained with these initializations-without running the clustering algorithm. We use the same set of model weights used in Figure 3. We find that the initialization method also does not affect the translation performance. As an extreme case, random clustering is also a fine candidate for training the map-each model.

3) Number of classes.
This determines the vocabulary size of a label space, which eventually adjusts the smoothing degree. Table  2 shows the translation performance of the map-each model with a varying number of classes. Similarly as before, there is no serious performance gap among different word classes, and POS tags and lemmas also comform to this trend. However, we observe a slight but steady degradation of translation quality (≈ -0.2% BLEU) when the vocabulary size is larger than a few hundreds. We also lose statistical significance for BLEU in these cases. The reason could be: If the label space becomes larger, it gets closer to the original vocabulary and therefore the smoothing model provides less additional information to add to the standard phrase translation model.  The series of experiments show that the mapeach model performs very similar across vocabulary size and its structure. From our internal experiments, this argument also holds for the map-all model. The results do not change even when we use a different clustering algorithm, e.g. bilingual clustering (Och, 1999). For the translation performance, the more important factor is the log-linear model training to find an optimal set of weights for the smoothing models.

Comparison of Smoothing Models
Next, we compare the two smoothing models by their performance in four different trans-lation tasks: IWSLT 2012 German→English, WMT 2015 Finnish→English, WMT 2014 English→German, and WMT 2015 English→Czech. We train 100 classes on each side with 30 clustering iterations starting from the default (top-frequent) initialization. Table 3 provides the corpus statistics of all datasets used. Note that a morphologically rich language is on the source side for the first two tasks, and on the target side for the last two tasks. According to the results (Table 4), the mapeach model, which encourages backing off infrequent words, performs consistently better (maximum +0.5% BLEU, -0.6% TER) than the map-all model in all cases.

Comparison of Training Data Size
Lastly, we analyze the smoothing performance for different training data sizes (Figure 4). The improvement of BLEU score over the baseline decreases drastically when the training data get smaller. We argue that this is because the smoothing models are only the additional scores for the phrases seen in the training data. For smaller training data, we have more out-of-vocabulary (OOV) words in the test set, which cannot be handled by the presented models.

Analysis
In Section 5.2, we have shown experimentally that more optimized or more fine-grained classes do not guarantee better smoothing performance. We now verify by examining translation outputs that    Table 5: Comparison of translation outputs for the smoothing models with different vocabularies. "optimized" denotes 30 iterations of the clustering algorithm, whereas "non-optimized" means the initial (default) clustering.
the same level of performance is not by chance but due to similar hypothesis scoring across different systems. Given a test set, we compare its translations generated from different systems as follows. First, for each translated set, we sort the sentences by how much the sentence-level TER is improved over the baseline translation. Then, we select the top 200 sentences from this sorted list, which represent the main contribution to the decrease of TER. In Table 5, we compare the top 200 TERimproved translations of the map-each model setups with different vocabularies.
In the fourth column, we trace the input sentences that are translated by the top 200 lists, and count how many of those inputs are overlapped across given systems. Here, a large overlap indi-cates that two systems are particularly effective in a large common part of the test set, showing that they behaved analogously in the search process. The numbers in this column are computed against the map-each model setup trained with 100 optimized word classes (first row). For all map-each settings, the overlap is very large-around 90%.
To investigate further, we count how often the two translations of a single input are identical (the last column). This is normalized by the number of common input sentences in the top 200 lists between two systems. It is a straightforward measure to see if two systems discriminate translation hypotheses in a similar manner. Remarkably, all systems equipped with the map-each model produce exactly the same translations for the most part of the top 200 TER-improved sentences.
We can see from this analysis that, even though a smoothing model is trained with essentially different vocabularies, it helps the translation process in basically the same manner. For comparison, we also compute the measures for a map-all model, which are far behind the high similarity among the map-each models. Indeed, for smoothing phrase translation models, changing the model structure for vocabulary reduction exerts a strong influence in the hypothesis scoring, yet changing the vocabulary does not.

Conclusion
Reducing vocabulary using word-label mapping is a simple and effective way of smoothing phrase translation models. By mapping each word in a phrase at a time, the translation quality can be improved by up to +0.7% BLEU and -0.8% TER over a standard phrase-based SMT baseline, which is superior to Wuebker et al. (2013).
Our extensive comparison among various vocabularies shows that different word-label mappings are almost equally effective for smoothing phrase translation models. This allows us to use any type of word-level label, e.g. a randomized vocabulary, for the smoothing, which saves a considerable amount of effort in optimizing the structure and granularity of the label vocabulary. Our analysis on sentence-level TER demonstrates that the same level of performance stems from the analogous hypothesis scoring.
We claim that this result emphasizes the fundamental sparsity of the standard phrase translation model. Too many target phrase candidates are originally undervalued, so giving them any reasonable amount of extra probability mass, e.g. by smoothing with random classes, is enough to broaden the search space and improve translation quality. Even if we change a single parameter in estimating the label space, it does not have a significant effect on scoring hypotheses, where many other models than the smoothed translation model, e.g. language models, are involved with large weights. Nevertheless, an exact linguistic explanation is still to be discovered.
Our results on varying training data show that vocabulary reduction is more suitable for largescale translation setups. This implies that OOV handling is more crucial than smoothing phrase translation models for low-resource translation tasks.
For future work, we plan to perform a similar set of comparative experiments on neural machine translation systems.