Learning Semantic Representations for Nonterminals in Hierarchical Phrase-Based Translation

In hierarchical phrase-based translation, coarse-grained nonterminal X s may generate inappropriate translations due to the lack of sufﬁcient information for phrasal substitution. In this paper we propose a framework to reﬁne nonterminals in hierarchical translation rules with real-valued semantic representations. The semantic representations are learned via a weighted mean value and a minimum distance method using phrase vector representations obtained from large scale monolingual corpus. Based on the learned semantic vectors, we build a semantic non-terminal reﬁnement model to measure semantic similarities between phrasal substitutions and nonterminal X s in translation rules. Experiment results on Chinese-English translation show that the proposed model signiﬁcantly improves translation quality on NIST test sets.


Introduction
Hierarchical phrase-based translation (Chiang, 2007) explores formal synchronous context free grammar (SCFG) rules for translation. Two types of nonterminal symbols are used in translation rules: nonterminal X in ordinary SCFG rules and nonterminal S in glue rules that are specially introduced to concatenate nonterminal Xs in a monotonic manner. The same generic symbol X for all ordinary nonterminals makes it difficult to distinguish and select proper translation rules.
In order to address this issue, researchers either use syntactic labels to annotate nonterminal Xs (Zollmann and Venugopal, 2006;Zollmann and Vogel, 2011;Li et al., 2012;Hanneman and Lavie, 2013), or employ syntactic information * Corresponding author from parse trees to refine nonterminals with realvalued vectors (Venugopal et al., 2009;Huang et al., 2013). In addition to syntactic knowledge, semantic structures are also leveraged to refine nonterminals (Gao and Vogel, 2011). All these efforts focus on incorporating linguistic knowledge into hierarchical translation rules.
Unfortunately, syntactic or semantic parsers for many languages are not accessible due to the lack of labeled training data. In contrast, a large amount of unlabeled data are easily available. Therefore, can we mine syntactic or semantic properties for nonterminals from unlabeled data? Or can we exploit these data to refine nonterminals for SMT?
Learning semantic representations for terminals (words, multi-word phrases or sentences) from unlabeled data has achieved substantial progress in recent years (Mitchell and Lapata, 2008;Turian et al., 2010;Socher et al., 2010;Mikolov et al., 2013c;Blunsom et al., 2014). These representations have been used successfully in various NLP tasks. However, there is no attempt to learn semantic representations for nonterminals from unlabeled data. In this paper we propose a framework to learn semantic representations for nonterminal Xs in translation rules. Our framework is established on the basis of realvalued vector representations learned for multiword phrases, which are substituted with nonterminal Xs during hierarchical rule extraction. We propose a weighted mean value and a minimum distance method to obtain nonterminal representations from representations of their phrasal substitutions. We further build a semantic nonterminal refinement model with semantic representations of nonterminals to compute similarities between phrasal substitutions and nonterminals. In doing so, we want to enhance phrasal substitution and translation rule selection during decoding.
The big challenge here is that thousands of tar-get phrasal substitutions will be generated for one single nonterminal during decoding. Computing vector representations for all these phrases will be very time-consuming. We therefore introduce two different methods to handle it. In the first method, we project representations of source phrases onto their target counterparts linearly/nonlinearly via a neural network. These projected vectors are used as approximations to real target representations to compute semantic similarities. In the second method, we decode sentences in two passes. The first pass collects target phrase candidates from n-best translations of sentences generated by the baseline. The second pass calculates vector representations of these collected target phrases and then computes similarities between them and target-side nonterminals.
Our contributions are two-fold. First, we learn semantic representations for nonterminals from their phrasal substitutions with two different methods. This is the first time, to the best of our knowledge, to induce semantic representations for nonterminals from unlabeled data in the context of SMT. Second, we successfully address the issue of time-consuming target-side phrase-nonterminal similarity computation mentioned above. We incorporate both source-/target-side semantic nonterminal refinement model and their combination based on learned nonterminal representations into translation system. Experiment results show that our method can achieve an improvement of 1.16 BLEU points over the baseline system on NIST MT evaluation test sets.
The rest of this paper is organized as follows. Section 2 briefly reviews related work. Section 3 presents our approach of learning semantic vectors for nonterminals, followed by Section 4 describing the details of our semantic nonterminal refinement model. Section 5 introduces the integration of the proposed model into SMT. Experiment results are reported in Section 6. Finally, we conclude our work in Section 7.

Related Work
A variety of approaches have been explored for nonterminal refinement in hierarchical phrasebased translation. These approaches can be categorized into two groups: 1) augmenting the nonterminal symbol X with informative labels, and 2) attaching distributional linguistic knowledge to each nonterminal in hierarchical rules. The former only allows substitution operations with matched labels. The latter normally builds an additional model as a new feature of the log-linear model to incorporate attached knowledge.
Among approaches which directly refine the single label to more fine-grained labels, syntactic and semantic knowledge are explored in various ways. The syntactically augmented translation model (SAMT) proposed by Zollmann and Venugopal (2006) uses syntactic categories extracted from target-side parse trees to augment nonterminals in hierarchical rules. Unfortunately, there is a data sparseness problem in this model due to thousands of extracted syntactic categories. One solution to address this issue is to reduce the number of syntactic categories. Zollmann and Vogel (2011) use word tags, generated by either POS tagger or unsupervised word class induction, instead of syntactic categories. Hanneman and Lavie (2013) coarsen the label set by introducing a label collapsing algorithm to SAMT grammars (Zollmann and Venugopal, 2006). Yet another solution is easing restrictions on label matching. Shen et al. (2009) penalize substitution with unmatched labels while Chiang (2010) uses soft match features to model substitutions with various labels. Similar to Zollmann and Venugopal (2006), Hoang and Koehn (2010) decorate some hierarchical rules with source-side syntax information and use undecorated, decorated, and partially decorated rules in their translation model. Mylonakis and Sima'an (2011) employ source-side syntax-based labels to define a joint probability synchronous grammar. Combinatory Categorial Grammar (CCG) labels or CCG contextual labels are also used to enrich nonterminals (Almaghout et al., 2011;Weese et al., 2012). Li et al. (2012) incorporate head information extracted from source-side dependency structures into translation rules. Besides, semantic knowledge is also used to refine nonterminals. Gao and Vogel (2011) utilize target-side semantic roles to form SRL-aware SCFG rules. Most of approaches introduced here explicitly require syntactic or semantic parsers trained on manually labeled data.
On the other hand, efforts have also been directed towards attaching distributional linguistic knowledge to nonterminals. Venugopal et al. (2009) propose a preference grammar to annotate nonterminals based on preference distributions of syntactic categories. Huang et al. (2010) learn la-tent syntactic distributions for each nonterminal. They use these distributions to decorate nonterminal Xs in SCFG rules with a real-valued feature vectors and utilize these vectors to measure the similarities between source phrases and applied rules. Similar to this work, Huang et al. (2013) utilize treebank tags based on dependency parsing to learn latent distributions. Cao et al. (2014) attach translation rules with dependency knowledge, which contains both dependency relations inside rules and dependency relations between rules and their contexts.
The difference of our work from these studies is that our semantic representations are learned from unlabeled bilingual (or monolingual) data and do not depend on any linguistic resources, e.g., parsers. We also believe that our model is able to exploit both syntactic and semantic information for nonterminals since vector representations learned in our way are able to capture both syntactic and semantic properties (Turian et al., 2010;Socher et al., 2010).

Learning Semantic Representations for Nonterminals
In our framework, semantic representations for nonterminal Xs are automatically induced from word-aligned parallel corpus. In this section, we detail the essential component of our approach, i.e., how to learn semantic vectors for nonterminals and how to project source semantic vectors onto target language semantic space. Before discussing nonterminal representations, we briefly introduce vector representations for words and phrases.

Prerequisite: Learning Words and Phrases Representations
We employ a neural method, specifically the continuous bag-of-words model (Mikolov et al., 2013a) to learn high-quality vector representations for words. Once we complete the training of the continuous bag-of-words model, word embeddings form an embedding matrix M ∈ R d×|V | , where d is a pre-determined embedding dimensionality and each word w in the vocabulary V corresponds to a vector v ∈ R d . Given the embedding matrix M , mapping words to vectors can be done by simply looking up their respective columns in M . We further feed these learned word embeddings to recursive autoencoders (RAE) (Socher et al., 2011) for learning phrase representations. In traditional RAE (shown in Figure 1), given two input children representation vectors c 1 ∈ R d and c 2 ∈ R d , their parent representation p can be calculated as follows: is an element-wise activation function such as tanh.
The above output representation p can be used as a child vector to construct the representation for a larger subphrase. This process is repeated until a binary tree covering the whole input phrase is generated.
In order to evaluate how well the parent vector represents its children, we can reconstruct the children in a reconstruction layer: where c 1 and c 2 are the reconstructed children, is a bias term for reconstruction, and f (2) is an element-wise activation function. For each node in the generated binary tree, we compute Euclidean distance between the original input vectors and the reconstructed vectors to measure the reconstruction error: By minimizing the total reconstruction error over all nonterminal nodes, we can learn parameters of RAE. Socher et al. (2011) propose a greedy unsupervised RAE as an extension to the above traditional RAE. The main difference is that in the unsupervised RAE there is no tree structure which is given for traditional RAE. It can learn both representations and tree structures of phrases or sentences. In this work, we adopt the unsupervised RAE to learn vector representations for phrases.

Inducing Nonterminal Representations from Phrase Representations
As we extract hierarchical rules from phrases by replacing subphrases with nonterminal symbols, a nonterminal X is generalized from a number of Figure 1: The architecture of a recursive autoencoder, adapted from (Socher et al., 2011). Blue nodes are original vectors and yellow nodes are reconstructed vectors which are used to compute reconstruction errors.
subphrases. We believe that these subphrases determine syntactic and semantic properties of the nonterminal X. We therefore enrich each nonterminal X with a semantic vector induced from vector representations of phrases that are replaced by the nonterminal during rule extraction.
For an SCFG rule, we can learn semantic vectors for nonterminals on both the source and target side. Due to the space limitation, we introduce the procedure of learning nonterminal vectors on the source side. Semantic vectors on the target side can be learned analogically.
For each source-side nonterminal X of a hierarchical rule, we collect all source subphrases replaced by X in a source subphrase set P = {p 1 , p 2 , · · · , p m }. We also count the number of times of these phrases being replaced by nonterminal X on training data during rule extraction. We collect these numbers in a count set C = {c 1 , c 2 , · · · , c m }. Based on the phrase set P , count set C and learned phrase vector representations in P , we can compute a semantic vector v x for nonterminal X in each SCFG rule.
We propose two general approaches to obtain semantic vectors for nonterminals: a weighted mean value method and a minimum distance method. Given phrase vector representations P r = { p 1 , p 2 , . . . , p m } , we calculate the semantic vector for a nonterminal generalized from these phrases as follows.
Weighted mean value method (MV) computes semantic vector v x as: Minimum distance method (MD) finds a point in semantic space to minimize the sum of Eu-clidean distances of vectors in P r to this point.
We use the stochastic gradient descent algorithm to find the minimal distance and the point v Similar to the center of gravity, the semantic vector v x learned by this method acts as a semantic centroid for all vectors of phrases that are substituted by X. Nonterminals in different hierarchical translation rules will have different semantic centroids. These centroids will help translation model capture semantic diversity to a certain degree.

Mapping Source-Side Representations onto Target-Side Semantic Space
As we discussed in Section 1, directly learning vector representations for target phrases is very costly in practice. Inspired by Mikolov et al. (2013b), we adopt vector projection to alleviate this problem. Different from mapping representations from the source side to the target side by learning a linear matrix on word alignments (Mikolov et al., 2013b), we project source multiword phrase representations onto the target semantic space in a nonlinear manner as we believe that nonlinear relations between languages are more reasonable. Specifically, we use a neural network to achieve this goal. Our neural network is a multilayer feed-forward neural network with one hidden layer. The functional form can be written in the following equation: where src is the input vector which is learned in the source semantic space, W (3) denotes the weight matrix for connections between input and hidden neurons and W (4) denotes the weight matrix for links between hidden neurons and output, b (3) and b (4) are bias terms. To train the neural network, we optimize the following objective: where N is the number of training examples, trg i is the target vector representation for the ith example learned by RAE and p i is the output of the neural network for the source vector representation src i of ith example. R(θ) is the regularizer on parameters: where W denotes parameters for parameter matrices W (3) , W (4) and bias terms b (3) , b (4) .

Semantic Nonterminal Refinement Model
In this section, we describe our semantic nonterminal refinement model on the basis of induced real-valued semantic vectors for nonterminals.

Nonterminal Representations in Hierarchical Rules
We incorporate learned semantic representations of nonterminals into hierarchical rules. In particular, ordinary hierarchical rules take the following form: where a/b, c/d are strings of terminals on the source and target side, s and t are placeholders denoting the nonterminal X on the source or target side, X s and X t are aligned to each other. Representations for nonterminals can be on either the source or target side. They are attached to hierarchical rules as follows: where v x. is the source-or target-side semantic representation for nonterminal. In this way, we keep original translation rules intact and decorate nonterminals with their semantic representations.

The Model
The proposed semantic nonterminal refinement model estimates the semantic similarity between a phrase p and nonterminal X. The phrase p and nonterminal X will have a high similarity score in the representation space if they are semantically similar. The higher semantic similarity scores are, the more compatible nonterminals are with corresponding phrases.
There is another nonterminal S in glue rules, which are formalized as follows: This nonterminal S is different from X. We therefore treat it as a special case in the computation of semantic similarity.
In this work, we explore two approaches to compute similarity: one based on cosine similarity and the other based on Euclidean distance.
Given a phrase vector representation p and nonterminal X semantic vector v x , Cosine Similarity (CS) is computed as: We set α for the Cosine Similarity between the glue rule and its corresponding phrase as follows: SeSim = cos( p, vx) hierarchical rules α glue rules As for Euclidean Distance (ED), it is computed according to the following formula: and similarly we set β for glue rules:

Decoding
We incorporate the proposed model as a new feature into the hierarchical phrase-based translation system. Specifically, two features are added into the baseline system: 1. Source-side semantic similarity between source phrases and nonterminals 2. Target-side semantic similarity between target phrases and nonterminals We compute source-and target-side similarities based on representations of nonterminals and phrasal substitutions for each applied rule, and sum up these similarities to calculate the total score of a derivation on the two features. The integration of the source-side semantic nonterminal refinement model into the decoder is trivial. For the target-side model, however, we have to consider the efficiency issue as we mentioned in Section 1. We introduce two different methods to integrate the target-side model into the decoder: 1) projection and 2) two-pass decoding. In the first integration method, a mapping neural network is trained to map source phrase representations onto the target semantic space as described in Section 3.3. The projection can be linear if we remove the hidden layer in the projection neural network. This is similar to the mapping matrix learned by Mikolov et al. (2013b). We calculate semantic similarities between projected representations of phrases and those of nonterminals. In the two-pass decoding, we collect target phrase candidates from 100-best translations for each source sentence generated by the baseline in the first pass and learn vector representations for these target phrase candidates. Then in the second pass, we decode source sentence with our target semantic nonterminal refinement model using learned target phrase vector representations. If a target phrase appears in the collected set, the target-side semantic nonterminal refinement model will calculate the semantic similarity between the target phrase and the corresponding nonterminal on the target semantic space; otherwise the model will give a penalty. This is because this phrase is not a desirable phrase as it is not used in 100-best translations.
The weights of these two features are tuned by the Minimum Error Rate Training (MERT) (Och, 2003), together with weights of other sub-models on a development set. Figure 2 shows the architecture of SMT system with the proposed semantic nonterminal refinement model.

Experiment
In this section, we conducted a series of experiments on Chinese-to-English translation using large-scale bilingual training data, aiming at the following questions: 1. Which approach is better for learning nonterminal representations, weighted mean value or minimum distance?
2. Can the target-side semantic nonterminal refinement model improve translation quality? And which method is better for integrating the target-side semantic model into translation, projection or two-pass decoding?
3. Does the combination of source and target semantic nonterminal refinement models provide further improvement?

Setup
Our training corpus contains 2.9M sentence pairs with 80.9M Chinese words and 86.4M English words from LDC data 1 . We used NIST MT03 as our development set, NIST MT06 as our development test set and MT08 as our final test set. We ran Giza++ on the training corpus in both Chinese-to-English and English-to-Chinese directions and applied the "grow-diag-final" refinement rule (Koehn et al., 2003) to obtain word alignments. We used the SRI Language Modeling Toolkit 2 (Stolcke and others, 2002) to train our language models. MERT (Och, 2003) was adopted to tune feature weights of the decoder. We used the case-insensitive BLEU 3 as our evaluation metric. In order to alleviate the instability of MERT , we followed Clark et al. (2011) to perform three runs of MERT and reported average BLEU scores over the three runs for all our experiments.
We used word2vec toolkit 4 to train our word embeddings and set the vector dimension d to 30. In our training experiment, we used the continuous bag-of-words model with a context window of size 5. The monolingual corpus, which was used to pre-train word embeddings, is extracted from the above parallel corpus in SMT. To train vector representations for multi-word phrases, we randomly selected 1M bilingual sentences 5 as training set and used the unsupervised greedy RAE following (Socher et al., 2011). We used a learning rate of 10 −3 for our minimum distance method that learned the centroid of phrase representations as the vector representation of the corresponding nonterminal.
For projection neural network in Section 3.3, we set 300 units for the hidden layer and dimensionality of 30 for both input and output vectors. Learning rate was set to 10 −3 and the regularization coefficient λ L was set to 10 −3 . To construct the training set for the projection neural network, we selected phrase pairs from our rule table and used their representations on the source and target side as training examples. We randomly selected 5M examples as training set, 10k examples as development set and 10k examples as test set. The multi-layer projection neural network was trained with the back-propagation and stochastic gradient descent algorithm with a mini-batch size of 5k.
Our baseline system is an in-house hierarchical phrase-based system (Chiang, 2007). The features used in the baseline system includes a 4-gram language model trained on the Xinhua section of the English Gigaword corpus, a 3-gram language model trained on the target part of the bilingual training data, bidirectional translation probabilities, bidirectional lexical weights, a word count, a phrase count and a glue rule count.
In order to compare our proposed models with previous methods on nonterminal refinement, we re-implemented a syntax mismatch model (Syn-Mis) which was used by Huang et al. (2013) and integrated it into hierarchical phrase-based system. Syn-Mis model decorates each nonterminal with a distribution of head POS tags and uses this distribution to measure the degree of syntactic compatibility of translation rules with corresponding source spans. In order to obtain head POS tags for Syn-Mis model, we used the Stanford dependency parser 6 (Chang et al., 2009) Table 1: BLEU scores of our models against the baseline and Syn-Mis model. /*" and /+" : significantly better than Baseline at significance level p < 0.01 and p < 0.05 respectively.

Different Approaches to Learn Vector Representations for Nonterminals
Our first group of experiments were carried out to investigate which approach is more appropriate to learn semantic vectors for nonterminals. We only used the source-side semantic nonterminal refinement model in these experiments. In order to validate the effectiveness of the proposed approaches for learning nonterminal semantic vectors, we combined the minimum distance method (MD) with the Euclidean Distance (ED) because both of them are distance-based, and combined the weighted mean value method (MV) with the Cosine Similarity model (CS) as they belong to vector-based approaches. We chose α = 1.0, 0, -1.0 and β = 0, 0.5, 1.0 for glue rules to study the impact of these parameters. We compared our model with the baseline and Syn-Mis model.
Results are shown in Table 1. From Table 1, we observe that the proposed two approaches are able to achieve significant improvements over the baseline. (MV + CS) and (MD + ED) achieve up to an absolute improvement of 1.09 and 0.81 (when α = 0 and β = 0.5) BLEU points respectively over the baseline on the development test set MT06. And the approach (MV + CS) with α = 0 outperforms Syn-Mis by 0.4 BLEU points on MT06 without using any syntactic information. The approach (MV + CS) achieves better performance and it is more efficient than (MD + ED) where the computation of semantic centroids is time-consuming. Therefore, we adopt the approach (MV + CS) with α = 0 to learn semantic vectors for nonterminals and compute semantic similarities in the following experiments.  Table 2: Comparison of two-pass decoding, linear and nonlinear projection methods for integrating the target-side semantic nonterminal refinement model in terms of BLEU scores. /*" and /+" : significantly better than Baseline at significance level p < 0.01 and p < 0.05 respectively.

Effect of the Target Semantic Nonterminal Refinement Models
In the second set of experiments, we further validate the effectiveness of semantic nonterminal vectors learned on the target side. In these experiments, learning vector representations and computing semantic similarities were performed on the target language semantic space. We also compared the two integration methods discussed in Section 5 for the target-side model. With regard to the projection method, we further compared the linear projection (the projection neural network without hidden layer) with the nonlinear projection (with hidden layer). Experiment results are shown in Table 2.
From Table 2, we can see that • Two-pass decoding achieves the highest BLEU scores, which are higher than those of the baseline by 0.75 and 0.66 BLEU points on MT06 and MT08 respectively. The reason may be that noisy translation candidates are filtered out in the first pass. This finding is consistent with many other multiple-pass systems in natural language processing, e.g., two-pass parsing (Zettlemoyer and Collins, 2007).
• Nonlinear projection achieves an improvement of 0.62 BLEU points over the baseline on MT06. It outperforms linear projection method on both sets. These empirical results support our assumption that nonlinear relations between languages are more reasonable than linear relations.
• The results prove that the target-side semantic nonterminal refinement model is also able  Table 3: BLEU scores of the combination of the source-and target-side semantic nonterminal refine model. /*" and /+" : significantly better than Baseline at significance level p < 0.01 and p < 0.05 respectively.
to improve the baseline system, although the gain is less than that of the source-side counterpart.

Combination of the Source and Target Models
Finally, we integrated both the source-and targetside semantic nonterminal refinement models into the baseline system. In this experiment, we adopted nonlinear projection to obtain target semantic vector representations for target phrases. These two models collectively achieve a gain of up to 1.16 BLEU points over the baseline and 0.41 BLEU points over Syn-Mis model on average, which is shown in Table 3.

Conclusion
We have presented a framework to refine nonterminal X in hierarchical translation rules with semantic representations. The semantic vectors are derived from vector representations of phrasal substitutions, which are automatically learned using an unsupervised RAE. As the semantic nonterminal refinement model is capable of selecting more semantically similar translation rules, it achieves statistically significant improvements over the baseline on Chinese-to-English translation. Experiment results have shown that • Using (MV + CS) approach to learn semantic representations for nonterminals can achieve better performance than (MD + ED) in terms of BLEU scores.
• Target-side semantic nonterminal refinement model is able to substantially improve translation quality over the baseline. Two-pass de-coding method is superior to the projection method.
• The simultaneous incorporation of the source-and target-side models can achieve further improvements over a single-side model.
For the future work, we are interested in learning bilingual representations (Lauly et al., 2014;Gouws et al., 2014) for nonterminals. We also would like to extend our work by using more contextual lexical information to derive semantic vectors for nonterminals.