Generating Diverse Translations with Sentence Codes

Users of machine translation systems may desire to obtain multiple candidates translated in different ways. In this work, we attempt to obtain diverse translations by using sentence codes to condition the sentence generation. We describe two methods to extract the codes, either with or without the help of syntax information. For diverse generation, we sample multiple candidates, each of which conditioned on a unique code. Experiments show that the sampled translations have much higher diversity scores when using reasonable sentence codes, where the translation quality is still on par with the baselines even under strong constraint imposed by the codes. In qualitative analysis, we show that our method is able to generate paraphrase translations with drastically different structures. The proposed approach can be easily adopted to existing translation systems as no modification to the model is required.


Introduction
When using machine translation systems, users may desire to see different candidate translations other than the best one. In this scenario, users usually expect the system to show candidates with different sentence structures.
To obtain diverse translations, conventional neural machine translation (NMT) models allow one to sample translations using the beam search algorithm, however, they usually share similar sentence structures. Recently, various methods (Li et al., 2016;Xu et al., 2018) are proposed for diverse generation. These methods encourage the model to use creative vocabulary to achieve high diversity. Although producing creative words benefits tasks in the dialog domain, when applied to machine translation, it can hurt the translation quality by changing the original meaning.
In this work, we are interested in generating multiple valid translations with high diversity. To achieve this, we propose to construct the codes based on semantics-level or syntax-level information of target-side sentences.
To generate diverse translations, we constrain the generation model by specifying a particular code as a semantic or syntactic assignment. More concretely, we prefix the target-side sentences with the codes. Then, an NMT model is trained with the original source sentences and the prefixed target sentences. As the model generates tokens in left-to-right order, the probability of emitting each word is predicted conditioned on the assigned code. As each assignment is supposed to correspond to a sentence structure, the candidate translations sampled with different assignments are expected to have high diversity.
We can think such model as a mixture-of-expert translation model where each expert is capable of producing translations with a certain style indicated by the code. In the inference time, code assignments are given to the model so that a selection of experts are picked to generate translations.
The key question is how to extract such sentence codes. Here, we explore two approaches. First, a simple unsupervised method is tested, which clusters the sentence embeddings and use the cluster ids as the code assignments. Next, to capture only the structural variation of sentences, we turn to syntax. We encode the structure of constituent parse trees into discrete codes with a tree autoencoder.
Experiments on two machine translation datasets show that a set of highly diverse translations can be obtained with reasonable mechanism for extracting the sentence codes, while the sampled candidates still have BLEU scores on par with the baselines.
2 Proposed Approach

Extracting Sentence Codes
Our approach produces diverse translations by conditioning sentence generation with the sentence codes. Ideally, we would like the codes to capture the information about the sentence structures rather than utterances. To extract such codes from target sentences, we explore two methods.
Semantic Coding Model The first method extracts sentence codes from unsupervisedly learned semantic information. We cluster the sentence embeddings produced by pre-trained models into a fixed number of clusters, then use the cluster ids as discrete priors to condition the sentence generation. In this work, we test two semantic coding models. The first model is based on BERT (Devlin et al., 2018), where the vectors corresponding to the "[CLS]" token are clustered.
The second model produces sentence embeddings by averaging FastText word embeddings (Bojanowski et al., 2017). Comparing to the hidden states of BERT, word embeddings are expected to contain less syntactic information as the word order is ignored during training.
Syntactic Coding Model To explicitly capture the syntactic diversity, we also consider to derive the sentence codes from the parse trees produced by a constituency parser. As the utterance-level information is not desired, the terminal nodes are removed from the parse trees.
To obtain the sentence codes, we use a TreeLSTM-based auto-encoder similar to Socher et al. (2011), which encodes the syntactic information into a single discrete code. As illustrated in Fig. 1 (a), a TreeLSTM cell (Tai et al., 2015) computes a recurrent state based on a given input vector and the states of N i child nodes: The tree auto-encoder model is shown in Fig. 1  (c), where the encoder computes a latent tree representation. As the decoder has to unroll the vector representation following a reversed tree structure to predict the non-terminal labels, the standard TreeLSTM equation cannot be directly applied. To compute along the reversed tree, we modify Eq. 1 for computing the hidden state of the j-th child node given the parent-node state h i : l l Z l Y I S / 7 B i w d F v P o / 3 v w b J 8 k e N L G g o a j q p r s r S g Q 3 1 v e / v c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q G p V q h g 2 m h N L t i B o U X G L D c i u w n W i k c S S w F Y 1 v Z 3 7 r C b X h S j 7 Y S Y J h T I e S D z i j 1 k n N U S / j w b R X r v h V f w 6 y S o K c V C B H v V f + 6 v Y V S 2 O U l g l q T C f w E x t m V F v O B E 5 L 3 d R g Q t m Y D r H j q K Q x m j C b X z s l Z 0 7 p k 4 H S r q Q l c / X 3 R E Z j Y y Z x 5 D p j a k d m 2 Z u J / 3 m d 1 A 6 u w 4 z L J L U o 2 W L R I B X E K j J 7 n f S 5 R m b F x B H K N H e 3 E j a i m j L r A i q 5 E I L l l 1 d J 8 6 I a + N X g / r J S u 8 n j K M I J n M I 5 B H A F N b i D O j S A w S M 8 w y u 8 e c p 7 8 d 6 9 j 0 V r w c t n j u E P v M 8 f f f m P D w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 V v y f + B f + w O q V 4 B O k H k d u Y g i i z 8 = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z A O S J c x O O s m Y 2 Z l l Z l Y I S / 7 B i w d F v P o / 3 v w b J 8 k e N L G g o a j q p r s r S g Q 3 1 v e / v c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q G p V q h g 2 m h N L t i B o U X G L D c i u w n W i k c S S w F Y 1 v Z 3 7 r C b X h S j 7 Y S Y J h T I e S D z i j 1 k n N U S / j w b R X r v h V f w 6 y S o K c V C B H v V f + 6 v Y V S 2 O U l g l q T C f w E x t m V F v O B E 5 L 3 d R g Q t m Y D r H j q K Q x m j C b X z s l Z 0 7 p k 4 H S r q Q l c / X 3 R E Z j Y y Z x 5 D p j a k d m 2 Z u J / 3 m d 1 A 6 u w 4 z L J L U o 2 W L R I B X E K j J 7 n f S 5 R m b F x B H K N H e 3 E j a i m j L r A i q 5 E I L l l 1 d J 8 6 I a + N X g / r J S u 8 n j K M I J n M I 5 B H A F N b i D O j S A w S M 8 w y u 8 e c p 7 8 d 6 9 j 0 V r w c t n j u E P v M 8 f f f m P D w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 5 V v y f + B f + w O q V 4 B O k H k d u Y g i i z 8 = " > A A A B 7 X i c b V D L S g N B E O y N r x h f U Y 9 e B o P g K e y K o M e g F 4 8 R z A O S J c x O O s m Y 2 Z l l Z l Y I S / 7 B i w d F v P o / 3 v w b J 8 k e N L G g o a j q p r s r S g Q 3 1 v e / v c L a + s b m V n G 7 t L O 7 t 3 9 Q P j x q G p V q h g 2 m h N L t i B o U X G L D c i u w n W i k c S S w F Y 1 v Z 3 7 r C b X h S j 7 Y S Y J h T I e S D z i j 1 k n N U S / j w b R X r v h V f w 6 y S o K c V C B H v V f + 6 v Y V S 2 O U l g l q T C f w E x t m V F v O B E 5 L 3 d R g Q t m Y D r H j q K Q x m j C b X z s l Z 0 7 p k 4 H S r q Q l c / X 3 R E Z j Y y Z x 5 D p j a k d m 2 Z u J / 3 m d 1 A 6 u w 4 z L J L U o 2 W L R I B X E K j J 7 n f S 5 R m b F x B H K N H e 3 E j a i m j L r A i q 5 E I L l l 1 d J 8 6 I a + N X g / r J S u 8 n j K M I J n M I 5 B H A F N b i D O j S A w S M 8 w y u 8 e c p 7 8 d 6 9 j 0 V r w c t n j u E P v M 8 f f f m P D w = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = "

E j a i m j L r A i q 6 E I L l l 1 d J s 1 o J / E p w f 1 m u 3 e R x F O A U z u A C A r i C G t x B H R r A 4 B G e 4 R X e P O W 9 e O / e x 6 J 1 z c t n T u A P v M 8 f f 3 6 P E A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " o U 4 N 2 o w U 9 M M P t x x 7 w P H 5 6 7 x U B E 4 = " > A A
B G e 4 R X e P O W 9 e O / e x 6 J 1 z c t n T u A P v M 8 f f 3 6 P E A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " o U 4 N 2 o w U 9 M M P t x x 7 w P H 5 6 7 B G e 4 R X e P O W 9 e O / e x 6 J 1 z c t n T u A P v M 8 f f 3 6 P E A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " o U 4 N 2 o w U 9 M M P t x x 7 w P H 5 6 7

c Q S y r S w t x I 2 o p o y t O l U b A j + 8 s u r p H X h + p 7 r 3 1 / W 6 j d F H G U 4 g V M 4 B x + u o A 5 3 0 I A m M B j C M 7 z C m y O d F + f d + V i 0 l p x i 5 h j + w P n 8 A U v k j S I = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " U l 1 z j H M k 9 x Z A 8 o y R X J L J O x n 4 T Q w = " > A A A B 6 n i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 h E 0 G P R i 8 e K 9 g P a U D b b T b t 0 s w m 7 E 6 G E / g Q v H h T x 6 i / y 5 r 9 x 2 + a g r Q 8 G H u / N M D M v T K U w 6 H n f T m l t f W N z q 7 x d 2 d n d 2 z + o H h 6 1 T J J p x p s s k Y n u h N R w K R R v o k D J O 6 n m N A 4 l b 4 f j 2 5 n f f u L a i E Q 9 4 i T l Q U y H S k S C U b T S g + u 6 / W r N c 7 0 5 y C r x C 1 K D A o 1 + 9 a s 3 S F g W c 4 V M U m O 6 v p d i k F O N g k k + r f Q y w 1 P K x n T I u 5 Y q G n M T 5 P N T p + T M K g M S J d q W Q j J X f 0 / k N D Z m E o e 2 M 6 Y 4 M s v e T P z P 6 2 Y Y X Q e 5 U G m G X L H F o i i T B B M y + 5 s M h O Y M 5 c Q S y r S w t x I 2 o p o y t O l U b A j + 8 s u r p H X h + p 7 r 3 1 / W 6 j d F H G U 4 g V M 4 B x + u o A 5 3 0 I A m M B j C M 7 z C m y O d F + f d + V i 0 l p x i 5 h j + w P n 8 A U v k j S I = < / l a t e x i t >
x i

< l a t e x i t s h a 1 _ b a s e 6 4 = " f h C c X p / P I F s L r r 0 8 I m x U P J e 8 y 0 A = " > A A A B 6 n i c d V D J S g N B E K 2 J W 4 x b 1 K O X x i B 4 G n q G R J N b 0 I v H i G a B Z A g 9 n Z 6 k S c 9 C d 4 8 Y h n y C F w + K e P W L v P k 3 d h Z B R R 8 U P N 6 r o q q e n w i u N M Y f V m 5 l d W 1 9 I 7 9 Z 2 N r e 2 d 0 r 7 h + 0 V J x K y p o 0 F r H s + E Q x w S P W 1 F w L 1 k k k I 6 E v W N s f X 8 7 8 9 h 2 T i s f R r Z 4 k z A v J M O I B p 0 Q b 6 e a + z / v F E r a x W 6 2 U X Y R t t 4 J r T s 2 Q C n Z q Z 2 X k 2 H i O E i z R 6 B f f e 4 O Y p i G L N B V E q a 6 D E + 1 l R G p O B Z s W e q l i C a F j M m R d Q y M S M u V l 8 1 O n 6 M Q o A x T E 0 l S k 0 V z 9 P p G R U K l J 6 J v O k O i R + u 3 N x L + 8 b q q D q p f x K E k 1 i + h i U Z A K p G M 0 + x s N u G R U i 4 k h h E p u b k V 0 R C S h 2 q R T M C F 8 f Y r + J y 3 X d r D t X J d L 9 Y t l H H k 4 g m M 4 B Q f O o Q 5 X 0 I A m U B j C A z z B s y W s R + v F e l 2 0 5 q z l z C H 8 g P X 2 C c h / j i A = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " f h C c X p / P I F s L r r 0 8 I m x U P J e 8 y 0 A = " > A A A B 6 n i c d V D J S g N B E K 2 J W 4 x b 1 K O X x i B 4 G n q G R J N b 0 I v H i G a B Z A g 9 n Z 6 k S c 9 C d 4 8 Y h n y C F w + K e P W L v P k 3 d h Z B R R 8 U P N 6 r o q q e n w i u N M Y f V m 5 l d W 1 9 I 7 9 Z 2 N r e 2 d 0 r 7 h + 0 V J x K y p o 0 F r H s + E Q x w S P W 1 F w L 1 k k k I 6 E v W N s f X 8 7 8 9 h 2 T i s f R r Z 4 k z A v J M O I B p 0 Q b 6 e a + z / v F E r a x W 6 2 U X Y R t t 4 J r T s 2 Q C n Z q Z 2 X k 2 H i O E i z R 6 B f f e 4 O Y p i G L N B V E q a 6 D E + 1 l R G p O B Z s W e q l i C a F j M m R d Q y M S M u V l 8 1 O n 6 M Q o A x T E 0 l S k 0 V z 9 P p G R U K l J 6 J v O k O i R + u 3 N x L + 8 b q q D q p f x K E k 1 i + h i U Z A K p G M 0 + x s N u G R U i 4 k h h E p u b k V 0 R C S h 2 q R T M C F 8 f Y r + J y 3 X d r D t X J d L 9 Y t l H H k 4 g m M 4 B Q f O o Q 5 X 0 I A m U B j C A z z B s y W s R + v F e l 2 0 5 q z l z C H 8 g P X 2 C c h / j i A = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " f h C c X p / P I F s L r r 0 8 I m x U P J e 8 y 0 A = " > A A A B 6 n i c d V D J S g N B E K 2 J W 4 x b 1 K O X x i B 4 G n q G R J N b 0 I v H i G a B Z
A g 9 n Z 6 k S c 9 C d 4 8 Y h n y C F w + K e P W L v P k 3 d h Z B R R 8 U P N 6 r o q q e n w i u N M Y f V m 5 l d W 1 9 I 7 9 Z 2 N r e 2 d 0 r 7 h + 0 V J x K y p o 0 F r H s + E Q x w S P W 1 F w L 1 k k k I 6 E v W N s f X 8 7

9 h 2 T i s f R r Z 4 k z A v J M O I B p 0 Q b 6 e a + z / v F E r a x W 6 2 U X Y R t t 4 J r T s 2 Q C n Z q Z 2 X k 2 H i O E i z R 6 B f f e 4 O Y p i G L N B V E q a 6 D E + 1 l R G p O B Z s W e q l i C a F j M m R d Q y M S M u V l 1 O n 6 M Q o A x T E 0 l S k 0 V z 9 P p G R U K l J 6 J v O k O i R + u 3 N x L + b q q D q p f x K E k 1 i + h i U Z A K p G M 0 + x s N u G R U i 4 k h h E p u b k V 0 R C S h 2 q R T M C F f Y r + J y 3 X d r D t X J d L 9 Y t l H H k 4 g m M 4 B Q f O o Q 5 X 0 I A m U B j C A z z B s y W s R + v F e l 2 0 5 q z l z C H g P X 2 C c h / j i A = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " f h C c X p / P I F s L r r 0 I m x U P J e y 0 A = " > A A A B 6 n i c d V D J S g N B E K 2 J W 4 x b 1 K O X x i B 4 G n q G R J N b 0 I v H i G a B Z
A g 9 n Z 6 k S c 9 C d 4 8 Y h n y C F w + K e P W L v P k 3 d h Z B R R 8 U P N 6 r o q q e n w i u N M Y f V m 5 l d W 1 9 I 7 9 Z 2 N r e 2 d 0 r 7 h + 0 V J x K y p o 0 F r H s + E Q x w S P W 1 F w L 1 k k k I 6 E v W N s f X 8 7 Figure 1: Architecture of the TreeLSTM-based autoencoder with a discretization bottleneck for learning the sentence codes.

9 h 2 T i s f R r Z 4 k z A v J M O I B p 0 Q b 6 e a + z / v F E r a x W 6 2 U X Y R t t 4 J r T s 2 Q C n Z q Z 2 X k 2 H i O E i z R 6 B f f e 4 O Y p i G L N B V E q a 6 D E + 1 l R G p O B Z s W e q l i C a F j M m R d Q y M S M u V l 1 O n 6 M Q o A x T E 0 l S k 0 V z 9 P p G R U K l J 6 J v O k O i R + u 3 N x L + b q q D q p f x K E k 1 i + h i U Z A K p G M 0 + x s N u G R U i 4 k h h E p u b k V 0 R C S h 2 q R T M C F f Y r + J y 3 X d r D t X J d L 9 Y t l H H k 4 g m M 4 B Q f O o Q 5 X 0 I A m U B j C A z z B s y W s R + v F e l 2 0 5 q z l z C H g P X 2 C c h / j i A = < / l a t e x i t >
where the internal implementation of the recurrent function is same as Eq. 1, however, each node has a different parameterization depending on its position among siblings. Note that in the decoder side, no input vectors are fed to the recurrent computation. Finally, the decoder states are used to predict target labels, whereas the model is optimized with cross-entropy loss.
As the source sentence already provides hints on the target-side sentence structure, we feed the source information to the tree auto-encoder to encourage the latent representation to capture the syntax that cannot be inferred from the source sentence. To obtain the sentence codes from the latent tree representation, we apply improved semantic hashing (Kaiser and Bengio, 2018) to the hidden state of the root node, which discretizes the vector into a 8-bit code (binary vector). When performing improved semantic hashing, the forward pass computes two operations: binarization and saturated sigmoid, resulting in two vectors. One of these two vectors are randomly selected for the next computation. However, in the backward pass, the gradient always flows through the vector produced by saturated sigmoid. As the model is trained together with the bottleneck, the codes are optimized directly to minimize the loss function.

Diverse Generation with Code Assignment
Once we obtain the sentence codes, we prefix the target-side sentences in the training data with the corresponding codes. The resultant target sentence has a form of " c12 eoc Here is a translation.". The " eoc " token separates the code and words. We train a regular NMT model with the modified training dataset. To generate diverse translations, we first obtain top-K codes from the probability distribution of code prediction. In detail, we select K sentence codes with the highest probabilities. Then, conditioning on each code, we let the beam search continue to generate the sentence, resulting in K translations conditioned on different codes.

Related Work
Existing works for diverse text generation can be categorized into two major categories. The approaches in the first categoriy sample diverse sequences by varying a hidden representation. Jain et al. (2017) generates diverse questions by injecting Gaussian noise to the latent in a VAE for encouraging the creativity of results. Xu et al. (2018) learns K shared decoders, conditioned on different pattern rewriting embeddings. The former method is evaluated by assessing the ability of generating unique and unseen results, whereas the latter is evaluated with the number of unique uni/bi-grams and the divergence of word distributions produced by different decoders. Independent to this work, Shen et al. (2019) also explores mixture-of-expert models with an ensemble of learners. The paper discusses multiple training strategies and found the multiple choice learning works best.
The second category of approaches attempts to improve the diversity by improving decoding algorithms. Li et al. (2016) modifies the scoring function in beam search to encourage the algorithm to promote hypotheses containing words from different ancestral hypotheses, which is also evaluated with the number of unique uni/bi-grams. Kulikov et al. (2018) uses an iterative beam search approach to generate diverse dialogs.
Comparing to these works, we focus on generating translations with different sentence structures. We still use beam search to search for best words in every decoding steps under the constraint of code assignment. Our approach also comes with the advantage that no modification to the NMT model architecture is required.

Experimental Settings
We evaluate our models on two machine translation datasets: ASPEC Japanese-to-English dataset (Nakazawa et al., 2016) and WMT14 Germanto-English dataset. The datasets contain 3M and 4.5M bilingual pairs respectively. For the ASPEC Ja-En dataset, we use the Moses toolkit (Koehn et al., 2007) to tokenize the English side and Kytea (Neubig et al., 2011) to tokenize the Japanese side. After tokenization, we apply byte-pair encoding (Sennrich et al., 2016) to segment the texts into subwords, forcing the vocabulary size of each language to be 40k. For WMT14 De-En dataset, we use sentencepiece (Kudo and Richardson, 2018) to segment the words to ensure a vocabulary size of 32k.
In evaluation, we report tokenized BLEU for ASPEC Ja-En dataset.
For WMT14 De-En dataset, BLEU scores are generated using Sacre-Bleu toolkit (Post, 2018). For models that produce sentence codes during decoding, the codes are removed from translation results before evaluating BLEU scores.

Obtaining Sentence Codes
For the semantic coding model based on BERT, we cluster the hidden state of "[CLS]" token into 256 clusters with k-means algorithm. The cluster ids are then used as sentence codes. For models using FastText Embeddings, pre-trained vectors (Common Crawl, 2M words) are used. Please note that the number of clusters is a hyperparameter, here we choose the number of clusters to match the number of unique codes in the syntax-based model.
To train the syntax coding model, we parse target-side sentences with Stanford CFG parser (Klein and Manning, 2003). The TreeLSTMbased auto-encoder is implemented with DGL, 1 which is trained using AdaGrad optimizer for faster convergence. We found it helpful to pretrain the model without the discretization bottleneck for achieving higher label accuracy.

Quantitive Evaluation of Diversity
As we are interested in the diversity among sampled candidates, the diversity metric based on the divergence between word distributions (Xu et al., 2018) can not be applied in this case. In order to qualitatively evaluate the diversity of generated translations, we propose to use a BLEU-based discrepancy metric. Suppose Y is a list of candidate translations, we compute the diversity score with DP(Y ) = 1 |Y |(|Y | − 1) y∈Y y ∈Y,y =y 1 − ∆(y, y ), (3) where ∆(y, y ) returns the BLEU score of two candidates. The equation gives a higher diversity score when each candidate contains more unique n-grams.

Experiment Results
We use Base Transformer architecture (Vaswani et al., 2017) for all models. The results are summarized in Table 1. We sample three candidates with different models, and report the averaged diversity score. The BLEU(%) is reported for the candidate with highest confidence (log-probability). A detailed table with BLEU scores of all three candidates can be found in supplementary material,  where the BLEU scores of the second and third candidates are on par with the baseline.
We compare the proposed approach to three baselines. The first baseline samples three candidates using standard beam search. We also tested the diverse decoding approach (Li et al., 2016). The coefficient γ is chosen to maximize the diversity with no more than 0.5 BLEU degradation. The third baseline uses random codes for conditioning.
As shown in the table, the model based on BERT sentence embeddings achieves higher diversity in ASPEC dataset, which contains only formal texts. However, it fails to deliver similar results in WMT14 dataset, which is more informal. This may be due to the difficulty in clustering BERT vectors which were never trained to work with clustering. The model using FastText embeddings is shown to be more robust across the datasets, although it also fails to outperform the diverse decoding baseline in WMT14 dataset.
In contrast, syntax-based models achieve much higher diversity in both datasets. We found the results generated by this model has more diverse structures rather than word choices. By comparing the BLEU scores, no significant degradation is observed in translation quality. As a control experiment, using random codes does not contributes to the diversity. As a confirmation that the sentence codes have strong impact on sentence generation, the models using codes derived from references (oracle codes) achieve much higher BLEU scores. Table 2 gives samples of the candidate translations produced by the models conditioning on different discrete codes, compared to the candidates produced by beam search. We can see that the candidate translations produced by beam search has only minor grammatical differences. In contrast, the translation results sampled with the syntactic coding model have drastically different grammars. By examining the results, we found the syntaxbased model tends to produce one translation in active voice and another in passive voice.
To summarize, we show a diverse set of translations can be obtained with sentence codes when a reasonable external mechanism is used to produce the codes. When a good syntax parser exists, the syntax-based approach works better in terms of diversity. The source code for extracting discrete codes from parse trees will be publicly available.