Mixture Content Selection for Diverse Sequence Generation

Generating diverse sequences is important in many NLP applications such as question generation or summarization that exhibit semantically one-to-many relationships between source and the target sequences. We present a method to explicitly separate diversification from generation using a general plug-and-play module (called SELECTOR) that wraps around and guides an existing encoder-decoder model. The diversification stage uses a mixture of experts to sample different binary masks on the source sequence for diverse content selection. The generation stage uses a standard encoder-decoder model given each selected content from the source sequence. Due to the non-differentiable nature of discrete sampling and the lack of ground truth labels for binary mask, we leverage a proxy for ground truth mask and adopt stochastic hard-EM for training. In question generation (SQuAD) and abstractive summarization (CNN-DM), our method demonstrates significant improvements in accuracy, diversity and training efficiency, including state-of-the-art top-1 accuracy in both datasets, 6% gain in top-5 accuracy, and 3.7 times faster training over a state-of-the-art model. Our code is publicly available at https://github.com/clovaai/FocusSeq2Seq.


Introduction
Generating target sequences given a source sequence has applications in a wide range of problems in NLP with different types of relationships between the source and target sequences. For instance, paraphrasing or machine translation exhibit a one-to-one relationship because the source and the target should carry the same meaning. On the other hand, summarization or question generation exhibit one-to-many relationships because * Most work done during internship at Clova AI. Focus 3: in december 1878 , tesla left graz and severed all relations with his family to hide the fact that he dropped out of school . (Ours) ⇒ what did tesla do to hide he dropped out of school? Figure 1: Sample questions produced by our method from given passage-answer pair (answer is underlined). Our method generates diverse questions, by selecting different tokens to focus (colored) in contrast to 3mixture decoder (Shen et al., 2019) that generates 3 identical questions: "what did tesla do to hide the fact that he dropped out of school?". a single source often results in diverse target sequences with different semantics. Fig. 1 shows different questions that can be generated from a given passage.
Encoder-decoder models (Cho et al., 2014) are widely used for sequence generation, most notably in machine translation where neural models are now often almost as good as human translators in some language pairs. However, a standard encoder-decoder often shows a poor performance when it attempts to produce multiple, diverse outputs. Most recent methods for diverse sequence generation leverage diversifying decoding steps through alternative search algorithms (Fan et al., 2018;Vijayakumar et al., 2018) or mixture of decoders Shen et al., 2019). These methods promote diversity at the decoding step, while a more focused selection of the source sequence can lead to diversifying the semantics of the generated target sequences.
In this paper, we present a method for diverse generation that separates diversification and generation stages. The diversification stage leverages content selection to map the source to multiple sequences, where each mapping is modeled by focusing on different tokens in the source (oneto-many mapping). The generation stage uses a standard encoder-decoder model to generate a target sequence given each selected content from the source (one-to-one mapping). We present a generic module called SELECTOR that is specialized for diversification. This module can be used as a plug-and-play to an arbitrary encoder-decoder model for generation without architecture change.
The SELECTOR module leverages a mixture of experts (Jacobs et al., 1991;Eigen et al., 2014) to identify diverse key contents to focus on during generation. Each mixture samples a sequential latent variable modeled as a binary mask on every source sequence token. Then an encoder-decoder model generates multiple target sequences given these binary masks along with the original source tokens. Due to the non-differentiable nature of discrete sampling, we adopt stochastic hard-EM for training SELECTOR. To mitigate the lack of ground truth annotation for the mask (content selection), we use the overlap between the source and target sequences as a simple proxy for the ground-truth mask.
We experiment on question generation and abstractive summarization tasks and show that our method achieves the best trade-off between accuracy and diversity over previous models on SQuAD (Rajpurkar et al., 2016) and CNN-DM (Hermann et al., 2015;See et al., 2017) datasets. In particular, compared to the recently-introduced mixture decoder (Shen et al., 2019) that also aims to diversify outputs by creating multiple decoders, our modular method not only demonstrates better accuracy and diversity, but also trains 3.7 times faster.

Related Work
Diverse Search Algorithms Beam search, the most commonly used search algorithm for decoding, is known to produce samples that are short, contain repetitive phrases, and share majority of their tokens. Hence several methods are intro-duced to diversify search algorithms for decoding. Graves (2013); Chorowski and Jaitly (2017) tune temperature hyperparameter in softmax function. Vijayakumar et al. (2018); Li et al. (2016b) penalize similar samples during beam search in order to obtain diverse set of samples. Cho (2016) adds random noise to RNN decoder hidden states. Fan et al. (2018) sample tokens from top-k tokens at each decoding step. Our method is orthogonal to these search-based strategies, in that they diversify decoding while our method diversifies which content to be focused during encoding. Moreover, our empirical results show diversification with stochastic sampling hurts accuracy significantly.
Deep Mixture of Experts Several methods adopt a deep mixture of experts (MoE) (Jacobs et al., 1991;Eigen et al., 2014) to diversify decoding steps. Yang et al. (2018) introduce soft mixture of softmax on top of the output layer of RNN language model. ; Shen et al. (2019) introduce mixture of decoders with uniform mixing coefficient to improve diversity in machine translation. Among these, the closest to ours is the mixture decoder (Shen et al., 2019) that also adopts hard-EM for training, where a minimum-loss predictor is assigned to each data point, which is also known as multiple choice learning (Guzman-Rivera et al., 2012;Lee et al., 2016). While Shen et al. (2019) makes RNN decoder as a MoE, we make SELECTOR as a MoE to diversify content selection and let the encoderdecoder models one-to-one generation. As shown in our empirical results, our method achieves a better accuracy-diversity trade-off while reducing training time significantly.
Variational Autoencoders Variational Autoencoders (VAE) (Kingma and Welling, 2013) are used for diverse generation in several tasks, such as language modeling (Bowman et al., 2016), machine translation Su et al., 2018;Shankar and Sarawagi, 2019), and conversation modeling (Serban et al., 2017;Wen et al., 2017;Zhao et al., 2017;Park et al., 2018;Wen and Luong, 2018;Gu et al., 2019). These methods sample diverse latent variables from an approximate posterior distribution, but often suffer from a posterior collapse where the sampled latent variables are ignored (Bowman et al., 2016;Park et al., 2018;Kim et al., 2018b;  Figure 2: Overview of diverse sequence-to-sequence generation methods. (a) refers to our two-stage approach described throughout Section. 3, (b) refers to search-based methods (Vijayakumar et al., 2018;Li et al., 2016b;Fan et al., 2018), and (c) refers to mixture decoders (Shen et al., 2019;. Dieng et al., 2018;Xu and Durrett, 2018;He et al., 2019;Razavi et al., 2019). This is also observed in our initial experiments and Shen et al. (2019), where MoE-based methods significantly outperforms VAE-based methods because of the posterior collapse. Moreover, we observe that sampling mixtures makes training more stable compared to stochastic sampling of latent variables. Furthermore, our latent structure as a sequence of binary variables is different from most VAEs which use a fixed-size continuous latent variable. This gives a finer-grained control and interpretability on where to focus, especially when source sequence is long.
Diversity-Promoting Regularization Adding regularization to objective functions is used to diversify generation. Li et al. (2016a) introduce a term maximizing mutual information between source and target sentences. Chorowski and Jaitly (2017); Kalyan et al. (2018) introduce terms enforcing knowledge transfer among similar annotations. Our work is orthogonal to these methods and can potentially benefit from adding these regularization terms to our objective function.
Content Selection in NLP Selecting important parts of the context has been an important step in NLP applications (Reiter and Dale, 2000).  (Vinyals et al., 2015) for extracting key phrases for question generation and Gehrmann et al. (2018) use content selector to limit copying probability for abstractive summarization. The main purpose of these approaches is to enhance accuracy, while our method uses diverse content selection to enhance both accuracy and diversity (refer to our empirical results). Additionally, our method allows models to learn how to utilize information from the selected content, whereas Gehrmann et al. (2018) manually limit the copying mechanism on non-selected contents.

Method
In this paper, we focus on sequence generation tasks such as question generation and summarization that have one-to-many relationship. More formally, given a source sequence x = (x 1 . . . x S ) ∈ X , our goal is to model a conditional multimodal distribution for the target sequence p(y|x) that assigns high values on p(y 1 |x) . . . p(y K |x) for K valid mappings x → y 1 . . . x → y K . For instance, Fig. 1 illustrates K = 3 different valid questions generated for a given passage.
For learning a multimodal distribution, it is not appropriate to use a generic encoder-decoder (Cho et al., 2014) that minimizes for the expected value of the log probability of all the valid mappings.
This can lead to a suboptimal mapping that is in the middle of the targets but not near any of them. As a solution, we propose to (1) introduce a latent variable called focus that factorizes the distribution into two stages, select and generate (Section 3.1), and (2) independently train the factorized distributions (Section 3.2).

Select and Generate
In order to factorize the multimodal distribution into the two stages (select and generate), we introduce a latent variable called focus. The intuition is that in the select stage we sample several meaningful focus, each of which indicates which part of the source sequence should be considered important. Then in the generate stage, each sampled focus biases the generation process towards being conditioned on the focused content. Formally, we model focus with a sequence of binary variable, each of which corresponds to each token in the input sequence, i.e. m = {m 1 . . . m S } ∈ {0, 1} S . The intuition is that m t = 1 indicates t-th source token x t should be focused during sequence generation. For instance, in Fig. 1, colored tokens (green, red, or blue) show that different tokens are focused (i.e. values are 1) for different focus samples (out of 3). We first use the latent variable m to factorize p(y|x), where p φ (m|x) is selector and p θ (y|x, m) is generator. The factorization separates focus selection from generation so that modeling multimodality (diverse outputs) can be solely handled in the select stage and generate stage can solely concentrate on the generation task itself. We now describe each component in more details.
Selector In order to directly control the diversity of the SELECTOR's outputs, we model it as a hard mixture of experts (hard-MoE) (Jacobs et al., 1991;Eigen et al., 2014), where each expert specializes in focusing on different parts of the source sequences. In Fig. 1 and 2, focus produced by each SELECTOR expert is colored differently. We introduce a multinomial latent variable z ∈ Z, where Z = {1 . . . K}, and let each focus m be assigned to one of K experts with uniform prior p(z|x) = 1 K . With this mixture setting, p(m|x) is recovered as follows.
We model SELECTOR with a single-layer Bidirectional Gated Recurrent Unit (Bi-GRU) (Cho et al., 2014) followed by two fully-connected layers and a Sigmoid activation. We feed (current hidden state h t , first and last hidden state h 1 , h S , and expert embedding e z ) to a fully-connected layers (FC). Expert embedding e z is unique for each expert and is trained from a random initialization. From our initial experiments, we found this parallel focus inference to be more effective than auto-regressive pointing mechanism (Vinyals et al., 2015;Subramanian et al., 2018). The distribution of the focus conditioned on the input x and the expert id z is the Bernoulli distribution of the resulting values, To prevent low quality experts from dying during training, we let experts share all parameters, except for the individual expert embedding e z . We also reuse the word embedding of the generator in the word embedding of SELECTOR to promote cooperative knowledge sharing between SELECTOR and generator. With these parameter sharing techniques, adding a mixture of SELECTOR experts increases only a slight amount of parameters (GRU, FC, and e z ) to sequence-to-sequence models.
Generator For maximum diversity, we sample one focus from each SELECTOR expert to approximate p φ (m|x). For a deterministic behavior, we threshold o z t with a hyperparameter th instead of sampling from the Bernoulli distribution. This gives us a list of focus m 1 . . . m K coming from K experts. Each focus m z = (m z 1 . . . m z S ) is encoded as embeddings and concatenated with the input embeddings of the source sequence x = (x 1 . . . x S ). An off-the-shelf generation function such as encoder-decoder can be used for modeling p(y|m, x), as long as it accepts a stream of input embeddings. We use an identical generation function with K different focus samples to produce K different diverse outputs.

Training
Marginalizing the Bernoulli distribution in Eq. 3 by enumerating all possible focus is intractable since the cardinality of focus space 2 S grows exponentially with source sequence length S. Policy gradient (Williams, 1992;Yu et al., 2017) or Gumbel-softmax (Jang et al., 2017;Maddison et al., 2017) are often used to propagate gradients through a stochastic process, but we empirically found that these do not work well. We instead create focus guide and use it to independently and directly train the SELECTOR and the generator. Formally, a focus guide m guide = (m ) is a simple proxy of whether a source token is focused during generation. We set t-th focus guide m guide t to 1 if t-th source token x t is focused in target sequence y and 0 otherwise. During training, m guide acts as a target for SELECTOR and is a given input for generator (teacher forcing). During inference,m is sampled from SELECTOR and fed to the generator.
In question generation, we set m guide t to 1 if there is a target question token which shares the same word stem with passage token x t . Then we set m guide t to 0 if x t is a stop word or is inside the answer phrase. In abstractive summarization, we generate focus guide using copy target generation by Gehrmann et al. (2018), where they set source document token x t is copied if it is part of the longest possible subsequence that overlaps with the target summary.
Alg. 1 describes the overall training process, which first uses stochastic hard-EM (Neal and Hinton, 1998;Lee et al., 2016) for training the SE-LECTOR and then the canonical MLE for training the generator. E-step (line 2-5 in Alg. 1) we sample focus from all experts and compute their losses − log p φ (m guide |x, z). Then we choose an expert z best with minimum loss.
M-step (line 6 in Alg. 1) we only update the parameters of the chosen expert z best .
Training Generator (line 7-8 in Alg. 1) The generator is independently trained using conventional teacher forcing, which minimizes − log p θ (y|x, m guide ).

Experimental Setup
We describe our experimental setup for question generation (Section 4.1) and abstractive summarization (Section 4.2). For both tasks, we use an off-the-shelf, task-specific encoder-decoder-based model for the generator and show how adding SE-LECTOR can help to diversify the output of an arbitrary generator. To evaluate the contribution of SELECTOR we additionally compare our method with previous diversity-promoting methods as the baseline (Section 4.3).

Question Generation
Question generation is the task of generating a question from a passage-answer pair. Answers are given as a span in the passage (See Fig.1).
Dataset We conduct experiments on SQuAD (Rajpurkar et al., 2016) and use the same dataset split of Zhou et al. (2017a), resulting in 86,635, 8,965, and 8,964 source-target pairs for training, validation, and test, respectively. Both source passages and target questions are single sentences. The average length of source passage and target question are 32 and 11 tokens.
Generator We use NQG++ (Zhou et al., 2017a) as the generator, which is an RNN-based encoderdecoder architecture with copying mechanism .

Abstractive Summarization
Abstractive summarization is the task of generating a summary sentence from a source document that consists of multiple sentences.
Dataset We conduct experiments on the nonanonymized version of CNN-DM dataset (Hermann et al., 2015;See et al., 2017), whose training, validation, test splits have size of 287,113, 13,368, and 11,490 source-target pairs, respectively. The average length of the source documents and target summaries are 386 and 55 tokens. Following See et al. (2017), we truncate source and target sentences to 400 and 100 tokens during training.
Generator We use Pointer Generator (PG) (See et al., 2017) as the generator for summarization, which also leverages RNN-based encoder-decoder architecture and copying mechanism, and uses coverage loss to avoid repetitive phrases.

Baselines
For each task, we compare our method with other techniques which promote diversity at the decoding step. In particular, we compare with recent diverse search algorithms including Truncated Sampling (Fan et al., 2018), Diverse Beam Search (Vijayakumar et al., 2018), and Mixture Decoder (Shen et al., 2019). We implement these methods with NQG++ and PG. For each method, we generate K = (3 and 5) hypotheses from each source sequence. For search-based baselines (Fan et al., 2018;Vijayakumar et al., 2018), we select the top-k candidates after generation. For mixture models (Shen et al. (2019) and ours), we conduct greedy decoding from each mixture for fair comparison with search-based methods in terms of speed/memory usage.
Beam Search This baseline keeps K hypotheses with highest log-probability scores at each decoding step.
Diverse Beam Search This baseline adds a diversity promoting term to log-probability when scoring hypotheses in beam search, Following Vijayakumar et al. (2018), we use hamming diversity and diversity strength λ = 0.5.
Truncated Sampling This baseline randomly samples words from top-10 candidates of the distribution at the decoding step (Fan et al., 2018).
Mixture Decoder This baseline constructs a hard-MoE of K decoders with uniform mixing coefficient (referred as hMup in Shen et al. (2019)) and conducts parallel greedy decoding. All decoders share all parameters but use different embeddings for start-of-sequence token.
Mixture Selector (Ours) We construct a hard-MoE of K SELECTORs with uniform mixing coefficient that infers K different focus from source sequence. Guided by K focus, generator conducts parallel greedy decoding.

Metrics: Accuracy and Diversity
We use metrics introduced by previous works (Ott et al., 2018;Vijayakumar et al., 2018;Zhu et al., 2018) to evaluate the diversity promoting approaches. These metrics are extensions over BLEU-4 (Papineni et al., 2002) and ROUGE-2 F 1score (Lin, 2004) and aim to evaluate the trade-off between accuracy and diversity.
Top-1 metric (⇑) This measures the Top-1 accuracy among the generated K-best hypotheses. The accuracy is measured using a corpus-level metric, i.e., BLEU-4 or ROUGE-2.

Oracle metric (⇑)
This measures the quality of the target distribution coverage among the Top-K generated target sequences (Ott et al., 2018;Vijayakumar et al., 2018). Given an optimal ranking method (oracle), this metric measures the upper bound of Top-1 accuracy by comparing the best hypothesis with the target. Concretely, we generate hypotheses {ŷ 1 . . .ŷ K } from each source x and keep the hypothesisŷ best that achieves the best sentence-level metric with the target y. Then we calculate a corpus-level metric with the greedily-selected hypotheses {ŷ (i),best } N i=1 and references Pairwise metric (⇓) Referred as self- (Zhu et al., 2018) or pairwise- (Ott et al., 2018) metric, this measures the within-distribution similarity. This metric computes the average of sentencelevel metrics between all pairwise combinations of hypotheses {ŷ 1 . . .ŷ K } generated from each source sequence x. Low pairwise metric indicates high diversity between generated hypotheses.

Human Evaluation Setup
We ask Amazon Mechanical Turkers (AMT) to compare our method with the baselines. For each method, we generate three questions / summaries from 100 passages sampled from SQuAD / CNN-DM test set. For every pair of methods, annotators are instructed to pick a set of questions / summaries that are more diverse. To evaluate accuracy, they see one question / summary selected out of 3 questions / summaries with highest logprobability from each method. They are instructed to select a question / summary that is more coherent with the source passage / document. Each annotator is asked to choose either a better method (resulting in "win" or "lose") or "tie" if their quality is indistinguishable. Diversity and accuracy evaluations are conducted separately, and every pair of methods are presented to 10 annotators 1 .

Implementation details
For all experiments, we tie the weights (Press and Wolf, 2017) of the encoder embedding, the decoder embedding, and the decoder output layers. This significantly reduces the number of parameters and training time until convergence. We train up to 20 epochs and select the checkpoint with the best oracle metric. We use Adam (Kingma and Ba, 2015) optimizer with learning rate 0.001 and momentum parmeters β 1 = 0.9 and β 2 = 0.999. Minibatch size is 64 and 32 for question generation and abstractive summarization. All models are implemented in PyTorch (Paszke et al., 2017) and trained on single Tesla P40 GPU, based on NAVER Smart Machine Learning (NSML) platform (Kim et al., 2018a). SELECTOR The GRU has the same size as the generator encoder, and the dimension of expert embedding e z is 300 for NQG++, and 128 for PG. From simple grid search over [0.1, 0.5], we obtain focus binarization threshold th 0.15. The size of focus embedding for the generator is 16.

Results
Diversity vs. Accuracy Trade-off Tables 1 and 2 compare our method with different diversitypromoting techniques in question generation and abstractive summarization. The tables show that our mixture SELECTOR method outperforms all baselines in Top-1 and oracle metrics and achieves the best trade-off between diversity and accuracy. Moreover, both mixture models are superior to all search-based methods in the trade-off between diversity and accuracy. Standard and diverse beam search methods score low both in terms of accuracy and diversity. Truncated sampling shows the lowest self-similarity (high diversity), but it achieves the lowest score on Top-1 accuracy. Notably, our method scores state-of-the-art BLEU-4 in question generation on SQuAD and ROUGE comparable to state-of-the-art methods in abstractive summarization in CNN-DM (See also Table 4 for state-of-the-art results in CNN-DM).

Diversity vs. Number of Mixtures
Here we compare the effect of number of mixtures in our SELECTOR and Mixture Decoder (Shen et al., 2019). Tables 1 and 2 show that pairwise similarity increases (diversity ⇓) when the number of mixtures increases for Mixture Decoder. While we observe a similar trend for SELECTOR in the question generation task,   See et al. (2017), and the rest are from our experiments using PG as a generator. Method prefixes are the numbers of generated summaries for each document. Best scores are bolded. effect in the summarization task i.e., pairwise similarity decreases (diversity ⇑) for SELECTOR. The abstractive summarization task on CNN-DM has a target distribution with more modalities than question generation task on SQuAD, which is more difficult to model. We speculate that our SELECTOR improves accuracy by focusing on more modes of the output distribution (diversity ⇑), whereas Mixture Decoder tries to improve the accuracy by concentrating on fewer modalities of the output distribution (diversity ⇓). The bottom rows  of Tables 1 and 2 show the upper bound performance of SELECTOR by feeding focus guide to generator during test time. In particular, we assume that the mask is the ground truth overlap between input and target sequences at test time. The gap between the oracle metric (top-k accuracy) and the upper bound is very small for question generation. This indicates that the top-k masks for question generation include the ground truth mask. Future work involves improving the content selection stage for the summarization task.  Comparison with State-of-the-art Table 4 compares the performance of SELECTOR with the state-of-the-art bottom-up content selection of Gehrmann et al. (2018) in abstractive summarization. SELECTOR passes focus embeddings at the decoding step, whereas the bottom-up selection method only uses the masked words for the copy mechanism. We set K, the number of mixtures of SELECTOR, to 1 to directly compare it with the previous work (Bottom-Up (Gehrmann et al., 2018)). We observe that SELECTOR not only outperforms Bottom-Up in every metric, but also achieves a new state-of-the-art ROUGE-1 and ROUGE-L on CNN-DM. Moreover, our method scores state-of-the-art BLEU-4 in question generation on SQuAD (Table 1). Method R-1 R-2 R-L PG (See et al., 2017) 39.53 17.28 36.38 Bottom-Up (Gehrmann et al., 2018) Table 4: Comparison of single-expert selector with state-of-the-art abstractive summarization methods on CNN-DM. R stands for ROUGE (Lin, 2004) Efficient Training Table 5 shows that SELEC-TOR trains up to 3.7 times faster than mixture decoder (Shen et al., 2019). Training time of mixture decoder linearly increases with the number of decoders, while parallel focus inference of SELEC-TOR makes additional training time negligible.

Human Evaluation
Qualitative Analysis To analyze how the generator uses the selected content, we visualize attention heatmap of NQG++ in Fig.3 for question generation. The figure shows different attention fo rm ed in no ve m be r 19 90 by th e eq ua l m er ge r of sk y te le vi si on an d br iti sh sa te lli te br oa dc as tin g, bs ky b be ca m e th e uk 's la rg es t di gi ta l su bs cr ip tio n te le vi si on co m pa ny .
who was the merger of sky television and british satellite broadcasting ?
fo rm ed in no ve m be r 19 90 by th e eq ua l m er ge r of sk y te le vi si on an d br iti sh sa te lli te br oa dc as tin g, bs ky b be ca m e th e uk 's la rg es t di gi ta l su bs cr ip tio n te le vi si on co m pa ny .
what team won the uk ?
fo rm ed in no ve m be r 19 90 by th e eq ua l m er ge r of sk y te le vi si on an d br iti sh sa te lli te br oa dc as tin g, bs ky b be ca m e th e uk 's la rg es t di gi ta l su bs cr ip tio n te le vi si on co m pa ny .

Conclusion
We introduce a novel diverse sequence generation method via proposing a content selection module, SELECTOR. Built upon mixture of experts and hard-EM training, SELECTOR identifies different key parts on source sequence to guide generator to output a diverse set of sequences.
SELECTOR is a generic plug-and-play module that can be added to an existing encoder-decoder model to enforce diversity with a negligible additional computational cost. We empirically demonstrate that our method improves both accuracy and diversity and reduces training time significantly compared to baselines in question generation and abstractive summarization. Future work involves incorporating SELECTOR for other generation tasks such as diverse image captioning.