DORB: Dynamically Optimizing Multiple Rewards with Bandits

Policy gradients-based reinforcement learning has proven to be a promising approach for directly optimizing non-differentiable evaluation metrics for language generation tasks. However, optimizing for a specific metric reward leads to improvements in mostly that metric only, suggesting that the model is gaming the formulation of that metric in a particular way without often achieving real qualitative improvements. Hence, it is more beneficial to make the model optimize multiple diverse metric rewards jointly. While appealing, this is challenging because one needs to manually decide the importance and scaling weights of these metric rewards. Further, it is important to consider using a dynamic combination and curriculum of metric rewards that flexibly changes over time. Considering the above aspects, in our work, we automate the optimization of multiple metric rewards simultaneously via a multi-armed bandit approach (DORB), where at each round, the bandit chooses which metric reward to optimize next, based on expected arm gains. We use the Exp3 algorithm for bandits and formulate two approaches for bandit rewards: (1) Single Multi-reward Bandit (SM-Bandit); (2) Hierarchical Multi-reward Bandit (HM-Bandit). We empirically show the effectiveness of our approaches via various automatic metrics and human evaluation on two important NLG tasks: question generation and data-to-text generation, including on an unseen-test transfer setup. Finally, we present interpretable analyses of the learned bandit curriculum over the optimized rewards.


Introduction
Recent advancements in end-to-end neural networks-based approaches have shown wide success in various sequence generation tasks: machine translation (Sutskever et al., 2014;Luong et al., 2015), dialogue systems (Vinyals and Le, 2015;Serban et al., 2016), textual summarization (Rush et al., 2015;Nallapati et al., 2016;See et al., 2017), image/video captioning (Bahdanau et al., 2015;Venugopalan et al., 2015;Pasunuru and Bansal, 2017a), question generation (Du et al., 2017;Du and Cardie, 2018;Zhang and Bansal, 2019), etc. In all of these tasks, cross-entropy loss optimization has been widely used as a standard optimization approach (Sutskever et al., 2014), but this approach suffers from exposure-bias issue (Ranzato et al., 2016) and does not optimize for the non-differentiable automatic evaluation metrics that measure the quality of the generated sequence. Recent introduction of policy gradient-based reinforcement learning approaches address these issues for sequence generation tasks by directly optimizing the non-differentiable evaluation metrics (Zaremba and Sutskever, 2015;Ranzato et al., 2016;Rennie et al., 2017).
However, optimizing for a particular metric/reward via policy gradient-based approaches often leads to improvement in mostly that specific metric, suggesting that this approach is gaming the metrics (Paulus et al., 2018). The weighted average of multiple metrics or surrogate rewards have been explored , but these approaches have to deal with finding the optimal scale balance across different metrics. One can alternatively optimize multiple metrics via a mixing ratio (Pasunuru and Bansal, 2018), but this still needs careful tuning of the mixing ratio. Moreover, all these reward approaches are fixed and do not change over training, and all the metrics may not be important over every stage of the training. Thus, it might be useful to consider using a dynamic combination of metrics, which rewards to use early vs. later, or which rewards might be useful to come back later in training, and consider the context of the full history of rewards, as well as the model's current state and the nature of the metric.
To this end, we present a multi-armed bandit approach (which we name the DORB framework) where the arms of the bandit are the choices of the metrics that we want to optimize as rewards. At every round, the bandit chooses the next possible metric to optimize based on its previous performance history over these metrics, hence allowing the automatic learning of an optimal curriculum of rewards. We explore this approach in the context of exploration vs. exploitation via Exp3 algorithm (Auer et al., 2002b) with two novel approaches for bandit rewards: (1) Single Multireward Bandit (SM-Bandit); (2) Hierarchical Multireward Bandit (HM-Bandit). First, we present a reward scaling approach to maintain the metric rewards range in [0, 1]. Next, we present our SM-Bandit, where at each round, the bandit's reward is based on the performance improvement from multiple sources. Here, we use the average of all the scaled metric rewards from multiple sources as the final reward to the bandit. Finally, we present our HM-Bandit, which consists of a single firstlevel controller, as well as K second-level multiarmed bandits. The first-level controller's goal is to find the under-performing reward metric, while the second-level bandits' goal is to trigger the specific metric optimizer that will lead to a promising improvement in this specific metric.
We validate the effectiveness of our approaches on two important generation tasks: question generation and data-to-text generation (including an unseen-test transfer setup) via both automatic evaluation metrics and human evaluation. For question generation, we present results on the SQuAD QG dataset (Du et al., 2017), and for data-to-text NLG, we choose the WebNLG dataset (Gardent et al., 2017). We show that our bandit-based approaches perform statistically significantly better (based on human evaluation) than strong singlereward based RL models as well as non-bandits based multi-reward methods such as the multi-task approach of Pasunuru and Bansal (2018). We further present various interpretable analyses of our bandit progress and learned rewards curriculum over different bandit approaches.

Related Works
Policy Gradient and Generative Models: Neural sequence to sequence models with cross-entropy optimization, potentially with attention mecha-nism (Bahdanau et al., 2015) and pointer-copy mechanism (See et al., 2017;Gulcehre et al., 2016;Vinyals et al., 2015a;Merity et al., 2018), are widely used in language generation tasks such as machine translation (Sutskever et al., 2014;Luong et al., 2015), abstractive summarization (Chopra et al., 2016;Nallapati et al., 2016), question generation (Du et al., 2017;Zhang and Bansal, 2019), video/image captioning (Xu et al., 2015;Vinyals et al., 2015b;Pasunuru and Bansal, 2017a;Zhou et al., 2018), as well as sentence simplification (Zhang and Lapata, 2017;Guo et al., 2018). However, often the final metrics of interest are not differentiable, and thus not compatible with the standard maximum-likelihood based training. Motivated by this, recently there has been a surge in applications of reinforcement learning techniques to language generation (Ranzato et al., 2016), in which the gradients of non-differentiable metrics are approximated using the scoring function (RE-INFORCE (Williams, 1992)). A few successful examples include image captioning (Rennie et al., 2017;Ren et al., 2017), abstractive summarization (Paulus et al., 2018;Chen and Bansal, 2018;Pasunuru and Bansal, 2018;Celikyilmaz et al., 2018), machine translation (Wu et al., 2016;Gu et al., 2017), sentence simplification (Zhang and Lapata, 2017), as well as video captioning (Pasunuru and Bansal, 2017b;. Previous works have explored the problem of optimizing multiple rewards in the context of machine translation (Neubig and Watanabe, 2016). For example, the works of Duh et al. (2012) and Sankaran et al. (2013) are based on the theory of Pareto Optimality. Our approach, instead, dynamically decides the trade-off among metrics, rather than exploring the set of static Pareto-optimal hypotheses. The most related work on this line is Pasunuru and Bansal (2018), which simultaneously optimizes multiple rewards in alternate fashion for abstractive summarization. In our work, we use a multi-armed bandit framework to dynamically switch among multiple diverse reward optimizations in the context of policy-gradient-based generative models. 1 Multi-Armed Bandit: Many control problems can be cast as multi-armed bandit problems, where the goal is to select a sequence of arms/actions in order to optimize certain objective (e.g., expected future payoff) (Bubeck et al., 2012). One widely studied problem in the multi-armed bandit literature is finding the optimal trade-off between exploration and exploitation (Audibert et al., 2009;Macready and Wolpert, 1998;Auer et al., 2002a;Kveton et al., 2019;Bubeck et al., 2012). Some widely used bandit algorithms include -greedy (Sutton and Barto, 2018), Boltzmann exploration (Kaelbling et al., 1996), UCB (Auer et al., 2002a), Thompson sampling (Chapelle and Li, 2011), contextual bandit (Sharaf and Daumé III, 2019), as well as Exp3 adversarial bandit (Auer et al., 2002b). In this work, we use Exp3, and the hierarchical version of it, for the problem of optimizing multiple rewards. 2 Multi-armed bandit algorithms have been used in a wide range of applications, such as online advertising (Chen et al., 2013), recommendation (Li et al., 2010), multi-task task selection (Guo et al., 2019a), and hyper-parameter optimization (Li et al., 2018;Merentitis et al., 2018). Recently, Graves et al. (2017) apply a non-stationary multi-armed bandit (in particular, the Exp3.S algorithm) to select an adaptive policy (curriculum) that a neural network follows to maximize the learning efficiency. Sharma and Ravindran (2017) use multiarmed bandit sampling to choose which domain data (harder vs. easier) to feed as input to a single model (using different Atari games). To our knowledge, we are the first ones to apply a multi-armed bandit to optimize multiple rewards in the context of text generation.

Multi-Reward Optimization
In this section, we first describe the policy gradients-based reinforcement learning (RL) approach for text generation tasks, and then discuss the need for a better multi-reward optimization approach for RL in the context of generation tasks. Lastly, we introduce our novel methods for multireward optimization via multi-armed bandits. Glossary: Agent: RL policy gradients; Bandit: multi-armed bandit; Controller: controller in HM-Bandit (see Fig. 2).
2 In our initial experiments, we experimented with a few other bandit approaches (UCB, contextual bandit, variants of Exp3, e.g., Exp3-S), but we ended up with our current Exp3 setting due to its performance and stability reasons within the scope of our methods and tasks.
Policy Gradient Background. Cross-entropy loss based optimization is traditionally used for the sequence generation tasks. However, recent policy gradient-based reinforcement learning approach has shown two advantages over the crossentropy loss optimization approach: (1) avoiding exposure bias issue which is about the mismatch in the output distributions created by different train and test time decoding approaches in cross-entropy loss optimization; (2) able to directly optimize the non-differentiable evaluation metrics.
To this end, REINFORCE algorithm (Williams, 1992;Zaremba and Sutskever, 2015) is used to learn a policy p θ defined by the model parameters θ to predict the next action (tokens in our setup). Specifically, instead of minimizing the negative log-likelihood, we minimize the following loss: where w s is the sequence of sampled tokens and r(·) is the reward function that measures the quality of w s . The derivative of this loss function can then be approximated using a single sample along with a bias estimatorb to reduce variance: There are several ways to calculate the baseline estimator, and in this work we use the SCST mechanism (Rennie et al., 2017).
Need for a better multi-reward optimization.
Often, an RL agent can improve the policy p θ via multiple reward sources. However, efficient ways of optimizing multiple rewards in a policy gradient-based reinforcement learning setup have been less explored. Previous works have either explored using a weighted combination of multiple rewards (Zhang and Lapata, 2017;Li et al., 2016) or alternate fashion of optimizing multiple rewards inspired via multi-task learning setup (Pasunuru and Bansal, 2018). However, these approaches have a disadvantage of tuning the weights of the rewards combination or using a static tunable mixing ratio while optimizing in an alternate fashion.
To this end, we explore multi-reward optimization via a multi-armed bandit approach (Bubeck et al., 2012;Lattimore and Szepesvári, 2019;Burtini et al., 2015). During the training, the bandit explores/exploits the choice of reward functions in order to improve the overall performance of the model. In the remaining part of this section, we discuss various multi-armed bandit-based ...during the age of enlightenment, philosophers such as john locke advocated the principle in their writings, whereas others, such as thomas hobbes ... who was an advocate of separation of powers? Figure 1: Overview of our multi-armed bandit reward selection framework DORB. At each step, the model outputs are scored based on a reward function (metric), where the choice of the reward function is dynamically controlled by the multi-armed bandit. Then the corresponding optimization is executed based on the chosen reward function. Finally, the observed validation performance metrics are given as feedback to the bandit. models for multi-reward optimization (Sec. 3.1), and reward settings (Sec. 3.2). Then, we present the two novel approaches, namely Single Multireward Bandit (SM-Bandit, Sec. 3.3) and Hierarchical Multi-reward Bandit (HM-Bandit, Sec. 3.4).

Multi-Armed Bandit for Multi-Reward Optimization
Given a set of K candidate actions (arms) {a 1 , a 2 , ..., a K }, the objective of a multi-armed bandit problem is to maximize rewards earned through a sequence of lever pulls (actions). We call this reward as bandit reward. We view the problem of optimizing multiple rewards as a sequential design of experiments (Robbins, 1952), where the bandit's goal is to decide the next arm (loss function) to pull after each round in order to maximize the rewards it earns. Let {R 1 , R 2 , .., R K } be a set of different rewards from K sources which can measure the model/policy's performance. To directly maximize the performance of these K rewards, we need to use K different reinforcement learning-based loss functions. Let the loss function for R i be: Each of these K loss functions is considered as an arm of the multi-armed bandit (i.e., the arms/joysticks in Fig. 1), where pulling the i th arm will result in optimizing for reinforcement based loss function L RL i (i.e., in Fig. 1, main model parameters get updated). The goal of the bandit is to explore and exploit different loss functions and maximize its reward (the validation performance of the model, see Fig. 1). One widely studied problem is the trade-off between "exploitation" of the arm with the highest estimated payoff and "exploration" of less known arms. For this, we use the popular Exp3 bandit algorithm (Auer et al., 2002b) (see Appendix A for more details on Exp3).

Bandit Reward Settings
Note that in this work, we have two sets of rewards: rewards used for optimizing the sequence generation model via policy gradients-based reinforcement learning (R1 in Fig. 1, Sec. 3), and rewards used for the bandit (R2 in Fig. 1). The rewards for the generation model are used to optimize the model w.r.t. the metric of interest, while the rewards for the bandit help the bandit decide which "metric of interest" the generation model should optimize.
In order to maintain consistent magnitude/scale across metric rewards while using them for bandits, we use scaled rewards via the quantiles of rewards history following Graves et al. (2017).
be the history of unscaled rewards up to time step t. Let q lo t and q hi t be the lower and upper quantiles of R t , respectively. 3 Then, the scaled reward,r t is defined as follows:

Single Bandit with Multi-Reward
Often, we want to optimize multiple metrics in our RL approach. For this, we have to give a joint reward coming from multiple sources (metrics in our case) to the bandit as a bandit reward. One can easily give the weighted combination of these rewards coming from multiple sources as a reward to the bandit. However, tuning these weights is intractable if the number of reward sources is large. Here, we introduce a new approach called Single Multi-reward Bandit (SM-Bandit), which avoids tuning and uses rewards from multiple sources as feedback to the bandit. Let L RL 1 , L RL 2 , and L RL 3 be the reinforcement learning-based loss functions corresponding to three arms of the bandit: arm 1 , arm 2 , and arm 3 , respectively. If arm 2 is selected at round t, then we optimize for L RL 2 and measure the performance of all the unscaled metric scores on the validation set and then calculate the corresponding scaled rewards for each metric. We average over these scaled rewards and give that as a reward to the bandit. The generalization of this reward for K-armed bandit is: where r t is the bandit reward at round t andr t i is the scaled reward (Eq. 4) for the metric corresponding to arm i at round t. This approach allows us to avoid tuning the balancing weights across the metrics that we optimize, and ensure that the bandit is improving all metrics, as the bandit goal is to maximize the average of all metrics. A detailed procedure of SM-Bandit is described in Algorithm 1.

Hierarchical Bandit with Multi-Reward
The SM-bandit's goal in the previous approach described in Sec. 3.3 is to improve all metrics using a single bandit. In this section, we introduce another  Figure 2: Overview of the hierarchical multi-armed bandit. The first-level has a controller and the secondlevel has bandits. The controller decides which bandit of the second-level will be pulled. The second-level bandits then decide which metric to use as the reward function during RL optimization.
bandit-based variant to improve all metrics but by using multiple bandits which are controlled by a controller, called Hierarchical Multi-reward Bandits (HM-Bandit, Fig. 2). The HM-Bandit consists of a single first-level controller (not a bandit, top row in Fig. 2), and K second-level multi-armed bandits (middle row in Fig. 2). The first-level controller's goal is to find the under-performing reward metric, while the second-level bandits' goal is to trigger a specific metric optimizer that will lead to a promising improvement in this specific metric. More intuitively, the first-level controller sets the objective (e.g., ROUGE needs to be improved), while the second-level bandit decides which specific reward function can help accomplish the objective. A detailed procedure of our HM-bandit is described in Algorithm 2. This concept is also loosely related to Bayesian model selection, where it's common to use a hierarchical specification of models (Rasmussen and Williams, 2005).

Tasks and Setup
We use question generation and data-to-text generation tasks in our experiments. In this section, we discuss the details on these two tasks along with the experimental setup.

Question Generation
The goal of the question generation (QG) task is to generate a natural question that can be answered by the given answer span in a context. Recent works have applied seq2seq neural models for QG, e.g., Algorithm 2 HM-Bandit Training 1: Inputs: #rewards: K, #train steps: ntrain, #steps in bandit round: nbandit, #steps in controller round: ncontroller 2: Create the controller C with K bandits 3: Initialize all bandits, and set j ← 0 4: B ← chooseBandit (C, j) choose bandit at index j 5: a ← chooseArm(B) Based on Eqn. 5 6: i ← 0 7: while i < ntrain do 8: Sample word sequence w s from model 9: Calculate rewards R train based on w s 10: Optimize i ← i + 1 24: end while generating the question given answer sentence (Du et al., 2017;Zhou et al., 2017), or the whole paragraph (Du and Cardie, 2018; Song et al., 2018b;Liu et al., 2019a;Zhao et al., 2018;Kim et al., 2019;Sun et al., 2018). Many works also used RL to optimize specific metrics (Song et al., 2018a;Kumar et al., 2019;Yuan et al., 2017). Recently, Zhang and Bansal (2019) proposed semantics-enhanced rewards to improve the QG model, and also used the multi-reward approach proposed by Pasunuru and Bansal (2018) in their RL models.
Baseline. Given a paragraph p, and an answer span a, the goal of the QG model is to generate a question q answering a. We follow the encoder-attention-decoder style architecture (see Fig. 1). The encoder is a bi-directional LSTM-RNN (Hochreiter and Schmidhuber, 1997) with self-attention , and the decoder is a uni-directional LSTM-RNN with attention (Luong et al., 2015) and pointer (Gu et al., 2016) mechanism, similar to Zhang and Bansal (2019). The input to the model is a concatenation of contextualized word representations (BERT (Devlin et al., 2019)), answer tag embedding (BIO tagging scheme), Part-of-Speech (POS) tag embedding, and Named-Entity (NER) tag embedding.
Rewards. We use ROUGE-L, QPP, and QAP (Zhang and Bansal, 2019) as rewards for this task. QPP is calculated as the probability of the generated question being the paraphrase of the ground-truth question via a classifier trained on Quora Question Pairs. QAP is calculated as the probability of a pre-trained QA model to correctly answer the given generated question as input.
Dataset & Evaluation. We use the SQuAD QG English dataset from Du et al. (2017) for the QG task, derived from SQuAD v1.1 (Rajpurkar et al., 2016), and the test set consists of 10% sampled examples from the training set, as the SQuAD test set is not open. For pre-processing, we do standard tokenization. We report on evaluation metrics including BLEU-4, METEOR, ROUGE-L, Q-BLEU1 (Nema and Khapra, 2018), as well as QPP and QAP (Zhang and Bansal, 2019).

Data-to-Text Generation
Data-to-text is the task of expressing the components (attributes and values) of meaning representation (MR) as human-readable natural sentences. Previous work in this area include templates (Reiter, 1995), rules (Reiter et al., 2005), pipelines (Reiter, 2007;Reiter and Dale, 1997), probabilistic models (Liang et al., 2009)  Baseline. Given a set of Resource Description Framework (RDF) triples, 4 the task is to generate a natural language text describing the facts in the RDF data. Following Zhao et al. (2020), we serialize and reorder the RDF data as an intermediate planning setup, and feed the plan into a seq2seq model with attention and copy mechanism.
Rewards. We use BLEU, ROUGE-L, and Entailment-Score (Pasunuru and Bansal, 2018) as rewards. Entailment-Score is calculated based on the probability that the generated sentence is classified as an entailment w.r.t. the ground truth. 5  Table 1: Performance of our baselines and multi-armed bandit-based models on question generation task. † denotes that these models use ROUGE-L, QPP, and QAP rewards during the optimization.
Dataset & Evaluation. We use the WebNLG dataset (Gardent et al., 2017) -a widely used English benchmark for data-to-text generation which focuses on micro-planning involving several subtasks like referring expression generation, aggregation, lexicalization, sentence segmentation, and surface realization. It contains 9,674 unique RDF triple-sets and 25,298 text references, which is divided into train, dev, and test sets. 6 We report all our results on the 'seen' and 'unseen' part of the test set. For each sample, the input is a set of up to 7 RDF triples from DBPedia, and the output is their text descriptions. The standard evaluation metrics for this dataset include METEOR 7 (Denkowski and Lavie, 2014), BLEU (Papineni et al., 2002), and TER 8 (Snover et al., 2006). We also report ROUGE-L (Lin, 2004) and Entailment-Score (Pasunuru and Bansal, 2018).

Training Details
All the hyperparameters are tuned on the validation set for both question generation and data-to-text tasks. We use TITAN X and GeForce GTX 1080 GPUs for all our experiments. For the question generation task, we use two layers for both encoder and decoder. We set the hidden size of LSTM-RNN to 600 and use BERT-based contextual embeddings as input. We use a batch size of 32, encoder maximum length of 512 and decoder maximum length of 50, and maximum gradient clipping of 5. We use Adam optimizer (Kingma and Ba, 2015) with a learning rate of 1e-3 and 1e-6 for the cross-entropy and RL models, respectively. For data-to-text task, we use the same hyperparameters as discussed in Zhao et al. (2020) for the cross-entropy model, e.g., we use Adam with a batch size of 64 and an initial 6 https://webnlg-challenge.loria.fr/ 7 http://www.cs.cmu.edu/˜alavie/METEOR/ 8 http://www.cs.umd.edu/˜snover/tercom/ learning rate of 0.001. All RL models are initialized with the best cross-entropy model checkpoint, and use Adam with a learning rate of 1e-6. We refer to Appendix B for full training details.

Results and Analysis
In this section, we present the performance of previous work, our cross-entropy baselines, our RLbased baselines, and finally our multi-arm banditbased models. We start with results on automatic evaluation (Sec. 5.1-5.2). Next, we present results on human evaluation (Sec. 5.3). Finally, we present an interpretable analysis on the bandits (Sec. 5.4).

Results on Question Generation
Baselines. Table 1 presents results on the question generation dataset for our baselines. We use the previous state-of-the-art work (Zhang and Bansal, 2019) as our cross-entropy baseline. Next, we apply policy gradients-based reinforcement learning (RL) approach, and observe that all these models are better than the baseline in all metrics. Next, we will discuss the multi-reward RL models.
Multi-Armed Bandit Approaches. Finally, we evaluate our two bandit approaches: SM-Bandit and HM-Bandit as described in Sec. 3.3 and Sec. 3.4, respectively. Further, for a fair comparison of our multi-arm bandit-based models, we further implemented multi-reward alternate optimization approach introduced by Pasunuru and Bansal (2018) and considered it as baseline for our multireward models. 9,10 This model is slightly better  than single reward-based RL baselines. Table 1 presents the performance of the proposed two bandit models (SM-Bandit and HM-Bandit) on various automatic evaluation metrics, and we observe that on average these models perform much better than the cross-entropy and single reward RL baseline models. Further, our bandit models also perform better than the multi-reward approach proposed by Pasunuru and Bansal (2018), suggesting that our bandit-based models are able to dynamically select the reward to optimize for overall improvement in all the metrics that we want to optimize. Also see discussion of significant improvements in human evaluation in Sec 5.3.

Results on Data-to-Text Generation
Baselines. Table 2 presents our baselines on the WebNLG data-to-text task. Our cross-entropy model is comparable to the very recent state-of-theart model (Zhao et al., 2020). Further, we present single reward based RL models with ROUGE-L, BLEU, and Entailment score as rewards, which again perform better than our cross-entropy model. Next, we will discuss multi-reward models.
Multi-Armed Bandit Approaches. Table 2 also presents our multi-armed bandit models (SM-Bandit and HM-Bandit) which simultaneously use ROUGE-L, BLEU, and Entailment score as rewards. Again, we consider the model proposed by Pasunuru and Bansal (2018) as a baseline for multi-reward models. On average, our bandit-based models perform better than all our baselines that task are very close to the baseline model (Pasunuru and Bansal, 2018)  are discussed in the above paragraph and also the model based on Pasunuru and Bansal (2018). 11 Also see discussion of significant improvements in human evaluation in Sec 5.3.

Human Evaluation
It is shown that RL models can game the metric that we use as the objective function (Paulus et al., 2018). This motivated us to optimize the RL models on multiple metrics simultaneously, thus trying to improve all the metrics and making the RL model hard to game any particular metric. In this section, we validate the superiority of our bandit models via human evaluation studies. We performed anonymous human evaluation studies using Amazon Mechanical Turk (MTurk). We chose human annotators such that they are located in the USA, have at least 10,000 approved HITs, and have an approval rate of greater than 98%. For both question generation and WebNLG data-to-text, we considered 200 samples for each, and compared ROUGE-L RL, Pasunuru and Bansal  (2018), SM-Bandit, and HM-Bandit models by asking the annotators to rate the quality of the generated outputs based on relevance and coherence on 5point Likert scale. 12 Table 3 presents these human evaluation studies. In terms of relevance, our SM-Bandit and HM-Bandit models are significantly better than Pasunuru and Bansal (2018) (p<0.01) and ROUGE-L RL models (p<0.01) on question generation, while maintaining coherence. 13 On data-to-text, in terms of relevance, our SM-Bandit and HM-Bandit models are significantly better than Pasunuru and Bansal (2017a) with p<0.03 and p<0.02, respectively. Also, both bandit models are significantly better than ROUGE-L RL model with p<0.01. We also performed a similar human evaluation study for the test-only transfer setup on the unseen WebNLG test set, and the results are in Table 4. Here also our bandit-based model (HM-Bandit) performed statistically significantly better than Pasunuru and Bansal (2018) on relevance metric with p < 0.01, while maintaining coherence. Figure 4 presents the interpretable visualization of the probability distribution of each arm of the SM-Bandit as the training progresses. We observe 12 For question generation, relevance is defined as how clearly the generated question will be able to point to the right answer, given an input paragraph as context. For WebNLG data-to-text, relevance is defined as how related is the generated description w.r.t. the given RDF data such as mentioning the facts. For both tasks, coherence is based on the logic, readability, and fluency of the generated question or description. 13 We use bootstrap test (Efron and Tibshirani, 1994;Noreen, 1989) for calculating the statistical significance score. that each metric has played an important role (as high probability arm) for at least a few rounds over the training trajectory. Also, there are multiple switchings of these metrics over the training trajectory, suggesting that this kind of automatic dynamic switching is important to improve the overall performance of RL models with multiple rewards. Figure 3 presents the progress of child bandits of HM-Bandit during the training for question generation. As discussed in Sec. 3.4, these child bandits are controlled by a controller that selects the under-performing bandit. We observe that our HM-Bandit mostly used ROUGE-L child bandit for overall improvement in all metrics (as it is the under-performing metric). Further, each child bandit gave more importance to the metric that it wants to improve, e.g., the QAP child bandit gave more importance to the QAP arm. However, there is an exception for the ROUGE-L child bandit, where ROUGE-L arm is not the most important, suggesting that to improve the ROUGE-L metric other RL loss functions (QAP and QPP) are also useful.

Conclusion
We presented novel approaches for dynamically optimizing multiple reward metrics simultaneously via multi-armed bandit approach in the context of language generation. We described two such mechanisms, namely single bandit and hierarchi-cal bandit with multiple rewards. We conducted experiments on two challenging language generation tasks: question generation and data-to-text generation, and our method achieved strong improvements based on human evaluation over previous approaches. We further presented interpretable analysis on our bandit methods.