APRIL: Interactively Learning to Summarise by Combining Active Preference Learning and Reinforcement Learning

We propose a method to perform automatic document summarisation without using reference summaries. Instead, our method interactively learns from users’ preferences. The merit of preference-based interactive summarisation is that preferences are easier for users to provide than reference summaries. Existing preference-based interactive learning methods suffer from high sample complexity, i.e. they need to interact with the oracle for many rounds in order to converge. In this work, we propose a new objective function, which enables us to leverage active learning, preference learning and reinforcement learning techniques in order to reduce the sample complexity. Both simulation and real-user experiments suggest that our method significantly advances the state of the art. Our source code is freely available at https://github.com/UKPLab/emnlp2018-april.


Introduction
With the rapid growth of text-based information on the Internet, automatic document summarisation attracts increasing research attention from the Natural Language Processing (NLP) community (Nenkova and McKeown, 2012). Most existing document summarisation techniques require access to reference summaries to train their systems. However, obtaining reference summaries is very expensive: Lin (2004) reported that 3,000 hours of human effort were required for a simple evaluation of the summaries for the Document Understanding Conferences (DUC). Although previous work has proposed heuristics-based methods to summarise without reference summaries (Ryang and Abekawa, 2012;Rioux et al., 2014), the gap between their performance and the upper bound is still large: the ROUGE-2 upper bound of .212 on DUC'04 (P.V.S. and Meyer, 2017) is, for example, twice as high as Rioux et al.'s (2014) .114.
The Structured Prediction from Partial Information (SPPI) framework has been proposed to learn to make structured predictions without access to gold standard data (Sokolov et al., 2016b). SPPI is an interactive NLP paradigm: It interacts with a user for multiple rounds and learns from the user's feedback. SPPI can learn from two forms of feedback: point-based feedback, i.e. a numeric score for the presented prediction, or preference-based feedback, i.e. a preference over a pair of predictions. Providing preference-based feedback yields a lower cognitive burden for humans than providing ratings or categorical labels (Thurstone, 1927;Kendall, 1948;Kingsley and Brown, 2010;Zopf, 2018). Preference-based SPPI has been applied to multiple NLP applications, including text classification, chunking and machine translation (Sokolov et al., 2016a;Kreutzer et al., 2017). However, SPPI has prohibitively high sample complexities in the aforementioned NLP tasks, as it needs at least hundreds of thousands rounds of interaction to make near-optimal predictions, even with simulated "perfect" users. Figure 1a illustrates the workflow of the preference-based SPPI.
To reduce the sample complexity, in this work, we propose a novel preference-based interactive learning framework, called APRIL (Active Preference ReInforcement Learning). APRIL goes beyond SPPI by proposing a new objective function, which divides the preference-based interactive learning problem into two phases (illustrated in Figure 1b): an Active Preference Learning (APL) phase (the right cycle in Figure 1b), and a Reinforcement Learning (RL) phase (the left cycle). We show that this separation enables us to query preferences more effectively and to use the collected preferences more efficiently, so as to reduce the sample complexity. and APRIL (b) in the EMDS use case. Notation details, e.g., ∆ x and r(y n ), are discussed in §3.
We apply APRIL to Extractive Multi-Document Summarisation (EMDS). The task of EMDS is to extract sentences from the original documents to build a summary under a length constraint. We accommodate multiple APL and RL techniques in APRIL and compare their performance under different simulation settings. We also compare APRIL to a state-of-the-art SPPI implementation using both automatic metrics and human evaluation. Our results suggest that APRIL significantly outperforms SPPI.

Related Work
RL has been previously used to perform EMDS without using reference summaries. Ryang and Abekawa (2012) formulated EMDS as a Markov Decision Process (MDP), designed a heuristicsbased reward function considering both information coverage rate and redundancy level, and used the Temporal Difference (TD) algorithm (Sutton, 1984) to solve the MDP. In a follow-up work, Rioux et al. (2014) proposed a different reward function, which also did not require reference summaries; their experiments suggested that using their new reward function improved the summary quality. Henß et al. (2015) proposed a different RL formulation of EMDS and jointly used supervised learning and RL to perform the task. However, their method requires the access to reference summaries. More recent works applied encoderdecoder-based RL to document summarisation (Ranzato et al., 2015;Narayan et al., 2018;Paulus et al., 2017;Pasunuru and Bansal, 2018). These works outperformed standard encoder-decoder as RL can directly optimise the ROUGE scores and can tackle the exposure bias problems. However, these neural RL methods all used ROUGE scores as their rewards, which in turn relied on reference summaries. APRIL can accommodate these neural RL techniques in its RL phase by using a ranking of summaries instead of the ROUGE scores as rewards. We leave neural APRIL for future study. P.V.S. and Meyer (2017) proposed a bigrambased interactive EMDS framework. They asked users to label important bigrams in candidate summaries and used integer linear programming (ILP) to extract sentences covering as many important bigrams as possible. Their method requires no access to reference summaries, but it requires considerable human effort during the interaction: in simulation experiments, their system needed to collect up to 350 bigram annotations from a (simulated) user. In addition, they did not consider noise in users' annotations but simulated perfect oracles.
Preference learning aims at obtaining the ranking (i.e. total ordering) of objects from pairwise preferences (Fürnkranz and Hüllermeier, 2010). Simpson and Gurevych (2018) proposed to use an improved Gaussian process preference learning (Chu and Ghahramani, 2005) for learning to rank arguments in terms of convincingness from crowdsourced annotations. However, such Bayesian methods can hardly scale and suffer from high computation time. Zopf (2018) recently proposed to learn a sentence ranker from preferences. The resulting ranker can be used to identify the important sentences and thus to evaluate the quality of the summaries. His study also suggests that providing sentence preferences takes less time than writing reference summaries. APRIL not only learns a ranking over summaries from pairwise preferences, but also uses the ranking to "guide" our RL agent to generate good summaries.
There is a recent trend in machine learning to combine active learning, preference learning and RL, for learning to perform complex tasks from preferences (Wirth et al., 2017). The resulting algorithm is termed Preference-based RL (PbRL), and has been used in multiple applications, including training robots (Wirth et al., 2016) and Atariplaying agents (Christiano et al., 2017). SPPI and APRIL can both be viewed as PbRL algorithms. But unlike most PbRL methods that learn a utility function of the predictions (in EMDS, predictions are summaries) to guide the RL agent, APRIL is able to directly use a ranking of predictions to guide the RL agent without making assumptions about the underlying structure of the utility functions. This also enables APRIL to use nonutility-based preference learning techniques (e.g., Maystre and Grossglauser, 2017).

Background
In this section, we recap necessary details of SPPI, RL and preference learning, and adapt them to the EMDS use case, laying the foundation for APRIL.

The SPPI Framework
Let X be the input space and let Y(x) be the set of possible outputs for input x ∈ X . In EMDS, x ∈ X is a cluster of documents and Y(x) is the set of all possible summaries for cluster x. The function ∆ x : Y(x)×Y(x) → {0, 1} is the preference function such that ∆ x (y i , y j ) = 1 if the user believes y j is better than y i (denoted by y j y i or equivalently y i ≺ y j ), and 0 otherwise. Throughout this paper we assume that users do not equally prefer two different items. For a given x, the expected loss is: where p w (y i , y j |x) is the probability of querying the pair (y i , y j ). Formally, where φ(y|x) is the vector representation of y given x, and w is the weight vector to be learnt. Eq.
(2) is a Gibbs sampling strategy: w (φ(y i |x)− φ(y j |x)) can be viewed as the "utility gap" between y i and y j . The sampling strategy p w encourages querying pairs with large utility gaps.
To minimise L SPPI , SPPI uses gradient descent to update w incrementally. Alg. 1 presents the pseudo code of our adaptation of SPPI to EMDS. In the supplementary material, we provide a detailed derivation of ∇ w L SPPI (w|x).

Reinforcement Learning
RL amounts to efficient algorithms for searching optimal solutions in MDPs. MDPs are widely Input : sequence of learning rates γ t ; query budget T ; document cluster x initialise w 0 ; while t = 0 . . . T do sample (y i , y j ) according to Eq. (2); obtain feedback ∆ x (y i , y j ); w t+1 := w t − γ t ∇ w L SPPI (w|x) end Output: y * = arg max y∈Y (x) w T +1 φ(y, x) Algorithm 1: SPPI for preference-based interactive document summarisation (adjusted from Alg. 2 in (Sokolov et al., 2016a)).
used to formulate sequential decision making problems, which EMDS falls into: in EMDS, the summariser has to sequentially select sentences from the original documents and add them to the draft summary. An (episodic) MDP is a tuple (S, A, P, R, T ). S is the set of states, A is the set of actions, P : S × A × S → R is the transition function with P (s |s, a) yielding the probability of performing action a in state s and being transited to a new state s . R : S × A → R is the reward function with R(s, a) giving the immediate reward for performing action a in state s. T ⊆ S is the set of terminal states; visiting a terminal state terminates the current episode.
In EMDS, we follow the same MDP formulation as Ryang and Abekawa (2012) and Rioux et al. (2014). Given a document cluster, a state s is a draft summary, A includes two types of actions, concatenate a new sentence to the current draft summary, or terminate the draft summary construction. The transition function P in EMDS is trivial because given the current draft summary and an action, the next state can be easily inferred. The reward function R returns an evaluation score of the summary once the action terminate is performed; otherwise it returns 0 because the summary is still under construction and thus not ready to be evaluated. Providing non-zero rewards before the action terminate can lead to even worse result, as reported by Rioux et al. (2014).
A policy π : S × A → R in an MDP defines how actions are selected: π(s, a) is the probability of selecting action a in state s. In EMDS, a policy corresponds to a strategy to build summaries for a given document cluster. We let Y π (x) be the set of all possible summaries the policy π can construct in the document cluster x, and we slightly abuse the notation by letting π(y|x) denote the probabil-ity of policy π generating a summary y in cluster x. Then the expected reward of a policy is: where R(y|x) is the reward for summary y in document cluster x. The goal of an MDP is to find the optimal policy π * that has the highest expected reward: π * = arg max π R RL (π).
Note that the loss function in SPPI (Eq. (1)) and the expected reward function in RL (Eq. (3)) are in similar forms: if we view the pair selection probability p w in Eq. (2) as a policy, and view the preference function ∆ x in Eq. (1) as a negative reward function, we can view SPPI as an RL problem. The major difference between SPPI and RL is that SPPI selects and evaluates pairs of outputs, while RL selects and evaluates single outputs. We will exploit their connection to propose our new objective function and the APRIL framework.

Preference Learning
The linear Bradley-Terry (BT) model (Bradley and Terry, 1952) is one of the most widely used methods in preference learning. Given a set of items Y, suppose we have observed T preferences: Q = {q 1 (y 1,1 , y 1,2 ), · · · , q T (y T,1 , y T,2 )}, where y i,1 , y i,2 ∈ Y, and q i ∈ {≺, } is the oracle's preference in the i th round. The BT model minimises the following cross-entropy loss: ) −1 , and µ i,1 and µ i,2 indicate the direction of preferences: if y i,1 y i,2 then µ i,1 = 1 and µ i,2 = 0. Let w * = arg min w L BT (w), then w * can be used to rank all items in Y: for any y i , y j ∈ Y, the ranker prefers y i over y j if w * φ(y i ) > w * φ(y j ).

APRIL: Decomposing SPPI into Active Preference Learning and RL
A major problem of SPPI is its high sample complexity. We believe this is due to two reasons. First, SPPI's sampling strategy is inefficient: From Eq.
(2) we can see that SPPI tends to select pairs with large quality gaps for querying the user. This strategy can quickly identify the relatively good and relatively bad summaries, but needs many rounds of interaction to find the top summaries. Second, SPPI uses the collected preferences ineffectively: In Alg. 1, each preference is used only once for performing the gradient descent update and is forgotten afterwards. SPPI does not generalise or re-use collected preferences, wasting the useful and expensive information. These two weaknesses of SPPI motivate us to propose a new learning paradigm that can query and generalise preferences more efficiently. Recall that in EMDS, the goal is to find the optimal summary for a given document cluster x, namely the summary that is preferred over all other possible summaries in Y(x). Based on this understanding, we define a new expected reward function R APRIL for policy π as follows: where r(y|x) = y i ∈Y(x) ∆ x (y i , y j )/|Y(x)|. Note that ∆ x (y i , y j ) equals 1 if y j is preferred over y i and equals 0 otherwise (see §3.1). Thus, r(y|x) is the relative position of y in the (ascending) sorted Y(x), and it can be approximated by preference learning. The use of preference learning enables us to generalise the observed preferences to a ranker (see §3.3), allowing more effective use of the collected preferences. Also, we can use active learning to select summary pairs for querying more effectively. In addition, the resemblance of R APRIL and RL's reward function R RL (in Eq. (3)) enables us to use a wide range of RL algorithms to maximise R APRIL (see §2).
Based on the new objective function, we split the preference-based interactive learning into two phases: an Active Preference Learning (APL) phase (the right cycle in Fig. 1b), responsible for querying preferences from the oracle and approximating the ranking of summaries, and an RL phase (the left cycle in Fig. 1b), responsible for learning to summarise based on the learned ranking. The resulting framework APRIL allows for integrating any active preference learning and RL techniques. Note that only the APL phase is online (i.e. in-Input : query budget T ; document cluster x; RL episode budget N /* Phase 1: active preference learning */ while t = 0 . . . T do sample a summary pair (y i , y j ) using any APL strategy; obtain feedback ∆ x (y i , y j ); update ranker according to Eq. (4) ; end /* Phase 2: RL-based summarisation */ initialise an arbitrary policy π 0 ; while n = 0 . . . N do evaluate policy π n according to Eq. (5); update policy π n using any RL algorithm; end Output: y * = arg max y∈Yπ N (x) π N (y|x)  volving humans in the loop) while the RL phase can be performed offline, helping to improve the real-time responsiveness. Also, the learned ranker can provide an unlimited number of rewards (i.e. r(y|x) in Eq. (5)) to the RL agent, enabling us to perform many episodes of RL training with a small number of collected preferences -unlike in SPPI where each collected preference is used to train the system for one round and is forgotten afterwards. Alg. 2 shows APRIL in pseudo code.

Experimental Setup
Datasets. We perform experiments on DUC '04 to find the best performing APL and RL techniques. Then we combine the best-performing APL and RL to complete APRIL and compare it against SPPI on the DUC '01, DUC '02 and DUC '04 datasets. 1 Some statistics of these datasets are summarised in Table 1.
Simulated Users. Existing preference-based interactive learning techniques assume that the oracle has an intrinsic evaluation function U * and provides preferences consistent with U * by preferring higher valued candidates. We term this a Per-1 http://duc.nist.gov/ fect Oracle (PO). We believe that assuming a PO is unrealistic for real-world applications, because sometimes real users tend to misjudge the preference direction, especially when the presented candidates have similar quality. In this work, besides PO, we additionally consider two types of noisy oracles based on the user-response models proposed by Viappiani and Boutilier (2010): • Constant noisy oracle (CNO): with probability c ∈ [0, 1], this oracle randomly selects which summary is preferred; otherwise it provides preferences consistent with U * . We consider CNOs with c = 0.1 and c = 0.3.
• Logistic noisy oracle (LNO): for two summaries y i and y j in cluster x, the oracle prefers y i over y j with probability p U * (y i y j |x; m) = (1 + exp[(U * (y j |x) − U * (y i |x))/m]) −1 . This oracle reflects the intuition that users are more likely to misjudge the preference direction when two summaries have similar quality. Note that the parameter m ∈ R + controls the "noisiness" of the user's responses: higher values of m result in a less steep sigmoid curve, and the resulting oracle is more likely to misjudge. We use LNOs with m = 0.3 and m = 1.
As for the intrinsic evaluation function U * , recent work has suggested that human preferences over summaries have high correlations to ROUGE scores (Zopf, 2018). Therefore, we define: where R 1 , R 2 and R S stand for ROUGE-1, ROUGE-2 and ROUGE-SU4, respectively. The real values (0.47, 0.22 and 0.18) are used to balance the weights of the three ROUGE scores. We choose them to be around the EMDS upper-bound ROUGE scores reported by P.V.S. and Meyer (2017). As such, an optimal summary's U * value should be around 3.
Implementation. All code is written in Python and runs on a desktop PC with 8 GB RAM and an i7-2600 CPU. We use NLTK (Bird et al., 2009) to perform sentence tokenisation. Our source code is freely available at https://github.com/ UKPLab/emnlp2018-april.

Simulation Results
We first study the APL phase ( §6.1) and the RL phase ( §6.2)) separately by comparing the perfor-mance of multiple APL and RL algorithms in each phase. Then, in §6.3, we combine the best performing APL and RL algorithm to complete Alg. 2 and compare APRIL against SPPI.

APL Phase Performance
Recall that the task of APL is to output a ranking of all summaries in a cluster. In this subsection, we test multiple APL techniques and compare the quality of their resulting rankings. Two metrics are used: Kendall's τ (Kendall, 1948) and Spearman's ρ (Spearman, 1904). Both metrics are valued between −1 and 1, with higher values suggesting higher rank correlation. Because the number of possible summaries in a cluster is huge, instead of evaluating the ranking quality on all possible summaries, we evaluate rankings on 10,000 randomly sampled summaries, denotedŶ(x). During querying, all candidate summaries presented to the oracle are also selected fromŶ(x). SamplingŶ(x) a priori helps us to reduce the response time to under 500 ms for all APL techniques we test. We compare four active learning strategies under two query budgets, T = 10 and T = 100: • Random Sampling (RND): Randomly select two summaries fromŶ(x) to query.
(2). After each round, the weight vector w is updated according to Eq. (4).
• Uncertainty Sampling (Unc): Query the most uncertain summary pairs. In line with P.V.S. and Meyer (2017), the uncertainty of a summary is evaluated as follows: first, we estimate the probability of a summary y being the optimal summary in cluster x as p opt (y|x) = (1 + exp(−w * t φ(x, y))) −1 , where w * t is the weights learned by the BT model (see §3.3) in round t. Given p opt (y|x), we let the uncertainty score unc(y|x) = 1 − p opt (y|x) if p opt (y|x) ≥ 0.5 and unc(y|x) = p opt (y|x) otherwise.
• J&N is the robust query selection algorithm proposed by Jamieson and Nowak (2011). It assumes that the items' preferences are dependent on their distances to an unknown reference point in the embedding space: the farther an item to the reference point, the more preferred the item is. After each round of interaction, the algorithm uses all collected preferences to locate the area where the reference point may fall into, and identify the query pairs which can reduce the size of this area, termed ambiguous query pairs. To combat noise in preferences, the algorithm selects the most-likely-correct ambiguous pair to query the oracle in each round.
After all preferences are collected, we obtain the ranker as follows: for any y i , y j ∈ Y(x), the ranker prefers y i over y j if where w * is the weights vector learned by the BT model (see Eq. (4)), HU is the heuristics-based summary evaluation function proposed by Ryang and Abekawa (2012), and α ∈ [0, 1] is a parameter. The aim of using HU and α is to trade off between the prior knowledge (i.e. heuristics-based HU ) and the posterior observation (i.e. the BTlearnt w * ), so as to combat the cold-start problem. Based on some preliminary experiments, we set α = 0.3 when the query budget is 10, and α = 0.7 when the query budget is 100. The intuition is to put more weight to the posterior with increasing rounds of interaction. More systematic research of α can yield better results; we leave it for future work. For the vector φ(y|x), we use the same bagof-bigram embeddings as Rioux et al. (2014), and we let its length be 200.
In Table 2, we compare the performance of the four APL methods on the DUC'04 dataset. The baseline we compared against is the prior ranking. We find that Unc significantly 2 outperforms all other APL methods, except when the oracle is LNO-1, where the advantage of Unc to SBT is not significant. Also, both Unc and SBT are able to significantly outperform the baseline under all settings. The competitive performance of SBT, especially with LNO-1, is due to its unique sampling strategy: LNO-1 is more likely to misjudge the preference direction when the presented summaries have similar quality, but SBT has high probability to present summaries with large quality gaps (see Eq. (2)), effectively reducing the chance that LNOs misjudge preference directions. However, SBT is more "conservative" compared to Unc because it tends to exploit the existing Baseline, α = 0, T = 0: τ = .206, ρ = .304 Table 2: Performance of multiple APL algorithms (columns) using different oracles and query budgets (rows). The baseline is the purely prior ranking. All results except the baseline are averaged over 50 document clusters in DUC'04. Asterisk: significant advantage over other active learning strategies given the same oracle and budget T .
ranking to select one good and one bad summary to query, while Unc performs more exploration by querying the summaries that are least confident according to the current ranking. We believe this explains the strong overall performance of Unc. Additional experiments suggest that when we only use the posterior ranking (i.e. letting α = 1), no APL we test can surpass the baseline when T = 10. Detailed results are presented in the supplementary material. This observation reflects the severity of the cold-start problem, confirms the effectiveness of our prior-posterior trade-off mechanism in combating cold-start, and indicates the importance of tuning the α value (see Eq. (7)). This opens up exciting avenues for future work.

RL Phase Performance
We compare two RL algorithms: TD(λ) (Sutton, 1984) and LSTD(λ) (Boyan, 1999). TD(λ) has been used in previous RL-based EMDS work (Ryang and Abekawa, 2012;Rioux et al., 2014). LSTD(λ) is chosen, because it is an improved TD algorithm and has been used in the state-of-the-art PbRL algorithm by Wirth et al. (2016). We let the learning round (see Alg. 2) N = 5, 000, which we found to yield good results in reasonable time (less than 1 minute to generate a summary for one document cluster). Letting N = 3, 000 will result in a significant performance drop, while increasing N to 10,000 will only bring marginal improvement at the cost of doubling the runtime. The learn-  ing parameters we use for TD(λ) are the same as those by Rioux et al. (2014). For LSTD(λ), we let λ = 1 and initialise its square matrix as a diagonal matrix with random numbers between 0 and 1, as suggested by Lagoudakis and Parr (2003). The rewards we use are the U * function introduced in §5. Note that this serves as the upper-bound performance, because U * relies on the reference summaries (see Eq. (6)), which are not available in the interactive setting. As a baseline, we also present the upper-bound performance of integer linear programming (ILP) reported by P.V.S. and Meyer (2017), optimised for bigram coverage. Table 3 shows the performance of RL and ILP on the DUC'04 dataset. TD(λ) significantly outperforms LSTD(λ) in terms of all ROUGE scores we consider. Although the least-square RL algorithms (which LSTD belongs to) have been proved to achieve better performance than standard TD methods in large-scale problems (see Lagoudakis and Parr, 2003), their performance is sensitive to many factors, e.g., initialisation values in the diagonal matrix, regularisation parameters, etc. We note that a similar observation about the inferior performance of least-square RL in EMDS is reported by Rioux et al. (2014).
TD(λ) also significantly outperforms ILP in terms of all metrics except ROUGE-2. This is not surprising, because the bigram-based ILP is optimised for ROUGE-2, whereas our reward function U * considers other metrics as well (see Eq. (6)). Since ILP is widely used as a strong baseline for EMDS, these results confirm the advantage of using RL for EMDS problems.

Complete Pipeline Performance
Finally, we combine the best techniques of the APL and RL phase (namely Unc and TD(λ), respectively) to complete APRIL, and compare it against SPPI. As a baseline, we use the heuristicbased rewards HU to train both TD(λ) (rankingbased training, i.e. using HU to produce r(y|x) in Eq. (5) to train) and SPPI (preference-based training, i.e. using HU for generating pairs to train   SPPI) for up to 5,000 episodes. The baseline results are presented in the bottom rows of Table 4.
We make the following observations from Table 4. (i) Given the same oracle, the performance of APRIL with 10 rounds of interaction is comparable or even superior than that of SPPI after 100 rounds of interaction (see boldface in Table  4), suggesting the strong advantage of APRIL to reduce sample complexity. (ii) APRIL can significantly improve the baseline with either 10 or 100 rounds of interaction, but SPPI's performance can be even worse than the baseline (marked by † in Table 4), especially under the high-noise lowbudget settings (i.e., CNO-0.3, LNO-0.3, and LNO-1 with T = 10). This is because SPPI lacks a mechanism to balance between prior and posterior ranking, while APRIL can adjust this trade-off by tuning α (Eq. (7)). This endows APRIL with better noise robustness and lower sample complexity in high-noise low-budget settings. Note that the above observations also hold for the other two datasets, indicating the consistently strong performance of APRIL across different datasets.
As for the overall runtime, when budget T = 100, APRIL on average takes 2 minutes to interact with an oracle and output a summary, while SPPI takes around 15 minutes due to its expensive gradient descent computation (see §3.1).

Human Evaluation
Finally, we invited real users to compare and evaluate the quality of the summaries generated by SPPI and APRIL. We randomly selected three topics (d19 from DUC'01, d117i from DUC'02 and d30042 from DUC'04), and let both SPPI and our best-performing APRIL interact with PO for 10 rounds on these topics. The resulting 100-word summaries, shown in Figure 2, were presented to seven users, who had already read two background texts to familiarize with the topic. The users were asked to provide their preference on the presented Topic d30042 (DUC'04), SPPI: After meeting Libyan leader Moammar Gadhafi in a desert tent, U.N. Secretary-General Kofi Annan said he thinks an arrangement for bringing two suspects to trial in the bombing of a Pan Am airliner could be secured in the "not too distant future." TRIPOLI, Libya (AP) U.N. Secretary-General Kofi Annan arrived in Libya Saturday for talks aimed at bringing to trial two Libyan suspects in the 1988 Pan Am bombing over Lockerbie, Scotland. Secretary General Kofi Annan said Wednesday he was extending his North African tour to include talks with Libyan authorities. Annan's one-day, 2nd graf pvs During his Algerian stay, Topic d30042 (DUC'04), APRIL: TRIPOLI, Libya (AP) U.N. Secretary-General Kofi Annan arrived in Libya Saturday for talks aimed at bringing to trial two Libyan suspects in the 1988 Pan Am bombing over Lockerbie, Scotland. Annan's one-day visit to meet with Libyan leader Col. Moammar Gadhafi followed reports in the Libyan media that Gadhafi had no authority to hand over the suspects. The 60-year-old Annan is trying to get Libya to go along with a U.S.-British plan to try the two suspects before a panel of Scottish judges in the Netherlands for the Dec. 21, 1988, bombing over Lockerbie, Scotland. Sirte is 400 kilometers (250 miles) east of the Libyan capital Tripoli. During his Algerian stay, Topic d117i (DUC'02), SPPI: The Booker Prize is sponsored by Booker, an international food and agriculture business. The novel, a story of Scottish lowlife narrated largely in Glaswegian dialect, is unlikely to prove a popular choice with booksellers, who have damned all six books shortlisted for the prize as boring, elitist and-worst of all-unsaleable. The shortlist of six for the Pounds 20,000 Booker Prize for fiction, announced yesterday, immediately prompted the question 'Who ? ' Japanese writer Kazuo Ishiguro won the 1989 Booker Prize, Britain's top literary award, for his novel "The Remains of the Day," judges announced Thursday. He didn't win.
Topic d117i (DUC'02), APRIL: Australian novelist Peter Carey was awarded the coveted Booker Prize for fiction Tuesday night for his love story, "Oscar and Lucinda." The Booker Prize is sponsored by Booker, an international food and agriculture business, and administered by The Book Trust. British publishers can submit three new novels by British and Commonwealth writers. Six novels have been nominated for the Booker Prize, Britain's most prestigious fiction award, and bookmakers say the favorite is "The Remains of the Day" by Japanese author Kazuo Ishiguro. On the day of the Big Event, Ladbroke, the large British betting agency, posted the final odds.
Topic d19 (DUC'01), SPPI: The issue cuts across partisan lines in the Senate, with Minority Leader Bob Dole (R-Kan.) arguing against the White House position on grounds that including illegal aliens in the census is unfair to American citizens.. Loss of Seats Cited. Shelby's amendment says only that the secretary is to "make such adjustments in total population figures as may be necessary, using such methods and procedures as the secretary determines feasible and appropriate" to keep illegal aliens from being counted in congressional reapportionment. "Some states will lose congressional seats because of illegal aliens," Dole argued. But there's nothing simple about it.
Topic d19 (DUC'01), APRIL: In a blow to California and other states with large immigrant populations, the Senate voted Friday to bar the Census Bureau from counting illegal aliens in the 1990 population count. But the Senate already has voted to force the Census Bureau to exclude illegal immigrants in preparing tallies for congressional reapportionment. said that Georgia and Indiana both lost House seats after the 1980 Census, and California and New Yorkcenters of illegal immigration-each gained seats. A majority of the members of the House of Representatives has signaled support. The national head count will be taken April 1, 1990. summary pairs and rate the summaries on a 5point Likert scale with higher scores for better summaries. All users are fluent in English.
In all three topics, all users prefer the APRILgenerated summaries over the SPPI-generated summaries. Table 5 shows the users' ratings. The APRIL-generated summaries consistently receive higher ratings. These results are consistent with our simulation experiments and confirm the significant advantage of APRIL over SPPI.

Conclusion
We propose a novel preference-based interactive learning formulation named APRIL (Active Preference ReInforcement Learning), which is able to make structured predictions without referring to the gold standard data. Instead, APRIL learns from preference-based feedback. We designed a novel objective function for APRIL, which naturally splits APRIL into an active preference learning (APL) phase and a reinforcement learning (RL) phase, enabling us to leverage a wide spectrum of active learning, preference learning and RL algorithms to maximise the output quality with a limited number of interaction rounds. We applied APRIL to the Extractive Multi-Document Summarisation (EMDS) problem, simulated the users' preference-giving behaviour using multiple user-response models, and compared the performance of multiple APL and RL techniques. Simulation experiments indicated that APRIL signif-icantly improved the summary quality with just 10 rounds of interaction (even with high-noise oracles), and significantly outperformed SPPI in terms of both sample complexity and noise robustness. Human evaluation results suggested that real users preferred the APRIL-generated summaries over the SPPI-generated ones.
We identify two major lines of future work. On the technical side, we plan to employ more advanced APL and RL algorithms in APRIL, such as sample-efficient Bayesian-based APL algorithms (e.g., Simpson and Gurevych, 2018) and neural RL algorithms (e.g. Mnih et al., 2015) to further reduce the sample complexity of APRIL. On the experimental side, a logical next step is to implement an interactive user interface for APRIL and conduct a larger evaluation study comparing the summary quality before and after the interaction. We also plan to apply APRIL to more NLP applications, including machine translation, information exploration and semantic parsing.