Joint Optimization of User-desired Content in Multi-document Summaries by Learning from User Feedback

In this paper, we propose an extractive multi-document summarization (MDS) system using joint optimization and active learning for content selection grounded in user feedback. Our method interactively obtains user feedback to gradually improve the results of a state-of-the-art integer linear programming (ILP) framework for MDS. Our methods complement fully automatic methods in producing high-quality summaries with a minimum number of iterations and feedbacks. We conduct multiple simulation-based experiments and analyze the effect of feedback-based concept selection in the ILP setup in order to maximize the user-desired content in the summary.


Introduction
The task of producing summaries from a cluster of multiple topic-related documents has gained much attention during the Document Understanding Conference 1 (DUC) and the Text Analysis Conference 2 (TAC) series. Despite a lot of research in this area, it is still a major challenge to automatically produce summaries that are on par with human-written ones. To a large extent, this is due to the complexity of the task: a good summary must include the most relevant information, omit redundancy and irrelevant information, satisfy a length constraint, and be cohesive and grammatical. But an even bigger challenge is the high degree of subjectivity in content selection, as it can be seen in the small overlap of what is considered 1 http://duc.nist.gov/ 2 http://www.nist.gov/tac/ important by different users. Optimizing a system towards one single best summary that fits all users, as it is assumed by current state-of-the-art systems, is highly impractical and diminishes the usefulness of a system for real-world use cases.
In this paper, we propose an interactive conceptbased model to assist users in creating a personalized summary based on their feedback. Our model employs integer linear programming (ILP) to maximize user-desired content selection while using a minimum amount of user feedback and iterations. In addition to the joint optimization framework using ILP, we explore pool-based active learning to further reduce the required feedback. Although there have been previous attempts to assist users in single-document summarization, no existing work tackles the problem of multi-document summaries using optimization techniques for user feedback. Additionally, most existing systems produce only a single, globally optimal solution. Instead, we put the human in the loop and create a personalized summary that better captures the users' needs and their different notions of importance.
Need for personalization. Table 1 shows the ROUGE scores (Lin, 2004) of multiple existing summarization systems, namely TF*IDF (Luhn, 1958), LexRank (Erkan and Radev, 2004), Text-Rank (Mihalcea and Tarau, 2004), LSA (Gong and Liu, 2001), KL-Greedy (Haghighi and Vanderwende, 2009), provided by the sumy package 3 and ICSI 4 (Gillick and Favre, 2009;Boudin et al., 2015), a strong state-of-the-art approach (Hong et al., 2014) in comparison to the extractive upper bound on DUC'04 and DBS. DUC'04 is an English dataset of abstractive summaries from ho- Figure 1: Lexical overlap of a reference summary (cluster D31043t in DUC 2004) with the summary produced by ICSI's state-of-the-art system (Boudin et al., 2015) (Cao et al., 2016). Although some systems achieve state-ofthe-art performance, their scores are still far from the extractive upper bound of individual reference summaries as shown in Figure 1. This is due to low inter-annotator agreement for concept selection: Zechner (2002) reports, for example, κ = .13 and Benikova et al. (2016) κ = .23. Most systems try to optimize for all reference summaries instead of personalizing, which we consider essential to capture user-desired content.
Need for user feedback. The goal of concept selection is finding the important information within a given set of source documents. Although existing summarization algorithms come up with a generic notion of importance, it is still far from the user-specific importance as shown in Figure 1. In contrast, humans can easily assess importance given a topic or a query. One way to achieve personalized summarization is thus by combining the advantages of both human feedback and the generic notion of importance built in a system. This allows users to interactively steer the summarization process and integrate their user-specific notion of importance.
Contributions. In this work, (1) we propose a novel ILP-based model using an interactive loop to create multi-document user-desired summaries, and (2) we develop models using pool-based active learning and joint optimization techniques to collect user feedback on identifying important concepts of a topic. In order to encourage the community to advance research and replicate our results, we provide our interactive summarizer implementation as open-source software. 5 . Our proposed method and our new interactive summarization framework can be used in multiple application scenarios: as an interactive annotation tool, which highlights important sentences for the annotators, as a journalistic writing aid that suggests important, user-adapted content from multiple source feeds (e.g., live blogs), and as a medical data analysis tool that suggests key information assisting a patient's personalized medical diagnosis.
The rest of the paper is structured as follows: In section 2, we discuss related work. Section 3 introduces our computer-assisted summarization framework using the concept-based optimization. Section 4 describes our experiment data and setup. In section 5, we then discuss our results and analyze the performance of our models across different datasets. Finally, we conclude the paper in section 6 and discuss future work.

Related Work
Previous works related to our research address extractive summarization as a budgeted subset selection problem, computer-assisted approaches, and personalized summarization models.
Bugeted subset selection. Extractive summarization systems that compose a summary from a number of important sentences from the source documents are by far the most popular solution for MDS. This task can be modeled as a budgeted maximum coverage problem. Given a set of sentences in the document collection, the task is to maximize the coverage of the subset of sentences under a length constraint. The scoring function estimates the importance of the content units for a summary. Most previous works consider sentences as content units and try different scoring functions to optimize the summary.
One of the earliest systems by McDonald (2007) models a scoring function by simultaneously maximizing the relevance scores of the selected content units and minimizing their pairwise redundancy scores. They solve the global optimization problem using an ILP framework. Later, several state-of-the-art results employed an ILP to maximize the number of relevant concepts in the created summary: Gillick and Favre (2009) use an ILP with bigrams as concepts and hand-coded deletion rules for compression. Berg-Kirkpatrick et al. (2011) combine grammatical features relating to the parse tree and use a maximum-margin SVM trained on annotated gold-standard compressions. Woodsend and Lapata (2012) jointly optimize content selection and surface realization, Li et al. (2013) estimate the weights of the concepts using supervised methods, and Boudin et al. (2015) propose an approximation algorithm to achieve the optimal solution. Although these approaches achieve state-of-the-art performance, they produce only one globally optimal summary which is impractical for various users due to the subjectivity of the task. Therefore, we research interactive computer-assisted approaches in order to produce personalized summaries.
Computer-assisted summarization. The majority of the existing computer-assisted summarization tools (Craven, 2000;Narita et al., 2002;Orǎsan et al., 2003;Orǎsan and Hasler, 2006) present important elements of a document to the user. Creating a summary then requires the human to cut, paste, and reorganize the important elements in order to formulate a final text. The work by Orǎsan and Hasler (2006) is closely related to ours, since they assist users in creating summaries for a source document based on the output of a given automatic summarization system. However, their system is neither interactive nor does it consider the user's feedback in any way. Instead, they suggest the output of the state-of-the-art (singledocument) summarization method as a summary draft and ask the user to construct the summary without further interaction.
Personalized summarization. While most previous work focuses on generic summaries, there have been a few attempts to take a user's preferences into account. The study by Berkovsky et al. (2008) shows that users prefer personalized summaries that precisely reflect their interests. These interests are typically modeled with the help of a query (Park and An, 2010) or keyword annotations reflecting the user's opinions (Zhang et al., 2003).
In another strand of research, Díaz and Gervás (2007) create user models based on social tagging and Hu et al. (2012) rank sentences by combining informativeness scores with a user's interests based on fuzzy clustering of social tags. Extending the use of social content, another recent work showed how personalized review summaries (Poussevin et al., 2015) can be useful in recommender systems beyond rating predictions. Although these approaches show that personalized summaries are more useful than generic summaries, they do not attempt to iteratively refine a summary in an interactive user-system dialog.

Approach
The goal of our work is maximizing the userdesired content in a summary within a minimum number of iterations. To this end, we propose an interactive loop that alternates the automatic creation of a summary and the acquisition of user feedback to refine the next iteration's summary.

Summary Creation
Our starting point is the concept-based ILP summarization framework by Boudin et al. (2015). Let C be the set of concepts in a given set of source documents D, c i the presence of the concept i in the resulting summary, w i a concept's weight, j the length of sentence j, s j the presence of sentence j in the summary, and Occ ij the occurrence of concept i in sentence j. Based on these definitions, we formulate the following ILP: The objective function (1) maximizes the occurrence of concepts c i in the summary based on their weights w i . The constraint formalized in (2) ensures that the summary length is restricted to a maximum length L, (3) ensures the selection of all concepts in a sentence s j if s j has been selected for the summary. Constraint (4) ensures that a concept is only selected if it is present in at least one of the selected sentences.
The two key factors for the performance of this ILP are defining the concept set C and a method to estimate the weights w i ∈ W . Previous works have used word bigrams as concepts (Gillick and Favre, 2009;Li et al., 2013;Boudin et al., 2015) and either use document frequency (i.e. the number of source documents containing the concept) as weights (Woodsend and Lapata, 2012;Gillick and Favre, 2009) or estimate them using a supervised regression model (Li et al., 2013). For our implementation, we likewise use bigrams as concepts and document frequency as weights, as Boudin et al. (2015) report good results with this simple strategy. Our approach is, however, not limited to this setup, as our interactive approach allows for any definition of C and W , including potentially more sophisticated weight estimation methods, e.g., based on deep neural networks. In section 5.2, we additionally analyze how other notions of concepts can be integrated into our approach.

Interactive Summarization Loop
Algorithm 1 provides an overview of our interactive summarization approach. The system takes the set of source documents D as input, derives the set of concepts C, and initializes their weights W . In line 5, we start the interactive feedback loop iterating over t = 0, . . . , T . We first create a summary S t (line 6) by solving the ILP and then extract a set of concepts Q t (line 7), for which we query the user in line 11 As the user feedback in the current time step, we use the concepts I t ⊆ Q t that have been considered important by the user. For updating the weights W in line 12, we may use all feedback collected until the current time step t, i.e., I t 0 = t j=0 I j and the set of concepts Q t 0 = t j=0 Q j seen by the user (with If there are no more concepts to query (i.e., Q t = ∅), we stop the iteration and return the personalized summary S t . end for 15: end procedure

User Feedback Optimization
To optimize the summary creation based on user feedback, we iteratively change the concept weights in the objective function of the ILP setup. We define the following models: Accept model (ACCEPT). This model presents the current summary S t with highlighted concepts Q t to a user and asks him/her to select all important concepts I t . We assign the maximum weight MAX to all concepts in I t and consider the remaining Q t − I t as unimportant by setting their weight to 0 (see equation 7 and 8). The intuition behind this baseline is that the modified scores cause the ILP to prefer the user-desired concepts while avoiding unimportant ones.
Joint ILP with User Feedback (JOINT). The ACCEPT model fails in cases where the user could not accept concepts that never appear in one of the S t summaries. To tackle this, in our JOINT model, we change the objective function of the ILP in order to create S t by jointly optimizing importance and user feedback. We thus replace the equation (1) with: Equation (9) maximizes the use of concepts for which we yet lack feedback (i ∈ Q t 0 ) and minimizes the use of concepts for which we already have feedback (i ∈ Q t 0 ). In this JOINT model, we use an exploration phase t = 0 . . . τ to collect the feedback, which terminates when the user does not return any important concepts (i.e., I t = ∅). In the exploratory phase, the minus term in the equation 9 helps to reduce the score of the sentences whose concepts have received feedback already. In other words, it causes higher scores for sentences consisting of concepts which yet lack feedback. After the exploration step, we fall back to the original importance-based optimization function from equation (1).
Active learning with uncertainty sampling (AL). Our JOINT model explores well in terms of prioritizing the concepts which yet lack user feedback. However, it gives equal probabilities to all the unseen concepts. The AL model employs pool-based active learning (Kremer et al., 2014) during the exploration phase in order to prioritize concepts for which the model is most uncertain. We distinguish the unlabeled concept pool C u = {Φ(x 1 ), Φ(x 2 ), ..., Φ(x N )} and the labeled concept pool C = {(Φ(x 1 ), y 1 ), (Φ(x 2 ), y 2 ), . . . , (Φ(x N ), y N )}, where each concept x i is represented as a d-dimensional feature vector Φ(x i ) ∈ R d . The labels y i ∈ {−1, 1} are 1 for all important concepts in I t 0 and −1 for all unimportant concepts in Q t 0 − I t 0 . Initially, the labeled concept pool C is small or empty, whereas the unlabeled concept pool C u is relatively large. The learning algorithm is presented with a C = C ∪ C u and is first called to learn a decision function f (0) : R d → R, where the function f (0) (Φ(x)) is taken to predict the label of the input vector Φ(x). Then, in each t th iteration, where t = 1, 2, . . . , τ , the querying algorithm selects an instance ofx t ∈ C u for which the learning algorithm is least certain. Thus, our learning goal of active learning is to minimize the expected loss L (i.e., hinge loss) with limited querying opportunities to obtain a decision function f (1) , f (2) , . . . , f (τ ) that can achieve low error rates: As the learning algorithm, we use a support vector machine (SVM) with a linear kernel. To obtain the probability distribution over classes we use Platt's calibration (Platt, 1999), an effective approach for transforming classification models into a probability distribution. Equation (11) shows the probability estimates for f (t) , where f (t) is the uncalibrated output of the SVM in the t th iteration and A, B are scalar parameters that are learned by the calibration algorithm. The uncertainty scores are calculated as described in the equation (12) for all the concepts which lack feedback (C u ).
p(y | f (t) ) = 1 1 + exp(Af (t) + B) (11) For our AL model, we now change the objective function in order to create S t by multiplying uncertainty scores u i to the weights w i . We thus replace the objective function from (9) with Active learning with positive sampling (AL+). One way to sample the unseen concepts is using uncertainty as in AL, but another way is to actively choose samples for which the learning algorithm predicts as a possible important concept. In AL+, we introduce the notion of certainty (1−u i ) for the positively predicted samples (f (t) (Φ(x i )) = 1) in where 4 Experimental Setup

Data
For our experiments, we mainly focus on the DBS corpus, which is an MDS dataset of coherent extracts created from heterogeneous sources about multiple educational topics (Benikova et al., 2016). This corpus is well-suited for our evaluation setup, since we are able to easily simulate a user's feedback based on the overlap between generated and reference summary. Additionally, we carry out experiments on the most commonly used evaluation corpora published by DUC/NIST from the generic multidocument summarization task carried out in DUC'01, DUC'02 and DUC'04. The documents are all from the news domain and are grouped into various topic clusters. Table 2 shows the properties of these corpora.
For evaluating the summaries against the reference summary we use ROUGE (Lin, 2004) with the parameters suggested by (Owczarzak et al., 2012) yielding high correlation with human judgments (i.e., with stemming and without stopword removal). 6 Since DBS summaries do not have a fixed length, we use a variable length parameter L for evaluation, where L denotes the length of the reference summary. All results are averaged across all topics and reference summaries.

Data Pre-processing and Features
To pre-process the datasets, we perform tokenization and stemming with NLTK (Loper and Bird, 2002) and constituency parsing with the Stanford parser (Klein and Manning, 2003) for English and 6 -n 4 -m -a -x -c 95 -r 1000 -f A -p 0.5 -t 0 -2 -4 -u German. The parse trees will be used in section 5.2 below to experiment with a syntactically motivated concept notion.
As a concept's feature representation Φ for our active learning setups AL and AL+, we use pre-trained word embeddings. We use the Google News embeddings with 300 dimensions by Mikolov et al. (2013) for English and the 100dimensional news-and Wikipedia-based embeddings by Reimers et al. (2014) for German. Additionally, we add TF*IDF, number of stop words, presence of named entities, and word capitalization as features. Discrete features, such as part-ofspeech tags, are mapped into the word representation via lookup tables.

Oracle-Based Simulation of User Feedback
The presence of a human in the loop typically demands for a user study based evaluation, but to collect sufficient data for various settings of our models would be too expensive. Therefore, we resort to an oracle-based approach, where the oracle is a system simulating the user by generating the feedback based on reference outputs. This idea has been widely used in the development of interactive systems (González-Rubio et al., 2012;Knowles and Koehn, 2016) for studying the problem and exhibiting solutions in a theoretical and controlled environment.
To simulate user feedback in our setting, we consider all concepts I t ⊆ Q t from the systemsuggested summary S t as important if they are present in the reference summary. Let Ref be the set of concepts in the reference summary. In the t th iteration, we return I t = Q t ∩ Ref as the simulated user feedback. Thus, the goal of our system is to reach the upper bound for a user's reference summary within a minimal number of iterations.

Methods
To examine the system performance based on user feedback, we analyze our models' performance on multiple datasets. The results in Table 3 show that our idea of interactive multi-document summarization allows users to steer a general summary towards a personalized summary consistently across all datasets. From the results, we can see that the AL model starts from the conceptbased ILP summarization and nearly reaches the upper bound for all the datasets within ten iterations. AL+ performs similar to AL in terms of ROUGE, but requires less feedback (compare Table 4). Furthermore, the ACCEPT and JOINT models get stuck in a local optimum due to the less exploratory nature of the models.

Concept Notion
Our interactive summarization approach is based on the scalable global concept-based model which uses bigrams as concepts. Thus, it is intuitive to use bigrams for collecting user feedback as well. 7 Although our models reach the upper bound when using bigram-based feedback, they require a significantly large number of iterations and much feedback to converge, as shown in Table 4.
To reduce the amount of feedback, we also consider content phrases to collect feedback. That is, syntactic chunks from the constituency parse trees consisting of non-function words (i.e., nouns, verbs, adjectives, and adverbs). For DBS being extractive dataset, we use bigrams and content phrases as concepts, both for the objective function in equation (1) and as feedback items, whereas for the DUC datasets, the concepts are always bigrams for both the feedback types (bigrams/content phrases). For DUC being abstractive, in the case of feedback given on content phrases, they are projected back to the bigrams to change the concept weights in order to have more overlap of simulated feedback. Table 4 shows feedbacks based on the content phrases reduces the number of feedbacks by a factor of 2. Furthermore, when content phrases are used as concepts for DBS, the performance of the models is lower compared to bigrams, as seen in Table 3. For DUC'04, the improvements are +.1 ROUGE-2 after ten iterations, which is relatively notable considering the lower upper bound of .21 ROUGE-2. This is primarily because DBS is a corpus of cohesive extracts, whereas DUC'04 consists of abstractive summaries. As a result, the oracles created using abstractive reference summaries have lower overlap of concepts as compared to that of the oracles created using extractive summaries.

Datasets
For DBS, it becomes clear that the JOINT model converges faster with an optimum amount of feedback as compared to other models. AC-CEPT takes relatively more feedbacks than JOINT, but performs low in terms of ROUGE scores. The best performing models are AL and AL+, which reach closest to the upper bound. This is clearly due to the exploratory nature of the models which use semantic representations of the concepts to predict uncertainty and importance of possible concepts for user feedback.
For DUC'04, the JOINT model reaches the closest to the upper bound, closely followed by AL. The JOINT model consistently stays above all other models and it gathers more important concepts due to optimizing feedbacks for concepts which lack feedback. Interestingly, AL+ performs rather worse in terms of both ROUGE scores and gathering important concepts. The primary reason for this is the fewer feedback collected from the simulation due to the abstractive property of reference summaries, which makes the AL+ model's prediction inconsistent. Figure 3 shows the performance of different models in comparison to two different oracles for the same document cluster. For DBS, the JOINT, AL, and AL+ models consistently converge to the upper bound in 4 iterations for different oracles, whereas ACCEPT takes longer for one oracle and does not reach the upper bound for the other.

Personalization
For DUC'04, JOINT and AL show consistent performance across the oracles, whereas AL+ performs worse than the state-of-the-art system (iteration 0) for oracle created using abstractive summaries as shown in Figure 3 (right) for User:1. However, for User:2, we observe a ROUGE-2 improvement of +.1 indicating that the predictions of the active learning system are better if there is more feedback. Nevertheless, we expect that in practical use, the human summarizers may give more feedback similar to DBS in comparison to DUC'04 simulation setting.

Conclusion and Future Work
We propose a novel ILP-based approach using interactive user feedback to create multi-document user-desired summaries. In this paper, we investigate pool-based active learning and joint optimization techniques to collect user feedback for identifying important concepts for a summary. Our models show that interactively collecting feedback consistently steers a general summary towards a user-desired personalized summary. We empirically checked the validity of our approach on standard datasets using simulated user feedback and observed that our framework shows promising results in terms of producing personalized multi-document summaries.
As future work, we plan to investigate more sophisticated sampling strategies based on active learning and concept graphs to incorporate lexicalsemantic information for concept selection. We also plan to look into ways to propagate feedback to similar and related concepts with partial feedback, to reduce the total amount of feedback. This is a promising direction as we have shown that interactive methods help to create user-desired personalized summaries, and with minimum amount of feedbacks, it has propitious use in scenarios where user-adapted content is a requirement.