WSL-DS: Weakly Supervised Learning with Distant Supervision for Query Focused Multi-Document Abstractive Summarization

In the Query Focused Multi-Document Summarization (QF-MDS) task, a set of documents and a query are given where the goal is to generate a summary from these documents based on the given query. However, one major challenge for this task is the lack of availability of labeled training datasets. To overcome this issue, in this paper, we propose a novel weakly supervised learning approach via utilizing distant supervision. In particular, we use datasets similar to the target dataset as the training data where we leverage pre-trained sentence similarity models to generate the weak reference summary of each individual document in a document set from the multi-document gold reference summaries. Then, we iteratively train our summarization model on each single-document to alleviate the computational complexity issue that occurs while training neural summarization models in multiple documents (i.e., long sequences) at once. Experimental results on the Document Understanding Conferences (DUC) datasets show that our proposed approach sets a new state-of-the-art result in terms of various evaluation metrics.


Introduction
With the rapid growth of textual documents on the internet, accessing information from the web has become a challenging issue (Yao et al., 2017). Often users want the summary of a topic from various sources to fulfill their information needs (Feigenblat et al., 2017). The QF-MDS task deals with such problems where the goal is to summarize a set of documents to answer a given query.
In the QF-MDS task, the summaries generated by the summarizer can be either extractive or abstractive (Yao et al., 2017;Kulkarni et al., 2020). An extractive summarizer extracts relevant text spans from the source document(s), whereas an abstractive summarizer generates a summary in natural language that may contain some words which did not appear in the source document(s) (Rush et al., 2015;Nallapati et al., 2016;Nema et al., 2017). With the rising popularity of virtual assistants in recent years, there is a growing interest to integrate abstractive summarization capabilities in these systems for natural response generation (Nishida et al., 2019).
One major challenge for the QF-MDS task is that the datasets used for such tasks do not contain any labeled training data (Xu and Lapata, 2020). Therefore, neural summarization models that leverage supervised training cannot be used in these datasets. Note that for other related tasks (Allan et al., 2003;Liu et al., 2008;Miao et al., 2012), how to reduce the demands for labeling the data and how to leverage unlabeled data were also identified as a major challenge. While using datasets similar to the target dataset as the training data for the QF-MDS task, we find that these datasets only contain multi-document gold summaries. However, the state-of-the-art transformer-based (Vaswani et al., 2017) summarization models (Liu and Lapata, 2019;Laskar et al., 2020a) cannot be used in long documents due to computational complexities (Beltagy et al., 2020;Zaheer et al., 2020). To tackle these issues, we propose a novel weakly supervised approach by utilizing distant supervision to generate weak reference summary of each single-document from multi-document gold reference summaries. We train our model on each document with weak supervision and find that our proposed approach that generates abstractive summaries is very effective for the QF-MDS task. More concretely, we make the following contributions: • First, to address the issue of unlabeled individual documents in a training document set, we utilize pre-trained sentence similarity models Laskar et al., 2020b) to generate the weak reference summary of each individual document from multi-document gold reference summaries.
• Second, to address the computational issue to train neural models in long documents (Zaheer et al., 2020;Beltagy et al., 2020), we propose an iterative approach that adopts a pre-trained singledocument generic summarization model to leverage the effectiveness of fine-tuning such models for query focused abstractive summarization (Laskar et al., 2020a) and extends it for the QF-MDS task.
• Extensive experiments on the DUC 1 2005-07 datasets show that our proposed approach sets a new state-of-the-art result in terms of various ROUGE scores. As a secondary contribution, we will make our source codes publicly available here: https://github.com/tahmedge/ WSL-DS-COLING-2020.

Related Work
Early work on multi-document summarization was mostly focused on generic summarization (Nayeem et al., 2018), whereas the amount of work for QF-MDS had been very limited (Yao et al., 2017). Due to the lack of training data for the QF-MDS task, most previous works were based on various unsupervised approaches that could only generate extractive summaries (Wang et al., 2008;Wan and Xiao, 2009;Haghighi and Vanderwende, 2009;Wan and Zhang, 2014;Yao et al., 2015;Zhong et al., 2015;Ma et al., 2016;Feigenblat et al., 2017;Roitman et al., 2020).
To generate the abstractive summaries for the QF-MDS task, (Baumel et al., 2018) proposed a transfer learning technique to tackle the issue of no training data. They adopted the Pointer Generation Network (PGN) (See et al., 2017) pre-trained for the generic abstractive summarization task in a large dataset to predict the query focused summaries in the target dataset via modifying the attention mechanism of the PGN model. However, their model failed to outperform different extractive approaches in terms of various ROUGE scores (Feigenblat et al., 2017;Roitman et al., 2020).
Identifying sentences which are relevant to the query is an important step for the QF-MDS task. For this purpose, various approaches were utilized such as counting word overlaps (Baumel et al., 2018) or the Cross-Entropy Method (Feigenblat et al., 2017). Though neural models based on supervised training have significantly outperformed various non-neural models for the answer selection task in recent years (Laskar et al., 2019;Laskar et al., 2020b), such neural models have not been effectively used for the QF-MDS task yet due to the absence of labeled data for the relevant sentences in the QF-MDS datasets.
Recently, (Garg et al., 2019) showed that neural models pre-trained in a large Question Answering (QA) dataset could effectively select answers in other QA datasets. More recently, such pre-trained answer selection models were used by (Xu and Lapata, 2020) for the QF-MDS task. In their work, they utilized distant supervision from various QA datasets using the fine-tuned BERT (Devlin et al., 2019) model to filter out the irrelevant sentences from the documents. However, (Baumel et al., 2018) showed that filtering sentences as an early step could lead to performance deterioration for the QF-MDS task. Thus, instead of applying distant supervision to filter out some sentences from the document, we apply it to generate the weak reference summary of each unlabeled document in our training datasets. Our proposed weakly supervised learning approach not only allows us to leverage the advantage of finetuning pre-trained generic summarization models (Laskar et al., 2020a), but also allows us to overcome the limitation of training neural models in long documents (Beltagy et al., 2020;Zaheer et al., 2020).

Our Proposed Approach
Suppose, we have a query Q = q 1 , q 2 , ..., q k containing k words and a set of N documents D = d 1 , d 2 , ..., d N . For the QF-MDS task, the goal is to generate a summary S = s 1 , s 2 , ...s n containing n words from the document set D for the given query Q.
In figure 1, we show the overall architecture of our proposed approach. Since there is no training data available for the QF-MDS task, we provide supervised training to our target dataset by using other Figure 1: An overview of our model that generates (a) weak reference summary using RoBERTa for (b) iterative fine-tuning using BERTSUM to generate query focused abstractive summaries which are then ranked by (c) RoBERTa.
[CLS] and [SEP] are the special tokens used with inputs (Devlin et al., 2019).
QF-MDS datasets as the training data. However, the available QF-MDS datasets (Feigenblat et al., 2017) only contain the gold summaries generated by human experts from multiple documents and do not contain the gold summary of each individual document. Due to the limitations of using neural models in long documents (Beltagy et al., 2020;Zaheer et al., 2020), we propose an iterative approach to train our model on each document of a document set. For this purpose, we generate the weak reference summary of each document from the multi-document gold summaries using distant supervision to train our model for the QF-MDS task. Finally, we rank the generated query focused summaries via an answer selection model (Laskar et al., 2020b). In the following, we give a detailed description of our proposed approach.

Weakly Supervised Learning with Distant Supervision
To generate the weak reference summaries using distant supervision, we utilize the pre-trained RoBERTa model  in two steps (see Figure 1a). At first, we generate the weak extractive reference summary of each individual document using a RoBERTa sentence similarity model fine-tuned for the answer selection task. Then, we measure the similarity score between each sentence in the human written (abstractive) multi-document gold summaries with each sentence in the weak extractive reference summary using a RoBERTa sentence similarity model fine-tuned for the paraphrase identification task. Based on the similarity score, we select the most relevant sentences from the gold reference summaries as the weak abstractive reference summary for each document. Below we describe these steps in detail.
RoBERTa Answer Selection Model: In this step, we first generate the weak extractive reference summary of each individual document d k by measuring the relevance between the query Q i and each sentence S j in d k . For this purpose, we adopt the RoBERTa sentence similarity model from (Laskar et al., 2020b) for its impressive performance in the answer sentence selection task and fine-tune it in the QA-ALL dataset of MS-MARCO (Bajaj et al., 2016). The fine-tuned RoBERTa MS-MARCO model was then utilized in our training dataset to measure the similarity score between each sentence in the document and the query. Based on the similarity score, we select the top 3 most relevant sentences as the weak extractive reference summary since extracting only 3 sentences was found effective in different extractive summarizers such as the LEAD-3 baseline as well as the BERTSUM EXT model (Liu and Lapata, 2019).

RoBERTa Paraphrase Identification Model:
We provide distant supervision to generate the weak abstractive reference summary by replacing each sentence of the weak extractive reference summary (generated in the previous step) with the most similar sentence found in the multi-document gold summaries. For this purpose, we fine-tune the RoBERTa model for the paraphrase identification task in the MRPC dataset . Then for each document d k in a document set D i , we utilize the finetuned RoBERTa MRPC paraphrase identification model to replace each sentence S j in the weak extractive reference summary of d k with the most similar sentence S g found in the gold summaries 2 of D i .  Table 1: Performance comparisons in terms of (a) F1 and (b) Recall. '*' denotes extractive model.

Iterative Fine-Tuning for Multi-Document Summarization
For the QF-MDS task, we adopt the transformer-based (Vaswani et al., 2017) BERTSUM model pretrained for generic abstractive summarization on the CNN/DailyMail dataset (Liu and Lapata, 2019) to leverage the advantages of fine-tuning it for the query focused abstractive summarization task (Laskar et al., 2020a). However, BERTSUM was trained for the single-document summarization task by considering at most 512 tokens (Liu and Lapata, 2019;Beltagy et al., 2020;Zaheer et al., 2020). To address this issue for the multi-document scenario, we take an iterative approach (see Figure 1b). At first, we incorporate query relevance via concatenating the query with each document, similar to the work of (Laskar et al., 2020a). Then, we fine-tune BERTSUM using the weak abstractive reference summary to generate the query focused abstractive summary of each document in a document set. The sentences in the generated query focused summaries of each document set are then ranked using the fine-tuned RoBERTa MS-MARCO answer selection model to select the sentences that are most relevant to the query (see Figure 1c).

Experimental Setup
We now describe the datasets used in this paper, followed by the details of our implementation.  (Feigenblat et al., 2017). Each document set is associated with a query and the objective is to generate a summary containing at most 250 words from the document set based on the given query. Given the absence of the training data, to evaluate our model in each year's dataset we use the datasets from the other two years for training. From each year's training data, we randomly selected 20% of the document sets for validation while we used the rest for training.
Implementation: For the RoBERTa model, we used its Large version Laskar et al., 2020b) and implemented using HuggingFace's Transformer (Wolf et al., 2019). For fine-tuning the summarization model, we used the BERTSUM EXT-ABS 3 model pre-trained on the CNN/DailyMail dataset (Liu and Lapata, 2019). While selecting the most relevant sentences as the final query focused summary, we used the Trigram Blocking to reduce redundancy (Paulus et al., 2018). To fine-tune the BERTSUM model, we kept most parameters similar to the original work (Liu and Lapata, 2019) and ran 50 steps with batch size equal to 200. Among these 50 steps, we selected the step for evaluation that performed the best on the validation set. All of our models were run in multi-GPU settings using 4 NVIDIA V100 GPUs. We report the results based on both Recall and F1 scores in terms of ROUGE-1, ROUGE-2, and ROUGE-SU4 metrics (Lin, 2004). From now on, we denote ROUGE as R.  (-4.24%) 41.01 (-4.27%) No, based on paired t-test (p ≤ .05) without Weakly Supervised Learning 40.01 (-6.37%) 40.12 (-6.35%) Yes, based on paired t-test (p ≤ .05) Table 2: Ablation test result based on the average R-1. '-' denotes 'deterioration from original model'.

Results and Discussions
We now analyze the performance of our proposed model by comparing with other models (see Table  1). We also perform ablation test to further investigate its effectiveness (see Table 2). We denote our approach of using the Pre-trained models (RoBERTa and BERTSUM) for Query focused SUMmary generation via utilizing Weakly Supervised Learning with Distant Supervision (WSL-DS) as PQSUM WSL-DS . For performance comparisons, we use two baselines that do not utilize weak supervision and fine-tuning. Note that both of these baselines use the BERTSUM (Liu and Lapata, 2019) model pre-trained on the CNN/DailyMail dataset. One of them is pre-trained for extractive summarization: PQSUM EXT ; while the other is pre-trained for abstractive summarization: PQSUM ABS . These baselines generate the summaries of all documents in a document set which are then ranked using RoBERTa MS-MARCO . Moreover, we compare our model with four recent works: i) CES-50 (Feigenblat et al., 2017), ii) RSA (Baumel et al., 2018), iii) QUERYSUM (Xu and Lapata, 2020), and iv) DUAL-CES (Roitman et al., 2020).  (Roitman et al., 2020) in DUC 2007. In comparison to the abstractive RSA model (Baumel et al., 2018), we find that our model outperforms them in all datasets in terms of both R-1 and R-2 Recall, but fails to outperform them in R-SU4 scores. Moreover, we find based on paired t-test (p ≤ .05) that the weakly supervised learning significantly outperforms the baselines in terms of both Recall and F1.
Ablation Test: The result of our ablation test based on the average of R-1 scores across all datasets is shown in Table 2. We find that the performance is deteriorated if we exclude Distant Supervision via removing the RoBERTa MRPC model, as well as if the Trigram Blocking is not used. Moreover, the performance is significantly degraded if the summary is generated by only ranking the sentences in the documents using the fine-tuned RoBERTa MS-MARCO without utilizing Weakly Supervised Learning.

Conclusions and Future Work
In this paper, we propose a novel weakly supervised approach for the Query Focused Multi-Document Abstractive Summarization task to tackle the issue of no available labeled training data for such tasks. We also propose an iterative approach to address the computational problem that occurs while training neural models in long documents (Kitaev et al., 2019;Beltagy et al., 2020;Zaheer et al., 2020). Experimental results on three datasets show that our proposed approach sets a new state-of-the-art result in various evaluation metrics. In the future, we will apply our models on more tasks, such as information retrieval applications (Huang and Hu, 2009;Huang et al., 2003;Yin et al., 2013;Huang et al., 2005), sentiment analysis (Liu et al., 2007;Yu et al., 2012), learning from imbalanced or unlabeled datasets (Liu et al., 2006;Bari et al., 2019;Bari et al., 2020), and automatic chart question answering (Kim et al., 2020).