UETrice at MEDIQA 2021: A Prosper-thy-neighbour Extractive Multi-document Summarization Model

This paper describes a system developed to summarize multiple answers challenge in the MEDIQA 2021 shared task collocated with the BioNLP 2021 Workshop. We propose an extractive summarization architecture based on several scores and state-of-the-art techniques. We also present our novel prosper-thy-neighbour strategies to improve performance. Our model has been proven to be effective with the best ROUGE-1/ROUGE-L scores, being the shared task runner up by ROUGE-2 F1 score (over 13 participated teams).


Introduction
Biomedical documents are available with the tremendous amount on the Internet, together with several search engines (e.g., Pubmed® 1 ) and question-answering systems (e.g., CHiQA 2 ) developed. However, the returned results of these systems still contain a lot of noise and duplication, making them difficult for users without medical knowledge to quickly grasp the main content and get the necessary information. Hence, generating a shorter condensed form with important information would benefit many users as it saves time and can retrieve massive useful information. This motivation leads to the growing interest among the research community in developing automatic text summarization methods. The BioNLP-MEDIQA 2021 shared task 3 (Ben Abacha et al., 2021) aims to attract further research efforts in text summarization and their applications in medical Question-Answering (QA). This shared task is motivated by a need to develop relevant methods, techniques, and gold standards for text summarization in the * Contributed equally & Names are in alphabetical order † * Corresponding author 1 https://pubmed.ncbi.nlm.nih.gov/ 2 https://chiqa.nlm.nih.gov/ 3 https://sites.google.com/view/ mediqa2021 medical domain and their application to improve the domain-specific QA system. Task 2 -Summarization of Multiple Answers focuses on developing multi-document summarization approaches that could synthesize and compress information from answers to a medical question.
According to Radev et al. (2002) a summary is defined as 'a text that is produced from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually, significantly less than that'. Automatic text summarization is the task of condensing the document(s) and generating a compressed summary, which is shorter but preserves key information content and overall meaning. A summary can be generated through extractive or abstractive approaches (or hybrid). Typically, to produce an abstractive summarization, we need to use advanced linguistic techniques to 'understand' the text as well as re-generate the summary in natural language from useful information. Up to now, the research community is focusing more on extractive summarization. This approach tries to achieve coherent and meaningful summaries in a more simple and faster way than the abstractive approach. Extractive summarization chooses important sentences (or phrases) from the original documents (without any modification) and merges them to generate a summary.
Our proposed model for the multi-answer summarization task follows extractive summarization approaches. We try to select sentences containing the most important information in the original answers. Our novel contributions are: (i) Proposing the question-driven scores to ensure that the summary is the answer to the question, (ii) Proposing Prosper-thy-neighbour (PtN) strategies, which increase the constraint of neighbouring sentences, to take advantage of paragraph information in the answer. (iii) Combining several scores that successfully applied for summarization problem, includ-ing TF-IDF, Lexrank, and Textrank with optimized weights, (iv) Improving the maximal marginal relevance technique (MMR) for multi-document summarization with BERT-based embedding to improve the performance.
The remaining of this paper is organized as follows: Section 2 gives a brief introduction to some state-of-the-art related works. Section 3 describes task data and our proposed model. Section 4 is the experimental results and our discussion. And finally, the conclusion.

Related works
From the early 1950s, various methods have been proposed for extractive summarization (Allahyari et al., 2017). Some of them are based on the idea of using scores to choose the most important phrases in the documents. Term Frequency-Inverse Document Frequency (TF-IDF) (Hovy et al., 1999;Christian et al., 2016) is a frequency-based score to detect important sentences by calculating the scores of its words. Lexrank (Erkan and Radev, 2004) and Textrank (Mihalcea and Tarau, 2004) are two graph-based methods that rank sentences/words using their degree centrality. Maximal Marginal Relevance (MMR) (Carbonell and Goldstein, 1998;Bennani-Smires et al., 2018) is one of the most well-known approaches for multi-document summarization. It is a diversity-based re-ranking method based on the document similarities and can be used to remove redundancy in the summaries. Although encouraging results have been reported, most of these scores are applied individually. Since each score type has its unique contribution, combining them may help to improve the performance. Hence, we propose an architecture to take advantage of several scores with weights and calculate a final combined score.
With the advent of machine learning techniques in NLP, many research projects tried to apply machine learning methods to extractive summarization tasks, from the Naive Bayes, Decision tree, Support vector machine (Gambhir and Gupta, 2017) to deep learning models. Most recently, Savery et al. (2020) improved the Bidirectional auto regressive transformer (BART) with a question-driven approach, but it is more well-known for abstractive summarization, which is not discussed in-depth in this paper.

Shared task data
The MEDIQA-AnS Dataset (Savery et al., 2020) is used as the training data set. The validation and the test sets are the summaries that were created by the experts from the original answers generated by the question-answering system namely CHiQA 4 . Table 1 gives our statistics on the given datasets (see (Ben Abacha et al., 2021) for detailed description of shared task data).
An important observation is that answers often tend to have related sentences in a passage that makes an important 'point'. Some adjacent sentences are structured in a deductive manner (e.g., several explanatory sentences follow after a stated sentence) or inductive (e.g., the last sentence is the conclusion of previous sentences). Extracting these whole pieces of text ensures a complete summary while enhancing fluency and natural language resemblance. Our prosper-thy-neighbour strategies are proposed to take advantage of this characteristic.

Proposed model
The overall architecture of our Prosper-thy-Neighbour (PtN) summarization model is shown in Figure 1. It comprises four main phases: preprocessing, single document summarization, multidocument summarization and post-processing phases.

Pre-processing
The pre-processing phase receives question Q and a set of corresponding answers (documents) D = {d i } n i=1 as the input. ScispaCy (Neumann et al., 2019), which is based on SpaCy (Honnibal et Pre-processed documents

Scoring
Neighbor boosting

Multi-document
Merging single-document summaries Re-calculate the final sentence scores

Post-processing
Maximal Marginal Relevance (MMR) Figure 1: The proposed Prosper-thy-neighbour model. 2020) models, is used for the typical pre-processing techniques (i.e. segmentation and tokenization) in terms of biomedical, scientific and clinical text. We also construct two normalization modules. (i) The coarse-grained normalization is applied to the answer only. It removes noise from the raw text (non-ASCII characters, HTML tags, duplicate spacing, etc.) (ii) The fine-grained normalization includes stop-words removing, lower-casing, stemming, and full form generation (Schwartz and Hearst, 2002) for biomedical abbreviations. Finally, BioBERT (Lee et al., 2020), which is designed for multiple biomedical text mining tasks, is used for part-ofspeech tagging, named entities/keywords recognizing and embedding generating. BioBERT-based embeddings are 768− dimensional vectors used for calculating the similarity of words and sentences.

Single-answer extractive summarization
Using information from the pre-processing phase, the single-document extractive summarization phase generates the summary for every single an-swer. Our extractive summarization model tries to determine which sentences are important to the document by sentence scoring.
Sentences scoring: Since it is difficult to identify the importance of sentences from a single point of view, hence, we use three different types of scores: Frequency-based scores, graph-based scores and question-driven scores.
Frequency-based score: Term Frequency -Inverse Document Frequency (TF-IDF) (Salton and McGill, 1986) is the probabilistic method that reflects the importance of words in a set of documents by a float number. The TF-IDF score of a word w contained in document d of document set D is defined as tf idf (w, d, D). We apply two rules to improve TF-IDF: (i) Boosting the TF-IDF score of keywords, and (ii) Assigning TF-IDF score to 0 if it is lower than a pre-selected threshold. The TF-IDF score of a sentence is the cumulative TF-IDF scores of its component words.
Graph-based scores are used to determine which sentences and words seem to be the core of a document. Lexrank and Textrank are two of the most well-known methods of this approach.
Lexrank (Erkan and Radev, 2004) computes sentence importance based on the concept of eigenvector centrality in a graph representation of sentences. A document is considered as a graph, each node represents a sentence. Two nodes have a weighted edge depending on the similarity of their corresponding sentences. Cosine similarity is used to calculate the similarity between two sentences x and y (see Formula 1). In which, x and y are represented by TF-IDF vectors of n dimensions, i.e., X and Y respectively (n is the number of distinguished tokens in two sentences).
To calculate the centrality of a node, we analyze the weight of its connected edges and the centrality of adjacent nodes (Formula 2). If a sentence is similar to many other sentences, it has higher centrality and conceived having a certain ability to represent other sentences.
(2) where adj u is the set of nodes that adjacent to u, n is the number of nodes and d is the damping factor.
Textrank (Mihalcea and Tarau, 2004) is mostly similar to Lexrank. It calculates the centrality of terms instead of the centrality of sentences as in Formula 3. In the PtN model, if the Textrank score is lower than a predefined threshold, we assign it to 0. The Textrank score of a sentence is the sum of Textrank scores of its participated terms.
in which w is the token and X and Y are two terms.
Question-driven scores are used to give higher priorities to sentences that are related to the questions. These scores are proposed to focus on the answer summarization task, ensuring that the summary is a suitable answer to the question.
Question-similarity score uses the BioBERT and Cosine distance (Formula 1) to calculate the similarities between the question and sentences in all of its answers. Formally, qb(sentence), the questionsimilarity score of a sentence is defined as: Keyword-based score is determined by the percentage of question keywords that appear in a sentence. Let K is the set of question keywords, kw(sentence) is the keyword-based score of a sentence, it is defined by the following formula: Scores combination: All scores are normalized in the range [0 − 1] by using Min-Max normalization. We then combine them into a final sentence score by using optimized weights (see Formula 6.
score =w 1 × tf idf in which, w i is the weight of each score. They are fine-tuned on the validation set.
Prosper-thy-neighbour strategies: As described in Section 3.1, an important sentence may need some adjacent sentences to clarify or support it. Hence, answers often tend to have continuous segments of sentences that make important 'points'. Since the aforementioned scores do not consider the neighbours of a sentence, our prosperthy-neighbour strategies are proposed to take advantage of this characteristic. There are three different prosper-thy-neighbour strategies: clusterboosting, relative-boosting and centre-boosting. Cluster-boosting: We calculate the averaged scores of n continuous sentences (n = 3, 4, 5) as cluster scores. We then select top-k clusters with the highest average scores. The sentence score is set equal to its highest cluster score. Sentences that are not selected in any clusters are assigned the score of 0.
Relative-boosting is performed by three steps: • Step 1: Find top-n highest-score sentences with their original orders. • Step 2: For consecutive selected sentences, let L is the position of the preceding sentence, R is the position of the following sentence. If R − L + 1 ≤ k (k is predefined), step 3 is executed. • Step 3: Let score i be the score of the i-th sentence. The final scores f inal i of all sentences having the position between L and R are updated by the following formula: Centre-boosting: Let score i be the score of i-th sentences. The final score f inal i of sentence i-th is updated by the following formula: in which, n is the number of sentences, L and R is the number of sentences that impact the current sentence i in two directions: left and right. With centre-boosting, the important sentence strongly affects its adjacent sentences.
However, with these prosper-thy-neighbour strategies, the selected neighbour sentences can bring redundant information, i.e., we may keep too many sentences to the left/right of an important sentence. Those redundancies can be cut off in the post-processing phase (Section 3.2.4).

Ranking and and Filtering Sentences
We use the final score boosted by the prosper-thyneighbour strategy to rank the sentences. There are several ways to choose sentences for the singledocument extractive summary: getting top-n or top-p% of sentences, using the threshold to filter unimportant sentences. In the proportion-and threshold-based approach, the number of sentences depends on the document length and sentence scoring. It might probably cause an unexpected bias in the next multi-document summarization phase. Based on the experimental results on the validation set, we fix the number of selected sentences in each document.

Multi-answer extractive summarization
Multiple extractive single-answer summaries from the previous phase are merged into a single document. Since the previous phase chooses an equal number of sentences for all answers, there might be some redundant sentences. Since the current sentence scores are based on separate documents, we re-calculate them as in the merged document by using the proposed score described in Section 3.2.2. The filtering step then removes some lowest-score sentences.
Maximal Marginal Relevance (MMR): (Carbonell and Goldstein, 1998) is also used to reduce redundancy while maintaining query relevance. MMR works in the selected appropriate sentence in merged documents. It is the combination of the relevance and diversity concepts, in a controllable way. Let S i is the i-th sentence, its MMR score is calculated based on the similarities between S i , the answer D and the question Q (Formula 9). The similarity to the question and the duplication with other sentences affects the MMR score through the ratio λ. In which, BioBERT is used to represent sentences and question and Cosine distance is used to calculate the similarities. We use the MMR score to discard duplicated and question-irrelevant sentences, i.e., remove m sentences having the lowest MMR score.

Post-processing
For each segment of continuously selected sentences, we find the position of the most important sentence which has the highest combined score. Then, for other sentences in the segment, if the distance from their position to the important sentence exceeds a predefined k parameter, those should be eliminated in the final multi-document extractive summary.

Evaluation metrics
We adopt the official task evaluations with ROUGE scores (Lin and Och, 2004) including ROUGE-1, ROUGE-2 and ROUGE-L. ROUGE-n Recall (R), Precision (P ) and F 1 between predicted summary and referenced summary are calculated as in Formulas 10, 11 and 14, respectively. Choosing correct sentences help to increase ROUGE-n R and P .

ROUGE-n P =
|Matched N-grams| |Predict summary N-grams| (10) ROUGE-n R = |Matched N-grams| |Reference summary N-grams| (11) ROUGE-L P = Length of the LCS |Predict summary tokens| (12) ROUGE-L R = Length of the LCS |Reference summary tokens| (13) ROUGE-L recall (R), precision (P ) and F 1 are calculated as in Formula 12, 13 and 14, respectively. ROUGE-L uses the Longest Common Subsequence (LCS) between predicted summary and referenced summary and they are normalized by the tokens in the summary.

Comparative models
We use the official results of the MEDIQA shared task as a comparison to other participated teams on the multi-answer summarization task. For a detailed evaluation of the effectiveness of the single-answer summarization phase, we also make some comparisons with related works: • Lead-3: First three sentences of an article were taken as a summary. • k-random sentences: k random sentences were selected as a summary. • k-best ROUGE: k sentences with the highest ROUGE-L score relative to the question were selected. • Bidirectional long short-term memory (BiL-STM) network (Hochreiter and Schmidhuber, 1997): The most relevant sentences in an article were selected by a BiLSTM. • Pointer-generator network (See et al., 2017): A hybrid sequence-to-sequence attention model which creates summaries with two approaches: copying text and create new text from the source documents. • Bidirectional auto-regressive transformer (BART) (Savery et al., 2020): A transformerbased encoder-decoder model improved with a question-driven approach.
The results of these comparative models are taken from experimental results reported in Savery et al. (2020).

Task final results and comparison
Based on the validation set experiments, the number of selected sentences in single-answer summarization is 7 per answer. In the multi-answer summarization phase, the score-based filter selects top-20 sentences in the merged document, then MMR removes 5 lowest-score sentences. Therefore, our multi-answer document summaries have 15 sentences (or less, based on the length of the original answers). Post-processing with distance value k = 3 often removes 2-4 sentences. The final outputs often have ∼13 sentences. Since both cluster-boosting and relative-boosting show their drawbacks with the lower F1-score performance on the validation set, we use the centre-boosting strategy in our optimal model. Table 2 shows the shared task official results of top-5 competitors. ROUGE-2 F 1 is used as the main metric to rank the participating teams. We also show several other evaluation metrics for detailed results: ROUGE-1 F 1, ROUGE-L F 1, HOMLS F 1 and BERT-based F 1. We are the runner-up in the leader board, with ROUGE-2 F 1 at 0.504 (0.004 less than the rank No.1 team). However, our ROUGE-1 F 1 and ROUGE-L F 1 are the highest of all participating teams. Table 3 shows the performances of our model and comparative models at the single-answer level. Because the results of the comparative models are reported in the training dataset, all results are reported on the training dataset. To ensure the comparisons are fair, we report both model results with and without the post-processing phase. The results show that our model outperforms all comparative models. To ensure the comparisons are fair, we report both model results with and without the postprocessing phase. The results show that our model outperforms all comparative models.

Contribution of model components
We study the contribution of each model component to the system performance by ablating each of them in turn from the model and afterward evaluating the model on the validation set. Validation data are used for evaluation because we use validation data to optimize the model's hyperparameters. We compare these experimental results with the full system results and then illustrate the changes of ROUGE-2 F 1 in Figure 2. The changes of ROUGE-2 F 1 show that all model components help the system to boost its performance (in terms of the increments in ROUGE-2 F 1).The contribution, however, varies among components, TF-IDF and MMR have the biggest contribution while Lexrank/Textrank brings the smallest contribution.
The prosper-thy-neighbour strategy also demonstrates its effectiveness to improve the ROUGE-2 F 1. Centre-boosting seems to be the most suitable strategy for this task since the results increase dramatically if we replace it with cluster-boosting or relative-boosting.

Frequency-based scoring
Graph-based scoring We also investigate the change of results at different compression ratios. Figure 3 shows the change of ROUGE-2 P , R and F 1 on the validation set when taking 2-20 sentences to the summary (excluding the post-processing step). We observed that P and F have trade-off results while increasing the number of sentences. F 1 got the best results at 15 sentences, due to the balance between P and F . Therefore, we choose this configuration for our official runs on the test set.

Errors analysis
To further evaluate the performance of the proposed system, we have analyzed the results of the best model on the validation set. Table 4 provides some examples of the model problems and their effects. Firstly, because of using a fixed statistical-based maximum number of output sentences, we ran into problems with too long or too short documents. Question #56 is an example of the redundancy in the output summary that there are only 5 important sentences but our model keeps fixed 13 sentences. On the contrary, in Question #91, the answer to 'How can I stop being allergic to caffeine?' are summarized in 23 sentences. However, many relevant sentences have been filtered out to ensure a fixed size of the output.
Although we have combined many different ranking methods for tokens and sentences, some final scores did not meet our expectation. The frequency-based scores (TF-IDF) are failed in Question 82, in which the token 'Hirschsprung' is over-weighted due to repeated occurrence. In addition, the popular keywords like 'treatment', 'medicine' have too low weight. As a result, in Question #19 about 'the cure for pulsatile tinnitus', all of the sentences related to treatment and medicine were filtered out.
Some other issues related to the driven question are illustrated in Question #22 and Question #36. In the first example, the question analyzer failed to extract the keyword 'safe'. For this reason, the summary phase went in the wrong direction -the content is only related to 'defibrillator'. In the second one, the proposed model did not focus on the driven question so that the summary does not contain the desired information.
Besides the problems related to the model components, we also noticed some problems related to Adding some irrelevant sentence (decreasing precision) 28 Can you help me find a clinic that specializes in treatment for atopic eczema?
Problem in postprocessing Removal of important sentence (decreasing recall) the input data for which Question #36 is an example. The question is about 'herbal medicines for rheumatoid arthritis' while the chiQA answers do not mention this topic. Therefore, our model as well as other machine learning models do not have enough linguistic information to summarize these documents. Some other errors seem attributable to our model's limitations (Example #28 and #78). We listed here some highlight problems to prioritize future researches: (i) The neighbour boosting method needs to be improved to only increase the weight of related sentences instead of all neighbouring sentences; (ii) Post-processing rules need to be stricter to avoid eliminating important sentences.

Conclusions
This paper presents a systematic study of our extractive approach to the MEDIQA 2021 -Task 2: Multi-answer summarization. We combined and optimized several scoring criteria such as TF-IDF, Lexrank, Textrank, query-based, keywords-based and MMR scores. We also developed a strategy called Prosper-thy-neighbour to take advantage of adjacent sentences in the answers. The proposed model has a potential performance, being the runner-up of the shared task. Our best performance achieved a ROUGE-2 F 1 is 0.504, comparable to that of the highest-ranked system with 0.507.
Experiments were also carried out to verify the rationality and impact of model components and the compressed ratio. The results demonstrated the contribution and robustness of all techniques and hyper-parameters. The error analysis was made to analyze the sources of the errors. The evidence pointed to some imperfection of the sentence selecting strategy, the ranking score combination and the question analyzer. Our proposed system is extensible in several ways: applying machine learning model, deeply question-analyzing, sentences clustering, etc. We will release our source code on the public repository to support the re-producibility of our work and facilitate other related studies.