Coarse-to-Fine Query Focused Multi-Document Summarization

We consider the problem of better modeling query-cluster interactions to facilitate query focused multi-document summarization. Due to the lack of training data, existing work relies heavily on retrieval-style methods for assembling query relevant summaries. We propose a coarse-to-ﬁne modeling framework which employs progressively more accurate modules for estimating whether text segments are relevant, likely to contain an answer, and central. The modules can be independently developed and leverage training data if available. We present an instantiation of this framework with a trained evidence estimator which relies on distant supervision from question answering (where various resources exist) to identify segments which are likely to answer the query and should be included in the summary. Our framework 1 is robust across domains and query types (i.e., long vs short) and outperforms strong comparison systems on benchmark datasets.


Introduction
Query Focused Multi-Document Summarization (QFS; Dang 2006) aims to create a short summary from a set of documents that answers a specific query. It has various applications in personalized information retrieval and recommendation engines where search results can be tailored to an information need (e.g., a user might be looking for an overview summary or a more detailed one which would allow them to answer a specific question).
Neural approaches have become increasingly popular in single-document text summarization (Nallapati et al., 2016;Paulus et al., 2018;Li et al., 2017b;See et al., 2017;Narayan et al., 2018;Gehrmann et al., 2018), thanks to the representational power afforded by deeper architectures and the availability of large-scale datasets containing hundreds of thousands of document-summary pairs (Sandhaus, 2008;Hermann et al., 2015;Grusky et al., 2018). Unfortunately, such datasets do not exist in QFS, and one might argue it is unrealistic they will ever be created for millions of queries, across different domains, and languages. In addition to the difficulties in obtaining training data, another obstacle to the application of end-to-end neural models is the size and number of source documents which can be very large. It is practically unfeasible (given memory limitations of current hardware) to train a model which encodes all of them into vectors and subsequently generates a summary from them.
In this paper we propose a coarse-to-fine modeling framework for extractive QFS which incorporates a relevance estimator for retrieving textual segments (e.g., sentences or longer passages) associated with a query, an evidence estimator which further isolates segments likely to contain answers to the query, and a centrality estimator which finally selects which segments to include in the summary. The vast majority of previous work (Wan et al., 2007;Wan, 2008;Wan and Xiao, 2009;Wan and Zhang, 2014) creates summaries by ranking textual segments (usually sentences) according to their relationship (e.g., similarity) to other segments and their relevance to the query. In other words, relevance and evidence estimation are subservient to estimating the centrality of a segment (e.g., with a graph-based model). We argue that disentangling these subtasks allows us to better model the query and specialize the summaries to specific questions or topics (Katragadda and Varma, 2009). A coarse-to-fine approach is also expedient from a computational perspective; at each step the model processes a decreasing number of segments (rather than entire documents), and as a result is insensitive to the original input size and more scalable.
Our key insight is to treat evidence estimation as a question answering task where a cluster of po-tentially relevant documents provides support for answering a query (Baumel et al., 2016). Advantageously, we are able to train the evidence estimator on existing large-scale question answering datasets (Rajpurkar et al., 2016;Joshi et al., 2017;Yang et al., 2018), alleviating the data paucity problem in QFS. Existing QFS systems (Wan et al., 2007;Wan, 2008;Wan and Xiao, 2009;Wan and Zhang, 2014) employ classic retrieval techniques (such as TF-IDF) to estimate the affinity between query-sentence pairs. Such techniques can handle short keyword queries, but are less appropriate in QFS settings where query narratives can be long and complex. We argue that a trained evidence estimator might be better at performing semantic matching (Guo et al., 2016) between queries and document segments. To this effect, we experiment with two popular QA settings, namely answer sentence selection (Heilman and Smith, 2010;Yang et al., 2015) and machine reading comprehension (Rajpurkar et al., 2016) which operates over passages than isolated sentences. In both cases, our evidence estimators take advantage of powerful pre-trained encoders such as BERT (Devlin et al., 2019), to better capture semantic interactions between queries and text units.
Our contributions in this work are threefold: we propose a coarse-to-fine model for QFS which we argue allows to introduce trainable components taking advantage of existing datasets and pre-trained models; we capitalize on the connections of QFS with question answering and propose different ways to effectively estimate the query-segment relationship; we provide experimental results on several benchmarks which show that our model consistently outperforms strong comparison systems across domains (news articles vs. medical text) and query types (long narratives vs. keywords).

Related Work
Existing research on query-focused multidocument summarization largely lies on extractive approaches, where systems usually take as input a set of documents and select the sentences most relevant to the query for inclusion in the summary.
In Figure 1(a), we provide a sketch of classic centrality-based approaches which have generally shown strong performance in QFS. Under this framework, all sentences within a document cluster, together with their query relevance, are jointly considered in estimating centrality. A vari-ety of approaches have been proposed to enhance the way relevance and centrality are estimated ranging from incorporating topic-sensitive information (Wan, 2008;Badrinath et al., 2011;Xu and Lapata, 2019), predictions about information certainty (Wan and Zhang, 2014), manifold-ranking algorithms (Wan et al., 2007;Wan and Xiao, 2009;Wan, 2009), and Wikipedia-based query expansion (Nastase, 2008). More recently, Li et al. (2015) estimate the salience of text units within a sparsecoding framework by additionally taking into account reader comments (associated with news reports). Li et al. (2017a) use a cascaded neural attention model to find salient sentences, whereas in follow-on work Li et al. (2017b) employ a generative model which maps sentences to a latent semantic space while a reconstruction model estimates sentence salience. There are also feature-based approaches achieving good results by optimizing sentence selection under a summary length constraint (Feigenblat et al., 2017).
In contrast to previous work, our proposal does not simultaneously perform segment selection and query matching. We introduce a coarse-to-fine approach that incorporates progressively more accurate components for selecting segments to include in the summary, making model performance relatively insensitive to the number and size of input documents. Drawing inspiration from recent work on QA, we take advantage of existing datasets in order to reliably estimate the relationship between the query and candidate segments. We focus on two QA subtasks which have attracted considerable attention in the literature, namely answer sentence selection which aims to extract answers from a set of pre-selected sentences (Heilman and Smith, 2010;Yao et al., 2013;Yang et al., 2015) and machine reading comprehension (Rajpurkar et al., 2016;Welbl et al., 2018;Yang et al., 2018), which aims at answering a question after processing a short text passage (Chen, 2018). QA and QFS are related but ultimately different tasks. QA aims at finding the best answer in a span or sentence, while QFS extracts a set of sentences based on user preferences and the content of the input documents under a length budget (Wan, 2008;Wan and Zhang, 2014). QA questions are often short and fact-based while QFS narratives can be longer and more complex (see the example in Section 3) and as a result simply localizing an answer within a cluster is not optimal. Figure 1: Classic (a) and proposed framework (b) for query-focused summarization. The classic approach involves a relevance estimator nested within a summarization module while our framework takes document clusters as input, and sequentially processes them with three individual modules (relevance, evidence, and centrality estimators). The blue circles indicate a coarse-to-fine estimation process from original articles to final summaries where modules gradually discard segments (i.e., sentences or passages). With regard to evidence estimation, we adopt pretrained BERT (Devlin et al., 2019) which is further fine-tuned with distant signals from question answering.

Problem Formulation
Let Q denote an information request and D = {d 1 , d 2 , . . . , d M } a set of topic-related documents. It is often assumed (e.g., in DUC competitions) that Q consists of a short title (e.g., Amnesty International ) highlighting the topic of interest, and a query narrative which is considerably longer and detailed (e.g., What is the scope of operations of Amnesty International and what are the international reactions to its activities? ).
We illustrate our proposed framework in Figure 1(b). We first decompose documents into segments, i.e., passages or sentences, and retrieve those which are most relevant to query Q (Relevance Estimator). Then, a trained estimator quantifies the semantic match between selected segments and the query (Evidence Estimator) to further isolate segments for consideration in the output summary (Centrality Estimator). We propose two variants of our evidence estimator; a context agnostic variant infers evidence scores over individual sentences, while a context aware one infers evidence scores for tokens within a passage which are further aggregated into sentence-level evidence. Passages might allow for semantic relations to be estimated more reliably since neighboring context is also taken into account.

Relevance Estimator
Our QFS system operates over documents within a cluster which we segment into sentences. The latter serve as input to the context agnostic evidence estimator. For the context aware variant, we obtain passages with a sliding window over continuous sentences in the same document.
During inference, we first retrieve the top k IR answer candidates (i.e., sentences or passages) which are subsequently processed by our evidence estimator. We do this following an adaptive method that allows for a variable number of segments to be selected for each query. Specifically, for the ith query-cluster pair, we first rank all segments in the cluster based on term frequency with respect to the query, and determine k IR i such that it reaches a fixed threshold θ ∈ [0, 1]. Formally, k IR i , the number of retrieved segments, is given by: where r i,j is the relevance score for segment j (normalized over segments in the ith cluster). Although we adopt term frequency as our relevance estimator, there is nothing in our framework which precludes the use of more sophisticated retrieval methods (Dai and Callan, 2019;Akkalyoncu Yilmaz et al., 2019). We investigated approaches based on term frequency-inverse sentence frequency (Allan et al., 2003) and BM25 (Robertson et al., 2009), however, we empirically found that they are inferior, having a bias towards shorter segments which are potentially less informative for summarization.

Evidence Estimator
We argue that relevance matching is not sufficient to capture the semantics expressed in the query narrative and its relationship to the documents in the cluster. We therefore leverage distant supervision signals from existing QA datasets to train our evidence estimator and use the trained estimators to rerank answer candidates selected from the retrieval module. For the ith cluster, we select the top min{k QA , k IR i } candidates as answer evidence (where k QA is tuned on the development set).
Sentence Selection Let Q denote a query (in practice a sequence of tokens) and {S 1 , S 2 , . . . , S N } the set of candidate answers (also token sequences) obtained from the retrieval module. Our learning objective is to find the correct answer(s) within this set. We concatenate query Q and candidate sentence S into a sequence [CLS], Q, [SEP], S, [SEP] to serve as input to a BERT encoder (we pad each sequence in a minibatch of L tokens). The [CLS] vector serves as input to a single layer neural network to obtain the distribution over positive and negative classes: where Z = c exp t T i W :,c and matrix W ∈ R d×2 is a learnable parameter. We use a cross entropy loss where 1 denotes that a sentence contains the answer (and 0 otherwise): We treat the probability of the positive class as evidence score q = p (i) 1 ∈ (0, 1) and use it to rank all retrieved segments for each query.
Span Selection A span selection model allows us to capture more faithfully the answer, its local context and their interactions. Again, let Q denote a query token sequence and P a passage token sequence. Our training objective is to find the correct answer span in P. Similar to sentence selection, we concatenate the query Q and the passage P into a sequence [CLS], Q, [SEP], P, [SEP] and pad it to serve as input to a BERT encoder. Let T = [t i ] N i=1 denote the contextualized vector representation of the entire sequence obtained from BERT. We feed T into two separate dense layers to predict probabilities p S and p E : where w S and w E are two learnable vectors denoting the beginning and end of the (answer) span, respectively. During training we optimize the loglikelihood of the correct start and end positions. For passages without any correct answers, we set these to 0 and default to the [CLS] position.
At inference time, to allow comparison of results across passages, we remove the final softmax layer over different answer spans. Specifically, we first calculate the (unnormalized) start and end scores for all tokens in a sequence: And collect sentence scores from token scores as follows. For each sentence starting at token i and ending at token j, we obtain score matrix Q via: where we collect all possible span scores within a sentence in matrix S where S i ,j denotes the span score from token i to token j (i ≤ i < j ≤ j). Matrix A is an upper triangular matrix masking all illegitimate spans whose end comes before the start. The tanh function scales the magnitude of extreme scores (e.g., scores over 100 or under 0.01), as a means of reducing the variance ofQ. And finally, we use max pooling to obtain a scalar score q: It is possible to produce multiple evidence scores for the same sentence since we use overlapping passages; we select the score with the highest value in this case.
Ensemble Selection We can also build an ensemble by linearly interpolating evidence scores from the two estimators based on sentence selection and span extraction. Let (E S , q S ) and (E P , q P ) denote the selected sentence sets and their evidence scores produced by the sentence selection estimator and span extraction estimator, respectively. We obtain the ensemble score for sentence e via: 10) where the coefficient was set to µ = 0.9.

Centrality Estimator
Graph Construction Inspired by Wan (2008), we introduce as our centrality estimator an extension of the well-known LEXRANK algorithm (Erkan and Radev, 2004), which we modify to incorporate the evidence estimator introduced in the previous section.
For each document cluster, LEXRANK builds a graph G = (V, E) with nodes V corresponding to sentences and (undirected) edges E whose weights are computed based on similarity. Specifically, matrix E represents edge weights where each element E i,j corresponds to the transition probability from vertex i to vertex j. The original LEXRANK algorithm uses TF-IDF (Term Frequency Inverse Document Frequency) to measure similarity; since our framework operates over sentences rather than "documents", we use TF-ISF (Term Frequency Inverse Sentence Frequency), with ISF defined as: where C is the total number of sentences in the cluster, and SF(w) is the number of sentences in which w occurs. We integrate our evidence estimator into the original transition matrix as: where φ ∈ (0, 1) controls the extent to which query-specific information influences sentence selection for the summarization task; andq is a distributional evidence vector which we obtain after normalizing the evidence scores q ∈ R 1×|V | obtained from the previous module (q = q/ |V | v q v ). Summary Generation In order to decide which sentences to include in the summary, a node's centrality is measured using a graph-based ranking algorithm (Erkan and Radev, 2004;Xu and Lapata, 2019). Specifically, we run a Markov chain withẼ on G until it converges to stationary distribution e * where each element denotes the salience  of a sentence. In the proposed algorithm, e * jointly expresses the importance of a sentence in the document and its semantic relation to the query as modulated the evidence estimator and controlled by φ. We rank sentences according to e * and select the top k Sum ones, subject to a budget (e.g., 250 words). To reduce redundancy, we apply the diversity algorithm proposed in Wan (2008) which penalizes the salience of sentences according to their overlap with those already selected to appear in the summary. We also remove the sentences which have high cosine similarities (i.e., ≥ 0.6) with any sentence already included in the summary ( Owls are a group of birds that belong to the order strigiformes, constituting 200 extant bird of prey species. Most are solitary and nocturnal, with some exceptions (e.g., the northern hawk owl).
Owls hunt mostly small mammals, insects, and other birds, although a few species specialize in hunting fish. They are found in all regions of the earth except antarctica, most of greenland and some remote islands.
Owls are characterized by their small beaks and wide faces, and are divided into two families: the typical owls, strigidae; and the barn-owls, tytonidae.

Question
By what main attribute are computational problems classified utilizing computational complexity theory?

Context
Computational complexity theory is a branch of the theory of computation in theoretical computer science that focuses on classifying computational problems according to their inherent difficulty, and relating those classes to each other. A computational problem is understood to be a task that is in principle amenable to being solved by a computer, which is equivalent to stating that the problem may be solved by mechanical application of mathematical steps, such as an algorithm. Answer inherent difficulty Span Selection (unanswerable) Question What was the name of the 1937 treaty?

Context
Other legislation followed, including the Migratory Bird Conservation Act of 1929, a 1937 treaty prohibiting the hunting of right and gray whales, and the Bald Eagle Protection Act of 1940. These later laws had a low cost to society: the species were relatively rare and little opposition was raised.

Plausible Answer
Bald Eagle Protection Act Table 3: Examples for two types of question answering datasets for evidence estimation: answer sentence selection and span selection. Red denotes answers while blue denotes a plausible answer to the question that cannot be answered from the given context. We use the union of WikiQA (Yang et al., 2015) and TrecQA (Heilman and Smith, 2010) for answer sentence selection and SQuAD 2.0 (Rajpurkar et al., 2016) for span selection. SQuAD 2.0 contains both answerable and unanswerable questions and we show one example for each of them.
is provided in Table 1. We used three datasets for training our evidence estimator, including WikiQA (Yang et al., 2015), TrecQA (Yao et al., 2013), and SQuAD 2.0 (Rajpurkar et al., 2018). WikiQA and TrecQA are benchmarks for answer sentence selection while SQuAD 2.0 is a popular machine reading comprehension dataset (which we used for span selection). Compared to SQuAD, WikiQA and TrecQA are smaller and we therefore integrate them for model training . We show statistics for QA datasets in Table 2 and examples in Table 3.

Implementation Details
We used the publicly released BERT model 2 and fine-tuned it on our QA tasks. Considering the maximum input length BERT allows (512 tokens) and the query narrative (which in DUC is fairly long), we set the maximum passage size to 8 sentences (with maximum sentence length of 50 tokens). To ensure all sentences are properly contextualized, we used a stride size of 4 sentences to create overlapping passages. Details on model training and optimization are provided in Appendix A.
We also evaluated model summaries in a judgment elicitation study via Amazon Mechanical Turk. Native English speakers (self-reported) were asked to rate query-summary pairs on two dimensions: Succinctness (does the summary avoid unnecessary detail and redundant information?) and Coherence (does the summary make logical sense?). The ratings were obtained using a five point Likert scale. In addition, participants were asked to assess the Relevance of the summary to the query. Crowdworkers read a summary and for each sentence decided whether it is relevant (i.e., whether it provides an answer to the query), irrelevant (i.e., it does not answer the query), and partially relevant (i.e., it is not clear it directly answers the query). Relevant sentences were awarded  a score of 5, partially relevant ones a score of 2.5, and 0 otherwise. Sentence scores were averaged to obtain a relevance score for the whole summary.

Results
Automatic Evaluation Our results on DUC are summarized in Table 4. The first block reports upper bound performance (GOLD) which we estimated by treating a (randomly selected) reference summary as the output of a hypothetical system and comparing it against the remaining (three) ground truth summaries. ORACLE uses reference summaries as queries to retrieve summary sentences, and LEAD returns all lead sentences (up to 250 words) of the most recent document. The second block in Table 4 compares our model to various graph-based approaches which include: LEXRANK (Erkan and Radev, 2004), a widely used unsupervised method based on Markov random walks. LEXRANK is query-free, it measures relations between all sentence pairs in a cluster and sentences recommend other similar sentences for inclusion in the summary. GRSUM (Wan, 2008), a Markov random walk model that integrates queryrelevance into a Graph Ranking algorithm; and CTSUM (Wan and Zhang, 2014) which is based on GRSUM but additionally considers sentence CerTainty information in ranking.
The third group in the table shows the performance of autoencoder-based neural approaches. C-ATTENTION (Li et al., 2017a) is based on Cascaded attention with sparsity constraints for compressive multi-document summa-   Table 4 presents different variants of our query-focused summarizer which we call QUERYSUM. We show automatic results with distant supervision based on isolated Sentences (QUERYSUM S ), Passages (QUERYSUM P ), and an ensemble model (QUERYSUM S+P ) which combines both. As can be seen, our models outperform strong comparison systems on both DUC test sets: QUERYSUM S achieves the best R-1 while QUERY-SUM P achieves the best R-2 and R-SU4. Perhaps unsurprisingly, both models fall behind the human upper bound.
Our results on the TD-QFS dataset are summarized in Table 5. In addition to LEAD and LEXRANK, we compared to KLSUM, the best performing system on this dataset (Baumel et al., 2016). KLSUM selects a subset of sentences from retrieved candidates by minimizing the Kullback-Leibler Divergence between the unigram distribution in the selected sentences and the source cluster. QUERYSUM S and our ensemble model achieve superior results across all ROUGE metrics.  Table 6: Human evaluation results on DUC (above) and TD-QFS (below): average Relevance, Succinctness, Coherence ratings; All is the average across ratings; : sig different from VAESUM or KLSUM; †: sig different from QUERYSUM; • : sig different from Gold (at p < 0.01, using a pairwise t-test).

Human Evaluation
VAESUM 5 , a neural state-of-the-art system, QUERYSUM, and the LEAD baseline. For TD-QFS, we evaluated summaries created by KLSUM, QUERYSUM, and LEAD. We also included a randomly selected GOLD standard summary as an upper bound. We sampled 20 query-cluster pairs from DUC (2006, 2007; 10 from each set), and 20 pairs from TD-QFS (5 from each cluster). We collected three responses per query-summary pair. Table 6 shows the ratings for each system. As can be seen, participants find QUERYSUM summaries on DUC more relevant and with less redundant information compared to LEAD and VAESUM. Our multi-step estimation process also produces more coherent summaries (as coherent as LEAD) even though coherence is not explicitly modeled. Overall, participants perceive QUERYSUM summaries as significantly better (p < 0.01) compared to LEAD and VAESUM (see Appendix B for examples of system output). QUERYSUM is also considered as the best performing system across metrics on TD-QFS. This further demonstrates the robustness of our system on unseen domains and query types.

Ablation Studies
We also conducted ablation experiments to verify the effectiveness of the proposed coarse-to-fine framework. We present results in Table 7 when individual modules are removed. In the −Relevance setting, all text segments (i.e., sentences or passages) in a cluster are given as input to the evidence estimator module. The −Evidence setting treats all retrieved segments as evidence for summarization. Note that since our  summarizer operates on sentences, we can only assess this configuration with the QUERYSUM S model; we take the top k QA sentences from the retrieval module as evidence. The −Centrality setting treats the (ranked) output of the evidence estimator as the final summary. For the sake of brevity, we report results on DUC-2007 and TD-QFS (DUC-2006 follows a very similar pattern). As can be seen, removing the retrieval module leads to a large drop in the performance of QUERY-SUM S . This indicates that the (deep) semantic matching model trained for sentence selection can get distracted by noise which a (shallow) relevance matching model can help pre-filter. Interestingly, on DUC, when the matching model is trained on passages, the retrieval module seems more or less redundant, there is in fact a slight improvement in R-2 and R-SU4 (see row QUERYSUM P , − Relevance in Table 7). This suggests that the evidence estimator trained on passages is more robust and captures the semantics of the query more faithfully. Moreover, since it takes contextual signals into account, it is able to recognize irrelevant information and unanswerability is explicitly modeled. We show in Figure 2 how ROUGE-2 varies over k IR best retrieved segments. We compare three different types of query settings, the short title, the narrative, and the full query with both the title and the narrative. As expected, recall increases with k IR (i.e., when more evidence is selected) and then finally converges. For both sentence and passage retrieval settings, the full query achieves best performance over k IR , with the narrative being most informative when it comes to relevance estimation.
Performance also drops in Table 7 when the evidence estimator is removed (see QUERYSUM S , −Evidence in Table 7). In Figure 3, we plot how ROUGE-2 varies with increasing k QA when the evidence component is estimated on passages and sentences for the full model. As can be seen, the model trained on passages surpasses the model

ROUGE-2 Recall
Full S Full P Narrative S Narrative P Title S Title P Figure 2: Performance (ROUGE-2 Recall) over k IR best retrieved segments (development set). S and P refer to sentence and passage retrieval, respectively. Full is the concatenation of the query title and narrative.
trained on sentences roughly when k QA = 80. For comparison, we also show the performance of the retrieval module by treating the top sentences as evidence. The retrieval curve is consistently under the passage curve, and under the sentence curve when k QA < 140. Since the quality of top sentences directly affects the quality of the summarization module, this further demonstrates the effectiveness of evidence estimation in terms of reranking retrieved segments. Finally, Table 7 shows that the removal of the centrality estimator decreases performance even when the query and appropriate evidence are taken into account. This suggests that the centrality estimator further learns to select important summary worthy sentences from the available evidence. Interestingly, the gain on the DUC datasets is slight but considerable on TD-QFS, suggesting that in less topically concentrated clusters where multiple high-quality answers can be available, the soft discrimination between answer candidates based on their answerability can be useful during the final summary sentence selection.

Conclusions
In this work, we proposed a coarse-to-fine estimation framework for query focused multi-document summarization. We explored the potential of leveraging distant supervision signals from Question Answering to better capture the semantic relations between queries and document segments. Experimental results across datasets show that the proposed model yields results superior to competitive baselines contributing to summaries which are more  GPUs with 11GB memory. For the answer sentence selection model, BERT was fine-tuned with a learning rate of 3 × 10 −6 and a batch size of 16 for 3 epochs . For span selection, we adopted a learning rate of 3 × 10 −5 and a batch size of 64 for 5 epochs. During inference, the confidence threshold for the relevance estimator was set to θ = 0.75 (Kratzwald and Feuerriegel, 2018) for both sentence and passage retrieval. For the evidence estimator, k QA was tuned on the development set. We obtained 90 and 110 evidence sentences from the sentence selection and span selection models, respectively. For the centrality estimator, the influence of the query was set to φ = 0.15 (Wan, 2008;Wan and Zhang, 2014

B Summary Outputs
We show in Table 8 and Table 9  GOLD: In 1996, China began cracking down on crime. Extensive investigations and citizen tips led to hundreds of arrests for such crimes as drug trafficking; firearms, ammunition and explosives manufacturing, sales, smuggling and possession; burglary and robbery; murder; hooliganism; kidnapping; racketeering; gambling; and blackmail. The perpetrators are often gangs of thieves and criminals, and members of international criminal gangs operating between China and Hong Kong or China and Macau. In 1998, 60% of criminal suspects arrested were minors. Chinese authorities broke up a Hong Kong-based gang operating between Hong Kong and the mainland. Its leader was tried, convicted, and sentenced to death in China. Chinese authorities apprehended members of a Macau gang in its Guangdong Province. As part of its "Strike Hard national crime-fighting campaign, China agreed to participate in the UN Commission on Crime Prevention and Criminal Justice. China revised its criminal and procedural laws and enacted new laws. Its Criminal Law was amended to include terrorist crime, organized crime, money-laundering, illegal immigrant trafficking, and environment-related crimes. China signed legal assistance agreements with 28 countries and extradition agreements with ten. China pledged increased cross-border anti-crime cooperation and urged Portugal to take tougher measures against gang-related crime in preparation for the 1999 handover of the Portuguese colony. After the handover, China will station troops in Macau to better fight organized criminal activity there. The Chinese government pledges to increase efforts to crack down on corruption, smuggling, and other economic crimes as well as criminal acts in 2000. LEAD: Members of a criminal gang in Foshan city of south China's Guangdong province, which was controlled by a larger and more notorious gang in neighboring Macao, have been apprehended by local police. Police arrested 28 people who have been involved in more than 30 cases of blackmail, gambling, illegal use of guns and other crimes. The gambling cases involved more than 50 million yuan (about six million U.S. dollars) of illicit money. Police also seized a number of guns and ammunition, including eight military pistols. The gang was established by Zeng Qiqiang in 1996, as a branch of the "Shuifang Bang", a large criminal gang in Macao. The gang in Foshan, with more than 100 members, used to help the "Shuifang Bang" run its gambling operations and collect money from people by force. To date, the provincial public security department of Guangdong and the local police in Foshan have completely uprooted the gang which seriously threatened the security of Foshan and Macao. VAESUM: Police working with Hong Kong authorities had arrested 18 members of the gang in southern Guangdong province, which is adjacent to Hong Kong. As a reputed local crime boss fights his death sentence in China, reports Thursday said Hong Kong officials had previously asked mainland counterparts to consider sparing the lives of territory residents convicted of capital offenses in China. A police chief of a southern Chinese city where a reputed Hong Kong crime boss is on trial has stepped up security following assassination threats by gang members. Gang members loyal to "Big Spender" Cheung Tze-keung have put a 4 million Hong Kong dollar (U.S. dlrs 512,000 million) price tag for the death of Guangzhou police chief Zheng Guoqiang. The arrests are the latest in a series by Chinese and Hong Kong police to crack down on criminal activities related to 43-year-old Hong Kong gang boss Cheung Tze-keung, alias "Big Spender". Charges against the reputed gangsters center around the killing of a mainland Chinese businessman and a Hong Kong resident, armed robberies, smuggling explosives into Hong Kong, and the kidnapping of the two Hong Kong businessmen for more than 1.6 billion Hong Kong dollars (U.S. dlrs 205 million) in ransom. Hong Kong officials would appeal on grounds that the mainland had no jurisdiction over Cheung's case since many of Cheung's alleged crimes, including kidnappings of two Hong Kong tycoons, were committed in Hong Kong. 18 were Hong Kong residents and 14 were from mainland China. QUERYSUM: Zhang Fusen, head of the Chinese delegation, told the fifth session of the UN commission on Crime Prevention and Criminal Justice (CCPCJ) that China will participate in united nations activities in crime prevention and criminal justice. China has revised the criminal law and criminal procedure law, promulgated and enforced new laws such as the lawyers' law and the law on administrative punishment to strengthen the judicial guarantee for human rights during that period of time, the paper says. As a reputed local crime boss fights his death sentence in china, reports Thursday said Hong Kong officials had previously asked mainland counterparts to consider sparing the lives of territory residents convicted of capital offenses in China. China is ready to strengthen cooperation with other countries and international organizations in combating and preventing organized transnational crime, a senior Chinese official said here today. Zhang said that in the past few years, China's law enforcement authorities cracked numerous cases in southeast china involving killing, kidnapping and racketeering by members of criminal gangs which entered china from overseas. Statistics show that in 1996, courts throughout the country sentenced 322,382 criminal offenders who had seriously endangered public security by committing crimes of violence, crimes involving the use of guns, and gang-related crimes. Speaking at the opening ceremony of the seventh world conference of Asia Crime Prevention Foundation (ACPF), deputy procurator-general of the supreme people's procuratorate of China Liang Guoqing called for enhancing cooperation among asian countries to fight crimes and set up a crime prevention regime. (3) international corporations; (4) law revision and enforcement. Our system produces more diverse content that represents these aspects compared to other systems.
Query: Describe the activities of Morris Dees and the Southern Poverty Law Center. GOLD: Morris Dees is a co-founder and leader of the Southern Poverty Law Center, located in Montgomery, Alabama. It was founded to battle racial bias and has expanded its efforts by tracking hate crimes and the increasing spread of racist organizations across the US. "Teaching Tolerance" is a major program of the Center. Under that program, a magazine promoting interracial and intercultural understanding goes to more than 400,000 teachers. Other publications of the Center include the magazine "Intelligence Report" and pamphlets "Ten Ways to Fight Hate" and "Fighting Hate at School". Dees has determined that the civil courts are an effective forum in which to attack and destroy hate groups. He has used the civil lawsuit like a "Buck Knife, carving financial assets out of hate group leaders". Some skeptics thought that Dees sought out victims of hate groups to profit from their tragedy. However, Dees does not charge the groups and the Center estimates that it collects only 2% on successful judgments. Dees has a perfect record in the major lawsuits he has prosecuted. Successful judgments include one for $21.5M against a South Carolina branch of the Ku Klux Klan for burning the Macedonia Baptist Church. Others include $6.3M against Aryan Nation's leader Richard Butler and $7M against a Klan group that killed a black man in Mobile, Alabama. The Center operates mostly on contributions that in the late 1990s have increased to around $100 Million annually. LEAD: Spokane, Wash. (AP) -facing eviction from its compound in northern Idaho, the aryan nations may move its annual white supremacist gathering to Pennsylvania next year. The news was posted on the Neo-Nazi group's web site Friday, a week after the group was slapped with a $6.3 million judgment in a civil lawsuit. The compound is scheduled to be seized on sept. 29 and the assets sold to satisfy a portion of the judgment due to two people who sued the group after they were assaulted by aryan nations' guards. The notice was the first indication that the lawsuit, brought by the southern poverty law center, may drive the group out of Idaho. "I have been asked if I would continue to host the yearly national congress and my answer was, of course, an astounding yes!" wrote august B. Kreis III, web master for the Aryan nations and a posse comitatus leader in Pennsylvania. Kreis wrote that if the compound is lost, the Aryan nations "National Congress 2001" would be planned for a site near ulysses, pa. Aryan nations leader Richard Butler declined to talk with reporters Friday. He is appealing the judgment to the Idaho supreme court, but that appeal is not expected to halt the seizure of the group's 20-acre compound north of Hayden lake. Morris Dees, the civil rights lawyer who led the plaintiffs' legal team, has said he expected the judgment to bring a quick end to the aryan nations and its racist, anti-semitic message. VAESUM: A state jury in northern Idaho Thursday ordered leaders of the Aryan nations to pay more than $6 million to the victims of an attack two years ago by men who were serving as security guards at the group's compound near here. Coeur d'Alene, Idaho -issuing a verdict that civil rights organizations hope will bankrupt one of the nation 's largest white-supremacist groups and limit its ability to preach hate. Aryan nations leader Richard Butler vowed Saturday he will not leave northern Idaho, despite a $6.3 million judgment against his racist organization. Coeur d'Alene, idaho -Morris S. Dees JR. , who has won a series of civil rights suits against the Ku Klux Klan and other racist groups in a campaign to put them out of business, came to court here Monday to try to seize the Aryan nations compound that has nurtured white supremacists for more than 20 years. Her son who were attacked by Aryan nations guards outside the white supremacist group's north Idaho headquarters. One of two men convicted of assaulting a woman and her son outside the headquarters of the Aryan nations denied being a member of the white supremacist group Thursday during testimony in a civil rights case filed against them, the aryan nations and the group's founder, Richard Butler. Morris Dees, co-founder of the southern poverty law center in Montgomery, Ala., has said he intends to take everything the aryan nations owns to pay the judgment, including the sect's name. QUERYSUM: Morris Dees, the co-founder of the southern poverty law center in Montgomery, Ala., and one of the attorneys for the plaintiffs, said he intended to enforce the judgment, taking everything the Aryan nations owns, including its trademark name. Dees, founder of the southern poverty law center, has won a series of civil right suits against the Ku Klux Klan and other racist organizations in a campaign to drive them out of business. But since co-founding the southern poverty law center in 1971, Dees has wielded the civil lawsuit like a buck knife, carving financial assets out of hate group leaders who inspire followers to beat, burn and kill. In a lawsuit that goes to trial Monday, attorney Morris Dees of the southern poverty law center is representing a mother and son who were attacked by security guards for the white supremacist group. The southern poverty law center tracks hate groups, and intelligence report covers right-wing extremists. Over the last two decades, the southern poverty law center has taken the Ku Klux Klan and other hate groups to court, starting with a successful suit against the invisible empire Klan, which in 1979 attacked a group of peaceful civil rights marchers in Decatur, Ala. He said Gilliam also told the informant someone should kill the FBI sniper who killed the wife of white supremacist randy weaver during an 11-day standoff in 1992 at Ruby Ridge, Idaho, along with civil rights lawyer Morris Dees of the Montgomery-based southern poverty law center. Table 9: System outputs for cluster D0701A in DUC 2007. The gold summary answers the query covering three main aspects (denoted with different colors): (1) Southern Poverty Law Center and its activities; (2) Morris Dees and his activities; (3) representative successful lawsuits. For this document cluster, summarization systems are prone to extract unnecessary lawsuit details, which indirectly relate to the given query but are not the query focus. Our system contains more summary-worthy facts that succinctly respond to the given query compared to other systems.