Facet-Aware Evaluation for Extractive Summarization

Commonly adopted metrics for extractive summarization focus on lexical overlap at the token level. In this paper, we present a facet-aware evaluation setup for better assessment of the information coverage in extracted summaries. Specifically, we treat each sentence in the reference summary as a facet, identify the sentences in the document that express the semantics of each facet as support sentences of the facet, and automatically evaluate extractive summarization methods by comparing the indices of extracted sentences and support sentences of all the facets in the reference summary. To facilitate this new evaluation setup, we construct an extractive version of the CNN/Daily Mail dataset and perform a thorough quantitative investigation, through which we demonstrate that facet-aware evaluation manifests better correlation with human judgment than ROUGE, enables fine-grained evaluation as well as comparative analysis, and reveals valuable insights of state-of-the-art summarization methods. Data can be found at https://github.com/morningmoni/FAR.


Introduction
Text summarization has enjoyed increasing popularity due to its wide applications, whereas the evaluation of text summarization remains challenging and controversial. The most commonly used evaluation metric of summarization is lexical overlap, i.e., ROUGE (Lin, 2004), which regards the system and reference summaries as sequences of tokens and measures their n-gram overlap.
However, recent studies (Paulus et al., 2017;Schluter, 2017;Kryscinski et al., 2019) reveal the limitations of ROUGE and find that in many cases, it fails to reach consensus with human judgment.
Since lexical overlap only captures information  (Nallapati et al., 2016). coverage at the surface (token) level, ROUGE favors system summaries that share more tokens with the reference summaries. Nevertheless, such summaries may not always convey the desired semantics. For example, in Table 1, the document sentence with the highest ROUGE score has more lexical overlap but expresses rather different semantic meaning. In contrast, the sentence manually extracted from the document by our annotators, which conveys similar semantics, is over-penalized as it involves other details or uses alternative words.
In this paper, we argue that the information coverage in summarization can be better evaluated by facet overlap, i.e., whether the system summary covers the facets in the reference summary. Specifically, we treat each reference sentence as a facet, identify document sentences that express the semantics of each facet as support sentences of the facet, and measure information coverage by Facet-Aware Recall (FAR), i.e., how many facets are covered. We focus on extractive summarization for the following two reasons. Theoretically, since extractive methods cannot paraphrase or compress the document sentences as abstractive methods, it is somewhat unfair to penalize them for extracting long sentences that cover the facets. Pragmatically, we can evaluate extractive methods automatically by comparing the indices of extracted sentences and support sentences. We denote the mappings from each facet (sentence) in the reference summary to its support sentences in the document as Facet-Aware Mappings (FAMs). FAMs can be used as labels indicating which sentences should be extracted but they are grouped with respect to each facet, while conventional extractive labels correspond to the entire reference summary rather than individual facets (detailed explanations in Sec. 2.1). Compared to treating one summary as a sequence of n-grams, facet-aware evaluation considers information coverage at a semantically richer granularity, and thus can contribute to a more accurate assessment on the summary quality.
To verify the effectiveness of facet-aware evaluation, we construct an extractive version of the CNN/Daily Mail dataset (Nallapati et al., 2016) by annotating its FAMs (Sec. 2). We revisit state-ofthe-art extractive methods using this new extractive dataset (Sec. 3.2), the results of which show that FAR correlates better with human evaluation than ROUGE. We also demonstrate that FAMs are beneficial for fine-grained evaluation of both abstractive and extractive methods (Sec. 3.3). We then illustrate how facet-aware evaluation can be useful for comparing different extractive methods in terms of their capability of extracting salient and non-redundant sentences (Sec. 3.4). Finally, we explore the feasibility of automatic FAM creation by evaluating sentence regression approaches against the ground-truth annotations (i.e., FAMs), and generalize facet-aware evaluation to the entire CNN/Daily Mail dataset without any human annotation (Sec. 4). We believe that the summarization community will benefit from the proposed setup for better assessment of information coverage and gain deeper understandings of the current benchmark dataset and state-of-the-art methods through our analysis.

Contributions.
(1) We propose a facet-aware evaluation setup that better assesses information coverage for extractive summarization. (2) We build the first dataset designed specifically for extractive summarization by creating facet-aware mappings from reference summaries to documents. (3) We revisit state-of-the-art summarization methods in the proposed setup and discover valuable insights.
(4) To our knowledge, our work is also the first thorough quantitative analysis regarding the char-  Two of three support groups of facet 1 (r 1 ) are covered. Facet 2 (r 2 ) cannot be covered as document sentence 4 (d 4 ) is missing in the extracted summary. The illustration corresponds to the example in Sec. 3.1.
acteristics of the CNN/Daily Mail dataset.

Dataset Creation
In this section, we describe the process of creating an extractive summarization dataset to facilitate facet-aware evaluation, which involves annotating FAMs between the documents and abstractive reference summaries. We first formalize the FAMs and then describe the FAM annotation on the CNN/Daily Mail dataset (Nallapati et al., 2016).

FAMs: Facet-Aware Mappings
We denote one document-summary pair as {D, R}, where D = [d 1 , d 2 , ..., d D ], R = [r 1 , r 2 , ..., r R ], and D, R denote the numbers of document sentences and reference sentences, respectively. We conceptualize facet as one unique semantic aspect presented in the summary. In practice, we hypothesize that each reference sentence r i corresponds to one facet. 2 We define support sentences as the sentences in the document that express the semantics of one facet r i . We define support group S of facet r i as a set of support sentences that can fully cover the information of r i . For each facet r i in the reference summary, we try to find all its support sentences in the document and put them into support groups. Since we focus on single-document where N is the number of support groups. Each S i j = {d I 1 , d I 2 , ..., d I M j } is a support group, where I 1 , I 2 , ..., I M j are the indices of support sentences and M j is the number of support sentences in S i j . One illustrative example is presented in Fig. 1. The support sentences are likely to be verbose, but we consider whether the support sentences express the semantics of the facet regardless of their length. 3 The reason is that we believe extractive summarization should focus on information coverage since it cannot alter the original sentences and once salient sentences are extracted, one can then compress them in an abstractive manner (Chen and Bansal, 2018;Hsu et al., 2018). Relation w. Extractive Labels. Extractive methods (Nallapati et al., 2017;Chen and Bansal, 2018;Narayan et al., 2018c) typically require binary labels of every document sentence indicating whether it should be extracted during model training. Such labels are called extractive labels and usually created heuristically based on reference summaries since existing datasets do not provide extractive labels but only abstractive references. Our assumption that each reference sentence corresponds to one facet is similar to that during the creation of extractive labels. The major differences are that (1) We allow an arbitrary number of support sentences while extractive labels usually limit to one support sentence for each reference sentence, i.e., we do not specify M j . For example, we would put two support sentences to one support group if they are complementary and only combining them can cover the facet. (2) We try to find multiple support groups (N > 1), as there could be more than one set of support sentences that cover the same facet. In contrast, there is no notion of support group in extractive labels as they inherently form one such group (N = 1). Also, we allow N = 0 if such a mapping cannot be found even by humans. (3) The FAMs are more accurate as they are created by human annotators while extractive methods use sentence regression approaches (which we evaluate in Sec. 4.1) to obtain extractive labels approximately.
Comparison w. SCUs. Some may mistake FAMs for Summarization Content Units (SCUs) in Pyramid (Nenkova and Passonneau, 2004), but they are different in that (1) FAMs utilize both the documents and reference summaries while SCUs ignore the documents; (2) FAMs are at the sentence level and can thus be used to automatically evaluate extractive methods once created -simply by matching sentence indices we can know how many facets are covered, while SCUs have to be manually annotated for each system (refer to Appendix B Fig. 4).

Creation of Extractive CNN/Daily Mail
To verify the effectiveness of facet-aware evaluation, we annotate the FAMs of 150 documentsummary pairs from the test set of CNN/Daily Mail. Specifically, we take the first 50 samples in the test set, the 20 samples used in the human evaluation of Narayan et al. (2018c), and randomly draw another 80 samples. The annotators are graduate students who are required to read through the document and mark support groups for each facet. The most similar document sentences to each facet found by ROUGE and cosine similarity of average word embeddings are provided as the baselines for annotation. 310 non-empty FAMs are created by three annotators with high agreement (pairwise Jaccard index 0.714) and further verified to reach consensus. 4 On average, 5.44 (6.04 non-unique) document sentences are included as the support sentences in each document-summary pair.
To summarize, we found that the facets can be divided into three categories based on their quality and degree of abstraction as follows. Noise: The facet is noisy and irrelevant to the main content, either because the document itself is too hard to summarize (e.g., a report full of quotations) or the human editor was too subjective when writing the summary (See et al., 2017). Another possible reason is that the so-called "summaries" in CNN/Daily Mail are in fact "story highlights", which seems reasonable to include certain details. We found that 41/150 (27.3%) samples have noisy facet(s), indicating that the reference summaries of CNN/Daily Mail are rather noisy. We show in Sec. 3.2 that existing summarization methods perform poorly on this category, which justifies our judgment of "noisy facets" from another aspect. Also note that there would not be a "noise" category in a "clean" dataset. However, given the creation process of popular summarization datasets (Nallapati et al., 2016;Narayan et al., 2018b), it is unlikely that all of their samples are of high quality. Low Abstraction: The facet can be mapped to its support sentences. We denote the (rounded) average number of support sentences for each facet as M (= 1 N N j=1 M j , N represents the number of support groups). As shown in Table 2, all the facets with non-empty FAMs in CNN/Daily Mail are paraphrases or compression of one to two sentences in the document without much abstraction. High Abstraction: The facet cannot be mapped to its support sentences (N = 0) by humans, which indicates that the writing of the facet requires deep understanding of the document rather than simply reorganizing several sentences. The proportion of this category (13.3%) also indicates how often extractive methods would not work (well) on CNN/Daily Mail.
We found it easier than previously believed to create the FAMs on CNN/Daily Mail, as it is uncommon (average number of support groups N = 1.6) to detect multiple sentences with similar semantics. In addition, most support groups only have one or two support sentences with large lexical overlap, which coincides with the fact that extractive methods work quite well on CNN/Daily Mail and abstractive methods are often hybrid and learn to copy words directly from the documents. That said, we try to automate the FAM creation and scale facet-aware evaluation to the whole test set of CNN/Daily Mail using machine-created FAMs (Sec. 4).

Facet-Aware Evaluation
In this section, we introduce the facet-aware evaluation setup (Sec. 3.1) and demonstrate its effectiveness by revisiting state-of-the-art summarization methods under this new setup (Sec. 3.2). We then illustrate the additional benefits of facet-aware evaluation, including fine-grained evaluation (Sec. 3.3) and comparative analysis (Sec. 3.4).

Proposed Metrics
As current extractive methods are facet-agnostic, i.e., their output is not nested (organized by facets) but a flat set of extracted sentences, we consider one facet as being "covered" if any of its support groups can be found in the whole extracted summary. Formally, we define the Facet-Aware Recall (FAR) as follows.
where Any(X ) returns 1 if any x ∈ X is 1 and 0 otherwise, I(X , Y) returns 1 if set X ⊂ Y and 0 otherwise, E denotes the set of extracted sentences, and R is the number of facets. Intuitively, FAR does not over-penalize extractive methods for extracting long sentences as long as the extracted sentences cover the semantics of the facets. FAR also treats each facet equally, whereas ROUGE weighs higher the facets with more tokens since they are more likely to incur lexical overlap. To further measure model capability of retrieving salient (support) sentences without considering redundancy as FAR does, we merge all the support sentences of one document-summary pair to one single support set and define the Support-Aware Recall (SAR) as follows. SAR is used in Sec. 3.4 for the comparative analysis of extractive methods.
4 . Note that d 1 and d 3 are salient (support sentences) and both considered positive in SAR, while they only contribute to the coverage of one facet in FAR.

Automatic Evaluation with FAR
By utilizing the low abstraction category on the extractive CNN/Daily Mail dataset, we revisit extractive methods to evaluate how they perform on information coverage. Specifically, we compare Lead-3 (that extracts the first three document sentences), FastRL(E) (E for extractive only) (Chen and Bansal, 2018) Table 3, there is almost no discrimination among the last four methods under ROUGE-1 F1, and the rankings under ROUGE-1/2/L often contradict with each other. The observations on ROUGE Precision/Recall are similar. We provide them as well as more comparative analysis under facet-aware evaluation in Sec. 3.4. For facet coverage, the upper bound of FAR by extracting 3 sentences (Oracle, given the ground-truth FAMs) is 84.8, much higher than all the compared methods. The best performing extractive method under FAR 5 Extracting all the sentences results in a perfect FAR, which is expected as FAR measures recall. One can also normalize FAR by the number of extracted sentences.
is UnifiedSum(E), which indicates that it covers the most facets semantically.  FAR's Correlation w. Human Evaluation. Although FAR is supposed to be favored as the FAMs are manually labeled and indicate accurately whether one sentence should be extracted (assuming the annotations are in high quality), to further verify that FAR correlates with human preference, we ask the annotators to rank the outputs of Uni-fiedSum(E), NeuSum, and Lead-3 and measure ranking correlation. As listed in Table 4, we observe that the method with the most 1st ranks in the human evaluation coincides with FAR. We also find that FAR has higher Spearman's coefficient ρ than ROUGE (

Fine-grained Evaluation
One benefit of facet-aware evaluation is that we can employ the category breakdown of FAMs for fine-grained evaluation, namely, how one method performs on noisy / low abstraction / high abstraction samples, respectively. Any metric of interest can be used for this fine-grained analysis. Here we consider ROUGE and additionally evaluate several abstractive methods: PG (Pointer-Generator) (See et al., 2017), FastRL(E+A) (ex-tractive+abstractive) (Chen and Bansal, 2018), and UnifiedSum(E+A) (Hsu et al., 2018).
As shown in Table 5, extractive methods perform poorly on high abstraction samples, which is somewhat expected since they cannot perform abstraction. Abstractive methods, however, also exhibit a huge performance gap between low and high abstraction samples, which suggests that existing abstractive methods achieve decent overall performance mainly by extraction rather than abstraction, i.e., performing well on low abstraction samples of CNN/Daily Mail. We also found that all the compared methods perform much worse on the documents with "noisy" reference summaries, implying that the randomness in the reference summaries might introduce noise to both model training and evaluation. Note that although the sample size is relatively small, we observe consistent results when analyzing different subsets of the data.  Table 5: ROUGE-1 F1 of extractive and abstractive methods on noisy (N), low abstraction (L), high abstraction (H), and high quality (L + H) samples.

Comparative Analysis
Facet-aware evaluation is also beneficial for comparing extractive methods regarding their capability of extracting salient and non-redundant sentences. We show the FAR, SAR, and ROUGE scores of various extractive methods in Fig. 2. We next illustrate how one can leverage these scores under different metrics for comparative analysis. For brevity, we denote ROUGE Precision and ROUGE Recall as RP and RR, respectively. FAR vs. ROUGE. By comparing the scores of extractive methods under FAR and ROUGE, one can discover useful insights. For example, we observe that the performance of Refresh, FastRL(E), NeuSum are quite close to Lead-3 under FAR, but they generally have higher RR. Such results imply that these methods might have learned to extract sentences that are not the support sentences, i.e., sentences that do not directly contribute to the facet coverage, but still have lexical overlap with reference summaries. It is also likely that they extract redundant support sentences that happen to have token matches with other facets. Overall, Unified-Sum(E) covers the most facets (high FAR) and also has decent lexical matches (high RR). SAR vs. ROUGE. By comparing SAR with RP, one can find that UnifiedSum(E) extracts salient but possibly redundant support sentences, as it has higher SAR but similar RP to Lead-3. On the contrary, Refresh has similar SAR with Lead-3 but higher RP, which again implies that it might extract non-support sentences that contain token matches but irrelevant semantics. Similarly, Ban-ditSum is capable of lexical overlap (high RP), but the matched tokens may not contribute much to the major semantics (low SAR). FAR vs. SAR. By comparing FAR with SAR ( Fig. 3), we observe that FastRL(E) and NeuSum have FAR scores similar to Lead-3 and Refresh, but higher SAR scores. One possible explanation is that FastRL(E) and NeuSum are better at extracting support sentences, but they do not handle redundancy very well, i.e., the extracted sentences might contain multiple support groups of the same facet (recall the example in Sec. 3.1). For instance, there are 30.3% extracted summaries of FastRL(E) that can cover more than one support group of the same facet while there are 19.1% for Lead-3.

Evaluation without Human Annotation
In the previous sections, we have demonstrated the effectiveness and benefits of facet-aware evaluation. One remaining issue that might prevent facet-aware evaluation from scaling is the need of human-annotated FAMs. We thus study the feasibility of automatic FAM creation with sentence regression and present a pilot study of conducting facet-aware evaluation without any human annotation in this section.

Sentence Regression for FAM Creation
Similar to most benchmark constructions, facetaware evaluation requires one-time annotationonce the FAMs are annotated, we can reuse them for automatic evaluation. That said, we explore various approaches to automate this one-time process. Specifically, we investigate whether facet-aware evaluation can be conducted without any human effort by utilizing sentence regression (Zopf et al., 2018) to automatically create the FAMs.
Sentence regression is widely used to create extractive labels. Sentence regression approaches typ-  ically transform abstractive reference summaries to extractive labels heuristically using ROUGE. Previously, one could only estimate the quality of these labels by evaluating the extractive models trained using such labels, i.e., comparing their extracted summaries with the reference summaries (also approximately via ROUGE). Now that the humanannotated FAMs serve as ground-truth extractive labels, we can evaluate how well each approach performs accurately. Sentence Regression Approaches. We briefly review recent sentence regression approaches as follows. Nallapati et al. (2017)

Evaluation with Machine-Created FAMs
Results on Support Sentence Discovery. We first evaluate sentence regression with its original function, i.e., creating extractive labels (finding support sentences). We merge the support groups of each sample and calculate precision and recall (i.e., SAR). The performance of sentence regression approaches is shown in Table 6. The relatively low recall suggests that simply finding one support sentence for each facet as most existing approaches do would miss plenty of salient sentences, which could possibly worsen the models trained on such labels since the models would treat missed support sentences as unimportant ones. On the bright side, many sentence regression approaches achieve high  Correlation w. Human-Annotated FAMs. We then explore the correlation between humanannotated and machine-created FAMs by evaluating extractive methods against both of them. This time we extend to find for each facet multiple support sentences and put each support sentence into a separate support group. We measure the correlation between estimated and ground-truth FAR by Pearson's r. We measure the correlation between system rankings induced from estimated and ground-truth FAR by Spearman's ρ and Kendall's τ . The detailed correlation results of representative approaches are listed in Table 7. We observe that creating three support groups consistently shows the highest correlation for the same sentence regression approach. Also, the FAMs created by ROUGE-1 F1 and ROUGE-AVG F1 have very high correlation with human annotation, indicating the usability and reliability of machine-created FAMs for system ranking.  FAR Prediction. Despite the high correlation, we also find that the estimated FAR scores may vary in range compared to the ground-truth FAR. 7 Therefore, we further use the estimations of different sentence regression approaches to train a linear regression model to fit the ground-truth FAR (denoted as AutoFAR). We then calculate the estimated FAR scores on the whole test set of CNN/Daily Mail and use the trained linear regressor to predict a (supposedly) more accurate FAR score (denoted as AutoFAR-L). As shown in Table 8, the fitting of AutoFAR is very close to the ground-truth FAR, and the system ranking on the large-scale evaluation under AutoFAR-L follows a similar trend to that under FAR with Spearman's ρ = 54.3. On the other hand, although our preliminary analysis on AutoFAR-L shows promising results, we also note that since the human annotation on the whole test set is lacking, the reliability of such extrapolation is not guaranteed and we leave more rigorous study with a larger number of systems and samples as future work.    (Paulus et al., 2017;Narayan et al., 2018c;Chen and Bansal, 2018) generally sampled 50 to 100 documents for human evaluation in addition to ROUGE in light of its limitations. Chen et al. (2016);Yavuz et al. (2018) inspected 100 samples and analyzed their category breakdown for reading comprehension and semantic parsing, respectively. We observed similar trends when analyzing different subsets of the FAMs, indicating that our findings are relatively stable. We thus conjecture that our sample size is sufficient to verify our hypotheses and benefit future research.

Conclusion and Future Work
We propose a facet-aware evaluation setup for better assessment of information coverage in extractive summarization. We construct an extractive summarization dataset and demonstrate the effectiveness of facet-aware evaluation on this newly constructed dataset, including better human correlation on the assessment of information coverage, and the support for fine-grained evaluation as well as comparative analysis. We also evaluate sentence regression approaches and explore the feasibility of fully-automatic evaluation without any human annotation. In the future, we will investigate multi-document summarization datasets such as DUC (Paul and James, 2004) and TAC (Dang and Owczarzak, 2008) to see whether our findings coincide when multiple references are provided. We will also explore better sentence regression approaches for the use of both extractive summarization methods and automatic FAM creation.  Figure 4: Comparison of summarization metrics. Support sentences are marked in the same color as their corresponding facets. SCUs have to be annotated for each extracted summary during evaluation, while facetaware evaluation can be conducted automatically by comparing sentence indices.

A Practical Notes on CNN/Daily Mail
We note several issues of the CNN/Daily Mail dataset in the hope that the researchers working on this dataset are better aware of these issues. One issue is that sometimes the titles and image captions are introduced in the main body of the document by mistake (usually captured by "-lrbpictured -rrb-" or colons), which may lead to bias or label leaking for model training since the reference summaries are observed to be similar to the titles and image captions (Narayan et al., 2018a). For example, we found that if there is a sentence in the main body that is almost the same as one of the captions, then that sentence is very likely to be used in the reference summary. Many such cases can be found in our annotated data.
We also found that in many documents, the 4-th sentence is "scroll down for video". And if this sentence appears in one document, it is often the case that the first three sentences are good enough to summarize the whole document. This finding provides yet another evidence why a simple Lead-3 baseline could be rather strong on CNN/Daily Mail. In addition, sentences similar to the first three sentences can often be found afterward, which suggests that the first three sentences may not even belong to the main body of the document.

B Additional Illustration
In Fig. 4, we show the comparison of ROUGE, FAR, and Pyramid. In Fig. 5, we show the the ground-truth FAR scores, the FAR scores estimated by various sentence regression approaches, and the prediction of FAR scores by linear regression.

C Detailed Examples
We list below the full documents, reference summaries, and the corresponding FAMs of several examples shown in Table 2. In particular, Table  10 shows an example of several support groups covering the same facet. We release all of the annotated data to facilitate facet-aware evaluation and follow-up studies along this direction.  -LRB-CNN -RRB-Paul Walker is hardly the first actor to die during a production . But Walker 's death in November 2013 at the age of 40 after a car crash was especially eerie given his rise to fame in the " Fast and Furious " film franchise . The release of " Furious 7 " on Friday (this is the only mention of "Friday" in the whole document) offers the opportunity for fans to rememberand possibly grieve again -the man that so many have praised as one of the nicest guys in Hollywood . " He was a person of humility , integrity , and compassion , " military veteran Kyle Upham said in an email to CNN . Walker secretly paid for the engagement ring Upham shopped for with his bride . " We did n't know him personally but this was apparent in the short time we spent with him . I know that we will never forget him and he will always be someone very special to us , " said Upham . The actor was on break from filming " Furious 7 " at the time of the fiery accident , which also claimed the life of the car 's driver , Roger Rodas . Producers said early on that they would not kill off Walker 's character , Brian O'Connor , a former cop turned road racer . Instead , the script was rewritten and special effects were used to finish scenes , with Walker 's brothers , Cody and Caleb , serving as body doubles . There are scenes that will resonate with the audience -including the ending , in which the filmmakers figured out a touching way to pay tribute to Walker while " retiring " his character . At the premiere Wednesday night in Hollywood , Walker 's co-star and close friend Vin Diesel gave a tearful speech before the screening , saying " This movie is more than a movie . " (random quotation, may use other quotes as well) " You 'll feel it when you see it , " Diesel said . " There 's something emotional that happens to you , where you walk out of this movie and you appreciate everyone you love because you just never know when the last day is you 're gon na see them . " There have been multiple tributes to Walker leading up to the release . Diesel revealed in an interview with the " Today " show that he had named his newborn daughter after Walker . Social media has also been paying homage to the late actor . A week after Walker 's death , about 5,000 people attended an outdoor memorial to him in Los Angeles . Most had never met him . Marcus Coleman told CNN he spent almost $ 1,000 to truck in a banner from Bakersfield for people to sign at the memorial . " It 's like losing a friend or a really close family member ... even though he is an actor and we never really met face to face , " Coleman said . " Sitting there , bringing his movies into your house or watching on TV , it 's like getting to know somebody . It really , really hurts . " Walker 's younger brother Cody told People magazine that he was initially nervous about how " Furious 7 " would turn out , but he is happy with the film . " It 's bittersweet , but I think Paul would be proud , " he said . CNN 's Paul Vercammen contributed to this report .
Reference Summary: " Furious 7 " pays tribute to star Paul Walker , who died during filming Vin Diesel : " This movie is more than a movie " (random quotation) " Furious 7 " opens Friday (unimportant detail) FAMs: N/A Police say Jenkins had cut a hole in the roof of a commercial business in Maryland on March 9 and deputies arrested him as he fled . According to Dover police , ' Jenkins was found in possession of .45 -caliber handgun that was stolen from a business in Delaware State Police Troop 9 jurisdiction . A search of Jenkins vehicle revealed an additional .45 -caliber handgun stolen from the same business . ' Jenkins is being held in Maryland and will face 72 charges involving the 18 burglaries in Dover when he is returned to Delaware . The charges he is facing break down to : four counts of wearing a disguise during the commission of a felony , eighteen counts of third-degree burglary , eighteen counts of possession of burglary tools , fourteen counts of theft under $ 1,500 , and eighteen counts of criminal mischief , two of which are felonies , authorities said .
Cpl. Mark Hoffman with the Dover Police Department told the News Journal that Delaware State Police are planning to file charges over a 19th robbery at Melvin 's Auto Service , which reportedly occurred in a part of Dover where jurisdiction is held by state police . Sharon Hutchison , who works at one of the businesses Jenkins allegedly robbed , told the newspaper ' He cut through two layers of drywall , studs and insulation . ' The Prince George 's County Sheriff 's Department did not immediately return a request for information on what charges Jenkins is facing there .

FAMs:
• thomas k. jenkins , 49 , was arrested last month by deputies with the prince george 's county sheriff 's office , authorities said .
[Support Group0][Sent0]: authorities said in a news release thursday that 49-year-old thomas k. jenkins of capitol heights , maryland , was arrested last month by deputies with the prince george 's county sheriff 's office .
• police say jenkins had cut a hole in the roof of a commercial business in maryland on march 9 and deputies arrested him as he fled .
[Support Group0][Sent0]: police say jenkins had cut a hole in the roof of a commercial business in maryland on march 9 and deputies arrested him as he fled .
• jenkins is accused of carrying out multiple robberies in dover , delaware .
[Support Group0][Sent0]: jenkins is being held in maryland and will face 72 charges involving the 18 burglaries in dover when he is returned to delaware .
[Support Group2][Sent0]: thomas jenkins has been accused by the dover police department of robbing multiple businesses .
• he is facing 72 charges from the dover police department for 18 robberies .
[Support Group0][Sent0]: jenkins is being held in maryland and will face 72 charges involving the 18 burglaries in dover when he is returned to delaware .
• the delaware state police is planning to file charges over a 19th robbery , which occurred in a part of dover where jurisdiction is held by state police .
[Support Group0][Sent0]: mark hoffman with the dover police department told the news journal that delaware state police are planning to file charges over a 19th robbery at melvin 's auto service , which reportedly occurred in a part of dover where jurisdiction is held by state police .