Unsupervised Abstractive Meeting Summarization with Multi-Sentence Compression and Budgeted Submodular Maximization

We introduce a novel graph-based framework for abstractive meeting speech summarization that is fully unsupervised and does not rely on any annotations. Our work combines the strengths of multiple recent approaches while addressing their weaknesses. Moreover, we leverage recent advances in word embeddings and graph degeneracy applied to NLP to take exterior semantic knowledge into account, and to design custom diversity and informativeness measures. Experiments on the AMI and ICSI corpus show that our system improves on the state-of-the-art. Code and data are publicly available, and our system can be interactively tested.


Introduction
People spend a lot of their time in meetings. The ubiquity of web-based meeting tools and the rapid improvement and adoption of Automatic Speech Recognition (ASR) is creating pressing needs for effective meeting speech summarization mechanisms.
Spontaneous multi-party meeting speech transcriptions widely differ from traditional documents. Instead of grammatical, well-segmented sentences, the input is made of often ill-formed and ungrammatical text fragments called utterances. On top of that, ASR transcription and segmentation errors inject additional noise into the input.
In this paper, we combine the strengths of 6 approaches that had previously been applied 1 https://bitbucket.org/dascim/acl2018_abssumm 2 http://datascience.open-paas.org/abs_summ_app to 3 different tasks (keyword extraction, multisentence compression, and summarization) into a unified, fully unsupervised end-to-end meeting speech summarization framework that can generate readable summaries despite the noise inherent to ASR transcriptions. We also introduce some novel components. Our method reaches state-ofthe-art performance and can be applied to languages other than English in an almost out-of-thebox fashion.

Framework Overview
As illustrated in Figure 1, our system is made of 4 modules, briefly described in what follows. The first module pre-processes text. The goal of the second Community Detection step is to group together the utterances that should be summarized by a common abstractive sentence (Murray et al., 2012). These utterances typically correspond to a topic or subtopic discussed during the meeting. A single abstractive sentence is then separately generated for each community, using an extension of the Multi-Sentence Compression Graph (MSCG) of Filippova (2010). Finally, we generate a summary by selecting the best elements from the set of abstractive sentences under a budget constraint. We cast this problem as the maximization of a custom submodular quality function.
Note that our approach is fully unsupervised and does not rely on any annotations. Our input simply consists in a list of utterances without any metadata. All we need in addition to that is a part-of-speech tagger, a language model, a set of pre-trained word vectors, a list of stopwords and fillerwords, and optionally, access to a lexical database such as WordNet. Our system can work out-of-the-box in most languages for which such resources are available.

Related Work and Contributions
As detailed below, our framework combines the strengths of 6 recent works. It also includes novel components.

Multi-Sentence Compression Graph
(MSCG) (Filippova, 2010) Description: a fully unsupervised, simple approach for generating a short, self-sufficient sentence from a cluster of related, overlapping sentences. As shown in Figure 5, a word graph is constructed with special edge weights, the K-shortest weighted paths are then found and re-ranked with a scoring function, and the best path is used as the compression. The assumption is that redundancy alone is enough to ensure informativeness and grammaticality. Limitations: despite making great strides and showing promising results, Filippova (2010) reported that 48% and 36% of the generated sentences were missing important information and were not perfectly grammatical. Contributions: to respectively improve informativeness and grammaticality, we combine ideas found in Boudin and Morin (2013) and Mehdad et al. (2013), as described next.
3.2 More informative MSCG (Boudin and Morin, 2013) Description: same task and approach as in Filippova (2010), except that a word co-occurrence network is built from the cluster of sentences, and that the PageRank scores of the nodes are computed in the manner of Mihalcea and Tarau (2004). The scores are then injected into the path re-ranking function to favor informative paths. Limitations: PageRank is not state-of-the-art in capturing the importance of words in a document. Grammaticality is not considered. Contributions: we take grammaticality into ac-count as explained in subsection 3.4. We also follow recent evidence (Tixier et al., 2016a) that spreading influence, as captured by graph degeneracy-based measures, is better correlated with "keywordedness" than PageRank scores, as explained in the next subsection.
3.3 Graph-based word importance scoring (Tixier et al., 2016a) Word co-occurrence network. As shown in Figure 2, we consider a word co-occurrence network as an undirected, weighted graph constructed by sliding a fixed-size window over text, and where edge weights represent co-occurrence counts (Tixier et al., 2016b;Mihalcea and Tarau, 2004). Important words are influential nodes. In social networks, it was shown that influential spreaders, that is, those individuals that can reach the largest part of the network in a given number of steps, are better identified via their core numbers rather than via their PageRank scores or degrees (Kitsak et al., 2010). See Figure 3 for the intuition. Similarly, in NLP, Tixier et al. (2016a) have shown that keywords are better identified via their core numbers rather than via their TextRank scores, that is, keywords are influencers within their word cooccurrence network. Graph degeneracy (Seidman, 1983). Let G(V, E) be an undirected, weighted graph with n = |V | nodes and m = |E| edges. A k-core of G is a maximal subgraph of G in which every vertex v has at least weighted degree k. As shown in Figures 3 and 4, the k-core decomposition of G forms a hierarchy of nested subgraphs whose cohesiveness and size respectively increase and decrease with k. The higher-level cores can be viewed as a filtered version of the graph that excludes noise. This property is highly valuable when dealing with graphs constructed from noisy text, like utterances. The core number of a node is the highest order of a core that contains this node. Figure 3: k-core decomposition. The blue and the yellow nodes have same degree and similar PageRank numbers. However, the blue node is a much more influential spreader as it is strategically placed in the core of the network, as captured by its higher core number.
The CoreRank number of a node (Tixier et al., 2016a;Bae and Kim, 2014) is defined as the sum of the core numbers of its neighbors. As shown in Figure 4, CoreRank more finely captures the structural position of each node in the graph than raw core numbers. Also, stabilizing scores across node neighborhoods enhances the inherent noise robustness property of graph degeneracy, which is desirable when working with noisy speech-to-text output.
Time complexity. Building a graph-of-words is O(nW ), and computing the weighted k-core decomposition of a graph requires O(m log(n)) (Batagelj and Zaveršnik, 2002). For small pieces of text, this two step process is so affordable that it can be used in real-time . Finally, computing CoreRank scores can be done with only a small overhead of O(n), provided that the graph is stored as a hash of adjacency lists. Getting the CoreRank numbers from scratch for a community of utterances is therefore very fast, especially since typically in this context, n ∼ 10 and m ∼ 100.
3.4 Fluency-aware, more abstractive MSCG (Mehdad et al., 2013) Description: a supervised end-to-end framework for abstractive meeting summarization. Community Detection is performed by (1) building an utterance graph with a logistic regression classifier, and (2) applying the CONGA algorithm. Then, before performing sentence compression with the MSCG, the authors also (3) build an entailment graph with a SVM classifier in order to eliminate redundant and less informative utterances. In addition, the authors propose the use of WordNet (Miller, 1995) during the MSCG building phase to capture lexical knowledge between words and thus generate more abstractive compressions, and of a language model when re-ranking the shortest paths, to favor fluent compressions. Limitations: this effort was a significant advance, as it was the first application of the MSCG to the meeting summarization task, to the best of our knowledge. However, steps (1) and (3) above are complex, based on handcrafted features, and respectively require annotated training data in the form of links between human-written abstractive sentences and original utterances and multiple external datasets (e.g., from the Recognizing Textual Entailment Challenge). Such annotations are costly to obtain and very seldom available in practice.
Contributions: while we retain the use of WordNet and of a language model, we show that, without deteriorating the quality of the results, steps (1) and (2) above (Community Detection) can be performed in a much more simple, completely unsupervised way, and that step (3) can be removed. That is, the MSCG is powerful enough to remove redundancy and ensure informativeness, should proper edge weights and path re-ranking function be used.
In addition to the aforementioned contributions, we also introduce the following novel components into our abstractive summarization pipeline: • we inject global exterior knowledge into the edge weights of the MSCG, by using the Word Attraction Force of Wang et al. (2014), based on distance in the word embedding space, • we add a diversity term to the path re-ranking function, that measures how many unique clusters in the embedding space are visited by each path, • rather than using all the abstractive sentences as the final summary like in Mehdad et al. (2013), we maximize a custom submodular function to select a subset of abstractive sentences that is nearoptimal given a budget constraint (summary size). A brief background of submodularity in the context of summarization is provided next.
3.5 Submodularity for summarization (Lin and Bilmes, 2010;Lin, 2012) Selecting an optimal subset of abstractive sentences from a larger set can be framed as a budgeted submodular maximization task: where S is a summary, c s is the cost (word count) of sentence s, B is the desired summary size in words (budget), and f is a summary quality scoring set function, which assigns a single numeric score to a summary S. This combinatorial optimization task is NPhard. However, near-optimal performance can be guaranteed with a modified greedy algorithm (Lin and Bilmes, 2010) that iteratively selects the sentence s that maximizes the ratio of quality function gain to scaled cost f (S∪s)−f (S) /c r s (where S is the current summary and r ≥ 0 is a scaling factor).
In order for the performance guarantees to hold however, f has to be submodular and monotone non-decreasing. Our proposed f is described in subsection 4.4.

Our Framework
We detail next each of the four modules in our architecture (shown in Figure 1).

Text preprocessing
We adopt preprocessing steps tailored to the characteristics of ASR transcriptions. Consecutive repeated unigrams and bigrams are reduced to single terms. Specific ASR tags, such as {vocalsound}, {pause}, and {gap} are filtered out. In addition, filler words, such as uh-huh, okay, well, and by the way are also discarded. Consecutive stopwords at the beginning and end of utterances are stripped.
In the end, utterances that contain less than 3 nonstopwords are pruned out. The surviving utterances are used for the next steps.

Utterance community detection
The goal here is to cluster utterances into communities that should be summarized by a common abstractive sentence.
We initially experimented with techniques capitalizing on word vectors, such as k-means and hierarchical clustering based on the Euclidean distance or the Word Mover's Distance (Kusner et al., 2015). We also tried graph-based approaches, such as community detection in a complete graph where nodes are utterances and edges are weighted based on the aforementioned distances.
Best results were obtained, however, with a simple approach in which utterances are projected into the vector space and assigned standard TF-IDF weights. Then, the dimensionality of the utterance-term matrix is reduced with Latent Semantic Analysis (LSA), and finally, the k-means algorithm is applied. Note that LSA is only used here, during the utterance community detection phase, to remove noise and stabilize clustering. We do not use a topic graph in our approach.
We think using word embeddings was not effective, because in meeting speech, as opposed to traditional documents, participants tend to use the same term to refer to the same thing throughout the entire conversation, as noted by Riedhammer et al. (2010), and as verified in practice. This is probably why, for clustering utterances, capturing synonymy is counterproductive, as it artificially reduces the distance between every pair of utterances and blurs the picture.

Multi-Sentence Compression
The following steps are performed separately for each community.

Word importance scoring
From a processed version of the community (stemming and stopword removal), we construct an undirected, weighted word co-occurrence network as described in subsection 3.3. We use a sliding window of size W = 6 not overspanning utterances. Note that stemming is performed only here, and for the sole purpose of building the word cooccurrence network.
We then compute the CoreRank numbers of the nodes as described in subsection 3.3. Figure 5: Compressed sentence (in bold red) generated by our multi-sentence compression graph (MSCG) for a 3-utterance community from meeting IS1009b of the AMI corpus. Using Filippova (2010)'s weighting and re-ranking scheme here would have selected another path: design different remotes for different people bit of it's from their tend to for ti. Note that the compressed sentence does not appear in the initial set of utterances, and is compact and grammatical, despite the redundancy, transcription and segmentation errors of the input. The abstractive and robust nature of the MSCG makes it particularly well-suited to the meeting domain.  We finally reweigh the CoreRank scores, indicative of word importance within a given community, with a quantity akin to an Inverse Document Frequency, where communities serve as documents and the full meeting as the collection. We thus obtain something equivalent to the TW-IDF weighting scheme of Rousseau and Vazirgiannis (2013), where the CoreRank scores are the term weights TW: (2) where t is a term belonging to community d, and D is the set of all utterance communities. We compute the IDF as IDF (t, D) = 1 + log |D| /Dt, where |D| is the number of communities and D t the number of communities containing t.
The intuition behind this reweighing scheme is that a term should be considered important within a given meeting if it has a high CoreRank score within its community and if the number of communities in which the term appears is relatively small.

Word graph building
The backbone of the graph is laid out as a directed sequence of nodes corresponding to the words in the first utterance, with special START and END nodes at the beginning and at the end (see Figure 5). Edge direction follows the natural flow of text. Words from the remaining utterances are then iteratively added to the graph (between the START and END nodes) based on the following rules: 1) if the word is a non-stopword, the word is mapped onto an existing node if it has the same lowercased form and the same part-of-speech tag 3 . In case of multiple matches, we check the immediate context (the preceding and following words in the utterance and the neighboring nodes in the graph), and we pick the node with the largest context overlap or which has the greatest number of words already mapped to it (when no overlap). When there is no match, we use WordNet as described in Appendix A.
2) if the word is a stopword and there is a match, it is mapped only if there is an overlap of at least one non-stopword in the immediate context. Otherwise, a new node is created.
Finally, note that any two words appearing within the same utterance cannot be mapped to the same node. This ensures that every utterance is a loopless path in the graph. Of course, there are many more paths in the graphs than original utterances.

Edge Weight Assignment
Once the word graph is constructed, we assign weights to its edges as: where p i and p j are two neighbors in the MSCG. As detailed next, those weights combine local cooccurrence statistics (numerator) with global exterior knowledge (denominator). Note that the lower Local co-occurrence statistics. We use Filippova (2010)'s formula: is the number of words mapped to node p i in the MSCG G , and diff(P, p i , p j ) −1 is the inverse of the distance between p i and p j in a path P (in number of hops). This weighting function favors edges between infrequent words that frequently appear close to each other in the text (the lower, the better).
Global exterior knowledge. We introduce a second term based on the Word Attraction Force score of Wang et al. (2014): where d p i ,p j is the Euclidean distance between the words mapped to p i and p j in a word embedding space 4 . This component favor paths going through salient words that have high semantic similarity (the higher, the better). The goal is to ensure readability of the compression, by avoiding to generate a sentence jumping from one word to a completely unrelated one.

Path re-ranking
As in Boudin and Morin (2013), we use a shortest weighted path algorithm to find the K paths between the START and END symbols having the lowest cumulative edge weight: 4 GoogleNews vectors https://code.google.com/archive/p/word2vec Where |P | is the number of nodes in the path. Paths having less than z words or that do not contain a verb are filtered out (z is a tuning parameter). However, unlike in Boudin and Morin (2013), we rerank the K best paths with the following novel weighting scheme (the lower, the better), and the path with the lowest score is used as the compression: The denominator takes into account the length of the path, and its fluency (F ), coverage (C), and diversity (D). F , C, and D are detailed in what follows. Fluency. We estimate the grammaticality of a path with an n-gram language model. In our experiments, we used a trigram model 5 : where |P | denote path length, and p i and #n-gram are respectively the words and number of n-grams in the path.
Coverage. We reward the paths that visit important nouns, verbs and adjectives: where #p i is the number of nouns, verbs and adjectives in the path. The TW-IDF scores are computed as explained in subsection 4.3. Diversity. We cluster all words from the MSCG in the word embedding space by applying the kmeans algorithm. We then measure the diversity of the vocabulary contained in a path as the number of unique clusters visited by the path, normalized by the length of the path: The graphical intuition for this measure is provided in Figure 6. Note that we do not normalize D by the total number of clusters (only by path length) because k is fixed for all candidate paths.

Budgeted submodular maximization
We apply the previous steps separately for all utterance communities, which results in a set S of abstractive sentences (one for each community). This set of sentences can already be considered to be a summary of the meeting. However, it might exceed the maximum size allowed, and still contain some redundancy or off-topic sections unrelated to the general theme of the meeting (e.g., chit-chat). Therefore, we design the following submodular and monotone non-decreasing objective function: where λ ≥ 0 is the trade-off parameter, n s i is the number of occurrences of word s i in S, and w s i is the CoreRank score of s i .
Then, as explained in subsection 3.5, we obtain a near-optimal subset of abstractive sentences by maximizing f with a greedy algorithm. Cor-eRank scores and clusters are found as previously described, except that this time they are obtained from the full processed meeting transcription rather than from a single utterance community.

Datasets
We conducted experiments on the widely-used AMI (McCowan et al., 2005) and ICSI (Janin et al., 2003) benchmark datasets. We used the traditional test sets of 20 and 6 meetings respectively for the AMI and ICSI corpora (Riedhammer et al., 2008). Each meeting in the AMI test set is associated with a human abstractive summary of 290 words on average, whereas each meeting in the ICSI test set is associated with 3 human abstractive summaries of respective average sizes 220, 220 and 670 words. For parameter tuning, we constructed development sets of 47 and 25 meetings, respectively for AMI and ICSI, by randomly sampling from the training sets. The word error rate of the ASR transcriptions is respectively of 36% and 37% for AMI and ICSI.

Baselines
We compared our system against 7 baselines, which are listed below and more thoroughly detailed in Appendix B. Note that preprocessing was exactly the same for our system and all baselines.
• Random and Longest Greedy are basic baselines recommended by (Riedhammer et al., 2008), • Oracle is the same as the random baseline, but uses the human extractive summaries as input.
In addition to the baselines above, we included in our comparison 3 variants of our system using different MSCGs: Our System (Baseline) uses the original MSCG of Filippova (2010), Our System (KeyRank) uses that of Boudin and Morin (2013), and Our System (FluCovRank) that of Mehdad et al. (2013). Details about each approach were given in Section 3.

Parameter tuning
For Our System and each of its variants, we conducted a grid search on the development sets of each corpus, for fixed summary sizes of 350 and 450 words (AMI and ICSI). We searched the following parameters: • n: number of utterance communities (see Section 4.2). We tested values of n ranging from 20 to 60, with steps of 5. This parameter controls how much abstractive should the summary be. If all utterances are assigned to their own singleton community, the MSCG is of no utility, and our framework is extractive. It becomes more and more abstractive as the number of communities decreases.
• z: minimum path length (see Section 4.3). We searched values in the range [6, 16] with steps of 2. If a path is shorter than a certain minimum number of words, it often corresponds to an invalid sentence, and should thereby be filtered out.
The scaling factor makes sure the quality function gain and utterance cost are comparable.
The best parameter values for each corpus are summarized in Table 1. λ is mostly non-zero, indicating that it is necessary to include a regularization term in the submodular function. In some cases though, r is equal to zero, which means that utterance costs are not involved in the greedy decision heuristic. These observations contradict the conclusion of Lin (2012)   Apart from the tuning parameters, we set the number of LSA dimensions to 30 and 60 (resp. on AMI and ISCI). The small number of LSA dimensions retained can be explained by the fact that the AMI and ICSI transcriptions feature 532 and 1126 unique words on average, which is much smaller than traditional documents. This is due to relatively small meeting duration, and to the fact that participants tend to stick to the same terms throughout the entire conversation. For the kmeans algorithm, k was set equal to the minimum path length z when doing MSCG path re-ranking (see Equation 10), and to 60 when generating the final summary (see Equation 11).
Following Boudin and Morin (2013), the number of shortest weighted paths K was set to 200, which is greater than the K = 100 used by Filippova (2010). Increasing K from 100 improves performance with diminishing returns, but significantly increases complexity. We empirically found 200 to be a good trade-off.

Results and Interpretation
Metrics. We evaluated performance with the widely-used ROUGE-1, ROUGE-2 and ROUGE-SU4 metrics (Lin, 2004). These metrics are respectively based on unigram, bigram, and unigram plus skip-bigram overlap with maximum skip distance of 4, and have been shown to be highly correlated with human evaluations (Lin, 2004). ROUGE-2 scores can be seen as a measure of summary readability (Lin and Hovy, 2003;Ganesan et al., 2010). ROUGE-SU4 does not require con-secutive matches but is still sensitive to word order.
Macro-averaged results for summaries generated from automatic transcriptions can be seen in Figure 7 and Table 2. Table 2 provides detailed comparisons over the fixed budgets that we used for parameter tuning, while Figure 7 shows the performance of the models for budgets ranging from 150 to 500 words. The same information for summaries generated from manual transcriptions is available in Appendix C. Finally, summary examples are available in Appendix D. ROUGE-1. Our systems outperform all baselines on AMI (including Oracle) and all baselines on ICSI (except Oracle). Specifically, Our System is best on ICSI, while Our System (KeyRank) is superior on AMI. We can also observe on Figure 7 that our systems are consistently better throughout the different summary sizes, even though their parameters were tuned for specific sizes only. This shows that the best parameter values are quite robust across the entire budget range. ROUGE-2. Again, our systems (except Our System (Baseline)) outperform all baselines, except Oracle. In addition, Our System and Our System (FluCovRank) consistently improve on Our System (Baseline), which proves that the novel components we introduce improve summary fluency. ROUGE-SU4. ROUGE-SU4 was used to measure the amount of in-order word pairs overlapping. Our systems are competitive with all baselines, including Oracle. Like with ROUGE-1, Our System is better than Our System (KeyRank) on ICSI, whereas the opposite is true on AMI. General remarks.
• The summaries of all systems except Oracle were generated from noisy ASR transcriptions, but were compared against human abstractive summaries. ROUGE being based on word overlap, it makes it very difficult to reach very high scores, because many words in the ground truth summaries do not appear in the transcriptions at all.
• The scores of all systems are lower on ICSI than on AMI. This can be explained by the fact that on ICSI, the system summaries have to jointly match 3 human abstractive summaries of different content and size, which is much more difficult than matching a single summary. • Our framework is very competitive to Oracle, which is notable since the latter has direct access to the human extractive summaries. Note that Or-  acle does not reach very high ROUGE scores because the overlap between the human extractive and abstractive summaries is low (19% and 29%, respectively on AMI and ICSI test sets).

Conclusion and Next Steps
Our framework combines the strengths of 6 approaches that had previously been applied to 3 different tasks (keyword extraction, multi-sentence compression, and summarization) into a unified, fully unsupervised end-to-end summarization framework, and introduces some novel components. Rigorous evaluation on the AMI and ICSI corpora shows that we reach state-of-the-art performance, and generate reasonably grammatical abstractive summaries despite taking noisy utterances as input and not relying on any annotations or training data. Finally, thanks to its fully unsupervised nature, our method is applicable to other languages than English in an almost out-of-thebox manner. Our framework was developed for the meeting domain. Indeed, our generative component, the multi-sentence compression graph (MSCG), needs redundancy to perform well. Such redundancy is typically present in meeting speech but not in traditional documents. In addition, the MSCG is by design robust to noise, and our custom path re-ranking strategy, based on graph degeneracy, makes it even more robust to noise. As a result, our framework is advantaged on ASR input. Finally, we use a language model to favor fluent paths, which is crucial when working with (meeting) speech but not that important when dealing with well-formed input.
Future efforts should be dedicated to improving the community detection phase and generating more abstractive sentences, probably by harnessing Deep Learning. However, the lack of large training sets for the meeting domain is an obstacle to the use of neural approaches.