Real-Time Keyword Extraction from Conversations

We introduce a novel method to extract keywords from meeting speech in real-time. Our approach builds on the graph-of-words representation of text and leverages the k-core decomposition algorithm and properties of submodular functions. We outperform multiple baselines in a real-time scenario emulated from the AMI and ICSI meeting corpora. Evaluation is conducted against both extractive and abstractive gold standard using two standard performance metrics and a newer one based on word embeddings.


Introduction
Motivation. People spend a significant amount of their time attending meetings. To benefit from recent technological advances, many companies are now using web-based meeting tools that can accommodate remote participants and allow video in addition to voice calls. While very useful, those tools typically do not offer extra features beyond screen sharing or instant messaging. In particular, they broadcast participant voices without leveraging the rich information conveyed in speech. Yet, the use of Automatic Speech Recognition (ASR) systems opens the gate to numerous text mining applications that can assist participants as the meeting unfolds, or once it is over. Goals. Here, we focus on extracting keywords in real-time from speech transcriptions (ASR output) over the course of a virtual meeting. This task is very important, as current keywords provide a snapshot of the ongoing topics and can be used to * This research is supported in part by the OpenPaaS::NG project.
improve productivity in a variety of ways: (1) on the fly retrieval of relevant internal and external resources (webpages, emails) based on the topics detected, (2) constant maintenance of a meeting summary to enable latecomers to quickly catchup, and (3) smart indexing once the meeting is over.

Challenges.
Processing multi-party meeting speech transcriptions is a difficult NLP task. First, spontaneous speech differs from traditional documents. In lieu of well-formed, self-contained sentences, the data consist of fragments of speech transcripts called utterances, which are often illformed, ungrammatical, and contain informal or filler words (e.g., "uh-huh"). Moreover, speakers dilute important information by frequently pausing, interrupting each other, and chit-chatting. Second, errors made by the ASR system inject some additional noise into the transcriptions. Contributions. 1. We build on the k-core graph decomposition algorithm to assign scores to terms. As will be explained, our approach is particularly well suited to speech transcriptions as it is fully unsupervised and robust to noise. 2. To select the best terms, we propose a new keyword quality function and prove that it is submodular, which enables its near-optimal optimization under a budget constraint in a way fast enough to meet the real-time requirements. 3. We evaluate the performance of our method against that of numerous baselines on two standard, well-known datasets (AMI and ICSI), and reach state-of-the-art performance. 4. Finally, we release our code and data as publicly available 1 , making our study fully repro-ducible. Furthermore, our system can be interactively tested online 2 .
In the remainder of this paper, we introduce our system, describe our experiments, and report and interpret our results.

Proposed system
As shown in Figure 1, our system can be broken down into 4 modules. We describe them in what follows. T parameter. We receive as input a stream of text from the ASR tool, which is composed of utterances of duration 2.01s on average (std. dev. of 2.03). Starting from t=0 (beginning of the meeting), our system considers consecutive intervals I i of fixed size T =60s. I 1 is made up of all utterances starting within [0, T [, I 2 covers [T, 2T [, etc. The number of words in each interval (before cleaning) is 200 on average (std. dev. of 75). T is a trade-off parameter: as it increases, more textual data become available for the interval, which usually yields better keywords. But on the other hand, additional lag is introduced. Note that we experimented with dynamic interval length based on speaker dominance periods, but found that while increasing complexity, it did not offer noticeable improvements. Cleaning. At the end of each time period, we tokenize, stem, and remove punctuation and standard stopwords from the associated utterances. We also filter out ASR-specific terms indicating inaudible sounds, pauses, and background noise, such as {vocalsound}.

Graph building
Then, from the pre-processed text for the interval, we generate an undirected, weighted graph of words G(V, E) like in Mihalcea and Tarau (2004). Word co-occurrence networks are flexible, information-rich structures with many parameters (Tixier et al., 2016b). In the present study, the nodes V are unique terms (unigrams) in the text and two nodes are linked by an edge e ∈ E if the two words they represent co-occur within a sliding window of fixed size W = 3 overspanning utterance boundaries (making our system robust to utterance segmentation errors). Furthermore, edge weights match co-occurrence counts. This step is O(|V |W ) in time, which is very fast for the small graphs considered here (|V | ≈ |E| ≈ 10).

Term scoring
k-core. The k-core is one of the most fundamental constructs in network analysis. A maximal connected subgraph of G is said to be a k-core of G if each of its nodes has degree greater than or equal to k (Seidman, 1983). The core number of a node is the highest order of a k-core that contains this node. k-core decomposition. We apply the generalized k-core algorithm of Batagelj and Zaveršnik (2011). Essentially, this algorithm deletes at each step the vertex of lowest degree (in the current subgraph) as well as all its incident edges, which decreases the degrees of the nodes in the neighborhood. Note that for a weighted graph, the degree of a vertex is the sum of the weights of its incident edges. As shown in Figure 2, the output is the k-core decomposition of G, that is, the set of all its cores from 1 (G as a whole) to k max (its main core). The k-cores form a hierarchy of nested subgraphs whose cohesiveness and size respectively increase and decrease with k. Application to keyword extraction. As we move upwards the k-core hierarchy of a graph of words, we expect to find more and more keywords. The underlying assumption is that in a word co-occurrence network, centrality (as measured by PageRank, for example) is not the best "keywordness" criterion, and that it is better instead to look for nodes that are not only central but that also form tightly knitted substructures with other nodes, that is, nodes that are part of cohesive subgraphs (Tixier et al., 2016a). CoreRank. Finally, we assign to each node v in the graph the sum of the core numbers of its neighbors N (v): We will refer to this scoring scheme as CoreRank in the remainder of this paper. Assigning scores at 3-core 2-core 1-core Core number Core number Core number c = 1 c = 2 c = 3 * ** Figure 2: k-core decomposition of a graph and CoreRank (CR) scoring scheme. While nodes and have the same score (2) in terms of core numbers, node has a greater CR score (7 vs 5), which accurately reflects its more central position in the graph.
the node level (rather than at the subgraph level) allows to better discriminate between vertices, which makes ranking and selection easier. Also, stabilizing scores across node neighborhoods increases robustness to noise, which is particularly desirable when dealing with noisy text like speech transcriptions.
Complexity. Computing the k-cores is very efficient: thanks to Batagelj and Zaveršnik (2011), it can be done in O(|V |+|E| log |V |) time. Computing the CoreRank scores is also very affordable, as it is O(|E|) in time. For the small graphs considered here, these steps can therefore be performed very quickly, which suits well the real-time nature of our task.

Keyword extraction
Keyword quality function. Rather than using heuristics like in Tixier et al. (2016a) to select nodes from G (i.e., to extract tokens from the text), we frame the keyword identification problem as the maximization of a set function under a budget constraint. In particular, we define a keyword quality function f that not only measures the cumulative CoreRank score of a given set of terms S, but also the density of the subgraph they induce: where λ is a trade-off parameter, and the set function h counts the number of edges that should be added to the subgraph induced by S to make it complete: where |S|, resp. |E(S)|, denotes the number of vertices, resp. edges, in the subgraph induced by S. h(S) is null when S is complete (i.e., of unit density), and increases as the density of the graph decreases. Recall that a complete graph is a graph where every two nodes are linked by an edge, and that a subgraph of G(V, E) induced by a set of nodes S ⊆ V , has S as its vertices and all the edges from E for which both endpoints belong to S as its edges.
Interpretation. The first component of f measures the extent to which a set contains nodes with high CoreRank numbers, while its second term (h) provides an extra layer of cohesiveness requirements, by biasing the selection towards a set of nodes that together form a dense subgraph. To maximize f , we want to jointly maximize, resp. minimize, its first and second terms. Optimization task. Finding the best subset of terms S * ⊆ V to serve as keywords can be seen as a combinatorial optimization task under a budget constraint: where c v is the unit cost of including term v as a keyword, and B is the budget, which we define as the number of keywords that should be returned. B can be expressed as a percentage of the total number of words in the interval, but here we consider it to be fixed. Performance guarantees. As we prove in the extended version of this paper, our keyword quality function f is submodular, enabling Equation 4 (NP-complete) to be solved by a simple greedy algorithm with (1− 1 /e) ≈ 0.63 approximation guarantees (Nemhauser et al., 1978). Note that to benefit from these guarantees, f should also be monotone, which does not apply in our case. However, we invoke the fact that if |S| |V | (which holds here), the monotonicity constraint can be relieved (Lin et al., 2009;Krause, 2008).

Datasets
We used two datasets widely used in the field of meeting speech processing: the AMI corpus 3 (Mc-Cowan et al., 2005) and the ICSI corpus 4 (Janin et al., 2003). These datasets contain respectively 137 and 57 meetings lasting from 10 to 70 minutes (2,400 to 19,000 words) and involving between 2 and 6 participants whose conversations were automatically converted to text with a word error rate approaching 37%. Each meeting comes with gold standard in the form of human-written abstractive and extractive summaries. The extractive summaries were put together by selecting the best utterances from the transcripts. In some cases, multiple summaries are available for the same meeting.

Baselines
We evaluated the performance of our system against that of 5 baselines and an Oracle, which are presented next. First, to better interpret our results and enable easy cross-comparison with other studies, we included two standard, basic baselines: (1) selecting words at random from the processed text (without replacement), and (2) selecting the most frequent words from the processed text. Within our graph-based submodular framework, we also considered the replacement of CoreRank scores with (3) weighted degree centrality (sum of the weights of the incident edges), (4) PageRank scores (Mihalcea and Tarau, 2004), and (5) RAKE scores: is the weighted degree of term v in the graph and f req(v) its frequency in the text (Rose et al., 2010). Finally, we used as an Oracle the (6) most frequent words from the part of the extractive summary corresponding to the time interval considered. Of course, we used the same budget for all baselines, the Oracle, and our system.

Evaluation methodology
We compared all systems under two settings. Scenario 1. Using the traditional vector-space model, we computed the cosine similarity between the sum of the one-hot vectors of the keywords returned by a given method for a particular time interval, and the sum of the one-hot vectors of the words in the part of the extractive summary corresponding to the same interval. Results were averaged across summaries (when multiple ones were available), and finally across time intervals to compute the overall performance of the method (macro-averaging). For the random baseline, results were first averaged over 10 runs, to reduce variance. In this scenario, the method whose keywords most closely match the gold standard receives the highest score. Note that using TF-IDF weighting (rather than integer entries) did not change the rankings. Scenario 2. For the sake of completeness, we also wanted to evaluate performance against the abstractive summaries. However, since the sentences in those summaries do not come from the transcripts but were freely written by annotators, they are not time-stamped and thus cannot be linked to any particular interval. Consequently, to allow comparison, we concatenated the keywords extracted by a given method and for a given meeting from all intervals, thus obtaining a concise keyword-based summary of the full meeting. To compute the similarity with the abstractive summaries, we then used ROUGE-1 (Lin, 2004) and the Word Mover's Distance (WMD) (Kusner et al., 2015). ROUGE-1 computes similarity based on unigram overlap, while the WMD takes into account semantic similarity between terms, and is therefore more robust to the fact that the abstractive summaries contain words that were never actually spoken. Very briefly, the WMD is the minimum cumulative Euclidean distance needed for all words in the first summary to travel (in an embedding space) to the second summary. As our embeddings, we used publicly available 5 300dimensional vectors learned by Mikolov et al. (2013) from a 100B-word corpus (Google News). Note that since the WMD is a distance, the best performing methods are associated in that case with the lowest scores (for ROUGE, which is a measure of similarity, it is the opposite).

Results
Tables 1 and 2 display the results for the first and second scenarios, respectively.
In both cases, and on both the AMI and ICSI corpora, CoreRank outperforms the baselines, sometimes by a wide margin. Overall, the Oracle reaches best performance, which was expected since it has direct access to the gold standard. Nevertheless, it highlights the fact that there is still much room for improvement. However, it is worth noting that on the AMI dataset, under the second scenario, CoreRank outperforms even the Oracle. Impact of the budget. Figures 3 and 4 report the results under scenario 1, respectively for the AMI and ICSI datasets, for an increasing number of extracted keywords. The curves of the Oracle, Random and RAKE baselines were omitted for readability purposes. On both datasets, as the number of extracted keywords increases, we observe that the performance of all methods also increases. However, the rankings remain stable. Impact of h. Under the first setting and on the AMI corpus, we finally investigated how the density term (h) of our submodular function f was influencing the performance of the graph-based systems. As shown in Table 3, h proved beneficial, even though the improvements were marginal. The only exception was RAKE, for which best performance was achieved for λ = 0 (no density term). Note that the trade-off parameter λ was optimized for each method on a small development set consisting of 60 time intervals randomly drawn (without replacement) from the AMI corpus. We searched the [0, 3] line, with uniform steps of size 10 −3 .

Related work
To the best of our knowledge, this study is the first to investigate the extraction of keywords from meeting speech transcriptions in real-time. However, previous work did focus on offline meeting summarization. For instance, Lin et al. (2009) used a sentence semantic graph and a different submodular objective function. Habibi and Popescu-Belis (2013) used LDA and submodularity to select keywords covering as many topics as possible. Here, we assume that at most one topic can be discussed within each of our short time intervals. Closely related to our work is also that of Meladianos et al. (2015), who detected sub-events in real-time from the Twitter stream by stacking graphs of terms built from full tweets (without sliding window) and studying the evolution of core numbers over time in the overall graph. In our case, however, utterances are not self-contained pieces of information, and we don't receive them at a rate that is high enough to enable any kind of temporal analysis.

Conclusion
we presented a novel approach for real-time keyword extraction from ASR output, based on the core decomposition of networks and submodularity. Results show the superiority of our method over several baselines.        Table 3: Performance under scenario 1 and on the AMI corpus, with and without the density-based term of f .