A Sliding-Window Approach to Automatic Creation of Meeting Minutes

Meeting minutes record any subject matter discussed, decisions reached and actions taken at the meeting. The importance of automatic minuting cannot be overstated. In this paper, we present a sliding window approach to automatic generation of meeting minutes. It aims at addressing issues pertaining to the nature of spoken text, including the lengthy transcript and lack of document structure, which make it difficult to identify salient content to be included in meeting minutes. Our approach combines a sliding-window approach and a neural abstractive summarizer to navigate through the raw transcript to find salient content. The approach is evaluated on transcripts of natural meeting conversations, where we compare results obtained for human transcripts and two versions of automatic transcripts and discuss how and to what extent the summarizer succeeds at capturing salient content.


Introduction
Meetings are ubiquitous across organizations of all shapes and sizes, and it takes a tremendous effort to record any subject matters discussed, final decisions reached and actions taken at meetings. With the rise of remote workforce, virtual meetings are more important than ever. An increasing number of video conferencing providers including Zoom, Microsoft Team, Amazon Chime and Google Meet allow meetings to be transcribed (Martindale, 2021). However, without automatic minuting, consolidating notes and creating meeting minutes is still regarded as a tedious and time-consuming task for meeting participants. There is thus an urgent need to develop advanced techniques to better summarize and organize meeting content.
Meeting summarization has been attempted on a small scale before the era of deep learning. Previous work includes efforts to extract utterances and keyphrases from meeting transcripts (Galley, 2006;Murray and Carenini, 2008;Gillick et al., 2009;, detect meeting decisions (Hsueh and Moore, 2008), compress or merge utterances to generate abstracts Wang and Cardie, 2013;Mehdad et al., 2013) and make use of acoustic-prosodic and speaker features (Maskey and Hirschberg, 2005;Zhu et al., 2009;Chen and Metze, 2012) for utterance extraction. The continued development of automatic transcription and its easy accessibility have sparked a renewed interest in meeting summarization (Shang et al., 2018;Li et al., 2019;Koay et al., 2020;Song et al., 2020;Zhu et al., 2020;Zhong et al., 2021), where neural representations are explored for this task. We believe the time is therefore ripe for a reconsideration of the approach to automatic minuting.
It may be tempting to apply neural abstractive summarization to meetings given its remarkable recent success on summarization benchmarks, e.g., CNN/DM (See et al., 2017;Chen and Bansal, 2018;Gehrmann et al., 2018;Laban et al., 2020). However, the challenge lies not only in handling hallucinations that are seen in abstractive models (Kryscinski et al., 2019;Lebanoff et al., 2019;Maynez et al., 2020) but also the models' strong positional bias that occurs as a consequence of fine-tuning on news articles (Kedzie et al., 2018;Grenander et al., 2019). Neural summarizers also assume a maximum sequence length, e.g., Perez-Beltrachini et al. (2019) use the first 800 tokens of the document as input. With an estimated speaking rate of 122 words per minute (Polifroni et al., 1991), it indicates that the summarizer may only process a relatively short transcript -about 5 minutes in duration.
In this paper, we instead study an extractive meeting summarizer to identify salient utterances from the transcripts. It leverages a sliding window to navigate through a transcript of any length and a neural abstractive summarizer to find salient local content. In particular, we aim to address three key questions: (1) what are suitable window and stride sizes? (2) can the abstractive summarizer effectively identify salient local content? (3) how should we consolidate local abstracts into meeting-level summaries? Our approach is intuitive and appealing, as humans make a sequence of local decisions when navigating through very long recordings. It is evaluated on transcripts of natural meeting conversations (Janin et al., 2003), where we obtained human transcripts and two versions of automatic transcripts produced by the AMI speech recognizer (Hain et al., 2006) and Google Cloud's Speech-to-Text API. 1 Our contributions in this paper are as follows.
• We study the feasibility of a sliding-window approach to automatic generation of meeting minutes that draws on a pretrained neural abstractive summarizer to make local decisions on utterance saliency. It does not require any annotated data and can be extended to meetings of various types and domains.
• We examine results obtained from human transcripts and two versions of automatic transcripts, and show that our summarizer either outperforms or performs comparably to competitive baselines given both automatic and human evaluations. We discuss how and to what extent the summarizer succeeds at capturing salient content. 2 2 Background: The BART Summarizer BART (Lewis et al., 2020) has demonstrated strong performance on neural abstractive summarization. It consists of a bidirectional encoder and a left-toright autoregressive decoder, each contains multiple layers of Transformers (Vaswani et al., 2017). The model is pretrained using a denoising objective that, given a corrupted input text, the encoder strives to learn meaningful representations and the decoder reconstructs the original text using the representations. In this study, we use BART-large-cnn as a base summarizer. It contains 12 layers in each of the encoder and decoder and uses a hidden size of 1024. The model is then fine-tuned on the CNN dataset for abstractive summarization.
There are two obstacles that should be overcome in order for BART to generate meeting summaries from transcripts. Firstly, BART is trained on written text, rather than spoken text. The pretraining data contain 160G of news, books, stories, and web text. It remains unclear if the model can effectively 1 https://cloud.google.com/speech-to-text 2 Our transcripts and system outputs are released publicly at https://github.com/ucfnlp/meeting-sliding-window  identify salient content on spoken text and, how it is to reduce lead bias that is not as frequent in spoken text as in news writing (Grenander et al., 2019). Secondly, a transcript can far exceed the maximum input length of the model, which is restricted by the GPU memory size. This is the case even for recent variants such as Reformer (Kitaev et al., 2020) and Longformer (Beltagy et al., 2020).

Our Approach
A sliding-window approach to generating meeting minutes is appealing because it breaks lengthy transcripts into small and manageable local windows, allowing a set of "mini-summaries" to be produced from such windows which are then assembled into meeting-level summaries. There are two essential decisions to be made when using a sliding window. Firstly, one must decide on the size of the local window. Our window size is bounded by the maximum sequence length of BART as the utterances in a window are concatenated into a flat sequence that serves as input to it. We consider a number of window sizes with W={128, 256, 512, 1024} tokens. Secondly, a transcript may be partitioned into nonoverlapping or partially overlapping windows. We set the stride size to be S={128, 256, 512, 1024} tokens to support both (W ≥ S). When they are of equal size, a transcript is divided into a sequence of non-overlapping windows.
In Figure 1, we enumerate all 10 combinations of window and stride sizes. For example, we ex-  Consolidation. BART abstracts generated from local windows cannot be simply concatenated to form meeting-level summaries as they contain redundancy. When local windows are partially overlapping, they can cause the same content to be included in different abstracts. Instead, we identify supporting utterances of each abstract from the transcript. Particularly, we compute the ROUGE-L scores between each utterance in the window and the abstract. If the utterance is longer than 5 tokens, achieves a recall score r > 0.5 and precision score p > 0.1, we call it a supporting utterance. 3 The same utterance can support multiple abstracts. We include an utterance into the meeting summary if it is designated as the supporting utterance for at lease one local abstract. It lends flexibility and improves ease of consolidation of local abstractive summaries produced by BART.

Results
Dataset. Our experiments are performed on the ICSI meeting corpus (Janin et al., 2003), which is a challenging benchmark for meeting summarization.
The corpus contains 75 meeting recordings, each is about an hour long. We use 54 meetings for training and report results on the standard test set contain-3 The thresholds were determined heuristically on the training set by observing the resulting alignment.
ing 6 meetings. Each training meeting has been annotated with an extractive summary. Each test meeting has three human-annotated extractive summaries, which we use as gold-standard summaries. The original corpus include human transcripts and automatic speech recognition (ASR) output generated by the AMI ASR team (Hain et al., 2006). We are able to generate a new version of automatic transcripts by using Google's Speech-to-Text API as an off-the-shelf system. 4 Comparing results on different versions of transcripts allows us to better assess the generality of our findings.
Our baselines include both general-purpose extractive summarizers and meeting-specific summarizers. LexRank (Erkan and Radev, 2004) and Tex-tRank (Mihalcea and Tarau, 2004) are graph-based extractive methods. SumBasic (Vanderwende et al., 2007) selects sentences if they contain frequently occurring content words. KL-Sum (Haghighi and Vanderwende, 2009) adds sentences to the summary to minimize KL divergence. We additionally experiment with two meeting summarizers. Shang et al. (2018) group utterances into clusters, generate an abstractive sentence from each cluster using sentence compression, then select best elements from these sentences under a budget constraint. Koay et al. (2020) develop a supervised BERT summarizer to identify summary utterances.
We report test set results in Table 1, where system summaries are compared with gold-standard extractive summaries using ROUGE metrics (Lin, 2004). The summary length is computed as the percentage of selected utterances over all utterances of the meetings and average number of words per test summary. This information is reported wherever  Figure 2: (TOP) Relative position of supporting utterances in their local windows. We find that BART tends to take summary content from the first 150-200 tokens of the input sequence. With a large window (W=1024), summary content is likely taken from the first 20% of input. (BOTTOM) Length distribution of BART abstracts, measured by number of characters. Using windows ranging from 128 to 1024 tokens, the average abstract length increases from 281 to 332 characters, i.e., 56 to 66 words assuming 5 characters per word for English texts (Shannon, 1951). Results are obtained on the ICSI training set using human transcripts.
available, and baseline summarizers are set to output the same number of summary utterances as the sliding-window (SW) approach. Our SW approach can outperform or perform comparably to competitive baselines when evaluated on human and ASR transcripts. We note that Koay et al. (2020) utilize a supervised BERT summarizer, whereas our SW approach is unsupervised. 5 It does not require annotated summaries and only uses the training set to determine window and stride sizes (S=128, W=1024, details later). A closer examination reveals that Google transcripts contain substantially less filled pauses (um, uh, mm-hmm), disfluencies (go-go-go away), repetitions and verbal interruptions. The Google service also tends to produce lengthier utterances. Table 2 provides an example comparing human, AMI and Google transcripts. The summaries produced with Google transcripts contain fewer utterances and less number of words per summary. They achieve a higher precision and lower recall when compared to those of AMI and human transcripts.
We are curious to know where supporting utterances appear in the local windows. In Figure 2, we discretize the position information into 5 bins and plot the distributions for four settings that use different window sizes (W={128,256,512,1024}) but the same stride size (S=128). We observe that BART 5 We use pyrouge with default options to evaluate all summaries. The scores are different from that of Koay et al. (2020) which removed stopwords during evaluation by using '-s'.

Transcription
Human AMI Google # of utter. per meeting 1330 1410 188 # of words per utterance 7.7 7.0 33.0 (Human) and um There one of our diligent workers has to sort of volunteer to look over Tilman's shoulder while he is changing the grammars to English (AMI) And um And they're one of our a The legend to work paris has to sort of volunteer to Look over time and shorter what he is changing that gram was to english (Google) and they are one of our diligent workers has to sit or volunteer to look over two months shoulder while he is changing the Grandma's to English tends to select content from the first 150 to 200 tokens of the input and add them to the abstract. It indicates that the model exhibits strong lead bias even for spoken text, which differs from news writing (Grenander et al., 2019). Additionally, we examine the length of BART abstracts, measured by the number of characters in an abstract. Using windows from 128 to 1024 tokens, we find that the avg. abstract length increases from 281 to 332 characters, ≈56 to 66 words assuming 5 characters per word on average for English texts (Shannon, 1951). While a larger window can lead to a longer abstract, the abstract size is disproportionate to the window  Figure 3: Precision, recall and F-scores of summary utterance selection using different combinations of stride (S) and window (W) sizes. Results are obtained on the ICSI training set using human transcripts. We find that (S=128, W=1024) attains a good balance between precision and recall, whereas using small, non-overlapping windows (S=128, W=128) yields high recall due to more utterances are included in the summary.

R-2 Precision R-2 Recall R-2 F-Score
Figure 4: R-1 and R-2 scores when different combinations of stride (S) and window (W) sizes are used. Results are obtained on the ICSI training set for human transcripts. With (S=256, W=1024), we obtain balanced precision and recall scores. The best R-2 F-score is achieved with (S=128, W=1024).
size. These results are obtained on the training set using human transcripts as input.
In Figure 3, we investigate various combinations of stride (S) and window sizes (W) and report their precision, recall and F-scores on summary utterance selection. Similarly, the results are obtained on the training set using human transcripts as input. We highlight some interesting findings. We observe that a large context window (W=1024) tends to give high precision. A small window combined with small stride yields high recall due to more utterances are selected for the summary. For example, both settings (W=512, S=128) and (W=1024, S=256) allow an utterance to be visited 4 times. The former achieves a higher recall (0.395 vs. 0.239) due to

Utterance Rating System
Score-2 Score-1 Score-0  its smaller window and stride sizes. In Figure 4, we show R-1 and R-2 scores obtained on the training set for all combinations of stride and window sizes. We find that recall scores decrease substantially using large stride sizes (>=512 tokens). With (S=256, W=1024), we obtain balanced precision and recall scores. The best R-2 F-score is achieved with (S=128, W=1024) which is used at test time.
In Figure 5, we present the percentage of supporting (summary) utterances per meeting and per window, for various combinations of window and stride sizes. On human transcripts, we observe that combining small stride and window sizes (S=128, W=128) has led to ∼30% utterances to be selected per meeting. In contrast, (S=128, W=1024) selects 19% of the utterances. Human transcripts and automatic transcripts generated by AMI ASR appear to show similar behavior, but the Google transcriber breaks up utterances differently.
We further conduct a human evaluation on the six test meetings. Three human evaluators (two native speakers and a non-native speaker) are employed carefully what they are doing with the program and I begin to -to work also in that. 1 0 1 fn002 But the first thing that I don't understand is that they 0 1 1 fn002 are using 0 0 1 fn002 the uh log energy that this quite -I don't know why they have some 0 1 1 fn002 constant in the expression of the lower energy. I don't know what that means. 0 1 1 me018 They have a constant in there, you said? 0 1 0 Table 4: Extractive summaries produced by the sliding-window approach (SW) appear to read more coherently than those of the supervised BERT summarizer. Consecutive sentences in SW summaries are more likely to be associated with the same idea/speaker compared to supervised-BERT. "Gold" are ground-truth summary utterances.
for this task. They rate each summary utterance as highly relevant (2), relevant (1) or irrelevant (0) by matching the utterance with the meeting abstract provided by the ICSI corpus. The systems for comparison are SW, TextRank and the fully supervised BERT summarizer (Koay et al., 2020). In Table 3, we report the percentage of summary utterances assigned to each category (Fleiss' Kappa=0.29). Our summarizer obtains promising results. It outperforms TextRank and performs comparably to supervised-BERT. We find that the SW summarizer navigates through the transcript in an equally detailed manner. It leads to coherent and sometimes verbose summaries, compared to other extractive summaries. A snippet of the transcript and its accompanying summaries are shown in Table 4.

Conclusion
We investigate the feasibility of a sliding-window approach to generating meeting minutes and obtain promising results on both human and automatic transcripts. The approach does not require annotated data and it has a great potential to be extended to meetings of various domains. Our future work includes, in the near horizon, experimenting with a look-ahead mechanism to enable the summarizer to skip over insignificant transcript segments.