How Domain Terminology Affects Meeting Summarization Performance

Meetings are essential to modern organizations. Numerous meetings are held and recorded daily, more than can ever be comprehended. A meeting summarization system that identifies salient utterances from the transcripts to automatically generate meeting minutes can help. It empowers users to rapidly search and sift through large meeting collections. To date, the impact of domain terminology on the performance of meeting summarization remains understudied, despite that meetings are rich with domain knowledge. In this paper, we create gold-standard annotations for domain terminology on a sizable meeting corpus; they are known as jargon terms. We then analyze the performance of a meeting summarization system with and without jargon terms. Our findings reveal that domain terminology can have a substantial impact on summarization performance. We publicly release all domain terminology to advance research in meeting summarization.


Introduction
A vast number of meetings are being held and recorded everyday, far more than can ever be comprehended. With this explosion of meetings comes a pressing need to develop summarization techniques to assist in browsing meeting archives (Carletta et al., 2006;Ailomaa et al., 2006). A meeting summarization system takes a meeting recording and its transcript as input and produces a concise text summary as output, which preserves the most important content of the meeting discussion (Murray and Carenini, 2008;Shang et al., 2018;Li et al., 2019). The techniques hold great potential to make large archives of meetings substantially more efficient to browse, search and facilitate information sharing.
We envision an automated summarizer that is capable of generating meeting minutes by identifying salient utterances from transcribed meeting recordings. Neural text summarization has seen significant progress (See et al., 2017;Tan et al., 2017;Chen and Bansal, 2018;Narayan et al., 2018;Lebanoff et al., 2018;West et al., 2019;Liu and Lapata, 2019;Laban et al., 2020), but most prior work focused on written texts. In contrast, recent years have seen a growing interest in summarizing spoken texts (Tardy et al., 2020). Particularly, the characteristics of meetings, domain terminology and limited annotated data pose novel challenges to neural summarization models. We favor extractive over abstractive models as the latter are prone to hallucinate content that is unfaithful to the input (Kryscinski et al., 2019).
In this paper, we investigate how domain terminology impacts meeting summarization performance, especially in the context of neural extractive summarization. Jargon is the specialized terminology associated with a particular domain (Meyers et al., 2014). It is employed in a communicative context and may not be well understood outside that context. Because meetings are usually held among professionals, jargon is ubiquitous in meeting discussions. In Table 1, we provide an example of jargon terms identified by human experts. Without a thorough study of technical jargon in the meeting domain, it is unclear how best to optimize a meeting summarizer to incorporate domain knowledge.
We present an assessment of the meeting summarization performance by comparing models trained with and without jargon. A collection of jargon terms are meticulously compiled by our expert annotators from .520 and um you know separated -used the individual channels we segmented it in-into the segments that Jane had used 277.520 279.952 and uh Don sampled that so -281.374 289.611 um and then we ran up to I guess the first twenty minutes, up to synch time of one two zero zero so is that -that's twenty minutes or so? 289.611 296.601 Um yeah because I guess there's some, and Don can talk to Jane about this, there's some bug in the actual synch time file that a meeting corpus containing multi-party conversations on the topic of speech and signal processing (Janin et al., 2003). Such jargon terms are distinct from speech recognition errors; the latter substitutes one word for another similar-sounding word during automatic transcription. The users can eliminate transcription errors using a modern interactive transcript editor. However, there remains a pressing need to understand how domain terminology affects the meeting summarization performance. Our contributions are twofold. First, we create gold-standard annotations for domain terminology on a large meeting corpus; they are known as jargon terms. Prior work has not explored such domain-specific thesauri and thus there is limited knowledge of the target domain. Second, we analyze the performance of a meeting summarization system with and without jargon. Due to the nature of sound, such a summarizer is highly desirable to aid users in navigating through meeting recordings. Our findings suggest that domain terminology has a substantial impact on summarization performance, which should not be overlooked.

Data and Annotation
We extend the ICSI meeting corpus (Janin et al., 2003) for this study, which contains 75 meetings recorded at the International Computer Science Institute, Berkeley. 2 The meetings are primarily between speech group members of ICSI. An average meeting lasts an hour and has up to 10 participants. Each participant wore a close-talking microphone and they sat around a meeting table equipped with far-field microphones. The corpus is one of the larger resources in this area (Renals et al., 2012). It contains rich annotations including human transcripts, segmentation of utterances and further annotations of extractive summaries 3 , making the corpus suitable for summarization. We have chosen ICSI over the AMI corpus (Carletta et al., 2006); both are natural conversations, but the scenarios in AMI meetings are artificial.
Annotating domain terminology is non-trivial as there lacks a universal definition. Instead, we solicit annotations from undergraduate students majoring in computer science and designate words and expressions that are beyond the scope of their knowledge as domain terminology. Interestingly, modern deep neural models often acquire such generic knowledge through unsupervised pretraining (Lewis et al., 2020). The annotators are instructed to identify words and expressions from human transcripts; they are called jargon terms and usually have particular meaning in the speech and language processing field.
The student annotators are able to annotate all of the 75 meetings for jargon terms. Meeting transcripts are substantially longer than typical news articles. A transcript contains 1,731 utterances on average and 7 words per utterance. Each meeting is annotated by one student due to the sheer size of the transcripts. However, one of the meetings has been annotated by all of the four annotators. Their average pairwise inter-annotator agreement is 0.69, indicating a moderate to high agreement between the annotators. We find that an average meeting contains 92 jargon expressions and each expression contains about 3 words. Jargon terms are observed in 5.2% of the utterances; when short utterances containing less than 5 words  Table 3: Results of our summarizer on the ICSI test set. We report ROUGE scores for our summarizer, with and without using jargon, and contrast it with strong baseline systems (Xie and Liu, 2010;Shang et al., 2018). Experimental results on human transcripts and speech recognition outputs (ASR) suggest that our model performs on par with prior state of the art.
are removed from consideration, the percentage is fairly significant (11.6%). Our collection of domain terminology will be a valuable resource to investigate a variety of research questions regarding domain adaptation. Importantly, if a summarizer performs better when jargon terms are excluded, it indicates domain terminology may have only limited impact on determining utterance salience, or the summarizer has been ineffective in using domain knowledge. Conversely, if the summarizer performs less well, domain terminology is considered essential and it is important for speech recognizers to correctly transcribe these terms to avoid any loss in summarization performance. In what follows, we describe our meeting summarizer and examine how domain terms are processed by a modern tokenizer.

Meeting Summarization
Jargon Term Tokenization SmartKom system smart-ko-m system discourse annotations discourse ann-ota-tions situational context factors situation-al context factors modifiers, auxiliaries mod-ifiers , aux-ilia-ries JavaBayes belief-net java-bay-es belief -net a real wizard system a real wizard system the L_D_C the l _ d _ c the near field mikes the near field mike-s Utterance with Jargon she wanted to display the stylized F_ zeroes, I think they're called? Utterance without Jargon she wanted to display the [MASK] I think they're called? Table 2: An example showing how jargon terms are processed by a modern tokenizer, WordPiece. E.g., smart-ko-m means the jargon SmartKom was split into three tokens. Moreover, our method allows jargon to be masked-out of the utterances for summarization.
The very first step that one must take to build a meeting summarizer is tokenization, which transforms an input utterance to a sequence of sub-word units. WordPiece (Schuster and Nakajima, 2012) and BPE (Sennrich et al., 2016) are two modern methods for tokenization. We use WordPiece that has a total vocabulary of 30,522 sub-words. The method builds a vocabulary of the desired size by iteratively combining word parts into a sub-word if doing so increases the language model likelihoods.
Given the vocabulary and any input word, it uses a greedy longest-match-first algorithm to tokenize it into sub-word units; the longest sub-word will be matched first. We provide example tokenization outputs in Table 2. We show that most domain terminology can be properly processed by the WordPiece tokenizer. There are two immediate issues that require attention. First, it has considerable difficulties processing infrequent entities and terms, e.g., smart-ko-m, java-bay-es and ann-ota-tions are not well tokenized. Moreover, entities such as "LDC" need to be spelled out, the tokenizer transforms it into three individual letters, thus losing the original meaning.
Our meeting summarizer takes as input an utterance and outputs a binary label indicating if the utterance should be included in the summary. Due to data scarcity, we refrain from using sequential prediction or a more sophisticated approach that may overfit, but focus primarily on demonstrating the impact of domain terminology on model performance. Our summarizer is based on BERT-LARGE that contains 24 layers of Transformer blocks, 16 attention heads and 1024-dimensional hidden vectors (Devlin et al., 2019). The top-layer hidden vector of the [CLS] token is used as the representation of the input utterance. We apply a linear and a softmax layer to predict a binary label. Importantly, jargon terms can be masked-out of the input utterances by replacing each term with [MASK] token prior to training ( Table 2). The method thus

Without Jargon
With Jargon Summ Classifier Summarizer Classifier Summarizer Length P(%) R(%) F(%) R-1 R-2 BERTScore P(%) R(%) F(%) R-1 R-2 BERTScore  Table 4: Results of our meeting summarizer, using jargon or not, while varying the length of output summaries. employs a single architecture to assess model performance with and without jargon. We train the summarizer on 38,657 utterances from 54 meetings; each meeting was annotated by a single annotator. Utterances containing less than 5 words are removed from consideration. The summarizer is evaluated on the standard test set containing 6 meetings; each of these meetings have been annotated by three annotators (Carenini et al., 2011). Our experiments are performed on human transcripts and ASR outputs, respectively, the latter are acquired from the SRI speech recognizer. In the following, we discuss our findings in terms of how domain terminology affects summarization.

Results and Analysis
Our experimental results are presented in Table 3. We evaluate against two strong baseline systems. Xie and Liu (2010) describe an extractive meeting summarizer utilizing maximum marginal relevance and speech-specific features. Shang et al. (2018) introduce a graph framework to group utterances into clusters, perform multi-sentence compression then selection under a budget constraint. Our experiments show that, despite its simplicity, our meeting summarizer can outperform or perform on par with prior state of the art, showing a remarkable advancement of pretrained deep models in the meeting domain.
We observe that summarizing with jargon terms yields substantially better performance (an absolute gain of +4.3% R-2 F-score) on human transcripts, comparing to the alternative that masks jargon out of input utterances. The performance gap has narrowed on ASR transcripts, as domain terminology contains infrequent entities and terms, which are subject to transcription errors. Our findings suggest that domain terminology plays a significant role in determining utterance salience. Its impact on summarization and other downstream meeting applications should not be underestimated.
In Table 4, we assess the model performance on human transcripts, using jargon or not during training, and generate output summaries of varying length. We rank the utterances by their confidence scores and select a portion of them. Gold uses the length of ground-truth summaries. We show the precision, recall and F-scores of our classifier, ROUGE (Lin, 2004) and BERTScore (Zhang et al., 2020) for summaries. 4 We find that across all lengths and evaluation metrics, summarizing with jargon can lead to a performance boost for meeting summarization. While this work has primarily experimented with the ICSI corpus, the results are sufficiently substantial that we expect them to hold over similar meeting corpora.

Related Work
Generating meeting summaries is a challenging problem with a great application potential. A significant number of techniques have been attempted in the past, including extraction of utterances and keyphrases from transcripts (Galley, 2006;Murray and Carenini, 2008;Gillick et al., 2009) and taking advantage of prosodic and speaker-related features (Maskey and Hirschberg, 2005;Zhu et al., 2009;Chen and Metze, 2012). As spoken utterances are verbose with low information density, some methods further compress and merge utterances (Liu and Liu, 2013;Wang and Cardie, 2013;Mehdad et al., 2013). Despite these valuable contributions, a closer investigation remains necessary to develop an understanding of how domain terminology affects meeting summarization performance.
Recent years have seen a renewed interest in summarizing meeting transcripts (Shang et al., 2018;Zhu et al., 2020;Tardy et al., 2020) and other types of online and transcribed conversations (Goo and Chen, 2018;Yuan and Yu, 2020;Gliwa et al., 2019). In particular, Tardy et al. (2020) create a corpus containing 22 public meetings including their automatic transcriptions from audio recordings and meeting reports written by a professional. Li et al. (2019) develop a multi-modal hierarchical attention mechanism for abstractive summarization, where attention is applied to topics, utterances and words to narrow the focus to salient content; their experiments were performed the AMI corpus, thus results are not directly comparable. Our work excludes prosodic and speaker-related features to focus solely on domain terminology. It provides a new baseline for future research toward building effective meeting summarizers.

Conclusion
We seek to better understand how domain terminology impacts meeting summarization performance in the context of neural extractive summarization. We solicit quality annotations from expert annotators to compile a list of jargon terms from a sizable meeting corpus, which is a valuable resource to investigate a variety of research questions regarding domain adaptation. Our extensive experiments show that domain terminology has a substantial impact on summarization performance that should not be neglected. Future work may address the questions of how to obtain domain terminology in a semi-automatic way and inject domain knowledge into a meeting summarization system.