Keep Meeting Summaries on Topic: Abstractive Multi-Modal Meeting Summarization

Transcripts of natural, multi-person meetings differ significantly from documents like news articles, which can make Natural Language Generation models for generating summaries unfocused. We develop an abstractive meeting summarizer from both videos and audios of meeting recordings. Specifically, we propose a multi-modal hierarchical attention across three levels: segment, utterance and word. To narrow down the focus into topically-relevant segments, we jointly model topic segmentation and summarization. In addition to traditional text features, we introduce new multi-modal features derived from visual focus of attention, based on the assumption that the utterance is more important if the speaker receives more attention. Experiments show that our model significantly outperforms the state-of-the-art with both BLEU and ROUGE measures.


Introduction
Automatic meeting summarization is valuable, especially if it takes advantage of multi-modal sensing of the meeting environment, such as microphones to capture speech and cameras to capture each participant's head pose and eye gaze. Traditional extractive summarization methods based on selecting and reordering salient words tend to produce summaries that are not natural and incoherent. Although state-of-the-art work (Shang et al., 2018) employs WordNet (Miller, 1995) to make summaries more abstractive, the quality is still far from those produced by humans, as shown in Table 1. Moreover, these methods tend to have limited content coverage by selecting salient words.
On the other hand, recent years have witnessed the success of Natural Language Generation (NLG) models to generate abstractive summaries. Since human-written summaries tend to mention the exact given keywords without paraphrasing, the copy mechanism proposed by a Pointer Generator Network (PGN) (See et al., 2017) naturally fits this task. Apart from generating words from a fixed vocabulary, it also copies the words from the input. However, transcripts of multi-person meetings widely differ from traditional documents. Instead of grammatical, wellsegmented sentences, the input is often composed of ill-formed utterances. Therefore, NLG models can easily lose focus. For example, in Table 1, PGN fails to capture the keywords remote control, trendy and user-friendly.
Therefore, we propose a multi-modal hierarchical attention mechanism across topic segments, utterances, and words. We learn topic segmentation as an auxiliary task and limit the attention within each segment. Our approach mimics human summarization methods by segmenting first and then summarizing each segment. To locate key utterances, we propose that the rich multi-modal data from recording the meeting environment, especially cameras facing each participant, can provide speaker interaction and participant feedback to discover salient utterances. One typical interaction is Visual Focus Of Attention (VFOA), i.e., the target that each participant looks at in every timestamp. Possible VFOA targets include other participants, the table, etc. We estimate VFOA based on each participant's head orientation and eye gaze. The longer the speaker is paid attention by others, the higher possibility that the utterance is important. For example, in Table 1, the high VFOA received by the speaker for the last two sentences assists in maintaining the bold keywords.

Method
As shown in Figure 1, our meeting data consists of synchronized videos of each participant in a Um I'm Sarah, the Project Managerand this is our first meeting, surprisingly enough. Okay, this is our agenda, um we will do some stuff , get to know each other a bit better to feel more comfortable with each other . Um then we'll go do tool training, talk about the project plan, discuss our own ideas and everything um and we've got twenty five minutes to do that, as far as I can understand. Now, we're developing a remote control which you probably already know. Um, we want it to be original, something that's uh people haven't thought of, that's not out in the shops, um, trendy, appealing to a wide market, but you know, not a hunk of metal, and userfriendly, grannies to kids, maybe even pooches should be able to use it.

Manual summary
The project manager gave an introduction to the goal of the project , to create a trendy yet userfriendly remote. Extractive summary (Shang et al., 2018) Abstractive summary (See et al., 2017) Our Approach hunk of metal and userfriendly granny's to kids.
The project manager opened the meeting and introduced the upcoming project to the team members.
The project manager opens the meeting. The project manager states the goal of the project, which is to develop a remote control. It should be original, trendy, and userfriendly.  group meeting, as well as a time-stamped transcript of the utterances generated by Automatic Speech Recognition (ASR) tools 1 . We formulate a meeting transcript as a list of triples X = {(p i , f i , u i )}. p i ∈ P is the the speaker of utterance u i , where P denotes the set of participants. f i contains the VFOA target sequence over the course of utterance u i for each participant. Each utterance u i is a sequence of words The output of our model is a summary Y and the segment ending boundaries S. The training instances for the generator are provided in the form of T train = {(X, Y, S)}, and the testing instances only contain the transcripts T test = {X}.

Visual Focus of Attention Estimation
Given the recording video of each individual, we estimate VFOA based on each participant's head orientation and eye gaze for every frame. The VFOA targets include F = {p 0 , . . . , p |P | , table, whiteboard, projection screen and unknown}. As  Figure 2, we feed each input color image into the OpenFace tool (Baltrusaitis et al., 2018) to estimate the head pose angle (roll, pitch and yaw) and the eye gaze direction vector (az-imuth and elevation), and concatenate them into a 5-dimensional feature vector. To obtain the actual visual targets from the head pose and eye gaze estimation, we build a seven-layer network to output a one-hot vector, which indicates the most possible visual target at the current frame, and each dimension stands for a VFOA target. The network is trained on the VFOA annotation, including the VFOA target for each frame of each participant.
Then the output of all participants are concatenated. For utterance u i , the VFOA vector f i ∈ R |P | * |F | is the sum of each frame's VFOA outputs over the course of u i , where each dimension stands for the total duration of the attention paid to the corresponding VFOA target.

Meeting Transcript Encoder
For an utterance u i = {w i 0 , w i 1 , . . . }, we embed each word w i j using the pretrained GloVe (Pennington et al., 2014), and apply a bidirectional gated recurrent unit (GRU) (Cho et al., 2014) to obtain the encoded word representation h i j . The utterance representations are the average of words. Additionally, the speaker p i is encoded into a onehot vector p i ∈ R |P| .

Topic Segmentation Decoder
We divide the input sequence into contiguous segments based on SegBot (Li et al., 2018). Its decoder takes a starting utterance of a segment as input at each decoding step, and outputs the ending utterance of the segment. Taking Figure 3 as an example, there are 5 utterances in the transcript. The initial starting utterance is u 0 with the possible positions from u 0 to u 4 ; if u 2 is detected as the ending utterance, then u 3 is the next starting utterance and is input to the decoder, with possible positions from u 3 to u 4 .
We extend SegBot to obtain the distribution over possible positions j ∈ {i, i + 1, . . . } by using a multi-modal segmentation attention: where d i is the decoded utterance of starting utterance u i . Let s i denote the ending utterance of the segment that starts with the utterance u i , the probability for u j to be the ending utterance s i is:

Meeting Summarization Decoder
We build our decoder based on Pointer-Generator Network (PGN) (See et al., 2017) to copy words from the input transcript in terms of attention distribution. Different from PGN, we introduce a hierarchical attention mechanism based on the topic segmentation results, as shown in Figure 4. As VFOA has close ties to salient utterances, we use the VFOA received by speaker f k p k to capture the importance of utterance u k , where p k is the a vector indicating which dimension's VFOA target is the speaker p k . Formally, we use a GRU to obtain the decoded hidden states d i for the i th input word. The Utterance2Word attention on the word w j of the utterance u k is: The context representation for the utterance u k is u ik = Softmax(e ij )w j , w j ∈ u k . The Seg-ment2Utterance attention on the utterance u k in the input transcript is:  The context representation for segment s q is c iq = Softmax(e ik )u k , u k ∈ s q . The Meet-ing2Segment attention is: The hierarchical attention of w j is calculated within the utterance u k and then segment s q : The probability of generating y i follows the decoder in PGN (See et al., 2017), and α sum ij is the attention in the decoder for copying words from the input sequence.

Joint End-to-End Training
The summarization task and the topic segmentation task are trained jointly with the loss function: where P (Y, S|X) is the conditional probability of the summary Y and the segments S given the input meeting transcript X = {(p i , f i , u i )}. Here, y i is one token in the ground truth summary, and s j denotes the ending boundary of the segment that starts with u j .

Experiments
Our experiments are conducted on the widely used AMI Meeting Corpus (Carletta et al., 2005). This corpus is about a remote control design project from kick-off to completion. Each meeting lasts 30 minutes and contains four participants: a project manager, a marketing expert, an industrial designer, and a user interface designer. We follow the conventional approach (Shang et al., 2018) in the meeting analysis literature to preprocess and divide the dataset into training (97 meetings), development (20 meetings) and test sets (20 meetings). One meeting in the test set does not provide videos and thus it is ignored. The ASR transcripts are provided in the dataset (Garner et al., 2009), which are manually revised based on the automatically generated ASR output. Each meeting has a summary containing about 300 words and 10 sentences. Each meeting is also divided into multiple segments focusing on various topics. The ASR transcripts and the videos recorded for all participants are the input of the model. We use manual annotation of summaries and topic segments for training, while they are generated automatically during testing. The VFOA estimation model is trained separately on the VFOA annotation of 14 meetings in the dataset, and achieve 64.5% prediction accuracy.
The baselines include: (1) state-of-the-art extractive summarization method CoreRank (Shang et al., 2018), and (2) neural network based generation model PGN (See et al., 2017). We adopt two standard metrics ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002) for evaluation. Additionally, to show the impact of VFOA, we remove the VFOA features as an additional baseline, and conduct significance testing. By T-test, the differences on ROUGE and BLEU are considered to be statistically significant (P value ≤ 0.09), except BLEU 4 (P value = 0.27).
Compared to the abstractive method PGN in Table 2, the multimodal summarizer achieves larger improvement on ROUGE than BLEU. It demonstrates our approach's ability to focus on topically related words. For example, 'The marketing expert discussed his findings from trend watching reports, stressing the need for a product that has a fancy look and feel, is technologically innovative...' is generated by our model, while the PGN generates 'the marketing expert discussed his findings from trend watching reports'. The speaker receives higher VFOA from participants while mentioning the utterances containing these keywords. To demonstrate the effectiveness of VFOA attention, we rank the utterances in terms of VFOA, and achieve 45.8% accuracy of selecting salient utterances based on the annotation of (Shang et al., 2018) 2 . Therefore, the model learns that when the speaker receives higher VFOA, the utterances of that speaker is more important.
Moreover, topic segmentation also contributes to the better coverage of salient words, which is demonstrated by the improvement on ROUGE metrics of the model without VFOA features. Each meeting is divided to six to ten segments, with special focuses on topics such as 'openings', 'trend watching', 'project budget' and 'user target group'. With the topic segmentation results, the utterances within the same segment are more correlated, and topically related words tend to be frequently mentioned. For example, 'fancy look' is more important within the 'trend watching' segment than the whole transcript.
The VFOA distribution is highly correlated to topic segmentation. For example, the project manager pays more attention to the user interface designer in 'trend watching' segment, while focuses more on the marketing expert in another segment about 'project budget'. Therefore, the VFOA feature not only benefits the summarization decoder, but also improves the performance of topic segmentation. The topic segmentation accuracy is 57.74% without VFOA feature, and 60.11% with VFOA feature in segmentation attention.
Compared to the extractive method CoreRank in Table 2, our BLEU scores are doubled, which demonstrate that the abstractive summaries are more coherent and natural. For example, the extractive summaries are often incomplete sentences, such as 'prefer a design where the remote control and the docking station'. But the abstractive summaries are well-organized sentences, such as 'The remote will use a conventional battery and a docking station which recharges the battery'. Also, the improvement on ROUGE 2 and ROUGE L is larger than ROUGE 1, which shows the superiority of abstractive methods to maintain longer terms, such as corporate website, etc.

Related Work
Extractive summarization methods rank and select words by constructing word co-occurrence graphs (Mihalcea and Tarau, 2004;Erkan and Radev, 2004;Lin and Bilmes, 2010;Tixier et al., 2016b), and they are applied to meeting summarization (Liu et al., 2009(Liu et al., , 2011Tixier et al., 2 https://bitbucket.org/dascim/acl2018_ abssumm/src/master/data/meeting/ami 2016a; Shang et al., 2018). However, extractive summaries are often not natural and coherent with limited content coverage. Recently the neural natural language generation models boost the performance of abstractive summarization (Luong et al., 2015;Rush et al., 2015;See et al., 2017), but they are often unable to focus on topic words. Inspired by utterance clustering in extractive methods (Shang et al., 2018), we propose a hierarchical attention based on topic segmentation (Li et al., 2018). Moreover, our hierarchical attention is multi-modal to narrow down the focus by capturing participant interactions. Multi-modal features from human annotations have been proven effective at improving summarization, such as dialogue act (Goo and Chen, 2018). Instead of using human annotations, our approach utilizes a simply detectable multi-modal feature VFOA.

Conclusions and Future Work
We develop a multi-modal summarizer to generate natural language summaries for multi-person meetings. We present a multi-modal hierarchical attention mechanism based on VFOA estimation and topic segmentation, and the experiments demonstrate its effectiveness. In the future, we plan to further integrate higher level participant interactions, such as gestures, face expressions, etc. We also plan to construct a larger multimedia meeting summarization corpus to cover more diverse scenarios, building on our previous work (Bhattacharya et al., 2019).