Personalized Multimodal Feedback Generation in Education

The automatic feedback of school assignments is an important application of AI in education. In this work, we focus on the task of personalized multimodal feedback generation, which aims to generate personalized feedback for teachers to evaluate students’ assignments involving multimodal inputs such as images, audios, and texts. This task involves the representation and fusion of multimodal information and natural language generation, which presents the challenges from three aspects: (1) how to encode and integrate multimodal inputs; (2) how to generate feedback specific to each modality; and (3) how to fulfill personalized feedback generation. In this paper, we propose a novel Personalized Multimodal Feedback Generation Network (PMFGN) armed with a modality gate mechanism and a personalized bias mechanism to address these challenges. Extensive experiments on real-world K-12 education data show that our model significantly outperforms baselines by generating more accurate and diverse feedback. In addition, detailed ablation experiments are conducted to deepen our understanding of the proposed framework.


Introduction
In recent years, oral presentations have become a popular form of assignments in online K-12 education (Liu et al., 2020b). An oral presentation assignment requires students to answer a question or explain a concept verbally. It is able to test students' oral expression skills, language organization abilities, and understanding of the topic itself simultaneously . Oral assignments are usually submitted in video format, which involves multimodal information. For example, given a math question "Please describe the procedure of finding the greatest common divisor of two integers", a student is asked to record a video in which he or she presents the answer to teachers. The video contains information on the modality of image (pictures of the video), audio (voice of the student), and text (transcribed speech). To evaluate such oral presentation assignments, teachers on the online education platform provide short textual feedback. An example is shown in Figure 1. Feedback may involve different aspects of modalities, such as the clarity of video, the fluency of voice, and the relevance between the answers and the questions. Meanwhile, different teachers tend to write feedback on their own language styles. In order to lighten the workloads of teachers and improve the efficiency of online teaching, in this paper, we aim to automatically generate personalized feedback from multimodal oral presentation assignments based on the language styles of various teachers. The task, known as personalized multimodal feedback generation, is an important but rarely touched application of AI in Education (Liu et al., 2020a).
One traditional solution of this task is to construct a pipeline-based system. Firstly, raw information from different modalities is passed into a series of pre-trained models for image recognition, speech fluency evaluation, and text relevance assessment. Then, according to some manually designed strategies, a piece of artificial feedback is retrieved from a repository. The main limitations of the pipeline methods are: (1) the terminal supervisory signals (the feedback texts in our case) cannot be propagated to upstream You did a great job, but the fluency should be improved. Figure 1: An example of the multimodal feedback generation task. Given the same multimodal inputs, teachers may provide different feedback.
modules. Thus we have to use extra labeled data to train them, also, domain knowledge may be required; (2) different modules depend on each other, therefore the errors from upstream modules may directly lead to the errors in downstream modules (Zhao and Eskénazi, 2016); and (3) it is hard to achieve personalization if we only rely on a limited feedback repository. To address the above shortcomings of traditional solution and build a more realistic end-to-end educational feedback generation system, we face unique challenges. First, since the input is composed of information from multiple modalities, it is challenging to represent and fuse multimodal information for feedback generation (Baltrusaitis et al., 2019). Moreover, to help students better understand the feedback, the generated content should be specific to each modality. In addition, teachers have different styles, and it is desired to imitate the language styles of various teachers and generate personalized feedback.
To address the above challenges, in this work, we propose a novel deep learning architecture named Personalized Multimodal Feedback Generation Network (PMFGN) which can be trained in an end-to-end manner. Although there exist some end-to-end multimodal-based text generation approaches (Kiros et al., 2014;Mao et al., 2015;Hori et al., 2017), most of them cannot handle inputs with the form of image, audio, and text simultaneously. To the best of our knowledge, our model with a novel network structure is the first one designed for the task of multimodal feedback generation. In our proposed framework, we introduce a modality gate with a hierarchical attention mechanism that enables the model to generate different parts of the feedback based on the evaluation of different modalities. Meanwhile, we design a personalized bias mechanism to encourage the model to generate personalized feedback. Experiments have been conducted on a real-world K-12 oral presentation dataset. The results have verified the superiority of our proposed model according to various evaluation metrics. Compared with several baselines, our method can provide more precise, reasonable, and diverse feedback. We summarize our major contributions as follows: • We propose a modality gate with a hierarchical attention mechanism to enable multimodal data integration and feedback generation to specific modalities.
• We introduce a novel personalized bias mechanism to realize personalized feedback generation.
• We build a novel PMFGN model for the multimodal feedback generation task. It is shown to achieve state-of-the-art performance on this task by experiments.
The rest of the paper is organized as follows. First, we introduce the task definition of personalized multimodal feedback generation in Section 2. Afterward, we present our PMFGN framework in Section 3. Next, Section 4 carries out our experimental setup and results with discussions. Then, we review related works in Section 5. Finally, Section 6 concludes the work with possible future research directions.

Task Definition
Typically the feedback given by teachers is mainly concerned about the clarity of the video, the fluency of the voice, and the relevance between the answer and the given topic. Thus, given a question q and a video submission, we extract raw information from three modalities: image, audio, and text. Since images don't change much from beginning to end when a student giving an oral presentation, we just take a screenshot of the video, i.e. an image i, as the signals of image modality. We separate the sound from the video as

Text Encoder
Ey t−1 Ey t−2 P per (y t ) the audio signals a. Besides, we transcribe the audio by automated speech recognition (ASR) tools into texts t. Both the question content q and the ASR transcription t compose the signals of textual modality.
We formulate the task of personalized multimodal feedback generation as follows. Given a corpus X = {(i, a, t, q, p, y)} N n=1 , where each instance contains an image i, a piece of audio a, a speech text t, a question text q, a teacher identifier p and the corresponding feedback text y written by that teacher, we seek to train a model which can generate personalized feedback for the teacher p given a set of the above multimodal inputs (i, a, t, q).

Overview
The overall framework of the model is presented in Figure 2. The model consists of four components to address the aforementioned challenges: (1) modality encoders, which encode the raw image, audio, and text inputs into a sequence of feature representations; (2) modality gate, which selects information to generate the feedback specific to each modality; (3) general language model, which is an RNN decoder that models the distribution of the feedback conditioned on the encoded modality information; and (4) personalized language model, which provides a bias on the distribution estimated by the general language model in order to imitate the wording and tone of individual teachers. The framework is an end-to-end architecture that takes the image, audio, and texts as input and generates the feedback as output.

Modality Encoders
Image Encoder. The image encoder transforms an image into a sequence of vector representations with fixed-length L (i) , each of which stands for a specific area of that image. We use GoogLeNet to extract these image features (Szegedy et al., 2015): L (i) ) = CNN(i) Audio Encoder. We first extract a sequence of acoustic features, i.e., Mel-Frequency Cepstral Coefficients (MFCCs), a 1 , a 2 , . . . , a L (a) from the original audio signals, and then encode them through an unidirectional Recurrent Neural Network, where the hidden state of each timestamp is treated as the audio features of this timestamp: L (a) ) = RNN(a 1 , a 2 , . . . , a L (a) ) Vanilla RNN cell, GRU, or LSTM can be used as the recurrent cells. Text Encoder. To facilitate the whole framework to learn to estimate the relevance between the speech text t and the question text q, we use a MatchRNN structure (Wang and Jiang, 2016) to encode them simultaneously. Given a speech text t = (t 1 , t 2 , . . . , t L (t) ) with length L (t) and a question text q = (q 1 , q 2 , . . . , q L (q) ) with length L (q) , we pass them through two RNNs and obtain the corresponding hidden . Then for each word t k in the speech text, we compute an attention-weighted combinations of the hidden states of the question text: Afterward, the attentive vector c k and the hidden state h t k of the speech text are concatenated as , which is then fed into a matching RNN with LSTM cells: h m k = LSTM(h m k−1 , m k ). The obtained hidden state h m k implies the matching degree between each word t k in the speech text and the question text q. More important matching results are selectively remembered. In Equation 1, the attention weight is determined by α kj = softmax(e kj ), and e kj is defined as Unlike the original MatchRNN, which only takes the last hidden state h m L (t) for prediction, our framework treats all the hidden states (h m 1 , h m 2 , . . . , h m L (t) ) as the encoded features for the downstream feedback generation task. Finally, we get the matching features of the speech text and the question text: All the feature vectors of three modalities are set to be d-dimensional. As a unified network, the three modality encoders are trained together with the rest parts in an end-to-end manner.

Modality Gate
After encoding the information of multiple modalities, how to effectively integrate such information for downstream tasks is one of the most important challenges (Baltrusaitis et al., 2019). In the feedback generation task, teachers may include specific comments on the information of various modalities and general comments on the entire submission in one piece of feedback. Thus, to meet the special needs of this task, a reasonable way to integrate multimodal information is, at each step of decoding, we choose the information of one noteworthy modality based on the context to generate a word in the feedback. We introduce a modality gate mechanism which selectively allows the information of one modality to pass through at each step for the decoder to generate the feedback. When there is no modality that should receive special attention, a pre-defined feature vector indicating "general comment" will pass through the gate to generate such feedback.
What's more, for the information of one modality, there may exist various aspects of evaluation. Let's take the modality of audio as an example. Defective audio may suffer from several different flaws like the volume is too low, the voice is not fluent or the voice is unclear. Thus, before deciding which modality to focus on, we perform structured self-attention (Lin et al., 2017) with K hops on the modality features, to represent the quality of this modality in different aspects (say, the quality of the audio in volume, fluency, clarity) as a weighted sum of the modality features. Ideally, the result of each hop stands for an aspect. For modality m, the representation of the k-th aspect is calculated as vector and W (m) ∈ R d×d is a learnable parameter matrix. Besides the aspect vectors {z k } K k=1 of the three modalities, we introduce a single learnable vector z (0) ∈ R d to indicate the "general comment".
In the modality gate, we first map the four groups of z vectors to key vectors for four-class classification to decide whether to generate a general comment or a comment targeted on image, audio or texts. We use z (0) as its key identically k (0) = z (0) , and map the aspect vectors {z (m) k } K k=1 to their keys k (m) ∈ R d by concatenating them together and performing a linear transformation we use the hidden state of the decoder at the previous step h t−1 as the query to determine which modality to pass. Their corresponding scores s ∈ R 4 are calculated as follows: where W ∈ R d×d is a learnable parameter matrix.
During training, we use signals in the feedback to indicate which modality each sentence is directed at. For example, when only voice-related phrases like "couldn't hear", "not fluent" appear in a sentence, we will assign a label of "audio" for each word in that sentence. Only the z-vectors of the corresponding modality pass through the modality gate to generate the words in the sentence. The modality labels are provided as supervisory signals to train the classifier in the modality gate, by optimizing the following cross-entropy objective function: where N is the number of the training instances, L i is the length of the feedback, y m il is 1 when the l-th word belongs to modality m, and 0 otherwise. s m il indicates the likelihood predicted by the modality gate that it belongs to modality m. This objective function will be trained jointly with the loss function in Equation 2 in the form of multi-task learning (Caruana, 1993).

General Language Model
A conditioned GRU language model is used as the decoder to generate the feedback conditioned on the z-vectors that pass through the modality gate. In contrast to the personalized language model below, we refer to it as the general language model. For the modality m, at the step t, we align the previous hidden state h t−1 with the aspect vectors {z (m) k } K k=1 to compute a weighted sum of them, so that we can enable the model to focus on different aspects when generating different words: . For each modality m, we introduce a learnable embedding vector e (m) ∈ R d to indicate which modality is targeted on for generating the feedback. Then the word at the previous step y t−1 , the modality embedding e (m) , together withz (m) t , are taken as the inputs of the RNN to update the hidden state: where E is the embedding matrix of the words.
The probability of the next word y t predicted by the general language model is computed by an output layer after the hidden states: P gen (y t |y 1 , . . . , y t−1 , e (m) ,z (m) t )) = g gen (h t ) where g gen represents the function of the output layer in the general language model.

Personalized Language Model
The general language model estimates the distribution of the feedback conditioned on the encoded modality information. It is trained on a great amount of feedback data from all the teachers so it tends to generate generic feedback. To address the challenge of personalized feedback generation, we introduce a personalized language model that models another distribution of the feedback conditioned on the teacher. The latter performs as a bias towards the former to help the model generate feedback in a specific style of an individual teacher.
We apply a bigram DNN-based language model as the personalized language model. Taking the embeddings of the previous two words x = [Ey t−2 : Ey t−1 ] as input, the network passes them through two hidden layers and then predicts the probability of the current word y t : where g per represents the function of the output layer in the personalized language model. Unlike the original model, we use distinct parameters H p 1 and d p 1 in the first hidden layer for each teacher to enable the model to learn the different language styles of different teachers.

Loss Function
The final distribution of the word y t is computed as a linear combination of the probability predicted by the general language model and the personalized language model: where the weight r is decided by the current hidden state h t of the general language model, σ is the sigmoid function and w r ∈ R d h is a parameter vector. In this way, the model can learn to automatically judge whether we should write a word based on the modality information or the teacher's personal preferences at each step. The model is trained to minimize the negative log-likelihood of the ground truth feedback: Hence, the final joint loss function is L(Θ) = J + αJ where α is a hyperparameter that adjusts the weighted balance of two parts.

Experiment
In this section, we conduct extensive experiments to evaluate our proposed framework based on a realworld dataset collected from an online education platform. Through the experiments, we try to answer two questions: (1) Does our model achieve better performances than representative baselines? and (2) How does each component in our proposed framework contribute to the performance?

Dataset
The Dolphin dataset contains 20, 442 videos of oral presentation assignments collected from a realworld online education platform. The assignments are about answering a math question at the level of kindergarten or primary school. Each video is accompanied by a piece of feedback manually written by a teacher. A total of 111 different teachers wrote the feedback. The average length of the feedback is 8.78 Chinese characters. The dataset is randomly divided into 15, 550 records for training, 1, 945 for validation and 2, 947 for test. We guarantee that in each set, the data of every teacher are selected in proportion to the total number of that teacher's data.

Implementation Details
In this subsection, we describe the details of the implementation of our proposed model. Speech texts, question texts, and feedback texts are first segmented by Jieba Chinese segmentation system 1 . We build two distinct dictionaries for the texts from the input (i.e. speech texts and question texts) and the output (i.e. feedback texts). We initialize the word embeddings from standard normal distribution N (0, 1). All the other parameters are initialized from a uniform distribution U (−1/ where k is the size of the last dimension of the parameter tensor. Hyper-parameters are determined according to the model performance on the validation set as follows. We adopt the pre-trained GoogLeNet (Inception v1) model (Szegedy et al., 2015) as the image encoder, where the layers behind inception (5b) are removed and replaced with a linear layer that transforms the features to the dimension of 256. The audio data are first downsampled to 16 kHz, and then we extract the MFCCs of it from 50 ms time windows with a 50 ms shift. The audio encoder is implemented as a GRU RNN with a hidden size of 256. For both the text encoder and the general language model, we set the size of word embeddings as 256 and the size of the hidden states as 512. GRU cells are used for the general language model. We perform structured self-attention with K = 3 hops on all three modalities. The value of α is chosen as 0.5. An Adam optimizer (Kingma and Ba, 2014) with an initial learning rate of 0.001 is applied to train the model. The batch size is set to be 10. When the perplexity of the model on the validation data doesn't drop for 3 consecutive epochs, the training is terminated.

Baselines
We compare our model with the following methods.
• GRU-LM: A GRU language model trained on the feedback in the training set (i.e., the alone general language model in our framework). In the test phase, we randomly sample the first word and then generate an entire piece of feedback by greedy search.
• Show-Attend-and-Tell : Originally it is an image captioning model with an attention mechanism. We use it to generate feedback based on the image features.
• Attribute2Seq-Audio : The model generates product reviews based on a sequence of attribute vectors with an attention mechanism. We substitute the audio features for the attributes to generate the feedback for assignments.
• Attribute2Seq-Text: It is similar to Attribute2Seq-Audio but it takes text features as input.
• Multimodal Attention (Hori et al., 2017): The model adopts a hierarchical attention mechanism to fuse multimodal information to generate video descriptions. We use it to integrate image, audio and text features for feedback generation.  Figure 3: The results of ablation study. PMFGN represents the entire model; "no label" stands for the model without label information for multi-task learning; "no PLM" is the model without the personalized language model; "fixed r" stands for that r is set to a fixed value of 0.7; "1 hop" represents that 1 hop structured self-attention is conducted in the modality gate.
• Repository: We also include a repository-based baseline. It is the traditional pipeline method that selects a piece of feedback from several repositories according to a simple strategy. First, we use three well-trained binary classification models to judge whether there is no sound in the audio, whether the voice of the student is fluent, and whether the speech text and the question text are relevant. And then for each of the four cases: "no sound", "not relevant", "relevant but not fluent", "relevant and fluent", we select a piece of feedback from the corresponding feedback repository. The four repositories above contain 34, 29, 96, 92 pieces of feedback written by different teachers respectively. The sound existence model and the fluency model are realized by logistic regression on pre-extracted OpenSMILE acoustic features, and the relevance model is based on MatchRNN, as we used for our text encoder.

Experimental Results
We first evaluate the performance of our model and the baselines on the test data according to several automatic metrics: perplexity, BLEU, ROUGE, and Distinct-N. Perplexity measures how well a probabilistic generative model predicts the ground truth. The lower, the better. BLEU and ROUGE measure the text similarity between the generated feedback and reference feedback by comparing the number of overlapping text units. Besides the ground truth, we manually select 1-4 pieces of semantically identical feedback written by the same teacher as the references. Published evaluation codes are used for these two metrics 2 . Distinct-N measures the diversity of the generated texts by computing the number of distinct N-grams of the texts (Li et al., 2016a). A higher Distinct-N score indicates a more diverse text. Experimental results are shown in Table 1. From the table, we make the following observations: First, Multimodal Attention achieves higher BLEU and ROUGE scores than the first four methods. It demonstrates that the information from all the three modalities is useful for generating the feedback, and enables the model to make a more accurate evaluation, which is reflected by producing feedback more similar to the references. Second, all the baselines get relatively low Distinct scores, which means that without a personalized mechanism, these methods can only generate universal and repetitive feedback for all the teachers. Except for GRU-LM, which gets a higher Distinct score thanks to randomly sampling the first token. However, the generated feedback is indeed random too. Third, the traditional pipeline method Repository cannot obtain the desired performance, since the limited repositories are not applicable for all the cases and all the teachers. Finally, our model achieves the best results on all the evaluation metrics, which demonstrates that our model effectively takes advantage of the information of each modality to produce more proper feedback. Besides, through the personalized bias mechanism, our model generates feedback that is more diverse and in line with the style of teachers.
To further demonstrate the effectiveness of the proposed framework, we present a case study in Appendix A. We provide three examples of oral presentation assignments along with the corresponding produced feedback. A discussion on the examples is also included.

Ablation Study
In this subsection, we seek to figure out how each component contributes to the final performance of the proposed model. In Figure 3, we compare the performance of the entire proposed model and variants of the models with some components eliminated.
We make the following observations. First, if we don't use the annotated modality labels to train the model to predict the targeted modalities but randomly select modalities (i.e., "no label" in the figure), the performance on BLEU scores degrades dramatically, which demonstrates that choosing the right modalities is crucial for generating the correct feedback. Second, if we remove the personalized language model from the framework (i.e., "no PLM" in the figure), the BLEU scores also decline and the Distinct score changes to 1/6 of the original, which shows that the personalized bias mechanism helps a lot for generating personalized and more diverse feedback. Note that for the model without personalized language model, the BLEU scores are still higher than the scores of the baseline Multimodal Attention, which means our model is more effective to integrate multimodal information for feedback generation than directly employing a hierarchical attention mechanism for multimodal information fusion. Third, if the weight r is fixed to be a constant instead of being determined based on the hidden state of the general language model at each step (i.e., "fixed r" in the figure), the performances become even poorer than no PLM. It is because we need to let the model decide whether a "general" word or a "personalized" word should be generated based on the context. Otherwise, with a fixed weight, the model has constant biases on the two sides at each step, which sometimes leads to a conflict between the two language models in deciding which words to choose, so that ungrammatical sentences can be generated. Fourth, in the complete model, we adopt K-hop structured self-attention to extract the information of various aspects from a modality. If we replace multiple hops with 1 hop (i.e., "1 hop" in the figure), the performance is also poorer than the original one, which verifies the benefits of this design.

Related Work
To the best of our knowledge, there is no previous work on the problem of automatic feedback generation for oral presentation assignments. However, there exist related works about text generation based on unimodal or multimodal information. The works (Mao et al., 2015;Vinyals et al., 2015; investigate the problem of image captioning, which takes an image as input to generate a textual description. Video description is studied in the works (Venugopalan et al., 2014;Rohrbach et al., 2015) and (Hori et al., 2017). The first two take only dynamic images as input while the last one integrates the images and audio via a hierarchical attention mechanism to generate the description.
The works more similar to our work in the application are those which generate reviews for products on online shopping platforms. In the work , a seq2seq model is adopted to encode attributes (user, product, rating) into vectors and then generate the review with an RNN decoder. An attention layer is added between encoder and decoder to learn the alignments between attributes and generated words. The work (Ni and McAuley, 2018) generates reviews for a given user, item, several related phrases such as product titles, and aspect-aware knowledge. In the work (Sun et al., 2019), the authors use an image of a product and a predetermined rating as evidence, to generate a review. For a given pair of user and item, the work (Truong and Lauw, 2019) first predicts a rating that the user would give to the item, and then integrate the multimodal information of the user, the item, the predicted rating, and an image to generate a review. In Net2Text (Xu et al., 2019), the authors first construct a graph with users and items as nodes and then predict a review of a user towards an item in a language model conditioned on learned node embeddings.
Personalized text generation is also a hot problem in many NLP tasks, especially dialogue generation. The most common solution is to introduce persona embeddings to model the personality of speakers (Li et al., 2016b). Besides, the work (Luan et al., 2017) proposes to build personalized dialogue models via multi-task learning. The authors train a seq2seq model on common conversation data and an autoencoder on personal non-conversation data with the parameters in their decoders sharing. In addition, domain adaptation Zhang et al., 2019) and transfer learning (Mo et al., 2018) methods are also proposed for personalized dialogue generation.

Conclusion
In this paper, we study the problem of automated multimodal feedback generation for oral presentation assignments in K-12 education. As a pioneering work for this task, we propose a novel PMFGN that learns to produce personalized feedback for oral presentations in an end-to-end manner. Equipped with a modality gate mechanism and a personalized bias mechanism, the proposed framework encodes and fuses multimodal information in an effective way and achieves personalized feedback generation. The performance of the proposed model is demonstrated by experiments conducted on a real-world K-12 education dataset according to various evaluation metrics.
This work focuses on evaluating an oral presentation assignment based on its intrinsic quality (clarity, fluency, relevance, etc.), but doesn't consider judging whether an answer is right or wrong since the automated correction of assignments is another problem under study. In the future, we plan to integrate this aspect into feedback generation.
(Translation: Mike and Nancy had 80 pieces of candy. After Mike gave Nancy 2 pieces of candy, they have the same number of candies. Do you know how many pieces of candy they had respectively at the beginning?) Submitted video (screenshot): The feedback written by a teacher, produced by baseline methods and our proposed model is listed as follows.

A.4 Discussion
Example A shows an oral presentation submission with a video of 57 seconds long. The sound of the video is very clear and the answer is correct, but the image is very blurred. The video is not shot aimed at the student as required. Except for GRM-LM, all the baselines fail to recognize the defect on the image, and provide completely positive feedback. Without the image information as input, GRM-LM generates a piece of feedback "Black screen?" randomly. Our method produces a piece of perfect feedback that admires the student's answer and also points out the issues on the image. For example B, a wrong video with a length of only 5 seconds is submitted. Our model not only accurately points out this error, but also produces a piece of feedback which is very similar to the feedback given by the teacher. This means that our model successfully imitates the teacher's style and generates personalized feedback. Example C shows an eligible oral presentation submission. Although most of the baselines produce general positive