One Comment from One Perspective: An Effective Strategy for Enhancing Automatic Music Comment

The automatic generation of music comments is of great significance for increasing the popularity of music and the music platform’s activity. In human music comments, there exists high distinction and diverse perspectives for the same song. In other words, for a song, different comments stem from different musical perspectives. However, to date, this characteristic has not been considered well in research on automatic comment generation. The existing methods tend to generate common and meaningless comments. In this paper, we propose an effective multi-perspective strategy to enhance the diversity of the generated comments. The experiment results on two music comment datasets show that our proposed model can effectively generate a series of diverse music comments based on different perspectives, which outperforms state-of-the-art baselines by a substantial margin.


Introduction
In recent years, neural networks have achieved great success in natural language generation (NLG), which can be applied in many real-world scenarios, such as poetry generation, dialogue generation, comment generation, and so on. Music comment generation is a sub-task of NLG. High-quality comments can effectively increase the popularity of music and the activity of the platform (Zeng et al., 2019). comments, there exists excellent distinction and diverse perspectives for the same music content. There goes a saying that there are a thousand Hamlets in a thousand people's eyes. As a result, the music comment generation is typically regarded as a one-to-many generation task.
Nowadays, researchers have noticed such problems and tried to solve them via multiple methods in dialogue generation. Some of them have utilized topics, keywords, meta-words, and other information during the generation process to improve performance (Xing et al., 2017;Mou et al., 2016;. Some researchers try to optimize the decoding process (Vijayakumar et al., 2016;Li et al., 2016b) or reorder the candidate sequences after the decoding (Yao et al., 2016;Song et al., 2017). However, these approaches cannot significantly improve the model's performance on diversity. Some methods model one-to-many relationships by multiple latent variables and conform to the dialogue generation scenario Zhou et al., 2018;Chen et al., 2019). They tried to add multiple latent mechanisms or multi-mapping mechanisms between the encoder and decoder of the seq2seq architecture. These one-to-many mapping modules can capture a variety of similar generation modes to a certain extent in dialogue generation. Nevertheless, there is so much overlap of aspects between the text generated through these mapping modules, such as topics and description methods, which is more intolerable in comment generation than in dialogue generation.
This paper aims to bridge the gap between human and machine comments via a multi-perspective mechanism. We define the different language styles, views or aspects of the human comment creation as the musical perspectives, such as emotional perspectives, content theme perspectives, lyrics style perspectives, etc. Besides, compared with other scenarios, the distinction between various comments needs to be more significant in this task for the same input content.
In detail, to better simulate human behaviors and generate more diverse comments, we construct an effective multi-perspective mechanism. The proposed model consists of a music sequence information encoder with a multi-perspective extraction mechanism and a decoder to generate different comments. There is a significant difference between the training stage and the inference stage in our model. In the training stage, the model extracts the perspective that is more conform to the music content and selects the perspective suitable for optimization through the posterior information of the target comment. Besides, we also design a distinction loss function between the perspective extraction components. Our model maximizes the difference between the extraction components by minimizing the loss so that each component exerts a different effect. In the inference stage, the model can separately generate different comments based on different perspectives that are optimized in the previous training stage. Our proposed model not only fits the situation of generating music comments but also dramatically reduces the duplication and redundancy between perspective components. Finally, it can simulate multiple perspectives and generate diverse comments.
Overall, the contributions of this paper are listed as follows: • As far as we can see, we are the first to improve music comment generation's diversity through a multi-perspective mechanism.
• We propose a novel comment generation model based on multiple-perspective extraction training. The proposed model improves the quality of music comments and makes different perspective modules generate diverse comments.
• The automatic evaluation results show that our model is better than baselines on both datasets. Further analysis and manual evaluation show that the difference between automatically generated comments has been indeed improved significantly.

Model Overview
First, we define the task of generating music comment. Given an input sequence containing music information X = {x 1 , x 2 , ..., x T }, which contains the song title, author, and lyrics, we hope that our model can generate the corresponding music comment Y = {y 1 , y 2 , ..., y T }. x i and y j for i = 1, 2, ..., T , j = 1, 2, ..., T are words. T and T are the lengths of the input sequence and output sequence.
We aim to generate a series of comments Y from multiple perspectives given the textual sequence  Figure 2: Illustration of our model X. Figure 2 illustrates the architecture of our model. In the training stage, the input sequence X is encoded and converted to multi-perspective vectors through a multi-perspective mechanism, which simulates commenting for the music based on the concerned aspects and language styles. These multiperspective vectors need to decode after two ways of extraction. The first way extracts the part of semantic vectors related to the input sequence. The second extracts the part through the posterior information of the target comment. These extracted vectors are finally used in the decoder to generate vivid and diverse comments. Besides, to avoid duplication and redundancy of these perspective components, we optimize the multiperspective mechanism by a Distinction Loss Function. The final loss function consists of three parts: Generation Loss, Distinction Loss, and Matching Loss. Among them, Generation Loss is the decoder's negative log-likelihood loss function, and Matching Loss is an auxiliary loss function to project music content and music comment into the same perspective vector space.
In the inference stage, the input sequence X is encoded and directly converted into multi-perspective vectors. All of the multi-perspective vectors individually generate different comments. In this stage, there are no two ways of perspective extraction. We generate corresponding comments for each perspective component. In this way, our model can simultaneously generate comments from multiple perspectives.

Encoders
The proposed model includes a music content encoder and a comment encoder. The comment encoder is only used in the training stage. Both of them use a single-layer bidirectional GRU (Cho et al., 2014), and the learning parameters are not shared. For the music content encoder, the i-th hidden states of forward and backward GRU are computed by: where e (x i ) ∈ R d is the embedding of word x i and d is the dimension of embeddings. Then corresponding hidden states of forward and backward GRU are concatenated as the i-th hidden states h i . Finally as the semantic representation of input sequence. Samely, the semantic representation of music comment Y is y.

Multi-Perspective Mechanism
We use multiple linear functions to implement the multi-perspective mechanism. We call these linear functions the perspective components. We update the parameters of these perspective components with the help of two ways of perspective extraction. The simple linear function can also simulate a variety of comment perspectives to generate diverse comments.
Each linear function performs a linear transformation on the semantic representation of music content x, thereby capturing the underlying regularities in semantics: where W k ∈ R lp×T , and b k ∈ R lp are the parameters of the k-th perspective component. l p and T is dimension of p and x. And {p i } K i=1 are the multi-perspective vectors.

Perspective Extraction
These different perspective vectors contain potential information in language expression methods or aspects of description. Eventually, they are fed into the decoder to generate vivid and diverse comments. However, if these unprocessed vectors are directly used in the decoding process, there will be no difference between these perspective components. Therefore, perspective extraction needs to be performed, and appropriate multi-perspective vectors are extracted for parameter updates in the training stage. These multi-perspective vectors go through two ways of extraction. In the prior perspective extraction, considering that not all the potential information is suitable for all songs, the model learns to extract a part of suitable vectors for each input sequence. For example, for a sad piece of music, we cannot use a comic style to comment. In the posterior perspective extraction, we utilize the posterior perspective extraction mechanism similar to Chen et al. (2019) to pick out suitable perspective vectors for parameter updates.
The Prior Perspective Extraction. The prior perspective extraction selects the perspective vectors related to the input sequence. So we need to model P α (p i |x) and find an appropriate weight for each perspective vector: is used to get the similarity between p i and x. Inspired by Zhou et al. (2017), in order to avoid over-fitting of g α (p i , x), we add two learnable parameter matrix and use the Maxout activation function (Goodfellow et al., 2013): The correlation of the encoded music content x and multi-perspective vectors can be obtained through the above process. Moreover, the weights of each perspective vector can be calculated. Thereby it achieves the purpose of the prior perspective extraction.
The Posterior Perspective Extraction. The posterior perspective extraction is similar to Chen et al. (2019). We use the posterior information of the target comment to extract the multi-perspective vectors. We calculate the correlation between the target comment and each perspective vector. Then we use softmax normalization to get the probability of extracting the perspective vector: where g β is dot product operation.
Perspective Extraction. The two-way perspective extraction can give different weights to different perspective vectors. Besides, we fuse the extracted result of the two ways to obtain the final perspective vector: where W α , W β ∈ R lp×lp and b ∈ R lp are both learnable parameters. l p is dimension of p. We extract the perspective vector with the highest probability and fed it into the decoder. For the backpropagation of the sampling process, we use Gumbel-Softmax reparametrization (Jang et al., 2017) to obtain the probability. So that different samples can reasonably optimize various perspective components in the training stage.

Decoder
The comment decoder uses a unidirectional GRU : where s j is the hidden state of GRU , c j is the context vector of time step j. We use the extracted multiperspective vector p as the initial state of the hidden layer. The generation probability of each time step in the decoding process is: P (y j |y 0:j−1 , X, P ) = sof tmax (s j , c j ) where y 0:j−1 is generated text before the time step. However, in the inference stage, there is no longer a perspective extraction process. Each perspective vector will be fed into the decoder and generate a comment. Through multi-perspective comments generated by multiple perspective components, we can enhance the performance of automatic music comments.

Distinction Loss
To further increase each perspective component's difference, we add a regularization term about the multi-perspective mechanism for the loss function. We aim to enhance the difference between the rows of the parameters matrix of linear functions.
Inspired by , we perform a dot product operation on the parameters matrix and its transpose. Then the result of the dot product minus the identity matrix. We add the final result as a penalty term for enlarging the difference between rows: · F stands for the Frobenius norm of a matrix.

Overall Loss Function
In posterior perspective extraction, the encoded target comment y is required. So we need to project the music comment y into the same perspective vector space, and we use the Matching Loss (Chen et al., 2019). For each input sequence, we randomly collected some negative samples, which aims to guide the input content and current target comment mapping to the same semantic space. Matching Loss is the negative log-likelihood of relevance for the encoded music content and encoded comments: In the decoding process, the loss function of generating comments is: Therefore, the total loss function of our model is: 3 Experiments

Datasets and Setups
We construct two Chinese music comment datasets for all experiments. One is the QQ Music comment dataset, while the other is the NetEase Cloud Music comment dataset. In detail, we collect the QQ music comment dataset, including about 61,618 pieces of song information-comment data from the online music website 2 and collect about 205,085 comments from the NetEase Cloud music website 3 . The details of the datasets are shown in Meanwhile, in our experiments, the size of word embedding is 200, and we initialize the word embedding from Tencent AI Lab Embedding Corpus 4 . The number of perspective components is 20. The hidden size is set to 1024. We use Adam optimizer (Kingma and Ba, 2014), and the learning rate is set to 0.0002. We set a dropout rate of 0.3 and use the Beam Search to generate all samples, and the beam size is set to 10. The coefficient γ of Distinction Loss is 0.001 in the NetEase Cloud Music dataset and 0.00005 in the QQ Music dataset. The model generally reaches the optimality of the validation set within ten epochs. Additionally, we choose the better one between the model of the 10-th epoch and the model with the lowest loss on the verification set.

Baselines
For the experimental comparisons, we compare our model with the following baselines: • Seq2Seq (Qin et al., 2018): This model follows the framework of the sequence-to-sequence model with attention.
• VMED (Le et al., 2018): This model associates each memory read with a mode in the latent mixture distribution at each timestep. It can capture the variability observed in sequential data.
• MMPMS (Chen et al., 2019): A state-of-art multi-mapping mechanism model. It focuses on selecting the corresponding mapping module by the target response. Following their setting, we set the number of mapping modules to 20.

Evaluation Metrics
We use two kinds of evaluation methods: automatic evaluation and manual evaluation. For automatic evaluation, we used BLEU-1/2 (Chen and Cherry, 2014) to test the percentage of overlap of unigram and bigram between the generated comment and ground truth. We also use Dist-1/2 (Li et al., 2016a) to test the richness of unigram and bigram in all the comments generated. For manual evaluation, inspired by Liu et al. (2019) and Zhou et al. (2018), we adopt the following four manual evaluation metrics: • Fluency: Whether the comments are fluent and whether there are severe grammatical errors.
• Coherence: Whether the generated comments conform to the scenario of music. How relevant the comment is to the music content.
• Meaningfulness: Whether the generated comments have rich meaning and detailed content.
• Distinction: Whether there exist significant differences between the generated comments for the same music input content. The greater the difference, the higher the score.
All the above metrics are scored on a five-point scale, and we take the average of scores as the final result. We construct a manual test set containing 250 generated comments for each model, which belongs to 50 input samples. We invite five human experts to provide scores according to the above criteria, and the average score for each metric is computed.

QQ Music
NetEase Cloud Music BLEU-1 BLEU-2 Dist-1 Dist-2 BLEU-1 BLEU-   Table 2 shows the automatic evaluation results of the two datasets. It can be easily observed that our proposed model obtains a higher BLEU score and Dist-2 score than baselines on the two music comment datasets. The value of Dist-1 is also close to the highest value of baselines. The improvement of the BLEU score reflects that our model can generate more informative comments, which may be attributed to the design of prior perspective extraction. Moreover, the Dist score shows that the generated comments are diverse, and the vocabulary is rich enough.  The human evaluation results, as shown in Table 3, indicate that our model has better performance on manual evaluation. In terms of Fluency and Relevance of the QQ music dataset, our model score is low. According to the observation of the results, we find that it is because the comment needs to generate is too long, and some repeated text fragments appear frequently. The metric of Distinction has a significant improvement compared to baselines, which shows that the comments our model generates are very diverse.

Further Analysis
To validate that the difference of perspective components in our model has been significantly improved and our model can generate multi-perspective comments, we further analyze the results generated by the models. Two similar models are selected to compare with our model. One is the baseline MMPMS, and the other is a sub-model that removes Distinction Loss from the proposed model.  Table 4: Average Similarity of comments generated by the different perspectives.
In order to measure the difference of comments based on different perspectives, we proposed a new metric called Average Similarity that is defined as the average similarity between multiple comments generated based on different perspective components for the same music input content.
We randomly select 200 samples. For each sample, we calculate the average similarity of generated comments. In detail, we use BERT (Devlin et al., 2018) to vectorize the comments and obtain 768dimensional vectors. Then similarity is measured by the method of cosine similarity. The result in Table 4 shows that the average of our model is lower than other models, which indicates that multiple comments generated by the same case in our proposed model are more diverse. The difference in perspective components has significantly improved. In addition to comparing the multiple comments corresponding to different perspectives, we can also analyze the overall difference of results set corresponding to every perspective component.
We compare the metric Dist-2 of results set corresponding to every perspective component. In order to facilitate comparison, we have sorted them. As can be seen from the figure 3, the difference between perspective components from the proposed model is noticeable. Figure (b) only shows nineteen dots because the first value from MMPMS is too slow to be an abnormal value. The polyline of our model is more tending to a straight line of y = kx+b, which means there is a uniform degree of difference between each component. Furthermore, the effect on the long text from the QQ Music dataset is more significant. It is worth noting that our improvements expand the difference between perspective components and significantly improve the diversity of the text generated from each perspective component. Table 4 presents generated comments from Seq2seq, MMPMS and our model. We select five comments for each model. The five comments generated by the Seq2seq model come from the top-5 in the process of beam searching. Moreover, we select the top 5 of the smoothest sentences from the results of the single input for MMPMS and our model. It can be seen that the comments generated by Seq2Seq, and MMPMS inevitably focus on the same perspective, and even generate duplicate phrases and words. For Lyrics: 晴时多云偶阵⾬，偶尔失去太阳的勇⽓，就算挫折浸湿了翅膀，梦想也不曾停⽌远扬，难免会哭泣，像倾盆⼤⾬淋着 ⾬，我陪你前⾏，仰望着⾬后彩虹... Cloudy occasional showers, occasionally losing the sun's courage, even if the frustration soaked the wings, the dream never stopped flying, it will inevitably cry, like a downpour rain, I accompany you, looking forward to the rainbow after the rain. instance, all the 3th-5th comments from the MMPMS model describe about "梦想(dream)" and "照 亮(light up)". However, the comments generated by our model are meaningful and diverse. Meanwhile, our model can also generate comments from more perspectives or topics.

Related work
The text generation based on the Seq2Seq model tends to create general text. For example, existing models on open-domain comment generation always produce repetitive and uninteresting comments (Lin et al., 2019).  model the input news as a topic interaction graph and generate comments with a graph-to-sequence model. Lin et al. (2019) retrieve informative and relevant comments by leveraging user-generated data. However, many researchers try to model a one-to-many relationship to solve this similar problem in dialogue generation tasks. In detail, Xing et al. (2017) use topics to simulate prior human knowledge and guide them to form informative responses. In contrast, Mou et al. (2016) utilize pointwise mutual information to extract words as keywords and decode the response based on the keywords.  propose a neural knowledge diffusion model to introduce knowledge into dialogue generation. Zhang et al. (2018) apply an explicit specificity control variable into a seq2seq model to generate responses at different specificity levels. Besides,  enhance the seq2seq architecture with a goal tracking memory network to incorporate meta-words into generation. The above methods aim to enhance the diversity of generated results via adding specific structures, which has already achieved a particular improvement. In recent years, some researchers try to construct multiple latent mechanisms to model the one-tomany relationship and generate diverse results. Among them, Tao et al. (2018) propose a novel Multi-Head Attention Mechanism (MHAM), which aims at capturing multiple semantic aspects from the user utterance. Zhou et al. (2017) develop an encoder-diverter-decoder framework, which is used to encode the input into mechanism-aware context, and decode the responses with the controlled styles or topics. Based on the previous work, Zhou et al. (2018) add the filter modules and obtain better results, which selects a subset from all mechanisms to make it contain enough mechanisms to generate multiple style responses. Chen et al. (2019) try to get accurate optimization of latent mechanisms and design a kind of mapping selection method for a multi-mapping mechanism.

Conclusion
This paper proposes an effective multi-perspective strategy to enhance automatic music comment and achieve that one comment is from one perspective. The strategy solves the problem of generating common but meaningless comments in the automatic music comment to some extent. We reform the one-tomany modeling mechanism and make it fit the situation of generating music comment.
In conclusion, our model bridges the gap between human and machine comments via the multi-perspective mechanism, simulating various perspectives to generate diverse music comments. Experiment results show that our method has achieved excellent performance in both automatic evaluation and manual evaluation. The proposed model can also be applied to other text generation scenarios such as news comment generation and poetry generation.