Reference and Document Aware Semantic Evaluation Methods for Korean Language Summarization

Text summarization refers to the process that generates a shorter form of text from the source document preserving salient information. Many existing works for text summarization are generally evaluated by using recall-oriented understudy for gisting evaluation (ROUGE) scores. However, as ROUGE scores are computed based on n-gram overlap, they do not reflect semantic meaning correspondences between generated and reference summaries. Because Korean is an agglutinative language that combines various morphemes into a word that express several meanings, ROUGE is not suitable for Korean summarization. In this paper, we propose evaluation metrics that reflect semantic meanings of a reference summary and the original document, Reference and Document Aware Semantic Score (RDASS). We then propose a method for improving the correlation of the metrics with human judgment. Evaluation results show that the correlation with human judgment is significantly higher for our evaluation metrics than for ROUGE scores.


Introduction
The task of text summarization is to generate a reference summary that conveys all the salient information of an original document. There are two strategies for this type of summarization (i.e., extractive and abstractive summarization). With the extractive approach, the most noticeable key sentences are extracted from the source and compiled into a reference (Zhong et al., 2019;Wang et al., 2019;Xiao and Carenini, 2019). The second approach is abstractive, with which a paraphrased summary is generated from the source (Zhang et al., 2018;Guo et al., 2018;Wenbo et al., 2019). The generated summary may not contain the same words that appear in the source document. Therefore, measuring factual alignment between the generated summary and source document is important (Kryscinski et al., 2019).
Most summarization models are evaluated using recall-oriented understudy for gisting evaluation (ROUGE) (Lin, 2004a), which measures n-gram overlaps between generated and reference summaries. ROUGE has proven to have a high correlation with manual evaluation methods, such as pyramid (Nenkova et al., 2007) and TAC AESOP (Owczarzak and Dang, 2011). However, Louis (2013) showed that the correlation significantly decreased when only one reference summary was provided. Additionally, considering the process by which a person manually summarizes a document, ROUGE is limited, because it does not reflect semantic meanings between generated and reference summaries. For example, when a person summarizes a document, they tend to use words that are implicit while not always using the explicit words from the original document. As the ROUGE score is computed based on an n-gram overlap, the score can be low even if two words have the same semantic meaning. Table 1 shows an example of the ROUGE limitation when applied to a Korean summarization. This tendency is particularly prevalent in Korean, which is an agglutinative language that combines various morphemes into a Table 1: An example showing the limitations of ROUGE in Korean summarization. The incorrectly generated summary has a high ROUGE score, but has the opposite semantic meaning. Text areas marked in blue and red serve as indicators for distinguishing the factualness of the semantic comparisons, as reflected by the our metrics shown.
To overcome this limitation, an evaluation method that considers the semantic information of both the generated and reference summary is required. It is important to examine the factuality between the generated summary and source document, because the generated summary may contain false information. Each person summarizes information in different manners, and it is difficult to agree, even after cross-checking (Kryscinski et al., 2019). Therefore, the source document should also be considered with generated and reference summary.
In this study, we propose metrics for evaluating a summarization model that consider both the source document and reference summary together with the generated summary (see Table 1). Our contributions can be summarized as follows: • We propose the evaluation metrics that can be applied to a summarization model using deep semantic information.
• We propose methods to improve the correlation between the proposed evaluation metrics and human judgment.
• Via extensive evaluation, we demonstrate that the correlation with human judgment is significantly higher for our proposed evaluation metrics than for ROUGE scores.

Related Work
Evaluation methods of text summarization are divided into two strategies: manual and automatic. Manual evaluation is expensive and difficult (Nenkova and Passonneau, 2004;Passonneau et al., 2013). Several studies have been conducted to develop automatic methods that facilitate fast and low-cost evaluations.
There are two types of automatic evaluation methods: extrinsic and intrinsic. An extrinsic automatic method evaluates a summarization model based on how it affects the completion of tasks comprising the judgment of document relevance (Dorr et al., 2004). The intrinsic automatic method evaluates quality via a property analysis or by calculating its similarity to a manually generated summary. Intrinsic methods include the pyramid method (Nenkova et al., 2007), the basic-elements method , and ROUGE (Lin, 2004b). The pyramid method inspects various human-made summaries and creates summary content units, each with a scoring weight. The basic-elements method is similar to the pyramid method. ROUGE evaluates the similarity of the lexical overlap between the candidate and reference summary.
As the ROUGE score is computed based on the n-gram overlap, it does not account for synonymous words or phrases. Many approaches have been proposed to overcome this limitation. ParaEval , ROUGE-WE (Ng and Abrecht, 2015), ROUGE 2.0 (Ganesan, 2018), and ROUGE-G (ShafieiBavani et al., 2018) have been used to extend ROUGE to support synonymous constructs. ParaEval uses a matching method based on paraphrase tables. ROUGE-WE uses a lexical matching method with a semantic similarity measure and the cosine distances between tokens. ROUGE 2.0 uses WordNet as a synonym dictionary and computed token overlaps with all synonyms of matched words. ROUGE-G uses lexical and semantic matching from WordNet. These approaches have limitations because they require hand-crafted lexical and synonym dictionaries, which are particularly difficult to construct in Korean. Our research is similar to (Zhang et al., 2019), which utilized BERT to compute semantic score between the generated and reference sentence. However, (Zhang et al., 2019) does not consider the document, whereas our research considers the document to be characterized in the evaluation of summarization tasks. Overall, our research is different from previous approaches in that 1) We propose a method to evaluate generated summary by considering documents as well as reference summary. 2) In addition, our evaluation model is robust to out of vocabulary (OOV) words because it leverages a pre-trained neural network (SBERT) based on byte pair encoding (BPE) (Gage, 1994) tokenization method from unsupervised learning. Considering the fact that Korean is an agglutinative language, this feature is essential. 3) Finally, Our evaluation model can be further trained to capture more contextualized information both on reference summary and document.
Text summarization models can be divided into abstractive, extractive, and hybrid. Abstractive models reword phrases and create summaries having novel phrases constructed from the original document. Recent text summarization approaches have leveraged multi-task and multi-reward training (Jiang and Bansal, 2018;Paulus et al., 2017;Guo et al., 2018), attention-with-copying mechanisms (Tan et al., 2017;See et al., 2017;Cohan et al., 2018), and unsupervised training strategies (Schumann, 2018;Chu and Liu, 2018). The extractive method extracts the most-suitable sentences (or words) from the source document and copies them directly into the summary. Many researchers (Neto et al., 2002;Colmenares et al., 2015;Filippova and Altun, 2013) have utilized domain expertise to develop heuristics for refining summary texts. Recently, neural-based text summarization models have been proposed to train the model for predicting whether a span of text should be included in the summary (Nallapati et al., 2016;Narayan et al., 2017;Xu and Durrett, 2019;Liu et al., 2019a). Reinforcement learning-based summarization models have also been proposed to directly optimize models (Wu and Hu, 2018;Dong et al., 2018;Narayan et al., 2018b). The hybrid approach uses both abstractive and extractive methods. With this approach, the summarization process is divided into two phases: content selection and paraphrasing (Gehrmann et al., 2018;Hsu et al., 2018;Chen and Bansal, 2018;.

Methodology
From Table 1, we can observe the importance of considering both the document and reference summary together for proper evaluation of the summarization model. In Subsection 3.1, we propose a method for evaluating the generated summary with the reference summary to reflect deep semantic meaning. Next, we propose a method for evaluating the generated summary with the original document and reference summary together. The reference-document-aware evaluation metric model can be further trained to capture more contextualized information from both on reference summary and document (Subsection 3.2).

Reference and Document Aware Semantic Evaluation
Let us define the generated summary from the summarization model as y p = [w 1 , ..., w n ] and reference summary as y r = [w 1 , ..., w m ], where w i indicates each word. Then, each summary representation, v p and v r , can be constructed using sentence-embedding methods. Neural-based sentence-embedding methods have been broadly studied. Conneau (2017) trained a siamese bidirectional long short-term memory model with a max-pooling strategy on the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) and the MultiGenre Natural Language Inference (NLI) dataset (Williams et al., 2017). Cer (2018) proposed the universal sentence encoder to train a transformer on the SNLI dataset. Reimers (2019) recently proposed sentence-BERT(SBERT), which leverages a pre-trained BERT (Devlin et al., 2018), trained with a combination of the SNLI and multi-genre NLI, and showed state-of-theart sentence embedding performance. SBERT is suitable for semantic similarity searches and showed faster inference speeds than previous state-of-the-art approaches, including BERT, RoBERTa (Liu et al., 2019b), and the universal sentence encoder.
We leverage a pre-trained SBERT to construct summary representations. Each word representation, e, is obtained from SBERT as (1) where j represents an index of a word-embedding dimension, and n represents a length of E. v r can also be obtained in the same manner.
The semantic similarity score, s(p, r), between v p and v r can be obtained as follows: Recall that it is important to consider factual consistency of generated summary with the source document, and, given the same document, the method of summarizing important information varies from person to person (Owczarzak et al., 2012;Kryscinski et al., 2019). Therefore, the source document should also be considered with the generated summary when evaluating the summarization model.
Given a document, D = [w 1 , ..., w k ], the document representation, v d , can be obtained using Eqs.
(1) and (2). Thus, the similarity score between v p and v d can be defined as, Given a reference and source document, the reference-document-aware semantic score (RDASS) of the generated summary is defined by averaging s(p, r) and s(p, d): We also experimented with a sum, max and min operation between s(p, r) and s(p, d), but averaging the two scores reports highest correlation with human judgment.

Fine-tuning SBERT with the Abstractive Summarization Model
As SBERT is a trainable metric model, it can be further trained to capture more contextualized information about the reference summary and source document. We propose a fine-tuning method for SBERT that uses the abstractive summarization model. Most neural approaches for abstractive summarization are based on an encoder-decoder architecture (See et al., 2017). Formally, given a document, D = [w 1 , ..., w k ], the objective is to generate a summary, y p = [w 1 , ..., w n ], from a hidden representation, h p = [h 1 , ..., h n ]. The hidden representation is the output vector of the decoder. We leverage the hidden representation of the decoder to fine-tune the SBERT.
Following (Reimers and Gurevych, 2019), we adopt a triplet objective to fine-tune the SBERT. Given an anchor h p , a positive reference representation v p r , a negative representation v n r , and a Euclidean distance d, the triplet objective for generated and reference summaries J(p, r) is then defined as where represents a margin that ensures h p is closer to v p r than v n r . We set as 1. Similarly, the triplet objective for generated summary and document can be defined as Thus, the final objective for SBERT is to minimize the combined two triplet objectives as The objective function SBERT J is jointly optimized with the abstractive summarization objective. Usually, the negative log-likelihood objective between the generated and reference summaries is used for abstractive summarization (See et al., 2017;Narayan et al., 2018a). We refer to the fine-tuned SBERT with abstractive summarization model as "FWA-SBERT." 4 Experimental Setup

Dataset
We trained and evaluated our models using the Korean Daum/News dataset 1 , comprising 10 topics, such as politics, economy, international, culture, information technology, and others. From this, we extracted 3-million news articles. The number of articles for training, validating, and testing was 2.98M , 0.01M , and 0.01M respectively. We refer to this dataset as Daum/News. We used Daum/News to fully understand the content of the article and conduct a proper evaluation. The dataset contains articles from 143 newspapers, each having different summary styles, and the effectiveness of the proposed methods is exemplified using it. Therefore, we expect that our research can be applied to different languages.

Summarization Model
We adopted abstractive summarization model of (Liu and Lapata, 2019) 2 . Liu (2019) leveraged pretrained BERT as an encoder and a six-layered transformer as a decoder, showing state-of-the-art results on Cable News Network/DailyMail (Hermann et al., 2015), New York Times (Sandhaus, 2008), and XSum (Narayan et al., 2018a) datasets. We set all environments according to (Liu and Lapata, 2019), except that we leveraged the pre-trained BERT trained on Korean dataset (Subsection 4.3) instead of english-bert-base-uncased. We trained the abstractive summarization model on Korean Daum/News dataset.

SBERT
To leverage SBERT, we first pre-trained BERT (bert-base-uncased) on Korean dataset, comprising 23M sentences and 1.6M documents, including Wiki, Sejong corpus, and web documents. Next, we trained SBERT with classification and regression objectives from NLI (Bowman et al., 2015;Williams et al., 2017) and the semantical textual similarity (STS) benchmark (STSb) (Cer et al., 2017). Because NLI and STSb datasets are in English, we leveraged the Korean NLI and STS dataset 3 (Ham et al., 2020) which translated from Kakao Machine Translator 4 . Evaluation of the STS benchmark test dataset was conducted, showing an 80.52 Spearman's rank correlation result. Subsequently, the pre-trained SBERT model was fine-tuned with the abstractive summarization model to capture more contextualized information of the reference summary and source document with a generated summary (Subsection 3.2). All training was conducted on the Kakao Brain Cloud with 4 Tesla V100 graphical processing units.

Human Judgment
To demonstrate the effectiveness of the reference-document-aware semantic metric, we evaluated its correlation with human judgment. Following (Kryscinski et al., 2019), we asked annotators to score relevance, consistency, and fluency. Relevance represents the degree of appropriateness of the document, consistency represents the degree of factualness, and fluency represents the degree of the quality of generated summary. Additionally, human avg represents the average value of the scores for the three indicators. Given a document, reference summary, and generated summary, each annotator scored in the range of 1 to 5 points for the evaluation indicator (i.e., relevance, consistency, fluency). The human judgment was conducted by 6 judges having a PhD (3 judges) or a MS (3 judges) degree in computer science. The averaged human score of relevance was 3.8, consistency was 3.6, and fluency was 3.9 for 200 sampled summaries from Korean Daum/News test dataset.

Results
In this section, we first report the performance of the summarization model using the ROUGE and proposed evaluation metrics (Subsection 3.1). Next, we report how the proposed evaluation metrics correlated to human judgment. We also report the correlation of the proposed evaluation metrics to ROUGE to show that the proposed methods complement ROUGE. Finally, through qualitative evaluation, we demonstrate the limitations of ROUGE and the superiority of the proposed evaluation metrics.  The abstractive summarization model is based on the neural architecture of (Liu and Lapata, 2019). We trained the summarization model on the Daum/News dataset. To evaluate the summarization model, we used ROUGE and the proposed evaluation metrics. The fine-tuned FWA-SBERT was then used to evaluate the proposed semantic scores (s(p, r), s(p, d), and RDASS). Table 2 shows the performance of the summarization model with baseline methods (Reference Summary, Lead 1, and 3) on the Daum/News dataset.

Performance of the Summarization Model
We set the reference summary as upper-bound. In the case of the reference summary, the reporter tends to use implicit words when summarizing the document, so the s(p, d) score is relatively low compared to the Lead baselines. However, because the s(p, r) score is 1.00, the reference summary shows the highest RDASS score. For Lead-1, s(p, r) shows higher performance than s(p, d), and for Lead-3, s(p, d) shows higher performance than s(p, r). The reason for this performance is that Lead-3 contains more sentences from the document, so the similarity with the reference summary s(p, r) is low, but the similarity with the document s(p, d) is increased. In the case of ROUGE performance of lead baselines, relatively low performance can be confirmed compared to other researches (Kryscinski et al., 2019) conducted in English dataset. The reason is that in the case of Korean, the same semantic meaning is expressed differently because of the nature of the language of the agglutinative language. A detailed example of this is described in Table 5 below. However, it can be seen that the RDASS score of lead baselines is similar to that of the reference summary. Through this, we can confirm that the proposed evaluation method can reflect the semantic meaning of the reference summary and document well. In the case of the (Liu and Lapata, 2019), it shows higher similarity with the reference summary than the Lead baselines, but since it is based on the generation model, it does not extract the sentence from the document as the Lead baselines.
As a result, it shows the relatively low s(p, d) score. We describe how these results are correlated with human judgment in the next section. Figures (1a) and (1b) show the Pearson correlation and Kendall rank, respectively, of the proposed evaluation metrics with human judgment on the 200 sampled summaries. Pearson correlation measure whether the two variables are linearly related, where 1 indicates positive linear correlation and -1 indicates negative linear correlation. And Kendall rank measure the rank correlation of the two variables, where 1 indicates two variables are similar and -1 indicates dissimilar. Both correlation measure methods are widely used in summarization task to analyze correlation with human judgment.

Correlation with Human Judgment
In the Pearson correlation matrix, the correlation with human judgment was significantly higher for the proposed evaluation metrics than for ROUGE scores. Additionally, in the Kendall rank matrix, the proposed evaluation metrics showed highest correlation with human judgment than did the ROUGE scores. Among the proposed evaluation metrics, s(p, r) showed higher performance than s(p, d) and RDASS showed the highest correlation with human judgment. These results indicate that the proposed evaluation metrics can reflect deep semantic meaning overcoming the limitations of ROUGE which based on n-gram overlap.
To demonstrate the effectiveness of fine-tuning SBERT with an abstractive summarization model, we set baseline methods depending on which sentence representation methods to use for the proposed methods (Subsection 3.1) as follows: Multilingual Universal Sentence Encoder (MUSE): MUSE  is a multilingual sentence encoder that embeds text from 16 languages into a single semantic space using multi-task  learning. This model was trained on more than 1-billion question-answer pairs and showed competitive state-of-the-art results on semantic (Gillick et al., 2018), bitext retrival (Ziemski et al., 2016), and retrieval question-answering . Pre-trained SBERT: We only leveraged pre-trained SBERT without fine-tuning. We refer to this as "P-SBERT." Table 3 show the performance comparison depended upon which sentence representation was used. P-SBERT shows the high correlation coefficient with humans than MUSE. Overall, when the FWA-SBERT was used, it showed the closest correlation with human judgment.
Through quantitative evaluation, we demonstrated that the proposed evaluation metrics had a high correlation with human judgment and that the method of fine-tuning SBERT improved the performance of the proposed evaluation metrics.
We also experimented to understand how each evaluation metric was correlated to each other. As shown in Table 4, there was a high correlation among the ROUGE metrics. However, the proposed evaluation metrics had a relatively low correlation with ROUGE. This indicates that the proposed evaluation metrics reflected semantic meaning, in our case, that ROUGE could not. Thus, we believe it complements ROUGE metrics.  Table 4: Pearson correlation of ROUGE and the proposed evaluation metrics.

Qualitative Analysis
In this section, through qualitative analysis, we demonstrate the effectiveness of our evaluation metrics. Table 5 shows ROUGE, RDASS and human evaluation results for the generated summaries for the two articles.
In article-1, the generated summary "On the 30th birthday of Messi, he had a good time with his family" has the same semantic meaning as the reference summary "Messi's 30th birthday with his wife and son". However, since the sentence having the same semantic meaning can be variously expressed in Korean, which has the characteristics of agglutinative language, the ROUGE score is low while human evaluation scores are high. Likewise, the generated summary "Samsung Electronics launches new 'qled tv' in Brazil" in article-2 has a same semantic meaning as the reference summary "Samsung Electronics launches 'qled tv' in Brazil, the largest market in Latin America". The generated summary in both articles is correct, but the ROUGE score is low. On the other hand, the RDASS score indicates a higher score, and indicates that the generated summary is the correct answer.

Conclusion
In this paper, we pointed out the limitation of the widely used ROUGE evaluation metric when adopting Korean summarization. Since Korean is an agglutinative language, the generated summary having the same semantic meaning with reference summary can be variously expressed. Therefore, only leveraging ROUGE metric can produce inaccurate evaluation results. To overcome this limitation, we proposed RDASS (Reference and Document Aware Semantic Score) evaluation metric. The RDASS can reflect deep semantic relationships of a generated, reference summary, and document. Through extensive evaluations, we demonstrated that the correlation with human judgment is higher for the proposed evaluation metric (RDASS) than for ROUGE scores. In future work, we will demonstrate the effectiveness of the proposed method in English summarization dataset.