Abstractive Text Summarization Based on Deep Learning and Semantic Content Generalization

This work proposes a novel framework for enhancing abstractive text summarization based on the combination of deep learning techniques along with semantic data transformations. Initially, a theoretical model for semantic-based text generalization is introduced and used in conjunction with a deep encoder-decoder architecture in order to produce a summary in generalized form. Subsequently, a methodology is proposed which transforms the aforementioned generalized summary into human-readable form, retaining at the same time important informational aspects of the original text and addressing the problem of out-of-vocabulary or rare words. The overall approach is evaluated on two popular datasets with encouraging results.


Introduction
Text Summarization (TS) aims at composing a concise version of an original text, retaining its salient information. Since manual TS is a demanding, time expensive and generally laborious task, automatic TS is gaining increasing popularity and therefore constitutes a strong motivation for further research.
Current efforts in automatic TS mainly focus on summarizing single documents (e.g. news, articles, scientific papers, weather forecasts, etc.) and multi-documents (e.g. news from different sources, user reviews, e-mails etc.), reducing the size of the initial text while at the same time preserving key informational elements and the meaning of content.
Two main approaches to automatic TS have been reported in the relevant literature; extractive and abstractive (Gambhir and Gupta, 2017;Allahyari et al., 2017). In the former case, those sentences of original text that convey its content are firstly identified and then extracted in order to construct the summary. In the latter case, new sen-tences are generated which concatenate the overall meaning of the initial text, rephrasing its content. Abstractive TS is a more challenging task; it resembles human-written summaries, as it may contain rephrased sentences or phrases with new words (i.e. sentences, phrases and words that do not appear in the original text), thereby improving the generated summary in terms of cohesion, readability or redundancy.
The main contribution of this work is a novel abstractive TS technique that combines deep learning models of encoder-decoder architecture and semantic-based data transformations. Since the majority of literature in abstractive TS focuses in either of the aforementioned parts, the proposed approach tries to bridge this gap by introducing a framework that combines the potential of machine learning with the importance of semantics. The said framework is comprised of three components; (i) a theoretical model for text generalization (Section 3) (ii) a deep learning network whose input is the text and its output a summary in generalized form (Section 4) and (iii) a methodology of transforming the "generalized" summary into a human-readable form, containing salient information of the original document (Section 5). Additionally, the proposed framework is capable of coping with the problem of out-of-vocabulary (OOV) words (or words of limited occurrences), thereby achieving semantic content generalization. The overall architecture is evaluated on Gigaword (Napoles et al., 2012;Rush et al., 2015) and Duc 2004 (Over et al., 2007), two popular datasets used in TS tasks, with the obtained results being promising, outperforming the current state-of-the-art.
The rest of this paper is organized as follows; Section 2 overviews the related work and Sections 3-5 outline the components of the proposed framework. Section 6 describes the experimental procedure in detail and discusses the obtained results. Finally, the paper concludes in Section 7, where possible future extensions are examined.

Related work
Abstractive TS methods can be broadly classified into structure and semantic based approaches (Moratanch and Chitrakala, 2016). The former make use of pre-defined structures (e.g. ontologies, trees, templates, graphs and rules), whereas the latter utilize the semantic representation of text along with natural language generation systems (based on information items, predicate arguments and semantic graphs). Recently, deep learning architectures have been widely adopted in abstractive TS and they have since become the stateof-the-art (Gupta and Gupta, 2019), especially in short text summarization (Paulus et al., 2017) that is the focus of the current work. The proposed approach further extends the said architectures with semantic-based concept generalization, in an effort to improve the overall system performance.
In particular, semantic-based approaches utilizing (semantic) graphs produce the desired summaries through the extraction of ontological and syntactical relations in text, mainly by reducing the graph or by locating its key concepts (Khan et al., 2018;Joshi et al., 2018;Moawad and Aref, 2012). Item-based solutions, on the other hand, employ the notion of information item (the smallest unit of coherent textual information such as subject, verb and object triplets) in order to generate the summary out of the top-rated sentences. For example, the information items, along with temporal and spatial characteristics, are used in (Genest and Lapalme, 2011) in order to produce the abstractive summary.
Predicate argument-based approaches merge the respective structures of text (i.e. verbs, subjects and objects) and the summary is being formed from the top-ranked such structures (Alshaina et al., 2017;Zhang et al., 2016). Nevertheless, semantic-based methods are not able to achieve comparable performance to deep learning approaches (Gupta and Gupta, 2019) and for this reason, a framework utilizing semantic-based data generalization for the enhancement of sequenceto-sequence (seq2seq) deep learning abstractive summarization is presented in this work. Seq2seq architectures require a sequence of words at their input and also emit a different, in the general case, sequence of words at their output.
An early approach to using semantic resources for the generalization of concepts connected with a conjunctive or disjunctive relation is due to (Belkebir and Guessoum, 2016), which replaces two or more consecutive concepts by one more general word, entailing the meaning of the initial ones (e.g. the phrase "apples and oranges" may be replaced by the word "fruits"). Our proposed methodology, however, is not limited to conjunctive and disjunctive relations and can, therefore, generalize every concept of a text.
The state-of-the-art in abstractive TS deep learning systems employ seq2seq models of encoder-decoder architectures along with attention mechanisms, primarily based on recurrent neural networks (RNNs) and especially on long short-term memory networks (LSTMs) and gated recurrent units (GRUs) (Chopra et al., 2016;Nallapati et al., 2016;See et al., 2017;Song et al., 2018;Chen et al., 2016;Gupta and Gupta, 2019). In these cases, the encoder input is a sequence of words which are subsequently converted into a vector representation and the decoder, assisted by the attention mechanism which focuses on specific words at each step of the input sequence (Bahdanau et al., 2014), determines the output, emitting the next word of the summary based on the previous ones.
The methodology described above is further extended in (Rush et al., 2015), where a neural attention-based model is trained end-to-end on a large amount of data (article-summary pairs) that learns to produce abstractive summaries. Similarly, Nallapati et al. (2016) and See et al. (2017) train encoder-decoder models with attention mechanisms in order to face the problem of unseen (out-of-vocabulary) words, incorporating a pointer generator network in their system. Furthermore, See et al. (2017) avoid repetition of the same words in the summary through the inclusion of a coverage mechanism, while Lin et al. (2018) address the same problem by proposing a model of a convolutional gated unit that performs global encoding for the improvement of the representation of the input data. Finally, Song et al. (2018) propose a deep LSTM-CNN (convolutional neural network) framework, which generates summaries via the extraction of phrases from source sentences.
The presented approach in this work is also based on a seq2seq deep learning model (See et al., 2017). In contrast to the systems outlined above, the novelty of our technique lies in the device of a semantic-based methodology for text generalization, which is going to be presented in detail in the forthcoming sections.

Text generalization
The basic assumption of text generalization is the existence of a taxonomy of concepts that can be extracted from text (Definition 3.1). More specifically, the said taxonomy contains concepts and their hypernyms (Definition 3.2) in a hierarchical structure. Once the concepts have been extracted, the taxonomy path (Definition 3.3), containing the ordered sequence of concepts according to their taxonomy depth (Definition 3.4), is used for generalizing text. Figure 1 illustrates an example taxonomy of five concepts, where concept c 4 has a taxonomy depth equal to 3 and a taxonomy path P c 4 = {c 4 , c 2 , c 1 , c 0 }.  Definition 3.1 (Taxonomy of concepts) A taxonomy of concepts consists of a hierarchical structure of concepts which are related with an is-a type of a relationship.
Definition 3.2 (Hypernym) Given a taxonomy of concepts, concept c j is a hypernym of c i if and only if c i semantically entails c j (c i |= c j ).

Definition 3.3 (Taxonomy path of concept)
Given a taxonomy of concepts, a taxonomy path P ca of c a is an ordered sequence of concepts P ca = {c a , c a+1 , . . . , c n } where c i |= c j , ∀i < j and c n is the root concept of the taxonomy.

Definition 3.4 (Taxonomy depth of concept)
Given a taxonomy path of concepts P ca = {c a , c a+1 , . . . , c i , . . . , c n }, the taxonomy depth of concept c i is the number of concepts from c i to the root concept c n in the path of concepts (d c i = n − i). By definition, the depth of the root concept is equal to zero.
A piece of text can be generalized only when it contains generalizable concepts (Definition 3.5). A concept c i with a taxonomy path P c i is said to have been generalized when it has been replaced by a concept c j ∈ P c i such that d c j < d c i . Accordingly, a text excerpt is said to have been generalized when it contains at least one generalized concept (Definition 3.6). The minimum taxonomy depth of a generalized concept constitutes the level of generalization of the given text (Definition 3.7).
Definition 3.5 (Generalizable concept) A concept c i of taxonomy depth d c i is said to be generalizable when at least one concept of its taxonomy path has a taxonomy depth less than d c i .
Definition 3.6 (Generalizable text) A text excerpt is said to be generalizable when it contains at least one generalizable concept.
Definition 3.7 (Level of generalization) The level of generalization of a text excerpt is equal to the minimum depth of its generalized concepts.

Text generalization strategies
Given the above definitions, two novel strategies for text generalization are presented, which take into account the frequency of a concept in the source text. The intuition behind this transformation is the fact that machine learning systems tend to require a sufficient number of training samples prior to producing accurate predictions. Therefore, low-frequency terms should ideally be replaced by respective high-frequency hypernyms that semantically convey the original meaning.
Text generalization strategies are used to generalize both the training set (i.e. the articles and their respective summaries) as well as the test set (i.e. the unseen text). As it shall be described next, the machine learning model of Section 4 generates a generalized summary that is transformed to a readable text through the post-processing methodology of Section 5.

Named Entities-driven Generalization (NEG)
NEG only generalizes those concepts whose taxonomy path contains particular named entities (NEs) such as location, person and organization (Algorithm 1). For example, given the set of named entities E = {location, person}, the sentence "John has been in Paris" can be generalized to " person has been in location ", where NEs are enclosed in underscores in order to be distinguished from the corresponding words that may appear in the dataset. Algorithm 1 requires: (i) the input text, (ii) the taxonomy of concepts T , (iii) the set C of tuples of extracted concepts c i along with their respective taxonomy paths P i and frequency f i (C = {(c 1 , P 1 , f 1 ), (c 2 , P 2 , f 2 ), . . . , (c n , P n , f n )}), (iv) the set E of named entities (E = {e 1 , e 2 , . . .}) and (v) the threshold θ f of the minimum number of occurrences of a concept. In lines 2 − 4 of Algorithm 1, a term can be generalized when both its frequency in the input text is less than the specified threshold θ f and its taxonomy path P i contains a named entity c ∈ E. In this case, c i is replaced by its hypernym c (line 4). The output of the algorithm is a generalized version of the input text (genT ext). It should be noted that when θ f = ∞, the operation of the NEG algorithm resembles that of named entity anonymization (Hassan et al., 2018).
genT ext ← replace c i with c 5: end if 6: end for 7: return genT ext

Level-driven Generalization (LG)
LG generalizes the concepts according to the given level of generalization d (Definition 3.7), as illustrated in Algorithm 2. For instance, given the taxonomy of Figure 1 and d = 1, the sentence "banana is nutritious" may be generalized to "food is nutritious".
Similarly to Algorithm 1, Algorithm 2 requires (i) the input text, (ii) the taxonomy T , (iii) the set of tuples C, (iv) the threshold θ f and (v) the level of generalization d. In lines 6 − 25, a term c i is candidate for generalization when its frequency f i is below the specified threshold θ f (line 7). More specifically, c i is replaced by its hypernym c h (line 11) only when the depth d c h of the latter is at least equal to d (line 9).
When a term is generalized, the set of concepts C is either updated by merging c i with its hyper-nym c h (lines 14 − 18) or a new entry is added in C, if c h is not already a member of the set (lines 20 − 21). Both the outer while-loop and the inner for-loop are terminated when no more generalization can be applied to the text because either the frequency of all concepts is greater than θ f or all concepts have a taxonomy depth less or equal to d. In this case, the algorithm returns the generalized version of the input text (line 27) and terminates.
if ∃c h ∈ C then 14:

Deep learning model
After the text generalization phase outlined in the previous section completes, the summaries are produced by an encoder-decoder deep learning model, inspired from the "Sequence-to-sequence attentional model" (See et al., 2017). The en-coder consists of a bi-directional LSTM (Graves et al., 2013), the decoder of a unidirectional LSTM and the attention mechanism employed is similar to that of Bahdanau et al. (2014). Words are represented using a neural language model like word2vec  and the overall model is trained on article-summary pairs. Once the training phase is over, the model is expected to predict an output vector of tokens Y = (y 1 , y 2 , ...) (summary) given an input vector of tokens X = (x 1 , x 2 , ...) (text).
During training, the sequence of tokens (word embeddings) of the source text X = (x 1 , x 2 , . . . , x n ) is given to the encoder one-byone in forward and reverse order, producing a hidden state h i = bi lstm(x i , h i−1 ) for each embedding x i . Then, the target sequence of tokens Y = (y 1 , y 2 , . . . , y m ) is given to the decoder, which learns to predict the next word y t given the previous one y t−1 , the state of the decoder s t = lstm(s t−1 , y t−1 , c t ) and the context vector c t , as computed by the attention mechanism. More specifically, the context vector c t is computed as a weighted sum of the encoder hidden states h i , according to the Equations 1-3 below where a ti is the weight, at each time step t, of the hidden state of the encoder h i (i.e. a ti indicates the importance of h i ), e ti indicates how well the output of step t matches with the input around word x i , s t−1 is the previous state of decoder, W h , W s and b are the weights and bias, respectively. Summary prediction is achieved using beam search (Graves, 2012;Boulanger-Lewandowski et al., 2013); for each time step of the beam searchbased decoder, the w candidate tokens with the highest log-probability are kept in order to determine the best output summary, where w is the beam width.

Post-processing of the predicted summary
Since the output of the deep learning model described in Section 4 is in generalized form, a post-processing technique for determining the specific meaning of each general concept is necessary.
More specifically, a method should be devised that would match the generalized concepts of the predicted summary with the appropriate tokens of the original text. Essentially, this is a problem of optimal bipartite matching, between the general concepts of the (generalized) summary and candidate concepts of the original text. To address this issue, Algorithm 3 is proposed, which performs the best matching based on the similarity of the context around the generalized concepts of the summary and the candidate concepts of the text.

Algorithm 3 Matching Algorithm
Require: genSum, text, T Algorithm's 3 input is the generalized summary genSum, the original text text and the taxonomy of concepts T . In the first loop (lines 4 − 14), the similarity s between the context of each generalized token token s and each token token a of the source text that has a hypernym c similar to token s is computed (line 9) and the tuple {(token s , token a , s)} is added to the set cr of candidate replacements of the generalized concepts (line 10). When all the generalized concepts of the (generalized) summary have been examined, cr is sorted in descending order according to s (line 15). In the second loop (lines 16 − 21), token s is replaced by token a of maximum s (line 18) and is subsequently removed from gc (line 19). Eventually, Algorithm 3 returns the final summary summary (line 22) in human-readable form, which also contains specific information according to the source text.
Algorithm 3 works for both strategies of Section 3.1. In the LG strategy (Section 3.1.2), it is trivial to check whether token s exists in the taxonomy path of token a and therefore become candidate for replacement. In the case of NEG (Section 3.1.1), token s (e.g. a general concept of the summary such as location or person) may be replaced by a concept of the article, when the taxonomy path of the latter contains the former.
Finally, an important aspect affecting the performance of Algorithm 3 is the choice of the similarity function (line 9), which is a hyperparameter of the approach. Candidate similarity functions range from some well established indices like the cosine distance or the Jaccard coefficient to more complex measures like the word mover distance (Kusner et al., 2015) and the Levenshtein edit distance (Yujian and Bo, 2007). Of course, the optimal choice is highly dependant on the available data and we further reason on this subject on the experimental part of this submission.

Experiments & Results
The experimental methodology followed in this work is in accordance with some widely-adopted practices in the relevant literature (Rush et al., 2015;Nallapati et al., 2016;Chopra et al., 2016;See et al., 2017;Gao et al., 2019).

Datasets
Two popular datasets used in automatic TS tasks have been selected; Gigaword (Napoles et al., 2012) and DUC 2004(Over et al., 2007. The first dataset, Gigaword, is obtained as it is described by Rush et al. (2015) and further preprocessed in order to remove duplicate entries, punctuation and summaries whose length is either greater than or equal to the length of the articles they summarize. Moreover, the dataset has been normalized by expanding the contractions in the text (e.g. "I've" to "I have") 1 . After the completion of this step, the training set contains about 3 million article-summary pairs which consist of 99, 224 unique words (out of a total of 110 million words). The average article and summary length is 28.9 and 8.3 words, respectively. Finally, 4, 000 pairs have been selected randomly from the test set to form the validation set and another 4.000 pairs were also randomly selected to form the final test vectors as it is commonly done in the relevant literature (Rush et al., 2015;Nallapati et al., 2016;Chopra et al., 2016;Gao et al., 2019).
The DUC 2004 dataset, on the other hand, contains 500 news articles and 4 human-generated summaries for each one of them. The same preprocessing methodology is applied to this dataset as well, but since it contains very few instances it is solely used for evaluation purposes (and not during model training). As it is a common practice in relevant experimental procedures, only the first sentence of the articles is used and the summaries are set to have a maximum length of 75 bytes (Rush et al., 2015;Nallapati et al., 2016;Gao et al., 2019).

Baseline and competitive approaches
The deep learning model outlined in Section 4 serves as the baseline approach. Its optimal hyperparameters are reported in the subsequent section; however, no generalization scheme is used. The baseline approach is tested on both datasets (Gigaword and DUC 2004).
Such a direct comparison is not possible for the Gigaword dataset, due to the extra preprocessing steps of our approach and the random sampling of the testing data.

Parameter tuning
The methodology outlined in this work is dependant on a number of parameters and hyperparameters. Initially, the neural language model for the vector representation of words must be decided upon; after a brief experimentation with various representations and vector-spaces, pre-trained word2vec embeddings of size 300 were selected .
Following, a suitable similarity function for Algorithm 3 (line 8) must be specified. Several notions of word similarity have been considered, ranging from simple indices in-between single words (e.g. cosine similarity, Jaccard coefficient) to more advanced measurements like the word mover distance (Kusner et al., 2015) and the Levenshtein Edit distance (Yujian and Bo, 2007). The approach that achieved the best result was that of the combination of cosine similarity of averaged word2vec vectors and cosine similarity based on bag of words. In particular, the best performance was achieved when the windows around the candidate and the generalized concepts were set to 10 and 6, respectively.
The optimal hyper-parameters of the deep learning model (Section 4) have been determined to be as follows; The encoder (bi-directional LSTM) consists of two layers (of size 200 each), while the decoder (unidirectional LSTM) is single-layered, again of size 200. The batch size has been set to 64, the learning rate to 0.001 and the training data were randomly shuffled at each epoch. The employed optimization method has been the Adam algorithm (Kingma and Ba, 2014), with gradient norm clipping (Pascanu et al., 2013) and crossentropy as the loss function (Golik et al., 2013). Finally, all words of the vocabulary have been considered in the training phase and a beam search of width equal to 4 has been used in the evaluation phase.
In order to assess the effect of the two generalization strategies discussed in Section 3.1, three distinct system configurations have been evaluated. The first one is the baseline approach of Section 6.2. The second system is an extension of the baseline, using NEG as the generalization methodology and the third one is also an extension of the baseline, employing the LG strategy. police raided several locations near Input nobe after receiving word of a threat text but no evidence of a planned attack was found police raided several locations near Generalized location after receiving word of a text threat but no evidence of a planned attack was found Generalized police raided several locations near summary location Output police raided several locations near summary nobe Table 1: An example of NEG strategy from the input text to the output summary

Procedure
As it has been discussed above, the experimental procedure includes three sets of experiments in for the second day in a row astronauts Input boarded space shuttle endeavour on text friday for liftoff on nasa first space station construction flight for the second day in a row astronauts Generalized boarded space equipment endeavour text on friday for rise on nasa first space station construction flight Generalized astronauts boarded spacecraft for summary rise Output astronauts boarded spacecraft for summary liftoff Table 2: An example of LG strategy from the input text to the output summary total, with two of them based on the generalization strategies of Section 3.1. The WordNet taxonomy of concepts has been used (Miller, 1995;Fellbaum, 1998), out of which the hypernyms and the taxonomy paths have been extracted. To select the appropriate synset for extracting its taxonomy path, we use the WordNet first sense, as it has proved to be a very hard baseline in knowledgebased word sense disambiguation approaches (Raganato et al., 2017). Both generalization strategies are only applied to nouns in text, which are identified by the application of part-of-speech tagging and more specifically, the Stanford log-linear partof-speech tagger (Toutanova et al., 2003). The set of named entities used in NEG strategy (Section 3.1.1) is E = {Location, P erson, Organization}, as the datasets contain news articles which are dominated by relevant entities.
The named entities are extracted from the text using a named entity recognizer (NER) (specifically, the Stanford NER, Finkel et al., 2005) in conjunction with the WordNet taxonomy. Firstly, the pre-trained NER is executed and then the remaining named entities are extracted from WordNet; when a term in the text has a hypernym in the predefined set of named entities E, this word is annotated as a named entity. The performance of this generalization strategy is assessed for various thresholds of word frequency θ f (as stated in the respective Section, a word is generalized only if its frequency in the dataset is less than θ f ).
The level of generalization (i.e. the taxonomy depth of a generalized concept) used in LG (Section 3.1.2) has been determined to be d = 5. This level has been chosen as the concepts become very general when d < 5, rendering the production of  the final summary a difficult task (Section 5). In a similar fashion to the NEG strategy, the performance of the LG approach is assessed for various thresholds of word frequency θ f . The overall architecture and all model configurations were trained on single Titan XP GPU 2 . Each training epoch took approximately 3.5 hours and all models converged around epoch 15.
Finally, the performance of all systems is measured on the official ROUGE package (Lin, 2004) of ROUGE-1 (word overlap), ROUGE-2 (bigram overlap) and ROUGE-L (longest common sequence). More specifically, for Gigaword testing data the F-measure of ROUGE score is reported while for the DUC dataset the evaluation metric is the standard ROUGE recall (Nallapati et al., 2016;Chopra et al., 2016;Gao et al., 2019). Table 1 illustrates an example NEG approach which includes the input text, the generalized text (after the application of the NEG algorithm), the predicted generalized summary and the output summary (after post-processing the predicted summary). The underlined words are those that have been generalized and vice versa. Similarly, Table 2 outlines an example LG approach. Table 3 illustrates the ROUGE scores on the Gigaword dataset for both generalization strategies (NEG, LG) and various thresholds of word frequency θ f . Similarly, Table 4 contains the ROUGE scores on the DUC 2004 dataset. Apart from the NEG-infinity and LG-infinity configurations (which over-generalize), the other configurations of our model outperform the baseline approach on both datasets.

Results
Intuitively, improved results were expected especially in the generalization of low-frequency words, as machine learning approaches typically require a sufficient number of samples in order to be trained properly. This is exactly the case for the LG strategy, as the best results are obtained when generalizing words that have at most 100 occurrences (θ f = 100) in the Gigaword dataset. Similarly, the best ROUGE-1 and ROUGE-2 scores for the LG strategy in the DUC 2004 dataset are also obtained when θ f = 100. However, the NEG strategy exhibits its best performance at θ f = 500 on the Gigaword dataset and at θ f = 1000 on the DUC 2004 dataset, with the exception of the ROUGE-2 metric which is maximized at θ f = 500.
Therefore, the LG strategy seems to be more fit in improving the performance of the deep learning system when generalizing low-frequency words. On the other hand, the NEG strategy has a positive effect on system performance, even though frequent words (θ f ≥ 500) are generalized to the predefined named entities. This may be happening because most words describing named entities (especially those in E) have a specific function within the text and the reduction of their number (through the generalization to named entities) may lead to a more accurate prediction.
In both strategies, the configurations that generalize all concepts regardless of their frequency (θ f = ∞), exhibit the worst performance. In these cases of over-generalization, the deep learning model fails to learn the particular function of each word, as the generalized terms have a wide range of uses in the text. Another possible explanation of this failure is that the post-processing task of producing the final summary is not able  to accurately match the generalized concepts with specific words, due to a large amount of the former. Obviously, a trade-off exists between θ f and the obtained performance. The last lines of Table 4 also exhibit that the best NEG and LG configurations outperform the other systems in terms of the ROUGE-2 and ROUGE-L scores and demonstrate a near-optimal performance when the ROUGE-1 score is considered, thereby indicating the robustness of the proposed methodology on the DUC 2004 dataset. In case of the Gigaword dataset, the further preprocessing of data has led to a significant performance improvements, especially in comparison to previous work (Chopra et al., 2016;Gao et al., 2019). Even though the aforementioned steps have resulted in more informative and accurate summaries, they do not permit a direct comparison with previously reported results.

Conclusion and Future Work
Even though deep learning approaches have been widely used in abstractive TS, it is evident that their combination with semantic-based or structure-based methodologies needs to be more thoroughly studied. In this direction, the proposed novel framework combines deep learning techniques with semantic-based content methodologies so as to produce abstractive summaries in generalized form, which, in turn, are transformed into the final summaries. The experimental results have demonstrated that the followed approach enhances the performance of deep learning models.
The positive results may be attributed to the optimization of the parameters of the deep leaning model and the ability of the method to handle OOV and very low frequency words. The obtained results show that the proposed approach is an effective methodology of handling OOV or rare words and it improves the performance of text summarization.
Of course, certain aspects of the proposed methodology could be extended. Since currently only nouns are considered for generalization, an expansion to verbs could result in additional improvement. Moreover, as the ambiguity is a challenging problem in natural language processing, it would be interesting to capture the particular meaning of each word in the text so that our methodology manages to uncover the specific semantic meaning of words. Finally, the distinct semantic representation of each word could further enhance the performance of the deep learning model.