Sentence Packaging in Text Generation from Semantic Graphs as a Community Detection Problem

An increasing amount of research tackles the challenge of text generation from abstract ontological or semantic structures, which are in their very nature potentially large connected graphs. These graphs must be “packaged” into sentence-wise subgraphs. We interpret the problem of sentence packaging as a community detection problem with post optimization. Experiments on the texts of the VerbNet/FrameNet structure annotated-Penn Treebank, which have been converted into graphs by a coreference merge using Stanford CoreNLP, show a high F1-score of 0.738.


Introduction
An increasing amount of research in Natural Language Text Generation (NLTG) tackles the challenge of generation from abstract ontological (Bontcheva and Wilks, 2004;Sun and Mellish, 2006;Bouayad-Agha et al., 2012;Banik et al., 2013;Franconi et al., 2014;Colin et al., 2016) or semantic (Ratnaparkhi, 2000;Varges and Mellish, 2001;Corston-Oliver et al., 2002;Kan and McKeown, 2002;Bohnet et al., 2010;Flanigan et al., 2016) structures. Unlike input structures to surface generation, which are syntactic trees, ontological and genuine semantic representations are predominantly connected graphs or collections of elementary statements (as, e.g., RDF-triples or minimal predicate-argument structures) in which re-occurring elements are duplicated (but which can be, again, considered to be a connected graph). In both cases, the problem of the division of the graph into sentential subgraphs, which we will refer henceforth to as "sentence packaging", arises. In the traditional generation task distribution, sen-tence packaging is largely avoided. It is assumed that the text planning module creates a text plan from selected elementary statements (elementary discourse units), establishing discourse relations between them. The sentence planning module then either aggregates the elementary statements contained in the text plan into more complex statements or keeps them as separate simple statements, depending on the language, style, preferences of the targeted reader, etc. (Shaw, 1998;Dalianis, 1999;Stone et al., 2003). Even if datadriven, as, e.g., in (Bayyarapu, 2011), this strategy may suggest itself mainly for input representations with a limited number of elementary elements and simple sentential structures as target. In the context of scalable report (or any other narration) generation, which can be assumed to start, for instance, from large RDF-graphs (i.e., RDFtriples with cross-referenced elements), or from large semantic graphs, the aggregation challenge is incomparably more complex. In the light of this challenge and the fact that in a narration the discourse structure is, as a rule, defined over sentential structures rather than elementary statements, sentence packaging on semantic representations appears as an alternative that is worth to be explored. More recent data-driven concept-to-text approaches to NLTG, e.g., (Konstas and Lapata, 2012), text simplification, e.g., , dialogue act realization, e.g., (Mairesse and Young, 2014;Wen et al., 2015), deal with sentence packaging, but, as a rule, all of them concern inputs of limited size, with at most 3 to 5 resulting sentence packages, while realistic large input semantic graphs may give rise to dozens. In what follows, we present a model for sentence packaging of large semantic graphs, which contain up to 75 sentences.
In general, the problem of sentence packaging consists in the optimal decomposition of a given graph into subgraphs, such that: (i) each subgraph is in itself a connected graph; (ii) the outgoing edges of the predicative vertices in a subgraph fulfil the valency conditions of these vertices (i.e., the obligatory arguments of a predicative vertice must be included in the subgraph); (iii) the appearance of a vertice in several subgraphs is subject to linguistic restrictions of co-reference. 1 In graphtheoretical terms, sentence packaging can be thus viewed as an approximation of dense subgraph decomposition, which is a very prominent area of research in graph theory. It has been also studied in the context of numerous applications, including biomedicine (e.g., for protein interaction network (Bader and Hogue, 2003) or brain connectivity analysis (Hagmann et al., 2008)), web mining (Sarıyuece et al., 2015), influence analysis (Ugander et al., 2012), community detection (Asim et al., 2017), etc. Our model is inspired by the work on community detection. The model has been validated in experiments on the VerbNet/FrameNet annotated version of the Penn TreeBank (Mille et al., 2017), in which coreferences in the individual texts of the corpus have been identified using the Stanford CoreNLP toolkit (Manning et al., 2014) and fused to obtain a graph representation. The experiments show that we achieve an F 1 -score of 0.738 (with a precision of 0.792 and a recall of 0.73), which means that our model is able to cope with the problem of sentence packaging in NLTG.
The remainder of the paper is structured as follows. In Section 2, we introduce the semantic graphs that are assumed to be decomposed and analyze them. Section 3 outlines the experiments we carried out, and Section 4 discusses the outcome of these experiments. In Section 5, we briefly review the work that is related to ours. In Section 6, finally, we draw some conclusions and outline possible lines of future work.

Overview
We assume a semantic graph to which the problem of sentence packaging is applied to be a labeled graph with semantemes, i.e., word sense disambiguated lexical items, as vertice labels and predicative argument relations as edge labels. The vertice labels are furthermore assumed to be typed in terms of semantic categories such as 'action', 'object', 'property', etc. A semantic graph of this kind can be a Abstract Meaning Representation (AMR) (Banarescu et al., 2013) obtained from the fusion of coreference vertices across individual sentential AMRs or a VerbNet or FrameNet structure obtained from the merge of sentential Verb-Net respectively FrameNet structures that contain coreferences. An RDF-triple store which is annotated with semantic metadata, e.g., in OWL (https://www.w3.org/OWL/) can be equally converted into such a graph (Rodriguez-Garcia and Hoehndorf, 2018). Without loss of generality, we will assume, in what follows, that our semantic graphs are hybrid VerbNet / Framenet graphs in that we use first level VerbNet type ids / FrameNet type ids as vertice labels and VerbNet relations as edge labels.
As already mentioned in the Introduction, we use the VerbNet/FrameNet annotated version of the Penn TreeBank (henceforth dataset) to which we apply the co-reference resolution from Stanford OpenCore NLP to obtain a graph representation (and which we split into a development set and test set, with 85% and 15% texts that contained 78% and 22% of the sentences respectively). Consider the schematic representation of the semantic graph of one of the texts from the development set in Figure 1. It consists of two isolated subgraphs: one of them (to the left) comprises three sentences and the second (to the right) corresponds to a single sentence. The blue (dark) nodes correspond to verbal and nominal predicate tokens. As illustrated in Figure 2, a significant number (to be precise: 94%) of the text graphs obtained after the co-reference merge in the development set contain subgraphs which combine several sentences (in total, 77% of sentences were combined), Figure 2: Sentence distribution in the graphs of the VerbNet/FrameNet annotated version of the Penn TreeBank such that the task of sentence packaging is a necessary task in the context of NLTG. Even if the number of texts with a large number of merged sentences is relatively small, we can observe in Figure 2 that the line corresponding to the cumulative sum has a constant slope for the majority of texts, which implies that the number of sentences per bin of texts of the same size is close to constant. This means that each bin contributes evenly when we evaluate the quality of obtained packaging since we focus on recovering the sentences and assessing each of them individually, without averaging within a text.

Graph Analysis for Sentence Packaging
The generation information that characterizes a graph in the context of sentence packaging concerns: (i) the optimal number of sentences into which this given graph can be divided, and (ii) the profile (in semantic or graph theory terms) of a typical sentence of this graph. We use this information in the subsequent stages of sentence packaging.

Estimation of the Number of Sentences
In order to estimate the number of sentences into which a given semantic graph is to be decomposed, we built up a linear regression model with Ridge regularization on the development set with the features listed in the first column of Table 1. The statistics on chosen features are shown in the other columns, where Q 2 is a median, N 1 is an absolute number of sentences with a non-zero value of a parameter, and N 2 is a corresponding relative number.
The highest R 2 -value was reached with the combination of all features, including FrameNet   Figure 3. The value is high, which means that the obtained model allows an accurate prediction of the number of sentences and can be used as an input parameter in community detection algorithms. We did not opt for using the number of predicates corresponding to different types for the regression since most of the types cover less than 7% of sentences from the development set.

Sentence Profiling
In order to obtain the prototypical profiles of the sentences in our dataset, we enriched the types of features used for the linear regression model above by features that play an important role in sentence formation: the type(s) of the parent node(s) of each node in the development set and the types of its arguments. With these enriched features at hand, we first built a multivariate normal distribution (MVN) of the most common non-correlated features of sentences chosen iteratively by crossvalidation in such a way that a matrix of feature vectors is not singular for any set of folds. We As an alternative to an iterative selection of the appropriate features, we applied Principal Component Analysis (PCA) (Jolliffe, 1986) to a space of the most common 100 features and selected principal vectors that describe 90% of the variance for building an MVN distribution. This step made the matrix of values of sentence features to be invertible, as required for the MVN distribution.
We assessed the proximity of the sentences of the development set to these initially obtained MVN distributions. As illustrated in Figure 4, the distribution of the degrees of correspondence to the joint distribution of 20 non-correlated features is right-skewed, with many sentences on the left that fit the distribution poorly. In order to remedy this, we implemented, for cases of a weak correspondence of a significant part of sentences (more than 15%) of the development set with the joint distribution, a clustering algorithm in a space of selected features (K-means, k=10) and built the distribution separately for each cluster. The proximity of the profile of the sentence being packaged has been assessed with respect to the joint distribution of each of the clusters -with success, as the results in Table 2, Subsection 3.2 below show.

Background
Community detection aims to cluster a given social network (graph) into groups of tightly connected or similar vertices (Asim et al., 2017). The different algorithms which have been proposed are often adapted to the particular characteristics of the investigated network (Fortunato and Hric, 2016). Some algorithms take into account only the network structure (the mutual arrangement of vertices and the relationships between them) and are aimed at maximizing the modularity value (Blondel et al., 2008). Other algorithms consist in clustering the vertices by combining the most similar elements in terms of their attribute values without link analysis (Combe et al., 2015). Recently, the tendency has been to use both relationships between vertices and their characteristics and identify overlapping groups for optimal network decomposition (Yang et al., 2013). In our work, we experimented so far with algorithms which operate with links between vertices and allow for fast partitioning of huge graphs.

Setup of the Experiments
We first began to experiment with three community detection algorithms: LOUVAIN (Blondel et al., 2008), METIS (Karypis and Kumar, 2000), and COPRA (Gregory, 2010). However, already the first simple tests showed that COPRA performed poorly on our data in that it decomposed each graph into a small set of isolated subgraphs that did not include all the vertices of the original graph (see the exact figures below). Therefore, we discarded COPRA from further experiments, while LOUVAIN and METIS were taken to serve as baselines. Since METIS requires as input the number of communities (= sentences) into which a given graph is to be decomposed, we use linear regression presented in Subsection 2.2.1 as preprocessing stage.
To improve the quality of the initial decomposition made using community detection algorithms (i.e., our baselines) we carried out a local descent search, adding neighbour vertices to each subgraph one by one and keeping them if the correspondence of the subgraph to the multivariate distribution increased. The optimization is performed as a post-processing stage as follows: 1. for each s ∈ S, with S: = set of sentence subgraphs obtained by LOUVAIN / METIS (a) determine the degree of correspondence to the joint distribution (in case of several subgraphs, choose the most appropriate one) that is to be optimized. (b) apply local descent search, adding nodes from s ∈ S (with s = s) iteratively each time when it leads to the increase of the optimized parameter (subgraphs can share common nodes, i.e., overlap) 2. stop local descent search when there is no node that improves s.
F 1 -score was chosen as a measure for the comparison of the quality of decompositions obtained by different algorithms on the test set. It is calculated for each original sentence since we consider a sentence as a separate unit. Its value takes into account which part of the original sentence was covered by the obtained subgraph and how many nodes that did not belong to the original sentence were mistakenly appended. Each isolated subgraph corresponds to one unit only, although it can include several original sentences. For those original sentences that are not captured in the majority of their nodes in any individual subgraph, F 1 -score is equal to 0. The macro-F 1 , i.e. the average F 1 -score over all sentences, is a final measure.
The results are displayed in Table 2. 'No decomposition' stands for the case when any graph in the test set is considered to be a sentence (it can be considered as an additional baseline); 'METIS LR ' for "METIS with linear regression as a preprocessing stage", 'DC K ' for "descent search with K-means", and 'DC ¬K ' "for descent search without K-means".  Table 2: Results of testing the obtained models As already mentioned above, COPRA showed a very poor performance on our data. The exact numbers were: mean recall = 0.113, mean precision = 0.088, and mean F 1 -score = 0.084). Therefore, we did not include them into Table 2 and did not combine COPRA with other techniques.

Performance Assessment
We can observe that the local descent search with the chosen optimization function leads to an increase of the mean F 1 -score in each case. The use of a larger number of features with PCA leads to slightly poorer results, but still shows an improvement in comparison to the baseline community detection (LOUVAIN, and METIS LR ). However, METIS LR is somewhat better than our optimizations with respect to precision and METIS LR +DC ¬K is the best (even if by only a very minor margin, compared to the best F 1 -score reaching METIS LR +DC K ).
The very low figures for 'No Decomposition', i.e., the interpretation of each single graph as a sentence, show us that the problem of sentence packaging (or, in other words, decomposition of textual semantic graphs into sentential subgraphs) is indeed a relevant problem in large scale semantics-to-text generation.
Carrying out the error analysis, we assessed several obtained subgraphs in detail and identified at least two causes of the low values of precision and recall. The first cause lies in a suboptimal performance of the coreference resolution related to the merge of co-referenced nodes. For example, for the entity 'Mr. Peladeau', which appeared in a given text ten times, the module generated a node labeled 'Peladeau' and ten nodes labeled 'Mr.', connecting the 'Peladeau' node to all ten 'Mr.' nodes. This decreased our precision. We fixed the erroneous graphs by combining non-root nodes that were connected to the same input and output nodes with the same types of arguments and recalculated the measures. Some sentences were significantly affected by this change. For instance, for the mentioned example, the precision increased from 0.35 to 0.44. However, the overall mean F 1 -score increased only by 0.5% because this error affected a relatively small number of subgraphs.
Another cause for poor quality of some obtained subgraphs consisted in the creation of subgraphs that contained subgraphs of several ground truth sentences. This led to the low value of precision, even if the recall was relatively high. To account for this problem, we defined a procedure that allowed us to separate such compound graphs into a set of subgraphs. This procedure duplicates those nodes that have two or more non-overlapping in-put paths from roots which include a node with a defined VerbNet class. Since the output paths of duplicated nodes and the input paths without a node from VerbNet should not be necessarily assigned to all the copies of a node, we remove these paths to avoid overloading each single subgraph with redundant information.
The application of the node duplication procedure to the graphs obtained by LOUVAIN+PCA+DC K led to an increase of the overall mean precision (taking into account only covered ground truth sentences) from 0.85 to 0.96 and to a decrease of the recall from 0.86 to 0.67 since the procedure also affected some optimal sentence subgraphs by splitting them further into single clause subgraphs. At the same time, the coverage of the original sentences was improved (857 instead of 687 out of 908 were covered), which compensated the lower recall and led to an increase of the F 1 -score by 10%. The potential values of precision and recall that could be reached if we combine subgraphs that belong to the same sentences are 0.91 and 0.77 respectively, which results in an F 1 -score of 0.83. To tackle the problem of combining the subgraphs of clauses, full-text clustering could be used (Devyatkin et al., 2015). Adding back the removed paths linked to copied nodes will also contribute to the increase of overall quality of sentences.

Example
For illustration, consider in Figure 5 a subgraph obtained from a larger initial graph, which is shown in Figure 6 (the obtained subgraph is circled). The subgraph corresponds to the ground truth subgraph with a precision of 0.938 and a recall of 0.882. It might be seen that the obtained subgraph contains enough information to generate a sentence with a similar meaning as the original one.
The original sentence that corresponds to the subgraph in Figure 5 is He said the company is experimenting with the technique on alfalfa, and plans to include cotton and corn, among other crops.; cf. also Figure 7 for the text (with the corresponding sentence highlighted) captured by the initial graph. The text comprises 755 tokens of 41 sentences, which formed 10 isolated graphs after coreference resolution. The largest graph contains 578 vertices, which correspond to 32 sentences with 18 vertices that link sentences. The LOUVAIN+PCA+DC K method applied to the whole graph detected 31 sentences out of 41. An additional separation of the obtained graphs by the procedure described above led to the detection of 9 extra sentences. Thus, the 98% of the ground truth sentences were recovered to a certain extent.

Related Work
A number of natural language text generators take as input sentence structures -for instance, sentence templates, as in the case of SimpleNLG generators (Gatt and Reiter, 2009), syntactic structures, as in the case of surface-oriented generators (Belz et al., 2011;Mille et al., 2018a), or more abstract semantic structures such as, e.g., AMRs; cf., e.g., (May and Priyadarshi, 2017;Song et al., 2018). For these generators, the problem of sentence packaging or aggregation is obviously obsolete. As already mentioned in the Introduction, in setups that start from input that is not yet cast into sentence structures, traditional NLTG foresees the task of (content) aggregation, which is dealt with as part of sentence planning (or microplanning): the elementary content elements, as assumed to be present in the text plan, are aggregated into more complex elements; see, among others, (Shaw, 1998;Dalianis, 1999;Stone et al., 2003;. Our work is more in line with Konstas and La- Figure 6: Example of the initial graph with one of the detected sentence subgraphs circled Figure 7: Original plain text with the recovered sentence subgraph highlighted pata (2012)'s data-driven concept-to-text model, which creates from the input database records hypergraphs that are then projected onto multiple sentence reports. We also depart from graphs (which we create from isolated semantic sentence structures by establishing coreference links between coinciding elements across different structures), only that we work with graphs that are considerably larger than those Konstas and Lapata work with (up to 75 resulting sentences per graph vs. >10 resulting sentences per graph). Furthermore, while we use community detection algorithms (and focus only on the problem of sentence packaging), they view the entire problem of the verbalization of a hypergraph as a graph traversal problem. The difference in the size of the input data (and thus the number of the resulting sentences) is also a distinctive feature of our proposal when we compare it to other works that deal with sentence packaging. For instance,  split in their experiments on text simplification complex sentences into 2 to 3 more simple sentences. As content representation, they use the WebNLG dataset of RDF-triples . To split a given set of RDF-triples into several subsets, they learn a probabilistic model. Wen et al. (2015) use LSTM-models to generate utterances from a given sequence of tokens in the context of a dialogue application.
Since for our experiments we apply coreference resolution to create from the VerbNet/Framenet annotated sentences of the Penn Treebank large connected graphs, our work could be also considered to be related to the recent efforts on the creation of datasets for NLTG; cf., e.g., Novikova et al., 2017;Mille et al., 2018b). However, so far, the coreference resolution has been entirely automatic, with no subsequent thorough validation and manual correction. Both would be needed to ensure high quality of the resulting dataset.

Conclusions and Future Work
We have presented a community detectionbased strategy for packaging semantic (Verb-Net/FrameNet) graphs into sentential subgraphs and tested it on a large dataset. We have shown that, in principle, sentence packaging can be interpreted as a community detection problem since community detection algorithms aim to identify densely connected subgraphs-which can be expected from sentential structures. The evaluation suggests that the subgraphs obtained by community detection can be further improved by a postprocessing stage, e.g., by descent search or PCA.
The duplication of nodes for an additional decomposition of obtained graphs led to an increase of the performance. To avoid the unnecessary splitting of optimal subgraphs, as observed in some cases, the offered procedure might be furthermore restricted, for example, by duplicating only the nodes with high centrality measures.
In the future, we plan to explore community detection algorithms which will allow us to take the attributes of the vertices into account. For this purpose, the optimization function must be modified to take into account the mutual compatibility of vertices rather than their similarity, since vertices within one sentence usually have different properties and do not form homogeneous communities in a general sense. Furthermore, we plan to explore to what extent reinforcement learningbased graph partitioning algorithms that take the specifics of the semantic graphs into account in terms of features are suitable for the problem of sentence packaging.