A Domain Agnostic Approach to Verbalizing n-ary Events without Parallel Corpora

We present a method for automatically generating descriptions of biological events encoded in the KB B IO 101 Knowledge base. We evaluate our approach on a corpus of 336 event descriptions, provide a qualitative and quantitative analysis of the results obtained and discuss possible directions for further work


Introduction
While earlier work on data-to-text generation heavily relied on handcrafted linguistic resources, more recent data-driven approaches have focused on learning a generation system from parallel corpora of data and text. Thus, (Angeli et al., 2010;Chen and Mooney, 2008;Wong and Mooney, 2007;Konstas and Lapata, 2012b;Konstas and Lapata, 2012a) trained and developed data-to-text generators on datasets from various domains including the air travel domain (Dahl et al., 1994), weather forecasts (Liang et al., 2009;Belz, 2008) and sportscasting (Chen and Mooney, 2008). In both cases, considerable time and expertise must be spent on developing the required linguistic resources. In the handcrafted, symbolic approach, appropriate grammars and lexicons must be specified while in the parallel corpus based learning approach, an aligned data-text corpus must be built for each new domain. Here, we explore an alternative approach using non-parallel corpora for surface realisation from knowledge bases that can be used for any knowledge base for which there exists large textual corpora.
A more specific, linguistic issue which has received relatively little attention is the unsupervised verbalisation of n-ary relations and the task of appropriately mapping KB roles to syntactic functions. In recent work on verbalising RDF triples, relations are restricted to binary relations (called "property" in the RDF language) and the issue is therefore intrinsically simpler. In symbolic approaches dealing with n-ary relations, the mapping between syntactic and semantic arguments is determined by the lexicon and must be manually specified. In data-driven approaches, the mapping is learned from the alignment between text and data and is restricted by cases seen in the training data. Instead, we learn a probabilistic model designed to select the most probable mapping. In this way, we provide a domain independent, fully automatic, means of verbalising n-ary relations.
The paper is structured as follows. In Section 2, we discuss related work. In Section 3, we present the method used to verbalise KB events and their participants. In Section 4, we evaluate our approach on a corpus of 336 test cases, provide a qualitative and quantitative analysis of the results obtained and discuss possible directions for further work. Section 5 concludes.

Related Work
There has been much research in recent years on developing natural language generation systems which support verbalisation from knowledge and data bases.
Many of the existing KB Verbalising tools rely on generating so-called Controlled Natural Languages (CNL) i.e., a language engineered to be read and written almost like a natural language but whose syntax and lexicon is restricted to prevent ambiguity. For instance, the OWL verbaliser integrated in the Protégé tool is a CNL based generation tool, (Kaljurand and Fuchs, 2007) which provides a verbalisation of every axiom present in the ontology under consideration. Similarly, (Wilcock, 2003) describes an ontology verbaliser using XML-based generation. Finally, recent work by the SWAT project 1 has focused on pro-ducing descriptions of ontologies that are both coherent and efficient (Williams and Power, 2010). In these approaches, the mapping between relations and verbs is determined either manually or through string matching and KB relations are assumed to map to binary verbs.
More complex NLG systems have also been developed to generate text (rather than simple sentences) from knowledge bases. Thus, the MI-AKT project (Bontcheva and Wilks., 2004) and the ONTOGENERATION project (Aguado et al., 1998) use symbolic NLG techniques to produce textual descriptions from some semantic information contained in a knowledge base. Both systems require some manual input (lexicons and domain schemas). More sophisticated NLG systems such as TAILOR (Paris, 1988), MIGRAINE (Mittal et al., 1994), and STOP (Reiter et al., 2003) offer tailored output based on user/patient models. While offering more flexibility and expressiveness, these systems are difficult to adapt by non-NLG experts because they require the user to understand the architecture of the NLG systems (Bontcheva and Wilks., 2004). Similarly, the NaturalOWL system (Galanis et al., 2009) has been proposed to generate fluent descriptions of museum exhibits from an OWL ontology. These approaches however rely on extensive manual annotation of the input data.
Related to the work discussed in this paper is the task of learning subcategorization information from textual corpora. Automatic methods for subcategorization frame acquisition have been proposed from general text corpora, e.g., (Briscoe and Carroll, 1997), (Korhonen, 2002), (Sarkar and Zeman, 2000) and specific biomedical domain corpora as well (Rimell et al., 2013). Such works are limited to the extraction of syntactic frames representing subcategorization information. Instead, we focus on relating the syntactic and semantic frame and, in particular, on the linking between syntactic and semantic arguments.
Another trend of work relevant to this paper is generation from databases using parallel corpora of data and text. (Angeli et al., 2010) train a sequence of discriminative models to predict data selection, ordering and realisation. (Wong and Mooney, 2007) uses techniques from statistical machine translation to model the generation task and (Konstas and Lapata, 2012b;Konstas and Lapata, 2012a) learns a probabilistic Context-Free Grammar modelling the structure of the database and of the associated text. Various systems from the KBGEN shared task (Banik et al., 2013) -(Butler et al., 2013, (Gyawali and Gardent, 2013) and (Zarrieβ and Richardson, 2013) perform generation from the same input data source as ours' and use parallel text for supervision. Our approach differs from all these approaches in that it does not require parallel text/data corpora. Also in contrast to the template extraction approaches described in (Kondadadi et al., 2013), (Ell and Harth, 2014) and (Duma and Klein, 2013), we do not succeed in directly matching the input data to surface text in the sentences obtained from non-parallel biomedical texts. Instead, we must extract the subcategorization frame and learn the linking between semantic and syntactic arguments.

Methodology
Our goal is to automatically generate natural language verbalisations of the biological event descriptions encoded in KB BIO 101 (Chaudhri et al., 2013) whereby an event description is assumed to consist of an event, its arguments and the roles relating each argument to the event. In the KB BIO 101 knowledge base, events are concepts of type EVENT (e.g., RELEASE), arguments are concepts of type ENTITY (e.g., GATED-CHANNEL, VASCULAR-TISSUE, IRON) and roles are relations between events and entities (e.g., AGENT, PATIENT, PATH, INSTRUMENT).
We propose a probabilistic method which extracts possible verbalisation frames from large biology specific domain corpora and uses probabilities both to select an appropriate frame given an event description and to determine the mapping between syntactic and semantic arguments. That is, probabilities are used to determine which event argument fills which syntactic function (e.g., subject, object) in the produced verbalisation.
We start by giving a brief overview of the content and the structure of KB BIO 101(Section 3.1). We then describe the steps involved in building our generation system.

KB Bio 101
The foundational component of the KB is the Component Library (CLIB), an upper ontology which is linguistically motivated and designed to support the representation of knowledge for automated reasoning (Gunning et al., 2010). CLIB adopts four simple top level distinctions: (1) enti- ties (things that are); (2) events (things that happen); (3) relations (associations between things); and (4) roles (ways in which entities participate in events). Figure 1 shows an example representation for a blocking event between a plasma membrane and hydrophobic compounds which could be verbalised as The plasma membrane blocks hydrophobic compounds. In this representation, Block is a subclass of the event class. Plasma-Menbrane and Hydrophobic-Compound are subclasses of the entity class. The Plasma-Menbrane and the Hydrophobic-Compound concepts stand respectively in an instrument and in an object role relation with the Block event.
KB BIO 101 is organized into a set of concept maps, where each concept map corresponds to a biological entity or process. It was encoded by biology teachers and contains around 5,000 concept maps. KB BIO 101 is available for download for academic purposes in various formats including OWL 2 .
To test and evaluate our approach, we focus on the subpart of KB BIO 101 isolated for the KBGEN surface realisation shared task by (Banik et al., 2013). In this dataset, content units were semiautomatically selected from KB BIO 101 in such a way that (i) the set of relations in each content unit forms a connected graph; (ii) each content unit can be verbalised by a single, possibly complex sentence which is grammatical and meaningful and (iii) the set of content units contain as many different relations and concepts of different semantic types (events, entities, properties, etc) as possible.
That is, the KB content extracted for KBGEN isolate event descriptions which can be verbalised by a single, coherent sentence. To evaluate the ability of our generator to generate event descriptions, we further process this dataset to produce all KB fragments which represent a single event with roles to entities only. The statistics for the resulting dataset (dubbed KBGEN+) are shown in Table 1.

Corpus Collection
We begin by gathering sentences from several of the publicly available biomedical domain corpora. 3 This includes the BioCause (Mihil et al., 2013), BioDef 4 , BioInfer (Pyysalo et al., 2007), Grec (Thompson et al., 2009), Genia (Kim et al., 2003) and PubMedCentral (PMC) 5 corpus. We also include the sentences available in annotations of named concepts in the KB BIO 101 ontology. This custom collection of sentences will be the corpus upon which our learning approach will build on.  Our lexicon is then a merge of all entries extracted from either a lexicon or the ontology for the KBGEN + events and entities. In Table 3, we present the size of lexicon available from each source (Total Entries) and the count of KBGEN + event and entity types (Intersecting Entries) for which one or more entry was found in that source. Table 4 shows the proportion of KBGEN + event and entity types for which a lexical entry was found as well as the maximum, minimum and average number of lexical items associated with event and entities in the merged lexicon.

Frame Extraction
Events in KBGEN + take an arbitrary number of participants ranging from 1 to 8. Knowing the lexicalisation of an event name is therefore not sufficient. For each event lexicalisation, information about syntactic subcategorisation and syntactic/semantic 6 http://www.nlm.nih.gov/mesh/filelist. html 7 Obtained by parsing the entries in Synonyms section of html pages crawled from http://www. biology-online.org/dictionary/  linking is also required. Consider for instance, the following event representation: Knowing that a possible lexicalisation of a Block event is the finite verb form blocked is not sufficient to produce an appropriate verbalisation of the KB event e.g., (1) C/EBP beta blocked TNF activation in myeloid cells.
In addition, one must know that this verb (i) takes a subject, an object and an optional prepositional argument introduced by a locative preposition (subcategorisation information) and (ii) that the INSTRUMENT role is realised by the subject slot, the OBJECT role by the DOBJ slot and the BASE role by the PREP-LOC slot (syntax/semantics linking information). That is, we need to know, for each KB event e and its associated roles (i.e., event-toentity relations), first, what are the syntactic arguments of each possible lexicalisations of e and second, for each possible lexicalisation, which role maps to which syntactic function.
To address this issue, we extract syntactic frames from our constructed corpus and use the collected data to learn the mapping between KB and syntactic arguments.
Frame extraction proceeds as follows. For each event name in the KBGEN + event set, we look for sentences in the corpus that mention this event name or one of its several verbalisations available in the merged lexicon (ALL in Table 4).
We then parse these sentences using the Stanford dependency parser 8 for collapsed dependency structures and extract frames from the resulting  Table 4: Proportion of Event and Entity Names for which a Lexical Entry was found. Min, max and average number of lexical items associated with event and entities parse trees. A frame is a sequence of dependency relations labelling the local subtree originating at a node labelled with an event name (or one of its variants). For instance, given the sentence and the dependency tree shown in Figure 2, the extracted frame for the event Block will be : nsubj,VB,dobj indicating that the verb form block requires a subject and an object.
That is, a syntactic frame describes the arguments required by the lexicalisations of an event and the syntactic function they realise.
When extracting the frames, we only consider a subset of the dependency relations produced by the Stanford parser to avoid including in the frame adjuncts such as temporal or spatial phrases which are optional rather than required arguments. Specifically, the dependency relations considered for frame construction are: A total of 718 distinct event frames were observed whereby 97.63% of the KBGEN + events were assigned at least one frame and each event was assigned an average of 82.01 distinct frames. Each event can be lexicalised by several natural language words or phrases and each natural language expressions may occur in several syntactic environments. 9 vmod creating, vmod forming, vmod producing, vmod resulting, vmod using, xcomp using are not directly given by the Stanford parser but reconstructed from a vmod or an xcomp dependency "collapsed" with the lemmas producing or using much in the same way as the prep P collapsed dependency relation provided by the Stanford Parser. These added dependencies are often used in biomedical text to express e.g., RESULT or RAW-MATERIAL role relations.

Probabilistic Models
Given F a set of syntactic frames, E a set of KB event names, D a set of syntactic dependency names and R, a set of KB roles, we next describe three probabilistic models that will be used to generate natural language sentences.
• The model P (f |e) with f ∈ F and e ∈ E, which encodes the probability of a frame given an event.
• The model P (f |r) with f ∈ F and r ∈ R, which encodes the probability of a frame given a role.
• The model P (d|r) with d ∈ D and r ∈ R, which encodes the probability of a syntactic dependency given a role.
We have chosen generative models for frames and dependencies given events and roles, and not the other way around, because such models intuitively match the generation process at test time. Each of the three models P (f |e), P (f |r) and P (d|r) is assumed multinomial with maximum likelihood estimates determined by the labelled data built as described in Algorithm 1. Intuitively, C e is the corpus consisting of all frames found in the corpus to be associated with a lexicalisation of e. Similarly, C r nd C d gathers all pairs of (frame,role) and (dependency relation, role) that could be identified given the KBGEN + KB, the corpus described in Section 3.2 and the lexicon described in Section 3.3. A Symmetric Dirichlet prior with hyperparameter α = 0.1 is further used in order to favor sparse distributions. Training thus gives: counts ((f, e) ∈ C e ) + 0.1 f ′ (counts ((f ′ , e) ∈ C e ) + 0.1) This first model allows to choose a syntactic frame that will be used to verbalize a given event.
For the second distribution: This second model also ranks the frames, but this time based on the given set of roles. The third model is trained in a similar way: It is used to choose which dependency in f shall represent the role r.

Surface Realisation
In our approach, surface realisation takes as input an even description. To verbalize an input event description containing an event e and n roles r 1 . . . . . . r n , we first identify the event and the roles present in the input. The arity of the event is then defined as the count of distinct role types present in the input (to favor aggregation, in case of repeating roles) 10 . Among all the frames seen for this event during training, we select only those that have the same arity (same number of syntactic dependents) as the input event. All such frames are candidate frames for generation. We consider two alternative scoring functions for choosing the n-best frames 11 . In the first case, we select the frame which maximises the score (M1): To determine the mapping between roles and syntactic dependencies, we then look for the best permutation of the roles for every winning frame f = (d 1 , · · · , d n ): 10 Thus if the input event description contains e.g., 2 object roles and an instrument role, its arity will be 2 rather than 3. This accounts for the fact that the two object roles will be verbalised as a coordinated NP filling in a single dependency function rather than two distinct syntactic arguments. 11 n=5 in our experiments where P({r 1 , . . . , r n }) is the set of all permutations of the roles 12 . In the second model (M2), we first compute the optimal mapping (r f 1 , . . . ,r f n ) for every possible frame and then use this information to select the n-best frames for generation: Note that (M1) (and (M2)) can be viewed as a product of experts, but with independently trained experts and without any normalization factor. It is thus not a probability, but this is fine because the normalization term does not impact the choice of the winning frame.
Both (M1) and (M2) alternatives output a winningf , i.e., a sequence of dependencies that shall be used to generate the sentence, as well as their mapping with roles (rf 1 , . . . ,rf n ). Thus, generation boils down to filling every dependency slot in sequence with its optional preposition (e.g., for d i = prep to or d i = prep at) and the lexical entry of the entity bound to the corresponding role. For repeating roles of the input, we aggregate their bound entities via the conjunction "and" and fill the corresponding dependency slot.
The results obtained by verbalising the n-best frames given by models (M1 & M2) are separately stored and we present their analysis in Section 4.

Results and Discussion
We evaluate our approach on the 336 event representations included in the KBGEN + dataset. For each event representation, we generate the 5 best natural language verbalisations using the method described in the preceding section. We then evaluate the results both qualitatively and quantitatively.

Input KBGEN +
Lexicons L e for events and L t for entities as described in Section 3.3 Raw text corpus T with dependency trees as described in Section 3.4 Output Corpus (multiset) C e for model P (f |e) Corpus (multiset) C r for model P (f |r) Corpus (multiset) C d for model P (d|r) 1. For every event e ∈KBGEN + , let lex(e) be all possible lexicalisations of e taken from L e : 2. For every lexicalisation l ∈ lex(e): 3. For every occurrence e t ∈ T of l: (a) Extract the frame f governed by e t (b) Add the observation f with label e in the frame-event corpus: (c) For every entity w t ∈ L t that is a dependent of e t with syntactic relation d, add every role r associated with this entity in KBGEN + to both role corpora: Algorithm 1: Preparation of the corpora used to train our probabilistic models

Coverage
We first consider coverage i.e., the proportion of input in the test set for which a verbalisation is produced. In total, we generate output for 321 (95.53%) of the test data. For 3 input cases involving two distinct event names (PHOTORESPIRATION, UNEQUAL-SHARING), there was no associated frame because none of the lexicalisations of the event name could be found in the corpus. Covering such cases would involve a more sophisticated lexicalisation strategy for instance, the strategy used in (Trevisan, 2010), where names are tokenized and pos-tagged before being mapped using hand-written rules to a lexicalisation.
For the other 12 input cases, generation fails because no frame of matching arity could be found. As discussed in Section 4.3 below, this is often due to cases where a KB role (mostly the BASE role) is verbalised as a modifier of an argument rather than a verb argument. Other cases are cases where the event is nominalised and there is no matching frame for that nominalisation.

Accuracy
Because the generated verbalisations are not learned from a parallel corpora, the generated sentences are often very different from the reference sentence. For instance, the generated sentence may contain a verb while in the reference sentence, the event is nominalised. Or the event might be verbalised by a transitive verb in the generated sentence but by a verb taking a prepositional object in the reference sentence (Eg: A double bond holds together an oxygen and a carbon vs. Carbon and oxygen are held together by double bond). To automatically assess the quality of the generated sentences, we therefore do not use BLEU. Instead we measure the accuracy of role mapping and we complement this automatic metric with the human evaluation described in the next section.
Role mapping is assessed as follows. For each input in the test data, we record the mapping beween the KB role of an argument in the event description and the syntactic dependency of the corresponding natural language argument in the gold sentence. For instance, given the event description shown in Section 3.4 for Sentence 1 (repeated below for convinience as Example 1), we record the syntax/semantics mapping: INSTRUMENT

C/EBP beta blocked TNF activation in myeloid cells.
Accuracy is then the proportion of generated role:dependency mappings which are correct i.e., which match the reference. Although this does not address the fact that the generated and the reference sentence may be very different, it provides some indication of whether the generated mappings are plausible. We thus report this accuracy for the 1-best and 5-best solutions provided by our model, to partly account for the variability in possible correct answers. We compare our results to two baselines. The first baseline (BL-LING) is obtained using a default role/dependency assignment which is manually defined using linguistic introspection. The second (BL-GOLD) is a strong, informed baseline which has access to the frequency of the role/dependency mapping in the gold corpus. That is, this second baseline assigns to each role in the input event description, the syntactic dependency most frequently assigned to this role in the gold corpus. The default mapping used for BL-GOLD is as follows: As expected, the difference between BL-LING and BL-GOLD shows that using information from the GOLD strongly improves accuracy.
While M1 and M2 do not improve on the baseline, an important drawback of these baselines is that they may map two or more roles in an event description to the same dependency (e.g., RAW-MATERIAL and RESULT to dobj). Worse, they may map a role to a dependency which is absent from the selected frame (if the dependency mapped onto by a role in the input does not exist in that frame). In contrast, the probabilistic approach is linguistically more promising as it guarantees that each role is mapped to a distinct dependency relation. We therefore take advantage of both the linguistically inspired baseline (BL-LING) and the probabilistic approach by combining both into a model (M2-BL-LING) which simply replaces the mapping proposed by the M2 model by that proposed by the BL-LING baseline whenever the probability of the M2 model is below a given threshold 13 . Because it predicts role/dependency map- 13 We have empirically chosen a threshold that retains 40% pings that are consistent with the selected frames, this new model is linguistically sound. And because it makes use of the strong prior information contained in the BL-LING baseline, it has a good accuracy.

Human Evaluation
Taking a sample of 264 inputs from the KBGEN + dataset, we evaluate the mappings of roles to syntax in the output. The sample contains inputs with 1 to 2 roles (40%), 3 roles (30%) and more than 3 roles (30%). For each sampled input, we consider the 5 best outputs and manually grade the output as follows: 1. Correct: both the syntax/semantic linking of the arguments and the lexicalisation of the event and of its arguments is correct.
2. Almost Corrrect: the lexicalisation of the event and of its arguments is correct and the linking of core semantic arguments is correct. The core arguments are the most frequent ones in the test data namely AGENT, BASE, OB-JECT.
Three judges independently graded 264 inputs using the above criteria. The inter-annotator agreement, as measured with the Fleiss Kappa in a preliminary experiment in these conditions, was κ = 0.76 which is considered as "good agreement" in the literature. 29% of the ouput were found to be correct, 20% to be almost correct and 51% to be incorrect.
One main factor negatively affecting results is the number of roles contained in an event description. Unsurprisingly, the greater the number of roles the lower the accuracy. That is, for event descriptions with 3 or less roles, the scores are higher (40%, 23%, 37% respectively for correct, almost correct and incorrect) as there are less possibilities to be considered. Another, related issue, is data sparsity. Unsurprisingly, roles that are less frequent often score lower (i.e., are more often incorrectly mapped to syntax) than roles which occur more frequently. Thus, the three most frequent roles (AGENT,OBJECT, BASE) have a 5-best role mapping accuracy that ranges from 43% to 77%, while most other roles have much lower accuracy. These of our model's outputs; this is the only threshold value that we have tried, and we have not tuned this threshold at all two issues suggest that results could be improved by using either more data or a more sophisticated smoothing or learning strategy. However linguistic factors are also at play here.
First, some semantic roles are often verbalised as verbs rather than thematic roles. For instance, in Sentence (2), the event (INTRACELLULAR-DIGESTION) is verbalised as a nominalisation and the OBJECT role as a verb (produces). More generally, a role in the KB is not necessarily realised by a thematic role.
(2) Intracellular digestion of polymers and solid substances in the lysosome produces monomers.
Second, in some cases, entities which are arguments of the event in the input are verbalised as prepositional modifiers of an argument of the verb verbalising the event rather than as an argument of the verb itself. This is frequently the case for the BASE relation. For instance, Example (3) shows the gold sentence for an input containing EUKARYOTIC-CELL as a BASE argument. As can be seen, in this case, the EUKARYOTIC-CELL entity is verbalised by a prepositional phrase modifying an NP rather than by an argument of the verb.
(3) Lysosomal enzymes digest nucleic acids and proteins in the lysosome of eukaryotic cells.

Conclusion
We have presented an approach for verbalising biological event representations which differs from previous work in that (i) it uses a non-parallel corpora and (ii) it focuses on n-ary relations and on the issue of how to automatically map natural language and KB arguments. A first evaluation gives encouraging results and identifies three main open questions for further research. How best to deal with data sparsity to account for event descriptions involving a high number of roles or roles that are infrequent? How to handle semantic roles that are verbalised as modifiers rather than as syntactic arguments? How to account for cases where KB roles are verbalised by verbs rather than by syntactic dependencies?