A demo of FORGe: the Pompeu Fabra Open Rule-based Generator

This demo paper presents the multilingual deep sentence generator developed by the TALN group at Universitat Pompeu Fabra, implemented as a series of rule-based graph-transducers for the syntacticization of the input graphs, the resolution of morphological agreements, and the linearization of the trees.


Introduction
FORGe (Mille et al., 2017) 1 is a pipeline of graph transducers which, coupled with lexical resources, allows for generating texts, starting from a variety of abstract input structures. The current generator has been mainly developed for English on the dependency Penn Treebank (Johansson and Nugues, 2007) automatically converted to predicateargument structures, and on Abstract Meaning Representations, using the SemEval'17 data (May and Priyadarshi, 2017). It is currently being adapted to languages such as Spanish, German French, and Polish, in the context of ontology-to-text generation as part of a dialogue system. Our generator follows the theoretical model of the Meaning-Text Theory (Mel'čuk, 1988), and performs the following actions: (i) syntacticization of predicate-argument graphs; (ii) introduction of function words; (iii) linearization and retrieval of surface forms.

Overview of the system
In this section, we briefly describe the input to the system and the successive transductions . 1 See this paper for an evaluation of the system in the context of the SemEval AMR-to-text generation challenge.

Inputs
The input structures can be trees or acyclic graphs that contain linguistic information only, which includes meaning bearing units and predicateargument relations such as ARG0 (if licensing external arguments, as in PropBank (Kingsbury and Palmer, 2002)), ARG1, ARG2, . . . , ARGn). In order to allow for more compact representations, the generator can also handle "non-core" predicates as edges, be it with a generic label nonCore, or with a typed label such as purpose; see, for example two alternative representations of a purpose meaning between two nodes N 1 and N 2 : ARG2 purpose

Generation of the deep syntactic structure
First of all, parts of speech are assigned to each node of the structure. Then, during this transduction, a top-down recursive syntacticization of the semantic graph is performed. It looks for the syntactic root of the sentence, and from there for its syntactic dependent(s), for the dependent(s) of the dependent(s), and so on. We first identify the root of a syntactic tree in case the original input structure does not contain one, and then, produce a well-formed tree that covers as much of the input graph as possible, while avoiding the possible dependency conflicts. In the following example, "peek" is chosen as the root

Introduction of function words
The next step towards the realization of the sentence is the introduction of all idiosyncratic words (prepositions, auxiliaries, determiners, etc.) and of a fine-grained (surface-)syntactic structure that gives enough information for linearizing and resolving agreements between the different words. For this task, we use a valency (subcategorization) lexicon built automatically from PropBank and Nom-Bank (Meyers et al., 2004). During this transduction, anaphora are resolved, and personal pronouns are introduced in the tree (this includes possessive, relative and personal pronouns). See, e.g., how the preposition "at" is introduced in the following surface-syntactic structure: he peek at dog the black bark that

Resolution of morpho-syntactic agreements, linearization, and retrieval of surface forms
In order to resolve agreements, the rules for this transduction check the governor/dependent pairs, together with the syntactic relation that links them together. Some other rules order governordependent pairs and siblings with one another. We then match the triple <lemma><POS><morphosyntactic features> with an entry of a morphological dictionary and simply replace the triple by the surface form. The final sentence corresponding to the running example would be He peeks at the black dog that barks.

A flexible multilingual generation pipeline
The presented pipeline is flexible from several perspectives. First, it is quite easily adaptable to different types of inputs; for instance, it took only one week to adapt it to the AMRs of SemEval'17. Second, many rules are language-independent, and others can be easily adapted to other languages, which means that, with good quality lexical resources, the effort for building a generator in a new language is minimal. Finally it is possible to substitute some parts of the pipeline with statistical modules, as, e.g., the transition between deep-and surface-syntax (Ballesteros et al., 2015) or the linearization step (Bohnet et al., 2011), in order to overcome a possible lack of coverage of the rules. During the demo session, participants will be encouraged to play with the generator through a graphical interface, in order to see all the details of a generation process (in English, with some examples in German and Polish).