Language Generation via DAG Transduction

A DAG automaton is a formal device for manipulating graphs. By augmenting a DAG automaton with transduction rules, a DAG transducer has potential applications in fundamental NLP tasks. In this paper, we propose a novel DAG transducer to perform graph-to-program transformation. The target structure of our transducer is a program licensed by a declarative programming language rather than linguistic structures. By executing such a program, we can easily get a surface string. Our transducer is designed especially for natural language generation (NLG) from type-logical semantic graphs. Taking Elementary Dependency Structures, a format of English Resource Semantics, as input, our NLG system achieves a BLEU-4 score of 68.07. This remarkable result demonstrates the feasibility of applying a DAG transducer to resolve NLG, as well as the effectiveness of our design.


Introduction
The recent years have seen an increased interest as well as rapid progress in semantic parsing and surface realization based on graph-structured semantic representations, e.g. Abstract Meaning Representation (AMR; Banarescu et al., 2013), Elementary Dependency Structure (EDS; Oepen and Lønning, 2006) and Depedendency-based Minimal Recursion Semantics (DMRS; Copestake, 2009). Still underexploited is a formal framework for manipulating graphs that parallels automata, tranducers or formal grammars for strings and trees. Two such formalisms have recently been proposed and applied for NLP. One is graph grammar, e.g. Hyperedge Replacement Gram-mar (HRG; Ehrig et al., 1999). The other is DAG automata, originally studied by Kamimura and Slutzki (1982) and extended by Chiang et al. (2018). In this paper, we study DAG transducers in depth, with the goal of building accurate, efficient yet robust natural language generation (NLG) systems.
The meaning representation studied in this work is what we call type-logical semantic graphs, i.e. semantic graphs grounded under type-logical semantics (Carpenter, 1997), one dominant theoretical framework for modeling natural language semantics. In this framework, adjuncts, such as adjective and adverbal phrases, are analyzed as (higher-order) functors, the function of which is to consume complex arguments (Kratzer and Heim, 1998). In the same spirit, generalized quantifiers, prepositions and function words in many languages other than English are also analyzed as higher-order functions. Accordingly, all the linguistic elements are treated as roots in type-logical semantic graphs, such as EDS and DMRS. This makes the typological structure quite flat rather than hierachical, which is an essential distinction between natural language semantics and syntax.
To the best of our knowledge, the only existing DAG transducer for NLG is the one proposed by Quernheim and Knight (2012). Quernheim and Knight introduced a DAG-to-tree transducer that can be applied to AMR-to-text generation. This transducer is designed to handle hierarchical structures with limited reentrencies, and it is unsuitable for meaning graphs transformed from type-logical semantics. Furthermore, Quernheim and Knight did not describe how to acquire graph recognition and transduction rules from linguistic data, and reported no result of practical generation. It is still unknown to what extent a DAG transducer suits realistic NLG.
The design for string and tree transducers (Comon et al., 1997) focuses on not only the logic of the computation for a new data structure, but also the corresponding control flow. This is very similar the imperative programming paradigm: implementing algorithms with exact details in explicit steps. This design makes it very difficult to transform a type-logical semantic graph into a string, due to the fact their internal structures are highly diverse. We borrow ideas from declarative programming, another programming paradigm, which describes what a program must accomplish, rather than how to accomplish it. We propose a novel DAG transducer to perform graphto-program transformation ( §3). The input of our transducer is a semantic graph, while the output is a program licensed by a declarative programming language rather than linguistic structures. By executing such a program, we can easily get a surface string. This idea can be extended to other types of linguistic structures, e.g. syntactic trees or semantic representations of another language.
We conduct experiments on richly detailed semantic annotations licensed by English Resource Grammar (ERG; Flickinger, 2000). We introduce a principled method to derive transduction rules from DeepBank (Flickinger et al., 2012). Furthermore, we introduce a fine-to-coarse strategy to ensure that at least one sentence is generated for any input graph. Taking EDS graphs, a variable-free ERS format, as input, our NLG system achieves a BLEU-4 score of 68.07. On average, it produces more than 5 sentences in a second on an x86 64 GNU/Linux platform with two Intel Xeon E5-2620 CPUs. Since the data for experiments is newswire data, i.e. WSJ sentences from PTB (Marcus et al., 1993), the input graphs are quite large on average. The remarkable accuracy, efficiency and robustness demonstrate the feasibility of applying a DAG transducer to resolve NLG, as well as the effectiveness of our transducer design.

Preliminaries
A node-labeled simple graph over alphabet Σ is a triple G = (V, E, ℓ), where V is a finite set of nodes, E ⊆ V × V is an finite set of edges and ℓ : V → Σ is a labeling function. For a node v ∈ V , sets of its incoming and outgoing edges are denoted by in(v) and out(v) respectively. For an edge e ∈ E, its source node and target node are denoted by src(e) and tar(e) respectively. Gen-erally speaking, a DAG is a directed acyclic simple graph. Different from trees, a DAG allows nodes to have multiple incoming edges. In this paper, we only consider DAGs that are unordered, node-labeled, multi-rooted 1 and connected.
Conceptual graphs, including AMR and EDS, are both node-labeled and edge-labeled. It seems that without edge labels, a DAG is inadequate, but this problem can be solved easily by using the strategies introduced in (Chiang et al., 2018). Take a labeled edge proper q BV −→ named for example 2 . We can represent the same information by replacing it with two unlabeled edges and a new labeled node: proper q → BV → named.

Previous Work
DAG automata are the core engines of graph transducers (Bohnet and Wanner, 2010;Quernheim and Knight, 2012). In this work, we adopt Chiang et al. (2018)'s design and define a weighted DAG automaton as a tuple M = ⟨Σ, Q, δ, K⟩: • Σ is an alphabet of node labels.
• Q is a finite set of states.
• δ : Θ → K\{0} is a weight function that assigns nonzero weights to a finite transition set Θ. Every transition t ∈ Θ is of the form where q i and r j are states in Q. A transition t gets m states on the incoming edges of a node and puts n states on the outgoing edges. A transition that does not belong to Θ recieves a weight of zero.
A run of M on a DAG D = ⟨V, E, ℓ⟩ is an edge labeling function ρ : E → Q. The weight of a run ρ (denoted as δ ′ (ρ)) is the product of all weights of local transitions: Here, for a function f , we use f ({a 1 , · · · , a n }) to represent {f (a 1 ), · · · , f (a n )}. If K is a boolean semiring, the automata fall backs to an unweighted DAG automata or DAG acceptor. A accepting run or recognition is a run, the weight of which is 1, meaning true.

Challenges
The DAG automata defined above can only be used for recognition. In order to generate sentences from semantic graphs, we need DAG transducers. A DAG transducer is a DAG automata-augmented computation model for transducing well-formed DAGs to other data structures. Quernheim and Knight (2012) focused on feature structures and introduced a DAG-to-Tree transducer to perform graph-to-tree transformation. The input of their transducer is limited to single-rooted DAGs. When the labels of the leaves of an output tree in order are interpreted as words, this transducer can be applied to generate natural language sentences.
When applying Quernheim and Knight's DAGto-Tree transducer on type-logic semantic graphs, e.g. ERS, there are some significant problems. First, it lacks the ability to reverse the direction of edges during transduction because it is difficult to keep acyclicy anymore if edge reversing is allowed. Second, it cannot handle multiple roots. But we have discussed and reached the conclusion that multi-rootedness is a necessary requirement for representing type-logical semantic graphs. It is difficult to decide which node should be the tree root during a 'top-down' transduction and it is also difficult to merge multiple unconnected nodes into one during a 'bottom-up' transduction. At the risk of oversimplifying, we argue that the function of the existing DAG-to-Tree transducer is to transform a hierachical structure into another hierarchical structure. Since the type-local semantic graphs are so flat, it is extremely difficult to adopt Quernheim and Knight's design to handle such graphs. Third, there are unconnected nodes with direct dependencies, meaning that their correpsonding surface expressions appear to be very close. The conceptual nodes even x deg and steep a 1 in Figure 4 are an example. It is extremely difficult for the DAG-to-Tree transducer to handle this situation.

Basic Idea
In this paper, we introduce a design of transducers that can perform structure transformation towards many data structures, including but not limited to trees. The basic idea is to give up the rewritting method to directly generate a new data structure piece by piece, while recognizing an input DAG. Instead, our transducer obtains target structures based on side effects of DAG recognition. The output of our transducer is no longer the target data structure itself, e.g. a tree or another DAG, and is now a program, i.e. a bunch of statements licensed by a particular declarative programming language. The target structures are constructed by executing such programs.
Since our main concern of this paper is natural language generation, we take strings, namely sequences of words, as our target structures. In this section, we introduce an extremely simple programming language for string concatenation and then details about how to leverage the power of declarative programming to perform DAG-tostring transformation.

A Declarative Programming Language
The syntax in the BNF format of our declarative programming language, denoted as L c , for string calculation is: ⟨program⟩ ::= ⟨statement⟩ * ⟨statement⟩ ::= ⟨variable⟩ = ⟨expr⟩ ⟨expr⟩ ::= ⟨variable⟩ | ⟨string⟩ Here a string is a sequence of characters selected from an alphabet (denoted as Σ out ) and can be empty (denoted as ϵ). The semantics of '=' is value assignment, while the semantics of '+' is string concatenation. The value of variables are strings. For every statement, the left hand side is a variable and the right hand side is a sequence of string literals and variables that are combined through '+'. Equation (1) presents an exmaple program licensed by this language.
After solving these statements, we can query the values of all variables. In particular, we are interested in S, which is related to the desired natural language expression John want to go 3 .
Using the relation between the variables, we can easily convert the statements in (1) to a rooted tree. The result is shown in Figure 1. This tree is significantly different from the target structures discussed by Quernheim and Knight (2012) or other normal tree transducers (Comon et al., 1997). This tree represents calculation to solve the program. Constructing such internal trees is an essential function of the compiler of our programming language.

Informal Illustration
We introduce our DAG transducer using a simple example. Figure 2 shows the original input graph D = (V, E, ℓ). Without any loss of generality, we remove edge labels. Table 1 lists the rule set-R-for this example. Every row represents an applicable transduction rule that consists of two parts. The left column is the recognition part displayed in the form I σ − → O, where I, O and σ decode the state set of incoming edges, the state set of outgoing edges and the node label respectively. The right column is the generation part which consists of (multiple) templates of statements licensed by the programming language defined in the previous section. In practice, two different rules may have a same recognition part but different generation parts.
Every state q is of the form l(n, d) where l is the finite state label, n is the count of possible variables related to q, and d denotes the direction. The value of d can only be r (reversed), u (unchanged) or e(empty). Variable v l(j,d) represents the jth (1 ≤ j ≤ n) variable related to state q. For example, v X(2,r) means the second variable of state X(3,r). There are two special variables: S, which corresponds to the whole sentence and L, which corresponds to the output string associated to current node label. It is reasonable to assume that there exists a function ψ : Σ → Σ * out that maps a particular node label, i.e. concept, to a surface string. Therefore L is determined by ψ. Now we are ready to apply transduction rules to translate D into a string. The transduction consists of two steps: Recognition The goal of this step is to find an edge labeling function ρ : E → Q which satisfies ) matches the recognition part of a rule in R. The recognition result is shown in Figure 3. The red dashed edges in Figure 3 make up an intermediate graph T (ρ), which is a subgraph of D if edge direction is not taken into account. Sometimes, T (ρ) paralles the syntactic structure of an output sentence. For a labeling function ρ, we can construct intermediate graph T (ρ) by checking the direction parameter of every edge state. For an u) is included. The recognition process is slightly different from the one in Chiang et al. (2018). Since incoming edges with an Empty(0,e) state carry no semantic information, they will be ignored during recognition. For example, in Figure 3, we will only use e 2 and e 4 to match transducation rules for node named(John).
Instantiation We use rule(v) to denotes the rule used on node v. Assume s is the generation part of rule(v). For every edge e i adjacent to v, assume ρ(e i ) = l(n, d). We replace L with ψ(ℓ(v)) and replace every occurrence of v l(j,d) in s with a new variable x ij (1 ≤ j ≤ n). Then we Q = {DET(1,r), Empty(0,e), VP(1,u), NP(1,u)}

Rule For Recognition
For Table 1: Sets of states (Q) and rules (R) that can be used to process the graph in Figure 2.
get a newly generated expression for v. For example, node want v 1 is recognized using Rule 2, so we replace v NP(1,u) with x 21 , v VP(1,u) with x 11 and L with want. After instantiation, we get all the statements in Equation (1). Our transducer is suitable for type-logical semantic graphs. Because declarative programming brings in more freedom for graph transduction. We can arrange the variables in almost any order without regard to the edge directions in original graphs. Meanwhile, the multi-rooted problem can be solved easily because the generation is based on side effects. We do not need to decide which node is the tree root.

Definition
The formal definition of our DAG transducer described above is a tuple M = (Σ, Q, R, w, V, S) where: • Σ is an alphabet of node labels.
• Q is a finite set of edge states. Every state q ∈ Q is of the form l(n, d) where l is the state label, n is the variable count and d is the direction of state which can be r, u or e.
• R is a finite set of rules. Every rule is of the form I σ − → ⟨O, E⟩. E can be any kind of statement in a declarative programming language. It is called the generation part. I, σ and O have the same meanings as they do in the previous section and they are called the recognition part.
• w is a score function. Given a particular run and an anchor node, w assigns a score to measure the preference for a particular rule at this anchor node.
• V is the set of parameterized variables that can be used in every expression.
• S ∈ V is a distinguished, global variable. It is like the 'goal' of a program.

DAG Transduction-based NLG
Different languages exhibit different morphosyntactic and syntactico-semantic properties. For example, Russian and Arabic are morphologically-rich languages and heavily utilize grammatical markers to indicate grammatical as well as semantic functions. On the contrary, Chinese, as an analytic language, encodes grammatical and semantic information in a highly configurational rather than either inflectional or derivational way. Such differences affects NLG significantly. Considering generating Chinese sentences, it seems sufficient to employ our DAG transducer to obtain a sequence of lemmas, since no morpholical production is needed. But for morphologically-rich languages, we do need to model complex morphological changes.
To unify a general framework for DAG transduction-based NLG, we propose a two-step strategy to achive meaning-to-text transformation.
• In the first phase, we are concerned with syntactico-semantic properties and utilize our DAG transducer to translate a semantic graph into sequential lemmas. Information such as tense, apsects, gender, etc. is attached to anchor lemmas. Actually, our transducer generates "want.PRES" rather than "wants". Here, "PRES" indicates a particular tense.
• In the second phase, we are concerned with morpho-syntactic properties and utilize a neural sequence-to-sequence model to obtain final surface strings from the outputs of the DAG transducer.

Inducing Transduction Rules
We present an empirical study on the feasibility of DAG transduction-based NLG. We focus on  Figure 4: An example graph. The intended reading is "the decline is even steeper than in September", he said. Original edge labels are removed for clarity. Every edge is associated with a span list, and spans are written in the form label<begin:end>. The red dashed edges belong to the intermediate graph T .

EDS-specific Constraints
In order to generate reasonable strings, three constraints must be kept during transduction. First, for a rule I σ − → ⟨O, E⟩, a state with direction u in I or a state with direction r in O is called head state and its variables are called head variables. For example, the head state of rule 3 in Table 1 is VP(1,u) and the head state of rule 2 is DET(1,r). There is at most one head state in a rule and only head variables or S can be the left sides of statements. If there is no head state, we assign the global S as its head. Otherwise, the number of statements is equal to the number of head variables and each statement has a distinguished left side variable. An empty state does not have any variables. Second, every rule has no-copying, no-deleting statements. In other words, all variables must be used exactly once in a statement. Third, during recognition, a labeling function ρ is valid only if T (ρ) is a rooted tree.
After transduction, we get result ρ * . The first and second constraints ensure that for all nodes, there is at most one incoming red dashed edge in T (ρ * ) and 'data' carried by variables of the only incoming red dashed edge or S is separated into variables of outgoing red dashed edges. The last constraint ensures that we can solve all statements by a bottom-up process on tree T (ρ * ).

Fine-to-Coarse Transduction
Almost all NLG systems that heavily utilize a symbolic system to encode deep syntacticosemantic information lack some robustness, meaning that some input graphs may not be successfully processed. There are two reasons: (1) some explicit linguistic constraints are not included; (2) exact decoding is too time-consuming while inexact decoding cannot cover the whole search space. To solve the robustness problem, we introduce a fine-to-coarse strategy to ensure that at least one sentence is generated for any input graph. There are three types of rules in our system, namely induced rules, extended rules and dynamic rules. The most fine-grained rules are applied to bring us precision, while the most coarse-grained rules are for robustness.
In order to extract reasonable rules, we will use both EDS graphs and the corresponding derivation trees provided by ERG. The details will be described step-by-step in the following sections. Figure 4 shows an example for obtaining induced rules. The induced rules are directly obtained by following three steps:

Induced Rules
Finding intermediate tree T EDS graphs are highly regular semantic graphs. It is not difficult to generate T based on a highly customized 'breadthfirst' search. The generation starts from the 'top' node ( say v to in Figure 4) given by the EDS graph and traverse the whole graph. No more than thirty heuristic rules are used to decide the visiting order of nodes.
Assigning states EDS graphs also provide span information for nodes. We select a group of lexical nodes which have corresponding substrings in the original sentence. In Figure 4, these nodes are in bold font and directly followed by a span. Then we merge spans from the bottom of T to the top to assign each red edge a span list. For each node n in T , we collect spans of every outgoing dashed edge of n into a list s. Some additional spans may be inserted into s. These spans do not occur in the EDS graph but they do occur in the sentence, i.e. than<29:33>. Then we merge continuous spans in s and assign the remaining spans in s to the incoming red dashed edge of n. We apply a similar method to the derivation tree. As a result, every inner node of the derivation tree is associated with a span. Then we align the edges in T to nodes of the inner derivation tree by comparing their spans. Finally edge labels in Figure 4 are generated.
We use the concatenation of the edge labels in a span list as the state label. The edge labels are joined in order with ' '. Empty(0,e) is the state of the edges that do not belong to T (ignoring direction), such as e 12 . The variable count of a state is equal to the size of the span list and the direction of a state is decided by whether the edge in T related to the state and its corresponding edge in D have different directions. For example, the state of e 5 should be ADV PP(2,r).
Generating statements After the above two steps, we are ready to generate statements according to how spans are merged. For all nodes, spans of the incoming edges represent the left hand side and the outgoing edges represent the right hand side. For example, the rule for node comp will be:

Extended Rules
Extended rules are used when no induced rules can cover a given node. In theory, there can be unlimited modifier nodes pointing to a given node, such as PP and ADJ. We use some manually written rules to slightly change an induced rule (prototype) by addition or deletion to generate a group of extended rules. The motivation here is to deal with the data sparseness problem.
For a group of selected non-head states in I, such as PP and ADJ. We can produce new rules by removing or duplicating more of them. For example: As a result, we get the two rules below:

Dynamic Rules
During decoding, when neither induced nor extended rule is applicable, we create a dynamic rule on-the-fly. Our rule creator builds a new rule following the Markov assumption: q n } denotes the outgoing states and I, D have the same meaning as before. Though they are unordered multisets, we can give them an explicit alphabet order by their edge labels. There is also a group of hard constraints to make sure that the predicted rules are well-formed as the definition in §5 requires. This Markovization strategy is widely utilized by lexicalized and unlexicalized PCFG parsers (Collins, 1997;Klein and Manning, 2003). For a dynamic rule, all variables in this rule will appear in the statement. We use a simple perceptron-based scorer to assign every variable a score and arrange them in an decreasing order.

Set-up
We use DeepBank 1.1 (Flickinger et al., 2012), i.e. gold-standard ERS annotations, as our main experimental data set to train a DAG transducer as well as a sequence-to-sequence morpholyzer, and wikiwoods (Flickinger et al., 2010), i.e. automatically-generated ERS annotations by ERG, as additional data set to enhance the sequence-to-sequence morpholyzer. The training, development and test data sets are from DeepBank and split according to DeepBank's recommendation. There are 34,505, 1,758 and 1,444 sentences (all disconnected graphs as well as their associated sentences are removed) in the training, development and test data sets. We use a small portion of wikiwoods data, c.a. 300K sentences, for experiments.
37,537 induced rules are directly extracted from the training data set, and 447,602 extended rules are obtained. For DAG recognition, at one particular position, there may be more than one rule applicable. In this case, we need a disambiguation model as well as a decoder to search for a globally optimal solution. In this work, we train a structured perceptron model (Collins, 2002) for disambiguation and employ a beam decoder. The perceptron model used by our dynamic rule generator are trained with the induced rules. To get a sequence-to-sequence model, we use the open source tool-OpenNMT 4 .

The Decoder
We implement a fine-to-coarse beam search decoder. Given a DAG D, our goal is to find the highest scored labeling function ρ: where n is the node count and f j (·, ·) and w j represent a feature and the corresponding weight, respectively. The features are chosen from the context of the given node v i . We perform 'topdown' search to translate an input DAG into a morphology-function-enhanced lemma sequence. Each hypothesis consists of the current DAG graph, the partial labeling function, the current hypothesis score and other graph information used to perform rule selection. The decoder will keep the corresponding partial intermediate graph T acyclic when decoding. The algorithm used by our decoder is displayed in Algorithm 1. Function FindRules(h, n, R) will use hard constraints to select rules from the rule set R according to the contextual information. It will also perform an acyclic check on T . Function Insert(h, r, n, B) will create and score a new hypothesis made from the given context and then insert it into beam B.  (Song et al., 2017).
The graphs that cannot received any natural language sentences are removed while conducting the BLEU evaluation.
As we can conclude from Table 2, using only induced rules achieves the highest accuracy but the coverage is not satisfactory. Extended rules lead to a slight accuracy drop but with a great improvement of coverage (c.a. 10%). Using dynamic rules, we observe a significant accuracy drop. Nevertheless, we are able to handle all EDS graphs. The full-coverage robustness may benefit many NLP applications. The lemma sequences generated by our transducer are really close to the golden one. This means that our model actually works and most reordering patterns are handled well by induced rules.
Compared to the AMR generation task, our transducer on EDS graphs achieves much higher accuracies. To make clear how much improvement is from the data and how much is from our DAG transducer, we implement a purely neural baseline. The baseline converts a DAG into a concept sequence by a pre-order DFS traversal on the intermediate tree of this DAG. Then we use a sequenceto-sequence model to transform this concept sequence to the lemma sequence for comparison. This is a kind of implementation of Konstas et al.'s model but evaluated on the EDS data. We can see that on this task, our transducer is much better than a pure sequence-to-sequence model on DeepBank data.  Table 3: Efficiency of our NL generator. Table 3 shows the efficiency of the beam search decoder with a beam size of 128. The platform for this experiment is x86 64 GNU/Linux with two Intel Xeon E5-2620 CPUs. The second and third columns represent the average and the maximal time (in seconds) to translate an EDS graph. Using dynamic rules slow down the decoder to a great degree. Since the data for experiments is newswire data, i.e. WSJ sentences from PTB (Marcus et al., 1993), the input graphs are quite large on average. On average, it produces more than 5 sentences per second on CPU. We consider this a promising speed.

Conclusion
We extend the work on DAG automata in Chiang et al. (2018) and propose a general method to build flexible DAG transducer. The key idea is to leverage a declarative programming language to minimize the computation burden of a graph transducer. We think may NLP tasks that involve graph manipulation may benefit from this design. To exemplify our design, we develop a practical system for the semantic-graph-to-string task. Our system is accurate (BLEU 68.07), efficient (more than 5 sentences per second on a CPU) and robust (fullcoverage). The empirical evaluation confirms the usefulness a DAG transducer to resolve NLG, as well as the effectiveness of our design.