Exact yet Efficient Graph Parsing, Bi-directional Locality and the Constructivist Hypothesis

A key problem in processing graph-based meaning representations is graph parsing, i.e. computing all possible derivations of a given graph according to a (competence) grammar. We demonstrate, for the first time, that exact graph parsing can be efficient for large graphs and with large Hyperedge Replacement Grammars (HRGs). The advance is achieved by exploiting locality as terminal edge-adjacency in HRG rules. In particular, we highlight the importance of 1) a terminal edge-first parsing strategy, 2) a categorization of a subclass of HRG, i.e. what we call Weakly Regular Graph Grammar, and 3) distributing argument-structures to both lexical and phrasal rules.


Introduction
Language production, though as important as language understanding, has received very limited theoretical and empirical research attention. A fundamental problem in modeling language production is parsing meaning representations, i.e. computing all possible analyses of a given meaning representation (MR) according to a (competence) grammar. In theory, the worst-case complexities of existing algorithms are exponential or high-degree polynomial w.r.t. grammar size and input length. In practice, there are few systems that can parse large but frequent MRs with a realistic, wide-coverage grammar in a reasonable time.
The major contribution of this paper is an exact yet efficient method to parse MRs in the framework of graph-based semantic representations (Koller et al., 2019) and Hyperedge Replacement Grammar (Drewes et al., 1997). The ability to enumerate all possible analyses of a graph facilitates surface realization, grammar induction, recursive graph embedding, etc. The advance in efficiency is from exploiting locality of HRG rules from the rarely discussed perspective of language production, a reversed direction to language understanding. We discuss locality in a sense of terminal edge-adjacency and develop a locality-centric complexity analysis of the de facto algorithm introduced by Chiang et al. (2013). Our analysis motivates (1) a terminal edge-first parsing strategy, (2) a categorization of a subclass of HRG, i.e. what we call Weakly Regular Graph Grammar, and (3) a computational support in the constructivist hypothesis in theoretical linguistics. Altogether, our analysis leads to a substantial improvement in practical graph parsing. An MR with the number of conceptual nodes ranging from 5 to 50 corresponding to a Wall Street Journal sentence can receive a fullforest analysis in 0.089 second on average with a large-scale comprehensive grammar; Even semantic graphs with c.a. 80 conceptual nodes can be processed in less than 0.5 second.

A Graph-Structured Syntax-Semantics Interface
Linguistically-informed graph parsing needs a precise model of the syntax-semantics interface. To this end, we need to precisely describe elementary structures corresponding to linguistic units at (morphological,) lexical and phrasal levels, and precisely describe the MERGE operation of two linguistic units. Under the umbrella of graph-based MRs, we employ hypergraphs and HRGs (Drewes et al., 1997) to achieve the two goals. Throughout this paper, we define an edgelabeled, ordered hypergraph over finite alphabet Σ as a tuple G = (V, E, ℓ), where V is a finite set of nodes, E ⊆ V + is a finite set of hyperedges, and ℓ : E → Σ is a labeling function. A hyperedge can connect to more than two nodes or a single node. Labels can be associated to edges but not nodes. The set of nodes connected by edge e are denoted by V (e) and the set of edges connected to  Figure 1: An HRG-based syntactico-semantic derivation for He really seems to care. The right part are examples of HRG rules. Throughout this paper, we use filled black nodes to indicate external nodes, arrows to indicate single-node edges and directed arcs to indicate edges connected to two nodes. The edge labeled as Y in rule γ 2 connects more than two nodes whose orders are indicated by tiny numbers around lines. Nodes in an HRG rule and subgraphs of an input graph are mentioned with numbers and characters respectively. Since nodes receive no informative labels, we use single-node edges with underlined terminal labels to represent concepts, e.g. "pron." Others terminal labels, e.g. "arg1," express semantic roles.
node v are denoted by E(v). We use graph and hypergraph interchangeably, and similarly for edge and hyperedge. Fig. 1 presents an example that contains a raising construction. The graph associated to the sentence (indicated by S ) is derived along with a syntactic tree, in which the leaves and internal nodes are associated with graphs (indicated by x ) as lexical and phrasal interpretations.
The key operation in semantic composition is to glue two graphs, say G 1 and G 2 . It is obvious that not every node in G 1 is visible to G 2 and vice versa. To emphasize on this point, we augment the representation of a hypergraph (V, E, ℓ) with a list of ordered external nodes V x ∈ V + and get a hypergraph fragment H = (V, E, ℓ, V x ). The number of external nodes is denoted by rank(H).
Graph gluing can be manipulated by an HRG G = (N , T , P, S), where N and T are two finite disjoint alphabets of nonterminal and terminal symbols respectively, S ∈ N is the start symbol, and P is the finite collection of rewriting rules in the form of A → R. The left hand side (LHS) A belongs to N , and the right hand side (RHS) R is a hypergraph fragment over N ∪ T . See γ 1 to γ 10 in Fig. 1 for example.
A carefully designed HRG can be linguistically elegant, in that its rules are consistent with stateof-the-art linguistic analysis. For instance, raising and control constructions receive principled analysis with rules in Fig.1. HRG can be comparable to other popular grammar formalisms, such as Combinatory Categorial Grammar (CCG; Steedman, 1996Steedman, , 2000. See Fig. 2 for an illustration. (S\NP y )/NP x λx.λy.like(y, x) 1 2 3 like arg1 arg2 Figure 2: A comparison of CCG and HRG. The external nodes 1 , 2 and 3 corresponds to S, NP y and NP x in the syntactic category respectively.

Graph Parsing with a General HRG
In the framework of graph-based MRs, a key problem is graph parsing: computing all possible analyses of a given semantic graph according to a grammar. Fig. 3 demonstrates the target structure of graph parsing -derivation forest. A derivation forest allows us to efficiently enumerate every derivation. Coupled with a local score function that evaluates the goodness of a rule application, a graph parser can further tell the goodness of a particular derivation tree or the full forest as a whole.
Though essential, graph parsing is only partially understood. In this section, we summarize the state-of-the-art algorithm for graph parsing with HRGs (Chiang et al., 2013), and then evaluate its efficiency with a wide-coverage grammar. The context-freeness of HRG allows us to represent a derivation as a tree, and sets of derivations as a derivation forest, which is the output structure of graph parsing. In the derivation forest, a dashed rectangle (node) corresponds to a subgraph, which may be immediately built with different HRG rules. Each rule application is separately represented as a box. Necessary and sufficient information includes the BRs of G t , G Lt as well as G Rt and the rule itself.

A Dynamic Programming Algorithm
Chiang et al.'s algorithm is a dynamic programming algorithm, in which a collection of inprocess subgraphs are iteratively recognized as solutions to subproblems. Two key techniques are introduced concerning (1) how to pack a subgraph and (2) how to expand recognized subgraphs. A subgraph is compactly encoded by boundary representation (BR) defined as follows. Assume I is a subgraph of a graph H. A boundary node of I is an external node of H or it is incident to an edge that is not in I. A boundary edge of I is an edge in I which connects to a boundary node. Let m be an arbitrarily chosen marker node in H. The BR of I is the tuple b(I) = ⟨bn(I), be(I), m ∈ I⟩, where bn(I) is the set of I's boundary nodes, be(I) is the set of I's boundary edges, and (m ∈ I) is a boolean value indicating whether m is in I. Take P 1 in Fig. 5 for example. The dotted box shows a subgraph that has been recognized. bn( Y ) = { C A F }, and D and G are irrelevant to further recognition. Now consider combining two subgraphs recognized as nonterminal X and Y according to γ 2 in Fig. 5. As to incrementally match elements of a rule, e.g. γ 2 , in an edge-by-edge way, Chiang et al. proposes to leverage a tree decomposition 1 T R of the RHS of an HRG rule A → R 1 A tree decomposition T of a graph fragment H = ⟨V, E, ℓ, Vx⟩ is a tree that every node η in T is associated with a tuple ⟨Vη, Eη⟩. T must satisfy the following properties: (1) for each v ∈ V , there is a node η such that v ∈ Vη; (2) for each e ∈ E, there is exactly one node η such that e ∈ Eη and V (e) ⊆ Vη; (3) for each v ∈ V , all nodes in T that cover v are connected; (4) for the root of T ηr, Vx ⊆ Vη r .
if every node of T R must be one of: (1) a leaf node associated to empty graph; (2) a unary node which introduces exactly one edge; (3) a binary node which introduces no edges. Throughout, for convenience, let η denote a node from T R and R ⊵η denote the subgraph of R whose edges are induced by nodes in the subtree rooted by η. If η is binary, its children are denoted by η 1 and η 2 . If η is unary, the edge introduced by it and its only child are denoted by e and η 1 respectively.
Oriented by the fundamental architecture of chart parsing/generation (Kay, 1996), T R are used to define active/passive items and inference rules that process such items. A passive item is of Target Derivation A small number of inference rules (as shown in Fig. 5) are sufficient to control merging the chart items. R0 is applied on the root node of T R . R1, R2.T, R2.NT and R3 are applied on leaf nodes, unary nodes that introduce a terminal edge, unary nodes that introduce a nonterminal edge and binary nodes respectively. e * is an edge of G such that ℓ(e) = ℓ(e * ). {e → e * } or {e → X} reprensets the mapping that sends each node of e to the corresponding node of e * or X. ψ(X R ) denotes a list generated by applying ψ on each node of X R in order. Refer to the original paper for a complete description of the algorithm. See the bottom part of Fig. 5 for a partial recognition along with T 1 in Fig. 4.

Treewidth-centric Complexity Analysis
It is an advantage of using tree decomposition that the treewidth of a grammar leads to a bound on the number of boundary nodes which we must keep track of during parsing. When applying an inference rule at η, all mentioned boundary nodes are called active nodes and denoted as A(η). A(η) = bn(R ⊵η 1 ) ∪ bn(R ⊵η 2 ) if η is binary, and A(η) = bn(R ⊵η 1 )∪V (e) otherwise. Let k be the treewidth of a grammar and d be the maximum degree of any node in the input graph. The number of rule instan-tiations at η is actually in O(n |Aη| 3 d|Aη| ). The first part n |Aη| is the number of ways of mappings between active nodes in a rule and nodes in an input graph. The second part 3 d|Aη| is an upper bound of realizations for boundary edges. Chiang et al.
) by a parallel analysis.

Measuring Practical Performance
Successful integration of two chart items according to an inference rule requires that the items are disjoint and can make up a new bijection. When two chart items pass the check, the following successful integration is viewed as a successful rule instantiation, and in this case, the operation cost is taken into account. When two chart items fail to pass the check, there will be no successful rule instantiation, and in this case, the operation cost for this failed integration is overlooked by the treewidth-centric complexity analysis. The cost to figure out an integation is impossible is actually comparable to that of a successful integation. Measuring practical performance with respect to both successful and failed integration operations is a necessary complement to the theoretical analysis, especially when the number of failed integrations is prominent. In the following experiments, we will report the exact numbers for successful (indicated as #Succ) and total (=successful+failed; indicated as #Total) integrations.

Evaluation with a Realistic Grammar
To profile the parsing algorithm, we conduct experiments on the Elementary Dependency Structure (EDS; Oepen and Lønning, 2006) graphs provided by DeepBank v1.1 (Flickinger et al., 2012). The data is separated into training, development and test sets according to standard setup for string parsing. We get a wide-coverage linguisticallymeaningful grammar 2 by applying the grammar extraction algorithm described in Chen et al. (2018). The grammar is lexicalized (LxG), in that argument-structures are lexically encoded, like almost all popular deep grammars used in NLP. Tab. 1 shows the statistics of the rules.  Referring to Bolinas 3 , we re-implement the algorithm in C++ and test its efficiency on 4500 EDS graphs that are randomly selected from the training set with the size in the range of 5 to 50. By size of a graph, we mean the number of its nodes. If the number of total subgraphs allocated during parsing is larger than 2.6 × 10 7 , the parser will throw an out-of-memory error (OOM). In all the following tables, all statistics are the average values over instances which successfully receive derivation forests. The platform for all experiments is x86 64 GNU/Linux with one Intel(R) Core(TM) i7-5930K CPU at 3.50GHz.
Tab. 2 summarizes the results. For small graphs, the algorithm achieves a promising speed. For larger graphs, most of parsing time is wasted on the failed integrations. Fig. 6 represents the numbers of successful and total integrations. We can clearly see that the difference between the two types of integrations increase very quickly when 2 We only consider rules the RHS of which are connected. A few graphs that are not connected and thus removed. A very small portion of DeepBank graphs result in disconnected rules. These graphs contain arguable annotations related to (1) distributive readings of coordination, (2) quantifier of bare NPs, and/or (3) Chiang et al. (2013). First column is the size of input graphs. Last column is the number of graphs in given range.
an input graph is enlarged. In §4.5 we will discuss how to reduce failed integrations. 4 Speeding Up by Exploiting Locality

Locality as Edge-Adjacency
Some notion of locality is conceptually necessary for studying complex structures. Adjacency is a key perspective to express locality in some linguistic theories, such as CCG (Steedman, 2000, p. 54): (1) The Principle of (String-)Adjacency Combinatory rules may only apply to finitely many phonologically realized and stringadjacent entities.
Almost all string parsing algorithms benefit from this string-adjacency. Now let us picture stringadjacency using a graph language. Fig. 7 gives a visualization of the linear chain structure of a word sequence. The terminal edge labeled as next in γ 11 explicitly displays a local relation: 2 and 3 being able to be recognized almost simultaneously. String-adjacency turns to be terminal edgeadjacency from a graph-theoretic view. What does terminal edge-adjacency actually mean? From a semiotic perspective of a language system, being either natural or artificial, a key property is form-meaning connection. A particular form triggers a particular meaning. What can be observed can be directly recognized, and then makes other things recognizable. Considering language production, the input is an MR, and in the graph-based framework, it is terminal edges that are directly observable. In this way a terminal edge makes nodes connected to it co-recognizable.
The existing algorithms, including Chiang et al. (2013) and Groschwitz et al. (2015), do not consider terminal edge-adjacency. We will show that capturing locality in this sense is beneficial, just like what successful string parsing algorithms do.

Locality-centric Complexity Analysis
Some active nodes are not independent with each other if we take terminal edge-adjacency into consideration. We call a graph consisting of only terminal edges a terminal graph. For a graph fragment H, we use term(H) to denote the subgraph of H that is induced from all and only terminal edges. We informally illustrate the idea of dependency between nodes in a rule, and then present a precise analysis. Fig 8 is a prototype of a binary node in T R .  Proposition 1. Consider a graph G and connected terminal graph R t . If there is a node v 1 in R t that is tied to a node v * 1 in G, then finding all isomorphisms of R t in G can be completed in O(d mt ) time, where m t is the number of edges in R t and d is the maximum degree of any node in G.
Proof. We perform a depth-first search over R t starting at v 1 and arranging all edges of R t as a sequence according to the order in which they are visited. Let the edge sequence be e 1 , e 2 , ..., e mt . We match edges in this sequence one by one.
In other words, v is already tied to a node v * ∈ G. As a result, the number of possible mappings of e j is at most d, because the degree of v * is at most d. Therefore, the number of isomorphisms of R t is in O(d mt ). As a result, all isomorphisms can be found in O(d mt ) time.
When l active nodes locate in a connected component of term(R ⊵η ), these nodes are somewhat dependent. By Proposition 1, the number of valid node mappings of these l nodes is bounded by O(nd mt ) rather than O(n l ).
Definition 1. For any node η in T R , δ(η) denotes the size of a maximal subset of A(η) such that all nodes in this subset is independent with each other. We use S(η) to denote one of such maximal subsets. Similar to treewidth, we define δ(T R ) = max η in T R δ(η) and δ(R) as the minimum δ of any tree decomposition of R.
Proposition 2. For any graph fragment R, δ(R) ≤ k + 1 where k is the treewidth of R.
Proof. This proposition is trivial. For any η, we have δ(η) ≤ |A η | ≤ |V η | ≤ k+1 (Proposition 3 in Chiang et al.). By the definition of δ(R), we have Proposition 3. The number of ways of instantiations of any inference rule is in O(n δ * d mg 3 dng ), where n g /m g is the maximum count of nodes/terminal-edges of any RHS in G and δ * is the maximum δ of any RHS in G.
Proof. When applying an inference rule on η, we first select the mappings for nodes in S(η) independently. According to the definition of S(η), for an active node v / ∈ S(η), there must be a node u ∈ S(η) such that u and v belong to the same connected component c of term (R ⊵η ). Therefore, the number of possible mappings for all active nodes is in O(n |S(η)| d mg ) ≤ O(n δ * d mg ). The analysis for boundary edges is similar to Chiang et al.'s. The only difference is that the tree decomposition which minimizes δ may not minimize the treewidth k. Since k ≤ n g −1, the number of ways of boundary edges is in O(3 dng ).
We can conclude from Proposition 2 and 3 that our locality-centric analysis is tighter than the treewidth-centric one, and the upper bound of time complexity may decrease for some restricted HRGs. In Fig. 4, the treewidth of T 2 is 3, but δ(T 2 ) = 1. So the number of rule instantiations that can be applied along with T 2 is in O(n) instead of O(n 4 ). In §4.3, we will introduce Weakly Regular Graph Grammar (WRGG), a new subclass of HRG, the δ of which is more intuitively understandable.

Weakly Regular Graph Grammar
We discuss prototypes of HRG rules, investigating their key properties in a linguistic context. We then formally define WRGG that reflects the linguistic emphasis and also show that WRGG is actually a very expressive subclass of HRG.
Firstly, the HRG rule under discussion allows at most two non-terminals at RHS. Computationally speaking, we can transform a multi-branching rule into multiple binary rules without loss of expressiveness, as we are able to get a CFG in Chomsky Normal Form for any CFG. Linguistically speaking, multi-branching rules have been removed from generative linguistic theories, since at least Minimalist Program (Chomsky, 1995). Fig.  9 presents four prototypes with the binary constriction. γ 3 , γ 6 , γ 7 , γ 8 and γ 9 in Fig. 1 are of T0, γ 1 , γ 4 , γ 5 and γ 10 are of T1, and γ 2 is of T3. Secondly, for a lexicalized grammar, most rules are of T0 or T1, since constructions barely take semantic materials. If a rule introduces heavy constructional meaning, it may affect one of its intermediate constituents (T2) or bridge the meanings between both of its intermediate constituents (T3), and hardly affect its intermediate constituent separately. Even though a rule has multiple terminal components, we can replace it with several rules of T0-T3. Thirdly, a node that is only connected to a nonterminal edge is a kind of placeholder, in that it does not affect current semantic composition but will be used in future. Otherwise it has been removed in a previous step. Finally, we do not consider disconnected RHS because it yields disconnected graphs.
The number of those nodes is denoted by f (G).
Definition 3. A weakly regular rule A → R satisfies the following conditions: (1) R is connected; (2) term(R) is an empty graph or a connected graph; (3) if a free node of R is incident to only one edge, it is also an external node.
Definition 4. An HRG is weakly regular, if all of its rules are weakly regular.
The proofs of this proposition can be found in the appendix. The tree shown in Fig. 10 is a valid nice tree decomposition of R and the δ of the tree is f (R) or f (R) + 1. We argue that for parsing with a binary WRGG, the number of free nodes is more meaningful and we can use the tree decomposition shown in Fig. 10 rather than a tree decomposition with minimum treewidth.
Courcelle (1991) introduces Regular Graph Grammar (RGG). It is provable that RGG is a subclass of WRGG. There are no free nodes in RGG and graph parsing with an RGG can be finished in linear time by applying Chiang et al.'s algorithm. This result is comparable to another algorithm proposed by Gilroy et al. (2017). However, the strong restrictions of RGG make it too weak to model linguistic structures. WRGG is much more linguistically adequate.

Distributed Argument-Structure
We value the trigger role played by terminal edges in an HRG rule. Now let us revisit the derivation governed by a lexicalized grammar. It is obvious that lexical rules try to use up all terminal edges at the initial stage of syntactico-semantic composition. If we can distribute terminal edges to all rules, both lexical and phrasal, we are able to get a reduced number of free nodes on average and in exactly this way improve graph parsing remarkably. The idea to distribute argument-structures exhibits a constructivist perspective, which is a competing hypothesis to lexicalism that dominates our field for dozens of years, since at least Bresnan and Kaplan (1982). The constructivist approaches to argument structures have been recently discussed by different theoretical linguistic theories, including but not limited to Distributed Morphology Marantz, 1993, 1994) and Sign-Based Construction Grammar (Boas and Sag, 2012). The emphasis on the advantage of Distributed Argument-Structure under the consideration of language production is a computational support for many constructivist approaches. Fig. 11 demonstrates a derivation with a construction grammar. Compare γ 12 to γ 4 and γ 13 to γ 5 , we can clearly see that δ is significantly reduced. A comparison of lexical rules also confirms the importance of distributed argument-structure.

Fast Accessing of Chart Items
We will complete our discussion on locality by considering the edge-zero case, i.e. unifying nodes. In Fig. 8, when we try to integrate R ⊵η 1 and R ⊵η 2 , we must make sure that the three nodes on the boundary, viz. 2 , 4 and 5 , are identical in terms of mappings relative to η 1 and η 2 respectively. Otherwise, a failure occurs. In both cases, trying to unify them causes a bottleneck for graph parsing, as conceptually suggested in §3.3 and empirically confirmed by Tab. 2.
Considering the above problem in the framework of chart parsing, we would like to construct a data structure to efficiently access all chart items.
In particular, when partial information is provided, this data structure can quickly find all compatible chart items. In this paper, we use a map, with the keys being partial information for quiry and the values being sets of chart items. The implementation used in §3.4 follows the method proposed by Chiang et al. (2013), only mentioning ℓ(e) or η for indexing, which is not efficient in practice. We propose to build a more comprehensive map. See Tab. 3 for an example of our map.
Indexing key(s) Item Table 3: Examples for indexing chart items in Fig. 5. P 1 has multiple keys.
Note that the number of possible mask's for a passive item grows exponentially w.r.t. the number of the corresponding external nodes. However a significant number of mask's are not used by any tree decomposition of any rule. And such mask's can be found by processing a grammar before parsing. For all HRGs used for experiments, the maximum number of useful mask's for a passive item is 15.

Empirical Evaluation
A construction grammar (CxG) is automatically induced in a similar way to the experiments in §3.4. Note that our grammar extraction procedure makes sure that this grammar is weakly regular. As shown in Tab. 4, the average number of free nodes in CxG is much smaller. We conduct new experiments using the improvements mentioned in previous sections. We re-run the improved parser on 4195 EDS graphs, which can successfully receive derivation forests from the original parser. Tab. 5 and Fig. 12 show the effectiveness of our improvements. The terminal-first tree decomposition (as illustrated in Fig. 10) is able to significantly reduce the number of integrations. Our indexing method can effectively reduce the number of failed integrations. For the CxG, using the terminal-edge first strategy is more effective than the indexing strategy. Note that the cost to build a map for indexing chart items is not ignorable.    Figure 12: The number of total integrations relative to size of input graphs. All data points in the plot are the average value on test samples of a given size.

Conclusion
We introduce several locality-centric refinements to advance graph parsing and empirically evaluate their effectiveness. We show that exact graph parsing can be efficient even for large graphs and with large graph grammars.

A Proof for Proposition 4
We provide the proof for R with two nonterminal edges: e X and e Y .
Firstly, we prove δ(R) ≥ f (R). For any nice tree decomposition T R of R, let η m be the node with minimum height such that R ⊵ηm contains both e X and e Y .
[1] η m is binary. Let η 1 , η 2 be the two children of η m . Without loss of generality, we assume R ⊵η 1 contains e X and R ⊵η 2 contains e Y .
[2] η m is unary. Let η 1 be the only child of η m . In this case, η m introduces either e X or e Y . Without loss of generality, we assume η m introduces e X .
Let v be a free node of R.
Case 1 v is incident to only one of e X and e Y . By property (3) of weakly regularity, v is an external node of R. Therefore, v ∈ bn(R) ⊂ bn(R ⊵η 1 ) ⊂ A(η m ).
By the above discussion, we conclude that all free nodes of R are active nodes of η m and it is obvious that free nodes are independent. As as result, we have δ(T R ) ≥ δ(η m ) ≥ f (R). The arbitrariness of T R ensures that δ(R) ≥ f (R).