Prefix Lexicalization of Synchronous CFGs using Synchronous TAG

We show that an epsilon-free, chain-free synchronous context-free grammar (SCFG) can be converted into a weakly equivalent synchronous tree-adjoining grammar (STAG) which is prefix lexicalized. This transformation at most doubles the grammar’s rank and cubes its size, but we show that in practice the size increase is only quadratic. Our results extend Greibach normal form from CFGs to SCFGs and prove new formal properties about SCFG, a formalism with many applications in natural language processing.


Introduction
Greibach normal form (GNF; Greibach, 1965) is an important construction in formal language theory which allows every context-free grammar (CFG) to be rewritten so that the first character of each rule is a terminal symbol. A grammar in GNF is said to be prefix lexicalized, because the prefix of every production is a lexical item. GNF has a variety of theoretical and practical applications, including for example the proofs of the famous theorems due to Shamir and Chomsky-Schützenberger (Shamir, 1967;Chomsky and Schützenberger, 1963;Autebert et al., 1997). Other applications of prefix lexicalization include proving coverage of parsing algorithms (Gray and Harrison, 1972) and decidability of equivalence problems (Christensen et al., 1995).
By using prefix lexicalized synchronous context-free grammars (SCFGs), Watanabe et al. (2006) and Siahbani et al. (2013) obtain asymptotic and empirical speed improvements on a machine translation task. Using a prefix lexicalized grammar ensures that target sentences can be generated from left to right, which allows the use of beam search to constrain their decoder's search space as it performs a left-to-right traversal of translation hypotheses. To achieve these results, new grammars had to be heuristically constrained to include only prefix lexicalized productions, as there is at present no way to automatically convert an existing SCFG to a prefix lexicalized form.
This work investigates the formal properties of prefix lexicalized synchronous grammars as employed by Watanabe et al. (2006) and Siahbani et al. (2013), which have received little theoretical attention compared to non-synchronous prefix lexicalized grammars. To this end, we first prove that SCFG is not closed under prefix lexicalization. Our main result is that there is a method for prefix lexicalizing an SCFG by converting it to an equivalent grammar in a different formalism, namely synchronous tree-adjoining grammar (STAG) in regular form. Like the GNF transformation for CFGs, our method at most cubes the grammar size, but we show empirically that the size increase is only quadratic for grammars used in existing NLP tasks. The rank is at most doubled, and we maintain O(n 3k ) parsing complexity for grammars of rank k. We conclude that although SCFG does not have a prefix lexicalized normal form like GNF, our conversion to prefix lexicalized STAG offers a practical alternative.

SCFG
An SCFG is a tuple G = (N, Σ, P, S) where N is a finite nonterminal alphabet, Σ is a finite terminal alphabet, S ∈ N is a distinguished nonterminal called the start symbol, and P is a finite set of synchronous rules of the form (1) A 1 → α 1 , A 2 → α 2 for some A 1 , A 2 ∈ N and strings α 1 , α 2 ∈ (N ∪ Σ) * . 1 Every nonterminal which appears in α 1 Figure 1: An example of synchronous rewriting in an STAG (left) and the resulting tree pair (right). must be linked to exactly one nonterminal in α 2 , and vice versa. We write these links using numerical annotations, as in (2).
An SCFG has rank k if no rule in the grammar contains more than k pairs of linked nodes. In every step of an SCFG derivation, we rewrite one pair of linked nonterminals with a rule from P , in essentially the same way we would rewrite a single nonterminal in a non-synchronous CFG. For example, (3) shows linked A and B nodes being rewritten using (2): Note how the 1 and 2 in rule (2) are renumbered to 2 and 3 during rewriting, to avoid an ambiguity with the 1 already present in the derivation.
An SCFG derivation is complete when it contains no more nonterminals to rewrite. A completed derivation represents a string pair generated by the grammar.

STAG
An STAG (Shieber, 1994) is a tuple G = (N, Σ, T, S) where N is a finite nonterminal alphabet, Σ is a finite terminal alphabet, S ∈ N is a distinguished nonterminal called the start symbol, and T is a finite set of synchronous tree pairs of the form where t 1 and t 2 are elementary trees as defined in Joshi et al. (1975). A substitution site is a leaf node marked by ↓ which may be rewritten by another tree; a foot node is a leaf marked by * that may be used to rewrite a tree-internal node. Every substitution site in t 1 must be linked to exactly one nonterminal in t 2 , and vice versa. As in SCFG, we write these links using numbered annotations; rank is defined for STAG the same way as for SCFG.
In every step of an STAG derivation, we rewrite one pair of linked nonterminals with a tree pair from T , using the same substitution and adjunction operations defined for non-synchronous TAG.
For example, Figure 1 shows linked A and B nodes being rewritten and the tree pair resulting from this operation. See Joshi et al. (1975) for details about the underlying TAG formalism.

Terminology
We use synchronous production as a cover term for either a synchronous rule in an SCFG or a synchronous tree pair in an STAG.
Following Siahbani et al. (2013), we refer to the left half of a synchronous production as the source side, and the right half as the target side; this terminology captures the intuition that synchronous grammars model translational equivalence between a source phrase and its translation into a target language. Other authors refer to the two halves as the left and right components (Crescenzi et al., 2015) or, viewing the grammar as a transducer, the input and the output (Engelfriet et al., 2017).
We call a grammar ε-free if it contains no productions whose source or target side produces only the empty string ε.

Synchronous Prefix Lexicalization
Previous work (Watanabe et al., 2006;Siahbani et al., 2013) has shown that it is useful for the target side of a synchronous grammar to start with a terminal symbol. For this reason, we define a synchronous grammar to be prefix lexicalized when the leftmost character of the target side 2 of every synchronous production in that grammar is a terminal symbol.

Closure under Prefix Lexicalization
We now prove that the class SCFG is not closed under prefix lexicalization.
Theorem 1. There exists an SCFG which cannot be converted to an equivalent PL-SCFG.
Proof. The SCFG in (7) generates the language L = { a i b j c i , b j a i | i ≥ 0, j ≥ 1}, but this language cannot be generated by any PL-SCFG: Suppose, for the purpose of contradiction, that some PL-SCFG does generate L; call this grammar G. Then the following derivations must all be possible in G for some nontermials U, V, X, Y : i and ii follow from the same arguments used in the pumping lemma for (non-synchronous) context free languages (Bar-Hillel et al., 1961): strings in L can contain arbitrarily many as, bs, and cs, so there must exist some pumpable cycles which generate these characters. In i, k + m = n + p because the final derived strings must contain an equal number of bs, and n ≥ 1 because G is prefix lexicalized; in ii the constraints on q, r and s follow likewise from L. iii follows from the fact that, in order to pump on the cycle in ii, this cycle must be reachable from the start symbol. iv follows from the fact that a context-free production cannot generate a discontinuous span. Once the cycle in i has generated a b, it is impossible for ii to generate an a on one side of the b and a c on the other. Therefore i must always be derived strictly later than ii, as shown in iv.
Now we obtain a contradiction. Given that G can derive all of i through iv, the following derivation is also possible: But since n, r ≥ 1, the target string derived this way contains an a before a b and does not belong to L. This is a contradiction: if G is a PL-SCFG then it must generate i through iv, but if so then it also generates strings which do not belong to L. Thus no PL-SCFG can generate L, and SCFG must not be closed under prefix lexicalization.
There also exist grammars which cannot be prefix lexicalized because they contain cyclic chain rules. If an SCFG can derive something of the form X 1 , Y 1 ⇒ * xX 1 , Y 1 , then it can generate arbitrarily many symbols in the source string without adding anything to the target string. Prefix lexicalizing the grammar would force it to generate some terminal symbol in the target string at each step of the derivation, making it unable to generate the original language where a source string may be unboundedly longer than its corresponding target. We call an SCFG chain-free if it does not contain a cycle of chain rules of this form. The remainder of this paper focuses on chain-free grammars, like (7), which cannot be converted to PL-SCFG despite containing no such cycles.

Prefix Lexicalization using STAG
We now present a method for prefix lexicalizing an SCFG by converting it to an STAG. Figure 2: A target-side terminal leftmost derivation. a ∈ Σ; X, A, Yi, Bi ∈ N ; and αi, βi, γi ∈ (N ∪ Σ) * . Theorem 2. Given a rank-k SCFG G which is εfree and chain-free, an STAG H exists such that H is prefix lexicalized and L(G) = L(H). The rank of H is at most 2k, and |H| = O(|G| 3 ).
Proof. Let G = (N, Σ, P, S) be an ε-free, chainfree SCFG. We provide a constructive method for prefix lexicalizing the target side of G.
We begin by constructing an intermediate grammar G XA for each pair of nonterminals X, A ∈ N \ {S}. For each pair X, A ∈ N \ {S}, G XA will be constructed to generate the language of sentential forms derivable from X 1 , A 1 via a target-side terminal leftmost derivation (TTLD). A TTLD is a derivation of the form in Figure 2, where the leftmost nonterminal in the target string is expanded until it produces a terminal symbol as the first character. We write X 1 , A 1 ⇒ *

T T LD
u, v to mean that X 1 , A 1 derives u, v by way of a TTLD; in this notation, Given X, A ∈ N \ {S} we formally define G XA as an STAG over the terminal alphabet Σ XA = N ∪ Σ and nonterminal alphabet N XA = {Y XA |Y ∈ N }, with start symbol S XA . N XA contains nonterminals indexed by XA to ensure that two intermediate grammars G XA and G Y B do not interact as long as X, A = Y, B .
G XA contains four kinds of tree pairs: 3 we add a tree pair of the form in Figure 3(a).
we add a tree pair of the form in Figure 3(b). As a special case, if Y = Z we collapse the root node and adjunction site to produce a tree pair of the following form: we add a tree pair of the form in Figure 3(d). Figure 4: An SCFG rule and a tree pair based off that rule, taken from an intermediate grammar GAA. The tree pair is formed according to the pattern illustrated in Figure 3(c). Observe that the B nodes retain the link they bore in the original rule. This link is not functional in the intermediate grammar (that is, it cannot be used for synchronous rewriting) because B / ∈ NAA, but it will be functional when this tree pair is added to the final grammar H. Proof. This can be shown by induction over derivations of increasing length. The proof is straightforward but very long, so we provide only a sketch; the complete proof is provided in the supplementary material.
As a base case, observe that a tree of the shape in Figure 3(a) corresponds straightforwardly to the derivation (10) X 1 , A 1 ⇒ α 1 , aα 2 which is a TTLD starting from X, A . By construction, therefore, every TTLD of the shape in (10) corresponds to some tree in G XA of shape 3(a); likewise every derivation in G XA comprising a single tree of shape 3(a) corresponds to a TTLD of the shape in (10). As a second base case, note that a tree of the shape in Figure 3(b) corresponds to the last step of a TTLD like (11): In the other direction, the last step of any TTLD of the shape in (11) will involve some rule of the shape Y → α 1 , B → aα 2 ; by construction G XA must contain a corresponding tree pair of shape 3(b). Together, these base cases establish a one-toone correspondence between single-tree derivations in G XA and the last step of a TTLD starting from X, A . Now, assume that the last n steps of every TTLD starting from X, A correspond to some derivation over n trees in G XA , and vice versa. Then the last n + 1 steps of that TTLD will also correspond to some n + 1 tree derivation in G XA , and vice versa.
To see this, consider the step n + 1 steps before the end of the TTLD. This step may be in the middle of the derivation, or it may be the first step of the derivation. If it is in the middle, then this step must involve a rule of the shape The existence of such a rule in G implies the existence of a corresponding tree in G XA of the shape in Figure 3(c). Adding this tree to the existing n-tree derivation yields a new n + 1 tree derivation corresponding to the last n + 1 steps of the TTLD. 4 In the other direction, if the n + 1th tree 5 of a derivation in G XA is of the shape in Figure  3(c), then this implies the existence of a production in G of the shape in (12). By assumption the first n trees of the derivation in G XA correspond to some TTLD in G; by prepending the rule from (12) to this TTLD we obtain a new TTLD of length n + 1 which corresponds to the entire n + 1 tree derivation in G XA . Finally, consider the case where the TTLD is only n + 1 steps long. The first step must involve a rule of the form The existence of such a rule implies the existence of a corresponding tree in G XA of the shape in Figure 3(d). Adding this tree to the derivation which corresponds to the last n steps of the TTLD yields a new n+1 tree derivation corresponding to the entire n + 1 step TTLD. In the other direction, if the last tree of an n + 1 tree derivation in G A is of the shape in Figure 3(d), then this implies the 4 It is easy to verify by inspection of Figure 3 that whenever one rule from G can be applied to the output of another rule, then the tree pairs in GXA which correspond to these rules can compose with one another. Thus we can add the new tree to the existing derivation and be assured that it will compose with one of the trees that is already present. 5 Although trees in GXA may contain symbols from the nonterminal alphabet of G, these symbols belong to the terminal alphabet in GXA. Only nonterminals in NXA will be involved in this derivation, and by construction there is at most one such nonterminal per tree. Thus a well-formed derivation structure in GXA will never branch, and we can refer to the n + 1th tree pair as the one which is at depth n in the derivation structure. existence of a production in G of the shape in (13). By assumption the first n trees of the derivation in G XA correspond to some TTLD in G; by prepending the rule from (13) to this TTLD we obtain a new TTLD of length n + 1 which corresponds to the entire n + 1 tree derivation in G XA .
Taken together, these cases establish a one-toone correspondence between derivations in G XA and TTLDs which start from X, A ; in turn they confirm that G XA generates the desired language L XA .
Once we have constructed an intermediate grammar G XA for each X, A ∈ N \ {S}, we obtain the final STAG H as follows: 1. Convert the input SCFG G to an equivalent STAG. For each rule 3. For every X, A ∈ N , in all tree pairs where the target tree's leftmost leaf is labeled with A and this node is linked to an X, replace this occurrence of A with S XA . Also replace the linked node in the source tree.
4. For every X, A ∈ N , let R XA be the set of all tree pairs rooted in S XA , and let T XA be the set of all tree pairs whose target tree's leftmost leaf is labeled with S XA . For every s, t ∈ T XA and every s , t ∈ R XA , substitute or adjoin s and t into the linked S XA nodes in s and t, respectively. Add the derived trees to H.

5.
For all X, A ∈ N , let T XA be defined as above. Remove all tree pairs in T XA from H.
6. For all X, A ∈ N , let R XA be defined as above. Remove all tree pairs in R XA from H.
We now claim that H generates the same language as the original grammar G, and all of the target trees in H are prefix lexicalized.
The first claim follows directly from the construction.
Step 1 merely rewrites the grammar in a new formalism. From Lemma 1 it is clear that steps 2-3 do not change the generated language: the set of string pairs generable from a pair of S XA nodes is identical to the set generable from X, A in the original grammar.
Step 4 replaces some nonterminals by all possible alternatives; steps 5-6 then remove the trees which were used in step 4, but since all possible combinations of these trees have already been added to the grammar, removing them will not alter the language.
The second claim follows from inspection of the tree pairs generated in Figure 3. Observe that, by construction, for all X, A ∈ N every target tree rooted in S XA is prefix lexicalized. Thus the trees created in step 4 are all prefix lexicalized variants of non-lexicalized tree pairs; steps 5-6 then remove the non-lexicalized trees from the grammar. Figure 5 gives an example of this transformation applied to a small grammar. Note how A nodes at the left edge of the target trees end up rewritten as S AA nodes, as per step 4 of the transformation.

Complexity & Formal Properties
Our conversion generates a subset of the class of prefix lexicalized STAGs in regular form, which we abbreviate to PL-RSTAG (regular form for TAG is defined in Rogers 1994). This section discusses some formal properties of PL-RSTAG.
Generative Capacity PL-RSTAG is weakly equivalent to the class of ε-free, chain-free SCFGs: this follows immediately from the proof that our transformation does not change the language generated by the input SCFG. Note that every TAG in regular form generates a context-free language (Rogers, 1994).
Alignments and Reordering PL-RSTAG generates the same set of reorderings (alignments) as SCFG. Observe that our transformation does not cause nonterminals which were linked in the original grammar to become unlinked, as noted for example in Figure 4. Thus subtrees which are gener- Figure 5: An SCFG and the STAG which prefix lexicalizes it. Non-productive trees have been omitted.
Grammar |G| |H| % of G prefix lexicalized log |G| (|H|) Siahbani and Sarkar (2014a)  |G| and |H| give the grammar size before and after prefix lexicalization; log |G| |H| is the increase as a power of the initial size.
We also show the percentage of productions which are already prefix lexicalized in G.
ated by linked nonterminals in the original grammar will still be generated by linked nonterminals in the final grammar, so no reordering information is lost or added. 6 This result holds despite the fact that our transformation is only applicable to chainfree grammars: chain rules cannot introduce any reorderings, since by definition they involve only a single pair of linked nonterminals.
Grammar Rank If the input SCFG G has rank k, then the STAG H produced by our transformation has rank at most 2k. To see this, observe that the construction of the intermediate grammars increases the rank by at most 1 (see Figure 3(b)). When a prefix lexicalized tree is substituted at the left edge of a non-lexicalized tree, the link on the substitution site will be consumed, but up to k + 1 new links will be introduced by the substituting tree, so that the final tree will have rank at most 2k.
In the general case, rank-k STAG is more powerful than rank-k SCFG; for example, a rank-4 SCFG is required to generate the reordering in S → A 1 B 2 C 3 D 4 , S → C 3 A 1 D 4 B 2 (Wu, 1997), but this reordering is captured by the 6 Although we consume one link whenever we substitute a prefix lexicalized tree at the left edge of an unlexicalized tree, that link can still be remembered and used to reconstruct the reorderings which occurred between the two sentences. following rank-3 STAG: For this reason, we speculate that it is possible to further transform the grammars produced by our lexicalization in order to reduce their rank, but the details of this transformation remain as future work. This potentially poses a solution to an issue raised by Siahbani and Sarkar (2014b). On a Chinese-English translation task, they find that sentences like (15) involve reorderings which cannot be captured by a rank-2 prefix lexicalized SCFG: Tā bǔchōng shuō , liánhé zhèngfǔ mùqián zhuàngkuàng wěndìng ... He added that the coalition government is now in stable condition ...
If rank-k PL-RSTAG is more powerful than rank-k SCFG, using a PL-RSTAG here would permit capturing more reorderings without using grammars of higher rank.
Parse Complexity Because the grammar produced is in regular form, each side can be parsed in time O(n 3 ) (Rogers, 1994), for an overall parse complexity of O(n 3k ), where n is sentence length and k is the grammar rank.

Grammar Size and Experiments
If H is the PL-RSTAG produced by applying our transformation to an SCFG G, then H contains O(|G| 3 ) elementary tree pairs, where |G| is the number of synchronous productions in G. When the set of nonterminals N is small compared to |G|, a tighter bound is given by O(|G| 2 |N | 2 ). Table 1 shows the actual size increase on a variety of grammars: here |G| is the size of the initial grammar, |H| is the size after applying our transformation, and the increase is expressed as a power of the original grammar size. We apply our transformation to the grammar from Siahbani and Sarkar (2014a), which was created for a Chinese-English translation task known to involve complex reorderings that cannot be captured by PL-SCFG (Siahbani and Sarkar, 2014b). We also consider the grammar in (7) and an ITG (Wu, 1997) containing 10,000 translation pairs, which is a grammar of the sort that has previously been used for word alignment tasks (cf Zhang and Gildea 2005). We always observe an increase within O(|G| 2 ) rather than the worst-case O(|G| 3 ), because |N | is small relative to |G| in most grammars used for NLP tasks.
We also investigated how the proportion of prefix lexicalized rules in the original grammar affects the overall size increase. We sampled grammars with varying proportions of prefix lexicalized rules from the grammar in Siahbani and Sarkar (2014a); Table 2 shows the result of lexicalizing these samples. We find that the worst case size increase occurs when 50% of the original grammar is already prefix lexicalized. This is because the size increase depends on both the number of prefix lexicalized trees in the intermediate grammars (which grows with the proportion of lexicalized rules) and the number of productions which need to be lexicalized (which shrinks as the proportion of prefix lexicalized rules increases). At 50%, both factors contribute appreciably to the grammar size, analogous to how the function f (x) = x(1 − x) takes its maximum at x = 0.5.

Applications
The LR decoding algorithm from Watanabe et al. (2006) relies on prefix lexicalized rules to generate a prefix of the target sentence during machine translation. At each step, a translation hypothesis is expanded by rewriting the leftmost nonterminal in its target string using some grammar rule; the prefix of this rule is appended to the existing translation and the remainder of the rule is pushed onto a stack, in reverse order, to be processed later. Translation hypotheses are stored in stacks according to the length of their translated prefix, and beam search is used to traverse these hypotheses and find a complete translation. During decoding, the source side is processed by an Earley-style parser, with the dot moving around to process nonterminals in the order they appear on the target side.
Since the trees on the target side of our transformed grammar are all of depth 1, and none of these trees can compose via the adjunction operation, they can be treated like context-free rules and used as-is in this decoding algorithm. The only change required to adapt LR decoding to use a PL-RSTAG is to make the source side use a TAG parser instead of a CFG parser; an Earley-style parser for TAG already exists (Joshi and Schabes, 1997), so this is a minor adjustment.
Combined with the transformation in Section 4, this suggests a method for using LR decoding without sacrificing translation quality. Previously, LR decoding required the use of heuristically generated PL-SCFGs, which cannot model some reorderings (Siahbani and Sarkar, 2014a). Now, an SCFG tailored for a translation task can be transformed directly to PL-RSTAG and used for decod-ing; unlike a heuristically induced PL-SCFG, the transformed PL-RSTAG will generate the same language as the original SCFG which is known to handle more reorderings.
Note that, since applying our transformation may double the rank of a grammar, this method may prove prohibitively slow. This highlights the need for future work to examine the generative power of rank-k PL-RSTAG relative to rankk SCFG in the interest of reducing the rank of the transformed grammar.

Related Work
Our work continues the study of TAGs and lexicalization (e.g. Joshi et al. 1975;Schabes and Waters 1993). Schabes and Waters (1995) show that TAG can strongly lexicalize CFG, whereas CFG only weakly lexicalizes itself; we show a similar result for SCFGs. Kuhlmann and Satta (2012) show that TAG is not closed under strong lexicalization, and Maletti and Engelfriet (2012) show how to strongly lexicalize TAG using simple context-free tree grammars (CFTGs).
Other extensions of GNF to new grammar formalisms include Dymetman (1992) for definite clause grammars, Fernau and Stiebe (2002) for CF valence grammars, and Engelfriet et al. (2017) for multiple CFTGs. Although multiple CFTG subsumes SCFG (and STAG), Engelfriet et al.'s result appears to guarantee only that some side of every synchronous production will be lexicalized, whereas our result guarantees that it is always the target side that will be prefix lexicalized.
Lexicalization of synchronous grammars was addressed by Zhang and Gildea (2005), but they consider lexicalization rather than prefix lexicalization, and they only consider SCFGs of rank 2. They motivate their results using a word alignment task, which may be another possible application for our lexicalization.
Analogous to our closure result, Aho and Ullman (1969) prove that SCFG does not admit a normal form with bounded rank like Chomsky normal form. Blum and Koch (1999) use intermediate grammars like our G XA s to transform a CFG to GNF. Another GNF transformation (Rosenkrantz, 1967) is used by Schabes and Waters (1995) to define Tree Insertion Grammars (which are also weakly equivalent to CFG).
We rely on Rogers (1994) for the claim that our transformed grammars generate context-free languages despite allowing wrapping adjunction; an alternative proof could employ the results of Swanson et al. (2013), who develop their own context-free TAG variant known as osTAG. Kaeshammer (2013) introduces the class of synchronous linear context-free rewriting systems to model reorderings which cannot be captured by a rank-2 SCFG. In the event that rank-k PL-RSTAG is more powerful than rank-k SCFG, our work can be seen as an alternative approach to the same problem.
Finally, Nesson et al. (2008) present an algorithm for reducing the rank of an STAG on-the-fly during parsing; this presents a promising avenue for proving a smaller upper bound on the rank increase caused by our transformation.

Conclusion and Future Work
We have demonstrated a method for prefix lexicalizing an SCFG by converting it to an equivalent STAG. This process is applicable to any SCFG which is εand chain-free. Like the original GNF transformation for CFGs our construction at most cubes the grammar size, though when applied to the kinds of synchronous grammars used in machine translation the size is merely squared. Our transformation preserves all of the alignments generated by SCFG, and retains properties such as O(n 3k ) parsing complexity for grammars of rank k. We plan to verify whether rank-k PL-RSTAG is more powerful than rank-k SCFG in future work, and to reduce the rank of the transformed grammar if possible. We further plan to empirically evaluate our lexicalization on an alignment task and to offer a comparison against the lexicalization due to Zhang and Gildea (2005).