Regular transductions with MCFG input syntax

We show that regular transductions for which the input part is generated by some multiple context-free grammar can be simulated by synchronous multiple context-free grammars. We prove that synchronous multiple context-free grammars are strictly more powerful than this combination of regular transductions and multiple context-free grammars.


Introduction
In machine translation, one is interested in automatically translating sentences of one natural language into sentences of another natural language. Such translations can be considered as string-tostring transductions by viewing the words of a natural language as symbols of a formal language, and viewing sentences as strings. Several formal models for such transductions have been proposed, e.g., syntax-directed translation schemata (Lewis and Stearns, 1968), also known as synchronous context-free grammars (Chiang, 2007), two-way generalized sequential machines (2gsm) (Sheperdson, 1959;Aho and Ullman, 1970), MSO definable string-to-string transductions (MSO-sst) (Courcelle and Engelfriet, 2012), and streaming string transducers (SST) (Alur andČerný, 2010).
It has been established that the deterministic versions of 2gsm, MSO-sst, and SST generate the same class of string-to-string transductions (Alur andČerný, 2010;Engelfriet and Hoogeboom, 2001); the same is true for the nondeterministic versions of MSO-sst and SST (Alur and Deshmukh, 2011). Due to these characterizations, the involved transducers and the corresponding transductions are called regular transducers and regular transductions, respectively.
In statistical machine translation (Lopez, 2008), one aims at automatically inferring a translation model from some bilingual corpus, where the translation model is chosen from some class of formal devices, e.g., the class of regular transducers. In the seminal paper by Brown et al. (1993), the inference is based on the concept of alignment graph (used as hidden random variable in the EM-algorithm (Dempster et al., 1977)); each such graph consists of an input sentence w, an output sentence v, and a binary relation between the set pos(v) of positions of v and the set pos(w) of positions of w. In the particular case of the IBM models each alignment graph is a partial mapping from pos(v) to pos(w). These have almost the same mathematical structure as the origin graphs of Bojańczyk (2013), except that in the latter, the mapping is total. Bojańczyk (2013) and Bojańczyk et al. (2017a,b) investigated the concept of regular transductions with origin semantics, where the origin semantics of a regular transducer A is a set of the origin graphs that A can create: if A produces a portion v of the output while reading the input symbol at position i, then each position of v is aligned to i.
Since the domain of each regular transduction is a regular string language, it cannot capture nonregular syntactic phenomena on the source side of the translation. To enhance this capability, this paper investigates imposing additional syntactic restrictions on the input of a regular transducer, through intersection with a multiple context-free grammar (Seki et al., 1991) (MCFG). We prove that the resulting transduction can also be generated by a synchronous MCFG, which is a pair of MCFGs with synchronized nonterminals, much as in, e.g., synchronous context-free grammars. We further give an example of a synchronous MCFG whose transduction cannot be represented as the intersection of a regular transducer and a MCFG.

Preliminaries
We let N = {0, 1, 2, . . .}, We abbreviate [1, j] by [j]. We abbreviate sequences of objects, like a 1 · · · a n and a 1 , . . . , a n , by a 1,n . We denote the powerset of a set A by P(A). We abbreviate a set {a} with one element by a. An alphabet A is a nonempty and finite set.
For functions f : A → B and g : B → C, we denote their composition by f • g, i.e., (f • g)(a) = g(f (a)) for each a ∈ A.
Let Σ 1 and Σ 2 be alphabets. An origin graph (over Σ 1 and Σ 2 ) is a triple (w, v, g) where w ∈ Σ * 1 , v ∈ Σ * 2 , and g (origin mapping) maps each position j of v to a position i of w. Intuitively, the pair (j, i) ∈ g indicates that the symbol at position j of v originated from position i of w. Let A be a set of origin graphs and L 1 and L 2 formal languages. Then we define We generally refer to Σ 1 as the input alphabet and Σ 2 as the output alphabet. For a set L ⊆ L 1 × L 2 we define the input projection as proj 1 (L) = {w | (w, v) ∈ L} and the output projection as proj 2 (L) = {v | (w, v) ∈ L}.

Streaming String Transducers
Here we recall the definition of streaming transducer from Alur and Deshmukh (2011), with some slight modifications that refer to the final output of a string.
A nondeterministic streaming string transducer (over Σ 1 and Σ 2 ) (for short: NSST) is a tuple A = (Q, Σ 1 , Σ 2 , R, r o , T, q 0 , F ) where Q is a finite, nonempty set of states, Σ 1 and Σ 2 are the input alphabet and the output alphabet, respectively, R is a finite set of registers, r o ∈ R is the output register, T ⊆ Q × Σ 1 × Ass(R, Σ 2 ) × Q is a finite set of transitions, q 0 ∈ Q is the initial state, and F ⊆ Q is the set of final states.
The summary of A is the mapping defined inductively as follows.
for each q ∈ Q, w ∈ Σ * 1 , and a ∈ Σ 1 . The string-to-string transduction computed by A is the set is obtained by at least one sequence of transitions, and possibly more than one due to nondeterminism. For a given such sequence, each symbol occurrence in v is obtained by application of a transition (q , a, α, q ), and this links the index of that symbol occurrence in v to the index of the corresponding occurrence of a in w. Thereby the sequence of transitions corresponds in a natural way to an origin graph. The set of such origin graphs is denoted by [[A]] o , and will be called the origin semantics of A. For instance (ab, ab#ab) ∈ τ . The transformation τ can be computed by the NSST A = (Q, Σ, Σ, R, Let α denote (r 1 , r 2 ) := (r 1 ab#r 2 ab, ε). Then Since q 0 is the initial state and q f is the final state, we obtain (ab, ab#ab) ∈ [[A]]. The corresponding origin graph is shown in Figure 1.
We call a NSST nondeleting if for each assignment α occurring in a transition, each register r occurs exactly once in The proof is very similar to the proof of a similar result for MCFG by Seki et al. (1991), which we will mention again in Section 4. We First, Q contains a new state q 0 plus states of the form q D where q ∈ Q and D ⊆ R. The intuition is that the registers in D are those that must remain empty in A , as in a corresponding computation in A their contents would later appear as part of a register that is deleted (or that is not the output register when the end of the input is reached). By keeping those registers empty, they no longer need to be deleted, and instead can be added in an arbitrary way to assignments without changing the semantics. We let q D ∈ F if and only if q ∈ F and r o / ∈ D, and q 0 ∈ F if and only if q 0 ∈ F .
For each D ⊆ R and (q, a, α, q ) ∈ T , we have (q D , a, α , q D ) ∈ T , where D and α are defined as follows. The registers in D are obtained in one of two ways. First, if r ∈ D, then every register in α(r) is in D , and secondly, if a register r does not occur in α(r) for any r, then it is in D .
In the first instance, α (r) is a copy of α(r) for each r / ∈ D, and α (r) is obtained from α(r) by omitting all output symbols for each r ∈ D. However, each register r that does not occur in α(r), for any r, is added to α (r ) in an arbitary place for an arbitrary r . Moreover, if q = q 0 , then T also contains (q 0 , a, α , q D ).
where D contains r 2 and r 3 because r 2 br 3 is assumed to be deleted later, and D contains r 4 because it is deleted here. Further, T would include (q D , a, (r 1 , r 2 , r 3 , r 4 ) := (r 2 r 3 , cr 4 , ε, er 1 ), q D ) where we have added r 4 to the right-hand side of the assignment in an arbitrary place.

Synchronous Multiple Context-Free Grammars
A multiple context-free grammar (over Σ) (for short: MCFG) is a tuple G = (N, S, Σ, P ) where N is an alphabet of nonterminals, each nonterminal A has a fanout in N (denoted by fo(A)), S ∈ N is an initial nonterminal with fo(S) = 1, Σ is an alphabet of terminals, and P is a finite set of rules, where each rule has the form 1, i is a sequence of i variables in X such that the set of all variables occurring in x , each w j is in (Σ ∪ X m ) * ; finally, the rule is linear in X, i.e., each variable in X occurs at most once in w 1 · · · w 0 . The rank of this rule is n. The rank of G is the maximal rank of its rules.
Rules can be instantiated by consistent substitution of variables. The derivation relation ⇒ of G is defined in the usual way, by applying instantiated rules. The language over Σ generated by G is defined to be the set of strings w such that S(w) ⇒ * ε, and is denoted by L(G). Two MCFGs are equivalent if they generate the same language.
A MCFG is called uni-lexicalized if there is exactly one terminal in each rule. For each MCFG there is an equivalent uni-lexicalized MCFG. A MCFG is called nondeleting if each variable that occurs in the right-hand side occurs exactly once in the left-hand side. For each MCFG there is an equivalent nondeleting MCFG (Seki et al., 1991).
A synchronous multiple context-free grammar (for short: synchronous MCFG) is a tuple G = (N, S, Σ 1 , Σ 2 , P ) such that G = (N, S, Σ 1 ∪ Σ 2 , P ) is an MCFG (called underlying MCFG) except that S has fanout 2. Moreover, for each nonterminal A we split its fanout into an input fanout 1 and an output fanout 2 such that = 1 + 2 , and denote this by fo(A) = ( 1 , 2 ). In particular, we let fo(S) = (1, 1). We call the first 1 arguments of A its input arguments and the remaining 2 arguments its output arguments, and we separate these two blocks by a semicolon. We require that elements of Σ 1 and Σ 2 may only occur in input arguments and output arguments, respectively. Finally, we require that no variable may simultaneously occur in an input and in an output argument. We implement this requirement by choosing X as set of input variables and Y = {y 1 , y 2 , . . .} as set of output variables. Hence, a rule of a synchronous MCFG has the form where n ∈ N, A 0 , A 1 , . . . , A n are nonterminals and each A i has fanout ( i , m i ); for each i ∈ [n], x 1,mn }) * ; finally, the rule is linear in X and Y .
The MCFG G 1 = (N, S, Σ 1 , P 1 ) is the input component of G, where the fanout of each nonterminal of N is its input fanout in G, and P 1 is the set of all rules of P in which the output arguments are dropped. Similary, we define the output component of G.
Let G be a synchronous MCFG. We define the derivation relation ⇒ G of G to be the derivation relation of its underlying MCFG. The string-tostring transduction computed by G is the set A uni-lexicalized synchronous MCFG is a synchronous MCFG in which each rule either contains exactly one input symbol or contains neither input symbols nor output symbols. In a straightforward way, we can associate with each uni-lexicalized synchronous MCFG G a set [[G]] o of origin graphs by linking each occurrence of an output terminal of a rule to the unique input terminal of that rule.

Intersecting the Input of NSST with MCFG
Lemma 5.1. For every NSST A over Σ 1 and Σ 2 and every MCFG G over Σ 1 , there is a unilexicalized synchronous MCFG G over Σ 1 and Σ 2 such that where R consists of the registers r 1 , . . . , r ρ , and let G = (N, S, Σ 1 , P ) be an MCFG. Without loss of generality we may assume that A is nondeleting and that G is uni-lexicalized.
The intuition behind the construction of G covers two aspects. Starting from the MCFG G, we impose the state behaviour of A onto the nonterminal behaviour of G by a type of construction that can be traced back to Bar-Hillel et al. (1964). This aspect of the construction achieves The second aspect concerns the manipulation of the registers of A. We let G simulate the assignments in its output component, while its input component processes the input string.
The number of relevant assignments is in general infinite. In order to be able to simulate these assignments using a finite set of rules of G , we split up each assignment into a finite part, called "pattern", and a potentially infinite part, called "residue". The pattern represents ρ register occurrences in the image of the assignment, while the residue consists of the 2ρ strings that are interlaced with the register occurrences. The patterns are maintained as annotations of the nonterminals of G and the residues appear in the output arguments of G . The residues of the left-hand side of a rule will be expressed in terms of residues that appear as output variables in the right-hand side. For this purpose, we introduce assignments that have output variables in their image.
For instance, let G contain the rule A(ax 1 ) and A have states q 1 , . . . , q 5 . Assume that there is a transition (q 5 , a, α, q 3 ), and that there are strings w 1 ) and (α 1 , q 2 ) ∈ ∆(q 1 , w (1) 1 ). Then G will have a rule of the form 1 and p 1 are patterns corresponding to α 1 and α 2 , respectively, and p (0) 1 corresponds to α 1 • α 2 • α. Hence there is a corresponding pattern for each argument in the right-hand sided and in the left-hand side.
Then the set P contains the rule ). This is illustrated in Figure 2.
In addition, for each c ∈ {(q 0 , p We can prove the following invariant. For ev- This invariant implies that for every w ∈ Σ * 1 and v ∈ Σ * 2 : . Example 5.2. We consider the NSST A of Example 3.1 and the MCFG G = (N, A, Σ, P ) with N = {A}, fo(A) = 1, and for each γ ∈ Σ, P contains the rules Obviously, [[G]] = Σ * . We apply the construction of Lemma 5.1 to A and G and we obtain the unilexicalized synchronous MCFG G which contains for each γ ∈ Σ and i ∈ {0, 1} at least the following rules.
On the basis of Lemma 5.3, one can obtain complexity bounds on typical tasks involving NSST, such as deciding whether (w, v) ∈ [[A]] for given strings w and v and NSST A, relying on known complexity results for synchronous MCFG, and related formalisms such as synchronous LCFRS (Kaeshammer, 2013).
However, the relation between NSST and synchronous MCFG does not in any obvious way suggest a practical algorithm to do inference of NSST on the basis of sets of origin graphs, and this problem must remain outside the scope of the present paper. 1

and MCFG G such that [[G]] = [[A]] ∩ ([[G ]] × Σ *
2 ), for the shared output alphabet Σ 2 , and perhaps even that . In this section we show the answer to the former question is negative, whereby it is negative for the latter question as well. This holds even if the rank of G is restricted to 1 and To see this, consider the synchronous MCFG G of rank 1 with N = {S, A}, Σ 1 = Σ 2 = {a, b, a , b }, and the following rules.

As illustrated in
If we fix n > 0, to be determined later, then the number of possible schemas σ(α 1 , α 2 ) is bounded by (2n) 2ρ , where ρ = |R| as before. This follows from the fact that each schema is determined by a set of pairs of indices. There is one such pair for each register r, consisting of the index in the schema where the substring r |(α 1 •αε)(r)| starts, and another index where it ends. If this substring is empty, this can be encoded by a starting index that is greater than the ending index.
For each q ∈ Q, let C(q) be the number of pairs (w, w ) ∈ {a, b} n × {a , b } n such that G(w, w , q, s) for some s. Now fix q to be such that C(q) is maximal among the κ states of A. This means that there are at least 2 2n /κ pairs (w, w ) ∈ {a, b} n × {a , b } n such that G(w, w , q, s) for some s.
For each schema s, let C(q, s) be the number of pairs (w, w ) ∈ {a, b} n × {a , b } n such that G(w, w , q, s). Now fix s to be such that C(q, s) is maximal among the at most (2n) 2ρ schemas. This means that there are at least 2 2n κ·(2n) 2ρ pairs (w, w ) ∈ {a, b} n × {a , b } n such that G(w, w , q, s).
There is a string w ∈ {a , b } n such that there are at least 2 2n κ·(2n) 2ρ /2 n = 2 n κ·(2n) 2ρ strings w ∈ {a, b} n such that G(w, w , q, s). For this w and α 2 fixed, there are at least log 2 ( 2 n κ·(2n) 2ρ ) = n − log 2 κ − 2ρ(1 + log 2 n) ≥ n − log 2 κ − 2ρn positions in shuffle(w, w ) where we may find both a and b, depending on the choice of w ∈ {a, b} n . This means the schema s contains at least 2n − 2 log 2 κ − 4ρn − ρ occurrences of symbols from R; note that in the output string, symbols from {a , b } are interlaced with symbols from {a, b}.
Similarly, there is a string w ∈ {a, b} n such that there are at least 2 n κ·(2n) 2ρ strings w ∈ {a , b } n such that G(w, w , q, s). For this w and α 1 fixed, there are at least n − log 2 κ − 2ρn positions in shuffle(w, w ) where we may find both a and b , depending on the choice of w ∈ {a, b} n . This means the schema s contains at least 2n−2 log 2 κ− 4ρn − ρ occurrences of †.
Altogether, this requires the length of the schema, and thereby of the output string, to be at least 4n − 4 log 2 κ − 8ρn − 2ρ. We now obtain the contradiction 4n − 4 log 2 κ − 8ρn − 2ρ > 2n by choosing n > ρ+2 log 2 κ 1−4ρ . The transduction [[G]] above is almost the same as the transduction called merge by Alur andČerný (2010), who also present a proof that this is beyond the power of deterministic SST (DSST). Because this transduction is a function, and because functional NSSTs are equivalent to DSSTs (Alur and Deshmukh, 2011), this could be used to produce an alternative to our proof above. However, the proof by Alur andČerný (2010) appears to contain at least one mistake, which is why we chose to present our own. 2

Conclusions
Motivated by potential applications of origin graphs for machine translation, we have considered NSSTs. We have shown that when their input languages are restricted by MCFGs, then transducers with origin semantics are obtained that can also be generated by synchronous MCFGs. We have further shown that not every synchronous MCFG can be obtained by such a combination of a NSST and a MCFG.