Lexicalization of Probabilistic Linear Context-free Rewriting Systems

In the field of constituent parsing, probabilistic grammar formalisms have been studied to model the syntactic structure of natural language. More recently, approaches utilizing neural models gained lots of traction in this field, as they achieved accurate results at high speed. We aim for a symbiosis between probabilistic linear context-free rewriting systems (PLCFRS) as a probabilistic grammar formalism and neural models to get the best of both worlds: the interpretability of grammars, and the speed and accuracy of neural models. To combine these two, we consider the approach of supertagging that requires lexicalized grammar formalisms. Here, we present a procedure which turns any PLCFRS G into an equivalent lexicalized PLCFRS G’. The derivation trees in G’ are then mapped to equivalent derivations in G. Our construction for G’ preserves the probability assignment and does not increase parsing complexity compared to G.


Introduction
Constituency parsing is a syntactical analysis in NLP that aims to enhance sentences with, usually tree-shaped, phrase structures (for an example cf. the left of Fig. 1). Formalisms such as context-free grammars (CFG) are used in this setting because they are conceptually simple, interpretable, and parsing is tractable (cubic in sentence length).
Discontinuous constituents span non-contiguous sets of positions in a sentence. The resulting phrase structures do not take the shape of a tree anymore, as they contain crossing branches (cf. the left of Fig. 1), and cannot be modeled by CFG. As a countermeasure, many corpora (e.g., the Penn Treebank (PTB)) denote these phrase structures as trees nevertheless and introduce designated notations for discontinuity, which is then often ignored in parsing. However, discontinuity occurs in about 20 % of the sentences in the PTB, and parsing discontinuous constituents can improve accuracy (Evang and Kallmeyer, 2011). For this, so-called "mildly context-sensitive" grammar formalisms have been investigated, e.g., tree-adjoining grammars (TAG;Joshi et al., 1975) and linear context-free rewriting systems (LCFRS; Vijay-Shanker et al., 1987). Their increased expressiveness comes at the cost of a higher parsing complexity: given a sentence of length n, parsing is in O(n 6 ) for TAG and O(n 3·fanout(G) ) for an LCFRS G. The fanout is grammar-specific and reflects the degree of discontinuity in the rules of G. The expressiveness of TAG equals that of LCFRS with fanout 2. An LCFRS derivation of a discontinuous phrase is shown in the right of Fig. 1.
Supertagging has been used for more efficient parsing with lexical TAG (Bangalore and Joshi, 1999). A TAG is lexical if each rule contains one word. A supertagger selects for each position of the input sentence a subset of the rules of the TAG; these are the so-called supertags. Parsing is then performed with the much smaller grammar of supertags. Recently, the performance of supertagging has been improved by using neural classifiers for the selection of supertags (Vaswani et al., 2016). The goal of our research is to use supertagging for parsing with LCFRS, because they are more expressive than TAG.
In this paper, we lay the theoretical foundations for a supertagging-based LCFRS parser. As LCFRS obtained from corpora such as the PTB are usually not lexical, we employ a lexicalization procedure. It can be seen as an instance of the technique for lexicalization of multiple context-free tree grammars (Engelfriet et al., 2018). However, our approach is more concise and does not increase the fanout of the grammar (thus preserving parsing complexity). Moreover, the approach is extended to account for probabilistic parsing. Furthermore, we introduce a procedure which recovers from each derivation of a lexicalized LCFRS all corresponding derivations of the original grammar. This short paper is to be seen as a report on our approaches to lexicalization and recovery of derivations. An implementation of the supertagger and experimental evaluation are currently work in progress.

Preliminaries
The set of non-negative (resp. positive) integers is denoted by N (resp. N + ). We abbreviate {1, ..., n} by [n] for each n ∈ N. Let A be a set; the set of (finite) strings over A is denoted by A * . An alphabet is a finite and non-empty set.
Let S be some set whose elements we call sorts. An S -sorted set is a tuple (A, sort) where A is a set and sort: A → S . Usually, we identify (A, sort) with A, and denote sort by sort A and sort −1 (s) by A s for each s ∈ S . The usual notation for sets (∈, ⊆, ∪, . . .) is used with sorted sets in the intuitive manner. Now let A be an (S * × S )-sorted set. The set of trees over A is the S -sorted set A ranked set A is an (S * × S )-sorted set where S = {s}; the notation rk A (a) = k abbreviates sort A (a) = (s k , s), and A k abbreviates A (s k ,s) . If we use a usual set B in place of a ranked set, we will silently assume rk B (b) = 0 for each b ∈ B. Let X be a set. We let A(X) = {a(x 1 , . . . , x k ) | k ∈ N, a ∈ A k , x 1 , . . . , x k ∈ X}.
LCFRS. Linear context-free rewriting systems extend the rule-based string rewriting mechanism of CFG to string tuples; we describe the gener-ation process by compositions. Let k ∈ N and s 1 , . . . , s k , s ∈ N + ; a Σ-composition is a tuple (u 1 , . . . , u s ) where each u 1 , . . . , u s is a non-empty string over Σ and variables of the form x j i where i ∈ [k] and j ∈ [s i ]. Each of these variables must occur exactly once in u 1 · · · u s and they are ordered such that x 1 i occurs before x 1 i+1 and x j i occurs before x j+1 i for each i ∈ [k − 1] and j ∈ [s i − 1]. We denote the set of Σ-compositions by C Σ (s 1 ···s k ,s) ; we drop the superscript in the case Σ = ∅ (then C (s 1 ···s k ,s) is finite); we drop the subscript if we admit any configuration of k, s 1 , . . . , s k and s. We associate with each composition (u 1 , . . . , u s ) ∈ C Σ (s 1 ···s k ,s) a function from k string tuples, where the i-th tuple is of length s i , to a string tuple of length s. This function is denoted by (u 1 , . . . , u s ) . Intuitively, it replaces each variable of the form x j i in u 1 , . . . , u s by the j-th component of the i-th argument.
. The sort of the rule is (B 1 · · · B k , A); we call A the left-hand side (lhs), B 1 , . . . , B k the right-hand side (rhs) and c the rule's composition. We drop the parentheses around the rhs if k = 0. We call rules of the form where k ≥ 2 branching, and denote the set of these rules in R by R (T) , R (M) and R (B) , respectively. A rule is called lexical, if its composition contains at least one terminal. The lcfrs G is called lexical, if each rule is lexical. The set of (complete) derivations in G is where σ ∈ Σ k and t ∈ T ∆∪Q({x 1 ,...,x k }) . We call a transition linear (resp. non-deleting) if each variable in {x 1 , . . . , x k } occurs at most once (resp. at least once) in t. We call A deterministic if, for each q ∈ Q an σ ∈ Σ, there is at most one transition of the form q(σ(x 1 , . . . , x k )) → t in δ. We call it linear (resp. non-deleting) if each transition is linear (resp. non-deleting).
In a TT with ε-rules (εTT), the set δ may also contain transitions of the form q(x 1 ) → t where t ∈ T ∆∪Q({x 1 }) and q ∈ Q. We treat them analogously to transitions where k = 1.
The transduction of A is expressed in terms of the binary derivation relation ⇒ A over T ∆∪Q(T Σ ) . We write s ⇒ A t if there is a subtree q(σ(s 1 , . . . , s k )) in s and a transition q(σ(x 1 , . . . , x k )) → t such that t is obtained by replacing q(σ(s 1 , . . . , s k )) in s by t and replacing x i by s i for each TTs are a special case of TTs with regular lookahead (TT R ) that were used by Engelfriet et al. (2018). For an excellent overview on tree transducers, we refer to Maletti (2010).
Equivalence of (P)LCFRS. Two LCFRS G and G are called ldTT R -equivalent if there are two linear and deterministic TT R , T and T , such T , as well as T , may map multiple derivations to one single derivation. The relation between the weights assures that this mapped derivation assumes the greatest of the original derivations' weights.

Lexicalizing LCFRS
For the remainder, we assume (without loss of generality; cf. Seki et al., 1991) → c(B 1 , . . . , B k ) where c ∈ C (i.e. c contains no terminal symbols) and none of B 1 , . . . , B k is S . Starting with (G, µ), we incrementally construct a lexical PLCFRS in three steps: 1. Monic rules are removed.
2. Terminating rules are removed and, for each branching rule, each subset of rhs nonterminals is replaced with lexical symbols of matching terminating rules. This construction obtains terminating rules with at least two lexical symbols and each constructed monic rule contains at least one lexical symbol.
3. A terminal is cut from each terminating rule and pasted into a remaining non-lexical branching rule if a derivation with the branching rule reaches the terminating rule at some point. At the end of this step, each rule contains at least one terminal.
The first two steps are direct instances of the lexicalization for multiple context-free tree grammars as introduced by Engelfriet et al. (2018).
Step 1 (Dechain). In the first step, we remove each monic rule and chain its composition with the composition of each other reachable rule. Definition 1. Let k ∈ N, s, s , s 1 , . . . , s k ∈ N + , c ∈ C (s 1 ···s k ,s) and c ∈ C (s,s ) . We denote the composition c (c) ∈ C s 1 ···s k ,s by c • c.
The set of rules R dc is the smallest set R such that • R \ R (M) ⊆R • for each A → c 1 (B) ∈ R (M) and B → c 2 (C 1 , ..., C k ) ∈R, the rule A → c 1 • c 2 (C 1 , ..., C k ) is inR. We define the function µ dc : R dc → [0, 1] such that, for each rule r = A → c (C 1 , ..., C k ) ∈ R dc , the value is µ dc (r) = max µ(r) ∪ µ(r 1 ) · µ dc (r 2 ) | r 1 = A → c 1 (B) ∈ R (M) , The set R dc , as well the function µ dc , can be efficiently computed with an instance of the algorithm by Aho et al. (1974, alg. 5 Step 2 (Fuse Terminals). In this step, the symbols of the terminating rules are inserted into nonterminating (terminal-free) rules. The intuition is given in Fig. 2. For the formal definition we first introduce the insertion operation.
For each rule r = A → c(B 1 , . . . , B k ), we define the set F(r) = ρ: π → Σ | π ⊆ [k], ∀i ∈ π: B i → (ρ(i)) ∈ R (T) dc . It contains each function that replaces a subset of r's rhs nonterminals with terminal symbols respecting the terminating rules; we use it to define the set of rules R ft and the function µ ft : R ft → [0, 1]: Step 3 (Propagate Terminals). In this final step, terminal symbols are inserted into the remaining non-lexical branching rules. Intuitively, one terminal symbol is cut from each terminating rule, which now has at least two occurrences of terminal symbols. This symbol is then pasted into a non-lexical branching rule that reaches the rule it was cut from via its second right-hand side nonterminal. If there are rules between the terminating rule the symbol was cut from and the rule it shall be pasted into, then the information that a terminal can be pasted through these rules is propagated via the nonterminals. For this we extend the set of nonterminals to N pt = N ∪ N × Σ, where (A, σ) ∈ N × Σ indicates that σ was cut from a rule with left-hand side A (sort N pt (A, σ) = sort N (A) for each (A, σ) ∈ N × Σ). Definition 3. Let σ ∈ Σ and r be a rule of the form A → (σu 1 , u 2 , ..., u s ) (B 1 , ..., B k ). We denote (A, σ) → (u 1 , ..., u s ) (B 1 , ..., B k ) by cut(r).
Definition 4. Let r = A → c (B 1 , ..., B k ) be a rule and i ∈ [k]. We obtain c from c by replacing the variable x 1 i with σx 1 i and denote A → c (B 1 , ..., (B i , σ), ..., B k ) by paste i σ (r). Let R ( Σ) ft denote the set of rules in R (B) ft without terminals. We define the sets of rules Note that, in contrast to Engelfriet et al. (2018), we do not need to split the rules' compositions, because we always cut the first symbol. Therefore, the fanout of the grammar is unchanged.
The applications of cut and paste that led to a rule in R pt are unambiguously determined from the lhs and rhs nonterminals in N ×Σ. These operations are unambiguously reversible; for each r ∈ R pt , there is a rule in r ∈ R ft uniquely determined such that r is exactly one of • r, • paste 2 σ (r), • cut(r), • cut(paste 1 σ (r)), or • cut(paste 1 σ 1 (paste 2 σ 2 (r))). We define the function µ pt : R pt → [0, 1] such that µ pt (r ) = µ ft (r) for each r ∈ R pt .

Unlexicalizing Derivations
The ultimate goal of parsing is to obtain derivations of the original grammar (G, µ), which are quite different from the derivations of the transformed grammar ((N pt , Σ, S , R pt ), µ pt ). Therefore we seek a transformation from T R pt to T R . Engelfriet et al. (2018) have introduced deterministic linear and non-deleting TT R from T R pt to T R ft , from T R ft to T R dc , and from T R dc to T R ; they were used to show ldTT R -equivalence of the corresponding grammars. The composition of these transducers yields a transduction from T R pt to T R . However, for recovering derivations, these transducers are not adequate, as a derivation in T R ft or T R dc may have multiple possible originals in T R dc or T R , respectively. We want to be able to obtain all of these derivations, as this may be beneficial for NP 2 → (x 1 1 , x 1 2 ) (NP 1 , PP) NP 1 → (A hearing) PP → ( on x 1 1 ) (NP 1 ) NP 1 → (the issue) cut paste 2 on cut paste 1 the NP 2 → (x 1 1 , on x 1 2 ) (NP 1 , (PP, on)) NP 1 → (A hearing) (PP, on) → (the x 1 1 ) ((NP 1 , the)) (NP 1 , the) → (issue) Figure 3: Propagation of terminals. Left: derivation of a (partly lexicalized) LCFRS before the application of step 3. Right: derivation after step 3, where the is propagated from the leaf to the PP rule and on is propagated from there to the NP 2 rule, thus lexicalizing it.
later stages of an application (e.g., when selecting k best derivations is desired). For this, we employ two approaches. We map each derivation in T R pt to its unique original derivation by using the transducer of (Engelfriet et al., 2018); we denote this transducer by T← − pt . It realizes a deterministic tree relabeling which is already indicated at the end of the previous section. For the other two transductions, we define novel nondeterministic tree transducers.
Transduction T R dc → T R . Let us denote the composition (x 1 , ..., x sort N (A) ) by id A for each A ∈ N. We define the linear and non-deleting εTT T← − dc = (Q, R dc , R, (S , id S ), δ) where Q = A,B∈N {B} × C (sort N (B),sort N (A)) and δ is the smallest set that contains, , the composition of the three transductions introduced in this section, is the inverse of the transduction T from the proof of Theorem 5 (cf. App. B, p. 7). Therefore, the k best derivations in (G, µ) must be among the transductions of the k best derivations in ((N pt , Σ, S , R pt ), µ pt ), which benefits the enumeration of k best derivations.

Conclusion
Based on Engelfriet et al. (2018), we have introduced a procedure which constructs for every PLCFRS G an equivalent lexicalized PLCFRS G . Moreover, we have described how to recover from each derivation of G all corresponding derivations of G. In future work, we will use our approach to implement a supertagging-based LCFRS parser.

A Supplemental Algorithms
Algorithm 1 comps shows an instance of the alg. by (Aho et al., 1974, alg. 5.5) that we use to compute R dc and µ dc efficiently.

B Supplemental Proofs
We prove thm. 5 in two steps.
First lem. 7 shall prove that the (unweighted) underlying grammars, G and (N pt , Σ, S , R pt ) are ldTT R -equivalent. As linear and deterministic TT R are closed under composition, the idea is to construct two of them for each step, one that transduces derivations in the original to derivations in the constructed grammar and vice versa. The three transducers for each direction are then composed to obtain transductions from derivations in G to derivations in (N pt , Σ, S , R pt ) and vice versa.
Thm. 5 additionally proves the preservation of the weights. Similarly to the above, we show that the PLCFRS constructed in each of the three steps are ldTT R -equivalent. The property is clearly transitive.
Proof. The first two steps are instances of lems. 32 and 37 by Engelfriet et al. (2018), therefore we will only show the third step.
Step 3. For the construction, we consider R ft and R pt as ranked sets. For eachR ∈ {R ft , R pt } and rule of the form r = A → c(B 1 , . . . , B k ) ∈R, we let rkR(r) = k.
Let, for each σ ∈ Σ, R σ and R X be the sets of all rules in R ft of the form A → c(B 1 , . . . , B k ) where c is of the form (σu 1 , u 2 , . . . , u s ) and (x 1 1 u 1 , u 2 , . . . , u s ), respectively. To decide whether a terminal symbol may be propagated through a derivation, we define the look-ahead language L σ