Tensors over Semirings for Latent-Variable Weighted Logic Programs

Semiring parsing is an elegant framework for describing parsers by using semiring weighted logic programs. In this paper we present a generalization of this concept: latent-variable semiring parsing. With our framework, any semiring weighted logic program can be latentified by transforming weights from scalar values of a semiring to rank-n arrays, or tensors, of semiring values, allowing the modelling of latent-variable models within the semiring parsing framework. Semiring is too strong a notion when dealing with tensors, and we have to resort to a weaker structure: a partial semiring. We prove that this generalization preserves all the desired properties of the original semiring framework while strictly increasing its expressiveness.


Introduction
Weighted Logic Programming (WLP) is a declarative approach to specifying and reasoning about dynamic programming algorithms and chart parsers. WLP is a generalization of bottom-up logic programming where proofs are assigned weights by combining the weights of the axioms used in the proof, and the weight of a theorem is in turn calculated by combining the weights of all its possible proof paths. The combinatorial nature of this procedure makes weighted logic programs highly suitable for specifying dynamic programming algorithms. In particular, Goodman (1999) presents an elegant abstraction for specifying and computing parser values based on WLP where the values could be drawn from any complete semiring. This generalizes the case of Boolean decision problems, probabilistic grammars with Viterbi search and other quantities of interest such as the best derivation or 1 Our definition of a partial semiring is slightly different than those in the abstract algebra literature e.g. Steenstrup (1985). the set of all possible derivations. It is then possible to derive a general formulation of inside and outside calculations in a way that is agnostic to the particular semiring chosen. Latent variable models have been an important component in the NLP toolbox. The central assumption in latent variable models is that the correlations between observed variables in the training data could be explained by unobserved, hidden variables. Latent variables have been used with grammars such as Probabilistic Context-Free Grammars (PCFGs), where each node in the parse tree is represented using a vector of latent state probabilities that further extend the expressiveness of the grammar (Matsuzaki et al., 2005).
The approach of adding latent variables to formal grammars have proven to be a fruitful one: in the context of PCFG parsing, Matsuzaki et al. (2005) show that latent variable PCFGs (L-PCFGs) perform on par with models hand-annotated with linguistically motivated features. Cohen et al. (2013) report that on the Penn Treebank dataset, L-PCFGs trained with either EM or a spectral algorithm provide a 20% increase in F1 over PCFGs without latent states. Gebhardt (2018) shows that the benefits of latent variables are not limited to PCFGs by successfully enriching both Linear Context-Free Rewriting Systems and Hybrid Grammars with latent variables, and demonstrates their applicability on discontinuous constituent parsing.
Given the usefulness of latent variables, it would be desirable to have a generic inference mechanism for any latent variable grammar. WLPs can represent inference algorithms for probabilistic grammars effectively. However, this does not trivially extend to latent-variable models because latent variables are often represented as vectors, matrices and higher-order tensors, and these taken together no longer form a semiring. This is because in the semiring framework, values for deduction items and for rules must all come from the same set, and the semiring operations must be defined over all pairs of values from this set. This does not allow for letting different grammar nonterminals be represented by vectors of different sizes. More importantly, it does not allow for a rule's value to be a tensor whose dimensionality depends on the rule's arity, as is generally the case in latent variable frameworks.
In this paper we start with a broad interpretation of latent variables as tensors over an arbitrary semiring. While a set of tensors over semirings is no longer a semiring, we prove that if the set of tensors have certain matching dimensions for the set of grammar rules they are assigned to, then they fulfill all the desirable properties relevant for the semiring parsing framework. This paves the way to use WLPs with latent variables, naturally improving the expressivity of the statistical model represented by the underlying WLP. Introducing a semiring framework like ours makes it easier to seamlessly incorporate latent variables into any execution model for dynamic programming algorithms (or software such as Dyna, Eisner et al. 2005, and other Prolog-like/WLP-like solvers).
We focus on CFG parsing, however the same latent variable techniques can be applied to any weighted deduction system, including systems for parsing TAG, CCG and LCFRS, and systems for Machine Translation (Lopez, 2009). The methods we present for inside and outside computation can be used to learn latent refinements of a specified grammar for any of these tasks with EM (Dempster et al., 1977;Matsuzaki et al., 2005), or used as a backbone to create spectral learning algorithms (Hsu et al., 2012;Bailly et al., 2009;Cohen et al., 2014).

Main Results Takeaway
We present a strict generalization of semiring weighted logic programming, with a particular focus on parser descriptions in WLP for context-free grammars. Throughout, we utilize the correspondence between axioms and grammar rules, deductive proofs and grammar derivations, and derived theorems and strings.
We assume that axioms/grammar rules come equipped with weights in the form of tensors over semiring values. The main issue with going from semirings to tensors over semiring values is that these weights need to be well defined in that any valid derivation should correspond to a sequence of well defined semiring operations. For CFGs, we give a straightforward condition that ensures this is the case. This essentially boils down to making sure that each non-terminal corresponds to a fixed vector space dimension. For example, if A corresponds to a space of d 1 dimensions, B to d 2 and C to d 3 , then a rule A → B C would have a tensor weight in d 2 × d 3 × d 1 .
As long as the weights are well defined, the standard definitions for the value of a grammar derivation and a string according to a semiring weighted grammar extend to the case of tensors of semirings. Weighted logic programming provides the means to declaratively specify an efficient algorithm to obtain these values of interest. In line with Sikkel (1998) and Goodman (1999) we present precise conditions for when a partial-semiring WLP describes a correct parser.
The value of the WLP formulation of parsing algorithms is that it provides a unified fashion in which dynamic programming algorithms can be extracted from the program description. This relies on the ability of a WLP to decompose the value of a proof to a combination of the values of the sub-proofs. Specifically, given a derivation tree, a WLP description automatically provides algorithms for calculating the inside and outside values. We provide analogous algorithms for calculating the inside and outside values for partial-semiring WLPs. Our outside formulation addresses the noncommutative nature of tensors themselves, and could be extended to cases where the underlying semiring is non-commutative using the techniques presented by Goodman (1998).

Related Work
"Parsing as deduction" (Pereira and Warren, 1983) is an established framework that allows a number of parsing algorithms to be written as declarative rules and deductive systems (Shieber et al., 1995), and their correctness to be rigorously stated (Sikkel, 1998). Goodman (1999) has extended the parsing as deduction framework to arbitrary semirings and showed that various different values of interest could be computed using the same algorithm by changing the semiring. This led to the development of Dyna, a toolkit for declaratively specifying weighted logic programs, allowing concise implementation of a number of NLP algorithms (Eisner et al., 2005).
The semiring characterization of possible values to assign to WLPs gave rise to the formulation of a number of novel semirings. One novel semiring of interest for purposes of learning parameters is the generalized entropy semiring (Cohen et al., 2008) which can be used to calculate the KL-divergence between the distribution of derivations induced by two weighted logic programs. Other two semirings of interest are expectation and variance semirings introduced by Eisner (2002) and Li and Eisner (2009). These utilize the algebraic structure to efficiently track quantities needed by the expectationmaximization algorithm for parameter estimation. Their framework allows working with parameters in the form of vectors in R n for a fixed n, coupled with a scalar in R ≥0 . The semiring value of a path is roughly calculated by the multiplication of the scalars and (appropriately weighted) addition of the vectors. This is in contrast with our framework where weights could be tensors of arbitrary rank rather than only vectors, and the values of paths are calculated via tensor multiplication.
Finally, Gimpel and Smith (2009) extended the semiring framework to a more general algebraic structure with the purpose of incorporating nonlocal features. Their extension comes at the cost that the new algebraic structure does not obey all the semiring axioms. Our framework differs from theirs in that under reasonable conditions, tensors of semirings do behave fully like regular semirings.

Background and Notation
Our formalism could be used to enrich any WLP that implements a dynamic programming algorithm, but for simplicity, we follow Goodman (1999) and focus our presentation on parsers with a context-free backbone. 2

Context-free Grammars
Formally, a Context-Free Grammar (CFG) is a 4tuple N, Σ, R, S . The set of N denotes the nonterminals which will be denoted by uppercase letters A, B etc., and S is a non-terminal that is the special start symbol. The set of Σ denotes the terminals which will be denoted by lowercase letters a, b etc. R is the set of rules of the form A → α consisting of one non-terminal on the left hand side 2 Note that given a grammar G in a formalism F and a string α, it is possible to construct a CFG grammar c(G, w) from G and α (Nederhof, 2003). This construction is possible even for range concatenation grammars (Boullier, 2004) which span all languages that could be parsed in poly-time.
(lhs), and a string α ∈ (N ∪ Σ) * on the right hand side (rhs). We will use α ⇒ β if β could be derived from α with the application of one grammar rule. We will say that a sentence σ ∈ Σ + could be derived from the non-terminal A if σ could be generated by starting with A and repeatedly applying rules in R until the right hand side contains only terminals, and denote this as A * = ⇒ σ. We will denote the language that a grammar G defines by CFG derivations can naturally be represented as trees. We will use the notation r : T 1 . . . T k to represent a tree that has the node r as its root and T 1 , . . . , T k as its direct subtrees. We will use D G to denote the set of all derivation trees that can be constructed with the grammar G, and D G (σ) for all valid derivation trees that generate the sentence σ in G.

Semirings
A semiring is an algebraic structure similar to a ring, except that it does not require additive inverses.
Definition 1. A semiring is a set S together with two operations + and ×, where + is commutative, associative and has an identity element 0. The operation of × is associative, has an identity element 1 and distributes over +.
There are a few less common semirings that provide useful values in parsing. The Viterbi semiring calculates the probability of the best derivation. It has values in [0, 1], + := max and ×, 0, 1 as standard. The Derivation forest, Viterbi derivation and Viterbi n-best semirings calculate the set of all derivations, the best derivation and the n-best derivations respectively. Unlike the previous examples, the × operation of these semirings is not commutative. In general, if the × operation in a semiring is commutative, we refer to it as a commutative semiring, and otherwise it is referred to as non-commutative. For precise definitions and detailed descriptions of these semirings see Goodman (1999).

Weighted Logic Programming
A logic program consists of axioms and inference rules that could be applied iteratively to prove theorems. Inference rules are expressed in the form A 1 ...A k B where A 1 . . . A k are antecedents from which B can be concluded. Axioms are inference rules with no antecedents.
One way to express dynamic programming algorithms such as CKY is as logic programs. This approach takes the point of view of parsing as deduction: terms consist of grammar rules and items in the form of [i, A, j] that correspond to the intermediate entries in the chart. Grammar rules are taken to be axioms, and the description of the parser is given as a set of inference rules. These can have both grammar rules and items as antecedents and an item as the conclusion. A logic program in this form includes a special designated goal item that stands for a successful parse.
Continuing with the example of CKY, consider the procedural description for how to obtain a chart item from smaller chart items if we have the rule A → B C in the grammar: The corresponding inference rule in a logic program would be: Note that in the inference rule above, A → B C is a rule template with free variables A, B, C. In general, the terms in inference rules can contain free variables, however for a logic program to describe a valid dynamic algorithm, every free variable in the conclusion of an inference rule must appear in its antecedents as well.
A weighted logic program is a logic program where terms are assigned values from a semiring. When paired with semiring operations, inference rules provide the description of how to compute the value of the conclusion given the values of the antecedents. The result of an application of a particular inference rule is the semiring multiplication of all the antecedents. The value of a term B is then calculated as the semiring sum of values obtained from inference rules that have B as their the conclusion.

Semiring Parsing
In the context of parsing, Goodman (1999) presents a framework where a grammar G comes equipped with a function w that maps each rule in G to a semiring value. Then, a grammar derivation string E consisting of the successive applications of rules e 1 , . . . , e n is defined to have the value A parser specification is given in the form of a weighted logic program, referred to as item-based description. From these, the value of a derivation D is calculated recursively as follows: where is the semiring product. Let inner(x) represent the set of all derivation trees headed by the item x. Then the value of x is: is the semiring addition. The value of a sentence is then equal to inner(goal).
Given the definitions of value according to the grammar and the parser, Goodman (1999) provides a theorem for conditions of correctness: Theorem 4.1. (Goodman 1999, Theorem 1; informal) An item-based description I is correct if for every grammar G there exists a one-to-one correspondence between the grammar and item derivations, and these derivations get the same value regardless of weight function used.
One caveat with calculating based on item-based derivations is that there is an ordering of items: we cannot compute the value of an item unless the values of all its children are computed already. For this, Goodman (1999) assumes that each item is assigned to a bucket so that if an item b depends on a, then bucket(a) ≤ bucket(b). If a bucket depends on itself, then it is considered a special looping bucket. For all the formulas we present in this the main paper we assume that the items belong to non-looping buckets. The formulas for looping buckets are provided in Appendix B.
For an item x, calculating its value might require summing over exponentially many derivation trees.
To address this, it is possible to provide a general formula that efficiently computes the inner value for an item (Goodman 1999, Theorem 2): The other important value associated with an item x is its outside value Z(x), which is the sum of values of derivation trees, modified so that x is removed with all its subtrees. This value is complementary to the inside values V (x) (Goodman 1999, Theorem 4): Z(x) can likewise be calculated using a recursive formula if the values are from a commutative semiring (Goodman 1999, Theorem 5):

Tensor Notation
We use the term tensor to refer to an n-dimensional array of semiring values. We use S to denote a semiring and A, B etc. to denote tensors. The element A ∈ S a 1 ×a 2 ×...×an will denote that A is a rank-n tensor of values drawn from S, with the ith rank having dimension a i . The entry in index k 1 , . . . , k n will be denoted with subscripts A k 1 ,...,kn .

Latent-variable Parsing as Tensor Weighted Logic Programs
For semiring parsing to work for latent-variable models it should allow weights to be vectors, matrices and tensors. In this section we present a framework that generalizes that of Goodman (1999), and is able to capture tensors over semirings as weights.
Note that this includes scalars as a special case.

Semiring Operations
The main reason why tensors over semirings are not semirings is that with tensor weights, ⊕ and ⊗ become partially defined -not all elements can naturally be added or multiplied to any other element anymore. We refer to these structures as partial semirings. With some reasonable constraints, we show that ⊕ and ⊗ obey the semiring axioms in cases that are relevant for the semiring parsing framework. Let S be the chosen underlying semiring, +, × to be the semiring operations and 0, 1 be the additive and multiplicative identity of the semiring respectively. The set of possible weights are defined as {S d 1 ×...×dn } for n ∈ N, and d i ∈ N for all i ≤ n. ⊕ is a partial addition that is defined on two tensors A, B ∈ S d 1 ×...×dn as long as the dimensions of each of their ranks match. Then, the addition is defined component-wise: The additive identity is now a class of tensors, one for each unique list of tensor dimensions. The additive identity for any A ∈ S d 1 ×...×dn is the tensor Z ∈ S d 1 ×...×dn with 0 in every entry.
Multiplication is defined as the contraction of an index between two tensors with arbitrary number of ranks. Specifically, we consider the family ⊗ [k;l] which contracts the kth rank of the first tensor with the lth rank of the second tensor. This is only defined if the two ranks to be contracted have the same dimension, as follows: where δ is the identity function that is equal to 1 if i k = j l and 0 otherwise. Note that the ranks corresponding to B which are not contracted over go in between the ranks of A, replacing where the contracted rank of A was. We will use ⊗ j as a shorthand of ⊗ [j;1] , and in cases where j = l = 1, we will omit the subscript on ⊗ altogether.
More generally, we will allow multiplication operations that contract multiple consecutive dimensions. A ⊗ r [k;l] B will denote contracting rank k of A with rank l of B, rank k + 1 of A with rank l + 1 of B and so forth until rank k + r − 1 of A and l + r − 1 of B. Formally: j l+r ,...,jm,i k+r ,...,in We will use the notation A ⊗ * B as a shorthand To make the presentation clearer, we will also use the notation X ⊗ [A 1 , A 2 , . . . , A k ] to denote contraction of A 1 with the first rank of X, A 2 with the second and so forth. In other words The multiplicative identity for A ∈ S d 1 ×...×dn and ⊗ k is the identity matrix I ∈ S d k ×d k where the diagonal entries are the multiplicative identity from the underlying semiring, and the non-diagonals are the additive identity. For A ∈ S d 1 ×...×dn and ⊗ r k the multiplicative identity is a rank-2r tensor I ∈ S d k ×...×d k+r−1 ×d k ×...×d k+r−1 and is defined as follows: Lastly, as the higher order analogue of the transpose operator, we will define a permutation operator A π where π = [π 1 , π 2 , . . . , π r ] is a permutation of [1 . . . r] and r is the rank of A. The π i th rank of A π is equal to ith rank of A.
The key property of semirings for purposes of efficient calculation of item values is the distributive property. This property also holds for tensors over semirings.
Lemma 5.1. For any k, l, ⊗ [k;l] distributes over ⊕ A proof can be found in Appendix A.

Grammar Derivations
For a grammar G with a function w that provides a mapping from rules to tensor weights, we will define a value of a derivation via the derivation tree: Definition 2. Given a grammar G and a weight function w, the value of a derivation tree T is: Note that there is no guarantee that this equation is defined for any arbitrary w. We will call a weight function w well defined for a grammar G if for all valid derivation trees T in G, V w G (T ) is defined. For CFGs there is a straightforward method to ensure that w is well defined: Tensor dimensions of grammar rules: The value of the tree is given by the equation: Grammar derivation string: The value of the string is given by the equation: Lemma 5.2. A set of weights w for a given CFG is well defined if there exist consistent dimensions d i for each nonterminal A i such that for all grammar rules R : A n → α 1 A 1 α 2 . . . α n−2 A n−1 α n , w(R) ∈ S d 1 ×...×dn Proof is given together with Lemma 5.3. Note that if a weight function for CFG is well defined, then the rank for the weights of rules with no non-terminals on their rhs is always 1.
Given a grammar derivation tree T , let us call the list of derivation rules E : R 1 , R 2 , . . . , R n appearing in T ordered via depth-first, left-to-right manner a grammar derivation string. Definition 3. Given a CFG with tensor weights w, the value of a grammar derivation string is defined as: where the application of ⊗ proceeds from left to right as is standard.
For semirings, since the bracketing does not affect the final value of an expression, it is straightforward to show that the value of a grammar derivation tree corresponds to that of a grammar derivation string. With tensors over semirings this might fail with an arbitrary formalism F , and in the general we require the value of a derivation to be calculated with the bracketing induced by the derivation tree. However, for the special case of CFGs, the value of the grammar derivation tree and the value of its corresponding grammar derivation string are always equal. This means that for the computation of the value of the derivation, it is possible to replace the bracketing induced by the derivation tree by left-to-right bracketing without affecting the final value. Figure 1 demonstrates the calculation of the value of the tree and the string for the same derivation together with how the tensor dimensions of the intermediate results evolve with each step of the calculation.
Lemma 5.3. Given a CFG G and a weight function w that fulfills the condition in Lemma 5.2, then w is well defined and V w G (T ) = V w G (E) for any grammar derivation tree T and corresponding grammar derivation string E.
Proof. We will proceed by induction on the derivation tree. If T consists of only one rule r, then V w G (T ) = V w G (E). Furthermore, r does not have any non-terminals on its rhs, so V w G (T ) ∈ S d 0 with S d 0 corresponding to the lhs non-terminal in r.
Otherwise, T has a labeled node r and the subtrees T 1 , . . . , T k .
Because w fulfills the condition in Lemma 5.2, w(r) ∈ S d 1 ×...×d k ×d 0 for some d i where S d 0 is the space corresponding to the non-terminal on the lhs of r, and S d i is the space corresponding to the ith non-terminal appearing in the rhs of r for i = 1, . . . , k. Then to complete the proof, it suffices to show that V w G (T i ) ∈ S d i for all subtrees T i . This already holds for the base case. For each is the space corresponding to the non-terminal in the lhs of R i . For the derivation to be valid, this non-terminal needs to match the ith non-terminal in the rhs of R, hence S d i 0 = S d i

Item-based Descriptions
Item-based descriptions are formal descriptions of various parsers for context-free grammars. Itembased descriptions consist of a set of deduction rules of the form T 1 . . . T k Q P 1 . . . P j where upper case letters could either be grammar rule templates (e.g. if T 1 : A → B C then any nonterminals from the grammar can be substituted for A, B, C) or for items. T 1 . . . T k are referred to as antecedents, Q as the conclusion and P 1 . . . P j are side conditions that the parser requires to execute the rule, but doesn't use the values of. Items correspond to chart elements in procedural descriptions of parsers, and are placeholders for intermediate results which can be combined to obtain the final result. The item-based description also provides a special goal item which is variable-free, and does not occur as a condition of any other inference rules.
Definition 4. Given a grammar G and an itembased description I, a valid item derivation tree is defined as follows: • For all r ∈ G, r is an item derivation tree.
• If D a 1 , . . . , D a k and D c 1 , . . . , D c j are derivation trees headed by a 1 , . . . , a k and c 1 , . . . , c j respectively, and a 1 ...a k b c 1 , . . . , c j is the instantiation of a deduction rule in I, then b : D a 1 , . . . , D a k is also an item derivation tree.
inner σ (x) denotes the set of all trees headed by x that occur in parses for σ. Formally, D ∈ inner σ (x) if D is headed by x and is a subtree of some D ∈ D I(G) (σ). The value of a derivation tree is calculated similarly to that of a grammar tree: Notice that unlike the definition from Goodman (1999), the first antecedent in the inference rule has a special role in the calculation. Intuitively, our framework treats the value of the first antecedent as a function, and the trailing ones as the arguments. The interaction between the trailing antecedents is thus moderated through the value of the first antecedent, which corresponds to the requirement that  Figure 1 using the item-based description of CKY in Figure 3. Definition 5. If for any σ ∈ L(G) and any T, T ∈ inner σ (x), V w I(G) (T ) and V w I(G) (T ) are defined and dim(V w I(G) (T )) = dim(V w I(G) (T )), then the weights w are well defined.
Given an item-based derivation I, a grammar G, a well defined weight function w and a target sentence σ, the value of an item x is defined to be the sum of all its possible derivations. Formally: Definition 6. For a given grammar G and itembased description I, the value of a sentence σ is equal to the value of the goal item which spans σ: Definition 7. An item-based description is correct if for all grammars G, complete semirings S, well defined weight functions w and sentences σ, V w I(G) (σ) = V w G (σ) Now we are ready to state the equivalent theorem to Theorem 4.1. Let us introduce a special symbol ⊥ and extend V w G and V w I(G) to any weight function w so that if w is not-well defined for G, then V w G (σ) = ⊥ and likewise for V w I(G) .

Theorem 5.4. An item-based description I is correct if
• For every grammar G, the mapping g : D I(G) → D G that maps d ∈ D I(G) to the corresponding d ∈ D G is a bijection with an inverse function f .
• For any complete semiring S and weight function w, g and f preserve the values assigned to a derivation: Proof proceeds similarly to that in (Goodman, 1999) and can be found in Appendix A.

Inside and Outside Calculations
In the following, we will omit the sentence σ from inner σ (x) and refer to this as inner(x). Let inner( a 1 ,..,a k x ) the set of derivation trees where the root note is x, and the direct children of x are a 1 , . . . , a k .
For efficient computation of this value, we will assume that there is a partial order b on the items so that if the item y depends on x, then b(x) ≤ b(y).
The proof uses the distributive property and follows that of Goodman (1999). It can be found in Appendix A.
For the notion of a value of a derivation to extend to outside trees, we will have to do some modifications. This is because an outside tree will have one subtree b : A 1 , . . . , A n , such that V (A 1 ) ⊗ [V (A 2 ), . . . , V (A n )] will potentially not be defined since one of the subtrees A k will be missing. Note that the missing A k will be headed by an item. We will say the a tree T ∈ outer(x) if T can be obtained by taking a tree T headed by the goal item and removing any of its subtrees headed by the item x. Outer value Z(T k ) is defined recursively as follows: If T k is headed by the goal item then Z(T k ) = I d S . Else, it has a direct parent tree T such that T = b : T 1 , . . . , T k , . . . , T n . In this case, Z(T k ) = where I T k ×d S is the identity tensor for the space S d 1 ×...×d i ×d S , T k ∈ S d 1 ×...×d i , and d s is the dimension assigned for the terminal symbol S. The permutation π is defined as follows: [1, 2, . . . , i, j + 1, j + 2, . . . , n, i + 1, i + 2, . . . , j] To understand the function of π it is useful to consider the dimensions of the term before and after it is applied. Let the term V (T 0 ) ⊗ k [I T k ×d S , V (T k+1 ), . . . , V (T n )] have dimensions: Here e 1 , . . . , e k−1 are the dimensions that will be contracted with V (T 1 ), . . . , V (T k−1 ) with the second multiplication operation, and d n , . . . , d m are the dimensions that were either introduced by the contraction with V (T k+1 ), . . . , V (T n ) or were trailing dimensions from V (T 1 ). The result of the contraction with I T k ×d S are the dimensions in the middle: d 1 , . . . , d i , d S , e k , d 1 , . . . , d i , d S . Unlike the original definition of I there is one dimension e k missing from the beginning of the sequence since it got used up during the contraction operation. What the permutation does is to move one section of the dimensions introduced by I to the very end. The dimensions become: Note that this has no effect on the next contraction with V (T 1 ), . . . , V (T k−1 ) since the first k−1 ranks are left in place. However, changing the order of the ranks allow the last contraction with Z(T ) to be well defined.
Lemma 6.2. Let V and Z be defined on a commutative semiring S and let O ∈ outer σ (x) and T ∈ inner σ (x). If combining O and T in the obvious way results in the complete derivation D, Otherwise T has a parent tree T p = y : T 1 , . . . , T n where T = T k . Furthermore, T p ∈ inner σ (y), O p ∈ outer σ (y) and by the induction Since T p ∈ inner σ (y) we know that The proof progresses by calculating the value for [V (D)] i based on the above term and shows that this is equal to the value of [V (T ) ⊗ * Z(O)] i . Full proof can be found in Appendix A.
In the general case, Goodman (1999) defines the reverse value of x as the sum of all its outer trees.
We will see that for a well defined weight function w, any D ∈ outer σ (x) will be assigned a value with dimensions d 1 × . . . × d n × d S where d S is the dimension assigned to the start symbol S, and d 1 , . . . , d n are the dimensions for inner σ (x). Lemma 6.3. Let C(D, x) represent the number of times x occurs in a derivation D. Then, For an item x, any O ∈ outer(x) and T ∈ inner(x) can be combined to form a successful derivation tree containing x, and thus the number C(D, x) corresponds exactly to the number of derivation trees containing x. Hence, Now we are ready to state how to calculate the outside value of an item. Following Goodman (1999) we will extend the notation for the set of outer trees and introduce outer k, a 1 ...an b ⊆ outer(a k ) to mean the subset of the outer trees in outer(a k ) where a k has parent b and the siblings a i . In other words, this is the set of all outer trees where the rule from which a k is removed is a 1 . . . a n b .
Otherwise the outer trees outer(x) could be written as the union of outer trees outer k, a 1 ...an b for each rule a 1 ...an b where a k = x for some k. Hence:

Z(D)
Using the distributive property of the partial semiring, the inside part of the equation becomes: Replacing the inner part of the previous equation with this term gives the desired equality.

Conclusion
We have presented a general extension of the semiring parsing framework where the weights for the grammar rules are tensors of semiring values, with the motivation of extending semiring parsing framework to latent variable models. We hope that this work will enable streamlined development of EMbased or spectral learning algorithms for latent refinements of a number of grammar formalisms.
Firstly, note that for the left hand side of the equation to be defined, B and C needs to be of matching ranks, and that B ⊕ C will be the same rank as both B and C. Therefore, if the left hand side is well defined then both A ⊗ [k;l] B and A ⊗ [k;l] C is defined and has matching ranks. So the right hand side is defined if and only if the left hand side is defined as well.
• For any complete semiring S and weight function w, g and f preserve the values assigned to a derivation: Proof.
Observe that D ∈ D I(G) (α) iff g(D) ∈ D G (α) since the rules that appear in the leaves of D, applied from left to right, determines the grammar derivation tree g(D) uniquely via g, and vice versa. Hence, Theorem 6.1.
V (x) = Where the last step holds due to the distributive property of the partial semiring.
Since the set inner(x) = i D i where D i ∈ inner( a 1 ...a k x ) for all inference rules a 1 ...a k x , we can write the summation over D ∈ inner(x) as: Where the last line is obtained by replacing the inner part of the expression with the equality obtained from the previous part of the proof.
Lemma 6.2. Let V and Z be defined on a commutative semiring S and let O ∈ outer α (x) and T ∈ inner α (x). If combining O and T in the obvious way results in the complete derivation D then Proof. To simplify notation of the indices, let i stand for a list of indices i 1 , . . . , i n for some n. We will also use d i to denote a list d i 1 , . . . d i n i and d to denote d 1 , . . . , d n . δ(i, j) = n k=1 δ(i k , j k ). We will proceed by induction on the parse tree. Base case is where x = goal, T = D and O is empty. Otherwise T has a parent tree T p = y : T 1 , . . . , T n where T = T k . Furthermore, T p ∈ inner α (y), O p ∈ outer α (y) and by induction hypothesis Since T p ∈ inner α (y) we know that The proof progresses by calculating the value for [V (D)] i based on the above term and shows that this is equal to the value of [V (T ) ⊗ * Z(O)] i . Let: Then: Now we will proceed to prove that this term is equal to V (T k ) ⊗ * Z(O). Let I T k ∈ S e k ,d k ,s,e k ,d k ,s . We will calculate the value of the outside term in sections. Then, Which completes the proof. The last simplification step is obtained by replacingê k and e k with e k ,d k and d k with d k and s and s withŝ since these need to be equal for any term to contribute to the final sum. The commutativity of S then allows V (T k ) e k ,d k to be moved to its place in the sequence.
Theorem 6.4. If x is the goal item, then Z(x) = I s . Else, Otherwise the outer trees outer(x) could be written as the union of outer trees outer k, a 1 ...an b for each rule a 1 ...an b where a k = x for some k. Hence:

Z(D)
For the inner part of this equation we have: Da k+1 ∈inner(a k+1 ),..., Da n ∈inner(an) Since ⊕ distributes over ⊗, this can rewritten as And since V (a i ) and Z(D b ) are defined as the summation of their inner and outer trees respectively V (a k+1 ), . . . , V (a n )]) π ⊗ (V (a 2 ), . . . , V (a k−1 )) ⊗ * Z(b) Replacing the inner part of the previous equation with this term gives us the desired equality, completing the proof.

Appendix B -Inside and Outside Calculations for Looping Buckets
In computing the inside and outside values with an item-based description, we assume a pre-computed ordering over items in the form of buckets. For items x and y, we write bucket(x) ≤ bucket(y) if the value of y depends on the value of x. So far we have assumed that items could be simply sorted so that no item directly or indirectly depends on itself, and given the inside and outside formulas accordingly. In this section we give the equivalent formulas for items in looping buckets. Items in a looping bucket depend on each other and computing their values might require an infinite sum. Our presentation and proofs both follow that of Goodman (1998).
For an item x in a looping bucket B, let the generation of a derivation tree x to be the maximum number of items in B that could appear in a single path from the root to a leaf. This intuitively provides an ordering for processing a potentially infinite number of trees by starting from generation 0 and incrementally adding larger and larger trees. We will denote the set of inner trees of x with generation at most g with inner ≤ (x, B) Adding up the values of all inner trees of x that have generation at most g then gives us an approximation for the true inner value of x, and the approximation gets better as g gets larger. Formally, we define a g generation value for an item x in bucket B as: For ω-continuous semirings, the infinite sum is equal to the supremum of the partial sums (Kuich 1997, 613), hence (Goodman 1999, 589): Fortunately, tensors of semirings of set dimensions are ω-continuous as long as the underlying semiring is ω-continuous. We give the necessary definitions to establish this property: Definition 8. (Kuich 1997, 611) A semiring is naturally ordered if there is a partial ordering such that x y iff there is a z s.t. x ⊕ z = y.
Definition 9. (Kuich 1997, 612) A naturally ordered complete semiring is ω-continuous if for any sequence x 1 , x 2 , . . . and for any constant y, if for all n, 0≤i≤n x i y then i x i y Notice that for the set of tensors in S d where d is an arbitrary list of positive integers, if the underlying semiring has a natural ordering then this could be extended straightforwardly to S d by the following rule: X Y iff X i Y i for all indices i. It is straightforward to check that if the underlying semiring is ω-continuous, then S d is ω-continuous as well.
Goodman (1999) gives a formula for V ≤g (x, B) in order to compute or approximate the supremum. Below we give the analogous formula for partial semirings: Note that if a i is not in the bucket B then V ≤g−1 (a i , B) = V (a i ), hence V ≤g−1 (a i , B) can be replaced with K g (a i , B), completing the proof.
We will follow a similar strategy for computing the outside values of items that belong to a looping bucket. The only difference is the slight difference in the definition of the generation of of the tree. If D ∈ outer(x) where x belongs to a looping bucket B, then the generation of D is maximum number of items that could appear in a single path from the root to x, where x is included in the count. Let Where π is defined as in Theorem 6.4 and Like the inner case, note that for an item b not in the looping bucket b, Z ≤g−1 (b, B) = Z(b), hence we can replace Z ≤g−1 (b, B) with H g (b, B), completing the proof.