K-best Iterative Viterbi Parsing

This paper presents an efficient and optimal parsing algorithm for probabilistic context-free grammars (PCFGs). To achieve faster parsing, our proposal employs a pruning technique to reduce unnecessary edges in the search space. The key is to conduct repetitively Viterbi inside and outside parsing, while gradually expanding the search space to efficiently compute heuristic bounds used for pruning. Our experimental results using the English Penn Treebank corpus show that the proposed algorithm is faster than the standard CKY parsing algorithm. In addition, we also show how to extend this algorithm to extract k-best Viterbi parse trees.


Introduction
The CKY or Viterbi inside algorithm is a wellknown algorithm for PCFG parsing (Jurafsky and Martin, 2000), which is a dynamic programming parser using a chart table to calculate the Viterbi tree. This algorithm is commonly used in natural language parsing, but when the size of the grammar is extremely large, exhaustive parsing becomes impractical. One way to reduce the computational cost of PCFG parsing is to prune the edges produced during parsing. In fact, modern parsers have often employed pruning techniques such as beam search (Ratnaparkhi, 1999) and coarse-tofine search (Charniak et al., 2006).
Despite their practical success, both pruning methods are approximate, so the solution of the parser is not always optimal, i.e., the parser does not always output the Viterbi tree. Recently, another line of work has explored A* search algo-rithms, in which simpler problems are used to estimate heuristic scores for prioritizing edges to be processed during parsing (Klein and Manning, 2003). If the heuristic is consistent, A* parsing always outputs the Viterbi tree. As Tsuruoka and Tsujii (2004) mentioned, however, A* parsing has a serious difficulty from an implementation point of view: "One of the most efficient way to implement an agenda, which keeps edges to be processed in A* parsing, is to use a priority queue, which requires a computational cost of O(log(n)) at each action, where n is the number of edges in the agenda. The cost of O(log(n)) makes it difficult to build a fast parser by the A* algorithm." This paper presents an alternative way of pruning unnecessary edges while keeping the optimality of the parser. We call this algorithm iterative Viterbi parsing (IVP) for the reason that the iterative process plays a central role in our proposal. The IVP algorithm conducts repetitively Viterbi inside and outside parsing, while gradually expanding the search space to efficiently compute lower and upper bounds used for pruning. IVP is easy to implement and is much faster in practice than the standard CKY parsing algorithm.
In addition, we also show how to extend the IVP algorithm to extract K-best Viterbi parse trees. The idea is to integrate Huang and Chiang (2005)'s K-best algorithm 3, which is called as Lazy, with the iterative parsing process. Lazy performs a Viterbi inside pass and then extracts K-best lists in a top-down manner. Although especially the first Viterbi inside pass is a bottleneck of the Lazy algorithm, the K-best IVP algorithm avoids its amount of work as well as in the 1-best case.

Iterative Viterbi Parsing
Following Pauls and Klein (2009), we define some notations. The IVP algorithm takes as input a (a) PCFG G and a sentence x consisting of terminal symbols t 0 . . . t n−1 . Without loss of generality, we assume Chomsky normal form: each non-terminal rule r in G has the form r = A → B C with log probability weight log q(r), where A, B and C are elements in N , which is a set of non-terminal symbols. Chart edges are labeled spans e = (A, i, j). Inside derivations of an edge e = (A, i, j) are trees rooted at A and spanning t i . . . t j−1 . The score of a derivation d is denoted by s(d) 1 . The score of the best (maximum) inside derivation for an edge e is called the Viterbi inside score β(e). The goal of 1-best PCFG parsing is to compute the Viterbi inside score of the goal edge (TOP, 0, n) where TOP is a special root symbol. For the goal edge, we call its derivation goal derivation. The score of the best derivation of TOP → t 0 . . . t i−1 A t j . . . t n−1 is called the Viterbi outside score α(e).
We assume N = {A, B, C, D}. By grouping several symbols in the same cell of the chart table, we can make a smaller table than the original one. While the original chart table in Figure 1 (a) contains non-terminal symbols only, the chart table in Figure 1 (b) contains not only non-terminal 1 The score of a derivation is the sum of rule weights for all rules used in the derivation. In this paper, to make shrinkage symbols, we use hierarchical clustering of non-terminal symbols defined in (Charniak et al., 2006). Figure 2 shows a part of the hierarchical symbol definition. Formally, we hierarchically cluster N into m + 1 : ifd consists of non-terminals only then 6: returnd 7: if lb < best(chart) then 8: lb ← best(chart) 9: expand-chart(chart,d, G) 10: Viterbi-outside(chart) 11: prune-chart(chart, lb) we define a mapping π i→j : N i → P(N j ) where P(·) is the power set of ·. Taking a symbol HP in Figure 2 as an example, π 0→1 (HP) = {S , N }. When i = j, for some i-th layer shrinkage symbol A ∈ N i , π i→j (A) returns a singleton {A}. For all 0 ≤ i, j, k ≤ m, the rule parameter associated with symbols X i ∈ N i , X j ∈ N j , X k ∈ N k is defined as the following: By this construction, each derivation in a coarse chart gives an upper bound on its corresponding derivation in the original chart (Klein and Manning, 2003) and we can obtain the following lemma: Lemma 1. If the best goal derivationd in the coarse chart does not include any shrinkage symbol, it is equivalent to the best goal derivation in the original chart.
Proof . Let Y be the set of all goal derivations in the original chart, Y ⊂ Y be the subset of Y not appearing in the coarse chart, and Y be the set of all goal derivations in the coarse chart. For each derivation d ∈ Y , there exists its unique corresponding derivation d in Y (see Figure 1). Then, we have and this means thatd is the best derivation in the original chart. 2 Algorithm 1 shows the pseudo code for IVP. The IVP algorithm starts by initializing coarse chart, which consists of only 0-th layer shrinkage symbols. It conducts Viterbi inside parsing to find the best goal derivation. If the derivation does not contain any shrinkage symbols, the algorithm returns it and terminates. Otherwise, the chart table is expanded, and the above procedure is repeated until the termination condition is satisfied.
For efficient parsing, we integrate a pruning technique with IVP. For an edge e = (A, i, j), we denote by αβ(e) = α(e) + β(e) the score of the best goal derivation which passes through e, where β(e) and α(e) are Viterbi inside and outside scores for e. Then, if we obtain a lower bound lb such that lb ≤ max d∈Y s(d) where Y is the set of all goal derivations in the original chart, an edge e with αβ(e) < lb is no longer necessary to be processed. Though it is expensive to compute αβ(e) in the original chart, we can efficiently compute by Viterbi inside-outside parsing its upper bound in a coarse chart table: αβ(e) ≤ α(e) + β(e) = αβ (e) where α(e) and β(e) are the Viterbi inside and outside scores of e in the coarse chart table. If αβ(e) < lb, we can safely prune the edge e away from the coarse chart. Note that this pruning simply reduces the search space at each IVP iteration and does not affect the number of iterations taken until convergence at all.
We initialize the lower bound lb with the score of a goal derivation obtained by deterministic parsing det() in the original chart. The deterministic parsing keeps only one non-terminal symbol with the highest score per chart cell and removes the other non-terminal symbols. The det() function is very fast but causes many search errors. For efficient pruning, a tighter lower bound is important, thus we update the current lower bound with the score of the best derivation, having non-terminals only, obtained by the best() function in the current coarse chart, if the former is less than the latter.
At line 9, IVP expands the current chart table by replacing all shrinkage symbols ind with their next layer symbols using mapping π. While this expansion cannot derive a reasonable worst time complexity since it takes many iterations until convergence, we show from our experimental results that it is highly effective in practice.
The K-best IVP algorithm also prunes unnecessary edges and initializes the lower bound lb with the score of the k-th best derivation obtained by beam search parsing in the original chart. For efficient pruning, we update lb with the k-th best derivation, which consists of non-terminals only, obtained by the k-best() function in the current coarse chart. The getShrinkageDeriv() function seeks the best derivation, which contains shrinkage symbols, from [d 2 , . . . ,d k ]. The K-best IVP algorithm inherits the other components from standard IVP.

Experiments
We used the Wall Street Journal (WSJ) part of the English Penn Treebank: Sections 02-21 were used for training, sentences of length 1-35 in Section 22 for testing. We estimated a Chomsky normal form PCFG by maximum likelihood from rightbranching binarized trees without function labels and trace-fillers. Note that while this grammar is a proof-of-concept, CKY on a larger grammar does not work well even for short sentences. Table 1 shows that the number of edges produced by the IVP algorithm is significantly smaller than standard CKY. Moreover, many of the edges are pruned during the iterative process. While IVP takes many iterations util convergence, it is about 8 times faster than CKY. The fact means that the computational cost of the Viterbi inside and outside algorithms on a small chart is negligible.
Next, we examine the K-best IVP algorithm. Figure 3 shows parsing speed of Lazy and K-best  IVP algorithms for various k (2 ∼ 128). When k is small (2 ∼ 64), K-best IVP is much faster than Lazy. However, K-best IVP did not work well when setting k to more than 128. We show the reason in Figure 4 where we plot the number of edges in chart table at each K-best IVP iterations for some test sentence with length 28. It is clear that the smaller k is, the earlier it is convergent. Moreover, when setting k too large, it is difficult to compute a tight lower bound, i.e., K-best IVP does not prune unnecessary edges efficiently. However, in practice, this is not likely to be a serious problem since many NLP tasks use only very small kbest parse trees (Choe and Charniak, 2016). Huang and Chiang (2005) presented an efficient K-best parsing algorithm, which extracts K-best lists after a Viterbi inside pass. Huang (2005) also described a K-best extension of the Knuth parsing algorithm (Knuth, 1977;Klein and Manning, 2004). Pauls and Klein (2009) successfully integrated A* search with the K-best Knuth algorithm. Tsuruoka and Tsujii (2004) proposed an itera- tive CKY algorithm, which is similar to our IVP algorithm in that it conducts repeatedly CKY parsing with a threshold until the best parse is found. The main difference is that IVP employs a coarseto-fine chart expansion to compute better lower and upper bounds efficiently. Moreover, Tsuruoka and Tsujii (2004) did not mention how to extend their algorithm to K-best parsing.

Related Work
The coarse-to-fine parsing (Charniak et al., 2006) is used in many practical parsers such as Petrov and Klein (2007). However, the coarseto-fine search is approximate, so the solution of the parser is not always optimal.
For sequential decoding, Kaji et al. (2010) also proposed the iterative Viterbi algorithm. Huang et al. (2012) extended it to extract K-best strings by integrating the backward K-best A* search (Soong and Huang, 1991) with the iterative process. Our proposed algorithm can be regarded as a generalization of their methods to the parsing problem.

Conclusion and Future Work
This paper presents an efficient K-best parsing algorithm for PCFGs. This is based on standard Viterbi inside-outside algorithms and is easy to implement. Now, we plan to conduct experiments using latent-variable PCFGs (Matsuzaki et al., 2005;Cohen et al., 2012) to prove that our method is useful for a variety of grammars.