Enhancing the Inside-Outside Recursive Neural Network Reranker for Dependency Parsing

We propose solutions to enhance the Inside-Outside Recursive Neural Network (IORNN) reranker of Le and Zuidema (2014). Replacing the original softmax function with a hierarchical softmax us-ing a binary tree constructed by combining output of the Brown clustering algorithm and frequency-based Huffman codes, we signiﬁcantly reduce the reranker’s computational complexity. In addition, enriching contexts used in the reranker by adding subtrees rooted at (ancestors’) cousin nodes, the accuracy is increased.


Introduction
Using neural networks for syntactic parsing has become popular recently, thanks to promising results that those neural-net-based parsers achieved. For constituent parsing, Socher et al. (2013) using a recursive neural network (RNN) got an F1 score close to the state-of-the-art on the Penn WSJ corpus. For dependency parsing, the inside-outside recursive neural net (IORNN) reranker proposed by Le and Zuidema (2014) is among the top systems, including the Chen and Manning (2014)'s extremely fast transition-based parser employing a traditional feed-forward neural network.
There are many reasons why neural-net-based systems perform very well. First, Bansal et al. (2014) showed that using word-embeddings can lead to significant improvement for dependency parsing. Interestingly, those neural-net-based systems can transparently and easily employ wordembeddings by initializing their networks with those vectors. Second, by comparing a countbased model with their neural-net-based model on perplexity, Le and Zuidema (2014) suggested that predicting with neural nets is an effective solution for the problem of data sparsity. Last but not least, as showed in the work of Socher and colleagues on RNNs, e.g. Socher et al. (2013), neural networks are capable of 'semantic transfer', which is essential for disambiguation.
We focus on how to enhance the IORNN reranker of Le and Zuidema (2014) by both reducing its computational complexity and increasing its accuracy. Although this reranker is among the top systems in accuracy, its computation is very costly due to its softmax function used to compute probabilities of generating tokens: all possible words in the vocabulary are taken into account. Solutions for this are to approximate the original softmax function by using a hierarchical softmax (Morin and Bengio, 2005), noise-contrastive estimation (Mnih and Teh, 2012), or factorization using classes (Mikolov et al., 2011). A cost of using those approximations is, however, drop of the system performance. To reduce the drop, we use a hierarchical softmax with a binary tree constructed by combining Brown clusters and Huffman codes.
We show that, thanks to the reranking framework and the IORNN's ability to overcome the problem of data sparsity, more complex contexts can be employed to generate tokens. We introduce a new type of contexts, named full-history. By employing both the hierarchical softmax and the new type of context, our new IORNN reranker has significantly lower complexity but higher accuracy than the original reranker.

The ∞-order Generative Model
The reranker employs the generative model proposed by Eisner (1996). Intuitively, this model is top-down: starting with ROOT, we generate its left dependents and its right dependents. We then generate dependents for each ROOT's dependent. The generative process recursively continues until there is no dependent to generate. Formally, this model is described by the following formula where H is the current head, T (N ) is the subtree rooted at N , and C N is the context to generate N . H L , H R are respectively H's left dependents and right dependents, plus EOC (End-Of-Children), a special token to inform that there are no more dependents to generate. Thus, P (T (ROOT )) is the probability of generating the entire T . Le and Zuidema's ∞-order generative model is defined as the Eisner's model in which the context C ∞ D to generate D contains all of D's generated siblings, its ancestors and theirs siblings. Because of very large fragments that contexts are allowed to hold, traditional count-based methods are impractical (even if we use smart smoothing techniques). They thus introduced the IORNN architecture to estimate the model.

Estimation with the IORNN
Each tree node y carries three vectors: inner representation i y , representing y, outer representation o y , representing the full context of y, and partial outer representationō y , representing the partial context C ∞ y which generates the token of y. Without loss of generality and ignoring directions for simplicity, we assume that the model is generating dependent y for node h conditioning on context C ∞ y (see Figure 1). Under the approximation that the inner representation of a phrase equals the inner representation of its head, and thanks to the recursive definition of full/partial contexts (C ∞ y is a combination of C ∞ h and y's previously generated sisters), the (partial) outer representations of y are computed as follows. Figure 1: The process to (a) generate y, (b) compute outer representation o y , given head h and sibling x. Black, grey, white boxes are respectively inner, partial outer, and outer representations. (Le and Zuidema, 2014) where is the set of y's sisters generated before. And, The probability P (w|C ∞ y ) of generating a token w at node y is given by sof tmax(w,ō y ) = e u(w,ōy) w ∈V e u(w ,ōy) (2) where u(w 1 ,ō y ), ..., u(w |V | ,ō y ) T = Wō y + b and V is the set of all possible tokens (i.e. vocabulary). W is a |V | × n real matrix, b is an |V |-d real vector.

The Reranker
Le and Zuidema's (mixture) reranker is where D(S) and s(S, T ) are a k-best list and scores given by a third-party parser, and α ∈ [0, 1].

Reduce Complexity
The complexity of the IORNN reranker above for computing P (T (ROOT )) is approximately 1 O = l × (3 × n × n + n × |V |) 1 Look at Figure 1, we can see that each node requires four matrix-vector multiplications: two for computing children's (partial) outer representation, one for computing sisters' (partial) outer representations, and one for computing the softmax.
where l is the length of the given sentence, n and |V | are respectively the dimensions of representations and the vocabulary size (sums of vectors are ignored because their computational cost is small w.r.t l × n × n). In Le and Zuidema's reranker, |V | ≈ 14000 n = 200. It means that the reranker spends most of its time on computing sof tmax(w,ō y ) in Equation 2. This is also true for the complexity in the training phase.
To reduce the reranker's complexity, we need to approximate this softmax. Mnih and Teh (2012) propose using the noise-contrastive estimation method which is to force the system to discriminate correct words from randomly chosen candidates (i.e., noise). This approach is very fast in training thanks to fixing normalization factors to one, but slow in testing because normalization factors are explicitly computed. Vaswani et al. (2013) use the same approach, and also fix normalization factors to one when testing. This, however, doest not guarantee to give us properly normalized probabilities. We thus employ the hierarchical sofmax proposed by Morin and Bengio (2005) which is fast in both training and testing and outputs properly normalized probabilities.
Assume that there is a binary tree whose leaf nodes each correspond to a word in the vocabulary. Let (u w 1 , u w 2 , ..., u w L ) be a path from the root to the leaf node w (i.e. u w 1 = root and u w L = w). Let L(u) the left child of node u, and [x] be 1 if x true and −1 otherwise. We then replace Equation 2 by where σ(z) = 1/(1 + e −z ). If the binary tree is perfectly balanced, the new complexity is approximately l × (3 × n × n + n × log(|V |)), which is less than 4l × n × n if |V | < 2 n (1.6 × 10 60 as n = 200 in the Le and Zuidema's reranker). Constructing a binary tree for this hierarchical softmax turns out to be nontrivial. Morin and Bengio (2005) relied on WordNet whereas Mikolov et al. (2013) used only frequency-based Huffmann codes. In our case, an ideal tree should reflect both semantic similarities between words (e.g. leaf nodes for 'dog' and 'cat' should be close to each other), and word frequencies (since we want to minimize the complexity). Therefore we propose combining output of the Brown hierarchical clustering algorithm (Brown et al., 1992) and frequency-based Huffman codes. 2 Firstly, we use the Brown algorithm to find c hierarchical clusters (c = 500 in our experiments). 3 We then, for each cluster, compute the Huffman code for each word in that cluster.

Enrich Contexts
Although suggesting that predicting with neural networks is a solution to overcome the problem of sparsity, Le and Zuidema's reranker still relies on two widely-used independence assumptions: (i) the two Markov chains that generate dependents in the two directions are independent, given the head, and (ii) non-overlapping subtrees are generated independently. 4 That is why its partial context (e.g. the red-dashed shape in Figure 2) used to generate a node ignores: (i) sisters in the other direction and (ii) ancestors' cousins and their descendants.
We, in contrast, eliminate those two assumptions by proposing the following top-down left-toright generative story. From the head node, we generate its dependents from left to right. The partial context to generate a dependent is the whole fragment that is generated so far (e.g. the blue shape in Figure 2). We then generate subtrees rooted at those nodes also from left to right. The full context given to a node to generate the subtree rooted at this node is thus the whole fragment that is generated so far (e.g. the combination of the blue shape and the blue-dotted shape in Figure 2). In this way, the model always uses the whole up-to-date fragment to generate a dependent or to generate a subtree rooted at a node. To our knowledge, these contexts, which contain full derivation histories, are the most complete ones ever used for graph-based parsing.
Extending the IORNN reranker in this way is straight-forward. For example, we first generate a subtree tr(x) rooted at node x in Figure 3. We then compute the inner representation for tr(x): if tr(x) contains only x then i tr(x) = i x ; otherwise Figure 2: Context used in Le and Zuidema's reranker (red-dashed shape) and full-history context (bluesolid shape) to generate token 'authorization'. where S(x) is the set of x's dependents, dr(u) is the dependency relation of u with x, W i h/dr(u) are n × n real matrices, and b i is an n-d real vector. 5

Experiments
We use the same setting reported in Le and Zuidema (2014, Section 5.3). The Penn WSJ Treebank is converted to dependencies using the Yamada and Matsumoto (2003)'s head rules. Sections 2-21 are for training, section 22 for development, and section 23 for testing. The development and test sets are tagged by the Stanford POStagger trained on the whole training data, whereas 10-way jackknifing is used to generate tags for the training set. For the new IORNN reranker, we set n = 200, initialise it with the 50-dim word embeddings from Collobert et al. (2011). We use the MSTParser (with the 2nd-order feature mode) (McDonald et al., 2005) to generate k-best lists, and optimize k and α (Equation 3) on the development set. Table 1 shows the comparison of our new reranker against other systems. It is a surprise that our reranker with the proposed hierarchical softmax alone can achieve an almost equivalent score with Le and Zuidema's reranker. We conjecture that drawbacks of the hierarchical softmax compared to the original can be lessened by probabilities of generating other elements like POS-tags, System UAS MSTParser (baseline) 92.06 Koo and Collins (2010) 93.04 Zhang and McDonald (2012) 93.06 Martins et al. (2013) 93.07 Bohnet and Kuhn (2012) 93.39 Reranking Hayashi et al. (2013) 93.12 Le and Zuidema (2014) 93.12 Our reranker (h-softmax only, k = 45) 93.10 Our reranker (k = 47) 93.27 Table 1: Comparison with other systems on section 23 (excluding punctuation).
dependency relations. Adding enriched contexts, our reranker achieves the second best accuracy among those systems. Because in this experiment no words have paths longer than 20 n = 200, our new reranker has a significantly lower complexity than the one of Le and Zuidema's reranker. On a computer with an Intel Core-i5 3.3GHz CPU and 8GB RAM, it takes 20 minutes to train this reranker, which is implemented in C++, and 2 minutes to evaluate it on the test set.

Conclusion
Solutions to enhance the IORNN reranker of Le and Zuidema (2014) were proposed. We showed that, by replacing the original softmax function with a hierarchical softmax, the reranker's computational complexity significantly decreases. The cost of this, which is drop on accuracy, is avoided by enriching contexts with subtrees rooted at (ancestors') cousin nodes. The new reranker, according to experimental results on the Penn WSJ Treebank, has even higher accuracy than the old one.