Enriched In-Order Linearization for Faster Sequence-to-Sequence Constituent Parsing

Sequence-to-sequence constituent parsing requires a linearization to represent trees as sequences. Top-down tree linearizations, which can be based on brackets or shift-reduce actions, have achieved the best accuracy to date. In this paper, we show that these results can be improved by using an in-order linearization instead. Based on this observation, we implement an enriched in-order shift-reduce linearization inspired by Vinyals et al. (2015)’s approach, achieving the best accuracy to date on the English PTB dataset among fully-supervised single-model sequence-to-sequence constituent parsers. Finally, we apply deterministic attention mechanisms to match the speed of state-of-the-art transition-based parsers, thus showing that sequence-to-sequence models can match them, not only in accuracy, but also in speed.


Introduction
Sequence-to-sequence (seq2seq) neural architectures have proved useful in several NLP tasks, with remarkable success in some of them such as machine translation, but they lag behind the state of the art in others. In constituent parsing, seq2seq models still need to improve to be competitive in accuracy and efficiency with their main competitors: transition-based constituent parsers (Dyer et al., 2016;Liu and Zhang, 2017b;Fernández-González and Gómez-Rodríguez, 2019). Vinyals et al. (2015) laid the first stone in seq2seq constituent parsing, proposing a linearization of phrase-structure trees as bracketed sequences following a top-down strategy, which can be predicted from the input sequence of words by any off-the-shelf seq2seq framework. While this approach is very simple, its accuracy and efficiency are significantly behind the state of the art in the fully-supervised single-model scenario.
Most attempts to improve this approach focused on modifying the neural network architecture, while keeping the top-down linearization strategy. As exceptions, Ma et al. (2017) and Liu and Zhang (2017a) proposed linearizations based on sequences of transition-based parsing actions instead of brackets. Ma et al. (2017) tried a bottom-up linearization, but they obtained worse results than topdown approaches. 1 Liu and Zhang (2017a) kept the top-down strategy, but using transitions of the top-down transition system of Dyer et al. (2016) instead of a bracketed linearization, achieving a higher performance.
In transition-based constituent parsing, an inorder algorithm has recently proved superior to the bottom-up and top-down approaches (Liu and Zhang, 2017b), but we know of no applications of this approach in seq2seq parsing.
Contributions In this paper, we advance the understanding of linearizations for seq2seq parsing, and improve the state of the art, as follows: (1) we show that the superiority of a transition-based top-down linearization over a bracketing-based one observed by Liu and Zhang (2017a) does not hold when both are tested under the same framework. In fact, we show that the additional information provided by the larger vocabulary in the linearization of Vinyals et al. (2015) is beneficial to seq2seq predictions. (2) We implement a novel in-order transition-based linearization, based on the in-order transition system by Liu and Zhang (2017b), and manage to notably increase parsing accuracy with respect to previous approaches. (3) We enhance the in-order representation of parse trees by adding extra information following the shift-reduce version of the (Vinyals et al., 2015) linearization, obtaining state-of-the-art accuracy among seq2seq parsers and on par with some well-known transition-based approaches. (4) We bridge the remaining gap with transition-based parsers -parsing speed -by applying a new variant of deterministic attention (Kamigaito et al., 2017;Ma et al., 2017) to restrict the hidden states used to compute the attention vector, doubling the system's speed. The result is a seq2seq parser 2 that, for the first time, matches the speed and accuracy of transition-based parsers implemented under the same neural framework. (5) Using the neural framework of Dyer et al. (2015) as testing ground, we perform a homogeneous comparison among different seq2seq linearizations and widely-known transition-based parsers.

Enriched Linearizations
To cast constituent parsing as seq2seq prediction, each parse tree needs to be represented as a sequence of symbols that can be predicted from an input sentence. Initially, Vinyals et al. (2015) proposed a top-down bracketed linearization of constituent trees, where opening and closing brackets include non-terminal labels and POS tags are normalized by replacing them with a tag XX. An example is shown in linearization a of Figure 1.
As an alternative, Liu and Zhang (2017a) presented a shift-reduce linearization based on the topdown transition system defined for constituent parsing by Dyer et al. (2016) Figure 1). This provides three transitions that can be used on a stack and a buffer to build a constituent tree: a Shift transition to push words from the buffer into the stack, a Non-Terminal-X transition to push a non-terminal node X into the stack, and a Reduce transition to pop elements from the stack until a non-terminal node is found and create a new subtree with all these elements as its children, pushing this new constituent into the stack.
Following Vinyals et al. (2015)'s linearization where closing brackets also include the nonterminal label, we define an equivalent shift-reduce variant, where the Reduce transition is also parameterized with the non-terminal on top of the resulting subtree (Reduce-X). In that way, we can one-to-one map opening brackets to Non-Terminal-X transitions, closing brackets to Reduce-X actions and XX-tags to Shift transitions as shown in example c of Figure 1 . This enriched version will enlarge the vocabulary, but will also add some extra informa-tion that, as we will see below, improves parsing accuracy.
As an alternative to the top-down parser of (Dyer et al., 2016), Liu and Zhang (2017b) define a transition system based on in-order traversal, as in leftcorner parsing (Rosenkrantz and Lewis, 1970): the non-terminal node on top of the tree being built is only considered after the first child is completed in the stack, building each subtree in a bottom-up manner, but choosing the non-terminal node on top before the new constituent is reduced. Transitions are the same as in the top-down algorithm (plus a Finish transition to terminate the parsing process), but the effect of applying a Reduce transition is different: it pops all elements from the stack until the first non-terminal node is found, which is also popped together with the preceding element in the stack to build a new constituent with all of them as children of the non-terminal node. 3 This algorithm pushed state-of-the-art accuracies in shift-reduce constituent parsing; and, as we show in Section 4, it can be succesfully applied as a linearization method for seq2seq constituent parsing. Sequence d in Figure 1 exemplifies in-order linearization.
Similarly to the enriched top-down variant, we also extend the in-order shift-reduce linearization by parametrizing Reduce transitions. Additionally, we can also add extra information to Shift transitions. (Suzuki et al., 2018) leaves POS tags of punctuation symbols out of the normalization proposed by Vinyals et al. (2015) without further explanation, but possibly they consider it can help seq2seq models. We adapt this idea to our novel enriched in-order linearization and lexicalize Shift transitions when a "." or a "," are pushed into the stack as "Shift." and "Shift,", respectively. 4 In our experiments, we see that lexicalizing Shift transitions has indeed an impact on parsing performance. In Figure 1 and sequence e, we include an example of this linearization technique.
Note that, although we use a transition-based linearization of parse trees, our approach is agnostic to the stack structure and the parsing process is performed by a simple seq2seq model that straightforwardly translates input sequences of words into sequences of shift-reduce actions.

Seq2seq Neural Network
Baseline Model In our experiments, we test all proposed linearizations in the seq2seq neural architecture designed by Liu and Zhang (2017a) and implemented on the framework developed by Dyer et al. (2015). This architecture proved to outperform the majority of seq2seq approaches, even without implementing beam search (which penalizes parsing speed). The difference with respect to the vanilla seq2seq configuration (Vinyals et al., 2015) is that two separate attention models are used to cover two different and variable segments of the input. This provides improvements in accuracy, regardless of the linearization method used. More specifically, Liu and Zhang (2017a) follow the common practice in stack-LSTM-based shiftreduce parsers (Dyer et al., 2015(Dyer et al., , 2016Liu and Zhang, 2017b) that uses a concatenation of pretrained word embeddings (e * w i ) and randomly initialized word (e w i ) and POS tag embeddings (e p i ) to derive (through a ReLu non-linear function) the final representation x i of the ith input word: where W enc and b enc are model parameters, and w i and p i represent the form and the POS tag of the ith input word.
This representation x i is fed into the encoder (implemented by a BiLSTM) to output an encoder hidden state h i : As a decoder, a LSTM generates a sequence of decoder hidden states from which a sequence of actions is predicted. Concretely, the current decoder hidden state d j is computed by: where W dec and b dec are model parameters, d j−1 is the previous decoder hidden state, and l att j and r att j are the resulting attention vectors over the left and right segments, respectively, of encoder hidden states h 1 . . . h n . These two segments of the input are defined by index p, which is initialized to the beginning of the sentence and moves one position to the right each time a Shift transition is applied. Therefore, l att j and r att j are computed at timestep j as: Then, the current token y j is predicted from d j as: where W att , b att , W pred and b pred are parameters. In Figure 2, we graphically describe the neural architecture.
Note that current state-of-the-art transition-based parsers, which rely on stack-LSTMs to represent the stack structure, are also implemented under the framework by Dyer et al. (2015) and, therefore, our approach can be fairly compared to them in terms of accuracy and speed.
Deterministic Attention Previous work (Kamigaito et al., 2017;Ma et al., 2017;Liu et al., 2018) claims that using deterministic attention mechanisms instead of the standard probabilistic variant leads to accuracy and speed gains. We propose a simple and effective procedure to implement deterministic attention in the architecture by Liu and Zhang (2017a), substantially reducing the time consumed by the decoder to predict the next token.
Apart from dividing the sequence of encoder hidden states into segments, Liu and Zhang (2017a) provide explicit alignment between the input word sequence and the output transition sequence by keeping the index p that indicates a correspondence between input words and Shift transitions. This information can be used to force the model to focus on those encoder hidden states that are more informative for decoding at each timestep, avoiding going through the whole input to compute the attention vector, and thus considerably reducing decoding time.
To gain some insight on what input words are most relevant, we study on the dev set the attention values assigned by the model to each encoder hidden state and the frequency with which each of them achieves the highest value at each timestep. Surprisingly, we found out that, for the top-down parser, almost 90% of the time the highest attention values were assigned to the words in positions p and p + 1 by a wide margin. For the in-order parser, words in those positions also received considerable attention values, but they were determinant only 75% of the time. Following these results, we propose a computation of l att j and r att j where only the encoder hidden states in the rightmost position (p) of the left segment and in the leftmost position (p + 1) of the right segment are considered: This change avoids calculating the weight α ij for each encoder hidden state, as needed in probabilistic attention. Attention vectors are computed in constant time, notably reducing running time while keeping the accuracy, as shown in our experiments.

Experiments
We test the proposed approaches on the PTB treebank (Marcus et al., 1993) with standard splits. 5 Table 1 compares parsing accuracy of all linearizations proposed in Section 2 to state-of-the-art fully-supervised transition-based constituent parsing models. The results show that our enriched inorder linearization is the most suitable option implemented so far for seq2seq constituent parsing, outperforming all existing seq2seq approaches (even without beam-search decoding) and matching some transition-based models. We also demonstrate that the enriched top-down variant (equivalent to the bracketed (Vinyals et al., 2015)'s linearization) outperforms the regular top-down approach of Liu and Zhang (2017a). A trend that can also be seen in the  in-order linearization, where the addition of more tokens (parametrized Reduce and lexicalized Shift transitions) to the vocabulary benefits model performance (a gain of 0.4 F-score points), meaning that seq2seq models make use of this additional information. In fact, we analysed the average length of output sequences and noticed that enriched variants with larger vocabulary tend to produce shorter sequences. We hypothesize that the extra information is helping the model to better contextualize tokens in the sequence during training, minimizing the prediction of wrong tokens at decoding time. Finally, we extend the implementation by Liu and Zhang (2017a) with 10-beam-search decoding and increase F-score by 0.3 points.
We also evaluate parsing speeds under the exact same conditions among our approach and the top-down (Dyer et al., 2016) and in-order (Liu and Zhang, 2017b) transition-based constituent parsers, implemented in the framework by Dyer   Table 2 shows how the proposed deterministic attention technique doubles the speed of the baseline model, putting it on par with stack-LSTM-based shift-reduce systems, which are considered one of the most efficient approaches for constituent parsing. We can also see from Table 1 that the presented mechanism is more beneficial in terms of accuracy for the top-down algorithm (increasing 0.2 points in F-score) than the in-order variant (suffering a drop of 0.1 points in F-score), as could be expected from our previous analysis of attention vectors. Finally, at the bottom of Table 1, we show current state-of-the-art chart-based parsers. These approaches, while more accurate, are significantly slower than seq2seq and transition-based parsers, being less appealing for downstream applications where the speed is crucial.

Conclusion
We present significant accuracy and speed improvements in seq2seq constituent parsing. The proposed linearization techniques can be used by any off-the-self seq2seq model without building a specific algorithm or structure. In addition, any advances in seq2seq neural architectures or pretrained transformer-based language models (Devlin et al., 2019) can be directly used to enhance our approach.

A.1 Top-down Transition System
In the top-down transition system defined by Dyer et al. (2016), parser configurations have the form c = Σ, B , where Σ is a stack of constituents and B is the buffer that contains words from the input sentence. The top-down algorithm also provides three transitions (described in Figure 3) that can be used on the stack and the buffer (that initially contains the whole unparsed sentence) to build the final constituent tree. Concretely: • a Shift transition is used to push words from the buffer into the stack, • a Non-Terminal-X transition to push a nonterminal node X into the stack, • and a Reduce transition to pop elements from the stack until a non-terminal node is found and create a new subtree with all these elements as its children, pushing this new constituent into the stack.

A.2 In-order Transition System
Liu and Zhang (2017b) define a transition system that builds a phrase structure tree in an in-order traversal order: the non-terminal node on top of  the tree being built is only considered after the first child node is completed in the stack, building each subtree in a bottom-up manner, but choosing the non-terminal node on top before the new constituent is reduced. This transition system has parser configurations with the stack-buffer form c = Σ, B and uses the following actions (described in Figure 4): • a Shift transition to move words from the buffer to the stack, • a Non-Terminal-X transition to push a nonterminal node X into the stack as long as the first child of the future constituent is on top of the stack, • a Reduce transition to pop all elements from the stack until the first non-terminal node is found, which is also popped together with the preceding element in the stack to build a new constituent with all of them as children of the non-terminal node, • and, finally, a Finish transition to terminate the parsing process.
The in-order transition system is a combination of the classic bottom-up and the new top-down algorithms, providing advantages of both of them: the access to information from partial parses from the bottom-up approach, and the non-local outlook of the top-down approach.

A.3 Data and Settings
Following common practice, we test the proposed approaches on the Wall Street Journal sections of the English Penn Treebank (Marcus et al., 1993) with standard splits: sections 2-21 are used as training data, section 22 for development and section 23 for testing. We adopt stochastic gradient descent with Adam (Kingma and Ba, 2014) and hyper-parameter selection as (Liu and Zhang, 2017a), detailed in Table 3. In addition, we use predicted POS tags and pretrained word embeddings (generated on the AFP portion of English Gigaword) as (Dyer et al., 2016;Liu and Zhang, 2017a,b).
All neural models are trained by minimizing the following cross-entropy loss objective with an l 2 regularization term: where θ is the set of parameters, p y ij is the probability of the jth token in the ith training example given by the model and λ is a regularization hyperparameter. For further details about the neural architecture, the reader can refer to (Liu and Zhang, 2017a). For our executions, we report the average accuracy and speed over 3 runs with random initialization and on a single CPU core.