Constituent Parsing as Sequence Labeling

We introduce a method to reduce constituent parsing to sequence labeling. For each word w_t, it generates a label that encodes: (1) the number of ancestors in the tree that the words w_t and w_{t+1} have in common, and (2) the nonterminal symbol at the lowest common ancestor. We first prove that the proposed encoding function is injective for any tree without unary branches. In practice, the approach is made extensible to all constituency trees by collapsing unary branches. We then use the PTB and CTB treebanks as testbeds and propose a set of fast baselines. We achieve 90% F-score on the PTB test set, outperforming the Vinyals et al. (2015) sequence-to-sequence parser. In addition, sacrificing some accuracy, our approach achieves the fastest constituent parsing speeds reported to date on PTB by a wide margin.


Introduction
Constituent parsing is a core problem in NLP where the goal is to obtain the syntactic structure of sentences expressed as a phrase structure tree.
Several authors have proposed more efficient approaches which are helpful to gain speed while preserving (or even improving) accuracy. Sagae and Lavie (2005) present a classifier for constituency parsing that runs in linear time by relying on a shift-reduce stack-based algorithm, instead of a grammar. It is essentially an extension of transition-based dependency parsing (Nivre, 2003). This line of research has been polished through the years (Wang et al., 2006;Zhu et al., 2013;Dyer et al., 2016;Liu and Zhang, 2017;Fernández-González and Gómez-Rodríguez, 2018).
With an aim more related to our work, other authors have reduced constituency parsing to tasks that can be solved faster or in a more generic way. Fernández-González and Martins (2015) reduce phrase structure parsing to dependency parsing. They propose an intermediate representation where dependency labels from a head to its dependents encode the nonterminal symbol and an attachment order that is used to arrange nodes into constituents. Their approach makes it possible to use off-the-shelf dependency parsers for constituency parsing. In a different line, Vinyals et al. (2015) address the problem by relying on a sequence-to-sequence model where trees are linearized in a depth-first traversal order. Their solution can be seen as a machine translation model that maps a sequence of words into a parenthesized version of the tree. Choe and Charniak (2016) recast parsing as language modeling. They train a generative parser that obtains the phrasal structure of sentences by relying on the Vinyals et al. (2015) intuition and on the Zaremba et al. (2014) model to build the basic language modeling architecture.
More recently, Shen et al. (2018) propose an architecture to speed up the current state-of-theart chart parsers trained with deep neural networks (Stern et al., 2017;Kitaev and Klein, 2018). They introduce the concept of syntactic distances, which specify the order in which the splitting points of a sentence will be selected. The model learns to predict such distances, to then recursively partition the input in a top-down fashion.
Contribution We propose a method to transform constituent parsing into sequence labeling.
This reduces it to the complexity of tasks such as part-of-speech (PoS) tagging, chunking or namedentity recognition. The contribution is two-fold.
First, we describe a method to linearize a tree into a sequence of labels ( §2) of the same length of the sentence minus one. 2 The label generated for each word encodes the number of common ancestors in the constituent tree between that word and the next, and the nonterminal symbol associated with the lowest common ancestor. We prove that the encoding function is injective for any tree without unary branchings. After applying collapsing techniques, the method can parse unary chains.
Second, we use such encoding to present different baselines that can effectively predict the structure of sentences ( §3). To do so, we rely on a recurrent sequence labeling model based on BIL-STM's (Hochreiter and Schmidhuber, 1997;Yang and Zhang, 2018). We also test other models inspired in classic approaches for other tagging tasks (Schmid, 1994;Sha and Pereira, 2003). We use the Penn Treebank (PTB) and the Penn Chinese Treebank (CTB) as testbeds.
The comparison against Vinyals et al. (2015), the closest work to ours, shows that our method is able to train more accurate parsers. This is in spite of the fact that our approach addresses constituent parsing as a sequence labeling problem, which is simpler than a sequence-to-sequence problem, where the output sequence has variable/unknown length. Despite being the first sequence labeling method for constituent parsing, our baselines achieve decent accuracy results in comparison to models coming from mature lines of research, and their speeds are the fastest reported to our knowledge.

Linearization of n-ary trees Notation and Preliminaries
In what follows, we use bold style to refer to vectors and matrices (e.g x and W). Let w=[w 1 , w 2 , ..., w |w| ] be an input sequence of words, where w i ∈ V . Let T |w| be the set of constituent trees with |w| leaf nodes that have no unary branches. For now, we will assume that the constituent parsing problem consists in mapping each sentence w to a tree in T |w| , i.e., we assume that correct parses have no unary branches. We will deal with unary branches later.
To reduce the problem to a sequence labeling task, we define a set of labels L that allows us to encode each tree in T |w| as a unique sequence of labels in L (|w|−1) , via an encoding function Φ |w| : T |w| → L (|w|−1) . Then, we can reduce the constituent parsing problem to a sequence labeling task where the goal is to predict a function F |w|,θ : V |w| → L |w|−1 , where θ are the parameters to be learned. To parse a sentence, we label it and then decode the resulting label sequence into a constituent tree, i.e., we apply F |w|,θ • Φ −1 |w| . For the method to be correct, we need the encoding of trees to be complete (every tree in T |w| must be expressible as a label sequence, i.e., Φ |w| must be a function, so we have full coverage of constituent trees) and injective (so that the inverse function Φ −1 |w| is well-defined). Surjectivity is also desirable, so that the inverse is a function on L |w|−1 , and the parser outputs a tree for any sequence of labels that the classifier can generate.
We now define our Φ |w| and show that it is total and injective. Our encoding is not surjective per se. We handle ill-formed label sequences in §2.3.

The Encoding
Let w i be a word located at position i in the sentence, for 1 ≤ i ≤ |w| − 1. We will assign it a 2-tuple label l i = (n i , c i ), where: n i is an integer that encodes the number of common ancestors between w i and w i+1 , and c i is the nonterminal symbol at the lowest common ancestor.
Basic encodings The number of common ancestors may be encoded in several ways.
1. Absolute scale: The simplest encoding is to make n i directly equal to the number of ancestors in common between w i and w i+1 .
2. Relative scale: A second and better variant consists in making n i represent the difference with respect to the number of ancestors encoded in n i−1 . Its main advantage is that the size of the label set is reduced considerably. Figure 1 shows an example of a tree linearized according to both absolute and relative scales.
Encoding for trees with exactly k children For trees where all branchings have exactly k children, it is possible to obtain a even more efficient linearization in terms of number of labels. To do so, we take the relative scale encoding as our starting point. If we build the tree incrementally in a leftto-right manner from the labels, if we find a negative n i , we will need to attach the word w i+1 (or a new subtree with that word as its leftmost leaf) to the (−n i + 2)th node in the path going from w i to the root. If every node must have exactly k children, there is only one valid negative value of n i : the one pointing to the first node in said path that has not received its kth child yet. Any smaller value would leave this node without enough children (which cannot be fixed later due to the leftto-right order in which we build the tree), and any larger value would create a node with too many children. Thus, we can map negative values to a single label. Figure 2 shows an example for the case of binarized trees (k = 2). Links to root Another variant emerged from the empirical observation that some tokens that are usually linked to the root node (such as the final punctuation in Figure 1) were particularly difficult to learn for the simpler baselines. To successfully deal with these cases in practice, it makes sense to consider a simplified annotation scheme where a node is assigned a special tag (ROOT, c i ) when it is directly linked to the root of the tree.
From now on, unless otherwise specified, we use the relative scale without the simplification for exactly k children. This will be the encoding used in the experiments ( §4), because the size of the label set is significantly lower than the one obtained by relying on the absolute one. Also, it works directly with non-binarized trees, in contrast to the encoding that we introduce for trees with exactly k children, which is described only for completeness and possible interest for future work. For the experiments ( §4), we also use the special tag (ROOT, c i ) to further reduce the size of the label set and to simplify the classification of tokens connected to the root, where |n i | is expected to be large.

Theoretical correctness
We now prove that Φ |w| is a total function and injective for any tree in T |w| . We remind that trees in this set have no unary branches. Later (in §2.3) we describe how we deal with unary branches. To prove correctness, we use the relative scale. Correctness for the other scales follows trivially.
Completeness Every pair of nodes in a rooted tree has at least one common ancestor, and a unique lowest common ancestor. Hence, for any tree in T |w| , the label l i = (n i , c i ) defined in Section 2.1 is well-defined and unique for each word w i , 1 ≤ i ≤ |w| − 1; and thus Φ |w| is a total function from T |w| to L (|w|−1) .
Injectivity The encoding method must ensure that any given sequence of labels corresponds to exactly one tree. Otherwise, we have to deal with ambiguity, which is not desirable.
For simplicity, we will prove injectivity in two steps. First, we will show that the encoding is injective if we ignore nonterminals (i.e., equivalently, that the encoding is injective for the set of trees resulting from replacing all the nonterminals in trees in T |w| with a generic nonterminal X). Then, we will show that it remains injective when we take nonterminals into account.
For the first part, let τ ∈ T |w| be a tree where nonterminals take a generic value X. We represent the label of the ith leaf node as • i . Consider the representation of τ as a bracketed string, where a single-node tree with a node labeled A is represented by (A), and a tree rooted at R with child subtrees C 1 . . . C n is represented as Each leaf node will appear in this string as a substring (• i ). Thus, the parenthesized string has the form α 0 (• 1 )α 1 (• 2 ) . . . α |w|−1 (• |w| )α w , where the α i s are strings that can only contain brackets and nonterminals, as by construction there can be no leaf nodes between (• i ) and (• i+1 ).
We now observe some properties of this parenthesized string. First, note that each of the substrings α i must necessarily be composed of zero or more closing parentheses followed by zero or more opening parentheses with their corresponding nonterminal, i.e., it must be of the form [)] * [(X] * . This is because an opening parenthesis followed by a closing parenthesis would represent a leaf node, and there are no leaf nodes between (• i ) and (• i+1 ) in the tree.
Thus, we can write α i as α i) α i( , where α i) is a string matching the expression [)] * and α i( a string matching the expression [(X] * . With this, we can write the parenthesized string for τ as Let us now denote by β i the string α i−1( (• i )α i) . Then, and taking into account that α 0) and α w( are trivially empty in the previous expression due to bracket balancing, the expression for the tree becomes simply β 1 β 2 . . . β |w| , where we know, by construction, that each β i is of the form Since we have shown that each tree in T |w| uniquely corresponds to a string β 1 β 2 . . . β |w| , to show injectivity of the encoding, it suffices to show that different values for a β i generate different label sequences. To show this, we can say more about the form of β i : it must be either of the form [(X] * (• i ) or of the form (• i )[)] * , i.e., it is not possible that β i contains both opening parenthesis before the leaf node and closing parentheses after the leaf node. This could only happen if the tree had a subtree of the form (X(• i )), but this is not possible since we are forbidding unary branches.
Hence, we can identify each β i with an integer number δ(β i ): 0 if β i has neither opening nor closing parentheses outside the leaf node, +k if it has k opening parentheses, and −k if it has k closing parentheses. It is easy to see that δ(β 1 )δ(β 2 ) . . . δ(β |w|−1 ) corresponds to the values n i in the relative-scale label encoding of the tree τ . To see this, note that the number of unclosed parentheses at the point right after β i in the string exactly corresponds to the number of common ancestors between the ith and (i + 1)th leaf nodes. A positive δ(β i ) = k corresponds to opening k parentheses before β i , so the number of common ancestors of w i and w i+1 will be k more than that of w i−1 and w i . A negative δ(β i ) = −k corresponds to closing k parentheses after β i , so the number of common ancestors will conversely decrease by k. A value of zero means no opening or closing parentheses, and no change in the number of common ancestors.
Thus, different parenthesized strings β 1 β 2 . . . β |w| generate different label sequences, which proves injectivity ignoring nonterminals (note that δ(β |w| ) does not affect injectivity as it is uniquely determined by the other values: it corresponds to closing all the parentheses that remain unclosed at that point).
It remains to show that injectivity still holds when nonterminals are taken into account. Since we have already proven that trees with different structure produce different values of n i in the labels, it suffices to show that trees with the same structure, but different nonterminals, produce different values of c i . Essentially, this reduces to showing that every nonterminal in the tree is mapped into a concrete c i . That said, consider a tree τ ∈ T |w| , and some nonterminal X in τ . Since trees in T w do not have unary branches, X has at least two children. Consider the rightmost word in the first child subtree, and call it w i . Then, w i+1 is the leftmost word in the second child subtree, and X is the lowest common ancestor of w i and w i+1 . Thus, c i = X, and a tree with identical structure but a different nonterminal at that position will generate a label sequence with a different value of c i . This concludes the proof of injectivity.

Limitations
We have shown that our proposed encoding is a total, injective function from trees without unary branches with yield of length |w| to sequences of |w| − 1 labels. This will serve as the basis for our reduction of constituent parsing to sequence labeling. However, to go from theory to practice, we need to overcome two limitations of the theoretical encoding: non-surjectivity and the inability to encode unary branches. Fortunately, both can be overcome with simple techniques.
Handling of unary branches The encoding function Φ |w| cannot directly assign the nonterminal symbols of unary branches, as there is not any pair of words (w i , w i+1 ) that have those in common. Figure 3 illustrates it with an example.
It is worth remarking that this is not a limitation of our encoding, but of any encoding that would facilitate constituent parsing as sequence labeling, as the number of nonterminal nodes in a tree with unary branches is not bounded by any function of |w|. The fact that our encoding works for trees without unary branches owes to the fact that such a tree cannot have more than |w| − 1 non-leaf nodes, and therefore it is always possible to encode all of them in labels associated with |w| − 1 leaf nodes. T 1 T 2 T 3 T 4 T 5 w 1 w 2 w 3 w 4 w 5 T 1 T 2 T 3 T 4 T 5 w 1 w 2 w 3 w 4 w 5

Φ(T):
Φ -1 (Φ(T)): Figure 3: An example of a tree that cannot be directly linearized with our approach. w i and T i abstract over words and PoS tags. Dotted lines represent incorrect branches after applying and inverting our encoding naively without any adaptation for unaries. The nonterminal symbol of the second ancestor of w 2 (X) cannot be decoded, as no pair of words have X as their lowest common ancestor. A similar situation can be observed for the closest ancestor of w 5 (Z).
To overcome this issue, we follow a collapsing approach, as is common in parsers that need special treatment of unary chains (Finkel et al., 2008;Narayan and Cohen, 2016;Shen et al., 2018). For clarity, we use the name intermediate unary chains to refer to unary chains that end up into a nonterminal symbol (e.g. X → Y in Figure 3) and leaf unary chains to name those that yield a PoS tag (e.g. Z → T 5 ). Intermediate unary chains are collapsed into a chained single symbol, which can be encoded by Φ |w| as any other nonterminal symbol. On the other hand, leaf unary chains are collapsed together with the PoS tag, but these cannot be encoded and decoded by relying on Φ |w| , as our encoding assumes a fixed sequence of leaf nodes and does not encode them explicitly. To overcome this, we propose two methods: 1. To use an extra function to enrich the PoS tags before applying our main sequence labeling function. This function is of the form Ψ |w| : V |w| → U |w| , where U is the set of labels of the leaf unary chains (without including the PoS tags) plus a dummy label ∅. Ψ |w| maps w i to ∅ if there is no leaf unary chain at w i , or to the collapsed label otherwise.
2. To extend our encoding function to predict them as a part of our labels l i , by transforming them into 3-tuples (n i , c i , u i ) where u i encodes the leaf unary chain collapsed label for w i , if there is any, or none otherwise. We call this extended encoding function Φ |w| .
The former requires to run two passes of sequence labeling to deal with leaf unary chains. The latter avoids this, but the number of labels is larger and sparser. In §4 we discuss how these two approaches behave in terms of accuracy and speed.
Non-surjectivity Our encoding, as defined formally in Section 2.1, is injective but not surjective, i.e., not every sequence of |w| − 1 labels of the form (n i , c i ) corresponds to a tree in T |w| . In particular, there are two situations where a label sequence formally has no tree, and thus Φ −1 |w| is not formally defined and we have to use extra heuristics or processing to define it: • Sequences with conflicting nonterminals. A nonterminal can be the lowest common ancestor of more than two pairs of contiguous words when branches are non-binary. For example, in the tree in Figure 1, the lowest common ancestor of both "the" and "red" and of "red" and "toy" is the same N P node. This translates into c 4 = NP , c 5 = NP in the label sequence. If we take that sequence and set c 5 = VP , we obtain a label sequence that does not strictly correspond to the encoding of any tree, as it contains a contradiction: two elements referencing the same node indicate different nonterminal labels. In practice, this problem is trivial to solve: when a label sequence encodes several conflicting nonterminals at a given position in the tree, we compute Φ −1 |w| using the first such nonterminal and ignoring the rest.
• Sequences that produce unary structures.
There are sequences of values n i that do not correspond to a tree in T |w| because the only tree structure satisfying the common ancestor conditions of their values (the one built by generating the string of β i s in the injectivity proof) contains unary branchings, causing the problem described above where we do not have a specification for every nonterminal. An example of this is the sequence (1, S), (3, Y ), (1, S), (1, S) in absolute scaling, that was introduced in Figure 3. In practice, as unary chains have been previously collapsed, any generated unary node is considered as not valid and removed.

Sequence Labeling
Sequence labeling is an structured prediction task that generates an output label for every token in an input sequence (Rei and Søgaard, 2018). Examples of practical tasks that can be formulated under this framework in natural language processing are PoS tagging, chunking or named-entity recognition, which are in general fast. However, to our knowledge, there is no previous work on sequence labeling methods for constituent parsing, as an encoding allowing it was lacking so far.
In this work, we consider a range of methods ranging from traditional models to state-of-theart neural models for sequence labeling, to test whether they are valid to train constituency-based parsers following our approach. We give the essential details needed to comprehend the core of each approach, but will mainly treat them as black boxes, referring the reader to the references for a careful and detailed mathematical analysis of each method. Appendix ?? specifies additional hyperparameters for the tested models.
Preprocessing We add to every sentence both beginning and end tokens.

Traditional Sequence Labeling Methods
We consider two baselines to train our prediction function F |w|,θ , based on popular sequence labeling methods used in NLP problems, such as PoS tagging or shallow parsing (Schmid, 1994;Sha and Pereira, 2003). (Lafferty et al., 2001) Let CRF |w|,θ be its prediction function, a CRF model computes conditional probability distributions of the form p(l, w) such that CRF θ (w) = l = arg max l p(l , w). In our work, the inputs to the CRF are words and PoS tags. To represent a word w i , we are using information of the word itself and also contextual information from w [i−1:i+1] . 3 In particular:

Conditional Random Fields
• We extract the word form (lowercased), the PoS tag and its prefix of length 2, from w [i−1:i+1] . For these words we also include binary features: whether it is the first word, the last word, a number, whether the word is capitalized or uppercased.
• Additionally, for w i we look at the suffixes of both length 3 and 2 (i.e. w i[−3:] and w i[−2:] ).
To build our CRF models, we relied on the sklearn-crfsuite library 4 .
MultiLayer Perceptron (Rosenblatt, 1958) We use one hidden layer. Let MLP |w|,θ be its prediction function, it treats sequence labeling as a set of independent predictions, one per word. The prediction for a word is computed as sof tmax(W 2 · relu(W 1 · x + b 1 ) + b 2 ), where x is the input vector and W i and b i the weights and biases to be learned at layer i. We consider both a discrete (MLP d ) and an embedded (MLP e ) perceptron. For the former, we use as inputs the same set of features as for the CRF. For the latter, the vector x for w i is defined as a concatenation of word and PoS tag embeddings from w [i−2:i+2] . 5 To build our MLPs, we relied on keras. 6

Sequence Labeling Neural Models
We are using NCRFpp 7 , a sequence labeling framework based on recurrent neural networks (RNN) (Yang and Zhang, 2018), and more specifically on bidirectional short-term memory networks (Hochreiter and Schmidhuber, 1997), which have been successfully applied to problems such as PoS tagging or dependency parsing (Plank et al., 2016;Kiperwasser and Goldberg, 2016). Let LSTM(x) be an abstraction of a standard long short-term memory network that processes the sequence x = [x 1 , ..., x |x| ], then a BILSTM encoding of its ith element, BILSTM(x, i) is defined as: In the case of multilayer BILSTM'S, the timestep outputs of the BILSTM m are fed as input to the BILSTM m+1 . The output label for each w i is finally predicted as sof tmax(W · h i + b).
Given a sentence [w 1 , w 2 , ..., w |w| ], the input to the sequence model is a sequence of embeddings [w 1 , w 2 , ..., w |w| ] where each w i = w i • p i • ch i , such that w i and p i are a word and a PoS tag embedding, and ch i is a word embedding obtained from an initial character embedding layer, also based on a BILSTM. Figure 4 shows the architecture of the network.

Experiments
We report results on models trained using the relative scale encoding and the special tag (ROOT,c i ). As a reminder, to deal also with leaf unary chains, we proposed two methods in §2.3: to predict them relying both on the encoding functions Φ |w| and Ψ |w| , or to predict them as a part of an enriched label predicted by the function Φ |w| . For clarity, we are naming these models with the superscripts Ψ,Φ and Φ , respectively.
Datasets We use the Penn Treebank (Marcus et al., 1994) and its official splits: Sections 2 to 21 for training, 22 for development and 23 for testing. For the Chinese Penn Treebank (Xue et al., 2005): articles 001-270 and 440-1151 are used for training, articles 301-325 for development, and articles 271-300 for testing. We use the version of the corpus with the predicted PoS tags of Dyer et al. (2016). We train the Φ models based on the predicted output by the corresponding Ψ model.

Metrics
We use the F-score from the EVALB script using COLLINS.prm as the parameter file. Speed is measured in sentences per second. We briefly comment on the accuracy (percentage of correctly predicted labels, no symbol excluded here) of our baselines.
Source code It can be found at https:// github.com/aghie/tree2labels Hardware The models are run on a single thread of a CPU 8 and on a consumer-grade GPU 9 .
In sequence-to-sequence work (Vinyals et al., 2015) the authors use a multi-core CPU (the number of threads was not specified), while we provide results on a single core for easier comparability. Parsing sentences on a CPU can be framed as an "embarrassingly parallel" problem (Hall et al., 2014), so speed can be made to scale linearly with the number of cores. We use the same batch size as Vinyals et al. (2015) for testing (128). 10 Table 1 shows the performance of our baselines on the PTB development set. It is worth noting that since we are using different libraries to train the models, these might show some differences in terms of performance/speed beyond those expected in theory. For the BILSTM model we test:

Results
• BILSTM m=1 : It does not use pretrained word embeddings nor character embeddings. The number of layers m is set to 1.
• BILSTM m=1,e : It adds pretrained word embeddings from GloVe (Pennington et al., 2014) for English and from the Gigaword corpus for Chinese (Liu and Zhang, 2017).
• BILSTM m=2,e,ch : m is set to 2.  The Ψ, Φ and the Φ models obtain similar Fscores. When it comes to speed, the BILSTMs Φ are notably faster than the BILSTMs Ψ,Φ . Φ models are expected to be more efficient, as leaf unary chains are handled implicitly. In practice, Φ is a more expensive function to compute than the original Φ, since the number of output labels is significantly larger, which reduces the expected gains with respect to the Ψ, Φ models. It is worth noting that our encoding is useful to train an MLP e with a decent sense of phrase structure, while being very fast. Paying attention to the differences between F-score and Accuracy for each baseline, we notice the gap between them is larger for CRFs and MLPs. This shows the difficulties that these methods have, in comparison to the BILSTM approaches, to predict the correct label when a word w i+1 has few common ancestors with w i . For example, let -10X be the right (relative scale) label between w i and w i+1 , and let l 1 =-1X and l 2 =-9X be two possible wrong labels. In terms of accuracy it is the same that a model predicts l1 or l2, but in terms of constituent F-score, the first will be much worse, as many closed parentheses will remain unmatched. Tables 2 and 3 compare our best models against the state of the art on the PTB and CTB test sets. The performance corresponds to models without reranking strategies, unless otherwise specified.

Discussion
We are not aware of work that reduces constituency parsing to sequence labeling. The work that can be considered as the closest to ours is that of Vinyals et al. (2015), who address it as a sequence-to-sequence problem, where the output sequence has variable/unknown length. In this context, even a one hidden layer perceptron outperforms their 3-layer LSTM model without attention, while parsing hundreds of sentences per second. Our best models also outperformed their 3-layer LSTM model with attention and even a simple BILSTM model with pretrained GloVe embeddings obtains a similar performance. In terms of F-score, the proposed sequence labeling baselines still lag behind mature shift-reduce and chart parsers. In terms of speed, they are clearly faster than both CPU and GPU chart parsers and are at least on par with the fastest shift-reduce ones. Although with significant loss of accuracy, if phrase-representation is needed in large-scale tasks where the speed of current systems makes parsing infeasible (Gómez-Rodríguez, 2017;, we can use the simpler, less accurate models to get speeds well above any parser reported to date.
It is also worth noting that in their recent work, published while this manuscript was under review, Shen et al. (2018) developed a mapping of binary trees with n leaves to sequences of n − 1 integers (Shen et al., 2018, Algorithm 1). This encoding is different from the ones presented here, as it is based on the height of lowest common ancestors in the tree, rather than their depth. While their purpose is also different from ours, as they use this mapping to generate training data for a parsing algorithm based on recursive partitioning using realvalued distances, their encoding could also be applied with our sequence labeling approach. However, it has the drawback that it only supports binarized trees, and some of its theoretical properties are worse for our goal, as the way to define the inverse of an arbitrary label sequence can be highly ambiguous: for example, a sequence of n−1 equal labels in this encoding can represent any binary tree with n leaves. Multi-core 120 88.3 (number not (Vinyals et al., 2015) specified) Constituency parsing as dependency parsing Fernández-González and Martins (2015) WSJ23 1 (Peters et al., 2018) Chart-based parsers with GPU-specific implementation Canny et al. (2013) WSJ ( Stern et al. (2017) report that they use a 16-core machine, but sentences are processed one-at-a-time. Hence, they do not exploit inter-sentence parallelism, but they may gain some speed from intra-sentence parallelism. indicates the that the speed was reported in the paper itself. * and £ indicate that the speeds were extracted from Zhu et al. (2013) and Fernández and Gómez-Rodríguez (2018

Conclusion
We presented a new parsing paradigm, based on a reduction of constituency parsing to sequence labeling. We first described a linearization function to transform a constituent tree (with n leaves) into a sequence of n − 1 labels that encodes it. We proved that this encoding function is total and injective for any tree without unary branches. We also discussed its limitations: how to deal with unary branches and non-surjectivity, and showed how these can be solved. We finally proposed a set of fast and strong baselines.
(ED431B 2017/01). We gratefully acknowledge NVIDIA Corporation for the donation of a GTX Titan X GPU.