An Empirical Study of Building a Strong Baseline for Constituency Parsing

This paper investigates the construction of a strong baseline based on general purpose sequence-to-sequence models for constituency parsing. We incorporate several techniques that were mainly developed in natural language generation tasks, e.g., machine translation and summarization, and demonstrate that the sequence-to-sequence model achieves the current top-notch parsers’ performance (almost) without requiring any explicit task-specific knowledge or architecture of constituent parsing.


Introduction
Sequence-to-sequence (Seq2seq) models have successfully improved many well-studied NLP tasks, especially for natural language generation (NLG) tasks, such as machine translation (MT) (Sutskever et al., 2014;Cho et al., 2014) and abstractive summarization (Rush et al., 2015). Seq2seq models have also been applied to constituency parsing (Vinyals et al., 2015) and provided a fairly good result. However one obvious, intuitive drawback of Seq2seq models when they are applied to constituency parsing is that they have no explicit architecture to model latent nested relationships among the words and phrases in constituency parse trees, Thus, models that directly model them, such as RNNG (Dyer et al., 2016), are an intuitively more promising approach. In fact, RNNG and its extensions (Kuncoro et al., 2017;Fried et al., 2017) provide the current stateof-the-art performance. Sec2seq models are currently considered a simple baseline of neuralbased constituency parsing.
After the first proposal of an Seq2seq constituency parser, many task-independent techniques have been developed, mainly in the NLG research area. Our aim is to update the Seq2seq approach proposed in Vinyals et al. (2015) as a stronger baseline of constituency parsing. Our motivation is basically identical to that described in Denkowski and Neubig (2017). A strong baseline is crucial for reporting reliable experimental results. It offers a fair evaluation of promising new techniques if they solve new issues or simply resolve issues that have already been addressed by current generic technology. More specifically, it might become possible to analyze what types of implicit linguistic structures are easier or harder to capture for neural models by comparing the outputs of strong Seq2seq models and task-specific models, e.g., RNNG.
The contributions of this paper are summarized as follows: (1) a strong baseline for constituency parsing based on general purpose Seq2seq models 1 , (2) an empirical investigation of several generic techniques that can (or cannot) contribute to improve the parser performance, (3) empirical evidence that Seq2seq models implicitly learn parse tree structures well without knowing taskspecific and explicit tree structure information.

Constituency Parsing by Seq2seq
Our starting point is an RNN-based Seq2seq model with an attention mechanism that was applied to constituency parsing (Vinyals et al., 2015). We omit detailed descriptions due to space limitations, but note that our model architecture is identical to the one introduced in Luong et al. (2015a) 2 .
A key trick for applying Seq2seq models to constituency parsing is the linearization of parse 1 Our code and experimental configurations for reproducing our experiments are publicly available: https://github.com/nttcslab-nlp/strong s2s baseline parser 2 More specifically, our Seq2seq model follows the one implemented in seq2seq-attn (https://github.com/harvardnlp/seq2seq-attn), which is the alpha-version of the OpenNMT tool (http://opennmt.net).

Original input
John has a dog . Output: S-exp.
(S (NP NNP ) (VP VBZ (NP DT NN ) ) . ) Linearized form (S (NP NNP )NP (VP VBZ (NP DT NN )NP )VP . )S w/ POS normalized (S (NP XX )NP (VP XX (NP XX XX )NP )VP . )S Table 1: Examples of linearization and POS-tag normalization (Vinyals et al., 2015) trees (Vinyals et al., 2015). Roughly speaking, a linearized parse tree consists of open, close bracketing and POS-tags that correspond to a given input raw sentence. Since a one-to-one mapping exists between a parse tree and its linearized form (if the linearized form is a valid tree), we can recover parse trees from the predicted linearized parse tree. Vinyals et al. (2015) also introduced the part-of-speech (POS) tag normalization technique. They substituted each POS tag in a linearized parse tree to a single XX-tag 3 , which allows Seq2seq models to achieve a more competitive performance range than the current state-ofthe-art parses 4 . Table 1 shows an example of a parse tree to which linearization and POS-tag normalization was applied.

Task-independent Extensions
This section describes several generic techniques that improve Seq2seq performance 5 . Table 2 lists the notations used in this paper for a convenient reference.

Subword as input features
Applying subword decomposition has recently become a leading technique in NMT literature (Sennrich et al., 2016;Wu et al., 2016). Its primary advantage is a significant reduction of the serious out-of-vocabulary (OOV) problem. We incorporated subword information as an additional feature of the original input words. A similar usage of subword features was previously proposed in Bojanowski et al. (2017).
Formally, the encoder embedding vector at encoder position i, namely, e i , is calculated as follows: 3 We did not substitute POS-tags for punctuation symbols such as ".", and ",".
4 Several recently developed neural-based constituency parsers ignore POS tags since they are not evaluated in the standard evaluation metric of constituency parsing (Bracketing F-measure). 5 Figure in the supplementary material shows the brief sketch of the method explained in the following section. D : dimension of the embeddings H : dimension of the hidden states i : index of the (token) position in input sentence j : index of the (token) position in output linearized format of parse tree V (e) : vocabulary of word for input (encoder) side V (s) : vocabulary of subword for input (encoder) side E : encoder embedding matrix for V (e) , where E ∈ R D×|V (e) | F : encoder embedding matrix for V (s) , where F ∈ R D×|V (s) | wi : i-th word (token) in the input sentence, wi ∈ V (e) x k : one-hot vector representation of the k-th word in V (e) s k : one-hot vector representation of the k-th subword in V (s) u : encoder embedding vector of unknown token φ(·) : function that returns the index of given word in the vocabulary V (e) ψ(·) : function that returns a set of indices in the subword vocabulary V (s) generated from the given word. e.g., k ∈ ψ(wi) ei : encoder embedding vector at position i in encoder V (d) : vocabulary of output with POS-tag normalization V (q) : vocabulary of output without POS-tag normalization zj : final hidden vector calculated at the decoder position j oj : final decoder output scores at decoder position j qj : output scores of auxiliary task at decoder position j b : additional bias term in the decoder output layer for mask pj : vector format of output probability at decoder position j A : number of models for ensembling C : number of candidates generating for LM-reranking Table 2: List of notations used in this paper.
where k = φ(w i ). Note that the second term of RHS indicates our additional subword features, and the first represents the standard word embedding extraction procedure. Among several choices, we used the byte-pair encoding (BPE) approach proposed in Sennrich et al. (2016) applying 1,000 merge operations 6 .

Unknown token embedding as a bias
We generally replace rare words, e.g., those appearing less than five times in the training data, with unknown tokens in the Seq2seq approach. However, we suspect that embedding vectors, which correspond to unknown tokens, cannot be trained well for the following reasons: (1) the occurrence of unknown tokens remains relatively small in the training data since they are obvious replacements for rare words, and (2) Seq2seq is relatively ineffective for training infrequent words (Luong et al., 2015b). Based on these observations, we utilize the unknown embedding as a bias term b of linear layer (W x + b) when obtaining every encoder embeddings for overcoming infrequent word problem. Then, we modify Eq. 2 as follows: Note that if w i is unknown token, then Eq. 2 becomes e i = 2u + k ∈ψ(w i ) (F s k + u).

Multi-task learning
Several papers on the Seq2seq approach (Luong et al., 2016) have reported that the multi-task learning extension often improves the task performance if we can find effective auxiliary tasks related to the target task. From this general knowledge, we re-consider jointly estimating POS-tags by incorporating the linearized forms without the POS-tag normalization as an auxiliary task. In detail, the linearized forms with and without the POS-tag normalization are independently and simultaneously estimated as o j and q j , respectively, in the decoder output layer by following equation:

Output length controlling
As described in Vinyals et al. (2015), not all the outputs (predicted linearized parse trees) obtained from the Seq2seq parser are valid (well-formed) as a parse tree. Toward guaranteeing that every output is a valid tree, we introduce a simple extension of the method for controlling the Seq2seq output length (Kikuchi et al., 2016). First, we introduce an additional bias term b in the decoder output layer to prevent the selection of certain output words: If we set a large negative value at the m-th element in b, namely b m ≈ −∞, then the m-th element in p j becomes approximately 0, namely p j,m ≈ 0, regardless of the value of the k-th element in o j . We refer to this operation to set value −∞ in b as a mask. Since this naive masking approach is harmless to GPU-friendly processing, we can still exploit GPU parallelization. We set b to always mask the EOS-tag and change b when at least one of the following conditions is satisfied: (1) if the number of open and closed brackets generated so far is the same, then we mask the XX-tags (or the POS-tags) and all the closed brackets.
(2) if the number of predicted XX-tags (or POS-tags) is equivalent to that of the words in a given input sentence, then we mask the XX-tags (or all the POS-tags) and all the open brackets. If both conditions (1) and (2) are satisfied, then the decoding process is finished. The additional cost for controlling the mask is to count the number of XX-tags and the open and closed brackets so far generated in the decoding process.

Pre-trained word embeddings
The pre-trained word embeddings obtained from a large external corpora often boost the final task performance even if they only initialize the input embedding layer. In constituency parsing, several systems also incorporate pre-trained word embeddings, such as Vinyals et al. (2015); Durrett and Klein (2015). To maintain as much reproducibility of our experiments as possible, we simply applied publicly available pre-trained word embeddings, i.e., glove.840B.300d 7 , as initial values of the encoder embedding layer.

Model ensemble
Ensembling several independently trained models together significantly improves many NLP tasks.
In the ensembling process, we predict the output tokens using the arithmetic mean of predicted probabilities computed by each model: where p (a) j represents the probability distribution at position j predicted by the a-th model.

Language model (LM) reranking
Choe and Charniak (2016) demonstrated that reranking the predicted parser output candidates with an RNN language model (LM) significantly improves performance. We refer to this reranking process as LM-rerank. Following their success, we also trained RNN-LMs on the PTB dataset with their published preprocessing code 8 to reproduce the experiments in Choe and Charniak (2016) for our LM-rerank. We selected the current stateof-the-art LM (Yang et al., 2018) 9 as our LMreranker, which is a much stronger LM than was used in Choe and Charniak (2016).  Table 4: Results on English PTB data: Results were average (ave), worst (min), and best (max) performance of ten models independently trained with distinct random initial values. Test data was only evaluated on baseline and our best setting ((a), (e), (f) and (j)) to prevent over-tuning to the test data. We confirmed that all our results contained no malformed parse trees.

Experiments
Our experiments used the English Penn Treebank data (Marcus et al., 1994), which are the most widely used benchmark data in the literature. We used the standard split of training (Sec.02-21), development (Sec.22), and test data (Sec.23) and strictly followed the instructions for the evaluation settings explained in Vinyals et al. (2015). For data pre-processing, all the parse trees were transformed into linearized forms, which include standard UNK replacement for OOV words and POS-tag normalization by XX-tags. As explained in Vinyals et al. (2015), we did not apply any parse tree binarization or special unary treatment, which were used as common techniques in the literature.  ments unless otherwise specified. proximately 3 point gain from the baseline conventional Seq2seq model (a) on test data Bra.F. One drawback of Seq2seq approach is that it seems sensitive to initialization. Comparing only with a single result for each setting may produce inaccurate conclusions. Therefore, we should evaluate the performances over several trials to improve the evaluation reliability.

Results
The baseline Seq2seq models, (a) and (f), produced the malformed parse trees. We postprocessed such malformed parse trees by simple rules introduced in (Vinyals et al., 2015). On the other hand, we confirmed that all the results applying the technique explained in Sec. 3.4 produced no malformed parse trees. Ensembling and Reranking: Table 5 shows the results of our models with model ensembling and LM-reranking. For ensemble, we randomly selected eight of the ten Seq2seq models reported in Table 4. For LM-reranking, we first generated 80 candidates by the above eight ensemble models and selected the best parse tree for each input in terms of the LM-reranker. The results in Table 5 were taken from a single-shot evaluation, unlike the averages of ten independent runs in Table 4. Hyper-parameter selection: We empirically investigated the impact of the hyper-parameter selections. Table 6 shows the results. The following observations appear informative for building strong baseline systems: (1) Smaller mini-batch size M and gradient clipping G provided the better performance. Such settings lead to slower and longer training, but higher performance. (2) Larger layer size, hidden state dimension, and beam size have little impact on the performance; our setting, L = 2, H = 200, and B = 5 looks adequate in terms of speed/performance trade-off. Input unit selection: As often demonstrated in the NMT literature, using subword split as input token unit instead of standard tokenized word unit has potential to improve the performance. Table 6 (e) shows the results of utilizing subword splits. Clearly, 8K and 16K subword splits as input token units significantly degraded the performance. It seems that the numbers of XX-tags in output and tokens in input should keep consistent for better performance since Seq2seq models look to somehow learn such relationship, and used it during the decoding. Thus, using subword information as features is one promising approach for leveraging subword information into constituency parsing. Table 7 lists the reported constituency parsing scores on PTB that were recently published in the literature. We split the results into three categories. The first category (top row) contains the results of the methods that were trained only from the pre-defined training data (PTB Sec.02-21), without any additional resources. The second category (middle row) consists of the results of methods that were trained from the pre-defined PTB training data as well as those listed in the top row, but incorporating word embeddings obtained from a large-scale external corpus to initialize the encoder embedding layer. The third category (bottom row) shows the performance of the methods that were trained using high-confidence, auto-parsed trees in addition to the pre-defined PTB training data.

Comparison to current top systems
Our Seq2seq approach successfully achieved the competitive level as the current top-notch methods: RNNG and its variants. Note here that, as described in Dyer et al. (2016), RNNG uses Berkeley parser's mapping rules for effectively handling singleton words in the training corpus. In contrast, we demonstrated that Seq2seq models have enough power to achieve a competitive stateof-the-art performance without leveraging such task-dependent knowledge. Moreover, they need no explicit information of parse tree structures, transition states, stacks, (Stanford or Berkeley) mapping rules, or external silver training data during the model training except general purpose word embeddings as initial values. These observations from our experiments imply that recently developed Seq2seq models have enough ability to implicitly learn parsing structures from linearized parse trees. Our results argue that Seq2seq models can be a strong baseline for constituency parsing.

Conclusion
This paper investigated how well general purpose Seq2seq models can achieve the higher performance of constituency parsing as a strong baseline method. We incorporated several generic techniques to enhance Seq2seq models, such as incorporating subword features, and output length controlling. We experimentally demonstrated that by applying ensemble and LM-reranking techniques, a general purpose Seq2seq model achieved almost the same performance level as the state-of-the-art constituency parser without any task-specific or explicit tree structure information.