Improving Constituency Parsing with Span Attention

Constituency parsing is a fundamental and important task for natural language understanding, where a good representation of contextual information can help this task. N-grams, which is a conventional type of feature for contextual information, have been demonstrated to be useful in many tasks, and thus could also be beneficial for constituency parsing if they are appropriately modeled. In this paper, we propose span attention for neural chart-based constituency parsing to leverage n-gram information. Considering that current chart-based parsers with Transformer-based encoder represent spans by subtraction of the hidden states at the span boundaries, which may cause information loss especially for long spans, we incorporate n-grams into span representations by weighting them according to their contributions to the parsing process. Moreover, we propose categorical span attention to further enhance the model by weighting n-grams within different length categories, and thus benefit long-sentence parsing. Experimental results on three widely used benchmark datasets demonstrate the effectiveness of our approach in parsing Arabic, Chinese, and English, where state-of-the-art performance is obtained by our approach on all of them.


Introduction
Constituency parsing, which aims to generate a structured syntactic parse tree for a given sentence, is one of the most fundamental tasks in natural language processing (NLP), and plays an important role in many downstream tasks such as relation extraction (Jiang and Diesner, 2019), natural language inference (Chen et al., 2017), and machine translation (Ma et al., 2018). Recently, Figure 1: The treelet of an example of the form "V+NP+PP", where the "PP" should attach to the "V" (in green) rather than the "NP" (in red). neural parsers (Vinyals et al., 2015;Dyer et al., 2016;Stern et al., 2017; without using any grammar rules significantly outperform conventional statistical grammar-based ones (Collins, 1997;Sagae and Lavie, 2005;Glaysher and Moldovan, 2006;, because neural networks, especially recurrent models (e.g, Bi-LSTM), are adept in capturing long range contextual information, which is essential to modeling the entire sentence. Particularly, a significant boost on the performance of chart-based parsers is observed from some recent studies (Kitaev and Klein, 2018;Zhou and Zhao, 2019) that employ advanced text encoders (i.e., Transformer, BERT, and XLNet), which further demonstrates the usefulness of contexts for parsing.
In general, besides powerful encoders, other extra information (such as pre-trained embeddings and extra syntactic information) can also provide useful contextual information and thus enhance model performance in many NLP tasks (Pennington et al., 2014;Song et al., 2018a;Mrini et al., 2019;Tian et al., 2020a,b). As one type of the extra information, n-grams are used as a simple yet effective source of contextual feature in many studies Song and Xia, 2012;Yoon et al., 2018;Tian et al., 2020c) Therefore, they could be potentially beneficial for parsing as well. However, recent chart-based parers (Stern et al., 2017;Kitaev and Klein, 2018;Gaddy et al., 2018;Zhou and Zhao, 2019) make rare effort to leverage such n-gram information. Another potential issue with current Figure 2: The architecture of the chart-based constituency parser with span attention, with an example partial input sentence and its output. The right part of the figure shows the categorical span attention, where extracted n-grams in span (i, j) are categorized by their length so that n-grams in different categories are weighted separately (different colors refer to different n-gram categories). Note that for normal span attention, all n-grams are weighted together, where attention a i,j directly corresponds to e i,j,· in the figure.
chart-based parsers is that they represent spans by subtraction of hidden states at the span boundaries, where the context information in-between may be lost and thus hurt parsing performance especially for long sentences. N-grams can be a simple yet useful source to fill the missing information. For instance, Figure 1 illustrates the treelet of an example in the form of "V+NP+PP". As a classic example of PP-attachment ambiguity, a parser may wrongly attach the "PP" to the "NP" if it only focuses on the words at the boundaries of the text span "flag ... year" and in-between information is not represented properly. In this case, n-grams within that span (e.g., the uni-gram "telescope") can provide useful cues indicating that the "PP" should be attached to the "V". Although there are traditional non-neural parsers using n-grams as features to improve parsing (Sagae and Lavie, 2005;Pitler et al., 2010), they are limited in treating them euqally without learning their weights. Therefore, unimportant n-grams may deliver misleading information and lead to wrong predictions.
To address this problem, in this paper, we propose a span attention module to enhance chartbased neural constituency parsing by incorporating appropriate n-grams into span representations. Specifically, for each text span we extract all its substrings that appear in an n-gram lexicon; the span attention uses the normal attention mechanism to weight them with respect to their contributions to predict the constituency label of the span. Because in general short n-grams occur more frequently than long ones, they may dominate in the attention if all n-grams are globally weighted, We further enhance our approach with a categorical mechanism which first groups n-grams into different categories according to their length and then weights them within each category. Thus, n-grams with different lengths are separately treated and the infrequent long ones carrying more contextual information can be better leveraged. The effectiveness of our approach is illustrated by experimental results on three benchmark datasets from different languages (i.e., Arabic, Chinese, and English), on all of which state-of-the-art performance is achieved.

The Approach
Our approach follows the chart-based paradigm for constituency parsing, where the parse tree T of an input sentence X = x 1 x 2 · · · x i · · · x j · · · x q is represented as a set of labeled spans. A span is denoted by a triplet (i, j, l) with i and j referring to the beginning and ending positions of a span with a label l ∈ L. Here, L is the label set containing d l constituent types. The architecture of our approach is shown in Figure 2. The left side is the backbone chart-based parser. It assigns real value scores s(i, j, l) to the labeled spans, then computes the score of a candidate tree by summing up the scores of all its spans, and finally chooses a valid tree T with the highest score s by The right side of Figure 2 shows the proposed span attention to enhance the backbone parser, where n-grams in X are extracted from a preconstructed lexicon N and are weighted through the attention module according to their contribution to the parsing process. Therefore, the process of computing s(i, j, l) of the labeled spans through our approach is formalized by where X j i is the text in range [i, j] of X ; SA represents the span attention module and p computes the probability of labeling l ∈ L to the span (i, j).
In this section, we start with a brief introduction of neural chart-based parsing, then describe our span attention, and end with an illustration of incorporating span attention into the parsing process.

Neural Chart-based Parsing
Recent neural chart-based parsers (Stern et al., 2017;Kitaev and Klein, 2018;Zhou and Zhao, 2019) follow the encoder-decoder way, where the encoder receives X and generates a sequence of context-sensitive hidden vectors (denoted as h i and h j for x i and x j , respectively), which are used to compute the span representation r i,j ∈ R dr for (i, j) by subtraction: r i,j = h j − h i . This span representation assumes that, for a recurrent model, e.g., LSTM, its hidden vector at each time step relies on the previous ones so that such subtraction could, to some extent, capture the contextual information of all the words in that span. 2 For decoders, most recent neural chart-based parsers follow the strategy proposed by Stern et al. (2017), where all span representations r i,j are fed into a variant of CYK algorithm to generate a globally optimized tree for each sentence. Normally, r i,j is fed into multi-layer perceptrons (MLP) to compute its scores s(i, j, ·) over the label set L. Afterwards, a recursion function is applied to find the highest score s * (i, j) of span (i, j), which is computed by searching the best constituency label and the corresponding boundary k (i < k < j) by Note that in the special case where j = i + 1, the best score only relies on the candidate label: Therefore, to parse the entire sentence, one computes s * (1, q) through the above steps and use a back pointer to recover the full tree structure.

Span Attention
Although the encoding from subtraction of hidden states is demonstrated to be effective (Stern et al., 2017;Kitaev and Klein, 2018;, the subtraction might not represent all the crucial information in the text span. Especially, for Transformer-based encoders, unlike recurrent models, their h i and h j have no strong dependency on each other so that subtraction may fail to fully capture the contextual information in the span, especially when the span is long. Since n-grams are a good source of the information in the text span, we propose span attention to incorporate weighted ngram information into span representations to help score the spans (i, j, l). In detail, for each span (i, j) in X , we extract all n-grams in that span that appear in Lexicon N to form a set C i,j = {c i,j,1 , c i,j,2 , · · · c i,j,v , · · · c i,j,m i,j } and use the set in span attention. The attention of each n-gram where e i,j,v ∈ R dr is the embedding of c i,j,v whose dimension is identical to that of r i,j . The resulted attention vector a i,j ∈ R dr is thus computed by the weighted average of n-gram embeddings by and it is used to enhance the span representation.
In normal attention, all n-grams are weighted globally and short n-grams may dominate the attention because they occur much more frequently than long ones and are intensively updated. However, there are cases that long n-grams can play an important role in parsing when they carry useful context and boundary information. Therefore, we extend the span attention with a category mechanism (namely, categorical span attention) by grouping n-grams based on their lengths and weighting them within each category. 3 In doing so, all n-grams in N are categorized into n groups according to their lengths, i.e., C i,j = {C i,j,1 , C i,j,2 , · · · C i,j,u , · · · C i,j,n }, with u ∈ [1, n] denoting the n-gram length. Then, for each category with n-grams in length u, we follow the same process in Eq. (5) and (6) to compute a (u) i,j,v and a (u) i,j . The final attention is obtained from the concatenation of all categorical attentions by with a trainable parameter δ u ∈ R + to balance the contribution of attentions from different categories.

Parsing with Span Attention
The backbone parser follows  to use BERT as the encoder, where r i,j = h j − h i is applied to represent the span (i, j). Once a i,j is obtained from the span attention for (i, j), we incorporate it into the backbone parsing process by directly concatenating it with r i,j : r i,j = r i,j ⊕a i,j ∈ R dr·(n+1) . Then, we apply two fully connected layers with ReLU activation function to r i,j and compute the span scores s(i, j, ·) over the label set L, which can be formalized by: and Here, LN denotes the layer normalization operation; W 1 , W 2 and b 1 , b 2 are trainable parameters in the fully connected layers. Afterwards, we use Eq. (3) and (4) to recursively find the highest score s best (1, q), and use a back pointer to recover the globally optimized parse tree. Arabic Penn Treebank 2.0 (ATB) (Maamouri et al., 2004), the Chinese Penn Treebank 5 (CTB5) (Xue et al., 2005), and Penn Treebank 3 (PTB) (Marcus et al., 1993). 4 For ATB, we follow Chiang et al. (2006) and Green and Manning (2010) to use their split 5 to get the training/dev/test sets and convert the texts in the dataset from Buckwalter transliteration 6 to modern standard Arabic. For CTB5 and PTB, we follow Shen et al. (2018) and Kamigaito et al. (2017) to split the datasets. Moreover, we use the Brown Corpus (Marcus et al., 1993) and Genia (Tateisi et al., 2005) for cross-domain experiments. 7 For all datasets, we follow Suzuki et al. (2018) to clean up the raw data 8 and report the statistics of each resulted dataset in Table 1.

N-gram Lexicon Construction
For n-gram extraction, we compute the pointwise mutual information (PMI) of any two adjacent words x , x in the dataset by where p is the probability of an n-gram (i.e., x , x and x x ) in a dataset. A high PMI score suggests 4 All the datasets are obtained from the official release of Linguistic Data Consortium. The catalog numbers for ATB part 1-3 are LDC2003T06, LDC2004T02, LDC2005T20, for CTB5 is LDC2005T01, and for PTB is LDC99T42. 5 Such split uses the "Johns Hopkins 2005 Workshop" standard, for which we follow the detailed split guideline offered by https://nlp.stanford.edu/software/ parser-arabic-data-splits.shtml. 6 http://languagelog.ldc.upenn.edu/myl/ ldc/morph/buckwalter.html 7 The Brown Corpus is obtained together with PTB (LDC99T42), and the Genia corpus is obtained by its official PTB format from https://nlp.stanford.edu/ mcclosky/biomedical.html. 8 We use the clean-up code from https://github. com/nikitakit/parser-data-gen.   that the two words co-occur a lot in the dataset and are more likely to form an n-gram. We set the threshold to 0 to determine whether a delimiter should be inserted between the two adjacent words x and x . In other words, to build the lexicon N from a dataset, we use PMI as an unsupervised segmentation method to segment the dataset and collect all n-grams (n ≤ 5) 9 appearing at least twice in the training and development sets combined. 10

Model Implementation
In our experiments, we use BERT (Devlin et al., 2019) 9 We empirically set the max n-gram length to 5 as a unified threshold for all three languages. 10 We show the details of extracting the lexicon with example n-grams in the Appendix. 11 We download BERT models for Arabic and English from https://github.com/google-research/bert, and for Chinese from https://s3.amazonaws.com/ models.huggingface.co/. We download ZEN and XLNet at https://github.com/sinovation/ZEN amd https://github.com/zihangdai/xlnet. add three additional token-level self-attention layers to the top of BERT, ZEN, and XLNet.
For other settings, we randomly initialize all ngram embeddings used in our attention module 12 with their dimension matching that of the hidden vectors obtained from the encoder (e.g., 1024 for BERT-large). Besides, we run our experiments with and without predicted part-of-speech (POS) tags. Following previous studies, for the experiments without POS tags, we take sentences as the only input; for the experiments with POS tags, we obtain the POS tags from Stanford POS Tagger (Toutanova et al., 2003) and incorporate the POS tags by directly concatenating their embeddings with the output of the BERT/ZEN/XLNet encoder. Following previous studies (Suzuki et al., 2018;, we use hinge loss during the training process and evaluate different models by by precision, recall, F1 score, and complete match score via the standard evaluation toolkit EVALB 13 .
During the training process, we try three learning rates, i.e., 5e-5, 1e-5, 5e-6, with a fixed random seed, pick the model with the best F1 score on the development set, and evaluate it on the test set.

Overall Performance
In the main experiment, we compare the proposed models with and without the span attention to explore the effect of the span attention on chart-based constituency parsing. For models with the span attention, we also run the settings with and without the categorical mechanism. The results (i.e., precision, recall, F1 score, and complete match scores of all models, as well as their number of trainable parameters) with different configurations (including whether to use the predicted POS tags) on the development sets of ATB, CTB5, and PTB are reported in Table 2.
There are several observations. First, the span attention over n-grams shows its generalization ability, where consistent improvements of F1 over the baseline models are observed on all languages under different settings (i.e., with and without using predicted POS tags; using BERT or XLNet encoders). Second, compared with span attention without the category mechanism, in which n-grams are weighted together, models with categorical span attention perform better on both F1 and complete match scores with a relatively small increase of parameter numbers (around 1M ). Particularly, for the complete match scores, the span attention with normal attentions does not outperform the baseline models in some cases, whereas the categorical span attention mechanism does in all cases. These results could be explained by that frequent short n-grams dominate the general attentions so that the long ones containing more contextual information fail to function well in filling the missing information in the span representation, and thus harm the understanding of long spans, which results in inferior results in complete match score. In contrast, the categorical span attention is able to weight n-grams in different length separately, so that the attentions are not dominated by high-frequency short n-grams and thus reasonable weights can be assigned to long n-grams. Therefore, our model can learn from the important long n-grams and have a good performance on the long spans, which results in consistent improvements over baseline models in complete match scores. Third, on CTB5, models with ZEN encoder consistently outperform the ones with BERT without using POS tags, while they fail to do so with the  POS tags as the additional input, which suggests that the predicted POS tags may have more conflict with ZEN compared with BERT. Moreover, we run our models on the test set of each dataset and compare the results with previous studies, as well as the ones from prevailing parsers, i.e., Stanford CoreNLP Toolkits (SCT) 14  and Berkeley Neural Parser (BNP) 15 (Kitaev and Klein, 2018). The results are reported in Table 3, where the models using predicted POS tags are marked with "*". 16 Our models with CATSA outperform previous best performing models from Zhou and Zhao (2019) and Mrini et al. (2019) under different settings (i.e., whether to use the predicted POS tags), and achieve stateof-the-art performance on all datasets. Compared with Zhou and Zhao (2019) and Mrini et al. (2019) which improve constituency parsing by leveraging the dependency information when training their head phrase structure grammar (HPSG) parser, our approach enhances the task from another direction by incorporating n-gram information through the span attentions as a way to address the limitation of using hidden vector subtraction to represent spans.

Cross-domain Experiments
To further explore whether our approach can be generalized across domains, we follow the setting of Fried et al. (2019) to conduct cross-domain experiments on the Brown and Genia corpus using the models with SA and CATSA, as well as their corresponding baseline. Note that, for fair comparison, we use BERT-large cased as the encoder without using the predicted POS tags. We follow Fried et al. (2019) to train models on the training set of PTB and evaluate them on the entire Brown corpus and the entire Genia corpus. To construct N in this experiment, we extract n-grams by PMI from the training set of PTB. The results (F1 scores) are reported in Table 4. From the table, we find that our model with categorical span attentions (+ CATSA) outperforms the BERT baseline (Fried et al., 2019) on the Brown corpus while fails to do so on the Genia corpus. The explanation cloud be that the distance between Genia (medical domain) and PTB (news wire domain) is much larger than that between Brown and PTB, so that the n-gram overlap in two domains are limited and thus has little influence to the target domain.

Effect of CATSA on Long Sentences
To explore the effect of our approach, we investigate our best performing models (where predicted POS tags are used) with the span attention module and the corresponding baselines on different length of sentences in the test sets. The curves of F1 scores with respect to the minimal test sentence length (the horizontal axis) from different models on ATB, CTB5, and PTB are illustrated in Figure  3(a), 3(b), and 3(c), respectively. 17 In general, long sentences are harder to parse and thus all models' performance degrades when sentence length increases. Yet, our models with CatSA outperform the baseline for all sentence groups and the gap is bigger for long sentences, which indicates our approach can handle long sentences better than the baselines. One possible explanation for this is that long sentences will have larger text spans and may require more long-distance contextual information. Our approach incorporates n-gram information into the span representation and thus can appropriately leverages the infrequent long n-grams by separately weighting them in different categories.

Analysis on Different N-gram Lengths
To test using n-grams in different length, we conduct an ablation study on the n-grams with respect to their length. In doing so, we conduct experiments on the best performing models (where predicted POS tags are used) with the span attention module, by restricting that n-grams whose length are larger than a threshold is excluded from the lexicon N . We try the threshold from 1 to 5 and demonstrate the curves (F-scores) on the test set of ATB, CTB5, and PTB in Figure 4(a), (b), and (c), respectively. The results of their corresponding baselines are also represented in red curves for reference. It is found from the curves that our models with span attentions consistently outperform the baseline models, which indicates the robustness of our approach with respect to different n-grams used in the model. In addition, for different languages, the n-gram threshold varies when the best performance is obtained. For example, the best performing model on English is with three words as the maximum length of n-grams, while that is five for Arabic and four for Chinese.
Moreover, to investigate how the categorical span attention addresses the problem that highfrequency short n-grams can dominate the general attentions, we run the best performing models with span attentions on the whole ATB, CTB5, and PTB datasets, obtain the total weight assigned to each n-gram, and compute the average weight for the n-grams in each n-gram length category. Figure 5 shows the histograms of the average weights from models with SA and CATSA.
The histograms show that the models with SA (the orange bars) tend to assign short n-grams relatively high weights, especially the uni-grams. This is not surprising because short n-grams occur more frequently and are thus updated more times than long ones. In contrast, the models with CATSA show a different weight distribution (the blue bars) among n-grams with different lengths, which indicates that the CATSA module could balance the weights distribution and thus enable the model to learn from infrequent long n-grams. Figure 6: An example sentence with its parsing results from the best performing baseline and our model. The correct and wrong parsing results are highlighted on the span labels by green and red, respectively. The superscripts on the span labels illustrate the heights of them. "V" is a POS tag so there is no height for it.

Case Study
To illustrate how our model improves baselines with the span attention, especially for long sentences, we show the parse trees produced by the two models for an example sentence in Figure  6, where the superscript for the internal node is the height of the subtree rooted at that node. In this case, our model correctly attaches the "PP" ("with two ... utilities") containing 24 words to the verb "compete", while the baseline attach it to the noun "customers". Since the distances between the boundary positions of the wrongly predicted spans (highlighted in red) are relatively long, the baseline system, which simply represents the span as subtraction of the hidden vectors at the boundary positions, may fail to capture the important context information within the text span. In contrast, the span representations used in our model are enhanced by weighted n-gram information and thus contain more context information. Therefore, in deciding which component (i.e., "compete" or "customer") the with-PP should attach to, n-grams (e.g., the uni-gram "companies") may provide useful cues, since "customers with companies" is less likely than "compete with companies".

Related Work
There are two main types of parsing methodologies. One is the transition-based approaches (Sagae and Lavie, 2005); the other is the chart-based approaches (Collins, 1997;Glaysher and Moldovan, 2006). Recently, neural methods start to play a dominant role in this task, where improvements mainly come from powerful encodings (Dyer et al., 2016;Cross and Huang, 2016;Liu and Zhang, 2017;Stern et al., 2017;Gaddy et al., 2018;Ki-taev and Klein, 2018;Fried et al., 2019). Moreover, there are studies that do not follow the aforementioned methodologies, which instead regard the task as a sequence-to-sequence generation task (Vinyals et al., 2015;Suzuki et al., 2018), a language modeling (Choe and Charniak, 2016) task or a sequence labeling task (Gómez-Rodríguez and Vilares, 2018). To further improve the performance, some studies leverage extra resources (such as auto-parsed large corpus (Vinyals et al., 2015), pre-trained word embeddings (Kitaev and Klein, 2018)), HPSG information (Zhou and Zhao, 2019;Mrini et al., 2019), or use model ensembles . Compared to these studies, our approach offers an alternative way to enhance constituency parsing with effective leveraging of n-gram information. Moreover, the proposed span attention addresses the limitation of previous studies (Kitaev and Klein, 2018;) that spans are represented by the subtraction of encoded vectors at span boundaries (i.e., the hidden states at initial and ending positions of the span) and thus reduces information loss accordingly. In addition, the categorical span attention provides a simple, yet effective, improvement over the normal attention to process n-grams in a more precise way, which could become a reference for leveraging similar resources in future research.

Conclusion
In this paper, we proposed span attention to integrate n-gram into span representations to enhance chart-based neural constituency parsing. Specifically, for each text span in an input sentence, we firstly extracted n-grams in that span from an ngram lexicon, and then fed them into the span attention to weight them according to their contribution to the parsing process. To better leverage n-grams, especially the long ones, categorical span attention was proposed to improve the normal attention by categorizing n-grams according to their length and weighting them separately within each category. Such span attention not only leverages important contextual information from n-grams but also addresses the limitation of current Transformer-based encoders using subtraction for span representations. To the best our knowledge, this is the first work using n-grams for neural constituency parsing. The effectiveness of our approach was demonstrated by experimental results on three benchmark datasets from Arabic, Chinese, and English, where state-ofthe-art performance is obtained on all of them. N-GRAMS AVG. ATTENTION said 0.0141 more than 0.0063 as well as 0.0167 from a year earlier 0.0267 in the past few years 0.0184 Table 5: Example n-grams with their average weights obtained from our best performing model (i.e., XLNet + POS + CATSA) on the entire PTB dataset.
for Arabic, Chinese, and English on the entire ATB, CTB5, and PTB datasets, respectively. Then, for each n-gram, we compute its average attention weights according to its appearance in the entire dataset. Afterwards, we group n-grams by their length and rank the n-grams according to their average attention weights within each group. The top 50 n-grams in each group as well as their attention weights for each language are reported in the supplemental material. As a demonstration, Table 5 shows a few n-grams with their average attention weights on the entire PTB dataset.