Supertagging Combinatory Categorial Grammar with Attentive Graph Convolutional Networks

Supertagging is conventionally regarded as an important task for combinatory categorial grammar (CCG) parsing, where effective modeling of contextual information is highly important to this task. However, existing studies have made limited efforts to leverage contextual features except for applying powerful encoders (e.g., bi-LSTM). In this paper, we propose attentive graph convolutional networks to enhance neural CCG supertagging through a novel solution of leveraging contextual information. Specifically, we build the graph from chunks (n-grams) extracted from a lexicon and apply attention over the graph, so that different word pairs from the contexts within and across chunks are weighted in the model and facilitate the supertagging accordingly. The experiments performed on the CCGbank demonstrate that our approach outperforms all previous studies in terms of both supertagging and parsing. Further analyses illustrate the effectiveness of each component in our approach to discriminatively learn from word pairs to enhance CCG supertagging.


Introduction
Combinatory categorial grammar (CCG) is a lexicalized grammatical formalism, where the lexical categories (also known as supertags) of the words in a sentence provide informative syntactic and semantic knowledge for text understanding. Therefore, CCG parse often provides useful information for many downstream natural language processing (NLP) tasks such as logical reasoning (Yoshikawa et al., 2018) and semantic parsing (Beschke, 2019). To perform CCG parsing in different languages, most studies conducted a supertagging-parsing pipline (Clark and Curran, 2007;Kummerfeld et al., † Corresponding author. 1 Our code and models for CCG supertagging are released at https://github.com/cuhksz-nlp/NeST-CCG. 2010; Lewis and Steedman, 2014b;Huang and Song, 2015;Xu et al., 2015;Lewis et al., 2016;Vaswani et al., 2016;Yoshikawa et al., 2017), in which their main focus is the first step, and they generated the CCG parse trees directly from supertags with a few rules afterwards.
Building an accurate supertagger in a sequence labeling process requires a good modeling of contextual information. Recent neural approaches to supertagging mainly focused on leveraging powerful encoders with recurrent models (Lewis et al., 2016;Vaswani et al., 2016;Clark et al., 2018), with limited attention paid to modeling extra contextual features such as word pairs with strong relations. Graph convolutional networks (GCN) is demonstrated to be an effective approach to model such contextual information between words in many NLP tasks (Marcheggiani and Titov, 2017;Huang and Carley, 2019;De Cao et al., 2019;; thus we want to determine whether this approach can also help CCG supertagging. However, we cannot directly apply conventional GCN models to CCG supertagging because in most of the previous studies the GCN models are built over the edges in the dependency tree of an input sentence. As high-quality dependency parsers are not always available, we do not want our CCG supertaggers to rely on the existence of dependency parsers. Thus, we need another way to extract useful word pairs to build GCN models. For that, we propose to obtain word pairs from frequent chunks (n-grams) in the corpus, because those chunks are easy to identify with co-occurrence counts. To appropriately learn from n-grams, one requires the GCN to be able to distinguish different word pairs because such information in n-grams are not explicitly structured as that in dependency parses. Because existing GCN models are limited in treating all word pairs equally, we propose an adaptation of conventional GCN for CCG supertagging. Figure 1: The architecture of our CCG supertagger with A-GCN and an example input sentence with its supertagging and parsing output. The supertagging process for "buy" is highlighted in green. The adjacency matrix illustrates the edges of the graph that is built upon the chunks (n-grams) extracted from the lexicon N , with the chunks illustrated in the red boxes.
In this paper, we propose attentive GCN (A-GCN) for CCG supertagging, where its input graph is built based on chunks (n-grams) extracted with unsupervised methods. In detail, two types of edges in the graph are introduced to model word relations within and across chunks and an attention mechanism is applied to GCN to weight those edges. In doing so, different contextual information are discriminatively learned to facilitate CCG supertagging without requiring any external resources. The validity of our approach is demonstrated by experimental results on the CCGbank (Hockenmaier and Steedman, 2007), where state-of-the-art performance is obtained for both tagging and parsing.

The Approach
We treat CCG supertagging as a sequence labeling task, where the input is a sentence with n words X = x 1 x 2 · · · x i · · · x n , and the output is a sequence of supertags Y = y 1 y 2 · · · y i · · · y n . Our approach uses attentive GCN (A-GCN) to incorporate information of word pairs through a graph; the graph is built based on n-grams in the input sentence that appear in a lexicon N . This lexicon consists of n-grams automatically extracted from raw corpora by unsupervised methods. The overall architecture of our tagger is illustrated in Figure 1, with an input sentence and corresponding supertagging and parsing output. The details of the main components in the architecture are provided below.

GCN
Normal GCN models with L layers learn from word pairs suggested by the dependency parsing results of the input sentence X , where the edges between all pairs of words x i and x j are represented by an adjacency matrix A = {a i,j } n×n . In A, a i,j = 1 if there is a dependency edge between x i and x j or i = j (the direction of the edge is ignored), and a i,j = 0 otherwise. Based on the adjacency matrix, for each x i , the l-th GCN layer finds all x j associated with x i (where a i,j = 1), takes their hidden vectors h (l−1) j from the (l − 1)th layer, and computes the output for where W (l) and b (l) are trainable matrix and bias for the l-th GCN layer, LN refers to layer normalization and σ the ReLU activation function. Therefore, in normal GCN, for each x i , all the x j that connect to x i are treated exactly the same.

Graph Construction based on Chunks
Since CCG supertagging is also a parsing task, we do not want our approach to rely on the existence of a dependency parser. Without such a parser, we need an alternative for finding good word pairs to build the graph in A-GCN (which is equivalent to build the adjacency matrix A). Inspired by the studies that leverage chunks (n-grams) as effective features to carry contextual information and enhance model performance (Song et al., 2009;Song and Xia, 2012;Ishiwatari et al., 2017;Yoon et al., 2018;Tian et al., 2020a,c,b), we propose to construct the graph based on the chunks (n-grams) extracted from a pre-constructed n-gram lexicon N . Specifically, the lexicon is constructed by computing the PMI of any two adjacent words s , s in the training set by where p is the probability of an n-gram (i.e., s , s and s s ) in the training set; then a high PMI Figure 2: Examples of the two types of edges for building the graph in an input sentence, in which chunks (ngrams) extracted from the lexicon N are highlighted in green; example in-chunk and cross-chunk edges are marked in blue and red color, respectively. score suggests that the two words co-occur a lot in the dataset and are more likely to form a n-gram. For each pair of adjacent words s i−1 , s i in a sentence S = s 1 s 2 · · · s i−1 s i · · · s n , we compute the PMI score of the two words and use a threshold to determine whether a delimiter should be inserted between them. As a result, the sentence S is segmented into pieces of n-grams and we extract all n-grams from all sentences to form the lexicon N . 2 Then for graph building, given an input sentence X , we find all the n-grams in X that appear in N . A chunk is either a n-gram that does not overlap with other n-grams or a text span that covers multiple overlapping n-grams. For example, in Figure  2, we find four chunks (i.e., "all students", "are required to", "finish", and "in two hours") in the example sentence according to the lexicon N (the chunks are highlighted in green). In these chunks, "all students", "finish", and "in two hours" are nonoverlapping n-grams included in the lexicon and "are required to" is a text span that covers the overlapping n-grams "are required" and "required to". In most cases, the adjacent words within the same chunk tend to have a strong word-word relation in terms of co-occurrence, and thus we can build the graph and its adjacency matrix accordingly.
Based on the chunks, we construct the graph by two types of edges, i.e., the in-chunk and crosschunk ones: the first type is to model local word pairs, and the graph includes edges between any two adjacent words within the same chunk. For example, as shown in Figure 2, the in-chunk edges (blue lines) for the chunk "in two hours" are "(in, two)" and "(two hours)". The second type is to model cross chunk word pairs, which are built from any two adjacent chunks with the starting and ending words in the two chunks connected. The motivation of using the starting and ending words is that English phrases tend to be head-initial (e.g., verb phrase such as "buy some books") or headfinal (e.g., adjective phrase such as "red apples") in many cases. E.g., for the two chunks "all students" and "are required to" in Figure 2, the corresponding cross-chunk edges (red lines) are "(all, are)", "(all, to)", "(students, are)", and "(students, to)". The graph is equivalent to the adjacency matrix A, where a i,j = 1 if there is an edge between x i and x j in the graph or i = j, and a i,j = 0 otherwise. 3

The Attentive GCN
When learning from a graph, conventional GCN models treat all word pairs from the graph equally, and thus are unable to account for the possibility that the contribution of different x j on x i may vary. Particularly for our graph built from chunks, it is important to be able to distinguish different word pairs because all the chunks and the graph are constructed automatically without a dependency parser. Therefore, we apply an attention mechanism to the adjacency matrix and adapt Eq. (1) used in the normal GCN for our A-GCN by replacing the a i,j ∈ {0, 1} by a weight p (l) i,j ∈ (0, 1). For each x i and all its associated x j , the weight p (l) i,j for this word pair is computed by pos models the positional relation (i.e., left, right, or self ) between x i and x j and it has three choices, i.e., W

Supertagging with A-GCN
To conduct supertagging with A-GCN, we firstly obtain the hidden vector h   for the alignment. Then, a softmax decoder is used to predict the supertagŷ i for x i : where T denotes the set with all CCG categories and o t i the value at dimension t in o i .

Settings
We run experiments on the English CCGbank (Hockenmaier and Steedman, 2007) 5 and follow Clark and Curran (2007) to split it into train/dev/test sets, whose statistics (sentence and word numbers) are reported in Table 1. To construct n-gram lexicon N for building the edges in our graph, we perform PMI on the training set of CCGbank to extract n-grams whose length is between [1,5], with the threshold of the PMI score set to 0. For the encoder, we try both cased and uncased BERT-Large (Devlin et al., 2019) with their default settings (e.g., 24 layers of self-attentions in 1024 dimensional hidden vectors) 6 and used two layers for A-GCN. To obtain CCG parse from 5 The official dataset is obtained from https:// catalog.ldc.upenn.edu/LDC2005T13. 6 We download the pre-trained BERT models from https: //github.com/google-research/bert.  Table 3: Results (supertagging accuracy and labeled Fscores) of different models with BERT-Large encoder on the development set of CCGbank. "PARM" is the number of trainable parameters in the models; "Full" uses the fully connected graph and "Chunk" uses the graph built based on chunks.
the generated supertags, we adopt the parsing algorithm used in EasyCCG (Lewis and Steedman, 2014a). We follow previous studies (Lewis and Steedman, 2014a;Lewis et al., 2016;Yoshikawa et al., 2017) to evaluate our model on both the tagging accuracy of the most frequent 425 supertags and the labeled F-scores (LF) of the dependencies converted from CCG parse 7 . For other hyper-parameter settings, we test their values as shown in Table 2 when training our models. We tried all combinations of them for each model and use the one achieving the highest supertagging results in our final experiments. Note that, with the best hyper-parameters, the best performance is achieved with warm-up rate 0.1, batch size 16, and learning rate 1e-5.

Results
To explore the effectiveness of our approach, we run CCG taggers with and without A-GCN, and try two ways to construct the graph: one is a fully connected GCN where edges are built between every two words; the other is our proposed approach with the chunk-based graph. Experimental results on supertagging accuracy (TAG) and labeled F-scores (LF) for parsing on the development set of CCGbank are reported in Table 3, with the number of trainable parameters of all models also presented.
The experiments show that, for both cased and uncased BERT encoders, the proposed chunk A-GCN works the best in terms of both supertagging accuracy and parsing results. In contrast, Full A-GCN has inferior performance to the BERT baselines. This contrast shows the importance of appropriate construction of the graphs fed into A-GCN, since the fully connected graph with all words asso-  ciated with one another may introduce noise word relations and thus yield bad performance. Furthermore, we run our models with uncased BERT encoder on the test set and compare the performance with previous studies on both supertagging and parsing. Table 4 shows the results, where the studies marked by † use the same parser (i.e., the EasyCCG parser) to generate CCG trees from supertags. Among the previous studies, Stanojević and Steedman (2019) performed CCG parsing directly without the suppertagging step, whereas the rest all did supertagging first. Regardless of this difference, our approach performs the best on CCGbank in both supertagging accuracy and parsing LF.

Ablation Study
We conduct an ablation study to explore the effect of the two types of edges and the attention mechanism on our best model. The supertagging and parsing results of models with different configurations are reported in Table 5, where the results are categorized into four groups. The first group (ID 1) is the results of the best performing model where all settings are activated; the second (ID 2-3) is the ablation of either in-chunk or cross-chunk edges with attention; the third (ID 4-6) is the result of using normal GCN without the attention mechanism; and the last group (ID 7) is the baseline model where none of the three settings is activated.
The results show that the model performance drops when either part is ablated (ID 1 vs. ID 2-6). Specifically, removing attention significantly hurts the performance, where all configurations without attention (ID 4-6) shows worse-than-baseline (ID 7) results; this observation confirms the importance of applying attention on GCN. One possible expla-  nation to this phenomenon could be that considerable noises are introduced to the graph because the edges in our graph are derived from chunks and they do not follow syntax in most cases; thus, it is crucial to assign weights to the edges and not treat them with equally. Interestingly, comparing the two types of edges, models with cross-chunk edges yield much higher results than the ones with in-chunk edges only when the attention is not used (ID 5 vs. ID 6), while it is slightly better when attention is applied (ID 2 vs. ID 3). This comparison suggests that in-chunk edges could introduce more noise than cross-chunk edges. So that when the attention is not used (ID 6), the model fails to weight the edges and results in a significant drop on its performance; On the contrary, when the attention is applied (ID 3), our model is able to even the performance of models with in-chunk and cross-chunk edges, which confirms that weighting is essential in selecting useful information for CCG supertagging.

Conclusion
In this paper, we propose A-GCN for CCG supertagging, with its graph built from chunks extracted from a lexicon. We use two types of edges for the graph, namely, in-chunk and cross-chunk edges for word pairs within and across chunks, respectively, and propose an attention mechanism to distinguish the important word pairs according to their contribution to CCG supertagging. Experimental results and the ablation study on the English CCGbank demonstrate the effectiveness of our approach to CCG supertagging, where state-of-the-art performance is obtained on both CCG supertagging and parsing. Further analysis is performed to investigate using different types of edges, which reveals their quality and confirms the necessity of introducing attention to GCN for CCG supertagging.