Incorporating Syntax and Semantics in Coreference Resolution with Heterogeneous Graph Attention Network

External syntactic and semantic information has been largely ignored by existing neural coreference resolution models. In this paper, we present a heterogeneous graph-based model to incorporate syntactic and semantic structures of sentences. The proposed graph contains a syntactic sub-graph where tokens are connected based on a dependency tree, and a semantic sub-graph that contains arguments and predicates as nodes and semantic role labels as edges. By applying a graph attention network, we can obtain syntactically and semantically augmented word representation, which can be integrated using an attentive integration layer and gating mechanism. Experiments on the OntoNotes 5.0 benchmark show the effectiveness of our proposed model.


Introduction
Coreference resolution is a core task in NLP, which aims to identify all mentions that refer to the same entity. Coreference encodes rich semantic information which has been successfully applied to improve many downstream NLP tasks (Luan et al., 2019;Wadden et al., 2019;Dasigi et al., 2019;Stojanovski and Fraser, 2018). Impressive progress has been made in recent years since the introduction of the first end-to-end neural coreference resolution model (Lee et al., 2017) by utilising contextualized embeddings from large pretrained language models (Joshi et al., 2019(Joshi et al., , 2020Kantor and Globerson, 2019;Wu et al., 2020) such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019). Rich language knowledge encoded in these pretrained models has largely alleviated the need for syntactic and semantic features. However, such information has been shown to benefit BERT based models on other tasks (Nie et al., 2020a;Pouran Ben Veyseh et al., 2020). Therefore, we believe such information could also benefit the coreference resolution task.
In this paper, we propose a neural coreference resolution model based on Joshi et al. (2019), which we extend by incorporating external syntactic and semantic information. For syntactic information, we use dependency trees to capture the long-term dependency exists among mentions. Kong and Jian (2019) has successfully incorporated structural information into neural models, but their model still requires the design of complex hand-engineered features. In contrast, our model is more flexible, using a graph neural network to encode syntax in the form of dependency trees. For semantic information, we adopt semantic role labelling (SRL) structures. SRL labels capture who did what to whom and it is effective in providing document-level event description information, which allows us to better identify the relationship between event mentions. Previous statistical coreference systems have successfully integrated such information (Ponzetto and Strube, 2006;Kong et al., 2009), but their effectiveness has not been examined in neural models.
Moreover, inspired by recent progress made in document-level relation extraction (Christopoulou et al., 2019), we encode both syntactic and semantic information in a heterogeneous graph. Nodes of different granularity are connected based on the feature structures. Node representations are updated iteratively through our defined message passing mechanism and incorporated into contextualized embeddings using an attentive integration module and gating mechanism. We conduct experiments on the OntoNotes 5.0 (Pradhan et al., 2012) benchmark, where the results show that our proposed model significantly outperforms a strong baseline.

Baseline Model
Our model is based on the c2f-coref model (Lee et al., 2018) which enumerates all text spans as potential mentions and prunes unlikely spans aggressively. For each mention i, 2 the model learns a distribution over its possible antecedents Y(i): where the scoring function s(i, j) measures how likely span i and j comprise valid mentions and corefer to one another: where g i and g j are span representations formed by the concatenation of contextualized embeddings of span endpoints and head vector using attention mechanism. FFNN represents the feedforward layer, φ(i, j) are meta features including span distance and speaker identities, and s m and s c are the mention score and pairwise coreference score.
3 Proposed Model Figure 2 shows the architecture of our proposed model, where the key components are presented in blue and orange backgrounds. Other parts follow Lee et al. (2018) (see §2) except that we use SpanBERT (Joshi et al., 2020) as the document encoder and discard the higher-order span refinement module as suggested by Xu and Choi (2020).

Node Construction
There are three types of nodes in our heterogeneous graph: token nodes (T), argument nodes (A) and predicate nodes (P). The representation of token nodes and predicate nodes is the contextualized embeddings from the SpanBERT encoder, denoted as h w and h p respectively. The representation of an argument node is formed by averaging the embeddings of the tokens it contains, denoted as h a .

Edge Construction
Edges are constructed based on feature structures. An example is shown in Figure 1.
Token-Token Edges are constructed according to dependency tree structures. Specifically, there will be a directed edge between two token nodes starting from head to dependent if they are connected, with edges being the corresponding dependency labels. A self-loop edge with cyclic label is 2 i is a span with one or more tokens. also added to each node in the graph. Besides, we also link the root nodes of two adjacent sentences to allow cross-sentence interaction.
Token-Argument Argument nodes are linked to token nodes they contain. The edge is unlabelled but bidirectional to allow token-level information to augment the averaged representation of arguments and propagate semantic information back to tokens.
Predicate-Argument Argument nodes are connected to predicate nodes they belong to with edges being the corresponding SRL labels. The edge is made bidirectional to allow mutual information propagation. Predicates can be regarded as intermediate nodes to allow each argument to aggregate information from other arguments with the same predicate.

Graph Attention Layer
We use a Graph Attention Network (Veličković et al., 2018) to propagate syntactic and semantic information to basic token nodes. For a node i, the attention mechanism allows it to selectively incorporate information from its neighbour nodes: where h i and h j are embeddings of node i and j, a T , W and W k are trainable parameters. e ij is the embedding of edge label type between node i and j based on graph structures, σ is the LeakyReLU activation function. and [; ] represent the concatenation operation. Eqs. 5 and 6 are designated as an where h i and h j are the embeddings of target and neighbour node and h i is the updated embedding of target node.

Message Propagation
To make each node embedding more informative, we update all nodes in the graph multiple times via our designed message passing path. First, we update token nodes using neighbour token nodes connected through dependency syntactic edges: where h l−1 w is the token representation in previous layer l − 1, h l w is the updated representation in current layer l and h 0 w is the SpanBERT encoding. In parallel, we update the argument using the token representation; then the updated argument is used to update the predicate features; after that, the updated predicate nodes propagate information back to their connected argument nodes; finally, the updated argument nodes distribute the representation to all connected basic token nodes: After L iterations, we can get the final syntax and semantics-enhanced token representation, which can be denoted as h d w and h s w , respectively.

Attentive Integration Layer
Since attention mechanisms are effective in choosing the most relevant information (Nie et al., 2020a,b), we use an attentive integration layer to selectively incorporate the syntactic and semantic information. For each type of information h c w ∈ {h d w , h s w }, we concatenate it with initial token representation h 0 w and use the concatenation to compute the importance score of h c w to h 0 w : where FFNN c is a one-layer feedforward network with sigmoid activation function for information type c (either Dep or SRL). After obtaining the valid attention weights using softmax function, we could compute the weighted average sum of both syntactic and semantic information: Since the extra syntactic and semantic information is not always useful, we use a gate to leverage such information dynamically: h where W g and b g are trainable parameters, represents element-wise multiplication and σ is the logistic sigmoid function. Finally, the augmented token representation h w can be used to form span representation and compute pairwise coreference score as in Section 2.

Experiments
Dataset We evaluate our model on the English OnotoNotes 5.0 benchmark (Pradhan et al., 2012), which consists of 2802, 343 and 348 documents in the training, development and test data sets.

Implementation Details
We reimplement the c2f-coref+SpanBERT 3 baseline using PyTorch and use the Independent setup for long documents. For graph encoders, the number of heads of syntactic and semantic sub-graphs is 4 and 8 for base and large model, respectively. We set the size of edge label embeddings to 300 and use 2 GAT layers for both sub-graphs. More details are in Appendix A.

Results
The main evaluation is the average F1 of three metrics -MUC, B 3 and CEAFφ 4 on the test set using the official CoNLL-2012 evaluation scripts. 4 Table 1   +SpanBERT-base and large model compared with previous work. Our model consistently outperforms the SpanBERT baseline (Joshi et al., 2020) on all three metrics with an improvement of 1.4% and 1.5% on Avg. F1 score respectively, as well as our reimplemented baseline (+1.3% and +1.1%), which is a substantial improvement by considering the difficulty of this task. This demonstrates the effectiveness of our heterogeneous graph-based method in leveraging syntactic and semantic features and such features are indeed useful in neural methods. Note that we also show the current stateof-the-art CorefQA model (Wu et al., 2020), which uses span-prediction paradigm to compute pairwise coreference scores. The model is compatible with our method, i.e. adding our proposed graph attention and attentive integration layer on top of their document encoder with minor modification. The reason why we did not use it as a start baseline is due to hardware limitations since it requires 128G GPU memory for training.    Table 4, we show the performance of our model against the baseline on the development set as a function of document lengths.
As expected, our model consistently outperforms the baseline model on all document sizes, especially for documents with length larger than 765 tokens. This demonstrates that the incorporated external syntax and semantics are beneficial for modelling longer dependencies. However, our model has similar pattern as the baseline model, performing distinctly worse as document length increases. This shows that the sentence-level syntax and semantics used in this work are not sufficient enough to tackle the deficiency of modelling long-range dependency. One possible solution is to leverage document-level features such as hierarchical discourse structures.

Related Work
Graph Neural Networks (GNN) have long been used for integrating external features of graph structures into a range of NLP tasks, including semantic role labelling  and machine translation (Bastings et al., 2017). How-ever, the application of GNN on coreference resolution task is less explored. Xu and Yang (2019) adopted dependency syntax to improve gendered pronoun resolution. However, they did not evaluate their model on larger datasets and identify whether syntax features are still useful for common coreference resolution. In this paper, we not only utilise syntax but also semantic features, and we show both of them contribute to significant improvement over a strong baseline on a large standard dataset. There are many GNN variants. Graph Convolutional Network (GCN) (Kipf and Welling, 2017) is the most widely-used one and has been shown to benefit a number of NLP tasks. However, it lacks the ability of modeling different edge labels including directions and edge types. Although Relational Graph Convolutional Network (RGCN) (Schlichtkrull et al., 2017) was proposed to tackle this problem, the way of representing edge information as label-wise parameters makes it suffer from over-parameteration problem even for small sized label vocabularies. In this work, we use a graph encoder improved based on Graph Attention Network (GAT) (Veličković et al., 2018) to better capture structural syntax and semantics, as GAT is able to model different types of edges with few parameters.

Conclusion
In this paper, we propose a heterogeneous-graph based model to enhance coreference resolution by effectively leveraging dependency tree structures and SRL semantic features. Particularly, nodes of different granularity in the graph propagate and aggregate information to and from neighbour nodes to obtain both syntactically and semantically augmented representation. Moreover, an attentionbased mechanism is used to dynamically aggregate such augmented information. Experiments on the OntoNotes 5.0 benchmark confirm the effectiveness of our proposed model with significant improvement achieved against the strong baseline. Future work will focus on applying other features, such as constituent parsing trees and WordNet.

A Implementation Details
We utilise the Adam Optimizer (Kingma and Ba, 2015) with a gradient clipping of 1.0 and a batch size of 1 (single document) for both base and large models. SpanBERT-base and large models are fintuned using learning rates of 2×10 −5 and 1×10 −5 , with a warmup scheduler in the first 10% training steps. We use learning rates of 3 × 10 −4 and 5 × 10 −4 for task-related parameters with linear decay decreasing to 0. The training of base model is conducted on a single Nvidia Telsa V100 GPU with 16G memory while training large model requires 32G memory. Gold features annotated on the OntoNotes 5.0 dataset are used in the experiment. We use Stanford CoreNLP toolkit (Manning et al., 2014) to convert the annotated constituent trees into Stanford dependency trees (de Marneffe and Manning, 2008). SRL labels are organized in the form of triples: (p, a, l), which refers to predicate, argument and label, respectively.