Difference between revisions of "Graph Parsing (State of the Art)"

From ACL Wiki
Jump to navigation Jump to search
 
(18 intermediate revisions by 3 users not shown)
Line 23: Line 23:
 
graph bank statistics; and (d) a continuously evolving record of
 
graph bank statistics; and (d) a continuously evolving record of
 
state-of-the-art processing results.
 
state-of-the-art processing results.
 +
Of these, components (a) and (b) are provided by
 +
Kuhlmann and Oepen (2016)<ref name="kuhlmann2016">Marco Kuhlmann and Stephan Oepen. Towards a Catalogue of Linguistic Graph Banks. Computational Linguistics, 2016. In press. [http://www.mn.uio.no/ifi/english/people/aca/oe/cl.pdf Preprint]</ref>,
 +
while (c) and (d) are maintained below.
  
 
This page was initiated by
 
This page was initiated by
Line 32: Line 35:
  
 
= Software: Graph Analysis Toolkit =
 
= Software: Graph Analysis Toolkit =
 +
 +
An open-source reference implementation of the toolkit that was built to conduct the study of Kuhlmann and Oepen (2016)<ref name="kuhlmann2016"/> will be available later this summer.
  
 
= AMR: Abstract Meaning Representation =
 
= AMR: Abstract Meaning Representation =
 +
 +
Abstract Meaning Representation (AMR) eschews explicit syntactic derivations and consideration of the syntax–semantics interface; it rather seeks to directly annotate “whole-sentence logical meanings” (Banarescu et al. 2013<ref name="banarescu2013">Banarescu, Laura, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract Meaning Representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178–186, Sofia, Bulgaria, August.</ref>). Node labels in AMR name abstract concepts, which in large parts draw on the ontology of OntoNotes predicate senses and corresponding semantic roles. Nodes are not overtly related to surface lexical units, and thus are unordered. Although AMR has its roots in semantic networks and earlier knowledge representation approaches (Langkilde and Knight 1998<ref name="langkilde1998">Langkilde, Irene and Kevin Knight. 1998. Generation that exploits corpus-based statistical
 +
knowledge. In Proceedings of the 17th International Conference on Computational Linguistics and the 36th Meeting of the Association for Computational Linguistics, pages 704–710, Montréal, Canada.</ref>), larger-scale manual AMR annotation is a recent development only. We sample two variants of AMR, viz. (a) the graphs as annotated in AMRBank 1.0 (LDC2014T12), and (b) a normalized version that we call AMR<sup>−1</sup>, where so-called “inverse roles” (like ARG0-of) are reversed. Such inverted edges are frequently used in AMR in order to render the graph as a single rooted structure, where the root is interpreted as the top-level focus. (The graph bank is natively constructed and released with inverted edges, but for parser evaluation the AMR<sup>−1</sup> normalization is typically assumed; our conversion builds on the code of Cai and Knight (2013)<ref name="cai2013">Cai, Shu and Kevin Knight. 2013. Smatch. An evaluation metric for semantic feature structures. In Proceedings of the 51th Meeting of the Association for Computational Linguistics, pages 748–752, Sofia, Bulgaria, August.</ref>.) In the context of this comparison, we map this interpretation to a general concept of “top nodes” for both AMR and AMR<sup>−1</sup>. Flanigan et al. (2014)<ref name="flanigan2014">Flanigan, Jeffrey, Sam Thomson, Jaime Carbonell, Chris Dyer, and Noah A. Smith. 2014. A discriminative graph-based parser for the Abstract Meaning Representation. In Proceedings of the 52nd Meeting of the Association for Computational Linguistics, pages 1426–1436, Baltimore, MD, USA, June.</ref> published the first parser targeting AMR, and the state of the art has been repeatedly updated since.
 +
 +
{|
 +
! style="text-align:left;"| Property
 +
! style="text-align:center;"| AMR
 +
! style="text-align:center;"| AMR<sup>−1</sup>
 +
|-
 +
|number of graphs
 +
|style="text-align:right;"|10309
 +
|style="text-align:right;"|10309
 +
|-
 +
|average number of tokens
 +
|style="text-align:right;"|20.62
 +
|style="text-align:right;"|20.62
 +
|-
 +
|average number of nodes per token
 +
|style="text-align:right;"|0.67
 +
|style="text-align:right;"|0.67
 +
|-
 +
|number of edge labels
 +
|style="text-align:right;"|135
 +
|style="text-align:right;"|100
 +
|-
 +
|percentage of graphs that are trees
 +
|style="text-align:right;"|52.48
 +
|style="text-align:right;"|18.60
 +
|-
 +
|percentage of graphs with treewidth one
 +
|style="text-align:right;"|52.72
 +
|style="text-align:right;"|52.72
 +
|-
 +
|average treewidth
 +
|style="text-align:right;"|1.524
 +
|style="text-align:right;"|1.524
 +
|-
 +
|maximal treewidth
 +
|style="text-align:right;"|4
 +
|style="text-align:right;"|4
 +
|-
 +
|average edge density
 +
|style="text-align:right;"|1.065
 +
|style="text-align:right;"|1.065
 +
|-
 +
|percentage of nodes that are reentrant
 +
|style="text-align:right;"|5.23
 +
|style="text-align:right;"|18.95
 +
|-
 +
|percentage of graphs that are cyclic
 +
|style="text-align:right;"|3.15
 +
|style="text-align:right;"|0.71
 +
|-
 +
|percentage of graphs that are not connected
 +
|style="text-align:right;"|0.00
 +
|style="text-align:right;"|0.00
 +
|-
 +
|percentage of graphs that are multi-rooted
 +
|style="text-align:right;"|0.00
 +
|style="text-align:right;"|77.50
 +
|-
 +
|percentage of non-top roots
 +
|style="text-align:right;"|47.78
 +
|style="text-align:right;"|19.39
 +
|}
 +
 +
= CCD: Combinatory Categorial Grammar Dependencies =
 +
 +
Hockenmaier and Steedman (2007)<ref name="hockenmaier2007">Hockenmaier, Julia and Mark Steedman. 2007. CCGbank. A corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Computational Linguistics, 33:355–396.</ref> construct CCGbank from a combination of careful interpretation of the syntactic annotations in the PTB with additional, manually curated lexical and constructional knowledge. In CCGbank (LDC2005T13), the strings of the venerable PTB Wall Street Journal (WSJ) corpus are annotated with pairs of (a) CCG syntactic derivations and (b) sets of semantic bi-lexical dependency triples, which we term CCD. The latter “include most semantically relevant non-anaphoric local and long-range dependencies” and are suggested by the CCGbank creators as a proxy for predicate–argument structure. While CCD has mainly been used for contrastive parser evaluation (Clark and Curran [2007]<ref name="clark2007">Clark, Stephen and James R. Curran. 2007. Wide-coverage efficient statistical parsing with CCG and log-linear models. Computational Linguistics, 33(4):493–552.</ref>, Fowler and Penn [2010]<ref name="fowler2010">Fowler, Timothy A. D. and Gerald Penn. 2010. Accurate context-free parsing with Combinatory Categorial Grammar. In Proceedings of the 48th Meeting of the Association for Computational Linguistics, pages 335–344, Uppsala, Sweden.</ref>; inter alios), there is current work that views each set of triples as a directed graph and parses directly into these target representations (Du, Sun, and Wan 2015<ref name="du2015">Du, Yantao, Weiwei Sun, and Xiaojun Wan. 2015. A data-driven, factorization parser for CCG dependency structures. In Proceedings of the 53rd Meeting of the Association for Computational Linguistics and of the 7th International Joint Conference on Natural Language Processing, pages 1545–1555, Bejing, China.</ref>).
 +
 +
{|
 +
! style="text-align:left;"| Property
 +
! style="text-align:center;"| CCD
 +
|-
 +
|number of graphs
 +
|style="text-align:right;"|39604
 +
|-
 +
|average number of tokens
 +
|style="text-align:right;"|23.47
 +
|-
 +
|average number of nodes per token
 +
|style="text-align:right;"|0.88
 +
|-
 +
|number of edge labels
 +
|style="text-align:right;"|6
 +
|-
 +
|percentage of graphs that are trees
 +
|style="text-align:right;"|1.45
 +
|-
 +
|percentage of graphs with treewidth one
 +
|style="text-align:right;"|29.27
 +
|-
 +
|average treewidth
 +
|style="text-align:right;"|1.742
 +
|-
 +
|maximal treewidth
 +
|style="text-align:right;"|5
 +
|-
 +
|average edge density
 +
|style="text-align:right;"|1.070
 +
|-
 +
|percentage of nodes that are reentrant
 +
|style="text-align:right;"|28.09
 +
|-
 +
|percentage of graphs that are cyclic
 +
|style="text-align:right;"|1.28
 +
|-
 +
|percentage of graphs that are not connected
 +
|style="text-align:right;"|12.53
 +
|-
 +
|percentage of graphs that are multi-rooted
 +
|style="text-align:right;"|99.67
 +
|-
 +
|percentage of non-top roots
 +
|style="text-align:right;"|47.78
 +
|-
 +
|average edge length
 +
|style="text-align:right;"|2.582
 +
|-
 +
|percentage of graphs that are noncrossing
 +
|style="text-align:right;"|48.23
 +
|-
 +
|percentage of graphs with pagenumber at most two
 +
|style="text-align:right;"|98.64
 +
|}
  
 
= EDS: Elementary Dependency Structures =
 
= EDS: Elementary Dependency Structures =
  
 
= SDP: Semantic Dependency Parsing =
 
= SDP: Semantic Dependency Parsing =
 +
 +
= UCCA: Universal Conceptual Cognitive Annotation =
 +
[http://www.cs.huji.ac.il/~oabend/ucca.html Universal Conceptual Cognitive Annotation (UCCA)] is a novel semantic approach to grammatical representation. It was developed in the Computational Linguistics Lab of the Computer Science Department of the Hebrew University by Omri Abend and Ari Rappoport<ref name="abend2013">Abend, Omri and Ari Rappoport. 2013. Universal Conceptual Cognitive Annotation (UCCA). ACL.</ref>.
 +
 +
The central idea of the project is to analyze and annotate natural languages using purely semantic categories and structure (a graph). Syntactic categories and structure are automatically deduced from the semantic ones using learning algorithms. The basic set of semantic categories (the foundational layer) is inspired by work in linguistic typology, cognitive grammar, and neuroscience.
 +
 +
Hershcovich et al. introduced the first parser for UCCA <ref name="hershcovich2017">Hershcovich, Daniel and Omri Abend and Ari Rappoport. 2017. A Transition-Based Directed Acyclic Graph Parser for UCCA. ACL.</ref>, a transition-based parser based on BiLSTM feature representation.
 +
 +
 +
{|
 +
! style="text-align:left;"| Property
 +
! style="text-align:center;"| UCCA Wiki
 +
! style="text-align:center;" | UCCA Wiki (sentences)
 +
! style="text-align:center;"| UCCA 20K Leagues
 +
! style="text-align:center;"| UCCA 20K Leagues (sentences)
 +
|-
 +
|number of graphs
 +
|style="text-align:right;"|367
 +
|style="text-align:right;"|5225
 +
|style="text-align:right;"|154
 +
|style="text-align:right;"|506
 +
|-
 +
|average number of tokens
 +
|style="text-align:right;"|431.7
 +
|style="text-align:right;"|30.32
 +
|style="text-align:right;"|114.24
 +
|style="text-align:right;"|34.77
 +
|-
 +
|average number of nodes per token
 +
|style="text-align:right;"|1.89
 +
|style="text-align:right;"|1.89
 +
|style="text-align:right;"|1.67
 +
|style="text-align:right;"|1.67
 +
|-
 +
|number of edge labels
 +
|style="text-align:right;"|12
 +
|style="text-align:right;"|12
 +
|style="text-align:right;"|12
 +
|style="text-align:right;"|12
 +
|-
 +
|percentage of graphs that are trees
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|-
 +
|average edge density
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|-
 +
|percentage of nodes that are reentrant
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|-
 +
|percentage of graphs that are cyclic
 +
|style="text-align:right;"|0
 +
|style="text-align:right;"|0
 +
|style="text-align:right;"|0
 +
|style="text-align:right;"|0
 +
|-
 +
|percentage of graphs that are not connected
 +
|style="text-align:right;"|0
 +
|style="text-align:right;"|0
 +
|style="text-align:right;"|0
 +
|style="text-align:right;"|0
 +
|-
 +
|percentage of graphs that are multi-rooted
 +
|style="text-align:right;"|0
 +
|style="text-align:right;"|0
 +
|style="text-align:right;"|0
 +
|style="text-align:right;"|0
 +
|-
 +
|percentage of non-top roots
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|-
 +
|percentage of graphs that are noncrossing
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|style="text-align:right;"|
 +
|}
 +
 +
= References =

Latest revision as of 09:13, 1 June 2017

Background and Motivation

Graphs exceeding the formal complexity of rooted trees are of growing relevance to much NLP research. We interpret the term graph parsing broadly as mapping from surface strings to graph-structured target representations, which typically provide some level of syntactico-semantic analysis. Although formally well-understood in graph theory, there is substantial variation in the types of linguistic graphs, as well as in the interpretation of various structural properties. To provide a common terminology and transparent statistics across different collections of graphs in NLP, we propose to establish a ‘catalogue’ of graph banks and associated parsing results.

We anticipate a bit of a cottage industry in linguistic graph banks and graph processing tasks over the next few years, which may make it difficult to keep track of contentful similarities and differences across frameworks and approaches. This page is intended to stimulate community work towards an up-to-date resource combining the following components: (a) formal definitions of (relevant) structural graph properties; (b) in-depth descriptions of how these apply to different graph banks; (c) constantly growing surveys of graph bank statistics; and (d) a continuously evolving record of state-of-the-art processing results. Of these, components (a) and (b) are provided by Kuhlmann and Oepen (2016)[1], while (c) and (d) are maintained below.

This page was initiated by Marco Kuhlmann and Stephan Oepen, and for the time being (mid-May 2016) is very much a work in progress. We intend to have a first complete draft available for community review by early June 2016.

Software: Graph Analysis Toolkit

An open-source reference implementation of the toolkit that was built to conduct the study of Kuhlmann and Oepen (2016)[1] will be available later this summer.

AMR: Abstract Meaning Representation

Abstract Meaning Representation (AMR) eschews explicit syntactic derivations and consideration of the syntax–semantics interface; it rather seeks to directly annotate “whole-sentence logical meanings” (Banarescu et al. 2013[2]). Node labels in AMR name abstract concepts, which in large parts draw on the ontology of OntoNotes predicate senses and corresponding semantic roles. Nodes are not overtly related to surface lexical units, and thus are unordered. Although AMR has its roots in semantic networks and earlier knowledge representation approaches (Langkilde and Knight 1998[3]), larger-scale manual AMR annotation is a recent development only. We sample two variants of AMR, viz. (a) the graphs as annotated in AMRBank 1.0 (LDC2014T12), and (b) a normalized version that we call AMR−1, where so-called “inverse roles” (like ARG0-of) are reversed. Such inverted edges are frequently used in AMR in order to render the graph as a single rooted structure, where the root is interpreted as the top-level focus. (The graph bank is natively constructed and released with inverted edges, but for parser evaluation the AMR−1 normalization is typically assumed; our conversion builds on the code of Cai and Knight (2013)[4].) In the context of this comparison, we map this interpretation to a general concept of “top nodes” for both AMR and AMR−1. Flanigan et al. (2014)[5] published the first parser targeting AMR, and the state of the art has been repeatedly updated since.

Property AMR AMR−1
number of graphs 10309 10309
average number of tokens 20.62 20.62
average number of nodes per token 0.67 0.67
number of edge labels 135 100
percentage of graphs that are trees 52.48 18.60
percentage of graphs with treewidth one 52.72 52.72
average treewidth 1.524 1.524
maximal treewidth 4 4
average edge density 1.065 1.065
percentage of nodes that are reentrant 5.23 18.95
percentage of graphs that are cyclic 3.15 0.71
percentage of graphs that are not connected 0.00 0.00
percentage of graphs that are multi-rooted 0.00 77.50
percentage of non-top roots 47.78 19.39

CCD: Combinatory Categorial Grammar Dependencies

Hockenmaier and Steedman (2007)[6] construct CCGbank from a combination of careful interpretation of the syntactic annotations in the PTB with additional, manually curated lexical and constructional knowledge. In CCGbank (LDC2005T13), the strings of the venerable PTB Wall Street Journal (WSJ) corpus are annotated with pairs of (a) CCG syntactic derivations and (b) sets of semantic bi-lexical dependency triples, which we term CCD. The latter “include most semantically relevant non-anaphoric local and long-range dependencies” and are suggested by the CCGbank creators as a proxy for predicate–argument structure. While CCD has mainly been used for contrastive parser evaluation (Clark and Curran [2007][7], Fowler and Penn [2010][8]; inter alios), there is current work that views each set of triples as a directed graph and parses directly into these target representations (Du, Sun, and Wan 2015[9]).

Property CCD
number of graphs 39604
average number of tokens 23.47
average number of nodes per token 0.88
number of edge labels 6
percentage of graphs that are trees 1.45
percentage of graphs with treewidth one 29.27
average treewidth 1.742
maximal treewidth 5
average edge density 1.070
percentage of nodes that are reentrant 28.09
percentage of graphs that are cyclic 1.28
percentage of graphs that are not connected 12.53
percentage of graphs that are multi-rooted 99.67
percentage of non-top roots 47.78
average edge length 2.582
percentage of graphs that are noncrossing 48.23
percentage of graphs with pagenumber at most two 98.64

EDS: Elementary Dependency Structures

SDP: Semantic Dependency Parsing

UCCA: Universal Conceptual Cognitive Annotation

Universal Conceptual Cognitive Annotation (UCCA) is a novel semantic approach to grammatical representation. It was developed in the Computational Linguistics Lab of the Computer Science Department of the Hebrew University by Omri Abend and Ari Rappoport[10].

The central idea of the project is to analyze and annotate natural languages using purely semantic categories and structure (a graph). Syntactic categories and structure are automatically deduced from the semantic ones using learning algorithms. The basic set of semantic categories (the foundational layer) is inspired by work in linguistic typology, cognitive grammar, and neuroscience.

Hershcovich et al. introduced the first parser for UCCA [11], a transition-based parser based on BiLSTM feature representation.


Property UCCA Wiki UCCA Wiki (sentences) UCCA 20K Leagues UCCA 20K Leagues (sentences)
number of graphs 367 5225 154 506
average number of tokens 431.7 30.32 114.24 34.77
average number of nodes per token 1.89 1.89 1.67 1.67
number of edge labels 12 12 12 12
percentage of graphs that are trees
average edge density
percentage of nodes that are reentrant
percentage of graphs that are cyclic 0 0 0 0
percentage of graphs that are not connected 0 0 0 0
percentage of graphs that are multi-rooted 0 0 0 0
percentage of non-top roots
percentage of graphs that are noncrossing

References

  1. 1.0 1.1 Marco Kuhlmann and Stephan Oepen. Towards a Catalogue of Linguistic Graph Banks. Computational Linguistics, 2016. In press. Preprint
  2. Banarescu, Laura, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract Meaning Representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 178–186, Sofia, Bulgaria, August.
  3. Langkilde, Irene and Kevin Knight. 1998. Generation that exploits corpus-based statistical knowledge. In Proceedings of the 17th International Conference on Computational Linguistics and the 36th Meeting of the Association for Computational Linguistics, pages 704–710, Montréal, Canada.
  4. Cai, Shu and Kevin Knight. 2013. Smatch. An evaluation metric for semantic feature structures. In Proceedings of the 51th Meeting of the Association for Computational Linguistics, pages 748–752, Sofia, Bulgaria, August.
  5. Flanigan, Jeffrey, Sam Thomson, Jaime Carbonell, Chris Dyer, and Noah A. Smith. 2014. A discriminative graph-based parser for the Abstract Meaning Representation. In Proceedings of the 52nd Meeting of the Association for Computational Linguistics, pages 1426–1436, Baltimore, MD, USA, June.
  6. Hockenmaier, Julia and Mark Steedman. 2007. CCGbank. A corpus of CCG derivations and dependency structures extracted from the Penn Treebank. Computational Linguistics, 33:355–396.
  7. Clark, Stephen and James R. Curran. 2007. Wide-coverage efficient statistical parsing with CCG and log-linear models. Computational Linguistics, 33(4):493–552.
  8. Fowler, Timothy A. D. and Gerald Penn. 2010. Accurate context-free parsing with Combinatory Categorial Grammar. In Proceedings of the 48th Meeting of the Association for Computational Linguistics, pages 335–344, Uppsala, Sweden.
  9. Du, Yantao, Weiwei Sun, and Xiaojun Wan. 2015. A data-driven, factorization parser for CCG dependency structures. In Proceedings of the 53rd Meeting of the Association for Computational Linguistics and of the 7th International Joint Conference on Natural Language Processing, pages 1545–1555, Bejing, China.
  10. Abend, Omri and Ari Rappoport. 2013. Universal Conceptual Cognitive Annotation (UCCA). ACL.
  11. Hershcovich, Daniel and Omri Abend and Ari Rappoport. 2017. A Transition-Based Directed Acyclic Graph Parser for UCCA. ACL.