Difference between revisions of "Combinatory Categorial Grammar"

From ACL Wiki
Jump to navigation Jump to search
(37 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
== Introduction ==
 
== Introduction ==
  
Combinatory Categorial Grammar (CCG) is an efficiently parseable, yet linguistically expressive grammar formalism. It has a completely transparent interface between surface syntax and underlying semantic representation, including predicate-argument structure, quantification and information structure.
+
[[Image:Steedman2001.jpg|thumb|right|Steedman (2001): The Syntactic Process. MIT Press]]
 +
 
 +
Combinatory Categorial Grammar (CCG) is an efficiently parseable, yet linguistically expressive grammar formalism. It has a completely transparent interface between surface syntax and underlying semantic representation, including predicate-argument structure, quantification and information structure.
 +
CCG relies on combinatory logic, which has the same expressive power as the lambda calculus, but builds its expressions differently.
 +
 
 +
The first linguistic and psycholinguistic arguments for basing the grammar on combinators were put forth by Mark Steedman and Anna Szabolcsi. More recent proponents of the approach are Jacobson and Baldridge.
 +
For example, the [[combinator]] B (the compositor) is useful in creating long-distance dependencies, as in "Who do you think Mary is talking about?" and the combinator W (the duplicator) is useful as the lexical interpretation of reflexive pronouns, as in "Mary talks about herself". Together with I (the identity mapping) and C (the permutator) these form a set of primitive, non-interdefinable combinators. Jacobson interprets personal pronouns as the combinator I, and their binding is aided by a complex combinator Z, as in "Mary lost her way". Z is definable using W and B.
 +
 
 +
CCG is known to define the same language class as tree-adjoining grammar, linear indexed grammar, and head grammar, and is said to be mildly context-sensitive.
 +
 
 +
One of the key publications of CCG is ''The Syntactic Process'' by Mark Steedman. There are various efficient parsers available for CCG.
  
 
== Software ==
 
== Software ==
  
=== OpenCCG: The OpenNLP library ===
+
=== OpenCCG: The OpenNLP CCG library ===
  
 
[http://openccg.sourceforge.net OpenCCG], the [http://opennlp.sf.net OpenNLP] CCG Library,  
 
[http://openccg.sourceforge.net OpenCCG], the [http://opennlp.sf.net OpenNLP] CCG Library,  
 
is an open source natural language processing library written in
 
is an open source natural language processing library written in
 
Java, which provides parsing and realization services based on Mark Steedman's
 
Java, which provides parsing and realization services based on Mark Steedman's
's Combinatory Categorial Grammar (CCG) formalism.
+
Combinatory Categorial Grammar (CCG) formalism.
 +
The library makes use of multi-modal extensions to CCG developed by
 +
[http://comp.ling.utexas.edu/jbaldrid Jason Baldridge] as part of the [http://grok.sourceforge.net/ Grok] system
 +
(the precursor to OpenCCG). Current development efforts, led by [http://www.ling.ohio-state.edu/~mwhite/ Michael White], are focused on making the realizer practical to use in dialogue systems. For the latest news about OpenCCG, check out the
 +
[http://openccg.sourceforge.net/ OpenCCG page] on SourceForge. You can also look at [http://comp.ling.utexas.edu/wiki/doku.php/openccg/projects_using_openccg some of the projects using OpenCCG].
 +
 
 +
The VisCCG tool for working with grammars is available as part of the OpenCCG source code and distribution. There are [http://comp.ling.utexas.edu/wiki/doku.php/openccg several tutorials available for learning how to edit grammars with VisCCG and use OpenCCG].
 +
 
 +
=== The C&C Parser and Supertagger ===
 +
 
 +
The [http://svn.ask.it.usyd.edu.au/trac/candc/wiki C&C CCG parser and supertagger] form
 +
part of the language processing tools developed by James Curran and Stephen Clark.
 +
The tools are written in C++ and have been designed to be efficient enough for large-scale NLP tasks.
 +
 
 +
=== StatCCG ===
 +
 
 +
StatCCG is a statistical CCG parser (trained on CCGbank) written by Julia Hockenmaier. Executables are available [http://www.cis.upenn.edu/~juliahr/Parser/index.html here]
 +
 
 +
=== Boxer ===
  
== CCGbank ==
+
[http://svn.ask.it.usyd.edu.au/trac/candc/wiki/boxer Boxer] is developed by [http://homepages.inf.ed.ac.uk/jbos/ Johan Bos] and generates formal semantic representations for CCG grammars. Boxer takes as input CCG (Combinatory Categorial Grammar) derivations and produces DRSs (Discourse Representation Structures, from Hans Kamp's Discourse Representation Theory) as output. It is distributed with the C&C tools. Boxer produces standard DRS syntax, uses a neo-Davidsonian analysis for events (with thematic roles from VerbNet), incorporates Van der Sandt's algorithm for presupposition, is 100% compatible with first-order logic (FOL), and normalises cardinal and date expressions. DRSs can be generated in various output formats: resolved or underspecified, in Prolog or XML, flattened or recursive structures, with discourse referents represented by Prolog atoms or variables, and with pretty printed DRSs or not. It is also possible to output FOL formulas translated from the DRSs.
 +
 
 +
== Treebanks for CCG ==
 +
 
 +
=== CCGbank ===
 +
 
 +
CCGbank is a translation of the [http://www.cis.upenn.edu/~treebank/home.html Penn Treebank]
 +
into a corpus of Combinatory Categorial Grammar derivations, created by [http://www.cis.upenn.edu/~juliahr Julia Hockenmaier] and [http://www.inf.ed.ac.uk/~steedman Mark Steedman]. You can get it [http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2005T13 here] from the [http://www.ldc.upenn.edu Linguistic Data Consortium]. You can also have a look at this [http://www.cis.upenn.edu/~juliahr/CCGbankDemo demo] of the HTML version included in the LDC distribution.
 +
 
 +
CCGbank pairs syntactic derivations with sets of word-word dependencies which
 +
approximate the underlying predicate-argument structure.
 +
The translation process and linguistic analyses are explained in the [http://www.cis.upenn.edu/~juliahr/Papers/CCGbank/CCGbankManual.pdf manual].
 +
CCGbank contains 99.44% of the sentences in the Penn Treebank, for
 +
which it corrects a number of inconsistencies and errors in the
 +
original annotation.
 +
 
 +
The LDC distribution also contains machine-readable versions
 +
of the data, which contain the syntactic derivations and the corresponding lists of word-word dependencies,
 +
as well as a file that is searchable by [http://tedlab.mit.edu/~dr/ Doug Rohde]'s [http://tedlab.mit.edu/~dr/TGrep2/index.html TGrep2] (version 1.15).
 +
 
 +
In all versions, the file structure corresponds exactly to that of the original Treebank.
 +
 
 +
=== The Groningen Meaning Bank ===
 +
 
 +
The [http://gmb.let.rug.nl Groningen Meaning Bank] is an annotated corpus of public domain texts. Version 1.0 comprises 1,000 texts with CCG analyses for each sentence and semantic representations for each text.
  
 
== Publications ==
 
== Publications ==
 +
 +
This is a very incomplete list of publications. Follow the links to homepages in the People section to see more, and see the [http://groups.inf.ed.ac.uk/ccg/publications.html CCG site publications page].
 +
 +
=== Books ===
 +
 +
* Curry, Haskell B. and Richard Feys (1958), Combinatory Logic, Vol. 1. North-Holland.
 +
 +
* Steedman, Mark (1996), Surface Structure and Interpretation. The MIT Press.
 +
 +
* Steedman, Mark (2000), The Syntactic Process. The MIT Press.
 +
 +
=== Journal Articles ===
 +
 +
* Jacobson, Pauline (1999), “Towards a variable-free semantics.” Linguistics and Philosophy 22, 1999. 117-184
 +
 +
* Steedman, Mark  (1987), “Combinatory grammars and parasitic gaps”. Natural Language and Linguistic Theory 5, 403-439.
 +
 +
=== Articles in books or collections ===
 +
 +
* Szabolcsi, Anna (1989), "Bound variables in syntax (are there any?)." Semantics and Contextual Expression, ed. by Bartsch, van Benthem, and van Emde Boas. Foris, 294-318.
 +
 +
* Szabolcsi, Anna (1992), "Combinatory grammar and projection from the lexicon." Lexical Matters. CSLI Lecture Notes 24, ed. by Sag and Szabolcsi. Stanford, CSLI Publications. 241-269.
 +
 +
* Szabolcsi, Anna (2003), “Binding on the fly: Cross-sentential anaphora in variable-free semantics”. Resource Sensitivity in Binding and Anaphora, ed. by Kruijff and Oehrle.  Kluwer, 215-229.
 +
 +
=== Conference and workshop papers ===
 +
 +
* Jason Baldridge and Geert-Jan Kruijff. 2002. [http://www.aclweb.org/anthology-new/P/P02/P02-1041.pdf Coupling CCG with Hybrid Logic Dependency Semantics]. In Proceedings of ACL 2002.
 +
 +
* Jason Baldridge and Geert-Jan Kruijff. 2003. [http://www.aclweb.org/anthology-new/E/E03/E03-1036.pdf Multi-Modal Combinatory Categorial Grammar]. In Proceedings of EACL 2003.
 +
 +
* Jason Baldridge, Sudipta Chatterjee, Alexis Palmer, and Ben Wing. 2007. [http://comp.ling.utexas.edu/jbaldrid/papers/baldridge_etal_geaf07.pdf DotCCG and VisCCG: Wiki and Programming Paradigms for Improved Grammar Engineering with OpenCCG]. In Proceedings of the Workshop on Grammar Engineering Across Frameworks. Stanford, CA.
 +
 +
* Jason Baldridge. 2008. [http://comp.ling.utexas.edu/jbaldrid/papers/BaldridgeColing08.pdf Weakly supervised supertagging with grammar-informed initialization]. In Proceedings of COLING-2008. Manchester, UK.
 +
 +
* Fred Hoyt and Jason Baldridge. 2008. [http://aclweb.org/anthology-new/P/P08/P08-1038.pdf A Logical Basis for the D combinator and normal form constraints in Combinatory Categorial Grammar]. In Proceedings of ACL/HLT-2008. Columbus, OH.
 +
* Geert-Jan Kruijff and Jason Baldridge. 2004. [http://www.aclweb.org/anthology-new/C/C04/C04-1028.pdf Generalizing Dimensionality in Combinatory Categorial Grammar]. Proceedings of COLING 2004.
 +
 +
* Mike White and Jason Baldridge. 2003. [http://comp.ling.utexas.edu/jbaldrid/papers/White-Baldridge-ENLG-2003 Adapting Chart Realization to CCG]. In Proceedings of ENLG 2003.
 +
 +
=== Dissertations and Masters Theses ===
 +
 +
* Baldridge, Jason (2002). [http://comp.ling.utexas.edu/jbaldrid/papers/dissertation.html "Lexically Specified Derivational Control in Combinatory Categorial Grammar."] PhD Dissertation. Univ. of Edinburgh.
 +
 +
* Gann Bierner (2001). Alternative Phrases: Theoretical Analysis and Practical Applications, PhD thesis, University of Edinburgh.
 +
 +
* Julia Hockenmaier (2003). Data and Models for Statistical Parsing with Combinatory Categorial Grammar, PhD thesis, University of Edinburgh.
 +
 +
* Beryl Hoffman. 1995. Computational Analysis of the Syntax and Interpretation of ‘Free’ Word-order in Turkish. Ph.D. thesis, University of Pennsylvania. IRCS Report 95-17.
 +
 +
* Nobo Komagata. 1999. [http://nobo.komagata.net/thesis A Computational Analysis of Information Structure Using Parallel Expository Texts in English and Japanese]. PhD thesis. University of Pennsylvania.
 +
 +
* Mark McConville (2001) Incremental natural language understanding with Combinatory Categorial Grammar. MSc thesis, School of Cognitive Science, Division of Informatics, University of Edinburgh.
 +
 +
* Jong C. Park. 1996. [http://portal.acm.org/citation.cfm?id=923558 A Lexical Theory of Quantification in Ambiguous Query Interpretation]. Ph.D Dissertation, Department of Computer and Information Science, University of Pennsylvania.
  
 
== People ==
 
== People ==
  
 +
* [http://comp.ling.utexas.edu/jbaldrid Jason Baldridge]
 +
* [http://homepages.inf.ed.ac.uk/jbos/ Johan Bos]
 +
* [http://www.ceng.metu.edu.tr/~bozsahin/ Cem Bozsahin]
 +
* [http://web.comlab.ox.ac.uk/oucl/work/stephen.clark/ Stephen Clark]
 +
* [http://www.it.usyd.edu.au/about/people/staff/james.shtml James Curran]
 +
* [http://www.cis.upenn.edu/~juliahr Julia Hockenmaier]
 
* [http://www.iccs.inf.ed.ac.uk/~steedman/ Mark Steedman]
 
* [http://www.iccs.inf.ed.ac.uk/~steedman/ Mark Steedman]
 +
* [http://www.ling.ohio-state.edu/~mwhite/ Michael White]

Revision as of 06:36, 19 January 2012

Introduction

Steedman (2001): The Syntactic Process. MIT Press

Combinatory Categorial Grammar (CCG) is an efficiently parseable, yet linguistically expressive grammar formalism. It has a completely transparent interface between surface syntax and underlying semantic representation, including predicate-argument structure, quantification and information structure. CCG relies on combinatory logic, which has the same expressive power as the lambda calculus, but builds its expressions differently.

The first linguistic and psycholinguistic arguments for basing the grammar on combinators were put forth by Mark Steedman and Anna Szabolcsi. More recent proponents of the approach are Jacobson and Baldridge. For example, the combinator B (the compositor) is useful in creating long-distance dependencies, as in "Who do you think Mary is talking about?" and the combinator W (the duplicator) is useful as the lexical interpretation of reflexive pronouns, as in "Mary talks about herself". Together with I (the identity mapping) and C (the permutator) these form a set of primitive, non-interdefinable combinators. Jacobson interprets personal pronouns as the combinator I, and their binding is aided by a complex combinator Z, as in "Mary lost her way". Z is definable using W and B.

CCG is known to define the same language class as tree-adjoining grammar, linear indexed grammar, and head grammar, and is said to be mildly context-sensitive.

One of the key publications of CCG is The Syntactic Process by Mark Steedman. There are various efficient parsers available for CCG.

Software

OpenCCG: The OpenNLP CCG library

OpenCCG, the OpenNLP CCG Library, is an open source natural language processing library written in Java, which provides parsing and realization services based on Mark Steedman's Combinatory Categorial Grammar (CCG) formalism. The library makes use of multi-modal extensions to CCG developed by Jason Baldridge as part of the Grok system (the precursor to OpenCCG). Current development efforts, led by Michael White, are focused on making the realizer practical to use in dialogue systems. For the latest news about OpenCCG, check out the OpenCCG page on SourceForge. You can also look at some of the projects using OpenCCG.

The VisCCG tool for working with grammars is available as part of the OpenCCG source code and distribution. There are several tutorials available for learning how to edit grammars with VisCCG and use OpenCCG.

The C&C Parser and Supertagger

The C&C CCG parser and supertagger form part of the language processing tools developed by James Curran and Stephen Clark. The tools are written in C++ and have been designed to be efficient enough for large-scale NLP tasks.

StatCCG

StatCCG is a statistical CCG parser (trained on CCGbank) written by Julia Hockenmaier. Executables are available here

Boxer

Boxer is developed by Johan Bos and generates formal semantic representations for CCG grammars. Boxer takes as input CCG (Combinatory Categorial Grammar) derivations and produces DRSs (Discourse Representation Structures, from Hans Kamp's Discourse Representation Theory) as output. It is distributed with the C&C tools. Boxer produces standard DRS syntax, uses a neo-Davidsonian analysis for events (with thematic roles from VerbNet), incorporates Van der Sandt's algorithm for presupposition, is 100% compatible with first-order logic (FOL), and normalises cardinal and date expressions. DRSs can be generated in various output formats: resolved or underspecified, in Prolog or XML, flattened or recursive structures, with discourse referents represented by Prolog atoms or variables, and with pretty printed DRSs or not. It is also possible to output FOL formulas translated from the DRSs.

Treebanks for CCG

CCGbank

CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar derivations, created by Julia Hockenmaier and Mark Steedman. You can get it here from the Linguistic Data Consortium. You can also have a look at this demo of the HTML version included in the LDC distribution.

CCGbank pairs syntactic derivations with sets of word-word dependencies which approximate the underlying predicate-argument structure. The translation process and linguistic analyses are explained in the manual. CCGbank contains 99.44% of the sentences in the Penn Treebank, for which it corrects a number of inconsistencies and errors in the original annotation.

The LDC distribution also contains machine-readable versions of the data, which contain the syntactic derivations and the corresponding lists of word-word dependencies, as well as a file that is searchable by Doug Rohde's TGrep2 (version 1.15).

In all versions, the file structure corresponds exactly to that of the original Treebank.

The Groningen Meaning Bank

The Groningen Meaning Bank is an annotated corpus of public domain texts. Version 1.0 comprises 1,000 texts with CCG analyses for each sentence and semantic representations for each text.

Publications

This is a very incomplete list of publications. Follow the links to homepages in the People section to see more, and see the CCG site publications page.

Books

  • Curry, Haskell B. and Richard Feys (1958), Combinatory Logic, Vol. 1. North-Holland.
  • Steedman, Mark (1996), Surface Structure and Interpretation. The MIT Press.
  • Steedman, Mark (2000), The Syntactic Process. The MIT Press.

Journal Articles

  • Jacobson, Pauline (1999), “Towards a variable-free semantics.” Linguistics and Philosophy 22, 1999. 117-184
  • Steedman, Mark (1987), “Combinatory grammars and parasitic gaps”. Natural Language and Linguistic Theory 5, 403-439.

Articles in books or collections

  • Szabolcsi, Anna (1989), "Bound variables in syntax (are there any?)." Semantics and Contextual Expression, ed. by Bartsch, van Benthem, and van Emde Boas. Foris, 294-318.
  • Szabolcsi, Anna (1992), "Combinatory grammar and projection from the lexicon." Lexical Matters. CSLI Lecture Notes 24, ed. by Sag and Szabolcsi. Stanford, CSLI Publications. 241-269.
  • Szabolcsi, Anna (2003), “Binding on the fly: Cross-sentential anaphora in variable-free semantics”. Resource Sensitivity in Binding and Anaphora, ed. by Kruijff and Oehrle. Kluwer, 215-229.

Conference and workshop papers

Dissertations and Masters Theses

  • Gann Bierner (2001). Alternative Phrases: Theoretical Analysis and Practical Applications, PhD thesis, University of Edinburgh.
  • Julia Hockenmaier (2003). Data and Models for Statistical Parsing with Combinatory Categorial Grammar, PhD thesis, University of Edinburgh.
  • Beryl Hoffman. 1995. Computational Analysis of the Syntax and Interpretation of ‘Free’ Word-order in Turkish. Ph.D. thesis, University of Pennsylvania. IRCS Report 95-17.
  • Mark McConville (2001) Incremental natural language understanding with Combinatory Categorial Grammar. MSc thesis, School of Cognitive Science, Division of Informatics, University of Edinburgh.

People