Combinatory Categorial Grammar
- 1 Introduction
- 2 Software
- 3 Treebanks for CCG
- 4 Publications
- 5 People
Combinatory Categorial Grammar (CCG) is an efficiently parseable, yet linguistically expressive grammar formalism. It has a completely transparent interface between surface syntax and underlying semantic representation, including predicate-argument structure, quantification and information structure. CCG relies on combinatory logic, which has the same expressive power as the lambda calculus, but builds its expressions differently.
The first linguistic and psycholinguistic arguments for basing the grammar on combinators were put forth by Mark Steedman and Anna Szabolcsi. More recent proponents of the approach are Jacobson and Baldridge. For example, the combinator B (the compositor) is useful in creating long-distance dependencies, as in "Who do you think Mary is talking about?" and the combinator W (the duplicator) is useful as the lexical interpretation of reflexive pronouns, as in "Mary talks about herself". Together with I (the identity mapping) and C (the permutator) these form a set of primitive, non-interdefinable combinators. Jacobson interprets personal pronouns as the combinator I, and their binding is aided by a complex combinator Z, as in "Mary lost her way". Z is definable using W and B.
CCG is known to define the same language class as tree-adjoining grammar, linear indexed grammar, and head grammar, and is said to be mildly context-sensitive.
One of the key publications of CCG is The Syntactic Process by Mark Steedman. There are various efficient parsers available for CCG.
OpenCCG: The OpenNLP CCG library
OpenCCG, the OpenNLP CCG Library, is an open source natural language processing library written in Java, which provides parsing and realization services based on Mark Steedman's Combinatory Categorial Grammar (CCG) formalism. The library makes use of multi-modal extensions to CCG developed by Jason Baldridge as part of the Grok system (the precursor to OpenCCG). Current development efforts, led by Michael White, are focused on making the realizer practical to use in dialogue systems. For the latest news about OpenCCG, check out the OpenCCG page on SourceForge. You can also look at some of the projects using OpenCCG.
The VisCCG tool for working with grammars is available as part of the OpenCCG source code and distribution. There are several tutorials available for learning how to edit grammars with VisCCG and use OpenCCG.
The C&C Parser and Supertagger
The C&C CCG parser and supertagger form part of the language processing tools developed by James Curran and Stephen Clark. The tools are written in C++ and have been designed to be efficient enough for large-scale NLP tasks.
StatCCG is a statistical CCG parser (trained on CCGbank) written by Julia Hockenmaier. Executables are available here
Boxer is developed by Johan Bos and generates formal semantic representations for CCG grammars. Boxer takes as input CCG (Combinatory Categorial Grammar) derivations and produces DRSs (Discourse Representation Structures, from Hans Kamp's Discourse Representation Theory) as output. It is distributed with the C&C tools. Boxer produces standard DRS syntax, uses a neo-Davidsonian analysis for events (with thematic roles from VerbNet), incorporates Van der Sandt's algorithm for presupposition, is 100% compatible with first-order logic (FOL), and normalises cardinal and date expressions. DRSs can be generated in various output formats: resolved or underspecified, in Prolog or XML, flattened or recursive structures, with discourse referents represented by Prolog atoms or variables, and with pretty printed DRSs or not. It is also possible to output FOL formulas translated from the DRSs.
Treebanks for CCG
CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar derivations, created by Julia Hockenmaier and Mark Steedman. You can get it here from the Linguistic Data Consortium. You can also have a look at this demo of the HTML version included in the LDC distribution.
CCGbank pairs syntactic derivations with sets of word-word dependencies which approximate the underlying predicate-argument structure. The translation process and linguistic analyses are explained in the manual. CCGbank contains 99.44% of the sentences in the Penn Treebank, for which it corrects a number of inconsistencies and errors in the original annotation.
The LDC distribution also contains machine-readable versions of the data, which contain the syntactic derivations and the corresponding lists of word-word dependencies, as well as a file that is searchable by Doug Rohde's TGrep2 (version 1.15).
In all versions, the file structure corresponds exactly to that of the original Treebank.
The Groningen Meaning Bank
The Groningen Meaning Bank is an annotated corpus of public domain texts. Version 1.0 comprises 1,000 texts with CCG analyses for each sentence and semantic representations for each text.
This is a very incomplete list of publications. Follow the links to homepages in the People section to see more, and see the CCG site publications page.
- Curry, Haskell B. and Richard Feys (1958), Combinatory Logic, Vol. 1. North-Holland.
- Steedman, Mark (1996), Surface Structure and Interpretation. The MIT Press.
- Steedman, Mark (2000), The Syntactic Process. The MIT Press.
- Jacobson, Pauline (1999), “Towards a variable-free semantics.” Linguistics and Philosophy 22, 1999. 117-184
- Steedman, Mark (1987), “Combinatory grammars and parasitic gaps”. Natural Language and Linguistic Theory 5, 403-439.
Articles in books or collections
- Szabolcsi, Anna (1989), "Bound variables in syntax (are there any?)." Semantics and Contextual Expression, ed. by Bartsch, van Benthem, and van Emde Boas. Foris, 294-318.
- Szabolcsi, Anna (1992), "Combinatory grammar and projection from the lexicon." Lexical Matters. CSLI Lecture Notes 24, ed. by Sag and Szabolcsi. Stanford, CSLI Publications. 241-269.
- Szabolcsi, Anna (2003), “Binding on the fly: Cross-sentential anaphora in variable-free semantics”. Resource Sensitivity in Binding and Anaphora, ed. by Kruijff and Oehrle. Kluwer, 215-229.
Conference and workshop papers
- Jason Baldridge and Geert-Jan Kruijff. 2002. Coupling CCG with Hybrid Logic Dependency Semantics. In Proceedings of ACL 2002.
- Jason Baldridge and Geert-Jan Kruijff. 2003. Multi-Modal Combinatory Categorial Grammar. In Proceedings of EACL 2003.
- Jason Baldridge, Sudipta Chatterjee, Alexis Palmer, and Ben Wing. 2007. DotCCG and VisCCG: Wiki and Programming Paradigms for Improved Grammar Engineering with OpenCCG. In Proceedings of the Workshop on Grammar Engineering Across Frameworks. Stanford, CA.
- Jason Baldridge. 2008. Weakly supervised supertagging with grammar-informed initialization. In Proceedings of COLING-2008. Manchester, UK.
- Fred Hoyt and Jason Baldridge. 2008. A Logical Basis for the D combinator and normal form constraints in Combinatory Categorial Grammar. In Proceedings of ACL/HLT-2008. Columbus, OH.
- Geert-Jan Kruijff and Jason Baldridge. 2004. Generalizing Dimensionality in Combinatory Categorial Grammar. Proceedings of COLING 2004.
- Mike White and Jason Baldridge. 2003. Adapting Chart Realization to CCG. In Proceedings of ENLG 2003.
Dissertations and Masters Theses
- Baldridge, Jason (2002). "Lexically Specified Derivational Control in Combinatory Categorial Grammar." PhD Dissertation. Univ. of Edinburgh.
- Gann Bierner (2001). Alternative Phrases: Theoretical Analysis and Practical Applications, PhD thesis, University of Edinburgh.
- Julia Hockenmaier (2003). Data and Models for Statistical Parsing with Combinatory Categorial Grammar, PhD thesis, University of Edinburgh.
- Beryl Hoffman. 1995. Computational Analysis of the Syntax and Interpretation of ‘Free’ Word-order in Turkish. Ph.D. thesis, University of Pennsylvania. IRCS Report 95-17.
- Nobo Komagata. 1999. A Computational Analysis of Information Structure Using Parallel Expository Texts in English and Japanese. PhD thesis. University of Pennsylvania.
- Mark McConville (2001) Incremental natural language understanding with Combinatory Categorial Grammar. MSc thesis, School of Cognitive Science, Division of Informatics, University of Edinburgh.
- Jong C. Park. 1996. A Lexical Theory of Quantification in Ambiguous Query Interpretation. Ph.D Dissertation, Department of Computer and Information Science, University of Pennsylvania.