Combinatory Categorial Grammar

Introduction

Steedman (2001): The Syntactic Process. MIT Press

Combinatory Categorial Grammar (CCG) is an efficiently parseable, yet linguistically expressive grammar formalism. It has a completely transparent interface between surface syntax and underlying semantic representation, including predicate-argument structure, quantification and information structure. CCG relies on combinatory logic, which has the same expressive power as the lambda calculus, but builds its expressions differently.

The first linguistic and psycholinguistic arguments for basing the grammar on combinators were put forth by Mark Steedman and Anna Szabolcsi. More recent prominent proponents of the approach are Jacobson and Baldridge. For example, the combinator B (the compositor) is useful in creating long-distance dependencies, as in "Who do you think Mary is talking about?" and the combinator W (the duplicator) is useful as the lexical interpretation of reflexive pronouns, as in "Mary talks about herself". Together with I (the identity mapping) and C (the permutator) these form a set of primitive, non-interdefinable combinators. Jacobson interprets personal pronouns as the combinator I, and their binding is aided by a complex combinator Z, as in "Mary lost her way". Z is definable using W and B.

CCG is known to define the same language class as tree-adjoining grammar, linear indexed grammar, and head grammar, and is said to be mildly context-sensitive.

One of the key publications of CCG is The Syntactic Process by Mark Steedman. There are various efficient parsers available for CCG.

Software

OpenCCG: The OpenNLP library

OpenCCG, the OpenNLP CCG Library, is an open source natural language processing library written in Java, which provides parsing and realization services based on Mark Steedman's Combinatory Categorial Grammar (CCG) formalism. The library makes use of multi-modal extensions to CCG developed by Jason Baldridge as part of the Grok system (the precursor to OpenCCG). Current development efforts, led by Michael White, are focused on making the realizer practical to use in dialogue systems. For the latest news about OpenCCG, check out the SourceForge project page.

The C&C Parser and Supertagger

The C&C CCG parser and supertagger form part of the language processing tools developed by James Curran and Stephan Clark. The tools are written in C++ and have been designed to be efficient enough for large-scale NLP tasks.

StatCCG

StatCCG is a statistical CCG parser (trained on CCGbank) written by Julia Hockenmaier. Executables are available here

Boxer

Boxer is developed by Johan Bos and generates formal semantic representations for CCG grammars. Boxer takes as input CCG (Combinatory Categorial Grammar) derivations and produces DRSs (Discourse Representation Structures, from Hans Kamp's Discourse Representation Theory) as output. It is distributed with the C&C tools. Boxer produces standard DRS syntax, uses a neo-Davidsonian analysis for events (with thematic roles from VerbNet), incorporates Van der Sandt's algorithm for presupposition, is 100% compatible with first-order logic (FOL), and normalises cardinal and date expressions. DRSs can be generated in various output formats: resolved or underspecified, in Prolog or XML, flattened or recursive structures, with discourse referents represented by Prolog atoms or variables, and with pretty printed DRSs or not. It is also possible to output FOL formulas translated from the DRSs.

CCGbank

CCGbank is a translation of the Penn Treebank into a corpus of Combinatory Categorial Grammar derivations, created by Julia Hockenmaier and Mark Steedman. You can get it here from the Linguistic Data Consortium. You can also have a look at this demo of the HTML version included in the LDC distribution.

CCGbank pairs syntactic derivations with sets of word-word dependencies which approximate the underlying predicate-argument structure. The translation process and linguistic analyses are explained in the manual. CCGbank contains 99.44% of the sentences in the Penn Treebank, for which it corrects a number of inconsistencies and errors in the original annotation.

The LDC distribution also contains machine-readable versions of the data, which contain the syntactic derivations and the corresponding lists of word-word dependencies, as well as a file that is searchable by Doug Rohde's TGrep2 (version 1.15).

In all versions, the file structure corresponds exactly to that of the original Treebank.

Publications

TBA

Combinatory Categorial Grammar

Contents