Training the C&C Parser
The C&C Parser is an advanced statistical parser using the framework of Combinatory Categorial Grammar (CCG). It is quite easy to use with pre-trained models, but creating one's own models is a slightly different story. Although the software is distributed with a wealth of scripts that should make training easy, differences between systems and dependencies on various libraries make the task of getting the training code to work a bit daunting. The following are detailed step-by-step instructions to replicate the (almost) exact figures reported in Clark&Curran (2007) on a single 64-bit Ubuntu 12.04 machine (which should have multiple cores and at least around 40 GB of main memory). The steps to take on other recent Linux distributions should be very similar.
Please extend the instructions with more detail, helpful hints and notes on other operating systems! They were initially written up by Kilian Evang; thanks are due to Tim Dawborn, Stephen Clark and James Curran for advice without which I would probably never have gotten it to run.
# Customize these variables: export CANDC_PREFIX=$HOME export CCGBANK=$HOME/data/CCGbank1.2 export TMPDIR=$HOME/tmp # the default /tmp is often on a tiny filesystem export NUMNODES=32 export LIB=/usr/lib # Some variables for use below: export CANDC=$CANDC_PREFIX/candc export SCRIPTS=$CANDC/src/scripts/ccg export EXT=$CANDC/ext # Package dependencies: sudo apt-get install g++ gawk libibumad-dev mpich2 subversion # Check out the C&C tools. # You need credentials for that, see # http://svn.ask.it.usyd.edu.au/trac/candc/wiki/Subversion cd $CANDC_PREFIX svn checkout http://svn.ask.it.usyd.edu.au/candc/trunk candc -r 2400 # Some patches to fix various problems with the scripts provided: # Use a temp directory different from /tmp since that often doesn't have enough # space: sed -i -e "s|/tmp|$TMPDIR|" $SCRIPTS/*_model_* # Replace /bin/env by /usr/bin/env sed -i -e "s|/bin/env|/usr/bin/env|" $SCRIPTS/lexicon_features \ $SCRIPTS/count_features # Work around non-portable sed -f shebang sed -i -e 's|$SCRIPTS/convert_brackets|sed -f $SCRIPTS/convert_brackets|g' \ $SCRIPTS/create_data # TODO patches to make the scripts work with the LDC version of CCGbank should # go here. # Make ext directory mkdir $EXT # Install Boost library (Ubuntu doesn't seem to have a version that is compiled # against MPICH2). echo 'using mpi ;' > ~/user-config.jam # Boost's build script won't build MPI # library without this for some reason mkdir $EXT/install cd $EXT/install wget https://dl.dropboxusercontent.com/u/5358991/boost_1_53_0.tar.gz # or # get it from Sourceforge tar -xzf boost_1_53_0.tar.gz cd boost_1_53_0 ./bootstrap.sh --with-libraries=mpi --prefix=$EXT ./b2 install # Install ancient MR-MPI C&C depends on cd $EXT/install wget http://sydney.edu.au/it/~tdaw3088/misc/mrmpi-22Apr09.tbz2 # If this link is # dead, try http://dl.dropbox.com/u/5358991/mrmpi-22Apr09.tbz2 tar jxf mrmpi-22Apr09.tbz2 cd mrmpi-22Apr09/src make -f Makefile.linux clean make -f Makefile.linux cp *.h $EXT/include cp libmrmpi.a $EXT/lib # Build C&C cd $CANDC make -f Makefile.linux all train bin/generate # Create data # Will only work with CCGbank 1.2 for now, not with LDC version of CCGbank $SCRIPTS/create_data $CCGBANK $NUMNODES working/ # Train the POS tagger and Supertagger: $SCRIPTS/train_taggers working/ # Evaluate the supertagger model to ensure its results are sane: $SCRIPTS/cl07_table4 working/ # Create the model_hybrid directory and empty config file: mkdir working/model_hybrid touch working/model_hybrid/config # Train a hybrid model: export LD_LIBRARY_PATH=$EXT/lib:$LIB $SCRIPTS/create_model_hybrid `pwd` $NUMNODES working/ $SCRIPTS/train_model_hybrid `pwd` $NUMNODES working/ # Evaluate the parser model:$SCRIPTS/cl07_table7 working/
- Stephen Clark and James Curran (2007): Wide-Coverage Efficient Statistical Parsing with CCG and Log-Linear Models. In Computational Linguistics 33(4), http://aclweb.org/anthology-new/J/J07/J07-4004.pdf